Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Pairwise Document Similarity in Large Collections with MapReduce"

Lan Vy 58 4 pdf

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ Tải xuống

This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in computing document similarity into separate multiplication and summation stages in a way that is well matched to efﬁcient disk access patterns across several machines. On a collection consisting of approximately 900,000 newswire articles, our algorithm exhibits linear growth in running time and space in terms of the number of documents. . | Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed Jimmy Lin and Douglas W. Oard Human Language Technology Center of Excellence and UMIACS Laboratory for Computational Linguistics and Information Processing University of Maryland College Park MD 20742 telsayed jimmylin oard @umd.edu Abstract This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in computing document similarity into separate multiplication and summation stages in a way that is well matched to efficient disk access patterns across several machines. On a collection consisting of approximately 900 000 newswire articles our algorithm exhibits linear growth in running time and space in terms of the number of documents. 1 Introduction Computing pairwise similarity on large document collections is a task common to a variety of problems such as clustering and cross-document coreference resolution. For example in the PubMed search engine 1 which provides access to the life sciences literature a more like this browsing feature is implemented as a simple lookup of documentdocument similarity scores computed offline. This paper considers a large class of similarity functions that can be expressed as an inner product of term weight vectors. For document collections that fit into randomaccess memory the solution is straightforward. As collection size grows however it ultimately becomes necessary to resort to disk storage at which point aligning computation order with disk access patterns becomes a challenge. Further growth in the Department of Computer Science t The iSchool College of Information Studies 1http www.ncbi.nlm.nih.gov PubMed document collection will ultimately make it desirable to spread the computation over several processors at which point interprocess communication becomes a second potential bottleneck for which

TÀI LIỆU LIÊN QUAN

Kỷ yếu tóm tắt báo cáo khoa học: Hội nghị khoa học tim mạch toàn quốc lần thứ XI - Hội tim mạch Quốc gia Việt Nam

Báo cáo nghiên cứu khoa học: "Danh lục các loài thú ở khu bảo tồn thiên nhiên Pù Huống tỉnh Nghệ An và ý nghĩa bảo tồn nguồn gen quí hiếm của chúng"

Báo cáo khoa học: Hỗ trợ nâng cao năng lực quản lý chất thải sinh hoạt tại thành phố Hội An

Báo cáo nghiên cứu khoa học: "Tính năng động nghệ thuật của văn học hiện đại Việt Nam và một cách nhìn hành trình thể loại"

Báo cáo nghiên cứu khoa học: " DỊCH CHUYỂN TRUY VẤN OQL VÀO CÁC PHÉP TÍNH BAO HÀM"

Báo cáo khoa học: " Áp dụng thủ tục phân tích trong kiểm toán báo cáo tài chính"

Báo cáo nghiên cứu khoa học: "Người lính trở về sau chiến tranh với mặc cảm “ăn mày dĩ vãng’ trong tiểu thuyết Chu Lai"

Báo cáo nghiên cứu khoa học: "Khảo sát hiện tượng chuyển đổi chức năng - nghĩa của động từ tiếng Việt"

Báo cáo nghiên cứu khoa học: " BẢN CHẤT KHOA HỌC VÀ CÁCH MẠNG LÀ CỘI NGUỒN SỨC SỐNG CỦA CHỦ NGHĨA MÁC - LÊNIN"

Báo cáo khoa học: " CẢI TIẾN CÁC THUẬT TOÁN MƯỢN VÀ KHOÁ KÊNH TẦN SỐ MẠNG DI ĐỘNG TẾ BÀO"

Đã phát hiện trình chặn quảng cáo AdBlock

Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.