TAILIEUCHUNG - Báo cáo khoa học: "Statistical Models for Topic Segmentation"

Most documents are about more than one subject, but many NLP and IR techniques implicitly assume documents have just one topic. We describe new clues that mark shifts to new topics, novel algorithms for identifying topic boundaries and the uses of such boundaries once identified. We report topic segmentation performance on several corpora as well as improvement on an IR task that benefits from good segmentation. Introduction Dividing documents into topically-coherent sections has many uses, but the primary motivation for this work comes from information retrieval (IR). . | Statistical Models for Topic Segmentation Jeffrey c. Reynar1 Microsoft Corporation One Microsoft Way Redmond WA 98052 USA jreynar@ Abstract Most documents are about more than one subject but many NLP and IR techniques implicitly assume documents have just one topic. We describe new clues that mark shifts to new topics novel algorithms for identifying topic boundaries and the uses of such boundaries once identified. We report topic segmentation performance on several corpora as well as improvement on an IR task that benefits from good segmentation. Introduction Dividing documents into topically-coherent sections has many uses but the primary motivation for this work comes from information retrieval IR . Documents in many collections vary widely in length and while the shortest may address one topic modest length and long documents are likely to address multiple topics or be comprised of sections that address various aspects of the primary topic. Despite this fact most IR systems treat documents as indivisible units and index them in their entirety. This is problematic for two reasons. First most relevance metrics are based on word frequency which can be viewed as a function of the topic being discussed Church and Gale 1995 . For example the word header is rare in general English but it enjoys higher frequency in documents about soccer. In general word frequency is a good indicator of whether a document is relevant to a query but consider a long document containing only one section relevant to a query. If a keyword is used only in the pertinent section its overall frequency in the document will be low and as a result the document as a whole may be judged irrelevant despite the relevance of one section. The second reason it would be beneficial to index sections of documents is that once a search engine has identified a relevant document users would benefit from direct access to the relevant sections. This problem is compounded when searching multimedia .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.