TAILIEUCHUNG - Báo cáo khoa học: "A Statistical Model for Domain-Independent Text Segmentation"

We propose a statistical method that finds the maximum-probability segmentation of a given text. This method does not require training data because it estimates probabilities from the given text. Therefore, it can be applied to any text in any domain. An experiment showed that the method is more accurate than or at least as accurate as a state-of-the-art text segmentation system. | A Statistical Model for Domain-Independent Text Segmentation Masao Utiyama and Hitoshi Isahara Communications Research Laboratory 2-2-2 Hikaridai Seika-cho Soraku-gun Kyoto 619-0289 Japan mutiyama@ and isahara@ Abstract We propose a statistical method that finds the maximum-probability segmentation of a given text. This method does not require training data because it estimates probabilities from the given text. Therefore it can be applied to any text in any domain. An experiment showed that the method is more accurate than or at least as accurate as a state-of-the-art text segmentation system. 1 Introduction Documents usually include various topics. Identifying and isolating topics by dividing documents which is called text segmentation is important for many natural language processing tasks including information retrieval Hearst and Plaunt 1993 Salton et al. 1996 and summarization Kan et al. 1998 Nakao 2000 . In information retrieval users are often interested in particular topics parts of retrieved documents instead of the documents themselves. To meet such needs documents should be segmented into coherent topics. Summarization is often used for a long document that includes multiple topics. A summary of such a document can be composed of summaries of the component topics. Identification of topics is the task of text segmentation. A lot of research has been done on text segmentation Kozima 1993 Hearst 1994 Oku-mura and Honda 1994 Salton et al. 1996 Yaari 1997 Kan et al. 1998 Choi 2000 Nakao 2000 . A major characteristic of the methods used in this research is that they do not require training data to segment given texts. Hearst 1994 for example used only the similarity of word distributions in a given text to segment the text. Consequently these methods can be applied to any text in any domain even if training data do not exist. This property is important when text segmentation is applied to information retrieval or summarization because both .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.