TAILIEUCHUNG - Báo cáo khoa học: "Text Segmentation with LDA-Based Fisher Kernel"

In this paper we propose a domainindependent text segmentation method, which consists of three components. Latent Dirichlet allocation (LDA) is employed to compute words semantic distribution, and we measure semantic similarity by the Fisher kernel. Finally global best segmentation is achieved by dynamic programming. Experiments on Chinese data sets with the technique show it can be effective. Introducing latent semantic information, our algorithm is robust on irregular-sized segments. | Text Segmentation with LDA-Based Fisher Kernel Qi Sun Runxin Li Dingsheng Luo and Xihong Wu Speech and Hearing Research Center and Key Laboratory of Machine Perception Ministry of Education Peking University 100871 Beijing China sunq lirx dsluo wxh @ Abstract In this paper we propose a domainindependent text segmentation method which consists of three components. Latent Dirichlet allocation LDA is employed to compute words semantic distribution and we measure semantic similarity by the Fisher kernel. Finally global best segmentation is achieved by dynamic programming. Experiments on Chinese data sets with the technique show it can be effective. Introducing latent semantic information our algorithm is robust on irregular-sized segments. 1 Introduction The aim of text segmentation is to partition a document into a set of segments each of which is coherent about a specific topic. This task is inspired by problems in information retrieval summarization and language modeling in which the ability to provide access to smaller coherent segments in a document is desired. A lot of research has been done on text segmentation. Some of them utilize linguistic criteria Beeferman et al. 1999 Mochizuki et al. 1998 while others use statistical similarity measures to uncover lexical cohesion. Lexical cohesion methods believe a coherent topic segment contains parts with similar vocabularies. For example the TextTiling algorithm introduced by Hearst 1994 assumes that the local minima of the word similarity curve are the points of low lexical cohesion and thus the natural boundary candidates. Reynar 1998 has proposed a method called dotplotting depending on the distribution of word repetitions to find tight regions of topic similarity graphically. One of the problems with those works is that they treat terms uncorrelated assigning them orthogonal directions in the feature space. But in reality words are correlated and sometimes even synonymous so that texts with very few

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.