Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
In this paper we propose a domainindependent text segmentation method, which consists of three components. Latent Dirichlet allocation (LDA) is employed to compute words semantic distribution, and we measure semantic similarity by the Fisher kernel. Finally global best segmentation is achieved by dynamic programming. Experiments on Chinese data sets with the technique show it can be effective. Introducing latent semantic information, our algorithm is robust on irregular-sized segments. | Text Segmentation with LDA-Based Fisher Kernel Qi Sun Runxin Li Dingsheng Luo and Xihong Wu Speech and Hearing Research Center and Key Laboratory of Machine Perception Ministry of Education Peking University 100871 Beijing China sunq lirx dsluo wxh @cis.pku.edu.cn Abstract In this paper we propose a domainindependent text segmentation method which consists of three components. Latent Dirichlet allocation LDA is employed to compute words semantic distribution and we measure semantic similarity by the Fisher kernel. Finally global best segmentation is achieved by dynamic programming. Experiments on Chinese data sets with the technique show it can be effective. Introducing latent semantic information our algorithm is robust on irregular-sized segments. 1 Introduction The aim of text segmentation is to partition a document into a set of segments each of which is coherent about a specific topic. This task is inspired by problems in information retrieval summarization and language modeling in which the ability to provide access to smaller coherent segments in a document is desired. A lot of research has been done on text segmentation. Some of them utilize linguistic criteria Beeferman et al. 1999 Mochizuki et al. 1998 while others use statistical similarity measures to uncover lexical cohesion. Lexical cohesion methods believe a coherent topic segment contains parts with similar vocabularies. For example the TextTiling algorithm introduced by Hearst 1994 assumes that the local minima of the word similarity curve are the points of low lexical cohesion and thus the natural boundary candidates. Reynar 1998 has proposed a method called dotplotting depending on the distribution of word repetitions to find tight regions of topic similarity graphically. One of the problems with those works is that they treat terms uncorrelated assigning them orthogonal directions in the feature space. But in reality words are correlated and sometimes even synonymous so that texts with very few