Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Topic models have been successfully applied to many document analysis tasks to discover topics embedded in text. However, existing topic models generally cannot capture the latent topical structures in documents. Since languages are intrinsically cohesive and coherent, modeling and discovering latent topical transition structures within documents would be beneficial for many text analysis tasks. | Structural Topic Model for Latent Topical Structure Analysis Hongning Wang Duo Zhang ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign Urbana IL 61801 USA wang296 dzhang22 czhai @cs.uiuc.edu Abstract Topic models have been successfully applied to many document analysis tasks to discover topics embedded in text. However existing topic models generally cannot capture the latent topical structures in documents. Since languages are intrinsically cohesive and coherent modeling and discovering latent topical transition structures within documents would be beneficial for many text analysis tasks. In this work we propose a new topic model Structural Topic Model which simultaneously discovers topics and reveals the latent topical structures in text through explicitly modeling topical transitions with a latent first-order Markov chain. Experiment results show that the proposed Structural Topic Model can effectively discover topical structures in text and the identified structures significantly improve the performance of tasks such as sentence annotation and sentence ordering. 1 Introduction A great amount of effort has recently been made in applying statistical topic models Hofmann 1999 Blei et al. 2003 to explore word co-occurrence patterns i.e. topics embedded in documents. Topic models have become important building blocks of many interesting applications see e.g. Blei and Jordan 2003 Blei and Lafferty 2007 Mei et al. 2007 Lu and Zhai 2008 . In general topic models can discover word clustering patterns in documents and project each document to a latent topic space formed by such word clusters. However the topical structure in a document i.e. the internal dependency between the top-1526 ics is generally not captured due to the exchangeability assumption Blei et al. 2003 i.e. the document generation probabilities are invariant to content permutation. In reality natural language text rarely consists of isolated unrelated sentences but