Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Topic models have great potential for helping users understand document corpora. This potential is stymied by their purely unsupervised nature, which often leads to topics that are neither entirely meaningful nor effective in extrinsic tasks (Chang et al., 2009). We propose a simple and effective way to guide topic models to learn topics of specific interest to a user. We achieve this by providing sets of seed words that a user believes are representative of the underlying topics in a corpus. . | Incorporating Lexical Priors into Topic Models Jagadeesh Jagarlamudi University of Maryland College Park USA jags@umiacs.umd.edu Hal Daume III University of Maryland College Park USA hal@umiacs.umd.edu Raghavendra Udupa Microsoft Research Bangalore India raghavu@microsoft.com Abstract Topic models have great potential for helping users understand document corpora. This potential is stymied by their purely unsupervised nature which often leads to topics that are neither entirely meaningful nor effective in extrinsic tasks Chang et al. 2009 . We propose a simple and effective way to guide topic models to learn topics of specific interest to a user. We achieve this by providing sets of seed words that a user believes are representative of the underlying topics in a corpus. Our model uses these seeds to improve both topicword distributions by biasing topics to produce appropriate seed words and to improve document-topic distributions by biasing documents to select topics related to the seed words they contain . Extrinsic evaluation on a document clustering task reveals a significant improvement when using seed information even over other models that use seed information naively. 1 Introduction Topic models such as Latent Dirichlet Allocation LDA Blei et al. 2003 have emerged as a powerful tool to analyze document collections in an unsupervised fashion. When fit to a document collection topic models implicitly use document level co-occurrence information to group semantically related words into a single topic. Since the objective of these models is to maximize the probability of the observed data they have a tendency to explain only the most obvious and superficial aspects of a corpus. They effectively sacrifice performance on rare topics to do a better job in modeling frequently occurring words. The user is then left with a skewed impression of the corpus and perhaps one that does not perform well in extrinsic tasks. To illustrate this problem we ran LDA on the most .