Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Statistical language modeling (SLM) has been used in many different domains for decades and has also been applied to information retrieval (IR) recently. Documents retrieved using this approach are ranked according their probability of generating the given query. In this paper, we present a novel approach that employs the generalized Expectation Maximization (EM) algorithm to improve language models by representing their parameters as observation probabilities of Hidden Markov Models (HMM). | Optimizing Language Model Information Retrieval System with Expectation Maximization Algorithm Justin Liang-Te Chiu Department of Computer Science and Information Engineering National Taiwan University 1 Roosevelt Rd. Sec. 4 Taipei Taiwan 106 ROC b94902009@ntu.edu.tw Jyun-Wei Huang Department of Computer Science and Engineering Yuan Ze University 135 Yuan-Tung Road Chungli Taoyuan Taiwan ROC s976017 @mail.yzu.edu.tw Abstract Statistical language modeling SLM has been used in many different domains for decades and has also been applied to information retrieval IR recently. Documents retrieved using this approach are ranked according their probability of generating the given query. In this paper we present a novel approach that employs the generalized Expectation Maximization EM algorithm to improve language models by representing their parameters as observation probabilities of Hidden Markov Models HMM . In the experiments we demonstrate that our method outperforms standard SLM-based and tf.idf-based methods on TREC 2005 HARD Track data. 1 Introduction In 1945 soon after the computer was invented Vannevar Bush wrote a famous article--- As we may think V. Bush 1996 which formed the basis of research into Information Retrieval IR . The pioneers in IR developed two models for ranking the vector space model G. Salton and M. J. McGill 1986 and the probabilistic model S. E. Robertson and S. Jones 1976 . Since then the research of classical probabilistic models of relevance has been widely studied. For example Robertson S. E. Robertson and S. Walker 1994 S. E. Robertson 1977 modeled word occurrences into relevant or non-relevant classes and ranked documents according to the probabilities they belong to the relevant one. In 1998 Ponte and Croft 1998 proposed a language modeling framework which opens a new point of view in IR. In this approach they gave up the model of relevance instead they treated query generation as random sampling from every document model. The retrieval