Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
The trend in information retrieval systems is from document to sub-document retrieval, such as sentences in a summarization system and words or phrases in question-answering system. Despite this trend, systems continue to model language at a document level using the inverse document frequency (IDF). In this paper, we compare and contrast IDF with inverse sentence frequency (ISF) and inverse term frequency (ITF). A direct comparison reveals that all language models are highly correlated; however, the average ISF and ITF values are 5.5 and 10.4 higher than IDF. . | A Comparison of Document Sentence and Term Event Spaces Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill North Carolina NC 27599-3360 cablake@email.unc.edu Abstract The trend in information retrieval systems is from document to sub-document retrieval such as sentences in a summarization system and words or phrases in question-answering system. Despite this trend systems continue to model language at a document level using the inverse document frequency IDF . In this paper we compare and contrast IDF with inverse sentence frequency ISF and inverse term frequency ITF . A direct comparison reveals that all language models are highly correlated however the average ISF and ITF values are 5.5 and 10.4 higher than IDF. All language models appeared to follow a power law distribution with a slope coefficient of 1.6 for documents and 1.7 for sentences and terms. We conclude with an analysis of IDF stability with respect to random journal and section partitions of the 100 830 full-text scientific articles in our experimental corpus. 1 Introduction The vector based information retrieval model identifies relevant documents by comparing query terms with terms from a document corpus. The most common corpus weighting scheme is the term frequency TF x inverse document frequency IDF where TF is the number of times a term appears in a document and IDF reflects the distribution of terms within the corpus Salton and Buckley 1988 . Ideally the system should assign the highest weights to terms with the most discriminative power. One component of the corpus weight is the language model used. The most common language model is the Inverse Document Frequency IDF which considers the distribution of terms between documents see equation 1 . IDF has played a central role in retrieval systems since it was first introduced more than thirty years ago Sparck Jones 1972 . IDF ti log2 N -log2 ni 1 1 N is the total number of corpus documents ni is .