TAILIEUCHUNG - Báo cáo khoa học: "A Comparison of Document, Sentence, and Term Event Spaces"

The trend in information retrieval systems is from document to sub-document retrieval, such as sentences in a summarization system and words or phrases in question-answering system. Despite this trend, systems continue to model language at a document level using the inverse document frequency (IDF). In this paper, we compare and contrast IDF with inverse sentence frequency (ISF) and inverse term frequency (ITF). A direct comparison reveals that all language models are highly correlated; however, the average ISF and ITF values are and higher than IDF. . | A Comparison of Document Sentence and Term Event Spaces Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill North Carolina NC 27599-3360 cablake@ Abstract The trend in information retrieval systems is from document to sub-document retrieval such as sentences in a summarization system and words or phrases in question-answering system. Despite this trend systems continue to model language at a document level using the inverse document frequency IDF . In this paper we compare and contrast IDF with inverse sentence frequency ISF and inverse term frequency ITF . A direct comparison reveals that all language models are highly correlated however the average ISF and ITF values are and higher than IDF. All language models appeared to follow a power law distribution with a slope coefficient of for documents and for sentences and terms. We conclude with an analysis of IDF stability with respect to random journal and section partitions of the 100 830 full-text scientific articles in our experimental corpus. 1 Introduction The vector based information retrieval model identifies relevant documents by comparing query terms with terms from a document corpus. The most common corpus weighting scheme is the term frequency TF x inverse document frequency IDF where TF is the number of times a term appears in a document and IDF reflects the distribution of terms within the corpus Salton and Buckley 1988 . Ideally the system should assign the highest weights to terms with the most discriminative power. One component of the corpus weight is the language model used. The most common language model is the Inverse Document Frequency IDF which considers the distribution of terms between documents see equation 1 . IDF has played a central role in retrieval systems since it was first introduced more than thirty years ago Sparck Jones 1972 . IDF ti log2 N -log2 ni 1 1 N is the total number of corpus documents ni is .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.