TAILIEUCHUNG - Báo cáo khoa học: "Probabilistic Document Modeling for Syntax Removal in Text Summarization"

Statistical approaches to automatic text summarization based on term frequency continue to perform on par with more complex summarization methods. To compute useful frequency statistics, however, the semantically important words must be separated from the low-content function words. The standard approach of using an a priori stopword list tends to result in both undercoverage, where syntactical words are seen as semantically relevant, and overcoverage, where words related to content are ignored. . | Probabilistic Document Modeling for Syntax Removal in Text Summarization William M. Darling School of Computer Science University of Guelph 50 Stone Rd E Guelph ON N1G 2W1 Canada wdarling@ Fei Song School of Computer Science University of Guelph 50 Stone Rd E Guelph ON N1G 2W1 Canada fsong@ Abstract Statistical approaches to automatic text summarization based on term frequency continue to perform on par with more complex summarization methods. To compute useful frequency statistics however the semantically important words must be separated from the low-content function words. The standard approach of using an a priori stopword list tends to result in both undercoverage where syntactical words are seen as semantically relevant and overcoverage where words related to content are ignored. We present a generative probabilistic modeling approach to building content distributions for use with statistical multi-document summarization where the syntax words are learned directly from the data with a Hidden Markov Model and are thereby deemphasized in the term frequency statistics. This approach is compared to both a stopword-list and POS-tagging approach and our method demonstrates improved coverage on the DUC 2006 and TAC 2010 datasets using the ROUGE metric. 1 Introduction While the dominant problem in Information Retrieval in the first part of the century was finding relevant information within a datastream that is exponentially growing the problem has arguably transitioned from finding what we are looking for to sifting through it. We can now be quite confident that search engines like Google will return several pages relevant to our queries but rarely does one have time to go through the enormous amount of data that is 642 supplied. Therefore automatic text summarization which aims at providing a shorter representation of the salient parts of a large amount of information has been steadily growing in both importance and popularity over the last .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.