TAILIEUCHUNG - Báo cáo khoa học: "Statistical Sense Disambiguation with Relatively Small Corpora Using Dictionary Definitions "

Corpus-based sense disambiguation methods, like most other statistical NLP approaches, suffer from the problem of data sparseness. In this paper, we describe an approach which overcomes this problem using dictionary definitions. Using the definitionbased conceptual co-occurrence data collected from the relatively small Brown corpus, our sense disambiguation system achieves an average accuracy comparable to human performance given the same contextual information. | Statistical Sense Disambiguation with Relatively Small Corpora Using Dictionary Definitions Alpha K. Luk Microsoft Institute North Ryde NSW 2113 Australia t-alphal@ Department of Computing Macquarie University NSW 2109 Australia Abstract Corpus-based sense disambiguation methods like most other statistical NLP approaches suffer from the problem of data sparseness. In this paper we describe an approach which overcomes this problem using dictionary definitions. Using the definitionbased conceptual co-occurrence data collected from the relatively small Brown corpus our sense disambiguation system achieves an average accuracy comparable to human performance given the same contextual information. 1 Introduction Previous corpus-based sense disambiguation methods require substantial amounts of sense-tagged training data Kelly and Stone 1975 Black 1988 and Hearst 1991 or aligned bilingual corpora Brown et al. 1991 Dagan 1991 and Gale et ủ. 1992 . Yarowsky 1992 introduces a thesaurus-based approach to statistical sense disambiguation which works on monolingual corpora without the need for sense-tagged training data. By collecting statistical data of word occurrences in the context of different thesaurus categories from a relatively large corpus 10 million words the system can identify salient words for each category. Using these salient words the system is able to disambiguate polysemous words with respect to thesaurus categories. Statistical approaches like these generally suffer from the problem of data sparseness. To estimate the salience of a word with reasonable accuracy the system needs the word to have a significant number of occurrences in the corpus. Having large corpora will help but some words are simply too infrequent to make a significant statistical contribution even in a rather large corpus. Moreover huge corpora are not generally available in all domains and storage and processing of very huge corpora can be problematic in some In this .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.