TAILIEUCHUNG - Báo cáo khoa học: "Minimized Models for Unsupervised Part-of-Speech Tagging"

We describe a novel method for the task of unsupervised POS tagging with a dictionary, one that uses integer programming to explicitly search for the smallest model that explains the data, and then uses EM to set parameter values. We evaluate our method on a standard test corpus using different standard tagsets (a 45-tagset as well as a smaller 17-tagset), and show that our approach performs better than existing state-of-the-art systems in both settings. | Minimized Models for Unsupervised Part-of-Speech Tagging Sujith Ravi and Kevin Knight University of Southern California Information Sciences Institute Marina del Rey California 90292 sravi knight @ Abstract We describe a novel method for the task of unsupervised POS tagging with a dictionary one that uses integer programming to explicitly search for the smallest model that explains the data and then uses EM to set parameter values. We evaluate our method on a standard test corpus using different standard tagsets a 45-tagset as well as a smaller 17-tagset and show that our approach performs better than existing state-of-the-art systems in both settings. 1 Introduction In recent years we have seen increased interest in using unsupervised methods for attacking different NLP tasks like part-of-speech POS tagging. The classic Expectation Maximization EM algorithm has been shown to perform poorly on POS tagging when compared to other techniques such as Bayesian methods. In this paper we develop new methods for unsupervised part-of-speech tagging. We adopt the problem formulation of Merialdo 1994 in which we are given a raw word sequence and a dictionary of legal tags for each word type. The goal is to tag each word token so as to maximize accuracy against a gold tag sequence. Whether this is a realistic problem set-up is arguable but an interesting collection of methods and results has accumulated around it and these can be clearly compared with one another. We use the standard test set for this task a 24 115-word subset of the Penn Treebank for which a gold tag sequence is available. There are 5 878 word types in this test set. We use the standard tag dictionary consisting of 57 388 word tag pairs derived from the entire Penn 8 910 dictionary entries are relevant to the 5 878 word types in the test set. Per-token ambiguity is about tags token yielding approximately 106425 possible ways to tag the data. There are 45 distinct grammatical tags. In .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.