TAILIEUCHUNG - Báo cáo khoa học: "Weakly Supervised Part-of-Speech Tagging for Morphologically-Rich, Resource-Scarce Languages"

This paper examines unsupervised approaches to part-of-speech (POS) tagging for morphologically-rich, resource-scarce languages, with an emphasis on Goldwater and Griffiths’s (2007) fully-Bayesian approach originally developed for English POS tagging. We argue that existing unsupervised POS taggers unrealistically assume as input a perfect POS lexicon, and consequently, we propose a weakly supervised fully-Bayesian approach to POS tagging, which relaxes the unrealistic assumption by automatically acquiring the lexicon from a small amount of POS-tagged data | Weakly Supervised Part-of-Speech Tagging for Morphologically-Rich Resource-Scarce Languages Kazi Saidul Hasan and Vincent Ng Human Language Technology Research Institute University of Texas at Dallas Richardson TX 75083-0688 saidul vince @ Abstract This paper examines unsupervised approaches to part-of-speech POS tagging for morphologically-rich resource-scarce languages with an emphasis on Goldwater and Griffiths s 2007 Pully-Bayesian approach originally developed for English POS tagging. We argue that existing unsupervised POS taggers unrealistically assume as input a perfect POS lexicon and consequently we propose a weakly supervised fully-Bayesian approach to POS tagging which relaxes the unrealistic assumption by automatically acquiring the lexicon from a small amount of POS-tagged data. Since such relaxation comes at the expense of a drop in tagging accuracy we propose two extensions to the Bayesian framework and demonstrate that they are effective in improving a fully-Bayesian POS tagger for Bengali our representative morphologically-rich resource-scarce language. 1 Introduction Unsupervised POS tagging requires neither manual encoding of tagging heuristics nor the availability of data labeled with POS information. Rather an unsupervised POS tagger operates by only assuming as input a POS lexicon which consists of a list of possible POS tags for each word. As we can see from the partial POS lexicon for English in Figure 1 the is unambiguous with respect to POS tagging since it can only be a determiner DT whereas sting is ambiguous since it can be a common noun NN a proper noun NNP or a verb VB . In other words the lexicon imposes constraints on the possible POS tags Word POS tag s running NN JJ sting NN NNP VB the DT Figure 1 A partial lexicon for English of each word and such constraints are then used by an unsupervised tagger to label a new sentence. Conceivably tagging accuracy decreases with the increase in ambiguity unambiguous words .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.