TAILIEUCHUNG - Báo cáo khoa học: "Improved Unsupervised POS Induction through Prototype Discovery"

We present a novel fully unsupervised algorithm for POS induction from plain text, motivated by the cognitive notion of prototypes. The algorithm first identifies landmark clusters of words, serving as the cores of the induced POS categories. The rest of the words are subsequently mapped to these clusters. We utilize morphological and distributional representations computed in a fully unsupervised manner. | Improved Unsupervised POS Induction through Prototype Discovery Omri Abend1 Roi Reichart2 Ari Rappoport1 institute of Computer Science 2ICNC Hebrew University of Jerusalem omria011roiri arir @ Abstract We present a novel fully unsupervised algorithm for POS induction from plain text motivated by the cognitive notion of prototypes. The algorithm first identifies landmark clusters of words serving as the cores of the induced POS categories. The rest of the words are subsequently mapped to these clusters. We utilize morphological and distributional representations computed in a fully unsupervised manner. We evaluate our algorithm on English and German achieving the best reported results for this task. 1 Introduction Part-of-speech POS tagging is a fundamental NLP task used by a wide variety of applications. However there is no single standard POS tagging scheme even for English. Schemes vary significantly across corpora and even more so across languages creating difficulties in using POS tags across domains and for multi-lingual systems Jiang et al. 2009 . Automatic induction of POS tags from plain text can greatly alleviate this problem as well as eliminate the efforts incurred by manual annotations. It is also a problem of great theoretical interest. Consequently POS induction is a vibrant research area see Section 2 . In this paper we present an algorithm based on the theory of prototypes Taylor 2003 which posits that some members in cognitive categories are more central than others. These practically define the category while the membership of other elements is based on their association with the Omri Abend is grateful to the Azrieli Foundation for the award of an Azrieli Fellowship. central members. Our algorithm first clusters words based on a fine morphological representation. It then clusters the most frequent words defining landmark clusters which constitute the cores of the categories. Finally it maps the rest of the words to these categories.

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.