TAILIEUCHUNG - Báo cáo khoa học: "Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering"

An unsupervised part-of-speech (POS) tagging system that relies on graph clustering methods is described. Unlike in current state-of-the-art approaches, the kind and number of different tags is generated by the method itself. We compute and merge two partitionings of word graphs: one based on context similarity of high frequency words, another on log-likelihood statistics for words of lower frequencies. Using the resulting word clusters as a lexicon, a Viterbi POS tagger is trained, which is refined by a morphological component. . | Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering Chris Biemann University of Leipzig NLP Department Augustusplatz 10 11 04109 Leipzig Germany biem@ Abstract An unsupervised part-of-speech POS tagging system that relies on graph clustering methods is described. Unlike in current state-of-the-art approaches the kind and number of different tags is generated by the method itself. We compute and merge two partitionings of word graphs one based on context similarity of high frequency words another on log-likelihood statistics for words of lower frequencies. Using the resulting word clusters as a lexicon a Viterbi POS tagger is trained which is refined by a morphological component. The approach is evaluated on three different languages by measuring agreement with existing taggers. 1 Introduction Motivation Assigning syntactic categories to words is an important pre-processing step for most NLP applications. Essentially two things are needed to construct a tagger a lexicon that contains tags for words and a mechanism to assign tags to running words in a text. There are words whose tags depend on their use. Further we also need to be able to tag previously unseen words. Lexical resources have to offer the possible tags and our mechanism has to choose the appropriate tag based on the context. Given a sufficient amount of manually tagged text several approaches have demonstrated the ability to learn the instance of a tagging mechanism from manually labelled data and apply it successfully to unseen data. Those high-quality resources are typically unavailable for many languages and their creation is labourintensive. We will describe an alternative needing much less human intervention. In this work steps are undertaken to derive a lexicon of syntactic categories from unstructured text without prior linguistic knowledge. We employ two different techniques one for high-and medium frequency terms one for medium-and low frequency

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.