TAILIEUCHUNG - Báo cáo khoa học: "Automatic Part-of-Speech Tagging for Bengali: An Approach for Morphologically Rich Languages in a Poor Resource Scenario"

This paper describes our work on building Part-of-Speech (POS) tagger for Bengali. We have use Hidden Markov Model (HMM) and Maximum Entropy (ME) based stochastic taggers. Bengali is a morphologically rich language and our taggers make use of morphological and contextual information of the words. Since only a small labeled training set is available (45,000 words), simple stochastic approach does not yield very good results. In this work, we have studied the effect of using a morphological analyzer to improve the performance of the tagger. . | Automatic Part-of-Speech Tagging for Bengali An Approach for Morphologically Rich Languages in a Poor Resource Scenario Sandipan Dandapat Sudeshna Sarkar Anupam Basu Department of Computer Science and Engineering Indian Institute of Technology Kharagpur India 721302 sandipan sudeshna @ Abstract This paper describes our work on building Part-of-Speech POS tagger for Bengali. We have use Hidden Markov Model HMM and Maximum Entropy ME based stochastic taggers. Bengali is a morphologically rich language and our taggers make use of morphological and contextual information of the words. Since only a small labeled training set is available 45 000 words simple stochastic approach does not yield very good results. In this work we have studied the effect of using a morphological analyzer to improve the performance of the tagger. We find that the use of morphology helps improve the accuracy of the tagger especially when less amount of tagged corpora are available. 1 Introduction Part-of-Speech POS taggers for natural language texts have been developed using linguistic rules stochastic models as well as a combination of both hybrid taggers . Stochastic models Cutting et al. 1992 Dermatas et al. 1995 Brants 2000 have been widely used in POS tagging for simplicity and language independence of the models. Among stochastic models bi-gram and tri-gram Hidden Markov Model HMM are quite popular. Development of a high accuracy stochastic tagger requires a large amount of annotated text. Stochastic taggers with more than 95 word-level accuracy have been developed for English German and other European Languages for which large labeled data is available. Our aim here is to develop a stochastic POS tagger for Bengali but we are limited by lack of a large annotated corpus for Bengali. Simple HMM models do not achieve high accuracy when the training set is small. In such cases ad 221 ditional information may be coded into the HMM model to achieve higher .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.