TAILIEUCHUNG - Báo cáo khoa học: "Large-Coverage Root Lexicon Extraction for Hindi"

This paper describes a method using morphological rules and heuristics, for the automatic extraction of large-coverage lexicons of stems and root word-forms from a raw text corpus. We cast the problem of high-coverage lexicon extraction as one of stemming followed by root word-form selection. We examine the use of POS tagging to improve precision and recall of stemming and thereby the coverage of the lexicon. We present accuracy, precision and recall scores for the system on a Hindi corpus. | Large-Coverage Root Lexicon Extraction for Hindi Cohan Sujay Carlos Monojit Choudhury Sandipan Dandapat Microsoft Research India monojitc@ Abstract This paper describes a method using morphological rules and heuristics for the automatic extraction of large-coverage lexicons of stems and root word-forms from a raw text corpus. We cast the problem of high-coverage lexicon extraction as one of stemming followed by root word-form selection. We examine the use of POS tagging to improve precision and recall of stemming and thereby the coverage of the lexicon. We present accuracy precision and recall scores for the system on a Hindi corpus. 1 Introduction Large-coverage morphological lexicons are an essential component of morphological analysers. Morphological analysers find application in language processing systems for tasks like tagging parsing and machine translation. While raw text is an abundant and easily accessible linguistic resource high-coverage morphological lexicons are scarce or unavailable in Hindi as in many other languages Clement et al. 2004 . Thus the development of better algorithms for the extraction of morphological lexicons from raw text corpora is a task of considerable importance. A root word-form lexicon is an intermediate stage in the creation of a morphological lexicon. In this paper we consider the problem of extracting a large-coverage root word-form lexicon for the Hindi language a highly inflectional and moderately agglutinative Indo-European language spoken widely in South Asia. Since a POS tagger another basic tool was available along with POS tagged data to train it and since the error patterns indicated that POS tagging could greatly improve the accuracy of the lexicon we used the POS tagger in our experiments on lexicon extraction. Previous work in morphological lexicon extraction from a raw corpus often does not achieve very high precision and recall de Lima 1998 Oliver and Tadic 2004 . In some previous work the process .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.