TAILIEUCHUNG - Báo cáo khoa học: "Unsupervised Learning of Arabic Stemming using a Parallel Corpus"

This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10K sentences) parallel corpus as its sole training resources. No parallel text is needed after the training phase. Monolingual, unannotated text can be used to further improve the stemmer by allowing it to adapt to a desired domain or genre. | Unsupervised Learning of Arabic Stemming using a Parallel Corpus Monica Rogati Computer Science Department Carnegie Mellon University mrogati@ Scott McCarley Yiming Yang IBM TJ Watson Language Technologies Institute Research Center Carnegie Mellon University jsmc@ yiming@ Abstract This paper presents an unsupervised learning approach to building a non-English Arabic stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small 10K sentences parallel corpus as its sole training resources. No parallel text is needed after the training phase. Monolingual unannotated text can be used to further improve the stemmer by allowing it to adapt to a desired domain or genre. Examples and results will be given for Arabic but the approach is applicable to any language that needs affix removal. Our resource-frugal approach results in agreement with a state of the art proprietary Arabic stemmer built using rules affix lists and human annotated text in addition to an unsupervised component. Task-based evaluation using Arabic information retrieval indicates an improvement of 22-38 in average precision over unstemmed text and 96 of the performance of the proprietary stem-mer above. 1 Introduction Stemming is the process of normalizing word variations by removing prefixes and suffixes. From an t . Work done while a summer intern at IBM TJ Watson Research Center information retrieval point of view prefixes and suffixes add little or no additional meaning in most cases both the efficiency and effectiveness of text processing applications such as information retrieval and machine translation are improved. Building a rule-based stemmer for a new arbitrary language is time consuming and requires experts with linguistic knowledge in that particular language. Supervised learning also requires large quantities of labeled data in the target language and quality declines when using completely .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.