TAILIEUCHUNG - Báo cáo khoa học: "Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking"

We investigate the tasks of general morphological tagging, diacritization, and lemmatization for Arabic. We show that for all tasks we consider, both modeling the lexeme explicitly, and retuning the weights of individual classifiers for the specific task, improve the performance. | Arabic Morphological Tagging Diacritization and Lemmatization Using Lexeme Models and Feature Ranking Ryan Roth Owen Rambow Nizar Habash Mona Diab and Cynthia Rudin Center for Computational Learning Systems Columbia University New York NY 10115 UsA ryanr rambow habash mdiab rudin @ Abstract We investigate the tasks of general morphological tagging diacritization and lemmatiza-tion for Arabic. We show that for all tasks we consider both modeling the lexeme explicitly and retuning the weights of individual classifiers for the specific task improve the performance. 1 Previous Work Arabic is a morphologically rich language in our training corpus of about 288 000 words we find 3279 distinct morphological tags with up to 100 000 possible Because of the large number of tags it is clear that morphological tagging cannot be construed as a simple classification task. Hajic 2000 is the first to use a dictionary as a source of possible morphological analyses and hence tags for an inflected word form. He redefines the tagging task as a choice among the tags proposed by the dictionary using a log-linear model trained on specific ambiguity classes for individual morphological features. Hajic et al. 2005 implement the approach of Hajic 2000 for Arabic. In previous work we follow the same approach Habash and Rambow 2005 using SVM-classifiers for individual morphological features and a simple combining scheme for choosing among competing analyses proposed by the dictionary. Since the dictionary we use BAMA Buck-walter 2004 also includes diacritics orthographic 1 This work was funded under the DARPA GALE program contract HR0011-06-C-0023. We thank several anonymous reviewers for helpful comments. A longer version of this paper is available as a technical report. marks not usually written we extend this approach to the diacritization task in Habash and Rambow 2007 . The work presented in this paper differs from this previous work in that a we introduce a new .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.