TAILIEUCHUNG - Báo cáo khoa học: "Extracting and Classifying Urdu Multiword Expressions"

This paper describes a method for automatically extracting and classifying multiword expressions (MWEs) for Urdu on the basis of a relatively small unannotated corpus (around million tokens). The MWEs are extracted by an unsupervised method and classified into two distinct classes, namely locations and person names. The classification is based on simple heuristics that take the co-occurrence of MWEs with distinct postpositions into account. | Extracting and Classifying Urdu Multiword Expressions Annette Hautli Department of Linguistics University of Konstanz Germany Sebastian Sulger Department of Linguistics University of Konstanz Germany Abstract This paper describes a method for automatically extracting and classifying multiword expressions MWEs for Urdu on the basis of a relatively small unannotated corpus around million tokens . The MWEs are extracted by an unsupervised method and classified into two distinct classes namely locations and person names. The classification is based on simple heuristics that take the co-occurrence of MWEs with distinct postpositions into account. The resulting classes are evaluated against a hand-annotated gold standard and achieve an f-score of and for locations and persons respectively. A target application is the Urdu ParGram grammar where MWEs are needed to generate a more precise syntactic and semantic analysis. 1 Introduction Multiword expressions MWEs are expressions which can be semantically and syntactically idiosyncratic in nature acting as a single unit their meaning is not always predictable from their components. Their identification is therefore an important task for any Natural Language Processing NLP application that goes beyond the analysis of pure surface structure in particular for languages with few other NLP tools available. There is a vast amount of literature on extracting and classifying MWEs automatically many approaches rely on already available resources that aid during the acquisition process. In the case of the Indo-Aryan language Urdu a lack of linguistic re 24 sources such as annotated corpora or lexical knowledge bases impedes the task of detecting and classifying MWEs. Nevertheless statistical measures and language-specific syntactic information can be employed to extract and classify MWEs. Therefore the method described in this paper can partly overcome the .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.