TAILIEUCHUNG - Báo cáo khoa học: "Manually Annotated Hungarian Corpus"

Current paper presents the results of a two-year project during which a consortium of the University of Szeged and the MorphoLogic Ltd. Budapest developed a morpho-syntactically parsed and annotated (disambiguated) corpus for Hungarian. For morpho-syntactic encoding, the Hungarian version of MSD (MorphoSyntactic Description) has been used. The corpus contains texts of five different topic areas: schoolchildren's compositions, fiction, computer-related texts, news, and legal texts. During annotation, linguists have checked the morphosyntactic parsing of each word. . | Manually Annotated Hungarian Corpus Zoltán Alexin Department of Informatics University of Szeged alexin Tibor Gyimóthy Research Group on Artifical Intelligence at University of Szeged gyimothy@ Csaba Hatvani Department of Informatics University of Szeged hacso@ László Tihanyi MorphoLogic Budapest János Csirik Department of Informatics University of Szeged csirik@ Károly Bibok Slavic Institute University of Szeged kbibok@ Gabor Proszeky MorphoLogic Budapest proszeky@ Abstract Current paper presents the results of a two-year project during which a consortium of the University of Szeged and the MorphoLogic Ltd. Budapest developed a morpho-syntactically parsed and annotated disambiguated corpus for Hungarian. For morpho-syntactic encoding the Hungarian version of MSD Morpho-Syntactic Description has been used. The corpus contains texts of five different topic areas school children s compositions fiction computer-related texts news and legal texts. During annotation linguists have checked the morpho-syntactic parsing of each word. Finding part-of-speech tagging disambiguation rules by machine learning algorithms was also studied by the researchers of the consortium. Due to the fact that the size of the corpus reaches up to 1 million text words without punctuation characters it may serve as a reference source for numerous future research applications. The corpus can be obtained freely via Internet for research and educational purposes. 1 Introduction The beginning of the work dates back to 1998 when the authors started a research project on the application of ILP Inductive Logic Programming learning methods for part-of-speech tagging. This research was done within the framework of a European ESPRIT project LTR 20237 ILP2 where first studies were based on the so-called TELRI corpus Erjavec et al. 1998 . Since the corpus annotation had several deficiencies and

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.