Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Corpus-based Natural Language Processing (NLP) tasks for such popular languages as English, French, etc. have been well studied with satisfactory achievements. In contrast, corpus-based NLP tasks for unpopular languages (e.g. Vietnamese) are at a deadlock due to absence of annotated training data for these languages. | HLT-NAACL 2003 Workshop Building and Using Parallel Texts Data Driven Machine Translation and Beyond pp. 88-95 Edmonton May-June 2003 POS-Tagger for English-Vietnamese Bilingual Corpus Dinh Dien Information Technology Faculty of Vietnam National University of HCMC 20 C2 Hoang Hoa Tham Ward 12 Tan Binh Dist. HCM City Vietnam ddien@saigonnet.vn Hoang Kiem Center of Information Technology Development of Vietnam National University of HCMC 227 Nguyen Van Cu District 5 HCM City hkiem@citd.edu.vn Abstract Corpus-based Natural Language Processing NLP tasks for such popular languages as English French etc. have been well studied with satisfactory achievements. In contrast corpus-based NLP tasks for unpopular languages e.g. Vietnamese are at a deadlock due to absence of annotated training data for these languages. Furthermore hand-annotation of even reasonably well-determined features such as part-of-speech POS tags has proved to be labor intensive and costly. In this paper we suggest a solution to partially overcome the annotated resource shortage in Vietnamese by building a POS-tagger for an automatically word-aligned English-Vietnamese parallel Corpus named EVC . This POS-tagger made use of the Transformation-Based Learning or TBL method to bootstrap the POS-annotation results of the English POS-tagger by exploiting the POS-information of the corresponding Vietnamese words via their wordalignments in EVC. Then we directly project POS-annotations from English side to Vietnamese via available word alignments. This POS-annotated Vietnamese corpus will be manually corrected to become an annotated training data for Vietnamese NLP tasks such as POS-tagger Phrase-Chunker Parser Word-Sense Disambiguator etc. 1 Introduction POS-tagging is assigning to each word of a text the proper POS tag in its context of appearance. Although each word can be classified into various POS-tags in a defined context it can only be attributed with a definite POS. As an example in this sentence 2 can