TAILIEUCHUNG - Báo cáo khoa học: "Automatic Discovery of Named Entity Variants – Grammar-driven Approaches to Non-alphabetical Transliterations"

Identification of transliterated names is a particularly difficult task of Named Entity Recognition (NER), especially in the Chinese context. Of all possible variations of transliterated named entities, the difference between PRC and Taiwan is the most prevalent and most challenging. In this paper, we introduce a novel approach to the automatic extraction of diverging transliterations of foreign named entities by bootstrapping cooccurrence statistics from tagged and segmented Chinese corpus. Preliminary experiment yields promising results and shows its potential in NLP applications. . | Automatic Discovery of Named Entity Variants - Grammar-driven Approaches to Non-alphabetical Transliterations Chu-Ren Huang Institute of Linguistics Academia Sinica Taiwan churenhuang@ Petr Simon Institute of Linguistics Academia Sinica Taiwan sim@ Shu-Kai Hsieh DoFLAL NIU Taiwan shukai@ Abstract Identification of transliterated names is a particularly difficult task of Named Entity Recognition NER especially in the Chinese context. Of all possible variations of transliterated named entities the difference between PRC and Taiwan is the most prevalent and most challenging. In this paper we introduce a novel approach to the automatic extraction of diverging transliterations of foreign named entities by bootstrapping cooccurrence statistics from tagged and segmented Chinese corpus. Preliminary experiment yields promising results and shows its potential in NLP applications. 1 Introduction Named Entity Recognition NER is one of the most difficult problems in NLP and Document Understanding. In the field of Chinese NER several approaches have been proposed to recognize personal names date time expressions monetary and percentage expressions. However the discovery of transliteration variations has not been well-studied in Chinese NER. This is perhaps due to the fact that the transliteration forms in a non-alphabetic language such as Chinese are opaque and not easy to compare. On the hand there is often more than one way to transliterate a foreign name. On the other hand dialectal difference as well as different transliteration strategies often lead to the same named entity to be transliterated differently in different Chinese speaking communities. Corpus Example Clinton Frequency XIN ttffi 24382 CNA ffi 150 XIN fflffffi 0 CNA M ffi 120842 Table 1 Distribution of two transliteration variants for Clinton in two sub-corpora Of all possible variations the cross-strait difference between PRC and Taiwan is the most prevalent and most challenging. The

Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.