TAILIEUCHUNG - Báo cáo khoa học: "Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora"

Named Entity recognition (NER) is an important part of many natural language processing tasks. Current approaches often employ machine learning techniques and require supervised data. However, many languages lack such resources. This paper presents an (almost) unsupervised learning algorithm for automatic discovery of Named Entities (NEs) in a resource free language, given a bilingual corpora in which it is weakly temporally aligned with a resource rich language. | Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora Alexandre Klementiev Dan Roth Dept. of Computer Science University of Illinois Urbana IL 61801 klementi danr @ Abstract Named Entity recognition NER is an important part of many natural language processing tasks. Current approaches often employ machine learning techniques and require supervised data. However many languages lack such resources. This paper presents an almost unsupervised learning algorithm for automatic discovery of Named Entities NEs in a resource free language given a bilingual corpora in which it is weakly temporally aligned with a resource rich language. NEs have similar time distributions across such corpora and often some of the tokens in a multi-word NE are transliterated. We develop an algorithm that exploits both observations iteratively. The algorithm makes use of a new frequency based metric for time distributions and a resource free discriminative approach to transliteration. Seeded with a small number of transliteration pairs our algorithm discovers multi-word NEs and takes advantage of a dictionary if one exists to account for translated or partially translated NEs. We evaluate the algorithm on an English-Russian corpus and show high level of NEs discovery in Russian. 1 Introduction Named Entity recognition has been getting much attention in NLP research in recent years since it is seen as significant component of higher level NLP tasks such as information distillation and question answering. Most successful approaches to NER employ machine learning techniques which require supervised training data. However for many languages these resources do not exist. Moreover it is often difficult to find experts in these languages both for the expensive annotation effort and even for language specific clues. On the other hand comparable multilingual data such as multilingual news streams are becoming increasingly available see section 4 . In .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.