TAILIEUCHUNG - Báo cáo khoa học: "Mining Wiki Resources for Multilingual Named Entity Recognition"

In this paper, we describe a system by which the multilingual characteristics of Wikipedia can be utilized to annotate a large corpus of text with Named Entity Recognition (NER) tags requiring minimal human intervention and no linguistic expertise. This process, though of value in languages for which resources exist, is particularly useful for less commonly taught languages. We show how the Wikipedia format can be used to identify possible named entities and discuss in detail the process by which we use the Category structure inherent to Wikipedia to determine the named entity type of a proposed entity. We further. | Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman Department of Defense Washington DC 20310 arichman@ Patrick Schone Department of Defense Fort George G. Meade MD 20755 pjschon@ Abstract In this paper we describe a system by which the multilingual characteristics of Wikipedia can be utilized to annotate a large corpus of text with Named Entity Recognition NER tags requiring minimal human intervention and no linguistic expertise. This process though of value in languages for which resources exist is particularly useful for less commonly taught languages. We show how the Wikipedia format can be used to identify possible named entities and discuss in detail the process by which we use the Category structure inherent to Wikipedia to determine the named entity type of a proposed entity. We further describe the methods by which English language data can be used to bootstrap the NER process in other languages. We demonstrate the system by using the generated corpus as training sets for a variant of BBN s Identifinder in French Ukrainian Spanish Polish Russian and Portuguese achieving overall F-scores as high as on independent human-annotated corpora comparable to a system trained on up to 40 000 words of human-annotated newswire. 1 Introduction Named Entity Recognition NER has long been a major task of natural language processing. Most of the research in the field has been restricted to a few languages and almost all methods require substantial linguistic expertise whether creating a rulebased technique specific to a language or manually annotating a body of text to be used as a training set for a statistical engine or machine learning. In this paper we focus on using the multilingual Wikipedia to automatically create an annotated corpus of text in any given language with no linguistic expertise required on the part of the user at run-time and only English knowledge required during development

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.