TAILIEUCHUNG - Báo cáo khoa học: "Mining Wiki Resources for Multilingual Named Entity Recognition"

In this paper, we describe a system by which the multilingual characteristics of Wikipedia can be utilized to annotate a large corpus of text with Named Entity Recognition (NER) tags requiring minimal human intervention and no linguistic expertise. This process, though of value in languages for which resources exist, is particularly useful for less commonly taught languages. We show how the Wikipedia format can be used to identify possible named entities and discuss in detail the process by which we use the Category structure inherent to Wikipedia to determine the named entity type of a proposed entity. We further. | Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman Department of Defense Washington DC 20310 arichman@ Patrick Schone Department of Defense Fort George G. Meade MD 20755 pjschon@ Abstract In this paper we describe a system by which the multilingual characteristics of Wikipedia can be utilized to annotate a large corpus of text with Named Entity Recognition NER tags requiring minimal human intervention and no linguistic expertise. This process though of value in languages for which resources exist is particularly useful for less commonly taught languages. We show how the Wikipedia format can be used to identify possible named entities and discuss in detail the process by which we use the Category structure inherent to Wikipedia to determine the named entity type of a proposed entity. We further describe the methods by which English language data can be used to bootstrap the NER process in other languages. We demonstrate the system by using the generated corpus as training sets for a variant of BBN s Identifinder in French Ukrainian Spanish Polish Russian and Portuguese achieving overall F-scores as high as on independent human-annotated corpora comparable to a system trained on up to 40 000 words of human-annotated newswire. 1 Introduction Named Entity Recognition NER has long been a major task of natural language processing. Most of the research in the field has been restricted to a few languages and almost all methods require substantial linguistic expertise whether creating a rulebased technique specific to a language or manually annotating a body of text to be used as a training set for a statistical engine or machine learning. In this paper we focus on using the multilingual Wikipedia to automatically create an annotated corpus of text in any given language with no linguistic expertise required on the part of the user at run-time and only English knowledge required during development

Lâm Trường 70 9 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Mining Wiki Resources for Multilingual Named Entity Recognition"

9 50 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462343 61

Giới thiệu :Lập trình mã nguồn mở

14 26098 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11349 542

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10553 466

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9844 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8891 1161

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8507 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8101 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7758 1792

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7273 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

báo cáo hóa học:" Increased androgen receptor expression in serous carcinoma of the ovary is associated with an improved survival"

6 156 3 28-12-2024

Bảng màu theo chữ cái – V

11 168 2 28-12-2024

Chương 10: Các phương pháp tính quá trình quá độ trong mạch điện tuyến tính

57 235 7 28-12-2024

báo cáo hóa học:" Quality of data collection in a large HIV observational clinic database in sub-Saharan Africa: implications for clinical research and audit of care"

7 154 4 28-12-2024

CHƯƠNG 2: RỦI RO THÂM HỤT TÀI KHÓA

28 160 1 28-12-2024

Chủ đề 3 : SỰ CÂN BẰNG CỦA VẬT RẮN (4 tiết)

9 208 1 28-12-2024

The Ombudsman Enterprise and Administrative Justice

309 143 0 28-12-2024

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining

101 140 1 28-12-2024

TRẮC NGHIỆM - CÁC BỆNH THIẾU DINH DƯỠNG THƯỜNG GẶP

32 212 2 28-12-2024

Phạm trù Chủ nghĩa cá nhân của tư tưởng phương Tây trong sự lý giải của Phan Khôi _1

9 131 0 28-12-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8101 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7758 1792

Ebook Chào con ba mẹ đã sẵn sàng

112 4409 1371

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6293 1266

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8891 1161

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3843 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3920 609

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4718 565

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11349 542

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4510 490