TAILIEUCHUNG - Báo cáo khoa học: "Recall-Oriented Learning of Named Entities in Arabic Wikipedia"

We consider the problem of NER in Arabic Wikipedia, a semisupervised domain adaptation setting for which we have no labeled training data in the target domain. To facilitate evaluation, we obtain annotations for articles in four topical groups, allowing annotators to identify domain-speciﬁc entity types in addition to standard categories. Standard supervised learning on newswire text leads to poor target-domain recall. We train a sequence model and show that a simple modiﬁcation to the online learner—a loss function encouraging it to “arrogantly” favor recall over precision— substantially improves recall and F1 . . | Recall-Oriented Learning of Named Entities in Arabic Wikipedia Behrang Mohit Nathan Schneider Rishav Bhowmick Kemal Oflazer Noah A. Smith School of Computer Science Carnegie Mellon University . Box 24866 Doha Qatar Pittsburgh PA 15213 USA behrang@ nschneid@cs. rishavb@qatar. ko@cs. nasmith@cs. Abstract We consider the problem of NER in Arabic Wikipedia a semisupervised domain adaptation setting for which we have no labeled training data in the target domain. To facilitate evaluation we obtain annotations for articles in four topical groups allowing annotators to identify domain-specific entity types in addition to standard categories. Standard supervised learning on newswire text leads to poor target-domain recall. We train a sequence model and show that a simple modification to the online learner a loss function encouraging it to arrogantly favor recall over precision substantially improves recall and Fl. We then adapt our model with self-training on unlabeled target-domain data enforcing the same recall-oriented bias in the selftraining stage yields marginal 1 Introduction This paper considers named entity recognition NER in text that is different from most past research on NER. Specifically we consider Arabic Wikipedia articles with diverse topics beyond the commonly-used news domain. These data challenge past approaches in two ways First Arabic is a morphologically rich language Habash 2010 . Named entities are referenced using complex syntactic constructions cf. English NEs which are primarily sequences of proper nouns . The Arabic script suppresses most vowels increasing lexical ambiguity and lacks capitalization a key clue for English NER. Second much research has focused on the use of news text for system building and evaluation. Wikipedia articles are not news belonging instead to a wide range of domains that are not clearly 1The annotated dataset and a supplementary document with additional details of this work can be found at http .

Hồng Phúc 83 12 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462087 59

Giới thiệu :Lập trình mã nguồn mở

14 23863 75

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11131 537

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10377 458

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9657 106

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8668 1151

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8364 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7948 2250

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6993 260

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 6800 1617

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Công nghiệp gang thép Việt Nam : Một giai đoạn phát triển và chuyển đổi chính sách mới part 5

6 217 0 02-07-2024

HƯỚNG DẪN SỬ DỤNG PHẦN MỀM CAITA part 9

18 160 0 02-07-2024

Truyện kiếm hiệp - Duy ngã độc tôn phần 5/7

1 119 0 02-07-2024

Giáo trình phân tích phương trình vi phân viết dưới dạng thuật toán đặc tính của hệ thống p1

5 129 0 02-07-2024

Thương hiệu sản phẩm làng nghề: Đã ít, lại thiếu tính cạnh tranh

5 141 0 02-07-2024

Tự học thổi sáo và ngâm thơ part 4

11 175 1 02-07-2024

Hướng dẫn chế độ dinh dưỡng cho người bệnh viêm khớp

5 144 0 02-07-2024

Anh văn TOEFL Vocabulary-008

8 134 0 02-07-2024

BÀI GIẢNG Biến Đổi Năng Lượng Điện Cơ - TS. Hồ Phạm Huy

137 124 0 02-07-2024

MANAGING NANO-BIO-INFO-COGNO INNOVATIONS

380 129 0 02-07-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7948 2250

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 6800 1617

Ebook Chào con ba mẹ đã sẵn sàng

112 4035 1304

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5728 1197

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8668 1151

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3658 667

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3850 601

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4422 548

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11131 537

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4307 484