TAILIEUCHUNG - Báo cáo khoa học: "Recall-Oriented Learning of Named Entities in Arabic Wikipedia"

We consider the problem of NER in Arabic Wikipedia, a semisupervised domain adaptation setting for which we have no labeled training data in the target domain. To facilitate evaluation, we obtain annotations for articles in four topical groups, allowing annotators to identify domain-specific entity types in addition to standard categories. Standard supervised learning on newswire text leads to poor target-domain recall. We train a sequence model and show that a simple modification to the online learner—a loss function encouraging it to “arrogantly” favor recall over precision— substantially improves recall and F1 . . | Recall-Oriented Learning of Named Entities in Arabic Wikipedia Behrang Mohit Nathan Schneider Rishav Bhowmick Kemal Oflazer Noah A. Smith School of Computer Science Carnegie Mellon University . Box 24866 Doha Qatar Pittsburgh PA 15213 USA behrang@ nschneid@cs. rishavb@qatar. ko@cs. nasmith@cs. Abstract We consider the problem of NER in Arabic Wikipedia a semisupervised domain adaptation setting for which we have no labeled training data in the target domain. To facilitate evaluation we obtain annotations for articles in four topical groups allowing annotators to identify domain-specific entity types in addition to standard categories. Standard supervised learning on newswire text leads to poor target-domain recall. We train a sequence model and show that a simple modification to the online learner a loss function encouraging it to arrogantly favor recall over precision substantially improves recall and Fl. We then adapt our model with self-training on unlabeled target-domain data enforcing the same recall-oriented bias in the selftraining stage yields marginal 1 Introduction This paper considers named entity recognition NER in text that is different from most past research on NER. Specifically we consider Arabic Wikipedia articles with diverse topics beyond the commonly-used news domain. These data challenge past approaches in two ways First Arabic is a morphologically rich language Habash 2010 . Named entities are referenced using complex syntactic constructions cf. English NEs which are primarily sequences of proper nouns . The Arabic script suppresses most vowels increasing lexical ambiguity and lacks capitalization a key clue for English NER. Second much research has focused on the use of news text for system building and evaluation. Wikipedia articles are not news belonging instead to a wide range of domains that are not clearly 1The annotated dataset and a supplementary document with additional details of this work can be found at http .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.