TAILIEUCHUNG - Báo cáo khoa học: "Automatic Acquisition of Named Entity Tagged Corpus from World Wide Web"

In this paper, we present a method that automatically constructs a Named Entity (NE) tagged corpus from the web to be used for learning of Named Entity Recognition systems. We use an NE list and an web search engine to collect web documents which contain the NE instances. The documents are refined through sentence separation and text refinement procedures and NE instances are finally tagged with the appropriate NE categories. | Automatic Acquisition of Named Entity Tagged Corpus from World Wide Web Joohui An Dept. of CSE POSTECH Pohang Korea 790-784 minnie@ Seungwoo Lee Dept. of CSE POSTECH Pohang Korea 790-784 pinesnow@ Gary Geunbae Lee Dept. of CSE POSTECH Pohang Korea 790-784 gblee@ Abstract In this paper we present a method that automatically constructs a Named Entity NE tagged corpus from the web to be used for learning of Named Entity Recognition systems. We use an NE list and an web search engine to collect web documents which contain the NE instances. The documents are refined through sentence separation and text refinement procedures and NE instances are finally tagged with the appropriate NE categories. Our experiments demonstrates that the suggested method can acquire enough NE tagged corpus equally useful to the manually tagged one without any human intervention. 1 Introduction Current trend in Named Entity Recognition NER is to apply machine learning approach which is more attractive because it is trainable and adaptable and subsequently the porting of a machine learning system to another domain is much easier than that of a rule-based one. Various supervised learning methods for Named Entity NE tasks were successfully applied and have shown reasonably satisfiable per-formance. Zhou and Su 2002 Borthwick et al. 1998 Sassano and Utsuro 2000 However most of these systems heavily rely on a tagged corpus for training. For a machine learning approach a large corpus is required to circumvent the data sparseness problem but the dilemma is that the costs required to annotate a large training corpus are non-trivial. In this paper we suggest a method that automatically constructs an NE tagged corpus from the web to be used for learning of NER systems. We use an NE list and an web search engine to collect web documents which contain the NE instances. The documents are refined through the sentence separation and text refinement procedures and NE .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.