TAILIEUCHUNG - Báo cáo khoa học: "Self-Training for Enhancement and Domain Adaptation of Statistical Parsers Trained on Small Datasets"

Creating large amounts of annotated data to train statistical PCFG parsers is expensive, and the performance of such parsers declines when training and test data are taken from different domains. In this paper we use selftraining in order to improve the quality of a parser and to adapt it to a different domain, using only small amounts of manually annotated seed data. We report significant improvement both when the seed and test data are in the same domain and in the outof-domain adaptation scenario. . | Self-Training for Enhancement and Domain Adaptation of Statistical Parsers Trained on Small Datasets Roi Reichart ICNC Hebrew University of Jerusalem roiri@ Ari Rappoport Institute of Computer Science Hebrew University of Jerusalem arir@ Abstract Creating large amounts of annotated data to train statistical PCFG parsers is expensive and the performance of such parsers declines when training and test data are taken from different domains. In this paper we use selftraining in order to improve the quality of a parser and to adapt it to a different domain using only small amounts of manually annotated seed data. We report significant improvement both when the seed and test data are in the same domain and in the out-of-domain adaptation scenario. In particular we achieve 50 reduction in annotation cost for the in-domain case yielding an improvement of 66 over previous work and a 20-33 reduction for the domain adaptation case. This is the first time that self-training with small labeled datasets is applied successfully to these tasks. We were also able to formulate a characterization of when selftraining is valuable. 1 Introduction State of the art statistical parsers Collins 1999 Charniak 2000 Koo and Collins 2005 Charniak and Johnson 2005 are trained on manually annotated treebanks that are highly expensive to create. Furthermore the performance of these parsers decreases as the distance between the genres of their training and test data increases. Therefore enhancing the performance of parsers when trained on small manually annotated datasets is of great importance both when the seed and test data are taken 616 from the same domain the in-domain scenario and when they are taken from different domains the out-of-domain or parser adaptation scenario . Since the problem is the expense in manual annotation we define small to be sentences which are the sizes of sentence sets that can be manually annotated by constituent structure in a

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.