TAILIEUCHUNG - Báo cáo khoa học: "The Effect of Corpus Size in Combining Supervised and Unsupervised Training for Disambiguation"

We investigate the effect of corpus size in combining supervised and unsupervised learning for two types of attachment decisions: relative clause attachment and prepositional phrase attachment. The supervised component is Collins’ parser, trained on the Wall Street Journal. The unsupervised component gathers lexical statistics from an unannotated corpus of newswire text. We find that the combined system only improves the performance of the parser for small training sets. Surprisingly, the size of the unannotated corpus has little effect due to the noisiness of the lexical statistics acquired by unsupervised learning. . | The Effect of Corpus Size in Combining Supervised and Unsupervised Training for Disambiguation Michaela Atterer Institute for NLP University of Stuttgart atterer@ Hinrich Schutze Institute for NLP University of Stuttgart hinrich@ Abstract We investigate the effect of corpus size in combining supervised and unsupervised learning for two types of attachment decisions relative clause attachment and prepositional phrase attachment. The supervised component is Collins parser trained on the Wall Street Journal. The unsupervised component gathers lexical statistics from an unannotated corpus of newswire text. We find that the combined system only improves the performance of the parser for small training sets. Surprisingly the size of the unannotated corpus has little effect due to the noisiness of the lexical statistics acquired by unsupervised learning. 1 Introduction The best performing systems for many tasks in natural language processing are based on supervised training on annotated corpora such as the Penn Treebank Marcus et al. 1993 and the prepositional phrase data set first described in Ratnaparkhi et al. 1994 . However the production of training sets is expensive. They are not available for many domains and languages. This motivates research on combining supervised with unsupervised learning since unannotated text is in ample supply for most domains in the major languages of the world. The question arises how much annotated and unannotated data is necessary in combination learning strategies. We investigate this question for two attachment ambiguity problems relative clause RC attachment and prepositional phrase PP attachment. The supervised component is Collins parser Collins 1997 trained on the Wall Street Journal. The unsupervised component gathers lexical statistics from an unannotated corpus of newswire text. The sizes of both types of corpora annotated and unannotated are of interest. We would expect that large annotated .

TÀI LIỆU LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.