TAILIEUCHUNG - Báo cáo khoa học: "Detecting Errors in Part-of-Speech Annotation"

We propose a new method for detecting errors in "gold-standard" part-ofspeech annotation. The approach locates errors with high precision based on n-grams occurring in the corpus with multiple taggings. Two further techniques, closed-class analysis and finitestate tagging guide patterns, are discussed. The success of the three approaches is illustrated for the Wall Street Journal corpus as part of the Penn Treebank. | Detecting Errors in Part-of-Speech Annotation Markus Dickinson Department of Linguistics The Ohio State University dickinso@ w. Detmar Meurers Department of Linguistics The Ohio State University dm@ Abstract We propose a new method for detecting errors in gold-standard part-of-speech annotation. The approach locates errors with high precision based on n-grams occuưing in the corpus with multiple taggings. Two further techniques closed-class analysis and finite-state tagging guide patterns are discussed. The success of the three approaches is illustrated for the Wall Street Journal corpus as part of the Penn Treebank. 1 Introduction Part-of-speech pos annotated reference corpora such as the British National Corpus Leech et al. 1994 the Penn Treebank Marcus et al. 1993 or the German Negra Treebank Skut et al. 1997 play an important role for current work in computational linguistics. They provide training material for research on tagging algorithms and they serve as a gold standard for evaluating the performance of such tools. High quality pos-annotated text is also relevant as input for syntactic processing for practical applications such as information extraction and for linguistic research making use of pos-based corpus queries. The gold-standard pos-annotation for such large reference corpora is generally obtained using an automatic tagger to produce a first annotation followed by human post-editing. While Sinclair 1992 provides some arguments for prioritizing a fully automated analysis human post-editing has been shown to significantly reduce the number of pos-annotation errors. Brants 2000 discusses that a single human post-editor reduces the error rate in the STTS annotation of the German Negra corpus produced by the TnT tagger to . Baker 1997 also reports an improvement of around 2 for a similar experiment carried out for an English sample originally tagged with accuracy by the CLAWS tagger. And Leech 1997 reports that .

TỪ KHÓA LIÊN QUAN
TÀI LIỆU MỚI ĐĂNG
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.