Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We propose a new method for detecting errors in "gold-standard" part-ofspeech annotation. The approach locates errors with high precision based on n-grams occurring in the corpus with multiple taggings. Two further techniques, closed-class analysis and finitestate tagging guide patterns, are discussed. The success of the three approaches is illustrated for the Wall Street Journal corpus as part of the Penn Treebank. | Detecting Errors in Part-of-Speech Annotation Markus Dickinson Department of Linguistics The Ohio State University dickinso@ling.osu.edu w. Detmar Meurers Department of Linguistics The Ohio State University dm@ling.osu.edu Abstract We propose a new method for detecting errors in gold-standard part-of-speech annotation. The approach locates errors with high precision based on n-grams occuưing in the corpus with multiple taggings. Two further techniques closed-class analysis and finite-state tagging guide patterns are discussed. The success of the three approaches is illustrated for the Wall Street Journal corpus as part of the Penn Treebank. 1 Introduction Part-of-speech pos annotated reference corpora such as the British National Corpus Leech et al. 1994 the Penn Treebank Marcus et al. 1993 or the German Negra Treebank Skut et al. 1997 play an important role for current work in computational linguistics. They provide training material for research on tagging algorithms and they serve as a gold standard for evaluating the performance of such tools. High quality pos-annotated text is also relevant as input for syntactic processing for practical applications such as information extraction and for linguistic research making use of pos-based corpus queries. The gold-standard pos-annotation for such large reference corpora is generally obtained using an automatic tagger to produce a first annotation followed by human post-editing. While Sinclair 1992 provides some arguments for prioritizing a fully automated analysis human post-editing has been shown to significantly reduce the number of pos-annotation errors. Brants 2000 discusses that a single human post-editor reduces the 3.3 error rate in the STTS annotation of the German Negra corpus produced by the TnT tagger to 1.2 . Baker 1997 also reports an improvement of around 2 for a similar experiment carried out for an English sample originally tagged with 96.95 accuracy by the CLAWS tagger. And Leech 1997 reports that .