TAILIEUCHUNG - Báo cáo khoa học: "Automatic Detection and Correction of Errors in Dependency Treebanks"

Annotated corpora are essential for almost all NLP applications. Whereas they are expected to be of a very high quality because of their importance for the followup developments, they still contain a considerable number of errors. With this work we want to draw attention to this fact. Additionally, we try to estimate the amount of errors and propose a method for their automatic correction. | Automatic Detection and Correction of Errors in Dependency Treebanks Alexander Volokh DFKI Stuhlsatzenhausweg 3 66123 Saarbrucken Germany Gunter Neumann DFKI Stuhlsatzenhausweg 3 66123 Saarbrucken Germany neumann@ Abstract Annotated corpora are essential for almost all NLP applications. Whereas they are expected to be of a very high quality because of their importance for the followup developments they still contain a considerable number of errors. With this work we want to draw attention to this fact. Additionally we try to estimate the amount of errors and propose a method for their automatic correction. Whereas our approach is able to find only a portion of the er -rors that we suppose are contained in almost any annotated corpus due to the nature of the process of its creation it has a very high pre -cision and thus is in any case beneficial for the quality of the corpus it is applied to. At last we compare it to a different method for error detection in treebanks and find out that the errors that we are able to detect are mostly different and that our approaches are complementary. 1 Introduction Treebanks and other annotated corpora have become essential for almost all NLP applications. Papers about corpora like the Penn Treebank 1 have thousands of citations since most of the algorithms profit from annotated data during the development and testing and thus are widely used in the field. Treebanks are therefore expected to be of a very high quality in order to guarantee reliability for their theoretical and practical uses. The construction of an annotated corpus involves a lot of work performed by large groups. However despite the fact that a lot of human post-editing and automatic quality assurance is done errors can not be avoided completely 5 . 346 In this paper we propose an approach for finding and correcting errors in dependency treebanks. We apply our method to the English dependency corpus - conversion of the Penn .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.