TAILIEUCHUNG - Báo cáo khoa học: "User Edits Classification Using Document Revision Histories"

Document revision histories are a useful and abundant source of data for natural language processing, but selecting relevant data for the task at hand is not trivial. In this paper we introduce a scalable approach for automatically distinguishing between factual and fluency edits in document revision histories. The approach is based on supervised machine learning using language model probabilities, string similarity measured over different representations of user edits, comparison of part-of-speech tags and named entities, and a set of adaptive features extracted from large amounts of unlabeled user edits. . | User Edits Classification Using Document Revision Histories Amit Bronner Informatics Institute University of Amsterdam Christof Monz Informatics Institute University of Amsterdam Abstract Document revision histories are a useful and abundant source of data for natural language processing but selecting relevant data for the task at hand is not trivial. In this paper we introduce a scalable approach for automatically distinguishing between factual and fluency edits in document revision histories. The approach is based on supervised machine learning using language model probabilities string similarity measured over different representations of user edits comparison of part-of-speech tags and named entities and a set of adaptive features extracted from large amounts of unlabeled user edits. Applied to contiguous edit segments our method achieves statistically significant improvements over a simple yet effective edit-distance baseline. It reaches high classification accuracy 88 and is shown to generalize to additional sets of unseen data. 1 Introduction Many online collaborative editing projects such as Wikipedia1 keep track of complete revision histories. These contain valuable information about the evolution of documents in terms of content as well as language style and form. Such data is publicly available in large volumes and constantly growing. According to Wikipedia statistics in August 2011 the English Wikipedia contained million articles with an average of revisions per article. The average number of revision edits per month is about 4 million in English and almost 11 million in total for all 1 http 2Average for the 5 years period between August 2006 and August 2011. The count includes edits by registered Exploiting document revision histories has proven useful for a variety of natural language processing NLP tasks including sentence compression Nelken and Yamangil 2008 Yamangil and Nelken .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.