TAILIEUCHUNG - Báo cáo khoa học: "User Edits Classiﬁcation Using Document Revision Histories"

Document revision histories are a useful and abundant source of data for natural language processing, but selecting relevant data for the task at hand is not trivial. In this paper we introduce a scalable approach for automatically distinguishing between factual and ﬂuency edits in document revision histories. The approach is based on supervised machine learning using language model probabilities, string similarity measured over different representations of user edits, comparison of part-of-speech tags and named entities, and a set of adaptive features extracted from large amounts of unlabeled user edits. . | User Edits Classification Using Document Revision Histories Amit Bronner Informatics Institute University of Amsterdam Christof Monz Informatics Institute University of Amsterdam Abstract Document revision histories are a useful and abundant source of data for natural language processing but selecting relevant data for the task at hand is not trivial. In this paper we introduce a scalable approach for automatically distinguishing between factual and fluency edits in document revision histories. The approach is based on supervised machine learning using language model probabilities string similarity measured over different representations of user edits comparison of part-of-speech tags and named entities and a set of adaptive features extracted from large amounts of unlabeled user edits. Applied to contiguous edit segments our method achieves statistically significant improvements over a simple yet effective edit-distance baseline. It reaches high classification accuracy 88 and is shown to generalize to additional sets of unseen data. 1 Introduction Many online collaborative editing projects such as Wikipedia1 keep track of complete revision histories. These contain valuable information about the evolution of documents in terms of content as well as language style and form. Such data is publicly available in large volumes and constantly growing. According to Wikipedia statistics in August 2011 the English Wikipedia contained million articles with an average of revisions per article. The average number of revision edits per month is about 4 million in English and almost 11 million in total for all 1 http 2Average for the 5 years period between August 2006 and August 2011. The count includes edits by registered Exploiting document revision histories has proven useful for a variety of natural language processing NLP tasks including sentence compression Nelken and Yamangil 2008 Yamangil and Nelken .

Thành Nhân 47 11 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "User Edits Classiﬁcation Using Document Revision Histories"

11 43 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 461864 55

Giới thiệu :Lập trình mã nguồn mở

14 22634 59

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10884 529

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10064 446

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9518 104

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8279 1125

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8230 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7864 2220

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6683 253

Vật lý hạt cơ bản (1)

29 5769 85

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

extremetech Hacking BlackBerry phần 9

31 248 0 26-04-2024

Trading Strategies Profit Making Techniques For Stock_3

23 184 0 26-04-2024

Magnetic Bearings Theory and Applications phần 2

14 172 0 26-04-2024

Báo cáo nghiên cứu khoa học " KẾT QUẢ NGHIÊN CỨU BƯỚC ĐẦU VỀ THIÊN ĐỊCH CHÂN KHỚP TRÊN CÂY THANH TRÀ Ở THỪA THIÊN HUẾ "

7 175 0 26-04-2024

Giáo trình CẤU TRÚC DỮ LIỆU VÀ GIẢI THUẬT - Chương 1

5 126 0 26-04-2024

Báo cáo tốt nghiệp: Vận hành và bảo dưỡng trong MPLS

92 144 3 26-04-2024

Data Structures and Algorithms - Chapter 8: Heaps

41 118 0 26-04-2024

báo cáo hóa học:" Rare ligamentum flavum cyst causing incapacitating lumbar spinal stenosis: Experience with 3 Chinese patients"

4 96 0 26-04-2024

Bài Tiểu Luận Chuyên Đề Tổ Chức Hoạt Động Nhận Thức Trong Dạy Học Vật Lý " Định Luật Ôm Cho Các Loại Đoạn Mạch Chứa Nguồn Điện"

10 150 3 26-04-2024

báo cáo hóa học:" Increased androgen receptor expression in serous carcinoma of the ovary is associated with an improved survival"

6 99 0 26-04-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7864 2220

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 5720 1364

Ebook Chào con ba mẹ đã sẵn sàng

112 3767 1231

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5318 1136

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8279 1125

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3498 643

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10884 529

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3683 525

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4045 514

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4127 480