TAILIEUCHUNG - Báo cáo khoa học: "Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging – A Case Study"

Manually annotated corpora are valuable but scarce resources, yet for many annotation tasks such as treebanking and sequence labeling there exist multiple corpora with different and incompatible annotation guidelines or standards. This seems to be a great waste of human efforts, and it would be nice to automatically adapt one annotation standard to another. We present a simple yet effective strategy that transfers knowledge from a differently annotated corpus to the corpus with desired annotation. We test the efﬁcacy of this method in the context of Chinese word segmentation and part-of-speech tagging, where no segmentation and POS tagging standards. | Automatic Adaptation of Annotation Standards Chinese Word Segmentation and POS Tagging - A Case Study Wenbin Jiang t Liang Huang Í Qun Liu t iKey Lab. of Intelligent Information Processing Institute of Computing Technology Chinese Academy of Sciences PO. Box 2704 Beijing 100190 China jiangwenbin liuqun @ Google Research 1350 Charleston Rd. Mountain View CA 94043 USA lianghuang@ Abstract Manually annotated corpora are valuable but scarce resources yet for many annotation tasks such as treebanking and sequence labeling there exist multiple corpora with different and incompatible annotation guidelines or standards. This seems to be a great waste of human efforts and it would be nice to automatically adapt one annotation standard to another. We present a simple yet effective strategy that transfers knowledge from a differently annotated corpus to the corpus with desired annotation. We test the efficacy of this method in the context of Chinese word segmentation and part-of-speech tagging where no segmentation and POS tagging standards are widely accepted due to the lack of morphology in Chinese. Experiments show that adaptation from the much larger People s Daily corpus to the smaller but more popular Penn Chinese Treebank results in significant improvements in both segmentation and tagging accuracies with error reductions of and 14 respectively which in turn helps improve Chinese parsing accuracy. 1 Introduction Much of statistical NLP research relies on some sort of manually annotated corpora to train their models but these resources are extremely expensive to build especially at a large scale for example in treebanking Marcus et al. 1993 . However the linguistic theories underlying these annotation efforts are often heavily debated and as a result there often exist multiple corpora for the same task with vastly different and incompatible annotation philosophies. For example just for English treebanking there have been

Tuấn Linh 56 9 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging – A Case Study"

9 44 0

báo cáo hóa học:" Research Article Automatic Query Generation and Query Relevance Measurement for Unsupervised Language Model Adaptation of Speech Recognition"

12 43 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 461857 55

Giới thiệu :Lập trình mã nguồn mở

14 22593 58

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10882 529

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10047 445

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9513 104

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8267 1124

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8216 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7862 2220

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6669 253

Vật lý hạt cơ bản (1)

29 5765 85

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Báo cáo khoa học: Loss of kinase activity in Mycobacterium tuberculosis multidomain protein Rv1364c

14 234 0 24-04-2024

CẤU TẠO HẠT NHÂN NGUYÊN TỬ-ĐỘ HỤT KHỐI-NĂNG LƯỢNG LIÊN KẾT-LK RIÊNG

12 264 0 24-04-2024

Oreilly learning the vi Editor phần 4

19 228 0 24-04-2024

beginning Ubuntu Linux phần 1

34 212 1 24-04-2024

Bơm máy nén quạt trong công nghệ part 1

20 249 2 24-04-2024

extremetech Hacking Firefox phần 7

46 187 0 24-04-2024

Management and Services Part 1

10 155 0 24-04-2024

Công nghiệp gang thép Việt Nam : Một giai đoạn phát triển và chuyển đổi chính sách mới part 5

6 194 0 24-04-2024

B2B Content Marketing: 2012 Benchmarks, Budgets & Trends

17 138 0 24-04-2024

Đề tài: Tìm hiểu một số yêu cầu đặt ra với một phòng thu âm, để đảm bảo chất lượng âm thanh trong sản phẩm đa phương tiện

8 159 1 24-04-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7862 2220

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 5674 1348

Ebook Chào con ba mẹ đã sẵn sàng

112 3757 1230

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5307 1135

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8267 1124

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3483 641

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10882 529

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3677 525

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4039 514

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4120 480