TAILIEUCHUNG - Báo cáo khoa học: "Minimized Models for Unsupervised Part-of-Speech Tagging"

We describe a novel method for the task of unsupervised POS tagging with a dictionary, one that uses integer programming to explicitly search for the smallest model that explains the data, and then uses EM to set parameter values. We evaluate our method on a standard test corpus using different standard tagsets (a 45-tagset as well as a smaller 17-tagset), and show that our approach performs better than existing state-of-the-art systems in both settings. | Minimized Models for Unsupervised Part-of-Speech Tagging Sujith Ravi and Kevin Knight University of Southern California Information Sciences Institute Marina del Rey California 90292 sravi knight @ Abstract We describe a novel method for the task of unsupervised POS tagging with a dictionary one that uses integer programming to explicitly search for the smallest model that explains the data and then uses EM to set parameter values. We evaluate our method on a standard test corpus using different standard tagsets a 45-tagset as well as a smaller 17-tagset and show that our approach performs better than existing state-of-the-art systems in both settings. 1 Introduction In recent years we have seen increased interest in using unsupervised methods for attacking different NLP tasks like part-of-speech POS tagging. The classic Expectation Maximization EM algorithm has been shown to perform poorly on POS tagging when compared to other techniques such as Bayesian methods. In this paper we develop new methods for unsupervised part-of-speech tagging. We adopt the problem formulation of Merialdo 1994 in which we are given a raw word sequence and a dictionary of legal tags for each word type. The goal is to tag each word token so as to maximize accuracy against a gold tag sequence. Whether this is a realistic problem set-up is arguable but an interesting collection of methods and results has accumulated around it and these can be clearly compared with one another. We use the standard test set for this task a 24 115-word subset of the Penn Treebank for which a gold tag sequence is available. There are 5 878 word types in this test set. We use the standard tag dictionary consisting of 57 388 word tag pairs derived from the entire Penn 8 910 dictionary entries are relevant to the 5 878 word types in the test set. Per-token ambiguity is about tags token yielding approximately 106425 possible ways to tag the data. There are 45 distinct grammatical tags. In .

Bích Hạnh 102 9 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Minimized models and grammar-informed initialization for supertagging with highly ambiguous lexicons"

9 59 0

Báo cáo khoa học: "Minimized Models for Unsupervised Part-of-Speech Tagging"

9 47 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462291 61

Giới thiệu :Lập trình mã nguồn mở

14 24918 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11286 542

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10511 466

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9790 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8876 1160

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8467 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8090 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7471 1763

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7188 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Gia đình vui nhộn

4 374 3 26-11-2024

B2B Content Marketing: 2012 Benchmarks, Budgets & Trends

17 213 3 26-11-2024

Bảng màu theo chữ cái – V

11 153 2 26-11-2024

CHƯƠNG 2: RỦI RO THÂM HỤT TÀI KHÓA

28 152 1 26-11-2024

Đề tài " Dự báo về tác động của Tổ chức Thương mại Thế giới WTO đối với các doanh nghiệp xuất khẩu vừa và nhỏ Việt Nam – Những giải pháp đề xuất "

72 177 2 26-11-2024

Báo cáo " Thẩm quyền quản lí nhà nước đối với hoạt động quảng cáo thực trạng và hướng hoàn thiện "

7 196 7 26-11-2024

Báo cáo nghiên cứu khoa học " Vai trò chính quyền địa phương trong phát triển kinh tế : khu chuyên doanh gốm sứ ( Trung Quốc ) và Bát Tràng ( Việt Nam )("

11 206 1 26-11-2024

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining

101 133 1 26-11-2024

Xinh xinh vườn nhà

6 128 0 26-11-2024

TRẮC NGHIỆM - CÁC BỆNH THIẾU DINH DƯỠNG THƯỜNG GẶP

32 201 2 26-11-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8090 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7471 1763

Ebook Chào con ba mẹ đã sẵn sàng

112 4364 1369

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6156 1258

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8876 1160

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3790 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3909 609

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4618 562

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11286 542

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4454 490