TAILIEUCHUNG - Báo cáo khoa học: "Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling"

In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efﬁcient blocked Gibbs sampler combined with dynamic programming for inference. Our model is a nested hierarchical Pitman-Yor language model, where Pitman-Yor spelling model is embedded in the word model. We conﬁrmed that it signiﬁcantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation. | Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling Daichi Mochihashi Takeshi Yamada Naonori Ueda NTT Communication Science Laboratories Hikaridai 2-4 Keihanna Science City Kyoto Japan daichi yamada ueda @ Abstract In this paper we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference. Our model is a nested hierarchical Pitman-Yor language model where Pitman-Yor spelling model is embedded in the word model. We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation. Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language without any word indications. 1 Introduction Word is no trivial concept in many languages. Asian languages such as Chinese and Japanese have no explicit word boundaries thus word segmentation is a crucial first step when processing them. Even in western languages valid words are often not identical to space-separated tokens. For example proper nouns such as United Kingdom or idiomatic phrases such as with respect to actually function as a single word and we often condense them into the virtual words UK and . . In order to extract words from text streams unsupervised word segmentation is an important research area because the criteria for creating supervised training data could be arbitrary and will be suboptimal for applications that rely on segmentations. It is particularly difficult to create correct training data for speech transcripts colloquial texts and classics where segmentations are often ambiguous let alone is impossible for unknown languages whose properties computational linguists might seek to uncover. From a scientific point of view it is also interesting because it can shed light on how .

Phương Thanh 52 9 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "A Bayesian Model for Unsupervised Semantic Parsing"

11 80 0

Báo cáo khoa học: "Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling"

9 36 0

Báo cáo khoa học: "A Fully Bayesian Approach to Unsupervised Part-of-Speech Tagging∗"

8 55 0

Báo cáo khoa học: "Unsupervised Coreference Resolution in a Nonparametric Bayesian Model"

8 70 0

Báo cáo khoa học: "An Unsupervised Dynamic Bayesian Network Approach to Measuring Speech Style Accommodation"

11 52 0

Báo cáo khoa học: "A Bayesian Approach to Unsupervised Semantic Role Induction"

11 75 0

CoGAPS 3: Bayesian non‑negative matrix factorization for single‑cell analysis with asynchronous updates and sparse data structures

6 28 1

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462388 61

Giới thiệu :Lập trình mã nguồn mở

14 27414 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11390 543

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10590 468

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9871 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8923 1162

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8544 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8114 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 8081 1836

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7329 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Bảng màu theo chữ cái – V

11 178 2 26-01-2025

CHƯƠNG 2: RỦI RO THÂM HỤT TÀI KHÓA

28 168 1 26-01-2025

Đề tài " Dự báo về tác động của Tổ chức Thương mại Thế giới WTO đối với các doanh nghiệp xuất khẩu vừa và nhỏ Việt Nam – Những giải pháp đề xuất "

72 194 2 26-01-2025

Word Games with English 1

65 149 1 26-01-2025

Lịch sử Trung Quốc 5000 năm tập 3 part 2

54 161 1 26-01-2025

Determini prounoun 1

6 148 0 26-01-2025

CÂU HỎI TRẮC NGHIỆM HSLS NƯỚC TIỂU

9 181 0 26-01-2025

ĐỀ LUYỆN THI ĐẠI HỌC MÔN: TIẾNG ANH - SỐ 3

4 140 1 26-01-2025

LINUX DEVICE DRIVERS 3rd edition phần 8

64 145 0 26-01-2025

Đề thi Tiếng Anh lop 12 (2010-2011) Trần Hưng Đạo Mã đề: 001

19 115 0 26-01-2025

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8114 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 8081 1836

Ebook Chào con ba mẹ đã sẵn sàng

112 4481 1381

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6470 1285

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8923 1162

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3887 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3934 616

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4843 569

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11390 543

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4556 490