TAILIEUCHUNG - Báo cáo khoa học: "Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling"

In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference. Our model is a nested hierarchical Pitman-Yor language model, where Pitman-Yor spelling model is embedded in the word model. We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation. | Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling Daichi Mochihashi Takeshi Yamada Naonori Ueda NTT Communication Science Laboratories Hikaridai 2-4 Keihanna Science City Kyoto Japan daichi yamada ueda @ Abstract In this paper we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference. Our model is a nested hierarchical Pitman-Yor language model where Pitman-Yor spelling model is embedded in the word model. We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation. Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language without any word indications. 1 Introduction Word is no trivial concept in many languages. Asian languages such as Chinese and Japanese have no explicit word boundaries thus word segmentation is a crucial first step when processing them. Even in western languages valid words are often not identical to space-separated tokens. For example proper nouns such as United Kingdom or idiomatic phrases such as with respect to actually function as a single word and we often condense them into the virtual words UK and . . In order to extract words from text streams unsupervised word segmentation is an important research area because the criteria for creating supervised training data could be arbitrary and will be suboptimal for applications that rely on segmentations. It is particularly difficult to create correct training data for speech transcripts colloquial texts and classics where segmentations are often ambiguous let alone is impossible for unknown languages whose properties computational linguists might seek to uncover. From a scientific point of view it is also interesting because it can shed light on how .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.