TAILIEUCHUNG - Báo cáo khoa học: "Improved Source-Channel Models for Chinese Word Segmentation"

The source model is used to estimate the generative probability of a word sequence, in which each word belongs to one word type. For each word type, a channel model is used to estimate the generative probability of a character string given the word type. So there are multiple channel models. We shall show in this paper that our models provide a statistical framework to corporate a wide variety linguistic knowledge and statistical models in a unified way. We evaluate the performance of our system using an annotated test set. . | Improved Source-Channel Models for Chinese Word Segmentation1 Jianfeng Gao Mu Li and Chang-Ning Huang Microsoft Research Asia Beijing 100080 China jfgao t-muli cnhuang @ Abstract This paper presents a Chinese word segmentation system that uses improved sourcechannel models of Chinese sentence generation. Chinese words are defined as one of the following four types lexicon words morphologically derived words factoids and named entities. Our system provides a unified approach to the four fundamental features of word-level Chinese language processing 1 word segmentation 2 morphological analysis 3 factoid detection and 4 named entity recognition. The performance of the system is evaluated on a manually annotated test set and is also compared with several state-of-the-art systems taking into account the fact that the definition of Chinese words often varies from system to system. 1 Introduction Chinese word segmentation is the initial step of many Chinese language processing tasks and has attracted a lot of attention in the research community. It is a challenging problem due to the fact that there is no standard definition of Chinese words. In this paper we define Chinese words as one of the following four types entries in a lexicon morphologically derived words factoids and named entities. We then present a Chinese word segmentation system which provides a solution to the four fundamental problems of word-level Chinese language processing word segmentation morphological analysis factoid detection and named entity recognition NER . There are no word boundaries in written Chinese text. Therefore unlike English it may not be desirable to separate the solution to word segmentation from the solutions to the other three problems. Ideally we would like to propose a unified approach to all the four problems. The unified approach we used in our system is based on the improved source-channel models of Chinese sentence generation with two components a source model .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.