TAILIEUCHUNG - Báo cáo khoa học: "Use of Mutual Information Based Character Clusters in Dictionary-less Morphological Analysis of Japanese"

For languages whose character set is very large and whose orthography does not require spacing between words, such as Japanese, tokenizing and part-of-speech tagging are often the difficult parts of any morphological analysis. For practical systems to tackle this problem, uncontrolled heuristics are primarily used. The use of information on character sorts, however, mitigates this difficulty. This paper presents our method of incorporating character clustering based on mutual information into DecisionTree Dictionary-less morphological analysis. By using natural classes, we have confirmed that our morphological analyzer has been significantly improved in both tokenizing and tagging Japanese text. . | Use of Mutual Information Based Character Clusters in Dictionary-less Morphological Analysis of Japanese Hideki Kashioka Yasuhiro Kawata Yumiko Kinjo Andrew Finch and Ezra w. Black kashioka ykawata kinjo finch black @ ATR Interpreting Telecommunications Reserach Laboratories Abstract For languages whose character set is very large and whose orthography does not require spacing between words such as Japanese tokenizing and part-of-speech tagging are often the difficult parts of any morphological analysis. For practical systems to tackle this problem uncontrolled heuristics are primarily used. The use of information on character sorts however mitigates this difficulty. This paper presents our method of incorporating character clustering based on mutual information into Decision-Tree Dictionary-less morphological analysis. By using natural classes we have confirmed that our morphological analyzer has been significantly improved in both tokenizing and tagging Japanese text. 1 Introduction Recent papers have reported cases of successful part-of-speech tagging with statistical language modeling techniques Church 1988 Cutting et. al. 1992 Charniak et. al. 1993 Brill 1994 Nagata 1994 Yamamoto 1996 . Morphological analysis on Japanese however is more complex because unlike European languages no spaces are inserted between words. In fact even native Japanese speakers place word boundaries inconsistently. Consequently individual researchers have been adopting different word boundaries and tag sets based on their own theory-internal justifications. For a practical system to utilize the different word boundaries and tag sets according to the demands of an application it is necessary to coordinate the dictionary used tag sets and numerous other parameters. Unfortunately such a task is costly. Furthermore it is difficult to maintain the accuracy needed to regulate the word boundaries. Also depending on the pur pose new technical terminology may have to be collected .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.