TAILIEUCHUNG - Báo cáo khoa học: "Subword-based Tagging for Confidence-dependent Chinese Word Segmentation"

We proposed a subword-based tagging for Chinese word segmentation to improve the existing character-based tagging. The subword-based tagging was implemented using the maximum entropy (MaxEnt) and the conditional random fields (CRF) methods. We found that the proposed subword-based tagging outperformed the character-based tagging in all comparative experiments. In addition, we proposed a confidence measure approach to combine the results of a dictionary-based and a subword-tagging-based segmentation. . | Subword-based Tagging for Confidence-dependent Chinese Word Segmentation Ruiqiang Zhang1 2 and Genichiro Kikui and Eiichiro Sumita1 2 1National Institute of Information and Communications Technology 2ATR Spoken Language Communication Research Laboratories 2-2-2 Hikaridai Seiika-cho Soraku-gun Kyoto 619-0288 Japan @ Abstract We proposed a subword-based tagging for Chinese word segmentation to improve the existing character-based tagging. The subword-based tagging was implemented using the maximum entropy MaxEnt and the conditional random fields CRF methods. We found that the proposed subword-based tagging outperformed the character-based tagging in all comparative experiments. In addition we proposed a confidence measure approach to combine the results of a dictionary-based and a subword-tagging-based segmentation. This approach can produce an ideal tradeoff between the in-vocaulary rate and out-of-vocabulary rate. Our techniques were evaluated using the test data from Sighan Bakeoff 2005. We achieved higher F-scores than the best results in three of the four corpora PKU CITYU and MSR . 1 Introduction Many approaches have been proposed in Chinese word segmentation in the past decades. Segmentation performance has been improved significantly from the earliest maximal match dictionary-based approaches to HMM-based Zhang et al. 2003 approaches and recent state-of-the-art machine learning approaches such as maximum entropy MaxEnt Xue and Shen 2003 support vector machine Now the second author is affiliated with NTT. SVM Kudo and Matsumoto 2001 conditional random fields CRF Peng and McCallum 2004 and minimum error rate training Gao et al. 2004 . By analyzing the top results in the first and second Bakeoffs Sproat and Emerson 2003 and Emerson 2005 we found the top results were produced by direct or indirect use of so-called IOB tagging which converts the problem of word segmentation into one of character tagging so .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.