Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Rethinking Chinese Word Segmentation: Tokenization, Character Classiﬁcation, or Wordbreak Identiﬁcation"

Tiến Hiệp 81 4 pdf

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ Tải xuống

This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to ﬁnd a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efﬁciently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges. | Rethinking Chinese Word Segmentation Tokenization Character Classification or Wordbreak Identification Chu-Ren Huang Petr Simon Institute of Linguistics Institute of Linguistics Academia Sinica Taiwan Academia Sinica Taiwan churen@gate.sinica.edu.tw sim@klubko.net Shu-Kai Hsieh Laurent Prevot DoFLAL CLLE-ERSS CNRS NIU Taiwan Universite de Toulouse France shukai@gmail.com prevot@univ-tlse2.fr Abstract This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to find a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efficiently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges. The most critical concept that we introduce is that Chinese word segmentation is the classification of a string of character-boundaries CB s into either word-boundaries WB s and non-word-boundaries. In Chinese CB s are delimited and distributed in between two characters. Hence we can use the distributional properties of CB among the background character strings to predict which CB s are WB s. 1 Introduction modeling and theoretical challenges The fact that word segmentation remains a main research topic in the field of Chinese language processing indicates that there maybe unresolved theoretical and processing issues. In terms of processing the fact is that none of exiting algorithms is robust enough to reliably segment unfamiliar types of texts before fine-tuning with massive training data. It is true that performance of participating teams have steadily improved since the first SigHAN Chinese segmentation bakeoff Sproat and Emerson 2004 . Bakeoff 3 in 2006 produced best f-scores at 95 and higher. However these can only be achieved after training with the pre-segmented training dataset. This is still very far away from real-world .

TÀI LIỆU LIÊN QUAN

Kỷ yếu tóm tắt báo cáo khoa học: Hội nghị khoa học tim mạch toàn quốc lần thứ XI - Hội tim mạch Quốc gia Việt Nam

Báo cáo nghiên cứu khoa học: "Danh lục các loài thú ở khu bảo tồn thiên nhiên Pù Huống tỉnh Nghệ An và ý nghĩa bảo tồn nguồn gen quí hiếm của chúng"

Báo cáo khoa học: Hỗ trợ nâng cao năng lực quản lý chất thải sinh hoạt tại thành phố Hội An

Báo cáo nghiên cứu khoa học: "Tính năng động nghệ thuật của văn học hiện đại Việt Nam và một cách nhìn hành trình thể loại"

Báo cáo nghiên cứu khoa học: " DỊCH CHUYỂN TRUY VẤN OQL VÀO CÁC PHÉP TÍNH BAO HÀM"

Báo cáo khoa học: " Áp dụng thủ tục phân tích trong kiểm toán báo cáo tài chính"

Báo cáo nghiên cứu khoa học: "Người lính trở về sau chiến tranh với mặc cảm “ăn mày dĩ vãng’ trong tiểu thuyết Chu Lai"

Báo cáo nghiên cứu khoa học: "Khảo sát hiện tượng chuyển đổi chức năng - nghĩa của động từ tiếng Việt"

Báo cáo nghiên cứu khoa học: " BẢN CHẤT KHOA HỌC VÀ CÁCH MẠNG LÀ CỘI NGUỒN SỨC SỐNG CỦA CHỦ NGHĨA MÁC - LÊNIN"

Báo cáo khoa học: " CẢI TIẾN CÁC THUẬT TOÁN MƯỢN VÀ KHOÁ KÊNH TẦN SỐ MẠNG DI ĐỘNG TẾ BÀO"

Đã phát hiện trình chặn quảng cáo AdBlock

Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.