TAILIEUCHUNG - Báo cáo khoa học: "Rethinking Chinese Word Segmentation: Tokenization, Character Classiﬁcation, or Wordbreak Identiﬁcation"

This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to ﬁnd a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efﬁciently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges. | Rethinking Chinese Word Segmentation Tokenization Character Classification or Wordbreak Identification Chu-Ren Huang Petr Simon Institute of Linguistics Institute of Linguistics Academia Sinica Taiwan Academia Sinica Taiwan churen@ sim@ Shu-Kai Hsieh Laurent Prevot DoFLAL CLLE-ERSS CNRS NIU Taiwan Universite de Toulouse France shukai@ prevot@ Abstract This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to find a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efficiently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges. The most critical concept that we introduce is that Chinese word segmentation is the classification of a string of character-boundaries CB s into either word-boundaries WB s and non-word-boundaries. In Chinese CB s are delimited and distributed in between two characters. Hence we can use the distributional properties of CB among the background character strings to predict which CB s are WB s. 1 Introduction modeling and theoretical challenges The fact that word segmentation remains a main research topic in the field of Chinese language processing indicates that there maybe unresolved theoretical and processing issues. In terms of processing the fact is that none of exiting algorithms is robust enough to reliably segment unfamiliar types of texts before fine-tuning with massive training data. It is true that performance of participating teams have steadily improved since the first SigHAN Chinese segmentation bakeoff Sproat and Emerson 2004 . Bakeoff 3 in 2006 produced best f-scores at 95 and higher. However these can only be achieved after training with the pre-segmented training dataset. This is still very far away from real-world .

Tiến Hiệp 81 4 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Tokenization: Returning to a Long Solved Problem"

5 37 0

Báo cáo khoa học: "Simultaneous Tokenization and Part-of-Speech Tagging for Arabic without a Morphological Analyzer"

6 87 0

Báo cáo khoa học: "Rethinking Chinese Word Segmentation: Tokenization, Character Classiﬁcation, or Wordbreak Identiﬁcation"

4 73 0

Báo cáo khoa học: "Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop"

8 50 0

Báo cáo khoa học: "One Tokenization per Source"

7 78 0

Information retrieval techniques: Lecture 8

16 15 1

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 461867 55

Giới thiệu :Lập trình mã nguồn mở

14 22643 59

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10892 529

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10066 446

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9519 104

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8281 1125

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8238 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7864 2220

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6687 253

Vật lý hạt cơ bản (1)

29 5770 85

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Gia đình vui nhộn

4 312 1 27-04-2024

Động cơ đốt trong và máy kéo công nghiêp tập 1 part 7

23 258 0 27-04-2024

extremetech Hacking BlackBerry phần 9

31 250 0 27-04-2024

Bibliography on Medieval Women, Gender, and Medicine 1980-2009

82 209 0 27-04-2024

Magnetic Bearings Theory and Applications phần 2

14 172 0 27-04-2024

Công nghiệp gang thép Việt Nam : Một giai đoạn phát triển và chuyển đổi chính sách mới part 5

6 194 0 27-04-2024

BÀI GIẢNG VỀ - MẠCH ĐIỆN II - Chương I: Phân tích mạch trong miền thời gian

38 140 0 27-04-2024

MÔN HỌC VẬT LIỆU VÀ CÔNG NGHỆ KIM LOẠI - PHẦN I: KIM LOẠI HỌC

32 177 2 27-04-2024

B2B Content Marketing: 2012 Benchmarks, Budgets & Trends

17 138 0 27-04-2024

GIÁO TRÌNH MÁY ĐIỆN KHÍ CỤ ĐIỆN - PHẦN I MÁY ĐIỆN - CHƯƠNG 1

46 131 2 27-04-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7864 2220

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 5737 1368

Ebook Chào con ba mẹ đã sẵn sàng

112 3767 1231

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5319 1136

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8281 1125

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3499 643

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10892 529

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3684 525

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4046 515

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4128 480