TAILIEUCHUNG - Báo cáo khoa học: "One Tokenization per Source"

We report in this paper the observation of one tokenization per source. That is, the same critical fragment in different sentences from the same source almost always realize one and the same of its many possible tokenizations. This observation is demonstrated very helpful in sentence tokenization practice, and is argued to be with far-reaching implications in natural language processing. | One Tokenization per Source Jin GUO Kent Ridge Digital Labs 21 Heng Mui Keng Terrace Singapore 119613 Abstract We report in this paper the observation of one tokenization per source. That is the same critical fragment in different sentences from the same source almost always realize one and the same of its many possible tokenizations. This observation is demonstrated very helpful in sentence tokenization practice and is argued to be with far-reaching implications in natural language processing. 1 Introduction This paper sets to establish the hypothesis of one tokenization per source. That is if an ambiguous fragment appears two or more times in different sentences from the same source it is extremely likely that they will all share the same tokenization. Sentence tokenization is the task of mapping sentences from character strings into streams of tokens. This is a long-standing problem in Chinese Language Processing since in Chinese there is an apparent lack of such explicit word delimiters as white-spaces in English. And researchers have gradually been turning to model the task as a general lexicalization or bracketing problem in Computational Linguistics with the hope that the research might also benefit the study of similar problems in multiple languages. For instance in Machine Translation it is widely agreed that many multiple-word expressions such as idioms compounds and some collocations while not explicitly delimited in sentences are ideally to be treated as single lexicalized units. The primary obstacle in sentence tokenization is in the existence of uncertainties both in the notion of words tokens and in the recognition of words tokens in context. The same fragment in different contexts would have to be tokenized differently. For instance the character string todayissunday would normally be tokenized as today is Sunday but can also reasonably be today is sun day . In terms of possibility it has been argued that no lexically possible tokenization can not .

Ðức Tuệ 84 7 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop"

8 50 0

Báo cáo khoa học: "One Tokenization per Source"

7 78 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 461847 55

Giới thiệu :Lập trình mã nguồn mở

14 22518 57

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10865 529

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10029 445

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9490 104

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8243 1124

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8206 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7860 2220

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6646 253

Vật lý hạt cơ bản (1)

29 5755 85

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

CẤU TẠO HẠT NHÂN NGUYÊN TỬ-ĐỘ HỤT KHỐI-NĂNG LƯỢNG LIÊN KẾT-LK RIÊNG

12 262 0 19-04-2024

extremetech Hacking BlackBerry phần 9

31 239 0 19-04-2024

Bibliography on Medieval Women, Gender, and Medicine 1980-2009

82 205 0 19-04-2024

Management and Services Part 1

10 155 0 19-04-2024

Bơm máy nén quạt trong công nghiệp part 8

20 196 2 19-04-2024

MySQL Basics for Visual Learners PHẦN 9

15 183 0 19-04-2024

Posted prices versus bargaining in markets_7

23 154 0 19-04-2024

B2B Content Marketing: 2012 Benchmarks, Budgets & Trends

17 137 0 19-04-2024

The profit magic of stock Timing The Markets_5

22 117 0 19-04-2024

Data Structures and Algorithms - Chapter 8: Heaps

41 115 0 19-04-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7860 2220

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 5601 1327

Ebook Chào con ba mẹ đã sẵn sàng

112 3752 1229

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8243 1124

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5255 1124

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3473 641

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10865 529

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3670 524

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4024 513

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4100 478