TAILIEUCHUNG - Báo cáo khoa học: "Unsupervised Segmentation of Chinese Text by Use of Branching Entropy"

Figure 1: Intuitive illustration of a variety of successive tokens and a word boundary mentation by formalizing the uncertainty of successive tokens via the branching entropy (which we mathematically de ne in the next section). Our intention in this paper is above all to study the fundamental and scienti c statistical property underlying language data, so that it can be applied to language engineering. The above assumption (A) dates back to the fundamental work done by Harris (Harris, 1955), where he says that when the number of di erent tokens coming after every pre x of a word marks. | Unsupervised Segmentation of Chinese Text . a . Zh huiJin and Kumiko Tanaka-Ishi Graduate School of Information Science and Technology University of Tokyo Abstract We propose an unsupervised segmentation method based on an assumption about language data that the increasing point of ntropy of success veoha-acters 1 the location of a word boundary A large-scale expern ent was conducted by. using 200 MB o unsegmented training data and 1 MB of test data and precision of 90 vasat ained with reca 1 being around 80 . Moreover we found that the precision was s able at around 90 independently of the learning data size. i Introduct ion . The theme of this paper is the following as-sumpdon . The uncertainty o token coming after a sequence helps determine whether a given position is at a boundary. A . Intuitively as illustrated in FigureM the variety of successive tokens at each character inside a word mono onieallv de teases according to the offset length because th longer he preceding character n-gram the longer the p eceding contex and the more 1 restricts the appearance of possible next tokens Forex-ample it is easier o guess wh h character conies after natura than after na . On the other hand the uncertainty at the po ition of a word border becom s greater and the complexity increases as the position is out of context. With the same example it is difficult to guess which character comes after natural . This suggests that a word border can be detected by focusing on the differentials of the uncertainty of branching. In this paper we report our study on applying this assumption to Chinese word seg- Figure ft Intuitive illustration of a variety of successive tokens and a word boundary mentation by formalizing he uncertainty of su ce sive tokens via the branching ntropy which we mathematically define in the next s ction . Ou Intel ion in this paper is above all to study the fundamental and scientific stat stical property nderly ng language data so that it can be applied to .

Khánh Vy 55 8 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure"

5 53 0

Báo cáo khoa học: "Fully Unsupervised Word Segmentation with BVE and MDL"

6 43 0

Báo cáo khoa học: "Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling"

9 36 0

Báo cáo khoa học: "Unsupervised Search for The Optimal Segmentation for Statistical Machine Translation"

6 45 0

Báo cáo khoa học: "Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models"

10 56 0

Báo cáo khoa học: "Unsupervised Multilingual Learning for Morphological Segmentation"

9 54 0

Báo cáo khoa học: "Contextual Dependencies in Unsupervised Word Segmentation∗"

8 59 0

Báo cáo khoa học: "Making Sense of Sound: Unsupervised Topic Segmentation over Acoustic Input"

8 87 0

Báo cáo khoa học: "Unsupervised Segmentation of Chinese Text by Use of Branching Entropy"

8 47 0

Báo cáo khoa học: "A Language-Independent Unsupervised Model for Morphological Segmentation"

8 54 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462012 59

Giới thiệu :Lập trình mã nguồn mở

14 23544 70

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11084 535

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10307 454

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9609 106

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8570 1146

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8337 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7918 2242

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6939 260

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 6558 1581

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Anh văn bằng C-124

8 206 0 16-06-2024

TƯƠNG QUAN GIỮA MÔ HỌC, GIẢI PHẪU VÀ HÌNH ẢNH CỦA CÁC KHỐI U PHẦN PHỤ

3 184 1 16-06-2024

Management and Services Part 1

10 182 1 16-06-2024

Báo cáo nghiên cứu khoa học " KẾT QUẢ NGHIÊN CỨU BƯỚC ĐẦU VỀ THIÊN ĐỊCH CHÂN KHỚP TRÊN CÂY THANH TRÀ Ở THỪA THIÊN HUẾ "

7 198 1 16-06-2024

Posted prices versus bargaining in markets_7

23 177 0 16-06-2024

Đề tài: Tìm hiểu một số yêu cầu đặt ra với một phòng thu âm, để đảm bảo chất lượng âm thanh trong sản phẩm đa phương tiện

8 180 1 16-06-2024

Data Structures and Algorithms - Chapter 8: Heaps

41 142 1 16-06-2024

GIÁO TRÌNH VI XỬ LÝ 1 - CHƯƠNG 5. LẬP TRÌNH CHO VI ĐIỀU KHIỂN 80C51

23 129 1 16-06-2024

Báo cáo nghiên cứu nông nghiệp " Field control of pest fruit flies in Vietnam "

14 153 2 16-06-2024

ĐỀ THI THỬ ĐẠI HỌC 2009 – THPT ĐÔNG SƠN 1 – LẦN 2 – MÔN TOÁN

8 118 1 16-06-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7918 2242

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 6558 1581

Ebook Chào con ba mẹ đã sẵn sàng

112 3968 1296

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5602 1170

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8570 1146

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3611 664

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3820 581

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11084 535

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4261 528

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4273 483