TAILIEUCHUNG - Báo cáo khoa học: "An Iterative Algorithm to Build Chinese Language Models"

We present an iterative procedure to build a Chinese language model (LM). We segment Chinese text into words based on a word-based Chinese language model. However, the construction of a Chinese LM itself requires word boundaries. To get out of the chicken-and-egg problem, we propose an iterative procedure that alternates two operations: segmenting text into words and building an LM. Starting with an initial segmented corpus and an LM based upon it, we use a Viterbi-liek algorithm to segment another set of data. Then, we build an LM based on the second set and use the resulting LM to. | An Iterative Algorithm to Build Chinese Language Models Xiaoqiang Luo Center for Language and Speech Processing The Johns Hopkins University 3400 N. Charles St. Baltimore MD21218 USA Abstract We present an iterative procedure to build a Chinese language model LM . We segment Chinese text into words based on a word-based Chinese language model. However the construction of a Chinese LM itself requires word boundaries. To get out of the chicken-and-egg problem we propose an iterative procedure that alternates two operations segmenting text into words and building an LM. Starting with an initial segmented corpus and an LM based upon it we use a Viterbi-liek algorithm to segment another set of data. Then we build an LM based on the second set and use the resulting LM to segment again the first corpus. The alternating procedure provides a self-organized way for the segmenter to detect automatically unseen words and correct segmentation errors. Our preliminary experiment shows that the alternating procedure not only improves the accuracy of our segmentation but discovers unseen words surprisingly well. The resulting word-based LM has a perplexity of 188 for a general Chinese corpus. 1 Introduction In statistical speech recognition Bahl et al. 1983 it is necessary to build a language model LM for assigning probabilities to hypothesized sentences. The LM is usually built by collecting statistics of words over a large set of text data. While doing so is straightforward for English it is not trivial to collect statistics for Chinese words since word boundaries are not marked in written Chinese text. Chinese is a morphosyllabic language DeFrancis 1984 in that almost all Chinese characters represent a single syllable and most Chinese characters are also morphemes. Since a word can be multi-syllabic it is generally non-trivial to segment a Chinese sentence into words Wu and Tseng 1993 . Since segmentation is Salim Roukos IBM T. J. Watson Research Center Yorktown .

Huy Tường 62 5 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

An iterative greedy algorithm for sparsity-constrained optimization

9 71 0

**Báo cáo khoa học: "Iterative Viterbi A* Algorithm for K-Best Sequential Decoding"**

9 52 0

Báo cáo khoa học: "An Iterative Algorithm to Build Chinese Language Models"

5 46 0

REW‑ISA: Unveiling local functional blocks in epi‑transcriptome profling data via an RNA expression‑weighted iterative signature algorithm

22 32 1

Compressed sensing improved iterative reconstruction-reprojection algorithm for electron tomography

19 37 1

Comparative analysis of different variants of the Uzawa algorithm in problems of the theory of elasticity for incompressible materials

5 57 0

báo cáo hóa học:" Research Article An Iterative Surface Evolution Algorithm for Multiview Stereo"

10 26 0

Báo cáo hóa học: "Research Article A New Hybrid Iterative Algorithm for Fixed-Point Problems, Variational Inequality Problems, and Mixed Equilibrium Problems"

15 45 0

Báo cáo hóa học: " Research Article Iterative Object Localization Algorithm Using Visual Images with a Reference Coordinate"

16 37 0

Báo cáo hóa học: " Research Article Strong Convergence of a Modiﬁed Iterative Algorithm for Mixed-Equilibrium Problems in Hilbert Spaces"

23 33 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462286 61

Giới thiệu :Lập trình mã nguồn mở

14 24867 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11283 542

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10510 466

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9786 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8876 1160

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8465 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8090 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7467 1763

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7186 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Gia đình vui nhộn

4 374 3 25-11-2024

B2B Content Marketing: 2012 Benchmarks, Budgets & Trends

17 213 3 25-11-2024

Báo cáo nghiên cứu nông nghiệp " Biofertiliser inoculant technology for the growth of rice in Vietnam: Developing technical infrastructure for quality assurance and village production for farmers "

12 132 2 25-11-2024

Hướng dẫn chế độ dinh dưỡng cho người bệnh viêm khớp

5 159 2 25-11-2024

CHƯƠNG 2: RỦI RO THÂM HỤT TÀI KHÓA

28 152 1 25-11-2024

Sử dụng mô hình ARCH và GARCH để phân tích và dự báo về giá cổ phiếu trên thị trường chứng khoán

24 1064 2 25-11-2024

Bệnh sán lá gan trên gia súc và cách phòng trị

3 157 1 25-11-2024

ĐỀ LUYỆN THI ĐẠI HỌC MÔN: TIẾNG ANH - SỐ 3

4 115 1 25-11-2024

NHÀ MẠC – NAM BẮC TRIỀU (1527-1592)_1

6 117 1 25-11-2024

ĐỀ KIỂM TRA GIỮA HỌC KỲ TÂM LÝ Y HỌC – Y ĐỨC

18 238 0 25-11-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8090 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7467 1763

Ebook Chào con ba mẹ đã sẵn sàng

112 4364 1369

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6151 1258

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8876 1160

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3787 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3909 609

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4615 562

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11283 542

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4449 490