TAILIEUCHUNG - Báo cáo khoa học: "Distribution-Based Pruning of Backoff Language Models"

We propose a distribution-based pruning of n-gram backoff language models. Instead of the conventional approach of pruning n-grams that are infrequent in training data, we prune n-grams that are likely to be infrequent in a new document. Our method is based on the n-gram distribution . the probability that an n-gram occurs in a new document. Experimental results show that our method performed 7-9% (word perplexity reduction) better than conventional cutoff methods. | Distribution-Based Pruning of Backoff Language Models Jianfeng Gao Microsoft Research China No. 49 Zhichun Road Haidian District 100080 China j fgao@ Kai-Fu Lee Microsoft Research China No. 49 Zhichun Road Haidian District 100080 China kfl@ Abstract We propose a distribution-based pruning of n-gram backoff language models. Instead of the conventional approach of pruning n-grams that are infrequent in training data we prune n-grams that are likely to be infrequent in a new document. Our method is based on the n-gram distribution . the probability that an n-gram occurs in a new document. Experimental results show that our method performed 7-9 word perplexity reduction better than conventional cutoff methods. 1 Introduction Statistical language modelling SLM has been successfully applied to many domains such as speech recognition Jelinek 1990 information retrieval Miller et al. 1999 and spoken language understanding Zue 1995 . In particular n-gram language model LM has been demonstrated to be highly effective for these domains. N-gram LM estimates the probability of a word given previous words P Wn W1 . Wn-1 . In applying an SLM it is usually the case that more training data will improve a language model. However as training data size increases LM size increases which can lead to models that are too large for practical use. To deal with the problem count cutoff Jelinek 1990 is widely used to prune language models. The cutoff method deletes from the LM those n-grams that occur infrequently in the training data. The cutoff method assumes that if an n-gram is infrequent in training data it is also infrequent in testing data. But in the real world training data rarely matches testing data perfectly. Therefore the count cutoff method is not perfect. In this paper we propose a distribution-based cutoff method. This approach estimates if an n-gram is likely to be infrequent in testing data . To determine this likelihood we divide the training .

Ngọc Ánh 62 7 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462341 61

Giới thiệu :Lập trình mã nguồn mở

14 26036 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11346 542

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10551 466

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9842 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8890 1161

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8505 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8101 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7745 1790

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7264 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Gia đình vui nhộn

4 392 3 27-12-2024

Báo cáo nghiên cứu khoa học " KẾT QUẢ NGHIÊN CỨU BƯỚC ĐẦU VỀ THIÊN ĐỊCH CHÂN KHỚP TRÊN CÂY THANH TRÀ Ở THỪA THIÊN HUẾ "

7 277 4 27-12-2024

Data Structures and Algorithms - Chapter 8: Heaps

41 188 5 27-12-2024

Giáo trình phân tích phương trình vi phân viết dưới dạng thuật toán đặc tính của hệ thống p1

5 162 1 27-12-2024

Báo cáo nghiên cứu nông nghiệp " Field control of pest fruit flies in Vietnam "

14 191 4 27-12-2024

Báo cáo nghiên cứu khoa học " HÃY LÀM CHO HUẾ XANH HƠN VÀ ĐẸP HƠN "

6 181 3 27-12-2024

Giáo án điện tử tiểu học môn lịch sử: Cách mạng mùa thu

39 165 1 27-12-2024

Báo cáo y học: "The Factors Influencing Depression Endpoints Research (FINDER) study: final results of Italian patients with depressio"

9 149 1 27-12-2024

Báo cáo " Bàn về hành vi pháp luật và hành vi đạo đức "

11 179 2 27-12-2024

Word Games with English 1

65 138 1 27-12-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8101 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7745 1790

Ebook Chào con ba mẹ đã sẵn sàng

112 4407 1371

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6284 1266

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8890 1161

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3840 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3920 609

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4709 565

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11346 542

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4509 490