TAILIEUCHUNG - Báo cáo khoa học: "Distribution-Based Pruning of Backoff Language Models"

We propose a distribution-based pruning of n-gram backoff language models. Instead of the conventional approach of pruning n-grams that are infrequent in training data, we prune n-grams that are likely to be infrequent in a new document. Our method is based on the n-gram distribution . the probability that an n-gram occurs in a new document. Experimental results show that our method performed 7-9% (word perplexity reduction) better than conventional cutoff methods. | Distribution-Based Pruning of Backoff Language Models Jianfeng Gao Microsoft Research China No. 49 Zhichun Road Haidian District 100080 China j fgao@ Kai-Fu Lee Microsoft Research China No. 49 Zhichun Road Haidian District 100080 China kfl@ Abstract We propose a distribution-based pruning of n-gram backoff language models. Instead of the conventional approach of pruning n-grams that are infrequent in training data we prune n-grams that are likely to be infrequent in a new document. Our method is based on the n-gram distribution . the probability that an n-gram occurs in a new document. Experimental results show that our method performed 7-9 word perplexity reduction better than conventional cutoff methods. 1 Introduction Statistical language modelling SLM has been successfully applied to many domains such as speech recognition Jelinek 1990 information retrieval Miller et al. 1999 and spoken language understanding Zue 1995 . In particular n-gram language model LM has been demonstrated to be highly effective for these domains. N-gram LM estimates the probability of a word given previous words P Wn W1 . Wn-1 . In applying an SLM it is usually the case that more training data will improve a language model. However as training data size increases LM size increases which can lead to models that are too large for practical use. To deal with the problem count cutoff Jelinek 1990 is widely used to prune language models. The cutoff method deletes from the LM those n-grams that occur infrequently in the training data. The cutoff method assumes that if an n-gram is infrequent in training data it is also infrequent in testing data. But in the real world training data rarely matches testing data perfectly. Therefore the count cutoff method is not perfect. In this paper we propose a distribution-based cutoff method. This approach estimates if an n-gram is likely to be infrequent in testing data . To determine this likelihood we divide the training .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.