TAILIEUCHUNG - Báo cáo khoa học: "Smoothing a Tera-word Language Model"

Frequency counts from very large corpora, such as the Web 1T dataset, have recently become available for language modeling. Omission of low frequency n-gram counts is a practical necessity for datasets of this size. Naive implementations of standard smoothing methods do not realize the full potential of such large datasets with missing counts. In this paper I present a new smoothing algorithm that combines the Dirichlet prior form of (Mackay and Peto, 1995) with the modified back-off estimates of (Kneser and Ney, 1995) that leads to a 31% perplexity reduction on the Brown corpus compared to a baseline implementation. | Smoothing a Tera-word Language Model Deniz Yuret Koc University dyuret@ Abstract Frequency counts from very large corpora such as the Web 1T dataset have recently become available for language modeling. Omission of low frequency n-gram counts is a practical necessity for datasets of this size. Naive implementations of standard smoothing methods do not realize the full potential of such large datasets with missing counts. In this paper I present a new smoothing algorithm that combines the Dirichlet prior form of Mackay and Peto 1995 with the modified back-off estimates of Kneser and Ney 1995 that leads to a 31 perplexity reduction on the Brown corpus compared to a baseline implementation of Kneser-Ney discounting. 1 Introduction Language models . models that assign probabilities to sequences of words have been proven useful in a variety of applications including speech recognition and machine translation Bahl et al. 1983 Brown et al. 1990 . More recently good results on lexical substitution and word sense disambiguation using language models have also been reported Yuret 2007 . The recently introduced Web 1T 5-gram dataset Brants and Franz 2006 contains the counts of word sequences up to length five in a 1012 word corpus derived from publicly accessible Web pages. As this corpus is several orders of magnitude larger than the ones used in previous language modeling studies it holds the promise to provide more accurate domain independent probability estimates. How ever naive application of the well-known smoothing methods do not realize the full potential of this dataset. In this paper I present experiments with modifications and combinations of various smoothing methods using the Web 1T dataset for model building and the Brown corpus for evaluation. I describe a new smoothing method Dirichlet-Kneser-Ney DKN that combines the Bayesian intuition of MacKay and Peto 1995 and the improved back-off estimation of Kneser and Ney 1995 and gives significantly .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.