TAILIEUCHUNG - Báo cáo khoa học: "An Empirical Investigation of Discounting in Cross-Domain Language Models"

We investigate the empirical behavior of ngram discounts within and across domains. When a language model is trained and evaluated on two corpora from exactly the same domain, discounts are roughly constant, matching the assumptions of modified Kneser-Ney LMs. However, when training and test corpora diverge, the empirical discount grows essentially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts, resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines. . | An Empirical Investigation of Discounting in Cross-Domain Language Models Greg Durrett and Dan Klein Computer Science Division University of California Berkeley gdurrett klein @ Abstract We investigate the empirical behavior of ngram discounts within and across domains. When a language model is trained and evaluated on two corpora from exactly the same domain discounts are roughly constant matching the assumptions of modified Kneser-Ney LMs. However when training and test corpora diverge the empirical discount grows essentially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines. 1 Introduction Discounting or subtracting from the count of each n-gram is one of the core aspects of Kneser-Ney language modeling Kneser and Ney 1995 . For all but the smallest n-gram counts Kneser-Ney uses a single discount one that does not grow with the ngram count because such constant-discounting was seen in early experiments on held-out data Church and Gale 1991 . However due to increasing computational power and corpus sizes language modeling today presents a different set of challenges than it did 20 years ago. In particular modeling crossdomain effects has become increasingly more important Klakow 2000 Moore and Lewis 2010 and deployed systems must frequently process data that is out-of-domain from the standpoint of the language model. In this work we perform experiments on held-out data to evaluate how discounting behaves in the 24 cross-domain setting. We find that when training and testing on corpora that are as similar as possible empirical discounts indeed do not grow with ngram count which validates the parametric assumption of Kneser-Ney smoothing. However when the train and evaluation corpora differ even slightly discounts generally exhibit linear growth in the count of the n-gram with the amount of .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.