TAILIEUCHUNG - Báo cáo khoa học: "Faster and Smaller N -Gram Language Models"

N -gram language models are a major resource bottleneck in machine translation. In this paper, we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25% of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques. . | Faster and Smaller N-Gram Language Models Adam Pauls Dan Klein Computer Science Division University of California Berkeley adpauls klein @ Abstract N-gram language models are a major resource bottleneck in machine translation. In this paper we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25 of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram the most compact lossless representation to date and even more compact than recent lossy compression techniques. We also discuss techniques for improving query speed during decoding including a simple but novel language model caching technique that improves the query speed of our language models and SRILM by up to 300 . 1 Introduction For modern statistical machine translation systems language models must be both fast and compact. The largest language models LMs can contain as many as several hundred billion n-grams Brants et al. 2007 so storage is a challenge. At the same time decoding a single sentence can trigger hundreds of thousands of queries to the language model so speed is also critical. As always trade-offs exist between time space and accuracy with many recent papers considering small-but-approximate noisy LMs Chazelle et al. 2004 Guthrie and Hepple 2010 or small-but-slow compressed LMs Germann et al. 2009 . In this paper we present several lossless methods for compactly but efficiently storing large LMs in memory. As in much previous work Whittaker 258 and Raj 2001 Hsu and Glass 2008 our methods are conceptually based on tabular trie encodings wherein each n-gram key is stored as the concatenation of one word here the last and an offset encoding the remaining words here the context . After presenting a bit-conscious basic system that typifies such approaches we improve .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.