TAILIEUCHUNG - Báo cáo khoa học: "Faster and Smaller N -Gram Language Models"

N -gram language models are a major resource bottleneck in machine translation. In this paper, we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25% of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques. . | Faster and Smaller N-Gram Language Models Adam Pauls Dan Klein Computer Science Division University of California Berkeley adpauls klein @ Abstract N-gram language models are a major resource bottleneck in machine translation. In this paper we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25 of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram the most compact lossless representation to date and even more compact than recent lossy compression techniques. We also discuss techniques for improving query speed during decoding including a simple but novel language model caching technique that improves the query speed of our language models and SRILM by up to 300 . 1 Introduction For modern statistical machine translation systems language models must be both fast and compact. The largest language models LMs can contain as many as several hundred billion n-grams Brants et al. 2007 so storage is a challenge. At the same time decoding a single sentence can trigger hundreds of thousands of queries to the language model so speed is also critical. As always trade-offs exist between time space and accuracy with many recent papers considering small-but-approximate noisy LMs Chazelle et al. 2004 Guthrie and Hepple 2010 or small-but-slow compressed LMs Germann et al. 2009 . In this paper we present several lossless methods for compactly but efficiently storing large LMs in memory. As in much previous work Whittaker 258 and Raj 2001 Hsu and Glass 2008 our methods are conceptually based on tabular trie encodings wherein each n-gram key is stored as the concatenation of one word here the last and an offset encoding the remaining words here the context . After presenting a bit-conscious basic system that typifies such approaches we improve .

Hiểu Minh 54 10 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Tối ưu và rút ngắn thời gian khởi động hệ thống với Startup Faster

6 55 0

Báo cáo toán học: "FASTER AND FASTER CONVERGENT SERIES FOR ζ(3)"

2 36 0

Idea Mapping: How to Access Your Hidden Brain Power, Learn Faster, Remember More

290 46 0

Báo cáo khoa học: "Faster Parsing by Supertagger Adaptation"

11 48 0

Báo cáo khoa học: "Faster and Smaller N -Gram Language Models"

10 45 0

Báo cáo khoa học: "Forest Rescoring: Faster Decoding with Integrated Language Models ∗"

8 64 0

Even Faster Web Sites

256 38 0

Báo cáo khoa học: "Toward Smaller, Faster, and Better Hierarchical Phrase-based SMT"

4 70 0

Ebook Goals!: How to Get Everything You Want - Faster Than You Ever Thought Possible - Brian Tracy

330 81 0

A combination of faster R-CNN and yolov2 for drone detection in images

7 29 4

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 461914 55

Giới thiệu :Lập trình mã nguồn mở

14 22883 64

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10958 531

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10146 450

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9557 104

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8341 1127

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8270 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7883 2224

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6771 255

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 5964 1440

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Gia đình vui nhộn

4 318 1 12-05-2024

Đánh giá hao mòn và độ tin cậy của chi tiết và kết cấu trên đầu máy diezel part 3

12 320 0 12-05-2024

Sáng tạo trong thuật toán và lập trình với ngôn ngữ Pascal và C# Tập 2 - Chương 4

47 252 1 12-05-2024

Anh văn bằng C-124

8 187 0 12-05-2024

Công nghiệp gang thép Việt Nam : Một giai đoạn phát triển và chuyển đổi chính sách mới part 5

6 199 0 12-05-2024

MySQL Database Usage & Administration PHẦN 7

37 163 0 12-05-2024

Đề tài: Tìm hiểu một số yêu cầu đặt ra với một phòng thu âm, để đảm bảo chất lượng âm thanh trong sản phẩm đa phương tiện

8 165 1 12-05-2024

Báo cáo tốt nghiệp: Vận hành và bảo dưỡng trong MPLS

92 149 3 12-05-2024

Data Structures and Algorithms - Chapter 9: Hashing

54 118 0 12-05-2024

XỬ TRÍ CHẤN THƯƠNG SỌ NÃO KÍN

1 121 1 12-05-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7883 2224

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 5964 1440

Ebook Chào con ba mẹ đã sẵn sàng

112 3780 1247

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5380 1137

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8341 1127

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3532 651

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10958 531

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3725 525

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4142 522

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4170 481