TAILIEUCHUNG - Báo cáo khoa học: "Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases"

In this paper we describe a novel data structure for phrase-based statistical machine translation which allows for the retrieval of arbitrarily long phrases while simultaneously using less memory than is required by current decoder implementations. We detail the computational complexity and average retrieval times for looking up phrase translations in our sufﬁx array-based data structure. We show how sampling can be used to reduce the retrieval time by orders of magnitude with no loss in translation quality. . | Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases Chris Callison-Burch Colin Bannard University of Edinburgh 2 Buccleuch Place Edinburgh EH8 9LW chris colin @ Josh Schroeder Linear B Ltd. 39 B Cumberland Street Edinburgh EH3 6RA josh@ Abstract In this paper we describe a novel data structure for phrase-based statistical machine translation which allows for the retrieval of arbitrarily long phrases while simultaneously using less memory than is required by current decoder implementations. We detail the computational complexity and average retrieval times for looking up phrase translations in our suffix array-based data structure. We show how sampling can be used to reduce the retrieval time by orders of magnitude with no loss in translation quality. 1 Introduction Statistical machine translation SMT has an advantage over many other statistical natural language processing applications in that training data is regularly produced by other human activity. For some language pairs very large sets of training data are now available. The publications of the European Union and United Nations provide gigbytes of data between various language pairs which can be easily mined using a web crawler. The Linguistics Data Consortium provides an excellent set of off the shelf Arabic-English and Chinese-English parallel corpora for the annual NIST machine translation evaluation exercises. The size of the NIST training data presents a problem for phrase-based statistical machine translation. Decoders such as Pharaoh Koehn 2004 primarily use lookup tables for the storage of phrases and their translations. Since retrieving longer segments of hu man translated text generally leads to better translation quality participants in the evaluation exercise try to maximize the length of phrases that are stored in lookup tables. The combination of large corpora and long phrases means that the table size can quickly become unwieldy. A

Tuyết Nhi 99 8 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 461860 55

Giới thiệu :Lập trình mã nguồn mở

14 22613 59

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10883 529

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10060 446

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9515 104

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8274 1125

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8225 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7863 2220

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6669 253

Vật lý hạt cơ bản (1)

29 5767 85

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Báo cáo khoa học: Loss of kinase activity in Mycobacterium tuberculosis multidomain protein Rv1364c

14 235 0 25-04-2024

Động cơ đốt trong và máy kéo công nghiêp tập 1 part 7

23 258 0 25-04-2024

MySQL Basics for Visual Learners PHẦN 9

15 183 0 25-04-2024

MÔN HỌC VẬT LIỆU VÀ CÔNG NGHỆ KIM LOẠI - PHẦN I: KIM LOẠI HỌC

32 175 2 25-04-2024

Đề tài: Tìm hiểu một số yêu cầu đặt ra với một phòng thu âm, để đảm bảo chất lượng âm thanh trong sản phẩm đa phương tiện

8 159 1 25-04-2024

Báo cáo tốt nghiệp: Vận hành và bảo dưỡng trong MPLS

92 143 3 25-04-2024

Data Structures and Algorithms - Chapter 9: Hashing

54 113 0 25-04-2024

Hệ thống làm lạnh và điều hòa không khí

21 125 0 25-04-2024

Kỹ thuật nuôi cá rồng part 5

7 127 0 25-04-2024

Gastroenterology an illustrated colour text - part 10

10 88 0 25-04-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7863 2220

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 5695 1353

Ebook Chào con ba mẹ đã sẵn sàng

112 3764 1231

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5311 1135

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8274 1125

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3492 642

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10883 529

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3679 525

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4041 514

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4123 480