TAILIEUCHUNG - Báo cáo khoa học: "Does more data always yield better translations?"

Nowadays, there are large amounts of data available to train statistical machine translation systems. However, it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques: one based on approximating the probability of an indomain corpus; and another based on infrequent n-gram occurrence. Experimental results not only report signiﬁcant improvements over random sentence selection but also an improvement over a system trained with the whole. | Does more data always yield better translations Guillem Gasco Martha-Alicia Rocha German Sanchis-Trilles Jesus Andres-Ferrer and Francisco Casacuberta Departament de Sistemes Informatics i Computacio Universitat Politecnica de Valencia Cami de Vera s n 46022 Valencia Spain ggasco mrocha gsanchis jandres fcn @ Abstract Nowadays there are large amounts of data available to train statistical machine translation systems. However it is not clear whether all the training data actually help or not. A system trained on a subset of such huge bilingual corpora might outperform the use of all the bilingual data. This paper studies such issues by analysing two training data selection techniques one based on approximating the probability of an indomain corpus and another based on infrequent n-gram occurrence. Experimental results not only report significant improvements over random sentence selection but also an improvement over a system trained with the whole available data. Surprisingly the improvements are obtained with just a small fraction of the data that accounts for less than of the sentences. Afterwards we show that a much larger room for improvement exists although this is done under non-realistic conditions. 1 Introduction Globalisation and the popularisation of the Internet have lead to a rapid increase in the amount of bilingual corpora available. Entities such as the European Union the United Nations and other multinational organisations need to translate all the documentation they generate. Such translations happen every day and provide very large multilingual corpora which are oftentimes difficult to process and significantly increase the computational requirements needed to train statistical machine translation SMT systems. For instance the corpora made available for recent machine translation evaluations are in the order of 1 billion running words Callison-Burch et al. 2010 . However two main problems arise when attempting to use this huge pool

Nhã Thanh 69 10 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Does more data always yield better translations?"

10 51 0

Project Management Suite™» 2012 Edition "An ant on the move does more than a dozing ox"

16 63 0

Báo cáo y học: "Synovial histopathology of psoriatic arthritis, both oligo- and polyarticular, resembles spondyloarthropathy more than it does rheumatoid arthritis"

12 45 0

Why does the river erosion situation become more complicated in the Mekong delta?

10 56 0

(Nội dung trùng lắp)Does Industry Specialist Auditor Provide More Insights in Their audit report? An Empirical Study of Key Audit Matters Section

26 28 3

DOES GREATER FIRM-SPECIFIC RETURN VARIATION MEAN MORE OR LESS INFORMED STOCK PRICING?

40 36 0

Does industry specialist auditor provide more insights in their audit report? An empirical study of key audit matters section

26 13 2

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462291 61

Giới thiệu :Lập trình mã nguồn mở

14 24918 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11286 542

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10511 466

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9790 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8876 1160

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8467 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8090 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7471 1763

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7188 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Báo cáo nghiên cứu khoa học " KẾT QUẢ NGHIÊN CỨU BƯỚC ĐẦU VỀ THIÊN ĐỊCH CHÂN KHỚP TRÊN CÂY THANH TRÀ Ở THỪA THIÊN HUẾ "

7 261 4 26-11-2024

B2B Content Marketing: 2012 Benchmarks, Budgets & Trends

17 213 3 26-11-2024

Đóng mới oto 8 chỗ ngồi part 9

10 171 3 26-11-2024

báo cáo hóa học:" Increased androgen receptor expression in serous carcinoma of the ovary is associated with an improved survival"

6 150 3 26-11-2024

báo cáo hóa học:" Perceptions of rewards among volunteer caregivers of people living with AIDS working in faith-based organizations in South Africa: a qualitative study"

10 146 1 26-11-2024

báo cáo hóa học:" Quality of data collection in a large HIV observational clinic database in sub-Saharan Africa: implications for clinical research and audit of care"

7 146 4 26-11-2024

Sử dụng mô hình ARCH và GARCH để phân tích và dự báo về giá cổ phiếu trên thị trường chứng khoán

24 1064 2 26-11-2024

Báo cáo y học: "The Factors Influencing Depression Endpoints Research (FINDER) study: final results of Italian patients with depressio"

9 139 1 26-11-2024

Báo cáo " Thẩm quyền quản lí nhà nước đối với hoạt động quảng cáo thực trạng và hướng hoàn thiện "

7 196 7 26-11-2024

Valve Selection Handbook - Fourth Edition

337 139 1 26-11-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8090 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7471 1763

Ebook Chào con ba mẹ đã sẵn sàng

112 4364 1369

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6156 1258

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8876 1160

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3790 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3909 609

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4618 562

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11286 542

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4454 490