TAILIEUCHUNG - Báo cáo khoa học: "An Automatic Filter for Non-Parallel Texts"

Numerous cross-lingual applications, including state-of-the-art machine translation systems, require parallel texts aligned at the sentence level. However, collections of such texts are often polluted by pairs of texts that are comparable but not parallel. Bitext maps can help to discriminate between parallel and comparable texts. Bitext mapping algorithms use a larger set of document features than competing approaches to this task, resulting in higher accuracy. In addition, good bitext mapping algorithms are not limited to documents with structural mark-up such as web pages. . | An Automatic Filter for Non-Parallel Texts Chris Pike Computer Science Department New York University 715 Broadway 7th FlOor New York NY 10003 USA lastname @ I. Dan Melamed Computer Science Department New York University 715 Broadway 7th Floor New York NY 10013 USA lastname @ Abstract Numerous cross-lingual applications including state-of-the-art machine translation systems require parallel texts aligned at the sentence level. However collections of such texts are often polluted by pairs of texts that are comparable but not parallel. Bitext maps can help to discriminate between parallel and comparable texts. Bitext mapping algorithms use a larger set of document features than competing approaches to this task resulting in higher accuracy. In addition good bitext mapping algorithms are not limited to documents with structural mark-up such as web pages. The task of filtering non-parallel text pairs represents a new application of bitext mapping algorithms. 1 Introduction In June 2003 the . government organized a Surprise Language Exercise for the NLP community. The goal was to build the best possible language technologies for a surprise language in just one month Oard 2003 . One of the main technologies pursued was machine translation MT . Statistical MT SMT systems were the most successful in this scenario because their construction typically requires less time than other approaches. On the other hand SMT systems require large quantities of parallel text as training data. A significant collection of parallel text was obtained for this purpose from multiple sources. SMT systems were built and tested results were reported. Much later we were surprised to discover that a significant portion of the training data was not parallel text Some of the document pairs were on the same topic but not translations of each other. For today s sentence-based SMT systems this kind of data is noise. How much better would the results have been if the noisy .

Vạn Thông 89 4 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Evaluation of the effectiveness of automatic sprinkling system for shallot cultivation in Vinh Chau district, Soc Trang province

7 74 0

Microcontroller - based automatic treatment skin treater design

6 93 0

Development of automatic sorting conveyor belt using PLC

10 41 1

A prototype plc built automatic vehicle washing system using VFD

11 47 1

Lecture Automatic control systems technology - Lesson 1: Introduction to control systems technology

28 20 1

Automatic gate prototype based on microcontroller of atmel Atmega16

9 83 0

FUNDAMENTALS OF AUTOMATIC TRANSMISSIONS

380 52 0

Bài giảng Hệ thống thông tin công nghiệp: Chương 4 - Bùi Quốc Anh

17 63 0

Advanced wastewater treatment by hydraulic automatic floating media filter

6 74 0

Improving the learning experience of business subjects in engineering studies using automatic spreadsheet correctors

18 88 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462351 61

Giới thiệu :Lập trình mã nguồn mở

14 26643 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11375 543

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10566 468

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9854 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8906 1161

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8518 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8109 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7898 1817

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7289 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Báo cáo nghiên cứu khoa học " KẾT QUẢ NGHIÊN CỨU BƯỚC ĐẦU VỀ THIÊN ĐỊCH CHÂN KHỚP TRÊN CÂY THANH TRÀ Ở THỪA THIÊN HUẾ "

7 287 4 08-01-2025

B2B Content Marketing: 2012 Benchmarks, Budgets & Trends

17 241 3 08-01-2025

Đóng mới oto 8 chỗ ngồi part 9

10 186 3 08-01-2025

Báo cáo nghiên cứu nông nghiệp " Biofertiliser inoculant technology for the growth of rice in Vietnam: Developing technical infrastructure for quality assurance and village production for farmers "

12 152 2 08-01-2025

Báo cáo nghiên cứu khoa học " HÃY LÀM CHO HUẾ XANH HƠN VÀ ĐẸP HƠN "

6 187 3 08-01-2025

BÀI GIẢNG Biến Đổi Năng Lượng Điện Cơ - TS. Hồ Phạm Huy

137 166 1 08-01-2025

Đề tài " Dự báo về tác động của Tổ chức Thương mại Thế giới WTO đối với các doanh nghiệp xuất khẩu vừa và nhỏ Việt Nam – Những giải pháp đề xuất "

72 193 2 08-01-2025

Báo cáo y học: "The Factors Influencing Depression Endpoints Research (FINDER) study: final results of Italian patients with depressio"

9 157 1 08-01-2025

Bệnh sán lá gan trên gia súc và cách phòng trị

3 170 1 08-01-2025

Báo cáo nghiên cứu khoa học " NÂNG QUAN HỆ KINH TẾ THƯƠNG MẠI VIỆT NAM - TRUNG QUỐC LÊN TẦM CAO THỜI ĐẠI "

8 178 1 08-01-2025

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8109 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7898 1817

Ebook Chào con ba mẹ đã sẵn sàng

112 4435 1376

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6353 1276

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8906 1161

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3858 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3930 610

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4768 567

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11375 543

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4533 490