TAILIEUCHUNG - Báo cáo khoa học: "Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora"

We present a novel method for extracting parallel sub-sentential fragments from comparable, non-parallel bilingual corpora. By analyzing potentially similar sentence pairs using a signal processinginspired approach, we detect which segments of the source sentence are translated into segments in the target sentence, and which are not. This method enables us to extract useful machine translation training data even from very non-parallel corpora, which contain no parallel sentence pairs. | Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora Dragos Stefan Munteanu University of Southern California Information Sciences Institute 4676 Admiralty Way Suite 1001 Marina del Rey Ca 90292 dragos@ Daniel Marcu University of Southern California Information Sciences Institute 4676 Admiralty Way Suite 1001 Marina del Rey CA 90292 marcu@ Abstract We present a novel method for extracting parallel sub-sentential fragments from comparable non-parallel bilingual corpora. By analyzing potentially similar sentence pairs using a signal processing-inspired approach we detect which segments of the source sentence are translated into segments in the target sentence and which are not. This method enables us to extract useful machine translation training data even from very non-parallel corpora which contain no parallel sentence pairs. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. 1 Introduction Recently there has been a surge of interest in the automatic creation of parallel corpora. Several researchers Zhao and Vogel 2002 Vogel 2003 Resnik and Smith 2003 Fung and Cheung 2004a Wu and Fung 2005 Munteanu and Marcu 2005 have shown how fairly good-quality parallel sentence pairs can be automatically extracted from comparable corpora and used to improve the performance of machine translation MT systems. This work addresses a major bottleneck in the development of Statistical MT SMT systems the lack of sufficiently large parallel corpora for most language pairs. Since comparable corpora exist in large quantities and for many languages - tens of thousands of words of news describing the same events are produced daily - the ability to exploit them for parallel data acquisition is highly beneficial for the SMT field. Comparable corpora exhibit various degrees of parallelism. Fung and Cheung 2004a describe corpora ranging from noisy parallel to .

Quốc Hòa 71 8 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora"

8 54 0

Báo cáo khoa học: "Extracting Paraphrases of Technical Terms from Noisy Parallel Software Corpora"

4 74 0

Báo cáo khoa học: "Extracting Paraphrases from a Parallel Corpus"

8 47 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462348 61

Giới thiệu :Lập trình mã nguồn mở

14 26568 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11372 543

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10561 468

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9852 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8898 1161

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8513 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8108 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7870 1809

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7285 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Gia đình vui nhộn

4 395 3 06-01-2025

Quy Trình Canh Tác Cây Bông Vải

8 170 3 06-01-2025

Hướng dẫn chế độ dinh dưỡng cho người bệnh viêm khớp

5 176 2 06-01-2025

Giáo án điện tử tiểu học môn lịch sử: Cách mạng mùa thu

39 168 1 06-01-2025

ETHICAL CODE HANDBOOK: Demonstrate your commitment to high standards

7 152 1 06-01-2025

báo cáo khoa học: "Malignant peripheral nerve sheath tumor arising from the greater omentum: Case report"

4 147 1 06-01-2025

Báo cáo nghiên cứu khoa học " Đại hội XVI thông qua điều lệ Đảng cộng sản Trung Quốc những sửa đổi bổ sung mới "

4 168 1 06-01-2025

5 thói quen ăn uống hủy hoại hàm răng đẹp

5 177 2 06-01-2025

Báo cáo lâm nghiệp: "Assessment of the effects of below-zero temperatures on photosynthesis and chlorophyll a fluorescence in leaf discs of Eucalyptus globulu"

4 148 0 06-01-2025

longman english 1

5 136 0 06-01-2025

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8108 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7870 1809

Ebook Chào con ba mẹ đã sẵn sàng

112 4429 1376

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6337 1275

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8898 1161

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3856 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3927 610

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4764 567

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11372 543

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4530 490