TAILIEUCHUNG - Báo cáo khoa học: "Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora"

We present a novel method for extracting parallel sub-sentential fragments from comparable, non-parallel bilingual corpora. By analyzing potentially similar sentence pairs using a signal processinginspired approach, we detect which segments of the source sentence are translated into segments in the target sentence, and which are not. This method enables us to extract useful machine translation training data even from very non-parallel corpora, which contain no parallel sentence pairs. | Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora Dragos Stefan Munteanu University of Southern California Information Sciences Institute 4676 Admiralty Way Suite 1001 Marina del Rey Ca 90292 dragos@ Daniel Marcu University of Southern California Information Sciences Institute 4676 Admiralty Way Suite 1001 Marina del Rey CA 90292 marcu@ Abstract We present a novel method for extracting parallel sub-sentential fragments from comparable non-parallel bilingual corpora. By analyzing potentially similar sentence pairs using a signal processing-inspired approach we detect which segments of the source sentence are translated into segments in the target sentence and which are not. This method enables us to extract useful machine translation training data even from very non-parallel corpora which contain no parallel sentence pairs. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. 1 Introduction Recently there has been a surge of interest in the automatic creation of parallel corpora. Several researchers Zhao and Vogel 2002 Vogel 2003 Resnik and Smith 2003 Fung and Cheung 2004a Wu and Fung 2005 Munteanu and Marcu 2005 have shown how fairly good-quality parallel sentence pairs can be automatically extracted from comparable corpora and used to improve the performance of machine translation MT systems. This work addresses a major bottleneck in the development of Statistical MT SMT systems the lack of sufficiently large parallel corpora for most language pairs. Since comparable corpora exist in large quantities and for many languages - tens of thousands of words of news describing the same events are produced daily - the ability to exploit them for parallel data acquisition is highly beneficial for the SMT field. Comparable corpora exhibit various degrees of parallelism. Fung and Cheung 2004a describe corpora ranging from noisy parallel to .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.