TAILIEUCHUNG - Báo cáo khoa học: "Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora"

Current phrase-based SMT systems perform poorly when using small training sets. This is a consequence of unreliable translation estimates and low coverage over source and target phrases. This paper presents a method which alleviates this problem by exploiting multiple translations of the same source phrase. Central to our approach is triangulation, the process of translating from a source to a target language via an intermediate third language. This allows the use of a much wider range of parallel corpora for training, and can be combined with a standard phrase-table using conventional smoothing methods. . | Machine Translation by Triangulation Making Effective Use of Multi-Parallel Corpora Trevor Cohn and Mirella Lapata Human Computer Research Centre School of Informatics University of Edinburgh tcohn mlap @ Abstract Current phrase-based SMT systems perform poorly when using small training sets. This is a consequence of unreliable translation estimates and low coverage over source and target phrases. This paper presents a method which alleviates this problem by exploiting multiple translations of the same source phrase. Central to our approach is triangulation the process of translating from a source to a target language via an intermediate third language. This allows the use of a much wider range of parallel corpora for training and can be combined with a standard phrase-table using conventional smoothing methods. Experimental results demonstrate BLEU improvements for triangulated models over a standard phrase-based system. 1 Introduction Statistical machine translation Brown et al. 1993 has seen many improvements in recent years most notably the transition from word- to phrase-based models Koehn et al. 2003 . Modern SMT systems are capable of producing high quality translations when provided with large quantities of training data. With only a small training sample the translation output is often inferior to the output from using larger corpora because the translation algorithm must rely on more sparse estimates of phrase frequencies and must also back-off to smaller sized phrases. This often leads to poor choices of target phrases and reduces the coherence of the output. Unfortunately parallel corpora are not readily available in large quantities except for a small subset of the world s languages see Resnik and Smith 2003 for discussion therefore limiting the potential use of current SMT systems. 728 In this paper we provide a means for obtaining more reliable translation frequency estimates from small datasets. We make use of multi-parallel corpora .

TÀI LIỆU LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.