TAILIEUCHUNG - Báo cáo khoa học: "Using Noisy Bilingual Data for Statistical Machine Translation"

SMT systems rely on sufficient amount of parallel corpora to train the translation model. This paper investigates possibilities to use word-to-word and phrase-to-phrase translations extracted not only from clean parallel corpora but also from noisy comparable corpora. Translation results for a Chinese to English translation task are given. | Using Noisy Bilingual Data for Statistical Machine Translation Stephan Vogel Interactive Systems Lab Language Technologies Institute Carnegie Mellon University vogel @ Abstract SMT systems rely on sufficient amount of parallel corpora to train the translation model. This paper investigates possibilities to use word-to-word and phrase-to-phrase translations extracted not only from clean parallel corpora but also from noisy comparable corpora. Translation results for a Chinese to English translation task are given. 1 Introduction Statistical machine translation systems typically use a translation model trained on bilingual data and a language model for the target language trained on perhaps some larger monolingual data. Often the amount of clean parallel data is limited. This leads to the question of whether translation quality can be improved by using additional noisier bilingual data. Some approaches like Fung and MxKeown 1997 have been developed to extract word translations from non-parallel corpora. In Munteanu and Marcu 2002 bilingual suffix trees are used to extract parallel sequences of words from a comparable corpus. 95 of those phrase translation pairs were judged to be correct. However no results where reported if these additional translation correspondences resulted in improved translation quality. 2 The SMT System Statistical translation as introduced in Brown et al. 1993 is based on word-to-word translations. The SMT system used in this study relies on multiword to multi-word translations. The term phrase translations will be used throughout this paper without implying that these multi-word translation pairs are phrases in some linguistic sense. Phrase translations can be extracted from the Viterbi alignment of the alignment model. Phrase translation pairs are seen only a few times. Actually most of the longer phrases are seen only once in even the larger corpora. Using relative frequency to estimate the translation probability would make most

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.