Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
SMT systems rely on sufficient amount of parallel corpora to train the translation model. This paper investigates possibilities to use word-to-word and phrase-to-phrase translations extracted not only from clean parallel corpora but also from noisy comparable corpora. Translation results for a Chinese to English translation task are given. | Using Noisy Bilingual Data for Statistical Machine Translation Stephan Vogel Interactive Systems Lab Language Technologies Institute Carnegie Mellon University vogel @cs.emu.edu Abstract SMT systems rely on sufficient amount of parallel corpora to train the translation model. This paper investigates possibilities to use word-to-word and phrase-to-phrase translations extracted not only from clean parallel corpora but also from noisy comparable corpora. Translation results for a Chinese to English translation task are given. 1 Introduction Statistical machine translation systems typically use a translation model trained on bilingual data and a language model for the target language trained on perhaps some larger monolingual data. Often the amount of clean parallel data is limited. This leads to the question of whether translation quality can be improved by using additional noisier bilingual data. Some approaches like Fung and MxKeown 1997 have been developed to extract word translations from non-parallel corpora. In Munteanu and Marcu 2002 bilingual suffix trees are used to extract parallel sequences of words from a comparable corpus. 95 of those phrase translation pairs were judged to be correct. However no results where reported if these additional translation correspondences resulted in improved translation quality. 2 The SMT System Statistical translation as introduced in Brown et al. 1993 is based on word-to-word translations. The SMT system used in this study relies on multiword to multi-word translations. The term phrase translations will be used throughout this paper without implying that these multi-word translation pairs are phrases in some linguistic sense. Phrase translations can be extracted from the Viterbi alignment of the alignment model. Phrase translation pairs are seen only a few times. Actually most of the longer phrases are seen only once in even the larger corpora. Using relative frequency to estimate the translation probability would make most