Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We present a method to transliterate names in the framework of end-to-end statistical machine translation. The system is trained to learn when to transliterate. For Arabic to English MT, we developed and trained a transliterator on a bitext of 7 million sentences and Google’s English terabyte ngrams and achieved better name translation accuracy than 3 out of 4 professional translators. The paper also includes a discussion of challenges in name translation evaluation. | Name Translation in Statistical Machine Translation Learning When to Transliterate UlfHermjakob and Kevin Knight University of Southern California Information Sciences Institute 4676 Admiralty Way Marina del Rey CA 90292 USA ulf knight @isi.edu Hal Daume III University of Utah School of Computing 50 S Central Campus Drive Salt Lake City UT 84112 USA me@hal3.name Abstract We present a method to transliterate names in the framework of end-to-end statistical machine translation. The system is trained to learn when to transliterate. For Arabic to English MT we developed and trained a transliterator on a bitext of 7 million sentences and Google s English terabyte ng t am s and achieved better name translation accuracy than 3 out of 4 professional translators. The paper also includes a discussion of challenges in name translation evaluation. 1 Introduction State-of-the-art statistical machine translation SMT is bad at translating names that are not very common particularly across languages with different character sets and sound systems. For example consider the following automatic translation 1 Arabic input SMT output musicians such as Bach l ijjjSyfyyy Correct translation composers such as Bach Mozart Chopin Beethoven Schumann Rachmaninoff Ravel and Prokofiev The SMT system drops most names in this example. Name dropping and mis-translation happens when the system encounters an unknown word mistakes a name for a common noun or trains on noisy parallel data. The state-of-the-art is poor for Taken from NIST02-05 corpora two reasons. First although names are important to human readers automatic MT scoring metrics such as Bleu do not encourage researchers to improve name translation in the context of MT. Names are vastly outnumbered by prepositions articles adjectives common nouns etc. Second name translation is a hard problem even professional human translators have trouble with names. Here are four reference translations taken from the same corpus with mistakes .