TAILIEUCHUNG - Báo cáo khoa học: "Name Translation in Statistical Machine Translation Learning When to Transliterate"

We present a method to transliterate names in the framework of end-to-end statistical machine translation. The system is trained to learn when to transliterate. For Arabic to English MT, we developed and trained a transliterator on a bitext of 7 million sentences and Google’s English terabyte ngrams and achieved better name translation accuracy than 3 out of 4 professional translators. The paper also includes a discussion of challenges in name translation evaluation. | Name Translation in Statistical Machine Translation Learning When to Transliterate UlfHermjakob and Kevin Knight University of Southern California Information Sciences Institute 4676 Admiralty Way Marina del Rey CA 90292 USA ulf knight @ Hal Daume III University of Utah School of Computing 50 S Central Campus Drive Salt Lake City UT 84112 USA me@ Abstract We present a method to transliterate names in the framework of end-to-end statistical machine translation. The system is trained to learn when to transliterate. For Arabic to English MT we developed and trained a transliterator on a bitext of 7 million sentences and Google s English terabyte ng t am s and achieved better name translation accuracy than 3 out of 4 professional translators. The paper also includes a discussion of challenges in name translation evaluation. 1 Introduction State-of-the-art statistical machine translation SMT is bad at translating names that are not very common particularly across languages with different character sets and sound systems. For example consider the following automatic translation 1 Arabic input SMT output musicians such as Bach l ijjjSyfyyy Correct translation composers such as Bach Mozart Chopin Beethoven Schumann Rachmaninoff Ravel and Prokofiev The SMT system drops most names in this example. Name dropping and mis-translation happens when the system encounters an unknown word mistakes a name for a common noun or trains on noisy parallel data. The state-of-the-art is poor for Taken from NIST02-05 corpora two reasons. First although names are important to human readers automatic MT scoring metrics such as Bleu do not encourage researchers to improve name translation in the context of MT. Names are vastly outnumbered by prepositions articles adjectives common nouns etc. Second name translation is a hard problem even professional human translators have trouble with names. Here are four reference translations taken from the same corpus with mistakes .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.