TAILIEUCHUNG - Báo cáo khoa học: "Phrase-Based Backoff Models for Machine Translation of Highly Inflected Languages"

We propose a backoff model for phrasebased machine translation that translates unseen word forms in foreign-language text by hierarchical morphological abstractions at the word and the phrase level. The model is evaluated on the Europarl corpus for German-English and FinnishEnglish translation and shows improvements over state-of-the-art phrase-based models. | Phrase-Based Backoff Models for Machine Translation of Highly Inflected Languages Mei Yang Department of Electrical Engineering University of Washington Seattle WA USA yangmei@ Katrin Kirchhoff Department of Electrical Engineering University of Washington Seattle WA USA katrin@ Abstract We propose a backoff model for phrasebased machine translation that translates unseen word forms in foreign-language text by hierarchical morphological abstractions at the word and the phrase level. The model is evaluated on the Europarl corpus for German-English and Finnish-English translation and shows improvements over state-of-the-art phrase-based models. 1 Introduction Current statistical machine translation SMT usually works well in cases where the domain is fixed the training and test data match and a large amount of training data is available. Nevertheless standard SMT models tend to perform much better on languages that are morphologically simple whereas highly inflected languages with a large number of potential word forms are more problematic particularly when training data is sparse. SMT attempts to find a sentence ê in the desired output language given the corresponding sentence f in the source language according to e argmaxe P f e P e 1 Most state-of-the-art SMT adopt a phrase-based approach such that e is chunked into I phrases e1 . ẽ and the translation model is defined over mappings between phrases in e and in f . . P f e . Typically phrases are extracted from a word-aligned training corpus. Different inflected forms of the same lemma are treated as different words and there is no provision for unseen forms . unknown words encountered in the test data are not translated at all but appear verbatim in the output. Although the percentage of such unseen word forms may be negligible when the training set is large and matches the test set well it may rise drastically when training data is limited or from a different domain. Many .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.