TAILIEUCHUNG - Báo cáo khoa học: "Four Techniques for Online Handling of Out-of-Vocabulary Words in Arabic-English Statistical Machine Translation"

We present four techniques for online handling of Out-of-Vocabulary words in Phrasebased Statistical Machine Translation. The techniques use spelling expansion, morphological expansion, dictionary term expansion and proper name transliteration to reuse or extend a phrase table. We compare the performance of these techniques and combine them. Our results show a consistent improvement over a state-of-the-art baseline in terms of BLEU and a manual error analysis. | Four Techniques for Online Handling of Out-of-Vocabulary Words in Arabic-English Statistical Machine Translation Nizar Habash Center for Computational Learning Systems Columbia University habash@ Abstract We present four techniques for online handling of Out-of-Vocabulary words in Phrasebased Statistical Machine Translation. The techniques use spelling expansion morphological expansion dictionary term expansion and proper name transliteration to reuse or extend a phrase table. We compare the performance of these techniques and combine them. Our results show a consistent improvement over a state-of-the-art baseline in terms of BLEU and a manual error analysis. 1 Introduction We present four techniques for online handling of Out-of-Vocabulary OOV words in phrase-based Statistical Machine Translation SMT .1 The techniques use morphological expansion MorphEx spelling expansion SpellEx dictionary word expansion DictEx and proper name transliteration TransEx to reuse or extend phrase tables online. We compare the performance of these techniques and combine them. We work with a standard Arabic-English SMT system that has been already optimized for minimizing data sparsity through the use of morphological preprocessing and orthographic normalization. Thus our baseline token OOV rate is rather low average . None of our techniques are specific to Arabic and all can be retargeted to other languages given availability of techniquespecific resources. Our results show that we improve over a state-of-the-art baseline by over relative bleu score and handle all OOV instances. An error analysis shows that in 60 of the time our OOV handling successfully produces acceptable output. Additionally we still improve in bleu score even as we increase our system s training data by 10-fold. 1This work was funded under the DARPA gale program contract HR0011-06-C-0023. 2 Related Work Much work in MT has shown that orthographic and morpho-syntactic preprocessing of the

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.