Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We tackle the previously unaddressed problem of unsupervised determination of the optimal morphological segmentation for statistical machine translation (SMT) and propose a segmentation metric that takes into account both sides of the SMT training corpus. We formulate the objective function as the posterior probability of the training corpus according to a generative segmentation-translation model. We describe how the IBM Model-1 translation likelihood can be computed incrementally between adjacent segmentation states for efficient computation. . | Unsupervised Search for The Optimal Segmentation for Statistical Machine Translation Co kun Mermer1 3 and Ahmet Afsin Akin2 3 1 Bogazici University Bebek Istanbul Turkey 2Istanbul Technical University Sariyer Istanbul Turkey 3TUBITAK-UEKAE Gebze Kocaeli Turkey coskun ahmetaa @uekae.tubitak.gov.tr Abstract We tackle the previously unaddressed problem of unsupervised determination of the optimal morphological segmentation for statistical machine translation SMT and propose a segmentation metric that takes into account both sides of the SMT training corpus. We formulate the objective function as the posterior probability of the training corpus according to a generative segmentation-translation model. We describe how the IBM Model-1 translation likelihood can be computed incrementally between adjacent segmentation states for efficient computation. Submerging the proposed segmentation method in a SMT task from morphologically-rich Turkish to English does not exhibit the expected improvement in translation BLEU scores and confirms the robustness of phrase-based SMT to translation unit combinatorics. A positive outcome of this work is the described modification to the sequential search algorithm of Morfessor Creutz and Lagus 2007 that enables arbitrary-fold parallelization of the computation which unexpectedly improves the translation performance as measured by BLEU. 1 Introduction In statistical machine translation SMT words are normally considered as the building blocks of translation models. However especially for morphologically complex languages such as Finnish Turkish Czech Arabic etc. it has been shown that using sub-lexical units obtained after morphological preprocessing can improve the machine translation performance over a word-based system Habash and Sadat 2006 Oflazer and Durgar El-Kahlout 2007 Bisazza and Federico 2009 . However the effect of segmentation on transla tion performance is indirect and difficult to isolate Lopez and Resnik 2006 . The challenge .