Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We propose a novel bilingual topical admixture (BiTAM) formalism for word alignment in statistical machine translation. Under this formalism, the parallel sentence-pairs within a document-pair are assumed to constitute a mixture of hidden topics; each word-pair follows a topic-specific bilingual translation model. Three BiTAM models are proposed to capture topic sharing at different levels of linguistic granularity (i.e., at the sentence or word levels). | BiTAM Bilingual Topic AdMixture Models for Word Alignment Bing Zhao1 and Eric P. Xing11 bzhao epxing @cs.cmu.edu Language Technologies Institute1 and Machine Learning Department1 School of Computer Science Carnegie Mellon University Abstract We propose a novel bilingual topical admixture BiTAM formalism for word alignment in statistical machine translation. Under this formalism the parallel sentence-pairs within a document-pair are assumed to constitute a mixture of hidden topics each word-pair follows a topic-specific bilingual translation model. Three BiTAM models are proposed to capture topic sharing at different levels of linguistic granularity i.e. at the sentence or word levels . These models enable wordalignment process to leverage topical contents of document-pairs. Efficient variational approximation algorithms are designed for inference and parameter estimation. With the inferred latent topics BiTAM models facilitate coherent pairing of bilingual linguistic entities that share common topical aspects. Our preliminary experiments show that the proposed models improve word alignment accuracy and lead to better translation quality. 1 Introduction Parallel data has been treated as sets of unrelated sentence-pairs in state-of-the-art statistical machine translation SMT models. Most current approaches emphasize within-sentence dependencies such as the distortion in Brown et al. 1993 the dependency of alignment in HMM Vogel et al. 1996 and syntax mappings in Yamada and Knight 2001 . Beyond the sentence-level corpuslevel word-correlation and contextual-level topical information may help to disambiguate translation candidates and word-alignment choices. For example the most frequent source words e.g. functional words are likely to be translated into words which are also frequent on the target side words of the same topic generally bear correlations and similar translations. Extended contextual information is especially useful when translation models are vague due .