Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
In this paper a method to incorporate linguistic information regarding single-word and compound verbs is proposed, as a first step towards an SMT model based on linguistically-classified phrases. By substituting these verb structures by the base form of the head verb, we achieve a better statistical word alignment performance, and are able to better estimate the translation model and generalize to unseen verb forms during translation. Preliminary experiments for the English - Spanish language pair are performed, and future research lines are detailed. . | Phrase Linguistic Classification and Generalization for Improving Statistical Machine Translation Adria de Gispert TALP Research Center Universitat Politecnica de Catalunya UPC Barcelona agispert@gps.tsc.upc.es Abstract In this paper a method to incorporate linguistic information regarding single-word and compound verbs is proposed as a first step towards an SMT model based on linguistically-classified phrases. By substituting these verb structures by the base form of the head verb we achieve a better statistical word alignment performance and are able to better estimate the translation model and generalize to unseen verb forms during translation. Preliminary experiments for the English - Spanish language pair are performed and future research lines are detailed. 1 Introduction Since its revival in the beginning of the 1990s statistical machine translation SMT has shown promising results in several evaluation campaigns. From original word-based models results were further improved by the appearance of phrase-based translation models. However many SMT systems still ignore any morphological analysis and work at the surface level of word forms. For highly-inflected languages such as German or Spanish or any language of the Romance family this poses severe limitations both in training from parallel corpora as well as in producing a correct translation of an input sentence. This lack of linguistic knowledge in SMT forces the translation model to learn different translation probability distributions for all inflected forms of nouns adjectives or verbs vengo vienes viene etc. and this suffers from usual data sparseness. Despite the recent efforts in the community to provide models with this kind of information see Section 6 for details on related previous work results are yet to be encouraging. In this paper we address the incorporation of morphological and shallow syntactic information regarding verbs and compound verbs as a first step towards an SMT model based on .