Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Numerous cross-lingual applications, including state-of-the-art machine translation systems, require parallel texts aligned at the sentence level. However, collections of such texts are often polluted by pairs of texts that are comparable but not parallel. Bitext maps can help to discriminate between parallel and comparable texts. Bitext mapping algorithms use a larger set of document features than competing approaches to this task, resulting in higher accuracy. In addition, good bitext mapping algorithms are not limited to documents with structural mark-up such as web pages. . | An Automatic Filter for Non-Parallel Texts Chris Pike Computer Science Department New York University 715 Broadway 7th FlOor New York NY 10003 USA lastname @cs.nyu.edu I. Dan Melamed Computer Science Department New York University 715 Broadway 7th Floor New York NY 10013 USA lastname @cs.nyu.edu Abstract Numerous cross-lingual applications including state-of-the-art machine translation systems require parallel texts aligned at the sentence level. However collections of such texts are often polluted by pairs of texts that are comparable but not parallel. Bitext maps can help to discriminate between parallel and comparable texts. Bitext mapping algorithms use a larger set of document features than competing approaches to this task resulting in higher accuracy. In addition good bitext mapping algorithms are not limited to documents with structural mark-up such as web pages. The task of filtering non-parallel text pairs represents a new application of bitext mapping algorithms. 1 Introduction In June 2003 the U.S. government organized a Surprise Language Exercise for the NLP community. The goal was to build the best possible language technologies for a surprise language in just one month Oard 2003 . One of the main technologies pursued was machine translation MT . Statistical MT SMT systems were the most successful in this scenario because their construction typically requires less time than other approaches. On the other hand SMT systems require large quantities of parallel text as training data. A significant collection of parallel text was obtained for this purpose from multiple sources. SMT systems were built and tested results were reported. Much later we were surprised to discover that a significant portion of the training data was not parallel text Some of the document pairs were on the same topic but not translations of each other. For today s sentence-based SMT systems this kind of data is noise. How much better would the results have been if the noisy .