TAILIEUCHUNG - Báo cáo khoa học: "An Automatic Filter for Non-Parallel Texts"

Numerous cross-lingual applications, including state-of-the-art machine translation systems, require parallel texts aligned at the sentence level. However, collections of such texts are often polluted by pairs of texts that are comparable but not parallel. Bitext maps can help to discriminate between parallel and comparable texts. Bitext mapping algorithms use a larger set of document features than competing approaches to this task, resulting in higher accuracy. In addition, good bitext mapping algorithms are not limited to documents with structural mark-up such as web pages. . | An Automatic Filter for Non-Parallel Texts Chris Pike Computer Science Department New York University 715 Broadway 7th FlOor New York NY 10003 USA lastname @ I. Dan Melamed Computer Science Department New York University 715 Broadway 7th Floor New York NY 10013 USA lastname @ Abstract Numerous cross-lingual applications including state-of-the-art machine translation systems require parallel texts aligned at the sentence level. However collections of such texts are often polluted by pairs of texts that are comparable but not parallel. Bitext maps can help to discriminate between parallel and comparable texts. Bitext mapping algorithms use a larger set of document features than competing approaches to this task resulting in higher accuracy. In addition good bitext mapping algorithms are not limited to documents with structural mark-up such as web pages. The task of filtering non-parallel text pairs represents a new application of bitext mapping algorithms. 1 Introduction In June 2003 the . government organized a Surprise Language Exercise for the NLP community. The goal was to build the best possible language technologies for a surprise language in just one month Oard 2003 . One of the main technologies pursued was machine translation MT . Statistical MT SMT systems were the most successful in this scenario because their construction typically requires less time than other approaches. On the other hand SMT systems require large quantities of parallel text as training data. A significant collection of parallel text was obtained for this purpose from multiple sources. SMT systems were built and tested results were reported. Much later we were surprised to discover that a significant portion of the training data was not parallel text Some of the document pairs were on the same topic but not translations of each other. For today s sentence-based SMT systems this kind of data is noise. How much better would the results have been if the noisy .

TÀI LIỆU MỚI ĐĂNG
10    186    3    08-01-2025
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.