TAILIEUCHUNG - Báo cáo khoa học: "Mining Parenthetical Translations from the Web by Word Alignment"

Documents in languages such as Chinese, Japanese and Korean sometimes annotate terms with their translations in English inside a pair of parentheses. We present a method to extract such translations from a large collection of web documents by building a partially parallel corpus and use a word alignment algorithm to identify the terms being translated. The method is able to generalize across the translations for different terms and can reliably extract translations that occurred only once in the entire web. . | Mining Parenthetical Translations from the Web by Word Alignment Dekang Lin Shaojun Zhao Benjamin Van Durme f Marius Pasca Google Inc. Mountain View CA 94043 lindek@ University of Rochester Rochester NY 14627 zhao@ University of Rochester Rochester NY 14627 vandurme@ Google Inc. Mountain View CA 94043 mars@ Abstract Documents in languages such as Chinese Japanese and Korean sometimes annotate terms with their translations in English inside a pair of parentheses. We present a method to extract such translations from a large collection of web documents by building a partially parallel corpus and use a word alignment algorithm to identify the terms being translated. The method is able to generalize across the translations for different terms and can reliably extract translations that occurred only once in the entire web. Our experiment on Chinese web pages produced more than 26 million pairs of translations which is over two orders of magnitude more than previous results. We show that the addition of the extracted translation pairs as training data provides significant increase in the BLEU score for a statistical machine translation system. 1 Introduction In natural language documents a term word or phrase is sometimes followed by its translation in another language in a pair of parentheses. We call these parenthetical translations. The following examples are from Chinese web pages we added underlines to indicate what is being translated 1 MW ffi Brookings Institution w fetB ftiXfet 8itt g -ia Jeremy Shapiro ẾP . 2 i Lte O aaVWLV5. indigestion s gastritis ẺB SỄ tl . 3 Bf sfti not going to fly ift 4 .SỄ . te a linear programming . Contributions made during an internship at Google The parenthetically translated terms are typically new words technical terminologies idioms products titles of movies books songs and names of persons organizations locations etc. Commonly an author might use such a parenthetical when a given

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.