TAILIEUCHUNG - Báo cáo khoa học: "An Alignment Method for Noisy Parallel Corpora based on Image Processing Techniques"

This paper presents a new approach to bitext correspondence problem (BCP) of noisy bilingual corpora based on image processing (IP) techniques. By using one of several ways of estimating the lexical translation probability (LTP) between pairs of source and target words, we can turn a bitext into a discrete gray-level image. We contend that the BCP, when seen in this light, bears a striking resemblance to the line detection problem in IP. Therefore, BCPs, including sentence and word alignment, can benefit from a wealth of effective, well established IP techniques, including convolution-based filters, texture analysis and Hough transform. . | An Alignment Method for Noisy Parallel Corpora based on Image Processing Techniques Jason s. Chang and Mathis H. Chen Department of Computer Science National Tsing Hua University Taiwan jschang@ mathis @ Phone 886-3-5731069 Fax 886-3-5723694 Abstract This paper presents a new approach to bitext correspondence problem BCP of noisy bilingual corpora based on image processing IP techniques. By using one of several ways of estimating the lexical translation probability LTP between pairs of source and target words we can turn a bitext into a discrete gray-level image. We contend that the BCP when seen in this light bears a striking resemblance to the line detection problem in IP. Therefore BCPs including sentence and word alignment can benefit from a wealth of effective well established IP techniques including convolution-based filters texture analysis and Hough transform. This paper describes a new program PlotAlign that produces a word-level bitext map for noisy or non-literal bitext based on these techniques. Keywords alignment bilingual corpus image processing 1. Introduction Aligned corpora have proved very useful in many tasks including statistical machine translation bilingual lexicography Daille Gaussier and Lange 1993 and word sense disambiguation Gale Church and Yarowsky 1992 Chen Ker Sheng and Chang 1997 . Several methods have recently been proposed for sentence alignment of the Hansards an English-French corpus of Canadian parliamentary debates Brown Lai and Mercer 1991 Gale and Church 1991a Simard Foster and Isabelle 1992 Chen 1993 and for other language pahs such as English-German English-Chinese and English-Japanese Church Dagan Gale Fung Helfman and Satish 1993 Kay and Rõscheisen 1993 Wu 1994 . The statistical approach to machine translation SMT can be understood as a word-by-word model consisting of two sub-models a language model for generating a source text segment s and a translation model for mapping s to its .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.