Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
This paper describes an accurate and robust text alignment system for structurally different languages. Among structurally different languages such as Japanese and English, there is a limitation on the amount of word correspondences that can be statistically acquired. The proposed method makes use of two kinds of word correspondences in aligning bilingual texts. One is a bilingual dictionary of general use. | High-Performance Bilingual Text Alignment Using Statistical and Dictionary Information Masahiko Haruno Takefumi Yamazaki NTT Communication Science Labs. 1-2356 Take Yokosuka-Shi Kanagawa 238-03 Japan haruno@nttkb.ntt.jp yamazaki@nttkb.ntt.jp Abstract This paper describes an accurate and robust text alignment system for structurally different languages. Among structurally different languages such as Japanese and English there is a limitation on the amount of word correspondences that can be statistically acquired. The proposed method makes use of two kinds of word correspondences in aligning bilingual texts. One is a bilingual dictionary of general use. The other is the word correspondences that are statistically acquired in the alignment process. Our method gradually determines sentence pairs anchors that correspond to each other by relaxing parameters. The method by combining two kinds of word correspondences achieves adequate word correspondences for complete alignment. As a result texts of various length and of various genres in structurally different languages can be aligned with high precision. Experimental results show our system outperforms conventional methods for various kinds of Japanese-English texts. 1 Introduction Corpus-based approaches based on bilingual texts are promising for various applications i.e. lexical knowledge extraction Kupiec 1993 Matsumoto et al. 1993 Smadja et al. 1996 Dagan and Church 1994 Kumano and Hirakawa 1994 Haruno et al. 1996 machine translation Brown and others 1993 Sato and Nagao 1990 Kaji et al. 1992 and information retrieval Sato 1992 . Most of these works assume voluminous aligned corpora. Many methods have been proposed to align bilingual corpora. One of the major approaches is based on the statistics of simple features such as sentence length in words Brown and others 1991 or in characters Gale and Church 1993 . These techniques are widely used because they can be imple mented in an efficient and simple way through .