TAILIEUCHUNG - Báo cáo khoa học: "Sub-sentential Alignment Using Substring Co-Occurrence Counts"

In this paper, we will present an efficient method to compute the co-occurrence counts of any pair of substring in a parallel corpus, and an algorithm that make use of these counts to create subsentential alignments on such a corpus. This algorithm has the advantage of being as general as possible regarding the segmentation of text. | Sub-sentential Alignment Using Substring Co-Occurrence Counts Fabien Cromieres GETA-CLIPS-IMAG BP53 38041 Grenoble Cedex 9 France Abstract In this paper we will present an efficient method to compute the co-occurrence counts of any pair of substring in a parallel corpus and an algorithm that make use of these counts to create sub-sentential alignments on such a corpus. This algorithm has the advantage of being as general as possible regarding the segmentation of text. 1 Introduction An interesting and important problem in the Statistical Machine Translation SMT domain is the creation of sub-sentential alignment in a parallel corpus a bilingual corpus already aligned at the sentence level . These alignments can later be used to for example train SMT systems or extract bilingual lexicons. Many algorithms have already been proposed for sub-sentential alignment. Some of them focus on word-to-word alignment Brown 97 or Melamed 97 . Others allow the generation of phrase-level alignments such as Och et al. 1999 Marcu and Wong 2002 or Zhang Vogel Waibel 2003 . However with the exception of Marcu and Wong these phrase-level alignment algorithms still place their analyses at the word level whether by first creating a word-to-word alignment or by computing correlation coefficients between pairs of individual words. This is in our opinion a limitation of these algorithms mainly because it makes them rely heavily on our capacity to segment a sentence in words. And defining what a word is is not as easy as it might seem. In peculiar in many Asians writings systems Japanese Chinese or Thai for example there is not a special symbol to delimit words such as the blank in most non Asian writing systems . Current systems usually work around this problem by using a segmentation tool to pre-process the data. There are however two major disadvantages - These tools usually need a lot of linguistic knowledge such as lexical dictionaries and hand-crafted .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.