TAILIEUCHUNG - Báo cáo khoa học: "Enhancing Statistical Machine Translation with Character Alignment"

The dominant practice of statistical machine translation (SMT) uses the same Chinese word segmentation specification in both alignment and translation rule induction steps in building Chinese-English SMT system, which may suffer from a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. | Enhancing Statistical Machine Translation with Character Alignment Ning Xi Guangchao Tang Xinyu Dai Shujian Huang Jiajun Chen State Key Laboratory for Novel Software Technology Department of Computer Science and Technology Nanjing University Nanjing 210046 China xin tanggc dxy huangsj chenjj @ Abstract The dominant practice of statistical machine translation SMT uses the same Chinese word segmentation specification in both alignment and translation rule induction steps in building Chinese-English SMT system which may suffer from a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. To tackle this we propose a framework that uses two different segmentation specifications for alignment and translation respectively we use Chinese character as the basic unit for alignment and then convert this alignment to conventional word alignment for translation rule induction. Experimentally our approach outperformed two baselines fully word-based system using word for both alignment and translation and fully character-based system in terms of alignment quality and translation performance. 1 Introduction Chinese Word segmentation is a necessary step in Chinese-English statistical machine translation SMT because Chinese sentences do not delimit words by spaces. The key characteristic of a Chinese word segmenter is the segmentation specifi-cation1. As depicted in Figure 1 a the dominant practice of SMT uses the same word segmentation for both word alignment and translation rule induction. For brevity we will refer to the word segmentation of the bilingual corpus as word segmentation for alignment WSA for short because it determines the basic tokens for alignment and refer to the word segmentation of the aligned corpus as word segmentation for rules WSR for short because it determines the basic tokens of translation Bilingual Corpus WSA t Word alignment Aligned Corpus WSA f Rule induction Translation Rules WSR f .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.