TAILIEUCHUNG - Báo cáo khoa học: "Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation"

We introduce a word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Instead of using manually segmented monolingual domain-specific corpora to train segmenters, we make use of bilingual corpora and statistical word alignment techniques. First of all, our approach is adapted for the specific translation task at hand by taking the corresponding source (target) language into account. . | Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation Yanjun Ma Andy Way National Centre for Language Technology School of Computing Dublin City University Dublin 9 Ireland yma away @ Abstract We introduce a word segmentation approach to languages where word boundaries are not orthographically marked with application to Phrase-Based Statistical Machine Translation PB-SMT . Instead of using manually segmented monolingual domain-specific corpora to train segmenters we make use of bilingual corpora and statistical word alignment techniques. First of all our approach is adapted for the specific translation task at hand by taking the corresponding source target language into account. Secondly this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains. We evaluate the performance of our segmentation approach on PB-SMT tasks from two domains and demonstrate that our approach scores consistently among the best results across different data conditions. 1 Introduction State-of-the-art Statistical Machine Translation SMT requires a certain amount of bilingual corpora as training data in order to achieve competitive results. The only assumption of most current statistical models Brown et al. 1993 Vogel et al. 1996 Deng and Byrne 2005 is that the aligned sentences in such corpora should be segmented into sequences of tokens that are meant to be words. Therefore for languages where word boundaries are not orthographically marked tools which segment a sentence into words are required. However this segmentation is normally performed as a preprocessing step using various word seg-menters. Moreover most of these segmenters are usually trained on a manually segmented domain specific corpus which is not adapted for the specific translation task at hand given that the manual segmentation is performed in a monolingual context. Consequently such segmenters cannot .

TỪ KHÓA LIÊN QUAN
TÀI LIỆU MỚI ĐĂNG
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.