TAILIEUCHUNG - Báo cáo khoa học: "Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models"

This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information (part-of-speech, morphological segmentation) while learning a morpheme segmentation over the target language. Our model outperforms a competitive word alignment system in alignment quality. . | Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models Jason Naradowsky Department of Computer Science University of Massachusetts Amherst Amherst MA 01003 narad@ Kristina Toutanova Microsoft Research Redmond WA 98502 kristout@ Abstract This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information part-of-speech morphological segmentation while learning a morpheme segmentation over the target language. Our model outperforms a competitive word alignment system in alignment quality. Used in a monolingual morphological segmentation setting it substantially improves accuracy over previous state-of-the-art models on three Arabic and Hebrew datasets. 1 Introduction An enduring problem in statistical machine translation is sparsity. The word alignment models of modern MT systems attempt to capture p ei fj the probability that token ei is a translation of fj. Underlying these models is the assumption that the word-based tokenization of each sentence is if not optimal at least appropriate for specifying a conceptual mapping between the two languages. However when translating between unrelated languages - a common task - disparate morphological systems can place an asymmetric conceptual burden on words making the lexicon of one language much more coarse. This intensifies the problem of sparsity as the large number of word forms created This research was conducted during the author s internship at Microsoft Research 895 through morphologically productive processes hinders attempts to find concise mappings between concepts. For instance Bulgarian adjectives may contain markings for gender .

TỪ KHÓA LIÊN QUAN
TÀI LIỆU MỚI ĐĂNG
28    158    1    27-12-2024
65    137    1    27-12-2024
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.