TAILIEUCHUNG - Báo cáo khoa học: "A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining"

We propose a novel model to automatically extract transliteration pairs from parallel corpora. Our model is efficient, language pair independent and mines transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models. | A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining Hassan Sajjad Alexander Fraser Helmut Schmid Institute for Natural Language Processing University of Stuttgart sajjad fraser schmid @ Abstract We propose a novel model to automatically extract transliteration pairs from parallel corpora. Our model is efficient language pair independent and mines transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models. We evaluate on NEWS 2010 shared task data and on parallel corpora with competitive results. 1 Introduction Transliteration mining is the extraction of transliteration pairs from unlabelled data. Most transliteration mining systems are built using labelled training data or using heuristics to extract transliteration pairs. These systems are language pair dependent or require labelled information for training. Our system extracts transliteration pairs in an unsupervised fashion. It is also able to utilize labelled information if available obtaining improved performance. We present a novel model of transliteration mining defined as a mixture of a transliteration model and a non-transliteration model. The transliteration model is a joint source channel model Li et al. 2004 . The non-transliteration model assumes no correlation between source and target word characters and independently generates a source and a target word using two fixed unigram character models. We use Expectation Maximization EM to learn parameters maximizing the likelihood of the interpolation of both sub-models. At test time we label word 469 pairs as transliterations if they have a higher probability assigned by the transliteration sub-model than by the non-transliteration sub-model. We extend the unsupervised system to a semisupervised system by adding a new S-step to the EM algorithm. The S-step takes the

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.