TAILIEUCHUNG - Báo cáo khoa học: "Rare Word Translation Extraction from Aligned Comparable Documents"

We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. | Rare Word Translation Extraction from Aligned Comparable Documents Emmanuel Prochasson and Pascale Fung Human Language Technology Center Hong Kong University of Science and Technology Clear Water Bay Kowloon Hong Kong eemmanuel pascale @ Abstract We present a first known result of high precision rare word bilingual extraction from comparable corpora using aligned comparable documents and supervised classification. We incorporate two features a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80 and 98 for recognizing and extracting correct translations for rare terms from 1 to 5 occurrences . Moreover we show that our system can be trained on a pair of languages and test on a different pair of languages obtaining a F-Measure of 77 for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data. 1 Introduction Rare words have long been a challenge to translate automatically using statistical methods due to their low occurrences. However the Zipf s Law claims that for any corpus of natural language text the frequency of a word wn n being its rank in the frequency table will be roughly twice as high as the frequency of word wra i. The logical consequence is that in any corpus there are very few frequent words and many rare words. We propose a novel approach to extract rare word translations from comparable corpora relying on two main features. The first feature is the context-vector similarity Fung 2000 Chiao and Zweigenbaum 2002 1327 Laroche and Langlais 2010 each word is characterized by its context in both source and target corpora words in translation should have similar context in both languages. The second feature follows the assumption that specific

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.