TAILIEUCHUNG - Báo cáo khoa học: "A Method for Effective and Scalable Mining of Named Entity Transliterations from Large Comparable Corpora"

News stories are typically rich in NEs and therefore, comparable news corpora can be expected to contain NETEs (Klementiev and Roth, 2006; Tao et al., 2006). The large quantity and the perpetual availability of news corpora in many of the world’s languages, make mining of NETEs a viable alternative to traditional approaches. It is this opportunity that we address in our work. In this paper, we detail an effective and scalable mining method, called MINT (MIning Named-entity Transliteration equivalents), for mining of NETEs from large comparable corpora. . | MINT A Method for Effective and Scalable Mining of Named Entity Transliterations from Large Comparable Corpora Raghavendra Udupa K Saravanan A Kumaran Jagadeesh Jagarlamudi Microsoft Research India Bangalore 560080 INDIA raghavu v-sarak kumarana jags @ Abstract In this paper we address the problem of mining transliterations of Named Entities NEs from large comparable corpora. We leverage the empirical fact that multilingual news articles with similar news content are rich in Named Entity Transliteration Equivalents NETEs . Our mining algorithm MINT uses a cross-language document similarity model to align multilingual news articles and then mines NETEs from the aligned articles using a transliteration similarity model. We show that our approach is highly effective on 6 different comparable corpora between English and 4 languages from 3 different language families. Furthermore it performs substantially better than a state-of-the-art competitor. 1 Introduction Named Entities NEs play a critical role in many Natural Language Processing and Information Retrieval IR tasks. In Cross-Language Information Retrieval CLIR systems they play an even more important role as the accuracy of their transliterations is shown to correlate highly with the performance of the CLIR systems Mandl and Womser-Hacker 2005 Xu and Weischedel 2005 . Traditional methods for transliterations have not proven to be very effective in CLIR. Machine Transliteration systems AbdulJaleel and Larkey 2003 Al-Onaizan and Knight 2002 Virga and Khudanpur 2003 usually produce incorrect transliterations and translation lexcions such as hand-crafted or statistical dictionaries are too static to have good coverage of NEs 1 occurring in the current news events. Hence there is a critical need for creating and continually updat Currently with University of Utah. 1 New NEs are introduced to the vocabulary of a language every day. On an average 260 and 452 new NEs appeared daily in the XIE and AFE segments

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.