TAILIEUCHUNG - Báo cáo khoa học: "Customizing Parallel Corpora at the Document Level"

Recent research in cross-lingual information retrieval (CLIR) established the need for properly matching the parallel corpus used for query translation to the target corpus. We propose a document-level approach to solving this problem: building a custom-made parallel corpus by automatically assembling it from documents taken from other parallel corpora. Although the general idea can be applied to any application that uses parallel corpora, we present results for CLIR in the medical domain. | Customizing Parallel Corpora at the Document Level Monica ROGATI and Yiming YANG Computer Science Department Carnegie Mellon University 5000 Forbes Avenue Pittsburgh PA 15213 mrogati@ yiming@ Abstract Recent research in cross-lingual information retrieval CLIR established the need for properly matching the parallel corpus used for query translation to the target corpus. We propose a document-level approach to solving this problem building a custom-made parallel corpus by automatically assembling it from documents taken from other parallel corpora. Although the general idea can be applied to any application that uses parallel corpora we present results for CLIR in the medical domain. In order to extract the best-matched documents from several parallel corpora we propose ranking individual documents by using a length-normalized Okapi-based similarity score between them and the target corpus. This ranking allows us to discard 50-90 of the training data while avoiding the performance drop caused by a good but mismatched resource and even improving CLIR effectiveness by 4-7 when compared to using all available training data. 1 Introduction Our recent research in cross-lingual information retrieval CLIR established the need for properly matching the parallel corpus used for query translation to the target corpus Rogati and Yang 2004 . In particular we showed that using a general purpose machine translation MT system such as SYSTRAN or a general purpose parallel corpus - both of which perform very well for news stories Peters 2003 - dramatically fails in the medical domain. To explore solutions to this problem we used cosine similarity between training and target corpora as respective weights when building a translation model. This approach treats a parallel corpus as a homogeneous entity an entity that is self-consistent in its domain and document quality. In this paper we propose that instead of weighting entire resources we can select individual .

TÀI LIỆU MỚI ĐĂNG
28    160    1    28-12-2024
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.