TAILIEUCHUNG - Báo cáo khoa học: "Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities"

This paper presents an approach for Multilingual Document Clustering in comparable corpora. The algorithm is of heuristic nature and it uses as unique evidence for clustering the identification of cognate named entities between both sides of the comparable corpora. One of the main advantages of this approach is that it does not depend on bilingual or multilingual resources. However, it depends on the possibility of identifying cognate named entities between the languages used in the corpus. | Multilingual Document Clustering an Heuristic Approach Based on Cognate Named Entities Soto Montalvo GAVAB Group URJC Raquel Martinez NLP IR Group UNED raquel@ Arantza Casillas Dpt. EE UPV-EHU Victor Fresno GAVAB Group URJC Abstract This paper presents an approach for Multilingual Document Clustering in comparable corpora. The algorithm is of heuristic nature and it uses as unique evidence for clustering the identification of cognate named entities between both sides of the comparable corpora. One of the main advantages of this approach is that it does not depend on bilingual or multilingual resources. However it depends on the possibility of identifying cognate named entities between the languages used in the corpus. An additional advantage of the approach is that it does not need any information about the right number of clusters the algorithm calculates it. We have tested this approach with a comparable corpus of news written in English and Spanish. In addition we have compared the results with a system which translates selected document features. The obtained results are encouraging. 1 Introduction Multilingual Document Clustering MDC involves dividing a set of n documents written in different languages into a specified number k of clusters so the documents that are similar to other documents are in the same cluster. Meanwhile a multilingual cluster is composed of documents written in different languages a monolingual cluster is composed of documents written in one language. MDC has many applications. The increasing amount of documents written in different languages that are available electronically leads to develop applications to manage that amount of information for filtering retrieving and grouping multilingual documents. MDC tools can make easier tasks such as Cross-Lingual Information Retrieval the training of parameters in statistics based machine translation or the alignment

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.