TAILIEUCHUNG - Báo cáo khoa học: "Identifying Word Translations from Comparable Corpora Using Latent Topic Models"

A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, ., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from wordtopic distributions with similarity measures in the original space, are also reported. . | Identifying Word Translations from Comparable Corpora Using Latent Topic Models Ivan VuliC Wim De Smet and Marie-Francine Moens Department of Computer Science . Leuven Celestijnenlaan 200A Leuven Belgium @ Abstract A topic model outputs a set of multinomial distributions over words for each topic. In this paper we investigate the value of bilingual topic models . a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results obtained by combining knowledge from wordtopic distributions with similarity measures in the original space are also reported. 1 Introduction Generative models for documents such as Latent Dirichlet Allocation LDA Blei et al. 2003 are based upon the idea that latent variables exist which determine how words in documents might be generated. Fitting a generative model means finding the best set of those latent variables in order to explain the observed data. Within that setting documents are observed as mixtures of latent topics where topics are probability distributions over words. Our goal is to model and test the capability of probabilistic topic models to identify potential translations from document-aligned text collections. A representative example of such a comparable text collection is Wikipedia where one may observe articles discussing the same topic but strongly varying 479 in style length and even vocabulary while still sharing a certain amount of main concepts or topics . We try to establish a connection between such latent topics and an idea known as the distributional hypothesis Harris 1954 - words with a similar meaning are often used in similar .

TỪ KHÓA LIÊN QUAN
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.