Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We propose a corpus-based probabilistic framework to extract hidden common syntax across languages from non-parallel multilingual corpora in an unsupervised fashion. For this purpose, we assume a generative model for multilingual corpora, where each sentence is generated from a language dependent probabilistic contextfree grammar (PCFG), and these PCFGs are generated from a prior grammar that is common across languages. | Learning Common Grammar from Multilingual Corpus Tomoharu Iwata Daichi Mochihashi Hiroshi Sawada NTT Communication Science Laboratories 2-4 Hikaridai Seika-cho Soraku-gun Kyoto Japan iwata daichi sawada @cslab.kecl.ntt.co.jp Abstract We propose a corpus-based probabilistic framework to extract hidden common syntax across languages from non-parallel multilingual corpora in an unsupervised fashion. For this purpose we assume a generative model for multilingual corpora where each sentence is generated from a language dependent probabilistic context-free grammar PCFG and these PCFGs are generated from a prior grammar that is common across languages. We also develop a variational method for efficient inference. Experiments on a non-parallel multilingual corpus of eleven languages demonstrate the feasibility of the proposed method. 1 Introduction Languages share certain common properties Pinker 1994 . For example the word order in most European languages is subject-verb-object SVO and some words with similar forms are used with similar meanings in different languages. The reasons for these common properties can be attributed to 1 a common ancestor language 2 borrowing from nearby languages and 3 the innate abilities of humans Chomsky 1965 . We assume hidden commonalities in syntax across languages and try to extract a common grammar from non-parallel multilingual corpora. For this purpose we propose a generative model for multilingual grammars that is learned in an unsupervised fashion. There are some computational models for capturing commonalities at the phoneme and word level Oakes 2000 Bouchard-Cote et al. 2008 but as far as we know no attempt has been made to extract commonalities in syntax level from non-parallel and non-annotated multilingual corpora. In our scenario we use probabilistic context-free grammars PCFGs as our monolingual grammar model. We assume that a PCFG for each language is generated from a general model that are common across languages and each .