TAILIEUCHUNG - Báo cáo khoa học: "Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora"

The lack of parallel corpora and linguistic resources for many languages and domains is one of the major obstacles for the further advancement of automated translation. A possible solution is to exploit comparable corpora (non-parallel bi- or multi-lingual text resources) which are much more widely available than parallel translation data. | ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora Mãrcis Pinnis1 Radu Ion2 Dan Steíănescii2 Fangzhong Su3 Inguna Skadina1 Andrejs Vasiljevs1 Bogdan Babych3 1Tilde Vienĩbas gatve 75a Riga Latvia andrejs @ Research Institute for Artificial Intelligence Romanian Academy radu danstef @ 3Centre for Translation Studies University of Leeds @ Abstract The lack of parallel corpora and linguistic resources for many languages and domains is one of the major obstacles for the further advancement of automated translation. A possible solution is to exploit comparable corpora non-parallel bi- or multi-lingual text resources which are much more widely available than parallel translation data. Our presented toolkit deals with parallel content extraction from comparable corpora. It consists of tools bundled in two workflows 1 alignment of comparable documents and extraction of parallel sentences and 2 extraction and bilingual mapping of terms and named entities. The toolkit pairs similar bilingual comparable documents and extracts parallel sentences and bilingual terminological and named entity dictionaries from comparable corpora. This demonstration focuses on the English Latvian Lithuanian and Romanian languages. Introduction In recent decades data-driven approaches have significantly advanced the development of machine translation MT . However lack of sufficient bilingual linguistic resources for many languages and domains is still one of the major obstacles for further advancement of automated translation. At the same time comparable corpora . non-parallel bi- or multilingual text resources such as daily news articles and large knowledge 91 bases like Wikipedia are much more widely available than parallel translation data. While methods for the use of parallel corpora in machine translation are well studied Koehn 2010 similar techniques for comparable corpora have

TỪ KHÓA LIÊN QUAN
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.