TAILIEUCHUNG - Báo cáo khoa học: "An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment"

We propose a language-independent method for the automatic extraction of transliteration pairs from parallel corpora. In contrast to previous work, our method uses no form of supervision, and does not require linguistically informed preprocessing. We conduct experiments on data sets from the NEWS 2010 shared task on transliteration mining and achieve an F-measure of up to 92%, outperforming most of the semi-supervised systems that were submitted. | An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment Hassan Sajjad Alexander Fraser Helmut Schmid Institute for Natural Language Processing University of Stuttgart sajjad fraser schmid @ Abstract We propose a language-independent method for the automatic extraction of transliteration pairs from parallel corpora. In contrast to previous work our method uses no form of supervision and does not require linguistically informed preprocessing. We conduct experiments on data sets from the NEWS 2010 shared task on transliteration mining and achieve an F-measure of up to 92 outperforming most of the semi-supervised systems that were submitted. We also apply our method to English Hindi and English Arabic parallel corpora and compare the results with manually built gold standards which mark transliterated word pairs. Finally we integrate the transliteration module into the GIZA word aligner and evaluate it on two word alignment tasks achieving improvements in both precision and recall measured against gold standard word alignments. 1 Introduction Most previous methods for building transliteration systems were supervised requiring either handcrafted rules or a clean list of transliteration pairs both of which are expensive to create. Such resources are also not applicable to other language pairs. In this paper we show that it is possible to extract transliteration pairs from a parallel corpus using an unsupervised method. We first align a bilingual corpus at the word level using GIZA and create a list of word pairs containing a mix of nontransliterations and transliterations. We train a sta 430 tistical transliterator on the list of word pairs. We then filter out a few word pairs those which have the lowest transliteration probabilities according to the trained transliteration system which are likely to be non-transliterations. We retrain the translitera-tor on the filtered data set. This process is iterated filtering .

TỪ KHÓA LIÊN QUAN
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.