TAILIEUCHUNG - Báo cáo khoa học: "An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment"

We propose a language-independent method for the automatic extraction of transliteration pairs from parallel corpora. In contrast to previous work, our method uses no form of supervision, and does not require linguistically informed preprocessing. We conduct experiments on data sets from the NEWS 2010 shared task on transliteration mining and achieve an F-measure of up to 92%, outperforming most of the semi-supervised systems that were submitted. | An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment Hassan Sajjad Alexander Fraser Helmut Schmid Institute for Natural Language Processing University of Stuttgart sajjad fraser schmid @ Abstract We propose a language-independent method for the automatic extraction of transliteration pairs from parallel corpora. In contrast to previous work our method uses no form of supervision and does not require linguistically informed preprocessing. We conduct experiments on data sets from the NEWS 2010 shared task on transliteration mining and achieve an F-measure of up to 92 outperforming most of the semi-supervised systems that were submitted. We also apply our method to English Hindi and English Arabic parallel corpora and compare the results with manually built gold standards which mark transliterated word pairs. Finally we integrate the transliteration module into the GIZA word aligner and evaluate it on two word alignment tasks achieving improvements in both precision and recall measured against gold standard word alignments. 1 Introduction Most previous methods for building transliteration systems were supervised requiring either handcrafted rules or a clean list of transliteration pairs both of which are expensive to create. Such resources are also not applicable to other language pairs. In this paper we show that it is possible to extract transliteration pairs from a parallel corpus using an unsupervised method. We first align a bilingual corpus at the word level using GIZA and create a list of word pairs containing a mix of nontransliterations and transliterations. We train a sta 430 tistical transliterator on the list of word pairs. We then filter out a few word pairs those which have the lowest transliteration probabilities according to the trained transliteration system which are likely to be non-transliterations. We retrain the translitera-tor on the filtered data set. This process is iterated filtering .

Mạnh Hùng 58 10 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Solving uncapacitated multiple allocation p-hub center problem by Dijkstra’s algorithm-based genetic algorithm and simulated annealing

14 63 0

Lecture Algorithm design - Chapter 6: Dynamic programming II

50 62 0

Computer Security: Chapter 9 - Role-Based Access Control (RBAC) Role Classification Algorithm

12 101 0

Lecture Algorithm design - Chapter 2: Algorithm analysis

26 55 0

Lecture Algorithm design - Chapter 4: Greedy Algorithms II

64 71 0

Lecture Algorithm design - Chapter 7: Network flow I

87 74 0

Anomaly detection using genetic with SVM algorithm in data mining

8 74 0

Optimization of machining processes using pattern search algorithm

12 66 0

Lecture Data Structure and Algorithm - Week 6: Sorting Algorithm

68 23 3

Lectures on the NTRU encryption algorithm and digital signature scheme

31 57 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 461844 55

Giới thiệu :Lập trình mã nguồn mở

14 22508 57

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10861 529

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10024 445

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9488 104

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8241 1124

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8199 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7859 2219

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6639 253

Vật lý hạt cơ bản (1)

29 5753 85

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Đề tài: Ôn xác định vị trí trên – dưới, trước- sau của đối tượng khác.

8 348 3 19-04-2024

Báo cáo khoa học: Loss of kinase activity in Mycobacterium tuberculosis multidomain protein Rv1364c

14 233 0 19-04-2024

Trading Strategies Profit Making Techniques For Stock_3

23 181 0 19-04-2024

Anh văn bằng C-124

8 170 0 19-04-2024

Magnetic Bearings Theory and Applications phần 2

14 170 0 19-04-2024

MySQL Database Usage & Administration PHẦN 7

37 154 0 19-04-2024

Đóng mới oto 8 chỗ ngồi part 9

10 115 0 19-04-2024

XỬ TRÍ CHẤN THƯƠNG SỌ NÃO KÍN

1 111 1 19-04-2024

New Trends and Developments in Automotive Industry Part 7

35 91 0 19-04-2024

báo cáo hóa học:" Rare ligamentum flavum cyst causing incapacitating lumbar spinal stenosis: Experience with 3 Chinese patients"

4 96 0 19-04-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7859 2219

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 5589 1325

Ebook Chào con ba mẹ đã sẵn sàng

112 3749 1228

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8241 1124

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5246 1124

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3471 641

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10861 529

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3668 524

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4022 513

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4093 478