TAILIEUCHUNG - Báo cáo khoa học: "Rare Word Translation Extraction from Aligned Comparable Documents"

We present a ﬁrst known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classiﬁcation. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. | Rare Word Translation Extraction from Aligned Comparable Documents Emmanuel Prochasson and Pascale Fung Human Language Technology Center Hong Kong University of Science and Technology Clear Water Bay Kowloon Hong Kong eemmanuel pascale @ Abstract We present a first known result of high precision rare word bilingual extraction from comparable corpora using aligned comparable documents and supervised classification. We incorporate two features a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80 and 98 for recognizing and extracting correct translations for rare terms from 1 to 5 occurrences . Moreover we show that our system can be trained on a pair of languages and test on a different pair of languages obtaining a F-Measure of 77 for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data. 1 Introduction Rare words have long been a challenge to translate automatically using statistical methods due to their low occurrences. However the Zipf s Law claims that for any corpus of natural language text the frequency of a word wn n being its rank in the frequency table will be roughly twice as high as the frequency of word wra i. The logical consequence is that in any corpus there are very few frequent words and many rare words. We propose a novel approach to extract rare word translations from comparable corpora relying on two main features. The first feature is the context-vector similarity Fung 2000 Chiao and Zweigenbaum 2002 1327 Laroche and Langlais 2010 each word is characterized by its context in both source and target corpora words in translation should have similar context in both languages. The second feature follows the assumption that specific

Công Hào 60 9 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Rare Word Translation Extraction from Aligned Comparable Documents"

9 52 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 461857 55

Giới thiệu :Lập trình mã nguồn mở

14 22603 58

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10882 529

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10049 445

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9513 104

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8268 1124

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8224 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7862 2220

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6669 253

Vật lý hạt cơ bản (1)

29 5765 85

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Đề tài: Ôn xác định vị trí trên – dưới, trước- sau của đối tượng khác.

8 352 3 24-04-2024

Báo cáo khoa học: Loss of kinase activity in Mycobacterium tuberculosis multidomain protein Rv1364c

14 235 0 24-04-2024

Sáng tạo trong thuật toán và lập trình với ngôn ngữ Pascal và C# Tập 2 - Chương 4

47 246 1 24-04-2024

Mass Transfer in Multiphase Systems and its Applications Part 19

40 255 1 24-04-2024

Oreilly learning the vi Editor phần 4

19 228 0 24-04-2024

Trading Strategies Profit Making Techniques For Stock_3

23 183 0 24-04-2024

Magnetic Bearings Theory and Applications phần 2

14 170 0 24-04-2024

Công nghiệp gang thép Việt Nam : Một giai đoạn phát triển và chuyển đổi chính sách mới part 5

6 194 0 24-04-2024

Hướng dẫn sử dụng Quickoffice cho Ipad và Iphone

13 150 0 24-04-2024

Đề tài: Tìm hiểu một số yêu cầu đặt ra với một phòng thu âm, để đảm bảo chất lượng âm thanh trong sản phẩm đa phương tiện

8 159 1 24-04-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7862 2220

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 5678 1349

Ebook Chào con ba mẹ đã sẵn sàng

112 3757 1230

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5309 1135

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8268 1124

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3489 642

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10882 529

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3678 525

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4040 514

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4120 480