TAILIEUCHUNG - Báo cáo khoa học: "Enhancing Statistical Machine Translation with Character Alignment"

The dominant practice of statistical machine translation (SMT) uses the same Chinese word segmentation specification in both alignment and translation rule induction steps in building Chinese-English SMT system, which may suffer from a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. | Enhancing Statistical Machine Translation with Character Alignment Ning Xi Guangchao Tang Xinyu Dai Shujian Huang Jiajun Chen State Key Laboratory for Novel Software Technology Department of Computer Science and Technology Nanjing University Nanjing 210046 China xin tanggc dxy huangsj chenjj @ Abstract The dominant practice of statistical machine translation SMT uses the same Chinese word segmentation specification in both alignment and translation rule induction steps in building Chinese-English SMT system which may suffer from a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. To tackle this we propose a framework that uses two different segmentation specifications for alignment and translation respectively we use Chinese character as the basic unit for alignment and then convert this alignment to conventional word alignment for translation rule induction. Experimentally our approach outperformed two baselines fully word-based system using word for both alignment and translation and fully character-based system in terms of alignment quality and translation performance. 1 Introduction Chinese Word segmentation is a necessary step in Chinese-English statistical machine translation SMT because Chinese sentences do not delimit words by spaces. The key characteristic of a Chinese word segmenter is the segmentation specifi-cation1. As depicted in Figure 1 a the dominant practice of SMT uses the same word segmentation for both word alignment and translation rule induction. For brevity we will refer to the word segmentation of the bilingual corpus as word segmentation for alignment WSA for short because it determines the basic tokens for alignment and refer to the word segmentation of the aligned corpus as word segmentation for rules WSR for short because it determines the basic tokens of translation Bilingual Corpus WSA t Word alignment Aligned Corpus WSA f Rule induction Translation Rules WSR f .

Công Lý 68 6 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Enhancing Statistical Machine Translation with Character Alignment"

6 52 0

Báo cáo khoa học: "Enhancing Language Models in Statistical Machine Translation with Backward N-grams and Mutual Information Triggers"

10 54 0

Enhancing production of lignin peroxidase from white rot fungi employing statistical optimization and evaluation of its potential in delignification of crop residues

23 69 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462343 61

Giới thiệu :Lập trình mã nguồn mở

14 26186 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11351 542

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10553 466

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9844 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8891 1161

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8508 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8101 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7770 1793

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7279 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Gia đình vui nhộn

4 392 3 30-12-2024

B2B Content Marketing: 2012 Benchmarks, Budgets & Trends

17 229 3 30-12-2024

Đóng mới oto 8 chỗ ngồi part 9

10 179 3 30-12-2024

Data Structures and Algorithms - Chapter 8: Heaps

41 188 5 30-12-2024

báo cáo hóa học:" Increased androgen receptor expression in serous carcinoma of the ovary is associated with an improved survival"

6 156 3 30-12-2024

Báo cáo nghiên cứu nông nghiệp " Biofertiliser inoculant technology for the growth of rice in Vietnam: Developing technical infrastructure for quality assurance and village production for farmers "

12 147 2 30-12-2024

Báo cáo nghiên cứu khoa học " HÃY LÀM CHO HUẾ XANH HƠN VÀ ĐẸP HƠN "

6 181 3 30-12-2024

BÀI GIẢNG Biến Đổi Năng Lượng Điện Cơ - TS. Hồ Phạm Huy

137 161 1 30-12-2024

Đề tài " Dự báo về tác động của Tổ chức Thương mại Thế giới WTO đối với các doanh nghiệp xuất khẩu vừa và nhỏ Việt Nam – Những giải pháp đề xuất "

72 187 2 30-12-2024

Báo cáo nghiên cứu khoa học " NÂNG QUAN HỆ KINH TẾ THƯƠNG MẠI VIỆT NAM - TRUNG QUỐC LÊN TẦM CAO THỜI ĐẠI "

8 174 1 30-12-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8101 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7770 1793

Ebook Chào con ba mẹ đã sẵn sàng

112 4410 1371

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6311 1270

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8891 1161

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3844 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3921 609

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4723 566

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11351 542

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4511 490