TAILIEUCHUNG - Báo cáo khoa học: "Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation"

We introduce a word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Instead of using manually segmented monolingual domain-speciﬁc corpora to train segmenters, we make use of bilingual corpora and statistical word alignment techniques. First of all, our approach is adapted for the speciﬁc translation task at hand by taking the corresponding source (target) language into account. . | Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation Yanjun Ma Andy Way National Centre for Language Technology School of Computing Dublin City University Dublin 9 Ireland yma away @ Abstract We introduce a word segmentation approach to languages where word boundaries are not orthographically marked with application to Phrase-Based Statistical Machine Translation PB-SMT . Instead of using manually segmented monolingual domain-specific corpora to train segmenters we make use of bilingual corpora and statistical word alignment techniques. First of all our approach is adapted for the specific translation task at hand by taking the corresponding source target language into account. Secondly this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains. We evaluate the performance of our segmentation approach on PB-SMT tasks from two domains and demonstrate that our approach scores consistently among the best results across different data conditions. 1 Introduction State-of-the-art Statistical Machine Translation SMT requires a certain amount of bilingual corpora as training data in order to achieve competitive results. The only assumption of most current statistical models Brown et al. 1993 Vogel et al. 1996 Deng and Byrne 2005 is that the aligned sentences in such corpora should be segmented into sequences of tokens that are meant to be words. Therefore for languages where word boundaries are not orthographically marked tools which segment a sentence into words are required. However this segmentation is normally performed as a preprocessing step using various word seg-menters. Moreover most of these segmenters are usually trained on a manually segmented domain specific corpus which is not adapted for the specific translation task at hand given that the manual segmentation is performed in a monolingual context. Consequently such segmenters cannot .

Hòa Hiệp 76 9 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation"

9 65 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462343 61

Giới thiệu :Lập trình mã nguồn mở

14 26232 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11352 542

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10553 466

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9844 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8892 1161

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8508 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8101 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7786 1798

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7279 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Gia đình vui nhộn

4 392 3 31-12-2024

Báo cáo nghiên cứu khoa học " KẾT QUẢ NGHIÊN CỨU BƯỚC ĐẦU VỀ THIÊN ĐỊCH CHÂN KHỚP TRÊN CÂY THANH TRÀ Ở THỪA THIÊN HUẾ "

7 279 4 31-12-2024

THE ANTHROPOLOGY OF ONLINE COMMUNITIES BY Samuel M.Wilson and Leighton C. Peterson

19 227 4 31-12-2024

B2B Content Marketing: 2012 Benchmarks, Budgets & Trends

17 229 3 31-12-2024

Đóng mới oto 8 chỗ ngồi part 9

10 180 3 31-12-2024

Quy Trình Canh Tác Cây Bông Vải

8 165 3 31-12-2024

Hướng dẫn chế độ dinh dưỡng cho người bệnh viêm khớp

5 171 2 31-12-2024

Sử dụng mô hình ARCH và GARCH để phân tích và dự báo về giá cổ phiếu trên thị trường chứng khoán

24 1073 2 31-12-2024

5 thói quen ăn uống hủy hoại hàm răng đẹp

5 171 2 31-12-2024

Sáng kiến kinh nghiệm môn mỹ thuật

5 179 1 31-12-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8101 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7786 1798

Ebook Chào con ba mẹ đã sẵn sàng

112 4412 1374

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6322 1274

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8892 1161

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3846 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3921 609

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4724 566

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11352 542

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4511 490