TAILIEUCHUNG - Báo cáo khoa học: "Sub-sentential Alignment Using Substring Co-Occurrence Counts"

In this paper, we will present an efficient method to compute the co-occurrence counts of any pair of substring in a parallel corpus, and an algorithm that make use of these counts to create subsentential alignments on such a corpus. This algorithm has the advantage of being as general as possible regarding the segmentation of text. | Sub-sentential Alignment Using Substring Co-Occurrence Counts Fabien Cromieres GETA-CLIPS-IMAG BP53 38041 Grenoble Cedex 9 France Abstract In this paper we will present an efficient method to compute the co-occurrence counts of any pair of substring in a parallel corpus and an algorithm that make use of these counts to create sub-sentential alignments on such a corpus. This algorithm has the advantage of being as general as possible regarding the segmentation of text. 1 Introduction An interesting and important problem in the Statistical Machine Translation SMT domain is the creation of sub-sentential alignment in a parallel corpus a bilingual corpus already aligned at the sentence level . These alignments can later be used to for example train SMT systems or extract bilingual lexicons. Many algorithms have already been proposed for sub-sentential alignment. Some of them focus on word-to-word alignment Brown 97 or Melamed 97 . Others allow the generation of phrase-level alignments such as Och et al. 1999 Marcu and Wong 2002 or Zhang Vogel Waibel 2003 . However with the exception of Marcu and Wong these phrase-level alignment algorithms still place their analyses at the word level whether by first creating a word-to-word alignment or by computing correlation coefficients between pairs of individual words. This is in our opinion a limitation of these algorithms mainly because it makes them rely heavily on our capacity to segment a sentence in words. And defining what a word is is not as easy as it might seem. In peculiar in many Asians writings systems Japanese Chinese or Thai for example there is not a special symbol to delimit words such as the blank in most non Asian writing systems . Current systems usually work around this problem by using a segmentation tool to pre-process the data. There are however two major disadvantages - These tools usually need a lot of linguistic knowledge such as lexical dictionaries and hand-crafted .

Mai Thảo 68 6 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 461891 55

Giới thiệu :Lập trình mã nguồn mở

14 22790 64

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10931 531

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10105 448

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9543 104

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8312 1127

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8253 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7869 2221

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6736 253

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 5871 1419

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Đề tài: Ôn xác định vị trí trên – dưới, trước- sau của đối tượng khác.

8 356 3 05-05-2024

Đánh giá hao mòn và độ tin cậy của chi tiết và kết cấu trên đầu máy diezel part 3

12 315 0 05-05-2024

Báo cáo khoa học: Loss of kinase activity in Mycobacterium tuberculosis multidomain protein Rv1364c

14 238 0 05-05-2024

Mass Transfer in Multiphase Systems and its Applications Part 19

40 258 1 05-05-2024

CẤU TẠO HẠT NHÂN NGUYÊN TỬ-ĐỘ HỤT KHỐI-NĂNG LƯỢNG LIÊN KẾT-LK RIÊNG

12 270 0 05-05-2024

MySQL Basics for Visual Learners PHẦN 9

15 186 0 05-05-2024

MÔN HỌC VẬT LIỆU VÀ CÔNG NGHỆ KIM LOẠI - PHẦN I: KIM LOẠI HỌC

32 180 2 05-05-2024

QUẢN LÝ CHẤT LƯỢNG KHÔNG KHÍ

75 139 0 05-05-2024

Báo cáo tốt nghiệp: Vận hành và bảo dưỡng trong MPLS

92 145 3 05-05-2024

Diseases of the Liver and Biliary System - part 1

33 126 0 05-05-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7869 2221

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 5871 1419

Ebook Chào con ba mẹ đã sẵn sàng

112 3773 1240

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5350 1137

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8312 1127

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3523 645

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10931 531

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3703 525

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4090 519

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4144 480