TAILIEUCHUNG - Báo cáo khoa học: "A DOM Tree Alignment Model for Mining Parallel Data from the Web"

This paper presents a new web mining scheme for parallel data acquisition. Based on the Document Object Model (DOM), a web page is represented as a DOM tree. Then a DOM tree alignment model is proposed to identify the translationally equivalent texts and hyperlinks between two parallel DOM trees. By tracing the identified parallel hyperlinks, parallel web documents are recursively mined. Compared with previous mining schemes, the benchmarks show that this new mining scheme improves the mining coverage, reduces mining bandwidth, and enhances the quality of mined parallel sentences. web site) domain, showing that of 150,000 websites in the. | A DOM Tree Alignment Model for Mining Parallel Data from the Web Lei Shi1 Cheng Niu1 Ming Zhou1 and Jianfeng Gao2 Microsoft Research Asia 5F Sigma Center 49 Zhichun Road Beijing 10080 P. R. China 2Microsoft Research One Microsoft Way Redmond WA 98052 USA leishi chengniu mingzhou jfgao @ Abstract This paper presents a new web mining scheme for parallel data acquisition. Based on the Document Object Model DOM a web page is represented as a DOM tree. Then a DOM tree alignment model is proposed to identify the transla-tionally equivalent texts and hyperlinks between two parallel DOM trees. By tracing the identified parallel hyperlinks parallel web documents are recursively mined. Compared with previous mining schemes the benchmarks show that this new mining scheme improves the mining coverage reduces mining bandwidth and enhances the quality of mined parallel sentences. 1 Introduction Parallel bilingual corpora are critical resources for statistical machine translation Brown 1993 and cross-lingual information retrieval Nie 1999 . Additionally parallel corpora have been exploited for various monolingual natural language processing NLP tasks such as wordsense disambiguation Ng 2003 and paraphrase acquisition Callison 2005 . However large scale parallel corpora are not readily available for most language pairs. Even where resources are available such as for English-French the data are usually restricted to government documents . the Hansard corpus which consists of French-English translations of debates in the Canadian parliament or newswire texts. The governmentese that characterizes these document collections cannot be used on its own to train data-driven machine translation systems for a range of domains and language pairs. With a sharply increasing number of bilingual web sites web mining for parallel data becomes a promising solution to this knowledge acquisition problem. In an effort to estimate the amount of bilingual data on the web Ma and Liberman

Nam An 79 8 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Bài giảng DOM & SAX XML & ADO.NET

31 98 0

Bài giảng Lập trình hướng đối tượng - XML DOM

121 90 1

Giáo án bài Tập đọc: Anh đom đóm - Tiếng việt 3 - GV.N.Tấn Tài

3 108 0

Slide bài Tập đọc: Anh đom đóm - Tiếng việt 3 - GV.N.Tấn Tài

26 64 0

Mô hình Dom

55 50 0

Bài giảng Lập trình Web: DOM – HTML - Trần Phước Tuấn

50 123 0

Bài giảng Lập trình ứng dụng mạng - Chương 5: HTML DOM - HTML Document Object Model

54 70 0

Ánh sáng đom đóm có từ đâu

16 38 0

Bài giảng DOM-DSO - Nguyễn Đức Cương

72 78 0

Bài báo cáo: Bệnh đốm đen lúa – Cercospora oryzae

15 82 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462348 61

Giới thiệu :Lập trình mã nguồn mở

14 26497 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11370 543

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10557 468

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9850 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8897 1161

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8512 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8107 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7844 1803

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7285 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

B2B Content Marketing: 2012 Benchmarks, Budgets & Trends

17 236 3 05-01-2025

Báo cáo nghiên cứu nông nghiệp " Field control of pest fruit flies in Vietnam "

14 195 4 05-01-2025

Chương 10: Các phương pháp tính quá trình quá độ trong mạch điện tuyến tính

57 240 7 05-01-2025

báo cáo hóa học:" Perceptions of rewards among volunteer caregivers of people living with AIDS working in faith-based organizations in South Africa: a qualitative study"

10 162 1 05-01-2025

Giáo án điện tử tiểu học môn lịch sử: Cách mạng mùa thu

39 168 1 05-01-2025

Valve Selection Handbook - Fourth Edition

337 149 2 05-01-2025

ETHICAL CODE HANDBOOK: Demonstrate your commitment to high standards

7 152 1 05-01-2025

Word Games with English 1

65 145 1 05-01-2025

IT Audit: EMC’s Journey to the Private Cloud

13 161 1 05-01-2025

Xinh xinh vườn nhà

6 135 0 05-01-2025

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8107 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7844 1803

Ebook Chào con ba mẹ đã sẵn sàng

112 4424 1376

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6336 1275

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8897 1161

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3855 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3926 609

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4754 567

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11370 543

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4529 490