TAILIEUCHUNG - Báo cáo khoa học: "A DOM Tree Alignment Model for Mining Parallel Data from the Web"

This paper presents a new web mining scheme for parallel data acquisition. Based on the Document Object Model (DOM), a web page is represented as a DOM tree. Then a DOM tree alignment model is proposed to identify the translationally equivalent texts and hyperlinks between two parallel DOM trees. By tracing the identified parallel hyperlinks, parallel web documents are recursively mined. Compared with previous mining schemes, the benchmarks show that this new mining scheme improves the mining coverage, reduces mining bandwidth, and enhances the quality of mined parallel sentences. web site) domain, showing that of 150,000 websites in the. | A DOM Tree Alignment Model for Mining Parallel Data from the Web Lei Shi1 Cheng Niu1 Ming Zhou1 and Jianfeng Gao2 Microsoft Research Asia 5F Sigma Center 49 Zhichun Road Beijing 10080 P. R. China 2Microsoft Research One Microsoft Way Redmond WA 98052 USA leishi chengniu mingzhou jfgao @ Abstract This paper presents a new web mining scheme for parallel data acquisition. Based on the Document Object Model DOM a web page is represented as a DOM tree. Then a DOM tree alignment model is proposed to identify the transla-tionally equivalent texts and hyperlinks between two parallel DOM trees. By tracing the identified parallel hyperlinks parallel web documents are recursively mined. Compared with previous mining schemes the benchmarks show that this new mining scheme improves the mining coverage reduces mining bandwidth and enhances the quality of mined parallel sentences. 1 Introduction Parallel bilingual corpora are critical resources for statistical machine translation Brown 1993 and cross-lingual information retrieval Nie 1999 . Additionally parallel corpora have been exploited for various monolingual natural language processing NLP tasks such as wordsense disambiguation Ng 2003 and paraphrase acquisition Callison 2005 . However large scale parallel corpora are not readily available for most language pairs. Even where resources are available such as for English-French the data are usually restricted to government documents . the Hansard corpus which consists of French-English translations of debates in the Canadian parliament or newswire texts. The governmentese that characterizes these document collections cannot be used on its own to train data-driven machine translation systems for a range of domains and language pairs. With a sharply increasing number of bilingual web sites web mining for parallel data becomes a promising solution to this knowledge acquisition problem. In an effort to estimate the amount of bilingual data on the web Ma and Liberman

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.