TAILIEUCHUNG - Báo cáo khoa học: "Language ID in the Context of Harvesting Language Data off the Web"

As the arm of NLP technologies extends beyond a small core of languages, techniques for working with instances of language data across hundreds to thousands of languages may require revisiting and recalibrating the tried and true methods that are used. Of the NLP techniques that has been treated as “solved” is language identification (language ID) of written text. However, we argue that language ID is far from solved when one considers input spanning not dozens of languages, but rather hundreds to thousands, a number that one approaches when harvesting language data found on the Web. . | Language ID in the Context of Harvesting Language Data off the Web Fei Xia University of Washington Seattle Wa 98195 USA fxia@ William D. Lewis Microsoft Research Redmond WA 98052 USA wilewis@ Hoifung Poon University of Washington Seattle Wa 98195 USA hoifung@ Abstract As the arm of NLP technologies extends beyond a small core of languages techniques for working with instances of language data across hundreds to thousands of languages may require revisiting and recalibrating the tried and true methods that are used. Of the NLP techniques that has been treated as solved is language identification language ID of written text. However we argue that language ID is far from solved when one considers input spanning not dozens of languages but rather hundreds to thousands a number that one approaches when harvesting language data found on the Web. We formulate language ID as a coreference resolution problem and apply it to a Web harvesting task for a specific linguistic data type and achieve a much higher accuracy than long accepted language ID approaches. 1 Introduction A large number of the world s languages have been documented by linguists it is now increasingly common to post current research and data to the Web often in the form of language snippets embedded in scholarly papers. A particularly common format for linguistic data posted to the Web is interlinearized text a format used to present language data and analysis relevant to a particular argument or investigation. Since interlinear examples consist of orthographically or phonetically encoded language data aligned with an English translation the corpus of interlinear examples found on the Web when taken together constitute a significant multilingual parallel corpus covering hundreds to thousands of the world s languages. Previous work has discussed methods for harvesting interlinear text off the Web Lewis 2006 enriching it via structural projections Xia and Lewis .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.