Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
As the arm of NLP technologies extends beyond a small core of languages, techniques for working with instances of language data across hundreds to thousands of languages may require revisiting and recalibrating the tried and true methods that are used. Of the NLP techniques that has been treated as “solved” is language identification (language ID) of written text. However, we argue that language ID is far from solved when one considers input spanning not dozens of languages, but rather hundreds to thousands, a number that one approaches when harvesting language data found on the Web. . | Language ID in the Context of Harvesting Language Data off the Web Fei Xia University of Washington Seattle Wa 98195 USA fxia@u.Washington.edu William D. Lewis Microsoft Research Redmond WA 98052 USA wilewis@microsoft.com Hoifung Poon University of Washington Seattle Wa 98195 USA hoifung@cs.Washington.edu Abstract As the arm of NLP technologies extends beyond a small core of languages techniques for working with instances of language data across hundreds to thousands of languages may require revisiting and recalibrating the tried and true methods that are used. Of the NLP techniques that has been treated as solved is language identification language ID of written text. However we argue that language ID is far from solved when one considers input spanning not dozens of languages but rather hundreds to thousands a number that one approaches when harvesting language data found on the Web. We formulate language ID as a coreference resolution problem and apply it to a Web harvesting task for a specific linguistic data type and achieve a much higher accuracy than long accepted language ID approaches. 1 Introduction A large number of the world s languages have been documented by linguists it is now increasingly common to post current research and data to the Web often in the form of language snippets embedded in scholarly papers. A particularly common format for linguistic data posted to the Web is interlinearized text a format used to present language data and analysis relevant to a particular argument or investigation. Since interlinear examples consist of orthographically or phonetically encoded language data aligned with an English translation the corpus of interlinear examples found on the Web when taken together constitute a significant multilingual parallel corpus covering hundreds to thousands of the world s languages. Previous work has discussed methods for harvesting interlinear text off the Web Lewis 2006 enriching it via structural projections Xia and Lewis .