Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We present langid.py, an off-the-shelf language identification tool. We discuss the design and implementation of langid.py, and provide an empirical comparison on 5 longdocument datasets, and 2 datasets from the microblog domain. We find that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data. | langid.py An Off-the-shelf Language Identification Tool Marco Lui and Timothy Baldwin NICTA VRL Department of Computing and Information Systems University of Melbourne VIC 3010 Australia mhlui@unimelb.edu.au tb@ldwin.net Abstract We present langid. py an off-the-shelf language identification tool. We discuss the design and implementation of langid. py and provide an empirical comparison on 5 long-document datasets and 2 datasets from the microblog domain. We find that langid.py maintains consistently high accuracy across all domains making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data. 1 Introduction Language identification LangID is the task of determining the natural language that a document is written in. It is a key step in automatic processing of real-world data where a multitude of languages may be present. Natural language processing techniques typically pre-suppose that all documents being processed are written in a given language e.g. English but as focus shifts onto processing documents from internet sources such as microblogging services this becomes increasingly difficult to guarantee. Language identification is also a key component of many web services. For example the language that a web page is written in is an important consideration in determining whether it is likely to be of interest to a particular user of a search engine and automatic identification is an essential step in building language corpora from the web. It has practical implications for social networking and social media where it may be desirable to organize comments and other user-generated content by language. It also has implications for accessibility since it enables automatic determination of the target language for automatic machine translation purposes. 25 Many applications could potentially benefit from automatic language identification but building a customized solution per-application is .