TAILIEUCHUNG - Báo cáo khoa học: "An Off-the-shelf Language Identification Tool"

We present , an off-the-shelf language identification tool. We discuss the design and implementation of , and provide an empirical comparison on 5 longdocument datasets, and 2 datasets from the microblog domain. We find that maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data. | An Off-the-shelf Language Identification Tool Marco Lui and Timothy Baldwin NICTA VRL Department of Computing and Information Systems University of Melbourne VIC 3010 Australia mhlui@ tb@ Abstract We present langid. py an off-the-shelf language identification tool. We discuss the design and implementation of langid. py and provide an empirical comparison on 5 long-document datasets and 2 datasets from the microblog domain. We find that maintains consistently high accuracy across all domains making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data. 1 Introduction Language identification LangID is the task of determining the natural language that a document is written in. It is a key step in automatic processing of real-world data where a multitude of languages may be present. Natural language processing techniques typically pre-suppose that all documents being processed are written in a given language . English but as focus shifts onto processing documents from internet sources such as microblogging services this becomes increasingly difficult to guarantee. Language identification is also a key component of many web services. For example the language that a web page is written in is an important consideration in determining whether it is likely to be of interest to a particular user of a search engine and automatic identification is an essential step in building language corpora from the web. It has practical implications for social networking and social media where it may be desirable to organize comments and other user-generated content by language. It also has implications for accessibility since it enables automatic determination of the target language for automatic machine translation purposes. 25 Many applications could potentially benefit from automatic language identification but building a customized solution per-application is .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.