TAILIEUCHUNG - Báo cáo khoa học: "Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments"

We address the problem of part-of-speech tagging for English data from the popular microblogging service Twitter. We develop a tagset, annotate data, develop features, and report tagging results nearing 90% accuracy. The data and tools have been made available to the research community with the goal of enabling richer text analysis of Twitter and related social media data sets. | Part-of-Speech Tagging for Twitter Annotation Features and Experiments Kevin Gimpel Nathan Schneider Brendan O Connor Dipanjan Das Daniel Mills Jacob Eisenstein Michael Heilman Dani Yogatama Jeffrey Flanigan and Noah A. Smith School of Computer Science Carnegie Mellon Univeristy Pittsburgh PA 15213 USA kgimpel nschneid brenocon dipanjan dpmills jacobeis mheilman dyogatama jflanigan nasmith @ Abstract We address the problem of part-of-speech tagging for English data from the popular microblogging service Twitter. We develop a tagset annotate data develop features and report tagging results nearing 90 accuracy. The data and tools have been made available to the research community with the goal of enabling richer text analysis of Twitter and related social media data sets. 1 Introduction The growing popularity of social media and user-created web content is producing enormous quantities of text in electronic form. The popular microblogging service Twitter is one particularly fruitful source of user-created content and a flurry of recent research has aimed to understand and exploit these data Ritter et al. 2010 Shar-ifi et al. 2010 Barbosa and Feng 2010 Asur and Huberman 2010 O Connor et al. 2010a Thelwall et al. 2011 . However the bulk of this work eschews the standard pipeline of tools which might enable a richer linguistic analysis such tools are typically trained on newstext and have been shown to perform poorly on Twitter Finin et al. 2010 . One of the most fundamental parts of the linguistic pipeline is part-of-speech POS tagging a basic form of syntactic analysis which has countless applications in NLP. Most POS taggers are trained from treebanks in the newswire domain such as the Wall Street Journal corpus of the Penn Treebank PTB Marcus etal. 1993 . Tagging performance degrades on out-of-domain data and Twitter poses additional challenges due to the conversational nature of the text the lack of conventional orthography and 140-character .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.