TAILIEUCHUNG - Báo cáo khoa học: "Baselines and Bigrams: Simple, Good Sentiment and Topic Classification"

Variants of Naive Bayes (NB) and Support Vector Machines (SVM) are often used as baseline methods for text classification, but their performance varies greatly depending on the model variant, features used and task/ dataset. We show that: (i) the inclusion of word bigram features gives consistent gains on sentiment analysis tasks; (ii) for short snippet sentiment tasks, NB actually does better than SVMs (while for longer documents the opposite result holds); (iii) a simple but novel SVM variant using NB log-count ratios as feature values consistently performs well across tasks and datasets. . | Baselines and Bigrams Simple Good Sentiment and Topic Classification Sida Wang and Christopher D. Manning Department of Computer Science Stanford University Stanford CA 94305 sidaw manning @ Abstract Variants of Naive Bayes NB and Support Vector Machines SVM are often used as baseline methods for text classification but their performance varies greatly depending on the model variant features used and task dataset. We show that i the inclusion of word bigram features gives consistent gains on sentiment analysis tasks ii for short snippet sentiment tasks NB actually does better than SVMs while for longer documents the opposite result holds iii a simple but novel SVM variant using NB log-count ratios as feature values consistently performs well across tasks and datasets. Based on these observations we identify simple NB and SVM variants which outperform most published results on sentiment analysis datasets sometimes providing a new state-of-the-art performance level. 1 Introduction Naive Bayes NB and Support Vector Machine SVM models are often used as baselines for other methods in text categorization and sentiment analysis research. However their performance varies significantly depending on which variant features and datasets are used. We show that researchers have not paid sufficient attention to these model selection issues. Indeed we show that the better variants often outperform recently published state-of-the-art methods on many datasets. We attempt to categorize which method which variants and which features perform better under which circumstances. First we make an important distinction between sentiment classification and topical text classifica 90 tion. We show that the usefulness of bigram features in bag of features sentiment classification has been underappreciated perhaps because their usefulness is more of a mixed bag for topical text classification tasks. We then distinguish between short snippet sentiment tasks and longer reviews showing

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.