TAILIEUCHUNG - Báo cáo khoa học: "A New Feature Selection Score for Multinomial Naive Bayes Text Classification Based on KL-Divergence"

We define a new feature selection score for text classification based on the KL-divergence between the distribution of words in training documents and their classes. The score favors words that have a similar distribution in documents of the same class but different distributions in documents of different classes. Experiments on two standard data sets indicate that the new method outperforms mutual information, especially for smaller categories. | A New Feature Selection Score for Multinomial Naive Bayes Text Classification Based on KL-Divergence Karl-Michael Schneider Department of General Linguistics University of Passau 94032 Passau Germany schneide@ Abstract We define a new feature selection score for text classification based on the KL-divergence between the distribution of words in training documents and their classes. The score favors words that have a similar distribution in documents of the same class but different distributions in documents of different classes. Experiments on two standard data sets indicate that the new method outperforms mutual information especially for smaller categories. 1 Introduction Text classification is the assignment of predefined categories to text documents. Text classification has many applications in natural language processing tasks such as E-mail filtering prediction of user preferences and organization of web content. The Naive Bayes classifier is a popular machine learning technique for text classification because it performs well in many domains despite its simplicity Domingos and Pazzani 1997 . Naive Bayes assumes a stochastic model of document generation. Using Bayes rule the model is inverted in order to predict the most likely class for a new document. We assume that documents are generated according to a multinomial event model McCallum and Nigam 1998 . Thus a document is represented as a vector di xn . Xj I of word counts where V is the vocabulary and each xit 2 0 1 2 . indicates how often wt occurs in di. Given model parameters p wt cj and class prior probabilities p cj and assuming independence of the words the most likely class for a document di is computed as c di argmax p cj p d cj j V 1 argmaxp cj Yp wt cj ra wi di j t 1 where n wt di is the number of occurrences of wt in di. p wt cj and p cj are estimated from training documents with known classes using maximum likelihood estimation with a Laplacean prior z . 1 Edi2Cj n wt di p wt

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.