TAILIEUCHUNG - A Comparison of Event Models for Naive Bayes Text Classication

Recent approaches to text classication have used two di erent rst-order probabilistic models for classication, both of which make the naive Bayes assumption. Some use a multi-variate Bernoulli model, that is, a Bayesian Network with no dependencies between words and binary word features (. Larkey and Croft 1996; Koller and Sahami 1997). Others use a multinomial model, that is, a uni-gram language model with integer word counts (. Lewis and Gale 1994; Mitchell 1997). This paper aims to clarify the confusion by describing the di erences and details of these two models, and by empirically comparing their classication performance on ve text corpora. We nd that the multi-variate Bernoulli performs well with small vocabulary sizes, but. | A Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum mccallum@justresearch .com Í Just Research 4616 Henry Street Pittsburgh PA 15213 Kamal Nigam knigam@cs. cmu. edu t School of Computer Science Carnegie Mellon University Pittsburgh PA 15213 Abstract Recent approaches to text classification have used two different first-order probabilistic models for classification both of which make the naive Bayes assumption. Some use a multi-variate Bernoulli model that is a Bayesian Network with no dependencies between words and binary word features . Larkey and Croft 1996 Koller and Sahami 1997 . Others use a multinomial model that is a uni-gram language model with integer word counts . Lewis and Gale 1994 Mitchell 1997 . This paper aims to clarify the confusion by describing the differences and details of these two models and by empirically comparing their classification performance on five text corpora. We find that the multi-variate Bernoulli performs well with small vocabulary sizes but that the multinomial performs usually performs even better at larger vocabulary sizes providing on average a 27 reduction in error over the multi-variate Bernoulli model at any vocabulary size. Introduction Simple Bayesian classifiers have been gaining popularity lately and have been found to perform surprisingly well Friedman 1997 Friedman et al. 1997 Sahami 1996 Langley et al. 1992 . These probabilistic approaches make strong assumptions about how the data is generated and posit a probabilistic model that embodies these assumptions then they use a collection of labeled training examples to estimate the parameters of the generative model. Classification on new examples is performed with Bayes rule by selecting the class that is most likely to have generated the example. The naive Bayes classifier is the simplest of these models in that it assumes that all attributes of the examples are independent of each other given the context of the class. This is the .

Khánh Phi 69 8 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

The socio-economic impacts of the COVID-19 pandemic: A review

5 44 1

Awareness of the impact of environmental pollution on human health and health issues of people in Vinh Tan and Phuoc the communes, Tuy Phong district, Binh Thuan in 2021

6 29 1

Sources of Health Insurance and Characteristics of the Uninsured: Analysis of the March 2008 Current Population Survey

36 63 0

Balancing health benefits and social sacrifices: A qualitative study of how screening-detected celiac disease impacts adolescents’ quality of life

10 66 0

2000 Utah Child Health Survey: Children with Special Health Care Needs

142 67 0

THE SWISS AND DUTCH HEALTH INSURANCE SYSTEMS: UNIVERSAL COVERAGE AND REGULATED COMPETITIVE INSURANCE MARKETS

40 57 0

Linking sexual, reproductive, maternal and newborn health – the circle of life

87 74 0

Choosing a Medigap Policy: A Guide to Health Insurance for People with Medicare

60 51 0

European Research on Environment and Health Funded by the Seventh Framework Programme

1 57 0

Discriminative ability of the generic and conditionspecific Child-Oral Impacts on Daily Performances (Child-OIDP) by the Limpopo-Arusha School Health (LASH) Project: A cross-sectional study

10 79 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 461867 55

Giới thiệu :Lập trình mã nguồn mở

14 22643 59

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10892 529

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10066 446

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9519 104

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8281 1125

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8238 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7864 2220

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6687 253

Vật lý hạt cơ bản (1)

29 5770 85

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

extremetech Hacking BlackBerry phần 9

31 250 0 27-04-2024

Anh văn bằng C-124

8 175 0 27-04-2024

Magnetic Bearings Theory and Applications phần 2

14 172 0 27-04-2024

Management and Services Part 1

10 156 0 27-04-2024

Bơm máy nén quạt trong công nghiệp part 8

20 198 2 27-04-2024

MySQL Basics for Visual Learners PHẦN 9

15 184 0 27-04-2024

MySQL Database Usage & Administration PHẦN 7

37 156 0 27-04-2024

MÔN HỌC VẬT LIỆU VÀ CÔNG NGHỆ KIM LOẠI - PHẦN I: KIM LOẠI HỌC

32 177 2 27-04-2024

Đóng mới oto 8 chỗ ngồi part 9

10 116 0 27-04-2024

báo cáo hóa học:" Endoscopic decompression for intraforaminal and extraforaminal nerve root compression"

7 107 0 27-04-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7864 2220

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 5737 1368

Ebook Chào con ba mẹ đã sẵn sàng

112 3767 1231

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5319 1136

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8281 1125

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3499 643

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10892 529

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3684 525

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4046 515

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4128 480