TAILIEUCHUNG - Báo cáo khoa học: "Document Classification Using a Finite Mixture Model"

We propose a new method of classifying documents into categories. We define for each category a finite mixture model based on soft clustering of words. We treat the problem of classifying documents as that of conducting statistical hypothesis testing over finite mixture models, and employ the EM algorithm to efficiently estimate parameters in a finite mixture model. Experimental results indicate that our method outperforms existing methods. | Document Classification Using a Finite Mixture Model Hang Li Kenji Yamanishi c c Res. Labs. NEC 4-1-1 Miyazaki Miyamae-ku Kawasaki 216 Japan Email lihang yamanisi @ Abstract We propose a new method of classifying documents into categories. We define for each category a finite mixture model based on soft clustering of words. We treat the problem of classifying documents as that of conducting statistical hypothesis testing over finite mixture models and employ the EM algorithm to efficiently estimate parameters in a finite mixture model. Experimental results indicate that our method outperforms existing methods. 1 Introduction We are concerned here with the issue of classifying documents into categories. More precisely we begin with a number of categories . tennis soccer skiing each already containing certain documents. Our goal is to determine into which categories newly given documents ought to be assigned and to do so on the basis of the distribution of each document s Many methods have been proposed to address this issue and a number of them have proved to be quite effective . Apte Damerau and Weiss 1994 Cohen and Singer 1996 Lewis 1992 Lewis and Ringuette 1994 Lewis et al. 1996 Schutze Hull and Pedersen 1995 Yang and Chute 1994 . The simple method of conducting hypothesis testing over word-based distributions in categories defined in Section 2 is not efficient in storage and suffers from the data sparseness problem . the number of parameters in the distributions is large and the data size is not sufficiently large for accurately estimating them. In order to address this difficulty Guthrie Walker and Guthrie 1994 have proposed using distributions based on what we refer to as hard A related issue is the retrieval from a data base of documents which are relevant to a given query pseudodocument . Deerwester et al. 1990 Fuhr 1989 Robertson and Jones 1976 Salton and McGill 1983 Wong and Yao 1989 . clustering of words . in which a

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.