TAILIEUCHUNG - Báo cáo khoa học: "Improving Probabilistic Latent Semantic Analysis with Principal Component Analysis"

Probabilistic Latent Semantic Analysis (PLSA) models have been shown to provide a better model for capturing polysemy and synonymy than Latent Semantic Analysis (LSA). However, the parameters of a PLSA model are trained using the Expectation Maximization (EM) algorithm, and as a result, the trained model is dependent on the initialization values so that performance can be highly variable. In this paper we present a method for using LSA analysis to initialize a PLSA model. We also investigated the performance of our method for the tasks of text segmentation and retrieval on personal-size corpora, and present results demonstrating the. | Improving Probabilistic Latent Semantic Analysis with Principal Component Analysis Ayman Farahat Palo Alto Research Center 3333 Coyote Hill Road Palo Alto CA 94304 Francine Chen Palo Alto Research Center 3333 Coyote Hill Road Palo Alto CA 94304 chen@ Abstract Probabilistic Latent Semantic Analysis PLSA models have been shown to provide a better model for capturing polysemy and synonymy than Latent Semantic Analysis LSA . However the parameters of a PLSA model are trained using the Expectation Maximization EM algorithm and as a result the trained model is dependent on the initialization values so that performance can be highly variable. In this paper we present a method for using LSA analysis to initialize a PLSA model. We also investigated the performance of our method for the tasks of text segmentation and retrieval on personal-size corpora and present results demonstrating the efficacy of our proposed approach. 1 Introduction In modeling a collection of documents for information access applications the documents are often represented as a bag of words . as term vectors composed of the terms and corresponding counts for each document. The term vectors for a document collection can be organized into a term by document co-occurrence matrix. When directly using these representations synonyms and polysemous terms that is terms with multiple senses or meanings are not handled well. Methods for smoothing the term distributions through the use of latent classes have been shown to improve the performance of a number of information access tasks including retrieval over smaller collections Deerwester et al. 1990 text segmentation Brants et al. 2002 and text classification Wu and Gunopulos 2002 . The Probabilistic Latent Semantic Analysis model PLSA Hofmann 1999 provides a probabilistic framework that attempts to capture polysemy and synonymy in text for applications such as retrieval and segmentation. It uses a mixture decomposition to .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.