TAILIEUCHUNG - Báo cáo khoa học: "Searching for Topics in a Large Collection of Texts"

We describe an original method that automatically finds specific topics in a large collection of texts. Each topic is first identified as a specific cluster of texts and then represented as a virtual concept, which is a weighted mixture of words. Our intention is to employ these virtual concepts in document indexing. In this paper we show some preliminary experimental results and discuss directions of future work. | Searching for Topics in a Large Collection of Texts Martin Holub Jiri Semecky Jiri Divis Center for Computational Linguistics Charles University Prague holub semecky @ Abstract We describe an original method that automatically finds specific topics in a large collection of texts. Each topic is first identified as a specific cluster of texts and then represented as a virtual concept which is a weighted mixture of words. Our intention is to employ these virtual concepts in document indexing. In this paper we show some preliminary experimental results and discuss directions of future work. 1 Introduction In the field of information retrieval for a detailed survey see . Baeza-Yates and Ribeiro-Neto 1999 document indexing and representing documents as vectors belongs among the most successful techniques. Within the framework of the well known vector model the indexed elements are usually individual words which leads to high dimensional vectors. However there are several approaches that try to reduce the high dimensionality of the vectors in order to improve the effec-tivity of retrieving. The most famous is probably the method called Latent Semantic Indexing LSI introduced by Deerwester et al. 1990 which employs a specific linear transformation of original word-based vectors using a system of latent semantic concepts . Other two approaches which inspired us namely Dhillon and Modha 2001 and Torkkola 2002 are similar to LSI but dif ferent in the way how they project the vectors of documents into a space of a lower dimension. Our idea is to establish a system of virtual concepts which are linear functions represented by vectors extracted from automatically discovered concept-formative clusters of documents. Shortly speaking concept-formative clusters are semantically coherent and specific sets of documents which represent specific topics. This idea was originally proposed by Holub 2003 who hypothesizes that concept-oriented vector .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.