Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Mining retrospective events from text streams has been an important research topic. Classic text representation model (i.e., vector space model) cannot model temporal aspects of documents. To address it, we proposed a novel burst-based text representation model, denoted as BurstVSM. BurstVSM corresponds dimensions to bursty features instead of terms, which can capture semantic and temporal information. | A Novel Burst-based Text Representation Model for Scalable Event Detection Wayne Xin Zhao Rishan Chen Kai Fan Hongfei Yan 2 and Xiaoming Litt School of Electronics Engineering and Computer Science Peking University China State Key Laboratory of Software Beihang University China batmanfly tsunamicrs fankaicn yhf1029 @gmail.com lxm@pku.edu.cn Abstract Mining retrospective events from text streams has been an important research topic. Classic text representation model i.e. vector space model cannot model temporal aspects of documents. To address it we proposed a novel burst-based text representation model denoted as BurstVSM. BurstVSM corresponds dimensions to bursty features instead of terms which can capture semantic and temporal information. Meanwhile it significantly reduces the number of non-zero entries in the representation. We test it via scalable event detection and experiments in a 10-year news archive show that our methods are both effective and efficient. 1 Introduction Mining retrospective events Yang et al. 1998 Fung et al. 2007 Allan et al. 2000 has been quite an important research topic in text mining. One standard way for that is to cluster news articles as events by following a two-step approach Yang et al. 1998 1 represent document as vectors and calculate similarities between documents 2 run the clustering algorithm to obtain document clusters as events.1 Underlying text representation often plays a critical role in this approach especially for long text streams. In this paper our focus is to study how to represent temporal documents effectively for event detection. Classical text representation methods i.e. Vector Space Model VSM have a few shortcomings when dealing with temporal documents. The major one is that it maps one dimension to one term which completely ignores temporal information and therefore VSM can never capture the evolving trends in text streams. See the example in Figure 1 D1 and D2 Corresponding author. 1Post-processing may be .