TAILIEUCHUNG - Báo cáo khoa học: "Efficiently Accessing Wikipedia’s Edit History"

We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. . | Wikipedia Revision Toolkit Efficiently Accessing Wikipedia s Edit History Oliver Ferschke Torsten Zesch and Iryna Gurevych Ubiquitous Knowledge Processing Lab Computer Science Department Technische Universitat Darmstadt Hochschulstrasse 10 D-64289 Darmstadt Germany http Abstract We present an open-source toolkit which allows i to reconstruct past states of Wikipedia and ii to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format our toolkit massively decreases the data volume to less than 2 of the original size and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general and to foster research making use of the knowledge encoded in Wikipedia s edit history. 1 Introduction In the last decade the free encyclopedia Wikipedia has become one of the most valuable and comprehensive knowledge sources in Natural Language Processing. It has been used for numerous NLP tasks . word sense disambiguation semantic relatedness measures or text categorization. A detailed survey on usages of Wikipedia in NLP can be found in Medelyan et al. 2009 . The majority of Wikipedia-based NLP algorithms works on single snapshots of Wikipedia which are 97 published by the Wikimedia Foundation as XML dumps at irregular Such a snapshot only represents the state of Wikipedia at a certain fixed point in time while Wikipedia actually is a dynamic resource that is constantly changed by its millions of .

TỪ KHÓA LIÊN QUAN
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.