Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Two trends are evident in the recent evolution of the field of information extraction: a preference for simple, often corpus-driven techniques over linguistically sophisticated ones; and a broadening of the central problem definition to include many non-traditional text domains. This development calls for information extraction systems which are as retctrgetable and general as possible. Here, we describe SRV, a learning architecture for information extraction which is designed for maximum generality and flexibility. . | Toward General-Purpose Learning for Information Extraction Dayne Freitag School of Computer Science Carnegie Mellon University Pittsburgh PA 15213 USA dayneScs.emu.edu Abstract Two trends are evident in the recent evolution of the field of information extraction a preference for simple often corpus-driven techniques over linguistically sophisticated ones and a broadening of the central problem definition to include many non-traditional text domains. This development calls for information extraction systems which are as retargetable and general as possible. Here we describe SRV a learning architecture for information extraction which is designed for maximum generality and flexibility. SRV can exploit domain-specific information including linguistic syntax and lexical information in the form of features provided to the system explicitly as input for training. This process is illustrated using a domain created from Reuters corporate acquisitions articles. Features are derived from two general-purpose NLP systems Sleator and Temperly s link grammar parser and Wordnet. Experiments compare the learner s performance with and without such linguistic information. Surprisingly in many cases the system performs as well without this information as with it. 1 Introduction The field of information extraction IE is concerned with using natural language processing NLP to extract essential details from text documents automatically. While the problems of retrieval routing and filtering have received considerable attention through the years IE is only now coming into its own as an information management sub-discipline. Progress in the field of IE has been away from general NLP systems that must be tuned to work in a particular domain toward faster systems that perform less linguistic processing of documents and can be more readily targeted at novel domains e.g. Appelt et al. 1993 . A natural part of this development has been the introduction of machine learning techniques to .