Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Classical Information Extraction (IE) systems fill slots in domain-specific frames. This paper reports on S EQ, a novel open IE system that leverages a domainindependent frame to extract ordered sequences such as presidents of the United States or the most common causes of death in the U.S. S EQ leverages regularities about sequences to extract a coherent set of sequences from Web text. S EQ nearly doubles the area under the precision-recall curve compared to an extractor that does not exploit these regularities. . | Extracting Sequences from the Web Anthony Fader Stephen Soderland and Oren Etzioni University of Washington Seattle afader soderlan etzioni @cs.washington.edu Abstract Classical Information Extraction IE systems fill slots in domain-specific frames. This paper reports on Seq a novel open IE system that leverages a domainindependent frame to extract ordered sequences such as presidents of the United States or the most common causes of death in the U.S. Seq leverages regularities about sequences to extract a coherent set of sequences from Web text. Seq nearly doubles the area under the precision-recall curve compared to an extractor that does not exploit these regularities. 1 Introduction Classical IE systems fill slots in domain-specific frames such as the time and location slots in seminar announcements Freitag 2000 or the terrorist organization slot in news stories Chieu et al. 2003 . In contrast open IE systems are domainindependent but extract flat sets of assertions that are not organized into frames and slots Sekine 2006 Banko et al. 2007 . This paper reports on Seq an open IE system that leverages a domain-independent frame to extract ordered sequences of objects from Web text. We show that the novel domain-independent sequence frame in Seq substantially boosts the precision and recall of the system and yields coherent sequences filtered from low-precision extractions Table 1 . Sequence extraction is distinct from set expansion Etzioni et al. 2004 Wang and Cohen 2007 because sequences are ordered and because the extraction process does not require seeds or HTML lists as input. The domain-independent sequence frame consists of a sequence name s e.g. presidents of the United States and a set of ordered pairs x k where x is a string naming a member of the sequence with name s and k is an integer indicating Most common cause of death in the United States 1. heart disease 2. cancer 3. stroke 4. COPD 5. pneumonia 6. cirrhosis 7. AIDS 8. chronic liver disease 9. .