Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
In recent years, error mining approaches were developed to help identify the most likely sources of parsing failures in parsing systems using handcrafted grammars and lexicons. However the techniques they use to enumerate and count n-grams builds on the sequential nature of a text corpus and do not easily extend to structured data. | Error Mining on Dependency Trees Claire Gardent Shashi Narayan CNRS LORIA UMR 7503 Universite de Lorraine LORIA UMR 7503 Vandoeuvre-les-Nancy F-54500 France Villers-les-Nancy F-54600 France claire.gardent@loria.fr shashi.narayan@loria.fr Abstract In recent years error mining approaches were developed to help identify the most likely sources of parsing failures in parsing systems using handcrafted grammars and lexicons. However the techniques they use to enumerate and count n-grams builds on the sequential nature of a text corpus and do not easily extend to structured data. In this paper we propose an algorithm for mining trees and apply it to detect the most likely sources of generation failure. We show that this tree mining algorithm permits identifying not only errors in the generation system grammar lexicon but also mismatches between the structures contained in the input and the input structures expected by our generator as well as a few id-iosyncrasies error in the input data. 1 Introduction In recent years error mining techniques have been developed to help identify the most likely sources of parsing failure van Noord 2004 Sagot and de la Clergerie 2006 de Kok et al. 2009 . First the input data text is separated into two subcorpora a corpus of sentences that could be parsed PASS and a corpus of sentences that failed to be parsed FAIL . For each n-gram of words and or part of speech tag occurring in the corpus to be parsed a suspicion rate is then computed which in essence captures the likelihood that this n-gram causes parsing to fail. These error mining techniques have been applied with good results on parsing output and shown to help improve the large scale symbolic grammars and 592 lexicons used by the parser. However the techniques they use e.g. suffix arrays to enumerate and count n-grams builds on the sequential nature of a text corpus and cannot easily extend to structured data. There are some NLP applications though where the processed data is .