Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We study the issue of porting a known NLP method to a language with little existing NLP resources, specifically Hebrew SVM-based chunking. We introduce two SVM-based methods – Model Tampering and Anchored Learning. These allow fine grained analysis of the learned SVM models, which provides guidance to identify errors in the training corpus, distinguish the role and interaction of lexical features and eventually construct a model with ∼10% error reduction. | SVM Model Tampering and Anchored Learning A Case Study in Hebrew NP Chunking Yoav Goldberg and Michael Elhadad Computer Science Department Ben Gurion University of the Negev P.O.B 653 Be er Sheva 84105 Israel yoavg elhadad@cs.bgu.ac.il Abstract We study the issue of porting a known NLP method to a language with little existing NLP resources specifically Hebrew SVM-based chunking. We introduce two SVM-based methods - Model Tampering and Anchored Learning. These allow fine grained analysis of the learned SVM models which provides guidance to identify errors in the training corpus distinguish the role and interaction of lexical features and eventually construct a model with 10 error reduction. The resulting chunker is shown to be robust in the presence of noise in the training corpus relies on less lexical features than was previously understood and achieves an F-measure performance of 92.2 on automatically PoS-tagged text. The SVM analysis methods also provide general insight on SVM-based chunking. 1 Introduction While high-quality NLP corpora and tools are available in English such resources are difficult to obtain in most other languages. Three challenges must be met when adapting results established in English to another language 1 acquiring high quality annotated data 2 adapting the English task definition to the nature of a different language and 3 adapting the algorithm to the new language. This paper presents a case study in the adaptation of a well known task to a language with few NLP resources available. Specifically we deal with SVM based Hebrew NP chunking. In Goldberg et al. 2006 we established that the task is not trivially transferable 224 to Hebrew but reported that SVM based chunking Kudo and Matsumoto 2000 performs well. We extend that work and study the problem from 3 angles 1 how to deal with a corpus that is smaller and with a higher level of noise than is available in English we propose techniques that help identify suspicious data points in t F