Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
This paper presents a method of improving the accuracy of subcategorization frames (SCFs) acquired from corpora to augment existing lexicon resources. I estimate a confidence value of each SCF using corpus-based statistics, and then perform clustering of SCF confidencevalue vectors for words to capture cooccurrence tendency among SCFs in the lexicon. | Improving the Accuracy of Subcategorizations Acquired from Corpora Naoki Yoshinaga Department of Computer Science University of Tokyo 7-3-1 Hongo Bunkyo-ku Tokyo 113-0033 yoshinag@is.s.u-tokyo.ac.jp Abstract This paper presents a method of improving the accuracy of subcategorization frames SCFs acquired from corpora to augment existing lexicon resources. I estimate a confidence value of each SCF using corpus-based statistics and then perform clustering of SCF confidencevalue vectors for words to capture cooccurrence tendency among SCFs in the lexicon. I apply my method to SCFs acquired from corpora using lexicons of two large-scale lexicalized grammars. The resulting SCFs achieve higher precision and recall compared to SCFs obtained by naive frequency cut-off. 1 Introduction Recently a variety of methods have been proposed for acquisition of subcategorization frames SCFs from corpora surveyed in Korhonen 2002 . One interesting possibility is to use these techniques to improve the coverage of existing large-scale lexicon resources such as lexicons of lexi-calized grammars. However there has been little work on evaluating the impact of acquired SCFs with the exception of Carroll and Fang 2004 . The problem when we integrate acquired SCFs into existing lexicalized grammars is lower quality of the acquired SCFs since they are acquired in an unsupervised manner rather than being manually coded. If we attempt to compensate for the poor precision by being less strict in filtering out less likely SCFs then we will end up with a larger number of noisy lexical entries which is problematic for parsing with lexicalized grammars Sarkar et al. 2000 . We thus need some method of selecting the most reliable set of SCFs from the system output as demonstrated in Korhonen 2002 . In this paper I present a method of improving the accuracy of SCFs acquired from corpora in order to augment existing lexicon resources. I first estimate a confidence value that a word can have each SCF using