Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
This paper reports the corpus-oriented development of a wide-coverage Japanese HPSG parser. We first created an HPSG treebank from the EDR corpus by using heuristic conversion rules, and then extracted lexical entries from the treebank. The grammar developed using this method attained wide coverage that could hardly be obtained by conventional manual development. We also trained a statistical parser for the grammar on the treebank, and evaluated the parser in terms of the accuracy of semantic-role identification and dependency analysis. . | Corpus-Oriented Development of Japanese HPSG Parsers Kazuhiro Yoshida Department of Computer Science University of Tokyo 7-3-1 Hongo Bunkyo-ku Tokyo 113-0033 kyoshida@is.s.u-tokyo.ac.jp Abstract This paper reports the corpus-oriented development of a wide-coverage Japanese HPSG parser. We first created an HPSG treebank from the EDR corpus by using heuristic conversion rules and then extracted lexical entries from the treebank. The grammar developed using this method attained wide coverage that could hardly be obtained by conventional manual development. We also trained a statistical parser for the grammar on the treebank and evaluated the parser in terms of the accuracy of semantic-role identification and dependency analysis. 1 Introduction In this study we report the corpus-oriented development of a Japanese HPSG parser using the EDR Japanese corpus 2002 . Although several researchers have attempted to utilize linguistic grammar theories such as LFG Bresnan and Kaplan 1982 CCG Steedman 2001 and HPSG Pollard and Sag 1994 for parsing real-world texts such attempts could hardly be successful because manual development of wide-coverage linguistically motivated grammars involves years of labor-intensive effort. Corpus-oriented grammar development is a grammar development method that has been proposed as a promising substitute for conventional manual development. In corpus-oriented methods a treebank of a target grammar is constructed first and various grammatical constraints are extracted from the treebank. Previous studies reported that wide-coverage grammars can be obtained at low cost by using this method. Hockenmaier and Steedman 2002 Miyao et al. 2004 The treebank can also be used for training statistical disambiguation models and hence we can construct a statistical parser for the extracted grammar. The corpus-oriented method enabled us to develop a Japanese HPSG parser with semantic information whose coverage on real-world sentences is 95.3 . This high coverage