Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We compare two approaches for describing and generating bodies of rules used for natural language parsing. In today’s parsers rule bodies do not exist a priori but are generated on the fly, usually with methods based on n-grams, which are one particular way of inducing probabilistic regular languages. We compare two approaches for inducing such languages. One is based on n-grams, the other on minimization of the Kullback-Leibler divergence. | Alternative Approaches for Generating Bodies of Grammar Rules Gabriel Infante-Lopez and Maarten de Rijke Informatics Institute University of Amsterdam infante mdr @science.uva.nl Abstract We compare two approaches for describing and generating bodies of rules used for natural language parsing. In today s parsers rule bodies do not exist a priori but are generated on the fly usually with methods based on n-grams which are one particular way of inducing probabilistic regular languages. We compare two approaches for inducing such languages. One is based on n-grams the other on minimization of the Kullback-Leibler divergence. The inferred regular languages are used for generating bodies of rules inside a parsing procedure. We compare the two approaches along two dimensions the quality of the probabilistic regular language they produce and the performance of the parser they were used to build. The second approach outperforms the first one along both dimensions. 1 Introduction N-grams have had a big impact on the state of the art in natural language parsing. They are central to many parsing models Charniak 1997 Collins 1997 2000 Eisner 1996 and despite their simplicity n-gram models have been very successful. Modeling with n-grams is an induction task Gold 1967 . Given a sample set of strings the task is to guess the grammar that produced that sample. Usually the grammar is not be chosen from an arbitrary set of possible grammars but from a given class. Hence grammar induction consists of two parts choosing the class of languages amongst which to search and designing the procedure for performing the search. By using n-grams for grammar induction one addresses the two parts in one go. In particular the use of n-grams implies that the solution will be searched for in the class of probabilistic regular languages since n-grams induce probabilistic automata and consequently probabilistic regular languages. However the class of probabilistic regular languages induced using .