TAILIEUCHUNG - Báo cáo khoa học: "Using an Annotated Corpus as a Stochastic Grammar"
In Data Oriented Parsing (DOP), an annotated corpus is used as a stochastic grammar. An input string is parsed by combining subtrees from the corpus. As a consequence, one parse tree can usually be generated by several derivations that involve different subtrces. This leads to a statistics where the probability of a parse is equal to the sum of the probabilities of all its derivations. In (Scha, 1990) an informal introduction to DOP is given, while (Bed, 1992a) provides a formalization of the theory. . | Using an Annotated Corpus as a Stochastic Grammar Rens Bod Department of Computational Linguistics University of Amsterdam Spuistraat 134 NL-1012 VB Amsterdam rens @ Abstract In Data Oriented Parsing DOP an annotated corpus is used as a stochastic grammar. An input string is parsed by combining subtrees from the corpus. As a consequence one parse tree can usually be generated by several derivations that involve different subfrees. This leads to a statistics where the probability of a parse is equal to the sum of the probabilities of all its derivations. In Scha 1990 an informal introduction to DOP is given while Bod 1992a provides a formalization of the theory. In this paper we compare DOP with other stochastic grammars in the context of Formal Language Theory. It it proved that it is not possible to create for every DOP-model a strongly equivalent stochastic CFG which also assigns the same probabilities to the parses. We show that the maximum probability parse can be estimated in polynomial time by applying Monte Carlo techniques. The model was tested on a set of hand-parsed strings from die Air Travel Information System ATIS spoken language corpus. Preliminary experiments yield 96 test set parsing accuracy. 1 Motivation As soon as a formal grammar characterizes a nontrivial part of a natural language almost every input string of reasonable length gets an unmanageably large number of different analyses. Since most of these analyses are not perceived as plausible by a human language user there is a need for distinguishing the plausible parse s of an input string from the implausible ones. In stochastic language processing it is assumed that the most plausible parse of an input string is its most probable parse. Most instantiations of this idea estimate the probability of a parse by assigning application probabilities to context free rewrite rules Jelinek 1990 or by assigning combination probabilities to elementary structures Resnik 1992 Schabes 1992 .
đang nạp các trang xem trước