Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
The particular domain chosen here as a case study is the problem of restoring missing accents 1 to Spanish and French text. Because it requires the resolution of both semantic and syntactic ambiguity, and offers an objective ground truth for automatic evaluation, it is particularly well suited for demonstrating and testing the capabilities of the given algorithm. It is also a practical problem with immediate application. PROBLEM DESCRIPTION The general problem considered here is the resolution of lexical ambiguity, both syntactic and semantic, based on properties of the surrounding context. . | DECISION LISTS FOR LEXICAL AMBIGUITY RESOLUTION Application to Accent Restoration in Spanish and French David Yarowsky Department of Computer and Information Science University of Pennsylvania Philadelphia PA 19104 yarowskyOunagi.cis.upenn.edu Abstract This paper presents a statistical decision procedure for lexical ambiguity resolution. The algorithm exploits both local syntactic patterns and more distant collocational evidence generating an efficient effective and highly perspicuous recipe for resolving a given ambiguity. By identifying and utilizing only the single best disambiguating evidence in a target context the algorithm avoids the problematic complex modeling of statistical dependencies. Although directly applicable to a wide class of ambiguities the algorithm is described and evaluated in a realistic case study the problem of restoring missing accents in Spanish and French text. Current accuracy exceeds 99 on the full task and typically is over 90 for even the most difficult ambiguities. INTRODUCTION This paper presents a general-purpose statistical decision procedure for lexical ambiguity resolution based on decision lists Rivest 1987 . The algorithm considers multiple types of evidence in the context of an ambiguous word exploiting differences in collocational distribution as measured by log-likelihoods. Unlike standard Bayesian approaches however it does not combine the log-likelihoods of all available pieces of contextual evidence but bases its classifications solely on the single most reliable piece of evidence identified in the target context. Perhaps surprisingly this strategy appears to yield the same or even slightly better precision than the combination of evidence approach when trained on the same features. It also brings with it several additional advantages the greatest of which is the ability to include multiple highly non-independent sources of evidence without complex modeling of dependencies. Some other advantages are significant .