Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
I argue that because of spelling and typing errors and other properties of typed text, the identification of words and word boundaries in general requires syntactic and semantic knowledge. A lattice representation is therefore appropriate for lexical analysis. I show how the use of such a representation in the CLARE system allows different kinds of hypothesis about word identity to be integrated in a uniform framework. I then describe a quantitative evaluation of CLARE's performance on a set of sentences into which typographic errors have been introduced. . | LATTICE-BASED WORD IDENTIFICATION IN CLARE David M. Carter SRI International Cambridge Computer Science Research Centre 23 Millers Yard Cambridge CB2 IRQ U.K. dmcQcam.sri.com ABSTRACT I argue that because of spelling and typing errors and other properties of typed text the identification of words and word boundaries in general requires syntactic and semantic knowledge. A lattice representation is therefore appropriate for lexical analysis. I show how the use of such a representation in the CLARE system allows different kinds of hypothesis about word identity to be integrated in a uniform framework. I then describe a quantitative evaluation of CLARE S performance on a set of sentences into which typographic errors have been introduced. The results show that syntax and semantics can be applied as powerful sources of constraint on the possible corrections for misspelled words. 1 INTRODUCTION In many language processing systems uncertainty in the boundaries of linguistic units at various levels means that data are represented not as a well-defined sequence of units but as a lattice of possibilities. It is common for speech recognizers to maintain a lattice of overlapping word hypotheses from which one or more plausible complete paths are subsequently selected. Syntactic parsing of either spoken or written language frequently makes use of a chart or well-formed substring table because the correct bracketing of a sentence cannot easily be calculated deterministically. And lattices are also often used in the task of converting Japanese text typed in kana syllabic symbols to kanji the lack of interword spacing in written Japanese and the complex morphology of the language mean that lexical items and their boundaries cannot be reliably identified without applying syntactic and semantic knowledge Abe et al 1986 . In contrast however it is often assumed that for languages written with interword spaces it is sufficient to group an input character stream deterministically into