TAILIEUCHUNG - Báo cáo khoa học: "LATTICE-BASED WORD IDENTIFICATION IN CLARE"

I argue that because of spelling and typing errors and other properties of typed text, the identification of words and word boundaries in general requires syntactic and semantic knowledge. A lattice representation is therefore appropriate for lexical analysis. I show how the use of such a representation in the CLARE system allows different kinds of hypothesis about word identity to be integrated in a uniform framework. I then describe a quantitative evaluation of CLARE's performance on a set of sentences into which typographic errors have been introduced. . | LATTICE-BASED WORD IDENTIFICATION IN CLARE David M. Carter SRI International Cambridge Computer Science Research Centre 23 Millers Yard Cambridge CB2 IRQ . ABSTRACT I argue that because of spelling and typing errors and other properties of typed text the identification of words and word boundaries in general requires syntactic and semantic knowledge. A lattice representation is therefore appropriate for lexical analysis. I show how the use of such a representation in the CLARE system allows different kinds of hypothesis about word identity to be integrated in a uniform framework. I then describe a quantitative evaluation of CLARE S performance on a set of sentences into which typographic errors have been introduced. The results show that syntax and semantics can be applied as powerful sources of constraint on the possible corrections for misspelled words. 1 INTRODUCTION In many language processing systems uncertainty in the boundaries of linguistic units at various levels means that data are represented not as a well-defined sequence of units but as a lattice of possibilities. It is common for speech recognizers to maintain a lattice of overlapping word hypotheses from which one or more plausible complete paths are subsequently selected. Syntactic parsing of either spoken or written language frequently makes use of a chart or well-formed substring table because the correct bracketing of a sentence cannot easily be calculated deterministically. And lattices are also often used in the task of converting Japanese text typed in kana syllabic symbols to kanji the lack of interword spacing in written Japanese and the complex morphology of the language mean that lexical items and their boundaries cannot be reliably identified without applying syntactic and semantic knowledge Abe et al 1986 . In contrast however it is often assumed that for languages written with interword spaces it is sufficient to group an input character stream deterministically into

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.