Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
The paper describes the development of software for automatic grammatical ana]ysi$ of u n l ~ ' U i ~ , unedited English text at the Unit for Compm= Research on the Ev~li~h Language (UCREL) at the U n i v e t ~ of Lancaster. The work is ~n'nmtly funded by IBM and carried out in collaboration with colleagues at IBM UK ( W ' ~ ) and IBM Yorktown Heights. The paper will focus on the lexicon component of the word raging system, the UCREL grammar, the datal~zlks of parsed sentences, and the tools that have been. | Lexicon and grammar in probabilistic tagging of written English. Andrew David Beale Unit for Computer Research on the English Language University of Lancaster Bailrigg Lancaster England LAI 4YT enbO25@ulucJancs.vaxl Abstract The paper describes the development of software for automatic grammatical analysis of unrestricted unedited English text at the Unit for Computer Research on the English Language UCREL at the University of Lancaster. The work is currently funded by IBM and carried out in collaboration with colleagues at IBM UK Winchester and IBM Yorktown Heights. The paper will focus on the lexicon component of the word tagging system the UCREL grammar the databanks of parsed sentences and the tools that have been written to support development of these components. This work has applications to speech technology spelling correction and other areas of natural language processing. Currently our goal is to provide a language model using transition statistics to disambiguate alternative parses for a speech recognition device. 1. Text Corpora Historically the use of text corpora to provide empirical data for testing grammatical theories has been regarded as important to varying degrees by philologists and linguists of differing persuasions. The use of corpus citations in grammars and dictionaries pre-dates electronic data processing Brown. 1984 34 . While most of the generative grammarians of the 60s and 70s ignored corpus data the increased power of the new technology nevertheless points the way to new applications of computerized text corpora in dictionary making style checking and speech recognition. Computer corpora present the computational linguist with the diversity and complexity of real language which is more challenging for testing language models than intuitively derived examples. Ultimately grammars must be judged by their ability to contend with the real facts of language and not just basic constructs extrapolated by grammarians. 2. Word Tagging The .