Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Statistical language models should improve as the size of the n-grams increases from 3 to 5 or higher. However, the number of parameters and calculations, and the storage requirement increase very rapidly if we attempt to store all possible combinations of n-grams. To avoid these problems, the reduced n-grams’ approach previously developed by O’Boyle (1993) can be applied. A reduced n-gram language model can store an entire corpus’s phrase-history length within feasible storage limits. Another theoretical advantage of reduced n-grams is that they are closer to being semantically complete than traditional models, which include all n-grams. . | Reduced n-gram models for English and Chinese corpora Le Q Ha P Hanna D W Stewart and F J Smith School of Electronics Electrical Engineering and Computer Science Queen s University Belfast Belfast BT7 1NN Northern Ireland United Kingdom lequanha@lequanha.com Abstract Statistical language models should improve as the size of the n-grams increases from 3 to 5 or higher. However the number of parameters and calculations and the storage requirement increase very rapidly if we attempt to store all possible combinations of n-grams. To avoid these problems the reduced n-grams approach previously developed by O Boyle 1993 can be applied. A reduced n-gram language model can store an entire corpus s phrase-history length within feasible storage limits. Another theoretical advantage of reduced n-grams is that they are closer to being semantically complete than traditional models which include all n-grams. In our experiments the reduced n-gram Zipf curves are first presented and compared with previously obtained conventional n-grams for both English and Chinese. The reduced n-gram model is then applied to large English and Chinese corpora. For English we can reduce the model sizes compared to 7-gram traditional model sizes with factors of 14.6 for a 40-million-word corpus and 11.0 for a 500-million-word corpus while obtaining 5.8 and 4.2 improvements in perplexities. For Chinese we gain a 16.9 perplexity reductions and we reduce the model size by a factor larger than 11.2. This paper is a step towards the modeling of English and Chinese using semantically complete phrases in an n-gram model. 1 Introduction to the Reduced N-Gram Approach Shortly after this laboratory first published a variable n-gram algorithm Smith and O Boyle 1992 O Boyle 1993 proposed a statistical method to improve language models based on the removal of overlapping phrases. A distortion in the use of phrase frequencies had been observed in the small railway timetable Vodis Corpus when the bigram RAIL .