Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Reduced n-gram models for English and Chinese corpora"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

Statistical language models should improve as the size of the n-grams increases from 3 to 5 or higher. However, the number of parameters and calculations, and the storage requirement increase very rapidly if we attempt to store all possible combinations of n-grams. To avoid these problems, the reduced n-grams’ approach previously developed by O’Boyle (1993) can be applied. A reduced n-gram language model can store an entire corpus’s phrase-history length within feasible storage limits. Another theoretical advantage of reduced n-grams is that they are closer to being semantically complete than traditional models, which include all n-grams. . | Reduced n-gram models for English and Chinese corpora Le Q Ha P Hanna D W Stewart and F J Smith School of Electronics Electrical Engineering and Computer Science Queen s University Belfast Belfast BT7 1NN Northern Ireland United Kingdom lequanha@lequanha.com Abstract Statistical language models should improve as the size of the n-grams increases from 3 to 5 or higher. However the number of parameters and calculations and the storage requirement increase very rapidly if we attempt to store all possible combinations of n-grams. To avoid these problems the reduced n-grams approach previously developed by O Boyle 1993 can be applied. A reduced n-gram language model can store an entire corpus s phrase-history length within feasible storage limits. Another theoretical advantage of reduced n-grams is that they are closer to being semantically complete than traditional models which include all n-grams. In our experiments the reduced n-gram Zipf curves are first presented and compared with previously obtained conventional n-grams for both English and Chinese. The reduced n-gram model is then applied to large English and Chinese corpora. For English we can reduce the model sizes compared to 7-gram traditional model sizes with factors of 14.6 for a 40-million-word corpus and 11.0 for a 500-million-word corpus while obtaining 5.8 and 4.2 improvements in perplexities. For Chinese we gain a 16.9 perplexity reductions and we reduce the model size by a factor larger than 11.2. This paper is a step towards the modeling of English and Chinese using semantically complete phrases in an n-gram model. 1 Introduction to the Reduced N-Gram Approach Shortly after this laboratory first published a variable n-gram algorithm Smith and O Boyle 1992 O Boyle 1993 proposed a statistical method to improve language models based on the removal of overlapping phrases. A distortion in the use of phrase frequencies had been observed in the small railway timetable Vodis Corpus when the bigram RAIL .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.