TAILIEUCHUNG - Báo cáo khoa học: "Similarity-Based Estimation of Word Cooccurrence Probabilities"

In m a n y applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer m a y need to determine which of the two word combinations "eat a peach" and "eat a beach" is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However, the nature of language is such that m a n y word combinations are infrequent and do not occur in a given corpus. . | Similarity-Based Estimation of Word Cooccurrence Probabilities Ido Dagan Fernando Pereira AT T Bell Laboratories 600 Mountain Ave. Murray Hill NJ 07974 USA Abstract In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example a speech recognizer may need to determine which of the two word combinations eat a peach and eat a beach is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However the nature of language is such that many word combinations are infrequent and do not occur in a given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on most similar words. We describe a probabilistic word association model based on distributional word similarity and apply it to improving probability estimates for unseen word bigrams in a variant of Katz s back-off model. The similarity-based method yields a 20 perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error. Introduction Data sparseness is an inherent problem in statistical methods for natural language processing. Such methods use statistics on the relative frequencies of configurations of elements in a training corpus to evaluate alternative analyses or interpretations of new samples of text or speech. The most likely analysis will be taken to be the one that contains the most frequent configurations. The problem of data sparseness arises when analyses contain configurations that never occurred in the training corpus. Then it is not possible to estimate probabilities from observed frequencies and some other estimation scheme has to be used. We focus here on a particular kind of configuration word cooccurrence. Examples of such cooccurrences include .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.