TAILIEUCHUNG - Báo cáo khoa học: "Similarity-Based Methods For Word Sense Disambiguation"

We compare four similarity-based estimation methods against back-off and maximum-likelihood estimation methods on a pseudo-word sense disambiguation task in which we controlled for both unigram and bigram frequency. The similarity-based methods perform up to 40% better on this particular task. We also conclude that events that occur only once in the training set have major impact on similarity-based estimates. | Similarity-Based Methods For Word Sense Disambiguation Ido Dagan Dept of Mathematics and Computer Science Bar Ilan University Ramat Gan 52900 Israel Lillian Lee Div. of Engineering and Applied Sciences Harvard University Cambridge MA 01238 USA Fernando Pereira AT T Labs - Research 600 Mountain Ave. Murray Hill NJ 07974 USA Abstract We compare four similarity-based estimation methods against back-off and maximum-likelihood estimation methods on a pseudo-word sense disambiguation task in which we controlled for both unigram and bigram frequency. The similarity-based methods perform up to 40 better on this particular task. We also conclude that events that occur only once in the training set have major impact on similarity-based estimates. 1 Introduction The problem of data sparseness affects all statistical methods for natural language processing. Even large training sets tend to misrepresent low-probability events since rare events may not appear in the training corpus at all. We concentrate here on the problem of estimating the probability of unseen word pairs that is pairs that do not occur in the training set. Katz s back-off scheme Katz 1987 widely used in bigram language modeling estimates the probability of an unseen bigram by utilizing unigram estimates. This has the undesirable result of assigning unseen bigrams the same probability if they are made up of unigrams of the same frequency. Class-based methods Brown et al. 1992 Pereira Tishby and Lee 1993 Resnik 1992 cluster words into classes of similar words so that one can base the estimate of a word pair s probability on the averaged cooccurrence probability of the classes to which the two words belong. However a word is therefore modeled by the average behavior of many words which may cause the given word s idiosyncrasies to be ignored. For instance the word red might well act like a generic color word in most cases but it has distinctive .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.