Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
In recent years there is much interest in word cooccurrence relations, such as n-grams, verbobject combinations, or cooccurrence within a limited context. This paper discusses how to estimate the probability of cooccurrences that do not occur in the training data. We present a method that makes local analogies between each specific unobserved cooccurrence and other cooccurrences that contain similar words, as determined by an appropriate word similarity metric. | CONTEXTUAL WORD SIMILARITY AND ESTIMATION FROM SPARSE DATA Ido Dagan AT T Bell Laboratories 600 Mountain Avenue Murray Hill NJ 07974 daganQresearch.att.com Shaul Marcus Computer Science Department Technion Haifa 32000 Israel shaulQcs.technion.ac.il Shaul Markovitch Computer Science Department Technion Haifa 32000 Israel shaulmQcs.technion.ac.il Abstract In recent years there is much interest in word cooccurrence relations such as n-grams verbobject combinations or cooccurrence within a limited context. This paper discusses how to estimate the probability of cooccurrences that do not occur in the training data. We present a method that makes local analogies between each specific unobserved cooccurrence and other cooccurrences that contain similar words as determined by an appropriate word similarity metric. Our evaluation suggests that this method performs better than existing smoothing methods and may provide an alternative to class based models. 1 Introduction Statistical data on word cooccurrence relations play a major role in many corpus based approaches for natural language processing. Different types of cooccurrence relations are in use such as cooccurrence within a consecutive sequence of words n-grams within syntactic relations verb-object adjective-noun etc. or the cooccurrence of two words within a limited distance in the context. Statistical data about these various cooccurrence relations is employed for a variety of applications such as speech recognition Jelinek 1990 language generation Smadja and McKeown 1990 lexicography Church and Hanks 1990 machine translation Brown et al. Sadler 1989 information retrieval Maarek and Smadja 1989 and various disambiguation tasks Dagan et al. 1991 Hindle and Rooth 1991 Grishmanet al. 1986 Dagan and Itai 1990 . A major problem for the above applications is how to estimate the probability of cooccurrences that were not observed in the training corpus. Due to data sparseness in unrestricted language the aggregate .