TAILIEUCHUNG - Báo cáo khoa học: "A Model of Lexical Attraction and Repulsion*"

This paper introduces new methods based on exponential families for modeling the correlations between words in text and speech. While previous work assumed the effects of word co-occurrence statistics to be constant over a window of several hundred words, we show that their influence is nonstationary on a much smaller time scale. Empirical data drawn from English and Japanese text, as well as conversational speech, reveals that the "attraction" between words decays exponentially, while stylistic and syntactic contraints create a "repulsion" between words that discourages close co-occurrence. W e show that these characteristics are well described by simple mixture. | A Model of Lexical Attraction and Repulsion Doug Beeferman Adam Berger John Lafferty School of Computer Science Carnegie Mellon University Pittsburgh PA 15213 USA dougb aberger lafferty Abstract This paper introduces new methods based on exponential families for modeling the correlations between words in text and speech. While previous work assumed the effects of word co-occurrence statistics to be constant over a window of several hundred words we show that their influence is nonstationary on a much smaller time scale. Empirical data drawn from English and Japanese text as well as conversational speech reveals that the attraction between words decays exponentially while stylistic and syntactic contraints create a repulsion between words that discourages close co-occurrence. We show that these characteristics are well described by simple mixture models based on two-stage exponential distributions which can be trained using the EM algorithm. The resulting distance distributions can then be incorporated as penalizing features in an exponential language model. 1 Introduction One of the fundamental characteristics of language viewed as a stochastic process is that it is highly nonstationary. Throughout a written document and during the course of spoken conversation the topic evolves effecting local statistics on word occurrences. The standard trigram model disregards this nonstationarity as does any stochastic grammar which assigns probabilities to sentences in a contextindependent fashion. Research supported in part by NSF grant IRI-9314969 DARPA AASERT award DAAH04-95-1-0475 and the ATR Interpreting Telecommunications Research Laboratories. Stationary models are used to describe such a dynamic source for at least two reasons. The first is convenience stationary models require a relatively small amount of computation to train and to apply. The second is ignorance we know so little about how to model effectively the nonstationary characteristics of language

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.