Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We propose a simple generative, syntactic language model that conditions on overlapping windows of tree context (or treelets) in the same way that n-gram language models condition on overlapping windows of linear context. We estimate the parameters of our model by collecting counts from automatically parsed text using standard n-gram language model estimation techniques, allowing us to train a model on over one billion tokens of data using a single machine in a matter of hours. | Large-Scale Syntactic Language Modeling with Treelets Adam Pauls Dan Klein Computer Science Division University of California Berkeley Berkeley CA 94720 USA adpauls klein @cs.berkeley.edu Abstract We propose a simple generative syntactic language model that conditions on overlapping windows of tree context or treelets in the same way that n-gram language models condition on overlapping windows of linear context. We estimate the parameters of our model by collecting counts from automatically parsed text using standard n-gram language model estimation techniques allowing us to train a model on over one billion tokens of data using a single machine in a matter of hours. We evaluate on perplexity and a range of grammaticality tasks and find that we perform as well or better than n-gram models and other generative baselines. Our model even competes with state-of-the-art discriminative models hand-designed for the grammaticality tasks despite training on positive data alone. We also show fluency improvements in a preliminary machine translation experiment. 1 Introduction N-gram language models are a central component of all speech recognition and machine translation systems and a great deal of research centers around refining models Chen and Goodman 1998 efficient storage Pauls and Klein 2011 Heafield 2011 and integration into decoders Koehn 2004 Chiang 2005 . At the same time because n-gram language models only condition on a local window of linear word-level context they are poor models of long-range syntactic dependencies. Although several lines of work have proposed generative syntactic language models that improve on n-gram models for moderate amounts of data Chelba 1997 Xu et al. 2002 Charniak 2001 Hall 2004 Roark 959 2004 these models have only recently been scaled to the impressive amounts of data routinely used by n-gram language models Tan et al. 2011 . In this paper we describe a generative syntactic language model that conditions on local context treelets1 in