Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models, the relation is trivially held, but in models that allow arbitrary clustering of context (such as decision tree models), this relation is generally not satisfied. . | Generalized Interpolation in Decision Tree LM Denis Filimonovfi ỆHuman Language Technology Center of Excellence Johns Hopkins University den@cs.umd.edu Mary Harperf fDepartment of Computer Science University of Maryland College Park mharper@umd.edu Abstract In the face of sparsity statistical models are often interpolated with lower order backoff models particularly in Language Modeling. In this paper we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models the relation is trivially held but in models that allow arbitrary clustering of context such as decision tree models this relation is generally not satisfied. Based on this insight we also propose a generalization of linear interpolation which significantly improves the performance of a decision tree language model. 1 Introduction A prominent use case for Language Models LMs in NLP applications such as Automatic Speech Recognition ASR and Machine Translation MT is selection of the most fluent word sequence among multiple hypotheses. Statistical LMs formulate the problem as the computation of the model s probability to generate the word sequence w1w2 . wm w assuming that higher probability corresponds to more fluent hypotheses. LMs are often represented in the following generative form m p wm Ị Ị p wiiw1-1 i 1 In the following discussion we will refer to the function p wi wi-1 as a language model. 620 Note the context space for this function wi-1 is arbitrarily long necessitating some independence assumption which usually consists of reducing the relevant context to n 1 immediately preceding tokens p wi w1-1 p izn j These distributions are typically estimated from observed counts of n-grams w -ra 1 in the training data. The context space is still far too large therefore the models are recursively smoothed using lower order distributions. For instance in a widely used n-gram LM the .