Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
In statistical machine translation, a researcher seeks to determine whether some innovation (e.g., a new feature, model, or inference algorithm) improves translation quality in comparison to a baseline system. To answer this question, he runs an experiment to evaluate the behavior of the two systems on held-out data. In this paper, we consider how to make such experiments more statistically reliable. We provide a systematic analysis of the effects of optimizer instability—an extraneous variable that is seldom controlled for—on experimental outcomes, and make recommendations for reporting results more accurately. . | Better Hypothesis Testing for Statistical Machine Translation Controlling for Optimizer Instability Jonathan H. Clark Chris Dyer Alon Lavie Noah A. Smith Language Technologies Institute Carnegie Mellon University PittsbUrgh PA 15213 UsA jhclark cdyer alavie nasmith @cs.cmu.edu Abstract In statistical machine translation a researcher seeks to determine whether some innovation e.g. a new feature model or inference algorithm improves translation quality in comparison to a baseline system. To answer this question he runs an experiment to evaluate the behavior of the two systems on held-out data. In this paper we consider how to make such experiments more statistically reliable. We provide a systematic analysis of the effects of optimizer instability an extraneous variable that is seldom controlled for on experimental outcomes and make recommendations for reporting results more accurately. 1 Introduction The need for statistical hypothesis testing for machine translation MT has been acknowledged since at least Och 2003 . In that work the proposed method was based on bootstrap resampling and was designed to improve the statistical reliability of results by controlling for randomness across test sets. However there is no consistently used strategy that controls for the effects of unstable estimates of model parameters.1 While the existence of optimizer instability is an acknowledged problem it is only infrequently discussed in relation to the reliability of experimental results and to our knowledge there has yet to be a systematic study of its effects on 1We hypothesize that the convention of trusting BLEU score improvements of e.g. 1 is not merely due to an appreciation of what qualitative difference a particular quantitative improvement will have but also an implicit awareness that current methodology leads to results that are not consistently reproducible. 176 hypothesis testing. In this paper we present a series of experiments demonstrating that optimizer instability