TAILIEUCHUNG - Báo cáo khoa học: "Scaling to Very Very Large Corpora for Natural Language Disambiguation"

Yet since we now have access to significantly more data, one has to wonder what conclusions that have been drawn on small data sets may carry over when these learning methods are trained using much larger corpora. In this paper, we present a study of the effects of data size on machine learning for natural language disambiguation. In particular, we study the problem of selection among confusable words, using orders of magnitude more training data than has ever been applied to this problem. First we show learning curves for four different machine learning algorithms. . | Scaling to Very Very Large Corpora for Natural Language Disambiguation Michele Banko and Eric Brill Microsoft Research 1 Microsoft Way Redmond WA 98052 USA mbanko brill @ Abstract The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks algorithms continue to be optimized tested and compared after training on corpora consisting of only one million words or less. In this paper we evaluate the performance of different learning methods on a prototypical natural language disambiguation task confusion set disambiguation when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application correctly labeled training data is free. Since this will often not be the case we examine methods for effectively exploiting very large corpora when labeled data comes at a cost. 1 Introduction Machine learning techniques which automatic ally learn linguistic information from online text corpora have been applied to a number of natural language problems throughout the last decade. A large percentage of papers published in this area involve comparisons of different learning approaches trained and tested with commonly used corpora. While the amount of available online text has been increasing at a dramatic rate the size of training corpora typically used for learning has not. In part this is due to the standardization of data sets used within the field as well as the potentially large cost of annotating data for those learning methods that rely on labeled text. The empirical NLP community has put substantial effort into evaluating performance of a large number of machine learning methods over fixed and relatively small data sets. Yet since we now have access to significantly more data one has to wonder what conclusions that have been drawn on small data sets may carry over when these learning methods are trained .

Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.