TAILIEUCHUNG - Báo cáo khoa học: "Intelligent Selection of Language Model Training Data"

We address the problem of selecting nondomain-specific language model training data to build auxiliary language models for use in tasks such as machine translation. Our approach is based on comparing the cross-entropy, according to domainspecific and non-domain-specifc language models, for each sentence of the text source used to produce the latter language model. We show that this produces better language models, trained on less data, than both random data selection and two other previously proposed methods. . | Intelligent Selection of Language Model Training Data Robert C. Moore William Lewis Microsoft Research Redmond WA 98052 USA bobmoore wilewis @ Abstract We address the problem of selecting nondomain-specific language model training data to build auxiliary language models for use in tasks such as machine translation. Our approach is based on comparing the cross-entropy according to domainspecific and non-domain-specifc language models for each sentence of the text source used to produce the latter language model. We show that this produces better language models trained on less data than both random data selection and two other previously proposed methods. 1 Introduction Statistical N-gram language models are widely used in applications that produce natural-language text as output particularly speech recognition and machine translation. It seems to be a universal truth that output quality can always be improved by using more language model training data but only if the training data is reasonably well-matched to the desired output. This presents a problem because in virtually any particular application the amount of in-domain data is limited. Thus it has become standard practice to combine in-domain data with other data either by combining N-gram counts from in-domain and other data usually weighting the counts in some way or building separate language models from different data sources interpolating the language model probabilities either linearly or log-linearly. Log-linear interpolation is particularly popular in statistical machine translation . Brants et al. 2007 because the interpolation weights can easily be discriminatively trained to optimize an end-to-end translation objective function such as B LEU by making the log probability according to each language model a separate feature function in the overall translation model. The normal practice when using multiple languages models in machine translation seems to be to train models on as much .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.