TAILIEUCHUNG - Báo cáo khoa học: "Prediction of Learning Curves in Machine Translation"

Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific purpose. Since ad-hoc manual translation can represent a significant investment in time and money, a prior assesment of the amount of training data required to achieve a satisfactory accuracy level can be very useful. In this work, we show how to predict what the learning curve would look like if we were to manually translate increasing amounts of data. | Prediction of Learning Curves in Machine Translation Prasanth Kolachina Nicola Cancedda Marc Dymetman Sriram Venkatapathy LTRC IIIT-Hyderabad Hyderabad India f Xerox Research Centre Europe 6 chemin de Maupertuis 38240 Meylan France Abstract Parallel data in the domain of interest is the key resource when training a statistical machine translation SMT system for a specific purpose. Since ad-hoc manual translation can represent a significant investment in time and money a prior assesment of the amount of training data required to achieve a satisfactory accuracy level can be very useful. In this work we show how to predict what the learning curve would look like if we were to manually translate increasing amounts of data. We consider two scenarios 1 Monolingual samples in the source and target languages are available and 2 An additional small amount of parallel corpus is also available. We propose methods for predicting learning curves in both these scenarios. 1 Introduction Parallel data in the domain of interest is the key resource when training a statistical machine translation SMT system for a specific business purpose. In many cases it is possible to allocate some budget for manually translating a limited sample of relevant documents be it via professional translation services or through increasingly fashionable crowdsourcing. However it is often difficult to predict how much training data will be required to achieve satisfactory translation accuracy preventing sound provisional budgetting. This prediction or more generally the prediction of the learning curve of an SMT system as a function of available in-domain parallel data is the objective of this paper. We consider two scenarios representative of realistic situations. 1. In the first scenario S1 the SMT developer is given only monolingual source and target samples from the relevant domain and a small test parallel corpus. This research was carried out during an internship at Xerox Research Centre Europe. 22

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.