TAILIEUCHUNG - Báo cáo khoa học: "Bucking the Trend: Large-Scale Cost-Focused Active Learning for Statistical Machine Translation"

We explore how to improve machine translation systems by adding more translation data in situations where we already have substantial resources. The main challenge is how to buck the trend of diminishing returns that is commonly encountered. We present an active learning-style data solicitation algorithm to meet this challenge. We test it, gathering annotations via Amazon Mechanical Turk, and find that we get an order of magnitude increase in performance rates of improvement. | Bucking the Trend Large-Scale Cost-Focused Active Learning for Statistical Machine Translation Michael Bloodgood Human Language Technology Center of Excellence Johns Hopkins University Baltimore MD 21211 bloodgood@ Chris Callison-Burch Center for Language and Speech Processing Johns Hopkins University Baltimore MD 21211 ccb@ Abstract We explore how to improve machine translation systems by adding more translation data in situations where we already have substantial resources. The main challenge is how to buck the trend of diminishing returns that is commonly encountered. We present an active learning-style data solicitation algorithm to meet this challenge. We test it gathering annotations via Amazon Mechanical Turk and find that we get an order of magnitude increase in performance rates of improvement. 1 Introduction Figure 1 shows the learning curves for two state of the art statistical machine translation SMT systems for Urdu-English translation. Observe how the learning curves rise rapidly at first but then a trend of diminishing returns occurs put simply the curves flatten. This paper investigates whether we can buck the trend of diminishing returns and if so how we can do it effectively. Active learning AL has been applied to SMT recently Haffari et al. 2009 Haffari and Sarkar 2009 but they were interested in starting with a tiny seed set of data and they stopped their investigations after only adding a relatively tiny amount of data as depicted in Figure 1. In contrast we are interested in applying AL when a large amount of data already exists as is the case for many important lanuage pairs. We develop an AL algorithm that focuses on keeping annotation costs measured by time in seconds low. It succeeds in doing this by only soliciting translations for parts of sentences. We show that this gets a savings in human annotation time above and beyond what the reduction in words annotated would have indicated by a factor of about three and .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.