TAILIEUCHUNG - Báo cáo khoa học: "Stochastic Iterative Alignment for Machine Translation Evaluation"

A number of metrics for automatic evaluation of machine translation have been proposed in recent years, with some metrics focusing on measuring the adequacy of MT output, and other metrics focusing on fluency. Adequacy-oriented metrics such as BLEU measure n-gram overlap of MT outputs and their references, but do not represent sentence-level information. In contrast, fluency-oriented metrics such as ROUGE-W compute longest common subsequences, but ignore words not aligned by the LCS. | Stochastic Iterative Alignment for Machine Translation Evaluation Ding Liu and Daniel Gildea Department of Computer Science University of Rochester Rochester NY 14627 Abstract A number of metrics for automatic evaluation of machine translation have been proposed in recent years with some metrics focusing on measuring the adequacy of MT output and other metrics focusing on fluency. Adequacy-oriented metrics such as BLEU measure n-gram overlap of MT outputs and their references but do not represent sentence-level information. In contrast fluency-oriented metrics such as ROUGE-W compute longest common subsequences but ignore words not aligned by the LCS. We propose a metric based on stochastic iterative string alignment SIA which aims to combine the strengths of both approaches. We compare SIA with existing metrics and find that it outperforms them in overall evaluation and works specially well in fluency evaluation. 1 Introduction Evaluation has long been a stumbling block in the development of machine translation systems due to the simple fact that there are many correct translations for a given sentence. Human evaluation of system output is costly in both time and money leading to the rise of automatic evaluation metrics in recent years. In the 2003 Johns Hopkins Workshop on Speech and Language Engineering experiments on MT evaluation showed that BLEU and NIST do not correlate well with human judgments at the sentence level even when they correlate well over large test sets Blatz et al. 2003 . Liu and Gildea 2005 also pointed out that due to the limited references for every MT output using the overlapping ratio of n-grams longer than 2 did not improve sentence level evaluation performance of BLEU. The problem leads to an even worse result in BLEU S fluency evaluation which is supposed to rely on the long ngrams. In order to improve sentence-level evaluation performance several metrics have been proposed including ROUGE-W ROUGE-S Lin and Och 2004 and METEOR Banerjee

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.