TAILIEUCHUNG - Báo cáo khoa học: "Comparing Automatic and Human Evaluation of NLG Systems"

We consider the evaluation problem in Natural Language Generation (NLG) and present results for evaluating several NLG systems with similar functionality, including a knowledge-based generator and several statistical systems. We compare evaluation results for these systems by human domain experts, human non-experts, and several automatic evaluation metrics, including NIST, BLEU, and ROUGE. We find that NIST scores correlate best ( ) with human judgments, but that all automatic metrics we examined are biased in favour of generators that select on the basis of frequency alone. . | Comparing Automatic and Human Evaluation of NLG Systems Anja Belz Natural Language Technology Group CMIS University of Brighton UK Ehud Reiter Dept of Computing Science University of Aberdeen UK ereiter@ Abstract We consider the evaluation problem in Natural Language Generation NLG and present results for evaluating several NLG systems with similar functionality including a knowledge-based generator and several statistical systems. We compare evaluation results for these systems by human domain experts human non-experts and several automatic evaluation metrics including NIST BLEU and ROUGE. We find that NIST scores correlate best with human judgments but that all automatic metrics we examined are biased in favour of generators that select on the basis of frequency alone. We conclude that automatic evaluation of NLG systems has considerable potential in particular where high-quality reference texts and only a small number of human evaluators are available. However in general it is probably best for automatic evaluations to be supported by human-based evaluations or at least by studies that demonstrate that a particular metric correlates well with human judgments in a given domain. 1 Introduction Evaluation is becoming an increasingly important topic in Natural Language Generation NLG as in other fields of computational linguistics. Some NLG researchers are impressed by the success of the BLEU evaluation metric Papineni et al. 2002 in Machine Translation MT which has transformed the MT field by allowing researchers to quickly and cheaply evaluate the impact of new ideas algorithms and data sets. BLEU and related metrics work by comparing the output of an MT system to a set of reference gold standard translations and in principle this kind of evaluation could be done with NLG systems as well. Indeed NLG researchers are already starting to use BLEU Habash 2004 Belz 2005 in their evaluations as this is much cheaper and easier to

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.