Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
This paper presents a probabilistic framework, QARLA, for the evaluation of text summarisation systems. The input of the framework is a set of manual (reference) summaries, a set of baseline (automatic) summaries and a set of similarity metrics between summaries. It provides i) a measure to evaluate the quality of any set of similarity metrics, ii) a measure to evaluate the quality of a summary using an optimal set of similarity metrics, and iii) a measure to evaluate whether the set of baseline summaries is reliable or may produce biased results. . | QARLA A Framework for the Evaluation of Text Summarization Systems Enrique Amigo Julio Gonzalo Anselmo Penas Felisa Verdejo Departamento de Lenguajes y Sistemas Informaticos Universidad Nacional de Educacion a Distancia c Juan del Rosal 16 - 28040 Madrid - Spain enrique julio anselmo felisa @lsi.uned.es Abstract This paper presents a probabilistic framework QARLA for the evaluation of text summarisation systems. The input of the framework is a set of manual reference summaries a set of baseline automatic summaries and a set of similarity metrics between summaries. It provides i a measure to evaluate the quality of any set of similarity metrics ii a measure to evaluate the quality of a summary using an optimal set of similarity metrics and iii a measure to evaluate whether the set of baseline summaries is reliable or may produce biased results. Compared to previous approaches our framework is able to combine different metrics and evaluate the quality of a set of metrics without any a-priori weighting of their relative importance. We provide quantitative evidence about the effectiveness of the approach to improve the automatic evaluation of text summarisation systems by combining several similarity metrics. 1 Introduction The quality of an automatic summary can be established mainly with two approaches Human assessments The output of a number of summarisation systems is compared by hu man judges using some set of evaluation guidelines. Proximity to a gold standard The best automatic summary is the one that is closest to some reference summary made by humans. Using human assessments has some clear advantages the results of the evaluation are interpretable and we can trace what a system is doing well and what is doing poorly. But it also has a couple of serious drawbacks i different human assessors reach different conclusions and ii the outcome of a comparative evaluation exercise is not directly reusable for new techniques i.e. a summarisation strategy developed after