TAILIEUCHUNG - Báo cáo khoa học: "Evaluating CETEMPublico, a free resource for Portuguese"

In this paper we present a thorough evaluation of a corpus resource for Portuguese, CETEMPúblico, a 180million word newspaper corpus free for R&D in Portuguese processing. We provide information that should be useful to those using the resource, and to considerable improvement for later versions. In addition, we think that the procedures presented can be of interest for the larger NLP community, since corpus evaluation and description is unfortunately not a common exercise. | Evaluating CETEMPúblico a free resource for Portuguese Diana Santos SINTEF Tele og Data Postboks 124 Blindern N-0314 Oslo Norway Paulo Rocha Departamento de Informática Universidade do Minho PT-4710-057 Braga Portugal Abstract In this paper we present a thorough evaluation of a corpus resource for Portuguese CETEMPúblico a 180million word newspaper corpus free for R D in Portuguese processing. We provide information that should be useful to those using the resource and to considerable improvement for later versions. In addition we think that the procedures presented can be of interest for the larger NLP community since corpus evaluation and description is unfortunately not a common exercise. 1 Introduction CETEMPúblico is a large corpus of European Portuguese newspaper language available at no cost to the community dealing with the processing of It was created in the framework of the Computational Processing of Portuguese project a government funded initiative to foster language engineering of the Portuguese Evaluating this resource we have two main goals in mind To contribute to improve its usefulness and to suggest ways of going about as far as corpus evaluation is concerned in general noting that most corpora projects are simply described and not evaluated . 1 CETEMPúblico stands for Corpus de Extractos de Textos Electrónicos MCT Público and its full reference is http cetempublico 2 See http In fact and despite the amount of research devoted to corpus processing nowadays there is not much information about the actual corpora being processed which may lead naive users and or readers to conclude that this is not an interesting issue. In our opinion that is the wrong conclusion. There is in fact a lot to be said about any particular corpus. We believe in addition that such information should be available when one is buying or even just

Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.