Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Evaluating CETEMPublico, a free resource for Portuguese"

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ

In this paper we present a thorough evaluation of a corpus resource for Portuguese, CETEMPúblico, a 180million word newspaper corpus free for R&D in Portuguese processing. We provide information that should be useful to those using the resource, and to considerable improvement for later versions. In addition, we think that the procedures presented can be of interest for the larger NLP community, since corpus evaluation and description is unfortunately not a common exercise. | Evaluating CETEMPúblico a free resource for Portuguese Diana Santos SINTEF Tele og Data Postboks 124 Blindern N-0314 Oslo Norway Diana.Santos@informatics.sintef.no Paulo Rocha Departamento de Informática Universidade do Minho PT-4710-057 Braga Portugal Paulo.Rocha@alfa.di.uminho.pt Abstract In this paper we present a thorough evaluation of a corpus resource for Portuguese CETEMPúblico a 180million word newspaper corpus free for R D in Portuguese processing. We provide information that should be useful to those using the resource and to considerable improvement for later versions. In addition we think that the procedures presented can be of interest for the larger NLP community since corpus evaluation and description is unfortunately not a common exercise. 1 Introduction CETEMPúblico is a large corpus of European Portuguese newspaper language available at no cost to the community dealing with the processing of Portuguese.1 It was created in the framework of the Computational Processing of Portuguese project a government funded initiative to foster language engineering of the Portuguese language.2 Evaluating this resource we have two main goals in mind To contribute to improve its usefulness and to suggest ways of going about as far as corpus evaluation is concerned in general noting that most corpora projects are simply described and not evaluated . 1 CETEMPúblico stands for Corpus de Extractos de Textos Electrónicos MCT Público and its full reference is http cgi.portugues.mct.pt cetempublico 2 See http www.portugues.mct.pt In fact and despite the amount of research devoted to corpus processing nowadays there is not much information about the actual corpora being processed which may lead naive users and or readers to conclude that this is not an interesting issue. In our opinion that is the wrong conclusion. There is in fact a lot to be said about any particular corpus. We believe in addition that such information should be available when one is buying or even just

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.