TAILIEUCHUNG - Báo cáo khoa học: "A Figure of Merit for the Evaluation of Web-Corpus Randomness"

In this paper, we present an automated, quantitative, knowledge-poor method to evaluate the randomness of a collection of documents (corpus), with respect to a number of biased partitions. The method is based on the comparison of the word frequency distribution of the target corpus to word frequency distributions from corpora built in deliberately biased ways. We apply the method to the task of building a corpus via queries to Google. | A Figure of Merit for the Evaluation of Web-Corpus Randomness Massimiliano Ciaramita Marco Baroni Institute of Cognitive Science and Technology National Research Council Roma Italy SSLMIT Universita di Bologna Forli Italy baroni@ Abstract In this paper we present an automated quantitative knowledge-poor method to evaluate the randomness of a collection of documents corpus with respect to a number of biased partitions. The method is based on the comparison of the word frequency distribution of the target corpus to word frequency distributions from corpora built in deliberately biased ways. We apply the method to the task of building a corpus via queries to Google. Our results indicate that this approach can be used reliably to discriminate biased and unbiased document collections and to choose the most appropriate query terms. 1 Introduction The Web is a very rich source of linguistic data and in the last few years it has been used intensively by linguists and language technologists for many tasks Kilgarriff and Grefenstette 2003 . Among other uses the Web allows fast and inexpensive construction of general purpose corpora . corpora that are not meant to represent a specific sub-language but a language as a whole. There are several recent studies on the extent to which Web-derived corpora are comparable in terms of variety of topics and styles to traditional balanced corpora Fletcher 2004 Sharoff 2006 . Our contribution in this paper is to present an automated quantitative method to evaluate the variety or randomness with respect to a number of non-random partitions of a Web corpus. The more random less-biased towards specific partitions a corpus is the more it should be suitable as a general purpose corpus. We are not proposing a method to evaluate whether a sample of Web pages is a random sample of the Web although this is a related issue Bharat and Broder 1998 Henzinger et al. 2000 . Instead we propose a method based on .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.