TAILIEUCHUNG - Báo cáo khoa học: "A Figure of Merit for the Evaluation of Web-Corpus Randomness"

In this paper, we present an automated, quantitative, knowledge-poor method to evaluate the randomness of a collection of documents (corpus), with respect to a number of biased partitions. The method is based on the comparison of the word frequency distribution of the target corpus to word frequency distributions from corpora built in deliberately biased ways. We apply the method to the task of building a corpus via queries to Google. | A Figure of Merit for the Evaluation of Web-Corpus Randomness Massimiliano Ciaramita Marco Baroni Institute of Cognitive Science and Technology National Research Council Roma Italy SSLMIT Universita di Bologna Forli Italy baroni@ Abstract In this paper we present an automated quantitative knowledge-poor method to evaluate the randomness of a collection of documents corpus with respect to a number of biased partitions. The method is based on the comparison of the word frequency distribution of the target corpus to word frequency distributions from corpora built in deliberately biased ways. We apply the method to the task of building a corpus via queries to Google. Our results indicate that this approach can be used reliably to discriminate biased and unbiased document collections and to choose the most appropriate query terms. 1 Introduction The Web is a very rich source of linguistic data and in the last few years it has been used intensively by linguists and language technologists for many tasks Kilgarriff and Grefenstette 2003 . Among other uses the Web allows fast and inexpensive construction of general purpose corpora . corpora that are not meant to represent a specific sub-language but a language as a whole. There are several recent studies on the extent to which Web-derived corpora are comparable in terms of variety of topics and styles to traditional balanced corpora Fletcher 2004 Sharoff 2006 . Our contribution in this paper is to present an automated quantitative method to evaluate the variety or randomness with respect to a number of non-random partitions of a Web corpus. The more random less-biased towards specific partitions a corpus is the more it should be suitable as a general purpose corpus. We are not proposing a method to evaluate whether a sample of Web pages is a random sample of the Web although this is a related issue Bharat and Broder 1998 Henzinger et al. 2000 . Instead we propose a method based on .

Mai Vy 61 8 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Figure Drawing - Dynamic Figure Drawing

29 57 0

Figure Drawing - Figure Anatomy

18 83 0

Figure Drawing - Figure Construction

16 51 0

Figure Drawing - Lighting the Figure

24 58 0

Figure Drawing - Posing the Figure

24 62 0

Figure Drawing - The Human Figure

16 56 0

Figure Drawing Without A Model - The figure in Action

29 60 1

Cartooning-Concepts and Methods: Part 1: Figure Drawing Basics

23 61 0

Ebook Fashion sketchbook (6th edition)

49 47 1

Figure Drawing - Figure Drawing Without A Model - Graphic Narrative

17 49 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462337 61

Giới thiệu :Lập trình mã nguồn mở

14 25992 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11342 542

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10547 466

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9838 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8889 1161

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8502 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8100 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7730 1790

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7245 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Báo cáo nghiên cứu nông nghiệp " Field control of pest fruit flies in Vietnam "

14 190 4 26-12-2024

Báo cáo nghiên cứu khoa học " HÃY LÀM CHO HUẾ XANH HƠN VÀ ĐẸP HƠN "

6 180 3 26-12-2024

Bảng màu theo chữ cái – V

11 164 2 26-12-2024

CHƯƠNG 2: RỦI RO THÂM HỤT TÀI KHÓA

28 158 1 26-12-2024

Đề tài " Dự báo về tác động của Tổ chức Thương mại Thế giới WTO đối với các doanh nghiệp xuất khẩu vừa và nhỏ Việt Nam – Những giải pháp đề xuất "

72 184 2 26-12-2024

Báo cáo " Thẩm quyền quản lí nhà nước đối với hoạt động quảng cáo thực trạng và hướng hoàn thiện "

7 205 7 26-12-2024

Valve Selection Handbook - Fourth Edition

337 145 2 26-12-2024

Bệnh sán lá gan trên gia súc và cách phòng trị

3 162 1 26-12-2024

Báo cáo nghiên cứu khoa học " Sự nhất quán phát triển kinh tế thị trường XHCN trong xây dựng xã hội hài hoà của Trung Quốc và đổi mới của Việt Nam "

8 144 1 26-12-2024

Xinh xinh vườn nhà

6 131 0 26-12-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8100 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7730 1790

Ebook Chào con ba mẹ đã sẵn sàng

112 4406 1371

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6281 1266

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8889 1161

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3838 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3919 609

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4705 565

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11342 542

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4505 490