TAILIEUCHUNG - Báo cáo khoa học: "An efﬁcient algorithm for building a distributional thesaurus (and other Sketch Engine developments)"

Gorman and Curran (2006) argue that thesaurus generation for billion+-word corpora is problematic as the full computation takes many days. We present an algorithm with which the computation takes under two hours. We have created, and made publicly available, thesauruses based on large corpora for (at time of writing) seven major world languages. The development is implemented in the Sketch Engine (Kilgarriff et al., 2004). | An efficient algorithm for building a distributional thesaurus and other Sketch Engine developments Pavel Rychly Masaryk University Brno Czech Republic pary@ z Adam Kilgarriff Lexical Computing Ltd Brighton UK adam@ Abstract Gorman and Curran 2006 argue that thesaurus generation for billion -word corpora is problematic as the full computation takes many days. We present an algorithm with which the computation takes under two hours. We have created and made publicly available thesauruses based on large corpora for at time of writing seven major world languages. The development is implemented in the Sketch Engine Kilgarriff et al. 2004 . Another innovative development in the same tool is the presentation of the grammatical behaviour of a word against the background of how all other words of the same word class behave. Thus the English noun constraint occurs 75 in the plural. Is this a salient lexical fact To form a judgement we need to know the distribution for all nouns. We use histograms to present the distribution in a way that is easy to grasp. 1 Thesaurus creation Over the last ten years interest has been growing in distributional thesauruses hereafter simply thesauruses . Following initial work by Sparck Jones 1964 and Grefenstette 1994 an early online distributional thesaurus presented in Lin 1998 has been widely used and cited and numerous authors since have explored thesaurus properties and parameters see survey component of Weeds and Weir 2005 . 41 A thesaurus is created by taking a corpus identifying contexts for each word identifying which words share contexts. For each word the words that share most contexts according to some statistic which also takes account of their frequency are its nearest neighbours. Thesauruses generally improve in accuracy with corpus size. The larger the corpus the more clearly the signal of similar words will be distinguished from the noise of words that just happen to share a few contexts . Lin s was

Hồng Mai 72 4 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Efﬁcient Pattern Matching Algorithm for Memory Architecture

9 42 0

Báo cáo khoa học: "An efﬁcient algorithm for building a distributional thesaurus (and other Sketch Engine developments)"

4 61 0

Báo cáo khoa học: "Sequential Labeling with Latent Variables: An Exact Inference Algorithm and Its Efﬁcient Approximation"

9 44 0

Báo cáo hóa học: " Research Article An Efﬁcient Algorithm for Instantaneous Frequency Estimation of Nonstationary "

16 37 0

Báo cáo hóa học: " Research Article An Efﬁcient Addressing Scheme and Its Routing Algorithm for a Large-Scale Wireless Sensor Network"

13 30 0

Báo cáo hóa học: " Research Article Spectral Content Characterization for Efﬁcient Image Detection Algorithm Design"

14 43 0

Báo cáo hóa học: " Research Article Efﬁcient Algorithm and Architecture of Critical-Band Transform for Low-Power Speech Applications"

10 40 0

Báo cáo hóa học: " Research Article Efﬁcient Hybrid DCT-Domain Algorithm for Video Spatial Downscaling"

16 44 0

Báo cáo hóa học: " Research Article An Efﬁcient Implementation of the Sign LMS Algorithm Using Block Floating Point Format"

7 65 0

Báo cáo hóa học: " Cross-Layer Design of an Energy-Efﬁcient Cluster Formation Algorithm with Carrier-Sensing Multiple Access for Wireless Sensor Networks"

14 30 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462348 61

Giới thiệu :Lập trình mã nguồn mở

14 26497 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11370 543

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10557 468

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9850 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8897 1161

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8512 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8107 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7844 1803

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7285 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

THE ANTHROPOLOGY OF ONLINE COMMUNITIES BY Samuel M.Wilson and Leighton C. Peterson

19 231 4 05-01-2025

báo cáo hóa học:" Increased androgen receptor expression in serous carcinoma of the ovary is associated with an improved survival"

6 161 3 05-01-2025

Giáo trình phân tích phương trình vi phân viết dưới dạng thuật toán đặc tính của hệ thống p1

5 170 1 05-01-2025

Báo cáo y học: "The Factors Influencing Depression Endpoints Research (FINDER) study: final results of Italian patients with depressio"

9 154 1 05-01-2025

ETHICAL CODE HANDBOOK: Demonstrate your commitment to high standards

7 152 1 05-01-2025

Báo cáo nghiên cứu khoa học " Sự nhất quán phát triển kinh tế thị trường XHCN trong xây dựng xã hội hài hoà của Trung Quốc và đổi mới của Việt Nam "

8 148 1 05-01-2025

CUỘC KHÁNG CHIẾN CHỐNG THỰC DÂN PHÁP KẾT THÚC (1953 - 1954)_5

11 153 1 05-01-2025

5 thói quen ăn uống hủy hoại hàm răng đẹp

5 176 2 05-01-2025

Sáng kiến kinh nghiệm môn mỹ thuật

5 181 1 05-01-2025

Phạm trù Chủ nghĩa cá nhân của tư tưởng phương Tây trong sự lý giải của Phan Khôi _1

9 134 0 05-01-2025

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8107 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7844 1803

Ebook Chào con ba mẹ đã sẵn sàng

112 4424 1376

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6336 1275

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8897 1161

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3855 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3926 609

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4754 567

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11370 543

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4529 490