TAILIEUCHUNG - Báo cáo khoa học: "An Efﬁcient Indexer for Large N-Gram Corpora"

Hakan Ceylan Department of Computer Science University of North Texas Denton, TX 76203 hakan@ Rada Mihalcea Department of Computer Science University of North Texas Denton, TX 76203 rada@ Abstract We introduce a new publicly available tool that implements efﬁcient indexing and retrieval of large N-gram datasets, such as the Web1T 5-gram corpus. Our tool indexes the entire Web1T dataset with an index size of only 100 MB and performs a retrieval of any N-gram with a single disk access. With an increased index size of 420 MB and duplicate data, it also allows users to issue wild card queries provided that the wild. | An Efficient Indexer for Large N-Gram Corpora Hakan Ceylan Department of Computer Science University of North Texas Denton TX 76203 hakan@ Rada Mihalcea Department of Computer Science University of North Texas Denton TX 76203 rada@ Abstract We introduce a new publicly available tool that implements efficient indexing and retrieval of large N-gram datasets such as the Web1T 5-gram corpus. Our tool indexes the entire Web1T dataset with an index size of only 100 MB and performs a retrieval of any N-gram with a single disk access. With an increased index size of 420 MB and duplicate data it also allows users to issue wild card queries provided that the wild cards in the query are contiguous. Furthermore we also implement some of the smoothing algorithms that are designed specifically for large datasets and are shown to yield better language models than the traditional ones on the Web1T 5-gram corpus Yuret 2008 . We demonstrate the effectiveness of our tool and the smoothing algorithms on the English Lexical Substitution task by a simple implementation that gives considerable improvement over a basic language model. 1 Introduction The goal of statistical language modeling is to capture the properties of a language through a probability distribution so that the probabilities of word sequences can be estimated. Since the probability distribution is built from a corpus of the language by computing the frequencies of the N-grams found in the corpus the data sparsity is always an issue with the language models. Hence as it is the case with many statistical models used in Natural Language Processing NLP the models give a much better performance with larger data sets. However the large data sets such as the Web1T 5-Gram corpus of Brants and Franz 2006 present 103 a major challenge. The language models built from these sets cannot fit in memory hence efficient accessing of the N-gram frequencies becomes an issue. Trivial methods such as linear or binary search

Minh Thông 38 6 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 461906 55

Giới thiệu :Lập trình mã nguồn mở

14 22844 64

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10944 531

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10133 449

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9552 104

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8328 1127

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8266 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7880 2223

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6751 253

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 5921 1428

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Đề tài: Ôn xác định vị trí trên – dưới, trước- sau của đối tượng khác.

8 359 3 09-05-2024

Báo cáo nghiên cứu khoa học " KẾT QUẢ NGHIÊN CỨU BƯỚC ĐẦU VỀ THIÊN ĐỊCH CHÂN KHỚP TRÊN CÂY THANH TRÀ Ở THỪA THIÊN HUẾ "

7 177 0 09-05-2024

MySQL Basics for Visual Learners PHẦN 9

15 187 0 09-05-2024

Diseases of the Liver and Biliary System - part 1

33 132 0 09-05-2024

Hệ thống làm lạnh và điều hòa không khí

21 130 0 09-05-2024

Fecal Incontinence Diagnosis and Treatment - part 8

35 105 0 09-05-2024

A Practical Guide for Health Researchers - part 7

24 111 0 09-05-2024

Tự học thổi sáo và ngâm thơ part 4

11 152 1 09-05-2024

GYNECOLOGIC CANCERS IN PREGNANCY: GUIDELINES OF AN INTERNATIONAL CONSENSUS MEETING

12 96 0 09-05-2024

The Constituents of Medicinal Plants

185 101 0 09-05-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7880 2223

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 5921 1428

Ebook Chào con ba mẹ đã sẵn sàng

112 3779 1244

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5369 1137

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8328 1127

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3527 650

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10944 531

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3715 525

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4118 520

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4154 481