Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "An Efﬁcient Indexer for Large N-Gram Corpora"

Minh Thông 38 6 pdf

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ Tải xuống

Hakan Ceylan Department of Computer Science University of North Texas Denton, TX 76203 hakan@unt.edu Rada Mihalcea Department of Computer Science University of North Texas Denton, TX 76203 rada@cs.unt.edu Abstract We introduce a new publicly available tool that implements efﬁcient indexing and retrieval of large N-gram datasets, such as the Web1T 5-gram corpus. Our tool indexes the entire Web1T dataset with an index size of only 100 MB and performs a retrieval of any N-gram with a single disk access. With an increased index size of 420 MB and duplicate data, it also allows users to issue wild card queries provided that the wild. | An Efficient Indexer for Large N-Gram Corpora Hakan Ceylan Department of Computer Science University of North Texas Denton TX 76203 hakan@unt.edu Rada Mihalcea Department of Computer Science University of North Texas Denton TX 76203 rada@cs.unt.edu Abstract We introduce a new publicly available tool that implements efficient indexing and retrieval of large N-gram datasets such as the Web1T 5-gram corpus. Our tool indexes the entire Web1T dataset with an index size of only 100 MB and performs a retrieval of any N-gram with a single disk access. With an increased index size of 420 MB and duplicate data it also allows users to issue wild card queries provided that the wild cards in the query are contiguous. Furthermore we also implement some of the smoothing algorithms that are designed specifically for large datasets and are shown to yield better language models than the traditional ones on the Web1T 5-gram corpus Yuret 2008 . We demonstrate the effectiveness of our tool and the smoothing algorithms on the English Lexical Substitution task by a simple implementation that gives considerable improvement over a basic language model. 1 Introduction The goal of statistical language modeling is to capture the properties of a language through a probability distribution so that the probabilities of word sequences can be estimated. Since the probability distribution is built from a corpus of the language by computing the frequencies of the N-grams found in the corpus the data sparsity is always an issue with the language models. Hence as it is the case with many statistical models used in Natural Language Processing NLP the models give a much better performance with larger data sets. However the large data sets such as the Web1T 5-Gram corpus of Brants and Franz 2006 present 103 a major challenge. The language models built from these sets cannot fit in memory hence efficient accessing of the N-gram frequencies becomes an issue. Trivial methods such as linear or binary search

TÀI LIỆU LIÊN QUAN

Kỷ yếu tóm tắt báo cáo khoa học: Hội nghị khoa học tim mạch toàn quốc lần thứ XI - Hội tim mạch Quốc gia Việt Nam

Báo cáo nghiên cứu khoa học: "Danh lục các loài thú ở khu bảo tồn thiên nhiên Pù Huống tỉnh Nghệ An và ý nghĩa bảo tồn nguồn gen quí hiếm của chúng"

Báo cáo khoa học: Hỗ trợ nâng cao năng lực quản lý chất thải sinh hoạt tại thành phố Hội An

Báo cáo nghiên cứu khoa học: "Tính năng động nghệ thuật của văn học hiện đại Việt Nam và một cách nhìn hành trình thể loại"

Báo cáo nghiên cứu khoa học: " DỊCH CHUYỂN TRUY VẤN OQL VÀO CÁC PHÉP TÍNH BAO HÀM"

Báo cáo khoa học: " Áp dụng thủ tục phân tích trong kiểm toán báo cáo tài chính"

Báo cáo nghiên cứu khoa học: "Người lính trở về sau chiến tranh với mặc cảm “ăn mày dĩ vãng’ trong tiểu thuyết Chu Lai"

Báo cáo nghiên cứu khoa học: "Khảo sát hiện tượng chuyển đổi chức năng - nghĩa của động từ tiếng Việt"

Báo cáo nghiên cứu khoa học: " BẢN CHẤT KHOA HỌC VÀ CÁCH MẠNG LÀ CỘI NGUỒN SỨC SỐNG CỦA CHỦ NGHĨA MÁC - LÊNIN"

Báo cáo khoa học: " CẢI TIẾN CÁC THUẬT TOÁN MƯỢN VÀ KHOÁ KÊNH TẦN SỐ MẠNG DI ĐỘNG TẾ BÀO"

Đã phát hiện trình chặn quảng cáo AdBlock

Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.