TAILIEUCHUNG - Báo cáo khoa học: "Similarity-Based Estimation of Word Cooccurrence Probabilities"

In m a n y applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer m a y need to determine which of the two word combinations "eat a peach" and "eat a beach" is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However, the nature of language is such that m a n y word combinations are infrequent and do not occur in a given corpus. . | Similarity-Based Estimation of Word Cooccurrence Probabilities Ido Dagan Fernando Pereira AT T Bell Laboratories 600 Mountain Ave. Murray Hill NJ 07974 USA Abstract In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example a speech recognizer may need to determine which of the two word combinations eat a peach and eat a beach is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However the nature of language is such that many word combinations are infrequent and do not occur in a given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on most similar words. We describe a probabilistic word association model based on distributional word similarity and apply it to improving probability estimates for unseen word bigrams in a variant of Katz s back-off model. The similarity-based method yields a 20 perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error. Introduction Data sparseness is an inherent problem in statistical methods for natural language processing. Such methods use statistics on the relative frequencies of configurations of elements in a training corpus to evaluate alternative analyses or interpretations of new samples of text or speech. The most likely analysis will be taken to be the one that contains the most frequent configurations. The problem of data sparseness arises when analyses contain configurations that never occurred in the training corpus. Then it is not possible to estimate probabilities from observed frequencies and some other estimation scheme has to be used. We focus here on a particular kind of configuration word cooccurrence. Examples of such cooccurrences include .

Công Luận 47 7 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462302 61

Giới thiệu :Lập trình mã nguồn mở

14 24979 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11294 542

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10514 466

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9797 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8878 1161

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8468 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8092 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7483 1764

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7196 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Quy Trình Canh Tác Cây Bông Vải

8 150 2 30-11-2024

Báo cáo nghiên cứu khoa học " Vai trò chính quyền địa phương trong phát triển kinh tế : khu chuyên doanh gốm sứ ( Trung Quốc ) và Bát Tràng ( Việt Nam )("

11 207 1 30-11-2024

Báo cáo nghiên cứu khoa học " Sự nhất quán phát triển kinh tế thị trường XHCN trong xây dựng xã hội hài hoà của Trung Quốc và đổi mới của Việt Nam "

8 139 1 30-11-2024

5 thói quen ăn uống hủy hoại hàm răng đẹp

5 160 1 30-11-2024

OPEN SOURCE ERP REASONABLE TOOLS FOR MANUFACTURING SMEs?

1 144 1 30-11-2024

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining

101 135 1 30-11-2024

Lịch sử Trung Quốc 5000 năm tập 3 part 2

54 142 1 30-11-2024

Determini prounoun 1

6 134 0 30-11-2024

Báo cáo khoa học: "A rare coexistence of adrenal cavernous hemangioma with extramedullar hemopoietic tissue: a case report and brief review of the literature"

4 102 0 30-11-2024

LINUX DEVICE DRIVERS 3rd edition phần 8

64 125 0 30-11-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8092 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7483 1764

Ebook Chào con ba mẹ đã sẵn sàng

112 4369 1369

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6162 1259

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8878 1161

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3797 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3911 609

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4623 562

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11294 542

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4460 490