TAILIEUCHUNG - Báo cáo khoa học: "Large-Coverage Root Lexicon Extraction for Hindi"

This paper describes a method using morphological rules and heuristics, for the automatic extraction of large-coverage lexicons of stems and root word-forms from a raw text corpus. We cast the problem of high-coverage lexicon extraction as one of stemming followed by root word-form selection. We examine the use of POS tagging to improve precision and recall of stemming and thereby the coverage of the lexicon. We present accuracy, precision and recall scores for the system on a Hindi corpus. | Large-Coverage Root Lexicon Extraction for Hindi Cohan Sujay Carlos Monojit Choudhury Sandipan Dandapat Microsoft Research India monojitc@ Abstract This paper describes a method using morphological rules and heuristics for the automatic extraction of large-coverage lexicons of stems and root word-forms from a raw text corpus. We cast the problem of high-coverage lexicon extraction as one of stemming followed by root word-form selection. We examine the use of POS tagging to improve precision and recall of stemming and thereby the coverage of the lexicon. We present accuracy precision and recall scores for the system on a Hindi corpus. 1 Introduction Large-coverage morphological lexicons are an essential component of morphological analysers. Morphological analysers find application in language processing systems for tasks like tagging parsing and machine translation. While raw text is an abundant and easily accessible linguistic resource high-coverage morphological lexicons are scarce or unavailable in Hindi as in many other languages Clement et al. 2004 . Thus the development of better algorithms for the extraction of morphological lexicons from raw text corpora is a task of considerable importance. A root word-form lexicon is an intermediate stage in the creation of a morphological lexicon. In this paper we consider the problem of extracting a large-coverage root word-form lexicon for the Hindi language a highly inflectional and moderately agglutinative Indo-European language spoken widely in South Asia. Since a POS tagger another basic tool was available along with POS tagged data to train it and since the error patterns indicated that POS tagging could greatly improve the accuracy of the lexicon we used the POS tagger in our experiments on lexicon extraction. Previous work in morphological lexicon extraction from a raw corpus often does not achieve very high precision and recall de Lima 1998 Oliver and Tadic 2004 . In some previous work the process .

Hữu Thống 46 9 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 461924 55

Giới thiệu :Lập trình mã nguồn mở

14 22976 64

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10965 531

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10156 450

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9559 104

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8360 1127

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8271 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7885 2225

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6795 255

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 6016 1450

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Đánh giá hao mòn và độ tin cậy của chi tiết và kết cấu trên đầu máy diezel part 3

12 320 0 16-05-2024

Sáng tạo trong thuật toán và lập trình với ngôn ngữ Pascal và C# Tập 2 - Chương 4

47 255 1 16-05-2024

extremetech Hacking BlackBerry phần 9

31 261 0 16-05-2024

Trading Strategies Profit Making Techniques For Stock_3

23 197 1 16-05-2024

Trading Strategies Profit Making Techniques For Stock_8

23 184 1 16-05-2024

Anh văn bằng C-124

8 188 0 16-05-2024

Báo cáo nghiên cứu khoa học " KẾT QUẢ NGHIÊN CỨU BƯỚC ĐẦU VỀ THIÊN ĐỊCH CHÂN KHỚP TRÊN CÂY THANH TRÀ Ở THỪA THIÊN HUẾ "

7 189 0 16-05-2024

QUẢN LÝ CHẤT LƯỢNG KHÔNG KHÍ

75 142 0 16-05-2024

Hệ thống làm lạnh và điều hòa không khí

21 133 0 16-05-2024

ĐỀ THI THỬ ĐẠI HỌC 2009 – THPT ĐÔNG SƠN 1 – LẦN 2 – MÔN TOÁN

8 105 0 16-05-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7885 2225

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 6016 1450

Ebook Chào con ba mẹ đã sẵn sàng

112 3782 1250

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5398 1137

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8360 1127

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3536 655

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10965 531

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3735 526

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4154 523

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4180 483