TAILIEUCHUNG - Báo cáo khoa học: "Unsupervised Learning of Arabic Stemming using a Parallel Corpus"

This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10K sentences) parallel corpus as its sole training resources. No parallel text is needed after the training phase. Monolingual, unannotated text can be used to further improve the stemmer by allowing it to adapt to a desired domain or genre. | Unsupervised Learning of Arabic Stemming using a Parallel Corpus Monica Rogati Computer Science Department Carnegie Mellon University mrogati@ Scott McCarley Yiming Yang IBM TJ Watson Language Technologies Institute Research Center Carnegie Mellon University jsmc@ yiming@ Abstract This paper presents an unsupervised learning approach to building a non-English Arabic stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small 10K sentences parallel corpus as its sole training resources. No parallel text is needed after the training phase. Monolingual unannotated text can be used to further improve the stemmer by allowing it to adapt to a desired domain or genre. Examples and results will be given for Arabic but the approach is applicable to any language that needs affix removal. Our resource-frugal approach results in agreement with a state of the art proprietary Arabic stemmer built using rules affix lists and human annotated text in addition to an unsupervised component. Task-based evaluation using Arabic information retrieval indicates an improvement of 22-38 in average precision over unstemmed text and 96 of the performance of the proprietary stem-mer above. 1 Introduction Stemming is the process of normalizing word variations by removing prefixes and suffixes. From an t . Work done while a summer intern at IBM TJ Watson Research Center information retrieval point of view prefixes and suffixes add little or no additional meaning in most cases both the efficiency and effectiveness of text processing applications such as information retrieval and machine translation are improved. Building a rule-based stemmer for a new arbitrary language is time consuming and requires experts with linguistic knowledge in that particular language. Supervised learning also requires large quantities of labeled data in the target language and quality declines when using completely .

Minh Khôi 75 8 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning"

6 60 0

Báo cáo khoa học: "Unsupervised Learning of Semantic Relation Composition"

10 59 0

Báo cáo khoa học: "Don’t ‘have a clue’? Unsupervised co-learning of downward-entailing operators"

6 41 0

Báo cáo khoa học: "Unsupervised Learning of Narrative Schemas and their Participants"

9 53 0

Báo cáo khoa học: "Unsupervised Multilingual Learning for Morphological Segmentation"

9 54 0

Báo cáo khoa học: "Unsupervised Learning of Narrative Event Chains"

9 51 0

Báo cáo khoa học: "Analyzing the Errors of Unsupervised Learning"

9 62 0

Báo cáo khoa học: "Unsupervised Learning of Acoustic Sub-word Units"

4 70 0

Báo cáo khoa học: "Supervised and Unsupervised Learning for Sentence Compression"

8 71 0

Báo cáo khoa học: "Unsupervised Learning of Field Segmentation Models for Information Extraction"

8 61 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 461856 55

Giới thiệu :Lập trình mã nguồn mở

14 22583 57

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10880 529

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10043 445

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9510 104

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8267 1124

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8215 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7862 2220

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6664 253

Vật lý hạt cơ bản (1)

29 5764 85

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Đề tài: Ôn xác định vị trí trên – dưới, trước- sau của đối tượng khác.

8 352 3 23-04-2024

Đánh giá hao mòn và độ tin cậy của chi tiết và kết cấu trên đầu máy diezel part 3

12 302 0 23-04-2024

Công nghiệp gang thép Việt Nam : Một giai đoạn phát triển và chuyển đổi chính sách mới part 5

6 194 0 23-04-2024

THE ANTHROPOLOGY OF ONLINE COMMUNITIES BY Samuel M.Wilson and Leighton C. Peterson

19 138 0 23-04-2024

Khurana et al. Journal of Orthopaedic Surgery and Research 2010, 5:23

7 133 0 23-04-2024

HƯỚNG DẪN SỬ DỤNG PHẦN MỀM CAITA part 9

18 128 0 23-04-2024

Diseases of the Liver and Biliary System - part 1

33 120 0 23-04-2024

Christmas Meditations on the Twelve Holy Days

173 103 0 23-04-2024

Truyện kiếm hiệp - Duy ngã độc tôn phần 5/7

1 91 0 23-04-2024

Gastroenterology an illustrated colour text - part 10

10 87 0 23-04-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7862 2220

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 5667 1347

Ebook Chào con ba mẹ đã sẵn sàng

112 3757 1230

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5295 1134

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8267 1124

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3480 641

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 10880 529

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3677 525

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4038 514

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4118 480