TAILIEUCHUNG - Báo cáo khoa học: "Unsupervised Segmentation of Words Using Prior Distributions of Morph Length and Frequency"

We present a language-independent and unsupervised algorithm for the segmentation of words into morphs. The algorithm is based on a new generative probabilistic model, which makes use of relevant prior information on the length and frequency distributions of morphs in a language. Our algorithm is shown to outperform two competing algorithms, when evaluated on data from a language with agglutinative morphology (Finnish), and to perform well also on English data. | Unsupervised Segmentation of Words Using Prior Distributions of Morph Length and Frequency Mathias Creutz Neural Networks Research Centre Helsinki University of Technology 9800 FIN-02015 HUT Finland Abstract We present a language-independent and unsupervised algorithm for the segmentation of words into morphs. The algorithm is based on a new generative probabilistic model which makes use of relevant prior information on the length and frequency distributions of morphs in a language. Our algorithm is shown to outperform two competing algorithms when evaluated on data from a language with agglutinative morphology Finnish and to perform well also on English data. 1 Introduction In order to artificially understand or produce natural language a system presumably has to know the elementary building blocks . the lexicon of the language. Additionally the system needs to model the relations between these lexical units. Many existing NLP natural language processing applications make use of words as such units. For instance in statistical language modelling probabilities of word sequences are typically estimated and bag-of-word models are common in information retrieval. However for some languages it is infeasible to construct lexicons for NLP applications if the lexicons contain entire words. In especially agglutinative languages 1 such as Finnish and Turkish the 1 III agglutinative languages words are formed by the concatenation of morphemes. number of possible different word forms is simply too high. For example in Finnish a single verb may appear in thousands of different forms Karlsson 1987 . According to linguistic theory words are built from smaller units morphemes. Morphemes are the smallest meaning-bearing elements of language and could be used as lexical units instead of entire words. However the construction of a comprehensive morphological lexicon or analyzer based on linguistic theory requires a considerable amount of work by .

Ngọc Quyên 67 8 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure"

5 53 0

Báo cáo khoa học: "Fully Unsupervised Word Segmentation with BVE and MDL"

6 43 0

Báo cáo khoa học: "Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling"

9 36 0

Báo cáo khoa học: "Unsupervised Search for The Optimal Segmentation for Statistical Machine Translation"

6 45 0

Báo cáo khoa học: "Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models"

10 56 0

Báo cáo khoa học: "Unsupervised Multilingual Learning for Morphological Segmentation"

9 54 0

Báo cáo khoa học: "Contextual Dependencies in Unsupervised Word Segmentation∗"

8 59 0

Báo cáo khoa học: "Making Sense of Sound: Unsupervised Topic Segmentation over Acoustic Input"

8 87 0

Báo cáo khoa học: "Unsupervised Segmentation of Chinese Text by Use of Branching Entropy"

8 47 0

Báo cáo khoa học: "A Language-Independent Unsupervised Model for Morphological Segmentation"

8 54 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462341 61

Giới thiệu :Lập trình mã nguồn mở

14 26046 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11346 542

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10551 466

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9842 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8891 1161

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8505 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8101 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7747 1790

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7264 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Gia đình vui nhộn

4 392 3 27-12-2024

Báo cáo nghiên cứu nông nghiệp " Field control of pest fruit flies in Vietnam "

14 191 4 27-12-2024

Quy Trình Canh Tác Cây Bông Vải

8 164 3 27-12-2024

Giáo án điện tử tiểu học môn lịch sử: Cách mạng mùa thu

39 165 1 27-12-2024

ĐỀ TÀI " ĐÁNH GIÁ HIỆU QUẢ HOẠT ĐỘNG KINH DOANH NGOẠI HỐI CỦA NGÂN HÀNG THƯƠNG MẠI CỔ PHẦN XUẤT NHẬP KHẨU VIỆT NAM "

51 150 3 27-12-2024

Word Games with English 1

65 138 1 27-12-2024

5 thói quen ăn uống hủy hoại hàm răng đẹp

5 168 1 27-12-2024

TRẮC NGHIỆM - CÁC BỆNH THIẾU DINH DƯỠNG THƯỜNG GẶP

32 209 2 27-12-2024

CÔNG NGHỆ MÔI TRƯỜNG - CHƯƠNG 5 CƠ SỞ QUÁ TRÌNH XỬ LÝ SINH HỌC

1 142 0 27-12-2024

ĐỀ KIỂM TRA GIỮA HỌC KỲ TÂM LÝ Y HỌC – Y ĐỨC

18 246 0 27-12-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8101 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7747 1790

Ebook Chào con ba mẹ đã sẵn sàng

112 4407 1371

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6284 1266

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8891 1161

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3840 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3920 609

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4709 565

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11346 542

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4509 490