TAILIEUCHUNG - Báo cáo khoa học: "A Joint Statistical Model for Simultaneous Word Spacing and Spelling Error Correction for Korean"

This paper presents noisy-channel based Korean preprocessor system, which corrects word spacing and typographical errors. The proposed algorithm corrects both errors simultaneously. Using Eojeol transition pattern dictionary and statistical data such as Eumjeol n-gram and Jaso transition probabilities, the algorithm minimizes the usage of huge word dictionaries. | A Joint Statistical Model for Simultaneous Word Spacing and Spelling Error Correction for Korean Hyungjong Noh Jeong-Won Cha Gary Geunbae Lee Department of Computer Science and Engineering Pohang University of Science Technology POSTECH San 31 Hyoja-Dong Pohang 790-784 Republic of Korea Changwon National University Department of Computer information Communication 9 Sarim-dong Changwon Gyeongnam Korea 641-773 nohhj@ jcha@ gblee@ Abstract This paper presents noisy-channel based Korean preprocessor system which corrects word spacing and typographical errors. The proposed algorithm corrects both errors simultaneously. Using Eojeol transition pattern dictionary and statistical data such as Eumjeol n-gram and Jaso transition probabilities the algorithm minimizes the usage of huge word dictionaries. 1 Introduction With increasing usages of messenger and SMS we need an efficient text normalizer that processes colloquial style sentences. As in the case of general literary sentences correcting word spacing error and spelling error is the very essential problem with colloquial style sentences. In order to correct word spacing errors many algorithms were used which can be divided into statistical algorithms and rule-based algorithms. Statistical algorithms generally use character ngram Eojeol1 or Eumjeol2 n-gram in Korean Kang and Woo 2001 Kwon 2002 or noisy-channel model Gao et. al. 2003 . Rule-based algorithms are mostly heuristic algorithms that reflect linguistic knowledge Yang et al. 2005 to solve word spacing problem. Word spacing problem is treated especially in Japanese or Chinese 1 Eojeol is a Korean spacing unit which consists of one or more Eumjeols morphemes . 2 Eumjeol is a Korean syllable. 61 which does not use word boundary or Korean which is normally segmented into Eojeols not into words or morphemes. The previous algorithms for spelling error correction basically use a word dictionary. Each word in a sentence is compared

TÀI LIỆU LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.