Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
This paper presents noisy-channel based Korean preprocessor system, which corrects word spacing and typographical errors. The proposed algorithm corrects both errors simultaneously. Using Eojeol transition pattern dictionary and statistical data such as Eumjeol n-gram and Jaso transition probabilities, the algorithm minimizes the usage of huge word dictionaries. | A Joint Statistical Model for Simultaneous Word Spacing and Spelling Error Correction for Korean Hyungjong Noh Jeong-Won Cha Gary Geunbae Lee Department of Computer Science and Engineering Pohang University of Science Technology POSTECH San 31 Hyoja-Dong Pohang 790-784 Republic of Korea Changwon National University Department of Computer information Communication 9 Sarim-dong Changwon Gyeongnam Korea 641-773 nohhj@postech.ac.kr jcha@changwon.ac.kr gblee@postech.ac.kr Abstract This paper presents noisy-channel based Korean preprocessor system which corrects word spacing and typographical errors. The proposed algorithm corrects both errors simultaneously. Using Eojeol transition pattern dictionary and statistical data such as Eumjeol n-gram and Jaso transition probabilities the algorithm minimizes the usage of huge word dictionaries. 1 Introduction With increasing usages of messenger and SMS we need an efficient text normalizer that processes colloquial style sentences. As in the case of general literary sentences correcting word spacing error and spelling error is the very essential problem with colloquial style sentences. In order to correct word spacing errors many algorithms were used which can be divided into statistical algorithms and rule-based algorithms. Statistical algorithms generally use character ngram Eojeol1 or Eumjeol2 n-gram in Korean Kang and Woo 2001 Kwon 2002 or noisy-channel model Gao et. al. 2003 . Rule-based algorithms are mostly heuristic algorithms that reflect linguistic knowledge Yang et al. 2005 to solve word spacing problem. Word spacing problem is treated especially in Japanese or Chinese 1 Eojeol is a Korean spacing unit which consists of one or more Eumjeols morphemes . 2 Eumjeol is a Korean syllable. 61 which does not use word boundary or Korean which is normally segmented into Eojeols not into words or morphemes. The previous algorithms for spelling error correction basically use a word dictionary. Each word in a sentence is compared