Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Abstract—This paper extends a novel Vietnamese segmentation approach for text categorization. Instead of using annotated training corpus or lexicon which is still lack in Vietnam, we use statistic information extracted directly from a commercial search engine and genetic algorithm to find the most reasonable way of segmentation. The extracted information is document frequency of segmented words. We conduct many thorough experiments to find out the most appropriate mutual information formula in word segmentation step. Our experiment results on segmentation and categorization obtained from online news abstracts clearly show that our approach is very optimistic | 172 1 Word Segmentation for Vietnamese Text Categorization An online corpus approach Thanh V. Nguyen Hoang K. Tran Thanh T.T. Nguyen and Hung Nguyen Abstract This paper extends a novel Vietnamese segmentation approach for text categorization. Instead of using annotated training corpus or lexicon which is still lack in Vietnam we use statistic information extracted directly from a commercial search engine and genetic algorithm to find the most reasonable way of segmentation. The extracted information is document frequency of segmented words. We conduct many thorough experiments to find out the most appropriate mutual information formula in word segmentation step. Our experiment results on segmentation and categorization obtained from online news abstracts clearly show that our approach is very optimistic. It achieves results in nearly 80 human judgment on segmentation and over 90 micro-averaging F in categorization. The processing time is less than one minute per document when enough statistic information was cached. Index Terms Genetic Algorithm Text Categorization Web Corpus Word Segmentation. I. Introduction It has clearly known that word segmentation is a major barrier in text categorization tasks for Asian languages such as Chinese Japanese Korean and Vietnamese. Although Vietnamese is written in extended Latin characters it shares some identical characteristics with the other phonographic southeast Asian languages. Asian languages are hard in determining word boundaries as well as have different phonetic grammatical and semantic features from Euro-Indian languages. Thus it is difficult in trying to fit Vietnamese into wide- and well-investigated approaches on Euro-Indian languages without acceptable Vietnamese word segmentation. Why is identifying word boundary in Vietnamese vital for Vietnamese text categorization According to 18 and our survey most of top-performing text categorization methods the Support Vector Machine 8 k-Nearest Neighbor 16 Linear Least .