Đang chuẩn bị liên kết để tải về tài liệu:
Word Segmentation for Vietnamese Text Categorization: An online corpus approach

Kim Thông 43 6 pdf

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ Tải xuống

Abstract—This paper extends a novel Vietnamese segmentation approach for text categorization. Instead of using annotated training corpus or lexicon which is still lack in Vietnam, we use statistic information extracted directly from a commercial search engine and genetic algorithm to find the most reasonable way of segmentation. The extracted information is document frequency of segmented words. We conduct many thorough experiments to find out the most appropriate mutual information formula in word segmentation step. Our experiment results on segmentation and categorization obtained from online news abstracts clearly show that our approach is very optimistic | 172 1 Word Segmentation for Vietnamese Text Categorization An online corpus approach Thanh V. Nguyen Hoang K. Tran Thanh T.T. Nguyen and Hung Nguyen Abstract This paper extends a novel Vietnamese segmentation approach for text categorization. Instead of using annotated training corpus or lexicon which is still lack in Vietnam we use statistic information extracted directly from a commercial search engine and genetic algorithm to find the most reasonable way of segmentation. The extracted information is document frequency of segmented words. We conduct many thorough experiments to find out the most appropriate mutual information formula in word segmentation step. Our experiment results on segmentation and categorization obtained from online news abstracts clearly show that our approach is very optimistic. It achieves results in nearly 80 human judgment on segmentation and over 90 micro-averaging F in categorization. The processing time is less than one minute per document when enough statistic information was cached. Index Terms Genetic Algorithm Text Categorization Web Corpus Word Segmentation. I. Introduction It has clearly known that word segmentation is a major barrier in text categorization tasks for Asian languages such as Chinese Japanese Korean and Vietnamese. Although Vietnamese is written in extended Latin characters it shares some identical characteristics with the other phonographic southeast Asian languages. Asian languages are hard in determining word boundaries as well as have different phonetic grammatical and semantic features from Euro-Indian languages. Thus it is difficult in trying to fit Vietnamese into wide- and well-investigated approaches on Euro-Indian languages without acceptable Vietnamese word segmentation. Why is identifying word boundary in Vietnamese vital for Vietnamese text categorization According to 18 and our survey most of top-performing text categorization methods the Support Vector Machine 8 k-Nearest Neighbor 16 Linear Least .

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection"

Báo cáo khoa học: "A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging"

Báo cáo khoa học: "Word Alignment Combination over Multiple Word Segmentation"

Báo cáo khoa học: "An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging"

Báo cáo khoa học: "A Novel Word Segmentation Approach for Written Languages with Word Boundary Markers"

Báo cáo khoa học: "Unsupervized Word Segmentation: the case for Mandarin Chinese"

Báo cáo khoa học: "Incremental Joint Approach to Word Segmentation, POS Tagging, and Dependency Parsing in Chinese"

Báo cáo khoa học: "Exploring Deterministic Constraints: From a Constrained English POS Tagger to an Efﬁcient ILP Solution to Chinese Word Segmentation"

Báo cáo khoa học: "Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation"

Báo cáo khoa học: "Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation"

Đã phát hiện trình chặn quảng cáo AdBlock

Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.