TAILIEUCHUNG - Word Segmentation for Vietnamese Text Categorization: An online corpus approach

Abstract—This paper extends a novel Vietnamese segmentation approach for text categorization. Instead of using annotated training corpus or lexicon which is still lack in Vietnam, we use statistic information extracted directly from a commercial search engine and genetic algorithm to find the most reasonable way of segmentation. The extracted information is document frequency of segmented words. We conduct many thorough experiments to find out the most appropriate mutual information formula in word segmentation step. Our experiment results on segmentation and categorization obtained from online news abstracts clearly show that our approach is very optimistic | 172 1 Word Segmentation for Vietnamese Text Categorization An online corpus approach Thanh V. Nguyen Hoang K. Tran Thanh . Nguyen and Hung Nguyen Abstract This paper extends a novel Vietnamese segmentation approach for text categorization. Instead of using annotated training corpus or lexicon which is still lack in Vietnam we use statistic information extracted directly from a commercial search engine and genetic algorithm to find the most reasonable way of segmentation. The extracted information is document frequency of segmented words. We conduct many thorough experiments to find out the most appropriate mutual information formula in word segmentation step. Our experiment results on segmentation and categorization obtained from online news abstracts clearly show that our approach is very optimistic. It achieves results in nearly 80 human judgment on segmentation and over 90 micro-averaging F in categorization. The processing time is less than one minute per document when enough statistic information was cached. Index Terms Genetic Algorithm Text Categorization Web Corpus Word Segmentation. I. Introduction It has clearly known that word segmentation is a major barrier in text categorization tasks for Asian languages such as Chinese Japanese Korean and Vietnamese. Although Vietnamese is written in extended Latin characters it shares some identical characteristics with the other phonographic southeast Asian languages. Asian languages are hard in determining word boundaries as well as have different phonetic grammatical and semantic features from Euro-Indian languages. Thus it is difficult in trying to fit Vietnamese into wide- and well-investigated approaches on Euro-Indian languages without acceptable Vietnamese word segmentation. Why is identifying word boundary in Vietnamese vital for Vietnamese text categorization According to 18 and our survey most of top-performing text categorization methods the Support Vector Machine 8 k-Nearest Neighbor 16 Linear Least .

TÀI LIỆU LIÊN QUAN
TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.