TAILIEUCHUNG - Báo cáo khoa học: "A Comparison and Semi-Quantitative Analysis of Words and Character-Bigrams as Features in Chinese Text Categorization"

Words and character-bigrams are both used as features in Chinese text processing tasks, but no systematic comparison or analysis of their values as features for Chinese text categorization has been reported heretofore. We carry out here a full performance comparison between them by experiments on various document collections (including a manually word-segmented corpus as a golden standard), and a semi-quantitative analysis to elucidate the characteristics of their behavior; and try to provide some preliminary clue for feature term choice (in most cases, character-bigrams are better than words) and dimensionality setting in text categorization systems. . | A Comparison and Semi-Quantitative Analysis of Words and Character-Bigrams as Features in Chinese Text Categorization Jingyang Li Maosong Sun Xian Zhang National Lab. of Intelligent Technology Systems Department of Computer Sci. Tech. Tsinghua University Beijing 100084 China lijingyang@ sms@ kevinn9@ Abstract Words and character-bigrams are both used as features in Chinese text processing tasks but no systematic comparison or analysis of their values as features for Chinese text categorization has been reported heretofore. We carry out here a full performance comparison between them by experiments on various document collections including a manually word-segmented corpus as a golden standard and a semi-quantitative analysis to elucidate the characteristics of their behavior and try to provide some preliminary clue for feature term choice in most cases character-bigrams are better than words and dimensionality setting in text categorization systems. 1 Introduction1 Because of the popularity of the Vector Space Model VSM in text information processing document indexing term extraction acts as a pre-requisite step in most text information processing tasks such as Information Retrieval Baeza-Yates and Ribeiro-Neto 1999 and Text Categorization Sebastiani 2002 . It is empirically known that the indexing scheme is a nontrivial complication to system performance especially for some Asian languages in which there are no explicit word margins and even no natural semantic unit. Concretely in Chinese Text Categorization tasks the two most important index ing units feature terms are word and characterbigram so the problem is which kind of terms2 should be chosen as the feature terms words or character-bigrams To obtain an all-sided idea about feature choice beforehand we review here the possible feature variants or options . First at the word level we can do stemming do stop-word pruning include POS Part of Speech information etc. Second term .

Yến Trang 95 8 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Estimating priorities from relative deviations in pairwise comparison matrices

18 51 3

Comparison of proliferation resistance among natural uranium, thoriumeuranium, and thoriumeplutonium fuels used in CANada Deuterium Uranium in deep geological repository by combining multiattribute utility analysis with transport model

7 78 0

Introduction to the optical communications by simulating an optical high debit transmission chain using optisystem with comparison of optical windows

10 68 0

Axitinib, cabozantinib, or everolimus in the treatment of prior sunitinib-treated patients with metastatic renal cell carcinoma: Results of matching-adjusted indirect comparison analyses

12 61 1

Social comparison of tribal groups based on Wadi project adoption

10 52 1

Exact p-values for pairwise comparison of Friedman rank sums, with application to comparing classifiers

18 51 1

Review study on design comparison of solar dryer cum solar cooker

6 40 2

Comparative efficacy of some new insecticides against termites (Odontotermes obesus Rambur) on wheat (Triticum aestivum L.) in comparison to yield under field conditions

7 40 1

uranium

9 75 0

Báo cáo toán học: "Two new criteria for comparison in the Bruhat order"

4 60 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462340 61

Giới thiệu :Lập trình mã nguồn mở

14 26019 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11345 542

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10550 466

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9841 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8889 1161

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8504 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8100 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7735 1790

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7263 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Gia đình vui nhộn

4 392 3 26-12-2024

B2B Content Marketing: 2012 Benchmarks, Budgets & Trends

17 229 3 26-12-2024

Đóng mới oto 8 chỗ ngồi part 9

10 179 3 26-12-2024

Giáo trình phân tích phương trình vi phân viết dưới dạng thuật toán đặc tính của hệ thống p1

5 162 1 26-12-2024

Báo cáo nghiên cứu nông nghiệp " Field control of pest fruit flies in Vietnam "

14 190 4 26-12-2024

Quy Trình Canh Tác Cây Bông Vải

8 164 3 26-12-2024

Hướng dẫn chế độ dinh dưỡng cho người bệnh viêm khớp

5 167 2 26-12-2024

báo cáo hóa học:" Perceptions of rewards among volunteer caregivers of people living with AIDS working in faith-based organizations in South Africa: a qualitative study"

10 157 1 26-12-2024

Giáo án điện tử tiểu học môn lịch sử: Cách mạng mùa thu

39 164 1 26-12-2024

Báo cáo nghiên cứu khoa học " Sự nhất quán phát triển kinh tế thị trường XHCN trong xây dựng xã hội hài hoà của Trung Quốc và đổi mới của Việt Nam "

8 144 1 26-12-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8100 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7735 1790

Ebook Chào con ba mẹ đã sẵn sàng

112 4406 1371

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6283 1266

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8889 1161

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3839 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3919 609

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4708 565

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11345 542

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4508 490