TAILIEUCHUNG - Báo cáo khoa học: "Text Segmentation with LDA-Based Fisher Kernel"

In this paper we propose a domainindependent text segmentation method, which consists of three components. Latent Dirichlet allocation (LDA) is employed to compute words semantic distribution, and we measure semantic similarity by the Fisher kernel. Finally global best segmentation is achieved by dynamic programming. Experiments on Chinese data sets with the technique show it can be effective. Introducing latent semantic information, our algorithm is robust on irregular-sized segments. | Text Segmentation with LDA-Based Fisher Kernel Qi Sun Runxin Li Dingsheng Luo and Xihong Wu Speech and Hearing Research Center and Key Laboratory of Machine Perception Ministry of Education Peking University 100871 Beijing China sunq lirx dsluo wxh @ Abstract In this paper we propose a domainindependent text segmentation method which consists of three components. Latent Dirichlet allocation LDA is employed to compute words semantic distribution and we measure semantic similarity by the Fisher kernel. Finally global best segmentation is achieved by dynamic programming. Experiments on Chinese data sets with the technique show it can be effective. Introducing latent semantic information our algorithm is robust on irregular-sized segments. 1 Introduction The aim of text segmentation is to partition a document into a set of segments each of which is coherent about a specific topic. This task is inspired by problems in information retrieval summarization and language modeling in which the ability to provide access to smaller coherent segments in a document is desired. A lot of research has been done on text segmentation. Some of them utilize linguistic criteria Beeferman et al. 1999 Mochizuki et al. 1998 while others use statistical similarity measures to uncover lexical cohesion. Lexical cohesion methods believe a coherent topic segment contains parts with similar vocabularies. For example the TextTiling algorithm introduced by Hearst 1994 assumes that the local minima of the word similarity curve are the points of low lexical cohesion and thus the natural boundary candidates. Reynar 1998 has proposed a method called dotplotting depending on the distribution of word repetitions to find tight regions of topic similarity graphically. One of the problems with those works is that they treat terms uncorrelated assigning them orthogonal directions in the feature space. But in reality words are correlated and sometimes even synonymous so that texts with very few

Bá Tùng 74 4 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Text Segmentation by Language Using Minimum Description Length"

10 46 0

Báo cáo khoa học: "Text Segmentation with LDA-Based Fisher Kernel"

4 56 0

Báo cáo khoa học: "Unsupervised Segmentation of Chinese Text by Use of Branching Entropy"

8 47 0

Báo cáo khoa học: "A Statistical Model for Domain-Independent Text Segmentation"

8 74 0

Báo cáo khoa học: "Text Segmentation Using Reiteration and Collocation"

5 36 0

Báo cáo khoa học: "Text Segmentation with Multiple Surface Linguistic Cues"

5 47 0

Báo cáo khoa học: "Optimal Multi-Paragraph Text Segmentation by Dynamic Programming"

3 53 0

Báo cáo khoa học: "Cohesion and Collocation: Using Context Vectors in Text Segmentation"

5 55 0

Báo cáo khoa học: "BASED TEXT SEGMENTATION ON SIMILARITY BETWEEN WORDS"

3 38 0

Báo cáo khoa học: "MULTI-PARAGRAPH SEGMENTATION EXPOSITORY TEXT"

8 38 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462370 61

Giới thiệu :Lập trình mã nguồn mở

14 26953 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11382 543

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10579 468

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9861 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8911 1161

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8531 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8111 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 8021 1831

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7305 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

B2B Content Marketing: 2012 Benchmarks, Budgets & Trends

17 242 3 14-01-2025

báo cáo hóa học:" Increased androgen receptor expression in serous carcinoma of the ovary is associated with an improved survival"

6 164 3 14-01-2025

báo cáo hóa học:" Perceptions of rewards among volunteer caregivers of people living with AIDS working in faith-based organizations in South Africa: a qualitative study"

10 165 1 14-01-2025

Giáo án điện tử tiểu học môn lịch sử: Cách mạng mùa thu

39 171 1 14-01-2025

Báo cáo " Bàn về hành vi pháp luật và hành vi đạo đức "

11 182 2 14-01-2025

Valve Selection Handbook - Fourth Edition

337 150 2 14-01-2025

ETHICAL CODE HANDBOOK: Demonstrate your commitment to high standards

7 156 1 14-01-2025

báo cáo khoa học: "Malignant peripheral nerve sheath tumor arising from the greater omentum: Case report"

4 149 1 14-01-2025

IT Audit: EMC’s Journey to the Private Cloud

13 163 1 14-01-2025

Chủ đề 3 : SỰ CÂN BẰNG CỦA VẬT RẮN (4 tiết)

9 218 1 14-01-2025

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8111 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 8021 1831

Ebook Chào con ba mẹ đã sẵn sàng

112 4453 1378

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6403 1280

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8911 1161

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3867 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3932 610

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4813 568

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11382 543

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4544 490