TAILIEUCHUNG - Báo cáo khoa học: "Using Bilingual Comparable Corpora and Semi-supervised Clustering for Topic Tracking"

We address the problem dealing with skewed data, and propose a method for estimating effective training stories for the topic tracking task. For a small number of labelled positive stories, we extract story pairs which consist of positive and its associated stories from bilingual comparable corpora. To overcome the problem of a large number of labelled negative stories, we classify them into some clusters. This is done by using k-means with EM. The results on the TDT corpora show the effectiveness of the method. . | Using Bilingual Comparable Corpora and Semi-supervised Clustering for Topic Tracking Fumiyo Fukumoto Interdisciplinary Graduate School of Medicine and Engineering Univ. of Yamanashi fukumoto@ Yoshimi Suzuki Interdisciplinary Graduate School of Medicine and Engineering Univ. of Yamanashi ysuzuki@ Abstract We address the problem dealing with skewed data and propose a method for estimating effective training stories for the topic tracking task. For a small number of labelled positive stories we extract story pairs which consist of positive and its associated stories from bilingual comparable corpora. To overcome the problem of a large number of labelled negative stories we classify them into some clusters. This is done by using k-means with EM. The results on the TDT corpora show the effectiveness of the method. 1 Introduction With the exponential growth of information on the Internet it is becoming increasingly difficult to find and organize relevant materials. Topic Tracking defined by the TDT project is a research area to attack the problem. It starts from a few sample stories and finds all subsequent stories that discuss the target topic. Here a topic in the TDT context is something that happens at a specific place and time associated with some specific actions. A wide range of statistical and ML techniques have been applied to topic tracking Carbonell et. al 1999 Oard 1999 Franz 2001 Larkey 2004 . The main task of these techniques is to tune the parameters or the threshold to produce optimal results. However parameter tuning is a tricky issue for tracking Yang 2000 because the number of initial positive training stories is very small one to four and topics are localized in space and time. For example Taipei Mayoral Elections and . Mid-term Elections are topics but Elections is not a topic. Therefore the system needs to estimate whether or not the test stories are the same topic with few information about the topic. Moreover the .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.