TAILIEUCHUNG - Báo cáo khoa học: "Unsupervised Topic Modelling for Multi-Party Spoken Discourse"

We present a method for unsupervised topic modelling which adapts methods used in document classification (Blei et al., 2003; Griffiths and Steyvers, 2004) to unsegmented multi-party discourse transcripts. We show how Bayesian inference in this generative model can be used to simultaneously address the problems of topic segmentation and topic identification: automatically segmenting multi-party meetings into topically coherent segments with performance which compares well with previous unsupervised segmentation-only methods (Galley et al., 2003) while simultaneously extracting topics which rate highly when assessed for coherence by human judges. . | Unsupervised Topic Modelling for Multi-Party Spoken Discourse Matthew Purver CSLI Stanford University Stanford CA 94305 UsA mpurver@ Thomas L. Griffiths Dept. of Cognitive Linguistic Sciences Brown University Providence RI 02912 USA tomgriffiths@ Abstract We present a method for unsupervised topic modelling which adapts methods used in document classification Blei et al. 2003 Griffiths and Steyvers 2004 to unsegmented multi-party discourse transcripts. We show how Bayesian inference in this generative model can be used to simultaneously address the problems of topic segmentation and topic identification automatically segmenting multi-party meetings into topically coherent segments with performance which compares well with previous unsupervised segmentation-only methods Galley et al. 2003 while simultaneously extracting topics which rate highly when assessed for coherence by human judges. We also show that this method appears robust in the face of off-topic dialogue and speech recognition errors. 1 Introduction Topic segmentation - division of a text or discourse into topically coherent segments - and topic identification - classification of those segments by subject matter - are joint problems. Both are necessary steps in automatic indexing retrieval and summarization from large datasets whether spoken or written. Both have received significant attention in the past see Section 2 but most approaches have been targeted at either text or monologue and most address only one of the two issues usually for the very good reason that the dataset itself provides the other for example by the explicit separation of individual documents or news stories in a collection . Spoken multi-party meetings pose a difficult problem firstly neither the Konrad P. Kording Dept. of Brain Cognitive Sciences Massachusetts Institute of Technology Cambridge MA 02139 USA kording@ Joshua B. Tenenbaum Dept. of Brain Cognitive Sciences Massachusetts Institute of .

TÀI LIỆU MỚI ĐĂNG
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.