TAILIEUCHUNG - Báo cáo khoa học: "A Hierarchical Model of Web Summaries"

We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach, based on information-theoretic grounds, uses an algorithm similar to recursive feature selection. Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. | A Hierarchical Model of Web Summaries Yves Petinot and Kathleen McKeown and Kapil Thadani Department of Computer Science Columbia University New York NY 10027 ypetinot kathy kapil @ Abstract We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ a popular Web directory and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach based on information-theoretic grounds uses an algorithm similar to recursive feature selection. Our second approach is fully Bayesian and derived from the more general model hierarchical LDA. We evaluate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data. 1 Introduction The work presented in this paper is aimed at leveraging a manually created document ontology to model the content of an underlying document collection. While the primary usage of ontologies is as a means of organizing and navigating document collections they can also help in inferring a significant amount of information about the documents attached to them including path-level statistical representations of content and fine-grained views on the level of specificity of the language used in those documents. Our study focuses on the ontology underlying DMOZ1 a popular Web directory. We propose two methods for crystalizing a hierarchical topic model against its hierarchy and show that the resulting models outperform a flat unigram model in its predictive power over held-out data. 1http 670 To construct our hierarchical topic models we adopt the mixed membership formalism Hofmann 1999 Blei et al. 2010 where a document is represented as a mixture over a set of word multinomials. We consider the document hierarchy H . the DMOZ hierarchy as a tree where internal nodes category nodes and leaf nodes documents as well as the edges connecting them are .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.