TAILIEUCHUNG - Báo cáo khoa học: "Unsupervised Decomposition of a Document into Authorial Components"

Navot Akiva Dept. of Computer Science Bar-Ilan University Ramat Gan, Israel {moishk,}@ Idan Dershowitz Dept. of Bible Hebrew University Jerusalem, Israel dershowitz@ We propose a novel unsupervised method for separating out distinct authorial components of a document. In particular, we show that, given a book artificially “munged” from two thematically similar biblical books, we can separate out the two constituent books almost perfectly. | Unsupervised Decomposition of a Document into Authorial Components Moshe Koppel Navot Akiva Dept. of Computer Science Bar-Ilan University Ramat Gan Israel Idan Dershowitz Dept. of Bible Hebrew University Jerusalem Israel moishk @ dershowitz@ Nachum Dershowitz School of Computer Science Tel Aviv University Ramat Aviv Israel nachumd@ Abstract We propose a novel unsupervised method for separating out distinct authorial components of a document. In particular we show that given a book artificially munged from two thematically similar biblical books we can separate out the two constituent books almost perfectly. This allows us to automatically recapitulate many conclusions reached by Bible scholars over centuries of research. One of the key elements of our method is exploitation of differences in synonym choice by different authors. 1 Introduction We propose a novel unsupervised method for separating out distinct authorial components of a document. There are many instances in which one is faced with a multi-author document and wishes to delineate the contributions of each author. Perhaps the most salient example is that of documents of historical significance that appear to be composites of multiple earlier texts. The challenge for literary scholars is to tease apart the document s various components. More contemporary examples include analysis of collaborative online works in which one might wish to identify the contribution of a particular author for commercial or forensic purposes. We treat two versions of the problem. In the first easier version the document to be decomposed is given to us segmented into units each of which is the work of a single author. The challenge 1356 is only to cluster the units according to author. In the second version we are given an unsegmented document and the challenge includes segmenting the document as well as clustering the resulting units. We assume here that no information about the authors of

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.