TAILIEUCHUNG - Báo cáo khoa học: "Towards Robust Context-Sensitive Sentence Alignment for Monolingual CorporaRani Nelken and Stuart M. Shieber Division of Engineering and Applied Sciences Harvard University 33 Oxford St. Cambridge, MA 02138 nelken,shieber @deas.harvard.edu¡  Abstract"

Aligning sentences belonging to comparable monolingual corpora has been suggested as a first step towards training text rewriting algorithms, for tasks such as summarization or paraphrasing. We present here a new monolingual sentence alignment algorithm, combining a sentence-based TF*IDF score, turned into a probability distribution using logistic regression, with a global alignment dynamic programming algorithm. Our approach provides a simpler and more robust solution achieving a substantial improvement in accuracy over existing systems. . | Towards Robust Context-Sensitive Sentence Alignment for Monolingual Corpora Rani Nelken and Stuart M. Shieber Division of Engineering and Applied Sciences Harvard University 33 Oxford St. Cambridge MA 02138 nelken shieber @ Abstract Aligning sentences belonging to comparable monolingual corpora has been suggested as a first step towards training text rewriting algorithms for tasks such as summarization or paraphrasing. We present here a new monolingual sentence alignment algorithm combining a sentence-based TF IDF score turned into a probability distribution using logistic regression with a global alignment dynamic programming algorithm. Our approach provides a simpler and more robust solution achieving a substantial improvement in accuracy over existing systems. 1 Introduction Sentence-aligned bilingual corpora are a crucial resource for training statistical machine translation systems. Several authors have suggested that large-scale aligned monolingual corpora could be similarly used to advance the performance of monolingual text-to-text rewriting systems for tasks including summarization Knight and Marcu 2000 Jing 2002 and paraphrasing Barzilay and Elhadad 2003 Quirk et al. 2004 . Unlike bilingual corpora such as the Canadian Hansard corpus which are relatively rare it is now fairly easy to amass corpora of related monolingual documents. For instance with the advent of news aggregator services such as Google News one can readily collect multiple news stories covering the same news item Dolan et al. 2004 . Utilizing such a resource requires aligning related documents at a finer level of resolution identifying which sentences from one document align with which sentences from the other. Previous work has shown that aligning related monolingual documents is quite different from the well-studied multi-lingual alignment task. Whereas documents in a bilingual corpus are typically very closely aligned monolingual corpora exhibit a much looser level of .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.