TAILIEUCHUNG - Báo cáo khoa học: "Towards Robust Context-Sensitive Sentence Alignment for Monolingual CorporaRani Nelken and Stuart M. Shieber Division of Engineering and Applied Sciences Harvard University 33 Oxford St. Cambridge, MA 02138 nelken,shieber @deas.harvard.edu¡ Abstract"

Aligning sentences belonging to comparable monolingual corpora has been suggested as a ﬁrst step towards training text rewriting algorithms, for tasks such as summarization or paraphrasing. We present here a new monolingual sentence alignment algorithm, combining a sentence-based TF*IDF score, turned into a probability distribution using logistic regression, with a global alignment dynamic programming algorithm. Our approach provides a simpler and more robust solution achieving a substantial improvement in accuracy over existing systems. . | Towards Robust Context-Sensitive Sentence Alignment for Monolingual Corpora Rani Nelken and Stuart M. Shieber Division of Engineering and Applied Sciences Harvard University 33 Oxford St. Cambridge MA 02138 nelken shieber @ Abstract Aligning sentences belonging to comparable monolingual corpora has been suggested as a first step towards training text rewriting algorithms for tasks such as summarization or paraphrasing. We present here a new monolingual sentence alignment algorithm combining a sentence-based TF IDF score turned into a probability distribution using logistic regression with a global alignment dynamic programming algorithm. Our approach provides a simpler and more robust solution achieving a substantial improvement in accuracy over existing systems. 1 Introduction Sentence-aligned bilingual corpora are a crucial resource for training statistical machine translation systems. Several authors have suggested that large-scale aligned monolingual corpora could be similarly used to advance the performance of monolingual text-to-text rewriting systems for tasks including summarization Knight and Marcu 2000 Jing 2002 and paraphrasing Barzilay and Elhadad 2003 Quirk et al. 2004 . Unlike bilingual corpora such as the Canadian Hansard corpus which are relatively rare it is now fairly easy to amass corpora of related monolingual documents. For instance with the advent of news aggregator services such as Google News one can readily collect multiple news stories covering the same news item Dolan et al. 2004 . Utilizing such a resource requires aligning related documents at a finer level of resolution identifying which sentences from one document align with which sentences from the other. Previous work has shown that aligning related monolingual documents is quite different from the well-studied multi-lingual alignment task. Whereas documents in a bilingual corpus are typically very closely aligned monolingual corpora exhibit a much looser level of .

Hoàng Ân 63 8 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Towards robust multi-tool tagging. An OWL/DL-based approach"

12 56 0

Báo cáo khoa học: "Towards Robust Animacy Classiﬁcation Using Morphosyntactic Distributional Features"

8 55 0

Báo cáo khoa học: "Towards Robust Context-Sensitive Sentence Alignment for Monolingual CorporaRani Nelken and Stuart M. Shieber Division of Engineering and Applied Sciences Harvard University 33 Oxford St. Cambridge, MA 02138 nelken,shieber @deas.harvard.edu¡ Abstract"

8 50 0

Stochastic Database Cracking: Towards Robust Adaptive Indexing in Main-Memory Column-Stores

12 34 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462336 61

Giới thiệu :Lập trình mã nguồn mở

14 25915 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11335 542

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10543 466

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9835 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8885 1161

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8499 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8098 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7709 1788

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7240 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Giáo án mầm non chương trình đổi mới: Gia đình vui nhộn

4 391 3 23-12-2024

Báo cáo nghiên cứu nông nghiệp " Field control of pest fruit flies in Vietnam "

14 189 4 23-12-2024

Báo cáo nghiên cứu khoa học " HÃY LÀM CHO HUẾ XANH HƠN VÀ ĐẸP HƠN "

6 180 3 23-12-2024

Bảng màu theo chữ cái – V

11 163 2 23-12-2024

BÀI GIẢNG Biến Đổi Năng Lượng Điện Cơ - TS. Hồ Phạm Huy

137 157 1 23-12-2024

Đề tài " Dự báo về tác động của Tổ chức Thương mại Thế giới WTO đối với các doanh nghiệp xuất khẩu vừa và nhỏ Việt Nam – Những giải pháp đề xuất "

72 183 2 23-12-2024

báo cáo khoa học: "Malignant peripheral nerve sheath tumor arising from the greater omentum: Case report"

4 140 1 23-12-2024

Báo cáo nghiên cứu khoa học " NÂNG QUAN HỆ KINH TẾ THƯƠNG MẠI VIỆT NAM - TRUNG QUỐC LÊN TẦM CAO THỜI ĐẠI "

8 170 1 23-12-2024

Lập trình Java cơ bản : Luồng và xử lý file part 8

5 140 1 23-12-2024

Lịch sử Trung Quốc 5000 năm tập 3 part 2

54 148 1 23-12-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8098 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7709 1788

Ebook Chào con ba mẹ đã sẵn sàng

112 4406 1371

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6273 1266

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8885 1161

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3835 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3917 609

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4700 565

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11335 542

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4501 490