TAILIEUCHUNG - Báo cáo khoa học: "Thematic segmentation of texts: two methods for two kinds of texts"

To segment texts in thematic units, we present here how a basic principle relying on word distribution can be applied on different kind of texts. We start from an existing method well adapted for scientific texts, and we propose its adaptation to other kinds of texts by using semantic links between words. These relations are found in a lexical network, automatically built from a large corpus. We will compare their results and give criteria to choose the more suitable method according to text characteristics. . | Thematic segmentation of texts two methods for two kinds of texts Olivier FERRET LIMSI-CNRS Bat. 508 -BP 133 F-91403 Orsay Cedex France ferret@ Brigitte GRAU LIMSI-CNRS Bât 508 - BP 133 F-91403 Orsay Cedex France grau@ Nicolas MASSON LIMSI-CNRS Bât 508 - BP 133 F-91403 Orsay Cedex France masson@ Abstract To segment texts in thematic units we present here how a basic principle relying on word distribution can be applied on different kind of texts. We start from an existing method well adapted for scientific texts and we propose its adaptation to other kinds of texts by using semantic links between words. These relations are found in a lexical network automatically built from a large corpus. We will compare their results and give criteria to choose the more suitable method according to text characteristics. 1. Introduction Text segmentation according to a topical criterion is a useful process in many applications such as text summarization or information extraction task. Approaches that address this problem can be classified in knowledge-based approaches or word-based approaches. Knowledge-based systems as Grosz and Sidner s 1986 require an extensive manual knowledge engineering effort to create the knowledge base semantic network and or frames and this is only possible in very limited and well-known domains. To overcome this limitation and to process a large amount of texts word-based approaches have been developed. Hearst 1997 and Masson 1995 make use of the word distribution in a text to find a thematic segmentation. These works are well adapted to technical or scientific texts characterized by a specific vocabulary. To process narrative or expository texts such as newspaper articles Kozima s 1993 and Morris and Hirst s 1991 approaches are based on lexical cohesion computed from a lexical network. These methods depend on the presence of the text vocabulary inside their network. So to avoid any restriction about domains in such kinds of .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.