TAILIEUCHUNG - Báo cáo khoa học: "Text Segmentation Using Reiteration and Collocation"

A method is presented for segmenting text into subtopic areas. The proportion of related pairwise words is calculated between adjacent windows of text to determine their lexical similarity. The lexical cohesion relations of reiteration and collocation are used to identify related words. These relations are automatically located using a combination of three linguistic features: word repetition, collocation and relation weights. This method is shown to successfully detect known subject changes in text and corresponds well to the segmentations placed by test subjects. . | Text Segmentation Using Reiteration and Collocation Amanda c. Jobbins Department of Computing Nottingham Trent University Nottingham NG1 4BU UK ajobbins@ Lindsay J. Evett Department of Computing Nottingham Trent University Nottingham NG1 4BU UK lje@ Abstract A method is presented for segmenting text into subtopic areas. The proportion of related pairwise words is calculated between adjacent windows of text to determine their lexical similarity. The lexical cohesion relations of reiteration and collocation are used to identify related words. These relations are automatically located using a combination of three linguistic features word repetition collocation and relation weights. This method is shown to successfully detect known subject changes in text and corresponds well to the segmentations placed by test subjects. Introduction Many examples of heterogeneous data can be found in daily life. The Wall Street Journal archives for example consist of a series of articles about different subject areas. Segmenting such data into distinct topics is useful for information retrieval where only those segments relevant to a user s query can be retrieved. Text segmentation could also be used as a pre-processing step in automatic summarisation. Each segment could be summarised individually and then combined to provide an abstract for a document. Previous work on text segmentation has used term matching to identify clusters of related text. Salton and Buckley 1992 and later Hearst 1994 extracted related text portions by matching high frequency terms. Yaari 1997 segmented text into a hierarchical structure identifying sub-segments of larger segments. Ponte and Croft 1997 used word co-occurrences to expand the number of terms for matching. Reynar 1994 compared all words across a text rather than the more usual nearest neighbours. A problem with using word repetition is that inappropriate matches can be made because of the lack of contextual information .

TÀI LIỆU MỚI ĐĂNG
28    152    1    30-11-2024
337    141    1    30-11-2024
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.