Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
A method is presented for segmenting text into subtopic areas. The proportion of related pairwise words is calculated between adjacent windows of text to determine their lexical similarity. The lexical cohesion relations of reiteration and collocation are used to identify related words. These relations are automatically located using a combination of three linguistic features: word repetition, collocation and relation weights. This method is shown to successfully detect known subject changes in text and corresponds well to the segmentations placed by test subjects. . | Text Segmentation Using Reiteration and Collocation Amanda c. Jobbins Department of Computing Nottingham Trent University Nottingham NG1 4BU UK ajobbins@resumix.com Lindsay J. Evett Department of Computing Nottingham Trent University Nottingham NG1 4BU UK lje@doc.ntu.ac.uk Abstract A method is presented for segmenting text into subtopic areas. The proportion of related pairwise words is calculated between adjacent windows of text to determine their lexical similarity. The lexical cohesion relations of reiteration and collocation are used to identify related words. These relations are automatically located using a combination of three linguistic features word repetition collocation and relation weights. This method is shown to successfully detect known subject changes in text and corresponds well to the segmentations placed by test subjects. Introduction Many examples of heterogeneous data can be found in daily life. The Wall Street Journal archives for example consist of a series of articles about different subject areas. Segmenting such data into distinct topics is useful for information retrieval where only those segments relevant to a user s query can be retrieved. Text segmentation could also be used as a pre-processing step in automatic summarisation. Each segment could be summarised individually and then combined to provide an abstract for a document. Previous work on text segmentation has used term matching to identify clusters of related text. Salton and Buckley 1992 and later Hearst 1994 extracted related text portions by matching high frequency terms. Yaari 1997 segmented text into a hierarchical structure identifying sub-segments of larger segments. Ponte and Croft 1997 used word co-occurrences to expand the number of terms for matching. Reynar 1994 compared all words across a text rather than the more usual nearest neighbours. A problem with using word repetition is that inappropriate matches can be made because of the lack of contextual information .