TAILIEUCHUNG - Báo cáo khoa học: "A Study on Automatically Extracted Keywords in Text Categorization"

This paper presents a study on if and how automatically extracted keywords can be used to improve text categorization. In summary we show that a higher performance — as measured by micro-averaged F-measure on a standard text categorization collection — is achieved when the full-text representation is combined with the automatically extracted keywords. The combination is obtained by giving higher weights to words in the full-texts that are also extracted as keywords. We also present results for experiments in which the keywords are the only input to the categorizer, either represented as unigrams or intact. . | A Study on Automatically Extracted Keywords in Text Categorization Anette Hulth and Beata B. Megyesi Department of Linguistics and Philology Uppsala University Sweden bea@ Abstract This paper presents a study on if and how automatically extracted keywords can be used to improve text categorization. In summary we show that a higher performance as measured by micro-averaged F-measure on a standard text categorization collection is achieved when the full-text representation is combined with the automatically extracted keywords. The combination is obtained by giving higher weights to words in the full-texts that are also extracted as keywords. We also present results for experiments in which the keywords are the only input to the cat-egorizer either represented as unigrams or intact. Of these two experiments the unigrams have the best performance although neither performs as well as headlines only. 1 Introduction Automatic text categorization is the task of assigning any of a set of predefined categories to a document. The prevailing approach is that of supervised machine learning in which an algorithm is trained on documents with known categories. Before any learning can take place the documents must be represented in a form that is understandable to the learning algorithm. A trained prediction model is subsequently applied to previously unseen documents to assign the categories. In order to perform a text categorization task there are two major decisions to make how to represent the text and what learning algorithm to use to create the prediction model. The decision about the representation is in turn divided into two sub questions what features to select as input and which type of value to assign to these features. In most studies the best performing representation consists of the full length text keeping the tokens in the document separate that is as unigrams. In recent years however a number of experiments have been .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.