TAILIEUCHUNG - Báo cáo khoa học: "A Practical Solution to the Problem of Automatic Word Sense Induction"

Recent studies in word sense induction are based on clustering global co-occurrence vectors, . vectors that reflect the overall behavior of a word in a corpus. If a word is semantically ambiguous, this means that these vectors are mixtures of all its senses. Inducing a word’s senses therefore involves the difficult problem of recovering the sense vectors from the mixtures. | A Practical Solution to the Problem of Automatic Word Sense Induction Reinhard Rapp University of Mainz FASK D-76711 Germersheim Germany rapp@ Abstract Recent studies in word sense induction are based on clustering global co-occurrence vectors . vectors that reflect the overall behavior of a word in a corpus. If a word is semantically ambiguous this means that these vectors are mixtures of all its senses. Inducing a word s senses therefore involves the difficult problem of recovering the sense vectors from the mixtures. In this paper we argue that the demixing problem can be avoided since the contextual behavior of the senses is directly observable in the form of the local contexts of a word. From human disambiguation performance we know that the context of a word is usually sufficient to determine its sense. Based on this observation we describe an algorithm that discovers the different senses of an ambiguous word by clustering its contexts. The main difficulty with this approach namely the problem of data sparseness could be minimized by looking at only the three main dimensions of the context matrices. 1 Introduction The topic of this paper is word sense induction that is the automatic discovery of the possible senses of a word. A related problem is word sense disambiguation Here the senses are assumed to be known and the task is to choose the correct one when given an ambiguous word in context. Whereas until recently the focus of research had been on sense disambiguation papers like Pantel Lin 2002 Neill 2002 and Rapp 2003 give evidence that sense induction now also attracts attention. In the approach by Pantel Lin 2002 all words occurring in a parsed corpus are clustered on the basis of the distances of their co-occurrence vectors. This is called global clustering. Since by looking at differential vectors their algorithm allows a word to belong to more than one cluster each cluster a word is assigned to can be considered as one of its

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.