TAILIEUCHUNG - Báo cáo khoa học: "Concept Unification of Terms in Different Languages for IR"

Due to the historical and cultural reasons, English phases, especially the proper nouns and new words, frequently appear in Web pages written primarily in Asian languages such as Chinese and Korean. Although these English terms and their equivalences in the Asian languages refer to the same concept, they are erroneously treated as independent index units in traditional Information Retrieval (IR). This paper describes the degree to which the problem arises in IR and suggests a novel technique to solve it | Concept Unification of Terms in Different Languages for IR Qing Li Sung-Hyon Myaeng Information Communications University Korea liqing myaeng @ Yun Jin Chungnam National University Korea wkim@ Bo-yeong Kang Seoul National University Korea comeng99@ Abstract Due to the historical and cultural reasons English phases especially the proper nouns and new words frequently appear in Web pages written primarily in Asian languages such as Chinese and Korean. Although these English terms and their equivalences in the Asian languages refer to the same concept they are erroneously treated as independent index units in traditional Information Retrieval IR . This paper describes the degree to which the problem arises in IR and suggests a novel technique to solve it. Our method firstly extracts an English phrase from Asian language Web pages and then unifies the extracted phrase and its equivalence s in the language as one index unit. Experimental results show that the high precision of our conceptual unification approach greatly improves the IR performance. 1 Introduction The mixed use of English and local languages presents a classical problem of vocabulary mismatch in monolingual information retrieval MIR . The problem is significant especially in Asian language because words in the local languages are often mixed with English words. Although English terms and their equivalences in a local language refer to the same concept they are erroneously treated as independent index units in traditional MIR. Such separation of semantically identical words in different languages may limit retrieval performance. For instance as shown in Figure 1 there are three kinds of Chinese Web pages containing information related with Viterbi Algorithm ÍỆELMEẾ . The first case contains Viterbi Algorithm but not its Chinese equivalence ÍỆELMEẾ . The second num MTSKt-romtM as the states of. HMM hidden Markov models in which the 1 attic eiVitetbi algorithm is employed for .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.