TAILIEUCHUNG - Báo cáo khoa học: "Alignment-Based Discriminative String Similarity"

A character-based measure of similarity is an important component of many natural language processing systems, including approaches to transliteration, coreference, word alignment, spelling correction, and the identification of cognates in related vocabularies. We propose an alignment-based discriminative framework for string similarity. We gather features from substring pairs consistent with a character-based alignment of the two strings. This approach achieves exceptional performance; on nine separate cognate identification experiments using six language pairs, we more than double the precision of traditional orthographic measures like Longest Common Subsequence Ratio and Dice’s Coefficient. . | Alignment-Based Discriminative String Similarity Shane Bergsma and Grzegorz Kondrak Department of Computing Science University of Alberta Edmonton Alberta Canada T6G 2E8 bergsma kondrak @ Abstract A character-based measure of similarity is an important component of many natural language processing systems including approaches to transliteration coreference word alignment spelling correction and the identification of cognates in related vocabularies. We propose an alignment-based discriminative framework for string similarity. We gather features from substring pairs consistent with a character-based alignment of the two strings. This approach achieves exceptional performance on nine separate cognate identification experiments using six language pairs we more than double the precision of traditional orthographic measures like Longest Common Subsequence Ratio and Dice s Coefficient. We also show strong improvements over other recent discriminative and heuristic similarity functions. 1 Introduction String similarity is often used as a means of quantifying the likelihood that two pairs of strings have the same underlying meaning based purely on the character composition of the two words. Strube et al. 2002 use Edit Distance as a feature for determining if two words are coreferent. Taskar et al. 2005 use French-English common letter sequences as a feature for discriminative word alignment in bilingual texts. Brill and Moore 2000 learn misspelled-word to correctly-spelled-word similarities for spelling correction. In each of these examples a similarity measure can make use of the recurrent substring pairings that reliably occur between 656 words having the same meaning. Across natural languages these recurrent substring correspondences are found in word pairs known as cognates words with a common form and meaning across languages. Cognates arise either from words in a common ancestor language . light Licht night Nacht in English German or from foreign .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.