TAILIEUCHUNG - Báo cáo khoa học: "An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)"

Gorman and Curran (2006) argue that thesaurus generation for billion+-word corpora is problematic as the full computation takes many days. We present an algorithm with which the computation takes under two hours. We have created, and made publicly available, thesauruses based on large corpora for (at time of writing) seven major world languages. The development is implemented in the Sketch Engine (Kilgarriff et al., 2004). | An efficient algorithm for building a distributional thesaurus and other Sketch Engine developments Pavel Rychly Masaryk University Brno Czech Republic pary@ z Adam Kilgarriff Lexical Computing Ltd Brighton UK adam@ Abstract Gorman and Curran 2006 argue that thesaurus generation for billion -word corpora is problematic as the full computation takes many days. We present an algorithm with which the computation takes under two hours. We have created and made publicly available thesauruses based on large corpora for at time of writing seven major world languages. The development is implemented in the Sketch Engine Kilgarriff et al. 2004 . Another innovative development in the same tool is the presentation of the grammatical behaviour of a word against the background of how all other words of the same word class behave. Thus the English noun constraint occurs 75 in the plural. Is this a salient lexical fact To form a judgement we need to know the distribution for all nouns. We use histograms to present the distribution in a way that is easy to grasp. 1 Thesaurus creation Over the last ten years interest has been growing in distributional thesauruses hereafter simply thesauruses . Following initial work by Sparck Jones 1964 and Grefenstette 1994 an early online distributional thesaurus presented in Lin 1998 has been widely used and cited and numerous authors since have explored thesaurus properties and parameters see survey component of Weeds and Weir 2005 . 41 A thesaurus is created by taking a corpus identifying contexts for each word identifying which words share contexts. For each word the words that share most contexts according to some statistic which also takes account of their frequency are its nearest neighbours. Thesauruses generally improve in accuracy with corpus size. The larger the corpus the more clearly the signal of similar words will be distinguished from the noise of words that just happen to share a few contexts . Lin s was

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.