TAILIEUCHUNG - Báo cáo khoa học: "Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike"

We propose a method to automatically train lemmatization rules that handle prefix, infix and suffix changes to generate the lemma from the full form of a word. We explain how the lemmatization rules are created and how the lemmatizer works. We trained this lemmatizer on Danish, Dutch, English, German, Greek, Icelandic, Norwegian, Polish, Slovene and Swedish full form-lemma pairs respectively. We obtained significant improvements of 24 percent for Polish, percent for Dutch, percent for English, percent for German and percent for Swedish compared to plain suffix lemmatization using a suffix-only lemmatizer. Icelandic deteriorated with . | Automatic training of lemmatization rules that handle morphological changes in pre- in- and suffixes alike Bart Jongejan CST-University of Copenhagen Njalsgade 140-142 2300 Kobenhavn S Denmark bartj@ Hercules Dalianisf Ị fDSV KTH - Stockholm University Forum 100 164 40 Kista Sweden ỊEuroling AB SiteSeeker Igeldammsgatan 22c 112 49 Stockholm Sweden hercules@ Abstract We propose a method to automatically train lemmatization rules that handle prefix infix and suffix changes to generate the lemma from the full form of a word. We explain how the lemmatization rules are created and how the lemmatizer works. We trained this lemmatizer on Danish Dutch English German Greek Icelandic Norwegian Polish Slovene and Swedish full form-lemma pairs respectively. We obtained significant improvements of 24 percent for Polish percent for Dutch percent for English percent for German and percent for Swedish compared to plain suffix lemmatization using a suffix-only lem-matizer. Icelandic deteriorated with percent. We also made an observation regarding the number of produced lemmatization rules as a function of the number of training pairs. 1 Introduction Lemmatizers and stemmers are valuable human language technology tools to improve precision and recall in an information retrieval setting. For example stemming and lemmatization make it possible to match a query in one morphological form with a word in a document in another morphological form. Lemmatizers can also be used in lexicography to find new words in text material including the words frequency of use. Other applications are creation of index lists for book indexes as well as key word lists Lemmatization is the process of reducing a word to its base form normally the dictionary look-up form lemma of the word. A trivial way to do this is by dictionary look-up. More advanced systems use hand crafted or automatically generated transformation rules that look at the surface form of the word and .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.