TAILIEUCHUNG - Báo cáo khoa học: "Part-of-Speech Tagging Considering Surface Form for an Agglutinative Language"

The previous probabilistic part-of-speech tagging models for agglutinative languages have considered only lexical forms of morphemes, not surface forms of words. This causes an inaccurate calculation of the probability. The proposed model is based on the observation that when there exist words (surface forms) that share the same lexical forms, the probabilities to appear are different from each other. Also, it is designed to consider lexical form of word. By experiments, we show that the proposed model outperforms the bigram Hidden Markov model (HMM)-based tagging model. based tagging model. . | Part-of-Speech Tagging Considering Surface Form for an Agglutinative Language Do-Gil Lee and Hae-Chang Rim Dept. of Computer Science Engineering Korea University 1 5-ka Anam-dong Seongbuk-ku Seoul 136-701 Korea dglee rim @ Abstract The previous probabilistic part-of-speech tagging models for agglutinative languages have considered only lexical forms of morphemes not surface forms of words. This causes an inaccurate calculation of the probability. The proposed model is based on the observation that when there exist words surface forms that share the same lexical forms the probabilities to appear are different from each other. Also it is designed to consider lexical form of word. By experiments we show that the proposed model outperforms the bigram Hidden Markov model HMM -based tagging model. 1 Introduction Part-of-speech POS tagging is a job to assign a proper POS tag to each linguistic unit such as word for a given sentence. In English POS tagging word is used as a linguistic unit. However the number of possible words in agglutinative languages such as Korean is almost infinite because words can be freely formed by gluing morphemes together. Therefore morpheme-unit tagging is preferred and more suitable in such languages than word-unit tagging. Figure 1 shows an example of morpheme structure of a sentence where the bold lines indicate the most likely morpheme-POS sequence. A solid line represents a transition between two morphemes across a word boundary and a dotted line represents a transition between two morphemes in a word. The previous probabilistic POS models for agglutinative languages have considered only lexical forms of morphemes not surface forms of words. This causes an inaccurate calculation of the probability. The proposed model is based on the observation that when there exist words surface forms that share the same lexical forms the probabilities to appear are different from each other. Also it is designed to consider lexical form of

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.