TAILIEUCHUNG - Báo cáo khoa học: "Automatic Sanskrit Segmentizer Using Finite State Transducers"

In this paper, we propose a novel method for automatic segmentation of a Sanskrit string into different words. The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. | Automatic Sanskrit Segmentizer Using Finite State Transducers Vipul Mittal Language Technologies Research Center IIIT-H Gachibowli Hyderabad India. vipulmittal@ Abstract In this paper we propose a novel method for automatic segmentation of a Sanskrit string into different words. The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. We followed two different approaches to segment a Sanskrit text using sandhi1 rules extracted from a parallel corpus of manually sandhi split text. While the first approach augments the finite state transducer used to analyze Sanskrit morphology and traverse it to segment a word the second approach generates all possible segmentations and validates each constituent using a morph analyzer. 1 Introduction Sanskrit has a rich tradition of oral transmission of texts and this process causes the text to undergo euphonic changes at the word boundaries. In oral transmission the text is predominantly spoken as a continuous speech. However continuous speech makes the text ambiguous. To overcome this problem there is also a tradition of reciting the pada-patha recitation of words in addition to the recitation of a sarnhita a continuous sandhied text . In the written form because of the dominance of oral transmission the text is written as a continuous string of letters rather than a sequence of words. Thus the Sanskrit texts consist of a very Sandhi means euphony transformation of words when they are consecutively pronounced. Typically when a word W1 is followed by a word w2 some terminal segment of wi merges with some initial segment of w2 to be replaced by a smoothed phonetic interpolation corresponding to minimizing the energy necessary to reconfigurate the vocal organs at the juncture between the words. long sequence of phonemes with the word boundaries having undergone .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.