Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
In this paper, we propose a novel method for automatic segmentation of a Sanskrit string into different words. The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. | Automatic Sanskrit Segmentizer Using Finite State Transducers Vipul Mittal Language Technologies Research Center IIIT-H Gachibowli Hyderabad India. vipulmittal@research.iiit.ac.in Abstract In this paper we propose a novel method for automatic segmentation of a Sanskrit string into different words. The input for our segmentizer is a Sanskrit string either encoded as a Unicode string or as a Roman transliterated string and the output is a set of possible splits with weights associated with each of them. We followed two different approaches to segment a Sanskrit text using sandhi1 rules extracted from a parallel corpus of manually sandhi split text. While the first approach augments the finite state transducer used to analyze Sanskrit morphology and traverse it to segment a word the second approach generates all possible segmentations and validates each constituent using a morph analyzer. 1 Introduction Sanskrit has a rich tradition of oral transmission of texts and this process causes the text to undergo euphonic changes at the word boundaries. In oral transmission the text is predominantly spoken as a continuous speech. However continuous speech makes the text ambiguous. To overcome this problem there is also a tradition of reciting the pada-patha recitation of words in addition to the recitation of a sarnhita a continuous sandhied text . In the written form because of the dominance of oral transmission the text is written as a continuous string of letters rather than a sequence of words. Thus the Sanskrit texts consist of a very Sandhi means euphony transformation of words when they are consecutively pronounced. Typically when a word W1 is followed by a word w2 some terminal segment of wi merges with some initial segment of w2 to be replaced by a smoothed phonetic interpolation corresponding to minimizing the energy necessary to reconfigurate the vocal organs at the juncture between the words. long sequence of phonemes with the word boundaries having undergone .