Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Morphological segmentation has been shown to be beneficial to a range of NLP tasks such as machine translation, speech recognition, speech synthesis and information retrieval. Recently, a number of approaches to unsupervised morphological segmentation have been proposed. This paper describes an algorithm that draws from previous approaches and combines them into a simple model for morphological segmentation that outperforms other approaches on English and German, and also yields good results on agglutinative languages such as Finnish and Turkish. . | A Language-Independent Unsupervised Model for Morphological Segmentation Vera Demberg School of Informatics University of Edinburgh Edinburgh EH8 9LW Gb v.demberg@sms.ed.ac.uk Abstract Morphological segmentation has been shown to be beneficial to a range of NLP tasks such as machine translation speech recognition speech synthesis and information retrieval. Recently a number of approaches to unsupervised morphological segmentation have been proposed. This paper describes an algorithm that draws from previous approaches and combines them into a simple model for morphological segmentation that outperforms other approaches on English and German and also yields good results on agglutinative languages such as Finnish and Turkish. We also propose a method for detecting variation within stems in an unsupervised fashion. The segmentation quality reached with the new algorithm is good enough to improve grapheme-to-phoneme conversion. 1 Introduction Morphological segmentation has been shown to be beneficial to a number of NLP tasks such as machine translation Goldwater and McClosky 2005 speech recognition Kurimo et al. 2006 information retrieval Monz and de Rijke 2002 and question answering. Segmenting a word into meaningbearing units is particularly interesting for morphologically complex languages where words can be composed of several morphemes through inflection derivation and composition. Data sparseness for such languages can be significantly decreased when 920 words are decomposed morphologically. There exist a number of rule-based morphological segmentation systems for a range of languages. However expert knowledge and labour are expensive and the analyzers must be updated on a regular basis in order to cope with language change the emergence of new words and their inflections . One might argue that unsupervised algorithms are not an interesting option from the engineering point of view because rule-based systems usually lead to better results. However segmentations .