TAILIEUCHUNG - Báo cáo khoa học: "Speech Recognition of Czech - Inclusion of Rare Words Helps"

Large vocabulary continuous speech recognition of inflective languages, such as Czech, Russian or Serbo-Croatian, is heavily deteriorated by excessive out of vocabulary rate. In this paper, we tackle the problem of vocabulary selection, language modeling and pruning for inflective languages. We show that by explicit reduction of out of vocabulary rate we can achieve significant improvements in recognition accuracy while almost preserving the model size. Reported results are on Czech speech corpora. . | Speech Recognition of Czech - Inclusion of Rare Words Helps PetrPodvesky and Pavel Machek Institute of Formal and Applied Linguistics Charles University Prague Czech Republic podvesky machek @ Abstract Large vocabulary continuous speech recognition of inflective languages such as Czech Russian or Serbo-Croatian is heavily deteriorated by excessive out of vocabulary rate. In this paper we tackle the problem of vocabulary selection language modeling and pruning for inflective languages. We show that by explicit reduction of out of vocabulary rate we can achieve significant improvements in recognition accuracy while almost preserving the model size. Reported results are on Czech speech corpora. 1 Introduction Large vocabulary continuous speech recognition of inflective languages is a challenging task for mainly two reasons. Rich morphology generates huge number of forms which are not captured by limited-size dictionaries and therefore leads to worse recognition results. Relatively free word order admits enormous number of word sequences and thus impoverishes n-gram language models. In this paper we are concerned with the former issue. Previous work which deals with excessive vocabulary growth goes mainly in two lines. Authors have either decided to break words into sub-word units or to adapt dictionaries in a multi-pass scenario. On Czech data Byrne et al. 2001 suggest to use linguistically motivated recognition units. Words are broken down to stems and endings and used as the recognition units in the first recognition phase. In the second phase stems and endings are concatenated. On Serbo-Croatian Geutner et al. 1998 also tested morphemes as the recognition units. Both groups of authors agreed that this approach is not beneficial for speech recognition of inflective languages. Vocabulary adaptation however brought considerable improvement. Both Icring and Psutka 2001 on Czech and Geutner et al. 1998 on Serbo-Croatian reported substantial reduction of

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.