Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
This paper presents a new approach of integrating Bottleneck feature (BNF) which is used for extracting tone information, to adapt to Multi Space Distribution Hidden Markov Model (MSDHMM) for Vietnamese Automatic Speech recognition (Vietnamese ASR). | Nguyễn Văn Huy Tạp chí KHOA HỌC & CÔNG NGHỆ 139(09): 229 - 236 NEURAL NETWORK-BASED TONAL FEATURE FOR VIETNAMESE SPEECH RECOGNITION USING MULTI SPACE DISTRIBUTION MODEL Nguyen Van Huy* College of Technology - TNU SUMMARY This paper presents a new approach of integrating Bottleneck feature (BNF) which is used for extracting tone information, to adapt to Multi Space Distribution Hidden Markov Model (MSDHMM) for Vietnamese Automatic Speech recognition (Vietnamese ASR). In order to improve the performance of tonal feature, the first point that we present is a progress for extracting tonal feature based on a bottle neck Multilayer Perceptron (MLP) network that so called tonal bottle neck feature. The second major point in this paper is that we describe an approach for adapting the TBNF to MSD-HMM model. A new building system was trained with the appropriated topology for BNF size and MLP topology of hidden layers for tone recognition. Experiments on new building recognition system with TBNF integration are done to compare to 1/ a baseline system using MFCC feature and normal HMM prototype of five states, and 2/ a MSD-HMM system with widely used for extraction pitch feature such as Average Magnitude Difference Function (AMDF). Recognition accuracy on the testing set is 80.69%, it improved 2.38% compared to the baseline system and 0.32% compared to the best MSD-HMM system using the standard pitch feature AMDF. Keywords: Multi space distribution, bottle neck feature, tonal bottle neck feature, Vietnamese tone recognition, pitch feature INTRODUCTION* Tonal languages like Vietnamese, Mandarin and Cantonese generally use tones to represent phone level distinction, which are therefore essential to distinguish between words. Such tone information is generated by excursions in fundamental frequency, a feature that most recognition systems today discard as irrelevant for speech recognition. Vietnamese is a tonal monosyllable language in which each syllable has only one of .