Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Traditional concatenative speech synthesis systems use a number of heuristics to define the target and concatenation costs, essential for the design of the unit selection component. In contrast to these approaches, we introduce a general statistical modeling framework for unit selection inspired by automatic speech recognition. Given appropriate data, techniques based on that framework can result in a more accurate unit selection, thereby improving the general quality of a speech synthesizer. They can also lead to a more modular and a substantially more efficient system. . | Statistical Modeling for Unit Selection in Speech Synthesis Cyril Allauzen and Mehryar Mohri and Michael Riley AT T Labs - Research 180 Park Avenue Florham Park NJ 07932 USA allauzen mohri riley @research.att.com Abstract Traditional concatenative speech synthesis systems use a number of heuristics to define the target and concatenation costs essential for the design of the unit selection component. In contrast to these approaches we introduce a general statistical modeling framework for unit selection inspired by automatic speech recognition. Given appropriate data techniques based on that framework can result in a more accurate unit selection thereby improving the general quality of a speech synthesizer. They can also lead to a more modular and a substantially more efficient system. We present a new unit selection system based on statistical modeling. To overcome the original absence of data we use an existing high-quality unit selection system to generate a corpus of unit sequences. We show that the concatenation cost can be accurately estimated from this corpus using a statistical n-gram language model over units. We used weighted automata and transducers for the representation of the components of the system and designed a new and more efficient composition algorithm making use of string potentials for their combination. The resulting statistical unit selection is shown to be about 2.6 times faster than the last release of the AT T Natural Voices Product while preserving the same quality and offers much flexibility for the use and integration of new and more complex components. 1 Motivation A concatenative speech synthesis system Hunt and Black 1996 Beutnagel et al. 1999a consists of three components. The first component the textanalysis frontend takes text as input and outputs a sequence of feature vectors that characterize the acoustic signal to synthesize. The first element of each of these vectors is the predicted phone or halfphone other elements are .