Options
Significance of spectral cues in automatic speech segmentation for Indian language speech synthesizers
Date Issued
01-10-2020
Author(s)
Baby, Arun
Prakash, Jeena J.
Subramanian, Aswin Shanmugam
Indian Institute of Technology, Madras
Abstract
Building speech synthesis systems for Indian languages is challenging owing to the fact that digital resources for these languages are hardly available. Vocabulary independent speech synthesis requires that a given text is split at the level of the smallest sound unit, namely, phone. The waveforms or models of phones are concatenated to produce speech. The waveforms corresponding to that of the phones are obtained manual (listening and marking) when digital resources are scarce. But the manual labeling of speech data (also known as speech segmentation) can lead to inconsistencies as the duration of phones can be as short as 10ms. The most common approach to automatic segmentation of speech is to perform forced alignment using monophone hidden Markov models (HMMs) that have been obtained using embedded re-estimation after flat start initialization. These results are then used in neural network frameworks to build better acoustic models for speech synthesis/recognition. Segmentation using this approach requires large amounts of data and does not work very well for low resource languages. To address the issue of paucity of data, signal processing cues like short-term energy (STE) and sub-band spectral flux (SBSF) are used in tandem with HMM based forced alignment for automatic speech segmentation. STE and SBSF are computed on the speech waveforms. STE yields syllable boundaries, while SBSF provides locations of significant change in spectral flux that are indicative of fricatives, affricates, and nasals. STE and SBSF cannot be used directly to segment an utterance. Minimum phase group delay based smoothing is performed to preserve these landmarks, while at the same time reducing the local fluctuations. The boundaries obtained with HMMs are corrected at the syllable level, wherever it is known that the syllable boundaries are correct. Embedded re-estimation of monophone HMM models is again performed using the corrected alignment. Thus, using signal processing cues and HMM re-estimation in tandem, robust monophone HMM models are built. These models are then used in Gaussian mixture model (GMM), deep neural network (DNN) and convolutional neural network (CNN) frameworks to obtain state-level frame posteriors. The boundaries are again iteratively corrected and re-estimated. Text-to-speech (TTS) systems are built for different Indian languages using phone alignments obtained with and without the use of signal processing based boundary corrections. Unit selection based and statistical parametric based TTS systems are built. The result of the listening tests showed a significant improvement in the quality of synthesis with the use of signal processing based boundary correction.
Volume
123