Options
Group delay based phone segmentation for HTS
Date Issued
01-01-2014
Author(s)
Shanmugam, S. Aswin
Indian Institute of Technology, Madras
Abstract
HMM based speech synthesis (HTS) is a state-of-the art approach to text-to-speech synthesis. Segmentation of the training data is essential for building any text-to-speech system. Most conventional text-to-speech systems use phones as the basic unit of synthesis and use a speech recogniser to automatically segment the data at the phone level. As Indian languages are low resource languages, accurate transcriptions are difficult to obtain owing to paucity of data. Manual labeling at the phone level is not only laborious but also inaccurate. HMM based flat start segmentation doesn't work well at the sentence level. In this paper we propose an event driven approach to obtain better phone boundaries. Syllable-like events are detected in the speech signal and matched with syllabified transcription of the text. The syllables are converted to phoneme sequences and Baum-Welch embedded re-estimation is restricted to the syllable-level. Subjective evaluations indicate that the proposed system has a lower word error rate compared to that of a conventional system that uses flat start for obtaining phone boundaries. © 2014 IEEE.