Options
A probabilistic approach to selecting units for speech synthesis based on acoustic similarity
Date Issued
01-01-2014
Author(s)
Abstract
Most unit selection synthesisers sound quite natural when the database consists of a number of realisations of the same sound unit from a large number of contexts. A common problem observed with these synthesisers is unexpected prosody when a new context is presented in the text. The objective of this paper is to address this issue and select appropriate units that are relevant to a specific context. Text-to-speech synthesisers propose a number of different features based on the linguistic context to select units. The key contribution in this paper is that the acoustic context rather than the linguistic context is crucial for improving naturalness. A probabilistic framework is proposed for selecting units based on an acoustic framework. Reducing the variability in acoustic context improves both naturalness and intelligibility. Since the context is only specified by acoustics, it can be applied to any language and perhaps even multilingual synthesis. The proposed approach has been tested on 2 Indian languages. An improvement of up to 21.9% in DMOS and 73.93% in WER relative to the conventional system that uses linguistic criteria is observed. © 2014 IEEE.