Now showing 1 - 10 of 197
  • Placeholder Image
    Publication
    Prosody modeling for syllable-based concatenative speech synthesis of Hindi and Tamil
    (19-04-2011)
    Bellur, Ashwin
    ;
    Narayan, K. Badri
    ;
    Raghava Krishnan, K.
    ;
    This paper describes ways to improve prosody modeling in syllable-based concatenative speech synthesis systems for two Indian languages, namely Hindi and Tamil, within the unit selection paradigm. The syllable is a larger unit than the diphone and contains most of the coarticulation information. Although syllable-based synthesis is quite intelligible compared to diphone based systems, naturalness especially in terms of prosody, requires additional processing. Since the synthesizer is built using a cluster unit framework, a hybrid approach, where a combination of both rule based and statistical models are proposed to model prosody of syllable like units better. It is further observed that prediction of phrase boundaries is crucial, particularly because Indian languages are replete with polysyllabic words. CART based phrase modeling for Hindi and Tamil are discussed. Perceptual experiments show a significant improvement in the MOS for both Hindi and Tamil synthesizers.
  • Placeholder Image
    Publication
    Group delay based phone segmentation for HTS
    (01-01-2014)
    Shanmugam, S. Aswin
    ;
    HMM based speech synthesis (HTS) is a state-of-the art approach to text-to-speech synthesis. Segmentation of the training data is essential for building any text-to-speech system. Most conventional text-to-speech systems use phones as the basic unit of synthesis and use a speech recogniser to automatically segment the data at the phone level. As Indian languages are low resource languages, accurate transcriptions are difficult to obtain owing to paucity of data. Manual labeling at the phone level is not only laborious but also inaccurate. HMM based flat start segmentation doesn't work well at the sentence level. In this paper we propose an event driven approach to obtain better phone boundaries. Syllable-like events are detected in the speech signal and matched with syllabified transcription of the text. The syllables are converted to phoneme sequences and Baum-Welch embedded re-estimation is restricted to the syllable-level. Subjective evaluations indicate that the proposed system has a lower word error rate compared to that of a conventional system that uses flat start for obtaining phone boundaries. © 2014 IEEE.
  • Placeholder Image
    Publication
    Robust syllable segmentation and its application to syllable-centric continuous speech recognition
    (18-05-2010)
    Janakiraman, Rajesh
    ;
    Chaitanya, Kumar J.
    ;
    The focus of this paper is two-fold: (a) to develop a knowledge-based robust syllable segmentation algorithm and (b) to establish the importance of accurate segmentation in both the training and testing phases of a speech recognition system. A robust segmentation algorithm for segmenting the speech signal into syllables is first developed. This uses a non-statistical technique that is based on group delay(GD) segmentation and Vowel Onset point(VOP) detection. The transcription corresponding to the utterance is syllabified using rules. This produces an annotation for the train data. The annotated train data is then used to train a syllable-based speech recognition system. The test signal is also segmented using the proposed algorithm. This segmentation information is then incorporated into the linguistic search space to reduce both computational complexity and word error rate(WER). WER's of 4.4% and 21.2% are reported on the TIMIT and NTIMIT databases respectively. ©2010 IEEE.
  • Placeholder Image
    Publication
    On parametric representations of the modified group delay
    (22-09-2008)
    Padmanabhan, R.
    ;
    The modified group delay (MODGD) is a group delay based representation suited for processing speech signals. The MODGD is parameterized by three entities, α, γ and lifterω. Typically, optimal values of these parameters have to be determined by experimentation. In this paper, we propose a method to automatically determine an optimal value for lifterω̂, which enables the other two parameters to be set to 1.0. This will reduce the optimisation required to obtain meaningful MODGD values directly from the speech signal. © 2008 IEEE.
  • Placeholder Image
    Publication
    Robustness of phase based features for speaker recognition
    (26-11-2009)
    Padmanabhan, R.
    ;
    Parthasarathi, Sree Hari Krishnan
    ;
    This paper demonstrates the robustness of group-delay based features for speech processing. An analysis of group delay functions is presented which show that these features retain formant structure even in noise. Furthermore, a speaker verification task performed on the NIST 2003 database show lesser error rates, when compared with the traditional MFCC features. We also mention about using feature diversity to dynamically choose the feature for every claimed speaker. Copyright © 2009 ISCA.
  • Placeholder Image
    Publication
    Raga verification in carnatic music using longest common segment set
    (01-01-2015)
    Dutta, Shrey
    ;
    Krishnaraj Sekhar, P. V.
    ;
    There are at least 100 rāgas that are regularly performed in Carnatic music concerts. The audience determines the identity of rāgas within a few seconds of listening to an item. Most of the audience consists of people who are only avid listeners and not performers. In this paper, an attempt is made to mimic the listener. A rāga verification framework is therefore suggested. The rāga verification system assumes that a specific rāga is claimed based on similarity of movements and motivic patterns. The system then checks whether this claimed rāga is correct. For every rāga, a set of cohorts are chosen. A rāga and its cohorts are represented using pallavi lines of compositions. A novel approach for matching, called Longest Common Segment Set (LCSS), is introduced. The LCSS scores for a rāga are then normalized with respect to its cohorts in two different ways. The resulting systems and a baseline system are compared for two partitionings of a dataset. A dataset of 30 rāgas from Charsur Foundation 1 is used for analysis. An equal error rate (EER) of 12% is obtained.
  • Placeholder Image
    Publication
    Zero resource speech synthesis using transcripts derived from perceptual acoustic units
    (01-01-2019)
    Karthik Pandia, D. S.
    ;
    Zerospeech synthesis is the task of building vocabulary independent speech synthesis systems, where transcriptions are unavailable for training data. It is, therefore, necessary to convert training data into a sequence of fundamental acoustic units that can be used for synthesis during the test. This paper attempts to discover, and model perceptual acoustic units consisting of steady state, and transient regions in speech. The transients roughly correspond to CV, VC units, while the steady-state corresponds to sonorants and fricatives. The speech signal is first preprocessed by segmenting the same into CVC-like units using a short-term energy-like contour. These CVC segments are clustered using a connected components-based graph clustering technique. The clustered CVC segments are initialized such that the onset (CV) and decays (VC) correspond to transients, and the rhyme corresponds to steady-states. Following this initialization, the units are allowed to re-organise on the continuous speech into a final set of AUs in an HMM-GMM framework. AU sequences thus obtained are used to train synthesis models. The performance of the proposed approach is evaluated on the Zerospeech 2019 challenge database. Subjective and objective scores show that reasonably good quality synthesis with low bit rate encoding can be achieved using the proposed AUs.
  • Placeholder Image
    Publication
    Detection of syn flooding attacks using generalized autoregressive conditional heteroskedasticity (GARCH) modeling technique
    (18-05-2010)
    Ranjan, Nikhil
    ;
    ;
    Gonsalves, Timothy A.
    This paper explores a fast and effective method to detect TCP SYN flooding attack. The Generalized autoregressive conditional heteroskedastic (GARCH) model which is the most commonly used statistical modeling technique for financial time series is proposed as a new technique for Denial of service attack detection. The exponential backoff and retransmission property of TCP during timeouts is exploited in the detection mechanism. We are able to detect low as well as high intensity SYN flooding attacks by modeling the difference between SYN and SYN+ACK packets using GARCH. Our studies show that this non linear volatility model performs better than earlier models like Linear Prediction. ©2010 IEEE.
  • Placeholder Image
    Publication
    Early vocabulary development through picture-based software solutions
    (01-01-2018)
    Kasthuri, G. R.
    ;
    Ramanathan, P.
    ;
    ;
    Jacob, N.
    ;
    Assistive technology enables children with disabilities to gain access, function independently and take advantage of schooling and social opportunities[1]. The need for alternative and augmentative communication (AAC) in all children is not the same[2] but AAC can aid in expressive language and also support intelligible speakers in developing and using communication skills in varied situations. While the range and flexibility of AAC has grown over the years, making devices accessible to children at varied economic and regional backgrounds, is still a challenge. KAVI-PTS is designed as a picture-to-speech Android application, and has been made available in several Asian languages. This application is conceived of as an inexpensive software alternative to communication charts. It can be easily configured to adjust contrast levels and customize selection modes, enabling children to have access to a tailor-made communication solution.
  • Placeholder Image
    Publication
    Signal processing based segmentation and HMM based acoustic clustering of syllable segments for low bit rate segment vocoder at 1.4 Kbps
    (01-12-2008)
    Chevireddy, Sadhana
    ;
    ;
    In this paper, we propose a novel approach for developing a segment-based vocoder at very low bit-rates. The segmental unit chosen for coding is a syllable. A signal processing technique called automatic group delay based segmentation is used to obtain syllable like segments. The segment codebook is prepared by acoustically clustering the syllable segments using a Hidden Markov Model (HMM) based unsupervised and incremental training algorithm. When the residual is modeled using MELP, a bit-rate of 1.4 Kbps is achieved. The synthesized speech quality is compared with that of the standard MELP codec at 2.4 Kbps using the objective evaluation measure, PESQ. copyright by EURASIP.