Now showing 1 - 10 of 12
  • Placeholder Image
    Publication
    Studies on inter-speaker variability in speech and its application in automatic speech recognition
    (01-10-2011)
    In this paper, we give an overview of the problem of inter-speaker variability and its study in many diverse areas of speech signal processing. We first give an overview of vowel-normalization studies that minimize variations in the acoustic representation of vowel realizations by different speakers. We then describe the universal-warping approach to speaker normalization which unifies many of the vowel normalization approaches and also shows the relation between speech production, perception and auditory processing. We then address the problem of inter-speaker variability in automatic speech recognition (ASR) and describe techniques that are used to reduce these effects and thereby improve the performance of speaker-independent ASR systems. © 2011 Indian Academy of Sciences.
  • Placeholder Image
    Publication
    Improving acoustic models in TORGO dysarthric speech database
    (01-03-2018)
    Joy, Neethu Mariam
    ;
    Assistive speech-based technologies can improve the quality of life for people affected with dysarthria, a motor speech disorder. In this paper, we explore multiple ways to improve Gaussian mixture model and deep neural network (DNN) based hidden Markov model (HMM) automatic speech recognition systems for TORGO dysarthric speech database. This work shows significant improvements over the previous attempts in building such systems in TORGO. We trained speaker-specific acoustic models by tuning various acoustic model parameters, using speaker normalized cepstral features and building complex DNN-HMM models with dropout and sequence-discrimination strategies. The DNN-HMM models for severe and severe-moderate dysarthric speakers were further improved by leveraging specific information from dysarthric speech to DNN models trained on audio files from both dysarthric and normal speech, using generalized distillation framework. To the best of our knowledge, this paper presents the best recognition accuracies for TORGO database till date.
  • Placeholder Image
    Publication
    S-Vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder
    (01-01-2022)
    Mary, Narla John Metilda Sagaya
    ;
    ;
    Katta, Sandesh Varadaraju
    One of the most popular speaker embeddings is x-vectors, which are obtained from an architecture that gradually builds a larger temporal context with layers. In this paper, we propose to derive speaker embeddings from Transformer's encoder trained for speaker classification. Self-attention, on which Transformer's encoder is built, attends to all the features over the entire utterance and might be more suitable in capturing the speaker characteristics in an utterance. We refer to the speaker embeddings obtained from the proposed speaker classification model as s-vectors to emphasize that they are obtained from an architecture that heavily relies on self-attention. Through experiments, we demonstrate that s-vectors perform better than x-vectors. In addition to the s-vectors, we also propose a new architecture based on Transformer's encoder for speaker verification as a replacement for speaker verification based on conventional probabilistic linear discriminant analysis (PLDA). This architecture is inspired by the next sentence prediction task of bidirectional encoder representations from Transformers (BERT), and we feed the s-vectors of two utterances to verify whether they belong to the same speaker. We name this architecture the Transformer encoder speaker authenticator (TESA). Our experiments show that the performance of s-vectors with TESA is better than s-vectors with conventional PLDA-based speaker verification.
  • Placeholder Image
    Publication
    An automated technique to generate phone-to-articulatory label mapping
    (01-02-2017)
    Abraham, Basil
    ;
    Recent studies have shown that in the case of under-resourced languages, use of articulatory features (AF) emerging from an articulatory model results in improved automatic speech recognition (ASR) compared to conventional mel frequency cepstral coefficient (MFCC) features. Articulatory features are more robust to noise and pronunciation variability compared to conventional acoustic features. To extract articulatory features, one method is to take conventional acoustic features like MFCC and build an articulatory classifier that would output articulatory features (known as pseudo-AF). However, these classifiers require a mapping from phone to different articulatory labels (AL) (e.g., place of articulation and manner of articulation), which is not readily available for many of the under-resourced languages. In this article, we have proposed an automated technique to generate phone-to-articulatory label (phone-to-AL) mapping for a new target language based on the knowledge of phone-to-AL mapping of a well-resourced language. The proposed mapping technique is based on the center-phone capturing property of interpolation vectors emerging from the recently proposed phone cluster adaptive training (Phone-CAT) method. Phone-CAT is an acoustic modeling technique that belongs to the broad category of canonical state models (CSM) that includes subspace Gaussian mixture model (SGMM). In Phone-CAT, the interpolation vector belonging to a particular context-dependent state has maximum weight for the center-phone in case of monophone clusters or by the AL of the center-phone in case of AL clusters. These relationships from the various context-dependent states are used to generate a phone-to-AL mapping. The Phone-CAT technique makes use of all the speech data belonging to a particular context-dependent state. Therefore, multiple segments of speech are used to generate the mapping, which makes it more robust to noise and other variations. In this study, we have obtained a phone-to-AL mapping for three under-resourced Indian languages namely Assamese, Hindi and Tamil based on the phone-to-AL mapping available for English. With the generated mappings, articulatory features are extracted for these languages using varying amounts of data in order to build an articulatory classifier. Experiments were also performed in a cross-lingual scenario assuming a small training data set (≈ 2 h) from each of the Indian languages with articulatory classifiers built using a lot of training data (≈ 22 h) from other languages including English (Switchboard task). Interestingly, cross-lingual performance is comparable to that of an articulatory classifier built with large amounts of native training data. Using articulatory features, more than 30% relative improvement was observed over the conventional MFCC features for all the three languages in a DNN framework.
  • Placeholder Image
    Publication
    FMLLR Speaker Normalization with i-Vector: In Pseudo-FMLLR and Distillation Framework
    (01-04-2018)
    Joy, Neethu Mariam
    ;
    Kothinti, Sandeep Reddy
    ;
    When an automatic speech recognition (ASR) system is deployed for real-world applications, it often receives only one utterance at a time for decoding. This single utterance could be of short duration depending on the ASR task. In these cases, robust estimation of speaker normalizing methods like feature-space maximum likelihood linear regression (FMLLR) and i-vectors may not be feasible. In this paper, we propose two unsupervised speaker normalization techniques - one at feature level and other at model level of acoustic modeling - to overcome the drawbacks of FMLLR and i-vectors in real-time scenarios. At feature level, we propose the use of deep neural networks (DNN) to generate pseudo-FMLLR features from time-synchronous pair of filterbank and FMLLR features. These pseudo-FMLLR features can then be used for DNN acoustic model training and decoding. At model level, we propose a generalized distillation framework, where a teacher DNN trained on FMLLR features guides the training and optimization of a student DNN trained on filterbank features. In both the proposed methods, the ambiguity in choosing the speaker-specific FMLLR transform can be reduced by augmenting i-vectors to the input filterbank features. Experiments conducted on 33-h and 110-h subsets of Switchboard corpus show that the proposed methods provide significant gains over DNNs trained on FMLLR, i-vector appended FMLLR, filterbank and i -vector appended filterbank features, in real-time scenario.
  • Placeholder Image
    Publication
    Multiple background models for speaker verification using the concept of vocal tract length and MLLR super-vector
    (01-09-2012)
    Sarkar, A. K.
    ;
    In this paper, we investigate the use of Multiple Background Models (M-BMs) in Speaker Verification (SV). We cluster the speakers using either their Vocal Tract Lengths (VTLs) or by using their speaker specific Maximum Likelihood Linear Regression (MLLR) supervector, and build a separate Background Model (BM) for each such cluster. We show that the use of M-BMs provide improved performance when compared to the use of a single/gender wise Universal Background Model (UBM). While the computational complexity during test remains same for both M-BMs and UBM, M-BMs require switching of models depending on the claimant and also score-normalization becomes difficult. To overcome these problems, we propose a novel method which aggregates the information from Multiple Background Models into a single gender independent UBM and is inspired by conventional Feature Mapping (FM) technique. We show that using this approach, we get improvement over the conventional UBM method, and yet this approach also permits easy use of score-normalization techniques. The proposed method provides relative improvement in Equal-Error Rate (EER) by 13.65 % in the case of VTL clustering, and 15.43 % in the case of MLLR super-vector when compared to the conventional single UBM system. When AT-norm score-normalization is used then the proposed method provided a relative improvement in EER of 20.96 % for VTL clustering and 22.48 % for MLLR super-vector based clustering. Furthermore, the proposed method is compared with the gender dependent speaker verification system using Gaussian Mixture Model-Support Vector Machines (GMM-SVM) supervector linear kernel. The experimental results show that the proposed method perform better than gender dependent speaker verification system. © 2012 Springer Science+Business Media, LLC.
  • Placeholder Image
    Publication
    Decorrelating Feature Spaces for Learning General-Purpose Audio Representations
    (01-10-2022)
    Ghosh, Sreyan
    ;
    Seth, Ashish
    ;
    Inspired by the recent progress in self-supervised learning for computer vision, in this paper, through the DeLoRes (Decorrelating latent spaces for Low Resource audio representation learning) framework, we introduce two new general-purpose audio representation learning approaches, the DeLoRes-S and DeLoRes-M. Our main objective is to make our network learn representations in a resource-constrained setting (both data and compute) that can generalize well across a diverse set of downstream tasks. Inspired by the Barlow Twins objective function, we propose learning embeddings invariant to distortions of an input audio sample while ensuring that they contain non-redundant information about the sample. We call this the DeLoRes learning framework, which we employ in different fashions with the DeLoRes-S and DeLoRes-M. In our experiments, we learn audio representations with less than half the number of model parameters and 10% audio samples compared to state-of-the-art algorithms to achieve state-of-the-art results on 7 out of 11 tasks on linear evaluation and 4 out of 11 tasks in the finetuning setup. In addition to being simple and intuitive, our pre-training procedure is amenable to compute through its inherent nature of construction. Furthermore, we conduct extensive ablation studies on our training algorithm, model architecture, and results and make all our code and pre-trained models publicly available1.
  • Placeholder Image
    Publication
    Modified Mean and Variance Normalization: Transforming to Utterance-Specific Estimates
    (01-05-2016)
    Joshi, Vikas
    ;
    Prasad, N. Vishnu
    ;
    Cepstral mean and variance normalization (CMVN) is an efficient noise compensation technique popularly used in many speech applications. CMVN eliminates the mismatch between training and test utterances by transforming them to zero mean and unit variance. In this work, we argue that some amount of useful information is lost during normalization as every utterance is forced to have the same first- and second-order statistics, i.e., zero mean and unit variance. We propose to modify CMVN methodology to retain the useful information and yet compensate for noise. The proposed normalization approach transforms every test utterance to utterance-specific clean mean (i.e., utterance mean if the noise was absent) and clean variance, instead of zero mean and unit variance. We derive expressions to estimate the clean mean and variance from a noisy utterance. The proposed normalization is effective in the recognizing voice commands that are typically short (single words or short phrases), where more advanced methods [such as histogram equalization (HEQ)] are not effective. Recognition results show a relative improvement (RI) of (Formula presented.) in word error rate over conventional CMVN on the Aurora-2 database and a RI of 20 and (Formula presented.) over CMVN and HEQ on short utterances of the Aurora-2 database.
  • Placeholder Image
    Publication
    Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain
    (01-01-2014)
    Mohan, Aanchan
    ;
    Rose, Richard
    ;
    Ghalehjegh, Sina Hamidi
    ;
    In developing speech recognition based services for any task domain, it is necessary to account for the support of an increasing number of languages over the life of the service. This paper considers a small vocabulary speech recognition task in multiple Indian languages. To configure a multi-lingual system in this task domain, an experimental study is presented using data from two linguistically similar languages - Hindi and Marathi. We do so by training a subspace Gaussian mixture model (SGMM) (Povey et al., 2011; Rose et al., 2011) under a multi-lingual scenario (Burget et al., 2010; Mohan et al., 2012a). Speech data was collected from the targeted user population to develop spoken dialogue systems in an agricultural commodities task domain for this experimental study. It is well known that acoustic, channel and environmental mismatch between data sets from multiple languages is an issue while building multi-lingual systems of this nature. As a result, we use a cross-corpus acoustic normalization procedure which is a variant of speaker adaptive training (SAT) (Mohan et al., 2012a). The resulting multi-lingual system provides the best speech recognition performance for both languages. Further, the effect of sharing "similar" context-dependent states from the Marathi language on the Hindi speech recognition performance is presented. © 2013 Elsevier B.V. All rights reserved.
  • Placeholder Image
    Publication
    VTLN using analytically determined linear-transformation on conventional MFCC
    (02-04-2012)
    Sanand, D. R.
    ;
    In this paper, we propose a method to analytically obtain a linear-transformation on the conventional Mel frequency cepstral coefficients (MFCC) features that corresponds to conventional vocal tract length normalization (VTLN)-warped MFCC features, thereby simplifying the VTLN processing. There have been many attempts to obtain such a linear- transformation, but all the previously proposed approaches either modify the signal processing (and therefore not conventional MFCC), or the linear-transformation does not correspond to conventional VTLN-warping, or the matrices being estimated and are data dependent. In short, the conventional VTLN part of an automatic speech recognition (ASR) system cannot be simply replaced with any of the previously proposed methods. Umesh proposed the idea to use band-limited interpolation for performing VTLN-warping on MFCC using plain cepstra. Motivated from this work, Panchapagesan and Alwan proposed a linear-transformation to perform VTLN-warping on conventional MFCC. However, in their approach, VTLN warping is specified in the Mel-frequency domain and is not equivalent to conventional VTLN. In this paper, we present an approach which also draws inspiration from the work of Umesh , and which we believe for the first time performs conventional VTLN as a linear-transformation on conventional MFCC using the ideas of band-limited interpolation. Deriving such a linear-transformation to perform VTLN, would allow us to use the VTLN-matrices in transform-based adaptation framework with its associated advantages and yet would require the estimation of a single parameter. Using four different tasks, we show that our proposed approach has almost identical recognition performance to conventional VTLN on both clean and noisy speech data. © 2012 IEEE.