Now showing 1 - 10 of 69
  • Placeholder Image
    Publication
    Fast computation of speaker characterization vector using MLLR and sufficient statistics in anchor model framework
    (01-01-2010)
    Sarkar, A. K.
    ;
    Anchor modeling technique has been shown to be useful in reducing computational complexity for speaker identification and indexing of large audio database. In this technique, speakers are projected onto a talker space spanned by a set of predefined anchor models which are usually represented by Gaussian Mixture Models (GMMs). The characterization of each speaker involves calculation of likelihood with each of the anchor models, and is therefore expensive even in the GMM Universal Background model (GMM-UBM) frame work using top-C mixtures per feature vector. In this paper, we propose a computationally efficient (Fast) method to calculate the likelihood of the speech utterances using anchor speaker-specific Maximum Likelihood Linear Regression (MLLR) matrices and sufficient statistics estimated from the utterance. We show that the proposed method is faster by an order of magnitude for evaluating the speaker characterization vector. Since anchor models use simple distance measures to identify speakers, they are used as a first stage to select N probable speakers and then cascaded by a conventional GMM-UBM stage which finally identifies the speaker from this reduced set. We show that the proposed method in cascade combination perform 4.21× faster than the conventional cascade anchor model system with comparable performance. The experiments are performed on NIST 2004 SRE in core condition. © 2010 ISCA.
  • Placeholder Image
    Publication
    Improving the Performance of Transformer Based Low Resource Speech Recognition for Indian Languages
    (01-05-2020)
    Shetty, Vishwas M.
    ;
    Sagaya Mary N J, Metilda
    ;
    The recent success of the Transformer based sequence-to-sequence framework for various Natural Language Processing tasks has motivated its application to Automatic Speech Recognition. In this work, we explore the application of Transformers on low resource Indian languages in a multilingual framework. We explore various methods to incorporate language information into a multilingual Transformer, i.e., (i) at the decoder, (ii) at the encoder. These methods include using language identity tokens or providing language information to the acoustic vectors. Language information to the acoustic vectors can be given in the form of one hot vector or by learning a language embedding. From our experiments, we observed that providing language identity always improved performance. The language embedding learned from our proposed approach, when added to the acoustic feature vector, gave the best result. The proposed approach with retraining gave 6% - 11% relative improvements in character error rates over the monolingual baseline.
  • Placeholder Image
    Publication
    Transfer learning and distillation techniques to improve the acoustic modeling of low resource languages
    (01-01-2017)
    Abraham, Basil
    ;
    Seeram, Tejaswi
    ;
    Deep neural networks (DNN) require large amount of train- ing data to build robust acoustic models for speech recognition tasks. Our work is intended in improving the low-resource lan- guage acoustic model to reach a performance comparable to that of a high-resource scenario with the help of data/model param- eters from other high-resource languages. we explore trans- fer learning and distillation methods, where a complex high resource model guides or supervises the training of low re- source model. The techniques include (i) multi-lingual frame- work of borrowing data from high-resource language while training the low-resource acoustic model. The KL divergence based constraints are added to make the model biased towards low-resource language, (ii) distilling knowledge from the com- plex high-resource model to improve the low-resource acoustic model. The experiments were performed on three Indian lan- guages namely Hindi, Tamil and Kannada. All the techniques gave improved performance and the multi-lingual framework with KL divergence regularization giving the best results. In all the three languages a performance close to or better than high- resource scenario was obtained.
  • Placeholder Image
    Publication
    Non-negative subspace projection during conventional MFCC feature extraction for noise robust speech recognition
    (01-01-2013)
    Pavan Kumar, D. S.
    ;
    Bilgi, Raghavendra R.
    ;
    An additional feature processing algorithm using Non-negative Matrix Factorization (NMF) is proposed to be included during the conventional extraction of Mel-frequency cepstral coefficients (MFCC) for achieving noise robustness in HMM based speech recognition. The proposed approach reconstructs log-Mel filterbank outputs of speech data from a set of building blocks that form the bases of a speech subspace. The bases are learned using the standard NMF of training data. A variation of learning the bases is proposed, which uses histogram equalized activation coefficients during training, to achieve noise robustness. The proposed methods give up to 5.96% absolute improvement in recognition accuracy on Aurora-2 task over a baseline with standard MFCCs, and up to 13.69% improvement when combined with other feature normalization techniques like Histogram Equalization (HEQ) and Heteroscedastic Linear Discriminant Analysis (HLDA). © 2013 IEEE.
  • Placeholder Image
    Publication
    Modified cepstral mean normalization - Transforming to utterance specific non-zero mean
    (01-01-2013)
    Joshi, Vikas
    ;
    Prasad, N. Vishnu
    ;
    Cepstral Mean Normalization (CMN) is a widely used technique for channel compensation and for noise robustness. CMN compensates for noise by transforming both train and test utterances to zero mean, thus matching first-order moment of train and test conditions. Since all utterances are normalized to zero mean, CMN could lead to loss of discriminative speech information, especially for short utterances. In this paper, we modify CMN to reduce this loss by transforming every noisy test utterance to the estimate of clean utterance mean (mean estimate of the given utterance if noise was not present) and not to zero mean. A look-up table based approach is proposed to estimate the clean-mean of the noisy utterance. The proposed method is particularly relevant for IVR-based applications, where the utterances are usually short and noisy. In such cases, techniques like Histogram Equalization (HEQ) do not perform well and a simple approach like CMN leads to loss of discrimination. We obtain a 12% relative improvement over CMN in WER for Aurora-2 database; and when we analyze only short utterances, we obtain a relative improvement of 5% and 25% in WER over CMN and HEQ respectively. Copyright © 2013 ISCA.
  • Placeholder Image
    Publication
    Investigation of Ensemble features of Self-Supervised Pretrained Models for Automatic Speech Recognition
    (01-01-2022)
    Arunkumar, A.
    ;
    Sukhadia, Vrunda N.
    ;
    Self-supervised learning (SSL) based models have been shown to generate powerful representations that can be used to improve the performance of downstream speech tasks. Several state-of-the-art SSL models are available, and each of these models optimizes a different loss which gives rise to the possibility of their features being complementary. This paper proposes using an ensemble of such SSL representations and models, which exploits the complementary nature of the features extracted by the various pretrained models. We hypothesize that this results in a richer feature representation and show results for the ASR downstream task. To this end, we use three SSL models that have shown excellent results on ASR tasks, namely HuBERT, Wav2Vec2.0, and WavLM. We explore the ensemble of models fine-tuned for the ASR task and the ensemble of features using the embeddings obtained from the pre-trained models for a downstream ASR task. We get a relative improvement of 10% in ASR performance over individual models and pre-trained features when using LibriSpeech(100h) and WSJ dataset for the downstream tasks.
  • Placeholder Image
    Publication
    Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi
    (01-01-2022)
    Bhanushali, Anish
    ;
    Bridgman, Grant
    ;
    Deekshitha, G.
    ;
    Ghosh, Prasanta
    ;
    Kumar, Pratik
    ;
    Kumar, Saurabh
    ;
    Kolladath, Adithya Raj
    ;
    Ravi, Nithya
    ;
    Seth, Aaditeshwar
    ;
    Seth, Ashish
    ;
    Singh, Abhayjeet
    ;
    Sukhadia, Vrunda N.
    ;
    ;
    Udupa, Sathvik
    ;
    Durga Prasad, Lodagala V.S.V.
    This paper describes the corpus and baseline systems for the Gram Vaani Automatic Speech Recognition (ASR) challenge in regional variations of Hindi. The corpus for this challenge comprises the spontaneous telephone speech recordings collected by a social technology enterprise, Gram Vaani. The regional variations of Hindi together with spontaneity of speech, natural background and transcriptions with variable accuracy due to crowdsourcing make it a unique corpus for ASR on spontaneous telephonic speech. Around, 1108 hours of real-world spontaneous speech recordings, including 1000 hours of unlabelled training data, 100 hours of labelled training data, 5 hours of development data and 3 hours of evaluation data, have been released as a part of the challenge. The efficacy of both training and test sets are validated on different ASR systems in both traditional time-delay neural network-hidden Markov model (TDNN-HMM) frameworks and fully-neural end-to-end (E2E) setup. The word error rate (WER) and character error rate (CER) on eval set for a TDNN model trained on 100 hours of labelled data are 29.7% and 15.1%, respectively. While, in E2E setup, WER and CER on eval set for a conformer model trained on 100 hours of data are 32.9% and 19.0%, respectively.
  • Placeholder Image
    Publication
    Improved cepstral mean and variance normalization using Bayesian framework
    (01-12-2013)
    Prasad, N. Vishnu
    ;
    Cepstral Mean and Variance Normalization (CMVN) is a computationally efficient normalization technique for noise robust speech recognition. The performance of CMVN is known to degrade for short utterances, due to insufficient data for parameter estimation and loss of discriminable information as all utterances are forced to have zero mean and unit variance. In this work, we propose to use posterior estimates of mean and variance in CMVN, instead of the maximum likelihood estimates. This Bayesian approach, in addition to providing a robust estimate of parameters, is also shown to preserve discriminable information without increase in computational cost, making it particularly relevant for Interactive Voice Response (IVR)-based applications. The relative WER reduction of this approach w.r.t. Cepstral Mean Normalization, CMVN and Histogram Equalization are (i) 40.1%, 27% and 4.3% with the Aurora2 database for all utterances, (ii) 25.7%, 38.6% and 30.4% with the Aurora2 database for short utterances, and (iii) 18.7%, 12.6% and 2.5% with the Aurora4 database. © 2013 IEEE.
  • Placeholder Image
    Publication
    Combining speaker and noise feature normalization techniques for automatic speech recognition
    (18-08-2011)
    García, L.
    ;
    Benítez, C.
    ;
    Segura, J. C.
    ;
    This work deals with strategies to jointly reduce the speaker and environment mismatches in Automatic Speech Recognition. The consequences of environmental mismatch in the performance of conventional Vocal Tract Length Normalization algorithm are analyzed, observing the sensitivity of the warping factor distributions to the SNR fall. A new combined speaker-noise normalization strategy which reduces the effect of noise in VTLN by applying Histogram Equalization is proposed and experimented in AURORA2 and AURORA4 databases. Solid results are obtained and discussed to analyze the effectiveness of the described technique. © 2011 IEEE.
  • Placeholder Image
    Publication
    Correlation networks for speaker normalization in automatic speech recognition
    (01-01-2018)
    Rini Sharon, A.
    ;
    Reddy Kothinti, Sandeep
    ;
    In this paper, we propose using common representation learning(CRL) for speaker normalization in automatic speech recognition (ASR). Conventional methods like feature space maximum likelihood linear regression (fMLLR) require two pass decode and their performance is often limited by the amount of data during test. While i-vectors do not require two-pass decode, a significant number of input frames are required for estimation. Hence, as an alternative, a regression model employing correlational neural networks (CorrNet) for multi-view CRL is proposed. In this approach, the CorrNet training methodology treats normalized and un-normalized features as two parallel views of the same speech data. Once trained, this network generates frame-wise fMLLR-like features, thus overcoming the limitations of fMLLR/i-vectors. The recognition accuracy using the proposed CorrNet-generated features is comparable with the i-vector model counterparts and significantly better than the un-normalized features like filterbank. With CorrNet-features, we get an absolute improvement in word error rate of 2.5% for TIMIT, 2.69% for WSJ84 and 3.2% for Switchboard-33hour over un-normalized features.