Now showing 1 - 5 of 5
  • Placeholder Image
    Publication
    Modified cepstral mean normalization - Transforming to utterance specific non-zero mean
    (01-01-2013)
    Joshi, Vikas
    ;
    Prasad, N. Vishnu
    ;
    Cepstral Mean Normalization (CMN) is a widely used technique for channel compensation and for noise robustness. CMN compensates for noise by transforming both train and test utterances to zero mean, thus matching first-order moment of train and test conditions. Since all utterances are normalized to zero mean, CMN could lead to loss of discriminative speech information, especially for short utterances. In this paper, we modify CMN to reduce this loss by transforming every noisy test utterance to the estimate of clean utterance mean (mean estimate of the given utterance if noise was not present) and not to zero mean. A look-up table based approach is proposed to estimate the clean-mean of the noisy utterance. The proposed method is particularly relevant for IVR-based applications, where the utterances are usually short and noisy. In such cases, techniques like Histogram Equalization (HEQ) do not perform well and a simple approach like CMN leads to loss of discrimination. We obtain a 12% relative improvement over CMN in WER for Aurora-2 database; and when we analyze only short utterances, we obtain a relative improvement of 5% and 25% in WER over CMN and HEQ respectively. Copyright © 2013 ISCA.
  • Placeholder Image
    Publication
    Improved cepstral mean and variance normalization using Bayesian framework
    (01-12-2013)
    Prasad, N. Vishnu
    ;
    Cepstral Mean and Variance Normalization (CMVN) is a computationally efficient normalization technique for noise robust speech recognition. The performance of CMVN is known to degrade for short utterances, due to insufficient data for parameter estimation and loss of discriminable information as all utterances are forced to have zero mean and unit variance. In this work, we propose to use posterior estimates of mean and variance in CMVN, instead of the maximum likelihood estimates. This Bayesian approach, in addition to providing a robust estimate of parameters, is also shown to preserve discriminable information without increase in computational cost, making it particularly relevant for Interactive Voice Response (IVR)-based applications. The relative WER reduction of this approach w.r.t. Cepstral Mean Normalization, CMVN and Histogram Equalization are (i) 40.1%, 27% and 4.3% with the Aurora2 database for all utterances, (ii) 25.7%, 38.6% and 30.4% with the Aurora2 database for short utterances, and (iii) 18.7%, 12.6% and 2.5% with the Aurora4 database. © 2013 IEEE.
  • Placeholder Image
    Publication
    Combining speaker and noise feature normalization techniques for automatic speech recognition
    (18-08-2011)
    García, L.
    ;
    Benítez, C.
    ;
    Segura, J. C.
    ;
    This work deals with strategies to jointly reduce the speaker and environment mismatches in Automatic Speech Recognition. The consequences of environmental mismatch in the performance of conventional Vocal Tract Length Normalization algorithm are analyzed, observing the sensitivity of the warping factor distributions to the SNR fall. A new combined speaker-noise normalization strategy which reduces the effect of noise in VTLN by applying Histogram Equalization is proposed and experimented in AURORA2 and AURORA4 databases. Solid results are obtained and discussed to analyze the effectiveness of the described technique. © 2011 IEEE.
  • Placeholder Image
    Publication
    Efficient speaker and noise normalization for robust speech recognition
    (01-12-2011)
    Joshi, Vikas
    ;
    Bilgi, Raghavendra
    ;
    ;
    Benitez, C.
    ;
    Garcia, L.
    In this paper, we describe a computationally efficient approach for combining speaker and noise normalization techniques. In particular, we combine the simple yet effective Histogram Equalization (HEQ) for noise compensation with Vocal-tract length normalization (VTLN) for speaker-normalization. While it is intuitive to remove noise first and then perform VTLN, this is difficult since HEQ performs noise compensation in the cepstral domain, while VTLN involves warping in spectral domain. In this paper, we investigate the use of the recently proposed T-VTLN approach to speaker normalization where matrix transformations are directly applied on cepstral features. We show that the speaker-specific warp-factors estimated even from noisy speech using this approach closely match those from clean-speech. Further, using sub-band HEQ (S-HEQ) and TVTLN we get a significant relative improvement of 20% and an impressive 33.54% over baseline in recognition accuracy for Aurora-2 and Aurora-4 task respectively. Copyright © 2011 ISCA.
  • Placeholder Image
    Publication
    Modified Mean and Variance Normalization: Transforming to Utterance-Specific Estimates
    (01-05-2016)
    Joshi, Vikas
    ;
    Prasad, N. Vishnu
    ;
    Cepstral mean and variance normalization (CMVN) is an efficient noise compensation technique popularly used in many speech applications. CMVN eliminates the mismatch between training and test utterances by transforming them to zero mean and unit variance. In this work, we argue that some amount of useful information is lost during normalization as every utterance is forced to have the same first- and second-order statistics, i.e., zero mean and unit variance. We propose to modify CMVN methodology to retain the useful information and yet compensate for noise. The proposed normalization approach transforms every test utterance to utterance-specific clean mean (i.e., utterance mean if the noise was absent) and clean variance, instead of zero mean and unit variance. We derive expressions to estimate the clean mean and variance from a noisy utterance. The proposed normalization is effective in the recognizing voice commands that are typically short (single words or short phrases), where more advanced methods [such as histogram equalization (HEQ)] are not effective. Recognition results show a relative improvement (RI) of (Formula presented.) in word error rate over conventional CMVN on the Aurora-2 database and a RI of 20 and (Formula presented.) over CMVN and HEQ on short utterances of the Aurora-2 database.