Now showing 1 - 10 of 12
  • Placeholder Image
    Publication
    Transfer learning and distillation techniques to improve the acoustic modeling of low resource languages
    (01-01-2017)
    Abraham, Basil
    ;
    Seeram, Tejaswi
    ;
    Deep neural networks (DNN) require large amount of train- ing data to build robust acoustic models for speech recognition tasks. Our work is intended in improving the low-resource lan- guage acoustic model to reach a performance comparable to that of a high-resource scenario with the help of data/model param- eters from other high-resource languages. we explore trans- fer learning and distillation methods, where a complex high resource model guides or supervises the training of low re- source model. The techniques include (i) multi-lingual frame- work of borrowing data from high-resource language while training the low-resource acoustic model. The KL divergence based constraints are added to make the model biased towards low-resource language, (ii) distilling knowledge from the com- plex high-resource model to improve the low-resource acoustic model. The experiments were performed on three Indian lan- guages namely Hindi, Tamil and Kannada. All the techniques gave improved performance and the multi-lingual framework with KL divergence regularization giving the best results. In all the three languages a performance close to or better than high- resource scenario was obtained.
  • Placeholder Image
    Publication
    Articulatory feature extraction using CTC to build articulatory classifiers without forced frame alignments for speech recognition
    (01-01-2016)
    Abraham, Basil
    ;
    ;
    Joy, Neethu Mariam
    Articulatory features provide robustness to speaker and environment variability by incorporating speech production knowledge. Pseudo articulatory features are a way of extracting articulatory features using articulatory classifiers trained from speech data. One of the major problems faced in building articulatory classifiers is the requirement of speech data aligned in terms of articulatory feature values at frame level. Manually aligning data at frame level is a tedious task and alignments obtained from the phone alignments using phone-to-articulatory feature mapping are prone to errors. In this paper, a technique using connectionist temporal classification (CTC) criterion to train an articulatory classifier using bidirectional long short-term memory (BLSTM) recurrent neural network (RNN) is proposed. The CTC criterion eliminates the need for forced frame level alignments. Articulatory classifiers were also built using different neural network architectures like deep neural networks (DNN), convolutional neural network (CNN) and BLSTM with frame level alignments and were compared to the proposed approach of using CTC. Among the different architectures, articulatory features extracted using articulatory classifiers built with BLSTM gave better recognition performance. Further, the proposed approach of BLSTM with CTC gave the best overall performance on both SVitchboard (6 hours) and Switchboard 33 hours data set.
  • Placeholder Image
    Publication
    Overcoming data sparsity in acoustic modeling of low-resource language by borrowing data and model parameters from high-resource languages
    (01-01-2016)
    Abraham, Basil
    ;
    ;
    Joy, Neethu Mariam
    In this paper, we propose two techniques to improve the acoustic model of a low-resource language by: (i) Pooling data from closely related languages using a phoneme mapping algorithm to build acoustic models like subspace Gaussian mixture model (SGMM), phone cluster adaptive training (Phone-CAT), deep neural network (DNN) and convolutional neural network (CNN). Using the low-resource language data, we then adapt the afore mentioned models towards that language. (ii) Using models built from high-resource languages, we first borrow subspace model parameters from SGMM/Phone-CAT; or hidden layers from DNN/CNN. The language specific parameters are then estimated using the lowresource language data. The experiments were performed on four Indian languages namely Assamese, Bengali, Hindi and Tamil. Relative improvements of 10 to 30% were obtained over corresponding monolingual models in each case.
  • Placeholder Image
    Publication
    Improved acoustic modeling of low-resource languages using shared SGMM parameters of high-resource languages
    (06-09-2016)
    Joy, Neethu Mariam
    ;
    Abraham, Basil
    ;
    Navneeth, K.
    ;
    In this paper, we investigate methods to improve the recognition performance of low-resource languages with limited training data by borrowing subspace parameters from a high-resource language in subspace Gaussian mixture model (SGMM) framework. As a first step, only the state-specific vectors are updated using low-resource language, while retaining all the globally shared parameters from the high-resource language. This approach gave improvements only in some cases. However, when both state-specific and weight projection vectors are re-estimated with low-resource language, we get consistent improvement in performance over conventional monolingual SGMM of the low-resource language. Further, we conducted experiments to investigate the effect of different shared parameters on the acoustic model built using the proposed method. Experiments were done on the Tamil, Hindi and Bengali corpus of MANDI database. Relative improvement of 16.17% for Tamil, 13.74% for Hindi and 12.5% for Bengali, over respective monolingual SGMM were obtained.
  • Placeholder Image
    Publication
    An automated technique to generate phone-to-articulatory label mapping
    (01-02-2017)
    Abraham, Basil
    ;
    Recent studies have shown that in the case of under-resourced languages, use of articulatory features (AF) emerging from an articulatory model results in improved automatic speech recognition (ASR) compared to conventional mel frequency cepstral coefficient (MFCC) features. Articulatory features are more robust to noise and pronunciation variability compared to conventional acoustic features. To extract articulatory features, one method is to take conventional acoustic features like MFCC and build an articulatory classifier that would output articulatory features (known as pseudo-AF). However, these classifiers require a mapping from phone to different articulatory labels (AL) (e.g., place of articulation and manner of articulation), which is not readily available for many of the under-resourced languages. In this article, we have proposed an automated technique to generate phone-to-articulatory label (phone-to-AL) mapping for a new target language based on the knowledge of phone-to-AL mapping of a well-resourced language. The proposed mapping technique is based on the center-phone capturing property of interpolation vectors emerging from the recently proposed phone cluster adaptive training (Phone-CAT) method. Phone-CAT is an acoustic modeling technique that belongs to the broad category of canonical state models (CSM) that includes subspace Gaussian mixture model (SGMM). In Phone-CAT, the interpolation vector belonging to a particular context-dependent state has maximum weight for the center-phone in case of monophone clusters or by the AL of the center-phone in case of AL clusters. These relationships from the various context-dependent states are used to generate a phone-to-AL mapping. The Phone-CAT technique makes use of all the speech data belonging to a particular context-dependent state. Therefore, multiple segments of speech are used to generate the mapping, which makes it more robust to noise and other variations. In this study, we have obtained a phone-to-AL mapping for three under-resourced Indian languages namely Assamese, Hindi and Tamil based on the phone-to-AL mapping available for English. With the generated mappings, articulatory features are extracted for these languages using varying amounts of data in order to build an articulatory classifier. Experiments were also performed in a cross-lingual scenario assuming a small training data set (≈ 2 h) from each of the Indian languages with articulatory classifiers built using a lot of training data (≈ 22 h) from other languages including English (Switchboard task). Interestingly, cross-lingual performance is comparable to that of an articulatory classifier built with large amounts of native training data. Using articulatory features, more than 30% relative improvement was observed over the conventional MFCC features for all the three languages in a DNN framework.
  • Placeholder Image
    Publication
    Generalized distillation framework for speaker normalization
    (01-01-2017)
    Joy, Neethu Mariam
    ;
    Kothinti, Sandeep Reddy
    ;
    ;
    Abraham, Basil
    Generalized distillation framework has been shown to be effective in speech enhancement in the past. We extend this idea to speaker normalization without any explicit adaptation data in this paper. In the generalized distillation framework, we assume the presence of some "privileged" information to guide the training process in addition to the training data. In the proposed approach, the privileged information is obtained from a "teacher" model, trained on speaker-normalized FMLLR features. The "student" model is trained on un-normalized filterbank features and uses teacher's supervision for cross-entropy training. The proposed distillation method does not need first pass decode information during testing and imposes no constraints on the duration of the test data for computing speakerspecific transforms unlike in FMLLR or i-vector. Experiments done on Switchboard and AMI corpus show that the generalized distillation framework shows improvement over un-normalized features with or without i-vectors.
  • Placeholder Image
    Publication
    Articulatory and stacked bottleneck features for low resource speech recognition
    (01-01-2018)
    Shetty, Vishwas M.
    ;
    Sharon, Rini A.
    ;
    Abraham, Basil
    ;
    Seeram, Tejaswi
    ;
    Prakash, Anusha
    ;
    Ravi, Nithya
    ;
    In this paper, we discuss the benefits of using articulatory and stacked bottleneck features (SBF) for low resource speech recognition. Articulatory features (AF) which capture the underlying attributes of speech production are found to be robust to channel and speaker variations. However, building an efficient articulatory classifier to extract AF requires an enormous amount of data. In low resource acoustic modeling, we propose to train the bidirectional long short-term memory (BLSTM) articulatory classifier by pooling data from the available low resource Indian languages, namely, Gujarati, Tamil, and Telugu. This is done in the context of Microsoft Indian Language challenge. Similarly, we train a multilingual bottleneck feature extractor and an SBF extractor using the pooled data. To bias, the SBF network towards the target language, a second network in the stacked architecture was trained using the target language alone. The performance of ASR system trained with stand-alone AF is observed to be at par with the multilingual bottleneck features. When the AF and the biased SBF are appended, they are found to outperform the conventional filterbank features in the multilingual deep neural network (DNN) framework and the high-resolution Mel frequency cepstral coefficient (MFCC) features in the time-delayed neural network(TDNN) framework.
  • Placeholder Image
    Publication
    Joint estimation of articulatory features and acoustic models for low-resource languages
    (01-01-2017)
    Abraham, Basil
    ;
    ;
    Joy, Neethu Mariam
    Using articulatory features for speech recognition improves the performance of low-resource languages. One way to obtain ar- ticulatory features is by using an articulatory classifier (pseudo- articulatory features). The performance of the articulatory fea- tures depends on the efficacy of this classifier. But, training such a robust classifier for a low-resource language is constrained due to the limited amount of training data. We can overcome this by training the articulatory classifier using a high resource language. This classifier can then be used to generate articula- tory features for the low-resource language. However, this tech- nique fails when high and low-resource languages have mis- matches in their environmental conditions. In this paper, we address both the aforementioned problems by jointly estimat- ing the articulatory features and low-resource acoustic model. The experiments were performed on two low-resource Indian languages namely, Hindi and Tamil. English was used as the high-resource language. A relative improvement of 23% and 10% were obtained for Hindi and Tamil, respectively.
  • Placeholder Image
    Publication
    DNNs for unsupervised extraction of pseudo FMLLR features without explicit adaptation data
    (01-01-2016)
    Joy, Neethu Mariam
    ;
    Baskar, Murali Karthick
    ;
    ;
    Abraham, Basil
    In this paper, we propose the use of deep neural networks (DNN) as a regression model to estimate feature-space maximum likelihood linear regression (FMLLR) features from unnormalized features. During training, the pair of unnormalized features as input and corresponding FMLLR features as target are provided and the network is optimized to reduce the mean-square error between output and target FMLLR features. During test, the unnormalized features are passed through this DNN feature extractor to obtain FMLLR-like features without any supervision or first pass decode. Further, the FMLLR-like features are generated frame-by-frame, requiring no explicit adaptation data to extract the features unlike in FMLLR or ivector. Our proposed approach is therefore suitable for scenarios where there is little adaptation data. The proposed approach provides sizable improvements over basis-FMLLR and conventional FMLLR when normalization is done at utterance level on TIMIT and Switchboard-33hour data sets.
  • Placeholder Image
    Publication
    Improved phone-cluster adaptive training acoustic model
    (16-11-2016)
    Joy, Neethu Mariam
    ;
    ;
    Abraham, Basil
    ;
    Navneeth, K.
    Phone-cluster adaptive training (Phone-CAT) is a subspace based acoustic modeling technique inspired from cluster adaptive training (CAT) and subspace Gaussian mixture model (SGMM). This paper explores three extensions, viz., increasing phonetic subspace dimension, including sub-states and speaker subspace, to the basic Phone-CAT model to improve its recognition performance. The latter two extensions are similar in implementation as that of SGMM as both acoustic models share a similar subspace framework. But, since the phonetic subspace dimension of Phone-CAT is constrained to be equal to the number of monophones, the first extension is not straightforward to implement. We propose a Two-stage Phone-CAT model where we increase the phonetic subspace dimension to that of the number of monophone states. This model will still be able to retain the center phone capturing property of the state-specific vectors in basic Phone-CAT. Experiments done on 33-hour train subset of Switchboard database shows improvements in recognition performance of basic Phone-CAT model with the inclusion of the proposed extensions.