Now showing 1 - 10 of 18
  • Placeholder Image
    Publication
    Improving the Performance of Transformer Based Low Resource Speech Recognition for Indian Languages
    (01-05-2020)
    Shetty, Vishwas M.
    ;
    Sagaya Mary N J, Metilda
    ;
    The recent success of the Transformer based sequence-to-sequence framework for various Natural Language Processing tasks has motivated its application to Automatic Speech Recognition. In this work, we explore the application of Transformers on low resource Indian languages in a multilingual framework. We explore various methods to incorporate language information into a multilingual Transformer, i.e., (i) at the decoder, (ii) at the encoder. These methods include using language identity tokens or providing language information to the acoustic vectors. Language information to the acoustic vectors can be given in the form of one hot vector or by learning a language embedding. From our experiments, we observed that providing language identity always improved performance. The language embedding learned from our proposed approach, when added to the acoustic feature vector, gave the best result. The proposed approach with retraining gave 6% - 11% relative improvements in character error rates over the monolingual baseline.
  • Placeholder Image
    Publication
    Investigation of Ensemble features of Self-Supervised Pretrained Models for Automatic Speech Recognition
    (01-01-2022)
    Arunkumar, A.
    ;
    Sukhadia, Vrunda N.
    ;
    Self-supervised learning (SSL) based models have been shown to generate powerful representations that can be used to improve the performance of downstream speech tasks. Several state-of-the-art SSL models are available, and each of these models optimizes a different loss which gives rise to the possibility of their features being complementary. This paper proposes using an ensemble of such SSL representations and models, which exploits the complementary nature of the features extracted by the various pretrained models. We hypothesize that this results in a richer feature representation and show results for the ASR downstream task. To this end, we use three SSL models that have shown excellent results on ASR tasks, namely HuBERT, Wav2Vec2.0, and WavLM. We explore the ensemble of models fine-tuned for the ASR task and the ensemble of features using the embeddings obtained from the pre-trained models for a downstream ASR task. We get a relative improvement of 10% in ASR performance over individual models and pre-trained features when using LibriSpeech(100h) and WSJ dataset for the downstream tasks.
  • Placeholder Image
    Publication
    Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi
    (01-01-2022)
    Bhanushali, Anish
    ;
    Bridgman, Grant
    ;
    Deekshitha, G.
    ;
    Ghosh, Prasanta
    ;
    Kumar, Pratik
    ;
    Kumar, Saurabh
    ;
    Kolladath, Adithya Raj
    ;
    Ravi, Nithya
    ;
    Seth, Aaditeshwar
    ;
    Seth, Ashish
    ;
    Singh, Abhayjeet
    ;
    Sukhadia, Vrunda N.
    ;
    ;
    Udupa, Sathvik
    ;
    Durga Prasad, Lodagala V.S.V.
    This paper describes the corpus and baseline systems for the Gram Vaani Automatic Speech Recognition (ASR) challenge in regional variations of Hindi. The corpus for this challenge comprises the spontaneous telephone speech recordings collected by a social technology enterprise, Gram Vaani. The regional variations of Hindi together with spontaneity of speech, natural background and transcriptions with variable accuracy due to crowdsourcing make it a unique corpus for ASR on spontaneous telephonic speech. Around, 1108 hours of real-world spontaneous speech recordings, including 1000 hours of unlabelled training data, 100 hours of labelled training data, 5 hours of development data and 3 hours of evaluation data, have been released as a part of the challenge. The efficacy of both training and test sets are validated on different ASR systems in both traditional time-delay neural network-hidden Markov model (TDNN-HMM) frameworks and fully-neural end-to-end (E2E) setup. The word error rate (WER) and character error rate (CER) on eval set for a TDNN model trained on 100 hours of labelled data are 29.7% and 15.1%, respectively. While, in E2E setup, WER and CER on eval set for a conformer model trained on 100 hours of data are 32.9% and 19.0%, respectively.
  • Placeholder Image
    Publication
    Span Classification with Structured Information for Disfluency Detection in Spoken Utterances
    (01-01-2022)
    Ghosh, Sreyan
    ;
    Kumar, Sonal
    ;
    Singla, Yaman Kumar
    ;
    Shah, Rajiv Ratn
    ;
    Existing approaches in disfluency detection focus on solving a token-level classification task for identifying and removing disfluencies in text. Moreover, most works focus on leveraging only contextual information captured by the linear sequences in text, thus ignoring the structured information in the text which is efficiently captured by dependency trees. In this paper, building on the span classification paradigm of entity recognition, we propose a novel architecture for detecting disfluencies in transcripts from spoken utterances, incorporating both contextual information through transformers and long-distance structured information captured by dependency trees, through graph convolutional networks (GCNs). Experimental results show that our proposed model achieves state-of-the-art results on the widely used English Switchboard dataset for disfluency detection and outperforms prior-art by a significant margin. We make all our codes publicly available on GitHub.
  • Placeholder Image
    Publication
    S-Vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder
    (01-01-2022)
    Mary, Narla John Metilda Sagaya
    ;
    ;
    Katta, Sandesh Varadaraju
    One of the most popular speaker embeddings is x-vectors, which are obtained from an architecture that gradually builds a larger temporal context with layers. In this paper, we propose to derive speaker embeddings from Transformer's encoder trained for speaker classification. Self-attention, on which Transformer's encoder is built, attends to all the features over the entire utterance and might be more suitable in capturing the speaker characteristics in an utterance. We refer to the speaker embeddings obtained from the proposed speaker classification model as s-vectors to emphasize that they are obtained from an architecture that heavily relies on self-attention. Through experiments, we demonstrate that s-vectors perform better than x-vectors. In addition to the s-vectors, we also propose a new architecture based on Transformer's encoder for speaker verification as a replacement for speaker verification based on conventional probabilistic linear discriminant analysis (PLDA). This architecture is inspired by the next sentence prediction task of bidirectional encoder representations from Transformers (BERT), and we feed the s-vectors of two utterances to verify whether they belong to the same speaker. We name this architecture the Transformer encoder speaker authenticator (TESA). Our experiments show that the performance of s-vectors with TESA is better than s-vectors with conventional PLDA-based speaker verification.
  • Placeholder Image
    Publication
    CCC-WAV2VEC 2.0: Clustering AIDED Cross Contrastive Self-Supervised Learning of Speech Representations
    (01-01-2023)
    Lodagala, Vasista Sai
    ;
    Ghosh, Sreyan
    ;
    While Self-Supervised Learning has helped reap the benefit of the scale from the available unlabeled data, the learning paradigms are continously being bettered. We present a new pre-training strategy named ccc-wav2vec 2.0, which uses clustering and an augmentation based cross-contrastive loss as its self-supervised objective. Through the clustering module we scale down the influence of those negative examples that are highly similar to the positive. The Cross-Contrastive loss is computed between the encoder output of the original sample and the quantizer output of its augmentation, and vice-versa, bringing robustness to the pre-training strategy. ccc-wav2vec 2.0 achieves upto 15.6% and 12.7% relative WER improvement over the baseline wav2vec 2.0 on the test-clean and test-other sets respectively of LibriSpeech, without the use of any language model. The proposed method also achieves upto 14.9% relative WER improvement over the baseline wav2vec 2.0, when fine-tuned on Switchboard data.
  • Placeholder Image
    Publication
    Unfused: Unsupervised Finetuning Using Self Supervised Distillation
    (01-01-2023)
    Seth, Ashish
    ;
    Ghosh, Sreyan
    ;
    ;
    Manocha, Dinesh
    In this paper, we introduce UnFuSeD, a novel approach to leverage self-supervised learning and reduce the need for large amounts of labeled data for audio classification. Unlike prior works, which directly fine-tune a self-supervised pre-trained encoder on a target dataset, we use the encoder to generate pseudo-labels for unsupervised fine-tuning before the actual fine-tuning step. We first train an encoder using a novel self-supervised learning algorithm (SSL) on an unlabeled audio dataset. Then, we use that encoder to generate pseudo-labels on our target task dataset via clustering the extracted representations. These pseudo-labels are then used to guide self-distillation on a randomly initialized model, which we call unsupervised fine-tuning. Finally, the resultant encoder is fine-tuned on our target task dataset. Through UnFuSeD, we propose the first system that moves away from generic SSL paradigms in literature, which pretrain and fine-tune the same encoder, and presents a novel self-distillation-based system to leverage SSL pre-training for low-resource audio classification. In practice, UnFuSeD achieves state-of-the-art results on the LAPE Benchmark, significantly outperforming all our baselines. Additionally, UnFuSeD allows us to achieve this at a $\approx 40$% reduction in the number of parameters over the previous state-of-the-art system. We make all our codes publicly available 1. 1https://github.com/Sreyan88/LAPE
  • Placeholder Image
    Publication
    INVESTIGATION OF ROBUSTNESS OF HUBERT FEATURES FROM DIFFERENT LAYERS TO DOMAIN, ACCENT AND LANGUAGE VARIATIONS
    (01-01-2022)
    Kumar, Pratik
    ;
    Sukhadia, Vrunda N.
    ;
    In this paper, we investigate the use of pre-trained HuBERT model to build downstream Automatic Speech Recognition (ASR) models using data that have differences in domain, accent and even language. We use the standard ESPnet recipe with HuBERT as pre-trained models whose output is fed as input features to a downstream Conformer model built from target domain data. We compare the performance of HuBERT pre-trained features with the baseline Conformer model built with Mel-filterbank features. We observe that as the domain, accent and bandwidth (as in the case of Switchboard data) vary, the relative improvements in performance over baseline decrease significantly. Further, with more labelled data in the target domain, the relative improvement narrows down, and both systems become comparable. We also investigate the effect on ASR performance when output from intermediate layers of HuBERT are used as features and show that these are more suitable for data in a different language, since they capture more of the acoustic representation. Finally, we compare the output from Convolutional Neural Network (CNN) Feature encoder used in pre-trained models with the Mel-filterbank features and show that Mel-filterbanks are often better features for modelling data from different domains.
  • Placeholder Image
    Publication
    Decorrelating Feature Spaces for Learning General-Purpose Audio Representations
    (01-10-2022)
    Ghosh, Sreyan
    ;
    Seth, Ashish
    ;
    Inspired by the recent progress in self-supervised learning for computer vision, in this paper, through the DeLoRes (Decorrelating latent spaces for Low Resource audio representation learning) framework, we introduce two new general-purpose audio representation learning approaches, the DeLoRes-S and DeLoRes-M. Our main objective is to make our network learn representations in a resource-constrained setting (both data and compute) that can generalize well across a diverse set of downstream tasks. Inspired by the Barlow Twins objective function, we propose learning embeddings invariant to distortions of an input audio sample while ensuring that they contain non-redundant information about the sample. We call this the DeLoRes learning framework, which we employ in different fashions with the DeLoRes-S and DeLoRes-M. In our experiments, we learn audio representations with less than half the number of model parameters and 10% audio samples compared to state-of-the-art algorithms to achieve state-of-the-art results on 7 out of 11 tasks on linear evaluation and 4 out of 11 tasks in the finetuning setup. In addition to being simple and intuitive, our pre-training procedure is amenable to compute through its inherent nature of construction. Furthermore, we conduct extensive ablation studies on our training algorithm, model architecture, and results and make all our code and pre-trained models publicly available1.
  • Placeholder Image
    Publication
    Towards Speech to Speech Machine Translation focusing on Indian Languages
    (01-01-2023)
    Mujadia, Vandan
    ;
    ; ;
    Sangal, Rajeev
    ;
    Sharma, Dipti Misra
    We introduce an SSMT (Speech to Speech Machine Translation, aka Speech to Speech Video Translation) Pipeline1, as a web application for translating videos from one language to another by cascading multiple language modules. Our speech translation system combines highly accurate speech to text (ASR) for Indian English, pre-possessing modules to bridge ASR-MT gaps such as spoken disfluency and punctuation, robust machine translation (MT) systems for multiple language pairs, SRT module for translated text, text to speech (TTS) module and a module to render translated synthesized audio on the original video. It is user-friendly, flexible, and easily accessible system. We aim to provide a complete configurable speech translation experience to users and researchers with this system. It also supports human intervention where users can edit outputs of different modules and the edited output can then be used for subsequent processing to improve overall output quality. By adopting a human-in-the-loop approach, the aim is to configure technology in such a way where it can assist humans and help to reduce the involved human efforts in speech translation involving English and Indian languages. As per our understanding, this is the first fully integrated system for English to Indian languages (Hindi, Telugu, Gujarati, Marathi, and Punjabi) video translation. Our evaluation shows that one can get 3.5+ MOS score using the developed pipeline with human intervention for English to Hindi. A short video demonstrating our system is available at https://youtu.be/MVftzoeRg48.