Towards Building ASR Systems for the Next Billion Users

12 nov 2021

IIT Madras, 2AI4Bharat, 3Microsoft, 4RBCDSAI

In this paper, the authors study if the wav2vec style pre-training transfers to Indic languages (yes, it does). To answer the question the authors curate 17,314 hours of raw audio data for pre-training across 40 languages from 4 language families. The authors do ablation studies on pre-training corpus, fine-tuning data, and task-specific language information.

Step 1: Curated 17,314 hours of raw audio data for pre-training across 40 languages.

The authors crawl raw audios from you tube for 22 languages. And crawl raw audios from news-on-air (government run radio) for 40 languages. For more details look for Table 1 in the paper.
Apply a VAD algorithm to remove the long silences.
Remove the audios files which have an SNR of less than 15 db.
Lastly, chunk the audio files to a maximum duration of 15 seconds.
Uncompressed data size is 1.5 terabyte.

Step 2: IndicWav2Vec: A multilingual ASR model for Indian Languages

Pre-training
- Uses the wav2vec 2.0 architecture.
- Training is done using the masked contrastive loss and a diversity loss to ensure better utilisation of codebook.
- Employs data sampling using a variable alpha (0,1). Lower alpha means equal participation and the opposite for high values of alpha.
Fine-tuning: It is done on task specific data by learning a projection layer over the output of context network. CTC loss function is used with spec augment.
Decoding: Combines the Acoustic model, language model and the lexicon to score the best beam.
Re-scoring: Use of transformer based LM in parallel to the N-gram LM.

Step 3: Results and discussion.

Ablation studies on pre-training.
- Pre-training is helpful.
- Diversity is better than just collecting large quantities of data in few languages.
- Larger models are better given large amounts of datasets.
- Initialising from a converged model is a better choice than training from scratch.