Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

11 Sep 2019

Google

In this paper, the authors experiment on the RNN-T based ASR models in the multilingual settings.

Multilingual RNN-T (A0)
1. The authors use it as a baseline.
With a language ID (A1)
1. The authors condition the A0 with a one-hot language ID vector.
Imbalance dataset (A2)
1. Using Sampling the authors try to reduce the data/prior bias in the overall dataset.
  1. A0+sampling.
  2. A1 +sampling
  3. A1+sampling+60k The authors also stop the sampling after 60k steps.
Using language adapters.
1. A1+adapters. The authors use the adapter modules trained for each language separately.
Baselines:
1. Monolingual
  1. CTC
  2. RNN-T

The authors observe that conditioning on a language ID is an essential step in a multilingual settings for a performance boost on the languages with smaller footprint in the training dataset.
The authors also observe that sampling also degrades the performance in with and without both the cases.
The authors observe that adapter modules do not provide much of a boost overall but are important for the languages with smaller training datasets.
Finally, Multilingual is a better choice that monolingual systems especially in the case of low resource languages.