MULTILINGUAL SPEECH RECOGNITION WITH A SINGLE END-TO-END MODEL

2018

Google and Toyota

The authors of this paper propose to train a joint E2E multilingual model using the LAS architecture. LAS is an seq2seq architecture i.e., encoder + decoder combined. Therefore no need of separate pronunciation model and language model.

In this paper the authors train 6 variants of LAS model for the ASR task:

Monolingual - Baseline
Joint (multilingual): This model has little to no confusion in identifying the languages. Which means it is implictly learning something similar to language identification task. this could be because, according to the authors comment, the 9 languages have very little overlap in their scripts.
Joint + MTL(predict the language ID) - 21 % relative improvement
Joint + encoder conditioned - 7 % relative improvement over 3.
Joint + decoder conditioned
Joint + encoder-decoder conditioned: Similar WER to 4

The authors try to trasncribe code switch (hindi-tamil) utterances using the model 2 i.e., joint. Model 2 sticks faithfully to one language i.e., only transcribe either tamil or hindi. Furthoremore for some utterances it transliterates the hindi part. The authors describe this by saying that the language model is dominant (I do not understand the last sentence. How come they came up with the answer that the LM is dominant.) Lastly, I am not clear on the creation of dataset part.

Lastly the authors wanted to study what if we condition the encoder with a wrong language id i.e, use model 4 but with a wrong lang ID. They label the Urdu utterances to Hindi and to their surprise the model transliterates the Urdu utterance to Hindi. This shows that the model is very faithful/overfit on the languge ID provided.

Transliterates: write using the closest corresponding letter of a different alphabet or a script.