2018
Google and Toyota
The authors of this paper propose to train a joint E2E multilingual model using the LAS architecture. LAS is an seq2seq architecture i.e., encoder + decoder combined. Therefore no need of separate pronunciation model and language model.
In this paper the authors train 6 variants of LAS model for the ASR task:
The authors try to trasncribe code switch (hindi-tamil) utterances using the model 2 i.e., joint. Model 2 sticks faithfully to one language i.e., only transcribe either tamil or hindi. Furthoremore for some utterances it transliterates the hindi part. The authors describe this by saying that the language model is dominant (I do not understand the last sentence. How come they came up with the answer that the LM is dominant.) Lastly, I am not clear on the creation of dataset part.
Lastly the authors wanted to study what if we condition the encoder with a wrong language id i.e, use model 4 but with a wrong lang ID. They label the Urdu utterances to Hindi and to their surprise the model transliterates the Urdu utterance to Hindi. This shows that the model is very faithful/overfit on the languge ID provided.