The standard structure of current speech recognition systems con- sists of three main stages. First, the sound waveform is passed through feature extraction to generate relatively compact feature vectors at a frame rate of around 100 Hz. Secondly, these feature vectors are fed to an acoustic model which has been trained to as- sociate particular vectors with particular speech units; commonly, this is realized as a set of Gaussian mixtures models (GMMs) of the distributions of feature vectors corresponding to context- dependent phones. Finally, the output of these models provides the relative likelihoods for the different speech sounds needed for a hidden Markov model (HMM) decoder, which searches for the most likely allowable word sequence.
The acoustic model is trained using a corpus of examples that have been manually or automatically labeled. For distribu- tion Gaussian-mixture models, this can be done according to a maximum-likelihood criteria via the EM algorithm. However, this is not optimal: typically, we would rather have a discrim- inative criteria that optimized the ability to distinguish different classes, rather than just the match within each class.
likelihoods versus posteriors