Cs 224S / Linguist 281 Speech Recognition, Synthesis, and Dialogue
Cs 224S / Linguist 281 Speech Recognition, Synthesis, and Dialogue
Cs 224S / Linguist 281 Speech Recognition, Synthesis, and Dialogue
If Time: Evaluation
Word Error Rate
Reminder: VQ
To compute p(ot|qj)
Compute distance between feature vector ot
and each codeword (prototype vector) in a preclustered codebook where distance is either Euclidean Mahalanobis and take its codeword vk
Choose the vector that is the closest to ot And then look up the likelihood of vk given HMM state j in the B matrix
Computing bj(vk)
feature value 2 for state j feature value 1 for state j bj(vk) = number of vectors with codebook index k in state j = 14 = 1 number of vectors in state j 56 4
Summary: VQ
Training:
Do VQ and then use Baum-Welch to assign probabilities to each symbol
Decoding:
Do VQ and then use the symbol probabilities in decoding
Multivariate Gaussians
Baum-Welch for multivariate Gausians
Better than VQ
VQ is insufficient for real ASR Instead: Assume the possible values of the observation feature vector ot are normally distributed. Represent the observation likelihood function bj(ot) as a Gaussian with mean j and variance j2
1 ( x ) f ( x | , ) = exp( ) 2 2 2
Gaussian PDFs
A Gaussian is a probability density function; probability is area under curve. To make it a probability, we constrain area under curve = 1. BUT
We will be using point estimates; value of Gaussian at point.
Technically these are not probabilities, since a pdf gives a probability over a interval, needs to be multiplied by dx As we will see later, this is ok since the same value is omitted from all Gaussians, so argmax is still correct.
P(o|q):
P(o|q)
1 i = ot s.t . ot is state i T t =1
T 1 i 2 = (ot i ) 2 s.t . qt is state i T t =1
(i)o
t
T t 2 ( i )( o ) t t i t =1 T
i =
t =1 T
(i)
t t =1
2i =
(i)
t t =1
Multivariate Gaussians
Instead of a single mean and variance :
1 ( x ) 2 f ( x | , ) = exp( ) 2 2 2
Multivariate Gaussians
Defining and
= E ( x)
= E [( x )( x )
T
= E [( x i i )( x j j )]
2 ij
= [0 0] = [0 0] = [0 0] = I = 0.6I = 2I As becomes larger, Gaussian becomes more spread out; as becomes smaller, Gaussian more Text compressed and figures from Andrew Ngs lecture notes for CS229
[1 0] [0 1]
[.6 0] [ 0 2]
As we increase the off-diagonal entries, more correlation between value of x and value of y
Text and figures from Andrew Ngs lecture notes for CS229
As we increase the off-diagonal entries, more correlation between value of x and value of y
Text and figures from Andrew Ngs lecture notes for CS229
Decreasing non-diagonal entries (#1-2) Increasing variance of one dimension in diagonal (#3)
Text and figures from Andrew Ngs lecture notes for CS229
In two dimensions
Diagonal covariance
Diagonal contains the variance of each dimension ii2 So this means we consider the variance of each acoustic feature (dimension) 2 separately D
b j (ot ) =
d =1
o 1 td jd exp 2 2 jd 2 jd 1
D
b j (ot ) = 2
1
D 2
d =1
jd
1 exp( 2 d =1 2
(otd jd ) 2
jd
Natural extension of univariate case, where now i is mean vector for state i:
T
i =
(i)o
t t =1 T
(i)
t t =1
i2 =
(i)(o
t t =1
t T
i )(ot i )
t
(i)
t =1
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 4
Mixtures of Gaussians
M mixtures of Gaussians:
1 1 T 1 f ( x | jk , jk ) = c jk exp ( x jk ) ( x jk ) D /2 1/ 2 2 (2 ) | jk | k =1
For diagonal covariance:
M D 2 o 1 td jkd exp 2 jkd
D
b j (ot ) = c jk
k =1
M
1 2 2jkd
D
d =1
b j (ot ) =
k =1
c jk 2
D 2
d =1
jkd
1 exp( 2 d =1 2
( x jkd jkd ) 2
jkd
GMMs
Summary: each state has a likelihood function parameterized by:
M Mixture weights M Mean Vectors of dimensionality D Either
M Covariance Matrices of DxD
Or more likely
M Diagonal Covariance Matrices of DxD which is equivalent to M Variance Vectors of dimensionality D
Training a GMM
Problem: how do we train a GMM if we dont know what component is accounting for aspects of any particular observation? Intuition: we use Baum-Welch to find it for us, just as we did for finding hidden states that accounted for the observation
Now,
tm ( j ) =
T
t 1
( j )aij c jm b jm (ot ) j ( t )
i= 1
F (T )
T T tm
jm =
tm
( j )o t c jm =
tk
t =1 T M
( j) jm =
tk
t =1
tm
( j )(ot j )(ot j )T
T M tm
t =1 T M
t =1 k =1
( j)
t =1 k =1
( j)
t =1 k =1
( j)
2/4/09
34
2/4/09
35
Embedded Training
Components of a speech recognizer:
Feature extraction: not statistical Language model: word transition probabilities, trained on some other corpus Acoustic model:
Pronunciation lexicon: the HMM structure for each word, built by hand Observation likelihoods bj(ot) Transition probabilities aij
2/4/09
36
And wed be done! But we dont have word and phone boundaries, nor phone labeling
2/4/09
37
Embedded training
Instead:
Well train each phone HMM embedded in an entire sentence Well do word/phone segmentation and alignment automatically as part of training process
2/4/09
38
Embedded Training
2/4/09
39
Likelihoods:
initialize and of each state to global mean and variance of all training data
2/4/09
40
Embedded Training
Given: phoneset, pron lexicon, transcribed wavefiles
Build a whole sentence HMM for each sentence Initialize A probs to 0.5, or to zero Initialize B probs to global mean and variance Run multiple iteractions of Baum Welch
During each iteration, we compute forward and backward probabilities
Viterbi training
Baum-Welch training says:
We need to know what state we were in, to accumulate counts of a given output symbol ot Well compute I(t), the probability of being in state i at time t, by using forward-backward to sum over all possible paths that might have been in state i and output ot.
42
Forced Alignment
Computing the Viterbi path over the training data is called forced alignment Because we know which word string to assign to each observation sequence. We just dont know the state sequence. So we use aij to constrain the path to go through the correct words And otherwise do normal Viterbi Result: state sequence!
2/4/09
CS 224S Winter 2007
43
Baum-Welch
n ( s . t . o = v ) j t k (v ) = b j k nj
Where nij is number of frames with transition from i to j in best path And nj is number of frames where state j is occupied
2/4/09
CS 224S Winter 2007
44
Viterbi Training
Much faster than Baum-Welch But doesnt work quite as well But the tradeoff is often worth it.
2/4/09
45
1 2 i = ( ot i ) s.t. qt = i N i t =1
2
2/4/09 46
Log domain
In practice, do all computation in log domain Avoids underflow
Instead of multiplying lots of very small probabilities, we add numbers that are not so small.
In log space:
2/4/09
47
Log domain
Repeating:
Where:
Note that this looks like a weighted Mahalanobis distance!!! Also may justify why we these arent really probabilities (point estimates); these are really just distances.
2/4/09
48
Evaluation
How to evaluate the word string output by a speech recognizer?
2) Acoustic Model:
Gaussians for computing p(o|q)
3) Lexicon/Pronunciation Model
HMM: what phones can follow each other
Pronunciation Modeling
Summary
Speech Recognition Architectural Overview Hidden Markov Models in general
Forward Viterbi Decoding