Rnn-Based Ams + Introduction To Language Modeling: Instructor: Preethi Jyothi
Rnn-Based Ams + Introduction To Language Modeling: Instructor: Preethi Jyothi
Rnn-Based Ams + Introduction To Language Modeling: Instructor: Preethi Jyothi
Lecture 9
CS 753
Instructor: Preethi Jyothi
Recall RNN definition
yt y1 y2 y3
unfold
H, O H, O H, O H, O …
h0 h1 h2
ht
xt x1 x2 x3
Two main equations govern RNNs:
yt = O(Uht + b(y))
where W, V, U are matrices of input-hidden weights, hidden-hidden
weights and hidden-output weights resp; b(h) and b(y) are bias vectors
and H is the activation function applied to the hidden layer
Training RNNs
• An unrolled RNN is just a very deep feedforward network
y1 y2 y3
H, O H, O H, O
h0,2 h1,2 h2,2
H, O H, O H, O
h0,1 h1,1 h2,1
x1 x2 x3
• RNNs can be stacked in layers to form deep RNNs
• Empirically shown to perform better than shallow RNNs on
ASR [G13]
[G13] A. Graves, A . Mohamed, G. Hinton, “Speech Recognition with Deep Recurrent Neural Networks”, ICASSP, 2013.
Vanilla RNN Model
yt = O(Uht + b(y))
Forget
Gate
• Memory cell: Neuron that stores information over long time periods
• Forget gate: When on, memory cell retains previous contents.
Otherwise, memory cell forgets contents.
• When input gate is on, write into memory cell
• When output gate is on, read from the memory cell
Bidirectional RNNs
xhello xworld x.
• BiRNNs process the data in both directions with two separate hidden layers
• Goal: Single RNN model that addresses this issues and does not
rely on HMM-based alignments [G14]
[G14] A. Graves, N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks”, ICML, 2014.
RNN-based Acoustic Model
yt-1 yt yt+1
H b, O b H b, O b H b, O b
h3,b h2,b h1,b h0,b
xt-1 xt xt+1
• H was implemented using LSTMs in [G13]. Input: Acoustic feature vectors, one per frame;
Output: Phones + space
• Deep bidirectional LSTM networks were used to do phone recognition on TIMIT
• Trained using the Connectionist Temporal Classification (CTC) loss [covered in later class]
[G13] A. Graves, et al., “Speech recognition with deep recurrent neural networks”, ICASSP, 2013.
reducing
Table 1. TIMITRNN-based
Phoneme Recognition Results.
Acoustic Model ‘Epochs’ is
parame-
the number of passes through the training set before conver-
gence. ‘PER’ is the phoneme error rate on the core test set.
• Language models
• Speech recognition
• Machine translation
• SRILM Toolkit:
http://www.speech.sri.com/projects/srilm/
• KenLM Toolkit:
https://kheafield.com/code/kenlm/
http://opengrm.org/
Introduction to probabilistic LMs
Probabilistic or Statistical Language Models
• What is the obvious limitation here? We’ll never see enough data
Simplifying Markov Assumption
• Markov chain:
• Limited memory of previous word history: Only last m words are included
• Unigram model
⇡(w1 )
PrM L (w1 ) = P
i ⇡(wi )
• Bigram model
⇡(w1 , w2 )
PrM L (w2 |w1 ) = P
i ⇡(w1 , wi )
Example
• Closed vocabulary task: Use a fixed vocabulary, V. We know all the words in advance.
• More realistic setting, we don’t know all the words in advance. Open vocabulary task.
Encounter out-of-vocabulary (OOV) words during test time.
• Extrinsic evaluation:
• Time-consuming process!
Intrinsic Evaluation
• How likely does the model consider the text in a test set?
• How likely does the model consider the text in a test set?
• How likely does the model consider the text in a test set?
• Perplexity(test) = 1/Prmodel[text]
• How likely does the model consider the text in a test set?
• How closely does the model approximate the actual (test set)
distribution?
• KL-divergence between two distributions X and Y
DKL(X||Y) = Σσ PrX[σ] log (PrX[σ]/PrY[σ])
• Equals zero iff X = Y ; Otherwise, positive
1 1 X 1
log( )=
|#words/#sents| |#sents| 2test Prmodel [ ]
11 X
X 11
=N log(
log( Pr ) )
N Prmodel
model[ [] ]
<latexit sha1_base64="AIY5XO73eXFCg3d/A28Ewf2L2VQ=">AAACJ3icbVBNa9tAFFwlTes4Tesmx16WmoB7MZJTaC8tIb3kFFyoP8ASYrV6shfvasXuU6kR+je55K/kUmhLSY/5J11/HFq7AwvDzDzevkkKKSz6/m9vb//RweMnjcPm0dPjZ89bL06GVpeGw4Brqc04YRakyGGAAiWMCwNMJRJGyfzj0h99AWOFzj/jooBIsWkuMsEZOilufXgfZobx4JqGtlRxFVoxVawOpZ521k4V9o3TEb5ipXQKsq4n61RUv45bbb/rr0B3SbAhbbJBP259D1PNSwU5csmsnQR+gVHFDAouoW6GpYWC8TmbwsTRnCmwUbW6s6ZnTklppo17OdKV+vdExZS1C5W4pGI4s9veUvyfNykxexdVIi9KhJyvF2WlpKjpsjSaCgMc5cIRxo1wf6V8xlw76KptuhKC7ZN3ybDXDc67vU9v2heXmzoa5CV5RTokIG/JBbkifTIgnNyQO/KD/PRuvW/eL+9+Hd3zNjOn5B94D38A4++nOA==</latexit>
where N = #words
1 X 1 1 X 1
log(• How does )= log( ) relate to perplexity?
s| |#sents| 2test Prmodel [ ] N Prmodel [ ]
Perplexity vs. Empirical Cross Entropy
1 1
log(perplexity) = log
N Pr[test]
1 Y 1
= log ( )
N Prmodel [ ]
1 X 1
= log( )
N Prmodel [ ]