Rnn-Based Ams + Introduction To Language Modeling: Instructor: Preethi Jyothi

RNN-based AMs
+ Introduction to Language Modeling
Lecture 9
CS 753
Instructor: Preethi Jyothi
Recall RNN definition
yt y1 y2 y3
unfold
H, O H, O H, O H, O …
h0 h1 h2
ht
xt x1 x2 x3
Two main equations govern RNNs:
ht = H(Wxt + Vht-1 + b(h))
yt = O(Uht + b(y))
where W, V, U are matrices of input-hidden weights, hidden-hidden 
weights and hidden-output weights resp; b(h) and b(y) are bias vectors 
and H is the activation function applied to the hidden layer
Training RNNs
• An unrolled RNN is just a very deep feedforward network
• For a given input sequence:
• create the unrolled network
• add a loss function node to the network
• then, use backpropagation to compute the gradients
• This algorithm is known as backpropagation through time

(BPTT) 
Deep RNNs
y1 y2 y3
H, O H, O H, O
h0,2 h1,2 h2,2
H, O H, O H, O
h0,1 h1,1 h2,1
x1 x2 x3
• RNNs can be stacked in layers to form deep RNNs
• Empirically shown to perform better than shallow RNNs on
ASR [G13]
[G13] A. Graves, A . Mohamed, G. Hinton, “Speech Recognition with Deep Recurrent Neural Networks”, ICASSP, 2013.
Vanilla RNN Model
ht = H(Wxt + Vht-1 + b(h))
yt = O(Uht + b(y))
H : element wise application of the sigmoid or tanh function
O : the softmax function
Run into problems of exploding and vanishing gradients.

Exploding/Vanishing Gradients
• In deep networks, gradients in early layers are computed as the
product of terms from all the later layers
• This leads to unstable gradients:
• If the terms in later layers are large enough, gradients in early

layers (which is the product of these terms) can grow
exponentially large: Exploding gradients
• If the terms are in later layers are small, gradients in early

layers will tend to exponentially decrease: Vanishing gradients
• To address this problem in RNNs, Long Short Term Memory

(LSTM) units were proposed [HS97]
[HS97] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, 1997.
Long Short Term Memory Cells
Input ⊗ Memory ⊗ Output

Gate Cell Gate
Forget
Gate
• Memory cell: Neuron that stores information over long time periods
• Forget gate: When on, memory cell retains previous contents.
Otherwise, memory cell forgets contents.
• When input gate is on, write into memory cell
• When output gate is on, read from the memory cell
Bidirectional RNNs
concat concat concat
y1,f y3,b y2,f y2,b y3,f y1,b

Backward 
H b, O b H b, O b H b, O b
layer h3,b h2,b h1,b h0,b
Forward  Hf, Of Hf, Of Hf, Of
layer h0,f h1,f h2,f h3,f
xhello xworld x.
• BiRNNs process the data in both directions with two separate hidden layers
• Outputs from both hidden layers are concatenated at each position

ASR with RNNs
• We have seen how neural networks can be used for acoustic

models in ASR systems
• Main limitation: Frame-level training targets derived from HMM-

based alignments
• Goal: Single RNN model that addresses this issues and does not
rely on HMM-based alignments [G14]
[G14] A. Graves, N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks”, ICML, 2014.
RNN-based Acoustic Model
yt-1 yt yt+1
H b, O b H b, O b H b, O b
h3,b h2,b h1,b h0,b
Hf, Of Hf, Of Hf, Of

h0,f h1,f h2,f h3,f
xt-1 xt xt+1
• H was implemented using LSTMs in [G13]. Input: Acoustic feature vectors, one per frame;
Output: Phones + space
• Deep bidirectional LSTM networks were used to do phone recognition on TIMIT
• Trained using the Connectionist Temporal Classification (CTC) loss [covered in later class]
[G13] A. Graves, et al., “Speech recognition with deep recurrent neural networks”, ICASSP, 2013.
reducing
Table 1. TIMITRNN-based
Phoneme Recognition Results.
Acoustic Model ‘Epochs’ is
parame-
the number of passes through the training set before conver-
gence. ‘PER’ is the phoneme error rate on the core test set.
N ETWORK W EIGHTS E POCHS PER

CTC-3 L -500 H - TANH 3.7M 107 37.6%
d on the CTC-1 L -250 H 0.8M 82 23.9%
with all CTC-1 L -622 H 3.8M 87 23.0%
separate CTC-2 L -250 H 2.3M 55 21.0%
ly stop- CTC-3 L -421 H - UNI 3.8M 115 19.6%
CTC-3 L -250 H 3.8M 124 18.6%
test set.
CTC-5 L -250 H 6.8M 150 18.4%
m-based T RANS -3 L -250 4.3M
H phoneme 112results 18.3%
TIMIT recognition
buted on P RE T RANS -3 L -250 H 4.3M 144 17.7%
emporal
23. The
[G13] A. Graves, et al., “Speech recognition with deep recurrent neural networks”, ICASSP, 2013.
So far, we’ve looked at acoustic models…
Acoustic  Context  Pronunciation  Language 
Acoustic  Models Transducer Model Model Word 
Indices Triphones Monophones Words Sequence
Next, language models
Acoustic  Context  Pronunciation  Language 
Acoustic  Models Transducer Model Model Word 
Indices Triphones Monophones Words Sequence
• Language models
• provide information about word reordering
Pr(“she class taught a”) < Pr(“she taught a class”)
• provide information about the most likely next word
Pr(“she taught a class”) > Pr(“she taught a speech”)

Application of language models
• Speech recognition
• Pr(“she taught a class”) > Pr(“sheet or tuck lass”)
• Machine translation
• Handwriting recognition/Optical character recognition
• Spelling correction of sentences
• Summarization, dialog generation, information retrieval, etc.

Popular Language Modelling Toolkits
• SRILM Toolkit:
http://www.speech.sri.com/projects/srilm/
• KenLM Toolkit:
https://kheafield.com/code/kenlm/
• OpenGrm NGram Library:
http://opengrm.org/
Introduction to probabilistic LMs
Probabilistic or Statistical Language Models
• Given a word sequence, W = {w1, … , wn}, what is Pr(W)?
• Decompose Pr(W) using the chain rule:
Pr(w1,w2,…,wn-1,wn) = Pr(w1) Pr(w2|w1) Pr(w3|w1,w2)…Pr(wn|w1,…,wn-1)
• Sparse data with long word contexts: How do we estimate

the probabilities Pr(wn|w1,…,wn-1)?
Estimating word probabilities
• Accumulate counts of words and word contexts
• Compute normalised counts to get next-word probabilities
• E.g. Pr(“class | she taught a”)  

= π(“she taught a class”) 
  π(“she taught a”)
 
where π(“…”) refers to counts derived  
from a large English text corpus
• What is the obvious limitation here? We’ll never see enough data
Simplifying Markov Assumption
• Markov chain:
• Limited memory of previous word history: Only last m words are included
• 1-order language model (or bigram model)
Pr(w1,w2,…,wn-1,wn) ≅ Pr(w1|<s>) Pr(w2|w1) Pr(w3|w2)…Pr(wn|wn-1)
• 2-order language model (or trigram model)
Pr(w1,w2,…,wn-1,wn) ≅ Pr(w2|w1,<s>) Pr(w3|w1,w2)…Pr(wn|wn-2,wn-1)
• Ngram model is an N-1th order Markov model

Estimating Ngram Probabilities
• Maximum Likelihood Estimates
• Unigram model
⇡(w1 )
PrM L (w1 ) = P
i ⇡(wi )
• Bigram model
⇡(w1 , w2 )
PrM L (w2 |w1 ) = P
i ⇡(w1 , wi )
Example
The dog chased a cat 

The cat chased away a mouse 
The mouse eats cheese
What is Pr(“The cat chased a mouse”) using a bigram model?
Pr(“<s> The cat chased a mouse </s>”) = 
Pr(“The|<s>”) ⋅ Pr(“cat|The”) ⋅ Pr(“chased|cat”) ⋅ Pr(“a|chased”) ⋅ Pr(“mouse|

a”) ⋅ Pr(“</s>|mouse”) =  
 
3/3 ⋅ 1/3 ⋅ 1/2 ⋅ 1/2 ⋅ 1/2 ⋅ 1/2 = 1/48 
Example

What is Pr(“The dog eats cheese”) using a bigram model?
Pr(“<s> The dog eats cheese </s>”) = 
Pr(“The|<s>”) ⋅ Pr(“dog|The”) ⋅ Pr(“eats|dog”) ⋅ Pr(“cheese|eats”) ⋅ Pr(“</s>|

cheese”) =  
 
3/3 ⋅ 1/3 ⋅ 0/1 ⋅ 1/1 ⋅ 1/1 = 0!  Due to unseen bigrams
How do we deal with unseen bigrams? We’ll come back to it.

Open vs. closed vocabulary task
• Closed vocabulary task: Use a fixed vocabulary, V. We know all the words in advance.
• More realistic setting, we don’t know all the words in advance. Open vocabulary task.
Encounter out-of-vocabulary (OOV) words during test time.
• Create an unknown word: <UNK>

• Estimating <UNK> probabilities: Determine a vocabulary V. Change all words in the
training set not in V to <UNK>
• Now train its probabilities like a regular word
• At test time, use <UNK> probabilities for words not in training
Evaluating Language Models
• Extrinsic evaluation:
• To compare Ngram models A and B, use both within a

specific speech recognition system (keeping all other
components the same)
• Compare word error rates (WERs) for A and B
• Time-consuming process!
Intrinsic Evaluation
• Evaluate the language model in a standalone manner
• How likely does the model consider the text in a test set?
• How closely does the model approximate the actual (test

set) distribution?
• Same measure can be used to address both questions —

perplexity!
Measures of LM quality

set) distribution?

perplexity!
Perplexity (I)
• Perplexity(test) = 1/Prmodel[text]
• Normalized by text length:
• Perplexity(test) = (1/Prmodel[text])1/N where N = number of

tokens in test
• e.g. If model predicts i.i.d. words from a dictionary of

size L, per word perplexity = (1/(1/L)N)1/N = L
Intuition for Perplexity
• Shannon’s guessing game builds intuition for perplexity
• What is the surprisal factor in predicting the next word?
• At the stall, I had tea and _________ biscuits 0.1  

samosa 0.1 
coffee 0.01 
rice 0.001 
⋮ 
but 0.00000000001 
• A better language model would assign a higher probability to the  

actual word that fills the blank (and hence lead to lesser
surprisal/perplexity)
Measures of LM quality

set) distribution?

perplexity!
Perplexity (II)
• How closely does the model approximate the actual (test set)
distribution?
• KL-divergence between two distributions X and Y 
DKL(X||Y) = Σσ PrX[σ] log (PrX[σ]/PrY[σ])
• Equals zero iff X = Y ; Otherwise, positive
• How to measure DKL(X||Y)? We don’t know X! Cross entropy  

• DKL(X||Y) = Σσ PrX[σ] log(1/PrY[σ]) - H(X)   between X and Y
where H(X) = -Σσ PrX[σ] log PrX[σ]
• Empirical cross entropy:
1 X 1
log( )
|test| 2test Pry [ ]
Perplexity vs. Empirical Cross Entropy
• Empirical Cross Entropy (ECE)
1 X 1
log( )
|#sents| 2test Prmodel [ ]
• Normalized Empirical Cross Entropy = ECE/(avg. length) =
1 1 X 1
log( )=
|#words/#sents| |#sents| 2test Prmodel [ ]
  11 X
X 11
=N log(
log( Pr ) )
  N Prmodel
model[ [] ]
<latexit sha1_base64="AIY5XO73eXFCg3d/A28Ewf2L2VQ=">AAACJ3icbVBNa9tAFFwlTes4Tesmx16WmoB7MZJTaC8tIb3kFFyoP8ASYrV6shfvasXuU6kR+je55K/kUmhLSY/5J11/HFq7AwvDzDzevkkKKSz6/m9vb//RweMnjcPm0dPjZ89bL06GVpeGw4Brqc04YRakyGGAAiWMCwNMJRJGyfzj0h99AWOFzj/jooBIsWkuMsEZOilufXgfZobx4JqGtlRxFVoxVawOpZ521k4V9o3TEb5ipXQKsq4n61RUv45bbb/rr0B3SbAhbbJBP259D1PNSwU5csmsnQR+gVHFDAouoW6GpYWC8TmbwsTRnCmwUbW6s6ZnTklppo17OdKV+vdExZS1C5W4pGI4s9veUvyfNykxexdVIi9KhJyvF2WlpKjpsjSaCgMc5cIRxo1wf6V8xlw76KptuhKC7ZN3ybDXDc67vU9v2heXmzoa5CV5RTokIG/JBbkifTIgnNyQO/KD/PRuvW/eL+9+Hd3zNjOn5B94D38A4++nOA==</latexit>
where N = #words
1 X 1 1 X 1
log(• How does )= log( ) relate to perplexity?
s| |#sents| 2test Prmodel [ ] N Prmodel [ ]
Perplexity vs. Empirical Cross Entropy
1 1
log(perplexity) = log
N Pr[test]
1 Y 1
= log ( )
N Prmodel [ ]
1 X 1
= log( )
N Prmodel [ ]
Thus, perplexity = exp(normalized cross entropy)
Example perplexities for Ngram models trained on WSJ (80M words):  
Unigram: 962, Bigram: 170, Trigram: 109

Introduction to smoothing of LMs
Recall example

What is Pr(“The dog eats cheese”)?
Pr(“<s> The dog eats cheese </s>”) = 
Pr(“The|<s>”) ⋅ Pr(“dog|The”) ⋅ Pr(“eats|dog”) ⋅ Pr(“cheese|eats”) ⋅ Pr(“</s>|

cheese”) =  
 
3/3 ⋅ 1/3 ⋅ 0/1 ⋅ 1/1 ⋅ 1/1 = 0!  Due to unseen bigrams
Unseen Ngrams
• Even with MLE estimates based on counts from large text
corpora, there will be many unseen bigrams/trigrams that
never appear in the corpus
• If any unseen Ngram appears in a test sentence, the

sentence will be assigned probability 0
• Problem with MLE estimates: maximises the likelihood of the

observed data by assuming anything unseen cannot happen
and overfits to the training data
• Smoothing methods: Reserve some probability mass to Ngrams that

don’t occur in the training corpus
Add-one (Laplace) smoothing
Simple idea: Add one to all bigram counts. That means,

⇡(wi 1 , wi )
PrM L (wi |wi 1 ) =
⇡(wi 1 )
becomes
PrLap (wi |wi 1 ) =

⇡(wi 1 , wi ) + 1 ✓
⇡(wi 1 ) + V
where V is the vocabulary size

Rnn-Based Ams + Introduction To Language Modeling: Instructor: Preethi Jyothi

Uploaded by

Copyright:

Available Formats

Rnn-Based Ams + Introduction To Language Modeling: Instructor: Preethi Jyothi

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rnn-Based Ams + Introduction To Language Modeling: Instructor: Preethi Jyothi

Uploaded by

Copyright:

Available Formats

RNN-based AMs

+ Introduction to Language Modeling

ht = H(Wxt + Vht-1 + b(h))

• For a given input sequence:

• create the unrolled network

• add a loss function node to the network

• then, use backpropagation to compute the gradients

• This algorithm is known as backpropagation through time

ht = H(Wxt + Vht-1 + b(h))

H : element wise application of the sigmoid or tanh function

O : the softmax function

Run into problems of exploding and vanishing gradients.

• This leads to unstable gradients:

• If the terms in later layers are large enough, gradients in early

• If the terms are in later layers are small, gradients in early

• To address this problem in RNNs, Long Short Term Memory

Input ⊗ Memory ⊗ Output

concat concat concat

y1,f y3,b y2,f y2,b y3,f y1,b

• Outputs from both hidden layers are concatenated at each position

• We have seen how neural networks can be used for acoustic

• Main limitation: Frame-level training targets derived from HMM-

Hf, Of Hf, Of Hf, Of

N ETWORK W EIGHTS E POCHS PER

• provide information about word reordering

Pr(“she class taught a”) < Pr(“she taught a class”)

• provide information about the most likely next word

Pr(“she taught a class”) > Pr(“she taught a speech”)

• Pr(“she taught a class”) > Pr(“sheet or tuck lass”)

• Handwriting recognition/Optical character recognition

• Spelling correction of sentences

• Summarization, dialog generation, information retrieval, etc.

• OpenGrm NGram Library:

• Given a word sequence, W = {w1, … , wn}, what is Pr(W)?

• Decompose Pr(W) using the chain rule:

Pr(w1,w2,…,wn-1,wn) = Pr(w1) Pr(w2|w1) Pr(w3|w1,w2)…Pr(wn|w1,…,wn-1)

• Sparse data with long word contexts: How do we estimate

• Accumulate counts of words and word contexts

• Compute normalised counts to get next-word probabilities

• E.g. Pr(“class | she taught a”)

• 1-order language model (or bigram model)

Pr(w1,w2,…,wn-1,wn) ≅ Pr(w1|<s>) Pr(w2|w1) Pr(w3|w2)…Pr(wn|wn-1)

• 2-order language model (or trigram model)

Pr(w1,w2,…,wn-1,wn) ≅ Pr(w2|w1,<s>) Pr(w3|w1,w2)…Pr(wn|wn-2,wn-1)

• Ngram model is an N-1th order Markov model

• Maximum Likelihood Estimates

The dog chased a cat

What is Pr(“The cat chased a mouse”) using a bigram model?

Pr(“<s> The cat chased a mouse </s>”) =

Pr(“The|<s>”) ⋅ Pr(“cat|The”) ⋅ Pr(“chased|cat”) ⋅ Pr(“a|chased”) ⋅ Pr(“mouse|

The dog chased a cat

What is Pr(“The dog eats cheese”) using a bigram model?

Pr(“<s> The dog eats cheese </s>”) =

Pr(“The|<s>”) ⋅ Pr(“dog|The”) ⋅ Pr(“eats|dog”) ⋅ Pr(“cheese|eats”) ⋅ Pr(“</s>|

How do we deal with unseen bigrams? We’ll come back to it.

• Create an unknown word: <UNK>

• To compare Ngram models A and B, use both within a

• Compare word error rates (WERs) for A and B

• Evaluate the language model in a standalone manner

• E.g. Pr(“class | she taught a”)  

The dog chased a cat 

Pr(“<s> The cat chased a mouse </s>”) = 

The dog chased a cat 

Pr(“<s> The dog eats cheese </s>”) = 

• At the stall, I had tea and _________ biscuits 0.1  

• A better language model would assign a higher probability to the  

• How to measure DKL(X||Y)? We don’t know X! Cross entropy  

Example perplexities for Ngram models trained on WSJ (80M words):  

The dog chased a cat 

Pr(“<s> The dog eats cheese </s>”) =