LSTM

Long short-term memory
Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term
memory." Neural computation 9.8 (1997): 1735-1780.
01
Long Short-Term Memory (LSTM)
Olivia Ni

• Recurrent Neural Networks (RNN)
• The Problem of Long-Term Dependencies
• LSTM Networks
• The Core Idea Behind LSTMs
• Step-by-Step LSTM Walk Through
• Variants on LSTMs
• Conclusions & References
• Appendix (BPTT  Gradient Exploding/ Vanishing)
02
Outline

• Idea:
• condition the neural network on all previous information and tie the weights
at each time step
• Assumption: temporal information matters (i.e. time series data)
03
Recurrent Neural Networks (RNN)
RNN RNNRNN
𝐼𝑛𝑝𝑢𝑡 𝑡
𝑂𝑢𝑡𝑝𝑢𝑡 𝑡
𝑆𝑇𝑀 𝑡−1 𝑆𝑇𝑀 𝑡
𝐼𝑛𝑝𝑢𝑡 𝑡−1
𝑂𝑢𝑡𝑝𝑢𝑡 𝑡−1
𝐼𝑛𝑝𝑢𝑡 𝑡+1
𝑂𝑢𝑡𝑝𝑢𝑡 𝑡+1
𝑆𝑇𝑀 𝑡−2 𝑆𝑇𝑀 𝑡+1
• STM = Short-term memory

• RNN Definition:
• Model Training:
• All model parameters 𝜃 = 𝑈, 𝑉, 𝑊 can be updated by gradient descent
04
Recurrent Neural Networks (RNN)
𝑆𝑡 = 𝜎 𝑈𝑥𝑡 + 𝑊𝑠𝑡−1
𝑂𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑉𝑠𝑡
𝜃 𝑖+1
← 𝜃 𝑖
− 𝜂𝛻𝜃 𝐶 𝜃 𝑖

• Example: (Consider trying to predict the last word in the text)
• Issue: in theory, RNNs can handle such “long-term dependencies,” but
they cannot in practice!!
“The clouds are in the sky.”
“I grew up in France… I speak fluent French.”
05
The Problem of Long-Term Dependencies

• RNN Training Issue:
(1) The gradient is a product of Jacobian matrices, each associated with a step
in the forward computation
(2) Multiply the same matrix at each time step during BPTT
• The gradient becomes very small or very large quickly
• Vanishing or Exploding gradient
• The error surface is either very flat or very steep
06

• Possible Solutions:
• Gradient Exploding:
• Clipping (https://arxiv.org/abs/1211.5063?context=cs)
• Gradient Vanishing:
• Better Initialization (https://arxiv.org/abs/1504.00941)
• Gating Mechanism (LSTM, GRU, …, etc.)
• Attention Mechanism (https://arxiv.org/pdf/1706.03762.pdf)
07

08
LSTM Networks – The Core Idea Behind LSTMs
RNN RNNRNN
• LSTM = Long Short-term memory
LSTM LSTMLSTM
𝐿𝑆𝑇𝑀 𝑡−1 𝐿𝑆𝑇𝑀 𝑡
𝐿𝑆𝑇𝑀 𝑡−2 𝐿𝑆𝑇𝑀 𝑡+1

09
LSTM Networks – The Core Idea Behind LSTMs
• LSTM = Long Short-term memory
𝐿𝑆𝑇𝑀
𝑆𝑇𝑀
𝑆𝑇𝑀

10
LSTM Networks – Step-by-Step LSTM Walk Through (0/4)
• The cell state runs straight down
the entire chain, with only some
minor linear interactions.
 Easy for information to flow
along it unchanged
• The LSTM does have the ability
to remove or add information to
the cell state, carefully regulated
by structures called gates.

11
• Forget gate (sigmoid + pointwise
multiplication operation):
decides what information we’re
going to throw away from the
cell state
• 1: ‘’Complete keep this”
• 0: “Complete get rid of this”

12
• Input gate (sigmoid + pointwise
decides what new information
we’re going to store in the cell
state
Vanilla RNN

13
• Cell state update: forgets the
things we decided to forget
earlier and add the new
candidate values, scaled by how
much we decided to update
• 𝑓𝑡: decide which to forget
• 𝑖 𝑡: decide which to update
⟹ 𝐶𝑡 has been updated at timestamp 𝑡, which change slowly!

14
• Output gate (sigmoid + pointwise
decides what new information
we’re going to output
⟹ ℎ 𝑡 has been updated at timestamp 𝑡, which change faster!

15
LSTM Networks – Variants on LSTMs (1/3)
• LSTM with Peephole Connections
• Idea: allow gate layers to look at
the cell state

16
• LSTM with Coupled Forget/ Input
Gate
• Idea: we only forget when we’re
going to input something in its
place, and vice versa.

17
• Gated Recurrent Unit (GRU)
• Idea:
• combine the forget and input gates
into a single “update gate”
• merge the cell state and hidden state
Update gate:
Reset gate:
State Candidate:
Current State:

Explain by
- Backpropagation
Through Time (BPTT)
RNN Training Issue:
- Gradient Vanishing
- Gradient Exploding
Review
- Backpropagation (BP)
18
Appendix – The Problem of Long-Term Dependencies

𝜕 𝐶 𝜃
𝜕 𝑤𝑖𝑗
𝑙
𝜕 𝐶 𝜃
𝜕 𝑤𝑖𝑗
𝑙
• Gradient Descent for Neural Networks
• Computing the gradient includes millions of parameters.
• To compute it efficiently, we use backpropagation.
• Compute the gradient based on two pre-computed terms from forward and backward pass.
19
𝜕 𝐶 𝜃
𝜕 𝑤𝑖𝑗
𝑙
BPTT
BP

𝜕 𝐶 𝜃
𝜕 𝑤𝑖𝑗
𝑙
=
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝑙
𝜕 𝑧𝑖
𝑙
𝜕 𝑤𝑖𝑗
𝑙
• WLOG, we use 𝑤𝑖𝑗
𝑙
to demonstrate
• Forward pass:
20
𝜕 𝑧𝑖
𝑙
𝜕 𝑤𝑖𝑗
𝑙
= ൝
𝑥𝑗 , 𝑖𝑓 𝑙 = 1
𝑎𝑗
𝑙−1
, 𝑖𝑓 𝑙 > 1
(𝑙 = 1) (𝑙 > 1)
BPTT
BP

𝜕 𝐶 𝜃
𝜕 𝑤𝑖𝑗
𝑙
=
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝑙
𝜕 𝑧𝑖
𝑙
𝜕 𝑤𝑖𝑗
𝑙
𝑙
to demonstrate
• Backward pass :
21
(𝑙 = 𝐿) (𝑙 < 𝐿)
𝛿𝑖
𝐿
=
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝐿
=
𝜕 𝐶 𝜃
𝜕 𝑦𝑖
𝜕 𝑦𝑖
𝜕 𝑧𝑖
𝐿
=
𝜕 𝐶 𝜃
𝜕 𝑦𝑖
𝜕𝑎𝑖
𝐿
𝜕 𝑧𝑖
𝐿
=
𝜕 𝐶 𝜃
𝜕 𝑦𝑖
𝜕σ 𝑧𝑖
𝐿
𝜕 𝑧𝑖
𝐿
=
𝜕 𝐶 𝜃
𝜕 𝑦𝑖
σ′ 𝑧𝑖
𝐿
BPTT
BP
𝛿𝑖
𝑙
=
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝑙
= ෍
𝑘
𝜕 𝐶 𝜃
𝜕 𝑧 𝑘
𝑙+1
𝜕 𝑧 𝑘
𝑙+1
𝜕𝑎𝑖
𝐿
𝜕𝑎𝑖
𝐿
𝜕 𝑧𝑖
𝑙
=
𝜕𝑎𝑖
𝐿
𝜕 𝑧𝑖
𝑙
෍
𝑘
𝜕 𝐶 𝜃
𝜕 𝑧 𝑘
𝑙+1
𝜕 𝑧 𝑘
𝑙+1
𝜕𝑎𝑖
𝐿
=
𝜕𝑎𝑖
𝐿
𝜕 𝑧𝑖
𝑙
෍
𝑘
𝛿𝑖
𝑙+1 𝜕 𝑧 𝑘
𝑙+1
𝜕𝑎𝑖
𝐿
= σ′ 𝑧𝑖
𝑙
෍
𝑘
𝛿𝑖
𝑙+1
𝑤 𝑘𝑖
𝑙+1
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝑙 ≜ 𝛿𝑖
𝑙
=
σ′ 𝑧𝑖
𝐿 𝜕 𝐶 𝜃
𝜕 𝑦𝑖
, 𝑖𝑓 𝑙 = 𝐿
σ′ 𝑧𝑖
𝑙
෍
𝑘
𝛿𝑖
𝑙+1
𝑤 𝑘𝑖
𝑙+1
, 𝑖𝑓 𝑙 < 𝐿

𝜕 𝐶 𝜃
𝜕 𝑤𝑖𝑗
𝑙
=
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝑙
𝜕 𝑧𝑖
𝑙
𝜕 𝑤𝑖𝑗
𝑙
𝑙
to demonstrate
• Backward pass :
22
BPTT
BP
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝑙 ≜ 𝛿𝑖
𝑙
=
σ′ 𝑧𝑖
𝐿 𝜕 𝐶 𝜃
𝜕 𝑦𝑖
, 𝑖𝑓 𝑙 = 𝐿
σ′ 𝑧𝑖
𝑙
෍
𝑘
𝛿𝑖
𝑙+1
𝑤 𝑘𝑖
𝑙+1
, 𝑖𝑓 𝑙 < 𝐿

𝜕 𝐶 𝜃
𝜕 𝑤𝑖𝑗
𝑙
=
𝜕 𝐶 𝜃
𝜕 𝑧𝑖
𝑙
𝜕 𝑧𝑖
𝑙
𝜕 𝑤𝑖𝑗
𝑙
• Concluding Remarks for Backpropagation (BP)
23
BPTT
BP

• Recap Recurrent Neuron Network (RNN) Architectures
• Model Training:
• All model parameters 𝜃 = 𝑈, 𝑉, 𝑊 can be updated by gradient descent
24
BPTT
BP
𝑆𝑡 = 𝜎 𝑈𝑥𝑡 + 𝑊𝑠𝑡−1
𝜃 𝑖+1
← 𝜃 𝑖

25
BPTT
BP
𝑆𝑡 = 𝜎 𝑈𝑥 𝑡 + 𝑊𝑠𝑡−1
𝐹𝑜𝑟 𝜃 = 𝑈, 𝑉, 𝑊 , 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 𝑡ℎ𝑒𝑚 𝑏𝑦 𝜃 𝑖+1
← 𝜃 𝑖
𝑾 𝑼 𝑽
𝑊(1)
⟵ 𝑊 1
−
𝜕𝐶 3
𝜃
𝜕𝑊 1
𝑊(2)
⟵ 𝑊 2
−
𝜕𝐶 3
𝜃
𝜕𝑊 2
𝑊(3)
⟵ 𝑊 3
−
𝜕𝐶 3
𝜃
𝜕𝑊 3
𝑈(1)
⟵ 𝑈 1
−
𝜕𝐶 3
𝜃
𝜕𝑈 1
𝑈(2)
⟵ 𝑈 2
−
𝜕𝐶 3
𝜃
𝜕𝑈 2
𝑈(3)
⟵ 𝑈 3
−
𝜕𝐶 3
𝜃
𝜕𝑈 3
𝑉(3)
⟵ 𝑉 3
−
𝜕𝐶 3
𝜃
𝜕𝑉 3
𝑊 ⟵ 𝑊 −
𝜕𝐶 3 𝜃
𝜕𝑊 1
−
𝜕𝐶 3 𝜃
𝜕𝑊 2
−
𝜕𝐶 3 𝜃
𝜕𝑊 3
𝑈 ⟵ 𝑈 −
𝜕𝐶 3 𝜃
𝜕𝑈 1
−
𝜕𝐶 3 𝜃
𝜕𝑈 2
−
𝜕𝐶 3 𝜃
𝜕𝑈 3
𝑉 ⟵ 𝑉 −
𝜕𝐶 3 𝜃
𝜕𝑉 3
𝐶 ≜ ෍
𝑡
𝐶 𝑡 , 𝑊𝐿𝑂𝐺, 𝑢𝑠𝑖𝑛𝑔 𝐶 3 𝑡𝑜 𝑒𝑥𝑝𝑙𝑎𝑖𝑛
Tie 𝜃
NO
Yes

26
BPTT
BP
𝜕𝐶 3
𝜃
𝜕𝑊
=
𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑊
= ෍
𝑘=0
3 𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑠 𝑘
𝜕𝑠 𝑘
𝜕𝑊
= ෍
𝑘=0
3 𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
ෑ
𝑗=𝑘+1
3 𝜕𝑠𝑗
𝜕𝑠𝑗−1
𝜕𝑠 𝑘
𝜕𝑊
𝑾 𝑼 𝑽
𝑊 ⟵ 𝑊 −
𝜕𝐶 3
𝜃
𝜕𝑊 1
−
𝜕𝐶 3
𝜃
𝜕𝑊 2
−
𝜕𝐶 3
𝜃
𝜕𝑊 3
𝑈 ⟵ 𝑈 −
𝜕𝐶 3
𝜃
𝜕𝑈 1
−
𝜕𝐶 3
𝜃
𝜕𝑈 2
−
𝜕𝐶 3
𝜃
𝜕𝑈 3
𝑉 ⟵ 𝑉 −
𝜕𝐶 3
𝜃
𝜕𝑉 3
Tie 𝜃
Yes
𝜕𝐶 3 𝜃
𝜕𝑈
=
𝜕𝐶 3 𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑈
= ෍
𝑘=1
3 𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑠 𝑘
𝜕𝑠 𝑘
𝜕𝑈
= ෍
𝑘=1
3 𝜕𝐶 3 𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
ෑ
𝑗=𝑘+1
3 𝜕𝑠𝑗
𝜕𝑠𝑗−1
𝜕𝑠 𝑘
𝜕𝑈
𝜕𝐶 3 𝜃
𝜕𝑉
=
𝜕𝐶 3 𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑉
𝐶 ≜ ෍
𝑡
𝐶 𝑡 , 𝑊𝐿𝑂𝐺, 𝑢𝑠𝑖𝑛𝑔 𝐶 3 𝑡𝑜 𝑒𝑥𝑝𝑙𝑎𝑖𝑛
← 𝜃 𝑖

27
BPTT
BP
𝜕𝑠𝑗
𝜕𝑠 𝑘
= ෑ
𝑗=𝑘+1
3 𝜕𝑠𝑗
𝜕𝑠𝑗−1
= ෑ
𝑗=𝑘+1
3
𝑊 𝑇
𝑱𝒂𝒄𝒐𝒃𝒊𝒂𝒏_𝒎𝒂𝒕𝒓𝒊𝒙 𝜎′
𝑠𝑗−1
𝜕𝐶 3
𝜃
𝜕𝑊
=
𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑊
= ෍
𝑘=0
3 𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑠 𝑘
𝜕𝑠 𝑘
𝜕𝑊
= ෍
𝑘=0
3 𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
ෑ
𝑗=𝑘+1
3 𝜕𝑠𝑗
𝜕𝑠𝑗−1
𝜕𝑠 𝑘
𝜕𝑊
𝜕𝐶 3 𝜃
𝜕𝑈
=
𝜕𝐶 3 𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑈
= ෍
𝑘=1
3 𝜕𝐶 3
𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
𝜕𝑠3
𝜕𝑠 𝑘
𝜕𝑠 𝑘
𝜕𝑈
= ෍
𝑘=1
3 𝜕𝐶 3 𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑠3
ෑ
𝑗=𝑘+1
3 𝜕𝑠𝑗
𝜕𝑠𝑗−1
𝜕𝑠 𝑘
𝜕𝑈
𝜕𝐶 3 𝜃
𝜕𝑉
=
𝜕𝐶 3 𝜃
𝜕𝑜3
𝜕𝑜3
𝜕𝑉
← 𝜃 𝑖

• Understand the difficulty of training recurrent neural networks
• Gradient Exploding
• Gradient Vanishing
• One possible solution for solving the gradient vanishing problem is
“Gating mechanism”, which is the key concept of LSTM
• LSTM can be “deep” if we stack multiple LSTM cells
• Extensions:
• Uni-directional v.s. Bi-directional
• One-to-one, One-to-many, Many-to-one, Many-to-Many (w/ or w/o Encoder-Decoder)
28
Conclusions

• Understanding LSTM Networks
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
• Prof. Hung-yi Lee Courses
https://www.youtube.com/watch?v=xCGidAeyS4M
https://www.youtube.com/watch?v=rTqmWlnwz_0
• On the difficulty of training recurrent neural networks
https://arxiv.org/abs/1211.5063
• UDACITY Courses: Intro to Deep Learning with PyTorch
https://classroom.udacity.com/courses/ud188
29
References

LSTM

More Related Content

LSTM