Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Recurrent Neural
Networks
Daniel Thorngren
Sharath T.S.
Shubhangi Tandon
Motivation
● Translational Invariance
● Deep computational graph:
T O B E O R ...
y(0) y(1) y(2) y(3) y(4) y(t)
Motivation
● Translational Invariance
● Deep computational graph:
T O B E O R ...
y(0) y(1) y(2) y(3) y(4) y(t)
Computation Graphs
● Different sources use different systems.
This is the system the book uses.
● Nodes are variables
● Connections are functions
● A variable is computed using all of the
connections pointing towards it
● Can compute derivatives by applying
chain rule, working backwards through
the graph.
NOT ALL GRAPHS FOLLOW THESE RULES!
>:(
L
y
h
x
yt
W
RNN Computation Graphs - Unfolding
Output
Loss func.
Truth
Hidden L.
Input
Square Connects to next timestep
x(0) x(1) x(2)x(t)
Folded Unfolded
yt
(t)
L L L L
y(t)
h h h h
yt
(1) yt
(2) yt
(3)
y(t) y(t) y(t)
Common Design Patterns
Standard,
output every step
Output is the only
recurrence information Output computed at the end
x x
y y
y
yt
yt
yt
L L
L
h h
h h h h
x(n)x(t)x(2)x(1)
Training
● Compute forward in time, work backwards for gradient.
○ Exploding/vanishing gradient problem
● Teacher forcing:
○ Pipe true outputs into the hidden layer instead of model outputs
y(t) y(t+1)
x(t) x(t+1)
h h
yt
(t) yt
(t+1)
y(t) y(t+1)
x(t) x(t+1)
h h
L L
Training Time Test Time
Recursive Neural Nets
● Map a sequence to a tree, and reduce
the tree one layer at a time until you
reach a single point, your output
● Many choices of how to arrange the
tree.
x(1) x(2) x(3) x(4)
y yt
L
U W U W
U W
Deep Recurrent Neural Nets
Can add depth to any of the stages mentioned:
Multiple Recurrent Layers
Extra input, output, and
hidden layer processing
+Direct hidden layer
yt
yt
yt
x x x
h1
h2 h h
(1) Vanilla mode without RNN (e.g. image classification).
(2) Sequence output (e.g. image captioning).
(3) Sequence input (e.g. sentiment analysis).
(4) Sequence input and sequence output (e.g. Machine Translation).
(5) Synced sequence input and output (e.g. video classification).
What makes Recurrent Networks so special? Sequences !
The unreasonable effectiveness of RNNs
-
- Character level language model
- LSTM of Leo Tolstoy’s War and Peace
- Outputs after 100 iters, 300 iters, 700 iters and
2000 iters
Challenges of Vanishing and Exploding Gradients
Hidden State Recurrence Relation
using Power Method
- Spectral radius will make gradient explode or vanish
- Variance multiplies at every cell (or timestep)
- For Feed-forward networks of fixed size:
- obtain some desired variance v∗
, choose the individual weights with variance v = n
√ v∗
.
- carefully chosen scaling can avoid the vanishing and exploding gradient problem
- For RNNs , this means we cannot effectively capture Long term dependencies.
- Gradient of a long term interaction has exponentially smaller magnitude than short term
interaction
- After a forward pass, the gradients of the non-linearities are fixed.
- Back propagation is like going forwards through a linear system in which the slope of the
non-linearity has been fixed.
Loss function of a char-level RNN
def lossFun(inputs, targets, hprev):
"""
inputs,targets are both list of integers.
hprev is Hx1 array of initial hidden state
returns the loss, gradients on model parameters, and last hidden state
"""
xs, hs, ys, ps = {}, {}, {}, {}
hs[-1] = np.copy(hprev)
loss = 0
# forward pass
for t in xrange(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
xs[t][inputs[t]] = 1
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
# backward pass: compute gradients going backwards
dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh),
np.zeros_like(Why)
dbh, dby = np.zeros_like(bh), np.zeros_like(by)
dhnext = np.zeros_like(hs[0])
for t in reversed(xrange(len(inputs))):
dy = np.copy(ps[t])
dy[targets[t]] -= 1 # backprop into y. see
http://cs231n.github.io/neural-networks-case-study/#grad if confused here
dWhy += np.dot(dy, hs[t].T)
dby += dy
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)
for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
Understanding gradient flow dynamics
How to overcome gradient issues ?
Remedial strategies #1
- Gradient Clipping for Exploding Gradient
- Skip connections
- Integer valued skip length
- Example : ResNet
- Leaky Units
- Linear self-connections approach allows the effect of remembrance and forgetfulness to
be adapted more smoothly and flexibly by adjusting the real-valued α rather than by
adjusting the integer-valued skip length.
- α can be sampled from a distribution or learnt.
- Removing connections
- Learns to interact with far off and nearby connections
- Have explicit and discrete updates taking place at different times, with a different
frequency for different groups of units
Remedial strategies #2
- Regularization to maintain information flow
- Require the gradient at any time step t to be similar in magnitude to the gradient of the
loss at the very last layer.
=
- For easy gradient computation, is treated as a constant
- Doesn’t perform as well as leaky units with abundant data
- Perhaps because the constant gradient assumption doesn’t scale well .
Echo State Networks
- Recurrent and input weights are fixed. Only output weights are learnable.
- Relies on the idea that a big, random expansion of the input vector, can often make it easy for a
linear model to fit the data.
- fix the recurrent weights to have some spectral radius such as 3, does not explode due to the
stabilizing effect of saturating nonlinearities like tanh.
- Sparse connectivity - Very few non zero values in hidden to hidden weights
- Creates loosely coupled oscillators, information can hang around in a particular part of
the network.
- Important to choose the scale of the input to hidden connections. They need to drive the
states of the loosely coupled oscillators but, they mustn't wipe out information that those
oscillators contain about the recent history.
- used to initialize the weights in a fully trainable recurrent network (Sutskever 2012, Sutskever
et al., 2013).
LSTMs
Unrolled RNN
The repeating module in a standard RNN contains
a single layer.
The repeating module in an LSTM contains four
interacting layers.
LSTMs
- Adding even more structure
- LSTM : RNN cell with 4 gates that control how information is retained
- Input value can be accumulated into the state if the sigmoidal input gate allows it.
- The cell state unit has a linear self-loop whose weight is controlled by the forget gate.
- The output of the cell can be shut off by the output gate.
- All the gating units have a sigmoid nonlinearity, while the ‘g’ gate can have any squashing
nonlinearity.
- i and g gates - multiplicative interaction
- g - what between -1 to 1 should I add to the cell state
- i - should I go through with the operation.
- Forget gate - Can kill gradients in LSTM if set to zero. Initialize to 1 at start so gradients flow
nicely and LSTM learns to shut or open whenever it wants.
- The state unit can also be used as an extra input to the gating units(Peephole connections).
LSTM - Equations
Forget Input
Update Output
Gated Recurrent Units
LSTM : Search Space Odyssey
- 2015 Paper by Greff et al.
- Compare 8 different configurations of LSTM Architecture
- GRUs
- Without Peephole connections
- Without output gate
- Without non-linearities at output and forget gate etc
- Trained for 5200 iters, over 15 CPU years
- Did not see any major improvement in results, classic LSTM architecture
works as well as other versions
Encoder Decoder Frameworks : Seq2Seq
Sequence to Sequence with Attention - NMT
Explicit Memory
● Motivation
○ Some knowledge can be implicit, subconscious, and difficult to verbalize
■ Ex - how a dog looks different from a cat.
○ It can also be explicit, declarative and straightforward to put into words
■ Ex - everyday commonsense knowledge -> a cat is a kind of animal
■ Ex - Very specific facts -> the meeting with the sales team is at 3:00 PM, room 141.”
○ Neural networks excel at storing implicit knowledge but struggle to memorize facts
■ SGD requires a sample to be repeated several time for a NN to memorize, that too
not precisely. (Graves et al, 2014b)
○ Such explicit memory allows systems to rapidly and intentionally store and retrieve
specific facts and to sequentially reason with them.
Memory Networks
● Memory networks include a set of memory cells that can be accessed via an addressing
mechanism.
○ Originally required a supervision signal instructing them how to use their memory cells
Weston et al. (2014)
○ Graves et al. (2014b) introduced NMTs
■ able to learn to read from and write arbitrary content to memory cells without
explicit supervision about which actions to undertake
■ allow end-to-end training using a content-based soft attention mechanism.
Bahdanau et al.(2015)
Memory Networks
● Soft Addressing - (Content based)
○ Cell state is a Vector - weight used to read to or write from a cell is a function of that cell.
■ Weight can be produced using a softmax across all cells.
○ Completely retrieve vector-valued memory if we are able to produce a pattern that
matches some but not all of its elements
● Hard addressing - (Location based)
○ Output a discrete memory location/Treat weights as probabilities and choose a particular
cell to read or write from
○ Requires specialized optimization algorithms
Memory Networks
Resources and References
- http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- https://www.youtube.com/watch?v=iX5V1WpxxkY&list=PL16j5WbGpaM0_
Tj8CRmurZ8Kk1gEBc7fg&index=10
- https://www.coursera.org/learn/neural-networks/home/week/7
- https://www.coursera.org/learn/neural-networks/home/week/8
- http://cs231n.stanford.edu/slides/2016/winter1516_lecture10.pdf
- https://arxiv.org/abs/1409.0473
- https://arxiv.org/pdf/1308.0850.pdf
- https://arxiv.org/pdf/1503.04069.pdf
Thank you !!
Optimisation for Long term dependencies
- Problem
- Specifically, whenever the model is able to represent long term dependencies, the
gradient of a long term interaction has exponentially smaller magnitude than the gradient
of a short term interaction.
- It does not mean that it is impossible to learn, but that it might take a very long time to
learn long-term dependencies.
- gradient-based optimization becomes increasingly difficult with the probability of
successful training reaching 0 for sequences of only length 10 or 20
- Leaky units & multiple time scales
- Skip connections through time
- Leaky units - The linear self-connection approach allows this effect to be adapted more
smoothly and flexibly by adjusting the real-valued α rather than by adjusting the
integer-valued skip length.
- Remove connections -
- Gradient Clipping

More Related Content

Recurrent Neural Networks

  • 2. Motivation ● Translational Invariance ● Deep computational graph: T O B E O R ... y(0) y(1) y(2) y(3) y(4) y(t)
  • 3. Motivation ● Translational Invariance ● Deep computational graph: T O B E O R ... y(0) y(1) y(2) y(3) y(4) y(t)
  • 4. Computation Graphs ● Different sources use different systems. This is the system the book uses. ● Nodes are variables ● Connections are functions ● A variable is computed using all of the connections pointing towards it ● Can compute derivatives by applying chain rule, working backwards through the graph. NOT ALL GRAPHS FOLLOW THESE RULES! >:( L y h x yt W
  • 5. RNN Computation Graphs - Unfolding Output Loss func. Truth Hidden L. Input Square Connects to next timestep x(0) x(1) x(2)x(t) Folded Unfolded yt (t) L L L L y(t) h h h h yt (1) yt (2) yt (3) y(t) y(t) y(t)
  • 6. Common Design Patterns Standard, output every step Output is the only recurrence information Output computed at the end x x y y y yt yt yt L L L h h h h h h x(n)x(t)x(2)x(1)
  • 7. Training ● Compute forward in time, work backwards for gradient. ○ Exploding/vanishing gradient problem ● Teacher forcing: ○ Pipe true outputs into the hidden layer instead of model outputs y(t) y(t+1) x(t) x(t+1) h h yt (t) yt (t+1) y(t) y(t+1) x(t) x(t+1) h h L L Training Time Test Time
  • 8. Recursive Neural Nets ● Map a sequence to a tree, and reduce the tree one layer at a time until you reach a single point, your output ● Many choices of how to arrange the tree. x(1) x(2) x(3) x(4) y yt L U W U W U W
  • 9. Deep Recurrent Neural Nets Can add depth to any of the stages mentioned: Multiple Recurrent Layers Extra input, output, and hidden layer processing +Direct hidden layer yt yt yt x x x h1 h2 h h
  • 10. (1) Vanilla mode without RNN (e.g. image classification). (2) Sequence output (e.g. image captioning). (3) Sequence input (e.g. sentiment analysis). (4) Sequence input and sequence output (e.g. Machine Translation). (5) Synced sequence input and output (e.g. video classification). What makes Recurrent Networks so special? Sequences !
  • 11. The unreasonable effectiveness of RNNs - - Character level language model - LSTM of Leo Tolstoy’s War and Peace - Outputs after 100 iters, 300 iters, 700 iters and 2000 iters
  • 12. Challenges of Vanishing and Exploding Gradients Hidden State Recurrence Relation using Power Method - Spectral radius will make gradient explode or vanish - Variance multiplies at every cell (or timestep) - For Feed-forward networks of fixed size: - obtain some desired variance v∗ , choose the individual weights with variance v = n √ v∗ . - carefully chosen scaling can avoid the vanishing and exploding gradient problem - For RNNs , this means we cannot effectively capture Long term dependencies. - Gradient of a long term interaction has exponentially smaller magnitude than short term interaction
  • 13. - After a forward pass, the gradients of the non-linearities are fixed. - Back propagation is like going forwards through a linear system in which the slope of the non-linearity has been fixed.
  • 14. Loss function of a char-level RNN def lossFun(inputs, targets, hprev): """ inputs,targets are both list of integers. hprev is Hx1 array of initial hidden state returns the loss, gradients on model parameters, and last hidden state """ xs, hs, ys, ps = {}, {}, {}, {} hs[-1] = np.copy(hprev) loss = 0 # forward pass for t in xrange(len(inputs)): xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation xs[t][inputs[t]] = 1 hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss) # backward pass: compute gradients going backwards dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why) dbh, dby = np.zeros_like(bh), np.zeros_like(by) dhnext = np.zeros_like(hs[0]) for t in reversed(xrange(len(inputs))): dy = np.copy(ps[t]) dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here dWhy += np.dot(dy, hs[t].T) dby += dy dh = np.dot(Why.T, dy) + dhnext # backprop into h dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity dbh += dhraw dWxh += np.dot(dhraw, xs[t].T) dWhh += np.dot(dhraw, hs[t-1].T) dhnext = np.dot(Whh.T, dhraw) for dparam in [dWxh, dWhh, dWhy, dbh, dby]: np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
  • 16. How to overcome gradient issues ?
  • 17. Remedial strategies #1 - Gradient Clipping for Exploding Gradient - Skip connections - Integer valued skip length - Example : ResNet - Leaky Units - Linear self-connections approach allows the effect of remembrance and forgetfulness to be adapted more smoothly and flexibly by adjusting the real-valued α rather than by adjusting the integer-valued skip length. - α can be sampled from a distribution or learnt. - Removing connections - Learns to interact with far off and nearby connections - Have explicit and discrete updates taking place at different times, with a different frequency for different groups of units
  • 18. Remedial strategies #2 - Regularization to maintain information flow - Require the gradient at any time step t to be similar in magnitude to the gradient of the loss at the very last layer. = - For easy gradient computation, is treated as a constant - Doesn’t perform as well as leaky units with abundant data - Perhaps because the constant gradient assumption doesn’t scale well .
  • 19. Echo State Networks - Recurrent and input weights are fixed. Only output weights are learnable. - Relies on the idea that a big, random expansion of the input vector, can often make it easy for a linear model to fit the data. - fix the recurrent weights to have some spectral radius such as 3, does not explode due to the stabilizing effect of saturating nonlinearities like tanh. - Sparse connectivity - Very few non zero values in hidden to hidden weights - Creates loosely coupled oscillators, information can hang around in a particular part of the network. - Important to choose the scale of the input to hidden connections. They need to drive the states of the loosely coupled oscillators but, they mustn't wipe out information that those oscillators contain about the recent history. - used to initialize the weights in a fully trainable recurrent network (Sutskever 2012, Sutskever et al., 2013).
  • 21. The repeating module in a standard RNN contains a single layer.
  • 22. The repeating module in an LSTM contains four interacting layers.
  • 23. LSTMs - Adding even more structure - LSTM : RNN cell with 4 gates that control how information is retained - Input value can be accumulated into the state if the sigmoidal input gate allows it. - The cell state unit has a linear self-loop whose weight is controlled by the forget gate. - The output of the cell can be shut off by the output gate. - All the gating units have a sigmoid nonlinearity, while the ‘g’ gate can have any squashing nonlinearity. - i and g gates - multiplicative interaction - g - what between -1 to 1 should I add to the cell state - i - should I go through with the operation. - Forget gate - Can kill gradients in LSTM if set to zero. Initialize to 1 at start so gradients flow nicely and LSTM learns to shut or open whenever it wants. - The state unit can also be used as an extra input to the gating units(Peephole connections).
  • 24. LSTM - Equations Forget Input Update Output
  • 26. LSTM : Search Space Odyssey - 2015 Paper by Greff et al. - Compare 8 different configurations of LSTM Architecture - GRUs - Without Peephole connections - Without output gate - Without non-linearities at output and forget gate etc - Trained for 5200 iters, over 15 CPU years - Did not see any major improvement in results, classic LSTM architecture works as well as other versions
  • 28. Sequence to Sequence with Attention - NMT
  • 29. Explicit Memory ● Motivation ○ Some knowledge can be implicit, subconscious, and difficult to verbalize ■ Ex - how a dog looks different from a cat. ○ It can also be explicit, declarative and straightforward to put into words ■ Ex - everyday commonsense knowledge -> a cat is a kind of animal ■ Ex - Very specific facts -> the meeting with the sales team is at 3:00 PM, room 141.” ○ Neural networks excel at storing implicit knowledge but struggle to memorize facts ■ SGD requires a sample to be repeated several time for a NN to memorize, that too not precisely. (Graves et al, 2014b) ○ Such explicit memory allows systems to rapidly and intentionally store and retrieve specific facts and to sequentially reason with them.
  • 30. Memory Networks ● Memory networks include a set of memory cells that can be accessed via an addressing mechanism. ○ Originally required a supervision signal instructing them how to use their memory cells Weston et al. (2014) ○ Graves et al. (2014b) introduced NMTs ■ able to learn to read from and write arbitrary content to memory cells without explicit supervision about which actions to undertake ■ allow end-to-end training using a content-based soft attention mechanism. Bahdanau et al.(2015)
  • 31. Memory Networks ● Soft Addressing - (Content based) ○ Cell state is a Vector - weight used to read to or write from a cell is a function of that cell. ■ Weight can be produced using a softmax across all cells. ○ Completely retrieve vector-valued memory if we are able to produce a pattern that matches some but not all of its elements ● Hard addressing - (Location based) ○ Output a discrete memory location/Treat weights as probabilities and choose a particular cell to read or write from ○ Requires specialized optimization algorithms
  • 33. Resources and References - http://colah.github.io/posts/2015-08-Understanding-LSTMs/ - http://karpathy.github.io/2015/05/21/rnn-effectiveness/ - https://www.youtube.com/watch?v=iX5V1WpxxkY&list=PL16j5WbGpaM0_ Tj8CRmurZ8Kk1gEBc7fg&index=10 - https://www.coursera.org/learn/neural-networks/home/week/7 - https://www.coursera.org/learn/neural-networks/home/week/8 - http://cs231n.stanford.edu/slides/2016/winter1516_lecture10.pdf - https://arxiv.org/abs/1409.0473 - https://arxiv.org/pdf/1308.0850.pdf - https://arxiv.org/pdf/1503.04069.pdf
  • 35. Optimisation for Long term dependencies - Problem - Specifically, whenever the model is able to represent long term dependencies, the gradient of a long term interaction has exponentially smaller magnitude than the gradient of a short term interaction. - It does not mean that it is impossible to learn, but that it might take a very long time to learn long-term dependencies. - gradient-based optimization becomes increasingly difficult with the probability of successful training reaching 0 for sequences of only length 10 or 20 - Leaky units & multiple time scales - Skip connections through time - Leaky units - The linear self-connection approach allows this effect to be adapted more smoothly and flexibly by adjusting the real-valued α rather than by adjusting the integer-valued skip length. - Remove connections - - Gradient Clipping