The document discusses recurrent neural networks (RNNs) and their ability to handle sequence data. It provides resources on LSTMs and RNN effectiveness. It describes 5 types of sequence problems RNNs can address and how RNNs are well-suited for sequences. Challenges like vanishing and exploding gradients are discussed along with strategies like LSTM gates, gradient clipping and leaky units to help optimize long term dependencies. Memory networks and their use of soft addressing mechanisms are also summarized.
3. (1) Vanilla mode without RNN (e.g. image classification).
(2) Sequence output (e.g. image captioning).
(3) Sequence input (e.g. sentiment analysis).
(4) Sequence input and sequence output (e.g. Machine Translation).
(5) Synced sequence input and output (e.g. video classification).
What makes Recurrent Networks so special? Sequences !
4. The unreasonable effectiveness of RNNs
-
- Character level language model
- LSTM of Leo Tolstoy’s War and Peace
- Outputs after 100 iters, 300 iters, 700 iters and
2000 iters
5. Challenges of Vanishing and Exploding Gradients
Hidden State Recurrence Relation
using Power Method
- Spectral radius will make gradient explode or vanish
- Variance multiplies at every cell (or timestep)
- For Feed-forward networks of fixed size:
- obtain some desired variance v∗
, choose the individual weights with variance v = n
√ v∗
.
- carefully chosen scaling can avoid the vanishing and exploding gradient problem
- For RNNs , this means we cannot effectively capture Long term dependencies.
- Gradient of a long term interaction has exponentially smaller magnitude than short term
interaction
6. - After a forward pass, the gradients of the non-linearities are fixed.
- Back propagation is like going forwards through a linear system in which the slope of the
non-linearity has been fixed.
7. Loss function of a char-level RNN
def lossFun(inputs, targets, hprev):
"""
inputs,targets are both list of integers.
hprev is Hx1 array of initial hidden state
returns the loss, gradients on model parameters, and last hidden state
"""
xs, hs, ys, ps = {}, {}, {}, {}
hs[-1] = np.copy(hprev)
loss = 0
# forward pass
for t in xrange(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
xs[t][inputs[t]] = 1
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
# backward pass: compute gradients going backwards
dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh),
np.zeros_like(Why)
dbh, dby = np.zeros_like(bh), np.zeros_like(by)
dhnext = np.zeros_like(hs[0])
for t in reversed(xrange(len(inputs))):
dy = np.copy(ps[t])
dy[targets[t]] -= 1 # backprop into y. see
http://cs231n.github.io/neural-networks-case-study/#grad if confused here
dWhy += np.dot(dy, hs[t].T)
dby += dy
dh = np.dot(Why.T, dy) + dhnext # backprop into h
dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
dbh += dhraw
dWxh += np.dot(dhraw, xs[t].T)
dWhh += np.dot(dhraw, hs[t-1].T)
dhnext = np.dot(Whh.T, dhraw)
for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]
10. Remedial strategies #1
- Gradient Clipping for Exploding Gradient
- Skip connections
- Integer valued skip length
- Example : ResNet
- Leaky Units
- Linear self-connections approach allows the effect of remembrance and forgetfulness to
be adapted more smoothly and flexibly by adjusting the real-valued α rather than by
adjusting the integer-valued skip length.
- α can be sampled from a distribution or learnt.
- Removing connections
- Learns to interact with far off and nearby connections
- Have explicit and discrete updates taking place at different times, with a different
frequency for different groups of units
11. Remedial strategies #2
- Regularization to maintain information flow
- Require the gradient at any time step t to be similar in magnitude to the gradient of the
loss at the very last layer.
=
- For easy gradient computation, is treated as a constant
- Doesn’t perform as well as leaky units with abundant data
- Perhaps because the constant gradient assumption doesn’t scale well .
12. Echo State Networks
- Recurrent and input weights are fixed. Only output weights are learnable.
- Relies on the idea that a big, random expansion of the input vector, can often make it easy for a
linear model to fit the data.
- fix the recurrent weights to have some spectral radius such as 3, does not explode due to the
stabilizing effect of saturating nonlinearities like tanh.
- Sparse connectivity - Very few non zero values in hidden to hidden weights
- Creates loosely coupled oscillators, information can hang around in a particular part of
the network.
- Important to choose the scale of the input to hidden connections. They nee
states of the loosely coupled oscillators but, they mustn't wipe out informa
oscillators contain about the recent history.
- used to initialize the weights in a fully trainable recurrent network (Sutskever 2012, Sutskever
et al., 2013).
16. LSTMs
- Adding even more structure
- LSTM : RNN cell with 4 gates that control how information is retained
- Input value can be accumulated into the state if the sigmoidal input gate allows it.
- The cell state unit has a linear self-loop whose weight is controlled by the forget gate.
- The output of the cell can be shut off by the output gate.
- All the gating units have a sigmoid nonlinearity, while the ‘g’ gate can have any squashing
nonlinearity.
- i and g gates - multiplicative interaction
- g - what between -1 to 1 should I add to the cell state
- i - should I go through with the operation.
- Forget gate - Can kill gradients in LSTM if set to zero. Initialize to 1 at start so gradients flow
nicely and LSTM learns to shut or open whenever it wants.
- The state unit can also be used as an extra input to the gating units(Peephole connections).
19. LSTM : Search Space Odyssey
- 2015 Paper by Greff et al.
- Compare 8 different configurations of LSTM Architecture
- GRUs
- Without Peephole connections
- Without output gate
- Without non-linearities at output and forget gate etc
- Trained for 5200 iters, over 15 CPU years
- Did not see any major improvement in results, classic LSTM architecture
works as well as other versions
22. Explicit Memory
● Motivation
○ Some knowledge can be implicit, subconscious, and difficult to verbalize
■ Ex - how a dog looks different from a cat.
○ It can also be explicit, declarative and straightforward to put into words
■ Ex - everyday commonsense knowledge -> a cat is a kind of animal
■ Ex - Very specific facts -> the meeting with the sales team is at 3:00 PM, room 141.”
○ Neural networks excel at storing implicit knowledge but struggle to memorize facts
■ SGD requires a sample to be repeated several time for a NN to memorize, that too
not precisely. (Graves et al, 2014b)
○ Such explicit memory allows systems to rapidly and intentionally store and retrieve
specific facts and to sequentially reason with them.
23. Memory Networks
● Memory networks include a set of memory cells that can be accessed via an addressing
mechanism.
○ Originally required a supervision signal instructing them how to use their memory cells
Weston et al. (2014)
○ Graves et al. (2014b) introduced NMTs
■ able to learn to read from and write arbitrary content to memory cells without
explicit supervision about which actions to undertake
■ allow end-to-end training using a content-based soft attention mechanism.
Bahdanau et al.(2015)
24. Memory Networks
● Soft Addressing - (Content based)
○ Cell state is a Vector - weight used to read to or write from a cell is a function of that cell.
■ Weight can be produced using a softmax across all cells.
○ Completely retrieve vector-valued memory if we are able to produce a pattern that
matches some but not all of its elements
● Hard addressing - (Location based)
○ Output a discrete memory location/Treat weights as probabilities and choose a particular
cell to read or write from
○ Requires specialized optimization algorithms
27. Optimisation for Long term dependencies
- Problem
- Specifically, whenever the model is able to represent long term dependencies, the
gradient of a long term interaction has exponentially smaller magnitude than the gradient
of a short term interaction.
- It does not mean that it is impossible to learn, but that it might take a very long time to
learn long-term dependencies.
- gradient-based optimization becomes increasingly difficult with the probability of
successful training reaching 0 for sequences of only length 10 or 20
- Leaky units & multiple time scales
- Skip connections through time
- Leaky units - The linear self-connection approach allows this effect to be adapted more
smoothly and flexibly by adjusting the real-valued α rather than by adjusting the
integer-valued skip length.
- Remove connections -
- Gradient Clipping