4-Recurrent Neural Network
4-Recurrent Neural Network
4-Recurrent Neural Network
1 / 279
Recurrent Neural Network
114 / 279
Feedforward neural network (FFNN) as MLP
• Information only flows in one direction.
• No sense of time or memory for previous data.
115 / 279
Time Delay Neural Network (TDNN)
• TDNN is a FFNN whose purpose is to classify temporal patterns with shift-invariance
(independency on the beginning points to insure data i.i.d for MLP, so we may shu✏e)
in sequential data like 1D Conv with stride 1.
• It is the simplest way of controlling time-varying sequential data
and allows conventional backpropagation algorithms.
• TDNN gets its inputs by sliding a window of size n across multiple time steps in sequential data
and treating these n consecutive data samples as one input data.
116 / 279
Disadvantages of TDNN
• The success of TDNN depends on finding an appropriate window size.
- small window size does not capture the longer dependencies.
- large window size increases the parameter number and may add unnecessary noise.
• TDNN works well for some short-memory problems such as Atari games,
but cannot handle long-memory problems over hundreds of time steps such as stock prediction.
• Because TDNN has a fixed window size, it cannot handle sequential data of variable length
such as language translation.
• FFNNs do not have any explicit memory of past data.
Their ability to capture temporal dependency is limited within the window size.
Even for a particular time window, input data is treated as a multidimensional feature vector
rather than as a sequence of observations, so we lose the benefit of sequential information.
(Unaware of the temporal structure)
117 / 279
Recurrent Neural Network (RNN)
• RNN (LSTM) makes predictions based on current and previous inputs recurrently,
while FFNN makes decisions based only on the current input.
• RNN processes the input sequence one data xt at a time (current input) and
maintains a hidden state vector ht as a memory for past information (previous inputs).
• It learns to selectively retain relevant information to capture temporal dependency or structure
across multiple time steps for processing sequential data.
• RNN does not require a fixed sized time window and can handle variable length sequences.
118 / 279
119 / 279
SimpleRNN
• Each time step t : xt input, yt output, ht 1 prior hidden state, ht current hidden state.
• Hidden state vector ht acts as a memory for past information in sequential data.
• Weight matrices Wxh , Whh , Why are shared at every time step (without Whh , it is a MLP).
120 / 279
Backpropagation Through Time (BPTT)
• DNN uses backpropagation to update the weights in the way to minimize the cost.
• RNN uses a variation of backpropagation called BPTT to control temporal dependencies.
• How to update 3 weight matrices Wxh , Whh , Why shared at every time step?
U = Wxh input to hidden weights
V = Whh hidden to hidden weights
W = Why hidden to output weights
121 / 279
Weight update in BPTT
• Unfolded network is treated as one big FFNN which takes the current and whole previous
inputs as one big input data, and shares weights U = Wxh , V = Whh , W = Why at every time step.
• In forward pass, we may think as the weights are not shared so that each has its own weights
U0 , U1 , . . . , Ut , V1 , . . . , Vt and W.
• In backward pass, compute the gradient with respect to all weights U0 , U1 , . . . , Ut , V1 , . . . , Vt
and W in the unfolded network as standard backpropagation.
• Final gradients of U and V are the average (or sum) of all gradients of Ui ’s and Vi ’s respectively,
and applied to RNN weight updates.
122 / 279
Truncated BPTT
• BPTT can be slow to train RNN on problems with very long sequential data.
Furthermore, the accumulation of gradients over many timesteps can result in a shrinking
of values to zero, or a growth of values that eventually overflow.
• Truncated BPTT limits the number of timesteps used on backward pass for backpropagation
and estimates the gradient used to update the weights rather than calculate it fully.
• When using TBPTT, we choose the number of timesteps (lookback) as a hyperparameter
to split up long input sequences into subsequences that are both long enough to capture
relevant past information for making predictions and short enough to train efficiently.
• For time series, input data is divided into overlapping subsequences with same time interval.
Each subsequence has lookback number of consecutive time steps and forms a training sample.
During training, TBPTT is done over individual samples (subsequence) for lookback time steps.
• RNN input data: 3D tensor with shape (batch size, timesteps, input dim).
• Lookback values ranging up to 200 have been used successfully for many tasks.
123 / 279
(batch size, timesteps, input dim)
124 / 279
State maintenance
• Within a batch, individual samples (subsequences) are still independent.
• Keras provides 2 ways of maintaining RNN hidden states (and additional cell states in LSTM).
- Default mode: states are only maintained over each individual sample for lookback number
of time steps starting with a random initial state.
- Stateful mode: states are maintained along successive batches so that the final state of
i th sample of the current batch is used as the initial state for i th sample of the next batch.
• To use stateful mode, the following must be assumed.
- Batch size must divide the number of total samples because all batches have the same
number of samples.
- If Bt and Bt+1 are successive batches, then Bt+1 [i] is the follow-up sequence to Bt [i] for all i.
- No sample shu✏ing is required to maintain state across batches: Bt [i] ! Bt+1 [i].
125 / 279
Long Short-Term Memory (LSTM)
LSTM is designed to capture long-term dependency and overcomes the vanishing/exploding
gradient problem in SimpleRNN.
⇤ Long-term dependency
If the gap between the relevant information and
the point where it is needed becomes large,
RNN learning ability will be decreased.
0 1 0 1 0
it = Wxhi xt + Whhi ht 1 + bhi input gate BBB i CCC BBB CCC B x 1C
CCC BBB t CCC
BBBB f CCCC BBBB CCC W BBBht 1 CCC
ft = Wxhf xt + Whhf ht 1 + bhf forget gate BBB CCC = BBB
B@ o CA B@ CCA @
1
A
g tanh
ot = Wxho xt + Whho ht 1 + bho output gate
gt = tanh Wxhg xt + Whhg ht 1 + bhg (= RNN core
ct = ft ct 1 + it gt cell state
ht = ot tanh(ct ) hidden state
• Input and forget gates determine how the next cell state
is influenced by the input and the previous hidden state.
• Output gate determines how the hidden state and the output
is influenced by the cell state.
• SimpleRNN is LSTM with it = ot = 1, ft = 0,
since ht = tanh tanh Wxh xt + Whh ht 1 + bh .
127 / 279
Backpropagation
• SimpleRNN: computing gradient of ht involves many factors of W (and tanh).
) largest singular value < 1 (vanishing gradients) or > 1 (exploding gradients)
• LSTM: cell state ct acting like an accumulator over time ensures that gradients don’t decay.
Backpropagation from ct to ct 1 uses only elementwise multiplication by ft ,
but no matrix multiplication by W. ( ct = ft ct 1 + it gt
128 / 279
Gated Recurrent Unit (GRU)
GRU is a simplified variant of LSTM by merging the cell state and the hidden state together,
combining the input gate and the forget gate into a single update gate.
⇤ Hidden units a↵ected by reset and update gates learn to capture dependencies over di↵erent time scales.
Those units that learn to capture short-term dependencies will tend to have frequently active reset gates,
but those that capture longer-term dependencies will have mostly active update gates.
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, Cho 2014
129 / 279
it = Wxhi xt + Whhi ht 1 + bh
i
input gate
ft = Wxh xt + Whh ht 1 + bhf
f f
forget gate
ot = Wxho xt + Whho ht 1 + bho output gate
gt = tanh Wxhg xt + Whhg ht 1 + bhg (= RNN core
ct = ft ct 1 + it gt cell state
ht = ot tanh(ct ) hidden state
130 / 279
Bidirectional LSTM
• Sometimes you want to incorporate information from data both preceding and following,
and so input sequences are presented and learned both forward and backward.
! !
yt = Why h t + Why h t + by
131 / 279
Stacked LSTM
• LSTM layers are stacked one on top of another.
• Deep models were often exponentially more efficient at representing some functions
than a shallow one.
0 1 0 1
BBB i CCC BBB CCC !
BBB f CCC BBB CCC l 1
BBB CCC = BBB CCC W l ht
BBB o CCC BBB CCC htl 1
B@ CA B@ CA
g tanh
ctl = ft ctl 1 + it gt
htl = ot tanh(ctl )