Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

4-Recurrent Neural Network

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Introduction to Deep Learning

Deep Deep Neural Network


Convolutional Neural Network

Learning Recurrent Neural Network


Attention Mechanism
Seungsang Oh Auto-Encoder & VAE
Dept of Mathematics Generative Adversarial Network
& Dept of Data Science Natural Language Processing
Korea University Graph Neural Network

1 / 279
Recurrent Neural Network

114 / 279
Feedforward neural network (FFNN) as MLP
• Information only flows in one direction.
• No sense of time or memory for previous data.

Recurrent neural network (RNN)


• Information flows in multi-direction.
• Sense of time and memory for previous data.

115 / 279
Time Delay Neural Network (TDNN)
• TDNN is a FFNN whose purpose is to classify temporal patterns with shift-invariance
(independency on the beginning points to insure data i.i.d for MLP, so we may shu✏e)
in sequential data like 1D Conv with stride 1.
• It is the simplest way of controlling time-varying sequential data
and allows conventional backpropagation algorithms.
• TDNN gets its inputs by sliding a window of size n across multiple time steps in sequential data
and treating these n consecutive data samples as one input data.

input(t) = [xt , xt 1 , . . . , xt n+1 ]


input(t 1) = [xt 1 , xt 2 , . . . , xt n ]
input(t 2) = [xt 2 , xt 3 , . . . , xt n 1 ]
..
.

116 / 279
Disadvantages of TDNN
• The success of TDNN depends on finding an appropriate window size.
- small window size does not capture the longer dependencies.
- large window size increases the parameter number and may add unnecessary noise.
• TDNN works well for some short-memory problems such as Atari games,
but cannot handle long-memory problems over hundreds of time steps such as stock prediction.
• Because TDNN has a fixed window size, it cannot handle sequential data of variable length
such as language translation.
• FFNNs do not have any explicit memory of past data.
Their ability to capture temporal dependency is limited within the window size.
Even for a particular time window, input data is treated as a multidimensional feature vector
rather than as a sequence of observations, so we lose the benefit of sequential information.
(Unaware of the temporal structure)

117 / 279
Recurrent Neural Network (RNN)
• RNN (LSTM) makes predictions based on current and previous inputs recurrently,
while FFNN makes decisions based only on the current input.
• RNN processes the input sequence one data xt at a time (current input) and
maintains a hidden state vector ht as a memory for past information (previous inputs).
• It learns to selectively retain relevant information to capture temporal dependency or structure
across multiple time steps for processing sequential data.
• RNN does not require a fixed sized time window and can handle variable length sequences.

118 / 279
119 / 279
SimpleRNN
• Each time step t : xt input, yt output, ht 1 prior hidden state, ht current hidden state.
• Hidden state vector ht acts as a memory for past information in sequential data.
• Weight matrices Wxh , Whh , Why are shared at every time step (without Whh , it is a MLP).

ht = tanh Wxh xt +Whh ht 1 +bh


yt = Why ht + by

120 / 279
Backpropagation Through Time (BPTT)
• DNN uses backpropagation to update the weights in the way to minimize the cost.
• RNN uses a variation of backpropagation called BPTT to control temporal dependencies.

• How to update 3 weight matrices Wxh , Whh , Why shared at every time step?
U = Wxh input to hidden weights
V = Whh hidden to hidden weights
W = Why hidden to output weights

121 / 279
Weight update in BPTT
• Unfolded network is treated as one big FFNN which takes the current and whole previous
inputs as one big input data, and shares weights U = Wxh , V = Whh , W = Why at every time step.
• In forward pass, we may think as the weights are not shared so that each has its own weights
U0 , U1 , . . . , Ut , V1 , . . . , Vt and W.
• In backward pass, compute the gradient with respect to all weights U0 , U1 , . . . , Ut , V1 , . . . , Vt
and W in the unfolded network as standard backpropagation.
• Final gradients of U and V are the average (or sum) of all gradients of Ui ’s and Vi ’s respectively,
and applied to RNN weight updates.

122 / 279
Truncated BPTT
• BPTT can be slow to train RNN on problems with very long sequential data.
Furthermore, the accumulation of gradients over many timesteps can result in a shrinking
of values to zero, or a growth of values that eventually overflow.
• Truncated BPTT limits the number of timesteps used on backward pass for backpropagation
and estimates the gradient used to update the weights rather than calculate it fully.
• When using TBPTT, we choose the number of timesteps (lookback) as a hyperparameter
to split up long input sequences into subsequences that are both long enough to capture
relevant past information for making predictions and short enough to train efficiently.
• For time series, input data is divided into overlapping subsequences with same time interval.
Each subsequence has lookback number of consecutive time steps and forms a training sample.
During training, TBPTT is done over individual samples (subsequence) for lookback time steps.
• RNN input data: 3D tensor with shape (batch size, timesteps, input dim).
• Lookback values ranging up to 200 have been used successfully for many tasks.

123 / 279
(batch size, timesteps, input dim)
124 / 279
State maintenance
• Within a batch, individual samples (subsequences) are still independent.
• Keras provides 2 ways of maintaining RNN hidden states (and additional cell states in LSTM).
- Default mode: states are only maintained over each individual sample for lookback number
of time steps starting with a random initial state.
- Stateful mode: states are maintained along successive batches so that the final state of
i th sample of the current batch is used as the initial state for i th sample of the next batch.
• To use stateful mode, the following must be assumed.
- Batch size must divide the number of total samples because all batches have the same
number of samples.
- If Bt and Bt+1 are successive batches, then Bt+1 [i] is the follow-up sequence to Bt [i] for all i.
- No sample shu✏ing is required to maintain state across batches: Bt [i] ! Bt+1 [i].

125 / 279
Long Short-Term Memory (LSTM)
LSTM is designed to capture long-term dependency and overcomes the vanishing/exploding
gradient problem in SimpleRNN.

⇤ Long-term dependency
If the gap between the relevant information and
the point where it is needed becomes large,
RNN learning ability will be decreased.

Long Short-Term Memory, Hochreiter 1997


126 / 279
• LSTM uses a cell state vector ct to remember past information over a long interval,
in addition to a hidden state vector ht of SimpleRNN.
• Input, forget and output gates regulate the flow of information into and out of the cell state.

0 1 0 1 0
it = Wxhi xt + Whhi ht 1 + bhi input gate BBB i CCC BBB CCC B x 1C
CCC BBB t CCC
BBBB f CCCC BBBB CCC W BBBht 1 CCC
ft = Wxhf xt + Whhf ht 1 + bhf forget gate BBB CCC = BBB
B@ o CA B@ CCA @
1
A
g tanh
ot = Wxho xt + Whho ht 1 + bho output gate
gt = tanh Wxhg xt + Whhg ht 1 + bhg (= RNN core
ct = ft ct 1 + it gt cell state
ht = ot tanh(ct ) hidden state

• Input and forget gates determine how the next cell state
is influenced by the input and the previous hidden state.
• Output gate determines how the hidden state and the output
is influenced by the cell state.
• SimpleRNN is LSTM with it = ot = 1, ft = 0,
since ht = tanh tanh Wxh xt + Whh ht 1 + bh .
127 / 279
Backpropagation
• SimpleRNN: computing gradient of ht involves many factors of W (and tanh).
) largest singular value < 1 (vanishing gradients) or > 1 (exploding gradients)

• LSTM: cell state ct acting like an accumulator over time ensures that gradients don’t decay.
Backpropagation from ct to ct 1 uses only elementwise multiplication by ft ,
but no matrix multiplication by W. ( ct = ft ct 1 + it gt

128 / 279
Gated Recurrent Unit (GRU)
GRU is a simplified variant of LSTM by merging the cell state and the hidden state together,
combining the input gate and the forget gate into a single update gate.

⇤ Hidden units a↵ected by reset and update gates learn to capture dependencies over di↵erent time scales.
Those units that learn to capture short-term dependencies will tend to have frequently active reset gates,
but those that capture longer-term dependencies will have mostly active update gates.

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, Cho 2014
129 / 279
it = Wxhi xt + Whhi ht 1 + bh
i
input gate
ft = Wxh xt + Whh ht 1 + bhf
f f
forget gate
ot = Wxho xt + Whho ht 1 + bho output gate
gt = tanh Wxhg xt + Whhg ht 1 + bhg (= RNN core
ct = ft ct 1 + it gt cell state
ht = ot tanh(ct ) hidden state

zt = Wz xt + Uz ht 1 + bz update gate input gate


forget gate
rt = Wr xt + Ur ht 1 + br reset gate
h̃t = tanh Wh xt + Uh (rt ht 1 ) + bh (= RNN core
ht = (1 zt ) ht 1 + zt h̃t hidden state cell state
hidden state

130 / 279
Bidirectional LSTM
• Sometimes you want to incorporate information from data both preceding and following,
and so input sequences are presented and learned both forward and backward.

! !
yt = Why h t + Why h t + by

131 / 279
Stacked LSTM
• LSTM layers are stacked one on top of another.
• Deep models were often exponentially more efficient at representing some functions
than a shallow one.

0 1 0 1
BBB i CCC BBB CCC !
BBB f CCC BBB CCC l 1
BBB CCC = BBB CCC W l ht
BBB o CCC BBB CCC htl 1
B@ CA B@ CA
g tanh
ctl = ft ctl 1 + it gt
htl = ot tanh(ctl )

• When stacking RNNs, it is mandatory to set return sequences=True in Keras.


The input shape for each recurrent layer is (batch size, timesteps, input dim).
The output shape for each recurrent layer is (batch size, timesteps, output dim) when True,
while (batch size, output dim) when False.
132 / 279
Dropout in LSTM
• Applying standard dropout to the recurrent connections results in poor performance
since the noise caused by dropout at each time step (per-step mask) prevents the network
from retaining long-term memory (this will erase all related past memories).
RNN regularization dropout O1 (non-recurrent connections) htl 1 mt , mt,i ⇠ Bernoulli(1 p)
• Dropout with per-sequence mask generates a dropout mask for each input sequence,
and keep it the same at every time step (on both the non-recurrent and recurrent connections).
The elements in the hidden and cell states that are not dropped will persist throughout
the entire sequence to maintain long-term memory.
RNNdrop O2 ct = ft ct 1 + it gt m
Variational RNN dropout O3 xt mx and ht 1 mh
Weight-dropped LSTM O4 Whh⇤ M ht 1
(dropconnect on 4 hidden-to-hidden weight matrices)
Recurrent dropout O5 ct = ft ct 1 + it gt m (or mt )
Zoneout O6 ct = ct 1 mct + ( ft ct 1 + it gt ) (1 mct )
ht = ht 1 mht + ot tanh( ft ct 1 + it gt ) (1 mht )
(corresponding activations from the previous time-step) 133 / 279

You might also like