Lecture Notes - Recurrent Neural Networks
Lecture Notes - Recurrent Neural Networks
architectures. You learnt about the different kinds of RNN architectures and their usage on problems that
involve sequences. Then, you learnt about backpropagation and the problem of vanishing and exploding
gradient in RNNs. You also learnt about bidirectional RNNs which are variants of the standard RNNs.
Finally, you learnt about the LSTM network and its variant GRU network both of which help in tackling the
problem of vanishing gradients faced by a vanilla RNN network.
You learnt that a normal neural network is insufficient to train sequence data. Some examples of sequence
data are:
● Time series
● Music
● Videos
● Text
You learnt that sequential data contains multiple entities. The order in which these entities are present is
important.
You learnt about the architecture of an RNN. The architecture is such that it takes into account the
multiple entities present in a sequence. The architecture of an RNN and the feedforward equations are
shown below:
You learnt that an RNN consists of recurrent layers. The weights present below the recurrent layers are
denoted by WR. You also learnt that WR is a square matrix because it connects each and every neuron at
timestamp ‘t’ in layer ‘l’ with each and every neuron at timestep ‘t+1’ in the same layer ‘l’.
You also learnt that each activation is dependent on two things: the activation in the previous layer ‘l-1’ at
the current timestep ‘t’, and the activation in the same layer ‘l’ at the previous timestep ‘t-1’.
You also learnt about the matrix sizes of the terms involved in the feedforward equations. The following
table shows the matrix sizes:
In the above notation, ‘t’ denotes the timestep, ‘l’ denotes the layer in the network and batch_size is the
number of data points passed in one go.
You also learnt that tthere’s one more way to write the feedforward equation in an RNN shown below:
where,
W(l) = [ WF(l) | WR(l)], that is the column-wise concatenation of the weight matrices at layer ‘l’.
[at(l-1), at-1(l)] is the row-wise concatenation of the activations at(l-1) and at-1(l).
Next, you went through the different types of RNN. You learnt that changing the input and/or the output
leads to a different architecture. The different types of RNN that you learnt about are:
● Many-to-one architecture:
You learnt that this architecture involves a sequence as an input and a single entity as an output. You used
this architecture in the C-code generator which was a character level text generator.
You learnt that this type of RNN can be used to model data which involves sequences in the input as well
as the output. The important thing to note here is that the input and output sequences must have a
one-to-one correspondence and therefore the input and output sequences are equal in length. You used
this type of architecture while building a POS tagger where the input comprised of a sentence and the
output comprised of a part-of-speech tag for each word in the sentence.
● Encoder-decoder architecture:
You learnt that this is also a many-to-many architecture type. But the input and output sequences don’t
have a one-to-one correspondence. As a result, most often than not, the length of the input and the
output sequence is not equal. You learnt that this architecture can be deployed in problems such as
language translation and document summarization. You also learnt that the errors are backpropagated
from the decoder to the encoder. The encoder and decoder have a different set of weights and they are
different RNNs altogether. The loss is calculated at each timestep which can either be backpropagated at
each timestep, or the cumulative loss (sum of all the losses from all the timesteps of a sequence) can be
backpropagated after the entire sequence is ingested. Generally, the errors are backpropagated once an
RNN ingests an entire batch.
● One-to-many architecture:
You learnt that this type of architecture has a single entity as an input and a sequence as the output. You
can use this architecture for generation such as music generation, creating drawings, generating text, etc.
After going through the architectures, you learnt the mechanism in which gradients flow in an RNN. This
mechanism is called backpropagation through time (BPTT). You learnt that in any given term in an RNN
depends not only on the current input but also on the input from previous timesteps. The gradients not
only flow back from the output layer to the input layer, but they also flow back in time from the last
timestep to the first timestep. Hence the name backpropagation through time.
You learnt that you can make use of offline sequences by looking ahead. You can feed the offline
sequences to an RNN in regular order as well as the reverse order to get better results in whatever task
you’re doing. Such an RNN is called a bidirectional RNN. You also learnt that in a bidirectional RNN, the
input at each timestep is a concatenation of the entity present in regular order and the entity present in
reverse order. For example, for a sentence of length 100, the input at the first timestep will be a
concatenation of the first word x1 and the last word x100. You learnt that a bidirectional RNN has 2x number
of parameters than a vanilla RNN.
To get rid of the vanishing gradients problem, you learnt that researchers came up with another type of
cell that can be used inside an RNN layer, called the LSTM cell.
There are three inputs to an LSTM cell: the cell state from the previous timestep (ct-1), the activation from
the previous timestep (ht-1) and the input from the current timestep (xt) from the previous layer. There are
two outputs: the current cell state (ct) and the current activation (ht) which goes in two directions: into the
next timestep and into the next layer just like normal RNN activations are passed in two directions.
You also learnt that there are three gates: forget gate, update gate and the output gate. The forget gate is
used to discard information from the previous cell state. The update gate writes new information to the
previous cell state. After discarding and writing new information, you get the new cell state.
You learnt that an LSTM layer is made of multiple LSTM cells and an LSTM network can have multiple
LSTM layers stacked on top of each other in the same way as an RNN with multiple RNN layers. The
feedforward equations of an LSTM network are:
f t = σ (W f [h t−1 , xt ] + bf )
it = σ (W i [h t−1 , xt ] + bi )
ct′ = tanh(W c [h t−1 , xt ] + bc )
© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved
ct = f t ct−1 + it ct′
ot = σ (W o [h t−1 , xt ] + bo )
ht = ot tanh(ct )
You also learnt that each of the fours weight matrices involved in the LSTM feedforward equations is a
column-wise concatenation of the feedforward weight (WF) and the recurrent weights (WR) in the layer.
[
W f = W F f | W Rf ]
W i = [W F i | W Ri ]
W c = [W F c | W Rc ]
W o = [W F o | W Ro ]
You learnt that as a result of 4 weight matrices and biases, an LSTM layer has 4x parameters than an RNN
layer.
Finally, you briefly saw an LSTM variant - the gated recurrent unit (GRU). A GRU network consists of GRU
layers which consist of GRU cells which are similar to LSTM cells. However, the GRU has fewer parameters
than an LSTM network. A GRU has three weight matrices as compared to the four in an LSTM layer. This
means that a GRU has 3x parameters than a vanilla RNN layer.
Finally, you learnt how to build different types of RNN in Python using the Keras library. You are
recommended to go through the RNN code provided to you.
● You can download this document from the website for self-use only.
● Any copies of this document, in part or full, saved to disc or to any other storage medium may only be used
for subsequent, self-viewing purposes or to print an individual extract or copy for non-commercial personal
use only.
● Any further dissemination, distribution, reproduction, copying of the content of the document herein or the
uploading thereof on other websites or use of content for any other commercial/unauthorized purposes in
any way which could infringe the intellectual property rights of UpGrad or its contributors, is strictly
prohibited.
● No graphics, images or photographs from any accompanying text in this document will be used separately
for unauthorised purposes.
● No material in this document will be modified, adapted or altered in any way.
● No part of this document or UpGrad content may be reproduced or stored in any other web site or included
in any public or private electronic retrieval system or service without UpGrad’s prior written permission.
● Any rights not expressly granted in these terms are reserved.