Introduction to Recurrent Neural Network

Presented By: Praanav Bhowmik
& Niraj Kumar
Introduction To RNN

Lack of etiquette and manners is a huge turn oﬀ.
KnolX Etiquettes
Punctuality
Join the session 5 minutes prior
to the session start time. We
start on time and conclude on
time!
Feedback
Make sure to submit a
constructive feedback for all
sessions as it is very helpful for
the presenter.
Silent Mode
Keep your mobile devices in
silent mode, feel free to move
out of session in case you need
to attend an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.

Our Agenda
01 Introduction To Neural
Networks
02 What is Neural Network?
03
What is RNN?
04
Types of RNN
05
05
4
Feed Forward Neural
Network

Introduction To Neural
Network

What is Neural Network ?
Neural Networks used in Deep Learning, consists of different layers connected to each other and work on the
structure and functions of a human brain. It learns from huge volumes of data and uses complex algorithms to
train a neural network.

Introduction to RNN
Do you know how Google’s autocomplete feature predicts the rest of the words a user is typing?

● In feed forward neural network, information always flows in one direction[forward], from the input nodes to
the hidden layers[if any], and then to output nodes. There are no cycles or loops in the network.
● Decisions are based on current input
● No memory about the past
● No future scope
Feed Forward Neural Network

● Feed forward Neural Network has several limitations due to architecture:
● Accepts a fixed-sized vector as input (e.g. an image)
● Produces a fixed-sized vector as output (e.g. probabilities of different classes)
● Performs such input-output mapping used a fixed amount of computational steps (e.g. the number of
layers)
● These limitations make it difficult to model time series problems when input and output are
real-valued sequences.
Feed Forward Neural Network : Limitations

What is Recurrent
Neural Network

● Recurrent Neural Network works on the principle of saving the output layer and feeding this back to the
input in order to predict the output of the layer.
Recurrent Neural Network

● Recurrent Neural Network is basically a generalization of feed-forward neural network that has
an internal memory. RNNs are a special kind of neural networks that are designed to effectively
deal with sequential data. This kind of data includes time series (a list of values of some
parameters over a certain period of time) text documents, which can be seen as a sequence of
words, or audio, which can be seen as a sequence of sound frequencies over time.
● RNN is recurrent in nature as it performs the same function for every input of data while the
output of the current input depends on the past one computation. For making a decision, it
considers the current input and the output that it has learned from the previous input.
● Cells that are a function of inputs from previous time steps are also known as memory cells.
● Unlike feed-forward neural networks, RNNs can use their internal state (memory) to process
sequences of inputs. In other neural networks, all the inputs are independent of each other. But in
RNN, all the inputs are related to each other.

● Recurrent Neural Network (RNN) is a neural network model proposed in the 80’s for modelling time series.
● The structure of the network is similar to feedforward neural network, with the distinction that it allows a
recurrent hidden state whose activation at each time is dependent on that of the previous time (cycle).
Recurrent Neural Network : Intuition

● The basic challenge of classic feed-forward neural network is that it has no memory, that is, each
training example given as input to the model is treated independent of each other. In order to
work with sequential data with such models we need to show them the entire sequence in one go
as one training example. This is problematic because number of words in a sentence could vary
and more importantly this is not how we tend to process a sentence in our head.
● When we read a sentence, we read it word by word, keep the prior words / context in memory
and then update our understanding based on the new words which we incrementally read to
understand the whole sentence. This is the basic idea behind the RNNs — they iterate through
the elements of input sequence while maintaining a internal “state”, which encodes everything
which it has seen so far. The “state” of the RNN is reset when processing two different and
independent sequences.
● Recurrent neural networks are a special type of neural network where the outputs from previous
time steps are fed as input to the current time step.
● Another distinguishing characteristic of recurrent networks is that they share parameters across
each layer of the network. While feed forward networks have different weights across each node,
recurrent neural networks share the same weight parameter within each layer of the network.
That said, these weights are still adjusted in the through the processes of backpropagation and
gradient descent to facilitate reinforcement learning.
Why RNN?

● The way RNNs do this, is by taking the output of each neuron (input nodes are fed into a hidden layer
with sigmoid or tanh activations), and feeding it back to it as an input. By doing this, it does not only
receive new pieces of information in every time step, but it also adds to these new pieces of information
a w
̲ e̲ i̲g
̲ h
̲ t̲e̲ d
̲ ̲v
̲ e̲ r
̲ s̲ i̲o
̲ n
̲ of the previous output. As you can see the hidden layer outputs are passed through a
conceptual delay block to allow the input of h ᵗ⁻¹ into the hidden layer. What is the point of this? Simply,
the point is that we can now model time or sequence-dependent data.
● This makes these neurons have a kind of “memory” of the previous inputs it has had, as they are
somehow quantified by the output being fed back to the neuron.
● A particularly good example of this is in predicting text sequences. Consider the following text string:
“A girl walked into a bar, and she said ‘Can I have a drink please?’. The bartender said ‘Certainly { }”.
There are many options for what could fill in the { } symbol in the above string, for instance, “miss”,
“ma’am” and so on. However, other words could also fit, such as “sir”, “Mister” etc. In order to get the
correct gender of the noun, the neural network needs to “recall” that two previous words designating the
likely gender (i.e. “girl” and “she”) were used.
● A recurrent neural network can be thought of as multiple copies of the same network, each passing a
message to a successor. This type of flow of information through time (or sequence) in a recurrent
neural network is shown in the diagram below, which unrolls the sequence (loop unrolled):

● This unrolled network shows how we can supply a stream of data (intimately related to sequences, lists
and time-series data) to the recurrent neural network. For instance, first we supply the word vector for
“A” to the network F — the output of the nodes in F are fed into the “next” network and also act as a
stand-alone output ( h₀ ). The next network (though it’s the same network) F at time t=1 takes the next
word vector for “girl” and the previous output h₀ into its hidden nodes, producing the next output h₁ and
so on.
● NOTE: Although shown for easy explanation in Diagram, but the words themselves i.e. “A”, “girl” etc.
aren’t inputted directly into the neural network. Neither are their one-hot vector type representations —
rather, an embedding word vector (Word2Vec) is used for each word.
● One last thing to note — the weights of the connections between time steps are shared i.e. there isn’t a
different set of weights for each time step (it’s the same for all time steps BECAUSE we have the same
single RNN cell looped to itself)

Introduction to Recurrent Neural Network

RNN Limitations
● The problem with RNNs is that as time passes by and they get fed more and more new data, they start to
“forget” about the previous data they have seen (vanishing gradient problem), as it gets diluted between
the new data, the transformation from activation function, and the weight multiplication. This means they
have a good short term memory, but a slight problem when trying to remember things that have
happened a while ago (data they have seen many time steps in the past).
● The more time steps we have, the more chance we have of back-propagation gradients either
accumulating and exploding or vanishing down to nothing.
● Consider the following representation of a recurrent neural network:
● Here, ht is the new state (current time stamp), ht₋₁is the previous state (previous time
stamp) while xₜ is the current input.
● Where U and V are the weight matrices connecting the inputs and the recurrent outputs
respectively. We then often will perform a softmax of all hₜ the outputs.

Notice, however, that if we go back three time steps in our recurrent neural network, we have the
following:
From the above you can see, as we work our way back in time, we are essentially adding deeper and deeper
layers to our network. This causes a problem — consider the gradient of the error with respect to the weight
matrix U during back-propagation through time, it looks something like this:
● The equation above is only a rough approximation of what is going on during back-propagation through time.
Each of these gradients will involve calculating the gradient of the sigmoid function. The problem with the
sigmoid function occurs when the input values are such that the output is close to either 0 or 1 — at this point, the
gradient is very small (saturating).
● For ex:- Lets say the value decreased like 0.863 →0.532 →0.356 →0.192 →0.117 →0.086 →0.023 →0.019..
● you can see that there is no much change in last 3 iterations.
● It means that when you multiply many sigmoid gradients together you are multiplying many values which are
potentially much less than zero — this leads to a vanishing gradient problem.

The mathematics that computes this change is multiplicative, which means that the gradient calculated in a step that is deep in the neural
network will be multiplied back through the weights earlier in the network. Said differently, the gradient calculated deep in the network is
"diluted" as it moves back through the net, which can cause the gradient to vanish - giving the name to the vanishing gradient problem!
The actual factor that is multiplied through a recurrent neural network in the backpropagation algorithm is referred to by the mathematical
variable Wrec. It poses two problems:
● When Wrec is small, we experience a vanishing gradient problem
● When Wrec is large, we experience an exploding gradient problem
Note that both of these problems are generally referred to by the simpler name of the "vanishing gradient problem".

Solving Vanishing Grandient Problem
Solving the Exploding Gradient Problem
● For exploding gradients, it is possible to use a modified version of the backpropagation algorithm called
truncated backpropagation. The truncated backpropagation algorithm limits that number of timesteps that
the backproporation will be performed on, stopping the algorithm before the exploding gradient problem
occurs.
● You can also introduce penalties, which are hard-coded techniques for reduces a backpropagation's
impact as it moves through shallower layers in a neural network.
● Lastly, you could introduce gradient clipping, which introduces an artificial ceiling that limits how large
the gradient can become in a backpropagation algorithm.

Solving the Vanishing Gradient Problem
● Weight initialization is one technique that can be used to solve the vanishing gradient problem. It involves
artificially creating an initial value for weights in a neural network to prevent the backpropagation algorithm
from assigning weights that are unrealistically small.
● You could also use echo state networks, which is a specific type of neural network designed to avoid the
vanishing gradient problem
● The most important solution to the vanishing gradient problem is a specific type of neural network called Long
Short-Term Memory Networks (LSTMs), which were pioneered by Sepp Hochreiter and Jürgen Schmidhuber.
Recall that Mr. Hochreiter was the scientist who originally discovered the vanishing gradient problem.
● LSTMs are used in problems primarily related to speech recognition, with one of the most notable examples
being Google using an LSTM for speech recognition in 2015 and experiencing a 49% decrease in transcription
errors.
● LSTMs are considered to be the go-to neural net for scientists interested in implementing recurrent neural
networks.

● MachineLearningMastery
● AnalyticsVidhya
● SlideShare
References

Thank You !
Get in touch with us:
Lorem Studio, Lord Building
D4456, LA, USA

Introduction to Recurrent Neural Network

More Related Content

Introduction to Recurrent Neural Network