Module 4 Recurrent Neural Network
Module 4 Recurrent Neural Network
Module 4 Recurrent Neural Network
RECURRENT NEURAL
NETWORK
BY
Dr REEMA MATHEW A
PROFESSOR, CSE
SYLLUBUS
Why RNN?
The input layer ‘x’ takes in the input to the neural network and
processes it and passes it onto the middle layer.
The middle layer ‘h’ can consist of multiple hidden layers, each with
its own activation functions and weights and biases.
The Recurrent Neural Network will standardize the different
activation functions and weights and biases so that each hidden
layer has the same parameters.
Then, instead of creating multiple hidden layers, it will create one
and loop over it as many times as required.
Feed-Forward Neural Networks vs Recurrent
Neural Networks
RNN-TYPES/VARIANTS
https://www.simplilearn.com/tutorials/deep-learning-tutorial/rnn
RNN-TYPES/VARIANTS
https://medium.com/analytics-vidhya/what-is-rnn-a157d903a88
Objectives-26/10/23
RNN design
Encoder – decoder sequence to sequence
architectures
Language modeling example of RNN
Deep recurrent networks
Recursive neural networks
RNN DESIGN
3 TYPES
Recurrent networks that produce an output at each time
step and have recurrent connections between hidden
units, illustrated in figure 10.3 •
Recurrent networks that produce an output at each
time step and have recurrent connections only from the
output at one time step to the hidden units at the next
time step, illustrated in figure 10.4 •
Recurrent networks with recurrent connections between
hidden units, that read an entire sequence and then
produce a single output, illustrated in figure 10.5
The figure does not specify the choice of activation function for the hidden
units. Here we assume the hyperbolic tangent activation function.
Also, the figure does not specify exactly what form the output and loss function
take.
Here we assume that the output is discrete, as if the RNN is used to predict
words or characters.
A natural way to represent discrete variables is to regard the output o as giving
the unnormalized log probabilities of each possible value of the discrete
variable.
We can then apply the softmax operation as a post-processing step to obtain a
vector yˆ of normalized probabilities over the output. Forward propagation
begins with a specification of the initial state h (0) . Then, for each time step
from t = 1 to t = τ, we apply the following update equations:
where the parameters are the bias vectors b and c along with the weight
matrices U, V and W, respectively, for input-to-hidden, hidden-to-output and
hidden- to-hidden connections. This is an example of a recurrent network that
maps an input sequence to an output sequence of the same length. The total
loss for a given sequence of x values paired with a sequence of y values
would then be just the sum of the losses over all the time steps. For
example, if L (t) is the negative log-likelihood of y (t) given x (1) , . . . , x (t) ,
the
where is given by reading the entry for y (t) from the model’s
output vector yˆ (t) . Computing the gradient of this loss function with respect to the
parameters is an expensive operation. The gradient computation involves performing
a forward propagation pass moving left to right through our illustration of the
unrolled graph in figure 10.3, followed by a backward propagation pass moving
right to left through the graph.
Encoders-Decoders - Sequence to Sequence Architecture
RNN typically has fixed-size input and output vectors, i.e., the
lengths of both the input and output vectors are predefined.
It is not desirable in use cases such as speech recognition,
machine translation, etc., where the input and output
sequences do not need to be fixed and of the same length.
"How have you been?“- In French-"Comment avez-vous
été?". Here, neither the input nor output sequences are fixed
in size. In this context, if you want to build a language
translator using an RNN, you do not want to define the
sequence lengths beforehand.
This architecture is very new, having only been pioneered in
2014, although, has been adopted as the core technology
inside Google’s translate service.
The idea behind the design of this model is to enable it
to process input where we do not constrain the length.
One RNN will be used as an encoder, and another as a
decoder.
The output vector generated by the encoder and the
input vector given to the decoder will possess a fixed
size.
However, they need not be equal.
The output generated by the encoder can either be
given as a whole chunk or can be connected to the
hidden units of the decoder unit at every time step.
The RNNs in the encoder and decoder can be simple
RNNs, LSTMs, or GRUs.
Encoder-Decoder Model
Encoder
• Multiple RNN cells can be stacked together to form the encoder. RNN
reads each inputs sequentially.
• For every timestep (each input) t, the hidden state (hidden vector) h is
updated according to the input at that timestep X[i].
• After all the inputs are read by encoder model, the final hidden state of
the model represents the context/summary of the whole input sequence.
• Example: Consider the input sequence “I am a Student” to be encoded.
There will be totally 4 timesteps ( 4 tokens) for the Encoder model. At each
time step, the hidden state h will be updated using the previous hidden
state and the current input.
• At the first timestep t1, the previous hidden state h0 will be considered as
zero or randomly chosen.
• So the first RNN cell will update the current hidden state with the first input
and h0.
• Each layer outputs two things — updated hidden state and the output for
each stage.
• The outputs at each stage are rejected and only the hidden states will be
propagated to the next layer.
• The hidden states h_i are computed using the formula:
At second timestep t2, the hidden state h1 and the second input X[2]
will be given as input , and the hidden state h2 will be updated
according to both inputs.
Then the hidden state h1 will be updated with the new input and will
produce the hidden state h2. This happens for all the four stages wrt
example taken.
Encoder Vector
• This is the final hidden state produced from the encoder part of the
model. It is calculated using the formula above.
• This vector aims to encapsulate the information for all input elements
in order to help the decoder make accurate predictions.
• It acts as the initial hidden state of the decoder part of the model.
Decoder
• The Decoder generates the output sequence by predicting the next
output Yt given the hidden state ht.
• The input for the decoder is the final hidden vector obtained at the
end of encoder model.
• Each layer will have three inputs, hidden vector from previous layer
ht-1 and the previous layer output yt-1, original hidden vector h.
• At the first layer, the output vector of encoder and the random symbol
START, empty hidden state ht-1 will be given as input, the outputs
obtained will be y1 and updated hidden state h1.
• The second layer will have the updated hidden state h1 and the previous
output y1 and original hidden vector h as current inputs, produces the
hidden vector h2 and output y2.
• The outputs occurred at each timestep of decoder is the actual output.
The model will predict the output until the END symbol occurs.
• A stack of several recurrent units where each predicts an output y_t at a
time step t.
• Each recurrent unit accepts a hidden state from the previous unit and
produces an output as well as its own hidden state.
•Any hidden state h_i is computed using the formula:
Output Layer
• We use Softmax activation function at the output layer.
• It is used to produce the probability distribution from a vector of
values with the target class of high probability.
• The output y_t at time step t is computed using the formula:
we have shown the probabilistic prediction of the next word at each of timestamps from
1 to 4.
Ideally, we would like the probability of the next word to be predicted correctly from the
probabilities of the previous words.
Each one-hot encoded input vector has length four, in which only one bit is 1 and the
remaining bits are 0s. The main flexibility here is in the dimensionality p of the hidden
representation, which we set to 2 in this case.
As a result, the matrix Wxh will be a 2 × 4 matrix, so that it maps a one-hot encoded
input vector into a hidden vector vector of size 2.
As a practical matter, each column of corresponds to one of the four words, and one
of these columns is copied by the expression
Note that this expression is added to and then transformed with the tanh function
to produce the final expression.
The final output is defined by . Note that the matrices Whh and Why are of sizes 2
× 2 and 4 × 2, respectively
In this case, the outputs are continuous values (not probabilities) in which
larger values indicate greater likelihood of presence. These continuous values
are eventually converted
to probabilities with the softmax function, and therefore one can treat them as
substitutes to log probabilities.
The word “cat” is predicted in the first time-stamp with a value of 1.3, although
this value seems to be (incorrectly) outstripped by “mouse” for which the
corresponding value is 1.7.
However, the word “chased” seems to be predicted correctly at the next time-
stamp.
As in all learning algorithms, one cannot hope to predict every value exactly,
and such errors are more likely to be made in the early iterations of the
backpropagation algorithm.
However, as the network is repeatedly trained over multiple iterations, it
makes fewer errors over the training data.
apply SGD into computing loss for a sentence (actually a batch of the sentence), compute
gradients and update weights and repeat this process.
Deep recurrent networks
Recursive neural networks
Recursive neural networks represent yet another generalization of
recurrent net- works, with a different kind of computational graph, which
is structured as a deep tree, rather than the chain-like structure of RNNs.
The typical computational graph for a recursive network is illustrated in
figure 10.14.
Recursive neural networks were introduced by Pollack (1990), and their
potential use for learning to reason was described by Bottou (2011).
Recursive networks have been successfully applied to processing data
structures as input to neural nets (Frasconi et al., 1997, 1998), in natural
language processing (Socher et al., 2011a,c, 2013a), as well as in
computer vision (Socher et al., 2011b).
One clear advantage of recursive nets over recurrent nets is that for a
sequence of the same length τ, the depth (measured as the number of
compositions of nonlinear operations) can be drastically reduced from τ to
O(log τ ), which might help deal with long-term dependencies.
An open question is how to best structure the tree. One option is to have a
tree structure that does not depend on the data, such as a balanced binary
tree.
In some application domains, external methods can suggest the appropriate
tree structure.
For example, when processing natural language sentences, the tree
structure for the recursive network can be fixed to the structure of the parse
tree of the sentence provided by a natural language parser (Socher et al.,
2011a, 2013a).
https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-
44e9eb85bf21
https://medium.com/analytics-vidhya/undestanding-recurrent-neural-network-rnn-and-long-
short-term-memory-lstm-30bc1221e80d
LSTM (Long Short Term Memory)
LTSMs are a type of Recurrent Neural Network for learning long-term
dependencies. It is commonly used for processing and predicting time-
series data.
They were introduced by Hochreiter & Schmidhuber (1997)
From the image we can see LSTMs have a chain-like structure. General
RNNs have a single neural network layer. LSTMs, on the other hand, have
four interacting layers communicating extraordinarily.
Why LSTM
If we are trying to predict the last word in
“the clouds are in the sky,” we don’t need
any further context – it’s pretty obvious the
next word is going to be sky.
In such cases, where the gap between the
relevant information and the place that it’s
needed is small, RNNs can learn to use the
past information.
But there are also cases where we need
more context. Consider trying to predict the
last word in the text “I grew up in France… I
speak fluent French.”
Unfortunately, as that gap grows, RNNs
become unable to learn to connect the
information.
LSTMs are explicitly designed to avoid
the long-term dependency problem.
https://colah.github.io/posts/2015-08-Understanding-
LSTMs/
https://www.simplilearn.com/tutorials/deep-learning-
tutorial/rnn
The Core Idea Behind LSTMs
The key to LSTMs is the cell state, the
horizontal line running through the top of the
diagram.
The cell state is kind of like a conveyor belt. It
runs straight down the entire chain, with only
some minor linear interactions.
It’s very easy for information to just flow
along it unchanged.
The LSTM does have the ability to remove or
add information to the cell state, carefully
regulated by structures called gates.
Forget gate, Input Gate, output Gate
Gates are a way to optionally let information
An LSTM has three of these gates, to
through. They are composed out of a protect and control the cell state.
sigmoid neural net layer and a pointwise
multiplication operation.
Working of LSTM
Step 1: Decide How Much Past Data It Should Remember-
Forget gate
The first step in the LSTM is to decide which information should be
omitted from the cell in that particular time step.
The sigmoid function determines this. It looks at ht-1 and xt , and
outputs a number between 0 and 1 for each number in the cell
state .
A represents “completely keep this” while a represents
“completely get rid of this.”
Forget Gate:
This gate decide which information is important and should be stored and
which information to forget.
It removes the non important information from neuron cell. This results in
optimization of performance.
This gate takes 2 input- one is the output generated by previous cell and
other is input of current cell. Following required bias and weights are
added and multiplied and sigmoid function is applied to the value.
A value between 0 and 1 is generated and based on this we decide which
information to keep.
If value is 0 the forget gate will remove that information and if value is 1
then information is important and has to be remembered.
With the current input at x(t), the input gate analyzes the important information —
John plays football, and the fact that he was the captain of his college team is
important.
“He told me yesterday over the phone” is less important; hence it's forgotten. This
process of adding some new information can be done via the input gate.
• Input Gate:
• This gate is used to add information to neuron cell. It is responsible of
what values should be added to cell by using activation function like
sigmoid.
• It creates an array of information that has to be added. This is done by
using another activation function called tanh. It generates a values
between -1 and 1.
• The sigmoid function act as a filter and regulate what information has to
be added in cell.
Step 3: Decide What Part of the Current Cell State
Makes It to the Output – Output gate
The third step is to decide what the output will be. First, we run a sigmoid layer,
which decides what parts of the cell state make it to the output. Then, we put the
cell state through tanh to push the values to be between -1 and 1 and multiply it by
the output of the sigmoid gate.
Let’s consider this example to predict the next word in the sentence: “John played
tremendously well against the opponent and won for his team. For his
contributions, brave ____ was awarded player of the match.”
There could be many choices for the empty space. The current input brave is an
adjective, and adjectives describe a noun. So, “John” could be the best output after
brave.
Output Gate:
This gate is responsible for selecting important information
from current cell and show it as output. It creates a vector of
values using tanh function which ranges from -1 to 1. It uses
previous output and current input as a regulator which also
includes sigmoid function and decides which values should be
shown as output.
LSTM Use Case- predict stock prices
Now that you understand how LSTMs work, let’s do a practical implementation to predict the prices
of stocks using the “Google stock price” data.
Based on the stock price data between 2012 and 2016, we will predict the stock prices of 2017.
https://www.simplilearn.com/tutorials/machine-
learning-tutorial/stock-price-prediction-using-
machine-learning
RNN REVIEW
GRU
where W_r and W_z are weight matrices that are learned during training.
GRU Architecture
1. Input layer: The input layer takes in sequential data, such as a
sequence of words or a time series of values, and feeds it into the
GRU.
2. Hidden layer: The hidden layer is where the recurrent computation
occurs. At each time step, the hidden state is updated based on the
current input and the previous hidden state. The hidden state is a
vector of numbers that represents the network’s “memory” of the
previous inputs.
3. Reset gate: The reset gate determines how much of the previous
hidden state to forget. It takes as input the previous hidden state and
the current input, and produces a vector of numbers between 0 and
1 that controls the degree to which the previous hidden state is
“reset” at the current time step.
4. Update gate: The update gate determines how much of the candidate
activation vector to incorporate into the new hidden state. It takes as
input the previous hidden state and the current input, and produces a
vector of numbers between 0 and 1 that controls the degree to which the
candidate activation vector is incorporated into the new hidden state.
5. Candidate activation vector: The candidate activation vector is a
modified version of the previous hidden state that is “reset” by the reset
gate and combined with the current input. It is computed using a tanh
activation function that squashes its output between -1 and 1.
6. Output layer: The output layer takes the final hidden state as input and
produces the network’s output. This could be a single number, a sequence
of numbers, or a probability distribution over classes, depending on the
task at hand.
3 Learning Due to its more complex architecture, GRU can still learn to capture long-term
Ability: LSTM can potentially learn more dependencies effectively. It performs well in
complex patterns and relationships in many natural language processing tasks and is a
the data. It is well-suited for tasks where popular choice for various sequence modeling
capturing long-term dependencies is tasks.
critical.
4 Training LSTM has more parameters, With fewer parameters, GRU may
Speed: which can result in slightly have faster training times, making it
slower training times compared more efficient for larger datasets.
to GRU, especially on larger
datasets.
LSTM
Input Gate
Cell State
Output Gate
GRU
https://towardsdatascience.com/understanding-gru-
networks-2ef37df6c9be
CASE STUDY-BERT(Bidirectional
Representation for Transformers),Sentiment
Analysis
https://www.kaggle.com/code/harshjain123/bert-for-everyone-tutorial-
implementation
https://medium.com/@princedede/sentiment-analysis-using-python-hotel-
reviews-case-study-c6b81f7cfa96