Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Module 4 Recurrent Neural Network

Download as pdf or txt
Download as pdf or txt
You are on page 1of 78

MODULE 4

RECURRENT NEURAL
NETWORK
BY
Dr REEMA MATHEW A
PROFESSOR, CSE

SYLLUBUS
Why RNN?

Issues in the feed forward neural network : -


1. Can’t handle sequential data.
2. Consider only current input.
3. Can’t memorize the previous input.
Recurrent Neural Network
What Are Recurrent Neural Networks (RNN)?
 Recurrent Neural Networks (RNNs) are a type of artificial neural network
designed to process sequences of data. They work especially well for jobs
requiring sequences, such as time series data, voice, natural language, and other
activities.
 RNN works on the principle of saving the output of a particular layer and feeding
this back to the input in order to predict the output of the layer.
 Below is how you can convert a Feed-Forward Neural Network into a Recurrent
Neural Network:

 The nodes in different layers of the neural network are compressed


to form a single layer of recurrent neural networks. A, B, and C are
the parameters of the network.

 Here, “x” is the input layer, “h” is the hidden layer, and “y” is the output
layer. A, B, and C are the network parameters used to improve the output
of the model. At any given time t, the current input is a combination of
input at x(t) and x(t-1). The output at any given time is fetched back to the
network to improve on the output.

Why Recurrent Neural Networks?


 RNN were created because there were a few issues in the feed-
forward neural network:
• Cannot handle sequential data
• Considers only the current input
• Cannot memorize previous inputs
 The solution to these issues is the RNN. An RNN can handle
sequential data, accepting the current input data, and previously
received inputs. RNNs can memorize previous inputs due to their
internal memory.
How Does Recurrent Neural Networks Work?
 In Recurrent Neural networks, the information cycles through
a loop to the middle hidden layer

 The input layer ‘x’ takes in the input to the neural network and
processes it and passes it onto the middle layer.
 The middle layer ‘h’ can consist of multiple hidden layers, each with
its own activation functions and weights and biases.
 The Recurrent Neural Network will standardize the different
activation functions and weights and biases so that each hidden
layer has the same parameters.
 Then, instead of creating multiple hidden layers, it will create one
and loop over it as many times as required.
Feed-Forward Neural Networks vs Recurrent
Neural Networks

 A feed-forward neural network allows information to flow only


in the forward direction, from the input nodes, through the
hidden layers, and to the output nodes.
 There are no cycles or loops in the network.
 In a feed-forward neural network, the decisions are based on
the current input.
 It doesn’t memorize the past data, and there’s no future
scope.
 Feed-forward neural networks are used in general regression
and classification problems.

Advantages of Recurrent Neural Network


 1.Ability To Handle Variable-Length Sequences
 RNNs are designed to handle input sequences of variable length, which
makes them well-suited for tasks such as speech recognition, natural
language processing, and time series analysis.
 2.Memory Of Past Inputs
 RNNs have a memory of past inputs, which allows them to capture
information about the context of the input sequence. This makes them
useful for tasks such as language modeling, where the meaning of a word
depends on the context in which it appears.
 3.Parameter Sharing
 RNNs share the same set of parameters across all time steps, which
reduces the number of parameters that need to be learned and can lead to
better generalization.
 4.Non-Linear Mapping
 RNNs use non-linear activation functions, which allows them to learn complex, non-
linear mappings between inputs and outputs.
 5.Sequential Processing
 RNNs process input sequences sequentially, which makes them computationally
efficient and easy to parallelize.
 6.Flexibility
 RNNs can be adapted to a wide range of tasks and input types, including text, speech,
and image sequences.
 7.Improved Accuracy
 RNNs have been shown to achieve state-of-the-art performance on a variety of
sequence modeling tasks, including language modeling, speech recognition, and
machine translation.
 These advantages make RNNs a powerful tool for sequence modeling and analysis,
and have led to their widespread use in a variety of applications, including natural
language processing, speech recognition, and time series analysis.

Disadvantages of Recurrent Neural Network


 1.Vanishing And Exploding Gradients
RNNs can suffer from the problem of vanishing or exploding gradients, which can
make it difficult to train the network effectively. This occurs when the gradients of
the loss function with respect to the parameters become very small or very large
as they propagate through time.
 2.Computational Complexity
 RNNs can be computationally expensive to train, especially when dealing
with long sequences. This is because the network has to process each
input in sequence, which can be slow.
 3.Difficulty In Capturing Long-Term Dependencies
 Although RNNs are designed to capture information about past inputs, they
can struggle to capture long-term dependencies in the input sequence. This
is because the gradients can become very small as they propagate through
time, which can cause the network to forget important information.
 4.Lack Of Parallelism
 RNNs are inherently sequential, which makes it difficult to parallelize the
computation. This can limit the speed and scalability of the network.
 5.Difficulty In Choosing The Right Architecture
 There are many different variants of RNNs, each with its own advantages
and disadvantages. Choosing the right architecture for a given task can be
challenging, and may require extensive experimentation and tuning.
 6.Difficulty In Interpreting The Output
 The output of an RNN can be difficult to interpret, especially when dealing
with complex inputs such as natural language or audio. This can make it
difficult to understand how the network is making its predictions.

Two Issues of Standard RNNs


 1. Vanishing Gradient Problem
 Recurrent Neural Networks enable you to model time-dependent and
sequential data problems, such as stock market prediction, machine
translation, and text generation. You will find, however, RNN is hard to train
because of the gradient problem.
 RNNs suffer from the problem of vanishing gradients. The gradients carry
information used in the RNN, and when the gradient becomes too small, the
parameter updates become insignificant. This makes the learning of long
data sequences difficult.
 2. Exploding Gradient Problem
 While training a neural network, if the slope tends to grow exponentially
instead of decaying, this is called an Exploding Gradient. This problem
arises when large error gradients accumulate, resulting in very large
updates to the neural network model weights during the training process.
 Long training time, poor performance, and bad accuracy are the major
issues in gradient problems.
Gradient Problem Solutions
The Architecture of Recurrent Neural Networks

RNN-TYPES/VARIANTS

https://www.simplilearn.com/tutorials/deep-learning-tutorial/rnn
RNN-TYPES/VARIANTS
https://medium.com/analytics-vidhya/what-is-rnn-a157d903a88
Objectives-26/10/23
RNN design
Encoder – decoder sequence to sequence
architectures
Language modeling example of RNN
Deep recurrent networks
Recursive neural networks

RNN DESIGN
3 TYPES
Recurrent networks that produce an output at each time
step and have recurrent connections between hidden
units, illustrated in figure 10.3 •
 Recurrent networks that produce an output at each
time step and have recurrent connections only from the
output at one time step to the hidden units at the next
time step, illustrated in figure 10.4 •
 Recurrent networks with recurrent connections between
hidden units, that read an entire sequence and then
produce a single output, illustrated in figure 10.5
The figure does not specify the choice of activation function for the hidden
units. Here we assume the hyperbolic tangent activation function.
Also, the figure does not specify exactly what form the output and loss function
take.
Here we assume that the output is discrete, as if the RNN is used to predict
words or characters.
A natural way to represent discrete variables is to regard the output o as giving
the unnormalized log probabilities of each possible value of the discrete
variable.
We can then apply the softmax operation as a post-processing step to obtain a
vector yˆ of normalized probabilities over the output. Forward propagation
begins with a specification of the initial state h (0) . Then, for each time step
from t = 1 to t = τ, we apply the following update equations:
 where the parameters are the bias vectors b and c along with the weight
matrices U, V and W, respectively, for input-to-hidden, hidden-to-output and
hidden- to-hidden connections. This is an example of a recurrent network that
maps an input sequence to an output sequence of the same length. The total
loss for a given sequence of x values paired with a sequence of y values
would then be just the sum of the losses over all the time steps. For
example, if L (t) is the negative log-likelihood of y (t) given x (1) , . . . , x (t) ,
the

 where is given by reading the entry for y (t) from the model’s
output vector yˆ (t) . Computing the gradient of this loss function with respect to the
parameters is an expensive operation. The gradient computation involves performing
a forward propagation pass moving left to right through our illustration of the
unrolled graph in figure 10.3, followed by a backward propagation pass moving
right to left through the graph.
Encoders-Decoders - Sequence to Sequence Architecture
 RNN typically has fixed-size input and output vectors, i.e., the
lengths of both the input and output vectors are predefined.
 It is not desirable in use cases such as speech recognition,
machine translation, etc., where the input and output
sequences do not need to be fixed and of the same length.
 "How have you been?“- In French-"Comment avez-vous
été?". Here, neither the input nor output sequences are fixed
in size. In this context, if you want to build a language
translator using an RNN, you do not want to define the
sequence lengths beforehand.
This architecture is very new, having only been pioneered in
2014, although, has been adopted as the core technology
inside Google’s translate service.
The idea behind the design of this model is to enable it
to process input where we do not constrain the length.
One RNN will be used as an encoder, and another as a
decoder.
The output vector generated by the encoder and the
input vector given to the decoder will possess a fixed
size.
However, they need not be equal.
The output generated by the encoder can either be
given as a whole chunk or can be connected to the
hidden units of the decoder unit at every time step.
The RNNs in the encoder and decoder can be simple
RNNs, LSTMs, or GRUs.

Encoder-Decoder Model

There are three main blocks in the encoder-decoder model,


• Encoder
• Hidden Vector
• Decoder
The Encoder will convert the input sequence into a single-dimensional
vector (hidden vector).
The decoder will convert the hidden vector into the output sequence.
Encoder-Decoder models are jointly trained to maximize the conditional
probabilities of the target sequence given the input sequence.
How the Sequence to Sequence Model works?

Encoder

• Multiple RNN cells can be stacked together to form the encoder. RNN
reads each inputs sequentially.
• For every timestep (each input) t, the hidden state (hidden vector) h is
updated according to the input at that timestep X[i].
• After all the inputs are read by encoder model, the final hidden state of
the model represents the context/summary of the whole input sequence.
• Example: Consider the input sequence “I am a Student” to be encoded.
There will be totally 4 timesteps ( 4 tokens) for the Encoder model. At each
time step, the hidden state h will be updated using the previous hidden
state and the current input.
• At the first timestep t1, the previous hidden state h0 will be considered as
zero or randomly chosen.
• So the first RNN cell will update the current hidden state with the first input
and h0.
• Each layer outputs two things — updated hidden state and the output for
each stage.
• The outputs at each stage are rejected and only the hidden states will be
propagated to the next layer.
• The hidden states h_i are computed using the formula:
At second timestep t2, the hidden state h1 and the second input X[2]
will be given as input , and the hidden state h2 will be updated
according to both inputs.
Then the hidden state h1 will be updated with the new input and will
produce the hidden state h2. This happens for all the four stages wrt
example taken.
Encoder Vector
• This is the final hidden state produced from the encoder part of the
model. It is calculated using the formula above.
• This vector aims to encapsulate the information for all input elements
in order to help the decoder make accurate predictions.
• It acts as the initial hidden state of the decoder part of the model.
Decoder
• The Decoder generates the output sequence by predicting the next
output Yt given the hidden state ht.
• The input for the decoder is the final hidden vector obtained at the
end of encoder model.
• Each layer will have three inputs, hidden vector from previous layer
ht-1 and the previous layer output yt-1, original hidden vector h.
• At the first layer, the output vector of encoder and the random symbol
START, empty hidden state ht-1 will be given as input, the outputs
obtained will be y1 and updated hidden state h1.
• The second layer will have the updated hidden state h1 and the previous
output y1 and original hidden vector h as current inputs, produces the
hidden vector h2 and output y2.
• The outputs occurred at each timestep of decoder is the actual output.
The model will predict the output until the END symbol occurs.
• A stack of several recurrent units where each predicts an output y_t at a
time step t.
• Each recurrent unit accepts a hidden state from the previous unit and
produces an output as well as its own hidden state.
•Any hidden state h_i is computed using the formula:
Output Layer
• We use Softmax activation function at the output layer.
• It is used to produce the probability distribution from a vector of
values with the target class of high probability.
• The output y_t at time step t is computed using the formula:

We calculate the outputs using the hidden state at the current


time step together with the respective weight W(S).
Softmax is used to create a probability vector that will help us
determine the final output (e.g. word in the question-answering
problem).
Character-level Language Model
 Training Time
Language modeling example of RNN
 In order to illustrate the workings of the RNN, we will use a toy example of a
single sequence defined on a vocabulary of four words. Consider the sentence:
The cat chased the mouse
 In this case, we have a lexicon of four words, which are
{“the,”“cat,”“chased,”“mouse”}. In Figure 7.4,

 we have shown the probabilistic prediction of the next word at each of timestamps from
1 to 4.
 Ideally, we would like the probability of the next word to be predicted correctly from the
probabilities of the previous words.
 Each one-hot encoded input vector has length four, in which only one bit is 1 and the
remaining bits are 0s. The main flexibility here is in the dimensionality p of the hidden
representation, which we set to 2 in this case.
 As a result, the matrix Wxh will be a 2 × 4 matrix, so that it maps a one-hot encoded
input vector into a hidden vector vector of size 2.
 As a practical matter, each column of corresponds to one of the four words, and one
of these columns is copied by the expression
 Note that this expression is added to and then transformed with the tanh function
to produce the final expression.
 The final output is defined by . Note that the matrices Whh and Why are of sizes 2
× 2 and 4 × 2, respectively
In this case, the outputs are continuous values (not probabilities) in which
larger values indicate greater likelihood of presence. These continuous values
are eventually converted
to probabilities with the softmax function, and therefore one can treat them as
substitutes to log probabilities.
The word “cat” is predicted in the first time-stamp with a value of 1.3, although
this value seems to be (incorrectly) outstripped by “mouse” for which the
corresponding value is 1.7.
 However, the word “chased” seems to be predicted correctly at the next time-
stamp.
As in all learning algorithms, one cannot hope to predict every value exactly,
and such errors are more likely to be made in the early iterations of the
backpropagation algorithm.
 However, as the network is repeatedly trained over multiple iterations, it
makes fewer errors over the training data.
apply SGD into computing loss for a sentence (actually a batch of the sentence), compute
gradients and update weights and repeat this process.
Deep recurrent networks
Recursive neural networks
Recursive neural networks represent yet another generalization of
recurrent net- works, with a different kind of computational graph, which
is structured as a deep tree, rather than the chain-like structure of RNNs.
The typical computational graph for a recursive network is illustrated in
figure 10.14.
 Recursive neural networks were introduced by Pollack (1990), and their
potential use for learning to reason was described by Bottou (2011).
 Recursive networks have been successfully applied to processing data
structures as input to neural nets (Frasconi et al., 1997, 1998), in natural
language processing (Socher et al., 2011a,c, 2013a), as well as in
computer vision (Socher et al., 2011b).
One clear advantage of recursive nets over recurrent nets is that for a
sequence of the same length τ, the depth (measured as the number of
compositions of nonlinear operations) can be drastically reduced from τ to
O(log τ ), which might help deal with long-term dependencies.
 An open question is how to best structure the tree. One option is to have a
tree structure that does not depend on the data, such as a balanced binary
tree.
In some application domains, external methods can suggest the appropriate
tree structure.
 For example, when processing natural language sentences, the tree
structure for the recursive network can be fixed to the structure of the parse
tree of the sentence provided by a natural language parser (Socher et al.,
2011a, 2013a).

Challenges of training Recurrent Networks.


 Recurrent neural networks are very hard to train
because of the fact that the time-layered network is a
very deep network, especially if the input sequence is
long.
 In other words, the depth of the temporal layering is
input-dependent. As in all deep networks, the loss
function has highly varying sensitivities of the loss
function (i.e., loss gradients) to different temporal
layers.
 Furthermore, even though the loss function has highly
varying gradients to the variables in different layers, the
same parameter matrices are shared by different
temporal layers. This combination of varying sensitivity
and shared parameters in different layers can lead to
some unusually unstable effects
Gated RNNs - LSTM and GRU
 What is LSTM (Long Short Term Memory)
 RNNs are not able to memorize data for long time and begins to forget its previous
inputs. They are used as solution for short term memory learning.
 In RNN when a new information is added RNN completely modifies the existing
information.
 RNN is not able to distinguish between important or not-so-important information
 Whereas in LSTM there is small modification in existing information when a new
information is added because LSTM contains gate which determine the flow of
information.
 Also, to overcome the problem of vanishing and exploding gradient LSTM is used.

 https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-
44e9eb85bf21
https://medium.com/analytics-vidhya/undestanding-recurrent-neural-network-rnn-and-long-
short-term-memory-lstm-30bc1221e80d
LSTM (Long Short Term Memory)
 LTSMs are a type of Recurrent Neural Network for learning long-term
dependencies. It is commonly used for processing and predicting time-
series data.
 They were introduced by Hochreiter & Schmidhuber (1997)
 From the image we can see LSTMs have a chain-like structure. General
RNNs have a single neural network layer. LSTMs, on the other hand, have
four interacting layers communicating extraordinarily.

Why LSTM
 If we are trying to predict the last word in
“the clouds are in the sky,” we don’t need
any further context – it’s pretty obvious the
next word is going to be sky.
 In such cases, where the gap between the
relevant information and the place that it’s
needed is small, RNNs can learn to use the
past information.
 But there are also cases where we need
more context. Consider trying to predict the
last word in the text “I grew up in France… I
speak fluent French.”
 Unfortunately, as that gap grows, RNNs
become unable to learn to connect the
information.
LSTMs are explicitly designed to avoid
the long-term dependency problem.
https://colah.github.io/posts/2015-08-Understanding-
LSTMs/

Workings of LSTMs in RNN

https://www.simplilearn.com/tutorials/deep-learning-
tutorial/rnn
The Core Idea Behind LSTMs
 The key to LSTMs is the cell state, the
horizontal line running through the top of the
diagram.
 The cell state is kind of like a conveyor belt. It
runs straight down the entire chain, with only
some minor linear interactions.
 It’s very easy for information to just flow
along it unchanged.
 The LSTM does have the ability to remove or
add information to the cell state, carefully
regulated by structures called gates.
 Forget gate, Input Gate, output Gate
 Gates are a way to optionally let information
An LSTM has three of these gates, to
through. They are composed out of a protect and control the cell state.
sigmoid neural net layer and a pointwise
multiplication operation.

Working of LSTM
Step 1: Decide How Much Past Data It Should Remember-
Forget gate
 The first step in the LSTM is to decide which information should be
omitted from the cell in that particular time step.
 The sigmoid function determines this. It looks at ht-1 and xt , and
outputs a number between 0 and 1 for each number in the cell
state .
 A represents “completely keep this” while a represents
“completely get rid of this.”
 Forget Gate:
 This gate decide which information is important and should be stored and
which information to forget.
 It removes the non important information from neuron cell. This results in
optimization of performance.
 This gate takes 2 input- one is the output generated by previous cell and
other is input of current cell. Following required bias and weights are
added and multiplied and sigmoid function is applied to the value.
 A value between 0 and 1 is generated and based on this we decide which
information to keep.
 If value is 0 the forget gate will remove that information and if value is 1
then information is important and has to be remembered.

 Consider the following two sentences:


 Let the output of h(t-1) be “Alice is good in Physics. John, on the other
hand, is good at Chemistry.”
 Let the current input at x(t) be “John plays football well. He told me
yesterday over the phone that he had served as the captain of his
college football team.”
 The forget gate realizes there might be a change in context after
encountering the first full stop. It compares with the current input
sentence at x(t). The next sentence talks about John, so the information
on Alice is deleted. The position of the subject is vacated and assigned
to John.
Step 2: Decide How Much This Unit Adds to the
Current State - Input gate
 In the second layer, there are two parts. One is the sigmoid function, and the other
is the tanh function. In the sigmoid function, it decides which values to let through
(0 or 1). tanh function gives weightage to the values which are passed, deciding
their level of importance (-1 to 1).

 With the current input at x(t), the input gate analyzes the important information —
John plays football, and the fact that he was the captain of his college team is
important.
 “He told me yesterday over the phone” is less important; hence it's forgotten. This
process of adding some new information can be done via the input gate.

• Input Gate:
• This gate is used to add information to neuron cell. It is responsible of
what values should be added to cell by using activation function like
sigmoid.
• It creates an array of information that has to be added. This is done by
using another activation function called tanh. It generates a values
between -1 and 1.
• The sigmoid function act as a filter and regulate what information has to
be added in cell.

Step 3: Decide What Part of the Current Cell State
Makes It to the Output – Output gate
 The third step is to decide what the output will be. First, we run a sigmoid layer,
which decides what parts of the cell state make it to the output. Then, we put the
cell state through tanh to push the values to be between -1 and 1 and multiply it by
the output of the sigmoid gate.

 Let’s consider this example to predict the next word in the sentence: “John played
tremendously well against the opponent and won for his team. For his
contributions, brave ____ was awarded player of the match.”
 There could be many choices for the empty space. The current input brave is an
adjective, and adjectives describe a noun. So, “John” could be the best output after
brave.

Output Gate:
This gate is responsible for selecting important information
from current cell and show it as output. It creates a vector of
values using tanh function which ranges from -1 to 1. It uses
previous output and current input as a regulator which also
includes sigmoid function and decides which values should be
shown as output.
LSTM Use Case- predict stock prices
 Now that you understand how LSTMs work, let’s do a practical implementation to predict the prices
of stocks using the “Google stock price” data.
 Based on the stock price data between 2012 and 2016, we will predict the stock prices of 2017.
https://www.simplilearn.com/tutorials/machine-
learning-tutorial/stock-price-prediction-using-
machine-learning

Variants on Long Short Term Memory


 1. LSTM Peephole connections- let the gate layers look at the cell state(2000)
 2.LSTM with coupled forget and input gates
 3. Gated Recurrent Unit, or GRU(2014)
 2.LSTM with coupled forget and input gates

RNN REVIEW
GRU

GRU stands for Gated Recurrent Unit, which is a type of


recurrent neural network (RNN) architecture that is similar to
LSTM .
GRU has a simpler architecture than LSTM, with fewer
parameters, which can make it easier to train and more
computationally efficient.
In LSTM, the memory cell state is maintained separately from
the hidden state and is updated using three gates: the input
gate, output gate, and forget gate.
 In GRU, the memory cell state is replaced with a “candidate
activation vector,” which is updated using two gates: the reset
gate and update gate.
How GRU Works?
GRU is a popular alternative to LSTM for modeling sequential data,
especially in cases where computational resources are limited or where a
simpler architecture is desired.
GRU processes sequential data one element at a time, updating its
hidden state based on the current input and the previous hidden state.
At each time step, the GRU computes a “candidate activation vector”
that combines information from the input and the previous hidden state.
This candidate vector is then used to update the hidden state for the next
time step.
The candidate activation vector is computed using two gates: the reset
gate and the update gate. The reset gate determines how much of the
previous hidden state to forget, while the update gate determines how
much of the candidate activation vector to incorporate into the new
hidden state.
 Step 1:The reset gate r and update gate z are computed using the current input
x and the previous hidden state h_t-1

where W_r and W_z are weight matrices that are learned during training.

 Step 2: The candidate activation vector h_t~ is computed using the


current input x and a modified version of the previous hidden state
that is "reset" by the reset gate:

where W is another weight matrix.

 Step 3:The new hidden state h_t is computed by combining the


candidate activation vector with the previous hidden state,
weighted by the update gate:

GRU Architecture
1. Input layer: The input layer takes in sequential data, such as a
sequence of words or a time series of values, and feeds it into the
GRU.
2. Hidden layer: The hidden layer is where the recurrent computation
occurs. At each time step, the hidden state is updated based on the
current input and the previous hidden state. The hidden state is a
vector of numbers that represents the network’s “memory” of the
previous inputs.
3. Reset gate: The reset gate determines how much of the previous
hidden state to forget. It takes as input the previous hidden state and
the current input, and produces a vector of numbers between 0 and
1 that controls the degree to which the previous hidden state is
“reset” at the current time step.
4. Update gate: The update gate determines how much of the candidate
activation vector to incorporate into the new hidden state. It takes as
input the previous hidden state and the current input, and produces a
vector of numbers between 0 and 1 that controls the degree to which the
candidate activation vector is incorporated into the new hidden state.
5. Candidate activation vector: The candidate activation vector is a
modified version of the previous hidden state that is “reset” by the reset
gate and combined with the current input. It is computed using a tanh
activation function that squashes its output between -1 and 1.
6. Output layer: The output layer takes the final hidden state as input and
produces the network’s output. This could be a single number, a sequence
of numbers, or a probability distribution over classes, depending on the
task at hand.

Sl Parameter LSTM GRU


No
1 Architecture LSTM has a more complex architecture GRU has a simplified architecture with two
: compared to GRU. It consists of three gates: the update gate (z) and reset gate (r). The
gates: the input gate (i), forget gate (f), and update gate controls how much of the previous
output gate (o). These gates control the hidden state should be retained, and the reset gate
flow of information through the cell state, determines how much of the past information to
allowing the LSTM to remember or forget forget.
information over time.
2 Number of LSTM typically has more parameters GRU has fewer parameters since it lacks the
Parameters: than GRU due to the additional gate (forget forget gate. This can make it more
gate). This can make LSTM more powerful computationally efficient and less prone to
but also more prone to overfitting, overfitting, making it a good choice for smaller
especially on smaller datasets. datasets.

3 Learning Due to its more complex architecture, GRU can still learn to capture long-term
Ability: LSTM can potentially learn more dependencies effectively. It performs well in
complex patterns and relationships in many natural language processing tasks and is a
the data. It is well-suited for tasks where popular choice for various sequence modeling
capturing long-term dependencies is tasks.
critical.
4 Training LSTM has more parameters, With fewer parameters, GRU may
Speed: which can result in slightly have faster training times, making it
slower training times compared more efficient for larger datasets.
to GRU, especially on larger
datasets.
LSTM

Sigmoid squishes values to be between 0 and 1


Forget gate

Input Gate
Cell State

Output Gate
GRU

https://towardsdatascience.com/understanding-gru-
networks-2ef37df6c9be
CASE STUDY-BERT(Bidirectional
Representation for Transformers),Sentiment
Analysis
 https://www.kaggle.com/code/harshjain123/bert-for-everyone-tutorial-
implementation
 https://medium.com/@princedede/sentiment-analysis-using-python-hotel-
reviews-case-study-c6b81f7cfa96

You might also like