DL Unit4

UNIT IV -RECURRENT AND RECURSIVE
NETS
RECURRENT NEURAL NETWORK
 Recurrent Neural Network(RNN) is a type of Neural Network where
the output from the previous step is fed as input to the current step.
 In traditional neural networks, all the inputs and outputs are
independent of each other, but in cases when it is required to predict the
next word of a sentence, the previous words are required and hence
there is a need to remember the previous words.
 Thus RNN came into existence, which solved this issue with the help of
a Hidden Layer.
 The main and most important feature of RNN is its Hidden State, which
remembers some information about a sequence.
 The state is also referred to as Memory State since it remembers the
previous input to the network.
 It uses the same parameters for each input as it performs the same task
on all the inputs or hidden layers to produce the output. This reduces the
of parameters, unlike other neural networks.
 RNN were created because there were a few issues in the feed-forward
neural network:
I. Cannot handle sequential data
II. Considers only the current input
III. Cannot memorize previous inputs
Simple Recurrent Neural Network

Working of RNN:
The Recurrent Neural Network consists of multiple fixed activation function
units, one for each time step. Each unit has an internal state which is called the
hidden state of the unit. This hidden state signifies the past knowledge that the
network currently holds at a given time step. This hidden state is updated at
every time step to signify the change in the knowledge of the network about the
past. The hidden state is updated using the following recurrence relation:-
The formula for calculating the current state: ting

ht =f(ht-1, Xt)
Where,
ht= Currentstate
Ht-1= previousstate
Xt= input state
To apply the activation function tanh, we have-
ht = tanh (Whhht-1+ WxhXt)
Where,
Whh = weight of recurrent neuron and,

Wxh = weight of the input neuron
The formula for calculating output:
Yt = Whyh
Types of RNN:
Feedforward networks map one input to one output, and while we’ve visualized
recurrent neural networks in this way in the above diagrams, they do not
actually have this constraint. Instead, their inputs and outputs can vary in length,
and different types of RNNs are used for different use cases, such as music
generation, sentiment classification, and machine translation.
 One To One: There is only one pair here. A one-to-one architecture is used
in traditional neural networks.
 One To Many: A single input in a one-to-many network might result in
numerous outputs. One too many networks are used in the production of
music, for example.
 Many To One: In this scenario, a single output is produced by combining
many inputs from distinct time steps. Sentiment analysis and emotion
identification use such networks, in which the class label is determined by a
sequence of words.
 Many To Many: For many to many, there are numerous options. Two
inputs yield three outputs. Machine translation systems, such as English
to French or vice versa translation systems, use many to many networks.
Applications of RNN:
1. Machine Translation
2. Speech Recognition
3. Sentiment Analysis
4. Automatic Image Tagger
Advantages of RNN:
1. Ability to handle variable-length sequences
2. Memory of past inputs
3. Parameter sharing
4. Flexibilty
5. Non linear Mapping
6. Sequential processing
7. Improved accuracy
Disdvantages of RNN:
1. Computational Complexity
2. Vanishing and Exploding gradients
3. Difficulty in long-term dependencies
4. Lack of parallelism
5. Difficulty in choosing the right Architecture
6. Difficulty in interpreting the output
BIDIRECTIONAL RECURRENT NEURAL NETWORK

 An architecture of a neural network called a bidirectional recurrent neural
network (BRNN) is made to process sequential data.
 In order for the network to use information from both the past and future
context in its predictions, BRNNs process input sequences in both the
forward and backward directions.
 This is the main distinction between BRNNs and conventional recurrent
neural networks.
 The BRNN functions similarly to conventional recurrent neural networks
in the forward direction, updating the hidden state depending on the
current input and the prior hidden state at each time step.
 The backward hidden layer, on the other hand, analyses the input
sequence in the opposite manner, updating the hidden state based on the
current input and the hidden state of the next time step.
 Compared to conventional unidirectional recurrent neural networks, the
accuracy of the BRNN is improved since it can process information in both
directions and account for both past and future contexts.
 Because the two hidden layers can complement one another and give the final
prediction layer more data, using two distinct hidden layers also offers a type of
model regularisation.
 In order to update the model parameters, the gradients are computed for both
the forward and backward passes of the backpropagation through the time
technique that is typically used to train BRNNs. The input sequence is
processed by the BRNN in a single forward pass at inference time, and
predictions are made based on the combined outputs of the two hidden layers.
layers.
Bi-directional Recurrent Neural Network
Working of Bidirectional Recurrent Neural Network

1. Inputting a sequence: A sequence of data points, each represented as a
vector with the same dimensionality, are fed into a BRNN. The sequence
might have different lengths.
2. Dual Processing: Both the forward and backward directions are used to
process the data. On the basis of the input at that step and the hidden state at
step t-1, the hidden state at time step t is determined in the forward
direction. The input at step t and the hidden state at step t+1 are used to
calculate the hidden state at step t in a reverse way.
3. Computing the hidden state: A non-linear activation function on the
weighted sum of the input and previous hidden state is used to calculate the
hidden state at each step. This creates a memory mechanism that enables the
network to remember data from earlier steps in the process.
4. Determining the output: A non-linear activation function is used to
determine the output at each step from the weighted sum of the hidden state
and a number of output weights. This output has two options: it can be the
final output or input for another layer in the network.
5. Training: The network is trained through a supervised learning approach
where the goal is to minimize the discrepancy between the predicted output
and the actual output. The network adjusts its weights in the input-to-hidden
and hidden-to-output connections during training through backpropagation.
 To calculate the output from an RNN unit, we use the following formula:
 H (Forward) = A(X * W (forward) + H (Forward) * W (Forward) + b
t t XH t-1 HH H
(Forward)
 H (Backward) = A(X * W (Backward) + H (Backward) * W
t t XH t+1 HH
(Backward) + b (Backward)
H
where,
A = activation function,
W = weight matrix
b = bias
 The hidden state at time t is given by a combination of Ht (Forward) and
Ht (Backward). The output at any given hidden state is :
Y =H *W +b
t t AY y
 The training of a BRNN is similar to backpropagation through a time

algorithm. BPTT algorithm works as follows:
 Roll out the network and calculate errors at each iteration
 Update weights and roll up the network.
 However, because forward and backward passes in a BRNN occur
simultaneously, updating the weights for the two processes may occur at the
same time. This produces inaccurate outcomes. Thus, the following approach is
used to train a BRNN to accommodate forward and backward passes
individually.
Advantages of Bidirectional RNN:
 Context from both past and future: With the ability to process sequential
input both forward and backward, BRNNs provide a thorough grasp of the
full context of a sequence. Because of this, BRNNs are effective at tasks
like sentiment analysis and speech recognition.
 Enhanced accuracy: BRNNs frequently yield more precise answers since
they take both historical and upcoming data into account.
 Efficient handling of variable-length sequences: When compared to
conventional RNNs, which require padding to have a constant length,
BRNNs are better equipped to handle variable-length sequences.
 Resilience to noise and irrelevant information: BRNNs may be resistant
to noise and irrelevant data that are present in the data. This is so because
both the forward and backward paths offer useful information that supports
the predictions made by the network.
 Ability to handle sequential dependencies: BRNNs can capture long-term
links between sequence pieces, making them extremely adept at handling
complicated sequential dependencies.
Disadvantages of Bidirectional RNN:

 Computational complexity: Given that they analyze data both forward
and backward, BRNNs can be computationally expensive due to the
increased amount of calculations needed.
 Long training time: BRNNs can also take a while to train because there
are many parameters to optimize, especially when using huge datasets.
 Difficulty in parallelization: Due to the requirement for sequential
processing in both the forward and backward directions, BRNNs can be
challenging to parallelize.
 Overfitting: BRNNs are prone to overfitting since they include many
parameters that might result in too complicated models, especially when
trained on short datasets.
 Interpretability: Due to the processing of data in both forward and
backward directions, BRNNs can be tricky to interpret since it can be
difficult to comprehend what the model is doing and how it is producing
predictions.
Applications of RNN:
 Speech Recognition
 Translation
 Handwritten Recognition
 Protein Structure Prediction
 Part-of-speech tagging
 Dependency Parsing
 Entity Extraction
ENCODER-DECODER SEQUENCE TO SEQUENCE

ARCHITECTURES:
Encoder-Decoder Model:
 There are three main blocks in the encoder-decoder model,
i. Encoder
ii. Hidden Vector
iii. Decoder
 The Encoder will convert the input sequence into a single-dimensional
vector (hidden vector). The decoder will convert the hidden vector into
the output sequence.
 Encoder-Decoder models are jointly trained to maximize the conditional
probabilities of the target sequence given the input sequence.
 The encoder-decoder architecture for recurrent neural networks is the
standard neural machine translation method that rivals and in some
cases outperforms classical statistical machine translation methods.
Working of sequence to sequence model:
 Working of e Encoder-decoder sequence to sequence model
Encoder:
 Multiple RNN cells can be stacked together to form the encoder. RNN
reads each inputs sequentially
 For every timestep (each input) t, the hidden state (hidden vector) h is
updated according to the input at that timestep X[i].
 After all the inputs are read by encoder model, the final hidden state of
the model represents the context/summary of the whole input sequence.
 Example: Consider the input sequence “I am a Student” to be encoded.
There will be totally 4 timesteps ( 4 tokens) for the Encoder model. At
each time step, the hidden state h will be updated using the previous
hidden state and the current input.
 At the first timestep t1, the previous hidden state h0 will be considered as
zero or randomly chosen. So the first RNN cell will update the current
hidden state with the first input and h0. Each layer outputs two things —
updated hidden state and the output for each stage. The outputs at each
stage are rejected and only the hidden states will be propagated to the
next layer.
 The hidden states h_i are computed using the formula:
 At second timestep t2, the hidden state h1 and the second input X[2] will
be given as input , and the hidden state h2 will be updated according to
both inputs. Then the hidden state h1 will be updated with the new input
and will produce the hidden state h2. This happens for all the four stages
wrt example taken.
 A stack of several recurrent units (LSTM or GRU cells for better
performance) where each accepts a single element of the input sequence,
collects information for that element, and propagates it forward.
 In the question-answering problem, the input sequence is a collection of
all words from the question. Each word is represented as x_i where i is
the order of that word.
This simple formula represents the result of an ordinary recurrent neural
network. As you can see, we just apply the appropriate weights to the previously
hidden state h_(t-1) and the input vector x_t.
Encoder Vector
 This is the final hidden state produced from the encoder part of the
model. It is calculated using the formula above.
 This vector aims to encapsulate the information for all input elements in
order to help the decoder make accurate predictions.
 It acts as the initial hidden state of the decoder part of the model.
Decoder:
 The Decoder generates the output sequence by predicting the next output
Yt given the hidden state ht.
 The input for the decoder is the final hidden vector obtained at the end of
encoder model.
 Each layer will have three inputs, hidden vector from previous layer ht-1
and the previous layer output yt-1, original hidden vector h.
 At the first layer, the output vector of encoder and the random symbol
START, empty hidden state ht-1 will be given as input, the outputs
obtained will be y1 and updated hidden state h1 (the information of the
output will be subtracted from the hidden vector).
 The second layer will have the updated hidden state h1 and the previous
output y1 and original hidden vector h as current inputs, produces the
hidden vector h2 and output y2.
 The outputs occurred at each timestep of decoder is the actual output. The
model will predict the output until the END symbol occurs.
 A stack of several recurrent units where each predicts an output y_t at a
time step t.
 Each recurrent unit accepts a hidden state from the previous unit and
produces an output as well as its own hidden state.
 In the question-answering problem, the output sequence is a collection of
all words from the answer. Each word is represented as y_i where i is the
order of that word.
Example: Decoder.
Any hidden state h_i is computed using the formula:
As you can see, we are just using the previous hidden state to compute the
next one.
Output Layer:
 We use Softmax activation function at the output layer.
 It is used to produce the probability distribution from a vector of values
with the target class of high probability.
 The output y_t at time step t is computed using the formula:
 We calculate the outputs using the hidden state at the current time
step together with the respective weight W(S). Softmax is used to
create a probability vector that will help us determine the final
output (e.g. word in the question-answering problem).
 The power of this model lies in the fact that it can map sequences of
different lengths to each other. As you can see the inputs and outputs
are not correlated and their lengths can differ. This opens a whole
new range of problems that can now be solved using such
architecture.
Applications:
 It possesses many applications such as
 Google’s Machine Translation
 Question answering chatbots
 Speech recognition
 Time Series Application etc..
BPTT FOR TRAINING RNN

Recurrent Neural Networks are those networks that deal with sequential data.
They can predict outputs based on not only current inputs but also considering
the inputs that were generated prior to it. The output of the present depends on
the output of the present and the memory element (which includes the previous
inputs).
To train these networks, we make use of traditional backpropagation with an
added twist. We don't train the system on the exact time "t". We train it
according to a particular time "t" as well as everything that has occurred prior to
time "t" like the following: t-1, t-2, t-3.
Take a look at the following illustration of the RNN:
 S1, S2, and S3 are the states that are hidden or memory units at the time
of t1, t2, and t3, respectively, while Ws represents the matrix of weight
that goes with it.
 X1, X2, and X3 are the inputs for the time that is t1, t2, and t3,
respectively, while Wx represents the weighted matrix that goes with it.
 The numbers Y1, Y2, and Y3 are the outputs of t1, t2, and t3, respectively
as well as Wy, the weighted matrix that goes with it.
For any time, t, we have the following two equations:
 St = g1 (Wx xt + Ws St-1)
 Yt = g2 (WY St )
where g1 and g2 are activation functions.
We will now perform the back propagation at time t = 3.
Let the error function be:
 Et=(dt-Yt )2
Here, we employ the squared error, in which D3 is the desired output at a time t
= 3.
In order to do backpropagation, it is necessary to change the weights that are
associated with inputs, memory units, and outputs.
Adjusting Wy:
To better understand, we can look at the following image:
Explanation:
E3 is a function of Y3. Hence, we differentiate E3 with respect to Y3.
Y3 is a function of W3. Hence, we differentiate Y3 with respect to W3.
Adjusting Ws:
Explanation:
E3 is a function of the Y3. Therefore, we distinguish the E3 with respect to Y3.
Y3 is a function of the S3. Therefore, we differentiate between Y3 with respect to
S3.
S3 is an element in the Ws. Therefore, we distinguish between S3 with respect to
Ws.
But it's not enough to stop at this, therefore we have to think about the previous
steps in time. We must also differentiate (partially) the error function in relation
to the memory units S2 and S1, considering the weight matrix Ws.
It is essential to be aware that a memory unit, such as St, is the result of its
predecessor memory unit, St-1.
Therefore, we distinguish S3 from S2 and S2 from S1.
In general, we can describe this formula in terms of:
Adjusting WX:
Explanation:
E3 is an effect in the Y3. Therefore, we distinguish the E3 with respect to Y3.
Y3 is an outcome that is a function of the S3. Therefore, we distinguish the
Y3 with respect to S3.
S3 is an element in the WX. Thus, we can distinguish the S3 with respect to WX.
We can't just stop at this, and therefore we also need to think about the
preceding time steps. Therefore, we separate (partially) the error function in
relation to the memory unit S2 and S1, considering the WX weighting matrix.
In general, we can define this formula in terms of:
Limitations:
This technique that uses the back Propagation over time (BPTT) is a method
that can be employed for a limited amount of time intervals, like 8 or 10. If we
continue to backpropagate and the gradient gets too small. This is known as the
"Vanishing gradient" problem. This is because the value of information
diminishes geometrically with time. Therefore, if the number of time steps is
greater than 10 (Let's say), the data is effectively discarded.
Going Beyond RNNs:

One of the most famous solutions to this issue is using what's known as Long-
Short-Term Memory (LSTM for short) cells instead of conventional RNN cells.
However, there could be another issue, referred to as the explosion
gradient problem, in which the gradient becomes uncontrollably high.
Solution:
A well-known method is known as gradient clipping when for each time step,
we will determine if the gradient δ is greater than the threshold. If it is, then we
should normalize it.
LONG SHORT TERM MEMORY NETWORKS

 Long Short Term Memory is a kind of recurrent neural network.
 In RNN output from the last step is fed as input in the current step.
 LSTM was designed by Hochreiter & Schmidhuber. It tackled the
problem of long-term dependencies of RNN in which the RNN cannot
predict the word stored in the long-term memory but can give more
accurate predictions from the recent information.
 As the gap length increases RNN does not give an efficient performance.
LSTM can by default retain the information for a long period of time.
 It is used for processing, predicting, and classifying on the basis of time-
series data.
 Long Short-Term Memory (LSTM) is a type of Recurrent Neural
Network (RNN) that is specifically designed to handle sequential data,
such as time series, speech, and text.
 LSTM networks are capable of learning long-term dependencies in
sequential data, which makes them well suited for tasks such as language
translation, speech recognition, and time series forecasting.
 A traditional RNN has a single hidden state that is passed through time,
which can make it difficult for the network to learn long-term
dependencies. LSTMs address this problem by introducing a memory
cell, which is a container that can hold information for an extended period
of time.
 The memory cell is controlled by three gates: the input gate, the forget
gate, and the output gate. These gates decide what information to add to,
remove from, and output from the memory cell.
 The input gate controls what information is added to the memory cell.
 The forget gate controls what information is removed from the memory
cell.
 The output gate controls what information is output from the memory
cell.
 This allows LSTM networks to selectively retain or discard information
as it flows through the network, which allows them to learn long-term
dependencies.
 LSTMs can be stacked to create deep LSTM networks, which can learn
even more complex patterns in sequential data. LSTMs can also be used
in combination with other neural network architectures, such as
Convolutional Neural Networks (CNNs) for image and video analysis.
Structure of LSTM:
 LSTM has a chain structure that contains four neural networks and
different memory blocks called cells.
 Information is retained by the cells and the memory manipulations are

done by the gates. There are three gates –
1. Forget Gate: The information that is no longer useful in the cell state is
removed with the forget gate. Two inputs x_t (input at the particular time)
and h_t-1 (previous cell output) are fed to the gate and multiplied with weight
matrices followed by the addition of bias. The resultant is passed through an
activation function which gives a binary output. If for a particular cell state the
output is 0, the piece of information is forgotten and for output 1, the
information is retained for future use. The equation for the forget gate is:
f_t = σ(W_f · [h_t-1, x_t] + b_f)
where,
 W_f represents the weight matrix associated with the forget gate.
 [h_t-1, x_t] denotes the concatenation of the current input and the
previous hidden state.
 b_f is the bias with the forget gate.
 σ is the sigmoid activation function.
2. Input gate: The addition of useful information to the cell state is done by the
input gate. First, the information is regulated using the sigmoid function and
filter the values to be remembered similar to the forget gate using inputs h_t-
1 and x_t. Then, a vector is created using tanh function that gives an output
from -1 to +1, which contains all the possible values from h_t-1 and x_t. At last,
the values of the vector and the regulated values are multiplied to obtain the
useful information. The equation for the input gate is:
i_t = σ(W_i · [h_t-1, x_t] + b_i)
Ĉ_t = tanh(W_c · [h_t-1, x_t] + b_c)
C_t = f_t ⊙ C_t-1 + i_t ⊙ Ĉ_t
Where,
 ⊙ denotes element-wise multiplication
 tanh is tanh activation function
3. Output gate: The task of extracting useful information from the current cell
state to be presented as output is done by the output gate. First, a vector is
generated by applying tanh function on the cell. Then, the information is
regulated using the sigmoid function and filter by the values to be remembered
using inputs h_t-1 and x_t. At last, the values of the vector and the regulated
values are multiplied to be sent as an output and input to the next cell. The
equation for the output gate is:
o_t = σ(W_o · [h_t-1, x_t] + b_o)
Advantages of LSTM:
 Long-term dependencies can be captured by LSTM networks. They have
a memory cell that is capable of long-term information storage.
 In traditional RNNs, there is a problem of vanishing and exploding
gradients when models are trained over long sequences. By using a gating
mechanism that selectively recalls or forgets information, LSTM
networks deal with this problem.
 LSTM enables the model to capture and remember the important context,
even when there is a significant time gap between relevant events in the
sequence. So where understanding context is important, LSTMS are used.
eg. machine translation.
Disadvantages of LSTM:
 Compared to simpler architectures like feed-forward neural networks
LSTM networks are computationally more expensive. This can limit their
scalability for large-scale datasets or constrained environments.
 Training LSTM networks can be more time-consuming compared to
simpler models due to their computational complexity. So training
LSTMs often requires more data and longer training times to achieve
high performance.
 Since it is processed word by word in a sequential manner, it is hard to
parallelize the work of processing the sentences.
Applications of LSTM :
 Language Modeling
 Speech Recognition
 Time Series Forecasting
 Anomaly Detection
 Recommender Systems
 Video Analysis

DL Unit4

Uploaded by

Copyright:

Available Formats

DL Unit4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL Unit4

Uploaded by

Copyright:

Available Formats

UNIT IV -RECURRENT AND RECURSIVE

Simple Recurrent Neural Network

The formula for calculating the current state: ting

To apply the activation function tanh, we have-

ht = tanh (Whhht-1+ WxhXt)

Whh = weight of recurrent neuron and,

The formula for calculating output:

BIDIRECTIONAL RECURRENT NEURAL NETWORK

Bi-directional Recurrent Neural Network

Working of Bidirectional Recurrent Neural Network

 The training of a BRNN is similar to backpropagation through a time

Disadvantages of Bidirectional RNN:

ENCODER-DECODER SEQUENCE TO SEQUENCE

Working of sequence to sequence model:

 Working of e Encoder-decoder sequence to sequence model

BPTT FOR TRAINING RNN

Going Beyond RNNs:

LONG SHORT TERM MEMORY NETWORKS

 Information is retained by the cells and the memory manipulations are

You might also like