DL Unit4
DL Unit4
DL Unit4
NETS
RECURRENT NEURAL NETWORK
Recurrent Neural Network(RNN) is a type of Neural Network where
the output from the previous step is fed as input to the current step.
In traditional neural networks, all the inputs and outputs are
independent of each other, but in cases when it is required to predict the
next word of a sentence, the previous words are required and hence
there is a need to remember the previous words.
Thus RNN came into existence, which solved this issue with the help of
a Hidden Layer.
The main and most important feature of RNN is its Hidden State, which
remembers some information about a sequence.
The state is also referred to as Memory State since it remembers the
previous input to the network.
It uses the same parameters for each input as it performs the same task
on all the inputs or hidden layers to produce the output. This reduces the
of parameters, unlike other neural networks.
RNN were created because there were a few issues in the feed-forward
neural network:
I. Cannot handle sequential data
II. Considers only the current input
III. Cannot memorize previous inputs
ht= Currentstate
Ht-1= previousstate
Xt= input state
Where,
Yt = Whyh
Types of RNN:
Feedforward networks map one input to one output, and while we’ve visualized
recurrent neural networks in this way in the above diagrams, they do not
actually have this constraint. Instead, their inputs and outputs can vary in length,
and different types of RNNs are used for different use cases, such as music
generation, sentiment classification, and machine translation.
One To One: There is only one pair here. A one-to-one architecture is used
in traditional neural networks.
One To Many: A single input in a one-to-many network might result in
numerous outputs. One too many networks are used in the production of
music, for example.
Many To One: In this scenario, a single output is produced by combining
many inputs from distinct time steps. Sentiment analysis and emotion
identification use such networks, in which the class label is determined by a
sequence of words.
Many To Many: For many to many, there are numerous options. Two
inputs yield three outputs. Machine translation systems, such as English
to French or vice versa translation systems, use many to many networks.
Applications of RNN:
1. Machine Translation
2. Speech Recognition
3. Sentiment Analysis
4. Automatic Image Tagger
Advantages of RNN:
1. Ability to handle variable-length sequences
2. Memory of past inputs
3. Parameter sharing
4. Flexibilty
5. Non linear Mapping
6. Sequential processing
7. Improved accuracy
Disdvantages of RNN:
1. Computational Complexity
2. Vanishing and Exploding gradients
3. Difficulty in long-term dependencies
4. Lack of parallelism
5. Difficulty in choosing the right Architecture
6. Difficulty in interpreting the output
(Forward)
H (Backward) = A(X * W (Backward) + H (Backward) * W
t t XH t+1 HH
(Backward) + b (Backward)
H
where,
A = activation function,
W = weight matrix
b = bias
The hidden state at time t is given by a combination of Ht (Forward) and
Ht (Backward). The output at any given hidden state is :
Y =H *W +b
t t AY y
Applications of RNN:
Speech Recognition
Translation
Handwritten Recognition
Protein Structure Prediction
Part-of-speech tagging
Dependency Parsing
Entity Extraction
Encoder:
Multiple RNN cells can be stacked together to form the encoder. RNN
reads each inputs sequentially
For every timestep (each input) t, the hidden state (hidden vector) h is
updated according to the input at that timestep X[i].
After all the inputs are read by encoder model, the final hidden state of
the model represents the context/summary of the whole input sequence.
Example: Consider the input sequence “I am a Student” to be encoded.
There will be totally 4 timesteps ( 4 tokens) for the Encoder model. At
each time step, the hidden state h will be updated using the previous
hidden state and the current input.
At the first timestep t1, the previous hidden state h0 will be considered as
zero or randomly chosen. So the first RNN cell will update the current
hidden state with the first input and h0. Each layer outputs two things —
updated hidden state and the output for each stage. The outputs at each
stage are rejected and only the hidden states will be propagated to the
next layer.
The hidden states h_i are computed using the formula:
At second timestep t2, the hidden state h1 and the second input X[2] will
be given as input , and the hidden state h2 will be updated according to
both inputs. Then the hidden state h1 will be updated with the new input
and will produce the hidden state h2. This happens for all the four stages
wrt example taken.
A stack of several recurrent units (LSTM or GRU cells for better
performance) where each accepts a single element of the input sequence,
collects information for that element, and propagates it forward.
In the question-answering problem, the input sequence is a collection of
all words from the question. Each word is represented as x_i where i is
the order of that word.
This simple formula represents the result of an ordinary recurrent neural
network. As you can see, we just apply the appropriate weights to the previously
hidden state h_(t-1) and the input vector x_t.
Encoder Vector
This is the final hidden state produced from the encoder part of the
model. It is calculated using the formula above.
This vector aims to encapsulate the information for all input elements in
order to help the decoder make accurate predictions.
It acts as the initial hidden state of the decoder part of the model.
Decoder:
The Decoder generates the output sequence by predicting the next output
Yt given the hidden state ht.
The input for the decoder is the final hidden vector obtained at the end of
encoder model.
Each layer will have three inputs, hidden vector from previous layer ht-1
and the previous layer output yt-1, original hidden vector h.
At the first layer, the output vector of encoder and the random symbol
START, empty hidden state ht-1 will be given as input, the outputs
obtained will be y1 and updated hidden state h1 (the information of the
output will be subtracted from the hidden vector).
The second layer will have the updated hidden state h1 and the previous
output y1 and original hidden vector h as current inputs, produces the
hidden vector h2 and output y2.
The outputs occurred at each timestep of decoder is the actual output. The
model will predict the output until the END symbol occurs.
A stack of several recurrent units where each predicts an output y_t at a
time step t.
Each recurrent unit accepts a hidden state from the previous unit and
produces an output as well as its own hidden state.
In the question-answering problem, the output sequence is a collection of
all words from the answer. Each word is represented as y_i where i is the
order of that word.
Example: Decoder.
Any hidden state h_i is computed using the formula:
As you can see, we are just using the previous hidden state to compute the
next one.
Output Layer:
We use Softmax activation function at the output layer.
It is used to produce the probability distribution from a vector of values
with the target class of high probability.
The output y_t at time step t is computed using the formula:
We calculate the outputs using the hidden state at the current time
step together with the respective weight W(S). Softmax is used to
create a probability vector that will help us determine the final
output (e.g. word in the question-answering problem).
The power of this model lies in the fact that it can map sequences of
different lengths to each other. As you can see the inputs and outputs
are not correlated and their lengths can differ. This opens a whole
new range of problems that can now be solved using such
architecture.
Applications:
It possesses many applications such as
Google’s Machine Translation
Question answering chatbots
Speech recognition
Time Series Application etc..
S1, S2, and S3 are the states that are hidden or memory units at the time
of t1, t2, and t3, respectively, while Ws represents the matrix of weight
that goes with it.
X1, X2, and X3 are the inputs for the time that is t1, t2, and t3,
respectively, while Wx represents the weighted matrix that goes with it.
The numbers Y1, Y2, and Y3 are the outputs of t1, t2, and t3, respectively
as well as Wy, the weighted matrix that goes with it.
For any time, t, we have the following two equations:
St = g1 (Wx xt + Ws St-1)
Yt = g2 (WY St )
where g1 and g2 are activation functions.
We will now perform the back propagation at time t = 3.
Let the error function be:
Et=(dt-Yt )2
Here, we employ the squared error, in which D3 is the desired output at a time t
= 3.
In order to do backpropagation, it is necessary to change the weights that are
associated with inputs, memory units, and outputs.
Adjusting Wy:
To better understand, we can look at the following image:
Explanation:
E3 is a function of Y3. Hence, we differentiate E3 with respect to Y3.
Y3 is a function of W3. Hence, we differentiate Y3 with respect to W3.
Adjusting Ws:
To better understand, we can look at the following image:
Explanation:
E3 is a function of the Y3. Therefore, we distinguish the E3 with respect to Y3.
Y3 is a function of the S3. Therefore, we differentiate between Y3 with respect to
S3.
S3 is an element in the Ws. Therefore, we distinguish between S3 with respect to
Ws.
But it's not enough to stop at this, therefore we have to think about the previous
steps in time. We must also differentiate (partially) the error function in relation
to the memory units S2 and S1, considering the weight matrix Ws.
It is essential to be aware that a memory unit, such as St, is the result of its
predecessor memory unit, St-1.
Therefore, we distinguish S3 from S2 and S2 from S1.
In general, we can describe this formula in terms of:
Adjusting WX:
To better understand, we can look at the following image:
Explanation:
E3 is an effect in the Y3. Therefore, we distinguish the E3 with respect to Y3.
Y3 is an outcome that is a function of the S3. Therefore, we distinguish the
Y3 with respect to S3.
S3 is an element in the WX. Thus, we can distinguish the S3 with respect to WX.
We can't just stop at this, and therefore we also need to think about the
preceding time steps. Therefore, we separate (partially) the error function in
relation to the memory unit S2 and S1, considering the WX weighting matrix.
In general, we can define this formula in terms of:
Limitations:
This technique that uses the back Propagation over time (BPTT) is a method
that can be employed for a limited amount of time intervals, like 8 or 10. If we
continue to backpropagate and the gradient gets too small. This is known as the
"Vanishing gradient" problem. This is because the value of information
diminishes geometrically with time. Therefore, if the number of time steps is
greater than 10 (Let's say), the data is effectively discarded.
Structure of LSTM:
LSTM has a chain structure that contains four neural networks and
different memory blocks called cells.
2. Input gate: The addition of useful information to the cell state is done by the
input gate. First, the information is regulated using the sigmoid function and
filter the values to be remembered similar to the forget gate using inputs h_t-
1 and x_t. Then, a vector is created using tanh function that gives an output
from -1 to +1, which contains all the possible values from h_t-1 and x_t. At last,
the values of the vector and the regulated values are multiplied to obtain the
useful information. The equation for the input gate is:
i_t = σ(W_i · [h_t-1, x_t] + b_i)
Ĉ_t = tanh(W_c · [h_t-1, x_t] + b_c)
C_t = f_t ⊙ C_t-1 + i_t ⊙ Ĉ_t
Where,
⊙ denotes element-wise multiplication
tanh is tanh activation function
3. Output gate: The task of extracting useful information from the current cell
state to be presented as output is done by the output gate. First, a vector is
generated by applying tanh function on the cell. Then, the information is
regulated using the sigmoid function and filter by the values to be remembered
using inputs h_t-1 and x_t. At last, the values of the vector and the regulated
values are multiplied to be sent as an output and input to the next cell. The
equation for the output gate is:
o_t = σ(W_o · [h_t-1, x_t] + b_o)
Advantages of LSTM:
Long-term dependencies can be captured by LSTM networks. They have
a memory cell that is capable of long-term information storage.
In traditional RNNs, there is a problem of vanishing and exploding
gradients when models are trained over long sequences. By using a gating
mechanism that selectively recalls or forgets information, LSTM
networks deal with this problem.
LSTM enables the model to capture and remember the important context,
even when there is a significant time gap between relevant events in the
sequence. So where understanding context is important, LSTMS are used.
eg. machine translation.
Disadvantages of LSTM:
Compared to simpler architectures like feed-forward neural networks
LSTM networks are computationally more expensive. This can limit their
scalability for large-scale datasets or constrained environments.
Training LSTM networks can be more time-consuming compared to
simpler models due to their computational complexity. So training
LSTMs often requires more data and longer training times to achieve
high performance.
Since it is processed word by word in a sequential manner, it is hard to
parallelize the work of processing the sentences.
Applications of LSTM :
Language Modeling
Speech Recognition
Time Series Forecasting
Anomaly Detection
Recommender Systems
Video Analysis