Long Short Term Memory Networks Explanation

Last Updated : 02 Jan, 2023
Summarize
Comments
Improve
Suggest changes
Like Article
Like
Save
Share
Report
News Follow

Prerequisites: Recurrent Neural Networks 

To solve the problem of Vanishing and Exploding Gradients in a Deep Recurrent Neural Network, many variations were developed. One of the most famous of them is the Long Short Term Memory Network(LSTM). In concept, an LSTM recurrent unit tries to “remember” all the past knowledge that the network is seen so far and to “forget” irrelevant data. This is done by introducing different activation function layers called “gates” for different purposes. Each LSTM recurrent unit also maintains a vector called the Internal Cell State which conceptually describes the information that was chosen to be retained by the previous LSTM recurrent unit.

LSTM networks are the most commonly used variation of Recurrent Neural Networks (RNNs). The critical component of the LSTM is the memory cell and the gates (including the forget gate but also the input gate), inner contents of the memory cell are modulated by the input gates and forget gates. Assuming that both of the segue he are closed, the contents of the memory cell will remain unmodified between one time-step and the next gradients gating structure allows information to be retained across many time-steps, and consequently also allows group that to flow across many time-steps. This allows the LSTM model to overcome the vanishing gradient properly occurs with most Recurrent Neural Network models.

 A Long Short Term Memory Network consists of four different gates for different purposes as described below:- 

  1. Forget Gate(f): At forget gate the input is combined with the previous output to generate a fraction between 0 and 1, that determines how much of the previous state need to be preserved (or in other words, how much of the state should be forgotten). This output is then multiplied with the previous state. Note: An activation output of 1.0 means “remember everything” and activation output of 0.0 means “forget everything.” From a different perspective, a better name for the forget gate might be the “remember gate”
  2. Input Gate(i): Input gate operates on the same signals as the forget gate, but here the objective is to decide which new information is going to enter the state of LSTM. The output of the input gate (again a fraction between 0 and 1) is multiplied with the output of tan h block that produces the new values that must be added to previous state. This gated vector is then added to previous state to generate current state
  3. Input Modulation Gate(g): It is often considered as a sub-part of the input gate and much literature on LSTM’s does not even mention it and assume it is inside the Input gate. It is used to modulate the information that the Input gate will write onto the Internal State Cell by adding non-linearity to the information and making the information Zero-mean. This is done to reduce the learning time as Zero-mean input has faster convergence. Although this gate’s actions are less important than the others and are often treated as a finesse-providing concept, it is good practice to include this gate in the structure of the LSTM unit.
  4. Output Gate(o): At output gate, the input and previous state are gated as before to generate another scaling fraction that is combined with the output of tanh block that brings the current state. This output is then given out. The output and state are fed back into the LSTM block.

The basic workflow of a Long Short Term Memory Network is similar to the workflow of a Recurrent Neural Network with the only difference being that the Internal Cell State is also passed forward along with the Hidden State. 

Working of an LSTM recurrent unit:  

  1. Take input the current input, the previous hidden state, and the previous internal cell state.
  2. Calculate the values of the four different gates by following the below steps:-
    • For each gate, calculate the parameterized vectors for the current input and the previous hidden state by element-wise multiplication with the concerned vector with the respective weights for each gate.
    • Apply the respective activation function for each gate element-wise on the parameterized vectors. Below given is the list of the gates with the activation function to be applied for the gate.
  3. Calculate the current internal cell state by first calculating the element-wise multiplication vector of the input gate and the input modulation gate, then calculate the element-wise multiplication vector of the forget gate and the previous internal cell state and then add the two vectors. 
    c_{t} = i\odot g + f\odot c_{t-1}
  4. Calculate the current hidden state by first taking the element-wise hyperbolic tangent of the current internal cell state vector and then performing element-wise multiplication with the output gate.

The above-stated working is illustrated as below:-  

Note that the blue circles denote element-wise multiplication. The weight matrix W contains different weights for the current input vector and the previous hidden state for each gate. 

Just like Recurrent Neural Networks, an LSTM network also generates an output at each time step and this output is used to train the network using gradient descent. 

The only main difference between the Back-Propagation algorithms of Recurrent Neural Networks and Long Short Term Memory Networks is related to the mathematics of the algorithm. 

Let \overline{y}_{t}     be the predicted output at each time step and y_{t}     be the actual output at each time step. Then the error at each time step is given by:- 

E_{t} = -y_{t}log(\overline{y}_{t})

The total error is thus given by the summation of errors at all time steps. 

E = \sum _{t} E_{t}
\Rightarrow E = \sum _{t} -y_{t}log(\overline{y}_{t})

Similarly, the value \frac{\partial E}{\partial W}     can be calculated as the summation of the gradients at each time step. 

\frac{\partial E}{\partial W} = \sum _{t} \frac{\partial E_{t}}{\partial W}

Using the chain rule and using the fact that \overline{y}_{t}     is a function of h_{t}     and which indeed is a function of c_{t}     , the following expression arises:- 

\frac{\partial E_{t}}{\partial W} = \frac{\partial E_{t}}{\partial \overline{y}_{t}}\frac{\partial \overline{y}_{t}}{\partial h_{t}}\frac{\partial h_{t}}{\partial c_{t}}\frac{\partial c_{t}}{\partial c_{t-1}}\frac{\partial c_{t-1}}{\partial c_{t-2}}.......\frac{\partial c_{0}}{\partial W}

Thus the total error gradient is given by the following:- 

\frac{\partial E}{\partial W} = \sum _{t} \frac{\partial E_{t}}{\partial \overline{y}_{t}}\frac{\partial \overline{y}_{t}}{\partial h_{t}}\frac{\partial h_{t}}{\partial c_{t}}\frac{\partial c_{t}}{\partial c_{t-1}}\frac{\partial c_{t-1}}{\partial c_{t-2}}.......\frac{\partial c_{0}}{\partial W}

Note that the gradient equation involves a chain of \partial c_{t}     for an LSTM Back-Propagation while the gradient equation involves a chain of \partial h_{t}     for a basic Recurrent Neural Network. 

How does LSTM solve the problem of vanishing and exploding gradients? 

Recall the expression for c_{t}

c_{t} = i\odot g + f\odot c_{t-1}

The value of the gradients is controlled by the chain of derivatives starting from \frac{\partial c_{t}}{\partial c_{t-1}}     . Expanding this value using the expression for c_{t}     :- 

\frac{\partial c_{t}}{\partial c_{t-1}} = \frac{\partial c_{t}}{\partial f}\frac{\partial f}{\partial h_{t-1}}\frac{\partial h_{t-1}}{\partial c_{t-1}} + \frac{\partial c_{t}}{\partial i}\frac{\partial i}{\partial h_{t-1}}\frac{\partial h_{t-1}}{\partial c_{t-1}} + \frac{\partial c_{t}}{\partial g}\frac{\partial g}{\partial h_{t-1}}\frac{\partial h_{t-1}}{\partial c_{t-1}} + \frac{\partial c_{t}}{\partial c_{t-1}}

For a basic RNN, the term \frac{\partial h_{t}}{\partial h_{t-1}}     after a certain time starts to take values either greater than 1 or less than 1 but always in the same range. This is the root cause of the vanishing and exploding gradients problem. In an LSTM, the term \frac{\partial c_{t}}{\partial c_{t-1}}     does not have a fixed pattern and can take any positive value at any time step. Thus, it is not guaranteed that for an infinite number of time steps, the term will converge to 0 or diverge completely. If the gradient starts converging towards zero, then the weights of the gates can be adjusted accordingly to bring it closer to 1. Since during the training phase, the network adjusts these weights only, it thus learns when to let the gradient converge to zero and when to preserve it.
 



Previous Article
Next Article

Similar Reads

Text Generation using Recurrent Long Short Term Memory Network
This article will demonstrate how to build a Text Generator by building a Recurrent Long Short Term Memory Network. The conceptual procedure of training the network is to first feed the network a mapping of each character present in the text on which the network is training to a unique number. Each character is then hot-encoded into a vector which
5 min read
What is LSTM - Long Short Term Memory?
LSTM excels in sequence prediction tasks, capturing long-term dependencies. Ideal for time series, machine translation, and speech recognition due to order dependence. The article provides an in-depth introduction to LSTM, covering the LSTM model, architecture, working principles, and the critical role they play in various applications. What is LST
10 min read
Short term Memory
In the wider community of neurologists and those who are researching the brain, It is agreed that two temporarily distinct processes contribute to the acquisition and expression of brain functions. These variations can result in long-lasting alterations in neuron operations, for instance through activity-dependent changes in synaptic transmission.
5 min read
Recurrent Neural Networks Explanation
Today, different Machine Learning techniques are used to handle different types of data. One of the most difficult types of data to handle and the forecast is sequential data. Sequential data is different from other types of data in the sense that while all the features of a typical dataset can be assumed to be order-independent, this cannot be ass
8 min read
Signed Networks in Social Networks
Prerequisite: Introduction to Social Networks In Social Networks, Network is of 2 types- Unsigned Network and Signed Network. In the unsigned network, there are no signs between any nodes, and in the signed network, there is always a sign between 2 nodes either + or -. The '+' sign indicates friendship between 2 nodes and the '-' sign indicates enm
4 min read
Difference Between Feed-Forward Neural Networks and Recurrent Neural Networks
Pre-requisites: Artificial Neural Networks and its Applications Neural networks are artificial systems that were inspired by biological neural networks. These systems learn to perform tasks by being exposed to various datasets and examples without any task-specific rules. In this article, we will see the difference between Feed-Forward Neural Netwo
2 min read
Differences Between Bayesian Networks and Neural Networks
Bayesian networks and neural networks are two distinct types of graphical models used in machine learning and artificial intelligence. While both models are designed to handle complex data and make predictions, they differ significantly in their theoretical foundations, operational mechanisms, and applications. This article will delve into the diff
9 min read
Episodic Memory and Deep Q-Networks
Episodic Memory: Episodic Memory is a category of long-term memory that involves recent recollection of specific events, situations, and experiences. For Example Your first day at college. There are two important aspects of episodic memory are Pattern Separation and Pattern Completion. Semantic Memory is used in many ways in machine learning such a
7 min read
Mathematical explanation for Linear Regression working
Suppose we are given a dataset: Given is a Work vs Experience dataset of a company and the task is to predict the salary of a employee based on his / her work experience. This article aims to explain how in reality Linear regression mathematically works when we use a pre-defined function to perform prediction task. Let us explore how the stuff work
1 min read
ML | Mathematical explanation of RMSE and R-squared error
RMSE: Root Mean Square Error is the measure of how well a regression line fits the data points. RMSE can also be construed as Standard Deviation in the residuals. Consider the given data points: (1, 1), (2, 2), (2, 3), (3, 6). Let us break the above data points into 1-d lists. Input: x = [1, 2, 2, 3] y = [1, 2, 3, 6] Code: Regression Graph Python C
5 min read
Explanation of Fundamental Functions involved in A3C algorithm
Although any implementation of the Asynchronous Advantage Actor Critic algorithm is bound to be complex, all the implementations will have one thing in common - the presence of the Global Network and the worker class. A3C (Asynchronous Advantage Actor-Critic) is a reinforcement learning algorithm that is used to train deep neural networks to make d
9 min read
ML | OPTICS Clustering Explanation
Prerequisites: DBSCAN Clustering OPTICS Clustering stands for Ordering Points To Identify Cluster Structure. It draws inspiration from the DBSCAN clustering algorithm. It adds two more terms to the concepts of DBSCAN clustering. OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering algorithm, similar to DBSCAN
7 min read
Mathematical explanation of K-Nearest Neighbour
KNN stands for K-nearest neighbour, it’s one of the Supervised learning algorithm mostly used for classification of data on the basis how it’s neighbour are classified. KNN stores all available cases and classifies new cases based on a similarity measure. K in KNN is a parameter that refers to the number of the nearest neighbours to include in the
2 min read
Explanation of BERT Model - NLP
BERT, an acronym for Bidirectional Encoder Representations from Transformers, stands as an open-source machine learning framework designed for the realm of natural language processing (NLP). Originating in 2018, this framework was crafted by researchers from Google AI Language. The article aims to explore the architecture, working and applications
14 min read
Chi-Square Test for Feature Selection - Mathematical Explanation
One of the primary tasks involved in any supervised Machine Learning venture is to select the best features from the given dataset to obtain the best results. One way to select these features is the Chi-Square Test. Mathematically, a Chi-Square test is done on two distributions two determine the level of similarity of their respective variances. In
4 min read
Saving Long URLs with Pandas and XlsxWriter
In today's digital age, URLs are everywhere—from website links to embedded resources in documents. If we frequently work with URLs in Excel, we might have faced challenges in managing long URLs, formatting them properly, or even keeping them clickable within our Excel files. This article dives into handling URLs in Excel using Python, specifically
4 min read
Understanding TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set). Terminologies: Term Frequenc
6 min read
How to Fit a Sinusoidal Term to Data in R?
In the field of machine learning and data science, data is often concidered to show patterns similar, to sinusoidal waves. It's much more important to understand and describe such data to study natural patterns such as weather patterns and many more. By the sinusoidal models, we can uncover insights and forecast future trends in the processes. This
8 min read
How to Include an Interaction Term in GAM in R?
Generalized Additive Models (GAM) are an extension of Generalized Linear Models (GLM) that allow for flexibility in modeling nonlinear relationships between predictors and the outcome variable. Generalized Linear Models (GLM) are particularly useful when the relationship between the predictor variables and the response variable is not well-represen
6 min read
Basics of Generative Adversarial Networks (GANs)
GANs is an approach for generative modeling using deep learning methods such as CNN (Convolutional Neural Network). Generative modeling is an unsupervised learning approach that involves automatically discovering and learning patterns in input data such that the model can be used to generate new examples from the original dataset. GANs is a way of
3 min read
Introduction to ANN (Artificial Neural Networks) | Set 3 (Hybrid Systems)
Prerequisites: Genetic algorithms, Artificial Neural Networks, Fuzzy Logic Hybrid systems: A Hybrid system is an intelligent system that is framed by combining at least two intelligent technologies like Fuzzy Logic, Neural networks, Genetic algorithms, reinforcement learning, etc. The combination of different techniques in one computational model m
4 min read
Depth wise Separable Convolutional Neural Networks
Convolution is a very important mathematical operation in artificial neural networks(ANN's). Convolutional neural networks (CNN's) can be used to learn features as well as classify data with the help of image frames. There are many types of CNN's. One class of CNN's are depth wise separable convolutional neural networks. These type of CNN's are wid
4 min read
Use Cases of Generative Adversarial Networks
Image synthesis: Generating new, realistic images from a given data distribution, such as faces, landscapes, or animals. Text-to-Image synthesis: Generating images from text descriptions, such as scene descriptions, object descriptions, or attributes.Image-to-Image translation: Translating images from one domain to another, such as converting grays
5 min read
Gated Recurrent Unit Networks
Gated Recurrent Unit (GRU) is a type of recurrent neural network (RNN) that was introduced by Cho et al. in 2014 as a simpler alternative to Long Short-Term Memory (LSTM) networks. Like LSTM, GRU can process sequential data such as text, speech, and time-series data. The basic idea behind GRU is to use gating mechanisms to selectively update the hi
8 min read
Introduction to Residual Networks
Recent years have seen tremendous progress in the field of Image Processing and Recognition. Deep Neural Networks are becoming deeper and more complex. It has been proved that adding more layers to a Neural Network can make it more robust for image-related tasks. But it can also cause them to lose accuracy. That's where Residual Networks come into
3 min read
Understanding of LSTM Networks
This article talks about the problems of conventional RNNs, namely, the vanishing and exploding gradients, and provides a convenient solution to these problems in the form of Long Short Term Memory (LSTM). Long Short-Term Memory is an advanced version of recurrent neural network (RNN) architecture that was designed to model chronological sequences
9 min read
Residual Networks (ResNet) - Deep Learning
After the first CNN-based architecture (AlexNet) that win the ImageNet 2012 competition, Every subsequent winning architecture uses more layers in a deep neural network to reduce the error rate. This works for less number of layers, but when we increase the number of layers, there is a common problem in deep learning associated with that called the
10 min read
Dropout in Neural Networks
The concept of Neural Networks is inspired by the neurons in the human brain and scientists wanted a machine to replicate the same process. This craved a path to one of the most important topics in Artificial Intelligence. A Neural Network (NN) is based on a collection of connected units or nodes called artificial neurons, which loosely model the n
3 min read
DeepPose: Human Pose Estimation via Deep Neural Networks
DeepPose was proposed by researchers at Google for Pose Estimation in the 2014 Computer Vision and Pattern Recognition conference. They work on formulating the pose Estimation problem as a DNN-based regression problem towards body joints. They present a cascade of DNN-regressors which resulted in high precision pose estimates.. Architecture: Pose V
5 min read
Introduction to Social Networks using NetworkX in Python
Prerequisite - Python Basics Ever wondered how the most popular social networking site Facebook works? How we are connected with friends using just Facebook? So, Facebook and other social networking sites work on a methodology called social networks. Social networking is used in mostly all social media sites such as Facebook, Instagram, and LinkedI
4 min read
three90RightbarBannerImg