Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM)
LONG SHORT-TERM
MEMORY( LSTM )
Laxmidhar Behera
Professor, Department of Electrical Engineering
Indian Institute of Technology Kanpur
Overview
● Recurrent Neural Networks (RNNs)
● Limitations of RNNs
● Long Short Term Memory Networks (LSTMs)
○ Motivation
○ Architecture
■ Forward Phase
■ Backward Phase
● Weight Update Laws
○ Examples
○ Drawbacks
○ Variants on LSTM
Recurrent Neural Networks (RNNs)
● A traditional neural network assumes that current output is not a function of previous outputs-- A very bad
idea for many practical tasks.
● For eg: If you want to predict the next word in a sentence, you better know which words came before it. eg:
John lives in ___( Delhi / book)
● RNNs make use of Sequential Information. The output of RNNs depends on the previous computations and
hence have a “memory”.
●
Vanishing Gradient Problem in RNNs
● RNN forward phase:
● An LSTM’s cell state is designed to update linearly and hence can act as a CEC overcoming the vanishing
gradient problem.
● Along with this cell state, LSTM cells have three gates using which LSTM can choose to retain their memory
over arbitrary periods of time and also forget if necessary.
LSTM: Architecture
● An LSTM neuron , also called an LSTM cell, has a cell state and three gates.
● Cell state(Ct) functions as the CEC. Three gates to remove, add and pass
information to and fro the cell state.
● The 3 gates are
1. Forget Gate (ft):
2. Input Gate(it):
Cell update :
● This cell state works as a CEC since the activation function is linear.
LSTM: Architecture
● Output Gate:
○ It decides what information we’re going to decide that we’re
going to output.
Weight updates:
● Note that we have used the truncated RTRL assumptions here also. Therefore, we get
LSTM: Backward Phase
Output gate weights:
Note: ot is the input gate, Wo are the output
gate weights, ut is the input to the neuron, yt is
the output,st = tanh(Ct).
Therefore,
LSTM: Examples
● Reber Grammar
● Eg: BTXXVPSE is a reber string.
BPTVVB is a non-reber
string.
● We start at B, and move from one node to the next, adding the symbols we pass to our string as we go.
When we reach the final E, we stop. If there are two paths we can take, e.g. after T we can go to either
S or X, we randomly choose one (with equal probability)
● An S can follow a T, but only if the immediately preceding symbol was a B. A V or a T can follow a T,
but only if the symbol immediately preceding it was either a T or a P.
● In order to know what symbol sequences are legal, therefore, any system which recognizes reber
strings must have some form of memory, which can use not only the current input, but also fairly recent
history in making a decision. RNNs can recognize a Reber Grammar.
LSTM: Examples
● Embedded Reber Grammar
LSTM: Examples
● Embedded Reber Grammar
● Using this grammar two types of strings are generated: one kind which is made using the top path
through the graph: BT<reber string>TE and one kind made using the lower path: BP<reber
string>PE.
● To recognize these as legal strings, and learn to tell them from illegal strings such as BP<reber
string>TE, a system must be able to remember the second symbol of the series, regardless of the
length of the intervening input ,and to compare it with the second last symbol seen.
● A simple RNN can no longer solve this task, but LSTM can solve it with after about 10000 training
examples .
LSTM EXAMPLE CODE
Problem:
Output: A binary label (0 or 1) associated with each input initially all 0. Once the cumulative sum of the input
values in the sequence exceeds a threshold, then the output value flips from 0 to 1.
Example:
Output: 0, 0, 0, 1, 1, 1
LSTM EXAMPLE CODE
● LSTM architecture : LSTM
Hidden layers H H H H H H
Inputs
Time 1 2 3 4 . . . . . . ... N-1 N ( Here, N is the sequence length)
LSTM example code
Let us implement this problem in Python using Keras library:
LSTM example code
Here, we trained the lstm model on 10,000 training sequences and then test our model on the test sequences.
LSTM Example Code
● An accuracy of more than 95% is achieved when a hidden layer
consisting of 20 LSTM neurons is used.
● The best accuracy is achieved when we used a binary cross entropy loss
function instead of a mean squared error loss function.
● Binary Cross entropy loss:
LSTM: Drawbacks & Variants
● Each LSTM cell gate receives connections only from input units and
outputs of all cells, but not from the cell state of the gate itself. As long
as the output gate is closed, it receives no input and may affect the
network performance.
● Peephole LSTMs [2] are a variant in which the gate layers take input from
the cell state also.
● Zaremba [3] has rigorously tested 10,000 variants of LSTMs and summarized that GRU is the
best one in most cases, initializing the offset of forget gate to a larger value for other LSTM
variants is used to get a similar result.
References
1. Hochreiter, S. and Schmidhuber, J., 1997. Long short-term memory.
Neural computation, 9(8), pp.1735-1780.
2. Gers, F.A., Schmidhuber, J. and Cummins, F., 1999. Learning to forget:
Continual prediction with LSTM
3. Zaremba, W., Sutskever, I. and Vinyals, O., 2014. Recurrent neural
network regularization. arXiv preprint arXiv:1409.2329.