Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Long Short-Term Memory (LSTM)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Lecture 11

LONG SHORT-TERM
MEMORY( LSTM )

Laxmidhar Behera
Professor, Department of Electrical Engineering
Indian Institute of Technology Kanpur
Overview
● Recurrent Neural Networks (RNNs)
● Limitations of RNNs
● Long Short Term Memory Networks (LSTMs)
○ Motivation
○ Architecture
■ Forward Phase
■ Backward Phase
● Weight Update Laws
○ Examples
○ Drawbacks
○ Variants on LSTM
Recurrent Neural Networks (RNNs)

● A traditional neural network assumes that current output is not a function of previous outputs-- A very bad
idea for many practical tasks.
● For eg: If you want to predict the next word in a sentence, you better know which words came before it. eg:
John lives in ___( Delhi / book)
● RNNs make use of Sequential Information. The output of RNNs depends on the previous computations and
hence have a “memory”.


Vanishing Gradient Problem in RNNs
● RNN forward phase:

● Weight update at nth time step:

● Also, note that s is a chain rule in itself , eg:


● Since the activation functions are tanh or sigmoid, their derivatives lie b/w 0 and 1. Hence, the gradient
values are shrinking exponentially fast, eventually vanish after few time steps. Gradient contributions from
“far away” steps don’t contribute.
● Thus you may not learn long range dependencies.
Vanishing Gradient in RNNs: An example
● Predict the last word in “the clouds are in the sky,”
we don’t need any previous context – it’s pretty
sure the next word is going to be sky.

● The gap between the relevant information and the


place that it’s needed is small, RNNs can learn to
use the past information. Figure 1. Short Range Dependencies

● Predict the last word in the text Eg: “I spent a


major portion of my youth in China ..……………I
speak fluent Chinese.”
● RNNs suffer Vanishing gradient problem and
will not learn long range dependencies.
Figure 1. Long Range Dependencies
LSTM: Motivation
Constant Error Carousel (CEC) : g
● RNN’s output:

● Backpropagated error for weight ‘g’ : (from k+1


timestep to kth timstep): w

To maintain a constant error carousel (CEC), we can take

● An LSTM’s cell state is designed to update linearly and hence can act as a CEC overcoming the vanishing
gradient problem.

● Along with this cell state, LSTM cells have three gates using which LSTM can choose to retain their memory
over arbitrary periods of time and also forget if necessary.
LSTM: Architecture
● An LSTM neuron , also called an LSTM cell, has a cell state and three gates.
● Cell state(Ct) functions as the CEC. Three gates to remove, add and pass
information to and fro the cell state.
● The 3 gates are
1. Forget Gate (ft):
2. Input Gate(it):

3. Output Gate (ot):


LSTM: Architecture
● The three gates:
○ Forget Gate(ft):
■ It decides what information we’re going to throw away
from the cell state.

■ It is a neural network which looks at ht−1 , xt, and outputs


a number between 0 and 1 for each number in the cell
state Ct−1 . A ‘1’ represents “completely keep this” while
a ‘0’ represents “completely get rid of this.”

■ For example, in a language model trying to predict the next word,


the cell state might include the gender of the present subject, so
that the correct pronouns can be used. When we see a new subject,
we want to forget the gender of the old subject.
LSTM: Architecture
● Input Gate:
○ It decides what new information we’re going to store in the cell
state.
○ Protects the memory contents stored in cell from perturbation
by irrelevant inputs.
○ This has two parts:
■ Calculate the candidate values that should be added
to the state.
■ Calculate the input gate(it) to decide how much we update
each state value.

Cell update :

● This cell state works as a CEC since the activation function is linear.
LSTM: Architecture
● Output Gate:
○ It decides what information we’re going to decide that we’re
going to output.

○ Eg : In a language model example, we try to predict next word.


○ Suppose model just saw a subject, it might want to output
information relevant to a verb, in case that’s what is coming
next.

○ It might also output whether the subject is singular or plural,


so that we know what form a verb should be conjugated into if
that’s what follows next.
LSTM: Backward Phase
● We update all the weights of the LSTM neuron by gradient descent. The gradients are calculated using Real
Time Recurrent Learning(RTRL).
● Let the input for the neuron be ut = [xt ht−1 ]’.
● For a single neuron,
Note: Ct is the cell state, yt is the output,
ot is the output gate, ft is the forget gate,

Weight updates:

1. Forget gate weights:


LSTM: Backward Phase
Forget gate weights:
● Here we approximate the gradient by assuming Also, known as Truncated RTRL. So, the
CECs are the only part of the system through which errors can flow back forever. This makes
LSTM’s updates efficient without significantly affecting learning power. Therefore,
LSTM: Backward Phase
Input gate weights:
Note: it is the input gate, Wi are the
input gate weights, ut is the input to the
neuron, Ct is the cell state, yt is the
output, ot is the output gate, ft is the
forget gate, st = tanh(Ct)

● Note that we have used the truncated RTRL assumptions here also. Therefore, we get
LSTM: Backward Phase
Output gate weights:
Note: ot is the input gate, Wo are the output
gate weights, ut is the input to the neuron, yt is
the output,st = tanh(Ct).

Therefore,
LSTM: Examples
● Reber Grammar
● Eg: BTXXVPSE is a reber string.
BPTVVB is a non-reber
string.

● We start at B, and move from one node to the next, adding the symbols we pass to our string as we go.
When we reach the final E, we stop. If there are two paths we can take, e.g. after T we can go to either
S or X, we randomly choose one (with equal probability)

● An S can follow a T, but only if the immediately preceding symbol was a B. A V or a T can follow a T,
but only if the symbol immediately preceding it was either a T or a P.

● In order to know what symbol sequences are legal, therefore, any system which recognizes reber
strings must have some form of memory, which can use not only the current input, but also fairly recent
history in making a decision. RNNs can recognize a Reber Grammar.
LSTM: Examples
● Embedded Reber Grammar
LSTM: Examples
● Embedded Reber Grammar

● Using this grammar two types of strings are generated: one kind which is made using the top path
through the graph: BT<reber string>TE and one kind made using the lower path: BP<reber
string>PE.

● To recognize these as legal strings, and learn to tell them from illegal strings such as BP<reber
string>TE, a system must be able to remember the second symbol of the series, regardless of the
length of the intervening input ,and to compare it with the second last symbol seen.

● A simple RNN can no longer solve this task, but LSTM can solve it with after about 10000 training
examples .
LSTM EXAMPLE CODE
Problem:

Input: A sequence of random values between 0 and 1.

Output: A binary label (0 or 1) associated with each input initially all 0. Once the cumulative sum of the input
values in the sequence exceeds a threshold, then the output value flips from 0 to 1.

Example:

Input: 0.12, 0.24, 0.98, 0.45, 0.10, 0.56 threshold: 1.5

Cumulative Sum: 0.12, 0.36, 1.34, 1.79, 1.89, 2.45

Output: 0, 0, 0, 1, 1, 1
LSTM EXAMPLE CODE
● LSTM architecture : LSTM

• A single input for each time-step. LSTM

• A hidden layer consisting of 20 LSTM neurons


Input
Output
• Each LSTM has a forget gate, input gate and an output gate.
LSTM
• A single output for each time-step.

• Unrolling the network: LSTM

Hidden layer (H)


Outputs

Hidden layers H H H H H H
Inputs
Time 1 2 3 4 . . . . . . ... N-1 N ( Here, N is the sequence length)
LSTM example code
Let us implement this problem in Python using Keras library:
LSTM example code
Here, we trained the lstm model on 10,000 training sequences and then test our model on the test sequences.
LSTM Example Code
● An accuracy of more than 95% is achieved when a hidden layer
consisting of 20 LSTM neurons is used.
● The best accuracy is achieved when we used a binary cross entropy loss
function instead of a mean squared error loss function.
● Binary Cross entropy loss:
LSTM: Drawbacks & Variants
● Each LSTM cell gate receives connections only from input units and
outputs of all cells, but not from the cell state of the gate itself. As long
as the output gate is closed, it receives no input and may affect the
network performance.
● Peephole LSTMs [2] are a variant in which the gate layers take input from
the cell state also.

● Another variation is to use coupled forget and input gates , in which we


only input new values to the state when we forget something older.
LSTM: Drawbacks & Variants
● Gated Recurrent Units(GRUs):
○ Similar to an LSTM unit, GRU has gating units that modulate the flow of information
inside the unit, however, without having a separate memory cells.
○ It combines the forget and input gates into a single “update gate.”

○ Reset gate r_t is given by:

○ It also merges the cell state and hidden state.

● Zaremba [3] has rigorously tested 10,000 variants of LSTMs and summarized that GRU is the
best one in most cases, initializing the offset of forget gate to a larger value for other LSTM
variants is used to get a similar result.
References
1. Hochreiter, S. and Schmidhuber, J., 1997. Long short-term memory.
Neural computation, 9(8), pp.1735-1780.
2. Gers, F.A., Schmidhuber, J. and Cummins, F., 1999. Learning to forget:
Continual prediction with LSTM
3. Zaremba, W., Sutskever, I. and Vinyals, O., 2014. Recurrent neural
network regularization. arXiv preprint arXiv:1409.2329.

You might also like