The document provides an overview of LSTM (Long Short-Term Memory) networks. It first reviews RNNs (Recurrent Neural Networks) and their limitations in capturing long-term dependencies. It then introduces LSTM networks, which address this issue using forget, input, and output gates that allow the network to retain information for longer. Code examples are provided to demonstrate how LSTM remembers information over many time steps. Resources for further reading on LSTMs and RNNs are listed at the end.
2. • We are the sum total of our experiences. None of us are the same as
we were yesterday, nor will be tomorrow. (B.J. Neblett)
• memorization: every time we gain new information, we store it for
future reference.
• combination: not all tasks are the same, so we couple our analytical
skills with a combination of our memorized, previous experiences to
reason about the world.
Human Learning
3. Outline
• Review of RNN
• SimpleRNN
• LSTM (forget, input and output gates)
• GRU (reset and update gates)
• LSTM
• Motivation
• Introduction
• Code example
4. Review of RNN
• Jack Ma is a Chinese business magnate, and his native language is ____
5. Review of RNN
• Jack Ma is a Chinese business magnate, and his native language is ____
13. Summary
• Hidden state can be viewed as “memory”, which tries to capture
previous information.
• Output is determined by current input and all the “memory”
• Hidden stage cannot capture all the information
• Unlike CNN, RNN shares the same parameters W
45. • Sample may refer to individual training examples. A “batch_size” variable is
hence the count of samples you sent to the neural network. That is, how
many different examples you feed at once to the neural network.
• Time Steps are ticks of time. It is how long in time each of your samples
are. For example, a sample can contain 128 time steps, where each time
steps could be a 30th of a second for signal processing. In Natural Language
Processing (NLP), a time step may be associated with a character, a word,
or a sentence, depending on the setup.
• Features are simply the number of dimensions we feed at each time steps.
For example in NLP, a word could be represented by 300 features
using word2vec. In the case of signal processing, let’s pretend that your
signal is 3D. That is, you have a X, a Y and a Z signal, such as an
accelerometer’s measurements on each axis. This means you would have
3 features sent at each time step for each sample.