Deep Learning for Text (Text Mining) LSTM

Deep Learning for Text
Neural Network,RNN, LSTM
CHIEN CHIN CHEN
Deep Learning with Python - Manning Publications
François Chollet, 2017

Preface
This subject is intended as a primer for those with no experience
in neural networks.
◦ We will not go into the math of the underlying algorithm,
backpropagation.
Instead, we first present basic components of neural networks.
◦ Neurons, fully connected networks.
And then talk about some advanced network structure.
◦ Embedding layers, recurrent neural networks, long short-term memory
networks.
Examples of IMDB review sentiment analysis will be provided to
practice these well-known networks.
2

Anatomy of a Neural Network (1/4)
What is a neural network?
◦ A network normally consists of layers and weights.
◦ Analyze the input (data features) to predict or estimate something
…
Input layer
…
…
hidden layerS output layer
年齡
性別
是否透過iPhone瀏覽
年薪
房車
休旅車
跑車
36
1
1
80
0.58
0.14
0.28
3

The prediction is very similar to regression
◦ Aggregate A LOT OF regression functions to make predictions.
◦ Layers extract (transform) features iteratively.
◦ Is a kind of feature engineering, but features are learned (acquired)
automatically. …
Input layer
…
…
年齡
性別
是否透過iPhone瀏覽
年薪
36
1
1
80
0.71
0.43
-0.22
0.71
0.68
4

One big thing is to determine the weights of a network.
◦ Fortunately, it is a supervised learning process.
◦ The weights are learned automatically from training data through
some sophisticated math.
Another big thing is to specify the structure of your network.
◦ Fully connected, recurrent, encoder-decoder, etc.
◦ Picking the right structure is more an art than a science…
年齡性別 iPhone … 年薪車款
36 1 1 … 80 1
44 0 1 … 135 2
…
23 1 0 ... 48 2
5

The overview of neural network learning:
Network structure &
weights
Training data
Evaluate how good
the network is
Sophisticated math
used to adjust weights
6

IMDB Sentiment Dataset
The dataset comes packaged with Keras.
◦ Has been preprocessed – the reviews (sequences of words) have
been turned into sequence of integers (unique word ids).
50,000 polarized reviews.
◦ 25,000 for training and 25,000 for testing.
◦ Each set consists of 50% negative and 50% positive reviews.
The sentiment detection task is simply a binary classification.
◦ Positive/negative reviews.
7

Neuon (1/4)
In the 1950s by Frank Rosenblatt, the perceptron offered a
novel algorithm for finding patterns in data.
◦ To mimic the operation of a living neuron cell.
◦ The cell weights incoming signals when deciding when to fire.
◦ When the cell reaches a certain level of charge, it fires, sending an
electrical signal out.
8

Neuon (2/4)
Rosenblatt’s research was to teach a machine to recognize
images.
◦ E.g., hot dog/not hot dog.
Features of the input image were each a small subsection of
the image.
A weight was assigned to each feature, to measure the
importance of each feature.
?
Activation function: once the
weighted sum is above a
certain threshold, the
perceptron outputs 1!!
Here, the activation function
is a step function.
9

Neuon (3/4)
The reason for having the bias is that you need the
perceptron to be resilient to inputs of all zeros.
◦ Sometime, you need your network not to output 0 when facing
inputs of 0.
◦ With this bias, you have no problem.
Of course, the weights will be learned by training examples;
rather than manually assign!!
10

Neuon (4/4)
The base unit of neural networks is the neuron!!
Some people regard perceptron is a special case of neuron
that uses step function as the activation function.
Many prefer using continuously differentiable nonlinear
function (e.g., sigmoid) as the activation function of neuron.
◦ To be able to model difficult problems.
◦ Allowing backpropagation training.
Sigmoid function: 𝑆(𝑥) =
1
1 + 𝑒−𝑥
11

Fully Connected Network (1/11)
The most popular neural network structure
◦ Each input element has a connection to every neuron in the next layer.
◦ And the outputs of layer k become the input of layer k+1, and so on.
…
Input layer
…
…
12

To process text information, we need to vectorize the input text.
◦ Multi-hot encoding/Embedding layer
Let’s start with the multi-hot encoding and a simple three-layer
structure.
…
Input layer
…
…
term1
term2
term|V|
Positive
or not?
0
1
1
0
0.14
term3
13

Loading the IMDB dataset, only
returning the top 10,000 most
frequently occurring words
14

Define a function that turns each
review into a multi-hot vector.
15

16

Loss function: a.k.a. objective function, measures how well
the learned model (weights) fits the training data.
◦ Binary crossentropy for binary classification problem
◦ yi: 1/0, the target value of i-th training example.
◦ 𝑦𝑖: the output of the model regarding i-th training example.
◦ Categorical crossentropy for multi-class classification problem.
◦ MSE for regression problem.
17

Optimizer: approach to adjust (update) network weights.
◦ SGD, RMSprop, Adam, … etc; most of them are based on gradient
descent.
18

We split training data into training/validation sets.
◦ The validation set is used to search appropriate hyper-parameters.
◦ E.g., activation function, number of hidden nodes, number of layers ….
19

epochs=20
◦ Optimizers adjust network weights iteratively.
◦ After all training data has gone through the network once, and the
errors are used by optimizers to update weights, we call this an
epoch of the neural network training.
◦ Note that a large epochs could make your model overfitting!!
batch_size=512
◦ The size of a random subset of the training data.
◦ To speed up training and to avoid getting stuck in a local optima.
◦ Default is 32.
20

Example of overfitting the training dataset
21

22

Embedding Layer (1/10)
The dimension of multi-hot vectors is too high, and too sparse.
◦ In the last example, vector dimension is 10,000!!
Word embeddings are low-dimensional, and dense vectors.
◦ The dimension of word embeddings is usually less than 1,000.
◦ Word embeddings are learned from data.
◦ To make synonyms embedded into similar (close) vectors.
Two ways to obtain word embeddings:
◦ Learn word embeddings jointly with the task you care about (e.g., the IMDB sentiment analysis).
◦ Using pre-trained word embeddings (e.g., Word2Vec).
Note that it is good to learn task-specific embeddings if you have plenty of
task-specific data; pre-trained embeddings sometimes would be too
general to specific tasks.
23

The following example shows how to add embedding layers
into your neural network.
One difficulty – different review lengths; but the size of
network inputs needs to be fixed!!
◦ Here, we simply examine the first 100 words of each review.
24

1 0 0
…
…
Input: 1 14 22 … 1067
2.5
0.7
…
1.8
3.7
0.8
…
0.1
…
1’s embedding 1067’s embedding
2.5 0.7 … 0.1
flatten
…
2.5
0.7
1.8
26

The weights of the embedding layer are initially random!!
During training, the weights are gradually adjusted to meet
the goal of the downstream task (e.g., sentiment analysis).
Once fully trained, the embeddings could be applied to
other similar tasks.
◦ Pre-trained embeddings.
◦ E.g., product review analysis
…
… embedding
layer
The embeddings can be represented as a
matrix of #unique_token x embedding_dimension,
that is, 10000x300 2.5 0.7 … 1.8
0.4 3.3 … 0.7
…
1.2 1.7 … 2.2
27

When training word embeddings, the embedding quality
depends on the size of your training set.
◦ If you have little training data, you might not be able to learn
appropriate task-specific embeddings.
An alternative is to load embedding vectors from a
precomputed embedding space that captures generic
language structures.
◦ Remember Word2Vec?
Below, we demonstrate how to load Word2Vec word
embedding into a Keras Embedding layer.
28

Let’s load Word2Vec embedding vectors first.
29

Next, build up the embedding matrix that we will load into
an Embedding layer for our neural network model.
Not every word can be found in a
pre-trained model!!
Zero embedding if a word is
unknown to Word2Vec!!
30

When lots of training data is available, models with a task-
specific embedding are generally more powerful (accurate)
than those with pre-trained embeddings.
◦ In this example, we train models with 25,000 reviews.
◦ That is why the model with pre-trained embeddings is inferior to the
first task-specific embedding model.
◦ Self-practice: randomly select 200 reviews for training, and compare
the two models.
32

Recurrent Neural Networks (1/6)
A network structure useful for sequential data.
◦ E.g., time series and TEXT – token by token!!
Offering a kind of memory mechanism that helps process
data sequentially.
◦ Like human being, enable the model to observe review tokens one by
one, and gradually realize the emotion of the author.
By contrast, the neural networks seen so far keep no state
between inputs.
◦ Each input is processed independently.
33

The simplest recurrent neural network (RNN) is a type of neural
network that has an internal loop.
◦ It processes sequences by iterating through the sequence elements
◦ Maintaining a state containing (memorizing) information relative to what it has
seen so far.
◦ There are many different RNNs (e.g., LSTM) fitting this definition, but with
more sophisticated structures.
RNN
I do not like this movie …
At each timestep t, it
considers the input at t and
the current state, which is
usually the output of t-1.
output as the state
34

Keras offers the SimpleRNN layer to help us construct a
simple RNN.
In the following example, we sequentially process an IMDB
review (i.e., token by token).
◦ Each token is first converted into an embedding (dim. 16).
◦ Then, the SimpleRNN layer processes all the embeddings one by
one.
◦ Last, the final output of the RNN layer is fed into a dense layer to
predict (classify) the sentiment of the review.
35

embeddings
…
Embedding Layer
I do not like this movie …
Dimension: 16
SimpleRNN
Dense (with sigmoid)
Dimension: 16
A value between 0 and 1, indicating positive/negative
Input review:
…
…
…
state embedding
sigmoid
The last output of SimpleRNN
36

NO Flatten()!!
37

38

LSTM (1/4)
One major shortcoming of simple RNN:
◦ Unable to retain old information in the current state after running
many timesteps.
◦ In other words, long-term dependencies are hard to keep.
Long Short-Term Memory (LSTM) develops a way to carry
information across many timesteps.
◦ A conveyor belt (or carray track) running parallel to the sequence
you are processing.
◦ Information from sequence can jump onto or off the belt, and be
transported to a later timestep.
39

LSTM (2/4)
Additional information flow – carry track:
◦ ct the carry at time t.
◦ Connected with the current input and state to affect the state being
sent to the next timestep.
40

LSTM (3/4)
Computing the new carry state (i.e., ct+1) is a little bit
complicated.
◦ Forget gate, input (candidate) gate, output (update) gate.
◦ We are not going to the detail of these gates.
Just keep in mind: the design of LSTM allows past
information to be reinjected at a later time.
◦ To have a long term memory of what have happened.
41

Deep Learning for Text (Text Mining) LSTM

More Related Content

Deep Learning for Text (Text Mining) LSTM