Unit 5b - Natural Language Processing
Unit 5b - Natural Language Processing
Processing
Chapter 11 from
Deep Learning Illustrated book
1
Preprocessing NLP Data
Tokenization: word_tokenize, sent_tokenize,
Converting all characters to lowercase
Removing stop words: a, an, the, of, at, …
Removing punctuation
Stemming
Handling n-grams
We can investigate the situation empirically by
incorporating the step and observing whether a
given preprocessing step impacts the accuracy of
your deep learning model downstream.
2
Preprocessing NLP Data
Most of these preprocessing steps are
available in nltk (the Natural Language
Toolkit) and gensim libraries.
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import gutenberg
from nltk.corpus import stopwords
import gensim
from gensim.models.phrases import Phraser, Phrases
from gensim.models.word2vec import Word2Vec
3
Preprocessing NLP Data
gutenberg.fileids()
len(gutenberg.fileids())
>>> 18
len(gutenberg.words())
>>> 2621613
gberg_sents = gutenberg.sents()
stemmer = PorterStemmer()
[stemmer.stem(w.lower()) for w in gberg_sents[4]
if w.lower() not in stpwrds]
4
Preprocessing NLP Data
Bigrams
phrases = Phrases(gberg_sents) # train detector
bigram = Phraser(phrases)
bigram.phrasegrams
>>> {'two_daughters': (19, 11.966813731181547),
>>> 'her_sister': (195, 17.7960829227865),
>>> 'more_than': (541, 29.023584433996874),
>>> 'had_been': (1256, 22.306024648925288),
>>> 'Miss_Taylor': (48, 453.75918026073305),
>>> ...
5
Preprocessing NLP Data
lower_sents = []
for s in gberg_sents:
lower_sents.append([w.lower() for w in s if w.lower()
not in list(string.punctuation)])
lower_bigram = Phraser(Phrases(lower_sents))
lower_bigram["jon lives in new york city".split()]
>>> ['jon', 'lives', 'in', 'new_york', 'city']
clean_sents = []
for s in lower_sents:
clean_sents.append(lower_bigram[s])
6
Word Vectors
Vector representations of words are the
information-dense alternative to one-hot
encodings of words.
Whereas one-hot representations capture
information about word location only, word
vectors (also known as word embeddings or
vector-space embeddings) capture information
about word meaning as well as location.
This additional information enable deep learning
NLP models to automatically learn linguistic
features.
7
Word Vectors
The overarching concept is to assign each word
within a corpus to a particular, meaningful
location within a multidimensional space called
the vector space.
Initially, each word is assigned to a random
location.
By considering the words that tend to be used
around a given word within our corpus, the words
can gradually be shifted into locations that
represent the meaning of the words.
8
Word Vectors
Commencing at the first word in our corpus
and moving to the right one word at a time
until we reach the final word in our corpus, we
consider each word to be the target word.
Two of the most popular techniques for
converting natural language into word vectors
are word2vec and GloVe.
10
Word Vectors
The closer two words are within vector space, the
closer their meaning, as determined by the
similarity of the context words appearing near
them in natural language.
Synonyms and common misspellings of a given
word would be expected to have nearly identical
context words and therefore nearly identical
locations in vector space.
Words that are used in similar contexts, such as
those that denote days of the week, tend to occur
near each other in vector space.
11
Word-Vector Arithmetic
Remarkably, because particular movements
across vector space turn out to be an efficient
way for relevant word information to be stored in
the vector space, these movements come to
represent relative meanings between words.
12
Word Embeddings with word2vec
Architecture Predicts Relative Strengths
Skip-gram (SG) Context words given Better for a smaller corpus;
target word represents rare words well
CBOW Target word given Multiple times faster; represents
context words frequent words slightly better
17
Plotting Word Vectors
Human brains are not well suited to
visualizing anything in greater than three
dimensions.
We can use techniques for dimensionality
reduction to approximately map the
locations of words from highdimensional
word-vector space down to two or three
dimensions.
t-distributed stochastic neighbor embedding
18
IMDb processed with DNN
We load the dataset
(x_train, y_train), (x_valid, y_valid) = imdb.load_data(num_words=5000, skip_top=50)
19
IMDb processed with DNN
Let us create a baseline ANN with word
embedding, flatten, dense, dropout and an
output layer.
model = Sequential()
model.add(Embedding(5000, 64, input_length=100))
# n_unique_words = 5000 and 64 dimensions
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
20
IMDb processed with DNN
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 100, 64) 320000
_________________________________________________________________
flatten_1 (Flatten) (None, 6400) 0
_________________________________________________________________
dense_1 (Dense) (None, 64) 409664
_________________________________________________________________
dropout_1 (Dropout) (None, 64) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 65
=================================================================
Total params: 729,729
Trainable params: 729,729
Non-trainable params: 0
_________________________________________________________________
21
IMDb processed with DNN
This model achieved a validation accuracy of
84.5%.
The area under the receiver operating
characteristic curve is a fairly high at 92.9%.
Our dense classifier is not specialized to detect
patterns of multiple tokens occurring in a
sequence that might predict film-review
sentiment.
E.g., token-pair not-good is a predictive of
negative sentiment.
22
IMDb processed with a CNN
Convolutional layers are particularly adept
at detecting spatial patterns.
Let us use them to detect spatial patterns
among words—like the not-good
sequence.
23
IMDb processed with a CNN
model = Sequential()
model.add(Embedding(5000, 64, input_length=400))
model.add(SpatialDropout1D(0.2))
model.add(Conv1D(256, 3, activation='relu'))
# model.add(Conv1D(256, 3, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
24
IMDb processed with a CNN
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 400, 64) 320000
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 400, 64) 0
_________________________________________________________________
conv1d_1 (Conv1D) (None, 398, 256) 49408
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 256) 65792
_________________________________________________________________
dropout_1 (Dropout) (None, 256) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 257
=================================================================
Total params: 435,457
25
IMDb processed with a CNN
This CNN model achieved a better validation
accuracy of 89.6%.
The area under the ROC curve is also higher
than the DNN at 96.2%.
Our ConvNet classifier outperformed our dense
net—perhaps in large part because its
convolutional layer is adept at learning patterns
of words that predict some outcome, such as
whether a film review is favorable or negative.
26
IMDb processed with an RNN
The filters within convolutional layers tend to
excel at learning short sequences like triplets of
words (k = 3), but a document of natural
language like a movie review might contain
much longer sequences of words that, when
considered all together, would enable the model
to accurately predict some outcome.
To handle long sequences of data like this, there
exists a family of deep learning models called
recurrent neural networks (RNNs).
27
IMDb processed with an RNN
Let us build a model with one RNN layer.
model = Sequential()
model.add(Embedding(10000, 64, input_length=100))
model.add(SpatialDropout1D(0.2))
model.add(SimpleRNN(256, dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
28
IMDb processed with an RNN
The results of running this model were not
encouraging.
We found that the training loss, after going
down steadily over the first half-dozen epochs,
began to jump around after that.
This indicates that the model is struggling to
learn patterns even within the training data.
We got a validation accuracy of 77.6
percent and an ROC AUC of 84.9 percent.
29
IMDb processed with an RNN
Main reason for the worst results so far with
RNNs is the fact that RNNs are only able to
backpropagate through ~10 time steps before
the gradient diminishes so much that
parameter updates become negligibly small.
Because of this, simple RNNs are rarely used in
practice: More-sophisticated recurrent layer types
like LSTMs, which can backpropagate through
~100 time steps, are far more common.
30
IMDb processed with an LSTM
Let us build a model with one LSTM layer.
model = Sequential()
model.add(Embedding(10000, 64, input_length=100))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(256, dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
31
IMDb processed with an LSTM
Training loss decreased steadily epoch
over epoch, suggesting that model-fitting
proceeded more conventionally than with
our simple RNN.
The results are not great, however.
Despite its relative sophistication, our LSTM
performed only as well as our baseline dense
model.
We got a validation accuracy of 84.8 percent
and an ROC AUC of 92.8 percent.
32
IMDb processed with a Bi-LSTM
Bidirectional LSTMs (Bi-LSTMs) are a
clever variation on standard LSTMs.
Whereas the latter involve backpropagation
in only one direction, bidirectional LSTMs
involve backpropagation in both directions
(backward and forward over timesteps).
This extra backpropagation doubles
computational complexity, but if
accuracy is paramount to your application,
it is often worth the trouble.
33
IMDb processed with a Bi-LSTM
Let us build a model with a Bi-LSTM layer.
35
IMDb with stacked Bi-LSTMs
Let us build a model with two Bi-LSTM
layers (stacked over one another).
model = Sequential()
model.add(Embedding(10000, 64, input_length=200))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(LSTM(64, dropout=0.2,
return_sequences=True)))
model.add(Bidirectional(LSTM(64, dropout=0.2)))
model.add(Dense(1, activation='sigmoid'))
36
IMDb with stacked Bi-LSTMs
The stacked Bi-LSTM outperformed its
unstacked cousin by a noteworthy margin,
with a validation accuracy of 87.8 percent
and an ROC AUC of 94.9%.
Lagging behind the accuracy of CNN model.
This is perhaps due to relatively small sized
IMDb dataset.
Perhaps, larger datasets would facilitate
effective backpropagation over the many
timesteps associated with LSTM layers.
37
Transfer Learning in NLP
ULMFiT (universal language model fine-tuning)
ELMo (embeddings from language models)
BERT (bi-directional encoder representations
from transformers) from Google.
Pretrained BERT models tuned to particular
38
Multi ConvNet Sentiment Classifier
Input layer
Embedding
Concatenate
Dense
Dense
Sigmoid
39
Comparing Sentiment Classifiers
40
Chapter Summary
Preprocessing NLP data
Word Embeddings with word2vec
IMDb processing
DNN
CNN
SimpleRNN
LSTM
Bi-LSTM
Stacked Bi-LSTM
Multi-ConvNet
41