Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views

Unit 5b - Natural Language Processing

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Unit 5b - Natural Language Processing

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Natural Language

Processing
Chapter 11 from
Deep Learning Illustrated book

1
Preprocessing NLP Data
 Tokenization: word_tokenize, sent_tokenize,
 Converting all characters to lowercase
 Removing stop words: a, an, the, of, at, …
 Removing punctuation
 Stemming
 Handling n-grams
We can investigate the situation empirically by
incorporating the step and observing whether a
given preprocessing step impacts the accuracy of
your deep learning model downstream.
2
Preprocessing NLP Data
 Most of these preprocessing steps are
available in nltk (the Natural Language
Toolkit) and gensim libraries.
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import gutenberg
from nltk.corpus import stopwords
import gensim
from gensim.models.phrases import Phraser, Phrases
from gensim.models.word2vec import Word2Vec

3
Preprocessing NLP Data
gutenberg.fileids()
len(gutenberg.fileids())
>>> 18
len(gutenberg.words())
>>> 2621613
gberg_sents = gutenberg.sents()

stpwrds = stopwords.words('english') + list(string.punctuation)


[w.lower() for w in gberg_sents[4] if w.lower() not in stpwrds]

stemmer = PorterStemmer()
[stemmer.stem(w.lower()) for w in gberg_sents[4]
if w.lower() not in stpwrds]

4
Preprocessing NLP Data
 Bigrams
phrases = Phrases(gberg_sents) # train detector
bigram = Phraser(phrases)
bigram.phrasegrams
>>> {'two_daughters': (19, 11.966813731181547),
>>> 'her_sister': (195, 17.7960829227865),
>>> 'more_than': (541, 29.023584433996874),
>>> 'had_been': (1256, 22.306024648925288),
>>> 'Miss_Taylor': (48, 453.75918026073305),
>>> ...

5
Preprocessing NLP Data
lower_sents = []
for s in gberg_sents:
lower_sents.append([w.lower() for w in s if w.lower()
not in list(string.punctuation)])
lower_bigram = Phraser(Phrases(lower_sents))
lower_bigram["jon lives in new york city".split()]
>>> ['jon', 'lives', 'in', 'new_york', 'city']

lower_bigram = Phraser(Phrases(lower_sents, min_count=32, threshold=64))

clean_sents = []
for s in lower_sents:
clean_sents.append(lower_bigram[s])

6
Word Vectors
 Vector representations of words are the
information-dense alternative to one-hot
encodings of words.
 Whereas one-hot representations capture
information about word location only, word
vectors (also known as word embeddings or
vector-space embeddings) capture information
about word meaning as well as location.
 This additional information enable deep learning
NLP models to automatically learn linguistic
features.
7
Word Vectors
 The overarching concept is to assign each word
within a corpus to a particular, meaningful
location within a multidimensional space called
the vector space.
 Initially, each word is assigned to a random
location.
 By considering the words that tend to be used
around a given word within our corpus, the words
can gradually be shifted into locations that
represent the meaning of the words.

8
Word Vectors
 Commencing at the first word in our corpus
and moving to the right one word at a time
until we reach the final word in our corpus, we
consider each word to be the target word.
 Two of the most popular techniques for
converting natural language into word vectors
are word2vec and GloVe.

“It was a bright cold day in April, and the clocks


were striking”
9
Word Vectors

10
Word Vectors
 The closer two words are within vector space, the
closer their meaning, as determined by the
similarity of the context words appearing near
them in natural language.
 Synonyms and common misspellings of a given
word would be expected to have nearly identical
context words and therefore nearly identical
locations in vector space.
 Words that are used in similar contexts, such as
those that denote days of the week, tend to occur
near each other in vector space.
11
Word-Vector Arithmetic
 Remarkably, because particular movements
across vector space turn out to be an efficient
way for relevant word information to be stored in
the vector space, these movements come to
represent relative meanings between words.

Vking – Vman + Vwoman = ?


Vbezos – Vamazon + Vtesla = ?
Vwindows – Vmicrowsoft + Vgoogle = ?

12
Word Embeddings with word2vec
Architecture Predicts Relative Strengths
Skip-gram (SG) Context words given Better for a smaller corpus;
target word represents rare words well
CBOW Target word given Multiple times faster; represents
context words frequent words slightly better

Comparison of word2vec architectures


 In practice, the SG architecture is a better choice when
we’re working with a small corpus. It represents rare
words in word-vector space well.
 In contrast, CBOW is much more computationally
efficient, so it is the better option when you’re working
with a very large corpus.
13
Word Embeddings with word2vec
 A major alternative to word2vec is GloVe.
 GloVe and word2vec differ in their underlying
methodology: word2vec uses predictive models,
while GloVe is count based.
 Performance wise, they are pretty similar.
 One potential advantage of GloVe is that it was
designed to be parallelized over multiple
processors or even multiple machines, so it
might be a good option if you’re looking to create
a word-vector space with many unique words
and a very large corpus.
14
Running word2vec
 Just a single line code will do.
model = Word2Vec(sentences=clean_sents, size=64, sg=1,
window=10, iter=5, min_count=10, workers=4)
 Sentences:
 Size: dimensions
 SG:
 Window: context window
 Iterations:
 Min_count: min frequency to include in the space
 Workers: number of processing cores
15
Evaluating Word Vectors
 Two broad perspectives we can consider when
evaluating the quality of word vectors: intrinsic
and extrinsic evaluations.
 Extrinsic evaluations involve assessing the
performance of your word vectors within our
downstream NLP application of interest —
sentiment-analysis classifier.
 In contrast, intrinsic evaluations involve
assessing the performance of our word vectors
not on our final NLP application, but rather on
measures like similarity or vector arithmetic.
16
Evaluating Word Vectors
model.wv.most_similar('father', topn=3)
[('mother', 0.8257375359535217),
('brother', 0.7275018692016602),
('sister', 0.7177823781967163)]
model.wv.doesnt_match("mother father sister brother dog".split())
dog
model.wv.most_similar(positive=['father', 'woman'], negative=['man'])
mother
model.wv.most_similar(positive=['husband', 'woman'], negative=['man'])
????

17
Plotting Word Vectors
 Human brains are not well suited to
visualizing anything in greater than three
dimensions.
 We can use techniques for dimensionality
reduction to approximately map the
locations of words from highdimensional
word-vector space down to two or three
dimensions.
 t-distributed stochastic neighbor embedding

18
IMDb processed with DNN
 We load the dataset
(x_train, y_train), (x_valid, y_valid) = imdb.load_data(num_words=5000, skip_top=50)

 We standardize the Length of the Reviews


x_train = pad_sequences(x_train, maxlen=100, padding= 'pre', truncating= 'pre', value=0)
x_valid = pad_sequences(x_valid, maxlen=100, padding= 'pre', truncating= 'pre', value=0)

 E.g., x_train[5] now gives us


'PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD
PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD
PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD
PAD PAD PAD PAD PAD PAD UNK begins better than UNK ends funny UNK UNK russian
UNK crew UNK UNK other actors UNK UNK those scenes where documentary shots UNK
UNK spoiler part UNK message UNK UNK contrary UNK UNK whole story UNK UNK does
UNK UNK UNK UNK'

19
IMDb processed with DNN
 Let us create a baseline ANN with word
embedding, flatten, dense, dropout and an
output layer.
model = Sequential()
model.add(Embedding(5000, 64, input_length=100))
# n_unique_words = 5000 and 64 dimensions
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

20
IMDb processed with DNN
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 100, 64) 320000
_________________________________________________________________
flatten_1 (Flatten) (None, 6400) 0
_________________________________________________________________
dense_1 (Dense) (None, 64) 409664
_________________________________________________________________
dropout_1 (Dropout) (None, 64) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 65
=================================================================
Total params: 729,729
Trainable params: 729,729
Non-trainable params: 0
_________________________________________________________________
21
IMDb processed with DNN
 This model achieved a validation accuracy of
84.5%.
 The area under the receiver operating
characteristic curve is a fairly high at 92.9%.
 Our dense classifier is not specialized to detect
patterns of multiple tokens occurring in a
sequence that might predict film-review
sentiment.
 E.g., token-pair not-good is a predictive of

negative sentiment.

22
IMDb processed with a CNN
 Convolutional layers are particularly adept
at detecting spatial patterns.
 Let us use them to detect spatial patterns
among words—like the not-good
sequence.

23
IMDb processed with a CNN
model = Sequential()
model.add(Embedding(5000, 64, input_length=400))
model.add(SpatialDropout1D(0.2))
model.add(Conv1D(256, 3, activation='relu'))
# model.add(Conv1D(256, 3, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

24
IMDb processed with a CNN
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 400, 64) 320000
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 400, 64) 0
_________________________________________________________________
conv1d_1 (Conv1D) (None, 398, 256) 49408
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 256) 65792
_________________________________________________________________
dropout_1 (Dropout) (None, 256) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 257
=================================================================
Total params: 435,457
25
IMDb processed with a CNN
 This CNN model achieved a better validation
accuracy of 89.6%.
 The area under the ROC curve is also higher
than the DNN at 96.2%.
 Our ConvNet classifier outperformed our dense
net—perhaps in large part because its
convolutional layer is adept at learning patterns
of words that predict some outcome, such as
whether a film review is favorable or negative.

26
IMDb processed with an RNN
 The filters within convolutional layers tend to
excel at learning short sequences like triplets of
words (k = 3), but a document of natural
language like a movie review might contain
much longer sequences of words that, when
considered all together, would enable the model
to accurately predict some outcome.
 To handle long sequences of data like this, there
exists a family of deep learning models called
recurrent neural networks (RNNs).

27
IMDb processed with an RNN
 Let us build a model with one RNN layer.

model = Sequential()
model.add(Embedding(10000, 64, input_length=100))
model.add(SpatialDropout1D(0.2))
model.add(SimpleRNN(256, dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

28
IMDb processed with an RNN
 The results of running this model were not
encouraging.
 We found that the training loss, after going
down steadily over the first half-dozen epochs,
began to jump around after that.
 This indicates that the model is struggling to
learn patterns even within the training data.
 We got a validation accuracy of 77.6
percent and an ROC AUC of 84.9 percent.

29
IMDb processed with an RNN
 Main reason for the worst results so far with
RNNs is the fact that RNNs are only able to
backpropagate through ~10 time steps before
the gradient diminishes so much that
parameter updates become negligibly small.
 Because of this, simple RNNs are rarely used in
practice: More-sophisticated recurrent layer types
like LSTMs, which can backpropagate through
~100 time steps, are far more common.

30
IMDb processed with an LSTM
 Let us build a model with one LSTM layer.

model = Sequential()
model.add(Embedding(10000, 64, input_length=100))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(256, dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

31
IMDb processed with an LSTM
 Training loss decreased steadily epoch
over epoch, suggesting that model-fitting
proceeded more conventionally than with
our simple RNN.
 The results are not great, however.
 Despite its relative sophistication, our LSTM
performed only as well as our baseline dense
model.
 We got a validation accuracy of 84.8 percent
and an ROC AUC of 92.8 percent.
32
IMDb processed with a Bi-LSTM
 Bidirectional LSTMs (Bi-LSTMs) are a
clever variation on standard LSTMs.
 Whereas the latter involve backpropagation
in only one direction, bidirectional LSTMs
involve backpropagation in both directions
(backward and forward over timesteps).
 This extra backpropagation doubles
computational complexity, but if
accuracy is paramount to your application,
it is often worth the trouble.
33
IMDb processed with a Bi-LSTM
 Let us build a model with a Bi-LSTM layer.

from keras.layers import LSTM


from keras.layers.wrappers import Bidirectional
model = Sequential()
model.add(Embedding(10000, 64, input_length=200))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(LSTM(256, dropout=0.2)))
model.add(Dense(1, activation='sigmoid'))
34
IMDb processed with a Bi-LSTM
 The straightforward conversion from LSTM
to Bi-LSTM yielded substantial
performance gains,
 we got a validation accuracy of 86.0 percent
and an ROC AUC of 93.5 percent, making it
our second-best model so far as it trails behind
only our convolutional architecture.
 Stacking more layers will further improve
the performance.

35
IMDb with stacked Bi-LSTMs
 Let us build a model with two Bi-LSTM
layers (stacked over one another).

model = Sequential()
model.add(Embedding(10000, 64, input_length=200))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(LSTM(64, dropout=0.2,
return_sequences=True)))
model.add(Bidirectional(LSTM(64, dropout=0.2)))
model.add(Dense(1, activation='sigmoid'))

36
IMDb with stacked Bi-LSTMs
 The stacked Bi-LSTM outperformed its
unstacked cousin by a noteworthy margin,
with a validation accuracy of 87.8 percent
and an ROC AUC of 94.9%.
 Lagging behind the accuracy of CNN model.
 This is perhaps due to relatively small sized
IMDb dataset.
 Perhaps, larger datasets would facilitate
effective backpropagation over the many
timesteps associated with LSTM layers.
37
Transfer Learning in NLP
 ULMFiT (universal language model fine-tuning)
 ELMo (embeddings from language models)
 BERT (bi-directional encoder representations
from transformers) from Google.
 Pretrained BERT models tuned to particular

NLP tasks have been associated with the


achievement of new state-of-the-art
benchmarks across a broad range of
applications, while requiring much less
training time and fewer data to get there.

38
Multi ConvNet Sentiment Classifier
Input layer

Embedding

Conv (k=2) Conv (k=3) Conv (k=4)


Max Pool Max Pool Max Pool

Concatenate

Dense

Dense

Sigmoid
39
Comparing Sentiment Classifiers

Model ROC AUC (%)


Dense 92.9
Convolutional 96.1
SimpleRNN 84.9
LSTM 92.8
Bi-LSTM 93.5
Stacked Bi-LSTM 94.9
GRU 93
Conv-LSTM 94.5
Multi-ConvNet 96.2

40
Chapter Summary
 Preprocessing NLP data
 Word Embeddings with word2vec
 IMDb processing
 DNN
 CNN

SimpleRNN

LSTM

Bi-LSTM

Stacked Bi-LSTM

Multi-ConvNet

41

You might also like