08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
For example, if the original text was ‘in the beginning god created heaven
and earth’ which after pre-processing and removal of stopwords
became ‘beginning god created heaven earth’ and for us, what we are
trying to achieve is that. Given [beginning, god, heaven, earth] as the
context, what the target center word is, which is ‘created’ in this case.
Word2Vec
• The Continuous Bag of Words (CBOW)
Model
Implementing the Continuous Bag of Words (CBOW) Model
Build the corpus vocabulary
Build a CBOW (context, target) generator
Build the CBOW model architecture
Train the Model
Get Word Embeddings
Word2Vec
• The Skip-gram Model
• The Skip-gram model architecture usually tries to achieve the reverse of
what the CBOW model does. It tries to predict the source context
words (surrounding words) given a target word (the center word)
Word2Vec
• The Skip-gram Model
• Considering our simple sentence from earlier, “the quick brown fox jumps
over the lazy dog”. If we used the CBOW model, we get pairs
of (context_window, target_word) where if we consider a context window
of size 2, we have examples like ([quick, fox], brown), ([the, brown],
quick), ([the, dog], lazy) and so on.
• Now considering that the skip-gram model’s aim is to predict the context
from the target word, the model typically inverts the contexts and targets,
and tries to predict each context word from its target word. Hence the task
becomes to predict the context [quick, fox] given target
word ‘brown’ or [the, brown] given target word ‘quick’ and so on. Thus the
model tries to predict the context_window words based on the
target_word.
Word2Vec
• Implementing the Skip-gram Model
The implementation will focus on five parts
• Build the corpus vocabulary
• Build a skip-gram [(target, context), relevancy] generator
• Build the skip-gram model architecture
• Train the Model
• Get Word Embeddings
Word2Vec
• Implementing the Skip-gram Model
For this, we feed our skip-gram model pairs of (X, Y) where X is
our input and Y is our label. We do this by using [(target, context), 1] pairs
as positive input samples where target is our word of interest and context is
a context word occurring near the target word and the positive label
1 indicates this is a contextually relevant pair. We also feed in [(target,
random), 0] pairs as negative input samples where target is again our word
of interest but random is just a randomly selected word from our vocabulary
which has no context or association with our target word. Hence
the negative label 0 indicates this is a contextually irrelevant pair. We do this
so that the model can then learn which pairs of words are contextually
relevant and which are not and generate similar embeddings for semantically
similar words.
Word2Vec
Skip-gram Model
The GloVe Model
• The GloVe model stands for Global Vectors which is an unsupervised learning
model which can be used to obtain dense word vectors similar to Word2Vec.
However the technique is different and training is performed on an aggregated
global word-word co-occurrence matrix, giving us a vector space with meaningful
sub-structures. This method was invented in Stanford by Pennington et al.
• The basic methodology of the GloVe model is to first create a huge word-context
co-occurence matrix consisting of (word, context) pairs such that each element in
this matrix represents how often a word occurs with the context (which can be a
sequence of words). The idea then is to apply matrix factorization to
approximate this matrix as depicted in the following figure.
The GloVe Model
• Considering the Word-Context (WC) matrix, Word-Feature (WF) matrix and Feature-Context (FC)
matrix, we try to factorize WC = WF x FC, such that we we aim to reconstruct WC from WF and FC
by multiplying them. For this, we typically initialize WF and FC with some random weights and
attempt to multiply them to get WC’ (an approximation of WC) and measure how close it is to
WC. We do this multiple times using Stochastic Gradient Descent (SGD) to minimize the error.
Finally, the Word-Feature matrix (WF) gives us the word embeddings for each word where F can
be preset to a specific number of dimensions.
The GloVe Model
The FastText Model
• The FastText model was first introduced by Facebook in 2016 as an extension and
supposedly improvement of the vanilla Word2Vec model.
• Overall, FastText is a framework for learning word representations and also
performing robust, fast and accurate text classification.
• The framework is open-sourced by Facebook on GitHub and claims to have the
following.
• Recent state-of-the-art English word vectors.
• Word vectors for 157 languages trained on Wikipedia and Crawl.
• Models for language identification and various supervised tasks.
The FastText Model
• The Word2Vec model typically ignores the morphological structure of each word
and considers a word as a single entity. The FastText model considers each word
as a Bag of Character n-grams. This is also called as a subword model
• We add special boundary symbols < and > at the beginning and end of words.
This enables us to distinguish prefixes and suffixes from other character
sequences. We also include the word w itself in the set of its n-grams, to learn a
representation for each word (in addition to its character n-grams). Taking the
word where and n=3 (tri-grams) as an example, it will be represented by the
character n-grams: <wh, whe, her, ere, re> and the special sequence <where>
representing the whole word. Note that the sequence , corresponding to the
word <her> is different from the tri-gram her from the word where.
The FastText Model
• In practice, the paper recommends in extracting all the n-grams for n ≥ 3 and n
≤ 6. This is a very simple approach, and different sets of n-grams could be
considered, for example taking all prefixes and suffixes. We typically associate a
vector representation (embedding) to each n-gram for a word. Thus, we can
represent a word by the sum of the vector representations of its n-grams or the
average of the embedding of these n-grams. Thus, due to this effect of leveraging
n-grams from individual words based on their characters, there is a higher chance
for rare words to get a good representation since their character based n-grams
should occur across other words of the corpus.
Trends in Universal Sentence Embedding Models
• The concept of sentence embeddings is not a very new concept, because back
when word embeddings were built, one of the easiest ways to build a baseline
sentence embedding model was by averaging.
• A baseline sentence embedding model can be built by just averaging out the
individual word embeddings for every sentence\document (kind of similar to bag
of words, where we lose that inherent context and sequence of words in the
sentence).
• there are more sophisticated approaches like encoding sentences in a
linear weighted combination of their word embeddings and then
removing some of the common principal components
Trends in Universal Sentence Embedding Models
• Doc2Vec is also a very popular approach proposed by Mikolov et. al.
• Herein, they propose the Paragraph Vector, an unsupervised
algorithm that learns fixed-length feature embeddings from variable-
length pieces of texts, such as sentences, paragraphs, and documents.
• Based on the above depiction, the model represents each document
by a dense vector which is trained to predict words in the document.
The only difference being the paragraph or document ID, used along
with the regular word tokens to build out the embeddings. Such a
design enables this model to overcome the weaknesses of bag-of-
words models.
Neural-Net Language Models (NNLM)
• Neural-Net Language Models (NNLM) is a very early idea based on a
neural probabilistic language model proposed by Bengio et al.
• They talk about learning a distributed representation for words which
allows each training sentence to inform the model about an
exponential number of semantically neighboring sentences. The
model learns simultaneously a distributed representation for each
word along with the probability function for word sequences,
expressed in terms of these representations. Generalization is
obtained because a sequence of words that has never been seen
before gets high probability if it is made of words that are similar (in
the sense of having a nearby representation) to words forming an
already seen sentence.
Google has built a universal sentence embedding model, nnlm-en-dim128 which is a token-based text embedding-
trained model that uses a three-hidden-layer feed-forward Neural-Net Language Model on the English Google News
200B corpus. This model maps any body of text into 128-dimensional embeddings.
Skip-Thought Vectors
• Skip-Thought Vectors were also one of the first models in the domain of
unsupervised learning-based generic sentence encoders.
• In their proposed paper, ‘Skip-Thought Vectors’, using the continuity of text from
books, they have trained an encoder-decoder model that tries to reconstruct the
surrounding sentences of an encoded passage. Sentences that share semantic
and syntactic properties are mapped to similar vector representations.
Quick Thought Vectors
• Quick Thought Vectors is a more recent supervised approach towards learning
sentence emebddings. Details are mentioned in the paper ‘An efficient
framework for learning sentence representations’. Interestingly, they reformulate
the problem of predicting the context in which a sentence appears as a
classification problem by replacing the decoder with a classfier in the regular
encoder-decoder architecture.
• Thus, given a sentence and the context in which it appears, a classifier
distinguishes context sentences from other contrastive sentences based on their
embedding representations. Given an input sentence, it is first encoded by using
some function. But instead of generating the target sentence, the model
chooses the correct target sentence from a set of candidate sentences. Viewing
generation as choosing a sentence from all possible sentences, this can be seen
as a discriminative approximation to the generation problem.
Quick Thought Vectors
InferSent
• InferSent is interestingly a supervised learning approach to learning universal
sentence embeddings using natural language inference data. This is hardcore
supervised transfer learning, where just like we get pre-trained models trained on
the ImageNet dataset for computer vision, they have universal sentence
representations trained using supervised data from the Stanford Natural
Language Inference datasets.
• The dataset used by this model is the SNLI dataset that comprises 570k human-
generated English sentence pairs, manually labeled with one of the three
categories: entailment, contradiction and neutral. It captures natural language
inference useful for understanding sentence semantics.
InferSent
• Based on the architecture depicted in the above figure, we can see that it uses a shared sentence
encoder that outputs a representation for the premise u and the hypothesis v. Once the sentence
vectors are generated, 3 matching methods are applied to extract relations between u and v :
• Concatenation (u, v)
• Element-wise product u ∗ v
• Absolute element-wise difference |u − v|
• The resulting vector is then fed into a 3-class classifier consisting of multiple fully connected
layers culminating in a softmax layer.
InferSent
Universal Sentence Encoder
• Universal Sentence Encoder from Google is one of the latest and best universal sentence
embedding models which was published in early 2018! The Universal Sentence Encoder encodes
any body of text into 512-dimensional embeddings that can be used for a wide variety of NLP
tasks including text classification, semantic similarity and clustering.
• It is trained on a variety of data sources and a variety of tasks with the aim of dynamically
accommodating a wide variety of natural language understanding tasks which require modeling
the meaning of sequences of words rather than just individual words.
Universal Sentence Encoder
• Essentially, they have two versions of their model available in TF-Hub as universal-sentence-
encoder. Version 1 makes use of the transformer-network based sentence encoding model and
Version 2 makes use of a Deep Averaging Network (DAN) where input embeddings for words and
bi-grams are first averaged together and then passed through a feed-forward deep neural
network (DNN) to produce sentence embeddings. We will be using Version 2 in our hands-on
demonstration shortly.
BERT
• BERT (Bidirectional Encoder Representations from Transformers) is a
recent paper published by researchers at Google AI Language. It has
caused a stir in the Machine Learning community by presenting state-
of-the-art results in a wide variety of NLP tasks, including Question
Answering (SQuAD v1.1), Natural Language Inference (MNLI), and
others.
• BERT’s key technical innovation is applying the bidirectional training
of Transformer, a popular attention model, to language modelling.
This is in contrast to previous efforts which looked at a text sequence
either from left to right or combined left-to-right and right-to-left
training.
BERT
• BERT makes use of Transformer, an attention mechanism that learns
contextual relations between words (or sub-words) in a text. In its
vanilla form, Transformer includes two separate mechanisms — an
encoder that reads the text input and a decoder that produces a
prediction for the task.
• When training language models, there is a challenge of defining a
prediction goal. Many models predict the next word in a sequence
(e.g. “The child came home from ___”), a directional approach which
inherently limits context learning. To overcome this challenge, BERT
uses two training strategies:
BERT
• Masked LM (MLM)
• Before feeding word sequences into BERT, 15% of the words in each
sequence are replaced with a [MASK] token. The model then
attempts to predict the original value of the masked words, based on
the context provided by the other, non-masked, words in the
sequence.
• In technical terms, the prediction of the output words requires:
• Adding a classification layer on top of the encoder output.
• Multiplying the output vectors by the embedding matrix, transforming them
into the vocabulary dimension.
• Calculating the probability of each word in the vocabulary with softmax.
The BERT loss function takes
into consideration only the
prediction of the masked
values and ignores the
prediction of the non-masked
words.
BERT
• Next Sentence Prediction (NSP)
• In the BERT training process, the model receives pairs of sentences as
input and learns to predict if the second sentence in the pair is the
subsequent sentence in the original document. During training, 50%
of the inputs are a pair in which the second sentence is the
subsequent sentence in the original document, while in the other
50% a random sentence from the corpus is chosen as the second
sentence. The assumption is that the random sentence will be
disconnected from the first sentence.
BERT
• To help the model distinguish between the two sentences in training,
the input is processed in the following way before entering the
model:
• A [CLS] token is inserted at the beginning of the first sentence and a [SEP]
token is inserted at the end of each sentence.
• A sentence embedding indicating Sentence A or Sentence B is added to each
token. Sentence embeddings are similar in concept to token embeddings with
a vocabulary of 2.
• A positional embedding is added to each token to indicate its position in the
sequence. The concept and implementation of positional embedding are
presented in the Transformer paper
To predict if the second sentence is indeed connected to the first, the following steps are performed:
1.The entire input sequence goes through the Transformer model.
2.The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple classification layer
(learned matrices of weights and biases).
3.Calculating the probability of IsNextSequence with softmax.