Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
9 views

08-DL-Deep Learning For Text Data (Transfer Learning in NLP)

Uploaded by

Hoàng Đăng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

08-DL-Deep Learning For Text Data (Transfer Learning in NLP)

Uploaded by

Hoàng Đăng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Class 08

Deep Learning for Text Data


(Transfer Learning in NLP)
Dr Tran Anh Tuan
Department of Math & Computer Sciences
University of Science, HCMC

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


1
University of Science HCMC
Contents
• Baseline Averaged Sentence Embeddings
• Doc2Vec
• Neural-Net Language Models
• Skip-Thought Vectors
• Quick-Thought Vectors
• InferSent
• BERT

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


2
University of Science HCMC
Transfer Learning
• Transfer learning is an exciting concept where we try to leverage prior
knowledge from one domain and task into a different domain and
task. The inspiration comes from us — humans, ourselves — where
in, we have an inherent ability to not learn everything from scratch.
Transfer Learning
• We transfer and leverage our knowledge from what we have learnt in the past for
tackling a wide variety of tasks. With computer vision, we have
excellent big datasets available to us, like Imagenet, on which, we get a suite of
world-class, state-of-the-art pre-trained model to leverage transfer learning. But
what about Natural Language Processing?
Transfer Learning in NLP
• In this topic, we will be showcasing several state-of-the-art generic sentence
embedding encoders, which tend to give surprisingly good performance,
especially on small amounts of data for transfer learning tasks as compared to
word embedding models. We will be covering the following models:
• Baseline Averaged Sentence Embeddings
• Doc2Vec
• Neural-Net Language Models
• Skip-Thought Vectors
• Quick-Thought Vectors
• InferSent
• Universal Sentence Encoder
Transfer Learning in NLP
• Why do we need Embeddings?

• With regard to speech or image recognition systems, we already get information


in the form of rich dense feature vectors embedded in high-dimensional datasets
like audio spectrograms and image pixel intensities.
• However, when it comes to raw text data, especially count-based models like Bag
of Words, we are dealing with individual words, which may have their own
identifiers, and do not capture the semantic relationship among words. This
leads to huge sparse word vectors for textual data and thus, if we do not have
enough data, we may end up getting poor models or even overfitting the data
due to the curse of dimensionality.
Transfer Learning in NLP
Transfer Learning in NLP
• Predictive methods like Neural Network based language models try to predict
words from its neighboring words looking at word sequences in the corpus and in
the process, it learns distributed representations, giving us dense word
embeddings.
• A neural network language model is a language model based on Neural
Networks , exploiting their ability to learn distributed representations to reduce
the impact of the curse of dimensionality.
Neural network language model
Word Embedding
• If we have a good numeric representation of text data which captures even the
context and semantics, we can use it for a wide variety of downstream real-world
tasks like sentiment analysis, text classification, clustering, summarization,
translation and so on. The fact of the matter is, machine learning or deep learning
models run on numbers, and embeddings are the key to encoding text data that
will be used by these models.
Universal Embeddings
• A big trend here has been finding out so-called ‘Universal Embeddings’ which are
basically pre-trained embeddings obtained from training deep learning models on
a huge corpus.
• This enables us to use these pre-trained (generic) embeddings in a wide variety of
tasks including, scenarios with constraints like lack of adequate data. This is a
perfect example of transfer learning, leveraging prior knowledge from pre-trained
embeddings to solve a completely new task!
• The following figure showcases some recent trends in Universal Word & Sentence
Embeddings
Universal Embeddings
Universal Embeddings
Trends in Word Embedding Models
• The word embedding models are perhaps some of the older and more mature models
which have been developed starting with Word2Vec in 2013. The three most common
models leveraging deep learning (unsupervised approaches) models based on
embedding word vectors in a continuous vector space based on semantic and contextual
similarity are:
• Word2Vec
• GloVe
• FastText
• These models are based on the principle of distributional hypothesis in the
field of distributional semantics, which tells us that words which occur and
are used in the same context, are semantically similar to one another and
have similar meanings (‘a word is characterized by the company it keeps’)
Word2Vec
• There are two different model architectures which can be leveraged by Word2Vec
to create these word embedding representations. These include,
• The Continuous Bag of Words (CBOW) Model
• The Skip-gram Model

Considering a simple sentence, “the quick brown fox


jumps over the lazy dog”, this can be pairs of
(context_window, target_word) where if we consider a
context window of size 2, we have examples like ([quick,
fox], brown), ([the, brown], quick), ([the, dog], lazy) and
so on. Thus the model tries to predict the target_word
based on the context_window words.
Word2Vec
• The Continuous Bag of Words (CBOW) Model

Implementing the Continuous Bag of Words (CBOW) Model


Build the corpus vocabulary
Build a CBOW (context, target) generator
Build the CBOW model architecture
Train the Model
Get Word Embeddings

For example, if the original text was ‘in the beginning god created heaven
and earth’ which after pre-processing and removal of stopwords
became ‘beginning god created heaven earth’ and for us, what we are
trying to achieve is that. Given [beginning, god, heaven, earth] as the
context, what the target center word is, which is ‘created’ in this case.
Word2Vec
• The Continuous Bag of Words (CBOW)
Model
Implementing the Continuous Bag of Words (CBOW) Model
Build the corpus vocabulary
Build a CBOW (context, target) generator
Build the CBOW model architecture
Train the Model
Get Word Embeddings
Word2Vec
• The Skip-gram Model
• The Skip-gram model architecture usually tries to achieve the reverse of
what the CBOW model does. It tries to predict the source context
words (surrounding words) given a target word (the center word)
Word2Vec
• The Skip-gram Model
• Considering our simple sentence from earlier, “the quick brown fox jumps
over the lazy dog”. If we used the CBOW model, we get pairs
of (context_window, target_word) where if we consider a context window
of size 2, we have examples like ([quick, fox], brown), ([the, brown],
quick), ([the, dog], lazy) and so on.
• Now considering that the skip-gram model’s aim is to predict the context
from the target word, the model typically inverts the contexts and targets,
and tries to predict each context word from its target word. Hence the task
becomes to predict the context [quick, fox] given target
word ‘brown’ or [the, brown] given target word ‘quick’ and so on. Thus the
model tries to predict the context_window words based on the
target_word.
Word2Vec
• Implementing the Skip-gram Model
The implementation will focus on five parts
• Build the corpus vocabulary
• Build a skip-gram [(target, context), relevancy] generator
• Build the skip-gram model architecture
• Train the Model
• Get Word Embeddings
Word2Vec
• Implementing the Skip-gram Model
For this, we feed our skip-gram model pairs of (X, Y) where X is
our input and Y is our label. We do this by using [(target, context), 1] pairs
as positive input samples where target is our word of interest and context is
a context word occurring near the target word and the positive label
1 indicates this is a contextually relevant pair. We also feed in [(target,
random), 0] pairs as negative input samples where target is again our word
of interest but random is just a randomly selected word from our vocabulary
which has no context or association with our target word. Hence
the negative label 0 indicates this is a contextually irrelevant pair. We do this
so that the model can then learn which pairs of words are contextually
relevant and which are not and generate similar embeddings for semantically
similar words.
Word2Vec

Skip-gram Model
The GloVe Model
• The GloVe model stands for Global Vectors which is an unsupervised learning
model which can be used to obtain dense word vectors similar to Word2Vec.
However the technique is different and training is performed on an aggregated
global word-word co-occurrence matrix, giving us a vector space with meaningful
sub-structures. This method was invented in Stanford by Pennington et al.
• The basic methodology of the GloVe model is to first create a huge word-context
co-occurence matrix consisting of (word, context) pairs such that each element in
this matrix represents how often a word occurs with the context (which can be a
sequence of words). The idea then is to apply matrix factorization to
approximate this matrix as depicted in the following figure.
The GloVe Model
• Considering the Word-Context (WC) matrix, Word-Feature (WF) matrix and Feature-Context (FC)
matrix, we try to factorize WC = WF x FC, such that we we aim to reconstruct WC from WF and FC
by multiplying them. For this, we typically initialize WF and FC with some random weights and
attempt to multiply them to get WC’ (an approximation of WC) and measure how close it is to
WC. We do this multiple times using Stochastic Gradient Descent (SGD) to minimize the error.
Finally, the Word-Feature matrix (WF) gives us the word embeddings for each word where F can
be preset to a specific number of dimensions.
The GloVe Model
The FastText Model
• The FastText model was first introduced by Facebook in 2016 as an extension and
supposedly improvement of the vanilla Word2Vec model.
• Overall, FastText is a framework for learning word representations and also
performing robust, fast and accurate text classification.
• The framework is open-sourced by Facebook on GitHub and claims to have the
following.
• Recent state-of-the-art English word vectors.
• Word vectors for 157 languages trained on Wikipedia and Crawl.
• Models for language identification and various supervised tasks.
The FastText Model
• The Word2Vec model typically ignores the morphological structure of each word
and considers a word as a single entity. The FastText model considers each word
as a Bag of Character n-grams. This is also called as a subword model
• We add special boundary symbols < and > at the beginning and end of words.
This enables us to distinguish prefixes and suffixes from other character
sequences. We also include the word w itself in the set of its n-grams, to learn a
representation for each word (in addition to its character n-grams). Taking the
word where and n=3 (tri-grams) as an example, it will be represented by the
character n-grams: <wh, whe, her, ere, re> and the special sequence <where>
representing the whole word. Note that the sequence , corresponding to the
word <her> is different from the tri-gram her from the word where.
The FastText Model
• In practice, the paper recommends in extracting all the n-grams for n ≥ 3 and n
≤ 6. This is a very simple approach, and different sets of n-grams could be
considered, for example taking all prefixes and suffixes. We typically associate a
vector representation (embedding) to each n-gram for a word. Thus, we can
represent a word by the sum of the vector representations of its n-grams or the
average of the embedding of these n-grams. Thus, due to this effect of leveraging
n-grams from individual words based on their characters, there is a higher chance
for rare words to get a good representation since their character based n-grams
should occur across other words of the corpus.
Trends in Universal Sentence Embedding Models
• The concept of sentence embeddings is not a very new concept, because back
when word embeddings were built, one of the easiest ways to build a baseline
sentence embedding model was by averaging.
• A baseline sentence embedding model can be built by just averaging out the
individual word embeddings for every sentence\document (kind of similar to bag
of words, where we lose that inherent context and sequence of words in the
sentence).
• there are more sophisticated approaches like encoding sentences in a
linear weighted combination of their word embeddings and then
removing some of the common principal components
Trends in Universal Sentence Embedding Models
• Doc2Vec is also a very popular approach proposed by Mikolov et. al.
• Herein, they propose the Paragraph Vector, an unsupervised
algorithm that learns fixed-length feature embeddings from variable-
length pieces of texts, such as sentences, paragraphs, and documents.
• Based on the above depiction, the model represents each document
by a dense vector which is trained to predict words in the document.
The only difference being the paragraph or document ID, used along
with the regular word tokens to build out the embeddings. Such a
design enables this model to overcome the weaknesses of bag-of-
words models.
Neural-Net Language Models (NNLM)
• Neural-Net Language Models (NNLM) is a very early idea based on a
neural probabilistic language model proposed by Bengio et al.
• They talk about learning a distributed representation for words which
allows each training sentence to inform the model about an
exponential number of semantically neighboring sentences. The
model learns simultaneously a distributed representation for each
word along with the probability function for word sequences,
expressed in terms of these representations. Generalization is
obtained because a sequence of words that has never been seen
before gets high probability if it is made of words that are similar (in
the sense of having a nearby representation) to words forming an
already seen sentence.
Google has built a universal sentence embedding model, nnlm-en-dim128 which is a token-based text embedding-
trained model that uses a three-hidden-layer feed-forward Neural-Net Language Model on the English Google News
200B corpus. This model maps any body of text into 128-dimensional embeddings.
Skip-Thought Vectors
• Skip-Thought Vectors were also one of the first models in the domain of
unsupervised learning-based generic sentence encoders.
• In their proposed paper, ‘Skip-Thought Vectors’, using the continuity of text from
books, they have trained an encoder-decoder model that tries to reconstruct the
surrounding sentences of an encoded passage. Sentences that share semantic
and syntactic properties are mapped to similar vector representations.
Quick Thought Vectors
• Quick Thought Vectors is a more recent supervised approach towards learning
sentence emebddings. Details are mentioned in the paper ‘An efficient
framework for learning sentence representations’. Interestingly, they reformulate
the problem of predicting the context in which a sentence appears as a
classification problem by replacing the decoder with a classfier in the regular
encoder-decoder architecture.
• Thus, given a sentence and the context in which it appears, a classifier
distinguishes context sentences from other contrastive sentences based on their
embedding representations. Given an input sentence, it is first encoded by using
some function. But instead of generating the target sentence, the model
chooses the correct target sentence from a set of candidate sentences. Viewing
generation as choosing a sentence from all possible sentences, this can be seen
as a discriminative approximation to the generation problem.
Quick Thought Vectors
InferSent
• InferSent is interestingly a supervised learning approach to learning universal
sentence embeddings using natural language inference data. This is hardcore
supervised transfer learning, where just like we get pre-trained models trained on
the ImageNet dataset for computer vision, they have universal sentence
representations trained using supervised data from the Stanford Natural
Language Inference datasets.
• The dataset used by this model is the SNLI dataset that comprises 570k human-
generated English sentence pairs, manually labeled with one of the three
categories: entailment, contradiction and neutral. It captures natural language
inference useful for understanding sentence semantics.
InferSent
• Based on the architecture depicted in the above figure, we can see that it uses a shared sentence
encoder that outputs a representation for the premise u and the hypothesis v. Once the sentence
vectors are generated, 3 matching methods are applied to extract relations between u and v :
• Concatenation (u, v)
• Element-wise product u ∗ v
• Absolute element-wise difference |u − v|
• The resulting vector is then fed into a 3-class classifier consisting of multiple fully connected
layers culminating in a softmax layer.
InferSent
Universal Sentence Encoder
• Universal Sentence Encoder from Google is one of the latest and best universal sentence
embedding models which was published in early 2018! The Universal Sentence Encoder encodes
any body of text into 512-dimensional embeddings that can be used for a wide variety of NLP
tasks including text classification, semantic similarity and clustering.
• It is trained on a variety of data sources and a variety of tasks with the aim of dynamically
accommodating a wide variety of natural language understanding tasks which require modeling
the meaning of sequences of words rather than just individual words.
Universal Sentence Encoder
• Essentially, they have two versions of their model available in TF-Hub as universal-sentence-
encoder. Version 1 makes use of the transformer-network based sentence encoding model and
Version 2 makes use of a Deep Averaging Network (DAN) where input embeddings for words and
bi-grams are first averaged together and then passed through a feed-forward deep neural
network (DNN) to produce sentence embeddings. We will be using Version 2 in our hands-on
demonstration shortly.
BERT
• BERT (Bidirectional Encoder Representations from Transformers) is a
recent paper published by researchers at Google AI Language. It has
caused a stir in the Machine Learning community by presenting state-
of-the-art results in a wide variety of NLP tasks, including Question
Answering (SQuAD v1.1), Natural Language Inference (MNLI), and
others.
• BERT’s key technical innovation is applying the bidirectional training
of Transformer, a popular attention model, to language modelling.
This is in contrast to previous efforts which looked at a text sequence
either from left to right or combined left-to-right and right-to-left
training.
BERT
• BERT makes use of Transformer, an attention mechanism that learns
contextual relations between words (or sub-words) in a text. In its
vanilla form, Transformer includes two separate mechanisms — an
encoder that reads the text input and a decoder that produces a
prediction for the task.
• When training language models, there is a challenge of defining a
prediction goal. Many models predict the next word in a sequence
(e.g. “The child came home from ___”), a directional approach which
inherently limits context learning. To overcome this challenge, BERT
uses two training strategies:
BERT
• Masked LM (MLM)
• Before feeding word sequences into BERT, 15% of the words in each
sequence are replaced with a [MASK] token. The model then
attempts to predict the original value of the masked words, based on
the context provided by the other, non-masked, words in the
sequence.
• In technical terms, the prediction of the output words requires:
• Adding a classification layer on top of the encoder output.
• Multiplying the output vectors by the embedding matrix, transforming them
into the vocabulary dimension.
• Calculating the probability of each word in the vocabulary with softmax.
The BERT loss function takes
into consideration only the
prediction of the masked
values and ignores the
prediction of the non-masked
words.
BERT
• Next Sentence Prediction (NSP)
• In the BERT training process, the model receives pairs of sentences as
input and learns to predict if the second sentence in the pair is the
subsequent sentence in the original document. During training, 50%
of the inputs are a pair in which the second sentence is the
subsequent sentence in the original document, while in the other
50% a random sentence from the corpus is chosen as the second
sentence. The assumption is that the random sentence will be
disconnected from the first sentence.
BERT
• To help the model distinguish between the two sentences in training,
the input is processed in the following way before entering the
model:
• A [CLS] token is inserted at the beginning of the first sentence and a [SEP]
token is inserted at the end of each sentence.
• A sentence embedding indicating Sentence A or Sentence B is added to each
token. Sentence embeddings are similar in concept to token embeddings with
a vocabulary of 2.
• A positional embedding is added to each token to indicate its position in the
sequence. The concept and implementation of positional embedding are
presented in the Transformer paper
To predict if the second sentence is indeed connected to the first, the following steps are performed:
1.The entire input sequence goes through the Transformer model.
2.The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple classification layer
(learned matrices of weights and biases).
3.Calculating the probability of IsNextSequence with softmax.

When training the BERT model, Masked LM and Next Sentence


Prediction are trained together, with the goal of minimizing the
combined loss function of the two strategies.
How to use BERT (Fine-tuning)
• BERT can be used for a wide variety of language tasks, while only adding a
small layer to the core model:
• Classification tasks such as sentiment analysis are done similarly to Next Sentence
classification, by adding a classification layer on top of the Transformer output for
the [CLS] token.
• In Question Answering tasks (e.g. SQuAD v1.1), the software receives a question
regarding a text sequence and is required to mark the answer in the sequence. Using
BERT, a Q&A model can be trained by learning two extra vectors that mark the
beginning and the end of the answer.
• In Named Entity Recognition (NER), the software receives a text sequence and is
required to mark the various types of entities (Person, Organization, Date, etc) that
appear in the text. Using BERT, a NER model can be trained by feeding the output
vector of each token into a classification layer that predicts the NER label.
THANK YOU

Dr Tran Anh Tuan - Department of Math & Computer Sciences -


53
University of Science HCMC

You might also like