Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Homepage

Get started

MACHINE LEARNING ALGORITHMS BIG DATA PROBABILITY CALCULUS

Word2Vec — a baby step in Deep


Learning but a giant leap towards
Natural Language Processing
Let’s demystify word2vec with reasoning, examples
and mathematics
Suvro Banerjee
May 12, 2018 · 10 min read

Note from the author : I am extremely happy and overwhelmed that so


many people read and found this article useful. I am working on a
project related to ChatBot. Word2Vec was a part of that bigger project. I
will upload that here on my publication (https://medium.com/explore-
artificial-intelligence) along with the codes and many more interesting
things on AI which I am exploring as well. I don’t want you to miss on
things on AI which I am exploring as well. I don’t want you to miss on
any of these, so if you think it’s applicable for you, you might want to
follow this publication. With love :)

Introduction
Word2Vec model is used for learning vector representations of words
called “word embeddings”. This is typically done as a preprocessing step,
after which the learned vectors are fed into a discriminative model
(typically an RNN) to generate predictions and perform all sort of
interesting things.

Why learn word embeddings


Image and audio processing systems work with rich, high-dimensional
datasets encoded as vectors of the individual raw pixel-intensities for
image data, thus all the information is encoded in the data hence the
relation between various entities in the system like (cats and dogs) can
be established.

But, when it comes to natural language processing systems traditionally


it treats words as discrete atomic symbols, and therefore ‘cat’ may be
represented as Id537 and ‘dog’ as Id143.These encodings are arbitrary,
and provide no useful information to the system regarding the
relationships that may exist between the individual symbols. This means
that the model can leverage very little of what it has learned about ‘cats’
when it is processing data about ‘dogs’ (such that they are both animals,
four-legged, pets, etc.).

Representing words as unique, discrete ids furthermore leads to data


sparsity, and usually means that we may need more data in order to
successfully train statistical models. Using vector representations can
overcome some of these obstacles.

Let’s take an example-

The traditional approach to NLP involved a lot of domain knowledge of


linguistics itself. Understanding terms such as phonemes and
morphemes were pretty standard as there are whole linguistic classes
dedicated to their study. Let’s look at how traditional NLP would try to
understand the following word.
Let’s say our goal is to gather some information about this word
(characterize its sentiment, find its definition, etc). Using our domain
knowledge of language, we can break up this word into 3 parts.

We understand that the prefix “un” indicates an opposing or opposite


idea and we know that “ed” can specify the time period (past tense) of
the word. By recognizing the meaning of the stem word “interest”, we
can easily deduce the definition and sentiment of the whole word. Seems
pretty simple right? However, when you consider all the different
prefixes and suffixes in the English language, it would take a very skilled
linguist to understand all the possible combinations and meanings.
Deep learning, at its most basic level, is all about representation
learning. Here, we’re going to take the same approach by creating
representations of words through large datasets.

Word Vectors

Let’s represent each word as a d-dimensional vector. Let’s use d = 6.


From this sentence, we want to create a word vector for each unique
word.

Now let’s think about how to fill in the values. We want the values to be
filled in such a way that the vector somehow represents the word and its
context, meaning, or semantics. One method is to create a co-occurrence
matrix.
A co-occurrence matrix is a matrix that contains the number of counts of
each word appearing next to all the other words in the corpus (or
training set). Let’s visualize this matrix.

Notice that through this simple matrix, we’re able to gain pretty useful
insights. For example, notice that the words ‘love’ and ‘like’ both contain
1’s for their counts with nouns (NLP and dogs). They also have 1’s for the
count with “I”, thus indicating that the words must be some sort of verb.
With a larger dataset than just one sentence, you can imagine that this
similarity will become more clear as ‘like’, ‘love’, and other synonyms will
begin to have similar word vectors, because of the fact that they are used
in similar contexts.

Now, although this a great starting point, we notice that the


dimensionality of each word will increase linearly with the size of the
corpus. If we had a million words (not really a lot in NLP standards),
we’d have a million by million sized matrix which would be extremely
we’d have a million by million sized matrix which would be extremely
sparse (lots of 0’s). Definitely not the best in terms of storage efficiency.
There have been numerous advancements in finding the most optimal
ways to represent these word vectors. The most famous of which is
Word2Vec.

Formal Treatment

Vector space models (VSMs) represent (embed) words in a continuous


vector space where semantically similar words are mapped to nearby
points (‘are embedded nearby each other’). VSMs have a long, rich
history in NLP, but all methods depend in some way or another on the
Distributional Hypothesis, which states that words that appear in the
same contexts share semantic meaning. The different approaches that
leverage this principle can be divided into two categories:

1. Count-based methods (e.g. Latent Semantic Analysis)

2. Predictive methods (e.g. Neural Probabilistic Language Models)

The distinction is -

Count-based methods compute the statistics of how often some word co-
occurs with its neighbor words in a large text corpus, and then map
these count-statistics down to a small, dense vector for each word.

Predictive models directly try to predict a word from its neighbors in


terms of learned small, dense embedding vectors (considered parameters
of the model).

Word2vec is a particularly computationally-efficient predictive model for


learning word embeddings from raw text. It comes in two flavors, the
Continuous Bag-of-Words model (CBOW) and the Skip-Gram model.
Algorithmically, these models are similar, except that CBOW predicts
target words from source context words, while the skip-gram does the
inverse and predicts source context-words from the target words.

We will focus on the skip-gram model in the rest of the discussion.

The Mathematics at work


Neural Probabilistic language models are traditionally trained using the
Maximum Likelihood principle to maximize the probability of the next
word wt (for “target”) given the previous words h (for “history”) in terms
of a softmax function

where score(wt, h) computes the compatibility of the target word wt with


the context h (a dot product is commonly used).

We train this model by maximizing its log-likelihood on the training set.


So, let’s maximize the below loss function-

This yields a properly normalized probabilistic model for language


modeling.

This same argument can be also shown in a slightly different


formulation which clearly shows the choice variable (or the parameter)
which is changed to maximize this objective.

Our goal is to find word representations that are useful for predicting
the surrounding words given a current word. In particular, we wish to
maximize the average log probability across our entire corpus:

This equation essentially says that there is some probability p of


observing a particular word that’s within a window of size c of the
current word wt. This probability is conditioned on the current word wt
and some setting of a parameter theta (determined by our model). We
wish to set these parameters theta so that this probability is maximized
wish to set these parameters theta so that this probability is maximized
across our entire corpus.

Basic Parametrization: Softmax Model


The basic skip-gram model defines the probability p through the softmax
function as we had seen earlier. If we consider wi to be a one-hot
encoded vector with dimension N and theta to be a N * K matrix
embedding matrix where we have N words in our vocabulary and our
learned embeddings have dimension K, then we can define-

It is worth noting that after learning, the matrix theta can be thought of
as an embedding lookup matrix.

In terms of architecture it is a simple three-layered neural network.

1. Take a 3 layer neural network. (1 input layer + 1 hidden layer + 1


output layer)

2. Feed it a word and train it to predict its neighbouring word.

3. Remove the last (output layer) and keep the input and hidden layer.

4. Now, input a word from within the vocabulary. The output given at
the hidden layer is the ‘word embedding’ of the input word.
the hidden layer is the ‘word embedding’ of the input word.

This parametrization has a major disadvantage that limits its usefulness


in cases of very large corpuses. Specifically, we notice that in order to
compute a single forward pass of our model, we must sum across the
entire corpus vocabulary in order to evaluate the softmax function. This
is prohibitively expensive on large datasets, so we look to alternate
approximations of this model for the sake of computational efficiency.

Improving Computational Efficiency

For feature learning in word2vec we do not need a full probabilistic


model. The CBOW and skip-gram models are instead trained using a
binary classification objective (logistic regression) to discriminate the
real target words (wt) from k imaginary (noise) words ~w, in the same
context.

Mathematically, the objective (for each example) is to maximize

This objective is maximized when the model assigns high probabilities to


the real words, and low probabilities to noise words. Technically, this is
called Negative Sampling, The updates it proposes approximate the
called Negative Sampling, The updates it proposes approximate the
updates of the softmax function in the limit. But computationally it is
especially appealing because computing the loss function now scales only
with the number of noise words that we select (k) and not all the words
in the vocabulary (V). This makes it much faster to train. Packages like
Tensorflow uses a very similar loss function called noise-contrastive
estimation (NCE) loss.

Intuitive Feel of Skip-gram model


As an example, let’s consider the dataset-

the quick brown fox jumped over the lazy dog

We first form a dataset of words and the contexts in which they appear.
For now, let’s stick to the vanilla definition and define ‘context’ as the
window of words to the left and to the right of a target word. Using a
window size of 1, we then have the dataset of (context, target) pairs.

([the, brown], quick), ([quick, fox], brown), ([brown, jumped],


fox), ...

Recall that skip-gram inverts contexts and targets, and tries to predict
each context word from its target word, so the task becomes to predict
‘the’ and ‘brown’ from ‘quick’, ‘quick’ and ‘fox’ from ‘brown’, etc.

Therefore our dataset becomes of (input, output) pairs as -

(quick, the), (quick, brown), (brown, quick), (brown, fox), ...

The objective function is defined over the entire dataset, but we typically
optimize this with with stochastic gradient descent (SGD) using one
example at a time (or a ‘minibatch’ of batch_size examples, where
typically 16 <= batch_size <= 512). So let’s look at one step of this
process.

Let’s imagine at training step we observe the first training case above,
where the goal is to predict the from quick. We select num_noise
number of noisy (contrastive) examples by drawing from some noise
distribution, typically the unigram distribution (The unigram posits that
each word occurrence is independent of all other word occurrences. i.e.
we can think of the generation process as a sequence of dice rolls as an
example.), P(w)

For simplicity let’s say num_noise=1 and we select sheep as a noisy


example. Next we compute the loss for this pair of observed and noisy
examples, i.e. the objective at time step ‘t’ becomes -

The goal is to make an update to the embedding parameterstheta to


maximize this objective function. We do this by deriving the gradient of
the loss with respect to the embedding parameters theta

We then perform an update to the embeddings by taking a small step in


the direction of the gradient. When this process is repeated over the
entire training set, this has the effect of ‘moving’ the embedding vectors
around for each word until the model is successful at discriminating real
words from noise words.

We can visualize the learned vectors by projecting them down to 2


dimensions. When we inspect these visualizations it becomes apparent
that the vectors capture some general, and in fact quite useful, semantic
information about words and their relationships to one another.
My Next Blog
An Introduction to Recurrent Neural Networks

Sources
1. Tensorflow implementation of Word2Vec

2. Distributed Representations of Words and Phrases and their


Compositionally — A research paper by Thomas Mikolov, Ilya
Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean

3. A practical guide to word2vec by Aneesh Joshi

4. Natural Language Processing Review by Adit Deshpande

5. Language Model by Rohan Verma

Machine Learning Deep Learning NLP Word2vec Artificial Intelligence

2.5K
claps 8

Suvro Banerjee Follow


“All that is not given is lost” — Tagore | Founder of Explore
Artificial Intelligence | MSc. Econometrics from UB | Machine
Learning Engineer @ Juniper Networks

Explore Artificial Intelligence Follow


Explore AI through the principles of Machine/Statistical
Learning, Mathematics and Computer Science.
GET UPDATES
Never miss a story from Explore Artificial Intelligence

You might also like