Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
Get started
Introduction
Word2Vec model is used for learning vector representations of words
called “word embeddings”. This is typically done as a preprocessing step,
after which the learned vectors are fed into a discriminative model
(typically an RNN) to generate predictions and perform all sort of
interesting things.
Word Vectors
Now let’s think about how to fill in the values. We want the values to be
filled in such a way that the vector somehow represents the word and its
context, meaning, or semantics. One method is to create a co-occurrence
matrix.
A co-occurrence matrix is a matrix that contains the number of counts of
each word appearing next to all the other words in the corpus (or
training set). Let’s visualize this matrix.
Notice that through this simple matrix, we’re able to gain pretty useful
insights. For example, notice that the words ‘love’ and ‘like’ both contain
1’s for their counts with nouns (NLP and dogs). They also have 1’s for the
count with “I”, thus indicating that the words must be some sort of verb.
With a larger dataset than just one sentence, you can imagine that this
similarity will become more clear as ‘like’, ‘love’, and other synonyms will
begin to have similar word vectors, because of the fact that they are used
in similar contexts.
Formal Treatment
The distinction is -
Count-based methods compute the statistics of how often some word co-
occurs with its neighbor words in a large text corpus, and then map
these count-statistics down to a small, dense vector for each word.
Our goal is to find word representations that are useful for predicting
the surrounding words given a current word. In particular, we wish to
maximize the average log probability across our entire corpus:
It is worth noting that after learning, the matrix theta can be thought of
as an embedding lookup matrix.
3. Remove the last (output layer) and keep the input and hidden layer.
4. Now, input a word from within the vocabulary. The output given at
the hidden layer is the ‘word embedding’ of the input word.
the hidden layer is the ‘word embedding’ of the input word.
We first form a dataset of words and the contexts in which they appear.
For now, let’s stick to the vanilla definition and define ‘context’ as the
window of words to the left and to the right of a target word. Using a
window size of 1, we then have the dataset of (context, target) pairs.
Recall that skip-gram inverts contexts and targets, and tries to predict
each context word from its target word, so the task becomes to predict
‘the’ and ‘brown’ from ‘quick’, ‘quick’ and ‘fox’ from ‘brown’, etc.
The objective function is defined over the entire dataset, but we typically
optimize this with with stochastic gradient descent (SGD) using one
example at a time (or a ‘minibatch’ of batch_size examples, where
typically 16 <= batch_size <= 512). So let’s look at one step of this
process.
Let’s imagine at training step we observe the first training case above,
where the goal is to predict the from quick. We select num_noise
number of noisy (contrastive) examples by drawing from some noise
distribution, typically the unigram distribution (The unigram posits that
each word occurrence is independent of all other word occurrences. i.e.
we can think of the generation process as a sequence of dice rolls as an
example.), P(w)
Sources
1. Tensorflow implementation of Word2Vec
2.5K
claps 8