Explaining The Intuition of Word2Vec & Implementing It in Python
Explaining The Intuition of Word2Vec & Implementing It in Python
Table of Contents
Introduction
What is a Word Embedding?
Word2Vec Architecture
- CBOW (Continuous Bag of Words) Model
- Continuous Skip-Gram Model
Implementation
- Data
- Requirements
- Import Data
- Preprocess Data
- Embed
- PCA on Embeddings
Concluding Remarks
Resources
Introduction
There are two main architectures which yield the success of word2vec.
The skip-gram and CBOW architectures.
The skip-gram model is a simple neural network with one hidden layer
trained in order to predict the probability of a given word being present
when an input word is present. Intuitively, you can imagine the skip-
gram model being the opposite of the CBOW model. In this
architecture, it takes the current word as an input and tries to
accurately predict the words before and after this current word. This
model essentially tries to learn and predict the context words around
the specified input word. Based on experiments assessing the accuracy
of this model it was found that the prediction quality improves given a
large range of word vectors, however it also increases the
computational complexity. The process can be described visually as
seen below.
Implementation
Data
For the purposes of this tutorial we’ll be working with the Shakespeare
dataset. You can find the file I used for this tutorial here, it includes all
the lines Shakespeare has written for his plays.
Requirements
nltk==3.6.1
node2vec==0.4.3
pandas==1.2.4
matplotlib==3.3.4
gensim==4.0.1
scikit-learn=0.24.1
Note: Since we’re working with NLTK you might need to download
the following corpus for the rest of the tutorial to work. This can easily
be done by the following commands :
import nltk
nltk.download('stopwords')
nltk.download('punkt')
Import Data
Note: Change the PATH variable to the path of the data you’re working
with.
Preprocess Data
Embed
Words in the Shakespeare data which is most similar to thou (Image provided
by the Author)
PCA on Embeddings
Words similar to each other would be placed closer together to one another.
Image provided by author