0% found this document useful (0 votes)

89 views

Explaining The Intuition of Word2Vec & Implementing It in Python

Word2Vec is an NLP technique that generates word embeddings by mapping words to vectors of numbers. It has two main architectures: CBOW predicts a target word from surrounding context words, while skip-gram predicts surrounding words from a target word. The tutorial shows how to implement Word2Vec on Shakespeare text, generating embeddings that place similar words closer together in vector space. PCA is then used to visualize the embeddings in 2D.

Uploaded by

Abhishek Sanap

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views

Explaining The Intuition of Word2Vec & Implementing It in Python

Uploaded by

Abhishek Sanap

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

Word2Vec Explained

Explaining the Intuition of Word2Vec & Implementing it in Python

This image was taken by Raphael Schaller from Unsplash

Table of Contents

 Introduction
 What is a Word Embedding?
 Word2Vec Architecture
- CBOW (Continuous Bag of Words) Model
- Continuous Skip-Gram Model
 Implementation
- Data
- Requirements
- Import Data
- Preprocess Data
- Embed
- PCA on Embeddings
 Concluding Remarks
 Resources

Introduction

Word2Vec is a recent breakthrough in the world of NLP. Tomas

Mikolov a Czech computer scientist and currently a researcher at
CIIRC ( Czech Institute of Informatics, Robotics and Cybernetics) was
one of the leading contributors towards the research and
implementation of word2vec. Word embeddings are an integral part of
solving many problems in NLP. They depict how humans understand
language to a machine. You can imagine them as a vectorized
representation of text. Word2Vec, a common method of generating
word embeddings, has a variety of applications such as text similarity,
recommendation systems, sentiment analysis, etc.

What is a Word Embedding?

Before we get into word2vec, let’s establish an understanding of what

word embeddings are. This is important to know because the overall
result and output of word2vec will be embeddings associated to each
unique word passed through the algorithm.
Word embeddings is a technique where individual words are
transformed into a numerical representation of the word (a vector).
Where each word is mapped to one vector, this vector is then learned
in a way which resembles a neural network. The vectors try to capture
various characteristics of that word with regard to the overall text.
These characteristics can include the semantic relationship of the
word, definitions, context, etc. With these numerical representations,
you can do many things like identify similarity or dissimilarity between
words.

Clearly, these are integral as inputs to various aspects of machine

learning. A machine cannot process text in its raw form, thus
converting the text into an embedding will allow users to feed the
embedding to classic machine learning models. The simplest
embedding would be a one hot encoding of text data where each vector
would be mapped to a category.
For example: have = [1, 0, 0, 0, 0, 0, ... 0]
a = [0, 1, 0, 0, 0, 0, ... 0]
good = [0, 0, 1, 0, 0, 0, ... 0]
day = [0, 0, 0, 1, 0, 0, ... 0] ...

However, there are multiple limitations of simple embeddings such as

this, as they do not capture characteristics of the word, and they can be
quite large depending on the size of the corpus.
Word2Vec Architecture

The effectiveness of Word2Vec comes from its ability to group together

vectors of similar words. Given a large enough dataset, Word2Vec can
make strong estimates about a word’s meaning based on their
occurrences in the text. These estimates yield word associations with
other words in the corpus. For example, words like “King” and “Queen”
would be very similar to one another. When conducting algebraic
operations on word embeddings you can find a close approximation of
word similarities. For example, the 2 dimensional embedding vector of
"king" - the 2 dimensional embedding vector of "man" + the 2
dimensional embedding vector of "woman" yielded a vector which is
very close to the embedding vector of "queen". Note, that the values
below were chosen arbitrarily.
King - Man + Woman = Queen
[5,3] - [2,1] + [3, 2] = [6,4]
You can see that the words King and Queen are close to each other in
position. (Image provided by the author)

There are two main architectures which yield the success of word2vec.
The skip-gram and CBOW architectures.

CBOW (Continuous Bag of Words)

This architecture is very similar to a feed forward neural network. This

model architecture essentially tries to predict a target word from a list
of context words. The intuition behind this model is quite simple: given
a phrase "Have a great day" , we will choose our target word to be “a” and
our context words to be [“have”, “great”, “day”]. What this model will
do is take the distributed representations of the context words to try
and predict the target word.

CBOW Architecture. Image taken from Efficient Estimation of Word

Representation in Vector Space
Continuous Skip-Gram Model

The skip-gram model is a simple neural network with one hidden layer
trained in order to predict the probability of a given word being present
when an input word is present. Intuitively, you can imagine the skip-
gram model being the opposite of the CBOW model. In this
architecture, it takes the current word as an input and tries to
accurately predict the words before and after this current word. This
model essentially tries to learn and predict the context words around
the specified input word. Based on experiments assessing the accuracy
of this model it was found that the prediction quality improves given a
large range of word vectors, however it also increases the
computational complexity. The process can be described visually as
seen below.

Example of generating training data for skip-gram model. Window size is 3.

Image provided by author
As seen above, given some corpus of text, a target word is selected over
some rolling window. The training data consists of pairwise
combinations of that target word and all other words in the window.
This is the resulting training data for the neural network. Once the
model is trained, we can essentially yield a probability of a word being
a context word for a given target. The following image below represents
the architecture of the neural network for the skip-gram model.

Skip-Gram Model architecture (Image provided by author)

A corpus can be represented as a vector of size N, where each element

in N corresponds to a word in the corpus. During the training process,
we have a pair of target and context words, the input array will have 0
in all elements except for the target word. The target word will be equal
to 1. The hidden layer will learn the embedding representation of each
word, yielding a d-dimensional embedding space. The output layer is a
dense layer with a softmax activation function. The output layer will
essentially yield a vector of the same size as the input, each element in
the vector will consist of a probability. This probability indicates the
similarity between the target word and the associated word in the
corpus.
For a more detailed overview of both these models, I highly
recommend reading the original paper which outlined these
results here.

Implementation

I’ll be showing how to use word2vec to generate word embeddings and

use those embeddings for finding similar words and visualization of
embeddings through PCA.

Data

For the purposes of this tutorial we’ll be working with the Shakespeare
dataset. You can find the file I used for this tutorial here, it includes all
the lines Shakespeare has written for his plays.

Requirements
nltk==3.6.1
node2vec==0.4.3
pandas==1.2.4
matplotlib==3.3.4
gensim==4.0.1
scikit-learn=0.24.1

Note: Since we’re working with NLTK you might need to download
the following corpus for the rest of the tutorial to work. This can easily
be done by the following commands :
import nltk
nltk.download('stopwords')
nltk.download('punkt')
Import Data

Note: Change the PATH variable to the path of the data you’re working
with.

Preprocess Data

Stopword Filtering Note

 Be aware that the stopwords removed from these lines are of

modern vocabulary. The application & data has a high
importance to the type of preprocessing tactics necessary for
cleaning of words.
 In our scenario, words like “you” or “yourself” would be
present in the stopwords and eliminated from the lines,
however since this is Shakespeare text data, these types of
words would not be used. Instead “thou” or “thyself” might be
useful to remove. Stay keen to these types of miniature
changes because they make a drastic difference in the
performance of a good model versus a poor one.
 For the purposes of this example, I won’t be going into
extreme details in identifying stopwords from a different
century, but be aware that you should.

Embed
Words in the Shakespeare data which is most similar to thou (Image provided
by the Author)

PCA on Embeddings
Words similar to each other would be placed closer together to one another.
Image provided by author

Tensorflow has made a very beautiful, intuitive and user-friendly

representation of the word2vec model. I highly recommend you to
explore it as it allows you to interact with the results of word2vec. The
link is below.

Embedding projector - visualization of high-dimensional data

Visualize high dimensional data.
projector.tensorflow.org
Concluding Remarks

Word embeddings are an essential part of solving many problems in

NLP, it depicts how humans understand language to a machine. Given
a large corpus of text, word2vec produces an embedding vector
associated with each word in the corpus. These embeddings are
structured such that words with similar characteristics are in close
proximity to one another. CBOW (continuous bag of words) and the
skip-gram model are the two main architectures associated with
word2vec. Given an input word, skip-gram will try to predict the words
in context to the input whereas the CBOW model will take a variety of
words and try to predict the missing one.

Intuitive Understanding of Word Embeddings - Count Vectors To Word2Vec
No ratings yet
Intuitive Understanding of Word Embeddings - Count Vectors To Word2Vec
34 pages
Modern Media Vs Self-Esteem
No ratings yet
Modern Media Vs Self-Esteem
8 pages
Lecture#14
No ratings yet
Lecture#14
38 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
Part 3
No ratings yet
Part 3
5 pages
Word 2 Vec
No ratings yet
Word 2 Vec
6 pages
Word Embeddings Notes
No ratings yet
Word Embeddings Notes
9 pages
Chapter II
No ratings yet
Chapter II
26 pages
542 315 Word2vec
No ratings yet
542 315 Word2vec
20 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
NLP Part 2
No ratings yet
NLP Part 2
14 pages
wordembed
No ratings yet
wordembed
31 pages
Word Embeddings in NLP - Gunjan Agicha - Medium
No ratings yet
Word Embeddings in NLP - Gunjan Agicha - Medium
5 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Creating Word Embeddings - Coding The Word2Vec Algorithm in Python Using Deep Learning - by Eligijus Bujokas - Towards Data Science
No ratings yet
Creating Word Embeddings - Coding The Word2Vec Algorithm in Python Using Deep Learning - by Eligijus Bujokas - Towards Data Science
11 pages
A Simple Word2vec Tutorial - Zafar Ali - Medium - Reader View
No ratings yet
A Simple Word2vec Tutorial - Zafar Ali - Medium - Reader View
9 pages
ML for NLP-LO4
No ratings yet
ML for NLP-LO4
42 pages
Report On Word2vec
No ratings yet
Report On Word2vec
7 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
CCS369 UNIT-2 20.12.24
No ratings yet
CCS369 UNIT-2 20.12.24
41 pages
Word Embeddings
No ratings yet
Word Embeddings
55 pages
Zhou 2020
No ratings yet
Zhou 2020
5 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
WORD EMBEDDING Project
No ratings yet
WORD EMBEDDING Project
15 pages
Word Embeddings With Neural Network
No ratings yet
Word Embeddings With Neural Network
5 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
NLP DL Lecture2
No ratings yet
NLP DL Lecture2
54 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
Word2Vec
No ratings yet
Word2Vec
33 pages
Vector Semantics and Embedding (part 2)
No ratings yet
Vector Semantics and Embedding (part 2)
47 pages
12 Subrata DL
No ratings yet
12 Subrata DL
25 pages
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
No ratings yet
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
20 pages
unit2
No ratings yet
unit2
15 pages
11.Chapter8_WordEmbedding
No ratings yet
11.Chapter8_WordEmbedding
17 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
CCS369 - TSS-Unit 2
No ratings yet
CCS369 - TSS-Unit 2
56 pages
Constructing and Evaluating Word Embeddings
No ratings yet
Constructing and Evaluating Word Embeddings
33 pages
Neural Network
No ratings yet
Neural Network
23 pages
5 Pretained Word Embeddings Algorithms
No ratings yet
5 Pretained Word Embeddings Algorithms
21 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
14-Word Embeddings II
No ratings yet
14-Word Embeddings II
31 pages
NLP Lec 03
No ratings yet
NLP Lec 03
26 pages
Unit iv
No ratings yet
Unit iv
57 pages
Unit 5 Part 2
No ratings yet
Unit 5 Part 2
21 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
Continuous Bag of Words
No ratings yet
Continuous Bag of Words
19 pages
7 Word Embeddings
No ratings yet
7 Word Embeddings
13 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
DLNLP CH-3 N
No ratings yet
DLNLP CH-3 N
11 pages
Word Embedding 9 Mar 23 PDF
No ratings yet
Word Embedding 9 Mar 23 PDF
16 pages
10 - Neural Networks For Text
No ratings yet
10 - Neural Networks For Text
40 pages
Distributed Representations of Sentences and Documents: Quoc Le Tomas Mikolov
No ratings yet
Distributed Representations of Sentences and Documents: Quoc Le Tomas Mikolov
9 pages
ria_37.03_24
No ratings yet
ria_37.03_24
7 pages
Paragraph Vector PDF
No ratings yet
Paragraph Vector PDF
9 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
From Everand
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
Marcin Jamro
No ratings yet
JavaScript OOP Step by Step: A Practical Guide with Examples
From Everand
JavaScript OOP Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
An Explicit Equation For Friction Factor in Pipe
No ratings yet
An Explicit Equation For Friction Factor in Pipe
2 pages
Class 10 Chemistry Chapter 4 Revision Notes
100% (1)
Class 10 Chemistry Chapter 4 Revision Notes
2 pages
141102-GITD-Paper Final
No ratings yet
141102-GITD-Paper Final
14 pages
Answer The Following Questions
No ratings yet
Answer The Following Questions
6 pages
Sealability of Sheet, Composite, and Solid Form-in-Place Gasket Materials
No ratings yet
Sealability of Sheet, Composite, and Solid Form-in-Place Gasket Materials
6 pages
Personal Self-Concept and Satisfaction With Life in Adolescence, Youth and Adulthood
No ratings yet
Personal Self-Concept and Satisfaction With Life in Adolescence, Youth and Adulthood
7 pages
Interviewer
No ratings yet
Interviewer
5 pages
(Ebook) The Gospel of Matthew: A Hypertextual Commentary (European Studies in Theology, Philosophy and History of Religions, Book 16) by Adamczewski, Bartosz ISBN 9783631679418, 3631679416 - Quickly download the ebook to read anytime, anywhere
100% (2)
(Ebook) The Gospel of Matthew: A Hypertextual Commentary (European Studies in Theology, Philosophy and History of Religions, Book 16) by Adamczewski, Bartosz ISBN 9783631679418, 3631679416 - Quickly download the ebook to read anytime, anywhere
73 pages
Mohit Agrawal - FSCL - 80.
No ratings yet
Mohit Agrawal - FSCL - 80.
46 pages
Risk Manajemen
No ratings yet
Risk Manajemen
9 pages
Team Teaching: Exploring The Curriculum: Compostela Valley State College
No ratings yet
Team Teaching: Exploring The Curriculum: Compostela Valley State College
22 pages
48.lamp Illumination Control System Using Sensor Circuit
No ratings yet
48.lamp Illumination Control System Using Sensor Circuit
4 pages
kevo b plan building
No ratings yet
kevo b plan building
47 pages
Stateflow Modelling
No ratings yet
Stateflow Modelling
21 pages
Informatica 120qs
No ratings yet
Informatica 120qs
30 pages
Upstox Dartstocks Help
0% (1)
Upstox Dartstocks Help
26 pages
Test Senario
No ratings yet
Test Senario
28 pages
Pmo WHS
No ratings yet
Pmo WHS
31 pages
FutureLogic PSA 66 ST Operator
No ratings yet
FutureLogic PSA 66 ST Operator
40 pages
First Aid For Burns
No ratings yet
First Aid For Burns
11 pages
To Change The Overall Look of Your Document
No ratings yet
To Change The Overall Look of Your Document
2 pages
Saint Mary's University - Criminology Review Center: Subject: Traffic Management & Accident Investigation
No ratings yet
Saint Mary's University - Criminology Review Center: Subject: Traffic Management & Accident Investigation
7 pages
Grade 5 q2 Mathematics Las
No ratings yet
Grade 5 q2 Mathematics Las
100 pages
FINAL Work Experience Sheet - ENGR. FAROUK NINOY A. BASILAN
No ratings yet
FINAL Work Experience Sheet - ENGR. FAROUK NINOY A. BASILAN
4 pages
01 Introduction To FinMan
No ratings yet
01 Introduction To FinMan
76 pages
Triumph-Germany Rnhyound 1
No ratings yet
Triumph-Germany Rnhyound 1
6 pages
The Literary Translator and The Concept of Fidelity
No ratings yet
The Literary Translator and The Concept of Fidelity
16 pages
Ovarian Cancer Case Study
100% (4)
Ovarian Cancer Case Study
46 pages
Inheritance
No ratings yet
Inheritance
14 pages