Creating Word Embeddings - Coding The Word2Vec Algorithm in Python Using Deep Learning - by Eligijus Bujokas - Towards Data Science
Creating Word Embeddings - Coding The Word2Vec Algorithm in Python Using Deep Learning - by Eligijus Bujokas - Towards Data Science
This is your last free member-only story this month. Upgrade for unlimited access.
Member-only story
479 7
When I was writing another article that showcased how to use word
embeddings in a text classification objective I realized that I always used pre-
trained word embeddings downloaded from an external source (for example
https://nlp.stanford.edu/projects/glove/). I started thinking about how to
create word embeddings from scratch and thus this is how this article was
born. My main goal is for people to read this article with my code snippets
and to get an in-depth understanding of the logic behind the creation of
vector representations of words.
https://github.com/Eligijus112/word-embedding-creation
Read the text -> Preprocess text -> Create (x, y) data points -> Create one hot
encoded (X, Y) matrices -> train a neural network -> extract the weights from
the input layer
From wiki: Word embedding is the collective name for a set of language
modeling and feature learning techniques in natural language processing
https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 1/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
(NLP) where words or phrases from the vocabulary are mapped to vectors
of real numbers. The term word2vec literally translates to word to vector.
For example,
Top highlight
Word embeddings are created using a neural network with one input layer,
one hidden layer and one output layer.
To create word embeddings the first thing that is needed is text. Let us create
a simple example stating some well-known facts about a fictional royal
family containing 12 sentences:
The royal family is the king and queen and their children
Prince is only a boy now
The computer does not understand that the words king, prince and man are
closer together in a semantic sense than the words queen, princess, and
daughter. All it sees are encoded characters to binary. So how do we make
https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 2/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
the computer understand the relationship between certain words? By
creating X and Y matrices and using a neural network.
When creating the training matrices for word embeddings one of the
hyperparameters is the window size of the context (w). The minimum value
for this is 1 because without context the algorithm cannot work. Lets us take
the first sentence and lets us assume that w = 2.
The bolded word the is called the focus word and 2 words to the left and 2
words to the right (because w = 2) are the so-called context words. So we can
start building our data points:
1 import re
2
3 def clean_text(
4 string: str,
5 punctuations=r'''!()-[]{};:'"\,<>./?@#$%^&*_~''',
6 stop_words=['the', 'a', 'and', 'is', 'be', 'will']) -> str:
7 """
8 A method to clean text
9 """
10 # Cleaning the urls
11 string = re.sub(r'https?://\S+|www\.\S+', '', string)
12
13 # Cleaning the html elements
14 string = re.sub(r'<.*?>', '', string)
15
16 # Removing the punctuations
17 for x in string.lower():
18 if x in punctuations:
19 string = string.replace(x, "")
20
21 # Converting the text to lower
22 string = string.lower()
23
24 # Removing stop words
25 string = ' '.join([word for word in string.split() if word not in stop_words])
26
27 # Cleaning the whitespaces
28 string = re.sub(r'\s+', ' ', string).strip()
29
30 return string
https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 3/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
Text preprocessing function
The full pipeline to create the (X, Y) word pairs given a list of strings texts:
['future', 'king'],
['future', 'prince'],
['king', 'prince'],
['king', 'future'],
['prince', 'king'],
['prince', 'future'],
['daughter', 'princess'],
['princess', 'daughter'],
['son', 'prince']
...
After the initial creation of the data points, we need to assign a unique
integer (often called index) to each unique word of our vocabulary. This will
be used further on when creating one-hot encoded vectors.
https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 4/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
After using the above function on the text we get the dictionary:
unique_word_dict = {
'beautiful': 0,
'boy': 1,
'can': 2,
'children': 3,
'daughter': 4,
'family': 5,
'future': 6,
'king': 7,
'man': 8,
'now': 9,
'only': 10,
'prince': 11,
'princess': 12,
'queen': 13,
'realm': 14,
'royal': 15,
'rule': 16,
'son': 17,
'strong': 18,
'their': 19,
'woman': 20
}
What we created up to this point is still not neural network friendly because
what we have as data is the pairs of (focus word, context word). In order for
the computer to start doing computations, we need a clever way to transform
these data points into data points made up of numbers. One such clever way
is the one-hot encoding technique.
There are 3 unique words: blue, sky and car. One hot representation for each
word:
'blue' = [1, 0, 0]
'car' = [0, 1, 0]
'sky' = [0, 0, 1]
A =
[
1, 0, 0
0, 0, 1
1, 0, 0
0, 1, 0
]
We will be creating two matrices, X and Y, with the exact same technique.
The X matrix will be created using the focus words and the Y matrix will be
created using the context words.
https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 5/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
Recall the first three data points which we created given the texts about
royalties:
['future', 'king'],
['future', 'prince'],
['king', 'prince']
The one-hot encoded X matrix (words future, future, king) in python would
be:
[array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0.,
0., 0., 0., 0.]),
array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0.,
0., 0., 0., 0.]),
array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 0.,
0., 0., 0., 0.])]
The one-hot encoded Y matrix (words king, prince, prince) in python would
be:
[array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 0.,
0., 0., 0., 0.]),
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
0., 0.,
0., 0., 0., 0.]),
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
0., 0.,
0., 0., 0., 0.])]
n - number of created data points (pairs of focus words and context words)
https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 6/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
32
33 # Converting the matrices into an array
34 X = np.asarray(X)
35 Y = np.asarray(Y)
We now have X and Y matrices built from the focus word and context word
pairs. The next step is to choose the embedding dimension. I will choose the
dimension to be equal to 2 in order to later plot the words and see whether
similar words form clusters.
The hidden layer dimension is the size of our word embedding. The output
layers activation function is softmax. The activation function of the hidden
layer is linear. The input dimension is equal to the total number of unique
https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 7/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
Search Medium words (remember, our X matrix is of the dimension n x 21). Each input node Write
will have two weights connecting it to the hidden layer. These weights are the
word embeddings! After the training of the network, we extract these
weights and remove all the rest. We do not necessarily care about the output.
For the training of the network, we will use keras and tensorflow:
1 # Deep learning:
2 from keras.models import Input, Model
3 from keras.layers import Dense
4
5 # Defining the size of the embedding
6 embed_size = 2
7
8 # Defining the neural network
9 inp = Input(shape=(X.shape[1],))
10 x = Dense(units=embed_size, activation='linear')(inp)
11 x = Dense(units=Y.shape[1], activation='softmax')(x)
12 model = Model(inputs=inp, outputs=x)
13 model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')
14
15 # Optimizing the network weights
16 model.fit(
17 x=X,
18 y=Y,
19 batch_size=256,
20 epochs=1000
21 )
22
23 # Obtaining the weights from the neural network.
24 # These are the so called word embeddings
25
26 # The input layer
27 weights = model.get_weights()[0]
28
29 # Creating a dictionary to store the embeddings in. The key is a unique word and
30 # the value is the numeric vector
31 embedding_dict = {}
32 for word in words:
33 embedding_dict.update({
34 word: weights[unique_word_dict.get(word)]
35 })
After the training of the network, we can obtain the weights and plot the
results:
plt.figure(figsize=(10, 10))
for word in list(unique_word_dict.keys()):
coord = embedding_dict.get(word)
plt.scatter(coord[0], coord[1])
plt.annotate(word, (coord[0], coord[1]))
https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 8/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
As we can see, there are the words ‘man’, ‘future’, ‘prince’, ‘boy’ and
‘daughter’, ‘woman’, ‘princess’ in separate corners of the plot and form
clusters. All this was achieved from just 21 unique words and 12 sentences.
Often in practice, pre-trained word embeddings are used with typical word
embedding dimensions being either 100, 200 or 300. I personally use the
embeddings stored here: \https://nlp.stanford.edu/projects/glove/.
A person who tries to understand the world through data and equations
https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 9/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
Eligijus Bujokas in Towards Data Science Miriam Santos in Towards Data Science
23 1 1.5K 17
Dominik Polzer in Towards Data Science Eligijus Bujokas in Towards Data Science
All You Need to Know to Build Your Text classification using word
First LLM App embeddings and deep learning in…
A step-by-step tutorial to document loaders, The purpose of this article is to help a reader
embeddings, vector stores and prompt… understand how to leverage word…
1.92K 19 143 3
See all from Eligijus Bujokas See all from Towards Data Science
Andrea D'Agostino in Towards Data Science Dominik Polzer in Towards Data Science
How to Train a Word2Vec Model All You Need to Know to Build Your
from Scratch with Gensim First LLM App
In this article we will explore Gensim, a very A step-by-step tutorial to document loaders,
popular Python library for training text-base… embeddings, vector stores and prompt…
93 1.92K 19
Lists
https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 10/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
Liu Zuo Lin in Python in Plain English Ruben Winastwan in Towards Data Science
7 180
Will Badr in Towards Data Science Jay Peterman in Towards Data Science
The Secret to Improved NLP: An In- Make a Text Summarizer with GPT-
Depth Look at the nn.Embedding… 3
Dissecting the `nn.Embedding` layer in Quick tutorial using Python, OpenAI’s GPT-3,
PyTorch and a complete guide on how it… and Streamlit
158 2 169 1
Help Status Writers Blog Careers Privacy Terms About Text to speech Teams
https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 11/11