Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
33 views

Creating Word Embeddings - Coding The Word2Vec Algorithm in Python Using Deep Learning - by Eligijus Bujokas - Towards Data Science

Uploaded by

wookikbut
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Creating Word Embeddings - Coding The Word2Vec Algorithm in Python Using Deep Learning - by Eligijus Bujokas - Towards Data Science

Uploaded by

wookikbut
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas

arning | by Eligijus Bujokas | Towards Data Science

This is your last free member-only story this month. Upgrade for unlimited access.

Member-only story

Creating Word Embeddings: Coding


the Word2Vec Algorithm in Python
using Deep Learning
Understanding the intuition behind word embedding creation with
deep learning

Eligijus Bujokas · Follow


Published in Towards Data Science · 7 min read · Mar 5, 2020

479 7

When I was writing another article that showcased how to use word
embeddings in a text classification objective I realized that I always used pre-
trained word embeddings downloaded from an external source (for example
https://nlp.stanford.edu/projects/glove/). I started thinking about how to
create word embeddings from scratch and thus this is how this article was
born. My main goal is for people to read this article with my code snippets
and to get an in-depth understanding of the logic behind the creation of
vector representations of words.

The whole code can be found here:

https://github.com/Eligijus112/word-embedding-creation

The short version of the creation of the word embeddings can be


summarized in the following pipeline:

Read the text -> Preprocess text -> Create (x, y) data points -> Create one hot
encoded (X, Y) matrices -> train a neural network -> extract the weights from
the input layer

In this article, I will briefly explain every step of the way.

From wiki: Word embedding is the collective name for a set of language
modeling and feature learning techniques in natural language processing

https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 1/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
(NLP) where words or phrases from the vocabulary are mapped to vectors
of real numbers. The term word2vec literally translates to word to vector.
For example,

“dad” = [0.1548, 0.4848, …, 1.864]

“mom” = [0.8785, 0.8974, …, 2.794]

The most important feature of word embeddings is that similar words in a


semantic sense have a smaller distance (either Euclidean, cosine or other)
between them than words that have no semantic relationship. For example,
words like “mom” and “dad” should be closer together than the words “mom”
and “ketchup” or “dad” and “butter”.

Top highlight
Word embeddings are created using a neural network with one input layer,
one hidden layer and one output layer.

Photo by Toa Heftiba on Unsplash

To create word embeddings the first thing that is needed is text. Let us create
a simple example stating some well-known facts about a fictional royal
family containing 12 sentences:

The future king is the prince

Daughter is the princess


Son is the prince
Only a man can be a king

Only a woman can be a queen


The princess will be a queen
Queen and king rule the realm

The prince is a strong man


The princess is a beautiful woman

The royal family is the king and queen and their children
Prince is only a boy now

A boy will be a man

The computer does not understand that the words king, prince and man are
closer together in a semantic sense than the words queen, princess, and
daughter. All it sees are encoded characters to binary. So how do we make

https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 2/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
the computer understand the relationship between certain words? By
creating X and Y matrices and using a neural network.

When creating the training matrices for word embeddings one of the
hyperparameters is the window size of the context (w). The minimum value
for this is 1 because without context the algorithm cannot work. Lets us take
the first sentence and lets us assume that w = 2.

The future king is the prince

The bolded word the is called the focus word and 2 words to the left and 2
words to the right (because w = 2) are the so-called context words. So we can
start building our data points:

(The, future), (The, king)

Now if we scan the whole sentence we would get:

(The, future), (The, king),


(future, the), (future, king), (future, is)
(king, the), (king, future), (king, is), (king, the)
(is, future), (is, king), (is, the), (is, prince),
(the, king), (the, is), (the, prince)
(prince, is), (prince, the)

From 6 words we are able to create 18 data points. In practice, we do some


preprocessing of the text and remove stop words like is, the, a, etc. By
scanning our whole text document and appending the data we create the
initial input which we can then transform into a matrix form.

1 import re
2
3 def clean_text(
4 string: str,
5 punctuations=r'''!()-[]{};:'"\,<>./?@#$%^&*_~''',
6 stop_words=['the', 'a', 'and', 'is', 'be', 'will']) -> str:
7 """
8 A method to clean text
9 """
10 # Cleaning the urls
11 string = re.sub(r'https?://\S+|www\.\S+', '', string)
12
13 # Cleaning the html elements
14 string = re.sub(r'<.*?>', '', string)
15
16 # Removing the punctuations
17 for x in string.lower():
18 if x in punctuations:
19 string = string.replace(x, "")
20
21 # Converting the text to lower
22 string = string.lower()
23
24 # Removing stop words
25 string = ' '.join([word for word in string.split() if word not in stop_words])
26
27 # Cleaning the whitespaces
28 string = re.sub(r'\s+', ' ', string).strip()
29
30 return string

text_preprocesing_embed hosted with ❤ by GitHub view raw

https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 3/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
Text preprocessing function

The full pipeline to create the (X, Y) word pairs given a list of strings texts:

1 # Defining the window for context


2 window = 2
3
4 # Creating a placeholder for the scanning of the word list
5 word_lists = []
6 all_text = []
7
8 for text in texts:
9
10 # Cleaning the text
11 text = text_preprocessing(text)
12
13 # Appending to the all text list
14 all_text += text
15
16 # Creating a context dictionary
17 for i, word in enumerate(text):
18 for w in range(window):
19 # Getting the context that is ahead by *window* words
20 if i + 1 + w < len(text):
21 word_lists.append([word] + [text[(i + 1 + w)]])
22 # Getting the context that is behind by *window* words
23 if i - w - 1 >= 0:
24 word_lists.append([word] + [text[(i - w - 1)]])

word_pair_pipeline hosted with ❤ by GitHub view raw

Creation of data points

The first entries of the created data points:

['future', 'king'],
['future', 'prince'],
['king', 'prince'],
['king', 'future'],
['prince', 'king'],
['prince', 'future'],
['daughter', 'princess'],
['princess', 'daughter'],
['son', 'prince']
...

After the initial creation of the data points, we need to assign a unique
integer (often called index) to each unique word of our vocabulary. This will
be used further on when creating one-hot encoded vectors.

1 def create_unique_word_dict(text:list) -> dict:


2 """
3 A method that creates a dictionary where the keys are unique words
4 and key values are indices
5 """
6 # Getting all the unique words from our text and sorting them alphabetically
7 words = list(set(text))
8 words.sort()
9
10 # Creating the dictionary for the unique words
11 unique_word_dict = {}
12 for i, word in enumerate(words):
13 unique_word_dict.update({
14 word: i
15 })
16
17 return unique_word_dict

unique_word_dictionary hosted with ❤ by GitHub view raw

Creation of unique word dictionary

https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 4/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
After using the above function on the text we get the dictionary:

unique_word_dict = {
'beautiful': 0,
'boy': 1,
'can': 2,
'children': 3,
'daughter': 4,
'family': 5,
'future': 6,
'king': 7,
'man': 8,
'now': 9,
'only': 10,
'prince': 11,
'princess': 12,
'queen': 13,
'realm': 14,
'royal': 15,
'rule': 16,
'son': 17,
'strong': 18,
'their': 19,
'woman': 20
}

What we created up to this point is still not neural network friendly because
what we have as data is the pairs of (focus word, context word). In order for
the computer to start doing computations, we need a clever way to transform
these data points into data points made up of numbers. One such clever way
is the one-hot encoding technique.

One-hot encoding transforms a word into a vector that is made up of 0 with


one coordinate, representing the string, equal to 1. The vector size is equal to
the number of unique words in a document. For example, lets us define a
simple list of strings:

a = ['blue', 'sky', 'blue', 'car']

There are 3 unique words: blue, sky and car. One hot representation for each
word:

'blue' = [1, 0, 0]
'car' = [0, 1, 0]
'sky' = [0, 0, 1]

Thus the list can be converted into a matrix:

A =
[
1, 0, 0
0, 0, 1
1, 0, 0
0, 1, 0
]

We will be creating two matrices, X and Y, with the exact same technique.
The X matrix will be created using the focus words and the Y matrix will be
created using the context words.

https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 5/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
Recall the first three data points which we created given the texts about
royalties:

['future', 'king'],
['future', 'prince'],
['king', 'prince']

The one-hot encoded X matrix (words future, future, king) in python would
be:

[array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0.,
0., 0., 0., 0.]),
array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0.,
0., 0., 0., 0.]),
array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 0.,
0., 0., 0., 0.])]

The one-hot encoded Y matrix (words king, prince, prince) in python would
be:

[array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 0.,
0., 0., 0., 0.]),
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
0., 0.,
0., 0., 0., 0.]),
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
0., 0.,
0., 0., 0., 0.])]

The final sizes of these matrices will be n x m, where

n - number of created data points (pairs of focus words and context words)

m - number of unique words

1 from scipy import sparse


2 import numpy as np
3
4 # Defining the number of features (unique words)
5 n_words = len(unique_word_dict)
6
7 # Getting all the unique words
8 words = list(unique_word_dict.keys())
9
10 # Creating the X and Y matrices using one hot encoding
11 X = []
12 Y = []
13
14 for i, word_list in tqdm(enumerate(word_lists)):
15 # Getting the indices
16 main_word_index = unique_word_dict.get(word_list[0])
17 context_word_index = unique_word_dict.get(word_list[1])
18
19 # Creating the placeholders
20 X_row = np.zeros(n_words)
21 Y_row = np.zeros(n_words)
22
23 # One hot encoding the main word
24 X_row[main_word_index] = 1
25
26 # One hot encoding the Y matrix words
27 Y_row[context_word_index] = 1
28
29 # Appending to the main matrices
30 X.append(X_row)
31 Y.append(Y_row)

https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 6/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
32
33 # Converting the matrices into an array
34 X = np.asarray(X)
35 Y = np.asarray(Y)

X_Y_creation_word_embeding hosted with ❤ by GitHub view raw

Creating the X and Y matrices

We now have X and Y matrices built from the focus word and context word
pairs. The next step is to choose the embedding dimension. I will choose the
dimension to be equal to 2 in order to later plot the words and see whether
similar words form clusters.

Neural network architecture

The hidden layer dimension is the size of our word embedding. The output
layers activation function is softmax. The activation function of the hidden
layer is linear. The input dimension is equal to the total number of unique

https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 7/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science
Search Medium words (remember, our X matrix is of the dimension n x 21). Each input node Write
will have two weights connecting it to the hidden layer. These weights are the
word embeddings! After the training of the network, we extract these
weights and remove all the rest. We do not necessarily care about the output.

For the training of the network, we will use keras and tensorflow:

1 # Deep learning:
2 from keras.models import Input, Model
3 from keras.layers import Dense
4
5 # Defining the size of the embedding
6 embed_size = 2
7
8 # Defining the neural network
9 inp = Input(shape=(X.shape[1],))
10 x = Dense(units=embed_size, activation='linear')(inp)
11 x = Dense(units=Y.shape[1], activation='softmax')(x)
12 model = Model(inputs=inp, outputs=x)
13 model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')
14
15 # Optimizing the network weights
16 model.fit(
17 x=X,
18 y=Y,
19 batch_size=256,
20 epochs=1000
21 )
22
23 # Obtaining the weights from the neural network.
24 # These are the so called word embeddings
25
26 # The input layer
27 weights = model.get_weights()[0]
28
29 # Creating a dictionary to store the embeddings in. The key is a unique word and
30 # the value is the numeric vector
31 embedding_dict = {}
32 for word in words:
33 embedding_dict.update({
34 word: weights[unique_word_dict.get(word)]
35 })

nn_word_embedding hosted with ❤ by GitHub view raw

Training and obtaining weights

After the training of the network, we can obtain the weights and plot the
results:

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
for word in list(unique_word_dict.keys()):
coord = embedding_dict.get(word)
plt.scatter(coord[0], coord[1])
plt.annotate(word, (coord[0], coord[1]))

https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 8/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science

Visualization of the embeddings

As we can see, there are the words ‘man’, ‘future’, ‘prince’, ‘boy’ and
‘daughter’, ‘woman’, ‘princess’ in separate corners of the plot and form
clusters. All this was achieved from just 21 unique words and 12 sentences.

Often in practice, pre-trained word embeddings are used with typical word
embedding dimensions being either 100, 200 or 300. I personally use the
embeddings stored here: \https://nlp.stanford.edu/projects/glove/.

Word2vec Python TensorFlow NLP Neural Networks

Written by Eligijus Bujokas Follow

374 Followers · Writer for Towards Data Science

A person who tries to understand the world through data and equations

More from Eligijus Bujokas and Towards Data Science

https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 9/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science

Eligijus Bujokas in Towards Data Science Miriam Santos in Towards Data Science

Feature Importance in Decision Pandas 2.0: A Game-Changer for


Trees Data Scientists?
A complete Python implementation and The Top 5 Features for Efficient Data
explanation of the calculations behind… Manipulation

· 9 min read · Jun 2, 2022 7 min read · Jun 27

23 1 1.5K 17

Dominik Polzer in Towards Data Science Eligijus Bujokas in Towards Data Science

All You Need to Know to Build Your Text classification using word
First LLM App embeddings and deep learning in…
A step-by-step tutorial to document loaders, The purpose of this article is to help a reader
embeddings, vector stores and prompt… understand how to leverage word…

· 25 min read · Jun 22 · 15 min read · Mar 15, 2020

1.92K 19 143 3

See all from Eligijus Bujokas See all from Towards Data Science

Recommended from Medium

Andrea D'Agostino in Towards Data Science Dominik Polzer in Towards Data Science

How to Train a Word2Vec Model All You Need to Know to Build Your
from Scratch with Gensim First LLM App
In this article we will explore Gensim, a very A step-by-step tutorial to document loaders,
popular Python library for training text-base… embeddings, vector stores and prompt…

· 9 min read · Feb 7 · 25 min read · Jun 22

93 1.92K 19

Lists

Natural Language Processing Coding & Development


384 stories · 40 saves 11 stories · 44 saves

Predictive Modeling w/ Practical Guides to Machine


Python Learning
18 stories · 98 saves 10 stories · 113 saves

https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 10/11
7/13/23, 2:43 PM Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning | by Eligijus Bujokas | Towards Data Science

Liu Zuo Lin in Python in Plain English Ruben Winastwan in Towards Data Science

Python Word2Vec For Text Semantic Textual Similarity with


Classification (With LSTM) BERT
# Plug & Play Code For Those With No Time How to use BERT to calculate the semantic
similarity between two texts

· 5 min read · Feb 4 · 11 min read · Feb 15

7 180

Will Badr in Towards Data Science Jay Peterman in Towards Data Science

The Secret to Improved NLP: An In- Make a Text Summarizer with GPT-
Depth Look at the nn.Embedding… 3
Dissecting the `nn.Embedding` layer in Quick tutorial using Python, OpenAI’s GPT-3,
PyTorch and a complete guide on how it… and Streamlit

· 8 min read · Jan 25 · 11 min read · Jan 24

158 2 169 1

See more recommendations

Help Status Writers Blog Careers Privacy Terms About Text to speech Teams

https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8 11/11

You might also like