Dept of CSE, AIET, Mijar 1
Dept of CSE, AIET, Mijar 1
Dept of CSE, AIET, Mijar 1
INTRODUCTION
1.1 INTRODUCTION
Distributional representation of words learned from neural language modelling (NLM)[1][2][4] have shown
to optimize many NLP tasks like machine translation [5][6], part-of-speech tagging [4], dependency parsing
[7]. Conventional method of representing a word includes “one hot vector” encoding, where the entity is ‘1’
at the word index and 0 otherwise. Unfortunately one-hot vector representation has three major drawbacks:
the length of vector increases with the increase in the vocabulary size leading to the curse of dimensionality
[1], it does not capture the semantic and syntactic relationship between the words as the Euclidian distance
between any two vectors is the same and induces a greater data sparsity.
With the help of distributional representation of words by word2vec one can represent the entire vocabulary
in a low dimensions vectors. Continuous Bag of Words (CBOW) and skip-gram models [2] represent the
words as real valued vectors based on past and future occurrence of words in raw text. Additionally semantic
and syntactic regularities are captured within the word embeddings [3]. For example vector (“France”) •
vector (“Italy”) + vector (“Rome”) results in a vector that is nearest to vector (“Paris”). To the extent
candidate know, none of the already proposed NLP research in Kannada has been taken word vectors into
consideration. In our proposed work, Authors trained neural network model on a couple of million number of
words, with an unassuming dimensionality of the word vectors between 100 - 350. However, the most
commonly used Word2Vec model fails to consider various parameters like word order in the given corpus,
POS-tag relationship between the words in the context, and sub word information within each word. The
combination of these known techniques if incorporated into the Word2vec model is proven to increase the
efficiency of training and also improve the quality of word representation. It is difficult for good
representations of uncommon words to be learned with customary word2vec. Especially in morphologically
rich languages like German, Chinese and Kannada, rare words contribute significant meaning to the sentence.
Traditional word2vec model and most of its variants tend to represent the rare words as (‘UNK’), losing most
of the valuable information in the sentence.
None of the already proposed NLP research in Kannada has been taken word vectors into consideration. In
Authors proposed work, they train neural network model on a couple of million number of words, with an
unassuming dimensionality of the word vectors between 100 - 350. However, the most commonly used
Word2Vec model fails to consider various parameters like word order in the given corpus, POS-tag
relationship between the words in the context, and subword information within each word.
1.3 OBJECTIVES
The main objective of this paper is to introduce words vectors for Kannada language by carefully choosing the
combination of known techniques which are occasionally used together and authors called the proposed model
as Kannada-Word2Vec model (kW2V). As the pre-trained word vectors are fundamental for any NLP task,
the final outcome of this work is to train high quality word vectors for Kannada language and make them
publicly available.
Scope of this work includes training the proposed model on a huge dataset and hence generate a quality
distributed word representation for Kannada words and make them publicly available for further research. The
tricks and techniques explained in this paper would help the model fine tune its parameters and provide a
computationally efficient algorithm on very large datasets.
METHODOLOGY
A. Training corpus
Previously most of the data used for training word vector models for English language include the English
Gigaword, Common Crawl and English Wikipedia dataset which are publicly available. Our primary focus in
this work is to generate word vectors for Kannada language. Authors used a University literature corpus of
22MB which has 46060 unique words.
C. Experimental parameters
The vector dimensionality of the word vector during training neural network is set to 300 and a negative
sampling of 5 words. A context window of 5 is chosen for skip-gram during training. Both input and output
word vectors are initialized with random normalization. The learning rate during backpropagation is set to
0.025 and then later reduced to 0.01 to get the optimal learning curve. The neural network model is trained
with 128 epochs and the final model is chosen based on the least cross entropy error during backpropagation.
Continuous Bag of Words Model (CBOW) The CBOW architecture is similar to continuous skip-gram
model but instead of predicting all the context words given the target word, it aims to predict the target
word given the vector representation of context words. This is similar to feedforward neural network
language model (NNLM) where in the hidden non-linear activation layer is replaced by a projection layer.
The model hypothesis is given by
The word frequency distribution of any languages indicates that few set of words (e.g. illi, mattu, alli etc.)
occur more often than others. Such frequently occurring words tend to overfit the parameters of neural
network model, in other words model fails to capture the co-occurrence information of rarely occurring
words. A standard approach to counter this biased distribution is to use subsampling of frequently
occurring words as in (6).
Each word in the vocabulary is discarded with the probability where is the frequency of word in the vocabulary
and t is a hyperparameter which is typically around10.
Word2Vec treats each word as unique element for vector representation model, further assigning a distinct
representation for each word in the vocabulary ignoring the internal structure of the word itself. Even
though Word2Vec model performs well enough for many NLP tasks over multiple languages, it fails to
capture the rich information in languages with a huge and rarely occurring vocabulary. A simple but very
intuitive approach to this problem is to generate vectors for each char n-grams of the given word during
training process. In practice, the group of n gram denoted by N is restricted to n-gram with 3 to 6 character.
Then the word vector is simple addition of these vectors, denoted by and the word vector of the original n-
gram is as in (7).
The skip-gram model and CBOW model are position independent. Many NLP tasks involving syntax like
POS- tagging and dependency parsing depends on the word order in the given text. The skip-gram uses only
one output embedding matrix to predict all the words in the,,, in the given context. This modified version of
skip-gram uses output matrix for each position in the predicting context. During testing time the next word is
predicted based on its position relative to the target word and selecting the word from the corresponding output
vector matrix. The number of parameters to be learnt is context times the actual skip-gram model parameters,
but this approach significantly improves the syntax based tasks like POS-tagging and dependency parsing.
A. Training corpus Previously most of the data used for training word vector models for English language
include the English Gigaword, Common Crawl and English Wikipedia dataset which are publicly
available. Authors primary focus in this work is to generate word vectors for Kannada language. They used
a University literature corpus of 22MB which has 46060 unique words.
B. Raw text to Unicode format, A literary data explained in the previous subsection is processed through a
PERL script which converts the Kannada words from a text file into Unicode format. The Unicode data is
further processed to create a dictionary of vocabulary for which the word vectors are eventually generated.
C. Experimental parameters, The vector dimensionality of the word vector during training neural network
is set to 300 and a negative sampling of 5 words. A context window of 5 is chosen for skip-gram during
training. Both input and output word vectors are initialized with random normalization. The learning rate
during backpropagation is set to 0.025 and then later reduced to 0.01 to get the optimal learning curve. The
neural network model is trained with 128 epochs and the final model is chosen based on the least cross entropy
error during backpropagation.
To evaluate the efficiency of word vectors, most commonly employed experiments include word similarity
task and word analogy task. In authors experiment they demonstrated results associated with both the
experiments. As none of the previous research work in Kannada NLP has tried to generate word vectors,
they compared their results with the most prevalent word vectors models like word2vec and Glove of
English language.
A. Qualitative analysis Before evaluating the models on the previously mentioned task, authors plotted the
top 100 most occurring words in the dataset in order to provide the visual representation of word vectors and
analyse how related words are grouped together.
C. Word similarity Task, Here again due to lack of existing word similarity task, authors created a demo
test experiment containing 100 word similarity test questions. The results are compared with the English word
analogy task WordSim-353[14] evaluated on skip-gram and CBOW word2vec models. All the experiments
are shown in the Table 2.