Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Dept of CSE, AIET, Mijar 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

CHAPTER 1

INTRODUCTION

1.1 INTRODUCTION

Distributional representation of words learned from neural language modelling (NLM)[1][2][4] have shown
to optimize many NLP tasks like machine translation [5][6], part-of-speech tagging [4], dependency parsing
[7]. Conventional method of representing a word includes “one hot vector” encoding, where the entity is ‘1’
at the word index and 0 otherwise. Unfortunately one-hot vector representation has three major drawbacks:
the length of vector increases with the increase in the vocabulary size leading to the curse of dimensionality
[1], it does not capture the semantic and syntactic relationship between the words as the Euclidian distance
between any two vectors is the same and induces a greater data sparsity.
With the help of distributional representation of words by word2vec one can represent the entire vocabulary
in a low dimensions vectors. Continuous Bag of Words (CBOW) and skip-gram models [2] represent the
words as real valued vectors based on past and future occurrence of words in raw text. Additionally semantic
and syntactic regularities are captured within the word embeddings [3]. For example vector (“France”) •
vector (“Italy”) + vector (“Rome”) results in a vector that is nearest to vector (“Paris”). To the extent
candidate know, none of the already proposed NLP research in Kannada has been taken word vectors into
consideration. In our proposed work, Authors trained neural network model on a couple of million number of
words, with an unassuming dimensionality of the word vectors between 100 - 350. However, the most
commonly used Word2Vec model fails to consider various parameters like word order in the given corpus,
POS-tag relationship between the words in the context, and sub word information within each word. The
combination of these known techniques if incorporated into the Word2vec model is proven to increase the
efficiency of training and also improve the quality of word representation. It is difficult for good
representations of uncommon words to be learned with customary word2vec. Especially in morphologically
rich languages like German, Chinese and Kannada, rare words contribute significant meaning to the sentence.
Traditional word2vec model and most of its variants tend to represent the rare words as (‘UNK’), losing most
of the valuable information in the sentence.

Dept of CSE, AIET, Mijar 1


1.2 PROBLEM STATEMENT

None of the already proposed NLP research in Kannada has been taken word vectors into consideration. In
Authors proposed work, they train neural network model on a couple of million number of words, with an
unassuming dimensionality of the word vectors between 100 - 350. However, the most commonly used
Word2Vec model fails to consider various parameters like word order in the given corpus, POS-tag
relationship between the words in the context, and subword information within each word.

1.3 OBJECTIVES

The main objective of this paper is to introduce words vectors for Kannada language by carefully choosing the
combination of known techniques which are occasionally used together and authors called the proposed model
as Kannada-Word2Vec model (kW2V). As the pre-trained word vectors are fundamental for any NLP task,
the final outcome of this work is to train high quality word vectors for Kannada language and make them
publicly available.

1.4 SCOPE FOR THE STUDY

Scope of this work includes training the proposed model on a huge dataset and hence generate a quality
distributed word representation for Kannada words and make them publicly available for further research. The
tricks and techniques explained in this paper would help the model fine tune its parameters and provide a
computationally efficient algorithm on very large datasets.

1.5 NEED FOR THE STUDY


Pre-trained word vectors are key requirements in any NLP tasks, generating word vectors for Indian
languages has drawn very less attention. There is need of distributed representation for Kannada words
using an optimal neural network model and combining various known techniques.

Dept of CSE, AIET, Mijar 2


CHAPTER 2

METHODOLOGY

A. Training corpus
Previously most of the data used for training word vector models for English language include the English
Gigaword, Common Crawl and English Wikipedia dataset which are publicly available. Our primary focus in
this work is to generate word vectors for Kannada language. Authors used a University literature corpus of
22MB which has 46060 unique words.

B. Raw text to Unicode format


A literary data explained in the previous subsection is processed through a PERL script which converts the
Kannada words from a text file into Unicode format. The Unicode data is further processed to create a
dictionary of vocabulary for which the word vectors are eventually generated.

C. Experimental parameters
The vector dimensionality of the word vector during training neural network is set to 300 and a negative
sampling of 5 words. A context window of 5 is chosen for skip-gram during training. Both input and output
word vectors are initialized with random normalization. The learning rate during backpropagation is set to
0.025 and then later reduced to 0.01 to get the optimal learning curve. The neural network model is trained
with 128 epochs and the final model is chosen based on the least cross entropy error during backpropagation.

2.1 RELATED WORK


Distributional representation of words is prevalent for many years in neural network community. One of the
pioneers among these is neural network architecture proposed by Bengio et.al [1], where a feedforward
neural network with a sequential projection layer and a non-linear functions in hidden layer was used to
learn the joint conditional probability of words in a large corpora, eventually generating the word vector
representation and a statistical neural network language model (NNLM). NNLM performs better than ngram
models. This work has been trailed by numerous others. A rather different approach was to learn word
vectors directly instead of focusing on learning language models [4]. A binary classification task deciding
Dept of CSE, AIET, Mijar 3
whether a word is related to its context or not is carried out to generate word vectors. Thus generated word
representations were later used for other NLP tasks. The two neural network architectures CBOW and
Skipgram [2] to learn vector space representation for English vocabulary are pretty successful. The word
representation for the target word is learn with the help of words appearing in its context. Later in their work
they introduce a test to evaluate the semantic and syntactic knowledge captured in the word vectors. The
statistical knowledge of word co-occurrence is significant to learn word embeddings. Global vectors (Glove)
[5] model proposed by Socher et.al, incorporate a context statistical knowledge into word vectors. From a
linguistic perspective recently people have proposed many architectures based on POS tagging [10] and
sentence dependency parsing. Recent works includes a novel architecture to learn sentiment lexicon for
Chinese words was proposed [11].

2.2 Skip-gram Model


The objective of skip-gram model is to find the joint probability of sequence of words in the given context
c. In other words given a sequence of words w1, w2, w3,...,wT, the skip-gram model objective is to
calculate log conditional probability of all the context words occurring given the target word, as in vector
and finally averaging all the joint probability over the total number of context windows N.

Fig.1. Skip-gram model

Dept of CSE, AIET, Mijar 4


2.3 Continuous Bag of Words

Continuous Bag of Words Model (CBOW) The CBOW architecture is similar to continuous skip-gram
model but instead of predicting all the context words given the target word, it aims to predict the target
word given the vector representation of context words. This is similar to feedforward neural network
language model (NNLM) where in the hidden non-linear activation layer is replaced by a projection layer.
The model hypothesis is given by

Fig.2. Continuous Bag of Words (CBOW) model

Dept of CSE, AIET, Mijar 5


2.4 Frequent Word Subsampling

The word frequency distribution of any languages indicates that few set of words (e.g. illi, mattu, alli etc.)
occur more often than others. Such frequently occurring words tend to overfit the parameters of neural
network model, in other words model fails to capture the co-occurrence information of rarely occurring
words. A standard approach to counter this biased distribution is to use subsampling of frequently
occurring words as in (6).

Each word in the vocabulary is discarded with the probability where is the frequency of word in the vocabulary
and t is a hyperparameter which is typically around10.

2.5 Subword Information

Word2Vec treats each word as unique element for vector representation model, further assigning a distinct
representation for each word in the vocabulary ignoring the internal structure of the word itself. Even
though Word2Vec model performs well enough for many NLP tasks over multiple languages, it fails to
capture the rich information in languages with a huge and rarely occurring vocabulary. A simple but very
intuitive approach to this problem is to generate vectors for each char n-grams of the given word during
training process. In practice, the group of n gram denoted by N is restricted to n-gram with 3 to 6 character.
Then the word vector is simple addition of these vectors, denoted by and the word vector of the original n-
gram is as in (7).

Dept of CSE, AIET, Mijar 6


2.6 Structured Skip-gram

The skip-gram model and CBOW model are position independent. Many NLP tasks involving syntax like
POS- tagging and dependency parsing depends on the word order in the given text. The skip-gram uses only
one output embedding matrix to predict all the words in the,,, in the given context. This modified version of
skip-gram uses output matrix for each position in the predicting context. During testing time the next word is
predicted based on its position relative to the target word and selecting the word from the corresponding output
vector matrix. The number of parameters to be learnt is context times the actual skip-gram model parameters,
but this approach significantly improves the syntax based tasks like POS-tagging and dependency parsing.

2.7 Word2vec parameters and computational structure(Abstract)

Dept of CSE, AIET, Mijar 7


CHAPTER 3
EXPERIMENTS

A. Training corpus Previously most of the data used for training word vector models for English language
include the English Gigaword, Common Crawl and English Wikipedia dataset which are publicly
available. Authors primary focus in this work is to generate word vectors for Kannada language. They used
a University literature corpus of 22MB which has 46060 unique words.

B. Raw text to Unicode format, A literary data explained in the previous subsection is processed through a
PERL script which converts the Kannada words from a text file into Unicode format. The Unicode data is
further processed to create a dictionary of vocabulary for which the word vectors are eventually generated.

C. Experimental parameters, The vector dimensionality of the word vector during training neural network
is set to 300 and a negative sampling of 5 words. A context window of 5 is chosen for skip-gram during
training. Both input and output word vectors are initialized with random normalization. The learning rate
during backpropagation is set to 0.025 and then later reduced to 0.01 to get the optimal learning curve. The
neural network model is trained with 128 epochs and the final model is chosen based on the least cross entropy
error during backpropagation.

Dept of CSE, AIET, Mijar 8


CHAPTER 4
RESULT AND DISCUSSION

To evaluate the efficiency of word vectors, most commonly employed experiments include word similarity
task and word analogy task. In authors experiment they demonstrated results associated with both the
experiments. As none of the previous research work in Kannada NLP has tried to generate word vectors,
they compared their results with the most prevalent word vectors models like word2vec and Glove of
English language.

A. Qualitative analysis Before evaluating the models on the previously mentioned task, authors plotted the
top 100 most occurring words in the dataset in order to provide the visual representation of word vectors and
analyse how related words are grouped together.

Fig 3. Vector space representation of words.

Dept of CSE, AIET, Mijar 9


B. Syntactic word analogy task As no standard syntactic word analogy task exists in Kannada language,
authors created an experiment involving 100 different word analogy test cases. The limitation of size and
vocabulary in authors dataset prevents them from creating a universal word analogy task. The results
determine how efficiently model predicted the word analogy in the top five probable output words. The most
commonly used syntactic word analogy experiment in English is MSR dataset which contains 8000 syntactic
word analogy questions [13]. Their
experimental results compared with Standard English word analogy task are shown in Table 1.

Table 1. Results on word analogy tasks (%).

C. Word similarity Task, Here again due to lack of existing word similarity task, authors created a demo
test experiment containing 100 word similarity test questions. The results are compared with the English word
analogy task WordSim-353[14] evaluated on skip-gram and CBOW word2vec models. All the experiments
are shown in the Table 2.

Table 2: Results on word similarity tasks (%).

Dept of CSE, AIET, Mijar 10


CHAPTER 5

CONCLUSION AND OUTLOOK


In this paper authors have proposed a neural network language model combining the recently proposed
techniques in order to generate a sophisticated word embeddings for Kannada language. Their findings
indicate that further improvement can be made in word vectors by training the proposed model on a large
dataset. The tricks and techniques explained in this paper would help the model fine tune its parameters and
provide a computationally efficient algorithm on very large datasets. Future scope of their work includes
training the proposed model on a huge dataset and hence generate a quality distributed word representation
for Kannada words and make them publicly available for further research.

Dept of CSE, AIET, Mijar 11


REFERENCES
[1] Bengio, Y.; Ducharme, R.; Vincent, P.; and Jauvin, C. 2003. A neural probabilistic language model.
Journal of Machine Learning Research 3:1137–1155.
[2] Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013a. Efficient estimation of word representations in
vector space. arXiv preprint arXiv: 1301.3781.
[3] Mikolov T.; Sutskever, I.; Chen, K.; Corrado, G. and Dean, J. 2013b. Distributed representations of words
and phrases and their compositionality. arXiv preprint arXiv: 1310.4546.
[4] Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa, P. 2011. Natural language
processing (almost) from scratch. The Journal of Machine Learning Research 12:2493–2537.
[5] Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages
1532– 1543.
[6] Shujie Liu, Nan Yang, Mu Li, and Ming Zhou. 2014. A recursive recurrent neural network for statistical
machine translation. In Proceedings of ACL, pages1491– 1500.
[7] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul.
2014. Fast and robust neural network joint models for statistical machine translation. In 52nd Annual
Meeting of the Association for Computational Linguistics, Baltimore, MD, USA, June.
[8] Danqi Chen and Christopher D Manning. 2014. A fast and accurate dependency parser using neural
networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pages 740–750.
[9] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov.
Enriching word vectors with subword information. arXiv preprint arXiv:
1607.04606, 2016.
[10] Lin Qiu, Yong Cao, Zaiqing Nie, and Yong Rui. 2014. Learning word representation considering
proximity and ambiguity. In Twenty-Eighth AAAI Conference on Artificial Intelligence.
[11] F. Wang, Y. Zhou and M. Lan, "Dimensional sentiment analysis of traditional Chinese words using
pretrained Not-quite-right Sentiment
Word Vectors and supervised ensemble models," 2016 International Conference on Asian Language
Processing (IALP), Tainan, 2016, pp. 300-303.
[12] Ling, Wang & Dyer, Chris & W Black, Alan & Trancoso, Isabel. (2015). Two/Too Simple Adaptations of
Word2Vec for Syntax Problems. 10.3115/v1/N15-1142.

Dept of CSE, AIET, Mijar 12


[13] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space
word representations. In HLT-NAACL, pages 746–751.
[14] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan
Ruppin. 2001. Placing search in context: The concept revisited. In Proceedings of WWW, pages 406– 414.
ACM.

Dept of CSE, AIET, Mijar 13

You might also like