dimensional dense vector representation of word. Word2vec and Word2Vec. BoW feature of text document consists of
has two models, Continuous Bag-of-words(CBoW) and Skip- tf-idf weights of each word in text.
gram model. Authors demonstrated that these vector along
with low dimensional representation inherit semantic rela-
tions as well as analogical relation between words. For
e.g vec(Paris)- vec(France) + vec(Berlin) = vec(Germany).
Word2vec model overcomes problems in BoW model like high
dimensionality of the representation, better results on word to
word similarity. We present a approach for text classification
using word embeddings generated by Word2vec model.
The rest of the paper is structured as follows. Section II we Fig. 2: Architecture for Text classification using Word
describe relevant literature. Methodology of our approach is Embedding
described in section III. Section IV and V detail the experiment
and results. Section VI is conclusion.
In tf-idf the motive is to find important words to facilitate
II. R ELATED W ORK classification. For collection of N documents where d is
individual document t is word in document d the tf-idf weight
Text classification is broadly solved as classification task.
is given in equation (1):
Supervised learning approaches like support vector ma-
chines[9], Naive Bayes[10], and logistic regression[11] have Wtd = ftd /ntd ∗ log(N/fdt ) (1)
been used with BoW features for classification. Along with
BoW, another features have also been explored for text Where Wtd is the tf-idf weight of the term t in document d.
classification like part-of-speech[12], semantic composition- ftd /n is term frequency(tf) which is ratio of no. of occurrences
ality[13], syntax structures[14] and ontology based knowl- of t in document d to total number of terms in d. log(N/fdt ) is
edge[15]. These features have not demonstrated reasonable inverse document frequency(idf) which is log of ratio of total
gain over BoW features[16]. number of documents N to number of document d in which
Recently, distributed representation of the words[7][8] have term t occurs.
been used as features for text classification. Word2vec[7] Word Embeddings feature are generated using Word2Vec
model is been widely used to generate word embeddings Skip-gram model[7] which generates word embeddings which
for many NLP tasks. In [17] author proposed naive distance inherit semantic similarity as well as scales to very large data
function using word2vec vectors which gives dissimilarity sets. Specifically a simple neural network with input,hidden
between given pair of sentences . Author demonstrated that and output layer is proposed by authors. Each word vector is
their distance function shows state-of-art result on 8 classi- trained to maximize the log probability of neighboring words
fication datasets. Short text similarity technique is presented shown in equation (1)[23]. i.e given a sequence of words
in [18] in this author used external knowledge with word2vec w1 , .......wT .
embeddings. Bag-of-Embeddings model is proposed in [19] T
1X X
according to the assumption that words have different dis- log p(w1 /wt ) (2)
tributional characteristics under different classes. Author also T t=1
demonstrates the better results on balanced and imbalanced
datasets. Multi-label document classification is presented in where nb(t) is a set of neighboring words of word wt
[20] with convolutional network topologies and word2vec. For and p(wj /wt ) is the hierarchical soft-max of the associated
most of recent work using word2vec globally trained model word vectors Vwj and Vwt [23]. The features for text file with
from google is used. BoW and Word2vec features are generated with averaging
In presented work, we use Word2Vec which inherits seman- the words present in text document. We get the feature of
tically meaningful features for text classification task. Text whole text document by this. After the generation of features
document length is also a significant factor in learning the these features are used in classification module. In this module
feature representation of text. In [21] author attempted to re- classifier is trained with BoW and Word2Vec features. In the
duce feature selection of word2vec using K-means clustering. end the performance of the classifier is measured.
We explore summarization algorithm for the same. Globally IV. E XPERIMENTS
trained vector are not optimal for some task[22]. Hence,we
use locally trained word embedding model for our work. A. Experimental Setup
For experiment we use Reuters-21578[24] dataset. This
III. M ETHODOLOGY is most used dataset for text classification research. It is
Our proposed methodology for text classification is shown imbalanced dataset as all the categories don’t contain equal
in Fig.2. Input is text document. Preprocessing of the text is number of documents. In all the categories, 8 categories are
done before feature generation module. In feature generation most frequent which contain majority of the dataset. We use
module, we generate feature representation of text using BoW these 8 categories for our experiment. We also explored text
summarization to get important feature words to improve per- increases with increase in the number of features used in
formance. Summarization of the preprocessed text is needed classifier. But Word2Vec works very well(89.27) on 100
to get abstract text. The length of the document is important
factor which contributes to efficiency of result. Our approach TABLE II: Word2Vec model performance on original
considers whole document as a entity. So to overcome the and summarized text
document length effect the summarization is being done. For No.of features Word2Vec Model Accuracy
that TextRank[25] model is used due to its state of art results. Original Summarized
Also TextRank does not require linguistic knowledge nor it is
100 89.87 88.39
domain or language specific, which makes it highly portable.
300 90.18 88.53
The performance is measured using accuracy, precision, re-
call. In Word2Vec features, we generate feature of the text
700 90.79 88.60
document as mean of vectors of words present in the text. 1000 91.13 88.65
