He Laskar 2019
He Laskar 2019
He Laskar 2019
Authorized licensed use limited to: East Carolina University. Downloaded on July 26,2020 at 15:52:34 UTC from IEEE Xplore. Restrictions apply.
dimensional dense vector representation of word. Word2vec and Word2Vec. BoW feature of text document consists of
has two models, Continuous Bag-of-words(CBoW) and Skip- tf-idf weights of each word in text.
gram model. Authors demonstrated that these vector along
with low dimensional representation inherit semantic rela-
tions as well as analogical relation between words. For
e.g vec(Paris)- vec(France) + vec(Berlin) = vec(Germany).
Word2vec model overcomes problems in BoW model like high
dimensionality of the representation, better results on word to
word similarity. We present a approach for text classification
using word embeddings generated by Word2vec model.
The rest of the paper is structured as follows. Section II we Fig. 2: Architecture for Text classification using Word
describe relevant literature. Methodology of our approach is Embedding
described in section III. Section IV and V detail the experiment
and results. Section VI is conclusion.
In tf-idf the motive is to find important words to facilitate
II. R ELATED W ORK classification. For collection of N documents where d is
individual document t is word in document d the tf-idf weight
Text classification is broadly solved as classification task.
is given in equation (1):
Supervised learning approaches like support vector ma-
chines[9], Naive Bayes[10], and logistic regression[11] have Wtd = ftd /ntd ∗ log(N/fdt ) (1)
been used with BoW features for classification. Along with
BoW, another features have also been explored for text Where Wtd is the tf-idf weight of the term t in document d.
classification like part-of-speech[12], semantic composition- ftd /n is term frequency(tf) which is ratio of no. of occurrences
ality[13], syntax structures[14] and ontology based knowl- of t in document d to total number of terms in d. log(N/fdt ) is
edge[15]. These features have not demonstrated reasonable inverse document frequency(idf) which is log of ratio of total
gain over BoW features[16]. number of documents N to number of document d in which
Recently, distributed representation of the words[7][8] have term t occurs.
been used as features for text classification. Word2vec[7] Word Embeddings feature are generated using Word2Vec
model is been widely used to generate word embeddings Skip-gram model[7] which generates word embeddings which
for many NLP tasks. In [17] author proposed naive distance inherit semantic similarity as well as scales to very large data
function using word2vec vectors which gives dissimilarity sets. Specifically a simple neural network with input,hidden
between given pair of sentences . Author demonstrated that and output layer is proposed by authors. Each word vector is
their distance function shows state-of-art result on 8 classi- trained to maximize the log probability of neighboring words
fication datasets. Short text similarity technique is presented shown in equation (1)[23]. i.e given a sequence of words
in [18] in this author used external knowledge with word2vec w1 , .......wT .
embeddings. Bag-of-Embeddings model is proposed in [19] T
1X X
according to the assumption that words have different dis- log p(w1 /wt ) (2)
tributional characteristics under different classes. Author also T t=1
j∈nb(t)
demonstrates the better results on balanced and imbalanced
datasets. Multi-label document classification is presented in where nb(t) is a set of neighboring words of word wt
[20] with convolutional network topologies and word2vec. For and p(wj /wt ) is the hierarchical soft-max of the associated
most of recent work using word2vec globally trained model word vectors Vwj and Vwt [23]. The features for text file with
from google is used. BoW and Word2vec features are generated with averaging
In presented work, we use Word2Vec which inherits seman- the words present in text document. We get the feature of
tically meaningful features for text classification task. Text whole text document by this. After the generation of features
document length is also a significant factor in learning the these features are used in classification module. In this module
feature representation of text. In [21] author attempted to re- classifier is trained with BoW and Word2Vec features. In the
duce feature selection of word2vec using K-means clustering. end the performance of the classifier is measured.
We explore summarization algorithm for the same. Globally IV. E XPERIMENTS
trained vector are not optimal for some task[22]. Hence,we
use locally trained word embedding model for our work. A. Experimental Setup
For experiment we use Reuters-21578[24] dataset. This
III. M ETHODOLOGY is most used dataset for text classification research. It is
Our proposed methodology for text classification is shown imbalanced dataset as all the categories don’t contain equal
in Fig.2. Input is text document. Preprocessing of the text is number of documents. In all the categories, 8 categories are
done before feature generation module. In feature generation most frequent which contain majority of the dataset. We use
module, we generate feature representation of text using BoW these 8 categories for our experiment. We also explored text
2019 5th International Conference on Computing Communication Control and Automation (ICCUBEA)
Authorized licensed use limited to: East Carolina University. Downloaded on July 26,2020 at 15:52:34 UTC from IEEE Xplore. Restrictions apply.
summarization to get important feature words to improve per- increases with increase in the number of features used in
formance. Summarization of the preprocessed text is needed classifier. But Word2Vec works very well(89.27) on 100
to get abstract text. The length of the document is important
factor which contributes to efficiency of result. Our approach TABLE II: Word2Vec model performance on original
considers whole document as a entity. So to overcome the and summarized text
document length effect the summarization is being done. For No.of features Word2Vec Model Accuracy
that TextRank[25] model is used due to its state of art results. Original Summarized
Also TextRank does not require linguistic knowledge nor it is
100 89.87 88.39
domain or language specific, which makes it highly portable.
300 90.18 88.53
The performance is measured using accuracy, precision, re-
call. In Word2Vec features, we generate feature of the text
700 90.79 88.60
document as mean of vectors of words present in the text. 1000 91.13 88.65
R EFERENCES
[1] Harris, Zellig. Distributional structure Word, 1954.
[2] Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G.W., and
Harshman, R. A. Indexing by latent semantic analysis. Journal of the
American Society of Information Science, 41(6):391407, 1990.
[3] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. ”Latent dirichlet
allocation.” Journal of machine Learning research 3.Jan (2003): 993-1022.
[4] Petterson, J., Buntine, W., Narayanamurthy, S. M., Caetano, T. S., and
Smola, A. J. Word features for latent dirichlet allocation. In NIPS, pp.
19211929, 2010.
[5] Mikolov, T., Yih, W., and Zweig, G. Linguistic regularities in continuous
space word representations. In NAACL, pp. 746751. Citeseer, 2013c.
[6] Yoshua Bengio, Rejean Ducharme, Vincent Pascal, and Christian Jauvin.
Fig. 3: Classification results A neural probabilistic language model. J. Mach. Learn. Res., 3:11371155,
2003.
[7] Mikolov, Tomas, et al. ”Efficient estimation of word representations in
vector space.”arXiv preprint arXiv:1301.3781(2013).
C. Analysis [8] Pennington, Jeffrey, Richard Socher, and Christopher Manning. ”Glove:
Global vectors for word representation.” Proceedings of the 2014 con-
The results shows that our approach works better for text ference on empirical methods in natural language processing (EMNLP).
classification on the Reuters-21578. Performance of the BoW 2014.
2019 5th International Conference on Computing Communication Control and Automation (ICCUBEA)
Authorized licensed use limited to: East Carolina University. Downloaded on July 26,2020 at 15:52:34 UTC from IEEE Xplore. Restrictions apply.
[9] Joachims, T. Text categorization with support vector machines: learning
with many relevant features. In Proceedings of ECML-98, 10th European
Conference on Machine Learning (Chemnitz, DE), pp. 137142 1998.
[10] Andrew Mccallum and Kamal Nigam. A comparison of event models for
naive bayes text classification. In Proc. of AAAI Workshop on Learning
for Text Categorization, 1998.
[11] Kamal Nigam, John Lafferty, and Andrew Mccallum. Using maximum
entropy for text classification. In Proc. of IJCAI Workshop on Machine
Learning for Information Filtering, 1999.
[12] David D. Lewis. An evaluation of phrasal and clustered representations
on a text categorization task. In Proc. of SIGIR, 1995.
[13] Karo Moilanen and Stephen Pulman. Sentiment composition. In Proc.
of RANLP, 2007.
[14] Nakagawa Tetsuji, Inui Kentaro, and Kurohashi Sadao. Dependency tree-
based sentiment classification using crfs with hidden. In Proc. of NAACL-
HLT, 2010.
[15] Sonawane, Sheetal S., and Parag Kulkarni. ”Concept based document
similarity using graph model.” International Journal of Information Tech-
nology (2019): 1-12.
[16] Sida Wang and Christopher D. Manning. Baselines and bigrams: Simple,
good sentiment and text classification. In Proc. of ACL, 2012.
[17] Kusner, Matt, et al. ”From word embeddings to document dis-
tances.”International Conference on Machine Learning. 2015.
[18] Kenter, Tom, and Maarten De Rijke. ”Short text similarity with word
embeddings.” Proceedings of the 24th ACM international on conference
on information and knowledge management. ACM, 2015.
[19] Jin, Peng, Yue Zhang, Xingyuan Chen, and Yunqing Xia. ”Bag-of-
Embeddings for Text Classification.” In IJCAI, vol. 16, pp. 2824-2830.
2016.
[20] Lenc, Ladislav, and Pavel Krl. ”Word Embeddings for Multi-label
Document Classification.”Proceedings of the International Conference
Recent Advances in Natural Language Processing, RANLP 2017.
[21] Ma, Long, and Yanqing Zhang. ”Using Word2Vec to process big text
data.” 2015 IEEE International Conference on Big Data (Big Data). IEEE,
2015.
[22] Diaz, Fernando, Bhaskar Mitra, and Nick Craswell. ”Query
expansion with locally-trained word embeddings.”arXiv preprint
arXiv:1605.07891(2016).
[23] Mikolov, Tomas, et al. ”Distributed representations of words and phrases
and their compositionality.” Advances in neural information processing
systems. 2013.
[24] http://www.daviddlewis.com/resources/testcollections/reuters21578/
[25] Mihalcea, Rada, and Paul Tarau. ”Textrank: Bringing order into text.”
Proceedings of the 2004 conference on empirical methods in natural
language processing. 2004.
2019 5th International Conference on Computing Communication Control and Automation (ICCUBEA)
Authorized licensed use limited to: East Carolina University. Downloaded on July 26,2020 at 15:52:34 UTC from IEEE Xplore. Restrictions apply.