He Laskar 2019

Text Classification Using Word Embeddings
Mukund N. Helaskar Sheetal S. Sonawane

Department of Computer Engineering Department of Computer Engineering
Pune Institute of Computer Technology Pune Institute of Computer Technology
Pune,India Pune,India
mukundpict7@gmail.com sssonawane@pict.edu
Abstract—Growth in the internet made immense impact on

the data generation. Most of the data in the world is in textual
format. There is need to access and use this data efficiently
and easily hence, text classification is widely studied problem
in research community. The applications of text classification are
also in diverse domains such as news filtering, opinion mining,
information retrieval and so on. At core of these applications is
machine learning algorithms which need input of fixed length
as a vector. Bag-of-Words is popular model used to represent
text in numeric vector format. But it has several disadvantages
such as word order is ignored, high dimensionality and sparse
representation if vocabulary is large. Word embeddings are
distributed vector representation of words. These representation
inherit semantic notion of similarity also they have shown state-
of-art results in many core natural language processing tasks. In
presented work, we present text classification using these word Fig. 1: Text classification methods
embeddings and measure the performance.
Index Terms—Data Mining, Machine Learning(ML), Natural
Language Processing(NLP),Text classification, Word Embedding.
different texts with same words might have same represen-
tation. Term-frequency/ Inverse document frequency(tf-idf),
I. I NTRODUCTION weighted representation of the words is also used in BoW.
Text classification is broadly studied area in data mining. It gives higher weight for the rare(important) words and low
There is exponential increase in the online availability of weights for frequently occurring words as those will not
information with growth in internet. Accessibility is from be contributing much. But BoW model along with Tf-Idf
variety of sources like digital libraries, social network feeds, have very little sense on semantics of word. Also, they do
scientific literature, e-books and so on. Data present is in the not capture distance between individual words. This means
structured and unstructured form. There is need to handle data that “Kohli”,“Tendulkar”,“Federrer” are equally distant but
of such magnitude efficiently. Main goal of the data mining semantically “Kohli” should be closer to “Tendulkar” than
is to extract useful information from textual data which deals “Federrer”.
with operations like information retrieval, text classification. There are many methods that attempt to solve this problem.
Text classification problem is defined as follows for given Latent Semantic Indexing[2] is method which uses singular
set of documents D = {d1 , d2 , ....dn } where n is total number value decomposition to understand structure of documents
of documents. Each document with assigned label from set and identify hidden(latent) relationship between words. But
L = {l1 , l2 , ....lk } where k is number of classes. Based it works better for small set of static documents which
on the features of document d classification model should is well used in Information retrieval. For large corpus, it
assign correct label l. There are multiple methods for text becomes computationally expensive. Latent Dirichlet Alloca-
classification. Supervised,unsupervised and semi-supervised tion(LDA)[3] is another technique which represents the text
methods shown in Fig.1. in topics. Idea is that text consists of topics and topics are
Text classification is of two sorts single label and multi- consists of similar words. LDA represents text in form of
label. At the core of text classification task are machine distribution over these topics. These approaches produce better
learning algorithms. These algorithm need text input to be representation of text but they are do not improve on word
represented as fixed length numeric vector. Most commonly distance-based task[4][5].
used representation is Bag-of-Words(BoW)[1] due to its sim- Word embeddings are distributed representation of the word.
plicity and efficiency. It considers the number of occurrences In recent work in learning distributed representation of the
of the word in text. But there are several disadvantages of BoW word many models are proposed[6][7][8]. One of the pop-
model. Word order is not taken into consideration therefore ular model is Word2vec[7] method which generates low-
978-1-7281-4042-1/19/$31.00 c 2019 IEEE
Authorized licensed use limited to: East Carolina University. Downloaded on July 26,2020 at 15:52:34 UTC from IEEE Xplore. Restrictions apply.
dimensional dense vector representation of word. Word2vec and Word2Vec. BoW feature of text document consists of
has two models, Continuous Bag-of-words(CBoW) and Skip- tf-idf weights of each word in text.
gram model. Authors demonstrated that these vector along
with low dimensional representation inherit semantic rela-
tions as well as analogical relation between words. For
e.g vec(Paris)- vec(France) + vec(Berlin) = vec(Germany).
Word2vec model overcomes problems in BoW model like high
dimensionality of the representation, better results on word to
word similarity. We present a approach for text classification
using word embeddings generated by Word2vec model.
The rest of the paper is structured as follows. Section II we Fig. 2: Architecture for Text classification using Word
describe relevant literature. Methodology of our approach is Embedding
described in section III. Section IV and V detail the experiment
and results. Section VI is conclusion.
In tf-idf the motive is to find important words to facilitate
II. R ELATED W ORK classification. For collection of N documents where d is
individual document t is word in document d the tf-idf weight
Text classification is broadly solved as classification task.
is given in equation (1):
Supervised learning approaches like support vector ma-
chines[9], Naive Bayes[10], and logistic regression[11] have Wtd = ftd /ntd ∗ log(N/fdt ) (1)
been used with BoW features for classification. Along with
BoW, another features have also been explored for text Where Wtd is the tf-idf weight of the term t in document d.
classification like part-of-speech[12], semantic composition- ftd /n is term frequency(tf) which is ratio of no. of occurrences
ality[13], syntax structures[14] and ontology based knowl- of t in document d to total number of terms in d. log(N/fdt ) is
edge[15]. These features have not demonstrated reasonable inverse document frequency(idf) which is log of ratio of total
gain over BoW features[16]. number of documents N to number of document d in which
Recently, distributed representation of the words[7][8] have term t occurs.
been used as features for text classification. Word2vec[7] Word Embeddings feature are generated using Word2Vec
model is been widely used to generate word embeddings Skip-gram model[7] which generates word embeddings which
for many NLP tasks. In [17] author proposed naive distance inherit semantic similarity as well as scales to very large data
function using word2vec vectors which gives dissimilarity sets. Specifically a simple neural network with input,hidden
between given pair of sentences . Author demonstrated that and output layer is proposed by authors. Each word vector is
their distance function shows state-of-art result on 8 classi- trained to maximize the log probability of neighboring words
fication datasets. Short text similarity technique is presented shown in equation (1)[23]. i.e given a sequence of words
in [18] in this author used external knowledge with word2vec w1 , .......wT .
embeddings. Bag-of-Embeddings model is proposed in [19] T
1X X
according to the assumption that words have different dis- log p(w1 /wt ) (2)
tributional characteristics under different classes. Author also T t=1
j∈nb(t)
demonstrates the better results on balanced and imbalanced
datasets. Multi-label document classification is presented in where nb(t) is a set of neighboring words of word wt
[20] with convolutional network topologies and word2vec. For and p(wj /wt ) is the hierarchical soft-max of the associated
most of recent work using word2vec globally trained model word vectors Vwj and Vwt [23]. The features for text file with
from google is used. BoW and Word2vec features are generated with averaging
In presented work, we use Word2Vec which inherits seman- the words present in text document. We get the feature of
tically meaningful features for text classification task. Text whole text document by this. After the generation of features
document length is also a significant factor in learning the these features are used in classification module. In this module
feature representation of text. In [21] author attempted to re- classifier is trained with BoW and Word2Vec features. In the
duce feature selection of word2vec using K-means clustering. end the performance of the classifier is measured.
We explore summarization algorithm for the same. Globally IV. E XPERIMENTS
trained vector are not optimal for some task[22]. Hence,we
use locally trained word embedding model for our work. A. Experimental Setup
For experiment we use Reuters-21578[24] dataset. This
III. M ETHODOLOGY is most used dataset for text classification research. It is
Our proposed methodology for text classification is shown imbalanced dataset as all the categories don’t contain equal
in Fig.2. Input is text document. Preprocessing of the text is number of documents. In all the categories, 8 categories are
done before feature generation module. In feature generation most frequent which contain majority of the dataset. We use
module, we generate feature representation of text using BoW these 8 categories for our experiment. We also explored text
2019 5th International Conference on Computing Communication Control and Automation (ICCUBEA)
summarization to get important feature words to improve per- increases with increase in the number of features used in
formance. Summarization of the preprocessed text is needed classifier. But Word2Vec works very well(89.27) on 100
to get abstract text. The length of the document is important
factor which contributes to efficiency of result. Our approach TABLE II: Word2Vec model performance on original
considers whole document as a entity. So to overcome the and summarized text
document length effect the summarization is being done. For No.of features Word2Vec Model Accuracy
that TextRank[25] model is used due to its state of art results. Original Summarized
Also TextRank does not require linguistic knowledge nor it is
100 89.87 88.39
domain or language specific, which makes it highly portable.
300 90.18 88.53
The performance is measured using accuracy, precision, re-
call. In Word2Vec features, we generate feature of the text
700 90.79 88.60
document as mean of vectors of words present in the text. 1000 91.13 88.65
B. Results features also. Increase in performance of Word2Vec with

In Table I, performance comparison of BoW model and increase in no. of features is not that significant. Also, using
presented model using Word2Vec is shown. Our model for summarization the performance of the Word2Vec model is not
text classification outperforms BoW model. Support vector improved much as shown in Table II. The difference in the
machine(SVM) classifier is used for the classification due to accuracy is by a percent only. In BoW, there is significant
its state-of-art performance. increase in performance. For dataset with huge vocabulary it
becomes computing and memory sensitive. Sparseness and
high-dimensionality is also there. Hence, using Word2Vec
TABLE I: Comparison of BoW and Word2Vec model as features for text classification is efficient. For large scale
performance data training time is reasonable for Word2Vec vectors. Also,
Dimensionality of Word2Vec vector doesn’t make noticeable
Classes BoW Model Word2Vec Model impact on the result.
Prec. Recall Prec. Recall
acq 0.84 0.92 0.94 0.97 V. C ONCLUSION
crude 0.93 0.75 1.00 0.87 In our work, we present text classification using word
earn 0.66 0.95 0.71 1.00 embeddings. Using word embeddings generated by simple
grain 0.88 0.92 0.94 0.87 neural network as features of text, improves performance of
interest 0.75 0.75 0.80 0.80 the text classification over Bag-of-Words features. Semantic
money-fx 0.81 0.83 0.91 0.92 and Syntactic properties of word embeddings make classi-
ship 0.90 0.56 1.00 0.75 fication of the text more efficient even for less number of
trade 0.89 0.82 0.94 0.96 features. Our approach is simple as only one parameter used
for classification is the word vectors of the text. In future
Fig.3. depicts classification results where y-axis is work, other word embeddings model with better features of
classification accuracy and x-axis shows the number of text can be explored for text classification task. Also, use of
features. Our proposed model shows better performance at external knowledge sources can be explored to improve the
every point. performance.
R EFERENCES
[1] Harris, Zellig. Distributional structure Word, 1954.
[2] Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G.W., and
Harshman, R. A. Indexing by latent semantic analysis. Journal of the
American Society of Information Science, 41(6):391407, 1990.
[3] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. ”Latent dirichlet
allocation.” Journal of machine Learning research 3.Jan (2003): 993-1022.
[4] Petterson, J., Buntine, W., Narayanamurthy, S. M., Caetano, T. S., and
Smola, A. J. Word features for latent dirichlet allocation. In NIPS, pp.
19211929, 2010.
[5] Mikolov, T., Yih, W., and Zweig, G. Linguistic regularities in continuous
space word representations. In NAACL, pp. 746751. Citeseer, 2013c.
[6] Yoshua Bengio, Rejean Ducharme, Vincent Pascal, and Christian Jauvin.
Fig. 3: Classification results A neural probabilistic language model. J. Mach. Learn. Res., 3:11371155,
2003.
[7] Mikolov, Tomas, et al. ”Efficient estimation of word representations in
vector space.”arXiv preprint arXiv:1301.3781(2013).
C. Analysis [8] Pennington, Jeffrey, Richard Socher, and Christopher Manning. ”Glove:
Global vectors for word representation.” Proceedings of the 2014 con-
The results shows that our approach works better for text ference on empirical methods in natural language processing (EMNLP).
classification on the Reuters-21578. Performance of the BoW 2014.
[9] Joachims, T. Text categorization with support vector machines: learning
with many relevant features. In Proceedings of ECML-98, 10th European
Conference on Machine Learning (Chemnitz, DE), pp. 137142 1998.
[10] Andrew Mccallum and Kamal Nigam. A comparison of event models for
naive bayes text classification. In Proc. of AAAI Workshop on Learning
for Text Categorization, 1998.
[11] Kamal Nigam, John Lafferty, and Andrew Mccallum. Using maximum
entropy for text classification. In Proc. of IJCAI Workshop on Machine
Learning for Information Filtering, 1999.
[12] David D. Lewis. An evaluation of phrasal and clustered representations
on a text categorization task. In Proc. of SIGIR, 1995.
[13] Karo Moilanen and Stephen Pulman. Sentiment composition. In Proc.
of RANLP, 2007.
[14] Nakagawa Tetsuji, Inui Kentaro, and Kurohashi Sadao. Dependency tree-
based sentiment classification using crfs with hidden. In Proc. of NAACL-
HLT, 2010.
[15] Sonawane, Sheetal S., and Parag Kulkarni. ”Concept based document
similarity using graph model.” International Journal of Information Tech-
nology (2019): 1-12.
[16] Sida Wang and Christopher D. Manning. Baselines and bigrams: Simple,
good sentiment and text classification. In Proc. of ACL, 2012.
[17] Kusner, Matt, et al. ”From word embeddings to document dis-
tances.”International Conference on Machine Learning. 2015.
[18] Kenter, Tom, and Maarten De Rijke. ”Short text similarity with word
embeddings.” Proceedings of the 24th ACM international on conference
on information and knowledge management. ACM, 2015.
[19] Jin, Peng, Yue Zhang, Xingyuan Chen, and Yunqing Xia. ”Bag-of-
Embeddings for Text Classification.” In IJCAI, vol. 16, pp. 2824-2830.
2016.
[20] Lenc, Ladislav, and Pavel Krl. ”Word Embeddings for Multi-label
Document Classification.”Proceedings of the International Conference
Recent Advances in Natural Language Processing, RANLP 2017.
[21] Ma, Long, and Yanqing Zhang. ”Using Word2Vec to process big text
data.” 2015 IEEE International Conference on Big Data (Big Data). IEEE,
2015.
[22] Diaz, Fernando, Bhaskar Mitra, and Nick Craswell. ”Query
expansion with locally-trained word embeddings.”arXiv preprint
arXiv:1605.07891(2016).
[23] Mikolov, Tomas, et al. ”Distributed representations of words and phrases
and their compositionality.” Advances in neural information processing
systems. 2013.
[24] http://www.daviddlewis.com/resources/testcollections/reuters21578/
[25] Mihalcea, Rada, and Paul Tarau. ”Textrank: Bringing order into text.”
Proceedings of the 2004 conference on empirical methods in natural
language processing. 2004.

He Laskar 2019

Uploaded by

Copyright:

Available Formats

He Laskar 2019

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

He Laskar 2019

Uploaded by

Copyright:

Available Formats

Text Classification Using Word Embeddings

Mukund N. Helaskar Sheetal S. Sonawane

Abstract—Growth in the internet made immense impact on

978-1-7281-4042-1/19/$31.00 c 2019 IEEE

B. Results features also. Increase in performance of Word2Vec with

You might also like