Learning Improved Class Vector For Multi-Class Question Type Classification
Learning Improved Class Vector For Multi-Class Question Type Classification
Learning Improved Class Vector For Multi-Class Question Type Classification
ABSTRACT
Recent research in NLP has exploited word embedding to achieve outstanding results in various tasks such as; spam
filtering, text classification and summarization and others. Present word embedding algorithms have power to capture
semantic and syntactic knowledge about word, but not enough to portray the distinct meaning of polysemy word. Many
work has utilized sense embeddings to integrate all possible meaning to word vector, which is computationally
expensive. Context embedding is another way out to identify word’s actual meaning, but it is hard to enumerate every
context with a small size dataset. This paper has proposed a methodology to generate improved class-specific word
vector that enhance the distinctive property of word in a class to tackle light polysemy problem in question classification.
The proposed approach is compared with baseline approaches, tested using deep learning models upon TREC, Kaggle
and Yahoo questions datasets and respectively attain 93.6%, 91.8% and 89.2% accuracy.
Keywords: Class-specific vector, Deep Learning Models, Polysemy, Question Classification, Word2vec.
This paper has introduced an approach that overcome the very basic approach for classification in which
the drawback of work mention in [7] and other commonly experts extract words or their combinations as features.
used techniques. Our approach has generated improved In [8], Biswas and others has designed syntactic patterns
class vectors those are independent of class labels and of questions in TREC QA track. TREC QA track is the
also aim to classify multi-class categorized datasets. In collection of questions which are categorized by Li and
contrast to previous work, the proposed approach can Roth into two levels; 6 coarse classes and 50 fine classes.
generate class vectors for questions labelled with its The syntactic rule based approach has achieved ~98%
answer type such as yes/no, factoid, list and summary and accuracy for classification. Another rule based approach
also questions labelled with their domain such as has added semantic hand crafted rule with syntactic rules
business, entertainment and so on. For example, in in [9] on same dataset and tested these rules using
previous work, tweets were collected form two classes; machine learning models. The linear SVM gives 91.6%
hurricane related and unrelated. The class vector for accuracy and manual approach has given 97% accuracy.
hurricane related tweet is based on the words having least Although rule based approaches are more accurate in
cosine similarity with label ‘hurricane’. This approach classification task, but this practice is taking enormous
doesn’t work for questions classified with their expected efforts and time. With machine learning models, these
answer type. Therefore, a general methodology has been drawbacks can be overcome and more latent features of
proposed in this paper to produce class vectors those are text can be extracted. [10] has explored deep learning
not depend on class label. Our work has been tested on models such as CNN, LSTM, hybrid framework, for
both types of datasets; TREC questions are classified question classification using word2vec algorithm on
with its semantic class, Kaggle questions are classified Turkish translated UIUC English dataset. These models
with topics or issues related to python such as run time are tested upon word vector generated using skip gram
error, installation etc. and Yahoo Questions are classified embedding model and CBOW model. This study has
with list, summary, factoid and yes/no answer type. investigated that skip gram model’s word vectors has
brought highest accuracy with CNN model. The RNN
Our proposed approach has used modified skip gram
model and its variants has also gained popularity [11,12]
model to generate word vectors and improved tf-idf score
in the same task. Stefan and others [11] has developed
(tf-idf-cf) to express class information for question
algorithm for accessing the question quality using
classification. With the experiments performed on
sequence model. The questions are the collection of
mentioned datasets using improved class-specific vector student feedback from a tutoring system, iSTART. Total
representation, helps to infer the excellence of proposed collected questions are 4575 and are manually coded
approach to tackle light polysemy problem in question between: very shallow (1) to very deep (4). The
classification task. The brief of the contribution has been experiments are implemented on multiple RNN models
made in this paper: such as GRU, Bi-GRU and LSTM using Glove and
FastText embeddings. The Bi-GRU model gives best
● Our work is first to utilize class specific word
performance with 81.22 % accuracy. Bi-LSTM has also
embeddings for multi-class question classification.
given improved results on question collection of about
● Unlike previous approach, class vectors are unrelated daily meetings and conversation [12]. The study has
to class labels. evaluated accuracy and loss function value as 90.9% and
● Our work has targeted polysemy problem in question 0.316 respectively. The question classification on
classification task. Chinese dataset has tested word2vec algorithm with
attention based deep learning model and compared
The paper is structured as, section 2 has discussion
results with other models such as CNN, LSTM and Bi-
about related literature survey, feature extraction
GRU [13]. From results, it is discovered that precision of
methods are explained section 3, proposed approach is
attention based LSTM and Bi-GRU CNN model is same,
mentioned in section 4, experiment setup, datasets
but the f-score of attention based Bi-GRU CNN is highest
description and results is given in section 5, section 6 has
among all models i.e. 0.784. Various previous works
made conclusion of the paper with insight into future
have contributed in word2vec algorithm [14,15,22] to
work and then references.
give a new direction in performing different NLP
2. RELATED WORK classification tasks. But all the existing techniques
generate one embedding per word, which face a fall in
Question classification is a vital step of question determining correct sense of word, when word actually
processing module in question answering system. This has multiple senses called polysemy words.
helps to find out the type of answer expected by the user
To eliminate polysemy problem, several models have
and also it’s related documents. This section describes the
been introduced to induce multiple-embedding for a
existing work in question classification. The techniques
word, multiple-embeddings were trained on machine
involved in classifying questions are placed in three
learning classifiers [17,16]. One way to discriminate
groups: rule based, machine learning and hybrid
between different senses of word is learning context of
techniques. The rule base or hand craft methodology is
114
Atlantis Highlights in Computer Sciences, volume 4
target word. The k-means clustering algorithm used by proposed approach. We test our approach on TREC and
Huang et al. [18] has empirically declared k-senses for Kaggle question dataset using deep learning models.
each ambiguous word. Local contexts of word are
grouped into k-clusters, which limits the knowledge that 3. FEATURE EXTRACTION METHODS
we can gather for distinguishing the related sense.
Neelakantan et al. [17] has extended this idea and utilized 3.1 Modified Skip Gram Model
skip-gram model for context clustering. The fusion of
skip-gram model and context-clustering was proposed
where cluster centroid is equivalent to sense vector and is
sent to skip gram model for updating. This approach
suffers through expensive training computation cost.
Some research work has also focus on morphology,
i.e. sub-words level, to obtain multiple embeddings per
word. Unlike previous work, Bojanowski et al. [19]
explored the internal structure of word and modify skip-
gram model with n-grams of word’s character. Each Figure 1 Architecture of Modified Skip-Gram Model. N
character n-gram has a vector representation and their is size of vocabulary; M is the context window size.
sum generate the word vector. This approach is combined
with Gaussian mixture model, where each Gaussian The skip gram model is a shallow neural network
component represents different sense of word [20,21]. architecture to implement word2vec algorithm. It
evaluates the context words embeddings from the given
The convolutional neural network has explored in target word. The earliest version of model represents a
many research work that help in producing context word with single vector, which is not sufficient to address
relevant embeddings. Jingyun et al. [23] has designed a polysemy obstacle in text classification. In the proposed
two layer recurrent CNN model to capture context approach, modified skip gram model [7] is used to update
relevant concepts. The first layer presents the (word, context vectors of word using its class-specific
concept) pair using pre-trained word vectors (local embedding, it’s architecture is given in Figure 1. The
information) and second layer hidden states are updated objective function for modified skip gram model
concatenated using Bi-GRU according to the word input is given in Equation 1.
time (global information). Both information is
aggregated at attention layer and generate context Լ = ∑𝑤∈∁ 𝑙𝑜𝑔 𝑝 (𝑉𝑤,𝑐 ) (1)
relevant word embedding for classifying short text
Where 𝑤 is the target word and ∁ is corpus.
datasets; TREC, MR (movie review) and AG news
corpus. The GCN (Graph Convolutional Network) model
variant of CNN has been explored to extract local 3.2 Tf-idf-cf
information using lexical relations in language and global
The popular tf-idf algorithm is one of the most utilized
information from BERT model [24-26]. Both the
statistical measure to find relevance of each word in
knowledge is combined via attention mechanism through
dataset [27,28] Many researchers have continuously
different layers of network. The hybrid of GCN and
improved weighting formula to perform better in various
BERT model (VGCN-BERT) has performed outstanding
NLP applications. We have used an improved tf-idf
in text classification on various datasets. SST-2, MR,
scoring formula to find relevant terms in each class. Liu
CoLA, ArangoHate, FountaHate has acquired 91.93%,
et al. [29] has introduced in-class characteristic in tf-idf
86.49%, 83.68%, 88.43% and 81.26% f1-score
(term frequency- inverse document frequency) weighting
respectively.
formula to exploit its ability to differentiate documents
The class-specific vector representation of word [7] among different classes. The in-class feature says that if
in corpus is another popular work for text classifying. term frequency is high and present in small portion of
The modified version of skip-gram model and continuous documents, then it is qualified to discriminate documents
bag of word (CBOW) model has been proposed for into classes. The formula for tf-idf-cf is written in
generating class vectors. The linear compositionality Equation 2:
property of word vector has been exploited for adding 𝑛𝑐𝑖𝑗
𝑁
class information in general word vectors [35-38]. A 𝑎𝑖𝑗 = 𝑡𝑓𝑖𝑗 ∗ 𝑙𝑜𝑔 ( ) ∗ (2)
𝑛𝑗 𝑁𝑐𝑖
parallel CNN framework is designed for classifying
binary categorized datasets; SemEval 2013 and hurricane Where 𝑡𝑓𝑖𝑗 is the term frequency of term j in
related tweets. With proposed features and deep learning document i, N is the total count of documents, 𝑛𝑗
model, SemEval 2013 attain 73.15% and hurricane represents the count of documents where term j appears,
related tweets attain 88.19%. The drawbacks observed in 𝑛𝑐𝑖𝑗 is the count of documents in which term j is present
this work (see section 1) has been removed in our in the same class c document i belongs to and 𝑁𝑐𝑖 is the
115
Atlantis Highlights in Computer Sciences, volume 4
number of documents within the same class document i proposed approach which based upon Sicong work [7].
belongs to. The improved formula with smoothing After pre-processing, words belong to a class are
technique is given in Equation 3: separated and tf-idf-cf weight is calculated for each word
𝑁+1.0 𝑛𝑐𝑖𝑗 in the class. Higher the tf-idf-cf score signifies the higher
𝑎𝑖𝑗 = 𝑙𝑜𝑔 (𝑡𝑓𝑖𝑗 + 1.0) ∗ 𝑙𝑜𝑔 ( ) ∗ (3) occurrence of the word in class, in short that word can
𝑛𝑗 𝑁𝑐𝑖
represent the class. For generating word embeddings, set
4. PROPOSED APPROACH of words of class are given as input to modified Skip-
Gram model given in Sicong work [7]. Then the top-n
In this paper, a methodology for automatic question
words with highest tf-idf-cf score are picked up and the
classification is proposed to handle light text polysemy
average of these word vectors (Equation 4) gives class
problem. Using linear compositionality property of word
vector (𝑣𝑒𝑐𝑡𝑜𝑟(𝑐𝑙𝑎𝑠𝑠)), approach has been tested for
embedding, different meanings of a word are captured in
different values of n. So, this methodology generates
class information vector; one vector for each class [39].
class vectors equal to the number of classes in dataset.
This section describes the procedure of executing this
1
approach, it’s flowchart is shown in Figure 2. 𝑣𝑒𝑐𝑡𝑜𝑟(𝑐𝑙𝑎𝑠𝑠) = (𝑣𝑒𝑐𝑡𝑜𝑟(𝑤1 ) + 𝑣𝑒𝑐𝑡𝑜𝑟 (𝑤2 ) +
𝑛
(a) Input dataset ⋯ + 𝑣𝑒𝑐𝑡𝑜𝑟 (𝑤𝑛 )) (4)
The details of question datasets used for this work is The representation of the class specific word
given in section 5.2. In TREC dataset, training dataset has embedding 𝑉𝑤,𝑐 is achieved using linear
questions with their labels and testing dataset has compositionality property of vector with the summation
questions only. Whereas Kaggle dataset is split into of general word vector (𝑉𝑤 ) and class vector
training and testing dataset with 7:3 ratio and all (vector(class)) as shown in Equation 5.
questions are labelled.
𝑉𝑤,𝑐 = 𝑉𝑤 + 𝑣𝑒𝑐𝑡𝑜𝑟(𝑐𝑙𝑎𝑠𝑠) (5)
(b) Pre-processing 𝑉𝑤,𝑐 for all the words of a class are given for training
Text pre-processing is an important step to convert in modified skip gram model to obtain updated word
raw textual data into more edible structure for machine vectors. These vector representations capture semantic as
learning models. Here the following are steps are taken well as conceptual knowledge for each occurrence of
in sequence to pre-process the questions: Tokenization, word. Basically the number of embeddings per word
Stop Words Removal and Stemming. The output of pre- depends on the number of classes in which word occur.
processing step gives necessary words those are given as Thus we get multiple class specific word embeddings for
input to word2vec algorithm [40-43]. training classifiers for multi-class datasets.
(c) Vector Representation of Words (d) Question Classification
The question classification accuracy depends upon The proposed methodology to generate class-specific
the quality of vector representation of words. In our vector of word for classifying questions is evaluated
proposed approach, improved class-specific word vectors using deep learning models with existing work of class
are generated to tackle light polysemy problem in vectors and others baseline approaches [44-48].
question classification. Figure 3 illustrates the flow of
116
Atlantis Highlights in Computer Sciences, volume 4
5. EXPERIMENT AND DATASET the basis of expected answers such as yes/no, summary,
factoid and list. The classifiers are feed with 70% of
5.1 Experimental Setup questions for training purpose and rest 30% are for
testing purpose. The details of training and testing
The introduced approach for automatic question questions is given in table1.
classification has been analyzed on system with 8GB
main memory and intel core i5 processor and 5.3 Baseline Approaches
implemented on python 3.6 environment using keras
[30]. The word2vec algorithm is implemented using This section explains the commonly used
NumPy library with following hyper-parameters methodologies used for question classification.
values; (window_size=3, embedding_dimension= 150,
1. Word2vec [13,15]: With Word2vec algorithm,
epochs=40, learning_rate=0.01). The vector
words are represented as a continuous vector. One
representation of questions and their labels are feed to word has one vector representation.
classifiers after 10-fold cross validation to calculate
accuracy of proposed approach. The confusion matrix 2. Word2Vec + tf-idf-cf [31]: This methodology has
has been calculated to find accuracy for classification provided weights to word embeddings using tf-idf-
on testing data with the below given formula (Equation cf score [29], which enables to calculate more than
one embedding for a word. The higher the tf-idf-cf
6):
score of word, more the importance of word in
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠 class.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (6)
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠
3. Word2vec + class vector [7]: The approach
described in [7] has generated class vectors to
5.2 Datasets represent a class, but with few drawbacks
(mentioned in section 1). Our work has eliminated
Table 1. Number of Questions in TREC and Kaggle
these gaps and increases the classification
Datasets accuracy.
Dataset About Total Trainin Testing
Name and Sample g Sample 5.4 Results
Categorie s Samples s
s Table 2. Results on TREC dataset
TREC Open 6463 5453 1000
References Classifiers
[32] domain
CNN Bi-LSTM ABBC
Questions/
[12] [10] [13]
6 Coarse
Word2vec 82 83.5 85
and 50
Word2vec + tf- 83.9 85.7 87.7
Fine
idf-cf
Classes
Word2vec+class 86.3 88.6 90.2
Kaggle Python 342184 239528 102656
vector
[33] question
collection Proposed 88.2 91.7 93.6
from stack approach
Overflow/ *ABBC- Attention Based Bi-GRU CNN
13378
classes Table 3. Results on Kaggle dataset
Yahoo Collection 43500 30450 13050
References Classifiers
Question of
CNN Bi-LSTM ABBC
s [34] questions
[12] [10] [13]
posted on
Word 2vec 81.1 84.4 87.3
Yahoo/ 4
Word2vec+tf-idf- 83 86 87.3
classes
cf
Word2vec+class 85.8 87.4 89.1
The proposed approach has been evaluated on three vector
multi-class question datasets. The first one is TREC Proposed approach 87.4 89.5 91.8
dataset; an open domain dataset which is classified into
six main categories and further into 50 fine categories.
The second is a Kaggle dataset; is collection of python
related questions, which has 342184 questions
categorized into 13378 classes. The third dataset is the
collection of Yahoo Questions those are categorized on
117
Atlantis Highlights in Computer Sciences, volume 4
92 95
90 90
88
86 85
Accuracy%)
84 80
82 75
80
70
65
90
88 We use CNN, Bi-LSTM and ABBC model with
86 different feature sets to examine the efficiency of
84 proposed approach. As the word embedding is the
82 essential feature for classification, but addition of class
80 information with word vectors has increase its
recognition in dataset. The proposed class vector in last
section are useful to update context word embedding of
a word in class and also expand the distinctive property
of a word for each appearance.
The elementary target of our work is to determine
the efficacy of question classification task utilizing
proposed improved class-specific word embeddings.
Figure 5 Graphical Representation of Results on
The graphical representation (figures 4- 6) of results on
Kaggle dataset.
TREC, Kaggle and Yahoo Questions dataset attained
Table 4. Results on Yahoo Questions dataset from proposed work shows that proposed approach has
achieved competent and better results when compared
References Classifiers with baseline approaches.
CNN Bi-LSTM ABBC The proposed approach gives 93.6%, 91.8 and
[12] [10] [13] 89.2% classification accuracy on TREC, Kaggle and
Yahoo Question datasets respectively on ABBC model,
Word 2vec 75.8 76.5 80
which shows improvement with respect to other
Word2vec+tf-idf-cf 78 78.5 82.3 baseline approaches (tables 2-4). The comparative
Word2vec+class 81.2 82.7 85.8 analysis has been made with different vector features
vector using deep learning models conclude that ABBC
framework is best to extract semantic and contextual
Proposed approach 84.5 86.4 89.2 information from class-specific vectors to handle light
polysemy problem in question classification. When
compared with main baseline approach, proposed work
has shown ~3% improvement.
The tf-idf-cf weight of word has also shown its
significance in question classification when seen in
second baseline approach and proposed approach. The
word2vec+tf-idf-cf feature has increase accuracy by
~2.5% for all datasets when compare with word2vec
118
Atlantis Highlights in Computer Sciences, volume 4
119
Atlantis Highlights in Computer Sciences, volume 4
120
Atlantis Highlights in Computer Sciences, volume 4
121