Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
94 views

17 - A Deep Learning Analysis On Question Classification Task Using Word2vec Representations

This document discusses using deep learning models for question classification in Turkish, an agglutinative language. It compares the performance of LSTM, CNN, CNN-LSTM, and CNN-SVM architectures using Word2vec embeddings on a new Turkish question dataset. The best model achieved 94% accuracy on the dataset. Word2vec embeddings were found to significantly impact the accuracy of different deep learning models for question classification in Turkish.

Uploaded by

Office Work
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

17 - A Deep Learning Analysis On Question Classification Task Using Word2vec Representations

This document discusses using deep learning models for question classification in Turkish, an agglutinative language. It compares the performance of LSTM, CNN, CNN-LSTM, and CNN-SVM architectures using Word2vec embeddings on a new Turkish question dataset. The best model achieved 94% accuracy on the dataset. Word2vec embeddings were found to significantly impact the accuracy of different deep learning models for question classification in Turkish.

Uploaded by

Office Work
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Neural Computing and Applications

https://doi.org/10.1007/s00521-020-04725-w (0123456789().,-volV)(0123456789().,-volV)

ORIGINAL ARTICLE

A deep learning analysis on question classification task using


Word2vec representations
Seyhmus Yilmaz1 • Sinan Toklu1

Received: 21 February 2019 / Accepted: 7 January 2020


 Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract
Question classification is a primary essential study for automatic question answering implementations. Linguistic features
take a significant role to develop an accurate question classifier. Recently, deep learning systems have achieved remarkable
success in various text-mining problems such as sentiment analysis, document classification, spam filtering, document
summarization, and web mining. In this study, we explain our study on investigating some deep learning architectures for a
question classification task in a highly inflectional language Turkish that is an agglutinative language where word structure
is produced by adding suffixes (morphemes) to root word. As a non-Indo-European language, languages like Turkish have
some unique features, which make it challenging for natural language processing. For instance, Turkish has no grammatical
gender and noun classes. In this study, user questions in Turkish are used to train and test the deep learning architectures. In
addition to this, the details of the deep learning architectures are compared in terms of test and 10-cross fold validation
accuracy. We use two major deep learning models in our paper: long short-term memory (LSTM), Convolutional Neural
Networks (CNN), and we also implemented the combination of CNN-LSTM, CNN-SVM structures and a number of
various those architectures by changing vector sizes and the embedding types. As well as this, we have built word
embeddings using the Word2vec method with a CBOW and skip gram models with different vector sizes on a large corpus
composed of user questions. Our another investigation is the effect of using different Word2vec pre-trained word
embeddings on these deep learning architectures. Experiment results show that the use of different Word2vec models has a
significant impact on the accuracy rate on different deep learning models. Additionally, there is no Turkish question dataset
labeled and so another contribution in this study is that we introduce new Turkish question dataset which is translated from
UIUC English question dataset. By using these techniques, we have reached an accuracy of 94% on the question dataset.

Keywords Deep learning  Question classification  SVM  Word embedding  Word2vec

1 Introduction manually browsing through. In most cases on the natural


language process, what a customer desires is the accurate
There is an increasingly growing amount of data on the answers to the questions asked by individuals. The aim of a
Internet such as the size and variety of online text docu- question answering (QA) implementation is to reply
ments. It results in the consumers inconvenient with the straightforwardly natural language queries asked by peo-
answers returned by programs, which just provide ranked ple. QA systems are popular study area that uses NLP with
lists of texts that individuals have to consume time information retrieval.
On the other hand, the area of question classification or
question categorization is to identify the category sort of
& Seyhmus Yilmaz questions in question answering (QA) implementation,
seyhmusyilmaz@duzce.edu.tr
which is asked in NLP.A significant part in QA (question
Sinan Toklu answering) systems [1–3] or other dialog implementations
sinantoklu@duzce.edu.tr
is to identify the questions to the probable category of an
1
Department of Computer Engineering, Faculty of answer [4]. For instance, the question of What country is
Engineering, Düzce University, Konuralp Campus, famous for chocolate ought to be classified into the kind of
81620 Düzce, Turkey

123
Neural Computing and Applications

location (country). Such data narrows down the search text classification for Turkish language based on Word2vec
space to classify the accurate answer string. Additionally, methods have been done [13], in our knowledge, we have
such data may propose different methods to investigate and not come across any important study related to text clas-
confirm a probable answer. For instance, the classification sification in Turkish based on deep learning using
of the question ‘‘Who is the prime minister of Belgium’’ to Word2vec skip gram or continues bag of grams.
a type of ‘‘human (person)’’ question should use the search For linguistic and grammatical reasons, natural lan-
method specific for human (person) type. guages have numerous words, which derive from the same
Document categorization is one of the related issues to morphological class. Especially, in Turkish language, there
question categorization [5]. Even though document cate- is a huge amount of derived words because of the language
gorization has been given a large amount of scientific structure [14]. Turkish utilizes the derivative affixes and
concentration in recent times, question categorization par- the inflectional to derive new words in which they naturally
ticularly in Turkish language now is a novel academic might contain a number of hundred structures, and a
problem. While a number of academic publications have number of million structures can be created from every
attempted to classify documents for Turkish language, to verbalroot [15].
the best of our knowledge, we have not come across any For that reason, it is very likely to derive words that
important academic papers related to question categoriza- mean nearly a sentence in English language. Numerous
tion in Turkish. The most significant distinction between derivational and inflectional suffixes can be taken from
document categorization and question categorization is that Turkish words when used in context in a sentence for
the document length is much longer than a question length example;
thus each word and character in questions could be useful.
gör ? ebil ? iyor ? du ? ? (s)he was able to see (it)
As a result of this, it is harder to take features from a single
okul ? da ? idi ? ler ? ? they were in the school.
question than a large document [5]. The most important
gönder ? ebil ? ecek ? se ? n? ? If you will be
difference between classifying questions and classifying
able to send (it)
documents is that the length of a question is much smaller
than a document. Therefore, in a question, each word and For that reason, in most cases lemmatization procedure
character could be important and it may be harder to sub- is very important for gaining the uninflected word forms to
tract features from a one document. apply IR (information retrieval) and other document pro-
In question classification tasks, there are three tech- cess technique to struggle Turkish. Lemmatization process
niques: machine learning, rule-based and hybrid techniques generally is used to improve the performance of the sys-
[6, 7].In our study, some deep learning techniques such as tems in information retrieval. Decreasing inflectional forms
deep learning (CNN [8], LSTM [9]), SVM [10] and their and occasionally derivationally connected form of words to
combinations are used to classify user questions based on root forms are the main aim of lemmatization. It provides a
Word2vec methods both skip gram and Continues Bag of link among surface form of connected words and dic-
Grams. Our major contributions in this study are: Most tionary forms. In Turkish, due to the language structure of
papers related to question classification focus on English Turkish for above-mentioned reasons, there is no an
language and they have not studied an agglutinative lan- effective lemmatization tool in comparison with English
guage where the structure of words is generated by putting language. This is another difficulty with Turkish language
suffixes (morphemes) to the words root. Turkish language when studying text processing (for more details about the
has been proven challenging for NLP. Most of the diffi- difficulty of Turkish see [16]).
culties stem from the complicated morphology of Turkish Another problem with Turkish is that it can lack
and how morphology interacts with syntax [11]. As a non- resources needed when deciding to study on textual data.
Indo-European language, languages like Turkish have At the beginning of this study, we were unable to find any
some distinctive features, which make it difficult for nat- Turkish question dataset and then we decided to translate
ural language processing. For instance, Turkish language an English question dataset to Turkish in order to assess the
does not have grammatical gender and noun classes [12]. In performance of the proposed methods in its best way. After
order to differentiate various age, courtesy or familiarity this study, we will plan to share this Turkish question
toward addressee, social distance, levels of politeness, this dataset for researchers who wish to study in this area.
language is based upon second person pronouns. In gen- Previously, bag of words was a modern technique to
eral, it is hard to capture these nuances by natural language extract features from questions [6] but there are some big
processing methods, which have been commonly used for disadvantages in this technique. Bag of word techniques
Indo European languages like German and English. The cannot capture semantics of the word. For instance: a
real meaning of a noun can be changed by the extensive training question containing words ‘car’ and ‘automobile’
usage of affixes [12]. Although a number of studies about are frequently used in the same context [17]. As a result,

123
Neural Computing and Applications

misclassification would be possible. On the other hand, the different techniques, an average correctness of
vectors related to these words are orthogonal in bag of 92.77% using CNN, 90.86% using LSTM, 91.8%
words system. This makes our problem more serious while using CNN-LSTM and 92.07% using CNN-SVM
making a model for questions. For instance, ‘‘Powerful’’ over the Turkish question dataset was achieved.
and ‘‘strong’’ and ‘‘Paris’’ are equal distant in BoW [17]. (3) As well as this, there was no Turkish question dataset
In order to tackle this challenge, we require one math- labeled and so another contribution in this study is
ematical representation of words. In this mathematical that we introduce a new Turkish question dataset
representation, we assigned every word x to a vector f ðxÞ which is translated from UIUC English question
such that if y and x have syntactic and semantic similarity dataset [20].
then f ðyÞ and f ðxÞ will become nearby vectors. A new
representation of words with above-mentioned feature is
1.1 Deep learning for text mining
invented by Mikolov et al. called distributed representa-
tions of words or Word2vec [18], which is used in our
Recently, the employ of a novel method known as deep
study in feature extraction step. The idea behind in word
learning has attracted the attention of developers and
representation technique is that words which have semantic
researchers, since deep learning algorithms have acquired
or syntactic relation are seen in the same contexts with high
extraordinary performance for a variety of natural language
likelihood [19]. Consequently, if word1 and word2 occur in
processing applications. Deep learning algorithms are the
the same context, the vectors of these two words ought to
type of machine learning algorithms consisting of various
be a little bit nearer to each other.
layers of perceptron which model the human being brain
In Word2vec, the representation of the words in a vector
[21]. In other words, it is an architecture that includes
space aids learning algorithms to accomplish better results
numerous layers of nonlinear data processing and a tech-
in NLP systems by classifying related words. Distributed
nique to learn the representation of features at consecutive
representations of words calculated utilizing neural net-
layers. Varied deep learning architectures have been used
works are very remarkable since the computed vectors
in different tasks, for example, long short-term memory
obviously code numerous linguistic regularities and pat-
(LSTM), deep neural networks (DNN), deep restricted
terns [18]. Somewhat unexpectedly, most of such patterns
Boltzmann machine (RBM), convolution neural networks
can be shown as linear translations. For instance, the out-
(CNN), etc. In addition, deep learning architecture has
come of a vector calculation vec(‘‘king’’)—vec(‘‘-
obtained remarkable results in natural language processing
man’’) ? vec(‘‘women’’) is closer to vec(‘‘queen’’) than to
for various problems such as sentimental analysis and text
any other word vectors [18].
classification. From now on, we will illustrate some of the
Our main contributions are as follows:
related works that use deep learning.
(1) Most papers related to question classification focus In [22], the authors use long short-term memory
on English language and they have not studied an (LSTM) to predict the sentiment of Roman Urdu language
agglutinative language where the structure of words tweets. In addition to LSTM model, they also use other
is generated by putting suffixes (morphemes) to the classifiers such as random forest and Naive Bayes. Their
words root. Turkish language has been proven LSTM deep learning model with word embedding out-
challenging for NLP. As a non-Indo-European performs other models on Sentiment Analysis in Roman
language, languages like Turkish have some distinc- Urdu language. The authors of [23] introduce the method to
tive features, which make it difficult for natural identify fuzziness in law textual data with a deep neural
language processing. network model in Thai language. In their paper, the defi-
(2) Another contribution the effect of using different nition of the fuzziness is an imprecise meaning in law text
Word2vec pre-trained word embeddings on different documents, which can be vague when read by a system.
deep learning architectures. First technique shown in They generated a labeled corpus from four Thai Law codes
our study utilizes Word2vec algorithms that are specifically (1) The Civil Procedure Code, (2) Commercial
Continues Bag of Grams and skip gram to cluster Code and The Civil, (3) The Criminal Code and The
words in the corpus and transform all words into Criminal Procedure Code. To classify the fuzziness, three
vectors in the space. The other technique uses every situations are produced. First is a verdict that needs the
feature vector as a combination of vectors of the production of evidence and second is a verdict depending
words of the questions. To extract word vectors, the upon a judge’s view and last one is a verdict, which shows
Word2vec technique is used. After that, CNN, other units. In addition to deep neural network model, the
LSTM, CNN-LSTM and CNN-SVM combinations authors use some machine learning algorithms such as
are used for classification. By applying those four SVM (support vector machine, random forest, decision

123
Neural Computing and Applications

tree). According to the experimental results, deep neural 2 Related works


network considerably has superior performance with
accuracy of 97.54% on all the dataset. In [24], the authors In classic question answering systems, there are three
amalgamate sentiment prediction and morphological eval- separate steps [30]:
uation in Punjabi language, which is an Indian language,
1. Question processing: The purpose of this step is to
using deep learning. To accomplish this, they use senti-
understand questions asked by users [31], for which
mental analysis of farmer suicide cases written in Punjabi
logical operations are applied for the representation
language on Punjabi online sites. 275 suicide cases are
and classification of the questions. In other words, this
taken in Punjab. For classification, they use Deep Neural
step classifies the questions asked by users and this step
Network and Morphological Punjabi text classification.
also is called as question classification step.
They achieve the accuracy of 95.45%.
2. Document extraction and processing: This step selects
An ensemble architecture of shallow and deep neural
a set of related documents and extracts a set of
network methods are applied for Vietnamese sentimental
paragraphs, which depends upon the focus of the
analysis in [25]. In this paper, the authors made an
question. The answer is in terms of these paragraphs.
experiment of three different dataset. In [26], the authors
3. Answer processing: This step is aim to select the
built a chatbot using the seq 2seq model integrating a deep
response based upon the related fragments of the
learning architecture of attention mechanism in Viet-
documents. This requires a pre-processing of the data
namese language but they use the limited dataset for
so as to pair an answer with the question asked. In
Vietnamese language. This chatbot are able to response to
Fig. 1, the common architecture for a question answer-
the users, but messages produced by the chatbot should be
ing system is shown [32].
enhanced to obtain a meaningful dialogue by enlarging the
dataset. The authors of [27] use a method based upon deep In order to resolve the question classification problem,
learning algorithms to categorize a document as composed there are numerous various techniques. Many of these
by a specific writer in Russian language. In addition to this, techniques can be separated into three groups: rule-based,
they make a comparison between dissimilar techniques of machine learning techniques, and hybrid techniques [6, 7].
vector representations of the textual data, for example, In order to classify questions in rule-based techniques [33],
character n-grams, label encoding. They test the experi- the system tries to pair questions with manually written
ments on some corpora in Russian, which gathered some rules. But deciding the precise rules is spending enormous
common social media websites. Lastly, they compare dif- time and effort to understand a variety question types. For
ferent deep learning algorithms like long short-term highly inflectional languages such as Turkish, it is very
memory (LSTM), CNN (Convolutional Neural Network), difficult to find all probable type of questions. The main
RNN (recurrent neural network) and the combination some cause why this kind of classification methods is uncommon
architecture. In [28], a novel framework for managing is that the general accuracy is not even close to the
negative messages on social media websites has been approaches that uses machine learning [6]. In [34], authors
introduced. In order to get superior classification and discuss that machine learning approaches are better than
training performance on social-based mentions, the authors manual methods. In contrast to manual approaches,
made considerable alteration in the combination method of machine learning approaches provide a reasonably easier
deep learning architecture layers. In addition to this, for re- way to categorize questions. In this way, it can be learnt
training the pre-trained embedding word vectors for better from the data, and therefore, this type of implementations
reflect sentimental terms; they introduced the resultant easily can be tailored to a new system.
sentimentally embedded word vectors. The authors of [29] Finally, the hybrid approaches are novel and not wide-
employ an ensemble architecture which combines long spread; there are limited papers which are using this
short-term memory (LSTM) and Convolutional Neural technique. We will mention a little bit about some studies.
Network (CNN) architectures in order to perform the The authors of the [5] have proposed a hybrid method for a
sentiment of Arabic tweet analysis. In this model, they did Persian closed domain question classification system. The
not utilize any feature extraction technique and any com- authors have formed taxonomy by themselves and prepared
plicated methods to obtain particular features. They only a database that include 9500 questions with the assist of a
use a pre-trained word vector representation, which is pre- few scholars. They achieve the satisfactory performance
trained on Twitter corpus. according to high number of question classes with 80.5%.
Despite rule-based methods, machine learning methods
are able to automatically build an accurate classification
implementation using various features of questions [5]. In

123
Neural Computing and Applications

Fig. 1 The general architecture


of NLQA system

addition, there are various machine learning approaches, Mollaei in [38] classify sentences into two levels of coarse-
for example, SVM, Naive Base, decision trees, K-nearest grained and fine-grained categories based upon the cate-
neighbors and deep learning classifiers that are utilized for gory of the answer to every question. Then, they extract
question classification. However, SVM (Support Vector setting and features sliding window on the Conditional
Machine) is the main machine learning approach utilized Random Fields architecture and trained CRF question
for question classification [30]. The authors of [30] are classifier. The aim of this paper is to categorize Persian
employed Support Vector Machine and dimension reduc- questions, and they gain the dataset employed for the study
tion with bag of n-grams feature vector. In order to achieve by their own effort. Most of the question dataset are
their goals, the authors select to employ as few linguistic obtained from the junior high and primary school docu-
features as possible. ments, and the rest is taken from commonly posed ques-
In some machine learning approaches, questions are tions in some Internet sites. The satisfactory statistics has
reformulated as a tree. In [35], the authors used a tree been achieved according to a large number of question
kernel with a SVM to classify questions and succeeded categories with an accuracy of 79.8%. Mishra et al. com-
80.2% statistics. But they did not use semantic and syn- bine semantic, syntactic and lexical features which get
tactic features during their experiments. Moreover, the better result of classification. Additionally, they adopt three
authors of the [36] implemented a kernel function called diverse methods: (NN) nearest neighbors, SVM (support
Hierarchical Directed Acyclic Graph (HDAG) that vector machines) and NB(Naive Bayes) based on bag of
straightforwardly accepts structured natural language n-grams and/or bag of words. The authors also show that
information, for example, some levels of chunks and their when taking SVM classifier and combining the semantic,
relatives. The authors of [37] proposed a hierarchical syntactic and lexical feature, this recovers the results of
method using the SNoW learning architecture to classify classification.
questions. In that study, a two-phase classification process Loniin [39] uses features roughly the same as what the
is employed. In the initial phase, the five most likely authors of [4] introduced. Even though the improvement is
coarse-grained question categories are shown. In the next that the authors employed a dimension reduction method
phase, the question is categorized into one of the child close to PCA (principal component analysis) called LSA
categories of the five coarse-grained question categories (latent semantic analysis) to decrease the space of feature
with result of 84.2%. dimension to a much smaller one. In their study, BPNN
In [4], the authors proposed two approaches to gain (back-propagation neural networks) and support vector
augment semantic features of defined headwords based machines are utilized. Their statistics show that back-
upon WordNet. By using Maximum Entropy (ME) and propagation neural networks get better results than SVM.
linear Support Vector Machine (SVM) and methods, they Some clustering algorithms are used by Razzaghnoori et al.
reach the accuracy of 89.2% and 89.0%, respectively. [6] in order to cluster words in the vocabulary and then

123
Neural Computing and Applications

Table 1 Outline of related studies


Language Dataset Feature extraction method Classification Accuracy References
technique (%)

English UIUC question dataset Bag of n-grams SVM 87.4 [35]


English UIUC question dataset Bag of n-grams DT 84.2 [35]
English UIUC question dataset Bag of n-grams NN 79.8 [35]
English UIUC question dataset Bag of n-grams NB 83.2 [35]
English UIUC question dataset To generate more complicated features, named entities, SNoW 91 [34]
chunks, head chunks, Words, POS tags, semantically
associated words and over these basic features some
operators are used
English Penn treebank using an Lexical feature, POS tags – [33]
additional treebank with
1153 words
English UIUC question dataset N-grams, named entities SVM 82.0 [30]
English UIUC question dataset N-grams SVM 80.2 [30]
English NTCIR-QAC1 Semantic information, named entities, words SVM using 88.0 [36]
HDAG
kernel
English UIUC question dataset WordNet for quoted and target strings, Lexical and syntactic ME 86.0 [40]
info: chunks, named entity tags, language model, POS tags (maximum
entropy
model)
English UIUC question dataset Words, named entity, semantic information – 91 [41]
English UIUC question dataset WordNet semantic features for n-grams, word shape, SVM and ME 89 [4]
headword, wh-words, headword
Persian Almost junior high school POS tags, N-gram, position of tokens, Question informer, CRF 85.3 [38]
and primary words, Question Words
Persian QURANIC Question N-gram, Verse Finder, special word detection, POS tags, SVM with 75.9 [5]
Lemma, length of question, normalized word defined
rules
English UIUC question dataset Headwords, word-shapes, related words, hypernyms, 93.8 [39]
Bigrams, wh-words
Persian UTQD.2016 Word2vec, tf-idf MLP, SVM, 85 [6]
LSTM, RNN

they converted every question into a vector space. After answering tasks for especially under-sourced languages
that for classification, MLP and SVM are used. By per- [43]. The authors of [44] proposed some baseline frame-
forming such techniques, they achieve an average accuracy works for cross-lingual OpenQA with two machine trans-
of 72% by Support Vector Machine and an accuracy of lation-oriented approaches that make the translation of
72.46% by Multi-Layered Perception on 3 different data- training data and test data, respectively. In the translation
bases. Additionally, they prepare UTQD-2016 (University test environment, they used Google Translate to translate
of Tehran Question Dataset 2016). Questions in this dataset from Ukrainian, Polish and Tamil to English and they
are taken from several type of jeopardy game shown by the make use of their own translator for French, Portuguese,
official Iran’s TV. In their third technique, each question is German, Chinese and Russian. In [45], a novel method
converted to a matrix where every row shows Word2vec called XLDA (cross-lingual data augmentation) is intro-
representation of a word. After they use a LSTM [6] model duced to improve the performance of natural languages
to classify the questions and they also achieve an average processing systems including question answering (QA). In
accuracy of 81.77% on three question databases. Table 1 this method, the authors replace a piece of the NLP input
illustrates the outline of the related studies in question text with its translation in another language. As a result,
classification. they make improvements to all languages in the XNLI
In addition, MT (machine translation), which translates dataset by up to 4.8%. The system achieved the state-of-
the data in source language into target language or vice the-art performance for three languages including the low-
versa [42], is one of the other ways to accomplish question resource languages such as Urdu. The authors of [46]

123
Neural Computing and Applications

proposed a version in Indonesian question analyzer to learning architecture in several language setting. The
improve the English monolingual question answering authors of the [50] generate a feature for every combination
architecture into Indonesian-English Cross Language of translation direction and technique, and train an archi-
question answering. Better results are revealed by using a tecture that learns optimum feature weights. On a big
question analyzer on the sourced language side than uti- forum dataset containing of messages in Chinese, Arabic
lizing the question translation on the target language side. and English, their new learn-to-translate method is better
It ought to be observed that the question analyzer is simi- than a robust baseline, which translates all text into English
larly appropriate for monolingual question answering in and then trains a classifier based only upon English
Indonesian question answering. The authors of [47] (translated or original) text.
examine the relation between automatic and manual
translation evaluation metrics. To do this, they begin with a
standard QA dataset and generated manual and automatic 3 Methodology and results
translations. A question set with five different varieties of
translation results are generated by the authors. Firstly, In this part, the feature extraction techniques will be
they translate the questions into Japanese manually and explained in depth. These feature extraction techniques are
then generated translations of the Japanese dataset into very important to classify the nature of the questions. After
English by five different approaches. For machine trans- that, classification techniques and the classifiers will be
lation, they use Google, Yahoo, Moses and Travator investigated. Furthermore, we will demonstrate the
Translate. In [48], the authors generate a question–answer approaches used in this part. In Fig. 2, the procedure of
pairs with the questions and answers both being in Hindi converting words to vectors and classifying the questions to
and English. For classifying an input question into the related classes are illustrated in our proposed approaches.
categories depending on the expected answer, they built a After, we will use question classification algorithm to
deep neural architecture. While testing, they translated classify questions utilizing Word2vec methods both skip
Hindi answers to English and produced a gold answer set gram and Continues Bag of Grams. Figure 2 shows the
by merging the real English answer and the translated general architecture of this study.
English answer for each question. Their method reached an
accuracy of 80.30% and 90.12% for finer and coarse cat- 3.1 Question database
egories, respectively. In [49], the authors concentrate on
Questing Answering systems and give an outline of the There is a lack of Turkish question datasets in comparison
current state of the area for cross-lingual (CLQA) and with English. In this study, we use a Turkish question
multilingual (MLQA) sub-tasks. Furthermore, they include dataset, which is translated by us from an English Question
an initial effort to analyze the result of a basic deep Dataset. To improve the quality of translation of this

Fig. 2 The general architecture of this study

123
Neural Computing and Applications

Table 2 Distribution of
Category # Train # Test Category # Train # Test
question categories in the UIUC
Question dataset ABBREVIATION 86 9 Animal 112 16
abb 16 1 Creative 207 0
exp 70 8 Other 217 12
DESCRIPTION 1162 94 HUMAN 1223 65
Reason 191 6 Description 25 3
Description 274 7 Group 47 6
Manner 276 2 Individual 189 55
Definition 421 123 Title 962 1
ENTITY 1250 94 LOCATION 835 81
Currency 4 6 Mountain 21 3
Religion 4 0 State 66 7
Letter 9 0 City 129 18
Instrument 10 1 Country 155 3
Symbol 11 0 Other 464 50
Plant 13 5 NUMERIC 896 113
Body 16 2 Order 6 0
Lang 16 2 Temp 8 5
Word 26 0 Code 9 0
Vehicle 27 4 Speed 9 6
Technique 38 1 Weight 11 4
Color 40 10 Size 13 0
Substance 41 15 Period 27 8
Product 42 4 Distance 34 16
Event 56 2 Other 52 12
Sport 62 1 Money 71 3
Term 93 7 Percent 75 3
Dis.med. 103 2 Date 218 47
Food 103 4 Count 363 9

database, we have received assistance from a professional observing the contextual data the input words appear
company [51]. This database uses Li and Roth’s (2002) in [53]. In the Word2vec vector space, the dimensionality
taxonomy. They introduced a two-layered classification, of these vectors becomes usually a few hundred dimen-
which is broadly utilized for question classification. This sions. In the text, every distinctive word is assigned to a
dataset consists of 6 coarse categories and 50 fine-grained related vector in the Word2vec space [54]. In the Word2-
categories that is indicated as ‘Coarse: fine, such as vec space, the vectors of the words are located such that if
‘‘LOCATION:city’’. All main classes and sub-classes of words will occur in similar contexts, and thus, the
this question database [11] are shown in Table 2. In this Word2vec algorithm will discover that such words ought to
database, there are 5500 questions in training data and 500 be positioned in near proximity to one another in the word
questions in test data in total. We built our Turkish dataset space. It is a mainly computationally well-organized pre-
from this English dataset for our experiments, and we are dictive system for learning word embeddings from raw
planning to make this dataset available publicly for corpus.
research purpose. Wored2vec has two different approaches:
1. Skip gram
3.2 Word2vec
2. CBOW (Continuous Bag of Words)
The Word2vec model was introduced in Mikolov [18, 52] Those two methods are algorithmically close to each
and is illustrated in Fig. 3. Word2vec is a shallow, two- other [55]. In Continuous Bag of Words (CBOW) archi-
layer neural networks which is trained to reform linguistic tecture, the algorithm predicts center words(target) based
contexts of words. This algorithm produces word vectors on the neighboring words. It is a statistical result that
from an enormous amount of text of words as input by Continuous Bag of Words smoothes on numerous the

123
Neural Computing and Applications

Fig. 3 The skip gram and CBOW model

distributional corpus, which treats a whole context as one Tr-Wiki: a Turkish Wikipedia snapshot numbered
observation. That becomes a positive thing for small 20180120.
corpus. In order to implement our Word2vec model, we use
The skip gram algorithm is illustrated in Fig. 3. Skip Gensim [57] to produce a set of word embeddings by
gram is very similar to CBOW, except that it exchanges the setting the dimensionality D, the context window size W,
output and input. This is the inverse of Continuous Bag of the total of negative samples ns and the skip gram sg. The
Words. It predicts all surrounding words (‘‘context’’) from default parameter value for the context window size is
one input word. Essentially, the training aim of the skip chosen as W = {5}. For this context size of the window,
gram algorithm is to discover word vectors that are useful four different dimensionality sizes D = {100, 200, 300,
at finding close words in the related contexts [56]. In skip 400} are used to investigate both of the high and low
gram model, nearby context words are predicted from the dimensions for the Word2vec vectors. Consequently,
center word (opposite of Continuous Bag of Words). When totally 8 different Word2vec models in total are produced
the corpus is bigger, skip gram model is more effective, by changing W and sg for Wikipedia corpora. For the rest
because in skip gram model every context-center pair is values, we chose the default parameters. We set the neg-
treated as a new observation. The Continuous Bag of ative sampling to 5, batch words to 10,000, minimum count
Words and skip gram architecture can be seen in Fig. 3. of words to 5 and iteration to 5. Furthermore, we investi-
gate the impact of the dimensionality vector on Wikipedia
3.3 Training Word2vec embedding model corpus. About 1 million Turkish articles are seen in Turkish
Wikipedia. After eliminating words with a frequency less
As part of this study, we train Word2vec models both skip than 5, more than 200 thousand Turkish words are gathered
gram and CBOW with different parameters on Wikipedia in the corpus. The main parameters used in this study are as
corpora. We train our Word2vec model for word repre- follows.
sentations on a corpus: a Turkish Wikipedia. We prefer the
Wikipedia as the dataset since it is the biggest encyclo- 3.4 Visualization of word embeddings
pedia which is open, multilingual on the Internet; its doc-
uments are organized by topics clearly. Thus, Wikipedia The authors of [58] introduced an embedding method to
corpora are very appropriate for the proposed Word2vec visualize syntactic and semantic analogies and they made
model. While training on Wikipedia corpora, we remove experiments to show whether the resulting visualizations
words with less than 5 words because these words have a obtain the noticeable architecture of the word embeddings
smaller amount data and are generally meaningless for produced with word Word2vec.PCA (principal component
training the Word2vec in our study (for example, some analysis), semantic axis, Cosine distance histogram are
have stop words, emoticons). By using these Wikipedia utilized for the visualization methods [59, 60]. In this
corpora, we have built our skip gram and CBOW model study, principal component analysis is used to show the
with different vector length 100, 200, 300 and 400. The syntactic and semantic relations of the words in two-
Wikipedia used in our study is:

123
Neural Computing and Applications

dimensional space because most papers use PCA to visu- After training, we visualize one of the learned Word2-
alize their data. vec embeddings as an example, which is shown in Fig. 4.
As I mentioned above, one of the main parameters of In this way, it keeps related words close together on the
Word2vec is ‘‘the dimension of embedding layer’’. The graph, while maximizing the distance between unrelated
dimension of embedding layers how many dimensional words [61]. In order to classify the questions based on the
spaces the words are represented in. For example, if the above-mentioned features, we will train four classifiers,
dimension of embedding layer is 100, our words are cal- using following models.
culated in 100-dimensional space. The embedding layer
1. CNN (Convolutional Neural Networks): In NLP mod-
has very high dimension, and we should use a technique
els, Convolutional Neural Network has been used
such as PCA to decrease the dimension for visualization.
convolving a filter with a fixed long representation of
After that, the PCA technique can then be used to visualize
words as input into the architecture [62], and final
the words represented in this 100-dimensional space in
classification is gained throughout a sequence of
two-dimensional space. The models we will visualize have
nonlinear function mappings and matrix multiplica-
many thousands of word vectors and most words are
tions [63]. The semantics of local form of textual data
overlapped on the vector space, and therefore, it is
such as a phrase can be learned automatically by such
impossible to see all word vectors in a single plot, which is
architectures, while refraining from having to take
illustrated in Fig. 4.
memory over the whole sequence of words for a given
sequence [62]. In our study, for the question

Fig. 4 2D results for our one of the Word2vec models with PCA

123
Neural Computing and Applications

Fig. 5 Convolutional Neural Networks (CNN) for sentence classification model architecture for an example

classification performance of a Convolutional Neural T with region size r 9 k. Mapping feature qi is produced
Network, the efficient word vector representations are via a convolutional layer by using a nonlinear function F.
significant factors. Word2vec methods [52] are able to qi ¼ FðT  ci:iþr þ bÞ; ð1Þ
produce vectors which allow the architecture to be
successful on the task of relating the words to their where F is a nonlinear function, for example, RELU
context in a given window [64]. (the Rectified Linear Unit), T represents a filter 2 Rrk , bias
term is represented by b and ci:iþr represents sub-matrix
In order to classify questions, we used a Convolutional
produced from r words. Thus, a question is mapped by the
Neural Networks (CNN) model explained by [8, 12, 65]
Convolutional Neural Network with n words to n - r ? 1
and highlighted in Fig. 5. To make our system appropriate
features q = [q1, q2, q3, …,qn-r?1]. In here, everyqshows
for this study, we changed some parameters, which is
the higher word representations in the filter.
explained as follows. The CNN model has three layers: the
In our model, the maximum pooling layer is utilized to
convolutional layer, max-pooling layer, the dropout layer,
take the feature q with the maximum value to choose the
the fully connected layer. In order to construct a classifier
most significant mapping features generated via a filter. To
utilizing a CNN, every question is shown as a word
gain multiple representations, we can apply filters with
embedding matrix. Given a question contain-
dissimilar region sizes and convolutional functions for
ing n words v1, v2, v3,…,vn, every word with its d-dimen-
every question. The output of max-pooling layer for entire
sional pre-trained word embedding matrix are replaced,
filters is then transferred to the fully connected layer with
and stacked row-wise to generate an instance matrix
six probabilistic output units, which is one of the database
Vi 2 Rnd class labels. The dropout layer is set as 0.5. Dropout is
Now, a description of each layer will be explained in frequently applied at this layer as a means of regularization
detail. [66]. Hyper and training parameters used in this method are
In a question, define a vector Vi 2 Rd , where Vi repre- RELU. This extracts the maximum value oi from each
sent a d-dimensional embedding vector for the ith word. feature map i (1-maximum pooling). Specifically, we
In here, the questions are padded to make all questions choose ReLU(the Rectified Linear Unit) because RELU is
the same length of words as the maximum length of the broadly employed in modern deep neural network archi-
question, because all questions must be the equivalent tecture like CNN. Compared to traditional activation
length of words in a CNN model. Every question j is shown functions such as sigmoid and tanh, ReLU offers a few
with a two-dimensional n 9 k matrix cj = [v1, v2 …,vn], benefits: (1) it is simple and fast to perform, (2) it mitigates
the definition of k denotes the dimensionality of the Vi the vanishing gradient problem and (3) it induces sparse-
embedding and n represents the longest count of words. In ness [67].
order to generate a map feature q for r words, a convolu- In our CNN model, we also use filter windows of 5, 4, 3
tional process on the i:i ? r sub-matrix is applied by a filter with 400, 300, 200 and 100-feature map search l2

123
Neural Computing and Applications

Table 3 The main parameters used in the Word2vec models flowing, incapable of changing weights [9]. Consequently,
Parameters Value
long-term dependencies are difficult to learn [9] (for more
comprehensive analysis in (1)–(3) [69]). LSTM is able to
Sentences None hold data over much longer periods than (10–12) time steps
Windows size 5 [70], which is the boundary of BPTT and RTRL systems. It
Dimensionality 400, 300, 200, 100 uses prior data to the current task and are implemented to
max_vocab_size None allow machines to learn to overcome struggles whose input
sg({0,1} 1 for skip gram otherwise CBOW and output are sequence with diverse length.
hs ({0,1} 0 Specifically, we use LSTM to classify questions using
Negative 5 Word2vec representations. As shown in Fig. 6, LSTM
hashfxn provides a gating structure, which includes four parts: an
Iter 5 output gate ot , a forget gate ft , a memory cell Ct and an
Sample 0.001 input gate it . All three gates take the data from the inputs at
Batch_words 10,000 present time step and the outputs at prior time step for the
Seed 1 classic LSTM. All symbols are demonstrated in Table 5
min_alpha 0.0001 where the present time step and the prior time step are
min_count 5 defined with the symbols t – 1 and t correspondingly. C acts
cbow_mean 1 like a conveyor belt and the data goes through the belt. The
null_word 0 previous data are deleted, and fresh data are put via f and i.
Workers 4 The output for present time step with o is generated by the
trim_rule None system. The most important part of the long short-term
sorted_vocab 1 memory architecture is C. This retains the data for long
Alpha 0.025 period. Therefore, long-term dependencies are supported
by this mechanism. The mathematical equations summary
the above-mentioned process as follows.
constraint of 3. The results of CNN model using Word2vec  
ft ¼ u Wt  ½ht1 ; xt  þ bf ð3Þ
is illustrated in Table 3.
In addition to this, accuracy is utilized for evaluation it ¼ uðWi  ½ht1 ; xt  þ bi Þ ð4Þ
metric. The count of correctly classified questions (the ot ¼ uðWo  ½ht1 ; xt  þ bo Þ ð5Þ
count of accurately classified samples) divided by the 0
entire of tested questions (the number of tested samples) Ct ¼ tanhðWc  ½ht1 ; xt  þ bc Þ ð6Þ
defines the accuracy [68]. In question classification, the
formula of the accuracy is shown as follows.
the count of correct classified questions
Accuracy ¼ ð2Þ
the count of test questions
Especially in this study, accuracy is coarse because
coarse groups are used for classification. In our study, we x +
used both two Word2vec word embeddings CBOW and tanh
skip gram as feature extraction techniques. In addition to
x x
this, we also compare the results in terms of 10-cross fold
validation accuracy.
σ σ tanh σ
2. LSTM (long short-term memory): LSTM is an updated
variation of recurrent neural networks (RNN) and it
plays a vital role for computers to understand text
documents.
BPTT (The backpropagation through time algorithm)
and RTRL (the Real-Time Recurrent Learning algorithm)
expand the ordinary backpropagation algorithm to fit the
RNN model, but Recurrent Neural Network has the dis- Two Lines Intersect But Do Not Touch
advantages of remembering long past data. Through RTRL
and BPTT, error signals incline to disappear after steps of Fig. 6 The LSTM architecture

123
Neural Computing and Applications

Fig. 7 LSTM structure with


peepholes

0
Ct ¼ ft  Ct1 þ it  Ct ð7Þ Fig. 6 (illustrated in Fig. 7) to let the memory cell Ct1
directly control the gates as shown the equations:
ht ¼ ot  tanhðCt Þ ð8Þ  
ft ¼ u Wf  ½Ct1 ; ht1 ; xt  þ bf ð9Þ
We have explained the standard LSTM structure so far.
But there are various LSTM structures which support the it ¼ uðWi  ½Ct1 ; ht1 ; xt  þ bi Þ ð10Þ
0
development of learning long-term dependencies. All Ct ¼ tanhðWc  ½ht1 ; xt  þ bc Þ ð11Þ
LSTMs are not the same as the standard one. One well- 0
known LSTM structure that inserts ‘‘peephole connec- Ct ¼ ft  Ct1 þ it  Ct ð12Þ
tions’’ is used in our method. In this model, it allows the In these equations, the transition matrix for the input xt
gate layers look at the cell state. Figure 7 shows LSTM is W, the memory cell Ct1 , component-wise multiplication
structure with peepholes. is shown as x, the hidden state vector ht1 and u indicates
In [71], the authors propose a sigmoid layer named the the sigmoid function.
‘‘forget gate layer’’ in order to remove some of previous The output gate ot controls the present hidden state
data from the memory cell and make more the space of value ht , which applies to the system result of a nonlin-
memory for new arriving data. In [69], the cell status is earity to the memory cell contents:
being monitoring by gate layers via ‘‘peephole connec-
tions’’. According to [72], such connections improve per- ot ¼ uðWo  ½Ct ; ht1 ; xt  þ bo Þ ð13Þ
formance on time-retaining mechanisms in which the ht ¼ ot  tanhðCt Þ ð14Þ
network has to learn to compute exact intervals among
events. At the following phase, the present time step of the
In this section, we use a variation of LSTM, which hidden state ht is utilized for the acquisition of htþ1 . In
other words, the word sequence is recursively processed by
includes the ‘‘peephole connections’’ to the mechanism of
long short-term memory by calculating their internal

Fig. 8 Types of architecture mapping in LSTM

123
Neural Computing and Applications

Fig. 9 The details of many to classification. In many-to-one architecture, there is only


one structure one output. For example, our input is, ‘‘Avrupa’nın en
büyük ülkesi hangisidir?’’ (‘‘What is the biggest country in
Europe?’’). In this question classification example, the
input is a sentence where the input is a question (sequence
of words) and the output is a probability showing that the
input question is in a country. Table 6 illustrates the results
of LSTM architecture with Word2vec.
The details of the many to one structure can be seen
from Fig. 9, every rectangle presents a vector and arrows
are functions, for example, matrix multiply. The output
vectors are shown at the top and the input vectors are at the
bottom, and vectors in the middle take the state of RNN
[74].
hidden state ht at every time phase. The hidden activations 3. CNN-LSTM model: The basic architecture of the
of the final time phase can be taken into consideration as proposed model is illustrated in Fig. 10, which is
the semantic representation of the entire sequence and adapted from [65, 71, 75], and it summaries the
utilized as input for classification layer. combination of the two deep neural network models:
Furthermore, there are various types of architecture CNN and LSTM.
mapping in LSTM [73]. These are one-to-one, one-to-
many, many-to-one and many-to-many mappings as rep- In this method, a softer form in which the maximum
resented in the Fig. 8. operation is executed over a smaller region of the feature
A many-to-one structure produces one output value after maps is used rather than max pooling over time. With this
taking multiple input values. In this structure, the input is in method, temporal data are better preserved and a sequence
the form of a sequence, and thus, the hidden states func- is generated rather than a single value by max-pool layer
tionally depend on both the input at that time step and the [71]. In the following layer, the data are then fed into a
prior hidden state, for example, sequence input (for LSTM cell with many to one structure and a fully con-
example, tweets where an input tweet is categorized as nected layer with softmax output. Table 6 illustrates the
expressing negative or positive sentiment). Because of that, results of CNN-LSTM model using Word2vec.
many-to-one model is selected in the case of question

Fig. 10 CNN-LSTM structure with an example

123
Neural Computing and Applications

3.4.1 Input layer 3.4.4 LSTM layer

In this architecture, this is the initial part and it shows each The aim of choosing the LSTM is to take the sequential
question as a row of vectors. Every vector denotes a token information by using the prior information. In LSTM layer,
based-upon word level. Every word in the question will be the output vectors from the dropout layer are taken as
shown into a specific vector with one of the fixed sizes of inputs. In this layer, all cell inputs are the output from the
400, 300, 200 or 100 based upon the Word2vec model dropout layer and the layer contains a set number of units
used. Every question j is shown with a two-dimensional and cells. In this layer, the last output of the layer has the
n 9 k matrix cj =[ v1 , v2 …, vn ], the definition of k denotes equivalent number of units.
the dimensionality of the vi embedding and n denotes the
longest count of words. The questions are padded 3.4.5 Fully connected layer
using \ Pad [ in order to make all questions the equiva-
lent length of words as the highest length of the question LSTM layer produces a single matrix from its outputs by
since all questions must be the equivalent length of words combining and merging them, and after that, this layer
in this layer. passes it to a fully connected layer. In order to classify
questions, the network converts the array into an output
3.4.2 Convolutional layer with feature maps using the fully connected layer and softmax.
4. CNN-SVM: SVM (Support Vector Machine) is a
Every input has a sequence of vectors, and this layer scans
supervised machine learning algorithm, which can be
it with a fixed length of filter. In this method, the filter sizes
applied for both regression and classification problems
of 3,4 and 5 are utilized to take the features of words. The
[10]. The aim of supervised machine learning is to
filters shift or stride merely single row and single column
generalize and learn an input output mapping. In
matrix. Every filter takes various features in a question
question classification problems, a set of questions are
with the Rectifier Linear Unit function (ReLU) in order to
used as input and their relevant class is used for output.
represent them in the feature map.
In the previous deep architectures, we use the last layer
3.4.3 Max-pooling layer as a fully connected layer with softmax output, which is
equal to a Linear Classifier. The prior layers perform as a
This layer down-samples and reduces the features in the feature extractor in this condition. In this way, we follow
feature map after the Convolutional layer. The maximum the authors of [71]. In this study, it is shown that that this
function or operation is the most usually employed method feature extractor can extract valuable features that present
at the max-pooling layer. For that reason, we selected the points that can be divided satisfactorily through a simple
maximum operation in this study. In order to decrease the Linear Classifier. The general framework of CNN-SVM
computation in the advanced layers and take the most architecture is shown in Fig. 11. In addition to that, those
significant feature, the maximum value is selected. After features can be helpful for other type of Linear Classifier.
that, the system applies the dropout method with the In this method, Linear SVM is used as the last layer. In
dropout value 0.5 to decrease overfitting. here, the typical cross-entropy loss function.

Fig. 11 The general framework of CNN-SVM architecture

123
Neural Computing and Applications

L ¼ CEðtarget; softmaxðW;ðinputÞÞÞ ð15Þ In Table 7, the results of CNN-SVM model using


is replaced with Linear SVM’s loss function. It is the Word2vec are illustrated.
extracted features from the previous layers. Where CE is
CrossEntropy, and L is Loss. After replacing SVM as the
final layer, loss function is as follows: 4 Conclusions
X
n X
6  2 In this article, we apply some deep learning methods on
L¼ Rðtj Wj/i þ 1Þ þ C Wj  ð16Þ
i j
question dataset based upon Word2vec embedding vectors
both skip gram and CBOW, which can effectively capture
where R is RELU(Rectifier Linear Unit function); t is the the semantic and syntactic relations among words. To do
target that can be - 1 or 1. This loss function shows the this, firstly the Word2vec algorithms calculate word vec-
loss when one vs rest scheme is utilized, since our question tors of vocabulary words. The algorithms initialize word
classification data are using a multiclass dataset. For Linear vectors with random vectors. Then, the algorithms try to
SVM with a soft margin, the loss function is now. raise the cosine similarity among all words and their con-
X
n X
6  2 text, which can be defined based upon the system. There-
 
L¼ Rðtj Wj/i þ 1  nij Þ þ R nij þ C Wj  fore, by providing big amount of text Wikipedia corpus for
i j these algorithms, they will be able to allocate word vectors
ð17Þ in the vector space such that their closeness is proportional
to relevant to their corresponding words.

Table 4 The results of CNN


Number of feature The type of Word2vec model Accuracy Accuracy
model using Word2vec
Vectors (Test) (10-cross fold validation)

100 CBOW 91.8 86.5


100 Skip gram 92.8 89.3
200 CBOW 92.4 86.7
200 Skip gram 92.6 89.5
300 CBOW 92.4 87.2
300 Skip gram 94 89.6
400 CBOW 92.4 86.9
400 Skip gram 93.8 89.5

Table 5 The symbols used in


Symbol Description
LSTM architecture
C CEC (The memory cell)
0
C The fresh candidate value
z, r, f , i, o update, reset, forget, input, and output gates
u Sigmoid function
h Output
bC ; bz ; fbf , bo , bi Offset values
tanh Hyperbolic tangent function
x Input
Wc The weight vector for input
Wz The weight vector for the update gate
W o; W i; W f The weight vectors of the output, input, forget gates separately
,  Component-wise multiplication
* Product of two scalars or product of a scalar and a vector
1 All elements of vectors are subtracted from 1
 Element-wise sum of two vectors

123
Neural Computing and Applications

Table 6 The Results of LSTM using Word2vec model


Number of feature vectors The type of Word2vec model Accuracy (test) Accuracy (10-cross fold validation)

100 CBOW 91 87.8


100 Skip gram 90.8 86.9
200 CBOW 90.6 87.6
200 Skip gram 90.3 87.7
300 CBOW 90.2 88.2
300 Skip gram 91.4 88.4
400 CBOW 91.6 88.5
400 Skip gram 91 88.5

Table 7 The Results of CNN-LSTM architecture


Number of feature vectors The type of Word2vec model Accuracy (test) Accuracy (10-cross fold validation)

100 CBOW 90.6 86.9


100 Skip gram 91 88.6
200 CBOW 92 89.5
200 Skip gram 92.8 88.9
300 CBOW 91.2 87.7
300 Skip gram 92.6 89
400 CBOW 91.4 87.4
400 Skip gram 92.8 89.6

Table 8 The Results of the CNN-SVM model


Number of feature vectors The type of Word2vec model Accuracy (test) Accuracy (10-cross fold validation)

100 CBOW 91.8 83.4


100 Skip gram 91.8 86.8
200 CBOW 91.2 81.1
200 Skip gram 92.8 87.1
300 CBOW 90.8 84
300 Skip gram 93 89.5
400 CBOW 92 84.6
400 Skip gram 93.2 89.4

These methods used for classification are based on based technique using an HMM-based sequence classifi-
Word2vec both skip gram and CBOW and deep learning cation technique [77] their answer might not generalize to
techniques together. This study focuses on agglutinative question classification tasks in an agglutinative language.
language and to the best of our knowledge; this is for the On the other hand, similar works other languages on
first time that question classification is studied in an question classification studies [6, 71] have not investigated
agglutinative language in depth. As a result, we have the impact of the Word2vec variations both CBOW and
reached satisfactory statistics according to the experimental skip gram and some parameters such as feature vector size
results of classes. Prior studies on question classification on question classification performance. The experimental
concentrate on different tasks for instance the named entity results in this study illustrates that those above-mentioned
or similar class categorizing [76] and integrating a rule-

123
Neural Computing and Applications

factors can definitely influence the classification perfor- 3. Mishra M, Mishra VK, Sharma HR (2013) Question classification
mance on question classification systems. using semantic, syntactic and lexical features. Int J Web Semant
Technol 4(3):39
Generally, in our study, we investigated CNN, LSTM, 4. Zhiheng H, Marcus T, Zengchang Q (2008) Question classifica-
CNN-LSTM and CNN-SVM when using Word2vec tion using head words and their hypernyms. In: Proceedings of
methods both CBOW and skip gram. While using two the conference on empirical methods in natural language pro-
different types of the Word2vec method, the CNN, CNN- cessing. Association for Computational Linguistics, pp 927–936
5. Ehsan S, Mojgan F (2014) A hybrid approach for question clas-
LSTM and CNN-SVM model using skip gram is able to sification in Persian automatic question answering systems 2014.
perform significantly better results in terms of accuracy on In: 4th international e conference on computer and knowledge
question classification dataset in comparison to using engineering (ICCKE). IEEE, pp 279–284(2014)
CBOW (Tables 4, 7, 8). In contrast to CNN, CNN-LSTM 6. Razzaghnoori M, Sajedi H, Jazani IK (2018) Question classifi-
cation in Persian using word vectors and frequencies. Cogn Syst
and CNN-SVM, using CBOW generally achieve better Res 47:16–27
results on LSTM structure. In CNN-LSTM model, using 7. Hao T, Xie W, Xu F (2015) A WordNet expansion-based
skip gram is better than using CBOW in most cases. In approach for question targets identification and classification. In:
addition to that, we have experienced the best result in Chinese computational linguistics and natural language process-
ing based on naturally annotated Big Data. Springer, Cham,
CNN model, an accuracy of 94%, when using skip gram pp 333–344
with 300 feature vectors. In addition to this, we experience 8. Kim Y (2014) Convolutional neural networks for sentence clas-
that using the right form of dataset can potentially contain sification. arxiv preprint: arxiv:1408.5882
more vocabulary for the classification database. Therefore, 9. Hu F, Li L, Zhang ZL (2017) Emphasizing essential words for
sentiment classification based on recurrent neural networks.
the relation between corpus and the classification dataset J Comput Sci Technol 32(4):785–795. https://doi.org/10.1007/
give superior question-level representations. s11390-017-1759-2
Lastly, when compared to a similar study on [71] 10. https://www.analyticsvidhya.com/blog/2017/09/understaing-sup
performed on the same dataset, which the authors have port-vector-machine-example-code/
11. Le-Hong P, Phan XH, Nguyen TD (2015) Using dependency
reached an accuracy of 95.4% with LSTM in English analysis to improve question classification. In Knowledge and
language; our results were low compared to this study Systems Engineering (pp. 653-665). Springer, Cham
conducted in English. The most important reason for this 12. Bilić P, Primorac J, Valtýsson B (eds) (2018) Technologies of
is the language structure of Turkish as we mentioned labour and the politics of contradiction. Springer, Cham, p 85
13. Şahin G (2017) Turkish document classification based on
earlier in Sect. 1. Consequently, there is an absence of Word2Vec and SVM classifier. In: Signal processing and com-
effective lemmatization tools for Turkish Language munications applications conference (SIU), 2017 25th, IEEE,
compared to English language. pp 1–4
To conclude our study, we recommend here one line for 14. http://user.ceng.metu.edu.tr/*mturhan/dfa/node14.html
15. Ozturkmenoglu O, Alpkocak A (2012) Comparison of different
the future work that we think may be motivating to explore. lemmatization approaches for information retrieval on Turkish
For the future, a hybrid feature extraction technique based text collection. In: 2012 International symposium on innovations
on more than one word embedding method together can be in intelligent systems and applications. IEEE
used to increase the accuracy of the question classification 16. Oflazer K (2014) Turkish and its challenges for language pro-
cessing. Lang Resour Eval 48(4):639–653
system. By using this hybrid method, the system will be 17. Wang C (2016) What are the limitations of the Bag-of-Words
able to take the advantages of all embedding methods used. model? [Online]. Available: https://www.quora.com/What-are-
the-limitations-of-the-Bag-of-Words-model
18. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013)
Distributed representations of words and phrases and their com-
Compliance with ethical standards positionality. In: Advances in neural information processing
systems, pp 3111–3119
19. Bollegala D, Maehara T, Kawarabayashi K (2015) Embedding
Conflict of interest The authors declare that they have no conflict of
semantic relations into word representations. In: Twenty-fourth
interest.
international joint conference on artificial intelligence
20. http://cogcomp.cs.illinois.edu/Data/QA/QC/
21. Abdi A et al (2019) Deep learning-based sentiment classification
References of evaluative text based on Multi-feature fusion. Inf Process
Manag 56(4):1245–1259
1. Blooma MJ, Goh DHL, Chua AYK, Ling Z (2008) Applying 22. Ghulam H et al (2019) Deep learning-based sentiment analysis
question classification to Yahoo! Answers. In: Applications of for Roman Urdu text. Procedia Comput Sci 147:131–135
digital information and Web Technologies, ICADIWT 2008. First 23. Sangkeettrakarn C, Haruechaiyasak C, Theeramunkong T (2019)
international conference on the IEEE, pp 229–234 Fuzziness detection in Thai law texts using deep learning.
2. Silva J, Luı́sa C, Mendes AC, Andreas W (2011) From symbolic In: 2019 10th International conference of information and com-
to sub-symbolic information in question classification. Artif Intell munication technology for embedded systems (IC-ICTES). IEEE
Rev 35(2):137–154 24. Singh J et al (2018) Morphological evaluation and sentiment
analysis of Punjabi text using deep learning classification. J King
Saud Univ-Comput Inf Sci (2018)

123
Neural Computing and Applications

25. Nguyen H-Q, Nguyen Q-U (2018) An ensemble of shallow and 45. Singh J et al (2019) ‘‘XLDA: cross-lingual data augmentation for
deep learning algorithms for Vietnamese Sentiment Analysis. In: natural language inference and question answering. arXiv pre-
2018 5th NAFOSTED conference on information and computer print arXiv:1905.11471
science (NICS). IEEE 46. Faruqi MI, Purwarianti A (2011) An Indonesian question ana-
26. Nguyen T, Shcherbakov M (2018) A neural network based lyzer to enhance the performance of Indonesian-English CLQA.
Vietnamese Chatbot. In: 2018 International conference on system In: Proceedings of the 2011 international conference on electrical
modeling & advancement in research trends (SMART). IEEE engineering and informatics. IEEE, pp 1–6
27. Dmitrin YV (dmitrinyuri@gmail.com), Botov DS Comparison of 47. Sugiyama K, Mizukami M, Neubig G, Yoshino K, Sakti S, Toda
deep neural network architectures for authorship attribution of T, Nakamura S (2015) An investigation of machine translation
Russian Social Media Texts evaluation metrics in cross-lingual question answering. In: Pro-
28. Vo K et al (2019) Handling negative mentions on social media ceedings of the tenth workshop on statistical machine translation,
channels using deep learning. J Inf Telecommun 1–23 pp 442–449
29. Heikal Maha, Torki Marwan, El-Makky Nagwa (2018) Sentiment 48. Gupta D, Kumari S, Ekbal A, Bhattacharyya P (2018 MMQA: a
analysis of Arabic Tweets using deep learning. Procedia Comput multi-domain multi-lingual question-answering framework for
Sci 142:114–122 English and Hindi. In: Proceedings of the eleventh international
30. Hacioglu K, Ward W (2003) Question classification with support conference on language resources and evaluation (LREC-2018)
vector machines and error correcting codes. In: Proceedings of 49. Loginova E, Varanasi S, Neumann G (2018) Towards multilin-
HLT-NAACL, Association for Computational Linguistics, Mor- gual neural question answering. In: European conference on
ristown, USA, vol 2, pp 28–30 advances in databases and information systems. Springer, Cham,
31. Loni B (2011) A survey of state-of-the-art methods on question pp 274–285
classification 50. Ture F, Boschee E (2016) Learning to translate for multilingual
32. Athira PM, Sreeja M, Reghuraj PC (2013) Architecture of an question answering. In: Proceedings of the 2016 conference on
ontology-based domain-specific natural language question empirical methods in natural language processing, pp 573–584
answering system. Int J Web Semant Technol 4(4):31 51. http://www.benila.com.tr/index.php
33. Hermjakob U (2001) Parsing and question classification for 52. Mikolov, Tomas et al (2013) Efficient estimation of word rep-
question answering. In: Proceedings of the workshop on open- resentations in vector space. arXiv preprint arXiv:1301.3781
domain question answering. Association for Computational Lin- 53. https://github.com/akoksal/turkish-word2Vec
guistics, vol 12, pp 1–6 54. https://israelg99.github.io/2017-03-23-Word2Vec-Explained/
34. Close LW (2002) Question classification using language model- 55. https://medium.com/@mubuyuk51/word2vec-nedir-t%C3%
ing. Center of Intelligent Information Retrieval (CIIR). Technical BCrk%C3%A7e-f0cfab20d3ae
report 56. Mandelbaum A, Shalev A (2016) Word embeddings and their use
35. Dell Z, Wee SL (2003) Question classification using support in sentence classification tasks. arXiv preprint arXiv:1610.08229
vector machines. In: Proceedings of the 26th annual international 57. https://radimrehurek.com/gensim/models/word2vec.html
ACM SIGIR conference on research and development in infor- 58. Liu S, Bremer PT, Thiagarajan JJ, Srikumar V, Wang B, Livnat
mation retrieval. ACM, pp 26–32 Y, Pascucci V (2018) Visual exploration of semantic relation-
36. Suzuki J, Taira H, Sasaki Y, Maeda E (2003) Question classifi- ships in neural word embeddings. IEEE Trans Vis Comput
cation using HDAG kernel. In: Proceedings of the ACL 2003 Graph 24(1):553–562
workshop on multilingual summarization and question answer- 59. Chen Z et al (2018) Evaluating semantic relations in neural word
ing. Association for Computational Linguistics, vol 12, pp 61–68 embeddings with biomedical and general domain knowledge
37. Li X, Roth D (2002) Learning question classifiers. In: Proceeding bases. BMC Med Inf Decis Mak 18(2):65
of the 19th international conference on computational linguistics. 60. https://medium.com/@patrickhk/practice-ntlk-word2vec-pca-
Association for Computational Linguistics, Morristown, USA, wordcloud-jieba-on-harry-potter-series-and-chinese-content-
vol 1, pp 1–7 ca6f845b3293
38. Mollaei A, Rahati-Quchani S, Estaji A (2012) Question classifi- 61. https://www.tensorflow.org/tutorials/representation/word2vec
cation in Persian language based on conditional random fields. In: 62. Zhang Y, Wallace B (2015) A sensitivity analysis of (and prac-
2012 2nd international conference on computer and knowledge titioners’ guide to) convolutional neural networks for sentence
engineering (ICCKE). IEEE, pp 295–300 classification. arXiv preprint arXiv:1510.03820
39. Loni BK, Seyedeh H, Wiggers P (2011) Latent semantic analysis 63. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5300751/
for question classification with neural networks. In: 2011 IEEE 64. Banerjee I et al (2019) Comparative effectiveness of convolu-
workshop on automatic speech recognition and understanding tional neural network (CNN) and recurrent neural network (RNN)
(ASRU). IEEE, pp 437–442 architectures for radiology text report classification. Artif Intell
40. Blunsom P, Kocik K, Curran J (2006) Question classification Med 97:79–88
with log-linear models. In: Proceedings of the 29th annual 65. Yang X, Macdonald C, Ounis I (2018) Using word embeddings in
international ACM SIGIR conference on research and develop- twitter election classification. Inf Retr J 21(2–3):183–207
ment in information retrieval. ACM, pp 615–616 66. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya
41. Santosh K, Ray Shailendra S, Joshi BP (2010) A semantic Sutskever, Ruslan Salakhutdinov (2014) Dropout: a simple way
approach for question classification using WordNet and Wiki- to prevent neural networks from overfitting. J Mach Learn Res
pedia. Pattern Recognit Lett 31(13):1935–1943 15(1):1929–1958
42. Lee C-H, Lee H-Y (2019) Cross-lingual transfer learning for 67. https://arxiv.org/pdf/1707.08214.pdf
question answering. arXiv preprint arXiv:1907.06042 68. Liu Y et al (2018) Feature extraction based on information gain
43. https://ahcweb01.naist.jp/DigRevURL/index.html and sequential pattern for English question classification. IET
44. Liu, Jiahua, et al. ‘‘XQA: A Cross-lingual Open-domain Question Softw 12(6):520–526
Answering Dataset.’’ Proceedings of the 57th Annual Meeting of 69. Hochreiter S (1998) The vanishing gradient problem during
the Association for Computational Linguistics. 2019 learning recurrent neural nets and problem solutions. Int J
Uncertain Fuzziness Knowl Syst 6(2):107–116

123
Neural Computing and Applications

70. Zhou H et al (2016) Exploiting syntactic and semantics infor- International cross-domain conference for machine learning and
mation for chemical–disease relation extraction. Database2016 knowledge extraction. Springer, Cham, pp 179–191
71. https://github.com/thtrieu/qclass_dl/blob/master/ProjectDescrip 76. Derici C, Celik K, Kutbay E, Aydın Y, Güngör T, Özgür A,
tion.pdf, https://github.com/thtrieu/qclass_dl/blob/master/Project Kartal G (2015) Question analysis for a closed domain question
Presentation.pdf answering system. In: International conference on intelligent text
72. Gers FA, Schmidhuber J (2000) Recurrent nets that time and processing and computational linguistics. Springer, Cham,
count. In: Proceedings of the IEEE-INNS-ENNS international pp 468–482
joint conference on neural networks. IJCNN 2000. Neural 77. Dönmez İ, Adalı E (2017) Turkish question answering applica-
Computing: New Challenges and Perspectives for the New Mil- tion with course-grained semantic matrix representation of sen-
lennium, vol 3, pp 189–194 tences. In: Computer science and engineering (UBMK), 2017
73. http://karpathy.github.io/2015/05/21/rnneffectiveness/ international conference on IEEE, pp 6–11
74. http://www.jackdermody.net/brightwire/article/Sequence_to_
Sequence_with_LSTM Publisher’s Note Springer Nature remains neutral with regard to
75. Alayba AM, Palade V, England M, Iqbal R (2018) A combined jurisdictional claims in published maps and institutional affiliations.
CNN and LSTM model for arabic sentiment analysis. In:

123

You might also like