Fikir Setie Tezera

DSpace Institution
DSpace Repository http://dspace.org

Computer Science thesis
2022-07
Amharic Question Generation from

Amharic legal Text Documents by
Using Deep Learning Approach
FIKIR, SETIE TEZERA
http://ir.bdu.edu.et/handle/123456789/14385
Downloaded from DSpace Repository, DSpace Institution's institutional repository
BAHIR DAR UNIVERSITY
BAHIR DAR INSTITUTE OF TECHNOLOGY
SCHOOL OF RESEARCH AND POSTGRADUATE
STUDIES
Faculty of Computing
MSc. Thesis on:
Amharic Question Generation from Amharic legal Text

Documents by Using Deep Learning Approach
By:
FIKIR SETIE TEZERA
Program: Computer Science

Advisor: Esubalew A. (PhD)
JULY 2022
Bahir Dar, Ethiopia
Amharic Question Generation from Amharic Legal Text
Documents with Using Deep Learning Approach
FIKIR SETIE TEZERA
Amharic Question Generation from Amharic Legal Text Documents using

Deep Learning Approach
Submitted to the School of Research and Graduate Studies of Bahir Dar Institute of
Technology, BDU in Partial Fulfillment of the Requirements for the Degree of Master
Science in Computer Science in the Computing Faculty
Advisor Name: Esubalew A. (PhD)
September 8, 2022
BAHIR DAR, ETHIOPIA
© 2022
FIKIR SETIE TEZERA
ALL RIGHTS RESERVED
ii
DECLARATION
This is to certify that the thesis entitled “Amharic Question Generation from Amharic Legal
Text Documents by using Deep Learning Approach”, submitted in partial fulfillment of the
requirements for the degree of Master of Science in Computer Science under Faculty of
Computing, Bahir Dar Institute of Technology, is a record of original work carried out by
me and has never been submitted to this or any other institution to get any other degree or
certificates. The assistance and help I received during the course of this investigation have
been duly acknowledged.
FIKIR SETIE TEZERA 03-08-22
Name of the Candidate Signature Date
iii
iv
ACKNOWLEDGEMENTS
First and foremost, I would like to thank the almighty God and His Mother Virgin Marry
for helping me start and complete this work.
Next, I would like to express my heart-felt gratitude to my advisor Dr. Esubalew Alemneh
for his Continues support and courage, without his constructive comment, detailed advice
and Professional guidance, the study would not have reached this level. I especially give
great value for the fast and detail feedback he gives me in every level of the study. I feel
honored to be one of his advisees. I would also thank Dr. Tesfa T. for the constructive
comment he gives me during the progress presentation of the study.
I would also thank my beloved mother Abeze Melaku who rise me without father with
great love, affection. The Encouragement and the moral support she gave me helped me to
be who I am now; also, I would Like thank a lot my beloved husband, Dr. Achamyeleh
G. for being near to me to support and encourage me throughout the study.
Finally, I want to say thank you a lot to my best friends, Habtamu A. and Derje S. For
commenting my title selection, during data gathering and also motive until the
accomplishment of the study. I want to acknowledge you with best of wish for your
successful life.
v
TABLE OF CONTENTS
DECLARATION ............................................................................................................... iii

ACKNOWLEDGEMENTS ................................................................................................ v
TABLE OF CONTENTS ................................................................................................... vi
LIST OF ABBREVIATIONS ............................................................................................ ix
LIST OF FIGURES ............................................................................................................ x
LIST OF TABLES ............................................................................................................. xi
ABSTRACT...................................................................................................................... xii
Chapter 1 ............................................................................................................................. 1
1. Introduction ..................................................................................................................... 1
1.1. Background .............................................................................................................. 1
1.1.1 Question Generation Applications...................................................................... 2
1.2. Statement of the problem ......................................................................................... 4
1.3. Objective of the study .............................................................................................. 5
1.3.1. General objective ............................................................................................... 5
1.3.2. Specific objective .............................................................................................. 5
1.4. Scope and Limitation of the study............................................................................ 6
1.5. Significance of the Study ......................................................................................... 6
1.6. Methodology of the research .................................................................................... 6
1.6.1. Research design methodology ........................................................................... 6
1.7. Organization of the thesis ......................................................................................... 7
Chapter 2 ............................................................................................................................. 8
2. Literature Review............................................................................................................ 8
2.1. Overview of question generation ............................................................................. 8
2.2. History of question generation ................................................................................. 9
2.3. Technique for Question Generation ....................................................................... 12
2.3.1 Neural network approaches ............................................................................ 13
2.3.1.1. Recurrent Neural Network (RNN) ........................................................ 13
2.3.1.2. Long Short-Term Memory (LSTM) ...................................................... 15
vi
2.3.1.3 Bi-LSTM ................................................................................................ 16
2.3.1.4. GRU ...................................................................................................... 16
2.3.1.5. Convolutional Neural Network ............................................................. 18
2.3.2. Activation functions ........................................................................................ 20
2.3.3. Optimizers ....................................................................................................... 21
2.4. Word Embedding ................................................................................................... 22
2.4.1 Word2Vec ......................................................................................................... 22
2.5. Question Construction Approaches ........................................................................ 23
2.6. Amharic Language ................................................................................................. 27
2.6.1. Amharic word class ......................................................................................... 27
2.6.2. Amharic sentence structure ............................................................................. 28
2.6.3. Amharic Interrogative Sentence Structure ...................................................... 28
2.7. Related works on question generation and approaches .......................................... 28
2.7.1 Question generation for English language........................................................ 28
2.7.2 Question generation for Amharic language ...................................................... 30
2.8. Summary ................................................................................................................ 30
Chapter 3 ........................................................................................................................... 32
3. Design of Amharic Question Generation Model .......................................................... 32
3.2 Model Architecture ............................................................................................ 33
3.3 Preprocessing ..................................................................................................... 34
3.3.1 Tokenization ............................................................................................... 35
3.3.2 Normalization ............................................................................................. 35
3.3.3 Stop Word Removal .................................................................................... 36
3.3.4 Stemming .................................................................................................... 36
Chapter 4 ........................................................................................................................... 38
4. Experiments and evaluation....................................................................................... 38
4.1 Introduction ........................................................................................................ 38
4.2 Experimentation ................................................................................................. 38
4.2.1 Data Collection ........................................................................................... 38
4.2.1 Implementation ........................................................................................... 39
4.2.2 Hyperparameters ......................................................................................... 39
vii
4.3 Evaluation and Results ....................................................................................... 41
4.3.1 Performance Metrics ................................................................................... 41
4.3.2 Experimental Result of CNN Model........................................................... 42
4.3.3 Experimental Result of LSTM Model ........................................................ 44
4.3.4 Experimental Result of Bi-LSTM model .................................................... 45
4.3.5 Discussions ................................................................................................. 46
Chapter 5 ........................................................................................................................... 49
5. Conclusions and Recommendations .......................................................................... 49
5.1. Conclusions ........................................................................................................ 49
5.2. Contributions of this study ................................................................................. 49
5.3. Recommendations .............................................................................................. 50
References ......................................................................................................................... 51
viii
LIST OF ABBREVIATIONS
ADFQG Amharic Dataset for Question Generation
AQG Automatic Question Generation
AAQG Automatic Amharic Question Generation
AUTOQUEST Automatic Question
BLEU Bilingual Evaluation Understudy
Bi-LSTM Bidirectional Long Short-Term Memory
CNN Convolutional Neural Network
DNN Deep Neural Network
FAQs Factoid Answer Questions
GRU Gate Recurrent Unit
LSTM Long Short-Term Memory
METEOR Metrics for evaluation for translation with explicit order
NER Named Entity Recognition
NLP Natural Language Processing
POS Part Of Speech
QA Question Answering
QG Question Generation
RNN Recurrent Neural Network
ROUGE Recall-Oriented Understudy for Gisting Evaluation
SQAuD Stanford Question Answering Dataset
SRL Semantic Role Labeling
Seq2Seq Sequence to Sequence
Word2vec Word to vector
ix
LIST OF FIGURES
FIGURE 2.1 ROLLED-UP RNN (VENKATACHALAM, 2019) ............................................................. 13
FIGURE 2.2 INFORMATION TRACKING OF RNN (VENKATACHALAM, 2019) .................................. 13
FIGURE 2.3. UNROLLED RNN ....................................................................................................... 14
FIG 2.4. LSTM ARCHITECTURE ADAPTED FROM (YU Y., 2019).................................................... 15
FIGURE 2.5. BI-LSTM ARCHITECTURE ADAPTED FROM (CARLO TASSO., 2018) .......................... 16
FIGURE 2.6 ARCHITECTURE OF GRU ADAPTED FROM (YU Y., 2019) ........................................... 17
FIGURE 2.7 CONVOLUTIONAL NEURAL NETWORK FIGURE SOURCE: [(HADDAD ET AL., 2020)]... 19
FIGURE 3:1 PROPOSED GENERAL ARCHITECTURE OF AUTOMATIC AMHARIC QUESTION
GENERATION MODEL ............................................................................................................. 34
FIGURE 4.1 TRAINING AND VALIDATION ACCURACY CURVE OF CNN MODEL ............................. 43
FIGURE 4.2 TRAINING AND VALIDATION LOSS CURVE OF CNN MODEL ....................................... 43
FIGURE 4.3 TRAINING AND VALIDATION ACCURACY CURVE OF LSTM MODEL........................... 44
FIGURE 4.4 TRAINING AND VALIDATION LOSS CURVE OF LSTM MODEL .................................... 44
FIGURE 4.5 TRAINING AND VALIDATION ACCURACY CURVE OF BI-LSTM MODEL...................... 45
FIGURE 4.6 TRAINING AND VALIDATION LOSS CURVE OF BI-LSTM MODEL ............................... 46
x
LIST OF TABLES
TABLE 1.1. SUMMARY OF DIFFERENT QUESTION GENERATION MODELS WITH PURPOSE,
METHOD, EVALUATION STRATEGIES AND WITH ITS MESS ............................................ 11
TABLE 2.2: SUMMARY OF RELATED WORK ......................................................................... 31

TABLE 3.1 SAMPLE DATASET WITH ANSWER, QUESTIONS AND THE CLASS PAIR .................. 32
TABLE 4.1 COMPARISON RESULT OF THE THREE DEEP LEARNING ALGORITHMS .................. 48
xi
ABSTRACT
Questioning is the main tool to grasp knowledge in our day-to-day activities. But the manual
construction of questions is time-consuming, expensive, and needs experts in the area. So,
developing automatic question generation can reduce the time to construct and the need for human
labor. Numerous studies have been done on question generation in full resource languages like
English, Chinese, and others using various recent techniques. However, two works are being done
on the Amharic question generation problem using traditional approaches including rule- and
template-based. It needs hand crafted rules and templates to train the model which is so time
consuming and tedious and also the performance of the model heavy depend on the size and quality
of rules and template to training. So, it is not the effective for big data. Also, there is no available
question generation data set to overcome thus problems, this study uses deep learning for Amharic
question generation problem, a new method of neural network with attached rules to address the
aforementioned issues. Since Amharic is a low-resource language for NLP we construct rules to
generate questions. To do this we consider the use of tokenization, normalization, stop word
removal, and stemming then feed it to the deep learning model which is CNN, LSTM, and Bi-LSTM
to generate questions based on the given input. Training data is prepared manually which is so
tedious and time-consuming because there is no available Question Generation training dataset.
It is about 6100 Question-Answer and paired with five classes. The class depicts as ((0) for how
much, (1) who, (2) what, (3) when, (4) where, and (5) for others. Accuracy, precision, F-measure,
and confusion matrix are performance measures that are used to assess the model's overall
effectiveness when applied to the provided dataset. According to performance measurement, in
this study's third trial, LSTM, CNN, and Bi-LSTM the maximum achieved accuracy rates of 92%,
94%, and 95 %. The results showed that the proposed Bi-LSTM overcame the challenges of
Amharic question generation better than the other two models.
Key words: Amharic, Question Generator, Deep Learning, word2vec, and Natural Language
Processing.
xii
Chapter 1
1. Introduction
1.1. Background
Questioning is the main tool to grasp knowledge in our day-to-day activities. Because its drive to the clear
idea, it develops the thinking and rehearsal potential and facilitate the learning environment.
People may ask questions to fill their knowledge gaps or to understand their living world with the
help of technology. Our aim here is to enable a machine to generate questions by learning using
natural language processing (NLP). In this information-oriented world, the most rewarding and
promising task now is using what machines have learned to solve urgent issues using two ways
i.e., making them answer questions or vice versa, letting them ask meaningful questions (Zhang,
2020). To do so natural language processing (NLP) is a technique to analyze and represent texts
by machine. It allows the machine to do a task by replacing humans with minimal time effectively.
The next is visualization is question generation (QG), QG is a very important task of NLP (natural
language processing) that aims to generate natural and meaningful questions from an input, which
is unstructured (e.g., Text/passage...) and structured (e.g., database) (Le, Kojiri, & Pinkwart, 2014).
It aims to generate natural and relevant questions that can be answered by the given input. Also,
QG input can be text, image, or audio that can be used as input to produce the corresponding output
which is the question. The QG task is motivated in two different ways: (i) creating Q-A pairs from
customizable contents so that custom QA or dialogue systems may be built quickly; (ii) producing
large-scale QA pairs of acceptable quality that can be utilized as supplementary QA model training
data or to increase the effectiveness of human annotation during the building of QA datasets (Tuan,
2019).
QG is an inseparable task of our life that touches many aspects of our lives (Baghaee, 2017). It can
be used broadly for education and interactive & dialogue question answering. Moreover, it plays
an indispensable role in education in discussion in the classroom, for assessment as well as for
exam/test preparation. For instance, manual exam preparation is so time-consuming and tedious.
However, by using the automatic question generation system, the overall workload of the teachers
can be reduced so that they spend much of their time on their academic issues such as research to
1
solve their community problems, community services, and so on. And also, it can be used for
knowledge and skill acquisition for students in the way of reading compression, academic writing
etc.
QG can be used for a dialogue system (Yue, 2020)) like Chabot in medicine to assist the health
environment by providing possible questions to start and give feedback (Mostafazadeh, 2016).
Question generation can be used to put sample frequently asked questions on the search engines
as suggestion on relatedness. So that it helps the people with the problem of expressing their exact
need to search and also to avoid forgetful nature of humans to write what exactly to ask (Baghaee,
2017).
The output of QG can be input for the other NLP application research for the improvement of the
system. It can provide correct/robust input for question answering systems (QAS) to improve the
performance of the system. Hence QAS aims to provide the exact answer for this by eliminating
the time to browse a lot of links. The existing Amharic question answering system was trained
with rule-based/manual constructed questions-answer pairs that is time-consuming, and inefficient
for large-scale data, and due to this Amharic question answering is not effective and not publicly
available.
1.1.1 Question Generation Applications
QG tasks can be applied in different streams. One use case is the implementation of assisted/guided
learning. That is, a question generation system will determine a learner's capabilities. Based on the
learner's capabilities, the QG system will generate types of questions, such as questions requiring
factoid answers or questions involving more complex reasoning processes on the learner’s behalf.
The goal of QG is to generate a valid and fluent question according to a given input. Question
Generation can be used in many scenarios, such as automatic tutoring systems, improving the
performance of Question Answering models, and enabling chatbots to lead a conversation. A
variation of this use case is simple dialog agents or chat bots, much like Eliza, a dialog agent will
take user input, and rephrase it as a question conducting an endless dialogue.
2
Another use case is online assistants, such as travel assistants. Here is a customer inquiry from
some company about a travel request. A QG system will gather travel specifics by questioning the
customer.
QG is the task of creating questions that can be answered, given the content of a document (or
collection of documents). While most QG systems are used to generate FAQs automatically, I've
found that they can also be used to help do other kinds of natural language generation tasks, such
as multi-document summarization. Here's the basic idea: take a collection of documents and use
QG to generate a large number of questions (along with the paragraphs/passages they've been
generated from). Then, bucket up the questions according to some similarity metric so that you'll
have a semi-unique set of questions (and their answers). You can then use the most frequently
occurring Qs to identify some set of topics that should be addressed in your generated summary
or as a topical outline for the document you're generating.
Again, to dive in, we might consider automatic question-answering applications for different
domains. Automatic question generation (AQG) was defined by Rus et al.,(2008) “the task of
automatically generating questions from various inputs such as raw text, database, or semantic
representation.” Writing good questions, though, is challenging work, which can be a time-
consuming and labor-intensive process. The increasing availability of digital information, together
with the deployment of various question-answering applications, led to the development of
research in AQG.
Even if the research on automatic question generation started back in the 1970s, it is still under
development for different applications in different languages like English, China, and Portugal,
and despite all the above applications of question generation.
As far as the researcher’s knowledge is concerned, there is no available research work in Amharic
question generation using deep learning, so the researcher aims to design an automatic Amharic
question generator model that takes an Amharic legal text documents (sentences) and produces an
automatic Amharic question using deep learning approaches. The researcher used recent deep
learning approaches. We used legal text documents extracted from chilot.me/federal law website
to train the model.
3
1.2. Statement of the problem
Questioning is the main tool to grasp knowledge in our day-to-day activities. Because its drive to
the clear idea, it develops the thinking and rehearsal potential of students that facilitate the learning
environment. But the manual generation/construction of questions is time-consuming, expensive,
and needs experts in the area. So, developing automatic question generation can reduce the time
to construct and the need for human labor.
There is pioneer researches work in Amharic question generation that is the automatic generation
of Amharic math word problem and equation (AMW) (Andinet, 2020) which used the template-
based approach using elementary school mathematic textbooks and worksheets data. The paper
used shallow NLP techniques that result in a limitation on generating deeper semantic meaning for
full AMW problems. Also, it is ineffective for large-scale data. And the author proposed to use
deeper NLP techniques. Other remaining pioneers work on automatic factoid Amharic question
generation from historical text using a rule-based approach (Getaneh Damtie, 2021) It used the
transformation on rule with NER and POS Tagger information for sentence and answer key
selection to change the text content to its corresponding questions.
The above two research works used template and rule-based approaches to generate questions.
The problem is that they are not effective for large-scale data. Template- approach is effective only
for a specific domain. And also rule-based approach limitation is the performance of the system is
dependent on the rules defined to transform the sentences to corresponding questions.
So, to alleviate the above problems in this paper we proposed deep learning approach which is part
of neural network (NN). Because NN works by interpreting sentence as a sequence of words and
converting it into another sequence of output by don’t concern on the dependence of words and
other relationships in the sentence (seq2seq) (Ferreira F, 2020).
Question generation can be used to improve the question answering system by providing correct
and robust input as an output of QG and input to QA. In the meantime, Amharic question
answering is a researchable area and it has been under improvement for over a year. It aims to
provide the exact answer by eliminating the time to browse different links to get Amharic texts.
4
But a researcher in Amharic QA (Seid M., 2009; Desalegh A, 2013; Wondwossen T, 2013; Brook
E., 2013; Medhanit G, 2019) used manual annotated questions and answer pair to train their model.
The problem is manual preparation of questions requires experience, training, and resources. And
also, it is tedious, time-consuming, and inefficient for large data to extract knowledge for the
machine. So, developing a system for AQG to provide well-designed question-answer (training
corpus) can be used as input for question answering system to improve the overall performance of
the system. And also, until now, there is no publicly available corpus for question generation.
To alleviate the above problems, we have designed an Amharic question generator model by using
a deep learning approach.
Research Questions
The designed model should answer the following questions;
1. Can we use deep learning for automatic Amharic question generation?
2. To what extent is it possible to develop automatic question generators for Amharic

language using a selected approach?
3. Which deep learning algorithm is best for automatic Amharic question generation?
1.3. Objective of the study
1.3.1. General objective
The general objective of this study is to design an automatic question generation model for
Amharic legal text documents using machine learning.
1.3.2. Specific objective
The study specifically aims:
✔ To prepare training data (corpus) from Amharic articles.

✔ To build models that formulate questions from the given input text/document.
✔ To train the model using the selected machine learning approaches.
✔ To evaluate the model.
5
1.4. Scope and Limitation of the study
This research aims to develop automatic question generators from only Amharic legal documents
text input. It doesn’t consider audio/video, image, and table input to generate questions as output
for question answering to improve its performance. It generates only WH (what
(ምንድን/ምን/ስለምን/ምን ምን), when (መቸ), where (የት/ከየት), how much (ስንት) and who
(ማን/በማን/ከማን/ለማን)) questions.
1.5. Significance of the Study
Designing this automatic Amharic question generation using sequence to sequence end to end with
predefined rules is a contribution of this thesis. Automatic question generator models are used for
different natural language applications including interactive question answering for education and
health care to assist professionals and the environment and to put frequently asked questions to
search engines. Another considerable significance of this thesis is it can give insight to future
researchers in this area since it is a pioneer in the area of Amharic language.
1.6. Methodology of the research
Research methodology is the overall process and the way of solving the identified problems. It
includes methods, techniques, and approaches for data collection, analysis, training, and design of
the model.
1.6.1. Research design methodology
In this study, researchers used the design science research methodology recommended by March
et al., (1995) and Peffers et al., (2007). In this methodology, first, the problem is assessed by
literature and observations then based on the identified problem, proposed artifacts are designed.
Finally, the designed artifacts are evaluated to improve the model and finally communication is
there for submission thesis report and presentation to the experts for evaluation. This design
science methodology includes tools, methods, algorithms, and evaluation mechanisms. Due to this
reason, researchers used this methodology for automatic question generation for Amharic legal
documents research.
6
1.7. Organization of the thesis
The remaining part of this thesis is organized as follows. The second chapter which is a literature
review includes background theory, application of question generation and approaches, and
methods of existing works with related work on other languages with different applications. The
third chapter discussed the design and implementation of the model including data collection.
Analysis and training of the model. Chapter four includes experimental results and findings of
automatic Amharic question generation, and then in chapter five the results and discussions can be
elaborated including recommendations and research findings.
7
Chapter 2
2. Literature Review
This chapter discusses the overview of question generation and its background theory.it contains
a definition of question generation, its history, and a review of related and significant publications
State of the art in this area and the approach, techniques applied to question generation, and related
works are to be covered.
2.1. Overview of question generation
“Question generation is the task of automatically generating questions from inputs such as raw
text, database, or semantic representation”. Rus et al. (2008). Question generation is a natural
language generation in which the input has a means of possibilities depending on the application.
And it can be text in the form of sentence/paragraph (Grasser et al., 2009, Heilman & Smith l.,
2010), image (visual question generation) (Mostafazadeh et al., 2016, Zhang et al., 2017), and
knowledge graph (Serban et al., 2016a). The main goal of question generation is to create natural,
relevant, correct, grammatical, and human-understandable questions.
Questions in question generation can be classified into two classes: subjective and objective.
Objective questions are questions that can be answered by WH-(what/ምንድን….ነው?, when/መቸ?,
where/የት?, who/ማን? and so on...) questions and fill-in-the-blank questions. Fill-in-the-Blank

questions are also classified into two that are close (with one or more blanks and alternative
/distractors answers and open fill-in-the-blank are with one blank and one answer questions
(Agrarwal, 2012). Subjective questions are which are open-ended to answer like, definitions, short
answer questions, essay questions, and opinion questions.
In terms of question complexity, question generation can be classified into shallow and deep
question generation. Shallow questions are question that focuses on facts that can be answered by
one short answer (such as what, who, where, when, yes/no) (Agrarwal, 2012) and deep questions
needs logical thinking to answer (such as why, why-not, how, what-if and what-if-not) (Chen et
al., 2018).
Based on the purpose the designed question generation can be also classified into educational and
interactive & dialogue question answering. It can be used for support education for knowledge
8
acquisition/verification (Wolfe, 1976), knowledge assessment (Heilman & Smith, 2010), and
tutorial (Lindberg et al., 2013; Graesser et al., 2008).
Based on the way of question generation at the end, it can be classified into generative based and
retrieval based. The generative method works by generating and retrieval based on ranking. (Duan
et al., 2017).
Currently, the successful result of deep learning in other NLP research areas such as question
answering (Song et al.,2017), reading comprehension (Nguyen et. al., 2016), and machine
translation leads researchers to use deep learning approaches for question generation. Those are
the neural network, and sequence-to-sequence modeling and they are generally generative and
retrieval based. The establishment of deep learning algorithms (Recurrent Neural Network,
Convolutional Neural Network, Gate Recurrent Unit, Long Short-Term Memory, etc.., and
merging of various algorithms for certain tasks) make the models show performance improvement.
Question generation is a very important NLP task that provides credible applications for our life,
especially in education and question-answering systems. There has been an important trend in
2010 (The First Shared Task Evaluation Challenge on Question Generation (QG-STEC)) in
question generation, a trend which discontinued for now but is continuing in the background.
However, there is still a long way to go for question generation.
2.2. History of question generation
Research on question generation starts back in the 1970's and it has been under improvement and
investigation. Different researchers used different approaches to design their AQG model.
Generally, there are four basic approaches for question generation. These are syntactic, temple,
semantic and recent neural network-based approaches. The above three approaches respectively
work with predefined rules and templates (placeholder) whereas the neural network approach is
data-driven.
The syntactic approach works by considering syntactic structure/not semantic and it is feasible for
WH-questions. By following three basic techniques i.e., sentence simplification, key phrase
identification, and application of transformation rule. The pioneer of QG which used this approach
9
was AUTOQUEST by Wolfe (1976) for self-learning for novice English learners. And this work
shows that automatically generated questions can be effective as human-generated questions.
Paper by Heilman & Smith, (2011), used NLP syntactic parser by over-generating and ranking
technique for course assessment. The paper used some transformation rules to apply to the input
sentence to be converted to its corresponding question then it ranked the over-generated question
based to assure question quality.
Andrenucci & Sneiders, (2005) introduce template approach for question answering by putting
frequently asked questions followed by Chen et al., (2009); Lindberg et al.,(2013) on child support
for reading comprehension to self-questioning instruction and for intelligent tutor respectively by
using predefined text that holds variable to be filled by the coming input. Eg. Like “why <auxiliary-
verb><x>.
The recent work by Du et al., (2017) uses neural networks to read compression. To train their
model with sequence to sequence and end-to-end fashion used SQAuD QA dataset. It is data-
driven. So, it avoids time preparing manually crafted corpus. But it needs large training data to
train its model.
We investigate recent and advanced question generator models which are question generators for
question answering purposes (Duan et al., 2017). They used two approaches i.e. generative and
retrieval based trained by SQuAD dataset. It is data-driven. They explore how question generation
output/result can boost the performance of question answering and they get the result that the
questions generated by AQG model improve the performance of their question answering system.
Question generation models can vary in purpose, method, evaluation strategies, and language.
These are summarized as follows:
10
Table 1.1. Summary of different question generation models with purpose, method, evaluation
strategies and with its mess
Author Purpose Method/approach Evaluation methods
AUTOQUEST For reading Syntactic parsing Used human evaluation

(Wolfe,1976) comprehension Drawback
The parse is not effective due
to memory
Little focused on a semantic
approach
Rus et al. To support Pure syntactic (rule Used human evaluation

(2010) literature writing based) Didn’t use machine learning
tools
Husman A. For question Syntactic parsing with Used precise and recall for
(2012) answering rule interaction evaluation of a question
Limited rule results in recall
and precise question variations
Linderberg Reading Template-based Used human evaluation

(2013) compression approach with semantic limited variety of questions
role labeling
Heilmith For education Syntactic parsing Used human evaluation

&Smith (2009, (assessment) transformation rules Inefficient for very large
2010, 2011) with new over- question generation
generating and ranking
techniques
11
Du and Cardie For question- Neural network with Used human and automatic
(2018) answer pair gated co-reference evaluation
corpus knowledge technique
Du et al. For reading End to end via Automatic and human
(2017) comprehension sequence-to-sequence evaluation
learning Needs large-scale data to train
its model. ineffective for under-
resourced language like
Amharic
Duan et al., For Question Neural network Used automatic evaluation

.(2017) answering retrieval-based via
CNN and generation-
based via RNN.
Serban et al., For question- Neural network Used automatic (bleu, meteor)
(2016) answer corpus and human evaluation
Used knowledge freebase facts
rather than documents.
It needs large data to train the
model which is inefficient for
under-resource languages like
Amharic.
It faces the problem of out of
vocabulary and unknown
words.
2.3. Technique for Question Generation
Generally, question generation has been done using two broad approaches which are the traditional
approach (rule/template based) which is used manually constructed data to train their model, and
the new approach neural network which is data-drive done using RNN with LSTM, CNN, and
GRU, etc.
12
2.3.1 Neural network approaches
2.3.1.1. Recurrent Neural Network (RNN)

RNN is a neural network approach for sequential that values the context data because its output is
not dependent only on the current input but also on the previous input. RNN takes one or more
inputs and produces one or more outputs by deciding based on the hidden state. RNN has a loop
inside its architecture that act as an internal memory so it helps to capture/remember the previous
output to act as input with current state so the same input may produce different output due to the
hidden state. We can say that RNN has two input that is present and recent past.
Now a day RNN is used in many application areas such as natural language processing (NLP),
machine translation, and question answering so this success leads to being used in question
generation research.
Figure 2.1 Rolled-up RNN (Venkatachalam, 2019)
Figure 2.2 Information tracking of RNN (Venkatachalam, 2019)
RNN for simple loop process which is rolled on A state (RNN cell) by taking input xt and ht, looping
transferring information from one state to another process as seen in figure 2.1. So RNN works
13
recursive manner to its input and output as its name indicates thus steps until optimization of the
network reaches its threshold value as it can see in figure 2.1 and figure 2.2.
RNN consists of a sequence of inputs x=(x1,……. xt ) with hidden state h and optional output y,
and at each time step of t, hidden state of h(t) is updated as follows
h(t)=f(h(t-1),xt)………………………………………………………………………. Equation (1)
Where f is a non-linear activation.
For a series of input RNN the information passing of every single network creates unrolled RNNs,
which makes RNNs work in a sequence of data.
Figure 2.3. Unrolled RNN
For the sequence of inputs RNN works by taking the first input x(0) to state A then output h(0)
which is input together(x(1)+h(0) )=h(1)for the next step and produce to h(1)and so on as seen in
figure 2.3. So RNN works recursively until it reaches its threshold
Generally, figure 2.2 and as seen in figure 2.3the hidden state unrolled RNN update it as follows;
h(t)=Tanh(Whhht-1+Wxhxt)…………………………………………………………. Equation (2)
Yt=whyht……………………………………………………………………………. Equation (3)
Where ht hidden state with activation function
Tanh, activation function
Wh, is the weight matrix for input to hidden state that helps to measure the importance of the
current input and previous input to produce output.
Xt, input vector
Yt, new hidden state on the recurrent network which serves as input for the coming process.
14
2.3.1.2. Long Short-Term Memory (LSTM)
RNN face vanishing and exploding gradient on the input which yields difficulty to capture long-
term dependence for a large sequence of inputs by considering its relevance so to alleviate this
problem it used LSTM which is the advance/special type of RNN that has memory cell in their
hidden layer to remember its stored text knowledge for a long time.
LSTM has three gates which are forget gate, an input gate, and an output gate, and each gate has
its function to the entire. Forget gate function is to decide which part of the previous cell state (ct-
1) should be forgotten to free memory based on its relevance that is if ft=1 means keep information
whereas ft=0, get rid of the information. The second gate helps LSTM to decide what to update or
what new information to store in the cell state (replacement of forgotten). Finally, the output gate
helps it to decide which is relevant to produce an output as depicted in figure 2.4.
Figure2.4. LSTM Architecture Adapted from (Yu Y., 2019)
As seen in Figure2.4 LSTM network works as follows

1. Forget gate is the first sigmoid function which applied to current input xt and ht-1 current hidden
state to decide what to forget/throw from the previous cell ct-1computed as follows1.
𝑓𝑡 = 𝜎(𝑊𝑓ℎℎ𝑡 − 1 + 𝑊𝑓𝑥𝑥𝑡 + 𝑏𝑓 ………………………………………….…Equation (4)
2. Input gate used by LSTM to decide what new information to store on the cell state by updating
the old state (ct-1) with new ct.
𝑖𝑡 = 𝜎(𝑊𝑖ℎℎ𝑡 − 1 + 𝑊𝑖𝑥𝑥𝑡 + 𝑏𝑖)……………………………………….…… ...Equation (5)
Ct = ft ∗ ct − 1 + it ∗ C………………………………………………………….. . Equation (6)
15
3. Output gate used to decide which is relevant to produce an output.
Ot = 𝜎(Wohht − 1 + Woxx + bo) ………………………....…………………. ..Equation (7)
htot. tanh(ct) …………………………………….………………………...….. Equation (8)
2.3.1.3 Bi-LSTM
A bidirectional LSTM (a biLSTM) is a sequence processing model that consists of two LSTMs,
one of which receives input forward and the other of which receives it backward. With the help of
BiLSTMs, the network has access to more information, which benefits the algorithm's context.
The output of each LSTM is combined using their total as the data processing proceeds backward
and forwards one step at a time. Additionally, bidirectional LSTM (Bi-LSTM) allows for the
simultaneous extraction of a greater amount of contextual data. These models, which are state-of-
the-art solutions classifying sequential data into many classes, are more successful in resolving the
speech recognition issue.
Figure 2.5. Bi-LSTM Architecture Adapted from (Carlo Tasso., 2018)
2.3.1.4. GRU
Gate recurrent unite is a specialized version of RRN designed by Cho et.al. , (2014) for machine
translation. With two that is reset gate and update gate by reducing one gate of LSTM to control
the information flow. It used a reset gate used to control the data that flows into memory and an
update gate used to control the data that flow out of memory.
16
Figure 2.6 Architecture of GRU Adapted from (Yu Y., 2019)
GRU works as follows as depicted in Figure 2.6:-
Update gate
Used to determine/decide how much of the past/previous information needs to be passed to the
future. This gate is used to solve vanish gradient problem of RRN.
𝑍𝑡 = 𝜎(𝑊𝑧 𝑋𝑡 + 𝑈𝑧 ℎ𝑡1 _1)……………………………………………………………… Equation (9)

Reset gate zt
Used to decide how much of the previous information needs to be forgotten.
rt = σ(wz xt+ UZ ht1 ) …………………………………………………………. Equation (10)
The current state of the memory is computed as
ℎ̀𝑡 = 𝑡𝑎𝑛 ℎ (𝑤𝑥𝑡 + 𝑟𝑡 ⊙ 𝑈 ℎ𝑡_1 )……………………………..………………….. Equation (11)
ht = (1zt )ht1 + zt ht ……………………….……………………………………. Equation (12)

Where
17
zt= update gate with time step t
rt= update gate with time step t
ht´= current memory content
ht= final memory content at time step t and sigma (σ), tan and Hadamard (element-wise)
multiplication (⊙)
Both LSTM and GRU are effective for a sequence-to-sequence data for the memorization process.
2.3.1.5. Convolutional Neural Network

Convolution Neural Networks (CNNs) are multi-layered artificial neural networks that can extract
features from the picture, and text data as well as other complicated features in data. CNN's have
been applicable in computer vision applications such as image segmentation, object detection, and
classification. However, CNN's have recently been used to solve text-related issues like question
generation.
Convolutional neural networks (CNN) are a type of forwarding neural network that includes a
layer of neurons that performs the convolution process (Maslej et al. 2020). The architecture of
this network was influenced by the role of the ocular nerve. Neurons respond to the input of
surrounding neurons' activations using a convolutional kernel, also known as a filter, of a certain
scale. CNN's are deep, feed-forward artificial neural networks successfully applied to analyze
visual imagery, Part-of-speech tagging, name object recognition, semantic parsing, sentence
modeling, and search query retrieval have all recently been shown to be successful with CNN
models (Collobert et al., 2011; Kalchbrenner et al., 2014; Shen et al., 2014).
18
Figure 2.7 Convolutional Neural Network Figure source: [(Haddad et al., 2020)]
As shown in figure 2.6 CNN includes three types of layers: convolutional, pooling, and fully
connected layers.
To compute feature maps from the previous layer, each convolutional layer has many convolution
kernels. The convolution method would have learned distinct feature representations of original
inputs and passed them to subsequent layers.
The receptive field of a single neuron is described as a region of neighboring neurons that is
mapped to a single neuron on the next layer. The input is convolved by a trainable kernel, and the
result is applied with an element-wise nonlinear activation function to produce a feature map.
Pooling layers are used to reduce the number of outputs, reduce computational complexity, and
avoid over-fitting. Since duplicate data is generated when the convolution layers are applied, the
19
sampling layers are normally applied just behind them. The kernels change as they pass through
the individual inputs (Ogunfunmi et al., 2019).
2.3.2. Activation functions
The activation functions help the network use the important information and suppress the irrelevant
data points. When compared with a neuron-based model that is in our brains, the activation function
is at the end selecting what is to be fired to the next neuron. (Jain V., 2019)
Activation functions can be classified into two network architectures and conventional activation.
Conventional includes Tanh and Sigmoid activation functions and network architectural ReLU,
ELU, SELU, GELU, and ISRLU included(Nguyen et al., 2021).
Mapping the product of inputs and trainable parameters was the function of Tanh functions to a
range of -1 to 1. It reveals a potential problem called vanishing gradients, which occurs when the
gradient is too small or too big, or a saturated problem. The same as Tanh Sigmoid also has the
problem of vanishing gradient approaches to 0 and 1. This problem causes trainable parameters to
not be adjusted, resulting in stopping learning. However, when ReLU is trapped on the left side of
zero, it may experience a problem known as a dead state. When a problem affects many nodes in
a network, the network's output suffers.
(Nguyen et al., 2021)Furthermore, neurons have the same sign of gradient with ReLU, all the
layer's weights either increase or decrease. To solve vanishing gradients and dead states, ElU was
developed. The same as ReLU in positive functions, but differs in negative functions. In negative
functions, ReLU responded to 0 but ELU produces negative values even if weights updated only
in one direction cannot occur problems. SELU is developed for batch normalization before feeding
data to the neural network. The above-listed activation function didn’t solve batch dropout
regularization, but GELU solved it. The exponent function on the left side of the zero highly cost
computation problem occurred in ELU. To solve this problem and speed up learning ISRLU was
developed, but it needs a different experimental setup.
20
2.3.3. Optimizers
Optimizers are algorithms that are used to adjust the parameters of the designed model by
increasing the loss function and increasing the accuracy of the model.
( Yang L. & Shami, 2020) Yang defined the term hyperparameter optimization as a search for the
collection of precise model configuration arguments that result in the model's optimal performance
on a given dataset. The value of a hyperparameter is an attribute that is used to monitor and control
the learning process. Gradient descent is one of the most widely used optimization algorithms and
the most common method for optimizing neural networks. It aims to adjust weight and minimize
the loss or cost. It is a technique for minimizing a cost function J(θ) with model parameters by
updating the parameters in the opposite direction of the cost function ∇θJ(θ) gradients for the
parameters. The size of the measures we take to achieve a local minimum is determined by our
learning rate.
There is a way of gradient descent, each with a different amount of data used to calculate the
objective function's gradient. Batch gradient descent, stochastic gradient descent, and mini-batch
gradient descent are the three forms of gradient descent(Nabi J., 2019).
Stochastic Gradient Descent is the second optimizer, that tries to run a training epoch for each
within the dataset and update the model’s parameters more frequently, and it requires less memory
as no need to store values of loss functions was its advantage. It has a high variance in model
parameters and to get the same convergence as gradient descent needs to slowly reduce the value
of the learning rate were its disadvantages(Zou et al., 2019).
Adaptive gradient descent (Adagrad) is the third type of optimizer. It is an algorithm for gradient-
based optimization that avoids the concept of using fixed learning rates and uses dynamic learning
rates for every parameter. They are tailored to the frequency at which a parameter is updated during
training using parameter-specific learning rates. This implies that for infrequent parameters, larger
updates are performed and for regular parameters, smaller updates are performed. The
improvement from Adagrad is known as the Adaptive delta (Adadelta). It is a robust extension of
adagrad that adapts the learning rate based on a moving window of gradient updates, instead of
21
accumulating all past gradients. Delta in Adadelta refers to the difference between the current
weights and newly updated weights.
Nesterov Accelerated Adaptive Moment Estimation (Nadam) is also the other type of objective
function which combines Adam and Nesterov. Nadam uses Nesterov to update the gradient one
step ahead by replacing the previous ˆm of adam optimizer with the current ˆm. The learning
process is accelerated by summing up the exponential decay of the moving averages for the
previous and current gradient.
2.4. Word Embedding
Word embedding is a vector representation of words in the text. It captures the context of each
word in a text document, syntactic (structure), and semantic (meaning) similarity, integration with
other words (karani, 2019). Word vectorization is a process of changing the words in to vectors or
some sort of real numbers which can be preprocessed by the natural language modeling. After all
the final output is word embedding. Word embedding (output of vectorized word) can be able to
store the meaning of the word using low-dimensional vector (representation of words that
incorporates context.), and it can be used in multi aspect of NLP researches. Mikolov et al. (2013)
define word2vec's appearance as one of the first word embedding models that set off a huge wave
in the field of lexical semantics, the effects of which can still be felt today. In all cases, the word
is associated with the same representation, even though different contexts may cause different
meanings of the word, some of which may be semantically unrelated. Sense representations are an
attempt to remedy the deficiency of word embeddings in terms of sense conflation. The word
embedding approach is holding word co-occurrence statistical information with certain sequence
to determine phrases, words, or paraphrases. Naturally, every feed-forward neural network that
takes words from a dictionary or word gallery as input and embeds them as vectors into a lower
dimensional space, after that made adjustment via back-propagation, certainly produces word
embedding, which is usually referred to as Embedding Layer.
2.4.1 Word2Vec
Word embedding defined by Wang et al., (2019) is a real-valued vector representation of words
that incorporates both semantic and syntactic interpretations from a vast, unlabeled corpus. It's a
powerful tool that's commonly utilized in modern natural language processing (NLP) applications
22
like semantic analysis, information retrieval, and dependency parsing. Shi et al., (2019) define
embedding as the process of converting each text into a numerical representation in the form of a
vector. Word2vec is one of the most often used methods. When compared to older techniques,
embedding vectors generated by the word2vec technique provide several improvements, including
latent semantic analysis.
The semantics of words in a sentence or document is preserved by the Word2Vec embedding

model, and the phrase's contextual integrity is preserved(Sivakumar et al., 2020). Although the
embedding matrix is tiny, it contains information on the term's aspect. Because of its simplified
design, it is faster than any other methodology. This type of model is effective and good in both
small and large datasets. Given unlabeled training data, it is a Neural Network model that encodes
the semantic information of each term in the corpus. To determine semantic similarity, it calculates
the cosine resemblance between the word vectors. The vectors of similar meaning words are
comparable, whereas the vectors of different meaningful words are diverse. Skip Gram (SG) and
Continuous Bag of Words (CBOW) are the two flavors. Nikolentzos et al., (2017) explained word
vectors are learned using back propagation and stochastic gradient descent methods. A single layer
is hidden in both models. When W is the weight matrix, the input to the SG is a single word Wi,
and the output is a context of words (W0, 1,..,W0; N). The CBOW takes a context of words (W0,
1, ..., W0; N) as input and outputs a single word, Y.
2.5. Question Construction Approaches
Existing AQG approaches can be classified into three categories: syntax-based, template-based,
and semantics-based Le et al. (2014). Syntax-based approaches (also called transformation-based
approaches) are especially effective for short sentences, where questions are generated about
explicit factual information at the sentence level. They usually work through the following three
steps Le et al. (2014): (a) delete the identified target concept, (b) place a determined question
keyword at the first position of the question, and (c) convert the verb into a grammatically correct
form considering auxiliary and model verbs. For example, the sentence Barack Hussein Obama II
served as the 44th president of the United States from 2009 to 2017 can be transformed into the
following question: Who served as the 44th president of the United States from 2009 to 2017? The
aforementioned processes can be carried out by an algorithm without it being aware of the
underlying meaning of the changed sentence, and it may also produce grammatically incorrect
23
queries due to the shallow realization of the modified parse trees used to represent the input
sentences.
Heilman (2011) focuses on automatically generating factual WH-questions by applying

combinations of general rules. Their system breaks AQG down into a multistep process: simplified
factual statements are first extracted from complex inputs by (optionally) altering or transforming
lexical items, syntactic structure, and semantics. Next, the sentences are separately transformed
into questions by applying sequences of simple, linguistically motivated transformations such as
subject-auxiliary inversion and Wh-movement. They employ some core NLP tools in their system
to analyze the linguistic properties of input sentences: Stanford Phrase Structure Parser to
automatically sentence-split, tokenize, and parse input texts; the Tregex Tree Searching Language
and Tool to identify the relevant syntactic elements (e.g., subjects of sentences); Supersense
Tagger to label word tokens with high-level semantic classes such as noun. person, noun. location,
and so on; and ARKref Noun Phrase Coreference Tool to identify the antecedents of pronouns.
The system of Agarwal and Mannem (2011) uses analysis of the senses of discourse connectives
to generate questions of the type why, when, give an example, and yes/no. For sentences containing
at least one connective, the system finds the question types based on discourse relation (e.g., the
sense of the discourse “because” is “casual,” so the selected question type is “why”). To get the
final question, a set of transformations is applied to the content and the arguments found in the
task of content selection. For example, from the sentence “Because shuttlecock flight is affected
by wind, competitive badminton is played indoors,” the following question is generated: “Why is
competitive badminton played indoors”?
The template-based approaches rely on the idea that a question template can capture a class of
context-specific questions with the same structure. For example, Mostow and Chen, (2009)
developed templates such as: “What would happen if <X>?” for conditional text, and “Why
<auxiliary-verb><X>?” for linguistic modality, where <X> is the placeholder mapped to semantic
roles annotated by a semantic role labeler. These question templates can only be used for these
specific entity relationships, whereas for other kinds of entity relationships, new templates must
be defined. These approaches are most suitable for applications with a special purpose, which
sometimes comes within a closed domain. In a common template, there are three components:
24
question and answer, which are the output from the learning document, and entries, which are used
for template matching.
Hussein et al. (2014) propose a rule-based QG system that is trained by storing a huge amount of
template rules in the data store. The proposed system uses a pure syntactic pattern-matching
approach to generate content-related questions to improve the independent study of any textual
material. The training phase is performed by annotating various sentences related to an application
domain with certain keywords and POS (part-of-speech) tags available in OpenNLP. Based on the
associated POS tags, the system tries to find a similar template rule in its data store, and if there is
no match, the user is asked to add a new template. For doing this, the user is prompted for a WH-
type, a new verb, and question tags.
Le et al. (2014) propose a QG approach, which makes use of semantic information available on
WordNet to give students ideas related to a discussion topic and guide them on how to expand it.
After analyzing and parsing a given topic text to extract important concepts, every extracted noun
or noun phrase is used as a resource to search for its related concepts in WordNet. The proposed
approach consists of three steps. First, question templates are generated without using WordNet
(e.g., “What is <X>,” whereas X is a noun or a noun phrase extracted from the discussion topic).
The second step is question construction using retrieved hyponyms and the previously generated
templates (e.g., “What is activation energy,” where “activation energy” is one of the hyponyms of
the noun “energy”). Finally, questions are generated using the example sentences provided by
WordNet and ARK (Heilman & Smith, 2009), a syntax-based tool for generating questions from
English sentences or phrases, which achieved 43.3% acceptability for the top 10 ranked questions
and produced an average of 6.8 acceptable questions per 250 words of Wikipedia texts.
Each of the different approaches has its advantages in the task of AQG. While the template-based
methods usually generate domain-related questions that are correct grammatically, the syntax-
based methods provide better coverage of the text, and the semantic-based methods use some
background resources to provide more interesting questions. Some of the more recent techniques
try to combine the different approaches to develop high-quality questions.
Mazidi and Nielsen (2015) present a template-based QG system that is built on multiple views of
text: syntactic structure retrieved from dependency parse, paired with information from semantic
25
role labels and discourse cues. Their system builds a lexical-semantic tree structure using
dependencies from the Stanford dependency parsers and semantic roles extracted from SENNA.
The inclusion of modifiers such as casual, temporal, and locative from SENNA allows the
generation of semantically oriented questions such as how and why questions. For QG, each
sentence is matched against a list of around 50 possible patterns, which were manually constructed
to match features that might appear within the tree structure of each sentence. Questions are
generated whenever a pattern matches the sentence. Their results show that the dependency parse
can provide a good foundation for QG, particularly when combined with information from
multiple sources.
Labutov et al. (2015) propose a crowdsourcing-based methodology for generating deep

comprehension questions from the novel text. They use an ontology-crowd-relevance workflow
for generating high-level questions that cannot be answered from a single sentence in the article.
Their method involves three stages: first, the text is decomposed into a two-dimensional ontology
represented as category-section pairs (e.g., category: person, section: early life), then high-level
templates are solicited from the crowd workers (e.g., “Who were the key influences on <Person>
in his/her childhood?”), and finally, a subset of templates is retrieved for a target text segment
based on its ontological categories. In this research, the ontology is a Cartesian product of article
categories and article section names (derived from Freebase and Wikipedia, respectively), and it
is used by the crowd workers for creating the relevant templates. The templates are converted into
questions by filling in the article-specific entity extracted from the title. If ontological labels are
not available, they are inferred from the text by a standard text classification algorithm using basic
tf-idf features. Similarly, the relevance of a question template to an article can be estimated using
the Euclidean distance between the question features and the article segment.
In contrast to the previous, mostly rule-based methods, Du, Shao, and Cardie (2017) have defined
the task of QG as a purely data-driven sequence-to-sequence learning problem that directly maps
a preselected text passage, spanning a sentence or a paragraph, to a question. Both the input text
and the output question are modeled as two distinct word sequences, which does not restrict the
generated questions to the words in the input text or even to the entire vocabulary of input
sequences. This is an important property of good reading comprehension questions focused on
understanding rather than just memorizing the learned material. The sequence-to-sequence QG
26
algorithm selects the next output token by maximizing the conditional log-likelihood of the
predicted question sequence, given the input text and the previously selected tokens. The
conditional probability of each token is calculated using the long short-term memory encoder-
decoder architecture trained on a corpus of a sentence–question pairs, where each word is
represented as a pretrained embedding of 300 dimensions. In Du et al. (2017), the sequence-to-
sequence method is trained and tested on disjoint subsets of the SQuAD corpus Rajpurkar, Zhang,
Lopyrev, and Liang (2016), where it naturally outperforms the rule-based method of Heilman
(2011), which was not tuned on the same dataset.
Recently, the work of Du et al. (2017) was extended in several directions. Thus, assuming that the
answer contains certain spans of the text from the input passage (like in the SQuAD corpus), Song,
Wang, Hamza, Zhang, and Gildea (2018) enrich the encoder with the question context calculated
as matching between the target answer and the entire passage. Another answer-aware QG method,
which utilizes paragraph-level context, is presented in Zhao, Ni, Ding, and Ke (2018)). In contrast
to these two works, Scialom, Piwowarski, and Staiano (2019) show how to adapt a novel,
transformer architecture to the task of neural question generation when the answer is not provided
as part of the input. To deal with rare/unseen words, they enhanced the basic Transformer
architecture with a copying mechanism, a place holding strategy, and contextualized word
embedding.
2.6. Amharic Language
Amharic is the most widely spoken Semitic language family next to Arabic in the world and it has
more than hundred million speakers inside the country. It is the official language of the Federal
Democratic Republic of the Ethiopian government. It has its witting script adapted from Ge’ez
alphabet called Fidel with 33 consonant letters and 7 vowels for each consonant total 231 Fidel’s
patters and 44 letter (275 characters) and it is written from left to right in contrast to Arabic, Hebrew.
2.6.1. Amharic word class
Amharic language has general eight-word class those are: ስም ፤ ግስ ፤ ትውሳከ ግስ ፤ መስተዋድድ ፤
ተውላጠ ስም ፤ ቅፅል ፤ መስተፃምር ቃለ አጋኖ። (Amare G., 1990)
27
2.6.2. Amharic sentence structure
A general sentence structure in Amharic differs from English language which is Subject-Verb-
Object (SVO) whereas Amharic language sentence structure Subject-Verb-Object (SOV) or a
simple Subject + Verb arrangement. There is also other formats OSV but not common.
2.6.3. Amharic Interrogative Sentence Structure
An interrogative sentence is a sentence that asks a question and ends with a question mark.
Interrogative sentence construction differs from language to language. For Amharic language we
can ask about the subject of the sentence, the complement of the sentence, the action of the
sentence, and the specifier of the sentence (Amare G, 1990).
Common Amharic integrative words in integrative sentences are:
ስ ን ት, ስ ን ቱ, ከስ ን ት, በስ ን ት,
ማን , ማን , ማነ ው, እ ነ ማን , ማን ማን , ማን ን , ማና ማን , እ ነ ማን ን , የ ማን , ከማን ኛው,
ምን ,ምን ዴን , የ ምን , ምኑ ን , ስለምን , ምን ,ማን ኛው, ማን ኛይቱ, ማን ኛዋ, ማን ኛቸው, በ ማን ,
ተና ገ ር , ጥቀስ , ግሇፅ , ዘ ርዝር , ጥራ,
የ ት, የ ቱ, በየ ት, የ ቷ, የ ቲቱ, የ ቶቹ, ወዳት, ከየ ትኛው, የ የ ት, የ ትኛው, የ ትኛዋ, ከየ ት, እ ስ ከየ ት, የ

ትኛው,
በ ምን , እ ስ ከምን , ከምን , ወዯምን , ምን ጊዜ, and
መቸ, በ መቸ, እ ስ ከመቸ.
2.7. Related works on question generation and approaches
2.7.1 Question generation for English language
Du X., (2017) du attempted to create the first NQG, which inputs a phrase into an RNN-based
encoder and outputs a question regarding the text using a decoder. The attentiveness mechanism
helps the decoder pay attention to the sections of the input text that are most important for
28
formulating a question. However, this model does not encode the target answer, and thus may not
be able to learn which information to emphasize for question generation. However, this model
does not encode the target answer, and thus may not be able to learn which information to
emphasize for question generation. Due to this, it is not effective for paragraph-level QG only
effective for sentence level.
Singh J.,(2018) attempted to generate questions from the given text using deep learning methods
CNN, GRU, and LSTM using SquAD data set to tian the model. They reported that GRUs
evaluation score close little lesser (≈15% lesser) than LSTM in time of training whereas CNN
score is less than both LSTM and GRUs (≈25% lesser) training time this implies that CNN is also
effective to generating text (text questions) which capture semantic and achieves competitive
evaluation.
Duan N., (2017) attempted question generation for question answering using deep learning with
CQA (community question answering dataset). Their aim is How to use large-scale QA pairs
automatically crawled and processed from the Community-QA website as training data to create
questions from given passages using neural networks to improve the question answering system
(QAS). They used two approaches one is a retrieval-based method using a convolution neural
network (CNN), and the other is a generation-based method using a recurrent neural network
(RNN). They evaluate the generated questions using BLEU score and human annotator They
tested three benchmark datasets on sentence selection task, including SQuAD, MS MARCO, and
WiKiQA, and integrating QA pair to end QA task shows the significant improvement that indicates
that QA and QG are dual tasks that can boost each other
Du X., (2018) attempted question generation large-scale question-answer pair dataset for question
answering training data. They used conference knowledge with a neural network that is LSTM.
the proposed gated Coreference knowledge for Neural Question Generation (CorefNQG), a neural
sequence model with a novel gating mechanism that leverages continuous representations of
coreference clusters the set of mentions used to refer to each entity and to better encode linguistic
knowledge introduced by coreference, for paragraph-level question generation. But it needs a large
training dataset
29
2.7.2 Question generation for Amharic language
Even if there is lots of work in Question generation with rapid advancement globally like English
language with full of resources question generation is a new research area in Amharic language.
There is a very few works in this field. Pioneer researches work in Amharic question generation
is a generation of Amharic math word problem and equation (AMW) (Andinet Assefa Bekele,
2020) which used template-based approach on elementary school mathematic textbooks and
worksheets data. The paper used Walt Information Center (WIC) tagged corpus, which is manually
tagged by the Ethiopians Languages Research Center staff member, and 154 gathered Amharic
math word problems from other educational worksheets. It is passed with segmentation,
tokenization, and PoS tagging preprocessing processes then creating AMW problems and
equations is carried out after preprocessing example AMW problems. Those are the creation and
modification of templates. A template contains AMW problems with a placeholder to be field by
the incoming input. The paper was it used shallow NLP techniques that result in a limitation on
generating deeper semantic meaning for full AWM problems Also, it is ineffective for large-scale
data. And the author proposed to use deeper NLP techniques.
Other remaining pioneers work on automatic factoid Amharic question generation from historical
text using a rule-based approach (Getaneh Damtie, 2021). It used the transformation rule with NER
and POS Tagger information for sentence and answer key selection to change the text content to
its corresponding questions. It used historical data to train its model. They got the experimental
result of 86.4 %t accuracy for PoS tagger, 82.0 % accuracy for NER, and 95.3 % accuracy for
relevant sentence selection. The overall accuracy of the question generator was 84.6 percent. Its
limitation was the performance depends on the numbers, varieties, and qualities of the rules. And
is not effective for large-scale data.
2.8. Summary
The related work of Question generation is summarized and analyzed as follows in the table
30
Table 2.2: Summary of related work
No Autor Year Title Gap Approach Result
1 Du et 2017 Learning to Cannot handle sequence-to- The system

al. Ask: Neural paragraph sequence ranked best in
Question context, only learning 38.4% of
Generation for done RNN (RNN) evaluations
Reading with an
Comprehension average rank of
1.94 inter rate
agreement of
0.23
2 Du X. 2018 Harvesting Needs massive Deep learning F-measure of

and Paragraph-Level data to train with 78.8%
Cardie Question- coreference
C. Answer Pairs
from Wikipedia
3 Andinet 2020 Automatic Semantic Template 93.84%

Assefa Generation of problem based accuracy
Amharic Math achieved
Word Problem
and Equation
4 Getaneh 2021 Automatic Ineffective in Rule based 86.4%

Damite Amharic Factual large data size accuracy
Question achieved
Generation from Performance
Historic Text depends on rules
Using Rule- size and quality
Based Approach
31
Chapter 3
3. Design of Amharic Question Generation Model
3.1 Introduction
This chapter discussed about the proposed deep-learning-based automatic Amharic question
generation model's design and work flow of the designed model including text pre-processing tasks
like stemming, normalization, and tokenization. It also discussed deep learning model's training
process and use the dataset to assess the model's performance.
3.2.The Dataset preparation
The dataset is organized based on law data extracted from their website (chilot.me/federal laws)
(more talking about the constitution) with 6100 answer questions paired in training data. The data
preparation format is as depicted below.
Table 3.1 sample dataset with answer, questions and the class pair
Answer Question Class
A: ፍትሀ ነገስት በ1240 አካባቢ በግብጽ አገር ቅብጡ አቡል Q: ፍትሀ ነገስት መቸ 3
ፋዳኢል እብን አል-አሳል በአረብኛ የጻፉት ህገ መንግስት ነው። ተጻፈ?
A: ፍትሀ ነገስት በ1240 አካባቢ በግብጽ አገር ቅብጡ አቡል Q: ፍትሀ ነገስት በማን 1
ፋዳኢል እብን አል-አሳል በአረብኛ የጻፉት ህገ መንግስት ነው። ነው የተጻፈው ?
A: ፍትሀ ነገስት በ1240 አካባቢ በግብጽ አገር ቅብጡ አቡል Q: እብን አል አሳል 1
ፋዳኢል እብን አል-አሳል በአረብኛ የጻፉት ህገ መንግስት ነው። ማናቸው ?
A: እብን አል-አሳል የጻፉት ህገ መንግስት ህጎቹን የወሰዱት በከፊል Q: ፍትሀ ነገስት ህጎቹ 4
ከሀዋርያት ጽህፈቶችና በከፊል ደግሞ ከድሮ ቢዛንታይን ነገስታት የተወሰዱት ከየት ነበር ?
ህጎች ነበር ።
32
A: እብን አል-አሳል የጻፉት ህገ መንግስት ህጎቹን የወሰዱት በከፊል Q: ፍትሀ ነገስት ህጎቹ 4
ከሀዋርያት ጽህፈቶችና በከፊል ደግሞ ከድሮ ቢዛንታይን ነገስታት ምንጭ ከየት ነበር ?
ህጎች ነበር ።
When dealing with the data utilization, the questions are generated from the articles with intended
question generation patterns such as only WH (what (ምንድን/ምን/ስለምን/ምን ምን), when (መቸ),
where (የት/ከየት), how much (ስንት) and who (ማን/በማን/ከማን/ለማን)).
3.2 Model Architecture

As presented in Figure 3.1 below, the general outflow model of the automatic Amharic question
generation model starts from collecting Amharic corpus from different sources. Firstly, revise
different literature related to our study and SQAuD dataset format from different sources. Based
on the literature and the time constraint our dataset is limited to legal text documents extracted
from chilot.me/federal laws domain. We totally collected 6500 Amharic-question answer sets and
given for Amharic experts for evaluation of the annotation. The expert reviewed our dataset and
only 6100 Amharic question answer set remains because the others were rejected due to syntactic
problems. This total dataset passes through our preprocessing step before splitting into training
and testing. The output of preprocessing dataset padding into the equal length of 100 sizes and
changed into text-to-sequence. When the data is ready for training, we begin developing our deep
learning model in a hyperparameter experimental configuration.
On the other hand, we collected 612,138 Amharic sentences from different sources and build our
word2vec model for the embedding layer in order to handle the context. After compiling our data
with the model, we had defined, we trained the model by fitting it to the data across a series of
rounds or epochs starting with the default parameters, changing the experimental setup as
necessary to get the lowest loss and highest accuracy. The final aim of this thesis study would be
to validate and test the model's performance. As a result, the model had three modules: a
preprocessing module, a module for model building, and a feature mapping trained model with an
Amharic question-answer set. In the last module if the output of the SoftMax classifier results in a
33
prediction less than 4 levels then the model generates Amharic questions or nothing response to
the mode if the classifier is classified as another class.
Figure 3:1 Proposed General Architecture of Automatic Amharic Question Generation Model
3.3 Preprocessing
In this section, text preprocessing is used to turn the raw text into useful data. Preprocessing
involves eliminating undesired, special characters, and punctuation from raw data and replacing it
with a single alphabet (normalization) in a separate representation. Due to the effect on the model's
accuracy and performance is lessened but not numbers because of the usability of our dataset.
34
3.3.1 Tokenization
Tokenization is the process of dividing data into small word-sized segments or the entire corpus
into word-level segments. It breaks down a stream of text into words by taking the input text from
a user or from the provided corpus and dividing it into a series of tokens. Finally, it provides a list
of the words that will be utilized in the following preprocessing stage. Due to the fact that Amharic
special characters are not recognizable as those in other languages, the model we utilized was also
divided into character levels. White spaces and other punctuation, such as the question mark "?"
are typically employed as the closest approximations of word-to-word delimiters in Latin
(boundary markers between sequences of words). Amharic has its own punctuation marks that
divide texts or phrases into a stream of words, just like the other languages. Amharic punctuation
marks include ‘ሁለት ነጥብ’ (፡)/two points/ ኮለን (,) /colon/ ‘አራት ነጥብ(።)/full stop/, ‘ነጠላ ሰረዝ’(፣
)/semicolon/, ‘ድርብ ሰረዝ’ (፤)/double comma/, ጥያቄ ምልክት?’/question mark/, ቃል አጋኖ
(!)/exclamation mark/, ይዘት (.)/dot/, and ትምህርተ ስላቅ (¡)/Education sarcasm/are used as sentence
delimiter or as white space.
3.3.2 Normalization
Tokenization divides the provided text into words or segments. Since there are three different sorts
of normalization problems in the Amharic language, normalization in this study changes a list of
words into a more standardized order. The first one is compound words think as different words
in the system like ‘ህገ’ and ‘መግስት’ but this is a single word in Amharic. We handle this type of
word by a list of double words and substitute like ህገ_መግስት if consequently occur. The second
type of normalization which contains a slash (/), dot(.) occurs sometimes as a title like ፍ/ቤት
expands and write as ፍርድ_ቤት. Lastly, the Amharic normalization made due to consistent writing
styles like አላማ ዐላማ ዓላማ substitute {አ፣ ዐ፣ዓ} by አ.
On the other hand, certain linguists and academics have expressed reservations about the
application of such Amharic character normalization techniques. They claim that the algorithm
replaces each character with a sound, which is the problem. As a result, there is a problem with
meaning and language standardization. Take a look at the following two words: ሰረቀ (stole) and
ሠረቀ(stole) shape of the initial character, which is named ሰ and ሠ, is the only difference between
35
the two words. Because the linguistic experiment produced a variety of results depending on norms
for the Amharic Language, we looked into whether the aforementioned normalization process had
an effect on our model. Our research indicates that the Amharic question generation model is
unaffected by the aforementioned normalized techniques. As a result, new Amharic character
normalization algorithms are required to create Amharic language semantic rules, as the
aforementioned Amharic normalization algorithm would not have been suitable for all Amharic
natural language processing.
3.3.3 Stop Word Removal
(Singh & Siddiqui, 2012) Since stop words alter the meaning of key terms in a document, they are
frequently removed from many NLP applications. Given that it covers the majority of the page
and requires more memory and processing time during model training. Because stop words in
Amharic are context-dependent like those in other languages, there aren't any standard or universal
ones.
Every text in a language has the most common terms, but they also contain the least amount of
information. Conjunctions, articles, and prepositions are commonly used in texts written in
Amharic but rarely provide any helpful information. In our task, word embedding containing the
context of the model is helpful to generate Amharic questions from even unseen answers.
Therefore, the removal of stop words from our corpus prior to training the model was done in order
to increase the accuracy, decrease memory usage, and produce better results by concentrating on
the key terms that lessen noise and false positives. Therefore, the word like ወይም ፣ በዚህ ፣ እንደ ፣
ነገር ፣ እና ፣ ና ፣ ፣ … occurred in the corpus remove by a list of stop words.
3.3.4 Stemming
Stemming is the process of keeping the stem word for each word in sentences while deleting any
additional suffixes, prefixes, and infixes. In Natural Language Preprocessing applications,
eliminating affixes is a crucial preprocessing step for a morphologically rich language like
Amharic at the time of stemming. The main goal of stemming is to identify words with similar
roots or stems that were given the same vector representation during word embedding. For this
reason, even though our stemmer has limitations, stemming was included in our preprocessing step
36
for this investigation. For example, the words (ሰባሪዎች፣ ተሰበረ፣ ሊሰበሩ፣ እንሰብራቸዋለን፣ ተሰበሩ፣
ተሰበረች፣ ልትሰበር) or as a future (ይሰብራል፣ ትሰብራለች፣ ይሰብራሉ፣ እሰብራለን) or as a past
(ሰብራለች፣ ሰብሯል፣ ሰብረናል፣ሰብረዋል) have root ስ-ብ-ር it’s better to save time and memory during
training also the other advantages of stemming. Under- and over-stemming was one of the study's
limitations. After the preprocessing, the manual stemming technique is used to resolve such issues.
To sum-up this, after preprocessing is done the preprocessed training dataset is feed to the
designed model.
37
Chapter 4
4. Experiments and evaluation
4.1 Introduction
To validate the proposed Automatic Amharic question generation, model a series of experiments
were done in this chapter. This section of the research also discusses data collection and evaluation
of the model. In this chapter, the basic procedures behind the implementation of question
generation from Amharic articles are investigated. The best fit procedures are depicted and the
selection of network strategy is elaborated via showing the performance graphs, which visualize
the casual design approaches.
4.2 Experimentation
4.2.1 Data Collection
The data collection of this research is not an easy task instead it is very tedious and time-consuming
work to collect 6500 questions and answer pairs within its context in the law domain from
(chilot.me). The collected question and answer pairs are given for one language expert to review
and the question pair which is passed by the language experts taken as a dataset for this research.
After the review of the language expert only 6100 question and answer pairs passed, and the
remaining 400 questions and answer pairs were rejected from our dataset because of
misconstruction of the sentence and not following the syntactic and grammatical rules of Amharic
language. We used different data cleaning mechanisms like stop word removal, punctuation mark
removal, titles, and abbreviations expanding and stemming into the word stem but not removing
numbers in our research context because it indicates time. The data cleaning step's main aim was
to prepare the data for training and testing and minimize resource consumption.
The dataset is split into training, validation, and testing within the ratio of 80%, 10%, and 10% of
the total respectively. Besides, the training data size (4880), validation data size (610), and testing
data size (610). However, the testing training dataset, the validation dataset and the training dataset
is independent to each other to know the performance of the model by unseen data during training.
38
4.2.1 Implementation
Since Python is a free open-source programming language that is utilized for data analysis and
machine learning in the field of artificial intelligence, we employ Anaconda software tools in this
experiment. We utilized the Jupiter code editor and the TensorFlow environment from Anaconda.
An object-oriented programming language that enables quick application development is Python.
It is software that uses a high-level programming language that enables you to create desktop and
web apps. While Jupiter is a Python environment for scientific creation, it also features editing,
interactive debugging, and testing.
4.2.2 Hyperparameters
The experimental results were affected by several deep learning hyperparameters; however, the
loss function was the one we used in our study. One of the most crucial features of deep learning
algorithms is this. The loss function, which produces the loss, forecasts the model's error, which is
used to assess the gradients.
Using error backpropagation, the neuron weights in a particular layer are modified to reduce the
error rate in subsequent assessments. Due to the multi-class nature of our classification in this
work, we used categorical cross-entropy loss functions. based on the typical categorical cross-
entropy (Ho & Wookey, 2019) calculated as:
𝑘 𝑚 4. 1
1
𝐽𝑐𝑐𝑒 = − ∑ ∑ 𝑦𝑘 𝑚 𝑥𝑙𝑜𝑔(ℎ𝜃(𝑥𝑚, 𝑘))
𝑀
𝑘=1 𝑚=1
Where, M = number of training examples, K = number of classes
Yk m = target label for training example m for class k, X = input for training example m
ℎ𝜃 = model with neural network weights 𝜃
The next hyperparameter that factors into deep learning algorithms experiment is optimization.
Zou et al. (2019) Define an optimization function that lowers the prediction error rate of the model.
Adam was the deep learning optimizer we picked for our experiment out of the several available.
Because it calculates each parameter's learning evaluation, it had a far higher learning rate than the
39
others and was successful in fixing our challenge (Maslej Krešňáková et al., 2020). Like RMSprop,
Adam has an exponentially decaying average of past gradients v(t) as well as an exponentially
decaying average of past squared gradients m(t). Since all moving averages are set to zero, the
calculation of the moments is skewed towards zero. This happens most often in the early stages of
decay when the decay parameters (β1, β2) are similar to 1. Such biased can be removed using
modified estimations m^t, v^t:
𝑚𝑡 𝑣𝑡 4. 2
𝑚^𝑡 = 𝑣^𝑡 =
1 − β𝑡1 1 − β𝑡2
β1 = The exponential decay rate for the first moment estimates(0.9)
β2 = The exponential decay rate for the second − moment estimates(0.999)
Even though a lot of optimizers, in this research from the above optimizers
The dropout rate served as the third deep learning hyper-parameter configuration in our
experiment. Dropping neurons is one of the techniques used to lessen overfitting. During the three
experiments, we used dropout as a variable hyper parameter, but dropout rate 0.39 is appropriate
for our model. During training, it at random disconnects inputs from the first completely connected
layer to the second fully connected layer, and during testing, it turns on all neurons. It makes it
possible for many, redundant nodes to activate when given a similar pattern, aiding in the
generalization of our model and reducing overfitting(Park et al., 2019).
According to Maslej Krešňáková et al. (2020)during model training, dropout is necessary; the neural
network's neurons and connections are randomly eliminated. Individual neurons can be
disregarded or eliminated to prevent over-adaptation and overfitting. A new sub-network with
different neurons than the previous iteration is created after each iteration. Such a strategy results
in a collection of sub-networks, which has a higher probability of capturing random occurrences
in data than a single robust network. When adopting this strategy, the parameter that influences
the likelihood of choosing some neurons must be a collection of sub-networks that have a higher
likelihood than a single robust network of detecting random occurrences in the data.
Model regularization was the other deep learning hyper parameter that was used. It permits the
imposition of fines during optimization on either layer activity or parameter. The loss function that
40
the network optimizes includes the total of these penalties. We used kernel regularized l2 in our
models. In order to avoid overfitting, L2 regularization penalizes the model's weights, limiting
their range to tiny values. In this study, we examined the regularization of the two kernels,
however, our model was successful in reducing the overfitting curve in L2 kernel regularization
with a penalizing rate of 0.002.
Hyperparameter Optimization's (HPO) objective is to automate the hyperparameter tuning

procedure and give customers the ability to employ machine learning models to solve practical
problems. It's crucial to employ the proper optimization process to locate appropriate hyper-
parameters. Random search is a decision-theoretic technique that randomly chooses hyper-
parameter combinations from the search space given a constrained execution time(L. Yang &
Shami, 2020). We used random hyperparameter optimization techniques at a time to restrict the
last range and then randomly test in one direction.
4.3 Evaluation and Results
4.3.1 Performance Metrics
We choose performance metrics and confusion metrics to measure the model performance using
standard classification metrics such as accuracy, precision, recall, and f1-score (Maslej et al. 2020).
For multiple classification problems, such metrics are simple to obtain and can be computed as
follows:
Accuracy: It is calculated as the sum of correct classifications divided by the total number of
classifications. The mathematical expression is given below.
TP + TN 4. 3
Accuracy =
TP + TN + FN + FP
Precision: it is a measure of the true positive among all positives. The mathematical expression
is given below
𝑇𝑃 4. 4
Precision =
𝑇𝑃 + 𝐹𝑃
Recall: commonly called sensitivity, corresponds to the true positive rate of the considered class.
The mathematical expression is given below
41
𝑇𝑃 4. 5
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
F1-score: Is the weighted average of precision and recall. The mathematical expression is given
below.
2 ∗ (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙) 4. 6
𝐹1 − 𝑠𝑐𝑜𝑟𝑒 =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
where:
True Positive (TP): is an outcome where the model correctly predicts the positive class
True Negative (TN): is an outcome where the model correctly predicts the negative class
False Positive (FP): is an outcome where the model incorrectly predicts the positive class
False Negative (FN): is an outcome where the model incorrectly predicts the negative class.
The other evaluation matrix used in this study is the confusion matrix. Based on the dataset
sentence's real labels and a predicated label, such metrics might be produced for each class to be
used in the multi-label categorization. Three deep learning methods were tested in three separate
tests using the predetermined constant and variable hyperparameters. Both graphical analysis and
in-text interpretation were used to examine the comparison result of the experiment using those
methodologies. Mean Square Error (MSE), Root Mean Square Error, and Mean Average Error is
used to calculate the cost of training. The confusion matrix is also used to compute the
classification outcomes of the model. The experiment was conducted using a dataset that was
divided into training, testing, and validation portions at ratios of 0.7 for training, 0.1 for testing,
and 0.2 for validation.
4.3.2 Experimental Result of CNN Model
As a result of applying the aforementioned hyper parameters, the experiment on a convolutional

neural network produced the following findings. Based on the confusion matrix and evaluation
criteria, the model's performance was assessed. The experiment that outperformed training
accuracy and minimal loss with an average evaluation matrix was displayed in Figure 4:2 among
the three experiments.
42
Figure 4.1 Training and validation Accuracy curve of CNN Model
The CNN model's training and validation accuracy on automatic Amharic question generation
challenges are shown in Figure 4:2. This outcome was superior to those of the remaining CNN
studies, with training accuracy and validation accuracy proceeding largely indistinguishably.
Because the value had changed and the result had no variance, the training was terminated early
at epoch 10.
Figure 4.2 Training and validation Loss curve of CNN Model
The loss of the CNN model during training and validation is shown by the above graph (Figure
4:3). The model exhibited no overfitting difficulties because we used model regularization and
43
dropout rate balancing in this experiment. This experiment's findings were superior to those of
the two others.
4.3.3 Experimental Result of LSTM Model
Using a set of specified hyper parameters, the LSTM model performed three experiments. The
experimental result that performs better on the Amharic question generation dataset is analyzed
below.
Figure 4.3 Training and validation Accuracy curve of LSTM Model
As shown in Figure 4.4 the training and validation accuracy of LSTM model has good performance
as the result presented.
Figure 4.4 Training and Validation Loss curve of LSTM Model
According to Figures 4.3 and 4.4, which show accuracy and validation of the training and loss
curves, the LSTM model performed well in this experiment. Accuracy, precision, recall, and f1-
score are used to evaluate it based on performance measures and a confusion matrix similar to that
44
of other models. Table 4.5 illustrates the performance characteristics of the LSTM model in single
experiments. The Table demonstrates that the average accuracy was 87%.
4.3.4 Experimental Result of Bi-LSTM model
Experimental results of Bi-LSTM model were analyzed and interpreted like that of other models.
Figure 4.5 Training and validation Accuracy curve of Bi-LSTM Model
The training accuracy and validation accuracy distribution for resolving Amharic question
generation issues during the experiment show which of the three experiments works best, as shown
by the curve (Figure 4.6). As a result, the Bi-LSTM model's training procedure was properly
validated, as the result showed. Around epoch 7, the model's training data were, nevertheless, more
difficult than the validation data. The issue is resolved by including the model's regular learning
rate(Smith, 2018).
45
Figure 4.6 Training and validation Loss Curve of Bi-LSTM Model
In comparison to the other two results, the experimental outcome of the Bi-LSTM model on this
experiment was effective. Figure 4.7 above illustrates that the model has no overfitting issues.
Since we used a regularized model and a batch size that was appropriate for our little dataset. Since
it is successfully classified and taught via backpropagation over two backward and forward layers,
similar to how CNN flattens its layers, at the end of the process. The total performance of the Bi-
LSTM model on Amharic question-generating issues is shown in Table (4.6). This is the outcome
of the one trial using performance measures, which was successful.
After a classification of question generation into five classes, then we change the vector into text
and by using the text matching technique retrieve the answer pair of the generated question.
4.3.5 Discussions
In this section, we used a testing dataset and various evaluation methodologies to measure the
model's overall performance. Three distinct experiments using three different deep learning
algorithms were conducted with both fixed and variable hyperparameters. Both word interpretation
and graphical analysis were used to assess the comparison result obtained from those
methodologies. Then, using evaluation measures, the average accuracy was examined in a single
table. Following is a discussion of each of the three studies' hyperparameters and how they affected
our experiment.
In experiment 1, all three deep learning methods were tested using padding, embedding, split ratio,
and vocabulary size within the defined constant hyperparameter configuration. Particularly, from
46
the variable’s activation = tanh, batch size = 32, random state = 0, and number of neurons = 100;
the hyperparameter optimizer = SGD. We used the Tanh activation function in this experiment,
although there was a problem with the gradient vanishing between -1 and 1(Nguyen et al., 2021).
The vanishing gradient issue prevents the model from just entering a global minimum. The
activation function problem affects the results of this experiment because the model parameters
for SGD vary significantly. The learning rate must be gradually decreased to reach the same
convergence as gradient descent. In this experiment, CNN performance was particularly dismal
for LSTM (86%) and Bi-LSTM (89%) (92 percent).
The second experiment's variable hyper parameters were random state=0, Optimizer=NAdam,
activation=ReLU, batch size=16, epochs=10, dropout=0.2, and number of neurons =60 instead of
the hyper parameter used in experiment 1. In this experiment, the activation function encountered
dead states between 0 and 1. However, the experiment's batch size was mostly determined by the
dataset. Because of this, the model's performance in comparison to other tests has decreased to a
minimum. Our study takes into account the dropout rate as well. This is problematic for our issue
because we picked a 0.2 dropout rate for our experiment. All negative values and dead state issues
have the activation function set to 0. The accuracy of LSTM, CNN, and Bi-LSTM in this
experiment was 83%, 85%, and 88%, respectively. Due to the aforementioned factors, all
algorithms exhibit low performance in the experimental results.
The constant hyper parameters were set during the previous experiment in both experiments 1 and
2. However, the parameters were as follows: random state = 42; optimizer = Adam; activation =
GELU; regularize = l2(0.001); batch size = 8; epochs = 10; dropout = 0.39; and number of neurons
= 40. Compared to the others, the Adam optimizers utilized in this experiment have a very good
learning rate. The GELU (gaussian rectified linear unit) activation function that was employed in
this experiment can learn from mini-batch data by resolving the dead state problem in ReLU. The
overfitting issues of the graph from 0 to 42 are minimized and regularized using the random state.
Due to the tiny dataset in this experiment, the batch size was also suitable for the models. In this
experiment, we employed model regularization and a small batch size with a dropout rate of 0.39
to improve model performance.
Additionally, in this experiment, nearly all algorithms—such as LSTM (92 percent), CNN (94
percent), and Bi-LSTM—obtained a solid average result for Amharic question generation
47
difficulties (95 percent). After the model classifies the question class less than 4 then we used a
rule-based approach to generate from the question-answer pair. In addition to this, our model
classifies the question class as less than 4 then we used a rule-based approach to generate questions
from the question-answer pair. Firstly our model classifies the question based on learned data and
its classification vector back changed into text and retired its question to display the user. The rule
does not respond for the user to anything if the model classification is 4 or in the other class of
question classification.
Table 4.1 comparison result of the three deep learning algorithms
Model Experiment 1 Experiment 2 Experiment 3

LSTM 86 83 92
CNN 89 85 94
Bi-LSTM 92 88 95
48
Chapter 5
5. Conclusions and Recommendations
This chapter discussed the observation from this research work and the recommendation made by
the researcher for further improvements.
5.1.Conclusions
In this study, our aim is to generate questions from Amharic legal text documents using deep
learning. We collected law data from their website called chilot.me/federal laws. we prepaid the
data set with question and answer pia with five classes (Answer, Question, Class) sett as class
depicts as ((0) for how much(ስንት), (1) for who (ማን/በማን/ከማን/ለማን) , (2) for what
(ምንድን/ምን/ስለምን/ምን ምን), (3) for when (መቸ), (4) for where (የት/ከየት), and (5) for others.
By doing so 6500 Amharic question-and-answer pairs in the law field were prepaid and examined
by the experts, those question passed by the experts were feed to the designed model. After the
review of the language expert only 6100 question and answer pairs passed, and the remaining 400
questions and answer pairs were rejected from our dataset because of misconstruction of the
sentence and not following the syntactic and grammatical rules of Amharic language.
Using the dataset and hyper-parameters that were built up in three separate experiments for each
mode the trained model predicts the predefined class and finally generate questions within answer
pair. The effectiveness of the model was evaluated using confusion metrics and performance
metrics like accuracy, precision, recall, and f1-score. According to the findings, during the third
experiment, LSTM, CNN, and Bi-LSTM, respectively, achieved accuracy of 92%, 94%, and 95%.
According to the testing data, Bi- LSTM outperformed the other models by achieving a state-of-
the-art result of 96 percent accuracy. Preparing data sets and information in Amharic is difficult.
Constrained language resources, were challenges in this study.
5.2. Contributions of this study

The major contribution of this study includes:
▪ The study designed an automatic question generation model by deep learning approach
would be the first model.
▪ Preparing the labeled dataset on legal text documents extracted from chilot.me/federal laws
website.
49
▪ We have developed three deep learning models LSTM, CNN, and Bi-LSTM; those are
useful for Amharic question generation problems.
• We developed word2vec word embedding models that handle the vocabulary and context
by CBOW within a size of 35 MB 612,138 sentences within 920,797 vocabulary size as
embedding layers.
5.3. Recommendations
In this study, generating factoid questions from Amharic legal text documents using deep learning
with labeled data set by shows promising result. To make full-fledged Amharic question
generation further research should be conducted. Based on our observation in our thesis we
recommend as follow
➢ Increase the dataset size to enhance the overall performance of the system.
➢ Increase the class and question varieties to generate quality questions.
➢ Incorporate non-factoid questions to have full-fledged Amharic question generation
➢ Incorporate different and recent NLP tools that enhances the performance of the system
➢ Design Automatic question Generation by pre-trained models by using BERT.
50
References
Agarwal, M., Mannem, P. (2011, June). Automatic gap-fill question generation from text
books. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building
Educational Applications (pp. 56-64). Association for Computational Linguistics.
arXiv:1712.09827v1 [cs.CL] 28
Becker, L., Basu, S., &Vanderwende, L. (2012, June). Mind the gap: learning to choose gaps
for question generation. In Proceedings of the 2012 Conference of the North
American Chapter of the Association for Computational Linguistics: Human
Language Technologies (pp. 742-751). Association for Computational Linguistics.
Bekele, A.A. (2020). Automatic Generation of Amharic Math Word Problem and Equation.
Journal of Computer and communications, 59-77.
https://doi.org/10.4236/jcc.2020.88006
Bernardo Leite (2020). Automatic Question Generation for the Portuguese Language,
Thesis.
Brown, J. C., Frishkoff, G. A., &Eskenazi, M. (2005). Automatic question generation for
vocabulary assessment. In Proceedings of the conference on Human Language
Technology and Empirical Methods in Natural Language Processing (pp. 819-826).
Association for Computational Linguistics.
Chali, Y., and Hasan, S. (2015). Towards Topic-to-Question Generation. Computational
linguistic, 41(1), 1-20. doi:10.1162/coli_a_00206
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011).
Natural language processing (almost) from scratch. Journal of Machine Learning
Research, 12(ARTICLE), 2493–2537.
Damtie G. (2021). Automatic Amharic Factual Question Generation From Historic Text Using
Rule Based Approach thesis.
Das, B., Majumder, M. (2017). Factual open cloze question generation for assessment of
Learner’s knowledge. Int J Educ Technol High Educ.https://doi.org/10.1186/s41239-
017-0060-3
51
Du, J., Qi, F., & Sun, M. (2019). Using bert for word sense disambiguation. ArXiv Preprint
ArXiv:1909.08358.
Guy Danon & Mark Last (2017). A Syntactic Approach to Domain-Specific Automatic
Question Generation. ArXiv: 1712.09827.
Haddad, B., Orabe, Z., Al-Abood, A., & Ghneim, N. (2020). Arabic Offensive Language
Detection with Attention-based Deep Neural Networks. Proceedings of the 4th
Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task
on Offensive Language Detection, 76–81.
Heilman, M., & Smith, N. A. (2010, June). Good question! Statistical ranking for question
generation. In Human Language Technologies: The 2010 Annual Conference of the
North American Chapter of the Association for Computational Linguistics (pp. 609-
617). Association for Computational Linguistics.
Ho, Y., & Wookey, S. (2019). The real-world-weight cross-entropy loss function: Modeling
the costs of mislabeling. IEEE Access, 8, 4806–4813.
Ibrahim Eldesoky Fatto, (2014). Semantic based automatic question generation using
Artificial Immune System. Computer Engineering and Intelligent Systems
www.iiste.org ISSN 2222-1719 ISSN 2222-2863 (Online) Vol.5, No.8, 2014.
Jaspreet Singha, &YashvardhanSharmaa, (2018). Encoder-Decoder Architectures for
Generating Questions, International Conference on Computational Intelligence and
Data Science (ICCIDS).
Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network
for modelling sentences. ArXiv Preprint ArXiv:1404.2188.
Ken Peffers, (2006).the design science research process: a model for producing and
presenting information systems research.
https://arxiv.org/ftp/arxiv/papers/2006/2006.02763.pdf
Kim, J., Kim, H., & others. (2017). An effective intrusion detection classifier using long
short-term memory with gradient descent optimization. 2017 International Conference
on Platform Technology and Service (PlatCon), 1–6.
Kunichika, H., Katayama, T., Hirashima, T., & Takeuchi, A. (2004). Automated question
52
generation methods for intelligent English learning systems and its evaluation. In
Proc. of ICCE.
Kurdi,G., Leo, J. , Parsia, B. et al.(2019). A systematic review of Automatic Question
Generation for Educational Purposes, In the International Journal of Artificial
Intelligence in Education, Springer.
LAST, M and Danon, G. (2020). Automatic question generation. Wires Data and knowledge
Discover. . doi:10.1002/widm.1382
Le NT., Kojiri T., Pinkwart N. (2014) Automatic Question Generation for Educational
Applications – The State of Art. In: van Do T., Thi H., Nguyen N. (eds) Advanced
Computational Methods for Knowledge Engineering. Advances in Intelligent
Systems and Computing, vol 282. Springer, Cham. https://doi.org/10.1007/978-3-
319-06569-4_24 282:325-338
Liu, M., Calvo, R. A., & Rus, V. (2012). G-Asks: An intelligent automatic question generation
system for academic writing support. Dialogue Discourse, 3(2), 101-124.
Loureiro, D., Pilehvar, M. T., Rezaee, K., & Camacho-Collados, J. (2020). Language
models and Word Sense Disambiguation: An overview and analysis. ArXiv.
Maslej Krešňáková, V., Sarnovsky, M., Butka, P., & Machova, K. (2020). Comparison of
Deep Learning Models and Various Text Pre-Processing Techniques for the Toxic
Comments Classification. Applied Sciences, 10, 8631.
https://doi.org/10.3390/app10238631
Mazidi, K., Nielsen, R. D. (2014). Linguistic considerations in automatic question generation.
In Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers) (Vol. 2, pp. 321-326).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word
representations in vector space. ArXiv Preprint ArXiv:1301.3781.
Miroslav Blšták , (2018). Automatic Question Generation Based on Sentence Structure
Analysis, thesis.
Mitkov, R., & Ha, L. A. (2003,). Computer-aided generation of multiple-choice tests.In
Proceedings of the HLT-NAACL 03 workshop on Building educational applications
using natural language processing-Volume 2 (pp. 17-22). Association for
Computational Linguistics.
53
Nguyen, A., Pham, K., Ngo, D., Ngo, T., & Pham, L. (2021). An Analysis of State-of-the-art
Activation Functions For Supervised Deep Neural Network.
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016).
Msmarco: A human generated machine reading comprehension dataset. arXiv
preprint arXiv:1611.09268.
Nikolentzos, G., Meladianos, P., Rousseau, F., Stavrakas, Y., & Vazirgiannis, M. (2017).
Multivariate gaussian document representation from word embeddings for text
categorization. Proceedings of the 15th Conference of the European Chapter of the
Association for Computational Linguistics: Volume 2, Short Papers, 450–455.
Nwafor Chidinma A and Ikechukwu E. Onyenwe2,(2021). An automated multiple-choice
question generation using natural language processing techniques
Ogunfunmi, T., Ramachandran, R. P., Togneri, R., Zhao, Y., & Xia, X. (2019). A primer on
deep learning architectures and applications in speech processing. Circuits, Systems,
and Signal Processing, 38(8), 3406–3432.
Park, S., Song, K., Ji, M., Lee, W., & Moon, I.-C. (2019). Adversarial dropout for recurrent
neural networks. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01),
4699–4706.
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100,000+ questions for
machine comprehension of text. arXiv preprint arXiv:1606.05250.
Rus, V., Wyse, B., Piwek, P., Lintean, M., Stoyanchev, S., Moldovan, C. (2010, July). The
first question generation shared task evaluation challenge. In Proceedings of the 6th
International Natural Language Generation Conference (pp. 251-257). Association
for Computational Linguistics.
Seng T. (2019). Recent Advances in Neural Question Generation. National University of
Singapore.
Shen, Y., He, X., Gao, J., Deng, L., & Mesnil, G. (2014). Learning semantic representations
using convolutional neural networks for web search. Proceedings of the 23rd
International Conference on World Wide Web, 373–374.
Shi, M., Wang, K., & Li, C. (2019). A C-LSTM with word embedding model for news text
classification. 2019 IEEE/ACIS 18th International Conference on Computer and
Information Science (ICIS), 253–257.
54
Sivakumar, S., Videla, L. S., Kumar, T. R., Nagaraj, J., Itnal, S., & Haritha, D. (2020).
Review on Word2Vec Word Embedding Neural Net. 2020 International Conference on
Smart Electronics and Communication (ICOSEC), 282–290.
Tina Baghaee, (2017). Automatic Neural Question Generation Using Community-Based
Question Answering Systems, University of Lethbridge
Tuan L, Darsh J Shah, and Regina Barzilay (2019 Capturing Greater Context for Question
Generation. ArXiv: 1910.10274v1.
Viera Maslej-Kreš ˇnáková Peter Butka and Kristína Machová, M. S. *. (2020).
Comparison of Deep Learning Models and Various Text Pre-Processing Techniques
for the Toxic Comments Classification. Applied Sciences, 1–26.
Wang, B., Wang, A., Chen, F., Wang, Y., & Kuo, C.-C. J. (2019). Evaluating word
embedding models: methods and experimental results. APSIPA Transactions on Signal
and Information Processing, 8.
Wiedemann, G., Remus, S., Chawla, A., & Biemann, C. (2019). Does BERT Make Any
Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings.
Yang, L., & Shami, A. (2020). On hyperparameter optimization of machine learning
algorithms: Theory and practice. Neurocomputing, 415, 295–316.
Zijing Zhang (2020). Cue Word Guided Question Generation BERT Model Fine-Tined on
Natural Question Dataset. Thesis, university of Waikato.
Zou, F., Shen, L., Jie, Z., Zhang, W., & Liu, W. (2019). A sufficient condition for
convergences of adam and rmsprop. Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 11127–11135.
55

Fikir Setie Tezera

Uploaded by

Copyright:

Available Formats

Fikir Setie Tezera

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fikir Setie Tezera

Uploaded by

Copyright:

Available Formats

DSpace Institution

DSpace Repository http://dspace.org

Amharic Question Generation from

FIKIR, SETIE TEZERA

MSc. Thesis on:

Amharic Question Generation from Amharic legal Text

Program: Computer Science

FIKIR SETIE TEZERA

Amharic Question Generation from Amharic Legal Text Documents using

Advisor Name: Esubalew A. (PhD)

FIKIR SETIE TEZERA 03-08-22

Name of the Candidate Signature Date

DECLARATION ............................................................................................................... iii

TABLE 2.2: SUMMARY OF RELATED WORK ......................................................................... 31

1.1.1 Question Generation Applications

The designed model should answer the following questions;

1. Can we use deep learning for automatic Amharic question generation?

2. To what extent is it possible to develop automatic question generators for Amharic

1.3. Objective of the study

1.3.1. General objective

1.3.2. Specific objective

The study specifically aims:

✔ To prepare training data (corpus) from Amharic articles.

1.5. Significance of the Study

1.6. Methodology of the research

1.6.1. Research design methodology

2.1. Overview of question generation

where/የት?, who/ማን? and so on...) questions and fill-in-the-blank questions. Fill-in-the-Blank

2.2. History of question generation

Author Purpose Method/approach Evaluation methods

AUTOQUEST For reading Syntactic parsing Used human evaluation

Rus et al. To support Pure syntactic (rule Used human evaluation

Linderberg Reading Template-based Used human evaluation

Heilmith For education Syntactic parsing Used human evaluation

Duan et al., For Question Neural network Used automatic evaluation

2.3. Technique for Question Generation

2.3.1.1. Recurrent Neural Network (RNN)

Figure 2.1 Rolled-up RNN (Venkatachalam, 2019)

Figure 2.2 Information tracking of RNN (Venkatachalam, 2019)

Figure 2.3. Unrolled RNN

Figure2.4. LSTM Architecture Adapted from (Yu Y., 2019)

As seen in Figure2.4 LSTM network works as follows

Figure 2.5. Bi-LSTM Architecture Adapted from (Carlo Tasso., 2018)

GRU works as follows as depicted in Figure 2.6:-

𝑍𝑡 = 𝜎(𝑊𝑧 𝑋𝑡 + 𝑈𝑧 ℎ𝑡1 _1)……………………………………………………………… Equation (9)

Used to decide how much of the previous information needs to be forgotten.

rt = σ(wz xt+ UZ ht1 ) …………………………………………………………. Equation (10)

The current state of the memory is computed as

ℎ̀𝑡 = 𝑡𝑎𝑛 ℎ (𝑤𝑥𝑡 + 𝑟𝑡 ⊙ 𝑈 ℎ𝑡_1 )……………………………..………………….. Equation (11)

ht = (1zt )ht1 + zt ht ……………………….……………………………………. Equation (12)

ht´= current memory content

2.3.1.5. Convolutional Neural Network

2.3.2. Activation functions

2.4. Word Embedding

The semantics of words in a sentence or document is preserved by the Word2Vec embedding

2.5. Question Construction Approaches

Heilman (2011) focuses on automatically generating factual WH-questions by applying

Labutov et al. (2015) propose a crowdsourcing-based methodology for generating deep

2.6. Amharic Language