Amharic Language Query Processing in Database Using Natural Language Interface
Amharic Language Query Processing in Database Using Natural Language Interface
By
Smegnew Asemie
August, 2008
Jimma, Ethiopia
School of Computing
Jimma Institute of Technology
Jimma University
Amharic Language Query Processing in Database
Using Natural Language Interface
Smegnew Asemie
August, 2008
Jimma, Ethiopia
JIMMA UNIVERSITY
SCHOOL OF COMPUTING
Amharic Language Query Processing in Database
Using Natural Language Interface By
Smegnew Asemie
Title
Signature
Date
__________________ __________
DEDICATION
This thesis is dedicated to my sister Mulusew Asemie and her hasband Abebe Alemu. They are
always a great role model for me because of their commitment and who always encourages me to
make the best effort to realize my dreams.
ACKNOWLEDGEMNT
Before all, I praise the almighty God and his mother St. Mary for making everything the way it
is. My greatest gratitude is extended to my advisor Mr. Getachew Mamo (assistant professor)
and my Co -advisor Mr. Tefery Kebebew for the positive encouragement before the work was
started and constructive comments and guidance after the work has been started. They sacrifice
their time for ongoing discussions during my difficult time of selecting a title for the study.
It is my pleasure also to express my thanksto Dr. Millita Luke and Debela tesfaye for giving a
constructive idea on the area at the time of title selection and after.
I would also thank my special friend Abeba Hailemaryam for all what she did for the past two
years. She is treating me as what a best sister treated her brothers and she teach me how can make
a friend.
My deepest thanks go to all my friends and my class meets especially Workineh Tesema, Zerihun
Olana, Andualem chekol, and Birhanu Ambes, our relationship is more than a friend. Beside this I
would like to acknowledge all staff members of school of computing for their help to complete
this research on time.
Special thanks also goes to my families mainly for my mother Dejitnu Abate, my sisters Mulusew
Asemie, Tirusew Asemie, Birtukan Asemie and my brother Gedefaw Mola, who have been
behind me in supporting and encouraging me through difficult times.
ii
Table of Contents
ACKNOWLEDGEMNT .................................................................................................................... ii
LIST OF TABLES ......................................................................................................................... vi
LIST OF FIGURES ...................................................................................................................... vii
LIST OF ACRONYMS ................................................................................................................... viii
ABSTRACT ...................................................................................................................................... x
CHAPTER ONE ............................................................................................................................. 1
INTRODUCTION .......................................................................................................................... 1
1.1.
Background .......................................................................................................................... 1
1.2.
1.3.
1.3.1.
1.3.2.
1.4.
Methodology ........................................................................................................................ 5
1.4.1.
1.4.2.
1.4.3.
1.4.4.
1.4.5.
Evaluation.................................................................................................................................. 6
1.5.
1.6.
1.7.
2.2.
2.3.
2.3.1.
2.3.2.
2.3.3.
2.3.4.
iii
2.4.
2.4.1.
2.4.2.
2.4.2.1.
2.4.2.2.
2.4.2.3.
2.4.2.4.
2.4.3.
2.4.4.
2.5.
Related works..................................................................................................................... 25
Introduction ........................................................................................................................ 30
3.2.
3.3.
3.4.
3.5.
3.6.
3.6.1.
3.6.2.
3.6.3.
3.7.
4.2.
4.3.
4.4.
4.5.
4.6.
4.6.1.
iv
4.7.
4.8.
Result ................................................................................................................................. 58
Introduction ........................................................................................................................ 59
5.2.
5.3.
5.4.
5.5.
5.6.
5.7.
Join Queries................................................................................................................................. 66
5.8.
5.9.
5.10.
5.10.1.
5.10.1.1.
5.10.1.2.
5.10.1.3.
Conclusions ........................................................................................................................ 99
6.2.
LIST OF TABLES
Table 2. 1: Sample database table ................................................................................................. 17
Table 3. 1: The Amharic Character Representation...................................................................... 31
Table 4. 1: Sample stemming table ............................................................................................... 40
Table 4. 2: A Sample for compound words .................................................................................. 42
Table 4. 3: Table_ Handling Table ............................................................................................... 43
Table 4. 4: Column_ Handling Table ........................................................................................... 43
Table 4. 5: Conditional_ Word Table ........................................................................................... 44
Table 5. 1: Employee table structure ............................................................................................ 71
Table 5. 2: Department table structure .......................................................................................... 71
Table 5. 3: Employee on education structure................................................................................ 71
Table 5. 4: List query request and results ..................................................................................... 76
Table 5. 5: Accuracy of List Query .............................................................................................. 76
Table 5. 6: Single conditional query and results ........................................................................... 81
Table 5. 7: Accuracy of single conditional query ......................................................................... 81
Table 5. 8: Multiple conditional query and results ....................................................................... 88
Table 5. 9: accuracy of multiple condition query ......................................................................... 88
Table 5. 10: Aggregate function query and results ....................................................................... 94
Table 5. 11: Accuracy of aggregate function ................................................................................ 95
vi
LIST OF FIGURES
Figure 2. 1: Vertical Taxonomy of Information Retrieval model. .............................................. 10
Figure 2.2: intermediate representation language architecture ..................................................... 19
Figure 2. 3: commonly used architecture of NLIDBs................................................................... 23
Figure 4. 1: Architecture of Amharic Language Interface for database...................................................... 36
vii
LIST OF ACRONYMS
AI: Artificial Intelligence
ASCII: American Standard Code for Information Interchange
ATN: Augmented Transition Network
CLIR: Cross Language Information Retrieval
DB: DataBase
DBMS: Data Base Management System
EAGLi: Engine for Answers in Genomics Literature
ECoSA: Ethiopian Computer Standards Association
Elfsoft: English Language Frontend Software
eLSSNL: eLibrary Searching System by Natural Language
FAM: File Access Manager
GINLIDB: Generic Interactive Natural Language Interface to Database
HCI: Human Computer Interaction
HMM: Hidden Markov Model
IDA: Intelligent Data Access
IE: Information Extraction
INLAND: Informal Natural Language Access to Navy Data
IQR: Intermediate Query Representation
IR: Information Retrieval
MASQUE: Modular Answering System for Queries in English
viii
ix
ABSTRACT
In the present computing world, computer based information technologies have been extensively
used to help many organizations, private companies, academic and education institutions to
manage their processes and information systems. Information systems are used to manage data.
A general information management system that is capable of managing several kinds of data,
stored in the database is known as Database Management System (DBMS). Database
Management System is a collection of interrelated data and set of programs to access those data.
Database systems are designed to manage large bodies of information.
To access the database without the knowledge of SQL, since 1976, different research scholars
doing a research on the area of natural language interface to database (NLIDB). As the name
suggests an NLIDB allows an ordinary user to ask query to database in natural language. In this
paper, we propose Amharic Natural Language Interface to Database (ANLIDB). Here, the
request is simple like asking a human to do so in a local language (Amharic). In this paper, we
are dealing with Amharic language alphabet, punctuation, syntactic structure and problems in
retrieving Amharic text.
In this paper, we have designed and developed an interface in the local language so that user can
easily use that system without the knowledge of English language and SQL. So, in order to
address this issue we have developed an algorithm to efficiently map Amharic language into
Structured Query Language (SQL). We are divided the algorithm into three parts an algorithm to
handle select query, an algorithm to handle conditional query and an algorithm to handle
aggregation.
The algorithm has been implemented in Java and tasted on Human Resource Management
(HRM) database containing Employee, Department and Employee on education table. The
prototype can handle list query, condition query, and aggregate function. The accuracy of the
system is measured in term of precision percentage with two classes that identifies query
response as: Correct and Incorrect. The prototype system achieves a good performance and the
overall efficiency of system is observed to be 91%.
xi
CHAPTER ONE
INTRODUCTION
1.1. Background
Since a long time ago, information has been playing an important role in our lives; most people
try to get the information they need before making a decision. Recently, with the growth of
technologies such as computers and laptops, personal digital assistant (PDAs), cellular phone,
and the internet, information can be accessed almost anywhere, at any time, by anybody,
including those who do not necessarily have computer backgrounds. One of the major sources of
information is database. Database contains a collection of related data, stored in a systematic way
to model the part of the world. In order to extract information from a database, one needs to
formulate a query in such a way that the computer will understand and produce the desired
output. However, not everybody is able to write such queries, especially those who lack a
computer background [1].
A language is the primary means of communication used by humans. Natural Language
Processing (NLP) is a technique which can make a computer to understand a natural language
and easily communicate with a human being. Also, it is becoming one of the most active
techniques used in a human computer interaction (HCI). In the context of Human Computer
Interaction (HCI), there are many NLP applications such as Information Retrieval Systems,
Information Extraction, Speech Recognition, Language Translator, Question Answering, (QA)
Natural Language Interface to Database (NLIDB), and Dialog Systems [4].
In our day to day activity computer has an important role to minimize workload and to complete
tasks in time. Unlike most user-computer interfaces, a Natural Language Interface allows users
to communicate fluently with a computer system with very little preparation. Even though
Natural Language may be the easiest symbol system for people to learn and use; it has proved to
be the hardest for a computer to master.
Internet is the largest data provider in todays date and it caters to users of all kinds. The
vastness of data makes it mandatory that data is saved in an organized manner so that it is easy to
search, retrieve and maintain [1]. For this purpose the most logical and commonly used storage
method is by the use of databases. To access these data from the database the knowledge of
database language is required.
To enable database queries to be performed by users with little or no SQL querying abilities,
companies like Elfsoft (English Language Frontend Software which has developed SQL Tutor)
have analyzed the abilities of Natural Language Processing to develop products for people to
interact with the database in simple English. This enables a user to simply enter queries in
English to the Natural Language Database Interface; a kind of application is known as a Natural
Language Interface to a Database (NLIDB) [1].
NLIDB deals with representation of user request to database in his/her native language. NLIDB
then maps the user request in standard SQL to retrieve desired results from the target database.
The purpose of this interface/system is to facilitate access by the user through hiding
complexities of database query language syntax. Thus the user writes his/her request similar to
email message and submit to NLIDB system. The system then understands the request and
translates it in accurate database query so that the precise results can be retrieved. Hence, an easy
to use user interface comes into picture which would facilitate diverse users to access data. There
is a need to design and develop an interface in the local language so that a user without the
knowledge of English language and SQL can easily use the system.
The problem of natural language access to a database is divided into two sub-components [2]:
Linguistic component and database component. Linguistic component, translates the natural
language input sentence into a formal query, and then after a database search generates a natural
language response. Next Database component performs traditional Database Management
functions. Questions entered in natural language are translated into a formal query. This query
is then processed by the database management system and after processing, the result is return
back to the natural language component where generation routines produce a surface language
version of the response.
medicine from the table information for a particular disease, one has to create Query like select
medicine from information where disease = Cold. However, to do so, a user who doesnt know
SQL and cant use English language may not be able to access the database. So, to make
database applications easy to use for these people, we have present a model and an algorithm to
convert Amharic sentence into SQL and retrieve relevant text from a Relational Database. It has
been use HRM database with Employee, Department, and Employee on education table as a case
study for developing a natural language query processing from a database system. The query has
been asked in Amharic sentence for retrieving relevant information from databases.
So far, different techniques such as pattern matching, syntax based, semantic grammar based and
Intermediate Representation Language systems have been used to develop NLIDB. Among these
techniques, the study employed pattern matching and similarity checking for developing
Amharic language text retrieval from Relational Database.
1.4. Methodology
1.4.1. Literature Review
For the successful completion of this research, different literatures related to language interface
to database and data retrieval from database were reviewed.
relationships between words. Parser for Amharic language is developed only at a research level;
no single practical tool has been developed yet. Hence, to avoid this limitation, the researcher
substituted parser for algorithm, and developed it in such a way that it can identify column name
and column value without using parser.
Due to the main reason of the advantage it has and its suitability for this particular research, the
researcher used the pattern matching and similarity checking techniques for the generation and
execution of SQL query through taking the users input in natural language (Amharic) form. For
instance, one main advantage of the pattern matching is simplicity, no elaborate parsing and
interpretation modules are needed, and the systems are easy to implement. In addition, patternmatching systems often manage to come up with some reasonable answer, even if the input is out
of the range of sentences the patterns were designed to handle.
1.4.5. Evaluation
The study involves developing the designed system and evaluating its performance. For
evaluating the performance Amharic sentences have been used.
researcher used only pattern matching and similarity checking technique for converting Amharic
sentence into structured Query language (SQL).
No need of Training: - it doesnt need for the user to have taken any previous training on
language interface to accessing a database. It is highly user friendly and easy to use by
the end users.
Simple and easy to use: - The interface is very simple and easy to use because the end
users can write the query in their native (Amharic) language.
Moreover, the study can serve as a springboard for others, who have interest to do more research
in this area, by providing them with some basic theoretical and practical knowledge they may
need to start with.
CHAPTER TWO
LITERATURE REVIEW
2.1. Survey on Natural Language Processing (NLP) Applications
This chapter covers the area of natural language processing by reviewing different articles on the
area. We introduce about information retrieval system, question answering system based on open
and closed domain question answering, and statistical or semantic question answering system.
From open and closed domain question answering system AnswerBus, START, TextMap,
EaGLi, and WolframAlpha are discussed, and different articles discussed on statistical or
semantic question answering system. We also introduce dialog between human and computers in
natural language called a Dialog System. Different works done by researchers are discussed on
the area of dialog system like ELIZA, HAPPY ASSISTANT, BIRD QUEST, CHAT 80,
PLANES. In addition Natural language interface to database (NLIDB) has been discussed. From
NLIDB we discussed the techniques used to implement the area like pattern matching, syntax
based, semantic grammar, and intermediate representation has been discussed. Furthermore we
discussed the advantage and disadvantages of NLIDB, the common architecture of NLIDB and
related works done on the area of that we have been implemented.
10
Spanish, Italian or Portuguese and extracts possible answers in English. It uses a specific
dictionary. It classifies all words of retrieved documents in two categories: matching and
nonmatching according to its predefined formula. It provides five search engines and directories
used for retrieveing a webpages that are relevant to the user question. The system uses a simple
language recognition model to determine wheter the question is in English, or any of the other
five languages. And if the question language is not English AnswerBus send the question into
AltaVistas translation tool, and obtain the question that has been translated into English.
Katz et al. [13] developed the worlds first web based open domain Question Answering system
START (SynTactic Analysis using Reversible Transformations). It has been online and
continuously operating since December, 1993. Questions are asked in English about place,
movie, people, dictionary definition and much more. It uses semantic parsing. They built a
system that integrates heterogeneous data sources using an objectpropertyvalue model to
answer user questions. And they identify three main challenges in getting a computer to answer
such questions, understanding the question, identifying where to find the information, and
fetching the information itself. It handles all varieties of media, including text, diagrams, images,
audio and video clips, data sets, web pages, etc. START considered as the best system that can
return the good answers for the user [14]. The major problem in this system is that it accepts only
simple questions related to its domain like Geography, Science & Reference, Arts &
Entertainment, History and Culture, it could not answer the question about causes and methods
[14]. It uses the concept of template <subject relation object>. The START system gives a proper
answer when the query is asked and the answer appears in frequently asked database. In other
cases, the user has to navigate to web page to get an exact answer.
The open domain intelligent Question Answering assistant TEXTMAP [15] focused on
developing an algorithm that automatically mines vast amounts of data in order to answer the
question posed in natural language. It uses answering techniques like factoid question (What is
the capital of Morocco?), cause question (Why is there no cure for the cold?), Biography
questions (What do you know about Dick Cheney?), and event question (What do you know
about the Kobe earthquake?). TEXTMAP employs a combination of rule-based and supervised
and unsupervised machine learning algorithms that are trained on massive amounts of data. It
supports English, Spanish and German language. It provides web based interface to user.
11
Another open domain Question Answering system, EAGLi [16] (Engine for Answers in
Genomics Literature) retrieves relevant answers from selected taxonomy. It uses browser and
predictive model and also includes advanced search. It supports only biology and medicine
questions. It provides web based interface to user.
The computational knowledge engine WolframAlpha [17] is an online service that answers
factual queries directly by computing the answer from an external source rather than providing a
list of documents or web pages. It is composed of a toolkit such as mathematics, computer
algebra, symbolic, numerical computation, visualization and statistical capabilities. It deals with
facts not with options. A computation time for each query is limited.
12
between them, and this is important to use users use a simple English queries. Limitation of the
system is, it uses predefined syntax structure to input the query in natural language.
Rajendra Akerkar and Manish Joshi [20] in their paper discussed the natural language interface
which accepts questions in natural language and generates textual responses. It uses a keyword
matching approach. It presents the rules to tackle the phenomenon using shallow parsing
technique. The experimental result shows that approach they used provides high accuracy and
produce reasonable textual responses.
13
database. The system uses semantic grammar techniques and it is implemented in Prolog
language. The CHAT-80 was an impressive, efficient and sophisticated system. The database of
CHAT-80 consists of facts like oceans, major seas, major rivers and major cities about 150 of the
countries world and a small set of English language vocabulary that are enough for querying the
database. The basic method followed by Chat-80 is to attach some extra control information to
the logical form of a query in order to make it an efficient piece of Prolog program that can be
executed directly to produce the answer.
D.L. Waltz [25] developed PLANES (Programmed Language-based Enquiry System) at the
University of Illinois Coordinated Science Laboratory. PLANES include an English language
front end with the ability to understand and explicitly answer user requests. It carries out
clarifying dialogues with the user as well as it answers vague or poorly defined questions. The
system was developed using database related to information of the U.S. Navy 3-M (Maintenance
and Material Management), which is a database of aircraft maintenance and flight data. The idea
can be directly applied to other non-hierarchical record-based databases.
Database Languages
A database system provides at least one language which includes a Data Definition Language
(DDL) to specify the database schema, the Data Manipulation Language (DML) to articulate
database queries & updates and Data Query Language (DQL) for retrieving the data. SQL is
widely used database languages.
I.
Data Manipulation Language (DML): It is a language for accessing and manipulating the
data contained in database. Accessing the data refers to the manipulation of information
stored in the database as: (i) Insertion of new information into database (ii) Deletion of
information from database (iii) Modification of information stored in the database [2].
14
DML has two classes of language (i) Procedural DML in which user specifies what data
is required and how to get those data and (ii) Non Procedural (also refer as Declarative)
in which user specifies what data is needed without specifying how to get those data.
II.
Data Definition Language (DDL):- It is a language for creating and manipulating the
structure of a data. The schema created by DDL is stored in the data dictionary which
contains metadata that is data about data. The data values stored in the database must
satisfy certain consistency constraints such as domain constraints, referential integrity,
assertions and authorization as defined in a data dictionary.
III.
Data Query Language (DQL): It is a language for retrieving the information from
database.
Query Interface
A query interface to a database is a system that helps the user to access the information which is
stored in a database. Natural Language Interface (NLI) is one kind of query interface in which
user can input the query in natural language. Besides NLI, numbers of traditional user interfaces
are being used by Database Management System (DBMS) packages such as Spreadsheet like
Interface, Forms based Interface, Database Query Interface, Graphical User Interface, Query-ByExample, and Command Line Interface.
15
express the definition of NLIDB Abhijeet [27] says it is communication channel between the
user and the computer; without any knowledge of any programming language, a user can act as a
programmer. Through these systems, users can interact with database in a more convenient and
flexible way.
Natural Language Interface to Database (NLIDB) is a system that allows the user to access
information stored in a database by typing requests expressed in some natural language. In the
last few decades, many NLIDB systems have been developed through which users can interact
with database in a more convenient and flexible way. Because of this, this application of NLP is
still very widely used today [6]. Natural Language Interface has been a very interesting area of
research since the past. The aim of Natural language Interface to Database is to provide an
interface where a user can interact with database more easily using his/her natural language and
access or retrieve his/her information [1]. Moreover, the NLIDB is a system that converts the
query in native language into SQL.
Pattern-Matching Systems
Pattern matching system is the earliest and the simplest techniques to implement natural
language interface to database (NLIDB). These patterns and rules are fixed [7]. The rules states
that if an input word or sentence is matched with the given pattern, the action has been taken.
16
Those actions are also mention in the database [27]. The main advantage of pattern matching
approach is that no elaborate parsing and modules of interpretation are required and the systems
are very easy to implement. Also, pattern-matching systems often manage to come up with some
reasonable answer, even if the input is out of the range of sentences the patterns were designed to
handle [26]. One of the best natural language processing system that role in this style is ELIZA
[1]. For simplification Ashish kumar [26] present the following example.
Countries_Table
Country
Capital
Language
France
Paris
French
Italy
Rome
Italian
India
Delhi
Hindi
2.4.2.2.
Syntax-Based Systems
In syntax based system user questions are analyzed syntactically i.e. it is parsed and the resulting
syntactic tree is mapped to an expression in some database query language [1]. Syntax-based
systems use a grammar that describes the possible syntactic structures of the users questions.
Syntax-based NLIDBs usually interface to application-specific database systems that provide
database query languages carefully designed to facilitate the mapping from the parse tree to the
database query.
17
The main advantage of using syntax based approaches is that they provide detailed information
about the structure of a sentence. A parse tree contains a lot of information about the sentence
structure; starting from a single word and its part of speech, how words can be grouped together
to form a phrase, how phrases can be grouped together to form more complex phrases, until a
complete sentence is built. Having this information, we can map the semantic meanings to
certain production rules (or nodes in a parse tree). One of the examples of syntax based system is
LUNAR. In this system grammar is nothing but the possible syntactic structure of the users
question.
As Neelu Nihalani [28] present the problem of syntax based, unfortunately not all nodes should
be mapped, some nodes have to be left just as they are without adding any semantic meanings.
And it is not always clear which nodes should be mapped and which should not. Moreover the
same node in different parse trees is not necessarily going to be translated in all the trees. The
second problem is a sentence can have multiple correct parse trees, and if all are translated, they
may lead to different query results. The last problem is that it is difficult for a syntax based
approach to directly map a parse tree into some general database query language, such as SQL
(Structured Query Language).
2.4.2.3.
In semantic grammar systems, the requests and responcess is still done by parsing the input and
mapping the parse tree to a database query. The difference, in this case, is that the grammars
categories do not necessarily correspond to syntactic concepts. The basic idea of a semantic
grammar system is to simplify the parse tree as much as possible, by removing unnecessary
nodes or combining some nodes together. Based on this idea, the semantic grammar system can
better reflect the semantic representation without having complex parse tree structures. Instead of
smaller structures, the semantic grammar approach also provides a special way for assigning a
name to a certain node in the tree, thus resulting in less ambiguity compared to the syntax based
approach [28].
The main drawback of semantic grammar approach is that it requires some prior- knowledge of
the elements in the domain, therefore making it difficult to port to other domains. In addition, a
parse tree in a semantic grammar system has specific structures and unique node labels, which
18
could hardly be useful for other applications. Much of the systems developed till now like
LUNAR, LADDER, use this approach of semantic grammar.
2.4.2.4.
Due to the difficulties of directly translating a sentence into a general database query languages
using a syntax based approach, the intermediate representation systems were proposed. The idea
is to map a sentence into a logical query language first, and then further translate this logical
query language into a general database query language, such as SQL. Figure 2.2 show a possible
architecture of an intermediate representation language system [26].
19
No artificial language: One advantage of NLIDBs is supposed to be that the user is not required
to learn an artificial communication language. Formal query languages are difficult to learn and
master, at least by non-computer-specialists. Graphical interfaces and form-based interfaces are
easier to use by occasional users; still, invoking forms, linking frames, selecting restrictions from
menus, etc. constitute artificial communication languages that have to be learned and mastered
by the end-user. In contrast, an ideal NLIDB would allow queries to be formulated in the users
native language. This means that an ideal NLIDB would be more suitable for occasional users,
since there would be no need for the user to spend time learning the systems communication
language.
Simple, easy to use: Consider a database with a query language or a certain form designed to
display the query. While an NLIDB system only requires a single input, a form-based may
contain multiple inputs (fields, scroll boxes, combo boxes, radio buttons, etc) depending on the
capability of the form. In the case of a query language, a question may need to be expressed
using multiple statements which contain one or more sub queries with some joint operations as
the connector.
Better for Some Questions: It has been argued that there is some kind of questions (e.g.
questions involving negation or quantification) that can be easily expressed in natural language,
but that seem difficult (or at least tedious) to express using graphical or form -based interfaces.
For example, Which department has no programmers? (Negation), or Which company
supplies every department? (Universal quantification), can be easily expressed in natural
language, but they would be difficult to express in most graphical or form-based interfaces.
Questions like the above can, of course, be expressed in database query languages like SQL, but
complex database query language expressions may have to be written.
Fault tolerance: Most of NLIDB systems provide some tolerances to minor grammatical errors,
while in a computer system; most of the time, the lexicon should be exactly the same as defined,
the syntax should correctly follow certain rules, and any errors will cause the input automatically
20
be rejected by the system. In the case of incomplete sentences, most of computer systems do not
provide any support. [2]
ii.
Linguistic coverage not obvious: A frequent complaint against NLIDBs is that the systems
linguistic capabilities are not obvious to the user. As already mentioned, current NLIDBs can
only cope with limited subsets of natural language. Users find it difficult to understand (and
remember) what kinds of questions the NLIDB can or cannot cope with. For example, Masque
[54] is able to understand What are the capitals of the countries bordering the Baltic and
bordering Sweden? which leads the user to assume that the system can handle all kinds of
conjunctions (false positive expectation). However, the question What are the capitals of the
countries bordering the Baltic and Sweden? cannot be handled. Similarly, a failure to answer a
particular query can lead the user to assume that equally difficult queries cannot be answered,
while in fact they can be answered (false negative expectation).
Formal query languages, form-based interfaces, and graphical interfaces typically do not suffer
from these problems. In the case of formal query languages, the syntax of the query language is
usually well-documented, and any syntactically correct query is guaranteed to be given an
answer. In the case of form-based and graphical interfaces, the user can usually understand what
sorts of questions can be input, by browsing the options offered on the screen; and any query that
can be input is guaranteed to be given an answer [2].
Linguistic vs. conceptual failures: When the NLIDB cannot understand a question; it is often
not clear to the user whether the rejected question is outside the systems linguistic coverage, or
whether it is outside the systems conceptual coverage. Thus, users often try to rephrase
questions referring to concepts the system does not know (e.g. rephrasing questions about
salaries towards a system that knows nothing about salaries), because they think that the problem
is caused by the systems limited linguistic coverage. In other cases, users do not try to rephrase
questions the system could conceptually handle, because they do not realize that the particular
phrasing of the question is outside the linguistic coverage, and that an alternative phrasing of the
same question could be answered. Some NLIDBs attempt to solve this problem by providing
21
diagnostic messages, showing the reason a question cannot be handled (e.g. unknown word,
syntax too complex, unknown concept, etc.)
Users assume intelligence: NLIDB users are often misled by the systems ability to process
natural language, and they assume that the system is intelligent, that it has common sense, or that
it can deduce facts, while in fact most NLIDBs have no reasoning abilities. This problem
does not arise in formal query languages, form-based interfaces, and graphical interfaces,
where the capabilities of the system are more obvious to the user. For example, when user asks a
query list the names of farmers who are 35 years old, he/she is not specifying the word age,
assuming that system will understand it automatically. But system is not so intelligent.
Inappropriate Medium: It has been argued that natural language is not an appropriate medium
for communicating with a computer system. Natural language is claimed to be too verbose or too
ambiguous for human-computer interaction. NLIDB users have to type long questions, while in
form-based interfaces only fields have to be filled in, and in graphical interfaces most of the
work can be done by mouse-clicking. In natural language interface user has to type full sentence
with all the connecters (articles, prepositions, etc.) but in graphical or form based interfaces it is
not required [29].
22
A. Syntactic Analysis
The word syntax means grammatical arrangements of words in a sentence and their relationship
with each other. The objective of the syntactic analysis is to find the syntactic structure of the
sentence. This splits the sentence into the simpler elements called Tokens. Then the spelling
checker check the token is correctly spell or not, or check the availability of tokens on the system
dictionary. Ambiguity reduction function reduces the ambiguity in a sentence and simplifies the
task of the parser.
23
B. Parse Tree
Output of syntactic analysis is a parse tree. It represents the syntactic structure of a sentence
according to some formal grammar. A parse tree is composed of nodes and branches; each node
is either a root node, a branch node, or a leaf node. In a parse tree, an interior node is a phrase
and is called a non-terminal of the grammar, while a leaf node is a word and is called a terminal
of the grammar.
C. Semantic Analysis
Semantic Analysis is related to create the representations for meaning of linguistics inputs. It
deals with how to determine the meaning of the sentence from the meaning of its parts. So, it
generates a logical query which is the input of Database Query Generator.
24
25
has to explicitly specify the attribute name. For example, if the user paraphrase his/her query as
display employee location, the system does not recognize it. However, if the same question is
rephrased as display employee address the system, can recognize and respond to by generating
SQL query accordingly.
Hendrix et al. [33] designed a natural language interface to database system LIFFER/LADDER,
which gives information about US Navy ships. This system uses a semantic grammar to parse
questions and uses distributed database. The system consists of three major components: (a)
INLAND (Informal Natural Language Access to Navy Data), (b) IDA (Intelligent Data Access)
and (c) FAM (File Access Manager). It supports multiple table queries with join conditions.
Language features that increase system usability, such as spelling correction, processing of
incomplete inputs, and run-time system personalization, are also included in the system.
Woods W. A. [29] developed a system LUNAR which answers about rock samples brought back
from the moon. The system makes use of two databases such as chemical analysis and literature
reference. The program used is an Augmented Transition Network (ATN) parsers and procedural
semantics. It consists of three components: (i) general purpose grammar (ii) Parser for a large
subset of natural English (iii) a rule driven semantic interpretation of component. The first
component is responsible for transforming natural language input into the disposable program to
carry out its intent and the third component deals with executing programs against a database to
determine answers to queries. The performance was quite impressive; it managed to handle 78%
of requests without any errors and this ratio rose to 90% when dictionary errors were corrected.
Runvanpura [34] has developed system - SQ-HAL. It is platform independent and has multiuser support. The system is written in Perl, which has a powerful string manipulation capability.
It uses top down parser methodology. It has limited thesaurus, the user has to manually enter the
relationships, and there is no direct method of retrieving column name. All the more, the system
cannot determine synonym for table name and column names; hence, the user has to manually
enter synonym words.
Chauhari S. et al [35] developed a system DBXplorer which describes a multi-step system to
answer keyword queries using relational databases. It proposes methodology which uses a
symbol table to store tables, columns, and rows of all data values that are looked up during the
26
search to identify the locations that contain all the keywords appearing in the question. The
system has been implemented using a commercial relational database and web server and allows
users to interact via a browser front-end.
Rashid Ahmad et al. [36] proposed an algorithm that efficiently maps a natural language query
entered in Urdu language to convert it into structured query language. The system accepts the
user query either in a question or in request form. The algorithm was implemented in Visual
C#.NET and was tested on a database containing student and employee data. The dictionary is
manually constructed and it is database specific. The program correctly maps 85% of the natural
language queries.
Amardeep Kaur [1] presented the design and implementation of natural language interface
to agricultural database in Punjabi language. The system uses MS Access database. The system
accepts input in specified template. Table name, column name and condition query mapped
manually. The author considers the limited words.
Anh Kim Nguyen and Phuong Hong Nguyen [37] in their paper constructed a natural language
interface to relational databases, which accepts fuzzy questions as inputs and generates answers
in the form of tables or short answers. By using derivation evaluation mechanism, the author
constructed a set of translation rules for all possible structures in standard trees of user questions
to translate it into SQL query.
Veera Boonjing and Chang Hsu [38] proposed a metadata search approach to provide practical
solutions to the natural language database query problem. Here the metadata grew in a largely
linear manner and the search was linguistics-free. A new class of reference dictionary integrated
four types of enterprise metadata: enterprise information models, database values, user-words,
and a query cases. The interpretation of input could be easily identified with the help of the
graphical representation method. It uses branch-and-bound method to identify the optimal
interpretation that led to SQL generation. The necessary condition was that the text input
contained at least one entry in reference dictionary, and the input was to complete and correct
grammar which led to correct single SQL query.
Androutsopoulos et al. [39] has proposed a system MASQUE (Modular Answering System for
Queries in English). The system is powerful and has portable natural language front end for
27
Prolog databases. It answers written English questions referring to certain domain knowledge
such as geography and airplane. Each question is transformed to suitable database query using
Prolog database. It uses an extra position grammar parser, and it transforms each question into a
single SQL query.
B. Sujatha et al. [40] discussed the novel architecture of natural language interface to database
which uses a pragmatic approach with illustrations. It incorporated a special language features
that increase system usability such as spelling correction, processing of inputs and runtime
system performance were also discussed. The three-level architecture consists of client level,
intermediate server level and the database level. The presentation of this example queries and
dictionary permits the user to better understand the contents of the database, which facilitates
query formulation. The table names are presented on the interface that helps the user to find out
what tables are present in the database.
H. V. Jagdish et al. [41] developed NALIX system- a generative interactive Natural Language
Query Interface to an XML database. The system can accept an English language sentence as
query input, which can include aggregation, nesting, and value joins, among other things. The
system can be classified as syntax based system. The transformation process has three steps: (a)
generating parse tree, (b) validating parse tree, and (c) translating parse tree into an Xquery
expression. It reformulates the input query to XQuery expression and translates it by a means of
mapping grammatical proximity of natural language, and parses tokens to the nearest
corresponding elements in the resulted XML. The system makes little attempt to understand
natural language itself.
Porfirio P. Filipe et al. [42] discussed a Natural Language Interface for Database. It allows the
user to formulate multimedia queries. Here the questions are first translated into logic language
and then to SQL which is processed by database management to respond to the queries.
Rukshan et al. [43] proposed the natural language interface to database, which allowed input in
the form of an English query through a convenient interface over the internet. A limited data
dictionary was used where all possible words related to particular system would be included.
Niculae Stratica [44] developed a querying system CINDI (Concordia Virtual Library System).
The system uses natural language input and gives structured representation of the answer in the
28
form of structured query language. In his study he uses Link Parser to semantically parse the
query, and it uses WordNet to build the conceptual knowledge base from the database schema.
The system was tested using information contained in the virtual library. As discussed by the
author himself, his system has some limitation such as values should be in double quotes, table
names and attribute names should be specified and template also should be specified.
Looking at the limitations of the various NLIDB systems developed by the various researchers in
this field, we have designed and developed a Natural Language Query Interface for Amharic
Text Retrieval from Relational Database that fulfills this knowledge gap. So our system has
improved the results that retrieve from the database by developing a new algorithm to efficiently
map, based on the structure of Amharic language nature. The system has used human resource
database with appropriate tables stored with data in Amharic natural language. Hence, the user
formulates the query in Amharic sentence and the as system can analyze such users query and
convert into Structured Query Language (SQL).
We have collected 120 sample user queries from ordinary people who have not knowledge about
database language. The accuracy of the system is measured in term of precision percentage
with two classes that identifies query response as: Correct Queries and Incorrect Queries.
29
CHAPTER THREE
THE AMHARIC WRITING SYSTEM
3.1. Introduction
As Betelehem [46] and Danel [47] discussed, sited in Lo [48], the Blackwell Encyclopedia of
Writing Systems defines the term writing system as "a set of visible or tactile signs used to
represent units of a language in a systematic way". Amharic is a Semitic language spoken
predominantly in Ethiopia. It is the working language of the country having a population of over
90 million as the present time. Amharic was the national language of Ethiopia until 1983 E.C
[49]. Currently it is the official working language of the Federal Democratic Republic of
Ethiopia and thus has official status nationwide and the official or working language of several
of the states/regions within the federal system, including Amhara and the multi-ethnic Southern
Nations, Nationalities and Peoples region. The language is spoken as a mother tongue by a large
segment of the population in the northern and central regions of Ethiopia and as a second
language by many others. It is the second most spoken Semitic language in the world next to
Arabic [50]. One of the major differences between Amharic and Semitic languages like Arabic
and Hebrew is that Amharic is written from left to right as of English [51].
According to [52] [53] Amharic is probably the second largest language in Ethiopia (after
Oromo, a Cushitic language) and possibly one of the five largest languages on the African
continent. As [54] sites there are three Semitic languages which are only found in Ethiopia and
Eritrea: those are Geez or Geez, Amharic and Tigrinya which are used in a representation for
Ethiopic system. As [55] cited the Geez syllable is solely Ethiopian writing system, used
nowhere else in the world except Eritrea (which happened to be part of Ethiopia) and Israel (by
Ethiopian Jews).
Geez play a significant role in the development and expansion of Amharic language and writing
system. Several religious texts, such as Bible, translations of Arabic Christian texts from Egypt
and literatures such as the qine () and poems are all written in Geez. The emergence and
expansion of Geez inscriptions in the Ethiopic script traced back to the 4th century AD, when
30
Geez was the language of the empire of Aksum North Ethiopia. Even if the use of geez is limited
to Orthodox Church, it is still a source for the coining of Ethiopian literary and technical terms
[56]. In spite of the relatively large number of speakers, Amharic is still a language for which
very few computational linguistic resources have been developed.
3.2.
Amharic Alphabet
The Ethiopic writing system, which the Amharic language uses, consists of a core of thirty-three
characters (, Fidel) each of which occurs in one basic form and in six other forms all known
as orders. The seven orders (the first basic order and the other six orders) of the Ethiopic script
represent the different sounds of a consonant-vowel combination. Amharic has 231 (7x33)
different characters and nearly 40 more other characters [57]. Most labialized consonants are
basically redundant, and there are actually only 39 context independent phonemes (monophones); of the 275 symbols of the script, only about 231 remain if the redundant ones are
removed [57]. The 40 additional characters contain special feature representing labialization like
/ gwe from / g, / qwe from / q, / lu from / l and / gu
from / g [58].
There are seven vowels in Amharic alphabets /, /u, /i, /a, /e, /_, and /o which are
based on their point of articulation grouped as Peripherals /u, /i, /a, /e, and /o and
Central vowels /, and /_ which are mostly used than the peripherals. The other idea worth
mentioning is that two consonants can appear in the middle or at the end of words in a cluster
whereas clusters at the beginning of the word are very restricted in which /_ is used. As an
example, the symbolic representations of the seven forms of the Amharic characters (ha),
(le), (me) are shown in Table
1st order
2nd order
3rd order
4th order
5th order
6th order
7th order
(H)
(Hu)
(Hi)
(Ha)
(He)
(H)
(Ho)
(L)
(Lu )
(Li )
(La )
(Le )
(L)
(Lo)
(M )
(Mu)
(Mi )
(Ma )
(Me)
(M )
(Mo)
31
32
33
has different meaning from the constituent terms hode which means stomach and sefee
which means wide. In literal term matching retrieval systems, the constituent terms of the
compound noun are considered as independent and a document which contains one of these
terms is treated as relevant. This phenomena result in retrieval of irrelevant materials for a query
which contains one of the constituent terms. However, concept based retrieval systems, like LSI,
can partially handle this problem, as the co-occurrence frequency of the constituent terms is
taken into account in determining the relation between the terms [62] [63].
34
Different Amharic word processing software have been developed since 1987 (e.g Power
Geez, Geez, Agafari, Visual Geez, Ethiopic etc.) [66] These softwares use the same
English keyboard differently. That is, two Amharic word processors can use the same button to
represent two different Amharic characters. As a result, whenever data is passed between
different Amharic word processing software, that data always runs the risk of corruption. ECoSA
(Ethiopian Computer Standards Association), a professional association, is working to solve the
problems that result from the inconsistency in the available different Amharic software.
Most of the softwares are written to work only with Microsoft word. However, there are few
which can work in other programs. Visual Geez is one of the exceptions. Visual Geez has
two versions; VG2 and VG2000 developed for different versions of Microsoft office
products. Both the test collection and the sample queries used in this research are written in VG2
version of Visual Geez. [67]
35
CHAPTER FOUR
METHODS AND ALGORITHMS
4.1. Architecture of the System
In this chapter, we present our architecture to develop Amharic language interface to database.
The Amharic language interface for database accepts Amharic sentence as an input and generate
SQL query. The generated queries then execute on the actual database and retrieve results and
display to the user. The given input has been analyzed semantically based on the domain
dependent dictionary.
36
Tokenize
In the first part of the process the Amharic sentence input query should be tokenized. During the
tokenization process, the sentence is broken down into words called tokens. Those tokens are
stored in an array list or hash map. Tokens may represent name of tables, column, row,
command, operation, or they can be any value or any non-useful words. As presented on [69]
unnecessary repeated words called stop words are removed, and the remaining words are stored
in the array list.
Tokenizing of a given text depends on the characteristics of language of the text in which it is
written. The Amharic language has its own punctuation marks which demarcate words in a
37
stream of characters which includes colon : (hulet netib), the four dots or duble colon (arat
netib), semi-colon (derib sereze), comma (netela serez), exclamation mark! (Qalagano)
and question mark ? (Teyakimeleket). But right now hulet netieb (:) is replaced by white space
and it is no more use to separate list of words. To make a java understand the Amharic Unicode
the system has been used the following codes.
FileInputStream fileInput = new FileInputStream(filePath);
InputStreamReader inputStream = new InputStreamReader(fileInput, "utf-8");
BufferedReader bufferedReader = new BufferedReader(inputStream);
StringBuffer stringBuffer = new StringBuffer();
String lineContent = null;
While ((lineContent = bufferedReader.readLine()) != null) {
stringBuffer.append(lineContent);
}
String content = stringBuffer.toString();
Algorithm for handling Unicode characters
Stop words removal: The all words of a user query as well as a database documents do not have
equal value for mapping a database query. The least important words are called stope words.
Stop words are non-context bearing words, also known as noisy words which are to be excluded
from the input sentence to speed up the process. Stop words do not represent objects or concepts
of the world, and in our case stop words do not represent table name, column name or column
values. They often belong to syntactic classes such as articles, pronouns, particles, and
prepositions. These words are characterized by poor ability to map a query and similarity
identification. Thus, they could be removed from the text by comparing each term in the text
with a list of common words developed for a particular language and sometimes for a particular
domain. The stop word removal should be done carefully; otherwise it may affect the system.
Sample stop words in these categories are: , , , , , , etc.
Spelling Checker
A spell checker is a tool that enables us to check the spellings of the words in user query,
validates them i.e. checks whether they are rightly or wrongly spelled and in case the spell
checker has doubts about the spelling of the word, suggests possible alternatives. The two core
functionalities provided by a spell checkers are: spelling error detection and spelling error
correction. Error Detection is to verify the validity of a word in the language while Error
Correction is to suggest corrections for the misspelled word.
The researchers use the spelling checker developed by Tefery Kebebew in Jimma University.
This spelling checker includes Amharic Unicode writing tolls, thereby users can write there
query without changing the system writing mode. This tools have good performance with a little
dictionary errors.
Stemmer
For grammatical reasons, documents are going to use different forms of a word. There are
families of derivational words with similar meanings, for instance words that have the same root
such as democrat but have different morphological variants, democracy, democratic, and
democratization [70]. The goal stemming is to reduce inflectional forms and sometimes
derivationally related forms of a word to a common base form. Stemming is also used to reduce
the size of the dictionary (i.e. the number of distinct terms used in representing a set of
39
documents). A smaller dictionary size results in a smaller storage space and processing time
required.
According to Atelach A. [57], the stemmer finds all possible segmentations of a given word
according to the morphological rules of the language and then selects the most likely prefix and
suffix for the word based on corpus statistics. It strips off the prefix and suffix, and then tries to
look up the remaining stem (or alternatively, some morphologically motivated variants of it) in a
dictionary to verify that it is a possible stem of the word.
The Amharic language makes use of prefixing, suffixing and infixing to create inflectional and
derivational word form. In a morphologically complex language like Amharic, a stemmer has a
great role in information retrieval. As present on [60], for stemming a given word it uses
exception list and normalization list files. Normalization is used to correct a variant of a word to
its stem after suffix is removed for some words (e.g. for a word will be removed as
affix and will be normalized to which is the stem).
Like Parser discussed on section 1.4.4, stemmer for Amharic language is developed only at a
research level; no single practical tool has been developed yet. So, to use the advantages of
stemmer we have developed a limited lookup table stemmer depending on the column names,
table names, and conditions of our particular system.
TOKEN
STEMMED
40
Normalization
As we have discussed on chapter 3 one of the problems in Amharic writing system is the
variation of alphabets (fidels) used with the same pronunciation. Tessema [71] has developed an
analyzer for normalizing documents to a specific form of a letter such as and to and ,
and , to and and to as well as their orders (, , , etc. , , , etc.) [71]. In addition
to the above normalizations, Yimam [72] investigated and found that some other orders of the
letters should also be normalized. For example , , , , and should be normalized to .
Similarly the characters , , , ; , ; and so on should be normalized to one form as they are
being used interchangeably in documents.
For each word in a token list
For each character in a word
If the character is any one of , , or, any other order thereof then
Replace it by continue
Else if the character is any one of , , or any other order thereof then
Replace it by continue
Else if it is or any other order thereof
Replace it by continue
Else if it is
Replace it by continue
Else if it is or any other order thereof
Replace it by continue
End if
End for
End for
Algorithm for normalizing the input words
41
In addition a compound word has been handled to correctly map the table name and column
name.
FIRST_WORD
SECOND_WORD
CONCATENATION
42
database with their different possibility to be expressed in the sentence. Thirdly, conditional_
words table contains different conditional symbol like <, >, <=, and >= with their name
expression in a natural language or in a given sentence. Then this logical query is transfer to
query generator. The overall mapping is describing and presented on the following tables [1].
Table_ Handling Table
Token Word
Mapped words
EMPLOYEE
EMPLOYEE
EMPLOYEE_ON_EDUCATION
Mapped words
Id
SEX
FILDE_OF_STUDY
HIRE_DATE
DEPARTMENT
COLLEG
SALARY
NAME
FILDE_OF_STUDING
POSITION
43
Mapped Words
||
<
||
>
|| ||
<=
|| ||
>=
||
!=
44
Retrieve token_wored and mapped_wored from database table and store on new
hashmap;
Get first tokens from normalized_words by using stringTokenizer method;
While (tokens are StillExist on normalized_words) {
Itrate the hasmap;
While (itrater have nextValue) {
Get token_wored;
Get mapped_wored;
If (token_word contains token) {
TableName/ColumnName= mapped_word;
}
}
token = Next token;
}
Algorithm for Mapping Table Name and Column Name
45
analyze the syntactic structure of Amharic sentence for querying a sentence to retrieve a text
from a database. The sentence or the question contains the name of a table, attribute or value. For
instance, when the user querying the input to retrieve the whole part of the table, then the query
looks like [] [] []. The equivalent English meaning of each word is [all]
[employees] [display] respectively. This structure has a table name called / employee and
no column name has been include on the sentence. So this sentence gives SELECT * FROM
EMPLOYEE; SQL query. This query displays the entire content of the table.
In the other case, for querying a sentence to retrieve a certain column from the table, the sentence
should embrace the column name. For instance when the user querying a sentence to display the
name of the employees, the appropriate Amharic query is [] [] [] []
and this structured looks [Employees] [Name] [List] [display] respectively. From this
structure /Employee is a Table Name, / name is a Column Name and the
remaining words are no contribution on generating the query on this system. Because from this
sentence the principal words are table name and column name for generating the query. So the
Query looks SELECT NAME FROM EMPLOYEE;. Therefore from the specified sentence the
column name has not a column value i.e. here the column name has used for a selection only.
Likewise if the sentence contains more than one column name and each column names are
recognized as used for selection only, directly list the column names next to a select query. For
example when the user wants to retrieve the name and the salary of an employee, the requests are
stated in Amharic like . From the given sentence /name and
/salary are identified as a column name and no column value for both column names; or
both column names are used for selection only. The expected SQL query intended for the above
sentence is SELECT NAME, SALARY FROM EMPLOYEE.
In other way when the user wants to retrieve a certain row from a certain column, analyzing the
column name for condition and the column name for selection is expected. To analyze which
column name is for condition and the other is selection, we analyze the question presented on
different requesting type. For example the queries like retrieve name and id number of
employees where sex is male, the Amharic structured offered requests like [] [] []
[] [] [] [] [] []. This structure looks [their sex] [male]
[been] [employees] [name] [and] [identification] [numbers] [display]. From this sentence
46
/employee is a table name and the other /sex, /name, and _/idnumber
are column name. Indeed, /sex is a column name used for a condition to handle the retrieved
results.
In other example display the name of the employees which are hired on 2005. Querying this
sentence in Amharic is [] [/] [] [] [] []. This structured looks
[in 2005] [year] [hired] [employees] [name] [display]. From the given sentence
/employee is a table name and /hired, and /name are column names. The token
/hired is a column name used for a condition to handle the results to be displayed. From
the entire given sentence the column name have present before table name and after table name,
and we have identify a rule. From the given sentence a token recognized as column name before
a table name is considered as a column name used for a condition, and a token recognized as a
column name after a table name is considered as a column name used for a selection statement.
However, form the given description there is an exception to identify where column from the
given sentence. For example, [ ] [ ] [] [] [ ] []
[] [] [] [] [] this means that select the name and sex of the
employees who have got more than 10000 birr and worked on accounting department. And the
structure looks [accounting] [department] [employees] [there salary] [10000] [more than] [have
been] [name] [and] [there sex] [display]. This structure indicates that the column name
/salary is presented after a table name called /employees. So to handle this and
such exception, we have checked the word after a table name called /have_been. So if the
word or is presented after the table name, the column name after a table name and
before is considered as a column name used for a condition.
In general, we have conclude that: IndexNumberOf(Cc) < IndexNumberOf(TableName) <
IndexNumberOf(Cs).
47
KEY
Table name
Column name
Column value
Conditional word
Keywords
48
<=
IndexNumberOf(TableName)
<
IndexNumberOf().
Based on this word we formulate the rule to handle the ordered by queries. Forexample,
; display all employees whose
salary less than 2500 ordered by names. In this sentence array value of TableName is 5 and array
value of is 7. So 0 <= 4 < 6. Based on this rule the query has been converted. The
above natural (Amharic) language query converted into:
SELECT * FROM EMPLOYEE
WHERE Salary <
ORDERED BY Name;
Group By: - This database query has used to display the results in a group. To recognize the
Group By query from the given input, we have checked the word / from the given
sentence. This keyword comes after a table name and before table name according to the the user
request. Then based on the keyword we formulate the group by queries. For example,
; this means that display all
employees whose salary is less than 3550 group by their departments. Then this natural language
query has been converted into:
SELECT * FROM EMPLOYEE
49
Count (): - This query is used for counting the result which fulfill the queries or the condition.
To recognize this SQL function we find the word from the given sentence. This keyword is
found after the table name on the sentence. For example, ,
which means display the total number of female employees. Then this query converted into:
SELECT COUNT (*) as TOTAL FROM EMPLOYE
WHERE Sex = ;
SUM (): - This database query is used to add the column values based on the specification. To
formulate the sum () query, we identified the word from the given sentence. This keyword
found after the table name and follow the Ordered by rules. For example,
; which means display the total number of salary in
management department. This query converted into:
SELECT SUM(Salary) as TOTAL_SALARY FROM EMPLOYEE
WHERE Department = ;
MAX (), MIN (), AVG (): - This database query is used to select the maximum, minimum, and
average of the value approximately. To formulate the query, we have identified the word
for maximum, for minimum, and for average from the given Amharic sentences.
For example, ; this means
that display the maximum salary from the computer science department. This query has
converted into:
SELECT MAX(salary) as MAXIMUM_SALARY FROM EMPLOYEE
WHERE Department = ;
We have identified rules to develop algorithms.
50
RULE #1: if the sentence doesnt contain the table name, the sentence is invalid for
translation.
RULE #2: If the sentence contain table name only, the query is selection of the whole
table. SELECT * FROM TABLE_NAME;
RULE #3: If the sentence contains both table name and column names, and if the column
name positioned next to table name, the column name is used for selection. SELECT
(COLUMN_NAME), [COLUMN_NAME] FROM TABLE_NAME;
RULE #4: If the sentence contains both table name and column names, and if the column
name positioned next to table name and a token
FROM
TABLE_NAME
WHERE
COLUMN_NAME
COLUMN_VALUE;
RULE #6: If the sentence contains both table name and column names, and if the column
name positioned before table name and after table name, the column name placed before
the table name is column name for condition (COLUMN_NAMEc) and the column name
placed after table name is column name for selection (COLUMN_NAMEs). SELECT
(COLUMN_NAMEs),
[COLUMN_NAMEs]
FROM
TABLE_NAME
WHERE
COLUMN_NAMEc = COLUMN_NAMEc_VALUE;
RULE #7: If the sentence contains both table name and column names, and the column
name is positioned before table name, and if the column name positioned at the beginning
of the sentence the column value is the word/s located next to column name (stop words
has been removed), and if the word placed next to column value is a conditional word the
sign is a mapped_ word of conditional word else the sign is equal sign.
RULE #8: If the sentence contains both table name and column names, and the column
name is positioned before table name, and if the word positioned next to column name is
a table name or column name, the column value is the word/s placed before the column
name (stop word has been removed) , and the sign is equal sign.
RULE #9: If the sentence contains both table name and column names, and the column
name is positioned before table name, and if the word positioned before column name is a
51
conditional word, and the word positioned next to column name is either table name or
column name, then the word before column name is a condition (a sign) and the value is
the word placed before the condition (sign).
RULE #10: If the natural language sentence contains the word , the query includes
COUNT () function.
RULE #11: If the natural language sentence contains the word , the query
includes AVG () function and the column to be calculated is found next to the word
.
RULE #12: If the natural language sentence contains the word , the query
includes MAX () functions and the column to be compared is found next to the word
.
RULE #13: If the natural language sentence contains the word , the query
includes MIN () functions and the column to be compared is found next to the word
.
RULE #14: If the natural language sentence contains the word , the query includes
SUM () functions and the column to be add is found before the word .
RULE #15: If the natural language sentence contains the word , the query includes
the condition with the keyword between. The initial comparable value is found before
the word and the second value found after the word .
RULE #16: If the natural language sentence contains the word , the query includes
group by query. The column used for grouping is found before the word .
RULE#17: If the sentence contains both table name and column names, and the column
name belongs from different table name, both the table name makes a join with a
keyword INNER JOIN.
RULE#18: If the sentence contains both table name and column names, and if the
sentence contains the keyword /, the query include LIKE in the where
cconditon. The keyword used for the condition is found before the word
/ and the column name used for checking is found before the keyword
used for the condition.
52
RULE#19: If the sentence contains both table name and column names, and if it contains
the keyword , the query includes ordering query. The column name used for a
comparison is found before the wored .
Algorithm to handle A2
Int Table_location = -1;
User Input = Input;
Put the input on ArrayList = input_list;
53
54
55
56
57
4.8. Result
The results found from the database are again sent to the application program, and the
application sends the results in a form that is understandable to the users. Finally, the result is
displayed in the interface accordingly so that the user can see the converted query as well as the
result retrieved from the database.
58
CHAPTER FIVE
IMPLEMENTATION, RESULTS and DISCUSSIONS
5.1. Introduction
As mentioned previously, in line with the main objective of this study, the researchers have
developed an Amharic language interface to database within java 8.02 with jdk 8 update 101
(1.8.0_101) and MYSQL database at the back end for preparing the dictionary and the actual
database. This section of the chapter deals with issues in the experimentation of the designed
Amharic language interface to database as discussed in 4.5. Primarily in using this particular
system developed by the research first the users have to enter the query through Amharic
sentence so that the system display SQL and results can be retrieved from the database
accordingly.
In the next few sections, the processes involved and the results and output that are obtained
during the researchers experimentation of the database system are discussed and presented in
detail. This include the steps or procedures employed in users executions of different types of
queries; such as, query for selection of the whole table, query for selection of certain columns,
queries with a single condition, queries with multiple condition, aggregate function, joining
queries, grouping and ordering queries. In addition, the results of the researchers evaluation of
their proposed system in different dimensions are also mentioned. In another section of the
chapter, analysis of the results from the experimentation of the system that includes the type of
database used to validate the design prototype, the categories of the queries and the mode or
forms of the queries are explicated and demonstrated. Finally, the result of the overall
performance measurement of the designed database system in this study is presented.
59
As displayed on figure 5.1, in order to startup accessing the database, the user first enters his/her
request in a sentence form in Amharic language, in the text box provided then they can generate
and execute their entered query by clicking on the button GENERATE. This will lead the
system to convert the users query in the natural language as the selection of the whole table,
selection of certain column from a table, and selection of certain rows from certain columns
(conditions) including aggregate function, grouping and ordering queries.
60
the list of all employees and every information about them both in rows and columns. The result
of such query made on the selection of the whole table is demonstrated on figure 5.2 as follows.
61
and rows of the employee table and display the required accordingly. The query can be
paraphrase in different form such as: , and etc.
62
check the rule specified on the previous chapter. According to RULE#3, if the user query
contains both table name and column names, and if the column name positioned next to table
name, the column name is used for selection. So the system convert the query into SELECT
SEX, LEVEL, HIR_DATE FROM Employee;. Finaly, this query sends and fired on the
database and retrieved and displayed the required data with columns on sex, level, and hire date
along with the entire rows. Figure 5.3 shows result of such query for selection of certain
columns. The different paraphrases of such queries are: ,
,
, , ,
, and etc.
63
64
65
about
employees
education
the
queries
is
joined
with
employee
table.
66
attribute by using a keyword INNER JOIN. So, according to this rule, the query converted into
SELECT NAME, DEP_NAME FROM department INNER JION ON employee.DEEP_ID =
department.DEP_ID. This query sends to the database and retrieves the data and displays it on
the table at shown on figure 5.6. The paraphrases of this type of query are looks like:
,
,
, ,
etc.
67
For example, a user requires finding a total number of employees and average salary, works on
mathimatics department, with a specification of the first letter leter of the employee should start
with they can paraphrase as
. According to RULE#18 if the sentence contains the keyword
/, the query include LIKE in the where cconditon. In the same fashine, on
RULE#10 and RULE#11, if the sentence contains the word , the query includes COUNT ()
function and if the keyword is the query includes AVG () function. Therfore, according
to RULE#18, RULE#10, and RULE11 the query converted into SELECT count(*) AS TOTAL,
AVG(SALARY)AS AVGSALARY FROM
68
69
Language Understanding: - during the request the user expected to include the
column name on the query. For instance in a certain query express the sex is female
but for our system would not understand and the user should reframe the query as
.
Simplification: - The user can type the query without a need for complexity of
expression and without the strong knowledge of SQL.
We have collected 120 sample user queries from ordinary people who have not knowledge about
database language. The accuracy of the system is measured in term of precision percentage
with two classes that identifies query response as: Correct Queries and Incorrect Queries.
5.10.1.
Analysis of Results
5.10.1.1.
We have used Academic employee database to verify or validate the functionality of the
developed prototype or ALIDB system. This database contains three tables namely Employee,
Department, and Employee on education. Each table has its own column. The structure of the
table is presented below:
COLUMN NAME
DATA TAYPE
COLLATION
EMP_ID
VARCHAR (10)
utf8_unicode_ci
70
COMMENTS
Employee
identification
number
NAME
VARCHAR (30)
utf8_unicode_ci
Name of employee
SEX
VARCHAR (4)
utf8_unicode_ci
Sex of employee
FILD_STUDY
VARCHAR (50)
utf8_unicode_ci
Field of study
LEVEL
VARCHAR (15)
utf8_unicode_ci
Level of employee
DEP_ID
VARCHAR (10)
utf8_unicode_ci
Department
identification
number
HIRE_DATE
DATE
utf8_unicode_ci
SALARY
VARCHAR (7)
utf8_unicode_ci
Salary of employee
POSITION
VARCHAR (15)
utf8_unicode_ci
Position of employee
DATA TAYPE
COLLATION
COMMENTS
DEP_ID
VARCHAR (10)
utf8_unicode_ci
DEP_NAME
VARCHAR (30)
utf8_unicode_ci
Name of department
COLLAGE
VARCHAR (30)
utf8_unicode_ci
Name of collage
DATA TAYPE
COLLATION
COMMENTS
EMP_ID
VARCHAR (10)
utf8_unicode_ci
UNIVERSITY
VARCHAR (30)
utf8_unicode_ci
University
of
employee
attending
FILED_OF_STUDING VARCHAR (30)
utf8_unicode_ci
STARTING_YEAR
utf8_unicode_ci
The
DATE
year
employee
started
studying
Table 5. 3: Employee on education structure
5.10.1.2.
To evaluate the actual users query request we have divided the requesting category into four:
Query for Selection, Query for a Single Condition, Query for Multiple Conditions, and Query for
71
Aggregate Function including Group by and Ordered by. We have included join query in all part
of the category.
C/I
SELECT
NAME,
SEX, C
HIRE_DATE, LEVEL FROM
employee;
SELECT
NAME,
SEX, C
SELECT
NAME,
SEX, C
HIRE_DATE LEVEL FROM
employee;
72
10
11
12
SELECT NAME,
HIRE_DATE
SALARY, C
FROM
Employee;
13
SELECT
SALARY,
HIRE_DATE
NAME, C
FROM
Employee;
14
SELECT
SALARY, C
15
16
17
18
19
20
73
department
department.DEP_ID;
22
FROM
department
JOIN
employee
employee.DEP_ID
INNER
ON
=
department.DEP_ID;
23
SELECT
NAME,
SEX, C
HIRE_DATE,
DEP_NAME,
LEVEL FROM
department
department.DEP_ID;
24
department
JOIN
employee
employee.DEP_ID
INNER
ON
=
department.DEP_ID;
25
department
JOIN
employee
employee.DEP_ID
department.DEP_ID;
74
INNER
ON
=
26
SELECT NAME,
DEP_NAME
department
University, C
INNER
employee
JOIN
ON
employee.DEP_ID
department.DEP_ID
JOIN
FROM
emp_on_edu
emp_on_edu.EMP_ID
=
INNER
ON
=
employee.EMP_ID;
27
emp_on_edu
emp_on_edu.EMP_ID
=
INNER
ON
=
employee.EMP_ID;
28
SELECT
NAME,
SEX, C
HIRE_DATE,
DEP_NAME,
University,
LEVEL
FROM
department
INNER
JOIN
employee
ON
employee.DEP_ID
department.DEP_ID
JOIN
emp_on_edu
emp_on_edu.EMP_ID
=
INNER
ON
=
employee.EMP_ID;
29
75
30
Incorrect queries(I)
Accuracy
30
30
100%
100
90
80
70
Correct
60
Incorrect
50
40
30
20
10
0
Select query
76
Single query includes selection of the whole column or specific column with specific rows. This
part of the query handles only a single condition and it have this form: Select Selection
[selection] from Table _name where Condition.
No.
C/I
SELECT
NAME,
SALARY, C
HIRE_DATE C
SELECT
NAME,
FROM Employee
POSITION C
SELECT
NAME,
FROM Employee
POSITION C
SELECT
NAME,
FROM Employee
POSITION C
10
11
SELECT
77
NAME,
DEP_NAME I
employee ON employee.DEP_ID =
department.DEP_ID;
12
13
14
emp_on_edu
INNER
JOIN
employee ON employee.EMP_ID =
emp_on_edu.EMP_ID
WHERE FILD_OF_STUDING =
'';
15
16
SELECT
NAME
FROM C
employee.DEP_ID
department.DEP_ID
WHERE
DEP_NAME
'';
17
SELECT
NAME
FROM C
department INNER JOIN employee
ON
employee.DEP_ID
department.DEP_ID
WHERE
DEP_NAME
'';
18
SELECT
NAME
FROM C
78
ON
employee.DEP_ID
department.DEP_ID
WHERE
DEP_NAME
'';
19
SELECT
NAME
FROM C
employee.DEP_ID
department.DEP_ID
WHERE
DEP_NAME
'';
20
employee
ON
employee.DEP_ID
department.DEP_ID
WHERE
DEP_NAME
'';
21
22
employee.DEP_ID
department.DEP_ID
WHERE
DEP_NAME
'';
23
employee ON employee.DEP_ID =
department.DEP_ID
WHERE LEVEL = ' ';
79
24
SELECT
NAME,
EMP_ID, C
DEP_NAME FROM department
department.DEP_ID
WHERE SALARY > ';
25
26
SELECT
FILD_OF_STUDING
NAME, C
FROM
employee.DEP_ID
ON
emp_on_edu.EMP_ID
employee.EMP_ID
WHERE LEVEL_EDU = '
';
27
SELECT
FILD_OF_STUDING
emp_on_edu
University, C
INNER
FROM
JOIN
employee ON employee.EMP_ID =
emp_on_edu.EMP_ID
WHERE LEVEL_EDU = '
';
28
80
29
30
';
Table 5. 6: Single conditional query and results
Total queries
Incorrect queries(I)
Accuracy
30
28
93.33
100
90
80
70
Correct
60
Incorect
50
40
30
20
10
0
Single Query
81
As revealed by the query score the single conditional query performance test, shown on figure
5.10 above, the accuracy of the system for this particular types of query is calculated to be
93.33%. This indicates that except for insignificance number of queries, only two in this case the
system is found to be very much accurate. This implies that the prototype of our proposed system
has high validity and reliability to convert and executed users queries exactly as per their
request.
C/I
department.DEP_ID
WHERE SALARY > '' AND
DEP_NAME = ' ';
4
SELECT
NAME
FROM C
department INNER JOIN employee
ON
employee.DEP_ID
department.DEP_ID
WHERE LEVEL = ' '
AND DEP_NAME = '
';
82
employee.DEP_ID
ON
emp_on_edu.EMP_ID
employee.EMP_ID
WHERE University = ' '
AND DEP_NAME = '';
5
ON
employee.DEP_ID
ON
emp_on_edu.EMP_ID
employee.EMP_ID
WHERE University = ' '
AND DEP_NAME = '';
6
SELECT
NAME
FROM C
department INNER JOIN employee
ON
employee.DEP_ID
department.DEP_ID
WHERE LEVEL = ' '
AND DEP_NAME = '';
7
SELECT
NAME
FROM C
department INNER JOIN employee
ON
employee.DEP_ID
department.DEP_ID
WHERE LEVEL = ' '
AND DEP_NAME = '';
8
SELECT
83
NAME
FROM C
Employee
WHERE LEVEL = ' '
AND SEX = '';
SELECT
Employee
NAME
FROM C
10
4500 SELECT
Employee
NAME
FROM C
11
SELECT
4500 Employee
NAME
FROM C
12
4500
SELECT
NAME
FROM C
department INNER JOIN employee
ON
employee.DEP_ID
department.DEP_ID
WHERE SALARY > '4500' AND
DEP_NAME = '' AND
SEX = '';
13
SELECT
NAME,
SEX, I
HIRE_DATE, LEVEL FROM
department INNER JOIN employee
ON
employee.DEP_ID
ON
=
employee.EMP_ID
WHERE POSITION null 'null'
84
SELECT
NAME
FROM I
department INNER JOIN employee
ON
employee.DEP_ID
department.DEP_ID
WHERE DEP_NAME = '';
15
employee ON employee.DEP_ID =
department.DEP_ID INNER JOIN
emp_on_edu
emp_on_edu.EMP_ID
ON
=
employee.EMP_ID
WHERE LEVEL_EDU = '
' AND COLLAGE = '
';
16
SELECT
NAME,
HIRE_DATE, LEVEL
Employee
SEX, C
FROM
SELECT
NAME
FROM C
department INNER JOIN employee
ON
employee.DEP_ID
ON
=
employee.EMP_ID
WHERE LEVEL = ' '
AND LEVEL_EDU = ' '
AND COLLAGE = ' ';
85
18
19
ON
employee.DEP_ID
department.DEP_ID
WHERE SALARY < '' AND
COLLAGE = ' ';
20
21
5000
SELECT
NAME,
DEP_NAME C
SALARY
>=
'5000'
SELECT
Employee
NAME
WHERE SEX !=
FROM C
'' AND
ON
employee.DEP_ID
department.DEP_ID
WHERE SALARY < '' AND
SEX != '' AND COLLAGE =
' ';
86
24
25
SALARY FROM
department
INNER JOIN employee ON
employee.DEP_ID
department.DEP_ID
WHERE
SALARY
>=
'5000'
SELECT
NAME,
University I
FROM department INNER JOIN
employee ON employee.DEP_ID =
ON
emp_on_edu.EMP_ID
employee.EMP_ID
WHERE
SEX
LEVEL_EDU
''
AND
''
AND
DEP_NAME = '
';
27
ON
employee.DEP_ID
department.DEP_ID
WHERE SALARY > '5000' AND
COLLAGE = ' ';
28
5000 SELECT
87
NAME,
DEP_NAME C
department.DEP_ID
WHERE SALARY <= '5000' AND
SEX = '' AND DEP_NAME =
'';
29
ON
employee.DEP_ID
department.DEP_ID
WHERE SALARY >= '' AND
COLLAGE = ' ';
30
ON
employee.DEP_ID
department.DEP_ID
WHERE SALARY <= '5000' AND
COLLAGE = ' ';
Table 5. 8: Multiple conditional query and results
Total queries
Incorrect queries(I)
Accuracy
30
25
83.33%
88
90
80
70
60
Correct
50
Incorrect
40
30
20
10
0
Multiple condition query
C/I
89
WHERE
DEP_NAME
' ';
2
DEP_NAME
SELECT AVG(SALARY) AS C
AVGSALARY FROM department
INNER
JOIN
employee
employee.DEP_ID
ON
=
department.DEP_ID
WHERE DEP_NAME = '';
4
SELECT MAX(SALARY) AS C
MAXIMUMSALARY
FROM
department INNER JOIN employee
ON
employee.DEP_ID
department.DEP_ID
WHERE
DEP_NAME
'';
5
SELECT MIN(SALARY) AS C
MINIMUMSALARY
FROM
department INNER JOIN employee
ON
employee.DEP_ID
department.DEP_ID
WHERE
DEP_NAME
'';
6
SELECT
Employee
NAME
FROM C
90
employee.DEP_ID
=
department.DEP_ID
WHERE
DEP_NAME
''
Order by SALARY;
8
10
11
12
SELECT
MIN(SALARY)
MINIMUMSALARY
AS C
FROM
Employee
WHERE
FIELD_STUDY
'';
13
SELECT
SUM(SALARY)
TOTALSALARY
AS C
FROM
employee.DEP_ID
department.DEP_ID
WHERE DEP_NAME = '';
91
14
SELECT
SUM(SALARY)
TOTALSALARY
AS C
FROM
Employee;
15
SELECT SUM(SALARY) AS C
TOTALSALARY
FROM
department INNER JOIN employee
ON
employee.DEP_ID
department.DEP_ID
WHERE COLLAGE = '';
16
17
JOIN
employee
employee.DEP_ID
ON
=
department.DEP_ID
WHERE DEP_NAME = '';
18
JOIN
employee
employee.DEP_ID
ON
=
department.DEP_ID
WHERE DEP_NAME = '';
19
92
JOIN
employee
ON
employee.DEP_ID
department.DEP_ID
WHERE NAME Like '%' AND
DEP_NAME = '';
20
21
22
SELECT
MIN(SALARY)
MINIMUMSALARY
AS I
FROM
Employee;
23
FROM
employee.DEP_ID
department.DEP_ID
WHERE DEP_NAME = '';
24
MIN(SALARY)AS
MINIMUMSALARY
FROM
employee.DEP_ID
department.DEP_ID
WHERE DEP_NAME = ''
Group by SEX;
25
26
93
FROM Employee
WHERE NAME Like '%';
28
MAX(SALARY)
AS
MAXIMUMSALARY
FROM
Employee
WHERE NAME Like '%';
29
SELECT
SUM(SALARY)AS I
TOTALSALARY
FROM
department INNER JOIN employee
ON
employee.DEP_ID
department.DEP_ID;
30
employee ON employee.DEP_ID =
department.DEP_ID INNER JOIN
emp_on_edu
ON
emp_on_edu.EMP_ID
employee.EMP_ID
WHERE LEVEL_EDU = ''
AND DEP_NAME = '
';
Incorrect queries(I)
Accuracy
30
26
86.67
94
90
80
70
60
Correct
50
Incorrect
40
30
20
10
0
Aggregate function
Figure 5.12: System performance of aggregate function
Similarly for this particular type of query testing, 30 user queries were presented as input to the
system. Among this the system generated 26 of the queries with 86.67% accuracy where as it
executed and generated the remaining 4 query incorrectly which somehow insignificant. Based
on our evaluation we found that the system exhibited strong validity and reliability, except for
just a few inaccuracies, which implies its strong functionality to generate user queries under the
category of aggregate function.
5.10.1.3.
OVERALL MEASUREMENT
Besides our evaluation of the system accuracy and functionality to generate users queries
separately for single condition multiple condition and aggregate function, we have also tried to
undertake an overall measurement of the systems accuracy to have an aggregate result of its
performance all on forms of queries and see its full operation. To do so we have calculated the
overall accuracy of correct query through a division of the total number of correct queries as
generated by the system, by the total number of imputed query. Henceforth, the results as shown
95
on figure 5.13 revealed that the system we have developed has an overall accuracy of 91% which
implies that the systems validity and reliability is very high, an indicator of its strong and success
full feature use and operation.
Overall accuracy of Correct Query =
109
= 0.91, = 91%
120
Correct
Incorrect
96
that questions can be raised seeking answers for the reason/s that lead to such high performance
of our NLIDB system; hence, to clarify on such questions we would also try to mention the
features that our system has incorporated during its design and development and as well as what
aspects it has constituted contributed to its best performance.
One similarity that it has with other previous NLIDB systems, by other developers such as
Himani Jain [31], Khalel Al-Rabbah and Safwan Shantnawi [31], Faraj et al.[32] and many
others is that it designed a user interface for database system in a natural language, which is
Amharic in our case. In addition, it is intended to address database users difficulties that have no
knowledge of English; hence it enables such users to use Amharic Language for their queries and
access to the database. Alike other previously created NLIDBs including the above ones, our
NLIDB system also targeted at users who have no any prior and expert knowledge and skill of
how to operate or use a database; in our case, any novice user can make a use of and operate the
database system whenever they need to access and obtain the data they require.
Another similarity it has is that, for instance, alike the NLIDB systems developed by Khaleel AlRabalah and Safwan Shatnawi(31) who proposed an Arabic Natural Language Interface to
Database(ANLIDB), for the design of our ANLIDB system we have developed and implemented
an algorithm that can map and extract phrases from a natural language query, paraphrased and
entered in Amharic and submitted to the database and enables it to construct and execute queries
in SQL form. In addition, we have also used a Relational Database alike other previous systems.
Whereas coming to the features that makes our ANLIDB system different from other previous
systems, it has incorporated unique features and has constituted some important aspects that gave
rise to its high performance as well as its simplicity and user friendly nature. These structural and
linguistic aspects are discussed as follows.
To begin with, one is that the syntactic and semantic features and the complexity in the structure
of the Amharic language, the natural language used in developing our NLIDB system compared
to other natural languages, such as English, Arabic, Hindu, Urdu and other languages on which
other previous NLIDB systems were developed. In the design of their database systems, the
various authors have created algorithms to help them create a platform where the database can
understand queries in the natural language and convert it into SQL queries. In particular to the
97
ANLIDB system we have designed, we have also developed and established an algorithm
specific to and that only works for the Natural language of our database system, which is
Amharic.
For instance, NLIDB created by Himani Jain [30) in developing his Hindi Language Interface to
Database did not fully investigated the structure the natural language of his database system,
Hindi, and did not create an efficient algorithm for the design of his NLIDB. As a result, the
HNILDB he developed cannot handle complex forms of queries such multiple condition,
aggregate function and grouping and ordering. It can execute only list and single conditional
query types. This actually not a limitation of only this particular system, but also other prior
NLIDB systems such as UNLIDB (Urdu Natural Language Interface for Database) and
PUNLIDB (Punjabi Natural Language Interface for Data Base) built by authors such as Rashid
Ahmad et al [36) and Amardeep Kaur [1] respectively also did not form algorithms for their
respective natural language databases; hence their system could only execute simple query
forms. However, in different from such and other previous NIDB system, we tried to overcome
such limitations by developing unique algorithm for the natural language, Amharic, that did not
exist previously and new in its type. This avoided any difficulty that our database could have
faced in understanding natural language queries in Amharic sentences as a result of character
variations, punctuation and other lexical and semantic variations in the structure of the language.
This gave rise to the simplicity of our particular database system to easily communicate with the
database user without any language barriers. In addition, the unique algorithm we have
developed also strengthened our NLIDB systems performance to generate and execute all forms
of queries irrespective of their complexity; it can manage to respond to, convert in SQL form,
and generate data correctly for all query types that fall under the category of List Query, Single
conditional, multiple conditional and aggregate function, grouping and ordering.
98
CHAPTER SIX
Conclusions, Recommendation and Future Works
6.1. Conclusions
Based on its aim and objective this study has tried to develop an Amharic language interface to
database system. The system targeted users who have no knowledge or skill of a database
system, and who have no a good command of English language. Based on this to intervene into
and solve difficulties of such database users, we have designed and developed a user interface of
a database system which enable users to execute their query in a natural language, Amharic in
our case, to generate and retrieve data as per their queries from the database. To evaluate the
system we have calculated the systems performance, using academic employee database, and
have analyzed their validity and reliability of NLIDB system we have created.
Generally, our study has revealed the following finding and conclusions made based on them are
stated as follows
Evaluation of the proposed system revealed that novice users can operate the system
without difficulties. This is because the system can easily recognize various characters
easily, can understand and execute various user queries with multiple forms of
expressions and structures.
The system performance during users query for a single condition found to be 100%
accurate. Hence, it is possible to concluded that the system has full functionality for
selection (list) query.
Despite fewer decrease, only by 6.67%, the system is found to be highly operational,
with 93.33% accuracy, for the single conditional query.
Except for some few noticeable performance differences, compared to the previous two
forms of queries, our system has shown strong performance to generate queries for
aggregate function, with 86.7% accuracy, correctly as per the requests of the user
Compared to the other three query forms, our performance evaluation turned out with the
least accuracy score of 83.33% for the multiple condition queries. The researchers
99
assumed and related such decrease in the performance of the NLID as the queries grew
more complex requesting data from various column names, the system may not exactly
process such queries as per users request as it is primarily not column specified.
As per the overall analysis of the system developed the finding revealed that its highly
efficient in generating and executing users queries of all the three categories. Its total
performance or overall accuracy of correct query is found to be 91%. However, this also
indicates for feature improvement to insure the systems full and perfect functionality.
It was noted that the ANLIDB system we have created has been identified for its
strengths and merits to properly execute users queries of any types, to be simple and
easy to use and is very much user friendly, reduces previous user difficulties due to lack
of SQL database and good command of English language, etc. ..This can be attributed to
the unique features it incorporated into it, the development of good and systematic
algorithm for its natural language, Amharic, and others.
However, it doesnt mean that our NLIDB system and the way it is designed and
developed is not without some drawbacks or limitations. One is related to its ability for
data manipulation by the users in that in it can only let users to SELEC data but does not
enable users to DELETE, UPDATE and INSERT data from the database, only making
the SELECTION option available to them.
Another limitation is also that whenever users want to access data belonging to the
column names, they are expected to specify the column name as the system would not
understand and comply with their queries unless and otherwise they do so. It has no
feature that can let it automatically comprehend user queries from column names unless
it is specified in their queries. For instance the query , a
word (male) indicates that the sex is male, but our system ignore it ales the query
present like .
Thirdly, the system has difficulties to grasp and understand date values framed in general
key words question such as (tomorrow), (yesterday), (September),
on the 12th of September etc This because the researchers could not find any
available and previously developed parser, part of speech tagging and a like language
100
tools for Amharic language, as a result they could not incorporate pattern matching and
similarity checking of date values in the system. Hence, the users, during their queries,
are expected to provide the complete data value as found in the predetermined column
values of the database. In addition our system can not inderstand fuzzy questions like
(much), (small), (bad), (good) and a like.
Lastly, our system is domain-dependent, that is for instance, it is employed for Employee
database; hence it cannot be manipulated for data other than the domain it is originally
created for
101
We also make other authors in the same track by including parsers POS tagger and
stemmer for Amharic language that can make the task of creating new and more
functional algorithms, than already done by the current researchers, so that it can be
possible to cut limitations of the kind that our NLIDNB system has regarding the
communication barriers it has with Users as a result of its incapacities to understand
multiple and more complex values of user queries in the Natural language.
102
REFERENCE
[1]
[2]
[3]
[4]
[5]
[6]
[7]
J. Patel and J. Dave, A Survey: Natural Language Interface to Databases, Int. J. Adv.
Eng. Res. Dev., 2015.
[8]
[9]
103
104
105
106
107
108
[71] M. Tessema, Design and implementation of Amharic Search Engine, Addis Ababa
University, 2007.
[72] S. Yimam, Amharic Question Answering For Factoid Questions, Addis Ababa
University, 2009.
[73] M. WORDOFA, SEMANTIC INDEXING AND DOCUMENT CLUSTERING FOR
AMHARIC INFORMATION RETRIEVAL, ADDIS ABABA UNIVERSITY, 2013.
[74] S. Arora, K. Batra, and S. Singh, " Dialogue System: A Brief Review",
[75]
109
Appendix
Appendix I: The Amharic character set [58]
Ordere
1st
2nd
3rd
4th
5th
110
6th
7th
10 20
30
40
50
70
80
90
100
60
111