Proceedings of the The International Technology Management Conference, Antalya, Turkey, 2015
Concept Extraction from Arabic Text Based On Semantic Analysis
Hassan Najadat1, Ahmad Rawashdeh2
Computer Information Systems Department1, Department of Electrical and Computer Engineering2
Jordan University of Science and Technology, Jordan1, University of Cincinnati, USA2
najadat@just.edu.jo1, rawashay@mail.uc.edu2
ABSTRACT
1. INTRODUCTION
Concept extraction can help in building ontologies,
It has been a dream to have computers, which
understand all of the digital data, and to give
computers that same ability which allows
humans to understand, summarize, infer, and
answer questions [1]. This ability can be
achieved if machines read and understand in the
same way as humans. Human reading involves
representing the newly acquired data as
knowledge and adding this newly acquired
knowledge to the already stored knowledge.
The knowledge representation is the reason of
the human capability of doing tremendous tasks
after reading a passage where machines lack
this capability. On the other hand, machines are
crippled by the chains of the insensible syntax
relative to the incredible vast space of semantic
in which humans enjoy exploring. It is very
appealing to represent the data stored on large
data sources such as the Web as knowledge in
order to gain the capabilities explained
previously. This may not be convenient unless
the importance and usage of knowledge
representations is understood very well.
The first step in giving machines the ability to
infer is identifying the concepts and relations.
The machine should be able to differentiate
among the words, which represent entities in
real life “concepts”, the words, which describe
these concepts “properties, adjectives”, and the
words, which relate these concepts “relations”.
The objective of this paper is to identify the
concepts automatically from the Arabic text.
This represents a step forward in Arabic
semantic processing. The extracted list of
concepts can be used solely as a keyword list
which are the main component of the semantic
Web. Ontologies are not only used in the semantic
Web, but also in other fields such as Information
Retrieval to improve the retrieval. In this work, an
Automatic Concept Extractor, which processes
Arabic text, is presented. The algorithm of the
Automatic Concept Extractor tags the words in the
text, finds the pattern of each noun, and outputs
only those nouns whose patterns match one of the
concepts patterns in the concepts extraction rules.
The result of each rule was evaluated individually to
find the rules with the highest precision. Two
datasets were crawled from the Web and converted
to XML. Each rule was tested twice with each
dataset as the input. The average precision of the
rules showed that the rules with the patterns
"Tafe'el" “ ”تفعيلand Fe'aleh “ ”فعالةachieved a high
precision.
KEYWORDS
Natural Language Processing, Ontology Web
Language,
Semantic
Frequency, Arabic Text.
Analysis,
Term
ISBN: 978-1-941968-11-6 ©2015 SDIWC
32
Proceedings of the The International Technology Management Conference, Antalya, Turkey, 2015
for indexing documents in Information
Retrieval system for instance, or it can be used
along with the extracted relations to allow
machines to make new conclusions.
The problem can be address as follows: Given
an Arabic Text (T), find the words, which are
Candidate Concepts (CC). Concepts cannot be
verbs, adjectives or particles, they can only be
nouns. Therefore, the first problem is
identifying the nouns. The second problem is
identifying the nouns that represent concepts
out of the nouns, which represent anything
other than concepts such as names and
properties.
2 RELATED WORKS
Concepts can be extracted based on statistical
and syntactical analysis. The Terms Frequency
(TF), Inverse Document Frequency (IDF), and
Entropy all are different types of statistical
measurements, which can be used to extract
alpha draft concepts. A combination of both TF
and IDF can also be used. Although statistical
information extracts words, which tend to be
indices more than concepts, they contain the
seed of what can be considered as the full set of
concepts. These seed terms can be used
especially after removing the very high and low
frequent items known as empty terms. This
method was applied on structured data of law
codes [2]. In this research, a preprocessing step
of converting unstructured Arabic web
documents to more structured format is
required.
In the medicine field, automatic ontology
creation assists physician in classifying the
various postoperative complications. It saves
them the effort of doing it manually using the
thesauri to code activities. This manual
procedure is not accurate and can lead to
difficulties. Using the syntax tool [3], an
ontology of the “is-a” relationship was
extracted from Patient Discharge Summary
(PDS) reports [4].
Machine learning can be used to train systems
to extract the proper ontology elements. For
instance, the system described by Craven [5]
ISBN: 978-1-941968-11-6 ©2015 SDIWC
was given an ontology describing the relevant
ontologies’ element to extract. It was also
provided with a collection of documents labeled
with these ontology elements that represent the
instances of the ontology. That allowed the
system to find a way to map the desired
sections of the documents with the desired
ontology elements.
It has been claimed that the motivation
“everything is on the Web, but we just cannot
find what we need” is not true or accurate. This
statement has allegedly been disproved by a
study conducted on the number of available
information found when searching for
accommodations, where the availability
information was found to be only 21 percent
and just 7 out of 16 categories of room features
details were found to be covered [6]. However,
there are some arguments against the study.
First, this sample was not representative.
Studying only a collection of websites on
tourism is not sufficient to come with the
general conclusion. Second, the statement
“everything is on the Web, but we just cannot
find what we need”, means there is large
amount of data on the Web but not all data is
there. Finally, it is not about how much is there
but rather how many types or categories of
data, since there is a hope of having more
information as people pay more attention to
such issues.
There is a difference between discovering
semantic structures and semantic knowledge.
One of the examples of semantic structures is
the menu items under “News”: international,
local, economics, sport, and science. These
represent taxonomies of structures but not
knowledge. No fact can be asserted into
knowledge to help concluding anything, and
even the semantic structure extraction can be
done automatically [7].
Liu and Zhai [8] have developed an
instance-based learner to extract data. It learns
the items’ structures of interests after having
some documents being labeled by the user
marking the learning phase. After that, for each
new instance the similarity is computed to
33
Proceedings of the The International Technology Management Conference, Antalya, Turkey, 2015
determine to which label each item should
belong. It can be clearly noticed the former and
the later systems process the structure not the
semantic. This means that there is no semantic
language or knowledge representation involved
which indicates the narrow scope of such
application.
In Summary, semantic processing enables
concluding, summarizing, paraphrasing, and
question answering, therefore, the semantic
Web is the future of the Web as was stated by
the inventor of the Word Wide Web (WWW)
Tim Berner-Lee [1]. It relies on ontologies,
which are formal descriptions of concepts and
relations. They are quite similar to the (OOP)
concepts and diagrams known as UML.
Ontologies are represented by ontologies
languages such as Resource Description
Framework (RDF) or Ontology Web Language
(OWL) with the ability to convert between
those different languages. Automatic ontology
creation can increase the speed by which the
semantic web is created. There are different
approaches for the creation of ontologies
ranging from: statistical, syntactical, template,
rule, machine learning to combination of all.
Natural language processing tools, which
include taggers and stemmers, may be used to
extract the concepts and relations. The type of
domain, the language, and the purpose of the
ontology may affect the choice of the approach
used for extracting and building the ontology.
Compared to the above, this research
importance lies in applying the semantic web
techniques using the Arabic language domain
and reporting that experience which serves as a
gateway for encouraging more research in this
important field of study. In addition, it is one of
the few to provide sufficient details about the
used dataset and describe techniques, which
could be used to make any future datasets for
the local community.
3. DATASET
CREATION
STANDARDIZATION
AND
ISBN: 978-1-941968-11-6 ©2015 SDIWC
This section explains how the datasets were
created and standardized. The first two sections
describe both types of datasets: Arabic Cable
News Network (CNN) [9] and Al-Jazirah [10]
respectively. The last section explains how the
datasets were converted to XML files.
The available dataset in Arabic language is
not as much as good as the one in English
language. Finding a free Arabic dataset was
difficult given the lack of activity and the nonopenness of the Arabic research as well as the
limitation of the resources. What distinguishes
the news data in general, is the type of words
included which tend to belong to various fields
and areas. This reflects the nature of the news
where more than one subject is explored such
as: sport, business, politics, health, technology
and science.
3.1. Arabic CNN “News Archive”
Data set 1 represents the news from the
Arabic CNN website over the years 2004, 2005
and 2006. It was collected from Arabic CNN.
The files were processed to remove irrelevant
data such as index, sitemap and the Arabic
CNN homepage. The files were converted to
XML files. Where the set of attributes are: date,
stories, time, title, content, Paragraph, Image
description, and Image URL.
Each XML file of the resulting XML files
represents the stories and the news in a
particular day. The file was named to reflect the
date of the news. The number of files was
decreased from 17404 in the original
unstructured dataset (html format) to 1096
(XML format).The number of folders was
decreased from 1034 to one. Merging the
different stories, related by day or subject,
distributed across several files into one file, and
removing the irrelevant files was the reasons of
having files reduction. Rather than having a
folder for each subject or day of the news, all
stories included in the XML files were stored
within a single folder.
34
Proceedings of the The International Technology Management Conference, Antalya, Turkey, 2015
3.2. Al-Jazirah “Al-Jazirah Archive”:
Data set 2 represents the news from
Al-Jazirah archive page. The original dataset
downloaded from Al-Jazirah archive website
contains 87080 files and 11436 folders, with a
464 MB size. The converted dataset contains
one folder and 10611 Files with a size of 284
MB. It is clear that the conversion results in a
smaller number of files and folders, less space
consumption and most importantly more
structured and standard dataset. Merging
multiple files into single XML files was the
reason for the files reduction.
Table 1 represents a summary of the data sets
one and two. It includes sizes and number of
files for each data sets before and after cleaning
and standardization.
Table 1: The dataset after and before the standardization
Before
clean/standardization
Data 1
Data 2
Size
Files
87.4MB
17,404
595MB
86,293
After
clean/standardization
Size
65.6
MB
284
MB
Files
1,096
10,611
4. EXTRACTING CANDIDATE
CONCEPTS:
In this section a description of the system used
to serialize the extracted candidate concepts is
provided. The path of the dataset folder, which
contains the XML files, and the number of files
to be processed must be provided in order to the
start running the system. This is the system
description from a higher level of view: the
Graphical User Interface (GUI).
There are three main methods which are used
to identify verbs, nouns and special words. The
IsNoun method determines if the word is a noun
ISBN: 978-1-941968-11-6 ©2015 SDIWC
or not. In Figure 1, the parameter “reason” is
used for analytical purposes.
The following verse, taken from Al Masree's
book [11], was used to help in tagging the
noun:
أ،مسند – لاسم ت ييز حصل بالج التنوين الندا
The function IsVerb is quite similar to the
IsNoun in general, but it differs in the type of
markers used for tagging the word. The
function isVerb returns one Boolean value
indicating whether the word is a verb or not,
and its parameters are similar to the isNoun
parameters. Figure 2 provides a set of rules that
extracts the verbs.
The Is_Special_Word function checks if the
word belongs to any of the following groups:
proposition ) (ح ف ج, exception ) (استثناء, future
)(مستق ل, demonstrative ) ( شار, relative ) (موصو
, question ) (س ا, temporal adverb ) ( ف ما,
spatial adverb ) ( ف م ا, kana and its
sisters )(كا أخوات ا, thanna and its sisters ( ن
)أخوات ا, conditional form )(الش ط, vocative )(النداء
, harf naseb ) (ح ف نصب اأس اء اأفعا, harf
jazem ) (جز.
Name: IsNoun. Returns true if the word
is a noun.
Input: string word, ref string reason,
string previous
Output: true if word is noun, otherwise
false.
Steps:
1. If Is_Special_word(word) Return
False.
2.ْIfْwordْstartsْwithْ“”ال,ْ“”وال,ْ
“”بال,ْ“ْ”كال“ْ”فال.OR
Ifْwordْendsْwithْ“ْْ”ةOR
If word ends with: Kasra ( )كسرةor
Tanween ()تنوين.OR
If IsNounPattern(word, Stemmed(word)) OR
If Nouns_Hashtable.contains( word).
OR
If Is_Harf_Neda(previous) OR
If Is_Harf_Jar(previous) OR
If IsVerb (previous) THEN Return True.
Else False.
Figure 1: The function IsNoun
35
Proceedings of the The International Technology Management Conference, Antalya, Turkey, 2015
Name: IsVerb. Returns true if the word
is a verb.
Input: string word, ref string reason,
string previous
Output: true if word is verb, otherwise
false.
Steps:
1.If Is_Special_word(word)
Return False.
2.If word ends with ْ, ْ, ْ , ْ ……OR
IfْIsVerbPattern(word)ْ………………..ْOR
If Verbs_Hashtable.contains(word)OR
IfْIs_harf_Jazem(previous)ْ…………….OR
IfْIs_Harf_Future(previous)ْ………ْTHN
Return True.Else
return false
Figure 2: The function Is_Verb
Figure
3
provides
a
code
for
is_Candidate_Concept function. It is used to
extract the concepts from the nouns. The
function is_Candidate_Concept returns true,
false or undetermined. The rules to be used can
be determined using a text file. The rules in
Figure 1 were used to extract candidate
concepts and their precisions were evaluated
Name: is_Candidate_Concept. Returns
the pattern matched as a string.
Input: string word ,ref string reason,
ref string patt
Output: true if the word is a
candidate concept.
Steps:
1.patt Word_Pattern(word)
2.if patt = ْ||ْْمفعلْ||ْمفعلْ||ْمفعال
ْ||ْ مفعالْ||ْمفعل ْ||ْمفعلهْ||ْمفعل
ًال ْ||ْفعال ْ||ْفاعول
مفعلهْ||ْفع
reason اسمْآل
Return true.
3.if patt = ْل
فاعلْْ||ْمفع
reason اسمْفاعل
Return true.
4.if patt = ْ||ْ فعال ْ||ْفعالْ||ْفعا
ْْ فعيلْْ||ْفعل
Reason مصد
Return true.
5.if patt ends with ينْ||ْو
Reason ج عْم كرْسالم
ISBN: 978-1-941968-11-6 ©2015 SDIWC
Return false.
6.if patt ends with
ا
Reason مثنى
Return false.
7.if patt ends with
ا
Reason ج عْمؤنثْسالم
Return false.
8.else
Reason no rule left
Return undetermined
Figure 1: The function Is_Candidate_Concept
1. RESULTS AND CONCLUSION
Only one rule was used to generate concepts
at a time, and the two datasets Arabic CNN and
Al-Jazirah were used separately in order to help
in determining the rule with the highest
precision, and gives a clear picture of the
performance of the rules comparatively. The
result of each rule was evaluated. The
evaluation was done manually. Initially, all of
the rules were applied together, but while
evaluating the results, it was difficult to
discover the cause of false positives and the
rules which cause it. Therefore, the result is
described from a single-rule perspective, to
help any complementary research in deciding
which rules to consider.
The rule “ ”تفعيلis a pattern of the infinitive
“”مصدر, and it is the infinitive of the augmented
verb “”فعَل. This rule gives the highest precision
due to its exclusive nature, where there exit no
any other part of speech in Arabic, including
the adjective, which shares this pattern with the
infinitive. Error! Reference source not
found.Table 2 shows the first seven rows of the
evaluated result of the rule “ ”تفعيلfor the
Arabic CNN dataset input only. The left
column “concepts” are the true concepts, and
the right column “Not Concepts” are the
wrongly classified words, which represent the
false positives. As it can be concluded from the
sample result, the false positives are concepts
but they are attached to the syntactical suffix
“”ا, also, one of the other false positives in the
Al-Jazirah dataset result is a name. The
36
Proceedings of the The International Technology Management Conference, Antalya, Turkey, 2015
non-concepts, which appeared in the result, are:
تدنيا, تعديا. Figure 4 shows the precision of the
rule for both datasets Al-Jazirah and Arabic
CNN. The vertical axis is the precision while
the horizontal axis is the dataset axis. The
precision of this rule ranges from 99.31-99.37.
Table 2: Sample result of “”تفعيل
Concepts
Not
Concepts
ت ين
تدنيا
تنشيط
تعديا
تجديد
تحصين
تخصيص
تخصيب
توجيه
توصيف
Figure 5. All rules precisions
The rules sorted in ascending order by their
average precision are ” ,” “مفع ة م ْفع ة,” فاعو
,” “مفعا م ْفعا," "فاعل م ْفعل," "م ْفعل مفعل," “فعا
“فعالة فعَالة,” “فعا,” “فع ة,””“فعيل, and “ ”تفعيلas
indicated in Figure 6.
Figure 2: The precision of “”تفعيل
Figure 5 provides a clear picture of the
performance of the rules comparatively. The
bars with the dotted pattern represent the
precision of the rules for the Al-Jazirah dataset
input, while the bars with the lined pattern are
for Arabic CNN dataset input. The vertical axis
is the precision and the horizontal axis is the
rules axis.
Figure 6. Average precisions
It has been shown that auto extracting
candidate concepts from the Arabic text is
applicable. The high precisions that some rules
ISBN: 978-1-941968-11-6 ©2015 SDIWC
37
Proceedings of the The International Technology Management Conference, Antalya, Turkey, 2015
have achieved such as “ ”تفعيلprove such
finding. The rules for extracting the candidate
concepts are based on Arabic patterns
(morphological features). No initial set of
concepts, terms or ontology was used. The
rules, which were tested, belong to the
categories: infinitive “”مصدر, tool name “ اسم
”اآلة, and agent name “”اسم الفاعل. Some of the
used patterns in the rules match adjectives and
names and that consequently reduces the
precisions of their rules. Thus, the precision can
be increased by finding a method for removing
the adjectives and names automatically or only
using the patterns which can’t be patterns of
adjectives. Removing names can be achieved
by storing a list of names in advance. In
addition, automatic relations extraction needs to
be done to achieve the ultimate goal of
knowledge representation.
REFERENCES
[ 1]
Berners-Lee T., Weaving the Web: The Original
Design and Ultimate Destiny of the World Wide
Web, by its inventor. San Francisco: Harper,
1999; 157.
[ 2]
Lame, G., Using NLP Techniques to Identify
Legal Ontology Components: Concepts and
Relations. Artificial Intelligence and Law, 2006;
12(4): 379-396.
[ 3]
Fabre, C., Bourigault, D., Linguistics clues for
corpus-based acquisition of lexical dependencies.
In proceeding of Corpus Linguistics, University
Centre for Computer Corpus Research on
Language (UCREL), Lancaster University, UK,
2001.
[ 4]
Charlet, J., Bachimont, B., Jaulent, M., Building
Medical Ontologies by Terminology Extraction
from Texts: An experiment for the intensive care
units. Computers in Biology and Medicine, 2006;
36(7-8): 857-870
[ 5]
Craven, M., DiPasquo, D., Freitag, D., McCallum,
A., Mitchell T., Nigam, K., Slattery, S., Learning
To Construct Knowledge bases from the World
Wide Web. Artificial Intelligence, 2000; 118(1-2):
69-113.
ISBN: 978-1-941968-11-6 ©2015 SDIWC
[ 6]
Hepp, M., Semantic Web and Semantic Web
Services. IEEE Internet Computing, 2006; 10(2):
85-88.
[ 7]
Mukherjee, S., Guizhen, Y., Wenfang, T.,
Ramakrishnan, I., Automatic Discovery of
Semantic Structures in HTML Documents.
Proceedings of the Seventh International
Conference on Document Analysis and
Recognition (ICDAR), Paraná – Brazil, 2003; 1:
245.
[ 8]
Liu, B., Zhai, Y., Extracting Web Data Using
Instance-Based Learning. World Wide Web,
2005; 10(2): 113-132
[ 9]
Arabic CNN, http://www.newsarchiver.com/
(accessed: 23 June 2008)
[ 10] Aljazirah
Archive,
http://www.aljazirah.com/aarchive.htm (accessed: 23 June
2008).
[ 11] Al Masree, B., Ibn Ageel Explanation “Sharh Ibn
Ageel”, Saida, Beirut: Modern Library. 1998; 2:
7.
38