Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Concept Extraction from Arabic Text Based On Semantic Analysis

Concept extraction can help in building ontologies, which are the main component of the semantic Web. Ontologies are not only used in the semantic Web, but also in other fields such as Information Retrieval to improve the retrieval. In this work, an Automatic Concept Extractor, which processes Arabic text, is presented. The algorithm of the Automatic Concept Extractor tags the words in the text, finds the pattern of each noun, and outputs only those nouns whose patterns match one of the concepts patterns in the concepts extraction rules. The result of each rule was evaluated individually to find the rules with the highest precision. Two datasets were crawled from the Web and converted to XML. Each rule was tested twice with each dataset as the input. The average precision of the rules showed that the rules with the patterns "Tafe'el" “ليعفت” and Fe'aleh “ةلاعف” achieved a high precision.

Proceedings of the The International Technology Management Conference, Antalya, Turkey, 2015 Concept Extraction from Arabic Text Based On Semantic Analysis Hassan Najadat1, Ahmad Rawashdeh2 Computer Information Systems Department1, Department of Electrical and Computer Engineering2 Jordan University of Science and Technology, Jordan1, University of Cincinnati, USA2 najadat@just.edu.jo1, rawashay@mail.uc.edu2 ABSTRACT 1. INTRODUCTION Concept extraction can help in building ontologies, It has been a dream to have computers, which understand all of the digital data, and to give computers that same ability which allows humans to understand, summarize, infer, and answer questions [1]. This ability can be achieved if machines read and understand in the same way as humans. Human reading involves representing the newly acquired data as knowledge and adding this newly acquired knowledge to the already stored knowledge. The knowledge representation is the reason of the human capability of doing tremendous tasks after reading a passage where machines lack this capability. On the other hand, machines are crippled by the chains of the insensible syntax relative to the incredible vast space of semantic in which humans enjoy exploring. It is very appealing to represent the data stored on large data sources such as the Web as knowledge in order to gain the capabilities explained previously. This may not be convenient unless the importance and usage of knowledge representations is understood very well. The first step in giving machines the ability to infer is identifying the concepts and relations. The machine should be able to differentiate among the words, which represent entities in real life “concepts”, the words, which describe these concepts “properties, adjectives”, and the words, which relate these concepts “relations”. The objective of this paper is to identify the concepts automatically from the Arabic text. This represents a step forward in Arabic semantic processing. The extracted list of concepts can be used solely as a keyword list which are the main component of the semantic Web. Ontologies are not only used in the semantic Web, but also in other fields such as Information Retrieval to improve the retrieval. In this work, an Automatic Concept Extractor, which processes Arabic text, is presented. The algorithm of the Automatic Concept Extractor tags the words in the text, finds the pattern of each noun, and outputs only those nouns whose patterns match one of the concepts patterns in the concepts extraction rules. The result of each rule was evaluated individually to find the rules with the highest precision. Two datasets were crawled from the Web and converted to XML. Each rule was tested twice with each dataset as the input. The average precision of the rules showed that the rules with the patterns "Tafe'el" “‫ ”تفعيل‬and Fe'aleh “‫ ”فعالة‬achieved a high precision. KEYWORDS Natural Language Processing, Ontology Web Language, Semantic Frequency, Arabic Text. Analysis, Term ISBN: 978-1-941968-11-6 ©2015 SDIWC 32 Proceedings of the The International Technology Management Conference, Antalya, Turkey, 2015 for indexing documents in Information Retrieval system for instance, or it can be used along with the extracted relations to allow machines to make new conclusions. The problem can be address as follows: Given an Arabic Text (T), find the words, which are Candidate Concepts (CC). Concepts cannot be verbs, adjectives or particles, they can only be nouns. Therefore, the first problem is identifying the nouns. The second problem is identifying the nouns that represent concepts out of the nouns, which represent anything other than concepts such as names and properties. 2 RELATED WORKS Concepts can be extracted based on statistical and syntactical analysis. The Terms Frequency (TF), Inverse Document Frequency (IDF), and Entropy all are different types of statistical measurements, which can be used to extract alpha draft concepts. A combination of both TF and IDF can also be used. Although statistical information extracts words, which tend to be indices more than concepts, they contain the seed of what can be considered as the full set of concepts. These seed terms can be used especially after removing the very high and low frequent items known as empty terms. This method was applied on structured data of law codes [2]. In this research, a preprocessing step of converting unstructured Arabic web documents to more structured format is required. In the medicine field, automatic ontology creation assists physician in classifying the various postoperative complications. It saves them the effort of doing it manually using the thesauri to code activities. This manual procedure is not accurate and can lead to difficulties. Using the syntax tool [3], an ontology of the “is-a” relationship was extracted from Patient Discharge Summary (PDS) reports [4]. Machine learning can be used to train systems to extract the proper ontology elements. For instance, the system described by Craven [5] ISBN: 978-1-941968-11-6 ©2015 SDIWC was given an ontology describing the relevant ontologies’ element to extract. It was also provided with a collection of documents labeled with these ontology elements that represent the instances of the ontology. That allowed the system to find a way to map the desired sections of the documents with the desired ontology elements. It has been claimed that the motivation “everything is on the Web, but we just cannot find what we need” is not true or accurate. This statement has allegedly been disproved by a study conducted on the number of available information found when searching for accommodations, where the availability information was found to be only 21 percent and just 7 out of 16 categories of room features details were found to be covered [6]. However, there are some arguments against the study. First, this sample was not representative. Studying only a collection of websites on tourism is not sufficient to come with the general conclusion. Second, the statement “everything is on the Web, but we just cannot find what we need”, means there is large amount of data on the Web but not all data is there. Finally, it is not about how much is there but rather how many types or categories of data, since there is a hope of having more information as people pay more attention to such issues. There is a difference between discovering semantic structures and semantic knowledge. One of the examples of semantic structures is the menu items under “News”: international, local, economics, sport, and science. These represent taxonomies of structures but not knowledge. No fact can be asserted into knowledge to help concluding anything, and even the semantic structure extraction can be done automatically [7]. Liu and Zhai [8] have developed an instance-based learner to extract data. It learns the items’ structures of interests after having some documents being labeled by the user marking the learning phase. After that, for each new instance the similarity is computed to 33 Proceedings of the The International Technology Management Conference, Antalya, Turkey, 2015 determine to which label each item should belong. It can be clearly noticed the former and the later systems process the structure not the semantic. This means that there is no semantic language or knowledge representation involved which indicates the narrow scope of such application. In Summary, semantic processing enables concluding, summarizing, paraphrasing, and question answering, therefore, the semantic Web is the future of the Web as was stated by the inventor of the Word Wide Web (WWW) Tim Berner-Lee [1]. It relies on ontologies, which are formal descriptions of concepts and relations. They are quite similar to the (OOP) concepts and diagrams known as UML. Ontologies are represented by ontologies languages such as Resource Description Framework (RDF) or Ontology Web Language (OWL) with the ability to convert between those different languages. Automatic ontology creation can increase the speed by which the semantic web is created. There are different approaches for the creation of ontologies ranging from: statistical, syntactical, template, rule, machine learning to combination of all. Natural language processing tools, which include taggers and stemmers, may be used to extract the concepts and relations. The type of domain, the language, and the purpose of the ontology may affect the choice of the approach used for extracting and building the ontology. Compared to the above, this research importance lies in applying the semantic web techniques using the Arabic language domain and reporting that experience which serves as a gateway for encouraging more research in this important field of study. In addition, it is one of the few to provide sufficient details about the used dataset and describe techniques, which could be used to make any future datasets for the local community. 3. DATASET CREATION STANDARDIZATION AND ISBN: 978-1-941968-11-6 ©2015 SDIWC This section explains how the datasets were created and standardized. The first two sections describe both types of datasets: Arabic Cable News Network (CNN) [9] and Al-Jazirah [10] respectively. The last section explains how the datasets were converted to XML files. The available dataset in Arabic language is not as much as good as the one in English language. Finding a free Arabic dataset was difficult given the lack of activity and the nonopenness of the Arabic research as well as the limitation of the resources. What distinguishes the news data in general, is the type of words included which tend to belong to various fields and areas. This reflects the nature of the news where more than one subject is explored such as: sport, business, politics, health, technology and science. 3.1. Arabic CNN “News Archive” Data set 1 represents the news from the Arabic CNN website over the years 2004, 2005 and 2006. It was collected from Arabic CNN. The files were processed to remove irrelevant data such as index, sitemap and the Arabic CNN homepage. The files were converted to XML files. Where the set of attributes are: date, stories, time, title, content, Paragraph, Image description, and Image URL. Each XML file of the resulting XML files represents the stories and the news in a particular day. The file was named to reflect the date of the news. The number of files was decreased from 17404 in the original unstructured dataset (html format) to 1096 (XML format).The number of folders was decreased from 1034 to one. Merging the different stories, related by day or subject, distributed across several files into one file, and removing the irrelevant files was the reasons of having files reduction. Rather than having a folder for each subject or day of the news, all stories included in the XML files were stored within a single folder. 34 Proceedings of the The International Technology Management Conference, Antalya, Turkey, 2015 3.2. Al-Jazirah “Al-Jazirah Archive”: Data set 2 represents the news from Al-Jazirah archive page. The original dataset downloaded from Al-Jazirah archive website contains 87080 files and 11436 folders, with a 464 MB size. The converted dataset contains one folder and 10611 Files with a size of 284 MB. It is clear that the conversion results in a smaller number of files and folders, less space consumption and most importantly more structured and standard dataset. Merging multiple files into single XML files was the reason for the files reduction. Table 1 represents a summary of the data sets one and two. It includes sizes and number of files for each data sets before and after cleaning and standardization. Table 1: The dataset after and before the standardization Before clean/standardization Data 1 Data 2 Size Files 87.4MB 17,404 595MB 86,293 After clean/standardization Size 65.6 MB 284 MB Files 1,096 10,611 4. EXTRACTING CANDIDATE CONCEPTS: In this section a description of the system used to serialize the extracted candidate concepts is provided. The path of the dataset folder, which contains the XML files, and the number of files to be processed must be provided in order to the start running the system. This is the system description from a higher level of view: the Graphical User Interface (GUI). There are three main methods which are used to identify verbs, nouns and special words. The IsNoun method determines if the word is a noun ISBN: 978-1-941968-11-6 ©2015 SDIWC or not. In Figure 1, the parameter “reason” is used for analytical purposes. The following verse, taken from Al Masree's book [11], was used to help in tagging the noun: ‫ أ‬،‫مسند – لاسم ت ييز حصل بالج التنوين الندا‬ The function IsVerb is quite similar to the IsNoun in general, but it differs in the type of markers used for tagging the word. The function isVerb returns one Boolean value indicating whether the word is a verb or not, and its parameters are similar to the isNoun parameters. Figure 2 provides a set of rules that extracts the verbs. The Is_Special_Word function checks if the word belongs to any of the following groups: proposition ) ‫(ح ف ج‬, exception )‫ (استثناء‬, future )‫(مستق ل‬, demonstrative ) ‫ ( شار‬, relative ) ‫(موصو‬ , question ) ‫ (س ا‬, temporal adverb ) ‫( ف ما‬, spatial adverb ) ‫( ف م ا‬, kana and its sisters )‫(كا أخوات ا‬, thanna and its sisters ‫( ن‬ )‫أخوات ا‬, conditional form )‫(الش ط‬, vocative )‫(النداء‬ , harf naseb ) ‫(ح ف نصب اأس اء اأفعا‬, harf jazem ) ‫ (جز‬. Name: IsNoun. Returns true if the word is a noun. Input: string word, ref string reason, string previous Output: true if word is noun, otherwise false. Steps: 1. If Is_Special_word(word) Return False. 2.ْIfْwordْstartsْwithْ“‫”ال‬,ْ“‫”وال‬,ْ “‫”بال‬,ْ“‫ْ”كال“ْ”فال‬.OR Ifْwordْendsْwithْ“‫ْْ”ة‬OR If word ends with: Kasra (‫ )كسرة‬or Tanween (‫)تنوين‬.OR If IsNounPattern(word, Stemmed(word)) OR If Nouns_Hashtable.contains( word). OR If Is_Harf_Neda(previous) OR If Is_Harf_Jar(previous) OR If IsVerb (previous) THEN Return True. Else False. Figure 1: The function IsNoun 35 Proceedings of the The International Technology Management Conference, Antalya, Turkey, 2015 Name: IsVerb. Returns true if the word is a verb. Input: string word, ref string reason, string previous Output: true if word is verb, otherwise false. Steps: 1.If Is_Special_word(word) Return False. 2.If word ends with ْ, ْ, ْ , ْ ……OR IfْIsVerbPattern(word)ْ………………..ْOR If Verbs_Hashtable.contains(word)OR IfْIs_harf_Jazem(previous)ْ…………….OR IfْIs_Harf_Future(previous)ْ………ْTHN Return True.Else return false Figure 2: The function Is_Verb Figure 3 provides a code for is_Candidate_Concept function. It is used to extract the concepts from the nouns. The function is_Candidate_Concept returns true, false or undetermined. The rules to be used can be determined using a text file. The rules in Figure 1 were used to extract candidate concepts and their precisions were evaluated Name: is_Candidate_Concept. Returns the pattern matched as a string. Input: string word ,ref string reason, ref string patt Output: true if the word is a candidate concept. Steps: 1.patt  Word_Pattern(word) 2.if patt = ْ||ْْ‫مفعلْ||ْمفعلْ||ْمفعال‬ ْ||ْ ‫مفعالْ||ْمفعل ْ||ْمفعلهْ||ْمفعل‬ ‫ًال ْ||ْفعال ْ||ْفاعول‬ ‫مفعلهْ||ْفع‬ reason  ‫اسمْآل‬ Return true. 3.if patt = ‫ْل‬ ‫فاعلْْ||ْمفع‬ reason ‫اسمْفاعل‬ Return true. 4.if patt = ْ||ْ ‫فعال ْ||ْفعالْ||ْفعا‬ ْْ ‫فعيلْْ||ْفعل‬ Reason  ‫مصد‬ Return true. 5.if patt ends with ‫ينْ||ْو‬ Reason ‫ج عْم كرْسالم‬ ISBN: 978-1-941968-11-6 ©2015 SDIWC Return false. 6.if patt ends with ‫ا‬ Reason ‫مثنى‬ Return false. 7.if patt ends with ‫ا‬ Reason ‫ج عْمؤنثْسالم‬ Return false. 8.else Reason  no rule left Return undetermined Figure 1: The function Is_Candidate_Concept 1. RESULTS AND CONCLUSION Only one rule was used to generate concepts at a time, and the two datasets Arabic CNN and Al-Jazirah were used separately in order to help in determining the rule with the highest precision, and gives a clear picture of the performance of the rules comparatively. The result of each rule was evaluated. The evaluation was done manually. Initially, all of the rules were applied together, but while evaluating the results, it was difficult to discover the cause of false positives and the rules which cause it. Therefore, the result is described from a single-rule perspective, to help any complementary research in deciding which rules to consider. The rule “‫ ”تفعيل‬is a pattern of the infinitive “‫”مصدر‬, and it is the infinitive of the augmented verb “‫”فعَل‬. This rule gives the highest precision due to its exclusive nature, where there exit no any other part of speech in Arabic, including the adjective, which shares this pattern with the infinitive. Error! Reference source not found.Table 2 shows the first seven rows of the evaluated result of the rule “‫ ”تفعيل‬for the Arabic CNN dataset input only. The left column “concepts” are the true concepts, and the right column “Not Concepts” are the wrongly classified words, which represent the false positives. As it can be concluded from the sample result, the false positives are concepts but they are attached to the syntactical suffix “‫”ا‬, also, one of the other false positives in the Al-Jazirah dataset result is a name. The 36 Proceedings of the The International Technology Management Conference, Antalya, Turkey, 2015 non-concepts, which appeared in the result, are: ‫تدنيا‬, ‫تعديا‬. Figure 4 shows the precision of the rule for both datasets Al-Jazirah and Arabic CNN. The vertical axis is the precision while the horizontal axis is the dataset axis. The precision of this rule ranges from 99.31-99.37. Table 2: Sample result of “‫”تفعيل‬ Concepts Not Concepts ‫ت ين‬ ‫تدنيا‬ ‫تنشيط‬ ‫تعديا‬ ‫تجديد‬ ‫تحصين‬ ‫تخصيص‬ ‫تخصيب‬ ‫توجيه‬ ‫توصيف‬ Figure 5. All rules precisions The rules sorted in ascending order by their average precision are ” ,”‫ “مفع ة م ْفع ة‬,” ‫فاعو‬ ,” ‫ “مفعا م ْفعا‬,"‫ "فاعل م ْفعل‬,"‫ "م ْفعل مفعل‬," ‫“فعا‬ ‫ “فعالة فعَالة‬,” ‫ “فعا‬,”‫ “فع ة‬,”‫”“فعيل‬, and “‫ ”تفعيل‬as indicated in Figure 6. Figure 2: The precision of “‫”تفعيل‬ Figure 5 provides a clear picture of the performance of the rules comparatively. The bars with the dotted pattern represent the precision of the rules for the Al-Jazirah dataset input, while the bars with the lined pattern are for Arabic CNN dataset input. The vertical axis is the precision and the horizontal axis is the rules axis. Figure 6. Average precisions It has been shown that auto extracting candidate concepts from the Arabic text is applicable. The high precisions that some rules ISBN: 978-1-941968-11-6 ©2015 SDIWC 37 Proceedings of the The International Technology Management Conference, Antalya, Turkey, 2015 have achieved such as “‫ ”تفعيل‬prove such finding. The rules for extracting the candidate concepts are based on Arabic patterns (morphological features). No initial set of concepts, terms or ontology was used. The rules, which were tested, belong to the categories: infinitive “‫”مصدر‬, tool name “ ‫اسم‬ ‫”اآلة‬, and agent name “‫”اسم الفاعل‬. Some of the used patterns in the rules match adjectives and names and that consequently reduces the precisions of their rules. Thus, the precision can be increased by finding a method for removing the adjectives and names automatically or only using the patterns which can’t be patterns of adjectives. Removing names can be achieved by storing a list of names in advance. In addition, automatic relations extraction needs to be done to achieve the ultimate goal of knowledge representation. REFERENCES [ 1] Berners-Lee T., Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web, by its inventor. San Francisco: Harper, 1999; 157. [ 2] Lame, G., Using NLP Techniques to Identify Legal Ontology Components: Concepts and Relations. Artificial Intelligence and Law, 2006; 12(4): 379-396. [ 3] Fabre, C., Bourigault, D., Linguistics clues for corpus-based acquisition of lexical dependencies. In proceeding of Corpus Linguistics, University Centre for Computer Corpus Research on Language (UCREL), Lancaster University, UK, 2001. [ 4] Charlet, J., Bachimont, B., Jaulent, M., Building Medical Ontologies by Terminology Extraction from Texts: An experiment for the intensive care units. Computers in Biology and Medicine, 2006; 36(7-8): 857-870 [ 5] Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell T., Nigam, K., Slattery, S., Learning To Construct Knowledge bases from the World Wide Web. Artificial Intelligence, 2000; 118(1-2): 69-113. ISBN: 978-1-941968-11-6 ©2015 SDIWC [ 6] Hepp, M., Semantic Web and Semantic Web Services. IEEE Internet Computing, 2006; 10(2): 85-88. [ 7] Mukherjee, S., Guizhen, Y., Wenfang, T., Ramakrishnan, I., Automatic Discovery of Semantic Structures in HTML Documents. Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR), Paraná – Brazil, 2003; 1: 245. [ 8] Liu, B., Zhai, Y., Extracting Web Data Using Instance-Based Learning. World Wide Web, 2005; 10(2): 113-132 [ 9] Arabic CNN, http://www.newsarchiver.com/ (accessed: 23 June 2008) [ 10] Aljazirah Archive, http://www.aljazirah.com/aarchive.htm (accessed: 23 June 2008). [ 11] Al Masree, B., Ibn Ageel Explanation “Sharh Ibn Ageel”, Saida, Beirut: Modern Library. 1998; 2: 7. 38