Ontology Handbook
Ontology Handbook
Ontology Handbook
Ontology Learning
Alexander Maedche1 and Steffen Staab2
1
FZI Research Center for Information Technologies, University of Karlsruhe,
Germany
email: maedche@fzi.de
2
Institute AIFB, University of Karlsruhe, Germany
email: sst@aifb.uni-karlsruhe.de
1.1 Introduction
and for the algorithm library. Comprehensive user interfaces are provided
to the ontology engineer to help select relevant data, apply processing and
transformation techniques or start a specific extraction mechanism. Data
processing can also be triggered by the selection of an ontology learning
algorithm that requires a specific representation. Results are merged us-
ing the result set structure and presented to the ontology engineer with
different views of the ontology structures.
extraction or clustering. Hence, we may reuse algorithms from the library for
acquiring different parts of the ontology definition.
Subsequently, we introduce some of these algorithms available in our im-
plementation. In general, we use a multi-strategy learning and result combi-
nation approach, i.e. each algorithm that is plugged into the library generates
normalized results that adhere to the ontology structures sketched above and
that may be combined into a coherent ontology definition. Several algorithms
are described in more detail in the following section.
Definition 1.3.1. Let lefd,l be the lexical entry frequency of the lexical entry
l in a document d. Let dfl be the overall document frequency of lexical entry
l. Then tfidfl,d of the lexical entry l for the document d is given by:
|D|
tfidfl,d = lefl,d ∗ log . (1.1)
dfl
Tfidf weighs the frequency of a lexical entry in a document with a fac-
tor that discounts its importance when it appears in almost all documents.
Therefore terms that appear too rarely or too frequently are ranked lower
than terms that hold the balance. A final step that has to be carried out on
1. Ontology Learning 7
the computed tfidfl,d is the following: A list of all lexical entries contained
in one of the documents from the corpus D without lexical entries that ap-
pear in a standard list of stopwords is produced.1 . The tfidf values for lexical
entries l are computed as follows:
Definition 1.3.2. tfidfl := tfidfl,d , tfidfl ∈ IR. (1.2)
d∈D
The user may define a threshold k ∈ IR+ that tfidfl has to exceed. The
lexical entry approach has been evaluated in detail (e.g. for varying k and
different selection strategies).
. . . N P {, N P } ∗ {, } or other N P . . .
When we apply this pattern to a sentence it can be inferered that the
NP’s referring to concepts on the left of or other are sub concepts of the NP
referring to a concept on the right. For example from the sentence
Association rules have been established in the area of data mining, thus,
finding interesting association relationships among a large set of data items.
Many industries become interested in mining association rules from their
databases (e.g. for helping in many business decisions such as customer re-
lationship management, cross-marketing and loss-leader analysis. A typical
example of association rule mining is market basket analysis. This process
analyzes customer buying habits by finding associations between the differ-
ent items that customers place in their shopping baskets. The information
1. Ontology Learning 9
L1 ai,1 L2 ai,2
“Mecklenburgs” area hotel hotel
“hairdresser” hairdresser hotel hotel
“balconies” balcony access access
“room” room TV television
The algorithm for learning generalized association rules uses the con-
cept hierarchy, an excerpt of which is depicted in Figure 1.3, and the con-
cept pairs from above (among many other concept pairs). In our actual ex-
periments, it discovered a large number of interesting and important non-
taxonomic conceptual relations. A few of them are listed in Table 1.2. Note
that in this table we also list two conceptual pairs, viz. (area, hotel)
and (room, television), that are not presented to the user, but that
10 Alexander Maedche and Steffen Staab
are pruned. The reason is that there are ancestral association rules, viz.
(area, accomodation) and (room,furnishing), respectively with higher
confidence and support measures.
corpus remain in the ontology. The user can also control the pruning of con-
cepts which are neither contained in the domain-specific nor in the generic
corpus. We refer to interested reader to [1.1], where the pruning algorithms
are described in further detail.
1.4 Implementation
This section describes the implemented ontology learning system Text-To-
Onto that is embedded in KAON, the Karlsruhe Ontology and Semantic web
infrastructure. KAON4 is an open-source ontology management and appli-
cation infrastructure targeted for semantics-driven business applications. It
includes a comprehensive tool suite allowing easy ontology management and
application.
5
Text-To-Onto is open-source and available for download at the KAON Web page.
1. Ontology Learning 13
1.5 Conclusion
Until recently ontology learning per se, i.e. for comprehensive construction of
ontologies, has not existed. However, much work in a number of disciplines —
computational linguistics, information retrieval, machine learning, databases,
software engineering — has actually researched and practiced techniques for
solving part of the overall problem.
We have introduced ontology learning as an approach that may greatly
facilitate the construction of ontologies by the ontology engineer. The notion
of Ontology Learning introduced in this article aims at the integration of a
multitude of disciplines in order to facilitate the construction of ontologies.
The overall process is considered to be semi-automatic with human interven-
tion. It relies on the “balanced cooperative modeling” paradigm, describing a
coordinated interaction between human modeler and learning algorithm for
the construction of ontologies for the Semantic Web.
References
1.1 A. Maedche: Ontology Learning for the Semantic Web. Kluwer Academic Pub-
lishers, 2002.
1.2 A. Maedche and S. Staab: Mining ontologies from text. In Proceedings of
EKAW-2000, Springer Lecture Notes in Artificial Intelligence (LNAI-1937),
Juan-Les-Pins, France, 2000.
1.3 B. Motik and A. Maedche and R. Volz: A Conceptual Modeling Approach for
building semantics-driven enterprise applications. 1st International Conference
on Ontologies, Databases and Application of Semantics (ODBASE-2002), Cal-
ifornia, USA, 2002.
1.4 A. Maedche and B. Motik and L. Stojanovic and R. Studer and R. Volz: Ontolo-
gies for Enterprise Knowledge Management, IEEE Intelligent Systems, 2002.
1.5 L. Kaufman and P. Rousseeuw: Finding Groups in Data: An Introduction to
Cluster Analysis, John Wiley, 1990.
1.6 Pereira, F. and Tishby, N. and Lee, L.: Distributation Clustering of English
Words. In Proceedings of the ACL-93, 1993.
1.7 Manning, C. and Schuetze, H.: Foundations of Statistical Natural Language
Processing. MIT Press, Cambridge, Massachusetts, 1999.
1.8 H. Cunningham and R. Gaizauskas and K. Humphreys and Y. Wilks: Three
Years of GATE, In Proceedings of the AISB’99 Workshop on Reference Archi-
tectures and Data Standards for NLP, Edinburgh, U.K. Apr, 1999.
1.9 G. Neumann and R. Backofen and J. Baur and M. Becker and C. Braun: An
Information Extraction Core System for Real World German Text Processing.
In Proceedings of ANLP-97, Washington, USA, 1997.
1.10 Agrawal, R. and Imielinski, T. and Swami, A.: Mining Associations between
Sets of Items in Massive Databases, In Proceedings of the 1993 ACM SIGMOD
International Conference on Management of Data, Washington, D.C., May 26-
28, 1993.
1.11 Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In
Proceedings of the 14th International Conference on Computational Linguis-
tics. Nantes, France, 1992.
14 Alexander Maedche and Steffen Staab
1.12 Kietz, J.-U. and Volz, R. and Maedche, A.: Semi-automatic ontology acquisi-
tion from a corporate intranet. In International Conference on Grammar Infer-
ence (ICGI-2000), Lecture Notes in Artificial Intelligence, LNAI, 2000.