(Abraham Bernstein (Auth.), Elena Simperl, Phil PDF
(Abraham Bernstein (Auth.), Elena Simperl, Phil PDF
(Abraham Bernstein (Auth.), Elena Simperl, Phil PDF
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Elena Simperl Philipp Cimiano
Axel Polleres Oscar Corcho
Valentina Presutti (Eds.)
13
Volume Editors
Elena Simperl
Karlsruhe Institute of Technology, Institute AIFB
Englerstrasse 11, 76131 Karlsruhe, Germany
E-mail: elena.simperl@aifb.uni-karlsruhe.de
Philipp Cimiano
University of Bielefeld, CITEC
Morgenbreede 39, 33615 Bielefeld, Germany
E-mail: cimiano@cit-ec.uni-bielefeld.de
Axel Polleres
Siemens AG Österreich
Siemensstrasse 90, 1210 Vienna, Austria
E-mail: axel.polleres@siemens.com
Oscar Corcho
Technical University of Madrid
C/ Severo Ochoa, 13, 28660 Boadilla del Monte, Madrid, Spain
E-mail: ocorcho@fi.upm.es
Valentina Presutti
ISTC-CNR, STLab
Via Nomentana 56, 00161 Rome, Italy
E-mail: valentina.presutti@istc.cnr.it
The Extended Semantic Web Conference (ESWC) is a major venue for dis-
cussing the latest scientific results and technology innovations around semantic
technologies. Building on its past success, ESWC seeks to broaden its focus to
span other relevant research areas in which Web semantics plays an important
role.
The goal of the Semantic Web is to create a Web of knowledge and services
in which the semantics of content is made explicit and content is linked both
to other content and services, allowing novel applications to combine content
from heterogeneous sites in unforeseen ways and support enhanced matching
between users’ needs and content. These complementarities are reflected in the
outline of the technical program of ESWC 2012; in addition to the research
and in-use tracks, we featured two special tracks putting particular emphasis on
inter-disciplinary research topics and areas that show the potential of exciting
synergies for the future, eGovernment, and digital libraries. ESWC 2012 pre-
sented the latest results in research, technologies, and applications in its field.
Besides the technical program organized over multiple tracks, the conference fea-
tured several exciting, high-profile keynotes in areas directly related or adjacent
to semantic technologies; a workshop and tutorial program; system descriptions
and demos; a poster exhibition; a project networking session; a doctoral sympo-
sium, as well as the ESWC summer school, which was held immediately prior
to the conference. As prominent co-located events we were happy to welcome
the OWL Experiences and Development workshop (OWLED), as well as the AI
Challenge.
The technical program of the conference received 212 submissions, which
were reviewed by the Program Committee of the respective tracks; each was
coordinated by Track Chairs who oversaw dedicated Program Committees. The
review process included paper bidding, assessment by at least three Program
Committee members, paper rebuttal, and meta-reviewing for each submission
subject to acceptance in the conference program and proceedings. In all, 53
papers were selected as a result of this process, following uniform evaluation
criteria devised for all (each) technical track(s).
The PhD symposium received 18 submissions, which were reviewed by the
PhD Symposium Program Committee. Thirteen papers were selected for pre-
sentation at a separate track and for inclusion in the ESWC 2012 proceedings.
ESWC 2012 had the pleasure and honor to welcome seven renowned keynote
speakers from academia and industry, addressing a variety of exciting topics of
highest relevance for the research agenda of the semantic technologies community
and its impact on ICT:
VI Preface
We would like to take the opportunity to express our gratitude to the Chairs,
Program Committee members, and additional reviewers of all refereed tracks,
who ensured that this year’s conference maintained the highest standards of
scientific quality. Our thanks are also offered to the Organizing Committee of
the conference, for their dedication and hard work in selecting and coordinating a
wide array of interesting workshops, tutorials, posters, and panels that completed
the program of the conference. Special thanks go to the various organizations
who kindly supported our conference as sponsors, to the Sponsorship Chair who
coordinated these activities, and to the team around STI International who
provided an excellent service in all administrative and logistic issues related to
the organization of the event. Last, but not least, we should like to say thank you
to the Proceedings Chair, to the development team of the Easychair conference
management system and to our publisher, Springer, for their support in the
preparation of this volume and the publication of the proceedings.
Organizing Committee
General Chair
Elena Simperl Karlsruhe Institute of Technology, Germany
Program Chairs
Philipp Cimiano University of Bielefeld, Germany
Axel Polleres Siemens AG Österreich, Vienna, Austria
Workshop Chairs
Alexandre Passant DERI, Ireland
Raphaël Troncy EURECOM, France
Tutorials Chairs
Emanuele della Valle University of Aberdeen, UK
Irini Fundulaki FORTH-ICS, Greece
Sponsorship Chair
Frank Dengler Karlsruhe Institute of Technology, Germany
Publicity Chair
Paul Groth VU University of Amsterdam, The Netherlands
Panel Chair
John Davies British Telecom, UK
VIII Organization
Proceedings Chair
Antonis Bikakis University College London, UK
Treasurer
Alexander Wahler STI International, Austria
Program Committee
Track Chairs
Linked Open Data Track
Sören Auer Chemnitz University of Technology, Germany
Juan Sequeda University of Texas at Austin, USA
Ontologies Track
Chiara Ghidini FBK, Italy
Dimitris Plexousakis University of Crete and FORTH-ICS, Greece
Reasoning Track
Giovambattista Ianni Università della Calabria, Italy
Markus Kroetzsch University of Oxford, UK
EGovernment Track
Asunción Gómez-Pérez Universidad Politecnica de Madrid, Spain
Vassilios Peristeras European Commission
Referees
Steering Committee
Chair
John Domingue
Members
Sponsoring Institutions
Table of Contents
Invited Talks
Semantic Web/LD at a Crossroads: Into the Garbage Can or To
Theory? (Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Abraham Bernstein
Ontologies Track
Representing Mereotopological Relations in OWL Ontologies with
OntoPartS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
C. Maria Keet, Francis C. Fernández-Reyes, and
Annette Morales-González
Reasoning Track
Modelling Structured Domains Using Description Graphs and Logic
Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
Despoina Magka, Boris Motik, and Ian Horrocks
From Web 1.0 to Social Semantic Web: Lessons Learnt from a Migration
to a Medical Semantic Wiki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
Thomas Meilender, Jean Lieber, Fabien Palomares, and Nicolas Jay
EGovernment Track
A Publishing Pipeline for Linked Government Data . . . . . . . . . . . . . . . . . . 778
Fadi Maali, Richard Cyganiak, and Vassilios Peristeras
Achieving Interoperability through Semantic Technologies in the Public
Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793
Chiara Di Francescomarino, Mauro Dragoni, Matteo Gerosa,
Chiara Ghidini, Marco Rospocher, and Michele Trainotti
PhD Symposium
Tackling Incompleteness in Information Extraction –
A Complementarity Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808
Christina Feilmayr
A Framework for Ontology Usage Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 813
Jamshaid Ashraf
Formal Specification of Ontology Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 818
Edelweis Rohrer
Leveraging Linked Data Analysis for Semantic Recommender
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823
Andreas Thalhammer
Sharing Statistics for SPARQL Federation Optimization, with
Emphasis on Benchmark Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828
Kjetil Kjernsmo
A Reuse-Based Lightweight Method for Developing Linked Data
Ontologies and Vocabularies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833
Marı́a Poveda-Villalón
Optimising XML–RDF Data Integration: A Formal Approach to
Improve XSPARQL Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838
Stefan Bischof
Software Architectures for Scalable Ontology Networks . . . . . . . . . . . . . . . 844
Alessandro Adamou
Identifying Complex Semantic Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849
Brian Walshe
Table of Contents XXI
Abraham Bernstein
University of Zurich
bernstein@ifi.uzh.ch
Scylla and Charybdis were mythical sea monsters noted by Homer. Scylla was
rationalized as a rock shoal (described as a six-headed sea monster) and Charybdis
was a whirlpool. They were regarded as a sea hazard located close enough to each
other that they posed an inescapable threat to passing sailors; avoiding Charybdis
meant passing too close to Scylla and vice versa. According to Homer, Odysseus
was forced to choose which monster to confront while passing through the strait...
Since its inception Semantic Web research projects have tried to sail the strait
between the Scylla of overly theoretical irrelevance and the Charybdis of non-
scientific applied projects.
Like Odysseus the Semantic Web community was wooed by the neatness of
theoretical explorations of knowledge representation methods that endanger to
crash the community into the Scylla the rock of irrelevance.
On the other side the maelstrom of Charybdis attracts the community as
it tries to fulfill the next vision of the next web thereby loosing its scientific
identity.
In this talk I will discuss and exemplify the strengths, weaknesses, opportuni-
ties, and pitfalls (or threats) of each of these extremes. I will use this analysis as
a basis for to explore some possible strategies to navigate the potentially stormy
seas of the Semantic Web community’s future.
Today, governments and businesses have to deal with high degrees of complex-
ity: The products they offer are highly individualized, there are many regulations
they must comply with and all this has to be dealt with under a growing rate
of change. Many organizations have tried to meet this challenge by reducing
complexity, through the elimination of exceptions etc. Jeroen van Grondelle,
principal architect at Be Informed, argues that the only way to deal with com-
plexity is by embracing it.
Ontologies have proven to be an excellent way to deal with all the differ-
ent concepts that are introduced when products are defined and the supporting
business processes are designed. When the right conceptualization is chosen, on-
tologies capture these policy choices and constraints in a natural way. Ontologies
deal well with the heterogeneous nature of policy and regulations, which often
originate from different legal sources and have different owners.
The benefits exist throughout the entire policy lifecycle. The formal, precise
nature of ontologies improves the quality and consistency of the choices made
and reduces ambiguity. Because ontologies are well interpretable by machines, Be
Informed succeeds in inferring many of the supporting services, such as process
applications and decision services, from the ontologies, thereby eliminating the
need for systems development. And the ability to infer executable services also
allows for advanced what-if analysis and simulation of candidate policies before
they are put into effect.
Jeroen will show some examples where ontologies were successfully applied in
the public sector and what the impact was on all parties involved, from policy
officers to citizens. He will also present some of the research challenges encoun-
tered when these new audiences are confronted with ontologies, a technology
that typically is of course unfamiliar to them.
Alon Halevy
Aleksander Kolcz
Twitter, USA
alek@ir.iit.edu
Twitter represents a large complex network of users with diverse and continu-
ously evolving interests. Discussions and interactions range from very small to
very large groups of people and most of them occur in the public. Interests are
both long and short term and are expressed by the content generated by the
users as well as via the Twitter follow graph, i.e. who is following whose content.
Understanding user interests is crucial to providing good Twitter experience
by helping users to connect to others, find relevant information and interesting
information sources.
The manner in which information is spread over the network and communica-
tion attempts are made can also help in identifying spammers and other service
abuses.
Understanding users and their preferences is also a very challenging problem
due to the very large volume information, the fast rate of change and the short
nature of the tweets. Large scale machine learning as well as graph and text
mining have been helping us to tackle these problems and create new opportu-
nities to better understand our users. In the talk I will describe a number of
challenging modeling problems addressed by the Twitter team as well as our
approach to creating frameworks and infrastructure to make learning at scale
possible.
Monica S. Lam
With the rise of cloud services, users’ personal data, from photos to bank trans-
actions, are scattered and hosted by a variety of application service providers.
Communication services like email and social networking, by virtue of helping
users share, have the unique opportunity to gather all data shared in one place.
As users shift their communication medium from email to social networks, per-
sonal data are increasingly locked up in a global, proprietary social web.
We see the rise of the mobile phone as an opportunity to re-establish an open
standard, as social data are often produced, shared, and consumed on the mobile
devices directly. We propose an API where apps can interact with friends’ phones
directly, without intermediation through a centralized communication service.
Furthermore, this information can then be made available on our own devices to
personalize and improve online interactions. Based on this API, we have created
a working prototype called Musubi (short for Mobile, Social, and UBIquitous)
along with various social applications, all of which are available on the Android
market.
Márta Nagy-Rothengass
European Commission
Marta.Nagy-Rothengass@ec.europa.eu
Data is today everywhere. The quantity and growth of generated data is enor-
mous, its proper management challenges us individual users but also business and
public organisations. We are facing data/information overflow and we are often
handicapped by storing, managing, analysing and preserving all of our data.
At the same time, this growing large amount of data offers us due to its
intelligent linkage, analysis and processing
– business opportunities like establishment of new, innovative digital services
towards end users and organisations;
– better decision making support in business and public sector; and
– increased intelligence and better knowledge extraction.
This is the reason why many of us see data as the oil of the 21th century. We
are challenged to unlock the potential and the added value of complex and big
data. It has become a competitive advantage to offer the right data to the right
people at the right time.
In my talk I will introduce the value chain thinking on data, than analyse
its main technology and business challenges and inform about the ongoing and
envisaged policy, infrastructure, research and innovation activities at European
level.
Successful political campaigns have mastered the tactics and strategies used
to effectively present an argument, manage and respond with authority during
crisis, influence the debate and shape public perception.
Yet, in todays 24/7 media environment it has become more difficult than ever
to set an agenda, frame an issue or engage an audience.
Four years ago, Barack Obama set a new standard for campaigning by chang-
ing the way new media was used to build an aspirational brand, engage and
empower supporters, raise money and turn out voters. As the 2012 presidential
race unfolds, the campaigns are stepping up their game. And in this cycle, they
are embracing digital media more than ever.
However, its not only the Presidents campaign and his opponents who are
faced with the challenge to create a narrative and frame the public debate.
Organizations in the private sector often deal with the similar complex issues
as they struggle to deliver tailored messages to target their audience, regardless
of whether its costumers, investors, media, the general public or even potential
employees.
From storytelling to big data lifestyle targeting: Julius van de Laar will provide
a first hand account on how todays most effective campaigns leverage battle
tested strategies combined with new media tools to create a persuasive narrative
and how they translate into actionable strategies for the corporate context.
Olaf Hartig
Humboldt-Universität zu Berlin
hartig@informatik.hu-berlin.de
Abstract. The World Wide Web currently evolves into a Web of Linked Data
where content providers publish and link data as they have done with hypertext
for the last 20 years. While the declarative query language SPARQL is the de
facto for querying a-priory defined sets of data from the Web, no language exists
for querying the Web of Linked Data itself. However, it seems natural to ask
whether SPARQL is also suitable for such a purpose.
In this paper we formally investigate the applicability of SPARQL as a query
language for Linked Data on the Web. In particular, we study two query models:
1) a full-Web semantics where the scope of a query is the complete set of Linked
Data on the Web and 2) a family of reachability-based semantics which restrict
the scope to data that is reachable by traversing certain data links. For both models
we discuss properties such as monotonicity and computability as well as the im-
plications of querying a Web that is infinitely large due to data generating servers.
1 Introduction
The emergence of vast amounts of RDF data on the WWW has spawned research on
storing and querying large collections of such data efficiently. The prevalent query lan-
guage in this context is SPARQL [16] which defines queries as functions over an RDF
dataset, that is, a fixed, a-priory defined collection of sets of RDF triples. This definition
naturally fits the use case of querying a repository of RDF data copied from the Web.
However, most RDF data on the Web is published following the Linked Data prin-
ciples [5], contributing to the emerging Web of Linked Data [6]. This practice allows
for query approaches that access the most recent version of remote data on demand.
More importantly, query execution systems may automatically discover new data by
traversing data links. As a result, such a system answers queries based on data that is
not only up-to-date but may also include initially unknown data. These features are the
foundation for true serendipity, which we regard as the most distinguishing advantage
of querying the Web itself, instead of a predefined, bounded collection of data.
While several research groups work on systems that evaluate SPARQL basic graph
patterns over the Web of Linked Data (cf. [9], [10,12] and [13,14]), we notice a shortage
of work on theoretical foundations and properties of such queries. Furthermore, there is
a need to support queries that are more expressive than conjunctive (basic graph pattern
based) queries [17]. However, it seems natural to assume that SPARQL could be used
in this context because the Web of Linked Data is based on the RDF data model and
SPARQL is a query language for RDF data. In this paper we challenge this assumption.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 8–23, 2012.
c Springer-Verlag Berlin Heidelberg 2012
SPARQL for a Web of Linked Data: Semantics and Computability 9
Contributions. In this paper we understand queries as functions over the Web of Linked
Data as a whole. To analyze the suitability of SPARQL as a language for such queries,
we have to adjust the semantics of SPARQL. More precisely, we have to redefine the
scope for evaluating SPARQL algebra expressions. In this paper we discuss two ap-
proaches for such an adjustment. The first approach uses a semantics where the scope
of a query is the complete set of Linked Data on the Web. We call this semantics full-
Web semantics. The second approach introduces a family of reachability-based seman-
tics which restrict the scope to data that is reachable by traversing certain data links.
We emphasize that both approaches allow for query results that are based on data from
initially unknown sources and, thus, enable applications to tap the full potential of the
Web. Nevertheless, both approaches precisely define the (expected) result for any query.
As a prerequisite for defining the aforementioned semantics and for studying theoret-
ical properties of queries under these semantics, we introduce a theoretical framework.
The basis of this framework is a data model that captures the idea of a Web of Linked
Data. We model such a Web as an infinite structure of documents that contain RDF
data and that are interlinked via this data. Our model allows for infiniteness because
the number of entities described in a Web of Linked Data may be infinite; so may the
number of documents. The following example illustrates such a case:
Example 1. Let ui denote an HTTP scheme based URI that identifies the natural num-
ber i. There is a countably infinite number of such URIs. The WWW server which
is responsible for these URIs may be set up to provide a document for each natural
number. These documents may be generated upon request and may contain RDF data
including the RDF triple (ui , http://.../next, ui+1 ). This triple associates the natural number
i with its successor i+1 and, thus, links to the data about i+1 [19]. An example for such
a server is provided by the Linked Open Numbers project1 .
In addition to the data model our theoretical framework comprises a computation model.
This model is based on a particular type of Turing machine which formally captures the
limited data access capabilities of computations over the Web.
We summarize the main contributions of this paper as follows:
– We present a data model and a computation model that provide a theoretical frame-
work to define and to study query languages for the Web of Linked Data.
– We introduce a full-Web semantics and a family of reachability-based semantics
for a (hypothetical) use of SPARQL as a language for queries over Linked Data.
– We systematically analyze SPARQL queries under the semantics that we introduce.
This analysis includes a discussion of satisfiability, monotonicity, and computabil-
ity of queries under the different semantics, a comparison of the semantics, and a
study of the implications of querying a Web of Linked Data that is infinite.
Related Work. Since its emergence the WWW has attracted research on declarative
query languages for the Web. For an overview on early work in this area we refer to [8].
Most of this work understands the WWW as a hypertext Web. Nonetheless, some of the
foundational work can be adopted for research on Linked Data. The computation model
that we use in this paper is an adaptation of the ideas presented in [1] and [15].
1
http://km.aifb.kit.edu/projects/numbers/
10 O. Hartig
In addition to the early work on Web queries, query execution over Linked Data
on the WWW has attracted much attention recently [9,10,12,13,14]. However, existing
work primarily focuses on various aspects of (query-local) data management, query ex-
ecution, and optimization. The only work we are aware of that aims to formally capture
the concept of Linked Data and to provide a well-defined semantics for queries in this
context is Bouquet et al.’s [7]. They define three types of query methods for conjunctive
queries: a bounded method which only uses RDF data referred to in queries, a direct
access method which assumes an oracle that provides all RDF graphs which are “rel-
evant” for a given query, and a navigational method which corresponds to a particular
reachability-based semantics. For the latter Bouquet et al. define a notion of reachabil-
ity that allows a query execution system to follow all data links. As a consequence, the
semantics of queries using this navigational method is equivalent to, what we call, cAll -
semantics (cf. Section 5.1); it is the most general of our reachability-based semantics.
Bouquet et al.’s navigational query model does not support other, more restrictive no-
tions of reachability, as is possible with our model. Furthermore, Bouquet et al. do not
discuss full SPARQL, theoretical properties of queries, or the infiniteness of the WWW.
While we focus on the query language SPARQL in the context of Linked Data on the
Web, the theoretical properties of SPARQL as a query language for a fixed, predefined
collection of RDF data are well understood today [2,3,16,18]. Particularly interesting
in our context are semantical equivalences between SPARQL expressions [18] because
these equivalences may also be used for optimizing SPARQL queries over Linked Data.
Structure of the Paper. The remainder of this paper is organized as follows. Sec-
tion 2 introduces the preliminaries for our work. In Section 3 we present the data model
and the computation model. Sections 4 and 5 discuss the full-Web semantics and the
reachability-based semantics for SPARQL, respectively. We conclude the paper in Sec-
tion 6. For full technical proofs of all results in this paper we refer to [11].
2 Preliminaries
This section provides a brief introduction of RDF and the query language SPARQL.
We assume pairwise disjoint, countably infinite sets U (all HTTP scheme based
URIs2 ), B (blank nodes), L (literals), and V (variables, denoted by a leading ’?’ sym-
bol). An RDF triple t is a tuple (s, p, o) ∈ (U ∪B)×U ×(U ∪B ∪L). For any RDF triple
t = (s, p, o) we define terms(t) = {s, p, o} and uris(t) = terms(t) ∩ U. Overloading
function terms, we write terms(G) = t∈G terms(t) for any (potentially infinite) set
G of RDF triples. In contrast to the usual formalization of RDF we allow for infinite
sets of RDF triples which we require to study infinite Webs of Linked Data.
In this paper we focus on the core fragment of SPARQL discussed by Pérez et al. [16]
and we adopt their formalization approach, that is, we use the algebraic syntax and
the compositional set semantics introduced in [16]. SPARQL expressions are defined
recursively: i) A triple pattern (s, p, o) ∈ (V ∪ U) × (V ∪ U) × (V ∪ U ∪ L) is a
2
For the sake of simplicity we assume in this paper that URIs are HTTP scheme based URIs.
However, our models and result may be extended easily for all possible types of URIs.
SPARQL for a Web of Linked Data: Semantics and Computability 11
SPARQL expression3. ii) If P1 and P2 are SPARQL expressions, then (P1 AND P2 ),
(P1 UNION P2 ), (P1 OPT P2 ), and (P1 FILTER R) are SPARQL expressions where R is a
filter condition. For a formal definition of filter conditions we refer to [16]. To denote the
set of all variables in all triple patterns of a SPARQL expression P we write vars(P ).
To define the semantics of SPARQL we introduce valuations, that are, partial map-
pings μ : V → U ∪ B ∪ L. The evaluation of a SPARQL expression P over a potentially
infinite set G of RDF triples, denoted by [[P ]]G , is a set of valuations. In contrast to the
usual case, this set may be infinite in our scenario. The evaluation function [[·]]· is de-
fined recursively over the structure of SPARQL expressions. Due to space limitations,
we do not reproduce the full formal definition of [[·]]· here. Instead, we refer the reader to
the definitions given by Pérez et al. [16]; even if Pérez et al. define [[·]]· for finite sets of
RDF triples, it is trivial to extend their formalism for infiniteness (cf. appendix in [11]).
A SPARQL expression P is monotonic if for any pair G1 , G2 of (potentially infinite)
sets of RDF triples such that G1 ⊆ G2 , it holds that [[P ]]G1 ⊆ [[P ]]G2 . A SPARQL ex-
pression P is satisfiable if there exists a (potentially infinite) set G of RDF triples such
that [[P ]]G = ∅. It is trivial to show that any non-satisfiable expression is monotonic.
In addition to the traditional notion of satisfiability we shall need a more restrictive
notion for the discussion in this paper: A SPARQL expression P is nontrivially satisfi-
able if there exists a (potentially infinite) set G of RDF triples and a valuation μ such
that i) μ ∈ [[P ]]G and ii) μ provides a binding for at least one variable; i.e. dom(μ) = ∅.
In this section we introduce theoretical foundations which shall allow us to define and
to analyze query models for Linked Data. In particular, we propose a data model and
introduce a computation model. For these models we assume a static view of the Web;
that is, no changes are made to the data on the Web during the execution of a query.
We model the Web of Linked Data as a potentially infinite structure of interlinked doc-
uments. Such documents, which we call Linked Data documents, or LD documents for
short, are accessed via URIs and contain data that is represented as a set of RDF triples.
While the three elements D, data, and adoc completely define a Web of Linked Data
in our model, we point out that these elements are abstract concepts and, thus, are not
available to a query execution system. However, by retrieving LD documents, such a
system may gradually obtain information about the Web. Based on this information the
system may (partially) materialize these three elements. In the following we discuss the
three elements and introduce additional concepts that we need to define queries.
We say a Web of Linked Data W = (D, data, adoc) is finite if and only if D is
finite; otherwise, W is infinite. Our model allows for infiniteness to cover cases where
Linked Data about an infinite number of identifiable entities is generated on the fly. The
Linked Open Numbers project (cf. Example 1) illustrates that such cases are possible in
practice. Another example is the LinkedGeoData project4 which provides Linked Data
about any circular and rectangular area on Earth [4]. Covering these cases enables us to
model queries over such data and analyze the effects of executing such queries.
Even if a Web of Linked Data W = (D, data, adoc) is infinite, Definition 1 requires
countability for D. We emphasize that this requirement does not restrict us in modeling
the WWW as a Web of Linked Data: In the WWW we use URIs to locate documents
that contain Linked Data. Even if URIs are not limited in length, they are words over a
finite alphabet. Thus, the infinite set of all possible URIs is countable, as is the set of all
documents that may be retrieved using URIs.
The mapping data associates each LD document d ∈ D in a Web of Linked Data
W = (D, data, adoc) with a finite set of RDF triples. In practice, these triples are ob-
tained by parsing d after d has been retrieved from the Web. The actual retrieval mech-
anism is not relevant for our model. However, as prescribed by the RDF data model,
Definition 1 requires that the data of each d ∈ D uses a unique set of blank nodes.
To denote the (potentially infinite but countable)
set of all RDF
triples in W we write
AllData(W ); i.e. it holds: AllData(W ) = data(d) | d ∈ D .
Since we use URIs as identifiers for entities, we say that an LD document d ∈ D
describes the entity identified by URI u ∈ U if there exists (s, p, o) ∈ data(d) such that
s = u or o = u. Notice, there might be multiple LD documents that describe an entity
identified by u. However, according to the Linked Data principles, each u ∈ U may also
serve as a reference to a specific LD document which is considered as an authoritative
source of data about the entity identified by u. We model the relationship between
URIs and authoritative LD documents by mapping adoc. Since some LD documents
may be authoritative for multiple entities, we do not require injectivity for adoc. The
“real world” mechanism for dereferencing URIs (i.e. learning about the location of the
authoritative LD document) is not relevant for our model. For each u ∈ U that cannot
be dereferenced (i.e. “broken links”) or that is not used in W it holds u ∈ / dom(adoc).
A URI u ∈ U with u ∈ dom(adoc) that is used in the data of an LD document d1 ∈ D
constitutes a data link to the LD document d2 = adoc(u) ∈ D. These data links form a
4
http://linkedgeodata.org
SPARQL for a Web of Linked Data: Semantics and Computability 13
graph structure which we call link graph. The vertices in such a graph represent the LD
documents of the corresponding Web of Linked Data; edges represent data links.
To study the monotonicity of queries over a Web of Linked Data we require a concept
of containment for such Webs. For this purpose, we introduce the notion of an induced
subweb which resembles the concept of induced subgraphs in graph theory.
Definition 2. Let W = (D, data, adoc) and W = (D, data, adoc ) be Webs of Linked
Data. W is an induced subweb of W if i) D ⊆ D, ii) ∀ d ∈ D : data (d) = data(d),
and iii) ∀ u ∈ UD : adoc (u) = adoc(u) where UD = {u ∈ U | adoc(u) ∈ D }.
It can be easily seen from Definition 2 that specifying D is sufficient to unambiguously
define an induced subweb (D, data, adoc ) of a given Web of Linked Data. Further-
more, it is easy to verify that for an induced subweb W of a Web of Linked Data W it
holds AllData(W ) ⊆ AllData(W ).
In addition to the structural part, our data model introduces a general understanding
of queries over a Web of Linked Data:
Definition 3. Let W be the infinite set of all possible Webs of Linked Data (i.e. all
3-tuples that correspond to Definition 1) and let Ω be the infinite set of all possible
valuations. A Linked Data query q is a total function q : W → 2Ω .
The notions of satisfiability and monotonicity carry over naturally to Linked Data
queries: A Linked Data query q is satisfiable if there exists a Web of Linked Data
W such that q(W ) is not empty. A Linked Data query q is nontrivially satisfiable if
there exists a Web of Linked Data W and a valuation μ such that i) μ ∈ q(W ) and
ii) dom(μ) = ∅. A Linked Data query q is monotonic if for every pair W1 , W2 of Webs
of Linked Data it holds: If W1 is an induced subweb of W2 , then q(W1 ) ⊆ q(W2 ).
Usually, functions are computed over structures that are assumed to be fully (and di-
rectly) accessible. In contrast, we focus on Webs of Linked Data in which accessibility
is limited: To discover LD documents and access their data we have to dereference
URIs, but the full set of those URIs for which we may retrieve documents is unknown.
Hence, to properly analyze a query model for Webs of Linked Data we must define a
model for computing functions on such a Web. This section introduces such a model.
In the context of queries over a hypertext-centric view of the WWW, Abiteboul and
Vianu introduce a specific Turing machine called Web machine [1]. Mendelzon and
Milo propose a similar machine model [15]. These machines formally capture the lim-
ited data access capabilities on the WWW and thus present an adequate abstraction for
computations over a structure such as the WWW. Based on these machines the authors
introduce particular notions of computability for queries over the WWW. These notions
are: (finitely) computable queries, which correspond to the traditional notion of com-
putability; and eventually computable queries whose computation may not terminate
but each element of the query result will eventually be reported during the computation.
We adopt the ideas of Abiteboul and Vianu and of Mendelzon and Milo for our work.
More precisely, we adapt the idea of a Web machine to our scenario of a Web of Linked
14 O. Hartig
Data. We call our machine a Linked Data machine (or LD machine, for short). Based on
this machine we shall define finite and eventual computability for Linked Data queries.
Encoding (fragments of) a Web of Linked Data W = (D, data, adoc) on the tapes
of such an LD machine is straightforward because all relevant structures, such as the
sets D or U, are countably infinite. In the remainder of this paper we write enc(x) to
denote the encoding of some element x (e.g. a single RDF triple, a set of triples, a full
Web of Linked Data, a valuation, etc.). For a detailed definition of the encodings we use
in this paper, we refer to the appendix in [11]. We now define LD machine:
Definition 4. An LD machine is a multi-tape Turing machine with five tapes and a
finite set of states, including a special state called expand. The five tapes include two,
read-only input tapes: i) an ordinary input tape and ii) a right-infinite Web tape which
can only be accessed in the expand state; two work tapes: iii) an ordinary, two-way
infinite work tape and iv) a right-infinite link traversal tape; and v) a right-infinite,
append-only output tape. Initially, the work tapes and the output tape are empty, the
Web tape contains a (potentially infinite) word that encodes a Web of Linked Data, and
the ordinary input tape contains an encoding of further input (if any). Any LD machine
operates like an ordinary multi-tape Turing machine except when it reaches the expand
state. In this case LD machines perform the following expand procedure: The machine
inspects the word currently stored on the link traversal tape. If the suffix of this word
is the encoding enc(u) of some URI u ∈ U and the word on the Web tape contains
enc(u) enc(adoc(u)) , then the machine appends enc(adoc(u)) to the (right) end
of the word on the link traversal tape by copying from the Web tape; otherwise, the
machine appends to the word on the link traversal tape.
Notice how any LD machine M is limited in the way it may access a Web of Linked
Data W = (D, data, adoc) that is encoded on its Web tape: M may use the data of any
particular d ∈ D only after it performed the expand procedure using a URI u ∈ U for
which adoc(u) = d. Hence, the expand procedure simulates a URI based lookup which
conforms to the (typical) data access method on the WWW. We now use LD machines
to adapt the notion of finite and eventual computability [1] for Linked Data queries:
Definition 5. A Linked Data query q is finitely computable if there exists an LD ma-
chine which, for any Web of Linked Data W encoded on the Web tape, halts after a
finite number of steps and produces a possible encoding of q(W ) on its output tape.
Definition 6. A Linked Data q query is eventually computable if there exists an LD
machine whose computation on any Web of Linked Data W encoded on the Web tape
has the following two properties: 1.) the word on the output tape at each step of the
computation is a prefix of a possible encoding of q(W ) and 2.) the encoding enc(μ )
of any μ ∈ q(W ) becomes part of the word on the output tape after a finite number of
computation steps.
Any machine for a non-satisfiable query may immediately report the empty result. Thus:
Fact 1. Non-satisfiable Linked Data queries are finitely computable.
In our analysis of SPARQL-based Linked Data queries we shall discuss decision prob-
lems that have a Web of Linked Data W as input. For such problems we assume the
computation may only be performed by an LD machine with enc(W ) on its Web tape:
SPARQL for a Web of Linked Data: Semantics and Computability 15
4 Full-Web Semantics
Based on the concepts introduced in the previous section we now define and study
approaches that adapt SPARQL as a language for expressing Linked Data queries.
The first approach that we discuss is full-Web semantics where the scope of each
query is the complete set of Linked Data on the Web. Hereafter, we refer to SPARQL
queries under this full-Web semantics as SPARQLLD queries. The definition of these
queries is straightforward and makes use of SPARQL expressions and their semantics:
Definition 8. Let P be a SPARQL expression. The SPARQLLD query that uses P , de-
P
noted
by Q , is a Linked Data query that, for any Web
of
Linked Data W , is defined as:
QP W = [[P ]]AllData(W ) . Each valuation μ ∈ QP W is a solution for QP in W .
In the following we study satisfiability, monotonicity, and computability of SPARQLLD
queries and we discuss implications of querying Webs of Linked Data that are infinite.
access to the data of all LD documents in the queried Web of Linked Data. Recall that,
initially, the machine has no information about what URI to use for performing an ex-
pand procedure with which it may access any particular document. Hence, to ensure
that all documents have been accessed, the machine must expand all u ∈ U. This pro-
cess never terminates because U is infinite. Notice, a real query system for the WWW
would have a similar problem: To guarantee that such a system sees all documents, it
must enumerate and lookup all (HTTP scheme) URIs.
The computability of any Linked Data query is a general, input independent property
which covers the worst case (recall, the requirements given in Definitions 5 and 6 must
hold for any Web of Linked Data). As a consequence, in certain cases the computation
of some (eventually computable) SPARQLLD queries may still terminate:
Example 3. Let QPEx2 be a monotonic SPARQLLD query which uses the SPARQL ex-
pression PEx2 = (u1 , u2 , u3 ) that we introduce in Example 2. Recall, PEx2 is satisfiable
but not nontrivially satisfiable. The same holds for QPEx2 (cf. Proposition 1). An LD
machine for QPEx2 may take advantage of this fact: As soon as the machine discovers an
LD document which contains RDF triple (u1 , u2 , u3 ), the machine may halt (after re-
porting {μ∅ } with dom(μ∅ ) = ∅ as the complete query result). In this particular case
the machine would satisfy the requirements for finite computability. However, QPEx2 is
still only eventually computable because there exist Webs of Linked Data that do not
contain any LD document with RDF triple (u1 , u2 , u3 ); any (complete) LD machine
based computation of QPEx2 over such a Web cannot halt (cf. proof of Theorem 1).
The example illustrates that the computation of an eventually computable query over a
particular Web of Linked Data may terminate. This observation leads us to a decision
problem which we denote as T ERMINATION (SPARQL LD ). This problem takes a Web
P
of Linked Data W and a satisfiable SPARQL
P
LD query Q as input and asks whether
an LD machine exists that computes Q W and halts. For discussing this problem we
note that the query in Example 3 represents a special case, that is, SPARQLLD queries
which are satisfiable but not nontrivially satisfiable. The reason why an LD machine
for such a query may halt, is the implicit knowledge that the query result is complete
once the machine identified the empty valuation μ∅ as a solution. Such a completeness
criterion does not exist for any nontrivially satisfiable SPARQLLD query:
Lemma 1. There is not any nontrivially satisfiable SPARQLLD query QP for which
exists an LD machine that, for any Web of Linked Data W encoded on the Web tape,
halts after a finite number of computation steps and outputs an encoding of QP W .
Lemma 1 shows that the answer to T ERMINATION (SPARQL LD ) is negative in most
cases. However, the problem in general is undecidable (for LD machines) since the in-
put for the problem includes queries that correspond to the aforementioned special case.
Theorem 2. T ERMINATION (SPARQL LD ) is not LD machine decidable.
queried Web. We now focus on the implications of potentially infinite Webs of Linked
Data for SPARQLLD queries. However, we assume a finite Web first:
Proposition 2. SPARQLLD queries over a finite Web of Linked Data have a finite result.
The following example illustrates that a similarly general statement does not exist when
the queried Web is infinite such as the WWW.
Example 4. Let Winf = (Dinf , datainf , adocinf ) be an infinite Web of Linked Data that
contains LD documents for all natural numbers (similar to the documents in Exam-
ple 1). Hence, for each natural number5 k ∈ N+ , identified by uk ∈ U, exists an LD
document adocinf (uk ) = dk ∈ Dinf such that datainf (dk ) = (uk , succ, uk+1 ) where
succ ∈ U identifies the successor relation for N+ . Furthermore, let P1 = (u1 , succ, ?v)
and P2 = (?x, succ, ?y) be SPARQL expressions. It can be seen easily that the result
of SPARQLLD query QP1 over Winf is finite, whereas, QP2 Winf is infinite.
The example demonstrates that some SPARQLLD queries have a finite result over some
infinite Web of Linked Data and some queries have an infinite result. Consequently,
we are interested in a decision problem F INITENESS (SPARQL LD ) which asks, given a
(potentially infinite)
Web of Linked Data W and a satisfiable SPARQL expression P ,
whether QP W is finite. Unfortunately, we cannot answer the problem in general:
5 Reachability-Based Semantics
Our results in the previous section show that SPARQL queries under full-Web seman-
tics have a very limited computability. As a consequence, any SPARQL-based query ap-
proach for Linked Data that uses full-Web semantics requires some ad hoc mechanism
to abort query executions and, thus, has to accept incomplete query results. Depending
on the abort mechanism the query execution may even be nondeterministic. If we take
these issues as an obstacle, we are interested in an alternative, well-defined semantics
for SPARQL over Linked Data. In this section we discuss a family of such seman-
tics which we call reachability-based semantics. These semantics restrict the scope of
queries to data that is reachable by traversing certain data links using a given set of URIs
as starting points. Hereafter, we refer to queries under any reachability-based seman-
tics as SPARQLLD(R) queries. In the remainder of this section we formally introduce
reachability-based semantics, discuss theoretical properties of SPARQLLD(R) queries,
and compare SPARQLLD(R) to SPARQLLD .
5.1 Definition
link graph of a Web of Linked Data to the document in question; the potential start-
ing points for such a path are LD documents that are authoritative for a given set of
entities. However, allowing for arbitrary paths might be questionable in practice be-
cause this approach would require following all data links (recursively) for answering a
query completely. Consequently, we introduce the notion of a reachability criterion that
supports an explicit specification of what data links should be followed.
Definition 9. Let T be the infinite set of all possible RDF triples and let P be the in-
finite set of all possible SPARQL expressions. A reachability criterion c is a (Turing)
computable function c : T × U × P → {true, false}.
An example for a reachability criterion is cAll which corresponds to the aforementioned
approach of allowing for arbitrary paths to reach LD documents; hence, for each tuple
(t, u, Q) ∈ T × U × Q it holds cAll (t, u, Q) = true. The complement of cAll is cNone
which always returns false. Another example is cMatch which specifies the notion of
reachability that we use for link traversal based query execution [10,12].
true if there exists a triple pattern tp in P and t matches tp,
cMatch t, u, P =
false else.
where an RDF triple t = (x1 , x2 , x3 ) matches a triple pattern tp = (x˜1 , x˜2 , x˜3 ) if for
all i ∈ {1, 2, 3} holds: If x̃i ∈ / V, then x̃i = xi .
We call a reachability criterion c1 less restrictive than another criterion c2 if i) for
each (t, u, P ) ∈ T × U × P for which c2 (t, u, P ) = true, also holds c1 (t, u, P ) = true
and ii) there exist a (t , u , P ) ∈ T × U × P such that c1 (t , u , P ) = true but
c2 (t , u , P ) = false. It can be seen that cAll is the least restrictive criterion, whereas
cNone is the most restrictive criterion. We now define reachability of LD documents:
Definition 10. Let S ⊂ U be a finite set of seed URIs; let c be a reachability criterion;
let P be a SPARQL expression; and let W = (D, data, adoc) be a Web of Linked Data.
An LD document d ∈ D is (c, P )-reachable from S in W if either
1. there exists a URI u ∈ S such that adoc(u) = d; or
2. there exist d ∈ D, t ∈ data(d ), and u ∈ uris(t) such that i) d is (c, P )-reachable
from S in W , ii) adoc(u) = d, and iii) c(t, u, P ) = true.
Based on reachability of LD documents we define reachable parts of a Web of Linked
Data. Such a part is an induced subweb covering all reachable LD documents. Formally:
Definition 11. Let S ⊂ U be a finite set of URIs; let c be a reachability criterion; let
P be a SPARQL expression; and let W = (D, data, adoc) be a Web of Linked Data.
(S,P )
The (S, c, P )-reachable part of W , denoted
by Wc , is an induced subweb (DR ,
dataR , adocR ) of W such that DR = d ∈ D | d is (c, P )-reachable from S in W .
We now use the concept of reachable parts to define SPARQLLD(R) queries.
Definition 12. Let S ⊂ U be a finite set of URIs; let c be a reachability criterion; and
let P be a SPARQL expression. The SPARQLLD(R) query that uses P , S, and c, denoted
by QP,S
c , is a Linked Data query that, for any Web of Linked Data W , is defined as
P,S (S,P )
Qc (W ) = [[P ]]AllData(W (S,P ) ) (where Wc is the (S, c, P )-reachable part of W ).
c
SPARQL for a Web of Linked Data: Semantics and Computability 19
As can be seen from Definition 12, our notion of SPARQLLD(R) consists of a family
of (reachability-based) query semantics, each of which is characterized by a certain
reachability criterion. Therefore, we refer to SPARQLLD(R) queries for which we use a
particular reachability criterion c as SPARQLLD(R) queries under c-semantics.
Definition 12 also shows that query results depend on the given set S ⊂ U of seed
URIs. It is easy to see that any SPARQLLD(R) query which uses an empty set of seed
URIs is not satisfiable and, thus, monotonic and finitely computable. We therefore con-
sider only nonempty sets of seed URIs in the remainder of this paper.
Since any SPARQLLD query over a finite Web of Linked Data has a finite result (cf.
Proposition 2), we use Proposition 3, case 2, to show the same for SPARQLLD(R) :
Proposition 4. The result of any SPARQLLD(R) query QP,S c over a finite Web of Linked
Data W is finite; so is the (S, c, P )-reachable part of W .
For the case of an infinite Web of Linked Data the results of SPARQLLD(R) queries
may be either finite or infinite. In Example 4 we found the same heterogeneity for
SPARQLLD . However, for SPARQLLD(R) we may identify the following dependencies.
Proposition 5 provides valuable insight into the dependencies between reachability cri-
teria, the (in)finiteness of reachable parts of an infinite Web, and the (in)finiteness
of query results. In practice, however, we are primarily interested in answering two
20 O. Hartig
Before we may come back to the aforementioned disparity, we focus on the computabil-
ity of SPARQLLD(R) queries. We first show the following, noteworthy result.
Definition 13. A reachability criterion c ensures finiteness if for any Web of Linked
Data W , any (finite) set S ⊂ U of seed URIs, and any SPARQL expression P , the
(S, c, P )-reachable part of W is finite.
SPARQL for a Web of Linked Data: Semantics and Computability 21
6 Conclusions
Our investigation of SPARQL as a language for Linked Data queries reveals the fol-
lowing main results. Some special cases aside, the computability of queries under any
of the studied semantics is limited and no guarantee for termination can be given. For
reachability-based semantics it is at least possible that some of the (non-special case)
query computations terminate; although, in general it is undecidable which. As a conse-
quence, any SPARQL-based query system for Linked Data on the Web must be prepared
for query executions that discover an infinite amount of data and that do not terminate.
Our results also show that –for reachability-based semantics– the aforementioned
issues must be attributed to the possibility for infiniteness in the queried Web (which is
a result of data generating servers). Therefore, it seems worthwhile to study approaches
for detecting whether the execution of a SPARQLLD(R) query traverses an infinite path
in the queried Web. However, the mentioned issues may also be addressed by another,
alternative well-defined semantics that restricts the scope of queries even further (or
differently) than our reachability-based semantics. It remains an open question how
such an alternative may still allow for queries that tap the full potential of the Web.
We also show that computability depends on satisfiability and monotonicity and that
for (almost all) SPARQL-based Linked Data queries, these two properties directly cor-
respond to the same property for the used SPARQL expression. While Arenas and Pérez
show that the core fragment of SPARQL without OPT is monotonic [3], it requires fur-
ther work to identify (non-)satisfiable and (non-)monotonic fragments and, thus, enable
an explicit classification of SPARQL-based Linked Data queries w.r.t. computability.
References
1. Abiteboul, S., Vianu, V.: Queries and computation on the web. Theoretical Computer Sci-
ence 239(2) (2000)
2. Angles, R., Gutierrez, C.: The Expressive Power of SPARQL. In: Sheth, A.P., Staab, S.,
Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS,
vol. 5318, pp. 114–129. Springer, Heidelberg (2008)
3. Arenas, M., Pérez, J.: Querying Semantic Web Data with SPARQL. In: PODS (2011)
4. Auer, S., Lehmann, J., Hellmann, S.: LinkedGeoData: Adding a Spatial Dimension to the
Web of Data. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta,
E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 731–746. Springer, Heidelberg
(2009)
5. Berners-Lee, T.: Linked Data (2006),
http://www.w3.org/DesignIssues/LinkedData.html
6. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data – the story so far. Journal on Semantic
Web and Information Systems 5(3) (2009)
7. Bouquet, P., Ghidini, C., Serafini, L.: Querying the Web of Data: A Formal Approach. In:
Gómez-Pérez, A., Yu, Y., Ding, Y. (eds.) ASWC 2009. LNCS, vol. 5926, pp. 291–305.
Springer, Heidelberg (2009)
8. Florescu, D., Levy, A.Y., Mendelzon, A.O.: Database techniques for the world-wide web: A
survey. SIGMOD Record 27(3) (1998)
9. Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.-U., Umbrich, J.: Data Summaries
for On-Demand Queries over Linked Data. In: WWW (2010)
SPARQL for a Web of Linked Data: Semantics and Computability 23
10. Hartig, O.: Zero-Knowledge Query Planning for an Iterator Implementation of Link Traver-
sal Based Query Execution. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B.,
Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I. LNCS, vol. 6643,
pp. 154–169. Springer, Heidelberg (2011)
11. Hartig, O.: SPARQL for a Web of Linked Data: Semantics and Computability (Extended
Version). CoRR abs/1203.1569 (2012), http://arxiv.org/abs/1203.1569
12. Hartig, O., Bizer, C., Freytag, J.-C.: Executing SPARQL Queries over the Web of Linked
Data. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta, E.,
Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 293–309. Springer, Heidelberg
(2009)
13. Ladwig, G., Tran, T.: Linked Data Query Processing Strategies. In: Patel-Schneider, P.F.,
Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010,
Part I. LNCS, vol. 6496, pp. 453–469. Springer, Heidelberg (2010)
14. Ladwig, G., Tran, T.: SIHJoin: Querying Remote and Local Linked Data. In: Antoniou, G.,
Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC
2011, Part I. LNCS, vol. 6643, pp. 139–153. Springer, Heidelberg (2011)
15. Mendelzon, A.O., Milo, T.: Formal models of web queries. Inf. Systems 23(8) (1998)
16. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. ACM Transac-
tions on Database Systems 34(3) (2009)
17. Picalausa, F., Vansummeren, S.: What are real SPARQL queries like? In: SWIM (2011)
18. Schmidt, M., Meier, M., Lausen, G.: Foundations of sparql query optimization. In: Proc. of
the 13th Int. Conference on Database Theory, ICDT (2010)
19. Vrandecić, D., Krötzsch, M., Rudolph, S., Lösch, U.: Leveraging non-lexical knowledge for
the linked open data web. In: RAFT (2010)
Linked Data-Based Concept Recommendation:
Comparison of Different Methods
in Open Innovation Scenario
1 Introduction
The ability to innovate is essential to the economic wellbeing, growth and survival of
most companies, especially when the market competition becomes strong. With the
global economic uncertainties in recent years, companies and innovation experts
started to question the old innovation models and seek new, more efficient ones. The
paradigm of Open Innovation (OI) [1] is proposed as a way to outsource the innova-
tion and seek solutions of R&D problems outside the company and its usual network
of collaborators. OI is intended to leverage the existing knowledge and ideas, that the
company is unaware of, and somehow democratize the process of innovation. In re-
cent years, one interesting realisation of OI is the one that encourages the innovation
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 24–38, 2012.
© Springer-Verlag Berlin Heidelberg 2012
Linked Data-Based Concept Recommendation 25
to emerge over the Web. This realization is the core business of companies such as
Hypios.com, Innocentive.com and NineSigma.com, which provide Web innovation
platforms where companies with R&D needs can post problems and find innovative
solutions. The companies looking to innovate, called seekers, would represent their
R&D needs through an innovation problem statement describing the context of the
problem to be solved. Such a statement is then published on a problem-solving plat-
form. Experts, called solvers then submit their solutions. The seeker then selects the
best contribution and acquires the rights to use it, often in exchange for a prize to the
solver and any other due fees.
Identification of the potential solvers and broadcasting problems to their attention
is already used by the Web innovation platforms to boost the problem-solving activity
[2]. In our previous work [3] we developed a method for solver finding that leverages
the user traces (e.g., blogs, publications, presentations) available in Linked Data.
However, finding the users with expertise in the problem topics is often not good
enough, as Web innovation platforms also seek a greater diversity in solutions in
terms of domains of knowledge that they are coming from, as well as in terms of dif-
ferent perspectives on the problem. Existing OI research strongly argues [4] that truly
innovative solutions often come from solvers whose competence is not in the topics
directly found in the problem description, but rather from those who are experts in a
different domain and can transfer the knowledge from one domain to another. One
way to identify and involve such lateral solvers is to search for the concepts lateral to
the problem. Such concepts then might be contained in the user profiles of experts
likely to submit solutions, or in the possibly existing solutions in the form of research
publications or patents. The key challenge thus comes down to the identification of
expertise topics, directly and laterally related to the problem in question.
With the emergence of the Linked Open Data (LOD) project1, which continues
stimulating creation, publication and interlinking the RDF graphs with those already
in the LOD cloud, the amount of triples increased to 31 billion in 2011, and continues
to grow. The value in the linked data is the large amount of concepts and relations
between them that are made explicit and hence can be used to infer relations more
effectively in comparison to deriving the same kind of relations from text. We pro-
pose two independently developed methods for topic discovery based on the Linked
Data. The first method called hyProximity, is a structure-based similarity which ex-
plores different strategies based on the semantics inherent in an RDF graph, while the
second one, Random Indexing, applies a well-known statistical semantics from In-
formation Retrieval to RDF, in order to identify the relevant set of both direct and
lateral topics. As the baseline we use the state of the art adWords keyword recom-
mender from Google that finds similar topics based on their distribution in textual
corpora and the corpora of search queries. We evaluate the performance of these me-
thods based on solution descriptions submitted to Hypios in the last year that we use
to create the ‘gold standard’. In addition, we conduct the user study aimed at gaining a
more fine-grained insight into the nature of the generated recommendations.
1
http://linkeddata.org/
26 D. Damljanovic, M. Stankovic, and P. Laublet
In this section we discuss the existing measures of semantic relatedness and systems
that use them in different scenarios including concept recommendation, followed by
the approaches which use Linked Data.
Legacy Approaches: Although our focus is semantic relatedness of concepts our
challenge is quite similar to term recommendation that has been studied for decades.
Semantically related terms have been used to help users choose the right tags in colla-
borative filtering systems [5]; to discover alternative search queries [6]; for query
refinement [7]; to enhance expert finding results [8]; for ontology maintenance [9],
[10], and in many other scenarios. Different techniques and different sources are used
and combined to develop Measures of Semantic Relatedness (MSRs). These measures
could be split into two major categories: 1) graph-based measures and 2) distribution-
al measures. In what follows we briefly examine each category of MSRs.
Graph-based measures make use of semantics (e.g., hyponymy or meronymy)
and/or lexical relationships (e.g., synonyms) within a graph to determine semantic
proximity between the concepts. For example, [11] exploits the hypernym graphs of
Wordnet2, [7] uses Gallois lattice to provide recommendations based on domain on-
tologies, whereas [12] uses the ODP taxonomy3. Some approaches (e.g. [10]) rely on
the graph of Wikipedia categories to provide recommendations. Different approaches
use different graph measures to calculate the semantic proximity of concepts. Shortest
path is among the most common of such measures. It is often enhanced by taking into
account the information content of the graph nodes [13]. To the best of our knowledge
these approaches have not been applied to knowledge bases of size and richness com-
parable to that of DBpedia4. Even the Wikipedia-based measures (e.g. [10]) do not go
beyond exploring categories, neither leverage the rich information inherent in DBpe-
dia. The MSR that we propose in this paper builds upon the existing graph-based
measures but is highly adapted to the rich structure of Linked Data sources, as it leve-
rages different types of relations between the concepts in the graph.
Distributional measures rely on the distributional properties of words in large text
corpora. Such MSRs deduce semantic relatedness by leveraging co-occurrences of
concepts. For example, the approach presented in [14] uses co-occurrence in research
papers, pondered with a function derived from the tf-idf measure [15] to establish a
notion of word proximity. Co-occurrence in tags [5] and in search results [16] is also
commonly used. In [17], the authors introduce Normalized Web Distance (NWD) as a
generalization of Normalized Google Distance (NGD) [16] MSR and investigate its
performance with six different search engines. The evaluation (based on the correla-
tion with human judgment) demonstrated the best performance of Exalead-based
NWD measure, closely followed by Yahoo!, Altavista, Ask and Google. A distribu-
tional measure applied for the task similar to ours is considered in [8], where using
2
http://wordnet.princeton.edu/
3
http://www.dmoz.org
4
While DBpedia contains more than 3.5 million concepts, the current version of Wordnet has
206941 word-sense pairs, and ODP has half a million categories.
Linked Data-Based Concept Recommendation 27
• Hierarchical links: The properties that help to organize the concepts based on
their types (e.g., rdf:type5 and rdfs:subclassOf) or categories (e.g., dcterms:subject
and skos:broader). The links created by those properties connect a concept to a
category concept – the one serving to organize other concepts into classes.
• Transversal links: The properties that connect concepts without the aim to estab-
lish a classification or hierarchy. The majority of properties belong to this group,
and they create direct and indirect links between ordinary, non-category concepts.
In our concept discovery we will treat the two types of links differently, due to their
different nature, and we will devise three different approaches in order to be able to
work with different data sets that might or might not contain both types of links. An
early version of our approach treating hierarchical links only is presented in [22].
(1)
HyProximity of a concept c to the set of initial concepts IC is the sum of values of the
distance functions for distances between the concept c and each concept ci from the
set of initial seed concepts IC. The distance value between the concept c and an initial
concept ci , is denoted dv(c, ci) and is inversely proportional to the value of a chosen
distance function, i.e. dv(c, ci) = p(c, ci)/ d(c, ci). Different distance functions d(c, ci)
and ponderation functions p(c, ci) can be used, and we will describe some of them in
the reminder of this paper. The calculation of hyProximity can be performed using
5
All the prefixes used in this paper can be looked up at http://prefix.cc
Linked Data-Based Concept Recommendation 29
the Algorithm 1. The generation of concept candidates as well as the distance value
function depend on the exploration strategy used. In the following sub-sections we
present a variety of strategies.
Algorithm 1.
1. get initial topic concepts IC
2. for each seed concept c in IC:
a. while distance_level++ < maxLevel:
i. generate concept candidates for the current distance_level
ii. for each concept candidate ci:
1. value(ci) = dv(c,ci)
2. get previousValue(ci) from Results
3. put <ci, previousValue(ci)+value(ci)> to Results
3. sort Results in decreasing order of hyProximity
In finding candidate concepts using the hierarchical links, we can distinguish sev-
eral ways to calculate distances. Our previous studies [22] allowed to isolate one par-
ticular function that gives best results, and that we will use here. Figure 1 represents
an example graph of concepts (black nodes) and their categories/types6 (white nodes),
and it will help us illustrate the distance function. Our hierarchical distance function
considers all the non-category concepts that share a common category with x (in the
case of our example – only the concept b) to be at distance 1. To find candidate con-
cepts at distance n, we consider each category connected to the starting concept (x)
over n links, and find all concepts connected to it over any of its subcategories. In our
example, this approach would lead to considering {b,c,d} as being at distance 2 from
x. Different ponderation schemes can be used along with the distance functions. A
standard choice in graph distance functions is to use the informational content [13] of
the category (-log(p) where p is the probability of finding the category in the graph of
DBpedia categories when going from bottom up) as a pondering function. Applied to
our case the pondering function p(c, ci) would take as a value the informational
content of the first category over which one may find c when starting from ci.
6
For the sake of simplicity, we will refer to both categories and types, as well as other possible
grouping relations used to construct a hierarchy, as categories.
30 D. Damljanovic, M. Stankovic, and P. Laublet
As the higher level categories normally have lower informational content, this func-
tion naturally gives higher hyProximity values to concept candidates found over cate-
gories closer to the initial concepts.
Mixed Distance Function. The mixed distance function asserts the distance n to all
the concepts found at the distance n by the hierarchical function and those found at
the same distance by the transversal function.
Latent Semantic Analysis (LSA) [23] is one of the pioneer methods to automatically
find contextually related words. The assumption behind this and other statistical se-
mantics methods is that words which appear in the similar context (with the same set
of other words) are synonyms. Synonyms tend not to co-occur with one another di-
rectly, so indirect inference is required to draw associations between words used to
express the same idea [19]. This method has been shown to approximate human per-
formance in many cognitive tasks such as the Test of English as a Foreign Language
(TOEFL) synonym test, the grading of content-based essays and the categorisation of
groups of concepts (see [19]). However, one problem with this method is scalability:
it starts by generating a term x document matrix which grows with the number of
terms and the number of documents and will thus become very large for large corpo-
ra. For finding the final LSA model, Singular Value Decomposition (SVD) and sub-
sequent dimensionality reduction is commonly used. This technique requires the
factorization of the term-document matrix which is computationally costly. Also,
calculating the LSA model is not easily and efficiently doable in an incremental or
out-of-memory fashion. The Random Indexing (RI) method [18] circumvents these
Linked Data-Based Concept Recommendation 31
problems by avoiding the need of matrix factorization in the first place. RI can be
seen as an approximation to LSA which is shown to be able to reach similar results
(see [24] and [25]). RI can be incrementally updated and also, the term x document
matrix does not have to be loaded in memory at once –loading one row at the time is
enough for computing context vectors. Instead of starting with the full term x docu-
ment matrix and then reducing the dimensionality, RI starts by creating almost ortho-
gonal random vectors (index vectors) for each document. This random vector is
created by setting a certain number of randomly selected dimensions to either +1 or
-1. Each term is represented by a vector (term vector) which is a combination of all
index vectors of the document in which it appears. For an object consisting of mul-
tiple terms (e.g. a document or a search query with several terms), the vector of the
object is the combination of the term vectors of its terms.
In order to apply RI to an RDF graph we first generate a set of documents which
represent this graph, by generating one virtual document for each URI in the graph.
Then, we generate a semantic index from the virtual documents. This semantic index
is then being searched in order to retrieve similar literals/URIs. Virtual documents can
be of different depth, and in the simplest case, for a representative URI S, a virtual
document of depth one is a set of triples where S is a subject - in addition if any object
in the set of triples is a URI we also include all triples where that URI is the subject
and the object is a literal. The reason for this is the fact that literals such as labels are
often used to describe URIs. A sample virtual document of depth one is shown in
Figure 2, where the graph is first expanded down one level from node S. Further on,
we also expand the graph from nodes O1 and O2 to include only those statements
where objects are literals. A sample raw that will be added to the term x document
matrix is illustrated in Table 1.
Fig. 2. From a representative subgraph to the virtual document for URI S: L - literals, O - non-
literal objects (URIs), P - RDF properties
32 D. Damljanovic, M. Stankovic, and P. Laublet
Table 1. A sample raw in the term x document matrix for the virtual document in Figure 2. The
number of documents is equal to the number of URIs in the graph, and the number of terms is
equal to the number of URIs and literals.
S P1 .. P10 L1 .. L8 O1 O2
S 10 1 .. 1 1 .. 1 3 4
.. .. .. .. .. .. .. .. .. ..
Traditionally, the semantic index captures the similarity of terms based on their
contextual distribution in a large document collection, and the similarity between
documents based on the similarities of the terms contained within. By creating a se-
mantic index for an RDF graph, we are able to determine contextual similarities be-
tween graph nodes (e.g., URIs and literals) based on their neighbourhood – if the two
nodes are related with a similar set of other nodes, they will appear as contextually
related according to the semantic index. We use the cosine function to calculate the
similarity between the input term (literal or URI) vector and the existing vectors in the
generated semantic index (vector space model). While the generated semantic index
can be used to calculate similarities between all combinations of term/document-
term/document, we focus on document-document search only: suggesting a set of
representative URIs related to a set of seed URIs or ICs.
• Extract problem URIs. We took the 26 problem descriptions and extracted their
key concepts using a natural language processing service that links the key con-
cepts in a given English text to the DBpedia entities. We use Zemanta7 for this ex-
traction, but other services such as OpenCalais8 or DBpedia Spotlight9 may also be
used. This service has been shown to perform well for the task of recognizing
Linked Data entities from text in recent evaluations [26].
• Extract solution URIs. For each problem we collected the submitted solutions
(142 total), extracted the key concepts in the same way we did for problem texts.
The key concepts extracted by Zemanta were not verified by human users. While in
the case of key concept extraction from problems this verification was feasible, in the
case of solutions it was not, as it would violate the confidentiality agreement. We
therefore had to work with automatically extracted and non-validated concepts, trust-
ing that Zemanta’s error rate would not affect the correctness of our further study, and
that the potential impact of potential errors would equally affect all approaches. Note
that when evaluating the baseline, we did not need to extract the key concepts, as the
Google Keyword tool would generate a set of keywords that we could then compare
to the words in the submitted solutions without any need for linking them to URIs.
As the baseline we used Google Adwords Keyword Tool10. This tool is a good candi-
date for baseline because it is the state of the art commercial tool employing some of
the best Information Retrieval practices to text. In a legacy platform that Hypios uses
for finding solvers, such a tool plays the crucial role as it is capable of suggesting up
to 600 similar terms which then can be used to search for solvers. This large number
7
developer.zemanta.com
8
http://www.opencalais.com/
9
http://dbpedia.org/spotlight
10
https://adwords.google.com/select/KeywordToolExternal
34 D. Damljanovic, M. Stankovic, and P. Laublet
of suggested terms is important for the task of Web crawling in order to find relevant
experts. Hypios crawls the Web in order to identify and extract the expert information
and thus enrich the existing database of experts. Google Adwords is also widely used
in tasks with similar purposes such as placing the adverts for consumers relevant to
the page they are viewing. Using the methods for ranking concept recommendations
inspired by Linked Data, our aim is to improve the baseline. Our hypothesis is that
linked data-based similarity metrics described in this paper can improve the baseline.
In what follows we detail the experiments conducted to test this hypothesis.
4.1 Results
We took the key concepts extracted from the problems, and fed them to our methods
and to the baseline system, which all generated an independent set of recommended
concepts. We then calculated the performance for each method by comparing the
results with those collected in the gold standard. The results, shown in Figure 3, indi-
cate that the mixed hyProximity measure performs best with regard to precision. This
measure should therefore be used in the end-user applications, as the users can typi-
cally consult only a limited number of top-ranked suggestions. With regard to recall,
Random Indexing outperforms the other approaches for 200 top-ranked suggestions.
It is especially useful in cases when it is possible to consider a large number of
suggestions which include false positives - such as the case when the keyword sug-
gestions are used for expert crawling. The balanced F-measure indicates that the
transversal hyProximity method might be the best choice when precision and recall
are equally important, and for less than 350 suggestions. After this threshold the
mixed hyProximity is a better choice. HyProximity measures improve the baseline
across all performance measures, while Random indexing improves it only with re-
gard to recall and F-measure for less than 200 suggestions. The significance of differ-
ences is confirmed by the T-test for paired values for each two methods (p<0.05).
Fig. 3. Comparison of methods: precision (top-left), recall (top-right), F-measure (bottom left).
On x axis: the number of suggestions provided by the systems.
Linked Data-Based Concept Recommendation 35
The relatively low precision and recall scores for all methods, including the
baseline, can be explained by the fact that our ‘gold standard’ is not complete : some
concepts might not appear in solutions, even if relevant, as not all relevant experts
were motivated to propose a solution. This is a natural consequence of the difficulty
of the task. However, our evaluation with such an incomplete dataset still gives an
insight into different flavors of our similarity measures, and to compensate for this
incompleteness, we conduct a user-centric study in order to test the quality of the
generated suggestions.
5 User Evaluation
We conducted a user study in order to cover the aspects of the methods’ performance
that could not have been covered by the previous evaluation. The reason is that rely-
ing on the solutions received for a particular problem gives insight into a portion of
the relevant topics only, as some correct and legitimate solutions might not have been
submitted due to the lack of interest in the problem prize, and in such cases our gold
standard would not take such topics into account. Further on, the user study allowed a
more fine-grained view on the quality of recommendations, as we focused on the
following two aspects:
• Relevancy: the quality of a concept suggestion being relevant to the given innova-
tion problem in the sense that the concept might lead to a potential solver of a solu-
tion of this problem if used in the expert search. We used the scale from 1 to 5: (1)
extremely irrelevant (2) irrelevant, (3) not sure (4) relevant (5) extremely relevant.
• Unexpectedness: the degree of unexpectedness of a concept suggestion for the
user evaluator on the scale from 1 to 5: (1) evident suggestions e.g. those that ap-
pear in the problem description (2) easy– suggestions that the user would have eas-
ily thought of based on the initial seed concepts (3) neutral (4) unexpected - for
keywords that the user would not have thought of in the given context, however the
concept is known to him (5) new unexpected - for keywords that were unknown to
the user as he had to look up their meaning in a dictionary or encyclopedia.
Suggestions being both relevant and unexpected would represent the most valuable
discoveries for the user in the innovation process, and a good concept recommenda-
tion system for this use case should be capable of providing such suggestions.
Twelve users familiar with OI scenarios (employees of OI companies and PhD stu-
dents in OI-related fields) participated in the study. They were asked to choose a sub-
set of innovation problems from the past practice of hypios.com and evaluate the
recommended concepts. This generated a total of 34 problem evaluations, consisting
of 3060 suggested concepts/keywords. For the chosen innovation problem, the eva-
luators were presented with the lists of 30 top-ranked suggestions generated by ad-
Words, hyProximity (mixed approach) and Random Indexing. We then asked them to
rate the relevancy and unexpectedness of suggestions using the above described
scales.
36 D. Damljanovic, M. Stankovic, and P. Laublet
The choice of our subjects was based on the two criteria. Their ability to judge the
relevancy in this particular sense came out of their experience with OI problems, and
at the same time they were not domain experts, but had rather general knowledge so
the topics that they would judge as unexpected would most likely be also unexpected
for an average innovation seeker from a client company.
hyProximity Random
Measure adWords
(mixed) Indexing
Relevance 2.930±0.22 3.693±0.23 3.330±0.25
Unexpectedness 2.859±0.16 2.877±0.25 3.052±0.22
Unexpectedness (relevancy >=4 ) 2.472±0.31 2.542±0.36 2.635±0.36
Unexpectedness (relevancy =5 ) 1.760±0.22 1.842±0.31 1.767±0.36
As shown in Table 2, the Linked Data measures outperform the baseline system
across all criteria. While hyProximity scores best considering the general relevance of
suggestions in isolation, Random Indexing scores best in terms of unexpectedness.
With regard to the unexpectedness of the highly relevant results (relevancy>=4) Ran-
dom indexing outperforms the other systems, however hyProximity offers a slightly
more unexpected suggestions if we consider only the most relevant results (relevan-
cy=5). We tested the differences in relevance for all methods using the paired T-test
over subjects individual means, and the tests indicated that the difference in relevance
between each pair is significant (p <0.05). The difference in unexpectedness is signifi-
cant only in the case of Random Indexing vs. baseline. This demonstrates the real abili-
ty of Linked Data-based systems to provide the user with valuable relevant concepts.
In the follow up study, we asked the raters to describe in their own words, the sug-
gestions they were presented with from each system (identified as System 1, 2, and 3).
The adjective most commonly used to describe adWords suggestions was “redundant”
and “Web-oriented”. This indeed corresponds to the fact that the system is not fully
adapted to the OI scenario, but also to the fact that it is based on a statistical approach,
which is more influenced by the statistical properties of Web content, than by the
meaning of things. HyProximity suggestions were most commonly described as “real-
ly interesting” and “OI-oriented”, while the suggestions of Random Indexing were
most often characterized as “very general”. According to the preference towards more
general or more specific concepts, it is therefore possible to advise the user with re-
gard to which of the two methods is more suitable for the specific use case.
To illustrate the qualitative aspects of suggestions we provided an example of con-
cept suggestions from all 3 systems on our website11.
6 Conclusion
We presented two Linked Data-based concept recommendation methods and eva-
luated them against the state of the art Information Retrieval approach which served
11
http://research.hypios.com/?page_id=165
Linked Data-Based Concept Recommendation 37
as our baseline. We argue that our methods are suitable in an Open Innovation scena-
rio where the suggested concepts are used to find potential solvers for a given prob-
lem. Our results show that both proposed methods improve the baseline in different
ways, thus suggesting that Linked Data can be a valuable source of knowledge for the
task of concept recommendation. The gold standard-based evaluation reveals a supe-
rior performance of hyProximity in cases where precision is preferred; Random
Indexing performed better in case of recall. In addition, our user study evaluation
confirmed the superior performance of Linked Data-based approaches both in terms
of relevance and unexpectedness. The unexpectedness of the most relevant results
was also higher with the Linked Data-based measures. Users also indicated that Ran-
dom Indexing provided more general suggestions, while those provided by hyProxim-
ity were more granular. Therefore, these two methods can be seen as complementary
and in our future work we will consider combining them as their different nature seem
to have a potential to improve the properties of the query process.
References
1. Chesbrough, H.W.: Open Innovation: The New Imperative for Creating and Profiting from
Technology. Harvard Business Press (2003)
2. Speidel, K.-P.: Problem-Description in Open Problem-Solving. How to overcome Cogni-
tive and Psychological Roadblocks. In: Sloane, P. (ed.) A Guide to Open Innovation and
Crowdsourcing. Advice from Leading Experts. KoganPage, London (2011)
3. Stankovic, M., Jovanovic, J., Laublet, P.: Linked Data Metrics for Flexible Expert Search
on the Open Web. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis,
D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I. LNCS, vol. 6643, pp. 108–123.
Springer, Heidelberg (2011)
4. Jeppesen, L.B., Lakhani, K.R.: Marginality and Problem Solving Effectiveness in Broad-
cast Research. Organization Science 20 (2009)
5. Sigurbjörnsson, B., van Zwol, R.: Flickr tag recommendation based on collective know-
ledge. In: Proceeding of the 17th International Conference on World Wide Web, WWW
2008, p. 327. ACM Press, New York (2008), doi:10.1145/1367497.1367542
6. Mei, Q., Zhou, D., Church, K.: Query suggestion using hitting time. In: Proceeding of the
17th ACM Conference on Information and Knowledge Mining - CIKM 2008, New York,
USA, p. 469 (2008)
7. Safar, B., Kefi, H.: OntoRefiner, a user query refinement interface usable for Semantic
Web Portals. In: Proceedings of Application of Semantic Web Technologies to Web
Communities Workshop, ECAI 2004, pp. 65–79 (2004)
8. Macdonald, C., Ounis, I.: Expertise drift and query expansion in expert search. In: Pro-
ceedings of the Sixteenth ACM Conference on Information and Knowledge Management -
CIKM 2007, p. 341. ACM Press, New York (2007)
9. Cross, V.: Semantic Relatedness Measures in Ontologies Using Information Content and
Fuzzy Set Theory. In: Proc. of the 14th IEEE Int’l Conf. on Fuzzy Systems, pp. 114–119
(2005)
38 D. Damljanovic, M. Stankovic, and P. Laublet
10. Gasevic, D., Zouaq, A., Torniai, C., Jovanovic, J., Hatala, M.: An Approach to Folksono-
my-based Ontology Maintenance for Learning Environments. IEEE Transactions on
Learning Technologies (2011) (in press)
11. Burton-Jones, A., Storey, V.C., Sugumaran, V., Purao, S.: A Heuristic-Based Methodolo-
gy for Semantic Augmentation of User Queries on the Web. In: Song, I.-Y., Liddle, S.W.,
Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 476–489. Springer,
Heidelberg (2003)
12. Ziegler, C.-N., Simon, K., Lausen, G.: Automatic Computation of Semantic Proximity Us-
ing Taxonomic Knowledge Categories and Subject Descriptors. In: Proceedings of the
15th ACM International Conference on Information and Knowledge Management, CIKM
2006, Arlington, Virginia, USA, pp. 465–474. ACM, New York (2006)
13. Resnik, P.: Using Information Content to Evaluate Semantic Similarity in a Taxonomy
(1995)
14. Matos, S., Arrais, J.P., Maia-Rodrigues, J., Oliveira, J.L.: Concept-based query expansion
for retrieving gene related publications from MEDLINE. BMC Bioinformatics (2010)
15. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New
York (1983)
16. Cilibrasi, R.L., Vitanyi, P.M.B.: The Google Similarity Distance. IEEE Transactions on
Knowledge and Data Engineering 19(3), 370–383 (2007), doi:10.1109/TKDE.2007.48
17. Gracia, J., Mena, E.: Web-Based Measure of Semantic Relatedness. In: Bailey, J., Maier,
D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS, vol. 5175, pp.
136–150. Springer, Heidelberg (2008)
18. Sahlgren, M.: An introduction to random indexing. In: Methods and Applications of Se-
mantic Indexing Workshop at the 7th International Conference on Terminology and
Knowledge Engineering, TKE 2005 (2005)
19. Cohen, T., Schvaneveldt, R., Widdows, D.: Reflective random indexing and indirect infe-
rence: A scalable method for discovery of implicit connections. Journal of Biomedical In-
formatics (2009)
20. Passant, A.: dbrec — Music Recommendations Using DBpedia. In: Patel-Schneider, P.F.,
Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC
2010, Part II. LNCS, vol. 6497, pp. 209–224. Springer, Heidelberg (2010)
21. Waitelonis, J., Sack, H.: Towards Exploratory Video Search Using Linked Data. In: 2009
11th IEEE International Symposium on Multimedia, pp. 540–545. IEEE (2009),
doi:10.1109/ISM.2009.111
22. Stankovic, M., Breitfuss, W., Laublet, P.: Linked-Data Based Suggestion of Relevant Top-
ics. In: Proceedings of I-SEMANTICS Conference 2011, Gratz, Austria, September 7-9
(2011)
23. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent
semantic analysis. Journal of the American Society for Information Science 41, 391–407
(1990)
24. Karlgren, J., Sahlgren, M.: From words to understanding. In: Uesaka, Y., Kanerva, P.,
Asoh, H. (eds.) Foundations of Real-World Intelligence, pp. 294–308. CSLI Publications,
Stanford (2001)
25. Cohen, T.: Exploring medline space with random indexing and Pathfinder networks. In:
Annual Symposium Proceedings/AMIA Symposium, pp. 126–130 (2008)
26. Rizzo, G., Troncy, R.: NERD: Evaluating Named Entity Recognition Tools in the Web of
Data. In: ISWC 2011 Workshop on Web Scale Knowledge Extraction (WEKEX), Bonn,
Germany (2011)
Finding Co-solvers on Twitter,
with a Little Help from Linked Data
1 Introduction
Modern challenges that science and engineering worlds are facing today are often
interdisciplinary and require research cooperation of teams of people in order to produce
good solutions. Analysis of tendencies in research publications [1] shows that more and
more multi-university teams produce accepted papers. Similarly, industrial innovation
challenges often require a collaborative effort of experts from across different
disciplines to work together. In this sense, innovation problem solving platforms, such
as Innocentive1, have started to propose problem challenges for teams of problem
solvers. Supporting users in the task of forming productive multidisciplinary teams
therefore plays an important role in a multitude of innovation-related situations.
Existing social studies [2] on the topic of forming teams investigate people’s
preferences when it comes to the choice of co-workers, they underline the importance
of co-worker similarity (both in terms of shared interests/expertise and in terms of
shared social connections) together with the expertise of co-workers given the work
task. Conveniently, more and more data about users (i.e. their social connections,
topics of interest, their work) is available on the Web, this opens the way for a
1
http://www.innocentive.com/
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 39–55, 2012.
© Springer-Verlag Berlin Heidelberg 2012
40 M. Stankovic, M. Rowe, and P. Laublet
co-worker recommendation approach that takes into account those different qualities
of the user and performs useful recommendations.
In this paper we are addressing the challenge of recommending potential co-solvers
to people facing an innovation or research problem for which they are only partially
competent. This recommendation is performed based on the data available in social
networks (both the user’s social graph and the user generated content), by taking into
account both the compatibility of candidates with the user, and the complementarity of
their competence. The research questions driving our work are: how can we suggest
appropriate co-solvers, which were potentially, previously unknown to the user? And:
what information can be used to enrich initially available user and problem data to
provide the best suggestions? In exploring these two research questions we have
devised an approach to recommend co-solvers for open innovation problems. The
contributions from this work are three-fold, and are as follows: (1) Profile Expansion:
we present methods for expanding topics that measure semantic relatedness between
topics using the linked data graph; (2) Similarity Measures: we describe three
similarity measures that exploit either relations between topic concepts or social
connections; (3) Evaluation: we assess the performance of our approach when different
profile expansion and similarity measures are used through two user studies.
The remainder of the paper is organized as follows: In section 2 we describe the
Open Innovation related scenario of use for which we developed our approach and we
present related work covering a) the recommendation of co-workers; b) expert
finding, and; c) measures of semantic relatedness. Our core approach is presented in
section 3 together with different alternatives for several parts of the recommendation
process. In section 4 we present two user study-based evaluations of our approach
executed over the social network Twitter.com. In section 5 we present our directions
of future work and conclude the paper.
2
http://techcrunch.com/2010/04/22/facebook-edgerank/
42 M. Stankovic, M. Rowe, and P. Laublet
a multitude of topics has not yet been fully explored. To the best of our knowledge,
existing approaches do not respond to the needs of OI scenarios, where the
requirements in terms of expertise of a potential problem solver are slightly different
than those used to select experts in most expert-finding approaches. The necessary
focus on getting diverse and laterally relevant experts has, to the best of our
knowledge, also not been the focus of the existing expert-finding approaches.
3
http://wordnet.princeton.edu/
4
http://www.dmoz.org/
5
While DBPedia contains more then 3.5 million concepts, the current version of Wordnet has
206941 word-sense pairs, and ODP has half a million categories.
Finding Co-solvers on Twitter, with a Little Help from Linked Data 43
3 Recommending Co-solvers
In our general approach, a Web user (called a seed user) approaches our system with
the intention to find potential collaborators for a particular research challenge or
innovation problem (called problem hereafter). He provides the text of the
problem/challenge and gives some identifier that he uses on a social networking
system, this allows access to: (1) his social connections and (2) some content that he
created. The system then proceeds with the creation of profiles for both the user and
the problem. Those profiles contain Linked Data identifiers of topics extracted from
the provided textual elements (from the text of the problem/challenge in the case of
the Problem Profile or from the content that the user has created in the case of the
User Profile). Optionally, an additional phase of profile enrichment may be performed
(called Profile Expansion). This functions by expanding the initial profiles in order
to broaden their topics and thus compensate for any incompleteness. – i.e. where the
topics may be too specific. Similarity scoring is performed over a base of candidate
user profiles in order to select those candidate users that: (1) are the most similar to
the seed user and; (2) whose profile fits the given innovation problem. Similarity
scoring can work both with the initial user and problem profile as well as with the
extended ones. Particular similarity functions will be further discussed in Section 3.3.
3.2 Profiling
In the profiling phase, user and problem profiles are created from the provided textual
elements (posts and biography in the case of user profiles and problem text in the case
of problem profiles). The topic profiles (denoted TP in equation (1)), regardless of the
type of entity (user or problem) that they concern, are sets of tuples assembled from a
Linked Data URI of a topic concept and a value w representing the importance of this
particular topic for the profile. In essence, this topic profile is a concept vector, where
each element’s index represents a unique concept. The value stored in a given element
is the frequency of the concept’s appearance in either: a) the user’s past information
or; b) the problem definition. In the phase of profile expansion, the values w of
additional related topics correspond to the relatedness of the topic to the given profile,
as expressed by the particular measure of semantic relatedness used to perform the
profile expansion.
Different operations are possible over the topic profiles. For instance, in our work we
will rely on the difference TP(problem)-TP(user), called difference topics, that
represents the topics found in the problem topic profile that are not found in the topic
profile of the seed user - this derives the topics for which there is no record of seed
user’s past knowledge. In addition to topic profiles, social profiles (denoted SP in
equation (2)) are also created for the users, and they contain the list of user’s social
connections:
6
http://developer.zemanta.com
7
http://www.opencalais.com/
8
http://dbpedia.org/spotlight
Finding Co-solvers on Twitter, with a Little Help from Linked Data 45
and the seed user profile, as found by the search function of the particular social
network used – in the case of Twitter using their built-in search functionality. Users
found by those different queries constitute a base of candidate user profiles for every
recommendation. These two particular ways of harvesting candidate users correspond to
the general intention of finding people similar to the seed user (in the sense of interests
and social connections) and relevant for the topics of the problem. Users that our seed
user are already friends with are eliminated from the possible recommendations as we
assume that they are known to the seed user and as such would not represent valuable
discoveries. In cases where they are considered relevant, it is relatively easy to tweak
the system to also include the user’s friends in the recommendations.
product (PC) of the three elementary measures as the aggregate measure. For instance
PC(Cosinet,Cosines,Cosinedt)= Cosinet•Cosines•Cosinedt. Alternatively it is possible to
use the sum of weighted values of elementary similarity measures, in which case the
weights may be adjusted by a machine-learning approach [9] in order to adapt to the
preference of each user. The PC measure, as opposed to any linear combination of
elementary functions, penalizes the candidates that rank extremely poorly at any
single similarity function (0•x=0), regardless of the high ranking at another function.
The candidates ranked highly at only one similarity function could therefore not be
ranked better then those being similar in all required aspects.
the greater the cumulative relatedness the greater the relatedness between the profile
and the related concept. The top-n topics are selected based on their cumulative
relatedness – we set n=30 for our experiments.
Our approach uses the links created over these two types of properties in different
ways, appropriate to the nature of those links. According to the equation (4) the value
9
All the prefixes used in this paper can be looked up at http://prefix.cc
48 M. Stankovic, M. Rowe, and P. Laublet
of our HPSR measure for two topics t1 and t2 is the sum of valorisations of the
connections achieved over hierarchical links (first component of the formula) and
those achieved over transversal links (second component of the formula). In the
treatment of hierarchical links we take all the common categories C(t1, t2) of the two
topics, and then for each common category Ci we count the informational content [22]
of this category as -log(pb) where pb is the probability of finding the category in the
graph of DBPedia categories when going from the bottom up. The sum of values for
all common categories represents the strength of links established between t1 and t2
over the hierarchical properties. The transversal links are treated slightly differently.
For each property p, from the previously defined set of relevant properties P, we
count the number of links connecting t1 and t2 over the property p (given by function
link(p, t1, t2)) and weight them using weighting function pond(p,t1). The value of the
weighting function is calculated as -log(n/M), where n is the number of other concepts
to which the concept t1 is connected over the same property p; and M is the large
constant larger then any possible value of n (in our case it is the total number of
concepts in DbPedia).
As our formula is not symmetric, i.e., HPSR(a,b) is not equal to HPSR(b,a), it is
always calculated by putting the topics that belong to the seed user or to the difference
topics as the first parameter, and the topics from the candidate user profiles as the
second parameter.
4 Evaluation
In order to evaluate the performance of different similarity measures and approaches to
profile expansion when providing co-solver recommendations we performed two
experiments involving Twitter users. Recent studies show the growth of scholarly
Twitter accounts10 and its use in communication in scientific communities, especially
the Computer Science research community [32], thus making Twitter resourceful for
co-solvers recommendations. We first created three multidisciplinary innovation
problems11, inspired from descriptions of existing research challenges and projects that
we found online, each involving the topics related to Semantic Web and to at least one
10
Third of all scholars are said to have a Twitter account today
http://www.scribd.com/doc/37621209/2010-Twitter-Survey-Report
11
Available on our website http://research.hypios.com/?page_id=184
Finding Co-solvers on Twitter, with a Little Help from Linked Data 49
other domain (in particular: Sensors, Online Advertising and Venture Capital
Investments). We then used different alternatives of our method to suggest possible
collaborators corresponding to our raters, by relying on candidate user profiles created
according to our approach described in 3.2.2 and 3.2.3. In the first experiment (§4.1)
we used a gold standard obtained from 3 raters and then assessed the performance of
different permutations of profile expansion methods with the interest similarity
measure and the difference topic measure. We omitted social similarity from this stage
due to the differences in the social networks of the 3 solvers – as each solver has
different potential candidates – the gold standard in this case uses the intersection of
candidates recommended to each solver. In the second experiment (§4.2) we evaluated
all profile expansion methods with all ranking methods using a group of 12 raters. In
this case we did not take interrater agreement for the gold standard, but instead
evaluated performance on an individual basis. We were therefore able to include social
similarity as a ranking technique and evaluate its performance. Performing these two
studies allows the comparison between performance of different profile expansion and
similarity measures when a) recommending co-solvers to a group of users – in the case
of experiment 1, and; b) recommending co-solvers to individual users in experiment 2.
To gauge the performance of different permutations of profile expansion and similarity
measures we used the following evaluation metrics:
Discounted Cumulative Gain. (DCG) quantifies the value of items in the top-n
ranked suggestions as well as the quality of their ranking. For each ranking resulting
from a particular ranking alternative, we take the 10 best-ranked user candidates and
look at the ratings users generated for them. If the user candidate found at position i is
rated positively by users we take rating to be 1, otherwise we consider it being equal
to 0. The importance of positively ranked candidates found on lower positions in a
particular ranking is downgraded logarithmically in order to favour ranking
alternatives that put the best candidates towards the top.
Average Precision. (AvePn) computes the average value of precision as a function of
recall, on the interval recall ∈[0,1]. For each position in a concrete ranking we
calculate the precision up to that point, and the gain in recall acheived at that point as
opposed to the previous point. The product of those gives an idea of the value a user
would gain by looking at the suggestions up to a particular rank.
10
DCG = rating1 +
ratingi
(5) (6)
i=2 log 2 i
4.1 Evaluation 1
In our first evaluation, we approached a group of 3 researchers from the field of the
Semantic Web and presented them with our 3 problems for which they were
collectively, as a group, only partially competent. For each rater we generated co-
solver suggestions using different combinations of similarity measures and profile
expansion approaches. We then took the top-10 suggestions from each different
50 M. Stankovic, M. Rowe, and P. Laublet
method and mixed all the suggestions together in one randomized list. This resulted in
reasonably sized lists (i.e., 30-50), as some users were recommended by several
methods, but on the other hand limited the possibilities of evaluation to the methods
defined prior to user rating. The raters then rated candidates by answering if they
would work with the suggested user on the given innovation problem. Raters were
instructed to base their ratings on a holistic perception of suitability of the suggested
user, and only positively rate the users who are both competent and who seem to be
potential co-workers. Prior to calculation of any performance measures we calculated
the inter-rater agreement using the kappa statistic defined in [33] for each pair of raters.
The value of k was, at first, inferior to the threshold of 0.6 for some of the rater pairs.
We then allowed the raters to anonymously see the ratings of other group members and
alter their ratings if they felt the others were right. After the second round the inter-
rater agreement was superior to 0.6 for all 3 problems and for all problems (0.74 on
average). We then used the majority agreement between raters to derive our decision
labels for each candidate user (i.e., recommended or not recommended).
Fig. 2. DCG of rankings obtained with different methods of expansion applied to problem
profiles (right) and to seed user profiles (left)
DCG values for similarity measures based on Cosine and Weighted Overlap
functions run with all three expansions methods are shown on Figure 2. In the case of
rankings based on the similarity with the difference topics, it is clear that the HSPR
method of profile enrichment dominates the other expansion methods. This method is
much less efficient when it comes to ranking based on the interest similarity of
candidate users to the seed users, where DMSR slightly outperforms the other
methods, with a little improvement over the standard approach. The figure shows
average values over all 3 problems, and the differences in method performance have
been confirmed to be significant by the T test (p<0.05) for HPSR with DMSR in
similarity with difference topics, but not in the case of interest similarity. It should be
noted that our expansion methods are not applicable to the calculation of social
similarity as this measure relies on SP vectors that contain no topics. It is indeed
reasonable to expect that distributional measures, based on the distribution of topics in
user profiles would work well on user-to-user similarity. The enrichment of problem
topics using Linked Data-based measures, on the other hand, has already been shown
to perform well in keyword suggestion scenarios [31] and it is reasonable to expect
that an enrichment based on the meaning of topics would allow better mapping of the
problem’s conceptual space and reach users whose profiles have a more complete
coverage of this space.
Finding Co-solvers on Twitter, with a Little Help from Linked Data 51
4.2 Evaluation 2
In our second evaluation, we solicited ratings from 12 individual Twitter users,
experts in the field of Semantic Web. Similar to the previous study, we provided them
with the list of candidate co-solvers and asked them to select those that they would
work with on a given problem, for which they were only partially competent. The
same 3 problems were used by each rater, thereby resulting in 36 sets of rated
suggestions. This time, in order to generate a more reusable gold standard, we asked
the raters to evaluate all the possible user candidates we collected whose profiles
contained at least one of the difference topics. Each set of suggestions for each
problem-user pair contained 80-240 suggestions to evaluate. This was time-
consuming for raters but resulted in a gold standard that covered virtually all
candidates that could be recommended by any variation of our approach. Such a gold
standard allowed us to perform additional comparisons of methods, and especially
focus on composite similarity measures that were not the subject of the first study.
Ranked candidate lists were generated using the following combinations of
similarity functions and profile expansion methods:
• PC(Cosines,Cosinedt,Cosinet): composite function that is a product of interest,
social and the similarity with difference topics counted using cosine similarity.
• PRF(PC(Cosines,Cosinedt,Cosinet)): PRF problem profile expansion with
composite similarity.
• PC(Cosines,HPSR(Cosinedt),Cosinet): HPSR expansion performed on difference
topics prior to calculating the similarity with difference topics.
• PC(Cosines,Cosinedt,DMSR(Cosinet)): DMSR expansion performed over the
seed user profile prior to calculating interest similarity.
• PC(Cosines, HPSR(Cosinedt),DMSR(Cosinet)): composite function in which
HPSR is used to expand profile topics and DMSR to expand seed user topic
profile prior to calculating the similarities.
In the above user study we described the results obtained when using hyProximity and
Distributional profile expansion measures over the elementary similarity functions.
For brevity, in this section, we omit these results from the second study and
concentrate on the remaining permutations and their combinations using a composite
similarity function – something that was not possible in the first study. We focus on
composite measures as they allow us to gain a more complete insight in the impact of
profile expansion on our multi-criteria recommendation task as a whole. As shown on
Figure 3, according to the DCG measure for the first 10 ranked suggestions, the
approaches with topic enrichment by either PRF, HPSR or DMSR consistently show
better results than the basic approach, on all 3 problems. The overall values are on an
expected level for the relevance scale used. HPSR performs slightly better then the
other methods in most cases. However the mixed aggregate function (where HPSR is
applied to the enrichment of problems and DMSR to the enrichment of user profiles)
mostly gives lower results than the individual enrichment approaches. The cause of
this might simply be that expanding both problem and seed user profiles induces too
much of a difference with regards to the input data and might divert the co-solver
search to a non-desired direction. The results shown on the Figure 4 represent the case
when the Cosine similarity function is used. When the Weighted Overlap is used,
results show negligible differences with the order of best alternatives unchanged, and
are omitted for brevity reasons.
52 M. Stankovic, M. Rowe, and P. Laublet
Fig. 3. Average DCG for all raters for different alternatives of composite similarity functions
Fig. 4. AvePn of composite approaches (y axis), counted at different rank positions from 1-40
(x axis). Better approaches reach higher AveP at lower ranks.
Similar results are observed with the Average Precision used as the performance
metric (FIgure 4). It shows that even on a larger set of best ranked candidates (40) the
individual expansion methods dominate the mixed one. All the expansion methods
also dominate the basic approach. The methods that gain higher values of AveP at
lower numbers of rank positions are the ones that give more valuable suggestions at
higher ranks and alow the user to discover valuable collaborators wile going through a
lower number of suggestions. In this case, HSPR enrichment has slightly better results
then the other methods. In order to give a better insight into the usefulness of rhe
results generated with our aproach we provide an example of co-solver suggestions on
our website12.
12
http://research.hypios.com/?page_id=184
Finding Co-solvers on Twitter, with a Little Help from Linked Data 53
References
1. Jones, B.F., Wuchty, S., Uzzi, B.: Multi-university research teams: shifting impact,
geography, and stratification in science. Science 322(5905), 1259–1262 (2008)
2. Hinds, P.J., Carley, K.M., Krackhardt, D., Wholey, D.: Choosing Work Group Members:
Balancing Similarity, Competence, and Familiarity· 1. Organizational Behavior and
Human Decision Processes 81(2), 226–251 (2000)
3. Chesbrough, H.W.: Open Innovation: The New Imperative for Creating and Profiting from
Technology. Harvard Business Press (2003)
4. Jeppesen, L.B., Lakhani, K.R.: Marginality and Problem Solving Effectiveness in
Broadcast Research. Organization Science 20 (2009)
5. Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating similarity measures. In: Proceedings
of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data
Mining, New York, USA (2005)
6. Luo, H., Niu, C., Shen, R., Ullrich, C.: A collaborative filtering framework based on both
local user similarity and global user similarity. Machine Learning 72(3), 231–245 (2008)
7. Guy, I., Jacovi, M., Perer, A., Ronen, I., Uziel, E.: Same places, same things, same
people?: mining user similarity on social media. In: Proceedings of the 2010 ACM
Conference on Computer Supported Cooperative Work, pp. 41–50. ACM (2010)
8. Chen, J., Geyer, W., Dugan, C., Muller, M., Guy, I.: Make new friends, but keep the old:
recommending people on social networking sites. In: Proceedings of the 27th International
Conference on Human Factors in Computing Systems. ACM (2009)
9. Lakiotaki, K., Matsatsinis, N., Tsoukias, A.: Multicriteria User Modeling in Recommender
Systems. IEEE Intelligent Systems 26(2), 64–76 (2011)
54 M. Stankovic, M. Rowe, and P. Laublet
10. Kazienko, P., Musial, K., Kajdanowicz, T.: Multidimensional Social Network in the Social
Recommender System. IEEE Transactions on Systems, Man, and Cybernetics - Part A:
Systems and Humans 41(4), 746–759 (2011)
11. Siersdorfer, S., Sizov, S.: Social recommender systems for web 2.0 folksonomies. In:
Proceedings of the 20th ACM Conference on Hypertext and Hypermedia, HT 2009, p.
261. ACM Press, New York (2009)
12. Text Retrieval Conference Proceedings (1992-2010)
13. Zoltan, K., Johann, S.: Semantic analysis of microposts for efficient people to people
interactions. In: 10th Roedunet International Conference, RoEduNet 2011, pp. 1–4, 23–25
(2011)
14. Ziaimatin, H.: DC Proposal: Capturing Knowledge Evolution and Expertise in
Community-Driven Knowledge Curation Platforms. In: Aroyo, L., Welty, C., Alani, H.,
Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part II.
LNCS, vol. 7032, pp. 381–388. Springer, Heidelberg (2011)
15. Abel, F., Gao, Q., Houben, G.J., Tao, K.: Analyzing Temporal Dynamics in Twitter
Profiles for Personalized Recommendations in the Social Web. In: Web Science
Conference, Koblenz (2011)
16. Stan, J., Do, V.-H., Maret, P.: Semantic User Interaction Profiles for Better People
Recommendation. In: Advances in Social Networks Analysis and Mining, ASONAM 2001
(2011); Rowe, M., Angeletou, S., Alani, H.: Predicting Discussions on the Social Semantic
Web. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., De
Leenheer, P., Pan, J. (eds.) ESWC 201. LNCS, vol. 6644, pp. 405–420. Springer,
Heidelberg (2011)
17. Wagner, C.: Exploring the Wisdom of the Tweets: Towards Knowledge Acquisition from
Social Awareness Streams. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A.,
Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010, Part II. LNCS,
vol. 6089, pp. 493–497. Springer, Heidelberg (2010)
18. Burton-Jones, A., Storey, V.C., Sugumaran, V., Purao, S.: A Heuristic-Based
Methodology for Semantic Augmentation of User Queries on the Web. In: Song, I.-Y.,
Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 476–
489. Springer, Heidelberg (2003)
19. Ziegler, C.-N., Simon, K., Lausen, G.: Automatic Computation of Semantic Proximity
Using Taxonomic Knowledge Categories and Subject Descriptors. In: Proceedings of the
15th ACM International Conference on Information and Knowledge Management, CIKM
2006, Arlington, Virginia, USA, pp. 465–474. ACM, New York (2006)
20. Strube, M., Ponzetto, S.P.: WikiRelate! Computing semantic relatedness using Wikipedia.
In: Proceedings of the National Conference on Artificial Intelligence, vol. 21, p. 1419.
AAAI Press, MIT Press, Menlo Park, Cambridge (1996, 2006)
21. Resnik, P.: Using Information Content to Evaluate Semantic Similarity in a Taxonomy
(1995)
22. Matos, S., Arrais, J.P., Maia-Rodrigues, J., Oliveira, J.L.: Concept-based query expansion
for retrieving gene related publications from Medline. BMC Bioinformatics 11, 212 (2010)
23. Cilibrasi, R.L., Vitanyi, P.M.B.: The Google Similarity Distance. IEEE Transactions on
Knowledge and Data Engineering 19(3), 370–383 (2007), doi:10.1109/TKDE.2007.48
24. Macdonald, C., Ounis, I.: Expertise drift and query expansion in expert search. In:
Proceedings of the Sixteenth ACM Conference on Information and Knowledge
Management - CIKM 2007, vol. 341. ACM Press, New York (2007)
Finding Co-solvers on Twitter, with a Little Help from Linked Data 55
25. Serdyukov, P., Chernov, S., Nejdl, W.: Enhancing Expert Search Through Query
Modeling. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425,
pp. 737–740. Springer, Heidelberg (2007)
26. Rizzo, G., Troncy, R.: NERD: Evaluating Named Entity Recognition Tools in the Web of
Data. In: Proceedings of the 11th Interational Semantic Web Conference 2011, Bonn,
Germany (2011)
27. Li, Q., Zheng, Y., Xie, X., Chen, Y., Liu, W., Ma, W.-Y.: Mining user similarity based on
location history. In: Proceedings of the 16th ACM SIGSPATIAL International Conference
on Advances in Geographic Information Systems - GIS 2008, p. 1. ACM Press, New York
(2008)
28. Balog, K., de Rijke, M.: Finding similar experts. In: Proceedings of the 30th Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval, SIGIR 2007, p. 821. ACM Press, New York (2007),
doi:10.1145/1277741.1277926
29. Viswanathan, K.K., Finin, T.: Text Based Similarity Metrics and Deltas for Semantic Web
Graphs. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z.,
Horrocks, I., et al. (eds.) Proceedings of the 9th International Semantic Web Conferene -
ISWC 2010, Shanghai, China (2010)
30. Stankovic, M., Breitfuss, W., Laublet, P.: Linked-Data Based Suggestion of Relevant
Topics. In: Proceedings of I-SEMANTICS Conference 2011, Gratz, Austria, September 7-
9 (2011)
31. Letierce, J., Passant, A., Decker, S., Breslin, J.: Understanding how Twitter is used to
spread scientific messages. In: Proceedings of the WebSci 2010, Raleigh, NC, US (2010)
32. Fleiss, J.: Statistical Methods for Rates and Proportions. Wiley-Interscience (1981)
Top-k Linked Data Query Processing
1 Introduction
In recent years, the amount of Linked Data has increased rapidly. According to
the Linked Data principles1 , dereferencing a Linked Data URI via HTTP should
return a machine-readable description of the entity identified by the URI. Each
URI therefore represents a virtual “data source” (see Fig. 1).
In this context, researchers have studied the problem of Linked Data query
processing [3,5,6,10,11,16]. Processing structured queries over Linked Data can
be seen as a special case of federated query processing. However, instead of re-
lying on endpoints that provide structured querying capabilities (e.g., SPARQL
interfaces), only HTTP URI lookups are available. Thus, entire sources have to
be retrieved. Even for a single trivial query, hundreds of sources have to be pro-
cessed in their entirety [10]. Aiming at delivering up-to-date results, sources often
cannot be cached, but have to be fetched from external hosts. Thus, efficiency
and scalability are essential problems in the Linked Data setting.
A widely adapted strategy for dealing with efficiency and scalability problems
is to perform top-k processing. Instead of computing all results, top-k query
processing approaches produce only the “best” k results [8]. This is based on the
observation that results may vary in “relevance” (which can be quantified via a
ranking function), and users, especially on the Web, are often interested in only
a few relevant results. Let us illustrate top-k Linked Data query processing:
1
http://www.w3.org/DesignIssues/LinkedData.html
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 56–71, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Top-k Linked Data Query Processing 57
Src. 1. ex:beatles
ex : b e a t l e s Src. 2. ex:sgt pepper
f o a f : name Src. 3. ex:help
”The B e a t l e s ” ; ex : s g t p e p p e r
ex : album ex : f o a f : name ex : h e l p
sgt pepper . ” S g t . Pepper ” ; f o a f : name ” Help ! ” ;
ex : album ex : h e l p ; ex : so n g ” Lucy ” . ex : so n g ” Help ! ” .
Fig. 1. Linked Data sources describing “The Beatles” and their songs “Help!” and
“Lucy”
ex:beatles ?song
1 SELECT ∗ WHERE
2 {
3 ex : b e a t l e s ex : album ? album . ex:album
ex:song
4 ? album ex : so n g ? so n g .
5 } ?album
Fig. 2. Example query returning songs in Beatles albums. The query comprises two
triple patterns q1 (line 3) and q2 (line 4).
Example 1. For the query in Fig. 2, the URIs ex:beatles, ex:help and
ex:sgt_pepper are dereferenced to produce results for ?song and ?album. The
results are retrieved from different sources, which vary in “relevance” (i.e., ex:help
provides the precise name for the song “Help!”, while ex:sgt_pepper merely holds
“Lucy” as name for a song, which is actually called “Lucy in the Sky with Dia-
monds”). Such differences are captured by a ranking function, which is used in
top-k query processing to measure the result relevance. For the sake of simplicity,
assume a ranking function assigns triples in ex:beatles (s1 ) a score of 1, triples in
ex:sgt_pepper (s2 ) score 2, and those in ex:help (s3 ) a score of 3. Further, assume
our ranking function assigns increasing values with increasing triple relevance.
While being appealing, top-k processing has not been studied in the Linked
Data (and the general RDF) context before. Aiming at the adaption of top-k
processing to the Linked Data setting, we provide the following contributions:
– Top-k query processing has been studied in different contexts [8]. Closest
to our work is top-k querying over Web-accessible databases [20]. However,
the Linked Data context is unique to the extent that only URI lookups are
available for accessing data. Instead of retrieving partial results matching
query parts from sources that are exposed via query interfaces (of the cor-
responding database endpoints), we have to retrieve entire sources via URI
lookups. To the best of our knowledge, this is the first work towards top-k
Linked Data query processing.
– We show that in a Linked Data setting, more detailed score information is
available. We propose strategies for using this knowledge to provide tighter
score bounds (and thus allow an earlier termination) as compared to top-k
processing in other scenarios [2,13,17]. Further, we propose an aggressive
technique for pruning partial query results that cannot contribute to the
final top-k result.
58 A. Wagner et al.
Definition 1 (RDF Triple, RDF Graph). Given a set of URIs U and a set
of literals L, t = s, p, o ∈ U × U × (U ∪ L) is a RDF triple, and a set of RDF
triples is called a RDF graph.
The Linked Data principles used to access and publish RDF data on the Web,
mandate that (1) HTTP URIs shall be used as URIs and that (2) dereferencing
a URI returns a description of the resource identified by the URI. Thus, a URI
d can be seen as a Linked Data source, whose content, namely a set of RDF
triples T d , is obtained by dereferencing d. Triples in T d contain other HTTP
URI references (links), connecting d to other sources. The union set of sources
in U forms a Linked Data graph G = {t|t ∈ T di ∧ di ∈ U }.
Query Model. The standard language for querying RDF is SPARQL [15].
Previous work on Linked Data query processing focused on processing basic
graph patterns (BGP), which is a core feature of SPARQL.
Result Model. Often, every triple pattern in a BGP query Q shares a common
variable with at least one other pattern such that Q forms a connected graph.
Computing results to a BGP query over G amounts to the task of graph pattern
matching. Basically, a result to a query Q evaluated over G (given by μG (Q)) is
a subgraph of G that matches Q. The set of all results for query Q is denoted
by ΩG (Q).
Top-k Linked Data Query Processing 59
Fig. 3. (a) Query plan providing a sorted access, query execution and the scheduler.
(b) Rank join operator with data from our “Beatles” example.
60 A. Wagner et al.
Query plans in relational databases generally consist of access plans for in-
dividual relations. Similarly, Linked Data query plans can be seen as being
composed of access plans at the bottom-level (i.e., one for each triple pat-
tern). An access plan for query Q = {q1 , . . . , qn } is a tree-structured query
plan constructed in the following way: (1) At the lowest level, leaf nodes are
source scan operators, one for every source that is relevant for triple pattern
qi (i.e., one for every d ∈ source(qi )). (2) The next level contains selection
operators,
one for every scan operator. (3) The root node is a union operator
(σT d1 (qi ), . . . , σT dn (qi )), which combines the outputs of all selection operators
for qi (with di ∈ source(qi )). At the next levels, the outputs of the access plans
(of their root operators) are successively joined to process all triple patterns of
the query, resulting in a tree of operators.
Example 2. Fig. 3 (a) shows an example query plan for the query in Fig. 2.
Instead of scan and join, their top-k counterparts scan-sort and rank-join are
shown (explained in the next section). There are three source scan operators,
one for each of the sources: ex:beatles (s1), ex:sgt_pepper (s2), and ex:help
(s3). Together with selection and union operators, they form two access plans
for the patterns q1 and q2 . The output of these access plans is combined using
one join operator.
Push-Based Processing. In previous work [10,11], push-based execution using
symmetric hash join operators was shown to have better performance than pull-
based implementations (such as [6]). In a push-based model, operators push
their results to subsequent operators instead of pulling from input operators,
i.e., the execution is driven by the incoming data. This leads to better behavior
in network settings, because, unlike in pull-based execution models, the query
execution is not blocked, when a single source is delayed [11].
3.1 Preliminaries
Besides the source index employed for Linked Data query processing, we need a
ranking function as well as a sorted access for top-k processing [7,14,17].
Top-k Linked Data Query Processing 61
“pull” their inputs in order to produce an output [7,17,20]. In compliance with [17],
we adapt the pull/bound rank join (PBRJ) algorithm template for a push-based
execution in the Linked Data setting. For simplicity, the following presentation of
the PBRJ algorithm assumes binary joins (i.e., each join has two inputs).
In a pull-based implementation, operators call a next method on their input
operators to obtain new data. In a push-based execution, the control flow is
inverted, i.e., operators have a push method that is called by their input oper-
ators. Algorithm 1 shows the push method of the PBRJ operator. The input
from which the input element r was pushed is identified by i ∈ {1, 2}. Note,
by input element we mean either a triple (if the input is a union operator) or
a partial query result (if the input is another rank join operator). First, the
input element r is inserted into the hash table Hi (line 3). Then, we probe the
other input’s hash table Hj for valid join combinations (i.e., the join condition
evaluates to “true”; see line 4), which are then added to the output queue O
(line 5). Output queue O is a priority queue such that the result with the highest
score is always first. The threshold Γ is updated using the bounding strategy B,
providing an upper bound on the scores of future join results (i.e., result com-
binations comprising “unseen” input elements). When a join result in queue O
has a score equal to or greater than the threshold Γ , we know there is no future
result having a higher score. Thus, the result is ready to be reported to a subse-
quent operator. If output O contains k results, which are ready to be reported,
the algorithm stops reading inputs (so-called early termination).
As reported in [17], the PBRJ has two parameters: its bounding strategy B and
its pulling strategy P. For the former, the corner-bound is commonly employed
and is also used in our approach. The latter strategy, however, is proposed for
a pull-based execution and is thus not directly applicable. Similar to the idea
behind the pulling strategy, we aim to have control over the results that are
pushed to subsequent operators. Because a push-based join has no influence
over the data flow, we introduce a scheduling strategy to regain control. Now,
the push method only adds join results to the output queue O, but does not push
them to a subsequent operator. Instead the pushing is performed in a separate
activate method as mandated by the scheduling strategy.
Algorithm 1. PBRJ.push(r)
Input: Pushed input element r on input i ∈ {1, 2}
Data: Bounding strategy B, output queue O, threshold Γ , hash tables H1 , H2
1 if i = 1 then j = 2;
2 else j = 1;
3 Insert r into hash table Hi ;
4 Probe Hj for valid join combinations with r ;
5 foreach valid join combination o do Insert o into O;
6 Γ ← B.update();
than the threshold Γ , it is essential that the upper bound is as low (tight) as
possible. In other words, a tight upper bound allows for an early termination of
the top-k join procedure, which results in less sources being loaded and triples
being processed. The most common choice for B is the corner bound strategy:
Scheduling Strategies. Deciding which input to pull from has a large effect
on operator performance [17]. Previously, this decision was captured in a pulling
strategy employed by the join operator implementation. However, in push-based
systems, the execution is not driven by results, but by the input data. Join op-
erators are only activated when input is actively being pushed from operators
lower in the operator tree. Therefore, instead of pulling, we propose a schedul-
ing strategy that determines which operators in a query plan are scheduled for
execution. That is, we move the control over which input is processed from the
join operator to the query engine, which orchestrates the query execution.
Algorithm 2 shows the execute method that takes a query Q and the number
of results k as input and returns the top-k results. First, we obtain a query plan
P from the plan method (line 1). We then use the scheduling strategy S to
obtain the next operator that should be scheduled for execution (line 2). The
scheduling strategy uses the current execution state as captured by the operators
in the query plan to select the next operator. We then activate the selected
operator (line 4). We select a new operator (line 5) until we either have obtained
the desired number of k results or there is no operator to be activated, i.e., all
inputs have been exhausted (line 3).
Algorithm 3 shows the activate method (called by execute) for the rank join
operator. Intuitively, the activate method triggers a “flush” of the operator’s
output buffer O. That is, all computed results having a score larger than or equal
to the operator’s threshold Γ (line 1) are reported to the subsequent operator
(lines 2-3). An activate method for a scan-sort operator of a source d simply
pushes all triples in d in a sorted fashion. Further, activate for selection and
union operators causes them to push their outputs to a subsequent operator.
Now, the question remains how a scheduling strategy should select the next
operator (nextOp method). We can apply the idea behind the state-of-the-art
pulling strategy [17] to perform corner-bound-adaptive scheduling. Basically, we
choose the input which leads to the highest reduction in the corner-bound:
Definition 6 (Corner-Bound-Adaptive Scheduling). Given a rank join
operator, we prefer the input that could produce join results with the highest
scores. That is, we prefer input 1 if Δ(α2 , β1 ) > Δ(α1 , β2 ), otherwise we prefer
input 2. In case of ties, the input with the least current depth, or the input with
the least index is preferred. The scheduling strategy then “recursively” selects
and activates operators that may provide input elements for the preferred input.
That is, in case the chosen input is another rank join operator, which has an
empty output queue, the scheduling strategy selects and activates operators for
its preferred input in the same manner.
Example 3. Assume k = 1 and let ti,j denote the j th triple in source i (e.g.,
t1,2 = ex:beatles ,ex:album,ex:sgt_pepper ). First, our scheduling strategy
prefers the input 1 and selects (via nextOp) and activates scan-sort(s1 ), sel(q1 ),
and union(q1 ). Note, also input 2 would have been a valid choice, as the threshold
(respectively α, β) is not set yet. The rank join reads t1,2 and t1,3 as new inputs
elements from union(q1 ), and both elements are inserted into H1 (α1 = β1 = 1).
The scheduler now prefers input 2 (as input 1 is exhausted) and selects and ac-
tivates scan-sort(s3 ), sel(q2 ), and union(q2 ), because source 3 has triples with
higher scores than source 2. Now, union(q2 ) pushes t3,2 and α2 respectively β2
is set to υ(t3,2 ) = 3. Employing a summation as Δ, the threshold Γ is set to 4
(as max{1 + 3, 1 + 3} = 4). Then, t3,2 is inserted into H2 and the joins between
t3,2 and elements in H1 are attempted; t1,3 1 t3,2 yields a result μ, which is
then inserted into the output queue. Finally, as υ(μ) = 4 ≥ Γ = 4 is true, μ is
reported as the top-1 result and the algorithm terminates. Note, not all inputs
have been processed, i.e., source 2 has not been scanned (cf. Fig. 3).
the opportunity for pruning arises only when k (or more) complete results have
been produced (by the root join operator).
More precisely, let Q be a query and μ(Qf ) a partial query result, with Qf
as “finished” part and Qr as “remaining” part (Qf ⊂ Q and Qr = Q \ Qf ). The
upper bound on the scores of all final results based on μ(Qf ) ∈ ΩG (Qf ) can be
obtained by aggregating the score of μ(Qf ) and the maximal score υuQ (Qr ) of
results μ(Qr ) ∈ ΩG (Qr ). υu (Qr ) can be computed as the aggregation of maximal
source upper bounds obtained for every triple pattern in Qr = {q1 , . . . , qm }, i.e.,
υuQ (Qr ) = Δ(υuQ (q1 ), . . . , υuQ (qm )), where υuQ (qi ) = max{υu (d)|d ∈ source(qi )}.
A tighter bound for υuQ (Qr ) can be obtained, if Qr contains one or more entity
queries (see previous section) and aggregating their scores in a greedy fashion.
Last, the following theorem can be established (see proof in [19]):
Theorem 2. A result μf ∈ ΩG (Qf ) cannot be part of the top-k results for Q
if Δ(υ(μf ), υuQ (Qr )) < min{υ(μ)|μ ∈ ΩG
k k
(Q)}, where ΩG (Q) are the currently
known k results of Q.
4 Experimental Evaluation
In the following, we present our evaluation and show that (1) top-k processing
outperforms state-of-the-art Linked Data query processing, when producing only
a number of top results, and (2) our tighter bounding and early pruning strategy
outperform baseline rank join operators in the Linked Data setting.
Systems. In total, we implemented three different systems, all based on push-
based join processing. For all queries, we generated left-deep query plans with
random orders of join operators. All systems use the same plans and are different
only in the implementation of the join operator.
First, we have the push-based symmetric hash join operator (shj ) [10,11],
which does not employ top-k processing techniques, but instead produces all
results and then sorts them to obtain the requested top-k results. Also, there
are two implementations of the rank join operator. Both use the corner-bound-
adaptive scheduling strategy (which has been shown to be optimal in previous
work [17]), but with different bounding strategies. The first uses the corner-
bound (rj-cc) from previous work [17], while the second (rj-tc) employs our
optimization with tighter bounds and early result pruning. The shj baseline is
used to study the benefits of top-k processing in the Linked Data setting, while
rj-cc is employed to analyze the effect of the proposed optimizations.
All systems were implemented in Java 6. Experiments were run on a Linux
server with two Intel Xeon 2.80GHz Dual-Core CPUs, 8GB RAM and a Segate
ST31000340AS 1TB hard disk. Before each query execution, all operating system
caches were cleared. The presented values are averages collected over three runs.
Dataset and Queries. We use 8 queries from the Linked Data query set of
the FedBench benchmark. Due to schema changes in DBpedia and time-outs
observed during the experiments (> 2 min), three of the 11 FedBench queries
were omitted. Additionally, we use 12 queries we created. In total, we have 20
Top-k Linked Data Query Processing 67
queries that differ in the number of results they produce (from 1 to 10118) and
in their complexity in terms of the number of triple patterns (from 2 to 5). A
complete listing of our queries can be found in [19].
To obtain the dataset, we executed all queries directly over the Web of Linked
Data using a link-traversal approach [6] and recorded all Linked Data sources
that were retrieved during execution. In total, we downloaded 681,408 Linked
Data sources, comprising a total of 1,867,485 triples. From this dataset we cre-
ated a source index that is used by the query planner to obtain relevant sources
for the given triple patterns.
Scores were randomly assigned to triples in the dataset. We applied three
different score distributions: uniform, normal (μ = 5, σ 2 = 1) and exponential
(λ = 1). This allows us to abstract from a particular ranking and examine
the applicability of top-k processing for different classes of functions. We used
summation as the score aggregation function Δ.
We observed that network latency greatly varies between hosts and evaluation
runs. In order to systematically study the effects of top-k processing, we thus
decided to store the sources locally, and to simulate Linked Data query processing
on a single machine (as done before [10,11]).
Parameters. Parameter k ∈ {1, 5, 10, 20} denotes the number top-k results
to be computed. Further, there are the three different score distributions d ∈
{u, n, e} (uniform, normal and exponential, respectively).
Overall Results. Fig. 4a shows an overview of processing times for all queries
(k = 1, d = n). We can see that for all queries the rank join approaches perform
better or at least equal to the baseline shj operator. On average, the execution
times for rj-cc and rj-tc were 23.13s and 20.32s, whereas for shj it was 43.05s.
This represents an improvement in performance of the rj-cc and rj-tc operators
over the shj operator by factors of 1.86 and 2.14, respectively.
The improved performance of the rank join operators is due to top-k pro-
cessing, because these operators do not have to process all input data in order
to produce the k top results, but can terminate early. On the other hand, the
shj implementation produces all results. Fig. 4b shows the average number of
retrieved sources for different values of k. We can see clearly that the rank join
approaches retrieve fewer sources than the baseline approach. In fact, rj-cc and
rj-tc retrieve and process only 41% and 34%, respectively, of the sources that
the shj approach requires. This is a significant advantage in the Linked Data
context, where sources can only be retrieved in their entirety.
However, we also see that the rank join operators sometimes do not perform
better than shj. In these cases, the result is small (e.g., Q18 has only two results).
The rank join operators have to read all inputs and compute all results in these
cases. For example, for Q20 the rank join approaches retrieve and process all
35103 sources, just as the shj approach does.
Bounding Strategies. We now examine the effect of the bounding strategies on
overall execution time. The average processing times mentioned earlier represent
68 A. Wagner et al.
80 (a)
70
60 rj-cc
rj-tc
Time [s]
50
shj
40
30
20
10
0
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20
35000 (b) (c) (d) 70 (e)
#sources retrieved
50 50
Time [s]
Time [s]
Time [s]
30000 rj-cc
60
40 40 rj-tc
25000 50 shj
20000 30 30 40
15000 30
20 20
10000 20
10 10
5000 10
0 0 0 0
k:1 k:5 k:10 k:20 k:1 k:5 k:10 k:20 dist:n dist:e dist:u 2TP 3TP 4TP 5TP
Fig. 4. (a) All queries with their evaluation times (k = 1, d = n). (b) Average number
of sources over all queries (different k, d = n). (c) Average evaluation time over all
queries (different k, d = n). (d) Average evaluation time over all queries (different
score distributions, k = 10). (e) Average evaluation time over all queries with varying
number of triple patterns (k = 1, d = n).
an improvement of 12% of rj-tc over rj-cc. For Q3, the improvement is even
higher, where rj-tc takes 11s, compared to 30s for rj-cc.
The improved performance can be explained with the tighter, more precise
bounding strategy of rj-tc compared to rj-cc. For example, our bounding strat-
egy can take advantage of a large star-shaped subexpression with 3 patterns
in Q3, leading to better performance because of better upper bound estimates.
Moreover, we observed that the look-ahead strategy helps to calculate a much
tighter upper bound especially when there are large score differences between
successive elements from a particular input.
In both cases, a tighter (more precise) bound means that results can be re-
ported earlier and less inputs have to be read. This is directly reflected in the
number of sources that are processed by rj-tc and rj-cc, where on average, rj-tc
requires 23% fewer sources than rj-cc. Note, while in Fig. 4a rj-tc’s performance
often seems to be comparable to rj-cc, Fig. 4b makes the differences more clear
in terms of the number of retrieved sources. For instance, both systems require
an equal amount of processing times for Q17. However rj-tc retrieves 7% less
sources. Such “small” savings did not show properly in our evaluation (as we
retrieved sources locally), but would effect processing time in a real-world setting
with network latency.
Concerning the outlier Q19, we noticed that rj-tc did read slightly more in-
put (2%) than rj-cc. This behavior is due to our implementation: Sources are
retrieved in parallel to join execution. In some cases, the join operators and the
source retriever did not stop at the same time.
We conclude that rj-tc performs equally well or better than rj-cc. For some
queries (i.e., entity queries and inputs with large score differences) we are able
to achieve performance gains up to 60% compared to the rj-cc baseline.
Top-k Linked Data Query Processing 69
Early Pruning. We observed that this strategy leads to lower buffer sizes (thus,
less memory consumption). For instance with Q9, rj-tc could prune 8% of its
buffered data. However, we also noticed that the number of sources loaded and
scanned is actually the key factor. While pruning had positive effects, the im-
provement is small compared to what could be achieved with tighter bounds (for
Q9 73% of total processing time was spent on loading and scanning sources).
Effect of Result Size k. Fig. 4c depicts the average query processing time
for all three approaches at different k (with d = n). We observed that the
time for shj is constant in k, as shj always computes all results, and that the
rank join approaches outperform shj for all k. However, with increasing k, more
inputs need to be processed. Thus, the runtime differences between the rank join
approaches and shj operator become smaller. For instance, for k = 1 the average
time saving over all queries is 46% (52%) for rj-cc (rj-tc), while it is only 31%
(41%) for k = 10.
Further, we can see in Fig. 4c that rj-tc outperforms rj-cc over all values for
k. The differences are due to our tighter bounding strategy, which substantially
reduces the amount of required inputs. For instance, for k = 10, rj-tc requires
21% less inputs than rj-cc on average.
We see that rj-tc and rj-cc behave similarly for increasing k. Both operators
become less efficient with increasing k (Fig. 4c).
Effect of Score Distributions. Fig. 4d shows average processing times for
all approaches for the three score distributions. We see that the performance of
both rank join operators varied only slightly w.r.t. different score distributions.
For instance, rj-cc performed better by 7% on the normal distribution compared
to the uniform distribution. The shj operator has constant evaluation times over
all distributions.
Effect of Query Complexity. Fig. 4e shows average processing times (with
k = 1, d = n) for different numbers of triple patterns. Overall, processing times
increase for all systems with an increasing number of patterns. Again, we see
that the rank join operators outperform shj for all query sizes. In particular, for
5 queries patterns, we noticed the effects of our entity bounds more clearly, as
those queries often contained entity queries up to the length of 3.
5 Related Work
The top-k join problem has been addressed before, as discussed by a recent sur-
vey [8]. The J* rank join, based on the A* algorithm, was proposed in [14]. Other
rank join algorithms, HRJN and HRJN*, were introduced in [7] and further ex-
tended in [12]. In contrast to previous works, we aim at the Linked Data context.
As recent work [6,3,10,11] has shown, Linked Data query processing introduces
various novel challenges. In particular, in contrast to the state-of-the-art pull -
based rank join, we need a push-based execution for queries over Linked Data.
We therefore adapt pull strategies to the push-based execution model (based on
70 A. Wagner et al.
operator scheduling). Further, our work is different from prior work on Web-
accessible databases [20], because we rely exclusively on simple HTTP lookups
for data access, and use only basic statistics in the source index.
There are different bounding strategies: In [2,17], the authors introduced a
new Feasible-Region (FR) bound for the general setting of n-ary joins and mul-
tiple score attributes. However, it has been proven that the PBRJ template is
instance-optimal in the restricted setting of binary joins using corner-bound and
a single score attribute [2,17]. We adapt the corner-bound to the Linked Data
setting and provide tighter, more precise bounds that allow for earlier termina-
tion and better performance.
Similar to our pruning approach, [18] estimates the likelihood of partial results
contributing to a final result (if the estimate is below a given threshold partial
results are pruned). However, [18] addressed the selection top-k problem, which
is different to our top-k join problem. More importantly, we do not rely on
probabilistic estimates for pruning, but employ accurate upper bounds. Thus,
we do not approximate final top-k results.
6 Conclusion
We discussed how existing top-k join techniques can be adapted to the Linked
Data context. Moreover, we provide two optimizations: (1) tighter bounds es-
timation for early termination, and (2) aggressive result pruning. We show in
real-world Linked Data experiments that top-k processing can substantially im-
prove performance compared to the state-of-the-art baseline. Further perfor-
mance gains could be observed using the proposed optimizations. In future work,
we like to address different scheduling strategies as well as further Linked Data
aspects like network latency or source availability.
References
1. Elbassuoni, S., Ramanath, M., Schenkel, R., Sydow, M., Weikum, G.: Language-
model-based ranking for queries on RDF-graphs. In: CIKM, pp. 977–986 (2009)
2. Finger, J., Polyzotis, N.: Robust and efficient algorithms for rank join evaluation.
In: SIGMOD, pp. 415–428 (2009)
3. Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K., Umbrich, J.: Data
summaries for on-demand queries over linked data. In: World Wide Web (2010)
4. Harth, A., Kinsella, S., Decker, S.: Using Naming Authority to Rank Data and
Ontologies for Web Search. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum,
L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823,
pp. 277–292. Springer, Heidelberg (2009)
5. Hartig, O.: Zero-Knowledge Query Planning for an Iterator Implementation of Link
Traversal Based Query Execution. In: Antoniou, G., Grobelnik, M., Simperl, E.,
Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I.
LNCS, vol. 6643, pp. 154–169. Springer, Heidelberg (2011)
6. Hartig, O., Bizer, C., Freytag, J.-C.: Executing SPARQL Queries over the Web of
Linked Data. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard,
D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 293–309.
Springer, Heidelberg (2009)
Top-k Linked Data Query Processing 71
7. Ilyas, I.F., Aref, W.G., Elmagarmid, A.K.: Supporting top-k join queries in rela-
tional databases. The VLDB Journal 13, 207–221 (2004)
8. Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-k query processing tech-
niques in relational database systems. ACM Comput. Surv. 58, 11:1–11:58 (2008)
9. Klyne, G., Carroll, J.J., McBride, B.: Resource description framework (RDF): con-
cepts and abstract syntax (2004)
10. Ladwig, G., Tran, T.: Linked Data Query Processing Strategies. In: Patel-
Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks,
I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 453–469. Springer,
Heidelberg (2010)
11. Ladwig, G., Tran, T.: SIHJoin: Querying Remote and Local Linked Data. In: An-
toniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., De Leenheer,
P., Pan, J. (eds.) ESWC 2011, Part I. LNCS, vol. 6643, pp. 139–153. Springer,
Heidelberg (2011)
12. Li, C., Chang, K.C.-C., Ilyas, I.F., Song, S.: Ranksql: query algebra and optimiza-
tion for relational top-k queries. In: SIGMOD, pp. 131–142 (2005)
13. Mamoulis, N., Yiu, M.L., Cheng, K.H., Cheung, D.W.: Efficient top-k aggregation
of ranked inputs. ACM Trans. Database Syst. (2007)
14. Natsev, A., Chang, Y.-C., Smith, J.R., Li, C.-S., Vitter, J.S.: Supporting incre-
mental join queries on ranked inputs. In: VLDB, pp. 281–290 (2001)
15. Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF. W3C Rec-
ommendation (2008)
16. Schmedding, F.: Incremental SPARQL evaluation for query answering on linked
data. In: Workshop on Consuming Linked Data in Conjunction with ISWC (2011)
17. Schnaitter, K., Polyzotis, N.: Optimal algorithms for evaluating rank joins in
database systems. ACM Trans. Database Syst. 35, 6:1–6:47 (2010)
18. Theobald, M., Weikum, G., Schenkel, R.: Top-k query evaluation with probabilistic
guarantees. In: VLDB, pp. 648–659 (2004)
19. Wagner, A., Tran, D.T., Ladwig, G., Harth, A., Studer, R.: Top-k linked data
query processing (2011), http://www.aifb.kit.edu/web/Techreport3022
20. Wu, M., Berti-Equille, L., Marian, A., Procopiuc, C.M., Srivastava, D.: Processing
top-k join queries. In: VLDB, pp. 860–870 (2010)
Preserving Information Content in RDF
Using Bounded Homomorphisms
Abstract. The topic of study in the present paper is the class of RDF
homomorphisms that substitute one predicate for another throughout
a set of RDF triples, on the condition that the predicate in question
is not also a subject or object. These maps turn out to be suitable for
reasoning about similarities in information content between two or more
RDF graphs. As such they are very useful e.g. for migrating data from
one RDF vocabulary to another. In this paper we address a particular
instance of this problem and try to provide an answer to the question
of when we are licensed to say that data is being transformed, reused or
merged in a non-distortive manner. We place this problem in the context
of RDF and Linked Data, and study the problem in relation to SPARQL
construct queries.
1 Introduction
As yet, the World Wide Web shows a bias towards getting the information to
flow, at the expense of maintaining the integrity of the circulated information.
Maintaining integrity is usually recognised as a very real and increasingly acute
need, though. Take public sector information: open public sector information is
a valuable national resource, and there is widespread agreement that dissem-
ination promotes transparent and accountable government, improves quality of
service, and in general serves to maintain a well-informed public. Yet, whilst the
political pressure for reusable public sector information is building momentum,
as witnessed e.g. by the European Public Sector Information Directive of 2003,
governments as suppliers and authoritative sources of information on the Web
must nevertheless acknowledge the challenges related to maintaining the primary
nature of its information. This points to a general tension between two funda-
mental requirements of the data-oriented Web: Keep the data freely flowing, but
shepherd the data into sanctioned use. In the present paper we shall place this
problem in the context of RDF and Linked Data, and study it in relation to
SPARQL construct queries.
Example 1. The Cultural Heritage Management Office in Oslo is the City of
Oslo’s adviser on questions of cultural conservation of architecturally and cul-
turally valuable buildings, sites and environments. It maintains a list of protected
buildings, known as ‘the yellow list’, which has been transformed to RDF and
published as Linked Data [11]. A small excerpt is given below:
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 72–86, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Preserving Information Content in RDF Using Bounded Homomorphisms 73
<http://sws.ifi.uio.no/gulliste/kulturminne/208/5/6643335/597618>
rdf:type gul:Kontor ;
hvor:gateadresse "Akersgata 44" ; geo:long "10.749" ;
hvor:postnummer "0180" ; geo:lat "59.916" .
Note that there is no explicit representation of city or country, and no grouping
of similar data. Suppose now that we wish to lift all available information about
culturally valuable buildings in Norway to the national level. We do so by adding
Oslo and Norway as parts of the address data. Also, we add types to buildings by
linking to the relevant class from the CIDOC CRM standard for cultural heritage
information (http://www.cidoc-crm.org/ ). For heuristic purposes we also group
geographical information and address information respectively around suitably
typed nodes:
CONSTRUCT{ ?x rdf:type ?y, cidoc:E25.Man-Made_Feature;
vcard:adr [ rdf:type vcard:Address;
vcard:street-address ?street; vcard:zip-code ?code;
vcard:locality geonames:3143242; # Oslo
vcard:country-name "Norway"@en ] ;
vcard:geo [ rdf:type geo:Point; geo:lat ?lat; geo:long ?long ] }
WHERE{ ?x rdf:type ?y;
hvor:gateadresse ?street; hvor:postnummer ?code;
geo:lat ?lat; geo:long ?long . }
The structural change to the data caused by the construct query is rather
thoroughgoing and extensive. Yet, there is still a principled relationship between
structural elements of the two graphs, e.g. the property hvor:gateadresse morphs
into the sequence of properties vcard:adr, vcard:street-address. Moreover, the
pairs of resources that are linked by hvor:gateadresse in the former graph re-
main linked by vcard:adr, vcard:street-address in the latter graph, and no other
pairs are similarly related. Indeed, the transformation can easily be seen to be
systematic in the sense that all pairs related in the same manner in the source
graph are transformed uniformly in terms of the same structural element in the
target graph. It is also non-distortive in the sense that no other pair of resources
are so related. Contrast with the case in which we replace hvor:postnummer with
vcard:locality, whilst keeping everything else as-is. We would then not be able
to distinguish between cities and zip-codes in the target graph, and would in
that sense have distorted the information from the source.
The purpose of the present paper is to give these intuitions mathematical
content. That is, we wish to formulate a criterion to sort conservative from
non-conservative ways of transforming data. Since we take construct queries
as our paradigm of RDF transformation, this means sorting conservative from
non-conservative construct queries. It is important to note that the uniformity
and non-distortiveness criteria we shall propose are purely structural, and do
not heed the semantics of the vocabulary elements involved. To the question
‘what makes the chain vcard:adr, vcard:zip-code an adequate representation of
hvor:gateadresse?’ the only answer is ‘because somebody wishes it to be so’.
What our criteria have to offer is thus nothing more than a clear notion of what
74 A. Stolpe and M.G. Skjæveland
you are rationally committed to, in terms of the structure of the target graph,
once you have made your choice of representatives for elements of the source
graph. We will do so by studying a class of RDF homomorphisms that sub-
stitutes one edge for another throughout a set of RDF triples, subject to the
condition that the edge in question is not also a vertex.
The paper is organised as follows: Section 2 defines the general concept
of an RDF homomorphism, and distinguishes the subset of conditional edge-
substitutions mentioned above. We shall call them p-maps. Section 3 recapitu-
lates the basic syntax and semantics of the SPARQL query language. Section 4
defines the notion of a bounded p-map and argues that it gives an adequate
criterion of conservativeness. Section 5 generalises the conservativeness criterion
to handle more sophisticated construct queries, e.g. that of Example 1. Section 6
presents essential results on the computational properties of computing p-maps,
whilst Section 7 closes with a summary and a few pointers to future lines of
research.
Related Work. Our homomorphisms differ from those of [1, 2] which essentially
rename blank nodes in order to mimic the semantics of RDF as defined in [6].
To the best of our knowledge, our particular notion of RDF homomorphism, and
the use of it, is novel. Considered as an embedding of one graph into another
a p-map can be viewed in two different ways which, although they amount to
the same formally, give rather different gestalts to the central issue. Looked at
from one angle, our problem (i.e. embedding a source into a target) resembles
data exchange: Given one source of data marked up in one way, one wants to
migrate the data to some target repository in a way that conforms to the target’s
schema. Yet, it differs from the problem studied in [4] in that our setup takes the
target to be fixed and possibly non-empty. Looked at from another angle, the
problem concerns how to extend an RDF graph conservatively. More specifically,
it concerns the problem of how to ensure that a transformation of source data into
a target repository does not interfere with the assertive content of the source.
Yet, it is unlike logic-based conservative extensions [5, 7, 8] in that the logical
vocabulary is being replaced as the source is ‘extended’ into the target. As such
bounded p-maps may also have a role to play in data fusion, which is defined as
“the process of fusing multiple records representing the same real-world object
into a single, consistent, and clean representation” [3].
Thus, a p-map is an RDF homomorphism in which the only elements that are
allowed to vary are edges: If h : G −→ H is an RDF homomorphism between
RDF graphs G and H, then h(g) ∈ H for all triples g ∈ G, while if h is a p-
map, then a, h(p), b ∈ H for all triples a, p, b ∈ G. This is a natural class of
homomorphisms to study for our purposes since edges are typically vocabulary
elements, while vertexes contain the “real” data. Note, though, that the defini-
tion of a p-map is not without subtleties, given that a single element in an RDF
graph may be both a vertex and an edge:
According to this definition SPARQL graph patterns do not contain blank nodes.
As shown in [1] it is easy to extend the definition in this respect, but as blank
nodes behave like variables in select queries, we shall not care to do so. We use
var(S) to denote the set of variables occurring in a set of triples S, and varp (S)
to denote those occurring as edges, i.e. in the second element of triples.
Definition 8. The answer to a query S, x over a graph G, written S, x (G),
is the set {μ(x) | μ ∈ SG }.
We end this section with a lemma that links the principal notions introduced
so far. It shows, essentially, that answers to queries and evaluations of SPARQL
patterns are interchangeable idioms for talking about transformations of RDF
graphs:
Lemma 1. Let G and H be RDF graphs, and h any function from UG to UH .
Then,
1. S, x (G) ⊆ h(S), x (H) iff SG ⊆ h(S)H .
2. h(S), x (H) ⊆ S, x (G) iff h(S)H ⊆ SG .
Proof. The claim follows immediately from Definition 8 and the fact that
dom(h) ∩ V = ∅, whence var(S) = var(h(S)).
4 Degrees of Conservativeness
Having assembled the requisite preliminaries, we turn to the problem of analysing
the notion of a conservative construct query. We shall limit the analysis in this
section to the simple case where the query transforms RDF triples to RDF
triples. Let G be any RDF graph. As a tentative characterisation we may say
that a construct query is conservative if applied to G it evaluates to a graph
Preserving Information Content in RDF Using Bounded Homomorphisms 77
As we shall see, each bound reflects a different aspect of the structure of the
target in the source. It is easy to check that (p1) is strictly stronger than (p2),
and that (p2) is strictly stronger than (p3). To be sure, there are other bounds,
but these are particularly simple and natural. We shall need the following lemma:
Lemma 2. If varp (t) = ∅ and t, x (G) = ∅ for a triple pattern t, then
π2 (h(t)) = h(π2 (t)) for any p-map h.
easy to check that h(Sb &Sc )H = h(Sb )&h(Sc )H , whence μ ∈ h(Sb )H
h(Sc )H , by Definition 7(2). It follows from Definition 6 that μ = μb ∪ μc for
compatible μb and μc such that μb ∈ h(Sb )H and μc ∈ h(Sc )H . Now, since
Sb &Sc , x (G) = ∅ and varp (Sb &Sc ) = ∅, by the supposition of the case, we have
Sb , y (G) = ∅ and varp (Sb ) = ∅ for y such that yi ∈ dom(μb ) for all yi ∈ y
and similarly for Sc . Therefore the induction hypothesis applies, so μb ∈ Sb H
and μc ∈ Sc H by Lemma 1. We have already assumed that μb and μc are
compatible, so μb ∪ μc ∈ Sb G Sc G = Sb &Sc G by Definition 7(2). Since
μb ∪ μc = μ, we are done.
Theorem 1 and Theorem 2 show that p1-maps induces a transformation between
RDF graphs that is exact in the sense that the diagram in Figure 1 commutes.
That is, whatever answer a query Q yields over G, h(Q) yields precisely the same
answer over H. Interestingly, the converse is also true, if a function induces an,
in this sense, exact transformation between graphs, then it is a p1-map:
Theorem 3. Let h be any function from U to itself. If for all SPARQL patterns
S we have S, x (G) = h(S), x (H), then h is a p1-map of G to H.
Proof. The proof is by induction on the complexity of S. The induction step is
easy, so we show only the base case where S is a triple pattern t. Suppose that
h is not a homomorphism between G and H. Then there is a triple a, p, b ∈ G
such that h(a), h(p), h(b) ∈ / H. Let t := x, p, y and x := x, y. Then h(t) =
x, h(p), y, and a, b ∈ t, x (G) \ h(t), x (H). We therefore have t, x (G)
h(t), x (H). Suppose next that h does not satisfy (p1). Then there is a triple
a, h(p), b ∈ H such that a, p, b ∈ / G, and t := x, p, y separates G and H by
a similar argument.
The class of p1-maps thus completely characterises the Q n
pairs of graphs for which there is an exact triple-to-triple G 2U
translation of select queries from one to the other. Note
that exactness here does not mean that the source and h(Q)
h
target are isomorphic. The target may contain more in-
formation in the form of triples, as long as these triples do H
not have source edges that map to them. Indeed, a p1-map Fig. 1.
need not even be injective:
Example 2. Assume we have the following RDF graphs: G := {a, p, b , a, q, b},
H1 := {a, r, b} and H2 := {a, r, b , c, s, d}. Then {p → r, q → r} is a p1-map
of G to H1 , and of G to H2 , given that h is the identity on vertexes.
Characterisation results similar to Theorem 2 and Theorem 3 are easily forth-
coming for p2- and p3-maps as well. The proofs are reruns with minor modific-
ations of that for p1-maps.
Theorem 4. Let h be any function from U to itself and suppose S, x (G) = ∅
and varp (S) = ∅. Then, h : G −→ H is a p2-map iff u ∈ h(S), x (H) \
S, x (G) implies u ∈/ UG for any u ∈ u, and h : G −→ H is a p3-map iff
u ∈ h(S), x (H) \ S, x (G) implies u ∈
/ UG for some u ∈ u.
Preserving Information Content in RDF Using Bounded Homomorphisms 79
H
G h(G)
\
la
la
lon \ lon
g g
t
\\
e e
typ typ \
l
\\lo
h
at
g-adresse s-address
\
n g
mer e type \\\
p-num z\-cod \\
type
VG V H \ VG
Marked arrows, \ , \\ and \\\ , represent triples satisfying bound (p1), (p2)
and (p3), respectively. The set of target vertexes is partitioned into two sets, VG
and VH \ VG , illustrated by the dashed line.
80 A. Stolpe and M.G. Skjæveland
Proof. In the limiting case that SG = ∅, we have C, S (G) = ∅ as well,
whence the theorem holds vacuously. For the principal case where SG = ∅
suppose g ∈ S, S (G) = ∪μ∈SG (μ(ρμ (S)). By Definition 3 we have that S
does not contain blank nodes, so g ∈ ∪μ∈SG (μ(S)). It follows that g = μ(t)
for a triple pattern t in S and some μ ∈ SG . By assumption, h is a p-map
of S to C, whence h(t) is a triple pattern in C, and since var(h(t)) = var(t),
it follows that μ(h(t)) ∈ ∪μ∈SG (μ(ρμ (C)). It remains to show that h(g) =
μ(h(t)). Since g = μ(t) it suffices to show that h(μ(t)) = μ(h(t)), which is just
the commutativity of μ and h. The relationships between the different graphs
are illustrated in Figure 2. Now assume that h from S to C is restricted by a
bound (p1)–(p3), indicated by (p) in the figure. Then for every t ∈ C there is a
t ∈ S such that the bound holds. For all μ such that μ(t) ∈ C, S (G) we have
μ(t ) ∈ S, S (G), but then the p-map h must be restricted by the same bound
as between S, S (G) and C, S (G).
h
Thus, if there is a bounded p-map from the
WHERE block to the CONSTRUCT block in a con- S (p) C
struct query, then any sub-graph that matches
μ μ
the former can be p-mapped with the same h
bound into the result of the query. By the
properties of bounded p-maps, therefore, we S, S (G) (p) C, S (G)
are licensed to say that the construct query is
Fig. 2.
a conservative transformation.
Preserving Information Content in RDF Using Bounded Homomorphisms 81
The class of c-maps thus consists of those path-maps that behave like a p-map
on unary paths (1), are sensitive to =V - and =E -equivalence (2, 3), and never
truncate paths (4). The next theorem shows that every c-map of G↑ to H ↑
induces a p-map of κG (G) to κH (H), for some κG and κH :
Preserving Information Content in RDF Using Bounded Homomorphisms 83
Definition 17. Suppose α and β are paths in the RDF graphs G and H, re-
spectively. We shall say that a c-map f : G↑ −→ H ↑ is respectively a c1-, c2- or
c3-map if one of the following conditions holds:
Proof. Suppose f is a c3-map and assume that there is a triple g := a, hf (p), b ∈
κH (H ↑ ) where a, b ∈ VG . We need to show that a, p, b ∈ κG (G↑ ). Given that
κG and κH are surjective by Lemma 4, let α ∈ G↑ and β ∈ H ↑ be such that
π2 (κG (α)) = p and κH (β) = g. By the definition of hf given in Lemma 5,
f (α) =E β, so by bound (c3) there is a γ ∈ G↑ where f (γ) = β. We have
g =V β =V γ by Definition 16 and Definition 15, and γ =E α by Definition 16,
since f (α) =E β. This means that κG (γ) = a, p, b. For the converse direction,
suppose hf is a p3-map and assume, for some α ∈ G↑ and β ∈ H ↑ , that f (α) =E
β and a, b ∈ VG , where a := px(β) and b := dt(β). By the definition of hf we
have hf (π2 (κG (α))) = π2 (κH (β)), so let κH (β) := a, hf (p), b. Since hf satisfies
(p1), we have a, p, b ∈ κG (G↑ ). By Lemma 4, there is a γ ∈ G↑ such that
κG (γ) = a, p, b. Since κG (γ) =E κG (α), we have f (γ) =E f (α), and given that
γ =V f (γ) =V β, we arrive at f (γ) = β. It is easy to adjust the membership of
a, b, px(β), dt(β) in VG in the proof and confirm the claim for two other pair of
corresponding bounds.
Expanding the class of conservative construct queries to also handle paths re-
quires the following generalisation of Theorem 6:
of α. This means we may not have μ(ρμ (f (α))) =E μ (ρμ (f (α))), whence f
is not a c-map of S, S (G) to C, S (G). In the absence of blank nodes in
C this situation cannot arise, ρ becomes redundant, and the proof becomes a
straightforward generalisation of that for Theorem 6.
As this proof-sketch is designed to show, extra care is required when the construct
query C contains blank nodes—as it does for instance in Example 1. However,
the preceding lemmata and theorems lay out all the essential steps. More specific-
ally, all that is needed in order to accommodate Example 1 and similar ones, is to
substitute equivalence classes of paths for paths throughout, where equivalence is
equality up to relabelling of blank nodes. The verification of this claim is a rerun
with minor modifications, and has therefore been left out.
6 Computational Properties
The problem of deciding whether there exists a homomorphism between two
(standard) graphs is well-known to be NP-complete. Since p-maps are more
restricted than generic graph homomorphisms, identifying p-maps between RDF
graphs is an easier task. In fact it can be done in polynomial time, the verification
of which is supported by the following lemmata:
Lemma 6. Let h1 and h2 be p-maps of G1 and G2 respectively to H. Then
h1 ∪h2 is a p-map of G1 ∪G2 to H if h1 (u) = h2 (u) for all u ∈ dom(h1 )∩dom(h2 ).
Lemma 7. If h1 , h2 are bounded p-maps such that h1 (u) = h2 (u) for all u ∈
dom(h1 ) ∩ dom(h2 ). Then h1 ∪ h2 is a bounded p-map satisfying the weaker of
the two bounds.
According to Lemma 6 the task of finding a p-map of G to H can be reduced to
the task of finding a set of p-maps of sub-graphs of G into H that are compatible
wrt. to shared domain elements. Lemma 7 then tells us that to check whether the
resulting p-map is bounded by some bound pn, it suffices to check whether each
of the smaller maps is. This procedure, each step of which is clearly polynomial,
does not require any backtracking, whence:
Theorem 9. Given two RDF graphs G and H, finding a p-map h : G −→ H,
bounded or not, is a problem polynomial in the size of G and H.
Proof (Sketch). For any RDF graphs G and H, fix the set VG of nodes occurring
as vertexes in G. Then for each p ∈ EG construct a p-map of Gp := {a, p , b ∈
G | p = p} into H. This amounts to iterating through the edges of H and
finding one, say q, such that i) a, p, b ∈ Gp → a, q, b ∈ Hq and ii) if p ∈ VG
then p = q. Lemma 6 tells us that the union of these maps is a p-map of G
to H, i.e. no choice of q for p is a wrong choice. There is therefore no need for
backtracking, whence a p-map can be computed in polynomial time. To check
whether it satisfies a given bound pn, it suffices by Lemma 7 to check that each
of the maps hp of Gp to Hq does. That is, for each element a, q, b ∈ Hq \ Gp
check that the required triple is in Gp . This is clearly a polynomial check.
Preserving Information Content in RDF Using Bounded Homomorphisms 85
For c-maps the situation is more complex. Since the composition of an RDF
graph may be exponentially larger than the graph itself, the problem is no longer
polynomial. More precisely, if G is an RDF graph and κ a composition function
|VG |
for G, then |κ(G↑ )| ≤ n=1 |EG | × n! . Yet, this is not a problem for any
realistically sized construct query. An experimental application is up and running
at http://sws.ifi.uio.no/MapperDan/ . Mapper Dan takes two RDF graphs or a
construct query as input, lets the user specify which bounds to apply to which
predicates, and checks whether there is a map under the given bound between
the two graphs or between the WHERE and CONSTRUCT block of the construct query.
In the cases where a bound is violated Mapper Dan offers guidance, if possible,
as to how to obtain a stratified map which satisfies the bounds. A map can
be used to translate the source RDF data to the target vocabulary, produce a
construct query which reflects the map, or to rewrite SPARQL queries.
This paper provides a structural criterion that separates conservative from non-
conservative uses of SPARQL construct queries. Conservativity is here measured
against the ‘asserted content’ of the underlying source, which is required to be
preserved by the possible change of vocabulary induced by the construct clause.
Our problem led us to consider a class of RDF homomorphisms (p-maps) the
existence of which guarantees that the source and target interlock in a reciprocal
simulation. Viewed as functions from triples to triples, p-maps are computable in
86 A. Stolpe and M.G. Skjæveland
polynomial time. The complexity increases with more complex graph patterns.
The class of p-maps has other applications besides that described here, e.g as the
basis for a more refined notion of RDF merging. As of today merging is based on
the method of taking unions modulo the standardising apart of blank nodes. If
one also wants a uniform representation of the data thus collected this method
is too crude. What one would want, rather, is a way of transforming the data by
swapping vocabulary elements whilst, as far as it goes, preserving the information
content of all the involved sources (this is not easily achieved by subsuming a set
of properties or types under a common super-type in an ontology). Such a merge
procedure may turn out to be an important prerequisite for truly RESTful write
operations on the web of linked data.
References
1. Arenas, M., Gutierrez, C., Pérez, J.: Foundations of RDF Databases. In: Tessaris,
S., Franconi, E., Eiter, T., Gutierrez, C., Handschuh, S., Rousset, M.-C., Schmidt,
R.A. (eds.) Reasoning Web. LNCS, vol. 5689, pp. 158–204. Springer, Heidelberg
(2009)
2. Baget, J.-F.: RDF Entailment as a Graph Homomorphism. In: Gil, Y., Motta, E.,
Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 82–96.
Springer, Heidelberg (2005)
3. Bleiholder, J., Naumann, F.: Data fusion. ACM Computing Surveys 41(1) (2008)
4. Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and
query answering. Theoretical Computer Science 336, 89–124 (2005)
5. Ghilardi, S., Lutz, C., Wolter, F.: Did I Damage my Ontology? A Case for Conser-
vative Extensions in Description Logics. In: Proc. of the 10th Int. Conference on
Principles of Knowledge Representation and Reasoning, KR 2006 (2006)
6. Hayes, P.: RDF Semantics. W3C Recommendation, W3C (2004),
http://www.w3.org/TR/rdf-mt/
7. Hutter, D.: Some Remarks on the Annotation %cons (1999)
8. Makinson, D.C.: Logical Friendliness and Sympathy in Logic. In: Logica Univer-
salis. Birkhäuser, Basel (2005)
9. Pérez, J., Arenas, M., Gutierrez, C.: Semantics of SPARQL. Technical report,
Universidad de Chile, TR/DCC-2006-17 (2006)
10. Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF. W3C Re-
commendation, W3C (2008), http://www.w3.org/TR/rdf-sparql-query/
11. Stolpe, A., Skjæveland, M.G.: From Spreadsheets to 5-star Linked Data in the
Cultural Heritage Domain: A Case Study of the Yellow List. In: Norsk Inform-
atikkonferanse (NIK 2011), Tapir, pp. 13–24 (2011)
Assessing Linked Data Mappings
Using Network Measures
Abstract. Linked Data is at its core about the setting of links between
resources. Links provide enriched semantics, pointers to extra informa-
tion and enable the merging of data sets. However, as the amount of
Linked Data has grown, there has been the need to automate the cre-
ation of links and such automated approaches can create low-quality links
or unsuitable network structures. In particular, it is difficult to know
whether the links introduced improve or diminish the quality of Linked
Data. In this paper, we present LINK-QA, an extensible framework that
allows for the assessment of Linked Data mappings using network met-
rics. We test five metrics using this framework on a set of known good
and bad links generated by a common mapping system, and show the
behaviour of those metrics.
1 Introduction
Linked Data features a distributed publication model that allows for any data
publisher to semantically link to other resources on the Web. Because of this
open nature, several mechanisms have been introduced to semi-automatically
link resources on the Web of Data to improve its connectivity and increase
its semantic richness. This partially automated introduction of links begs the
question as to which links are improving the quality of the Web of Data or
are just adding clutter. This notion of quality is particularly important because
unlike the regular Web, there is not a human deciding based on context whether a
link is useful or not. Instead, automated agents (with currently less capabilities)
must be able to make these decisions.
There are a number of possible ways to measure the quality of links. In this
work, we explore the use of network measures as one avenue of determining the
quality. These statistical techniques provide summaries of the network along dif-
ferent dimensions, for example, by detecting how interlinked a node is within in
a network [3]. The application of these measures for use in quality measurement
is motivated by recent work applying networks measures to the Web of Data [11].
Concretely, we pose the question of whether network measures can be used
to detect changes in quality with the introduction of new links (i.e. mappings)
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 87–102, 2012.
c Springer-Verlag Berlin Heidelberg 2012
88 C. Guéret et al.
2 Network Definitions
We now introduce the definitions used throughout this paper. The graph we
want to study will be referred to as the Data Network. It is the network of facts
provided by the graph of the Web of Data, excluding the blank nodes.
In this paper, we sample the Data Network by collecting information about the
neighbourhood of selected sets of resources within it. A resource’s neighbourhood
consists of a direct neighbourhood and an extended neighbourhood:
Figure 1 shows an example of a local network for which is vi , are the nodes
in Ni and are the nodes in Ni∗ . Local networks created around nodes from
G are the focus of the analysis performed by our framework. It is worth noting
neighbourhoods of every node in G is equivalent to
that the union of all local
this graph. That is, G ≡ vi ∈N Gi .
Fig. 1. Example of direct and extended neighbourhood around a source node graphics
3 Network Metrics
Based on the above definitions, we now detail a set of 5 network metrics to use in
quality detection. Relevant prior work on network analysis was the key criteria
to establish these metrics. Although Linked Data networks are different to social
networks, we used them as starting point. The degree, clustering coefficient and
centrality measures are justified as measures of network robustness [1,10]. The
other two metrics are based on studies of the Web of Data as a network that
show that fragmentation of the SameAs network is common and thus may be a
sign of low quality [12].
In specifying these metrics, one must not only define the measure itself but also
what constitutes quality with respect to that measure. Defining such a “quality
goal” is difficult as we are only beginning to obtain empirical evidence about
what network topologies map to qualitative notions of quality [10]. To address
this problem, for each metric, we define an ideal and justify it with respect to
some well-known quality notions from both network science and Linked Data
publication practice. We consider the following to be broad quality goals that
should be reached by the creation of links: 1. modifications should bring the
topology of the network closer to that of a power law network to make the
network more robust against random failure; 2. modifications should lower the
differences between the centrality of the hubs in the network to make the network
more robust against targeted failure of these critical nodes; 3. modifications
should increase the clustering within topical groups of resources and also lower
the average path length between groups (i.e. foster a small world network).
90 C. Guéret et al.
3.1 Degree
This measures how many hubs there are in a network. The aim is to have a
network which allows for fast connectivity between different parts of the network.
Thus making it easier for automated agents to find a variety of information
through traversal. Power-law networks are known to be robust against random
failure and are a characteristic of small world networks [1].
Measure. The degree of a node is given by its number of incoming and outgoing
edges.
mdegree
i = {eij | vj ∈ Ni , eij ∈ Ei } + {eji | vj ∈ Ni , eji ∈ Ei }
Ideal. The highest average clustering coefficient a network can have is 1, mean-
ing that every node is connected to every other node (the network is said to
be “complete”). Although this is a result the Web of Data should not aim at,
as most links would then be meaningless, an increase of the clustering coeffi-
cient is a sign of cohesiveness among local clusters. The emergence of such topic
oriented clusters are common in the Web of Data and are in line with having
a small world. We thus set an average clustering coefficient of 1 as a goal and
define the distance accordingly. S being the set of all resources, the distance to
the ideal is 1 minus the average clustering coefficient of the nodes in S.
1 clustering
dclustering = 1 − mi
S
vi ∈S
3.3 Centrality
Commonly used estimates of the centrality of a node in a graph are betweenness
centrality, closeness centrality, and degree centrality. All these values indicate the
critical position of a node in a topology. For this metric, we focus on betweenness
centrality, which indicates the likelihood of a node being on the shortest path
between two other nodes. The computation of betweenness centrality requires
knowing the complete topology of the studied graph. Because our metrics are
node-centric and we only have access to the local neighbourhood, we use the
ratio of incoming and outgoing edges as a proxy.
maxj∈V (mcentrality
j ) − mcentrality
i
dcentrality =
V − 1
i∈V
92 C. Guéret et al.
The very common owl:sameAs property can be improperly asserted. One way to
confirm a given sameAs relation is correct is to find closed chains of sameAs re-
lations between the linking resource and the resource linked. This metric detects
whether there are open sameAs chains in the network.
Measure. The metric counts the number of sameAs chains that are not closed.
Let pik = {eij1 , . . . , ejy k } be a path of length y defined as a sequence of edges
with the same label l(pik ). The number of open chains is defined as
mpaths
i = {pik | l(pik ) = ”owl:sameAs”, k = i}
Ideal. Ideally, we would like to have no open sameAs chains in the WoD. If
the new links contribute to closing the open paths, their impact is considered
positive. paths
dpaths = mi
vi ∈V
Measure. The measure counts the number of new edges brought to a resource
through the sameAs relation(s). This initial set of edges is defined as Ai =
{eij | l(eij ) = ”owl:sameAs”, j ∈ Ni+ } the number of edges brought to by the
neighbours connected through a sameAs relation defined as Bi = {ejl | vl ∈
Nj+ , l = i, eij ∈ Ni+ , l(eij ) = ”owl:sameAs”} Finally, the gain is the difference
between the two sets
mdescription
i = Bi \ Ai
Analyse
Fig. 2. Interaction between the different components of LINK-QA. The external inputs
are indicated in dashed lines pointing towards the processes (rounded box) using them.
4.1 Components
Select. This component is responsible for selecting the set of resources to be
evaluated. This can be done through a variety of mechanisms including sampling
the Web of Data, using a user specified set of resources, or looking at the set
of resources to be linked by a link discovery algorithm. It is left to the user to
decide whether the set of resources is a reasonable sample of the Data Network.
Construct. Once a set of resources is selected, the local network, as defined in
Definition 2, is constructed for each resource. The local networks are created by
querying the Web of Data. Practically, LINK-QA makes use of either SPARQL
endpoints or data files to create the graph surrounding a resource. In particular,
sampling is achieved by first sending a SPARQL query to a list of endpoints. If
no data is found, LINK-QA falls back on de-referencing the resource.
Extend. The “Extend” component adds new edges that are provided as input
to the framework. These input edges are added to each local network where
they apply. Once these edges are added, we compute a set of new local networks
around the original set of selected resources. The aim here is to measure the
impact of these new edges on the overall Data Network. This impact assessment
is done by the Compare component.
Analyse. Once the original local network and its extended local networks have
been created, an analysis consisting of two parts is performed:
94 C. Guéret et al.
Compare. The result coming from both analyses (before and after adding the
new edges) are compared to ideal distributions for the different metrics. The
comparison is provided to the user.
4.2 Implementation
The implementation is available as free software at http://bit.ly/Linked-QA,
and takes as input a set of resources, information from the Web of Data (i.e.
SPARQL endpoints and/or de-referencable resources) and a set of new triples
to perform quality assessment on. The implementation is written in Java and
uses Jena for interacting with RDF data. In particular, Jena TDB is used to
cache resource descriptions. Any23 is used for dereferencing data in order to get
a good coverage of possible publication formats.
The implementation generates HTML reports for the results of the quality
assessment. These reports are divided in three sections:
1. An overview of the status of the different metrics based on the change of
distance to the ideal distribution when the new links are added. The status
is “green” if the distance to the ideal decreased and “red” otherwise. The
relative change is also indicated. These statuses are derived from the change
in dmetric name observed when adding new links.
2. One graph per metric showing the distribution of the values for the different
mmetric name values obtained before and after adding the new set of links.
The rendering of these graphs is done by the Google Graph API.
3. A table reporting for all of the metrics the resources for which the score
mmetric
i
name
has changed most after the introduction of the new links.
It is important to note that LINK-QA is aimed at analysing a set of links and
providing insights to aid manual verification. There is no automated repair of
the links nor an exact listing of the faulty links. Outliers - resources that rank
farthest from the ideal distribution for a metric - are pointed out, but the final
assessment is left to the user.
5 Metric Analysis
The framework is designed to analyse the potential impact of a set of link can-
didates prior to their publication on the Web of Data. To evaluate this, we test
the links produced by a project using state of the art link generation tools: The
European project LOD Around the Clock (LATC) aims to enable the use of
the Linked Open Data cloud for research and business purposes. One goal of
QA of Linked Data 95
the project is the publication of new high quality links. LATC created a set of
linking specifications (link specs) for the Silk engine, whose output are link sets.
In order to assess the correctness of link specs, samples taken from the generated
links are manually checked. This results in two reference sets containing all the
positive (correct, good) and negative (incorrect, bad) links of the sample. The
link specs along with the link sets they produce, and the corresponding manu-
ally created reference sets are publicly available.1 Based on these link sets we
performed experiments to answer the following questions:
1. Do positive linksets decrease the distance to a metric’s defined ideal, whereas
negative ones increase it? If that is the case, it would allow us to distinguish
between link sets having high and low ratios of bad links.
2. Is there a correlation between outliers and bad links? If so, resources that
rank farthest from the ideal distribution of a metric would relate to incorrect
links from/to them.
Table 1. Detection result for each metric for both good and bad links. Blank - no
detection, I - Incorrect detection, C - correct detection. (lgd = linkedgeodata).
1
https://github.com/LATC/24-7-platform/tree/master/link-specifications
96 C. Guéret et al.
A global success rate can be quickly drawn from Table 1 by considering the
cumulative number of “C” and “I” to compute a recall score.
I + C 21 + 20
recall = = = 0.68
B + I + C 19 + 21 + 20
C 20
precision = = = 0.49
I + C 21 + 20
These two values indicate a mediocre success of our metrics on these data sets.
From the table and these values, we conclude that common metrics such as
centrality, clustering, and degree are insufficient for detecting quality. Addition-
ally, while the Description Richness and Open SameAs Chain metrics look more
promising, especially at detecting good and bad links, respectively, they report
too many false positives for reference sets of the opposite polarity.
We now present a more in-depth analysis of the results found in the table
focusing on the sensitivity of the metrics, their detection accuracy and their
agreement.
Sensitivity of Metrics. The presence in Table 1 of blank fields indicates that the
metric was not able to detect any change in the topology of the neighbourhood of
resources, meaning that it fails at the first goal. We realise that the Degree metric
is the only one to always detect changes. A behaviour that can be explained by
the fact that adding a new link almost always yields new connections and thus
alters the degree distribution.
The low performance of other metrics in detecting change can be explained by
either a lack of information in the local neighbourhood or a stable change. The
first may happen in the case of metrics such as the sameAs chains. If no sameAs
relations are present in the local network of the two resources linked, there will be
no chain modified and, thus, the metric will nott detect any positive or negative
effect for this new link. A stable change can happen if the link created does not
impact the global distribution of the metric. The results found in Table 1 report
changes in distributions with respect to the ideals defined, if the distribution
does not change with the addition of the links the metrics is are ineffective.
led us to aim at a small world topology, which does not correlate with the re-
sults found in our experiments; 2. Coverage of sample: The use of a sample
of the studied network forces us to consider a proxy for the actual metrics we
would have had computed on the actual network. Most noticeably, the centrality
measure featured in our prototype is a rough approximation. For this metric in
particular, a wider local neighbourhood around a resource would lead to better
estimates. The same applies to the detection of sameAs chains which may span
outside of the local neighbourhood we currently define; 3. Validity of metrics:
The somewhat better performance of Linked Data specific network measures
suggests that such tailored metrics may be a more effective than “class” met-
rics. The degree, clustering and centrality metrics look at the topology of the
network without considering its semantics. However, as it is confirmed by our
experiments, the creation of links is very much driven by these semantics and the
eventual changes in topology do not provide us with enough insights alone. Our
intuition, to be verified, is that effective metrics will leverage both the topological
and semantic aspect of the network.
We believe a future path forward is to gain more empirical evidence for par-
ticular topologies and their connection to quality. The sampling of the Web of
Data will also have to be reconsidered and may need to be defined with respect
to a particular metric.
for detecting bad links, they show a trend in this direction, indicating that the
predictions could be improved with the combination of multiple metrics. We
exclude the cluster coefficient from the following analysis. Given a ranking of
resources for each metric, we can assign each resource a sorted list of its ranks,
e.g. F abulous Disaster → (3, 3, 10, 17). A resource’s n-th rank, considering n
= 1. . . 4 metrics, is then determined by taking the n − 1-th element of this
list. Ideally, we would like to see negative resources receiving smaller n-th ranks
than positive ones. The distributions of the n-th ranks for all ns are shown in
Figure 4. These charts indicate that a combination indeed improves the results:
For example when combining 2 metrics, the probability of finding a negative
resource on one of the first 5 ranks increases from about 20 to 30 percent,
whereas an equi-distribution would only yield 10 percent (5 negative resources
in 50 links). This effect increases, as can be observed in the right column: The
positive-to-negative ratio is 0.6 for n = 4, which shows that a combination of
metrics is effective in detecting incorrect links.
Fig. 4. Distribution of negative and positive resources by its n-th rank. For “negative”
and “positive”, the y-Axis shows the absolute number of resources detected for every
bucket of five ranks. For “relative” it shows the ratio of negative to positive links.
6 Related Work
In this section, we provide a review of related work touching on this paper. We
particularly focus on quality with respect to the Semantic Web but also briefly
touch on Network Analysis and the automated creation of links.
QA of Linked Data 99
6.1 Quality
Improving data quality has become an increasingly pressing issue as the Web of
Data grows. For example, the Pedantic Web group has encouraged data providers
to follow best practices [15]. Much of the work related to quality has been on
the application information quality assessment on the Semantic Web. In the
WIQA framework [4], policies can be expressed to determine whether to trust a
given information item based on both provenance and background information
expressed as Named Graphs [5]. Hartig and Zhao follow a similar approach using
annotated provenance graphs to perform quality assessment [19]. Harth et al.[13]
introduce the notion of naming authority to rank data expressed in RDF based
on network relationships and PageRank.
Trust is often thought as being synonymous with quality and has been widely
studied including in artificial intelligence, the web and the Semantic Web. For
a readable overview of trust research in artificial intelligence, we refer readers
to Sabater and Sierra [20]. For a more specialized review of trust research as
it pertains to the Web see [8]. Artz and Gil provide a review of trust tailored
particularly to the Semantic Web [2]. Specific works include the IWTrust al-
gorithm for question answering systems [24] and tSPARQL for querying trust
values using SPARQL [14]. Our approach differs from these approaches in that
it focuses on using network measures to determine quality.
Closer to our work, is the early work by Golbeck investigating trust networks
in the Semantic Web [9]. This work introduced the notion of using network
analysis type algorithms for determining trust or quality. However, this work
focuses on trust from the point of view of social networks, not on networks in
general. In some more recent work [11], network analysis has been used to study
the robustness of the Web of Data. Our work differs in that it takes a wider view
of quality beyond just robustness. The closest work is most likely the work by
Bonatti et al., which uses a variety of techniques for determining trust to perform
robust reasoning [16]. In particular, they use a PageRank style algorithm to rank
the quality of various sources while performing reasoning. Their work focuses
on using these inputs for reasoning whereas LINK-QA specifically focuses on
providing a quality analysis tool. Additionally, we provide for multiple measures
for quality. Indeed, we see our work as complementary as it could provide input
into the reasoning process.
7 Conclusion
In this paper, we described LINK-QA, an extensible framework for performing
quality assessment on the Web of Data. We described five metrics that might
be useful to determine quality of Linked Data. These metrics were analysed
using a set of known good and bad quality links created using the mapping tool
Silk. The metrics were shown to be partially effective at detecting such links.
From these results, we conclude that more tailored network measures need to
be developed or that such a network based approach may need a bigger sample
than the one we introduced. We are currently looking at finding more semantics-
based measures, such as the sameAs chains. We are also looking at the interplay
of different measures and the combined interpretation of their results.
References
1. Adamic, L.A.: The Small World Web. In: Abiteboul, S., Vercoustre, A.-M. (eds.)
ECDL 1999. LNCS, vol. 1696, pp. 443–452. Springer, Heidelberg (1999)
2. Artz, D., Gil, Y.: A survey of trust in computer science and the Semantic Web. J.
Web Sem. 5(2), 58–71 (2007)
3. Barabási, A.L.: Linked (Perseus, Cambridge, Massachusetts) (2002)
4. Bizer, C., Cyganiak, R.: Quality-driven information filtering using the WIQA policy
framework. Journal of Web Semantics 7(1), 1–10 (2009)
5. Carroll, J.J., Bizer, C., Hayes, P., Stickler, P.: Named graphs, provenance and trust.
In: International World Wide Web Conference (2005)
QA of Linked Data 101
6. Choi, N., Song, I.Y., Han, H.: A survey on ontology mapping. ACM SIGMOD
Record 35(3), 34–41 (2006)
7. Ding, L., Finin, T.: Characterizing the Semantic Web on the Web. In: Cruz, I.,
Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo,
L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 242–257. Springer, Heidelberg (2006)
8. Golbeck, J.: Trust on the world wide web: a survey. Foundations and Trends in
Web Science 1(2), 131–197 (2006)
9. Golbeck, J., Parsia, B., Hendler, J.: Trust Networks on the Semantic Web. In:
Klusch, M., Omicini, A., Ossowski, S., Laamanen, H. (eds.) CIA 2003. LNCS
(LNAI), vol. 2782, pp. 238–249. Springer, Heidelberg (2003)
10. Guéret, C., Groth, P., van Harmelen, F., Schlobach, S.: Finding the Achilles Heel
of the Web of Data: Using Network Analysis for Link-Recommendation. In: Patel-
Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks,
I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 289–304. Springer,
Heidelberg (2010)
11. Guéret, C., Wang, S., Schlobach, S.: The web of data is a complex system - first
insight into its multi-scale network properties. In: Proc. of the European Conference
on Complex Systems, pp. 1–12 (2010)
12. Guéret, C., Wang, S., Groth, P., Scholbach, S.: Multi-scale analysis of the web
of data: A challenge to the complex system’s community. Advances in Complex
Systems 14(04), 587 (2011)
13. Harth, A., Kinsella, S., Decker, S.: Using Naming Authority to Rank Data and
Ontologies for Web Search. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum,
L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823,
pp. 277–292. Springer, Heidelberg (2009)
14. Hartig, O.: Querying Trust in RDF Data with tSPARQL. In: Aroyo, L., Traverso,
P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E.,
Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 5–20. Springer,
Heidelberg (2009)
15. Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the Pedantic
Web. In: Linked Data on the Web Workshop (LDOW 2010) at WWW 2010 (2010)
16. Hogan, A., Bonatti, P., Polleres, A., Sauro, L.: Robust and scalable linked data rea-
soning incorporating provenance and trust annotations. Journal of Web Semantics
(2011) (to appear) (accepted)
17. Ngonga Ngomo, A.-C., Auer, S.: Limes - a time-efficient approach for large-scale
link discovery on the web of data. In: Proc. of IJCAI (2011)
18. Niu, X., Wang, H., Wu, G., Qi, G., Yu, Y.: Evaluating the Stability and Credibility
of Ontology Matching Methods. In: Antoniou, G., Grobelnik, M., Simperl, E.,
Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I.
LNCS, vol. 6643, pp. 275–289. Springer, Heidelberg (2011)
19. Olaf, H., Zhao, J.: Using Web Data Provenance for Quality Assessment. In: Proc.
of the 1st Int. Workshop on the Role of Semantic Web in Provenance Management
(SWPM) at ISWC, Washington, USA (2009)
20. Sabater, J., Sierra, C.: Review on Computational Trust and Reputation Models.
Artificial Intelligence Review 24(1), 33 (2005)
21. Theoharis, Y., Tzitzikas, Y., Kotzinos, D., Christophides, V.: On graph features of
semantic web schemas. IEEE Transactions on Knowledge and Data Engineering 20,
692–702 (2007)
102 C. Guéret et al.
22. Toupikov, N., Umbrich, J., Delbru, R., Hausenblas, M., Tummarello, G.: DING!
Dataset Ranking using Formal Descriptions. In: WWW 2009 Workshop: Linked
Data on the Web (LDOW 2009), Madrid, Spain (2009)
23. Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Silk: A link discovery framework
for the web of data. In: 2nd Linked Data on the Web Workshop LDOW 2009, pp.
1–6. CEUR-WS (2009)
24. Zaihrayeu, I., da Silva, P.P., McGuinness, D.L.: IWTrust: Improving User Trust in
Answers from the Web. In: Herrmann, P., Issarny, V., Shiu, S.C.K. (eds.) iTrust
2005. LNCS, vol. 3477, pp. 384–392. Springer, Heidelberg (2005)
A Novel Concept-Based Search for the Web of Data
Using UMBEL and a Fuzzy Retrieval Model
Knowledge and Data Engineering Group, Trinity College Dublin, Dublin, Ireland
{Melike.Sah,Vincent.Wade}@scss.tcd.ie
Abstract. As the size of Linked Open Data (LOD) increases, the search and
access to the relevant LOD resources becomes more challenging. To overcome
search difficulties, we propose a novel concept-based search mechanism for the
Web of Data (WoD) based on UMBEL concept hierarchy and fuzzy-based
retrieval model. The proposed search mechanism groups LOD resources with
the same concepts to form categories, which is called concept lenses, for more
efficient access to the WoD. To achieve concept-based search, we use UMBEL
concept hierarchy for representing context of LOD resources. A semantic
indexing model is applied for efficient representation of UMBEL concept
descriptions and a novel fuzzy-based categorization algorithm is introduced for
classification of LOD resources to UMBEL concepts. The proposed fuzzy-
based model was evaluated on a particular benchmark (~10,000 mappings). The
evaluation results show that we can achieve highly acceptable categorization
accuracy and perform better than the vector space model.
1 Introduction
A key research focus in Web technology community is Linked Data. The term Linked
Data describes best practices for creating typed links between data from different
sources using a set of Linked Data principles. This ensures that published data
becomes part of a single global data space, which is known as “Web of Data” (WoD)
or “Linked Open Data” (LOD). Since the data is structured and relationships to other
data resources are explicitly explained, LOD allows discovery of new knowledge by
traversing links. However, as the number of datasets and data on the LOD is
increasing, current LOD search engines are becoming more important to find relevant
data for further exploration. This is analogous to the problem of the original Web [1].
However, current LOD search mechanisms are more focused on providing automated
information access to services and simple search result lists for users [2, 3]. For
example, they present search results in decreasing relevance order based on some
criterion (i.e. relevance to class names). However, result list based presentations of
retrieved links/resources do not provide efficient means to access LOD resources
since URIs or titles of LOD resources are not very informative. More efficient access
and discovery mechanisms on the WoD are crucial for finding starting points for
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 103–118, 2012.
© Springer-Verlag Berlin Heidelberg 2012
104 M. Sah and V. Wade
browsing and exploring potential data/datasets by Web developers and data engineers.
Currently, there are few approaches, which investigate this problem [1, 4, 11].
Our objective is to improve current search mechanisms on the WoD with a novel
concept-based search method. The dictionary definition of concept is “a general
notion or idea”. Concept-based search systems provide search results based on the
meaning, general notion of information objects so that search results can be presented
in more meaningful and coherent ways. Key challenges in supporting concept-based
search are: (1) the availability of a broad conceptual structure, which comprises good
concept descriptions, (2) extraction of high-quality terms from LOD resources for the
representation of resource context and categorization under the conceptual structure,
and (3) a robust categorization algorithm. In this paper, we focus on these issues in
order to introduce a novel concept-based search mechanism for the WoD.
2 Related Work
that use different schemas. In contrast, faceted search systems are generally bound to
specific schema properties and it can be difficult to generate useful facets for large
and heterogeneous data of the WoD [12]. In addition, scalability and system
performance is another issue for faceted systems over LOD [13].
1
DBpedia and Yago provide rich structures for linking instance data. However they do not
have a consistent framework of concepts (topics) for representing those instances.
A Novel Concept-Based Search for the Web of Data 107
2
It will be made public. A video demo is available at
http://www.scss.tcd.ie/melike.sah/concept_lenses.swf
108 M. Sah and V. Wade
Given that indexing and caching of WoD is very expensive, our approach is based
on existing 3rd party serives. In particular, we use Sindice search for querying the
WoD and Sindice Cache for retrieving RDF descriptions of LOD resources [2].
Lucene IR framework is utilized for indexing of concepts and at the implementation
of the fuzzy retrieval model. The server side is implemented with Java Servlets and
uses Jena for processing RDF. The client side is written using Javascript and AJAX.
In Figure 2, a screen shot of the concept-based search interface is presented. The
user interface presents the list of categories (concepts) at the left of the screen for
quick navigation access to concept lenses (LOD resources grouped based on context).
Fig. 2. The user interface of the concept-based search mechanism – categories (concepts) at the
left of the screen and concept lenses in the main panel of the screen
following reasons: the URI of a resource may contain keywords relevant to the
context of the resource. Titles (dc:title), names (foaf:name) and labels (rdfs:label), so
called label features usually include informative information about the resource.
property names typically provide further knowledge about “what type of the resource
is”. For example, birth place, father, mother, spouse are properties associated with
persons. However, some property names are generic (i.e. rdfs:label, rdf:type,
foaf:name, etc.) and do not provide information about the context. To overcome this,
we compiled a list of generic property names and if a property name matches any of
these, it is not accepted. On the other hand, type (rdf:type and dc:type) and subject
(dc:subject) provides the most discriminative features about the context of a LOD
resource. For instance, type and subject values can provide useful knowledge about
concepts (general notion or idea) of the resource for correct categorization. For
instance, for label “ocean” the context is not clear. But if we know type is album, then
we can understand that “ocean” is a label of an “album”.
From each LOD resource, keywords are extracted from the features as explained
above. Then, qualifiers and propositions are removed from the extracted keywords, to
enhance categorization accuracy. For instance, for the context “hockey games in
Canada”, the main concept is “hockey games” and not the “country” Canada. Thus,
we remove qualifiers/ propositions and the words after them for better term
extraction. Qualifier removal is based on keyword matching of from, in, of and has.
To ensure correct matching, there must be a space before/after the qualifiers, e.g.
words in italic are removed after the qualifiers: people from Alaska, reservoirs in
Idaho, mountains of Tibet, chair has four legs. This has the effect of generalizing the
concepts, which is perfectly reasonable for our purpose of categorizing and browsing
based on higher level of concepts.
On the other hand, after initial experiments, we observed that many LOD resources
do not have information about label, type and subject. To improve lexical data
mining, we also apply a semantic enrichment technique, where more lexical data is
gathered from the linked data graph of the resource by traversing owl:sameAs and
dbpedia:WikiPageRedirect links. If a resource has such links, first we obtain RDF
description of these resources and apply the feature extraction techniques explained
above. Finally, from the obtained and enriched terms, stop words are removed and
stemming is applied, where we obtain the final terms, which we call LOD terms.
LOD Term Weights Based on Features. Since different LOD terms have comparative
importance on the context of the LOD resource, terms are weighted. For example, type
and subject features provide more discriminative terms. Therefore terms which appear
in these features should be weighted higher. To achieve this, we divided LOD resource
features into two groups: Important features (I) and Other features (O). For important
features terms from type and subject features are combined to form a vector. For other
features terms from URI, label and property name are combined to form a vector. We
use the normalized term frequency (tf) of these features for term weighting as given
below,
tf (t ) I tf (t )O
if t ∈ I , w(t ) = 0.5 + 0.5 × ,
if t ∈ O, w(t ) = (1)
max(tf I ) max(tfO )
110 M. Sah and V. Wade
where w(t) is weight of term t, and tf(t)I and tf(t)O is term frequency of term t in
important and other features respectively. max(tf(t)I) and max(tf(t)O) represent
maximum term frequency in those features. For terms that appear in important
features (I), a minimum weight threshold of 0.5 is used to encourage syntactic
matches to these terms within UMBEL concept descriptions for categorization. On
the other hand, inverse document frequency (idf) can also be used together with term
frequency. However, idf calculation for each LOD term is expensive. For dynamic idf
calculations, dynamic search on the LOD is required for each term, which is
computationally expensive. For offline calculations of idf, we need to continuously
index the LOD, which is a resource-intensive task. Thus, we did not use idf.
may provide further relevant lexical terms about “sports”. Instead of accepting sub-
concepts as a part of the concept, we separately index sub-concepts labels as subl. The
subl include terms that occur in all inferred sub-concepts’ URIs, preferred and
alternative labels. In addition, we observed that many LOD resources contain links to
super-concepts. For example, a resource about a writer also contains information that
writer is a person. Thus, we index super-concept labels for more robust descriptors,
where supl contain terms that occur in all inferred super-concepts’ URIs, preferred
and alternative labels. Finally, al contain all the terms that appear in all parts.
UMBEL is formatted in RDF N-triple format and we load the triples into a triple
store (Jena persistent storage using Mysql DB) to extract the terms from UMBEL
concepts. Each concept is divided into semantic parts of uri, cl, subl, supl and al using
SPARQL queries, where concept descriptions are extracted from each semantic part.
From the concept descriptions, stop words are removed, as they have no semantic
importance to the description, and words are stemmed into their roots using the Porter
stemmer. The resultant words are accepted as concept terms. Finally, the extracted
concept terms from the semantic parts are indexed. To do this, we consider each
concept as a unique document and each semantic part is separately indexed as a term
vector under the document (concept) using Lucene IR framework. In addition, the
maximum normalized term frequency and inverse document frequency term value of
each semantic part is calculated (which is subsequently used by the fuzzy retrieval
model) and indexed together with the concept for quick retrieval. The inverted
concept index is used for categorization of LOD resources.
It should be also noted that concept descriptions can be enhanced with lexical
variations using WordNet. However, typically UMBEL descriptions include such
word variations in alternative labels. This is the advantage of UMBEL being built
upon on OpenCyc since OpenCyc contains rich lexical concept descriptions using
WordNet. For the UMBEL concept <http://umbel.org/umbel/rc/Automobile> for
instance, preferred label is car and alternative labels are auto, automobile,
automobiles, autos, cars, motorcar and motorcars. This demonstrates a rich set of
apparent lexical variations. Since these rich lexical descriptions are available in
UMBEL, we did not use other lexical enhancement techniques because accuracy of
the automated enhancements may affect categorization performance significantly.
Fuzzy Retrieval Model. First, UMBEL concept candidates are retrieved by searching
LOD terms in all labels (al) of concepts. Then, for each LOD term, t, a relevancy
score to every found UMBEL concept, c, is calculated by using a fuzzy function,
μ (t , c) ∈ [0,1] , on uri, cl, subl and supl semantic parts (since different parts have
relative importance on the context of the concept c). Thus, μ (t , c) shows the degree
of membership of the term t to all semantic parts of the concept c; where high values
of μ (t , c) show that t is a good descriptor for the concept c and , μ (t , c) = 0 , means
that the term t is not relevant for c. For membership degree calculation of μ (t , c) , first
membership degree of the term, t, to each part, p, should to be computed:
Definition 1: The membership degree of the term, t, to each part, p = [cl, subl, supl,
uri], is a fuzzy function, μ (t , c, p) ∈ [0,1] , which is based on tf × idf model. First, we
calculate normalized term frequency (ntf) of t in cl, subl and supl of the concept c,
tf (t , c, cl ) tf (t , c, p )
ntf (t , c, cl ) = 0.5 + 0.5 × , ∀subl , supl ∈ p, ntf (t , c, p ) = (2)
max(tf (c, cl )) max(tf (c, p ))
where tf (t , c, p ) represents term frequency of the term t in the part p of the concept c
and max(tf (c, p )) represents maximum term frequency in the part p of the concept c.
We calculate local normalized term frequencies for each semantic part, rather than
calculating normalized term frequency using all terms of the concept in all semantic
parts. In this way, term importance for a particular semantic part is obtained and
frequent terms have higher value. For cl a minimum threshold value of 0.5 is set,
since cl contains preferred/alternative terms of the concept, which is important for the
context of the concept c. Then, for each part, p, we calculate the idf value of t in p,
C
∀cl , subl , supl ∈ p, idf (t , c, p ) = log (3)
( n : t ∈ p ) + 1
here, n : t ∈ p is the number of semantic parts that contain the term t in p (i.e.
n : t ∈ cl ) and C is the total number of concepts in the collection. Again idf of a term
in a particular semantic part is calculated instead of idf of a term in the whole corpus.
In this way, rare terms that occur in a particular semantic part are assigned with
higher values, which mean that rare terms are more important for the semantic part p.
Next, tf × idf value of the term t in the semantic part p is computed,
∀cl , subl , supl ∈ p, tf × idf (t , c, p ) = ntf (t , c, p ) × idf (t , c, p ) (4)
Finally, the membership degree of the term t to each part p is a fuzzy value,
tf × idf (t , c, p )
∀cl , subl , supl ∈ p, μ (t , c, p ) = (5)
max (tf × idf (c, p ) )
where μ (t , c, p) ∈ [0,1] equals to normalized tf × idf value of the term t in the part p.
In this way, a fuzzy relevancy score is generated, where the term that has the
maximum tf × idf value in the part p, μ (t , c, p) = 1 and μ (t , c, p ) reduces as the term
importance decreases. As we discussed earlier, the maximum tf × idf value for each
A Novel Concept-Based Search for the Web of Data 113
semantic part is calculated and indexed during the semantic indexing for better
algorithm performance. Last, μ (t , c, uri ) ∈ [0,1] equals to normalized term frequency,
n
μ (t , c, uri ) = 0.5 + 0.5 × tf (t , c, uri ) tf (ti , c, uri ) (6)
i =1
here, term frequency of t in uri is divided by total number of terms in the uri. A
minimum threshold value of 0.5 is set, since uri terms are important. If uri contains
one term, μ (t , c, uri ) = 1 , means the term is important for uri. The term importance
decreases as the number of terms in the uri increases.
Definition 2: Relevancy of the term t to the concept c, is calculated by, μ (t , c ) , where
membership degrees of the term t to the parts uri, cl, subl and supl are combined,
wuri × μ (t , c, uri) + wcl × μ (t , c, cl ) + wsubl × μ (t , c, subl ) + wsupl × μ (t , c, supl )
μ (t , c ) = (7)
wuri + wcl + wsubl + wsupl
where, μ (t , c) ∈ [0,1] and, wuri , wcl , wsubl and wsupl are constant coefficients that
aid to discriminate features obtained from different parts. For example, the terms
obtained from the uri and cl can be weighted higher than subl and supl. The parameter
values were experimentally determined as we discuss later in the evaluations section.
Definition 3: Finally, relevancy of all LOD terms, T = {t1 ,..., t m } , to the concept c is,
m m
μ (T , c) = (μ (ti , c) × w(ti )) w(ti ) (8)
i =1 i =1
5 Evaluations
This section discusses the evaluation setup and the experiments undertaken to test the
performance of our approach. A particular benchmark was created to evaluate: (1)
performance of different LOD features, (2) categorization accuracy of the fuzzy
retrieval model against the vector space model, (3) efficiency of system performance.
5.1 Setup
Fig. 3. Categorization accuracy of the proposed fuzzy retrieval model with respect to different
LOD resource features and the semantic enrichment technique
Proposed Fuzzy Retrieval Model. Figure 3 shows precision and recall of the
proposed fuzzy retrieval model: (1) with different LOD resource features and (2) with
and without the semantic enrichment technique. The results show that among all LOD
resource features, type feature alone gave the best precision of 70.98% and 88.74%
without and with the enrichment respectively. This is because most resources contain
type feature, which provides knowledge about the context of resources. subject feature
performed a precision of ~62%, uri and label features alone did not perform well
(~28%) and property names performed the worse accuracy. Among combinations of
different LOD resource features, type+uri and type+label provided the best accuracy
without the semantic enrichment with a precision of 85.55% and 86.38% respectively.
Other combinations did not improve the overall accuracy despite more LOD terms
being used in the categorization. Another interesting outcome is that the semantic
enrichment technique did not have a significant impact on the categorization accuracy
(~1% improvement) except the type feature. In the type feature, the enrichment
technique improved the precision and recall ~18%. In addition, we noticed that in
some cases all possible mappings from DBpedia to UMBEL are not included, e.g. a
volcano mountain is mapped as umbel:Mountain, but not as umbel:Volcano. Besides,
DBpedia uses more general mappings, for example, a science fiction writer is mapped
as umbel:Writer, despite the existence of umbel:ScienceFictionWriter. This could be
because of human error since manual mapping process3 is involved, which can be
error-prone. Although these particular cases affected categorization accuracy, the
proposed fuzzy retrieval model achieved high accuracy on the benchmark. Especially
high performance is achieved by using the type feature and the type+uri and
type+label features (with and without the enrichment). The results are promising
because typically LOD resources contain data about type and labels of the resource,
which can be used to provide high quality categorization.
3
http://umbel.googlecode.com/svn/trunk/v100/
External%20Ontologies/dbpediaOntology.n3
116 M. Sah and V. Wade
Fig. 4. Comparison of the vector space model with the proposed fuzzy retrieval model
Vector Space Model. Since the proposed fuzzy retrieval model extends tf × idf with a
fuzzy relevancy score calculation using semantic structure of concepts, we compared
the categorization accuracy against the tf × idf model. In vector space model, concept
descriptions can be represented in a number of ways; using (1) only uri, (2) uri+cl,
(3) uri+cl+ subl, and (4) uri+cl+subl+supl. On the concept representation alternatives,
we applied tf × idf retrieval model on the benchmark. For fair comparison, the same
clean-up steps are applied to the vector space model (i.e. stemming, stop word and
qualifier removal) and the same voting algorithm is used if there is more than one
maximum categorization. In contrast, concept weights to all LOD terms are calculated
using the tf × idf scheme. In Figure 4, the best results are shown, which is achieved by
the type feature. Results show that the vector space model did not perform well. The
best precision and recall is obtained by using uri+cl with a precision and recall of
31.77% and 47.42 without the semantic enrichment and with a precision and recall of
33.0.9% and 47.75 with the semantic enrichment. When using all semantic parts, the
precision of the vector space is decreased to 20.37% compared to 88.74% precision of
the proposed fuzzy retrieval model.
Discussion of Results. Our fuzzy retrieval model performs outstandingly better than
the vector space model for the following reason. tf × idf is a robust statistical model,
which works well with good training data. Traditional concept-based IR systems
[5,6,7] use the top 2-3 levels of a concept hierarchy (few hundred concepts) with
hundreds of training documents. In contrast, we use the whole 28,000 UMBEL
concepts. Moreover each concept contains few lexical information in different
semantic parts of the concept, such as in URI, preferred/alternative labels and
super/sub-concept(s) labels to describe that concept. tf × idf cannot discriminate terms
only using combined terms and often few LOD terms are matched to many concepts
(sometimes hundreds) with the same tf × idf scores. We propose a more intuitive
approach, where our fuzzy retrieval model extends tf × idf with a fuzzy relevancy
score calculation based on semantic structure of concepts, i.e. terms from the concept,
sub-concept(s) and super-concept(s) have certain importance in retrieval. Besides,
relevancy scores are combined according to their importance to the concept. Hence,
this more intuitive approach performs astoundingly better than tf × idf, which do not
discriminate term importance based on semantic structure of a concept hierarchy.
A Novel Concept-Based Search for the Web of Data 117
3.5
2.5
Time (sec)
1.5
0.5
0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Number of LOD Terms
We have presented a novel approach for concept-based search on the Web of Data.
The proposed innovative search mechanism is based on UMBEL concept hierarchy,
fuzzy-based retrieval model and categorical result list presentation. Our approach
groups LOD resources with same concepts to generate concept lenses that can provide
efficient access to the WoD and enables concept-based browsing. The concept-based
search is achieved using UMBEL for representing context of LOD resources. Then, a
semantic indexing model is applied for efficient representation of UMBEL concept
descriptions. Finally a fuzzy-based retrieval algorithm is introduced for categorization
of LOD resources to UMBEL concepts. Evaluations show that the proposed fuzzy-
based model achieves highly acceptable results on a particular benchmark and
outperforms the vector space model in categorization accuracy, which is crucial for
correct formation of concept lenses.
The introduced semantic indexing and fuzzy retrieval model are not inherently
dependent on UMBEL vocabulary and should be applicable to multiple vocabularies.
118 M. Sah and V. Wade
References
1. Tummarello, G., Cyganiak, R., Catasta, M., Danielczyk, S., Delbru, R., Decker, S.:
Sig.ma: live views on the Web of Data. Journal of Web Semantics 8(4), 355–364 (2010)
2. Delbru, R., Campinas, S., Tummarello, G.: Searching Web Data: an Entity Retrieval and
High-Performance Indexing Model. Journal of Web Semantics 10, 33–58 (2012)
3. D’Aquin, M., Motta, E., Sabou, M., Angeletou, S., Gridinoc, L., Lopez, V., Guidi, D.:
Toward a New Generation of Semantic Web Applications. IEEE Intelligent Systems
(2008)
4. Heim, P., Ertl, T., Ziegler, J.: Facet Graphs: Complex Semantic Querying Made Easy. In:
Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L.,
Tudorache, T. (eds.) ESWC 2010. LNCS, vol. 6088, pp. 288–302. Springer, Heidelberg
(2010)
5. Chirita, P.A., Nejdl, W., Paiu, R., Kohlschütter, C.: Using ODP metadata to personalize
search. In: International ACM SIGIR Conference (2005)
6. Sieg, A., Mobasher, B., Burke, R.: Web Search Personalization with Ontological User
Profiles. In: International Conference on Information and Knowledge Management (2007)
7. Labrou, Y., Finin, T.: Yahoo! As An Ontology – Using Yahoo! Categories to Describe
Documents. In: International Conference on Information and Knowledge Management
(1999)
8. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval (1983)
9. Steichen, B., O’Connor, A., Wade, V.: Personalisation in the Wild – Providing
Personalisation across Semantic, Social and Open-Web Resources. ACM Hypertext (2011)
10. Carpineto, C., Romano, G.: Optimal Meta Search Results Clustering. In: SIGIR (2010)
11. Erling, O.: Faceted Views over Large-Scale Linked Data. In: Linked Data on the Web
(LDOW) Workshop, co-located with International World Wide Web Conference (2009)
12. Teevan, J., Dumais, S.T., Gutt, Z.: Challenges for Supporting Faceted Search in Large,
Heterogeneous Corpora like the Web. In: Workshop on HCIR (2008)
13. Shangguan, Z., McGuinness, D.L.: Towards Faceted Browsing over Linked Data. In:
AAAI Spring Symposium: Linked Data Meets Artificial Intelligence (2010)
14. White, R.W., Kules, B., Drucker, S.M., Schraefel, M.C.: Supporting Exploratory Search.
Introduction to Special Section of Communications of the ACM 49(4), 36–39 (2006)
Unsupervised Learning of Link Discovery
Configuration
1 Introduction
Identity links between data instances described in different sources provide ma-
jor added value of linked data. In order to facilitate data integration, newly
published data sources are commonly linked to reference repositories: popular
datasets which provide good coverage of their domains and are considered re-
liable. Such reference repositories (e.g., DBpedia or Geonames) serve as hubs:
other repositories either link their individuals to them or directly reuse their
URIs. However, establishing links between datasets still represents one of the
most important challenges to achieve the vision of the Web of Data. Indeed, such
a task is made difficult by the fact that different datasets do not share commonly
accepted identifiers (such as ISBN codes), do not rely on the same schemas and
ontologies (therefore using different properties to represent the same informa-
tion) and often implement different formatting conventions for attributes.
Automatic data linking often relies on fuzzy similarity functions comparing
relevant characteristics of objects in the considered datasets. More precisely, a
data linking task can be specified as the evaluation of a decision rule establishing
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 119–133, 2012.
c Springer-Verlag Berlin Heidelberg 2012
120 A. Nikolov, M. d’Aquin, and E. Motta
matching systems (e.g., RiMOM [9]), or systems which primarily rely on the
dataset matching stage (e.g., CODI [12]). However, in most other cases, a dedi-
cated decision rule has to be established for each link discovery task (i.e., each
pair of datasets to link). Existing systems in the Semantic Web area take two
different approaches to realise this:
Manual configuration. where the decision rule is specified by the user. Be-
sides requiring user effort, the clear disadvantage of such an approach is that
it relies on extensive knowledge from the user of the structure and content of
the two datasets to link, as well as on a reasonable level of intuition regard-
ing the performance of (often complex) similarity functions in a particular
situation.
Learning from training data. where the appropriate decision rule is pro-
duced by analyzing the available labeled data. This method is followed, for
example, by the ObjectCoref system [7]. This alleviates the need for user in-
put to establish the decision rule, but requires the availability of a substantial
set of robust training data (although some methods, like active learning [10]
can reduce the required amount of data).
– Assumption 1: While different URIs are often used to denote the same
entity in different repositories, distinct URIs within one dataset can be ex-
pected to denote distinct entities.
– Assumption 2: Datasets D1 and D2 have a strong degree of overlap.
– Assumption 3: A meaningful similarity function produces results in the
interval {0..1} and returns values close to 1.0 for pairs of matching individ-
uals.
The method described in this paper proposes to use a genetic algorithm guided
by a fitness criterion using these assumptions to assess the expected quality of
a decision rule, and of the derived set of links. Our method goes a step further
than existing methods, as it chooses an appropriate similarity function for a
given matching task as well as a suitable filtering criterion, rather than relying
Unsupervised Learning of Link Discovery Configuration 123
3 Algorithm
Applying a genetic algorithm to the problem of optimizing a decision rule re-
quires solving three issues: how relevant parameters of a decision rule are encoded
as a set of genes, what fitness measure to use to evaluate candidate solutions,
and how to use selection and variation operators to converge on a good solution.
sim(P (I1 ), P (I2 )) = fagg (w11 sim11 (V11 , V21 ), . . . , wmn simmn (V1m , V2n ))
– simij is the function which measures similarity between the values of the
attributes a1i of P (I1 ) and a2j of P (I2 ),
– wij is a numeric weight (0 ≤ wij ≤ 1),
– fagg is an aggregation function.
We considered two alternative filtering criteria: the threshold-based one and the
nearest neighbour one. The former requires that sim(P (I1 ), P (I2 )) ≥ t, where t
is a threshold value. The latter chooses for each instance I1 in the source dataset
such I2 that sim(P (I1 ), P (I2 )) = max(sim(P (I1 ), P (Ij ))). This criterion is ap-
plicable in cases where we expect each I1 to have a matching I2 .
Each of these parameters is represented by a gene in the following way:
These genotypes are evaluated by applying the decision rule to the matching
task and calculating the fitness function.
Unsupervised Learning of Link Discovery Configuration 125
In the absence of labelled data it is not possible to estimate the quality of a set
of mappings accurately. However, there are indirect indicators corresponding to
“good characteristics” of sets of links which can be used to assess the fitness of a
given decision rule. To establish such indicators, we rely on the assumptions we
made about the matching task. Traditionally, the quality of the matching output
is evaluated by comparing it with the set of true mappings M t and calculating
|tp|
the precision p and recall r metrics. Precision is defined as p = |tp|+|f p| , where
tp is a set of true positives (mappings m = (I1 , I2 ) such that both m ∈ M and
m ∈ M t ) and f p is a set of false positives (m ∈ M , but m ∈ / M t ). Recall is
|tp|
calculated as r = |tp|+|f n| , where f n is a set of false negatives (m ∈ / M , but
m ∈ M ). In the absence of gold standard mappings, we use Assumption 1 to
t
which increase precision are favored, while recall is only used to discriminate
between solutions with similar estimated precision. This “cautious” approach is
also consistent with the requirements of many real-world data linking scenarios,
as the cost of an erroneous mapping is often higher than the cost of a missed
correct mapping.
In order to incorporate Assumption 3, the final fitness function gives a pref-
erence to the solutions which accept mappings with similarity degrees close to
1: Ff∼it = F0.1∼
· (1 − (1 − simavg )2 ). In this way, the fitness function is able
to discriminate between such decision rules as avg(0.5 · jaro(name, label), 0.5 ·
edit(birthY ear, yearOf Birth)) ≥ 0.98 and avg(0.05 · jaro(name, label), 0.05 ·
edit(birthY ear, yearOf Birth), 0.9 · edit(name, yearOf Birth)) ≥ 0.098. While
126 A. Nikolov, M. d’Aquin, and E. Motta
these two rules would produce the same output in most cases, comparing ir-
relevant attributes (like name and yearOf Birth) is not desirable, because it
increases a possibility of spurious mappings without adding any value.
While we used Ff∼it as the main fitness criterion, to test the effect of the choice
of a fitness function on the performance of the genetic algorithm, we implemented
an alternative fitness function: the neighbourhood growth function FfNitG . While
the pseudo F-Measure tries to estimate the quality of resulting mappings to guide
the evolution of candidate solutions, FfNitG tries to exploit the desired property of
a “good” similarity function: namely, that it should be able to discriminate well
between different possible candidate mappings. To measure this property, we
adapt the neighbourhood growth indicator defined in [2] to achieve an optimal
clustering of instance matching results for a pre-defined similarity function, as
an alternative to the threshold-based filtering criterion. We adapt this indicator
as an alternative fitness criterion for selecting the most appropriate similarity
functions.
Definition 5: Let Mx represent a set of mappings (Ix , Ixj ) between an in-
dividual Ii ∈ I1 and a set of individuals Ixj ∈ I§| ⊆ I∈ . Let simmax =
max(sim(P (Ix ), P (Ixj ))) Then, neighbourhood growth N G(Ix ) is defined as
the number of mappings in Mx such that their similarity values are higher than
1 − c · (1 − simmax ), where c is a constant.
Intuitively, high values of N G(Ix ) indicate that the neighbourhood of an in-
stance is “cluttered”, and the similarity measure cannot adequately distinguish
between different matching candidates. Then the fitness function for a set of
compared instance pairs M is defined as FfNitG = 1/avgx (N G(Ix )). As this func-
tion does not require applying the filtering criterion, it only learns the similarity
function, but not the optimal threshold. However, the threshold can be deter-
mined after the optimal similarity function has been derived: t is selected in such
a way that it maximises the Ff∼it function over a set of compared pairs.
The algorithm takes as input two instance sets I1 and I2 and two sets of potential
attributes A1 and A2 . Each set of attributes Ai includes all literal property values
at a distance l from individuals in Ii . In our experiments we used l = 1, however,
also including the paths of length 2 if an individual was connected to a literal
through a blank node. In order to filter out rarely defined properties, we also
|{P (Ii )|aij ∈P (Ii ),Ii ∈I}|
remove all attributes aij for which |I| < 0.5.
As the first step, the algorithm initializes the population of size N . For the
initial population, all values of the genotype are set in the following way:
– A set of k pairs of attributes (a1i , a2j ) is selected randomly from the corre-
sponding sets A1 and A2 .
– For these pairs of attributes the similarity functions simij and the corre-
sponding weights wij are assigned randomly while for all others are set to
nil.
Unsupervised Learning of Link Discovery Configuration 127
At the new iteration, chromosomes in the updated population are again evalu-
ated using the Ff it fitness function, and the process is repeated. The algorithm
stops if the pre-defined number of iterations niter is reached or the algorithm
128 A. Nikolov, M. d’Aquin, and E. Motta
converges before this: i.e., the average fitness does not increase for nconv gener-
ations. The phenotype with the best fitness in the final population is returned
by the algorithm as its result.
4 Evaluation
To validate our method, we performed experiments with two types of datasets.
First, we tested our approach on the benchmark datasets used in the instance
matching tracks of the OAEI 2010 and OAEI 2011 ontology matching competi-
tions2 , to compare our approach with state-of-the-art systems. Second, we used
several datasets extracted from the linked data cloud to investigate the effect of
different parameter settings on the results.
4.1 Settings
As discussed above, a genetic algorithm starts with an initial population of ran-
dom solutions, and iteratively create new generations through selection, muta-
tion and crossover. In our experiments, we used the following default parameters:
– rates for different recombination operators: rel = 0.1, rm = 0.6, and rc = 0.3.
– rates for different mutation options: pm m m
att = 0.3, pwgt = 0.15, psym = 0.15,
m m
pt = 0.3, pagg = 0.1 (ensuring equivalent probabilities for modifying the list
of compared properties, comparison parameters, and the threshold).
– termination criterion: niter = 20 (found to be sufficient for convergence in
most cases).
– fitness function: Ff∼it , except when comparing Ff∼it with FfNitG
The genetic algorithm is implemented as a method in the KnoFuss architec-
ture [11]. Relevant subsets of two datasets are selected using SPARQL queries.
Each candidate decision rule is used as an input of the KnoFuss tool to create the
corresponding set of links. To reduce the computation time, an inverted Lucene3
index was used to perform blocking and pre-select candidate pairs. Each individ-
ual in the larger dataset was indexed by all its literal properties. Each individual
in the smaller dataset was only compared to individuals returned by the index
when searching on all its literal properties, and pairs of compared individuals
were cached in memory. Common pre-processing techniques (such as removing
stopwords and unifying synonyms) were applied to the literal properties.
Table 1. Comparison of F1-measure with other tools on the OAEI 2010 benchmark [4]
of the Restaurants dataset exist: the version originally used in the OAEI 2010
evaluation which contained a bug (some individuals included in the gold standard
were not present in the data), and the fixed version, which was used in other
tests (e.g, [13], [7]). To be able to compare with systems which used both variants
of the dataset, we also used both variants in our experiments. The OAEI 2011
benchmark includes seven test cases, which involve matching three subsets of
the New York Times linked data (people, organisations, and locations) with
DBpedia, Freebase, and Geonames datasets.
We compared our algorithm with the systems participating in the OAEI 2010
tracks as well as with the FBEM system [13], whose authors provided the bench-
mark datasets for the competition. We report in Table 1 on the performance of
the KnowFuss system using decision rules learned through our genetic algorithm
(noted KnowFuss+GA) as the average F1-Measure obtained over 5 runs of the
algorithm with a population size N = 1000. The solution produced by the
Table 3. Comparison of F1-measure with other tools on the OAEI 2011 benchmark [5]
converge to the optimal solution, and the algorithm usually converged well before
20 generations5.
Given the larger scale of the OAEI 2011 benchmark, to speed up the algo-
rithm we used random sampling with the sample size s = 100 and reduced the
population size to N = 100. To improve the performance, a post-processing step
was applied: the 1-to-1 rule was re-enforced, and for a source individual only 1
mapping was retained. As shown in Table 3, these settings were still sufficient
to achieve high performance: the algorithm achieved the highest F 1 measure on
4 test cases out of 7 and the highest average F 1 measure. These results verify
our original assumptions that (a) the fitness function based on the pseudo-F-
measure can be used as an estimation of the actual accuracy of a decision rule
and (b) the genetic algorithm provides a suitable search strategy for obtaining
a decision rule for individual matching.
high performance (F 1 above 0.9), and increasing the population size N led to
improvement in performance as well as more robust results (lower σF 1). In fact,
for the Music contributors test case, the results produced using Ff∼it and the ideal
case F 1 were almost equivalent. For the Research papers dataset (Table 5), we
trained the algorithm on several samples taken from the DOI dataset and then
applied the resulting decision rules to the complete test case (10000 individuals in
the DOI dataset). This was done to emulate use cases involving large-scale repos-
itories, in which running many iterations of the genetic algorithm over complete
datasets is not feasible. From Table 5 we can see that starting from 100 sam-
ple individuals the algorithm achieved stable performance, which is consistent
9
http://dblp.l3s.de/
10
http://km.aifb.kit.edu/projects/btc-2010/
11
http://dx.doi.org/
12
Experiments were performed on a Linux desktop with two Intel Core 2 Duo proces-
sors and 3GB of RAM.
132 A. Nikolov, M. d’Aquin, and E. Motta
Table 5. Results obtained for the Research papers dataset (for all sample sizes, pop-
ulation size N = 100 was used)
with the results achieved for the OAEI 2011 benchmark. Applying the resulting
decision rules to the complete dataset also produced results with precision and
recall values similar to the ones achieved on the partial sample. Finally, to test
the effect of the chosen fitness function on the performance, we compared the
pseudo-F-measure Ff∼it and neighbourhood growth FfNitG fitness functions. We
applied the algorithm to the Music contributors and Book authors datasets, as
well as to the NYT-Geonames and NYT-Freebase (people) test cases from the
OAEI 2011 benchmark (without applying post-processing). The results reported
in Table 6 show that both functions are able to achieve high accuracy with Ff∼it
providing more stable performance. This validates our initial choice of Ff∼it as
a suitable fitness criterion and reinforces our assumption that features of the
similarity distribution can indirectly serve to estimate the actual fitness.
References
1. de Carvalho, M.G., Laender, A.H.F., Goncalves, M.A., da Silva, A.S.: A genetic
programming approach to record deduplication. IEEE Transactions on Knowledge
and Data Engineering 99(PrePrints) (2010)
2. Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates.
In: ICDE 2005, pp. 865–876 (2005)
3. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A
survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)
4. Euzenat, J., et al.: Results of the ontology alignment evaluation initiative 2010. In:
Workshop on Ontology Matching (OM 2010), ISWC 2010 (2010)
5. Euzenat, J., et al.: Results of the ontology alignment evaluation initiative 2011.
In: Workshop on Ontology Matching (OM 2011), ISWC 2011 (2011)
6. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of American Sta-
tistical Association 64(328), 1183–1210 (1969)
7. Hu, W., Chen, J., Qu, Y.: A self-training approach for resolving object coreference
on the semantic web. In: WWW 2011, pp. 87–96 (2011)
8. Isele, R., Bizer, C.: Learning linkage rules using genetic programming. In: Workshop
on Ontology Matching (OM 2011), ISWC 2011, Bonn, Germany (2011)
9. Li, J., Tang, J., Li, Y., Luo, Q.: RiMOM: A dynamic multistrategy ontology align-
ment framework. IEEE Transactions on Knowledge and Data Engineering 21(8),
1218–1232 (2009)
10. Ngonga Ngomo, A.C., Lehmann, J., Auer, S., Höffner, K.: RAVEN - active learning
of link specifications. In: Workshop on Ontology Matching (OM 2011), ISWC 2011
(2011)
11. Nikolov, A., Uren, V., Motta, E., de Roeck, A.: Integration of Semantically An-
notated Data by the KnoFuss Architecture. In: Gangemi, A., Euzenat, J. (eds.)
EKAW 2008. LNCS (LNAI), vol. 5268, pp. 265–274. Springer, Heidelberg (2008)
12. Noessner, J., Niepert, M., Meilicke, C., Stuckenschmidt, H.: Leveraging Termino-
logical Structure for Object Reconciliation. In: Aroyo, L., Antoniou, G., Hyvönen,
E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010.
LNCS, vol. 6089, pp. 334–348. Springer, Heidelberg (2010)
13. Stoermer, H., Rassadko, N., Vaidya, N.: Feature-Based Entity Matching: The
FBEM Model, Implementation, Evaluation. In: Pernici, B. (ed.) CAiSE 2010.
LNCS, vol. 6051, pp. 180–193. Springer, Heidelberg (2010)
14. Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Discovering and Maintaining Links
on the Web of Data. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L.,
Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823,
pp. 650–665. Springer, Heidelberg (2009)
15. Zardetto, D., Scannapietro, M., Catarci, T.: Effective automated object matching.
In: ICDE 2010, pp. 757–768 (2010)
Graph Kernels for RDF Data
1 Introduction
The RDF-formated data on the World Wide Web leads to new types of distributed infor-
mation systems and poses new challenges and opportunities to data mining research. As
an official standard of the World Wide Web Consortium (W3C), the Resource Descrip-
tion Framework (RDF) establishes a universal graph-based data model not intuitively
suited for standard machine learning (ML) algorithms. Recent efforts of research, in-
dustry and public institutions in the context of Linked Open Data (LOD) initiatives
have led to considerable amounts of RDF data sets being made available and linked to
each other on the web [1]. Besides that, there is progress in research on extracting RDF
from text documents [2]. Consequently, the question of systematically exploiting the
knowledge therein by data mining approaches becomes highly relevant.
This paper focuses on making instances represented by means of RDF graph struc-
tures available as input to existing ML algorithms, which can solve data mining tasks
relevant to RDF, such as class-membership prediction, property value prediction, link
prediction or clustering. As an example, consider the mining of the social network
emerging from the linked user profiles available in FOAF, a popular RDF-based vocab-
ulary [3]. Relevant tasks in this setting include the identification of similar users (e.g.
for identity resolution) or the prediction of user interests (e.g. for recommendations).
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 134–148, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Graph Kernels for RDF Data 135
Existing approaches to mining the Semantic Web (SW) have either focused on one
specific semantic data representation [4, 5] based on RDF or on one of the specific
tasks mentioned above, i.e. the data representation and the learning algorithm have been
devised specifically for the problem at hand [6, 7, 8, 9, 10].
In contrast, kernels [11] promise a highly flexible approach by providing a powerful
framework for decoupling the data representation from the learning task: Specific kernel
functions can be deployed depending on the format of the input data and combined in
a “plug-and-play”-style with readily available kernel machines for all standard learning
tasks. In this context, the challenge of learning from RDF data can be reformulated
as designing adequate kernel functions for this representation. In fact, various kernel
functions for general graph structures have been proposed over the last years. However,
the properties of RDF graphs are different from general graphs: e.g. graphs representing
chemical compounds usually have few node labels which occur frequently in the graph
and nodes in these graphs have a low degree. In contrast, RDF node labels are used as
identifiers occurring only once per graph and nodes may have a high degree.
In this paper, we bridge the gap between the too general and too specialized ap-
proaches to mining RDF data. We first review existing approaches for mining from
semantically annotated data on the one hand and from general graphs on the other. We
discuss the problems of these approaches with respect to their flexibility (e.g. appli-
cability in various tasks, combination with different learning algorithms, different data
representations etc.) and their suitability as data representations for RDF-type data.
Improving on that, we introduce two versatile families of graph kernels based on inter-
section graphs and intersection trees. We discuss why our approach can be more easily
applied by non-experts to solve a wider range of data mining tasks using a wider range
of ML techniques on a wider range of RDF data sets compared to the more specialized
approaches found in the related work. To the best of our knowledge, these are the first
kernels which can interface between any RDF graph and any kernel machine, while
exploiting the specifics of RDF. By applying standard Support Vector Machines (SVM)
to RDF and RDF(S) data sets we show how our kernels can be used for solving two
different learning problems on RDF graphs: property value prediction and link predic-
tion. Despite its broader applicability we achieve comparable performance to methods
which are either more specialized on a specific data format and data mining task or too
general to exploit the specifics of RDF graphs efficiently.
In Section 2, we introduce foundations on kernel methods and discuss related work.
In Section 3 we describe a new family of kernel functions and their applicability to
learning from RDF data. In Section 4, we report experiments which demonstrate their
flexibility and performance. We conclude with a discussion and outlook in Section 5.
In this section, we briefly introduce the required formalisms for kernel methods and
present related work with respect to mining Semantic Web (SW) data. After introduc-
ing kernel methods and the learning problems arising from SW data, we discuss the
applicability of general graph kernels to semantic data representations, before detailing
on the existing methods for specific learning problems in the context of the SW.
136 U. Lösch, S. Bloehdorn, and A. Rettinger
interest. In the spirit of the Convolution Kernel by Haussler [13], which represents a
generic way of defining kernels, kernel functions for general graphs have been devised
by counting common subgraphs of two graphs. While the subgraph isomorphism prob-
lem, i.e. the problem of identifying whether a given input graph contains another input
graph as its subgraph, is known to be NP-complete, the search for common subgraphs
with specific properties can often be performed more efficiently. Thus, kernel functions
have been defined for various substructures such as walks [14], cycles [15] and trees
[16]. Any of these kernels is applicable to RDF data as long as it is defined on directed,
labeled graphs. While these kernel functions are applicable in a wide range of applica-
tions, they do not exploit the specific properties of RDF data. RDF graphs are directed,
labeled graphs in which nodes may be identified by their label, their URI.
Definition 2 (RDF Graph). An RDF graph is defined as a set of triples of the form
G = {(s, p, o)} = (V, E), where the subject s ∈ V is an entity, the predicate p denotes
a property, and the object o ∈ V is either another entity or, in case of a relation whose
values are data-typed, a literal. The vertices v ∈ V of G are defined by all elements that
occur either as subject or object of a triple. Each edge e ∈ E in the graph is defined by
a triple (s, p, o): the edge that connects s to o and has label p.1
Generally, we will look at RDF entities as the instances for learning. For example, two
sets of entities, identified by their Uniform Resource Identifiers (URIs) could be positive
and negative classes in a classification scenario. The argument entities’ neighborhood in
the overall RDF graph forms the basis for their kernel-induced feature representations.
Essentially, all proposed kernel functions are thus based on a neighborhood graph which
is obtained by a breadth-first search up to depth k starting from the entity of interest.
We have defined two versions of the neighborhood graph: the intersection graph (see
Section 3.1) and the intersection tree (see Section 3.2).
We define RDF kernels in a similar manner to other graph kernels by adopting the
idea of counting subgraphs with a specific structure in the input graphs. The essential
difference is that, as RDF builds on unique node labels, each RDF subgraph can occur at
most once in the input graph. This is not the case in general graphs, where it is common
that several nodes carry the same label – thus yielding potentially several equivalent
subgraphs. Therefore, when calculating the kernel function between two RDF graphs,
it is not necessary to identify the interesting structures and their frequencies in the two
graphs separately. Instead, it is sufficient to analyze a single structure which contains
the features of interest common in both input graphs.2 For each of the definitions of
the neighborhood graphs sketched above, we have defined a way of representing their
common structures, which are used as basis for the two families of kernel functions we
define: In Section 3.1 we will present kernel functions which are based on intersection
graphs (obtained from two instance graphs), in Section 3.2 kernel functions based on
intersection trees (on the basis of instance trees) will be presented.
The intersection graph of two graphs is a graph containing all the elements the two
graphs have in common.
Note that if the intersection graph contains a given subgraph, this subgraph is also a
subgraph of each of the two input graphs. Inversely, if a subgraph is contained in both
1
Note that this defines a multigraph which allows for multiple edges between the same two
nodes, as long as the type of the relation is different.
2
Gärtner et al. [14] have proposed kernel functions which are based on counting common struc-
tures in the direct product graph. In the case of graphs with unique node labels, like RDF, this
is equivalent to what we call the Intersection Graph which we define in Section 3.1.
Graph Kernels for RDF Data 139
instance graphs, it is part of the intersection graph. Thus, calculating a kernel function
based on a set of subgraphs can be reduced to constructing the intersection graph in the
first step and then counting the substructures of interest therein. We have defined kernels
based on implicit feature sets based on walks, paths, and (connected) subgraphs:
Walks and Paths. Connected elements within the intersection graphs are likely to
yield more interesting results than a set of arbitrary relations taken from the intersec-
tion graph. We have therefore defined additional kernels whose features are restricted
to subsets of all edge-induced subgraphs. We have focused on walks and paths as inter-
esting subsets, as they represent property chains in RDF.
Definition 5 (Walk, Path). A walk in a graph G = (V, E) is defined as a sequence
of vertices and edges v1 , e1 , v2 , e2 , . . . , vn , en , vn+1 with ei = (vi , pi , vi+1 ) ∈ E. The
length of a walk denotes the number of edges it contains.
A path is a walk which does not contain any cycles i.e. a walk for which the additional
condition vi = vj ∀i = j holds. We denote the set of walks of length l in a graph G by
walksl (G), the paths up to length l by pathsl (G).
Definition 6 (Walk Kernel, Path Kernel). The Walk Kernel for maximum path length l
and discount factor λ > 0 is defined by κl,λ (G1 , G2 ) = li=1 λi |{w|w ∈ walksi (G1 ∩
l
G2 )}|. Analog the Path Kernel κl,λ (G1 , G2 ) = i=1 λi |{p|p ∈ pathsi (G1 ∩ G2 )}|.
The corresponding feature space consists of one feature per walk (resp. path). In the
definition, the parameter λ > 0 serves as a discount factor and allows to weight longer
walks (paths) different from shorter ones. If λ > 1 then longer walks (paths) receive
more weight, in case of λ < 0 shorter ones contribute more weight.
As paths and walks are edge-induced substructures of a graph, the validity of the
proposed kernel functions can be shown using the same argument as for edge-induced
subgraphs. The kernel function can be calculated iteratively by constructing walks of
length i as extension of walks of length i − 1. In each iteration an edge is appended at
the end of the walks found in the previous iteration. For counting paths, the condition
that vi ∈
/ {v1 , . . . , vi−1 } has to be added when constructing the paths of length i.
A different approach for calculating these kernel functions is based on the powers of
the intersection graph’s adjacency matrix. The adjacency matrix M is a representation
of a graph in the form of a matrix with one row and one column per node in the graph
and entries xij = 1 if the graph contains an edge from node i to node j, 0 otherwise.
140 U. Lösch, S. Bloehdorn, and A. Rettinger
person100 person100
topic110
‘Machine Learning’ ‘Jane Doe’ topic110 ‘Jane Doe’ topic110
Fig. 1. (a) Example of an instance tree, (b) example of a complete subtree and (c) example of a
partial subtree of the instance tree
Each entry xij of the k th power of M can be interpreted as the number of walks of
length k existing from node i to node j. Therefore, the number of walks up to length k
k n n
in the graph can be obtained as i=1 j=1 l=1 (M i )jl . By setting the elements xjj
of M i to 0, this formula can also be used for the path kernel.3
The kernel functions presented in the previous section are based on the calculation of
the intersection graph. The use of the intersection graph may become problematic as
its calculation is potentially expensive: the whole instance graph for each entity has to
be extracted and the two graphs have to be intersected explicitely. However, the size
of the instance graph grows exponentially with the number of hops crawled from the
entity to build up the instance graph. Merging the two steps of extracting the instance
graphs and intersecting them is not directly feasible: Consider an entity e which can be
reached within k hops from both entities e1 and e2 , but through different paths. In this
case, e would be part of the intersection graph, but it would not be reachable from either
e1 or e2 in this graph. In the following, we present a different way of extracting com-
mon neighborhoods of two entities, which enables a direct construction of the common
properties, without building the instance graphs. This alternative method is based on the
use of instance trees instead of instance graphs. Instance trees are obtained based on the
graph expansion with respect to an entity of interest e (as for example defined in [17]).
The graph expansion can grow infinitely if the graph contains cycles. To avoid this
problem and to limit the size of the obtained trees the graph expansion is bound by
depth d. While in the original RDF graph, node labels were used as identifiers, i.e. each
node label occurred exactly once, this is not true in the expanded graph. If there is more
than one path from entity e to another entity e , then e will occur more than once, and
thus the label of e is not unique anymore. An intersection of the instance trees leads
to an intersection tree in the same spirit as the intersection graph. We introduce two
changes to the instance trees as obtained by direct expansion from the data graph:
The intersection tree can be extracted directly from the data graph without constructing
the instance trees explicitely. Therefore, the intersection tree can be directly obtained
using Algorithm 1. The algorithm directly extracts the intersection tree itd (e1 , e2 ) from
the data graph. Starting from the two entities e1 and e2 the intersection tree is built
using breadth-first search. Two cases have to be distinguished. In cases where one of
the entities e1 or e2 is found a new node is added to the tree with a dummy label. For
nodes with this label, the common relations of e1 and e2 are added as children. The
second case are all nodes which do not correspond to one of the entities: for them, a
new node with the node’s URI respective label is added to the tree, the children of these
latter nodes are all relations of this node in the data graph.
As in the case of the intersection graphs, our proposed kernel functions are based
on counting elements in the intersection trees. The features of interest are restricted to
features which contain the root of the tree. This is because features which are not con-
nected to the root element may be present in each instance tree, but may by construction
not be part of the intersection tree.
Full Subtrees. The first kernel function we propose based on the intersection tree is
the full subtree kernel, which counts the number of full subtrees of the intersection tree
itd (e1 , e2 ), i.e. of the intersection tree of depth d for the two entities e1 and e2 . A full
subtree of a tree t rooted at a node v is the tree with root v and all descendants of v in t
(see Figure 1 for an example).
Definition 9 (Full Subtree Kernel). The Full Subtree kernel is defined as the number
of full subtrees in the intersection tree. Subtrees of different height are weighted dif-
using a discount factor λ: κs t(e1 , e2 ) = st(root(itd (e1 , e2 ))) where st(v) =
ferently
1 + λ c∈children(v) st(c).
The corresponding feature mapping consists of one feature per subtree. Counting the
number of full subtrees in the kernel is equivalent to counting the walks starting at the
root of the intersection tree. The full subtree kernel is valid due to this equivalence.
142 U. Lösch, S. Bloehdorn, and A. Rettinger
Partial Subtrees. Given a tree T = (V, E), its partial subtrees are defined by subsets
V ⊂ V and E ⊂ E such that T = (V , E ) is a tree. We propose to define a kernel
function which counts the number of partial subtrees in the intersection tree itd (e1 , e2 )
who are rooted at the root of itd (e1 , e2 ).
Definition 10 (Partial Subtree Kernel). The Partial Subtree Kernel is defined as the
number of partial trees that the intersection tree contains. A discount factor λ gives
more or less weight to trees with greater depth: κpt (e1 , e2 ) = t(root(itd (e1 , e2 )))
where t is defined as: t(v) = c∈children(v) (λt(c) + 1). The function t(v) returns the
number of partial subtrees with root v that the tree rooted at v contains weighted by
depth with a discount factor λ.
The corresponding feature space consists of one feature per partial tree up to depth d
with root ei replaced by a dummy node in the data graph. The value of each feature is
the number of times a partial tree occurs.
4 Evaluation
To show the flexibility of the proposed kernels, we have conducted evaluations on
real world data sets in two learning tasks: a property value prediction task and a link
Graph Kernels for RDF Data 143
prediction task. The kernel functions are implemented in Java using J ENA4 for pro-
cessing RDF. For the Property Value Prediction, we used SVMlight [18] with the JNI
Kernel Extension.5 The Link Prediction problem we used the ν-SVM implementation
of LIBSVM [19].
applied these graph kernels to the instance graphs of depth 2 which were extracted from
the RDF data graph. We chose the parameters of the compared approaches according to
the best setting reported in the literature: The maximum depth of trees in the Weisfeiler-
Lehman kernel was set to 2, the discount factor for longer walks in the Gärtner kernel
was set to 0.5. On the SWRC dataset, the best configuration obtained by Bloehdorn
and Sure [6] in the original paper was used for comparison. This kernel configuration,
denoted by sim-ctpp-pc combines the common class similarity kernel described in their
paper with object property kernels for the workedOnBy, worksAtProject and publica-
tion. Additionally, also on the SWRC dataset, we compared the performance of our
kernels on the whole data set (including the schema) to a setting where all relations
which are part of the schema were removed (lowest part in Table 1).
Discussion. Our results show that with specific parameters our kernels reach compa-
rable error and higher F1-measure than the kernels proposed by Bloehdorn and Sure.
Considering that our approach is generic and can be used off the shelf in many sce-
narios, while their kernel function was designed manually for the specific application
scenario, this is a positive result. Our kernel functions also perform well with respect to
other graph kernels: The Weisfeiler-Lehman kernel is not able to separate the training
data in the SWRC dataset as it can match only very small structures. While the Gaert-
ner kernel achieves results which are comparable to our results, its calculation is more
expensive - due to the cubic time complexity of the matrix inversion. Our results show
that the walk kernel and the path kernel perform better in terms of classification errors
when the maximum size of the substructures are increased. As for the partial subtree
kernel it turned out that parameter tuning is difficult and that the results strongly de-
pend on the choice of the discount factor. Last but not least, a surprising result is that
the kernels which are based on the intersection graph perform better on the reduced
data set which does not contain the schema information. Our explanation for this is that
Graph Kernels for RDF Data 145
the intersection graph contains part of the schema and thus produces a similar overlap
for many of the instances. Further results which we do not report here show that with
increasing maximum size k of the considered substructures precision is increased and
recall is decreased. This effect may be due to the effect that bigger structures describe
instances in a more precise way, thus refining the description of elements belonging to
the class but losing generality of the class description on the other hand.
This is a valid kernel function for α, β ≥ 0 as the space of valid kernel function is
closed under sum and multiplication with a positive scalar [11]. Any kernel functions
for RDF entities may be used as subject and object kernel (κs and κo ).
Datasets. We evaluated the link prediction approach on two datasets. The first evalua-
tion setting uses the SWRC dataset as for property value description. In this scenario the
goal is to predict the affiliation relation, i.e. to decide for a tuple of person and
research group whether the given person works in the given research group. The sec-
ond dataset is a reduced version of the LiveJournal dataset used for age prediction. The
reduced dataset contains descriptions of 638 people and their social network. Overall,
there are 8069 instances of the foaf:knows relation that is learned in this setting.
Evaluation Method. RDF is based on the open-world assumption, which means that
if a fact is not stated, it may not be assumed to not hold. Rather, the truth value of
this relation is unknown. This is problematic as no negative training and test data is
available. As to the training phase, some initial experiments using one-class SVM [21]
did not yield promising results. We therefore considered some negative training data
to be necessary for the evaluation. Therefore, the positive training instances were com-
plemented with an equal number of unknown instances. We consider this number of
negative instances a good trade-off between the number of assumptions made and the
gain in quality of the trained classifier. The obtained training set is a balanced dataset
which does not present any bias towards one of the classes.
Evaluation may not be based on positive or negative classifications due to the absence
of negative data. Thus, a ranking-based approach to the evaluation was chosen: positive
instances should be ranked higher in the result set than negative ones. For evaluation, a
modified version of 5-fold cross validation was used: the positive instances were split
in five folds, each fold was complemented with an equal number of unknown instances
and the unused unknown instances were used as additional test data. Each model was
trained on four folds and evaluated on the fifth fold and the additional test data. NDCG
[22] and bpref [23] were used as evaluation measures.
146 U. Lösch, S. Bloehdorn, and A. Rettinger
Compared Approaches. We have compared our approach to the SUNS approach pre-
sented by Huang et al. [8] and to the link kernel with the Weisfeiler-Lehman kernel [16]
as entity kernel. This approach is based on a singular value decomposition of the rela-
tion matrix and generates predictions based on the completed matrix. We experimented
with different settings for the parameter of SUNS, where the range of the parameter
was taken from Huang et al. [8]. However, our evaluation procedure is substantially
different to theirs. We report the best results which we obtained. Results are reported in
Table 2. The settings of the Weisfeiler-Lehman kernel are the same as in the property
value prediction task. For all SVM-based evaluations presented here, the parameter ν
was set to 0.5.
Discussion. Our results show that our Link Prediction approach can outperform the
SUNS approach in terms of NDCG and bpref. Partly, this may be due to the chosen
evaluation procedure: the SUNS approach can only deal with those relation instances
which were present in the training phase: as only 80% of the available data were used
for training, some instances of the domain or range may not occur in the training data
and thus no prediction is obtained for those. To obtain a complete ranking, a minimum
value was assigned to these instances6 Additionally, a pruning step is part of the prepro-
cessing steps of SUNS, which removes elements with few connections. In comparison
with the Weisfeiler-Lehman kernel our kernel functions achieve comparable results on
the SWRC dataset. As to the relation of α and β in the link kernel, best results are
obtained for α β = 2 in the SWRC dataset and for β = 1 in the LiveJournal dataset.
α
We suppose that the higher importance of the subject in the SWRC dataset is due to the
smaller number of objects in the range of the relation. In contrast, in the LiveJournal
dataset there is an equal number of elements in the domain and the range of the rela-
tion. Another finding from additional experiments which we do not report here is that
the quality of the model increases with growing ν. This means that a complexer model
achieves better generalisation resuts on both datasets.
6
This explains the very low bpref achieved by the SUNS approach: all positive instances for
which no prediction could be obtained are ranked lower than all negative instances.
Graph Kernels for RDF Data 147
5 Conclusion
With the advent of the Resource Description Framework (RDF) and its broad uptake,
e.g. within the Linked Open Data (LOD) initiative, the problem of mining from seman-
tic data has become an important issue. In this paper, we have introduced a principle
approach for exploiting RDF graph structures within established machine learning al-
gorithms by designing suitable kernel functions which exploit the specific properties of
RDF data while remaining general enough to be applicable to a wide range of learn-
ing algorithms and tasks. We have introduced two versatile families of kernel functions
for RDF entities based on intersection graphs and intersection trees and have shown
that they have an intuitive, powerful interpretation while remaining computationally
efficient. In an empirical evaluation, we demonstrated the flexibility of this approach
and show that kernel functions within this family can compete with hand-crafted kernel
functions and computationally more demanding approaches.
In the future, we plan to extend this framework to other substructures that can be
shown to be present in intersection graphs resp. intersection trees and to apply it to new
tasks and other Linked Open Data (LOD) data sets.
Acknowledgments. The work was funded by the EU project XLike under FP7.
References
[1] Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. Int. Journal on Semantic
Web and Information Systems 5(3), 1–22 (2009)
[2] Augenstein, I., Padó, S., Rudolph, S.: Lodifier: Generating Linked Data from Unstructured
Text. In: Simperl, E., et al. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 210–224. Springer,
Heidelberg (2012)
[3] Brickley, D., Miller, L.: FOAF vocabulary specification. Technical report, FOAF project
(2007), http://xmlns.com/foaf/spec/20070524.html (Published online on
May 24, 2007)
[4] Fanizzi, N., d’Amato, C.: A Declarative Kernel for ALC Concept Descriptions. In: Espos-
ito, F., Raś, Z.W., Malerba, D., Semeraro, G. (eds.) ISMIS 2006. LNCS (LNAI), vol. 4203,
pp. 322–331. Springer, Heidelberg (2006)
[5] Fanizzi, N., d’Amato, C., Esposito, F.: Statistical Learning for Inductive Query Answer-
ing on OWL Ontologies. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D.,
Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 195–212. Springer,
Heidelberg (2008)
[6] Bloehdorn, S., Sure, Y.: Kernel Methods for Mining Instance Data in Ontologies. In: Aberer,
K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P.,
Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and
ISWC 2007. LNCS, vol. 4825, pp. 58–71. Springer, Heidelberg (2007)
[7] Rettinger, A., Nickles, M., Tresp, V.: Statistical Relational Learning with Formal Ontolo-
gies. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD
2009. LNCS, vol. 5782, pp. 286–301. Springer, Heidelberg (2009)
[8] Huang, Y., Tresp, V., Bundschus, M., Rettinger, A., Kriegel, H.-P.: Multivariate Prediction
for Learning on the Semantic Web. In: Frasconi, P., Lisi, F.A. (eds.) ILP 2010. LNCS,
vol. 6489, pp. 92–104. Springer, Heidelberg (2011)
148 U. Lösch, S. Bloehdorn, and A. Rettinger
[9] Bicer, V., Tran, T., Gossen, A.: Relational Kernel Machines for Learning from Graph-
Structured RDF Data. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis,
D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I. LNCS, vol. 6643, pp. 47–62.
Springer, Heidelberg (2011)
[10] Thor, A., Anderson, P., Raschid, L., Navlakha, S., Saha, B., Khuller, S., Zhang, X.-N.: Link
Prediction for Annotation Graphs Using Graph Summarization. In: Aroyo, L., Welty, C.,
Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011,
Part I. LNCS, vol. 7031, pp. 714–729. Springer, Heidelberg (2011)
[11] Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge Univer-
sity Press (2004)
[12] Getoor, L., Friedman, N., Koller, D., Pferrer, A., Taskar, B.: Probabilistic relational models.
In: Introduction to Statistical Relational Learning. MIT Press (2007)
[13] Haussler, D.: Convolution kernels on discrete structures. Technical Report UCS-CRL-99-
10, University of California at Santa Cruz (1999)
[14] Gärtner, T., Flach, P., Wrobel, S.: On graph kernels: Hardness results and efficient al-
ternatives. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI),
vol. 2777, pp. 129–143. Springer, Heidelberg (2003)
[15] Horváth, T., Gärtner, T., Wrobel, S.: Cyclic pattern kernels for predictive graph mining.
In: Proc. of the 10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining
(KDD 2004), pp. 158–167. ACM Press, New York (2004)
[16] Shervashidze, N., Borgwardt, K.: Fast subtree kernels on graphs. In: Advances in Neural
Information Processing Systems, vol. 22 (2009)
[17] Güting, R.H.: Datenstrukturen und Algorithmen. B.G. Teubner, Stuttgart (1992)
[18] Joachims, T.: Making large-scale SVM learning practical. In: Advances in Kernel Methods
- Support Vector Learning (1999)
[19] Chang, C.C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM Transac-
tions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), Software available at
http://www.csie.ntu.edu.tw/˜cjlin/libsvm
[20] Sure, Y., Bloehdorn, S., Haase, P., Hartmann, J., Oberle, D.: The SWRC Ontology – Se-
mantic Web for Research Communities. In: Bento, C., Cardoso, A., Dias, G. (eds.) EPIA
2005. LNCS (LNAI), vol. 3808, pp. 218–231. Springer, Heidelberg (2005)
[21] Schölkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A., Williamson, R.C.: Estimating the
support of a high-dimensional distribution. Technical report (1999)
[22] Järvelin, K., Kekäläinen, J.: IR evaluation methods for retrieving highly relevant documents.
In: Proc. of the 23rd Annual International ACM SIGIR Conference on Research and Devel-
opment in Information Retrieval, pp. 41–48. ACM (2000)
[23] Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proc. of
the 27th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR 2004), pp. 25–32. ACM (2004)
EAGLE: Efficient Active Learning of Link
Specifications Using Genetic Programming
Abstract. With the growth of the Linked Data Web, time-efficient ap-
proaches for computing links between data sources have become indis-
pensable. Most Link Discovery frameworks implement approaches that
require two main computational steps. First, a link specification has to be
explicated by the user. Then, this specification must be executed. While
several approaches for the time-efficient execution of link specifications
have been developed over the last few years, the discovery of accurate
link specifications remains a tedious problem. In this paper, we present
EAGLE, an active learning approach based on genetic programming.
EAGLE generates highly accurate link specifications while reducing the
annotation burden for the user. We evaluate EAGLE against batch learn-
ing on three different data sets and show that our algorithm can detect
specifications with an F-measure superior to 90% while requiring a small
number of questions.
1 Introduction
The growth of the Linked Data Web over the last years has led to a compendium
of currently more than 30 billion triples [3]. Yet, it still contains a relatively low
number of links between knowledge bases (less than 2% at the moment). Devis-
ing approaches that address this problem still remains a very demanding task.
This is mainly due to the difficulty behind Link Discovery being twofold: First,
the quadratic complexity of Link Discovery requires time-efficient approaches
that can efficiently compute links when given a specification of the conditions
under which a link is to be built [14,21] (i.e., when given a so-called link specifica-
tion). Such specifications can be of arbitrary complexity, ranging from a simple
comparison of labels (e.g., for finding links between countries) to the comparison
of a large set of features of different types (e.g., using population, elevation and
labels to link villages across the globe). In previous work, we have addressed this
task by developing the LIMES1 framework. LIMES provides time-efficient ap-
proaches for Link Discovery and has been shown to outperform other frameworks
significantly [20].
1
http://limes.sf.net
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 149–163, 2012.
c Springer-Verlag Berlin Heidelberg 2012
150 A.-C. Ngonga Ngomo and K. Lyko
The second difficulty behind Link Discovery lies in the detection of accu-
rate link specifications. Most state-of-the-art Link Discovery frameworks such
as LIMES and SILK [14] adopt a property-based computation of links between
entities. To ensure that links can be computed with a high accuracy, these frame-
works provide (a) a large number of similarity measures (i.e., Levenshtein, Jac-
card for strings) for comparing property values and (b) manifold means for com-
bining the results of these measures to an overall similarity value for a given
pair of entities. When faced with this overwhelming space of possible combina-
tions, users often adapt a time-demanding trial-and-error approach to detect an
accurate link specification for the task at hand. There is consequently a blatant
need for approaches that support the user in the endeavor of finding accurate
link specifications. From a user’s perspective, approaches for the semi-automatic
generation of link specification must support the user by
1. reducing the time frame needed to detect a link specification (time efficiency),
2. generating link specifications that generate a small number of false positives
and negatives (accuracy) and
3. providing the user with easily readable and modifiable specifications (read-
ability).
In this paper, we present the EAGLE algorithm, a supervised machine-learning
algorithm for the detection of link specifications that abides by the three criteria
presented above. One of the main drawbacks of machine-learning approaches
is that they usually require a large amount of training data to achieve a high
accuracy. Yet, the generation of training data can be a very tedious process. EA-
GLE surmounts this problem by implementing an active learning approach [27].
Active learning allows the interactive annotation of highly informative training
data. Therewith, active learning approaches can minimize the amount of training
data necessary to compute highly accurate link specifications.
Overall, the contributions of this paper are as follows:
– We present a novel active learning approach to learning link specifications
based on genetic programming.
– We evaluate our approach on three different data sets and show that we
reach F-measures of above 90% by asking between 10 and 20 questions even
on difficult data sets.
– We compare our approach with state-of-the-art approaches on the DBLP-
ACM dataset and show that we outperform them with respect to runtime
while reaching a comparable accuracy.
The advantages of our approach are manifold. In addition to its high accuracy,
it generates readable link specifications which can be altered by the user at
will. Furthermore, given the superior runtime of LIMES on string and numeric
properties, our approach fulfills the requirements for use in an interactive setting.
Finally, our approach only requires very small human effort to discover link
specifications of high accuracy as shown by our evaluation.
The rest of this paper is organized as follows: First, we give a brief overview of
the state of the art. Thereafter, we present the formal framework within which
EAGLE 151
EAGLE is defined. This framework is the basis for the subsequent specification of
our approach. We then evaluate our approach with several parameters on three
different data sets. We demonstrate the accuracy of our approach by computing
its F-measure. Moreover, we show that EAGLE is time-efficient by comparing
its runtime with that of other approaches on the ACM-DBLP dataset. We also
compare our approach with its non-active counterpart and study when the use
of active learning leads to better results.
2 Related Work
Over the last years, several approaches have been developed to address the time
complexity of link discovery. Some of these approaches focus on particular do-
mains of applications. For example, the approach implemented in RKB knowl-
edge base (RKB-CRS) [11] focuses on computing links between universities and
conferences while GNAT [25] discovers links between music data sets. Further
simple or domain-specific approaches can be found in [9,23,13,28,24]. In addition,
domain-independent approaches have been developed, that aim to facilitate link
discovery all across the Web. For example, RDF-AI [26] implements a five-step
approach that comprises the preprocessing, matching, fusion, interlinking and
post-processing of data sets. SILK [14] is a time-optimized tool for link discov-
ery. It implements a multi-dimensional blocking approach that is guaranteed to
be lossless thanks to the overlapping blocks it generates. Another lossless Link
Discovery framework is LIMES [20], which addresses the scalability problem by
implementing time-efficient similarity computation approaches for different data
types and combining those using set theory. Note that the task of discovering
links between knowledge bases is closely related with record linkage [30,10,5,17].
To the best of our knowledge, the problem of discovering accurate link speci-
fications has only been addressed in very recent literature by a small number
of approaches: The SILK framework [14] now implements a batch learning ap-
proach to discovery link specifications based on genetic programming which is
similar to the approach presented in [6]. The algorithm implemented by SILK
also treats link specifications as trees but relies on a large amount of annotated
data to discover high-accuracy link specifications. The RAVEN algorithm [22] on
the other hand is an active learning approach that treats the discovery of specifi-
cations as a classification problem. It discovers link specifications by first finding
class and property mappings between knowledge bases automatically. RAVEN
then uses these mappings to compute linear and boolean classifiers that can be
used as link specifications. A related approach that aims to detect discriminative
properties for linking is that presented by [29]. In addition to these approaches,
several machine-learning approaches have been developed to learn classifiers for
record linkage. For example, machine-learning frameworks such as FEBRL [7]
and MARLIN [4] rely on models such as Support Vector Machines [16,8], de-
cision trees [31] and rule mining [1] to detect classifiers for record linkage. Our
approach, EAGLE, goes beyond previous work in three main ways. First, it is
an active learning approach. Thus, it does not require the large amount of train-
ing data required by batch learning approaches such as FEBRL, MARLIN and
152 A.-C. Ngonga Ngomo and K. Lyko
3 Preliminaries
In the following, we present the core of the formalization and notation necessary
to implement EAGLE. We first formalize the Link Discovery problem. Then, we
give an overview of the grammar that underlies links specifications in LIMES
and show how the resulting specifications can be represented as trees. We show
how the discovery of link specifications can consequently be modeled as a genetic
programming problem. Subsequently, we give some insight in active learning and
then present the active learning model that underlies our work.
4 Approach
As we have formalized link specifications as trees, we can use Genetic Program-
ming (GP) to solve the problem of finding the most appropriate complex link
specification for a given pair of knowledge bases. Given a problem, the basic idea
behind genetic programming [18] is to generate increasingly better solutions of
the given problem by applying a number of genetic operators to the current pop-
ulation. In the following, we will denote the population at time t by g t . Genetic
operators simulate natural selection mechanisms such as mutation and repro-
duction to enable the creation of individuals that best abide by a given fitness
function. One of the key problems of genetic programming is that it is a non-
deterministic procedure. In addition, it usually requires a large training data set
to detect accurate solutions. In this paper, we propose the combination of GP
and active learning [27]. Our intuition is that by merging these approaches, we
can infuse some determinism in the GP procedure by allowing it to select the
most informative data for the population. Thus, we can improve the convergence
of GP approaches while reducing the labeling effort necessary to use them. In
the following, we present our implementation of the different GP operators on
link specifications and how we combine GP and active learning.
154 A.-C. Ngonga Ngomo and K. Lyko
m
m
metricOp
σ m1 m2
specOp
spec
m θ spec1 spec2 θ
4.1 Overview
Algorithm 1. EAGLE
Require: Specification of the two knowledge bases KS and KT
Get set S and set T of instances as specified in KS respectively KT .
Get property mapping (KS, KT )
Get reference mapping by asking user to label n random pairs (s, t) ∈ S × T
repeat
Evolve population(population,size) generations times.
Compute n most informative link candidates and ask user to label them.
until stop condition reached
EAGLE 155
In the following, we will explicate each of the steps of our algorithm in more
detail. Each of these steps will be exemplified by using the link specification
shown in Figure 5.
spec
AND
m 0.8 m 0.5
trigrams Jaccard
Evolution is the primary process which enables GP to solve problems and drives
the development of efficient solutions for a given problem. At the beginning
of our computation the population is empty and must be built by individuals
generated randomly. This is carried out by generating random trees whose nodes
are filled with functions or terminals as required. For this paper, we defined the
operators (functions and terminals) in the genotype for the problem to generate
link specifications as follows: all metricOp and specOp were set to be functions.
Terminal symbols were thresholds and measures. Note that these operators can
be extended at will. In addition, all operators were mapped to certain constraints
so as to ensure that EAGLE only generates valid program trees. For example, the
operator that compares numeric properties only accepts terminals representing
numeric properties from the knowledge bases.
Let g t be the population at the iteration t. To evolve a population to the
generation g t+1 we first determine the fitness of all individuals of generation g t
(see Section 4.3). These fitness values build the basis for selecting individuals
for the genetic operator reproduction. We use a tournament setting between two
selected individuals to decide which one is copied to the next generation g t+1 .
156 A.-C. Ngonga Ngomo and K. Lyko
AND OR
The individuals selected to build the population of g t+1 are the n fittest from
the union of the set of newly created individuals and g t . Note that we iteratively
generate new populations of potential fitter individuals.
4.3 Fitness
The aim of the fitness function is to approximate how well a solution (i.e., a link
specification) solves the problem at hand. In the supervised machine learning
setting, this is equivalent to computing how well a link specification maps the
training data at hand. To determine the fitness of an individual we first build the
link specification that is described by the tree at hand. Given the set of available
training data O = {(xi , yi ) ∈ S × T }, we then run the specification by using the
sets S(O) = {s ∈ S : ∃t ∈ T : (s, t) ∈ O} and T (O) = {t ∈ T : ∃s ∈ S : (s, t) ∈
O}. The result of this process is a mapping M that is then evaluated against O
by the means of the standard F-measure defined as
2P R |M ∩ O| |M ∩ O|
where P = and R = . (1)
P +R |M| |O|
Note that by running the linking on S(O) and T (O), we can significantly reduce
EAGLE’s runtime.
EAGLE 157
AND AND
Fig. 7. Crossover example. Consider we have two individuals with a program tree like
in (a). A crossover operation can replace subtrees to produce an offspring like (b).
The main idea behind the reduced of the amount of labeling effort required by
active learning approaches is that they only required highly informative training
data from the user. Finding these most informative pieces of information is
usually carried out by measuring the amount of information that the labeling of
a training data item would bear. Given the setting of EAGLE in which several
possible solutions co-exist, we opted for applying the idea of active learning by
committees as explicated in [19]. The idea here is to consistently entertain a finite
and incomplete set of solutions to the problem at hand. The most informative
link candidates are then considered to be the pairs (s, t) ∈ S × T upon which the
different solutions disagree the most. In our case, these are the link candidates
that maximize the disagreement function δ((s, t)):
where Mi are the mappings generated by the population g t . The pairs (s, t) that
lead to the highest disagreement score are presented to the user, who provides
the system with the correct labels. This training set is finally updated and used
to compute the next generations of solutions.
5 Evaluation
The goal of the first experiment, called Drugs, was to measure how well we can
detect a manually created LIMES specification. For this purpose, we generated
owl:sameAs link candidates between Drugs in DailyMed and Drugbank by using
their rdfs:label. The second experiment, Movies, was carried out by using the
results of a LATC2 link specification. Here, we fetched the links generated by a
link specification that linked movies in DBpedia to movies in LinkedMDB [12],
gathered the rdfs:label of the movies as well as the rdfs:label of their direc-
tors in the source and target knowledge bases and computed a specification that
aimed to reproduce the set of links at hand as exactly as possible. Note that this
specification is hard to reproduce as the experts who created this link specifica-
tion applied several transformations to the property values before carrying out
the similarity computation that led to the results at hand. Finally, in our third
experiment (Publications), we used the ACM-DBLP dataset described in [17].
Our aim here was to compare our approach with other approaches with respect
to both runtime and F-measure.
Table 1. Characteristics of the datasets used for the evaluation of EAGLE. S stands
for source, T for target.
All experiments were carried out on one kernel of an AMD Opteron Quad-
Core processor (2GHz) with the followings settings: the population size was
set to 20 or 100. The maximal number of generations was set to 50. In all
active learning experiments, we carried out 10 inquiries per iteration cycle. In
addition, we had the population evolve for 10 generations between all inquiries.
The mutation and crossover rates were set to 0.6. For the batch learners, we
set the number of generations to the size of the training data. Note that this
setup is of disadvantage for active learning as the batch learners then have more
data and more iterations on the data to learn the best possible specification.
We used this setting as complementary for the questions that can be asked by
the active learning approach. During our experiments, the Java Virtual Machine
was allocated 1GB RAM. All experiments were repeated 5 times.
5.2 Results
The results of our experiments are shown in the Figures below3 . In all figures,
Batch stands for the batch learners while AL stands for the active learners. The
2
http://lact-project.eu
3
Extensive results are available at the LIMES project website at
http://limes.sf.net
EAGLE 159
numbers in brackets are the sizes of the populations used. The results of the
Drugs experiments clearly show that our approach can easily detect simple link
specifications. In this experiment, 10 questions were sufficient for the batch and
active learning versions of EAGLE to generate link specifications with an F-
measure equivalent to the baseline of 99.9% F-measure. The standard deviation
lied around 0.1% for all experiments with both batch and active learner.
Fig. 8. Results of the Drugs experiment. Mean F-Measure of five runs of batch and
active learner, both using population sizes of 20 and 100 individuals. The baseline is
at 99.9% F-measure.
Fig. 9. Results of the Movies experiment. Mean F-measures of five runs of batch and
active learner, both using population sizes of 20 and 100 individuals. The baseline is
at 97.6% F-measure.
Fig. 10. Results of the Publications experiment. Mean F-measures of five runs of batch
and active learner, both using population sizes of 20 and 100 individuals. The baseline
is at 97.2% F-measure.
As stated above, we chose the ACM-DBLP data set because it has been used in
previous work to compare the accuracy and learning curve of different machine
learning approaches for deduplication. As our results show (see Table 2), we
reach an accuracy comparable to that of the other approaches. One of the main
advantages of our approach is that it is considerably more time-efficient that
all other approaches. Especially, while we are approximately 3 to 7 times faster
than MARLIN, we are more than 14 times faster than FeBRL on this data set.
So far, only a few other approaches have been developed for learning link
specifications from data. RAVEN [22] is an active learning approach that view
the learning of link specifications as a classification task. While it bears the ad-
vantage of being deterministic, it is limited to learning certain types of classifiers
(boolean or linear). Thus, it is only able to learn a subset of the specifications that
can be generated by EAGLE. Another genetic programming-based approach to
link discovery is implemented in the SILK framework [15]. This approach is yet
a batch learning approach and it consequently suffers of drawbacks of all batch
learning approaches as it requires a very large number of human annotations to
learn link specifications of a quality comparable to that of EAGLE.
References
1. Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of
items in large databases. SIGMOD Rec. 22, 207–216 (1993)
2. Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages.
In: SIGMOD Conference, pp. 783–794 (2010)
3. Auer, S., Lehmann, J., Ngonga Ngomo, A.-C.: Introduction to Linked Data and
Its Lifecycle on the Web. In: Polleres, A., d’Amato, C., Arenas, M., Handschuh, S.,
Kroner, P., Ossowski, S., Patel-Schneider, P. (eds.) Reasoning Web 2011. LNCS,
vol. 6848, pp. 1–75. Springer, Heidelberg (2011)
4. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string
similarity measures. In: KDD, pp. 39–48 (2003)
5. Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1–41 (2008)
6. Carvalho, M.G., Laender, A.H.F., Gonçalves, M.A., da Silva, A.S.: Replica identi-
fication using genetic programming. In: Proceedings of the 2008 ACM Symposium
on Applied Computing, SAC 2008, pp. 1801–1806. ACM, New York (2008)
7. Christen, P.: Febrl -: an open source data cleaning, deduplication and record linkage
system with a graphical user interface. In: KDD 2008, pp. 1065–1068 (2008)
8. Cristianini, N., Ricci, E.: Support vector machines. In: Kao, M.-Y. (ed.) Encyclo-
pedia of Algorithms. Springer (2008)
9. Cudré-Mauroux, P., Haghani, P., Jost, M., Aberer, K., de Meer, H.: idmesh: graph-
based disambiguation of linked data. In: WWW, pp. 591–600 (2009)
10. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A
survey. IEEE Transactions on Knowledge and Data Engineering 19, 1–16 (2007)
11. Glaser, H., Millard, I.C., Sung, W.-K., Lee, S., Kim, P., You, B.-J.: Research on
linked data and co-reference resolution. Technical report, University of Southamp-
ton (2009)
12. Hassanzadeh, O., Consens, M.: Linked movie data base. In: Bizer, C., Heath, T.,
Berners-Lee, T., Idehen, K. (eds.) Proceedings of the WWW 2009 Worshop on
Linked Data on the Web, LDOW 2009 (2009)
13. Hogan, A., Polleres, A., Umbrich, J., Zimmermann, A.: Some entities are more
equal than others: statistical methods to consolidate linked data. In: Workshop
on New Forms of Reasoning for the Semantic Web: Scalable & Dynamic (NeFoRS
2010) (2010)
14. Isele, R., Jentzsch, A., Bizer, C.: Efficient Multidimensional Blocking for Link
Discovery without losing Recall. In: WebDB (2011)
15. Isele, R., Bizer, C.: Learning Linkage Rules using Genetic Programming. In: Sixth
International Ontology Matching Workshop (2011)
16. Sathiya Keerthi, S., Lin, C.-J.: Asymptotic behaviors of support vector machines
with gaussian kernel. Neural Comput. 15, 1667–1689 (2003)
17. Köpcke, H., Thor, A., Rahm, E.: Comparative evaluation of entity resolution ap-
proaches with fever. Proc. VLDB Endow. 2(2), 1574–1577 (2009)
18. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means
of Natural Selection (Complex Adaptive Systems). The MIT Press (1992)
19. Liere, R., Tadepalli, P.: Active learning with committees for text categorization.
In: Proceedings of the Fourteenth National Conference on Artificial Intelligence,
pp. 591–596 (1997)
20. Ngonga Ngomo, A.-C.: A Time-Efficient Hybrid Approach to Link Discovery. In:
Sixth International Ontology Matching Workshop (2011)
EAGLE 163
21. Ngonga Ngomo, A.-C., Auer, S.: LIMES - A Time-Efficient Approach for Large-
Scale Link Discovery on the Web of Data. In: Proceedings of IJCAI (2011)
22. Ngonga Ngomo, A.-C., Lehmann, J., Auer, S., Höffner, K.: RAVEN – Active Learn-
ing of Link Specifications. In: Proceedings of OM@ISWC (2011)
23. Nikolov, A., Uren, V., Motta, E., de Roeck, A.: Overcoming Schema Heterogene-
ity between Linked Semantic Repositories to Improve Coreference Resolution. In:
Gómez-Pérez, A., Yu, Y., Ding, Y. (eds.) ASWC 2009. LNCS, vol. 5926, pp. 332–
346. Springer, Heidelberg (2009)
24. Papadakis, G., Ioannou, E., Niedere, C., Palpanasz, T., Nejdl, W.: Eliminating the
redundancy in blocking-based entity resolution methods. In: JCDL (2011)
25. Raimond, Y., Sutton, C., Sandler, M.: Automatic interlinking of music datasets on
the semantic web. In: Proceedings of the 1st Workshop about Linked Data on the
Web (2008)
26. Scharffe, F., Liu, Y., Zhou, C.: RDF-AI: an architecture for RDF datasets match-
ing, fusion and interlink. In: Proc. IJCAI 2009 Workshop on Identity, Reference,
and Knowledge Representation (IR-KR), Pasadena, CA, US (2009)
27. Settles, B.: Active learning literature survey. Technical Report 1648, University of
Wisconsin-Madison (2009)
28. Sleeman, J., Finin, T.: Computing foaf co-reference relations with rules and ma-
chine learning. In: Proceedings of the Third International Workshop on Social Data
on the Web (2010)
29. Song, D., Heflin, J.: Automatically Generating Data Linkages Using a Domain-
Independent Candidate Selection Approach. In: Aroyo, L., Welty, C., Alani, H.,
Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part
I. LNCS, vol. 7031, pp. 649–664. Springer, Heidelberg (2011)
30. Winkler, W.: Overview of record linkage and current research directions. Technical
report, Bureau of the Census - Research Report Series (2006)
31. Yuan, Y., Shaw, M.J.: Induction of fuzzy decision trees. Fuzzy Sets Syst. 69, 125–
139 (1995)
Combining Information Extraction,
Deductive Reasoning and Machine Learning
for Relation Prediction
1 Introduction
The prediction of the truth value of a (instantiated) relation or statement (i.e.,
a link in an RDF graph) is a common theme in such diverse areas as information
extraction (IE), deductive reasoning and machine learning. In the course of this
paper we consider statements in form of (s, p, o) RDF-triples where s and o are
entities and where p is a predicate. In IE, one expects that the relation of inter-
est can be derived from subsymbolic unstructured sensory data such as texts or
images and the goal is to derive a mapping from sensory input to statements.
In deductive reasoning, one typically has available a set of facts and axioms and
deductive reasoning is used to derive additional true statements. Relational ma-
chine learning also uses a set of true statements but estimates the truth values of
novel statements by exploiting regularities in the data. Powerful methods have
been developed for all three approaches and all have their respective strengths
and shortcomings. IE can only be employed if sensory information is available
that is relevant to a relation, deductive reasoning can only derive a small subset
of all statements that are true in a domain and relational machine learning is
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 164–178, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Combining IE, Deduction and Machine Learning for Relation Prediction 165
only applicable if the data contains relevant statistical structure. The goal of this
paper is to combine the strengths of all three approaches modularly, in the sense
that each step can be optimized independently. In a first step, we extract triples
using IE, where we assume that the extracted triples have associated certainty
values. In this paper we will only consider IE from textual data. Second, we
perform deductive reasoning to derive the set of provably true triples. Finally,
in the third step, we employ machine learning to exploit the dependencies be-
tween statements. The predicted triples are then typically ranked for decision
support. The complete system can be interpreted as a form of scalable hierar-
chical Bayesian modeling. We validate our model using data from the YAGO2
ontology, and from Linked Life Data and Bio2RDF, all of which are part of the
Linked Open Data (LOD) cloud.
The paper is organized as follows. The next section discusses related work.
Section 3 describes and combines IE and deductive reasoning. Section 4 describes
the relational learning approach. Section 5 presents various extensions and in
Section 6 we discuss scalability. Section 7 contains our experimental results and
Section 8 presents our conclusions.
2 Related Work
Multivariate prediction generalizes supervised learning to predict several variables
jointly, conditioned on some inputs. The improved predictive performance in mul-
tivariate prediction, if compared to simple supervised learning, has been attributed
to the sharing of statistical strength between the multiple tasks, i.e., data is used
more efficiently (see [32] and citations therein for a review). Due to the large degree
of sparsity of the relationship data in typical semantic graph domains, we expect
that multivariate prediction can aid the learning process in such domains.
Our approach is also related to conditional random fields [20]. The main dif-
ferences are the modularity of our approach and that our data does not exhibit
the linear structure in conditional random fields.
Recently, there has been quite some work on the relationship between kernels
and graphs [7] [33] [11]. Kernels for semi-supervised learning have, for example,
been derived from the spectrum of the Graph-Laplacian. Kernels for semantically
rich domains have been developed by [8]. In [36] [35] approaches for Gaussian
process based link prediction have been presented. Link prediction is covered
and surveyed in [27] [13]. Inclusion of ontological prior knowledge to relational
learning has been discussed in [28].
From early on there has been considerable work on supporting ontologies using
machine learning [24] [9] [21], while data mining perspectives for the Semantic
Web have been described by [1] [25]. A recent overview of the state of the art has
been presented in [29]. The transformation of text into the RDF structure of the
semantic web via IE is a highly active area of research [23] [30] [5] [6] [2] [4] [34] [3]
[26] [14]. [22] describes a perspective of ILP for the Semantic Web. We consider
machine learning approaches that have been applied to relation prediction in
the context with the Semantic Web. In [19] the authors describe SPARQL-ML,
a framework for adding data mining support to SPARQL. SPARQL-ML was
166 X. Jiang et al.
P (X = 1|S)
which is the probability that the statement represented by X is true given the
sensory information S. Otherwise no restrictions apply to the IE part in our
approach, e.g., it could be based on rules or on statistical classifiers. Note that
IE is limited to predict statements for which textual or other sensory information
is available.
In the applications we have textual information texts describing the subject
and textual information texto describing the object and we can write1
P (X = 1|texts , texto ). (1)
In other applications we might also exploit text that describes the predicate
textp or text that describes the relationship texts,p,o (e.g, a document where
a user (s) evaluates a movie (o) and the predicate is p=“likes”) [16]. A recent
overview on state of the art IE methods for textual data can be found in [29].
P (X = 1|KB)
For all triples that cannot be proven to be true, we assume that P (X = 1|KB)
is a small nonnegative number. This number reflects our belief that triples not
known to be true might still be true.
4.1 Notation
Consider (s, p, o) triple statements where s and o are entities and p is a predicate.
Note that a triple typically describes an attribute of a subject, e.g., (Jack, height,
Tall), or a relationship (Jack, likes, Jane). Consider, that {ei } is the set of known
entities in the domain. We assume that each entity is assigned to exactly one
class c(i). This assumption will be further discussed in Section 5. Let Nc be the
number of entities in class c.
168 X. Jiang et al.
We also assume that the set of all triples in which an entity ei can occur as a
subject is known and is a finite, possibly large, ordered set (more details later)
and contains Mc(i) elements. For each potential triple (s, p, o) we introduce
a random variable X which is in state one when the triple is true and is in
state zero otherwise. More precisely, Xi,k = 1 if the k-th triple involving ei as a
Mc(i)
subject is true and Xi,k = 0 otherwise. Thus, {Xi,k }k=1 is the set of all random
variables assigned to the subject entity ei .
We now assume that there are dependencies between all statements with the
same subject entity.
αi = Ahi (3)
where sig(in) = 1/(1 + exp(−in)) is the logistic function. In other words, αi,k
is the true but unknown activation that specifies the probability of observing
Xi,k = 1. Note that αi,k is continuous with −∞ < αi,k < ∞ such that a
Gaussian distribution assumption is sensible, whereas discrete probabilities are
bounded by zero and one.
We assume that αi,k is not known directly, but that we have a noisy version
available for each αi,k in the form of
where i,k is independent Gaussian noise with variance σ 2 . fi,k is now calcu-
lated in the following way from sensory information and the knowledge base. We
simply write
P̂ (Xi,k = 1|S, KB) = sig(fi,k )
and sensory and the knowledge base is transferred into
where inv-sig is the inverse of the logistic function. Thus probabilities close to one
are mapped to large positive f -values and probabilities close to zero are mapped
to large negative f -values. The resulting F -matrix contains the observed data
in the probabilistic model (see Figure 1).
Note that our generative model corresponds to the probabilistic PCA (pPCA)
described in [31] and thus we can use the learning equations from that paper.
Let F be the Nc × Mc matrix of f -values for class c and let
C = FTF
where the d column vectors in the Nc ×d matrix Ud are the principal eigenvectors
of C, with corresponding eigenvalues λ1 , ..., λd in the d × d diagonal matrix Λd
and R is an arbitrary d × Nc orthogonal rotation matrix.3 We also get
1
Mc
σ̂ 2 = λj .
Mc − d
j=d+1
Finally, we obtain
α̂i = ÂM −1 ÂT fi . (8)
3
A practical choice is the identity matrix R = I. Also note that we assume that the
mean is equal to zero, which can be justified in sparse domains.
170 X. Jiang et al.
Finally we want to comment on how we define the set of all possible triples un-
der consideration. In most applications there is prior knowledge available about
what triples should be considered. Also, typed relations constrain the number
of possible triples. In some applications it makes sense to restrict triples based
on observed triples: We define the set of all possible statements in a class c to
be all statements (s, p, o) where s is in class c and where the triple (s, p, o) has
been observed in the data for at least one element of s ∈ c.
5.3 Aggregation
After training, the learning model only considers dependencies between triples
with the same subject entity. Here we discuss how additional information can
be made useful for prediction.
describing the person. Similarly, if the knowledge base contains the statement
(Movie, isGenre, Action), we add the term “action” to the keywords describing
the movie.
6 Scalability
We consider the scalability of the three steps: deductive reasoning, IE, and ma-
chine learning. Deductive reasoning with less expressive ontologies scales up to
billions of statements [10]. Additional scalability can be achieved by giving up
completeness. As already mentioned, each class is modeled separately, such that,
if the number of entities per class and potential triples per entity are constant,
machine learning scales linearly with the size of the knowledge base. The ex-
pensive part of the machine learning part is the eigen decomposition required in
Equation 7. By employing sparse matrix algebra, this computation scales linearly
with the number of nonzero elements in F . To obtain a sparse F , we exploit the
sensory information only for the test entities and train the machine learning com-
ponent only on the knowledge base information, i.e., replace P̂ (Xi,k = 1|S, KB)
with P̂ (Xi,k = 1|KB) in Equation 6. Then we assume that P (X = 1|KB) =
is a small positive constant for all triples that are not and cannot be proven
true. We then subtract inv-sig() from F prior to the decomposition and add
inv-sig() to all α. The sparse setting can handle settings with millions of entities
in each class and millions of potential triples for each entity.
7 Experiments
7.1 Associating Diseases with Genes
As the costs for gene sequencing are dropping, it is expected to become part of
clinical practice. Unfortunately, for many years to come the relationships between
genes and diseases will remain only partially known. The task here is to predict
diseases that are likely associated with a gene based on knowledge about gene
and disease attributes and about known gene-disease patterns.
Combining IE, Deduction and Machine Learning for Relation Prediction 173
Disease genes are those genes involved in the causation of, or associated with
a particular disease. At this stage, more than 2500 disease genes have been
discovered. Unfortunately, the relationship between genes and diseases is far from
simple since most diseases are polygenic and exhibit different clinical phenotypes.
High-throughput genome-wide studies like linkage analysis and gene expression
profiling typically result in hundreds of potential candidate genes and it is still a
challenge to identify the disease genes among them. One reason is that genes can
often perform several functions and a mutational analysis of a particular gene
reveal dozens of mutation cites that lead to different phenotype associations to
diseases like cancer [18]. An analysis is further complicated since environmental
and physiological factors come into play as well as exogenous agents like viruses
and bacteria.
Despite this complexity, it is quite important to be able to rank genes in terms
of their predicted relevance for a given disease as a valuable tool for researchers
and with applications in medical diagnosis, prognosis, and a personalized treat-
ment of diseases.
In our experiments we extracted information on known relationships between
genes and diseases from the LOD cloud, in particular from Linked Life Data and
Bio2RDF, forming the triples (Gene, related to, Disease). In total, we considered
2462 genes and 331 diseases. We retrieved textual information describing genes
and diseases from corresponding text fields in Linked Life Data and Bio2RDF.
For IE, we constructed one global classifier that predicts the likelihood of a
gene-disease relationship based on the textual information describing the gene
and the disease. The system also considered relevant interaction terms between
keywords and between keywords and identifiers and we selected in total the 500
most relevant keywords and interaction terms. We did the following experiments
– ML: We trained a model using only the gene disease relationship, essentially a
collaborative filtering system. Technically, Equation 6 uses P̂ (Xi,k = 1|KB),
i.e., no sensory information.
– IE: This is the predictive performance based only on IE, using Equation 1.
– ML + IE: Here we combine ML with IE, as discussed in the paper. We
combine the knowledge base with IE as described in Section 3.3 and then
apply Equation 6 and Equation 8.
Figure 2 shows the results. As can be seen, the performance of the IE part is
rather weak and ML gives much better performance. It can nicely be seen that
the combination of ML and IE is effective and provides the best results.
Fig. 2. Results on the Gene-Disease Data as a function of the rank d of the approxima-
tion. For each gene in the data set, we randomly selected one related to statement to be
treated as unknown (test statement). In the test phase we then predicted all unknown
related to entries, including the entry for the test statement. The test statement should
obtain a high likelihood value, if compared to the other unknown related to entries.
The normalized discounted cumulative gain (nDCG@all) [17] is a measure to evaluate
a predicted ranking.
We obtained 440 entities representing the selected writers. We selected 354 en-
tities with valid yago:hasWikipediaUrl statements. We built the following five
models:
– ML: Here we considered the variables describing the writers’ nationality (in
total 4) and added information on the city where a writer was born. In total,
we obtained 233 variables. Technically, Equation 6 uses P̂ (Xi,k = 1|KB),
i.e., no sensory information.
Combining IE, Deduction and Machine Learning for Relation Prediction 175
We performed 10-fold cross validation for each model, and evaluated them with
the area under precision and recall curve. Figure 3 shows the results. We see that
the ML contribution was weak but could be improved significantly by adding
information on the country of birth (ML+AGG). The IE component gives ex-
cellent performance but ML improves the results by approximately 3 percentage
points. Finally, by including geo-reasoning, the performance can be improved by
another percentage point. This is a good example where all three components,
geo-reasoning, IE and machine learning fruitfully work together.
Fig. 3. The area under curve for the YAGO2 Core experiment as a function of the
rank d of the approximation
176 X. Jiang et al.
8 Conclusions
In this paper we have combined information extraction, deductive reasoning and
relational machine learning to integrate all sources of available information in a
modular way. IE supplies evidence for the statements under consideration and
machine learning models the dependencies between statements. Thus even if it
is not evident that a patient has diabetes just from IE from text, our approach
has the ability to provide additional evidence by exploiting correlations with
other statements, such as the patient’s weight, age, regular exercise and insulin
intake. We discussed the case that an entity belongs to more than one ontological
class and addressed aggregation. The approach was validated using data from
the YAGO2 ontology, and the Linked Life Data ontology and Bio2RDF. In the
experiments associating diseases with genes we could show that our approach
to combine IE with machine learning is effective in applications where a large
number of relationships need to be predicted. In the experiments on predict-
ing writer’s nationality we could show that IE could be combined with machine
learning and geo-reasoning for the overall best predictions. In general, the ap-
proach is most effective when the information supplied via IE is complementary
to the information supplied by statistical patterns in the structured data and if
reasoning can add relevant covariate information.
References
1. Berendt, B., Hotho, A., Stumme, G.: Towards Semantic Web Mining. In: Hor-
rocks, I., Hendler, J. (eds.) ISWC 2002. LNCS, vol. 2342, pp. 264–278. Springer,
Heidelberg (2002)
2. Biemann, C.: Ontology learning from text: A survey of methods. LDV Forum 20(2)
(2005)
3. Buitelaar, P., Cimiano, P.: Ontology Learning and Population: Bridging the Gap
between Text and Knowledge. IOS Press (2008)
4. Cimiano, P.: Ontology Learning and Population from Text: Algorithms, Evaluation
and Applications. Springer (2006)
5. Cimiano, P., Hotho, A., Staab, S.: Comparing conceptual, divise and agglomerative
clustering for learning taxonomies from text. In: Proceedings of the 16th Eureopean
Conference on Artificial Intelligence, ECAI 2004 (2004)
6. Cimiano, P., Staab, S.: Learning concept hierarchies from text with a guided ag-
glomerative clustering algorithm. In: Proceedings of the ICML 2005 Workshop on
Learning and Extending Lexical Ontologies with Machine Learning Methods (2005)
7. Cumby, C.M., Roth, D.: On kernel methods for relational learning. In: ICML (2003)
8. D’Amato, C., Fanizzi, N., Esposito, F.: Non-parametric statistical learning meth-
ods for inductive classifiers in semantic knowledge bases. In: IEEE International
Conference on Semantic Computing - ICSC 2008 (2008)
9. Fanizzi, N., d’Amato, C., Esposito, F.: DL-FOIL Concept Learning in Description
Logics. In: Železný, F., Lavrač, N. (eds.) ILP 2008. LNCS (LNAI), vol. 5194, pp.
107–121. Springer, Heidelberg (2008)
10. Fensel, D., van Harmelen, F., Andersson, B., Brennan, P., Cunningham, H., Della
Valle, E., Fischer, F., Huang, Z., Kiryakov, A., Lee, T.K.-I., Schooler, L., Tresp,
V., Wesner, S., Witbrock, M., Zhong, N.: Towards larkc: A platform for web-scale
reasoning. In: ICSC, pp. 524–529 (2008)
Combining IE, Deduction and Machine Learning for Relation Prediction 177
11. Gärtner, T., Lloyd, J.W., Flach, P.A.: Kernels and distances for structured data.
Machine Learning 57(3) (2004)
12. Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis, 2nd
edn. Chapman and Hall/CRC Texts in Statistical Science (2003)
13. Getoor, L., Diehl, C.P.: Link mining: a survey. SIGKDD Explorations (2005)
14. Grobelnik, M., Mladenic, D.: Knowledge discovery for ontology construction. In:
Davies, J., Studer, R., Warren, P. (eds.) Semantic Web Technologies. Wiley (2006)
15. Huang, Y., Tresp, V., Bundschus, M., Rettinger, A., Kriegel, H.-P.: Multivariate
Prediction for Learning on the Semantic Web. In: Frasconi, P., Lisi, F.A. (eds.)
ILP 2010. LNCS, vol. 6489, pp. 92–104. Springer, Heidelberg (2011)
16. Jakob, N., Müller, M.-C., Weber, S.H., Gurevych, I.: Beyond the stars: Exploiting
free-text user reviews for improving the accuracy of movie recommendations. In:
TSA 2009 - 1st International CIKM Workshop on Topic-Sentiment Analysis for
Mass Opinion Measurement (2009)
17. Jarvelin, K., Kekalainen, J.: IR evaluation methods for retrieving highly relevant
documents. In: SIGIR 2000 (2000)
18. Kann, M.G.: Advances in translational bioinformatics: computational approaches
for the hunting of disease genes. Briefing in Bioinformatics 11 (2010)
19. Kiefer, C., Bernstein, A., Locher, A.: Adding Data Mining Support to SPARQL
via Statistical Relational Learning Methods. In: Bechhofer, S., Hauswirth, M.,
Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 478–492.
Springer, Heidelberg (2008)
20. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Proba-
bilistic models for segmenting and labeling sequence data. In: ICML (2001)
21. Lehmann, J.: Dl-learner: Learning concepts in description logics. JMLR (2009)
22. Lisi, F.A., Esposito, F.: An ilp perspective on the semantic web. In: Semantic Web
Applications and perspectives (2005)
23. Maedche, A., Staab, S.: Semi-automatic engineering of ontologies from text. In:
Proceedings of the 12th International Conference on Software Engineering and
Knowledge Engineering (2000)
24. Maedche, A., Staab, S.: Ontology Learning. In: Handbook on Ontologies 2004.
Springer (2004)
25. Mika, P.: Social Networks and the Semantic Web. Springer (2007)
26. Paaß, G., Kindermann, J., Leopold, E.: Learning prototype ontologies by hierachi-
cal latent semantic analysis. In: Knowledge Discovery and Ontologies (2004)
27. Popescul, A., Ungar, L.H.: Statistical relational learning for link prediction. In:
Workshop on Learning Statistical Models from Relational Data (2003)
28. Rettinger, A., Nickles, M., Tresp, V.: Statistical Relational Learning with Formal
Ontologies. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.)
ECML PKDD 2009. LNCS, vol. 5782, pp. 286–301. Springer, Heidelberg (2009)
29. Sarawagi, S.: Information extraction. Foundations and Trends in Databases 1(3),
261–377 (2008)
30. Sowa, J.F.: Ontology, metadata, and semiotics. In: International Conference on
Computational Science (2000)
31. Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. Journal
of the Royal Statistical Society, Series B 61, 611–622 (1999)
32. Tresp, V., Yu, K.: Learning with dependencies between several response variables.
In: Tutorial at ICML 2009 (2009)
33. Vishwanathan, S.V.N., Schraudolph, N., Kondor, R.I., Borgwardt, K.: Graph ker-
nels. Journal of Machine Learning Research - JMLR (2008)
178 X. Jiang et al.
34. Völker, J., Haase, P., Hitzler, P.: Learning expressive ontologies. In: Buitelaar, P.,
Cimiano, P. (eds.) Ontology Learning and Population: Bridging the Gap between
Text and Knowledge. IOS Press (2008)
35. Xu, Z., Kersting, K., Tresp, V.: Multi-relational learning with gaussian processes.
In: Proceedings of the 21st International Joint Conference on Artificial Intelligence,
IJCAI 2009 (2009)
36. Yu, K., Chu, W., Yu, S., Tresp, V., Xu, Z.: Stochastic relational models for discrim-
inative link prediction. In: Advances in Neural Information Processing Systems,
NIPS 2006 (2006)
Automatic Configuration Selection Using Ontology
Matching Task Profiling
Abstract. An ontology matching system can usually be run with different con-
figurations that optimize the system’s effectiveness, namely precision, recall, or
F-measure, depending on the specific ontologies to be aligned. Changing the con-
figuration has potentially high impact on the obtained results. We apply matching
task profiling metrics to automatically optimize the system’s configuration de-
pending on the characteristics of the ontologies to be matched. Using machine
learning techniques, we can automatically determine the optimal configuration in
most cases. Even using a small training set, our system determines the best config-
uration in 94% of the cases. Our approach is evaluated using the AgreementMaker
ontology matching system, which is extensible and configurable.
1 Introduction
Ontology matching is becoming increasingly important as more semantic data, i.e., data
represented with Semantic Web languages such as RDF and OWL, are published and
consumed over the Web especially in the Linked Open Data (LOD) cloud [14]. Auto-
matic ontology matching techniques [10] are increasingly supported by more complex
systems, which use a strategy of combining several matching algorithms or matchers,
each taking into account one or more ontology features. Methods that combine a set of
matchers range from linear combination functions [17] to matcher cascades [7, 10, 25],
and to arbitrary combination strategies modeled as processes where specific matchers
play the role of combiners [4, 16]. The choice of parameters for each matcher and of
the ways in which the matchers can be combined may yield a large set of possible con-
figurations. Given an ontology matching task, that is, a set of ontologies to be matched,
an ontology engineer will painstakingly create and test many of those configurations
manually to find the most effective one as measured in terms of precision, recall, or
Research partially supported by NSF Awards IIS-0812258 and IIS-1143926 and by the In-
telligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory
(AFRL) contract number FA8650-10-C-7061. The U.S. Government is authorized to repro-
duce and distribute reprints for Governmental purposes notwithstanding any copyright annota-
tion thereon. The views and conclusions contained herein are those of the authors and should
not be interpreted as necessarily representing the official policies or endorsements, either ex-
pressed or implied, of IARPA, AFRL, or the U.S. Government.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 179–194, 2012.
c Springer-Verlag Berlin Heidelberg 2012
180 I.F. Cruz et al.
F-measure. Therefore, when given a new matching task, ontology engineers would like
to start from their own set of available configurations and automatically determine the
configuration that provides the best results without starting from scratch.
However, an ontology matching system that has been configured to match specific on-
tologies may not produce good results when matching other types of ontologies. For ex-
ample, LOD ontologies vary widely; they can be very small (at the schema level), shallow,
and poorly axiomatized such as GeoNames,1 large with medium depth and medium ax-
iomatization such as in DBpedia,2 or large, deep, and richly axiomatized such as Yago.3
In this paper we propose a machine learning method that takes as input an ontology
matching task (consisting of two ontologies) and a set of configurations and uses match-
ing task profiling to automatically select the configuration that optimizes matching
effectiveness. Our approach is implemented in the AgreementMaker [2] ontology match-
ing system and evaluated against the datasets provided by the 2011 Ontology Alignment
Evaluation Initiative (OAEI)4 . We show that the automatically configured system out-
performs the manually configured system. Furthermore, since the OAEI datasets are
designed to test systems on a variety of ontologies characterized by heterogeneous fea-
tures, they provide a rich and varied testbed. This testbed demonstrates the capability
of our method to match many real-world ontologies and to achieve good effectiveness
on new matching tasks without requiring manual tuning.
Although there are several machine learning approaches for ontology and schema
matching, our approach differs from them in key aspects. Some approaches learn to
classify mappings as correct or incorrect by combining several matchers [8] or similar-
ity functions [13]. Others exploit user feedback to learn optimal settings for a number
of parameters such as the similarity threshold used for mapping selection [25], or the
weights of a linear combination of several matchers and the rate of iterative propaga-
tion of similarity [7]. Our approach learns how to select the optimal configuration for a
matching task without imposing any restriction on the complexity of the adopted con-
figurations, thus allowing for the reuse of fine-tuned configurations tested in previous
ontology matching projects. We also achieve very good results with a limited train-
ing set by focusing on the feature selection problem, identifying fine-grained ontology
profiling metrics, and combining these metrics to profile the ontology matching task.
The rest of this paper is organized as follows: Section 2 defines the problem of con-
figuration selection and describes the configurations of the AgreementMaker system
used in our experiments. Section 3 describes the proposed learning method to automat-
ically select the best configuration and discusses the matching task profiling techniques
adopted. Section 4 describes the experiments carried out to evaluate the approach. Sec-
tion 5 discusses related work. Finally, Section 6 presents our conclusions.
2 Preliminaries
In this section we will explain some preliminary concepts, define the problem, and give
a brief description of our AgreementMaker system and its configurations.
1
http://www.geonames.org
2
http://www.dbpedia.org
3
http://www.mpi-inf.mpg.de/yago-naga/yago/
4
http://oaei.ontologymatching.org/2011/
Automatic Configuration Selection Using Ontology Matching Task Profiling 181
O1 O2 O1 O2 O1 O2 O1 O2 OM O1 O2
LEX LEX LEX LEX
BSM VMM PSM BSM BSM VMM PSM ASM BSM VMM PSM BSM VMM PSM MM ASM PSM
LWC LWC LWC LWC LWC LWC
IISM IISM LWC LWC GFM
work began with a general purpose configuration that consisted of syntactic matchers—
the Base Similarity Matcher (BSM), the Parametric String Matcher (PSM), and the
Vector-based Multi-word Matcher (VMM)—running in parallel and combined into a
final alignment by the Linear Weighted Combination (LWC) Matcher [4]. We later
extended our syntactic algorithms with lexicon lookup capabilities (LEX) and added
a refining structural matcher—the Iterative Instance and Structural Matcher (IISM)—
leading to the configuration shown in Figure 1(a). The configuration consists of running
syntactic and lexical matching algorithms, combining their results, and then refining the
combined alignment using a structural algorithm; this is a recurring pattern in all of our
configurations.
Some ontologies may contain labels and local names that require more advanced
string similarity techniques that are needed to perform syntactic matching. For this rea-
son, our second configuration, which is shown in Figure 1(b), features the Advanced
Similarity Matcher (ASM) [5]. This matcher extends BSM to find the similarity be-
tween syntactically complex concept labels.
As our work extended into matching biomedical ontologies [6] we found that the
lexicon lookup capability of our algorithms became very important in those tasks. Such
capability was especially useful because the biomedical ontologies include synonym
and definition annotations for some of the concepts. For these types of ontologies, we
use the configuration shown in Figure 1(c), which aggregates the synonyms and defini-
tions into the lexicon data structure. There is more than one combination step so as to
group similar matching algorithms before producing a final alignment.
An extension of the previous lexical-based configuration is shown in Figure 1(d).
This configuration adds the Mediating Matcher (MM) to aggregate the synonyms and
definitions of a third ontology, called the mediating ontology, into the lexicon data struc-
ture to improve recall. Finally, when matching several ontologies at a time, overall pre-
cision and runtime is more important. For this purpose we use a configuration, shown
in Figure 1(e), which features the combination of two syntactic matchers and is refined
by a structural matching algorithm that ensures precision and runtime efficiency.
Our proposed matching process follows the steps represented in Figure 2. First the
pair of ontologies to be matched is evaluated by the matching task profiling algorithm.
Based on the detected profile, a configuration is selected and the matcher stack is in-
stantiated. Finally, the ontology matching step is performed and an output alignment is
obtained.
Attribute Richness. Attribute Richness (AR) is defined as the average number of at-
tributes (datatype properties) per class and is computed as the number of attributes for
all classes (att) divided by the number of classes [26].
Class Richness. Class Richness (CR) is defined as the ratio of the number of classes
for which instances exist (|Ci |) divided by the total number of classes defined in the
ontology [26].
Label Uniqueness. This metric captures the number of terms whose local name and
label differ so as to determine whether we can find additional information in the term
labels. We define Label Uniqueness (LU) as the percentage of terms that have a label
that differs from their local name (diff ).
Average Depth. This metric describes the average depth (D) of the classes in an on-
tology defined as the mean of the depth over all the classes Ci (D(Ci )).
WordNet Coverage. WordNet is a lexical database for the English language [20].
It groups English words into sets of synonyms called synsets, provides short, general
definitions, and records the various semantic relations between these synsets. WordNet
is generally used in the ontology matching process to get information about the words
Automatic Configuration Selection Using Ontology Matching Task Profiling 185
contained in the attributes of a term. WordNet Coverage (WC) has been introduced as
a feature that evaluates for each couple of terms whether none, one, or both of them
can be found in WordNet [8]. Differently from previous approaches, in our system we
compute WC as the percentage of terms with label or local name (id) present in WordNet
(covered). We compute two WordNet Coverage metrics, one for local names and one
for labels.
Null Label and Comment. Two other very important features of the ontology that we
profile in our approach are the percentage of terms with no comment or no label, named
respectively Ncomment or Nlabel . They are defined as the number of terms that have no
comment (|NCcomment |) or no label (|NC label |) divided by the number of terms.
f
of the two metric values, divided by the logarithm of their difference. When the two
values are equal, the value of FS is one, while the more they differ, the closer to zero
that value will be.
mL
FSm = (1)
mH [log (mH − mL + 1 ) + 1 ]
The concatenation of A and FS into a function FS-A provides a more expressive mea-
sure. It considers how much a particular aspect is present in each of the two ontologies
as well as how much they share that aspect. Experiments described in Section 4 confirm
this intuition showing that FS-A outperforms all the other functions.
For each input ontology pair (matching task) we compute the metrics previously de-
scribed and use them at runtime to predict the best configuration. The space of all the
possible ontology pairs to be matched can be divided into a number of subspaces. We
define those subspaces as the sets of ontology pairs sharing the same configuration as
the best configuration for those matching tasks. From a machine learning point of view,
each subspace corresponds to a class to be predicted.
Differently from rule-based approaches, our approach allows us to find correlations
between metrics and the matching configurations without explicitly define them. No
matter how complex a matcher is, its performance on the training set will allow the
learning algorithm to determine its suitability to new matching tasks.
Supervised learning consists of techniques which, based on a set of manually labeled
training examples, create a function modeling the data [19]. Each training example is
composed of a feature vector and a desired output value. When the range of the output
value is a continuous interval, it is called a regression problem. In the discrete case, it
is called classification. The function created by the algorithm should be able to predict
the output value for a new input feature vector. In our problem, the input feature vector
is the result of matching task profiling while the output class is one of the possible
configurations.
Building a robust classifier is not trivial: the main problem is how to generate a
good training set. It should be a highly representative subset of the problem’s data. In
addition, the larger it is, the higher the quality of the classification will be. An important
aspect is the distribution of the classes in the training set. It should reflect the nature of
the data to be classified and contain an adequate number of examples for every class.
In order to generate our training set, the system goes through the steps shown in
Figure 4. For each ontology pair in a matching task, we compute the metrics introduced
in Section 3.1 and stored as data points in the training set. The system is then run with all
the given configurations, each generating an alignment for every matching task. We then
evaluate the precision, recall, and F-measure of each generated alignment and store the
results in an evaluation matrix. For each matching task, the configuration that optimizes
the precision, recall, or F-measure (depending on the users’ needs) is chosen and stored
as the correct class for the corresponding data points in the training set.
Automatic Configuration Selection Using Ontology Matching Task Profiling 187
We have tested this approach using many standard classifiers and including differ-
ent feature vectors, which are compared in Section 4. The trained model is used by
AgreementMaker to predict the correct configuration before starting a matching process.
The system takes also into account the confidence value returned by the classification
algorithm, which represents the certainty of the classifier’s prediction. If this value is
under a certain threshold the system ignores the classification and chooses the default
configuration (Figure 1(a)).
In this section we describe the experimental results obtained using the approach intro-
duced in Section 3. Our approach has been evaluated using the AgreementMaker system
with the configurations explained in Section 2.2. Our experiments have been run using
233 matching tasks with reference alignments provided in OAEI 2011.
For our evaluation, we use several classifiers in order to gauge their impact on correctly
classifying the matching tasks. The classifiers we use are well-known, each belonging to
a different category: k-NN (Instance-based) [1], Naive Bayes (Probability-based) [12],
Multilayer Perceptron (Neural Network) [11], and C4.5 (Decision Tree) [24].
For our first experiment, which is shown in Table 2, we perform a k-fold cross-
validation with k = {2, 10} using each classifier and comparing the obtained accuracy,
defined as the percentage of correctly classified instances. 10-fold cross-validation is
considered the standard for evaluating a learning algorithm while we also tested with
2-fold cross-validation to gauge the robustness of our approach, since we want as small
a training set as possible. We experimented with all of our combination functions to
generate the matching task profile. As can be seen from the table, a k-NN classifier
(with k = 3) exhibits the best accuracy in all the tests. Furthermore, we can see that
188 I.F. Cruz et al.
Combination
Cross-validation k-NN Naive Bayes Multilayer C4.5
Function
10-fold 88.1% 55.0% 84.1% 85.9%
ST
2-fold 86.0% 57.0% 82.9% 83.6%
10-fold 85.7% 55.3% 84.5% 86.1%
A
2-fold 82.9% 56.2% 82.6% 84.9%
10-fold 87.6% 54.9% 84.9% 88.0%
FS
2-fold 85.8% 55.3% 82.6% 83.7%
10-fold 89.9% 55.7% 88.1% 88.1%
FS-A
2-fold 89.4% 57.4% 83.8% 83.8%
the FS-A combination function has a significant impact on the overall results and in
particular on making the approach more robust.
The graph of Figure 5(a) shows the accuracy that was obtained by varying the train-
ing set size and the classifier used. Each point is computed by averaging 100 runs.
In each run, a random subset of matching tasks is selected for the training set and the
model is evaluated against the remaining part of the dataset. Due to the obtained results,
in this test we use FS-A as the combination function.
Naive Bayes is the worst performing classifier because the conditional independence
assumption between the selected metrics does not hold. In other words, some of our
metrics are interdependent. Since the dataset used is not big, instance-based classifiers
perform slightly better than others. This is because they are able to learn quickly even
from very small datasets. Other methods require a significant number of examples for
each class to be capable of training robust models, while an instance-based classifier can
make useful predictions using few examples per class. Therefore, this kind of classifier
works well when limited data is available, as in our case. The other two classifiers,
(a) (b)
Fig. 5. Comparison of (a) classifiers and of (b) combination functions
Automatic Configuration Selection Using Ontology Matching Task Profiling 189
Multilayer Perceptron and C4.5, show similar results, each being slightly worse than
k-NN. The more the training set size increases, the more they close the gap with k-NN,
confirming this characteristic of instance-based classifiers.
The results are also compared with a baseline. It is represented by the case in which
the system randomly chooses between configurations. Since the number of configu-
rations is five, the probability of selecting the right one is 20%, independently of the
size of the training set. We believe these results are really valuable and encouraging,
because even with a small training set our selection is able to outperform random selec-
tion substantially. As an example, when only 5% of the dataset is used for training, our
method is able to choose the best configuration in seven cases out of ten. Moreover, in
the other three cases, the results are only slightly worse than the best configuration. The
two experiments above lead us to choose the k-NN classifier.
In Figure 5(b) we compare the combination functions, which we described in Section
3.2, as a function of the size of the training set. From the graphs in the figure it is clear
that FS-A outperforms the other combination methods. Therefore we use FS-A in our
following experiments.
4.2 OAEI Results
In the previous section, we have shown that our learning system is robust. In this section
we use three tracks of the OAEI 2011 competition, namely Benchmark, Benchmark2,
and Conference, to study the impact of our approach in terms of F-measure.
In order to guarantee an equally distributed and representative training set, we in-
troduce the constraint that it should contain at least some representative instances for
each of the five classes. Using this approach, our experiments were run using only 20%
of the whole dataset as the training set, obtaining 94% accuracy. The remaining 6% of
misclassified examples are further divided: 3.83% for which the algorithm chooses the
second best configuration, and 2.17% for which it selects the third best one. The worst
performing configurations (fourth and fifth) are never chosen. It is worth noting that
even in the case where the algorithm fails to choose the best configuration, the chosen
configuration is acceptable and in some cases is better than the one chosen manually.
Overall, our results were improved by using the automatic selection instead of the
manual selection previously used in our system. This is because we are now able to
choose the best configuration on a task-by-task basis, rather than on a track-by-track
basis. Different configurations are used to match different tasks of the same domain,
while selection performed by an expert is usually domain-specific.
In Table 3 we show the percentage of tasks in which our automatic approach out-
performs the manual selection, for each of the considered tracks (Benchmark, Bench-
mark2, and Conference). We also show the average increase in F-measure for the tasks
that were improved (the remaining ones were already optimal). Our new automatic
selection leads to a significant improvement in all of the tracks, especially in the Con-
ference track where more than half of the matching tasks showed improvement.
While in Table 3 we compare the results obtained over the whole testset, in Table
4 we show our improvements in terms of F-measure on some particularly interesting
sub-tracks. We present an automatic versus manual selection comparison as before, but
also versus an ideal classifier (i.e., a classifier for which the best configuration is always
selected).
190 I.F. Cruz et al.
Benchmark Benchmark2
Conference
(301-304) (221-249)
Manual (M) 83.7% 82.4% 56.5%
Automatic (A) 86.7% 83.3% 61.0%
Ideal (I) 87.0% 83.6% 63.8%
Δ(A − M ) 3.0% 0.9% 4.5%
Δ(I − A) 0.3% 0.3% 2.8%
We selected for this comparison a subset of the previously mentioned tracks, choos-
ing the ones that we consider the most relevant, because they represent real-world ex-
amples. The sub-tracks are: Benchmark (301-304), Benchmark2 (221-249), and the
Conference track as a whole. The Anatomy track is not included because it is composed
of a single sub-track, which is correctly classified by both the automatic and manual se-
lections. The table shows the average F-measures obtained by the selection approaches
in these sub-tracks. We also show the difference between the automatic and manual
selections (Δ(A − M )) and between the ideal and automatic selections (Δ(I − A)).
In Figure 6 we show the performance obtained by the manual, automatic, and ideal
selections in the specified tasks of the Benchmark and Conference tracks. In most of
these test cases the automatic selection chooses the correct configuration, that is, the
automatically chosen configuration is also the ideal configuration. In some of these
cases the manual selection chooses the correct configuration as well. An interesting
case is provided by confOf-ekaw, where the three different modalities choose three
different configurations. However, even in this case, the automatic selection chooses a
better configuration than the one chosen manually.
5 Related Work
Several approaches have been proposed, whose objective is to improve the performance
of ontology schema matching by automatically setting configuration parameters. An
early approach considers a decision making technique that supports the detection of
suitable matchings based on a questionnaire that is filled out by domain and matching
experts [22]. In the continuation of this work, a rule-based system is used to rank a set
of matchers by their expected performance on a particular matching task [21].
The RiMOM system profiles the ontology matching task by evaluating an overall
lexical and structural similarity degree between the two ontologies [17]. These simi-
larity measures are used to change the linear combination weights associated with a
lexical and a structural matchers. These weights are adaptively set using matching task
Automatic Configuration Selection Using Ontology Matching Task Profiling 191
profiling, however only two metrics are used, a much smaller number than the number
considered in this paper. Furthermore, the configuration cannot be changed.
The learning-based approaches for system configuration fall under two types: (1)
learning to classify correspondences as correct/incorrect [8, 15, 23]; (2) learning opti-
mal parameter values for a system [13, 16, 18]. As compared to these approaches, our
contribution is that we learn to select the best configuration among a set of available
configurations for each matching task. This introduces a new challenge: we must con-
sider features describing the ontologies and their mutual compatibility as a whole and
define ontology metrics for this purpose, both being novel contributions.
We now describe in more detail three of the machine-learning approaches [8, 13, 16].
The first approach learns the combination of several matchers, considered as black
boxes [8]. The system uses a mapping classifier to label mappings as correct or in-
correct based on a mapping profile, which encompasses features of the matchers, lex-
ical and structural features of the concepts to be matched, and very simple ontology
features. Our configuration selection is based exclusively on ontology features and em-
beds several more expressive features. It also has the following two advantages. First,
we avoid executing every configuration of the system because, given a matching task,
only the optimal configuration is selected, leading to significant savings of time and
computational resources. Second, our classifier is trained on a small subset of the over-
all dataset. In particular, where in our experiments the overall dataset is split in a 1:4
ratio between training and evaluation data, the results for [8] are obtained with a ra-
tio of 3:1 in the Benchmark and 4:1 in the Conference datasets. Our datasets are ap-
proximately the same size, but we use a much smaller training set, requiring less user
effort. To the best of our knowledge, the minimization of the training set size has not
been investigated by others. The second approach features a technique to automatically
configure a schema matching system, which has been implemented in the eTuner ap-
plication [16]. This approach is interesting because it is based on the use of synthetic
data obtained by perturbing some features of the schemas to be aligned. However, it
Fig. 6. Comparison between manual, automatic, and ideal selections for some of the track tasks
192 I.F. Cruz et al.
applies to matching relational schemas not to ontologies. The third approach combines
several similarity metrics by learning an aggregated similarity function [13]. However,
the degree to which the matching system is configured in our approach is significantly
higher.
Finally, we mention some approaches that adopt active learning techniques in order
to automatically tune the system on the basis of user feedback [7, 9, 25]. However, all
these approaches learn to set system-specific parameters, such as the threshold [25],
the optimal weights in a linear combination function, or the number of iterations of
a similarity propagation matcher [7], while our system allows for configurations of
arbitrary complexity.
References
[1] Aha, D.W., Kibler, D., Albert, M.K.: Instance-based Learning Algorithms. Machine Learn-
ing 6(1), 37–66 (1991)
[2] Cruz, I.F., Palandri Antonelli, F., Stroe, C.: Agreementmaker: Efficient matching for large
real-world schemas and ontologies. PVLDB 2(2), 1586–1589 (2009)
[3] Cruz, I.F., Palandri Antonelli, F., Stroe, C.: Integrated Ontology Matching and Evaluation.
In: International Semantic Web Conference, Posters & Demos (2009)
[4] Cruz, I.F., Palandri Antonelli, F., Stroe, C.: Efficient Selection of Mappings and Automatic
Quality-driven Combination of Matching Methods. In: ISWC International Workshop on
Ontology Matching (OM) CEUR Workshop Proceedings, vol. 551, pp. 49–60 (2009)
[5] Cruz, I.F., Stroe, C., Caci, M., Caimi, F., Palmonari, M., Antonelli, F.P., Keles, U.C.: Using
AgreementMaker to Align Ontologies for OAEI 2010. In: ISWC International Workshop
on Ontology Matching (OM). CEUR Workshop Proceedings, vol. 689, pp. 118–125 (2010)
[6] Cruz, I.F., Stroe, C., Pesquita, C., Couto, F., Cross, V.: Biomedical Ontology Matching Us-
ing the AgreementMaker System (Software Demonstration). In: International Conference
on Biomedical Ontology (ICBO). CEUR Workshop Proceedings, vol. 833, pp. 290–291
(2011)
Automatic Configuration Selection Using Ontology Matching Task Profiling 193
[7] Duan, S., Fokoue, A., Srinivas, K.: One Size Does Not Fit All: Customizing Ontology
Alignment Using User Feedback. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P.,
Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496,
pp. 177–192. Springer, Heidelberg (2010)
[8] Eckert, K., Meilicke, C., Stuckenschmidt, H.: Improving Ontology Matching Using Meta-
level Learning. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen,
E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554,
pp. 158–172. Springer, Heidelberg (2009)
[9] Ehrig, M., Staab, S., Sure, Y.: Bootstrapping Ontology Alignment Methods with APFEL.
In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729,
pp. 186–200. Springer, Heidelberg (2005)
[10] Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, Heidelberg (2007)
[11] Gardner, M.W., Dorling, S.R.: Artificial Neural Networks (the Multilayer Perceptron)–a
Review of Applications in the Atmospheric Sciences. Atmospheric Environment 32(14-15),
2627–2636 (1998)
[12] John, G.H., Langley, P.: Estimating Continuous Distributions in Bayesian Classifiers. In:
Conference on Uncertainty in Artificial Intelligence (UAI), pp. 338–345 (1995)
[13] Hariri, B.B., Sayyadi, H., Abolhassani, H., Esmaili, K.S.: Combining Ontology Alignment
Metrics Using the Data Mining Techniques. In: ECAI International Workshop on Context
and Ontologies, pp. 65–67 (2006)
[14] Jain, P., Hitzler, P., Sheth, A.P., Verma, K., Yeh, P.Z.: Ontology Alignment for Linked Open
Data. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks,
I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 402–417. Springer, Heidelberg
(2010)
[15] Köpcke, H., Rahm, E.: Training Selection for Tuning Entity Matching. In: International
Workshop on Quality in Databases and Management of Uncertain Data (QDB/MUD), pp.
3–12 (2008)
[16] Lee, Y., Sayyadian, M., Doan, A., Rosenthal, A.: eTuner: Tuning Schema Matching Soft-
ware Using Synthetic Scenarios. VLDB Journal 16(1), 97–122 (2007)
[17] Li, J., Tang, J., Li, Y., Luo, Q.: RiMOM: A Dynamic Multistrategy Ontology Alignment
Framework. IEEE Transactions on Data and Knowledge Engineering 21(8), 1218–1232
(2009)
[18] Marie, A., Gal, A.: Boosting Schema Matchers. In: Meersman, R., Tari, Z. (eds.) OTM
2008. LNCS, vol. 5331, pp. 283–300. Springer, Heidelberg (2008)
[19] Marsland, S.: Machine Learning: an Algorithmic Perspective. Chapman & Hall/CRC (2009)
[20] Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet:
an On-line Lexical Database. International Journal of Lexicography 3(4), 235–244 (1990)
[21] Mochol, M., Jentzsch, A.: Towards a Rule-Based Matcher Selection. In: Gangemi, A., Eu-
zenat, J. (eds.) EKAW 2008. LNCS (LNAI), vol. 5268, pp. 109–119. Springer, Heidelberg
(2008)
[22] Mochol, M., Jentzsch, A., Euzenat, J.: Applying an Analytic Method for Matching Ap-
proach Selection. In: ISWC International Workshop on Ontology Matching (OM). CEUR
Workshop Proceedings, vol. 225 (2006)
[23] Ngo, D., Bellahsene, Z., Coletta, R.: YAM++ Results for OAEI 2011. In: ISWC Interna-
tional Workshop on Ontology Matching (OM). CEUR Workshop Proceedings, vol. 814, pp.
228–235 (2011)
[24] Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San
Mateo (1993)
194 I.F. Cruz et al.
[25] Shi, F., Li, J., Tang, J., Xie, G., Li, H.: Actively Learning Ontology Matching via User Inter-
action. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta, E.,
Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 585–600. Springer, Heidelberg
(2009)
[26] Tartir, S., Budak Arpinar, I., Moore, M., Sheth, A.P., Aleman-Meza, B.: OntoQA: Metric-
Based Ontology Quality Analysis. In: IEEE Workshop on Knowledge Acquisition from Dis-
tributed, Autonomous, Semantically Heterogeneous Data and Knowledge Sources, vol. 9,
pp. 45–53 (2005)
TELIX: An RDF-Based Model for Linguistic
Annotation
1 Introduction
A linguistic annotation is a descriptive or analytic mark dealing with raw lan-
guage data extracted from texts or any other kind of recording. A large and
heterogeneous number of linguistic features can be involved. Typically linguistic
annotations include part-of-speech tagging, syntactic segmentation, morpholog-
ical analysis, co-references marks, phonetic segmentation, prosodic phrasing and
discourse structures, among others.
There is an increasing need for vendors to interchange linguistic informa-
tion and annotations, as well as the source documents they refer to, among
different software tools. Text analysis and information acquisition often require
incremental steps with associated intermediate results. Moreover, tools and or-
ganizations make use of shared resources such as thesauri or annotated corpus.
Clearly, appropriate standards that support this open information interchange
are necessary. These standards must provide the means to model and serialize
the information as files.
In [16], the following requirements for a linguistic annotation framework are
identified: expressive adequacy, media independence, semantic adequacy, unifor-
mity, openness, extensibility, human readability, processability and consistency.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 195–209, 2012.
c Springer-Verlag Berlin Heidelberg 2012
196 E. Rubiera et al.
We postulate that the RDF framework features these properties, and therefore
constitutes a solid foundation. RDF graphs make use of custom vocabularies de-
fined by ontologies. Therefore, we introduce TELIX, a lightweight ontology that
provides comprehensive coverage of linguistic annotations and builds on previous
resources, such as feature structures and SKOS concept schemes. TELIX takes
advantage of the RDF/OWL expressive power and is compatible with legacy
materials. Moreover, translations from traditional linguistic annotation formats
to RDF is possible as shown in [6].
This paper is organized as follows. The next section revises the RDF frame-
work and discusses the advantages of representing linguistic annotations as RDF
graphs. The main contribution of this paper is TELIX, an OWL ontology which
is described in Section 3. Details on how to embed linguistic annotations using
the RDFa syntax are given in Section 4. Finally, Section 5 examines previous
initiatives, and conclusions and connections to ongoing, similar proposals are
presented in Section 6.
This section introduces TELIX (Text Encoding and Linguistic Information eX-
change), an OWL vocabulary designed to permit the representation linguistic
information as RDF graphs. It extends SKOS XL, which allows capturing lexical
entities as RDF resources, and it overcomes SKOS limitations of expressiveness.
TELIX introduces a number of classes and properties to provide natural lan-
guage acquisition and extraction tools with interchangeable, multilingual lexical
resources such as dictionaries or thesauri, as well as representing the outcomes of
text analyses, i.e., annotations content. The reader is invited to read the TELIX
specification [11] where complete details about the ontology and modeling deci-
sions are provided.
The TELIX namespace is http://purl.org/telix/ns#, although for the
sake of brevity, in this paper it is assumed to be the default namespace. The
namespace URL uses HTTP content negotiation to redirect to the OWL spec-
ification or the HTML documentation. The OWL file can be downloaded from
http://purl.org/telix/telix.owl.
Multiple segmentations of the same textual fragment are possible. For exam-
ple, tokens are assumed to be auxiliary entities, defined as contiguous string of
alphabetic or numerical characters and separated by separation characters such
as whitespaces. Note that punctuation is included in the resulting list of tokens
in some parsers and discourse analysis tools. Tokens enable tools to provide di-
vergent lexical understandings of a given piece of text. The same bunch of words
can be interpreted differently depending on the focus of the analysis. Consider,
for instance, the string “maraging steel”, composed of two tokens (t1, t2). It can
be seen either as a composition of two single words “[maragingt1] [steelt2 ]” or as
the collocation“[maragingt1 steelt2 ]” making the whole string a single lexical unit.
These lexical issues are critical when dealing with technical terminology, where
term boundaries are fuzzy and disputed. TELIX does not enforce a concrete
analysis regarding the lexical disambiguation of texts. The concept Token pro-
vides a free-focus word segmentation of the text, over which upper segmentation
layers (such as term identification) can be built.
More refined textual units, such as title, chapter, itemized list, etc., are not
part of TELIX. However, as TELIX is an OWL ontology, it can be extended or
combined with other ontologies to fit the specific requirements of a particular
application.
LabelOccurrence
rdf:type rdf:type
"Bronze, an alloy of copper and tin, was the first alloy discovered"
sense
sense
realizes realizes
ex:label-alloy sense
dbpedia:alloy
rdf:type skosxl:literalForm
skosxl:Label "alloy"@en
Fig. 1. This figure shows the links between label occurrences in a text and lexical
entities in the lexicon. In addition, the property sense attaches meanings to both
words and their occurrences.
– EI : PI → ℘(RI × RI ), where ℘(X) denotes the power set of the set X. The
function EI defines the extension of a property as a set of pairs of resources.
– SI : (V ∩ U ) → (RI ∪ PI ) defines the interpretation of URI references.
– LI : (V ∩ L) → (L ∪ RI ) defines the interpretation of literals.
⎡ ⎤ [ rdf:type telix:LabelOccurrence ;
word
⎢POS noun ⎥ telix:value "alloy"@en ;
⎢ ⎥
⎢ ⎥ telix:pos telix:NNS ;
⎢ masc ⎥
⎣AGR GEND ⎦ telix:agreement [ telix:number telix:Singular ;
NUM sg telix:gender telix:Masculine ] ] .
(a) AVM representation (b) RDF-based TELIX representation
original feature structure and the resultant RDF graph, where each pair of
nodes of G, connected by a grammatical feature, is translated to an RDF
triple.
– π5 (ψ) = CEI , where CEI is the class extension function of I, defined
by: CEI (c) = {a ∈ RI : (a, c) ∈ EI (I(type))} in G. Therefore, for each
ni ∈ ψ(s), the application of π5 returns (I(π1 (ni ), I(π3 (s)) ∈ EI (I(type)) =
(I(ηi ), I(c)) ∈ EI (I(type)) in G. This mapping retains the species names
interpretation of G (types of nodes) in the RDF graph: ψ : S
→ V.
TELIX introduces a core set of concepts and properties that represent S and
F respectively. This vocabulary permits the application of mappings π1 , π2 and
π3 to a given linguistic theory in order to express feature structures using the
RDF data model. Part of these grammatical features refer to morpho-syntactic
information, such as number (telix:number), person (telix:person), gender
(telix:gender) and tense (telix:tense) for agreement, or part-of-speech in-
formation (telix:pos). Furthermore, TELIX provides collections of values over
which these properties range. Some of these collections are based on existing
linguistic classifications. This is the case of part-of-speech tags, adapted from
the list used in the Penn Treebank Project (which can be extended to deal with
other languages). The purpose is to facilitate the exchange and integration of
linguistic information by reusing resources widely-adopted by the community.
As an example, Figure 2 illustrates the outcome produced by the application
of π mappings to a feature structure. The left-side of the Figure is the feature
structure, represented here with the graphical Attribute-Value Matrix notation,
which captures the grammatical analysis of the word “alloy”. The right-side shows
the resulting RDF graph (in N3 syntax).
TELIX also covers other aspects of text analysis, such as syntactic and dis-
course structures. RDF translations are provided for both constituent parse trees
(as partially illustrated in Figure 4) and dependency graphs. Furthermore, dis-
cursive entities are defined. With regards to referring expressions, TELIX in-
troduces properties: correfers, antecedent and anaphora, to express different
correference nuances. Rhetorical relations are also supplied to represent the un-
derlying structure at the discourse level of a given text.
It is worth mentioning that although TELIX provides machinery to represent
feature structures as RDF graphs, it does not cover complex constraints or fea-
ture structures operations (such as unification). In other words, TELIX permits
TELIX: An RDF-Based Model for Linguistic Annotation 203
NP
qqqqMMMMM
qqq MM
DT JJ NN
[ rdf:type telix:NounPhrase ;
telix:childNode [ rdf:type telix:DT ;
telix:childNode [ rdf:type telix:LabelOccurrence ;
telix:value "the"@en ] ] ;
telix:childNode [ rdf:type telix:JJ ;
telix:childNode [ rdf:type telix:LabelOccurrence ;
telix:value "first"@en ] ] ;
telix:childNode [ rdf:type telix:NN ;
telix:childNode [ rdf:type telix:LabelOccurrence ;
telix:value "alloy"@en ;
telix:realizes ex:label−alloy ;
telix:sense dbpedia:alloy ;
telix:agreement [ telix:number telix:Singular ;
telix:gender telix:Masculine ] ] ] .
Fig. 5. Document body annotated with RDFa attributes. Namespaces are omitted
template. Then, additional <p>, <div> and <span> tags must be introduced as
required until the document mark-up structure matches the text segmentation.
At this point, the document looks like the example in Figure 5. Note that sen-
tences are delimited by <span> tags nested inside the <p> tags. Tag nesting
captures multiple levels of structure (sections, paragraphs, sentences, parts of
sentences, words. . . ). Due to the tree-based model of XML documents, it is not
possible to build structures that overlap without one being contained within the
other. However, the RDFa document can be combined with other RDF docu-
ments sharing the same URIs
Note that tags make sentence segmentation explicit in the document. There-
fore, it is no longer necessary that the producer and the consumer of the doc-
ument implement the same segmentation algorithm in order to unequivocally
agree on the scope of each sentence. As the boundaries of each sentence are
explicitly marked in the document, and the number of sentences is unambigu-
ous, location-based references have a solid ground to build on. TELIX supports
location-based references as a fallback option to be backward compatible with
legacy tools.
5 Previous Work
TELIX builds on the experience of a chain of proposed languages, ontologies
and frameworks that have previously addressed the effective exchange of textual
resources in order to facilitate automated processing.
Most notably, TEI (Text Encoding Initiative) [1] is an XML-based encoding
scheme that is specifically designed for facilitating the interchange of data among
research groups using different programs or application software. TEI provides
an exhaustive analysis about how to encode the structure of textual sources, fea-
ture structures or graphics and tables in XML format. Although TEI defines a
very detailed specification of linguistic annotations, its XML syntax does not fa-
cilitate the integration of heterogeneous layers of annotations. Since most of the
linguistic workflows (UIMA, Gate, etc.) rely on multiple modules covering dif-
ferent layers of annotations, an RDF-based format to represent the annotations,
such as TELIX, is more suitable to be used by these systems. More concretely,
TELIX offers some advantages over TEI, derived from the more flexible nature
of RDF graphs with respect to XML trees, permitting the description of several
layers of annotations linked to the source document.
GrAF [17] is another graph-based format for linguistic annotation encoding,
although it does not rely on RDF but on an ad-hoc XML syntax. As it is based
on RDF, our proposal elegantly solves the graph merging problem. Moreover,
GrAF annotations can be translated into RDF [6], thus existing GrAF anno-
tations can easily be translated into TELIX. Furthermore, another advantage
of using the RDF framework is the availability of a standard query language,
TELIX: An RDF-Based Model for Linguistic Annotation 207
namely SPARQL. Both GrAF and TELIX are motivated by LAF (Linguistic
Annotation Framework [16]), which identifies the requirements and defines the
main decision principles for a common annotation format. TELIX supports in-
tegrated multilayered annotations and enables multiple annotations to the same
text fragment. However, although TELIX includes support for stand-off anno-
tations (based on offsets), it discourages them. Instead TELIX proposes a com-
bination of URI identifiers and RDFa annotations in mark-up documents.
LMF (Lexical Markup Framework, ISO 24613:2008) is a model of lexical re-
sources. It is suitable for the levels of annotations that are attached to a lexical
entry, but not for syntactic annotations in the case of non-lexicalized grammars.
Being an XML format, it lacks the advantages of RDF discussed in this paper.
Other ontologies have been proposed to represent linguistic information. The
most noteworthy one is GOLD [12], which is specified in OWL and provides
a vocabulary to represent natural languages. GOLD is designed as a refined
extension of the SUMO ontology. TELIX and GOLD have some resemblances,
although they diverge in their goals: TELIX is more annotation-oriented, while
GOLD aims to provide the means to describe natural languages formally.
The OLiA ontologies [7] provide OWL vocabularies to describe diverse lin-
guistic phenomena, from terminology and grammatical information to discourse
structures. TELIX and OLiA take different approaches to similar goals, in par-
ticular regarding constituent-based syntactic trees. TELIX also contributes a
formal foundation to translate feature structures in RDF.
The Lemon model [19] proposes its own vocabulary to model words and senses
in RDF. However, TELIX prefers to take advantage of (and extend) the SKOS
framework for modeling lexical entities, as discussed in Section 3.2. Regarding
WordNet [23], there are some overlaps with TELIX regarding lexical entities
treatment. Nevertheless, they are potentially complementary, e.g., WordNets
synsets can be used as values of TELIX’s sense property.
6 Conclusions
This paper proposes the use of the RDF framework in combination with an on-
tology (TELIX) for linguistic annotation. Despite the considerable body of pre-
vious and current proposals with similar goals, the authors believe that TELIX
sits in a previously unoccupied space because of its comprehensiveness and its
orientation to the information exchange on the web of data. A comprehensive
evaluation of TELIX with respect to the related works is planned for the coming
months. Among the works that are concurrently being developed and that are
closely tied to TELIX, POWLA [8] is a recent proposal of an OWL/DL formal-
ization to represent linguistic corpora based on the abstract model PAULA [10],
a complete and complex XML model of linguistic annotations. Another ongoing
initiative is NIF [15], also based on OLiA. NIF and TELIX coincide in the use
of URIs to univocally identify text fragments, enabling the handling of multiply-
anchored annotations over them. The main difference between NIF and TELIX
is that the latter offers a corpus level that is not provided by the former.
208 E. Rubiera et al.
Acknowledgment. The work described in this paper has been partially sup-
ported by the European Commission under ONTORULE Project (FP7-ICT-
2008-3, project reference 231875).
References
1. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Technical re-
port, TEI Consortium (2012), http://www.tei-c.org/Guidelines/P5/
2. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia:
A Nucleus for a Web of Open Data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang,
D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R.,
Schreiber, G., Cudré-Mauroux, P. (eds.) ISWC/ASWC 2007. LNCS, vol. 4825, pp.
722–735. Springer, Heidelberg (2007)
3. Bechhofer, S., Miles, A.: SKOS Simple Knowledge Organization System Reference.
W3C recommendation, W3C (August 2009),
http://www.w3.org/TR/2009/REC-skos-reference-20090818/
4. Birbeck, M., Adida, B.: RDFa primer. W3C note, W3C (October 2008),
http://www.w3.org/TR/2008/NOTE-xhtml-rdfa-primer-20081014/
5. Carroll, J.J., Bizer, C., Hayes, P., Stickler, P.: Named graphs, Provenance and
Trust. In: WWW 2005: Proceedings of the 14th International Conference on World
Wide Web, pp. 613–622. ACM, New York (2005)
6. Cassidy, S.: An RDF realisation of LAF in the DADA annotation server. In: Pro-
ceedings of ISA-5, Hong Kong (2010)
7. Chiarcos, C.: An Ontology of Linguistic Annotations. LDV Forum 23(1), 1–16
(2008)
8. Chiarcos, C.: POWLA: Modeling Linguistic Corpora in OWL/DL. In: Simperl,
E., et al. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 225–239. Springer, Heidelberg
(2012)
9. Derdek, S., El Ghali, A.: Une chaîne UIMA pour l’analyse de documents de régle-
mentation. In: Proceeding of SOS 2011, Brest, France (2011)
10. Dipper, S.: XML-based stand-off representation and exploitation of multi-level lin-
guistic annotation. In: Proceedings of Berliner XML Tage 2005 (BXML 2005), pp.
39–50 (2005)
11. Lévy, F. (ed.): D1.4 Interactive ontology and policy acquisition tools. Technical
report, Ontorule project (2011), http://ontorule-project.eu/
TELIX: An RDF-Based Model for Linguistic Annotation 209
12. Farrar, S., Langendoen, T.: A Linguistic Ontology for the Semantic Web. GLOT
International 7, 95–100 (2003)
13. Hayes, P.: RDF semantics. W3C recommendation. W3C (February 2004),
http://www.w3.org/TR/2004/REC-rdf-mt-20040210/
14. Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space,
1st edn. Morgan & Claypool (2011)
15. Hellmann, S.: NLP Interchange Format (NIF) 1.0 specification,
http://nlp2rdf.org/nif-1-0
16. Ide, N., Romary, L.: International Standard for a Linguistic Annotation Frame-
work. Journal of Natural Language Engineering 10 (2004)
17. Ide, N., Suderman, K.: GrAF: a graph-based format for linguistic annotations. In:
Proceedings of the Linguistic Annotation Workshop, LAW 2007, Stroudsburg, PA,
USA, pp. 1–8. Association for Computational Linguistics (2007)
18. King, P.J.: An Expanded Logical Formalism for Head-Driven Phrase Structure
Grammar. Arbeitspapiere des SFB 340 (1994)
19. McCrae, J., Spohr, D., Cimiano, P.: Linking Lexical Resources and Ontologies
on the Semantic Web with Lemon. In: Antoniou, G., Grobelnik, M., Simperl, E.,
Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I.
LNCS, vol. 6643, pp. 245–259. Springer, Heidelberg (2011)
20. Miles, A., Bechhofer, S.: SKOS Simple Knowledge Organization System eX-
tension for Labels (SKOS-XL). W3C recommendation, W3C (August 2009),
http://www.w3.org/TR/2009/REC-skos-reference-20090818/skos-xl.html
21. Pollard, C.: Lectures on the Foundations of HPSG. Technical report, Unpublished
manuscript: Ohio State University (1997),
http://www-csli.stanford.edu/~sag/L221a/cp-lec-notes.pdf
22. Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword Expres-
sions: A Pain in the Neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS,
vol. 2276, pp. 1–15. Springer, Heidelberg (2002)
23. Schreiber, G., van Assem, M., Gangemi, A.: RDF/OWL representation of Word-
Net. W3C working draft, W3C (June 2006),
http://www.w3.org/TR/2006/WD-wordnet-rdf-20060619/
24. Seaborne, A., Harris, S.: SPARQL 1.1 query. W3C working draft, W3C (October
2009), http://www.w3.org/TR/2009/WD-sparql11-query-20091022/
25. Tobin, R., Cowan, J.: XML information set, W3C recommendation, W3C, 2nd edn.
(February 2004), http://www.w3.org/TR/2004/REC-xml-infoset-20040204
LODifier: Generating Linked Data
from Unstructured Text
Abstract. The automated extraction of information from text and its transforma-
tion into a formal description is an important goal in both Semantic Web research
and computational linguistics. The extracted information can be used for a va-
riety of tasks such as ontology generation, question answering and information
retrieval. LODifier is an approach that combines deep semantic analysis with
named entity recognition, word sense disambiguation and controlled Semantic
Web vocabularies in order to extract named entities and relations between them
from text and to convert them into an RDF representation which is linked to DB-
pedia and WordNet. We present the architecture of our tool and discuss design
decisions made. An evaluation of the tool on a story link detection task gives
clear evidence of its practical potential.
1 Introduction
The term Linked Data (LD) stands for a new paradigm of representing information
on the Web in a way that enables the global integration of data and information in
order to achieve unprecedented search and querying capabilities. This represents an
important step towards the realization of the Semantic Web vision. At the core of the
LD methodology is a set of principles and best practices describing how to publish
structured information on the Web. In recent years these recommendations have been
adopted by an increasing number of data providers ranging from public institutions to
commercial entities, thereby creating a distributed yet interlinked global information
repository.
The formalism underlying this “Web of Linked Data” is the Resource Description
Framework (RDF) which encodes structured information as a directed labelled graph.
Hence, in order to publish information as Linked Data, an appropriate graph-based
representation of it has to be defined and created. While this task is of minor difficulty
and can be easily automatized if the original information is already structured (as, e.g.,
in databases), the creation of an adequate RDF representation for unstructured sources,
particularly textual input, constitutes a challenging task and has not yet been solved to
a satisfactory degree.
Most current approaches [7,19,16,6] that deal with the creation of RDF from plain
text fall into the categories of relation extraction or ontology learning. Typically, these
approaches process textual input very selectively, that is, they scan the text for linguistic
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 210–224, 2012.
c Springer-Verlag Berlin Heidelberg 2012
LODifier: Generating Linked Data from Unstructured Text 211
Tokenization unstructured
text
tokenized
text
Named Entity
NE-marked Parsing
Recognition parsed text Lemmatization
text (C&C)
(Wikifier)
lemmatized
parsed text
text
LOD-linked
RDF
http:
//ww
w.w3
.org/
RDF
/icon
s/rdf
_fly
patterns that realize a small number of pre-specified types of information (e.g., is-CEO-
of relations). This strategy is oriented toward a high precision of the extracted structured
information and certainly adequate if the result of the extraction process is meant to be
used for accumulation of large sets of factual knowledge of a predefined form.
In contrast to these approaches, we propose a strategy which aims at translating the
textual input in its entirety into a structural RDF representation. We aim at open-domain
scenarios in which no a-priori schema for the information to be extracted is available.
Applying our method to a document similarity task we demonstrate that it is indeed
both practical and beneficial to retain the richness of the full text as long as possible.
Our system, LODifier, employs robust techniques from natural language processing
(NLP) including named entity recognition (NER), word sense disambiguation (WSD)
and deep semantic analysis. The RDF output is embedded in the Linked Open Data
(LOD) cloud by using vocabulary from DBpedia and WordNet 3.0.
Plan of the Paper. Section 2 begins by sketching the architecture of the system. Sec-
tion 3 presents an evaluation of LODifier on a document similarity task. After dis-
cussing related work in Section 4, we conclude in Section 5.
2 The System
This section describes the resources and algorithms used to build LODifier. Figure 1
shows the architecture of the system. After tokenization, mentions of entities in the
212 I. Augenstein, S. Padó, and S. Rudolph
input text are recognized using the NER system Wikifier [18] and mapped onto DBpedia
URIs. Relations between these entities are detected using the statistical parser C&C and
the semantics construction toolkit Boxer [8], which generates discourse representation
structures (DRSs) [14]. Thereafter, the text is lemmatized and words are disambiguated
with the WSD tool UKB [1] to get WordNet mappings. The RDF graph is then created
by further processing the Boxer DRS output, transforming it into triples. Finally, it is
enriched with the DBpedia URIs (to link its entities to the LOD cloud) and the WordNet
sense URIs (to do the same for the relations). The following subsections provide details
on the individual processing steps.
The first step is to identify mentioned individuals. They are recognized using the NER
system Wikifier [18] that enriches English plain text with Wikipedia links. If Wiki-
fier finds a named entity, it is substituted by the name of the corresponding English
Wikipedia page. Applied to the test sentence
The New York Times reported that John McCarthy died. He invented the pro-
gramming language LISP.
[[The New York Times]] reported that [[John McCarthy (computer scientist)|
John McCarthy]] died. He invented the [[Programming language|programming
language]] [[Lisp (programming language)|Lisp]].
The next step is to generate DBpedia URIs out of the Wikifier output and link those
DBpedia URIs to previously introduced Boxer classes.
DBpedia [4] is a large, freely available domain-independent multilingual ontology
extracted from Wikipedia, comprising Wikipedia page names, infobox templates, cate-
gorization information, images, geo-coordinates and links to external webpages. DBpe-
dia contains links to various data sets including FOAF, Geonames and WordNet.
We exploit the fact that every Wikipedia page has a corresponding DBpedia page,
which allows for a straightforward conversion of Wikipedia URLs to DBpedia URIs.
LODifier: Generating Linked Data from Unstructured Text 213
ccg(1,
rp(s:dcl,
ba(s:dcl,
lx(np, n,
t(n, ’The_New_York_Times’, ’The_New_York_Times’, ’NNS’, ’I-NP’, ’O’)),
fa(s:dcl\np,
t((s:dcl\np)/s:em, ’reported’, ’report’, ’VBD’, ’I-VP’, ’O’),
fa(s:em,
t(s:em/s:dcl, ’that’, ’that’, ’IN’, ’I-SBAR’, ’O’),
ba(s:dcl,
lx(np, n,
t(n, ’John_McCarthy’, ’John_McCarthy’, ’NNP’, ’I-NP’, ’I-PER’)),
t(s:dcl\np, ’died’, ’die’, ’VBD’, ’I-VP’, ’O’))))),
t(period, ’.’, ’.’, ’.’, ’O’, ’O’))).
ccg(2,
rp(s:dcl,
ba(s:dcl,
t(np, ’He’, ’he’, ’PRP’, ’I-NP’, ’O’),
fa(s:dcl\np,
t((s:dcl\np)/np, ’invented’, ’invent’, ’VBD’, ’I-VP’, ’O’),
fa(np:nb,
t(np:nb/n, ’the’, ’the’, ’DT’, ’I-NP’, ’O’),
fa(n,
t(n/n, ’programming_language’, ’programming_language’, ’NN’, ’I-NP’, ’O’),
t(n, ’LISP’, ’LISP’, ’NNP’, ’I-NP’, ’O’))))),
t(period, ’.’, ’.’, ’.’, ’O’, ’O’))).
created by extracting properties from infoboxes and templates within Wikipedia articles.
The Raw Infobox Property Definition Set consists of a URI definition for each property
as well as a label. However, this property set turned out to be much too restricted to
cover all the relations identified by Boxer.
Therefore, we decided to map Boxer relations onto RDF WordNet class types in-
stead.
WordNet [10] is a large-scale lexical database for English. Its current version contains
more than 155.000 words (nouns, verbs, adjectives and adverbs), grouped into sets of
synonyms, which are called synsets. Ambiguous word belong to several synsets (one per
word sense). The synsets are linked to other synsets by conceptual relations. Synsets
contain glosses (short definitions) and short example sentences. RDF WordNet [3] is
a Linked Data version of WordNet. For each word it provides one URI for each word
sense. To map instances of words onto URIs the words have to be disambiguated.
For word sense disambiguation (WSD), we apply UKB [1], an unsupervised graph-
based WSD tool, to all our input words, but focus on the results for words which have
given rise to relations in the Boxer output. We use the disambiguated RDF WordNet
URIs as the Linked Data hooks for these relations.
LODifier: Generating Linked Data from Unstructured Text 215
______________________________ _______________________
| x0 x1 x2 x3 | | x4 x5 x6 |
|..............................| |.......................|
(| male(x0) |+| event(x4) |)
| named(x0,john_mccarthy,per) | | invent(x4) |
| programming_language(x1) | | agent(x4,x0) |
| nn(x1,x2) | | patient(x4,x2) |
| named(x2,lisp,nam) | | event(x5) |
| named(x3,new_york_times,org) | | report(x5) |
|______________________________| | agent(x5,x3) |
| theme(x5,x6) |
| proposition(x6) |
| ______________ |
| | x7 | |
| x6:|..............| |
| | event(x7) | |
| | die(x7) | |
| | agent(x7,x0) | |
| |______________| |
|_______________________|
– For each discourse referent, a blank node (bnode) is introduced. If it has been rec-
ognized as a NE by Boxer, we assign an URI from the ne: namespace to it via the
property drsclass:named. If an according DBpedia URI could be identified via
Wikifier, we link the blank node to the according DBpedia URI via owl:sameAs.
– The assignment of a Boxer class (that is, a unary predicate) to a discourse referent is
expressed by an RDF typing statement which associates a class URI to the discourse
referent’s bnode. For closed-class relations (like event), the class URI comes from
the predefined vocabulary (using the namespace drsclass:), for relations from
the open class we use the appropriate word sense URI extracted from WordNet via
UKB (in the namespace wn30:) or create a URL (in the namespace class:).
– A closed-class binary relation between two discourse referents (e.g., agent) is
expressed by an “ordinary” RDF triple with the referents’ bnodes as subject and
object, and using the corresponding URI from the closed-class Boxer vocabulary
namespace drsrel:. For open-class relations, the namespace rel: is used instead.
– Finally, we may encounter embedded DRSs, possibly related by complex con-
ditions expressing logical (disjunction,implication, negation) or modal (necessity,
216 I. Augenstein, S. Padó, and S. Rudolph
The result of applying this strategy to our example text is shown in Figure 4.
2
To obtain output that adheres to the current W3C RDF specification and is entirely supported
by standard-compliant tools, we refrain from using named graphs or quads to encode nested
DRS.
LODifier: Generating Linked Data from Unstructured Text 217
Thus, document pairs are predicted to describe the same topic exactly if they have a
similarity of θ or more.
The parameter θ is usually determined with supervised learning. We randomly split
our dataset k times (we use k=1000) into equally-sized training and testing sets. For
each split, we compute an optimal decision boundary θ̂ as the threshold which predicts
the training set as well as possible. More precisely, we choose θ̂ so that its distance
to wrongly classified training document pairs is minimized. Formally, θ̂ is defined as
follows: Let postrain and negtrain be the positive and negative partitions of the training
set respectively. Then:
⎡ ⎤
⎢⎢⎢ ⎥
2⎥
⎥⎥⎥
θ̂ = arg min ⎢⎢⎢⎣ min(0, sim(d p) − θ) +
2
min(0, θ − sim(d p)) ⎥⎦
θ
d p∈postrain d p∈negtrain
218 I. Augenstein, S. Padó, and S. Rudolph
We can then compute the accuracy of θ̂ on the current split’s test set, consisting of the
positive and negative partitions postest and negtest , as follows:
of all paths of length ≤ k in G, that is, the set of salient relations in G.3 Furthermore,
we write Rel(G) for the set of relevant nodes in an RDF graph G. As motivated in Sec-
tion 3.2, not all URIs are equally relevant and we experiment with different choices. We
can now define a family of similarity measures called path relevance overlap similarity
(proSim):
f ((a, b))
a,b∈Rel(G1 )
a,b∈Ck (G1 )∩Ck (G2 )
proSimk,Rel, f (G1 , G2 ) =
f ((a, b))
a,b∈Rel(G1 )
a,b∈Ck (G1 )
In words, the denominator of proSim determines the set of relevant semantic relations
Rel in G1 – modelled as the set of pairs of relevant URIs that are linked by a path of
length ≤ k – and quantifies them as the sum over a function applied to their path lengths.
The numerator does the same for the intersection of the relevant nodes from G1 and G2 .
We experiment with three instantiations for the function f . The first one, proSimcnt,
uses f () = 1, that is, just counts the number of paths irrespective of their length. The
second one, proSimlen, uses √f () = 1/, giving less weight to longer paths. The third one,
proSimsqlen, uses f () = 1/ , discounting long paths less aggressively than proSimlen .
All measures of the proSim family have the range [0;1], where 0 indicates no overlap
and 1 perfect overlap. It is deliberately asymmetric: the overlap is determined relative to
the paths of G1 . This reflects our intuitions about the task at hand. For a document to be
similar to a seed story, it needs to subsume the seed story but can provide additional, new
information on the topic. Thus, the similarity should be maximal whenever G1 ⊆ G2 ,
which holds for proSim.
3.4 Results
As described in Section 3.2, we experiment with several variants for defining the set of
relevant URIs (Variants 1 to 3, both normal and extended). These conditions apply to
all bag-of-URI and proSim models.
The results of our evaluation are shown in Table 1. The upper half shows results for
similarity measures without structural knowledge. At 63%, the bag-of-words baseline
clearly outperforms the random baseline (50%). It is in turn outperformed by almost
all bag-of-URI baselines, which yield accuracies of up to 76.4%. Regarding parameter
choice, we see the best results for Variant 3, the most inclusive definition of relevant
URIs (cf. Section 3.2). The URI baseline also gains substantially from the extended
setting, which takes further Linked Open Data relations into account.
Moving over to the structural measures, proSim, we see that all parametrizations of
proSim perform consistently above the baselines. Regarding parameters, we again see a
consistent improvement for Variant 3 over Variants 1 and 2. In contrast, the performance
is relatively robust with respect to the path length cutoff k or the inclusion of further
Linked Open Data (extended setting).
3
Shortest paths can be computed efficiently using Dijkstra’s algorithm [9]. We exclude paths
across “typing” relations such as event which would establish short paths between every pair
of event nodes (cf. Figure 3) and drown out meaningful paths.
220 I. Augenstein, S. Padó, and S. Rudolph
0.7
Bag−of−word model
Bag−of−URI model (Variant 3, extended)
0.6
Recall
4 Related Work
There are various approaches to extracting relationships from text. These approaches
usually include the annotation of text with named entities and relations and the extrac-
tion of those relations. Two approaches that are very similar to LODifier are of [5] and
[19]. They both use NER, POS-tagging and parsing to discover named entities and rela-
tions between them. The resulting relations are converted to RDF. The disadvantage of
these methods is however that they use labelled data as a base for extracting relations,
which is not flexible, as labelled data requires manual annotation.
222 I. Augenstein, S. Padó, and S. Rudolph
In terms of the pursued goal, that is, processing natural language, converting the
result into RDF and possibly exhibiting it as linked (open) data, LODifier shares the
underlying motivation with the NLP2RDF framework.4 The latter provides a generic
and flexible framework of how to represent any kind of NLP processing result in RDF.
While the current version of LODifier is a stand-alone tool not resting on this frame-
work, a future integration might improve its interoperability and reusability further.
The Pythia system [20] which is targeted at natural language question answering in
information systems also employs deep semantic analysis on posed questions in order
to come up with a translation into SPARQL queries which are then posed against RDF
stores. Pythia does, however, presume the existence of a lexicon specifying how lexi-
cal expressions are to be mapped to RDF entities of the queried data source. Thereby,
the approach is inherently domain-specific, whereas we aim at an open domain setting
where no a-priori lexical mappings or specific schematic information is available.
The AsKNet system [12] is aimed at automatically creating a semantic network.
Thereby, the processing strategy is similar to ours: the system uses C&C and Boxer
to extract semantic relations. To decide which nodes refer to the same entities, simi-
larity scores are computed based on spreading activation and nodes are then mapped
together. An approach building on AsKNet comes from [22]. They use AsKNet to build
a semantic network based on relations between concepts instead of relations between
named entities as already present in AsKNet. The resulting graph is then converted to
RDF. AsKNet and LODifier differ in the way they disambiguate named entities. LOD-
ifier uses NER and WSD methods before generating RDF triples and describes the
entities and relations using DBpedia and WordNet URIs whereas AsKNet first gener-
ates semantic structures from text and then tries to map nodes and edges together based
on similarity. Moreover, the graph output of the latter is not interlinked with other data
sources. This is one of the key features of LODifier, and we feel that we have only
scratched the surface regarding the benefit of interlinking.
information as a semantic graph; and (c) linking up the concepts and relations in the
input to the LOD cloud. These benefits provide types of additional information for sub-
sequent processing steps, which are generally not provided by “shallow” approaches.
Concretely, we have demonstrated that the LODifier representations improve topic de-
tection over competitive shallow models by using a document similarity measure that
takes semantic structure into account. More generally, we believe that methods like ours
are suitable whenever there is little data, for example, in domain-specific settings.
A few of the design decisions made for the RDF output may not be adequate for all
conceivable applications of LODifier. The use of blank nodes is known to bring about
computational complications, and in certain cases it is beneficial to Skolemize them
by URIs e.g. using MD5 hashes. Employing owl:sameAs to link discourse referents
to their DBpedia counterparts might lead to unwanted logical ramifications in case the
RDF output is flawed. Hence, we will provide a way to configure LODifier to produce
RDF in an encoding that meets the specific requirements of the use case.
A current shortcoming of LODifier is its pipeline architecture which treats the mod-
ules as independent so that errors cannot be recovered. We will consider joint inference
methods to find a globally most coherent analysis [11]. Regarding structural similarity
measures, we have only scratched the surface of the possibilities. More involved graph
matching procedures remain a challenge due to efficiency reasons; however, this is an
area of active research [17].
Possible applications of LODifier are manifold. It could be used to extract DBpedia
relation instances from textual resources, provided a mapping from WordNet entities
to DBpedia relations is given. Moreover, our system could also be applied for hybrid
search, that is, integrated search over structured and non-structured sources. In such a
setting, the role of LODifier would be to “pre-process” unstructured information source
(off-line) into a representation that matches the structured information sources. This
would reduce on-line search to the application of structured query processing techniques
to a unified dataset. Presuming that a good accuracy at the semantic micro-level can be
achieved, our method could also prove valuable in the domain of question answering.
In that case, LODifier could be used to transform structured resources into RDF against
which structured (SPARQL) queries generated by question answering systems such as
Pythia [20] could be posed.
Acknowledgements. We thank Pia Wojtinnek, Johan Bos and Stephen Clark for stim-
ulating discussions and technical advice about the Boxer system as well as Christina
Unger and Philipp Cimiano for pointers to related work. We acknowledge the funding
of the European Union (project XLike in FP7).
References
1. Agirre, E., de Lacalle, O.L., Soroa, A.: Knowledge-based WSD and specific domains: per-
forming over supervised WSD. In: Proceedings of the International Joint Conferences on
Artificial Intelligence, Pasadena, CA (2009)
2. Allan, J.: Introduction to topic detection and tracking, pp. 1–16. Kluwer Academic Publish-
ers, Norwell (2002)
224 I. Augenstein, S. Padó, and S. Rudolph
3. van Assem, M., van Ossenbruggen, J.: WordNet 3.0 in RDF (2011),
http://semanticweb.cs.vu.nl/lod/wn30/ (Online; accessed July 12, 2011)
4. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.:
DBpedia - A crystallization point for the Web of Data. Web Semant. 7, 154–165 (2009)
5. Byrne, K., Klein, E.: Automatic extraction of archaeological events from text. In: Proceed-
ings of Computer Applications and Quantitative Methods in Archaeology, Williamsburg, VA
(2010)
6. Cafarella, M.J., Ré, C., Suciu, D., Etzioni, O., Banko, M.: Structured querying of web text: A
technical challenge. In: Proceedings of the Conference on Innovative Data Systems Research,
Asilomar, CA (2007)
7. Cimiano, P., Völker, J.: Text2Onto. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB
2005. LNCS, vol. 3513, pp. 227–238. Springer, Heidelberg (2005)
8. Curran, J.R., Clark, S., Bos, J.: Linguistically Motivated Large-Scale NLP with C&C and
Boxer. In: Proceedings of the ACL 2007 Demo Session, pp. 33–36 (2007)
9. Dijkstra, E.W.: A note on two problems in connexion with graphs. Numerische Mathematik 1,
269–271 (1959)
10. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press (1998)
11. Finkel, J.R., Manning, C.D.: Hierarchical Joint Learning: Improving Joint Parsing and
Named Entity Recognition with Non-Jointly Labeled Data. In: Proceedings of the 48th An-
nual Meeting of the Association for Computational Linguistics, pp. 720–728 (2010)
12. Harrington, B., Clark, S.: Asknet: Automated semantic knowledge network. In: Proceedings
of the American Association for Artificial Intelligence, Vancouver, BC, pp. 889–894 (2007)
13. Hitzler, P., Krötzsch, M., Rudolph, S.: Foundations of Semantic Web Technologies. Chapman
& Hall/CRC (2009)
14. Kamp, H., Reyle, U.: From Discourse to Logic: Introduction to Model-theoretic Semantics
of Natural Language, Formal Logic and Discourse Representation Theory. Studies in Lin-
guistics and Philosophy, vol. 42. Kluwer, Dordrecht (1993)
15. Koehn, P.: Statistical Significance Tests for Machine Translation Evaluation. In: Proceed-
ings of Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 388–395
(2004)
16. Li, Y., Chu, V., Blohm, S., Zhu, H., Ho, H.: Facilitating pattern discovery for relation extrac-
tion with semantic-signature-based clustering. In: Proceedings of the ACM Conference on
Information and Knowledge Management, pp. 1415–1424 (2011)
17. Lösch, U., Bloehdorn, S., Rettinger, A.: Graph Kernels for RDF Data. In: Simperl, E., et al.
(eds.) ESWC 2012. LNCS, pp. 134–148. Springer, Heidelberg (2012)
18. Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the ACM Con-
ference on Information and Knowledge Management (2008)
19. Ramakrishnan, C., Kochut, K.J., Sheth, A.P.: A Framework for Schema-Driven Relation-
ship Discovery from Unstructured Text. In: Cruz, I., Decker, S., Allemang, D., Preist, C.,
Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp.
583–596. Springer, Heidelberg (2006)
20. Unger, C., Cimiano, P.: Pythia: Compositional Meaning Construction for Ontology-Based
Question Answering on the Semantic Web. In: Muñoz, R., Montoyo, A., Métais, E. (eds.)
NLDB 2011. LNCS, vol. 6716, pp. 153–160. Springer, Heidelberg (2011)
21. Valiente, G.: Algorithms on Trees and Graphs. Springer, Berlin (2002)
22. Wojtinnek, P.-R., Harrington, B., Rudolph, S., Pulman, S.: Conceptual Knowledge Acquisi-
tion Using Automatically Generated Large-Scale Semantic Networks. In: Croitoru, M., Ferré,
S., Lukose, D. (eds.) ICCS 2010. LNCS, vol. 6208, pp. 203–206. Springer, Heidelberg (2010)
POWLA: Modeling Linguistic Corpora in
OWL/DL
Christian Chiarcos
1 Background
Within the last 30 years, the maturation of language technology and the increas-
ing importance of corpora in linguistic research produced a growing number of
linguistic corpora with increasingly diverse annotations. While the earliest an-
notations focused on part-of-speech and syntax annotation, later NLP research
included also on semantic, anaphoric and discourse annotations, and with the
rise of statistic MT, a large number of parallel corpora became available. In
parallel, specialized technologies were developed to represent these annotations,
to perform the annotation task, to query and to visualize them. Yet, the tools
and representation formalisms applied were often specific to a particular type of
annotation, and they offered limited possibilities to combine information from
different annotation layers applied to the same piece of text. Such multi-layer
corpora became increasingly popular,1 and, more importantly, they represent a
valuable source to study interdependencies between different types of annota-
tion. For example, the development of a semantic parser usually takes a syntac-
tic analysis as its input, and higher levels of linguistic analysis, e.g., coreference
resolution or discourse structure, may take both types of information into con-
sideration. Such studies, however, require that all types of annotation applied
to a particular document are integrated into a common representation that pro-
vides lossless and comfortable access to the linguistic information conveyed in
the annotation without requiring too laborious conversion steps in advance.
At the moment, state-of-the-art approaches on corpus interoperability build
on standoff-XML [5,26] and relational data bases [12,17]. The underlying data
models are, however, graph-based, and this paper pursues the idea that RDF and
1
For example, parts of the Penn Treebank [29], originally annotated for parts-of-
speech and syntax, were later annotated with nominal semantics, semantic roles,
time and event semantics, discourse structure and anaphoric coreference [30].
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 225–239, 2012.
c Springer-Verlag Berlin Heidelberg 2012
226 C. Chiarcos
RDF data bases can be applied for the task to represent all possible annotations
of a corpus in an interoperable way, to integrate their information without any
restrictions (as imposed, for example, by conflicting hierarchies or overlapping
segments in an XML-based format), and to provide means to store and to query
this information regardless of the annotation layer from which it originates. Using
OWL/DL defined data types as the basis of this RDF representation allows to
specify and to verify formal constraints on the correct representation of linguistic
corpora in RDF. POWLA, the approach described here, formalizes data models
for generic linguistic data structures for linguistic corpora as OWL/DL concepts
and definitions (POWLA TBox) and represents the data as OWL/DL individuals
in RDF (POWLA ABox).
POWLA takes its conceptual point of departure from the assumption that
any linguistic annotation can be represented by means of directed graphs [3,26]:
Aside from the primary data (text), linguistic annotations consist of three prin-
cipal components, i.e., segments (spans of text, e.g., a phrase), relations between
segments (e.g., dominance relation between two phrases) and annotations that
describe different types of segments or relations.
In graph-theoretical terms, segments can be formalized as nodes, relations
as directed edges and annotations as labels attached to nodes and/or edges.
These structures can then be connected to the primary data by means of pointers.
A number of generic formats were proposed on the basis of such a mapping
from annotations to graphs, including ATLAS [3] and GrAF [26]. Below, this is
illustrated for the PAULA data model, that is underlying the POWLA format.
Traditionally, PAULA is serialized as an XML standoff format, it is specifically
designed to support multi-layer corpora [12], and it has been successfully applied
to develop an NLP pipeline architecture for Text Summarization [35], and for the
development of the corpus query engine ANNIS [38]. See Fig. 1 for an example
for the mapping of syntax annotations to the PAULA data model.
RDF also formalizes directed (multi-)graphs, so, an RDF linearization of the
PAULA data model yields a generic RDF representation of text-based linguistic
annotations and corpora in general. The idea underlying POWLA is to represent
linguistic annotations by means of RDF, and to employ OWL/DL to define
PAULA data types and consistency constraints for these RDF data.
2 POWLA
This section first summarizes the data types in PAULA, then their formalization
in POWLA, and then the formalization of linguistic corpora with OWL/DL.
The data model underlying PAULA is derived from labeled directed acyclic
(multi)graphs (DAGs). Its most important data types are thus different types of
nodes, edges and labels [14]:
POWLA: Modeling Linguistic Corpora in OWL/DL 227
Fig. 1. Using PAULA data structures for constituent syntax (German example sentence
taken from the Potsdam Commentary Corpus, [34])
Also, layers and documents can be assigned labels. These correspond to meta-
data, rather than annotations, e.g., date of creation or name of the annotator.
To this end, integrating corpora into the LOD cloud has not been suggested,
probably mostly because of the gap between the linguistics and the Semantic
Web communities. Recently, however, some interdisciplinary efforts have been
brought forward in the context of the Open Linguistics Working Group of the
Open Knowledge Foundation [13], an initiative of experts from different fields
concerned with linguistic data, whose activities – to a certain extent – converge
towards the creation of a Linguistic Linked Open Data (LLOD) (sub-)cloud
that will comprise different types of linguistic resources, unlike the current LOD
cloud also linguistic corpora. The following subsections describe ways in which
linguistic corpora may be linked with other LOD (resp. LLOD) resources.
So far, two resources have been converted using POWLA, including the NEGRA
corpus, a German newspaper corpus with annotations for morphology and syntax
[33], as well as coreference [32], and the MASC corpus, a manually annotated
subcorpus of the Open American Corpus, annotated for a great band-width of
phenomena [23]. MASC is represented in GrAF, and a GrAF converter has been
developed [11].
MASC includes semantic annotations with FrameNet and WordNet senses [1].
WordNet senses are represented by sense keys as string literals. As sense keys
are stable across different WordNet versions, this annotation can be trivially
rendered in URIs references pointing to an RDF version of WordNet. (However,
the corresponding WordNet version 3.1 is not yet available in RDF.)
FrameNet annotations in MASC make use of feature structures (attribute-
value pairs where the value can be another attribute-value pair), which are not
yet fully supported by the GrAF converter. However, reducing feature structures
to simple attribute-value pairs is possible. The values are represented in POWLA
as literals, but can likewise be transduced to properties pointing to URIs, if the
corresponding FrameNet version is available. An OWL/DL version of FrameNet
has been announced at the FrameNet site, it is, however, available only after
registration, and hence, not strictly speaking an open resource.
With this kind of resources being made publicly available, it would be possible
to develop queries that combine elements of both the POWLA corpus and lexical-
semantic resources. For example, one may query for sentences about land, i.e.,
‘retrieve every (POWLA) sentence that contains a (WordNet-)synonym of land’.
Such queries can be applied, for example, to develop semantics-sensitive corpus
querying engines for linguistic corpora.
In a similar way, corpora can also be linked to other resources in the LOD
cloud that provide identifiers that can be used to formalize corpus meta data,
e.g., provenance information. Lexvo [15] for example, provides identifiers for
languages, GeoNames [36] provides codes for geographic regions. ISOcat [28]
is another repository of meta data (and other) categories maintained by ISO
TC37/SC4, for which an RDF interface has recently been proposed [37].
Similarly, references to terminology repositories may be used instead of string-
based annotations. For example, the OLiA ontologies [8] formalize numerous
annotation schemes for morphosyntax, syntax and higher levels of linguistic de-
scription, and provide a linking to the morphosyntactic profile of ISOcat [9]
with the General Ontology of Linguistic Description [19], and other terminol-
ogy repositories. By comparing OLiA annotation model specifications with tags
used in a particular layer in a particular layer annotated according to the corre-
sponding annotation scheme, the transduction from string-based annotation to
references to community-maintained category repository is eased. Using such a
resource to describe the annotations in a given corpus, it is possible to abstract
POWLA: Modeling Linguistic Corpora in OWL/DL 235
from the surface form a particular tag and to interpret linguistic annotations on
a conceptual basis.
Linking corpora with terminology and metadata repositories is thus a way
to achieve conceptual interoperability between linguistic corpora and other
resources.
Although it predates the official LAF linearization GrAF [26] by several years
[16], it shares its basic design as an XML standoff format and the underlying
graph-based data model. One important difference is, however, the treatment
of segmentation [14]. While PAULA provides formalized terminal elements with
XLink/XPointer references to spans in the primary data, GrAF describes seg-
ments by a sequence of numerical ‘anchors’. Although the resolution of GrAF
anchors is comparable to that of Terminals in PAULA, the key difference is that
anchor resolution is not formalized within the GrAF data model.
This has implications for the RDF linearizations of GrAF data: The RDF
linearization of GrAF recently developed by [7] represents anchors as literal
strings consisting of two numerical, space-separated IDs (character offsets) like
in GrAF. This approach, however, provides no information how these IDs should
be interpreted (the reference to the primary data is not expressed). In POWLA,
Terminals are modeled as independent resources and information about the
surface string and the original order of tokens is provided. Another difference is
that this RDF linearization of GrAF is based on single GrAF files (i.e., single
annotation layers), and that it does not build up a representation of the entire
annotation project, but that corpus organization is expressed implicitly through
the file structure which is inherited from the underlying standoff XML. It is
thus not directly possible to formulate SPARQL queries that refer to the same
annotation layer in different documents or corpora.
Closer to our conceptualization is [4] who used OWL/DL to model a multi-
layer corpus with annotations for syntax and semantics. The advantages of
OWL/DL for the representation of linguistic corpora were carefully worked out
by the authors. Similar to our approach, [4] employed an RDF query language
for querying. However, this approach was specific to a selected resource and its
particular annotations, whereas POWLA, is a generic formalism for linguistic
corpora based on established data models developed to the interoperable for-
malization of arbitrary linguistic annotations assigned to textual data.
As emphasized above, a key advantage of the representation of linguistic re-
sources in OWL/RDF is that they can be published as linked data [2], i.e., that
different corpus providers can provide their annotations at different sites, and
link them to the underlying corpus. For example, the Prague Czech-English De-
pendency Treebank4 , which is an annotated translation of parts of the Penn
Treebank, could be linked to the original Penn Treebank. Consequently, the var-
ious and rich annotations applied to the Penn Treebank [30] can be projected
onto Czech.5 Similarly, existing linkings between corpora and lexical-semantic
resources, represented so far by string literals, can be transduced to URI ref-
erences if the corresponding lexical-semantic resources are provided as linked
4
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T25
5
Unlike existing annotation projection approaches, however, this would not require
that English annotations are directly applied to the Czech data – which introduces
additional noise –, but instead, SPARQL allows us to follow the entire path from
Czech to English to its annotations, with the noisy part (the Czech-English align-
ment) clearly separated from the secure information (the annotations).
POWLA: Modeling Linguistic Corpora in OWL/DL 237
data.An important aspect here is that corpora can be linked to other resources
from the Linked Open Data cloud using the same formalism.
Finally, linked data resources can be used to formalize meta data or linguis-
tic annotations. This allows, for example, to use information from terminology
repositories to query a corpus. As such, the corpus can be linked to terminology
repositories like the OLiA ontologies, ISOcat or GOLD, and these community-
defined data categories can be used to formulate queries that are independent
from the annotation scheme, but use an abstract, and well-defined vocabulary.
In this way, linguistic annotations in POWLA are not only structurally inter-
operable (they use the same representation formalism), but also conceptually
interoperable (they use the same vocabulary).
References
13. Chiarcos, C., Hellmann, S., Nordhoff, S.: The Open Linguistics Working Group
of the Open Knowledge Foundation. In: Chiarcos, C., Nordhoff, S., Hellmann, S.
(eds.) Linked Data in Linguistics. Representing and Connecting Language Data
and Language Metadata, pp. 153–160. Springer, Heidelberg (2012)
14. Chiarcos, C., Ritz, J., Stede, M.: By all these lovely tokens... Merging conflicting
tokenizations. Journal of Language Resources and Evaluation (LREJ) 46(1), 53–74
(2011)
15. De Melo, G., Weikum, G.: Language as a foundation of the Semantic Web. In: Pro-
ceedings of the 7th International Semantic Web Conference (ISWC 2008), vol. 401
(2008)
16. Dipper, S.: XML-based stand-off representation and exploitation of multi-level lin-
guistic annotation. In: Proceedings of Berliner XML Tage 2005 (BXML 2005),
Berlin, Germany, pp. 39–50 (2005)
17. Eckart, K., Riester, A., Schweitzer, K.: A discourse information radio news database
for linguistic analysis. In: Chiarcos, C., Nordhoff, S., Hellmann, S. (eds.) Linked
Data in Linguistics, Springer, Heidelberg (2012)
18. Eckart, R.: Choosing an XML database for linguistically annotated corpora.
Sprache und Datenverarbeitung 32(1), 7–22 (2008)
19. Farrar, S., Langendoen, D.T.: An OWL-DL implementation of GOLD: An ontology
for the Semantic Web. In: Witt, A.W., Metzing, D. (eds.) Linguistic Modeling
of Information and Markup Languages: Contributions to Language Technology,
Springer, Dordrecht (2010)
20. Francopoulo, G., Bel, N., George, M., Calzolari, N., Monachini, M., Pet, M., So-
ria, C.: Multilingual resources for NLP in the Lexical Markup Framework (LMF).
Language Resources and Evaluation 43(1), 57–70 (2009)
21. Hellmann, S.: The semantic gap of formalized meaning. In: Aroyo, L., Antoniou, G.,
Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.)
ESWC 2010. LNCS, vol. 6089, pp. 462–466. Springer, Heidelberg (2010)
22. Hellmann, S., Unbehauen, J., Chiarcos, C., Ngonga Ngomo, A.: The TIGER Corpus
Navigator. In: 9th International Workshop on Treebanks and Linguistic Theories
(TLT-9), Tartu, Estonia, pp. 91–102 (2010)
23. Ide, N., Fellbaum, C., Baker, C., Passonneau, R.: The manually annotated sub-
corpus: A community resource for and by the people. In: Proceedings of the ACL-
2010, pp. 68–73 (2010)
24. Ide, N., Pustejovsky, J.: What does interoperability mean, anyway? Toward an op-
erational definition of interoperability. In: Proceedings of the Second International
Conference on Global Interoperability for Language Resources (ICGL 2010), Hong
Kong, China (2010)
25. Ide, N., Romary, L.: International standard for a linguistic annotation framework.
Natural language engineering 10(3-4), 211–225 (2004)
26. Ide, N., Suderman, K.: GrAF: A graph-based format for linguistic annotations. In:
Proceedings of The Linguistic Annotation Workshop (LAW) 2007, Prague, Czech
Republic, pp. 1–8 (2007)
27. Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., Wright, S.: ISOcat: Cor-
ralling data categories in the wild. In: Proceedings of the International Conference
on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco (May
2008)
28. Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., Wright, S.: ISOcat: Re-
modelling metadata for language resources. International Journal of Metadata,
Semantics and Ontologies 4(4), 261–276 (2009)
POWLA: Modeling Linguistic Corpora in OWL/DL 239
29. Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus
of English: the Penn Treebank. Computational Linguistics 19(2), 313–330 (1993)
30. Pustejovsky, J., Meyers, A., Palmer, M., Poesio, M.: Merging PropBank, NomBank,
TimeBank, Penn Discourse Treebank and Coreference. In: Proc. of ACL Workshop
on Frontiers in Corpus Annotation (2005)
31. Rubiera, E., Polo, L., Berrueta, D., El Ghali, A.: TELIX: An RDF-based Model for
Linguistic Annotation. In: Simperl, E., et al. (eds.) ESWC 2012. LNCS, vol. 7295,
pp. 195–209. Springer, Heidelberg (2012)
32. Schiehlen, M.: Optimizing algorithms for pronoun resolution. In: Proceedings of the
20th International Conference on Computational Linguistics (COLING), Geneva,
pp. 515–521 (August 2004)
33. Skut, W., Brants, T., Krenn, B., Uszkoreit, H.: A linguistically interpreted corpus
of German newspaper text. In: Proc. ESSLLI Workshop on Recent Advances in
Corpus Annotation, Saarbrücken, Germany (1998)
34. Stede, M.: The Potsdam Commentary Corpus. In: Proceedings of the ACL Work-
shop on Discourse Annotation, pp. 96–102, Barcelona, Spain (2004)
35. Stede, M., Bieler, H.: The MOTS Workbench. In: Mehler, A., Kühnberger, K.-U.,
Lobin, H., Lüngen, H., Storrer, A., Witt, A. (eds.) Modeling, Learning, and Proc.
of Text-Tech. Data Struct. SCI, vol. 370, pp. 15–34. Springer, Heidelberg (2011)
36. Vatant, B., Wick, M.: GeoNames ontology, version 3.01 (February 2012),
http://www.geonames.org/ontology (accessed March 15, 2012)
37. Windhouwer, M., Wright, S.E.: Linking to linguistic data categories in ISOcat. In:
Linked Data in Linguistics (LDL 2012), Frankfurt/M., Germany (accepted March
2012)
38. Windhouwer, M., Wright, S.E.: Linking to linguistic data categories in ISOcat.
In: Chiarcos, C., Nordhoff, S., Hellmann, S. (eds.) Linked Data in Linguistics.
Representing and Connecting Language Data and Language Metadata, pp. 99–
107. Springer, Heidelberg (2012)
Representing Mereotopological Relations
in OWL Ontologies with OntoPartS
1 Introduction
Part-whole relations are essential for knowledge representation, in particular
in terminology and ontology development in subject domains such as biology,
medicine, GIS, and manufacturing. Usage of part-whole relations are exacerbated
when part-whole relations are merged with topological or mereotopological rela-
tions, such as tangential proper part where the part touches the boundary of the
whole it is part of; e.g., the FMA has 8 basic locative part-whole relations [14]
and GALEN has 26 part-whole and locative part-whole relations1 . It is also use-
ful for annotating and querying multimedia documents and cartographic maps;
e.g., annotating a photo of a beach where the area of the photo that depicts
the sand touching the area that depicts the seawater so that, together with the
knowledge that, among other locations, Varadero is a tangential proper part of
Cuba, the semantically enhanced system can infer possible locations where the
photo has been taken, or vv., propose that the photo may depict a beach scene.
Efforts have gone into figuring out which part-whole relations there are [24,11],
developing a logic language with which one can represent the semantics of the
1
http://www.opengalen.org/tutorials/crm/tutorial9.html
up to /tutorial16.html
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 240–254, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Representing Mereotopological Relations 241
relations [2,8], and how to use the two together [20,25,26]. The representa-
tion of mereotopology in Description Logics (DL) has not been investigated,
but related efforts in representing the Region Connection Calculus (RCC) in
DLs have [17,9,21,23,6]. Currently, the advances in mereotopology are not di-
rectly transferrable to a Semantic Web setting due to the differences in lan-
guages and theories and they miss software support to make it usable for the
ontology developer. Yet, ontologists require a way to effectively handle these
part-whole relations during ontology development without necessarily having to
become an expert in theories about part-whole relations, mereotopology, and
expressive ontology languages. Moreover, structured and guided usage can pre-
vent undesirable deductions and increase the amount of desirable deductions
even without the need to add additional expressiveness to the language. For in-
stance, instance classification: let NTPLI be a ‘non-tangential proper located in’
relation and EnclosedCountry ≡ Country ∃NTPLI.Country, and instances
NTPLI(Lesotho, South Africa),Country(Lesotho),Country(South Africa),
then it will correctly deduce EnclosedCountry(Lesotho). With merely ‘part-
of’, one would not have been able to obtain this result.
Thus, there are three problems: (i) the lack of oversight on plethora of part-
whole relations, that include real parthood (mereology) and parts with their
locations (mereotopology), (ii) the challenge to figure out which one to use
when, and (iii) underspecified representation and reasoning consequences. To
solve these problems we propose the OntoPartS tool to guide the modeller.
To ensure a solid foundation, transparency, a wide coverage of the types of part-
whole relations, and effectiveness during ontology development, we extend the
taxonomy of part-whole relations of [11] with the novel addition of mereotopo-
logical relations, driven by the KGEMT mereotoplogical theory [22], resulting
in a taxonomy of 23 part-whole relations. We describe the design rationale and
trade-offs with respect to what has to be simplified from KGEMT to realise as
much as possible in OWL so that OntoPartS can load OWL/OWL2-formalised
ontologies, and, if desired, modify the OWL file with the chosen relation. To en-
able quick selection of the appropriate relation, we use a simplified OWL-ized
DOLCE ontology for the domain and range restrictions imposed on the part-
whole relations and therewith let the user take ‘shortcuts’, which reduces the
selection procedure to 0-4 options based on just 2-3 inputs. The usability of On-
toPartS and effectiveness of the approach was evaluated and shown to improve
efficiency and accuracy in modelling.
In the remainder of the paper, we first describe the theoretical foundation of
the mereotopological relations and trade-offs for OWL (Section 2). We describe
the design, implementation, and evaluation of OntoPartS in Section 3, discuss
the proposed solution in Section 4, and conclude in Section 5.
and has been investigated also for Description Logics (e.g., [1,2]). There is a first
important distinction between parthood versus a meronymic relation (‘part’ in
natural language only), and, second, there is an additional aspect on parthood
and location [11]. The second dividing characteristic is the domain and range of
the relations (which are taken from the DOLCE foundational ontology [12] in
[11]). Particularly relevant here are the containment and location axioms (Eqs. 1
and 2) [11], where ED = EnDurant (enduring entity, e.g., a wall), R = Region
(e.g., the space that the wall occupies, the Alpine region), and has 2D and
has 3D for surfaces and space are shorthand relations standing for DOLCE’s
qualities and qualia; note that the domain and range is Region that has an
object occupying it, hence, this does not imply that those objects are related
also by structural parthood.
Part-whole relation
part-of mpart-of
P mP
proper-part-of … …
s-part-of PP involved-in
spatial-part-of
StP proper-s-part-of SpP proper-spatial-part-of II
proper-involved-in
PStP PSpP
PII
Subsumption in the
contained-in located-in original taxonomy
CI LI
proper-contained-in proper-located-in Subsumption for
PCI PLI proper part-of
equal-contained-in equal-located-in Subsumption
ECI ELI dividing non-proper,
tangential-proper- nontangential-proper- tangential-proper- nontangential-proper- equal, from proper
contained-in contained-in located-in located-in part-of
TPCI NTPCI TPLI NTPLI
Fig. 1. Graphical depiction of the extension of the mereological branch of the basic
taxonomy of part-whole relations with proper parthood and mereotopological relations
(meronymic relations subsumed by mpart-of not shown)
t2, t3), GEM of (M + t4, t5), GEMT of (MT + GEM + t10, t12, t13), and
KGEMT of (GEMT + t14, t15, t16). In addition, (t14-t16) require (t17-t19)
and there are additional axioms and definitions (like t20-t27) that can be built
up from the core ones.
Using Eqs. (1-2) and (t1-t27), we now extend the part-whole taxonomy of
[11] with the mereotopological relations as defined in Eqs. (3-10), which is shown
graphically in Fig. 1. The tangential and nontangential proper parthood relations
are based on axioms 65 and 66 in [22], which are (t24) and (t23), respectively,
in Table 1, and the same DOLCE categories are used as in Eqs. (1-2); see Fig. 1
and Table 1 for abbreviations.
∀x, y (ECI(x, y) ≡ CI(x, y) ∧ P (y, x) (3)
∀x, y (P CI(x, y) ≡ P P O(x, y) ∧ R(x)∧R(y) ∧ ∃z, w(has 3D(z, x) ∧ has 3D(w, y) ∧
ED(z) ∧ ED(w))) (4)
∀x, y (N T P CI(x, y) ≡ P CI(x, y) ∧ ∀z(C(z, x)→O(z, y))) (5)
∀x, y (T P CI(x, y) ≡ P CI(x, y) ∧ ¬N T P CI(x, y)) (6)
∀x, y (ELI(x, y) ≡ LI(x, y) ∧ P (y, x) (7)
∀x, y (P LI(x, y) ≡ P P O(x, y) ∧ R(x) ∧ R(y) ∧ ∃z, w(has 2D(z, x) ∧ has 2D(w, y) ∧
ED(z) ∧ ED(w))) (8)
∀x, y (N T P LI(x, y) ≡ P LI(x, y) ∧ ∀z(C(z, x)→O(z, y))) (9)
∀x, y (T P LI(x, y) ≡ P LI(x, y) ∧ ¬N T P LI(x, y)) (10)
Note that one also can use another foundational ontology, such as SUMO or
GFO, provided it at least contains categories equivalent to the ones from DOLCE
we use here. Concerning the interaction between this proposal for mereotopol-
ogy and DOLCE: DOLCE includes the GEM mereological theory, of which
KGEMT is an extension, and does not contain mereotopology; hence, our taxo-
nomic categorisation of the mereotopological relations and additional KGEMT
axioms do not lead to an inconsistency. Interaction with DOLCE’s—or any other
Representing Mereotopological Relations 245
Table 2. Subsets of KGEMT that can be represented in the OWL species; t9, t20,
t21, t22, t23, t24 can be simplified and added as primitives to each one
The main requirement of the software is to hide the logic involved in the formal
definition of the 23 part-whole relations, and translate it to a set of intuitive
steps and descriptions that will guide the user to make the selection and decision
effortlessly. The selection procedure for the 23 possible relations should be made
as short as possible and present only a small relevant subset of suggestions from
which the user can select the one that best fit the situation. A set of (top-level)
categories should be proposed to quickly discriminate among relations since the
user may be more familiar with the categories’ notions for domain and range than
with the relations’ definitions, therewith standardizing the criteria for selecting
the relations. Simple examples must be given for each relation and category.
Last, the user must have the possibility also to save the selected relation to the
ontology file from where the classes of interest were taken.
Given these basic functional requirements, some design decisions were made
for OntoPartS. From a generic perspective, a separate tool is an advantage,
because then it can be used without binding the ontologist to a single ontology
editor. Another consideration is usability testing. We chose to use a rapid way
of prototyping to develop the software to quickly determine whether it is really
helpful. Therefore, we implemented a stand-alone application that works with
OWL files. We also chose to use the DOLCE top-level ontology categories for
the standardization of the relationships’ decision criteria.
To structure the selection procedure in a consistent way in the implementa-
tion, we use activity diagrams to describe the steps that a user has to carry out
to interact with OntoPartS and to select the appropriate relation. An activity
diagram for the selection process of the mereotopological relations is available
in the online supplementary material. The selection of the appropriate relation
incorporates some previous ideas of a decision diagram and topological principles
as an extension of mereological theories [10,25,26], and questions and decision
points have been added to reflect the extended taxonomy. For the mereotopo-
logical relations considered, in principle, the decision for the appropriate one
can be made in two separate ways: either find mereotopological relations and
then asking to distinguish between located in and contained in, or vv. In the
OntoPartS interface, we have chosen to reduce the sequence of questions to a
single question (check box) that appears only when the domain and range are
regions, which asks whether the classes are geographical entities.
Concerning the most expressive OWL 2 DL species and KGEMT, then an-
tisymmetry (t3), the second order axioms (t5, t13), and the closure operators
(t14-t19) are omitted, and definitions of relations are simplified to only domain
and range axioms and their position in the hierarchy (recollect Table 2). In addi-
tion, OWL’s IrreflexiveObjectProperty and AsymmetricObjectProperty can be
used only with simple object properties, which runs into problems with a tax-
onomy of object properties and transitivity (that are deemed more important),
therefore also (t25, t27) will not appear in the OWL file. The combination of
the slimmed KGEMT and extended taxonomy of part-whole object properties
together with a DOLCE ultra-ultra-light is included in the online supplementary
248 C.M. Keet, F.C. Fernández-Reyes, and A. Morales-González
material as MereoTopoD.owl. This OWL file contains the relations with human-
readable names, as included in the taxonomy of [11] and the mereotopological
extension depicted in Fig. 1, as it is generally assumed to be better workable to
write out abbreviations in the domain ontology as it increases readability and
understandability of the ontology from a human expert point of view.
Observe that at the class-level, we have the so-called “all-some” construction
for property axioms, and if the modeller wants to modify it with a min-, max-, or
exact cardinality (e.g., ‘each spinal column is a proper part of exactly one human’),
then it goes beyond OWL 2 DL because the properPartOf object property is not
simple. Further, transitivity is a feature of OWL-DL, OWL Lite, OWL 2 DL, DL,
EL and RL, but not QL. Because one cannot know upfront the setting of the on-
tology, we keep the hierarchy of relations but do not add the relational properties
when writing into the .owl file, but the user can add them afterward.
Fig. 2. Selecting the part and whole classes in OntoPartS from the loaded OWL file
would have had two classes that are both, say, processes, then there is only
one option (involved in, as OntoPartS includes the taxonomy of [11]). The
software provides the suggestions, a verbalization of the possible relationship(s),
e.g., “Campground coincides exactly with RuralArea”, and typical examples, as can
be seen in Fig. 4. Once she selects the desired relation from the ones proposed
by the software, she can choose to add this relationship to the OWL file by sim-
ply clicking the button labelled “Save relationship to file” (Fig. 4 bottom) and
continue either with other classes and selection of a part-whole relation or with
developing the ontology in the ontology development environment of choice. ♦
phase such that it can be done more efficiently and with less errors. To this end,
a qualitative and two preliminary quantitate evaluations have been carried out.
The materials used for the experiments were the DOLCE taxonomy, the
taxonomy with the 23 part-whole relations, and the beta version of the On-
toPartS tool. The domain ontology about computers was developed using
Protege 4.0, which was divided into two versions, one with and one without
part-whole relations.
Results. The students in the first experiment asserted 380 part-whole relations
among the classes in the domain ontology (37 mereotopological), of which 210
were correct, i.e., on average, about 12.4 correct assertions/participant (variation
between 5 and 37); for the second experiment, the numbers are, respectively, 82
(22 mereotopological), 58, and an average of 9.7 (with variation between 0 and
27). Given the controlled duration of the second experiment, this amounts to,
on average, a mere 4 minutes to choose the correct relation with OntoPartS.
Evaluating the mistakes made by the participants revealed that an incorrect
selection of part-whole relation was due to, mainly, an incorrect association of a
domain ontology class to a DOLCE category. This was due to the late discovery
of the tool-tip feature in OntoPartS by some participants and the lack of an
“undo” button (even though a user could have switched back to the ontology
development environment and deleted the assertion manually). Several errors
were due to the absence of inverses in the beta version of the OntoPartS tool,
leading some participants to include Metal constitutes some Hardware. 83%
of the errors in the second experiment were made by those who did not use
OntoPartS, which looks promising for OntoPartS.
The responses in the qualitative evaluation was unanimous disbelief that se-
lection could be made this easy and quickly, and the desire was expressed to
consult the formal and ontological foundations. As such, OntoPartS stimu-
lated interest for education on the topic along the line of “the tool makes it easy,
then so the theory surely will be understandable”.
Overall, it cannot be concluded that modelling of part-whole relations with
OntoPartS results in statistically significant less errors—for this we need access
to more and real ontology developers so as to have a sufficiently large group
whose results can be analysed statistically. Given the speed with which correct
relations were selected, the automated guidelines do assist with representation
of part-whole relations such that it can be done more efficiently and quickly.
The experimentation also aided in improving OntoPartS’s functionality and
usability, so that it is now a fully working prototype.
4 Discussion
Despite the representation and reasoning limitations with the DL-based OWL
2 species, there are several modelling challenges that can be addressed with the
mereotopological and part-whole relation taxonomy together with OntoPartS
and they solve the three problems identified in the introduction.
the is-a vs. part-of confusion—i.e., using the is-a relation where a part-of relation
should have been used—common among novice ontology developers, and no such
errors were encountered during our experiments either (recollect Section 3.3). By
making part-whole relations easily accessible without the need for immediate in-
depth investigation into the topic, it is expected that this type of error may also
be prevented without any prior substantial training on the topic.
The fine-grained distinctions between the parthood relation enables, among
others, proper instance classification like mentioned in the introduction for
EnclosedCountry, thanks to being able to select the right relation and there-
with capturing the intended semantics of EnclosedCountry. If, on the other
hand, the modeller would have known only about proper part-of but not proper
located in, then she could only have asserted that Lesotho is a proper part of
South Africa, which holds (at best) only spatially but not administratively. Not
being able to make such distinctions easily leads to inconsistencies in the on-
tology or conflicts in ontology import, alignment, or integration. OntoClean [7]
helps with distinguishing between geographic and social entity, and, in analogy,
OntoPartS aids relating the entities with the appropriate relation.
5 Conclusions
References
1. Artale, A., Franconi, E., Guarino, N., Pazzi, L.: Part-whole relations in object-
centered systems: An overview. DKE 20(3), 347–383 (1996)
2. Bittner, T., Donnelly, M.: Computational ontologies of parthood, componenthood,
and containment. In: Proc. of IJCAI 2005, pp. 382–387. AAAI Press, Cambridge
(2005)
3. Cohn, A.G., Renz, J.: Qualitative spatial representation and reasoning. In: Hand-
book of Knowledge Representation, ch. 13 p. 551–596. Elsevier (2008)
4. Egenhofer, M.J., Herring, J.R.: Categorizing binary topological relations between
regions, lines, and points in geographic databases. Tech. Rep. 90-12, National Cen-
ter for Geographic Information and Analysis, University of California (1990)
5. Eschenbach, C., Heydrich, W.: Classical mereology and restricted domains. Inter-
national Journal of Human-Computer Studies 43, 723–740 (1995)
6. Grütter, R., Bauer-Messmer, B.: Combining OWL with RCC for spatiotermino-
logical reasoning on environmental data. In: Proc. of OWLED 2007 (2007)
7. Guarino, N., Welty, C.: An overview of OntoClean. In: Staab, S., Studer, R. (eds.)
Handbook on Ontologies, pp. 151–159. Springer (2004)
8. Horrocks, I., Kutz, O., Sattler, U.: The even more irresistible SROIQ. In: Proc.
of KR 2006, pp. 452–457 (2006)
9. Katz, Y., Cuenca Grau, B.: Representing qualitative spatial information in OWL-
DL. In: Proc. of OWLED 2005, Galway, Ireland (2005)
254 C.M. Keet, F.C. Fernández-Reyes, and A. Morales-González
10. Keet, C.M.: Part-Whole Relations in Object-Role Models. In: Meersman, R., Tari,
Z., Herrero, P. (eds.) OTM 2006 Workshops. LNCS, vol. 4278, pp. 1116–1127.
Springer, Heidelberg (2006)
11. Keet, C.M., Artale, A.: Representing and reasoning over a taxonomy of part-whole
relations. Applied Ontology 3(1-2), 91–110 (2008)
12. Masolo, C., Borgo, S., Gangemi, A., Guarino, N., Oltramari, A.: Ontology library.
WonderWeb Deliverable D18 (ver. 1.0) (December 31, 2003),
http://wonderweb.semanticweb.org
13. McGuinness, D.L., van Harmelen, F.: OWL Web Ontology Language Overview.
W3C Recommendation (2004), http://www.w3.org/TR/owl-features/
14. Mejino, J.L.V., Agoncillo, A.V., Rickard, K.L., Rosse, C.: Representing complexity
in part-whole relationships within the foundational model of anatomy. In: Proc. of
the AMIA Fall Symposium, pp. 450–454 (2003)
15. Motik, B., Grau, B.C., Horrocks, I., Wu, Z., Fokoue, A., Lutz, C.: OWL 2 Web
Ontology Language Profiles. W3c recommendation, W3C (October 27, 2009),
http://www.w3.org/TR/owl2-profiles/
16. Motik, B., Patel-Schneider, P.F., Parsia, B.: OWL 2 web ontology language struc-
tural specification and functional-style syntax. W3c recommendation, W3C (Oc-
tober 27, 2009) http://www.w3.org/TR/owl2-syntax/
17. Nutt, W.: On the Translation of Qualitative Spatial Reasoning Problems into
Modal Logics. In: Burgard, W., Christaller, T., Cremers, A.B. (eds.) KI 1999.
LNCS (LNAI), vol. 1701, pp. 113–124. Springer, Heidelberg (1999)
18. Odell, J.J.: Advanced Object-Oriented Analysis & Design using UML. Cambridge
University Press, Cambridge (1998)
19. Randell, D.A., Cui, Z., Cohn, A.G.: A spatial logic based on regions and connection.
In: Proc. of KR 1992, pp. 165–176. Morgan Kaufmann (1992)
20. Schulz, S., Hahn, U., Romacker, M.: Modeling anatomical spatial relations with
description logics. In: AMIA 2000 Annual Symposium, pp. 779–783 (2000)
21. Stocker, M., Sirin, E.: Pelletspatial: A hybrid RCC-8 and RDF/OWL reasoning
and query engine. In: Proc. of OWLED 2009, Chantilly, USA, October 23-24.
CEUR-WS, vol. 529 (2009)
22. Varzi, A.: Spatial reasoning and ontology: parts, wholes, and locations. In: Hand-
book of Spatial Logics, pp. 945–1038. Springer, Heidelberg (2007)
23. Wessel, M.: Obstacles on the way to qualitative spatial reasoning with descrip-
tion logics: some undecidability results. In: Proc. of DL 2001, Stanford, CA, USA,
August 1-3. CEUR WS, vol. 49 (2001)
24. Winston, M.E., Chaffin, R., Herrmann, D.: A taxonomy of partwhole relations.
Cognitive Science 11(4), 417–444 (1987)
25. Yang, W., Luo, Y., Guo, P., Tao, H., He, B.: A Model for Classification of Topo-
logical Relationships Between Two Spatial Objects. In: Wang, L., Jin, Y. (eds.)
FSKD 2005. LNCS (LNAI), Part II, vol. 3614, pp. 723–726. Springer, Heidelberg
(2005)
26. Zhong, Z.-N., et al.: Representing topological relationships among heterogeneous
geometry-collection features. J. of Comp. Sci. & Techn. 19(3), 280–289 (2004)
Evaluation of the Music Ontology Framework
1 Introduction
The Music Ontology [19] was first published in 2006 and provides a framework
for distributing structured music-related data on the Web. It has been used
extensively over the years, both as a generic model for the music domain and as
a way of publishing music-related data on the Web.
Until now the Music Ontology has never been formally evaluated and com-
pared with related description frameworks. In this paper, we perform a quan-
titative evaluation of the Music Ontology framework. We want to validate the
Music Ontology with regards to its intended use, and to get a list of areas the
Music Ontology community should focus on in further developments.
As more and more BBC web sites are using ontologies [20], we ultimately
want to reach a practical evaluation methodology that we can apply to other
domains. Those ontologies are mainly written by domain experts, and we would
need to evaluate how much domain data they can capture. We would also need to
identify possible improvements in order to provide relevant feedback to domain
experts.
We first review previous work on ontology evaluation in § 2. We devise our
evaluation methodology in § 3, quantifying how well real-world user-needs fit
within our Music Ontology framework. We perform in § 4 the actual evaluation,
and compare several alternatives for each step of our evaluation process. We
discuss the results and conclude in § 5.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 255–269, 2012.
c Springer-Verlag Berlin Heidelberg 2012
256 Y. Raimond and M. Sandler
Structural Metrics. Web ontologies are defined through an RDF graph. This
graph can be analysed to derive evaluation metrics. These metrics, evaluating
the structure of the graph defining the ontology but not the ontology itself,
are called structural metrics. For example the AKTiveRank system [1] includes
a metric quantifying the average amount of edges in which a particular node
corresponding to a concept is involved. This metric therefore gives an idea of how
much detail a concept definition in the evaluated ontology holds. Another set of
examples are the structural ontology measures defined in [9], including maximum
Evaluation of the Music Ontology Framework 257
and minimum depth and breadth of the concept hierarchy. Such metrics do not
capture the intended use of the evaluated ontology. We therefore do not consider
using structural metrics in our evaluation.
We now devise our methodology for evaluating our Music Ontology framework,
based on the the data-driven and the task-based evaluation methodologies de-
scribed in § 2.4 and § 2.5. We want this evaluation methodology to allow us to
validate our ontology with regards to real-world information-seeking behaviours.
We consider evaluating our knowledge representation framework against a
dataset of verbalised music-related user needs. We isolate a set of music-related
needs drawn from different sets of users, and we measure how well a music
information system backed by our knowledge representation frameworks could
handle these queries. Our evaluation methodology involves the following steps:
– We can use the results of previous works in extracting query features from
similar datasets;
– We can extract features from the dataset by following a statistical approach;
– We can manually extract features from a random sample of the dataset.
We also consider extracting a weight wf for each feature f , capturing the relative
importance of f within the dataset. Moreover, these weights are normalised so
that their sum is equal to one.
We now evaluate how well these features map to our knowledge representation
framework. The corresponding measure captures the ontology fit. The Music
Ontology was designed to not duplicate terms that could be borrowed from
other web ontologies (for example, foaf:Person, dc:title or po:Broadcast).
We take into account this design choice. In the last step of our evaluation process
Evaluation of the Music Ontology Framework 259
we therefore also consider terms from FOAF1 , Dublin Core2 and the Programmes
Ontology3.
We develop an ontology fit measure capturing how well the extracted features
can be mapped to our ontology. For a query feature f , we define δ as follows.
1 f is expressible within the ontology
δ(f ) = (2)
0 otherwise
Our ontology fit measure for a set of verbalised queries Q is then the weighted
sum of the δ(f ) for each feature f extracted from Q.
Δ= wf · δ(f ) (3)
f
3.4 Discussion
This ‘query-driven’ evaluation methodology corresponds to a particular kind of
a task-based evaluation methodology, where the task is simply to be able to
answer a set of musical queries and the evaluation metric focuses on coverage
(the insertion and substitution errors are not considered). The gold-standard
associated with this task is that such queries are fully expressed in terms of
our knowledge representation framework — the application has a way to derive
accurate answers for all of them. This evaluation methodology also corresponds
to a particular kind of data-driven evaluation. We start from a corpus of text,
corresponding to our dataset of verbalised user needs, which we analyse and try
to map to our knowledge representation framework.
A similar query-driven evaluation of an ontology-based music search system
is performed by Baumann et al. [4]. They gather a set of 1500 verbalised queries
issued to their system, which they cluster manually in five different high-level
categories (requests for artists, songs, etc.) in order to get insights on the coverage
of their system. We use a similar methodology, although we define a quantitative
evaluation measure which takes into account much more granular query features.
We also consider automating steps of this evaluation process.
categories of users: casual users and users of music libraries. We derive an ontol-
ogy fit measure for each of these categories.
Table 1. Comparison of the features identified in [2] and in [14] along with correspond-
ing Music Ontology terms
(3318 verbalised user needs) and a subset of Yahoo Questions (4805 verbalised
user needs). Most user queries include editorial information (artist name, track
name etc.), as spotted in previous analyses of similar datasets. When includ-
ing some information about a musical item this information will most likely be
related to vocal parts: singer, lyrics etc. The three most cited musical genres
are “rock”, “classical” and “rap”. The queries often include information about
space and time (e.g. when and where the user heard about that song). They also
include information about the access medium: radio, CD, video, online etc. A
large part of the queries include personal feelings, illustrated by the terms “love”
or “like”. Finally, some of them include information about constituting parts of
a particular musical item (e.g. “theme”).
262 Y. Raimond and M. Sandler
We could consider the words occurring the most in our dataset as query
features and their counts as a weight. However, the same problem as in the
ontology fit derived in § 4.1 also arises. The Music Ontology term corresponding
to one of these features is highly context-dependent. There are two ways to
overcome these issues. On the one hand, we can keep our evaluation purely on
a lexical level. We are particularly interested in such an evaluation because it
allows us to include other music-related representation frameworks which are
not ontologies but just specifications of data formats, therefore providing some
insights for comparison. On the other hand, we can extract underlying topics
from our corpus of verbalised user needs, and consider these topics as query
features. We therefore move our evaluation to the conceptual level.
Evaluation at the Lexical Level. We now derive a measure of the lexical coverage
of our ontology. We first produce a vector space representation of these verbalised
user needs and of labels and comments within the Music Ontology specification.
We first remove common stop words. We then map the stemmed terms to vector
dimensions and create vectors for our dataset and our ontology using tf-idf. We
also include in our vector space other music-related representation frameworks.
We finally compute cosine distances between pairs of vectors, captured in table 2.
We first note that the results in this table are not comparable with the ontol-
ogy fit results derived in the rest of this article. They are not computed using the
same methodology as defined in § 3. We note that our Music Ontology framework
performs better than the other representation framework — it is closer to the
dataset of user queries. These results are due to the fact that our ontology en-
compasses a wider scope of music-related information than the others, which are
dedicated to specific use-cases. For example XSPF is specific to playlists, iTunes
XML and hAudio to simple editorial metadata, Variations3 to music libraries
and AceXML to content-based analysis and machine learning. The lexical cover-
age of the Music Ontology framework is therefore higher. Of course this measure
is very crude and just captures the lexical overlap between specification docu-
ments and our dataset of user queries. It can serve for comparison purposes, but
not to validate our framework against this dataset.
Evaluation at the conceptual level. We now want to go beyond this lexical layer.
We try to extract from our dataset a set of underlying topics. We then consider
these topics as our query features and compute an ontology fit measure from
them by following the methodology described in § 3.
We consider that our corpus of musical queries reflects the underlying set of
topics it addresses. A common way of modelling the contribution of these topics
to the ith word in a given document (in our case a musical query) is as follows.
T
P (wi ) = P (wi |zi = j) · P (zi = j) (4)
j=1
where T is the number of latent topics, zi is a latent variable indicating the topic
from which the ith word was drawn, P (wi |zi = j) is the probability of the word
Evaluation of the Music Ontology Framework 263
Table 2. Cosine similarities between vectorised specification documents and the casual
users dataset. We use labels and descriptions of terms for web ontologies and textual
specifications for other frameworks.
Ontology Similarity
Music Ontology 0.0812
ID3 version 2.3.0 0.0526
hAudio 0.0375
Musicbrainz 0.0318
XSPF 0.026
ACE XML 0.0208
iTunes XML 0.0182
ID3 version 2.4.0 0.0156
Variations3 FRBR-based model, phase 2 0.0112
FRBR Core & Extended 0.0111
MODS 0.0055
MPEG-7 Audio 0.0013
wi under the j th topic, and P (zi = j) is the probability of choosing a word from
the j th topic in the current document. For example in a corpus dealing with
performances and recording devices, P (w|z) would capture the content of the
underlying topics. The performance topic would give high probability to words
like venue, performer or orchestra, whereas the recording device topic would
give high probability to words like microphone, converter or signal. Whether a
particular document concerns performances, recording devices or both would be
captured by P (z).
The Latent Dirichlet Allocation (LDA) [6] provides such a model. In LDA,
documents are generated by first picking a distribution over topics from a Dirich-
let distribution which determines P (z). We then pick a topic from this distribu-
tion and a word from that topic according to P (w|z) to generate the words in
the document. We use the same methodology as in [10] to discover topics.
We use an approximate inference algorithm via Gibbs sampling for LDA [12].
We first pre-process our dataset of musical queries by stemming terms, removing
stop words and removing words that appear in less than five queries. Repeated
experiments for different number of topics (20, 50, 100 and 200) suggest that a
model incorporating 50 topics best captures our data. We reach the set of topics
illustrated in table 4.1.
We consider these topics as our query features. For each topic, we use its
relative importance in the dataset as a feature weight. We manually map each
topic to terms within our Music Ontology framework to derive the ontology fit
measure described in § 3.3. The corresponding ontology fit measure is 0.723.
However, this measure of the ontology fit is still arguable. Some of the topics
inferred are not sufficiently precise to be easily mapped to Music Ontology terms.
A subjective mapping still needs to be done to relate the extracted topics with
a set of ontology terms. Moreover, some crucial query features are not captured
within the extracted topics. For example a lot of queries include an implicit
264 Y. Raimond and M. Sandler
Table 3. Top words in the first six topics inferred through Latent Dirichlet Allocation
over our dataset of musical queries
notion of uncertainty (such as “I think the title was something like Walk Away”),
which is not expressible within our ontology.
A possible improvement to the evaluation above would be to use a Correlated
Topic Model [5], which also models the relationships between topics. This would
allow us to not only evaluate the coverage of concepts within our ontology, but
also the coverage of the relationships between these concepts. It remains future
work to develop an accurate measure for evaluating how the inferred graph of
topics can be mapped to an ontological framework. Another promising approach
for ontology evaluation is to estimate how well a generative model based on an
ontology can capture a set of textual data.
We do not follow exactly the manual data analysis methodology used by Lee
et al. [14], which partially structures the original queries by delimiting the parts
that correspond to a particular recurrent feature. Indeed, it is important for our
Evaluation of the Music Ontology Framework 265
purpose that we extract the whole logical structure of the query. This will lead
to a more accurate ontology fit measure (but derived from a smaller dataset)
than in the previous sections.
Once these queries have been pre-processed, we assign a weight for each dis-
tinct feature. Such weights are computed as described in § 3.2. We give the main
query features, as well as the corresponding weight and the corresponding Mu-
sic Ontology term, in table 4. We then compute our ontology fit measure as
described in § 3.3. We find an ontology fit measure of 0.749.
Dataset of User Queries. In order to cope with the lack of data availability
for this category of users, we consider re-using the questions selected in Sug-
imoto’s study [22] from a binder of recorded reference questions asked at the
University of North Carolina Chapel Hill Music Library between July 15, 1996
and September 22, 1998. These questions were chosen to cover a typical range
of possible questions asked in a music library. These questions are:
1. What is the address for the Bartok Archive in NY?
2. Can you help me locate Civil War flute music?
3. I am a percussion student studying the piece “Fantasy on Japanese Wood
Prints” by Alan Hovhaness. I wondered if there was any information available
about the actual Japanese wood prints that inspired the composer. If so,
what are their titles, and is it possible to find prints or posters for them?
4. Do you have any information on Francis Hopkinson (as a composer)?
5. What are the lyrics to “Who will Answer”? Also, who wrote this and who
performed it?
5 Conclusion
References
1. Alani, H., Brewster, C.: Metrics for ranking ontologies. In: Proceedings of the 4th
Int. Workshop on Evaluation of Ontologies for the Web (2006)
2. Bainbridge, D., Cunningham, S.J., Downie, S.J.: How people describe their music
information needs: A grounded theory analysis of music queries. In: Proceedings
of the 4th International Conference on Music Information Retrieval (2003)
3. Baumann, S.: A music library in the palm of your hand. In: Proceedings of the
Contact Forum on Digital libraries for musical audio (Perspectives and tendencies
in digitalization, conservation, management and accessibility), Brussels (June 2005)
4. Baumann, S., Klüter, A., Norlien, M.: Using natural language input and audio
analysis for a human-oriented MIR system. In: Proceedings of Web Delivery of
Music (WEDELMUSIC) (2002)
5. Blei, D.M., Lafferty, J.D.: A correlated topic model of. The Annals of Applied
Statistics 1(1), 17–35 (2007)
6. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. The Journal of
Machine Learning Research 3(3), 993–1022 (2003)
7. Brewster, C., Alani, H., Dasmahapatra, S., Wilks, Y.: Data driven ontology evalu-
ation. In: Proceedings of the International Conference on Language Resources and
Evaluation, Lisbon, Portugal, pp. 164–168 (2004)
8. Elhadad, M., Gabay, D., Netzer, Y.: Automatic Evaluation of Search Ontologies in
the Entertainment Domain using Text Classification. In: Applied Semantic Tech-
nologies: Using Semantics in Intelligent Information Processing. Taylor and Francis
(2011)
9. Fernández, M., Overbeeke, C., Sabou, M., Motta, E.: What Makes a Good Ontol-
ogy? A Case-Study in Fine-Grained Knowledge Reuse. In: Gómez-Pérez, A., Yu,
Y., Ding, Y. (eds.) ASWC 2009. LNCS, vol. 5926, pp. 61–75. Springer, Heidelberg
(2009)
10. Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National
Academy of Sciences (2004)
11. Guarino, N., Welty, C.: Evaluating ontological decisions with ONTOCLEAN. Com-
munications of the ACM 45(2), 61–65 (2002)
12. Heinrich, G.: Parameter estimation for text analysis. Technical report, University
of Leipzig & vsonix GmbH, Darmstadt, Germany (April 2008)
13. Lavbic, D., Krisper, M.: Facilitating ontology development with continuous evalu-
ation. Informatica 21(4), 533–552 (2010)
14. Ha Lee, J., Stephen Downie, J., Cameron Jones, M.: Preliminary analyses of in-
formation features provided by users for identifying music. In: Proceedings of the
International Conference on Music Information Retrieval (2007)
15. Lozano-Tello, A., Gomez-Perez, A.: ONTOMETRIC: A method to choose the ap-
propriate ontology. Journal of Database Management 15(2), 1–18 (2004)
16. Maedche, A., Staab, S.: Measuring Similarity between Ontologies. In: Gómez-Pérez,
A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, pp. 251–263.
Springer, Heidelberg (2002)
17. IFLA Study Group on the Functional Requirements for Bibliographic Records.
Functional requirements for bibliographic records - final report. UBCIM Publica-
tions - New Series, vol.19 (September 1998),
http://www.ifla.org/VII/s13/frbr/frbr1.htm (last accessed March 2012)
18. Porzel, R., Malaka, R.: A task-based approach for ontology evaluation. In: Pro-
ceedings of the ECAI Workshop on Ontology Learning and Population (2004)
Evaluation of the Music Ontology Framework 269
19. Raimond, Y., Abdallah, S., Sandler, M., Giasson, F.: The music ontology. In: Pro-
ceedings of the International Conference on Music Information Retrieval, pp. 417–
422 (September 2007)
20. Raimond, Y., Scott, T., Oliver, S., Sinclair, P., Smethurst, M.: Use of Semantic
Web technologies on the BBC Web Sites. In: Linking Enterprise Data, pp. 263–
283. Springer (2010)
21. Saxton, M.L., Richardson, J.V.: Understanding reference transactions. Academic
Press (May 2002)
22. Sugimoto, C.R.: Evaluating reference transactions in academic music libraries.
Master’s thesis, School of Information and Library Science of the University of
North Carolina at Chapel Hill (2007)
23. Vrandečić, D., Sure, Y.: How to Design Better Ontology Metrics. In: Franconi, E.,
Kifer, M., May, W. (eds.) ESWC 2007. LNCS, vol. 4519, pp. 311–325. Springer,
Heidelberg (2007)
24. Zhang, Y., Li, Y.: A user-centered functional metadata evaluation of moving image
collections. Journal of the American Society for Information Science and Technol-
ogy 59(8), 1331–1346 (2008)
The Current State of SKOS Vocabularies
on the Web
1 Introduction
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 270–284, 2012.
c Springer-Verlag Berlin Heidelberg 2012
The Current State of SKOS Vocabularies on the Web 271
skos:narrower skos:prefLabel
‘Emotion’ ‘Joy’
skos:prefLabel ‘Love’
skos:prefLabel skos:narrower
skos:altLabel ‘Affection’
skos:broader
OWL ontologies into SKOS, where some typicality of SKOS vocabularies may
be useful to guide the conversion.
SKOS1 , accepted as a W3C Recommendation in August 2009, is one of a
number of Semantic Web knowledge representation languages. Other such lan-
guages include the Resource Description Framework (RDF)2 , the RDF Schema
(RDFS)3 , and the Web Ontology Language (OWL)4 . SKOS is a language de-
signed to represent traditional knowledge organization systems whose represen-
tation has weak semantics that are used for simple retrieval and navigation.
Such representation includes thesauri, subject headings, classification schemes,
taxonomies, glossaries and other structured controlled vocabularies.
The basic element in SKOS is the concept which can be viewed as a ‘unit of
thought’; ideas, meanings or objects, that is subjective and independent of the
term used to label them [1]. These concepts can be semantically linked through
hierarchical and associative relations.
Figure 1 shows an example of the usage of SKOS constructs. There are
four skos:Concept in the figure, representing the concept of Emotion, Love,
Joy and Beauty. The skos:broader and skos:narrower properties are used
to show that the concepts are hierarchically arranged, while the skos:related
property is used to show the associative relations between the concepts. SKOS
provides three properties for associating lexical labels to conceptual resources;
skos:prefLabel, skos:altLabel and skos:hiddenLabel. SKOS documenta-
tion properties such as skos:definition are used to provide additional textual
information regarding the concept.
One of SKOS’ aims is to provide a bridge between different communities of
practice within the Library and Information Sciences and the Semantic Web
communities. This is accomplished by transferring existing models of knowledge
organization systems to the Semantic Web technology context [1]. Knowledge
organization system (KOS) is a general term, referring to the tools that present
the organized interpretation of knowledge structures [2]. This includes a variety
1
http://www.w3.org/2004/02/skos/
2
http://www.w3.org/RDF/
3
http://www.w3.org/2001/sw/wiki/RDFS
4
http://www.w3.org/2001/sw/wiki/OWL
272 N.A. Abdul Manaf, S. Bechhofer, and R. Stevens
of schemes that are used to organize, manage and retrieve information. There
are several types of KOS. Hodge [3] groups them into three general categories:
Term lists (flat vocabularies): emphasize lists of terms, often with defini-
tions; e.g., authority files, glossaries, dictionaries, gazetteers, code lists;
Classifications and categories (multi-level vocabularies): emphasize the
creation of subject sets; e.g., taxonomies, subject headings, classification
schemes; and
Relationship lists (relational vocabularies): emphasize the connections be-
tween terms and concepts; e.g., thesauri, semantic networks, ontologies.
We wish to find how many SKOS vocabularies are publicly available for use on
the Web and how many fall into one of the listed categories. Additionally, we
are interested in learning what the SKOS vocabularies look like in terms of size,
shape and depth of the vocabulary structure. We are interested in understanding
which of the SKOS constructs listed in the SKOS Reference document [1] are
actually being used in the SKOS vocabularies and how often these constructs
are used.
Related Work. While research has attempted to characterize Semantic Web
documents such as OWL ontologies and RDF(S) documents on the Web [4,5,6],
to the authors knowledge there is no attempt at characterizing SKOS vocabu-
laries. Our approach is similar to the work produced by Wang et. al. [4], which
focused on OWL ontologies and RDFS documents.
Identifying SKOS vocabularies. For the purposes of this survey, we used the fol-
lowing definition of SKOS vocabulary. A SKOS vocabulary is a vocabulary that
at the very least contains SKOS concept(s) used directly, or SKOS constructs
that indirectly infer the use of a SKOS concept, such as use of SKOS semantic
relations.
Each candidate SKOS vocabulary was screened in the following way to identify
it as a SKOS vocabulary:
Collecting survey data. Since we are interested in both the asserted and inferred
version of the SKOS vocabularies, we performed the data recording in two stages;
with and without invocation of an automatic reasoner such as Pellet or JFaCT.
The data collected without invocation of an automatic reasoner may suggest
the actual usage of SKOS constructs in those vocabularies. As for the rest of
the analysis, such as identifying the shape and structure of the vocabularies, we
need to collect the data from the inferred version of the vocabularies, hence the
use of an automatic reasoner.
In the first stage of data collection, no automatic reasoner is invoked, since
we are interested in the asserted version of the SKOS vocabularies. For each
candidate SKOS vocabulary, we count and record the number of instances for
all SKOS constructs listed in the SKOS Reference [1]. We also record the IRI of
all Concept Scheme present in each SKOS vocabulary.
In the second stage of data collection, we applied a reasoner, and collected
and recorded the following for each SKOS vocabulary:
Filtering out multiple copies of the same SKOS vocabularies. We use the recorded
information in the previous stage to filter structurally identical SKOS vocabu-
laries. We compare the Concept Scheme IRI to search for duplicate vocabularies.
For two or more vocabularies having the same Concept Scheme IRI, we com-
pare the record for each SKOS construct count. We make a pairwise comparison
between each SKOS construct count, taking two vocabularies at a time.
The Current State of SKOS Vocabularies on the Web 275
1/ If the two vocabularies have identical records, we then check the content
of these vocabularies. This is done by making a pairwise comparison between
the instances of skos:Concept in one vocabulary to the other. If the two vo-
cabularies have the same instances of skos:Concept, then one copy of these
vocabularies is kept and the duplicate vocabulary is removed. Otherwise, follow
the next step.
2/ If the two vocabularies do not have identical records or identical instances
of skos:Concept, we assume that one vocabulary is a newer version of the other.
We check, if the two vocabularies belong to the same category (either Thesaurus,
Taxonomy, or Glossary), then we keep the latest version of the vocabulary and
remove the older version. Otherwise, both vocabularies are kept.
Analysing the survey results. In analysing the collected data, we found that some
of the vocabularies that are known to be SKOS vocabularies, fail in Step 2 (to
be identified as a SKOS vocabulary). We manually inspected these vocabularies.
We found several patterns of irregularity in the vocabulary representation and
considered them as modelling slips made by ontology engineers when authoring
the vocabularies. For each type of modelling slip, we decide whether the error is
intentional or unintentional, and if fixing the error would change the content of
the vocabulary. If the error is unintentional and fixing the error does not change
the content of the vocabulary, then we can apply fixing procedures to correct
the modelling slips. All fixed vocabularies were included in the survey for further
analysis.
We calculated the mode, mean, median and standard deviation for the occur-
rence of each construct collected from the previous process. The analysis focused
on two major aspects of the vocabularies; the usage of SKOS constructs and the
structure of the vocabularies. In terms of the usage of SKOS constructs, we
analysed which constructs were most used in the SKOS vocabularies. As for the
structural analysis of the SKOS vocabulary, we introduced a SKOS metric, M,
with eight tuples as follows:
a a f g h
b e b e
d d
c c
M S D L R MAXB FH BH FA
MA 5 2 0 1 1 2 1 0
MB 8 2 3 1 1 2 1 0
Note that even though the FH and BH for both examples are the same because
each example has the same skos:narrower and skos:broader relations, the
structure of both example is different. However, by looking at the S, L, R, we
may distinguish the structure of Example 1 from Example 2.
Following the categories of KOS discussed in Section 1, we also defined some
rules using the SKOS metric, M, to categorise the vocabularies in our corpus:
The Current State of SKOS Vocabularies on the Web 277
vocabulary but never used. The fixing procedure for this type of modelling slip
was to replace the skos:narrower property with the skos:member property to
show the relationship between a member and its collection. Applying the fixing
procedure fixed 1 SKOS vocabulary.
- invalid datatype usage such as invalid dateTime datatype, invalid integer
datatype, invalid string datatype, etc. This mistake was considered unfixable
errors. There were 14 SKOS vocabularies excluded from the survey.
Type 3: Use of unsupported datatype. This was issued by the reasoner when
encountering user-defined datatype. Note that this is not exactly a modelling
slip instead a result from some limitations of the reasoner, thus, hindered us
from getting the required data. To deal with this case, we first checked whether
the user-defined datatype was actually in use to type the data in the vocabulary.
If the datatype was not in use, we excluded the datatype from the datatype list
and reclassified the vocabulary. There were 9 vocabularies excluded from the
survey.
Fixing the modelling slips resulted in 24 additional SKOS vocabularies in-
cluded in the corpus, which gave us the final number of 478 SKOS vocab-
ularies. The summary figures and reasons for exclusion are presented in Ta-
bles 1 and 2. The full results and analysis can be found at
http://www.myexperiment.org/packs/237.
Figure 3(a) shows the percentage of the SKOS construct usage. For each
SKOS construct, a SKOS vocabulary was counted as using the construct if the
construct is used at least once in the vocabulary. In this stage, we only counted
the asserted axioms in each of the SKOS vocabularies.
Of all the SKOS constructs that are made available in the SKOS
Recommendation[1] skos:Concept, skos:prefLabel, and skos:broader are
the three most used constructs in the vocabularies, with 95.6%, 85.6% and 69.5%,
The Current State of SKOS Vocabularies on the Web 279
-(( ,-/
,-( ,(1
,((
+-( ++*
+((
*-(
*(( )*1
)-( ))1
))-))+
)(( .--.-(
+0+0+,*.
-( ).)) 1 0 0 / . - , , + + * ) ) )
( ( ( ( ( ( ( (
(
!
"
"%
"%
"%
"%
"%
"%
!
!
#
"
(a) Usage
(b) Usage
respectively. 28 out of 35 SKOS constructs were used in less than 10% of the
vocabularies. There were eight SKOS constructs that were not used in any of
the vocabularies.
The rules in Section 2 were used to categorise the vocabularies following the
types of KOS as described in Section 1. Figure 3(b) shows a chart representing
different types of SKOS vocabulary. As shown in this figure, 61% or 293 of the
vocabularies are categorised as Taxonomy. The second largest type is Glossary
with 27% or 129 vocabularies. 11% or 54 vocabularies fell into the Thesaurus
category.
The remaining 1% or 2 vocabularies were categorised as Others. Further in-
spection revealed that the two vocabularies in the Others category are a Glos-
sary with 4 skos:related properties on 2 pairs of the concepts (out of 333
concepts) and a snippet of a real SKOS vocabulary intended to show the use of
skos:related property. We decided to reclassify these two vocabularies into the
Glossary category for the first one and the Thesaurus category for the other.
280 N.A. Abdul Manaf, S. Bechhofer, and R. Stevens
Figure 4 plots the size of SKOS vocabularies and the maximum depth of the
vocabulary structure. Each subgraph represents different categories of SKOS
vocabulary; Thesaurus, Taxonomy and Glossary. Within each category, the vo-
cabularies are sorted according to their size in descending order. The size of
SKOS vocabulary was calculated based on the number of SKOS concepts in the
vocabulary. The maximum depth of vocabulary was calculated based on hier-
archical relations; skos:broader and skos:narrower; in the vocabulary. Fig-
ure 4(a) shows that the smallest size of vocabulary for the Thesaurus category
is 3 concepts and the largest is 102614 concepts. The maximum depth of the
vocabulary structure ranged from 1 to 13 levels. For the Taxonomy category as
shown in Figure 4(b), the smallest size of vocabulary was 2 concepts and the
largest was 75529 concepts. The maximum depth of the vocabulary structure
ranged from 1 to 16 levels. Figure 4(c) plots size of vocabularies from the Glos-
sary category. The smallest size of vocabulary is 1 concept and the largest is
40945 concepts. The maximum depth of the vocabulary structure for the Tax-
onomy category ranged from 1 to 16 levels. Note that there was no maximum
depth for the Glossary category because there are no hierarchical relations were
present in the vocabularies of this category.
Figure 5(a) and 5(b) show the number of loose concepts, root concepts and
maximum skos:broader relations for each vocabulary structure. For the The-
saurus category, the loose concepts ranged from 0 concept (which means all
concepts are connected to at least another concept) to 4426 concepts. The root
concepts ranged from 1 root to 590 root concepts. The maximum skos:broader
The Current State of SKOS Vocabularies on the Web 281
"" $ "%"&
#%!
#!
!$&
"$
%
!% #"
!$
"
%
" #
!
!
!
!
"
$
#
"
!
Fig. 5. (a) Number of loose concepts, root concepts, maximum skos:broader rela-
tions, hierarchical branching factors, and associative branching factors of vocabulary
structure
4 Discussion
As shown in Figure 3(a), of all the SKOS constructs that are available in the
SKOS standards, only 3 out of 35 SKOS constructs are used in more than 60%
282 N.A. Abdul Manaf, S. Bechhofer, and R. Stevens
15
http://labs.mondeca.com/skosReader.html
16
http://poolparty.punkt.at/wp-content/uploads/2011/07/
Survey Do Controlled Vocabularies Matter 2011 June.pdf
The Current State of SKOS Vocabularies on the Web 283
search, data integration and structure for content navigation are the main ap-
plication areas for controlled vocabularies. Standards like SKOS have gained
greater awareness amongst the participants, which shows that the web-paradigm
has entered the world of controlled vocabularies.
The result shows that there is no direct relationship between size of vocabulary
and its maximum depth of vocabulary structure. We can see from the graph that
the maximum depth of small vocabularies is almost similar to the maximum
depth of large vocabularies for both the Thesaurus and Taxonomy categories.
For the Thesaurus category, 43 out of 55 or 80% of the vocabularies have
maximum skos:broader relations more than one. This means that at least
one of the concepts in these vocabularies has more than one broader concept,
which make them poly-hierarchy graphs. 93% of the vocabularies have more
than one root concept, which means that these vocabularies have shape of multi-
trees. As for the Taxonomy category, 76% of the vocabularies have maximum
skos:broader relations more than one and 81% of the vocabularies have more
than one root concept.
As for the hierarchical and associative branching factors result, we found one
anomaly to the Taxonomy category where one of the vocabularies had depth, D
of 1 and only one root concept. One particularly noteworthy value of the metric is
that the hierarchical FBF, FH is 2143, which is one less than the total vocabulary
size, S, which is 2144. This vocabulary is an outlier, with a a single root node
and a very broad, shallow hierarchy. If we were to exclude this vocabulary, the
range of hierarchical FBF for the Taxonomy category is between 1 and 56. The
hierarchical FBF alongside hierarchy depth are important in determining the
indexing and search time for a particular query[12].
5 Conclusion
Our method for collecting and analysing SKOS vocabularies has enabled us
to gain an understanding of the type and typicality of those vocabularies. We
found out that all but two of the SKOS vocabularies that we collected from the
Web fell into one of the categories listed by the traditional KOS; flat vocabular-
ies (Glossary), multi-level vocabularies (Taxonomy) and relational vocabularies
(Thesaurus). In the future, we plan to select several SKOS vocabularies from
each category and study them in more detail in terms of the use and function of
the vocabularies in applications.
Based on the results of this survey, a typical taxonomy looks like a polyhierar-
chy that is 2 levels deep, with a F BFH of 10 concepts and a BF FH of 3 concepts.
A typical thesaurus also looks like a polyhierarchy that is 6 level deep, with a
F BFH of 3 concepts and a BF FH of 2 concepts, additionally having associative
relationships, F BFA of 1 concept.
In this survey, we collected 478 vocabularies that according to our definition
are SKOS vocabularies. Three years after becoming a W3C Recommendation,
the use of SKOS remains low. However, our total of SKOS vocabularies may
be artificially low, with some being hidden from our collection method. The
284 N.A. Abdul Manaf, S. Bechhofer, and R. Stevens
reasons for some of these vocabularies not being publicly accessible by an au-
tomated process could be due to confounding factors such as proprietary issue,
vocabularies stored within SVN, etc. However, we are confident that we have
done our best to deploy various methods in collecting SKOS vocabularies that
are publicly available on the Web.
Acknowledgements. Nor Azlinayati Abdul Manaf is receiving the scholarship
from Majlis Amanah Rakyat (MARA), an agency under the Malaysian Govern-
ment, for her doctorate studies. Many thanks to the reviewers who gave insightful
comments and suggestion to improve this paper.
References
1. Miles, A., Bechhofer, S.: SKOS simple knowledge organization system reference.
W3C recommendation, W3C (2009)
2. Zeng, M.L., Chan, L.M.: Trends and issues in establishing interoperability among
knowledge organization systems. JASIST 55(5), 377–395 (2004)
3. Hodge, G.: Systems of Knowledge Organization for Digital Libraries: Beyond Tra-
ditional Authority Files. Council on Library and Information Resources (2000)
4. Wang, T.D., Parsia, B., Hendler, J.: A Survey of the Web Ontology Landscape. In:
Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M.,
Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 682–694. Springer, Heidelberg
(2006)
5. Ding, L., Finin, T.W.: Characterizing the Semantic Web on the Web. In: Cruz, I.,
Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo,
L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 242–257. Springer, Heidelberg (2006)
6. Tempich, C., Volz, R.: Towards a benchmark for semantic web reasoners - an
analysis of the DAML ontology library. In: EON (2003)
7. Poole, D.L., Mackworth, A.K.: Artificial Intelligence: Foundations of Computa-
tional Agents. Cambridge University Press, New York (2010)
8. van Assem, M., Malaisé, V., Miles, A., Schreiber, G.: A Method to Convert The-
sauri to SKOS. In: Sure, Y., Domingue, J. (eds.) ESWC 2006. LNCS, vol. 4011,
pp. 95–109. Springer, Heidelberg (2006)
9. Panzer, M., Zeng, M.L.: Modeling classification systems in SKOS: some challenges
and best-practice recommendations. In: Proceedings of the 2009 International Con-
ference on Dublin Core and Metadata Applications, Dublin Core Metadata Initia-
tive, pp. 3–14 (2009)
10. Summers, E., Isaac, A., Redding, C., Krech, D.: LCSH, SKOS and linked data. In:
Proceedings of the 2008 International Conference on Dublin Core and Metadata
Applications (DC 2008), Dublin Core Metadata Initiative, pp. 25–33 (2008)
11. Binding, C.: Implementing Archaeological Time Periods Using CIDOC CRM and
SKOS. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt,
H., Cabral, L., Tudorache, T. (eds.) ESWC 2010, Part I. LNCS, vol. 6088, pp.
273–287. Springer, Heidelberg (2010)
12. Roszkowski, M.: Using taxonomies for knowledge exploration in subject gateways.
In: Proceedings of the 17th Conference on Professional Information Resources,
INFORUM 2011 (2011)
The ISOcat Registry Reloaded
1 Introduction
The linguistics community has accumulated a tremendous amount of research
data over the past decades. It has also realized that the data, being the back-bone
of published research findings, deserves equal treatment in terms of archiving
and accessibility. For the sustainable management of research data, archiving
infrastructures are being built, with metadata-based issues taking center stage.
Metadata schemes need to be defined to adequately describe the large variety of
research data. The construction of schemas and the archiving of resources will
be conducted locally, usually in the place where the research data originated.
To ensure the interoperability of all descriptional means, the ISOcat data cate-
gory registry has been constructed (see http://www.isocat.org). The registry,
implementing ISO12620:2009 [4], aims at providing a set of data categories for
the description of concepts and resources in various linguistic disciplines (syn-
tax, semantics, etc.), but also features a section on metadata terms, which is our
primary concern in this paper. Linguists giving a metadata-based description
of their research data are solicited to only use metadata descriptors from the
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 285–299, 2012.
c Springer-Verlag Berlin Heidelberg 2012
286 C. Zinn, C. Hoppermann, and T. Trippel
registry. When the registry lacks an entry, researchers are encouraged to extend
it by defining new data categories. The registry has a governing body to ensure
the quality and merit of all entries submitted for standardization. It is hoped
that its grass-root nature helps defining a sufficiently large set of metadata de-
scriptors of which a standardized subset reflects a consensus in a large user base.
While the grass-roots approach is appealing, the organization of the registry’s
content as a glossary of descriptors with little structure is problematic. With the
metadata term set now containing 450+ entries, with new entries added regu-
larly, it becomes increasingly hard to browse and manage its content. To address
this issue, we propose to re-organise the rather flat knowledge structure into a
more systematic, formal and hierarchical representation, following the footsteps
of schema.org. The new structure can serve as a complement to the existing
one, giving users an alternative and more accessible entry point to the registry.
The remainder of this paper is structured as follows: Sect. 2 gives an account
of the ISOcat registry. In Sect. 3, we describe our ontological reconstruction
and re-engineering to represent the contents of the glossary by a hierarchically-
structured concept scheme. Sect. 4 discusses ontology engineering issues and
sketches future work, and Sect. 5 concludes.
Since this encodes the notion of genus-differentia definitions, it is clear that any
set of (related) definitions induces a concept system. Notwithstanding, the DCR
Style Guidelines also point out that “concept systems, such as are implied here
by the reference to broader and related concepts, should be modeled in Relation
Registries outside the DCR.” In line with the policy to disallow (formal) rela-
tionships between complex data categories, the DCR guidelines continue saying
“Furthermore, different domains and communities of practice may dif-
fer in their choice of the immediate broader concept, depending upon
any given ontological perspective. Harmonized definitions for shared DCs
should attempt to choose generic references insofar as possible.”
This policy can induce quite some tension or confusion. While the definition of a
DC must reference a superordinate concept, it should reference a rather generic
than a rather specific superordinate concept. Moreover, superordinate concepts
in the definiens are referenced with natural-language expressions rather than
with formal references (say, by pointing to existing terms of the registry).
In the sequel, we will present a reconstruction of a concept system from the
many hundred data category entries and their definitions. This concept system
then makes formally explicit the relationships between ISOcat terms and the
concepts they denote. The concept system could then be seen as a more formal
(and complementary) account of the ISOcat registry; the system could, in fact,
be understood as the possible content of a relation registry making explicit all
relations between the ISOCat Metadata data categories.
<owl:Class rdf:ID="Corpus">
<rdfs:subClassOf rdf:resource="#Resource"/>
<owl:oneOf rdf:parseType="Collection">
<owl:Thing rdf:about="#ComparableCorpus"/>
<owl:Thing rdf:about="#ParallelCorpus"/>
<owl:Thing rdf:about="#Treebank"/> [...]
</owl:oneOf>
</owl:Class>
class hierarchy of linguistic resources, despite the fact that their definitions fail
to follow the DCR Style Guidelines promoting genus-differentia definitions.5
Fig. 3 depicts our initial class skeleton that we derived from the DCs given
in the table. Its top class (just below Thing) stems from the complex/open DC
/resourceClass/. The elements cited in its example section, however, should
5
Two entries have explanation sections adding to their definitions. The explanation
section of DC-3900 mentions “type[s] of written resource such as Text, Annotation,
Lexical research, Transcription etc”, whereas the respective section of DC-3901 men-
tions that “[d]ifferent types of written resources have different controlled vocabularies
for SubType: the type ’Lexical research’ has as SubType vocabulary {dictionary, ter-
minology, wordlist, lexicon,...}. In case the Written Resource Type is Annotation the
SubType specifies the type of annotation such as phonetic, morphosyntax etc.”
292 C. Zinn, C. Hoppermann, and T. Trippel
Incompleteness. With the class hierarchy giving a birds-eye view on the ISO-
cat glossary, several gaps can be easily spotted. For many of the main classes,
there are no corresponding entries in the ISOcat registry. Entries are missing,
for instance, for “resource”, “lexicon”, “corpus”, “experiment” etc. although ref-
erences are made to them in the definition and example sections of many DCs.
There are many minor gaps. There is, for instance, the DC-2689 /audio-
fileformat/, but there are no corresponding DCs for “videoFileFormat”, “doc-
umentFileFormat” etc. Moreover, there are DCs of type complex/open but their
type could be complex/closed. DC-2516 /derivationMode/, for instance, could
easily be closed by adding the simple DC “semi-automatic” to the existing values
“manual” and “automatic”.
296 C. Zinn, C. Hoppermann, and T. Trippel
There are cases where a data category’s association with its profiles is in-
complete. There is, for instance, DC-2008 /languageCode/; it is only associated
with the TDG Morphosyntax. Instead of also associating this DC with the TDG
Metadata, users have created yet another, but conceptually identical DC, namely
/languageId/ (DC-2482), and have associated it with the TDG Metadata.
relationships between complex/closed DCs and the simple DCs of their value
range. Moreover, there is ample opportunity to connect to existing ontologies
instead of inventing terminology anew.
We believe that our re-representation addresses these issues. It has the po-
tential to serve the goals of the ISOcat user community; it adds to the precision
of the ISOcat metadata-related content and groups together entries that are
semantically related; its hierarchical structure gives users a birds-eye view to
better access and manage a large repertoire of expert terminology.
4 Discussion
4.1 Expert Vocabulary: From Glossary to Ontology
The ISOcat registry is designed as a glossary of terms, and this design can
quickly be understood by a large user base without expertise in knowledge re-
presentation. Users can easily define a data category whenever they believe such
an entry is missing. ISOcat’s ease-of-use is also its fundamental shortcoming,
however. The definition of a DC is given in natural language, and hence, is
inherently vague and often open to multiple interpretations. Also, ISOcat entries
vary in style and quality, given the collaborative authoring effort. The increasing
size of the TDG Metadata, now containing more than 450 terms, its glossary-
like organization, the current data curation policy of the registry – authors can
only modify the entries they own – may prompt users to rashly define their own
data category instead of identifying and re-using an appropriate existing one.
Nevertheless, it is hoped that a standardization process, once set in motion, will
lead to an expert vocabulary most linguists agree upon.
It is clear that the definitions of the ISOcat metadata terms spawn a concept
system. Simple DCs are related to complex DCs because they appear in the
value range of the latter, and it is also possible to define subsumption relations
between simple DCs. Moreover, genus-differentia definitions relate to each other
definiendum and definiens. The non-adherence of authors to good practise when
defining, potentially prompted by a policy that disallows formal relationships
between complex DCs, is responsible for many of the weaknesses identified.
It is argued that relationships between complex DCs should be represented
in a relation registry [8]. DCR authors are encouraged to keep the definitions of
their entries deliberately vague so that this vagueness can then be addressed –
in varying manner – externally by using the relation registry. While the relation
registry is currently used for the mapping of terms from different TDGs or
vocabularies (using SKOS-like relation types such as broader and narrower, see
[6]), we find it questionable whether this is a viable approach for intra-vocabulary
mapping within the TDG Metadata. It would be unclear, e.g., how to draw
the line between explicitly and implicitly defined relations in the ISOcat data
category registry and those defined in the relation registry, and the possible
confusion it creates when the registries’ content is in contradiction to each other.
298 C. Zinn, C. Hoppermann, and T. Trippel
In fact, the concept scheme we derived from our analysis could be seen as an
incarnation of the relation registry. But in light of the previous discussion, it
must be an officially sanctioned one, aiming at giving an adequate account of
ISOcat metadata-related content.
The concept scheme can serve as a tool to better browse and manage the ISO-
cat term registry for metadata. It can inform curation efforts to render precise
the definition of existing entries, or to create new entries to fill the gaps made
obvious by our ontological reengineering. For this, the concept scheme and the
ISOcat registry need to be synchronized. This can be achieved by enforcing the
policy that authors of new DCs must somehow provide anchor points that link
a DC to a node in the hierarchy. Reconsider the entry /resourceClass/ (cf.
Table 1, page 7). It could be “semantically enriched” by making explicit the
class hierarchy that is only implicitly given in the informal language definition
of the entry: Resource is a class. Corpus is a subclass of Resource.
Lexicon is a subclass of Resource etc.
The semantic enrichment of the DC’s definition could then prompt users to
create entries for “corpus”, “lexicon”, “experiment” etc. Alternatively, and more
in line with common usage in many dictionaries, users could be encouraged to
associate the term being defined with broader, narrower, or related terms.
We hope that our concept scheme serves as a starting, reference and entry
point to the content of the ISOcat metadata-related vocabulary. For this, it needs
to be “in sync” but also officially sanctioned to better reflect, at any given time,
the content of the ISOcat registry. Our concept scheme, when understood as a
“relation registry”, has the advantage that – by following schema.org and its
OWL version (see http://schema.org/docs/schemaorg.owl) – it is based on
existing, open, and widely-used W3C standards. Future work will address how to
best profit from this technology in terms of sharing vocabulary with schema.org
and distributing metadata about linguistic resources using microformats.
5 Conclusion
The ISOcat registry has taken a central role in those parts of the linguistics
community that care about metadata. Its low-entry barrier allows users to con-
tribute towards a set of terms for the description of linguistic resources. The
ISOcat registry will continue to serve this role, but the registry and its users can
profit from the provisions we have outlined. With the re-representation of the
ISOcat metadata registry into a hierarchical structure, we have gained a birds-
eye view of its content. Our work unveiled current shortcomings of the ISOcat
registry from a knowledge representation perspective, where class hierarchies are
often constructed centrally and in a systematic and top-down manner.
Many of the problems that we have highlighted are typical for distributed work
on a lexicographic resource; here, contributors often take a local stance asking
The ISOcat Registry Reloaded 299
whether a glossary contains a certain term suitable for some given application
of the term, or not. With a glossary growing to many hundred entries, it is
not surprising that there will be two or more entries denoting the same concept
(synonymy), or two entries sharing the same data category name having different
(homonymy) or only partially overlapping (polysemy) meanings, etc.
A large part of our critique could be addressed by pointing out the “private”
nature of the DCs in the TDG Metadata. Once the standardization process of the
data categories gains traction, many of the issues can be addressed and solved.
We believe that our ontological approach would greatly support this process.
DCs owners are encouraged to consult our formalization and check whether
their entries can be improved by the birds-eye view now at their disposition. The
standardization body is encouraged to take our ontology to identify “important”
DCs and schedule them for standardization. For users of the registry, it serves as
efficient access method complementing existing search and browse functionality.
Our hierarchy is one of possibly many interpretations of the ISOcat metadata
registry. With on-going work on the registry, it will need to be revised accord-
ingly. Note that we do not seek to replace the TDG Metadata with the ontology
we have reconstructed by interpreting its content. It is intended to support ex-
isting workflows in order to obtain an ISOcat-based metadata repertoire that
progresses towards the completeness and high-quality of its entries.
The url http://www.sfs.uni-tuebingen.de/nalida/isocat/ points to the
current version of the hierarchy. Feedback is most welcome!
References
1. DCR Style Guidelines. Version ”2010-05-16”,
http://www.isocat.org/manual/DCRGuidelines.pdf (retrieved December 5, 2011)
2. Data Category specifications. Clarin-NL ISOcat workshop (May 2011),
http://www.isocat.org/manual/tutorial/2011/ISOcat-DC-specifications.pdf
(retrieved December 5, 2011)
3. Int’l Organization of Standardization. Data elements and interchange formats –
Information interchange – Representation of dates and times (ISO-8601), Geneva
(2009)
4. Int’l Organization of Standardization. Terminology and other language and content
resources - Specification of data categories and management of a Data Category
Registry for language resources (ISO-12620), Geneva (2009)
5. Int’l Organization of Standardization. Terminology work – Principles and methods
(ISO-704), Geneva (2009)
6. Schuurman, I., Windhouwer, M.: Explicit semantics for enriched documents. What
do ISOcat, RELcat and SCHEMAcat have to offer? In: Proceedings of Supporting
Digital Humanities, SDH 2011 (2011)
7. Soldatova, L.N., King, R.D.: An ontology of scientific experiments. Journal of the
Royal Society Interface 3(11), 795–803 (2006)
8. Wright, S.E., Kemps-Snijders, M., Windhouwer, M.A.: The OWL and the ISOcat:
Modeling Relations in and around the DCR. In: LRT Standards Workshop at LREC
2010, Malta (May 2010)
SCHEMA - An Algorithm for Automated
Product Taxonomy Mapping in E-commerce
1 Introduction
In recent years the Web has increased dramatically in both size and range,
playing an increasingly important role in our society and world economy. For
instance, the estimated revenue for e-commerce in the USA grew from $7.4
billion in 2000 to $34.7 billion in 2007 [10]. Furthermore, a study by Zhang et
al. [25] indicates that the amount of information on the Web currently doubles
in size roughly every five years. This exponential growth also means that it is
becoming increasingly difficult for a user to find the desired information.
To address this problem, the Semantic Web was conceived to make the Web
more useful and understandable for both humans and computers, in conjunction
with usage of ontologies, such as the GoodRelations [9] ontology for products.
Unfortunately, as it stands today, the vast majority of the data on the Web has
not been semantically annotated, resulting in search failures, as search engines
do not understand the information contained in Web pages. Traditional keyword-
based search cannot properly filter out irrelevant Web content, leaving it up to
the user to pick out relevant information from the search results.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 300–314, 2012.
c Springer-Verlag Berlin Heidelberg 2012
An Algorithm for Automated Product Taxonomy Mapping in E-commerce 301
2 Related Work
The field of taxonomy or schema mapping has generated quite some interest
in recent years. It is closely related to the field of ontology mapping, with one
important difference: whereas for matching of taxonomies (hierarchical struc-
tures), and schemas (graph structures), techniques are used that try to guess
the meaning implicitly encoded in the data representation, ontology mapping
algorithms try to exploit knowledge that is explicitly encoded in the ontolo-
gies [22]. In other words, due to the explicit formal specification of concepts and
relations in an ontology, the computer does not need to guess the meaning. In
order to interpret the meaning of concepts in an ontology or schema, algorithms
often exploit the knowledge contained in generalised upper ontologies, such as
SUMO [18] or WordNet [17]. In this way the semantic interoperability between
different ontologies is enhanced, facilitating correct matching between them. The
302 S.S. Aanen et al.
3 SCHEMA
This section discusses the SCHEMA framework, together with all the assump-
tions for our product taxonomy matching algorithm. Figure 1 illustrates the
high-level overview of the framework. This sequence of steps is executed for ev-
ery category in the source taxonomy. First, the name of the source category is
An Algorithm for Automated Product Taxonomy Mapping in E-commerce 303
disambiguated, to acquire a set of synonyms of the correct sense. This set is used
to find candidate categories from the target taxonomy, and is needed to account
for the varying denominations throughout taxonomies. After the Candidate Tar-
get Category Selection, every candidate category path is compared with the path
of the source category, by means of the Candidate Target Path Key Comparison.
The best-fitting candidate target category is selected as the winner. The objec-
tive of SCHEMA is to map source categories to a selected target category, if and
only if, all products in the source category fit in the selected target category.
This reflects our definition of a successful and meaningful category mapping.
First, the general assumptions — the basis for the development of SCHEMA
— are explained. Next, each step of the framework, as shown in Fig. 1, will be
discussed in more detail.
Products Books
Fig. 2. Mapping example for Overstock (top) to Amazon (bottom) categories. Normal
lines indicate a parent-child relationship; dashed lines indicate SCHEMA’s mapping.
that there possibly is no direct match for a very specific source category in the
target taxonomy. In such a case, it makes sense to match the source category
to a more general target category, as from a hierarchical definition, products
from a specific category should also fit into a more general class. Figure 2 shows
that category ‘Books’ (Overstock) is mapped to ‘Books’ (Amazon), as one would
expect. Unfortunately, there is no direct match for ‘Humor Books’ (Overstock)
in Amazon. However, humor books are also a kind of books, so SCHEMA will
map this category to the more general ‘Books’ category from Amazon. The more
general category is found by following the defined mapping for the parent of the
current source category. Note that root mappings are precluded.
SCHEMA’s last assumption is, that as usage of capitals in category names
does not affect the meaning, all lexical matching is performed case-insensitive.
The first step in creating a mapping for a category from the source taxonomy,
is to disambiguate the meaning of its name. As different taxonomies use vary-
ing denominations to identify the same classes, it is required that synonyms of
the source category label are taken into account for finding candidate target
categories. However, using all synonyms could result in inclusion of synonyms
of a faulty sense, which could for example cause a ‘laptop’ to be matched with
a book to write notes (i.e., a notebook). To account for this threat, SCHEMA
uses a disambiguation procedure in combination with WordNet [17], to find only
synonyms of the correct sense for the current source category. This procedure is
based on context information in the taxonomy, of which can be expected that
it gives some insight into the meaning of the source category name. Concerning
the general assumption on composite categories in Sect. 3.1, SCHEMA disam-
biguates every part of the source category (Split Term Set) separately. The result
after disambiguation is called the Extended Split Term Set. Note that the target
taxonomy does not play a role in the source category disambiguation.
Algorithm 1 explains the procedure that is used to create the Extended Split
Term Set for the current source category. First, Algorithm 1 splits the (compos-
ited) source category into separate classes: the Split Term Set. The same split
is performed for all children, and for the parent of the source category, which
will act as ‘context’ for the disambiguation process. Next, the disambiguation
procedure itself, which will be discussed shortly, is called for every split part
An Algorithm for Automated Product Taxonomy Mapping in E-commerce 305
of the source category. The result, the Extended Split Term Set, contains a set
of synonyms of the correct sense for each individual split term. The Extended
Split Term Set is used in SCHEMA to find candidate target categories, and to
evaluate co-occurrence of nodes for path-comparison.
the context, as described in Algorithm 2. For every possible sense of the tar-
get word, the overlap between its related glosses and the plain context words
is assessed. The length of the longest common substring is used as similarity
measure, and the sense with the highest accumulated score is picked as winner.
Figure 3 shows some candidates that have been found for category ‘Tubs’ from
Overstock. The Source Category Disambiguation procedure discussed in Sect. 3.2
results in the following Extended Split Term Set: {{Tubs, bathtub, bathing
tub, bath, tub}}. Synonym ‘bath’ is sufficient for candidate category ‘Kitchen
308 S.S. Aanen et al.
& Bath Fixtures’ (at the top of Fig. 3), to be selected. As ‘bath’ is included in
split target part ‘Bath Fixtures’ (as separate word), it matches, according to
Algorithm 3, making target category ‘Kitchen & Bath Fixtures’ a superset of
source category ‘Tubs’. Hence it is classified as a semantic match, and thus
selected as proper candidate target category.
Source Path
Home
Online Shopping Home & Garden Tubs
Improvement
A B C
Candidate Paths
A C
Home, Garden & Tools & Home Kitchen & Bath
Products
Tools Improvement Fixtures
A B C
Toys, Kids &
Products Baby Bathing Bathing Tubs
Baby
D E F C
Home, Garden & Tools & Home
Products Hardware Bath Hardware
Tools Improvement
A B D C
Fig. 3. Source category path for ‘Tubs’ in Overstock, with associated candidate target
categories from Amazon
identical if and only if their Extended Split Term Sets are the same. A node from
the source path and a node from the candidate target path are seen as identical
when Algorithm 3, the Semantic Match procedure, decides so. The result is a
key list for both the source path and the current candidate target path.
Figure 3 shows the key list for the source and candidate targets paths for
category ‘Tubs’. The candidate path at the bottom, is a good example of how
Semantic Match classifies nodes as being similar. Candidate node ‘Tools & Home
Improvement’ is assigned the same key (‘B’) as source node ‘Home Improvement’,
as the first one is a superset of the last one, thus all products under the second
should fit into the first. Considering candidate ‘Bath Hardware’ itself, one of
the synonyms of source category ‘Tubs’ (‘bath’), is included in the name of the
candidate category. Hence, ‘Bath Hardware’ gets the same key (‘C’) as ‘Tubs’.
For the key lists found for source and candidate path, the similarity is as-
sessed using the Damerau-Levenshtein distance [4]. This measure captures the
(dis)similarity and transposition of the nodes, hence both the number of co-
occurring nodes and the consistency of the node order are taken into account.
As the Damerau-Levenshtein distance is used in normalised form, a dissimilar
node in a long candidate path is weighted as less bad than the same dissimilar
node in a shorter path, which can unfortunately lead to biased results. There-
fore, a penalty is added for every unique key assigned solely to the candidate
path, or more precise: for every node for which no match exists in the source
path. The formula used as similarity measure for the key lists is as follows:
damLev(Ksrc , Kcandidate) + p
candidateScore = 1 − (1)
max(Ksrc , Kcandidate) + p
(but excluding the root), according to the assumption in Sect. 3.1. The complete
framework procedure then repeats for the next source taxonomy category.
4 Evaluation
In order to assess SCHEMA’s performance, it is compared to similar algorithms.
We have chosen to compare it with PROMPT [19], being a general-purpose
algorithm that is well-known in the field of ontology mapping. Additionally, the
algorithm of Park & Kim [20] is included in the comparison, due to their focus
on product taxonomy mapping in particular. First, we briefly discuss how the
evaluation has been set up. Then, we present the results for each algorithm and
discuss their relative performance.
4.2 Results
Table 1 presents a comparison of average precision, recall and F1 -score for every
algorithm. Tables 2, 3, and 4 give a more detailed overview of the results achieved
by SCHEMA, the algorithm of Park & Kim, and PROMPT, respectively.
As shown in Table 1, SCHEMA performs better than PROMPT and the al-
gorithm of Park & Kim, on both average recall and F1 -score. The recall has im-
proved considerably with 221% in comparison to the algorithm from Park & Kim,
and 384% against PROMPT. This can be partly attributed to the ability of
SCHEMA to cope with lexical variations in category names, using the Leven-
shtein distance metric, as well as the ability to properly deal with composite
categories. Furthermore, SCHEMA maps a category node to its parent’s map-
ping when no suitable candidate path was found, improving the recall when the
reference taxonomy only includes a more general product concept. Achieving a
high recall is important in e-commerce applications, as the main objective is
to automatically combine the products of heterogeneous product taxonomies in
one overview, in order to reduce search failures. A low recall means that many
categories would not be aligned, which would mean that many products will be
missing from search results. For this reason, it is generally better to map to a
more general category rather than not mapping at all. Worthy to mention is the
slight decrease in average precision for SCHEMA compared with the algorithm
of Park & Kim: 42.21% against 47.77%. This is due to the fact that there is a
trade-off between precision and recall: achieving a higher recall means that an
312 S.S. Aanen et al.
References
1. Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and Ontology Matching
with COMA++. In: ACM SIGMOD International Conference on Management of
Data 2005 (SIGMOD 2005), pp. 906–908. ACM (2005)
2. Banerjee, S., Pedersen, T.: An Adapted Lesk Algorithm for Word Sense Disam-
biguation Using WordNet. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276,
pp. 136–145. Springer, Heidelberg (2002)
3. Castano, S., Ferrara, A., Montanelli, S.: H-MATCH: An Algorithm for Dynami-
cally Matching Ontologies in Peer-Based Systems. In: 1st VLDB Int. Workshop on
Semantic Web and Databases (SWDB 2003), pp. 231–250 (2003)
4. Damerau, F.J.: A Technique for Computer Detection and Correction of Spelling
Errors. Communications of the ACM 7(3), 171–176 (1964)
5. Do, H.-H., Melnik, S., Rahm, E.: Comparison of Schema Matching Evaluations.
In: Chaudhri, A.B., Jeckle, M., Rahm, E., Unland, R. (eds.) Web Databases and
Web Services 2002. LNCS, vol. 2593, pp. 221–237. Springer, Heidelberg (2003)
314 S.S. Aanen et al.
6. Ehrig, M., Sure, Y.: Ontology Mapping - An Integrated Approach. In: Bussler,
C.J., Davies, J., Fensel, D., Studer, R. (eds.) ESWS 2004. LNCS, vol. 3053, pp.
76–91. Springer, Heidelberg (2004)
7. Ehrig, M., Staab, S.: QOM – Quick Ontology Mapping. In: McIlraith, S.A., Plex-
ousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 683–697.
Springer, Heidelberg (2004)
8. Giunchiglia, F., Shvaiko, P., Yatskevich, M.: S-Match: An Algorithm And An Im-
plementation of Semantic Matching. In: Dagstuhl Seminar Proceedings of Semantic
Interoperability and Integration 2005 (2005)
9. Hepp, M.: GoodRelations: An Ontology for Describing Products and Services Of-
fers on the Web. In: Gangemi, A., Euzenat, J. (eds.) EKAW 2008. LNCS (LNAI),
vol. 5268, pp. 329–346. Springer, Heidelberg (2008)
10. Horrigan, J.B.: Online Shopping. Pew Internet & American Life Project Report 36
(2008)
11. Kalfoglou, Y., Schorlemmer, M.: Ontology Mapping: The State of the Art. The
Knowledge Engineering Review 18(1), 1–31 (2003)
12. Kilgarriff, A., Rosenzweig, J.: Framework and Results for English SENSEVAL.
Computers and the Humanities 34(1-2), 15–48 (2000)
13. Lesk, M.: Automatic Sense Disambiguation using Machine Readable Dictionaries:
How to Tell a Pine Cone from an Ice Cream Cone. In: 5th Annual International
Conference on Systems Documentation (SIGDOC 1986), pp. 24–26. ACM (1986)
14. Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions, and
Reversals 10(8), 707–710 (1966)
15. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic Schema Matching with Cupid.
In: 27th International Conference on Very Large Data Bases (VLDB 2001). pp.
49–58. Morgan Kaufmann Publishers Inc. (2001)
16. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity Flooding: A Versatile Graph
Matching Algorithm and its Application to Schema Matching. In: 18th Interna-
tional Conference on Data Engineering (ICDE 2002). pp. 117–128. IEEE (2002)
17. Miller, G.A.: WordNet: A Lexical Database for English. Communications of the
ACM 38(11), 39–41 (1995)
18. Niles, I., Pease, A.: Towards a Standard Upper Ontology. In: International Confer-
ence on Formal Ontology in Information Systems 2001 (FOIS 2001). ACM (2001)
19. Noy, N.F., Musen, M.A.: The PROMPT Suite: Interactive Tools for Ontology
Merging and Mapping. International Journal of Human-Computer Studies 59(6),
983–1024 (2003)
20. Park, S., Kim, W.: Ontology Mapping between Heterogeneous Product Taxonomies
in an Electronic Commerce Environment. International Journal of Electronic Com-
merce 12(2), 69–87 (2007)
21. Rahm, E., Bernstein, P.A.: A Survey of Approaches to Automatic Schema Match-
ing. The VLDB Journal 10(4), 334–350 (2001)
22. Shvaiko, P., Euzenat, J.: A Survey of Schema-Based Matching Approaches. In:
Spaccapietra, S. (ed.) Journal on Data Semantics IV. LNCS, vol. 3730, pp. 146–
171. Springer, Heidelberg (2005)
23. VijayaLakshmi, B., GauthamiLatha, A., Srinivas, D.Y., Rajesh, K.: Perspectives
of Semantic Web in E- Commerce. International Journal of Computer Applica-
tions 25(10), 52–56 (2011)
24. Yu, Y., Hillman, D., Setio, B., Heflin, J.: A Case Study in Integrating Multiple
E-commerce Standards via Semantic Web Technology. In: Bernstein, A., Karger,
D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.)
ISWC 2009. LNCS, vol. 5823, pp. 909–924. Springer, Heidelberg (2009)
25. Zhang, G.Q., Zhang, G.Q., Yang, Q.F., Cheng, S.Q., Zhou, T.: Evolution of the
Internet and its Cores. New Journal of Physics 10(12), 123027 (2008)
Voting Theory for Concept Detection
Abstract. This paper explores the issue of detecting concepts for ontology
learning from text. Using our tool OntoCmaps, we investigate various metrics
from graph theory and propose voting schemes based on these metrics. The idea
draws its root in social choice theory, and our objective is to mimic consensus
in automatic learning methods and increase the confidence in concept extraction
through the identification of the best performing metrics, the comparison of
these metrics with standard information retrieval metrics (such as TF-IDF) and
the evaluation of various voting schemes. Our results show that three graph-
based metrics Degree, Reachability and HITS-hub were the most successful in
identifying relevant concepts contained in two gold standard ontologies.
1 Introduction
Building domain ontologies is one of the pillars of the Semantic Web. However, it is
now widely acknowledged within the research community that domain ontologies do
not scale well when created manually due to the constantly increasing amount of data
and the evolving nature of knowledge. (Semi) Automating the ontology building
process (ontology learning) is thus unavoidable for the full-realization of the Semantic
Web.
Ontology learning (from texts, xml, etc.) is generally decomposed in a number of
steps or layers, which target the different components of an ontology: concepts, tax-
onomy, conceptual relationships, axioms and axioms schemata [3]. This paper is con-
cerned with the first building block of ontologies which are concepts (classes). In fact,
concept extraction is a very active research field, which is of interest to all knowledge
engineering disciplines. Generally, research in ontology learning from texts considers
that a lexical item (a term) becomes a concept once it reaches a certain value on a
given metric (e.g. TFIDF). Numerous metrics such as TF-IDF, C/NC value or entropy
[3, 4, 8, 15] have been proposed to identify the most relevant terms from corpora in
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 315–329, 2012.
© Springer-Verlag Berlin Heidelberg 2012
316 A. Zouaq, D. Gasevic, and M. Hatala
information retrieval and ontology learning. For example, some approaches such as
Text2Onto [4] and OntoGen [7] rely on metrics such as TFIDF to evaluate term re-
levance. However, generally the presented solutions either adopt one metric or require
that the user identifies the most suitable metric for the task at hand [3]. Following our
previous work on graph theory based metrics for concept and relation extraction in
ontology learning [19], we propose to enrich this perspective by:
─ Testing various metrics from graph theory and
─ Taking into account a number of metrics in suggesting suitable concepts based on
the Social choice theory [5, 14].
1.1 Motivation
This work aims at exploring the following research questions:
─ Do we obtain better results with graph-based metrics rather than with traditional
information retrieval measures?
In our previous work [19], we showed that some graph-based metrics are a promising
option to identify concepts in an ontology learning system. This paper continues ex-
ploring this aspect by enriching the set of studied measures and extending the experi-
ment to another gold standard.
─ Do we obtain better results with voting schemes rather than with base metrics?
Social Choice Theory studies methods for the aggregation of various opinions in
order to reach a consensus [5]. This theory is appealing in our case for two main rea-
sons: firstly, at the practical level, it provides a mean to aggregate the results of vari-
ous metrics in order to recommend concepts. Secondly, at the theoretical level, it
gracefully integrates the idea of consensus, which is one of the main goals of ontolo-
gies. In fact, ontologies are meant to improve the communication between computers,
between humans and computers and between humans [12]. At this level, another re-
search question is: How can we mimic consensus with automatic ontology learning
methods? Although consensus is generally concerned with human users, our hypothe-
sis is that mimicking this characteristic at the level of automatic methods will provide
more reliable results.
1.2 Contributions
This paper explores various metrics and voting schemes for the extraction of concepts
from texts. Besides bringing a different perspective to this research avenue, the signi-
ficance of our proposal is that it is applicable to a number of issues related to the Se-
mantic Web, including (but not limited to) learning relationships, helping experts
collaboratively build an ontology and reducing the noise that results from the auto-
matic extraction methods.
Voting Theory for Concept Detection 317
2 Background
This paper is based on our ontology learning tool, OntoCmaps [19], which in turn is
derived from our previous work [20, 21]. OntoCmaps is a “complete” ontology learn-
ing system in the sense that it extracts primitive and defined classes (concepts), con-
ceptual relationships (i.e. relations with domain and range), taxonomical relationships
(is-a links) and equivalence classes’ axioms (e.g. AI = Artificial Intelligence).
OntoCmaps relies on dependency-based patterns to create a forest of multi-digraphs
constituted of nodes (terms) and edges (hierarchical and conceptual relations). An
example of pattern is:
Semantic Analysis
Is_a (knowledge representation, Artificial Intelligence technique)
By multi-digraphs, we mean that there can be multiple directed relationships from a
given term X to a given term Y. For each term X, there can be various relationships to a
set of terms S, which constitutes a term map. Some term maps might be isolated, others
might be linked to other term maps through relationships, hence creating a forest. Figure
1 shows a term map around the term “intelligent agent”, which can in turn be related to
the term of agent, which has itself a term map (and so on). Once the extraction of term
maps is performed, the tool filters the results based on various graph-based metrics by
assigning several scores to the potential candidates. These scores serve to promote can-
didate terms as concepts in the ontology. In our previous work [19], we identified a
number of graph-based metrics as potential useful measures for extracting terms and
relationships. We found promising results by comparing these graph-based metrics
(Degree, Betweenness, PageRank and HITS-Authority) to information retrieval metrics
such as TF-IDF and TF. We showed that graph-based metrics outperformed these
commonly used metrics to identify relevant candidate concepts. We also tested some
voting schemes (intersection voting scheme and majority voting scheme) and discov-
ered that they contributed in increasing the precision in our results.
This paper investigates further this previous study, by expanding the set of consi-
dered graph-based metrics and by using voting theory methods to consider the vote of
each metric for the selection of domain concepts. In fact, voting theory can be used to
consider the contribution of each metric and to decrease the noise that results from a
NLP pipeline. Voting theory has been experimented in a number of works in artificial
intelligence such as agent group-decision-making [6], information mashups [2], on-
tology merging [14] but to our knowledge, there is no ontology learning tool which
proposed to identify concepts through graph-based measures and to increase the con-
fidence of the extractions by aggregating the results of the various metrics through
voting theory. This type of aggregation, resulting from the Social Choice Theory [14],
seems similar in spirit to ensemble learning methods frequently used in machine
learning [13]. However, as previously stated, experimenting voting theories has the
318 A. Zouaq, D. Gasevic, and M. Hatala
potential to mimic real-world vote aggregation and seems a suitable approach to es-
tablishing consensus in learning domain concepts.
Concept detection through vote aggregation can closely be related to the problem of
rank aggregation, which is a well-known problem in the context of Web search where
there is a need of finding a consensus between the results of several search engines
[5]. Vote aggregation can be defined as the process of reaching a consensus between
various rankings of alternatives, given the individual ranking preferences of several
voters [2]. In the context of a vote, each metric is considered as a voter.
3.1 Metrics
After the extraction of term maps, OntoCmaps assigns rankings to the extracted terms
based on scores from various measures from graph theory (see below). In fact, since
OntoCmaps generates a network of terms and relationships, computational network
analysis methods are thus applicable and in this case. As outlined by [11], text mining in
general and concept extraction in particular can be considered as a process of network
traversal and weighting. In this paper, in addition to Degree, PageRank, HITS-Authority
and Betweenness presented in [19], we computed three additional metrics HITS-Hubs,
Clustering coefficient and Reachability centrality. As explained below, these metrics are
generally good indicators of the connectedness and the accessibility of a node, which
are two properties that might indicate the importance of a node (here a term) [11, 19].
The following metrics were calculated using the JUNG API [10]:
Degree (Deg) assigns a score to each term based on the number of its outgoing and
incoming relationships;
Voting Theory for Concept Detection 319
Here, we introduce voting theory methods, which can generally be divided in two
main classes: score-based methods and rank-based methods.
In the score-based methods, each metric assigns a score to the elements of a list
(here the extracted terms) and the resulting list must take into account the score as-
signed by each metric. Given the universe of Domain Terms DT, which is composed
of all the nominal expressions extracted through our dependency patterns [19], the
objective of the vote is to select the most popular terms t ∈DT, given multiple metrics
m ∈ M. Each metric m computes a score Stm for a term t. This score is used to create
a fully ordered list TM for each metric.
Sum and maximum values are generally two functions that are used to assign an
aggregated score to the terms [18]. We implemented two voting schemes based on
scores: the intersection voting scheme and the majority voting scheme.
In the Intersection Voting Scheme, we select the terms for which there is a con-
sensus among all the metrics and the score assigned is the sum of the scores of each
individual metric normalized by their number.
In the Majority Voting Scheme, we select the terms for which there exists a vote
from at least 50% of the metrics. The score is again the normalized sum of the score
of each individual metric participating in the vote.
Each graph-based metric produced a full list of terms (DT) ordered in the decreas-
ing order of scores. Top-k lists (partial ordering) may be created from full-ordered
lists through setting up a threshold over the value of the metrics. In fact, such a thre-
shold might be set to increase metrics’ precision: in this case, only the portion of the
list whose score is greater than or equal to a threshold is kept for each metric and the
voting schemes operate on these partial lists.
320 A. Zouaq, D. Gasevic, and M. Hatala
4 Methodology
4.1 Dataset
We used a corpus of 30,000 words on the SCORM standard which was extracted from
the SCORM manuals [16] and which was used in our previous experiments [19]. This
corpus was exploited to generate a gold standard ontology that was validated by a do-
main expert. To counterbalance the bias that may be introduced by relying on a unique
domain expert, we performed user tests to evaluate the correctness of the gold standard.
We randomly extracted concepts and their corresponding conceptual and taxonomical
relationships from the gold standard and exported them in Excel worksheets. The work-
sheets were then sent together with the domain corpus and the obtained gold standard
ontology to 11 users from Athabasca University, Simon Fraser University, the Univer-
sity of Belgrade, and the University of Lugano. The users were university professors
(3), postdoctoral researchers (2), and PhD (5) and master’s (1) students. The users were
instructed to evaluate their ontology subset by reading the domain corpus and/or having
a look to the global ontology. Each user had a distinct set of items (no duplicated items)
composed of 20 concepts and all their conceptual and taxonomical relationships. Almost
Voting Theory for Concept Detection 321
29% of the entire gold standard was evaluated by users and overall more than 93% of
the concepts were accepted as valid and understandable by these users. This size of the
sample and the fact that the sample evaluated by the users was selected randomly can
provide us with solid evidence that the results of the user evaluation of the sample can
be generalized to the entire gold standard.
To improve its quality, there have been slight modifications to the previous gold
standard: class labels were changed by using lemmatization techniques instead of
stemming, which introduced some changes in the GS classes. Additionally, some
defined classes were also created, and new relationships were discovered due to new
patterns added to OntoCmaps. The following table shows the statistics associated to
the classes in our current GS1.
Once the GS ontology was created, we ran the OntoCmaps tool on the same cor-
pus. The aim was to compare the expert GS concepts with the concepts learned by the
tool. We ran our ontology learning tool on the SCORM corpus and generated a rank-
ing of the extractions based on all the above-mentioned metrics: Degree, Between-
ness, PageRank, Hits, Clustering Coefficient and Reachability. The tool extracted
2423 terms among which the metrics had to choose the concepts of the ontology.
We also tested our metrics and voting schemes on another smaller corpus (10574
words) on Artificial Intelligence (AI) extracted from Wikipedia pages about the topic.
The tool extracted 1508 terms among which the metrics had to choose the concepts of
the ontology. Table 2 shows the statistics of the extracted AI gold standard.
1
http://azouaq.athabascau.ca/Corpus/SCORM/Corpus.zip
322 A. Zouaq, D. Gasevic, and M. Hatala
included Top-k, k=50, 100, 200 (up to ~14.5% of the expected terms) and large lists
had k>200. In the AI GS, small lists were Top-50 and Top-100 (up to ~13% of the
expected terms).
Our experimental evaluation of the different ranking methods tests each of the indi-
vidual metrics and each of the aforementioned voting systems. There are a number
of methods that are used to evaluate similar types of research: in information retrieval
and ontology learning, the results are generally evaluated using precision/recall and F-
measure [3]. In our case, we chose to concentrate on the precision measure as ontolo-
gy learning methods obtain difficultly good precision results (see for example [Brew-
ster et al., 2009] and the results of Text2Onto in [4] and in our experiments [19]).
Moreover, it is better to offer correct results to the user rather than a more complete
but rather a noisier list of concepts [9]. In voting theory and rank aggregation studies
[2], the results are often evaluated through Social Welfare Function (SWF). A SWF is
a mathematical function that measures the increased social welfare of the voting sys-
tem. SWF employed in similar research include Precision Optimal Aggregation and
the Spearman Footrule distance [2, 17]. Given that Precision Optimal Aggregation is
similar in spirit to the precision metric employed in information retrieval, we em-
ployed standard precision (Precision Function) against our GS:
Precision = items the metric identified correctly / total number of items generated by
the metric
4.3 Experiments
Quality of Individual Metrics. In [1], the authors indicate that the performance of
each individual ranker might have a strong influence over the overall impact of the
aggregation. Therefore, we decided first to assess the performance of each metric in
various partial lists: Top-50, Top-100, Top-600, Top-1000, Top-1500 and Top-2000.
Table 3 show the performance of each metric in each of these lists.
For smaller N-Lists (N=50,100, 200), we can notice that Betweenness, PageRank
and Degree are the best performing metrics, while the metrics Reachability, Degree
and Hits-Hub become the best ones with larger lists (N=400..2000). Only the degree
metrics seems to be constantly present in the best three results of each Top-N list.
Voting Theory for Concept Detection 323
In order to compare our results and make another experiment, we tested our me-
trics on the second gold standard (AI). The following table shows the results of this
experiment. We notice that HITS-Hub and Reachability give the best performance
overall.
Table 5. Metrics selection using CfsSubsetEval Attribute Evaluator2 and a BestFirst search
Number of Number of Number of
Attributes
folds (%) folds (%) folds (%)
SCORM AI ALL
1 Betw 10(100 %) 1( 10 %) 0( 0 %)
2 Prank 4(40 %) 10(100 %) 8( 80 %)
3 Deg 10(100 %) 10(100 %) 2( 20 %)
4 HITS(Auth) 0( 0 %) 7(70 %) 10(100 %)
5 HITS(Hubs) 10(100 %) 0( 0 %) 1( 10 %)
6 Reach 10(100 %) 0( 0 %) 10(100 %)
7 CC 8( 80 %) 0( 0 %) 0( 0 %)
Table 5 shows us how many times each metric was selected during a 10-fold cross
validation. We can see that some metrics are used more times than others during each
cross validation. According to these results, only two metrics Degree and Reachability
are present in all 10 folds of our cross-validation (10(100%)) over two datasets: De-
gree appears over the SCORM and AI datasets while reachability appears over the
SCORM and combined (All) datasets. However, we can notice that each individual
GS has other significant metrics.
Based on these results, we decided to compute the following voting schemes:
─ Intersection Voting Schemes (IVS_1, IVS_2 and IVS_3), where IVS_1 is based on
all the metrics except the clustering coefficient (which appears to be significant
only for SCORM): Hits_Hub, Hits_Authority, PageRank, Degree, Reachability and
Betweenness. IVS_2 uses Reachability and Betweenness while IVS_3 is based on
Betweenness, Reachability, Hits_Hub and Degree.
─ Majority Voting Schemes (MVS_1 and MVS_2), where MVS_1 and MVS_2 uses
the same metrics respectively as IVS_1 and IVS_3.
─ Borda, Nauru and Runoff were all based on the metrics Betweenness, Reachability,
Degree and HITS-Hubs which are the best metrics for the SCORM GS.
Precision Optimal Aggregation Results on the SCORM and AI GS. In the Top-
50 list of the SCORM GS, we noticed that all the voting schemes, except Runoff
(96% precision), were successful (100% precision) in identifying relevant concepts
among the highest ranked 50 terms. However, as the number of considered terms
increases (Table 6), we can notice that the Intersection voting schemes and the
majority voting schemes (~82%) beat slightly the other voting scheme systems
(Runoff: 77.5%, Nauru: 79.8%, and Borda: 80.5%). In our experiments on the AI GS
(Table 6), the best performing voting schemes were:
─ Nauru first (90%) and then Runoff, IVS_1 and MVS_2 with 88% in the Top-50 list
─ Runoff first (81.5%), Nauru (80%) and then IVS_1 and MVS_2 with 79% in the
Top-200 list;
─ IVS_2 first (67.5%), Nauru and Runoff with 67%, Borda with 66% and then IVS_1
and MVS_2 with 65.5% in the Top-600 list.
2
In Weka, CfsSubsetEval evaluates the worth of a subset of attributes by considering the indi-
vidual predictive ability of each feature along with the degree of redundancy between them.
Voting Theory for Concept Detection 325
As we can see in table 7, the metrics TFIDF and TF are more successful when they
are applied on the pre-filtered domain terms (TFIDF (DT) and TF (DT)). We can also
notice that the graph-based metrics and their combination through voting schemes
beat the traditional metrics (compare Table 3 and Table 7). Up to the Top-200 list,
Betweenness is the best performing metrics, then Reachability (in Top-400, Top-600
and Top-1000), then HITS-Hub (Top-1500), and finally Degree (Top-2000).
326 A. Zouaq, D. Gasevic, and M. Hatala
Comparison with other Metrics on the AI Gold Standard. We repeated the same
experiment on the AI GS. As shown in Table 8, among the traditional metrics, we can
also notice that the best performing ones are TFIDF (DT) and TF (DT). If we com-
pare these metrics from Table 8 with the graph-based ones (Table 4), we also see that
again graph-based metrics have much better performance in all the Top-k lists (k=50,
100, 200, 400, 600 and 1000). For example, the best in the Top-50 list is the Degree
and HITS-Hub with 88% versus (72% for TFIDF (DT)) and in the Top-600, the best
is HITS-hub (69.67%) versus 50.83% for TF (DT) and TFIDF (DT).
Based on the results presented in Tables 3, 4, 7 and 8, we ran a paired sample t-test
on each of these metrics combinations and the differences were statistically signifi-
cant and in favor of graph-based metrics in general, and in favor of Degree, reachabil-
ity and Hits-hubs in particular.
5 Discussion
In this section, we summarize our findings and the limitations of our work.
5.1 Findings
Our findings are related to our initial research questions:
Do we obtain better results with graph-based metrics rather than with tradition-
al ones?
Obviously, it is possible to confirm this research hypothesis through our experiments
with the best performing metrics being:
We can observe that Degree is constantly present and that Degree, Hits-Hub and Rea-
chability seem to be the best performing graph-based metrics. This result is confirmed
by our machine learning experiments (Table 5) for at least two metrics Degree and
Reachability.
Do we obtain better results with voting schemes rather than with base metrics?
As far as voting schemes are concerned, the first question is whether we were able to
increase the precision of the results by using these voting schemes (see Table 9). In
previous experiments [19], we noticed that some voting schemes were enabling us to
get better performance but our ranked lists contained only those terms whose weight
was greater than the mean value of the considered metric, which had already a strong
impact on the precision of each metric.
SCORM AI
100% : All voting schemes except Runoff 90%: Nauru
Top-50
96%: Bet and PageRank 88%: Deg and HITS-hub
97%: IVS_3, MVS_1, MVS_2 86%: Runoff
Top-100
96%: Bet 88%: HITS-hub
87%: IVS_1 and MVS_2 81.5%: Runoff
Top-200
88%: Bet 79.5%: HITS-hub
83.75%: IVS_3 and MVS_1 72.75%: Runoff
Top-400
81.75% : Reach 74.5%: HITS-hub
82.67%: IVS_1, IVS_3, MVS_1, MVS_2 67.5: IVS_2
Top-600
82.33%: Reach 69.67%: HITS-hub
77.7%: IVS_1 and MVS_2 60.7%: IVS_1 and MVS_2
Top-1000
77.6%: Reach 58.2%: HITS-hub and Reach
71.26%: IVS_1 and MVS_2
Top-1500 NA
71.20%: HITS_hub
65.15%: IVS_1 and MVS_2
Top-2000 NA
64%: Degree
Despite a small increase in almost all the cases in favor of voting schemes, the dif-
ference between voting schemes and base metrics such as Degree, Hits-Hub and Rea-
chability was not really noteworthy. This asks the question whether such voting
schemes are really necessary and whether the identified best graph-based metrics
would not be enough, especially if we don’t take the mean value as a threshold for the
metrics. Having identified that the best base metrics were Degree, Reachability and
HITS-hub, we tried some combinations of metrics on the SCORM GS. Despite an
improvement of voting theory schemes (e.g. Borda) in some Top-n lists, we did not
notice a major difference. Our future work will continue testing combinations of vot-
ing schemes and voting theory measures, based on these metrics, on various gold
standards. We also plan to compare this voting-based approach with ensemble ma-
chine learning algorithms.
328 A. Zouaq, D. Gasevic, and M. Hatala
5.2 Limitations
One of the most difficult aspects in evaluating this type of work is the necessity to
build a gold standard, which in general requires a lot of time and resources. Building a
GS that represents a universal ground truth is not possible. Ideally, the experiments
presented in this paper should be repeated over various domains to evaluate the gene-
ralizability of the approach. However, this is often impossible due to the cost of such
a large scale evaluation. In this paper, we extended our previous evaluation on another
corpus, and we also extended the set of tested metrics and voting schemes. Future
work will have to continue the validation of our approach and to expand the set of
“traditional” metrics (such as C/NC value) to be compared with graph-based metrics.
Another limitation is that the metrics that we propose for discovering concepts are
graph-based metrics, which involves processing the corpus to obtain a graph while
metrics commonly used in information retrieval such as TF-IDF only require the cor-
pus. In our experiments, we always relied on OntoCmaps to generate this graph.
However, we do not believe that this could represent a threat to the external validity
of our findings, as these metrics are already applied successfully in other areas such
social network analysis and information retrieval and are not dependent on anything
else than a set of nodes (terms) and edges (relationships).
Finally, despite our focus on concepts in this paper, such a graph-based approach is
worth the effort only if the aim is to extract a whole ontology and not only concepts,
as it involves discovering terms and relationships between terms. This requirement is
also closely linked to another limitation: since we rely on deep NLP to produce such a
graph, it requires time to process the corpus and calculate the graph-based metrics.
However, we believe that this is not a major limitation, as ontologies are not supposed
to be generated on the fly.
6 Conclusion
In this paper, we presented various experiments involving a) the comparison between
graph-based metrics and traditional information retrieval metrics and b) the compari-
son between various voting schemes, including schemes relying on voting theory. Our
finding indicates that graph-based metrics always outperform traditional metrics in
our experiments. In particular, Degree, Reachability and HITS-Hub seem to be the
best performing ones. Although voting schemes increased precision in our experi-
ments, there was only a slight improvement on the precision as compared to the three
best performing metrics.
References
1. Adali, S., Hill, B., Magdon-Ismail, M.: The Impact of Ranker Quality on Rank Aggrega-
tion Algorithms: Information vs. Robustness. In: Proc. of 22nd Int. Conf. on Data Engi-
neering Workshops. IEEE (2006)
Voting Theory for Concept Detection 329
2. Alba, A., Bhagwan, V., Grace, J., Gruhl, D., Haas, K., Nagarajan, M., Pieper, J., Robson,
C., Sahoo, N.: Applications of Voting Theory to Information Mashups. In: IEEE Interna-
tional Conference on Semantic Computing, pp. 10–17 (2008)
3. Cimiano, P.: Ontology Learning and Population from Text. Algorithms, Evaluation and
Applications. Springer (2006)
4. Cimiano, P., Völker, J.: Text2Onto. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB
2005. LNCS, vol. 3513, pp. 227–238. Springer, Heidelberg (2005)
5. Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the Web.
In: Proc. of the 10th International Conference on WWW, pp. 613–622. ACM (2001)
6. Endriss, U.: Computational Social Choice: Prospects and Challenges. Procedia Computer
Science 7, 68–72 (2011)
7. Fortuna, B., Grobelnik, M., Mladenic, D.: Semi-automatic Data-driven Ontology Con-
struction System. In: Proc. of the 9th Int. Multi-Conference Information Society, pp. 309–
318. Springer (2006)
8. Frantzi, K.T., Ananiadou, S.: The C/NC value domain independent method for multi-word
term extraction. Journal of Natural Language Processing 3(6), 145–180 (1999)
9. Hatala, M., Gašević, D., Siadaty, M., Jovanović, J., Torniai, C.: Can Educators Develop
Ontologies Using Ontology Extraction Tools: an End User Study. In: Proc. 4th Euro. Conf.
Technology-Enhanced Learning, pp. 140–153 (2009)
10. JUNG, http://jung.sourceforge.net/ (last retrieved on December 6, 2011)
11. Kozareva, Z., Hovy, E.: Insights from Network Structure for Text Mining. In: Proc. of the
49th Annual Meeting of the ACL Human Language Technologies, Portland (2011)
12. Maedche, A., Staab, S.: Ontology Learning for the Semantic Web. IEEE Intelligent Sys-
tems 16(2), 72–79 (2001)
13. Polikar, R.: Bootstrap inspired techniques in computational intelligence: ensemble of clas-
sifiers, incremental learning, data fusion and missing features. IEEE Signal Processing
Magazine 24, 59–72 (2007)
14. Porello, D., Endriss, U.: Ontology Merging as Social Choice. In: Proceedings of the 12th
International Workshop on Computational Logic in Multi-agent Systems (2011)
15. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Informa-
tion Processing & Management 24(5), 515–523 (1988)
16. SCORM (2011), http://www.adlnet.gov (last retrieved on December 10, 2011)
17. Sculley, D.: Rank Aggregation for Similar Items. In: Proc. of the 7th SIAM International
on Data Mining (2007)
18. Shili, L.: Rank aggregation methods. In: WIREs Comp. Stat. 2010, vol. 2, pp. 555–570
(2010)
19. Zouaq, A., Gasevic, D., Hatala, M.: Towards Open Ontology Learning and Filtering. In-
formation Systems 36(7), 1064–1081 (2011)
20. Zouaq, A., Nkambou, R.: Evaluating the Generation of Domain Ontologies in the Know-
ledge Puzzle Project. IEEE Trans. on Kdge and Data Eng. 21(11), 1559–1572 (2009)
21. Zouaq, A.: An Ontological Engineering Approach for the Acquisition and Exploitation of
Knowledge in Texts. PhD Thesis, University of Montreal (2008) (in French)
Modelling Structured Domains
Using Description Graphs and Logic Programming
1 Introduction
OWL 2 [7] is commonly used to represent objects with complex structure, such as com-
plex assemblies in engineering applications [8], human anatomy [22], or the structure of
chemical molecules [10]. In order to ground our discussion, we next present a concrete
application of the latter kind; however, the problems and the solution that we identify
apply to numerous similar scenarios.
The European Bioinformatics Institute (EBI) has developed the ChEBI ontology—a
public dictionary of molecular entities used to ensure interoperability of applications
supporting tasks such as drug discovery [17]. In order to automate the classification of
molecular entities, ChEBI descriptions have been translated into OWL and then clas-
sified using state of the art Semantic Web reasoners. While this has uncovered numer-
ous implicit subsumptions between ChEBI classes, the usefulness of the approach was
somewhat limited by a fundamental inability of OWL 2 to precisely represent the struc-
ture of complex molecular entities. As we discuss in more detail in Section 3, OWL 2
exhibits a so-called tree-model property [23], which prevents one from describing non-
tree-like relationships using OWL 2 schema axioms. For example, OWL 2 axioms can
state that butane molecules have four carbon atoms, but they cannot state that the four
atoms in a cyclobutane molecule are arranged in a ring. Please note that this applies
to schema descriptions only: the structure of a particular cyclobutane molecule can be
represented using class and property assertions, but the general definition of all cyclobu-
tane molecules—a problem that terminologies such as ChEBI aim to solve—cannot be
This work was supported by the EU FP7 project SEALS and the EPSRC projects ConDOR,
ExODA, and LogMap.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 330–344, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Modelling Structured Domains Using Description Graphs and Logic Programming 331
other DGs; a cyclic ontology that is not compatible with this precedence relation entails
a special propositional symbol. A cyclic ontology can still entail useful consequences,
but termination of reasoning can no longer be guaranteed.
In Section 6 we consider the problem of reasoning with ontologies including only
negation-free rules. We show that the standard bottom-up evaluation of logic programs
can decide the relevant reasoning problems for semantically acyclic ontologies, and that
it can also decide whether an ontology is semantically acyclic. Furthermore, in Section
7 we show that this result can be extended to ontologies with stratified negation.
In Section 8 we present the results of a preliminary evaluation of our formalism.
We show that molecule descriptions from the ChEBI ontology can be translated into a
DGLP ontology that entails the desired subsumption consequences. Furthermore, de-
spite the very preliminary nature of our implementation, we show that reasoning with
DGLP ontologies is practically feasible. Thus, in this paper we lay the theoretical foun-
dations of a novel, expressive, and OWL 2 RL-compatible ontology language that is
well suited to modelling objects with complex structure.
The proofs of all technical results presented in this paper are given in a technical
report that is available online.1
2 Preliminaries
We assume the reader to be familiar with OWL and description logics. For brevity, we
write OWL axioms using the DL notation; please refer to [1] for an overview of the DL
syntax and semantics. Let Σ = (ΣC , ΣF , ΣP ) be a first-order logic signature, where
ΣC , ΣF , and ΣP are countably infinite sets of constant, function, and predicate sym-
bols, respectively, and where ΣP contains the 0-ary predicate ⊥. The arity of a predicate
A is given by ar (A). A vector t1 , . . . , tn of first-order terms is often abbreviated as t.
An atom is a first-order formula of the form A(t), where A ∈ ΣP and t is a vector of
the terms t1 , . . . , tar (A) . A rule r is an implication of the form
w.r.t. a set of terms T is the set of rules obtained by substituting the variables of r by
the terms of T in all possible ways. Given a logic program P , the program ground(P )
is obtained from P by replacing each rule r ∈ P with its grounding w.r.t. HU (P ).
Let I ⊆ HB (P ) be a set of ground atoms. Then, I satisfies a ground rule r if
body + (r) ⊆ I and body − (r) ∩ I = ∅ imply head (r) ⊆ I. Furthermore, I is a model
of a (not necessarily ground) program P , written I |= P , if ⊥ ∈ I and I satisfies each
rule r ∈ ground(P ). Given a negation-free program P , set I is a minimal model of P
if I |= P and no I I exists such that I |= P . The Gelfond-Lifschitz reduct P I of a
logic program P w.r.t I is obtained from ground(P ) by removing each rule r such that
body − (r) ∩ I = ∅, and removing all atoms not Bi in all the remaining rules. A set I is
a stable model of P if I is a minimal model of P I . Given a fact A, we write P |= A if
A ∈ I for each stable model I of P ; otherwise, we write P |= A.
A substitution is a partial mapping of variables to ground terms. The result of apply-
ing a substitution θ to a term, atom, or a set of atoms M is written as M θ and is defined
as usual. Let P be a logic program in which no predicate occurring in the head of a rule
in P also occurs negated in the body of a (possibly different) rule in P . Operator TP
applicable to a set of facts X is defined as follows:
3 Motivating Application
We next motivate our work using examples from the chemical Semantic Web appli-
cation mentioned in the introduction. The goal of this application is to automatically
classify chemical entities based on descriptions of their properties and structure. Un-
fortunately, as discussed in [20], OWL cannot describe cyclic structures with sufficient
precision. This causes problems when modelling chemical compounds since molecules
often have cyclic parts. For example, the cyclobutane molecule contains four carbon
atoms connected in a ring,2 as shown in Figure 1(a). One might try to represent this
structure using the following OWL axiom:
Molecule,
Cyclobutane
hasAtom
bond
C C Carbon Carbon
C C Carbon Carbon
(a) Chemical structure of cyclobutane (b) Interpretation I
Molecule, Cyclobutane
Molecule, Cyclobutane
Carbon Carbon
Carbon Oxygen
CarbonCarbon Carbon Carbon Carbon
Carbon Carbon
consequences. For example, one cannot define the class of molecules containing four-
membered rings that will be correctly identified as a superclass of cyclobutane.
The formalism from [20] addresses this problem by augmenting an OWL ontology
with a set of rules and a set of description graphs (DGs), where each DG describes a
complex object by means of a directed labeled graph. To avoid misunderstandings, we
refer to the formalism from [20] as DGDL (Description Graph Description Logics), and
to the formalism presented in this paper as DGLP (Description Graph Logic Programs).
Thus, cyclobutane can be described using the DG shown in Figure 2(a). The first-order
semantics of DGDL ontologies ensures that all models of an ontology correctly repre-
sent the DG structure; for example, interpretation I from Figure 1(c) does not satisfy
the DG in Figure 1(a). Nevertheless, the first-order models of DGDL ontologies can still
be insufficiently precise. For example, the interpretation I shown in Figure 1(d) sat-
isfies the definition of cyclobutane under the semantics of DGDL ontologies. We next
show how the presence of models with excess information can restrict entailments.3
One might describe the class of hydrocarbon molecules (i.e., molecules consisting
exclusively of hydrogens and carbons) using axiom (2). One would expect the definition
of cyclobutane (as given in a DGDL ontology) and (2) to imply subsumption (3).
3
Krötzsch et al. [13] also suggest an extension of OWL 2 for the representation of graph-like
structures. As we show next, the first-order semantics of this formalism exhibits the same
problems as that of DGDL ontologies.
Modelling Structured Domains Using Description Graphs and Logic Programming 335
This, however, is not the case, since interpretation I does not satisfy axiom (3). One
might preclude the existence of extra atoms by adding cardinality restrictions requiring
each cyclobutane to have exactly four atoms. Even so, axiom (3) would not be entailed
because of a model similar to I, but where one carbon atom is also an oxygen atom.
One could eliminate such models by introducing disjointness axioms for all chemical
elements. Such gradual circumscription of models, however, is not an adequate solution,
as one can always think of additional information that needs to be ruled out [18].
In order to address such problems, we present a novel expressive formalism that
we call Description Graph Logic Programs (DGLP). DGLP ontologies are similar to
DGDL ontologies in that they extend OWL ontologies with DGs and rules. In our case,
however, the ontology is restricted to OWL 2 RL so that the ontology can be trans-
lated into rules [9]. We give semantics to our formalism by translating DGLP ontolo-
gies into logic programs with function symbols. As is common in logic programming,
the translation is interpreted under stable models. Consequently, interpretations such
as I are not stable models of the DG in Figure 2(a), and hence subsumption (3) is
entailed.
Logic programs with function symbols can axiomatise infinite non-tree-like struc-
tures, so reasoning with DGLP ontologies is trivially undecidable [4]. Our goal, how-
ever, is not to model arbitrarily large structures, but to describe complex objects up to a
certain level of granularity. For example, acetic acid has a carboxyl part, and carboxyl
has a hydroxyl part, but hydroxyl does not have an acetic acid part (see Fig. 3(a)).
In Section 5 we exploit this intuition and present a condition that ensures decidabil-
ity. In particular, we require the modeller to specify an ordering on DGs that, intuitively,
describes which DGs are allowed to imply existence of other DGs. Using a suitable test,
one can then check whether implications between DGs are acyclic and hence whether
DGs describe structures of bounded size only. The resulting semantic acyclicity condi-
tion allows for the modelling of naturally-arising molecular structures, such as acetic
acid, that would be ruled out by existing syntax-based acyclicity conditions [6,16].
HasAtom
Bond
Cyclobutane(a) Molecule(a)
Cyclobutane G G G G
Gcb (a, f1 cb (a), f2 cb (a), f3 cb (a), f4 cb (a))
Gcb : 1 Gcb
HasAtom(a, fi (a)) for each i ∈ {1, 2, 3, 4}
G
Carbon(fi cb (a)) for each i ∈ {1, 2, 3, 4}
Gcb Gcb
Bond(fi (a), fi+1 (a)) for each i ∈ {1, 2, 3}
Carbon 2 3 Carbon Bond(f4 (a), f1Gcb (a))
Gcb
MolWith4MemberedRing(a)
Carbon 5 4 Carbon Hydrocarbon(a)
Carboxyl
GAA : AceticAcid Gcxl : Carboxyl
O
1 1
Carbonyl HasPart
C
Hydroxyl 2 3 2 3
Methyl CH3 OH Methyl Carboxyl CarbonylHydroxyl
(a) Chemical graph of acetic acid (b) Acetic acid DG (c) Carboxyl DG
Fig. 3. The chemical graph of acetic acid and the GAA and the Gcxl DGs
in [9] and included in R, and datatypes can be handled as in [15]. Similarly, we could
think of F as an OWL 2 ABox, as ABox assertions correspond directly to facts [9]. An
example of a DGLP ontology is {GAA , Gcxl }, {(GAA , Gcxl )}, ∅, {AceticAcid(a)}.
We next define the semantics of DGLP via a translation into logic programs. Since
R and F are already sets of rules and ≺ serves only to check acyclicity, we only need
to specify how to translate DGs into rules.
Definition 4 (Start, Layout, and Recognition Rule). Let G = (V, E, A, λ, m) be a
description graph and let f1G , . . . , f|V
G
|−1 be fresh distinct function symbols uniquely
associated with G. The start rule sG , the layout rule G , and the recognition rule rG of
G are defined as follows:
The start and layout rules of a description graph serve to unfold the graph’s structure.
The function terms f1G (x), . . . , f|V
G
|−1 (x) correspond to existential restrictions whose
existentially quantified variables have been skolemised.
Example 1. The DG of cyclobutane from Figure 2 can be naturally represented by the
existential restriction (4). The skolemised version of (4) is the start rule (sGcb ).
The layout rule (Gcb ) encodes the edges and the labelling of the description graph.4
Finally, the rule rGcb is responsible for identifying the cyclobutane structure:
4
In the rest of the paper for simplicity we assume that bonds are unidirectional.
338 D. Magka, B. Motik, and I. Horrocks
Gcb (x1 , x2 , x3 , x4 , x5 ) →Cyclobutane(x1 ) ∧ Bond(xi , xi+1 ) ∧ Bond(x5 , x2 ) ∧
2≤i≤4
HasAtom(x1 , xi ) ∧ Carbon(xi ) (Gcb )
2≤i≤5 2≤i≤5
HasAtom(x1 , xi )∧ Carbon(xi ) ∧ Bond(xi , xi+1 ) ∧ Bond(x5 , x2 ) →
2≤i≤5 2≤i≤5 2≤i≤4
Next, we define Axioms(DG), which is a logic program that encodes a set of DGs.
Definition 5 (Axioms(DG)). For a description graph G = (V, E, λ, A, m), the pro-
gram Axioms(G) is the set of rules that contains the start rule sG and the layout rule
G if m ∈ {⇒, ⇔}, and the recognition rule rG if m∈ {⇐, ⇔}. For a set of descrip-
tion graphs DG = {Gi }1≤i≤n , let Axioms(DG) = Gi ∈DG Axioms(Gi ).
For each DGLP ontology O = DG, ≺, R, F , we denote with LP(O) the program
Axioms(DG) ∪ R ∪ F . To check whether a class C is subsumed by a class D, we can
proceed as in standard OWL reasoning: we assert C(a) for a a fresh individual, and we
check whether D(a) is entailed.
Definition 6 (Subsumption). Let O be a DGLP ontology, let C and D be unary pred-
icates occurring in O, and let a be a fresh individual not occurring in O. Then, D
subsumes C w.r.t. O, written O |= C D, if LP(O) ∪ {C(a)} |= D(a) holds.
Example 2. We now show how a DGLP ontology can be used to obtain the inferences
described in Section 3. Rule (r1 ) encodes the class of four-membered ring molecules:
Molecule(x) ∧ HasAtom(x, yi ) ∧ Bond(yi , yi+1 ) ∧ Bond(y4 , y1 )
1≤i≤4 1≤i≤3
not yi = yj → MolWith4MemberedRing(x) (r1 )
1≤i<j≤4
The use of the equality predicate = in the body of r1 does not require an extension to
our syntax: if = occurs only in the body and not in the head of the rules, then negation
of equality can be implemented using a built-in predicate. In addition, we represent the
class of hydrocarbons with rules (r2 ) and (r3 ).
Molecule(x) ∧ HasAtom(x, y) ∧ notCarbon(y) ∧ notHydrogen(y) → NHC(x) (r2 )
Molecule(x) ∧ not NHC(x) → HydroCarbon(x) (r3 )
Cyclobutane(x) → Molecule(x) (r4 )
Finally, we state that cyclobutane is a molecule using (r4 ) that corresponds to the OWL
2 RL axiom Cyclobutane Molecule. Let DG = {Gcb }, let ≺ = ∅, let R = {ri }4i=1 ,
let F = {Cyclobutane(a)}, and let O = DG, ≺, R, F . Figure 2(b) shows the only
stable model of LP(O) by inspection of which we see that LP(O) |= Hydrocarbon(a)
and LP(O) |= MolWith4MemberedRing(a), as expected.
Modelling Structured Domains Using Description Graphs and Logic Programming 339
5 Semantic Acyclicity
Deciding whether a logic program with function symbols entails a given fact is known
to be undecidable in general [4]. This problem is closely related to the problem of
reasoning with datalog programs with existentially quantified rule heads (known as
tuple-generating dependencies or tgds) [5]. For such programs, conditions such as weak
acyclicity [6] or super-weak acyclicity [16] ensure the termination of bottom-up reason-
ing algorithms. Roughly speaking, these conditions examine the syntactic structure of
the program’s rules and check whether values created by a rule’s head can be propagated
so as to eventually satisfy the premise of the same rule. Due to the similarity between
tgds and our formalism, such conditions can also be applied to DGLP ontologies. These
conditions, however, may overestimate the propagation of values introduced by existen-
tial quantification and thus rule out unproblematical programs that generate only finite
structures. As we show in Example 4, this turns out to be the case for programs that
naturally arise from DGLP representations of molecular structures.
To mitigate this problem, we propose a new semantic acyclicity condition. The idea
is to detect repetitive construction of DG instances by checking the entailment of a spe-
cial propositional symbol Cycle. To avoid introducing an algorithm-specific procedural
definition, our notion is declarative. The graph ordering ≺ of a DGLP ontology O is
used to extend LP(O) with rules that derive Cycle whenever an instance of a DG G1
implies existence of an instance of a DG G2 but G1 ≺ G2 .
Definition 7 (Check(O)). Let Gi = (Vi , Ei , λi , Ai , mi ), i ∈ {1, 2} be two description
graphs. We define ChkPair(G1 , G2 ) and ChkSelf(Gi ) as follows:
ChkPair(G1 , G2 ) = {G1 (x1 , . . . , x|V1 | ) ∧ A2 (xk ) → Cycle | 1 ≤ k ≤ |V1 |} (5)
ChkSelf(Gi ) = {Gi (x1 , . . . , x|Vi | ) ∧ Ai (xk ) → Cycle | 1 < k ≤ |Vi |} (6)
Let DG = {Gi }1≤i≤n be a set of description graphs and let ≺ be a graph ordering on
DG. We define Check(DG, ≺) as follows:
Check(DG, ≺) = ChkPair(Gi , Gj ) ∪ ChkSelf(Gi )
i,j∈{1,...,n}, i=j, Gi ≺Gj 1≤i≤n
Example 3. Figure 3(a) shows the structure of acetic acid molecules and the parts they
consist of. In this example, however, we focus on the description graphs for acetic acid
(GAA ) and carboxyl (Gcxl ), which are shown in Figures (3)(b) and (3)(c), respectively.
Since an instance of acetic acid implies the existence of an instance of a carboxyl, but
not vice versa, we define our ordering as GAA ≺ Gcxl . Thus, for DG = {GAA , Gcxl } and
≺ = {(GAA , Gcxl )}, set Check(DG, ≺) contains the following rules:
Gcxl (x1 , x2 , x3 ) ∧ AceticAcid(xi ) → Cycle for 1 ≤ i ≤ 3
GAA (x1 , x2 , x3 ) ∧ AceticAcid(xi ) → Cycle for 2 ≤ i ≤ 3
Gcxl (x1 , x2 , x3 ) ∧ Carboxyl(xi ) → Cycle for 2 ≤ i ≤ 3
340 D. Magka, B. Motik, and I. Horrocks
We next define when a DGLP ontology is semantically acyclic. Intuitively, this condi-
tion will ensure that the evaluation of LP(O) does not generate a chain of description
graph instances violating the DG ordering.
Definition 8. A DGLP ontology O is said to be semantically acyclic if and only if
LP(O) ∪ Check(O) |= Cycle.
Example 4. Let DG = {GAA , Gcxl } with mAA = mcxl = ⇔ , let ≺ = {(GAA , Gcxl )},
let F = {AceticAcid(a)}, and let O = DG, ≺, ∅, F . By Definition 5, logic program
LP(O) contains F and the following rules (HP abbreviates HasPart):
AceticAcid(x) → GAA (x, f1 (x), f2 (x))
GAA (x, y, z) → AceticAcid(x) ∧ Methyl(y) ∧ Carboxyl(z) ∧ HP(x, y) ∧ HP(x, z)
Methyl(y) ∧ Carboxyl(z) ∧ HP(x, y) ∧ HP(x, z) → GAA (x, y, z)
Carboxyl(x) → Gcxl (x, g1 (x), g2 (x))
Gcxl (x, y, z) → Carboxyl(x) ∧ Carbonyl(y) ∧ Hydroxyl(z) ∧ HP(x, y) ∧ HP(x, z)
Carbonyl(y) ∧ Hydroxyl(z) ∧ HP(x, y) ∧ HP(x, z) → Gcxl (x, y, z)
Let also Check(O) = Check(DG, ≺) as defined in Example 3. The stable model of
P = LP(O) ∪ Check(O) can be computed using the TP operator:
TP∞ = {AceticAcid(a), GAA (a, f1 (a), f2 (a)), HP(a, f1 (a)), HP(a, f2 (a)),
Methyl(f1 (a)), Carboxyl(f2 (a)), Gcxl (f2 (a), g1 (f2 (a)), g2 (f2 (a))), Carbonyl(g1 (f2 (a))),
Hydroxyl(g2 (f2 (a))), HP(f2 (a), g1 (f2 (a))), HP(f2 (a), g2 (f2 (a)))}
Since Cycle is not in the (only) stable model of P , we have P |= Cycle and O is
semantically acyclic. However, P is neither weakly [6] nor super-weakly acyclic [16].
This, we believe, justifies the importance of semantic acyclicity for our applications.
Example 5 shows how functions may trigger infinite generation of DG instances.
Example 5. Let O = {G}, ∅, {B(x) → A(x)}, {A(a)} be a DGLP ontology where G
is such that Axioms(G) is as follows:
Axioms(G) = {A(x) → G(x, f(x)), G(x1 , x2 ) → A(x1 ) ∧ B(x2 ) ∧ R(x1 , x2 )}
For Check(O) = {G(x1 , x2 ) ∧ A(x2 ) → Cycle} and P = LP(O) ∪ Check(O), we
have TP∞ = {A(a), G(a, f(a)), R(a, f(a)), B(f(a)), A(f(a)), Cycle, . . .}. Now O is not
semantically acyclic because Cycle ∈ TP∞ , which indicates that TP can be applied to F
in a repetitive way without terminating.
Semantic acyclicity is a sufficient, but not a necessary termination condition: bottom-up
evaluation of LP(O) ∪ Check(O) can terminate even if O is not semantically acyclic.
Example 6. Let O = {G}, ∅, {R(x1 , x2 ) ∧ C(x1 ) → A(x2 )}, {A(a), C(a)} be a DGLP
ontology where G, Check(O), and P are defined as in Example 5. One can
see that {A(a), C(a), G(a, f(a)), R(a, f(a)), B(f(a)), A(f(a)), Cycle, G(f(a), f(f(a))),
B(f(f(a))), R(f(a), f(f(a)))} is the stable model of P computable by finitely many ap-
plications of the TP operator; however, O is not semantically acyclic since Cycle ∈ TP∞ .
Modelling Structured Domains Using Description Graphs and Logic Programming 341
In the present section, we consider the problem of reasoning with a DGLP ontology
O = DG, ≺, R, F where R is negation-free. Intuitively, one can simply apply the
TP operator to P = LP(O) ∪ Check(O) and compute TP1 , TP2 , . . . , TPi and so on. By
Theorem 7, for some i we will either reach a fixpoint or derive Cycle. In the former case,
we have the stable model of O (if ⊥ ∈ TPi ), which we can use to decide the relevant
reasoning problems; in the latter case, we know that O is not acyclic.
Theorem 7. Let O = DG, ≺, R, F be a DGLP ontology with R negation-free, and
let P = LP(O) ∪ Check(O). Then, Cycle ∈ TPi or TPi+1 = TPi for some i ≥ 1.
By Theorem 7, checking the semantic acyclicity of O is thus decidable. If the stable
model of LP(O) ∪ Check(O) is infinite, then Cycle is derived; however, the inverse
does not hold as shown in Example 6. Furthermore, a stable model of LP(O), if it
exists, is clearly contained in the stable model of LP(O) ∪ Check(O), and the only
possible difference between the two stable models is for the latter to contain Cycle.
The following result shows that, as long as LP(O) is stratified, one can always assign
the cycle checking rules in Check(O) to the appropriate strata and thus obtain a DG-
stratification of LP(O) ∪ Check(O).
Lemma 1. Let O = DG, ≺, R, F be a DGLP ontology. If σ is a stratification of
LP(O), then σ can be extended to a DG-stratification σ of LP(O) ∪ Check(O).
The following theorem implies that, given a stratifiable DGLP ontology, we can decide
whether the ontology is semantically acyclic, and if so, we can compute its stable model
and thus solve all relevant reasoning problems.
Theorem 9. Let O be a DGLP ontology and P = LP(O) ∪ Check(O). If P1 , . . . , Pn
is a stratification partition of P w.r.t. a DG-stratification of P , then, for each j with
1 ≤ j ≤ n, there exists i ≥ 1 such that Cycle ∈ UPi j , or UPi+1
j
= UPi j and UPi j is finite.
Our tests have shown all of the ontologies to be acyclic. Furthermore, all tests have
correctly classified the relevant molecules into appropriate molecule classes; for exam-
ple, we were able to conclude that acetylene has exactly two carbons, that cyclobu-
tane has a four-membered ring, and that dinitrogen is inorganic. Please note that none
of these inferences can be derived using the approach from [20] due to the lack of
negation-as-failure, or using OWL only due to its tree-model property.
All tests were accomplished in a reasonable amount of time: no test required more
than a few minutes. Given the prototypical character of our application, we consider
these results to be encouraging and we take them as evidence of the practical feasibility
of our approach. The most time-intensive test was T4 , which identified molecules con-
taining a four-membered ring. We do not consider this surprising, given that the rule for
recognising T4 contains many atoms in the body and thus requires evaluating a complex
join. We noticed, however, that reordering the atoms in the body of the rule significantly
reduces reasoning time. Thus, trying to determine an appropriate ordering of rule atoms
via join-ordering optimisations, such as the one used for query optimisation in relational
databases, might be a useful technique in an optimised DGLP implementation.
References
1. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.: The Descrip-
tion Logic Handbook: Theory, Implementation, and Applications. CUP (2007)
344 D. Magka, B. Motik, and I. Horrocks
2. Baget, J.F., Leclère, M., Mugnier, M.-L., Salvat, E.: On rules with existential variables: Walk-
ing the decidability line. Artif. Intell. 175(9-10), 1620–1654 (2011)
3. Baral, C., Gelfond, M.: Logic Programming and Knowledge Representation. Journal of
Logic Programming 19, 73–148 (1994)
4. Beeri, C., Vardi, M.Y.: The Implication Problem for Data Dependencies. In: Even, S., Kariv,
O. (eds.) ICALP 1981. LNCS, vol. 115, pp. 73–85. Springer, Heidelberg (1981)
5. Cali, A., Gottlob, G., Lukasiewicz, T., Marnette, B., Pieris, A.: Datalog+/-: A family of logi-
cal knowledge representation and query languages for new applications. In: LICS (2010)
6. Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data Exchange: Semantics and Query An-
swering. In: Calvanese, D., Lenzerini, M., Motwani, R. (eds.) ICDT 2003. LNCS, vol. 2572,
pp. 207–224. Springer, Heidelberg (2002)
7. Grau, B.C., Horrocks, I., Motik, B., Parsia, B., Patel-Schneider, P.F., Sattler, U.: OWL 2: The
next step for OWL. J. Web Sem. 6(4), 309–322 (2008)
8. Graves, H.: Representing Product Designs Using a Description Graph Extension to OWL 2.
In: Proc. of the 5th OWLED Workshop (2009)
9. Grosof, B.N., Horrocks, I., Volz, R., Decker, S.: Description Logic Programs: Combining
Logic Programs with Description Logic. In: WWW (2003)
10. Hastings, J., Dumontier, M., Hull, D., Horridge, M., Steinbeck, C., Sattler, U., Stevens, R.,
Hörne, T., Britz, K.: Representing Chemicals using OWL, Description Graphs and Rules. In:
OWLED (2010)
11. Horrocks, I., Patel-Schneider, P.F., Boley, H., Tabet, S., Grosof, B., Dean, M.: SWRL: A
semantic web rule language combining OWL and RuleML. W3C Member Submission (May
21, 2004), http://www.w3.org/Submission/SWRL/
12. Krötzsch, M., Rudolph, S., Hitzler, P.: ELP: Tractable Rules for OWL 2. In: Sheth, A.P.,
Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC
2008. LNCS, vol. 5318, pp. 649–664. Springer, Heidelberg (2008)
13. Krötzsch, M., Maier, F., Krisnadhi, A., Hitzler, P.: A better uncle for owl: nominal schemas
for integrating rules and ontologies. In: Srinivasan, S., Ramamritham, K., Kumar, A., Ravin-
dra, M.P., Bertino, E., Kumar, R. (eds.) WWW, pp. 645–654. ACM (2011)
14. Levy, A.Y., Rousset, M.C.: Combining Horn Rules and Description Logics in CARIN. Arti-
ficial Intelligence 104(1-2), 165–209 (1998)
15. Lutz, C., Areces, C., Horrocks, I., Sattler, U.: Keys, nominals, and concrete domains. J. of
Artificial Intelligence Research 23, 667–726 (2004)
16. Marnette, B.: Generalized Schema-Mappings: from Termination to Tractability. In: PODS
(2009)
17. de Matos, P., Alcántara, R., Dekker, A., Ennis, M., Hastings, J., Haug, K., Spiteri, I., Turner,
S., Steinbeck, C.: Chemical Entities of Biological Interest: an update. Nucleic Acids Re-
search 38(Database-Issue), 249–254 (2010)
18. McCarthy, J.: Circumscription - a form of non-monotonic reasoning. Artif. Intell. (1980)
19. Motik, B., Cuenca Grau, B., Horrocks, I., Wu, Z., Fokoue, A., Lutz, C.: OWL 2 Web Ontol-
ogy Language: Profiles, W3C Recommendation (October 27, 2009)
20. Motik, B., Grau, B.C., Horrocks, I., Sattler, U.: Representing Ontologies Using Description
Logics, Description Graphs, and Rules. Artif. Int. 173, 1275–1309 (2009)
21. Motik, B., Sattler, U., Studer, R.: Query Answering for OWL-DL with Rules. J. Web
Sem. 3(1), 41–60 (2005)
22. Rector, A.L., Nowlan, W.A., Glowinski, A.: Goals for concept representation in the GALEN
project. In: SCAMC 1993, pp. 414–418 (1993)
23. Vardi, M.Y.: Why is modal logic so robustly decidable? DIMACS Series in Discrete Mathe-
matics and Theoretical Computer Science, pp. 149–184 (1996)
Extending Description Logic Rules
1 Introduction
Several different paradigms have been devised to model ontologies for the Seman-
tic Web [6]. Currently, the most prominent approaches for modeling this knowl-
edge are description logics (DLs) [1] and rules based on the logic programming
paradigm. Although both are based on classical logic, they differ significantly
and the search for a satisfactory integration is still ongoing [4,11].
Even if the DL-based Web Ontology Language OWL [5], a W3C standard,
is the main language for modeling ontologies in the Semantic Web, rule-based ap-
proaches have also proven very successful. Included in many commercial
applications, rules continue to be pursued in parallel to OWL using the Rule
Interchange Format RIF [2], also a W3C standard, as a rule exchange layer. Un-
derstanding the differences between both paradigms in order to come up with
workable combinations has become a major effort in current research.
This paper extends on the work presented in [13] where it has been shown that,
in fact, many rules can be expressed in OWL. We extend this work to include
some types of rules previously excluded. We formally define C-Rules, a set of
rules that can be embedded directly into OWL extended with role conjunction.
We also discuss how our approach can be used in conjunction with previous
weaker methods for embedding rules based on nominal schemas.
To express C-Rules in DL notation we employ the DL SROIQ(∃ ), an ex-
tension of SROIQ [8], which underlies OWL 2 DL. SROIQ(∃ ) encompasses
SROIQ adding a restricted form of role conjunction.
This work was supported by the National Science Foundation under award 1017225
III: Small: TROn – Tractable Reasoning with Ontologies.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 345–359, 2012.
c Springer-Verlag Berlin Heidelberg 2012
346 D. Carral Martı́nez and P. Hitzler
2 Preliminaries
We introduce SROIQ(∃ ), a DL fragment that adds role conjunction, in a
restricted way, to SROIQ [8]. Axioms of the form R1 R2 V are allowed in
SROIQ(∃ ), where R1 and R2 are two (possibly complex) roles.1
Roles which appear on the right hand sides of axioms of the form R1 R2
V are restricted to only appear in concepts of the form ∃V.C. Although this
1
In a sense, role conjunction was already implicit in [14], and is also used in [13], for
a similar purpose.
Extending Description Logic Rules 347
precondition might look very restrictive, it suffices for the use of role conjunction
for expressing rules, as discussed in this paper. As a technical note, in terms of
regularity of RBoxes (required for decidability), we assume that for a role V
appearing in an axiom R1 R2 V we have that both R1 ≺ V and R2 ≺ V (≺
indicates the order in a regular role hierarchy).
SROIQ(∃ ) bears the same semantics as SROIQ, with the exception of the
role conjunction constructor. The formal semantics is as usual (see, e.g., [17]),
and for lack of space we do not repeat it here. Note that it follows easily from
the arguments laid out in [17] that SROIQ(∃ ) is decidable.
Undirected Graph. The direction of the edges in the graph can be changed using
the inverse role constructor. Therefore, there is no need for our algorithm in
Definition 1 to check the direction of binary predicates when we apply different
simplifications. Below, S is fresh role name in the knowledge base.
Rule Subset Eq Subset Set α to the KB
R(x, y) S(y, x) R− S
Translating Terminal Rules. A terminal
rule R is a rule of the form Bi → H.
We have that the body of the rule Bi contains one, and at most one, free FOL
variable x appearing only once in a unary predicate of the form B(x) (the graph
has been reduced to the root vertex, and therefore, there is only one variable
left appearing only once). The body Bi might also contain other predicates of
the form C(a) or R(b, c) s.t. a, b, and c are constants. The head H is composed
of a single unary predicate H(x) s.t. x is the same free variable that appears in
the body.
A terminal rule R is translated into a DL inclusion axiom of the form Bi
H. This axiom contains a fresh concept H on the right hand side of the role
inclusion axiom and a concept intersection on the left hand side featuring the
next elements:
– A fresh concept B standing for the unary predicate B(x) s.t. x is the only
free variable appearing in the rule.
– A concept ∃U.(C {a}) for every unary predicate of the form C(a) appearing
in the body where a is a constant.
– A concept ∃U.({b} ∃R.{c}) for every binary predicate of the form R(b, c)
appearing in the body of the rule where b and c are constants.
The argument just given also constitutes a proof of Lemma 2.
Again, we see that any GR where every vertex u has d(u) ≥ 3 (possibly
obtained after several reduction steps) cannot be simplified. Otherwise the graph
can be reduced to a single edge (u, t) s.t. u ∈ H and t ∈ H with H the head of
the rule. Note that the procedure is almost the same as in Section 3, except for
the accepting condition.
The process to translate the rule into a set of equivalent SROIQ(∃ ) state-
ments and proofs remain the same as the one presented in Section 3 except for
the trivial translation of the terminal rule.
It is important to remark that in some cases a B-Rule may not be expressible
while a U-Rule with the same body is. The second vertex might block a possible
role reduction forbidding further simplifications. As an example we have that
R1 (x, y) ∧ R2 (x, w) ∧ R3 (w, y) ∧ R4 (y, z) ∧ R5 (w, z) → C(x) is expressible in
SROIQ(∃ ) using our approach, while R1 (x, y)∧R2 (x, w)∧R3 (w, y)∧R4 (y, z)∧
R5 (w, z) → C(x, z) is not.
5 Examples
We start with a worked example for our transformation. As initial rule, we use
where a and b are constant and x, y and z are free variables. Transformations
following the discussion from Section 3 are detailed in Table 1.
Note that the rule listed in step 6 of Table 1 can already be directly translated
to SROIQ(∃ ) as ∃M. ∃E.{a} ∃U.({a} ∃Y.{b}) Z. But to improve
readability of the paper, our rule reduction approach has been presented in a
simpler form, avoiding such shortcuts. So, although the method shown is sound
and correct, there are U-Rules and B-Rules, as the one presented in the example,
where at some step of the reduction process no further simplifications are strictly
required. An earlier translation of the rule reduces the number of statements that
need to be added to the knowledge base. Recall, in particular, that rules with
tree shaped graphs are directly expressible in DL [11,13].
Also, we have that reductions according to our transformations are applied
non-deterministically. Although any rule reduction leading to a terminal rule is
essentially correct, there might be differences in the set of axioms added to the
knowledge base. For example, let R be a U-Rule containing the binary predicates
A(x, y) and B(y, z) s.t. both y and z are variables not appearing anywhere else
in the rule (hence, we have that d(y) = 2 and d(z) = 1). In the next reduction
step, we can decide which variable, y or z, we want to erase.
Assuming we want to reduce A(x, y) and B(y, z) to Z(x), there are two dif-
ferent ways of doing so, namely (1) first reducing y, and (2) first reducing z. In
the first case, we end up with two axioms A ◦ B C and ∃C. Z, while in the
354 D. Carral Martı́nez and P. Hitzler
Table 1. Reduction example. For every step substitute the rule in the previous row by
the one in the current one and add the axioms on the second column to the knowledge
base.
This rule places a pair of individuals under the binary predicate IllegalRe-
viewerOf if the first is a teacher of the student who is the author of the reviewed
paper. It can be transformed into the following set of SROIQ(∃ ) axioms.
TeacherOf ◦ AuthorOf R1
ReviewerOf R1 R2
R2 IllegalReviewer
In earlier sections of this paper we have shown how to translate some FOL rules
into DL notation. Although some rules can be translated to SROIQ(∃ ) using
the presented approach there are still more complex rules that cannot be simpli-
fied in the same way. To express these rules we employ the DL SROIQV (∃ ).
SROIQV(∃ ) adds nominal schemas, a DL constructor that can be used as
”variable nominal classes,” to the previously described SROIQ(∃ ). We will
refrain from introducing all formal details and refer the reader to [9,10,11,15]
for this. While the semantic intuition behind nominal schemas is the same as
that behind DL-safe variables, nominal schemas integrate seamlessly with DL
syntax. As a consequence, the DL fragment SROIQV(∃ ) encompasses DL-safe
variables while staying within the DL/OWL language paradigm avoiding the use
of hybrid approaches.
Using these nominal schemas we are able to express FOL rules that are not
part of the treatment in Sections 3 and 4. Consider, for example, the rule
∃R1 .(∃R4 .{z} ∃R5 .{w}) ∃R2 .{z} ∃R3 .({w} ∃.R6 .{z}) C
Note that, as already stated, nominal schemas do not share the same semantics
defined for FOL variables. Nominal schemas, as DL-safe variables, are restricted
to stand only for nominals which are explicitly present in the knowledge base,
356 D. Carral Martı́nez and P. Hitzler
while FOL variables can represent both named and unknown individuals. There-
fore, the statements presented in the example just given are not strictly equiva-
lent. Despite this fact, nominal schemas allow us to retain most of the entailments
from the original FOL axiom without increasing the worst-case complexity of the
DL fragment.
Although nominal schemas do not increase the worst-case complexity of the
language [15], the number of different nominal schemas per axiom can affect the
performance of the reasoning process [3,10]. It is therefore desirable to use as
few nominal schemas as possible.
We now discuss two different ways of translating complex rules into DLs. First
we prove the following.
Proof. Given a rule R, firstly role up to simplify all binary predicates containing
one constant as shown in Section 3. All binary predicates in the rule containing
the same pair of variables are also replaced by a single binary predicate as
described under Unifying Binary Predicates in Section 3.
Due to these transformations, we can now assume without loss of generality
that the rule R contains only unary predicates with a constant, binary predicates
with two constants, and binary predicates with two variables as arguments.
Now choose two variables x and y s.t. x is a root vertex and y is not. Using
the inverse role construct we can now swap arguments in binary predicates s.t. x
is always appearing in the first argument and y is in the first argument of every
predicate where the other variable is not x. The variables selected will be the
only ones not substituted by a nominal schema in the translated rule.
The rule body is now translated as shown in Table 2. The resulting DL ex-
pressions are joined by conjunction. Bi (y), R(x, y), and Ri (y, vi ) are all the
predicates where y appears.
Finally, the head H(x) can be rewritten into the concept H (or if it is a
binary predicate H(x, z), a concept of the form ∃H.{z}), and the implication
arrow replaced by class inclusion .
It is straightforward to formally verify the correctness of this transformation,
and parts of the proof are simliar to the correctness proof from [15] for the
embedding of binary Datalog into SROIQV.
Clearly, the number of nominal schemas used to represent rule R is n − 2, the
total number of free variables minus 2.
∃R1 .(∃R4 .{z} ∃R5 .{w}) ∃R2 .{z} ∃R3 .{w} ∃.U ({w} ∃.R6 .{z}) C
As another example, the following rule transforms into the subsequent axiom.
Proof. By grounding every variable but three in the rule to named individuals
we end up with a larger number of rules s.t. each one of them contains only
three different free variables.6 All these new grounded rules are expressible in
DL using the approach presented in Section 3 of this paper.
While the first of the approaches just mentioned allows us to represent all knowl-
edge in SROIQV(∃ ), the second one, although initially looking more efficient,
requires preprocessing steps. Further research and algorithms are required to
smartly deal with nominal schemas other than through such grounding, a cum-
bersome technique that requires too much space and time for current reasoners
[3,10].
Let us finally return to the regularity issue discussed at the very end of Sec-
tion 5. In the example discussed there, if we desire to also add the statement
IllegalReviewerOf ReviewerOf to the knowledge base, we cannot do so directly
without violating regularity. Using nominal schemas, however, we can weaken
this axiom to the form
∃IllegalReviewerOf.{x} ∃ReviewerOf.{x}
(or, e.g., to
∃IllegalReviewerOf− .{x} ∃ReviewerOf− .{x}
6
Note that any rule with three variables can be reduced used our approach. Having
only three nodes in the graph for all of them we have that d(u) ≤ 2 and therefore
all of them can be reduced.
358 D. Carral Martı́nez and P. Hitzler
or to both), where {x} is a nominal schema. Essentially, this means that the
role inclusion will apply in case the first argument or the second argument (the
filler) is a known individual. I.e., the individuals connected by the IllegalRe-
viewerOf property are not both are unnamed. While this is weaker than the
standard semantics, it should provide a viable workaround in many cases. Also
note that, alternatively, the regularity violation could be avoided by using a
similarly weakened form of any of the other statements involved in the violation.
References
1. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P. (eds.):
The Description Logic Handbook: Theory, Implementation, and Applications, 2nd
edn. Cambridge University Press (2007)
2. de Bruijn, J.: RIF RDF and OWL Compatibility. W3C Recommendation (June 22,
2010), http://www.w3.org/TR/rif-rdf-owl/
3. Carral Martı́nez, D., Krisnadhi, A., Maier, F., Sengupta, K., Hitzler, P.: Reconciling
OWL and rules. Tech. rep., Kno.e.sis Center, Wright State University, Dayton,
Ohio, U.S.A. (2011), http://www.pascal-hitzler.de/
Extending Description Logic Rules 359
4. Hitzler, P., Parsia, B.: Ontologies and rules. In: Staab, S., Studer, R. (eds.) Hand-
book on Ontologies, 2nd edn., pp. 111–132. Springer (2009)
5. Hitzler, P., Krötzsch, M., Parsia, B., Patel-Schneider, P.F., Rudolph, S. (eds.):
OWL 2 Web Ontology Language: Primer. W3C Recommendation (October 27,
2009), http://www.w3.org/TR/owl2-primer/
6. Hitzler, P., Krötzsch, M., Rudolph, S.: Foundations of Semantic Web Technologies.
Chapman & Hall/CRC (2009)
7. Horrocks, I., Patel-Schneider, P.F., Bechhofer, S., Tsarkov, D.: OWL Rules: A
proposal and prototype implementation. Journal of Web Semantics 3(1), 23–40
(2005)
8. Horrocks, I., Kutz, O., Sattler, U.: The even more irresistible SROIQ. In: Proc.
of the 10th Int. Conf. on Principles of Knowledge Representation and Reasoning
(KR 2006), pp. 57–67. AAAI Press (2006)
9. Knorr, M., Hitzler, P., Maier, F.: Reconciling OWL and non-monotonic rules for
the Semantic Web. Tech. rep., Kno.e.sis Center, Wright State University, Dayton,
OH, U.S.A. (2011), http://www.pascal-hitzler.de/
10. Krisnadhi, A., Hitzler, P.: A tableau algorithm for description logics with nominal
schemas. Tech. rep., Kno.e.sis Center, Wright State University, Dayton, OH, U.S.A.
(2011), http://www.pascal-hitzler.de/
11. Krisnadhi, A., Maier, F., Hitzler, P.: OWL and Rules. In: Polleres, A., d’Amato,
C., Arenas, M., Handschuh, S., Kroner, P., Ossowski, S., Patel-Schneider, P. (eds.)
Reasoning Web 2011. LNCS, vol. 6848, pp. 382–415. Springer, Heidelberg (2011)
12. Krötzsch, M.: Description Logic Rules, Studies on the Semantic Web, vol. 008. IOS
Press/AKA (2010)
13. Krötzsch, M., Rudolph, S., Hitzler, P.: Description logic rules. In: Ghallab, M.,
Spyropoulos, C.D., Fakotakis, N., Avouris, N.M. (eds.) Proceeding of the 18th
European Conference on Artificial Intelligence, Patras, Greece, July 21-25, pp.
80–84. IOS Press, Amsterdam (2008)
14. Krötzsch, M., Rudolph, S., Hitzler, P.: ELP: Tractable rules for OWL 2. In: Sheth,
A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K.
(eds.) ISWC 2008. LNCS, vol. 5318, pp. 649–664. Springer, Heidelberg (2008)
15. Krötzsch, M., Maier, F., Krisnadhi, A.A., Hitzler, P.: A better uncle for OWL:
Nominal schemas for integrating rules and ontologies. In: Proceedings of the 20th
International Conference on World Wide Web (WWW 2011), pp. 645–654. ACM
(2011)
16. Motik, B., Sattler, U., Studer, R.: Query answering for OWL DL with rules. J. of
Web Semantics 3(1), 41–60 (2005)
17. Rudolph, S., Krötzsch, M., Hitzler, P.: Cheap Boolean Role Constructors for De-
scription Logics. In: Hölldobler, S., Lutz, C., Wansing, H. (eds.) JELIA 2008. LNCS
(LNAI), vol. 5293, pp. 362–374. Springer, Heidelberg (2008)
Prexto: Query Rewriting
under Extensional Constraints in DL-Lite
Riccardo Rosati
1 Introduction
The DL-Lite family of description logics [4,2] is currently one of the most studied on-
tology specification languages. DL-Lite constitutes the basis of the OWL2 QL language
[1], which is part of the standard W3C OWL2 ontology specification language. The
distinguishing feature of DL-Lite is to identify ontology languages in which expressive
queries, in particular, unions of conjunctive queries (UCQs), over the ontology can be
efficiently answered. Therefore, query answering is the most studied reasoning task in
DL-Lite (see, e.g., [13,9,7,15,6,5]).
The most common approach to query answering in DL-Lite is through query rewrit-
ing. This approach consists of computing a so-called perfect rewriting of the query with
respect to a TBox: the perfect rewriting of a query q for a TBox T is a query q that
can be evaluated on the ABox only and produces the same results as if q were evaluated
on both the TBox and the ABox. This approach is particularly interesting in DL-Lite,
because, for every UCQ q, query q can be expressed in first-order logic (i.e., SQL),
therefore query answering can be delegated to a relational DBMS, since it can be re-
duced to the evaluation of an SQL query on the database storing the ABox.
The shortcoming of the query rewriting approach is that the size of the rewritten
query may be exponential with respect to the size of the original query. In particular,
this is true when the rewritten query is in disjunctive normal form, i.e., is an UCQ. On
the other hand, [5] shows the existence of polynomial perfect rewritings of the query in
nonrecursive datalog.
However, it turns out that the disjunctive normal form is necessary for practical ap-
plications of the query rewriting technique, since queries of more complex forms, once
translated in SQL, produce queries with nested subexpressions that, in general, cannot
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 360–374, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Prexto: Query Rewriting under Extensional Constraints in DL-Lite 361
be evaluated efficiently by current DBMSs. So, while in some cases resorting to more
compact and structurally more complex perfect rewritings may be convenient, in gen-
eral this strategy does not solve the problem of arriving at an SQL expression that can
be effectively evaluated on the database.
In this scenario, a very interesting way to limit the size of the rewritten UCQ has been
proposed in [11]. This approach proposes the use of the so-called ABox dependencies
to optimize query rewriting in DL-LiteA . ABox dependencies are inclusions between
concepts and roles which are interpreted as integrity constraints over the ABox: in other
words, the ABox is guaranteed to satisfy such constraints. In the presence of such con-
straints, the query answering process can be optimized, since this additional knowledge
about the extensions of concepts and roles in the ABox can be exploited for optimizing
query answering. Intuitively, the presence of ABox dependencies acts in a complemen-
tary way with respect to TBox assertions: while the latter complicate query rewriting,
the former simplify it, since they state that some of the TBox assertions are already
satisfied by the ABox.
As explained in [11], ABox dependencies have a real practical interest, since they
naturally arise in many applications of ontologies, and in particular in ontology-based
data access (OBDA) applications, in which a DL ontology acts as a virtual global
schema for accessing data stored in external sources, and such sources are connected
through declarative mappings to the global ontology. It turns out that, in practical cases,
many ABox dependencies may be (automatically) derived from the mappings between
the ontology and the data sources.
In this paper, we present an approach that follows the ideas of [11]. More specifically,
we present Prexto, an algorithm for computing a perfect rewriting of a UCQ in the
description logic DL-LiteA . Prexto is based on the query rewriting algorithm Presto
[13]: with respect to the previous technique, Prexto has been designed to fully exploit
the presence of extensional constraints to optimize the size of the rewriting; moreover,
differently from Presto, it also uses concept and role disjointness assertions, as well as
role functionality assertions, to reduce the size of the rewritten query.
As already observed in [11], the way extensional constraints interact with reason-
ing, and in particular query answering, is not trivial at all: e.g., [11] defines a complex
condition for the deletion of a concept (or role) inclusion from the TBox due to the
presence of extensional constraints. In our approach, we use extensional constraints in
a very different way from [11], which uses such constraints to “deactivate” correspond-
ing TBox assertions in the TBox: conversely, we are able to define significant query
minimizations even for extensional constraints for which there exists no corresponding
TBox assertions. Based on these ideas, we define the Prexto algorithm: in particular,
we restructure and extend the Presto query rewriting algorithm to fully exploit the
presence of extensional constraints.
Finally, we show that the above optimizations allow Prexto to outperform the exist-
ing query rewriting techniques for DL-Lite in practical cases. In particular, we compare
Prexto both with Presto and with the optimization presented in [11].
The paper is structured as follows. After some preliminaries, in Section 3 we intro-
duce extensional constraints and the notion of extensional constraint Box (EBox). In
Section 4 we discuss the interaction between intensional and extensional constraints in
362 R. Rosati
query answering. Then, in Section 5 we present the Prexto query rewriting algorithm,
and in Section 6 we compare Prexto with existing techniques for query rewriting in
DL-LiteA . We conclude in Section 7.
B −→ A | ∃Q | δ(U ) E −→ ρ(U )
C −→ B | ¬B F −→ D | T1 | · · · | Tn
Q −→ P | P− V −→ U | ¬U
R −→ Q | ¬Q
In such rules, A, P , and U respectively denote an atomic concept (i.e., a concept name),
an atomic role (i.e., a role name), and an attribute name, P − denotes the inverse of an
atomic role, whereas B and Q are called basic concept and basic role, respectively.
Furthermore, δ(U ) denotes the domain of U , i.e., the set of objects that U relates to
values; ρ(U ) denotes the range of U , i.e., the set of values that U relates to objects;
D is the universal value-domain; T1 , . . . , Tn are n pairwise disjoint unbounded value-
domains. A DL-LiteA TBox T is a finite set of assertions of the form
From left to right, the first four assertions respectively denote inclusions between con-
cepts, roles, value-domains, and attributes. In turn, the last two assertions denote func-
tionality on roles and on attributes. In fact, in DL-LiteA TBoxes we further impose that
roles and attributes occurring in functionality assertions cannot be specialized (i.e., they
cannot occur in the right-hand side of inclusions). We call concept disjointness asser-
tions the assertions of the form B1 ¬B2 , and call role disjointness assertions the
assertions of the form Q1 ¬Q2 .
A DL-LiteA ABox A is a finite set of membership (or instance) assertions of the
forms A(a), P (a, b), and U (a, v), where A, P , and U are as above, a and b belong to
ΓO , the subset of ΓC containing object constants, and v belongs to ΓV , the subset of
ΓC containing value constants, where {ΓO , ΓV } is a partition of ΓC .
The semantics of a DL-LiteA ontology is given in terms of first-order logic (FOL)
interpretations I over a non-empty domain ΔI such that ΔI = ΔV ∪ ΔIO , where ΔIO
is the domain used to interpret object constants in ΓO , and ΔV is the fixed domain
(disjoint from ΔIO ) used to interpret data values. Furthermore, in DL-LiteA the Unique
Prexto: Query Rewriting under Extensional Constraints in DL-Lite 363
Name Assumption (UNA) is adopted, i.e., in every interpretation I, and for every pair
c1 , c2 ∈ ΓC , if c1
= c2 then cI1
= cI2 . The notion of satisfaction of inclusion, disjoint-
ness, functionality, and instance assertions in an interpretation is the usual one in DL
ontologies (we refer the reader to [10] for more details).
We denote with Mod(O) the set of models of an ontology O, i.e., the set of FOL
interpretations that satisfy all the TBox and ABox assertions in O. An ontology is in-
consistent if Mod(O) = ∅ (otherwise, O is called consistent). As usual, an ontology O
entails an assertion φ, denoted O |= φ, if φ is satisfied in every I ∈ Mod(O).
Given an ABox A, we denote by IA the DL-LiteA interpretation such that, for every
concept instance assertion C(a), aI ∈ C I iff C(a) ∈ A, for every role instance asser-
tion R(a, b), aI , bI I ∈ RI iff R(a, b) ∈ A, and for every attribute instance assertion
U (a, b), aI , bI I ∈ U I iff U (a, b) ∈ A.
We now recall queries, in particular conjunctive queries and unions of conjunctive
queries. A conjunctive query (CQ) q is an expression of the form
q(x) ← α1 , . . . , αn
3 Extensional Constraints
We now define the notion of EBox, which constitutes a set of extensional constraints,
i.e., constraints over the ABox. The idea of EBox has been originally introduced in [11],
under the name of ABox dependencies.
The following definitions are valid for every DL, under the assumption that the asser-
tions are divided into extensional assertions and intensional assertions, and extensional
assertions correspond to atomic instance assertions.
Given a set of intensional assertions N and an interpretation I, we say that I satisfies
N if I satisfies every assertion in N .
An extensional constraint box, or simply EBox, is a set of intensional assertions.
Notice that, from the syntactic viewpoint, an EBox is identical to a TBox. Therefore,
entailment of an assertion φ with respect to an EBox E (denoted by E |= φ) is defined
exactly in the same way as in the case of TBoxes.
Given an ABox A and an EBox E, we say that A is valid for E if IA satisfies E.
Definition 1. (Admissible ABox) Given a TBox T and an EBox E, an ABox A is an
admissible ABox for T and E if A is consistent with T and A is valid for E. We denote
with ADM(T , E) the set of ABoxes A that are admissible for T and E.
Informally, an EBox acts as a set of integrity constraints over the ABox. Differently
from other recent approaches that have proposed various forms of integrity constraints
for DL ontologies (e.g., [8,14]), an EBox constrains the ABox while totally discarding
the TBox, since the notion of validity with respect to an EBox only considers the ABox.
We are now ready to define the notion of perfect rewriting in the presence of both a
TBox and an EBox.
Definition 2. (Perfect rewriting in the presence of an EBox) Given a TBox T , an EBox
E, and a UCQ Q, a FOL query φ is a perfect rewriting of Q with respect to T , E if,
for every ABox A ∈ ADM(T , E), T , A |= Q iff IA |= φ.
The above definition establishes a natural notion of perfect rewriting in the presence
of an EBox E. Since E constrains the admissible ABoxes, the more selective is E (for
the same TBox T ), the more restricted the set ADM(T , E) is. If for instance, E, E are
two EBoxes such that E ⊂ E , we immediately get from the above definitions that
ADM(T , E) ⊇ ADM(T , E ). Now, let Q be a UCQ, let φ be a perfect rewriting of Q
with respect to T , E and let φ be a perfect rewriting of Q with respect to T , E : φ
will have to satisfy the condition T , A |= Q iff IA |= φ for more ABoxes A than
query φ . Consequently, φ will have to be a more complex query than φ . Therefore,
larger EBoxes in principle allow for obtaining simpler perfect rewritings.
Suppose we are given a TBox {Student Person}, an empty EBox E0 , and an EBox
E1 = {Student Person}. Now, given a query q(x) ← Person(x), a perfect rewriting
of this query with respect to T , E0 is
q(x) ← Person(x)
q(x) ← Student(x)
while a perfect rewriting of query q with respect to T , E1 is the query q itself. Namely,
under the EBox E1 we can ignore the TBox concept inclusion Student Person, since
it is already satisfied by the ABox.
However, as already explained in [11], we can not always ignore TBox assertions
that also appear in the EBox (and are thus already satisfied by the ABox). For instance,
let q be the query q ← C(x). If the TBox T contains the assertions ∃R C and
D ∃R− and the EBox E contains the assertion ∃R C, we cannot ignore this last
inclusion when computing a perfect rewriting of q (or when answering query q). In fact,
suppose the ABox is {D(a)}: then A ∈ ADM(T , E) and query q is entailed by T , A.
But actually q is not entailed by T , A where T = T − E.
From the query rewriting viewpoint, a perfect rewriting of q with respect to T is
q ← C(x)
q ← R(x, y)
q ← D(y)
q ← C(x)
And of course, the ABox A shows that this last query is not a perfect rewriting of q
with respect to T , E. Therefore, also when computing a perfect rewriting, we cannot
simply ignore the inclusions of the TBox that are already satisfied by the ABox (i.e.,
that belong to the EBox).
The example above shows that we need to understand under which conditions we are
allowed to use extensional constraints to optimize query rewriting.
5 Prexto
In this section we present the algorithm Prexto (Perfect Rewriting under EXTensional
cOnstraints). Prexto makes use of the algorithm Presto, originally defined in [13],
which computes a nonrecursive datalog program constituting a perfect rewriting of
a UCQ Q with respect to a DL-LiteA TBox T . The algorithm Presto is reported in
Figure 1. We refer the reader to [13] for a detailed explanation of the algorithm. For
our purposes, it suffices to remind that the program returned by Presto uses auxiliary
datalog predicates, called ontology-annotated (OA) predicates, to represent every basic
concept and basic role that is involved in the query rewriting. E.g., the basic concept
366 R. Rosati
Algorithm Presto(Q, T )
Input: UCQ Q, DL-LiteR TBox T
Output: nr-datalog query Q
begin
Q = Rename(Q);
Q = DeleteUnboundVars(Q );
Q = DeleteRedundantAtoms(Q , T );
Q = Split(Q );
repeat
if there exist r ∈ Q and ej-var x in r
such that Eliminable(x, r, T ) = true
and x has not already been eliminated from r
then begin
Q = EliminateEJVar(r, x, T );
Q = DeleteUnboundVars(Q );
Q = DeleteRedundantAtoms(Q , T );
Q = Q ∪ Split(Q )
end
until Q has reached a fixpoint;
for each OA-predicate pn α occurring in Q
do Q = Q ∪ DefineAtomView(pn α , T )
end
B is represented by the OA-predicate p1B , while the basic role R is represented by the
OA-predicate p2R (the superscript represents the arity of the predicate).1
In the following, we modify the algorithm Presto. In particular, we make the fol-
lowing changes:
1. the final for each cycle of the algorithm (cf. Figure 1) is not executed: i.e., the rules
defining the OA-predicates are not added to the returned program;
2. the algorithm DeleteRedundantAtoms is modified to take into account the
presence of disjointness assertions and role functionality assertions in the
TBox. More precisely, the following simplification rules are added to algorithm
DeleteRedundantAtoms(Q , T ) (in which we denote basic concepts by B, C,
basic roles by R, S, and datalog rules by the symbol r):
(a) if p2R (t1 , t2 ) and p2S (t1 , t2 ) occur in r and T |= R ¬S, then eliminate r from
Q ;
(b) if p2R (t1 , t2 ) and p2S (t2 , t1 ) occur in r and T |= R ¬S − , then eliminate r
from Q ;
(c) if p1B (t) and p1C (t) occur in r and T |= B ¬C, then eliminate r from Q ;
(d) if p2R (t1 , t2 ) and p1C (t1 ) occur in r and T |= ∃R ¬C, then eliminate r from
Q ;
1
Actually, to handle Boolean subqueries, also 0-ary OA-predicates (i.e., predicates with no
arguments) are defined: we refer the reader to [13] for more details.
Prexto: Query Rewriting under Extensional Constraints in DL-Lite 367
(e) if p2R (t1 , t2 ) and p1C (t2 ) occur in r and T |= ∃R− ¬C, then eliminate r
from Q ;
(f) if p0α and p0β occur in r and T |= α0 ¬β 0 , then eliminate r from Q ;
(g) if p1B (t) and p0α occur in r and T |= B 0 ¬α0 , then eliminate r from Q ;
(h) if p2R (t1 , t2 ) and p0α occur in r and T |= R0 ¬α0 , then eliminate r from Q ;
(i) if p2R (t1 , t2 ) and p2R (t1 , t2 ) (with t2
= t2 ) occur in r and (funct R) ∈ T , then,
if t2 and t2 are two different constants, then eliminate r from Q ; otherwise,
replace r with the rule σ(r), where σ is the substitution which poses t2 equal
to t2 ;
(j) if p2R (t2 , t1 ) and p2R (t2 , t1 ) (with t2
= t2 ) occur in r and (funct R− ) ∈ T ,
then, if t2 and t2 are two different constants, then eliminate r from Q ; other-
wise, replace r with the rule σ(r), where σ is the substitution which poses t2
equal to t2 .
Example 1. Let us show the effect of the new transformations added to
DeleteRedundantAtoms through two examples. First, suppose T = {B
¬B , (funct R)} and suppose r is the rule
q(x) ← p1B (y), p2R (x, y), p2R (x, z), p1B (z)
Then, the above case (i) of algorithm DeleteRedundantAtoms can be applied, which
transforms r into the rule
Now, the above case (c) of algorithm DeleteRedundantAtoms can be applied, hence
this rule is deleted from the program. Intuitively, this is due to the fact that this rule
looks for elements belonging both to concept B and to concept B , which is impossible
because the disjointness assertion B ¬B is entailed by the TBox T . Therefore, it is
correct to delete the rule from the program.
From now on, when we speak about Presto we refer to the above modified version
of the algorithm, and when we speak about DeleteRedundantAtoms we refer to the
above modified version which takes into account disjointness and functionality asser-
tions.
The Prexto algorithm is defined in Figure 2. The algorithm is constituted of the
following four steps:
1. the nonrecursive datalog program P is computed by executing the Presto algo-
rithm. This program P is not a perfect rewriting of Q yet, since the definition of the
intermediate OA-predicates is missing;
2. the program P is then constructed (by the three for each cycles of the program).
This program contain rules defining the intermediate OA-predicates, i.e., the con-
cept and role assertions used in the program P . To compute such rules, the algo-
rithm makes use of the procedure MinimizeViews, reported in Figure 3. This pro-
cedure takes as input a basic concept (respectively, a basic role) B and computes a
minimal subset Φ of the set Φ of the subsumed basic concepts (respectively, sub-
sumed basic roles) of B which extensionally cover the set Φ, as explained below.
368 R. Rosati
Algorithm Prexto(Q, T , E )
Input: UCQ Q, DL-LiteA TBox T , DL-LiteA EBox E
Output: UCQ Q
begin
P = Presto(Q, T );
P = ∅;
for each OA-predicate PR2 occurring in P do
Φ = MinimizeViews(R, E , T );
P = P ∪ {p2B (x, y) ← S(x, y) | S is a role name and S ∈ Φ}
∪ {p2B (x, y) ← S(y, x) | S is a role name and S − ∈ Φ};
for each OA-predicate PB1 occurring in P do
Φ = MinimizeViews(B, E , T );
P = P ∪ {p1B (x) ← C(x) | C is a concept name and C ∈ Φ}
∪ {p1B (x) ← R(x, y) | ∃R ∈ Φ} ∪ {p1B (x) ← R(y, x) | ∃R− ∈ Φ};
for each OA-predicate PN0 occurring in P do
Φ = MinimizeViews(N 0 , E , T );
P = P ∪ {p0N ← C(x) | C is a concept name and C 0 ∈ Φ}
∪ {p0N ← R(x, y) | R is a role name and R0 ∈ Φ};
P = P ∪ P ;
Q = Unfold(P );
Q = DeleteRedundantAtoms(Q , E );
return Q
end
3. then, the overall nonrecursive datalog program P ∪ P is unfolded, i.e., turned into
a UCQ Q . This is realized by the algorithm Unfold which corresponds to the usual
unfolding of a nonrecursive program;
4. finally, the UCQ Q is simplified by executing the algorithm
DeleteRedundantAtoms which takes as input the UCQ Q and the EBox
E (notice that, conversely, the first execution of DeleteRedundantAtoms within
the Presto algorithm uses the TBox T as input).
Notice that the bottleneck of the whole process is the above step 3, since the number
of conjunctive queries generated by the unfolding may be exponential with respect to
the length of the initial query Q (in particular, it may be exponential with respect to the
maximum number of atoms in a conjunctive query of Q). As shown by the following ex-
ample, the usage of extensional constraints done at step 2 through the MinimizeViews
algorithm is crucial to handle the combinatorial explosion of the unfolding.
Example 2. Let T be the following DL-LiteA TBox:
Company ∃givesHighSalaryTo− FulltimeStudent Unemployed
∃givesHighSalaryTo− Manager FulltimeStudent Student
Manager Employee isBestFriendOf knows
Employee HasJob (funct isBestFriendOf)
∃receivesGrantFrom StudentWithGrant (funct isBestFriendOf− )
StudentWithGrant FulltimeStudent HasJob ¬Unemployed
Prexto: Query Rewriting under Extensional Constraints in DL-Lite 369
Algorithm MinimizeViews(B, E , T )
Input: basic concept (or basic role, or 0-ary predicate) B,
DL-LiteA EBox E , DL-LiteA TBox T
Output: set of basic concepts (or basic roles, or 0-ary predicates) Φ
begin
Φ = {B | T |= B B};
Φ = ∅;
for each B ∈ Φ do
if there exists B ∈ Φ such that E |= B B and E |= B B
then Φ = Φ ∪ {B };
Φ = Φ − Φ ;
while there exist B, B ∈ Φ
such that B = B and E |= B B and E |= B B
do Φ = Φ − {B };
return Φ
end
E1 = FulltimeStudent StudentWithGrant
E2 = ∃receivesGrantFrom StudentWithGrant
E3 = HasJob Employee
E4 = Manager Employee
Let us first consider an empty EBox. In this case, during the execution of
Prexto(q1 , T , ∅) the algorithm MinimizeViews simply computes the subsumed sets
of Student, knows, HasJob, which are, respectively:
MinimizeViews(Student, ∅, T ) =
{Student, FulltimeStudent, StudentWithGrant, ∃receivesGrantFrom}
MinimizeViews(knows, ∅, T ) =
{knows, knows− , isBestFriendOf, isBestFriendOf− }
MinimizeViews(HasJob, ∅, T ) =
{HasJob, Employee, Manager, ∃givesHighSalaryTo− }
Since every such set is constituted of four predicates, the UCQ returned by the unfolding
step in Prexto(q1 , T , E) contains 64 CQs. This is also the size of the final UCQ, since
in this case no optimizations are computed by the algorithm DeleteRedundantAtoms,
because both the disjointness assertion and the role functionality assertions of T have
no impact on the rewriting of query q1 .
370 R. Rosati
Conversely, let us consider the EBox E: during the execution of Prexto(q1 , T , E),
we obtain the following sets from the execution of the algorithm MinimizeViews::
MinimizeViews(Student, E, T ) =
{Student, StudentWithGrant}
MinimizeViews(knows, E, T ) =
{knows, knows− , isBestFriendOf, isBestFriendOf− }
MinimizeViews(HasJob, E, T ) =
{Employee, ∃givesHighSalaryTo− }
Thus, the algorithm MinimizeViews returns only two predicates for Student and only
two predicates for HasJob. Therefore, the final unfolded UCQ is constituted of 16 CQs
(since, as above explained, the final call to DeleteRedundantAtoms does not produce
any optimization).
We now focus on the proof of correctness of Prexto, which is based on the known re-
sults about the Presto algorithm. Indeed, to prove correctness of Prexto, essentially we
have to show that the modifications done with respect to the Presto algorithm preserve
correctness.
In particular, it is possible to prove the following properties:
1. The additional simplification rules added to the DeleteRedundantAtoms algo-
rithm preserve completeness of the algorithm. More specifically, it can be easily
shown that, in every execution of the algorithm DeleteRedundantAtoms within
Presto, every additional rule transformation either produces a rule that is equiv-
alent (with respect to the TBox T ) to the initial rule, or deletes a rule which is
actually empty, i.e., which does not contribute to any nonempty conjunctive query
in the final UCQ.
2. The optimization realized by the MinimizeViews algorithm is correct. More pre-
cisely, the following property can be easily shown:
Lemma 1. Let T be a TBox, let E be an EBox, let B be a basic concept and let Φ
be the set of basic concepts subsumed by B in T and let Φ be the set returned by
MinimizeViews(B, E, T ). Then, the following property holds:
B IA = B IA
B∈Φ B∈Φ
An analogous property can be shown when B is a basic role (or a 0-ary predicate).
From the above lemma, it easily follows that step 2 of Prexto is correct.
3. In step 4 of Prexto, the final simplification of conjunctive queries realized by the
execution of the algorithm DeleteRedundantAtoms over the EBox E is correct.
This immediately follows from the correctness of DeleteRedundantAtoms shown
in the above point 1 and from the fact that the final UCQ is executed on the ABox,
i.e., it is evaluated on the interpretation IA .
Therefore, from the above properties and the correctness of the original Presto algo-
rithm, we are able to show the correctness of Prexto.
Prexto: Query Rewriting under Extensional Constraints in DL-Lite 371
Finally, it is easy to verify the following property, which states that the computational
cost of Prexto is no worse than all known query rewriting techniques for DL-LiteA
which compute UCQs.
Theorem 2. Prexto(Q, T , E) runs in polynomial time with respect to the size of T ∪ E,
and in exponential time with respect to the maximum number of atoms in a conjunctive
query in the UCQ Q.
6 Comparison
We now compare the optimizations introduced by Prexto with the current techniques
for query rewriting in DL-Lite.
In particular, we consider the simple DL-LiteA ontology of Example 2 and compare
the size of the UCQ rewritings generated by the current techniques (in particular, Presto
and the rewriting based on the TBox minimization technique TBox-min shown in [11])
with the size of the UCQ generated by Prexto. To single out the impact of the different
optimizations introduced by Presto, we present three different execution modalities
for Prexto: without considering the EBox (we call this modality Prexto-noEBox); (ii)
without considering disjointness axioms and role functionality axioms in the TBox (we
call this modality Prexto-noDisj); (iii) and considering all axioms both in the TBox and
in the EBox (we call this modality Prexto-full). Moreover, we will consider different
EBoxes of increasing size, to better illustrate the impact of the EBox on the size of the
rewriting.
Let T be the DL-LiteA ontology of Example 2 and let E1 , . . . , E4 be the following
EBoxes:
E1 = {E1 }
E2 = {E1 , E2 }
E3 = {E1 , E2 , E3 }
E4 = {E1 , E2 , E3 , E4 }
where E1 , . . . , E4 are the concept inclusion assertions defined in Example 2. Finally,
let q0 , q1 , q2 , q3 be the following simple queries:
q0 (x) ← Student(x)
q1 (x) ← Student(x), knows(x, y), HasJob(y)
q2 (x) ← Student(x), knows(x, y), HasJob(y), knows(x, z), Unemployed(z)
q3 (x) ← Student(x), knows(x, y), HasJob(y), knows(x, z), Unemployed(z),
knows(x, w), Student(w)
The table reported in Figure 4 shows the impact on rewriting (and answering) queries
q0 , q1 , q2 and q3 of: (i) the disjointness axiom and the functional role axioms in T ; (ii)
the EBoxes E1 , . . . , E4 . In the table, we denote by Presto+unfolding the UCQ obtained
by unfolding the nonrecursive datalog program returned by the Presto algorithm, and
372 R. Rosati
denote by TBox-min the execution of Presto+unfolding which takes as input the TBox
minimized by the technique presented in [11] using the extensional inclusions in the
EBox. These two rows can be considered as representative of the state of the art in
query rewriting in DL-Lite (with and without extensional constraints): indeed, due to
the simple structure of the TBox and the queries, every existing UCQ query rewriting
technique for plain DL-Lite ontologies (i.e., ontologies without EBoxes) would generate
UCQs of size analogous to Presto+unfolding (of course, we are not considering the
approaches where the ABox is preprocessed, in which of course much more compact
query rewritings can be defined [7,11]).
The third column of the table displays the results when the empty EBox was con-
sidered, while the fourth, fifth, sixth, and seventh column respectively report the results
when the EBox E1 , E2 , E3 , E4 , was considered. The numbers in these columns repre-
sent the size of the UCQ generated when rewriting the query with respect to the TBox
T and the EBox E: more precisely, this number is the number of CQs which constitute
the generated UCQ. We refer to Example 2, for an explanation of the results obtained
in the case of query q1 .
The results of Figure 4 clearly show that even a very small number of EBox axioms
may have a dramatic impact on the size of the rewritten UCQ, and that this is already
the case for relatively short queries (like query q2 ): this behavior is even more apparent
for longer queries like q3 . In particular, notice that, even when only two extensional in-
clusions are considered (case E = E2 ), the minimization of the UCQ is already very sig-
nificant. Moreover, for the queries under examination, extensional inclusions are more
Prexto: Query Rewriting under Extensional Constraints in DL-Lite 373
effective than disjointness axioms and role functionalily axioms on the minimization of
the rewriting size.
The results also show that the technique presented in [11] for exploiting extensional
inclusions does not produce any effect in this case. This is due to the fact that the
extensional inclusions considered in our experiment do not produce any minimization
of the TBox according to the condition expressed in [11]. Conversely, the technique for
exploiting extensional constraints of Prexto is very effective. For instance, notice that
this technique is able to use extensional constraints (like E2 and E3 ) which have no
counterpart in the TBox, in the sense that such concept inclusions are not entailed by
the TBox T .
Finally, we remark that the above simple example shows a situation which is actually
not favourable for the algorithm, since there are very few extensional constraints and
short (or even very short) queries: nevertheless, the experimental results show that, even
in this setting, our algorithm is able to produce very significant optimizations. Indeed,
the ideas which led to the Prexto algorithm came out of a large OBDA project that our
research group is currently developing with an Italian Ministry. In this project, several
relevant user queries could not be executed by our ontology reasoner (Quonto [3]) due
to the very large size of the rewritings produced. For such queries, the minimization
of the rewriting produced by the usage of the Prexto optimizations is actually much
more dramatic than the examples reported in the paper, because the queries are more
complex (at least ten atoms) and the number of extensional constraints is larger than in
the example. As a consequence, Prexto was able to lower the number of conjunctive
queries generated, and thus the total query evaluation time, typically by two to three
orders of magnitude: e.g., for one query, the total evaluation time passed from more
than 11 hours to 42 seconds; other seven queries, whose rewritings could not even be
computed or executed because of memory overflow of either the query rewriter or the
DBMS query parser, could be executed in few minutes, or even in a few seconds, after
the optimization.
7 Conclusions
In this paper we have presented a query rewriting technique for fully exploiting the pres-
ence of extensional constraints in a DL-LiteA ontology. Our technique clearly proves
that extensional constraints may produce a dramatic improvement of query rewriting,
and consequently of query answering over DL-LiteA ontologies.
We remark that it is immediate to extend Prexto to OWL2 QL: the main features of
OWL2 QL that are not covered by DL-LiteA mainly consist of the presence of additional
role assertions (symmetric/asymmetric/reflexive/irreflexive role assertions). These as-
pects can be easily dealt with by Prexto through a simple extension of the algorithm.
We believe that the present approach can be extended in several directions. First, it
would be extremely interesting to generalize the Prexto technique to ontology-based
data access (OBDA), where the ABox is only virtually specified through declarative
mappings over external data sources: as already mentioned in the introduction, in this
scenario extensional constraints would be a very natural notion, since they could be
automatically derived from the mapping specification. Then, it would be very interest-
ing to extend the usage of extensional constraints beyond DL-LiteA ontologies: in this
374 R. Rosati
respect, a central question is whether existing query rewriting techniques for other de-
scription logics (e.g., [9,12]) can be extended with optimizations analogous to the ones
of Prexto. Finally, we plan to fully implement our algorithm within the Quonto/Mastro
system [3] for DL-LiteA ontology management.
Acknowledgments. This research has been partially supported by the ICT Collabo-
rative Project ACSI (Artifact-Centric Service Interoperation), funded by the EU under
FP7 ICT Call 5, 2009.1.2, grant agreement n. FP7-257593.
References
1. OWL 2 web ontology language profiles (2009),
http://www.w3.org/TR/owl-profiles/
2. Artale, A., Calvanese, D., Kontchakov, R., Zakharyaschev, M.: The DL-Lite family and rela-
tions. J. of Artificial Intelligence Research 36, 1–69 (2009)
3. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A., Rodriguez-Muro, M.,
Rosati, R., Ruzzi, M., Savo, D.F.: The Mastro system for ontology-based data access. Se-
mantic Web J. 2(1), 43–53 (2011)
4. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning
and efficient query answering in description logics: The DL-Lite family. J. of Automated
Reasoning 39(3), 385–429 (2007)
5. Gottlob, G., Schwentick, T.: Rewriting ontological queries into small nonrecursive datalog
programs. In: Proc. of the 24th Int. Workshop on Description Logic, DL 2011 (2011)
6. Kikot, S., Kontchakov, R., Zakharyaschev, M.: On (in)tractability of OBDA with OWL2QL.
In: Proc. of the 24th Int. Workshop on Description Logic, DL 2011 (2011)
7. Kontchakov, R., Lutz, C., Toman, D., Wolter, F., Zakharyaschev, M.: The combined approach
to query answering in DL-Lite. In: Proc. of the 12th Int. Conf. on the Principles of Knowledge
Representation and Reasoning (KR 2010), pp. 247–257 (2010)
8. Motik, B., Horrocks, I., Sattler, U.: Bridging the gap between OWL and relational databases.
J. of Web Semantics 7(2), 74–89 (2009)
9. Pérez-Urbina, H., Motik, B., Horrocks, I.: Tractable query answering and rewriting under
description logic constraints. J. of Applied Logic 8(2), 186–209 (2010)
10. Poggi, A., Lembo, D., Calvanese, D., De Giacomo, G., Lenzerini, M., Rosati, R.: Linking
Data to Ontologies. In: Spaccapietra, S. (ed.) Journal on Data Semantics X. LNCS, vol. 4900,
pp. 133–173. Springer, Heidelberg (2008)
11. Rodriguez-Muro, M., Calvanese, D.: Dependencies: Making ontology based data access
work in practice. In: Proc. of the 5th Alberto Mendelzon Int. Workshop on Foundations
of Data Management, AMW 2011 (2011)
12. Rosati, R.: On conjunctive query answering in E L. In: Proc. of the 20th Int. Workshop on
Description Logic (DL 2007). CEUR Electronic Workshop Proceedings, vol. 250, pp. 451–
458 (2007), http://ceur-ws.org/
13. Rosati, R., Almatelli, A.: Improving query answering over DL-Lite ontologies. In: Proc. of
the 12th Int. Conf. on the Principles of Knowledge Representation and Reasoning (KR 2010),
pp. 290–300 (2010)
14. Tao, J., Sirin, E., Bao, J., McGuinness, D.L.: Integrity constraints in OWL. In: Proc. of the
24th AAAI Conf. on Artificial Intelligence, AAAI 2010 (2010)
15. Thomas, E., Pan, J.Z., Ren, Y.: TrOWL: Tractable OWL 2 Reasoning Infrastructure. In:
Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tu-
dorache, T. (eds.) ESWC 2010, Part II. LNCS, vol. 6089, pp. 431–435. Springer, Heidelberg
(2010)
Semi-automatically Mapping Structured Sources
into the Semantic Web
1 Introduction
The set of sources in the Linked Data cloud continues to grow rapidly. Many of
these sources are published directly from existing databases using tools such as
D2R [8], which makes it easy to convert relational databases into RDF. This con-
version process uses the structure of the data as it is organized in the database,
which may not be the most useful structure of the information in RDF. But
either way, there is often no explicit semantic description of the contents of a
source and it requires a significant effort if one wants to do more than simply
This research is based upon work supported in part by the Intelligence Advanced
Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL)
contract number FA8650-10-C-7058. The views and conclusions contained herein
are those of the authors and should not be interpreted as necessarily representing
the official policies or endorsements, either expressed or implied, of IARPA, AFRL,
or the U.S. Government.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 375–390, 2012.
c Springer-Verlag Berlin Heidelberg 2012
376 C.A. Knoblock et al.
convert a database into RDF. The result of the ease with which one can publish
data into the Linked Data cloud is that there is lots of data published in RDF
and remarkably little in the way of semantic descriptions of much of this data.
In this paper, we present an approach to semi-automatically building source
models that define the contents of a data source in terms of a given ontology. The
idea behind our approach is to bring the semantics into the conversion process
so that the process of converting a data source produces a source model. This
model can then be used to generate RDF triples that are linked to an ontology
and to provide a SPARQL end point that converts the data on the fly into RDF
with respect to a given ontology. Users can define their own ontology or bring in
an existing ontology that may already have been used to describe other related
data sources. The advantage of this approach is that it allows the source to be
transformed in the process of creating the RDF triples, which makes it possible
to generate RDF triples with respect to a specific domain ontology.
The conversion to RDF is a critical step in publishing sources into the Linked
Data cloud and this work makes it possible to convert sources into RDF with the
underlying semantics made explicit. There are other systems, such as R2R [7]
and W3C’s R2RML [9], that define languages for specifying mappings between
sources, but none of this work provides support for defining these mappings. This
paper describes work that is part of our larger effort on developing techniques
for performing data-integration tasks by example [23]. The integrated system is
available as an open-source tool called Karma1 .
2 Motivating Example
The bioinformatics community has produced a growing collection of databases
with vast amounts of data about diseases, drugs, proteins, genes, etc. Nomen-
clatures and terminologies proliferate and significant efforts have been under-
taken to integrate these sources. One example is the Semantic MediaWiki Linked
Data Extension (SMW-LDE) [5], designed to support unified querying, naviga-
tion, and visualization through a large collection of neurogenomics-relevant data
sources. This effort focused on integrating information from the Allen Brain At-
las (ABA) with standard neuroscience data sources. Their goal was to “bring
ABA, Uniprot, KEGG Pathway, PharmGKB and Linking Open Drug Data [16]
data sets together in order to solve the challenge of finding drugs that target
elements within a disease pathway, but are not yet used to treat the disease.”
We use the same scenario to illustrate and evaluate our contributions, com-
paring our results to the published SMW-LDE results (see Figure 1). We use
logical rules to formally define the mapping between data sources and an ontol-
ogy. Specifically, we use global-local-as-view (GLAV) rules [13] commonly used
in data integration [15] and data exchange [3] (i.e., rules whose antecedent and
consequent are conjunctive formulas). The rule antecedent is the source relation
that defines the columns in the data source. The rule consequent specifies how
the source data elements are defined using the ontology terms. For example, the
1
https://github.com/InformationIntegrationGroup/Web-Karma-Public
Semi-automatically Mapping Structured Sources into the Semantic Web 377
name
: subclass Thing
alternativeLabel
: object property alternativeSymbol
Top
pharmGKBId
: data property
isTargetedBy
Fig. 1. The ontology used in the SMW-LDE study, one of the KEGG Pathway sources
used, and the source model that defines the mapping of this source to the ontology
first term, Pathway(uri(Accession Id)) specifies that the values in the Acces-
sion Id column are mapped to the Pathway class, and that these values should
be used to construct the URIs when the source description is used to gener-
ate RDF. The second term, name( uri( Accession Id), Name) specifies that the
values in the Accession Id are related to the values in the Name column using
the name property.
The task in the SMW-LDE scenario is to define source models for 10 data
sources. Writing these source models by hand, or the equivalent R2R rules is
laborious and requires significant expertise. In the next sections we describe
how our system can generate source models automatically and how it enables
users to intervene to resolve ambiguities.
378 C.A. Knoblock et al.
computes the most succinct mapping, and the user interface allows the user to
guide the process towards the desired interpretation (Section 3.4).
The task is now to learn the labeling function φ̂(n, v). As mentioned above,
users label columns of data, but to learn φ̂(n, v) we need training data that
assigns semantic types to each value in a column. We assume that columns
contain homogeneous values, so from a single labeled column (n, {v1 , v2 , . . .}, t)
we generate a set of training examples {(n, v1 , t), (n, v2 , t), . . .} as if each value
in the column had been labeled using the same semantic type t.
For each triple (n, v, t) we compute a feature vector (fi ) that characterizes
the syntactic structure of the column name n and the value v. To compute the
feature vector, we first tokenize the name and the value. Our tokenizer uses
white space and symbol characters to break strings into tokens, but identifies
numbers as single tokens. For example, the name Accession Id produces the
tokens (“Accession”, “ ”, “Id”), the value PA2039 produces the tokens (“PA”,
2039), and the value 72.5◦ F produces the tokens (72.5, ◦ , F).
Each fi is a Boolean feature function fi (n, v) that tests whether the name,
value or the resulting tokens have a particular feature. For example, valueS-
tartsWithA, valueStartsWithB, valueStartsWithPA are three different feature func-
tions that test whether the value starts with the characters ‘A’, ‘B’ or the sub-
string “PA”; hasNumericTokenWithOrderOfMagnitude1, hasNumericTokenWithOr-
derOfMagnitude10 are feature functions that test whether the value contains nu-
meric tokens of order of magnitude 1 and 10 respectively. In general, features
are defined using templates of the form predicate(X), and are instantiated for
different values of X that occur within the training data. In our scenario, valueS-
tartsWith(X) is instantiated with X=‘P’ and X=‘A’ because “PA2039” is in the
first column and “Arthritis, Rheumatiod” is in the last column; however, there
will be no valueStartsWithB feature because no value starts with the character
‘B’. Our system uses 21 predicates; the most commonly instantiated ones are:
nameContainsToken(X), nameStartsWith(X), valueContainsToken(X), valueStarts-
With(X), valueHasCapitalizedToken(), valueHasAllUppercaseToken(), valueHasAl-
phabeticalTokenOfLength(X), valueHasNumericTokenWithOrderOfMagnitude(X),
valueHasNumericTokenWithPrecision(X), valueHasNegativeNumericToken().
A CRF is a discriminative model, and it is practical to construct feature vectors
with hundreds or even thousands of overlapping features. The model learns the
weight for each feature based on how relevant it is in identifying the semantic
types by optimizing a log-linear objective function that represents the joint like-
lihood of the training examples. A CRF model is useful for this problem because
it can handle large numbers of features, learn from a small number of examples,
and exploit the sequential nature of many structured formats, such as dates,
temperatures, addresses, etc. To control execution times, our system labels and
learns the labeling function using at most 100 randomly selected values from a
column. With 100 items, labeling is instantaneous and learning takes up to 10
seconds for sources with over 50 semantic types.
: Voc, Vtc : Vtp : Edp : Eop : Esc # : weight Xxx : column name
Fig. 3. The graph defines the search space for source models and provides the infor-
mation for the user interface to enable users to refine the computed source model
ontology. The algorithm for building the graph has three sequential steps: graph
initialization, computing nodes closure, and adding the links.
Graph Initialization: We start with an empty graph called G. In this step,
for each semantic type assigned to a column, a new node with a unique la-
bel is added to the graph. A semantic type is either a class in the ontology
or a pair consisting of the name of a datatype property and its domain. We
call the corresponding nodes in the graph Vtc and Vtp respectively. Applying
this step on the source shown in Figure 3 results in Vtc = {} and Vtp =
{pharmGKBId1 , pharmGKBId2 , pharmGKBId3 , pharmGKBId4 , name1 ,
name2 , name3 , geneSymbol1}.
Computing Nodes Closure: In addition to the nodes that are mapped from
semantic types, we have to find nodes in the ontology that relate those semantic
types. We search the ontology graph and for every class node that has a path
to the nodes corresponding to semantic types, we create a node in the graph. In
other words, we get all the class nodes in the ontology from which the semantic
types are reachable. To compute the paths, we consider both properties and
isa relationships. The nodes added in this step are called Voc . In the example,
we would have Voc = {T hing1, T op1 , Gene1 , P athway1 , Drug1 , Disease1}. In
Figure 3, solid ovals represent {Vtc ∪ Voc }, which are the nodes mapped from
classes of ontology, and the dashed ovals represent Vtp , which are the semantic
types corresponding to datatype properties.
Adding the Links: The final step in constructing the graph is adding the links
to express the relationships among the nodes. We connect two nodes in the graph
if there is a datatype property, object property, or isa relationship that connects
their corresponding nodes in the ontology. More precisely, for each pair of nodes
in the graph, u and v:
– If v ∈ Vtp , i.e., v is a semantic type mapped from a datatype property, and
u corresponds to the domain class of that semantic type, we create a directed
weighted link (u, v) with a weight equal to one (w = 1). For example, there
382 C.A. Knoblock et al.
: Steiner nodes : nodes in Steiner tree : edges in Steiner tree : deleted edge
1
classes &
4 5 properties
8 2 3
6 7 9 10 11 12 column
names
semantic
“click” types
“click”
desired
relationship
Fig. 5. Karma screen showing the PharmGKBPathways source. Clicking on the pencil
icon brings up a menu where users can specify alternative relationships between classes.
Clicking on a semantic type brings up a menu where the user can select the semantic
types from the ontology. A movie showing the user interface in action is available at
http://isi.edu/integration/videos/karma-source-modeling.mp4
Karma visualizes a source model as a tree of nodes displayed above the column
headings of a source. Figure 5 shows the visualization of the source model corre-
sponding to the Steiner tree shown in Figure 4(a). The root of the Steiner tree ap-
pears at the top, and shows the name of the class of objects that the table is about (in
our example the table is about diseases2 ). The Steiner nodes corresponding to the
semantic types are shown just below the column headings. The nodes between the
root and the semantic types show the relationships between the different objects
represented in the table. Internal nodes of the Steiner tree (e.g., nodes 4, 5 and 8)
consist of the name of an object property, shown in italics and a class name (a sub-
class of the range of the property). The property defines the relationship between
the class named in the parent node and the class of the current node. For example,
node 4 is “disrupts Pathway”, which means that the Disease (node 1) disrupts the
Pathway represented by the columns under node 4. The leaves of the tree (nodes 6,
7, 9, etc.) show the name of data properties. For example, node 6 is pharmGKBId,
meaning that the column contains the pharmGKBId of the Pathway in node 4.
2
Selection of the root is not unique for ontologies that declare property inverses. In this
example, any of the classes could have been selected as the root yielding equivalent
models.
Semi-automatically Mapping Structured Sources into the Semantic Web 385
1
4
5
Fig. 6. Karma screen showing the user interaction to change the model of a column
from a Pathway label to a Drug label
4 Evaluation
We evaluated our approach by generating source models for the same set of
sources integrated by Becker et al.[5], as described in Section 2. The objective
of the evaluation was 1) to assess the ability of our approach to produce source
models equivalent to the mappings Becker et al. defined for these sources, and
2) to measure the effort required in our approach to create the source models.
Becker et al. defined the mappings using R2R, so we used their R2R mapping files
as a specification of how data was to be mapped to the ontology. Our objective
was to replicate the effect of the 41 R2R mapping rules defined in these files. Each
R2R mapping rule maps a column in our tabular representation. We measured
effort in Karma by counting the number of user actions (number of menu choices
to select correct semantic types or adjust paths in the graph) that the user had
to perform. Effort measures for the R2R solution are not available, but appears
to be substantial given that the rules are expressed in multiple pages of RDF.
Using Karma we constructed 10 source models that specify mappings equiv-
alent to all of the 41 R2R mapping rules. Table 1 shows the number of actions
Semi-automatically Mapping Structured Sources into the Semantic Web 387
Table 1. Evaluation Results for Mapping the Data Sources using Karma
# User Actions
Source Table Name # Columns
Assign Semantic Type Specify Relationship Total
Genes 8 8 0 8
Drugs 3 3 0 3
PharmGKB
Diseases 4 4 0 4
Pathways 5 2 1 3
ABA Genes 6 3 0 3
Drugs 2 2 0 2
KEGG Diseases 2 2 0 2
Pathway Genes 1 1 0 1
Pathways 6 3 1 4
UniProt Genes 4 1 0 1
Total: 41 Total: 29 Total: 2 Total: 31
Avg. # User Actions/Column = 31/41 = 0.76
Events database 19 Tables Total: 64 Total: 43 Total: 4 Total: 47
Avg. # User Actions/Column = 47/64 = 0.73
required to map all the data sources. The Assign Semantic Type column shows
the number of times we had to manually assign a semantic type. We started this
evaluation with no training data for the semantic type identification. Out of the
29 manual assignments, 24 were for specifying semantic types that the system
had never seen before, and 5 to fix incorrectly inferred types.
The Specify Relationship column shows the number of times we had to select
alternative relationships using a menu (see Figure 5). For the PharmGKB and
KEGG Pathway sources, 1 action was required to produce a model semantically
equivalent to the R2R mapping rule. The total number of user actions was 31,
0.76 per R2R mapping rule, a small effort compared to writing R2R mapping
rules in RDF. The process took 11 minutes of interaction with Karma for a user
familiar with the sources and the ontology.
In a second evaluation, we mapped a large database of events into the ACE
OWL Ontology [12]. The ontology has 127 classes, 74 object properties, 68 data
properties and 122 subclass axioms. The database contains 19 tables with a
total of 64 columns. We performed this evaluation with no training data for the
semantic type identification. All 43 manual semantic type assignments were for
types that the system had not seen before, and Karma was able to accurately
infer the semantic types for the 21 remaining columns. Karma automatically
computed the correct source model for 15 of 19 tables and required one manual
relationship adjustment for each of the remaining 4 tables. The average number
of nodes in our graph data structure was 108, less than the number of nodes in
the ontology (127 classes and 68 types for data properties). The average time for
graph construction and Steiner tree computation across the 19 tables was 0.82
seconds, which suggests that the approach scales to real mid-size ontologies. The
process took 18 minutes of interaction with Karma.
5 Related Work
There is significant work on schema and ontology matching and mapping [21,6].
An excellent recent survey [22] focuses specifically on mapping relational
388 C.A. Knoblock et al.
databases into the semantic web. Matching discovery tools, such as LSD [10]
or COMA [20], produce element-to-element matches based on schemas and/or
data. Mapping generation tools, such as Clio [11] and its extensions [2], Altova
MapForce (altova.com), or NEON’s ODEMapster [4], produce complex map-
pings based on correspondences manually specified by the user in a graphical
interface or produced by matching tools. Most of these tools are geared toward
expert users (ontology engineers or DB administrators). In contrast, Karma fo-
cuses on enabling domain experts to model sources by automating the process
as much as possible and providing users an intuitive user interface to resolve
ambiguities and tailor the process. Karma produces complex GLAV mappings
under the hood, but users do not need to be aware of the logical complexities of
data integration/exchange. They see the source data in a familiar spreadsheet
format annotated with hierarchical headings, and they can interact with it to
correct and refine the mappings.
Alexe et al. [1] elicit complex data exchange rules from examples of source data
tuples and the corresponding tuples over the target schema. Karma could use
this approach to explain its model to users via examples, and as an alternative
method for users to customize the model by editing the examples.
Schema matching techniques have also been used to identify the semantic
types of columns by comparing them with labeled columns [10]. Another ap-
proach [19] is to learn regular expression-like rules for data in each column and
use these expressions to recognize new examples. Our CRF approach [14] im-
proves over these approaches by better handling variations in formats and by
exploiting a much wider range of features to distinguish between semantic types
that are very similar, such as those involving numeric values.
The combination of the D2R [8] and R2R [7] systems can also express GLAV
mappings as Karma. D2R maps a relational database into RDF with a schema
closely resembling the database. Then R2R can transform the D2R-produced
RDF into a target RDF that conforms to a given ontology using an expressive
transformation language. R2RML [9] directly maps a relational database to the
desired target RDF. In both cases, the user has to manually write the mapping
rules. In contrast, Karma automatically proposes a mapping and lets the user
correct/refine the mapping interactively. Karma could easily export its GLAV
rules into the R2RML or D2R/R2R formats.
6 Discussion
A critical challenge of the Linked Data cloud is understanding the semantics of
the data that users are publishing to the cloud. Currently, users are linking their
information at the entity level, but to provide deeper integration of the available
data, we also need semantic descriptions in terms of shared ontologies. In this
paper we presented a semi-automated approach to building the mappings from
a source to a domain ontology.
Often sources require complex cleaning and transformation operations on the
data as part of the mapping. We plan to extend Karma’s interface to express
Semi-automatically Mapping Structured Sources into the Semantic Web 389
these operations and to include them in the source models. In addition, we plan
to extend the approach to support modeling a source in which the relationships
among columns contain a cycle.
References
1. Alexe, B., ten Cate, B., Kolaitis, P.G., Tan, W.C.: Designing and refining schema
mappings via data examples. In: SIGMOD, Athens, Greece, pp. 133–144 (2011)
2. An, Y., Borgida, A., Miller, R.J., Mylopoulos, J.: A semantic approach to dis-
covering schema mapping expressions. In: Proceedings of the 23rd International
Conference on Data Engineering (ICDE), Istanbul, Turkey, pp. 206–215 (2007)
3. Arenas, M., Barcelo, P., Libkin, L., Murlak, F.: Relational and XML Data Ex-
change. Morgan & Claypool, San Rafael (2010)
4. Barrasa-Rodriguez, J., Gómez-Pérez, A.: Upgrading relational legacy data to the
semantic web. In: Proceedings of WWW Conference, pp. 1069–1070 (2006)
5. Becker, C., Bizer, C., Erdmann, M., Greaves, M.: Extending smw+ with a linked
data integration framework. In: Proceedings of ISWC (2010)
6. Bellahsene, Z., Bonifati, A., Rahm, E.: Schema Matching and Mapping, 1st edn.
Springer (2011)
7. Bizer, C., Schultz, A.: The R2R Framework: Publishing and Discovering Mappings
on the Web. In: Proceedings of the First International Workshop on Consuming
Linked Data (2010)
8. Bizer, C., Cyganiak, R.: D2R Server–publishing relational databases on the seman-
tic web. Poster at the 5th International Semantic Web Conference (2006)
9. Das, S., Sundara, S., Cyganiak, R.: R2RML: RDB to RDF Mapping Language,
W3C Working Draft (March 24, 2011), http://www.w3.org/TR/r2rml/
10. Doan, A., Domingos, P., Levy, A.Y.: Learning source descriptions for data integra-
tion. In: Proceedings of WebDB, pp. 81–86 (2000)
11. Fagin, R., Haas, L.M., Hernández, M.A., Miller, R.J., Popa, L., Velegrakis, Y.: Clio:
Schema mapping creation and data exchange. In: Conceptual Modeling: Founda-
tions and Applications - Essays in Honor of John Mylopoulos, pp. 198–236 (2009)
12. Fink, C., Finin, T., Mayfield, J., Piatko, C.: Owl as a target for information ex-
traction systems (2008)
13. Friedman, M., Levy, A.Y., Millstein, T.D.: Navigational plans for data integration.
In: Proceedings of AAAI, pp. 67–73 (1999)
14. Goel, A., Knoblock, C.A., Lerman, K.: Using conditional random fields to exploit
token structure and labels for accurate semantic annotation. In: Proceedings of
AAAI 2011 (2011)
15. Halevy, A.Y.: Answering queries using views: A survey. The VLDB Journal 10(4),
270–294 (2001)
16. Jentzsch, A., Andersson, B., Hassanzadeh, O., Stephens, S., Bizer, C.: Enabling
tailored therapeutics with linked data. In: Proceedings of the WWW Workshop on
Linked Data on the Web, LDOW (2009)
17. Kou, L., Markowsky, G., Berman, L.: A fast algorithm for steiner trees. Acta
Informatica 15, 141–145 (1981)
18. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic
models for segmenting and labeling sequence data. In: Proceedings of the Eigh-
teenth International Conference on Machine Learning, pp. 282–289 (2001)
390 C.A. Knoblock et al.
19. Lerman, K., Plangrasopchok, A., Knoblock, C.A.: Semantic labeling of online in-
formation sources. IJSWIS, special issue on Ontology Matching (2006)
20. Massmann, S., Raunich, S., Aumueller, D., Arnold, P., Rahm, E.: Evolution of
the coma match system. In: Proceedings of the Sixth International Workshop on
Ontology Matching, Bonn, Germany (2011)
21. Shvaiko, P., Euzenat, J.: A Survey of Schema-Based Matching Approaches. In:
Spaccapietra, S. (ed.) Journal on Data Semantics IV. LNCS, vol. 3730, pp. 146–
171. Springer, Heidelberg (2005)
22. Spanos, D.E., Stavrou, P., Mitrou, N.: Bringing relational databases into the se-
mantic web: A survey. In: Semantic Web. IOS Pre-press (2011)
23. Tuchinda, R., Knoblock, C.A., Szekely, P.: Building mashups by demonstration.
ACM Transactions on the Web (TWEB) 5(3) (2011)
Castor: A Constraint-Based SPARQL Engine
with Active Filter Processing
1 Introduction
As semantic web technologies adoption grows, the fields of application become broader,
ranging from general facts from Wikipedia, to scientific publications metadata, govern-
ment data, or biochemical interactions. The Resource Description Framework (RDF) [9]
provides a standard knowledge representation model, a key component for interconnect-
ing data from various sources. SPARQL [12] is the standard language for querying RDF
data sources. Efficient evaluation of such queries is important for many applications.
State-of-the-art SPARQL engines (e.g., Sesame [5], Virtuoso [6] or 4store [7]) are
based on relational database technologies. They are mostly designed for scalability, i.e.,
the ability to handle increasingly large datasets. However, they have difficulties to solve
complex queries, even on small datasets.
We approach SPARQL queries from a different perspective. We propose Castor, a
new SPARQL engine based on Constraint Programming (CP). CP is a technology for
solving NP-hard problems. It has been shown to be efficient for graph matching prob-
lems [20,15], which are closely related to SPARQL [3]. Castor is very competitive with
the state-of-the-art engines and outperforms them on complex queries.
Contributions. A first technical description of this work has been published in [13]. The
present paper presents a number of enhancements of the first Castor prototype, namely:
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 391–405, 2012.
c Springer-Verlag Berlin Heidelberg 2012
392 V. le Clément de Saint-Marcq et al.
Outline. The next section describes the SPARQL language and how it is implemented
in state-of-the-art engines. Section 3 presents our CP approach of SPARQL queries.
Section 4 shows the major parts of our system. Section 5 contains the experimental
results.
2 Background
Data in the semantic web are represented by a graph [9]. Nodes are identified by URIs1
and literals, or they may be blank. Edges are directed and labeled by URIs. We will
call such a graph an RDF dataset. Figure 1b shows an example of RDF dataset. Note
that we can equivalently represent the dataset as a set of triples (Fig. 1a). Each triple
describes an edge of the graph. The components of a triple are respectively the source
node identifier (the subject), the edge label (the predicate) and the destination node
identifier (the object).
:age
:Alice :worksFor :ACME . :age :Alice :Carol 50
:Alice :age 24 . :worksFor
:Bob :worksFor :ACME . 24 :worksFor
:Bob :age 42 .
:ACME
:Carol :worksFor :ACME .
:Carol :age 50 . :worksFor :UnitedCorp
:Dave :worksFor :UnitedCorp .
:Bob 42 :Dave :worksFor
:Dave :age 42 .
:age :age
SPARQL [12] is the standard query language for RDF. The basis of a query is a
triple pattern, i.e., a triple whose components may be variables. A set of triple patterns
is called a basic graph pattern (BGP) as it can be represented by a pattern graph to be
matched in the dataset. A solution of a BGP is an assignment of every variable to an
RDF value, such that replacing the variables by their assigned values in the BGP yields
a subset of the dataset viewed as a triple set. From now on, we will use the triple set
representation of the dataset. More complex patterns can be obtained by composing
BGPs together and by adding filters. Figure 2 shows an example SPARQL query with
one BGP and one filter.
1 For the sake of readability, throughout the paper we abbreviate URIs to CURIEs
(http://www.w3.org/TR/curie/).
Castor: A Constraint-Based SPARQL Engine with Active Filter Processing 393
SELECT * WHERE {
:ACME
?p1 :worksFor :ACME . (P1 ) :worksFor :worksFor
?p1 :age ?age1 . (P2 )
?p2 :worksFor :ACME . (P3 ) ?p1 ?p2
Fig. 2. SPARQL query example on the dataset shown in Fig. 1. The query returns all pairs of
employees working at ACME, the first one being younger than the second one.
BGPG = { μ | μ (BGP) ⊆ G } .
When evaluating a SPARQL query, the solution set of the graph pattern is trans-
formed into a list. Solution modifiers are then applied in the following order.
2 The condition on the variables appearing in c restricts the language to safe filters, without
limiting its expressive power [2].
394 V. le Clément de Saint-Marcq et al.
– Triple stores store the whole dataset in one three-column table. Each row represents
one triple. Examples in this category are Sesame, 4store [7], Virtuoso [6], RDF-
3x [10] and Hexastore [19].
– Vertically partitioned tables maintain one two-column table for each predicate. The
resulting smaller tables are sometimes more convenient than the single large table
of triple stores. However, there is a significant overhead when variables appear in
place of predicates in the query. An example is SW-Store [1].
– Schema-specific systems map legacy relational databases to RDF triples using a
user-specified ontology. Queries are translated to SQL and performed on the re-
lational tables. Native RDF datasets can be transformed to relational tables if the
user provides the structure. Thus, such systems do not handle well the schema-less
nature of RDF.
For the purpose of this paper, we will focus on triple stores, which are very popular and
well-performing generic engines.
The solutions for a single triple pattern can be retrieved efficiently from a triple store
using redundant indexes. Combining multiple triple patterns however involves joining
the solution sets together, i.e., merging mappings that assign the same value to common
variables. Such operations can be more or less expensive depending on the order in
which they are performed. Query engines carefully construct a join graph optimizing
the join order. The join graph is then executed bottom-up, starting from the leaves, the
triple patterns, and joining the results together. Filters are applied once all their variables
appear in a solution set.
The join graph optimization problem has been largely studied for relational databases
(e.g., [11,16]). Many results were also adapted to semantic web databases (e.g., [17]).
Figure 3 shows an example of executing a query in a triple store. Here, the filter can
only be applied at the very last stage of the evaluation, as it involves variables from
different parts of the query.
The relational database approach to SPARQL queries focuses on the triple patterns to
build the solutions. We propose another view focusing on the variables.
A solution to a query is an assignment of the variables of the query to values of the
dataset. The set of values that can be assigned to a variable is called its domain. The
domain of a variable is initially the set of all URIs, blank nodes and literals occurring
Castor: A Constraint-Based SPARQL Engine with Active Filter Processing 395
Fig. 3. Executing the query from Fig. 2 in a triple store evaluates the join graph bottom-up. Note
that, depending on the used join algorithms, some intermediate results may be produced lazily
and need not be stored explicitly. The URIs of the employees are abbreviated by their first letter.
in the dataset. We construct solutions by selecting for each variable a value from its
domain and checking that the obtained assignment satisfies the triple patterns and the
filters (i.e., the constraints).
Constructing all solutions can be achieved by building a search tree. Each node con-
tains the domains of the variables. The root node contains the initial domains. At each
node of the tree, a variable is assigned to a value in its domain (i.e., its domain is re-
duced to a singleton), and constraints are propagated to reduce other variable domains.
Whenever a domain becomes empty, the branch of the search tree is pruned. The form
of the search tree thus depends on the choice of variable at each node and the order of
the children (i.e., how the values are enumerated in the domain of the variables). Let us
consider for example the query of Fig. 2. When assigning ?age1 to 42, we can propagate
the constraint ?age1 < ?age2 to remove from the domain of ?age2 every value which
is not greater than 42 and, if all values are removed, we can prune this branch.
This is the key idea of constraint programming: prune the search tree by using the
constraints to remove inconsistent values from the domains of the variables. Each con-
straint is used successively until the fix-point is reached. This process, called propaga-
tion, is repeated at every node of the tree. There are different levels of propagation. An
algorithm with higher complexity will usually be able to prune more values. Thus, a
trade-off has to be found between the achieved pruning and the time taken.
Figure 4 shows the search tree for the running example. At the root node, the triple
patterns restrict the domains of ?p1 and ?p2 to only :Alice, :Bob and :Carol, i.e., the
employees working at ACME, and ?age1 and ?age2 to { 24, 42, 50 }. The filter removes
value 50 from ?age1, as there is no one older than 50. Similarly, 24 is removed from
?age2. Iterating the process, we can further remove :Carol from ?p1 and :Alice from
?p2. Compared to the relational database approach, we are thus able to exploit the filters
at the beginning of the search.
The tree is explored in a depth-first strategy. Hence only the path from the root to
the current node is kept in memory. In most constraint programming systems, instead
of keeping copies of the domains along the path to the root, one maintains the domains
of the current node and a trail. The trail contains the minimal information needed to
restore the current domains to any ancestor node.
396 V. le Clément de Saint-Marcq et al.
2 5 2 :A 24 {:B,:C} {42,50}
3 :A 24 :B 42
?p2 = :B ?p2 = :C
4 :A 24 :C 50
3 4 5 :B 42 :C 50
Fig. 4. Executing the query in Fig. 2 with constraint programming explores the search tree top-
down. The triple patterns and filters are used at every node to reduce the domains of the variables.
The URIs of the employees are abbreviated by their first letter.
4 Implementation
We evaluated the constraint-based approach using a state-of-the-art CP solver in [13].
While such implementation delivered some results, the cost of restoring the domains
in generic solvers is too high for large datasets. Hence, we have built a specialized
lightweight solver called Castor.
Castor is a prototype SPARQL engine based on CP techniques. When executing a
query, a domain is created for every variable of the query, containing all values occur-
ring in the dataset. For efficiency, every value is represented by an integer. Constraints
correspond to the triple patterns, filters and solution modifiers. The associated pruning
functions, called propagators are registered to the domains of the variables on which
the constraints are stated. The propagators will then be called whenever the domains are
modified. The search tree is explored in a depth-first strategy. A leaf node where every
domain is a singleton is a solution, which is returned by the engine.
In this section, we describe the major components of Castor: the constraints and their
propagators, the representation of the domains, the mapping of RDF values to integers,
and the triple indexes used to store the dataset and propagating the triple pattern con-
straints.
4.1 Constraints
Constraints and propagators are the core of a CP solver. SPARQL queries have three
kinds of constraints: triple patterns, filters and solution modifiers. The associated prop-
agators can achieve different levels of consistency, depending on their complexity and
properties of the constraint. We first show the different levels of consistency that are
achieved by Castor. Then, we explain the propagators for the different constraints.
To be correct, a propagator should at least ensure that the constraint is satisfied once
every variable in the constraint is bound (i.e., its domain is a singleton). However, to re-
duce the search space, propagators can prune the domains when variables are unbound.
Propagators can be classified by their achieved level of consistency [4], i.e., the amount
of pruning they can achieve.
Castor: A Constraint-Based SPARQL Engine with Active Filter Processing 397
Triple Patterns. A triple pattern is a constraint involving three variables, one for each
component. For ease of reading, we consider constants to be variables whose domains
are singletons. The pruning is performed by retrieving all the triples from the dataset
where the components of the bound variables correspond to the assigned value. Val-
ues of domains of unbound variables that do not appear in the resulting set of triples
are pruned. If the pruning is performed with only one unbound variable, we achieve
forward-checking consistency. Castor achieves more pruning by performing the pruning
when one or two variables are unbound. If all three variables are bound, the propagator
checks if the triple is in the dataset and empties a domain if this is not the case.
Filters. Castor has a generic propagator for filters achieving forward-checking consis-
tency. When traversing the domain of the unbound variable, we can check if the filter
is satisfied by evaluating the SPARQL expression as described by the W3C recommen-
dation [12]. It provides a fallback to easily handle any filter, but is not very efficient.
When possible, specialized algorithms are preferred.
For example, the propagator for the sameTERM(?x, ?y) filter can easily achieve do-
main consistency. The constraint states that ?x and ?y are the same RDF term. The
domains of both variables should be the same. Hence, when a value is removed from
one domain, the propagator removes that value from the other domain.
Propagators for monotonic constraints [18], e.g., ?x < ?y, can easily achieve
bound consistency. Indeed, for constraint ?x < ?y, we have max(?x) < max(?y) and
min(?x) < min(?y). The pruning is performed by adjusting the upper bound of ?x and
the lower bound of ?y.
Solution Modifiers. The DISTINCT keyword in SPARQL removes duplicates from the
results. Such operation can also be handled by constraints. When a solution is found,
a new constraint is added stating the any further solution must be different from the
current one. The propagator achieves forward-checking consistency, i.e., when all vari-
ables but one are bound, we remove the value of the already found solution from the
domain of the unbound variable.
398 V. le Clément de Saint-Marcq et al.
When the ORDER BY and LIMIT keywords are used together, the results shall only
include the n best solutions according to the specified ordering. After n solutions have
been found, we add a new constraint stating that any new solution must be “better” than
the worst solution so far. Such technique is known as branch-and-bound.
Values are partitioned into the following classes, shown in ascending order. The or-
dering of the values inside each class is also given. When not specified, or to solve
ambiguous cases, the values are ordered by their lexical form.
1. Blank nodes: ordered by their internal identifier
2. URIs
3. Plain literals without language tags
4. xsd:string literals
5. Boolean literals: first false, then true values
6. Numeric literals: ordered first by their numerical value, then by their type URI
7. Date/time literals: ordered chronologically
8. Plain literals with language tags: ordered first by language tag, then by lexical form
9. Other literals: ordered by their type URI
We map the values of a dataset to consecutive integers starting from 1, such that v1 <T
v2 ⇔ id(v1 ) < id(v2 ).
structures representing the domain should perform such operations efficiently. There
are two kinds of representations. The discrete representation keeps track of every single
value in the domain. The bounds representation only keeps the lowest and highest value
of the domain according to the total order defined in Section 4.2. We propose a dual
view, leveraging the strengths of both representations.
Discrete Representation. The domain is represented by its size and two arrays dom and
map.The size first elements of dom are in the domain of the variable, the others have
been removed (see Fig. 5). The map array maps values to their position in the dom array.
size
in domain removed
dom: d g f c b h a e
map: 6 4 3 0 7 2 1 5
a b c d e f g h
Fig. 5. Example representation of the domain { b, c, d, f , g }, such that size = 5, when the initial
domain is { a, . . . , h }. The size first values in dom belong to the domain; the last values are those
which have been removed. The map array maps values to their position in dom. For example, value
b has index 4 in the dom array. In such representation, only the size needs to be kept in the trail.
To remove a value, we swap it with the last value of the domain (i.e., the value
directly to the left of the size marker), reduce size by one and update the map array.
Such operation is done in constant time.
Alternatively, we can restrict the domain to the intersection of itself and a set S. We
move all values of S which belong to the size first elements of dom at the beginning
of dom and set size to the size of the intersection. Such operation is done in O(|S|),
with |S| the size of S. Castor uses the restriction operation in propagators achieving
forward-checking consistency.
Operations on the bounds however are inefficient. This major drawback is due to the
unsorted dom array. Searching for the minimum or maximum value requires the traver-
sal of the whole domain. Increasing the lower bound or decreasing the upper bound
involves removing every value between the old and new bound one by one.
As the order of the removed values is not modified by any operation, the domain can
be restored in constant time by setting the size marker back to its initial position. The
trail, i.e., the data structure needed to restore the domain to any ancestor node of the
search tree, is thus a stack of the sizes.
Note that this is not the standard representation of discrete domains in CP. However,
the trail of standard representations is too heavy for our purpose and size of data.
Bounds Representation. The domain is represented by its bounds, i.e., its minimum and
maximum values. In contrast to the discrete representation, the bound representation is
400 V. le Clément de Saint-Marcq et al.
an approximation of the exact domain. We assume all values between the bounds are
present in the domain.
In such a representation, we cannot remove a value in the middle of the domain as
we cannot represent a hole inside the bounds. However, increasing the lower bound or
decreasing the upper bound is done in constant time.
The data structure for this representation being small (only two numbers), the trail
contains copies of the whole data structure. Restoring the domains involves restoring
both bounds.
Dual View. Propagators achieving forward-checking or domain consistency remove
values from the domains. Thus, they require a discrete representation. However, propa-
gators achieving bounds consistency only update the bounds of the domains. For them
to be efficient, we need a bounds representation. Hence, Castor creates two variables
xD and xB (resp. with discrete and bound representation) for every SPARQL variable
?x. Constraints are stated using only one of the two variables, depending on which rep-
resentation is the most efficient for the associated propagator. In particular, monotonic
constraints are stated on bounds variables whereas triple pattern constraints are stated
on discrete variables.
An additional constraint xD = xB ensures the correctness of the dual approach. Achiev-
ing domain consistency for this constraint is too costly, as it amounts to perform every
operation on the bounds on the discrete representation. Instead, the propagator in Castor
achieves forward-checking consistency, i.e., once one variable is bound the other will
be bound to the same value. As an optimization, when restricting a domain to its inter-
section with a set S, we filter out values of S which are outside the bounds and update
the bounds of xB . Such optimization does not change the complexity of the operation,
as it has to traverse the whole set S anyway.
5 Experimental Results
To assess the performances of our approach, we have run the SPARQL Performance
Benchmark (SP2 Bench) [14]. SP2 Bench consists of a deterministic dataset generator,
and 12 representative queries to be executed on the generated datasets. The datasets
represent relationships between fictive academic papers and their authors, following the
model of academic publications in the DBLP database.
We compare the performances of four engines: Sesame 2.6.1 [5], Virtuoso 6.1.4 [6],
4store 1.1.4 [7] and our own Castor. Sesame was configured to use its native on-disk
store with three indexes (spoc, posc, ospc). The other engines were left in their default
configuration. We did not include RDF-3x in the comparison as it is unable to handle
the filters appearing in the queries. For queries involving filters, we have also tested a
version of Castor that does not post them as constraints, but instead evaluate them in a
post-processing step.
10min 10min
Virtuoso
Sesame
1min 1min Castor (no filters)
Castor
4store
10s 10s
Castor (no filters) 4store
1s Virtuoso 1s
Sesame
Castor
100ms 100ms
Q12a Q4
10ms ASK version of q5a
10ms BGP with “x1 < x2 ” filter
10min Castor (no filters) Castor 10min Sesame Castor (no filters)
Sesame Castor
1min 4store 1min
Virtuoso
4store
10s 10s Virtuoso
1s 1s
100ms Q6 100ms Q7
Negation Nested negation
10ms 10ms
Fig. 6. Castor is competitive and often outperforms state-of-the-art SPARQL engines on complex
queries. The x-axis represents the dataset size in terms of number of triples. The y-axis is the
query execution time. Both axes have a logarithmic scale.
402 V. le Clément de Saint-Marcq et al.
We have generated 6 datasets with 10k, 25k, 250k, 1M and 5M triples. We have
performed three cold runs of each query over all the generated datasets, i.e., between
two runs the engines were restarted and the system caches cleared with “sysctl -w
vm.drop_caches=3”. We have set a timeout of 30 minutes. Please note that cold runs
may not give the most significant results for some engines. E.g., Virtuoso aggressively
fills its cache on the first query in order to perform better on subsequent queries. How-
ever, such setting corresponds to the one used by the authors of SP2 Bench, so we have
chosen to use it as well. All experiments were conducted on an Intel Pentium 4 3.2 GHz
computer running ArchLinux 64bits with kernel 3.2.6, 3 GB of DDR-400 RAM and a
40 GB Samsung SP0411C SATA/150 disk with ext4 filesystem. We report the time
spent to execute the queries, not including the time needed to load the datasets.
The authors of SP2 Bench have identified four queries that are more challenging than
the others: Q4, Q5a, Q6 and Q7. The execution time of those queries, along with two
variations of Q5a, are reported in Figure 6.
Q5a and Q5b compute the same set of solutions. Q5a enforces the equality of two
variables with a filter, whereas Q5b uses a single variable for both. Note that such
optimization is difficult to do automatically, as equivalence does not imply identity
in SPARQL. For example, "42"^^xsd:integer and "42.0"^^xsd:decimal compare
equal in a filter, but are not the same RDF term and may thus not be matched in a BGP.
Detecting whether one can replace the two equivalent variables by a single one requires
a costly analysis of the dataset, which is not performed by any of the tested engines.
Sesame and 4store timed out when trying to solve Q5a on the 250k and above datasets.
Virtuoso does not differentiate equivalent values and treats equality as identity. Such
behavior breaks the SPARQL standard and can lead to wrong results. Castor does no
query optimization, but still performs equally well on both variants thanks to its ability
to exploit the filter at every node of the search tree. Q12a replaces the SELECT keyword
by ASK in Q5a. The solution is a boolean value reflecting whether there exists a solution
to the query. Thus, we only have to look for the first solution. However, Castor still
needs to initialize the search tree, which is the greatest cost. Virtuoso and 4store behave
similarly to Q5a, but Sesame is able to find the answer much more quickly.
Executing Q4 results in many solutions (e.g., for the 1M dataset, Q4 results in 2.5 ×
106 solutions versus 3.5 × 104 solutions for Q5a). The filter does not allow for much
pruning, as shown by the very similar performances between the two variants of Castor.
Nevertheless, Castor is still competitive with the other engines. None of the engines
were able to solve the query for the 5M dataset in less than 30 minutes.
Table 1. Castor is the fastest or second fastest engine for nearly every query. The ranking of the
engines is shown for each query. The last column is the average rank for every engine.
10min 10min
10min 10min
4store Castor
100ms 100ms
Castor
Q8
10ms Q3c 10ms Union with many “!=” filters
Single-variable filter, no results
10min 10min
Sesame
1min Virtuoso 1min
10s 10s
4store Sesame
1s Castor 1s Virtuoso
1min 1min
Sesame
Castor Sesame
10s Virtuoso 10s 4store
Castor (no filters)
1s 1s Virtuoso
4store Castor
100ms Q11 100ms Q12b
ORDER BY and LIMIT ASK version of q8
10ms 10ms
Fig. 7. On simpler queries, Castor is also very competitive with state-of-the-art SPARQL engines.
The x-axis represents the dataset size in terms of number of triples. The y-axis is the query
execution time. Both axes have a logarithmic scale.
404 V. le Clément de Saint-Marcq et al.
Figure 7 shows the results for the other queries, except Q12c. Query Q12c involves
an RDF value that is not present in the dataset. It is solved in constant time by all
engines equally well. For all queries, Castor is competitive with the other engines or
outperforming them. The sharp decrease of performances of Castor in Q11 between the
250k and the 1M datasets is due to the fixed size of the triple store cache. The hit ratio
drops from 99.6% to 40.4%.
For each query, we sort the engines in lexicographical order, first by the largest
dataset solved, then by the execution time on the largest dataset. The obtained ranks
are shown in Table 1. Castor is ranked first for 5 queries out of 16, and second for all
other queries but one. The 4store engine is ranked first on 8 queries, but does not fare
as well on the other queries. In most of the queries where 4store is ranked first, the ex-
ecution time of Castor is very close to the execution time of 4store. Virtuoso performs
well on some difficult queries (Q6 and Q7), but is behind for the other queries. Sesame
performs the worst of the tested engines.
6 Conclusion
Acknowledgments. The authors want to thank the anonymous reviewers for their con-
structive comments. The first author is supported as a Research Assistant by the Belgian
FNRS (National Fund for Scientific Research). This research is also partially supported
by the Interuniversity Attraction Poles Programme (Belgian State, Belgian Science Pol-
icy) and the FRFC project 2.4504.10 of the Belgian FNRS.
References
1. Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: SW-Store: a vertically partitioned
DBMS for semantic web data management. The VLDB Journal 18, 385–406 (2009)
Castor: A Constraint-Based SPARQL Engine with Active Filter Processing 405
2. Angles, R., Gutierrez, C.: The Expressive Power of SPARQL. In: Sheth, A.P., Staab, S.,
Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS,
vol. 5318, pp. 114–129. Springer, Heidelberg (2008)
3. Baget, J.-F.: RDF Entailment as a Graph Homomorphism. In: Gil, Y., Motta, E., Benjamins,
V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 82–96. Springer, Heidelberg
(2005)
4. Bessiere, C.: Handbook of Constraint Programming, ch. 3. Elsevier Science Inc. (2006)
5. Broekstra, J., Kampman, A., van Harmelen, F.: Sesame: A Generic Architecture for Storing
and Querying RDF and RDF Schema. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS,
vol. 2342, pp. 54–68. Springer, Heidelberg (2002)
6. Erling, O., Mikhailov, I.: RDF Support in the Virtuoso DBMS. In: Pellegrini, T., Auer, S.,
Tochtermann, K., Schaffert, S. (eds.) Networked Knowledge - Networked Media. Studies in
Computational Intelligence, vol. 221, pp. 7–24. Springer, Heidelberg (2009)
7. Harris, S., Lamb, N., Shadbolt, N.: 4store: The design and implementation of a clustered RDF
store. In: 5th International Workshop on Scalable Semantic Web Knowledge Base Systems
(SSWS 2009), at ISWC 2009 (2009)
8. Hose, K., Schenkel, R., Theobald, M., Weikum, G.: Database foundations for scalable RDF
processing. In: Polleres, A., d’Amato, C., Arenas, M., Handschuh, S., Kroner, P., Ossowski,
S., Patel-Schneider, P. (eds.) Reasoning Web 2011. LNCS, vol. 6848, pp. 202–249. Springer,
Heidelberg (2011)
9. Klyne, G., Carroll, J.J., McBride, B.: Resource description framework (RDF): Concepts and
abstract syntax (2004),
http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/
10. Neumann, T., Weikum, G.: RDF-3X: a RISC-style engine for RDF. Proc. VLDB Endow. 1,
647–659 (2008)
11. Piatetsky-Shapiro, G., Connell, C.: Accurate estimation of the number of tuples satisfying
a condition. In: Proceedings of the 1984 ACM SIGMOD International Conference on Man-
agement of Data, SIGMOD 1984, pp. 256–276. ACM, New York (1984)
12. Prud’hommeaux, E., Seaborne, A.: SPARQL query language for RDF (January 2008),
http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/
13. le Clément de Saint-Marcq, V., Deville, Y., Solnon, C.: An Efficient Light Solver for Query-
ing the Semantic Web. In: Lee, J. (ed.) CP 2011. LNCS, vol. 6876, pp. 145–159. Springer,
Heidelberg (2011)
14. Schmidt, M., Hornung, T., Lausen, G., Pinkel, C.: SP2 Bench: A SPARQL performance
benchmark. In: Proc. IEEE 25th Int. Conf. Data Engineering, ICDE 2009, pp. 222–233
(2009)
15. Solnon, C.: Alldifferent-based filtering for subgraph isomorphism. Artificial Intelli-
gence 174(12-13), 850–864 (2010)
16. Steinbrunn, M., Moerkotte, G., Kemper, A.: Heuristic and randomized optimization for the
join ordering problem. The VLDB Journal 6, 191–208 (1997)
17. Stocker, M., Seaborne, A., Bernstein, A., Kiefer, C., Reynolds, D.: SPARQL basic graph
pattern optimization using selectivity estimation. In: Proceeding of the 17th International
Conference on World Wide Web, WWW 2008, pp. 595–604. ACM, New York (2008)
18. Van Hentenryck, P., Deville, Y., Teng, C.M.: A generic arc-consistency algorithm and its
specializations. Artificial Intelligence 57(2-3), 291–321 (1992)
19. Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data
management. Proc. VLDB Endow. 1, 1008–1019 (2008)
20. Zampelli, S., Deville, Y., Solnon, C.: Solving subgraph isomorphism problems with con-
straint programming. Constraints 15, 327–353 (2010)
A Structural Approach to Indexing Triples
1 Introduction
As an essential part of the W3C’s semantic web stack, the RDF data model
is finding increasing use in a wide range of web data management scenarios,
including linked data1 . Due to its increasing popularity and application, recent
years have witnessed an explosion of proposals for the construction of native
RDF data management systems (also known as triplestores) that store, index,
and process massive RDF data sets.
While we refer to recent surveys such as [12] for a full overview of these pro-
posals, we can largely discern two distinct classes of approaches. Value-based
approaches focus on the use of robust relational database technologies such as
B+ -trees and column-stores for the physical indexing and storage of massive
1
http://linkeddata.org/
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 406–421, 2012.
c Springer-Verlag Berlin Heidelberg 2012
A Structural Approach to Indexing Triples 407
RDF graphs, and employ established relational database query processing tech-
niques for the processing of SPARQL queries [1, 7, 15, 18, 22]. While value-based
triplestores have proven successful in practice, they mostly ignore the native
graph structure like paths and star patterns that naturally occur in RDF data
sets and queries. (Although some value-based approaches consider extensions to
capture and materialize such common patterns in the data graph [1, 15].)
Graph-based approaches, in contrast, try to capture and exploit exactly this
richer graph structure. Examples include GRIN [20] and DOGMA [6], that pro-
pose index structures based on graph partitioning and distances in the graphs,
respectively. A hybrid approach is taken in dipLODocus[RDF], where value-
based indexes are introduced for more or less homogeneous sets of subgraphs [23].
These somewhat ad-hoc approaches work well for an established query workload
or class of graph patterns, but it is unclear how the indexed patterns can flexibly
support general SPARQL queries outside of the supported set.
Structural indexes have been successfully applied in the semi-structured and
XML data management context to exploit structural graph information in query
processing. A structural index is essentially a reduced version of the original data
graph where nodes have been merged according to some notion of structural
similarity such as bisimulation [4, 5, 9, 13]. These indexes effectively take into
account the structure of both the graph and query, rather then just the values
appearing in the query as is the case for value-based indexes. Furthermore, the
success of structural indexes hinges on a precise coupling between the expressive
power of a general query language and the organization of data by the indexes [9].
The precise class of queries that they can support is therefore immediately clear,
thereby addressing the shortcomings of other graph-based approaches.
While structural indexes have been explored for RDF data, for example in
the Parameterizable Index Graph [19] and gStore [24] proposals, these proposals
simplify the RDF data model to that of resource-centric edge-labeled graphs
over a fixed property label alphabet (disallowing joins on properties), which is
not well-suited to general SPARQL query evaluation where pattern matching
is triple-centric, i.e., properties have the same status as subjects and objects.
The relevance of such queries is observed by studies of the usage of SPARQL in
practice [3,16]. Furthermore, there is no tight coupling of structural organization
of these indexes to the expressivity of a practical fragment of SPARQL.
Motivated by these observations, we have initiated the SAINT-DB (Struc-
tural Approach to INdexing Triples DataBase) project to study the foundations
and engineering principles for native RDF management systems based on struc-
tural indexes that are faithful to both the RDF data model and the SPARQL
query language. As a initial foundation, we have recently established a precise
structural characterization of practical SPARQL fragments in terms of graph
simulations [8]. Our goal in SAINT-DB is to leverage this characterization in
the design of native structural indexing solutions for massive RDF data sets.
Fig. 1. A small RDF graph, with triples labeled for ease of reference
language, contains complete triple information and therefore allows for the re-
trieval of sets of triples rather than sets of resources. (2) A formalization of the
structural index, coupled to the expressivity of practical fragments of SPARQL,
is given, together with the algorithms for building and using it. (3) We demon-
strate the effective integration of structural indexing into a state-of-the-art triple
store with cost-based query optimization.
We proceed as follows. In Sec. 2 we introduce our basic terminology for query-
ing RDF data. In Sec. 3 we present the principles behind triple-based structural
indexes for RDF. In Sec. 4 we then discuss how these principles can be put into
practice in a state of the art triple store. In Sec. 5, we present an empirical
study where the effectiveness of the new indices within this extended triple store
is demonstrated. Finally, in Sec. 6 we present our main conclusions and give
indications for further research.
2 Preliminaries
The formal definition of BGP queries is as follows. Let V = {?x, ?y, ?z, . . . }
be a set of variables, disjoint from U. A triple pattern is an element of (U ∪ V)3 .
We write vars(p) for the set of variables occurring in triple pattern p. A basic
graph pattern (BGP for short) is a set of triple patterns. A BGP query (or simply
query for short) is an expression Q of the form select X where P where P is
a BGP and X is a subset of the variables mentioned in P .
Example 1. As an example, the following BGP query retrieves, from the RDF
graph of Fig. 1, those pairs of people pa and pc such that pa is a CEO and pc
has a social relationship with someone directly related to pa .
select ?pa , ?pc
where { (?pa , type, CEO), (?pa , ?relab , ?pb ), (?pb , ?relbc , ?pc ),
(?relab , type, socialRel), (?relbc , type, socialRel)}
To formally define the semantics of triple patterns, BGPs, and BGP queries,
we need to introduce the following concepts. A mapping μ is a partial function
μ : V → U that assigns values in U to a finite set of variables. The domain of μ,
denoted by dom(μ), is the subset of V where μ is defined. The restriction μ[X]
of μ to a set of variables X ⊆ V is the mapping with domain dom(μ) ∩ X such
that μ[X](?x) = μ(?x) for all ?x ∈ dom(μ) ∩ X. Two mappings μ1 and μ2 are
compatible, denoted μ1 ∼ μ2 , when for all common variables ?x ∈ dom(μ1 ) ∩
dom(μ2 ) it is the case that μ1 (?x) = μ2 (?x). Clearly, if μ1 and μ2 are compatible,
then μ1 ∪ μ2 is again a mapping. We define the join of two sets of mappings Ω1
and Ω2 as Ω1 Ω2 := {μ1 ∪ μ2 | μ1 ∈ Ω1 , μ2 ∈ Ω2 , μ1 ∼ μ2 }, and the projection
of a set of mappings Ω to X ⊆ V as πX (Ω) := {μ[X] | μ ∈ Ω}. If p is a triple
pattern then we denote by μ(p) the triple obtained by replacing the variables in
p according to μ. Semantically, triple patterns, BGPs, and queries evaluate to a
set of mappings when evaluated on an RDF graph D:
pD := {μ | dom(μ) = vars(p) and μ(p) ∈ D},
{p1 , . . . , pn }D := p1 D
···
pn D ,
select X where P D := πX (P D ).
Example 2. Let Q be the query of Example 1 and D be the dataset of Fig. 1.
Then QD = {
?pa → sue, ?pc → jane}. In other words, Q evaluated on D
contains a single mapping μ, where μ(?pa ) = sue and μ(?pc ) = jane.
n1 n2 n3 n4
n8 n9
Fig. 2. A structural index for the RDF graph of Fig. 1. As described in Example 3, a
few edges have been suppressed for clarity of presentation.
This proposition indicates two natural ways we can use a structural index I
to alternatively compute P D :
participate in the required joins are pruned. Indeed, note that computing all
embeddings of P in the trivial index I in which each block consists of a single
triple will be as hard as computing the result of P on D itself. On the other
hand, while it is trivial to compute all embeddings of P into the other trivial
index J in which all triples are kept in a single block, we always have α(pi ) = D
and hence no pruning is achieved.
We next outline a method for constructing structural indexes that are guaran-
teed to have optimal pruning power for the class of so-called pure acyclic BGPs,
in the following sense.
Note that the inclusion πvars(p) P D ⊆ p α∈A α(p) always holds due to Prop. 1.
The converse inclusion does not hold in general, however.
Stated differently, pruning-optimality says that every element in pα(p) can
be extended to a matching in P D , for every triple pattern p ∈ P and every
embedding α of P into I. Hence, when using Prop. 1 to compute P D we indeed
optimally prune each relation pi D to be joined.
As already mentioned, we will give a method for constructing structural in-
dexes that are pruning-optimal w.r.t. the class of so-called pure acyclic BGPs.
Here, purity and acyclicity are defined as follows.
The other restriction, acyclicity is a very well-known concept for relational select-
project-join queries [2]. Its adaption to BGP queries is as follows.
5 Experimental Validation
5.1 Experimental Setup
We have implemented SAINT-DB upon RDF-3X version 0.362 . All experiments
described in this section have been run on an Intel Core i7 (quad core, 3.06 GHz,
8MB cache) workstation with 8GB main memory and a three-disk RAID 5 array
(750GB, 7200rpm, 32MB cache) running 64-bit Ubuntu Linux.
Our performance indicator is the number of I/O read requests issued by
SAINT-DB and RDF-3X, measured by counting the number of calls to the buffer
manager’s readPage function. Thereby, our measurements are independent of
the page buffering strategies of the system. Since SAINT-DB currently does not
yet feature compression of the B+ -tree leaf pages, we have also turned off leaf
compression in RDF-3X for fairness of comparison. During all of our experiments
the structural indexes were small enough to load and keep in main memory. The
computation of the set of all embeddings into the index hence does not incur
any I/O read requests, and is not included in the figures mentioned.
Datasets, Queries, and Indexes. We have tested SAINT-DB on two syn-
thetic datasets and one real-world dataset. The first synthetic dataset, denoted
CHAIN, is used to demonstrate the ideal that SAINT-DB can achieve on highly
graph-structured and repetitive data. It contains chains of triples of the form
(x1 , y1 , x2 ), (x2 , y2 , x3 ), . . . , (xn , yn , xn+1 ), with chain length n ranging from 3
to 50. Each chain is repeated 1000 times and CHAIN includes around 1 mil-
lion triples in total. The full simulation index sim(CHAIN) has been gener-
ated accordingly, and consists of 1316 index blocks, each consisting of 1000
triples. On CHAIN we run queries that also have a similar chain-shaped style
(?x1 , ?y1 , ?x2 ), (?x2 , ?y2 , ?x3 ), . . . , (?xn , ?yn , ?xn+1 ), with n varying from 4 to 7.
The second synthetic dataset, denoted LUBM, is generated by the Lehigh
University Benchmark data generator [11] and contains approximately 2 million
triples. For this dataset, we computed the depth-2 simulation index sim2 (LUBM),
which consists of 222 index blocks. Index blocks have varying cardinalities, con-
taining as little as 1 triple to as many 190,000 triples.
The real-world RDF dataset, denoted SOUTHAMPTON, is published by the
University of Southampton3 . It contains approximately 4 million triples. For this
dataset, we also computed the depth-2 simulation index sim2 (SOUTHAMPTON),
which consists of 380 index blocks. Index blocks have varying cardinalities, con-
taining as little as 1 triple to as many 106 triples.
For all datasets the indexes in their current non-specialized form require only
a few megabytes and therefore can be kept in main memory. A specialized in-
memory representation could easily further reduce this footprint. The detailed
description of the queries used can be found online4 . In the rest of this section we
denote queries related to the LUBM dataset as L1, . . . , L16, and those related
to SOUTHAMPTON as S1, . . . , S7.
2
http://code.google.com/p/rdf3x/
3
http://data.southampton.ac.uk/
4
http://www.win.tue.nl/~yluo/saintdb/
A Structural Approach to Indexing Triples 417
L5
L7
L8
L11
L15
S1
S2
S3
S4
S5
S6
S7
L1
L4
L6
L9
L10
L12
L13
L14
L16
Fig. 3. Number of read requests for different query processing strategies
Table 1. Read requests for SAINT-DB and RDF-3X on the CHAIN dataset. The
columns denote the length of the chain in the query. Speed-up is the ratio of read
requests of RDF-3X over those of SAINT-DB.
4 5 6 7
SAINT-DB 306 350 393 438
RDF-3X 3864 4799 5734 6669
Speed-up 12.63 13.71 14.59 15.23
Table 2. Read requests for SAINT-DB and RDF-3X on the LUBM and SOUTHAMP-
TON datasets. Speed-up is the ratio of read requests of RDF-3X over those of SAINT-
DB.
C1 C2 C3
L2 L3 L4 L9 S1 S2 S4 L1 L5 L6 L7 L8
SAINT-DB 116 5 163 18 18 36 64 238 39 47 38 7
RDF-3X 89 5 123 12 16 35 53 194 132 39 268 7
Speed-up 0.77 1.00 0.75 0.67 0.89 0.97 0.83 0.82 3.38 0.83 7.05 1.00
C3
L10 L11 L12 L13 L14 L15 L16 S3 S5 S6 S7
SAINT-DB 25 41 0 53 1519 352 288 48 410 173 175
RDF-3X 21 30 281 109 2668 2178 1224 33 424 316 236
Speed-up 0.84 0.73 ∞ 2.06 1.76 6.19 4.25 0.69 1.03 1.83 1.35
6 Concluding Remarks
In this paper, we have presented the first results towards triple-based structural
indexing for RDF graphs. Our approach is grounded in a formal coupling be-
tween practical fragments of SPARQL and structural characterizations of their
expressive power. An initial empirical validation of the approach shows that it is
possible and profitable to augment current value-based indexing solutions with
structural indexes for efficient RDF data management.
In this first phase of the SAINT-DB investigations, we have focused primarily
on the formal framework and design principles. We are currently shifting our
focus to a deeper investigation into the engineering principles and infrastructure
necessary to put our framework into practice. Some basic issues for further study
in this direction include: alternates to the B+ -tree data structure for physical
storage and access of indexes and data sets; more sophisticated optimization and
query processing solutions for reasoning over both the index and data graphs;
efficient external memory computation and maintenance of indexes; and, exten-
sions to richer fragments of SPARQL, e.g., with the OPTIONAL and UNION
constructs.
References
1. Abadi, D., et al.: SW-Store: a vertically partitioned DBMS for semantic web data
management. VLDB J. 18, 385–406 (2009)
2. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley
(1995)
3. Arias, M., Fernández, J.D., Martínez-Prieto, M.A., de la Fuente, P.: An empirical
study of real-world SPARQL queries. In: USEWOD (2011)
4. Arion, A., Bonifati, A., Manolescu, I., Pugliese, A.: Path summaries and path
partitioning in modern XML databases. WWW 11(1), 117–151 (2008)
5. Brenes Barahona, S.: Structural summaries for efficient XML query processing.
PhD thesis, Indiana University (2011)
6. Bröcheler, M., Pugliese, A., Subrahmanian, V.S.: DOGMA: A Disk-Oriented Graph
Matching Algorithm for RDF Databases. In: Bernstein, A., Karger, D.R., Heath,
T., Feigenbaum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009.
LNCS, vol. 5823, pp. 97–113. Springer, Heidelberg (2009)
7. Fletcher, G.H.L., Beck, P.W.: Scalable indexing of RDF graphs for efficient join
processing. In: CIKM, Hong Kong, pp. 1513–1516 (2009)
8. Fletcher, G.H.L., Hidders, J., Vansummeren, S., Luo, Y., Picalausa, F., De Bra,
P.: On guarded simulations and acyclic first-order languages. In: DBPL, Seattle
(2011)
9. Fletcher, G.H.L., Van Gucht, D., Wu, Y., Gyssens, M., Brenes, S., Paredaens, J.:
A methodology for coupling fragments of XPath with structural indexes for XML
documents. Information Systems 34(7), 657–670 (2009)
10. Gentilini, R., Piazza, C., Policriti, A.: From bisimulation to simulation: Coarsest
partition problems. J. Autom. Reasoning 31(1), 73–103 (2003)
11. Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base sys-
tems. J. Web Sem. 3(2-3), 158 (2005)
12. Luo, Y., Picalausa, F., Fletcher, G.H.L., Hidders, J., Vansummeren, S.: Storing and
indexing massive rdf datasets. In: De Virgilio, R., et al. (eds.) Semantic Search over
the Web, Data-Centric Systems and Applications, pp. 29–58. Springer, Heidelberg
(2012)
13. Milo, T., Suciu, D.: Index Structures for Path Expressions. In: Beeri, C., Bruneman,
P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 277–295. Springer, Heidelberg (1998)
14. Neumann, T., Weikum, G.: Scalable join processing on very large RDF graphs. In:
SIGMOD, pp. 627–640 (2009)
15. Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF
data. VLDB J. 19(1), 91–113 (2010)
16. Picalausa, F., Vansummeren, S.: What are real SPARQL queries like? In: Proceed-
ings of the International Workshop on Semantic Web Information Management,
SWIM 2011, pp. 7:1–7:6. ACM, New York (2011)
17. Prud’hommeaux, E., Seaborne, A.: SPARQL query language for RDF. Technical
report, W3C Recommendation (2008)
18. Sidirourgos, L., et al.: Column-store support for RDF data management: not all
swans are white. Proc. VLDB Endow. 1(2), 1553–1563 (2008)
19. Tran, T., Ladwig, G.: Structure index for RDF data. In: Workshop on Semantic
Data Management, SemData@ VLDB (2010)
20. Udrea, O., Pugliese, A., Subrahmanian, V.S.: GRIN: A graph based RDF index.
In: AAAI, Vancouver, B.C., pp. 1465–1470 (2007)
A Structural Approach to Indexing Triples 421
21. van Glabbeek, R.J., Ploeger, B.: Correcting a Space-Efficient Simulation Algo-
rithm. In: Gupta, A., Malik, S. (eds.) CAV 2008. LNCS, vol. 5123, pp. 517–529.
Springer, Heidelberg (2008)
22. Weiss, C., Karras, P., Bernstein, A.: Hexastore: Sextuple Indexing for Semantic
Web Data Management. In: VLDB, Auckland, New Zealand (2008)
23. Wylot, M., Pont, J., Wisniewski, M., Cudré-Mauroux, P.: dipLODocus[RDF]—
Short and Long-Tail RDF Analytics for Massive Webs of Data. In: Aroyo, L.,
Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E.
(eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 778–793. Springer, Heidelberg
(2011)
24. Zou, L., Mo, J., Chen, L., Özsu, M.T., Zhao, D.: gStore: Answering SPARQL
queries via subgraph matching. Proc. VLDB Endow. 4(8), 482–493 (2011)
Domain Specific Data Retrieval on the Semantic Web
Tuukka Ruotsalo1,2,3
1
School of Information, University of California, Berkeley, USA
2
Department of Media Technology, Aalto University, Finland
3
Helsinki Institute for Information Technology (HIIT), Finland
tuukka.ruotsalo@aalto.fi
1 Introduction
Search engines have revolutionized the way we search and fetch information by being
able to automatically locate documents on the Web. Search engines are mostly used
to locate text documents that match queries expressed as a set of keywords. Recently,
the document centric Web has been complemented with structured metadata, such as
the Linked Open Data cloud (LOD) [3]. In such datasets structured and semantic data
descriptions complement the current Internet infrastructure through the use of machine
understandable information provided as annotations [2]. Annotations are produced man-
ually in many organizations, but automatic annotation has also become mature enough
to work on Web scale [13]. As a result, we are witnessing increasing amount of struc-
tured data published on the Web.
Standards such as the RDF(S) [5] and publishing practices for linked data have en-
abled seamless access to structured Web data, but the underlying collections remain
indexed using domain specific ontologies and schemas. In fact, such domain specific
structure is the underlying element empowering the Semantic Web. For example, the
data from cultural heritage data providers is very different from the data by scientific
literature publishers, indexed with different vocabularies, and in the end, serving dif-
ferent information needs. As a result, different data collections are being published as
a linked open data and accessed on the Web, but each individual publisher can de-
cide about the semantics used to annotate the particular data collection. This imposes
specific challenges for retrieval methods operating on such dataspace:
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 422–436, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Domain Specific Data Retrieval on the Semantic Web 423
1. Structured object data. Search is targeted to objects or entities that are increasingly
described using a combination of structured information and free text descriptions.
For example, a tourist attraction could be described with information about the lo-
cation of the site as coordinate data, the categorization of the site through references
to a thesauri or an ontology, and the description of the attraction in free text format.
2. Recall orientation. A subset of the linked data cloud identified relevant for a specific
application is often limited in size. Data collections are in hundreds of thousands
or millions as opposite to billions as in conventional Web search. This favors recall
oriented retrieval methods.
3. Semantic gap between search and indexing vocabulary. Objects originate form do-
main specific curated collections and are described using expert vocabulary. For
example, a user searching for scientific objects inside a museum could be interested
on spheres, galvanometers, and optical instruments, but could use terms ”science”
and ”object” to express her information need.
2 Retrieval Framework
We use a retrieval framework based on the VSM and extend it to utilize RDF triples as
indexing features. We show how indexing can be done for RDF triples, cosine similarity
computed over such data representation, and how reasoning and query expansion can
be incorporated in to the retrieval framework.
424 T. Ruotsalo
We start with a retrieval method based on the well known Vector Space Model of in-
formation retrieval [15]. We use metadata expressed as ontology-based annotations and
utilize RDF as a representation language. RDF describes data as triples, where each
triple value can be either a resource R or literal L. The feasibility of the index can
be problematic in terms of the size of the triple-space if the triples would be directly
used as indexing features. Using the pure VSM of information retrieval would cause
the dimensionality of the document representation to be vectors that have occurrences
for every deduced triple. The maximum dimensionality being the number of possible
triples on the domain T ∈ R × R × (R ∪ L). It is well known that high dimensionality
often causes problems in similarity measurements and has been recognized to be prob-
lematic in ontology-based search [6,1]. This would hurt the performance of the VSM,
because many of the matching concepts would be the same in the tail of super concepts,
i.e. almost all documents would be indexed using the triples consisting of resources
appearing in the upper levels of the ontology hierarchies.
We reformulate the indexing of the documents and the triples in the deductive closure
of their annotations as vectors describing occurrences of each triple given the property
of the triple. Splitting the vector space based on property is not a new idea, but has
been recently used in RDF indexing [12,4]. An intuition behind this is that properties
often specify the point of view to the entity. For example, annotating Europe as a man-
ufacturing place or subject matter should lead to completely different weighting of the
resource, depending on the commonality of Europe as a subject matter or as a manufac-
turing place. In addition, properties are not expected to be used as query terms alone,
but only combined with either subjects or objects of the triple. For example, it is un-
likely that a user would express her information need by inserting a query to return all
documents with dc:subject in the annotations. However, a user could construct a query
that would request all documents with dc:subject having a value Europe. Literals are
treated separately from concepts. We tokenize literals to words and stem them using the
Porter stemming method. After this they are stored in the same vectors as the concepts.
In practice, the data is often described using a schema, where the subject of the triple
is the identifier for the entity being described, as in the data used in our experiments.
However, our indexing strategy enables indexing of arbitrary RDF graphs.
Accessing the correct index for each vector space fast in the query phase requires
an external index. For this purpose we define a posting list that maps the index of the
correct vector space to the query. We propose a model over possible vector spaces, first
one for every possible property, and two additional vector spaces for subject and object.
From now on we refer to these actually indexed subjects and objects as concepts to
avoid mixing these with the subject of an RDF triple. Every concept is indexed in a
vector space that defines the occurrence of the concept in an annotation of a specific
entity. These vector spaces are referred as y and they form a set of vector spaces Y with
a length x, i.e. Yx = {y1 , ..., yx }.
This indexing strategy requires a large number of vector spaces, but the triple di-
mension of each matrix is lower because the maximum term length k for triples is
the number of resources and literals R, and for the document dimension only the
Domain Specific Data Retrieval on the Semantic Web 425
documents that have triples in the particular vector space are indexed. This avoids the
high dimensionality problem when computing similarity estimates.
2.2 Weighting
The purpose of the indexing strategy is not only to reduce the dimensionality to make
computation faster, but also to enable more accurate weighting by avoiding the prob-
lems caused by the high dimensionality. Intuitively, some of the triples are likely to
be much less relevant for the ranking than others. For example, matching a query only
based on a triple <rdf:Resource, rdf:Resource, rdf:Resource> will lead to a match to
all documents, but is meaningless for search purposes. On the other hand, a resource
Helsinki, should be matched to all documents indexed with resource Helsinki, but also
to the documents indexed with Europe, because they belong into the same deductive
closure, but with smaller weight. For this purpose we use tf-idf weighting over the re-
sources within a specific vector space. In normalized form tf is:
Ni,j 1
tfi,j = ( )2 , (1)
k Nk,j
N
idfi = 1 + log( ), (2)
ni + 1
where ni is the number of documents, where the resource i appears within the specific
vector space and N is the total number of documents in the system. The weight of an
individual resource in a specific vector space is given by:
The tf-idf effect in triple-space is achieved based on the annotation mass on resources,
but also through reasoning. For example, the resource Europe is likely to have much
more occurrences in the index than the resource Finland, since the index contains the
deductive closures of the triples from annotations using resource identifiers also for
other European countries. This makes the idf value for resource Finland higher than for
resource Europe. A document annotated with resources Germany, France and Finland
would increase the tf value for the resource Europe, because through deductive reason-
ing Germany, France and Finland are a part of Europe. Naturally, the tf could also be
higher in case the document is annotated with several occurrences of the same resource,
for example as a result of automatic annotation procedure based on text analysis [13].
2.3 Ranking
In the vector model the triple vectors can be used to compute the degree of similar-
ity between each document d stored in the system and the query q. The vector model
426 T. Ruotsalo
evaluates the similarity between the vector representing an individual document Vdj
and a query Vq . We reformulate the cosine similarity to take into account a set of vector
spaces, one for each possible combination of triples given the models y ∈ Y as opposite
to the classic VSM that would use only one vector space for all features. For this pur-
pose, we adopt the modified cosine similarity ranking formula used in Apache Lucene
open source search engine1 , where the normalization based on Euclidean distance is
replaced with a length norm and a coord-factor. The length norm is computed as:
1
ln(Vdj ,y ) = √ , (4)
nf
where nf is the number of features present used to index the document dj in the vector
space y under interest. The coord-factor is computed as:
mf
cf (q, dj ) = , (5)
k
where mf is the number of matching features in all vector spaces for document dj and
query q and k is the total number of features in the query.
In our use case, these have two clear advantages compared to the classic cosine sim-
ilarity. First, the use of the length norm gives more value to documents with less triple
occurrences within the vector space under interest. In our case this means that docu-
ments annotated with less triples within a particular vector space y get relatively higher
similarity score. This is intuitive, because the knowledge-base could contain manually
annotated documents with only few triples and automatically annotated documents with
dozens of triples. In addition, some vector spaces can end up having more triples, as a
result of reasoning or more intense annotation, than others. The number of matching
features in queries also should increase the similarity of the query and document. This
effect is captured by the coord-factor. We can now write the similarity as:
x
k
sim(q, dj ) = cf (q, dj ) · (wi,yj · ln(Vydj )), (6)
y=1 i=1
where the dot product of the vectors now determines the weight wi,j and is computed
across all vector spaces y. In this way the ranking formula enables several vector spaces
to represent a single document because length norm is computed for each vector space
separately. This can be directly used to operate with our triple space indexing.
The model approximates the importance of all the different combinations of y ∈ Y
separately. Intuitively, this is a coherent approach: the importance of a concept in the
domain is dependent on the use of the concept in a triple context. Note that our approach
does not normalize across the vector spaces. This favors matches in several vector space
instead of a number of matches in a single vector space. For example, a query with sev-
eral triples with the property dc:subject and a single triple with the property dc:creator
would favor queries that have both dc:subject and dc:creator present over queries that
would have matches only for one of the properties.
1
The features of the similarity computation that are not used in our method and experi-
ments are omitted. The full description of the original ranking formula can be found at
http://lucene.apache.org/
Domain Specific Data Retrieval on the Semantic Web 427
The adaptation of the vector space model that we presented in the earlier section as-
sumes the existence of document vectors that can be then stored in separate vector
spaces. RDF(S) semantics enable deductive reasoning on the triple space. Using such
information in the indexing phase is often called document expansion. This means that
the document vectors are constructed based on the triple-space resulting from a deduc-
tive reasoning process.
For example, an annotation with an object Paris, could be predicated by different
properties. One document could be created in Paris while another document could have
Paris as a subject matter. Through deductive reasoning both of these annotation triples
are deduced to a triple, where the property pointing to the concept Paris is rdf:Resource.
In a similar way, the concept Paris can be deduced through subsumption reasoning to
France, Europe, and so on.
If a search engine receives a query about Paris, it should not matter for the search
engine whether the user is interested in Paris in the role of subject matter or place of
creation. Therefore, the search engine should rank these cases equally based on only
the information that the documents are somehow related to Paris. In other words, based
on the triple, where the property is rdf:Resource. On the other hand, if the user specifies
an interest in Paris as a subject matter, the documents annotated in such way can be
ranked higher by matching them to a vector space of subject matters. This functionality
is already enabled using the vector space model for triple space by indexing deductive
closures along with the original triples.
Another way to improve the accuracy of the method is ontology-based query expan-
sion. While deductive reasoning provides logical deduction based on the relations avail-
able in the ontologies, the user can be interested also in other related documents. For
example, users interested in landscape paintings, could also be interested on seascape
paintings, landscape photographs and so on. These can be related in the ontology further
away or with different relationships that are included in the standard RDF(S) reasoning.
Ontologies can be very unbalanced and depending on the concepts used in the anno-
tation, different level of query expansion may be necessary. For example, a document
annotated with a concept Buildings may already be general enough and matches to
many types of buildings, while a document annotated with the concept Churches might
indicate user’s interest, not only on churches, but also other types of religious buildings.
Measuring a concept to be semantically close to another concept, and therefore a
good candidate for the query expansion, can be approximated using its position in the
ontological hierarchy [14]. The more specific the concept is, more expansion can be
allowed. We use the Wu-Palmer measure to measure the importance of a resource (sub-
ject, predicate, and object separately) given the original resource in the query triple.
Formally, the Wu-Palmer measure for resources c and c is:
2l(s(c, c ), r)
relW P (c, c ) = , (7)
l(c, s(c, c )) + l(c , s(c, c )) + 2l(s(c, c ), r)
where l(c, c ) is a function that returns the smallest number of nodes on the path con-
necting c and c (including c and c themselves), s(c, c ) is a function that returns the
428 T. Ruotsalo
lowest common super-resource of resources c and c , and r is the root resource of the
ontology.
The resources having a Wu-Palmer value above a certain threshold are selected for
query expansion. We construct all the possible triples that are possible based on the
resources determined by the Wu-Palmer measure and select the most general triples as
the expanded triples that are used in the actual similarity computation. This means that
all subjects, properties, and objects of any triple in the query are included by using all
permutations of the resources in these resulting sets and the most general combination is
selected. By the most general combination, we mean triple that has the longest distance
in terms of subsumption from the original triple in terms of the expanded subject, pred-
icate and object, each measured individually. This also removes possible redundancy of
the original query triples, such as inclusion of triples.
In case other relations are used in the expansion, all of the triples are included. In
other words, we include only the most general case in terms of subsumption, but include
related terms as new triples. The rationale behind including only the most general triple
is that including all possible super-triples could lead to a substantial amount of matching
triples and may hurt the accuracy of the similarity computation.
The Wu-Palmer measure can be used to dynamically control the query expansion
level towards an index of concepts that form a tree. Such a tree can be constructed in
many different ways. A trivial case is to use only subsumption hierarchies, a semanti-
cally coherent taxonomy of concepts. However, ontologies enable also other relations
to be used in query expansion. We refer different combinations of such relations as the
query expansion strategy.
We investigate the following query expansion strategies: related terms only, sub-
sumption only, full expansion. Related terms only strategy means that a semantic clique
is formed based on the nodes directly related to the concept being expanded (distance
of arcs is one), but no subsumption reasoning is used. Subsumption only strategy means
that the query is expanded using transitive reasoning in subsumption hierarchies. This
means that additional query expansion to other concepts than those in the deductive clo-
sure can be done only using subsumption hierarchies. Full expansion means that both,
subsumption and related terms are used in expansion and the tree index is built using
subsumption relations and related terms of each concept achieved through subsumption.
Related terms are not treated as transitive.
For example, using only the subsumption hierarchies, we could deduce the informa-
tion that the concept ”landscape paintings” is related to its superconcept ”paintings”
and through that to the concept ”seascape paintings”, because they have a common su-
perconcept. Using the full expansion we could obtain an additional information that
”seascape paintings” is further related to ”seascapes”, ”marinas” and so on.
3 Experiments
3.2 Data
We used a dataset in the domain of cultural heritage, where the documents have high
quality annotations. The dataset consists of documents that describe museum items,
including artwork, fine arts and scientific instruments, and points of interest, such as
visiting locations, statues, and museums. The data was obtained from the Museo Galileo
in Florence, Italy, and from the Heritage Malta. The document annotations utilize the
Dublin Core properties and required extensions for the cultural heritage domain, such
as material, object type, and place of creation of the document described. An example
annotation of a document describing a scientific instrument from the Museo Galileo is
described in Figure 1.
<dc:identifier> <urn:imss:instrument:402015> .
<sm:physicalLocation> <http://www.imss.fi.it/> .
<dc:title> "Horizontal dial" .
<dc:subject> "Measuring time" .
<dc:description> "Sundial, complete with gnomon..." .
<dc:subject> <aat:300054534> . (Astronomy)
<sm:dateOfCreation> <sm:time_1501_1600> . (16th Century)
<sm:material> <aat:300010946> . (Gilt Brass)
<sm:objectType> <aat:300041614> . (Sundial)
<sm:placeOfCreation> <tgn:7000084> (Germany)
<sm:processesAndTechniques> <aat:300053789> . (Gilding)
<dc:terms/isPartOf> "Medici collections" .
<rdf:type> <sm:Instrument> .
Fig. 1. An example of the data used in the experiments. Subjects of the triples are all identifiers
of the resource being describes and are therefore omitted. Description is shortened.
The documents are indexed with RDF(S) versions of Getty Vocabularies2 . The
RDF(S) versions of the Getty Vocabularies are lightweight ontologies that are trans-
formed to RDF(S) from the original vocabularies, where concepts are organized in
subsumption hierarchies and have related term relations. Geographical instances are
structured in meronymical hierarchies that represent geographical inclusion. Temporal
data is described using a hand crafted ontology that has concepts for each year, decade,
2
http://www.getty.edu/research/conducting research/vocabularies/
430 T. Ruotsalo
century, and millennium organized in a hierarchy. Literal values are indexed in the VSM
as Porter stemmed tokenized words.
The query set consists of 40 queries that were defined by domain experts in the same
museums where the datasets were curated. Figure 2 shows two example queries, one for
astronomers and subject matter optics, and another for physicist Leopoldo Nobili and
subject matter of galvanometers, batteries and electrical engineering. Relevance assess-
ments corresponding to the query set were provided for a set of 500 documents in both
museums. Museum professionals provided relevance assessments for the dataset by as-
sessing each document either relevant or not relevant separately for all of the queries.
The dataset and relevance assessment were carried out specifically for this study. This
is a relatively large set of queries and relevance assessments for one-off experiment be-
cause the recall is analyzed with full coverage by domain experts meaning that all of
the documents are manually inspected against all of the queries. Pooling or automatic
pre-filtering was not used. This makes the relevance assessments highly reliable, avoids
bias caused by automatic pre-filtering, and takes into account all possible semantic rel-
evance, even non-trivial connections judged relevant by the domain experts.
The domain experts were asked to created queries typical for the domain, such that
the queries would include also non-trivial queries considering the underlying collection.
For example, a query containing the concept ”seascapes” was judged relevant also for
objects annotated with the concept ”landscape paintings”, and for objects annotated
with ”marinas”, ”boats”, ”harbors” and so on. The judges were allowed to inspect the
textual description in addition to the image of the objects when assessing relevance.
Fig. 2. An example of two sets of queries defined by experts in the Museo Galileo. The names-
pace dc and sm refer to the Dublin Core and a custom extension of the Dublin Core properties for
the cultural heritage domain, and aat to the Art and Architecture Thesaurus of the Getty Founda-
tion. The subject of each RDF triple is omitted, because it is rdf:Resource for these queries.
Domain Specific Data Retrieval on the Semantic Web 431
When a relevant document is not retrieved at all, the precision value in the above
equation is taken to be 0. For a single information need, the average precision is the
average area under the precision-recall curve for a set of queries.
4 Results
Figures 3 and 4 summarize the results. Figure 3 presents the precision - recall using each
method variant when no query expansion was used. The curve on Figure 4 presents the
same results for the best query expansion determined by the Wu-Palmer cutoff that was
found to lead to best MAP for each method variant. In other words, the best achieved
indexing strategy - query expansion combination. The following main findings can be
observed. First, using the full indexing leads to the best overall performance. It performs
equally good to subsumption indexing when a combination of query expansion and
reasoning is used, but outperforms the subsumption indexing on low recall levels and
432 T. Ruotsalo
Fig. 3. Precision plotted on 11 recall levels for different reasoning and indexing strategies. No
query expansion is used. The values are averaged over the 40 queries.
in the case where no query expansion is used. Second, the subsumption indexing and
full indexing outperform related terms indexing in all tasks. The results show up to
76% improvement compared to a variation where no reasoning and query expansion
are used.
Both, indexing using subsumption and full indexing, that also uses subsumption,
seems to perform clearly better than related term indexing. The performance is in-
creased by 0.15 (68%) in MAP compared to the baseline. The gain in performance
achieved using subsumption and full indexing strategies imply that subsumption rea-
soning is the most important factor affecting the accuracy of the retrieval. Full indexing
strategy clearly outperforms the other strategies on overall performance and performs
best even on the lowest recall levels. An interesting finding is that subsumption reason-
ing and indexing strategy do perform worse than related terms strategy on the low recall
levels, when reasoning is not complemented with query expansion.
Query expansion has a significant overall effect and, in addition to reasoning, is an
important factor affecting the accuracy of the retrieval. Query expansion increases the
accuracy up to 0.16 (76%) in terms of MAP when full expansion reasoning and index-
ing strategy is used. It is notable that the subsumption reasoning and indexing strategy
actually performs only equally good compared to the baseline approach when no addi-
tional query expansion is used. This indicates that the combination of correct indexing
strategy and query expansion is crucial to achieve optimal accuracy. An additional query
expansion using super concepts from ontologies was found to be most effective when
using cut-off value 0.9 to 0.7 of the Wu-Palmer measure. This means an expansion of
zero to three nodes in the ontology graph in addition to the standard RDF(S) reasoning.
Domain Specific Data Retrieval on the Semantic Web 433
Fig. 4. Precision plotted on 11 recall levels for different reasoning and indexing strategies. The
best combination of query expansion and reasoning is used. The values are averaged over the 40
queries.
It is observable in the results that the gold standard and queries favor recall-oriented
methods, which was expected to be the case in domain specific setting. For example,
the subsumption indexing strategy with the Wu-Palmer cut-off at 0.4 leads to an equally
good performance as the cut-off 0.7, while cut-off values 0.5 and 0.6 perform worse.
This indicates that using extensive query expansion compensates better semantic ap-
proximation achieved using related term relations together with subsumption reason-
ing. We believe that this is due to the fact that our data set consists of documents from a
relatively specific domain and a collection of only 1000 documents. In additional runs
we observed that precision - recall curves have different tradeoffs when varying the
Wu-Palmer cut-off values. Using more query expansion increases recall, but does not
hurt precision as extensively as could be expected. Our conclusion is that our dataset
favors recall oriented approaches without a serious precision trade-off. This may not
be the case in settings, where data is retrieved from a data cloud that is linked to other
domains. Therefore, we believe that full expansion with mild query expansion leads to
best overall performance.
In this paper, we propose an indexing and retrieval framework for structured Web data
to support domain specific retrieval of RDF data. The framework is computationally
feasible because it avoids the high dimensionality of the triple space in similarity com-
putation by using triple based indexing. We conducted a set of experiments to validate
434 T. Ruotsalo
the performance of the approach and combine different reasoning, indexing and query
expansion strategies. We show that ontology-based query expansion and reasoning im-
proves retrieval in Semantic Web data retrieval and can be effectively used in our adap-
tation of the vector space model. We also provide empirical evidence to support the
effect of self-tuning query expansion method that is based on a metric that measures the
depth of the ontology graphs. The experimental evaluation of the framework led to the
following conclusions:
We conducted experiments that tested a number of different techniques and their com-
binations. However, the experimental setup leaves room for further research. While we
used two separate collections and queries from different annotators and institutions,
these were indexed using the same ontologies. The data used in the experiments is from
the cultural heritage domain and may not generalize to other more open domains.
We measured the performance of the methods against expert created gold standard on
a set of domain specific annotations on the cultural heritage domain. The relevance as-
sessments are determined manually for the whole dataset, unlike in some other datasets
proposed for semantic search evaluation, such as the Semantic Search Workshop data
[9], where the relevance assessments were determined by assessing relevance for doc-
uments pooled form 100 top results from each of the participating systems, queries
were very short, and in text format. This ensures that our dataset enables measuring
recall and all of the query-document matches, even non-trivial, are present. The set
of queries, for which the relevance assessments were created, are in the form of sets
of triples. This avoids the problems of query construction and disambiguation of the
terms, which means that we are able to measure the retrieval performance indepen-
dently of the user interface or initial query construction method. While we realize that
disambiguation and query construction are essential for search engines, we think that
they are problems of their own to be tackled by the Semantic Search community. Our
methods are therefore valid for information filtering scenarios and search scenarios that
can use novel query construction methods, such as faceted search, or query suggestion
techniques. The methods only operate on numerical space for triples and implements
ranking independently from the specific RDF dataset it could also be implemented as
a ranking layer under database management systems that support more formal query
languages such as SPARQL. Our experiments were run on a gold standard acquired
specifically for this study, that makes the results more reliable and the gold standard
highly reliable. However, we used relatively small dataset of 1000 documents which
Domain Specific Data Retrieval on the Semantic Web 435
makes the task recall oriented. However, the methods themselves scale to large collec-
tions, because the indexing and retrieval framework does not make any assumptions
over the classic VSM and are able to delimit the dimensionality of the VSM based on
splitting the space separately for each property. However, small collections are typical
in domain specific search and the results may not be directly comparable with results
obtained for other collections. While full query expansion with subsumption reasoning
works well for such a homogenous dataset, this might not be true for more varying
datasets. This is due to the fact that the best performance was achieved with the Wu-
Palmer cut-off value of 0.4 that allows traversing the supertree of a concept for several
nodes. This could hurt the accuracy when applied to larger precision oriented datasets.
However, our results are a clear indication of the effectiveness of both query expansion
and reasoning.
Furthermore, ontologies are not the only source for semantic information. Our
method operates in pure numerical vector space that makes it possible to apply standard
dimensionality reduction and topic modeling methods that could reveal the semantics
based on collection statistics. Since our experiments showed that maximal query expan-
sion using ontologies leads to best retrieval accuracy, such methods are an interesting
future research direction.
Acknowledgements. The work was conducted under the Academy of Finland Grant
(135536), Fulbright Technology Industries of Finland Grant, and Finnish Foundation
for Technology Promotion Grant.
References
1. Agirre, E., Arregi, X., Otegi, A.: Document expansion based on wordnet for robust ir. In:
Proceedings of the 23rd International Conference on Computational Linguistics: Posters,
COLING 2010, pp. 9–17. Association for Computational Linguistics, Stroudsburg (2010)
2. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web: Scientific American. Scientific
American (May 2001)
3. Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. Int. J. Semantic Web Inf.
Syst. 5(3), 1–22 (2009)
4. Blanco, R., Mika, P., Vigna, S.: Effective and Efficient Entity Search in RDF Data. In: Aroyo,
L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.)
ISWC 2011, Part I. LNCS, vol. 7031, pp. 83–97. Springer, Heidelberg (2011)
5. Brickley, D., Guha, R.V.: RDF vocabulary description language 1.0: RDF Schema W3C
recommendation. Recommendation, World Wide Web Consortium (February 10, 2004)
6. Castells, P., Fernandez, M., Vallet, D.: An adaptation of the vector-space model for ontology-
based information retrieval. IEEE Transactions on Knowledge and Data Engineering 19(2),
261–272 (2007)
7. Fazzinga, B., Gianforme, G., Gottlob, G., Lukasiewicz, T.: Semantic web search based on
ontological conjunctive queries. Web Semantics: Science, Services and Agents on the World
Wide Web 9(4), 453–473 (2011)
8. Férnandez, M., Cantador, I., López, V., Vallet, D., Castells, P., Motta, E.: Semantically en-
hanced information retrieval: An ontology-based approach. Web Semantics: Science, Ser-
vices and Agents on the World Wide Web 9(4), 434–452 (2011)
436 T. Ruotsalo
9. Halpin, H., Herzig, D., Mika, P., Blanco, R., Pound, J., Thompon, H., Duc, T.T.: Evaluat-
ing ad-hoc object retrieval. In: Proceedings of the International Workshop on Evaluation of
Semantic Technologies, Shanghai, China. CEUR, vol. 666 (November 2010)
10. Kiryakov, A., Popov, B., Ognyanoff, D., Manov, D., Kirilov, A., Goranov, M.: Semantic
Annotation, Indexing, and Retrieval. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC
2003. LNCS, vol. 2870, pp. 484–499. Springer, Heidelberg (2003)
11. Ning, X., Jin, H., Jia, W., Yuan, P.: Practical and effective ir-style keyword search over se-
mantic web. Information Processing & Management 45(2), 263–271 (2009)
12. Pérez-Agüera, J.R., Arroyo, J., Greenberg, J., Iglesias, J.P., Fresno, V.: Using bm25f for
semantic search. In: Proceedings of the 3rd International Semantic Search Workshop, SEM-
SEARCH 2010, pp. 2:1–2:8. ACM, New York (2010)
13. Ruotsalo, T., Aroyo, L., Schreiber, G.: Knowledge-based linguistic annotation of digital cul-
tural heritage collections. IEEE Intelligent Systems 24(2), 64–75 (2009)
14. Ruotsalo, T., Mäkelä, E.: A comparison of corpus-based and structural methods on approx-
imation of semantic relatedness in ontologies. International Journal on Semantic Web and
Information Systems 5(4), 39–56 (2009)
15. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communi-
cations of the ACM 18(11), 613–620 (1975)
16. Vallet, D., Fernández, M., Castells, P.: An Ontology-Based Information Retrieval Model. In:
Gómez-Pérez, A., Euzenat, J. (eds.) ESWC 2005. LNCS, vol. 3532, pp. 455–470. Springer,
Heidelberg (2005)
Exchange and Consumption of Huge RDF Data
1 Introduction
The amount and size of published RDF datasets has dramatically increased in the
emerging Web of Data. Publication efforts, such as Linked Open Data1 have “de-
mocratized” the creation of such structured data on the Web and the connection
between different data sources [7]. Several research areas have emerged along-
side this; RDF indexing and querying, reasoning, integration, ontology match-
ing, visualization, etc. A common Publication-Exchange-Consumption workflow
(Figure 1) is involved in almost every application in the Web of Data.
Publication. After RDF data generation, publication refers to the process of
making RDF data publicly available for diverse purposes and users. Besides RDF
publication with dereferenceable URIs, data providers tend to expose their data
as a file to download (RDF dump), or via a SPARQL endpoint, a service which
interprets the SPARQL query language [2].
Exchange. Once the consumer has discovered the published information, the
exchange process starts. Datasets are serialized in traditional plain formats (e.g.
RDF/XML [5], N3 [4] or Turtle [3]), and universal compressors (e.g. gzip) are
commonly applied to reduce their size.
Consumption. The consumer has to post-process the information in several
ways. Firstly, a decompression process must be performed. Then, the serialized
1
http://linkeddata.org/
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 437–452, 2012.
c Springer-Verlag Berlin Heidelberg 2012
438 M.A. Martínez-Prieto, M.A. Gallego, and J.D. Fernández
R/I
Q/P
dereferenceable URIs I
Reasoning/Integration
RDF dump Quality/Provenance
sensor Indexing
SPARQL Endpoints/
APIs
RDF must be parsed and indexed, obtaining a data structure more suitable for
tasks such as browsing and querying.
The scalability issues of this workflow arise in the following running exam-
ple. Let us suppose that you publish a huge RDF dataset like Geonames (112
million triples about geographical entities). Plain data take up 12.07 GB (in
Ntriples2 ), and compression should be applied. For instance, its gzip-compressed
dump takes 0.69 GB. Thus, compression is necessary for efficient exchange (in
terms of time) when managing huge RDF. However, after decompression, data
remain in a plain format and an intensive post-processing is required3 . Even
when data is shared through a SPARQL endpoint, some queries can return large
amounts of triples, hence the results must be compressed too.
Nowadays, the potential of huge RDF is seriously underexploited due to the
large space they take up, the powerful resources required to process them, and the
large consumption time. Similar problems arise when managing RDF in mobile
devices; although the amount of information could be potentially smaller, these
devices have more restrictive requirements for transmission costs/latency, and for
post-processing due to their inherent memory and CPU constraints [14]. A first
approach to lighten this workflow is a binary RDF serialization format, called
HDT (Header-Dictionary-Triples) [11], recently accepted as a W3C Member
Submission [6]. This proposal highlights the need to move forward plain RDF
syntaxes to a data-centric view. HDT modularizes the data and uses the skewed
structure of RDF graphs [9] to achieve compression. In practical terms, HDT-based
representations take up to 15 times less space than traditional RDF formats [11].
Whereas publication and exchange were partially addressed in HDT, the con-
sumption is underexploited; HDT provides basic retrieval capabilities which can
be used for limited resolution of SPARQL triple patterns. This paper revisits
these capabilities for speeding up consumption within the workflow above. We
propose i) to publish and exchange RDF serialized in HDT, and then ii) to per-
form a lightweight post-process (at consumption) enhancing the HDT represen-
tation with additional structures providing a full-index for RDF retrieval. The
resulting enhanced HDT representation (referred to as HDT-FoQ: HDT Focused on
Querying) enables the exchanged RDF to be directly queryable with SPARQL,
speeding up the workflow in several correlated dimensions:
2
http://www.w3.org/TR/rdf-testcases/#ntriples
3
Post-processing is the computation needed at consumption (parsing+indexing) be-
fore any query can be issued.
Exchange and Consumption of Huge RDF Data 439
2 State-of-the-Art
Huge RDF datasets are currently serialized in verbose formats (RDF/XML [5],
N3 [4] or Turtle [3]), originally designed for a document-centric Web. Although
they compact some constructions, they are still dominated by a human-readable
view which adds an unnecessary overhead to the final dataset representation.
It increases transmission costs and delays final data consumption within the
Publication-Exchange-Consumption workflow.
Besides serialization, the overall performance of the workflow is determined
by the efficiency of the external tools used for post-processing and consum-
ing huge RDF. Post-processing transforms RDF into any binary representation
which can be efficiently managed for specific consumption purposes. Although
it is performed once, the amount of resources required for it may be prohibitive
for many potential consumers; it is specially significative for mobile devices com-
prising a limited computational configuration.
Finally, the consumption performance is determined by the mechanisms used
for access and retrieval RDF data. These are implemented around the SPARQL
[2] foundations and their efficiency depends on the performance yielded by RDF
indexing techniques. Relational-based solutions such as Virtuoso [10] are widely
accepted and used to support many applications consuming RDF. On the other
hand, some stores build indexes for all possible combinations of elements in RDF
(SPO, SOP, PSO, POS, OPS, OSP), allowing i) all triple patterns to be directly
resolved in the corresponding index, and ii) the first join step within a BGP to
be resolved through fast merge-join. Hexastore [18] performs a memory-based
implementation which, in practice, is limited by the space required to represent
and manage the index replication. RDF-3X [17] performs multi-indexing on a
disk-resident solution which compresses the indexes within B+ -trees. Thus, RDF-
3X enables the management of larger datasets at the expense of overloading
querying processes with expensive I/O transferences.
440 M.A. Martínez-Prieto, M.A. Gallego, and J.D. Fernández
HDT is a binary serialization format which organizes RDF data in three logical
components. The Header includes logical and physical metadata describing the
RDF dataset and serves as an entry point to its information. The Dictionary
provides a catalog of the terms used in the dataset and maps them to unique
integer IDs. It enables terms to be replaced by their corresponding IDs and allows
high levels of compression to be achieved. The Triples component represents the
pure structure of the underlying graph after the ID replacement.
Publication and exchange processes are partially addressed by HDT. Although
it is a machine-oriented format, the Header gathers human-friendly textual meta-
data such as the provenance, size, quality, or physical organization (subparts and
their location). Thus, it is a mechanism to discover and filter published datasets.
In turn, the Dictionary and Triples partition mainly aims at efficient exchange;
it reduces the inherent redundancy to an RDF graph by isolating terms and
structure. This division has proved effective in RDF stores [17].
HDT allows different implementations for the dictionary and the triples. Besides
achieving compression, some implementations can be optimized to support native
data retrieval. The original HDT proposal [11] gains insights into this issue through
a triples implementation called Bitmap Triples (BT). This section firstly gives
basic notions of succinct data structures, and then revisits BT emphasizing how
these structures can allow basic consumption.
Succinct data structures [16] represent data using as little space as possible and
provide direct access. These savings allow them to be managed in faster levels of
the memory hierarchy, achieving competitive performance. They provide three
primitives (S is a sequence of length n from an alphabet Σ):
- ranka(S, i) counts the occurrences of a ∈ Σ in S[1, i].
- selecta(S, i) locates the position for the i-th occurrence of a ∈ Σ in S.
- access(S, i) returns the symbol stored in S[i].
In this paper, we make use of succinct data structures for representing sequences
of symbols. We distinguish between binary sequences (bitsequences) and general
sequences. i) Bitsequences are a special case drawn from Σ = {0, 1}. They can
be represented using n + o(n) bits of space while answering the three previous
Exchange and Consumption of Huge RDF Data 441
HDT describes Bitmap Triples (BT) as a specific triples encoding which represents
the RDF graph through its adjacency matrix. In practice, BT slices the matrix
by subject and encodes the predicate-object lists for each subject in the dataset.
Let us suppose that the triples below comprise all occurrences of subject s:
{(s, p1 , o11 ), · · · , (s, p1 , o1n1 ), (s, p2 , o21 ), · · · (s, p2 , o2n2 ), · · · (s, pk , oknk )}
These triples are reorganized into predicate-object adjacency lists as follows:
s → [(p1 , (o11 , · · · , o1n1 ), (p2 , (o21 , · · · , o2n2 )), · · · (pk , (ok1 , · · · , oknk )].
Each list represents a predicate, pj , related to s and contains all objects reach-
able from s through this predicate.
This transformation is illustrated in Figure 2; the ID-based triples represen-
tation (labeled as ID-triples) is firstly presented, and its reorganization in adja-
cency lists is shown on its right. As can be seen, adjacency lists draw tree-shaped
structures containing the subject ID in the root, the predicate IDs in the middle
level, and the object IDs in the leaves (note that each tree has as many leaves as
occurrences of the subject in the dataset). For instance, the right tree represents
the second adjacency list in the dataset, thus it is associated to the subject 2
(rooting the tree). In the middle level, the tree stores (in a sorted way) the three
IDs representing the predicates related to the current subject: 5,6, and 7. The
leaves comprise, in a sorted fashion, all objects related to the subject 2 through
a given predicate: e.g. objects 1 and 3 are reached through the path 2,6; which
means that the triples (2,6,1) and (2,6,3) are in the dataset.
BT implements a compact mechanism for modeling an RDF graph as a forest
containing as many trees as different subjects are used in the dataset. This
assumption allows subjects to be implicitly represented by considering that the
i − th tree draws the adjacency list related to the i − th subject. Moreover, two
integer sequences: Sp and So , are used for storing, respectively, the predicate and
the object IDs within the adjacency lists. Two additional bitsequences: Bp and Bo
442 M.A. Martínez-Prieto, M.A. Gallego, and J.D. Fernández
(storing list cardinalities) are used for delimitation purposes. This is illustrated
on the right side of Figure 2. As can be seen, Sp stores the five predicate IDs
involved in the adjacency lists: {7, 8, 5, 6, 7} and Bp contains five bits: {10100},
which are interpreted as follows. The first 1-bit (in Bp [1]) means that the list for
the first subject begins at Sp [1], and the second 1-bit (in Bp [3]) means that the list
for the second subject begins at Sp [3]. The cardinality of the list is obtained by
subtracting the positions, hence the adjacency lists for the first and the second
subject contain respectively 3 − 1 = 2 and 6 − 3 = 3 predicates. The information
stored in So = {2, 4, 4, 1, 3, 4} and Bo = {111101} is similarly interpreted, but
note that adjacency lists, at this level, are related to subject-predicate pairs.
BT gives a practical representation of the graph structure which allows triples
to be sequentially listed. However, direct accessing to the triples in the i-th list
would require a sequential search until the i-th 1-bit is found in the bitsequence.
Direct access (in constant time) to any adjacency list could be easily achieved
with a little spatial o(n) overhead on top of the original bitsequence sizes. It en-
sures constant time resolution for rank, select, and access, and allows efficient
primitive operations to be implemented on the adjacency lists:
– findPred(i): returns the list of predicates related to the subject i (Pi ), and
the position pos in which this list begins in Sp . This position is obtained as
pos = select1(Bp , i), and Pi is retrieved from Sp [pos, select1 (Bp , i + 1) − 1].
– filterPred(Pi,j): performs a binary search on Pi and returns the position
of the predicate j in Pi , or 0 if it is not in the list.
– findObj(pos): returns the list of objects (Ox ) related to the subject-
predicate pair represented in Sp [pos]. It positions the pair: x =
rank1 (Bo , pos), and then extracts Ox from So [select1 (Bo , x), select1 (Bo , x +
1) − 1].
– filterObj(Oj ,k): performs a binary search on Oj and returns the position
of the object k in Oj , or 0 if it is not in the list.
Table 1 summarizes how these primitives can be used to resolve some triple pat-
terns in SPARQL: (S,P,O), (S,P,?O), (S,?P,O), and (S,?P,?O). Let us sup-
pose that we perform the pattern (2,6,?) over the triples in Figure 2. BT firstly
retrieves (by findPred(2)) the list of predicates related to the subject 2: P2 =
{5, 6, 7}, and its initial position in Sp : posini = 3. Then, filterPred(P2,6) re-
turns posof f = 2 as the position in which 6 is in P2 . This allows us to obtain the
position in which the pair (2,6) is represented in Sp : pos = posini + posof f =
3 + 2 = 5, due to P2 starts in Sp [3], and 6 is the second element in this list.
Finally, findObj(5) is executed for retrieving the final result comprising the list
of objects O5 = {1, 3} related to the pair (2,6).
Exchange and Consumption of Huge RDF Data 443
– findSubj(i): returns the list of subjects related to the predicate i and the
positions in which they occur in Wp . This operation is described in Algorithm
1. It iterates over all occurrences of the predicate i and processes one of them
for each step. It locates the occurrence position in Wp (line 3) and uses it
for retrieving the subject (line 4) which is added to the result set.
– filterSubj(i,j): checks whether the predicate i and the subject j are
related. It is described in Algorithm 2. It delimits the predicate list for the
j − th subject, and counts the occurrences of the predicate i to posj (oj ) and
posj+1 (oj+1 ). Iff oj+1 > oj , the subject and the predicate are related.
Hence, the wavelet tree contributes with a PS-O index which allows two addi-
tional patterns to be efficiently resolved (row BT+Wp in Table 2). Both (?S,P,?O)
and (?S,P,O) first perform findSubj(P) to retrieve the list of subjects related
to the predicate P. Then, (?S,P,?O) executes findObj for each retrieved subject
Exchange and Consumption of Huge RDF Data 445
and obtains all objects related to it. In turn, (?S,P,O) performs a filterObj
for each subject to test if it is related to the object given in the pattern.
Let us suppose that, having the triples in Figure 2, we ask for all subjects and
objects related through the predicate 7: (?S,7,?O). findSubj(7) obtains the
list of two subjects related to the predicate (S = {1, 2}) and their positions in
Wp (pos={1,5}). The subsequent findObj(1) and findObj(5) return the list of
objects {2} and {4} respectively representing the triples (1,7,2) and (2,7,4).
1 2 Subjects
Bp 1 0 1 0 0
Predicates:
Wp 7 8 5 6 7
Wavelet Tree
Bo 1 1 1 10 1
Objects:
So 2 4 4 13 4 Algorithm 3. findPredSubj(i)
#1 #2 #3
#3 #4
#4 #5
1: posi ← select1 (BoP , i);
3 5 2
(P5)(P7)(P8) 2: posi+1 ← select1 (BoP , i + 1) − 1;
OP-Index: SoP 4 1 4 3 5 2 3: for (x = posi to posi+1 ) do
BoP 1 1 1 1 0 0 4: ptr ← select1 (Bo , BoP [x]);
5: P[ ] ← access(Wp , ptr);
6: S[ ] ← rank1 (Bp , ptr)
1 23 4 Objects
O
7: end for
8: return P; S
Fig. 3. Final HDT-FoQ configuration
5 Experimental Evaluation
This section studies the Publication-Exchange-Consumption workflow on a real-
world setup in which the three main agents are involved:
Exchange and Consumption of Huge RDF Data 447
We first analyze the impact of using HDT as a basis for publication, exchange and
consumption within the studied workflow, and compare its performance with
respect to those obtained for the methods currently used in each process. Then,
we focus on studying the performance of HDT-FoQ as the querying infrastructure
for SPARQL: we measure response times for triple pattern and join resolution.
All experiments are carried out on a heterogeneous configuration of real-world
datasets of different sizes and from different application domains (Table 3). We
report “user” times in all experiments. The HDT prototype is developed in C++
and compiled using g++-4.6.1 -O3 -m64. Both the HDT library and a visual
tool to generate/browse/query HDT files are publicly available4.
However, size is the most important factor due to its influence on the subsequent
processes (Table 4). HDT+lzma is the best choice. It achieves highly-compressed
representations: for instance, it takes 2 and 3 times less space than lzma and gzip
for dbpedia. This spatial improvement determines exchange and decompression
(for consumption) times as shown in Tables 6 and 7.
On the one hand, the combination of HDT and lzma is the clear winner for
exchange because of its high-compressibility. Its transmission costs are smaller
than the other alternatives: it improves them between 10 − 20 minutes for the
largest dataset. On the other hand, HDT+gzip is the most efficient at decompres-
sion, but its improvement is not enough to make up for the time lost in exchange
with respect to HDT+lzma. However, its performance is much better than the one
achieved by universal compression over plain RDF. Thus, HDT-based publication
and its subsequent compression (especially with lzma) arises as the most efficient
choice for exchanging RDF within the Web of Data.
The next step focuses on making the exchanged datasets queryable for con-
sumption. We implement the traditional process, which relies on the indexing
of plain RDF through any RDF store. We choose three systems6 : Virtuoso (re-
lational solution), RDF3X (multi-indexing solution), and Hexastore (in-memory
solution). We compare their performance against HDT-FoQ, which builds addi-
tional structures on the HDT-serialized datasets previously exchanged.
Table 8 compares these times. As can be seen, HDT-FoQ excels for all datasets:
its time is between one and two orders of magnitude lower than that obtained
6
Hexastore has been kindly provided by the authors. http://www.openlinksw.com/
(Virtuoso), http://ht tp://www.mpi-inf.mpg.de/∼neumann/rdf3x/ (RF3X).
Exchange and Consumption of Huge RDF Data 449
for the other techniques. For instance, HDT-FoQ takes 43.98 seconds to index
geonames, whereas RDF3X and Virtuoso use respectively 45 minutes and 5 hours.
It demonstrates how HDT-FoQ leverages the binary HDT representation to make
RDF quickly queryable through its retrieval functionality. Finally, it is worth
noting that Virtuoso does not finish the indexing for dbpedia after more than
1 day, and Hexastore requires a more powerful computational configuration for
indexing datasets larger than linkedMDB. This fact shows that we successfully
achieve our goal of reducing the amount of computation required by the consumer
to make queryable RDF obtained within the Web of Data.
Overall Performance. This section comprises an overall analysis of the pro-
cesses above. Note that publication is decoupled from this analysis because it is
performed only once, and its cost is attributed to the data provider. Thus, we
comprise times for exchange and consumption. These times are shown in Table
9. It compares the time needed for a conventional implementation against that
of the HDT driven approach. We choose the most efficient configurations in each
case: i) Comp.RDF+Indexing comprises lzma compression over the plain RDF
representation and indexing in RDF3X, and ii) Comp.HDT+HDT-FoQ compresses
the obtained HDT with lzma and then obtains HDT-FoQ.
The workflow is completed between 10 and 15 times faster using the HDT
driven approach. Thus, the consumer can start using the data in a shorter time,
but also with a more limited computational configuration as reported above.
References
1. Compact Data Structures Library (libcds), http://libcds.recoded.cl/
2. SPARQL Query Language for RDF. W3C Recomm. (2008),
http://www.w3.org/TR/rdf-sparql-query/
452 M.A. Martínez-Prieto, M.A. Gallego, and J.D. Fernández
1 Introduction
From the users’ perspective, the most important aspect of Semantic Web Search
Engines (SWSEs) [7] is the ability to support the search for ontologies which
match their requirements. Indeed, finding ontologies is a complex and creative
process which requires a lot of intuition. In addition, it also requires manual
analyses of the content of candidate ontologies to choose the ones that are ad-
equate to their intended use. For these reasons, the automatic Ontology Selec-
tion process has been studied in several different contexts in the recent years
[4,5,14,18,22,26,27,29,31] with the aim of improving the methods used to collect,
assess and rank candidate ontologies. However, the more user-centric Ontology
Search process, here defined as the activity of browsing the results from a SWSE
to identify the ontologies adequate to the search goal, has not been researched
This work was funded by the EC IST-FF6-027595 NeOn Project.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 453–468, 2012.
c Springer-Verlag Berlin Heidelberg 2012
454 C. Allocca, M. d’Aquin, and E. Motta
extensively until now. Such an activity is becoming crucial for the rapidly grow-
ing set of scenarios and applications relying on the reuse of existing ontologies.
Our view is that one of the issues hampering efficient ontol-
ogy search is that the results generated by SWSEs, such as Watson
(http://watson.kmi.open.ac.uk), Swoogle (http://swoogle.umbc.edu)
or Sindice (http://sindice.com), are not structured appropriately. These
systems return flat lists of ontologies where ontologies are treated as if they were
independent from each other while, in reality, they are implicitly related. For
example, the query “Conference Publication” currently1 gives 218 ontologies as
a result in Watson. The first two pages of results list several items, including
http://lsdis.cs.uga.edu/projects/semdis/sweto/testbed_v1_1.owl,
http://lsdis.cs.uga.edu/projects/semdis/sweto/testbed_v1_3.owl and
http://lsdis.cs.uga.edu/projects/semdis/sweto/testbed_v1_4.owl.
that represent different versions of the same ontology (isPrevVer-
sionOf ). Another common situation is when an ontology has been
translated in different ontology languages. This is the case in the first
(http://reliant.teknowledge.com/DAML/Mid-level-ontology.owl) and
second (http://reliant.teknowledge.com/DAML/Mid-level-ontology.daml)
results of the query “student university researcher ” in Watson or the sec-
ond (http://annotation.semanticweb.org/iswc/iswc.daml) and third
(http://annotation.semanticweb.org/iswc/iswc.owl) results of the same
query in Swoogle. These ontologies are obviously two different encodings of
the same model. Analogously, it is not hard to find ontologies connected
through other, more sophisticated relations such as different levels of similarity
(isLexicallySimilarTo, regarding the vocabulary, and isSyntacticallySimilarTo,
regarding the axioms), as well as the relationship between ontologies that
originate from the same provenance, as expressed through their URIs having
the same second level domain2 name (ComesFromTheSameDomain).3
It is our view that the failure of these systems to provide structured views of
the result space hamper the ontology search process as the result space becomes
unnecessarily large and full of redundancies. Hence, we have been investigating
the hypothesis that making explicit the relations between ontologies and using
them to structure the results of a SWSE system would support a more efficient
ontology search process. In previous publications [1,3], we presented a software
framework, Kannel, which is able to detect and make explicit relationships
between ontologies in large ontology repositories. In this paper we present a
comparative study evaluating the improvement brought by extending a SWSE
(Watson) by making explicit the relationships between ontologies and presenting
them in the results of the ontology search task (the Watson+Kannel system).
To this purpose, we have used a task-oriented and user-centred approach [20].
1
That is, on 16/12/2011.
2
A second-level domain (SLD) is a domain that is directly below a top-level domain
such as com, net and org, see
http://en.wikipedia.org/wiki/Second-level_domain
3
In [1,3] are reported the formal definitions of the above ontology relations.
Impact of Using Relationships between Ontologies 455
2 Related Work
The discussion of the related work follows two main directions. The first is related
to the process of developing Semantic Web Search Engine systems, while the
second concerns different types of relationships between ontologies that have
been studied in the literature. Most of the research work related to the ontology
search task concerns the development of SWSE systems [7], including: Watson
[8], Sindice [28], Swoogle [11], OntoSelect [4], ontokhoj [5] and OntoSearch [32].
All these systems have the aim of collecting and indexing ontologies from the
web and providing, based on keywords or other inputs, efficient mechanisms to
retrieve ontologies and semantic data. To the best of our knowledge, there is
no study regarding the comparison of the above ontology search engines. The
most common issues addressed by these systems are Ontology Selection - how to
identify/select automatically the set of relevant ontologies from a given collection
[26,27] – and Ontology Evaluation –how to assess the quality and relevance of
an ontology [14,27]. Several studies have contributed to the solution of both the
above problems, including approaches to ranking ontologies [18] and to select
appropriate ontologies [22,29,31]. These works focus on the mechanisms required
to support SWSE systems in automatically identifying ontologies from their
collections and presenting them in a ranked list to the users. However, Ontology
Search, as the activity of using a SWSE to find appropriate ontologies, has not
been considered before from an user-centric point of view. Furthermore, we can
find in literature many works related to the field of Search Engine Usability and
how humans interact with search engines [17], but such studies have not yet been
applied to SWSEs.
Ontologies are not isolated artefacts: they are, explicitly or implicitly, re-
lated to each other. Kleshchev [21] characterised, at a very abstract level, a
number of relations between ontologies such as sameConceptualisation, Resem-
blance, Simplification and Composition, without providing formal definitions for
them, and without providing mechanisms to detect them. Heflin [19] was the first
456 C. Allocca, M. d’Aquin, and E. Motta
to study formally some of the different types of links between ontologies, fo-
cusing on the crucial problems of versioning and evolution. Although, these
links are available with one of the most used web ontology language (OWL),
they are rarely used [2]. Several approaches have focused on the comparison
of different versions of ontologies in order to find the differences [16]. In par-
ticular, PROMTDIF [25] compares the structure of ontologies and OWLDiff
(http://semanticweb.org/wiki/OWLDiff) computes the differences by entail-
ment, checking the two set of axioms. SemVersion [30] compares two ontologies
and computes the differences at both the structural and the semantic levels.
Gangemi in [15] defined the ontology integration as the construction of an on-
tology C that formally specifies the union of the vocabularies of two other on-
tologies A and B. The most interesting case is when A and B commit to the
conceptualisation of the same domain of interest or of two overlapping domains.
In particular, A and B may be related by being alternative ontologies, truly over-
lapping ontologies, equivalent ontologies with vocabulary mismatches, overlapping
ontologies with disjoint domain, or homonymically overlapping ontologies. There
also exists an extensive collection of works, including [10,12,13,33], that propose
formal definitions of the ontology mapping concept. Most of them formalise map-
pings between concepts, relations and instances of two ontologies, to establish
an alignment between them, while we focus on relationships between whole on-
tologies. Finally, studies have targeted ontology comparison in order to identify
overlaps between ontologies [23] and many measures exist to compute particular
forms of similarity between ontologies [9].
All these studies discuss particular relations separately. While they contribute
interesting elements for us to build on, we focus here on assessing the impact of
providing various ontology relations to users of SWSE systems.
3 Systems Used
Kannel is an ontology-based framework for detecting and managing relation-
ships between ontologies for large ontology repositories. Watson is a gateway
to the Semantic Web that collects, analyses and gives access to ontologies and
semantic data available online. These two systems have already been detailed
in [3,8] respectively. Therefore, in this section we only describe the integration
of Watson with Kannel’s features to explain in more details how Kannel is
used on top of Watson’s repository and integrated into its interface.
Watson+Kannel4 is an extension of Watson where Watson’s ontology space
has been processed by Kannel to detect implicit relationships between on-
tologies (similarity, inclusion, versioning, common provenance). In addition, two
relationship-based mechanisms were added to the Watson ontology search inter-
face (see Fig. 1). They are:
4
The Watson+Kannel integration can be tested at
http://smartproducts1.kmi.open.ac.uk:8080/WatsonWUI-K.
Impact of Using Relationships between Ontologies 457
4 Methodology
The general aim of this study is to evaluate whether the inclusion of ontology
relationships to structure the results of a SWSE system, as described in the pre-
vious section, improves ontology search. In particular, we consider the following
two major aspects to be evaluated:
4.1 Participants
Sixteen members of the Open University, from PhD students to senior researchers,
participated to the evaluation. They were randomly divided into two groups and
asked to perform three ontology search tasks to the best of their ability6 using
5
The available links are based on the similarity, versioning, common provenance and
inclusion relationships as they are described in [1,3].
6
In this work, we considered the tasks to be successfully achieved when the users were
satisfied with the ontologies they found.
Impact of Using Relationships between Ontologies 459
Fig. 2. Profiles of the participants in the two groups W and W+K. Answers to the
corresponding questions could range from 0 to 5. The average is shown.
4.2 Tasks
Each individual participant was asked to realise three different ontology search
tasks that are described (as presented to the participants) below:
Task 2 - Annotation. Consider the two links provided7 . They are both about
the same domain, which is books. Consider them as webpages to which you
want to add semantic annotations based on ontologies. To achieve this goal,
you need to find ontologies using Watson (or Watson+Kannel) that can
be used to annotate the above web pages.
The data for evaluation is collected from two main sources: questionnaires
and videos. Regarding the first: two main questionnaires were designed for this
evaluation. One, regarding the background of users, was filled in by the partic-
ipants before realising the tasks (cf. participant profiles in Fig. 2). The other
one, filled in after realising the tasks, included questions regarding the user’s
satisfaction and confidence in the results obtained for the three ontology search
tasks. Questions in this second questionnaire asked how users felt they succeeded
with the tasks, how confident they were about having explored a significant part
of the relevant ontology space and their overall opinion about the ability of the
tool to support them in the tasks. Videos capturing the screen of participants as
they realised the given ontology tasks were used to collect concrete information
regarding the performance of users in these tasks. Analysing the videos, we were
able to measure the average time taken for each task, the number of pages visited
and the number of ontologies inspected.
7
http://www.amazon.co.uk/Shockwave-Rider-John-Brunner/dp/0345467175/ref=
sr 1 1?ie=UTF8&qid=1284638728&sr=1-1-spell
and http://www.booksprice.com/compare.do?inputData=top+gear&
Submit2.x=0&Submit2.y=0&Submit2=Search&searchType=theBookName.
Impact of Using Relationships between Ontologies 461
5 Results
The results of our evaluation are presented from both the users’ efficiency and
satisfaction perspectives. We also discuss how the different ontology relationships
included in Watson+Kannel were used to support ontology search in the
W+K group.
The diagrams in Fig. 3 show the main results of the typical efficiency of the
two groups (W and W+K) with respect to the three following measures:
Time is the time in minutes taken to realise the task, i.e., between the beginning
of the session, until the user was satisfied with the results obtained. This is
the most obvious way to assess the performance of users in ontology search.
Page is the number of pages of results an user would have viewed in order to
realise a task. This gives an indication of the effort required in browsing the
results of the SWSE to identify relevant ontologies.
Link is the number of links followed to realise the task, which corresponds to
the number of ontologies being inspected.
It clearly appears in Fig. 3 that our hypothesis (that including ontology rela-
tionships in the results of an ontology search system would make the ontology
search task more efficient) has been confirmed. Indeed, for the indicator Time
and Page, the differences between the W and W+K groups show a significant
improvement (taking into account the three tasks, the T-test result for Time
was 0.017 and for Page was 0.013, at significance level α < 0.05). While we can
observe a difference for the Link indicator, this difference was not shown to be
statistically significant (T-test result was 0.27). For example, in Task 3, it typ-
ically took 2.5 minutes less to achieve the task when using Watson+Kannel,
and required inspecting only half as many result pages and two third of the links
compared to when using Watson only.
It is worth noticing however that there are significant discrepancies in the
results obtained for the three different tasks as shown in Fig. 3. In particular, it
appears that the differences between W and W+K are less significant in Task 1.
One of the possible explanations for this phenomenon is that it took some time
for participants in group W+K to explore and learn the features provided by
Watson+Kannel that were not present in Watson. To support this interpre-
tation, we analysed the videos corresponding to the group W+K to determine
to what extent the features provided by Kannel were used in the three ontol-
ogy search tasks. As shown in Fig. 4, the features of Kannel (especially, the
ontology relation links) were used significantly less for Task 1 than they were for
Tasks 2 and 3 (see charts A and B). It appears that, after Task 1, users learnt
to use the ontology relation mechanisms provided by Watson+Kannel more
efficiently (see charts C – regarding the Link mechanism – and D – regarding
the GroupBy mechanism).
462 C. Allocca, M. d’Aquin, and E. Motta
12
10
8
6
4
2
Fig. 3. Performance profiles for the three ontology search tasks, regarding the Time,
Page and Link indicators. Each ’box’ represent the median (black line), the quartiles
(top and bottom of the box), min and max values for each indicator, in each group
for each task. Grey boxes correspond to the profiles of the W group, white boxes
correspond to the W+K group.
Impact of Using Relationships between Ontologies 463
Fig. 4. Use of the ontology relation features in Watson+Kannel by group W+K. (A)
shows the average number of ontology relation links followed and the number of times
the “group by” mechanism was applied in each task; (B) shows how many participants
used these mechanisms in each task. The diagrams (C) and (D) show the distributions
of the number of uses (number of times a feature is used – x axis – by number of users
– y axis) for Link and GroupBy respectivly.
Fig. 5. Median answers to the 5 user satisfaction questions in the two evaluation groups
Fig. 6. Use of the 6 relations by the W+K group over the three ontology search tasks
In addition to the users’ efficiency and satisfaction, we briefly discuss the extent
to which different relations were used to support ontology search. Measures of
the use of the Link and GroupBy functions in the Watson+Kannel system
over the three tasks in group W+K are summarised in the charts of Fig. 6.
Impact of Using Relationships between Ontologies 465
they reach stability or adapt to changes in the domain. Thus, once we have
detected the links between different versions of ontologies, it becomes possible
to explore how such ontologies evolve on the Semantic Web, in particular with
the aim of discovering relevant high level ontology evolution patterns, which can
be used to focus ontology search around notions such as ‘stability’ and ‘activ-
ity’. Finally, from a practical point of view and as part of our broader work on
building a framework for the management of relationships between ontologies
(see e.g., [3]), one of our future directions of research is to extend the set of
relationships between ontologies that can be considered by our system. One of
the most interesting aspects here concerns providing mechanisms to explore not
only single, atomic relations between ontologies, but also the relations derived
from the combination of others (e.g., compatibility and disagreement [6]).
References
1. Allocca, C.: Making explic semantic relations between ontologies in large ontology
repositories. PhD Symposium at the European Semantic Web Conference (2009)
2. Allocca, C.: Automatic Identification of Ontology Versions Using Machine Learning
Techniques. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis,
D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I. LNCS, vol. 6643, pp.
352–366. Springer, Heidelberg (2011)
3. Allocca, C., d’Aquin, M., Motta, E.: Door - towards a formalization of ontology
relations. In: Proc. of the Inter. Conf. on Knowledge Engineering and Ontology
Development, KEOD, pp. 13–20 (2009)
4. Buitelaar, P., Eigner, T., Declerck, T.: Ontoselect: A dynamic ontology library
with support for ontology selection. In: Proceedings of the Demo Session at the
International Semantic Web Conference (2004)
5. Chintan, P., Kaustubh, S., Yugyung, L., Park, E.K.: Ontokhoj a semantic web
portal for ontology searching, ranking, and classification. In: Proc. 5th ACM Int.
Workshop on Web Information and Data Management, New Orleans, Louisiana,
USA, pp. 58–61 (2003)
6. d’Aquin, M.: Formally measuring agreement and disagreement in ontologies. In:
K-CAP, pp. 145–152. ACM (2009)
7. d’Aquin, M., Ding, L., Motta, E.: Semantic web search engines. In: Domingue,
J., Fensel, D., Hendler, J.A. (eds.) Handbook of Semantic Web Technologies, pp.
659–700. Springer, Heidelberg (2011)
8. d’Aquin, M., Motta, E.: Watson, more than a semantic web search engine. Semantic
Web Journal 2(1), 55–63 (2011)
9. David, J., Euzenat, J.: Comparison between Ontology Distances (Preliminary Re-
sults). In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin,
T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 245–260. Springer,
Heidelberg (2008)
10. David, J., Euzenat, J., Šváb-Zamazal, O.: Ontology Similarity in the Alignment
Space. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z.,
Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 129–144.
Springer, Heidelberg (2010)
Impact of Using Relationships between Ontologies 467
11. Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari, P., Doshi,
V.C., Sachs, J.: Swoogle: A Search and Metadata Engine for the Semantic Web. In:
Proc. of the 13th ACM Conf. on Information and Knowledge Management. ACM
Press (November 2004)
12. Ehrig, M.: Ontology Alignment: Bridging the Semantic Gap. Semantic Web and
Beyond, vol. 4. Springer, New York (2007)
13. Euzenat, J.: Algebras of Ontology Alignment Relations. In: Sheth, A.P., Staab, S.,
Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC
2008. LNCS, vol. 5318, pp. 387–402. Springer, Heidelberg (2008)
14. Gangemi, A., Catenacci, C., Ciaramita, M., Lehmann, J.: A theoretical framework
for ontology evaluation and validation. In: Proceedings of SWAP 2005, the 2nd
Italian Semantic Web Workshop, Trento, Italy, December 14-16. CEUR Workshop
Proceedings (2005)
15. Gangemi, A., Pisanelli, D.M., Steve, G.: An overview of the onions project: Ap-
plying ontologies to the integration of medical terminologies. Technical report.
ITBM-CNR, V. Marx 15, 00137, Roma, Italy (1999)
16. Gonçalves, R.S., Parsia, B., Sattler, U.: Analysing multiple versions of an ontology:
A study of the nci thesaurus. In: Description Logics (2011)
17. Gordon, M., Pathak, P.: Finding information on the World Wide Web: the retrieval
effectiveness of search engines. Information Processing & Management 35(2), 141–
180 (1999)
18. Alani, H., Brewster, C., Shadbolt, N.: Ranking Ontologies with AKTiveRank. In:
Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M.,
Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 1–15. Springer, Heidelberg
(2006), http://eprints.ecs.soton.ac.uk/12921/01/iswc06-camera-ready.pdf
19. Heflin, J., Pan, Z.: A Model Theoretic Semantics for Ontology Versioning. In: McIl-
raith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298,
pp. 62–76. Springer, Heidelberg (2004)
20. Ingwersen, P.: Information Retrieval Interaction. Taylor Graham (1992)
21. Kleshchev, A., Artemjeva, I.: An analysis of some relations among domain ontolo-
gies. Int. Journal on Inf. Theories and Appl. 12, 85–93 (2005)
22. Lozano-Tello, A., Gómez-Pérez, A.: ONTOMETRIC: A Method to Choose the
Appropriate Ontology. Journal of Database Management 15(2) (April-June 2004)
23. Maedche, A., Staab, S.: Comparing ontologies-similarity measures and a compari-
son study. In: Proc. of EKAW 2002 (2002)
24. Motta, E., Mulholland, P., Peroni, S., d’Aquin, M., Gomez-Perez, J.M., Mendez, V.,
Zablith, F.: A Novel Approach to Visualizing and Navigating Ontologies. In: Aroyo,
L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist,
E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 470–486. Springer, Heidelberg
(2011)
25. Noy, N.F., Musen, M.A.: Promptdiff: A fixed-point algorithm for comparing ontol-
ogy versions. In: 18th National Conf. on Artificial Intelligence, AAAI (2002)
26. Sabou, M., Lopez, V., Motta, E.: Ontology Selection for the Real Semantic Web:
How to Cover the Queen’s Birthday Dinner? In: Staab, S., Svátek, V. (eds.) EKAW
2006. LNCS (LNAI), vol. 4248, pp. 96–111. Springer, Heidelberg (2006)
27. Sabou, M., Lopez, V., Motta, E., Uren, V.: Ontology selection: Ontology evaluation
on the real semantic web. In: 15th International World Wide Web Conference
(WWW 2006), Edinburgh, Scotland, May 23-26 (2006)
468 C. Allocca, M. d’Aquin, and E. Motta
28. Tummarello, G., Delbru, R., Oren, E.: Sindice.com: Weaving the Open Linked
Data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B.,
Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux,
P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 552–565. Springer,
Heidelberg (2007),
http://dblp.uni-trier.de/db/conf/semweb/iswc2007.html#TummarelloDO07
29. Hong, T.-P., Chang, W.-C., Lin, J.-H.: A Two-Phased Ontology Selection Approach
for Semantic Web. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES 2005. LNCS
(LNAI), vol. 3684, pp. 403–409. Springer, Heidelberg (2005)
30. Volkel, M.: D2.3.3.v2 SemVersion Versioning RDF and Ontologies. EU-IST Net-
work of Excellence (NoE) IST-2004-507482 KWEB
31. Xiaodong, W., Guo, L., Fang, J.: Automated ontology selection based on descrip-
tion logic. In: CSCWD, pp. 482–487 (2008)
32. Zhang, Y., Vasconcelos, W., Sleeman, D.H.: Ontosearch: An ontology search engine.
In: Bramer, M., Coenen, F., Allen, T. (eds.) SGAI Conf., pp. 58–69. Springer,
Heidelberg (2004)
33. Zimmermann, A., Krötzsch, M., Euzenat, J., Hitzler, P.: Formalizing ontology
alignment and its operations with category theory. In: Proceeding of the 4th Inter.
Conf. on Formal Ontology in Information Systems, FOIS, pp. 277–288. IOS Press
(2006)
Enhancing OLAP Analysis with Web Cubes
1 Introduction
Business intelligence (BI) comprises a collection of techniques used for extract-
ing and analyzing business data, to support decision-making. Decision-support
systems (DSS) include a broad spectrum of analysis capabilities, from simple re-
ports to sophisticated analytics. These applications include On-Line Analytical
Processing (OLAP) [9], a set of tools and algorithms for querying large mul-
tidimensional databases usually called data warehouses (DW). Data in a DW
come from heterogeneous and distributed operational sources, and go through
a process, denoted ETL (standing for Extraction, Transformation, and Load-
ing). In OLAP, data are usually perceived as a cube. Each cell in this data cube
contains a measure or set of measures representing facts and contextual informa-
tion (the latter called dimensions). For some data-analysis tasks (e.g., worldwide
price evolution of a certain product), the data contained the DSS do not suf-
fice. External data sources (like the web) can provide useful multidimensional
information, although usually too volatile to be permanently stored in the DW.
We now present, through a use case, the research problems that appear in this
scenario, and our approach for a solution to some of them.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 469–483, 2012.
c Springer-Verlag Berlin Heidelberg 2012
470 L. Etcheverry and A.A. Vaisman
Time
Q3-2011
November 2011 December 2011
Product Locat. profit #sales unitPrice unitCost profit #sales unitPrice unitCost
Canon T3i Kit 18-55 NJ 870 10 870 783 736 8 875 783
NY 1044 12 870 783 609 7 870 783
Body only NJ 340 4 850 765 375 5 840 765
NY 425 5 850 765 300 4 840 765
WA 1020 12 850 765 510 6 850 765
T3 Kit 18-55 CA 780 13 460 400 480 8 460 400
NJ 1200 15 480 400 560 7 480 400
Nikon D3100 Kit 18-55 CA 630 10 610 547 945 15 610 547
NY 732 12 608 547 366 6 608 547
WA 340 5 615 547 189 3 610 547
Kit 55-200 NY 750 6 725 600 1500 12 725 600
Body only CA 400 8 500 450 250 5 500 450
NY 385 7 505 450 330 6 505 450
D5100 Kit 18-55 NJ 1215 15 810 729 688 8 815 729
Body only CA 456 6 746 670 456 6 746 670
cameras. The company sales products from several manufacturers and wants
to find out candidates for “best deals” kind of offer. In today’s business, web
information is crucial for this. Price policies must take into account current
deals found on the Internet (e.g., number of available offers, shipping policies,
expected delivery time), as well as user opinions and product features. Jane, the
data analyst of the company manually searches the web, querying different sites,
and building spreadsheets with the collected data. Then she analizes local data
at the DSS, together with data from web sources. This procedure is not only
inefficient but also imprecise. Jane needs flexible and intelligent tools to get an
idea of what is being offered on the web. We propose to make Jane’s work more
productive, by semi-automatically extracting multidimensional information from
web data sources. This process produces what we denote web cubes, which
can then be related to local OLAP cubes through a set of operators that we
study in this paper. A key assumption is that web cubes only add knowledge for
decision-making, and are not aimed at replacing precise information obtained
from traditional DSS. Thus, we do not need these data to be complete, not
even perfectly sound: Jane only needs a “few good answers” [17] to enhance
her analysis, and our approach takes this into account. In a nutshell, web cubes
are data cubes (obtained from web data sources) expressed using an RDF [10]
vocabulary. We show through an use case how web cubes could be used to
enhance existing DSS.
The case starts with Jane using her local DSS to analyze sales of digital
cameras. From local cubes she produces a multidimensional report (Figure 1),
actually a data cube with dimensions Product, Geography, and Time, and mea-
sures profit, #sales, unitPrice, and unitCost. From the report, Jane identifies that
the sales of Canon T3 and T3i cameras have dropped in December. She conjec-
tures that probably these products are being offered on the web at better prices,
thus she decides to build a web cube to retrieve information about offers of these
camera models. To start the process of building web cubes, Jane specifies her
information requirements, which in this particular case are: price, delivery time
Enhancing OLAP Analysis with Web Cubes 471
and shipping costs of new Canon T3 and T3i DSLR cameras. She will try to ob-
tain sales facts with these three measures, if possible with the same dimensions
than the local cube. We assume that there is a catalogue of web data sources,
with metadata that allows deciding which sources are going to be queried, the
query mechanisms available for each source, and the format of the results.
Web data are available in many formats. Each one of these formats can be
accessed using different mechanisms. For example RDF data can be published via
SPARQL [16] endpoints, or extracted from HTML pages that publish RDFa [1]
(among other formats), while XML data may be the result of querying RESTful
web services (also known as web APIs). Tabular data may be extracted from
HTML tables or retrieved from data sharing platforms, such as Google Fusion
Tables1 . In this paper we do not deal with the problem of retrieving web cubes.
Well-known Information Retrieval and Natural Language Processing techniques
can be used for this. For integrating the information retrieved from the data
sources, and for representing the web cubes that are built after data extraction,
we propose to use RDF as the data model. For the latter task, we devised a
vocabulary called Open Cubes that we present in Section 3. It is highly possible
that not all of Jane’s requirements can be satisfied. For example, data could
be obtained at an aggregation level different from the requested one. Or may
be incomplete. We do not deal with these issues in this paper. Continuing with
our use case, let us suppose that from www.overstock.com, Jane obtaines the
following following RDF triples.
@prefix d c : <h t t p : // p u r l . o r g / dc>
@prefix r d f :<h t t p : //www. w3 . o r g /1999/02/22 − r d f −s y n t a x −n s#>
@prefix s c h e m a : <h t t p : // s chem a . o r g />
From these triples, a web cube is built (represented using the Open Cube vocab-
ulary). Figure 2 shows this cube, in report format. Note that in this example,
the new cube has the same dimensions than the cube in Figure 1, but different
measures. Also notice that we have maximized the number of returned results,
leading to the presence of null values (denoted by ‘-’ in Figure 2), which should
be replaced by appropriate values (e.g., ‘unknown state’) to guarantee the cor-
rectness of the results when performing OLAP operations. Jane now wants to
compare the price of each product in the local cube, with the price for the same
product in the web cube. This requires using OLAP operators like roll-up, slice,
dice, or drill-across, over web Cubes. Jane then realizes that in the Geography di-
mension of the web cube, data are presented per city instead of per state (which
is the case in the local cube). Thus, she needs to transform the web cube to the
same level of detail as the local cube, taking both cubes to the country level,
using a roll-up operation in the Geography dimension. After this, both cubes can
be merged, and Jane can compare the prices in the local cube with those found
1
http://www.google.com/fusiontables/Home/
472 L. Etcheverry and A.A. Vaisman
Time
Q3-2011
December 2011
Product Geography unitPrice deliveryTime shippingCost
SLR Camera Canon T3i Kit 18-55 USA - - 850.82 10 0 (1)
NY Amityville 799.95 - 19.95 (2)
- - 760.00 5 0 (3)
Body only USA - - 672.99 - 0 (4)
T3 Kit 18-55 USA NJ Somerset 466.82 - - (5)
- - 476.99 7 0 (6)
Time
Q3-2011
December 2011
Product Geography profit #sales unitPrice unitCost deliveryTime shippingCost
Canon T3i Kit 18-55 USA 672.5 15 872.5 783.0 - -
- - 803.6 7.5 19.95
Body only USA 395.0 15 843.3 765.0 - -
- - 673.0 - 0.0
T3 Kit 18-55 USA 520.0 15 470.0 400.0 - -
- - 471.9 7.0 0.0
Nikon D3100 Kit 18-55 USA 500.0 24 609.3 547.0 - -
Kit 55-200 USA 1500 12 725 600 - -
Body only USA 290.0 11 502.5 450.0 - -
D5100 Kit 18-55 USA 688 8 815 729 - -
Body only USA 456 6 746 670 - -
on the Internet (grey rows). Figure 3 shows the result (Section 4 shows how web
cubes can be mapped to the multidimensional model of the local cube).
Contributions. The following research questions arise in the scenario described
above: Is it possible to use web data to enhance local OLAP analysis, without
the burden of incorporating data sources and data requirements into the existent
DSS life-cycle? What definitions, data-models, and query mechanisms are needed
to accomplish these tasks? Our main goal is to start giving answers to the some
of these questions. Central to this goal is the representation and querying of
multidimensional data over the Semantic Web. Therefore, our main contributions
are: (a) We introduce Open Cubes, a vocabulary specified using RDFS that
allows representing the schema and instances of OLAP cubes, which extends
and makes workable other similar proposals, since it is not only devised for data
publishing, but for operating over RDF representation of multidimensional data
as well (Section 3); (b) We show how typical OLAP operators can be expressed
in SPARQL 1.1 using the vocabulary introduced in (a). We give an algorithm for
generating SPARQL 1.1. CONSTRUCT queries for the OLAP operators, and show
that implementing these operators is feasible. The basic assumption here is that
web cubes are composed of a limited number of instances (triples) of interest
(Section 4); (c) We sketch a mapping from a web cube to the multidimensional
(in what follows, MD) model, in order to be able to operate with the local cubes.
We do this through an example that shows how web cubes can be exported to
the local system, using the Mondrian OLAP server2 (Section 4).
2
http://mondrian.pentaho.com/documentation/schema.php
Enhancing OLAP Analysis with Web Cubes 473
2 Preliminaries
RDF and SPARQL. The Resource Description Framework (RDF) [10] is a data
model for expressing assertions over resources identified by an universal re-
source identifier (URI). Assertions are expressed as triples subject - predicate
- object, where subject are always resources, and predicate and object could be
resources or strings. Blank nodes are used to represent anonymous resources or
resources without an URI, typically with a structural function, e.g., to group
a set of statements. Data values in RDF are called literals and can only be
objects. A set of RDF triples or RDF dataset can be seen as a directed graph
where subject and object are nodes, and predicates are arcs. Many formats for
RDF serialization exist. The examples presented in this paper use Turtle [2].
RDF Schema (RDFS) [3] is a particular RDF vocabulary where a set of re-
served words can be used to describe properties like attributes of resources, and
to represent relationships between resources. Some of these reserved words are
rdfs:range [range], rdfs:domain [dom], rdf:type [type], rdfs:subClassOf [sc],
and rdfs:subPropertyOf [sp].
SPARQL is the W3C standard query language for RDF [16]. The query eval-
uation mechanism of SPARQL is based on subgraph matching: RDF triples
are interpreted as nodes and edges of directed graphs, and the query graph is
matched to the data graph, instantiating the variables in the query graph defini-
tion. The selection criteria is expressed as a graph pattern in the WHERE clause,
consisting basically in a set of triple patterns connected by the ‘.’ operator. The
SPARQL 1.1 specification [5], with status of working draft at the moment of
writing this paper, extends the power of SPARQL in many ways. Particularly
relevant to our work is the support of aggregate functions and the inclusion of
the GROUP BY clause.
OLAP. In OLAP, data are organized as hypercubes whose axes are dimensions.
Each point in this multidimensional space is mapped through facts into one or
more spaces of measures. Dimensions are structured in hierarchies of levels that
allow analysis at different levels of aggregation. The values in a dimension level
are called members, which can also have properties or attributes. Members in
a dimension level must have a corresponding member in the upper level in the
hierarchy, and this correspondence is defined through so-called rollup functions.
In our running example we have a cube with sales data. For each sale we have
four measures: quantity of products sold, profit, price and cost per product (see
Figure 1, which shows a cube in the form of a report). We have also three dimen-
sions: Product, Geography (geographical location of the point of sale), and Time.
Figure 4 shows a possible schema for each dimension, and for the sales facts. We
can see that in dimension Geography, cities aggregate over states, and states over
countries. A well-known set of operations are defined over cubes. We present (for
clarity, rather informally) some of these operations next, following [9] and [19].
year
O
iT
manufacturer category country quarter
TTTT O O O
T
O
model O
state O
month
prod1 prod2
prod1 prod2
month1 date1 1 3
month1 4 5
date2 3 2
month2 4 3
month2 date3 4 3
(c) Slice(C, P roduct, sum) (d) Dice(C, T ime, date > date1)
they address [18]. In particular, the Data Cube vocabulary does not provide the
means to perform OLAP operations over data. This and other issues will be
further discussed in Section 5.
Open Cubes is based on the multidimensional data model presented in [6],
whose main concepts are dimensions and facts 4 .
Dimensions have a schema and a set of instances; the schema contains the
name of the dimension, a set of levels and a partial order defined among them;
a dimension level is described by attributes; a dimension instance contains a
set of partial functions, called roll-up functions, that specify how level members
are aggregated. Facts also have a schema and instances; the former contains
the name of the fact, a set of levels and a set of measures; the latter is a partial
function that maps points of the schema into measure values. Figure 6 graphically
presents the most relevant concepts in the Open Cubes vocabulary, where bold
nodes represent classes and regular nodes represent properties. Labelled directed
arcs between nodes represent properties with a defined domain and range among
concepts in the vocabulary. We omit properties whose range is a literal value.
In Open Cubes, the class oc:Dimension and a set of related levels, mod-
elled by the oc:Level property, represent dimension schemas. A partial order
among levels is defined using properties oc:parentLevel and oc:childLevel,
while the attributes of each level member are modelled using the oc:Attribute
property. A fact schema is represented by the class oc:FactSchema and a set
of levels and measures, which are modelled using the properties oc:Level and
oc:Measure respectively. For each measure the aggregation function that can
be used to compute the aggregated value of the measure, can be stated using
the oc:hasAggFunction property. As an example, Figure 8 shows RDF triples
(in Turtle notation) that represent the schemas of the Products dimension and
the Sales fact from the report in Figure 7, using the Open Cubes vocabulary.
The prefixes oc and eg represent the Open Cube vocabulary and the URI of the
RDF graph of this example, respectively.
Dimension instances are modelled using a set of level members, which are rep-
resented by the oc:LevelMember class. The properties oc:parentLevelMember
4
We could have used a more complex model, like [7]. However, the chosen model is
expressive enough to capture the most usual OLAP features.
476 L. Etcheverry and A.A. Vaisman
Time
Q3-2011
December 2011
date1 date2
Product Geography price qtySold price qtySold
DSLR Camera Canon T3i Kit 18-55 USA OR Portland 714.54 2 714.54 3
T3 Kit 18-55 USA NJ Somerset 466.82 5 466.82 5
Jersey City 480 4 480 3
e g : s a l e s r d f : t y p e oc:FactSchema ;
oc:hasLevel eg:product ;
eg:p r od uc ts r d f : t y p e oc:Dimension ; oc:hasLevel eg : ci ty ;
oc:dimHasLevel eg: pr oduct ; oc:hasLevel eg:date ;
oc:dimHasLevel eg:model ; oc:hasMeasure e g : p r i c e ;
oc:dimHasLevel eg: m anufacturer ; oc:hasMeasure eg : q t yS ol d ;
oc:dimHasLevel e g : c a t e g o r y .
e g : p r i c e r d f : t y p e oc:Measure ;
eg:product r df:t ype oc:Level . o c : h a s A g g F u n c t i o n av g .
eg:model r d f : t y p e o c :L e v e l .
eg:manufacturer rdf :typ e oc:Le vel . e g: q t y So l d r d f : t y p e oc:Measure ;
eg:category rdf:type oc:Level . o c : h a s A g g F u n c t i o n sum .
eg:pro duct o c :pa r e nt L e ve l eg:model .
eg:model o c :p ar e n t Le ve l
oc:manufacturer .
eg:model o c :p ar e n t Le ve l e g :c at e g o r y .
and oc:childLevelMember represent the roll-up functions that specify the navi-
gation among level members. Instances of oc:FactInstance class represent fact
instances. The set of level members and the values of each measure are related
to the fact instance using pre-defined properties of type Level and Measure
Enhancing OLAP Analysis with Web Cubes 477
respectively. For example, Figure 9 shows how the instances in Figure 7 can be
represented using the Open Cubes vocabulary. Is it worth noting that subjects
in RDF triples should be either blank nodes or URIs. For clarity, we use constant
values between quotation marks to represent the identifier of each level member.
These constant values should be replaced by URIs that uniquely identify each
of the level members. The oc:hasFactId property allows to provide a literal
that uniquely identifies each fact instance within a collection of cubes or multi-
dimensional database. Due to space reasons we omit the oc:childLevelMember
relationships between the level members, and only show one complete fact in-
stance RDF representation (the first tuple in Figure 7) .
generates unique SPARQL variable names; (b) value(v) returns the value stored
in variable v; (c) levels(s) returns all the levels in a schema s (i.e., all the values
of ?l that satisfy s oc:hasLevel ?l); (d) measures(s) returns all the measures
in a schema s (all the values of ?m that satisfy s oc:hasMeasure ?m); and (e)
aggF unction(m) returns the aggregation function of measure m (all the values
of ?f that satisfy m oc:hasAggFunction ?f). Also assume that there is a level
dlo in dimension d such that there exists a path between dlo and dlr which
contains only arcs with type oc:parentLevel. The function levelsP ath(d1, d2 )
retrieves the ordered list of levels in the path between d1 and d2 (including both
levels). Also assume that it is possible to access and modify different parts of
a SPARQL query via the properties: resultFormat, graphPattern, and groupBy,
among others. The add(s) function appends s to a particular part of the query.
e g : s a l e s M o n t h r d f : t y p e oc:FactSchema ;
oc:hasLevel eg:product ; oc:hasLevel eg:city ;
o c : h a s L e v e l eg:month ; oc: has Meas ure eg:price ;
oc:hasMeasure e g :q t y So l d .
SalesByMonth schema
CONSTRUCT { ? i d o c : h a s S c h e m a e g : s a l e s M o n t h . ? i d e g : p r o d u c t ? p r o d .
? i d e g : c i t y ? c i t y . ? i d e g : m o n t h ?mon .
? i d e g : p r i c e ? p r i c e M o n t h . ? i d e g : q t y S o l d ? qtyMonth . }
WHERE{ {
SELECT ? p r o d ? c i t y ?mon (AVG( ? p r i c e ) AS ? p r i c e M o n t h )
(SUM( ? q t y S o l d ) AS ? qtyMonth )
( i r i ( f n : c o n c a t ( ” h t t p : // ex am pl e . o r g / s a l e s I n s t a n c e s#” , ” s a l e s ” , ” ” ,
f n : s u b s t r i n g −a f t e r ( ? prod , ” h t t p : // ex am pl e . o r g / s a l e s I n s t a n c e s#” ) , ” ” ,
f n : s u b s t r i n g −a f t e r ( ? c i t y , ” h t t p : // ex am pl e . o r g / s a l e s I n s t a n c e s#” ) , ” ” ,
f n : s u b s t r i n g −a f t e r ( ? mon , ” h t t p : // ex am pl e . o r g / s a l e s I n s t a n c e s#” ) ) ) AS ? i d )
WHERE {
? i oc:hasSchema s l : s a l e s . ? i e g : p r o d u c t ? prod .
? i e g : c i t y ? c i t y . ? i e g : d a te ? date .
? i e g : p r i c e ? p r i c e . ? i e g : q t y S o l d ? qty .
? d a t e o c : p a r e n t L e v e l M e m b e r ?mon . ?mon o c : i n L e v e l e g : m o n t h
}GROUP BY ? p r o d ? c i t y ?mon}}
SalesByMonth instances
SalesWithoutGeo schema
CONSTRUCT
{ ? i d oc:hasSchema eg: s al es Wi thout G e o . ? i d e g : c i t y ? c i t y .
? i d e g: da t e ? date . ? i d e g : p r i c e ? avgPrice .
? i d e g : q t y S o l d ? sumQtySold .
} WHERE {{
SELECT ? c i t y ? d a t e (AVG( ? p r i c e ) AS ? a v g P r i c e )
(SUM( ? q t y ) AS ? sumQtySold )
( i r i ( f n : c o n c a t ( ” h t t p : // ex am pl e . o r g / s a l e s I n s t a n c e s#” , ” s a l e s S G e o ” , ” ” ,
f n : s u b s t r i n g −a f t e r ( ? c i t y , ” h t t p : // ex am pl e . o r g / s a l e s I n s t a n c e s#” ) , ” ” ,
f n : s u b s t r i n g −a f t e r ( ? d a t e , ” h t t p : // ex am pl e . o r g / s a l e s I n s t a n c e s#” ) ) ) AS ? i d )
WHERE {
? i oc:hasSchema e g : s a l e s .
? i eg:city ? city . ? i e g : d a te ? date .
? i e g : p r i c e ? p r i c e . ? i e g : q t y S o l d ? qty .
}GROUP BY ? c i t y ? d a t e }}
SalesWithoutGeo instances
using Open Cubes, and produces SQL code that loads data into the database
created in the definition phase. Figure 12 shows a portion of the XML file that
represents the cube shown in Figure 2, closing the cycle of our running example:
Jane requested the web cubes, which were retrieved, and represented in RDF
using Open Cubes vocabulary; then she operated over the RDF representation
of the web cube, and finally imported it to the local DSS, for joint analysis with
the local cube.
Algorithm 1. Generates the SPARQL query that builds the Roll-Up instances
Input: so original schema, sr new schema, dlo level of D ∈ so , dlr level of D ∈ sr
Output: qOuter is a SPARQL CONSTRUCT query that creates the roll-up instances
1: qOuter.graphPattern.add(?id , oc:hasSchema, sr )
2: qInner.graphPattern.add(?i , oc:hasSchema, so )
3: for all l ∈ L = levels(so ) do
4: if l = dlo then
5: newVar(li )
6: qOuter.resultFormat.add(?id , l, value(li ))
7: qInner.resultFormat.add(value(li ))
8: qInner.graphPattern.add(?i , l, value(li ))
9: qInner.groupBy.add(value(li ))
10: end if
11: end for
12: for all m ∈ M = measures(so ) do
13: f = aggFunction(m)
14: newVar(mi ); newVar(agi )
15: qOuter.resultFormat.add(?id , m, value(agi ))
16: qInner.resultFormat.add(f(value(mi )) AS agi )
17: qInner.graphPattern.add(?i , m, value(mi ))
18: end for
19: for all dli ∈ path = levelsP ath(dlo , dlr ) do
20: newVar(lmi )
21: if dli = dlo then
22: qInner.graphPattern.add(?i , value(dli ), value(lmi ))
23: else
24: newVar(plmi )
25: qInner.graphPattern.add(value(plmi ), oc:inLevel, value(dli ))
26: if dli = dlr then
27: qInner.graphPattern.add(value(lmi ), oc:parentLevelMember,value(plmi ) )
28: end if
29: end if
30: end for
31: newVar(lmi )
32: qInner.groupBy.add(plmi )
33: qInner.resulFormat.add(plmi )
34: qOuter.resultFormat.add(?id , dlr , value(plmi ))
35: qOuter.graphPattern.set(qInner)
36: return qOuter
Enhancing OLAP Analysis with Web Cubes 481
<Schema>
<Cube name=” WCubeSales ”>
<T a b l e name=” w c u b e s a l e s f a c t ” />
<D i m e n s i o n name=” P r o d u c t s ” f o r e i g n K e y =” p r o d u c t i d ”>
<H i e r a r c h y h a s A l l=” f a l s e ” pri m ary Key=” p r o d u c t i d ”>
<T a b l e name=” p r o d u c t s ” />
<L e v e l name=” Model ” col um n=” m o d e l i d ” uni q ueMem bers=” t r u e ” />
<L e v e l name=” M a n u f a c u r e r ” col um n=” m a n u f i d ” uni q ueMem bers=” t r u e ” />
<L e v e l name=” C a t e g o r y ” col um n=” c a t e g o r y i d ” uni q ueMem bers=” t r u e ” />
</ H i e r a r c h y>
</ D i m e n s i o n>
<D i m e n s i o n name=” Time” f o r e i g n K e y =” d a t e ”>
...
</ D i m e n s i o n>
<Meas ure name=” U n i t P r i c e ” col um n=” u n i t p r i c e ” a g g r e g a t o r=” av g ” />
<Meas ure name=” D e l i v e r y Time” col um n=” d e l t i m e ” a g g r e g a t o r=” av g ” />
<Meas ure name=” S h i p p i n g C os t ” col um n=” s h i p c o s t ” a g g r e g a t o r=” av g ” />
</Cube>
</ Schema>
5 Related Work
Our work is highly related with the idea of situational BI, a term coined in [11].
Situational BI focuses on executing complex queries, mainly natural language
queries, over unstructured and (semi-) structured data; in particular, unstruc-
tured text retrieved from documents is seen as a primary source of information.
In the context of situational BI the process of augmenting local data with data
retrieved from web sources is discussed in [12].
In [4] a vocabulary called RDF Data Cube(DC) is presented. This vocabulary
is focused on representing statistical data, according to the SDMX data model.
Although this underlying data model shares some terms with traditional multi-
dimensional data models, the semantics of some of the concepts are different. An
example of this is the concept of slices. Slices, as defined in the DC vocabulary,
represent subsets of observations, fixing values for one or more dimensions. Slices
are not defined in terms of an existing cube, they are defined as new structures
and new instances (observations). An example can be found in [4], Section 7.
The semantics of the slice operator in the MD model is quite different, as shown
in Section 2. Besides, while dimensions and its hierarchical nature are first class
citizens in MD models, DC dimensions are flat concepts that allow to identify
observations at a single granularity. The DC vocabulary does not provide the
constructs to explicitly represent hierarchies within dimensions, neither at the
schema level (DataStructureDefinition) nor at the instance level. As the DC vo-
cabulary adheres to Linked Data principles hierarchies within dimensions may
be inferred from external hierarchies, whenever possible. For example, members
of a dimension stated to be of type foaf:Person can be grouped according to
their place of work using the foaf:workplaceHomepage property. This is clearly
not enough to guarantee the capability of the model to support OLAP operations
as roll-up, which need to represent hierarchical relationships within dimensions
levels and level members. Some of the problems found when trying to map cubes
expressed in the DC vocabulary into a multidimensional model are discussed in
[8]. In light of the above, we decided to buid a new vocabulary from scratch,
instead of extending DC.
482 L. Etcheverry and A.A. Vaisman
References
1. Adida, B., Birbeck, M.: RDFa Primer, Bridging the Human and Data Webs (2008),
http://www.w3.org/TR/xhtml-rdfa-primer/
2. Beckett, D., Berners-Lee, T.: Turtle - Terse RDF Triple Language (2011),
http://www.w3.org/TeamSubmission/turtle/
3. Brickley, D., Guha, R., McBride, B.: RDF Vocabulary Description Language 1.0:
RDF Schema (2004), http://www.w3.org/TR/rdf-schema/
4. Cyganiak, R., Field, S., Gregory, A., Halb, W., Tennison, J.: Semantic Statistics:
Bringing Together SDMX and SCOVO. In: Proc. of the WWW2010 Workshop on
Linked Data on the Web, pp. 2–6. CEUR-WS.org (2010)
5. Harris, S., Seaborne, A.: SPARQL 1.1 Query Language (2010),
http://www.w3.org/TR/sparql11-query/
Enhancing OLAP Analysis with Web Cubes 483
6. Hurtado, C.A., Mendelzon, A.O., Vaisman, A.A.: Maintaining Data Cubes un-
der Dimension Updates. In: Proc. of the 15th International Conference on Data
Engineering, ICDE 1999, pp. 346–355. IEEE Computer Society, Washington, DC
(1999)
7. Hurtado, C.A., Gutiérrez, C., Mendelzon, A.O.: Capturing summarizability with
integrity constraints in OLAP. ACM Transactions on Database Systems 30(3),
854–886 (2005)
8. Kämpgen, B., Harth, A.: Transforming statistical linked data for use in OLAP
systems. In: Proc. of the 7th International Conference on Semantic Systems, I-
Semantics 2011, New York, NY, USA, pp. 33–40 (2011)
9. Kimball, R.: The Data Warehouse Toolkit. J. Wiley and Sons (1996)
10. Klyne, G., Carroll, J.J., McBride, B.: Resource Description Framework (RDF):
Concepts and Abstract Syntax (2004), http://www.w3.org/TR/rdf-concepts/
11. Löser, A., Hueske, F., Markl, V.: Situational Business Intelligence. In: Castellanos,
M., Dayal, U., Sellis, T. (eds.) BIRTE 2008. LNBIP, vol. 27, pp. 1–11. Springer,
Heidelberg (2009)
12. Löser, A., Nagel, C., Pieper, S.: Augmenting Tables by Self-supervised Web Search.
In: Löser, A. (ed.) BIRTE 2010. LNBIP, vol. 84, pp. 84–99. Springer, Heidelberg
(2011)
13. Nebot, V., Llavori, R.B.: Building data warehouses with semantic data. In:
EDBT/ICDT Workshops. ACM International Conference Proceeding Series. ACM
(2010)
14. Niemi, T., Niinimäki, M.: Ontologies and summarizability in OLAP. In: Proc. of the
2010 ACM Symposium on Applied Computing, SAC 2010, pp. 1349–1353. ACM,
New York (2010)
15. Niinimäki, M., Niemi, T.: An ETL Process for OLAP Using RDF/OWL Ontologies.
In: Spaccapietra, S., Zimányi, E., Song, I.-Y. (eds.) Journal on Data Semantics
XIII. LNCS, vol. 5530, pp. 97–119. Springer, Heidelberg (2009)
16. Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF (2008),
http://www.w3.org/TR/rdf-sparql-query/
17. Sequeda, J., Hartig, O.: Towards a query language for the web of data (a vision
paper). CoRR, abs/1110.3017 (2011)
18. Shoshani, A.: OLAP and statistical databases: similarities and differences. In:
PODS 1997, pp. 185–196. ACM, New York (1997)
19. Vassiliadis, P.: Modeling multidimensional databases, cubes and cube operations.
In: SSDBM, pp. 53–62. IEEE Computer Society (1998)
Query-Independent Learning to Rank
for RDF Entity Search
1 Introduction
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 484–498, 2012.
© Springer-Verlag Berlin Heidelberg 2012
Query-Independent Learning to Rank for RDF Entity Search 485
retrieving RDF data because RDF triples form a graph, and graph patterns matching
subgraphs of this graph can be specified as SPARQL queries. Most endpoints, which
provide public Web access to the kind of RDF data mentioned above, support
SPARQL queries. While keyword search is clearly easier to use, structured query
languages such as SPARQL can provide the expressiveness (technical) users may
need in order to capture complex information needs, and to fully harness the structure
and semantics captured by the underlying data. In fact, many queries posed on the
Web are actually specified using form- and facet-based interfaces (e.g. facetted search
provided by Yahoo!, Amazon and EBay). The inputs provided by the users through
these interfaces are actually mapped to structured queries.
Since structured queries precisely capture the constraints the candidate answers
must satisfy, the underlying engine can return perfectly sound and complete results.
That is, all results can be found (complete) and every one of them perfectly matches
the query (sound). However, given the large amount of data, queries may result in a
large number of results, while only a few of them may be of interest to the user. In
this case, ranking and returning only the top-k results is the standard strategy used in
practical scenarios to improve efficiency and response time behavior. Studies have
shown that users typically scan results beginning from the top, and usually focus only
on the top three or four [1]. However, how do we rank entity search results in this
structured query scenario, given all entities equally (i.e. perfectly) match the query?
A few specific approaches have been proposed to deal with ranking RDF results
[5,7,10]. Most of these approaches [5] assume an ambiguous keyword query such that
the ranking problem is mainly understood as the one of computing content relevance,
i.e. to find out whether the resource’s content is relevant with respect to the query. In
the structured query setting, all resources are equally relevant. Ranking approaches
[10,11] that can be used to distinguish resources in this setting are mainly based on
centrality, a notion of “popularity” that is derived from the data via PageRank.
Besides centrality, we study the use of other features and incorporate them into a
learning to rank (LTR) framework for ranking entity search results, given structured
queries. The main contributions of this paper can be summarized as follows:
(1) Learning to rank over RDF data. LTR [2] is a state-of-the-art IR technique that
learns a ranking function from labeled training data (relevance judgments). We show
how LTR can be adopted for ranking entity search results over RDF data.
(2) Query-independent features. Critical for the performance of LTR are features.
For this specific structured query setting, we systematically identify query-
independent features (those that go beyond content relevance) and individually
analyze their impacts on ranking performance.
(3) Access logs based ground truth and training data. While LTR offers high
performance, it critically depends on the availability of relevance judgments for
training. We observed from our experiments based on real users (via a crowd sourcing
based evaluation recently proposed in [3]) that the final results strongly correlate with
the number of visits (#visits) that is captured in the access logs. We provide a detailed
analysis of this correlation and for the case where training data and ground truth is not
easy to obtain, we propose the use of #visits as an alternative.
486 L. Dali et al.
Using both cross-domain and domain-specific real world datasets, we evaluate the
proposed LTR approach and show its superior performance over two relevant
baselines. Results suggest that combining different features yields high and robust
performance. Surprisingly, the use of features that are derived from the external Web
corpus (features that are independent of the query and local dataset) yields the best
performance in many cases.
Structure. The remainder of the paper is structured as follows. We firstly discuss
related work in Section 2. Then, our adaptation of LTR is presented in Section 3.
Experimental results are discussed in Section 4 before we conclude in Section 5.
2 Related Work
Approaches for ranking in the RDF setting can be distinguished into those which
consider the relevance of a resource with respect to the query, and the others, which
derive different features (e.g. popularity) from cues captured in the data such as
centrality and frequency.
results to structured queries, which as opposed to the ambiguous keyword queries, are
precisely defined such that the query semantics can be fully harnessed to produce
answers that are equally relevant. Thus in principle, content relevance can be
expected to be less important in this case, and other features should be considered for
ranking. Among the approaches mentioned above, the only exception that deals with
structured queries is the LM-based ranking of RDF triples (graphs). As discussed, this
work does not directly capture content relevance but relies on informativeness. We
consider this as one baseline and show that using additional features can substantially
outperform this. Previous works build upon the vector space model [4], language
models [7], and probabilistic IR [5]. In this work, we adopt yet another popular IR
paradigm, namely LTR [2]. This paradigm constitutes the state-of-the-art in IR, and is
widely used by commercial Web search engines.
Approaches described in this subsection are not taking the query into account, but
rather using query-independent features. An example of such features that are
independent of the query is centrality, which can be derived from the graph-structured
nature of the underlying data using algorithms such as PageRank [8] and HITS [9].
The aim of PageRank is to give a global, query-independent score to each page. The
score computed by PageRank for a given page captures the likelihood of a random
Web surfer to land on that page.
The first adoption of PageRank in the structured data setting was proposed for
Entity-Relation graphs representing databases, and specific approaches for dealing
with RDF graphs have been introduced recently. For instance, ResourceRank [10] is
such a PageRank adapted metric that is iteratively computed for each resource in the
RDF dataset. Also, a two layered version of PageRank has been proposed [11], where
a resource gets a high rank if it has a high PageRank within its own graph, and if this
graph has a high PageRank in the LOD cloud (which is also considered as a graph
where nodes represent datasets). The difficulties in adapting PageRank to the
structured data setting is that the graph here – as opposed to the Web graph – has
heterogeneous nodes and edges (different types of resources and different relations
and attribute edges). A solution is to manually assign weights to different relations,
but this approach is only applicable in a restricted domain such as paper-author-
conference collections [13].
Instead of centrality, more simple features based on frequency counts have also
been used in the RDF setting. For instance, structured queries (graph patterns)
representing interpretations of keyword queries have been ranked based on the
frequency counts of nodes and edges [16]. Just like PageRank scores, these counts
aim to reflect the popularity of the nodes and edges in the query pattern such that
more popular queries are preferred. The use of frequency has long tradition in IR.
Term and inverse document frequencies are commonly used to measure the
importance of a term for a document relative to other terms in the collection.
These query-independent features can be directly applied to our structured query
setting to distinguish between the results that are equally relevant. In a systematic
488 L. Dali et al.
fashion, we identify different categories of features that can be used for our LTR
approach, including centrality and frequency. We show that besides the featurbes
derived from the corpus (i.e. the underlying RDF graph), external information on the
Web provides useful features too. We compare and show that the use of different
features can outperform the ResourceRank baseline, which is based on centrality.
, , 1, , , ,
such that
.
In other words, for each query, we take all the pairs of the feature vectors of the
answers to the query such that we put the answer with a higher target feature on the
first place. To each pair , we associate a cost :
2
1.
Intuitively we can think of as the confidence in the correct ordering of the pair
, or as the penalty, which the learning algorithm receives if it makes a mistake
on this pair. We can observe that if then 0, so we are not
Query-Independent Learning to Rank for RDF Entity Search 489
confident at all that should be ranked higher than . On the other hand if
then the value of gets close to 1, and the learning algorithm obtains a big
penalty for making a mistake on this pair.
The list of pairs , with their associated cost is the input to the RankSVM
[17] algorithm described below. The goal is to learn a weight vector of the
same dimensions as the training vectors . Then given a new vector , representing
the feature vector of an answer to be ranked, we can compute the score of the answer,
which is equal to the inner product between the weight vector and the vector ,
.
The ranking is then obtained by sorting answers by their scores.
· 1 , .
Number of subjects @ K. This feature is a count of the triples, which have as subject
the node for which we extract the feature. In Figure 1 on the left, the value of this
feature at level 1 is 2 (because two arrows go out) and the value of this feature at level
2 is 3 (= 2 + 1).
Number of objects @ K. This feature is computed in a similar way to the number of
subjects @ K, the difference being that now the number of triples with the node as
object is counted (arrows coming in). The graph on the right side of Figure 1 illustrates
the computation of this feature.
Number of types of outgoing predicates @ K. At each level this feature is the count
of the elements of the set of predicates occuring at that level. The anchor node is the
subject. This feature is illustrated on the left side of Figure 2.
Number of types of incoming predicates @ K. At each level this feature is the count
of the elements of the set of predicates occuring at that level. The anchor node is the
object. This feature is illustrated on the right side of Figure 2.
Query-Independent Learning to Rank for RDF Entity Search 491
3.3.2 PageRank
This section briefly describes the PageRank algorithm and how it applies to our case.
PageRank was introduced in the early days of web search out of a need for a global,
query independent ranking of the web pages. PageRank assumes a directed graph as
input and will give a score to each of the nodes as a result. PageRank is based on the
random walk model, which assumes that a very large number of users walk the graph
choosing at each step a random neighbor of the current node or jumping to any node
in the graph. The score of a node is given by the expected number of users being at
the given node at a moment in time. The scores are computed recursively from the
following equation:
· · 1 · , , ,
Where is the number of nodes in the graph, is the PageRank vector containing
the score for each node and is initialized with 0, is the transition matrix constructed
such that , 1 if there is an edge from node to node and 0 otherwise.
Moreover, to eliminate nodes which do not link to any other node we consider a sink
node such that , 1, and , 0, . Finally the columns of are
normalized to sum up to 1; is the jump vector and its entries are , ; is
492 L. Dali et al.
In case of the web, the graph is made of the web pages as nodes and the hyperlinks
as edges. In our case the nodes are DBpedia or Yago resources or categories, and the
edges are properties. For illustration, Figure 3 shows a subgraph from the Yago
knowledge base.
where is a node in the graph, is the total number of nodes connected to , and is
a node connected to . and are initialized to 1.
1
http://developer.yahoo.com/search/boss/
Query-Independent Learning to Rank for RDF Entity Search 493
resource corresponding to the person Neil Armstrong, we make a web search with the
query ‘Neil Armstrong’ and obtain that the number of search results is 3720000.
4 Experiments
Given RDF datasets and SPARQL queries, we obtained results using a Triple store. In
the experiments, we run different versions of the proposed LRT algorithm and
baselines to compute different rankings of these results. The goals of the experiments
are (1) to compare LTR against the baselines and (2) to analyze the performance of
individual features (feature sets). As performance measures, we use the standard
measures NDCG and Spearman’s correlation coefficient. We build upon the data,
queries and methodology proposed by the recent SemSearch Challenge evaluation
initiative [3]
We have two sets of queries3. The first set is a subset of the entity queries provided by
the SemSearch Challenge dataset. It consists of 25 queries, for which we obtain
answers from DBpedia and Yago, two datasets containing encyclopedic knowledge
that were extracted from Wikipedia infoboxes. Answers from these datasets
correspond to Wikipedia articles. We used the Wikipedia access logs from June 2010
to January 2011 (available at http://dammit.lt/wikistats/).
The other set consists of 24 queries, whose answers are computed from the
Semantic Web Dog Food (SWDF) dataset [18]. SWDF contains information about
people, conferences, workshops, papers and organizations from the Semantic Web
field. The dataset is built from metadata about conferences such as ISWC and ESWC,
starting from the year 2006. For the USEWOD 2011 Data Challenge [19], a dataset4
of access logs on the SWDF corpus was released.
2
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
3
http://aidemo.ijs.si/paper_supplement/dali_eswc2012_queries.zip
4
http://data.semanticweb.org/usewod/2011/challenge.html
494 L. Dali et al.
While the first set of queries is used to evaluate ranking in a general setting, the
second one is used to analyze how the approaches perform in a domain-specific
setting.
Staten Island
Harlem
The Bronx
% visits
Queens
Brooklyn % votes
Manhattan
0 10 20 30 40
Fig. 4. Percentage of votes and visits for the query "List of boroughs of New York City",
NDCG = 0.993, average confidence = 0.675
There are also a few questions where the rankings based on #votes and #visits are
not so similar. For instance for the question “Books of Jewish Canon” the NDCG
score is only 0.57. However, the average user confidence is also lower in this case
(only 0.476). Other questions of this type are “Names of hijackers in the September
11 attacks”, “Ratt albums” and “Ancient Greek city-kingdoms of Cyprus”. All these
Query-Independent Learning to Rank for RDF Entity Search 495
questions are relatively specific. We observed in these cases, users indicated relative
low confidence, and the agreement between users is also low, suggesting that it was
difficult for them to choose the correct answers.
Figure 5 shows the correlations between NDCG scores computed for the ranking
based on #visits, confidence and agreement values for each question. By agreement
between users we mean the percentage of votes the answer with the highest number of
votes has obtained. We can see that in general the ranking based on #votes is quite
similar to the ranking based on #visits. More exactly, the average NDCG score is
0.86. For 15 of the 25 queries, the answer with most votes corresponds to the article
that is most visited on Wikipedia. Further, we see that the higher the confidence of the
users, the higher is also the NDCG based on #visits. Also, NDCG based on #visits
correlates with agreement. This means that when users are confident and agree on the
results, the ranking based on #visits closely matches the ranking based on #votes.
4.3 Systems
As baselines for evaluating the proposed ranking models, we have implemented two
ranking methods described in the related work. The first is ResourceRank (ResRank)
[10], which provides a global, query-independent PageRank inspired ranking score.
The second baseline (LM) is based on building language models for the query and
results [20]. As discussed, it actually relies on a rather query-independent metric
called witness count, which measures the “informativeness” of RDF triples. This
count is estimated based on the number of results obtained from searching the Web
with the labels of the subject, predicate and object of the triple as queries. Because the
number of triples in DBpedia and Yago is large, it was not feasible for us to submit
496 L. Dali et al.
the resulting search requests. For this baseline, we could obtain results only for the
smaller SWDF dataset. The last one called Wikilog is considered as an upper limit
baseline, which rank results based on #visits in the access logs.
We have implemented several LTR systems based on different features and labels
(target features). In particular, we used the four different categories discussed before,
namely (1) features based on graph centrality, (2) features based on external sources,
(3) features based on the RDF dataset, and (4) the complete set of all features. Two
target features were used, namely #votes (systems using these labels for training are
denoted by the prefix ‘H_’) and #visits (systems denoted by prefix ‘L_’).
Looking at individual features we found that features like the number of search
results, the number of objects, number of objects @ 2, the number of different
incoming predicates @ 2 and the ngram count are among the best features for both
DBpedia and Yago achieving NDCG scores of about 0.8.
References
1. Cutrell, E., Guan, Z.: What are you looking for?: an eye-tracking study of information
usage in web search. In: CHI 2007, pp. 407–416 (2007)
2. Liu, T.-Y.: Learning to Rank for Information Retrieval. Foundations and Trends in
Information Retrieval 3(3), 225–331 (2009)
498 L. Dali et al.
3. Blanco, R., Halpin, H., Herzig, D.M., Mika, P., Pound, J., Thompson, H., Tran, D.T.:
Repeatable and Reliable Search System Evaluation using Crowd-Sourcing. In: SIGIR
2011, pp. 923–932 (2011)
4. Castells, P., Fernández, M., Vallet, D.: An Adaptation of the Vector-Space Model for
Ontology-Based Information Retrieval. IEEE Trans. Knowl. Data Eng., 261–272 (2007)
5. Blanco, R., Mika, P., Vigna, S.: Effective and Efficient Entity Search in RDF Data. In:
Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist,
E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 83–97. Springer, Heidelberg (2011)
6. Nie, Z., Ma, Y., Shi, S., Wen, J.-R., Ma, W.-Y.: Web Object Retrieval. In: WWW 2007,
pp. 81–90 (2007)
7. Kasneci, G., Elbassuoni, S., Weikum, G.: MING: mining informative entity relationship
subgraphs. In: Proceedings of the 18th ACM Conference on Information and Knowledge
Management, CIKM 2009, pp. 1653–1656 (2009)
8. Page, L., Brin, S., Motowani, R., Winograd, T.: The pagerank citation ranking: Bringing
order to the web. Technical report, Stanford Digital Libraries (1998)
9. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(3), 604–
632 (1999)
10. Hogan, A., Harth, A., Decker, S.: ReConRank: A Scalable Ranking Method for Semantic
Web Data with Context. In: SSWS 2006 (2006)
11. Delbru, R., Toupikov, N., Catasta, M., Tummarello, G., Decker, S.: Hierarchical Link
Analysis for Ranking Web Data. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A.,
Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010. LNCS, vol. 6089, pp.
225–239. Springer, Heidelberg (2010)
12. Jeh, G., Widom, J.: Scaling personalized web search. In: WWW 2003, pp. 271–279 (2003)
13. Hristidis, V., Hwang, H., Papakonstantinou, Y.: Authority-Based Keyword Search in
Databases. ACM Transactions on Database Systems 33(1) (2008)
14. Chakrabarti, S.: Dynamic Personalized Pagerank in Entity-Relation Graphs. In: WWW
2007, pp. 571–580 (2007)
15. Nie, Z., Zhang, Y., Wen, J.-R., Ma, W.-Y.: Object-Level ranking: Bringing Order to Web
Objects. In: WWW 2005, pp. 567–574 (2005)
16. Thanh, T., Wang, H., Rudolph, S., Cimiano, P.: Top-k Exploration of Query Candidates
for Efficient Keyword Search on Graph-Shaped (RDF) Data. In: ICDE 2009, pp. 405–416
(2009)
17. Joachims, T.: Optimizing search engines using clickthrough data. In: KDD 2002, pp. 133–
142 (2002)
18. Möller, K., Heath, T., Handschuh, S., Domingue, J.: Recipes for Semantic Web Dog Food
— The ESWC and ISWC Metadata Projects. In: Aberer, K., Choi, K.-S., Noy, N.,
Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi,
R., Schreiber, G., Cudré-Mauroux, P. (eds.) ISWC/ASWC 2007. LNCS, vol. 4825, pp.
802–815. Springer, Heidelberg (2007)
19. Berendt, B., Hollink, L., Hollink, V., Luczak-Rösch, M., Möller, K., Vallet, D.: USEWOD
2011. In: WWW (Companiaon Volume) 2011, pp. 305–306 (2011)
20. Elbassuoni, S., Ramanath, M., Schenkel, R., Sydow, M., Weikum, G.: Language-Model-
Based Ranking for Queries on RDF-Graphs. In: CIKM 2009, pp. 977–986 (2009)
21. Joachims, T.: Making Large-Scale SVM Learning Practical. In: Scholkopf, B., Burges, C.,
Smola, A. (eds.) Advances in Kernel-Methods - Support Vector Learning. MIT Press
(1999)
22. Rupnik, J.: Stochastic subgradient approach for solving linear support vector machines. In:
SiKDD (2008)
COV4SWS.KOM: Information Quality-Aware
Matchmaking for Semantic Services
1 Introduction
From the very beginning of semantic Web service (SWS) research, service discov-
ery and matchmaking have attracted large interest in the research community
[9,13,18]. The underlying techniques to measure the similarity between a service
request and service offers have been continuously improved, but matchmakers
still rely on a particular information quality regarding the syntactic and seman-
tic information given in a service description. There are several reasons why the
quality of syntactic and semantic service descriptions differs between service do-
mains. While in one domain, a well-accepted ontology describing the particular
(industrial) domain could be available, such an ontology might be missing for
other domains. Furthermore, it could be the case that the usage of a certain do-
main ontology in a specific industry is compulsory due to legal constraints, as it
is the case in the energy domain. In an upcoming Internet of Services, it is even
possible that there will be premium service marketplaces for certain domains,
which will only publish a service advertisement if certain quality standards re-
garding the service description are met. All things considered, the quality of
service descriptions will differ from service domain to service domain.
In this paper, we present our work on information quality-aware service match-
making. We propose an adaptation mechanism for matchmaking, which is based
on the usability and impact (with regard to service discovery) of syntactic de-
scriptions and semantic annotations on different levels of the service description
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 499–513, 2012.
c Springer-Verlag Berlin Heidelberg 2012
500 S. Schulte et al.
Likewise, the SAWSDL specification does not restrict the type of semantic
concepts a modelReference should point to. The only requirement is that the
concepts are identifiable via URI references. This is an advantage in so far as it
allows for a maximum of flexibility in annotating functional service descriptions.
Yet, this fact poses a problem if the concepts need to be automatically processed
and interpreted in some form. In the context of our work, we will assume the
semantic concepts to be formally defined in an OWL DL ontology. As a second
constraint, we only consider the first URI from a modelReference, all other URIs
are not regarded. This constraint is primarily made for practical reasons, as
there is no agreement what another modelReference actually addresses: It could
be a reference to a semantic concept from another domain ontology or address
preconditions and effects, as it is done for operations in WSMO-Lite [23].
Native similarity
1:n 1:n
Aggregated similarity wop
wiface simop(a,b)
Operation a Operation b
simagg(a,b)
Aggregated similarity
Output Parameters simout(a,b) Output Parameters
wout
simagg (a, b) = simif ace (a, b) ∗ wif ace + simop (a, b) ∗ wop
(2)
+ simin (a, b) ∗ win + simout (a, b) ∗ wout
Once similarities between all pairs of operations in a service request and service
offer have been computed, the overall service similarity simserv is derived by
finding an optimal matching of operations: The final matching for a pair of ser-
vices is conducted between their respective union set of operations, disregarding
how the operations are organized into interfaces. Formally, let I and J be the
sets of operations in a service request R and offer O, respectively. Let xij be a
binary variable, indicating whether i ∈ I has been matched with j ∈ J. Then,
1
simserv (R, O) = ∗ xij ∗ simagg (i, j) (3)
|I|
i∈I,j∈J
Subsequent to the matching process, the weights of all matched edges are
summed up and divided by the cardinality of the original sets. This yields the
similarity for two sets of components. If the cardinality of the two sets differs,
the following strategy is used: Generally, the cardinality of the set associated
with the service request is decisive: If an offer lacks requested operations or
outputs, its overall similarity decreases. For inputs, the cardinality of the set
associated with the service offer is decisive: If an offer requires more inputs
than the request provides, its overall similarity decreases. Such procedure does
not exclude any services due to a mismatch in the number of parameters or
operations. Instead, these offers are implicitly punished by a reduction in overall
similarity. The approach is based on the notion that such service offers may still
be able to provide a part of the initially requested functionality or outputs, or
may be invoked by providing additional inputs.
the concepts. The most intuitive way to compute semantic relatedness between
nodes in a graph would be the measurement of the shortest distance (path length)
between the graph nodes [5]. In the following, we refer to this measure as simP L .
Furthermore, we make use of the metrics by Resnik [20] and Lin [15]:
simResnik (A, B) = − log p(anc(A, B)) (4)
Given the design matrix and vector of predictors, the standard OLS estimator
can be applied [25]. It yields the initial estimate of level weights, namely the
vector β̂ (Eq. 7). In order to derive the final level weights, we further process the
vector. First, negative level weights, which can potentially result from the OLS
estimator, are set to 0, resulting in β̃ (Eq. 8). This ensures that increasing simi-
larities on the individual levels do not have a negative impact on the aggregated
similarity as it would be contradictory to common sense if higher similarity on
one level resulted in diminished overall similarity. Second, the entries are normal-
ized such that their sum matches the maximum relevance, resulting in the final
vector w (Eq. 9). This ensures that a pair of operations with perfect similarity
on all matching levels is precisely assigned the actual maximum relevance.
β̂ = (X X)−1 X y = β̂if ace , β̂op , β̂in , β̂out (7)
= −0.063, 0.401, 0.506, 0.197
β̃ = min(0, β̂if ace ), min(0, β̂op ), min(0, β̂in ), min(0, β̂out ) (8)
= β̃if ace , β̃op , β̃in , β̃out = 0, 0.401, 0.506, 0.197
Information Quality-Aware Matchmaking for Semantic Services 507
w = β̃if ace /s, β̃op /s, β̃in /s, β̃out /s (9)
= wif ace , wop , win , wout = 0, 0.363, 0.458, 0.178
s = β̃if ace + β̃op + β̃in + β̃out
4 Experimental Evaluation
that the full potential of COV4SWS.KOM will only be revealed if the annota-
tions address all service abstraction levels. However, SAWSDL-TC is a standard
test collection for SWS matchmaking and needs to be employed to accomplish
comparability with the results of other approaches. SAWSDL-TC is also used in
the International Semantic Service Selection Contest – Performance Evaluation
of Semantic Service Matchmakers (S3 Contest) [13], which serves as an annual
contest to compare and discuss matchmakers for different service formalisms.
Nevertheless, we assess our evaluation to be preliminary. We used SME27 to
compare our results with other state-of-the-art matchmaking algorithms.
We performed evaluation runs using different configurations of our match-
maker; due to space constraints, we will only present the most important evalu-
ation runs in the following. The interested reader can download the XAM4SWS
matchmaker project to conduct evaluation runs using different configurations of
COV4SWS.KOM. The applied configurations are depicted in Table 1; they make
use of different weightings of service abstraction levels on matchmaking results
and either apply simResnik , simLin , or simP L , as presented in Section 3.2.
For the OLS-based computation of weightings, the actual weights are iden-
tified using k-fold cross-validation [17]. In cross-validation, k–1 partitions of a
test data collection are applied for training purposes (i.e., the determination
of weights) while the remaining partition is applied for testing purposes (i.e.,
matchmaking). This is repeated k times in order to apply every partition in
testing; validation results are averaged over all rounds of training and testing.
In the example at hand, k=42 since every query and corresponding relevance
set from SAWSDL-TC serves as a partition from the service set. The neces-
sary probability values for simResnik and simLin have been calculated based
on SAWSDL-TC, i.e., we counted the appearances of semantic concepts in the
service collection and derived the probabilities from this observation.
XAM4SWS project) – as the fastest contestant in the SAWSDL track. This can
be traced back to the caching mechanisms applied (cp. Section 3.2).
0.9
0.8
0.7
precision
0.6
0.5
0.4
0.3
0 0.2 0.4 0.6 0.8 1
recall
Version 1b (AP’=0.734) Version 3b (AP’=0.808)
Version 2b (AP’=0.796) Version 4b (AP’=0.823)
5 Related Work
Since the seminal paper of Paolucci et al. [18], a large number of different match-
making approaches has been proposed. In the following, we will consider adaptive
matchmakers for SAWSDL, which today provide the best results in terms of IR
metrics. For a broader discussion, we refer to Klusch et al. – according to their
classification, COV4SWS.KOM classifies as an adaptive and non-logic-based se-
mantic matchmaker [9,13].
iMatcher applies an adaptive approach to service matchmaking by learning
different weightings of linguistic-based similarity measures [8,13]. iSeM is an
adaptive and hybrid semantic service matchmaker which combines matching of
the service signature and the service specification [11]. Regarding the former,
strict and approximated logical matching are applied, regarding the latter, a
stateless, logical plug-in matching is deployed. In SAWSDL-MX, three kinds
of filtering, based on logic, textual information, and structure are applied; the
matchmaker adaptively learns the optimal aggregation of those measures using
a given set of services [12]. Notably, COV4SWS.KOM and SAWSDL-MX/iSeM
have been developed completely independently. URBE calculates the syntactic or
semantic similarity between inputs and outputs [19]. Furthermore, the similarity
between the associated XSD data types for a given pair of inputs or outputs is
calculated based on predefined values. Weights may be determined manually.
In our former work, we have presented LOG4SWS.KOM, which is also a
matchmaker for service formalisms like SAWSDL and hRESTS [14,22]. This
matchmaker shares some features with COV4SWS.KOM, especially the fallback
strategy and the operations-focused matching approach.
However, LOG4SWS.KOM applies a completely different strategy to assess
the similarity of service components, as the matchmaker is based on logic-based
DoMs respectively their numerical equivalents. Most importantly, an automatic
adaptation to different qualities of syntactic and semantic information on differ-
ent service abstraction levels is not arranged for.
512 S. Schulte et al.
6 Conclusion
In this paper, we proposed an information quality-aware approach to service
matchmaking. Through the adaptation to different degrees of impact on single
service abstraction levels, it is possible to adapt our matchmaker to different
service domains. For this, we discussed the usage of similarity metrics from the
field of information theory and the OLS-based adaptation of the matchmaking
process regarding the quality of semantic and syntactic information on different
service abstraction levels. We evaluated different versions of the corresponding
matchmaker COV4SWS.KOM for SAWSDL. The combination of operations-
focused matching, similarity metrics from the field of information theory, and
self-adaptation based on the weights of different service abstraction levels led to
top evaluation results regarding IR metrics.
References
1. Baader, F., Nutt, W.: Basic Description Logics. In: The Description Logic Hand-
book: Theory, Implementation and Applications, ch. 2, pp. 47–100. Cambridge
University Press (2003)
2. Bellur, U., Kulkarni, R.: Improved Matchmaking Algorithm for Semantic Web Ser-
vices Based on Bipartite Graph Matching. In: 2007 IEEE International Conference
on Web Services, pp. 86–93 (2007)
3. Booth, D., Liu, C.K. (eds.): Web Service Description Language (WSDL) Version
2.0 Part 0: Primer. W3C Recommendation (June 2007)
4. Bourgeois, F., Lassalle, J.C.: An extension of the Munkres algorithm for the as-
signment problem to rectangular matrices. Communications of the ACM 14(12),
802–804 (1971)
5. Budanitsky, A., Hirst, G.: Evaluating WordNet-based Measures of Lexical Semantic
Relatedness. Computational Linguistics 32(1), 13–47 (2006)
6. Farrell, J., Lausen, H. (eds.): Semantic Annotations for WSDL and XML Schema.
W3C Recommendation (August 2007)
Information Quality-Aware Matchmaking for Semantic Services 513
7. Gomadam, K., Verma, K., Sheth, A.P., Li, K.: Keywords, Port Types and Seman-
tics: A Journey in the Land of Web Service Discovery. In: SWS, Processes and
Applications, ch. 4, pp. 89–105. Springer (2006)
8. Kiefer, C., Bernstein, A.: The Creation and Evaluation of iSPARQL Strategies
for Matchmaking. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M.
(eds.) ESWC 2008. LNCS, vol. 5021, pp. 463–477. Springer, Heidelberg (2008)
9. Klusch, M.: Semantic Web Service Coordination. In: Schumacher, M., Helin, H.,
Schuldt, H. (eds.) CASCOM: Intelligent Service Coordination in the Semantic Web,
ch. 4, pp. 59–104. Birkhäuser Verlag (2008)
10. Klusch, M., Fries, B., Sycara, K.P.: OWLS-MX: A hybrid Semantic Web service
matchmaker for OWL-S services. Journal of Web Semantics 7(2), 121–133 (2009)
11. Klusch, M., Kapahnke, P.: iSeM: Approximated Reasoning for Adaptive Hybrid
Selection of Semantic Services. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije,
A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010, Part II.
LNCS, vol. 6089, pp. 30–44. Springer, Heidelberg (2010)
12. Klusch, M., Kapahnke, P., Zinnikus, I.: Hybrid Adaptive Web Service Selection
with SAWSDL-MX and WSDL-Analyzer. In: Aroyo, L., Traverso, P., Ciravegna,
F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M.,
Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 550–564. Springer, Heidelberg
(2009)
13. Klusch, M., Küster, U., König-Ries, B., Leger, A., Martin, D., Paolucci, M., Bern-
stein, A.: 4th International Semantic Service Selection Contest – Retrieval Perfor-
mance Evaluation of Matchmakers for Semantic Web Services, S3 Contest (2010)
14. Lampe, U., Schulte, S., Siebenhaar, M., Schuller, D., Steinmetz, R.: Adaptive
Matchmaking for RESTful Services based on hRESTS and MicroWSMO. In: Work-
shop on Enhanced Web Service Technologies (WEWST 2010), pp. 10–17 (2010)
15. Lin, D.: An Information-Theoretic Definition of Similarity. In: Fifteenth Interna-
tional Conference on Machine Learning, pp. 296–304 (1998)
16. Miller, G.A.: WordNet: a lexical database for English. Communications of the
ACM 38(11), 39–41 (1995)
17. Mitchell, T.M.: Machine Learning. McGraw-Hill (1997)
18. Paolucci, M., Kawamura, T., Payne, T.R., Sycara, K.: Semantic Matching of
Web Services Capabilities. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS,
vol. 2342, pp. 333–347. Springer, Heidelberg (2002)
19. Plebani, P., Pernici, B.: URBE: Web Service Retrieval Based on Similarity Evalu-
ation. IEEE Trans. on Knowledge and Data Engineering 21(11), 1629–1642 (2009)
20. Resnik, P.: Semantic Similarity in a Taxonomy: An Information-Based Measure
and its Application to Problems of Ambiguity in Natural Language. Artificial In-
telligence Research 11, 95–130 (1999)
21. Sakai, T., Kando, N.: On Information Retrieval Metrics designed for Evaluation with
Incomplete Relevance Assessments. Information Retrieval 11(5), 447–470 (2008)
22. Schulte, S., Lampe, U., Eckert, J., Steinmetz, R.: LOG4SWS.KOM: Self-Adapting
Semantic Web Service Discovery for SAWSDL. In: 2010 IEEE 6th World Congress
on Services, pp. 511–518 (2010)
23. Vitvar, T., Kopecký, J., Viskova, J., Fensel, D.: WSMO-Lite Annotations for Web
Services. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.)
ESWC 2008. LNCS, vol. 5021, pp. 674–689. Springer, Heidelberg (2008)
24. Wang, R.Y., Strong, D.M.: Beyond Accuracy: What Data Quality Means to Data
Consumers. Management Information Systems 12(4), 5–33 (1996)
25. Wooldridge, J.M.: Introductory Econometrics: A Modern Approach, 2nd edn.
Thomson South-Western (2003)
Automatic Identification of Best Answers
in Online Enquiry Communities
1 Introduction
Nowadays, online enquiry platforms and Question Answering (Q&A) websites
represent an important source of knowledge for information seekers. According to
Alexa,1 14% of Yahoo!’s traffic goes to its Q&A website whereas Stack Exchange 2
(SE) Q&A network boast an average of 3.7 million visits per day.
It is very common for popular Q&A websites to generate many replies for
each posted question. In our datasets, we found that on average each question
thread received 9 replies, with some questions attracting more than 100 answers.
With such mass of content, it becomes vital for online community platforms
to put in place efficient policies and procedures to allow the discovery of best
answers. This allows community members to quickly find prime answers, and
1
Alexa, http://www.alexa.com
2
Stack Exchange, http://stackexchange.com
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 514–529, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Automatic Identification of Best Answers in Online Enquiry Communities 515
to reward those who provide quality content. The process adopted by Q&A
systems for rating best answers range from restricting answer ratings to the
author of the question (e.g. the SAP Community Network 3 (SCN) forums), to
opening it up to all community members (e.g. SE). What is common between
most of such communities is that the process of marking best answers is almost
entirely manual. The side effect is that many threads are left without any such
markings. In our datasets, about 50% of the threads lacked pointers to best
answer. Although, much research has investigated the automatic assessment of
answer quality and the identification of best answers [1], little work has been
devoted to the comparison of such models across different communities.
In this paper we apply a model for identifying best answers on three differ-
ent enquiry communities: the SCN forums (SCN), Server Fault 4 (SF) and the
Cooking community5 (CO). We test our model using various combinations of
user, content, and thread features to discover how such groups of features influ-
ence best answer identification. We also study the impact of community-specific
features to evaluate how platform design impacts best answers identification.
Accordingly, the main contributions of our paper are:
3
SAP Community Network, http://scn.sap.com
4
Server Fault, http://serverfault.com
5
Cooking community, http://cooking.stackexchange.com
516 G. Burel, Y. He, and H. Alani
2 Related Work
Many different approaches have been investigated for assessing content quality
on various social media platforms [1]. Most of those approaches are based on
estimating content quality from two groups of features; content features, and user
attributes. Content-based quality estimation postulates that the content and the
metadata associated with a particular answer can be used for deriving the value
of an answer. While user features considers that behavioural information about
answerers is relevant for identifying the merit of a post.
Content based assessment of quality has been applied to both textual [2,3,4]
and non textual [5,2,3,4] content. Textual features normally include readabil-
ity measures such as the Gunning-Fog index, n-grams or words overlap [3,4].
Content metadata like ratings, length and creation date [5,2,3,4] have also been
investigated in this context. In this paper we also use common content features,
such as content length andGunning-Fog index, alongside user features and other
novel features related to the online community platform.
Some approaches for assessing answer quality rely on assessing the expertise or
importance of the users themselves who provided the answers. Such assessment
is usually performed by applying link based algorithms such as ExpertiseRank
[6] which incorporates user expertise with PageRank, and HITS for measuring
popularity or connectivity of users [7,3,8].
Another line of research focused on identifying existing answers to new ques-
tions on Q&A systems. Ontologies and Natural Language Processing (NLP)
methods have been proposed for extracting relevant entities from questions and
matching them to existing answers [9,10,11,12]. Other methods involved more
standard Information Retrieval (IR) techniques like Probabilistic Latent Seman-
tic Analysis[13], query rewriting [14] and translation models [5]. Most of these
works however focus on measuring the relevance of answers to questions, rather
than on the quality of those answers. Other approaches analyse the role of posts,
to distinguish between conversational and informational questions [15], or be-
tween questions, acknowledges, and answers [16]. Although out of the scope of
this paper, such approaches could be used to filter out non-answer posts from
discussion threads that could improve best answer prediction.
Our work differs from all the above in that in addition to using common con-
tent and user features, we also use thread features that take into account certain
characterises of the individual threads; such as scores ratios, order of answers,
etc. In addition to those features, we also present a contextual topical reputation
model for estimating how knowledgeable the answerer is likely to be. Also, much
of previous work concentrated on studying single communities, whereas in this
paper we investigate and compare the results across three communities, thus
establishing a better idea of how generic the findings are.
features on these predictions. For training our answer classifier, we use three
main types of features; content, user, and thread features. All these features
are strictly generated from the information available at the time of the feature
extraction (i.e. future information are not taken into account while generating
attributes). The different attributes are described in the following sections.
1
HA (ui ) = − (P (Q|ui ) log P (Q|ui ) + P (A|ui ) log P (A|ui )) (1)
2
– Normalised Topic Entropy: Calculates the concentration (HT ) of a user’s
posts across different topics. Low entropy indicates focus on particular topics.
In our case, topics are given by the tags associated with a question or the
category of the post. Each user’s tags Tui are derived from the topics attached
to the questions asked or answered by the user. This can be used to calculate
the probability P (tj |ui ) of having a topic tj given a user ui :
|Tu |
1 i
HT (ui ) = − P (tj |ui ) log P (tj |ui ) (2)
|Tui | j=1
the user topical reputation function Eui and a question q with a set of topics
Tq , the reputation embedded within a post related to question q is given by:
|Tq |
EP (q, ui ) = Eui (tj ) (3)
j=1
Eui (tj ) = S(a) (4)
a∈Aui ,tj
Table 1. Differences between the Core Features Set and the Extended Features Set
Features Set
Type Core Features Set (19) Extended Features Set† (23)
User Reputation, Post Rate, Normalised Ac- Reputation, Age, Post Rate, Normalised
tivity Entropy, Number of Answers, An- Activity Entropy, Number of Answers,
swers Ratio, Number of Best Answers, Answers Ratio, Number of Best Answers,
Best Answers Ratio, Number of Ques- Best Answers Ratio, Number of Ques-
tions, Questions Ratio, Normalised Topic
tions, Questions Ratio, Normalised Topic
Entropy, Topical Reputation. (10) Entropy, Topical Reputation. (11)
Content Answer Age, Number of Question Views, Score, Answer Age, Number of Question
Number of Words, Gunning Fog Index, Views, Number of Comments, Number
Flesch-Kinkaid Grade Level. (5) of Words, Gunning Fog Index, Flesch-
Kinkaid Grade Level. (7)
Thread Number of Answers, Answer Position, Score, Number of Answers, Answer Po-
Relative Answer Position, Topical Repu- sition, Relative Answer Position, Topical
tation Ratio. (4) Reputation Ratio. (5)
†
Only valid for the Server Fault and Cooking datasets.
4 Datasets
Our experiments are conducted on three different datasets. The first two are
subs communities extracted from the April 2011 Stack Exchange (SE) public
datasets:6 Server Fault (SF) user group and the non technical Cooking website
6
As part of the public Stack Exchange dataset, the Server Fault and Cooking datasets
are available online at http://www.clearbits.net/get/1698-apr-2011.torrent
520 G. Burel, Y. He, and H. Alani
(CO) composed of cooking enthusiasts. The other dataset is obtained from the
SAP Community Network (SCN) forums and consists of posts submitted to 33
different forums between December 2003 and July 2011.7
Our three datasets come in different formats and structures. To facilitate their
representation, integration, and analysis, we converted all three datasets to a
common RDF format and structure (Figure1). Data is dumped into an SQL
database (1) then converted to RDF based on SIOC8 ontology using the D2RQ9
(2). RDF is then loaded into a triple store where knowledge augmentation func-
tions are executed (3). Such functions simply extend the knowledge graph of
each dataset by adding additional statements and properties (i.e. topical repu-
tation, answer length, votes ratio, etc.). This workflow serves as the input of the
learning algorithms used for predicting content quality (4). We extended SIOC
to represent Q&A vocabulary10. The flexibility of RDF enabled us to add fea-
tures without requiring schema redesign. Summary of mappings of our datasets
to SIOC classes is illustrated in Table 5.
In our experiments we train a categorical learning model for identifying the best
answers in our three datasets. For each thread, the best answer annotation is used
for training and validating the model. Because SCN best answer annotation is
8
SIOC Ontology, http://sioc-project.org
9
D2RQ Platform, http://www4.wiwiss.fu-berlin.de/bizer/d2rq
10
Q&A Vocabulary, http://purl.org/net/qa/ns#
522 G. Burel, Y. He, and H. Alani
Table 2. SIOC Class Mappings of the Stack Exchange and SCN Forums Datasets
Input Dataset
SCN SF and CO RDF Output
User User sioc:OnlineAccount/foaf:Person
Thread (first thread Post) Question sioct:Question
Post (not in first position) Answer sioct:Answer
Post (with 10 points) Best Answer sioct:BestAnswer
- Comment sioct:Comment
Forum Tag sioct:Tag (topic)
based on the author ratings, we use the best answer rating (i.e. 10) as the model
class and discard the other ratings (i.e. 2 and 6) for training the SCN model.
A standard 10-folds cross validation scheme is applied for evaluating the gener-
ated model. Each model uses the features described earlier in the paper. Decision
tree algorithms have been found to be the most successful in such contexts [3,17].
We use the Multi-Class Alternating Decision Tree learning algorithm due to its
consistent and superior results to other decision tree algorithms we tested (J48,
Random Forests, Alternating Tree and Random Trees).
To evaluate the performance of the learning algorithm, we use precision (P ),
recall (R) and the harmonic mean F-measure (F1 ) as well as the area under the
Receiver Operator Curve (ROC) measure. The precision measure represents the
proportion of retrieved best answers that were real best answers. Recall measures
the proportion of best answers that were successfully retrieved. We also plot the
ROC curve and use the Area Under the Curve (AU C) metrics for estimating
the classifier accuracy.
We run two experiments, the first compare the performance of our model for
identifying best answers across all three datasets, using the core and extended
feature sets. The second experiment focuses on evaluating the influence of each
features on best answers identification.
Table 3. Average Precision, Recall, F1 and AU C for the SCN Forums, Server Fault
and Cooking datasets for different feature sets and extended features sets (marked with
+) using the Multi-Class Alternating Decision Tree classifier
and CO datasets, we also train another basic model based on answer scores and
answer scores ratios since such features are normally especially designed as a
rating of content quality and usefulness.
Surprisingly, our results from all three datasets do not confirm previous re-
search on the importance of content length for quality prediction. For each of
our datasets, precision and recall were very low with a F1 median of 0.619 (SCN:
0.619/SF: 0.537/Cooking: 0.644). This might be due to the difference of our data
to those from literature which were taken from general Q&A communities such
as Yahoo! Answers [3] and the Naver community [5]).
The SF and CO models trained on the answer scores highlight positive cor-
relations between best answers and scores. However, this positive influence is
reduced when the data grow in SF over CO. CO shows high F1 results with
0.753 with Answer Score, whereas SF result is 0.625. Training the SE models on
Answer Score Ratios shows even higher results with a F1 of 0.806 for SF and
0.866 for Cooking. Overall, answer score ratio appear to be a good predicator
for answer quality which shows that SF and CO collaborative voting models
are effective. In particular, it shows that taking into account the relative voting
proportions between answers (i.e. scores ratio) is a better approach than only
considering absolute scores.
Core Features Models: Here we focus on the comparison of feature types (i.e.
users, content and threads) and the impact of using the extended feature set on
the identification process. We trained a model for each dataset and features set.
Results in Table 5.2 show that using the thread features we introduced in this
paper increases accuracy in all three datasets over user and content features.
Results also show that F1 when combining all core user, content, and thread
features was 11%, 9.3%, and 5.4% higher for SCN, SF, and CO respectively,
than the best F1 achieved when using these features individually.
524 G. Burel, Y. He, and H. Alani
Fig. 2. Bean Plots representing the distribution of different features and best answers
for the SCN Forums (SCN), the Server Fault (SF) and Cooking (C) datasets
Overall, when using all the core features (common to all datasets), SCN per-
formed better than SF (+7.1%) and CO (+6.4%). Predictions for CO were
slightly more accurate than for SF, probably due to its smaller size. However,
results in Table 5.2 show that F1 with all core features is lower than the Answer
Score Ratio by 4.6% for SF and 9.9% for CO. This reflects the value of this
particular feature for best answer identification on such platforms.
Figure 2 shows the distributions of best answers (good) and non-best answers
(answer) for posts length for all our datasets and answer scores for SF and CO.
Best answers seem to likely be shorter in SCN, and longer in SF and CO. This
variation could be driven by the difference in data sizes and topics as well as
external factors such as community policies (e.g. community editing in SE).
Table 4. Top features ranked by Information Gain Ratio for the SCN, Server Fault and
Cooking datasets. Type of feature is indicated by U /C /T for User /Content/Thread
Core Features: First we focus the analysis on the core features set. Table 5.3
shows that SCN’s most important feature for best answer identification appear
to be the topical reputation ratio, which also came high up the list with 3rd rank
in SF and 5th in CO. The number of answersalso comes high in each dataset: 2nd
for SCN and SF, and 3rd for CO. Note that our training datasets only contained
threads with best answers. Hence the shorter the thread is (i.e. less answers) the
easier it is to identify the best answer. Similarly, best answers ratio and number
of best answers also proved to be good features for best answer prediction. Figure
3 shows the correlations with best answers (good) and non-best answers (bad)
for the top five features in each datasets.
Distribution of SCN topical reputation in Figure 3 is narrower than the dis-
tribution of SF and CO. This highlights the difference between the SCN and SE
reputation models. Contrary to SE, SCN only allow positive reputation. For core
features, SF, CO, and SCN have a generally similar mode of operation. However,
SCN is less affected byanswer position due to the difference of platform editing
policies. SE favours small thread whereas SCN does not. Such difference leads
to a better correlation of number of answers with best answers in SE.
According to Table 5.3, user features appear to be dominant, with some thread
features amongst the most influential. Number of thread answers and historical
activities of users are particularly useful (e.g. number and ratio of user’s best
answers). User reputation in SCN plays a more important role than in SF and
526 G. Burel, Y. He, and H. Alani
Fig. 3. Bean Plots representing the distribution of different the top five features for
the SCN Forums (first row), the Server Fault (second row) and Cooking (third row)
datasets
CO, which is probably a reflection of the community policies that puts emphasis
on members’ reputation.
In SAP’s SCN, user activity focus seems to play a notable role (topical rep-
utation, answer and question ratios, activity entropy, etc.). These features are
further down the list for SF and CO.
7 Conclusions
References
1. Chai, K., Potdar, V., Dillon, T.: Content quality assessment related frameworks for
social media. In: Proc. Int. Conf. on Computational Science and Its Applications
(ICCSA), Heidelberg (2009)
2. Liu, Y., Agichtein, E.: You’ve got answers: towards personalized models for pre-
dicting success in community question answering. In: Proc. 46th Annual Meeting of
the Association for Computational Linguistics on Human Language Technologies:
Short Papers, Columbus, Ohio (2008)
3. Agichtein, E., Castillo, C., Donato, D., Gionis, A., Mishne, G.: Finding high-quality
content in social media. In: First ACM Int. Conf. on Web Search and Data Mining,
Palo Alto, CA (2008)
4. Bian, J., Liu, Y., Zhou, D., Agichtein, E., Zha, H.: Learning to recognize reliable
users and content in social media with coupled mutual reinforcement. In: Int. World
Wide Web Conf., Madrid (2009)
5. Jeon, J., Croft, W.B., Lee, J.H., Park, S.: A framework to predict the quality of
answers with non-textual features. In: SIGIR, Washington. ACM Press (2006)
6. Zhang, J., Ackerman, M., Adamic, L.: Expertise networks in online communities:
structure and algorithms. In: Proc. 16th Int. World Wide Web Conf., Banff (2007)
7. Jurczyk, P., Agichtein, E.: Discovering authorities in question answer communities
by using link analysis. In: ACM 16th Conf. Information and Knowledge Manage-
ment, CIKM 2007 (2007)
8. Suryanto, M., Lim, E., Sun, A., Chiang, R.: Quality-aware collaborative question
answering: methods and evaluation. In: Proc. 2nd ACM Int. Conf. on Web Search
and Data Mining, Barcelona (2009)
9. McGuinness, D.: Question answering on the semantic web. IEEE Intelligent Sys-
tems 19(1) (2004)
10. Narayanan, S., Harabagiu, S.: Question answering based on semantic structures.
In: Proc. 20th Int. Conf. on Computational Linguistics, Geneva (2004)
11. Lopez, V., Pasin, M., Motta, E.: AquaLog: An Ontology-Portable Question An-
swering System for the Semantic Web. In: Gómez-Pérez, A., Euzenat, J. (eds.)
ESWC 2005. LNCS, vol. 3532, pp. 546–562. Springer, Heidelberg (2005)
12. Wang, Y., Wang, W., Huang, C.: Enhanced semantic question answering system for
e-learning environment. In: 21st Int. Conf. on Advanced Information Networking
and Applications Workshops, AINAW, vol. 2 (2007)
13. Qu, M., Qiu, G., He, X., Zhang, C., Wu, H., Bu, J., Chen, C.: Probabilistic question
recommendation for question answering communities. In: Proc. 18th Int. World
Wide Web Conf., Madrid (2009)
14. Kwok, C., Etzioni, O., Weld, D.: Scaling question answering to the web. ACM
Transactions on Information Systems (TOIS) 19(3) (2001)
Automatic Identification of Best Answers in Online Enquiry Communities 529
15. Harper, F., Moy, D., Konstan, J.: Facts or friends?: distinguishing informational
and conversational questions in social Q&A sites. In: Proc. 27th Int. Conf. on
Human Factors in Computing Systems, CHI, Boston, MA (2009)
16. Kang, J., Kim, J.: Analyzing answers in threaded discussions using a Role-Based
information network. In: Proc. IEEE Int. Conf. Social Computing, Boston, MA
(2011)
17. Rowe, M., Angeletou, S., Alani, H.: Predicting Discussions on the Social Semantic
Web. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., De
Leenheer, P., Pan, J. (eds.) ESWC 2011, Part II. LNCS, vol. 6644, pp. 405–420.
Springer, Heidelberg (2011)
Characterising Emergent Semantics
in Twitter Lists
Abstract. Twitter lists organise Twitter users into multiple, often over-
lapping, sets. We believe that these lists capture some form of emer-
gent semantics, which may be useful to characterise. In this paper we
describe an approach for such characterisation, which consists of de-
riving semantic relations between lists and users by analyzing the co-
occurrence of keywords in list names. We use the vector space model
and Latent Dirichlet Allocation to obtain similar keywords according to
co-occurrence patterns. These results are then compared to similarity
measures relying on WordNet and to existing Linked Data sets. Results
show that co-occurrence of keywords based on members of the lists pro-
duce more synonyms and more correlated results to that of WordNet
similarity measures.
1 Introduction
The active involvement of users in the generation of content on the Web has led
to the creation of a massive amount of information resources that need to be
organized so that they can be better retrieved and managed. Different strategies
have been used to overcome this information overload problem, including the use
of tags to annotate resources in folksonomies, and the use of lists or collections
to organize them. The bottom-up nature of these user-generated classification
systems, as opposed to systems maintained by a small group of experts, have
made them interesting sources for acquiring knowledge. In this paper we conduct
a novel analysis of the semantics of emergent relations obtained from Twitter
lists, which are created by users to organize others they want to follow.
Twitter is a microbbloging platform where users can post short messages
known as tweets. Twitter was started in 2006 and has experienced a continuous
growth since then, currently reaching 100 million users1 . In this social network
users can follow other users so that they can receive their tweets. Twitter users
1
http://blog.twitter.com/2011/09/one-hundred-million-voices.html
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 530–544, 2012.
© Springer-Verlag Berlin Heidelberg 2012
Characterising Emergent Semantics in Twitter Lists 531
Subscriber 1
Fig. 1. Diagram showing different user roles in twitter lists. Boxes indicate list names.
are allowed to classify people into lists (see figure 1). The creator of the list is
known as the curator. List names are freely chosen by the curator and consist of
keywords. Users other than the curator can then subscribe to receive tweets from
the listed users. Similarly to what happens with folksonomies [7,19], the classifi-
cation system formed by connections between curators, subscribers, listed users,
and list names, can be considered as a useful resource for knowledge extraction.
In this work we analyze term co-occurrence patterns in these lists to identify
semantic relations between all these elements. Co-occurrence may happen due
to the simultaneous use of keywords in different lists created by curators, or in
lists followed by subscribers, or in lists under which users are listed.
For instance, table 1 summarizes the lists under which an active and well
known researcher in the Semantic Web field has been listed. The first column
presents the most frequent keywords used by curators of these lists, while the sec-
ond column shows keywords according to the number of subscribers. We can see
that semantic_web and semweb are frequently used to classify this user, which
suggests a strong relationship between both keywords. In fact, these keywords
can be considered as synonyms since they refer to same concept. Though less
frequent, other keywords such as semantic, tech and web_science are also related
to this context. The other keywords according to the use given by subscribers
(e.g., connections) are more general and less informative for our purposes.
We consider that Twitter Lists represent a potentially rich source for harvest-
ing knowledge, since they connect curators, members, subscribers and terms. In
this paper we explore which of such connections lead to emergent semantics and
produce most related terms. We analyze terms using the vector space model [24]
and a topic modeling method, the Latent Dirichlet Allocation [5]. Then we use
metrics based on the WordNet synset structure [10,26,16] to measure the se-
mantic similarity between keywords. In addition, we ground keywords to Linked
Open Data and present the relations found between them. This type of analy-
sis lays the foundation for the design of procedures to extract knowledge from
Twitter lists. For instance, ontology development can benefit of the emerging
vocabulary that can be obtained from these user generated sources.
In the following we present the models used to obtain relation between key-
words from Twitter lists. In section 3 we introduce the similarity metrics based
on WordNet, and we describe the technique used to gather relations from linked
data. Next we present, in section 4, the results of our study. Finally we describe
the related work in section 5, and present the conclusions in section 6.
532 A. García-Silva et al.
Table 1. Most frequent keywords found in list names where the user has been listed
Curators Subscribers
semantic_web 39 semantic_web 570
semweb 22 semweb 100
semantic 7 who-my-friends-talk-to 93
tech 7 connections 82
web_science 5 rock_stars 55
We use the vector space model [24] to represent list keywords and their rela-
tionships with curators, members and subscribers. Each keyword is represented
by three vectors of different dimension according to the type of relation rep-
resented. The use of vectors allows calculating similarity between them using
standard measures such as the angle cosine.
Twitter lists can be defined as a tuple T L = (C, M, S, L, K, Rl , Rk ) where
C, M, S, L, and K are sets of curators, members (of lists), subscribers, list names,
and keywords respectively, Rl ⊆ C ×L×M defines the relation between curators,
lists names, and members, and Rk ⊆ L × K represents keywords appearing in a
list name. A list φ is defined as (c, l, Mc,l ) where Mc,l = {m ∈ M |(c, l, m) ∈ Rl }.
A subscription to a list can be represented then by (s, c, l, Mc,l). To represent
keywords we use the following vectors:
- For the use of a keyword k according to curators we define kcurator as a vector
in |C| where entries in the vector wc = |{(c, l, Mc,l)|(l, k) ∈ Rk }| correspond to
the number of lists created by the curator c that contain the keyword k.
- For the use of a keyword k according to members we use a vector kmember in
|M| where entries in the vector wm = |{(c, l, m) ∈ Rl |(l, k) ∈ Rk }| correspond
to the number of lists containing the keyword k under which the member m has
been listed.
- For the use of a keyword k according to subscribers we utilize a vector
ksubscriber in |S| where entries in the vector ws = |{(s, c, l, Mc,l)|(l, k) ∈ Rk }|
correspond to the number of times that s has subscribed to a list containing the
keyword k.
In the vector space model we can measure the similarity between keywords
calculating the cosine of the angle for the corresponding vectors in the same
ki ·kj
dimension. For two vectors ki and kj the similarity is sim(ki , kj ) = ||ki ||·||k j ||
.
We also use Latent Dirichlet Allocation (LDA) [5] to obtain similar keywords.
LDA is an unsupervised technique where documents are represented by a set
of topics and each topic consists of a group of words. LDA topic model is an
improvement over bag of words approaches including the vector space model,
since LDA does not require documents to share words to be judged similar.
As long as they share similar words (that appear together with same words in
other documents) they will be judged similar. Thus documents are viewed as a
mixture of probabilistic topics that are represented as a T dimensional random
Characterising Emergent Semantics in Twitter Lists 533
variable θ. For each document, the topic distribution θ has a Dirichlet prior
p(θ|α) ∼ Dir(α). In generative story, each document is generated by first picking
a topic distribution θ from the Dirichlet prior and then use each document’s topic
distribution to sample latent topic variables zi . LDA makes the assumption that
each word is generated from one topic where zi is a latent variable indicating
the hidden topic assignment for word wi . The probability of choosing a word wi
under topic zi , p(wi |zi , β), depends on different documents.
We use the bag of words model to represent documents as input for LDA. For
our study keywords are documents and words are the different users according
to their role in the list structure. To represent keywords we use the following
sets:
- For a keyword k according to curators we use the set kbagCurator = {c ∈
C|(c, l, m) ∈ Rl ∧ (l, k) ∈ Rk } representing the curators that have created a list
containing the keyword k.
- For a keyword k according to members we use a set kbagMember = {m ∈
M |(c, l, m) ∈ Rl ∧(l, k) ∈ Rk } corresponding to the users who have been classified
under lists containing the keyword k.
- For a keyword k according to subscribers we use a set kbagSubscriber = {s ∈
S|(s, c, l, Mc,l) ∧ (l, k) ∈ Rk }, that is the set of users that follow a list containing
the keyword k.
LDA is then executed for all the keywords in the same representation schema
(i.e., based on curators, members, or subscribers) generating a topic distribution
θ for each document. We can compute similarity between two keywords ki and
kj in the same representation schema by measuring the angle cosine of their
corresponding topic distributions θi and θj .
A natural measure of similarity between words is the length of the path con-
necting the corresponding synsets [22,16]. The shorter the path the higher the
similarity. This length is usually calculated in the noun and verb is-a hierar-
chy according to the number of synsets in the path connecting the two words.
In the case of two synonyms, both words belong to the same synset and thus
the path length is 1. A path length of 2 indicates an is-a relation. For a path
length of 3 there are two possibilities: (i) both words are under the same hy-
pernym known as common subsumer, and therefore the words are siblings, and
(ii) both words are connected through an in-between synset defining an in-
direct is-a relation. Starting with 4 the interpretation of the path length is
harder.
However, the weakness of using path length as a similarity measure in Word-
Net is that it does not take into account the level of specificity of synsets in the
hierarchy. For instance, measure and communication have a path length of 3 and
share abstraction as a common subsumer. Despite low path length, this relation
may not correspond to the human concept of similarity due to the high level of
abstraction of the concepts involved.
Abstract synsets appear in the top of the hierarchy, while more specific ones
are placed at the bottom. Thus, Wu and Palmer [26] propose a similarity mea-
sure which includes the depth of the synsets and of the least common subsumer
(see equation 1). The least common subsumer lcs is the deepest hypernym that
subsumes both synsets, and depth is the length of the path from the root to the
synset. This similarity range between 0 and 1, the larger the value the greater
the similarity between the terms. For terms measure and communication, both
synsets have depth 4, and the depth of the lcs abstraction is 3; therefore, their
similarity is 0.75.
Jiang and Conrath [16] propose a distance measure that combines hierarchical
and distributional information. Their formula includes features such as local
network density (i.e., children per synset), synset depth, weight according to
the link type, and information content IC of synsets and of the least common
subsumer. The information content of a synset is calculated as the inverse log
of its probability of occurrence in the WordNet hierarchy. This probability is
based on the frequency of words subsumed by the synset. As the probability of a
synset increases, its information content decreases. Jiang and Conrath distance
can be computed using equation 2 when only the information content is used.
A shorter distance means a stronger semantic relation. The IC of measure and
communication is 2.95 and 3.07 respectively while abstraction has a IC of 0.78,
thus their semantic distance is 4.46.
We use, in section 4, the path length, Wu and Palmer similarity, and Jiang and
Conrath distance to study the semantics of the relations extracted from Twitter
lists using the vector space model and LDA.
the resources. In our case we discard the initial owl:sameAs relation between
DBpedia and OpenCyc resources, and keep the assertion that Anthropology
and Sociology are Social Sciences.
Keyword Keyword
anthropology sociology
grounding
rdf:type opencyc: rdf:type grounding
social science
owl:sameAs owl:sameAs
Fig. 2. Linked data showing the relation between the anthropology and sociology
SELECT *
WHERE{<dbpr:Anthropology> ?relation1 ?node1. ?node1 ?relation2 ?node2.
<dbpr:Sociology> ?relation4 ?node3. ?node3 ?relation3 ?node2.}
Listing 1.1. SPARQL query for finding relations between two DBpedia resources
4 Experiment Description
Data Set: Twitter offers an Application Programming Interface (API) for data
collection. We collected a snowball sample of users and lists as follows. Starting
with two initial seed users, we collected all the lists they subscribed to or are
members of. There were 260 such lists. Next, we expanded the user layer based
on current lists by collecting all other users who are members of or subscribers to
these lists. This yielded an additional set of 2573 users. In the next iteration, we
expanded the list layers by collecting all lists that these users subscribe to or are
members of. In the last step, we collected 297,521 lists under which 2,171,140
users were classified. The lists were created by 215,599 distinct curators, and
616,662 users subscribe to them6 . From list names we extracted, by approximate
matching of the names with dictionary entries, 5932 unique keywords; 55% of
them were found in WordNet. The dictionary was created from article titles and
redirection pages in Wikipedia.
Obtaining Relations from Lists: For each keyword we created the vectors
and the bags of words for each of the three user-based representations defined in
section 2. We calculated cosine similarity in the corresponding user-based vector
space. We also run the LDA algorithm over the bags of words and calculated the
cosine similarity between the topic distribution produced for each document. We
kept the 5 most similar terms for each keyword according to the Vector-space
and LDA-based similarities.
6
The data set can be found here: http://goo.gl/vCYyD
Characterising Emergent Semantics in Twitter Lists 537
0 0 0.072 9999.9999 1
0.125 0 0 0.0028 0 0
0 0 0.0011 9999.9999 1
0 0 0.9103 9999.9999 1
0.075 0 0 0.038 9999.9999 1
0 0 0.0449 9999.9999 1 JC
0.025 0 0 0.2087 9999.9999 1 WP
0 0 0.1376 9999.9999 1
0 VSM 0 LDA VSM 0.0447LDA 9999.9999
VSM LDA1
-0.025
0 0
Members 0.2271
Subscribers 0.0411 Curators
0.2858
0 0 0.0389 0 0.381
-0.075 0 0 0 0458 0 0669 0 7778
Fig. 3. Coefficient of correlation between Vector-space and LDA similarity with respect
to WordNet measures
15.8 0.52
Jiang and Conrath distance
Fig. 4. Average Jiang and Conrath distance and Wu and Palmer similarity
WordNet Analysis: For each pair of similar keywords we calculated their sim-
ilarity according to Jiang and Conrath (JC) and Wu and Palmer (WP) formulas.
To gain an initial insight about these measures we calculate the correlation be-
tween them (see Figure 3). We use the Pearson’s coefficient of correlations which
divides the covariance of the two variables by the product of their standard de-
viations.
In general these results show that Vector-space and LDA similarity based
on members produce the most similar results to that of WordNet measures.
Vector-space similarity based on subscribers and curators also produces corre-
lated results, although significantly lower. LDA similarity based on subscribers
results is correlated to JC distance but not to WP similarity. Finally LDA based
on curators produces results that are not correlated to WordNet similarities.
Correlation results can be partially explained by measuring the average of JC
distance and WP similarity7 (see figure 4). Vector-space and LDA similarities
based on Members have the shortest JC distance, and two of the top tree WP
similarity values. Vector-space similarity based on subscribers has also a short
JC distance, and a high WP similarity. For the rest of similarities JC distances
are longer and WP similarity lower.
7
The averages were calculated over relations for which both terms were in WordNet.
538 A. García-Silva et al.
Table 2. Path length in WordNet for similar Keywords according to Vector-space and
LDA models
40.00%
35.00%
30.00%
Depth>=5
25.00%
Depth>=6
20.00%
Depth>=7
15.00%
Depth>=8
10.00%
Depth>=9
5.00%
Depth>=10
0.00%
VSM LDA VSM LDA VSM LDA
Members Subscribers Curators
Fig. 5. Relations according to the depth of the least common subsumer LCS
30%
25%
Length = 10
20% Length = 9
Length = 8
15%
Length = 7
10% Length = 6
Length = 5
5%
Length = 4
0% Length = 3
VSM LDA VSM LDA VSM LDA
Members Subscribers Curators
Fig. 6. Relations according to the path length for those cases where the least common
subsumer has depth greater or equal to 5
In addition to the depth of the LCS, the other variable to explore is the
length of the path setting up the relation. The stacked columns in figure 6 show
the cumulative percentage of relations found by Vector-space and LDA models
according to the path length of the relation in WordNet, with a depth of the least
common subsumer greater than or equal to 5. From the chart we can state that
Vector-space similarity based on subscribers produces the highest percentage of
relations (26.19%) with a path length ≤ 10. This measure also produces the
highest percentage of relations for path lengths ranging from 9 to 4. The Vector-
space similarity based on members produces the second highest percentage of
relations for path lengths from 10 to 6.
In summary, we have shown that similarity models based on members produce
the results that are most directly related to the results of similarity measures
based on WordNet. These models find more synonyms and direct relations is-a
when compared to the models based on subscribers and curators. These results
suggest that some users are classified under different lists named with synonyms
or with keywords representing a concept in a distinct level of specificity. We also
540 A. García-Silva et al.
discovered that the majority of relations found by any model have a path length
≥ 3 and involve a common subsumer. Vector-space model based on subscribers
produces the highest number of relations that can be considered specific (depth
of LCS ≥ 5 or 6). However, for more specific relations ( 7 ≤ depth of LCS ≤
9) similarity models based on members produce a higher number. In addition
we considered the path length, for those relations containing a LCS placed in a
depth ≥ 5 in the hierarchy, as a variable influencing the relevance of a relation.
Vector-space model based on subscriber finds the highest number of relations
with 4 ≤ length ≤ 10. In general similarity models based on curators produce a
lower number of relations. We think this may be due to the scarcity of lists per
curator. In our dataset each curator has created 1.38 lists in average.
Linked Data Analysis: Our approach found DBpedia resources for 63.77% of
the keywords extracted from Twitter Lists. In average for the 41.74% of relations
we found the related keywords in DBpedia. For each relation found by Vector-
space or LDA similarity we query the linked data set looking for patterns between
the related keywords. Figure 7 shows the results according to the path length
of the relations found in the linked data set. These results are similar to the
ones produced by WordNet similarity measures. That is, similarity based on
Members produce the highest number of synonyms and direct relations though
in this case Vector-space similarity produces more synonyms than LDA. Vector-
space similarity based on subscribers has the highest number of relations of
length 3, followed by Vector-space and LDA similarity based on members.
14.00%
12.00%
10.00%
8.00%
6.00%
4.00%
2.00%
0.00%
VSM LDA VSM LDA VSM LDA
Members Subscribers Curators
Given that the Vector-space model based on members found the majority of
direct relations, we present, in table 3, the relations identified in the linked data
set. Broad term and subClassOf are among the most frequent relations. This
means that members of lists are usually classified in lists named with keywords
representing a concept with a different level of specificity. Other relations that
are difficult to elicit from traditional lexicons are also obtained, such as developer,
genre or largest city.
Characterising Emergent Semantics in Twitter Lists 541
Table 4. Indirect relations of length 3 found in the linked data set for the relations
established by the Vector-space model based on subscribers
relation1 relation2
rs → object ← rt
Relations Example
type type 67.35% nokia → company ← intel
subClassOf subClassOf 30.61% philanthropy → activities ← fundraising
relation1 relation2
rs ← object → rt
Relations Example
genre genre 12.43% theater ← Aesthetica → film
genre occupation 10.27% fiction ← Adam Maxwell → writer
occupation occupation 8.11% poet ← Alina Tugend → writer
product product 7.57% clothes ← ChenOne → fashion
product industry 9.73% blogs ← UserLand Software → internet
occupation known for 5.41% author ← Adeline Yen Mah → writing
known for known for 3.78% skeptics ← Rebecca Watson → atheist
main interest main interest 3.24% politics ← Aristotle → government
5 Related Work
Twitter has been investigated from different perspectives including network char-
acteristics, user behaviors, and tweet semantics among others. Twitter network
542 A. García-Silva et al.
properties, geographical features, and users have been studied in [15,17]. In [15]
authors use the HITS algorithm to identify hubs and authorities from the net-
work structure, while in [17] authors categorise users according to their behav-
iors. To identify the tweet semantics some proposals [2,1,23,6] annotate them
with semantic entities using available services such as Zemanta, Open Calais,
and DBpedia Spotlight [21]. In [2] tweets are linked to news articles and are
enriched with semantic annotations to create user profiles. These semantic an-
notations of tweets have been used in a faceted search approach [1]. In [23]
tweets and their semantic annotations are represented according to existing vo-
cabularies such as FOAF, Dublin Core, and SIOC, and are used to map tweets
to websites of conferences and events. In [6] authors use the semantic entities
identified in Tweets to obtain the concepts associated with user profiles. In ad-
dition some classifiers have been proposed in [8] to extract players and events
from sport tweets. Twitter allows the use of hashtags as a way to keep conver-
sation around certain topics. In [18] authors have studied hashtags as candidate
identifiers of concepts.
With respect to Twitter Lists, they have been used to distinguish elite users,
such as celebrities, media, organizations, and bloggers [25]. In this work authors
provide an analysis on the information flow of Twitter, and show dueling impor-
tance of mass media and opinion leaders. In addition, in [9] lists have been used
as a source for discovering latent characteristics of users.
In the broader context of the Web 2.0 the emerging semantics of folksonomies
have been studied under the assumption that it is possible to obtain a vocabulary
from these classification systems. In folksonomies the set of tags around resources
tends to converge [13] and users in the same social groups are more likely to use
the same set of tags [20]. The semantics of the emerging relations between tags
have been studied in [7,19]. A survey of the state of the art on this matter can
be found in [11].
6 Conclusions
In this paper we have described different models to elicit semantic relations
from Twittter lists. These models represent keyword co-occurrence in lists based
on three user roles: curators, subscribers and members. We measure similarity
between keywords using the vector-space model and a topic based model known
as LDA. Then we use Wordnet similarity measures including Wu and Palmer,
and Jiang and Conrath distance, to compare the results of the vector-space and
LDA models.
Results show that applying vector-space and LDA metrics based on members
produce the most correlated results to those of WordNet-based metrics. We
found that these measures produce relations with the shortest Jiang and Conrath
distance and high Wu and Palmer similarities. In addition, we categorize the
relations found by each model according to the path length in WordNet. Models
based on members produce the highest number of synonyms and of direct is-
a relations. However, most of the relations have a path length ≥ 3 and have
a common subsumer. We analyze these relations using the depth of the LCS
Characterising Emergent Semantics in Twitter Lists 543
and the path length as variables that help to identify the relevance of relations.
This analysis shows that the vector-space model based on subscribers finds the
highest number of relations when relevance is defined by a depth of LCS ≥ 5,
and the path length of relations is between 10 and 4.
We also investigate the type of relations found by each of the models using
general knowledge bases published as linked data. We categorize the relations
elicited by each model according to the path length in the linked data set. These
results confirm that the models based on members produce the highest number of
synonyms and direct relations. In addition, we find that direct relations obtained
from models based on members are mostly Broader Term and subclassOf. Finally,
we study the type of relations obtained from the vector-space model based on
subscribers with a path length of 3 and find that mostly they represent sibling
keywords sharing a common class, and subjects that are related through an
individual.
References
1. Abel, F., Celik, I., Houben, G.-J., Siehndel, P.: Leveraging the Semantics of Tweets
for Adaptive Faceted Search on Twitter. In: Aroyo, L., Welty, C., Alani, H., Taylor,
J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS,
vol. 7031, pp. 1–17. Springer, Heidelberg (2011)
2. Abel, F., Gao, Q., Houben, G.-J., Tao, K.: Semantic Enrichment of Twitter Posts
for User Profile Construction on the Social Web. In: Antoniou, G., Grobelnik, M.,
Simperl, E., Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC
201. LNCS, vol. 6644, Part II, pp. 375–389. Springer, Heidelberg (2011)
3. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data - The Story So Far. International
Journal on Semantic Web and Information Systems, IJSWIS (2009)
4. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hell-
mann, S.: DBpedia - A crystallization point for the Web of Data. Journal of Web
Semantic 7(3), 154–165 (2009)
5. Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. Journal of Machine Learn-
ing Research 3, 993–1022 (2003)
6. Cano, A.E., Tucker, S., Ciravegna, F.: Follow me: Capturing entity-based seman-
tics emerging from personal awareness streams. In: Making Sense of Microposts
(#MSM 2011), pp. 33–44 (2011)
7. Cattuto, C., Benz, D., Hotho, A., Stumme, G.: Semantic Grounding of Tag Re-
latedness in Social Bookmarking Systems. In: Sheth, A.P., Staab, S., Dean, M.,
Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS,
vol. 5318, pp. 615–631. Springer, Heidelberg (2008)
8. Choudhury, S., Breslin, J.: Extracting semantic entities and events from sports
tweets. In: Proceedings of the ESWC 2011 Workshop on ’Making Sense of Micro-
posts’. CEUR Workshop Proceedings, vol. 718 (May 2011)
9. Dongwoo Kim, Y.J.: Analysis of Twitter lists as a potential source for discovering
latent characteristics of users. In: Workshop on Microblogging at the ACM Con-
ference on Human Factors in Computer Systems (CHI 2010), Atlanta, CA, USA
(2010)
544 A. García-Silva et al.
10. Fellbaum, C.: WordNet and wordnets, 2nd edn., pp. 665–670. Elsevier, Oxford
(2005)
11. García-Silva, A., Corcho, O., Alani, H., Gómez-Pérez, A.: Review of the state of the
art: discovering and associating semantics to tags in folksonomies. The Knowledge
Engineering Review 27(01), 57–85 (2012)
12. García-Silva, A., Szomszor, M., Alani, H., Corcho, O.: Preliminary results in tag
disambiguation using dbpedia. In: Knowledge Capture (K-Cap 2009)-Workshop on
Collective Knowledge Capturing and Representation-CKCaR (2009)
13. Golder, S.A., Huberman, B.A.: Usage patterns of collaborative tagging systems.
Journal of Information Science 32(2), 198–208 (2006)
14. Heim, P., Lohmann, S., Stegemann, T.: Interactive Relationship Discovery via the
Semantic Web. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stucken-
schmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010, Part I. LNCS, vol. 6088,
pp. 303–317. Springer, Heidelberg (2010)
15. Java, A., Song, X., Finin, T., Tseng, B.: Why We Twitter: An Analysis of a Mi-
croblogging Community. In: Zhang, H., Spiliopoulou, M., Mobasher, B., Giles, C.L.,
McCallum, A., Nasraoui, O., Srivastava, J., Yen, J. (eds.) WebKDD 2007. LNCS,
vol. 5439, pp. 118–138. Springer, Heidelberg (2009)
16. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and
lexical taxonomy. CoRR, cmp-lg/9709008 (1997)
17. Krishnamurthy, B., Gill, P., Arlitt, M.: A few chirps about twitter. In: Proceedings
of the First Workshop on Online Social Networks, WOSN 2008, pp. 19–24. ACM,
New York (2008)
18. Laniado, D., Mika, P.: Making Sense of Twitter. In: Patel-Schneider, P.F., Pan, Y.,
Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC
2010, Part I. LNCS, vol. 6496, pp. 470–485. Springer, Heidelberg (2010)
19. Markines, B., Cattuto, C., Menczer, F., Benz, D., Hotho, A., Gerd, S.: Evaluating
similarity measures for emergent semantics of social tagging. In: Proceedings of
the 18th International Conference on World Wide Web, WWW 2009, pp. 641–650.
ACM, New York (2009)
20. Marlow, C., Naaman, M., Boyd, D., Davis, M.: Ht06, tagging paper, taxonomy,
flickr, academic article, to read. In: Proceedings of the Seventeenth Conference on
Hypertext and Hypermedia, pp. 31–40. ACM Press (2006)
21. Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: Dbpedia spotlight: Shedding
light on the web of documents. In: Proceedings of the 7th International Conference
on Semantic Systems, I-Semantics (2011)
22. Pedersen, T., Patwardhan, S., Michelizzi, J.: Wordnet: Similarity - measuring the
relatedness of concepts. In: AAAI, pp. 1024–1025. AAAI Press / The MIT Press
(2004)
23. Rowe, M., Stankovic, M.: Mapping tweets to conference talks: A goldmine for
semantics. In: Social Data on the Web Workshop, International Semantic Web
Conference (2010)
24. Salton, G., Mcgill, M.J.: Introduction to Modern Information Retrieval. McGraw-
Hill, Inc., New York (1986)
25. Wu, S., Hofman, J.M., Mason, W.A., Watts, D.J.: Who says what to whom on
twitter. In: Proceedings of the 20th International Conference on World Wide Web,
WWW 2011, pp. 705–714. ACM, New York (2011)
26. Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: Proc. of the 32nd
Annual Meeting on Association for Computational Linguistics, pp. 133–138 (1994)
Crowdsourcing Taxonomies
Abstract. Taxonomies are great for organizing and searching web con-
tent. As such, many popular classes of web applications, utilize them.
However, their manual generation and maintenance by experts is a time-
costly procedure, resulting in static taxonomies. On the other hand, min-
ing and statistical approaches may produce low quality taxonomies. We
thus propose a drastically new approach, based on the proven, increased
human involvement and desire to tag/annotate web content. We define
the required input from humans in the form of explicit structural, e.g.,
supertype-subtype relationships between concepts. Hence we harvest, via
common annotation practices, the collective wisdom of users with respect
to the (categorization of) web content they share and access. We further
define the principles upon which crowdsourced taxonomy construction
algorithms should be based. The resulting problem is NP-Hard. We thus
provide and analyze heuristic algorithms that aggregate human input
and resolve conflicts. We evaluate our approach with synthetic and real-
world crowdsourcing experiments and on a real-world taxonomy.
1 Introduction
Social media applications and research are increasingly receiving greater atten-
tion. A key defining characteristic is the increased human involvement. Even
before todays’ success of social media applications, many applications became
extremely successful due to the clever exploitation of implicit human inputs (e.g.,
Google’s ranking function), or explicit human input (e.g., Linux open source con-
tributions). Social media and the social web have taken this to the next level.
Humans contribute content and share, annotate, tag, rank, and evaluate content.
Specialized software aggregates such human input for various applications (from
content searching engines to recommendation systems, etc). The next wave in
this thread comes from crowdsourcing systems in which key tasks are performed
by humans (either in isolation or in conjunction with automata) [7]. Lately,
within the realm of data and information retrieval systems, crowdsourcing is
gaining momentum as a means to improve system performance/quality [3,13]. A
This work was partially funded by the EIKOS research project, within the THALES
framework, administered by the Greek Ministry for Education, Life Long Learning,
and Religious Affairs.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 545–559, 2012.
c Springer-Verlag Berlin Heidelberg 2012
546 D. Karampinas and P. Triantafillou
Our model does not depend on any “experts”. We adopt an automated approach,
with the additional feature that users explicitly provide us with relations between
the keywords (so-called “tags”) they employ to annotate the content they share.
Humans have a good understanding of the supertype-subtype relations between
various thematic categories, since these naturally exist around them. So, we
aim to exploit extended tagging and a categorization capability in order to
develop high-quality taxonomies. We ask users to contribute with metadata in
this format: taga → tagd . Here, taga is a supertype topic and represents a higher
Crowdsourcing Taxonomies 547
node to a potential concept hierarchy whilst tagd is a subtype topic. The arrow
between them connotes an Ancestor → Descendant (A → D) relation. In figure
build it from scratch. Based on the above example, suppose that we eventu-
ally receive an Arts → Music vote. If previously we had made a mistake when
filling the lack of knowledge between Arts and Music, an efficient rectifica-
tion must take place based on the now completed information. To sum up,
dynamic, piece-wise, online taxonomy maintenance is is a key characteristic.
As mentioned, the shape and structure of the taxonomy emerge from the crowd’s
subjective will, as it evolves. As the number of participants increases, in general,
the higher the output quality becomes. When conflicts arise, several conflicting
taxonomy states emerge as alternatives. One of them will be associated with the
greatest number of votes. In this way, the new accepted state of the taxonomy will
emerge. At the end, this process will converge to a structure, entirely defined from
the community’s aggregated knowledge. At this point, we can claim that this
final product objectively depicts a complete taxonomy. But how do we evaluate
such a taxonomy? We wish we could compare it against a golden rule taxonomy
and see how they differ; but there is no standardized, “ideal” structure on which
everyone agrees. Even if we compare taxonomies created by experts there will
not be a 100% match, since both the rules for creating the taxonomy and the
input data are often contradictory and obscure.
Users are asked to provide us with Ancestor → Descendant relations but
any given vote has it’s own interpretation, depending of the current state of
the taxonomy. Any incoming tag relation will be classified to a category, based
on the relative positions of the nodes it touches. For example, in figure 1 the
following possible scenarios can arise as far as an incoming vote is concerned:
relationship between these nodes, our algorithms for handling this situation
will be able to eventually yield the proper tree structure. If, however, there
is no such relation between the nodes of a crosslink, then our algorithms will
inescapably produce a supertype-subtype realtion between the two. When a
relation like this arises, special handling is required: our current idea for this
involves users supplying “negative” votes when they see incorrect crosslink
relationships established in the current taxonomy. We leave discussion of this
to future work.
Before we continue with the problem formulation we specify two solution in-
variants our approach maintains and give some insight of the taxonomy building
algorithm that follows.
Tree Properties: This is the primary invariant we maintain. A tree is an
undirected graph in which any two vertices are connected by exactly one path.
There are no cycles and this is a principle we carry on throughout the taxonomy
evolution. Starting with many shallow subtrees, as votes enter the system, we
detect relations between more and more tags. The independent trees gradually
form a forest and we use a common “Global Root” node to join them.
Maximum Votes Satisfiability: We also wish to preserve a quantitative
characteristic. Our purpose is to utilize every incoming Ancestor → Descendant
relation and embed it on the current structure. If this raises conflicts, our solution
to this is to derive a taxonomy structure which as a whole satisfies the maximum
number of users’ votes. At this point we need to mention that according to our
model, there is no constraint to the number of votes a user can submit to the
system (see “free-for-all tagging” at [15]). Satisfiability is measured not on per
individual basis but over all votes, overall.
Finally, we need to state that our solution does not take any measures to
face synonyms or polysemy issues. Although according to [12] these are not
major problems for social media, we admit that users tend to annotate their
content with idiosyncratic tags which in our case can lead to wrong keyword
interpretation and create links that users do not intent to recommend. This
issue is orthogonal to our work since we focus on structural development and
thus we can assume a controlled vocabulary without loss of generality.
Lastly, in case w is a separate node on the tree (line: 18), the new relation
forms a crosslink and is handled appropriately. Hereafter, we describe every tree
transformation triggered by each of these cases.
TRSFM Create New Tree: In this simple scenario the taxonomy does not
yet include any of the two nodes of the new vote. So the ancestor node u is
attached to the global root via a shortcut (R → u) and node v plays the role of
its child (see figure 2a.).
This addition forms a new tree with only two nodes. In the future, it will be
expanded with more nodes or get merged to another expanding tree.
TRSFM Merge: Merge is used in two similar cases. In line 8 of the core
algorithm we ask to attach a new node u to our taxonomy but in a generic
scenario its descendant node v does already have a parent node. So does happen
552 D. Karampinas and P. Triantafillou
in line 12 where we need to annex u’s participating tree to that one of v’s with
a link between them. In figure 3a we observe that both node C and u “compete
for the paternity” of v. In order to maintain the tree properties, only one of
the potential parents can be directly connected with v. We arbitrarily choose
node C to be the direct ancestor (parent) of v and set node u to be parent of
C which is in accordance with the Maximum Vote Satisfiability invariant since
an Ancestor → Descendant (u → v) relation takes place. The (temporary)
state formed in the middle of figure 3a suffers from the same “paternity conflict”
problem - now between A and u over C. Following the same reasoning, we finally
place node u on top of v’s tree being now the parent of v’s root. Since there is
no given relation yet between u and A we form a shortcut between them. We
also maintain a forward edge from u to v so not to lose the information we have
regarding the vote for the u → v relationship.
Definition 2 Forward Edge: A latent relation between two nodes. The source
node is an ancestor in the taxonomy and the target is a descendant. Forward
edges do not refer to P arent → Child links and remain hidden since they violate
tree’s properties.
The idea behind this transformation is that since we don’t have enough evidence
to decide on the partial order of v’s ancestors we temporarily send u node to the
root. Relations that will follow will illustrate the correct order.
If v is a root of a tree, an Attach New Child transformation is called.
TRSFM Expand Vertically: In this case the newly incoming vote is inter-
preted as a crosslink according to the current state of the taxonomy. As shown in
figure 3b the logic we follow resembles at some point the Merge transformation.
First, we locate the common ancestor A of u and v and break the link between it
and the immediate root E of the subtree that includes v. The independent sub-
tree is now linked with u via a shortcut formed between u and its root E. Two
forward edges are spawned to indicate latent Ancestor → Descendant relations.
For the sake of completeness we report here that in contrast to common edges
between nodes, shortcuts are not converted to forward edges when the link that
touches the two nodes brakes.
Algorithm 2. Crosslink Elimination
total nodes. This is an extreme scenario where all the tree nodes form a chain.
As we already stated, at this point, the BCR algorithm forms n possible tree
instances, with every one of them representing a unique cyclic rotation of the
nodes making up the BCR path. For every one of these n instances, we iterate
over the n nodes it consists of and explore their outgoing edges to verify whether
they form a backedge or not. The number of these outgoing edges is obviously
also at most n-1. Therefore, the overall asymptotic worst-case complexity of the
Backedge Conflict Resolution transformation is O(n3 ).
Theorem 1. If s denotes the number of votes, the worst-case asymptotic com-
plexity of CrowdTaxonomy Algorithm is O(s ∗ n3 ).
Proof. Vote Processing is called s times, once for each incoming vote. Every
time a single transformation is applied. The worst-case complexity of the latter
equals O(n3 ). Therefore the worst-case asymptotic complexity of the CrowdTax-
onomy algorithm is O(s ∗ n3 ).
This analysis depicts a worst-case scenario and is basically presented for the
sake of completeness, vis a vis the NP-Hard result presented earlier. As the
experiments showed, the algorithm’s behaviour in matters of absolute time is
approximately linear to the number of votes. This is because (i) BCR paths are
much smaller than n and (ii) outgoing links per node are also much smaller than
n. In future work we plan to present an analysis of the average complexity and
provide with better estimations than n which is a very relaxed upper-bound.
5 Experimentation
“break” this tree into distinct node (pair) relations, generate additional conflict-
ing pairs and feed (correct and incorrect pairs) to our algorithm.
Our algorithm was written in C and we used GLib. The interface of the crowd-
sourcing experiment was implemented in HTML/PHP and ran over Apache/MySQL.
Fig. 6. Recall and Precision using a per- Fig. 7. FScores having a mixture of Cor-
centage of Correct votes rect and False votes
6 Related Work
[4], [5], [19] and [18] apply association mining rules to induce relations between
terms and use them to form taxonomies. For text corpora, Sanderson and Croft
automatically derive a hierarchy of concepts and develop a statistical model
where term x subsumes term y if P (x|y) ≥ 0.8 and P (y|x) < 1 where P (a|b)
defines the probability of a document to include term a assuming that term b
is contained. Schmitz extended this and applied additional thresholds in order
558 D. Karampinas and P. Triantafillou
to deal with problems caused by uncontrolled vocabulary. [6] and [14] underline
the importance of folksonomies and the need to extract hierarchies for searching,
categorization and navigation purposes. They present approaches that operate
based on agglomerative clustering. A similarity measure is used to compute the
proximity between all tags and then a bottom-up procedure takes place. Nodes
under a threshold are merged into clusters in a recursive way and eventually
compose a taxonomy. Heyman & Garcia-Molina in [11] present another technique
with good results. Given a space of resources and tags, they form a vector for
every tag and set to the i − th element the number of times it has been assigned
to object i. They also use cosine similarity to compute all vectors’ proximities
and represent them as nodes of a graph with weighted edges that correspond to
their similarity distance. To extract a taxonomy they iterate over the nodes in
descending centrality order and set every of its neighbours either as children or
to the root based on a threshold.
As Plangprasopchok et al. note in [17] and [16] all these approaches make the
assumption that frequent words represent general terms. This does not always
hold and any threshold tuning approach leads to a trade-off between accurate but
shallow taxonomies against large but noisy ones. Also, all above works assume
a static tag space, despite its dynamicity [10].
7 Conclusions
References
1. Endeca, http://www.endeca.com/
2. Facetmap, http://facetmap.com/
3. Alonso, O., Lease, M.: Crowdsourcing 101: Putting the wsdm of crowds to work
for you: A tutorial. In: International Conference on WSDM (February 2011)
4. Au Yeung, C.-M., Gibbins, N., Shadbolt, N.: User-induced links in collaborative
tagging systems. In: Proceeding of the 18th ACM Conference on Information and
Knowledge Management, CIKM 2009 (2009)
5. Barla, M., Bieliková, M.: On deriving tagsonomies: Keyword relations coming from
crowd. In: Computational Collective Intelligence. Semantic Web, Social Networks
and Multiagent Systems (2009)
6. Brooks, C.H., Montanez, N.: Improved annotation of the blogosphere via autotag-
ging and hierarchical clustering. In: Proceedings of the 15th International Confer-
ence on World Wide Web, WWW 2006 (2006)
7. Doan, A., Ramakrishnan, R., Halevy, A.Y.: Crowdsourcing systems on the world-
wide web. Commun. ACM (2011)
8. Franklin, M.J., Kossmann, D., Kraska, T., Ramesh, S., Xin, R.: Crowddb: answer-
ing queries with crowdsourcing. In: ACM SIGMOD Conference (2011)
9. Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of
NP-Completeness
10. Halpin, H., Robu, V., Shepherd, H.: The complex dynamics of collaborative tag-
ging. In: 16th WWW Conference (2007)
11. Heymann, P., Garcia-Molina, H.: Collaborative creation of communal hierarchical
taxonomies in social tagging systems. Technical report (2006)
12. Heymann, P., Paepcke, A., Garcia-Molina, H.: Tagging human knowledge. In: Third
ACM International Conference on Web Search and Data Mining, WSDM 2010
(2010)
13. Ipeirotis, P.: Managing crowdsourced human computation: A tutorial. In: Interna-
tional Conference on WWW (March 2011)
14. Liu, K., Fang, B., Zhang, W.: Ontology emergence from folksonomies. In: 19th
ACM CIKM (2010)
15. Marlow, C., Naaman, M., Boyd, D., Davis, M.: Position Paper, Tagging, Taxonomy,
Flickr, Article, ToRead. In: Collaborative Web Tagging Workshop at WWW 2006
(2006)
16. Plangprasopchok, A., Lerman, K.: Constructing folksonomies from user-specified
relations on flickr. In: 18th WWW Conference (2009)
17. Plangprasopchok, A., Lerman, K., Getoor, L.: Growing a tree in the forest: con-
structing folksonomies by integrating structured metadata. In: 6th ACM SIGKDD
Conference (2010)
18. Sanderson, M., Croft, B.: Deriving concept hierarchies from text. In: 22nd ACM
SIGIR Conference (1999)
19. Schmitz, P.: WWW 2006 (2006)
20. Shapira, A., Yuster, R., Zwick, U.: All-pairs bottleneck paths in vertex weighted
graphs. In: 18th ACM-SIAM SODA Symposium (2007)
21. Triantafillou, P.: Anthropocentric data systems. In: 37th VLDB Conference (Vi-
sions and Challenges) (2011)
Generating Possible Interpretations for Statistics
from Linked Open Data
Heiko Paulheim
Abstract. Statistics are very present in our daily lives. Every day, new
statistics are published, showing the perceived quality of living in differ-
ent cities, the corruption index of different countries, and so on. Interpret-
ing those statistics, on the other hand, is a difficult task. Often, statistics
collect only very few attributes, and it is difficult to come up with hy-
potheses that explain, e.g., why the perceived quality of living in one city
is higher than in another. In this paper, we introduce Explain-a-LOD,
an approach which uses data from Linked Open Data for generating hy-
potheses that explain statistics. We show an implemented prototype and
compare different approaches for generating hypotheses by analyzing the
perceived quality of those hypotheses in a user study.
1 Introduction
Statistical data plays an important role in our daily lives. Every day, a new
statistic is published, telling about, e.g., the perceived quality of living in dif-
ferent cities (used as a running example throughout the following sections), the
corruption in different countries, or the box office revenue of films. While it is of-
ten possible to retrieve a statistic on a certain topic quite easily, interpreting that
statistic is a much more difficult task. The raw data of a statistic often consists
only of a few attributes, collected; in the extreme case, it may only comprise a
source and a target attribute, such as a city and its score. Therefore, formulating
hypotheses, e.g., why the perceived quality of living is higher in some cities than
in others is not easy and requires additional background information.
While there are tools for discovering correlations in statistics, those tools re-
quire that the respective background information is already contained in the
statistic. For example, the quality of living in a city may depend on the popu-
lation size, the weather, or the presence of cultural institutions such as cinemas
and theaters. For discovering those correlations, the respective data has to be
contained in the dataset. For creating useful hypotheses, the dataset should con-
tain a larger number of attributes, which makes the compilation of such a dataset
a large amount of manual work.
More severely, the selection of attributes for inclusion in a statistical dataset
introduces a bias: attributes are selected since the person creating the dataset
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 560–574, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Generating Possible Interpretations for Statistics from Linked Open Data 561
already assumes a possible correlation. For disovering new and unexpected hy-
potheses, this turns out to be a chicken-and-egg problem: we have to know what
we are looking for to include the respective attribute in the dataset. For example,
if we assume that the cultural live in a city influences the quality of living, we
will include background information about theaters and festivals in our dataset.
For many common statistical datasets (e.g. datasets which relate real-world
entities of a common class with one or more target variables), there is background
information available in Linked Open Data [2]. In the quality of living example,
information about all major cities in the world can be retrieved from the semantic
web, including information about the population and size, the weather, and
facilities that are present in that city. Thus, Linked Open Data appear to be an
ideal candidate for generating attributes to enhance statistical datasets, so that
new hypotheses for interpreting the statistic can be found.
In this paper, we introduce Explain-a-LOD, a prototype for automatically
generating hypotheses for explaining statistics by using Linked Open Data. Our
prototype implementation can import arbitrary statistics files (such as CSV
files), and uses DBpedia [3] for generating attributes in a fully automatic fash-
ion. While our main focus is on enhancing statistical datasets with background
information, we have implemented the full processing chain in our prototype,
using correlation analysis and rule learning for producing hypotheses which are
presented to the user.
The rest of this paper is structured as follows. In Sect. 2, we introduce our
approach, show a proof-of-concept prototype, and discuss the underlying algo-
rithms. Section 3 discusses the validity of the approach and the individual algo-
rithms with the help of a user study. In Sect. 4, we review related approaches.
We conclude with a summary and an outlook on future research directions.
2 Approach
We have developed an approach for using Linked Open Data in a way that new
hypotheses for interpreting statistics can be generated. The approach starts with
a plain statistic, e.g., a CSV file, and comes up with hypotheses, which can be
output in a user interface. To that end, three basic steps are performed: first, the
statistical data is enhanced such that additional data from Linked Open Data
is added, second, hypotheses are sought in this enhanced data set by means of
correlation analysis and rule learning, and third, the hypotheses that are found
are presented to the user. The basic workflow of our approach is depicted in
Fig. 1. We have implemented that approach in a proof-of-concept prototype.
Raw
Statistical
Data
FeGeLOD
Library of Feature
Entity Recognition
Generation Strategies Linked Open
Generation Data Cloud
Generation
Strategy #1 Feature Generation
Generation
Strategy #2
Strategy #n
Feature Pre-Selection
Enriched
Statistical
Data
Simple Correlation
Rule Learning
Analysis
Presentation of
Hypotheses
Open Data, so that additional information about those entities can be retrieved.
For the first prototype of FeGeLOD, we have used a very basic mechanism
for entity recognition: it retrieves all possible matching resources, e.g., such as
http://dbpedia.org/resource/Vancouver for the city name Vancouver, and
performs an optional type check, e.g. for dbpedia-owl:City. If the landing page
of the first step is a disambiguation page, all disambiguated entities are followed,
and the first one matching the type checks is used.
City index
Vancouver 106
Named Entity
Recognition
Feature
Generation
Feature
Selection
– Unqualified relations. Features are generated for incoming and outgoing re-
lations without any information about the related entity. For example, a
city may have incoming relations of type dbpedia-owl:foundationPlace.
Those features can be generated as boolean (incoming/outgoing relations of
the specified type exist or not) or numeric (counting the related entities)
features1 .
– Qualified relations. Unlike unqualified relations, boolean or numeric fea-
tures are generated including the type of the related entity, e.g., the pres-
ence or number of entities of type dbpedia-owl:Company which have a
dbpedia-owl:foundationPlace relation to a city. The detailed YAGO typ-
ing system leads to a lot of very specific features, such as number of airlines
that are founded in 2000 that are located in a city.
dbpedia-owl:City
dbpedia-owl:
141471
rdf:type populationTotal
dbpedia:Darmstadt
dbpedia-owl:
headquarter
dbpedia:
European_Space_Operations_Centre
dbpedia:EUMETSAT
rdf:type
dbpedia-owl:Organization
classes dbpedia-owl:City or even owl:Thing, which are true for all entities.
Likewise, qualified relations may yield a large number of features which are not
useful, such as the number of entities of type yago:ArtSchoolsInParis which
are located in a city: this attribute will have a non-zero value only for one entity,
i.e., Paris.
Since those features are very unlikely to produce useful hypotheses, we ap-
ply a simple heuristic to filter them out before processing the dataset in order
to improve the runtime behavior of the remaining processing steps. Given a
threshold p, 0 ≤ p ≤ 1, we discard all features that have a ratio of more than p
unknown, equal, or different values (different values, however, are not discarded
for numeric features). In our previous experiments, values of p between 0.95 and
0.99 have proved to produce data sets of reasonable size without reducing the
results’ quality significantly [15].
The result of the data preparation step is a table with many additional at-
tributes. That table can then be further analyzed to generate possible hypothe-
ses. Currently, we pursue two strategies for creating hypotheses:
– The correlation of each attribute with the respective target attribute is an-
alyzed. Attributes that are highly correlated (positively or negatively) lead
to a hypothesis such as “Cities with a high value of population have a low
quality of living”.
– Rule learning is used to produce more complex hypotheses which may take
more than one feature into account. We have used the standard machine
learning library Weka [4] for rule learning. Possible algorithms are class
association rule mining [1], the use of separate-and-conquer rule learners [6],
where in the latter case, only the first, i.e., most general rules are used, as
the subsequent rules are often not valid on the whole data set.
Table 2. Number of features generated for the two data sets used in the study. This
table shows the numbers without any post-processing feature selection. The boolean
and numerical variants of relations and qualified relations produce an equal number of
features.
3 Experimental Evaluation
3.1 Setup
We have conducted the user study with 18 voluntary participants, who were
undergraduate and graduate students as well as researchers at Technische Uni-
versität Darmstadt. The participants were between 24 and 45 years old, 15 of
them were male, 3 female.
For the evaluation, we have used two statistics datasets: the already mentioned
Mercer quality of living survey with data2 , which comprises 218 cities, and the
corruption perception dataset by Transparency International3 , which comprises
178 countries. With our entity recognition approach, we could map 97.7% of the
cities and 99.4% of the countries to the corresponding URIs in DBpedia.
For each data set, we have generated hypotheses with the approaches dis-
cussed above, using the different feature generation algorithms, and used the
top three hypotheses from both the simple correlation analysis and the rule
learning approach. Table 2 depicts the number of features generated and used
in each dataset.
For the rule learning approach, we also used a joint set of all feature generators,
so that rules involving features from different generators could also be found. As
the joint set of features cannot produce any new hypotheses when only regarding
correlations of single features, that dataset was only used with the rule learning
approach. After removing duplicates (a hypothesis with only one feature can be
found by both approaches), we had 37 hypotheses for the Mercer dataset and
2
Data available at http://across.co.nz/qualityofliving.htm
3
Data available at http://www.transparency.org/policy research/surveys
indices/cpi/2010/results
Generating Possible Interpretations for Statistics from Linked Open Data 567
3.2 Results
The first goal was to understand which strategies for feature generation and for
creating the hypotheses work well, also in conjunction. To that end, we analyzed
the ratings of the respective hypotheses. Figure 5 shows the results for the Mercer
dataset, Figure 6 shows the respective results for the Transparency International
dataset. The intra-class correlation (i.e., the agreement score of the participants)
was 0.9044 and 0.8977, respectively.
The first basic observation is that the evaluations for both datasets are very
different. For the Mercer dataset, simple correlations produce the more plausible
hypotheses, while for the Transparency International dataset, rule learning is
significantly better in some cases. In both cases, the best rated hypotheses are
produced when using the type features. In both cases, joining all the attributes
in a common dataset did not lead to significantly better rules.
For the Mercer dataset, the best rated hypotheses were Cities in which many
events take place have a high quality of living (found with correlation analysis
from unqualified relations, average rating 3.94), and Cities that are European
Capitals of Culture have a high quality of living (generated from a type feature
with type yago:EuropeanCapitalsOfCulture, found both by correlation analy-
sis and with a rule learner, rating 3.89). The worst rated hypotheses were Cities
where at least one music record was made and where at least 22 companies or or-
ganizations are located have a high quality of living (generated with a rule learner
from unqualified relations, rating 1.5) and Cities that are the hometown of at
least 18 bands, but the headquarter of at most one airline founded in 2000, have
a high quality of living (generated with a rule learner from qualified relations,
rating also 1.5).
For the Transparency International Dataset, the two best rated hypotheses
are Countries of type Least Developed Countries have a high corruption index
(generated by correlation analysis from a type feature with type yago:Least-
DevelopedCountries, rating 4.29), and Countries where no military conflict is
carried out and where no schools and radio stations are located have a high cor-
ruption index (generated by rule learning from three different qualified relation
features, rating 4.24). The two worst rated hypotheses are Countries with many
mountains have a low corruption index and Countries where no music groups
4
The hypotheses used in the evaluation are listed at
http://www.ke.tu-darmstadt.de/resources/explain-a-lod/user-study
568 H. Paulheim
!
!
"
#$%
Fig. 5. Average user ratings of the hypotheses generated for the Mercer dataset, ana-
lyzed by feature generation and hypothesis generation strategy
that have been disbanded in 2008 come from have a high corruption index (both
generated by correlation analysis from a qualified relation feature, ratings 1.39
and 1.28, respectively).
There are some hypotheses that are rated badly, because the explanations
they hint at are not trivial to see. For example, one hypothesis generated for
the Mercer dataset is Cities with a high longitude value have a high quality of
living (average rating 1.52). When looking at a map, this hypothesis becomes
plausible: it separates cities in, e.g., North America, Australia, and Japan, from
those in, e.g., Africa and India. Interestingly enough, a corresponding hypothe-
sis concerning the latitude (which essentially separates cities in the third world
from those in the rest of the world) was rated significantly (p < 0.05) higher
(rating 3.15). Another example for an hypothesis that is not trivial to interpret
is the following: Countries with an international calling code greater than 221
have a high corruption index (rating 1.69). Those calling codes mostly identify
African countries. On the other hand, the following hypothesis is rated signifi-
cantly higher (rating 4.0): Countries in Africa have a high corruption index.
The second goal of the user study was to get an impression of how the overall
usefulness of the tool is perceived. Figure 7 shows the results of the general
questions. The hypotheses got positive results on three of the four scales, i.e.,
the users stated that the results were at least moderately useful, surprising,
and non-trivial. The latter two are significantly better than the average value
of three with p < 0.05. The trustworthiness of the results, on the other hand,
was not rated well (p < 0.01). These results show that the tool is well suited
Generating Possible Interpretations for Statistics from Linked Open Data 569
!
!
"
#$%
Fig. 6. Average user ratings of the hypotheses generated for the Transparency Inter-
national dataset, analyzed by feature generation and hypothesis generation strategy
for generating hypotheses, but these hypotheses always need a human judging
whether these hypotheses are valid explanations or not.
At the end of the questionnaire, users were asked to give some additional
comments. One user was asking for detail information on certain explanations,
e.g., showing the average corruption of African and non-African states for a hy-
pothesis such as Countries in Africa have a high corruption index. Another user
remarked that some rules are hard to comprehend without background knowl-
edge (such as those involving latitude/longitude values, as discussed above).
Another user remarked that longer hypotheses were in general less plausible.
This may partly explain the bad performance of the rule-based approaches on
the Mercer dataset. Rule learning approaches most often seek to find rules that
have a good coverage and accuracy, i.e., split the dataset into positive and neg-
ative examples as good as possible. Since rule learning algorithms may choose
combinations of arbitrary features for that, it may happen that an unusual com-
bination of features leads to a good separation of the example space, but that
the resulting rule is not perceived as a very plausible one.
One example is the following rule, which was among the worst rated hypothe-
ses (average rating of 1.5): Cities which are the hometown of at least 18 bands,
but are the headquarter of at most one airline founded in 2000, have a high qual-
ity of life. While the second condition may increase the rule’s accuracy by some
percent, it decreases the perceived plausibility of the rule, mostly since there is
no obvious coherence between bands and airlines. In contrast, the following rule
received a significantly (p < 0.01) higher average rating (2.72): Cities that are
the origin of at least 33 artists and bands have a high quality of life. On the other
570 H. Paulheim
hand, the first rule has an accuracy of 98.0%, while the second rule has an ac-
curacy of only 88.6%. This shows that, while the accuracy of rules may increase
with additional, non-related features, this does not necessarily imply an increase
in the perceived plausibility. A similar observation can be made for correlation
analysis: the best-rated hypothesis for the Transparency International dataset,
Countries of type Least Developed Countries have a high corruption index is ac-
tually the one in the set of hypotheses with the lowest correlation (Pearson’s
correlation coefficient 0.39, rating 4.33).
Another observation made is that rule-based approaches are capable of find-
ing very exact conditions, i.e., they find the value which separates best between
positive and negative examples. One example are the following two corresponding
rules: Countries with a high HDI have a low corruption index (average rating: 4.0,
found with correlation analysis), and Countries with a HDI less than 0.712 have a
high corruption index (average rating: 3.39, found with rule learner). While both
hypotheses express the same finding, the second one, which is formulated in a more
specific way, is rated significantly (p < 0.05) lower. These examples show that very
accurate rules are not always perceived as plausible at the same time.
4 Related Work
There is a vast body of work that is concerned with the analysis of statistical
data [14]. Given a statistic, there are various methods to find out correlations
and interrelations of the attributes contained in those statistics. Highly developed
toolkits such as R [9] can be used for performing such analyses.
Those methods always assume that all the possible attributes are known, and
thus, they are only capable of finding correlations between attributes that are in-
cluded in the statistic. The work presented in this paper can be seen as a comple-
ment to those approaches, as it enhances a dataset by a multitude of attributes
that can then be examined by such statistical analysis algorithms and tools.
One of the works closest to Explain-a-LOD is proposed by Zapilko et al. [20].
The authors propose a method for publishing statistical data as linked data,
Generating Possible Interpretations for Statistics from Linked Open Data 571
which allows for combining different of such data sets. Kämpgen and Harth sug-
gest a similar approach for analyzing statistical linked data with online analytical
processing (OLAP) tools [11]. They discuss a common schema for such data and
present various case studies. While OLAP allows for asking for specific correla-
tions (i.e., the user has to come up with the hypotheses by himself upfront), our
approach generates hypotheses automatically. Furthermore, while we are able to
exploit any arbitrary, general-purpose datasets, such as DBpedia, the authors of
the two approaches are restricted to specialized statistical datasets, following a
specific schema. Nevertheless, including such specific statistical linked data sets
in our approach may help increasing the quality of our hypotheses significantly.
g-SEGS [13] uses ontologies as background knowledge in data mining tasks.
Ontologies are used as additional taxonomic descriptions for nominal attributes.
For example, a nominal attribute with the values Student, Apprentice, Employee,
Self-employed, and Unemployed may be augmented with a taxonomy of those
values. Thus, regularities that hold for all people in education (regardless of
whether they are students or apprentices) may be found better. In contrast to
our approach, g-SEGS uses T-Box information, while we use A-Box information.
Furthermore, in g-SEGS, the ontology has to be known in advance and mapped
to the dataset manually. This makes it difficult to discover new hypotheses, since
the designer of the ontology can be tempted to model only those facts in the
ontology that are considered relevant for the mining problem at hand.
SPARQL-ML [10] is an approach that foresees the extension of the SPARQL
query language [18] with a specialized statement to learn a model for a spe-
cific concept or numeric attribute in an RDF dataset. Such models can be seen
as explanations in the way we use them in Explain-a-LOD. However, the ap-
proach requires support of the endpoint in question, e.g., DBpedia, to support
the SPARQL-ML language extensions. In contrast, our approach works with any
arbitrary SPARQL endpoint providing Linked Open Data.
Mulwad et al. have proposed an approach for annotating tables on the web
[12]. The authors try to automatically generate links to DBpedia both for entities
in the table as well as for column names, which are linked to classes in ontologies.
Unlike the approach presented in this paper, the authors are not concerned with
creating hypotheses. Since tables are typical ways to present statistical data on
the web, their approach could be a useful complement to the Explain-a-LOD for
generating hypotheses on arbitrary tabular statistical data found on the web.
bands and airlines are not semantically close, which lowers the total plausibil-
ity. A rule with two conditions involving, e.g., bands and TV stars, or airlines
and logistics companies, would probably be perceived more plausible, since the
semantic distance between the conditions is lower. Therefore, an interesting re-
search direction would be finding accurate, but semantically coherent rules.
Concerning the presentation of the hypotheses, several improvements can be
thought of. The sorting of hypotheses by their rating is essential to the user, since
the best hypotheses are expected to be on top. However, our user study showed
that the natural ratings (such as the correlation coefficient for simple attributes)
do not always reflect the perceived plausibility. In future user studies, we want
to explore the impact of different rating measures for hypotheses. Furthermore,
the verbalization of hypotheses does not always work too well because of mixed
quality of labels used in the datasets [7]. Here, we aim for more intuitive and
readable verbalizations, such as proposed in [16].
Finally, an interactive user interface would be helpful, where the user can mark
implausible hypotheses (such as a correlation between the number of mountains
in a country and the country’s corruption index) and receive an explanation
and/or an alternative hypothesis. Taking the informal feedback from the user
study into account, it would also be helpful to provide evidence for hypothesis,
e.g., list those instances that fulfill a certain condition. Such a functionality might
also help to improve the trust in the hypotheses generated by Explain-a-LOD,
which was not perceived very high in our user study.
In summary, we have introduced an approach and an implemented prototype
that demonstrates how Linked Open Data can help in generating hypotheses for
interpreting statistics. The evaluation of the user study show that the approach
is valid and produces useful results.
References
1. Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of
Items in Large Databases. In: Proceedings of the ACM SIGMOD International
Conference on Management of Data, pp. 207–216 (1993)0
2. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data - The Story So Far. International
Journal on Semantic Web and Information Systems 5(3), 1–22 (2009)
3. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hell-
mann, S.: DBpedia - A crystallization point for the Web of Data. Web Semantics
- Science Services and Agents on the World Wide Web 7(3), 154–165 (2009)
4. Bouckaert, R.R., Frank, E., Hall, M., Holmes, G., Pfahringer, B., Reutemann, P.,
Witten, I.H.: WEKA — Experiences with a Java open-source project. Journal of
Machine Learning Research 11, 2533–2541 (2010)
574 H. Paulheim
5. Callahan, E.S., Herring, S.C.: Cultural bias in wikipedia content on famous persons.
Journal of the American Society for Information Science and Technology 62(10),
1899–1915 (2011)
6. Cohen, W.W.: Fast effective rule induction. In: Twelfth International Conference
on Machine Learning, pp. 115–123. Morgan Kaufmann (1995)
7. Ell, B., Vrandečić, D., Simperl, E.: Labels in the Web of Data. In: Aroyo, L., Welty,
C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.)
ISWC 2011, Part I. LNCS, vol. 7031, pp. 162–176. Springer, Heidelberg (2011)
8. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A. (eds.): Feature Extraction – Foun-
dations and Applications. Springer (2006)
9. Ihaka, R.: R: Past and future history. In: Proceedings of the 30th Symposium on
the Interface (1998)
10. Kiefer, C., Bernstein, A., Locher, A.: Adding Data Mining Support to SPARQL
Via Statistical Relational Learning Methods. In: Bechhofer, S., Hauswirth, M.,
Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 478–492.
Springer, Heidelberg (2008)
11. Kämpgen, B., Harth, A.: Transforming statistical linked data for use in olap sys-
tems. In: 7th International Conference on Semantic Systems, I-SEMANTICS 2011
(2011)
12. Mulwad, V., Finin, T., Syed, Z., Joshi, A.: Using linked data to interpret tables.
In: Proceedings of the First International Workshop on Consuming Linked Data,
COLD 2010 (2010)
13. Novak, P.K., Vavpetič, A., Trajkovski, I., Lavrač, N.: Towards semantic data min-
ing with g-segs. In: Proceedings of the 11th International Multiconference Infor-
mation Society, IS 2009 (2009)
14. Ott, R.L., Longnecker, M.: Introduction to Statistical Methods and Data Analysis.
Brooks/Cole (2006)
15. Paulheim, H., Fürnkranz, J.: Unsupervised Feature Generation from Linked Open
Data. In: International Conference on Web Intelligence, Mining, and Semantics,
WIMS 2012 (2012)
16. Piccinini, H., Casanova, M.A., Furtado, A.L., Nunes, B.P.: Verbalization of rdf
triples with applications. In: ISWC 2011 – Outrageous Ideas track (2011)
17. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge.
In: Proceedings of the 16th International Conference on World Wide Web, WWW
2007, pp. 697–706. ACM (2007)
18. W3C: SPARQL Query Language for RDF (2008),
http://www.w3.org/TR/rdf-sparql-query/
19. Wrobel, S.: An algorithm for multi-relational discovery of subgroups. In: Sympo-
sium on Pattern Discovery in Databases (PKDD 1997) (1997)
20. Zapilko, B., Harth, A., Mathiak, B.: Enriching and analysing statistics with linked
open data. In: Conference on New Techniques and Technologies for Statistics,
NTTS (2011)
Green-Thumb Camera: LOD Application
for Field IT
Abstract. Home gardens and green interiors have recently been receiv-
ing increased attention owing to the rise of environmental consciousness
and growing interest in macrobiotics. However, because the cultivation
of greenery in a restricted urban space is not necessarily a simple matter,
overgrowth or extinction may occur. In regard to both interior and ex-
terior greenery, it is important to achieve an aesthetic balance between
the greenery and the surroundings, but it is difficult for amateurs to
imagine the future form of the mature greenery. Therefore, we propose
an Android application, Green-Thumb Camera, to query a plant from
LOD cloud to fit environmental conditions based on sensor information
on a smartphone, and overlay its grown form in the space using AR.
1 Introduction
Home gardens and green interiors have been receiving increased attention owing
to the rise of environmental consciousness and growing interest in macrobiotics.
However, the cultivation of greenery in a restricted urban space is not necessarily
a simple matter. In particular, as the need to select greenery to fit the space is
a challenge for those without gardening expertise, overgrowth or extinction may
occur. In regard to both interior and exterior greenery, it is important to achieve
an aesthetic balance between the greenery and the surroundings, but it is difficult
for amateurs to imagine the future form of the mature greenery. Even if the user
checks images of mature greenery in gardening books, there will inevitably be
a gap between the reality and the user’s imagination. To solve these problems,
the user may engage the services of a professional gardening advisor, but this
involves cost and may not be readily available.
Therefore, we considered it would be helpful if an ‘agent’ service offering
gardening expertise were available on the user’s mobile device. In this paper,
we describe our development of Green-Thumb Camera, which recommends a
plant to fit the user’s environmental conditions (sunlight, temperature, etc.) by
using a smartphone’s sensors. Moreover, by displaying its mature form as 3DCG
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 575–589, 2012.
c Springer-Verlag Berlin Heidelberg 2012
576 T. Kawamura and A. Ohsuga
using AR (augmented reality) techniques, the user can visually check if the
plant matches the user’s surroundings. Thus, a user without gardening expertise
is able to select a plant to fit the space and achieve aesthetic balance with the
surroundings.
The AR in this paper refers to annotation of computational information to
suit human perception, in particular, overlapping of 3DCG with real images.
This technique’s development dates back to the 1990s, but lately it has been
attracting growing attention, primarily because of its suitability for recent mobile
devices. AR on mobile devices realizes the fusion of reality and computational
information everywhere. Research[25] on AR for mobile devices was conducted
in the 1990s, but it did not attract public attention because “mobile” computers
and sensors were big and hard to carry, and the network was slow.
The remainder of this paper is organized as follows. Section 2 describes our
proposed service, focusing on plant recommendation and the AR function. Sec-
tion 3 reports an experiment, and section 4 outlines related work, mainly on
AR-based services. Section 5, the final section, presents conclusions and identi-
fies future issues.
Fig. 2. Example of plant display (top: before growth, bottom: after growth)
Sunlight
This factor indicates the illuminance suitable for growing each plant and has
several levels such as shade, light shade, sunny[4,5].
To determine the current sunlight, we used a built-in illuminance sensor
on the smartphone. After the application boots up, if the user brings the
Green-Thumb Camera 579
smartphone to the space where he/she envisages putting the plant and
pushes the start button on the screen, the sunlight at the space is mea-
sured. If it is less than 3000 lux, it is deemed to be a shade area. If it is more
than 3000 lux but less than 10000 lux, it is deemed to be light shade, and if
more than 10000 lux, it is deemed to be a sunny area. In the case that the
sunlight taken by the sensor fits that for the plant, it is deemed suitable.
Temperature
This factor indicates the range (min, max) of suitable temperature for a
plant. The lower and the upper limits of the range are determined by refer-
ence to the sites as well as to the sunlight.
To get the temperature, we referred to past monthly average temperatures
for each prefecture from the Japan Meteorological Agency(JMA)[6], using
the current month and area (described below), instead of the current tem-
perature. The temperature for indoor plants from November to February is
the average winter indoor temperature for each prefecture from WEATH-
ERNEWS INC.(WN)[7]. In the case that the temperature taken by the sen-
sor is within the range of the plant, it is deemed suitable.
Planting Season
The planting season means a suitable period (start, end) for starting to
grow a plant (planting or sowing). The periods are set on a monthly basis
according to some gardening sites[8,9].
To get the current month, we simply used Calendar class provided by the
Android OS. However, the season is affected by the geographical location
(described below). Therefore, it is set one month later in the south area, and
one month earlier in the north area. In the northernmost area, it is set two
months earlier, because the periods are given mainly for Tokyo (middle of
Japan) on most websites. If the current month is in the planting season for
the plant, it is deemed suitable.
Planting Area
The planting area means a suitable area for growing a plant. It is set by
provincial area according to a reference book used by professional gardeners[4].
To get the current area, we used the GPS function on the smartphone. Then,
we classified the current location (latitude, longitude) for the 47 prefectures
in Japan, and determined the provincial area. If the current location is in
the area for the plant, it is deemed suitable.
Plant LOD and SPARQL Query. In this section, we describe how a plant
is recommended based on the above factors.
As a recommendation mechanism, we firstly tried to formulate a function on
the basis of multivariate analysis, but gave it up because priority factors differ
depending on the plant. Next, we created a decision tree per plant because the
reasons for recommendation are relatively easily analyzed from the tree struc-
ture, and then we evaluated the recommendation accuracy[30]D However, this
approach obviously poses a difficulty in terms of scaling up since manual creation
of training data is costly. Therefore, we prepared Plant LOD based on collective
580 T. Kawamura and A. Ohsuga
“Fern”. In addition, we created approx. 100 plants mainly for species native to
Japan. Each plant of the Plant class has almost 300 Properties, but most of them
are inherited from “Thing”, “Species” and “Eukaryote”. So we added 11 Proper-
ties to represent necessary attributes for plant cultivation, which correspond to
{ Japanese name, English name, country of origin, description, sunlight, temper-
ature (min), temperature (max), planting season (start), planting season (end),
blooming season (start), blooming season (end), watering amount, annual grass
(true or false), related website, image URL, 3DCG URL, planting area, planting
difficulty }. Fig. 4 illustrates the overall architecture of the Plant LOD, where
prefixes gtc: and gtcprop: mean newly created instances and attributes. The
Plant LOD is now stored in a cloud DB (DYDRA[11]) and a SPARQL endpoint
is offered to the public.
The semi-automatic creation of LOD in this paper is greatly inspired by an
invited talk of T. Mitchell at ISWC09[29], and involves a boot strapping method
based on ONTOMO[31] and a dependency parsing based on WOM Scouter[32].
But the plant names can be easily collected from a list on any gardening site
and we have already defined the necessary attributes based on our service re-
quirements. Therefore, what we would like to collect in this case is the value of
the attribute for each plant. As the boot strapping method[12], we first gener-
ate specific patterns from web pages based on some keys, which are the names
of the attributes, and then we apply the patterns to other web pages to ex-
tract the values of the attributes. This method is mainly used for the extraction
of < property, value > pairs from structured part of a document such as ta-
ble and list. However, we found there are many (amateur) gardening sites that
explain the nature of the plant only in plain text. Therefore, we created an
extraction method using the dependency parsing. It first follows the modifica-
tion relation in a sentence from a seed term, which is the name of the plant or
582 T. Kawamura and A. Ohsuga
the attribute, and then extract triples like < plantname, property, value > or
< −, property, value >. Either way, a key or seed of extraction is retrieved from
our predefined schema of Plant LOD to collate the existing LOD like DBpedia.
Also, for correction of mistakes, we extracted the values of a plant from more
than 100 web sites. If the values are identical, we sum up Google PageRanks
of their source sites and determine the best possible value and the second-best.
Finally, a user determines a correct value from the proposed ones. We conducted
this semi-automatic extraction of the values for the 13 attributes of the 90 plants
that we added, and then created the Plant LOD. In a recent experiment, the
best possible values achieved an average precision of 85% and an average recall
of 77%. We are now conducting more detailed evaluation, thus the results will
be discussed in another paper.
The SPARQL query includes the above-mentioned environmental factors ob-
tained from the sensors in FILTER evaluation, and is set to return the top three
plants in the reverse order of the planting difficulty within the types of Plant
class. It should be noted that SPARQL 1.0 does not have a conditional branching
statement such as IF-THEN or CASE-WHEN in SQL. Thus, certain restrictions
are difficult to express, such as whether the current month is within the plant-
ing season or not. Different conditional expressions are required for two cases
such as March to July and October to March. Of course, we can express such
a restriction using logical-or(||) and logical-and(&&) in FILTER evaluation, or
UNION keyword in WHERE clause. But, it would be a redundant expression in
some cases (see below, where ?start, ?end, and MNT mean the start month, the
end month, and the current month respectively). On the other hand, SPARQL
1.1 draft[13] includes IF as Functional Forms. So we expect the early fix of 1.1
specification and dissemination of its implementation.
in the camera view, recognizes its three-dimensional position and attitude, and
then displays 3DCGs in Metasequoia format on the marker. The 3DCG can
quickly change its size and tilt according to the marker’s position and attitude
through the camera. We have already prepared 90 kinds of plant 3DCG data for
recommendation.
4 Related Work
Recently, the remarkable progress of mobile devices has realized the AR function
ubiquitously. Mobile devices and AR have a strong affinity because it becomes
possible to overlay virtual information on reality everywhere. There are already
several reports in the literature and commercial services have been proposed,
which can be roughly classified into two categories depending on AR use: to
annotate text information to the real object and/or materialize the virtual object
in the real scene.
The former includes Sekai Camera[18], which sparked an AR boom in Japan,
and Layar[19], VTT(Technical Research Centre of Finland)[20], and Takemura
et al.[21]. Sekai Camera displays tags related to the real objects existing in a
town, which show the users’ comments and reviews. Layar annotates the text
information for restaurants, convenience stores and spots on a landscape, and
then provides their search function. Research at VTT concerned a system en-
abling a worker assembling industrial components to see the next parts and
how to attach them through a camera. Takemura realized a system employing a
wearable computer for annotating information on buildings.
The latter includes My.IKEA[22] and USPS Virtual Box Simulator[23].
My.IKEA realized simulation of furniture arrangement in the user’s home
through a camera by displaying 3DCG of the furniture on a corresponding
marker that comes with a catalog. Virtual Box Simulator is a system to show
3DCG boxes for courier services for determining the suitable box size for an
object to be dispatched. The service that we propose in this paper also adopts
Green-Thumb Camera 585
the approach of materializing virtual objects in real scenes and displays 3DCG
of non-existent objects as well. However, while the other systems materialize
the predefined objects statically bound to the markers, our service materializes
more adaptive objects by using the recommendation function according to the
real situation estimated by the sensors.
Moreover, we introduce three kinds of research on combining AR with another
technique. Regarding the combination with the recommendation function, Guven
et al.[24] show 3DCG avatars of real reviewers for a product by reading a marker
on the product, and then provide useful information on that product through
conversation with the avatars. Our service also shows adaptive information in
context with the AR. However, while the AR of this research is only used to
show the avatar, AR of our service shows the recommended object itself and
overlays it on the real scene to check the aesthetic balance with the surrounding.
Therefore, it would be a more practical use of AR.
Regarding the combination with software agents, Nagao et al. proposed agent
augmented reality[25] a decade ago and introduced the applications of shopping
support and a traveler’s guide system.
Furthermore, regarding the combination with plants, there is research by
Nishida et al.[26]. They used a 3DCG fairy personifying the plant and whose
physical appearance represents the plant’s physical condition, thus introducing
a game flavor to plant cultivation. Our service also uses AR for the plant growth.
However, while they focused on plant cultivation, our service is for the plant-
ing and selection of plants and for checking whether they will blend in with the
scenery. In fact, there has been little ICT research on plants for non-expert users
who enjoy gardening, although precision farming includes agricultural field anal-
ysis using sensors for the expert. The most practical service for non-expert users
may still be a search engine for the plant names. Focusing on those non-expert
users, we provide adaptive information in context by combining the semantic
information from the sensor and LOD with AR.
Finally, apart from AR, we introduce two kinds of research regarding sensors
and semantics. The first one is Semantic Sensor Network(SSN), in which sensor
data is annotated with semantic metadata to support environmental monitoring
and decision-making. SemSorGrid4Env[28] is applying it to flood emergency re-
sponse planning. Our service architecture is similar to SSN. However, instead of
searching and reasoning within the mashuped semantic sensor data, we assume
the existence of LOD on the net, to which the sensor data is connected.
The second one is about social sensor research, which integrates the existing
social networking services and physical-presence awareness like RFID data and
twitter with GPS data to encourage users’ collaboration and communication.
Live Social Semantics(LSS)[27] applied it to some conferences and suggested
new interests for the users. It resembles our service architecture in that face-to-
face contact events based on RFID are connected to the social information on
the net. However, from the difference in its objective, which is a social or field
support, the information flow is opposite. In our architecture, the sensor (client)
side requests the LOD on the net, although in LSS the social information (DB)
collects the sensor data.
Green-Thumb Camera 587
References
1. Srinivasan, A.: Handbook of precision agriculture: principles and applications.
Routledge (2006)
2. Berners-Lee, T.: Design Issues: Linked Data (2006),
http://www.w3.org/DesignIssues/LinkedData.html
3. Linked Data - Connect Distributed Data across the Web,
http://linkeddata.org/
4. Neo-Green Space Design. Seibundo Shinkosha Publishing (1996)
5. engeinavi (in Japanese), http://www.engeinavi.jp/db/.
6. Japan Meteorological Agency, http://www.jma.go.jp/jma/indexe.html
7. weathernews, http://weathernews.com/?language=en
8. Makimo Plant (in Japanese),
http://www.makimo-plant.com/modules/maintenance/index.php
9. Angyo: The Village of Garden Plants, http://www.jurian.or.jp/en/index.html
10. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia:
A Nucleus for a Web of Open Data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang,
D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R.,
Schreiber, G., Cudré-Mauroux, P. (eds.) ISWC/ASWC 2007. LNCS, vol. 4825, pp.
722–735. Springer, Heidelberg (2007)
588 T. Kawamura and A. Ohsuga
30. Kawamura, T., Mishiro, N., Ohsuga, A.: Green-Thumb Phone: Development of AR-
based Plant Recommendation Service on Smart Phone. In: Proc. of International
Conference on Advanced Computing and Applications, ACOMP (2011)
31. Kawamura, T., Shin, I., Nakagawa, H., Nakayama, K., Tahara, Y., Ohsuga, A.:
ONTOMO: Web Service for Ontology Building - Evaluation of Ontology Rec-
ommendation using Named Entity Extraction. In: Proc. of IADIS International
Conference WWW/INTERNET 2010, ICWI (2010)
32. Kawamura, T., Nagano, S., Inaba, M., Mizoguchi, Y.: Mobile Service for Reputa-
tion Extraction from Weblogs - Public Experiment and Evaluation. In: Proc. of
Twenty-Second Conference on Artificial Intelligence, AAAI (2007)
Assembling Rule Mashups in the Semantic Web
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 590–602, 2012.
Springer-Verlag Berlin Heidelberg 2012
Assembling Rule Mashups in the Semantic Web 591
RIF is a family of languages, called dialects, covering different kinds of rules: from
logic-programming [10] to production rules [5]. The syntax and semantics of each
dialect is rigorously and formally specified, sharing a common core of machinery.
Among their shared features, RIF dialects include support for annotations.
In RIF, annotations allow metadata to be attached to almost every syntactic
element of the language, from the RIF document itself (top element in the hier-
archy) to the terminal symbols of the grammar1. No more than one annotation
is allowed per element. An annotation has the form (* id ϕ *), where id rep-
resents the identifier of the annotated syntactic element (a URI), and ϕ is a RIF
formula that captures the metadata. In particular, ϕ is a frame (an expression of
the form s[p ->o]) or a conjunction of frames (i.e., And(s1[p1 ->o1 ], . . . ,
sn [pn ->on ])).
The RIF machinery for annotations is very flexible and permits great syntactic
freedom. For instance, the identifier (id) is not required in general. However,
when absent, it is not possible to attach metadata to the element nor can cross-
references be made between elements of a RIF document. Thus, our advice to
authors of RIF documents is to assign identifiers to at least groups and rules in
order to facilitate reusage.
1
RIF normative interchange syntax is an XML grammar, although for the sake of
readability, in this paper we will use the more human-readable RIF Presentation
Syntax (PS), defined in the informative part of the specification.
592 G. González-Moriyón et al.
Group (
(* ex:bf-rule ex:bf-rule [
dc:title -> "Breast feeding contraindication alert for Nafarelin"
dc:relation -> <http://www.drugbank.ca/drugs/DB00666>
dc:source -> <http://www.sometrustworthysite.com/>
]
*)
?Person[ex:avoid-drug -> <http://www.drugbank.ca/drugs/DB00666>]
:-
?Person # foaf:Person
?Person[foaf:gender -> "female"
ex:status -> ex:breastFeeding]
)
Fig. 1. A RIF rule with annotations
Annotation ϕ π(ϕ)
s [p -> o] { s p o }
s[p1 ->o1 . . . pn ->on ] {s p1 o1 ; · · · ; pn on }
And(F1 , . . . , Fn ) {π(F1 )} ∪ · · · ∪ {π(Fn )} .
To translate from RDF back to RIF, the π −1 mapping is applied. The hierarchy
described by some RDF graphs cannot be mapped by π −1 because the data
structures are not isomorphic — a graph (RDF) is more general than a tree
(XML). To ensure the transformation is feasible, a constraint is introduced:
each of the connected subgraphs defined by the edges labelled rulz:subset and
rulz:inRuleset−1 must have a tree-shaped structure. In other words: a ruleset
can have no more than one super-ruleset, no cycles can occur within a subset
hierarchy and a rule can be part of no more than one ruleset.
The π −1 mapping may produce a single tree or many of them. If there are
multiple trees, i.e, there are disconnected rulesets and rules, a new root node
Group is generated to subsume all the structures under a single tree. Moreover,
if the output of the mapping is a single rule that is not part of any group, a root
node Group is also generated, as required by the RIF/XML syntax.
As indicated at the beginning of this section, RIF does not permit metadata
to be attached to an entity that lacks an identifier. Note that this restriction
2
http://vocab.deri.ie/rulz
594 G. González-Moriyón et al.
does not exist in RDF because blank nodes are allowed as the subject of triples.
Consequently, if blank nodes are present in the RDF graph, the π −1 mapping
mints identifiers (URIs) for rules and rulesets to complete the transformation.
3 RIF Assembler
RIF Assembler receives one or more RIF documents as input and produces
a single RIF file. The instructions to select the domain rules that will form
part of the output and drive their transformation are also provided by rules.
These assembly instructions are metarules (i.e., second-order rules). The tool
can also read OWL ontologies and RDF datasets, which are taken into ac-
count in the metarule reasoning process, but are not part of the output of the
system.
RIF Assembler does not create rules from scratch. Any rule in the output
must be present in the input (note that the reverse is not necessarily true),
although it may be subject to changes during the process. The transformations
are limited to its metadata and its location within the structure of the ruleset.
More specifically, it is not possible to modify the formulation (and therefore the
meaning) of the rule.
RIF parsing: The input RIF documents are parsed to populate two data struc-
tures3 . Individual Abstract Syntax Trees (AST) are generated for each rule,
and incorporated into a pool. As a consequence, rules are disengaged from
the source documents. At the same time, the rule and group metadata, as
well as the hierarchical organization of the original documents, are converted
to RDF as explained in the previous section. This information is put into an
RDF store and merged with other domain knowledge sources, such as OWL
ontologies and RDF datasets.
Assembling: In the previous step rules and groups are abstracted from their
sources and syntax. This makes it possible to handle them as RDF resources
and to manipulate them by simply querying and updating the RDF graph.
Metarules specify the conditions and restrictions to be satisfied in order to
produce the tailored system. More precisely, metarules are fired to create,
modify and delete rule and group metadata; to select and delete rules; or
to rearrange the hierarchy by creating/deleting groups and changing rule
membership. It is worth noting that domain rules and metarules live in
separate universes; at no point do they mix with each other.
RIF generation: The last step of RIF Assembler generates a single AST from
the individual ASTs available in the rule pool. The RDF graph is queried
to find out which rules and groups are to be included in the output and
how they nest. The annotations are also obtained from the RDF graph. The
unified AST is then serialized as a RIF/XML document.
This process has been implemented in a web application built upon the Jena
Framework4. To implement the metarules, the forward chaining engine of the
Jena general purpose rule-based reasoner (also known as Jena Rules) is used5 .
Input and output RIF documents use the RIF/XML syntax instead.
A live instance of the application is available at http://ontorule-project.
eu/rifle-web-assembler/. The user interface is shown in Figure 3 and is di-
vided in three areas:
1. A toolbar to execute the main operations, namely: (a) to export the graph
to an RDF/XML file; (b) to execute SPARQL queries; (c) to export the
assembled rules to a RIF document; (d) to compute a partial evaluation of
the assembly and (e) to reset the application. The partial evaluation is a
feature that supports multistage assembly processes.
2. A list of the input documents including: domain rules in RIF/XML, domain
knowledge in RDF and OWL files, and metarules in Jena rules. The docu-
ments can be uploaded from local files, URIs or by direct input (typing or
pasting the text in a form).
3
To parse RIF/XML, we use the RIFle library, available at http://rifle.sf.net.
4
http://incubator.apache.org/jena/
5
http://incubator.apache.org/jena/documentation/inference/
596 G. González-Moriyón et al.
3. The panel at the right displays statistics about the information loaded in the
system and the resulting model after applying the metarules. For instance,
in Figure 3, 4 domain rules have been loaded from one RIF document, but
only 3 rules remain after the execution of the metarules.
Finally, RIF Assembler makes use of Parrot, a RIF and OWL documentation
tool [16], in order to display the combinations of ontologies and rules in the input
files and the final assembled ruleset.
4 Usage Scenarios
Two usage scenarios are presented in this section. The first one is a simplistic,
imaginary example in the health care domain, and illustrates the concept of rule
mashups. The second one is a realistic and more sophisticated scenario related to
knowledge reuse within ArcelorMittal, the world’s largest steel producer. More
details about the latter scenario can be found at [7].
Nowadays, drugs are shipped with Patient Information Leaflets (PIL) which con-
tain essential information about the product. The structure of a PIL, e.g., “list
of excipients”, “contra-indications”, and “use during pregnancy and lactation”,
is defined by regulations, for instance, European Directive 2001/83/EC.
One can imagine that, in the near future, pharmaceutical companies will pub-
lish this information on the web (open data). Although some information, such as
Assembling Rule Mashups in the Semantic Web 597
the list of excipients, may be modeled and made available as RDF graphs using
vocabularies such as SNOMED6 , other information, such as contra-indications
and interactions with other medicinal products, may require using RIF rules. As
an example, Figure 1 contains a rule that alerts breast-feeding women not to
take a particular drug.
In this hypothetical scenario, RIF Assembler may be used to process the rules
harvested from the web. As not all the sources are equally trustworthy, RIF
Assembler may execute metarules that decide which rules are reliable, taking
into account, for instance, provenance information contained in annotations. An
example of these metarules would be: “keep all the rules that come from a source
whitelisted by the World Health Organization”. The output of the assembly
process would be a ruleset potentially usable for making decisions on treatments
and prescriptions. Such a system would support professionals in medicine and
be the foundation for personal software assistants (virtual doctors).
"Group for
galvanization
quality control"
rulz:hasSubset
"Group for
"Group for
Rulesets "Group for assig."
defects"
phenom."
priority priority priority
follows follows
Tasks
"Defect id. "Phenomena "Coil assig.
task" id. task" task"
Fig. 4. RDF graph representing the rules and rulesets of the steel industry scenario
relation between tasks and business processes. The follows property defines to-
tal order within sets of tasks. The realizes property is used in rules and ruleset
annotations to connect them to the tasks (see Figure 4). Using this ontology, we
can express the fact that “The group to which Rule 1 belongs realizes the defect
identification task that implements the galvanization quality process”.
Rules are drawn from the shared pool and their metadata are augmented
and refined with specific knowledge borrowed from other ArcelorMittal facilities.
Metarules instruct RIF Assembler on how to manipulate and organize the input
rules. These assembly instructions are dictated by business experts in the domain
who are aware of the specifics throughout the production line. In the case of the
Avilés galvanization line, the metarules carry out three activities:
1. Creating a generic quality system from the rule pool. Metarules such as the
one in Figure 5 select the rules that are relevant to the galvanization quality
Assembling Rule Mashups in the Semantic Web 599
process and put them in a group as depicted in Figure 4. Rules from the
pool that are not relevant are simply discarded. Another metarule assigns
priorities to each group based on the precedence relation between tasks.
2. Augmenting the system with additional rules from similar factories. Some
metarules determine which business policies in use in other facilities may
also be applied in Avilés, even if the associated rules are not part of the
generic quality system. These rules are simply added to the final ruleset.
3. Refining the system by attending to the specifics of the galvanization pro-
cess in Avilés. In some cases, rules from another factory may also be added,
replacing rules from the generic quality system. This is the case with rules
related to electrogalvanization, which is a refinement of the generic galva-
nization process and is available only at certain factories.
Although not used in this scenario, RIF Assembler may also exploit OWL in-
ference, particularly the OWL-RL profile [12], to augment the expressivity of
metarules. This opens the door to handling ontologies with complex hierarchies,
such as those that describe business processes beyond the simple case addressed
in this scenario. Another practical application of reasoning within RIF Assem-
bler is to combine rules that are annotated with respect to different ontologies,
and that consequently require alignment.
Figure 4 depicts the RDF graph at an intermediate stage of the assembly
process. The solid lines represent annotations of rules and statements from the
ontology that is provided as input. The dashed lines indicate inferences derived
by the metarules. All the ruleset resources and their hierarchy were created by
the metarules.
[R1:
(?rule rdf:type rulz:Rule)
(?rule bp:realizes ?task)
(?rule rulz:inRuleset ?group)
(?rule bp:factory bp:Pool)
(?task bp:implements bp:QualityGalvanizationProcess)
(?task rdfs:label ?taskName)
strConcat("Autogenerated ruleset for task ",?taskName,?rulesetName)
makeTemp(?newGroup)
->
remove(2)
(?newGroup rdf:type rulz:Ruleset)
(?newGroup rdfs:label ?rulesetName)
(?newGroup bp:realizes ?task)
(?newGroup bp:scope bp:GalvanizationQualityProcess)
(?newGroup bp:factory bp:FactoryInAviles)
(?rule rulz:inRuleset ?newGroup)
]
Fig. 5. Metarule that creates new groups for rules from the pool that realize a relevant
business task
600 G. González-Moriyón et al.
5 Related Work
Business Rule Management Systems (BRMS) evolve from (production) rule en-
gines to cope with other requirements beyond the execution of rules. One of the
key features of modern BRMSs is the rule repository [9], where artifacts and
their metadata are stored. Rule metadata are particularly important in the last
step of the BRMS lifecycle: maintenance of the rule-based application [13]. This
final step involves knowledge reuse and adaptation to changes in the application
context. Leading BRMS products feature some mechanism to extract rules, i.e.,
generate rulesets that contain subsets of the complete knowledge base, by speci-
fying conditions applied to rule metadata [3]. This is the case of IBM WebSphere
ILOG JRules, which permits the execution of queries against rule metadata. Im-
proving on this feature was one of the motivations for our work on knowledge
reusability. In particular, RIF Assembler extends the query functionality with
the execution of metarules, permitting not only selection, but also modification
of the ruleset structure.
The Object Management Group7 (OMG) introduced the Model-driven Archi-
tecture (MDA) paradigm [15], which aims to model real systems using standards
such as UML, MOF or XMI. The models defined with these standards remain
abstract, and actual implementations can be obtained via automatic code gen-
eration. Although the topic of knowledge reuse is shared with RIF Assembler,
different drivers motivate these approaches: MDA is model-centric while RIF
Assembler is rule-centric.
SPIN [11] is a W3C Member Submission that proposes an RDF syntax for
SPARQL queries. Some of these queries, namely CONSTRUCT and SPARUL
queries, may express rules [14]. Therefore this work and SPIN share the idea
that rules can be represented by RDF resources. This permits the construction
of hybrid models that combine the model (ontology) and the queries, and where
queries can modify the model itself. We chose to build on RIF and not on SPAR-
QL/SPIN, because the former covers a wider range of rule languages. Thus, RIF
Assembler can be used with any BRMS that supports the RIF standard, while
using SPIN would require translation from different rule languages to SPIN.
XSPARQL provides a language to transform between RDF and XML [1]. It is
conceivable to use it to generate RIF/XML from SPARQL queries. In this sense,
it can go beyond what it is currently possible with RIF Assembler. However
there is a price to pay for this flexibility: while RIF Assembler only requires rule
writing skills, which is an ability that is presumed in the target user commu-
nity, XSPARQL requires technical knowledge of the RIF/XML syntax, the RDF
model and the SPARQL query language.
There are similarities between our work and metaprogramming, i.e., programs
that generate other programs [17]. RIF Assembler can be seen as metapro-
gramming where both the final program and the metaprogram are expressed
in rules (RIF and Jena Rules, respectively). However, RIF Assembler does not
alter the formulation of the rules, and therefore it is limited with respect to the
7
http://www.omg.org/
Assembling Rule Mashups in the Semantic Web 601
6 Conclusions
The vision of RIF Assembler relies on an appealing idea, namely, that rules can
be reified as RDF resources and treated as first-class web citizens. By doing so,
the doors are opened to rules linking to arbitrary resources in the web of data,
and vice versa. Many applications may exploit this idea, such as rule search en-
gines and rule-based personal assistants in the areas of ambient intelligence and
eHealth.
In this paper, two scenarios give insight into the potential of RIF Assembler.
We show that, assuming that annotated rules are available, it is possible to de-
rive mashup rulesets by simply writing down assembly instructions as metarules.
These metarules can be inspired by domain experts, dramatically simplifying the
construction of families of decision-support systems with respect to previous,
manual approaches.
Knowledge reuse, in particular the reuse of rules, is of critical importance
to any organization. We envision that the functionality of RIF Assembler may
eventually be an integral part of future BRMS products. As the availability of
rules increases on the web and in corporate environments, fostered by the adop-
tion of RIF, reuse will become easier. However, RIF Assembler goes beyond the
pure exchange of rules. It proposes that rules can be mixed and manipulated
independently of their source. For instance, given two rule-based systems A and
B, respectively developed with IBM JRules and JBoss Drools (both Java-based
environments), their rules can be exported to RIF. This permits the use of RIF
Assembler to select subsets of A and B to create a new system (described in
RIF), C, that can be translated to JRules, Drools or any other execution envi-
ronment, such as the C-based CLIPS. RIF Assembler supports scenarios where
reuse implies more than the mere portability of previously-built solutions, but
also rearrangement of knowledge in order to meet new requirements and contexts
of use, as in the ArcelorMittal scenario.
RIF Assembler does not yet analyze the contents and semantics of the rules.
Therefore, it is difficult to detect and handle contradictions between the rules.
Similarly, RIF Assembler does not provide automated consistency checks of the
assembled ruleset. It is up to the user to decide and implement metarules to deal
with rulesets that do not merge seamlessly. Nevertheless, the tool provides some
help in this task. For instance, it makes it easy to replace a troublesome rule with a
better alternative. The authors will further improve in this direction following the
findings made by the ONTORULE project on static rule consistency checking [6].
References
1. Akhtar, W., Kopecký, J., Krennwallner, T., Polleres, A.: XSPARQL: Traveling
between the XML and RDF Worlds – and Avoiding the XSLT Pilgrimage. In:
Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008.
LNCS, vol. 5021, pp. 432–447. Springer, Heidelberg (2008)
2. Boley, H., Kifer, M.: RIF overview. W3C Working Group Note, W3C (June 2010),
http://www.w3.org/TR/rif-overview/
3. Boyer, J., Mili, H.: Agile Business Rule Development - Process, Architecture, and
JRules Examples. Springer (2011)
4. de Bruijn, J.: RIF RDF and OWL Compatibility. Recommendation, W3C (June
2010), http://www.w3.org/TR/rif-rdf-owl/
5. de Sainte Marie, C., Hallmark, G., Paschke, A.: RIF Production Rule Dialect.
Recommendation, W3C (June 2010), http://www.w3.org/TR/rif-prd/
6. Fink, M.: D2.6 consistency maintenance final report. Deliverable, ONTORULE
project (2011)
7. González-Moriyón, G., Polo, L., Berrueta, D., Tejo-Alonso, C.: D5.5 final steel
industry public demonstrators. Deliverable, ONTORULE project (2011)
8. Hawke, S., Polleres, A.: RIF In RDF. Working Group Note, W3C (May 2011)
9. Herbst, H., Myrach, T.: A repository system for business rules. In: Mark (ed.)
Database Application Semantics, pp. 119–138. Chapman & Hall, London (1997)
10. Kifer, M., Boley, H.: RIF Basic Logic Dialect. Recommendation, W3C (June 2010),
http://www.w3.org/TR/rif-bld/
11. Knublauch, H., Hendler, J.A., Idehen, K.: SPIN - Overview and Motivation. Mem-
ber Submission, W3C (February 2011)
12. Motik, B., Fokoue, A., Horrocks, I., Wu, Z., Lutz, C., Grau, B.C.: OWL 2 web
ontology language profiles. W3C recommendation, W3C (October 2009),
http://www.w3.org/TR/2009/REC-owl2-profiles-20091027/
13. Nelson, M., Rariden, R., Sen, R.: A lifecycle approach towards business rules man-
agement. In: Proceedings of the 41st Annual Hawaii International Conference on
System Sciences, pp. 113–113 (January 2008)
14. Polleres, A.: From sparql to rules (and back). In: Proceedings of the 16th Interna-
tional Conference on World Wide Web WWW 2007, p. 787 (2007)
15. Poole, J.D.: OMG Model-Driven Architecture Home Page (2001),
http://www.omg.org/mda/index.htm
16. Tejo-Alonso, C., Berrueta, D., Polo, L., Fernández, S.: Metadata for Web Ontolo-
gies and Rules: Current Practices and Perspectives. In: Garcı́a-Barriocanal, E.,
Cebeci, Z., Okur, M.C., Öztürk, A. (eds.) MTSR 2011. CCIS, vol. 240, pp. 56–67.
Springer, Heidelberg (2011)
17. Visser, E.: Meta-programming with Concrete Object Syntax. In: Batory, D., Con-
sel, C., Taha, W. (eds.) GPCE 2002. LNCS, vol. 2487, pp. 299–315. Springer,
Heidelberg (2002)
Product Customization as Linked Data
Renault SA
13 avenue Paul Langevin
92359 Plessis Robinson, France
{edouard.chevalier,francois-paul.servant}@renault.com
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 603–617, 2012.
c Springer-Verlag Berlin Heidelberg 2012
604 E. Chevalier and F.-P. Servant
of comparison you want: price and CO2 emissions of small gasoline cars with
sun-roof, etc.
Hardly surprising: the data such applications require is not available yet. Are
manufacturers reluctant to open up their data about their ranges? If they are,
they will have to change as more daring competitors enter the game and begin
publishing their data: the cost of not appearing at all in the results of searches
made by potential customers would be overwhelming, particularly if such search
results easily convert to purchase orders.
On the other hand, it should be noted that cars are more complicated to han-
dle than books: books are searched on the basis of a very small set of properties
(title, author,...); they are well identified (e.g. through ISBN); and comparisons
between commercial offers only involve completely defined products. Whereas
cars are customizable, a crucial aspect of the problem: rather than fully spec-
ified products, what you compare are sets of them, that is, partially defined
products.
In industries practicing “Build to Order” of fully customizable products, ran-
ges are huge, because of the number of features and options a customer can
choose from: more than 1020 different cars are for sale at Renault, and 30 to 40
decisions are needed to completely differentiate one from them. Those ranges are
not only huge, they are also complex, because of the many constraints between
features which invalidate some of their combinations: if every combination of
distinctive features and options were possible, there would be 1025 different Re-
nault cars, not our mere 1020 - meaning you have only one chance upon 100,000
to define an existing Renault car, if you choose its specifications without taking
the constraints into account.
Specifying those product ranges requires the use of a vocabulary able to repre-
sent the constraints. This can be done by means of Semantic Web languages [2],
but using this data in practical applications requires sophisticated automatic rea-
soning to handle the constraints. Publishing such range definitions on the web
clearly won’t bring many practical results soon, as one cannot expect strong
reasoning capabilities from client agents.
Though difficult to specify and hard to manipulate, these product ranges are
nevertheless described rather effectively, for human users, by means of dedicated
web applications called configurators. A configurator helps a user interactively
define a product step by step, each step describing a valid partially defined prod-
uct (PDP), with a start price and a list of remaining choices given all previous
selections. Each of these choices links to another PDP until completion. Thus,
the configuration process traverses a graph whose nodes are PDPs. Now identify
each PDP with a URI returning the list of the PDPs it is linked to, among other
relevant information: what you have is a description of the range as Linked Data.
This is how a configuration engine can publish descriptions of complex products
on the web of data, which agents without reasoning capabilities can effectively
understand.
This should be of some benefit to any configurator application, whatever its
actual implementation. To name but one - the easy sharing of PDPs between
Product Customization as Linked Data 605
Because the set of different products that a customer can specify and order is too
huge to be enumerated, ranges of customizable products are defined in intention.
The specification of a family of similar products (typically those of the same
“model”) is based on a “lexicon”, i.e., a set of variables representing the relevant
descriptive attributes: body type, type of fuel, color, etc. In a completely defined
product, any of these variables is assigned one value and one only. Such a value
is called a “specification” in ISO-10303 STEP AP 214 terminology, a term that
we’ll use throughout this paper. In the Renault range, the variables are discrete:
any of them, e.g. the type of fuel, has a finite list of possible values, e.g. gasoline,
diesel, electric, gasoline-electric hybrid.
Then a set of constraints restricts the possible combinations of specifications.
The Product Range Specification (PRS) is therefore a Constraint Satisfaction
Problem (CSP), and the many PRS related questions that have to be answered
in the day to day operation of business are computationally hard (SAT being
NP-complete). Renault has developed tools, based on a compiled representa-
tion of this CSP. The computationally hard part of the problem is fully solved
in an offline phase, guaranteeing bounded and fast response times for most of
the queries: for configuration queries, time is linear on the size of the compiled
representation, which happens to remain small enough [3].
We list here the features that we think are necessary in a good configurator. Not
all of them may be available in the implementations we see on the web, either
because the software supporting the configuration process (the configuration
engine) is not able to provide them, or because of a poor application design,
which tends to stick to the old way of selling products, typically imposing a
predefined order on the user to make her choices, which simplified the handling
of the configuration process, at the expense of user comfort.
The main point is that the configuration engine should guarantee complete-
ness of inference, that is, every consequence logically entailed by a given state
in the configuration process gets actually inferred when in that state, and not
later [3]. This is absolutely necessary if users are to freely choose from specifica-
tions compatible with their previous selections, and to be barred from making
choices no valid configuration satisfies; in other words, if the whole range is to be
made accessible to them, without their ever having to backtrack from dead-ends.
Here a list of desirable features in a functional perspective:
3 Configuration API
The configuration process can be modeled as Linked Data. This provides the
basis for a simple, yet generic, Configuration API.
configService?chosenSpec=spec1&chosenSpec=spec2&... (1)
2
http://purl.org/coo/ns
608 E. Chevalier and F.-P. Servant
and returns the next list of specifications to be chosen from, all guaranteed to be
compatible with the input. Choosing one of them is then just a matter of adding
it to the list of the “chosenSpec” query parameters and of getting the updated
state of the configuration process.
Note that a query such as (1) identifies a configuration, and can be used
as a URI for the configuration in question; or, more precisely, redirect to an
actual URI of it; therefore, we can improve the service by making it return
the URI of the linked configuration along with each compatible specification:
the representation of the configuration resource then contains a list of couples
(compatible specification, linked Configuration).
Such a service makes it easy to implement a configurator application: access-
ing a configuration URI returns the data needed to build the corresponding web
page: basically a list of links to the next configurations. Every configurator ap-
plication on the web could be (re-)implemented this way: it is just a matter
of wrapping the configuration engine in a REST service that provides the data
needed to generate the HTML
When implementing such an application, play with HTTP content negotia-
tion, in quite classical Linked Data style, to respond either with data or with
a HTML page to a given configuration URI; either a page built from the data,
or the unadulterated data themselves. In the HTML page, include the data as
RDFa or microdata markup; stating in particular that the page describes the
Configuration.
3.2 Querying
The Linked Data based API allows to crawl the range, starting from the root of
the service (the “empty configuration”). Only valid configurations are returned.
It is useful to also provide a way to query the dataset. The template of the
service (1) can be used as a simple querying API. Mind however that any combi-
nation of specifications may not be valid. The service should detect such invalid
conjunctions and return a 404 Not found HTTP error. Only configuration engines
that support free order can provide such functionality in every circumstance.
It would be tempting to query the dataset using SPARQL syntax:
SELECT ? conf WHERE { ? conf : chosenSpec : spec1 , : spec2 . }
but according to SPARQL semantics, this should return all the configurations
with spec1 and spec2 - possibly several billions of them. Instead we would not
expect a list of configurations here, but one only - or zero if spec1 and spec2 are
not compatible. It is feasible to implement the intended semantics with SPARQL,
but the syntax is a bit cumbersome, therefore far less attractive. We therefore
didn’t implement a SPARQL endpoint.
supports free choice order. The configuration corresponding to a car model links
to the specifications compatible with that model; now, using a text search engine
tool such as Lucene, index the (model, specification) pairs with the text form
of the specification as the index key. Then, searching for “air conditioning, sun
roof, MP3” (say) will get you a list of (model, specification) pairs; making the
configuration service conjoin the car model and relevant specifications will get
you configurations matching your text search. This adapts to the case where the
configuration engine only supports free order only after some main choices have
been made; for instance, if choices for car model, engine and level of equipment
are required before allowing all options to be chosen in free order.
4 Renault Implementation
Traditionally, we have been providing access to the functionalities of our config-
uration engine through a java API. Recent plans for important changes in the
Renault web site sparked an opportunity to provide a Linked Data based access.
The definition of the current commercial offer is managed by upstream sys-
tems, then compiled into the binary data used by the configuration engine (size:
<100MB). It is published by means of a REST service, such as described above:
Linked Data is materialized on the fly when PDPs are queried (30 KB per PDP).
When the definition of the range is updated, part of the knowledge base used by
the configuration engine is replaced. URIs of PDPs include the release number
of the knowledge base, so all previous URIs are “deprecated” - but they still can
be queried by clients: an HTTP 301 redirects to the new URI, if the PDP still
exists in the range. A 404 is returned otherwise. In the latter case, the service
can be re-queried to get a “similar” product.
The implementation of the service uses Jersey3 (the reference implementation
of JAX-RS4 ). As of this writing (February 2012) only JSON data are returned,
and only for German and Italian markets.5 All functionalities of our configura-
tion engine are made accessible through the JSON data, including querying in
free order, maximum price, conflict resolution, completion, etc., (optional query
parameters being added to the configuration URIs to implement some of them).
A cursory look at the data5 may convey the impression a configuration URI
does not contain the list of chosen specifications which defines the configuration.
It does, though; only encoded in a short form. For we anticipated configurations
would be shared on Twitter; using an URL shortener or an internal index might
bring down performance as vast numbers of configuration URIs are generated:
100-300 to represent but one configuration. Indeed, many links are included since
we provide free order of choices.
Regarding performances, the HTTP response time for accessing one configu-
ration is around 20-30 milliseconds.
3
http://jersey.java.net/
4
http://jcp.org/en/jsr/detail?id=311
5
http://co.rplug.renault.{de,it}/docs
610 E. Chevalier and F.-P. Servant
5 Configuration Ontology
This simple ontology6 describes the classes and properties involved in the mod-
eling of the configuration process as Linked Data.
As a partially defined product whose completion to a valid product always
exists, a configuration can be seamlessly described in the GoodRelations ontology
framework and can participate in the web of data for e-business. This ontology
is generic, that is, applicable to any kind of customizable product: it does not
depend on the set of variables and specifications with which a given product is
defined.
Examples. In the following, we use examples about a very simple range of cars,
all of the same model called Model1. The lexicon contains four variables:
– Fuel Type: {Gasoline, Diesel}
– Temperature Control: {Heating, AirCond}
– Radio Type: {NoAudio, SimpleRadio, RadioMP3}
– Roof: {NormalRoof, SunRoof}.
The product range specification is defined in Tab. 1, by the list of specifica-
tions available on the three levels of equipment. The total number of different
completely defined cars is 8.
Notations. The RDF examples are written in Turtle syntax, using the prefix
“co” for this configuration ontology, “gr” for GoodRelations, “vso” for the Vehicle
Sales Ontology and “r” for the specifications.
Now say you want a radio, but you do not care what kind it is. Because a con-
figuration engine may support choices such as r:SimpleRadio OR r:RadioMP3,
if two or more of the co:chosenSpec of a Configuration correspond to the same
variable, by convention they are to be interpreted as ORed (even XORed, by the
way).
ex : Conf2 a co : Configuration ;
co : chosenSpec r : Model1 , r : SimpleRadio , r : RadioMP3 .
This means that the car has either a r:SimpleRadio, or a r:RadioMP3, not both.
Choice order. Choices are made one at a time and in a given order, which may
matter. Of course it doesn’t impact the characteristics of the product in any way,
but it can be used by the application, for instance to display a textual description
of the configuration. This could be achieved with an additional co:choiceSeq
property having rdf:Seq as its range.
some are impossible (they can no more be chosen), others are simply compati-
ble: they still can be chosen among several alternatives. Given the co:defining-
Choice(s) of a Configuration, some specifications are implied, i. e., included in
any completely defined product matching the configuration, some are impossi-
ble, i. e. they can no more be chosen, others are simply compatible: they can
still be chosen from among several alternatives.
Proposing the selection of several specifications at once. Let us note that this
model supports the selection of several specifications at once. This can be useful
from a marketing point of view, as an emphasis on certain packs of specifications,
or on certain full featured configurations:
ex : Conf3
co : chosenSpec r : Model1 ;
co : possible [ a co : Configurati on L in k ;
rdfs : label " Over - equipped configuration !" ;
co : specToBeAdded r : AirCond , r : RadioMP3 , r : SunRoof ;
co : linkedConf ex : overEquippe dC on f .].
Product Customization as Linked Data 613
co:alternative. A user may want to change one of its previous selections. This
property lists those of the co:chosenSpec, which can be changed: it links the
configuration to a similar one, with one of the co:chosenSpec removed or changed.
This property may not be used when the chosen specification in question happens
to be implied by the other choices. For instance, on ex:Conf1PlusDiesel, the
r:Diesel can be replaced by r:Gasoline:
ex : Conf1PlusDie se l co : alternative [
a co : Configurati o nL in k ;
co : specToBeRemo ve d r : Diesel ;
co : specToBeAdded r : Gasoline ;
co : linkedConf ex : Conf1PlusGa so l in e .].
Starting prices of the linked configurations can be embedded within the RDF
data returned when dereferencing the configuration:
ex : Conf1 co : possible [ a co : Configurat io nL i nk ;
co : specToBeAdded r : Diesel ;
co : linkedConf ex : Conf1PlusDie se l .].
ex : Conf1PlusDie se l gr : hasPriceSp e ci f i ca t io n [
a gr : U n i t P r i c e S p e c i f i c a t i o n ; gr : hasCurrency " EUR " ;
gr : hasMinCurr en c yV a lu e "10000.00"^^ xsd : float . ].
614 E. Chevalier and F.-P. Servant
The suffix “Model” may seem misleading when used for a Configuration, as it
suggests something such as “Ford T”, and not “Ford T with Air Conditioning
and MP3 connection plug” (itself not a completely defined product - you still
can choose, well, the color: it is a “prototype of similar products”).
On the other hand, a configuration has a price. It may be seen as a commercial
offer, or the expression of a customer’s wish list. It can therefore be considered
as a gr:Offering as well. Giving, as we do, gr:hasPriceSpecification the start price
of a co:Configuration makes it a de facto gr:Offering. Also, the range depends
on the vendor, a typical characteristic of an offer; e.g. two PC vendors both sell,
say, PC intel core i7 2500K, 4GB RAM: this is a configuration; however they
propose different disk capacities.
So, a Configuration can be considered as both a gr:ProductOrServiceModel
and a gr:Offering.
This ontology is generic: it does not depend on the variables and specifications
used to define a product, and it allows a publisher to use its own terms as spec-
ifications. This is an important point, as the whole purpose of the configuration
process is to come out with an order for a completely defined product, which
implies its definition in the manufacturing company’s terms. On the other hand,
there are shared vocabularies on the web for products. No technical obstacle
prevents us from adding triples using terms coming from such vocabularies to
the description of a Configuration. Example using the Vehicle Sales Ontology:
ex : Conf5 a co : Configuration ;
co : chosenSpec r : Model1 , r : Gasoline ;
vso : fuelType dbpedia : Gasoline .
an agent can implement a text based searching mechanism with small indexes,
and with calls to the configuration service. What about search engines, then?
We expect them to index our products as a matter of course.
The harsh reality, though, is that ranges are huge. We can proudly announce
the availability of our 1020 descriptions of completely defined products on the
web of data, and of even more partially defined ones, yet this is far more than
what the most obstinate robot can cope with. So, we cannot but give thought
to the fact that indexing will be partial.
Basically, configuration will be indexed by specifications. The semantics of
the properties used to describe a Configuration should be carefully taken into
account when deciding on which specifications indexing will be based. For in-
stance, if the values of the “co:possible” property were used to index config-
urations, queries searching for products containing several specifications could
return matches that actually do not include their conjunction: spec1 and spec2
can both be individually compatible with a given configuration, while spec1 and
spec2 together is impossible. Or, they could get displayed at a lower price than
the true one: the start price of a configuration generally increases when options
are added. The only way to return accurate results would be to query the config-
uration service at runtime; while this is a simple thing for a specialized agent to
do, search engines will not. As an other example, indexing configurations with
chosen and implied specifications only would require to build a very large in-
dex, to get matches for searches involving many specifications. The best solution
probably uses the union set of the values of co:chosenSpec, co:impliedSpec and
co:defaultSpec.
Of course we do not know how search engines will proceed. We enable them to
crawl the dataset, either starting just from its root (the “empty configuration”), or
from any configuration, and following links whose semantics is precisely defined
in the co:ConfigurationLink class. We provide them with enough information to
customize their strategies. For instance, they can choose which links they follow.
Not all specifications are of equal interest: the sun roof, the MP3 connector, etc.
are probably more important - for a customer as well as for a search engine -
than, say, the color of the ashtray.
On the other hand, the “sitemap” file of the web site is the place for the
publisher to list configurations the indexing robots should consider first. A still
unanswered question is: which configurations should be included in the sitemap
file to get the most of it from a marketing point of view? Clearly, the choice should
be driven by marketing data: for instance, which specifications and configurations
should be “pushed” toward the customer?
6 Benefits
6.1 Improved Architecture
As noted in section 4, the access we historically provided to the functionalities
of our configuration engine was through a java API. Switching to a REST based
API brought its own benefits. Before this change, our configuration engine and
616 E. Chevalier and F.-P. Servant
The development of several new client applications is on its way, and the costs
are much lower than with our previous Java API: the GUI developer does not
have to understand the concepts underlying configuration, nor (for the larger
part) to learn an API. Basically, she just has to display the links found in the
data.
Configurations truly deserve their status of first class objects. They represent
Partially Defined Products. They also capture the exact expression of the cus-
tomer’s wish list, constrained by the definition of the range: a very important
point of concern from a marketing point of view! Global identifiers for configu-
rations may be put to a number of uses, most of which increase the visibility of
the commercial offer. To name a few:
Agents knowledgeable about the buying habits and preferences of consumers can
use this data to generate ads matching their possible wishes better. For instance,
if a user, known to be young and accustomed to buying and downloading music,
issues a query about cars, display an ad for a small car with an MP3 adaptor.
Product Customization as Linked Data 617
7 Conclusion
References
1. Hepp, M.: GoodRelations: An Ontology for Describing Products and Services Of-
fers on the Web. In: Gangemi, A., Euzenat, J. (eds.) EKAW 2008. LNCS (LNAI),
vol. 5268, pp. 329–346. Springer, Heidelberg (2008)
2. Badra, F., Servant, F.P., Passant, A.: A Semantic Web Representation of a Product
Range Specification based on Constraint Satisfaction Problem in the Automotive
Industry. In: OSEMA Workshop ESWC (2011),
http://ceur-ws.org/Vol-748/paper4.pdf
3. Pargamin, B.: Vehicle Sales Configuration: the Cluster Tree Approach. In: ECAI
Workshop on Configuration (2002)
From Web 1.0 to Social Semantic Web: Lessons
Learnt from a Migration to a Medical Semantic
Wiki
1 Introduction
During the two last decades, the Internet has totally changed the way informa-
tion is published and shared in most of scientific areas, including medicine. First
websites in web 1.0 were made of static pages and hyperlinks allowing limited
interactions between editors and readers. Then, information sharing has evolved
with the rising of web 2.0 by allowing users to contribute to the contents. Nu-
merous studies have shown the position impact of such evolutions on medical
information systems [11,23]. Participative web applications can be implemented
and used in a collaborative way to build large databases. Finally, semantic web
has appeared. Semantic web aims at creating and sharing formalized information
in order to make it available for both humans and machines. Social semantic web
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 618–632, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Lessons Learnt from a Migration to a Medical Semantic Wiki 619
is considered as the merging of web 2.0 and the semantic web, i.e. a web where
shared formal information is edited collaboratively.
The Kasimir research project started in 1997. It aims at providing tools to
assist decision making by practitioners and, more generally, decision knowledge
management in oncology. The project is conducted in partnership with Oncolor,
an association gathering physicians from Lorraine (a region of France) involved
in oncology. On its static website, Oncolor publishes more than 140 medical
guidelines written in HTML in a web 1.0 fashion. This base is built through a
consensus between medical experts and is continually updated according to the
oncology state of the art and to local context evolutions. In order to facilitate
the creation, maintenance and publication of guidelines, Oncolor has expressed
the need for more efficient and collaborative tools. Moreover, it would be a great
benefit if the knowledge contained in guidelines was formalised and made avail-
able for semantic systems, particularly for Kasimir, since knowledge acquisition
is a bottleneck for building knowledge systems.
In this paper, an application of a semantic wiki approach for medical guideline
edition is reported.1 The expected benefits are twofold: first, online collaborative
work is simplified by the use of wikis and second, semantic technologies allow the
creation of additional services by making use of external medical resources such
as terminologies, online ontologies, and medical publication websites. However,
despite the effort of the semantic wiki community to simplify its systems, it is
still hard for medical expert to create semantic annotations. This issue involves
the need of taking into account structured and unstructured content but also,
when this is possible, to include dedicated tools for formalising data. In these
cases, implementation and development of semantic wiki extensions are required.
The rest of the paper is structured as follows: Section 2 describes the ap-
plication context. The migration of static Oncolor website to a collaborative
system is presented in Section 3, while Section 4 relates the addition of semantic
annotations and services. After a report on our evaluation study in Section 5,
some related work is introduced in Section 6. Section 7 is a discussion about the
benefits of the system, as well as ongoing and future work.
2 Context
treatment or dental care. Since guidelines are intended for both medical staff
and patients, editors have exploited various kinds of formats in order to be both
precise and didactic. Most guidelines follow the same structure. The first part
introduces the guideline with few sentences that explain which circumstances
imply the use of the guideline and the treatments that will be proposed. The next
part is a textual description of clinical and paraclinical investigations that can
lead to the starting point of the guideline. This starting point is often a staging
step allowing to classify the patient according to international classifications.
These classifications are presented as simple tables. Depending on classifications
results, decision trees guide the reader to the next step that details the medical
recommendation available in various formats, such as medical publications in
PDF or hypertext links to distant resources. Finally, guidelines conclude with
advice about medical supervision and sometimes with a lexicon of specifics terms.
As in all medical information systems, data quality in oncology is critical.
Each guideline should be reviewed every second year by experts. Two kinds of
editors can be identified in the reviewing process:
– Medical experts contribute with their technical knowledge. They are gath-
ered in committees under the supervision of coordinators that make sure the
guidelines are complete and the consistent. Most medical experts have poor
computer skills, limited to word processing and Internet browsing.
– Oncolor staff manages communication between the committee members and
creates the final guideline layout. They also check that guidelines are up to
date and propose new ways to facilitate their diffusion, while public health
physicians check the consistency of the information base. Most of Oncolor
employees do not have more computer skills than medical experts, except
for a computer graphic designer. Particularly, Oncolor does not have a web-
master in its staff.
Guidelines are made available on the Oncolor website [2], which also contains
various information about local healthcare services and provides links to dedi-
cated tools. This site also stores other Oncolor projects, including a thesaurus
of pharmacological products which is closely related to oncology guidelines. It
contains information about drugs used in cancer treatment.
Created in the mid 1990s, this website was completely made using a commer-
cial WYSIWYG HTML editor. The resulting HTML code is not readable, due
to successive technology evolutions. The first created pages were done using only
HTML and then, in the past 15 years, CSS, Javascript and XHTML were intro-
duced. Few pages also use ASP. All these evolutions have led to the construction
of weird pages where only the visual aspect is important and in which document
structure is hard to identify. Over the years, updating the website is becoming
more and more complex for Oncolor staff. All the pages edited on the Oncolor
website must be validated to follow the principles of HONcode certification [1]
which guarantee the quality and the independence of the content.
In this context, Oncolor has been asked to integrate a collaborative tool to
simplify the guideline creation and maintenance process. Moreover, it would be
of great benefits for Oncolor to keep track of all changes in the guidelines. That
Lessons Learnt from a Migration to a Medical Semantic Wiki 621
is why the system has to propose a versioning file system and some social tools
to allow communication between experts during updating process.
Wikis and Semantic Wikis: The Migration Process. Traditional wikis are
usually based on a set of editable pages, organised into categories and connected
by hyperlinks. They became the symbol of interactivity promoted through web
2.0. One of the founding principles of wikis, which is also the principal vector
of their popularity, is their ease of use even by persons that lack considerable
computer skills. Wikis are created and maintained through specific content man-
agement systems, the wiki engines, while wikitexts enable structuring, layout,
and links between articles. At this point, an idea has emerged: to exploit stored
pieces of knowledge automatically.
Indeed, a limit use to the wikis is illustrated by the querying of the data
contained in their pages. The search is usually done through word recognition
by strings, without considering their meaning. For example, the system cannot
answer a query like: “Give me the list of all currently reigning kings.” The so-
lution used in Wikipedia is a manual generation of lists. However, the manual
generation of all the lists answering queries users may raise is, at the very least,
tedious, if not impossible. This has motivated the introduction of a semantic
layer to wikis. Moreover, it would be interesting if information contained in
wikis were available through external services.
Semantic wikis were born from the application of wiki principles in the se-
mantic web context. A semantic wiki is similar to a traditional one in the sense
that it is a website where contents are edited in a collaborative way by users and
are organised into editable and searchable pages. However, semantic wikis are
not limited to natural language text. They characterise the resources and the
links between them. This information is formalised and thus becomes usable by
a machine, through processes of artificial reasoning. Thus, semantic wikis can be
viewed as wikis that are improved by the use of semantic technologies as well as
collaborative tools for editing formalised knowledge.
Semantic wikis corresponds to both Oncolor and Kasimir needs: guidelines can
be written in a collaborative way and semantic technologies allow to formalise
and extract structured content.
Lessons Learnt from a Migration to a Medical Semantic Wiki 623
The first part of the migration was choosing the most adapted semantic wiki en-
gines. Whereas many semantic wiki engines have emerged for the last 10 years,
only four open source projects seem active at this time: AceWiki [17], KiWI [22],
Ontowiki [13], and Semantic Mediawiki [16]. AceWiki uses ACE (Attempto Con-
trolled English), a sub-language of English that can be translated directly into
first order logic. However, Oncolor guidelines are already written in French and
the development of a controlled language for French medical guidelines that
covers all the contents would be tedious. Ontowiki and KiWI focus on RDF
triple edition by proposing dedicated interfaces such as dynamic forms. Their
approaches are very strict and do not seem reconcilable with importation of un-
structured contents. Moreover, no large scale implementation of these engines
can be found and, their development and user communities are limited. So, less
extensions are available and the support is weak.
Semantic Mediawiki (SMW) seems to be the best solution. SMW is an ex-
tension of Mediawiki, the engine used by Wikipedia. For the sake of simplicity
for users, it integrates the RDF triples editing in its wikitext. In this way, it
enables the creation of typed links that can also be used for indicating the at-
tributes of the page. Another interesting point of SMW is its popularity: there is
a large community of developers around it, and this community produces many
extensions, such as editing forms, the integration of an inference engine, etc.
For instance, the Halo extension2 proposes forms, an auto-completion system,
the integration of a SPARQL endpoint and much more. The only limitation for
our migration is that SMW does not provide extensions that allow to draw the
trees that are frequently used in the guidelines, but we have developed a decision
tree editor, as will be discussed further. Tutorials and community support make
the installation of SMW simple. Less than one hour is needed to install it for
anybody with average computer skills.
Once the semantic wiki had been installed, a specific skin that corresponds to
Oncolor graphics standards has been built to customise the application. The
next part of the work was to import guidelines in the wiki. However, in order
to correspond to wiki syntax, content had to be formatted into wikitext. For
each guideline, the HTML content was extracted and HTML pages were merged
when guidelines did contain more than one page. The table of contents was
automatically extracted and marked up when possible. However, the state of
the HTML code made impossible to systemically identify document structure.
It can be noted that the migration would have been simpler if CSS had been
used from the start. Then, unnecessary content such as browsing elements and
2
http://www.projecthalo.com/
624 T. Meilender et al.
JavaScript functions was removed. A parser was also used to transform HTML
into wikitext when simple tags were detected (images, tables, etc.). Moreover, by
using a parser and context analysis, specific fields were identified. The objective
was to identify interesting information about a guideline such as the date of its
last update or keywords. Moreover, by examining website folder structure, an
anatomical classification of the guideline has been identify. This classification
was reused as a base for guideline categorisation in the wiki.
Despite of all our efforts, the layout of the imported guidelines had to be
checked then. Due to the critical nature of the information, this checking was
done by Oncolor staff. On average, a person needed half a day to check each
guideline.
Additionally, the Oncolor thesaurus of pharmacology was imported. As its
content is closely related to guidelines, it was important to let it available in the
same information system. One page per described drug was created. In this case,
the simplicity of the HTML pages made the migration easier.
To migrate guidelines, Mediawiki import capacities were used. They allow to
import wikitext content from text files. In the wiki, some templates were built to
highlight the fields previously identified. An excerpt of a resulting page is shown
in Figure 1. All the guidelines are presently in the wiki.
In the usual philosophy of wikis, everybody can edit pages, even anonymously.
Although the importance of the information availability for the public, medical
data are critical and the guidelines must be approved by Oncolor experts to be
in public access. Moreover, if an expert modifies a guideline, the modification
Lessons Learnt from a Migration to a Medical Semantic Wiki 625
Decision trees were imported from the previous website as bitmap pictures. At
this point, guideline updates can also be simplified by proposing an online ed-
itor. KcatoS is a Mediawiki extension that allows the collaborative drawing
of decision trees. KcatoS decision tree language is a graphical representation
based on a small set of geometrical figures connected by directed edges. This rep-
resentation was directly inspired by the graphics standards of Oncolor. Indeed,
guidelines use visual representations that can mostly be viewed as trees. An ad-
vantage to use these graphics standards is that Oncolor experts already know
them. We want to preserve Oncolor’s graphic semantics in order to facilitate the
understanding of guidelines by physicians.
From a semantic point of view, each kind of node has its own meaning; e.g.
rounded rectangles represent medical situations, etc.
Most of the time, decision trees can be considered as structures from which
a meaning can be extracted. In order to avoid ambiguities and to guarantee
guideline consistency, classical syntactical rules of decision trees are used. A
syntactic module can be used to check if the edited tree respects the rules. Thus,
KcatoS can propose an export algorithm that allows to transform decision trees
into OWL.
KcatoS’s export algorithm defines two classes: Situation and
Recommendation. The first one represents some patient information while the
second one represents the description of the decision proposed by the system.
These classes are linked by the property hasRecommendation. This means that
for each situation there is a recommendation that is associated to it.
626 T. Meilender et al.
A tree is read using depth-first search. Each node is transformed using rules
which take into account the shape and its ancestors.
The export algorithm creates many concepts and properties. Including all
of them in the semantic wiki would decrease the ease of navigation because
it would lead to the creation of numerous pages. In order to avoid these page
creations, translated trees are stored in a specific file and linked to the wiki.
Thus, created ontologies are made available for other semantic web applications.
From a technical point of view, OWL API [15] is used to perform the export.
Extracting the whole semantics of a guideline is a tedious job that has to be done
by a medical expert with skills in knowledge engineering. As Oncolor does not
have this kind of specialist in its staff, formalising the guidelines would be a great
investment. Moreover, it is still difficult for non-specialists to understand the
benefits that semantics could bring to medical information system. That is why
the key idea of the project is to insert step-by-step useful semantic annotations
into the guidelines in order to increase Oncolor interest in the semantic web
technologies. The first way to introduce semantics is to exploit identified fields
extracted during the guideline migration. To improve their visualisation and
their update, SMW templates and queries mechanisms were used.
SMW proposes many ways to edit semantic annotations. The more basic way
to create annotations is wikitext, which can be improve thanks to templates.
Templates are generic pre-developed page layouts that can be embedded in sev-
eral wiki pages. They can also manage variables that are instantiated in the
Lessons Learnt from a Migration to a Medical Semantic Wiki 627
corresponding page. For instance, a template is used to generate the box in the
top right corner of the page shown in Figure 1. The template used to create this
box is generic enough to be applied to all guideline pages, and its use allows
flexible modifications. As template use is simple (and can be further simplified
by associating forms to them), they provide a simple way to create annotation
fields that can be filled by any users without specific skills.
Then semantic annotations can be exploited by SMW inline query engine.
Using a simple query language, semantic search can be done directly in a page
and results are displayed as tables, lists, etc. Combined to templates, seman-
tic queries are a simple way to create dynamic content relying on semantic
annotations.
Fig. 3. An excerpt of inline query that requests the guidelines that are out-of-date,
and the wiki page that contains the result
5 Evaluation
To carry out the evaluation, the opinions of the users have been investigated.
People asked were the four main contributors from Oncolor staff: two public
health physicians, a computer graphic designer, and a medical secretary.
The first interesting point is that, before the beginning, the only thing they
knew about wikis was Wikipedia and none had ever contributed to a wiki. De-
spite this, three contributors thought that less than one day of self-training is
needed to learn wikitext and to be an efficient contributor. The only difficul-
ties are related to particular layouts (tables and references) and wiki advanced
functions dealing with user management and page history. The only reluctance
to migrate to a wiki was guideline quality. They agreed a concern with that
the old system was time-consuming, but it had the advantage to produce high
quality guidelines. Experiments were led to update Oncolor’s old website and
semantic wiki with the same modifications. They show that the quality did not
suffer of the change and that the efficiency of updating has been increased by
the semantic wiki.
Lessons Learnt from a Migration to a Medical Semantic Wiki 629
Fig. 4. Example of data that can be imported from DBpedia and Drugbank about
Gemcitabine using SPARQL queries
Our panel cited the main advantages they see in using a wiki. They have agreed
that wikis are collaborative tools that allow more reactivity and more flexibility
in the update process. It has also been said that wikis improve conditions of
employment by allowing distant work, which was impossible with the previous
system. Moreover, they recognised that the wiki increases the quality of the
editing process and of the guideline themselves by allowing the standardisation
of the guideline and by simplifying the work on its layout.
In our system, the preferred contribution is the query to medical publication
websites Pubmed and Cismef which propose automatically a bibliography re-
lated to a guideline. The previous system did not permit that kind of function
that has been judged very useful. It is really important for the project that
Oncolor staff appreciated this contribution that is relying on semantic web tech-
nologies. Moreover, all participants declared that they are interested in using
MeSH annotations and want to lead further this experimentation.
6 Related Work
on semantic forms and focuses totally on structured content while our project
aims at migrating already existing unstructured data.
Semantic wikis have already been experimented in various domains. Particu-
larly, the building of a semantic portal for the AIFB Institute described in [14]
shows how important the technical settings are for increasing wiki performances
and how difficult it is to find the right balance between structured and unstruc-
tured data. This last issue has also been tackled in [24].
In this paper, a migration from a web 1.0 website containing medical data to a
semantic wiki has been described. The first step was the migration of data from
an HTML website to a collaborative solution, Semantic Mediawiki. The second
step consisted in adding a semantic layer to show the benefits that semantic web
technologies could bring.
Among the difficulties we have met, the analysis of the HTML version of the
guidelines was hard because of the use of invalid code. This is the result of the
use of different HTML editors that follow the evolution of the standard over a
decade. It appears that a correct use of HTML and CSS would have simplified
the migration, particularly the identification of tables of content and specific
fields. Moreover, medical information is critical and its migration implies a long
work of verification by medical experts. According to Oncolor members, about
70 days of work were necessary to check and correct all the guidelines.
Once the semantic wiki has been installed, the use of traditional wiki tools
for edition was easily learnt by Oncolor staff. However, we have noticed that the
creation and the use of semantic annotations remain difficult for non knowledge
expert although semantic wikis seem to be a simple approach. For example,
SMW inline query language is hard to handle for non computer specialists and
template construction also requires computer skills. Some tools have yet to be
implemented to improve this aspect in the philosophy of semantic forms and the
Halo project.
Another problem was to find the right balance between structured and un-
structured data. The advantage of structured data is the typing that enables to
easily reuse data in the semantic web context. However, structured data are still
difficult to edit and exploit, as shown in the context of semantic wikis. Moreover,
most of existing information sources are unstructured, and tedious work would be
necessary to transform them. This job would be expensive and time-consuming
so its benefits have to be shown first to non semantic web experts. Our method-
ology was to add semantic annotations step-by-step to improve the semantic
wiki quality. Until now, our work has consisted in showing the improvements so
that future developments will be upon Oncolor request.
Introducing structured information yields benefits when it is done in accor-
dance with already existing resources. In the medical domain, numerous thesauri
and information sources have been created, and it is hard for no medical special-
ists to determine which ones can be used. This choice has to be made according to
Lessons Learnt from a Migration to a Medical Semantic Wiki 631
the goal of the application with the approval of medical specialists. For instance,
it was hard to determine which thesaurus will be used to index guidelines. We
finally have chosen MeSH upon Oncolor request, although SNOMED or UMLS
seem more complete and CIM-10 seems more simple. The reason was that the
link to medical publication websites is useful for editors and provides additional
information for the readers.
Finally, the use of data from semantic web is a major concern in the medical
domain, due to the critical nature of the data. Using external resources seems to
cause a kind of reluctance in clinicians. Each source has to be first approved by
medical authorities before it can be exploited by a medical system. Particularly,
all sources must at least follow the principle of the HONcode certification.
Currently, our work focuses on minor technical adaptation of the wiki to On-
color needs. Our next task will be to increase gradually the semantic annotation’s
presence. The long-term goal is to obtain a structured knowledge base that con-
tains all the information provided by oncology guidelines. For such a project to
be successful, several issues have to be taken into account. The project must
be able to rely on several medical experts to structure and check information.
From this point of view, Oncolor will have a crucial role of support to play and
so, their satisfaction is really important. Moreover, to complete the formalisa-
tion, resources that are more expressive than MeSH will be needed. SNOMED
or UMLS seem to be better options. Finally, the scale of this final ontology will
require significant improvement in ontology engineering tools, particularly for
the edition and the maintenance.
References
1. Honcode, http://www.hon.ch/ (last consulted: December 2011)
2. Oncolor website, http://www.oncolor.fr (last consulted: December 2011)
3. Pubmed, http://www.ncbi.nlm.nih.gov/pubmed/ (last consulted: December
2011)
4. Badra, F., d’Aquin, M., Lieber, J., Meilender, T.: EdHibou: a customizable inter-
face for decision support in a semantic portal. In: Proc. of the Poster and Demo.
Session at the 7th International Semantic Web Conference (ISWC 2008), Karl-
sruhe, Germany, October 28 (2008)
5. Bail, S., Horridge, M., Parsia, B., Sattler, U.: The Justificatory Structure of the
NCBO BioPortal Ontologies. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bern-
stein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS,
vol. 7031, pp. 67–82. Springer, Heidelberg (2011)
6. Belleau, F., Nolin, M.-A., Tourigny, N., Rigault, P., Morissette, J.: Bio2rdf: To-
wards a mashup to build bioinformatics knowledge systems. Journal of Biomedical
Informatics 41(5), 706–716 (2008)
7. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hell-
mann, S.: Dbpedia - a crystallization point for the web of data. Journal of Web
Semantics: Science, Services and Agents on the WWW 7(3), 154–165 (2009)
8. Bodenreider, O.: Biomedical ontologies in action: Role in knowledge management,
data integration and decision support. In: IMIA Yearbook Medical Informatics,
pp. 67–79 (2008)
632 T. Meilender et al.
9. D’Aquin, M., Brachais, S., Lieber, J., Napoli, A.: Decision Support and Knowledge
Management in Oncology using Hierarchical Classification. In: Kaiser, K., Miksch,
S., Tu, S.W. (eds.) Proc. of the Symp. on Computerized Guidelines and Protocols,
CGP 2004, Prague, Czech Republic. Studies in Health Technology and Informatics,
vol. 101, pp. 16–30. IOS Press (2004)
10. Darmoni, S.J., Thirion, B., Leroyt, J.P., Douyère, M., Lacoste, B., Godard, C.,
Rigolle, I., Brisou, M., Videau, S., Goupyt, E., Piott, J., Quéré, M., Ouazir, S.,
Abdulrab, H.: A search tool based on ’encapsulated’ MeSH thesaurus to retrieve
quality health resources on the internet. Medical Informatics and The Internet in
Medicine 26(3), 165–178 (2001)
11. Giustini, D.: How Web 2.0 is changing medicine. BMJ 333, 1283–1284 (2006)
12. Hassanzadeh, O., Kementsietsidis, A., Lim, L., Miller, R.J., Wang, M.: Linkedct:
A linked data space for clinical trials. CoRR, abs/0908.0567 (2009)
13. Heino, N., Dietzold, S., Martin, M., Auer, S.: Developing Semantic Web Applica-
tions with the OntoWiki Framework. In: Pellegrini, T., Auer, S., Tochtermann, K.,
Schaffert, S. (eds.) Networked Knowledge - Networked Media. SCI, vol. 221, pp.
61–77. Springer, Heidelberg (2009)
14. Herzig, D.M., Ell, B.: Semantic MediaWiki in Operation: Experiences with Building
a Semantic Portal. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang,
L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part II. LNCS, vol. 6497,
pp. 114–128. Springer, Heidelberg (2010)
15. Horridge, M., Bechhofer, S.: The owl api: A java api for owl ontologies. Semantic
Web 2(1), 11–21 (2011)
16. Krötzsch, M., Vrandecic, D., Völkel, M., Haller, H., Studer, R.: Semantic wikipedia.
Journal of Web Semantics 5, 251–261 (2007)
17. Kuhn, T.: How controlled english can improve semantic wikis. In: Proc. of the 4th
Workshop on Semantic Wikis, European Semantic Web Conference 2009. CEUR
Workshop Proceedings (2009)
18. Köstlbacher, A., Maurus, J., Hammwöhner, R., Haas, A., Haen, E., Hiemke, C.:
Opendrugwiki – using a semantic wiki for consolidating, editing and reviewing of
existing heterogeneous drug data. In: 5th Workshop on Semantic Wikis Linking
Data and People, SemWiki 2010 (May 2010)
19. Lange, C., Schaffert, S., Skaf-Molli, H., Völkel, M. (eds.): 4th Semantic Wiki Work-
shop (SemWiki 2009) at the 6th European Semantic Web Conference (ESWC
2009), Hersonissos, Greece, June 1. CEUR Workshop Proc., vol. 464. CEUR-
WS.org (2009)
20. Nelson, S.J.: Medical terminologies that work: The example of mesh. In: Proc. of the
10th Int. Symposium on Pervasive Systems, Algorithms, and Networks, December
14-16, pp. 380–384 (2009)
21. Rospocher, M., Eccher, C., Ghidini, C., Hasan, R., Seyfang, A., Ferro, A., Miksch,
S.: Collaborative Encoding of Asbru Clinical Protocols, pp. 1–8 (2010)
22. Schaffert, S., Eder, J., Grünwald, S., Kurz, T., Radulescu, M., Sint, R., Stroka, S.:
Kiwi - a platform for semantic social software. In: Lange, et al. [19]
23. Schreiber, W., Giustini, D.: Pathology in the era of Web 2.0. American Journal of
Clinical Pathology 132, 824–828 (2009)
24. Sint, R., Stroka, S., Schaffert, S., Ferstl, R.: Combining unstructured, fully struc-
tured and semi-structured information in semantic wikis. In: Lange, et al. [19]
25. Wishart, D.S., Knox, C., Guo, A., Cheng, D., Shrivastava, S., Tzur, D., Gautam,
B., Hassanali, M.: Drugbank: a knowledgebase for drugs, drug actions and drug
targets. Nucleic Acids Research 36(Database-Issue), 901–906 (2008)
Semantics Visualization for Fostering Search Result
Comprehension
Abstract. Current search engines present search results in an ordered list even
if semantic technologies are used for analyzing user queries and the document
contents. The semantic information that is used during the search result
generation mostly remains hidden from the user although it significantly
supports users in understanding why search results are considered as relevant
for their individual query. The approach presented in this paper utilizes
visualization techniques for offering visual feedback about the reasons the
results were retrieved. It represents the semantic neighborhood of search results,
the relations between results and query terms as well as the relevance of search
results and the semantic interpretation of query terms for fostering search result
comprehension. It also provides visual feedback for query enhancement.
Therefore, not only the search results are visualized but also further information
that occurs during the search processing is used to improve the visual
presentation and to offer more transparency in search result generation. The
results of an evaluation in a real application scenario show that the presented
approach considerably supports users in assessment and decision-making tasks
and alleviates information seeking in digital semantic knowledge bases.
1 Introduction
The optimal use of information and knowledge plays a major role in global
competition and forms the basis for competitiveness of industrial companies.
Thereby, semantic technologies provide adequate linking tools for heterogeneous data
sources as well as the generation of a broader context that facilitates information
access and enables data exchange between different systems [1]. With the ongoing
establishment of semantic technologies like the Resource Description Framework
(RDF)1, the Web Ontology Language (OWL)2 and semantic-oriented query languages
1
http://www.w3.org/RDF/
2
http://www.w3.org/TR/owl-features/
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 633–646, 2012.
© Springer-Verlag Berlin Heidelberg 2012
634 C. Stab et al.
like SPARQL3 these developments are not only limited to specific domains but also
adopted in daily search processes of web-based search engines [2]. In both, domain-
specific applications and web-based search engines, the results of search processing
are usually presented in sorted lists. In most cases the ordering of list entries
represents the relevance of the results for the individual search of the user according
to various criteria [3]. So the most relevant result is placed in the first row followed
by less important ones. Using this kind of result presentation, the semantic
information of the documents that is used during search result generation and the
analysis of search terms, remains in most cases hidden from the user, though this
information considerably supports users in information-seeking tasks and selection of
appropriate documents for further examination.
According to Hearst [4] efficient and informative feedback is critically important
for designing search user interfaces. This includes in particular feedback about query
formulation and about reasons the particular results were retrieved. However,
relevance indicators besides list ordering such as numerical scores or special icons are
less frequently used because the meaning of the relevance score is opaque to the user
[5] in these presentations. This is because the majority of existing relevance indicators
only presents a single relevance per search result that summarizes all criteria instead
of offering a more fine-grained insight to search result processing.
In order to offer users an adequate tool that provides nevertheless the possibility to
assess the relevance of retrieved search results, we developed a novel approach that
utilizes information visualization techniques and semantic information that emerges
during search result generation. The major contributions and benefits of our approach
are:
• Support for relevance assessment: The presented approach supports users in
assessing the relevance of search results and offers more transparency in the search
result generation process.
• Query-Result-Relation visualization: The visual representation of relations between
query terms and search results as well as the retrieved semantic meaning of query
terms offers a fine-grained visual overview of search result relevancies and
facilitates the information seeking and decision making process.
• Visual feedback for query-enhancement: The illustration of additional attributes
and possible terms related to a given search request allows users to narrow search
results and to refine the individual search process.
The rest of the paper is organized as follows: In the next section we introduce our
approach for presenting search results in semantic domains and give a detailed
description of all parts and features. Then we introduce the application scenario of the
visualization and give an overview of its domain. We present the evaluation that we
have performed to compare our approach to already existing solutions followed by a
related work section, a discussion and a prospect of future work. As a detailed
description of the whole search process with all technical aspects is beyond the scope
of this paper, we only briefly describe the semantic background processing and focus
on the aspects of the visualization component and the advantages of semantics for
visualizing search results.
3
http://www.w3.org/TR/rdf-sparql-protocol/
Semantics Visualization
V for Fostering Search Result Comprehension 635
4
http://www.semavis.com
636 C. Stab et al.
Giving adequate feedback about the reasons the results were retrieved is one of the
major challenges for designing adequate search user interfaces. This is especially
important for semantic search engines, in which the meaning of query terms is
interpreted by means of semantically modeled entities, because the interpretation
might be highly ambiguous. For example the query term ford might be interpreted as
the name-attribute of a car manufacturer, as the surname-attribute of the famous
inventor or the title-attribute of an activity for crossing rivers. Each of these
interpretations will deliver a completely different result set. So it is not sufficient to
only present the relations between query terms and results, but it is also necessary to
point out the semantic interpretation of the given query terms to allow an
unambiguous assessment of retrieved results.
To meet these demands and to provide an adequate tool that allows users to
unambiguously determine the most relevant result for their individual search, our
approach visualizes both query-result-relations and the interpreted semantic meaning
of query terms. Therefore, each term of the given query is presented in an attribute
node of the visualization. The interpreted semantic meaning emerged during search
processing is visible in the label of the attribute node. So for every possible
interpretation a new node is created that represents the query term and its semantic
meaning. The relations between search results and the instantiated5 attribute nodes are
depicted as directed and weighted edges between attribute nodes and result nodes. As
mentioned above, the weighting of an edge is derived from the retrieved similarity
between the result and the attribute node, whereby the results are placed nearer to
more relevant query terms and attribute nodes respectively.
5
‘instantiated’ in this context means that a query term is assigned to a specific attribute.
Semantics Visualization for Fostering Search Result Comprehension 637
Fig. 2. Left: The visualization of query-result-relations reveals that only one of the five results
is semantically related to the queried application area. Right: The visual representation of the
identified semantic meanings of query terms avoids mistakes and ambiguity in result
assessment tasks.
Figure 2 shows the result visualization of the query ‘kuka robots in construction
industry’, where the term kuka is identified as manufacturer, the term robot as
function carrier and construction industry as application area. The visualization
reveals that only one of the results is related to the queried application area whereas
other results are related to the given manufacturer (Figure 2 left). The second example
shows the visualization of the results for the query ‘cylinder’. The given term is on
the one hand identified as shape of an object and on the other hand as a specific
function carrier. By visualizing the connections between search results and related
interpretations of the query term, users can easily recognize the results that match
their initial search intention.
Search results in semantic domains are not only retrieved by analyzing the content of
resources but also by considering the semantic information and the semantic structure
respectively. For example a resource that matches to only one of the given query
terms is higher rated in the result list when the remaining terms match to semantically
related resource. In some cases, semantic search processing enables the retrieval of
highly relevant resources even if the given query terms are not contained in the
resources. Especially in such cases where semantic structures are responsible for
result generation, it can be a very time-consuming and tedious task to identify the
right results for the individual search process. So it is important to provide an
adequate presentation that allows users to unambiguously assess the retrieved results
and enables them to comprehend why specific results are considered as relevant. To
offer this kind of feedback the proposed approach presents related resources that are
responsible for result retrieval and their related attributes in expendable attribute
nodes. Thus each of these nodes contains resources from the semantic neighborhood
of retrieved results that are of some relevance for the result generation. The labels of
these expandable attribute nodes are derived from the conjoint concept in the
semantic structure to indicate their meaning.
638 C. Stab et al.
nodes. This results in different lengths of the visible connections and indicates the
relevance between specific query terms and search results.
Fig. 4. The visual recommendation of additional attributes and possible terms for query
enhancement offers a visual tool for narrowing search results
3 Application Scenario
4 Evaluation
For evaluating our approach we performed a user study in which we compared the
visualization with a common list presentation (Figure 5). The study is mainly focused
on answering the question whether our visualization approach can support the user in
assessing search results and if our approach satisfies the needs of searchers. For
verification of our assumption we investigated the task completion time and
formulated the following hypothesis:
• H1: There is a difference in task completion time between the list presentation and
the visualization in assessing search results.
Additionally to the task completion time we measured the user satisfaction as a
subjective evaluation criterion.
6
Demonstration is available at http://athena.igd.fraunhofer.de/Processus/semavis.html Note
that the knowledgebase of the online demonstrator is currently only available in German and
contains only selected resources. Possible queries for demonstration are ‘kuka roboter
bauindustrie’, ‘glattes stückgut handhaben’ and ‘glas transportieren’.
Semantics Visualization
V for Fostering Search Result Comprehension 641
condition (in this case thee different user interfaces). In contrast to between-grooup
designed experiments, in within-group designs less participants are needed and
individual differences betw ween the participants are isolated more effectively [112].
Possible learn-effects wheen switching between conditions are controlled byy a
systematic randomization of o condition- and task-ordering. Furthermore participaants
were advised to disregard the knowledge from previous conditions and to expliccitly
show the solution of tasks by b means of elements in the user interface.
Altogether the experimeent contains three tasks that had to be accomplished frrom
every participant with both h conditions (list presentation and visualization). Becaause
the focus of the evaluation is the comparison of two different user interfaces and not
the investigation of the wh hole search process, we were able to pre-assign the quuery
terms for every task. So eveery participant retrieves the same results for every task and
thus also the same visual reepresentation and the evaluation outcome is not influennced
by other factors.
In the first task participaants had to identify the relations between each search reesult
and the terms of the given query. The second task was of the same type as the ffirst
task with the difference that t the result contains hierarchical structured attribuutes
instead of only flat attributtes. In the third task participants had to identify the mmost
relevant item for a specifiic search situation. To ensure that the solution couldd be
found in each condition, we w performed several pretests. We also ensured that eeach
participant gets the same visual
v presentation for each task and condition. The tiime
limit for each task was seet to three minutes. If a wrong answer was given oor a
participant could not solve a task, the completion time of the task was also set to thhree
minutes.
642 C. Stab et al.
4.2 Procedure
Altogether 17 participants, mainly graduates and students attended the evaluation.
The average participant was between 24 and 29 years old. The participants were
mainly involved in computer science (M = 4.65; SD = 0.6)7 and had no previous
knowledge of the engineering domain. After a general introduction to the user study
and an explanation of the procedure and tasks, participants got a brief introduction to
both systems in systematically randomized ordering. Both systems were queried with
a reference query and participants had the chance to ask questions about the systems.
After each task participants had to rate their overall satisfaction with the system on a
scale from 1 to 9 and three additional questions concerning their subjective opinion of
the system on a Likert scale from 1 (strongly disagree) to 5 (strongly agree). After
participants had completed all tasks, they had to answer a brief demographic
questionnaire.
4.3 Results
Figure 6 shows the average task completion times for each of the three tasks and both
conditions. The direct comparison of the average task completion times reveals that
participants performed better with our visualization approach (avg(t) = 51.3 sec; SD =
25.8) compared to the list presentation (avg(t) = 88.1 sec; SD = 30.1). A paired-
samples t-test also suggests that there is a significant difference in the task completion
time between the group who used the list presentation and the group who used our
visualization approach (t(50)=7.8028, p<0.05).
Hence the null hypothesis is refuted and the alternative hypotheses confirmed. The
comparison of means also indicates that users performed significantly faster with
the visualization approach compared to the list presentation. So we can proceed from
the assumption that visualizing search results taking semantic information into
account has a positive effect on the efficiency when assessing search result relevance.
7
Measured on a five point scale (5 = very much experience; 1 = very little experience) in the
demographic part of the questionnaire.
Semantics Visualization for Fostering Search Result Comprehension 643
5 Related Work
8
Measured on a five point Likert scale.
644 C. Stab et al.
beside extracted figures from relevant articles, query terms highlighted in the title and
boldfaced in the text excerpts for communicating reasons the particular results were
retrieved. Even though term highlighting can be useful for improving search result list
presentations, it does not reveal the semantic interpretation of search results and
prevent users from scanning the whole result list for getting an overview.
6 Discussion
The introduced approach was applied and evaluated in the field of mechanical
engineering and automation technology. Although this domain contains highly
complex processes and different kinds of heterogeneous users, domain experts were
able to semantically design it and build a comprehensive model that enables different
stakeholders the access to heterogeneous resources. In such well-defined domains,
aspects like data diversity, user roles and processes are in some way controllable and
the data access methods can be accurately aligned to specific tasks of the stakeholder.
The results of the evaluation showed that the proposed visualization approach
performed very well in the present domain. Nevertheless, further investigations are
needed to prove if the proposed approach is also transferable to other domains and if
it can be seamlessly integrated in semantic web search engines.
Currently, most search user interfaces are based on result list presentations and
usually show the titles and surrogates of the results. Cause of the public’s great
familiarity with this commonly used search result presentation, there is a certain
degree of risk with the introduction of a novel approach in user interfaces. Even if
novel approaches provide a variety of extended features and easier information
access, the success of each innovation in user interfaces is measured by the
acceptance of the users. Although the results of the evaluation show that the
introduced visualization approach performed well in a controlled experimental
environment and users are convinced of its benefits, there is still the need to prove if
visualization techniques will be applicable in web search engines. However, current
trends show an increased use of information visualization techniques in search user
interfaces.
In this paper we introduced a novel approach for visualizing search results in semantic
knowledge bases. The results of the evaluation showed that the utilization of semantic
information in search results visualization successfully fosters search result
comprehension and supports user in assessing retrieved resources. Also the approach
performed well for presenting different semantic interpretations of query terms and
query-result-relations respectively. The visual recommendation of novel dimensions
and immediate visual feedback for query refinement additionally fosters the common
search strategies of users and offers more transparency in search result processing.
For future work we plan the extension of query refinement features. In particular
we plan to implement the removal and change of attribute values that is not included
Semantics Visualization for Fostering Search Result Comprehension 645
Acknowledgements. This work has been carried out within the Core-Technology Cluster
(Innovative User Interfaces and Visualizations) of the THESEUS research program,
partially funded by the German Federal Ministry of Economics and Technology. We thank
H. J. Hesse, R. Traphöner and C. Dein (THESEUS PROCESSUS, Attensity Europe
GmbH) for the inspiring discussions, the provision of the data and the support during the
development of the data connection. We are also grateful to all participants that spend their
time participating in the evaluation.
References
1. Shadbolt, N., Berners-Lee, T., Hall, W.: The Semantic Web Revisited. IEEE Intelligent
Systems 21(3), 96–101 (2006)
2. Fernandez, M., Lopez, V., Sabou, M., Uren, V., Vallet, D., Motta, E., Castells, P.:
Semantic Search Meets the Web. In: 2008 IEEE International Conference on Semantic
Computing, pp. 253–260 (2008)
3. Cutrell, E., Robbins, D., Dumais, S., Sarin, R.: Fast, flexible filtering with phlat. In:
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp.
261–270 (2006)
4. Hearst, M.A.: Search User Interfaces. Cambridge University Press (2009)
5. White, R.W., Bilenko, M., Cucerzan, S.: Studying the Use of Popular Destinations to
Enhance Web Search Interaction. In: Proceedings of the 30th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval, pp. 159–166
(2007)
6. Stab, C., Breyer, M., Nazemi, K., Burkhardt, D., Hofmann, C., Fellner, D.W.: SemaSun:
Visualization of Semantic Knowledge based on an improved Sunburst Visualization
Metaphor. In: Proceedings of World Conference on Educational Multimedia, Hypermedia
and Telecommunications 2010, pp. 911–919. AACE, Chesapeake (2010)
7. Stab, C., Nazemi, K., Fellner, D.W.: SemaTime - Timeline Visualization of Time-
Dependent Relations and Semantics. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D.,
Chung, R., Hammound, R., Hussain, M., Kar-Han, T., Crawfis, R., Thalmann, D., Kao, D.,
Avila, L. (eds.) ISVC 2010, Part III. LNCS, vol. 6455, pp. 514–523. Springer, Heidelberg
(2010)
8. Nazemi, K., Breyer, M., Hornung, C.: SeMap: A Concept for the Visualization of
Semantics as Maps. In: Stephanidis, C. (ed.) UAHCI 2009, Part III. LNCS, vol. 5616, pp.
83–91. Springer, Heidelberg (2009)
9. Ward, M., Grinstein, G., Keim, D.: Interactive Data Visualization: Foundations,
Techniques, and Applications. A. K. Peters, Ltd., Natick (2010)
10. Jansen, B.J., Spink, A., Pedersen, J.O.: A Temporal Comparison of Altavista Web
Searching. Journal of the American Society for Information Science and
Technology 56(6), 559–570 (2005)
11. Jansen, B.J., Spink, A., Koshman, S.: Web Searcher Interaction with the Dogpile.com
Metasearch Engine. Journal of the American Society for Information Science and
Technology 58(5), 744–755 (2007)
646 C. Stab et al.
12. Lazar, J., Feng, J.H., Hochheiser, H.: Research Methods in Human - Computer Interaction.
John Wiley & Sons (2010)
13. Schenk, S., Saathoff, C., Staab, S., Scherp, A.: SemaPlorer - Interactive Semantic
Exploration. Journal of Web Semantics: Science, Services and Agents on the World Wide
Web 7(4), 298–304 (2009)
14. Heim, P., Lohmann, S., Stegemann, T.: Interactive Relationship Discovery via the
Semantic Web. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt,
H., Cabral, L., Tudorache, T. (eds.) ESWC 2010, Part I. LNCS, vol. 6088, pp. 303–317.
Springer, Heidelberg (2010)
15. Microsoft Academic Search, http://academic.research.microsoft.com
16. Stoyanovich, J., Lodha, M., Mee, W., Ross, K.A.: SkylineSearch: semantic ranking and
result visualization for pubmed. In: Proceedings of the 2011 International Conference on
Management of Data, SIGMOD 2011, pp. 1247–1250 (2011)
17. Nguyen, T., Zhang, J.: A Novel Visualization Model for Web Search Results. IEEE
Transactions on Visualization and Computer Graphics 12(5), 981–988 (2006)
18. Aula, A.: Enhancing the readability of search result summaries. In: Proceedings of HCI
2004, pp. 6–10 (2004)
19. Hearst, M.A., Divoli, H., Guturu, A., Ksikes, P., Nakov, M.A., Wooldridge, J.Y.: BioText
Search Engine: beyond abstract search. Bioinformatics 23(16), 2196 (2007)
Evaluating Scientific Hypotheses Using the SPARQL
Inferencing Notation
1 Introduction
*
Corresponding author.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 647–658, 2012.
© Springer-Verlag Berlin Heidelberg 2012
648 A. Callahan and M. Dumontier
1
http://www.w3.org/Submission/2011/SUBM-spin-overview-20110222/
2
http://www.w3.org/Submission/2011/SUBM-spin-modeling-20110222/
Evaluating Scientiffic Hypotheses Using the SPARQL Inferencing Notation 649
2 Methods
2.1 Overview
HyQue evaluates hypothesses (and assigns an evaluation score) by executing SP PIN
rules over the pertinent kno
owledge extracted from a HyQue Knowledge Base (HK KB).
A hypothesis is formulated as a logical expression in which elements of the hypotthe-
sis correspond to biologiccal entities of interest. HyQue maps the hypothesis, ex-
pressed using terminology from the HyQue ontology 3, to the relevant SPIN ruules,
which execute SPARQL qu ueries to retrieve data from the HKB. Finally, HyQue eexe-
cutes additional SPIN rules over the extracted data to obtain a quantitative measuree of
hypothesis support. Figure 1 provides a graphical overview of HyQue.
Fig. 1. HyQue uses SPIN ruless to evaluate a hypothesis over RDF linked data and OWL onttolo-
gies. The dashed rectangle reprresents OWL ontologies. Rounded rectangles are RDF resourcees.
3
The HyQue ontology, linked data, and SPIN rules are available at the project webssite:
http://hyque.semanticsciencee.org
650 A. Callahan and M. Dumontier
1. protein-protein binding
2. protein-nucleic acid binding
3. molecular activation
4. molecular inhibition
5. gene induction
6. gene repression
7. transport
All Linked Data (encoded using RDF) and ontologies (encoded using OWL) that
comprise the HKB are available at the project website.
HyQue uses rules to calculate a numerical score for a hypothesis based on the degree
of support the hypothesis has from statements in the HKB. HyQue first attempts to
identify statements about experimentally verified events in the HKB that have a high
degree of matching to a hypothesized event, and then assesses these statements using
domain specific rules to assign a score to the hypothesized event. If there is a statement
about an experimentally reported GAL gene/protein interaction in the HKB that ex-
actly matches a hypothesized event, then that event will be assigned a maximum score
when it is evaluated by HyQue. In contrast, if a hypothesized event describes an inter-
action between a protein A and a protein B but there is a statement in the HKB assert-
ing that protein A does not interact with protein B, then the hypothesis will be assigned
a low score based on the negation of the hypothesized event by experimental data.
Different HyQue rules add or subtract different numerical values based on whether the
relevant experimental data has properties that provide support for a hypothesized
event. For instance, if an event is hypothesized to occur in a specific cellular compart-
ment e.g. nucleus, but the HKB only contains a statement that such an event takes
place in a different cellular component e.g. cytoplasm, then a rule could be formulated
Evaluating Scientific Hypotheses Using the SPARQL Inferencing Notation 651
such that the hypothesis, while not directly supported by experimental evidence, will
be penalized less than if the event had been asserted to not take place at all.
Based on such scoring rules, each event type has a maximum possible score. When
a hypothesized event is evaluated by HyQue, it is assigned a normalized score calcu-
lated by the sum of the output of the relevant rule(s) divided by the maximum possi-
ble score. In this way, if an event has full experimental support, it will have an overall
score of 1, while if only some properties of the hypothesized event are supported by
statements in the HKB is will have a score between 0 and 1.
Overall proposition and hypothesis scores are calculated by additional rules based
on the operators that relate events. If a proposition specifies ‘event A’ OR ‘event B’
OR ‘event C’ then the maximum event score will be assigned as the proposition score,
while if the ‘AND’ operator was used, the mean event score will be assigned as the
proposition score. Using the mean reflects the relative contribution of each event
score while still maintaining a normalized value between 0 and 1. Similar rules are
used to calculate an overall hypothesis score based on proposition scores.
HyQue uses SPIN to execute rules that reflect this scoring system.
property. The SPARQL variable ‘?this’ has a special meaning for SPIN rules, and refers to
any instance of the class the rule is linked to. SPIN rules are linked to classes in the HyQue
ontology using the spin:rule predicate.
This hypothesis rule uses another rule, calculateHypothesisScore, to cal-
culate the hypothesis score, and the output of executing this rule is bound to the vari-
able ?hypothesisEvalScore. Note that the hypothesis rule is constrained to a
HyQue hypothesis that ‘has component part’ (hyque:HYPOTHESIS_0000010) some
‘proposition’ (hyque:HYPOTHESIS_0000001) that ‘has attribute’ a proposition
evaluation. In this way HyQue rules are chained together – when one rule is executed,
all the rules it depends on are executed until no new statements are created. In this
case, because a hypothesis evaluation score requires a proposition evaluation score,
when the hypothesis evaluation rule is executed, the HyQue SPIN rule for calculating
a proposition score is executed as well. Each proposition evaluation is asserted to be
‘obtained from’ the event evaluations corresponding to the event(s) specified by (hy-
que:HYPOTHESIS_0000012) the proposition. Each event evaluation is also asserted
to be ‘obtained from’ the scores determined for each event property (the agent, target,
location etc.) and the statements in the HKB the scores are based on.
Domain specific rules for HyQue pertain to the domain of interest. An example of
a domain specific rule is calculateActivateEventScore corresponding to
the following SPARQL query:
SELECT ?activateEventScore
WHERE {
BIND (:calculateActivateAgentTypeScore(?arg1)
AS ?agentTypeScore) .
BIND (:calculateActivateTargetTypeScore(?arg1)
AS ?targetTypeScore) .
BIND (:calculateActivateLogicalOperatorScore(?arg1)
AS ?logicalOperatorScore) .
BIND (:penalizeNegation(?arg1) AS ?negationScore) .
BIND (3 AS ?maxScore) .
BIND (((((?agentTypeScore + ?targetTypeScore) +
?logicalOperatorScore) + ?negationScore) /
?maxScore) AS ?activateEventScore) .
}
SELECT ?score
WHERE {
?arg1 ’has logical operator’ ?logical_operator .
BIND (IF((?logical_operator = ’positive regulation of molecular
function’), 1, -1) AS ?score) .
}
Thus, if the logical operator specified in a hypothesis event is of type ‘positive regula-
tion of molecular function’ (GO:0044093) the rule will return 1, and otherwise the
rule will return -1.The calculateActivateEventScore rule is composed of
several sub-rules of this format. HyQue uses similar rules for each of the seven event
types listed in section 2.2 to evaluate hypotheses.
SPIN rules were composed using the free edition of TopBraid Composer 3.5. Hy-
Que executes SPIN rules using the open source SPIN API 1.2.0 and Jena 2.6.4.
3 Results
HyQue currently uses a total of 63 SPIN rules to evaluate hypotheses. 18 of these are
system rules, and the remaining 45 are domain specific rules that calculate evaluation
scores based on well understood principles of the GAL gene network in yeast as de-
scribed in section 2.5. These rules have been used to evaluate 5 representative hy-
potheses about the GAL domain, one of which is presented in detail in section 3.1.
3.1 Evaluating a Hypothesis about GAL Gene Induction and Protein Inhibition
The following is a natural language description of a hypothesis about the GAL gene
network that has been evaluated by HyQue. Individual events are indicated by the
letter ‘e’, followed by a number to uniquely identify them. Events are related by the
AND operator in this hypothesis, while the two sets of events (typed as propositions
in the HyQue hypothesis ontology) are related by the OR operator.
Two domain specific SPIN rules were executed to evaluate this hypothesis: calcu-
lateInduceEventScore for e1-e5 and calculateInhibitEventScore
for e6, in conjunction with system rules to calculate overall proposition and hypothe-
sis scores based on the event scores.
By identifying and evaluating statements in the HKB that experimentally support
e1, the calculateInduceEventScore rule assigns e1 a score of 4 out of a
maximum score of 5 (see Table 1). This corresponds to a normalized score of 0.8.
Similarly, events 2-5 also receive a score of 0.8. The calculateInhibitE-
ventScore rule assigns event 6 a score of 1 based on comparable scoring rules.
Therefore, the proposition specifying e4, e5 and e6 receives a higher score (0.87 – the
mean of the individual event scores) than the proposition specifying e1, e2 and e3
(with a mean score of 0.8). Because the two propositions were related by the OR op-
erator, the hypothesis is assigned an overall score that is the maximum of the two
proposition scores, in this case, a value of 0.87.
Table 1. SPIN rules executed to evaluate a hypothetical GAL gene induction event, their
outcomes, and contribution to an overall hypothesis score assigned by HyQue
The complete HyQue evaluations of this hypothesis as well as that of four addi-
tional hypotheses are available as RDF at the project website.
4 Discussion
Using SPIN rules to evaluate HyQue hypotheses has several advantages. While Hy-
Que “version 1.0” used SPARQL queries to obtain relevant statements from the HKB,
the scoring rules used to evaluate those statements were hard-coded in system code.
HyQue’s SPIN evaluation rules can be represented as RDF, which allows the poten-
tial for users to query for HyQue rules that meet specific conditions, as well as poten-
tially link to and aggregate those rules. In addition, users can create their own SPIN
rules to meet specific evaluation criteria and augment existing HyQue rules to include
them. In this way, different scientists may use the same data to evaluate the same
hypotheses and arrive at unique evaluations depending on the domain principles
encoded by the SPIN rules they use, as demonstrated in section 3.2. Encoding evalua-
tion criteria as SPIN rules also ensures that the source of an evaluation can be expli-
citly stated, both in terms of the rules executed and the data the rules were executed
over. This is crucial for formalizing the outcomes of scientific reasoning such that
research conclusions can be confidently stated.
Separating HyQue system rules from the GAL domain specific rules highlights the
two aspects of the HyQue scoring system. Specifically, HyQue currently encodes
certain assumptions about how events in hypotheses may be related to one another,
and how these relations are used to determine an overall hypothesis score, as well as
domain specific assumptions about how to evaluate data in the context of knowledge
about the GAL gene network. However, because assumptions about hypothesis struc-
ture are encapsulated by HyQue system rules, they may be changed or augmented
without affecting the GAL domain specific rules, and vice versa. HyQue system rules
can be extended over time to facilitate the evaluation of hypotheses that have funda-
mentally different structures than those currently presented as demonstrations. We
envision a future iteration of HyQue where users can submit unique system and do-
main specific rules to use for evaluating hypotheses and in this way further research
in their field by exploring novel interpretations of experimental data and hypotheses.
656 A. Callahan and M. Dumontier
Similarly, it may be possible in future for HyQue users to select from multiple sets of
evaluation rules and to compare the hypothesis evaluations that result.
Crafting SPIN rules requires knowledge of SPARQL, which, while being used in a
number of life-science related projects[3, 5, 20-22], may present a barrier to some
users. Similarly, representing hypotheses as RDF to submit to HyQue is not a trivial
activity. To address the latter, we have developed an online form based system for
specifying hypothesis details and converting them to RDF, available at the project
website.
The Rule Interchange Format (RIF)4 is the W3C standard for representing and ex-
changing rules between rule systems. SPIN, a W3C member submission, has been
identified as an effort complimentary to RIF[23] and because there is some discussion
of RIF and RDF compatibility5, SPIN and RIF may become compatible if the RIF
working group remains active6. HyQue provides a relevant use case and motivation
for enabling such compatibility. Given that SPIN rules may be represented as RDF
and executed over any RDF store using SPARQL (both W3C standards), however,
and that the motivation of SPIN is specifically to execute SPARQL as rules, in the
context of HyQue compatibility with RIF is not of immediate concern.
5 Conclusions
We present an extended version of HyQue that uses SPIN rules to evaluate hypothe-
ses encoded as RDF, and makes the evaluation, including the data it is based upon,
also available as RDF. In this way, users are able to explicitly trace a path from hy-
pothesis to evaluation and the supporting experimental data, and vice versa. We have
demonstrated how HyQue evaluates a specific hypothesis about the GAL gene net-
work in yeast with an explanation of the scoring rules used and their outcomes.
Evaluations of additional hypotheses, as well as HKB data and HyQue SPIN rules are
available at http://hyque.semanticscience.org.
References
1. Neylon, C., Wu, S.: Article-Level Metrics and the Evolution of Scientific Impact. PloS
Biology 7(11), e1000242 (2009)
2. Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W., Hill, D.P., Kania,
R., Schaeffer, M., St Pierre, S., et al.: Big data: The future of biocuration.
Nature 455(7209), 47–50 (2008)
3. Belleau, F., Nolin, M.-A., Tourigny, N., Rigault, P., Morissette, J.: Bio2RDF: towards a
mashup to build bioinformatics knowledge systems. Journal of Biomedical Informat-
ics 41(5), 706–716 (2008)
4
http://www.w3.org/TR/2010/NOTE-rif-overview-20100622/
5
http://www.w3.org/TR/2010/REC-rif-rdf-owl-20100622/
6
http://www.w3.org/Submission/2011/02/Comment/
Evaluating Scientific Hypotheses Using the SPARQL Inferencing Notation 657
4. Tari, L., Anwar, S., Liang, S., Cai, J., Baral, C.: Discovering drug-drug interactions: a text-
mining and reasoning approach based on properties of drug metabolism. Bioinformatics
26 (18), i547–i553
5. Nolin, M.A., Dumontier, M., Belleau, F., Corbeil, J.: Building an HIV data mashup using
Bio2RDF. Briefings in Bioinformatics (2011)
6. Blonde, W., Mironov, V., Venkatesan, A., Antezana, E., De Baets, B., Kuiper, M.: Rea-
soning with bio-ontologies: using relational closure rules to enable practical querying. Bio-
informatics 27(11), 1562–1568 (2011)
7. Villanueva-Rosales, N., Dumontier, M.: yOWL: an ontology-driven knowledge base for
yeast biologists. Journal of Biomedical Informatics 41(5), 11 (2008)
8. Karp, P.D.: Artificial intelligence methods for theory representation and hypothesis forma-
tion. Comput. Appl. Biosci. 7(3), 301–308 (1991)
9. Karp, P.: Design Methods for Scientific Hypothesis Formation and Their Application to
Molecular Biology. Machine Learning 12(1-3), 89–116 (1993)
10. Karp, P.D., Ouzounis, C., Paley, S.: HinCyc: a knowledge base of the complete genome
and metabolic pathways of H. influenzae. In: Proc. Int. Conf. Intell. Syst. Mol. Biol.,
vol. 4, pp. 116–124 (1996)
11. Zupan, B., Bratko, I., Demsar, J., Juvan, P., Curk, T., Borstnik, U., Beck, J.R., Halter, J.,
Kuspa, A., Shaulsky, G.: GenePath: a system for inference of genetic networks and pro-
posal of genetic experiments. Artif. Intell. Med. 29(1-2), 107–130 (2003)
12. King, R.D., Rowland, J., Oliver, S.G., Young, M., Aubrey, W., Byrne, E., Liakata, M.,
Markham, M., Pir, P., Soldatova, L., et al.: The automation of science. Science 324(5923),
85–89 (2009)
13. Soldatova, L., King, R.D.: Representation of research hypotheses. In: Bio-Ontologies
2010: Semantic Applications in Life Sciences, Boston, MA (2010)
14. Callahan, A., Dumontier, M., Shah, N.: HyQue: Evaluating hypotheses using Semantic
Web technologies. In: Bio-Ontologies: Semantic Applications in the Life Sciences, Bos-
ton, MA (2010)
15. Callahan, A., Dumontier, M., Shah, N.H.: HyQue: evaluating hypotheses using Semantic
Web technologies. J. Biomed. Semantics 2(suppl. 2), S3 (2011)
16. Ideker, T., Thorsson, V., Ranish, J.A., Christmas, R., Buhler, J., Eng, J.K., Bumgarner, R.,
Goodlett, D.R., Aebersold, R., Hood, L.: Integrated genomic and proteomic analyses of a
systematically perturbed metabolic network. Science 292(5518), 929–934 (2001)
17. Racunas, S.A., Shah, N.H., Albert, I., Fedoroff, N.V.: HyBrow: a prototype system for
computer-aided hypothesis evaluation. Bioinformatics 20(suppl. 1), i257–i264 (2004)
18. Bhat, P.J., Murthy, T.V.: Transcriptional control of the GAL/MEL regulon of yeast Sac-
charomyces cerevisiae: mechanism of galactose-mediated signal transduction. Mol. Mi-
crobiol. 40(5), 1059–1066 (2001)
19. Peng, G., Hopper, J.E.: Evidence for Gal3p’s cytoplasmic location and Gal80p’s dual cy-
toplasmic-nuclear location implicates new mechanisms for controlling Gal4p activity in
Saccharomyces cerevisiae. Mol. Cell Biol. 20(14), 5140–5148 (2000)
20. Kobayashi, N., Ishii, M., Takahashi, S., Mochizuki, Y., Matsushima, A., Toyoda, T.: Se-
mantic-JSON: a lightweight web service interface for Semantic Web contents integrating
multiple life science databases. Nucleic Acids Res. 39(Web Server issue), W533–W540
(2011)
658 A. Callahan and M. Dumontier
21. Chen, B., Dong, X., Jiao, D., Wang, H., Zhu, Q., Ding, Y., Wild, D.J.: Chem2Bio2RDF: a
semantic framework for linking and data mining chemogenomic and systems chemical
biology data. BMC Bioinformatics 11, 255 (2010)
22. Antezana, E., Kuiper, M., Mironov, V.: Biological knowledge management: the emerging
role of the Semantic Web technologies. Brief Bioinform. 10(4), 392–407 (2009)
23. Polikoff, I.: Comparing SPIN with RIF (July 05, 2011),
http://topquadrantblog.blogspot.com/2011/06/
comparing-spin-with-rif.html (accessed December 7, 2011)
Declarative Representation of Programming
Access to Ontologies
1 Introduction
One of the most challenging issues in implementing Semantic Web applications is
that they are built using two different technologies: object-oriented programming
for the application logic and ontologies for the knowledge representation. Object-
oriented programming provides for maintainability, reuseability and robustness
in the implementation of complex software systems. Ontologies provide power-
ful means for knowledge representation and reasoning and are useful for various
application domains. For accessing ontological knowledge from object-oriented
software systems, there are solutions like ActiveRDF [8] and Jastor1. Most of
these frameworks make use of the structural similarities of both paradigms,
e.g., similar inheritance mechanisms and utilize simple solutions known from the
field of object-relational mapping. But with the use of these existing tools some
problems cannot be solved: Typically, the structural similarities lead to a one-to-
one mapping between ontology concepts, properties and individuals and object-
oriented classes, fields and objects, respectively. This leads to a data-centric
1
http://jastor.sourceforge.net/ last visit June 24, 2011.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 659–673, 2012.
c Springer-Verlag Berlin Heidelberg 2012
660 S. Scheglmann, A. Scherp, and S. Staab
Jim works for a multimedia company and is responsible for the integration of
knowledge-base access in an object-oriented media annotation framework. The
media annotation framework should support the user in annotating multimedia
content such as images or video clips. Jim shall use an ontology for representing
annotated media as well as the multimedia annotations. He has not been involved
in the design of the ontologies. His task is to define the programming interfaces
to access and update the knowledge-base seamlessly from the application. He
has to consider that further specializations toward domain-specific annotations
could result in changes of the implementation.
Declarative Representation of Programming Access to Ontologies 661
Figure 1(a) shows an excerpt of the ontology used by Jim to model the mul-
timedia metadata. The example is based on the Multimedia Metadata Ontol-
ogy (M3O) [13] for representing annotation, decomposition, and provenance in-
formation of multimedia data. It models the annotations of an image with an
EXIF2 geo-point wgs84:Point3 and a FoaF4 person foaf:Person as image creator.
As we can see from the different namespaces, the m3o:Image, wgs84:Point and
foaf:Person concepts and their superconcepts dul:InformationEntity, dul:Object
and finally dul:Entity are defined in different ontologies. The inheritance and
import relationships are shown in Figure 1b, which is needed important for a
proper API representation.
!
"#$
2
http://www.exif.org/ last visit dec 05, 2011.
3
Basic Geo (WGS84 lat/long) Vocabulary http://www.w3.org/2003/01/geo/ pro-
vides the namespace, last visit dec 05, June 2011.
4
http://www.foaf-project.org/ last visit dec 05, 2011.
662 S. Scheglmann, A. Scherp, and S. Staab
% %
$
$
$
!"#
" "
$
$
% %
$
&
$
for each of the concepts defined in the ontology. The relationships between con-
cepts are represented as fields of the domain classes, e.g., the satisfies relationship
between the m3o:AnnotationSituation and the m3o:AnnotationPattern concept is
represented as satisfies field of type AnnotationPattern in the AnnotationSitua-
tion class. The generated class structure gives Jim no information about how
to use it, i.e., which classes to instantiate when annotating an image with a
geo-point or a creator. In fact one has to instantiate the class representations
AnnotationPattern, AnnotationSituation, Image, EXIFGeoPoint, ImageConcept and
EXIFGeoPointConcept and fill all the fields representing the relationships, namely
defines, classifies, hasSetting and satisfies.
Furthermore not all class representations are of direct concern for Jim’s appli-
cation. Some of these representations provide direct content for the application,
like the annotated entity — the Image — or the annotation entities — the EXIF-
GeoPoint and the FoaFPerson. Other classes only provide the structure necessary
for a proper knowledge representation. The M3O ontology uses the Description
& Situation (D&S) ontology design pattern. Description & Situation is another
reification [3] formalism in contrast to the RDF reification5 . For using D&S as
reification formalism one has to add additional resources, the description, situ-
ation and the classifying concepts. The class representation for these concepts
are of no use for Jim when using the API in his application. For this reason, he
decides to encapsulate them from direct access and hide them from an eventual
application developer.
models. All the API models are used in our evaluation in Section 5. Jim first
identifies the functionality to be provided by the API, the annotation of im-
ages. Jim decides to provide a class for this annotation, the annotation class.
In the following, we describe the different designs of the three APIs. API-1:
He defines the set of concepts and properties involved in this functionality. Jim
classifies the concepts in this set according to how they are used in the ap-
plication and he splits them into two disjoint sets. The first set contains all
concepts representing the content the application works on. In our terminol-
ogy, we call them content concepts. We would like to emphasize that in our
scenario Jim as an API developer will not have to know about the terminol-
ogy we use at all; but it is significantly easier in this paper to use our ter-
minology to explain the different decisions he may take when developing the
API. For our example Jim chooses the m3o:Image, the wgs84:Point and the
foaf:Person to provide the content. The other set contains the concepts of struc-
tural concern for the knowledge representation. Subsequently, we call these con-
cepts structure concepts. For Jim these concepts are m3o:AnnotationPattern,
m3o:AnnotationSituation, m3o:AnnotatedConcept, m3o:GeoPointConcept, and
m3o-:CreatorConcept and he wants his API to encapsulate and hide class repre-
sentations of such concepts from the application. In our terminology, we call a
set of concepts and relations related to an API class a semantic unit SU =
(CO, SO, R) with CO the set of content concepts, SO the structure con-
cepts and R the set of relations. For our example, semantic units are, e.g.,
the annotation as described above or the geopoint consisting of the wgs84:Point
together with its latitude and longitude. Jim wants his API to be prepared
for arbitrary multimedia content and new types of annotations. The ontology
provides abstract concepts for multimedia content and annotations in its inher-
itance structure presented in Figure1b. But not all concepts from this structure
are of interest to the application. Thus Jim decides to use only the least com-
mon subsumers, e.g., dul:InfomationObject for annotatable multimedia content
and dul:Object for annotations. Jim implements interfaces representing these
two concepts.
Jim is now able to design the API. He defines a class for the annotation func-
tionality as shown in Figure 3. In addition, he defines a class for each content
concept the application works on, in this case Image, EXIFGeoPoint and
FoaFPerson. These classes implement the interfaces derived from the inheri-
tance structure of the ontology, InformationEntity and Object. The Infor-
mationEntity interface has to be realized by classes representing multimedia
content, e.g., by the Image class. The Object interface has to be realized by an-
notation entities, e.g., the classes EXIFGeoPoint and FoaFPerson. All these
classes and interfaces together with the operations form a so-called pragmatic
unit. A pragmatic unit is a tuple P U = (C, F, M ) that contains the classes
C, the fields F and the methods M of an object-oriented model and that relates
to a specific semantic unit in the underlying knowledge model.
664 S. Scheglmann, A. Scherp, and S. Staab
!"
!"
!
# %
!"
!
#
#"
$!
#
%
$
!
#
' # '
representations. Our API should provide classes to support the application de-
veloper in performing these operations in an easy and well encapsulated way.
(R5) Method Behavior. APIs provide methods to access or manipulate API
entities or to query for entity properties. In some cases, it might be necessary
to fall back to reasoning on the ontology [10] to be able to answer queries. For
example querying for all instances of a specific concept could be such a question.
A method for such a query performed on the Java representation could guar-
antee soundness but never completeness. The same also applies for consistency
preservation. In some cases, the API could restrict its behavior in a way that
it ensures the consistency of the represented knowledge. We expect the API to
either inform the calling method or throw an exception that the requested action
would affect the consistency of the represented knowledge. Sometimes, it is not
possible or practical for complexity reasons to restrict the API behavior. In this
case the API cannot ensure the consistency. Currently, we focus on cases where
restrictions or query answering on the API are possible, e.g., qualified number
restrictions on properties. A reasoner integration to ensure validity of operations
remains for future work.
In the first step, the Model of Ontologies (MoOn) is used to represent cru-
cial properties of the target API as properties of the ontology in a declarative
manner. In MoOn, concepts are classified as either being content concepts or
structure concepts. Semantic units are defined and one can adapt parts of
the ontology’s inheritance structure to the API. Figure 6 shows the semantic
unit annotation from our running example in the MoOn-based representation.
%
#
To show the applicability of our approach, we have developed and applied the
OntoMDE toolkit to generate APIs from different ontologies. We have selected
ontologies with different characteristics in terms of complexity, level of abstrac-
tion, degree of formalization, provenance, and domain-specificity. We have used
the OntoMDE toolkit to generate APIs for the Pizza12 and Wine13 ontologies.
As less formal real world ontologies, we have choosen the Ontology for Media
Resources (OfMR)14 of the W3C and the CURIO15 ontology used in the We-
KnowIt project16 . And last, we have used OntoMDE to generate APIs for the
M3O [13], our running example is based on, and the Event-Model-F (EMF) [14].
To demonstrate the flexibility and adaptability, we used OntoMDE to generate
different APIs from the same input ontology, from slightly changed versions of
the same ontology and to integrate legacy APIs into our ontology access API. We
have selected the M3O ontology ,OfMR aligned with the M3O and an EXIF17
ontology aligned to the M3O as input ontology for this study. As outlined for
our example in Section 2.4, we designed different possible APIs for accessing the
M3O. Then, we generated these APIs from the M3O ontology by changing the
declarative information about programming access on the MoOn and the OAM.
To show the integration capabilities of OntoMDE, we use the OAM to integrate
legacy APIs for the Image class in the M3O API.
With the first use case, the generation of APIs for the Pizza and Wine ontolo-
gies, we have shown that our approach is capable of processing OWL ontologies,
(R1,R9). From applying OntoMDE to multiple ontologies with different charac-
teristics, we can conclude that the general idea of distinguishing concepts into
content concepts or structure concepts is applicable to all tested ontolo-
gies. The concrete sets of content concepts or structure concepts strongly
depends on the characteristics of the ontology. In simple, less formal ontologies
most of the concepts are content concepts of direct concern for the application.
Whereas, in complex ontologies with a high level of abstraction and intense use
of reification more of the concepts tend to be structure concepts. The organi-
zation of concepts in semantic units is also applicable to all kinds of ontologies.
Again, we encounter differences depending on the characteristics of the ontology.
Simple ontologies often only allow for few and usually small semantic units.
Complex ontologies allow for multiple partially overlapping semantic units
with potentially many concepts.
We have also investigated the flexibility and adaptability of our approach. Re-
garding the adaptability, we have integrated the java.awt.image package as legacy
APIs for representing images into the APIs of our example. Using the OAM, the
integration of the generated API and the legacy API could be conducted in a
12
The pizza ontology http://www.co-ode.org/ontologies/pizza/2007/02/12/ last
visit dec 5, 2011.
13
http://www.w3.org/TR/owl-guide/wine.rdf last visit dec 5, 2011.
14
http://www.w3.org/TR/mediaont-10/ last visit dec 5, 2011.
15
http://www.weknowit.eu/content/
curio collaborative user resource interaction ontology last visit dec 5, 2011.
16
http://www.weknowit.eu/ last visit dec 5, 2011.
17
http://www.exif.org/specifications.html last visit dec 5, 2011.
Declarative Representation of Programming Access to Ontologies 671
few steps. As mentioned, we have generated different APIs for the ontology from
our example. We have also shown that changes of the API model could be ac-
complished by modifications on the MoOn, such as ”choice of pragmatic units”
or ”choice of content concepts”. As you can see, these changes result in different
numbers of pragmatic units and generated concept classes. To demonstrate the
flexibility regarding the actual RDF-persistence layer used, we have changed the
back-end API of the OntoMDE approach. We used our own RDF-persistence
layer Winter [12] as well as the RDF-persistence layer Alibaba18 . This change
of the backend could be conducted within a short time of about one hour. This
addresses requirements (R5), (R6), and (R7).
6 Related Work
The problem space of object relational impedance mismatch and the set of con-
ceptual and technical difficulties is addressed frequently in literature, e.g. in
[5,15,16,2]. Among others, Fowler provides in his book [1] a wide collection of
patterns to common object relational mapping problems. Due to the fact that
many problems in persistence and code generation for ontologies are similar to
problems from the field of relational databases many approaches utilize object-
relational strategies for object-triple problems, for example like ActiveRDF [8],
a persistence API for RDF adapting the object-relational ActiveRecord pattern
from Fowlers book or OTMj19 a framework that resembles some of Fowlers pat-
terns to the field of object-triple mapping. Most of the other frameworks, like
AliBaba, OWL2Java [6], Jastor20, OntologyBeanGenerator21, Àgogo [9], and
others, use similar techniques adapting object-relational solutions. An overview
can be found at Tripresso22, a project web site on mapping RDF to the object-
oriented world. These frameworks use a simple mapping model for transforming
each concept of the ontology into a class representation in a specific programming
language like Java or Ruby. Properties are mapped to fields. Only Àgogo [9] is
a programming-language independent model driven approach for automatically
generating ontology APIs. It introduces an intermediate step based on a Do-
main Specific Language (DSL). This DSL captures domain concepts necessary
to map ontologies to object-oriented representations but it does not captures the
pragmatics.
The mappings used to generate the MoOn from the OWL ontologies are based
on the work done for the Ontology Definition Metamodel (ODM) [4,11]. The
Ontology Definition Metamodel [7] is an initiative of the OMG23 for defining an
ontology development platform on top of MDA technologies like UML.
18
http://www.openrdf.org/doc/alibaba/2.0-alpha4/ last visit dec 5, 2011.
19
https://projects.quasthoffs.de/otm-j last visit dec 5, 2011.
20
http://jastor.sourceforge.net/ last visit dec 5, 2011.
21
http://protege.cim3.net/cgi-bin/wiki.pl?OntologyBeanGenerator last visit
dec 5, 2011.
22
http://semanticweb.org/wiki/Tripresso last visit dec 5, 2011.
23
http://www.omg.org/ last visit dec 5, 2011.
672 S. Scheglmann, A. Scherp, and S. Staab
7 Conclusion
We have presented with MoOn and OAM a declarative representation of prop-
erties of ontologies and their entities with regard to their use in applications and
application programming interfaces (APIs). On this basis, we have introduced a
multi-step model-driven approach to generate APIs from OWL-based ontologies.
The approach allows for user-driven customizations to reflect the needs in a spe-
cific application context. This distinguishes our approach from other approaches
performing a naive one-to-one mapping of the ontology concepts and properties
to the API classes and fields, respectively. With our approach, we alleviate the
developers from the tedious and time-consuming API development task such
that they can concentrate on developing the application’s functionalities. The
declarative nature of our approach eases reuseability and maintainability of the
generated API. In the case of a change of the ontology or the API, most of the
time only the declarative representation has to be adapted and a new API could
be generated. In our case studies, we applied our approach to several ontologies
covering different characteristics in terms of complexity, level of abstraction, de-
gree of formalization, provenance, and domain-specificity. For our future work,
we plan to integrate the support for different method behaviors (see R5) and
the dynamic extensibility of ontologies. The support of the dynamic extensibil-
ity of ontologies strongly depends on the persistence layer used. Another idea is
to use the declarative representation in combination with the ontology to prove
consistency of the data representation and manipulation in the API regarding
the ontology.
References
1. Fowler, M.: Patterns of Enterprise Application Architecture. Addison-Wesley Long-
man, Amsterdam (2002)
2. Fussell, M.L. (ed.): Foundations of Object Relational Mapping (2007),
http://www.database-books.us/databasesystems_0003.php
3. Gangemi, A., Mika, P.: Understanding the Semantic Web through Descriptions
and Situations. In: Meersman, R., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003,
and ODBASE 2003. LNCS, vol. 2888, pp. 689–706. Springer, Heidelberg (2003)
4. Hart, L., Emery, P.: OWL Full and UML 2.0 Compared (2004),
http://uk.builder.com/whitepapers/
0and39026692and60093347p-39001028qand00.html
5. Ireland, C., Bowers, D., Newton, M., Waugh, K.: A classification of object-
relational impedance mismatch. In: Chen, Q., Cuzzocrea, A., Hara, T., Hunt, E.,
Popescu, M. (eds.) DBKDA, pp. 36–43. IEEE Computer Society (2009)
6. Kalyanpur, A., Pastor, D.J., Battle, S., Padget, J.A.: Automatic Mapping of OWL
Ontologies into Java. In: SEKE (2004)
7. Ontology Definition Metamodel. Object Modeling Group (May 2009),
http://www.omg.org/spec/ODM/1.0/PDF
Declarative Representation of Programming Access to Ontologies 673
8. Oren, E., Delbru, R., Gerke, S., Haller, A., Decker, S.: Activerdf: object-oriented
semantic web programming. In: WWW. ACM (2007)
9. Parreiras, F.S., Saathoff, C., Walter, T., Franz, T., Staab, S.: à gogo: Automatic
Generation of Ontology APIs. In: IEEE Int. Conference on Semantic Computing.
IEEE Press (2009)
10. Parreiras, F.S., Staab, S., Winter, A.: Improving design patterns by description
logics: A use case with abstract factory and strategy. In: Khne, T., Reisig, W.,
Steimann, F. (eds.) Modellierung. LNI, vol. 127, pp. 89–104. GI (2008)
11. Rahmani, T., Oberle, D., Dahms, M.: An Adjustable Transformation from OWL
to Ecore. In: Petriu, D.C., Rouquette, N., Haugen, Ø. (eds.) MODELS 2010, Part
II. LNCS, vol. 6395, pp. 243–257. Springer, Heidelberg (2010)
12. Saathoff, C., Scheglmann, S., Schenk, S.: Winter: Mapping RDF to POJOs revis-
ited. In: Poster and Demo Session, ESWC, Heraklion, Greece (2009)
13. Saathoff, C., Scherp, A.: Unlocking the Semantics of Multimedia Presentations in
the Web with the Multimedia Metadata Ontology. In: WWW. ACM (2010)
14. Scherp, A., Franz, T., Saathoff, C., Staab, S.: F–a model of events based on the
foundational ontology DOLCE+DnS Ultralight. In: K-CAP 2009. ACM, New York
(2009)
15. Ambler Scott, W.: Crossing the object-data divide (March 2000),
http://drdobbs.com/architecture-and-design/184414587
16. Ambler Scott, W.: The object-relational impedance mismatch (January 2010),
http://www.agiledata.org/essays/impedanceMismatch.html
17. Wirfs-Brock, R., Wilkerson, B.: Object-Oriented Design: A Responsibility Driven
Approach. SIGPLAN Notices (October 1989)
Clinical Trial and Disease Search with Ad Hoc
Interactive Ontology Alignments
1 Introduction
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 674–686, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Clinical Trial and Disease Search 675
images they would additionally like to know whether previous diagnoses ex-
ist, if there has been a change in the case, and what kind of medication and
treatment plan is foreseen. This requires the medical images to be annotated
accordingly so that the radiologists can obtain all the necessary information
starting with a computer tomography (CT) or magnet resonance (MR) image,
and the case description in form of a patient record. We will explain how an
LODD (http://www.w3.org/wiki/HCLSIG/LODD) application based on dis-
eases, drugs, and clinical trials can be used to improve the (ontology-based)
clinical reporting process while, at the same time improving the patient follow-up
treatment process (i.e., monitoring the patient’s health condition and the devel-
opment of the disease). We will focus on the essential part of ontology matching
between the medical reference ontology, Radlex3 [5], and the available and rele-
vant LODD data contained in DrugBank, DailyMed, and Diseasome which are
mediated through the LinkedCT resources. LinkedCT, see http://linkedct.org,
aims at publishing the first open Semantic Web data source for clinical tri-
als data; it contains more than 60,000 trails, 14,243 conditions, and 67,271
interventions.
Essentially, the important mapping task between LinkedCT and Diseasome
must be seen in the context of a more complex medical workflow which we
will explain in detail. In addition, the mapping of several additional resources
can only be done interactively, at query time, to meet both the data and the
intentions of the clinician who searches for trail and drug information. The reason
for this is that a radiologist can only decide in an ad hoc fashion whether two
proposed “equality” or “related” matches are appropriate in a specific patient
and knowledge retrieval context. This paper is structured as follows. Section 2
describes the clinical problem statement and argues in favour of a context-based
interactive approach; section 3 describes the workflow we created in order to
meet the clinical requirements while embedding the context-based interactive
approach into a concrete knowledge retrieval scenario. Section 5 provides a first
evaluation of the approach to meet the clinical requirements; section 6 concludes.
A first analysis of the envisioned search functionality revealed that the asso-
ciation between observed diseases (e.g., lymphoma) and different, related types
of the same disease serve as a valuable knowledge resource (as it can be used
for refining the search query) when searching for similar patients and/or clinical
trials. We identified related LODD resources to capture valuable associations
between the various high-level concepts such as diseases, interventions, medica-
tions, symptoms, etc. Those concepts occur within the clinical diagnostic process
and are thus very relevant for defining search queries. The two related LODD
resources have been identified, Drugbank4 , Diseasome5 , and DailyMed6 . Figure
1 shows the identified LODD resources and a potential interlinking.
Existing medical ontologies for anatomy and disease related aspects (cf. FMA,
RadLex or NCI-Thesaurus) usually focus on one particular domain, such as
anatomy or radiology, and do not cover relations that link concepts from other
domains such as those which link associated findings with diseases. Most medical
ontologies of this scale for anatomy, disease, or drug aspects can be summarised
as: (a) they are very large models, (b) they have extensive is-a hierarchies up
to ten thousands of classes which are organised according to different views, (c)
they have complex relationships in which classes are connected by a number
of different relations, (d) their terminologies are rather stable (especially for
anatomy) in that they should not differ too much in the different ontologies
4
DrugBank is a large repository of small molecule and biotech drugs that contains
detailed information about drugs including chemical, pharmacological, and phar-
maceutical data in addition to sequence, structure, and pathway information. The
Linked Data DrugBank contains 1,153,000 triples and 60,300 links.
5
Diseasome contains information about 4,300 disorders and disease genes linked by
known disorder’s gene associations. It also indicates the common genetic origin of
many diseases. The list of disorders, disease genes, and associations between them
comes from the Online Mendelian Inheritance in Man (OMIM) , which is a compila-
tion of human disease genes and phenotypes. The Linked Data Diseasome contains
88,000 triples and 23,000 links.
6
Dailymed provides up-to-date information about marketed drugs. Human Prescrip-
tion Labels, OTC Labels, and Homeopathic Labels sum up to several million entries.
Clinical Trial and Disease Search 677
(we will show the opposite for the cancer disease parts), and (e) their modelling
principles are well defined and documented.
A variety of methods for ontology alignment have been proposed [13,2,4,1,8,6].
The objective of the state-of-the-art in ontology mapping research includes the
development of scalable methods (e.g., by combining very efficient string-based
methods with more complex structural methods), and tools for supporting users
to tackle the interoperability problem between distributed knowledge sources
(e.g., editors for iterative, semi-automatic mapping with advanced incremental
visualisations [9]). In addition, cognitive support frameworks for ontology map-
ping really involve users [3], or try to model a natural language dialogue for
interactive semantic mediation [11].
One of the ontology matching problems in medicine is still that, in many cases,
complex ontology matching algorithms cannot be used because they do not scale
to sizes of medical ontologies—complex methods for ontology alignment in the
medical domain turned out to be unfeasible because the concept and relation
matrix is often on the scale of 100000 × 100000 alignment cells and appropri-
ate subontologies cannot be created with state-of-the-art methods because of
complex inter-dependencies.
Another problem is that when using those methods, we can only work with
static mapping as a result of an offline matching process in which the mappings
are independent of the context in which they are used. In the context of our
medical use case, however, we learned in discussions with our clinicians that es-
tablishing clinical relevant associations between given clinical concepts has the
potential to improve the search functionality. But that comes with the condi-
tion of a context-dependent quality and relevance of established associations
(i.e., alignments) between clinical concepts which determines to which extent
the search functionality can be improved.
We argued in [14] that annotating medical images with information available
from LODD can eventually improve their search and navigation through ad-
ditional semantic links. One outcome of our ontology engineering methodology
[15,13] was the semi-automatic alignment between radiology-related OWL on-
tologies (FMA [7] and Radlex). This alignment could be used to provide new
connections in the medicine-related linked data cloud. The fact that context-
dependency may play a pivotal role for static alignments from FMA to Radlex
was shown in [16]. Why should this problem be more severe in the context of an
online information retrieval task where diseases from LinkedCT and Diseasome
have to be aligned?
Basically speaking, small changes in the nomenclature can make a big dif-
ferences in the adequacy of the proposed mappings. This is driven by the fact
that, in medicine, usually a very specific difference in the concept names might
make a huge difference for their interpretation (therefore we should not even try
to map LinkedCT and Diseasesome unless there are exact matches.) But the
absence of globally unique identifiers for diseases (the URIs) forces us to provide
the mappings. Instead of trying to infer such mappings on the large scale for
disease data sources (LinkedCT, for example, has only 830 owl:sameAs links to
678 D. Sonntag et al.
Diseasome but contains more than 4600 different disease URIs), we try to estab-
lish an ad hoc mapping. Further, we ask whether static nton mappings between
two medical ontologies are really necessary and represent really what is desired
in a specific query situation. In the following, we will argue that, at least in our
medical usecase, a rather different set of requirements exists.
For example, a clinical expert identifies patient cancer cells in the imaging
data and is sure that they are from type “lymphoma”. Hence our (Radlex) search
term is “lymphoma”. In order to decide on the follow-up treatment steps of the
patient, he wants to search for similar patients/trials where similar patients have
been successfully treated (case-based reasoning). Lymphoma diseases of patients
can, however, be distinguished along three orthogonal dimensions:
As lymphomas of different type, stage, and grade grow at different rates, they
respond differently to specific treatments. For that reason, clinicians need to
know the particular type, stage, and grade of a patient’s lymphoma for an ade-
quate treatment. Accordingly, we have to filter out the relevant trails, or in other
words, align the trials according to this complex information background.
For that reason, we cannot rely on semantic/structured-based knowledge mod-
els to establish associations for “fine-tuning” our search space. To the best of
our knowledge, no formal knowledge structure exists that relies these three di-
mension (type, stage, and grade) in a proper manner/model.
For the same reason, we also did not use a semantic similarity measure.
Initially, it might appear appropriate to use abbreviation lists, synonym sets,
etc. Although it seems to be obvious that purely syntactic (approximate) string
matching techniques are not sufficient to deal with different data representations,
different linguistic surface forms, and missing information about type, stage, and
grade, it is wrong to believe that the potential increase in recall when using such
query expansion only means a little loss of precision.
However, our interactive workflow is designed to increase precision at a stable
recall level as our examples from LinkedCT and Diseasome will show. (The recall
level can, however, be controlled by declaring the thresholds for individual ad
hoc matchers.)
3 Proposed Workflow
This process is utterly transparent to the user for two reasons. First, during the
direct-manipulation facetted browsing and search interface, he knows exactly at
which stage he is and exactly when additional input from him is required (stage
5). Second, built on the mental model of the retrieval stages and the context
knowledge he has gained by inspecting the patient file, reflecting on the Radlex
term, and following only the disease links he is interested in, the clinician knows
how to interpret the proposed mappings. It becomes clear that the clinician
is not necessarily interested in pure equality (or subsumption mapping), but
uses a more underspecified “related-to” mapping/alignment. Interestingly, this
“related-to” mapping only makes sense in this dynamic retrieval process where
he or she uses the proposed mappings as a further anchor point in his extended
search context. In other words, the user defines what he means by the mappings
in an ad hoc fashion. Therefore, we cannot use static mappings at all or record our
mappings as static ones. The question is whether this procedure really enhances
the robustness of the overall search interface in specific, usecase-relevant retrieval
situations (cf. section 5).
Fig. 2. Workflow of the Clinical Information Retrieval and Ontology Matching Tasks
Clinical Trial and Disease Search 681
whereby the function d(d1i , d2j ) counts the distance between two letters.
682 D. Sonntag et al.
The more measures we have that agree on two (disease) terms as being
“related”, the more stars are visible as our recommendation for the clinician.
However, the decision whether the recommendation is in line with the user’s
expectations is taken by the clinician and includes the outcome of πuser .
For example, n-gram measures such as Weighted Jaccard make it possible
to measure the similarity based on specific words and ignore others which are
expected to be unimportant. But how should we know which tokens are unim-
portant? Formally, the Weighted Jaccard measure takes two disease names d1 , d2
and computes (only) the fraction of tokens that are present in both multi-word
terms.
5 Evaluation
We evaluated the results of our individual similarity measures and found some
special characteristics of the measures when applied to our specific data. The
Weighted Jaccard method is useful to crop off stop words like “disease” as in
“Castleman’s Disease” to map against “Castleman”, but is completely useless
for, e.g., the stages of a disease case, a factor extremely important for lym-
phoma cases. This outcome reveals that when using this measure for medical
linked data, it would be based on the wrong assumption that medical concept
names contain negligible tokens. But our staging information and related type
information is coded in those tokens. For example, the LinkedCT terms “Lym-
phocytic Leukemia, L1” and “Lymphocytic Leukemia, L2” differ only in the
stage number, but this information is essential for finding related trials and
drugs.
But also the traditional string similarity measures such as simed cannot be
used without caution. For example, in [16] we emphasised (in the context of
the FMA ontology) that it contains closely related concepts such as “Anterior
cervical lymph node” and “Set of anterior cervical lymph nodes” which could
not be identified as duplicates with the simed function. The variation of the
linguistic surface forms might vary even to a greater extend when taking multiple
ontologies (i.e., LinkedCT and Diseasome) into account. On the other hand,
simed works very well in identifying related staging cases when only the stage
numbers or similar sub-expressions are different (cf . the “Lymphocytic Leukemia
L1 / L2”).
As a result of the observation that several distance functions have different
performances depending on the characteristics of the data (such as length of
the string, token permutations, etc.) we evaluated our ensemble measure with
the “star” recommendations. Since the functions are independently applied to
the disease names d1 , d2 and aggregated into a combined measure by speci-
fying the thresholds and weights of the individual distance function calls, a
vast improvement in the robustness could be achieved. A closer examination of
the disease names stored in LinkedCT and Diseasome provides an explanation:
many disease name contain long digit sequences, for instance ’G09330582163324’
(LinkedCT). In most cases the digits refer to important type, stage, and grade
Clinical Trial and Disease Search 683
Fig. 4. Mantle Cell - Choice box for the proposed and selected “related”-mappings to
Diseasome
information which is covered by the combined measure. The type, stage, and
grade information we finally optimised for, are the following:
1. the type of the disease: “Diabetes Mellitus, Type 1” , “Diabetes Mellitus,
Type 2”,
2. the stage of the disease: “Early Stage Breast Cancer (Stage 1-3)”,
3. the age of the patient as an indirect type classification : “ICU Patients 18
Years or Older”,
4. the date of the disease outbreak: “2009 H1N1 Influenza”,
5. genetic information; location of deleted region in the chromosome: “22q11.2
Deletion Syndrome”.
We then evaluated the interactive procedure in the context of our lymphoma
case. The lymphoma case reveals that LinkedCT enumerates 459 lymphoma
disease URIs (and Diseasome only 25 URIs) with the same name observations.
Since either the LinkedCT term or the Diseasesome term often lacks context
information, we can only rely on the interactive workflow because the context
is provided by the context factors explained above—and which are accessible to
the expert while focussing on the suggestions the system provides. Here is an
example of the individual measure outcomes for “Mantle Cell Lymphoma” from
LinkedCT (figure 3), for which no links to Diseasome exist.
The aggregated “star recommendation” is attached to the suggested choice
box for the as hoc “related” mappings (figure 4).
684 D. Sonntag et al.
The choice box is being displayed within the facetted browsing tool. Our tool
allows a user to filter thousands of results according to the best-ranked string
matches and linked data relations in a convenient way. Clinician can retrieve
the trails they might be interested in in just a few seconds. We implemented
the graphical user interface by using open-source knowledge management tools
Clinical Trial and Disease Search 685
6 Conclusion
We explained how an LODD application based on diseases, drugs, and clini-
cal trials can be used to improve the clinical reporting process. In order to get
information about trails, drugs, and diseases, several LODD sources can be ad-
dressed and the contained knowledge can be combined. The clinical problem
statement suggested that in order to make the application useful for improving
the patient follow-up treatment process, specific non-existing mappings must be
provided. The important mapping task lies between LinkedCT and Diseasome
in the context of a more complex medical workflow which we developed. We
argued that ad hoc interactive string-based ontology alignments should be used
to propose several ad hoc mappings to the user. The user can then verify them
at the knowledge retrieval stage while using the facetted browsing tool which
implements the graphical user interface of the proposed workflow. An evalua-
tion has shown that the interactive approach is useful and that underspecified
“related” mappings are often more useful than precise “equality” mappings in
the medical domain. In general, the discovery of such “related” links which are
typed, dynamic, and bound to a specific user, data, and application context,
should be an active research area.
Acknowledgements. We would like thank all colleagues at DFKI and all part-
ners in the MEDICO and RadSpeech projects for their valuable contributions
to the iterative process described here. This research has been supported by the
THESEUS Programme funded by the German Federal Ministry of Economics
and Technology (01MQ07016).
References
1. Doan, A., Madhavan, J., Domingos, P., Halevy, A.: Ontology matching: A ma-
chine learning approach. In: Handbook on Ontologies in Information Systems,
pp. 397–416. Springer (2003)
2. Euzenat, J., Shvaiko, P.: Ontology matching. Springer, Heidelberg (2007)
3. Falconer, S.M., Noy, N., Storey, M.-A.D.: Towards understanding the needs of
cognitive support for ontology mapping, vol. 225 (2006)
4. Kalfoglou, Y., Schorlemmer, W.M.: Ontology mapping: The state of the art. In:
Semantic Interoperability and Integration (2005)
5. Langlotz, C.P.: Radlex: A new method for indexing online educational materials.
RadioGraphics 26, 1595–1597 (2006)
686 D. Sonntag et al.
6. Noy, N.F.: Tools for mapping and merging ontologies. In: Handbook on Ontologies,
pp. 365–384 (2004)
7. Noy, N.F., Rubin, D.L.: Translating the foundational model of anatomy into OWL.
Web Semant. 6, 133–136 (2008)
8. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching.
The VLDB Journal 10, 334–350 (2001)
9. Robertson, G.G., Czerwinski, M.P., Churchill, J.E.: Visualization of mappings be-
tween schemas. In: CHI 2005: Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems, pp. 431–439. ACM, New York (2005)
10. Seifert, S., Kelm, M., Möller, M., Mukherjee, S., Cavallaro, A., Huber, M., Comani-
ciu, D.: Semantic annotation of medical images. In: Proceedings of SPIE Medical
Imaging, San Diego, CA, USA (2010)
11. Sonntag, D.: Embedded benchmarking and expert authoring for ontology mapping
and alignment generation. In: Proceedings of the Fifth International Conference
on Formal Ontology in Information Systems, FOIS (2008)
12. Sonntag, D., Schulz, C., Reuschling, C., Galarraga, L.: Radspeech, a mobile dia-
logue system for radiologists. In: Proceedings of the International Conference on
Intelligent User Interfaces, IUI (2012)
13. Sonntag, D., Wennerberg, P., Buitelaar, P., Zillner, S.: Pillars of Ontology Treat-
ment in the Medical Domain. In: Cases on Semantic Interoperability for Infor-
mation Systems Integration: Practices and Applications, pp. 162–186. Information
Science Reference (2010)
14. Sonntag, D., Wennerberg, P., Zillner, S.: Applications of an ontology engineering
methodology. AAAI Spring Symposium Series (2010)
15. Wennerberg, P., Zillner, S., Möller, M., Buitelaar, P., Sintek, M.: KEMM: A Knowl-
edge Engineering Methodology in the Medical Domain. In: Proceedings of the 2008
Conference on Formal Ontology in Information Systems, pp. 79–91. IOS Press,
Amsterdam (2008)
16. Zillner, S., Sonntag, D.: Aligning medical ontologies by axiomatic models, corpus
linguistic syntactic rules and context information. In: Proceedings of the 24th In-
ternational Symposium on Computer-based Medical Systems, CMBS (2011)
Towards Fuzzy Query-Relaxation for RDF
Abstract. In this paper, we argue that query relaxation over RDF data
is an important but largely overlooked research topic: the Semantic Web
standards allow for answering crisp queries over crisp RDF data, but
what of use-cases that require approximate answers for fuzzy queries over
crisp data? We introduce a use-case from an EADS project that aims
to aggregate intelligence information for police post-incident analysis.
Query relaxation is needed to match incomplete descriptions of entities
involved in crimes to structured descriptions thereof. We first discuss
the use-case, formalise the problem, and survey current literature for
possible approaches. We then present a proof-of-concept framework for
enabling relaxation of structured entity-lookup queries, evaluating differ-
ent distance measures for performing relaxation. We argue that beyond
our specific scenario, query relaxation is important to many potential
use-cases for Semantic Web technologies, and worthy of more attention.
1 Introduction
RDF is a flexible data format, and is well-suited to data integration scenarios.
However, specifying precise queries over integrated, incomplete, heterogeneous
data is much more challenging than likewise in closed, homogeneous settings.
Writing precise queries requires precise knowledge of the modelling and content
of the data. Even if a querying agent knows its exact information needs—and is
able to specify those needs in a crisp, structured request—often, the query will
not align well with heterogeneous data. Further, a querying agent may not be
able to specify the precise “scope” of answers it is interested in, but may instead
only be able to specify some “ideal” criteria that would be desirable.
Current Semantic Web standards and tools only go so far towards matching
the needs of the querying agent and the content of the dataset. RDFS and OWL
only facilitate finding more crisp answers to queries—answers matched directly
by the data or its entailments—and do not directly support a continuous notion
of distance (or similarity) for resources. For example, a query asking for a 2012
blue sports car on sale in New York may also be interested in a 2010 navy
roadster on sale in Newark. Although the subsumption relationship between a
sports car and a roadster could be modelled in RDFS, the distance of resources
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 687–702, 2012.
© Springer-Verlag Berlin Heidelberg 2012
688 A. Hogan et al.
such as blue/navy (vs. blue/red) and New York/Newark (vs. New York/Los
Angeles) cannot be succinctly axiomatised in RDFS/OWL for interpretation by
the query answering system. Instead, we argue that the RDF descriptions of
resources can be used to compute an inductive, generic notion of distance.
In this paper, we thus advocate a relaxed form of RDF query-answering where
the query should be interpreted as specifying the ideal criteria for answers, such
that other relevant (but non-crisp) scored answers are returned. This is similar to
top-k query-answering for Information Retrieval engines: a paradigm that works
well in highly-heterogeneous, incomplete scenarios, including Web search.
We first present an industrial use-case from the European Aeronautic De-
fence and Space Company (EADS) that requires matching witness observations
against crisp knowledge integrated from various law-enforcement and intelligence
agencies (§ 2). Next, we provide a survey of literature that relates to the needs
of EADS’ use-case and to query relaxation (§ 3). We then propose a generic
framework for building a relaxed RDF query engine (§ 4); we currently focus on
entity-lookup queries using similarities of RDF terms. Subsequently, we discuss
a generic technique for extracting distance/similarity scores between resources
based on their structured descriptions (§ 5). We then outline an early prototype
for relaxing queries—that represent witness observations—against a dataset of
vehicle descriptions, testing different similarity measures (§ 6). We conclude that
our own results are too preliminary for deployment, but argue that RDF query
relaxation (or more generally “fuzzy querying”) is an important, timely research
topic not only for EADS’ use-case, but may unlike various potential and diverse
Semantic Web applications involving vague/uncertain user requirements.
2 Use-Case Overview
Our use-case arises from an on-going research project at the European Aeronau-
tic Defence and Space Company (EADS): a large European aerospace, defence
and military contractor. EADS Innovation Works is the corporate research and
technology department of EADS that explores areas of mobility, security and
environment. One of the team’s key interests relates to civilian security, and en-
abling increased agency collaboration through use of intelligent systems. EADS
has been working on the development of systems for intelligence analysis to aid
post-crime police investigations [30], where analysts need to process raw infor-
mation, determine valuable evidence, and identify the entities involved and their
relationships. Investigations often rely on human observations, including police
reports, or statements from victims, witnesses and informers.
Such data are the result of subjective assessment and often carry inherent
vagueness and uncertainty. Human observations provide an estimate of the entity
observed, described in natural language, and may be imprecise (i.e., stating that
a suspect was 1.77 m tall when (s)he was 1.79 m tall) or vague (i.e., “between
1.7 m – 1.85 m”, or “average height”, etc.). Previous work [28,29] analysed issues
with using human intelligence data (HUMINT), and presented methods to align
data in different formats (numeric, textual and ranges). Herein, we view such
observations as structured fuzzy queries to be executed over crisp data.
Towards Fuzzy Query-Relaxation for RDF 689
4 Relaxation Framework
We now propose a conceptual framework for relaxation of an entity query: a list
of attribute–value or attribute–variable pairs Q := (p1 , o1 ), . . . , (pn , on ) such
that each pi is a property URI and each oi is either a variable, a URI or a literal
(i.e., Q ⊂ U × VUL).1 A crisp response consists of entities (subjects) with
predicate–object edges directly matching the query, as well as bindings for any
variables in o1 , . . . , on . In SPARQL terms, this query model roughly corresponds
to basic graph patterns with a common subject variable; for example:
SELECT * WHERE {?s :colour :blue ; :city :NY ; :type :Sport ; year 2010 ; reg ?r .}
To relax queries, we define a matcher as a function M : VUL × UL → R[0,1]
that maps a pair of values into a relaxation score: a value in [0, 1] where 1
indicates that the two values are not interchangeable, and 0 indicates perfect
interchangeability (e.g., M (c, c) := 0, M (?v, c) := 0). Each matcher is a distance
function between the query and entity values, respectively. The match function
1
The query is given an ordering for later convenience. We re-use standard RDF no-
tation: V denotes variables, U URIs and L RDF literals. AB denotes A ∪ B.
Towards Fuzzy Query-Relaxation for RDF 691
may not be symmetric for pairs of values at different levels of specificity: e.g.,
M (:blue, :navy) might be 0.2 suggesting :navy as a good relaxation for generic
:blue, whereas M (:navy, :blue) might be 0.5 since :navy is more specific.
Matchers then form the core of the relaxation framework, and can be instan-
tiated in different ways (cf. [24]). For numeric attribute matchers (e.g. :year),
normalised distances can be used: letting maxi and mini denote the max./min.
values for a numeric property pi appearing in the data, qvi a value in the query
and evi a value for an entity,
we can apply a normalised numeric matcher Mi :
qoi −eoi
(qoi , eoi ) → max i −mini
. For string attributes with functional character strings
(e.g., registration plates), lexical matchers can be used; we later use a Levenshtein
Lev(qvi ,evi )
edit-distance matcher for licence plates such that Mi : (qoi , eoi ) → max(|qo i |,|eoi |)
;
other matchers can be used as appropriate. For categorical attributes—with URIs
or a discrete set of literals as values (e.g., colour, city)—creating a matcher of-
ten requires background knowledge about the different values; as per Schumacher
and Bergmann [27], we thus propose to use a similarity table for such attributes,
computed by a background matching process (discussed later in § 5).
Thus, the relaxation framework may involve multiple matchers: a toolbox
of appropriate matchers can be offered to an administrator. Where a suitable
matcher is not found for a pair of values, the query engine can resort to returning
standard “crisp” answers, allowing for an ad-hoc, incremental relaxation frame-
work. We currently do not consider inference or relaxation of properties etc.; our
framework could perhaps be extended as per the literature surveyed in § 3.
For a query Q = (p1 , o1 ) . . . (pn , on ) and entity E, the matchers generate a
tuple of numeric distances Mi...n (Q, E) = d1 , . . . , dn . Considering the query as
the origin, the matchers map entities onto points in an n-dimensional Euclidean
space with each dimension ranging over [0, 1] (a unit n-cube). Where an entity
has multiple values for a given attribute, the closest to the query is used; where
an entity is not assigned a query-attribute, the matcher returns 1.2 Thereafter,
entities on the origin are crisp matches. Otherwise, the distance from an entity to
the query-origin can be measured straightforwardly as a Euclidean distance n (in
n 2
2 i=1 di
this case, i=1 di ), or with root-mean squared deviation (RMSD : n ).
The overall distance from the query-origin to each entity gives an overall
relaxation score that can be used to order presentation of results, or to perform
top-k thresholding. Further, users can annotate query attribute–value pairs with
a vagueness score that allows for controlling the relaxation of individual facets
(e.g., to allow more relaxation for :colour than :city). Thus a vague query is
defined as Q := (p1 , o1 , v1 , ) . . . , (pn , on , vn ) where v1 , . . . , vn ∈ R[0,1] (i.e.,
Q ⊂ U × VUL × R[0,1] ). A value vi indicates a threshold for di such that the
entity will only be considered a result if di ≤ vi (e.g., if v1 := 0, then (pi , oi ) must
have a crisp match). Thus, the origin and the coordinate v1 , . . . , vn prescribe a
region of space (an m-orthotope for m the number of non-crisp query attributes)
within which results fall into, allowing to tweak relaxation results.
2
Intuitively, this is the relaxed form of standard conjunctive query answering.
692 A. Hogan et al.
6 Proof of Concept
an ‘LD’ licence plate". The relaxation framework is used to derive a ranked list
of cars from the dataset in order of their relevance to the observation. In this
respect, the observation acts like a query which should be executed against the
car instances. Results should include not only those cars which directly match
the characteristics given in the observation, but also similar cars. Different char-
acteristics of the observation can be annotated with different vagueness values.
For demonstration purposes, we decided that the chosen dataset should con-
tain information about a significant number of car instances, with attributes for
(at least) make, model, colour and body-type, covering common facets of vehi-
cle observations. We thus took an existing structured dataset describing 50,000
car instances based on a popular Irish website advertising used cars. Each car
instance is described using the following six properties: vehicle make (48 unique
values; e.g., Toyota), make–model (491 values; e.g., Toyota Corolla), body style
(8 values; e.g., Saloon), fuel type (5 values; e.g., Diesel), colour (13 values after
normalisation; e.g., navy), and registration (i.e., unique licence plate; 50,000 val-
ues). Taking the raw data, colours were normalised into a set of thirteen defined
values, a new set of UK-style licence plates was randomly generated, and the
data were modelled in RDF using Hepp’s Vehicle Sales Ontology (VSO).4
Notably, all vehicle attributes except licence-plates are categorical, and thus
require tables that encode similarity/distance scores. To relax licence-plate val-
ues, we allow wildcard characters in the query and use the normalised Leven-
shtein measure mentioned earlier. For colour, the thirteen values were mapped to
an L*a*b* three-dimensional colour space, where Delta-E was used to compute
(Euclidean) distances between the colours and generate a matrix for relaxation.5
An open, non-trivial challenge was then posed by the other properties. For fuel-
type, it was particularly unclear what kind of relaxation behaviour should be
expected for the use-case; a set of regular distance scores were manually defined.
Of more interest were the make–model, model and body-style attributes, for
which further background information was needed.
Table 1. Top ten make–model matches overall (left) and for distinct makes (right)
endpoint and refining these mappings in a second phase [21]. Although numerous
matches were found, many make–models were not reconciled to DBpedia URIs.
Instead, we adopted a manual approach by appending make–model strings
onto the DBpedia namespace URI, replacing space with underscore. However,
this approach also encountered problems. First, of the 491 string values, 68 mod-
els (14%) did not have a corresponding reference in DBpedia (404 Not Found):
some of the unmatched models were colloquial UK/Irish names (e.g., the make–
model Citroen Distpatch is known elsewhere as Citroën Jumpy), some were
misspelt, and some had encoding issues. These values were manually mapped
based on suggestions from Wikipedia search. Second, some of the matches that
were found returned little data. Of these, some were redirect stub resources
(e.g., Citroen C3 redirects to Citroën C3), where we “forwarded” the mapping
through the redirect. Others still were disambiguation pages (e.g., Ford Focus
disambiguates to Ford Focus International, Ford Focus America and Ford
Focus BEV). Furthermore, some resources redirected to disambiguation pages
(e.g., Ford Focus ST redirect to the Ford Focus disambiguation page). Here,
we mapped strings to multiple resources, where Ford Focus was mapped to the
set of DBpedia resources for { Ford Focus, Ford Focus America, Ford Focus
International and Ford Focus BEV }. In total, 90 strings (18.3%) had to be
manually mapped to DBpedia. Given the mappings, we retrieved 53k triples of
RDF data from DBpedia, following redirects and disambiguation links.
We applied concurrence over the dataset with a threshold t := 200 (each
shared pair with card(p, v) = 200 would generate >20,000 raw concur scores at
least below 0.005). For the 491 original make–model values, concurrence found
non-zero similarity scores for 184k (76%) of the 241k total car-model pairs pos-
sible, with an average absolute match score of 0.08 across all models.
The top ten overall results are presented in Table 1, where we also present
the top ten matches for models with different makes; note that matches are
symmetric and that all matches presented in the table had a similarity score
696 A. Hogan et al.
7
Alternatively, the similarities could be normalised into a non-parametric, rank-based
distance, where a relaxation value of 0.5 includes the top-half similar models.
Towards Fuzzy Query-Relaxation for RDF 697
and H resp.), we took the average of all edition-similarities between both groups
(i.e., the arithmetic mean of scores for all edition-pairs in S × H).
Observation C: “A light Audi A3 8L, 2006 UK reg. starts with SW and ends with M ”.
Relaxed query: {(colour, white, 0.4), (edition, Audi-A-8l, 0.1), (reg, SW?6??M, 0.4)}
All Approaches (same results)
№
result score
1 Audi A3 8L SW06RWM yellow 0.92
2 Audi A3 8L SF56GCN white 0.91
3 Audi A3 8L BW06LJN gray 0.90
4 Audi A3 8L SW04TVH black 0.85
5 Audi A3 8L AE56MWM maroon 0.83
7 Conclusion
References
1. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hell-
mann, S.: DBpedia - a crystallization point for the Web of Data. J. Web Sem. 7(3),
154–165 (2009)
2. Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: A
comparative evaluation. In: SDM, pp. 243–254 (2008)
3. Bruno, N., Chaudhuri, S., Gravano, L.: Top-k selection queries over relational
databases: Mapping strategies and performance evaluation. ACM Trans. DB
Syst. 27(2), 153–187 (2002)
4. Chaudhuri, S., Datar, M., Narasayya, V.R.: Index selection for databases: A
hardness study and a principled heuristic solution. IEEE Trans. Knowl. Data
Eng. 16(11), 1313–1323 (2004)
5. Chu, W.W.: Cooperative database systems. In: Wiley Encyclopedia of Computer
Science and Engineering, John Wiley & Sons, Inc. (2008)
6. Dabrowski, M., Acton, T.: Modelling preference relaxation in e-commerce. In:
FUZZ-IEEE, pp. 1–8 (2010)
7. Dolog, P., Stuckenschmidt, H., Wache, H., Diederich, J.: Relaxing RDF queries
based on user and domain preferences. J. Intell. Inf. Syst. 33(3), 239–260 (2009)
Towards Fuzzy Query-Relaxation for RDF 701
8. Elbassuoni, S., Ramanath, M., Weikum, G.: Query Relaxation for Entity-
Relationship Search. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plex-
ousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part II. LNCS, vol. 6644,
pp. 62–76. Springer, Heidelberg (2011)
9. Gaasterland, T.: Cooperative answering through controlled query relaxation. IEEE
Expert 12(5), 48–59 (1997)
10. Gaasterland, T., Godfrey, P., Minker, J.: An overview of cooperative answering. J.
Intell. Inf. Syst. 1(2), 123–157 (1992)
11. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-
based explicit semantic analysis. In: IJCAI, pp. 1606–1611 (2007)
12. Goodall, D.W.: A new similarity index based on probability. Biometrics 22(4)
(1966)
13. Grice, P.: Logic and conversation. Syntax and Semantics 3 (1975)
14. Hogan, A., Zimmermann, A., Umbrich, J., Polleres, A., Decker, S.: Scalable and
distributed methods for entity matching, consolidation and disambiguation over
Linked Data corpora. J. Web Sem. 10, 76–110 (2012)
15. Hu, W., Chen, J., Qu, Y.: A self-training approach for resolving object coreference
on the semantic web. In: WWW, pp. 87–96 (2011)
16. Huang, H., Liu, C., Zhou, X.: Computing Relaxed Answers on RDF Databases. In:
Bailey, J., Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008.
LNCS, vol. 5175, pp. 163–175. Springer, Heidelberg (2008)
17. Hurtado, C.A., Poulovassilis, A., Wood, P.T.: Query Relaxation in RDF. In: Spac-
capietra, S. (ed.) Journal on Data Semantics X. LNCS, vol. 4900, pp. 31–61.
Springer, Heidelberg (2008)
18. Ioannou, E., Papapetrou, O., Skoutas, D., Nejdl, W.: Efficient Semantic-Aware
Detection of Near Duplicate Resources. In: Aroyo, L., Antoniou, G., Hyvönen, E.,
ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010,
Part II. LNCS, vol. 6089, pp. 136–150. Springer, Heidelberg (2010)
19. Kiefer, C., Bernstein, A., Stocker, M.: The Fundamentals of iSPARQL: A Virtual
Triple Approach for Similarity-Based Semantic Web Tasks. In: Aberer, K., Choi,
K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., May-
nard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ISWC/ASWC
2007. LNCS, vol. 4825, pp. 295–309. Springer, Heidelberg (2007)
20. Lopes, N., Polleres, A., Straccia, U., Zimmermann, A.: AnQL: SPARQLing Up
Annotated RDFS. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang,
L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496,
pp. 518–533. Springer, Heidelberg (2010)
21. Maali, F., Cyganiak, R., Peristeras, V.: Re-using cool URIs: Entity reconciliation
against LOD hubs. In: LDOW (2011)
22. Nikolov, A., Uren, V.S., Motta, E., De Roeck, A.: Integration of Semantically
Annotated Data by the KnoFuss Architecture. In: Gangemi, A., Euzenat, J. (eds.)
EKAW 2008. LNCS (LNAI), vol. 5268, pp. 265–274. Springer, Heidelberg (2008)
23. Noessner, J., Niepert, M., Meilicke, C., Stuckenschmidt, H.: Leveraging Termino-
logical Structure for Object Reconciliation. In: Aroyo, L., Antoniou, G., Hyvönen,
E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010,
Part II. LNCS, vol. 6089, pp. 334–348. Springer, Heidelberg (2010)
24. Oldakowski, R., Bizer, C.: SemMF: A framework for calculating semantic similarity
of objects represented as RDF graphs. In: ISWC (Poster Proc.) (2005)
702 A. Hogan et al.
25. Poulovassilis, A., Wood, P.T.: Combining Approximation and Relaxation in Se-
mantic Web Path Queries. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P.,
Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS,
vol. 6496, pp. 631–646. Springer, Heidelberg (2010)
26. Saïs, F., Pernelle, N., Rousset, M.-C.: Combining a Logical and a Numerical
Method for Data Reconciliation. In: Spaccapietra, S. (ed.) Journal on Data Se-
mantics XII. LNCS, vol. 5480, pp. 66–94. Springer, Heidelberg (2009)
27. Schumacher, J., Bergmann, R.: An Efficient Approach to Similarity-Based Re-
trieval on Top of Relational Databases. In: Blanzieri, E., Portinale, L. (eds.)
EWCBR 2000. LNCS (LNAI), vol. 1898, pp. 273–284. Springer, Heidelberg (2000)
28. Stampouli, D., Brown, M., Powell, G.: Fusion of soft information using TBM. In:
13th Int. Conf. on Information Fusion (2010)
29. Stampouli, D., Roberts, M., Powell, G.: Who dunnit? An appraisal of two people
matching techniques. In: 14th Int. Conf. on Information Fusion (2011)
30. Stampouli, D., Vincen, D., Powell, G.: Situation assessment for a centralised intel-
ligence fusion framework for emergency services. In: 12th Int. Conf. on Information
Fusion (2009)
31. Stuckenschmidt, H.: A Semantic Similarity Measure for Ontology-Based Informa-
tion. In: Andreasen, T., Yager, R.R., Bulskov, H., Christiansen, H., Larsen, H.L.
(eds.) FQAS 2009. LNCS, vol. 5822, pp. 406–417. Springer, Heidelberg (2009)
32. Tkalčič, M., Tasič, J.F.: Colour spaces: perceptual, historical and applicational
background. In: IEEE EUROCON, pp. 304–308 (2003)
Learning Driver Preferences of POIs
Using a Semantic Web Knowledge System
1 Introduction
The in-vehicle navigation system tries to minimize driver distraction by enforc-
ing restrictions on the way information is presented to the driver. One such
constraint restricts the number of items that can be displayed in a list on a sin-
gle screen to a fixed number of slots. When the driver searches for banks in the
car, for example, the search results are displayed as a list filled in these slots. If
the number of search results exceeds the number of slots, then the extra results
are pushed to the next page. A more personalized experience can be provided
by showing these results sorted according to the driver’s preferences. Using the
history of Points of Interests (POIs) that the driver visits, we can build a model
of what kind of places he/she (henceforth referred to as he) is more likely to
prefer. For example, the driver may have certain preferences for banks. We want
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 703–717, 2012.
c Springer-Verlag Berlin Heidelberg 2012
704 R. Parundekar and K. Oguchi
to understand his bank preferences so that they can be used to personalize the
result the next time he is searching for one. Our system, which uses a Semantic
Web Knowledge System to explore this personalization, is called Semantic User
Preference Engine or Supe. The places that the driver visits are stored in our
Knowledge Base as data that is structured using RDF. We first use a Machine
Learning algorithm to build a preference model of the places that the driver
likes from this history. The next time a driver searches for places of a certain
category, we use this model to re-order the search results according to his/her
estimated affinity.
We describe how Supe is used to study driver preferences of POIs in this paper
as follows. First, we discuss our system and how it relies on RDF and Linked
Data to represent information. Then, we describe how our the ontology and the
Linked Data about the preferred POIs is collected from multiple sources and
used to build the preference model for the driver. This includes how the Linked
Data is translated into the desired input for Machine Learning algorithms that
were used. This is followed by describing how Supe uses the built preference
model to reorder a result of a POI search to match the driver’s estimated affin-
ity of these places. We then explain the implementation and the evaluation of
our prototype system. We also include relevant work in using Linked Data and
Semantic Web for modeling user preferences and recommendations. Lastly, we
conclude by summarizing our findings along with future work.
2.1 Overview
The primary objective of Supe is to collect driver preferences and use the pref-
erence model to provide personalized POI search results to the driver. To be
successful in modeling driver preferences, it needs to understand the driver as
well as the POIs. By defining the semantics in a Knowledge Base, we tie the
"
( $!
$
& $ !
0*/
-%
"
!
1/
""
% !!
"
0/$ !
!
**+ ,
! &
,
0*/+ !-
"
understanding of place and driver data to an ontology. Supe also provides In-
telligent Services, which use a machine learning engine in the background, for
the presentation of the preferred places to the driver. Fig. 1 depicts the system
overview.
The Semantic Web Knowledge System is at the heart of Supe and resides in
the cloud. It contains a Knowledge Base, Intelligent Services, RESTful endpoints
and access control mechanisms which are described in detail in Section 2.2. Supe
is also connected to multiple devices that help collect POI data for the driver.
Applications running on these devices use the exposed services to search and
look-up places from multiple sources on the web. A place that the driver selects
from the search results is accessible on the navigation system in his vehicle. For
example, the driver might want to ‘send’ a POI from their desktop before getting
into the car. Alternatively, they may select a place from their smart phones
using their personal suite of installed applications and send it to the vehicle to
be used as a destination for the in-vehicle navigation system. A third scenario
finds places that the driver has visited using the history (GPS logs, etc.) of the
in-vehicle head unit. In all three cases, Supe keeps track of the place consumed
(i.e. navigated to, called, etc.) in order to add it to the driver’s preferences. Once
the preference model is built, it can be used to personalize the POI search in
the in-vehicle navigation system. When the navigation system requests a place
search, e.g. a bank, the system first retrieves POIs that match the search criteria,
reorders these results according to the driver’s preferences and returns the list
to the navigation system.
! $
#$
%
% Linked Data
Web
Services
$ Ontology
'
Knowledge Base. The Supe Knowledge Base is based on Semantic Web and
Linked Data standards. It is responsible for describing and storing the driver and
place information. It also allows for using Web Services on-the-fly as a source for
real-time place data in RDF. The Knowledge Base is composed of the following:
706 R. Parundekar and K. Oguchi
GroceryStore MovieTheatre
2. Linked Data: All instances for place and user data in the Knowledge Base
are identified by URIs belonging to the instance namespace and are repre-
sented in RDF. Internally, the Intelligent Services can access this data as
server objects that can be modified locally using the Jena framework [9].
Any data is also accessible in RDF or JSON [4] to the applications run-
ning on the devices for easy consumption as dereferencable URIs using the
Linked Data Design Principles[2]. Locally, the RDF data for each instance
is grouped together into an instance molecule and stored in a database with
its URI as the primary key. We base our grouping of Linked Data triples
into instance molecules by specializing the browse-able graph[2] terminol-
ogy, where triples that have the URI for the instance in the object part of
the triple are absent. Grouping triples into instance molecules allows us for
a loosely coupled browse-able graph of data without duplicates. The search
look-up is realized using a separate index that uses only the location and
category. Since we are using Jena on the server side, any necessary RDFS
inferences on the molecule can be performed when it is retrieved from the
Knowledge Base. A similar concept to instance molecule, defined as RDF
molecule, is also found in Ding et al.[5].
We also create a Linked Data wrapper around multiple sources on the web to
find and retrieve POI instance data. For example, we can access each place
already present on Yelp as Linked Data, as we wrap the Yelp API 1 and ‘lift’
the associated place data into our ontology. We use similar wrappers for
other sources, like Google Places API2 , or data stored locally in an RDBMS
store. Lifting all the data into the same ontology, allows us to integrate
multiple sources and also create a richer set of information for modeling
driver preference by merging such data.
1
http://www.yelp.com/developers/getting started/api overview
2
http://code.google.com/apis/maps/documentation/places/
Learning Driver Preferences Using a SWKS 707
Access Control API: Due to the highly personal nature of the data and to
prevent RESTful services, Knowledge Services, Intelligent Services and applica-
tions running on the client devices (collectively referred to as platform services
below) from corrupting other service or instance data, Supe also has an access
control mechanism in place. Users and platform services use an identifier and
a secret passkey combination for authentication. To simplify access control, we
use an approach that restricts access based on namespaces rather than on in-
dividual instances. In order to access instances in the necessary namespace, an
authentication challenge needs to be met successfully. Access control rules for
the different namespaces are described below in brief:
1. Ontology Namespace: All platform services have access to the ontology.
2. Instance Namespace: All platform services have access to the Linked Data
that is not user or service specific.
708 R. Parundekar and K. Oguchi
3. Namespaces for each user: All user specific data, like his home or work
address, etc., that is not owned by any particular platform service belongs to
his separate namespace. Access is granted to all platform services and client
devices that can respond to user authentication challenge.
4. Namespaces for each platform service: Data belonging to a platform
service but not specific to any user (e.g. service configuration, service static
objects, etc.) belongs to this namespace. A platform service cannot access
data in some other service’s namespace.
5. Subspaces for each platform service within a user’s namespace: All
platform services with user specific data save it in a sub-space created for
that service within the user’s namespace. A platform service cannot access
data belonging to a user that it does not have the necessary authentication
for. Alternatively, even if it does have user authentication, it cannot access
another service’s data.
When client applications send data to the in-vehicle navigation system, Supe
tracks the visited/consumed POIs and stores them as driver history. This data
is then used to build a statistical model of the driver’s place preferences by
training a machine learning algorithm. This process, is described below. Fig. 4
shows the steps involved in converting POI data about a bank, which the user
selected in the navigation system, to machine learnable data.
Fetching Linked Data for POIs (Lifting): Identifiers of POIs that are
tracked are used to build the preference model. Data for these places is retrieved
from multiple web sources using the POI Search knowledge service. This returns
place data in RDF using the ontology after lifting the web service response. For
example, when the user selects a bank called “Bank of America”, the lifting
process converts the JSON response of the web service (e.g. Yelp API) into an
instance molecule as shown in Fig. 4 (a). Lifting data to a single ontology also
allows us to support multiple sources for the POI data. Different Web Services
like Google Places API have different representations for data (e.g. different
metrics, different category hierarchy, etc.). Search results from these sources can
be integrated into the same instance molecule, using appropriate lifting logic.
Learning Driver Preferences Using a SWKS 709
*,-)
'.*,.<73;9-)
475
6+:
!'*,-)
"*/
*,456$'
!+-) 586
5868 7+8
!'*,
$'%-
0
%$*,' -) 76$'
!+
$'%
"*7+8
0 456$'
!+
$'%
a) Business Data JSON Returned by Yelp API d) Bank Table with Training Instances for Classifier
Fig. 4. Building User Preferences (Note: Only subset of the actual data used, is shown.)
Adding Context: A major design consideration for using an ontology for rep-
resentation in Supe is to automatically extrapolate context that would help for
a richer description of the data. The Context knowledge service is used to add
user and situation context to POI instance molecule using rules programmed in
the service. To add user data, it first loads the data for the driver from the local
Linked Data store. Based on the programmed rules, it then automatically adds
triples for the user context to the place data instance molecule. For example, if
the user has his home or work address stored, then the service is able to add
distance from home or distance from work context. As shown in Fig. 4 (b) and
(c), user context hasDistanceFromWork is added to the Bank of America in-
stance molecule. Another context that can be added is Situation. For example,
information whether the visit was in the morning or evening along with the time
spent can help understand the if the driver accessed an ATM or actually had to
stop by at the bank for a longer time.
Conversion of RDF Instance Molecule to Machine Learnable Data:
The instance molecule containing the POI and context data is used to learn
the user’s preferences using a content based approach. This instance needs to
be first converted to a representation that machine learning algorithms can un-
derstand. The translation from Linked Data to machine learning instances is
relatively straight-forward. We explain this translation as a conversion of in-
stance molecules into a table containing training data, as used by conventional
machine learning algorithms, below. (Due to lack of space, Fig. 4 (d) only shows
710 R. Parundekar and K. Oguchi
a representative set of columns that can be derived from Fig. 4 (c)). The tables
are implemented as frequency counts for values in each column and persisted in
the object store. The learning algorithm used is described later, in Section 2.4.
1. The Table: All instances belonging to a certain category are grouped to-
gether in a single table. For example, Fig. 4 (d) shows a subset of the table
for banks.
2. Rows in the Table: Each instance to be added to the preference model
translates to each row in the table. The URI can either be dropped from the
training (since row identifiers usually do not contribute to the learnt model)
or can alternatively be added as a new column in the table. We use the latter
approach to bias the model with a higher preference to a previously visited
place 3 .
3. Columns in the table The property-value pairs for the instance get trans-
lated as columns and values in the table. The properties for which the type
of the table is a domain, appear as columns. For example, while the has-
DriveThruATM is a property of Banks, the hasName property is inherited
from the parent concept POI and is also present in the table in Fig. 4.
4. Values in the Table: RDF Literal values are translated as one of string,
numeric or nominal values in a column. For example, the translation of the
values for the hasName & hasRating properties from Fig. 4 (c). to columns
in the table in Fig. 4 (d) is trivial. These translation rules can be specified ex-
plicitly in the ontology at construction time. Alternatively, a type detection
mechanism could be used to identify the data types. For values of properties
that are blank nodes, we use nesting of tables. This results in tracking in-
ner frequency counts for the properties. Section 2.4 describes how the inner
nested tables are used to compute preference. For values of properties that
are URIs, we can choose to either (i) use the lexical value of the identifier as
the value of the attribute in the machine learning table or (ii) represent the
instance in a nested table, similar to blank nodes.
5. Class Column in the Table: The class value in the table for the training
instances is marked as either ‘preferred’ or ‘notpreferred’ based on the way
the data is tracked. The POIs in the driver history of visited places are
marked as preferred (e.g. Bank of America is marked as preferred in the
class column in Fig. 4 (d)).
Once the preference model contains some POI data, applications requesting a
search of POIs can be presented with personalized results. In our use-case, the
user might want to search for banks around him using the in-vehicle navigation
system, when in a place that he does not know much about. The steps involved
in realizing this are part of the Personalized POI Search Intelligent Service, and
are described below.
Fetching POIs and Adding Context: The personalized POI search service
uses the POI search knowledge service to retrieve places that match the search
criteria from web sources. Necessary user and situation context are added to the
list of instances, which were already lifted into RDF by the knowledge service,
if necessary data is present. These two steps are similar to the steps described
in Section 2.3. The search service then uses the Preference Scoring service to
estimate the user’s affinity to each place.
Scoring Each POI Using the Stored Preference Model: The user’s pref-
erence model is loaded by the Preference Scoring Intelligent Service from the
object store. Each place instance it receives is first converted into a form suit-
able for input to scoring by the machine learning algorithm. For calculating the
preference of a POI to the user, we use a scoring function that computes the
Euclidean distance, in a way similar to the nearest neighbor calculation in un-
supervised classification algorithms. The scores for the POIs are returned to the
search service. It can then sort the list of POIs according to the score and present
the personalized list to the requesting client application.
(1 − P (“Sunnyvale” | pref erred))2
W here D( : x) = +(1 − P (“123 M urphy St.” | pref erred))2
2
a coffee shop in Santa Cruz while driving San Francisco for a break”, etc. We
compared the two scoring functions where the accuracy was calculated as follows.
For the list of places returned sorted according to either scoring mechanism, if
the dummy user selection would have matched the first place, we count this as
the preference modeling for the task as a success. Accuracy represents the % of
successful tasks. The results of the experiment are shown below in Fig. 5. These
two scoring mechanisms were selected for (i) checking that the preferences worked
by manually checking ‘under-the-hood’ of the learnt models, and (ii) establishing
a baseline for incorporating more sophisticated scoring techniques in the future.
)((
1(
0( !
/(
!
.(
-( !
,( !
+( !
*( !
)(
"
(
4!# %% 4!#"%#
# #
Fig. 6. Screenshots of the Implemented User Study Application: Search result for banks
(a) in daily commute area (places previously visited are marked) (b) in a new area using
Distance-from-Unity metric
In the absence of such explicit input, the Distance-from-Unity function was able
to better hone in the driver’s preferences than the naı̈ve-Bayes scoring function
and was chosen to personalize the search results.
4 Related Work
Before we discuss similar work that create content based preference models for
users, we briefly describe other work related to parts of our system, which have
been explored in the Semantic Web. The POI Search Knowledge Services that we
define are wrappers around existing Web Services and produce RDF as output
similar to Semantic Web Services[10]. They borrow the ‘lifting’ concept from
work in Semantically Annotated WSDL services[7], where the output of the
web service is ‘lifted’ from XML into a semantic model. In our case, we have
programmed the logic for lifting in our different Knowledge Services that may
choose from different sources like Yelp, Google Places for place data, since the
domain is fairly constant. But there are other solutions (e.g. Ambite et. al[1])
that can be alternatively used to automatically find alignments between sources
and construct Semantic Web Services. Though access control mechanisms for
the Semantic Web have been explored for a long time, it has only actively been
researched for the Linked Data more recently. Hollenbach et. al [6] describe a
decentralized approach to access control for Linked Data at the document level.
Wagner et. al [14] describe how policies can be implemented for controlling access
to private Linked Data in the semantic smart grid. Our approach for controlling
access does not include RDF descriptions of authentication and authorization
credentials but is instead based on existing practices for Web Services which
allow for easy integration with client devices. It is also easier to implement.
In the past year, the combination of RDF and machine learning has gained some
traction. The work that is perhaps most similar to our approach on converting
RDF to a machine learning table, is found in Lin et. al [8]. For the movie domain,
they try to predict if a movie receives more than $2M in its opening week by con-
verting the RDF graph for the movie to a Relational Bayesian Classifier. One ma-
716 R. Parundekar and K. Oguchi
with integrating from multiple sources, adding contexts, RDF to machine learn-
ing translations, etc. Optimizing this performance would prove as a great learning
opportunity. Lastly, another direction for future work is in using the system for ex-
ploring other domains like music, navigation, etc. for personalization in the vehi-
cle. We believe that Supe has the potential of graduating from an in-use intelligent
system for studying and understanding driver preferences of POIs to a real world
deployment that can provide a comprehensive personalized experience in the car.
References
1. Ambite, J.L., Darbha, S., Goel, A., Knoblock, C.A., Lerman, K., Parundekar, R.,
Russ, T.: Automatically Constructing Semantic Web Services from Online Sources.
In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta,
E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 17–32. Springer,
Heidelberg (2009)
2. Berners-Lee, T.: Design issues: Linked data (2006),
http://www.w3.org/DesignIssues/LinkedData.html
3. Bicer, V., Tran, T., Gossen, A.: Relational Kernel Machines for Learning from
Graph-Structured RDF Data. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia,
B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I. LNCS,
vol. 6643, pp. 47–62. Springer, Heidelberg (2011)
4. Crockford, D.: The application/json media type for javascript object notation, json
(2006), https://tools.ietf.org/html/rfc4627
5. Ding, L., Finin, T., Peng, Y., Da Silva, P., McGuinness, D.: Tracking RDF graph
provenance using RDF molecules. In: Proc. of the 4th International Semantic Web
Conference, Poster (2005)
6. Hollenbach, J., Presbrey, J., Berners-Lee, T.: Using rdf metadata to enable access
control on the social semantic web. In: Workshop on Collaborative Construction,
Management and Linking of Structured Knowledge (2009)
7. Kopeckỳ, J., Vitvar, T., Bournez, C., Farrell, J.: Sawsdl: Semantic annotations for
wsdl and xml schema. IEEE Internet Computing, 60–67 (2007)
8. Lin, H.T., Koul, N., Honavar, V.: Learning Relational Bayesian Classifiers from
RDF Data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L.,
Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 389–404.
Springer, Heidelberg (2011)
9. McBride, B.: Jena: A semantic web toolkit. IEEE Internet Computing 6(6), 55–59
(2002)
10. McIlraith, S.A., Son, T.C., Zeng, H.: Semantic web services. IEEE Intelligent Sys-
tems 16(2), 46–53 (2001)
11. Passant, A.: dbrec — Music Recommendations Using DBpedia. In: Patel-Schneider,
P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.)
ISWC 2010, Part II. LNCS, vol. 6497, pp. 209–224. Springer, Heidelberg (2010)
12. Pazzani, M., Billsus, D.: Learning and revising user profiles: The identification of
interesting web sites. Machine learning 27(3), 313–331 (1997)
13. Pazzani, M.J., Billsus, D.: Content-Based Recommendation Systems. In:
Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web 2007. LNCS, vol. 4321,
pp. 325–341. Springer, Heidelberg (2007)
14. Wagner, A., Speiser, S., Raabe, O., Harth, A.: Linked data for a privacy-aware
smart grid. In: INFORMATIK 2010 Workshop-Informatik für die Energiesysteme
der Zukunft (2010)
An Approach for Named Entity Recognition in Poorly
Structured Data
Abstract. This paper describes an approach for the task of named entity recog-
nition in structured data containing free text as the values of its elements. We
studied the recognition of the entity types of person, location and organization
in bibliographic data sets from a concrete wide digital library initiative. Our ap-
proach is based on conditional random fields models, using features designed to
perform named entity recognition in the absence of strong lexical evidence, and
exploiting the semantic context given by the data structure. The evaluation re-
sults support that, with the specialized features, named entity recognition can be
done in free text within structured data with an acceptable accuracy. Our ap-
proach was able to achieve a maximum precision of 0.91 at 0.55 recall and a
maximum recall of 0.82 at 0.77 precision. The achieved results were always
higher than those obtained with Stanford Named Entity Recognizer, which was
developed for grammatically well-formed text. We believe this level of quality
in named entity recognition allows the use of this approach to support a wide
range of information extraction applications in structured data.
1 Introduction
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 718–732, 2012.
© Springer-Verlag Berlin Heidelberg 2012
An Approach for Named Entity Recognition in Poorly Structured Data 719
purpose, but typically it consists in semantically richer data, which follows a struc-
tured data model, and on which more effective computation methods can be applied.
Information resources in digital libraries are usually described, along with their
context, by structured data records. These data, which is commonly referred in the
digital library community as metadata, may serve many purposes, and the most rele-
vant being resource discovery. Those records often contain unstructured data in natu-
ral language text, which might be useful to judge about the relevance of the resource.
The natural hypothesis is if that information can be represented with finer grained
semantics, then the quality of the system is expected to improve.
This paper addresses a particular task of information extraction, typically called
named entity recognition (NER), which deals with the textual references to entities,
that is, when they are referred to by means of names occurring in natural language
expressions, instead of structured data. This task deals with the particular problem of
how to locate these references in the data set and how to classify them according their
entity type [4].
We describe a NER approach, which we studied on the particular case of metadata
from the cultural heritage domain, represented in the generic Dublin Core1 data mod-
el, which typically contains uncontrolled free text in the values of its data elements.
We refer to this kind of data as poorly structured data. Typical examples of such data
elements are the titles, subjects, and publishing information.
NER has been extensively researched in grammatically well-formed text. In poorly
structured data however, the text may not be grammatically well-formed, so our as-
sumption is also that the data structure provides a semantic context which may sup-
port the NER task.
This paper presents an analysis of the NER problem poorly structured data, de-
scribes a novel NER approach to address this kind of data, and presents an evaluation
of the approach on a real set of data. The paper will follow with an introduction to
NER and related work in Section 2. The proposed approach is presented in Section 3,
and the evaluation procedure and results are presented in Section 4. Section 5 con-
cludes and presents future work.
The NER task refers to locating atomic elements in text and classifying them into
predefined categories such as the names of persons, organizations, locations, expres-
sions of time, quantities, etc. [4].
Initial approaches were based on manually constructed finite state patterns and/or
collections of entity names [4]. However, named entity recognition soon was consi-
dered as a typical scenario for the application of machine learning algorithms, because
of the potential availability of many types of evidence, which form the algorithm’s
input variables [5]. Current solutions can reach an F-measure accuracy around 90%
[4] in grammatically well-formed text, thus a near-human performance.
1
http://dublincore.org/
720 N. Freire, J. Borbinha, and P. Calado
3 Approach
3.1 Analysis
From our analysis of named entities found in structured data sets, we can highlight the
following points:
• Availability of lexical evidence varies in many cases. In some data elements
we found grammatically well-structured text, in other elements we found
short sentences, containing very limited lexical evidence, or plain expression
with practically non-existing lexical evidence. We also observed that in some
cases, analysis of the same field across several records, revealed a mix of all
cases.
• Instead of lexical evidence, we observed that, in some cases, textual patterns
are often available and could be explored as evidence for NER. For example,
punctuation marks play an important role, but its use may differ from how
they are used in natural language text.
• These data elements are typically modeled with general semantics. The
semantics associated with each element influences the type of named entities
found in the actual records. Therefore, we observed different probability
distributions for each entity type across data elements.
• One of the major sources of evidence is the actual name of the entities. Each
entity type presents names with different words and lengths, and also with
different degrees of ambiguity with other words and entity types.
From this analysis we believe that a generic approach must be highly adaptable, not
only to the data set under consideration but also to each data element. Text found in
each element across the whole data set is likely to be associated with particular pat-
terns and degrees of available lexical evidence.
On a more generic level, the approach should have a strong focus on the disam-
biguation of the names between the supported entity types, and be able to disambigu-
ate between entity names and other nouns/words.
We studied the three entity types on which most NER research has been focused, and
which are commonly known as enamex [15]: person, location and organization. In
addressing these three entity types, we wanted to design an approach that was not
limited to a set of known entity names, but could recognize any named entity of the
supported entity types, as usually done in NER in grammatically well-structured text.
As mentioned in the previous section, in structured data the characteristics of the
names of persons, organizations, and places are a strong evidence for recognizing the
named entities and determining their entity type. Therefore, in order to allow the pre-
dictive model to use the likelihood of a token being part of a named entity, we have
collected name usage statistics from comprehensive data sets of persons, organiza-
tions and locations.
Person and organization name statistics were extracted from VIAF - Virtual Inter-
national Authority File [16]. VIAF is a joint effort of several national libraries from
722 N. Freire, J. Borbinha, and P. Calado
all continents towards a consolidated data set gathered for many years about the crea-
tors of the bibliographic resources held at these libraries.
Location name statistics were extracted from Geonames [17], a geographic ontolo-
gy that covers all countries and contains over eight million locations.
A description of how the statistics were extracted, and used in the predictive mod-
el, is presented in Section 3.4.
3.4 Features
Several features were defined to give the predictive model the capability to capture
distinct aspects of the text, such as locating potential names, disambiguate between
entity types and other words, or detecting textual patterns from syntactical and lexical
evidence. This section presents the definition of these features.
An Approach for Named Entity Recognition in Poorly Structured Data 723
A set of features were defined to provide the predictive model with some evidence
for locating potential names of entities in the text. These features were created based
on data or statistics taken from the comprehensive listings of names described in Sec-
tion 3.2. Each entity type has different characteristics in the way entities are named,
so we defined the features in different ways for each entity type.
The features for person names explore how frequent a word was found in person
names, making a distinction between first names, surnames and names that appear in
lowercase. Let F denote a bag built from all first names found in VIAF, and let S de-
note a bag built from all surnames found in VIAF, and let C be a bag built from all
names found non-capitalized in VIAF. We define the following real valued features:
#
, log 1
∑# #
#
#
, log 1
∑# #
#
#
, log 1
∑# #
#
For organizations, only one feature was defined. Let C be a bag built from all words
and punctuation marks found in the names of organizations in VIAF, we define the
following real valued feature:
#
, log 1
∑# #
#
For places, the diversity of the names makes the frequency of use of the words not
effective, so one feature was defined, using the type of geographic entity and the
highest population known for a place on whose name the word appears in. Let C de-
note a bag built from all tokens found in the names of continents and countries. Simi-
larly let D, E, F and G denote bags built from all tokens found in the names of cities,
administrative divisions or islands, natural geographic entities, and other geographic
features, respectively. Also let denote a function that returns the
maximum population found in a location name with token t. We defined the following
real valued feature:
724 N. Freire, J. Borbinha, and P. Calado
1, ∈
min 100000, population
, ∈
100000
, 0.7, ∈
0.6, ∈
0.1, ∈
0,
Some features are based on data extracted from the WordNet [20] of the language
matching the language of the source text, which in the case we studied was English.
These features provide evidence to disambiguate between named entities of the target
types and other words.
With the aim to disambiguate between proper nouns referring to other entity types,
and proper nouns referring to persons, locations and organizations, we define the
feature , 0,1 . Let P denote the set of all variants in synsets which
have a part-of-speech value of proper noun, and let G, H, I, J, K, L denote the sets of
variants in synsets which are hyponyms, either directly or transitively, of one of the
synsets2 geographic area#noun#1, landmass#noun#1, district#noun#1, body of wa-
ter#noun#1, organization#noun#5, and person#noun#1, respectively. The feature is
defined as:
1, ∈ \
,
0,
We also use the Wordnet to capture the possible part-of-speech of some tokens. We
defined the feature , 0,1 , which indicates if the token exists in a
synset with part-of-speech noun. Let A denote the set of variants in synsets which
have a part-of-speech value of noun, we define the feature as:
1, ∈
,
0,
Similar features were defined for other parts-of-speech: , 0,1 ,
, 0,1 , , 0,1 , and , 0,1 .
We also defined features to capture syntactical characteristics of the text and the
tokens. The features , 0,1 and , 0,1
indicate if token xi is at the start or at the end of the value of the data element. The
case of the token is captured through the features , 0,1 and
, 0,1 , which indicate if the token is a word and contains the first
letter in uppercase, or all letters in uppercase, respectively. The token’s character
length is captured by the feature , .
The tokens are also used in a nominal feature , , where T denotes the
set of tokens built from the three preceding tokens, and the two following tokens, of
every named entity found in the training data:
2
To refer to Princeton WordNet synsets, we use the notation w#p#i where i corresponds to the
i-th sense of a literal w with part of speech p.
An Approach for Named Entity Recognition in Poorly Structured Data 725
, ∈
,
,
Capitalization statistics of words in the data set are extracted and used in a feature.
Let C denote the bag of capitalized words in the data set, and let D denote a bag of the
non-capitalized words in data set, we define the following real valued feature:
#
, log
1 #
Since typically each data element will have values with different characteristics, a
feature is necessary to capture the data element where the text is contained. We de-
fined the feature , , where D denotes the set of data element
identifiers of the data model (for example, in data encoded in XML, these identifiers
consist of the xml element’s namespace and element’s name).
Additional features are defined in similar way, but they refer to the three previous
tokens and the two following tokens, instead of the current one.
4 Evaluation
The evaluation of our approach was performed in the data sets from Europeana3,
which consist in descriptions of digital objects of cultural interest. This data set fol-
lows a data model using mainly Dublin Core elements, and named entities appear in
3
http://www.europeana.eu/
726 N. Freire, J. Borbinha, and P. Calado
data elements for titles, textual descriptions, tables of contents, subjects, authors and
publication.
The data set contains records originating from several European providers from the
cultural sector, such as libraries, museums and archives. Several European languages
are present, even within the description of the same object, for example when the
object being described is of a different language than the one used to create its de-
scription.
Providers from where this data originates follow different practices for describing
the digital objects, which causes the existence of highly heterogeneous data. Lexical
evidence is very limited in this data set, so it provides a good scenario for the evalua-
tion of the evidence made available by the structure and textual patterns of the data.
This section describes the experimental setup and its results. It will follow with the
description of the data set used for evaluation, and then describe the evaluation proce-
dure. Results of the evaluation are presented afterwards, and it finalizes with the re-
sults of the evaluation of individual features.
Table 1. Data elements studied in the data set and total annotated named entities
4
The data set is available for research use at http://web.ist.utl.pt/~nuno.freire/ner/
5
Element definitions were taken from the Dublin Core Metadata Terms.
An Approach for Named Entity Recognition in Poorly Structured Data 727
The evaluation data set was manually annotated. In very few cases, the manual an-
notation was uncertain, because the data records may not contain enough information
to support a correct annotation. For example, some sentences with named entities
were too small and no other information was available in the record to support a deci-
sion on the classification of the named entities to their entity type. Named entities
were annotated with their enamex type. If the annotator was unsure of the enamex
type of a named entity, he would annotate it as unknown. These annotations were not
considered for the evaluation of the results, and any recognition made in these entities
was discarded.
4.3 Results
The overall results of the evaluation of all entity types are presented in Fig. 2, and the
results of each entity type are presented in Fig. 1. The results of our approach were
6
The percentage of correctly identified named entities in all named entities found.
7
The percentage of named entities found compared to all existing named entities.
8
The weighted harmonic mean of precision and recall (equal weights for recall and precision).
728 N. Freire, J. Borbinhaa, and P. Calado
Fig. 1. Precision, recall and F1 results of the three enamex entity types measured on the evaalua-
tion data set
Both Stanford NER and our approach are based on CRFs. Although the implem men-
tation of CRFs used was not the same, and other differences exist on how CRFs are
used, we believe that the difference
d in the results obtained with both approachees is
due to the different featurees used, therefore supporting our initial hypothesis that the
semantic context of the datta structure, and non-lexical features, could support NE ER.
An Approach for
f Named Entity Recognition in Poorly Structured Data 729
An interesting result can bee observed at the lowest confidence threshold result for the
entity location, where Stanfford NER was actually able to achieve a F1 of 0.68, but the
probability given by the CR RF predictive model was close to zero for more than 770%
of the recognized named entities.
e This observation suggests that the lack of lexical
evidence had a major impacct in its results.
Results of both approach hes generally showed lowest values for recall than for ppre-
cision. In our approach oveerall recall ranged from 0.55 to 0.82 while overall precission
ranged from 0.77 to 0.91. Given
G the importance of the features based on the name s of
entities, as show in the next section, we believe that the lower recall is mainly cauused
by names that had no preseence in the entity names data sets. However, we were not
able to empirically support this conclusion.
Our approach was able to t achieve a high precision of 0.91 at 0.55 recall, or reacch a
recall of 0.82 at 0.77 precision. We believe these values reached levels high enoughh to
support a wide range of info ormation extraction applications, which may have differrent
requirements for recall or precision.
Fig. 2. Precision, recall and F1 results of all entity types measured on the evaluation data sset
for the overall results, on the evaluation on the individual entity types, we always
used combinations including these three groups of features, so that the results could
be more easily compared and analyzed.
All features contributed to the best performing combination, for all entity types, in
at least two of the cross-validation folds. The features which were used the least for
the best overall results, , and , , were
often used when evaluated on the results of the individual entity types. Therefore we
believe that all features should be used when applying this approach to other data sets.
We can also observe that the features that detected the names of the entities were
always used in the overall results. And, in addition, the features , ,
, , , , , , and
, were used very often. This seems to indicate that textual patterns
were very relevant for providing evidence for NER.
In the results of the feature , , it is worth noting that it was used
only in 10% or 20% of the folds in the overall results for locations and organizations,
but for persons it was used in 60% of the folds. This indicates that the textual patterns
where persons are referenced were distinct across data elements, while for the other
entity types the patterns were more uniform across data elements. Our analysis
An Approach for Named Entity Recognition in Poorly Structured Data 731
pointed that, in the data elements for creators and contributors, the names for persons
often appeared in inverse order (that is, surname, first_names), while in the other
elements they appeared in direct order (that is, first_names surname). We therefore
conclude that the semantic context given by the data structure is generally not re-
quired to allow the recognition of the entities, but in some cases, it can provide impor-
tance evidence for the predictive model.
We presented an approach for the task of named entity recognition in structured data
containing free text as the values of its elements. This approach is based on the ex-
traction of features from the text, which allows the predictive model to operate with
more independent of lexical evidence than named entity recognition systems devel-
oped for grammatically well-formed text.
Our approach was able to achieve a maximum precision of 0.91 at 0.55 recall, and
a maximum recall of 0.82 at 0.77 precision. The achieved results were significantly
higher than those obtained with the baseline. We believe this level of quality in named
entity recognition allows the use of this approach to support a wide range of informa-
tion extraction applications in digital library metadata.
Although we have specifically studied metadata from the cultural heritage sector,
we believe our approach has general applicability to any poorly structured data model.
In future work we will explore the use of ontologies for creating features to im-
prove the recognition of named entities. We will also address the resolution of the
recognized named entities in linked data contexts and ontologies.
References
1. Seth, G.: Unstructured Data and the 80 Percent Rule: Investigating the 80%. Technical re-
port, Clarabridge Bridgepoints (2008)
2. Shilakes, C., Tylman, J.: Enterprise Information Portals. Merrill Lynch Report (1998)
3. Sarawagi, S.: Information Extraction. Found. Trends Databases 1, 261–377 (2008)
4. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Linguisti-
cae Investigationes 30 (2007)
5. McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for information
extraction and segmentation. In: International Conference on Machine Learning (2000)
6. Martins, B., Borbinha, J., Pedrosa, G., Gil, J., Freire, N.: Geographically-aware informa-
tion retrieval for collections of digitized historical maps. In: 4th ACM Workshop on Geo-
graphical Information Retrieval (2007)
7. Freire, N., Borbinha, J., Calado, P., Martins, B.: A Metadata Geoparsing System for Place
Name Recognition and Resolution in Metadata Records. In: ACM/IEEE Joint Conference
on Digital Libraries (2011)
8. Sporleder, C.: Natural Language Processing for Cultural Heritage Domains. Language and
Linguistics Compass 4(9), 750–768 (2010)
9. King, P., Poulovassilis, A.: Enhancing database technology to better manage and exploit
Partially Structured Data. Technical report, University of London (2000)
732 N. Freire, J. Borbinha, and P. Calado
10. Williams, D.: Combining Data Integration and Information Extraction. PhD thesis, Uni-
versity of London (2008)
11. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gor-
rell, G., Funk, A., Roberts, A., Damljanovic, D., Heitz, T., Greenwood, M., Saggion, H.,
Petrak, J., Li, Y., Peters, W.: Text Processing with GATE (Version 6). University of Shef-
field Department of Computer Science (2011) ISBN 978-0956599315
12. Michelson, M., Knoblock, C.: Creating Relational Data from Unstructured and Ungram-
matical Data Sources. Journal of Articial Intelligence Research 31, 543–590 (2008)
13. Guo, J., Xu, G., Cheng, X., Li, H.: Named Entity Recognition in Query. In: 32nd Annual
ACM SIGIR Conference (2009)
14. Du, J., Zhang, Z., Yan, J., Cui, Y., Chen, Z.: Using Search Session Context for Named
Entity Recognition in Query. In: 33rd Annual ACM SIGIR Conference (2010)
15. Grishman, R., Sundheim, B.: Message Understanding Conference - 6: A Brief History. In:
Proc. International Conference on Computational Linguistics (1996)
16. Bennett, R., Hengel-Dittrich, C., O’Neill, E., Tillett, B.B.: VIAF (Virtual International Au-
thority File): Linking Die Deutsche Bibliothek and Library of Congress Name Authority
Files. In: 72nd IFLA General Conference and Council (2006)
17. Vatant, B., Wick, M.: Geonames Ontology (2006),
http://www.geonames.org/ontology/
18. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: Probabilistic
Models for Segmenting and Labeling Sequence Data. In: Proceedings of the Eighteenth In-
ternational Conference on Machine Learning, pp. 282–289. Morgan Kaufmann Publishers
Inc. (2001)
19. Wallach, H.: Conditional Random Fields: An Introduction. Technical Report MS-CIS-04-
21. Department of Computer and Information Science, University of Pennsylvania (2004),
http://www.cs.umass.edu/~wallach/technical_reports/
wallach04conditional.pdf
20. Miller, G.A., Beckwith, R., Fellbaum, C.D., Gross, D., Miller, K.: WordNet: An online
lexical database. Int. J. Lexicograph. 3(4), 235–244 (1990)
21. McCallum, A.: MALLET: A Machine Learning for Language Toolkit (2002),
http://mallet.cs.umass.edu
22. The Unicode Consortium: Unicode Text Segmentation (2010),
http://www.unicode.org/reports/tr29/
23. Sekine, S., Isahara, H.: IREX: IR and IE Evaluation project in Japanese. In: Proc. Confe-
rence on Language Resources and Evaluation (2000)
24. Michie, D., Spieglhalter, D.J., Taylor, C.C.: Machine learning, neural and statistical classi-
fication. Prentice Hall, Englewood Cliffs (1994)
25. Goodman, J.: Sequential Conditional Generalized Iterative Scaling. In: Proceedings of the
40th Annual Meeting of the Association for Computational Linguistics, pp. 9–16 (2002)
26. Finkel, J.R., Grenager, T., Manning, C.: Incorporating Non-local Information into Infor-
mation Extraction Systems by Gibbs Sampling. In: 43rd Annual Meeting of the Associa-
tion for Computational Linguistics (2005)
27. Sang, T.K., Erik, F., De, F.: Introduction to the CoNLL-2003 Shared Task: Language-
Independent Named Entity Recognition. In: Conf. on Natural Language Learning (2003)
28. Kohavi, R., John, G.: Wrappers for feature selection. Artificial Intelligence 97(1-2), 273–
324 (1997)
Supporting Linked Data Production for Cultural
Heritage Institutes:
The Amsterdam Museum Case Study
Abstract. Within the cultural heritage field, proprietary metadata and vocabu-
laries are being transformed into public Linked Data. These efforts have mostly
been at the level of large-scale aggregators such as Europeana where the origi-
nal data is abstracted to a common format and schema. Although this approach
ensures a level of consistency and interoperability, the richness of the original
data is lost in the process. In this paper, we present a transparent and interactive
methodology for ingesting, converting and linking cultural heritage metadata into
Linked Data. The methodology is designed to maintain the richness and detail of
the original metadata. We introduce the XMLRDF conversion tool and describe
how it is integrated in the ClioPatria semantic web toolkit. The methodology and
the tools have been validated by converting the Amsterdam Museum metadata
to a Linked Data version. In this way, the Amsterdam Museum became the first
‘small’ cultural heritage institution with a node in the Linked Data cloud.
1 Introduction
Cultural heritage institutions such as museums, archives or libraries typically have large
databases of metadata records describing the objects they curate as well as thesauri and
other authority files used for these metadata fields. At the same time, in the Linked Data
cloud a number of general datasets exist, such as GeoNames, VIAF or DBPedia. Im-
porting the individual cultural heritage metadata into the Linked Data cloud and linking
to these general datasets improves its reusability and integration.
While larger cultural heritage institutions such as the German National Library1 or
British National Library2 have the resources to produce their own Linked Data, meta-
data from smaller institutions is currently only being added through large-scale aggre-
gators. A prime example is Europeana, whose goals are to serve as an aggregator for
cultural heritage institution data. This is to be achieved through a process of ingesting
the metadata records, restructuring it to fit the Europeana Data Model and publishing it
1
http://permalink.gmane.org/gmane.culture.libraries.ngc4lib/7544
2
http://www.bl.uk/bibliographic/datafree.html
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 733–747, 2012.
c Springer-Verlag Berlin Heidelberg 2012
734 V. de Boer et al.
as Linked Data on Europeana servers. This approach ensures a level of consistency and
interoperability between the datasets from different institutions. The automatic inges-
tion process, conversion into new dataformats and external hosting by the aggregator
however creates the problem of a disconnect between the cultural heritage institute
original metadata and the Linked Data version.
Rather than having Linked Data ingestion being done automatically by large aggre-
gators, we present a methodology that is both transparent and interactive. The method-
ology covers data ingestion, conversion, alignment and Linked Data publication. It is
highly modular with clearly recognizable data transformation steps, which can be eval-
uated and adapted based on these evaluations. This design allows the institute’s collec-
tion managers, who are most knowledgeable about their own data, to perform or oversee
the process themselves. We describe a stack of tools that allow collection managers to
produce a Linked Data version of their metadata that maintains the richness of the orig-
inal data including the institute-specific metadata classes and properties. By providing
a mapping to a common schema interoperability is achieved. This has been previously
investigated by Tordai et al. [9], of which this work is a continuation. We provide a
partial validation of both these tools and the general methodology by using it to convert
the metadata from the Amsterdam Museum to RDF and serving it as Linked Data.
2 Methodology Overview
To convert collection metadata into Linked Data, we here describe the general method-
ology. The input is the original collection metadata as provided by aggregators or indi-
vidual cultural heritage institutions. The result of the workflow process is the collection
metadata in semantic format (RDF). Links are established between vocabulary terms
used in the collections.
Figure 1 shows the general workflow for the conversion and linking of the provided
metadata. The methodology is built on the ClioPatria semantic server [11]. ClioPatria
provides feedback to the user about intermediary or final RDF output in the form of an
RDF browser and by providing statistics on the various RDF graphs. This feedback is
crucial for the intended interactivity. The approach takes the form of a modular work-
flow, supported by two tools.Both the XMLRDF and Amalgame are packages of the
ClioPatria semantic web toolkit. ClioPatria itself is based on SWI-Prolog and XML-
RDF can therefore use its expressiveness for more complex conversions. ClioPatria and
its packages are available from http://ClioPatria.swi-prolog.org/.
In the first step of this workflow, we ingest the XML into the ClioPatria environment.
This can be either a static XML file with the metadata or the result of OAI harvesting
operation. We give an example in the case study in Section 6.3. In the second step, the
XML is converted to crude RDF format. This is done using the XMLRDF tool, which
is documented in Section 3. This crude RDF is then rewritten in RDF adhering to the
chosen metadata format, which is done using graph rewrite rules. These rules are exe-
cuted by the XMLRDF tool to produce the final RDF representation of the collection
metadata. An example of ingested XML, crude RDF and rewritten RDF is presented
in Section 6.3 and Figure 3. Next, the user can provide an RDFS metadata schema
which relates the produced classes and properties to the metadata schema of choice.
Supporting Linked Data Production for Cultural Heritage Institutes 735
The XMLRDF tool provides support for this by presenting the user with a schema tem-
plate based on the RDF data loaded. In Step 5, links are established between vocabulary
concepts that are used in the collection metadata and other vocabularies. This is done
using the Amalgame tool, which is documented in van Ossenbruggen et al. [7]. In Sec-
tion 5, we describe Amalgame and its methodology insofar it is part of the more general
conversion strategy.
ClioPatria
1. XML ingestion
2. Direct transformation to ‘crude’ RDF
XMLRDF
3. RDF restructuring:
4. Create a metadata mapping schema
5. Align vocabularies with external sources Amalgame
6. Publish as Linked Data
Fig. 1. General workflow for converting and linking metadata. The figure lists the various steps
and relates them to either the XMLRDF or Amalgame tool or to the ClioPatria server itself.
The RDF data can be served as Linked Open Data using ClioPatria. The server per-
forms HTTP content negotiation. If the HTTP request asks for ‘text/html’, the server
responds with an HTML page showing the (object) metadata in human-readable form.
When ‘application/rdf+xml’ is requested, the server responds by providing the Concise
Bounded Description in RDF triples3 . This adheres to the Linked Data requirements [1].
Additionally, a SPARQL endpoint is provided, where the data can be queried.
but for some archival data, the ordering is important and should be explicitly modeled
in the target RDF.
The XMLRDF tool is designed to translate each of these syntactic artifacts into a
proper semantic model where objects and properties are typed and semantically related
to a semantic metamodel such as the widely used SKOS and Dublin Core vocabularies.
It does so in two main steps, shown in Figure 1. First, a syntactic conversion of the
source XML into crude RDF is made. The produced RDF is as close as possible to the
source metadata. This step is described in Section 3.1. The RDF produced in this way
can be re-structured and enriched in the second step, which is in turn subdivided into
multiple sub-steps. We describe this in Section 3.2.
The interactivity in XMLRDF stems from the fact that individual rules, or combi-
nations of rules, can be run independently of each other and the resulting intermediate
RDF can be quickly inspected through the ClioPatria browser/visualization. This allows
the user to write rules, evaluate the result and adapt the rule if necessary.
of Prolog routines for typical conversion tasks. In some cases these will not satisfy,
in which case expertise in programming Prolog becomes necessary. We here give an
overview of the XMLRDF rewriting rules and provide a number of RDF transformation
recipes. There are 3 types of production rules:
1. Propagation rules add triples
2. Simplication rules delete triples and add new triples.
3. Simpagation rules are in between. They match triples, delete triples and add triples,
The overall syntax for the three rule-types is (in the order above):
<name>? @@ <triple>* ==> <guard>? , <triple>*.
<name>? @@ <triple>* <=> <guard>? , <triple>*.
<name>? @@ <triple>* \ <triple>* <=> <guard>? , <triple>*.
Fixing the Node-Structure. In some cases, we might want to add triples by concate-
nating multiple values. The dimensions rule in Figure 2 is an example of this, where we
add a concatenation of dimension values as an additional triple, using a new predicate.
Since we do not delete the original metadata, some data duplication occurs. In addition
to this, some literal fields need to be rewritten, sometimes to (multiple) new literals and
sometimes to a named or bnode instance.
The simplication rules can be used to map the record-based structure to the desired
structure. An example is the use to altlabel rule in Figure 2, which converts the ISO
term-based thesaurus constructs to the SKOS variant. This rule takes the alternative
term and re-asserts it as the skos:altLabel for the main concept.
Another use for the simplication rules is to delete triples, by having an empty action
part (Prolog ‘true’), as shown by the clean empty rule in Figure 2. This can be used to
delete triples with empty literals or otherwise obsolete triples.
Some blank nodes provide no semantic organization and can be removed, relating its
properties to the parent node. At other places, intermediate instances must be created
(as blank nodes or named instances).
Re-establish Internal Links. The crude RDF often contains literals where it should
have references to other RDF instances. Some properties represent links to other works
in the collection. The property value is typically a literal representing a unique identifier
to the target object such as the collection identifier or a database key. This step replaces
the predicate-value with an actual link to the target resource. The rewrite rules can use
the RDF background knowledge to determine the correct URI.
Re-establish External Links. This step re-establishes links from external resources
such as vocabularies which we know to be used during the annotation. In this step we
only make mapping for which we are absolutely sure. I.e., if there is any ambiguity, we
maintain the value as a blank node created in the previous step. An example of such a
rule is the content person rule shown in Figure 2.
Assign URIs to Blank Nodes Where Applicable. Any blank node we may wish to link
to from the outside world needs to be given a real URI. The record-URIs are typically
created from the collection-identifier. For other blank nodes, we look for distinguishing
(short) literals. The construct {X} can be used on the condition and action side of a rule.
If used, there must be exactly one such construct, one for the resource to be deleted and
one for the resource to be added. All resources for which the condition matches are
renamed. The assign uris rule in Figure 2 is an example of this. The
{S} binds the (blank node) identifier to be renamed. The Prolog guard generates a URI
(see below) which replaces all occurrences of the resource.
Utility Predicates. The rewriting process is often guided by a guard which is, as al-
ready mentioned, an arbitrary Prolog goal. Because translation of repositories shares
a lot of common tasks, we developed a library for these. An example of such a util-
ity predicate is the literal to id predicate, which generates an URI from a literal by
Supporting Linked Data Production for Cultural Heritage Institutes 739
“27659”
A) <record priref="27659 “ > B)
<title>Koperen etsplaat met portret van Clement de Jonghe</title>
<maker>Rembrandt (1606-1669)</maker> “Rembrandt (1606-1669)”
<object.type>etsplaat</object.type>
<dimension>
<dimension.value>21</dimensionValue> am:objectType
am:Record “etsplaat”
<dimension.unit>cm</dimension.unit>
_:bn1
</dimension>
<associated.subject></associated.subject> am:dimension am:dimensionValue
</record> “21”
am:Dimension
am:priref _:bn2
“27659”
am:dimensionUnit “cm”
am:objectType “etsplaat”
rdfs:label
“”
C) am:dimension
“21 cm”
am:Record am:dimensionValue
am:Dimension “21”
am:proxy-27659 _:bn2
am:dimensionUnit “cm”
Fig. 3. Example of the different steps of XMLRDF. A) shows an XML sample snippet describing
a single record is shown. B) is the result of the direct conversion to crude RDF is displayed. C)
shows the graph after the rules from Figure 2 have been applied. In that final graph, the URI for
the creator is used instead of the literal; a concatenated dimension label is added to the blank
node; the empty ‘associated subject’ triple is removed and the record has a proxy-based URI.
mapping all characters that are not allowed in a (Turtle) identifier to , as shown in the
assign uris rule in Figure 2.
dcterms:subject
rdfs
rdfs:subPropertyOf
am:contentPersonName
me
am:proxy_22093 ““Job Co
Cohen”
Fig. 4. RDF fragment showing how metadata mapping ensures interoperability. The bottom part
of the figure shows an example triple relating an object to the name of a depicted person. Dublin
Core (the metadata standard used in Europeana for object descriptions) only has a single notion
of the subject of a work. By mapping the specific properties to the more general property using
the rdfs:subProperty in the metadata schema, an application capable of RDFS reasoning can
infer that the object has “Job Cohen” as its subject. We therefore achieve interoperability without
discarding the complexity of the original data.
In this section, we describe how we used the above described methodology to convert
the Amsterdam Museum metadata and vocabularies to five-star Linked Data that is com-
patible with the Europeana Data Model (EDM). This includes linking the vocabularies
used in the metadata values to external sources. The input files, the intermediary and
converted RDF, the schema mapping files as well as the alignment strategies are all
available online at http://semanticweb.cs.vu.nl/lod/am/data.html,
where they are listed for each step. We here present an overview.
The Amsterdam Museum4 is a Dutch museum hosting cultural heritage objects related
to Amsterdam and its citizens. Among these objects are paintings, drawings, prints,
glass and silver objects, furniture, books, costumes, etc. all linked to Amsterdam’s his-
tory and modern culture. At any given moment, around 20% of the objects are on dis-
play in the museum’s exhibition rooms, while the rest is stored in storage depots.
As do many museums, the Amsterdam Museum uses a digital data management
system to manage their collection metadata and authority files, in this case the propri-
etary Adlib Museum software5 . As part of the museum’s policy of sharing knowledge,
in 2010, the Amsterdam Museum made their entire collection available online using
a creative commons license. The collection can be browsed through a web-interface6.
Second, for machine consumption, an XML REST API was provided that can be used
to harvest the entire collection’s metadata or retrieve specific results based on search-
terms such as on creator or year. The latter API has been used extensively in multiple
Cultural Heritage-related app-building challenges.
Europeana enables people to explore the digital resources of Europe’s museums, li-
braries, archives and audio-visual collections. Among it’s goals, Europeana will act as
an aggregator for European Linked Cultural Data. The idea is that the data from indi-
vidual cultural heritage institutions can be integrated by mapping them to a common
metadata model: the Europeana Data Model (EDM) [5].
EDM adheres to the principles of the Web of Data and is defined using RDF. The
model is designed to support the richness of the content providers metadata but also
enables data enrichment from a range of third party sources. EDM supports multi-
ple providers describing the same object, while clearly showing the provenance of all
the data that links to the digital object. This is achieved by incorporating the proxy-
aggregation mechanism from the Object Re-use and Exchange (ORE) model7. For our
purpose, this means that an Amsterdam Museum metadata record gives rise to both a
4
http://amsterdammuseum.nl
5
http://www.adlibsoft.com/
6
http://collectie.ahm.nl
7
http://www.openarchives.org/ore/
742 V. de Boer et al.
proxy resource as well as an aggregation resource. The RDF triples that make up the ob-
ject metadata (creator, dimensions etc.) have the proxy as their source while the triples
that are used for provenance (data provider, rights etc.) as well as alternate representa-
tion (e.g. digital thumbnails) have the aggregation resource as their source.
For it’s actual metadata, the EDM builds on proven metadata standards that are used
throughout the cultural heritage domain. Dublin Core is used to represent object meta-
data and SKOS to represent thesauri and vocabularies. Next to these properties, a limited
number of EDM-specific properties are introduced, including predicates that allow for
event-centric metadata.
Object Metadata. The object metadata consist of metadata records for the 73.447 ob-
jects including creator, dimensions, digital reproductions, related exhibitions etc. This
dataset was converted to 6,301,012 triples in the crude RDF transform. 669,502 dis-
tinct RDF subjects were identified as well as 94 distinct properties. 497,534 of the RDF
objects were identified as literals.
To enrich the crude RDF, a total of 58 XMLRDF rewrite rules were made. Of these
rules, 24 were used to re-establish links to the thesaurus and 5 rules reestablished links
to persons. An additional 4 rules made inter-object relations explicit. 10 rules were
‘clean-up’ rules . The remaining rules include rules that provide URIs to resources, rules
that rewrite untyped literals into language-typed RDF literals , rules re-ifying nested
blank nodes and rules combining literal values in one human-readable literal . Examples
of these rules are shown in Figure 2, specifically clean empty, assign uris, title nl,
content person and dimensions. The rules shown there are relatively simple for the
sake of clarity.
The 55 rules that are executed first are not EDM-specific, as they translate the XML
record structure into their RDF equivalent. The three rules that are executed last map
the data to the EDM. These rules explicitly build the aggregation-proxy construct for
each of the records and moves the record properties to the appropriate resource (object
metadata to the proxy, provenance data and reproduction info to the aggregation).
In total, executing the rewriting rules resulted in 5,700,371 triples with 100 predi-
cates and 933,891 subjects, of which 566,239 are blank nodes.
Supporting Linked Data Production for Cultural Heritage Institutes 743
We constructed an RDFS mapping file relating the 100 Amsterdam Museum prop-
erties to the EDM properties through the rdfs:subPropertyOf construct. Seven proper-
ties were mapped to EDM-specific properties (ens:hasMet, ens:happenedAt, etc.) and
three properties were defined as subproperties of rdfs:label, the rest of the properties are
defined as subproperties of Dublin Core properties. Two Amsterdam Museum classes
‘am:Exhibition’ and ‘am:Locat’ were defined as rdfs:subClassOf of the EDM class
‘ens:Event’.
Thesaurus Metadata. This dataset consists of 28.000 concepts used in the object meta-
data fields, including geographical terms, motifs, events etc. In the crude RDF transfor-
mation step, this was converted to 601,819 RDF triples about 160,571 distinct subjects,
using 19 distinct properties. 55,780 RDF objects are literals.
Most term-based thesauri, including the AM thesaurus, have a more or less uniform
structure (ISO 25964) for which the standard RDF representation is SKOS. We there-
fore chose to rewrite the AM thesaurus directly to SKOS format. For this purpose, we
constructed 23 rewriting rules. 6 rules establish links by mapping literal values to URIs
resulting in the SKOS object relations skos:broader, skos:narrower and skos:related. 4
rules mapped the original thesaurus’ USE/USEFOR constructs to skos:altLabels (cf.
the use to altlabel in Figure 2), 6 rules were clean-up rules. The remaining rules in-
clude rules that give URIs, give type relations and relate the skos:Concepts to a concept
scheme. In total after the rewrite, 160,701 RDF triples remain, describing 28,127 sub-
jects using 13 properties. Since the conversion already produced most of the SKOS
properties, the RDFS mapping file only contains the (new) skos:ConceptScheme triples
and mappings that relate the Amsterdam Museum notes to from the skos:notes.
Person Authority File. This dataset contains biographical information on 66.968 per-
sons related to the objects or the metadata itself. This relation includes creators, past
or present owners, depicted persons etc. In the crude RDF transformation step, the per-
son authority file was converted to 301,143 RDF triples about 66,968 distinct subjects,
using 21 distinct properties. 143,760 RDF objects are literals.
Since the crude RDF was already well structured and no additional literal rewriting
or mapping to URIs was required, only 2 rules are needed for the people authority file.
One changes the type of the records to ‘Person’, while the second one gives URIs to the
persons. These minor translations did not change the above statistics.
Since the current version of the EDM does not specify how biographical meta-
data is to be represented, we mapped to properties from the RDA Group 2 metadata
standard8. These properties include given and family names, birth and death dates etc.
As a side note, informed by this conversion, this metadata set is currently considered as
the EDM standard for biographical information. In total 20 rdfs:subProperty relations
were defined. The am:Person class was also mapped as a rdfs:subClassOf ens:Agent.
that those values exist. Although in this case the name should be available and unique,
there were three unmapped values after this rule was applied (for the remaining 2775
values, the correct URI was found). ClioPatria allows us to identify the erroneous val-
ues quickly. A new rule can be constructed for these, either rewriting them or removing
the unmapped triples. Alternatively, the triples can be maintained, as was done in the
Amsterdam Museum case.
Another example where the method is only partially successful is for two AM prop-
erties am:contentSubject and am:contentPersonName. These relate an object to either
a concept or a person (for example a painting depicting a nobleman and a building).
Dublin Core provides the dcterms:subject property which does not differentiate be-
tween the types. In the schema, we defined the AM properties as rdfs:subProperty of
dcterms:subject. An application capable of RDFS reasoning can infer that the object
has dcterms:subject both the person and the concept. We therefore achieve some inter-
operability without discarding the complexity of the original data as expressed using
the properties of the ‘am’ namespace.
To illustrate step 5, we aligned the thesaurus and person authority file with a number of
external sources using Amalgame. We report on the final alignment strategies.
Thesaurus. We mapped the thesaurus partly to the Dutch AATNed9 thesaurus and
partly to GeoNames10 . The thesaurus was first split into a geographical and a non-
geographical part consisting of 15851 and 11506 concepts respectively. We then aligned
the Dutch part of the geographic part (953 concepts with a common ancestor ”Nether-
lands”) to the Dutch part of GeoNames using a basic label-match algorithm. This re-
sulted in 143 unambiguous matches. We performed an informal evaluation by manually
assessing a random sample of the mappings. This resulted in indicated a high quality
of the matches (90%+ precision). The AM concepts for which no match was found in-
clude Amsterdam street names or even physical storage locations of art objects, which
are obviously not in GeoNames.
The non-geographic part of AM was aligned with the AATNed using the same basic
label match algorithm. Here, 3820 AM concepts were mapped. We then split the map-
ping in an unambiguous (one source is mapped to one target) and an ambiguous part
(one-to-many or many-to-one). The unambiguous part was evaluated as having a high
precision, the ambiguous mappings could be further disambiguated but still have a good
precision of about 75%. The coverage for the non-geographic part is about 33%.
Person Authority File. The person authority file was aligned to a subset of DBpedia11
containing persons using only the target skos:prefLabels. This resulted in 34 high qual-
ity mappings. The unmapped concepts were then aligned using the skos:altLabels as
well, and then split in 453 unambiguous and 897 ambiguous matches, with estimated
9
http://www.aat-ned.nl
10
http://www.GeoNames.org
11
http://dbpedia.org
Supporting Linked Data Production for Cultural Heritage Institutes 745
precisions of 25% and 10% respectively. These could also be further filtered by hand.
The people database was also aligned with Getty Union List of Artist Names (ULAN)12 ,
resulting in 1078 unambiguous matches with a high precision ( 100%) and an additional
348 ambiguous matches, with a slightly lower estimated precision. Although ULAN it-
self is not in the Linked Data Cloud, it is incorporated in VIAF, which is in the Linked
Data cloud.
The main reason for the low coverage is that a very large number of AM-persons
are not listed in ULAN as they are relatively unknown local artists, depicted or related
persons, museum employees, or even organizations. Mapping to alternative sources can
increase coverage here.
The Amsterdam museum data, consisting of the converted datasets, the schema mapping
files and the high-quality mapping files are served as Linked Open Data on the Europeana
Semantic Layer (ESL)13 . The ESL is a running instance of ClioPatria that houses other
datasets that have been mapped to EDM. More information, including how to access or
download the data is found at http://semanticweb.cs.vu.nl/lod/am.
7 Related Work
In this paper, we presented a methodology for transforming legacy data into RDF for-
mat, restructuring the RDF, establishing links and presenting it as Linked Data. A set
of tools developed at the Free University Berlin provides similar functionalities. Their
D2R server is a tool for publishing relational databases on the Semantic Web, by al-
lowing data providers to construct a wrapper around the database [2]. This makes the
database content browsable for both RDF and HTML browsers as well as queryable by
SPARQL. The R2R tool can be used to restructure the RDF and finally the Silk tool
12
http://www.getty.edu/research/tools/vocabularies/ulan
13
http://semanticweb.cs.vu.nl/europeana
746 V. de Boer et al.
is used to generate links to other data sources. One difference between D2R and our
approach described here is that we assume an XML output of the original data, whereas
D2R acts directly as a wrapper on the data. In the cultural heritage domain, many insti-
tutes already publish their data as XML using the OAI-PMH protocol14 as part of their
normal workflow. This XML can therefor be considered their ’outward image’ of the
internal database and is an ideal starting place for our Linked Data conversion. A second
difference is that we explicitly provide tools for interactive conversion and alignment
of the data. XMLRDF is part of the ClioPatria RDF production platform, allowing for
rapid assessment and evaluation of intermediary RDF.
Other tools that can be used to produce RDF include tools based on XSL transforma-
tions (XSLT). An example of such a tool is the OAI2LOD Server, which also starts from
an OAI-PMH input, converts this to RDF using XSLT and provides RDF-browser and
SPARQL access to the data [4]. Another example is the OAI-PMH RDFizer which is
one of Simile’s RDF transformation tools [8]. Such tools make use of the fact that RDF
can be serialized as XML and do the conversion by restructuring the XML tree. A lot of
cultural heritage institutions have relatively complex datastructures and will therefore
need more complex operations in parts of the conversion [6]. Even though XSLT as a
Turing-complete language has the same level of expressivity as Prolog, a number of
common rewriting operations are better supported by Prolog and its rewriting rule lan-
guage. Specifically, the XMLRDF rules can use Prolog and Cliopatria’s RDF reasoning
ability, taking existing triples into account when converting new triples.
14
http://www.openarchives.org/OAI/openarchivesprotocol.html
Supporting Linked Data Production for Cultural Heritage Institutes 747
conversion themselves, to showcase the tools and present a number of re-usable XML-
RDF conversion recipes. In the context of Europeana, the XMLRDF tool has been used
by the authors, as well as by external parties to convert archival, museum and library
data. A number of these converted datasets are presented in the ESL. The Amalgame
tool has also been used by external parties and we are currently in the process of having
alignments done by actual collection managers.
References
1. Berners-Lee, T.: Linked data - design issues (2006),
http://www.w3.org/DesignIssues/˜LinkedData.html
2. Bizer, C., Cyganiak, R.: D2r server – publishing relational databases on the semantic web
(poster). In: International Semantic Web Conference (2003)
3. Frühwirth, T.: Introducing simplification rules. Tech. Rep. ECRC-LP-63, European
Computer-Industry Research Centre, Munchen, Germany (October 1991); Presented at
the Workshop Logisches Programmieren, Goosen/Berlin, Germany, and the Workshop on
Rewriting and Constraints, Dagstuhl, Germany (October 1991)
4. Haslhofer, B., Schandl, B.: Interweaving oai-pmh data sources with the linked data cloud.
Int. J. Metadata, Semantics and Ontologies 1(5), 17–31 (2010),
http://eprints.cs.univie.ac.at/73/
5. Isaac, A.: Europeana data model primer (2010),
http://version1.europeana.eu/web/
europeana-project/technicaldocuments/
6. Omelayenko, B.: Porting cultural repositories to the semantic web. In: Proceedings of the
First Workshop on Semantic Interoperability in the European Digital Library (SIEDL 2008),
pp. 14–25 (2008)
7. van Ossenbruggen, J., Hildebrand, M., de Boer, V.: Interactive Vocabulary Alignment. In:
Gradmann, S., Borri, F., Meghini, C., Schuldt, H. (eds.) TPDL 2011. LNCS, vol. 6966, pp.
296–307. Springer, Heidelberg (2011)
8. Project, T.S.: Oai-pmh rdfizer (2012),
http://simile.mit.edu/wiki/OAI-PMH_RDFizer (retrieved December 2011)
9. Tordai, A., Omelayenko, B., Schreiber, G.: Semantic Excavation of the City of Books. In:
Proceedings of the Semantic Authoring, Annotation and Knowledge Markup Workshop
(SAAKM 2007), pp. 39–46. CEUR-WS (2007),
http://ftp.informatik.rwthaachen.de/
Publications/CEUR-WS/Vol-289/
10. Wielemaker, J., Hildebrand, M., van Ossenbruggen, J., Schreiber, G.: Thesaurus-Based
Search in Large Heterogeneous Collections. In: Sheth, A.P., Staab, S., Dean, M., Paolucci,
M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 695–
708. Springer, Heidelberg (2008),
http://dx.doi.org/10.1007/978-3-540-88564-1_44
11. Wielemaker, J., de Boer, V., Isaac, A., van Ossenbruggen, J., Hildebrand, M., Schreiber,
G., Hennicke, S.: Semantic workflow tool available. EuropeanaConnect Deliverable 1.3.1
(2011),
http://www.europeanaconnect.eu/documents/D1.3.1 eConnect
Workflow automation method implementation v1.0.pdf
Curate and Storyspace: An Ontology and Web-Based
Environment for Describing Curatorial Narratives
1 Introduction
Current museum metadata schemes and content management systems focus on the
description and management of the individual heritage objects that the museum holds
in its collection. An important responsibility for museums, as well as preserving the
collection, is to communicate to the public. One key form of communication is
through the development of curatorial narratives. These curatorial narratives may take
the form of physical museum exhibitions (possibly supplemented by other materials
such as audio guides and booklets) or online presentations. Curatorial narratives
express meaning across a number of heritage objects. The meaning of the narrative
cannot be expressed or derived purely from the metadata of the heritage objects that it
contains. Currently, there is therefore no support for the description and search of
museum narratives based on their meaning rather than the objects that they contain.
This work is being conducted as part of DECIPHER, an EU Framework
Programme 7 project in the area of Digital Libraries and Digital Preservation. A key
aim of DECIPHER is to allow users interactively to assemble, visualize and explore,
not just collections of heritage objects, but the knowledge structures that connect and
give them meaning. As part of this work, the curate ontology has been developed in
order that we can understand and describe the reasoning behind a curatorial narrative.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 748–762, 2012.
© Springer-Verlag Berlin Heidelberg 2012
An Ontology and Web-Based Environment for Describing Curatorial Narratives 749
This can the be used to describe and search of narratives based on their meaning
rather than just the heritage objects that they contain. The ontology will also be used
to drive computational assistance for the human construction of narratives.
The rest of this paper is structured as follows. The next section describes related
research in the formal description of events and its use in providing navigation across
heritage objects. Section 3 describes the curate ontology and how it can be used to
describe curatorial narratives. It draws on structuralist theories that distinguish the
narrative presentation from the conceptualization of the story and plot. Section 4
describes storyspace, an API and web interface to the ontology. Section 5 describes
the use of storyspace to model curatorial narratives on the conceptual, story level.
Section 6 summarises the findings of a structured interview with two members of
museum curatorial staff that used storyspace to model curatorial narratives over a two
month period. Section 7 presents conclusions and ongoing work.
2 Related Work
Although there has not been any previous attempt to develop an ontology of curatorial
narrative, some previous research has used metadata to generate or describe
presentations that include multiple heritage objects. These have made use of event-
based ontologies and metadata schemes to conceptually interconnect heritage objects.
Bletchley Park Text [1, 2] uses historical interviews described according to CIDOC
CRM [3] event-based metadata to assemble an online newspaper in response to a
query. Interviews are grouped according to the common people, places and objects
mentioned in their constituent events. Hyvonen et al. [4, 5] used event-based metadata
to assemble further heritage objects around another that acted as a hub or backbone to
the presentation. In one case a movie about the ceramics process was represented as
events and linked to other resources related to concepts (e.g. people objects) featured
in the events [5]. In the other, events were used to generate links within a poem and to
external resources giving additional information [5].
Wang et al. [6, 7] use content metadata and user preferences to suggest related
heritage objects of interest. van Hage et al. [8] combine this with a real-time routing
system to provide a personalized museum tour guide creating a conceptual path across
a number of heritage objects. The personalized tour guide developed by Lim and
Aylett [9] associated heritage objects with a metadata structure they termed a story
element that comprised events, people, objects, museum location and causal
relationships to other story elements. Recommendations were made based on casual
relationships and shared items contained in story elements.
Finally, van Erp et al. [10] describe a prototype system for event-driven browsing. The
system suggests related heritage objects based on their associated events. By selecting
related heritage objects the user can create a pathway through the heritage objects.
Research related to the interconnection of heritage objects based on event metadata
has made use of a number of ways of formally representing events. CIDOC CRM is
an upper level ontology for the cultural heritage sector [3]. CIDOC CRM affords an
event-based representation of metadata. This provides a way of representing the
750 P. Mulholland, A. Wolff, and T. Collins
changing properties of a heritage object over time (for example the changing
ownership of a painting). Another particular advantage of the CIDOC CRM ontology
is that it facilitates interoperability among museum metadata schemes. Other
approaches to the formal representation of events have been proposed such as LODE
[11] and SEM [12]. These aim to limit ontological commitment in order to broaden
the range of events that can be represented and are not focused specifically on the
heritage domain.
Other work has looked at separating the interpretation of events from the
representation of the events themselves. This simplifies the properties of the event and
allows multiple (possibly conflicting) interpretations of the same event to be modeled,
for example alternative perspectives on the cause-effect relationship between events.
The Descriptions and Situations (DnS) ontology design pattern applied to events [13]
supports this by distinguishing a situation (e.g. two events) from its description (e.g.
cause-effect relationship between them).
Narrative Story
Narrative PlotComponent Plot Story
Component Component
dul:Description
containsPlot classifies
classifies hasNarrative
Description Story has
Narrative Component definesStory
rdfs:sub Component Facet
Component Description Component
ClassOf
Type Story crm:Event
Narrative Narrative
Component dul:Event Facet
Component Component PlotDescription
definesNarrative Type lode:Event
Type Description uses
ComponentType
definesEvent classifies describes Event
definesJustification Description
Type Event Event
rdfs:sub ClassOf
ClassOf
Situation
E rdfs:subClassOf D C
751
752 P. Mulholland, A. Wolff, and T. Collins
Second, we hypothesized that curatorial narratives are not only presentations but the
product of a process of inquiry in which the heritage objects provide a source of
evidence. Narrative inquiry [19] is a methodology in which research can be conducted
by selecting or constructing a story of events, interpreting these by proposing and testing
a plot and then presenting this as a narrative to the research community. Narrative
inquiry can be contrasted with the scientific method as a research methodology. In
narrative inquiry the plot can be thought of as essentially a hypothesis that is tested
against the story, being the data of the experiment. Story, plot and narrative therefore
constitute a process rather than only associated types of description.
These hypotheses, in combination with an iterative design process in participation
with the two museums, led to the construction of the curate ontology1. An overview is
shown in figure 1. CIDOC CRM [3] and DOLCE+DnS Ultralite (DUL)2 are used as
upper level ontologies for curate. There are five main components to the ontology,
indicated by the areas A to E in figure 1. These will be described in the following five
subsections.
Part A of the ontology describes the concepts of story, plot and narrative and how
they are related (see figure 1). A narrative presents both a story and a plot. A story is
interpreted by a plot. A number of plots may be created for the same story. In some
cases a narrative may present a story but have no associated plot. This would indicate
that the narrative is recounting a chonology of events (i.e. a chronicle) but offers no
interpretation of them. Figure 2 shows the relationship between a story of Gabriel
Metsu and an associated plot and story. The story itself contains events. The events of
a story will be considered further in section 3.3.
Gabriel Metsu narratesPlot Gabriel Metsu plotsStory Gabriel Metsu containsEvent Gabrial Metsu
narrative plot story was born
narratesStory
The relationship between heritage objects and the story, plot and narrative is
illustrated in part B of figure 1. Discussions with museum partners made clear that we
needed to distinguish two types of narrative. A heritage object narrative tells a story
1
http://decipher.open.ac.uk/curate
2
http://ontologydesignpatterns.org/ont/dul/DUL.owl
An Ontology and Web-Based Environment for Describing Curatorial Narratives 753
about a heritage object. A heritage object may have multiple heritage object
narratives. These heritage object narratives may draw on different aspects of the
heritage object such as how the object was created, some insight it gives about the life
of the artist, what is depicted in the heritage object or who has owned it. A curatorial
narrative threads across a number of heritage object narratives. It makes conceptual
relationships across a set of exhibits, yielding more complex insights than could be
made from the exhibits individually.
This approach to modeling has two advantages. First, it allows us to distinguish
alternative stories of the same heritage object. Second it allows us to model, through
the heritage object story, what contribution a heritage object brings to a curatorial
story. The relationship between a heritage object and an event, mediated by the
heritage object story, plays the role of the illustrate property in the LODE ontology
[11] that associates an object with an event. The mediating role of the heritage object
story though allows us to represent through which story the event is associated with
the object.
Figure 3 shows two heritage object stories of the painting “A Woman Reading a
Letter”. One is concerned with how the painting illustrates the brush technique of the
artist. The other is concerned with a more recent incident in which the painting was
stolen and recovered. The story about brush technique is relevant to the curatorial story.
The curatorial story shows how Metsu’s technique changed over time, drawing on a
number of heritage object stories illustrating technique at a particular point in time.
Fig. 3. Modeling curatorial stories, heritage object stories and heritage objects
Story Facet
rdf:type rdf:type
Event
Event
crm:Event Description
Description
Element
The emplotment of a story (i.e. its association with a plot) is represented in part D of
figure 1. The approach taken to modeling plot makes use of the Descriptions and
Situations (DnS) pattern applied to events [13]. The ontology supports the definition of
plot relationships across events, story components or both. For example, a plot
relationship may define that one event causes another. In practice, the relationships
found between events tend to be subtler, for example the specification of an influence
between events. This is not only a feature of curatorial narratives. For example, in an
analysis of novels, Chatman [20] highlights “happenings” that have no cause within the
narrative. Similarly, plot relationships may be specified between story components (e.g.
this area in space and time is more peaceful than another) or between both events and
story components (e.g. this event was pivotal between two areas in space and time).
Figure 5 shows an example in which one event (Metsu drew ‘Sketch of a female
figure’) is classified as being preparatory for another event. As in the DnS ontology
design pattern a justification can be added in support of the defined relationship. Here,
visual similarity is used to justify the proposed plot relationship.
An Ontology and Web-Based Environment for Describing Curatorial Narratives 755
Event
crm:Event
Type
defines rdf:type
Visual similarity Justificaton
Justification
The narrative presentation of a story and plot is represented in part E of figure 1. This
also makes use of the Descriptions and Situations (DnS) pattern to specify structural
relationships between components of the narrative. A curatorial narrative within a
physical museum space may vary considerably from the underling story due to
different types of physical constraint. First, differences may be due to the fixed
structure of the museum space. For example, the exhibition space at IMMA is made
up of a number of relatively small rooms and interconnecting doors and corridors.
This can result in a story component spanning a number of physical spaces, with the
organization of heritage objects and interpretation panels across those spaces being as
much determined by aesthetic and size constraints as the conceptual organization of
the story.
EventRelation EventRelation
Narrative
Description Description
defines rdf:type
Preservation Justificaton
Justification
Some differences between story and narrative organization may result from
preservation constraints of the exhibits. For example, pencil sketches need to be
displayed in darker conditions than are used for displaying paintings, therefore need
to be separated in a physical museum space though not on the conceptual space of the
story. Figure 6 represents this example in which a narrative has been broken down in
order to separate components for preservation reasons.
4 Storyspace
Storyspace is an API to the curate ontology and currently a web interface to the story
and heritage object components of the ontology (i.e. sections B and C of figure 1).
The decision was taken to develop a web interface in order that museum participants
in the design process could try to model aspects of curatorial narratives for
themselves, understand the implications of the ontology and provide feedback on both
the ontology and web interface. The web interface was developed using the Drupal
CMS3. In Drupal, pieces of content (which may be rendered as a whole or part of a
web page) are represented as nodes. A Drupal node is of a node type that defines the
content fields of the node. For example, a node type for representing film reviews
may have fields for the film title, a textual review of the film and integer representing
a rating of the film. Corlosquet et al. [21] drew on the parallel between Drupal content
type, fields and nodes and the classes, properties and individuals of an ontology and
knowledge base. They developed support for Drupal content to be published
semantically according to this mapping.
Building on this idea we developed a set of Drupal content types for representing
the story and heritage object parts of the curate ontology. The content types developed
were: story, event, facet, heritage object, data and reference. The story content type is
used to represent both heritage object and curatorial stories. The reference content
type does not map to the curate ontology and represents a bibliographic source for a
curatorial or heritage object story. This was added at the request of the museums and
can be represented formally using existing bibliographic ontologies such as BIBO4.
The data content type represents additional metadata associated with an individual of
the curate ontology represented in storyspace. For example this is used to represent
additional metadata associated with an event imported into storyspace. An
rdfs:seeAlso property is defined between the entity (e.g. the event) and the additional
metadata.
Storyspace required a slightly more flexible mapping to the ontology than
demonstrated by Corlosquet [21] as, for example, heritage object and curatorial
stories and their components are all represented by the same Drupal content type but
map to different classes in the ontology. A Drupal module was developed in
storyspace to allow the rdf:type of a node to be defined according to any combination
of fields and values of the Drupal node. A formal description of the storyspace
3
http://drupal.org
4
http://bibliontology.com
An Ontology and Web-Based Environment for Describing Curatorial Narratives 757
content is held in a Sesame5 triple store using the ARC2 library6. The ontology API is
tied to the creation, update and deletion functions of the Drupal nodes and can be
triggered programmatically or via the web-based user interface. Also, similar to
Corlosquet et al. [21], appending rdfxml to a Drupal path presents the metadata of the
node. Storyspace also makes use of Simile Exhibit7 to allow the user to visualize
events of the story according to selected facets. The next section describes how
storyspace has been used by curatorial staff at the museums to model curatorial
stories. For this exercise the curatorial staff took the two exhibitions that had been
studied previously (The Moderns and Gabriel Metsu) and modeled the underlying
story of the exhibitions and related resources. The longer term intended use for
storyspace is to model a future exhibition and use it in working toward the narrative
presentation. However, the reverse engineering of the stories from the final narratives
provided a good test case and access to a number of existing resources that could be
managed in storyspace.
Although storyspace is being used to model a number of classes within the curate
ontology (e.g. storyspace, heritage object, facet, event, event description) in the
interface we chose to emphasise the story and subjugate the role of the other
components in the interface as elements of a story. In the primary menu (the dark
band toward to top of the screen in figure 7) curatorial and heritage object stories are
therefore represented in the primary menu and other entities are accessible either
through the stories in which they are contained or through the Resources menu.
In figure 7 a curatorial story has been selected entitled “Our collection: IMMA
publication of art packs for children aged six to twelve years old”. In the left hand
menu the Drupal fields (i.e. ontological properties) of the story can be accessed.
These correspond to components of the story, the events it contains, its facets,
references and also Simile Exhibit visualisations that can be used to visualize the
story’s events according to its defined facets.
Figure 7 is showing the heritage object stories of a curatorial story developed to
communicate an exhibition to schoolchildren. The heritage object stories contain a
view of the heritage object comprising its title in bold, a thumbnail and standard
collection information. The heritage object may participate in multiple heritage object
stories. Each of these heritage object stories may themselves be used in multiple
curatorial stories.
Figure 8 shows a list of the heritage object stories of a particular heritage object.
This list is accessed by selecting Heritage Objects from the Resources menu, and then
selecting the heritage object, in this case Sounion by Cecil King, and then selecting
the Object stories for that heritage object from the left hand menu.
5
http://www.openrdf.org
6
http://arc.semsol.org
7
http://www.simile-widgets.org/exhibit
758 P. Mulholland, A. Wolff,
W and T. Collins
Fig. 7. Heritage ob
bject stories within a curatorial story aimed at children
When an event is includ ded in a story it can be described according to the facets of
the story. In figure 9 the event of Gabriel Metsu painting a self-portrait is beeing
viewed. Values have been specified for the three facets of the story. In storyspaace,
facets can be defined as acccepting manually entered values or mapped to a propeerty.
In this second case, object values from the event metadata triples that have the evvent
as a subject for that prop perty are displayed. Simile exhibit visualisations cann be
generated for a story from a forms-based interface in which the facets to be visualissed,
colour-coded and included in the lens are specified. Figure 10 shows a visualisationn of
events from the Gabriel Meetsu story.
Currently, the events off a story can be either entered manually or imported frrom
Freebase8. Text associated with
w a story can be selected to display a list of events ussing
the Freebase search API. In n figure 11, George Bernard Shaw has been selected frrom
the story text (left) and a liist of events are displayed related to this text string. Thhese
events can be viewed and added
a into the story. Metadata related to a selected evennt is
imported as a related dataa node. Facets associated with Freebase properties ((e.g.
/time/event/start_date) can be b used to automatically add facet values to the story.
8
http://www.freebase.com
An Ontology and Web-B
Based Environment for Describing Curatorial Narratives 759
Fig. 11
1. Adding events from Freebase to a story
6 Structured Interrview
A structured interview wass conducted with the two members of curatorial staff w who
had contributed the most too storyspace over a two month period. One was from eeach
of the participating museumms. Questions covered the main concepts of curate and hhow
they are navigated and authhored in storyspace. Overall they reported finding it easyy to
navigate storyspace and ad dd the main types of content. This was evidenced by the
content that had been added d related to the two exhibitions. The modelling of herittage
objects, heritage object sttories and curatorial stories was found to be relativvely
straightforward. It was also
o found to be useful to navigate the different heritage objject
An Ontology and Web-Based Environment for Describing Curatorial Narratives 761
stories that had been told related to a heritage object and the curatorial stories in
which they had been included.
The representation of events within a story was found to be more difficult with
contemporary art as found in The Moderns exhibition. For the stories in the Gabriel
Metsu exhibition and even the earlier parts of The Moderns exhibition, the events in
the story were easier to identify. For the more recent works (e.g. from the 1960s or
1970s) the events were less clear. For living artists and in situations were many
sources of evidence are not in the public domain, a historical perspective is harder to
establish and therefore key events of the story are harder to identify.
For both exhibitions the most problematic concept was facet. These could be used to
model various themes of the events but it was not always clear how best to model this.
For example, a single theme facet could be defined with a number of possible values or
it could be broken down into a number of facets each representing a sub-theme.
A further issue discussed in the interviews was alternative nomenclature for some
of the concepts, such as heritage object stories and facets, when used in storyspace by
specific audiences. Although the concepts were found to enable the successful
representation of museum storytelling there are no existing, generally used terms in
museum practice that can be mapped to them. The option of specialized labels in
storyspace for particular museum groups was discussed as a possible extension. This
will be considered further during wider trials with museum staff.
References
1. Collins, T., Mulholland, P., Zdrahal, Z.: Semantic Browsing of Digital Collections. In: Gil,
Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp.
127–141. Springer, Heidelberg (2005)
2. Mulholland, P., Collins, T., Zdrahal, Z.: Bletchley Park Text. Journal of Interactive Media
in Education (2005), http://jime.open.ac.uk/2005/24
3. Crofts, N., Doerr, M., Gill, T., Stead, S., Stiff, M. (eds.): Definition of the CIDOC
Conceptual Reference Model (2010),
http://www.cidoc-crm.org/official_release_cidoc.html
762 P. Mulholland, A. Wolff, and T. Collins
4. Hyvönen, E., Mäkelä, E., Kauppinen, T., Alm, O., Kurki, J., Ruotsalo, T., Seppälä, K.,
Takala, J., Puputti, K., Kuittinen, H., Viljanen, K., Tuominen, J., Palonen, T., Frosterus,
M., Sinkkilä, R., Paakkarinen, P., Laitio, J., Nyberg, K.: CultureSampo: A National
Publication System of Cultural Heritage on the Semantic Web 2.0. In: Aroyo, L., Traverso,
P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou,
M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 851–856. Springer, Heidelberg
(2009)
5. Hyvönen, E., Palonen, T., Takala, J.: Narrative semantic web - Case National Finnish Epic
Kalevala. In: Poster papers, Extended Semantic Web Conference, Heraklion, Greece (2010)
6. Wang, Y., Aroyo, L.M., Stash, N., Rutledge, L.: Interactive User Modeling for
Personalized Access to Museum Collections: The Rijksmuseum Case Study. In: Conati,
C., McCoy, K., Paliouras, G. (eds.) UM 2007. LNCS (LNAI), vol. 4511, pp. 385–389.
Springer, Heidelberg (2007)
7. Wang, Y., Aroyo, L., Stash, N., et al.: Cultivating Personalized Museum Tours Online and
On-site. Interdisciplinary Science Reviews 32(2), 141–156 (2009)
8. van Hage, W.R., Stash, N., Wang, Y., Aroyo, L.: Finding Your Way through the
Rijksmuseum with an Adaptive Mobile Museum Guide. In: Aroyo, L., Antoniou, G.,
Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC
2010. LNCS, vol. 6088, pp. 46–59. Springer, Heidelberg (2010)
9. Lim, M.Y., Aylett, R.: Narrative Construction in a Mobile Tour Guide. In: Cavazza, M.,
Donikian, S. (eds.) ICVS-VirtStory 2007. LNCS, vol. 4871, pp. 51–62. Springer,
Heidelberg (2007)
10. van Erp, M., Oomen, J., Segers, R., van den Akker, C., et al.: Automatic Heritage Metadata
Enrichment with Historic Events. In: Museums and the Web, Philadelphia, PA (2011)
11. Shaw, R., Troncy, R., Hardman, L.: LODE: Linking Open Descriptions of Events. In:
Asian Semantic Web Conference, pp. 153–167 (2009)
12. van Hage, W.R., Malaise, V., Segers, R., Hollink, L., Schreiber, G.: Design and use of the
Simple Event Model (SEM). Journal of Web Semantics 9(2) (2011)
13. Scherp, A., Franz, T., Saathoff, C., Staab, S.: F—A Model of Events based on the
Foundational Ontology DOLCE+DnSUltralite. In: International Conference on Knowledge
Capture, pp. 137–144 (2009)
14. Waiboer, A.E.: Gabriel Metsu, Rediscovered Master of the Dutch Golden Age. National
Gallery of Ireland (2010)
15. Arnold, B., Cass, B., Dorgan, T., et al.: The Moderns - The Arts in Ireland from the 1900s
to the 1970s. Irish Museum of Modern Art (2011)
16. Maguire, M.: Museum practices report. Public DECIPHER Deliverable D2.1.1 (2011),
http://www.decipher-research.eu
17. Mulholland, P., Wolff, A., Collins, T., Zdrahal, Z.: An event-based approach to describing
and understanding museum narratives. In: Detection, Representation, and Exploitation of
Events in the Semantic Web. Workshop in Conjunction with the International Semantic
Web Conference (2011)
18. Hazel, P.: Narrative and New Media. In: Narrative in Interactive Learning Environments,
Edinburgh, UK (2008)
19. Polkinghorne, D.: Narrative knowing and the human sciences. State Univ. NY Press (1988)
20. Chatman, S.: Story and Discourse: Narrative structure in fiction and film. Cornell U. (1980)
21. Corlosquet, S., Delbru, R., Clark, T., Polleres, A., Decker, S.: Produce and Consume
Linked Data with Drupal! In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L.,
Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823,
pp. 763–778. Springer, Heidelberg (2009)
Bringing Mathematics to the Web of Data:
The Case of the
Mathematics Subject Classification
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 763–777, 2012.
© Springer-Verlag Berlin Heidelberg 2012
764 C. Lange et al.
Congress Subject Headings (LCSH; cf. sec. 7), as well as domain-specific ones
such as ACM’s Computing Classification System (CCS [23]) and the Physics
and Astronomy Classification Scheme (PACS [17]). The Mathematics Subject
Classification (MSC [14]) is the most common point of reference in mathematics.
It has been used to classify mathematical documents of all types, ranging from
lecture notes to journal articles and books. The MSC is maintained for the
mathematical community by Mathematical Reviews (henceforth abbreviated as
MR) and Zentralblatt Math (henceforth abbreviated as Zbl). The MSC is a three-
layer scheme using alphanumeric codes, for example: 53 is the classification for
Differential Geometry, 53A for Classical Differential Geometry, and 53A45 for
Vector and Tensor Analysis.
The present version MSC2010, released in January 2010, is now in production
use at MR and Zbl. Fixes of simple factual and conceptual errors are still possible,
whereas larger changes will be deferred to the next major revision to guarantee
a period of stability to developers of applications and services.
Current Usage All major mathematical journals and digital libraries make use
of the MSC. Examples range from the services of the AMS Mathematical Re-
views, online as MathSciNet, and FIZ Karlsruhe’s ZBMATH1 , through almost
all the publishers of mathematics (Elsevier, Springer, etc.), to the arXiv.org
pre-print server and the PlanetMath free encyclopedia [18]. The MSC is mainly
used as a means of structuring mathematics literature in libraries and for the
purposes of retrieving information by topic. For example, a recent analysis of
the PlanetMath server logs2 shows that accesses of PlanetMath’s “browse by
subject” pages3 , whose structure corresponds to the MSC2000, constitute 5 to 6
percent of all accesses of PlanetMath pages. Taking, furthermore, into account
that these pages are much less often linked to from external sites (such as Wiki-
pedia) and change less frequently, one can assume that they are less frequently
visited by users coming from a search engine’s results pageand thus constitute a
significant fraction of PlanetMath’s “intra-site” traffic.
When an author, an editor, or a librarian classifies a publication, he or she
typically identifies the right class(es) by consulting a human-readable version
of the MSC. Web forms for creating a mathematical publication or uploading
an existing one to a digital library typically require manual input of the MSC
classes; the same holds for the search forms e.g. of MR or Zbl. Assistance is not
provided – neither to authors, who could particularly benefit from an automatic
suggestion of appropriate MSC classes based on the contents of an article, nor
to users searching for articles, who could, e.g., benefit from the ability to select
an MSC class without knowing its alphanumeric code, and from an automatic
suggestion of related classes.
Maintenance and Revision So Far. The master source of the MSC has until
recently been maintained in one plain TEX file, using a set of custom macros
1
http://www.ams.org/mathscinet/and http://www.zentralblatt-math.org/zbmath/
2
Personal communication with Joseph Corneli from PlanetMath, 2011-10-31.
3
http://planetmath.org/browse/objects/
Bringing Mathematics to the Web of Data 765
which have been developed around 1984, and had no major changes since then.
The marked-up source of the example given above looks as follows:
\MajorSub 53-++\SubText Differential geometry
\SeeFor{For differential topology, see \SbjNo 57Rxx.
For foundational questions of differentiable manifolds, see \SbjNo 58Axx}
...
\SecndLvl 53Axx\SubText Classical differential geometry
...
\ThirdLvl 53A45\SubText Vector and tensor analysis
It is obvious that the TEX code is not useful for web-scale machine processing and
linking. A new approach for web applications seems necessary for several reasons.
Specialized subject classification schemes tend to be maintained by only a few ex-
perts in the arcane ways of the art, and the intellectual capital of a classification
scheme does not necessarily become more obvious from merely reimplementing
in a more standard format. However, the additional possibilities for accessing the
scheme and producing different views tailored to specific audiences or purposes,
which standard formats enable, may lead not only to wider adoption but even
to better quality control. This is because more opportunities in distinguishing
new aspects of the classification scheme arise, thus uncovering issues that may
not have been identified by the few expert maintainers only.
The remainder of this section reviews the recent maintenance of the MSC
implementation and points out problems we encountered. From 2006 to 2009
the MSC2000 version, then in current use, underwent a general revision, done
publicly by the editorial staffs of MR and Zbl. This revision included additions
and changes, and corrections of known errors and resulted in MSC2010. The
editors took into consideration comments and suggestions from the mathematical
public, of which there were on the order of a thousand recorded in a MySQL
database. This was done using a standard installation of MediaWiki which was,
and still is viewable by all, but was only editable by about 50 staff members.
Each change from the previous version can be clearly seen (additions in green,
deletions in red, on a yellow background)4 .
Once the intellectual content had been finalized in this process, the new
MSC2010 TEX master file in the format described above had to be produced, as
well as derived and ancillary documents in various formats. These included: a
table of changes, a KWIC index, PDF files for printing, as well as further variant
forms useful to MR and Zbl. Furthermore, a TiddlyWiki edition (a single-user
wiki in one HTML file; cf. http://www.tiddlywiki.com) was provided to enable
users to download a personal copy of the MSC2010, which they could browse and
annotate. The TEX master was obtained from the MediaWiki using a custom
Python script; most of the derived files were constructed from the TEX master
using custom Perl scripts. Obviously, all of these scripts were specific to the cus-
tom MSC TEX format; therefore, it would neither have been possible to reuse
existing scripts from the maintenance of other subject classification schemes, nor
will it be possible to use our scripts for a scheme other than the MSC.
4
see, for example, http://msc2010.org/mscwiki/index.php?title=13-XX
766 C. Lange et al.
Subject Headings (cf. sec. 7). Moreover, we knew we could rely on existing best-
practice recommendations for modeling classification systems in SKOS, such as [16].
The choice of appropriate URIs for the concepts required some more considera-
tions and is therefore covered separately on p. 771.
Notations (SKOS Core) The 5-character class number (e.g. 53A45) could
be represented as a notation5 (skos:notation), for which, for the purpose of
enabling MSC-specific validation, we implemented our own datatype
mscvocab:MSCNotation.
5
“a string of characters [. . . ] used to uniquely identify a concept within the scope of a
given concept scheme [which is] different from a lexical label in that a notation is not
normally recognizable as a word or sequence of words in any natural language” [13]
768 C. Lange et al.
Multilingual Labels (SKOS Core). About each concept, the TEX source pro-
vided as further information a descriptive English text (\SubText in the TEX
source), which could be represented as a preferred label, except that mathemat-
ical content requires separate treatment (see p. 768). Choosing SKOS allowed
us to go beyond just representing the information given in the TEX sources.
Independently from the TEX source, several trusted sources had contributed
translations of the descriptive texts to further languages: Chinese, Italian, and
Russian6 . SKOS, thanks to its RDF foundation, not only facilitates handling
accented characters (which also occur in the English-language descriptions) and
non-Latin alphabets, but allows for multilingual labels, for example:
msc2010:53A45 skos:prefLabel "Vector and tensor analysis"@en, "向量与张量分析"@zh .
6
The sources were: Tsinghua University for Chinese, the Russian Academy of Sciences
for Russian, and Alberto Marinari for Italian (MSC2000 only)
Bringing Mathematics to the Web of Data 769
7
Earlier SKOS versions included such properties in a “SKOS Extensions Vocab-
ulary” (http://www.w3.org/2004/02/skos/extensions/spec/2004-10-18.html), which,
however, has not been adopted as a standard so far.
8
The actual MSC code is 53-XX; it is encoded as 53-++ for historical reasons.
9
http://www.w3.org/2011/01/rdf-wg-charter
770 C. Lange et al.
Linking Across MSC Versions and Other Concept Schemes (SKOS Core). While
our main focus was on implementing the MSC2010 in SKOS, we also applied
our TEX→SKOS translation script (see sect. 4) to the older versions MSC2000
and MSC1991. Particularly the MSC2000 is still widely in use; therefore, mak-
ing explicit how closely classes match across MSC versions will aid automated
migration of existing digital libraries or at least be able to assist semi-automatic
migration. SKOS offers a set of different properties to express the closeness of
matching across subject classification schemes. Frequently occurring cases in
the MSC include concepts unchanged across versions (⇒ skos:exactMatch),
reclassifications within an area, e.g. 05E40 “Combinatorial aspects of com-
mutative algebra” partly replacing the MSC2000 classes 05E20 and 05E25 (⇒
skos:relatedMatch), and diversification of areas, e.g. within the area 97-XX
“Mathematics education”, which had 49 concepts in 2000 and 160 concepts in
2010 (⇒ skos:broadMatch). While we have so far only used these mapping proper-
ties across MSC versions, SKOS implementations of further subject classification
schemes in related domains are to be expected soon (cf. sec. 8). In this setting,
these properties can be applied analogously.
In this listing, the DDC concept appears as a blank node; in any case we re-
frained from assigning URIs in the DDC namespace to them. While the URI
scheme for the deeper levels has already been decided upon (having URIs such
as http://dewey.info/class/515.6310 , they are not currently dereferenceable.
Collections of Concepts Besides the Main Hierarchy (SKOS Core). Some of the
links within the MSC do not have single classes as their targets, but groups of
classes, which do not have a common superconcept that one could instead link
to. The most frequently used group of such concepts is the group of all subclasses
covering historical works related to an area. In the numeric scheme, these sub-
classes end in -03; for instance, 53-03 is the class of historical works about differ-
ential geometry. We have grouped them as skos:members of a skos:Collection, a
10
http://oclc.org/developer/documentation/dewey-web-services/using-api
Bringing Mathematics to the Web of Data 771
semantically weaker notion than skos:Concept – but that choice demands aware-
ness of the fact that SKOS keeps collections and concepts disjoint. Similar group-
ings include general reference works (-00), instructional expositions (-01), and
works on computational methods (-08).
msc:HistoricalTopics a skos:Collection ;
skos:prefLabel "Historical topics"@en ;
skos:member msc:01-XX, ..., msc:03-03, ..., msc:97-03 .
URI Syntax. Deploying a Linked Dataset requires thinking about a URI syn-
tax [8]. In the SKOS implementation described so far, the MSC2010 dataset
has around 92,000 triples (in the expanded version; see below); the RDF/XML
serialization is around 7 MB large. We expect that information about few MSC-
classified resources will be required in typical Linked Data scenarios, such as
looking up information about an MSC-classified resource. Publications in paper-
based and digital libraries are typically classified with two MSC classes; in
addition to these, the superclasses may be of interest. As such applications
should not be burdened with a 7 MB download, a “hash” namespace does
not make sense. Conversely, applications that require full access to the MSC,
such as annotation services that suggest MSC classes whose labels match a
given text (as shown in fig. 1), or browser frontends to digital libraries, would
rather benefit from querying a SPARQL endpoint, or their developers would
preload them with a downloaded copy of the MSC dataset anyway – a possi-
bility that is independent from the choice of namespace URI. Thus, we chose
772 C. Lange et al.
skos:prefLabel with mathematical markup for the same MSC class, marked up
as shown in (2), would count as two skos:prefLabels without an (RDF) language
tag. While that would not explicitly violate SKOS integrity condition S14, which
demands that “a resource has no more than one value of skos:prefLabel per lan-
guage tag”, it would contradict convention (1), leaving little hope for tool sup-
port. Carroll and Philipps proposed an extension to the RDF semantics in [4]
that would allow for indicating the language of an XML literal in 2005, but that
idea has never been adopted. Therefore, our current SKOS implementation of
the MSC2010 leaves this problem unsolved for now. Note that separating the
mathematical expressions in labels from the surrounding text would not qualify
as a workaround, as (1) expressions can be scattered over multiple places in a
text sentence, e.g. in the role of an adjective qualifying a noun, and as (2) the
structure and presentation of mathematical expressions may vary depending on
the language – not in the concrete case of the MSC labels but in general.
7 Related Work
Besides following the best practices established for SKOSifying the DDC [16],
our work was inspired from the LCSH dataset [22]. Both are comprehensive,
general-purpose classification schemes in contrast to the domain-specific MSC.
The LCSH was converted from a MARCXML representation to SKOS, using
custom scripts similar to ours. Similar to our approach, the authors developed
custom extensions to SKOS (e.g. structured change descriptions similar Panzer’s
and Zeng’s, which we reused), and finally evolved them into the MADS/RDF
data model, which can be thought of a superset of SKOS “designed specifically to
support authority data as used by and needed in the LIS [library and information
science] community and its technology systems” [12]. They also experienced lim-
itations of SKOS, concretely concerning the representation of “pre-coordinated
concepts”, i.e. subject headings combined from other headings. While our Web
Bringing Mathematics to the Web of Data 775
RDF Storage
Fig. 1. Annotating the scientific fields of a course on the new AUTH School of Math-
ematics site, using MSC/SKOS and Drupal 7 semantic mappings (cf. [6])
frontend to the MSC is in an early stage, the LCSH dataset is served via a
comprehensive frontend that offers each record for download in different formats
(including full MADS/RDF vs. plain SKOS), a graph visualization, and a form
for reporting errors. Limitations of SKOS and the possibility of extending SKOS
have also been reported for domain-specific classification schemes; see, e.g., van
Assem’s case studies with three different thesauri [1]. In domains closely related
to mathematics, we are not aware of completed SKOS implementations, but of
work in progress, for example on the ACM CCS [23] for computer science11 .
conceptual modeling approach helped to uncover new issues in the MSC con-
ceptualization. While this paper focuses on preserving all information from the
previous TEX master sources (plus translated labels), we have also, in previous
work [20], identified directions for enhancing the conceptual model by precise
definitions of the MSC classes, adding index terms to classes (for which Panzer
and Zeng provide a SKOS design pattern [16]) and a faceted structure (which
the collections introduced on p. 770 only partly address).
The MSC/SKOS dataset is also one of the first Linked Datasets in mathe-
matics. Our previous work has laid the conceptual and technical foundations for
integrating mathematics into the Web of Data [11]; we believe that the avail-
ability of the central classification scheme of this domain as LOD will encourage
further progress. Deploying the MSC as LOD makes it more easily reusable and
enables classification of smaller resources of mathematical knowledge (e.g. blog
posts, or figures or formulas in larger publications), instead of the traditional ap-
proach of assigning few MSC classes to a whole article. For a closer integration
of mathematical resources with those from related domains, we plan to establish
links from and to the ACM CCS [23], once available in SKOS, and with the
PACS [17], which we expect to reimplement ourselves. As further deployment
targets, we envision the European Digital Math Library [7], whose developers
are starting to work on Linked Data publishing, as well as the PlanetMath en-
cyclopedia, which is being reimplemented using the Planetary social semantic
web portal [10]. With this deployment strategy and the increased ability to clas-
sify fine-grained mathematical resources over the Web, we also believe that the
MSC/SKOS dataset may support a democratization of scientific publishing, and,
by taking away some of the control from the big publishing companies and giving
it back to the authors, encourage the rise of networked science that depends on
collaborative intelligences [15].
References
[1] van Assem, M.F.J.: Converting and Integrating Vocabularies for the Semantic
Web. PhD thesis, Vrije Universiteit Amsterdam (2010),
http://hdl.handle.net/1871/16148
[2] Berners-Lee, T.: Cwm. A general purpose data processor for the semantic web
(2009), http://www.w3.org/2000/10/swap/doc/cwm.html
[3] Mathematical Markup Language (MathML) 3.0. W3C Recommendation (2010),
http://www.w3.org/TR/MathML3
[4] Carroll, J.J., Phillips, A.: Multilingual RDF and OWL. In: Gómez-Pérez, A., Eu-
zenat, J. (eds.) ESWC 2005. LNCS, vol. 3532, pp. 108–122. Springer, Heidelberg
(2005)
[5] Carroll, J.J., et al.: Named Graphs, Provenance and Trust. In: WWW, pp. 613–
622. ACM (2005)
[6] Corlosquet, S., Delbru, R., Clark, T., Polleres, A., Decker, S.: Produce and Con-
sume Linked Data with Drupal! In: Bernstein, A., Karger, D.R., Heath, T., Feigen-
baum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS,
vol. 5823, pp. 763–778. Springer, Heidelberg (2009)
[7] EuDML – European Digital Mathematics Library, http://eudml.eu
Bringing Mathematics to the Web of Data 777
[8] Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space.
Morgan & Claypool (2011), http://linkeddatabook.com
[9] Information technology – Metadata registries (MDR) – Part 1: Framework. Tech.
Rep. 11179-1, ISO/IEC (2004)
[10] Kohlhase, M., et al.: The Planetary System: Web 3.0 & Active Documents for
STEM. Procedia Computer Science 4, 598–607 (2011),
https://svn.mathweb.org/repos/planetary/doc/epc11/paper.pdf
[11] Lange, C.: Enabling Collaboration on Semiformal Mathematical Knowledge by
Semantic Web Integration. Studies on the Semantic Web, vol. 11. IOS Press,
Amsterdam (2011)
[12] MADS/RDF primer. Status: Final Public Review Document (2011),
http://www.loc.gov/standards/mads/rdf/
[13] Miles, A., Bechhofer, S.: SKOS Simple Knowledge Organization System Reference.
W3C Recommendation (2009), http://www.w3.org/TR/skos-reference
[14] MSC 2010 (2010), http://msc2010.org
[15] Nielsen, M.: Reinventing Discovery: The New Era of Networked Science. Princeton
University Press (2011)
[16] Panzer, M., Zeng, M.L.: Modeling Classification Systems in SKOS: Some Chal-
lenges and Best-Practice Recommendations. In: International Conference on
Dublin Core and Metadata Applications (2009),
http://dcpapers.dublincore.org/index.php/pubs/article/view/974/0
[17] Physics and Astronomy Classification Scheme, PACS (2010),
http://aip.org/pacs/
[18] PlanetMath.org, http://planetmath.org
[19] Solomou, G., Papatheodorou, T.: The Use of SKOS Vocabularies in Digital Repos-
itories: The DSpace Case. In: Semantic Computing (ICSC), pp. 542–547. IEEE
(2010)
[20] Sperber, W., Ion, P.: Content analysis and classification in mathematics. In: Clas-
sification & Ontology, Intern. UDC Seminar, pp. 129–144
[21] Summers, E.: Following your nose to the Web of Data. Information Standards
Quarterly 20(1) (2008),
http://inkdroid.org/journal/following-your-nose-to-the-web-of-data
[22] Summers, E., et al.: LCSH, SKOS and Linked Data. In: Dublin Core (2008),
arXiv:0805.2855v3 [cs.DL]
[23] The 1998 ACM Computing Classification System (1998),
http://www.acm.org/about/class/ccs98
A Publishing Pipeline for Linked Government
Data
1 Introduction
Open data is an important part of the recent open government movement which
aims towards more openness, transparency and efficiency in government. Govern-
ment data catalogues, such as data.gov and data.gov.uk, constitute a corner
stone in this movement as they serve as central one-stop portals where datasets
can be found and accessed. However, working with this data can still be a chal-
lenge; often it is provided in a haphazard way, driven by practicalities within
the producing government agency, and not by the needs of the information user.
Formats are often inconvenient, (e.g. numerical tables as PDFs), there is little
consistency across datasets, and documentation is often poor [6].
Linked Government Data (LGD) [2] is a promising technique to enable more
efficient access to government data. LGD makes the data part of the web where it
can be interlinked to other data that provides documentation, additional context
or necessary background information. However, realizing this potential is costly.
The pioneering LGD efforts in the U.S. and U.K. have shown that creating high-
quality Linked Data from raw data files requires considerable investment into
reverse-engineering, documenting data elements, data clean-up, schema map-
ping, and instance matching [8,16]. When data.gov started publishing RDF,
large numbers of datasets were converted using a simple automatic algorithm,
without much curation effort, which limits the practical value of the resulting
RDF. In the U.K., RDF datasets published around data.gov.uk are carefully
curated and of high quality, but due to limited availability of trained staff and
contractors, only selected high-value datasets have been subjected to the Linked
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 778–792, 2012.
c Springer-Verlag Berlin Heidelberg 2012
A Publishing Pipeline for Linked Government Data 779
Data treatment, while most data remains in raw form. In general, the Semantic
Web standards are mature and powerful, but there is still a lack of practical
approaches and patterns for the publishing of government data [16].
In a previous work, we presented a contribution towards supporting the pro-
duction of high-quality LGD, the “self-service” approach [6]. It shifts the burden
of Linked Data conversion towards the data consumer. We pursued this work to
refine the self-service approach, fill in the missing pieces and realize the vision
via a working implementation.
Flexibility the provided solution should not enforce a rigid workflow on the
user. Components, tools and models should be independent from each other,
yet working well together to fit in a specific workflow adopted by the user.
Decentralization there should be no requirement to register in a centralized
repository, to use a single service or to coordinate with others.
Results sharing it should be possible to easily share results with others to
avoid duplicating work and efforts.
In this paper, we describe how we addressed these requirements through the
“LGD Publishing Pipeline”. Furthermore, we report on a case study in which
the pipeline was applied to publish the content of a local government catalogue
in Ireland as Linked Data.
The contributions of this paper are:
1. An end-to-end publishing pipeline implementing the self-service approach.
The publishing pipeline, centred around Google Refine1 , enables convert-
ing raw data available on government catalogues into interlinked RDF (sec-
tion 2). The pipeline also enables sharing the results along with their prove-
nance description on CKAN.net, a popular open data registry (section 2.5).
2. A formal machine-readable representation of full provenance information
associated with the publishing pipeline. The LGD Publishing Pipeline is
capable of capturing the provenance information, formally representing it
according to the Open Provenance Model Vocabulary (OPMV)2 and sharing
it along with the data on CKAN.net (section 2.5).
3. A case study applying the publishing pipeline to a local government cat-
alogue in Ireland. The resulting RDF, published as linked data as part of
data-gov.ie, is linked to existing data in the LOD cloud. A number of
widely-used vocabularies in the Linked Data community — such as VoiD3 ,
OPMV and Data Cube Vocabulary4 — were utilised in the data represen-
tation. The intermix of these vocabularies enriches the data and enables
powerful scenarios (section 3).
Fig. 1. Linked Data publishing pipeline (pertinent tool is shown next to each step)
All the involved functionalities are available through a single workbench which
not only supports transforming raw data into RDF; but also enables interlink-
ing the data, capturing and formally representing all the applied operations (i.e.
provenance information). The steps involved are independent from each other,
yet seamlessly integrated from the user point of view. In the following subsec-
tions, we describe the involved steps outlined in figure 1.
6
http://www.w3.org/egov/wiki/Data_Catalog_Vocabulary
7
Full documentation of Google Refine is available at:
http://code.google.com/p/google-refine/wiki/DocumentationForUsers
A Publishing Pipeline for Linked Government Data 783
We developed the RDF Extension for Google Refine8 to enable modelling and
exporting tabular data in RDF format. The conversion of tabular data into
RDF is guided through a template graph defined by the user. The template
graph nodes represent resources, literals or blank nodes while edges are RDF
properties (see figure 3). Nodes values are either constants or expressions based
on the cells contents. Every row in the dataset generates a subgraph according to
the template graph, and the whole RDF graph produced is the result of merging
all rows subgraphs. Expressions that produce errors or evaluate to empty strings
are ignored.
The main features of the extension are highlighted below (interested readers
are encouraged to check [12]):
2.4 Interlinking
Linking across dataset boundaries turns the Web of Linked Data from a col-
lection of data silos into a global data space [5]. RDF Links are established by
using the same URIs across multiple datasets.
Google Refine supports data reconciliation i.e. matching a project’s data
against some external reference dataset. It comes with a built-in support to
reconcile data against Freebase. Additional reconciliation services can be added
via implementing a standard interface10 . We extended Google Refine to reconcile
against any RDF data available through a SPARQL endpoint or as a dump file.
Reconciling against an RDF dataset makes URIs defined in that dataset usable
in the RDF export process. As a result, interlinking is integrated as part of the
publishing pipeline and enabled with a few clicks.
For example, to reconcile country names listed as part of a tabular data against
DBpedia all is needed is providing Google Refine with DBpedia SPARQL end-
point URL. The reconciliation capability of the RDF Extension, will match the
country names against labels in DBpedia. Restricting matching by type and ad-
jacent properties (i.e. RDF graph neighbourhood) is also supported. In [14] we
provided the full details and evaluated different matching approaches.
10
http://code.google.com/p/google-refine/wiki/ReconciliationServiceApi
A Publishing Pipeline for Linked Government Data 785
2.5 Sharing
The last step in the LGD Publishing Pipeline is sharing the RDF data so that
others can reuse it. However, the authoritative nature of government data in-
creases the importance of sharing a clear description of all the operations applied
to the data. Ideally, provenance information is shared in a machine-readable for-
mat with a well-defined semantics to enable not only human users but also
programs to access the information, process and utilise it.
We developed “CKAN Extension for Google Refine”11 that captures the op-
erations applied to the data, represents them according to the Open Provenance
Model Vocabulary (OPMV) and enables sharing the data and its provenance on
CKAN.net.
OPMV is a lightweight provenance vocabulary based on OPM [18]. It is used
by data.gov.uk to track provenance of data published by the U.K. government.
The core ontology of OPMV can be extended by defining supplementary mod-
ules. We defined an OPMV extension module to describe Google Refine workflow
provenance in a machine-readable format. The extension is based on another
OPMV extension developed by Jeni Tennison12 . It is available and documented
online at its namespace: http://vocab.deri.ie/grefine#
Google Refine logs all the operations applied to the data. It explicitly repre-
sents these operations in JSON and enables extracting and (re)applying them.
The RDF related operations added to Google Refine are no exception. Both
the RDF modelling and reconciling are recorded and saved in the project his-
tory. The JSON representation of the history in Google Refine is a full record
of the information provenance. The extension OPMV module enables linking
together the RDF data, the source data and the Google Refine operation his-
tory. Figure 4 shows an example representation of the provenance of RDF data
exported using Google Refine RDF Extension. In the figure ex:rdf file is
an RDF file derived from ex:csv file by applying operations represented in
ex:json history file.
Lastly, we enabled sharing the data on CKAN.net from within Google Re-
fine with a few clicks. CKAN.net is an “open data hub” i.e. a registry where
people can publicly share datasets by registering them along with their meta-
data and access information. CKAN.net can be seen as a platform for crowd-
sourcing a comprehensive list of available datasets. It enjoys an active community
that is constantly improving and maintaining dataset descriptions. CKAN Stor-
age13 , a recent extension of CKAN, allows files to be uploaded to and hosted by
CKAN.net.
A typical workflow for a CKAN contributor who wants to share the results
of transforming data into RDF using Google Refine might be: (i) exporting
the data from Google Refine in CSV and in RDF (ii) extracting and saving
Google Refine operation history (iii) preparing the provenance description (iv)
uploading the files to CKAN Storage and keeping track of the files URLs (v)
11
http://lab.linkeddata.deri.ie/2011/grefine-ckan
12
http://purl.org/net/opmv/types/google-refine#
13
http://ckan.org/2011/05/16/storage-extension-for-ckan/
786 F. Maali, R. Cyganiak, and V. Peristeras
The catalogue provides fairly rich description of its datasets. Each dataset is
categorized under one or more domain and described with a number of tags.
Additionally, metadata describing spatial and temporal coverage, publisher and
date of last update are also provided. Table 1 shows a quick summary of Fingal
Catalogue at the time of writing.
Refine with RDF Extension we converted the CSV data into RDF data adhering
to the Dcat model.
Most catalogues organize their datasets by classifying them under one or more
domain [13]. Dcat recommends using some standardised scheme for classification
so that datasets from multiple catalogues can be related together. We used the
Integrated Public Sector Vocabulary (ISPV) available from the UK government.
RDF representation of ISPV (which uses SKOS) is available by the esd-toolkit
as a dump file18 . We used this file to define a reconciliation service in Google
Refine and reconcile Fingal Catalogue domains against it.
Google Refine capabilities were very helpful with data cleaning. For example,
Google Refine Expression Language (GREL) was intensively used to properly
format dates and numbers to adhere to XML Schema data types syntax.
3.3 Interlinking
Electoral divisions are prevalent in the catalogue datasets especially those con-
taining statistical information. There are no URIs defined for these electoral
divisions, so we had to define new ones under data-gov.ie. We converted an
authoritative list of electoral divisions available from Fingal County Council
into RDF. The result was used to define a reconciliation service using Google
Refine RDF Extension. This means that in each dataset containing electoral di-
visions, moving from textual names of the divisions to the URIs crafted under
data-gov.ie is only few clicks away. A similar reconciliation was applied for
councillor names. It is worth mentioning that names were sometimes spelled in
different ways across datasets. For instance, Matt vs. Mathew and Robbie vs.
Robert. Reconciling to URIs eliminates such mismatches.
RDF Extension for Google Refine also enabled reconciling councillor names
against DBpedia and electoral divisions against Geonames.
3.4 RDF-izing
Google Refine clustering and facets were effective in giving a general understand-
ing about the data. This is essential to anticipate and decide on appropriate RDF
models for the data. Most of the datasets in the catalogue contain statistical in-
formation, we decided to use the Data Cube Vocabulary for representing this
data. Data Cube model is compatible with SDMX – an ISO standard for sharing
and exchanging statistical data and metadata. It extends SCOVO [10] with the
ability to explicitly describe the structure of the data and distinguishes between
dimensions, attributes and measures. Whenever applicable, We also used terms
18
http://doc.esd.org.uk/IPSV/2.00.html
A Publishing Pipeline for Linked Government Data 789
from SDMX extensions19 which augment the Data Cube Vocabulary by defining
URIs for common dimensions, attributes and measures.
For other datasets, we reused existing vocabularies whenever possible and
defined small domain ontologies otherwise. We deployed new custom terms on-
line using vocab.deri.ie which is a web-based vocabulary management tool fa-
cilitating vocabularies creation and deployment. As a result, all new terms
are documented and dereferenceable. Newly defined terms can be checked at
http://vocab.deri.ie/fingal#.
3.5 Sharing
With the CKAN Extension, each RDF dataset published is linked to its source
file and annotated with provenance information using the OPMV extension. By
linking the RDF data to its source and to Google Refine operations history,
a determined user is able to examine and (automatically) reproduce all these
operations starting from the original data and ending with an exact copy of the
published converted data.
In total, 60 datasets were published in RDF resulting in about 300K triples20
(a list of all datasets that were converted and the vocabularies used is available
in [12]). By utilising reconciliation, the published RDF data used the same URIs
for common entities (i.e. no URI aliases) and were linked to DBpedia and Geon-
ames. Based on our previous experience in converting legacy data into RDF,
we found that the pipeline significantly lowers the required time and effort. It
also helps reducing errors usually inadvertently introduced when using man-
ual conversion or custom scripts. However, issues related to URI construction,
RDF data modelling and vocabulary selection are not supported and need to be
tackled based on previous experience or external services.
The RDF data were then loaded into a SPARQL endpoint. We used Fuseki
to run the endopint. We used the Linked Data Pages framework21 to make the
data available in RDF and HTML based on content negotiation22 . Resolving the
URI of an electoral division, as the one for the city of Howth for example, gives
all the facts about Howth which were previously scattered across multiple CSV
files.
The combination of Dcat, VoiD and Data Cube vocabularies helped providing
a fine-grained description of the datasets and each data item. Figure 6 shows
how these vocabularies were used together. Listing 1.1 shows a SPARQL query
that given a URI of a data item (a.k.a fact) locates the source CSV file from
which the fact was extracted. This SPARQL query enables a user who finds a
particular fact in the RDF data doubtful to download the original authoritative
CSV file in which the fact was originally stated.
19
http://publishing-statistical-data.googlecode.com/
svn/trunk/specs/src/main/vocab/
20
The conversion required approximately two weeks effort of one of the authors.
21
https://github.com/csarven/linked-data-pages
22
The data is available online as part of http://data-gov.ie
790 F. Maali, R. Cyganiak, and V. Peristeras
Fig. 6. The combination of Dcat, VoiD and Data Cube vocabularies to describe Fingal
data
Listing 1.1. Getting the source CSV file for a particular fact (given as ex:obs)
1 SELECT ? d c a t d s ? c s v f i l e
2 WHERE {
3 ex : o b s qb : d a t a S e t ? q b d s .
4 ? qb ds dct : source ? dcat ds .
5 ? dcat ds dcat : d i s t r i b u t i o n ? d i s t .
6 ? d i s t d c a t : accessURL ? c s v f i l e ;
7 dct : format ? f .
8 ? f r d f s : l a b e l ’ t e x t / c sv ’ .
9 }
Thanks to the RDF flexibility, the data now can also be organised and sliced
in ways not possible with the previous rigid table formats.
4 Related Work
A number of tools for converting tabular data into RDF exist, most notably
XLWrap [11] and RDF123 [9]. Both support rich conversion and full control
over the shape of the produced RDF data. These tools focus only on the RDF
conversion and do not support a full publishing process. Nevertheless, they can
be integrated in a bigger publishing framework. Both RDF123 and XLWrap
use RDF to describe the conversion process without providing a graphical user
interface which makes them difficult for non-expert users.
Methodological guidelines for publishing Linked Government Data are pre-
sented in [17]. Similar to our work, a set of tools and guidelines were recom-
mended. However, the tools described are not integrated into a single workbench
and do not incorporate provenance description. The data-gov Wiki23 adopts a
wiki-based approach to enhance automatically-generated LGD. Their work and
ours both tackle the LGD creation with a crowd-sourcing approach though in
significantly different ways.
23
http://data-gov.tw.rpi.edu/wiki
A Publishing Pipeline for Linked Government Data 791
Acknowledgments. The work presented in this paper has been funded in part
by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2) and
the European Union under Grant No. 238900 (Rural Inclusion).
References
1. Alani, H., Dupplaw, D., Sheridan, J., O’Hara, K., Darlington, J., Shadbolt, N.,
Tullo, C.: Unlocking the Potential of Public Sector Information with Semantic
Web Technology. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I.,
Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G.,
Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 708–
721. Springer, Heidelberg (2007)
2. Berners-Lee, T.: Putting Government Data Online. WWW Design Issues (2009)
3. Berrueta, D., Phipps, J.: Best Practice Recipes for Publishing RDF Vocabularies.
World Wide Web Consortium, Note (August 2008)
4. Bizer, C., Cyganiak, R., Heath, T.: How to Publish Linked Data on the Web. Web
page (2007) (revised 2008)
5. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data - The Story So Far. International
Journal on Semantic Web and Information Systems (IJSWIS) (2009)
6. Cyganiak, R., Maali, F., Peristeras, V.: Self-service Linked Government Data with
dcat and Gridworks. In: Proceedings of the 6th International Conference on Se-
mantic Systems, I-SEMANTICS 2010. ACM (2010)
792 F. Maali, R. Cyganiak, and V. Peristeras
7. de León, A., Saquicela, V., Vilches, L.M., Villazón-Terrazas, B., Priyatna, F., Cor-
cho, O.: Geographical Linked Data: a Spanish Use Case. In: Proceedings of the 6th
International Conference on Semantic Systems, I-SEMANTICS 2010. ACM (2010)
8. Ding, L., Lebo, T., Erickson, J.S., DiFranzo, D., Williams, G.T., Li, X., Michaelis,
J., Graves, A., Zheng, J.G., Shangguan, Z., Flores, J., McGuinness, D.L., Hendler,
J.: TWC LOGD: A Portal for Linked Open Government Data Ecosystems. Web
Semantics: Science, Services and Agents on the World Wide Web (2011)
9. Han, L., Finin, T., Parr, C., Sachs, J., Joshi, A.: RDF123: From Spreadsheets
to RDF. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin,
T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 451–466. Springer,
Heidelberg (2008)
10. Hausenblas, M., Halb, W., Raimond, Y., Feigenbaum, L., Ayers, D.: SCOVO: Using
Statistics on the Web of Data. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano,
P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.)
ESWC 2009. LNCS, vol. 5554, pp. 708–722. Springer, Heidelberg (2009)
11. Langegger, A., Wöß, W.: XLWrap – Querying and Integrating Arbitrary Spread-
sheets with SPARQL. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L.,
Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823,
pp. 359–374. Springer, Heidelberg (2009)
12. Maali, F.: Getting to the Five-Star: From Raw Data to Linked Government Data.
Master’s thesis, National University of Ireland, Galway, Ireland (2011)
13. Maali, F., Cyganiak, R., Peristeras, V.: Enabling Interoperability of Government
Data Catalogues. In: Wimmer, M.A., Chappelet, J.-L., Janssen, M., Scholl, H.J.
(eds.) EGOV 2010. LNCS, vol. 6228, pp. 339–350. Springer, Heidelberg (2010),
http://dx.doi.org/10.1007/978-3-642-14799-9_29
14. Maali, F., Cyganiak, R., Peristeras, V.: Re-using Cool URIs: Entity Reconciliation
Against LOD Hubs. In: Proceedings of the Linked Data on the Web Workshop
2011, LDOW 2011 (March 2011)
15. Nam, T.: The Wisdom of Crowds in Government 2.0: Information Paradigm Evo-
lution toward Wiki-Government. In: AMCIS 2010 Proceedings (2010)
16. Sheridan, J., Tennison, J.: Linking UK Government Data. In: Proceedings of the
WWW 2010 Workshop on Linked Data on the Web (LDOW 2010) (2010)
17. Villazón-Terrazas, B., Vilches-Blázquez, L.M., Corcho, O., Gómez-Pérez, A.:
Methodological Guidelines for Publishing Government Linked Data. In: Wood,
D. (ed.) Linking Government Data, ch. 2. Springer (2011)
18. Zhao, J.: The Open Provenance Model Vocabulary Specification. Technical Report,
University of Oxford (2010)
Achieving Interoperability through Semantic
Technologies in the Public Administration
Abstract. In this paper we report the experience of using semantic based tools
and technologies for (collaboratively) modeling administrative procedures and
their related documents, organizational roles, and services, in the Italian Public
Administration (PA), focusing in particular on the interoperability aspects faced
during the modelling process. This experience, the reported lessons learned and
next steps identified, highlight the potential and criticality of using web 2.0 se-
mantic technologies and tools to enhance participatory knowledge sharing, inter-
operability, and collaboration in the modeling of complex domains in the PA.
1 Introduction
In the last few years, the Public Administrations (PA) of several countries around the
world have invested effort and resources into modernizing their services in order to
improve labor productivity, as well as, PA efficiency and transparency. The recent con-
tributions and developments in ICT (Information and Communication Technology) can
boost this modernization process, as shown by the support the ICT can provide to the
replacement of paper-based procedures with electronic-based ones (dematerialization
of documents) within the PA. An important contribution of the ICT, in supporting the
dematerialization of documents, is the production of proper and precise models of the
administrative procedures of the PA and of the specific “entities” related to these proce-
dures, such as the documents involved in the procedures, the organizational roles per-
forming the activities, and the services needed to manage the electronic documents in an
archival system. In fact, by following a model-driven approach [8,15], the availability
of these models is a key factor towards both (1) the re-design and re-engineering of the
administrative procedures, in order to replace paper-based documents with electronic-
based ones, and (2) the definition of an appropriate archival system able to safely store,
catalogue, manage, and retrieve the electronic documents produced within the PA. The
definition of these models, which can act as “reference models” at the national level
and enhance interoperability as described in [15], is often made complex, among the
other problems, by the heterogeneity of procedures, document typologies, organiza-
tional structures, terminologies, and so on, present at regional or local level, due for
instance to different regional laws or traditions.
In this paper, we report the experience of using semantic based technologies and
a wiki-based modeling tool, MoKi [10], in the context of the ProDe Italian national
project, in order to build national “reference models” for the management of electronic
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 793–807, 2012.
c Springer-Verlag Berlin Heidelberg 2012
794 C. Di Francescomarino et al.
2 Related Works
Several works have focused on the application of Semantic Web technologies in the PA
domain. We recall a few of them which have some commonalities with the work pre-
sented in this paper. In [17], the authors present a web-based knowledge management
system that, by providing an up-to-date and accurate legal framework, supports (i) civil
Achieving Interoperability through Semantic Technologies 795
servants in the composition of administrative acts and (ii) civil servants, citizens and
businesses in reasoning and substantiating administrative acts by means of precedents
and opinions. In the context of the SAKE EU project, [19] proposes an ontology-based
approach for the systematic management of changes in knowledge resources in pub-
lic administrations. Successful applications of semantic wiki based technologies in the
eGovernment domain have been reported in [11,20], to favour the management and
sharing of information and knowledge.
Some interoperability frameworks have been defined to grant the interoperability be-
tween different systems in the context of complex infrastructures. The Levels of Infor-
mation Systems Interoperability (LISI) [1] initiative of the US Department of Defense
aims to identify the stages through which systems should logically progress, or “ma-
ture”, in order to improve their capabilities to interoperate. LISI considers five increas-
ing levels of sophistication regarding system interaction and the ability of the system
to exchange and share information and services. Each higher level represents a demon-
strable increase in capabilities over the previous level of system-to-system interaction.
A more recent framework, adopted by the European Commission, is the European In-
teroperability Framework for European public services (EIF) [4]. It defines a set of
recommendations to support the delivery of European public services, by classifying
the interoperability aspects to be addressed according to different interoperability lev-
els (legal, organizational, semantic and technical).
Following the definition provided in [15], the approach taken in the ProDe project
can be classified as a model-driven approach, where models have been systematically
used as primary artifact for the definition of common procedures within different re-
gions and for the engineering of the document management system. Of the three mod-
eling sub-categories defined in [15], the models developed in ProDe cover the first and
the second, that is, the specification of (domain) data - provided by means of OWL
ontologies - and the specification of processes - provided by means of BPMN represen-
tations. The approach taken in ProDe, and the conceptual model developed to represent
data, bring some relation with the effort carried put in the UK Government Common
Information Model [13], where a reference model is defined to support the elicitation
and setting out of the Requirements specifications for e-service development. Differ-
ently from [13], where the models are centered around the notion of e-service, the data
models of ProDe are centered around the notion of document. Given the importance of
documents within the project, data have been described in terms of the MoReq meta-
data standard [2], which in turn can be represented in terms of Dublin Core Metadata
[5] as specified in [2]. Concerning the modeling of process knowledge, [15] classifies
the efforts of the PA in two different families: (i) process modeling, and (ii) service
modeling. In ProDe, the objective was to model general processes that are common to a
large number of PA, and can therefore be classified as a process modeling effort. In that
respect, the approach follows the one accomplished by SAP in the encoding of generic
process models for different fields on its Solution Maps [16].
The ProDe project has been conceived keeping in mind the technological framework
realized in ICAR1 , a national project addressing the establishment of the Italian Public
Connectivity and Cooperation System (SPC).
1
http://www.progettoicar.it/
796 C. Di Francescomarino et al.
their administrative procedures. Indeed, it often happens that each region has its own
name for designating documents reporting the same information, thus severely hinder-
ing comprehension of the process across regions.
Formal Language Interoperability. The management of document dematerialization re-
quires to deal with different entities and artifacts: for instance, the (i) nature and prop-
erties of the documents to be dematerialized, (ii) the procedures and activities to store,
catalogue, manage, and retrieve these document, and (iii) the actors involved in these
activities. These entities have diverse intrinsic nature and are commonly formally rep-
resented with different modeling languages: for instance, documents and actors can
be suitably modeled with declarative formalisms (e.g. ontologies), while business pro-
cesses formalism are more appropriate to correctly represent procedures and
activities.
Organizational and Technological Interoperability. In modeling the processes of a
complex organization like a PA, it is common to identify at least two conceptual levels
at which these processes take place: the organizational layer, comprising the activities,
roles, processes, and organizational structure of the PA, and the technological layer,
managing the set of information systems and software solutions that the PA uses to
perform (part of) its activities. Although the conceptual connection between these two
layers is rather evident, making it explicitly established and formalized in an integrated
architecture enables to offer to complex organizations additional added-value services,
like (i) verifying that the information systems supporting the organization are complaint
with its processes, (ii) monitoring the execution of the organization processes, and (iii)
checking (and possibly improve) the organization efficiency.
Placing ProDe Interoperabilities within the EIF. The interoperability issues encoun-
tered and identified in the ProDe project do not perfectly map to the four interoperability
layers proposed by EIF[4]. This is mainly due to the different goals of the project and
the framework: EIF is a set of recommendations that specify how European administra-
tions should communicate with one another within the EU and across Member States
borders in order to provide services; the ProDe project, instead, aims at defining national
“reference models” of PAs’ procedures and domain entities, starting from the existing
local ones. This means that, for example, procedures carried out locally, are aligned at
an abstract level, leaving regions the freedom to detail them according to their needs, so
that the abstract version of a process model developed by a region can be used as base
for the specificities of other regions.
Nevertheless, the interoperability aspects that came out within ProDe, are explicitly
or implicitly related to the EIF interoperability layers. In detail:
– lexicon interoperability and formal language interoperability are related to the EIF
SEMANTIC INTEROPERABILITY level. Both the interoperability aspects, in fact,
deal with language heterogeneity that, hampering a common understanding, re-
quires the provision of a “precise meaning” associated either to a shared vocabulary
or to the relationships existing among different formal languages.
– procedures interoperability lies in the middle between the EIF LEGAL and
ORGANIZATIONAL INTEROPERABILITY layers. The aspect deals, in fact, with the
798 C. Di Francescomarino et al.
In order to face the interoperability issues described in the previous section, and to cre-
ate a common reference model shared by all the regions belonging to the project, a com-
mon conceptual schema was proposed to the experts of the different task-teams to guide
the modeling of their administrative procedures, the related documents, and the services
to be provided by the document management system. This conceptual schema, whose
simplified version is graphically depicted in Figure 1 using an Entity-Relationship no-
tation, was developed by the experts in archival, computer, and organizational sciences
working in the central tasks of the ProDe project, and it represents an extension of the
one presented in [3]3 . In detail, the new entity Service is used to describe the function-
alities required to the document management system by a given task in order to handle
the documents managed within the task.
The second, and more important, contribution towards the achievement of interop-
erability was the customization and usage of a platform based on MoKi [9], a tool for
collaborative modeling of integrated processes and ontologies, in order to obtain mod-
els following the conceptual schema presented in Figure 1. The platform developed for
the ProDe project (hereafter referred to as the ProDeMoKi Platform) provides a set
of MoKi installations: one installation for each of the peripheral tasks, hereafter named
P T1 , . . . , P T7 , and a single installation CT for all the central tasks, where each MoKi
installation P T1 , . . . , P T7 was connected with the one for the central task CT . The
main idea of this platform is that, by using CT , the central tasks are able to create and
manage entities (e.g., metadata for the description of documents) that are subsequently,
and automatically, made available to P T1 , . . . , P T7 (e.g., to describe their documents),
thus favoring convergence and re-use.
Next we show in detail the general architecture of the MoKi tool.
MoKi4 [9] is a collaborative MediaWiki-based [12] tool for modeling ontological and
procedural knowledge. The main idea behind MoKi is to associate a wiki page, con-
taining both unstructured and structured information, to each entity of the ontology and
process model. From a high level perspective, the main features of MoKi are:
• an unstructured access mode (for all users) to view/edit the unstructured con-
tent;
• a fully-structured access mode (for knowledge engineers) to view/edit the com-
plete structured content; and
• a lightly-structured access mode (for domain experts) to view/edit (part of) the
structured content in a simplified way, e.g. via light forms.
These features have been proved extremely important in the context of the ProDe
project. In fact, the scenario addressed in the project required the modeling of adminis-
trative procedures, usually better described using a business process modeling notation,
enriched with knowledge which typically resides in an ontology, such as the classifica-
tion of document types, organizational roles, and so on. Moreover, the modeling team
was composed by an heterogeneous group of domain experts and knowledge engineers
situated in different Italian geographical regions.
Indeed, in the context of the ProDe project, many of the modeling actors involved in
the ProDe project were not familiar with ontology modeling. Therefore, we facilitated
the usage of MoKi by providing personalized lightly-structured access mode for each
typology of entities that the users had to model (the ones in the Document management
component, and Organizational structure component in Figure 1). An example of one
of these personalized views is reported in Figure 2. The figure shows the template used
for defining metadata entities.
Hereafter, we will refer to this version of MoKi providing personalized
lightly-structured access mode as ProDeMoKi.
Achieving Interoperability through Semantic Technologies 801
The ProDeMoKi Platform has been extensively used by 2 central task-teams and 6 pe-
ripheral task-teams6 for the last 12 months. Overall, 2255 wiki pages have been created,
6809 revisions realized, 710 pages deleted and 71 pages renamed by both peripheral and
central task users. Moreover, as comprehensively presented in [3], ProDeMoKi Plat-
form users have been interviewed, by means of an on-line questionnaire, about the ease
of use and the usefulness of the ProDeMoKi tool, in order to collect their subjective
impressions.
The analysis of the huge amount of usage data (collected analyzing the MediaWiki
database and the server log files), the users’ subjective perception, and the concrete
experience in the field gained during the project, allowed us to derive interesting obser-
vations about the support provided by semantic technologies to the different interoper-
ability aspects demanded by ProDe (described in Section 3). In the following we report,
for each of these interoperability aspects, the way in which it has been addressed by the
ProDeMoKi Platform, the lessons we learned from the project experience and from
the ProDeMoKi Platform usage, and some challenging ideas for future steps. Finally,
we summarize some further related lessons learned.
The users involved into the ProDe project have different background based on their
working area within the PA. Indeed, the domain experts, belonging to each of the pe-
ripheral task-teams, are specialized in specific topics, like healthcare, human resources,
and financial resources. The MoKi platform provides a web-accessible knowledge shar-
ing system that permits to all users - both within the same team and across different
ones - to cooperate and to provide feedbacks about the modeled processes, and how
documents are described in the platform.
and tasks) and among the most numerous (4 modelers) task-teams, thus remarking its
usefulness in case of large models and of collaborative work. These results show that
the ProDeMoKi tool is particularly useful in situations in which users work in team on
the same models.
Furthermore, the ProDeMoKi Platform is able to support the collaborative work of
users with different backgrounds, by providing simplified views according to the roles
and the specific competencies of the involved actors. Indeed, besides offering simplified
views customized on the base of the specific domain (i.e., administrative procedures)
to non-technical experts, the platform also tailors its interface to the specific actors’
needs, e.g., offering different functionalities to the central and the peripheral task users.
The extensive use of these views (the lightly-structured access mode of documents
and the fully-structured access mode of processes have been accessed respectively 931
and 2533 times by the peripheral task users, while the central task-teams accessed the
lightly-structured access mode of metadata and services 127 and 166 times, respec-
tively) confirmed the usefulness of these simplified and customized views.
Next steps. We plan to better investigate with controlled experiments the collaboration
mechanisms occurring among the different actors involved in the creation of interoper-
able models. The results and the feedback obtained will allow us to exploit the semantic
web technologies to further support ProDeMoKi users in their modeling activities.
The lexicon interoperability is one of the crucial issues of the ProDe project. As ex-
plained in Section 3, it is critical in order to avoid ambiguity problems that the domain
experts are able to use a common lexicon for describing both the document properties
and the atomic activities used in each process.
The effort spent during the project to this purpose was mainly devoted to: (i) the
adoption of (standard) pre-existing terminologies and metadata (e.g. MoReq), when
available; (ii) the creation of a shared vocabulary agreed among the different regions
for the service definition.
The architecture of the ProDeMoKi Platform provides a mechanism to link the CT
installation and the P T1 , . . . , P T7 ones in order to grant the semantic interoperability
of the used dictionaries, thus supporting regions in both these activities. Indeed, on
one side, this linking functionality provides the tool with the capability to enable the
definition of a common set of shared objects, that allowed the 4 central tasks to define
a common set of metadata, services, functionalities and indicators to be used by all the
peripheral tasks. On the other side, this functionality supports task-teams in reconciling
synonyms and in mapping specific terms to more general and shared ones. This way, it
allows them to come up with a common dictionary, based on MoReq, to be used by the
peripheral tasks for describing both the document properties (metadata) and the services
invoked by each atomic task.
Lessons Learned. The results of the effort spent on the convergence to a common dic-
tionary, clearly appear in the reduction of the ambiguity of document and activity names
in successive versions of the models created. For example, the number of different ac-
tivities modeled dropped from more than 170, in the first version of the “Modello di
riferimento”, to 22 in the last version, with a relative reduction of about 87%.
Next steps. The definition of high level models commonly agreed by all the participant
regions, as well as the use of a shared set of metadata and of a common dictionary, rep-
resent the first step towards the possibility of (semi-)automatically verifying the com-
pliance of the high level models to both national and regional laws, that is of primary
importance for the PA. Moreover, the shared vocabulary of terms, fostering the defini-
tion of a mapping between the specific regional procedures and the commonly agreed
models, could allow to verify the compliance of regional models to both national and
local norms, and to support their adaptation to changes in the regulations.
The complexity of PA procedures demands for the modeling and integration of dif-
ferent entities and artifacts. Each ProDeMoKi in the ProDeMoKi Platform permits
to model the ontology of the documents, the processes in which they are used, and
roles of the users involved in each process. To grant the interoperability of the formal
languages used to describe these different conceptual models, the platform permits to
build integrated models in which the entities defined in different formal languages can
be semantically related, in order to better represent the PA procedures. An example is
804 C. Di Francescomarino et al.
reported in Figure 3 in which the BPMN diagram shows how document entities are
connected with processes.
Lessons Learned. The importance of the interoperability among the formal languages
used for describing the different conceptual models of the PA procedures can be grasped
also from the data related to the ProDeMoKi Platform usage. For example, 3.32 doc-
uments have been used, on average, in each process diagram, with peaks of about 20
different documents in a process (as shown in the boxplot in Figure 4a). Moreover, the
same document has been used on average by 0.82 processes, including cases in which
the same document has been used by 4/5 different processes (Figure 4b). Users them-
selves are aware of the importance of such a facility: in fact, 45% of them judged such
a capability of ProDeMoKi one of its major strengths.
Lessons Learned. The conceptual difference between the organizational and techno-
logical layers is rather evident. However, we learned that the interoperability between
Achieving Interoperability through Semantic Technologies 805
these two conceptual layers is hard to reach in the context of the ProDe project. This
is mainly due to the several differences in the regions’ organizations, that make the
creation of a clear mapping between services, roles and procedures a challenging task.
Next steps. After having identified and modeled these two layers, the next step will
be to draw formal relations between them. Being able to link these layers, for example
determining when a certain technological component accomplishes a certain step in the
business process, could allow us to monitor the PA organizational process by monitoring
the progress of the corresponding process at the software layer. More importantly, when
faced with a change in the organizational process we could automatically modify the
technological process to reflect this change.
the PA employees have been trained with a learning session of 1 day, in which all the
features of ProDeMoKi have been illustrated, and hands-on exercises proposed. After
this session, the time spent for learning was very limited (on average, 1-2 days) and
the learning process did not require the involvement of ProDeMoKi developers (the
preferred approach was the autonomous training).
The proliferation of Semantic Web technologies, that we can envisage for a next
future, could allow the quick and easy adoption of platforms like the ProDeMoKi Plat-
form, as well as the growth of communities around these technologies. A further lesson
we learned from the ProDe project, indeed, is the importance to actively involve users in
the development process of the project, making them collaborating as part of a commu-
nity. In the context of ProDe, the active participation and collaboration of PA employees
of different regions, allowed to develop a common lexicon (by sharing knowledge and
discussing), to refine models and procedures (by confronting them with those of other
regions), and could allow, in the future, to keep alive the attention in maintaining and
evolving the models built together as the PA procedures change.
6 Conclusions
The paper reports our experience in the construction and usage of solutions based on
Semantic Web technologies in the context of ProDe, a national project involving Italian
Public Administrations. In particular, it presents how these technologies enabled the
collaborative modeling of administrative procedures and their related documents, orga-
nizational roles, and services, and contributed to deal with the interoperability issues
emerged in the context of project. More specifically, the features provided by the MoKi
tool and its customizations to face the specific needs of the project allowed to promote
interoperability among: users, PA procedures, terminologies, conceptual models, and
the different conceptual layers required by the project.
Taking advantage of the experience and of the lessons learned during the project,
we plan, for the future, to better investigate and support the collaboration and inter-
operability mechanisms among users with different competencies and roles, as well as
to explore techniques and approaches for (i) enabling the compliant evolution of PA
procedures and laws; and (ii) monitoring the execution of PA procedures to check their
compliance to models.
References
1. Levels of information systems interoperability (lisi) (1998),
http://www.eng.auburn.edu/˜hamilton/security/DODAF/LISI.pdf
2. MoReq2 specification: Model requirements for the management of electronic records (2008),
http://ec.europa.eu/transparency/archival policy/
moreq/doc/moreq2 spec.pdf
3. Casagni, C., Di Francescomarino, C., Dragoni, M., Fiorentini, L., Franci, L., Gerosa, M.,
Ghidini, C., Rizzoli, F., Rospocher, M., Rovella, A., Serafini, L., Sparaco, S., Tabarroni,
A.: Wiki-Based Conceptual Modeling: An Experience with the Public Administration. In:
Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E.
(eds.) ISWC 2011, Part II. LNCS, vol. 7032, pp. 17–32. Springer, Heidelberg (2011)
Achieving Interoperability through Semantic Technologies 807
4. European Commission. European Interoperability Framework (EIF) for European public ser-
vices (2010),
http://ec.europa.eu/isa/documents/isa_annex_ii_eif_en.pdf
5. DCMI. Dublin core metadata initiative (2007), http://dublincore.org/
6. Decker, G., Overdick, H., Weske, M.: Oryx – An Open Modeling Platform for the BPM
Community. In: Dumas, M., Reichert, M., Shan, M.-C. (eds.) BPM 2008. LNCS, vol. 5240,
pp. 382–385. Springer, Heidelberg (2008)
7. Di Francescomarino, C., Ghidini, C., Rospocher, M., Serafini, L., Tonella, P.: Semantically-
Aided Business Process Modeling. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum,
L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 114–
129. Springer, Heidelberg (2009)
8. France, R., Rumpe, B.: Model-driven development of complex software: A research
roadmap. In: 2007 Future of Software Engineering, FOSE 2007, pp. 37–54. IEEE Computer
Society, Washington, DC (2007)
9. Ghidini, C., Rospocher, M., Serafini, L.: Moki: a wiki-based conceptual modeling tool. In:
ISWC 2010 Posters & Demonstrations Track: Collected Abstracts, Shanghai, China. CEUR
Workshop Proceedings (CEUR-WS.org), vol. 658, pp. 77–80 (2010)
10. Ghidini, C., Rospocher, M., Serafini, L.: Conceptual modeling in wikis: a reference archi-
tecture and a tool. In: eKNOW 2012, The Fourth International Conference on Information,
Process, and Knowledge Management, pp. 128–135 (2012)
11. Krabina, B.: A semantic wiki on cooperation in public administration in europe. Journal of
Systemics, Cybernetics and Informatics 8, 42–45 (2010)
12. Wikimedia Foundation. Mediawiki, http://www.mediawiki.org
13. Cabinet Office Office of the e Envoy. e-services development framework primer v1.0b
(2002), http://www.dcc.uchile.cl/˜cgutierr/e-gov/eSDFprimer.pdf
14. OMG. BPMN, v1.1, www.omg.org/spec/BPMN/1.1/PDF
15. Peristeras, V., Tarabanis, K., Goudos, S.K.: Model-driven egovernment interoperability: A
review of the state of the art. Computer Standards & Interfaces 31(4), 613–628 (2009)
16. SAP. Solution maps,
http://www1.sap.com/solutions/businessmaps/
solutionmaps/index.epx
17. Savvas, I., Bassiliades, N.: A process-oriented ontology-based knowledge management sys-
tem for facilitating operational procedures in public administration. Expert Systems with
Applications 36(3, pt. 1), 4467–4478 (2009)
18. Smith, M.K., Welty, C., McGuinness, D.L.: Owl web ontology language guide. W3C Rec-
ommendation, February 10 (2004)
19. Stojanovic, N., Apostolou, D., Ntioudis, S., Mentzas, G.: A semantics-based software frame-
work for ensuring consistent access to up-to-date knowledge resources in public administra-
tions. In: Metadata and Semantics, pp. 319–328. Springer, US (2009)
20. Wagner, C., Cheung, K.S.K., Ip, R.K.F., Bottcher, S.: Building semantic webs for e-
government with wiki technology. Electronic Government, an International Journal 3(1),
36–55 (2005)
Tackling Incompleteness in Information Extraction –
A Complementarity Approach
Christina Feilmayr
Johannes Kepler University Linz, Altenberger Straße 69, 4040 Linz, Austria
cfeilmayr@faw.jku.at
1 Motivation
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 808–812, 2012.
© Springer-Verlag Berlin Heidelberg 2012
Tackling Incompleteness in Information Extraction – A Complementarity Approach 809
2 Proposed Approach
While tackling the above-mentioned problems primarily necessitates a reassessment
of contextual information, it also requires strategies for predicting missing values and
generating suggestions for missing slot values. Consequently, methods that obtain
new, additional information –such as text and data mining– are needed. Further, a
means of exploiting available context information to establish meaningful constraints,
conditions, and thresholds for value selection, class assignment, and the matching and
mapping procedures is required. Finally, a procedure must be devised to combine the
information obtained, evaluate and estimate its reliability, and incorporate the
available contextual information.
Tr a d i t i o n a l
NLP Pre- Named Entity Co-Reference
rence Template
ate Template
ate Scenario Filled Extraction
Processor Recognition tion
n
Resolution Element Prod.
od Relation Prod.
Prod. od
Prod. Template Prod. Templates
Novel b c d
a b e
Interpretation / Semantic Answer Attribute-Value
A Refined
Evaluation Fusion
Typification Assessment Validation ss
Assessment Information
a
Legend
aSelection of data mining method bIntegration
data mining
of existing
sources
cEvaluation of utility and
ranking of relevant answers
depends on incompleteness type
After the type of incompleteness has been identified, the appropriate data mining
methods must be selected. As already mentioned, the thrust of the new IE approach is
to integrate text and data mining into IE. Possible integration approaches are: (i) iden-
tification of collocations and co-occurrences, which can be used to resolve contradic-
tions, perform word sense disambiguation, and validate semantic relations; (ii) con-
straint-based mining, which can, for example, learn different kinds of constraints
(e.g., data type constraints); (iii) identification of frequent item sets (associations),
which can also be used as constraints for ST and template merging; (iv) prediction
provides additional evidence for missing slot values. (i)-(iii) are approaches, which
are applied to assess semantics, and (iv) to assess attribute-values. New information in
the form of constraints, conditions, or even suggestions for slots is generated and used
to improve IE results. The results are evaluated for a selected set of features (using
standard measures of the respective data mining method, e.g., accuracy, and/or se-
lected interestingness measures [4]) that might impact the construction of answers in
the presence of incompleteness. These features are combined in a flexible utility func-
tion (adapted from [7]) that expresses the overall value of information to a user.
Determining utility means that first the utility is calculated for each answer, and
second, if fusion is performed, the utility of the new value must be calculated in order
to determine whether it is more appropriate than the available alternatives. Conse-
quently, the utility value allows us to (i) define a meaningful ranking of candidates for
filling incomplete templates and (ii) discover the best fusion.
A possible single point of failure is the automatic determination of the incomplete-
ness type. Thus, the processable incompleteness types must be selected with care.
Another possible limitation occurs if measures of interestingness are insufficiently
meaningful: This renders the determined utility value useless and leads (depending
Tackling Incompleteness in Information Extraction – A Complementarity Approach 811
on the ranking and filtering methods used) to too much additional information being
selected for fusion and even more contradictory, uncertain, and (semantically) impre-
cise information being produced.
3 Background
In the early PhD phase, the literature review focused mainly on the general idea of
integrating data mining into IE. In [3], the author discussed the role of IE in text
mining applications and summarized initial research work on the integration of data
mining into IE. These first initiatives have been successful, but they discuss relatively
simple problems. Most importantly, the projects [6], [8], and [10] demonstrate that the
information extracted by such an integrated approach is of high quality (in terms of
correctness, completeness, and level of interest).
To the best of the author’s knowledge, there is no other research activity ongoing
that deals with the integration of data mining into IE to overcome the specific prob-
lem of incompleteness. There are some well-established approaches based on com-
plementarity in the knowledge fusion community. A general overview of knowledge
fusion is given in [1]. Nikolov [9] outlined a knowledge fusion system that makes
decisions depending on the type of problem and the amount of domain information
available. Zeng et al. [11] implemented a classifier to acquire context knowledge
about data sources and built an aggregation system capable of explaining incomplete
data. Ciravegna et al. [2] proposed an approach based on a combination of informa-
tion extraction, information integration, and machine learning techniques. There,
methodologies of information integration are used to corroborate the newly acquired
information, for instance, using evidence from multiple different sources. How to
exploit redundancy in terms of IE and question answering/answer validation are
described in [2] and [5], respectively.
Evaluation of Test Scenarios. The research work will conclude with a three-part
analysis demonstrating the improvements in IE domain analysis: (i) the first part is a
non-optimized information extraction process that provides the baseline; (ii) the
second part integrates a gold standard for a specific problem in order to highlight the
seriousness of the incompleteness problem; (iii) the third part is an information
extraction process using complementarity in order to overcome the incompleteness
problem. In comparison to (i) and (ii) the third part of evaluation should highlight the
improvements in reducing incompleteness. Moreover, an expert evaluation is planned,
which evaluate the several outcomes of complementarity modules.
References
1. Bloch, I., Hunter, A., et al.: Fusion: General Concepts and Characteristics. International
Journal of Intelligent Systems, Special Issue: Data and Knowledge Fusion 16(10), 1107–
1134 (2001)
2. Ciravegna, F., Chapman, S., Dingli, A., Wilks, Y.: Learning to Harvest Information for the
Semantic Web. In: Bussler, C.J., Davies, J., Fensel, D., Studer, R. (eds.) ESWS 2004.
LNCS, vol. 3053, pp. 312–326. Springer, Heidelberg (2004)
3. Feilmayr, C.: Text Mining Supported Information Extraction - An Extended Methodology
for Developing Information Extraction Systems. In: Proceedings of 22nd International
Workshop on Database and Expert Systems Applications (DEXA 2011), pp. 217–221
(2011)
4. Geng, L., Hamilton, H.J.: Interestingness Measures for Data Mining: A Survey. ACM
Computing Surveys 38(3), Article 9 (2006)
5. Magnini, B., Negri, M., Prevete, R., Tanev, H.: Is It the Right Answer? Exploiting Web
Redundancy for Answer Validation. In: Proceedings of the 40th Annual Meeting on Asso-
ciation for Computational Linguistics, pp. 425–432 (2002)
6. McCallum, A., Jensen, D.: A Note on the Unification of Information Extraction and Data
Mining using Conditional-Probability, Relational Models. In: Proceedings of the IJCAI
Workshop on Learning Statistical Models from Relational Data (2003)
7. Motro, A., Anokhin, P., Acar, A.C.: Utility-based Resolution of Data Inconsistencies. In:
Proceedings of the International Workshop on Information Quality in Information Sys-
tems, pp. 35–43 (2004)
8. Nahm, U.Y., Mooney, R.J.: Using Soft-Matching Mined Rules to Improve Information
Extraction. In: Proceedings of the AAAI 2004 Workshop on Adaptive Text Extraction and
Mining, pp. 27–32 (2004)
9. Nikolov, A.: Fusing Automatically Extracted Annotations for the Semantic Web, PhD
Thesis, Knowledge Media Institute, The Open University (2009)
10. Wong, T.-L., Lam, W.: An Unsupervised Method for Joint Information Extraction and
Feature Mining Across Different Web Sites. Data & Knowledge Engineering 68(1), 107–
125 (2009)
11. Zeng, H., Fikes, R.: Explaining Data Incompleteness in Knowledge Aggregation, Technic-
al Report, Knowledge Systems, AI Laboratory, KSL-05-04 (2005)
A Framework for Ontology Usage Analysis
Jamshaid Ashraf
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 813–817, 2012.
c Springer-Verlag Berlin Heidelberg 2012
814 J. Ashraf
Fig. 1. Ontology Usage analysis Framework and Ontology Lifecycle with feedback loop
To understand the nature of the structured (RDF) data published on the web,
the domain ontologies used and their co-usability factor, the use of semantic web
technologies and plausible reasoning (how much implicit knowledge is inferable),
in [6] we considered the e-commerce dataset (called GRDS) by crawling the web.
The latest version of the dataset comprises of 27 million triples collected from
approximately 215 data sources. In [6], we performed empirical analysis on the
dataset to analyse the data and knowledge patterns available and found that a
small set of concepts (lite ontology) of the original model is, in fact, used by
a large number of publishers. We also learnt that, with current instance data,
there is not much for RDFS reasoner(s) to infer implicit knowledge due to the
invariant data and knowledge patterns in the knowledge base. Based on the vis-
ibility obtained, in [7] we have proposed a framework and metrics to measure
the concepts and property usage in the dataset by keeping in view their richness
within the ontological model. In [8], the ontology usage framework is used to
extract the web schema, based on the ontology instantiation and co-usability
in the database. The web schema represents the prevailing schema, providing
the structure of data useful for accessing information from the knowledge base
and building data-driven applications. Based on the research done so far and
the initial results obtained, we are confident of the benefits that can be realized
with the implementation of OUSAF, which include: (a) assisting in building Se-
mantic Web applications to offer rich data services by exploiting the available
schema level information and assisting in providing an improved context driven
user interface and exploratory search [9]. based on auto discovery of explicit and
implicit knowledge; (b) enabling client applications to make expressive queries
to the Web by exploiting the schema patterns evolving through the use of ontolo-
gies; and (c) empirical analysis of domain ontology usage, as shown in Figure 1,
A Framework for Ontology Usage Analysis 817
that provides the feedback loop to the ontology development life cycle. Knowing
the sub-model of the original ontology, which provides information about usage
and adoption, will assist the ontology developer to pragmatically refine, update
and evolve the conceptual model of ontology.
References
1. Tao, J., Ding, L., McGuinness, D.L.: Instance data evaluation for semantic web-
based knowledge management systems. In: HICSS, pp. 1–10. IEEE Computer Soci-
ety (2009)
2. Jain, P., Hitzler, P., Yeh, P.Z., Verma, K., Sheth, A.P.: Linked data is merely more
data. In: AAAI Spring Symposium: Linked Data Meets Artificial Intelligence. AAAI
(2010)
3. Ding, L., Zhou, L., Finin, T., Joshi, A.: How the semantic web is being used: An
analysis of foaf documents. In: Proceedings of the 38th Annual Hawaii International
Conference on System Sciences - Track 4, vol. 4, pp. 113–120. IEEE Computer
Society, Washington, DC (2005)
4. Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic
web. In: Linked Data on the Web Workshop (LDOW 2010) at WWW 2010 (2010)
5. Borgatti, S.P., Everett, M.G.: Network analysis of 2-mode data. Social Net-
works 19(3), 243–269 (1997)
6. Ashraf, J., Cyganiak, R., ORiain, S., Hadzic, M.: Open ebusiness ontology usage:
Investigating community implementation of goodrelations. In: Linked Data on the
Web Workshop (LDOW 2011) at WWW 2011, Hyderabad, India, March 29 (2011)
7. Ashraf, J., Hadzic, M.: Domain ontology usage analysis framework. In: SKG, pp.
75–82. IEEE (2011)
8. Ashraf, J., Hadzic, M.: Web schema construction based on web ontology usage
analysis. In: JIST. Springer (2011)
9. Tvarožek, M.: Exploratory search in the adaptive social semantic web. Information
Sciences and Technologies Bulletin of the ACM Slovakia 3(1), 42–51 (2011)
Formal Specification of Ontology Networks
Edelweis Rohrer
Resources domains appears since in this case study, web resources are assessed
according to some quality criteria. Then, it is important to explicitly specify not
only the semantics of each domain, but moreover adding knowledge about how
1
Tutored by Regina Motz of Instituto de Computación, Facultad de Ingenierı́a,
Universidad de la República, Uruguay and Alicia Dı́az of LIFIA, Facultad de In-
formática, Universidad Nacional de La Plata, Argentina.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 818–822, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Formal Specification of Ontology Networks 819
these domains are related. The main motivation of this thesis is the identification
and formal definition of the different relationships among ontologies, to describe
a particular application, keeping the logical consistency. That is, there should
not be an axiom of an ontology that causes contradictory results over another
ontology in an ontology network. In a real application, the ontology network
consistency could be computationally hard to be checked, then, the trade-off be-
tween keeping the consistency and taking care of the computational properties,
is one of the main issues of this work. The main contribution will be to facil-
itate developers in the design of ontology networks, expliciting how ontologies
can be linked, keeping them as independent components. In the remainder of
this paper: Section 2 gives a background overview, Section 3 explains the PhD
approach, Section 4 introduces methodology issues and Section 5 presents the
work already done.
4 Research Methodology
This work is being carried out following an iterative process. I started with the
analysis of case studies to identify ontology relationships. This led to investigate
the way other authors addressed this issue, reviewing theoretical foundations
about DL and computational complexity when necessary. As a result, a set of
relationship definitions is obtained, which is validated in a case study, and from
the weakenesses found a new iteration starts, refining the previous definitions.
Regarding the evaluation of the approach, the implementation of an applica-
tion to design ontology networks is being carried out. It will allow to validate a
lot of important aspects: (i) its usability to define different relationships, reach-
ing the adecuate abstraction level (ii) the evaluation of the user satisfaction when
the ontology network evolves. Here, it is important to know about the imposed
restrictions for ensuring the ontology network consistency: if they help at the
moment of introducing changes or they difficult the task in practice.
5 Results
I have formalized four ontology relationships, introduced in Section 3. A first
formalization and its use to describe a web recommender system was presented
in [11]. In the following, I present the usesSymbolsOf relationship.
First, I define a relationship between two ontologies O and O w.r.t. a query
language QL as a set of axioms Ar , called relationship axioms such that:
Ar ⊆ {α ∈ QL | sig(α) ⊆ sig(O) ∪ sig(O ) ∪ Sr } where
Sr ⊆ {X | X ∈ NC ∪ NR ∪ NI } is called the relationship signature with:
NC the set of all the concept names, NR the set of all the role names, NI the
set of all the individual names
Sr ∩ sig(O ∪ O ) = Ø
822 E. Rohrer
References
1. Allocca, C., D’Aquin, M., Motta, E.: DOOR - Towards a Formalization of Ontology
Relations. In: Dietz, J.L.G. (ed.) KEOD, pp. 13–20. INSTICC Press (2009)
2. Cuenca Grau, B., Parsia, B., Sirin, E.: Combining OWL Ontologies Using ε-
Connections. Web Semantics: Science, Services and Agents on the World Wide
Web 4, 40–59 (2006)
3. Cuenca Grau, B., Horrocks, I., Kazakov, Y., Sattler, U.: Modular Reuse of Ontolo-
gies: Theory and Practice. J. Artif. Intell. Res (JAIR) 31, 273–318 (2008)
4. Konev, B., Lutz, C., Walther, D., Wolter, F.: Formal Properties of Modularisation.
In: Stuckenschmidt, H., Parent, C., Spaccapietra, S. (eds.) Modular Ontologies.
LNCS, vol. 5445, pp. 25–66. Springer, Heidelberg (2009)
5. Borgida, A., Serafini, L.: Distributed Description Logics: Assimilating Information
from Peer Sources. Journal of Data Semantics 1, 153–184 (2003)
6. Giunchiglia, F., Walsh, T.: A Theory of Abstraction. Journal Artificial Intelli-
gence 56, 323–390 (1992)
7. Ehrig, M.: Ontology Alignment. Bridging the Semantic Gap. Springer Sci-
ence+Business Media, LLC (2007)
8. Suchanek, F.M., Abiteboul, S., Senellart, P.: Ontology Alignment at the Instance
and Schema Level. Technical report, Institut National de Recherche en Informa-
tique et en Automatique (2011)
9. Baader, F., Nutt, W.: Basic Description Logics. In: Baader, et al. [12], pp. 47–95
10. Donini, F.M.: Complexity of Reasoning. In: Baader, et al. [12], pp. 101–138
11. Dı́az, A., Motz, R., Rohrer, E.: Making Ontology Relationships Explicit in a On-
tology Network. In: The V Alberto Mendelzon International Workshop on Foun-
dations of Data Management (May 2011)
12. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.
(eds.): The Description Logic Handbook: Theory, Implementation, and Applica-
tions. Cambridge University Press (2003)
Leveraging Linked Data Analysis for Semantic
Recommender Systems
Andreas Thalhammer
1 Motivation
Relevance. LOD is well known for providing a vast amount of detailed and
structured information. We believe that the information richness of LOD
combined with user preferences or usage data can help to understand items
and users in a more detailed way. In particular, LOD data can be the basis
for an accurate profile which can be useful for recommendation in various
domains. As information about items in the user profile is often unstructured
and contains only little background knowledge, this information needs to
be linked to external sources for structured data such as DBpedia.2 Also,
product and service providers need to link their offers accordingly. Methods
for this two-way alignment have to be specified and evaluated.
Diversity. According to [5], recent developments in semantic search focus on
contextualization and personalization. However, approaches that semanti-
cally enable diverse recommendations for users, also in context to users’
profiles, remain barely explored. Of course, this states a complementary way
of recommendation that is often only based on ranking by relevance. Con-
sider the example of a news aggregation Web site which ranks articles by
popularity. Popular articles are placed at the main page. On the same topic,
there are hundreds of additional articles from other news sites and blogs
indexed, but not visible on the main page. Of course, these articles get much
1
Linking Open Data - http://ow.ly/8mPMW
2
DBpedia - http://dbpedia.org/
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 823–827, 2012.
c Springer-Verlag Berlin Heidelberg 2012
824 A. Thalhammer
less clicks than the ones from the main page. This results in the most popular
sites gaining even more popularity. Our goal is to break up such feedback
loops by introducing diverse recommendations.
Non-existence. During recent years the LOD cloud3 has been growing to a
huge amount of interconnected triples. Approaches like Freebase4 and the
Wikidata project5 focus on collecting information through direct input in
order to establish an encyclopedic corpus. We believe that these systems
are likely to face the same issue like Wikipedia:6 since 2006, its growth is
decreasing [16]. This problem of Wikipedia gained focus of research and
different article recommendation systems have been explored [4,8]. These
systems point users to articles they might want to edit. This idea can be
extended in order to work for Freebase or Wikidata: given the corresponding
user profiles, it becomes feasible to point users to missing facts (e.g. the
mayor of a city).
2 Related Work
During the last decade, several approaches that aim to link semantic technologies
and recommender systems have been introduced. [15] introduces a framework
that enables semantically-enhanced recommendations in the cultural heritage
domain. Recommendation as well as personalization in this work rely on the
CHIP ontology which is designed specifically for the cultural heritage domain.
The core of the recommendation strategy bases on discovering domain-specific
links between artworks and topics (e.g. the same creator, creation site, or mate-
rial medium). In the outlook section of this work, the author emphasizes on LD
as a core technology to enhance personalization and recommendation. The work
presented in [7] states an approach to utilize LD in order to enhance recom-
mender systems. Just like our focus on linking items in the user profile to LOD
items, Heitmann and Hayes utilize LOD links in order to enhance background
information for recommendation corpora. The recommendation approach is col-
laborative filtering (cf. [1]) as “the inconsistent use of these semantic features
makes the cost of exploiting them high” [7]. In my thesis, I try to leverage
exactly these semantic features for recommendation. [2] introduces a semantic
news recommender system called “News@hand” that makes use of ontology-
based knowledge representation in order to mitigate the problem of ambiguity
and to leverage reasoning for mediating between fine and coarse-grained fea-
ture representations. The system supports content-based as well as collaborative
recommendation models. Similar to our approach, the items and the user pro-
files are represented as a set of weighted features. The weights of the item fea-
tures are computed with a TF-IDF technique which does not involve additional
knowledge.
3
LOD cloud - http://lod-cloud.net/
4
Freebase - http://www.freebase.com/
5
Wikidata - http://meta.wikimedia.org/wiki/Wikidata
6
Wikipedia - http://wikipedia.org/
Leveraging Linked Data Analysis for Semantic Recommender Systems 825
3 Proposed Approach
Our approach is based on identifying distinctive item features with the help of
usage or rating data. As with all recommender systems, the main goal is to help
users to find information that is important to them. On a different level, the
macro goals are to identify and match information that is important about users
with information that is important about items. Accordingly, the first part of
the process can be broken down to the following steps:
These processing steps can be performed offline. For the representation and
storage of the results an appropriate vocabulary needs to be selected. After these
steps, we have established a situation where we know what is important about
specific items as well as users. Afterwards, in the second part, we investigate
for different match-making techniques that help us to recommend items to the
user that relate to relevant, diverse, or missing information. Accordingly, the
following techniques may serve as starting points:
4 Methodology
For first tests and results, we chose the HetRec2011 MovieLens2k dataset [3] that
has been linked to Freebase data. The rating data stems from the MovieLens10M
dataset8 that contains anonymous user profiles.
The evaluation of recommender systems is usually based on precision and re-
call which can also be applied in this case. In this field, a couple of approaches
already exist that can serve as base lines. A measure which ranks diversity is
introduced in [9]. The recommendation of missing content for Web 2.0 collec-
tions can be evaluated by comparing the number of edits with and without the
recommendation approach.
7
Netflix Prize - http://www.netflixprize.com/
8
MovieLens10M - http://www.grouplens.org
Leveraging Linked Data Analysis for Semantic Recommender Systems 827
References
1. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender sys-
tems: a survey of the state-of-the-art and possible extensions. IEEE Transactions
on Knowledge and Data Engineering 17, 734–749 (2005)
2. Cantador, I., Bellogı́n, A., Castells, P.: Ontology-based personalised and
context-aware recommendations of news items. In: Proceedings of the 2008
IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent
Agent Technology, WI-IAT 2008, vol. 1, pp. 562–565. IEEE Computer Society,
Washington, DC (2008)
3. Cantador, I., Brusilovsky, P., Kuflik, T.: 2nd ws. on information heterogeneity and
fusion in recommender systems (hetrec 2011). In: Proc. of the 5th ACM Conf. on
Recommender Systems, RecSys 2011. ACM, New York (2011)
4. Cosley, D., Frankowski, D., Terveen, L., Riedl, J.: SuggestBot: Using Intelligent
Task Routing to Help People Find Work in Wikipedia. In: Human-Computer In-
teraction (2007)
5. Dengel, A.: Semantische suche. In: Dengel, A. (ed.) Semantische Technologien, pp.
231–256. Spektrum Akademischer Verlag (2012)
6. Fernández-Tobı́as, I., Cantador, I., Kaminskas, M., Ricci, F.: A generic semantic-
based framework for cross-domain recommendation. In: Proceedings of the 2nd
International Workshop on Information Heterogeneity and Fusion in Recommender
Systems, HetRec 2011 (2011)
7. Heitmann, B., Hayes, C.: Using Linked Data to Build Open, Collaborative Recom-
mender Systems. Artificial Intelligence (2010)
8. Huang, E., Kim, H.J.: Task Recommendation on Wikipedia. Data Processing
(2010)
9. Murakami, T., Mori, K., Orihara, R.: Metrics for Evaluating the Serendipity of
Recommendation Lists. In: Satoh, K., Inokuchi, A., Nagao, K., Kawamura, T.
(eds.) JSAI 2007. LNCS (LNAI), vol. 4914, pp. 40–46. Springer, Heidelberg (2008)
10. Ng, A.Y., Zheng, A.X., Jordan, M.I.: Stable Algorithms for Link Analysis. Machine
Learning, 267–275 (2001)
11. Oren, E., Gerke, S., Decker, S.: Simple Algorithms for Predicate Suggestions Using
Similarity and Co-occurrence. In: Franconi, E., Kifer, M., May, W. (eds.) ESWC
2007. LNCS, vol. 4519, pp. 160–174. Springer, Heidelberg (2007)
12. Pariser, E.: The filter bubble: what the Internet is hiding from you. Viking, London
(2011)
13. Passant, A.: dbrec — Music Recommendations Using DBpedia. In: Patel-Schneider,
P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B.
(eds.) ISWC 2010, Part II. LNCS, vol. 6497, pp. 209–224. Springer, Heidelberg
(2010)
14. Sarwar, B., Karypis, G., Konstan, J., Reidl, J.: Item-based collaborative filtering
recommendation algorithms. In: Proceedings of the 10th International Conference
on World Wide Web, WWW 2001, pp. 285–295. ACM, New York (2001)
15. Wang, Y.: Semantically-Enhanced Recommendations in Cultural Heritage. PhD
thesis, Technische Universiteit Eindhoven (2011)
16. Wikipedia. Modelling wikipedia’s growth,
http://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia’s_growth
(online accessed March 12, 2012)
17. Zhang, H., Fu, L., Wang, H., Zhu, H., Wang, Y., Yu, Y.: Eachwiki: Suggest to be
an easy-to-edit wiki interface for everyone. In: Semantic Web Challenge (2007)
Sharing Statistics for SPARQL Federation
Optimization, with Emphasis
on Benchmark Quality
Kjetil Kjernsmo
1 Motivation
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 828–832, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Sharing Statistics and Benchmarks for SPARQL Federation 829
The prior art in database theory is extensive, but I intend to focus on aspects
that sets SPARQL apart from e.g. SQL, like the quad model, that it is a Web
technology, or that data are commonly very heterogenous.
I have not yet started to explore the scientific literature around SPARQL
Federation in any depth as I am still in an early phase of my work. I am currently
focusing my efforts on benchmarking. The long-term goal of my work is SPARQL
Federation, but that is a minor concern in this paper. Overall, the expected
contributions are:
I take the state of the art in technology to be represented by the current basic
SPARQL 1.1 Federated Query Working Draft2 . In addition, many have imple-
mented federation that doesn’t require explicit references to service endpoints,
e.g. [8]. A recent scientific treatment of the current specification is in [1]. In that
paper, the authors also show an optimization strategy based on execution order
of so-called well-designed patterns.
A recent review of the state of the art is in [4]. In addition, [8] proposes
bound joins and proves they can dramatically reduce the number of requests to
federation members that are needed, as well as the implementation of FedX.
It has been my intention to focus on the two problems listed in section 3.3.1
in [4], i.e. strike a balance between accuracy and index size, and updating statis-
tics as data changes. Notably, histogram approaches generally suffer from the
problem that they grow too large or become an insufficiently accurate digest,
especially in the face of very heterogeneous data. [5] introduced QTrees, which
may alleviate the problem of histogram size, but which may not solve it.
Therefore, the core problem is: How do we compute and expose a digest that
is of optimal size for the query performance problem?
2
http://www.w3.org/TR/2011/WD-sparql11-federated-query-20111117/
830 K. Kjernsmo
2.2 In Benchmarking
Numerous benchmarks have been developed for SPARQL, but [2] showed that
currently most benchmarks poorly represent the typical data and queries that are
used on the Semantic Web. Most recently, [6] addressed some of these problems
by using real data and real queries from DBpedia. [7] has developed a benchmark
for the federated case.
Current common practice in benchmarking SPARQL-enabled systems is to
use or synthesize a certain dataset, then formulate a number of queries seen as
representative of SPARQL use in some way. These queries are then executed, and
some characteristic of performance is measured, for example the time it takes
for the engine to return the full result. Since there is a certain randomness in
query times, this process is repeated a number of times and an average response
time is found. Different engines can be compared based on these averages.
In many cases, this is sufficient. Sometimes, one engine can execute a query
in an order of magnitude faster than another. If this happens systematically for
many different queries, there is hardly reasonable doubt as to which is faster.
In most cases, the query response times differs little, however. Small differences
may seem unimportant but may become important if they are systematic. Even
if one engine is dramatically better than another in one case, small deficiencies
may add up to make the other a better choice for most applications anyway.
In this case, we must consider the possibility that the random noise can influ-
ence the conclusions. Whatever metric is used, it should be treated as a stochas-
tic variable. This opens new methodological possibilities, first and foremost using
well-established statistical hypothesis testing or ranking rather than just compar-
ing averages.
Furthermore, the current approach presents merely anecdotal evidence that
one engine is better than another with respect to performance. It may be that
while the benchmarking queries do seem to favor one engine, other queries that
have not been tried may reveal that there are strong adverse effects that may
not have been anticipated. A more systematic approach is needed to provide a
comprehensive and objective basis for comparing SPARQL engines.
In physical science and engineering, conventional wisdom has been that you
should only vary one variable at a time to study the effects of that one variable.
In medical science, this has been abandoned several decades ago, thanks to
advances in statistics. In for example a case where the researcher administrates
different treatments to terminally ill patients, some of which may be painful or
shorten their lives, experimental economy is extremely important.
Using techniques from statistical experimental design, I propose that it is
possible to design an experiment (i.e. a benchmark) which makes it possible to
cover all realistic cases and with which we can justify why the remaining corner
cases are unlikely to influence the result. For further elaboration, see Section 3.2.
So far, the benchmarking problem has been seen as a software testing problem,
but as stated in the introduction this is not the only objective, we may now see if
benchmark data can be exposed to help federation query optimizers along with
a statistical digest.
Sharing Statistics and Benchmarks for SPARQL Federation 831
The problems addressed by existing benchmarks such as the ones cited above
are almost orthogonal to the problems considered by my proposed project. While
I have seen some cases that compare performance based on box plots3 , it seems
not to be common practice. Furthermore, I have not to date seen any work
towards using methods like factorial designs to evaluate the performance of soft-
ware, but there may be a limit in terms of complexity for where it is feasible,
and I will restrict myself to SPARQL for this thesis.
3.2 In Benchmarking
Already in 1926, Ronald Fischer noted that complex experiments can be much
more efficient than simple ones4 , starting the experimental design field. One of
the simpler designs is “fractional factorial design”, in which several “factors” are
studied. In terms of SPARQL execution, the SPARQL engine is clearly a fac-
tor, but also, for example, the nestedness of OPTIONALs can be a factor, or the
number of triples in a basic graph pattern, etc. These numbers are varied to dif-
ferent “levels”. The key to understanding why this can be efficient is that these
variations need not occur in the same experiment. Therefore, for the SPARQL
language, many combinations of factors can be studied by carefully designing
queries to cover different factors, and a formalism called “resolution” has been
developed to classify how well this has been achieved, partly answering the ques-
tion of evaluation methodology for this part of the thesis. We should also validate
by comparing this benchmark with conclusions from existing benchmarks.
3
See http://shootout.alioth.debian.org/ for example
4
Cited in http://en.wikipedia.org/wiki/Factorial experiment
832 K. Kjernsmo
References
1. Buil-Aranda, C., Arenas, M., Corcho, O.: Semantics and Optimization of the
SPARQL 1.1 Federation Extension. In: Antoniou, G., Grobelnik, M., Simperl, E.,
Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part II.
LNCS, vol. 6644, pp. 1–15. Springer, Heidelberg (2011)
2. Duan, S., Kementsietsidis, A., Srinivas, K., Udrea, O.: Apples and oranges: a com-
parison of RDF benchmarks and real RDF datasets. In: Proc. of the 2011 Int. Conf.
on Management of Data, SIGMOD 2011, pp. 145–156. ACM (2011)
3. Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models.
In: Proc. of the 2001 ACM SIGMOD Int. Conf. on Management of Data, SIGMOD
2001, pp. 461–472. ACM (2001)
4. Görlitz, O., Staab, S.: Federated Data Management and Query Optimization for
Linked Open Data. In: Vakali, A., Jain, L.C. (eds.) New Directions in Web Data
Management 1. SCI, vol. 331, pp. 109–137. Springer, Heidelberg (2011)
5. Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.-U., Umbrich, J.: Data
summaries for on-demand queries over linked data. In: Proc. of the 19th Int. Conf.
on World Wide Web, WWW 2010, pp. 411–420. ACM (2010)
6. Morsey, M., Lehmann, J., Auer, S., Ngomo, A.-C.N.: DBpedia SPARQL Benchmark
– Performance Assessment with Real Queries on Real Data. In: Aroyo, L., Welty, C.,
Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC
2011, Part I. LNCS, vol. 7031, pp. 454–469. Springer, Heidelberg (2011)
7. Schmidt, M., Görlitz, O., Haase, P., Ladwig, G., Schwarte, A., Tran, T.: FedBench:
A Benchmark Suite for Federated Semantic Data Query Processing. In: Aroyo, L.,
Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E.
(eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 585–600. Springer, Heidelberg (2011)
8. Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.: FedX: Optimization
Techniques for Federated Query Processing on Linked Data. In: Aroyo, L., Welty,
C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.)
ISWC 2011, Part I. LNCS, vol. 7031, pp. 601–616. Springer, Heidelberg (2011)
A Reuse-Based Lightweight Method for Developing
Linked Data Ontologies and Vocabularies
María Poveda-Villalón
Abstract. The uptake of Linked Data (LD) has promoted the proliferation of
datasets and their associated ontologies for describing different domains. Par-
ticular LD development characteristics such as agility and web-based architec-
ture necessitate the revision, adaption, and lightening of existing methodologies
for ontology development. This thesis proposes a lightweight method for ontol-
ogy development in an LD context which will be based in data-driven agile de-
velopments, existing resources to be reused, and the evaluation of the obtained
products considering both classical ontological engineering principles and LD
characteristics.
The Linked Data (LD) initiative enables the easy exposure, sharing, and connecting of
data on the Web. Datasets in different domains are being increasingly published ac-
cording to LD principles1. In order to realize the notion of LD, not only must the data
be available in a standard format, but concepts and relationships among datasets must
be defined by means of ontologies or vocabularies2.
New vocabularies to model data to be exposed as Linked Data should be created
and published when the existing and broadly used ontologies do not cover all the data
intended for publication. Based on the guidelines for developing and publishing LD
[5], LD practitioners should describe their data (a) reusing as many terms as possible
from those existing in the vocabularies already published and (b) creating new terms,
when available vocabularies do not model all the data that must be represented. Dur-
ing this apparently simple process several questions may arise for a data publisher.
This PhD thesis proposal aims to develop a lightweight method to guide LD
1
http://www.w3.org/DesignIssues/LinkedData.html
2
At this moment, there is no clear division between what is referred to as “vocabularies” and
“ontologies” (http://www.w3.org/standards/semanticweb/ontology). For this reason, we will
use both terms indistinctly in this paper.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 833–837, 2012.
© Springer-Verlag Berlin Heidelberg 2012
834 M. Poveda-Villalón
practitioners through the process of creating a vocabulary to represent their data. The
ambition is to maintain the advantages, whilst offering solutions to cover the insuffi-
ciencies. The proposed method will be mainly based in reusing widely deployed
vocabularies, describing data by means of answering the following questions:
3 Proposed Approach
This PhD thesis proposal investigates how traditional and heavy methodologies for
the development of ontologies and ontology networks could be lightened and adapted
to an LD context by considering its particular requirements. A lightweight method for
A Reuse-Based Lightweight Method for Developing Linked Data Ontologies 835
Technological contribution
Data-driven approach Access to and reuse ontologies
Term and vocabularies already used in
Search the LD cloud
extraction
Methodological contribution
LD Vocabularies assessment based on
Data Select • LD Principles
BBDD
• Knowledge modelling criteria
… sensors
Technological contribution
OOPS! – OntOlogy Pitfall Scanner!
Integrate http://www.oeg-upm.net/oops
text
Methodological contribution
Guidelines for reuse:
New data? Can you • How much information should be reuse?
Yes represent • How to reuse elements or vocabularies?
all your • How to link elements or vocabularies?
data?
Use &
Evaluate
Publish
No
Complete
Technological contribution Methodological contribution
OOPS! – OntOlogy Pitfall Scanner! Guidelines for development:
http://www.oeg-upm.net/oops • According with LD principles
• Avoiding mistakes
Fig. 1. Lightweight method for building Linked Data ontologies and vocabularies
3
http://www.oeg-upm.net/
A Reuse-Based Lightweight Method for Developing Linked Data Ontologies 837
5 Conclusion
Describing data by means of vocabularies or ontologies is crucial for the Semantic
Web and LD realization. LD development characteristics such as agility and web-
based architecture force the revision and lightening of existing methodologies for
ontology development. This paper briefly presents the motivation and the proposed
approach of the thesis, the main goal of which is to propose a lightweight method for
ontology development in an LD context following a data-driven approach. Such a
method will be developed together with technological support to ease its application
and will be based in agile developments and the evaluation of the obtained products
considering both classical ontological engineering principles and LD characteristics.
The next steps consist of analyzing particular characteristics of LD developments
and proposing a first prototype both for the method and its technological support.
Following this, the obtained results will be evaluated in order to improve them in an
iterative way.
Acknowledgments. This work has been supported by the Spanish project BabelData
(TIN2010-17550). I would also like to thank Asunción Gómez-Pérez and Mari Car-
men Suárez-Figueroa for their advice and dedication and Edward Beamer for his
comments and revisions.
References
1. Auer, S.: RapidOWL - an Agile Knowledge Engineering Methodology. In: STICA 2006,
Manchester, UK (2006)
2. Fernández-López, M., Gómez-Pérez, A., Juristo, N.: METHONTOLOGY: From Ontolog-
ical Art Towards Ontological Engineering. In: Spring Symposium on Ontological Engi-
neering of AAAI, pp. 33–40. Stanford University, California (1997)
3. Gómez-Pérez, A., Fernández-López, M., Corcho, O.: Ontological Engineering. Advanced
Information and Knowledge Processing. Springer (November 2003) ISBN 1-85233-551-3
4. Gruninger, M., Fox, M.S.: The role of competency questions in enterprise engineering. In:
Proceedings of the IFIP WG5.7 Workshop on Benchmarking - Theory and Practice,
Trondheim, Norway (1994)
5. Heath, T., Bizer, C.: Linked data: Evolving the Web into a global data space, 1st edn.
Morgan & Claypool (2011)
6. Hristozova, M., Sterling, L.: An eXtreme Method for Developing Lightweight Ontologies.
CEUR Workshop Series (2002)
7. Pinto, H.S., Tempich, C., Staab, S.: DILIGENT: Towards a fine-grained methodology for
DIstributed, Loosely-controlled and evolvInG Engineering of oNTologies. In: de Manta-
ras, R.L., Saitta, L. (eds.) Proceedings of the ECAI 2004, August 22-27, pp. 393–397. IOS
Press, Valencia (2004) ISBN: 1-58603-452-9, ISSN: 0922-6389
8. Presutti, V., Daga, E., Gangemi, A., Blomqvist, E.: eXtreme Design with Content Ontolo-
gy Design Patterns. In: WOP 2009 (2009)
9. Staab, S., Schnurr, H.P., Studer, R., Sure, Y.: Knowledge Processes and Ontologies. IEEE
Intelligent Systems 16(1), 26–34 (2001)
10. Suárez-Figueroa, M.C.: Doctoral Thesis: NeOn Methodology for Building Ontology Net-
works: Specification, Scheduling and Reuse. Universidad Politécnica de Madrid, Spain
(2010)
Optimising XML–RDF Data Integration
A Formal Approach to Improve XSPARQL Efficiency
Stefan Bischof
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 838–843, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Optimising XML–RDF Data Integration 839
Efficiency is important when transforming data between XML and RDF. Eval-
uating complex XSPARQL queries, i.e., queries containing nested graph patterns,
shows slow query response times with the current prototype.1 The performance
impact can be explained by the architecture which rewrites XSPARQL queries to
XQuery queries containing external calls to a SPARQL engine. The main advant-
ages of such an implementation are reuse of state of the art query optimisation
as well as access to standard XML databases and triple stores. But for complex
XSPARQL queries a high number of SPARQL queries (similar to each other)
are yielded, resulting in a major performance impact [3, 4, 5]. Listing 1 contains
a nested SparqlForClause in line 3, which depends on the specific customer ID
($id). The implementation issues a separate SPARQL query for each customer.
Simple join reordering [3, 4] improves query answering performance. But there is
still a performance gap between simple flat queries and complex nested queries,
thus optimisations on a more fundamental level are needed.
The XQuery semantics specification[8] formalism, i.e., natural semantics, is
not well suited for concisely expressing query equivalences. As opposed to Rela-
tional Algebra, which serves as the basis for query languages like SPARQL, nat-
ural semantics uses calculus-like rules to specify type inference and evaluation
semantics. Since the XSPARQL semantics reuses the formalism of XQuery, a
concise description of possible optimisations is inhibited by the formalism.
To find and express new optimisations and prove their correctness, we need
a more suitable formalism. Finding such a formalism is the first goal of the
presented PhD topic. Like other formalisms used for query optimisation we also
use only a core fragment of the query language to optimise. We thus propose an
integrated formal model of an XSPARQL core fragment called XS.
Section 2 gives an overview of related work and describes open problems. Sec-
tion 3 explains the approach we propose which involves creating a core formal-
ism to express different kinds of optimisations by query equivalence and rules
for cost-based optimisations. Lastly Sect. 4 outlines the research methodology
for addressing the efficiency problem by a formal approach.
2 Related Work
Some published approaches to combine XML and RDF data use either XML or
RDF query languages or a combination. But none of these approaches is advanced
enough to address “cross-format” optimisation of such transformations. In general
such optimisations could be implemented by translating queries completely to a
1
An online demo and the source code are available at http://xsparql.deri.org/
840 S. Bischof
single existing query language, such as XQuery, and shift optimisation and query
evaluation to an XQuery engine. Another approach to implement a combined
query language is to build an integrated evaluation engine from scratch.
4 Research Methodology
Finding novel optimisations for querying across formats requires several tasks: lit-
erature search, formalisation, theoretical verification, prototyping, practical eval-
uation and implementation of a relevant use case. All of these tasks are not
executed strictly in sequence, but rather in several iterative refinement steps.
Literature Search. For clearly defining the research problem and ensuring nov-
elty of the approach, we are currently performing an extensive literature
search. Data integration and data exchange are recent topics attracting re-
searchers from different domains. Integrating data from different representa-
tions however, has not seen broad attention. A second topic currently under
investigation is finding an appropriate formalisation for XS.
Core Fragment Isolation and Optimisation. We will isolate an XSPARQL
core fragment which allows expressing practically relevant queries and holds
optimisation potential. Optimisations are defined by query equivalences.
Theoretical Evaluation. The core fragment allows theoretical correctness veri-
fication. Using XS we will prove theoretical properties such as complexity
bounds and query/expression equivalences.
Prototype. A prototype implementation is needed to devise practical evalu-
ations and to implement a demonstration use case. The prototype will be
built using state of the art technology but should still be kept simple enough
to allow quick adaption. We also plan to publish concrete use case solutions
to show feasibility and relevance of our approach in practise.
Practical Evaluation. In previous work we proposed the benchmark suite
XMarkRDF [4] to measure the performance of RDF to XML data trans-
formation. XMarkRDF is derived from the XQuery benchmark suite XMark.
842 S. Bischof
References
1. Beeri, C., Tzaban, Y.: SAL: An Algebra for Semistructured Data and XML. In:
WebDB (Informal Proceedings) 1999, pp. 37–42 (June 1999)
2. Benedikt, M., Koch, C.: From XQuery to Relational Logics. ACM Trans. Database
Syst. 34(4), 25:1–25:48 (2009)
3. Bischof, S., Decker, S., Krennwallner, T., Lopes, N., Polleres, A.: Mapping
between RDF and XML with XSPARQL. Tech. rep., DERI (March 2011),
http://www.deri.ie/fileadmin/documents/DERI-TR-2011-04-04.pdf
4. Bischof, S., Decker, S., Krennwallner, T., Lopes, N., Polleres, A.: Mapping between
RDF and XML with XSPARQL (2011) (under submission)
5. Bischof, S., Lopes, N., Polleres, A.: Improve Efficiency of Mapping Data between
XML and RDF with XSPARQL. In: Rudolph, S., Gutierrez, C. (eds.) RR 2011.
LNCS, vol. 6902, pp. 232–237. Springer, Heidelberg (2011)
6. ten Cate, B., Lutz, C.: The Complexity of Query Containment in Expressive Frag-
ments of XPath 2.0. J. ACM 56(6), 31:1–31:48 (2009)
7. ten Cate, B., Marx, M.: Axiomatizing the Logical Core of XPath 2.0. Theor. Comp.
Sys. 44(4), 561–589 (2009)
8. Draper, D., Fankhauser, P., Fernández, M., Malhotra, A., Rose, K., Rys, M., Siméon,
J., Wadler, P.: XQuery 1.0 and XPath 2.0 Formal Semantics, 2nd edn. W3C Re-
commendation, http://www.w3.org/TR/2010/REC-xquery-semantics-20101214/
9. Fischer, P.M., Florescu, D., Kaufmann, M., Kossmann, D.: Translating SPARQL
and SQL to XQuery. In: XML Prague 2011, pp. 81–98 (March 2011)
10. Gottlob, G., Koch, C., Pichler, R.: Efficient Algorithms for Processing XPath Quer-
ies. ACM Trans. Database Syst. 30, 444–491 (2005)
Optimising XML–RDF Data Integration 843
11. Gottlob, G., Koch, C., Pichler, R., Segoufin, L.: The Complexity of XPath Query
Evaluation and XML Typing. J. ACM 52(2), 284–335 (2005)
12. Groppe, S., Groppe, J., Linnemann, V., Kukulenz, D., Hoeller, N., Reinke, C.: Em-
bedding SPARQL into XQuery/XSLT. In: SAC 2008, pp. 2271–2278. ACM (2008)
13. Jagadish, H.V., Lakshmanan, L.V.S., Srivastava, D., Thompson, K.: TAX: A Tree
Algebra for XML. In: Ghelli, G., Grahne, G. (eds.) DBPL 2001. LNCS, vol. 2397,
pp. 149–164. Springer, Heidelberg (2002)
14. Zhang, X., Pielech, B., Rundesnteiner, E.A.: Honey, I Shrunk the XQuery!: an XML
Algebra Optimization Approach. In: WIDM 2002, pp. 15–22. ACM (2002)
Software Architectures for Scalable Ontology
Networks
Alessandro Adamou1,2
1
Alma Mater Studiorum Università di Bologna, Italy
2
ISTC, National Research Council, Italy
1 Introduction
The discipline and methodologies of ontology engineering are gradually shifting
away from the monolithic notion of ontologies. Current practices and empirical
sciences see reuse as a key criterion for constructing knowledge models today,
so that networking-related concepts are being applied to interconnectivity be-
tween ontologies. As a consequence, the notion of ontology network has begun to
surface. An ontology network is created either at design time (e.g. the engineer
adds OWL import statements and reuses imported entities) or at a later stage
(e.g. someone discovers alignments between multiple ontologies agnostic to each
other, creates an ontology with alignment statements and sets up a top ontology
that imports all of them). Our research concentrates on exploiting the latter
scenario while still accommodating the former.
Establishing ontology networks can have a number of advantages, the most
apparent ones being related to redundancy minimisation and reasoning. Inter-
connecting ontology modules keeps from re-defining basic or shared entities, can
augment knowledge on the given entities and produce even more inferred knowl-
edge when reasoners are run on the network. This is, however, when reasoning
performance issues kick in. Description Logic (DL) classifies a whole knowledge
structure non-selectively, and whether the process is incremental depends on
the reasoner implementation. However, if a reasoner is being run for a specific
task, portions of this knowledge structure could be unnecessary and inflate the
computation to produce results that are no use to the task at hand. Let us con-
sider the social network domain for an example. If the task is to infer a trust
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 844–848, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Software Architectures for Scalable Ontology Networks 845
network of users based on their activity, there is hardly any point in including
the GeoSpecies ontology that classifies life form species1 , which could however
be part of the knowledge base. If, however, another task is to infer an affinity
network, GeoSpecies could be considered for assessing affinity between zoologists.
There are also considerations to be made on the network substructure. The
OWL language relies upon import statements to enforce dependency resolution
to a certain extent, and as a de facto standard mechanism it has to be respected.
However, import statements are static artifacts that imply strict dependencies
identified at design time, therefore the dynamic usage of such OWL constructs
has to be delegated to a software platform that serves or aggregates ontologies.
We argue it should be combined with an OWL2-compliant versioning scheme.
RQ1. How can a single software framework create ontology networks on top of
a common knowledge base, in order to serve them for different processing tasks?
We call these ontology networks virtual, in that it is the framework that can
create, rewrite and break ontology linkage at runtime, as required by a specific
instance-reasoning task, without affecting the logical axioms in the ontologies.
Instance data may vary across simultaneous users of a system, but also with
time (e.g. only a daily snapshot of data feeds, or the whole history of an ABox
for a domain). This variability implies multiple simultaneous virtual networks. A
TBox, on the other hand, can be comprised of consolidated schemas and vocabu-
laries, which would be preposterous to copy across ontology networks. However,
they are generally the ones with a greater impact on reasoner performance, and
users should be able to exclude unneeded parts of them from their networks.
We must also make sure that (i) changes in a user-restricted ABox do not affect
another user’s ABox, and (ii) the memory footprint of ontology networks in the
framework is negligible compared to the combined size of the knowledge base.
RQ3. Can the constructs and limits of standard ontology languages be followed
and exploited to this end, without altering the logical structure of an ontology?
Our goal with RQ3 is to make sure that, whatever objects the framework intro-
duces, they can always be exported to legal OWL 2 artifacts; that they do not
require yet another annotation schema; that they do not force the addition of
logical axioms or the interpretation of existing ones in the original ontologies.
1
GeoSpecies knowledge base, http://lod.geospecies.org
846 A. Adamou
3 Related Work
the corresponding RDF graph into the same session day-by-day. A session
takes ownership of any non-versioned ontologies loaded into it. Although not
intended for persistence, it can last across more HTTP sessions over time.
– Scope. A “realm” for all the ontologies that model a given domain or con-
text. For instance a “social networks” scope can reference FOAF, SIOC,
alignments between these and other related custom models. One or more
scopes can be attached to one or more sessions at once, thereby realizing the
virtual network model. Each scope is divided in two spaces. The core space
contains the ontologies that provide immutable knowledge, such as founda-
tional ontologies. The custom space extends the core space with additional
knowledge such as alignments with controlled vocabularies. Note that one
occurrence of the same ontology can be shared across multiple spaces.
When these artifacts are exported as OWL ontologies, linkage relations are ma-
terialised as owl:imports statements forged for each ontology network, while
owl:versionIRIs are used by these artifacts to either “claim ownership” of the
ontologies they manage or share them (e.g. TBoxes) across networks.
This ontology network management architecture has been implemented as
part of the Apache Stanbol service platform for semantic content manage-
ment3 . Since Stanbol is an extensible framework, any plugin can setup and
manage its own ontology scopes and sessions. We contributed the following com-
ponents to the Stanbol ontology manager package:
At the time of writing, Stanbol is undergoing its release candidate phase. Our
contribution also comes as part of the Interactive Knowledge Stack project4 .
5 Validation
With the reference implementation of our framework in place, we are moving
on to the phase of validating it on use cases from the content management do-
main. Small and medium enterprises have committed to adopt Apache Stanbol.
The main use case is provided by a company working in content curation for
3
Apache Stanbol, http://incubator.apache.org/stanbol
4
Interactive Knowledge Stack (IKS), http://code.google.com/p/iks-project/
848 A. Adamou
the visual arts domain. There, a Fedora Commons digital object repository5 is
used to maintain image metadata, which are interconnected with a SKOS-based
representation of the Getty ULAN repository6 . OntoNet and reasoners will be
employed to produce knowledge enrichments over the standard Fedora metadata
vocabulary and present different user interface views on them based on different
ontology network configurations. In another use case, OntoNet will be used to
manage simultaneous content hierarchies from a shared repository. Scopes are
selected depending on the knowledge domains that each user decides to activate,
while sessions contain the metadata of user-owned and shared content items.
At the same time, we are developing benchmarking tools to evaluate the
computational efficiency of the system tout-court. Benchmarking will consider
(i) the amount and complexity of different networks that can be setup at the
same time on the same knowledge base, their memory usage and the minimum
Java VM size required; (ii) the overhead given by loading the same ontology
into multiple scopes or sessions; (iii) the duration of DL classification runs on
an ontology network large enough to deliver the expected inferences, compared
against standard DL reasoners called over the non-pruned knowledge base. The
possible size of the knowledge base per se will not be measured as it is strictly
bound to the storage mechanism employed. To date, the system effectiveness has
been measured through unit-testing and stress tests are guaranteeing that we
can set up a scope on a ∼ 200 MiB ontology using a VM ∼ 1.2 times as large.
References
1. Bao, J.: Representing and reasoning with modular ontologies. Ph.D. thesis, Iowa
State University (2007)
2. Blomqvist, E., Presutti, V., Daga, E., Gangemi, A.: Experimenting with eXtreme
Design. In: Cimiano, P., Pinto, H.S. (eds.) EKAW 2010. LNCS, vol. 6317, pp. 120–
134. Springer, Heidelberg (2010)
3. d’Aquin, M., Euzenat, J., Le Duc, C., Lewen, H.: Sharing and reusing aligned on-
tologies with Cupboard. In: Gil, Y., Noy, N.F. (eds.) K-CAP. ACM (2009)
4. Fokoue, A., Kershenbaum, A., Ma, L., Schonberg, E., Srinivas, K.: The Summary
Abox: Cutting Ontologies Down to Size. In: Cruz, I., Decker, S., Allemang, D.,
Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006.
LNCS, vol. 4273, pp. 343–356. Springer, Heidelberg (2006)
5. Haase, P., Rudolph, S., Wang, Y., Brockmans, S., Palma, R., Euzenat, J., d’Aquin,
M.: D1.1.1 networked ontology model. NeOn Deliverable 1(D1.1.1), 1–60 (2006)
6. Obrst, L.: Ontological Architectures, pp. 27–66. Springer, Netherlands (2010)
7. Stuckenschmidt, H., Parent, C., Spaccapietra, S. (eds.): Modular Ontologies: Con-
cepts, Theories and Techniques for Knowledge Modularization. LNCS, vol. 5445.
Springer, Heidelberg (2009)
8. Suárez de Figueroa Baonza, M.C.: NeOn Methodology for Building Ontology Net-
works: Specification, Scheduling and Reuse. Ph.D. thesis, Universidad Politécnica
de Madrid, Madrid, Spain (2010)
5
Fedora Commons, http://fedora-commons.org/
6
Union List of Artist Names,
http://www.getty.edu/research/tools/vocabularies/ulan/
Identifying Complex Semantic Matches
Brian Walshe
1 Motivation
When organizations wish to work together, integrating their data is usually a signifi-
cant challenge. As each organization’s data resources have often been developed in-
dependently, they are subject to heterogeneity – primarily semantic, syntactic and
structural. The use of ontologies has long been identified as an effective way of facili-
tating the integration process, seeing use in diverse fields, including for example, bio-
medicine and high-energy physics [1].
Suitable techniques for discovering and describing matches between ontologies are
still very much an open problem [2]. Matching ontologies is a demanding and error
prone task. Not only does it require expert knowledge of the matching process, but
also an in-depth understanding of the subject the ontologies describe, and typically
knowledge of the principals used to construct the ontologies. The ontologies can dif-
fer in scope, granularity and coverage, they can use different paradigms to represent
similar concepts, or use different modeling conventions [3]. Therefore producing high
quality correspondences is typically semi-automated – an expert user approves and
refines match candidates produced by an automatic semantic matcher tool [4].
Semantic matcher tools typically focus on detecting schema-level equivalence rela-
tions. However elementary correspondences between named ontology elements are
often not sufficient for tasks such as query rewriting, instance translation, or instance
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 849–853, 2012.
© Springer-Verlag Berlin Heidelberg 2012
850 B. Walshe
1
http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/
Identifying Complex Semantic Matches 851
The INRIA Ontology Alignment API [6] provides a comprehensive set of automated
match generators, drawn from the methods described by Shvaiko and Euzenat [8]. It
also provides a format for describing matches, so that they may be exchanged.
Current semantic matching tools focus on discovering equivalence (≡) relation-
ships, and less commonly less general (⊑), more general (⊒) and disjointness (≢)
relationships. Schemes exist for classifying more detailed forms of relationships,
these include Correspondence Patterns [5], and the THALIA framework [10]. The
EDOAL [6] language provides a method for describing more complicated correspon-
dences, using an OWL-like syntax, and was developed in conjunction with corres-
pondence patterns. As yet, there are few matchers capable of generating the complex
matches that EDOAL allows [6]. Ritze et. al. [11] describe a first attempt process for
detecting complex matches, and Sváb-Zamazal et. al. [12] provide a set of pattern
based tools for describing and managing these matches by relating them to ontology
design principals.
The field of Machine Learning provides many tools which could be of use in ana-
lyzing matches between the less structured elements found in Linked Data. The Weka
suite [13] provides a range of these tools as both an API and as part of a GUI. These
include attribute selection tools such as Information Gain, and regression analysis
tools.
Many of the techniques that will be investigated are reliant on analyzing instance
data contained in the ontologies, and one possible limitation is that if the ontologies
being mapped do not have suitable instance data these techniques will not be applica-
ble. Because of this there will be a focus on Linked Data, as Linked Data sources
contain large amounts of instance data, and often overlap to some extent. A further
obstacle to this research will be the ability to measure the effectiveness of the match-
ing techniques against a standard test set such as those used by the OAEI [14].
The methodology of this PhD consists of a literature review to establish the current
state of the art in semantic matching tools, methods for their evaluation, and formats
available for describing alignments. Following from this, techniques for detecting
restriction type Correspondence Patterns such as Class by Attribute Value and Class
by Attribute Existence and converting them to complex alignments will be developed.
Techniques for analyzing Property Value Transformation correspondence patterns by
regression or other means shall also be developed. A suitable test suite to evaluate the
these techniques will be required, which should be made public to allow comparison
with other matchers capable of producing complex matches
To date a tool has been developed which is capable of refining class matches to in-
clude declarative restrictions. An experiment was carried out to evaluate the ability of
this system, and the results are described in the following section.
4 Current Work
An evaluation was carried out of the ability of a system to refine elementary corres-
pondences between the class Person in DBpedia [7] and several classes of occupation
in YAGO [15] to produce complex correspondences. Using matched instances of
these classes that had been identified in both sources, attribute selection was used to
test if restriction type matches were occurring between the classes, and determine
which attribute these restrictions were dependent on. This evaluation demonstrated
that the Information Gain measure is a suitable scoring function for finding the
attribute and attribute value to condition matches of type Class by Attribute Value and
Class by Attribute Existence. This Information Gain function allowed us to reliably
select a gold standard correspondence pattern – consisting of an attribute and value to
condition – as our top result in two of the four mappings tested, and reliably returns
the best correspondence pattern in the top five results for all four test cases. In the
cases where the search algorithm did not return the best correspondence pattern as the
top result, this was because there were several patterns that could be considered valid
and selecting the “best” among these was difficult. The IG score requires that a suita-
ble training set of instances be used, and the evaluation demonstrated that a random
sample of 30 instances was sufficient to rank the gold standard attribute and attribute-
value pair in the top 5 results at least 89% of the time. The results of this experiment
have been accepted for publication at the DANMS 2012 workshop.
References
1. Weidman, S., Arrison, T.: Steps toward large-scale data integration in the sciences: sum-
mary of a workshop. National Academy of Sciences, 13 (2010)
2. Shvaiko, P., Euzenat, J.: Ontology matching: state of the art and future challenges. IEEE
Transactions on Knowledge and Data Engineering (preprint, 2012)
3. Klein, M.: Combining and relating ontologies: an analysis of problems and solutions. In:
Proc. of Workshop on Ontologies and Information Sharing, IJCAI, pp. 53–62 (2001)
4. O’Sullivan, D.: The OISIN framework: ontology interoperability in support of semantic in-
teroperability. PhD thesis, Trinity College Dublin (2006)
5. Scharffe, F.: Correspondence patterns representation. PhD thesis, University of Innsbruck
(2009)
6. David, J., Euzenat, J., Scharffe, F., Trojahn dos Santos, C.: The alignment API 4.0. Seman-
tic Web 2(1), 3–10 (2011)
7. Bizer, C., et al.: DBpedia - A crystallization point for the Web of Data. Web Semantics:
Science, Services and Agents on the World Wide Web (7), 154–165 (2009)
8. Shvaiko, P., Euzenat, J.: A Survey of Schema-Based Matching Approaches. In: Spaccapie-
tra, S. (ed.) Journal on Data Semantics IV. LNCS, vol. 3730, pp. 146–171. Springer, Hei-
delberg (2005)
9. Falconer, S.M., Storey, M.-A.: A Cognitive Support Framework for Ontology Mapping.
In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J.,
Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC
2007 and ISWC 2007. LNCS, vol. 4825, pp. 114–127. Springer, Heidelberg (2007)
10. Hammer, J., Stonebraker, M., Topsakal, O.: THALIA: Test Harness for the Assessment of
Legacy Information Integration Approaches. In: Proc. of the Int. Conference on Data En-
gineering (ICDE), pp. 485–486 (2005)
11. Ritze, D., Meilicke, C., Sváb-Zamazal, O., Stuckenschmidt, H.: A pattern-based ontology
matching approach for detecting complex correspondences. In: Proc. of Int. Workshop on
Ontology Matching, OM (2009)
12. Sváb-Zamazal, O., Daga, E., Dudás, M.: Tools for Pattern-Based Transformation of OWL
Ontologies. In: Proc. of Int. Semantic Web Conference (2011)
13. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA
data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1), 10–18
(2009)
14. Euzenat, J., et al.: First results of the Ontology Alignment Evaluation Initiative 2011. In:
Proc. of 6th Int. Workshop on Ontology Matching, OM 2011 (2011)
15. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A large ontology from wikipedia and
wordnet. Web Semantics: Science, Services and Agents on the World Wide Web 6(3),
203–217 (2008)
Data Linking with Ontology Alignment
Zhengjie Fan
Nowadays, countless linked data sets are published on the web. They are written
with respect to different ontologies. Linking resources in these various data sets
is the key for achieving a web of data. However, it is impossible for people to
interlink them manually. Thus, many methods are proposed to link these data
sets together. Here, I propose a data linking method based on ontology matching,
which can automatically link data sets from different domains. Formally, data
linking is an operation whose input are two collections of data. Its output is
a collection of links between entities from both collections, in which there are
binary relations on entities corresponding semantically to each other [4]. So,
the research problem which will be tackled here is: given two RDF data sets,
try to find out all possible “owl:sameAs” links between them automatically and
correctly. My work is part of the Datalift1 project, which aims to build a platform
for data publishing. It is made of several modules, such as vocabulary selection,
format conversion, interconnection and the infrastructure to host linked data
sets. My work is to build the interconnection module, the last step of Datalift,
that is, linking RDF data sets.
The paper is organized as follows. First, the state of the art on data linking
is briefly analyzed in Section 2. Then in Section 3, a data linking method is
introduced. Finally, Section 4 outlines the planned research methodology.
1
http://datalift.org/
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 854–858, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Data Linking with Ontology Alignment 855
efficiently help comparing data? How far ontology matching can enhance the
linking speed and accuracy? What its main advantage over other techniques?
What are its limitations? In which case, it cannot efficiently help linking RDF
data sets?
3 Proposed Approach
As a well-known data linking tool, SILK [2] is designed to execute the data link-
ing process, following manually written scripts specifying from which class and
property the value should be compared, as well as which comparison method is
used. So, I plan to realize the data linking process by transforming alignments
into SILK scripts. Then the data linking process can be triggered on SILK af-
terwards.
The data linking method proposed here is illustrated in Fig. 1. Suppose there
are two data sets to be linked. Firstly, their vocabularies, namespaces or ontol-
ogy URIs are sent to an Alignment Server, which is an alignment storage [3],
so as to check whether there is an ontology alignment available. If there is an
alignment and it is written in EDOAL2 [10], which is an expressive language for
expressing correspondences between entities from different ontologies, or not. It
cannot only express correspondences between classes and attributes, but also
express complex correspondences with restrictions. Then I directly produce a
SILK script. If it is not written in EDOAL, then each concept’s keys are com-
puted according to the TANE algorithm [12], which is for finding out functional
and approximate dependencies between properties of data sets, or coverage and
discriminating rate of the properties. If the Alignment Server does not contain
any alignment, then the vocabularies are searched. And an ontology matcher is
used to generate an alignment between the data sets’ ontologies. So, the linking
method introduced here has limitations when there is no correspondences or vo-
cabularies available. It will take extra time to compute the data set’s ontology
and ontology alignment.
At the early stage of my PHD research, I simplify my data linking method as
“extracting correspondences information from alignment written in EDOAL to
generate SILK script”. After it is successfully done, the key will be taken into
consideration to complete the picture.
Start
Query in SPARQL as
SELECT DISTINCT ?p
No Match Build
WHERE {?s ?p ?c. }
ontologies ontology
Is there any SELECT DISTINCT ?c
alignment? WHERE {?s rdf:type ?c}
Yes
No
Is it EDOAL alignment No
with linking information?
Compute Compute
keys of each class of dataset 1 keys of each class of dataset 2
Yes (in increasing order) (in increasing order)
coverage/discriminability rate coverage/discriminability rate
or TANE algorithm or TANE algorithm
Optimize results
(choosing the best script)
End
References
1. Araújo, S., Hidders, J., Schwabe, D., de Vries, A.P.: SERIMI - Resource Description
Similarity, RDF Instance Matching and Interlinking. CoRR. abs/1107.1104 (2011)
2. Bizer, C., Volz, J., Kobilarov, G., Gaedke, M.: Silk - A Link Discovery Framework
for the Web of Data. CEUR Workshop Proceedings, vol. 538, pp. 1–6 (2009)
3. David, J., Euzenat, J., Scharffe, F., Trojahn dos Santos, C.: The Alignment API
4.0. Semantic Web Journal 2(1), 3–10 (2011)
4. Ferrara, A., Nikolov, A., Scharffe, F.: Data linking for the Semantic Web. Interna-
tional Journal of Semantic Web in Information Systems 7(3), 46–76 (2011)
5. Hu, W., Chen, J., Qu, Y.: A Self-Training Approach for Resolving Object Coreference
on the Semantic Web. In: Proceedings of WWW 2011, pp. 87–96. ACM (2011)
6. Isele, R., Bizer, C.: Learning Linkage Rules using Genetic Programming. In: OM
2011. CEUR Workshop Proceedings, vol. 814 (2011)
7. Ngonga Ngomo, A.-C., Lehmann, J., Auer, S., Höffner, K.: RAVEN - Active Learning
of Link Specifications. In: OM 2011. CEUR Workshop Proceedings, vol. 814 (2011)
8. Nikolov, A., Uren, V.S., Motta, E., Roeck, A.N.D.: Handling Instance Coreferenc-
ing in the KnoFuss Architecture. In: IRSW 2008. CEUR Workshop Proceedings,
vol. 422, pp. 265–274 (2008)
9. Raimond, Y., Sutton, C., Sandler, M.: Automatic Interlinking of Music Datasets
on the Semantic Web. In: LDOW 2008. CEUR Workshop Proceedings, vol. 369
(2008)
10. Scharffe, F.: Correspondence Patterns Representation. PhD thesis, University of
Innsbruck (2009)
11. Song, D., Heflin, J.: Automatically Generating Data Linkages Using a Domain-
Independent Candidate Selection Approach. In: Aroyo, L., Welty, C., Alani, H.,
Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part
I. LNCS, vol. 7031, pp. 649–664. Springer, Heidelberg (2011)
12. Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: TANE: An Efficient Algo-
rithm for Discovering Functional and Approximate Dependencies. The Computer
Journal 42(2), 100–111 (1999)
A Semantic Policy Sharing Infrastructure
for Pervasive Communities
Vikash Kumar
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 859–863, 2012.
c Springer-Verlag Berlin Heidelberg 2012
860 V. Kumar
can be in the form of policies that the resident of an apartment creates for energy
saving and/or better user experience in his/her respective home.
In the m-commerce use case, we aim to use telecommunications (telco) specific
information like identity, location, etc. in providing personalized advertisements
(ads) to users based on their own preferences set in the form of policies. The
proposed infrastructure would enable them in getting better recommendations
by sharing efficient policies among acquaintances in a Web 3.0 environment.
The main research question to be investigated in this thesis is: For the same
application, can a semantic policy created for one set of physical, environmental
and contextual conditions and settings be effectively translated and applied into
another different set of settings while preserving its core idea? Along the way,
this work will also show how to translate the benefits achieved (in terms of
cost/energy savings, etc.) by application of a policy from one environment to
another.
2 Related Work
There has been considerable interest in the areas of rule interchange and profile
matching in the semantic web community. RuleML[6] (Rule Markup Language)
was the first initiative aimed at creating a unifying family of XML-serialized
rule languages that includes all the web rules. The REWERSE I1 Rule Markup
Language(R2ML) furthered this cause by proposing a comprehensive XML rule
format by integrating languages like OCL, SWRL and RuleML [10].
W3C launched the Rule Interchange Format(RIF) [5] working group in 2005
tasked with producing a core rule language using which rules can be represented
across all systems. The RIF framework for rule-based languages consists of a set
of dialects which formally describes information about the syntax, semantics and
XML serialization of a language. A semantics-enabled layered policy architecture
has been proposed in [3] as an extension of W3C’s Semantic Web architecture
aimed at facilitating the exchange and management of policies created in multiple
languages across the web.
Several projects have tried to utilize the benefits of smart meters and build
applications and services around data collected by them [9,11,12]. Several others
have suggested the use of mobile specific enablers for providing privacy aware
services based on semantic rules for mobile users [13,14].
While this thesis takes inspirations from the existing works in the direction of
rule interchange and sharing, the unique feature of the proposed infrastructure
will be the translation of a policy created in one set of conditions into that in
another set of conditions for the “same” application in a privacy aware man-
ner. Therefore the focus won’t be on application or language independence as
proposed in other approaches.
(possibly RIF-BLD [5]) which could easily be adapted from one array of set-
tings to another. Thereafter, for the translation, we need to first identify the
static and dynamic parameters of a rule and then replace the dynamic ones with
those matching the new set of conditions. For example, user X ’s policy: “Send
me all offers for iPad3 from Saturn electronic store” will be translated for user
Y as “Send me all offers for Samsung Galaxy tab from Amazon” where Y is
interested in electronic ads but his/her profile shows an inclination for Samsung
products (rather than Apple) and the shopping behavior shows transactions
mostly from Amazon (rather than Saturn). The italicized and underlined parts
represent the static and dynamic contents of the rule respectively. Other in-
ferences affecting this translation obtained by reasoning over user profile and
domain ontology, their online behavior and system policies may be facts like
iPad3 is a device in Tablet category while Saturn is a shop in the category Elec-
tronic Store. A suitable technique for such a matching of ontology concepts needs
to be developed. Another ontology containing the meta-policy information and
other rules governing policy translation would be a part of this infrastructure.
A central repository (Figure 1) will collect all the user policies annotated with
their respective metadata containing information about their static and dynamic
parameters, the perceived quantifiable advantage(s) achieved by their applica-
tion, etc. It will also contain user and environment data like profile, preferences,
temperature, etc. that will substitute the dynamic parameters of a rule.
862 V. Kumar
4 Preliminary Results
At the end of this thesis, we expect to have a working prototype of the policy
translation infrastructure in the mentioned use cases. The focus of smart home
use case would be to allow sharing of effective energy saving policies in a resi-
dent community and the evaluations would investigate the semantic similarity
of the translated policy along with preciseness of calculated savings data. The
m-commerce use case will primarily aim at using sharable policies as a tool for
preserving user privacy while still being able to infer useful information from
their publicly shared data for ad recommendations. Evaluations will be based
on relevance of ads served by effect of original and translated policies.
5 Conclusions
In this paper, we introduced the idea of a semantic policy translation infrastruc-
ture and described some related work which would form a starting point of the
research carried out in this thesis. Thereafter, we also mentioned some prelimi-
nary results of the work so far. According to the methodology shown in Section
3, the next major steps are to complete the policy translation engine, context
management system as well as the policy recommendation tool. Finally, we in-
tend to test our hypothesis through extensive user tests of the infrastructure
proposed in this paper.
A Semantic Policy Sharing Infrastructure 863
References
1. Toninelli, A., Jeffrey, B.M., Kagal, L., Montanari, R.: Rule-based and Ontology-
based Policies: Toward a hybrid approach to control agents in pervasive environ-
ments. In: Semantic Web and Policy Workshop (2005)
2. Grosof, B.N.: Representing E-Business Rules for the Semantic Web: Situated Cour-
teous Logic Programs in RuleML. In: Workshop on Information Technologies and
Systems, WITS 2001 (2001)
3. Hu, Y.J., Boley, H.: SemPIF: A Semantic meta-Policy Interchange Format for Mul-
tiple Web Policies. In: Proc. of the 2010 IEEE/WIC/ACM International Conference
on Web Intelligence and Intelligent Agent Technology (WI-IAT 2010), vol. 1, pp.
302–307. IEEE Computer Society, Washington, DC (2010)
4. The SESAME-S Project, http://sesame-s.ftw.at/
5. Rule Interchange Form, http://www.w3.org/2005/rules/
6. RuleML, http://www.ruleml.org/
7. Zhdanova, A.V., Zeiß, J., Dantcheva, A., Gabner, R., Bessler, S.: A Semantic Policy
Management Environment for End-Users and Its Empirical Study. In: Schaffert, S.,
Tochtermann, K., Auer, S., Pellegrini, T. (eds.) Networked Knowledge - Networked
Media. SCI, vol. 221, pp. 249–267. Springer, Heidelberg (2009)
8. The APSINT Project, http://www.apsint.ftw.at/
9. Kumar, V., Tomic, S., Pellegrini, T., Fensel, A., Mayrhofer, R.: User Created
Machine-Readable Policies for Energy Efficiency in Smart Homes. In: Proc. of the
Ubiquitous Computing for Sustainable Energy (UCSE 2010) Workshop at the 12th
ACM International Conference on Ubiquitous Computing, UbiComp 2010 (2010)
10. Bădică, C., Giurca, A., Wagner, G.: Using Rules and R2ML for Modeling Nego-
tiation Mechanisms in E-Commerce Agent Systems. In: Draheim, D., Weber, G.
(eds.) TEAA 2006. LNCS, vol. 4473, pp. 84–99. Springer, Heidelberg (2007)
11. Möller, S., Krebber, J., Raake, E., Smeele, P., Rajman, M., Melichar, M., Pallotta,
V., Tsakou, G., Kladis, B., Vovos, A., Hoonhout, J., Schuchardt, D., Fakotakis,
N., Ganchev, T., Potamitis, I.: INSPIRE: Evaluation of a Smart-Home System
for Infotainment Management and Device Control. In: Proc. of the LREC 2004
International Conference, Lisbon, Portugal, pp. 1603–1606 (2004)
12. Kamilaris, A., Trifa, V., Pitsillides, A.: HomeWeb: An application framework for
Web-based smart homes. In: 18th International Conference on Telecommunications
(ICT), pp. 134–139 (2011)
13. Gandon, F.L., Sadeh, N.M.: Semantic web technologies to reconcile privacy and
context awareness. Journal of Web Semantics 1, 241–260 (2004)
14. Toninelli, A., Montanari, R., Kagal, L., Lassila, O.: A Semantic Context-Aware
Access Control Framework for Secure Collaborations in Pervasive Computing En-
vironments. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika,
P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 473–486.
Springer, Heidelberg (2006)
Involving Domain Experts in Ontology Construction:
A Template Based Approach
1 Introduction
The availability of knowledge encoded in computer-readable forms, and expressed ac-
cording to precise conceptual models and formal languages such as the ones provided
by ontology languages, is an important pillar for the provision of flexible and better
integrated ways of handling content, production processes, and knowledge capital of
organisations and enterprises.
In spite of the efforts and progresses made in the area of knowledge elicitation and
modelling, methodologies and tools available nowadays are mainly tailored to Knowl-
edge Engineers [1], that is, people who know how to create formal conceptualisations of
a domain, but do not know the domain to be modelled. These tools are instead scarcely
usable by Domain Experts, that is, the people who know the organisation’s capital,
but often do not have any skills in formal model creation. As a result the interaction
between Knowledge Engineers and Domain Experts is regulated by rigid iterative wa-
terfall paradigms which make the process of producing and revising good quality on-
tologies too complex and expensive for the needs of business enterprises.
The work of the MoKi [2] project1 aims at producing a Web2.0 tool able to support
an active and agile collaboration between Knowledge Engineers and Domain Experts,
as well as the mining of the organisation’s content, to facilitate the production of good
quality formal models. The work of this thesis aims at improving MoKi in supporting a
more agile construction of good quality ontologies by encouraging Domain Experts to
actively participate in the construction of the formal part of the model. More in detail
it aims at exploring: (i) how templates, based on precise characterisations provided
by Top Level or specific Domain ontologies, can be used to describe knowledge at a
(semi-)formal level, and (ii) how Ontology Design Patterns (ODPs) can be used to reuse
existing knowledge without having to know the details of the underlying languages nor
the ODPs in all their detail. We explore and apply our ideas to the construction of
Enterprise Models [3], which provide the use case for this thesis.
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 864–869, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Involving Domain Experts in Ontology Construction 865
Enterprise Modelling Approaches: The work in [4] presents a methodology called En-
terprise Knowledge Development Method (EKD); a participative approach to enterprise
modelling. In this approach stakeholders meet in modelling sessions to create models
collaboratively and document them on large plastic sheets using paper cards. It requires
an extra effort of transferring the models from plastic walls to computer based models
by modelling technicians. The work in [2] proposed a collaborative framework and a
tool called MoKi, that supports the the creation of articulated enterprise models through
structured wiki pages. Moki enables heterogeneous teams of experts, with different
knowledge engineering skills, to actively collaborate in a modelling process. A com-
parison of MoKi and some other wiki-based tools is contained in [2] and due to lack of
space we do not elaborate further on the comparison.
Ontology Engineering Approaches: There has been some work done on capturing or-
ganisational aspects through the use of an enterprise ontology. In [5], a preliminary
proposal of an ontology of organisations based on DOLCE ontology [6] was presented.
They propose that the ontological analysis of an organisation is the first fundamental
step to build a precise and rigorous enterprise model. DOLCE is a top level ontology
describing very general concepts which are independent of a particular problem or do-
main. So it is required to provide some support for Domain Experts to work with them,
which is missing in [5]. A tutoring methodology called TMEO, based on ontologi-
cal distinction embedded in owl-lite version of DOLCE was proposed in [7] to guide
humans in elicitation of ontological knowledge. In [8], a controlled natural language
based approach was introduced to involve Domain Experts in the ontology construction
process. This approach requires that the Domain Experts learn the controlled language
before starting to model and it works with certain languages (e.g., English). An ap-
proach that attempts to enrich ontologies by finding partial instantiations of Content
ODPs and refining ontology using axiomatised knowledge contained in ODPs, is found
in [9]. This approach assumes the availability extended kind of ODPs which contain
additional lexical information, and it is more focused toward Ontology Engineers.
(a) Foundational ontology based Template (b) ODP based Template generation.
generation
characterise the main entities with DL axioms, provide a mechanism that transform the
logical characterisation in templates (for the specification of sub-concepts or individu-
als) and guiding user through the different fields of the templates by providing list of
possible entities suitable to fill in the different fields of the templates. Here we present a
portion of organization template and its instantiation in Figure 2 to show that how these
templates can be used to model real world entities. As shown in Figure 2, the arrows
represent the instantiation of Organization template to a particular organization “FIAT”
along with all the structural information and properties, such as “Audit Team” is one of
a Team in “FIAT” and the director of the organization is “ John Elkann”.
Concerning the second question, we plan to use a Natural Language Question An-
swering (QA) system like TMEO [7] to help the user in selecting the correct category
(and therefore template) to which the entity belongs. As shown in Figure 1a, the Do-
main Experts ask for assistance from the system to select the right template for the
concept. The QA system will ask the Domain Expert a list of question and depend-
ing on the answers (Yes/No) given by the Domain Expert, the system will purpose the
template.
Helping Domain Experts via Ontology Design Patterns: Our goal in this part is to dis-
cover whether some piece of knowledge that the Domain Expert needs to model can be
efficiently modelled through the use of ontology design patterns (ODPs). ODPs provide
repositories of already formalised knowledge and can be used as building blocks in on-
tology design, as shown in [10]. The initial line of research will be to explore if ODPs
can be detected directly from competency questions. Through this approach, we intend
to answer the following questions:
1. How can Content ODPs help Domain Experts to model complex distinctions in an
ontology?
We aim at answering this question by providing a mechanism for Domain Experts to
specify the requirements as competency questions and select the content ontology de-
sign patterns covering competency questions. In this direction, we have conducted a
small experiment in [11] about detection of patterns in ontologies. However, this ap-
proach was fairly naive considering only concepts in the matching process. The con-
cepts in patterns are abstract enough to match multiple competency questions, so the
matching process will require more than matching concepts. It will also match relations
868 M.T. Khan
because they are the major indicators of pattern instantiation. As shown in Figure 1b,
the process of identifying ODPs starts by taking input from Domain experts in the form
of competency questions and then matching it against the Content Ontology Design pat-
terns repository to see if there exist a complete match against this input or if it partially
instantiates some ODP. If it partially instantiates some ODP then a recommendation
will be given to the user suggesting to add the missing concepts and properties.
References
1. Noy, N.F., Mcguinness, D.L.: Ontology development 101: A guide to creating your first
ontology (2001)
2. Ghidini, C., Rospocher, M., Serafini, L.: Conceptual modeling in wikis: a reference architec-
ture and a tool. In: eKNOW, pp. 128–135 (2012)
3. Fox, M.S., Grüninger, M.: Enterprise modeling. AI Magazine 19(3), 109–121 (1998)
4. Janis Bubenko Jr., J.S., Persson, A.: User guide of the knowledge management approach
using enterprise knowledge patterns (2001),
http://www.dsv.su.se/˜js/ekd_user_guide.html/
5. Bottazzi, E., Ferrario, R.: Preliminaries to a dolce ontology of organisations. Int. J. Business
Process Integration and Management 4(4), 225–238 (2009)
6. Masolo, C., Borgo, S., Gangemi, A., Guarino, N., Oltramari, A.: WonderWeb deliverable
D18 ontology library (final). Tech. Rep. (2003)
7. Oltramari, A., Vetere, G., Lenzerini, M., Gangemi, A., Guarino, N.: Senso comune. In: Pro-
ceedings of the LREC 2010, Valletta, Malta (May 2010)
8. Dimitrova, V., Denaux, R., Hart, G., Dolbear, C., Holt, I., Cohn, A.G.: Involving Domain
Experts in Authoring OWL Ontologies. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M.,
Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 1–16.
Springer, Heidelberg (2008)
Involving Domain Experts in Ontology Construction 869
9. Nikitina, N., Rudolph, S., Blohm, S.: Refining ontologies by pattern-based completion. In:
Proceedings of the Workshop on Ontology Patterns (WOP 2009), vol. 516 (2009)
10. Gangemi, A.: Ontology Design Patterns for Semantic Web Content. In: Gil, Y., Motta, E.,
Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 262–276. Springer,
Heidelberg (2005)
11. Khan, M.T., Blomqvist, E.: Ontology design pattern detection - initial method and usage
scenarios. In: SEMAPRO, Florence, Italy (2010)
Quality Assurance in Collaboratively Created
Web Vocabularies
Christian Mader
1 Motivation
Most institutions that build and publish controlled vocabularies create them
for search and retrieval purposes, with specific functionalities in mind like, e.g.,
query expansion or faceted search [9]. However, shortcomings during the vo-
cabulary creation process impinge upon these functionalities, affecting the ef-
fectiveness of the systems backed by these vocabularies, e.g., in terms of recall
and precision. Among the problems arising in that context are missing relations
between concepts, ambiguous labeling or lack of documentation. Furthermore,
duplicate or abandoned entries, or logical contradictions might also be intro-
duced in the vocabulary creation process.
However, when publishing a vocabulary on the Web, additional requirements
have to be taken into consideration. Contributing to the Linked Data cloud in-
volves providing references to other data sources in order “to connect disparate
data into a single global data space” [6]. With the increasing availability of
vocabularies expressed in SKOS, finding and utilzing a well-accepted and well-
maintained vocabulary becomes even more challenging. Furthermore, as a con-
sequence of the ever-changing nature of the Web, resources might also become
unavailable, introducing the problem of “broken links”.
1
Simple Knowledge Organization System, http://www.w3.org/2004/02/skos/
E. Simperl et al. (Eds.): ESWC 2012, LNCS 7295, pp. 870–874, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Quality Assurance in Collaboratively Created Web Vocabularies 871
All of these issues could be subsumed under the term “controlled vocabulary
quality”. It is important for various reasons: as mentioned above, quality as-
surance measures primarily aim to improve search and retrieval use-cases, since
this traditionally has been a very common motivation for creation of controlled
vocabularies. However, they can also serve to enhance the usage experience for
human users who directly interact with the vocabulary itself, e.g., for getting an
overview about the covered domain or incorporating changes.
Especially in open linked environments, vocabulary quality is crucial for ac-
ceptance of a vocabulary by others, which in turn is a key concept of the Linked
Data principles. Once published as Linked Data, controlled vocabularies can
and should be referenced, enhanced, and reused, and with the “building blocks”
being of high quality, this is expected to happen to a much greater extent.
Research questions addressed in the proposed approach encompass: (i) what
does “vocabulary quality” mean in open, collaboratively maintained environments
and how can it be measured? (ii) how can quality assessment be integrated with col-
laborative vocabulary development environments? (iii) how does vocabulary qual-
ity assessment affect the quality of collaboratively created vocabularies?
2 Related Work
Existing standards for thesaurus construction [2,10] and manuals [3,7] propose
guidelines and best practices for testing and evaluating controlled vocabularies.
Many of them are hardly suitable for automatic assessment because additional
knowledge about the creation process, target user group or intended usage would
be required. [1] mentions vaguely formulated guidelines like, e.g., inclusion of
“all needed facets” or adherence of the term form to “common usage”, whereas
others, like “both BT and RT relationships occur between the same pair of
terms” [3] are more precise and better suited for algorithmic evaluation. However,
these guidelines are not specific for a concrete representation (e.g., SKOS) or
form of publication (e.g., Linked Data, relational database, hardcopy).
In [8], Kless & Milton provide a list of measurements constructs for the intrin-
sic quality of thesauri, examining a thesaurus as an artifact itself, i.e. isolated
from an application scenario. As stated by the authors themselves, the constructs
(e.g., “Conceptual clarity” or “Syntactical correctness”) are “solely based on
theoretical analysis” and application to existing thesauri is subject to their fu-
ture work. Although undeniably useful for assessment by humans, algorithmic
methods to measure the defined constructs are not covered. Furthermore, multi-
linguality and, since the paper focuses on intrinsic quality metrics, collaborative
aspects of the creation process were not taken into consideration.
In the field of ontology engineering, metrics have been developed to evaluate
and validate ontologies [5,11]. Common to these metrics is the fact that they are
designed to be applied to general ontologies and instance data. As a consequence,
they either do not deal with specific requirements in development of controlled
vocabularies and applicability of the metrics for measuring vocabulary quality is
still unclear.
872 C. Mader
3 Proposed Approach
In recent years, SKOS has been adopted by many organizations2 as a technol-
ogy for expressing vocabularies on the Web in a machine-readable format. As
a consequence, our approach focuses on processing vocabularies represented in
the SKOS language.
The overall goal of the approach is to ensure iterative improvement of a con-
trolled vocabulary’s quality in a collaborative development process. The “View”
in Figure 1 constitutes contributors taking part in the collaborative process. At
the core of the work is the “Quality Controller” component, which is based on
a catalog of quality criteria (cf. Table 1) and acts as a proxy between view and
model, managing quality assessment, user notification, and concurrency issues.
Upon instantiation the quality controller is parameterized with a vocabulary
(the model). Every contributor has to register at the quality controller by provid-
ing contact information and gets her own “working copy” of the vocabulary. The
quality of this working copy is analyzed on every relevant modification. Based
on this analysis, notification messages are created, containing information about
quality issues. These messages are then disseminated to the contributor who
can now decide to fix the issues or keep the current state of the vocabulary.
Eventually the changes of the contributor are synchronized with the model.
#
$
%& "
!
!
vocabulary for each criterion could be found on the Web indicates practical
relevance of this catalog.
4 Research Methodology
In a first step, as a problem definition and state-of-the-art survey, existing
publications targeting data quality, vocabulary and thesaurus development as well
as ontology building principles will be reviewed. It is important to find out to what
extent quality criteria in these areas exist and to elaborate on their importance to
controlled vocabularies. Based on the findings in the first step, we propose a set
of general quality criteria for controlled vocabularies. The result of this step
is a list of criteria together with algorithms that allow for programmatic evalua-
tion. After that, implementation of the tools, i.e. a library (API) that creates
a metrics based on the quality criteria, will be started. In the course of an analy-
sis step, existing vocabularies available on the Web will be evaluated against the
quality criteria. Community feedback collected in this step might lead to adjust-
ing and reformulating the quality criteria which target research question (i). To
874 C. Mader
References
1. Proceedings of ACM/IEEE 2003 Joint Conf. on Digital Libraries (JCDL 2003),
Houston, Texas, USA, May 27-31. IEEE Computer Society (2003)
2. Information and documentation – thesauri and interoperability with other vocab-
ularies – part 1: Thesauri for information retrieval. Norm (Draft) ISO 25964-1, Int.
Org. for Standardization, Geneva, Switzerland (2011)
3. Aitchison, J., Gilchrist, A., Bawden, D.: Thesaurus construction and use: a prac-
tical manual. Aslib IMI (2000)
4. de Coronado, S., et al.: The nci thesaurus quality assurance life cycle. Jour. of Bio.
Inf. 42(3), 530–539 (2009)
5. Gangemi, A., Catenacci, C., Ciaramita, M., Lehmann, J.: Ontology evaluation and
validation: an integrated formal model for the quality diagnostic task. Tech. rep.,
Lab. of Applied Ontologies – CNR, Rome, Italy (2005)
6. Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space,
1st edn. Morgan & Claypool (2011), http://linkeddatabook.com/
7. Hedden, H.: The accidental taxonomist. Information Today (2010)
8. Kless, D., Milton, S.: Towards Quality Measures for Evaluating Thesauri. In:
Sánchez-Alonso, S., Athanasiadis, I.N. (eds.) MTSR 2010. CCIS, vol. 108, pp.
312–319. Springer, Heidelberg (2010)
9. Nagy, H., Pellegrini, T., Mader, C.: Exploring structural differences in thesauri for
skos-based applications. In: I-Semantics 2011, pp. 187–190. ACM (2011)
10. NISO: ANSI/NISO Z39.19 - Guidelines for the Construction, Format, and Man-
agement of Monolingual Controlled Vocabularies (2005)
11. Tartir, S., Arpinar, I.B.: Ontology evaluation and ranking using ontoqa. In: ICSC
2007, pp. 185–192 (2007)
3
The source code can be downloaded from https://github.com/cmader/qSKOS/
4
https://github.com/cmader/qSKOS4rb/raw/master/results/qskos_results.ods
Author Index