Decomposing Federated Queries in presence of
Replicated Fragments
Gabriela Montoya, Hala Skaf-Molli, Pascal Molli, Maria-Esther Vidal
To cite this version:
Gabriela Montoya, Hala Skaf-Molli, Pascal Molli, Maria-Esther Vidal. Decomposing Federated Queries
in presence of Replicated Fragments. Web Semantics: Science, Services and Agents on the World Wide
Web, Elsevier, 2017. hal-01663116
HAL Id: hal-01663116
https://hal.archives-ouvertes.fr/hal-01663116
Submitted on 13 Dec 2017
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Decomposing Federated Queries in presence of Replicated Fragments
Gabriela Montoyaa,c,˚, Hala Skaf-Mollia , Pascal Mollia , Maria-Esther Vidalb,d
a LINA – UFR de Sciences et Techniques, Nantes University
2, rue de la Houssinière. 44322 NANTES CEDEX 3, France
b Universidad Simón Bolı́var
Baruta, Edo. Miranda - Apartado 89000 Cable Unibolivar Caracas, Venezuela.
c Department of Computer Science, Aalborg University
Selma Lagerlöfsvej 300. DK-9220 Aalborg Øst, Denmark
d Fraunhofer IAIS
Schloss Birlinghoven. 53757 Sankt Augustin, Germany
Abstract
Federated query engines allow for linked data consumption using SPARQL endpoints. Replicating data fragments
from different sources enables data re-organization and provides the basis for more effective and efficient federated query
processing. However, existing federated query engines are not designed to support replication. In this paper, we propose
a replication-aware framework named LILAC, sparqL query decomposItion against federations of repLicAted data
sourCes, that relies on replicated fragment descriptions to accurately identify sources that provide replicated data. We
defined the query decomposition problem with fragment replication (QDP-FR). QDP-FR corresponds to the problem
of finding the sub-queries to be sent to the endpoints that allows the federated query engine to compute the query
answer, while the number of tuples to be transferred from endpoints to the federated query engine is minimized. An
approximation of QDP-FR is implemented by the LILAC replication-aware query decomposition algorithm. Further,
LILAC techniques have been included in the state-of-the-art federated query engines FedX and ANAPSID to evaluate
the benefits of the proposed source selection and query decomposition techniques in different engines. Experimental
results suggest that LILAC efficiently solves QDP-FR and is able to reduce the number of transferred tuples and the
execution time of the studied engines.
Keywords: Linked Data, Federated Query Processing, Query Decomposition, Fragment Replication
1. Introduction
Billions of RDF triples have been made accessible
through SPARQL endpoints by data providers. 1 Recent
studies reveal unreliability and unavailability of existing
public SPARQL endpoints [4]. According to the SPARQLES monitoring system [32] less than a third out of the
545 studied public endpoints exhibits an availability rate
of 99-100% (values for November 2015).
Traditionally in distributed databases, fragmentation
and replication techniques have been used to improve
data availability [24]. Distributed database administrators
are able to design the fragmentation and replication
schema according to the applications and the expected
workload. The Linking Open Data (LOD) cloud [7]
datasets are published by autonomous data providers.
Hence fragmentation and replication schema cannot be
˚ Corresponding
author
Email addresses: gabriela.montoya@univ-nantes.fr (Gabriela
Montoya), hala.skaf@univ-nantes.fr (Hala Skaf-Molli),
pascal.molli@univ-nantes.fr (Pascal Molli), mvidal@ldc.usb.ve
(Maria-Esther Vidal)
1 http://stats.lod2.eu
Preprint submitted to Elsevier
designed. Clearly, any data provider can partially or
totally replicate datasets from other data providers. The
LOD Cloud Cache SPARQL endpoint2 is an example of an
endpoint that provides access to total replicas of several
datasets. DBpedia live3 allows a third party to replicate
DBpedia live changes in almost real-time. Data consumers
may also replicate RDF datasets for efficient and reliable
execution of their applications. However, given the size of
the LOD cloud datasets, data consumers may just replicate
subsets of RDF datasets or replicated fragments in a way
that their applications can be efficiently executed. Partial
replication allows for speeding up query execution time.
Partial replication can be facilitated by data providers,
e.g., DBpedia 2016-044 consists of over seventy dump files
each of them providing different fragments of the same
dataset, or can be facilitated by third party systems.
Publish-Subscribe systems such as sparqlPuSH [25] or
iRap RDF Update Propagation Framework [11] allow to
partially replicate datasets. Additionally, data consumers
are also autonomous and can declare federations composed
2 http://lod2.openlinksw.com/sparql
3 http://live.dbpedia.org/
4 http://downloads.dbpedia.org/2016-04/core-i18n/en/
December 13, 2017
of any set of SPARQL endpoints to execute their federated
queries.
Consequently, partial or total replication can exist in
federations of SPARQL endpoints; replicated data can
be at different levels of consistency [17]; and a federated
query engine has to be aware of the replication at runtime
in order to efficiently produce correct answers. On one
hand, if a federated query engine is unaware of data
replication, the engine performance may be negatively
impacted whenever replicated data is collected from the
available sources [22, 28]. On the other hand, if a federated
query engine is aware of data replication, sources can
be efficiently selected [16, 28] and data localities created
by replication can be exploited [22] to significantly speed
up federated query processing. These data localities,
created by endpoints with replicated fragments from
different datasets but relevant for federated queries, are
not attainable by data providers.
Exploiting replicas can be beneficial. However, replicating data has the intrinsic problem of ensuring data consistency. Traditionally in distributed databases, replicas
can be strongly consistent thanks to distributed transactions [24]. However, in the case of the Web, there is
no mechanism to ensure that all the available data are
strongly consistent. Regarding consistency of replicas, we
have identified three main scenarios.
as in the first scenario and focus on query processing
optimization under this assumption.
Query processing against sources with replicated data
has been addressed in [22], while the related problem
of query processing against sources with duplicated data
has been tackled in [16, 28]. These three approaches
prune redundant sources at source selection time. This
selection may prevent the decomposer from assigning joins
between a group of triple patterns to the same endpoint(s),
even if this choice produces the most selective sub-query.
To illustrate, consider a BGP with three triple patterns
tp1, tp2, and tp3. Suppose a SPARQL endpoint C1 is
relevant for tp1 and tp3, while C2 is relevant for tp1
and tp2. The source selection strategies implemented by
these approaches, will prevent from assigning tp1.tp3 to
C1 and tp1.tp2 to C2, even if these sub-queries generate
less intermediate results. Consequently, as we show in
Section 2, managing replication only at source selection
time may impede a query decomposer to generate the most
selective sub-queries.
In this paper, we exploit the replication knowledge
during query decomposition, to generate query decompositions where the limitations of existing replication-aware
source selection approaches are overcome. Furthermore,
we formalize the query decomposition problem with
fragment replication (QDP-FR). QDP-FR corresponds
to the problem of finding the sub-queries to be sent to
the endpoints that allow the federated query engine to
compute the query answer, while the number of tuples
to be transferred from the endpoints is minimized. We
also propose an approximate solution to QDP-FR, called
LILAC, sparqL query decomposItion against federations
of repLicAted data sourCes, that decomposes SPARQL
queries and ensures complete and sound query answers,
while reducing the number of transferred tuples from the
endpoints.
Specifically, the contributions of this paper are:
• First, if specific dataset versions are replicated, then
the replicas are always perfectly synchronized, e.g.,
a replica of DBpedia 3.8 is always perfectly synchronized with DBpedia 3.8. This case is especially pertinent when a federated query is defined on a particular version of the available datasets in order to
ensure reproducible results.
• Second, if the replicated data is locally modified, the
local data is no longer a replica. For instance, if
processes of data quality assessment are performed
by a data consumer on a replica of DBpedia 3.8,
changes to the replica need to be evaluated based on
the trustiness of the data consumer. Clearly, data
consumers have to change their federation according
with what source they trust.
• We outline the limitations of solving the source selection problem independently of the query decomposition problem in the context of replicated and fragmented data. We propose an approach where these
two federated query processing tasks should be interleaved to support engines in finding better execution
plans.
• Third, if the most up-to-date dataset versions are
used, because strong consistency cannot be ensured
in the context of LOD cloud datasets, replicas
may be unsynchronized during query execution.
Therefore, it is possible that some of the replicas
used to evaluate the query have not integrated all
the latest changes before queries are executed. This
third scenario can be handled by measuring the
replica divergence and the divergence incurred by
the sources used to evaluate the query. Out of date
replicas can be pruned during the source selection as
proposed in [21].
• Based on the replication-aware framework introduced in [22], we propose a query decomposition
strategy that relies on this framework to exploit
fragments replicated by various endpoints.
• We formalize the query decomposition problem with
fragment replication (QDP-FR).
• We propose a sound and complete algorithm to solve
the QDP-FR problem.
In this paper, for the sake of simplicity, we work under
the assumption that replicas are perfectly synchronized
• We reduce the QDP-FR problem to the set covering
problem and use existing set covering heuristics to
2
DBpedia
LinkedMDB
Q1:
triples(f): { dbr:A Knight’s Tale
dbo:director dbr:Brian Helgeland,
dbr:A Thousand Clowns
dbo:director dbr:Fred Coe,
dbr:Alfie (1966 film)
dbo:director dbr:Lewis Gilbert,
dbr:A Moody Christmas
dbo:director dbr:Trent O’Donnell,
dbr:A Movie dbo:director
dbr:Bruce Conner, · · · }
f1
f6,f2
C1
f4
f3,f5,f7
f2
f4,f5
C2
C3
fd(f): <dbpedia, ?film dbo:director ?director>
fd(f1):
fd(f2):
fd(f3):
fd(f4):
fd(f5):
fd(f6):
fd(f7):
<linkedmdb, ?director dbo:nationality dbr:United Kingdom>
<dbpedia, ?film dbo:director ?director>
<linkedmdb, ?movie owl:sameAs ?film>
<linkedmdb, ?movie linkedmdb:genre ?genre>
<linkedmdb, ?genre linkedmdb:film genre name ?name>
<dbpedia, ?director dbo:nationality dbr:United States>
<linkedmdb, ?movie linkedmdb:genre film genre:14>
(a) Fragment description
(b) Replicated Fragments
select distinct * where {
?director dbo:nationality ?nat .
?film dbo:director ?director .
?movie owl:sameAs ?film .
?movie linkedmdb:genre ?genre
?genre linkedmdb:film genre name ?gname
}
Triple
pattern
tp1
tp2
tp3
tp4
tp5
Relevant
fragment
f1
f6
f2
f3
f4
f7
f5
(tp1)
(tp2)
(tp3)
(tp4)
(tp5)
Relevant
endpoints
C3
C1
C1, C2
C2
C1, C3
C2
C2, C3
(c) SPARQL query Q1 and its triple pattern
relevant fragments and endpoints
Figure 1: Querying Federations with Replicated Fragments; (a) A fragment f is associated with a fragment description fd(f )=ă u, tp ą,
i.e., an authoritative endpoint u, and a triple pattern tp; triples(f ) corresponds to the RDF triples of f that satisfy fd(f ).tp in the dataset
accessible through fd(f ).u. (b) Three endpoints have replicated seven fragments from DBpedia and LinkedMDB. (c) A SPARQL query, Q1,
with five triple patterns; relevant endpoints of this query access RDF triples that match these five triple patterns
URIs are preserved. Fragments are described using a
2-tuple fd that indicates the authoritative source of the
fragment, e.g., DBpedia; the triple pattern that is met by
the triples in the fragment is also included in fd, e.g., ?film
dbo:director ?director.
Figure 1b depicts a federation of three SPARQL
endpoints: C1, C2, and C3; these endpoints expose seven
replicated fragments from DBpedia and LinkedMDB.
Replicated fragments correspond to copies of the RDF
triples in DBpedia and LinkedMDB. Query processing
against C1, C2, and C3 can be more efficient in terms of
execution time while query completeness can be ensured
under certain conditions.
To explain, consider the
SPARQL query Q1 presented in Figure 1c5 , this query
comprises five triple patterns: tp1, tp2, tp3, tp4, and tp5.
Based on the distribution and replication of the seven
fragments in Figure 1b, relevant SPARQL endpoints for
these triple patterns are selected, i.e., C1 is relevant for
tp1, tp2, tp4; C2 is relevant for tp2, tp3, tp4, tp5; and C3 is
relevant for tp1, tp4, tp5. However, because fragments are
replicated, only one endpoint could be selected to execute
triple patterns tp2-tp5, e.g., C2 for tp2, tp3, and C3 for
tp4, tp5.
Existing federated SPARQL engines [1, 6, 12, 26,
30] are not tailored for Linked Data replication. In
consequence, none of these engines prevents redundant
data from being retrieved or merged even in the presence
of federations as the one presented in Figure 1b. Relevant
endpoints for each query triple pattern of the federation
in Figure 1b are given in Figure 2a. All of these
relevant endpoints are included in logical plan produced
produce good solutions to the QDP-FR problem.
• We extend federated query engines FedX and
ANAPSID to perform LILAC query decomposition,
i.e., we extend the engines and create the new engines LILAC+FedX and LILAC+ANAPSID. We
study the performance of these engines and compare
them with existing engines FedX, DAW+FedX,
Fedra+FedX, ANAPSID, DAW+ANAPSID, and
Fedra+ANAPSID. Results suggest that query
decompositions produced by LILAC contribute to
reduce the number of transferred tuples and the
query execution time.
The paper is organized as follows. Section 2 provides
background and motivation. Section 3 defines replicated
fragments and presents the query decomposition problem
for fragment replication. Section 4 presents the LILAC
source selection and query decomposition algorithm. Section 5 reports our experimental results. Section 6 summarizes related works. Finally, conclusions and future work
are outlined in Section 7.
2. Background and Motivation
In this section, we illustrate the impact that exploiting
metadata about replicated fragments has on federated
query processing. First, we assume that data consumers
replicate fragments composed of RDF triples that satisfy
a given triple pattern; URIs in the original RDF dataset
are kept for all the replicated resources. In Figure 1a,
a fragment of DBpedia is illustrated. This fragment
comprises RDF triples that satisfy the triple pattern ?film
dbo:director ?director; triples included in Figure 1a
correspond to a copy of the DBpedia RDF triples where
5 The federation of SPARQL endpoints and query depicted in
Figures 1b and 1c extend the example given in [22].
3
Triple
pattern
tp1
tp2
tp3
tp4
tp5
Relevant
Relevant
fragendment
points
f1
C3
f6
C1
f2
C1, C2
f3
C2
f4
C1, C3
f7
C2
f5
C2, C3
Known containments
triples(f7) Ď triples(f4)
Transfered
tuples
from C1
from C2
from C3
Total
Exec
Time secs
Π
⊲⊳
ANAPSID
1,036
64,066
41,176
106,278
Q1
Fedra+ANAPSID
20,998
10,005
45,824
76,827
Q1D
ANAPSID
1,036
9,889
45,824
56,749
233.57
30.25
20.67
16.90
Π
⊲⊳
tp5@C2,C3
⊲⊳
tp1@C1,C3
Q1D
FedX
162
9,890
225
10,277
(b) Number of transferred tuples and execution time
Π
⊲⊳
ě1,800
549.90
(a) Relevant endpoints and known
containments
⊲⊳
Q1
Fedra+FedX
1
1,000,001
158
1,000,160
FedX
113,702
108,739
26,364
248,805
tp4@C3
⊲⊳
⊲⊳
⊲⊳
tp1@C1,C3
tp3@C2
tp4@C2,C3
Π
⊲⊳
⊲⊳
⊲⊳
⊲⊳
Π
⊲⊳
tp5@C2,C3
tp4@C3
⊲⊳
⊲⊳
tp5@C2
tp1@C1,C3
⊲⊳
tp2
⊲⊳
S
tp3
@C2
tp4@C1,C2,C3
tp2
tp2@C1,C2
tp3@C2
(c) FedX Q1
tp3
tp5
tp1
tp3
tp2
@C1
tp1@C3
tp2
@C2
@C2
tp5
@C3
⊲⊳
⊲⊳
tp4
tp2
tp1
@C1
(d) Fedra + FedX Q1
(e) ANAPSID Q1
(f) Fedra + ANAPSID Q1 (g) Q1D decomposition for
Q1
Figure 2: Execution of Q1 (Figure 1c) on federation defined in Figure 1b. (a) Endpoints in federation of Figure 1b that are relevant
for Q1. (b) Number of Transferred Tuples and Query Execution Time (secs) for Q1. ; Bold values correspond to the lower number of
transferred tuples and execution time for each engine; (c) FedX Left-linear plan; (d) the replication-aware Fedra + FedX plan; (e) ANAPSID
Bushy tree plan; (f) the replication-aware Fedra + ANAPSID plan. Both Fedra + FedX and Fedra + ANAPSID plans delegate join
execution to the endpoints and assign as many triple patterns as possible to the same endpoint. (g) Q1D is a possible decomposition of Q1.
by FedX [30] and given in Figure 2c. While some of these
endpoints are pruned by ANAPSID heuristics [23] and are
not included in logical plan given in Figure 2e. Because of
the lack of description about the replicated data accessible
through the endpoints, FedX is not able to detect that
more than one triple pattern can be exclusively evaluated
by one single endpoint, while ANAPSID heuristics allows
for delegating the evaluation of just one join to the
endpoints. In consequence, the query decompositions
include many sub-queries with only one triple pattern.
The resulting plans lead to retrieve redundant data from
endpoints. Figure 2b reports on the results of execution
of FedX and ANAPSID logical plans; as expected, they
generate many intermediate results that impact negatively
their performance.
Figures 2d and 2f show the logical plans that FedX
and ANAPSID produce when the replication-aware source
selection Fedra [22] is used. The replication-aware
source selection approaches aim to prune redundant
sources for triple patterns. In this example, tp2 can be
executed on C1 and C2 and Fedra assigns tp2 to C2,
forcing the query decomposer to evaluate tp2 on C2 only.
Results reported in Figure 2b, demonstrate that Fedra
reduces the number of transferred tuples and improves
the execution time compared to ANAPSID. Unfortunately,
this is not the case for FedX, Fedra source selection leads
FedX to include sub-queries with Cartesian products that
negatively impact on the number of transferred tuples and
execution time.
However, for this example, there exists another decomposition, Q1D, for the query Q1 (Figure 1c) presented in
Figure 2g. FedX and ANAPSID use different heuristics
to optimize and generate physical plans for this decomposition. For decomposition Q1D, FedX transfers less tuples and exhibits a lower execution time than FedX or
Fedra+FedX for Q1 (Figure 2b). We observe the same
behavior with ANAPSID.
Q1D is a better decomposition because endpoints
receive more selective sub-queries. Unlike all others
decompositions, tp2 is assigned to two endpoints: to C1
in conjunction with tp1 and to C2 in conjunction with tp3
(dashed line in Figure 2g). We notice clearly the effect of
sending more selective sub-queries on intermediate results
and execution times.
Unfortunately, this decomposition cannot be computed
by existing decomposers after solving the source selection
problem with replicated fragments. Replication aware
source selection strategies prune redundant sources for
a triple pattern, i.e., a triple pattern assigned to one
endpoint cannot be reassigned to another. For example, in
Figure 2d, Fedra assigned tp2 to C2, so it cannot assign
tp2 to C1. In order to find decompositions as Q1D, source
selection and query decomposition have to be interleaved.
In this paper, we propose a source selection and a query
decomposition strategy called LILAC able to produce
query decompositions as Q1D (Figure 2g).
3. Definitions and Problem Description
This section introduces definitions and the query
decomposition problem with fragment replication (QDPFR).
4
dbo:director in DBpedia have a joinable triple with
predicate owl:sameAs in LinkedMDB.
The federated query engine catalog includes the descriptions of replicated fragments and the endpoints that
provide access to them, i.e., the fragment localities. The
catalog for the federation in Figure 1b is shown in Figure 3.
The catalog is computed during the configuration/definition of the federation, and this should be done before
the execution of any query. Each endpoint exposes the
description of the fragments it has replicated; The catalog
is locally created by contacting the endpoints in the federation to retrieve the descriptions of the replicated fragments. These descriptions and the fragment containment
is used to identify fragments that have the same data and
obtain the fragment localities. For example, if the descriptions of fragments f a and f b are obtained from endpoints
C1 and C2 and f a, f b differ only in variable names, they
are identified as the same fragment. Therefore, only one
fragment and two endpoints are included in the catalog.
Several fragments present in the catalog may be
relevant for a given triple pattern. However, not all of
them need to be accessed to evaluate the triple pattern.
Among the relevant fragments, only the fragments that
are not contained in another relevant fragment need to be
accessed. For example, consider query Q1 (Figure 1c),
the federation given in Figure 1b, and its catalog given
in Figure 3. There are two fragments that have triples
that match the triple pattern ?movie linkedmdb:genre
?genre, f 4 and f 7. But as all the triples in f 7 are also in
f 4, then f 7 Ď f 4 and consequently only f 4 needs to be
accessed.
We adapt the query containment and equivalence
definition given in [14] for the case of a triple pattern query.
3.1. Definitions
Fragments of datasets are replicated preserving the
URIs of the resources. A fragment is composed of a set
of RDF triples, these triples may be described using the
authoritative endpoint from where the triples have been
replicated and the triple pattern that they match.
Without loss of generality, fragments are limited to
fragments that can be described using one triple pattern
as in [17, 33].
Definition 1 (Fragment Description) Given a fragment identifier f and a set of RDF triples triples(f ), the
description of f is a tuple f dpf q “ xu, tpy
• u is the non-null URI of the authoritative endpoint
where the triples triplespf q are accessible;
• tp is a triple pattern matched by all the RDF triples
in triplespf q and no other triple accessible through
u matches it.
We assume that there is just one level of replication,
i.e., one endpoint can provide access to fragments replicated from one or several authoritative endpoints, but
these endpoints cannot be used as authoritative endpoints
of any fragment. Additionally, our approach relies on
the further assumptions: (i) Fragments are read-only and
perfectly synchronized; 6 (ii) SPARQL endpoints are described in terms of the fragments whose RDF triples can
be accessible through the endpoint.
Assumption (i) allows us to focus in the first scenario
of replicating a precise dataset version (scenario discussed
in Section 1), and whose solutions can be combined with
divergence metrics for scenarios where latest dataset version is required. An initial approach to integrate divergence metrics during source selection has been proposed
in [21]. This approach performs the source selection using
the sources under a given threshold of allowed divergence.
Albeit restrictive, this replication strategy allows for
accessing RDF data replicated from different authoritative endpoints using one single endpoint. Re-locating data
in this way opens different data management opportunities, and provides the basis to query optimization techniques that can considerably reduce data transfer delays
and query execution time, as will be observed in the experimental study results.
To illustrate the proposed replication strategy, consider
the federation given in Figure 1b. Consider the two triple
patterns that share one variable and have the predicates
dbo:director and owl:same in query Q1 (Figure 1c).
Further, assume that the endpoint C2 is selected to process
this query instead of the authoritative endpoints, DBpedia
and LinkedMDB. Because the execution of the join can
be delegated to C2, the number of transferred tuples can
be reduced whenever not all the triples with predicate
6 The
Definition 2 (Triple Pattern Query Containment) Let T P pDq denote the result of execution of query
T P over an RDF dataset D. Let T P1 and T P2 be two
triple pattern queries. We say that T P1 is contained in
T P2 , denoted by T P1 Ď T P2 , if for any RDF dataset D,
T P1 pDq Ď T P2 pDq. We say that T P1 is equivalent to T P2
denoted T P1 ” T P2 iff T P1 Ď T P2 and T P2 Ď T P1 .
Testing the containment of two triple pattern queries7
amounts to finding a substitution of the variables in the
triple patterns8 . T P1 Ď T P2 , iff there is a substitution
θ such that applying θ to T P2 returns the triple pattern
query T P1 . Solving the decision problem of triple pattern
query containment between T P1 and T P2 , T P1 Ď T P2 ,
requires to check if T P1 imposes at least the same
restrictions as T P2 on the subject, predicate, and object
positions, i.e., T P1 should have at most the same number
of unbounded variables as T P2 . Therefore, testing triple
pattern containment has a complexity of Op1q. Moreover,
7 Containment
testing is adapted from [13].
substitution operator preserves URIs and literals, i.e., only
variables are substituted.
fragment synchronization problem has been studied in [11,
8 The
17].
5
Fragment Descriptions
fd(f1):
fd(f2):
fd(f3):
fd(f4):
fd(f5):
fd(f6):
fd(f7):
+
Fragment Localities
<linkedmdb, ?director dbo:nationality dbr:United Kingdom>
<dbpedia, ?film dbo:director ?director>
<linkedmdb, ?movie owl:sameAs ?film>
<linkedmdb, ?movie linkedmdb:genre ?genre>
<linkedmdb, ?genre linkedmdb:film genre name ?name>
<dbpedia, ?director dbo:nationality dbr:United States>
<linkedmdb, ?movie linkedmdb:genre film genre:14>
Fragment
f1
f2
f3
f4
f5
f6
f7
Catalog
=
Fragment Descriptions
Endpoints
C3
C1, C2
C2
C1, C3
C2, C3
C1
C2
fd(f1):
fd(f2):
fd(f3):
fd(f4):
fd(f5):
fd(f6):
fd(f7):
<linkedmdb, ?director dbo:nationality dbr:United Kingdom>
<dbpedia, ?film dbo:director ?director>
<linkedmdb, ?movie owl:sameAs ?film>
<linkedmdb, ?movie linkedmdb:genre ?genre>
<linkedmdb, ?genre linkedmdb:film genre name ?gname>
<dbpedia, ?director dbo:nationality dbr:United States>
<linkedmdb, ?movie linkedmdb:genre film genre:14>
Fragment Localities
f1:
f2:
f3:
f4:
C3
C1, C2
C2
C1, C3
f5: C2, C3
f6: C1
f7: C2
Figure 3: Data consumers expose the fragment description of their replicated fragments, LILAC computes the fragments localities using the
descriptions provided by several endpoints and stores this information in its catalog
Table 1: Relevant Fragments for federation in Figure 1b and query
Q1 (Figure 1c)
we overload operator Ď and use it to relate triple patterns
if and only if the triple pattern queries composed by such
triple patterns are related.
The decision problem of fragment containment of two
fragments fi and fj , is solved based on the satisfaction
of the triple patterns that described these fragments, i.e.,
f dpfi q.tp Ď f dpfj q.tp. Restricting the description of
fragments to a single triple pattern allows for the reduction
of the complexity of this problem. For the federation
in Figure 1b, f 7 Ď f 4 because f 4 and f 7 share the
same authoritative endpoint and there is a substitution
θ defined as { (?genre, film genre:14), (?movie, ?movie) }
and applying θ to f 4 returns f 7.
We can rely on both fragment descriptions and on
the containment relation to identify a query relevant
fragments. A relevant fragment contains RDF triples that
match at least one triple pattern of the query.
Q triple pattern
tp1
tp2
tp3
tp4
tp5
Relevant Fragment
f1
f6
f2
f3
f4
f7
f5
Function endpoints(f ) maps a fragment identifier f to the
set of endpoints in E that access the fragment. Function
relevantFragments(tp) maps a triple pattern tp to a subset
F 1 of F that corresponds to the relevant fragments of tp
(Definition 3). Operator Ď, relates triple pattern queries
according to Definition 2.
3.2. Query Decomposition Problem with Fragment Replication (QDP-FR)
Definition 3 (Fragment relevance) Let f be a fragment identifier and T P be a triple pattern of a query Q.
f is relevant to Q if T P Ď f dpf q.tp or f dpf q.tp Ď T P .
Given a SPARQL query Q, and a federation Fed =ă
E, F, endpoints(), relevantFragments(), Ď, DS ą, the
Query Decomposition Problem with Fragment Replication
(QDP-FR) decomposes each basic graph pattern (BGP) of
Q (bgp) into sub-queries, i.e., sets of triple patterns that
can be posed to the same endpoint. These sub-queries can
be combined using joins and unions to obtain an execution
plan for bgp that provides the same answers than the
evaluation of bgp against Fed, while the number of tuples
transferred during the execution of these sub-queries is
minimal.
Once the query has been parsed, for each BGP
sources are selected and the BGP is decomposed following
LILAC strategies. For the sake of clarity, we formalize
the properties of a query decomposition and give the
algorithms for the decomposition at BGP level. Naturally,
any SPARQL query that includes UNIONs, OPTIONALs,
blocks, or FILTERs, can be decomposed starting at
the BGP level and combining the BGP decompositions
according to the query structure.
A query decomposition Q’ is represented as a set S
composed of sets of sets of pairs (endpoint, set of triple
patterns), where endpoint represents a single endpoint
from the federation and set of triple patterns is a subset
of the BGP in Q’. S is built from Q’, where each element
jo of S corresponds to a join operand in the corresponding
Table 1 shows the relevant fragments to the triple
patterns in query Q1 (Figure 1c) and the federations
in Figure 1b. For example, the triple pattern tp1 has
two relevant fragments, f 1 and f 6; the triple pattern
tp2 has one relevant fragment, f 2; the triple pattern tp3
has one relevant fragment, f 3; the triple pattern tp4 has
two relevant fragments, f 4 and f 7; and triple pattern
tp5 has one relevant fragment, f 5. Because f 7 Ď f 4,
a complete answer for tp4, can be produced from RDF
triples triples(f4), i.e., by accessing only a single endpoint
the complete set rrtp4ssLinkedMDB is collected. In contrast,
both fragments f 1 and f 6 are required to answer tp1 and
two endpoints need to be contacted to collect the complete
answer for tp1. This example illustrates the impact of
using triple pattern containment during source selection
to reduce the number of selected endpoints.
Definition 4 (Federation with Replicated Fragments) A SPARQL federation with replicated fragments,
Fed, is defined as a 6-tuple, ăE, F, endpoints(), relevantFragments(), DS, Ďą. E is a set of endpoints that
compose the federation Fed. F is a set of identifiers of
the fragments accessible through endpoints in E. DS is
a set of triples accessible through the federation Fed.
6
Triple pattern
tp1
Relevant fragment
f1
f6
f2
f3
f4
f7
f5
tp2
tp3
tp4
tp5
Relevant endpoints
C3
C1
C1, C2
C2
C1, C3
C2
C2, C3
Ş
Vars(t)
Vars(t’) ‰ H)))
Sub-queries are as large as possible:
(@ sq, e, jo : jo P S ^ (e, sq) P jo : (D t : t P bgp-sq
: (D f : f P relevantFragments(t)
^ e P endpoints(f))
Ş
^ (D t’ : t’ P sq : Vars(t)
Vars(t’) ‰ H)))
No redundant sub-queries:
(@ jo : jo P S : (D t : t P bgp ^ (@ e, sq : (e, sq) P jo
: t P sq) : (D jo’ : jo’ P S ^ jo’ ‰ jo : (@ e’, sq’ :
(e’, sq’) P jo’ : t P sq’)))
(a) Selected endpoints
⊲⊳
tp2
S
⊲⊳
tp3
@C2
tp4
tp5
tp1@C3
⊲⊳
We illustrate QDP-FR on running query Q1 of Figure 1c and federation in Figure 1b. Figure 4a presents the
relevant fragments and relevant endpoints for each triple
pattern. Decompositions in generated by FedX, ANAPSID, Fedra+FedX, Fedra+ANAPSID and LILAC (Figures 2c, 2e, 2d, 2f, and 2g) retrieve all the relevant data
in the federation and only relevant data, i.e., they ensure
properties 1-2. Even if decomposition generated by Fedra+ANAPSID (Figure 2f) significantly reduces the number of transferred tuples with respect to the decomposition
generated by ANAPSID (Figure 2e), only the decomposition generated by LILAC (Figure 2g) minimizes the transferred data. LILAC includes more selective sub-queries
and these sub-queries contribute to reduce the number of
transferred tuples and the query execution time, e.g., instead of a sub-query with just tp1 as Fedra+ANAPSID
and ANAPSID include in their decompositions, LILAC
includes the more selective sub-query composed of tp1 and
tp2.
A source selection approach in combination with existing query decomposition strategies may produce decompositions as the ones in Figures 2d and 2f. Furthermore,
the number of transferred tuples may be significantly reduced whenever source selection is interleaved with query
decomposition and information about the replicated fragments are available.
@C3
tp1
tp2
@C1
S2 =
n
(C2 , { tp2, tp3 }) ,
(C3 , { tp4, tp5 }) ,
(C1 , { tp1, tp2 }), (C3, { tp1 })
o
(b) Decomposition and model
Figure 4: Selected endpoints are included in the LILAC query
decomposition of Q1 (Figure 1c) for federation in Figure 1b, the
decomposition is modeled with a set of sets of pairs (endpoint, set
of triple patterns)
query plan, i.e., elements jo in S correspond to sub-queries
connected by joins in the final plan. Further, each pair
uo=(e, ts) in a join operand jo indicates an endpoint e
where the set of triple patterns ts will be posed; elements
uo in a join operand jo are connected by unions in the final
plan. Selected endpoints (Figure 4a) are used to produce
the LILAC query decomposition presented in Figure 4b
for the SPARQL query Q1 in Figure 1c and federation in
Figure 1b. The set based representation S of this LILAC
decomposition is also presented.
Given a query decomposition Q’ that corresponds
to a non-empty solution of QDP-FR, the set based
representation S of Q’ meets the following properties:
1. Answer soundness: evaluating Q’ produces only
answers that would be produced by the evaluation
of Q over the set of all triples available in the
federation. Formally this property can be written
as:
eval(Q’ ) Ď Q(DS )
4. LILAC: An Algorithm for QDP-FR
The goal of LILAC is to reduce data transfer by taking advantage of the replication of relevant fragments for
several triple patterns on the same endpoint. Function
Decompose solves a two-fold problem; first relevant fragments are selected and then sub-queries and relevant endpoints are simultaneously chosen, e.g., two triple patterns
are assigned to the same sub-query if they share a variable
and can be evaluated at the same endpoint.
Algorithm 1 proceeds in four steps:
2. Answer completeness: evaluating Q’ produces
all the answers that would be produced by the
evaluation of Q over the set of all triples available
in the federation.
Formally this property can be written as:
eval(Q’ ) Ě Q(DS )
3. Data transfer minimization: executing Q’ minimizes the number of tuples transferred from the selected endpoints, i.e., each sub-query is as large as
possible without Cartesian products and there are
no redundant sub-queries. Formally, this property
can be written as:
No Cartesian products:
(@ sq, e, jo : jo P S ^ (e, sq) P jo : (@ sq’ : |sq’| ě
1 ^ sq’ Ă sq : (D t, t’ : t P sq’ ^ t’ P (sq - sq’) :
I. Selection of the non redundant fragments for each
triple pattern (line 2).
II. Identification of the candidate endpoints reducing the
unions if possible (line 3).
III. Generation of the largest Cartesian product free subqueries to evaluate the BGP triple patterns with just
one relevant fragment (line 4).
7
P relevantFragments(tp4), because f4 contains f5, i.e.,
f 5 Ď f 4, f 5 is not included in fragments for tp4.
IV. Reduction of the non selective sub-queries by merging
joinable sub-queries that can be executed against the
same endpoint, i.e., sub-queries that share at least
one variable and are assigned to the same endpoint
(line 5).
Algorithm 2 ReduceUnions Algorithm
Require: A set selectedFragments, composed of pairs (tp, fss), with tp
a triple pattern and fss a set of sets of fragments
Ensure: A set candidateEndpoints, composed of pairs (tp, ess), with tp
a triple pattern and ess a set of sets of endpoints
8: function ReduceUnions(selectedFragments)
9: candidateEndpoints Ð H
10: for (tp, fs) P selectedFragments do
11: if cardinality(fs)
ą 1 then
Ş
12:
ce Ð ( endpoints(f) | f P fs)
13: end if
14: if ce = H _ card(fs) = 1 then
15:
ce Ð { endpoints(f) : f P fs }
16: end if
Ť
17: candidateEndpoints Ð candidateEndpoints { ( tp, ce) }
18: end for
19: return
20: end function
Notice that the first two steps are also performed by
the Fedra source selection strategy. Even if both, Fedra
and LILAC, use a reduction to the set covering problem to
refine the set of selected sources for the triple patterns in
a BGP in the third step, the reduction used by LILAC
meets the requirement of delegating join execution to
the endpoints, and ensures that the largest sub-queries
obtained from this third step are effectively Cartesian
product free sub-queries. Finally, LILAC implements the
fourth step, being able to place the same triple pattern in
more than one sub-query. This is a unique characteristic
of the decompositions generated by LILAC.
Second, the function ReduceUnions (Algorithm 2)
localizes fragments on endpoints, i.e., performs an initial
endpoint selection based on the endpoints that provide
access to the fragments in selectedFragments; the function
endpoints() provides this information. Additionally,
for triple patterns that have several relevant fragments,
only the endpoints that simultaneously access all relevant
fragments are included in this initial endpoint selection if
they exist.
For our running example, function ReduceUnions
(Algorithm 2) uses the fragments (Figure 5c) to produce
the candidateEndpoints (Figure 5d). Endpoints C1 and
C3 are candidate endpoints for tp1, i.e., (tp1, { { C3 }, {
C1 } }) P candidateEndpoints; this is because (tp1, { {f1 },
{f6 } }) P selectedFragments, f1 is accessed by endpoint C3,
and f6 is accessed by endpoint C1. Further, even if tp1 has
two relevant sets of fragments, as they are not accessed
simultaneously by any endpoint, their relevant endpoints
are included as two sets of endpoints, i.e., { C3 }, { C1 }.
After selecting candidate endpoints for the BGP triple
patterns, the function ReduceBGPs (Algorithm 3) conducts
the following steps for the triple patterns with only one
relevant fragment, i.e., (tp, ess) P candidateEndpoints and
cardinality(ess)=1:
Algorithm 1 Decompose Algorithm
Require: A Basic Graph Pattern bgp
A Federation Fed=ăE,F,endpoints(),relevantFragments(), Ď,DS ą
Ensure: A decomposition representation decomposition
1: function Decompose(bgp,Fed)
2: fragments Ð SelectNonRedundantFragments(bgp, Fed)
3: endpoints Ð ReduceUnions(fragments)
4: (subqueries, ljtps) Ð ReduceBGPs(endpoints)
5: decomposition Ð IncreaseSelectivity(endpoints,subqueries, ljtps)
6: return decomposition
7: end function
Next, we illustrate how Algorithm 1 works on our
running example of query Q1 (Figure 1c) and federation
in Figure 1b. 9
First, for each triple pattern, LILAC computes the
non redundant set of relevant fragments; two or more
fragments are grouped together, if they share relevant
data for the triple pattern. Formally, the output of
the function SelectNonRedundantFragments satisfies the
following properties: fragments(tp) is a set of pairs (tp,rfs)
where tp is a triple pattern in BGP and rfs is a set
of sets of relevant fragments for tp. The set rfs is
composed of sets f s1 , ..., f sm of fragments such that
each f si comprises fragments that share revelant data for
tp. A singleton f si ={ fi } includes a fragment such
that f dpfi q.tp is contained by the triple pattern tp, i.e.,
f dpfi q.tp Ď tp; additionally, there is no other fragment fk
such that f dpfk q.tp Ď tp and f dpfk q.tp contains f dpfi q.tp.
A non singleton f si includes all the fragments fi such
that f dpfi q.tp subsumes the triple pattern tp, i.e., tp Ď
f dpfi q.tp.
For the federation in our running example, relevant
fragments (Figure 5b) are used to produce the fragments
(Figure 5c) by function SelectNonRedundantFragments.
The pair (tp4, {{f 4}}) is part of the set fragments; note
that even if f 5 is a relevant fragment for tp4, i.e., f 5
1. For each endpoint e, the computation of the set of
maximal size of joinable triple patterns that e can
answer, largest, i.e., (e, largest) P ljtps.
2. Reduction and resolution of the set covering instance
to select the minimal number of joinable triple
patterns that cover the set of triple patterns in ljtps
with only one relevant fragment.
Computing the set of maximal size of joinable triple
patterns may be done in a loop starting with singleton
sets and in each iteration making the union of two sets
of triple patterns if they are joinable until a fixed-point is
reached.
Because the BGP reduction has been reduced to an existing optimization problem (set covering problem). This
9 Notice that only triples in fragments f 1 and f 6 are accessible in
this federation to retrieve data for tp1, therefore it will not produce
all the answers that would be produced using DBpedia.
8
Fragment Descriptions
fd(f1):
fd(f2):
fd(f3):
fd(f4):
fd(f5):
fd(f6):
fd(f7):
<linkedmdb, ?director dbo:nationality dbr:United Kingdom>
<dbpedia, ?film dbo:director ?director>
<linkedmdb, ?movie owl:sameAs ?film>
<linkedmdb, ?movie linkedmdb:genre ?genre>
<linkedmdb, ?genre linkedmdb:film genre name ?gname>
<dbpedia, ?director dbo:nationality dbr:United States>
<linkedmdb, ?movie linkedmdb:genre film genre:14>
Triple pattern
tp1
tp2
tp3
tp4
Fragment Localities
f1:
f2:
f3:
f4:
C3
C1, C2
C2
C1, C3
f5: C2, C3
f6: C1
f7: C2
tp5
(a) Catalog
tp1
tp2
tp3
tp4
Relevant Fragments
f1
f6
f2
f3
f4
f7
f5
(b) Relevant Fragments
tp1
tp5
tp2
tp3
tp4
tp5
C1,C2
C1,C3
C2,C3
C2
o
f6
n
o n
f1
f2
f3
f4
f5
(tp1 , {{C3},{C1}} ), (tp2 , {C1,C2} ), (tp3 , {C2} ), (tp4, {C1,C3} ), (tp5, {C2,C3} )
(tp1 , {f1}, {f6} ), (tp2 , {f2} ), (tp3 , {f3} ), (tp4, {f4} ), (tp5, {f5} )
C3 C1
(c) fragments
(d) candidateEndpoints
Figure 5: Catalog’s information and relevant fragments are used to compute fragments in function SelectNonRedundantFragments and
candidateEndpoints in function reduceUnions (Algorithm 2) for the federation in Figure 1b, and query Q1 (Figure 1c)
reduction allows for relying on existing algorithms for solving the set covering problem [18], and efficiently and effectively implement the function SetCoveringSubqueries in
Algorithm 3 (line 48). However, even if using heuristics,
as the one presented in [18], allows for the identification of
solutions quickly it does not guarantees the generation of
an optimal solution and it limits LILAC to produce only
an approximate solution.
Our approximate solution to the set covering is performed on the triple patterns tps associated with each endpoint e in pairs (e, tps) P ljtps.
For our running example, function ReduceBGPs (Algorithm 3) uses the candidateEndpoints (Figure 6a) and the
joins in the BGP (Figure 6b) to produce the values of ljtps
and subqueries given in Figures 6c and 6d. Since (tp2, {{
C1, C2 }}), (tp3, {{C2 }}), (tp4, {{C1,C3 }}), (tp5, {{ C2,
C3 }}) are part of candidateEndpoints, the variable tmp is
composed of the following pairs: (C1, { tp2, tp4 }), (C2, {
tp2, tp3, tp5 }), (C3, { tp4, tp5 }). Additionally, the pairs
(C1, { tp2 }), (C1, { tp4 }), (C2, { tp2, tp3 }), (C2, { tp5
}), (C3, { tp4, tp5 }) are part of ljtps. Note that because
tp2 and tp4 do not share any variable, tp2 and tp4 have
been included in different elements of ljtps with C1. However, tp2 and tp3 are included in the same set of triple
patterns because they share the variable ?film. The output of evaluating the function SetCoveringSubqueries is
the set { (C2, { tp2, tp3 }), (C3, { tp4, tp5 })} because all
triple patterns tp2, tp3, tp4, and tp5 are covered by this
set of two elements.
Finally, sub-queries involving triple patterns with more
than one relevant fragment are computed, and their selectivity is increased by combining these triple patterns with
joinable triple patterns assigned to the same endpoint.
Endpoints able to answer the largest joinable sub-query
are selected, and joinable sub-queries are included in ljtps.
The endpoints and sub-queries selected for each relevant
fragment are combined and included as one element of
Algorithm 3 ReduceBGPs Algorithm
Require: A set candidateEndpoints, composed of pairs (tp, ess), with
tp a triple pattern and ess a set of sets of endpoints
Ensure: A pair (subqueries, ljtps), with subqueries and ljtps sets of
pairs (e, tps), e an endpoint and tps a set of triple patterns
21: function ReduceBGPs(candidateEndpoints)
22: es Ð H
23: for (tp, ess) P candidateEndpoints do
24: for es’ P ess do
25:
for e P es’Ťdo
26:
es Ð es
{e}
27:
end for
28: end for
29: end for
30: ljtps Ð H
31: triples Ð H
32: for e P es do
33: tps Ð H
34: for (tp, ess) P candidateEndpoints do
35:
if cardinality(ess)=1 then
36:
let es’ be the
Ť element of ess, i.e., ess = { es’ }
37:
tps Ð tps
{ tp }
38:
end if
39: end for
Ť
40: triples Ð triples { (e, tps) }
41: end for
42: for (e, tps) P triples do
43: for tpi P tps do
44:
largest Ð theŤlargest joinable sub-set of tps that includes tpi
45:
ljtps Ð ljtps
{ (e, largest) }
46: end for
47: end for
48: subqueries Ð SetCoveringSubqueries(ljtps)
49: return (subqueries, ljtps)
50: end function
9
tp1
tp2
C3 C1
n
(tp1 ,
{{C3},{C1}} ), (tp2 ,
tp3
tp4
tp5
C1,C2
C1,C3
C2,C3
C2
o
{C1,C2} ), (tp3 , {C2} ), (tp4, {C1,C3} ), (tp5, {C2,C3} )
(a) candidateEndpoints
tp2@C1
tp4@C1
⊲⊳
⊲⊳
tp2
tp4
(c) ljtps
tp5
⊲⊳
tp3
tp4
tp5
@C2
@C3
o
@C2
@C3
o n
(C2, { tp2, tp3 }), (C3, { tp4, tp5 })
(C1, { tp2 }), (C1, { tp4 }), (C2, { tp2, tp3}), (C2, { tp5 }), (C3, { tp4, tp5 })
tp3
tp2
n
Triple patterns
tp1, tp2
tp2, tp3
tp3, tp4
tp4, tp5
(b) Join variables
tp5@C2
⊲⊳
Variable
?director
?film
?movie
?genre
(d) subqueries
Figure 6: candidateEndpoints and joins in the query are used by function reduceBGPs (Algorithm 3) to produce the set of largest joinable
triple patterns (ljtps) and the set of non redundant subqueries (subqueries) for federation in Figure 1b) and query Q1 (Figure 1c)
.
Algorithm 4 IncreaseSelectivity Algorithm
select distinct ∗
where {
SERVICE <C2> { t p 2 . t p 3 } .
SERVICE <C3> { t p 4 . t p 5 } .
{
SERVICE <C1> { t p 1 . t p 2 }
} UNION {
SERVICE <C3> { t p 1 }
}
}
Require: A set subqueries, composed of pairs (e, tps), with e an
endpoint and tps a set of triple patterns
A set candidateEndpoints, composed of pairs (tp, ess), with tp a
triple pattern and ess a set of sets of endpoints
A set ljtps, composed of pairs (e, tps), with e an endpoint and tps
a set of triple patterns
Ensure: a decomposition representation decomposition
51: function IncreaseSelectivity(subqueries,endpoints, ljtps)
52: decomposition Ð H
53: for sq P subqueries do
Ť
54: decomposition Ð decomposition { { sq } }
55: end for
56: for (tp, ess) P endpoints do
57: if cardinality(ess) ą 1 then
58:
sq Ð H
59:
for es P ess do
60:
select (e, ts) P ljtps such that e P es
^ tsŤis the largest sub-query joinable with tp
Ť
61:
sq Ð sq
{ ( e, ts
{ tp } )}
62:
end for
Ť
63:
decomposition Ð decomposition
{ sq }
64: end if
65: end for
66: return decomposition
67: end function
Figure 7: Decomposition obtained with Algorithm 1 for federation
in Figure 1b, and query Q1 (Figure 1c)
Opn3 m2 kq, with n the number of triple patterns in the
query, m the number of fragments, k the number of
endpoints.
Details about time complexity are available in Appendix A.
Theorem 1 If all the RDF data accessible through the
federation endpoints are described as replicated fragments,
LILAC query decomposition produces a solution to the
QDP-FR problem, i.e., LILAC generates a query decomposition that satisfies the answer soundness, answer completeness, and data transfer minimization properties stated
in Section 3.2.
the decomposition representation, i.e., the set jo is a join
operand and comprises pairs (e, ts) in the query decomposition. Notice that each element of jo corresponds to an
union operand in the query decomposition. Sub-queries
already included in subqueries, i.e., sub-queries with just
one relevant fragment in the query decomposition, are represented as singleton sets in the query decomposition.
For our running example, Figure 7 presents the decomposition computed by function IncreaseSelectivity
(Algorithm 4). Sub-queries { tp2 . tp3 } and { tp4 . tp5 }
are evaluated at endpoints C2 and C3, respectively. Further, since the pair (C1, { tp2 }) belongs to ljtps, the pair
(tp1, { { C1 }, { C3 }}) is part of candidateEndpoints, and
the triple patterns tp1 and tp2 share the variable ?director, the sub-query { tp1, tp2 } is evaluated at C1 and the
sub-query { tp1 } is evaluated at C3 and the union of their
resulting mappings is produced by the federated query engine. As proposed in Section 2, a decomposition with selective sub-queries has been produced by Algorithm 1.
The proof of Theorem 1 is available in Appendix B.
5. Experimental Study
The goal of the experimental study is to evaluate
the effectiveness of LILAC. We compare the performance of the federated query engines FedX and
ANAPSID with respect to their extensions using DAW,
Fedra, and LILAC, i.e., the evaluation compares FedX,
DAW+FedX, Fedra+FedX, and LILAC+FedX, as well
as ANAPSID, DAW+ANAPSID, Fedra+ANAPSID,
and LILAC+ANAPSID. The aim of extending both
FedX and ANAPSID with LILAC is to show that
LILAC is not tailored for one federated query engine,
Proposition 1. Algorithm 1 has a time complexity of
10
Table 2: Dataset characteristics: version, number of different triples
(# DT), predicates (# P), and coherence [10]
Dataset
Diseasome
SWDF
DBpedia Geo
LinkedMDB
WatDiv1
WatDiv100
Version date
19/10/2012
08/11/2012
06/2012
18/05/2009
# DT
72,445
198,797
1,900,004
3,579,610
104,532
10,934,518
#P
19
147
4
148
86
86
Diseasome
Coherence
0.32
0.23
1.00
0.81
0.44
0.38
SWDF
HYBRID
PATH
STAR
12
9
6
3
Geo
LinkedMDB
WatDiv1
i.e., LILAC strategy works well with different engines.
Moreover, it should be noted that our experimental
study does not aim to compare federated query engines
FedX and ANAPSID, because their source selection
strategies have very different warranties, i.e., FedX source
selection theoretically ensures answer completeness, while
ANAPSID may prune relevant sources that are required
to produce a complete answer in order to reduce the
execution time. We expect to see that LILAC improves
these federated query engines performance in terms of
number transferred tuples, number of selected sources,
and execution time. Also, we hypothesize that LILAC
does not significantly degrade the performance of federated query engines in terms of decomposition time and
completeness. Moreover, LILAC+ANAPSID theoretically
produces complete answers, i.e., LILAC may increase
query answer completeness of ANAPSID.
Datasets: We use four real datasets: Diseasome, Semantic Web Dog Food (SWDF), Linked Movie Data Base
(LinkedMDB), and DBpedia Geo-coordinates (Geo). Further, we consider two instances of the Waterloo SPARQL
Diversity Test Suite synthetic dataset [2, 3] with 105 and
107 triples (WatDiv1 and WatDiv100). Table 2 shows
the characteristics of these datasets. These datasets have
sizes varying between 72 thousand triples up to 10 million
triples, their number of predicates goes from just four predicates for highly structured dataset Geo, until 148 predicates for quite structured dataset LinkedMDB. We used
the coherence metric [10] to measure the structuredness
of the datasets. The structuredness indicates how well
instances conform to their types, coherence values range
between 0.0 and 1.0, e.g., a dataset where all the type
instances have the same attributes has a coherence value
of 1.0, and a dataset where all the type instances have
different attributes has a coherence value of 0.0.
Queries: We generate 50,000 queries from 500 templates
for the WatDiv datasets. These queries have between
1 and 14 triple patterns, PATH, STAR, and HYBRID
(called Snowflake in [2, 3]) shapes. For the real datasets,
we generate PATH and STAR shaped templates with two
to eight triple patterns, and they are instantiated with
random values from the datasets to generate over 10,000
queries per dataset. Figure 8 presents the characteristics
of the 100 randomly selected queries to be executed for
each of the six datasets.
All the queries have the DISTINCT modifier, i.e.,
duplicated variable mappings are not included in the query
answers. In this way, the results of evaluating these queries
WatDiv100
100
102
104
Number of Answers
Figure 8: Characteristics of the randomly selected queries evaluated
in the experimental study. The number of triples in the queries goes
from 1 up to 14, their shapes are PATH, STAR, and HYBRID, and
their number of answers goes from 1 up to 45,129. X-axis is presented
using logarithmic scale
will be not affected by not collecting replicated data from
the relevant sources.
Fragment Replication: For each dataset, we set up a
federation with ten SPARQL endpoints. Each endpoint
can access data from replicated fragments that are relevant
for 100 random queries; the same fragment is replicated
in at most three endpoints. Fragments are created from
triple patterns in these 100 randomly selected queries.
SPARQL construct queries are executed to populate these
fragments; data is collected using the LDF client10 against
a local LDF server that hosts the corresponding datasets.
Implementations: Source selection and query decomposition components of FedX 3.111 and ANAPSID12 have
been modified. First, Fedra and DAW [28] replaced
FedX and ANAPSID source selection strategies; we
name these new engines, Fedra + FedX, DAW +
FedX, Fedra + ANAPSID, and DAW + ANAPSID,
respectively. Likewise, these engines were also modified to execute the LILAC source selection and query
decomposition strategies; LILAC + FedX and LILAC
+ ANAPSID correspond to these new versions of the
engines. Moreover, FedX was modified to include an
heuristic that produces plans free of Cartesian products
whenever it is possible. Cartesian product free plans
are positively impacted by good query decomposition
strategies as we have argued in Section 2. Because FedX
is implemented in Java and ANAPSID is implemented in
Python, LILAC, Fedra, and DAW13 are implemented
in both Java 1.7 and Python 2.7.3. Thus, LILAC,
Fedra and DAW are integrated in FedX and ANAPSID,
reducing the performance impact of including these
strategies. The code and experimental setup are available
at https://github.com/gmontoya/lilac.
10 https://github.com/LinkedDataFragments,
March 2015.
September 2014.
12 https://github.com/anapsid/anapsid, September 2014.
13 We had to implement DAW as its code is not available.
11 http://www.fluidops.com/fedx/,
11
Number of Selected Sources
Hardware and configuration details: The Grid’5000
testbed14 is used to run the experiments. In total 11
machines Intel Xeon E5520 2.27 GHz, with 24GB of RAM
are used. Ten machines are used to host the SPARQL
endpoints, and one to run the federated query engines.
Federated query engines use up to 7GB of RAM. SPARQL
endpoints are deployed using Virtuoso 7.2.115 . Virtuoso
parameters number of buffers and maximum number of
dirty buffers are set up to 1,360,000 and 1,000,000,
respectively. Virtuoso maximum number of result rows is
setup to 100,000 and the maximum query execution time
and maximum query cost estimation are set up to 600
seconds.
Evaluation Metrics: i) Number of Selected Sources
(NSS): is the sum of the number of sources that has
been selected per query triple pattern. ii) Query Decomposition and Source Selection Time (DST): is the
elapsed time since the query is posed until the query
decomposition is produced. iii) Execution Time (ET):
is the elapsed time since the query is posed until the
complete answer is produced. We used a timeout of 1,800
seconds. iv) Completeness (C): is the ratio between the
number of query answers produced by the engine and
the number of answers produced by the evaluation of
the query over the set of all the triples available in the
federation; values range between 0.0 and 1.0. v) Number
of Transferred Tuples (NTT): is the number of tuples
transferred from all the endpoints to the engine during a
query evaluation. Endpoints are accessed through proxies
that count the number of transferred tuples. 16 Presented
values correspond to the average of the execution of 100
random queries per each of the six federations.
Statistical Analysis: The Wilcoxon signed rank test [36]
for paired non uniform data is used to study the significance of the improvements on performance obtained when
LILAC decomposition is used. 17
20
15
10
5
0
Diseasome
Geo
LinkedMDB
SWDF
WatDiv1
WatDiv100
Number of Selected Sources
Figure 9: Number of Selected Sources for LILAC+ANAPSID ( ),
Fedra+ANAPSID ( ), DAW+ANAPSID ( ) and ANAPSID ( )
20
15
10
5
0
Diseasome
Geo
LinkedMDB
SWDF
WatDiv1
WatDiv100
Figure 10: Number of Selected Sources for LILAC+FedX ( ),
Fedra+FedX ( ), DAW+FedX ( ) and FedX ( )
5.1.1. Number of Selected Sources
Figures 9 and 10 show the performance of the
engines in terms of the number of selected sources.
LILAC+FedX and LILAC+ANAPSID select significantly less sources than FedX, DAW+FedX, ANAPSID,
and DAW+ANAPSID, respectively. These results are
natural because LILAC uses the knowledge present
in the catalog to precisely identify multiple endpoints
that access the same fragments and fragments that only
provide redundant data. Contrarily to LILAC, the
engines are not aware of the replicated fragments, and
consequently, they may incur in redundant data accesses.
Even a duplicate-aware technique like DAW, is not capable of achieving a reduction as great as the one achieved
by LILAC. The behavior exhibited by DAW is the
consequence of its duplicate detection technique, which
is based on the overlap between summaries and cannot
provide a precise description about the replicated fragments. However, LILAC+FedX and LILAC+ANAPSID
select slightly more sources than Fedra+FedX and
Fedra+ANAPSID, respectively. LILAC may assign one
triple pattern to more than one endpoint to increase the
selectivity of the sub-queries in the decomposition. On
the other hand, Fedra selects endpoints to facilitate
the query decomposition of the federated query engines.
Thus, Fedra chooses the smallest number of endpoints
to evaluate each triple pattern and enables FedX and
ANAPSID to build decompositions that include exclusive
and star-shaped groups, respectively.
To confirm that LILAC+FedX and LILAC+ANAPSID
5.1. Source Selection
Experiments were conducted against the federations
with replicated fragments and the 600 queries described in
the previous section, i.e., 100 random queries for each of
the six federations. The aim of this section is to report on
the results of the query decomposition and source selection
time (DST), as well as on the number of selected sources
(NSS). Although the proposed approach LILAC requires
more time for source selection and query decomposition,
i.e., LILAC exhibits a higher DST, it allows for reducing
the number of selected sources considerably, i.e., the NSS
is lower for LILAC. Moreover, the achieved reduction has
a great impact on the query execution time (ET).
14 https://www.grid5000.fr
15 https://github.com/openlink/virtuoso-opensource/
releases/tag/v7.2.1, June 2015.
16 Proxies are implemented in Java 1.7.
using the Apache
HttpComponents Client library 4.3.5 (https://hc.apache.org/).
17 The Wilcoxon signed rank test was computed using the R project
(http://www.r-project.org/).
12
Source Selection Time (secs)
Table 3: Number of queries that timed out and that aborted
execution
3
2
LILAC+FedX
Fedra+FedX
DAW+FedX
FedX
1
LILAC+ANAPSID
Fedra+ANAPSID
DAW+ANAPSID
ANAPSID
0
Diseasome
Geo
LinkedMDB
SWDF
WatDiv1
WatDiv100
WatDiv100
# timeouts
# aborted
0
0
0
2
0
2
5
4
0
0
0
5
0
0
0
0
Source Selection Time (secs)
Figure 11: Query Decomposition and Source Selection Time
(in secs) for LILAC+ANAPSID ( ), Fedra+ANAPSID ( ),
DAW+ANAPSID ( ) and ANAPSID ( )
DAW the source selection takes a considerable amount
of time, because during the source selection phase, FedX
contacts all the endpoints for each triple pattern to
check if the endpoints access relevant data for the triple
pattern. Similarly, DAW conducts an exhaustive search
and computes the overlapping data of the endpoint
to prune endpoints that only provide redundant data.
Contrary, LILAC and Fedra exploit metadata stored
in their catalogs to speed up source selection, while
ANAPSID uses heuristics to rapidly prune sources that
are not likely to provide relevant data for the query at the
cost of answer completeness.
To confirm that DST for LILAC+FedX and
LILAC+ANAPSID are lower than for FedX, DAW+FedX,
and DAW+ANAPSID, respectively, Wilcoxon signed rank
tests were run with the hypotheses:
H0: LILAC+FedX’s (LILAC+ANAPSID’s) DST is the
same as for FedX and DAW+FedX (DAW+ANAPSID).
H1: LILAC+FedX’s (LILAC+ANAPSID’s) DST is lower
than for FedX and DAW+FedX (DAW+ANAPSID).
Test p-values are presented in Table C.7, Appendix C.
For all the federations and engines, p-values are inferior to
0.05. These low p-values allow for rejecting the null hypothesis that DAW, FedX, and LILAC query decomposition and source selection times are similar. Additionally,
they support the acceptance of the alternative hypothesis that LILAC source selection and query decomposition
strategies are faster than DAW and FedX source selection
techniques.
3
2
1
0
Diseasome
Geo
LinkedMDB
SWDF
WatDiv1
WatDiv100
Figure 12: Query Decomposition and Source Selection Time (in
secs) for LILAC+FedX ( ), Fedra+FedX ( ), DAW+FedX ( )
and FedX ( )
select less sources than FedX, DAW+FedX, ANAPSID,
and DAW+ANAPSID, respectively, Wilcoxon signed rank
tests were run with the hypotheses:
H0: LILAC+FedX (LILAC+ANAPSID) selects the
same number of sources as FedX and DAW+FedX (resp.,
ANAPSID and DAW+ANAPSID) do.
H1: LILAC+FedX (LILAC+ANAPSID) selects less
sources than FedX and DAW+FedX (resp., ANAPSID
and DAW+ANAPSID) do.
Test p-values are presented in Tables C.5 and C.6, Appendix C. For all the federations and engines, p-values are
inferior to 0.05. These low p-values allow for rejecting the
null hypothesis that LILAC+FedX, LILAC+ANAPSID
selected the same number of sources than the other studied engines. Additionally, they support the acceptance
of the alternative hypothesis that LILAC+FedX and
LILAC+ANAPSID reduce the number of sources selected
by FedX and ANAPSID, respectively; the reductions
achieved by LILAC+FedX and LILAC+ANAPSID
are greater than the ones achieved by DAW+FedX and
DAW+ANAPSID, respectively.
5.2. Query Execution
We consider the same federations and queries from the
previous section and study their execution performance
in terms of execution time (ET), answer completeness
(C), and number of transferred tuples (NTT). As the
query executions that aborted or timed out do not
have significance in terms of the studied metrics, these
executions have been removed from the results; Table 3
summarizes the removed queries.
Using LILAC in
combination with the federated query engines reduces the
number of queries that time out or finish with an error.
This is mostly a natural consequence of the reduction
of the number of transferred tuples (NTT) achieved by
LILAC.
5.1.2. Source Selection and Query Decomposition Time
Figures 11 and 12 show the performance of the engines
in terms of query decomposition and source selection time
(DST). DST for LILAC+FedX is significantly better than
DST for FedX and DAW+FedX, while DST for ANAPSID
is significantly better than DST for LILAC+ANAPSID,
Fedra+ANAPSID, and DAW +ANAPSID. For FedX and
13
Execution Time (secs)
Execution Time (secs)
102
101
100
102
101
100
Diseasome
Geo
LinkedMDB
SWDF
WatDiv1
WatDiv100
Diseasome
Figure 13: Execution Time (in secs) for LILAC+ANAPSID ( ),
Fedra+ANAPSID ( ), DAW+ANAPSID ( ) and ANAPSID ( )
Geo
LinkedMDB
SWDF
WatDiv1
WatDiv100
Figure 14: Execution Time (in secs) for LILAC+FedX ( ),
Fedra+FedX ( ), DAW+FedX ( ) and FedX ( )
Execution Time (secs)
5.2.1. Execution Time
Figures 13 and 14 show the performance of the engines
in terms of execution time (ET). Execution time (ET)
for LILAC+FedX is better for all the datasets except
GeoCoordinates. Moreover, execution time (ET) for
LILAC+ANAPSID is better for all the datasets except
for GeoCoordinates and WatDiv1. The GeoCoordinates
dataset is characterized by high structuredness, i.e., coherence for GeoCoordinates is 1.00 (Table 2); additionally,
GeoCoordinates queries are very selective STAR shape
queries. Moreover, most of the executed queries over
GeoCoordinates and WatDiv1 have an execution time
of around 1 second. In this type of scenarios, FedX and
ANAPSID are already capable of generating execution
plans that reduce the number of transferred tuples and intermediate results. Consequently, when executing “easy”
queries it may not be worthy to include (more) expensive
source selection and decomposition strategies as the one
provided by LILAC. To test this conjecture, we consider
only the queries that FedX and ANAPSID executions
transfer at least 100 tuples. The number of considered
queries is 49 for ANAPSID and WatDiv1 federation,
seven for ANAPSID and GeoCoordinates federation, and
zero for FedX and GeoCoordinates federation. Seven
queries are not enough to draw a conclusion about the
significance of the execution time reduction by LILAC
on GeoCoordinates queries, but 49 queries are enough to
compare the execution time (ET) achieved by ANAPSID
and its extensions with DAW, Fedra, and LILAC.
Figure 15 presents the execution time (ET) for these 49
queries. For queries with at least 100 transferred tuples,
LILAC+ANAPSID reduces considerably the execution
time with respect to ANAPSID; the reductions achieved
by LILAC+ANAPSID are greater than the ones achieved
by DAW+ANAPSID and Fedra+ANAPSID.
Furthermore, the greatest reduction in execution time
is observed in SWDF, WatDiv1 and WatDiv100 for
FedX, and GeoCoordinates for ANAPSID for non-selective
queries. These combinations of federations and queries
are precisely those where the engines transfer the largest
number of tuples. These observations in combination
with the previous statement about query “easiness”,
provide evidence of a correlation between the number of
102
101
100
WatDiv1
Figure 15: Execution Time (in secs) for LILAC+ANAPSID ( ),
Fedra+ANAPSID ( ), DAW+ANAPSID ( ) and ANAPSID ( )
for queries with at least 100 intermediate results
transferred tuples and the execution time. Moreover, the
benefits that can be obtained from using LILAC may
lastly depend on the number of transferred tuples incurred
by the engines.
To confirm that LILAC query decomposition techniques significantly reduce the execution time (ET) of the
studied engines, a Wilcoxon signed rank test was run with
the hypotheses:
H0: using LILAC query decomposition does not change
the engine query execution time.
H1: LILAC query decompositions lead to query executions faster than the engines.
Test p-values are presented in Table C.8, Appendix C. With the exception of GeoCoordinates with
LILAC+FedX and LILAC+ANAPSID, and WatDiv1
with LILAC+ANAPSID, all the p-values are inferior
to 0.05. These low p-values allow for rejecting the null
hypothesis that the execution time of LILAC+FedX
(resp., LILAC+ANAPSID) and FedX (resp., ANAPSID)
are the same. Additionally, they support the acceptance
of the alternative hypothesis that LILAC+FedX and
LILAC+ANAPSID have lower execution time (ET).
Furthermore, we repeated the test for the WatDiv1 federation and the ANAPSID based engines using only the 49
queries that transfer at least 100 tuples (execution times
given in Figure 15); a p-value of 0.01454 was obtained.
This low p-value allows for rejecting the null hypothesis
and accepting the alternative hypothesis that LILAC
enhances ANAPSID and allows for a significant reduction
14
Number of Transferred Tuples
Completeness
1.00
0.75
0.50
0.25
0.00
Diseasome
Geo
LinkedMDB
SWDF
WatDiv1
Number of Transferred Tuples
Completeness
0.50
0.25
0.00
SWDF
WatDiv1
100
Geo
LinkedMDB
SWDF
WatDiv1
WatDiv100
Figure 18: Number of Transferred Tuples during execution of
LILAC+ANAPSID ( ), Fedra+ANAPSID ( ), DAW+ANAPSID
( ) and ANAPSID ( )
0.75
LinkedMDB
102
Diseasome
1.00
Geo
104
WatDiv100
Figure 16:
Completeness for LILAC+ANAPSID ( ), Fedra+ANAPSID ( ), DAW+ANAPSID ( ) and ANAPSID ( )
Diseasome
106
WatDiv100
106
104
102
100
Diseasome
Figure 17: Completeness for LILAC+FedX ( ), Fedra+FedX ( ),
DAW+FedX ( ) and FedX ( )
Geo
LinkedMDB
SWDF
WatDiv1
WatDiv100
Figure 19: Number of Transferred Tuples during execution of
LILAC+FedX ( ), Fedra+FedX ( ), DAW+FedX ( ) and FedX
( )
on the execution time for the WatDiv1 federation for
queries with at least 100 intermediate results.
To confirm that the engines enhanced with LILAC
achieve a greater reduction in execution time than the
engines enhanced with DAW, a Wilcoxon signed rank test
was run with the hypotheses:
H0: using LILAC or DAW in combination with the
engines leads to similar execution times.
H1: LILAC decompositions lead to query execution times
faster than source selection DAW.
Test p-values are presented in Table C.9, Appendix C.
For all the federations and engines, p-values are inferior
to 0.05. These low p-values allow for rejecting the null
hypothesis that the execution times when using LILAC
query decomposition strategy and when using the engines
enhanced with DAW are similar, and accepting the alternative hypothesis that LILAC reduction in the execution
time is greater. As expected, the fragment descriptions
used by LILAC pave the way for better characterization
of the replicated fragments accessible by the endpoints,
and consequently, LILAC selects significantly less sources
than DAW. Furthermore, LILAC identifies endpoints that
are able to execute larger sub-queries and reduces both
number of transferred tuples (NTT) and execution time
(ET).
in all cases except for one query against WatDiv100 and
LILAC+ANAPSID. In this particular query, ANAPSID
selects a physical operator that transfers all the mappings
matching each sub-query. Because there is one triple
pattern with the general predicate rdf:type that has
136,012 mappings, the execution of this sub-query exceeds
the maximal number of returned tuples by the endpoints
(100,000), and LILAC+ANAPSID then fails to produce a
complete answer for this query. It should be noticed that
in a federation where endpoints have a maximum for the
number of results that can be transferred, even approaches
such as FedX that theoretically always provide complete
answers are no longer able to ensure completeness in
practice. For example, FedX has completeness 0.0 for two
queries as shown in Figure 17.
5.2.3. Number of Transferred Tuples
Figures 18 and 19 show the performance of the
engines in terms of number of transferred tuples (NTT).
LILAC+FedX and LILAC+ANAPSID transfer less
tuples than the other approaches; similar behavior is
observed for Fedra+FedX and Fedra+ANAPSID, in
the federations with data from datasets Diseasome, GeoCoordinates, and LinkedMDB. These real datasets have
little in common, i.e., their sizes, numbers of predicates,
and coherence values are very different. However, the
random queries executed in these federations have some
common characteristics, they are mainly selective STAR
shape queries (cf. Figure 8). To confirm that LILAC
5.2.2. Completeness
Figures 16 and 17 show the performance of the
engines in terms of completeness. LILAC+FedX and
LILAC+ANAPSID are able to produce complete answers
15
reduces the number of transferred tuples by the engines
and that LILAC reduction is greater than the reduction
achieved by DAW, Wilcoxon signed rank tests were run
with the hypotheses:
H0: FedX, ANAPSID, DAW+FedX, DAW+ANAPSID,
LILAC+FedX, and LILAC+ANAPSID transfer the
same number of tuples.
H1: LILAC+FedX and LILAC+ANAPSID transfer
less tuples than FedX, ANAPSID, DAW+FedX, and
DAW+ANAPSID.
Test p-values are presented in Table C.11, Appendix C. With the exception of GeoCoordinates with
LILAC+FedX and DAW+FedX, p-values are inferior
to 0.05. These low p-values allow for rejecting the null
hypothesis that the engines enhanced with LILAC and
enhanced with DAW transfer similar number of tuples.
Additionally, they support the acceptance of the alternative hypothesis that LILAC reduction in the number of
transferred tuples is greater for all the combinations of
federations and engines, except by GeoCoordinates federation and FedX. GeoCoordinates queries are STAR-shaped
queries with low number of triple patterns and a very
limited number of answers and transferred results when
executed with FedX (cf., Figures 8 and 19). It is only
in this “easy” setup that LILAC does not outperform
DAW in terms of number of transferred tuples. We also
performed a Wilcoxon signed rank test to check if the
reduction achieved by DAW is greater than the reduction
achieved by LILAC in combination with FedX and a
p-value of 0.9772 was obtained for this test. Therefore,
for GeoCoordinates federation and FedX engine, both
DAW and LILAC achieved reductions in the number of
transferred tuples, but it cannot be stated that one of
them is significantly better than the other.
The observed results and p-values have confirmed
that LILAC does enhance both FedX and ANAPSID,
and the new federated query engines LILAC+FedX
and LILAC+ANAPSID are capable of finding better
execution plans that transfer significantly less tuples than
FedX and ANAPSID. Knowledge about the replicated
fragments in LILAC catalog paves the way for a reduction
on the number of selected sources and allows for delegating
join execution to the endpoints whenever is possible.
and the DST of DAW+FedX are higher than the ones
of the other FedX-based approaches. Because FedX and
DAW do not take into account how the triple patterns
are connected in the query, the Number of Transferred
Tuples (NTT) is negatively impacted and FedX plans may
include sub-queries with Cartesian products and DAW
may select different sources to evaluate triple patterns in a
join that can be locally evaluated in one source. Because
ANAPSID implements an heuristic to select the most
likely source to have relevant data, the Completeness (C)
is negatively impacted and ANAPSID may fail to produce
some of the query answers. Because Fedra performs
a replication-aware source selection and is tailored to
work with existing query decomposers, the Number of
Transferred Tuples (NTT) is negatively impacted and
Fedra has to select just one source per triple pattern
if possible to allow for delegating join evaluation to
the endpoints. Moreover, doing this may prevent the
generation of plans that have lower number of transferred
tuples. Because DAW, Fedra, and LILAC have more
knowledge about the data distribution on the federation,
using this knowledge to produce better source selections
that reduce the transfer of redundant data negatively
impacts on the Query Decomposition and Source Selection
Time (DST). Moreover, for “easy” queries where the
DST domains the Execution Time (ET), i.e., queries that
transfer very few tuples from endpoints to the engine for
any source selection, the ET is also negatively impacted
because only negligible reductions are attainable. Because
DAW relies on estimations based on data summaries and
Fedra and LILAC rely on heuristic solutions to the
set covering problem, the Number of Transferred Tuples
(NTT) and the Execution Time (ET) are negatively
impacted. These approaches only produce approximate
solutions. Therefore, they risk retrieving redundant data
or delegating less joins to the endpoints that actually
possible, and consequently exhibiting a higher NTT or
ET than possible. Finally, because all the approaches rely
on endpoints providing all the results for the evaluated
sub-queries and endpoints only send a limited number of
results, the Completeness (C) is negatively impacted and
the engines may fail to produce some of the answers.
6. Related Work
5.3. Discussion
Obtained results confirm that LILAC provides a better
solution to the QDP-FR than existing approaches. Table 4
shows the limitations exhibited by LILAC and existing
approaches and the metrics impacted by these limitations.
Because FedX, ANAPSID, and DAW are unaware of the
replicated fragments, the Number of Selected Sources
(NSS), the Number of Transferred Tuples (NTT), and
the Execution Time (ET) are negatively impacted and
these approaches select sources that provide redundant
data. Because DAW computes the similarity among data
exposed by different endpoints, the Query Decomposition
and Source Selection Time (DST) is negatively impacted
6.1. Distributed Databases and Data Fragmentation and
Replication
A distributed database corresponds to a logically unified database which is physically partitioned into multiple portions or fragments distributed across a computer
network [24]. Fragments can be replicated over a distributed database to improve data availability and to enhance scalability and performance of the data management system that provides the access to the distributed
database. Queries and frequency of general data management operations are known before hand. Thus, a distributed database can be physically designed to speed up
16
Table 4: Limitations of existing approaches and impact on obtained results for Number of Selected Sources (NSS), Query Decomposition
and Source Selection Time (DST), Execution Time (ET), Completeness (C), and Number of Transferred Tuples (NTT). ✓ indicates that the
approach exhibits the limitation and ✗ indicates that the approach is free of the limitation
Limitations
Has no knowledge about replication
Performs very expensive computations
Is unaware of BGPs or joins
Uses completeness risking heuristics
Performs uninformed pruning
Is too expensive for “easy” queries
Is an approximate solution
Relies on endpoints for completeness
FedX
✓
✗
✓
✗
✗
✗
✗
✓
ANAPSID
✓
✗
✓
✓
✗
✗
✗
✓
its expected work load, i.e., fragments and replication are
tailored for a particular application.
The Linking Open Data (LOD) cloud [7] is a logically
unified dataset composed of Linked Data sets publicly
available on the Web and accessible following specific
Web access interfaces, e.g., SPARQL endpoints or Triple
Pattern Fragments (TPFs). Albeit logical similarities, the
LOD cloud cannot be considered a distributed database.
LOD datasets are published by autonomous participants
and Linked Data availability cannot be always ensured [4].
Different approaches have been defined to empower data
consumers to access federations of LOD datasets [27], e.g.,
federated query engines. However, there is no known
set of queries to be executed over the LOD cloud; in
consequence, a general physical Linked Data distribution
designed to enhance the performance of a particular Linked
Data application cannot be achieved.
Albeit the differences between distributed databases
and the LOD cloud, data fragmentation and replication
techniques could be reused to enhance the LOD cloud
applications performance. However, because of the lack
of information about the cloud work load, traditional data
fragmentation and replication design [24] and distributed
query processing approaches [19] require to be redefined
to ensure Linked Data availability [31].
LILAC relies on a data distribution strategy tailored
for Linked Data. This distribution relies on data consumers that, having interest on certain federation of LOD
datasets, deploy new endpoints to provide access for replicated fragments of these LOD datasets. Thus, even if the
LOD datasets become unavailable, Linked Data availability can be still ensured by the data consumers. A replicated fragment corresponds to a set of the RDF triples
that are part of one dataset in the LOD cloud and satisfy
a given triple pattern. Descriptions of the replicated fragments are kept in the LILAC catalog and the fragment
containment relation is computed to determine fragment
overlap; reducing thus the retrieval of redundant data and
enhancing the performance of Linked Data applications.
Approach
DAW
✓
✓
✓
✗
✗
✓
✓
✓
Impacted Metrics
Fedra
✗
✗
✗
✗
✓
✓
✓
✓
LILAC
✗
✗
✗
✗
✗
✓
✓
✓
NSS, NTT, ET
DST
NTT
C
NTT
DST,ET
ET, NTT
C
and approaches to support fragment-aware data management have been defined [15, 17, 33].
Verborgh et al. [15, 33] proposed client-side query
processing techniques, named Linked Data Fragment
(LDF), to opportunistically exploit fragments that materialized set of triples of a LOD dataset. Fragments are
accessible through specific Web access interfaces named
Triple Pattern Fragments (TPFs) [33], and TPF servers
make available the fragments on the Web. LDF improves
Linked Data availability by moving query execution load
from servers to clients. During query processing, TPF
clients download fragments required to execute queries
from TPF servers through simple HTTP requests, and
then locally execute the queries. This strategy allows
clients to cache fragments locally and decreases the load
on the TPF server. LDF chooses a clear tradeoff by
shifting query processing to clients, at the cost of slower
query execution [33].
LDF restricts the operations that TPF servers perform,
and relies on TPF clients to perform SPARQL operations.
Additionally, fragments are accessed by using the TPF
Web access interfaces and have not been designed for
compatibility with the SPARQL protocol. Contrarily,
LILAC relies on data consumer endpoints that implement
the SPARQL protocol; thus, federated query engines can
be used for federated query processing. Both approaches
propose strategies to improve Linked Data availability, but
while in LDF the cost of query execution is individually
assumed by one client, in LILAC, the client shares this
cost across the data consumer endpoints.
Ibáñez et al. [17] also assume that RDF data from LOD
datasets is fragmented and replicated by data consumers;
however, fragments can be updated, i.e., data consumers
and providers can edit the fragments. Thus, Ibáñez et
al. proposed data synchronization techniques, named ColGraph, that allow for the synchronization of changes in the
fragments in the consumer-side, i.e., fragments are kept up
to date in the presence of concurrent editions.
LILAC assumes that all replicated fragments are
replicas of portions of LOD datasets, i.e., all the replicas
of a given fragment comprise the same RDF data. This
assumption can be unrealistic for LOD datasets that
change regularly, e.g., DBpedia Live. In consequence,
6.2. Linked Data Fragmentation
The problem of accessing fragments of Linked Data has
been recently addressed in the Semantic Web community,
17
sources that cannot contribute to produce query answers.
It uses an index with the predicates and the authority
URIs of the described resources in the triples accessible
by the federation endpoints. The authorities are used
to prune sources that either do not match any of the
query triple patterns, or that when considering the query
joins would produce an empty join evaluation. HiBISCuS
has been combined with existing federated query engines
such as FedX and SPLENDID, and it has been shown to
successfully reduce the number of selected sources.
These existing source selection strategies do not have
knowledge about replication of fragments in the federation
of endpoints. Hence, they should select all the relevant
sources in order to produce complete answers.
We
recognize that existing source selection techniques may
be used to enhance LILAC source selection. However,
the main goal of this paper is to show how by exploiting
knowledge about replicated fragments in concise source
descriptions such as, fragment containment relations,
good query decompositions and source selections can be
identified.
techniques as the ones proposed by Ibáñez et al. in
Col-Graph can be used to enhance LILAC for managing
scenarios where replicated fragment triples are not the
same as in the LOD dataset, i.e., replicated fragments
with divergence. For instance, potential source selection
criteria in this scenario, would be choosing endpoints that
access fragments with the lowest divergence or pruning
fragments with divergence greater than a given threshold.
A preliminary solution to this problem is outlined in [21].
However, managing divergent replicated fragments is out
of the scope of LILAC.
6.3. Source Selection in Federations of SPARQL Endpoints
Different techniques for source selection have been
implemented into existing federated query engines. Some
of them are based on runtime contacts to the endpoints,
i.e., execution of SPARQL ASK queries [1, 26, 30]. Other
approaches rely on indexes, precomputed or built at
runtime, that provide some statistics about the triples
accessible by the endpoints [1, 12, 26]. DARQ [26] relies
on both an index based structure on source descriptions
and basic statistics, e.g., number of triples per predicate,
to identify the sources that can be used to evaluate
triple patterns with a given predicate. For each query
triple pattern, FedX resorts to the evaluation of SPARQL
ASK queries for determining the endpoints relevant for
a triple pattern, i.e., endpoints that access at least one
triple that match the triple pattern. SPLENDID [12]
combines both lookups to precomputed VOID indexes
with runtime endpoint contacts. Endpoints are initially
selected according to the bindings in the triple patterns.
For instance, triple patterns with a bound predicate,
or triple patterns with bound predicate rdf:type and
bound object are looked up in the index; however, all
the possible endpoints are selected for triple patterns
with an unbound predicate. For triple patterns with a
bound subject or object, this initial source selection is
refined by contacting the selected endpoints, thus, the
set of selected sources may be reduced. ANAPSID [1]
maintains for each endpoint in a federation, the predicates
that can be evaluated by these endpoints, i.e., endpoints
are described in terms of predicates. These endpoint
descriptions are used to perform an initial source selection.
Additionally, heuristics [23] 18 are followed to reduce the
initial set of selected sources and relevant endpoints are
reduced according to the bindings of the triple patterns.
For example, if the predicate bindings share the same
namespace with predicates associated with an endpoint
description, then this endpoint is selected.
Saleem et al. [29] proposed a join-aware source selection
strategy, named HiBISCuS. HiBISCuS is based on pruning
6.4. Source Selection in Federations with Duplicated and
Replicated Data
Recently, source selection strategies aware of triple
duplication have been proposed [16, 28]. These approaches
use data summaries or sketches to estimate the overlapping
among sources. Benefit-Based Query routing (BBQ) [16]
extends ASK queries with Bloom filters [8] that provide
a summary of the results, in order to prune sources
that provide low number of triple pattern mappings, i.e.,
endpoints with low benefit. DAW [28] uses a combination
of Min-Wise Independent Permutations (MIPs) [9] and
triple selectivity information to estimate the overlap
between the results of different sources.
Based on
how many new query results are expected to be found,
sources that are below a predefined benefit threshold, are
discarded and not selected.
BBQ [16] and DAW [28] address closely related issue of
data duplication; however, data replication is not considered by these approaches. In data duplication, triples are
copied without following a replication schema. Thus, more
detailed information about the triples accessible through
the endpoints is needed to be able to assess overlapping
among sets of triples and to prune sources that do not
provide access to any further triple. The accuracy of these
approaches relies on the precise description of the metadata about their accessible triples and overlapping estimations. Because keeping all this information is not always feasible, these approaches may be too expensive and
not accurate enough in the context of replicated fragments
of LOD datasets. Additionally, these approaches perform
source selection at triple pattern level, i.e., sources are
selected for each triple pattern without taking into account that the triple pattern belongs to a BGP. Contrarily, LILAC exploits knowledge encoded in the fragment
description, e.g., the fragment containment relation, and
18 In practice, these heuristics have been shown to be accurate for
most queries, but they could prune relevant sources needed to provide
a complete answer.
18
precisely identifies sources that have replicated the same
fragments. This knowledge is used to choose among these
replicas, the ones that can evaluate more query joins in
order to reduce the number of transferred tuples, i.e., the
source selection is performed at the BGP level. Lastly,
these approaches prune sources with duplicated data during source selection and may prevent the query decomposer
from producing the most selective sub-queries. Contrarily, LILAC interleaves replication-aware source selection
and query decomposition. This allows the query decomposer to take advantage of opportunities to delegate more
selective sub-queries to the endpoints.
In previous work, we define Fedra [22], a source
selection approach that also addresses the problem of
querying federations of endpoints with replicated fragments. In Fedra, fragment descriptions are used to
prune sources that provide only redundant data and
BGPs are taken into account to delegate join execution to
the endpoints, if possible; consequently, Fedra reduces
the number of transferred tuples. FedX and ANAPSID
are extended with Fedra source selection and experimental studies show that in general Fedra+FedX and
Fedra+ANAPSID transfer less tuples than FedX and
ANAPSID, respectively. Because Fedra performs only
source selection, it has to choose among multiple endpoints that access the same fragment before knowing how
the query is decomposed into sub-queries. This prevents
the query decomposer from producing the most selective
sub-queries. Contrarily, LILAC interleaves replicationaware source selection and query decomposition and can
be used in combination with existing federated query
engines to produce execution plans that are unreachable
by the existing engines.
than one relevant endpoint are evaluated using sub-queries
with just one triple pattern in all the selected sources and
their results are combined in the federated query engine.
For the other triple patterns, they are grouped into starshaped groups, i.e., sets of triple patterns with just one
variable in common are evaluated in the same sub-queries.
The use of star-shaped groups may reduce the size of
intermediate results, and the number of tuples transferred
from endpoints to the query engine, as suggested by Vidal
et al. [35].
Recently, Vidal et al. [34] have proposed a source
selection and query decomposition strategy, named
Fed-DSATUR. Fed-DSATUR is based on a reduction
to the vertex coloring problem and extends an existing
approximate solution for the vertex coloring problem
named DSATUR to produce query decompositions that
maximize the answer completeness and minimize execution time. Fed-DSATUR can find optimal solutions
for certain queries, e.g., if the query is reduced to a
bipartite graph. The minimization of the number of colors
corresponds to minimize the number of sub-queries in the
query decomposition, while this minimization may have a
positive impact on the execution time, it may also have a
negative impact on the query completeness.
Similarly to ANAPSID and Fed-DSATUR, LILAC decomposes queries into Cartesian product free sub-queries.
Nevertheless, differently from existing approaches,
LILAC catalog includes the description of the replicated
fragments accessible through the federation endpoints.
Knowledge encoded in the fragment descriptions is
exploited by LILAC to safely prune relevant sources that
share the same relevant data. Moreover, LILAC may
include one triple pattern in more than one sub-query
in order to increase the sub-queries selectivity if these
sub-queries are evaluated by endpoints accessing all the
relevant triples in the federation.
6.5. Query Decomposition in Federations of SPARQL
Endpoints
FedX [30] decomposes queries into exclusive groups. All
the triple patterns with the same unique relevant endpoint
are evaluated together in the same sub-queries in order to
reduce the number of calls that the engines makes during
query execution. The triple patterns with more than one
relevant endpoint are evaluated as a sub-query with just
one triple pattern in all the relevant endpoints. Then,
obtained results from different endpoints are combined.
It is important to notice that FedX may evaluate in
the same sub-query triple patterns that do not share
any variable, i.e., FedX may generate sub-queries with
Cartesian products, if they can be exclusively evaluated
by one endpoint. This decision may negatively impact on
query execution time, because, pushing down Cartesian
products to the endpoints can greatly increase the number
of tuples to transfer.
ANAPSID [1] decomposes queries into star-shaped
groups. After ANAPSID heuristics [23] are used to select
the relevant endpoints for each triple pattern, sub-queries
are built according to the number of relevant endpoints
selected for the triple patterns. Triple patterns with more
7. Conclusions and Future Work
In this paper, we illustrated how replicating fragments
allow for data re-organization to better fit federated
query needs. We have overcome intrinsic limitations of
replication-aware source selection strategies and have
enabled federated queries engines to take advantage of
replication during query decomposition to produce subqueries that are more selective and transfer less tuples.
Concise description of the replicated data using fragment
descriptions paves the way for safely pruning sources
with replicated data, and generating sub-queries that
push joins down to the endpoints. We have formalized
the problem of source selection and query decomposition
for federations with replicated fragments (QDP-FR).
Moreover, we proposed a replication-aware federated
query decomposition algorithm LILAC that approximates QDP-FR and ensures soundness and completeness
of query answers. Federated query engines ANAPSID
and FedX were extended with LILAC source selection
19
and query decomposition strategies. Experimental results
demonstrate that LILAC achieves significant reduction
of the number of transferred tuples. Additionally, for
queries not “easily” executable by the studied engines,
also the execution time was significantly reduced.
This work opens several perspectives. First, we made
the hypothesis that replicated fragments are perfectly
synchronized and cannot be updated. We can leverage
this hypothesis and manage the problem of federated
query processing with divergence. For instance in [21],
divergence incurred by the endpoints in the federation
is measured and endpoints that exceed a divergence
threshold are pruned from the relevant sources.
Several variants of QDP-FR can also be developed.
QDP-FR does not distinguish between endpoints, i.e.,
the cost of accessing endpoints is considered the same.
For example, QDP-FR and LILAC can be modified to
compute a source selection and query decomposition in
order to consider user preferences or other type of data
quality values and exploit these meta-data to minimize
the number of relevant endpoints. Another perspective is
to take further advantage of replicated data. The presence
of the same replicated fragment in several endpoints may
be exploited by a federated query engine, these endpoints
can be used to share the evaluation of sub-queries.
[6] C. Basca and A. Bernstein. Avalanche: Putting the Spirit of
the Web back into Semantic Web Querying. In A. Polleres and
H. Chen, editors, ISWC Posters&Demos, volume 658 of CEUR
Workshop Proceedings. CEUR-WS.org, 2010.
[7] C. Bizer, T. Heath, and T. Berners-Lee. Linked Data - The
Story So Far. Int. J. Semantic Web Inf. Syst., 5(3):1–22, 2009.
[8] B. H. Bloom. Space/time trade-offs in hash coding with
allowable errors. Communication of the ACM, 13(7):422–426,
July 1970.
[9] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher.
Min-Wise Independent Permutations. J. Comput. Syst. Sci.,
60(3):630–659, 2000.
[10] S. Duan, A. Kementsietsidis, K. Srinivas, and O. Udrea. Apples
and oranges: a comparison of RDF benchmarks and real RDF
datasets. In T. K. Sellis, R. J. Miller, A. Kementsietsidis,
and Y. Velegrakis, editors, Proceedings of the ACM SIGMOD
International Conference on Management of Data, SIGMOD
2011, Athens, Greece, June 12-16, 2011, pages 145–156. ACM,
2011.
[11] K. M. Endris, S. Faisal, F. Orlandi, S. Auer, and S. Scerri.
Interest-based RDF update propagation.
In M. Arenas,
Ó. Corcho, E. Simperl, M. Strohmaier, M. d’Aquin, K. Srinivas,
P. T. Groth, M. Dumontier, J. Heflin, K. Thirunarayan, and
S. Staab, editors, The Semantic Web - ISWC 2015 - 14th
International Semantic Web Conference, Bethlehem, PA, USA,
October 11-15, 2015, Proceedings, Part I, volume 9366 of
Lecture Notes in Computer Science, pages 513–529. Springer,
2015.
[12] O. Görlitz and S. Staab. SPLENDID: SPARQL Endpoint
Federation Exploiting VOID Descriptions.
In O. Hartig,
A. Harth, and J. Sequeda, editors, COLD, 2011.
[13] C. Gutierrez, C. A. Hurtado, A. O. Mendelzon, and J. Pérez.
Foundations of Semantic Web databases. J. Comput. Syst. Sci.,
77(3):520–541, 2011.
[14] A. Y. Halevy. Answering queries using views: A survey. VLDB
J., 10(4):270–294, 2001.
[15] J. V. Herwegen, R. Verborgh, E. Mannens, and R. V. de Walle.
Query execution optimization for clients of triple pattern
fragments. In F. Gandon, M. Sabou, H. Sack, C. d’Amato,
P. Cudré-Mauroux, and A. Zimmermann, editors, The Semantic
Web. Latest Advances and New Domains - 12th European
Semantic Web Conference, ESWC 2015, Portoroz, Slovenia,
May 31 - June 4, 2015. Proceedings, volume 9088 of Lecture
Notes in Computer Science, pages 302–318. Springer, 2015.
[16] K. Hose and R. Schenkel. Towards benefit-based RDF source
selection for SPARQL queries. In R. D. Virgilio, F. Giunchiglia,
and L. Tanca, editors, SWIM, page 2. ACM, 2012.
[17] L. D. Ibáñez, H. Skaf-Molli, P. Molli, and O. Corby. Col-Graph:
Towards Writable and Scalable Linked Open Data. In Mika
et al. [20], pages 325–340.
[18] D. S. Johnson. Approximation Algorithms for Combinatorial
Problems. In A. V. Aho et al., editors, ACM Symposium on
Theory of Computing, pages 38–49. ACM, 1973.
[19] D. Kossmann. The state of the art in distributed query
processing. ACM Computer Survey, 32(4):422–469, 2000.
[20] P. Mika, T. Tudorache, A. Bernstein, C. Welty, C. A. Knoblock,
D. Vrandecic, P. T. Groth, N. F. Noy, K. Janowicz, and
C. A. Goble, editors. The Semantic Web - ISWC 2014 13th International Semantic Web Conference, Riva del Garda,
Italy, October 19-23, 2014. Proceedings, Part I, volume 8796 of
Lecture Notes in Computer Science. Springer, 2014.
[21] G. Montoya, H. Skaf-Molli, P. Molli, and M.-E. Vidal. Fedra:
Query Processing for SPARQL Federations with Divergence.
Technical report, Université de Nantes, May 2014.
[22] G. Montoya, H. Skaf-Molli, P. Molli, and M.-E. Vidal. Federated
SPARQL Queries Processing with Replicated Fragments. In
The Semantic Web - ISWC 2015 - 14th International Semantic
Web Conference, pages 36–51, Bethlehem, United States, Oct.
2015.
[23] G. Montoya, M.-E. Vidal, and M. Acosta. A heuristic-based
approach for planning federated sparql queries. In COLD, 2012.
Acknowledgements
Experiments were carried out using the Grid’5000
testbed, supported by a scientific interest group hosted
by Inria and including CNRS, RENATER and several
Universities as well as other organizations (see https:
//www.grid5000.fr). This work is partially supported
by the French National Research agency (ANR) through
the SocioPlug project (code: ANR-13-INFR-0003). The
first author was part of the Unit UMR6241 of the Centre
National de la Recherche Scientifique (CNRS), while
implementing and evaluating the presented approach.
References
[1] M. Acosta, M. Vidal, T. Lampo, J. Castillo, and E. Ruckhaus.
ANAPSID: An Adaptive Query Processing Engine for SPARQL
Endpoints. In Aroyo et al. [5], pages 18–34.
[2] G. Aluç, O. Hartig, M. T. Özsu, and K. Daudjee. Diversified
Stress Testing of RDF Data Management Systems. In Mika
et al. [20], pages 197–212.
[3] G. Aluç, M. T. Ozsu, K. Daudjee, and O. Hartig. chameleondb: a Workload-Aware Robust RDF Data Management System.
University of Waterloo, Tech. Rep. CS-2013-10, 2013.
[4] C. B. Aranda, A. Hogan, J. Umbrich, and P. Vandenbussche.
SPARQL Web-Querying Infrastructure: Ready for Action? In
H. Alani et al., editors, ISWC 2013, Part II, volume 8219 of
LNCS, pages 277–293. Springer, 2013.
[5] L. Aroyo, C. Welty, H. Alani, J. Taylor, A. Bernstein, L. Kagal,
N. F. Noy, and E. Blomqvist, editors. The Semantic Web ISWC 2011 - 10th International Semantic Web Conference,
Bonn, Germany, October 23-27, 2011, Proceedings, Part I,
volume 7031 of Lecture Notes in Computer Science. Springer,
2011.
20
checked if there is a containment with the other fragments. V pn, m, kq “ Opn.m.kq, as potentially for each
triple pattern, all the fragments should be considered
for each endpoint. W pn, kq “ Opn3 .kq, as potentially
for each endpoint, all the pairs of sub-queries should
be considered to find the two that share a variable, in
the worst case it should be considered that all triple
patterns can be evaluated in the endpoints, and in each
iteration the number of sub-queries is reduced by one.
Xpn, m, kq “ Opn2 .m.kq, as each triple pattern may be
associated to as many endpoint subsets as the number of
fragments and each fragment may be accessed by all the
endpoints. Also, each endpoint may be associated with
as many subsets as the number of triple patterns in ljtps.
Then T pn, m, kq can be written as:
[24] M. T. Özsu and P. Valduriez. Principles of distributed database
systems. Springer, 2011.
[25] A. Passant and P. N. Mendes. sparqlpush: Proactive notification of data updates in rdf stores using pubsubhubbub. In
SFSW, 2010.
[26] B. Quilitz and U. Leser. Querying Distributed RDF Data
Sources with SPARQL. In S. Bechhofer et al., editors, ESWC
2008, volume 5021 of LNCS, pages 524–538. Springer, 2008.
[27] N. A. Rakhmawati, J. Umbrich, M. Karnstedt, A. Hasnain,
and M. Hausenblas. A comparison of federation over SPARQL
endpoints frameworks. In P. Klinov and D. Mouromtsev,
editors, Knowledge Engineering and the Semantic Web - 4th
International Conference, KESW 2013, St. Petersburg, Russia,
October 7-9, 2013. Proceedings, volume 394 of Communications
in Computer and Information Science, pages 132–146. Springer,
2013.
[28] M. Saleem, A.-C. N. Ngomo, J. X. Parreira, H. F. Deus,
and M. Hauswirth. DAW: Duplicate-AWare Federated Query
Processing over the Web of Data. In H. Alani et al., editors,
ISWC 2013, Part I, volume 8218 of LNCS, pages 574–590.
Springer, 2013.
[29] M. Saleem and A. N. Ngomo.
HiBISCuS: HypergraphBased Source Selection for SPARQL Endpoint Federation. In
V. Presutti et al., editors, ESWC 2014, volume 8465 of LNCS,
pages 176–191. Springer, 2014.
[30] A. Schwarte, P. Haase, K. Hose, R. Schenkel, and M. Schmidt.
FedX: Optimization Techniques for Federated Query Processing
on Linked Data. In Aroyo et al. [5], pages 601–616.
[31] J. Umbrich, K. Hose, M. Karnstedt, A. Harth, and A. Polleres.
Comparing data summaries for processing live queries over
linked data. World Wide Web, 14(5-6):495–544, 2011.
[32] P.-Y. Vandenbussche, J. Umbrich, A. Hogan, and C. BuilAranda. SPARQLES: Monitoring Public SPARQL Endpoints.
Semantic Web, 2016. To appear.
[33] R. Verborgh, O. Hartig, B. D. Meester, G. Haesendonck, L. D.
Vocht, M. V. Sande, R. Cyganiak, P. Colpaert, E. Mannens,
and R. V. de Walle. Querying Datasets on the Web with High
Availability. In Mika et al. [20], pages 180–196.
[34] M.-E. Vidal, S. Castillo, M. Acosta, G. Montoya, and G. Palma.
On the Selection of SPARQL Endpoints to Efficiently Execute
Federated SPARQL Queries. Transactions on Large-Scale
Data- and Knowledge-Centered Systems, 2015.
[35] M.-E. Vidal, E. Ruckhaus, T. Lampo, A. Martı́nez, J. Sierra,
and A. Polleres. Efficiently joining group patterns in sparql
queries. In L. Aroyo, G. Antoniou, E. Hyvönen, A. ten Teije,
H. Stuckenschmidt, L. Cabral, and T. Tudorache, editors,
ESWC (1), volume 6088 of Lecture Notes in Computer Science,
pages 228–242. Springer, 2010.
[36] F. Wilcoxon. Individual comparisons by ranking methods. In
Breakthroughs in Statistics, pages 196–202. Springer, 1992.
T pn, m, kq “ maxpOpn.m2 q, Opn3 .kq, Opn2 .m.kqq
And it can be more concisely written as:
T pn, m, kq “ Opn3 .m2 .kq
Notice that execution of runtime queries to confirm
fragment relevance have been omitted from the complexity
computation. If they are considered, it could increase the
complexity to be linear in the number of triples stored
by the endpoints if no efficient indexing structures are
provided by the data store.
Appendix B. Proof of Theorem 1
Theorem 1 states that Algorithm 1 produces a
query decomposition that satisfies the properties 1-3
(Section 3.2).
Property 1. Proof by contradiction, suppose Algorithm 1 produces an answer that is not sound, then at
least one triple pattern should have been assigned to an
endpoint that does not access any of the triple pattern relevant fragments or a join between two triple patterns in
Q, has not been included in the output of Algorithm 1.
All triple patterns have been included in sub-queries either in line 54 or in line 63. Moreover, these sub-queries
assign triple patterns to endpoints that access relevant
fragments for the triple patterns, which have been obtained by functions SelectNonRedundantFragments and
endpoints(). Furthermore, Algorithm 1 preserves the existing joins in Q, i.e., all the triple patterns in Q have been
included in sub-queries either in line 54 or in line 63, and
shared variables among triple pattern remain unchanged
in the output of Algorithm 1. Therefore, it cannot be the
case that a triple pattern has been assigned to an endpoint with no relevant fragment, or that a join has been
removed from the query, and Algorithm 1 should produce
only sound answers.
Property 2. Proof by contradiction, suppose Algorithm 1 fails to produce a valid answer to query Q. In
this case, data for at least one triple pattern is missing,
Appendix A. Computation of complexity given in
Proposition 1
Let n be the number of triple patterns in the query,
m be the number of fragments, and k be the number
of endpoints. The time complexity of Algorithm 1,
T pn, m, kq, can be computed as:
T pn, m, kq “ n.U pmq ` V pn, m, kq ` W pn, kq ` Xpn, m, kq
with U pmq the complexity of finding the non redundant relevant fragments of a triple pattern, V pn, m, kq
the complexity of ReduceUnions, W pn, kq the complexity of ReduceBGPs, and Xpn, m, kq the complexity
of IncreaseSelectivity. U pmq “ Opm2 q, as each
fragment should be considered once and it should be
21
Table C.6: Wilcoxon signed rank test p-values for testing if LILAC
and DAW select the same number of sources or if LILAC selects
less sources. Bold p-values allow to accept that LILAC selects less
sources than DAW
or there are additional joins imposed by the query decomposition. For one triple pattern to be missing some data,
one fragment that provided no redundant data should have
been pruned in line 2, but in that line only redundant fragments were pruned, then this cannot be the case. Only in
line 61 additional joins have been included, but only triple
patterns that can be completely provided by the endpoint
belong to ljtps, hence tuples filtered by these joins would
be anyway pruned before returning the query answer. In
consequence, no tuple that should be present in the query
answer is actually removed by Algorithm 1, ensuring its
completeness.
Property 3. To satisfy this property any sub-query
should be Cartesian product free, as large as possible,
and no redundant. Proof by contradiction, suppose
Algorithm 1 output includes a sub-query that has a
Cartesian product or is not as large as possible or that
is redundant, sq. Proof by cases, case a) sq includes a
Cartesian product. However, sub-queries are only built in
lines 44 and 60, and these sub-queries are by construction
Cartesian product free. Case b) sq is not as large as
possible. However, sub-queries are only built in lines 44
and 60, and these sub-queries are by construction as large
as possible. Case c) sq is redundant. Sub-queries are only
built in lines 44 and 60. However, sub-queries built in
line 44 are pruned in line 48 using a set covering solution,
i.e., triple patterns are covered using as few sub-queries
and no redundant sub-queries can remain in the output of
function SetCoveringSubqueries. Moreover, sub-queries
built in line 60 access different relevant fragments for triple
pattern tp, with non known containment relationships
among them, therefore if all the replicated fragments have
been described none of these sub-queries is redundant
either. Hence, Algorithm 1 produces an output with
Cartesian product free, as large as possible, and no
redundant sub-queries, and minimizes the data transfer.
Federation
Diseasome
SWDF
LinkedMDB
Geocoordinates
WatDiv1
WatDiv100
Table C.7: Wilcoxon signed rank test p-values for testing if LILAC
and DAW/FedX DST is the same or if LILAC source selection is
faster. Bold p-values allow to accept that LILAC DST is faster
than DAW/FedX
Federation
Diseasome
SWDF
LinkedMDB
Geocoordinates
WatDiv1
WatDiv100
Federation
Diseasome
SWDF
LinkedMDB
Geocoordinates
WatDiv1
WatDiv100
Table C.5: Wilcoxon signed rank test p-values for testing if LILAC
and the engines select the same number of sources or if LILAC selects
less sources. Bold p-values allow to accept that LILAC selects less
sources than the engines
Diseasome
SWDF
LinkedMDB
Geocoordinates
WatDiv1
WatDiv100
p-value
DAW+FedX
FedX
ă 2.2e-16
7.728e-13
ă 2.2e-16
ă 2.2e-16
ă 2.2e-16
1.118e-14
ă 2.2e-16
1.326e-11
ă 2.2e-16
3.76e-16
ă 2.2e-16
4.235e-14
Table C.8: Wilcoxon signed rank test p-values for testing if LILAC
in combination with the engines and the engines alone have the
same execution time or if LILAC reduces the engines execution
time. Bold p-values allow to accept that LILAC reduces the engines
execution time
Appendix C. P-values of the Wilcoxon signed rank
tests performed
Federation
p-value
ANAPSID
FedX
ă 2.2e-16
2.292e-07
ă 2.2e-16
3.357e-12
ă 2.2e-16
2.792e-11
ă 2.2e-16
1.301e-05
3.782e-13
2.361e-05
4.402e-14 0.0009134
p-value
ANAPSID
FedX
6.421e-13 ă 2.2e-16
0.0069
ă 2.2e-16
1.37e-06
6.442e-10
1
1
0.9995
ă 2.2e-16
0.002381
9.497e-06
Table C.9: Wilcoxon signed rank test p-values for testing if LILAC
in combination with the engines and the engines alone have the
same execution time or if LILAC reduces the engines execution
time. Bold p-values allow to accept that the engines enhanced
with LILAC achieve a greater reduction in execution time than the
engines enhanced with DAW
p-value
ANAPSID
FedX
ă 2.2e-16 ă 2.2e-16
ă 2.2e-16 ă 2.2e-16
2.626e-16 ă 2.2e-16
ă 2.2e-16 ă 2.2e-16
5.258e-16 ă 2.2e-16
ă 2.2e-16 1.267e-15
Federation
Diseasome
SWDF
LinkedMDB
Geocoordinates
WatDiv1
WatDiv100
22
p-value
ANAPSID
FedX
ă 2.2e-16 ă 2.2e-16
ă 2.2e-16 ă 2.2e-16
ă 2.2e-16 3.714e-14
ă 2.2e-16 6.653e-13
ă 2.2e-16 2.368e-16
1.267e-11 9.179e-16
Table C.10: Wilcoxon signed rank test p-values for testing if LILAC
with the engines and the engines alone transfer the same number of
tuples or if LILAC reduces the number of transferred tuples. Bold pvalues allow to accept that LILAC reduces the number of transferred
tuples by the engines
Federation
Diseasome
SWDF
LinkedMDB
Geocoordinates
WatDiv1
WatDiv100
p-value
ANAPSID
FedX
ă 2.2e-16 ă 2.2e-16
4.688e-14
ă 2.2e-16
ă 2.2e-16
ă 2.2e-16
ă 2.2e-16
ă 2.2e-16
3.029e-16
ă 2.2e-16
7.689e-16
2.432e-15
Table C.11: Wilcoxon signed rank test p-values for testing if LILAC
with the engines and the engines alone transfer the same number of
tuples or if LILAC reduction in the number of transferred tuples is
greater. Bold p-values allow to accept that LILAC reduction in the
number of transferred tuples is greater
Federation
Diseasome
SWDF
LinkedMDB
Geocoordinates
WatDiv1
WatDiv100
p-value
ANAPSID
FedX
2.905e-16 1.017e-06
6.147e-09
2.365e-06
4.161e-15
3.421e-05
1.29e-12
0.5
3.301e-07
3.074e-09
1.205e-07
8.762e-08
23