The BAY-HIST Prediction Model for RDF Documents
Edna Ruckhaus and Marı́a-Esther Vidal
Universidad Simón Bolı́var
Caracas, Venezuela
{ruckhaus, mvidal}@ldc.usb.ve
Abstract. In real-world RDF documents, property subject and object values are
often correlated. The identification of these relationships is of significant relevance to many applications, e.g., query evaluation planning and linking analysis. In this paper we present the BAY-HIST Prediction Model, a combination of
Bayesian networks and multidimensional histograms which is able to identify the
probability of these dependencies. In general, Bayesian networks assume a small
number of discrete values for each of the variables considered in the network.
However, in the context of the Semantic Web, variables that represent the concepts in large-sized RDF documents may contain a very large number of values;
thus, BAY-HIST implements multidimensional histograms in order to aggregate
the data associated with each node in the network. We illustrate the benefits of
applying BAY-HIST to the problem of query selectivity estimation as part of costbased query optimization. We report initial experimental results on the predictive capability of this model and the effectiveness of our optimization techniques
when used together with BAY-HIST. The results suggest that the quality of the
optimal evaluation plan has improved over the plan identified by existing cost
models that assume independence and uniform distribution of the data values.
1
Introduction
The number of controlled vocabularies and annotated data sources in the Web has exploded in the last few years. Individually, many of these documents contain a large
number of concepts and instances, and additionally their growth rate is very high. Thus,
in order to be capable of scaling up, Web architectures have to be tailored for query
processing on large number of resources and instances. We apply BAY-HIST to the
problem of query selectivity estimation as part of cost-based query optimization.
The Prediction Model BAY-HIST is a framework that combines Bayesian networks
and multidimensional histograms with the purpose of determining dependencies between properties in RDF documents and the distribution of their values. Bayesian Networks are probabilistic models that allow a compact representation of the joint distribution of the concepts defined in an RDF document. In general, Bayesian networks
assume a small number of discrete values for each of the variables considered in the
network. However, in the context of RDF documents in the Semantic Web, variables
that represent the concepts in large-sized RDF documents may contain a very large
number of values; thus, BAY-HIST implements multidimensional histograms in order
to aggregate the data associated with each node in the Bayesian network that represents
the RDF document.
BAY-HIST has been included as a component of the OneQL System, an Ontology System that provides optimization and query evaluation techniques that scale up
to large RDF/RDF(S) documents [4, 10]. We report initial experimental results on the
predictive capability of this model and the effectiveness of our optimization techniques
when used together with BAY-HIST. The results suggest that the quality of the optimal
evaluation plan has improved compared to the plan identified by existing cost models
that assume independence and uniform distribution of the data values, by up to two
orders of magnitude.
The structure of this paper is as follows: first, we will give a motivating example.
Following this, we will present the syntax and semantics of BAY-HIST. Next, we will
explain the architecture of the BAY-HIST Prediction Model and its application to costbased query optimization. Then, the experimental study will be described, and finally,
the conclusions and future work will be presented.
2
A Motivating Example
The example that follows shows a query to the RDF repository published at
http://www.govtrack.us/. In this example, besides information concerning the U.S.
congress bills voting process, we consider information of the census such as religion
and gender, and political information such as the party and the state that is represented
by each representative that participates in the voting process. Consider the relationships
between party, gender, religion, state and the way a representative votes. To discover if
there is any correlation among the values of these five properties, we will try to determine if for different instantiations of the following query, different number of tuples are
obtained: Names of all the representatives of state ?S , that belong to party ?P, are of gender ?G,
are of religion ?R and have voted for the winning option in the voting process of Bill ?B. The
SPARQL representation of this query is illustrated in Figure 1.
PREFIX pol:<tag:http://www.rdfabout.com/rdf/schema/politico/>
PREFIX vote:<tag:http://www.rdfabout.com/rdf/schema/vote/>
PREFIX foaf:<tag:http://xmlns.com/foaf/0.1/>
SELECT ?X
FROM <tag:http://www.examples.org/votesdataset/>
WHERE
{?X pol:forOffice ?S . ?X pol:party ?P . ?Z pol:hasRole ?X . ?Z foaf:gender ?G .
?Z foaf:religion ?R . ?O vote:votedBy ?X . ?B vote:winner ?O}
Fig. 1. A SPARQL query
This query may have different subject and object instantiations (constants). For instance, we may want to explore for a certain Bill, the different combinations of instantations for party, religion, gender and state. While for a certain set of instantiations the
query has 18 answers, for another one it has no answers. This behavior is due to the lack
of uniformity in the property value distribution and the dependency between properties.
For example, the probability that a representative has voted for the winning option in
the voting process of Bill 1998-173 if he is Catholic, male, belongs to the Democratic
party and represents the state of Massachussets is much higher than the probability that
a representative has voted for the winning option in the voting process of Bill 1998-173
if he is Jewish, male, Republican and represents Oklahoma. The identification of these
relationships is of significant relevance to many applications. For instance, in query
evaluation planning, this information may provide the basis for the optimizer to discriminate between bad or good query plans.
3
The BAY-HIST Prediction Model
Consider the RDF repository presented in the previous example. Let us assume that
there are certain causal relationships between the subjects and objects of properties that
are represented as an RDF Bayesian Network (RBN), as shown in Figure 2. In this
o-foroffice
o-party
o-gender
o-religion
o-s-hasroleparty
o-s-hasroleforoffice
s-s-hasrolegender
s-s-forofficeparty
s-s-genderreligion
s-s-hasrolereligion
Fig. 2. RBN Votes
RBN, there are nodes that represent property subjects or objects. For example, node
o-religion represents the values (objects) of property religion. We also represent the
event of a combination between subjects or objects of related properties. Such is the
case of node s-s-foroffice-party that represents the event that a subject that is representing a certain state, belongs to a certain party. The arcs in this network represent
dependencies between nodes. In this network we model that the combination of voter
and gender is conditioned not only by the gender itself, but also by the state he represents and the party to which he belongs to; thus, the probability that a person’s gender is
‘male’, the state is ‘Oklahoma’ and that he belongs to the ‘Republican’ party is 0.033.
This probability is related to the probabilities of all the rest of combinations of gender,
state and party. Tables 1(a) and 1(b) show a portion of the conditional probability tables
(CPT) of this RBN. An RBN represents all the conditional dependencies among prop-
Table 1. CPT’s Votes
(a) CPT o-party
o-party
prob(o-party)
Democratic
0.51
Independent
0.007
Republican
0.47
(b) CPT s-s-foroffice-party
s-s-foroffice-party
true
false
true
false
true
false
true
false
...
o-foroffice
Democratic
Democratic
Independent
Independent
Republican
Republican
Democratic
Democratic
...
o-party prob(s-s-foroffice-party)
ak
0
ak
1
ak
0
ak
1
ak
0.03
ak
0.97
ma
0.038
ma
0.962
...
...
erty subjects and objects in an RDF document. Next, we will formally define an RDF
Bayesian Network:
Definition 1 (RDF Bayesian Network) Given an RDF directed graph
OR = (VR , ER ) where VR and ER are the nodes and arcs in the RDF graph. An RDF
Bayesian Network RB for OR , is a pair RB = hOB , CPTB i, where OB = (VB , E B ) is a
DAG. VB are the nodes in OB and E B are the arcs in OB . CPTB are the Conditional
Probability Tables for each node. The homomorphism f : P(ER ) → P(VB ) establishes
mappings between OR and OB :
f ({(sub, pro, ob j)}) = {s-pro, o-pro}
(Mapping 1)
f ({(sub1 , pro1 , ob j), (sub2 , pro2 , ob j)}) = {o-o-pro1 -pro2 , o-o-pro2 -pro1 }
(Mapping 2)
f ({(sub, pro1 , ob j1 ), (sub, pro2 , ob j2 )}) = {s-s-pro1 -pro2 , s-s-pro2 -pro1 }
(Mapping 3)
(Mapping 4)
f ({(sub, pro1 , ob j1 ), (sub2 , pro2 , sub)}) = {s-o-pro1 -pro2 , o-s-pro2 -pro1 }
VC ⊆ VB , where VC is the union of the sets of nodes established by mappings 2 to
4, and it is comprised of all the nodes that represent property combinations.
E B ⊆ VB × VC is the set of arcs. An arc (v1 , v2 ) ∈ E B iff there exist two sets of nodes
in the RBN, V1 ⊆ VB and V2 ⊆ VC such that, v1 ∈ V1 and v2 ∈ V2 and when f −1 is
applied to these sets, a subset of arcs in the RDF graph is obtained.
CPT B is the probability Pr(v/predecessors(v)) for each node v ∈ VB , i.e., the distribution on the values of v for each possible value assignment of its predecessors. The
CPT B are multidimensional histograms ordered by value. If a node v is a source node,
the histogram will be one-dimensional, because in this case the CPT B only represents
the distribution of values taken up by the variable represented by the node. For each
node v, according to the properties of the distribution of the values of v, CPT B can be
represented as an equi-width histogram or as an equi-height histogram.
Example 1 Next, we illustrate the use of the homomorphism f . Figure 3 shows a portion of an RDF graph (OR ) and its corresponding RBN graph (OB ). Mapping 1 is applied to the sets of RDF arcs {(rep1,foroffice,va)} and {(rep2,party,democratic)}:
f ({(rep1,foroffice,va)})={s-foroffice,o-foroffice}
f ({(rep2,party,democratic)})={s-party,o-party}
politico:foroffice
politico:rep1
va
s-foroffice
o-foroffice
s-party
o-party
politico:foroffice
politico:rep2
ma
politico:party
s-s-foroffice-party
Democrat
OR
s-s-party-foroffice
OB
Fig. 3. Example Mapping RDF Graph - RBN Graph
Then, Mapping 3 is applied to the set of RDF arcs {(rep2,foroffice,ma),
(rep2,party,democratic)}
f ({(rep2,foroffice,ma),(rep2,party,democratic)})={s-s-foroffice-party,s-s-party-foroffice}
The arc (o-foroffice,s-s-foroffice-party) belongs to E B because the arcs obtained by applying the inverse of f are subsets of ER :
f −1 ({s-foroffice,o-foroffice})∪ f −1 ({s-s-foroffice-party,s-s-party-foroffice})=
{(rep1,foroffice,va),(rep2,foroffice,ma),(rep2,party,democratic)}
Intuitively, an RBN is semantically valid if its arcs have been established between nodes
that map to properties whose subjects and objects are of the same type, i.e., have some
type of matching instantiations, subject-subject, subject-object or object-object. For example, an arc from node o-s-hasrole-party to node s-s-gender-religion is semantically valid because there are matching subject-subject instantiations between triples
of property hasrole and triples of religion, i.e., both are “persons”.
Given the symmetry property of the combinations between triple patterns, the set
VB may contain only one of the nodes in the sets defined with mappings 2, 3 and 4 in
Definition 1; thus, the resulting RBN is minimal:
Definition 2 (Minimal RBN) Given an RBN RB = hOB , CPT B i. RB is a Minimal RBN
if the set VB contains exactly one node in sets {s-s-pro1 -pro2 , s-s-pro2 -pro1 }, {s-o-pro1 pro2 , o-s-pro2 -pro1 } and {o-o-pro1 -pro2 , o-o-pro2 -pro1 }.
4
Architecture
Figure 4 shows the architecture of the BAY-HIST Prediction Model System. BAY-HIST
has two main components that generate and query the RBN: the RBN Analyzer and the
RBN Inference Engine. Both components make use of the SamIam Bayesian Inference
Tool [1].
The analyzer receives an RDF document and creates the RBN structure using the
mappings presented in Definition 1 to establish the correspondence between the RDF
graph and the nodes and arcs of the RBN structure. Once the RBN structure has been defined, the RDF data is loaded into relational tables, and a multi-dimensional histogram
is generated for each node in the RBN structure through the stored procedures and the
histogram option implemented by the DBMS Oracle [8]. Both, the RBN structure and
CPT’s are fed to the SamIam network editor, and a Bayesian network is generated in
one of the internal formats recognized by the SamIam tool.
When a query is received, the RBN Inference Engine constructs the corresponding
probability query (e.g., marginal probability and posterior marginal probability) and
passes this query on to the SamIam inference engine which then returns an answer.
RDF
Documents
RDF
Relational
Database
DBMS
Multi
Dimensional
Histogram
CPT
RDF Graph
RBN Structure
RBN Analyzer
Bayesian
Network
Editor
SAMIAM
Bayesian
Inference Tool
Query
Answer
RDF Bayesian
Network
Bayesian
Inference
Engine
RBN Query Engine
Fig. 4. Architecture of the BAY-HIST System
5
Application of BAY-HIST to Query Optimization
The BAY-HIST Prediction Model is applied to query selectivity estimation. These estimates are used within the cost model of a cost-based query optimizer as part of the
formulas that compute the cost and cardinality of query sub-plans. We have developed
a randomized optimization strategy based on the Simulated Annealing algorithm [7].
This algorithm explores execution plans of any shape (bushy trees) in contrast to other
optimization algorithms that explore a smaller portion, e.g., left-linear plans. Random
walks are performed in stages that consist of an initial random plan generation step followed by one or more plan transformation steps. An equilibrium condition or a number
of iterations determines the number of transformation steps in each stage.
The probability of transforming a current plan p into a new plan p′ is specified by
an acceptance probability function P(p, p′ , T ) that depends on a global time-varying
parameter T called the temperature which reflects the number of stages to be executed.
Function P may be nonzero when cost(p′ ) > cost(p), meaning that the optimizer can
produce a new plan even when it has a higher cost than the current one. This feature
prevents the optimizer from becoming stuck in a local minimum. Temperature T is decreased during each stage, and the optimizer concludes when T = 0. Transformations
applied to the plan during the random walks correspond to SPARQL axioms, e.g., commutativity and associativity of the ‘.’ operator. The optimizer is able to identify near
optimal solutions because of the precision of estimates that take into account correlations of values and non uniform distribution.
Using BAY-HIST, the selectivity of an RDF query execution plan that joins A and
B over join arguments J (A ⋊
⋉J B) is expressed in terms of a probability query against
the corresponding RBN:
Y
fs(A⋊
⋉J B) =
Pr(JoinEvent J /(JoinEvidJ A ∧ JoinEvidJ B ∧ instEvidI A ∧ instEvidI B ))
J∈J
This is a posterior marginal probability query, i.e., the probability that two pattern instantiations are combined, given the evidence of the instantiations and the joins in its
left and right sub-trees.
The probability queries associated with an RDF pattern (the base case) correspond
to marginal probabilities, i.e., to the probability that the value of subjects or objects
of the property in the pattern is equal to the instantiation in the pattern: Pr(o-pro=obj),
Pr(s-pro=sub) or Pr(s-pro=sub ∧ o-pro=obj).
An estimate of the selectivity of an RDF pattern A, carried out by using a probability
query on the RBN is more precise than an estimate carried out by using the traditional
cost model. The traditional cost model defines the following selectivity formula:
fs(A, J) =
Y
1/nKeys(A, J)
(1)
J∈J
where nKeys(A, J) is the number of different values taken up by J in pattern A. Likewise, an estimate of the selectivity of a sub-plan A ⋊
⋉J B carried out through a probability query on the RBN is more precise than an estimate carried out through the traditional
cost model. The selectivity formula in the traditional cost model is as follows:
Y
fs(A, B, J) =
1/max(nKeys(A, J), nKeys(B, J))
(2)
J∈J
These traditional formulas do not compute a precise estimate of the query evaluation
costs because they are based on the following assumptions: (a) the values of the subjects
and objects in a triple pattern are uniformly distributed, (b) the values of the subjects
and objects in a pattern are independent, and (c) the values of the subjects and objects
in properties of the patterns that are combined in a query, are independent.
The example that follows shows the motivating example query with two different
sets of instantiations:
– Names of all the male representatives of the state of Massachussets that belong to the Democratic party, are Catholic and have voted for the winning option in the voting process of Bill
1998-173.
– Names of all the male representatives of the state of Oklahoma that belong to the Republican
party, are Jewish and have voted for the winning option in the voting process of Bill 1998173.
PREFIX pol:<tag:http://www.rdfabout.com/politico/>
PREFIX vote:<tag:http://www.rdfabout.com/vote/>
PREFIX foaf:<tag:http://xmlns.com/foaf/0.1/>
SELECT ?X
FROM <tag:http://www.examples.org/votesdataset/>
WHERE
{?X pol:forOffice senate:ma .
?X pol:party ’Democratic’ .
?Z foaf:gender ’male’ .
?Z pol:hasRole ?X .
?Z foaf:religion ’Catholic’ .
?O vote:votedBy ?X .
’1998-173’ vote:winner ?O}
PREFIX pol:<tag:http://www.rdfabout.com/politico/>
PREFIX vote:<tag:http://www.rdfabout.com/vote/>
PREFIX foaf:<tag:http://xmlns.com/foaf/0.1/>
SELECT ?X
FROM <tag:http://www.examples.org/votesdataset/>
WHERE
{?X pol:forOffice senate:ok .
?X pol:party ’Republican’ .
?Z foaf:gender ’male’ .
?Z pol:hasRole ?X .
?Z foaf:religion ’Jewish’ .
?O vote:votedBy ?X .
’1998-173’ vote:winner ?O}
(a) SPARQL Query 1
(b) SPARQL Query 2
Fig. 5. Two Queries with Different Instantiations
The SPARQL representation of these two queries is illustrated in Figure 5. Query 1 and
Query 2 differ in their subject and object instantiations (constants), and their answers
are different: while the first query has 18 answers, the second one has no answers.
This behavior is due to the lack of uniformity in the property value distribution and
the dependencies between properties. Based on this observation, we use an RBN to
differentiate the selectivity of the sub-plans of each query execution plan taking into
account the existing correlation between the various RDF properties. To estimate the
selectivity of the sub-plan shown in Figure 6(a), a posterior marginal probability query
is carried out in the RBN and the result of this probability query is 0.0275.
{?X pol:forOffice senate:ma .
?X pol:party ’Democratic’ .
?Z foaf:gender ’male’ .
?Z pol:hasRole ?X .
?Z foaf:religion ’Catholic’}
Pr(s-s-hasrole-gender=true /\
o-s-hasrole-foroffice=true /\ 3.182.941
o-s-hasrole-party=true /
o-gender="male" /\
_Z,X
s-s-for-office-party=true /\
o-for-office="ma"
o-party="Democratic")
164.720
Pr(s-s-hasrole-religion=true /
s-s-hasrole-gender=true /\
o-s-hasrole-foroffice=true /\
o-s-hasrole-party=true /\
175.061
s-s-for-office-party=true /\
o-gender="male" /\
_Z
o-for-office="ma" /\
o-party="Democratic" /\
o-religion="Catholic")
2
religion
Pr(o-religion="Catholic")
1.104
hasrole
(a) Query Sub-Plan
2.059
Pr(s-s-foroffice-party=true /
o-foroffice="ma" /\
o-party="Democratic")
62
Pr(o-foroffice="ma")
foroffice
80
_X
gender
Pr(o-gender="male")
854
party
Pr(o-party="Democratic")
(b) Sub-Plan Tree
Fig. 6. Probability Queries on an Execution Sub-plan (http://www.govtrack.us/)
For the corresponding sub-plan in the second query, i.e., the same sub-plan with
different instantiations, the result of the inference on the RBN is 0, which is consistent
with the expectation that the cardinality of the first query is higher than the cardinality
of the second query. Figure 6(b) shows the tree representation of the sub-plan in Figure
6(a). Each node is annotated with the probability query corresponding to the sub-plan
(sub-tree) selectivity estimate, and with its cardinality. The cost estimate of the subplan, is equivalent to the total number of intermediate results that must be estimated to
obtain the answer:
cost(P) = 62 + 854 + 2.059 + 80 + 164.720 + 1.104 + 3.182.941 + 2 = 3.351.822
6
Related Work
In [6], Bayesian networks are applied to the problem of imprecise estimates of the selectivity of a relational query; this framework is known as the Probabilistic Relational
Model (PRM). This imprecision stems from the assumption of uniform distribution of
values for attributes in a table, attribute independence in one table, and attribute independence in tables that are semantically related. The proposed solution uses a probabilistic model to represent the distribution of values of each attribute and the correlations between attributes. Thus, instead of computing the query selectivity in terms of
the number of different values of each attribute in the select condition of the query,
the selectivity is computed using the result of a probability query to the model. In [5],
Statistical Relational Models (SRM) were developed. They are different from PRM
because they represent a statistical model of a particular database state instead of representing any state. Thus, Conditional Probability Table (CPT) construction in SRMs
is done through queries to the database whereas the structure and CPT construction in
PRMs is conducted by using machine-learning techniques.
The difference between the solution proposed by Getoor, et. al. [5, 6] and the solution presented in our paper, is the scalability to large-sized RDF repositories by means
of multidimensional histograms. The SRM, developed in [5] assume a low number of
values for each variable in the model. On the other hand, although in our work, an
RDF document is modeled similarly to an SRM, its nodes and arcs have a particular
semantics based on the RDF graph semantics, i.e., subject, property and object triples.
Besides this, in our proposed RBN model, there are also Join variables, but restricted
to the possible combinations between subjects and objects. Additionally, the purpose of
the Bayesian network proposed by Getoor, et. al., is the estimation of query selectivity.
In our work, Bayesian networks are applied to RDF documents in order to estimate the
selectivity of query evaluation plans and sub-plans.
The work described in [9, 11, 12] extends the Ontology Web Language (OWL) with
constructs that allow the annotation of an ontology with probabilities and causal relationships. These annotations are done with the purpose of reasoning on uncertainty in
ontologies. Once an ontology is annotated, it is translated to a Bayesian network, and
Bayesian inference queries may be answered. The main difference between these models and our research is that since the information on subject an object values are kept
in an aggregated form, our combinated approach of Bayesian networks and multidimensional histograms scales up to large RDF documents. Besides this, in our work we
define random variables that represent the event that a property may be combined (Join)
with another property; these type of variables are not considered in these approaches.
7
Experimental Study
The goal of the experimental study was to analyze the benefits of the proposed predictive model when applied to the problem of query optimization. First, the predictive capacity of the model was studied and then, the quality of the optimal query was
compared to the original query and to the optimal plan identified by a cost model that
assumes independence between properties and uniform distribution of values.
We used the real-world dataset on the US Congress bills voting process for the years
1998, 1999 and 2000 published at http://www.govtrack.us/. Besides the election results, we also consider census information about representatives such as religion and
gender, and political information such as the party and their state. The number of triples
in the dataset for years 1998, 1998-1999 and 1998-1999-2000 is 50, 860, 94, 590 and
128, 852, respectively.
The query benchmark is comprised of 112 queries with five instantiated patterns.
The properties in the patterns and their ordering are the same for all queries, but the instantiations are different. Previously, we determined that these properties are correlated
and thus, queries with different instantiations will have different selectivity.
We use the Bayesian inference tool, SamIam [1], to build the RBN based on the
graph structure, and the CPT which is represented as a multidimensional histogram.
Currently, the graph structure is built by hand, but this could be done semi-automatically.
The graph in the RBN was built according to the properties represented in the ontology. Then, the CPT were developed using multidimensional histograms to aggregate
the node values. The structure of this RBN was illustrated before as Figure 2. Each
CPT for a target node is a multidimensional histogram, where the first dimension corresponds to a node itself, and the rest of the dimensions correspond to the predecessors
of the node. The algorithm for multidimensional histogram generation constructs a histogram for the first dimension, and then for each bucket, it generates a histogram for
the second dimension, and so on, until all dimensions are completed. These histograms
were generated through the histogram options provided by the Oracle DBMS [8]. The
default histogram option generates equal-width or equal-height histograms according
to the number of different values of an attribute and its distribution.
In order to exploit the DBMS histogram mechanisms, we loaded a relational table
for each property in the ontology. For each target node, we created a relational table
that is a combination of the subject or object of the property that is represented by
the node, with the subjects or objects of all its predecessors. We used methods in the
Oracle package DBMS STATS to generate an histogram on the column that represents
the target node in the “combination” table. Then, for each bucket we created a table and
again used DBMS STATS to generate an histogram on the second dimension, and so
on until all the dimensions had been covered. The histogram was completed with the
computation of the frequency of each value of the target node given the different sets of
values of its predecessors.
Bayesian inference queries are posed to the network through the SamIam tool in
order to estimate the selectivity of each query based on the instantiations of its patterns. We use one of the algorithms implemented by SamIam, the Shenoy-Shafer exact
inference algorithm [2]. Each query was also evaluated and we obtained the number
of results. Thus, we compared the estimate of the selectivity with the actual number
of answers. The correlation value is 0.95. This result indicates that there exists a linear relationship between the estimates and the actual values, so we may assert that the
BAY-HIST model is capable of predicting the selectivity of a query plan or sub-plan,
and therefore, we can have a precise estimate of this plan’s evaluation cost.
The purpose of our next experiment was to study the effectiveness of our optimization techniques when used with the BAY-HIST prediction model. Given that the
BAY-HIST model is capable of considering dependencies between properties and its
distribution of values, the quality of the optimal plan identified by the optimizer using
BAY-HIST should be better than the quality of a plan identified by an optimizer that
uses a cost model that does not consider dependencies between properties and nonuniform distribution of values. We report on runtime performance, which corresponds
to the user time produced by the time command of the Unix operation system.
We used the same dataset and RBN as the previous experiment. We also used the
same query benchmark, but we shuffled the queries, evaluated them and chose the 21
queries that had the worst evaluation time. The experiment was performed using these
21 queries. The Simulated Annealing optimization algorithm was configured with an
initial temperature of 700, and 20 iterations in the initial stage.
We compared the performance of the original query, the optimal plan identified by
the optimizer with the model that assumes property independence and uniform distribution, and the optimal plan identified by the optimizer with the BAY-HIST model. These
plans were evaluated with and without index structures1 .
The average evaluation time is reported in Figure 7. We can observe that the performance of the optimal plans without index structures exceeds the performance of the
original queries by up to one order of magnitude. The improvement with the use of the
index strucures with respect to the original plans is up to two orders of magnitude, but
the improvement is even greater when the optimizer uses the BAY-HIST model. We
also observed that this difference is proportional to the incremental size of the datasets.
These results indicate that the quality of the plan identified by the optimizer and
the BAY-HIST model, is better than the quality of the optimal plan identified by the
optimizer with the traditional prediction model and the benefits are even greater when
index structures are used.
8
Conclusions and Future Work
We present the BAY-HIST Prediction Model, a combination of Bayesian networks and
multidimensional histograms, which is able to estimate correlations between data values
in an RDF document as well as their distribution. We study the benefits of applying
BAY-HIST to the problem of query selectivity estimation as part of cost-based query
optimization; also, we report initial experimental results that suggest that the quality of
the optimal evaluation plans can be improved when selectivity is estimated using the
BAY-HIST Prediction Model.
In the future we plan to use BAY-HIST on the RDF(S) and OWL formalisms; also,
we will study the benefits of this prediction model when it is used to discover links be1
Denoted as Bhyper according to the hypergraph RDF model that these index structures implement [3].
Evalua/on /me in ms. (log. scale)
10
Original
Original w/Bhyper
Op3mal
1
votes 98
votes 98‐99
votes 98‐99‐00
Op3mal w/Bhyper
Op3mal w/RBN
0.1
0.01
Op3mal w/Bhyper and RBN
RDF Document
Fig. 7. Quality of the Optimal Plan
tween data terms. Currently, the optimization algorithm queries the RBN for the selectivity of all the sub-plans in each execution plan. Future work will also include keeping
track of probability queries posed against an RBN in each execution plan, in order to
improve the efficiency of the cost model.
References
1. SamIam - Sensitivity Analysis Modeling Inference and More. Automated Reasoning Group,
University of California, Los Angeles. http://reasoning.cs.ucla.edu/samiam/.
2. Darwiche A. Modeling and Reasoning with Bayesian Networks. Cambridge University
Press, 2009.
3. Martinez A. and Vidal M. A Directed Hypergraph Model for RDF. In KWEPSY, 2007.
4. Ruckhaus E., Ruiz E., and Vidal M. Query evaluation and optimization in the semantic web.
Theory and Practice of Logic Programming - TPLP, 8(3):393–409, 2008.
5. Getoor L. Learning statistical models from relational data, 2001.
6. Getoor L., Taskar B., and Koller D. Selectivity estimation using probabilistic models. In
SIGMOD Conference, pages 461–472, 2001.
7. Vidal M., Ruckhaus E., Lampo T., Martinez A., Sierra J., and Polleres A. Efficiently joining
group patterns in SPARQL queries. In Proceedings ESWC, 2010.
8. ORACLE. Oracle Database Management System. http://www.oracle.com/.
9. Da Costa P., Laskey K., and Laskey K. PR-OWL: A bayesian ontology language for the
semantic web. In ISWC-URSW, pages 23–33, 2005.
10. Lampo T., Ruckhaus E., Sierra J., Vidal M., and Martinez A. OnEQL: An Ontology-based
Architecture to Efficiently Query Resources on the Semantic web. In Proceedings of SSWS,
collocated with ISWC, 2009.
11. Yi Yang and Jacques Calmet. Ontobayes: An ontology-driven uncertainty model. In Proceedings CIMCA ’05, 2005.
12. Ding Z., Peng Y., and Pan R. A Bayesian Approach to Uncertainty Modeling in OWL
Ontology. In Proceedings of the International Conference on Advances in Intelligent Systems
- Theory and Applications, 2004.