NOTE TO USERS
The original manuscript received by UMI contains pages with
slanted print. Pages were microfilmed as received.
This reproduction is the best copy available
UMI
1*1
National Library
ofCada
du Canada
Acquisitions and
Bibliographie Services
Acquisitions et
services bibliographiques
395 W e f ï i i Street
OttawaON K 1 A M
395. nie W e l l î
OüawaON K1AûN4
c2mda
canada
BibIiotkque nationale
The author has granted a nonexclusive licence allowing the
National Library of Canada to
reproduce, loan, distribute or sell
copies of this thesis in microform,
paper or electronic formats.
L'auteur a accordé une licence non
exclusive permettant à la
Bibliothèque nationale du Canada de
reproduire, prêter, distribuer ou
vendre des copies de cette thèse sous
la forme de microfichelfilm, de
reproduction sur papier ou sur format
électronique.
The author retains ownership of the
copyright in this thesis. Neither the
thesis nor substantial extracts fiom it
may be printed or otherwise
reproduced without the author's
permission.
L'auteur conserve la propriété du
droit d'auteur qui protège cette thèse.
Ni la thèse ni des extraits substantiels
de celle-ci ne doivent être imprimés
ou autrement reproduits sans son
autorisation.
\-i
List of Tables
Cost of the evaluation of a given qurry using different orderings . - . 9
Typical Experimcntal Results for a Titmary Predicats for SiCSms
Prolog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Table 3.2. Typical Experimental Rrsults for a Trmary Predicate for SB-Prolog 39
Table 3.3. Nurnber of rimes that the W14M instructions are cixecuted. . . . . . 40
Table 3.1. Nurnber of times that the WAM Instmctions are sxecuted (simplitled
version) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Table 3.5. Approximate Theoretical Values for a Trmary Predicarc . . . . . . -II
Table 3.6. Average cost error introduced by our approximation . . . . 42
Table 3.7. The book titles database . . . . . . . . . . . . . . . . . . . . . . . 4
Table 3.8. Orderings ranked by thcir costs . . . . . . . . . . . . . . . . . . . U
Table 3.9. The books database profile . . . . . . . . . . . . . . . . . . . . . . 50
Table 3.10. The extended books database . . . . . . . . . . . . . . . . . . . . . 59
Table 3.11. Predictions for al1 predicates . . . . . . . . . . . . . . . . . . . . . 60
Table 3.12. Predictions for the intensional database predicate . . . . . . . . 61
Table 5.1. The linear rcgion . . . . . . . . . . . . . . . . . . . . . . . . . . . S I
Table 5.2. The intermediate region . . . . . . . . . . . . . . . . . . . . . . . 83
Table 5.3. Percentages of the maximum value for nb = 1.6 m . . . . . . . . . 84
Table 5.4. Percentages of the maximum value for some factors . . . . . . . . 81
Table 5.5. Cornpanson between the formula and the expenmental results . . 84
Table 5.6. The cxponential region . . . . . . . . . . . . . . . . . . . . . . . . SS
Table 5.7. Estimating the cardinality ofa recursive predicatr . . . . . . . . . 89
Table 6.1. Number of visited tuplss for ordsring ;f i . . . . . . . . . . . . . -100
Table 6.2. Numbrr of visited tuples for ordcnng Y 2 . . . . . . . . . . . . . -101
Table 6.3. Exprcted number of visitrd tuples . . . . . . . . . . . . . . . . . . 105
Table 6.4, Companson between the predictsd and expenmental values . . -106
Table 6.5. The prrformers database predicates . . . . . . . . . . . . . . . . . 109
Table 6.6. Predicted values of two cost contributors for the non-recursive
query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -113
Table 6.7. Experimental results for the non-recursive query (rankings in square
brackets) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Table 6.8. The modified performers database profile . . . . . . . . . . . . . . I l 3
Table 1.1.
Table 3.1.
.
.
. -119
Table 6.9. Experimental results for the recursive predicatr .
Table 6.10. Eficiency of the transitive closure for different calling patterns . . 120
Table 6.1 1.
Table 6.12.
Table 6.13.
Table 6.14 .
Table AZ.1.
Table A1.1 .
Table ,442.
Table A4.3.
Table 41.4.
Table A4.5.
Table A4.6.
Table A4.7.
Table AJ.8.
vii
ï h e extensional databasc predicates . . . . . . . . . . . . . . . . -110
Different orderines for the qurry undrr consideration . . . . . . . -113
Expenrncntal results for the threr most eficient orderings . . . . -119
Cost metncs for al1 predicates . . . . . . . . . . . . . . . . . . . -129
Values of the Traversa1 Factor for the Temary Predicats Example 155
Valid orderings for the query pkg_uses/2 . . . . . . . . . . . . . - 1 %
The extensional database predicates . . . . . . . . . . . . . . . . . 158
Debray's domain for al1 predicates . . . . . . . . . . . . . . . . . 159
Cost domain for the extensional predicates . . . . . . . . . . . . . 160
Cost dornain for the intensional predicate and the main queq . . . 161
Cost mstrics for al1 predicates . . . . . . . . . . . . . . . . . . . . 162
Cost metncs for the intensional predicate . . . . . . . . . . . . . . 162
Theoretical and Exprrimental Values for the Packages Example . 163
...
VIL1
List of Fi.oures
Figure 1.1.
Figure 1.2.
Figure 1.3.
Figure 1.4.
Figure 1.5.
Figure 1.6.
Figure 2.1.
Figure 3.1.
Figure 3.2.
Figure 3.3.
Figure 3.4.
Figure 4.1.
Figure 4.2.
Figure 4.3.
Figure 4.1.
Figure 4.5.
Figure 4.6.
Figure 5.1.
Figure 5.2.
Figure 5.3.
Figure 5.1.
Figure 5.5.
Figure 6.1.
Figure 6.2.
Figure 6.3.
Figure 6.4.
Figure 6.5.
Figure 6.6.
Three representations of a given database tuple . . . . . . . . . . . . 6
A -=ph representation of a rule . . . . . . . . . . .
.. . . . . . . . 6
A graph represrntation of a GraphLog relation . . . . . . . . . . . 7
A query as a series of successive operations . . . . . . 16
The cost of a genenl predicatr is the surn of the çost of its individual
mles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Two general alternatives for a cost mode1 framework . . . . . . 19
Sets of lists of arguments for two evaluable predicates that ensure
safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Partial translation of a fact . . . . . . . . . . . . . . . . . . . . . 32
(a) An extract from one of the databases that were used and ( b typical
subgoals which revieve thest facts . . . . . . . . . . . . . . . 36
Debray's lattice for mode analysis . . . . . . . . . . . . . . . . . 46
Abstract interpretation applied to Prolog unification givcn nvo ternis t 1
andt2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Frequency diagram of an attributs that may bs approximated by a
discrete normal distribution . . . . . . . . . . . . . . . . . . . 67
Two temary predicates s 1 and s2 . . . . . . . . . . . . . . . . . . 70
Join of predicates s 1 and s2 . . . . . . . . . . . . . . . . . . . . . 71
Sslection after the join of predicates s l and s2 . . . . . . . . . . . 71
Final projection of arguments 3 and 5 . . . . . . . . . . . . . . . 72
Cost contributors are rstimated for eac h subgoal . . . . . . . . . . 76
Region for small values . . . . . . . . . . . . . . . . . . . . . . . 81
Region for large values . . . . . . . . . . . . . . . . . . . . . . . 82
GnphLog program . . . . . . . . . . . . . . . . . . . . . . . . . 36
GnphLog program for the recursivç program . . . . . . . . . . . 87
Gnphical representation of base predicates up and down . . . . . 90
The GraphLog database . . . . . . . . . . . . . . . . . . . . . . . 96
The 1984 United States Congressional Voting Records Databasc . 97
Two orderings that we wish to compare . . . . . . . . . . . . . . 97
Abstract black boxes for Example 1 . . . . . . . . . . . . . . . . 100
Interconnection of the black boxes for Example 1 . . . . . . . . . 10 1
Experimental results for both orderings . . . . . . . . . . . . . . 101
1.Y
Figure 6.7.
Figure 6.8.
Figure 6.9.
Figure 6.10.
Figure 6.1 1
Figure 6.12.
Figure 6.13.
Figure 6.14.
Figure 6.15.
Figure 6.16.
Figure 6.17.
Figure 6.18.
Figure 6.19.
Figure M.1.
Figure A1.2.
Figure A3.1.
.
Six orderings that we wish to compare . . . . . . . . . . . . . -103
Six orderings that we wish to compare . . . . . . . . . . . . . . -103
Abstract black boxes for Example 2 . . . . . . . . . . . . . . . . l m
Intrrcomection of two black boxes in Examplr 1 . . . . . . . . -105
Sample tuples from the performen database .
.
.
. -108
Abstract black boxes for the non-recunive query . . . . . . . . - 1 1 1
Expected values for the cost contributors for a specific ordenng - 1 12
Abstract black boxes for the recursive query . . . . . . . . . . . -116
Abstract representation of the different orderings . . . . . . . . . 118
Abstract black boxes for some predicates in the packages rxample 122
. . . . . . . . . . . . -124
Abstract black boxes for predicatr pa-f
Abstract black boxes for predicate cycle . . . . . . . . . . . . . -125
Impact of the underlying databasr on the performance of the cal1 . 131
Markov chain for the single solution case . . . . . . . . . . . . -142
iMarkov chah for the all-solutions case . . . . . . . . . . . . . . -143
General rnethod to measure CPU execution times . . . . . . . . -136
1 would like to thank Dr. Horspool for his patience and encouragement: IBM Toronto
Labontory for suggesting the topic. providing a Ph.D. fellowship and hosting a work
term: Dr. Wadge and Dr. R p a n . who offered a number of insights: and. finally. last but
certainly not least. my parents and brother. for their long-standing devotion and support.
To:
Jan Dournen (Delean)
Horacio Franco
qoercf Mdlender
Bmno Cornec
Gir-enaelFaucher
Bogislav Rarrschrrt
Federico Marin CO la
Daive Lumpson
G ~ y o wC'argci
Ken-icht .Mtrr.ata
Shel Ritter
Gusrai?Leonhardt
Sigis w.ald Kuijken
Gnipo Cinco Siglos
Chapter 1.
Introduction. Query Optirnization in GraphLog
In this dissertation- we propose a cosî modei for GraphLog. a que-
language that is
based on a p p h representation of both databases and qurriss. Specifically. G n p h i o g
is the que- language used by 4Tho~ghr.a sofiware engineering tool aimed at helping
engineers understand and solve a class of software engineering problems that involvr
large sets of objects and complex relationships amongst them [Consens921 [ R ~ n a n 9 7 ]
[Ryman93]. Graphiog queries ask for patterns that must be present or absent in the database _graph. Our frarnework is able to sstimats the relative cost of execution of different
orderings of sernantically equivalent GraphLog queries. thus allowing us to reject thoss
query ordenngs whose execution may be more inefficient. Our mode1 assumes a topdown evaluation strategy [Ceri90].
Givrn the fact that one of the distinguishing characteristics of GraphLog is the capability to express queries with recursion or closures. and since no previous cost mode1
has addressed the cost estimation of recursion and closures for a GraphLog-like lanp a g e . our original solution to this problem is of panicular interest. Our rnrrhodology has
been evaluated on several real-life databasrs with encouraging results.
In this chaptrr. we analyze some general issues relevant to query optimization in
general. and que- reordering in panicular. We also introduce the language that our work
will be applird to. Finally. we give an overview of what we have accomplished.
1.1
Que- Optimization
Que,?*oprimizarion [JarkeM] is direct1y concemed wi th the efficient execution of database queries. Its main goal is to minimize the resources needed to evaluate a query that
retrieves information from a given database. A query optimizer norrnally generates and
analyzes different alternatives to determine an efficient plan of execution. Optimizing a
query can reduce processing time by a factor whose value depends on the sizes of the
databasr dsfinitionsf. This decision is ofien based on cost models that capture the contributions due to different factors such as the sizris of the relations under consideration
or the expected number of mples retrieved by an intermediate operation.
If. for instance. a user poses the query "find al1 Japanese collectors who own a
Stradivarius violin". the query optimizcr would usually need some information about the
statistical profile of the database (how many Japanrse collectors are stored in the database. how many individuals are expected to own a Stradivarius violin. and mors). Givrn
thess premises. the optimizer may establish a suitable plan to soive the problem efficiently. A plan of execution has to takr into account several different factors. including the
order of oprrations. the searching algorithms that are used and the database structure itself.
Sorne of the most cornmon stratrgies adopted in query optimization includr:
1. Selection of the most efficient overall evaluation method ( i.e.. the computational
mode1 tbat derives al1 the solutions to the q u e l ) . The algorithm that is used to
search for the ansrvers clrariy has an influence on the efficiency of execution of
the que?. 'JOevaluation method is intrinsically supenor to the others. In fact. the
performance of different svaluation methods drpends on the nature of the problem. Typical evaluation msthods include bottom-up evaluation. top-down-evaluation. and combinations of both. Here. the optimization (i-e.. the decision as to
which evaluation method is the most suitable for the giwn que-) is prrformed
during the evaluation procrss itsslf.
2. Determination of the best syntactic rearrangement of the query subgoais. Given
that the order of sxecution of the subgoals can substantially influence the time
that is required to retrieve the answen to the query. it is usually advantageous to
find the goal ordering that is the least expensive to executr. Unfortunately. since
the number of cornbinations increases grometncally with the number of subgoals
in the query. an exhaustive search through al1 possible combinations may become
?For instance. w e will show a simple cxample in which a reduction factor of 2.000 is achievcd.
prohibitive. -4 practical cost mode1 is needed to compare the pertormance of different orderings and select a suitable (efficient)ordering.
3. Transformation of the original user que- into an equivalent one which can be executed more efficiently. In some cases. standard simplifications rnay br applird
to the new query- whereas they rnay not have been applicable to the initial que?.
However. this process of que- rewriting dors not guarantee that a more etXcirnt
que^
wi11 be found. In some cases. a loss in et'ficiency may occur.
If the evaluation is pehrmed by a specific "machine". w s will be more interested
in the last two approaches to query optimîzation ( a fixrd evaluation strategy is the usual
case for many qurry languages).
Our work will address the issue of selecting the brsi syntactic rearrangernent of the
query subgoals for a specific query language. namely GraphLog [Consrns89]. W s will
rsfer to this problrm as qire,?.wor-&ring.
1.2
Datalog
There has bern extensive work directed towards tackling the traditional dartzhase prog-arnrningparadigm. However. with a recent trend towards integating the database and
logic prograrnminq paradi-ms. nrw rrquiremsnts and challenges demand a difkrent approach to the sprcial problems raissd by the fogic'programmi~tgpn~-adigm.
This disssrtation is specifically focused on GraphLog. a language that incorporates the two abovementioned programming paradi_gns. Sincr GraphLog is closrly relared to Datalog. a relatively wcll-known logic que- language. we proceed to give a bricf overview of this lanwage.
C
Datalog [Ullrnan88] is a language that applirs the principlrs of logic programming
to the field of databasa. Datalog was specifically designed for interacting with large databases. The languagr is basrd on first-order Hom clauses without structures as arguments. Le.. only constants and variables are allowed. Constant arguments are also re-
ferred to as groirnd atoms. Most underlying Datalog concepts are similar to those in Logic Propmming [Cerigo]. In fact. the design of Datalog bas been noticeably influenced
by one of the most popular logic pro-admrninr languages. Prolog [ClocksinS 11. We proceed to give a brirf description of the language. .\ more detailrd covenge of the lan-
-mage can be found in the literature [L'llmanSS] [Gardaring91 [Ceri90].
A Datalog progarn consists of a finite set of logic clauses often referred to as fucis
and mies. Facts are assertions that drfine true staternents about some objects and thsir
The Datalog
relationships. Typical facts are "Feli-r is a ma~t"or "The square of.5 fi 3".
notation for thesr facts is:
The atomic symbol thai names the rrlationship is said to be thepredicart. definition.
In the example. maie and squarr are predicate symbols. The objects that are affectsd by
the relationships are named the nryrrmenis or data objects. In our example. thesr are the
constant values-/dix. 5 and 3.As a notational convention. both predicate s p b o l s and
constant arguments are writtcn with an initial lonw-case Iettsr. The collection of facts is
usually refsrred to as the h i a h a s r .
Rides are collections of statcmrnts rhat cstablish somc grneral proprnies of the objrcts and thrir relationships. Broadly speaking. mlcs permit the denvation of facts from
othrr facts. A Datalog nile is expressed in the fom of Hom clauses [HomSl]. that is.
clauses having the general forrn:
P if Ql and Q2 and ... and Q,
or. in Datalog notation.
.
p :-q l q2. .... qn.
p being the hend of the mle and the conjunctivs pan being the hodi. of the rule. Each q ,
is named a srtbgonl of the rule.
Rules usually make use of variables to represrnt general objects rather than specific
ones. Variables are represented by identifiers that must commence with a capital letter.
For example. the predicate
can be interpreted as "X is a son oJ'Y &Yis malt, and Y is a pal-en1 o f ' Y ' , The predicates
male and purenr should br defined rlsewhere. tither as facts or as niles.
The user may request information from the database by entering qireries. These are
Hom clauses which lack a head and can br evaluated or verified against the facts and
rules in the pro-mm. For example. the query
:-patient(Name. Disease). tropical(Disease).
may be used to retneve the names of those patients that have suffered a tropical disease
according to thrir clinical history. The answer to this que- is given by the set of d l tuples that satis- the
1.3
GraphLog
h related langage is GraphLog [ConsenslP]. GraphLog is a graphical database que?
language based on Datalog, and ennched by some additional features (specifically. the
formulation of path replar expressions). One of its original aims was to facilitate pro-
-mrnrning via a graphical representation of the programmer's designs and intentions.
The main idea is that a relational database can br reprrsented as a graph. and gnphs are
a very natunl representation for data in many application domains (for instance. tnnsportation networks. projrct scheduling, parts hierarchies. family trees. concept hîerarchies and Hypenrxt) [Consens891 [Consens901 [Fukar91] [Consens921 [Ryman9?]
[Ryman93].
Each node in the graph is labelled by a tuple of values: they correspond to the attnbute values in the database. Each edge in the p p h is labelled by a name of a relation
and an optional tuple of values. The set of values in both the edge label and the nodes
comected by the edge. together wîth the name of the relation in the edge, correspond to
+For practical reasons. some systems have the option of retrieving just
swer (by reporting the first instances of the solution that are derived).
3
subset of the whole an-
one niple in the database. Figure 1.1 shows threr cquivalent &-ph representations of the
fact :
Figure 1.1 Three representations of a given database tuple
Grneral relations (rules) and queries may also be represcntsd by -mphs. Every edge
in the graph rrpresrnts a relation amongst data objects as representcd in the nodes connccted by the cdge (and optionally in the edge). These data objects are the predicats ar-
-cuments and the): can be either variables or constants. The rule itsrlf is represented by a
special edgs (called the disringuished edgs) rhat also connects a pair of nodes. For instance. Figure 1.2 shows a g a p h representation of the mlr:
son(X.Y) :- male(X). parent(Y,X).
son
parent
Figure 1.2 A graph representation of a nrle
Anothrr example of a GraphLog relation is given in Figure 1.2. In this case. the following rulr is defined:
This example shows that the g a p h does not have ro be a comected graph. Note also that
the arguments are ordered as followsf: ( a ) tint thosr appeanng in the **startins"node:
(b) thoss showm in the "ending" node: and ( c ) those specitkd in the edge.
Figure 1.3 A graph representation of a GraphLog relation
GraphLog is a languaze that repressnts database facts. mlcs and qucrics as graphs
as described above. A formal definition of this qucry language can be found in
[Consens89]. It is shown that a GraphLog program has an equivalent Datalos program
associated with it. Of particular relevance is the fact that GraphLog allows programmen
to express r-ecrrrsii*erelations. thus providing a greatcr expressive pon-er than that of traditional relational algebra.
1.1 The Importance of Que-
Reordering
The efticiency with which a logic pro-gamming languaget exrcutcs a query is cntically
dependent on the ordrr in which soals arc cxpressed in a conjunction [Warren8 1 1. @<et?.
reorcier-ing is an important qurry optimization technique for findins more efficient eval-
uation orders for the predicates. The main goal of this technique is to reduce the numbrr
of altemativcs to be explored.
+In fact. arguments may be specified in prefiu. posf~x or infix notation. 4Thorrghr favours the in~ L Yconvention.
SIt is assumed that the specific resolution technique used is SLD-resolution [Ceri90].
To determine more efficient ways of evaluating a givrn set of subgoals. it is convrnient to have some information about the actual (extensional) database. Knowledge of
some paramevic values of the database can help determine an approximatr exrcution
cost that is to be associated with rve- subgoal. Query reordering usually requires at least
three different procrsses: ( a ) sathering a database profile or some general knowledge on
the characteristics of the databnse tuples. (b) estimating costs for dityerent orderings (in
the ideal case. for al1 possible valid ordrrings)'. and ( c ) determining the best order. In
this dissertation. we concentrate on the second issue. i-e.. trying to predict the (relative)
cost of rvaluatins a query (an.que-) for a givrn database.
1.1.1
Effect of Que- Reordering
To illustrate the effect that q u e v reordering may have on the performance of a qusry. we
use the following cvampls that describes a Prolog databasef.
Example. Considrr a database that consists of three predicata:
book(Title, Publisher-Name. AuthorJJarne).-4 collection of book tit les dong wi th thrir
publishers and authors.
pubiisher(Pubiisher-Name. City). A
list of different citirs where book publishrrs have
an authonzed distributor.
author(Author-Name.Nationality). A group of facts that relate authors to thrir respec-
tive nationalitiss.
Suppose that we wish to retneve a list of tuples m t l e , Publisher-Name. City.
AuthocName> of thosr publications whose author has Dutch nationaiity.
+Although the database profile may be used to estimate the cost of some simple subgoals ( for instance. facts). the cost of more cornplex (derived)subgoals requires some additional computational work.
:These rcsuIts also apply to GraphLog. especially since CiraphLog queries are usually translated
into Prolog under current implcmentations of the languagc.
Since this query involves al1 three predicates. there are 3! dif5erent ways to express
it:
:-book(T. P. A). publisher(P. C). author(A. dutch).
:-book(T. P. A), author(A. dutch). publisher(P. C).
:-publisherfP. C ) . book(T. P. A), author(A. dutch).
:-publisher(P. C). author(A. dutch). book(T. P. A).
:-author(A. dutch). publisher(P.C). book(T. P. A).
:- author(A. dutch). book(T. P. A). publisher(P. C).
The answer will be the same. regardkss of the chosen order. Howsvrr. depending on the
characteristics of the underlying database. the timings of the q u r n e s will not be the same.
For example. we applied al1 six orderings to a panicular database with 3 .O00 book titles.
20 diflerent publishers. 450 authon. 30 nationalitiss and 380 cititts worldwide. and ob-
srrvcd the costs shown in Table 1.1. The figures were obtained using SICStus Prolog
version 1.2 and Stony Brook Prolog (SB-Prolog) version 3.0 mrasured on a Sun SPARCstation SLC. Al 1 e'tecution times are estimated. accordincg to the implementation manuals. in "artificial" units. The database under consideration compnsed 3.000 facts for the
book predicate. 1.766 facts for rhe publisher predicatr and 450 facts for the author predicate.
ordering
publisher-author-book
author-publisher-book
publisher-book-author
author-book-publisher
I
1
cost using
SICStus Prolog
3434745
3438660
1
I
260040
1
I
2690
I
cost using
SB-prolog
3 152360
3 125060
443900
2810
1
Table 1.1 Cost of the evaluation of a given query using different orderings
It is clear from this example that the ordrr of the subgoals substantially affects the
performance of the Prolog query. It is also evident that the panicular Prolog implementation may affect the choice of the best ordering as well.
1.5
Our Dissertation
A cost mode1 of a particular implementation of the language GraphLog (in which Prolog
is the target pro-gram) is proposed in this dissertation. In panicular. we address the issue
of nnking different ( syntactically-equivalent ) arrangements of a given query in order to
select the ( potentially) most efficient orderins. One major feature of our methodology is
the ability to estimate the cost of recursive quenes and transitive closures.
1.S.1 The Problem Solved
Essentially. we have denved a methodolog that allows us to choose a potentially Icss
expensive ordering amongst a group of valid subgoal orderings. In othrr words. our proposed hmework is able to n n k different ordrrings according to their expocted execution cost. Rather than assigning absolute values ( i.r.. exact cxecution tirnos) to the diffsrent ordrrings under consideration. we are only interested in prrdictins thrrir expected
relative cost. Execution timc is used as the determining factor in the analysis.
Wc may state the general problsm as follows:
Givrn a GraphLog que? y of the form:
wr are to sstimate the relative cost of any jivsn ordering of the subgoals.
Our methodolo-y only ranks different orderings. It does not select potentially good
candidates from the whole spcctrurn of valid ordenngs. It is the responsibility of a preprocrssor to select a subset of potentially cheap orderings to start with (especially if the
number of permutations of orderings would make an exhaustive analysis prohibitive). In
facr, since we are interestrd in finding a permutation of the subgoals that yields a more
efficient plan of execution. there are at most rn! possible orderings (some of them may
be invalid as they may not comply with the safety rules of the query language) so that it
is not always feasible to test thern al1 individuaily. A practical approach is to select a subset of the orderings. namely those that are potentially less expensive to execute. Then.
we can estimate the cost of execution of each ordering in the subset to detemine a good
ordering. Thrre are several methods to select subsrts of potsntially efficient ordrrings.
amongst them. Sheridan's algorithm [Sheridan9 11 and simulatrd-annraling-based algori thrns [Ioamidis90].
1.5.2
Overview of Our Cost Mode1
In genrral. we have assumsd that somr information about the undcrlying database' is
availabie. Sheridan's algorithm [Sheridan9 11 is the framenark of choice whrn no information regarding the databases can b r obtainrd.
For any givrn ordering. a mode analysis [Debny88] is prrformed to detemine the
degree of instantiation of the subgoal arguments. For the case of the previously-mentioned Prolog implementation of GraphLog, our model takss into account the specific
evaluation stratrgy o f this language under a panicular implemrntation (namrly. the
WAM [Ait9 11).
W r have chosen to consider what wr cal1 the nierage behaviour for quenss. Given
ni[ possible valid quenrs that the user may pose for a panicular calling pattern (cf. De-
bny's framrwork). w e rstimate an aiw-age value of al[ their rxpected exrcution timings
and use this value as the rxpecred cost of the given query.: The framtwork in its prrsctnt
statc dors not produce any additional information such as msasurrs of the dispersion of
the values with respect to the average value. or corresponding upper and lowrr bounds.
.1L1I
Furthemore. rathcr than a dctailrd and expensive exact solution. our model considers
the procrss of solving a que- as a set of general actions only.
We have determinrd that a convrnirnt way to obtain a suitable ranking for the orderings under study is to consider the existence of what w r have called c-osr conrrihirrors.
that we proceed to sxplain in the following subsection.
+For instance. we asurne that the nurnbcr of tuples for cac h database fact and the nurnbcr of distinct values for tach argument position are available.
:Thus. wc arc assuming chat al1 queries have an cqual probability of being poscd. which is a major assumption.
++In fact. ive decided not to use intenals to characterize the rcsults based on the îàct that for a
transitive closure. the rcsulting intcnals werc normally too wide to be of practical use.
Additionally. we have devçloprd a mtthodology to estimate the average numbcr of
solutions associated with the query. this being an impiementarion-independent quantity.
In fact. Debray and Lin's related work [Debny93]. that derives a cost modrl of logic programS. is mainly concemed with this soie issue. Our mode1 is more general as it handles
recursive and closurr predicates.
One major consideration that a a s regardrd as essential since the inccption of this
dissertation was to produce a simple as possible framework. while producing yet accrptable results. Wr strongly belie*.-rthat our mode1 is simple. both conccptually and tiorn
the point of virw of a practical implernrntarion. We have testrd our mcthodology on sev-
eral real-Iife ( large) databases. Some detaiiçd case studics are given in Chapter 6.
Cost Contributors
Rather than analyzing the nature of the exact machine code that is gencrated (for instance. in the form of machine cycles that are required to executs the instructions). a simplrr analysis is often desirable. although at the rxpensr of a potrntial Ioss in prrcision.
The r e n r n i idca is to determine some peneric activities or groups of operations that are
directly rclatsd to the cost of exrcution of the que- and then estimate the individual
costs associated with such components. Therefore. we wish to single out some --costcontributors" that influence the efficiency of the code cxecution. Some typical cost contnbutors are ( 1 ) the numbcr of tuplrs in the databass that are visitcd to tind the global solution. ( 2 )the numbrr of marching (unification)attempts thar tske place during the resolution process. and (3) the number of solutions or answers to the query that are gathered
and displayed (we also have to consider any associated backtracking thar rnay occur
when new solutions are anempted). Some contributors may have a greater impact on the
query performance than others. For instance. it has been reponed that a Prolog progam
rnay spend 55-70% of its tirne unifying and 15-35% of its time backtracking [ w o o ~ ~ ] . '
+This behaviour is specirilIy rele~mrto our work. since the current implernentation of the
GnphLog interpreter gencrates Prclog code as the target language. For this reason. the number
of visited rupIes is a relevant cost contributor (if not the most relevant).
Lhfortunately. many of these quantities are both model- and machine dependent.
For example. if the rnodel uses clawe indexing to narrow down the number of clauses to
be explored, fewer tuple visits and unifications will be performed. Similarly. if specialized code optimizations are incorporated. this rnay have an impact on vanous cost contribu tors ( for instance. tail recursion optimization [Knise87] may reduce the cost associated with backtracking). The only cost contributor that is independcnt of the execution
mode1 seems to be the total number of solutions to the query. but. in the case of
GraphLog, this number is also independent of whatever ordering of the subgoals is selec ted!
In our rnodel. one initial task consists of defining which cosr contriblrtors are more
relevant. By eliminating some cost contributors. the process of cost estimation will be
simplified at the expense of some loss in precision. As we will argue latrr. many real-lifr
examples can be characterized by only a handfül of cost contributors (in some cases. only
one rnay suffice).
Database Profiling
Once a selected set of cost contributors is determined. a simple way to determine the expected value of thesr quantities must be found. This is usually done by usine a database
profile rather than the exact values in the database. Traditional statistical profiles are
specified by rneans of four catrgories of quantitative descriptors [Mamino88]: ( 1 ) descriptors of central tendency: (2) descripton of dispersion: (3) descriptors of size: and
(4) descriptors of fiequency distribution. Usually. the more precise the descriptors. the
more accurate the predictions. There are many widely-used "standard" descnptors:
mode. mçan. median: variance. standard deviation: cardinality of the relations: normality. uniformity. to mention only a few. Many real-life databases can be characterized by
these common descriptors with the advantage of a simpler. more general cost analysis.
normally at the expense of some loss in accuracy. In fact. many fiequency distributions
have been extensively studied in the area of statistics [~annino88].'
tGiven an arbitrary database. it is not always easy to establish which "standard"set of descriptors
approximates the data best. Sets of tests have been developed for some of the most popular approximation functions in the lirerature.
However. derived relations and complex quenrs do not dral with simple distribution functions. but rather with combinations (specifically. joins. semijoins. sslections
and projections) of distributions that require a more complex analysis. Most of the rrsearch work' has been devoted to just a few distribution functions (uniform. Pearson.
normal and Zipf) and not al1 basic database operators have brrn studied with the sams
d s g e e of depth or success. A substantial pan of the work has concentnted on the estimation of the nurnber of output tuples to the
Given thsse deficiencies. it is not
unusual that query optimizers automatically assume a distribution function that is simple
and well understood (typically the uniform distribution). An additional problem occurs
when the actual dktn bu tion function is not known (databases are constant1y changing
and it is not always possible to keep track of the changes in the shape of the distribution)
or only known in a non-parametric fonn (usually histograms). Our model will nonnally
assume a uniform distribution of attnbute values in compliance with the standard trend.
Given a certain degee of instantiation of the arguments of a GnphLog subgoal. our
daim is that it is kasible to estimate an expected value for the selected set of cost contributors. As it is always the case with abstract interpretation techniques [Cousot77].
[Cousot9?], the mort information we have about the subgoal. the more accuratc the estimates can be.
For the case of extensional database predicates. in Our model. such an sstimate is
obtained by simple statistical considerationstt. In the ideal case. if we know the exact
values of the database tuples as wel1 as the exact subgoal (query retrieval) under consideration. the expected value of a cost contributor can be calculated accurately. If our
knowledgr is more limited. we have to introducr some assumptions (as mentioned br-
fSee [Mmino88Jfor a thorough (although slishtly out-of-date) survey on the topic.
$Mer all. in traditional database q u e l planning the sizcs of intermediate relations are usually
regarded as important (if not the most important)conaibutors to the total execution cost of a queryt+Theestimation of a simple fact retrieval (i.e.. direct extensional database searches) is rnostly a
statistical problem since the dismbution followed by its arguments is assumed to be known in advance or c m be somehow deterrnined.
fore. we will normally assume a uniform distribution of indrprndent attribute values).
vet still achieving acceptable results.
For the case of intensional database predicates. the estimation of the expected value
of a cost contributor requires a more elaborate process. which we proceed to sketch.
Cost of a General Query
Given a que- whose cost we wish to estimate. we propose to decompose the query into
simpler components. To simpliQ the problem. we assume ihat qucnes are independent
of each other'. The simplest choice consists of defining a GraphLog suhgoal as the primitive entity to be analyzed. A subgoal is then treated as a "black box": siven some inputs
(such as degree of instantiation of the arguments. number of times that the subgoal is cxpected to be invoked. average nurnber of solutions that are expected to be retumed by the
subgoal. etc.). the expected values of the cost contributors rnay be estirnated (as the outpurs of the black box) and used by successive blocks as their respective inputs. The subgoal itself has to provide some information about intemal charactenstics such as distnbution of attnbute values or correlation amongst arguments (see Figure 1.4 as an example of this idea. Note that average values are obtained. sincr the actual values of the
vound terms are not taken into consideration: a uniform distribution of attribute values
3
is assumed instsad).
The total cost of the query is then estimated as the sum of the individual costs of the
subgoals. Again. standard abstract interpretation techniques are used to determine the demee of instantiation of the arguments and propagaie the intermediate results through al1
Y
successive query components. This instantiation information may also be used to rrjcct
unsafe orderings [cf. Section 2.1 31.
The estimation of a general predicate cal1 can be obtained as the surn of the costs
associated with each individual mle (Figure 1.5). This holds largely true as long as rules
are independent of each other (Le.. they do not have cornmon solutions). However. i t is
quite cornmon that two or more rules provide common solutions. A mutual exclusion
.tWe will see that a more complex frarnework is required to deal with dependencies arnongst
components.
nation(canada).
nation(be1gium).
nation(uk).
language(canada. french).
language(canada. english).
language(belgium. dutch).
language(belgium. french).
language(belgium. german1.
language(uk, english).
the language predicate has 3 distinct values
for argument # 1 and 4 distinct values for
argument #2. Of the total of 12 possible
combinations of these values, only 6 will
produce an answer: there is a rate of
success of 1/2
/
\
,
there are 3 nations
in the database:
\ 1
3 tuples are visited
v
and 3 tuples are retrieved
\
\
this valus
is a constant
average value
there are 6 language tuples
in the database
for each nation retrieved in
the previous step. 6 tuples
are visited (assuming no
indexing). The solution will
contain 3 times (112) tuples
Figure 1.4 A query as a series of successive operations
analysis may help, but the general problem of duplication resulting from independent
niles seems to be dificult to solve. Our cost mode1 does not take this source o f duplication o f tuples into account.'
tWe must distinguish benveen the cost of finding al1 ansivers (i.e.. the sum of the costs of the
individual mies) and the cost of finding al1 disrinci solurions (whose estimation has to take into
account the process of dimination of duplicates).
,
,.
predicate ;subgoal, , subgoal, ,2. .-..subgoal, .
predicate :-Subgoal2.1 , subgoa12.2. .... s u b g ~ a l ~ . ~ .
,
'
\
''\ ,
.
\'
predicate :-subgo&),
\
,
, s~bgoal,,~.
....subgoal,,,.
\\
\\
' 'ù
\
for each predicate rule:
estimate the cost of each rule body,
add the cost of head unification and
consider the process of projection and elimination of duplicates
Figure 1.5 The cost of a general predicate is the sum of the cost of its individual rules
When we are dealing with general predicate calls. we have to considsr some additional issues. such as (a) head unification. (b) clause indexinz. ( c ) independence of subeoals and (d) the fact that the distribution of the ruples may be difficult to predict. Head
CI
unification and clause indexing are implementation-specific issues and thsy are taksn
into account in our model by assigning to each rule in the predicate a probability of success. (usually) given the drgrer of instantiation of the arguments involved. Each nile is
rhen weighted based on this probability factor.
In some instances. the output of a sub_poalis affected by the nature of other sub-
-goals. Consider. for instance. a sequence of subgoals p(x. r).q ( ~Y.) . and suppose that the
set of values that the firsr subgoal derives for variable T are such that thry do not form
part of the domain for the first argument in predicate q. Cjnless we keep track of al1 intermrdiate values for variable T (which is normally contrary to abstract interpretation principles). we have no easy way to determine that predicate q will fail for al1 its inputs. By
the same token, sincs we will not know the exact values
of the variables involved. we
have no direct mrthod to estimate the shapr of the distribution of attribute values for general predicates. In Our cost model. we will ignore the issues of independence of subgoals
and distribution for intermediate results.
Once the determination of the outputs of the subgoals has been solved (that is. the
equivalent of the selsction operation of relational algebra). we need to couple different
black boxes ( Le.. tackie the analogue of the join and projection opentions of relational
algebra). Sevenl hurdles anse at this point. but the two most problematic are the duplication of solutions afier a projsct ion of arguments ( noted before) and the correlation between the arguments of two or more different subgoals ( interdependence amongst subgoals). Our model in its present form does not tackle these issues.
Our model also handles 1-ecrwsiivequenes which. in the specific case of GraphLog.
are in the f o m of a prcdicate closure. Spccifically. Our methodology estimates the expected average numbrr of solutions of a recursive predicate. The basic idea is that any
linearly recursive query can be rxpressed as a transitive closure (possibly prtxeded and
followed by somc non-recursivr predicates) [Jagadish87]. Thrrefore. w s estimate the
number of solutions of the recursive predicate by estimating the number of solutions of
an equivalent que- expressed in t e m s of transitive closure. Thus. we propose a method
ro estimate the average number of solutions of a transitive closure. An entire chapter wili
be devotsd to explain how our framework dsals with recursik-equeries.
Other issues not currently considered by our cost model include ( a )aliasing or sharing of a common variable within the same subgoal. ( b ) consideration of invalid inputs.
and ( c )more complex forms of rrcursion.
.4s we will see in a subsequsnt chapter. more accurate results may be achievrd when
the methodology is tailorrd to the specific abstnct machine and the panicular charactrristics of the system used to execute the queries. If we wish to obtain more accurate results. we would also requirr specific knowiedge of the evaluation methods that are used
( which
is crucial when dealing with recursive quenes) and the special optimization tech-
niques that are implrmented. ?Joie that. under this scheme. a new analysis would br rrquired for each different system. As can be seen. this process ma- become quite tedious.
.An alternative. more general solution would require making rough assumptions and concentrating on more "high-level" cost contributors. Thus. given a general evaluation strategy (top-dounwaluation in Our case [cf. Section 2.1.2]). we are able to estimate the cost
of a given GraphLog query without specific knowledge of the particular abstract machine that is being used by the GraphLog systcm under consideration. Our framework
addresses both approaches. so that we propose a mode1 tailored to a specific machine.
the WAM [ X t 9 1 1. as well as a model based on more "high-Ievel" cost contributors and
relatively independent of the underlying abstract machins (Figure 1.6).
Approach 1 :
Mode1 tailored to a specitic
machine
é~aluationmethod is known
optimizritions also knoum
more accurate
WC
mriy estirnate sxscution timss
only \.alid for that panicular machine
1
A proach72:
&dei barcd on --hi$ Iei.ilv'
cost contributors
1
specitic evalurition msthod and optimizations used are unknown
lsss accurate
we only estimate \dues of the cost
contributors and not sxpected timss
more generril
Figure 1.6 Two general alternatives for a cost model framework
Chapter 2.
Cost Modeling
.A cost model may be visualized as an abstraction that attempts to sstimate the g f k i r n c i
of the acmal execution of some pirce of code ( in Our case. a GraphLog
que^ ).
Diffsrrnt
parameters may be used to measure the degree of sff~ciency.The most commonly used
metrics are the rime or rnemog that are required to answer the entire query. It can br arw e d that. as memory continues to become cheaper. emphasis should be given to rstimat-
3
ing time rfficiency nther than memory efficiency.
Different orderings of the samr group of subgoals in a GraphLog qurry will usually
result in a diffrrent degree of rjflcirn- of execution. Such a difference is due to many
factors. ranging from sorne that are rathcr predictablç (such as the size and nature of the
machine code that is genented. or the series of systçmatic code oprimization techniques
that are prrformed) to those that are shaped by the current rnvironmrnt in which the proz a m is executed (such as currrnt systçm load. or the number of proccssçs compsting for
C
cornmon resources). The Iatter considerations are hard to take into account and are normally ignorsd.
In this chapter. we start with an ovenriew of somr issues relatsd to qusry reordering
in Datalog (which also apply to GraphLog). We also give a briçf account of some rclatrd
work in the area of qurry reordering.
2.1
Evaluation Methods for Datalog
Given a Datalog program. a computational mode1 that derives ail the facts sarisQing the
user's query is required. Normal1y. the chosen evaluation method computes solutions according to the so-called leasrflrpoint model [Ceri9 1 1.
Although pure Logic Programming does not include brrilr-in predicates such as
arithmetic or cornparison operators. most implementations permit the use of such predicates. An additional useful construct not available in pure Datalog is the use of iiegation.
Negation is often handled by using the ciosrd ii-orldassumption. a mechanism of negation as failure that States that the nrgation of a fact that cannot be logically derived from
the Datalog pro-gram is considrred to be valid.
Several eval uation methods have been proposed for solving Datalog queries. i.r..
determining whether a user's que- is valid piven the collection of rules and facts that
are fomulated in the pro-mm. Wr can categonzr these methods into nvo major p u p s
according to the general evaluation stratekW. namrly bottorn-up and top-doivn evaluations [Ceri9 1 1.
2.1.1
Bottom-up Evaluation
Botrom-trp evaluation mrthods apply the pnnciple of matching rules (usually called intensional dorahase pr-edicarrs)against the facts (also called rxtensiorzal datahase pwdicnre-7)to obtain vaiid values for the variables involved in the corresponding niles. Those
niles whoss hsad variables acquire gound values are then considered in a similar manner to extensional database predicates. and the process is repeated until al1 necessary
facts have besn derived. .Most bottom-up evaluation rnethods have been borrowsd or
adapted from well-known algorithms originally dsveloped to solve systrtms of squations
in Numerical .\nalysis (for example. the Jacobi algorithm for finding least fixpoints).
Most enrsnsions of the basic algorithms are aimed at avoiding duplication in the evaluation of intermediate solutions. Bottom-up evaluation is the natural method for sct-orisnted languages like Datalog.
2.1.2
Top-down Evaluation
Top-do~tzevaluation methods use the principlr of cin@catio>tbrtween a given subgoal
and the intensional or estensional database predicates. This process of unification provides a set of valid bindings that then are propagated to the other subgoals that constitute
the query. A so-called deriratio~itree is generated. A fairly wçll-known method that is
based on this resolution principle is the SLD-resolution procedure and its several extrnsions (which constitute the evaluation method of choice for the language Prolog). Topdown evaluation is well-suited for solving simple transitive closure problems when the
extensional databasc relation has no cycles. or when just one answsr to the que- is nsedcd .
In one of the current irnplementations due to Fukar [Fukar9 I l . the query language
GnphLog is translated into Prolog. Thus. the GraphLog databass can be viewed as a
Prolog database. and the executable proCimmas a Prolog progam. As a result. under this
particular implementarion. GnphLog is evaluatrd using a top-down strategy For this
very reason. al1 cost models that we propose in this dissertation are tailored to a top-down
cvaluation strategy.
2.1.3
S a f e l Considerations
- is an important issus related to the evaluation s t r a t c g that is chosen. Gsnerally
Sn/enb
speaking. a query is safe ro evaluate if it has a finitr number of answsn and the computation that is performed to tjnd them terminates. Le.. al1 the answen are obtaincid aftsr a
finite number of computations. For this reason. qusry safety p l a y a tery important rolr
when a plan of exscution is selectrd. The issue of the safety of rulrs has bcen e.ctcnsivcly
studied in the litsrature and safety conditions have been derived for difirent logic promrnming
s
2.1.1
languages. and Datalog is not an exception [BancilhonS6].
Que- Reordering in Datalog
In pure logic progamming. both rules and subgoals can bs reordered at n-il1 withour
changing the mraning of the progam. In practice. some orderings may yield more efficient executions of the program. Howsvrr. we have already scen that somc orderings
ma? lead to non-tsrminating computations.
A distinction exists between inherently non-terminating quenes and queries whose
computation dors not terminate forjust some orderings. In this latter case. the reordrring
algorithrn rnust reject such unsafe orderings.
The two principal causes of non-terminating computations for othenvise safe queries are:
Einlziablr predicates. i-e.. predicates that require that some of their arguments
have a ground valus prior to the predicate invocation. This is a conssquence of
the fact that built-in predicates usually deal with infiiire relations. In general. if
the predicate arguments do not have ground values before the cal 1. the evaluable
predicate will produce an infinite number of answers. Typical cxampltts of evaluablt predicares are aritlunrtic expressions and companson operators. For instance. consider the evaluable predicate plzcs~Y.Y, Z) which represents the arithmetic expression .Y + Y = Z. This predicate is unsafe if two or more arguments
are not inteser constants. Thus. a qurry such as :-p l w j . Y. 2, would yield an
infinite number of answers.
.Vrgarion. which is normally handled under the so-callrd Closrd Rot-ld .-îs.sronprion. considers anything that cannot be logically derived from the rules and îàcts
to br falss. The Datalog fixpoint evaluation procedure handltts negation by computing the complemrnt of the relation that is being negated. If the domain of such
a relation happens to be intinite. the complement may be infinite too. For this reason. the ncgation of a predicate with at lrast one variable argument is a potential
source for an infinite computation.
Safety niles for GraphLog have been formulated by Fukar [Fukar91]. it is s h o w
that. when GraphLog is translated into Prolog. safety is achieved when the following order for the subgoals is observrd: ( 1 ) positive (Le.. non-negated) database predicates first:
( 2 )evaluable predicates next: and ( 3 ) negated predicates last. However. this specification
is harshly restrictive. sincr çvaluable predicates and negations of predicates are only unsafe under certain circumstances.
A less limiting condition restricts evaluable and negated predicates to positions
where they are guaranteed to be safe. For the case of evaluable predicates. we have to
define a set of lists of arguments that are required to be ground in order to be safe (Le..
yield a finite number of answers). Figure 2.1 shows two examples of such sets of lists.
In the case of negation of predicates. we must guarantee that ail arzuments becorne
oround
z
pnor to the evaluation of the predicate.
built-in predicate >
96 >(A.B) :-tme if A is greuter h a n B
96 A, B: integer values
O . 6 built-in predicate O 6 -(.A.B,C) :- true if C = A minus
Ob A. B. C: integer values
This evaluable predicate is safe when
both arguments are ground: othenvisr it
is not safe.
This evaluabls predicate is safe whenever two or more arguments are ground:
not safe othenvise.
Set of lists of ground arguments that
marantees sa fety:
[A-BI I.
Set of lists of required ground arguments
that parantees sakty:
[AB]. [A,C]. [B.CI. [X.B.C]
C
B.
Figure 2.1 Sets of lists of arguments for two evaluable predicates that ensure safety
2.2
Some Recent Work on que^ Reordering
Several cost models for logic progamming languages have bcen proposed in the past.
McCarthy [McCanhy81] proposed the use of graph-colouring algorithms to mimic the
rvaluation process of a conjunction of literals. Gooley and Wah [Gooley89] suggrstrd a
hruristic msthod for reordering Prolog clauses using Markov chains and probabilitirs for
succrss and failure. McEnery and Nikolopoulos [McEnery90] dcjcribrd
3
reordering
system that rrarnnges non-recursivs Prolog clauses by applying borh static and dynamic
reordrrings: the dynamic reordrring uses star k t ical information from previous executions. Shcridan [Sheridan9 1 ] designrd a "bound-is-easier" heuristic algorit hm for reordenng conjunctions of literals by selecting subgoals containing ground arguments to be
placed brfore other subgoals. Wane. Yoo and Cheatham [Wang931 developed a heuristic
C
reordering system for C-Prolog based on the probability of success or failure as estimatrd by a statistical protiler. Finally. Debray and Lin [Debray93] developed a method for
cost analysis of Prolog pro-mms based on knowledge about "size" relationships between
arguments of predicates. this being specially aimed to handle recunion (althouzh some
cornmon cases of recursion. such as transitive closure and chain recursion. are not solved
at al]).
2.2.1
Efficient Reordering of Prolog Programs by LTsing.ÇIarkov Chains
Goolry and Wah's work [Gooley89] has propossd a mode1 that approximates the rvaluation s r r a t e u of Prolog propams by means of a Markov process. The cost is measured
as the number of predicate calls or unifications that takr place. The method needs to
know in advancc the probability of success and the cost of cxecution of each predicate.
Gooley and Wahh'sreordcring mrthod takrs into account rhr fact rhst differrnt Ievels of instantiation (ntodes) for the arguments in the subgoals lead to diffrrrnt values of
probabilitirs and costs. -4.Markov chain is proposed for each valid calling mode. The
values of costs and the probabilities of success arc to be providrd by the user (at least in
the csss of tlir base predicates). To avoid exploring al1 permutations of the subgoals.
Goolsy and Wah propose the use of a bat-tirst search.
The merhod cils0 considers that thcrr are some orderings thar must br rejectrd brcause of safety conditions. Howcver. no pnctical solution is giwn for recursivs predicates. The results for the simple Pro!og programs that are prrsrnted have soms ncçeptable ratios of improvernent. although the method seems to be quite expensive to implement. Apprndix .A 1.1 givrs a more detailrd visw of this method.
2-22
-4 Meta-Interpreter for Prolop Query Optimization
McEnery and Nikolopoulos [McEnrryPO] describe a meta-interpretcr for Prolog whkh
reorders clauses and predicates. It has two componrnts: ( a ) a static cornponent in chargr
of reamnging the clauses "a prion". and ( b ) 3 dynamic component that reorders the
clauses according to probabilistic profiles built from previously answrred qusries.
This method's static reordering phase consists of rearranging the clauses that definr
a predicate in such a way that the most successfd clauses are tried first. and the subgoals
within a clause are reordered in dsscrnding order of success likelihood.
Subgoal reordering is performed by using a generalization of a heuristic due to
D.H.D. Warren [WarrenBl]. Warren proposed a formula for the cost c of a simple qusry
q as givrn by c, = sa. where s is the size in tuples (i.e., the nurnbttr of solutions) of the
subgoal. and a is the product of the sizes of the domains of cach instantiated argument.
The generalized formula proposed by M c E n q and Nikolopoulos is given by:
where s and a are defined as in Warren's formula. and p is the probability of succcss of
the clause under analysis.
The method dors not handle recursive queries and it explores al1 permutations of
possible reorderings. which may be very expensive for large queries. For a more in-depth
view of this method. the readrr is referred to Appendix X 1.2.
22.3 Efficient Reordering of C-Prolog
Wang. Yoo and Cheatharn [Wang931 have implsmentrd a reordering mechanism for
Prolog pro-mms which assumes that the cost of evaluating a subgoal is a constant that
can be estimated by means of cumulative statistics. A profiler collects the number of subuoals that are invoked for a given predicatep. as well as the number of rimes that the cal1
2
fails. The average value of these metrics over the total number of calls to predicate p is
used as a measure of the cost of rvaluating such a predicate.
The probabili tiçs of success and failure collected dunng stat istical profiling are then
used to determine a suitable ordering. In fact. the system only accumulates the nurnber
of c a b to a predicate and the nurnber of times a failure occurs. The probability of failure
of a conjunction of subgoais. S I . ....s,, is then calculatsd as the product of the individual
probabilities of failure of the subgoals.
- number of failures
-
failure
number of caIls
,,,,
pmbabiliry of failure
=
failure rarei
An evident advantage of this method is that handling recursion is not a major prob-
lem, since we are only interested in the number of calls and failures. without paying attention to whether the calls are recursive or not. An obvious disadvantage of the method
is that the degree of instantiation of the subgoals is totally ignored. and. therefore. there
is no distinction between different calling modes of the same predicate. and these usually
yield different execution costs. Another drawback is that safety conditions are not incorporated and it is the responsibility of the user to inform the system about which predicates are not suitable for reordering.
1.2.1
On Reordering Conjunctions of Literals: A Simple, Fast Algorithm
Sheridan [Sheridan9 I l has formulated a good heuristic algorithm for reordrring conjunctions of subgoals in Prolog programs. This method differs from many othen in that it
does not require profile information of the underlying database. Although the method is
simple. it yirlds surprisingly good results. The method exploits the notion of '-ground is
bstter". Le.. the fact that the more instantiated the arguments in a subgoal are. the lrss
enpensive its exccution is. The goal of the method is to maximizr the so-callrd s i d r i r q ~
i~forrnafion
passing [UllmanSS] from lefi to right.
Sheridan's algorithm distinguishes three groups of subgoals: ( a )positive built-in litrrals. (b) negative literals and (c) other positive literals. This classification of the subgoals has to do with safety considerations. For instance. a built-in predicate may require
that some of its arguments have instantiated values before the predicate cal1 (an enobling
list of arguments). For example. consider the predicate surn(.-!. B. CI that rvaluatrs the
operation A = B + C. Typical cnabling lists (i-e.. lists of arguments that guarantee that
the given predicate is immediatrly evaluable) for this arithrnetic predicate are: [A. B. Cl,
[A, Cl. [A. BI, and [B. Cl. In other words. the predicate is safe whtnever two or three of
the arguments are instantiated to an integer value. By the same token. a nrgative literal
is safe if al1 its arguments are constant values. For example. given non-built-in predicates
y and p. the following ordenngs are safe ones.
whereas these are not:
Note that the algorithm exploits the property of Datalog-like programs where each
argument is guaranteed to have a constant value afier an,. call. Thus. given this specific
property. any occurrence of a variable other than the fint one is paranteed to have a constant value.
The algorithm nondeterministically selects subgoals according to the following critena (in descending order of priority):
1. non-nrgative non-built-in subgoals with at least one ground argument (either an
sxplicit constant or a variable that is known to be instantiated to a constant value
by virtur of having appeared in a previously selected subgoal):
2. non-negative built-in subgoals that are safe (Le.. at leasr one of its rnabling lists
is entirely cornposed of ground arguments):
3. negative subgoals that are safe (Le.. ail its arguments are ground):
1. non-negative non-built-in subgoals with no ground arguments.
The algorithm can use an additional heunstic mle which gives preference to subgoals with a larger number of bound arguments within each criteria group.
An important feature of this algorithm is that no knowledpe of the underlying data-
base is required. The main advantage of this fact is that there is no requirement for a database profile to be obtained. and this may represent a substantial saving. An obvious restriction of Sheridan's algorithm is that no distinction is made between a predicate that
retrieves a huge nurnber of tuples and one that is associated with a vety small set of tuples, and. as a result, the expensive predicate rnay be given priority over a possibly better
choice.
2.2.5
Cost Analysis of Logic Programs
Debray and Lin [Debray93] have proposed a more general h m e w o r k to analyze the cost
of logic programs. including simple forms of recursion. In particular. the method estimates the number of solutions of a logic pro-mm based on the sizes of the predicatr arguments. The method derives size relationships arnongst predicate arguments. This size
information is rhen used to compute the number of solutions generated by each predicate.
The methodology is applicabk to ail non-recursive predicates and to those recursive predicatcs with the proprny of having an argument whose size is reduced at svery
recursive stcrp. until a base-case valus is obtained. Unfortunately. this leaves out some
interesting cases of recursion (such as transitive closure or chain recursion). This method
is described in more drtail in Appendix A 1 -3.
Chapter 3. A Machine-Dependent Cost Model
We now proceed to study our cost model for a specific abstract machine. Since the current version of the GraphLog interpreter generates Prolog code [Fukar9 11. our analysis
will be focused on this particular target language. Funherrnore. we have chosen a panicular execution model for Prolo_r,namely the WAM abstract machine [Ait9 11. because it
is widely used for the Prolog language. (Prolog is the most widely used logic proberamming language. )
When dealing with databasrs. it is usual to separate logic predicates into two catsgories: extrnsional database predicates. which comprise a finite set of positive ground
facts. and intensional database predicates. which include al1 other predicates. We will devote the initial pan of this chapter to deriving a framework for extensional database predicates. and tackle the case of intensional database predicates therraftsr.
3.1
Cost model, Initial Assumptions
We stan fiom two assurnptions. First. we suppose that some parametic values of the da-
tabase are known in advance (such as the number of distinct valurs for every argument
position for al1 database facts. a model for the distribution that is followed by these attribute values. etc.). Furthemore. we assume that the model of Prolog's execution closrly follows the design of the Warren Abstract Machine (WAM) rnodel [fit9 I l .
We normally consider three different costs that can be rstimated for a given subgoal: (a) the cost of retricving al/ solutions to the subgoal: (b) the cost of finding thejirst
ansuqei-to the subgoal: and (c) the cost o f obtaining the nexf valid answer for a given
state. We wilI concentrate on the all-solutions case, since this is the usual scenario for
standard database queries.
Fact Retrieval, A11 Solutions
3.2
The simplest possible Prolog subgoal is one that only retrieves facts fiom the extensional
database. in this section we find the cost associated with finding al1 solutions to the subgoal.
Cr
Consider a subgoal p of arity ri of the form:
p c P , . P..- .... P n )
where P I . P2. .... P, are the arguments to the sub_poal.The rvaluation of this subgoal may
require the execution of a specific set of WAM instructions. such as: predicate calls. allocation and deallocation of stack frames. unification operations. attempt to examine the
different unifiable clauses. variable unwinding (in case of unification failure and backtracking). etc. One straightfonvard way of estimating the cost of evaluating the subgoal
is 10 deducs the exact sequence of machine instructions that is executed. If ~ v know
e
the
costs of the individual WAM instructions. a total cost for the fact retneval operation may
bc calculated.'
For instance. the WAM defines several term manipulation instructions to handle
unification. Thrir behaviour drpends on the mode set by a geun-zrcnire instruction. If
rend mode is set. the unification algorithm is applied to both the instruction operand and
the current heap ce11 (the WAM stores new t e m s ont0 a rnemory area called the heap).
If. instead. iir.ire mode is sprcified. a new ce11 is allocated on the heap. X typical translation of a fact is shown in Figure 3.1 (the WAM instructions are shown to the lefi).
Note that the number and nature of the arguments will determine the set of instructions that corresponds to the WAM translation. And the existence of two modes (read
and write) has to be considered as well.
However. a simpler approach can bc proposed instead. We cm neglect or disregard
those instructions that either are executed regardless of the position of the subgoal in a
+Sec [GorlickS7] for an attempt to use this approach. The proposed mode1 only considers very
simple clauses without disjunctions (therefore leaving out clause indexing).and does not address
the issue of the degree of instantiatioa (or "modes") of the predicate arguments either.
predicate/3 :
get-variable XO
getstructure m/2.X1
unify-vanable X5
unify-variable X6
get-structure n/2.X2
unify-value XO
unify-value X6
get-list X5
unify-constant a
unify-variable X4
get-list X4
unify-constant b
unify-variable X3
get-list X 3
unify-constant c
unify-constant [ ]
l
l
O%
Oh
predicate(
V,
m(
X5.
W)
%
.
O/,
O/,
O/O
"/O
O/O
O/O
n(
V.
W))
X5=[
al
X4j
O
O/,
X4=[
bl
X3]
O/O
X3=[
O/,
O/O
%
"10
cl
[ 11
predicate(V.m([a,b,c],W),n(V, W))
Figure 3.1 Partial translation of a fact
conjunctive clause (as in the case of the predicate call) or that do not incur a significant
cost (such as. for example. WAM's swirch instructions which suppon argument indexing). This latter group of instructions can be safely neglectsd ivhen ive are dealing with
fairly large databases. when othrr operations (variable unwinding. tuple visiting) dominate the execution performance.
We have found expenrnentally that three groups of WAM instructions are usually
responsible for the major part of the time spent evaluating a subgoal. These are: (a) instructions that are used to manipulate choice points (tn-me-else.
nms-e);
rern~rne-dse and
( b )instructions that perfom the unification algorithm for terms; and (c) in-
structions that restore a previous state when a new solution is required (since a process
of backtracking is launched). Our general cost function is based on these observations.
For case of analysis. we will usually assume a uniform distribution of independent
attribute values, a cornrnonly used assumption in the database field [Mannino88].
Choice Point Manipulation
3.2.1
The fint g o u p of WAM instructions that are heavily used during fact retrieval is concemed with physical access to the tuples. In our rnodel. we propose to writr the cost due
to choice point traversal as:
cos[-rravrrsul = nL,hpx
<,hp
where
nChp is
the total of number of choice points that are "visited". and
Tc.,1p
is the cxpeçted cost of executing the instructions associated with a sinrle
choice point.
Tc,, is assumed to be a constant that depends on the Prolog system in use. and its
value may br determined experimentally. The number of choicr points. i.e.. the number
of alternatives that must be rxplored during an all-solutions retrieval can b r estimated
fiom the database profile. Given the instantiations of the arguments and the scheme of
clause indexing that is used. we may estimate the number of tuples whose unification
will be attempted. Appendix I gives a formula that hoids whrn a uniform distribution of
independent attribute values is boing usrd.
3.2.2
Unification Operations
This second g o u p of WAM instructions is concemed with unification applied to terms
(gel-consranr instructions in Our case. since we are considering simple facts). To simpli-
fy our analysis, let us consider the two simplest cases of term unification: groumi (constant) and nor grulrnd (variable) unificationst. A variable unification is always guaranteed to succerd. whereas a ground unification can fail. In our model. we can rstimate the
cost associated with the unification operation as follows:
cosr-wt~i~urton= n TL.
x
TF'
+
nuc
x
Tu'.
+
n,,
x
+in fact. constants and variables rire the only rwo terms that are aIIowed in GraphLog.
w here
,n
is the number of successfùl constant unifications that take place:
nu,,
is the number of unsuccessful constant unifications that take place:
n,, is the number of (successful) variable unifications that take place:
T',, is the expected cost of perfoming one successful constant unification:
T,,,,, is the expected cost of performing one unsuccesstiil constant unitication:
and
Tl.,l is the expected cost of performing one (successful) variable unification.
The three numbers can be derived from the database profile (the instantiation of the
arguments and the distribution of anribute values may be used for this purpose): the three
cost factors rnay be determincd cxperimentally.
Consider again a subgoal p of anty n of the f o m :
p ( P , . P,.-
To estimatr the value of n,,,,.
II,,
.... P")
and n,.,l. two quantities have to be detrrmined for
evey argument position: (a) the number of unification anempts and ( b ) the number of
succrssful unifications. Clearly the number of unification anempts that take place for position k has rxactly the same valus as the number of successfûl unifications that occurred
for position k-1 ( b l ) . assuming that arguments are unifird from Ieft to right.
The numbrr of successjirl unifications at a given argument position is a fraction of
the total number of unification attempts that are made. We propose the following forrnula for n,,&.
the number of successfÛ1 unifications at position k:
K/k) is a reduction factor for argument position k. which also represents the savings due to clause indexing (if implemented); and
n,,ifl(I,,(k) is the number of unification attempts at argument position L
Additionally.
Appendix 1 shows some formulae that apply to the special case of a uniform distribution of independent attributr values.
The contribution to the cost of retrieving al1 solutions to a fact dcpends for the most pan
on the total number of solutions that can be retrieved. and is represented by opentions
that restore ptior States during backtracking that are not part of the "choice point manipulation" previously consideredt. We propose the following formula:
where
TbucrrcP
is the expectrd time associated with the procrss of restonng a previous state
whrn a new solution is searched. and can be determinrd experimentall y: and
n s o l is the expected number of solutions to the query.
Consider a subgoal p of arity t i of the form:
p ( P,.P,.
- .... Pn)
For a uniform distribution of attribute values. the total number of solutions is given
by:
n-sol= n ,,, 01)
ns,,,,l,(n) being the number of successful unifications that occur for the last argument P,,
(Section 3.2.2).
tAnother action that is dircctiy related to the number of solutions has to do tvith the actual display
of the results.
In general. the total number of solutions may be derived kom the database profile.
Much work has been published on this subject [Mamino88].
General Formula
3.2.4
A global formula simply takes al1 the above considerations into account. Given that we
have decided to restrict Our scope to the threr previously mentioned cost contributors.
our final formula is as follows:
ioral_c.ost = cos[-tra versal + cos[-un qicarion + cos[-backri'acking
Experimental Values for the Elementary Constants
3.3
Here we explain how to obtain ernpirical values for the constants Tclip,T,.U,T,,,
TL,,,and
for the panicular case of a uniform distribution of attribute values. I t must be rmphasized that these values are htavily dependent on the actual implementation that is
used.
A good strategy to determine the values of the above-mentionrd constants consists
of building srvenl perfectly uniform databases. and then measuring the rxecurion timr
for different types of queries involving both ground and variable arguments. -4temary
predicate serms to be a convrnient choice because it contains most important variants
-
without having to deal with huge databases (Figure 3.2). A binary predicate may work
as well. but less accurate results can be expected.
predl(ba,ba,ba) .
predl (ba,ba, aa) .
predl(ba,aa,ba) .
predl (ba,aa, aa) .
predl (aa,ba,ba) .
predl (aa,ba,aa) .
predl (aa,aa,ba) .
predl (aa,aa, aa) .
predl ( b a taa,bal .
predl (ba,aa, 2 ) .
predl(ba,Y,ba) .
predl (ba,Y,Z ) .
predl (X,aa,ba) .
predl (X,aa,2 ) .
predl ( X IY,ba) .
predl(X,Y,Z) .
Figure 3.2 (a) An extract from one of the databases that were used
and (b) typical subgoals which retrieve these facts
The basic idea consists of building a kind of database for which we can theoretically
predict the number of WAM instructions that get executed for our different queries (a
symmetric database is a suitable choice given its predictability with regards to the number of WAM operations that are expected to be executed). Thus. we can denvr theoretical fomulae based upon some parametric variables for al1 contributors that ive consider
relevant. Then. we experimentally obtain the costs of executing the quenes. and relate
thesr costs to the parametric variables- In the case of a perfectly uniform database. al1
contnbuton may br expressed as fûnctions of the sizes Siof the argument domains and
their respective products. Therefore. we rnay propose a grnrnl formula of the form:
whcre each G ~ ( SS,.
, -. .... S,.)
=
k
and each P, ç
n,
sk
P.
: 1. 2. ...-VI
whrre N is the number of arguments in the subgoal. For the temary case. wr have:
where Sn rcpresrnts the dornain size of argument n: and the q are constants that are re-
latrd to the weight or influence of the corresponding tenn in the total cost - a zero value
would mean no contribution whatsoever due to that particular term. In fact. when several
independent expenments are launched. on1y a few constants show both mcasurabl y
"large" and consistent values in repeated experiments. and thess are obvious candidates
to be considered significant.
Al1 experimental results mentioned in this section were obtained on both SICStus
Prolog. version 2.1. and SB-Prolog. version 3 .O.executing on a SUN IPC SPARCstation.
The experimental values were measured using the profiling routines pmvided by SICSms Prolog and SB-Prolog; al1 execution times are estimated. according to the implernen-
tation manuals. in "artificial" units.
.4pproximately 1.O00different databases were built with sizes ranging from I O to
about 25.000 different tuples. For every database. al1 possible combinations of ground
and variable arguments in the query were tried (see Figure 3.2). Using the least squares
method for curve fitting. the value of constants ci(i-e.. the dependency of the exccution
times upon the parametric values of the database) were obtained. Initial sxperimcnts
showed that al1 these dependencies were approximately linear.
Table 3.1 and Table 3.2 summarize some actual results for a cornplete experiment.
Sk stands for the numbcr of distinct values for argument position k. Those cells in the ta-
ble containing values that are clearly distinct from zero (and may indicate that the trrm
under consideration may contribute to the total cost) have been marked in bold font. A
decision was made as to consider as few constants as possible. for instance. disregardhg
some values for variables S,, S7and S3 (as well as the independent tenn). which will normally hold smaller values than their products. Some variables rnay have values clearly
distinct From zero afier one experiment. but no consistent values from experiment to experirncnt: we decided to ignore these constants as wellt. The eight different cases of
cround and not ground combinations are abbreviated using the letters g (for ground) and
c.
/'(forji-et. variable.
i.2.. not
ground).
At the same time. the corresponding WAM instmctions and the number of times
that they had been rxecuted were calculated. Wr assumed the first-argument indexing
characteristic of SICStus Prolog. A rough cstimate of the number of times that the WAM
instructions were cxpected to be executed is s h o w in Table 3.3.
The fact that for the all-ground-argument case (Le.. ggg) there was no clear dependence on the value of variable S3 (constant c j is ncgligible), and for the first-not-groundthe-rest-ground case (Le.. fgg) no appreciable dependency on variable (SIS3)
was observed (incidentally. the expressions in which these t e m s appear are highlighted by a
light shading on Table 3.3), suggests that the contribution of constant unifications (Le..
T,,, and Tu,,) may be neglected. Thus. a simplified table (Table 3.1) is obtained.
t W e observed that some apparentIy significant negmive quantities showed no consistent values
from experiment to experiment. and most of the time rheir values werc close to zero. For instance.
the value -O.14 in the first row of column c j in Table 3.1.
r
SISj
SIS2
S3
S-
SI
CS
C4
Cj
C~
I
0.019
0.068
0.000
-0.14
-0.00
0.000
-0.02
-0.00
0.020
0.000
0.00n
0.055 ( -0.00
-0.00
-0.06
-o -k
-0.00
0.027
0.000
0.000
-0.00
0.049
-0.00
-0.03
- tT
i?
-0.00
0.082
0.000
-0.09
-0.00
-0.70
-0.00
-0.05
fgg
0.026
0.002
0.003
0.000
-0.03
-0.03
0.019
0.088
fsf
0.026
0.003
0.060
0.002
-0.05
-0.03
-0.04
0.180
ffg
0.030
0.002
0.002
0.053
-0.03
-0.03
-0.04
0.100
fft'
0.083
0.010
0.008
0.006
-0.07
-0.05
-0.06
0.180
Variable
SIS2S3
Case
C7
=CC
000
-0.00
_o,of
SlS3
1
Co
A
Table 3.1 Typical Experimental Results for a Ternary Predicate for SlCStus Prolog
1
VariabIe
SIS,ST,
SIS;
SISt
SISl
S3
Sq
s1
1
Case
C -i
Cf-,
c5
C-I
C3
2
cI
Co
-0.00
0.022
0.004
0.002
-0.03
-0.01
-0.01
0.211
-0.01
-0.01
0.138
O(3U
z=t
-cof
-off'
-0.00
0.023
0.004
0.002
0.01 1
-0.00
0.033
0.01 1
0.008
-0.07
-0.00
-0.06
0.515
-0.00
0.069
0.002
-0.00
-0.03
0.007
0.009
0.080
tg
0.023
-0.00
-0.00
-0.00
-0.01
0.016
0.057
-0.10
t-gf
0.024
-0.00
0.039
-0.00
0.021
0.01 5
0.023
-0.20
ffs
0.030
-0.00
-0.00
0.030
0.001
0.032
0.002
-0.00
t'ff
0.072
-0.00
-0.00
-0.00
0.006
-0.00
-0.00
0.081
fa
,Z
i
-
Table 3.2 Typical Experimental Results for a Ternary Predicate for SB-Prolog
Thus. if we decide to consider on1y the rernaining three constants. Tc/,, (directly related to "retry-me-else"
operations), T,.,,
(associated with successful "get-variable" in-
(comected to the number of solutions of the retneval), then we prostructions) and Tbucr(ceed to establish which products of our S variables are expected to contribute to the cost
of the retneval. For instance. the product S2xS3 is significant for the "ggg" case, and
products SlxS2xS3 and SlxS3 are significant for the "fgf' case. We may build a table
cal1 switch switch
prcdi- on
on
cate
t erm
constant
rerryme
dse
-1
nust
me
successful
SUCCÈSS-
ful
get
hl grt
t g
\-anable
constant
totaI
n u m ber
of
solutions
U~SUCC~~S-
consrnt
1
s
s
-
O
1
3S,S2S3
1
O
Is1s2s3(
Table 3 3 Number of tirces that the WAM Instructions are executed
I I
Case
retn;
me
CI se
I
'iuccess\
fil get
ariabls
to tai
num ber
of
soIutions
Table 3.4 Number of times that the WAM Instructions are executed (simplified version)
showing such dependencirs (Table 3.5) and then proceed to connect these theoretical
values with the expenmental values.
To derivr the final values for Our constants Tc,lp, T,-and ThaCk,we rnust solve a system of simultaneous equations. For instance. for the SICStus Prolog single sxpenment
of Table 3.1. we would considcr the following system of approximate rquations:
Case
,
S S2S3
S2S3
Sis3
S I S , S7
-
S,
Table 3.5 Approximate Theoretical Values for a Ternary Predicate
Note that some cquations are redundant. Sometimes the same terms are rquated to
slightly dissimilar values. srwing to remind us that our results are only approximate.
There is no unique method to solve such an overdetenninedt system of equations. A simple rnethod descnbed in [Frobeg85] solves the system by using a maximum n o m . We
have computed the following approximate values for Our particular environment when
using our particular version of SICStus Prolog:
Tchp= 0.020.
T,.,, = 0.007.
Tbÿck= 0.048.
+An overdetermined (or inconsistent) system has more equations than unknoums.
Table 3.6 summarizes the deviation between the experirnental values (i.e.. the actual execution rimes in mificial units as obtained during the experiments) and the proposed
theoretical values when using thesr values for SICStus Prolog ( i.e.. applying these values
to the formula described in Section 32.4)'. The greatest discrepancics occur for the d l -
aound-argument case. due in part to the fact that constant unifications play a major rôle
Y
hrrr. and this is ignored by our approximation (i-e.. values for Tu,,and T,,, wcre not derived).
average deviation
between theoretical and
experimental values
( 1 .O00 different databases)
Table 3.6 Average cost error introduced by our approximation
3.4
Conjunction of Simple Queries, AI1 Solutions
We now procred to study the case of a conjunction of facts. -4gain. we are interested in
the ail-solutions case. Consider a conjunction of simple queries of the form
p l ~ , . pal.
~ . . . . p n an
whcre the notation p/a means that predicate p has an arity a.
+Since the databases in the experiments were forced to have a uniform distribution of independent attribute values. for each database. we can easily estimate the values of n,hp .,n .,n
n,,
and n-sol rcquired by the formula.
If. at svery point in die evaluation of this query. we know the instantiation of the
arguments of every subgoal. we can determine the cost of evaluating rach subgoai by using the formula described in Section 3.2. The following formulz could be applied to estimate the global cost of finding al1 solutions (i.e.. the total cost of evaluating a conjunction of subgoals):
where
i i l l n ) is the cost associated with finding al1 solutions to subgoal p,:
n-soiln) is the sstimated (average-case) number of solutions to subgoal p,,:
Note that each successive subgoal will bc called as many times as there are distinct
solutions that the previous subgoal is able to retneve.
Sincr we are dealing with subgoals that retrievr niples fiom a database.
wrt may dc-
termine in advance the acnial instantiation of every argument in the conjunction. Thus.
cvcry variable that appears for the first time in a subgoal will be uninstantiatcd. whrreas
any variable that has appeared before in another subgoal will be instantiated at that point.
To find the Ieast overall cost. al1 possible orders of subgoals must be considrred.
Table 3.7 shows a cornparison bstween (a) the experirnental costs for the book database
example and (b) thosc costs predicted when only using the primitive constants in Our formula and assuming a uniform distribution. Since the book databasr. like most real databases. does not follow a uniform distribution. sigificant differenccs c m be observed.
However. in this particular case. the uniform distribution mode1 can still b e used to predict a general trend. i.e.. we may still obtain the most efficient evaluation order. but there
is no guarantee that this will br the case in a grneral situation. The mode1 was also tested
against (c) another (artificially generated) book database which was designed to follow
a strictly uniform distribution. and. not surprisingly. our theoretical values predicted the
costs more accurately. All values in Table 3.7 are reported in SICStus Prolog's artificial
units.
(a)
reai database,
erperimental
costs using
SICStus Prolog
order
(CI
(b)
difierence
uniform
distribution,
uniform
between
experimental costs
distribution,
(b) and (c)
theoretical value using SICStus Prolog
-
author-publisher-book
3438660
author-book-publisher
2690
Table 3.7 The book titles database
Norrnally. w s will pay more attention to the relative cost amongst different ordrrings rather than to the "exact" cost values. Table 3.8 shows that we were able to predict
the correct order of the costs of the different orderings.
(a)
(b)
Ranking of theoretical
predictions
Ranliing of SICStus Proiog
when using a uniform
da ta base
1 . book-author-publisher
(1635)
1. book-author-publisher
1 . book-author-publisher
(2935)
(3370)
2. author-book-publisher
(2690)
2. aurhor-book-publisher
2. author-book-publisher
(3168)
(4330)
3. book-publisher-author
631345)
3. book-publisher-auttior
3 . book-publisher-auchor
h
(
-
3. publisher-book-author
(260040)
4. publisher-book-author
(696333)
5za. publisher-author-book
5=. pubiisher-author-book
(343476)
(933330 1 )
5=. author-publisher-book
5=. author-publisher-book
(3438660)
1
( 114655)
103305)
- -
1
(c)
Ranking of
SICStus Prolog (actual
measurements)
(9244272)
4.
publisher-book-author
(713215)
1
5=. publisher-author-book
I
I
(9485305)
5=. author-publisher-book
(93985 IO)
Table 3.8 Orderings ranked by their costs
a. Since these last two orderings are within 0.1% of each other (given actual measurements),
a similar rank is shown
3.5
Intensional Database Predicates
As mentioned before. it is common practice to separate logic predicates into two categories: exrensional database predicates and inremional database predicates. One advantage
of this division is that extensional predicates typically have large numbers of clauses
(they can be seen as the database itself). whereas intensional predicates normally have a
small number of clauses. Additionally. one can infer some proprnies for extensional
predicates. such as a distribution for the attribute values or correlation factors arnongst
them. that characterize the database under consideration. Normaily. these propenies are
constant in timr (cf. previous sections). In other words. one can predict. within certain
parameters. how a query will behavc when applied to that database.
On the othrr hand. intensional predicates are less predictablc. They require a more
complex analysis hmework. whosr predictions are normally lrss accurate. A standard
approach for analyzing the execution brhaviour of a program is to use abstract interpretation techniques [Cousot77. Cousot9ZJ. which transfer the problrm to a different. casier
to handle domain at the expense of some loss of precision. In the specific case of logic
progams. for example. instead of kesping tnck of the exact values that evsry variable
holds during program execution. one may want to consider a sirnpler. more general propcrty. One such proprrty is the mude of the variable, that is. its drgree of instantiation
[MelIish8-5] [Debray89]. We will not know the exact value. but at least we can ascertain
that the variable under consideration is an uninstantiated variable. or a ground constant.
or a term with a combination of both (Le.. a panially grounded structure). We can infer
these attributes by performing a static mode anabsis.
3.6
Mode analysis
In general. Prolog prograrns are undirectrd. that is. there is no distinction between input
and output parameters for a given predicate. This notion of bi-directionality presents a
major challenge to the production of efficient code. since the depth-first search strategy
with chronological backtracking that Prolog uses to implement non-drtenninisrn is itself
a very inefficient strategy [Mellish85]. However, Prolog predicates are typically witten
with one sole direction in mind and, as a result, some parameters are rneant to be exclu-
sively input or output. Knowledge of such directionality c m be expressed using the notion of modes. a concept which was introduced by D.H.D. Warren (and refined by Mellish) to classi@ the ways in which a Prolog predicate is used during the rxecution of a
program. If the pro-g-amrner provides such dues to help the compiler identi@ directionality. the genrrated code can be dramatically improved. A possible alternative is to infer
the mode information by perfonning a global analysis of the program [Debray88].
The standard approach for detennining the mode information of a logic pro_mam
statical ly uses nbstracr inrrrpreiarion [Cousot77]. [Cousot92]. This is a genrral technique where the standard semantics of a program are projçcted ont0 a different (and simpler) domain. Several solutions to the problem of finding the modes of a Prolog pro%pm
have been proposed. A qui te extensive survey is given in the introduction of [Debray SB].
In this section. the mode inferencc algorithm of Debray [Debray89] is described. since
this fnmcwork is the basis of the detenninacy analysis for Our work.
The mode of a predicate in a Prolog probgram specifies which arguments are input a r g mrnts and which are output arguments, taking into account al1 possible calls that can occur during the rxecution of a program. Depending on the nature of the problem. a set of
modes must be defined to characterize the modes of the arguments in a Prolog predicate.
Debray proposrd the family of modes
h.
= [ c. d. e. f. nv
:. whrrr c denotrs the set of
Iiilly-instantiatcd (gound) terms. d (don't know) the universai set of al1 texms. e the
cmpty set. f ( free variable) the set of un-instantiated variables. and nv the set of non-vanable t e m s (that is. structured ternis which are not h l l y instantiated). The set
cornpletr lattice under the inclusion operator (Figure 3.3):
Figure 3.3 Debray's lattice for mode analysis
-\
foms a
Given a set of terms T. its insranriarion is detïned to be the element of 1 that best
characterizes it. Thus. the least upper bound [Birkhoff40] for al1 tcnns in T is chosen.
Prolo@ unification operation can be undrrstood in the mode's domain (cailrd the
absrracr domain) as an operation that. given the instantiations of the arguments in a call.
refines them according to the nature of die hsad arguments. Dèbray [Debray89] defined
the lattice in a way that given nvo term instantiations T I and L.the unification of them
is chosen to be the least upper bound of their instantiations under the following partial
ordering:
f s d c n v c c e
The unification of terms is modelled by applying thejoin operation to two rlsments
of the latticr. a and b. which retums the least upper bound of a and h. The join operator
for the ordering under consideration is written as V
.
Some examples are shown in Figure 3.4. Note that. since some information is not
taken into account in the abstract domain. the results are usually lsss accurate than in the
concrete t and more cornplex) world.'
3.6.2
General Mode Analysis Method
Debray's msthod uses thepr*ocrclru-alvimr- for Prolog. which recognizes the existence of
mechanisms such as procedure call. success. failure. backtracking. etc. Debray 's static
inference of Prolog modes is based on keeping track of individual variable instantiations
throughout the cxecution of a program. Such information is propagatcd in the usual way.
from caller to callee at any predicate invocation and from c a k e to callrr at the time of
the predicate's completion. Thus. at any point during pro-mm execution. an instanriation state is defined. which contains instantiation information for every variable in the
program.
The notion of an instantiation state can be extended to any arbitrary non-variable
trrm. A constant trrm will have ground instantiation (c) and an rmpty dependency set.
+In particular. unification in Debray's mode1 c m never fail.
f2
Concrete domain
Abstract domain
ff Y)
r-
4
nv
--7-
=..
..,--
f(6)
Prolog unification
f(6)
Concrets domain
c
abstract unificatio
c
Abstract domain
-4
Figure 3.4 Abstract interpretation applied to Prolog unification given two terms t l and t2
A strucnired term will have ground instantiation ( c )if al1 its arguments are ground and
non-variable instantiation (nv) othenvise.
In order to facilitate the propagation of mode information. Debray ' s method defines
the existence of instanriation partet-ns for every procedure call. An instantiation pattern
will contain. for every procedure argument. sorne information related to its instantiation.
3.6.3
Abstract Domains
In essence. we can characterize every predicate call by a previous state (in which the instanriations of the arguments are grouped in a so-called callingportern).and by a result-
ing state. which differs from the original state in the same degree as the arguments do
(the new set of modes for the arguments is referred to as the szrccess pattern). In the
framework proposed by Debray [Debray89], we have knowledge about the degree of instantiation of rvery term at any execution point. Although not explicitly formulated by
Debray. we may characterize his abstract domain as pairs of ccalling panemsisuccess
panems> for rach predicate call in the probprn. ,More fomally.
I Inzr 5 ncpar ( Predr).
I 2 n 5 nspatl Prrd, cpur,,,
.
1 5 r 5 number of distinct predicates
where
Prrd, rcprrsents the ,-th prrdicate in the database:
cpnt, is a feasible calling pattem for predicate Pred: a calling pattem is an ordrred k-tuple (where k is the number of arguments in prrdicatr Prrd) in which
the k-th elernent represents the current mode for the k-th argument;
spnt,l., represents the m-th valid success pattern for the given calling pattem
cpnt,: a succcss pattem is identical in form to a calling pattem and it differs from
the calling pattern in that the modes for the arguments are updatrd as a rrsult of
the predicate call (by using Debray's unification rules in the abstract domain):
ncpat(Pred) is the numbsr of distinct calling patterns that Prrd can be invoked
with (as determined by a static analysis); and
mpartPred. q a t ) is the number of different success pattems that results from a
call to Pred given a calling pattem q a t .
We start with Debray's approach and propose enriching the domain with probabilities of occurrence for the various success pattems, as well as quantities related to the
cost of that particular execution path. We tailor the analysis to the all-solutions case. Our
analysis will be restricted to non-recursive programs. Wc may easily extend it to allow
recursion controlled by an argument that reduces its size at every recursive step as De-
bray does [Debray93], but this type of recursion does not occur in pure Datalog programs.
To illustrate how cost contributon are incorporated into Our absuact domain. consider a variant of the books database introduced in Section 2.1.4. Consider the following
order of subgoals for an arbitrary query:
and assume that the database attributes foliow a uniform distribution of values. The database profile is shown in Table 3.9.
Predicate
name
number of
tuples
distinct
values in
argument 1
distinct
values in
argument 2
distinct
values in
argument 3
book4
3.000
3.000
30
IO
publisheri2
7,600
20
330
-
450
450
30
-
aut hor!?
distinct
values in
argument 4
1
T'able 3.9 The books database profile
For Our panicular query. the book predicate will be invokrd with al1 arguments un-
bound (ix. a calling pattem [f. f. f. fl)'. The prrblisher predicate will be called with a calling pattem [g. tl. since the first argument will have a constant value afirr the call to predicate book has been completed. Finally. the author predicate will have both arguments
bound to a constant value (i-ç..a cailing pattern [g, g ] ) .
In this panicular case. we have the following instances that charactrrize the execution of the que.:
(a) book predicate:
In Debray's domain. the following instance is generated:
+As before.
WC
use "g" to denote a gound terms and "f' to indicate a free variable.
where the third element of the mple is the success pattern that results from a successful
call. in Our domain. we wish to include some cost contributon. namrly the number of
tuples that are visited. n,, the number of variable unifications that take place. n,. and the
expected number of solutions. n,. Thus. we would generate the following instance of the
book predicate:
<book. [f. f. f.
fl.
[g, g, g, g], 3000.12000.3000>,
where the three numerical values represent the cost rnetrics n,. n,. and n,. respectively.
(b) book predicate:
Debray's instance for this predicate would have the form:
<book. [g, fl, [g, g]>.
Our enriched domain wouId be:
<book. fg, fj, [g, g],380.380.380~.
( c ) airthor predicate:
Using Debray's domain. we would obtain:
whiie Our dornain would provide additional information:
+
<author, [g, gJ,[g,g], 1. 0. 0.0333>.
More formally. our enriched domain is defined as follows:
Cost Abstract Dornain =
(Predr. Clause .cpar,
4
I 5 m 2 ncpat ( P d r ) .
1
< nr < nspat( Pred,
cparm
. rnrrri~s,,~
;I .
1.
r,
1 Ir 5 number of distinct predicates
1 5 q, 5 nurnber of distinct clauses in predicate Pre<
+Note t h a ~in this case. the rate of success is given by 150i450/30.
+.
where
Pred, cpor,, ncpar(Pued) and nspar(Pred. cpnt) are the sarne as before:
CIausey identifies the q-th clause of the predicate under consideration: for extensional predicates. ail clauses may be collapscd into a single tuple template. making this element irrelevant:
cpat, is the n-th feasible calling pattern for predicate Pred
rnetrics contains a list c v , . v2. ... v,> of values for n different cost contributors
that we have decided beforehand are relevant to produce an estimate of the total
cost associated with calling pattern q a t , ; for instance. if we have decided that
our cost function will be based on the number of successful unifications (say.
n s u c i n i j ) we may have a list of the form Cnstrc-{ni/:
num-sols. where
num_sol is the number of solutions that resuit fiom calling predicate P r d In othcr words. we will record al1 those quantities that are required by our cost formula.
Thus. for the formula derived for the WAM in Section 3.2. our list of metrics
.
that the
would most probably be of the form < n,hp, nieu,nhacb n ~ m s o bNote
number of solutions associated with a predicate call is normally required in order
to calculate costs for conjunction of quenes (see Section 3.4):
In othcr words. given a predicate clause. we are mainiy interested in obtaining sorne
metncs related to the cost estimation for any viable calling pattern. We have omitted in
the dornain a place for the success patterns that are obtained. The reason for this is that
such information is impiicitly used by successive predicates in thsir respective cal iing
3.7
Cost Function
In this section. we will proceed to explain the domain element metrics. Its purpose is to
keep track of al1 relevant parametric values that are used to estimate the cost of the predicate clause. Normally. it will include values such as the average number of solutions that
+Note that. in the case of GraphLog. onIy one success pattern is obtained after any predicate call.
are expected. the number of tuples that are visited. the numbrr of variable or constant
unifications that take place, the number of times that the state should be restored. and so
on.
In the general case. the values of Our parametric values will have different values
for different calling pattems. For instance. a call of the f o m p( X) with calling pattem
cf> (i-e..a tiee variable) will retrieve a11 facts in the database. whereas a call of the form
p(c) with calling pattem cg> (i.e.. a ground term) will retrieve only one fact at most.
Therefore. we have to keep track of feasible abstract paths. i.e. information regarding
the calling pattems that the subgoals may be invoked with during the evaluation of that
particular clause. Givcn a conjunction of subgoals:
S
, s,.- ....
S
.
we define an nbstr-uct path as a list of tuples:
ubsrrucr pmh = [ {sl. c p a r ! ) . . ... (s,,. cpar,)]
where
cput, represents a feasible calling pattern for subgoal s,.
Note that an abstract path is defined for any GraphLog query or mle and its unique
value is detcrmined statically via a simple mode analysis. Although no success pattern
appears in the definition of an abstract path. it should be clear that successive calling patterns are built Frorn the success pattems that are obtained from previous subgoals.
3.7.1
Cost Function from the Perspective of Head Unifications
For intensional predicates. we will estimate the evaluation costs at the clause level. that
is. we will obtain the contribution to the cost due to each one of the clauses of a given
predicate. In this section. we propose a methodology that can be used to estimate the average cost of evaluating a predicate given a particular calling pattern. We will concenmate on the probability associated with the process of head unijkation. In the following
section we will consider how to estimate the cost of evaluating a complete query or
clause body. For this reason wc start from the assumption that the average cost of evrry
body is available before evaiuation time.
We estirnate the total cost of a single clause as follows (Eq. 3. 1):
where
c ~ s r ( P r r r l , C l a n s e , lis
~ )the cost that results from the et-aluation of the q-th
clause in the >--thpredicatr. siven an initial calling pattem cpal:
co.vr(httnif(Pi-ed,CI~~t~se~
)lcpul) is the cost due to the proccss of head unification
for the y-th clause in the >=thpredicate. given a specific calling pattem 'par:
P( h r i i ~ ~ f ( P i - e d , C l u z ~ s eis~the
) l ~probability
~~,~)
that the process of head uni fication for the q-th clause in the r-th predicate is suc~essfülfor the ,uiven calling pattern
cpal:
h(rpar) is the rnodified pattem that results after a successtul head unification giv-
en a calling pattern cpm. which in tum is the initial caIling pattern for the body
df the clause:
co.vr(hodifPred,CIa~~se~
)l , I , C PdI,
)
is the cost that is associatrd with the evaluat ion
of the body of the q-th clause in the i--thpredicate given a calling pattem h(cpni):
this value will be anatyzed in a following section.
Besides the cost of the body. there are wo unknown quantitirs at this point: the cost
due to the proccss of head unification and the probability that the head of the clause suc-
cesshlly unifies with the arguments to the call. The estimation of the number of primitive operations that take place doring a successfül head unification is quite straightforward: roughly spsaking, one tuple is visited, only one restoration process would be necessary in case of backtracking, and the number of variable and constant unifications can
easily be detcrmined from the calling paitem and the internai structure of the head.
Additionally. +given a calling pattern. we wish to determine the probability that the
head of a clause is successfully unified. Since we do not know the exact values that c m
appear at every argument position. nor the fiequency with which these values appear. we
are forced to make assumptions. A rough but simple assurnption would consider that al1
ground values follow a uniform distribution. Our univene of ground values may be defined such that it comprises exactly those values that appear in the heads of the ciauses.
Alternatively. we may obtain the distribution and univene of attribute values from anothcr source. If we want to attach probability values to rach clause we are forced to define or select a universe of values that. at least, includes al1 constant arguments that occur
in the heads.
The probability that a given calling pattern successfully unifies with a clause head
is estimated as follows:
n m g
w here
P(hlrniflPred,Clause(~,I~~~~)
is the probability that the process of hcad unification is successful for the y-th clause in the !=th predicats. given a calling pattem
cpoi:
P(suc~_rrniJ(a~)l,~
is~the
~ ~ )probability that an argument with instantiation
cpar[k] can be successfully unified with the k-th argument in the actual head:
cpar(k] is the current instantiation of the k-th argument in the calling pattern cpat:
and
nitrnnrg is the number of arguments in the head.
can be estimateci as follows:
In general. the value of P(succ_rrn~~ak)~CPuIIkl)
I
1. ifak = f and c p a t [ k j = f
K @ othenvise
... ( Eq 3.3)
It is important to realize that real-life predicates do not necessariiy have indepcndent probabilities for their argument unifications. In other words. our proposal assumes
an ideal case: that there is no correlation amongst arguments.
The value of K~ should be determined from whatever abstraction we use to characterize the distribution function of attribute values. and in our framework represents the
average probability that a ground or partially ground term cpar[k] can be successtùlly
unified wirh the actual argument ak. If no distribution function is known or if our abstraction does not keep uack of the actual values for the constants. we may simply assume a
uniforrn distribution of values. With this crude assumption. the probability that a ground
argument can be unified with anorher ground argument is given by I /C(ad. where C'(ad
is the cardinality of the universe of values for argument position k. The probability that
a ground term can be unified with a non-compatible argument (for rxample. a structure
with a different arity) is zero.
For rxample. consider the following predicate. u/3:
We may drcide to cstimate the universes of values as 1 3 . 4 5. 6 1 for the first argument. ;t. v.
W.
:
x for the secona argument. and { g. i { for the third argument. Suppose
that Our abstraction for these attributs values consists of the number of distinct values for
each attribute.
If we assume independence amongst attributes. the total probability that a head can
bc unified with any calling pattern is given by the product of the individual probabilities
associated with each argument (one in the case of variables, a fraction over the number
of distinct values for constant ar_munents). Thus. in our example. the probability that the
first clause succeeds would be given by:
Note that this probability is the sarne for an); cal1 in which al1 three argumtnts are
constants (Le.. ail initial four rules in predicate d).
Similarly. the probability that the fifth and sixth clauses succeed is estirnated as:
pro b : .V uni tics with argument 3 I
Finally. the probability that the clause
a(M.P.N) succeeds
would be one. Note that.
although our universe sets are arbitrary and underestirnates of the tnie argument domains
may occur. thrrr is a notion of which clauses are more likely to succeed.
3.7.2
Cost Function from the Perspective of Body Evaluations
Once we are able to assign a probabilistic value to each head clause given a calling pattern. we then estirnate the cost of rvaluating the corresponding bodies. Fint. we derive
a formula thai permits estimation of the cos<of evaluating a single subgoai s,. Since
predicate Predp the predicate invoked by the subgoal. contains y different clauses, the
foliowing formula can be used to determine an average cost associated with the whole
predicate:
4
where
cosi(Pred,.,Clause,~,,,,) is the cost that results from the evaluation of the y-th
clause in the r-th predicate givsn an initial cailing pattern cpar. as analyzed in the
previous section.
Now. givrn a conjunction of subgoals:
SI' S ,-.
.... S q .
the cost of the compound sequencr of subgoals will be decomposed into the in(
costs due to sach abstract path, as follows:
cost ( sequence of subgoals )
=
cost ( u h s t m r p h k )
... ( E q 3.7)
where
n p a h is the nurnber of distinct abstract paths that a given clause of a predicate
can yield when a calling pattern is initially usedf: and
cosr(parhp)is the cost that results from the evaluation of a cornpiete k-th abstract
path. This function can be expresscd as follows (Eq. 3.8):
where
nsolk is the average number of solutions that subgoal sk produces when invoked
with a calling pattern cpark.as recorded in the domain element rnerrics.
Example. Consider an extended version of the books database introduced in Szction 1 -4.1.
+In pure Datalog. only one abstract path is acnially derived.
59
book(Tïtle, Publisher-Name, Subject. Author-Name).
A coIlection of book tities
d o n g with their pubiishers. subjects of the publications and authors.
m
publisher(Publisher-Name, City). A list of different cities where book publishen
have an authorized distributor.
author(Author'-Name Nationality). A g o u p of facts that relate authors
ro their re-
spective nationalities.
skilled(Author-Name, Subject). A k t of the two more prominent authors on every
possible subject.
forte(Publisher-Name. Subject). A list of the two top publishing companies for sv-
ery given subject.
Suppose that we wish to retrieve an exhaustive list of tuples of the general f o m
a i t l e , Publisher-Name. City, AuthorName> for those "worthwhile" publications whose
author has a certain nationality.
Database profiIe:
(a) Extensional DB oredicaies. We assume that the extensional database predicates
follow a strict uniform distribution of aaribute values. The corresponding database profile is given in Table 3. IO.
Predicate
nams
number of
ruples
distinct
values in
argument 1
distinct
values in
argument 2
distinct
values in
argument 3
Table 3.10 The extended books database
distinct
values in
qurnsnt 4
(b) Intensional DB ~redicatr.Suppose that our (only) intensional databasr predicarr
is defincd as followst:
%worthwhile/3:worthwhile(Publisher.Author.Subject). Tells us if a book
%is worth buying
worthwhile(publishef-1-3.
worthwhile(publisher-5-Aworthwhile(publisher-10..
worrhwhile~author-23.
worthwhilelauthor-7J worthwhilelauthor-l3d.
worthwhile(Publisher,Subject):-forte(PuMdw3bject).
wo~hwhile~Author.Subject):-skilled(Author.Subject).
(c) Ouery. We consider the following query:
Table 3. 1 1 and Table 3.17 show the results of applying Our framework to this example.
cosi-
<clause. cpat>
cbooh4. [f.f.f.t]>
"chp
3.000
n,,
nsol
(nchp~Tchp+n\.u~T,u
- I l 5 0 1 x Tbac j
4~3.000 3.000
276.000
Table 3.11 Predictions for al1 predicates
tStrictly speaking. these predicate definitions are not safe. Al1 anonynous variables should be
explicitly consûained by direct references to the book. publishvr and atcfhor predicates. For instance. the tirst deftnition should be defmed as:
worthwhile@ubIisher-1 .A.S):- b 0 o k L S . A ) .
However. we omit these additional predicates to keep this example simple.
<worhwhilc3 n l , [g.g,g]>
-
1
120
1
1;20
1O 1
0.05
1
0.0224
-
Table 3.12 Predictions for the intensional database predicate
Experimrntal result:
average cost for al1 possible queriest: 30.834.6
Theoretical result:
cost = cost ( book1 [,.
(
(
,
I-
) +
cost ( worthwhik 1
cost ( publisherl
n s o l ( book1 LI. r- f- /l 1
i s. a. n 1
[ S . 11
) +
) + n-sol(
cost = 276.0 + 3000.0 x (0.232 + 0.261
n s o l ( worthwhile~
I.c. s.sl 1
publishert
[S.fl
x ( 28.120 + 380.0 x
) x
X
cost ( author 1
' [R-.si ) )
0.022) ) = 29. 535.6
theoretical cost: 29.535.6 (an error of 3.6% approximately).
+The experiment was repeated for al1 ditrerent author nationalities (the only ground argument in
the querv). and the average vatue is reported here.
Empirical constants used:
Appendix ilshows another cxample of our methodology applied IO a different Prolog system.
3.8
Overview of the Model
In this chapter we first denved a cmde method to estimate the cost of rvaluating
GraphLog queries when applied to extensional predicates. A top-down model of computation is assumed. The rnethod requires an empirical estimation of vanous constants associated with the rvaluation time of primitive operations. A profile of the underlying database is also required. By using this database profile, some formulae to estirnate the expected number of primitive operations that wiil occur dunng query evaluation must be
derived. We have considered the case when database values are distnbuted uniformly
and independently. Although real databases seldom conform to a unifom distribution
model, we may still be able to predict which evaluation orders give the brst exscution
times. t
Additionally. we have managrd to rxplain why some heuristic techniques for query
reordering that are based upon a "bound-is-easier" heuristic procedure work
[Sheridan9 1 1. These non-detrrministic algori thms usually srlrc t subgoals containing
ground arguments to be placed before other subgoals. We have obssrved that the occurrence of a ground argument reduces the number of primitive operations that take place
with respect to the case where that argument is not bound. For cxample, fewer tuples
have to be visited, fewer variable unifications take place (the constant unifications that
take place instead are, by far. less expensive operations), and. since fewer solutions are
expected to occur. fewer state restorations will occur. Thus. a fact retrieval with ground
arguments will be less expensive to evaluate than its non-ground counterpart.
+The closer the distribution resembles a unifom distribution the better the results wiIl be.
We then have proposed a general framework to sstimate the performance of
GraphLog (Prolog) queries based on abstract interpretation techniques and mode analysis. Again. a top-down execution is assurned. The method is applicable to the all-solutions case. The basic idea is to associate probabilities with the process of head unification. while considering the expected costs of the subgoals in the body of the clauses. Cost
metrics and the average number of solutions are propagated throughout the bodies of the
clauses in the usual manner.
Typically. the expressions for the number of solutions and primitive operations for
a given tuple ~ t i b g o o lcpal>
.
will normally be rxpressed in terrns of the numbrr of solutions of othçr queries represented by the tuples (subgoalA7cpatk>. This impliss that an
order of rvaluation of the analysis rquations should be found. If we restnct the queriss
to be non-circzilar. in the sense that they do not contain recursive calls (direct or indirect), it is always possible to tind an order of evaluation which is paranteed to terminate.
Chapter 4.
A qualitative model
So far. we have studied how to obtain a cost model for a specific abstract machine. Unfominately. if another abstract machine is used. we rnay not be able to apply Our specific
model. Furthemore, even if the same abstract machine is being used. but substantial additions or optimizations have been incorporated. the model may produce poor results.
Wc now procrcd to analyze how to derive a more general model in which underlying implementations are lrss relevant.
4.1
Fundamental Database Operations Revisited
We have already mentioned that any DatalogiGraphLog que- can be expressrd in t e m s
of fundamental database opentions (Le.. selections. joins and projections). and therefore
the methodology for estimating the value of the cost contributors may be focused on
these fundamental operations. Although. strictly speaking. this operational modrl usually assumes a bottom-up cornputation. we may also borrow some of the concepts and
apply them to a top-down model.
As an example. consider a database of articles sold at a given store. Simplified relations would include base relations for ( 1) products, Say. article(Artic1e-Name. Price. Department.
Distributor-Name). (?)
interna1 bookkeeping.
Say. taxation(Departrnent,
Applicable-Tax), (3) personnel grouped by department. Say, personnel(Department. Name),
and (4) distributor information, Say, distributor(DistributorIName, Distributor-Data). A typical query to retrieve information to calculate the cost of an item (ziven a specific distributor) afier taxes would have the f o m :
:-article(peaches, Price, Dept, starninaAc), taxation(Dept, Tax).
If we know how many distributors are registered for the article peaches.t the calculation of the cost contributon is quite straightforward. In this case. the cost of executing
the second subgoal is independent of the specific value of the department ( ~ e p tthat
) is
retrieved by the first subgoal (assuming that every department has at most one tax rate
in place). If the exact number of distributors that se11 peaches to the store is not known.
but an average of distributon prr article or similar information is available. this average
value can be used to estimate an "expected" number whose accuracy will depend on h o a
-average.*the article is.
Now. suppose that we pose a query to retneve the names of the clerks that belong
to the store department that sells peaches:
:-article(peaches.P. Dept. D). personnel(Dept.Clerk).
In this case. the value of the cost contributors associated with the second subgoal
will be influenced bby the actual department (Dept) that is retrieved by the fint subgoal.
assuming that different departments have different number of employees. If wr do not
know the depanment that will be the output of the fint subgoal. accunte knowledge of
the distribution hnction for the personnel relation will not be of much hrlp (sincc that
department can be an? of the valid departments in the store).
This is a vrry cornmon situation. and a compromise is needed (unless we want to
execute the code to detemiine the exact value of the department! ). One possible solution
would be to ' ~ e i g h t "al1 different departments by using a measure related to rheir probability of appsaring in the query (some departments are more likely to be invoked) and
take their weighted anthmetic mean as the value for the "average" department. If al1 departments have the same probability of being selected, a simple average may be used.
The utilization of central tendency values seems to be more appropriate than the use of
extreme (skewed) values. at least in the long mn. It is obvious that the absence of an exact value for the department anribute inevitably produces a loss in the accuracy of the
sstimate.
fWe also know that any article has only one pricc and forms part of one department exclusively.
Selection
The selection operation is the easiest of the three basic relational algebra operations to
deal with. Several researchers [Selinger79. Christodoulakis83. Fedorowicz83] have proposed diverse fomulae for different distribution functions. A srraightforward application of these formulae applied to the information indicated in the profile is al1 we need
to detennine the expected cardinality (i-e.. the average number of output tuples) of the
result. and the estimated values of other cost contributors. such as the expected number
of visited tuples or the expected nurnber of t e m unifications. can also be derived from
the forrnulae.
For instance. consider a database base relation r(integer. String) whose first argument
is known to follow an integer normal distribution with mean p and standard deviation G.
and that we also know that the number of tuples is. say, :V. If a qurry of the forni r(X. Y )
is used. where X is bound to a constant integer value while variable Y is a fiee variable.
we may estimate the number of tuples that are expected to be retrieved by using our
knowledge of how a normal distribution behaves. and by selecting appropriate ranges for
our analysis (sincr w r must "discretize" Our representation to accommodate integer values exclusively). However. we must be aware that the database profile is ofien just a sirnpie approximation of the real problem. and a real-life database will normally differ from
the -'ideal" case. Figure 4.1 shows ri typical example of an attribute that follows a discrete
version of a normal distribution. Note that its general shape is the one wr rxpçct for a
normal distribution, but individual values have some deviations from the ideal representation.
An interesting problem occurs when the same variable is attached to two or more
argument positions within a predicate. For instance. a subgoal such as a(X.X) establishes
an additional restriction: that both arguments have the same value. Even for simpler distribution functions. this seemingly harmless restriction poses a difficult challenge that
would require some additional information (for instance. the correlation amongst attributes) to be solved properly.
Figure 4.1 Frequency diagram of an attribute that may be approximated by a discrete
normal distribution
-4s mentioned before. if we do not know the exact constant value imrolved in the
seiection. wr may not br able to use the database profile. since this is otien given as a
fünction of the input value. For instance. in the example of Figure 1 . 1 . if the constant value of the argument is known to have a value. Say. X = 38 1. wc may expcct a cardinality
of approximately 20 tuples (from the normal distribution). But if the value of X is unknown (prrhaps because Our abstract interpretation analysis did not keep track of constant values). al1 we can do is çither propose a ronge of values (i-r..from O to 22. for the
ideal curve) or calculate an average value (and we must establish a finite range of :'Y'
values to do so). For exarnple, if we decide to estimate the cardinality of the selection a s
a simple average. for the attnbute depicted in Figure 4.1 we rnay choose a range of "X'
values from
-
-3
value of .V 8.0
x
o to p -+ 3
x
o. in which case we would have an approximate average
. If we choose a range of values that varies from
p-2
x
a to p + 2
x
o.
the average value will be approximately Y = 1 i .j . If we consider a range from p - 0 to
p+o.
we will have Y = 16.7 .
Join
Generally speaking. the join operation can be viewed as a Cartesian product of the two
relations involved. It is used to combine tuples from two or more relations [Mishra91].
In the case of a Datalop/GraphLo_equery. a join of the form
... si (A1.....AN), s2 (61.....BM). ...
c m be analyzed (assuming independence of subgoals and a top-down evaluation strate-
-w )as two separate selections (for s i and s2. respectively) and thtn. realizing that the
d
second subgoal will be invoked as many times as solutions the first subgoal provides. a
panicular cosr conuibutor may be calculated as (Eq. 4.1 ):
cost conuibutor ( s 1 join sZ ) = cost conmbutor ( s 1 ) + solutions ( s I
) x
cost conuibutor ( s 2 )
Naturally. a simple ("mode") analysis must give information as to which arguments
will hold constant values in s i and s2. For instance. in the sequsncr
predicate p will be invoked with two variable arguments. predicates q and r will be callrd
with a first argument constant and a second argument variable and prrdicate s will have
both arguments ground. Note that the simplicity of this analysis is due to the fact that
Datalog-like languages guarantee that al1 arguments are bound to some constant value
afier any predicatr call. Note that this analysis dors not keep track of the actual constant
values: only the fact that the argument is constant is established. Using similar notation
to the one used in the previous chapter. Our formula is as follows (Eq. 1.2):
ç o s t ( p ( . - l . B ) . y ( B . O . r ( C . D ) . s ( . . l . D )=) cost(p(.-l.Bl!
+
where the calling patterns are abbreviated as "c" for constant values and "j" for free (or
unbound) variables.
Note that. the formula for a reordering of the subgoals should nomally have a
slightly different aspect. since the "calling" patterns will usually Vary. Consider the following reordenng of subgoals:
In this case. our formula becomes (Eq. 4.3):
Projection
The most problematic of the basic relational algebra operations is the projection operation. The main challenge has to do with handling duplication of output tuples aficr the
projection [Kwast94]. Again. statistical considentions regarding the distnbution of attribute values may be used to tackle the problem [GelenbeSZ. AstrahanSj].
To illustrate this idea. consider again the example in Figure 4.1 . Note that there are
1000 different tuples of the f o m r(integer. String). However. if a projection is performed
over the fint argument (thus. rliminating the second argument). it becomrs clear that w r
will obtain from O to 25 duplicates for each "..Y" value. Standard Datalog does not discriminate amonot duplicates. and only one value is reported. For this reason. if the first
argument is known to be constant. we can establish that at most one valid answer will be
derived. If the "X' value lies within a particular region of the distribution curve. we may
assign a probability of that value producing such a valid answer. For instance. in Our normal distnbution for the example in Figure 4.1, if the 'Y' value is contained within the
region from
p -I
x O
to p + I
x
a. we may estirnate that the probability that the projection
of that first argument will have cardinality 1 is approximately 95.44% (normal distribution). Note that. if we perform the projection over a first argument that is a free variable,
the cardinality of the result will be given by the nurnber of distinct attribute values for
the first argument in the relation.
A more complicated scenano takes place when the projection involves more than
one relation. A cornrnon example occurs when two subgoals in the same qurry share a
common variable. such as in:
In this case. the projection will be a new relation. say sld. having the union of al1
arguments from s l and s l (Le.. the join of both relations). but with only one instantiation
of the cornmon arguments (the projection proper):
As an example. consider the two base relations in Figure 1.2.:
L
% predicate s l
%predicate s2
SI (ba,ba,ba).
sl(ba,ba,aa).
sl (aa,ba, aa) .
SI (aa,aa,bal .
sl (aa,aa, aa) .
s 2 (ba,ba, b a l .
s 2 (ba,aa,ba)
s2 (ba,aa,aa)
.
.
s 2 (aa,ba,ba).
s 2 (aa,aa,aa).
7
(a)
(b)
Figure 4.2 Two ternary predicates sl and s2
Suppose that we have to estimate the cardinality of query ~ ( C . D )where:
.
The join would yield the intermediate relation sa shown in Figure 4.3.
Then, a selection is performed such that the first argument of predicate si is q u a 1
to the first argument of predicate s l . The resultinz relation sh 1s shown in Figure 1.4.Finally, we must project arguments 3 and 5 to obtain the final result. as shown in Figure
4.5. Note that fiom the 12 tuples that are obtained for relation q, only 4 will be in the final
answer (the remainder are discarded as duplicates). Fortunately, in this case we already
know the upper bound for the cardinality of relation q, which is the product of the sizes
of the domains of arguments 3 and 5. i.e. 2 x 2 = 4. Thus, if the cardinality of the selection
% predicate s1 j o i n s 2
,ba,ba,ba,ba,ba).
,ba,ba,ba,aa,ba).
,balba, ba ,aa, aa) .
,ba,ba,aa,ba,ba).
,ba,ba,aa,aa,aa).
,ba, aa, ba ,b a lba) .
Figure 4.3 Join of predicates SI and s2
2lec tion a £ ter join
Figure 4.4 Selection after the join of prec licates sl and s2
afier join has a value that exceeds this upper bound. we must automatically reducr the
estimate to have a value that docs not exceed the upper bound. The lower bound is almost
always a cardinality of zero.
The whole picture
As has been mrntioned before. it is not unusual to obtain formulae for combinrd relational algebra operations that occur relatively fiequently (for instance. a selection afier
projection). The main advantage of this idea is that some sources of inaccuracy are elim-
Figure 4.5 Final projection of arguments 3 and 5
inated (mainly. the fact that we simply do not know the shape of the distribution for intermediate results). not to mention that the estimation procrss requires Iess effort.
In traditional databasc que.
modelling. given a sequencc of subgoals. we usually
decompose it into primitive relational algebra operations. apply fomulae derived for the
speci fic charactenstics of each partici patine relation to each of such componrnts. and
successiveiy continue applying the formular to the intermediate relations that result until
the entirr sequence is analyzed. Note that this approach is only valid when dealing with
non-recursiiu qurries.
In Our model. we use an analogous approach. The estimation of the cost of an extensional database predicate call may simply apply fomulae alrcady derived in standard
database rrsearch. We specifically definr a simple formula to estimate the cost of a conjunction of subgoals (Eq. 4.1 ) that is applicable to the top-dom model of execution.
As a final note. practically al1 methods assume that the constant values indicated in
the quex-y are valid ones. i-r.. values defined in the domain of the respective attribute.
This validation can be done for base relations without any major complication (for instance, a simple check for quenes to predicate personnel(Department, Name) in which the
first argument is a constant may determine whether this is a valid Department), but is not
an easy task for intermediate (virtual) relations (unless we keep track of the entire inter-
mediate results instead of a simpler abstraction). For intermediate relations. the validity
of constant values is usually autornatically assumrd for ease of analysis.
4.2
Recapitulation. Cost Estimation and Query Reordering
In this section we repeat some of the ideas that have bern mentioned before in ordcr to
produce a clear picture of al1 the important issues that must bc addrrssed.
Given a query of the form
whose cost we wish to estimate. we propose to decompose it into simpler components
that are assumed to be independent from each other. The simplrst choice consists of detining a subgoal as the primitive entity to be analyzed. A subroal is then treated as a
"black box": givrn some inputs (degree of instantiation of the arguments. numbrr of
timss that the subgoal is expected to be invoked. etc.). the expected values of the cost
contributors may be estimated (as the outputs of the black box) and used by successive
blocks as thrir respective inputs (Sre Figure 1 A). The subgoal itself has to provide some
information about interna1 c haractenstics such as distribution of attribute values or correlation amongst arguments. The total cost of the query is obtained as the sum of the individual costs of the subgoals. Standard abstract intctrpretation techniques may be used
to determine the degee of instantiation of the arguments and propagate the intermediate
results through al1 successive query components.
When a subquery is known to have at leasr one constant argument (whose exact val-
ue rnay bc unknown at the analysis time). we are forced to choose a way to account for
di fferent possible scenarios that result from the selection (since di fferent constant values
will produce different values for the cost contributors). A simple compromise is to consider "average queries" that represent either the most typical query that is cxpected to occur or an amalgamation of al1 distinct possibilities in which a (weighted) average is calculated.
There are two general groups of subgoals that are treated separately: simple fact retnevais (i.e., extensional database predicates) and general predicate calls (i.e., intension-
al database predicates). The estimation of the cost o f a simple fact retrieval can be reduced to a statistical problern since we know (or rnay detemine) the distribution followed by the arguments. General predicate calls are more complex. Specifically. we
have to deal with the following issues. amongst others: (a) head unification. ( b ) clause
indexing. (c) independence of subgoals and (d) the fact that the distribution of internediate results rnay be difficult to predict. Head unification and clause indexing rnay be taken into account by assigning to each rule in the predicate a probability of success given
the d e g r e of instantiation of the arguments involved. Each rule is then weighted based
on this factor.
The problem that two or more rules rnay provide comrnon solutions is not a trivial
one. Given two rules rl and Q. that provide set of answers .A1 and -4 ,. respectively. we
wish to find a new set -4 12 that is the (set-)union of sets -4 and -4,. Unfortunately. our
analysis cannot provide enough information to solvr this problem. since we do not know
the nature of answers -4 1 and d 7 . A mutual exclusion analysis may help. in the sense that
if wr determine that .dl and .A 7 have no answers in common then we know that the cardinality of . A I 7 is the sum of the cardinalities of d l and A I . But the genenl problem of
duplication resulting frorn independent rules is complex to solve.
We also have to handle recursive queries which. in the case of Datalog-like query
languages. occur in the f o m of a predicate closure. In Our schemr. a recursive query is
also treated as a black box, although the estimation of cost contributors (outputs) has to
be solved quite differently. The values of many of the cost contributors are totally meth-
od-dependent. Apparently. we rnay obtain good estimates of the number of tuples that
result from the closure (which is a crucial value required by successive black boxes).
1.3
Our Proposed Framework
In this section we delineate how to determine the expected values of the cost contributors
for a given subgoal. As a first step. the set of relevant cost contriburors thai we are going
to work with must be selected. Unfominately. unless we have a history of performance
of the form of query under analysis. it is not straightforward
tributors are important and which ones rnay be disregarded.*
COdecide
which cost con-
Once the relevant cost contributors have been selected, we have to calculate their
average values for each subgoal (represented as a black-box). We will estimate the rx-
pected average value for each cost contributor given some information. such as the actual
calling pattern or a database profile. The estimation of such average values will normally
require the application of formulae denved for the different basic operations of relational
algebra previously-mentioned (selcction, join. projection). W s rnay use formulae described in the literature (if we happen to be working with a specific database distribution
that has been previously studied). or simply consider a simple distribution (a uniform
distribution of independent attribute values is the usual choice).
to a crrAs has been mentioned before, the rxpected average nurnher ofsoli<~ions
tain subgoal has to be estimated. since it wil1 be used whenever a join operation occurs.
Once the expected average values of al1 relevant contributors have been estimated
for each separate subgoal (by using formulae for the sdecrion operation given a certain
calling pattern). the join operation is considered. We observe that when the al 1-solutions
case is considered (as is the case in a standard GraphLog query). any subgoal will be artemptrd as many times as solutions the subgoal to the left has providçdt (See Figure 1.6).
Thus. the values of the cost contributors are scaled by a factor given by the number of
solutions of the previous subgoal.
In other words. the value of a cost contributor of a subgoal is estimated as (Eq. 4.4):
value ( cos[-conun, ) = num-solnl
,
x
average-value ( cost-contrnr)
The calculation of the nurnber of solutions to the whole que- uses the value of the
number of answers to the last subgoal (scaled by the values of the number of answers to
al1 previous subgoals) as an upper bound.:
nm-solquq
= num_sol, x nu-sol,
- x ... x num_solm
iAccurnulritivc profriing is often the best aid to this end
+For the case of the lefi-most subgoai. a factor of 1 must be considered
...( Eq. 4.5)
a34
repeat
v
repeat
ns2 times
cost
contributor
cos1
contributor
values
values
ns1
answers
ns2
answers
Figure 4.6 Cost contributors are estimated for each subgoal
This value does not consider duplicates. and therefore. we rnay obtain an overestimate if we use this value directly. To avoid this. our framework wouid require a way to
take into consideration the removal of dupiicate tuples fiom the final solution.
The calculation of the average value of a cost contributor for the whole que- is accomplished by adding al1 individual values of the cost contributors for the differrnt subgoals in the query under consideration (Eq. 4.6):
value ( cost~connnqucn
) = value ( costcontrl ) + value ( cost-contr, J + ... + value ( cost-conun, )
-
Finally. once we know the values of al1 cost connibutors for the whole que-.
wr
are in a position to determine the total cost of the query. If we know the rxpected average
cost of each contributor per se (as a primitive operation). the problsm is reduced to what
wr have already discussed in Chapter 3. However. if the empirical values of these primitive operations are unknown. wr are forced to attach some weights to each of them. or
give pnority to some of them. The simplest straiegy is to select one single cost contnbutor and base Our rankings on this sole parameter. Othenvise. we face the problem of assigning specific wrights to the cost contributors.
:Unfomuiately. errors in the estimation of the number of answers to the whole query may increase exponentially with the number of components [Ioannidis95].
Chapter 5. Handling Recursive Queries
So far. we have characterized the cost of a (non-recursive) predicate by means of simple
cost measures. such as the number of visitrd tuples. the number of successful unification
operations or the number of solutions that are obtained. The only compiication we havr
encountered bas to do with having to consider different variants of clause indexing. depending on the actual implrmentation of the qurry evaluator.
Extending those results <O recursive predicates poses a real challenge. Undecidability of geneni recursion is well-known. and so is the potential occurrence of infinite computations. It does not corne as a surprise that most resrarchers have concrntrated on very
specific cases of recunion. For instance. Debray and Lin [Debny93] havr devrloped a
method for cost analysis of PmIog programs based on knowledge about "sizr" rrlationships betwren arguments of predicates. which is only applicable to recursivs definitions
in which an argument decreases in size at each new recursive invocation.
5.1
Erecution Cost of a Recursive Q u e l
Besides recursion with decreasing sizr functions over new recursive steps. there are other cases of recunion that may be handled by our cost model. The most important of these
is linear recursion over a database domain. In fact. one of the greatrst advantages of query langages derived fiom Datalog is that every (database) query produces a finite number of answrrs. and infinite loops are therefore avoided by choosing an appropriate evaluation method.
One immediate consequence of the selection of a specific evaluation method is that
the actual cost of evaluating a recursive predicate will depend on the chosen method.
There are many different evaluation methods that deal with recursive queries [Ceri90].
In general. cost measures such as the nurnber of visited tuples or the number of unification attempts are algorithrn-dependent, and there are additional factors that add to the
evaluation costs (for instance. bookkeeping of structures or validation of certain conditions).
However, there is one cost measure that is totally independent of the evaluation
method: the nwnber of solutions to the query. Furthemore. the number of solutions is a
quantity that is propagated to other subgoals in the query. since it affects the number of
times that the successive subgoals will be invoked.
For these reasons. it is relevant to devise a method to estimate the number of solutions that is associated with a recursive query.
5.2
Formulation of a Recursive Q u e l in Terms of Transitive Closure
Jagadish and Agrawal [Jagadish87] have shown that rvery linrarly recunive query can
be rxpressed as a transitive closure possibly preceded and followed by the usual openton of standard relational algebra (joins. projections. selections. etc.). A recursive rule
is Iinear if there is exactly one occurrence of the recursive literal in the body. BanciIhon
and Ramaknshnan have conjectured that most recursive queries are linear
[Bancilhon86]. The significance of this result is that it is potentially feasible to predict
the number of solutions of every linearly recursive query if we denve a general method
that is able to detennine the number of sdutions of the transitive closure case.
Thus. we suggest the following methodolog to atirnatr the number of solutions of
a recursive predicate:
1. Transfoml the linearly recunive predicats into its equivalent form that involvrs
transitive closure;
2. Estimate the cost of the transformed predicate in terms of its constituents (i.e..
normal non-recunive predicates and the transitive closure itsel f).
Thus. it becomes clear that we need to devise a rnethod to estimate the cardinality of a
transitive closure.
5.3
Predicting the Average Number of Solutions of a Transitive Closure
One of the most cornmon uses of recursion in GraphLog is simple transitive closure. exemplified by the following two rules:
where tc defines the result of the transitive closure and b is the relation (or base predicatr)
over which the closure is performed. Note that only two (sets of) arguments are involved
in the closure relation.
Our goal is to find the cardinality of tc (i-e.. the number of tuples n, that are obtained as a result of applying the transitive closure operator) given some information
about predicatr h. It is evident that the nature of predicate h has a substantial impact on
the cardinality of its transitive closure: if we represent a predicate by its rquivalent gaph
in which a fact is represented by a directed edge between the two values (nodes) of the
closure arguments. the transitive closure of a tree-like structure will produce fewer niples
than. for instance. that of a heavily comected structure with the same number of facts.
Bv the same token. a predicate with a highsr number of facts will nonnally producr more
tuplrs afisr the application of the transitive closure operator than a similarly-structured
predicate with fewer facts.
The simplrst possible study of transitive closure is one that only considen the cardinality of h (that is. the number of tuples nh that are associated with predicate h). disregarding any interna1 relationships between the arguments.
Suppose that we are interested in deterrnining the number of tuplrs n, that result
from applying transitive closure to predicate h. If we assume that the number of unique
mples nh associated with predicate b is known, and so is the number of distinct attribute
values for the relation. n,,. some upper- and lower bounds rnay be established. By propenies of transitive closure. we know that n, < n,'. < ( nb )
nb
.- .( ~ q5 .-1) .
Furthemore. for
ndr2- ni! + 2 ...(Eq. 5 . 2 ) - the limit value ( nd,) is obtained. Unfominatrly. these fron-
tier values are not that helpful for large values of n,,.
For even the simplest possible case. that of a uniform distribution of independent
attributes. denving exact fomulae proves to be a hard task. For sxample. wr derived the
following formula for the average expected value in the trivial case when n, = 2 :
The complrxity of the exact formulae increases as the value of nh docs. and cach
formula has to be obtained separately. which produces an impractical situation.
Estimating the Average Cardinaüty of Transitive Closure
5.1
As mentioned before. the cardinality of a transitive closure may Vary fiom a value in
which no tuples are added as a result of the closure to a maximum value given by the
square of the original number of tuples (in a gmph representation. this would correspond
to a bbcomplete"p p h for the involved "input" nodes). Since this range of values may
produce a vast intemal. a compromise is to work with central tendency mcasurements.
such as the anthmetic rnean.
For this purpose. we have generated randomly distributed tuples for our base predicate h and obtained results for srveral values of n,, (the number of distinct attnbute values for the relation) and nb (the number of unique tuples for predicatr b)'. After sevrral
experimrnts. it appears that the average number of tuples of the transitive closure can be
charactenzed by means of three different regions: (a) a simple ( linear) beha~iourfor
small values of nh: ( b ) a non-linear region for intermediate values of ilh; and (c) an exponrntial region for higher values of nh (Figure 5.1 and Figure 5-21. We proceed to characterize these thres regions.
5.4.1
Region of Small Values for the Number of Tuples in the Base Predicate
For small values of nb, a surpnsingly simple linear formula was empincally derived. If
we express nb in terms of n,, in the forrn
n, = n,,
the corresponding transitive closure can
, the average number of tuples of
be expressed (approximately) as:
+In our espeiiments. nb tuples were randomly selected fkom the (na,)' possible ruples that c m be
fomed with n,, distinct artribute values at each argument position.
input density nb
Figure 5-1 Region for srnaII values
-
n
-
=n
. -
)
q
5 . 4 . For
-
instance. if
.-=
i 5.
-
n,.
= nJI 4
.
or if
--i= 4.
= " J I 3 . or if 4 = I . nIL z nd, (Table 5.1 ). and so on. The formula seems to work
wtll for .-I2 i .?I . although accuracy starts to debmde sharply in the nri_rhbourhood of
*,L
this value. Furthermore. the predicted value is more precise for higher values of n,,.
Table 5.1 The linear region
average output densit! n,
P
large values
tn,,
211u1
input ciensit! nh
Figure 5.2 Region for large values
5.1.2
Region of Intermediate Values for the Number of Tuples in the Base Predicate
Our linear formuia begins to fail when the constant .4 Stans getting closer to one (Table
5.2). In fact. the standard deviation of the recorded values also becomes bigger. We have
bsen unable to derive a simple gencral formula for this range. Fonunately. this "internediate" region is represented by a relatively narrow interval of values that range from approximately
nb 2 O.Xnd,
to nh 5 i .hd,
. As a practical solution. we have obtainrd sorne
approximats formulae for different values of n ,
n, = 0.9nyr
n, - nu,
. we have applied the approximation
.the formula
<=
nJt
I .I
2
in this region. For instance. for
-
nt'. = J x nJt
(
3
x ( -4
-I))
. and
for
gives a good approximation of the cardinality of
the transitive closure. Note that. since this region is relatively small. the denvation of approximate formulae for different values o f
n,
represents a feasible strategy.'
tNaturally. our proposed formulae represent just simple approximations. We decided not to
spend too much time in deriving more exact formulae. since these approximations satis- Our
needs rather adequately.
I
I
I
1 lOO(8801
1 . 3
1
I
I
4119.821
UOO
Table 5.2 The intermediate region
5.43
Region of Large Values for the Number of Tuples in the Base Predicate
An important observation is that for values
sponding maximum value
( n,)
R,
.
2 1 . 2 ~ the
~ ~
prrcrnrage
of the corre-
= that is obtained afier the closure can be considered al-
most a constant (as seen in Table 5.3). The values of some percentages are depictrd in
Table 5.3.
Furthermore. we observe that some of these percentages may be represented by
very simple fractions. For instance. for
R,
= 1.4nd,
n, = 1.6nd,
. the associated fraction is
. the fraction is 1/25: for
n, = i.3na,
1/1;for n , =
n, = 1.75n,,
.
the fraction is 1%;
1-jn,, , the
for
fraction is 113: for
the fraction is l i 2 .
Table 5.3 Percentages of the maximum value for n,
nb=I .Z%,
nb=1.6%,
nb= 1 4 , nb= 1.4%, nb=1.5%,
0.09
0.17
0.25
nb=l .7%, nb=1.8%,
0-41
0.33
= 1.6 m
nb= 1.911,~~
0.59
0.53
0.47
Table 5.4 Percentages of the maximum value for some factors
This particular behaviour may be approximated by an analytical formula. In fact.
the values that are obtained strongly suggest that we may use an exponential formula to
mode1 this region. Thus. we have used the following simple general formula to characterize this subregioni:
7
n,
l'
'JI
\
1 1
)-
nIL= nd,-l 1 - enpl - - + p
'
7
'\
JEq.
5.5)
,,
For instance. the values that are obtained when using this formula when p =
0.9
are
shown in Table 5.5.
nb =
nb =
I
l
I
nb =
l
I
nb =
nb =
I.-t%, 1 . 5
!.6n,,
nb =
l
nb =
1
nh =
I.Sn.,
.
deri\.ed iformula) \dues
0.03Y2 0.0801
0.1179 0.22 12 0 . 3 0 3 0.3871 0.4727 0.555 1
sxperimental \dues
0.06
0.17
-
-
0.09
0.25
0.33
0.31
0.47
0
.
nh=
l
0.632 1
0.59
-
Table 5.5 Comparison between the formula and the experirnental results
Also. for the range n, 2 Zn", . we have used the following simple formula:
+Again this formula just represents a reasonable approximation of the vaIues under consideration. and "better" formulae may be proposed as well if more accuracy is needed.
"b
where the value a = 3nu, - 7
- ssems to give satisfactory results (Table 5.6).
5.5
Recursion Revisited
Once a formula that predicts the cardinality of transitive closure has been obtained. we
may use it to predict the cardinality of a recursive predicate.
As an example. let us consider a generalized version of the same generation exarnple proposed by Bancilhon and Ramaknshnan [Bancilhon86]:
whcre jlat. ~ r pand down are extensional database predicates. and p is the recursive (dcrived) predicate.
This predicate can be exprrssed in terms of transitive closure as follows:
where ~ c p h i \ n t indicates
c
the transitive closure of predicate updoitn:
updowntc(X.YU.XU.Y) :- u ~ ~ o w ~ ( x . Y u . x u . Y ) + .
We will analyze the GraphLog program for the generalized version of the sams generation problrm as s h o w in Figure 5.3.
The visual representation of this program is shown in Figure 5.4.
We wish to rstimate the number of ~ p k associated
s
with recursive predicate p2.
Since duplication of tuples must be considered (rhe second rule may produce tuples
that are already pan of the base predicate.jlor). the cardinality np2ofp.? will be:
where n ~ is ~the ,number of tuples o f predicateflat (which is known fiom the database
profile) and nupdo,,, is the number of tuples of the transitive closure of predicate tcpdown.
-- - - -
--
-
O/o~pdown(X.YU.XU.Y)
:-u p ( ~ . ~ l J ) . d o w n ( ~ u . ~ ) node( g9, n21, [v('X')] ).
node( 99, n22, [ ~ ( ' y )).]
node( 99. n23, [v('XU')] ) node( 99, n24. [ ~ ( 'U')]
y }.
node( 39, n25, [v('X').v('YU')] ).
node( 99. n26. [v('XU').v('Y')] 1.
edge( 99. n21, n23, up ).
edge( 99. n24, n22. down ).
disLedge( 99. n25. n26. updown ).
%extensional DB predicates
db-schema(up, 2).
db-scherna(down. 2).
db-schema( flat. 2).
:-flat(X.Y).
O/~~(X.Y)
node( 94. 179. [v('W] ).
node( 94. n 10. [v('Y)l ) edge( 94. n9. n 10. flat ).
dist-edge( 94. n9. n10. p).
%p(X,Y) :-up(X.XU),p(XU.YU),down(YU.Y)
node( 93. n5, [v('X')] ).
node( 93. n6. [v('XU')] ).
node( 93. n7. [v('YU1)]).
node( 93, n8, [vi'Y')] ).
edge( 93. n5. n6. up).
edge( g3, n6. n7. pl.
edge( 93, n7. n8. down).
3ist-edge( 93. n5. n8. p).
Youptc(X.Y) :-up(X,Y)+.
iode( g6. n 14. [v('X')]).
iode( g6. n 15. [v('Y')] ).
sdge( g6, n 14. n15, up:+: ).
jistedge( g6. n 14. n15. uptc).
O/oupdowntc(X,YU,XU,Y):-updown(X.YU.XU.Y)+.
node( g10. n27, [v('X1).v('YU')]).
node( g 10, n28, [v('XU').v('Y')] ).
edge( gl0, n27. n28. updown:+: ).
distedge( g10, n27. n28. updowntc ).
%p2(X.Y) :-flat(XU.YU), updowntc(X.YU.XU.Y)node( 912, n29, [v('X').v("fU')] 1.
node( g 12. n30. [v('XU'),v('Y')] ).
node( 912, n35, [v('X')] ).
node( 912. n36. [v('XU')] ).
node( 912. n37. [v('YU')] ).
node( g12. n38. [v('Y')] 1.
edge( g 12, n36. n37, flat).
edge( g 12, n29, n30. updowntc).
distedge( g12. n35. n38, pz).
%downtc(X.Y) :-down(X.Y)+.
iode( 98, n 19, [v('X')] ).
iode( 98. n20. [v('Y')] ).
idge( g8, n 19. n20. down:+: ).
jist-edge( 98. n19. nSO, downtc).
Figure 5.3 GraphLog program
updown(X,YU.XU.Y) :-up(X,XU).down(YU,Y).
The cardinality of predicate updow.n may be inferred by using the normal rnrthod
for rstimating the cost of non-recursive predicates. in this specific case, since the subgoals do not share any variable. we have that
nupdomn
-
nup
niiown
If we know the number N of distinct attribute values common to relations up and
down. we might be trmpted to use our formulae for cardinality of transitive closure.
-eiven
-9
n,, =
.v and
nb = nUp x n doWC This would be perfectly valid if there were no
restrictions whatsoever regarding how the tuples are distributed. i.e.. if we have a random
distribution. Unfortunately. it seems not to be the case in real-life databases. We
uptc
downtc
673
down'
( a ) original program
updowntc
(c) program in tems of transitive closure
Figure 5.4 GraphLog program for the recursive program
consistently obsewed that our formulae produced sorne overesrimates for higher values
of
nu,
. Ernpirical results
nupd,,,, < N2. (ix..
have also s h o w that our predictions are adequate when
in the region for "small values" of n,pdo,,n). Furthermore. we have
observed that our estimates may be improvrd for the other rwo regions: for the region of
"higher values" of nupdo,,, the cardinality of the transitive closure consistently seems to
be directly related to the product of the cardinalities of the individual transitive closures
of cip and doitx: for the region of intermediate values, the cardinality of the transitive
Table 5.6 The exponential region
89
closure seems to be related to the products n,
x
n u,,c
and nUpx n
,,,,, . Table 5.7 shows
some typical results for different values.
Table 5.7 Estimating the cardinality of a transitive closure
Let us study a very simple example and try to explain why the distribution of the
relation that the transitive closure is applied to is not uniform.
Esample. Consider a database with
.V = 10
distinct attnbute values and the follow-
h g randomly generated extensional database predicates:
up(1.2)*
up(1S).
up(3.2).
up(3.1O).
upiW.
up(6.1).
up(7.8).
up(7.lO).
up(9.9).
up(10.5).
A graphical representation of the two predicates is shown in Figure 5.5.
w
down
Figure 5.5 Graphical representation of base predicates up and down
In this example.
nap = 1 0 and nJ0,,., = 8 .
The transitive closure of predicate irp follows
the following behaviour:
Note that the tuples in the closure follow a recurrent pattern (al1 paths of lrngth 6
are also paths of length 3: al1 paths of length 7 are paihs of length 4 as well: and so on).
Similarly. predicate h r r - n has the following closure:
predicate "dowm"
1
7
3
3
5
6
7
8
paths of length 1
CI.!>
c1.D
<32>
<3.10>
c7.P
<S.6>
<9.5>
<10.10>
paths of length 2
<1 . I >
< 1.6>
< I .S>
- 3 . 1 O>
<7.2>
<7. i O>
< 10.1O>
pathsoflength3
<i.l>
<l.6>
c1.D
<3.10>
cl,10>
clO.lO>
pathsoflength4
cl.]>
c1.6>
<I.S>
clIo>
cl.10>
<10.10>
Predicate updoitn is the Cartesian product of both relations:
(9.9) (9.1.9.1 1
(9.1.9.3)
(9.3.9.2)
(9.3.9.10)
(9.7.9.3)
(9.3.9.6)
(10.5) (10.1.5.1)
(IOl.5.)
(10.3.5.2)
(l0.3.5.10)
(10.7.5.3)
(
10.i1.5.6)
(9.4.9.5)
~9.10.9.l0~
I O . .
(lO.lO.5.lO1
From this table. it should be evident that the distribution of attribute values of pairs
([X. YU]. [XU. Y]) is not randorn at all. Not only are many [XL. Y] pairs shared by some
[X. YU] pairs. but also the XU value is totally determined by the X value. For instance.
if X has a value of 1. the value of XU is either 2 or 5. and therefore. although thsre arc:
1000 possible candidates ( [ l . YU]. [XU. Y]). only 200 of them comply with the restnction ([l. YU], [Z. Y]) or the restriction ( [ l . YU]. [ 5 . Y]).
In other words. although we have derived some formulae to estimatr the cardinality
of the transitive closure of a uniform distribution of attributs values, they may not be accurate for -'realWpredicates. as different distribution hnctions rnay be encountered.
In summary. to estimate the cardinality of the transitive closure of a canesian product we propose to consider three different regions: (a) m a l 1 values of n,. the cardinality
of the canesian product. with respect to n,';
(b) intermediate values of n,; and ( c )higher
values of n,. Experimental results have indicated that we rnay use our formulae when n,
c n,'.
in the region that we have called "small values of n," with some accuracy. Once
more. it is the intermediate region which poses the major challenge: a good estimate of
the cardinality of the transitive closure can be obtained by using the products
and nu, x n,,_,,
n d 0 , x nuprc
,either the midpoint or some value in between. Finally. for the region of
"higher values of n,". the cardinality of the transitive closure seems to br directly related
to the (product of the) cardinalities of the individual transitive closures for up and do\%-n.
As a final note. it should be mentioned that o u study of transitive closure was restncted to the estimation of its cardinality. Once we are able to determine the expected
number of solutions that result from the transitive closure of a predicate (that is equivalent to the original recursive predicate). we may propagate this value to the othrr '-black
boxes" in our modelt. However. nothing has been said about the actual cost of executing
the recursion or closure. In fact. thsre is not much to be said, other than it will bs totalty
dependent on the actual implementation. The svaluation method that is used by the systern to handle recursive qurries will determine the cost of solving the recursion. Bancilhon and Ramakrishnan [Bancilhon86] have derived some formulae for several cornrnonly used evaluation methods for some elementary foms of recursion. Ln fact. the design
and evaluation of the performance of transitive closure algorithms has been an active
area of research [Agrawal90].
We must add that some systrms rnay decide to compute the transitive closure of a
predicate only once and store the results for fbture use (instead of computing the closure
any time that a user query requests it) [cheiney94].: Additionally, it is not uncommon to
use a pre-proccssor that transforrns the original form of closure to another that is equivalent and more efficient to execute [Lu93].
5.6 Algorithm to Estimate the Cost of a GraphLog Query
We now formulate our genenl algorithm to estimate the cost of a given GraphLog query.
Input: a query y of the t o m
and a calling pattern q a t q .
+Recall that the only inpur quantity that a "black box" requires is the number of tupIes retrieved
by the previous black box.
Explicit storage of the transitive closure results minirnizes access tirne and only requires a single
(usualIy expensive) computation o f the transitive closure.
Output: a cost estimate. cos$. and the expected number of solutions to the que-.
Geneml Algorithm:
algorithm estirnate-cost(q, cpab, cosb, num-solq) ;
I' assume that there are m subgoals in the query */
begin
perform-mode-analysis(q, cpatq, calling~attern(s,). ...,
callingpattern(s,)) ;
for i:=l t o m do
begin
e~timate~number~of~solutions(s~,
callinclpattern(si), num-soli);
estimate~relevant~co~t~metric(s~,
COS^^);
end;
num-solo:= 1 ; total-cost:=O: solutions:=l ;
for j:=l to m do
begin
tota-ost:= num_s0Ij., x COStj + tota-ost;
solutions:= solutions x num-sol,;
end;
costq:= total-cost; num-soiq:= solutions;
end;
procedure peiform-mode-analysis(q, cpatq, callgat(sl ) ,....callgat(s,));
begin
This analysis determines the calling pattern of each subgoal Si in the
query'
end;
procedure estimate-nu mber~of~solutions(s,
callinggattern(s), numsol);
begin
if s is an extensional DB predicate then use the database profile;:
if s is an intensional non-recursive OB predicate then
estimate~numsol~nonrecursive(s,
callingpattern(s), num-sol);
if s is an intensional recursive DB predicate then
estimate~numso~cursive(s,
callinclpattern(s), num-sol) ;
end;
+See Section 3.6
tSee Sections 3.2 and 4.1
procedure estimate~relevant~cost~metric(s,
callinggattern(s), cost);
begin
if s is an extensional DB predicate then use the database profile;
if s is an intensional non-recursive DB predicate then
estimate~cost~metric~nonrecursive(s,
callinc~pattern(s),cost);
if s is an intensional recursive DB predicate then
estimate~cost~metric~recursive(s,
callinuattern(s), cost) ;
end
cpat(s), num-sol);
procedure estimate~numsol~nonrecursive(s,
/' assume that predicate s has n different clauses '1
begin
for k:=1 to n do
estimate-cost (body(clausek), cost-bodyk, numsolk, h+(cpat(s)));
estirnate~number~of~solutions(num~sol,
num-sol,, .... nurnsol,);+
end:
callingpattern(s), num-sol) ;
procedure estimate~numsol~recursive(s,
begin
transform the recursive predicate to an equivalent transitive closure;tt
estimate the number of solutions by applying properties of transitive
c~osure;**
end;
procedure estimate-cost-metric-non
recursive(s, cost);
begin
1' assume that predicate s has n different clauses */
begin
for k:=l to n do
begin
estimate~cost(body(clausek),costbodyk, num-solk, h(cpat(s)));
costk:= cost-h unifttt(ciausek, cpat(s))+
?h(cpat)is the calling partem that is obtained after a successfid head unification and is detcrmined
by a simple mode analysis
:Sec Sections 3.5 and 4.3
+tSee Section 5.2
t:See Section 5.3
pf(hunif(ciausek, cpat(s))) x costbodyk;
end;
totalcost :=O;
for k:=l to n do
total-cost:= totahost + costk;
cost:= total-cost;
end;
procedure estimate~cost~metric~recursive(s,
cost);
begin
estimate the cost based on knowledae about the recursive algorithm
that is used;
end;
Most procedures can be pertormed mechanically once the abstraction of the database profile has been chosen. In general. we have assumed a uniform distribution of attributs values. but that does not have co be the case.
There are two procedures that pose some di fficulties regarding thrir automation.
One of them. the automatic transformation of a recursive quety into an cquivalent fonn
of transitive closure. has just recently been addressed by researchers in the tield
[~onsens89]:. The other. the estimation of the cost contributors when a rrcurçive predicate is solved, would imply a thorough analysis of the recursive algorithm in place. W r
will partially address rhis issue in the following chaptrr.
tttcost-hunif is the cost due to the process of head unification [Section 3.7.11.
t P ( hunif(clausek.cpat(s))is h e probabilil that the head unification is successful [Section 3.7.1 1.
ZSee Section 5.2
Chapter 6. Some Case Studies
In this chapter. we will apply Our framework to some typical databases. In particular. we
are intrrested in showing how somr real-life issues such as high correlation amongst at-
tributes or duplication of tuples may affect the accuracy of the results. We also compare
our results with one of the best algorithms for query reordering. narnely Shendan's algo-
ri thrn [Sheridan9 11.
6.1
The congressional voting records database
This publicly available database contains the votes for each of the US. House of Representatives Congressmen on several key votes in the 1984 session (the so-called 1984
United States Congressional Voting Records Database). A vrry simple protile of this da-
tabase is shown in Figure 6.1 .' An rxtract of the corresponding GraphLog database is
shown in Figure 6.2.
-
paW(1Jep) projectl(1.n).
project2(l.y).
project3(1,n).
project4(l ,y).
project5(l .y).
project6(1.y).
project7(1,n).
project8(l,n).
projectg(1,n).
projectl O(1.y).
project11(1.a).
project12(l.y).
projectl3(1,y).
projectl4(l.y).
projectl5(1,n).
projectl6(1.y).
party(2.dem).
project1(2.n).
project2(2,y).
project3i2.n).
project4(2.y).
project5(2,y).
project6(2,y).
project7(2,n).
project8(2.n).
project9(2.n).
projectl O(2.n).
projectl l(2.n).
projectl2(2,y).
projectl3(2,y).
projectl4(2.y).
projectl5(2,n).
projectl6(2,a).
m..
-
party(435,rep).
projectl(435.n).
project2(435.y).
project3(435,n).
project4(435,y).
projectS(435.y).
project6(435.y).
project7(435.n).
project8(435,n).
project9(435.n).
projectl O(435.y).
projectl 1(435,n).
project 12(43S,y).
projectl3(435,y).
project14(435,y).
projectl5pl35.a).
projectl6(435,n).
--
Figure 6.2 The GraphLog database
f ln fact. there are three attribute values for each vote. The third attribute value (besides '-es" and
"no" votes) can be regarded as an abstention.
Number of Instances: 435 (267 democrats. 168 republicans)
Number of Attributes: 16 + class name = 17 (al1 Boolean valued: y = yes: n= no: a = abstention)
Attribute Information:
1. Class Name: 2 attribute values (dernocrat, republican)
2. handicapped-infants: 2 attribute values (y.n)
3. water-project-cost-sharing: 2 attribute values (y,n)
4. adoption-of-the-budget-resolution:2 attribute values (y,n)
5. physician-fee-freeze: 2 attribute values (y.n)
6. el-salvador-aid:2 attribute values (y,n)
7. religious-groups-in-schools: 2 attribute values (y.n)
8. anti-satellite-test-ban: 2 attribute values (y,n)
9.aid-to-nicaraguan-contras:2 attribute values (y.n)
10. mx-missile: 2 attribute values (y,n)
11. immigration: 2 attribute values (y,n)
12. synfuels-corporation-cutback: 2 attribute values (y.n)
13. education-spending: 2 attribute values (y.n)
14. superfund-right-to-sue: 2 attribute values (y,n)
15.crime: 2 attribute values (y,n)
16. duty-free-exports:2 attribute values (y,n)
17. export-administration-act-south-africa:2 attribute values (y,n)
Figure 6.1 The 1984 United States Congressional Voting Records Database
Enample 1
Suppose that we wish to compare two ordenngs for a query that retritves those individuals who voted "yes" on issue # 4 and "no" on issues K 3 and
+ 5 . These two orderings
are shown in Figure 6.3.
1,
orderl (ld.Party) :-project4(ld,y). party(ld.Party), proiect3(ld.n). project5(ld.n).
ordeR(td.Party) :-party(ld.Party), project4(ld.y). project3(Id.n). projectS(1d.n).
Figure 6.3
Two orderings that we wish to compare
We will assume that the GraphLog translater generates code for a system that uses
first-argument indexing. Additionally. consider that we are mostly interested in making
our decision based on the number of visited tuples onlyt.
+As it tiappens. for this particular example. the number of visited tupIes (Le.. the nurnber of subgoal unification attempts that take place) is indeed the most relevant conmbutor to the cost of
executing this query
Let us consider ordering # I first:
orderl (1d.Party) :- project4(ld.y), party(ld.Party),project3(ld.n), project5(1d.n).
A very simple analysis will determine the calling patterns for the different subgoals
in this que- as:
orderl([f. fl) :- project4((f.g]). party([g, fl), project3([g, g]), project5([g. gj).
(g stands for '-gound argument": j'represents a "free variable)". We proceed to sstimats
the cost (as the expected number of visited tuples) of each subgoal. Successively, ws ob-
tain:
pro]ect4([f.g])
This predicate cal1 will visit al1 335 instances of this particular project. Only a fraction will actually succeed. In the absence of any additional information. we are forced to
assume a particular distribution for the three attributes in the second argument. namely
1.
("yes"), n ("no") and a ("abstention"). For instance. we may decide to use a uniform
distribution of independent values (so that we are expecting a fairly high number of abstentions!). Under this cmde consideration. we would retneve 435i3 (i.e.. 135) tuples.
ParMg. fl)
Now. we will visit as many pu-
tuples as solutions we got from the previous sub-
goal. Our estimation would indicate that 145 tuplrs had to be visited (recall that first-argument indexing is assumed). The second argument poses no restriction whatsoevver. so
that al1 145 tuples are expectrd to succeed.
project3([g,g]), project5([g,g])
The final two subgoals are very similar from the point of view of our analysis. Since
first-argument indexing occurs. a first argument ground will result in that. for each of the
135 tuples obtained from the previous phase, only one project3 tuple has to be visited. with
s
Our uniform assumption for the attnbute).
a 113 rate of success - roughly 48 ~ p k (recall
Similarly, for each successfdly retrieved project3 tupie, one projectb tuple will be visited
with a 113 rate of success as per Our assumptions (approximately 16 tuples).
Let us turn our attention to ordenng Ff 2:
ordeQ(ld.Party) :-party(ld,Party).project4(ld,y),project3(ld,n).projectS(ld.n}.
Again, a simple analysis will determine the cailing patterns for the different subaoals in this que- as:
5
order2([f. fl) :-party([f,fl), project4([g, g]), project3([g, gj). project5([g, g]).
and we proceed to estimate the expected number of visited tuples for each subgoal on the
right hand side.
pafly([f.fl)
This predicate call will visit al1 435 instances of predicate
Party.
Since both argu-
ments are variable. al1 435 instances will be retrieved as a result of the call.
proiect4([!& gI)
Now. we will visit as many project4 niples as solutions we got from the previous subuoal (note that first-argument indexing is assumed). Our estimation would indicatr that.
3
from thosr 435 tuplcs. only a fraction will actually comply with the restriction posed by
the second arprnent. If a uniform distribution of independent attributr values is assumed. roughly 435'3 (Le.. 145) tuples will be successfully retrirvrd.
project3([g,g]). projectS([g. g])
The analysis of the final two subgoals is identical to that for the aitemative ordenng.
Since first-argument indexing occurs. a first argument ground will rrsult in that. for each
of the I l 5 tuples obtained from the previous phase, only one project3 tuple has to be visited. with a li3 rate of success - roughly 48 tuples. and. similarly. for each successfully
retrieved
project3 tuple,
one projecto ~ p will
k be visited with a l/3 rate of success as per
Our assumptions (approximately 16 tuples).
Figure 6.1show the abstract values for the different subgoals. represented by "black
boxes". Figure 6.5 shows how these "black boxes" are intercomected to obtain the global values for the whole query.
The results of both analyses are sketched in Table 6.1 and Table 6.2.
I
party/2
335 solutions
3
visi ed tuples
7
calling pattern
1 call
1
project~
7
1 cal1
1
145 solutions
3 4 3 5
visi ed tuples
I
1 cal1
/
1 solution
435
1
1
1
31
visi ed tuples
calling pattem
0.33 solutions
0.33 solutions
-7"
1 call
0.33 solutions
3
1 cati
calling pattern
calling pattern
=visiPedl tuples
cal l i ng pattem
1
-1 7
'
call
calling pattem
Figure 6.4 Abstract black boxes for Example 1
number of visited tuples
expected number of
solutions
I
TOTAL
Table 6.1 Number of visited tuples for ordering # 1
JTp'
solutions
solutions
Figure 6.5
lnterconnection of the black boxes for Example 1
expected number of
numbsr of visited tuples
solutions
I
TOTAL
-
PrP
1063
1
16
1
Table 6.2 Number of visited tuples for ordering # 2
W r may conclude that we expect ordering
+ 1 to be a better option with respect to
ordering + 2 (which has a 38% additional cost). Experimental results for this query con-
fim our prediction. These results. using SICStus Prolog version 2.1 on a Sun SPARCstation SLC. are shown in Table 6.1 and Table 6.2. From the figures, ordering 8 2 is 43%
more expensive than ordenng t 1. Cost measurements are @en in Prolog's anificial
units.
I
orderf (ld,Party) :-project4(ld.y),party(ld,Party),project3(ld.n).project5(ld.n).
orderZ?(Id.Party):-partyjld.Party), projectQ(ld.y),project3(ld.n).projecd(ld.n).
ordering
rivenge cost
t 1000 experiments)
number of soIutions
Figure 6.6 Experimental results for both orderings
-4lthough our assumption of a uniform distribution is ckarly inaccurate. we stiIl can
predict which ordering will be less rxpensive to rxecute (from the point of view of visited tuples). Our sstimate of the number of solutions is obviously poor. but only a more
drtailed profils would yield brner results.
As an additional note. we must mention that this particular database has a very high
correlation factor amongst attributes. In other words. "republicans" are expectsd to vote
as a block on some (if not most) issues. and so are "democrats". If we pose a qurry of
the form:
requesting those democrats that voted some way. we should not be surprised to find that
our cstimatcs reprding the numbrr of succrsstLl tuples that are retrievrd are sven less
accurate.
Erample 2
Suppose that we wish to compare al1 orderings for a query that retrieves those individuals
that voted "yes" on issue 16 and "no" on issue 6. These orderings are shown in Figure
orderl (ld,Party) :- party(ld.Party),projectl6(ld,y), project6(ld.n).
order2(ld,Party) :-party(ld.Party),project6(ld.n), projectl6(Id.y).
ordeB(ld.Party) :-projec?l6(ld.y),party(ld.Par&y), project6(ld.n).
order4(ld,Party) :-projectl6(ld,y). project6(ld.n),party(ld.Party).
order5(ld,Party) :-project6(ld,n).party(ld.Party),projectl6(ld.y).
orde&(ld.Party) :-project6(ld,n),projectl6(ld.y). party(ld.Party).
-
-
--
- . -
-
Figure 6.7 Six orderings that we wish to compare
A very simple static analysis determines the corresponding calling patterns which
are shown in Figure 6.8.
orderl [f.q :- party[f,fl, project16[g,gltproject6[gvg].
order2[f,fl :- party[f.fl,project6[g,g],projectl6[g,g].
order3[f,fl :- projectl6[f,g],party[g.fl. project6[g.g].
order4[f.fl :- projectl6[f.g],project6[g,g], party[g.fl.
order5[f,fl :-project6[f,gl,party(g,fl, projectl6[g.g].
ordeffi[f.fl:-project6[f.g],projectl6[g.gl. party[g.fl.
Figure 6.8 Six ordertngs that we wish to compare
We wish to estimate the number of visited tuples associated with each predicatef
calling pattern pair that appears in the different orderings. Following a similar reasoning
to that in Examplr 1. we are able to deduce the values shown in Figure 6.9 in which we
may use "black boxes" to identify the abstract values that wtr estimate. Note rhat tintargument indexing is assumed. These "black boxes" are thrn inttrrconnected as shown in
Figure 6.10 to obtain the global values of an entire que-.
We are now in a position to cstimate the total number of tuples cxpectrd for rach
ordering. We summarize the (analytical) results in Table 6.3. Expcrimental results for
SICStus Prolog are shown in Table 6.4. where thty are cornparcd to our rstimatrd values.
We werc able to detect the two Ieast efficient orderings. Howrver. our predictions
regarding the other four orderings are not very accurate. This is due to the fact that we
are assurning a distribution that behaves in a certain way. whrreas the real database fol-
lows a different pattern. Predicate projectl6 clearly has a different behaviour from that of
predicate
project6
(in fact, the latter predicate gets executed more efficirntly than the
former), and our framework cannot make this distinction in the absence of a more de-
I
135 solutions
i/
1 solution
calling pattern
callins pattern
/
145 solutions
1
party 2
0.33 solutions
1
91
visi ed tuples
caiiing pattern
I
/
caliing pattern
1
0.33 solutions
145 solutions
calling pattern
1
1
calling pattern
I
Figure 6.9 Abstract black boxes for Example 2
tailed database profile. But. our framework is able to detect that if a '-project" predicatc
as the nrxt predicate. leavis sslected first. it is more efficient to place the other '*projectW
ing the Party predicate to the last position. Similarly. Our fiarnework determines that it is
not convcnient to place the party predicate as the fint subgoal in the clause.
6.2 The Performers Database
We proceed to study another real database. Our second database contains detailed infor-
mation of 888 classical-music compact discs. The information available for each com-
b-isited
"'
b
caIIing pûnern
isited
\
isited
tuples
numbcroi
48
Figure 6.10 lnterconnection of two black boxes in Example 2
I
ordering
l
estimated n u r n k r of visited niples
-
Table 6.3 Expected number of visited tuples
pact disc includes a list of individual tracks. a list o f individual and collective performers.
as well as some technical data rqarding the production of the recording. For the purposes of our case snidy. we wil1 consider the portion of the database that is related to the
musicians and their instruments.
l
ordering
experimental
value
clipcrimental
rruikrng
theorerical
ranking
Table 6.4 Comparison between the predicted and experimental values
Primitive Entities
6.2.1
The main entities to be considered are: ( 1 ) compact disc numbers. ( 2 ) artists or musicians. (3) instruments used by the musicians and (4) compact disc labels. To produce a
more interesting example. we will introduce an additional entity. narnely: ( 5 ) the overseas distributors for the compact discs.
Compact Disc Numbers
Every compact disc can be identified by a manufacturer's numbcr. which is an alphanumenc code. Each compact disc number is then assigned an intemal code to br used by
other relations:
recording(CD-Code, Manufacturer-Number)
Artists
Each individual performer or musical ensemble in the databasr is idenri fied by an interna1 code.
lnstrumen ts
Similarly. rvery instrument description has a unique code assigned to it. Different descriptions for the same instruments are treated as different entities.
instrurnent(lnstrumenLCode,InstrumentName).
Companies
U'e also find entries for the diffrrcnt companies that produce the compact discs storcd in
the database. Again, a sprcitic code is provided for rach label.
The Estensional Database
6.2.2
Once we have introducrd the main entities in the database. we procecd to explain the set
of facts that conform the performers databasc. W r will consider the following relations.
available as extensional DE3 prcdicatss:
Performers
For rach compact disc represented in the databasr. a list of prrformers is available. For
a ziven compact disc code. thrre is one envy for rach performrr listed for that production. If the same artist utilizes more than one instrument. there is one separate entry for
each instrument usrd by that perforrner.
performer(CD-Code. Performer-Code. Instrument-Code).
Labels
This relation @es the interna1 code for the Company that has produced the compact d i x .
label(CD-Code. Label-Code).
Distributors
Finally. rach label may have one or more -'oveneas distributors". that is. independent
companies that impon and distribute the compact discs in different pans of the world.
distributor(Label-Code. Distributor).
Typical sample tuples of the different relations in the performers database are
sketchrd in Figure 6.1 1.
distributor(k1. qualiton).
distributor(k7. pelleas).
distributor(k2. allegro).
distributor(k2. sri).
distributor(l1. analekta).
distributor(l3. polygram).
distributor(l4, sri).
distributor(l4. hmusa).
distributor(l7. polygram).
distributor(l8.koch).
jistributor(l2. allegro).
cfistributor(l2, fusion).
jistributor(m2. ebs).
recording(886:SYMPHONlA SY 91S06').
recording(887:TELDEC 4509-90798-2').
recording(888,'CRD 331 1') .
anist(b516,'BohumilBenicek').
artist(b51 7:Ensemble Tempo Barocco').
artist(b518,'ArsMusicae Barcelona').
Figure 6.1 1 Sample tuples from the performers database
We stan by analyzing a simple non-recursive que-. For instance. we may be interested
in knowing the codes of the performers of some particular instrument that are availablr
tiom a given overseas distnbutor. The following Datalog predicate defines such a rela-
rionship:
Suppose that we are interested in the following farnily of queries:
In other words. we will study a query with a calling pattern [f. g. g ] : the first argument is a variable. whereas the second and third arguments are constants. There are six
different orderings in which the three subgoals may be amnged (calling patterns are
showm in square brackets):
performer [f.f.g].label [g.fl, distributor [g,g]
performer [f.f.g].distributor [f,g], label [g.g]
label [f,fj, perfomer [g,f,gj. distributor [g,g]
label [f.f], distributor [g.g], perfomer [g,f.g]
distributor [f.g],performer [f.f.g]. label [g.gj
distributor [f.gJ,label [f.g], perfomer [g,f.gl
Our goal consists of selecting the most efficient ordering of al1 six. Al1 w r know
about the user's query is the caiiing pattern: instrurnentists~avaifabIe~fr~rn~a~distributor
[f. g. gl.
Typical queries are shown as follows:
:- instrumentists~available~frommaadistributor(A
sri. ten).
(tenon on a label distributrd by Scandinavian Record Impons):
:- instrumentists~available~from~aadistnbutor(A.
qualiton, sop).
(sopranos on a label distributed by Qualiton Imports):
:-instrumentists~available~frommaadistributor(A,
allegro, obo).
(oboists on a label distributed by Allegro Imports).
The database profile of the performers database is shown in Table 6.5.
distinct
distinct
distinct
values in
values in
values in
argument 1 argument Z argument 3
Predicate
name
number of
tuples
perforrnrri3
12.85 1
888
3,727
710
labeV2
888
888
92
-
distributor12
177
111
23
-
Table 6.5 The performers database predicates
In the absence of further information. we will treat the database as a uniform and
independent distribution of anribute values (aithough we know that this is probably not
the case). We will try to assign cost values to al1 six different possible orderings for the
subgoals in the predicate definition. We will consider a couple of cost contributors in our
analysis: number of visited tuples and number of variable unifications that are perforrnedSince we use SICSnis Prolog to executr the code generated by the GraphLog tnnslator. we have to assume that first-argument indexing is used. Our abstract "black boxes"
for al1 (statically detected) combinations of calling patterns for the three predicates are
shown in Figure 6.12.
The expected average number of tuples is either the total number of tuples for that
predicate if the tint argument is a variable. or this value divided by the number of distinct
values in the fint argument position otherwise. The total number of variable unifications
may be calculated with the aid of the formula that is shown in Appendix 2. or if a simpler
approximation is considered. by multiplying the number of expected visited tuples by the
number of variable arguments in the predicate call. Finally. the cxpected average number
of solutions is calculated by dividing the total number of tuples by the number of distinct
values at each gound argument position.
Once we have detemined the expected values for our cost contnbutors when a sind e call is considered, we procred to "interconnect" al1 three predicates. for each ordering
=
under consideration. This is illustrated in Figure 6.13 for only one of the specific orderings.
Table 6.6 shows the values that are estimated for both cost contnbutors given al1 six
different ordrrings. Table 6.7 surnmarizes the corresponding expenmental results for
different sets of ground terms. From these tables. we observe that we accurately predict
the best ordering. as well as the two worst orderings. Interestingly enough. the ordenng
that we expect to be the second most efficient one (i.e.. ordering #6). is only so for a few
of the experiments. In fact, for some distributors that carry many labels (sri, qualiton, al-
legro), orderings #1 and #3 seem to be more efficient. We must realize that our predic-
18.10 solutions
3
0.020 solutions
l2,85 1
visi ed tuples
7
calling pattern
1 call
3
--+
3
25,702
-L'1 call
calling pattern
vanable
uni fications
1
labe
lutions
unifications
1
9.65 solutions
888
visi ed tuples
7
calling pattem
1 call
Vgag
visi cd tuples
1,776
vanable
unifications
L
' unifications
0.108 solutions
1 solution
7
"
1 call
7
1 call
'
7.70 solutions
unifications
unifications
9
0.07 solutions
visi ed tuples
177
7
calling pattern
1 call
7
1 call
vanable
unifications
Figure 6.12 Abstract black boxes for the non-recursive query
O
vanable
unifications
-a
18.10
solutions
visited-tuples = 1285 1 + 18.10 + 28.96 = 12898.06
12,851
3
18.10
calling
pattern
y i ited tuplrs
P9
18.10
van ble
uni 1 ations
1.27
solutions
28.96
+A calling
7'
pattern
18-10 calls
I
variable-unifications = 25702 + 18-10 + O = 25720.1
unifications
Figure 6.13 Expected values for the cost contributors for a specific ordering
tions are bassd on uniform distributions of independent attnbute vat ues. and the darabase
does not follow this type of distribution. However. we are still able to select efficient ordenngs and reject those that perform poorly.
6.2.4
An Example involving a Closure
We will study an sxample that inciudes a closure predicate, which requires special treatment in Our framework. For practical reasons. we had to use a smaller database. because
the current implementation of the transitive closure algonthm is inefficient. The modified database profile for the performers predicate is shown in Table 6.8.
ordering
expected num- ranking e x p c ;cd iiüm- ranking
ber of visited
ber of variable for n,.,
for n,
niples n,
unifications n, ,
performer. label. distributor 1
12898
1 i31
25720
1 f4
1
perfoner.distributor. label
16194
28906
PI
1 Pl
label. perfoner. distributor 1
1 5764
16623
I PI
Pl
1
1
1
1
distributor. perfomer. label 1
distributor. label. perfomer 1
label. distributor. performer
1
1
1
1
1
1
1
3208
99269
8095
I
I
1
I
1
1
2676
197913
1
1
f
Pl
I
1
1
[11
[61
1
1
1
1
1
1 Pl
7926
L
[il
[61
1
Table 6.6 Predicted values of two cost contributors for the non-recursive query
ordering
A.
perfomer. label. distributor
1
j j j [2]
1
A,
1
A.
1
A.
369 [2]
1
296 [2]
1
3 14 [ j ]
A.
1
3 19 [3]
A,
1
387 [j]
Table 6.7 Experimentai results for the non-recursive query (rankings in square brackets)
Predicate
name
number of
tuples
performers/3
704
distinct
distinct
distinct
values in
values in
values in
argument 1 argument 2 argument 3
9
132
45
Table 6.8 The rnodified performers database profile
We now proceed to definr some additional relations to be applied to the performers
database. We are interested in the definition of a predicate that uses some form of recursion or closure.
Let us define a predicate coiieague that is true when two musicians participate in the
same recording production:
colleague(A.6) :-performer(X.A,J, perforrner(X,B,J
t
fSmctly speaking. we should also impose the restriction that A and B are different musicians.
We wi1l omit this additional consmint here and in future definitions to simpli@ the analysis.
1
We also definc that musician .f is an "indirect" colleagur of a musician B if both
have recorded at least one compact disc with a mutual "colleagu6' as defined beforr:
and therefore a transitive closure is used.
Further suppose that we wish to define a more specific type of "indirect" colleagur.
in which musician d has participated in a recording project performing the snrnr instrument as musician B. and both having recorded with a mutual "collsague" of the same instrument. This could be done by defining a predicate that includes rhe additional restnction of the musicians performing on same instrument:
and then taking the closure over this predicate. However. we will use a different set of
predicates for illustrativr purposes (i.e.. showing how closure predicates are handled by
Our framework).
Thus. we define a "same-instrument" indirect colleague -4as a musician who has
participated in a recording project performing the samr instrument as another rnusician
B. and both have recorded with a mutual "colleague of a collea_mie" (as defined above)
as:
where the last subgoal is a transitive closure.
W r observe that there are six different orderings for the nght-hand side of the predicate.
If, Say. the calling pattern for the predicate same~instniment~cooiieague~of~~coiieague
is
known to be [g. fl. the calling patterns of the predicate subgoals for al! six orderings are
performer [f,g,fl, performer [f.f.gj. colleague~of~a~colleague
[g.g]
perfomer [f.g,fl, colleague-of-asolleague [g,fi, performer [f.g,gJ
perfomer [f.f,fj,performer [f,g,gJ,colleague-of-a-colleague [g,g]
performer [f.f,fl,colleague-of-a-colleague
colleague-of-a-colleague
[g.g], perfomer [f.g.g]
[g,fl, performer [f.g,fl, performer ff.g,g)
colleague-of-asolleague [g.fl. performer[f.g.fl, performer [f,g.gj
As mentioned before. predicates that involve closures or recursion have to be treat-
ed as special black boxes: unless we know the exact algorithm that is being used to solve
the closure or recursion, nothing c m be said about the values of the cost contributors for
the predicate. excspt for the rxpected number of solutions (a value that is propagated to
other black boxes when we interconnect them to find the global values of the cost contnbutors).
Since the base predicate for this closure is given by the following two subgoals:
wc may estimate the average number of tuples as 204 x (101 9 = 4024. sincr there are
204 perforrner facts and 9 label
facts (k.,
recordings) in the database.
The average number of solutions of the closure predicate is estimated by noting that
the number of distinct attribute values for the relation is
nd, =
132 x i;2=
17421
( 132
is the number of performers). and the number of unique tuples for the base predicate is
estimated as
n, = 4624.
We then detemine the region that corresponds to these values:
their ratio is calculated as .-r = nd,
n, = 3-77
. This corresponds to the "region of small
values for the number of tuples in the base predicate" as explained in Section 5.4.1. Applying the formula mentioned in that section. we estimate the number of solutions of the
closure as
<=
I 71241 ( 3.77 - I )
s
6290 . This is the expected number of tuples for the
whole closure. If one argument is ground. only a fraction of the tuples will form part of
the solution. W r already know that the tuples produced by a transitive closure do not fol+Note that the nvo last orderings are equivalent in the abstract domain.
116
low a uniform distribution (See Section 5.5). but we may produce a rough estimate by
making this assumption. thus dividing the total number of tuples by the number of distinct values for that particular ground argument (i-e.. the number of perfonnen. in our
example). or , n
u:g ] = n,'. [,o./]
s o u n d , our estimate would be:
C
9
6290/' 132 = 4 7 . 7 .
ny.[ g .g ] = 47.7
Similady. if both arguments are
137 1 0 . 3 6 . Thus.
our black boxes for this
exampIe are shown in Figure 6.14.
204 solutions
4.53 solutions
7
' calling pattern
1 call
1
calling pattern
I
' unifications
l
calling pattern
1>408
TIa;:
caiiing pattern
' unifications
' unifications
L
1
47.7 solutions
1
calling pattern
1
vanable
' unifications
1
vanable
unifications
0.034 solutions
1-55 solutions
7
I cal1
1>.08
0 3 6 solutions
I
n
9
W.
visi ed tuples
caiiing pattern
I
Figure 6.14 Abstract black boxes for the recursive query
I
unifications
Suppose that we wish to use the number of visited tuples as the relevant cost contributor. The abstract representation of our six orderings would be as s h o w in Figure
6.15. We have used the narnes .Vdand iVgg to denote the unknown number of visited tuples associated with the closure predicate with c a l h g patterns [g, f ] and [g. g ] . respectively.
Thus. the expected number of visitrd tuples for the diffrrent ordenngs are obtained
as:
performer [f.g.fl. performer [f.f.g],colleague-of-~colleagu [g.g1:
visited niples = 520.2 + 7.02 Y,,,
--
bL
perfoner [f.g.q. colleague-of-a-colleague
[g,fl.perfoner [f.g.gl: visited niples = 1 5286.7 + 1 -55
--
perfomer [f.f.fj. perfomer [f.g.g]. colleague-of-hcolleague [g.gl:
visited niples = 41 820 + 6.94 Y,.,
performer [f.f.q. coileague-of-a-colleague
visited niples = 15 1 86.8 + 204 Y,,,
[g.g]. pufornier [f.g.gl:
----
-
bL
coiieague~of~a~colleague
[g.q. performer [f.g.fl.~ e r f o n e [f.g.gl:
r
visited niples = 248 13.5 + NCf
coiieague-of-a-colleague
visited mples = 248 13.5 + Nu*-
[g.fJ.perfomler [f.g.q. performer [f.g.gl:
-
-
At this point. wr need estimates of the magnitudes of .%-and .VU,. Unless we have
.5rS
a detailrd performance analysis of the algorithm that is used to executs the transitive closure. we are forcrd to propose some suitable value. For instance. we know that simple
algonthms to solve the transitive closurs problem [Warren751 are computed in tims ar
most proponional to the cube of the size of n. the number of distinct values in the d a tion. or @[
n'
!
[Baase88].
We also know that typical transitive closure algorithms cornpute practically the
same instructions for calling patterns [g, g ] and [g. fl [Fukar9 11. This is because the algonthm first obtains al1 pairs that are reachable From the first argument. and only then
the nature of the second argument is taken into account. In fact. many transitive closure
algorithms have similar behaviour even for the calling pattern [f. g]. since the computation is started from the second argument rather than fiom the unbound first argument.
number of
performer [f,f,fj, colleag
olleague [g,gj, performer [f.g,g
2.5
14981.8
visited rupI1-s
number of solutions
colleague-of-a-colleague [g,fl,performer [f,g,fI,perfomer [f.g,gJ
colleague-of-~colleague [g,fl
col league-O f-a-CO t leazuc 2
Figure 6.15 Abstract representation of the different orderings
numbcr of
2.5
15082.7
One educated guess as to the values of N'-and !La wouid be to use the cube of the
number of t U p k in the base relation ( 1 6 d = 98 x
9
10
.in our example).'
Given this val-
ue. our estimates become (rankings s h o w in square brackets):
Typical experimental results for this famiiy of queries when using SICStus Prolog;
as the target language are s h o w in Table 6.9 (again. rankings are shown in square brackets). Note that our first choice for the most efficient query ordering is close to the one
that is experimentally best (in fact. there is a virtual "tie" amongst the three orderings
with besr experirnental performance). W s are also able to discover those orderings thar
are more inefficient and therefore should be discarded. Not surprisingly. the most significant term is the one that relates to the ctosure predicate.
1 performer [f.f.fl.performer [f.g.gl. ~ ~ l l e a g ~ k ~ f - ~ ~[g.g]:
~ l l e1 a g 128
~ e 14 1 .O
1 perfoner [f.f.fj.colleague-of-a-colleague
- -
-
[g.g]. perfomer [f.g.g]:
1
[1=]
74962 10.0
[6]
colleague-of-a-colleague
[g,fl,performer [f.g.fl,performer [f.g,g]:
43330.0
Il=]
colleague-of-a-colleague
[g,fl, performer [f,g,fl, performer [f,g,gJ:
43472.0
[1=]
1
1
Table 6.9 Experimentai results for the recursive predicate
Incidentally. the ratio between the most and less efficient orderings in the experiments is 7496/13 = 174.3; the corresponding ratio in our predictions is 20000/99 s 202.0.
Our "educated guess" tumed out to be reasonably accurate.
t.4~mentioned before. both calling patterns require visits to similar number of tuples during the
computation of the closure: the only difference is that the calling pattern [g. gl wiII result in fewer
nipies to be kept in the finai answer.
We also launched a series of rxperiments to determine the performznce sf the umsitive closure algorithm used by the GraphLog translater. The resuits wheri SICSms Prolog is used are shown in Table 6.1 O (data values are expressed in "anificial units").
cal1ing pattern
execution tirne
colleague~of~~colleague[f,
fl
5?695-!9.0
colleague~of~a~colleague[g,
fl
4 1943.2
cotleague-of-a-colleague[f.
g]
62922.5
colleague-of-a-colleague[g.
g]
42 1433
Table 6.10 Efficiencyof the transitive closure for different calling patterns
Not surprisingly, the eficiency of a totally unbound transitive closure is very poor
as compared to the case when one or both arguments are ground. There is a factor of almost 138 brtween the most and least efficient calling patterns. Also. as we had predicted.
there is no visible difference between the efficiencies of calling patterns [g. fl and [g. g ] .
The fact that the efficirncy of calling pattern [f. g] is approximateiy 1.5 tirnes the efficiency of the other two calling pattems that involve a ground term suggats that this particular transitive closure algonthm is not symmetric.
6.3
The Packages Esarnple
Our final case study will be based on the "Packages Example" described in Appendix 4.
Our database profile is shown in Table 6.1 1.
Predicate number of
tuples
name
distinct
distinct
distinct
values in
values in
values in
argument 1 argument 2 argument 3
paru 3
1640
136
16
1610
uses/2
3075
1203
1288
A
Table 6.1 1 The extensional database predicates
We introduce a predicate that cornputes packages that are in a cycle:
cycleo() :-pkg-uses(X,X)+.
It defines a unary predicate c N e ( X ) to be true when there is
a path of one or more
arcs labeled pkg-uses fiom X to itself. that is. when ;Y is related ro itself by the transitive
closure of pkg-uses [Consens92].
Que- to be analyzed
Suppose that we wish to determine the best ordenng o f the following arbitrary
and consider the case when the Y argument is ground pnor to the call. We wiil proceed
to estimate the cost of al1 different ordenngs. They are shown. with their respective call-
-
ing patterns, in Table 6.12. Note that only 1 O of the ordenngs are unique.
It
becomes clear that we must determine the abstract properties of predicates part-of
and cycie. In Appendix 4. we have already obtainrd the information related to two intrnsional database predicates (namely part-of and pkkuses). The properties of prrdicatr cycie
are related to those of predicate pkcuses. so that we have to deducr the properties of this
intensional predicate. We know the specific ordenng that is used to implement the
pkg-uses
predicate.: Once we select this ordering we can draw the Wack boxes" for the
relevant calling patterns for the subgoals and then derive the "black box" for the pkg-uses
predicate. As before. we interconnsct the "black boxes" that correspond to the selscted
ordenng and then obtain the expected values of our cost contnbutors once the expected
values for the number of solutions are propagated. This is depicted in Figure 6.16 for one
of the rnany possible cost contributors: the nurnber of visited tuples.
.
must recall that a sysRegarding the number of solutions of predicate p k ~ u s e s we
tem predicate was ignored in the analysis of Appendix 4. The actual number of solutions
once the system predicate is considerrd is approximately ten times smaller.
Now we are able to corne up with a "black box" for the cyclic predicate. Le.. the
transitive closure of the pksuses predicate. Disregarding the additional call and head unitAlthough this query has no special intent. it is still an interesting exarnple. given the fact that it
contains nvo transitive closurcs.
$Althou_ghwe already know what the best ordering for this predicate woufd be. somctimes wc
are not able to modi@ (and recompile) the already existing code.
1075 solutions
n
1 solution
1=1\4075
7
calling pattern
I call
1 cal1
I/
'
calling pattern
uni k a t ions
unifications
4
4075 solutions
~ > 1 3 . 4 x 1 0 ~
visi ed tuples
caiiing pattern
1
kk
13*4x106
' unifications
Figure 6.16 Abstract black boxes for sorne predicates in the packages example
fication due to the indirect call to the transitive closure via the cycle predicate. we appIy
Our formulas for transitive closure to the pkg-uses predicate. In this example. the number
of distinct attribute values for the relation is nd, =
1640 x 1640= 2689600
(since 1640 is
the number of pans) and the number of unique tuples for the base predicate is estimated
as n, = 1073 . With these values, we proceed to determine the region that corresponds to
their ratio
-4 = n,, l'n, z 660.0
. This value lies in the "region of small values for the
number of tuples in the base predicate" as explained in Section 5.3.1. Applying the for-
mula derived in that section, we estimate the average number of solutions that results af-
calling patterns
ordering
ordering 8
Table 6.1 2 Different orderings for the query under consideration
ter the computation of the closure as
<=
2489600:
( 660.0 - 1 ) z
108 1 .Z
(expected number
of tuples for the entire closure). If one argument is ground. only a fraction of the tuples
will be in the solution. We already know that the transitive closure does not follow a uniforrn distribution (See Section 5.51, but we may produce a rough estimate by making this
assumption. thus dividing the total number of tuples by the number of distinct values for
that particular ground argument (Le.. the number of pans. in our example), or
u:g] = <rg.n= 4081 -2. 1640 = 2-49. Similady. if both arguments are _moud. our a t i mate would be:
<,
[g.g ] = 2.49, 1 6 M = 0.00lj. With these values. we are able to propose
the "black boxes" for predicatr cyde as shown in Figure 6.17.' We also need those "biack
boxes" that correspond to predicate partof (Figure 6.18).
1 12.06 solutions
1640 solutions
caiiinç pattern
0.0073 solutions
calling pattern
1
Figure 6.18 Abstract black boxes for predicate part-of
As in the previous case study. we propose that the numbsr of visited tuples in the
computation of the closure is in the order of
.y = ( n,)
j
= 40753 s 67.7
x
IO' when the first
ar-grnent is ground. In the previous case snidy we have already seen that the value of -ygis rxpected to bs some 1 3 8 times the values of lVL, or Ng: whereas the values of .ykis
just about 1.5 times that of
.v,,
= 138 x 40753 = 9.34
x
10'2
If we use the values
.Vgg t I n , ) 3 = 40753 t 67.7
x
10'
and
as approximations for the number of visited tuples in the
closures. we are finally able to estimate the cost of the ten different orderings. Initially.
we will snidy the case where al1 arguments are init ially uninstantiated, i.e., the case when
tStnctly speaking. we should only have to derive the -'black boxes" that correspond to calling
patterns [f. fl and [cg. g ]
1
pkcuses+2
]
0.034 solutions
1075 solutions
3
Nfg
visi ed tuples
calling pattern
7
1 cal1
cal ling pattern
1
pkg_uses+Q j
0.00 15 solutions
0.034 solutions
visi ed tuples
3
Na,
n
visi ed tuples
O-
7
1 cal1
7
1 cal1
calling pattern 1
3075 solutions
9
calling pattern
0.00 15 solutions
Ngg
visi ed tuplss
visi ed niples
N~
7
calling pattern
1 call
3
7
calling pattern
I call
Figure 6.17 Abstract black boxes for predicate cycle
we wish to revieve al1 solutions to the general query. As before. Our cost contributor will
be the rxpected number of visited tuples. Then. we will consider the case when the user
specifirs a ground term. i.e.. we wish to retrieve only a fraction of the tuples in the gen-
eral solution.
Example I
For the case when the query has the form
or, afier value substitution:
n,,=t - lMO+1640x(I+lx ( 6 7 . 7 x 1 0 ~ + 0 . 0 0 1 5 r 6 7 . 7 ~ 1 0 ~ ) ) = 1 1 I x i 0 ~ ~
-7
*. t - -
= 1640+1640~
(67.7~
I O ~ + O . O O ! ~(XI + 1 ~ 6 7 . 10%
7 ~ = Il1
= 1640 + 16-10x
n,$
!
I
r
-5 = 16-10+ 16-10
x
= 19 x 10IC
i 9.34x 1012+-Io75x (67.7x ~ o ~ + O . OxO 1I)~ =-!67
n,,% = 9.34x 1012+4075 x
12
n , , ~ 7= 9.34~
10 +JO75 X
I
n l r - 8 = 9.34x 10" + 4075 x
t*
= 9.34
x
1012
67.7x IO'+ 0.0015x 1 9.34x 10'~+4075r 1 ! o 134 x 1012
n,.[=I = 1640+ 1610~'
9.34~
~ O " + J O ~ Sx ( 1 cO.0073 ~67.7
x 1
0')
ni
x
i
10'' +-IO73x
(
(
I W O + 12.06 x
(
1 +1
x
67.7x IO')
)
=5
x
x
1011
10"
l 6 ~ l+X h j x i 9.34~
l0I2+4075x 1 '!=459x 10'
(
1640+ 1640 x
(
1 + 0.0073x 67.7x 10')
)
=3
x
loi!
1640+ 1640 x (67.7 x 10~+0.0015
x 1)) ~ 4 5 x2 10''
nrf~10
= 9.34x 1012+M75 x j 9.34 x 10"+407j
x
(1640+12.06x 1 ) ) = 38 x l0lS
and we conclude that it is likely that the fint three orderings are the most efficient ones
from the viewpoint of the number of visited niples.
Eaample 2
For the case when the query has the form
Our estimates are modified as follows (again. n,, stands for "number of visited tuples" and
or. after valus substitution:
\,,
=l = 1 + 1
x (
1 + 1 x (67.7 x IO'+
0.0015 x 67.7
x
IO')
)
= 67.8 x IO"
and. again. we conclude that the first three orderings are the most likely to be more efficient
fkom the perspective of the number of visited tuples. There is a factor of almost 162 berween
die three least expensive orderings and the next most rfficient ordering.
We obtained some expenmental values (only for the most efficient orderings) which
are summarized in Table 6.1 3
.'
calling patterns
ordem #
partof[f,fj, part-of[g,fl, cycle[gj, cycle[g].
1
theoretical experimental experimental
Tanking
ranking
result
I=
1340538
1=
Table 6.1 3 Experirnentai results for the three rnost efficient orderings
Cornparison to Sheridan's algorithm
6.4
In this section we will compare our results to those obtained by the appIication of Shendan's algorithm [Sre Section 2.2.41. one of the most successfûl reordering algorithms in
the literature.
6.4.1
Why is Sheridan's algorithm so successful?
The main idça sxploited by Sheridan's algorithm is that the more instantiated the a r y rnents in a predicate cal1 are. the less expensive that cal1 will be. For instance. in the rrsults showm in Table 6.14 (borrowed frorn Appendix 4). we notice that having an additional g o u n d argument always parantees a lower cost of execution.
1
<clause. cpat>
cost
Table 6.14 Cost metrics for al1 predicates
f.4 single experiment for the most efficient ordering (orderings ff 1. 2. 3 ) required several hours of
computation. The computation of a single experiment for the next group of orderings (orderings $4.
6.8) would requue close to one month of unintermpted execution!
6.4.2
Our framework versus Sheridan's
When comparing our results with those obtained by using Sheridan's algorithm. Our
framework gives consistently better results than Sheridan's for at least two groups of situations: (a) when the position of the ground arguments within a predicate call is crucial.
and (b) when the performance of two syntactically similar predicate calls varies considrrably becauss of noticeably different sizes of their underlying databasr definitions.
A simple glance at Table 6.14 will reveal rhat. for some predicates. the exact loca-
tions of the groound arguments will have an impact on the performance of the predicate
call. For example. predicate call tcses wirh a first argument constant and a second argument variable (Le.. cailing pattern [g. fj) will perform far better than the cal1 with a second argument constant and a fint argument variable (Le.. calling pattern [f. g]). Sheridan's algorithm has no way to determine such a difference. and which of the two will be
given preference is a matter of chance.
By a similar token. due to the fact that Sheridan's algorithm does not take into account any information regarding the underlying database. there is no obvious way to disiinguish between a potentially very expensive predicate and a less expensive one based
onlv on the nature (Le.. the mode) of the arguments. Consider a very simple case. that is
shown in Figure 6.19.
Here wr have two database predicates with a remarkably different number of facts.
Sheridan's algorithm would consider both orderings shown in the figure as equally expensive: the first predicate is executed with both arguments variable: the second predicate is executed with one ground argument and one variable argument. However. since
it is dramatically more expensive to retrieve al! several thousand tuplcs of predicatc p as
opposrd to only one tuple for predicate q, it is bencr to place the call to predicate q bcfore
the call to predicatep. Again. since our framework uses a profile of the database in use,
we are able to make this kind of prediction. whereas Sheridan's algorithm is not.
A third situation in which our method has a potential advantage over Sheridan's al-
eonthm can be observed when the query contains recursive predicates or path regular ex-
t
pressions. Sheridan's algorithm does not treat these predicates as special cases. and then,
predicate
nurnber of niples
( expenment 1 )
number of niples
(experiment 7 2 )
orderl (A.6.C):- p(A.B),q(B.C).
ordeR(A.B.C) :-q(B.C).p(A.B).
ordering
average cost
experiment r; 1 )
at-eragecost
Iexperirncnt G 2 )
I
Figure 6.19 Impact of the underlying database on the performance of the cal1
a potentially expensive recursive cal1 may be chosen as one of the subgoals to bs exscuted first. Our methodology pemits us to estimate the sizes of those special predicates and
their repercussions on succeeding predicates.
Chapter 7.
Conclusions and Future Work
In this chapter. we surnrnanze our results and address the limitations of our fnmework.
We also propose some additional work that is required.
7.1
Contributions of this Dissertation
W r have proposed a new methodology to estimate the cost of a peneral GraphLog query.
It is based on the assumption that a profile of the underlying database is known. We have
been able to predict with good accuracy the costs of conjunctions of queries when the
program is translated into a WAM-based version of Prolog. This is done by obtaining
smpirical values for the diverse primitive operations in which the que- is decomposed.
in combination with predictions regarding the number of times thesr primitive operations will be invoked. We have also explained how to predict the costs when different
abstract machines are used by defining relevant cost contributors whose values are propagated in conjunction with the expected number of solutions for each subgoal in the qurry. Our predictions are normally able to detect the most efficient reordenngs as well as
the most expensive ones. The accuracy of the results is noticeably influenced by the nature of the underlyins database. and our predictions are best for databases whose distribution resemblss a uniforrn distribution of attribute values.
Our predictions are usually superior to those obtained when using Sheridan's algorithrn, because we make use of more information and are also able to consider sorne special forms of subgoals.
A major contribution of our work is the ability to handle recursive queries and clo-
sures. We have shown that the key factor is the estimation of the output density after applying the transitive closure. We have provided some guidelines and formulae to estimate such output density which have not appeared elsewhere in the literature. Our results
may be of special interest given that SQL, the de facto standard for relational databases,
has finally included recursive constructs [Dar93]. and recursive query lanpages will
soon become the n o m [Ahad93].
Our results should be directly applicable to pure Datalog and any other functionfree database language.
7.2
Limitations of Our Framework
The main deficiency of our methodolog is that we have assumed "ideal" databases. We
have not addressed some important issues such as duplication of tuples alter a projection
or correlation amongst attributes. Many of our claims rnay be applicable to unifonn distributions of attribute values only.
Another issue that has not been addressed by our framrwork is the impact that
vround arguments have on the cost of the unification aigorithm. It stands ro rcason that
z
the unification of long strings of charactes consumes more time and resourcrs than unification of an integer or an atom with a short name. In fact. somr systems transfomi the
real attributes to shorter and easier to handle equivalent codes [Graefe9 11. as is the case
in the perfomcrs database in Chapter 6.
7.3
Future work
Our current framework does not consider the special case of aliasing of variables (especially whrn the same variable is used several tirnes within the samr predicate). A domain
analysis of the arguments that are involvrd may tstablish some upper bounds for the
number of tuples that are retrieved.
We also propose to address the inclusion of correlation factors and the anal ysis of
other path regular expressions besides transitive closure. Our framework would also be
more complete if built-in predicates were also to be included in the analysis.
It is likely that additional information regarding the cardinality of the base relations
may bc used to refine our results. Further study of this idea could be fruitfùl.
If we wish to use our Framework in a practical situation, we require a pre-process
that determines a set of suitable query orderings to be analyzed. Sorne methods have
been proposed to this efTect (randomized algorithms [Ioannidis90] or even Sheridan's algorithm may be adapted to that purpose). Nanirally. we need to incorporate a phase that
determines the calling patterns o f the different subgoals. but this has been already solved
slsewhere [Debray881.
So far. we have not mentioned what to do with built-in predicates. a common extension to pure GraphLo_g/Datalog. In fact w r consider that the inclusion of built-in
predicatrs fits quite naturally in Our framrwork. Estimating the number of solutions of a
built-in predicate becomes trivial when the domain of the attributes is known in advance.
and so do the values of the different cost contributors we might be intrrested in.
Several extensions have been proposed for GraphLog [Consens89]. Thesr include
the definition of aggregates and the option of using functional arguments. Aggregarrs are
constructs that are used to summarize data (typical aggregates are the average. maximum, minimum, sum or count of an attribute). We believe that Our Framework can be
casily extended to handle these constructs once we drtermine what additional operations
take place. In general, aggregates have to perform an action over al1 the tuples that satism
a condition. Our frarnework already estimates the cost of retrieval of the tuples. and we
only need to add the cost due to the aggregatr action (which will usualiy require to visir
each and rvery mple in the solution). Handling functional arguments is a more complex
matter. We require to denve a ncher absuact domain to distinguish partially instantiated
arguments. and modi@ the niles of the corresponding abstract unification.
It is not unusual to use a cache in order to reduce the cost of processing a query by
preventing multiple evaluations of the same predicate cal1 [Sellis87]. This is specially
uscfûl for inhrrently expensive quenes (such as transitive closures). A practical cost
mode1 would have to take into consideration this and other impkmentation issues.
References
.Gad. R., and Yao. B. RQL: A Recursive Que- Language. IEEE Transactions on Knoidedge and Daia Engineering. Vol. 5. No. 3. June 1 993.
pp. 451361.
Agrawal. R.. Dar. S.. and Jagadish. H.V. Direct transitive closure algorithrns: Design and performance evaluation. -4CM Transacrions oti Database Sisrems. Vol. 15. No. 3. September 1990. pp. 427-458.
Mt-Kaci. H. War-ren's Abswact Machine: -4 Tirtor-ial Reconsrr-rrcrion.
MIT Press, Cambridge. Mass.. 199 1.
Baase. S. Comptrrer .-ilgorirhrns: lnrroduction io Design and .-lna&~i.s.
Addison-Wesley. 2nd ed.. 1988.
Bancilhon. F.. Ramaluishnan. R. An amateur's introduction to recursive
query processing strategies. Proceedings of rhe I9X6 -4CM-SIGMOD
Conjèrence. 1986. pp. 16-52.
Birkhoff. G. Larrice rheon.. American Mathematical Society Colloquium Publications, Vol. 25. New York. 1940.
Ceri. S.. Gottlob. G.. and Tanca, L. Logic Programming and Darabases.
Springer-Veriag Berlin, 1990.
Ceri, S.. Gottlob. G., and Tanca. L. Datalog: A Self-Contained Tutorial
(Part 1). Published in Programmirovunie. No. 4. July-August. 199 1. pp.
20-3 S.
Cheiney. J.-P.. and Huang, Y.-N. Efficient maintenance of explicit transitive closure with set-oriented update propagation and parallel processing. Dara and Knodedge Engineering, Vol. 13. No. 3. Ocrober 1994.
pp. 197-226.
Clocksin. W.F.. and Mellish. CS. Programming in Prolog. SpringerVerlag, New York. 1981.
Consens. M .P.Graphiog: "Real L*" Recursive Queries Using Graphs.
M S thesis. D e p m e n t of Computer Science. University of Toronto.
January 1989.
Consens. M.. Mendelzon. A. GraphLog: a Visual Formalism for Real
Life Recursion. Proceedings oj'the 9th ACM SIG.4 CT-SIGMOD .$mposiwn on Prirzcipies of' Database &stems. 1990. pp. 104-3 1 6.
Consens. M.. -Mendelzon. A.. and Ryman. A. Visualizing and Querying
Sohvare Structures. Proceedings of'the 14th Ir~ternationalCor!fkrence
on Sojnt-areEngineering, Melbourne. Australia. May 1991.
Cousot. P.. and Cousot, R. Abstract Interpretation: a Unifird Framework
for Static Analysis of Pro-mms by Construction of Approximation of
Fixpoints. Proceedings oj'the 4th ACM Conf2r-ence on Principles 01'
Programming Languages. The Association for Computing Machinery.
New York. N.Y.. 1977.p~.238-252.
Cousot. P.. and Cousot. R. .-lhsn-acr lnterpretation and Application to
Logic Programs. Laboratoire d'Informatique de i'Ëcole Normale
Supérieure. Research Report LIENS-92- 12. June 1992.
Dar, S.. and Agrawal, R. Extending SQL with Genrralized Transitive
Closure Functionality. IEEE Transactions on Knowledge and Data Engineering. Vol. 5. No. 5. Octobrr 93. pp. 799-8 1 2.
Debray. S.K. Static Inference of Modes and Data Dependencies in Logic
Programs. .A CA4 Transactions on Programming Languages and Si*sterns, Vol. 1 1 No. 3. July 1989. pp. 41 8450.
Debray, S.K., and Lin. N. Cost Analysis of Logic Programs. .4CM
Transactions on Programming Longuages and Slsiems. Vol. 15 No. 5.
Nov 1993. pp. 826-875.
Debray, S.K. and Warren. D.S. Automatic Mode Inference for Logic
Programs. n e Journal oj'logic Programming. Vcl. 5 no. 3. Sept. 1988,
pp. 207-229.
Froberg, C.-E. ivumerical Mathematics - nieory and Computer Applications. The BenjamidCummings Publishing Co., 1 985.
Fukar, M . Tt-anslating Graphlog into Prolog. Technical Report T R
74.080. Centre for Advanced Studies. IBM Canada Laboratory. 199 1.
Gardarin, G. and Valduriez. P. Relational Databases and Knowledge
Bases. Addison-Wesley. 1989.
Gooley. M.M.. and Wah. B.W. Efficient Reordering of PROLOG Programs. iEEE Transactions on Kno w ledge and Data Engineering. Vo 1Ume 1. Number 4. 1989, pp. 470382.
Gorlick, M M . and Kesselman. C.F. Timing Prolog Programs without
Clocks. Proceedings of' rhe 1 98 7 $vmposiurn 0 1 2 Logic Programrnit~g.
San Francisco, CA. pp. 426332.
Graefe, G.. and Shapiro, L.D.Data compression and databasr performance. 199 1 Proceedngs of 'the -4 CMXEEE-Cornputer Science Simposirïm on .Ipplied Comprrting.
Hom. A. On Sentences which are Tme of Direct Unions of Algebras.
Journal of-Svmbolic Logic. Vol. 16, pp. 11-2 1 .
Ioannidis. Y.E.. and Kang, Y.C. Ramdomized Algorithm for Optimizing Large Join Queries. Proceedings of'the 1990 .-iC!WSIGMOD Cotzfèrencr on the :Managemen[of'Data. Atlantic City. N J. USA. May 1 990.
pp. 3 12-321.
[oamidis. Y .E ..and Poosala. V. Balancing Histogram Optimality and
Practicality for Query Result Sizr Estimation. Proceedings of'the 1995
. K M SIGMOD Infernational Conférence on Management ofData. San
Jose. CA, pp. 233-144.
Jagadish. H.V.. and Agrawal. R. A Study of Transitive Closure as a Recursion Mechanism. Procerdings of'the A CM Special interest Grotlp on
Management of Data 1987 Annual Conference (SIGMOD Record Vol.
16 No. 3. December 1987). San Francisco, CA, USA, May 27-29 198%
pp. 33 1-344.
Jarke, M.. and Koch, J. Query Optimization in Database Systems. KA4
Compirting S l r n e y , Vol. 16, No. 2. June 1984. pp. 1 1 1- 152.
[Kruse87]
Knise,R.L. Dara Simcmres & Program Design. Prentice-Hall Software
Series. New Jersey. USA. 1987.
[Kwast94]
Kwast, K.L.. and van Demeheuvel. S.J. Duplicates in SQL. Data and
Knorvfedge Engineering. Vol. 13. No. 1. August 1994. pp. 3 1-66.
[Lu931
Lu. W.. and Lee. D.L.Characterization and Procrssing of Simple Prefixed-Chain Recursion. Info~rnarionsciences. Vol. 68. No. 3. .March
1993.
[Mannino88]
Mannino. M.V. et al. Statistical Profile Estimation in Database Systsms.
.4CM Cornpuring Srrrvq-S.Vol. 20. No. 3. September 1988. pp. 19 1-22 1.
[McCarthySî] McCarthy. J. Coloring Maps and the Kart-alski Doca?ne. Report STANCS-83-903, Depariment of Computer Science. Stanford University.
1982.
[McEnery90] McEnexy A.. and Nikolopoulos, C. A Meta-lnterpreter for Prolog Query
Optimization. Proceedings of'the IASTED Intemarional $.mposiirm on
E- peut Sistems Theo- and Applications, 1990. pp. 75-77.
[Mellish85]
Mellish, CS. Sorne Global Optirnizations for a Prolog Compiler. The
Journal qfLogic Programming. vol. 2 no. 1. April 1985, pp. 43-66.
[Mishra9Z]
Mishra. P.. and Eich. M.H. Join Processing in Relational Databases.
.-ICA4Comptdng S w ~ v j !Vol.
~ . 25. No. 2. June 1993. pp. 63- 1 13.
[Ryrnan92]
R p a n , A. Foundat ions of 4Thought. Proceedings of.C.4SCON '91.
1992, pp. L 33- 155.
[Ryman93]
Ryman. A. Illuminating Sofhvare Specifications. Proceedings oj'C-4sCON '93. Vol. 1. 1993. pp. 412428.
[Ryman93a]
Ryman. A. Constructing Software Design Theorirs and Models. in Studies in Sofware Design, LNCS 1078. Springer. 1993. pp. 103- 1 14.
[Ssllis87]
Sellis, T.K. Efficiently supporting procedures in relational database systems. Proceedings of the 1987 -4 CM SIGMOD Conference. San Francisco, CA, pp. 278-29 1.
[Sheridan911
Sheridan. P. B. On Reordering Conjunctions of Literals; a Simple. Fast
Algorithm. Proceedings ofrhe 1991 $?mpositrm on -4pplied Cornpuring.
Kansas City. Mo. USA, April 199 1. IEEE Computer Society, pp. 73-79.
[Ullman85]
Ullman. J.D. Implementation of Logical Query Langages for Databases. .-KM Transncrions on Database $sems. Vol. 10. No. 3. 1985. pp.
289-32 1.
wllman88]
Ullman. J. D.Principles of 'Daraboseand Kno\iiedge-Base Svsrems. Vol.
I and II. Computer Science Press. 1988-89.
[Wang931
Wang, J., Yoo. J.. and Cheatham. T. Efficient Reordering of C-PROLOG. Procerdings oj-rhe2 lsi d CM Comprrter Science Conjerrnce. NY.
USA, 1993. pp. 151-155.
[Warren751
Warren. E.S. A Modification of Warshall's algorithm for the transitive
closure of binary relations. Comrnrtnicationsoj'rhe rlC.W. Vol. 18. No. 4.
April 1975. pp. 2 18-220.
[Warren8 1]
Warren. D.H.D. Efficient Processing of Interactive Relational Database
Queries Enpressed in Logic. Proceedings uj'rhr 7th Inm-narional Conjiwnce on Ve)? Large Darabases. pp. 272-28 1. IEEE. 1 98 1.
[W0085]
Woo, N.S. A Hardware Unification Unit: Design and Analysis. Proceedings of'the 2 3 h lnternarional Svmposiurn 011 Comprrter -4rchitectrtre. Boston. MA. 1985. pp. 198-305.
Appendix 1 A Detailed View of Other Approaches to Query
Reordering
A 1. f
Efficient Reordering of Prolog Programs by Using Markov Chains
Gooley and Wah's work [Gooley89] has proposed a mode1 that approximates the evaluation strategy of Prolog programs by means of a Markov process. The cost is mrasured
as the number of predicate calls or unifications rhat take place. The method needs to
know in advance the probability of success and the cost of rxrcution of rach predicate.
With rhis initial information. the cost of a panicular ordenng for the subgoals within a
single clause is calculatrd as follows:
Consider a predicate clause p with subgoals s,. ....s,.
If qi is the probability that subgoal sifaiis. and if ciis the cost associated with executing
subgoal si.the cost of a failure is given by the formula:
',
- :qi x c , ;
-
+
: ( 1 - y l ) x q -, x
;(1 -q,)
x (
(Cl
+ c ,-) : f...
1-q2)
x
---
x (
+
1 -qn
i ) X q a
X
( C I+
... + C , J !
or in closed tom:
n
. ni-l
l
: m
The goal of the method is to make failing clauses fail earlier and thus reducr backtracking. In other words. goals rhat are "more likely to fail" (and inexpensive ro rvaluate)
are placed near the head of the clause. This is usually accomplished by ordering the subgoals in decreasing order of their ratios yiici.
A very sirnilar approach is used to estimate a suitable ordering for the clauses in a
@en predicate:
Consider a predicate p defined by clauses k l . .... k,,
km: p :-~
~
.-..1s,".
.
Ifpi is the probability that clause ki fails. and if di is the cost associated with exccuting
clause ki, the cost of a single succrss is given by the formula:
= tp,
x d , :
+
( 1 - p , ) x p ,-x ( d , + d-, ) : +...+
-
: ( L - p ! ) x ( L - p , ) x...x (1-p,
,)
xp,x c d , + . - - + J n ) I
or in closed form:
The goal in this case is to get an initial answer as quickly and inexpsnsively as possible. In other words, goals that are '-more likely to succeed" (and inexpensive to evaluate) are placed near the begiming of the predicate. This is intuitively accomplished by
ordenng the clauses in a decreasing order of thrir ratios pj/'dj.
Note that these formulae assume that costs and success/failure probabilities of the
subgoals in a clause are independent of each other. which is usually not the case in Prolog
or GraphLog. Thus. the mode1 is just a coarse approximation. although the behaviour of
the clauses may still be predicted with some accuracy.
The success/failure probability and cost of executing the body of a clause (that dors
not involves recursion) are both calculated once the subgoals s,.
.... s, are modelled by
either the Markov chain in Figure A 1.1 if we are interested in the first solution only. or
the Markov chain in Figure A1 .Z if we want the cost of finding a11 solutions.
Note that each subgoal si corresponds to a distinct state. From each subgoal state
and
there are two transition arcs which are labelled with the probabilities of success (pi)
p..
Pi
P2
Figure A1.1
P3
Pn-2
Pn- t
Pn
Markov chain for the single solution case
p..
Pi
P2
body:
Figure A1.2
p3
"..'5
Pn-2
Pn- i
Pn
s3, ... . %-[. 5,
Markov chain for the all-solutions case
failure ( 1 -pi).The success transition arc connects the subgoal state with the next subgoal
state. while the failure transition connects it with the previous one. For the sprcial case
of the t h subgoal. the failure transition arc soes to an absorbing state labelled F (clause
failure). Similady, the success transition arc for the last subgoal reaches another absorbing state labelled S (clause success).
For GraphLog queries. we are often interestrd in deriving al1 solutions rather than
just the first one. Notice that when the all-solutions case is under consideration, the S
state is no longer an absorbing state and it has a failure transition arc of probability one
assigned to it (which mimics backtracking).
Goolry and Wah (Gooley891 explain that these Markov chains can be represented
mathematically by means of
r
x
r
matrices. where r is the number of states in the chain.
The transition matrix for the single-solution case has the following structure [Revuz81]:
A similar matrix is obtained for the ail-solutions case:
Several cost metrics (such as the number of visits to the success state Sand the probabilities and costs for the clause body) can be obtained after sorne mathematical manipulation of these matrices. For instance. the expected cost of a solution in the ail-solutions
case is given by:
A1.2
A Meta-Interpreter for Prolog Que.
Optimization
McEnery and Nikolopoulos [McEnery90] describe a mcta-interpreter for Prolog which
reorders clauses and predicates. It has two components: (a) a static component in charge
of reananging the clauses "a priori", and (b) a dynamic component that reorders the
clauses according to probabilistic profiles built from previously answered queries.
This method's static reordering phase consisü of rearranging the clauses that define
a predicate in such a way that the most successful clauses are tried first. and the subgoals
within a clause are reordered in descending order of success likelihood.
Subgoal reordering is performed by using a generalization of a heuristic due to
D.H.D.Warren [Warren8 11. Warren proposed a formula for the cost c of a simple que.
y as given by cq = s/a. where s is the size in niples (Le.. the number of solutions) of the
subgoal. and a is the product of the sizes of the domains of rach instantiated ar,*urnent .
For example. given the following Prolog database:
nation(canada).
nation(be4gium).
nation(uk).
language(canada. french).
language(canada. english).
language(belgiurn. dutch).
language(belgium.french).
language(belgium. german).
language(uk,english).
language(quebec. french).
language(texas.english).
and the fol Iowing predicatr detinition:
french-speaking-nation(N)
:- nation(N),language(N. french)
the cost of the query
wouId be obtained as follows:
The cost associated with the execution of predicate naiion with an unbound argument is given by:
(there are three nations. and thus
a
A-
= 3 ; there are no instantiated ar3wents. therefore
= I ). Similady. the cost of subgoal Iangzrage with both ar,ouments bound is estimated
as:
(there are eight tuples for the predicate. and thus s = 8 : the value of a is derived from the
fact that there are five regions and four different langages). Thus. the cost of the wholr
query would be given by:
'liçnçh rpediing nation
-
Crinilon
+
'lrnguec
= 3.4
if the textual ordenng is to be applied.
If the alternative order
french-speaking_nation[N) :- language(N, french), nation(N).
is to be used instead. our estimates will change accordingly. We calculate the cost of subgoal langtrage when the tirst argument is unbound and the second argument is ground:
c.
and the cost of subgoal narion with a bound argument:
'' nmun
-- 3- --1
3
and we obtain the cost of the alternative order as:
ene eh spnlrlng nation
--
c'luimup
+
"nation
=3
In other words. this second ordenng is estimated to be more efficient than the original one. 3
The generalized formula proposed by McEnery and Nikolopoulos is given by:
where s and a are defined as in Warren's formula. and p is the probability of success of
the clause under analysis. The authors propose a dynarnic evaluation of the value ofp.
according to the accurnulated history of the predicate. Note that the higher the value of
p. the lower its cost. This success rate is physically stored in the database as an ordered
tuple. Every tirne a clause succeeds, its success rate is increased by one. The probabilistic
profile is given by the ratio of the success rate and the overall sample space.
A
.
Cost Analysis of Logic Programs
Debray and Lin [Debray93] have proposed a fnmework to analyze the cost of logic pro-gams. including pro-mams with simple recursion. The method cstimates the number of
solutions of a logic p r o _ m based on the skes of the diverse predicate arguments. Various measures are included under the generic name "size". such as integer-value. listlength. term-depth. or term-size. Thus, some gpe information must be inferred and propagated for cach argument via a static program analysis.
The method derives size relationships amongst predicate arguments. This size information is propagated to compute the number of solutions generated by each predicate.
The size properties of predicate arguments are descnbed by mcans of two functions:
(a) size(arg). which provides the actual size of argument arg. and ( b ) &'(argl. arg2).
that calculates the size diffierence between two arguments. argl and ar-g.?. Each of these
functions has a different definition depending on the measure under consideration. For
examplc. the definition of si=e(wg) for the particular measure "list-length" is as fol-
r
sizc(t) =
O
if t is the empty list
1 +size(t,)
if t is of the form [Atl] for some trrm t l
undefined
O t herwi se
t W e use the standard Prolog notation for lists. [HIT]refers to a list whose initial element is H (the
head of the list) and the rest of the list is another list T (the rail of the list). An underscore '--.'
represents an anonynous variable. i.e. a variable whose exact binding is irrelevant for ow purposes.
Note that when the value of a particular size (or difference between sizes) cannot
be determined from the context. the hnctions return a special value of "bndefined.
A distinction between "output" and "input" arguments is also made. The size of an
input argument is always calculated from the sizes of previous occurrences of the variables appearing in that argument (the so-calledpredecessors of the input position). Consider the following predicate clause in which the measure under consideration is "listIength":
nrev([HIL], R) :-nrev(L. Fi1 ), append(R1. [Hl. R).
in which the cal1 has the following inpu~outputargument positions:
nrev(<input>.<output>),
and. from this calling pattern, we may derive the calling panrrns of the subgoals on the
right hand side as follows:
For pcdagogical reasons. we number the argument positions (and literals) as follows:
The size of <input> (Le..
[HI)
is simply calculated as the size of a unitary list as given
by function sizt.(ara. i.e., size(<input,>) = 1. The size of <input2> (i-r..L) is obtainrd by
applying function d@iargl. arg.?)to -=input,> ([HIL]. predecessor of <input2>) and ==input2>
itself. which gives size(cinput2>)= size(<input,>)- 1. Finally. the size of <input3> (i.e.. Ri )
is expressed in terms of that argument predecessor. -=output,> (whose value must be calculated elsewhere) as size(cinputp) = size(<output,>).
Size relationships for output argument positions are derived as functions expressed
in terms of the sizes of the different input arguments. symbolically expressed as
Sz(S.Arg,size(input,). ..., size(input,)). where input, ... input, are the input arguments of subgoal S and Arg is the argument position whose size is being calculated.
In out example. the size of coutputp is rxpressed in t e m s of the input argument as
size(<outputp)= Sz(nrevb,2,size(cinputp)), and the size of <output3>is derived from the sizes of its two input arguments as size(coutputp) = Sz(append.j.size(cinput3>). size(<inpuLp)).
Similarly. the size of <output,> is obtained as a fûnction of <output3>.the argument that
originates the output value of the whole clause. and in this case. size(coutput,>) =
size(coutput3>).
The set of size relations that is obtained for a given predicate clause is then sxpressed in t e m s of head input arguments only. a process that is called normalizution. For
instance. if we already know that Sz(nrev.l.size(x)) = size(x) for a given input X. thrn
size(cinput3>)should be transformed successively into:
which is expressed in terms of the head input argument exclusively. Once normalization
has been performed. a system of difference equations is obtained. To get closed form expressions. these difference equations need to be solved. Unfominately. solving difference equations automatically is a difficult problem. although automatic solutions for a
wide varirty of them have bsrn proposed in the literanire.
The number of solutions generated by a predicate is sstimated frorn the size relationships by counting the number of possible values that every variable in the clause may
br bound to. The method exploits two properties of unification that hold in many logiç
progamming languages. One of these properties is that if a variabis appears n times in
the subgoals of a clause. the total number of distinct bindings that such a variable may
have is at most the minimum value of the set (b,, ....b, i , the number of possible bindings
for the variable at each argument position as they would be computed independently. For
instance, consider the predicate clause:
in which predicates O , p, and q are predicates that always retum a bound term for argument positions 1, 2 , and 1, respectively. It follows that variable Y will be bound to only
those gound values that are cornmon to al1 three predicates o. p. and q. If the number of
distinct values at those argument positions are b,, bp, and bq. respectively. the number of
distinct bindings for variable Y is at most min (b,, bp,bq { Another useful property of unitkation is that for subgoals that contain more than
one variable. an upper bound of the number of bindings for such a subgoal is given by
the product of the number of bindings of each one of its variables. For example. given a
subgoal
if bY is the number of bindings that variable Y can take and b.y is the number of bindings
for variable X the maximum number of distinct tuples that can be obtained for the subgoal is given by
h,. x 6,. . Suppose that variable Y can be bound to values a and h. whereas
variable ,Ycan be unified to values h, c and d. Then. the 2 x 3 distinct tuples that can be
obtained are: s(a. b. a). s(a. c. a), s(a. d, a). s(b. b. b). s(b. c. b) and
s(b. d. b).
These tuples are
called the instances of the subgoal. A function called instunce(T) is drfined to compute
the number of instances of a term T.
Two additional quantitirs (functions) are defined for each predicate p in the pro_mm: Rdp. that represents the size of the relation defined by p (i-s..the number of tuples
that the predicate generates when al1 its arguments are uninstantiated), and Solp. the solution size forp (i.e..the number of tuples that are obtained for a particular input. as specified by the size values of the (instantiated) input arguments). Function insrancelS) is
used to calculate both quantities. the main difference being that only output variables are
considered for the derivation of Solp, whereas al1 variables (input and output) are used
for the calculation of Rel,.
To obtain values for Solp and Re$,, we nred to determine values for the number of
bindings that are possible for the variables contained in the predicate clause under consideration. denoted by p,, . Two cases are considered: the number of variable bindings for
input arguments (denoted by pri ), and the number of variable bindings for output arguments (denoted by
). The value
Pr, for input arguments is denvrd by using the above-
mentioned properties of unification (Le., function instunce(S)). On the other hand.
for output arguments is bounded by the product of the number of solutions that are expected from the predicate given a particular input size (as obtained from the s i x relationships for predicate arguments) and Pr,. itself. Again. if recursive predicates are presrnt in
the program. a set of difference cquations will be obtained.
and pro can be substantially improved if special cases are
The upper bounds for
also considered. Such special cases include: distinct variables that are bound b y the samr
literal. output variables that are instantiated according to the bindings of the input variables. etc. Similarly. the detection of mutually exclusive clauses can produce more precise results.
To demonstrate how the method is applied. considrr the following recursivr prom m which permutes a list of r ~ r r n e n t s : ~
Y
perm([
1 9
[ 1)-
perm(X. [LILl]) :- select(L. X. Y), perrn(Y. L I ) .
select(H. [HjT. T).
select(>(. [HiTl], [HIT2]) :- select(>(,T l . T2).
Suppose that we are using %st-length" as the relevant msasure. as well as the following inpuVoutput mapping:
or, in an alternative. expanded form:
+The syrnbol [] denotes an empty list. Le.. a list with no elements at ail.
We will concentrate on the simpler predicatr select. We try to denve the s i x of both
output arguments. output and ourpug.* The size relations for the first output argument (output,.) of predicate select are computed as:
size(coutpuk>)= undefined.
size(<outpuk>)= size(coutputlO>)
= Sz(se1ect.1 ,size(<inputp)).
(rule select, )
(rule select2)
This system of equations yields:
Sz(select,1.size(<inpub>)) = size(outpuk)= undefined.
The size relations for the other output argument (<outpub>)of selecc are as follows:
(rule select, )
(rule select*)
= size(<input+) - 1.
size(-=output7>)
size(<outputg>)= size(<output,,>) + 1.
where
or. in normalized form:
size(coutput,,>) = Sz(seiect.~,size(~input7~))
= Sz(seiect,~.size(<input6~)-1
)
Combining this set of conditions. we finally obtain:
Sz(seiect,3.~ize(cinpub>))= size(<output7>)= size(-=input& - 1.
Sz(select.3.size(<inpub>)) = size(<outputp)= S~(seiect.3,size(~inpub>)-1 )
(rule select, )
+ 1 . (rule select2)
This results in a system of difference equations of the form:
In this case. the solution is straightfonvard (which is not always the case):
f(x) = x - 1,
or. afier variable substitution:
+Foreach output argument position. we obtain one equation for each rule which is eicpressed in
tems of its respective inputs: inpub for nile select,. and inputs for rule select2.
The nexxt step is to estimate the number of solutions that predicateselecr is expected
to generate. Consider fint subgoal selecb in clause select,. The number of bindings for die
input variable T[ (<input7>) is qua1 to 1 (since evcry input variable has an initial. unique
value). We know that the number of bindings of both output variables Xand T7 is boundsd by:
p,
- p,,
-
x
Sol(seiect,size(4np~>))
=1
x
-
S01(select.~ize(~1np~~')
1 ).
In this case. the total number of solutions for select* squals the number of bindings
for the output arguments and is rxpected to bc:
Sol(seiect.size(cinpuk>))
= SoI(sekct.size(cinputfir)- 1 ).
Consider now the other predicate clause. select,. Again. the number of bindings for
input variables H and 71 (cinpugz)is çqual to 1. Since they are also the only output variables. we get:
Sol(seiect,size(cinput5>)) =
1.
In other words. the following equations are obtained:
which can be cornbined into:
f(x) = f(x- 1 ) + 1.
since both clauses are mutually exclusive.
Using boundary conditions (namely. that f(0) = O must hold). the final answer is obtained by solving the difference equationt:
SoI(se~ect.size(-=inpu~~))
= size(cinpub>).
and we conclude that predicate selecr will generate at most n solutions for an input of
size n.
+This particular difference equation has a trivial solution. However. one major challenge faced
by Debray and Lin's framework is that many real-Iife difference equations cannot be so!ved automatically.
Appendix 2
Primitive Constants in a Uniform Distribution
For the special case of a uniform and independent distribution of attribute values
" c h o i ~ e ~ o i nis
r s given
by:
where
nrup!es is
the number of tuplrs in the databasç for predicatr plPl. P ,. ... . PJ :
Kk is a wducrion factor for argument Pk.
I
1. if no indexin2 is apptied to argument position k
number ddisrinct values for argument position k. if argument indexing is invotved and the argument is ground
(note that the value of Kk is the same. regardless of whethrr hash table collisions occur
or not ).
Note that this formuia assumes a uniform distribution.
1. if the argument is a variable or if the argument position is sub-
jected to indsxing
KJk) =
number of distinct values for argument position k, if argument
indexing is not involved
Thus. if we define
-
1
O. othenvise
and
c;:,,Ix-) =
1, if Pk is a variable.
O, othenvise
we can derive the followhg final cquaiions:
n
For the trivial case of a uniform distribution. the following formula can be used to
calculate Fp the expected nurnber of tuples that. in average. have to be visitcd in order
to find thejim solution ( n represents the arity of the subgoal).
O. if Pk is a ground term and Pk-, is also a ground term
,
:
n
n
'ni= k
l
@ (rn)
/ -1
. if Pk is a ground terrn and Pk-, is not gound
J
O. if Pk is not ground and k = 1
1
O. if Pk is not ground and Pk-1 is not ground
r
"'""
=
1
1. ifm is w incirxed position and P, is g o u n d
number of distinct values for argument position m. othrnviss
Table A2.1 shows the corresponding values of ?for the tema- case. assuming rhat
no position is indexed. Once again. Sn.stands fur the number of distinct values for argument position I-.
Valus of F,.
Table A l 1
Values of trie Traversal Factor for the Ternarj Predicate Examp!e
Appendix 3
Method of Measurement
The general method to measure the CPU time required to executr a given Prolog que.
consists of repeating the execution of the query a cenain number of times. and talcing the
average of thesr rneasurements. The general scheme is shown in Figure A3.3. where
main represents the query undrr consideration. Note that wr rnust discard the contribu-
tion to the execution time due to the loop itssif.
rrrco~trolforallsol~tio~~s
: - control.
Figure -43.3 General method to rneasure CPU execution times
Appendix 4
A Performance Mode1 for QUINTUS Prolog
In this section. the performance mode1 is applied to QUINTUS Prolog for the packages
example in [consens921t. Essentially, the database contains the following extensional
DB predicates:
p a r - ~ 3an
: enurneration of 1 -640 parts in the system:
uses-2:a set of 4.075 facts which establish which pan uses anothrr pan.
An intensional DE3 predicate that can be used to determine the reiation '--4 is purt of'
package B" is as follows:
part_of(A.B) :- part(B,A).
The following Prolog query may be used to determine al1 the packages X. such that
X contains a pan A that uses a part B which is in turn contained in a package Y diffsrent
from X:
pkg_uses(X.Y) :- uses(A.B). part-of(r2.W. pa-f(B.Y).
.+(X=Y).
This conjunction of four subgoals can be re-arranged into srveral diffèrent orders.without affecting the accuracy of the result. Somc orderings are forbidden dur to the
fact that the subgoal \ + (X=Y) involves negation thus requiring that both of its arguments have a bound value before the evaiuation of the predicate is made. The following
table (Table A 4 1 ) shows al1 valid orderings for the query.
+Consens. M.. Mendelton. A.. and Ryman. A. Visualizing and Qucrying Software Structures.
Proceedings of-che /3rh Internarional Conference on Sofnc*areEngineering. Melbourne.
lia May 1992.
-4usü-a-
1
ordenng
1
subgoal X 1
1
subgoal + 2
1
subgoal t 3
1 subgoal
s'
4
(
Table 44.1 Valid orderings for the query pkg_uses/2
A 41
Database profile
The packages example can be viewed as a set of extensional and intensional predicates
aiong with the safe qurnes that are to be applied io the database.
(a) Extensional predicates.
We will assume that they follow a strict uniform distribution of attribute values.
Predicate number of
tuples
name
distinct
distinct
distinct
values in
values in
values in
argument 1 argument 3 argument 3
Table A4.2 The extensional database predicates
(b) lntensional predicate.
There is one intensional predicates which is defined upon one of the extensional
predicates. a feature that simplifies the analysis.
part-O f/2.
We consider al1 eight valid orderings for the package-mes example:
-
:- uses(A.B). part-of(A.X). part-of(B.Y). u
1
-.
ordenne # i
:- uses( AB). part-of(B.Y ). pan-of(A.X), #y-V_). ordenng
- $2
:-pa-f(X.X).
:- part-of(A.X).
:-part-of(A.X).
:- p a r t o f ( KY).
:-part_of( B.Y).
:-p a r t o f ( B.Y).
uses(A.B). p=of(B.Y),
.t(u-\L).ordcring $3
partoQB.Y). uses(A.B). i+(X=Y+. ordrring Lt;l
part-0fiB.Y). WV-V), uses(A.B). ordcring =5
uses( A B ). part-0flA.X). *+X+J.
ordrring #6
part_of(A.X). uses(A.B). W-Vf.
orderine 37
part-of(A.?(), jrl_Y-V), uses( A.B). ordering $8
CI
Note that the built-in predicatr has been left out. since the performance mode1 is not
applicable to system predicates.
A42
Abstract Domains
Table -44.3describes Debray's domain for this particular query. Table A 4 4 indicatrs the
cost domain that applirs to the extensional predicates. whilr Table A l . 5 -ives the cost
domain for the intensionai predicats and the different queries.
Table .43.3 Debray's domain for al1 predicates
1
1
1
(ail clauses)
1
[f.f.g].
<usesi?.
/
/
(allclauses)
(
<usesil.
1
(al1 clauses)
1
<par~'3,
[g.p].
1
1
{ cost;.n-sol;
[.
[g.fl.
1
I
( cost4,n-sol4
1..
I
{ cost,.n_.so12 ;.
I
C 1'
[
1'
1'
Table A 4 4 Cost domain for the extensional predicates
Table A4.5 summarizes the specific values of the cost domain for both the intensional predicate and the different queries. ( In fact. once the built-in predicate is removed
from the queriss. the eight valid ordenngs yield only four distinct queries: ( a ) ordering
K 1 = ordering ?Z: (b) ordering $3; (c)ordering i4 = ordering $5 = ordeting $7 = ordenng
5; and ( d ) ordsring $6.)
-44.3
Cost metrics
Some cost metris are surnmarized in this subsection. The values of the basic constants
are specific to QCINTUS Proiog under AIX.
Head unification probabilities:
probl=l (it always unifies)
Empirical constants used:
Clause cost metrics: Table A4.6 shows a summary of the cost metrics for al1 predicates, whereas Table A 4 7 provides them for the intensional predicate.
I
l
l
:prob, .cos17.n-~oI7).
1
pan-ofil.
(
:prob 1 .cost8.n-SOI,i.
{ 1.costord# 1.n-soiquery
l.
ordenng 8 1.
ordering i: 3.
ordrring a 5.
ordering F# 6.
ordering
* 7.
ordenng X 8.
Table A4.5 Cost domain for the intensional predicate and the main query
Table .44.6 Cost metrics for al1 predicates
<clause. c p a ~ p( hunif)
n,hp
VU
<part_oG2. [g,fj>
1
1+16401+2~1640
<part_ofi2. [f.fl>
1
1+1640 2+3x1640
-
-
n-sol
cost=
(n ~ T ~ ~ ~ i n ~ x TxTbd)
,.,+n~
13-13
1
1640
21 .O0
---
Table A4.7 Cost metrics for the intensional predicate
-44.4
Query cost formulae
cost ( or-dering6 ) =
A4.5
cosr c pan-of '2 1 L,-
,] ) + n s o l s x ( cost (
nsol,
x ( cost ( part-of
2i
uses 2 i [
[c. r l
)
]
II..~)
1
+
)
Comparison between the Mode1 Prediction and the Erperimental Results
Table A 4 8 presents a sumnary of the values predicted by the perhrmance mode1 as
compared with the values that are obtained experimentally. It should be noticed that the
performance mode1 was able to predict the correct order of performance of the quenes.
Ordering d Theoretical value Experimental value
1
9302 1
107054
O.b
Error
15.08%
i
Table A4.8 Theoretical and Experimental Values for the Packages Example
IMAGE EVALUATION
TEST TARGET (QA-3)
2 I W G E . lnc
-.
NY
-----Fax:
APPLIED
1653 East Main Street
Rochester,
14609 USA
Phone: 716/482-0300
716/288-5989