Estimating the cost of GraphLog queries

Nigel  Horspool

Estimating the cost of GraphLog queries

1995

NOTE TO USERS The original manuscript received by UMI contains pages with slanted print. Pages were microfilmed as received. This reproduction is the best copy available UMI 1*1 National Library ofCada du Canada Acquisitions and Bibliographie Services Acquisitions et services bibliographiques 395 W e f ï i i Street OttawaON K 1 A M 395. nie W e l l î OüawaON K1AûN4 c2mda canada BibIiotkque nationale The author has granted a nonexclusive licence allowing the National Library of Canada to reproduce, loan, distribute or sell copies of this thesis in microform, paper or electronic formats. L'auteur a accordé une licence non exclusive permettant à la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfichelfilm, de reproduction sur papier ou sur format électronique. The author retains ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fiom it may be printed or otherwise reproduced without the author's permission. L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation. \-i List of Tables Cost of the evaluation of a given qurry using different orderings . - . 9 Typical Experimcntal Results for a Titmary Predicats for SiCSms Prolog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Table 3.2. Typical Experimental Rrsults for a Trmary Predicate for SB-Prolog 39 Table 3.3. Nurnber of rimes that the W14M instructions are cixecuted. . . . . . 40 Table 3.1. Nurnber of times that the WAM Instmctions are sxecuted (simplitled version) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Table 3.5. Approximate Theoretical Values for a Trmary Predicarc . . . . . . -II Table 3.6. Average cost error introduced by our approximation . . . . 42 Table 3.7. The book titles database . . . . . . . . . . . . . . . . . . . . . . . 4 Table 3.8. Orderings ranked by thcir costs . . . . . . . . . . . . . . . . . . . U Table 3.9. The books database profile . . . . . . . . . . . . . . . . . . . . . . 50 Table 3.10. The extended books database . . . . . . . . . . . . . . . . . . . . . 59 Table 3.11. Predictions for al1 predicates . . . . . . . . . . . . . . . . . . . . . 60 Table 3.12. Predictions for the intensional database predicate . . . . . . . . 61 Table 5.1. The linear rcgion . . . . . . . . . . . . . . . . . . . . . . . . . . . S I Table 5.2. The intermediate region . . . . . . . . . . . . . . . . . . . . . . . 83 Table 5.3. Percentages of the maximum value for nb = 1.6 m . . . . . . . . . 84 Table 5.4. Percentages of the maximum value for some factors . . . . . . . . 81 Table 5.5. Cornpanson between the formula and the expenmental results . . 84 Table 5.6. The cxponential region . . . . . . . . . . . . . . . . . . . . . . . . SS Table 5.7. Estimating the cardinality ofa recursive predicatr . . . . . . . . . 89 Table 6.1. Number of visited tuplss for ordsring ;f i . . . . . . . . . . . . . -100 Table 6.2. Numbrr of visited tuples for ordcnng Y 2 . . . . . . . . . . . . . -101 Table 6.3. Exprcted number of visitrd tuples . . . . . . . . . . . . . . . . . . 105 Table 6.4, Companson between the predictsd and expenmental values . . -106 Table 6.5. The prrformers database predicates . . . . . . . . . . . . . . . . . 109 Table 6.6. Predicted values of two cost contributors for the non-recursive query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -113 Table 6.7. Experimental results for the non-recursive query (rankings in square brackets) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Table 6.8. The modified performers database profile . . . . . . . . . . . . . . I l 3 Table 1.1. Table 3.1. . . . -119 Table 6.9. Experimental results for the recursive predicatr . Table 6.10. Eficiency of the transitive closure for different calling patterns . . 120 Table 6.1 1. Table 6.12. Table 6.13. Table 6.14 . Table AZ.1. Table A1.1 . Table ,442. Table A4.3. Table 41.4. Table A4.5. Table A4.6. Table A4.7. Table AJ.8. vii ï h e extensional databasc predicates . . . . . . . . . . . . . . . . -110 Different orderines for the qurry undrr consideration . . . . . . . -113 Expenrncntal results for the threr most eficient orderings . . . . -119 Cost metncs for al1 predicates . . . . . . . . . . . . . . . . . . . -129 Values of the Traversa1 Factor for the Temary Predicats Example 155 Valid orderings for the query pkg_uses/2 . . . . . . . . . . . . . - 1 % The extensional database predicates . . . . . . . . . . . . . . . . . 158 Debray's domain for al1 predicates . . . . . . . . . . . . . . . . . 159 Cost domain for the extensional predicates . . . . . . . . . . . . . 160 Cost dornain for the intensional predicate and the main queq . . . 161 Cost mstrics for al1 predicates . . . . . . . . . . . . . . . . . . . . 162 Cost metncs for the intensional predicate . . . . . . . . . . . . . . 162 Theoretical and Exprrimental Values for the Packages Example . 163 ... VIL1 List of Fi.oures Figure 1.1. Figure 1.2. Figure 1.3. Figure 1.4. Figure 1.5. Figure 1.6. Figure 2.1. Figure 3.1. Figure 3.2. Figure 3.3. Figure 3.4. Figure 4.1. Figure 4.2. Figure 4.3. Figure 4.1. Figure 4.5. Figure 4.6. Figure 5.1. Figure 5.2. Figure 5.3. Figure 5.1. Figure 5.5. Figure 6.1. Figure 6.2. Figure 6.3. Figure 6.4. Figure 6.5. Figure 6.6. Three representations of a given database tuple . . . . . . . . . . . . 6 A -=ph representation of a rule . . . . . . . . . . . .. . . . . . . . 6 A graph represrntation of a GraphLog relation . . . . . . . . . . . 7 A query as a series of successive operations . . . . . . 16 The cost of a genenl predicatr is the surn of the çost of its individual mles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Two general alternatives for a cost mode1 framework . . . . . . 19 Sets of lists of arguments for two evaluable predicates that ensure safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Partial translation of a fact . . . . . . . . . . . . . . . . . . . . . 32 (a) An extract from one of the databases that were used and ( b typical subgoals which revieve thest facts . . . . . . . . . . . . . . . 36 Debray's lattice for mode analysis . . . . . . . . . . . . . . . . . 46 Abstract interpretation applied to Prolog unification givcn nvo ternis t 1 andt2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Frequency diagram of an attributs that may bs approximated by a discrete normal distribution . . . . . . . . . . . . . . . . . . . 67 Two temary predicates s 1 and s2 . . . . . . . . . . . . . . . . . . 70 Join of predicates s 1 and s2 . . . . . . . . . . . . . . . . . . . . . 71 Sslection after the join of predicates s l and s2 . . . . . . . . . . . 71 Final projection of arguments 3 and 5 . . . . . . . . . . . . . . . 72 Cost contributors are rstimated for eac h subgoal . . . . . . . . . . 76 Region for small values . . . . . . . . . . . . . . . . . . . . . . . 81 Region for large values . . . . . . . . . . . . . . . . . . . . . . . 82 GnphLog program . . . . . . . . . . . . . . . . . . . . . . . . . 36 GnphLog program for the recursivç program . . . . . . . . . . . 87 Gnphical representation of base predicates up and down . . . . . 90 The GraphLog database . . . . . . . . . . . . . . . . . . . . . . . 96 The 1984 United States Congressional Voting Records Databasc . 97 Two orderings that we wish to compare . . . . . . . . . . . . . . 97 Abstract black boxes for Example 1 . . . . . . . . . . . . . . . . 100 Interconnection of the black boxes for Example 1 . . . . . . . . . 10 1 Experimental results for both orderings . . . . . . . . . . . . . . 101 1.Y Figure 6.7. Figure 6.8. Figure 6.9. Figure 6.10. Figure 6.1 1 Figure 6.12. Figure 6.13. Figure 6.14. Figure 6.15. Figure 6.16. Figure 6.17. Figure 6.18. Figure 6.19. Figure M.1. Figure A1.2. Figure A3.1. . Six orderings that we wish to compare . . . . . . . . . . . . . -103 Six orderings that we wish to compare . . . . . . . . . . . . . . -103 Abstract black boxes for Example 2 . . . . . . . . . . . . . . . . l m Intrrcomection of two black boxes in Examplr 1 . . . . . . . . -105 Sample tuples from the performen database . . . . -108 Abstract black boxes for the non-recunive query . . . . . . . . - 1 1 1 Expected values for the cost contributors for a specific ordenng - 1 12 Abstract black boxes for the recursive query . . . . . . . . . . . -116 Abstract representation of the different orderings . . . . . . . . . 118 Abstract black boxes for some predicates in the packages rxample 122 . . . . . . . . . . . . -124 Abstract black boxes for predicatr pa-f Abstract black boxes for predicate cycle . . . . . . . . . . . . . -125 Impact of the underlying databasr on the performance of the cal1 . 131 Markov chain for the single solution case . . . . . . . . . . . . -142 iMarkov chah for the all-solutions case . . . . . . . . . . . . . . -143 General rnethod to measure CPU execution times . . . . . . . . -136 1 would like to thank Dr. Horspool for his patience and encouragement: IBM Toronto Labontory for suggesting the topic. providing a Ph.D. fellowship and hosting a work term: Dr. Wadge and Dr. R p a n . who offered a number of insights: and. finally. last but certainly not least. my parents and brother. for their long-standing devotion and support. To: Jan Dournen (Delean) Horacio Franco qoercf Mdlender Bmno Cornec Gir-enaelFaucher Bogislav Rarrschrrt Federico Marin CO la Daive Lumpson G ~ y o wC'argci Ken-icht .Mtrr.ata Shel Ritter Gusrai?Leonhardt Sigis w.ald Kuijken Gnipo Cinco Siglos Chapter 1. Introduction. Query Optirnization in GraphLog In this dissertation- we propose a cosî modei for GraphLog. a que- language that is based on a p p h representation of both databases and qurriss. Specifically. G n p h i o g is the que- language used by 4Tho~ghr.a sofiware engineering tool aimed at helping engineers understand and solve a class of software engineering problems that involvr large sets of objects and complex relationships amongst them [Consens921 [ R ~ n a n 9 7 ] [Ryman93]. Graphiog queries ask for patterns that must be present or absent in the database _graph. Our frarnework is able to sstimats the relative cost of execution of different orderings of sernantically equivalent GraphLog queries. thus allowing us to reject thoss query ordenngs whose execution may be more inefficient. Our mode1 assumes a topdown evaluation strategy [Ceri90]. Givrn the fact that one of the distinguishing characteristics of GraphLog is the capability to express queries with recursion or closures. and since no previous cost mode1 has addressed the cost estimation of recursion and closures for a GraphLog-like lanp a g e . our original solution to this problem is of panicular interest. Our rnrrhodology has been evaluated on several real-life databasrs with encouraging results. In this chaptrr. we analyze some general issues relevant to query optimization in general. and que- reordering in panicular. We also introduce the language that our work will be applird to. Finally. we give an overview of what we have accomplished. 1.1 Que- Optimization Que,?*oprimizarion [JarkeM] is direct1y concemed wi th the efficient execution of database queries. Its main goal is to minimize the resources needed to evaluate a query that retrieves information from a given database. A query optimizer norrnally generates and analyzes different alternatives to determine an efficient plan of execution. Optimizing a query can reduce processing time by a factor whose value depends on the sizes of the databasr dsfinitionsf. This decision is ofien based on cost models that capture the contributions due to different factors such as the sizris of the relations under consideration or the expected number of mples retrieved by an intermediate operation. If. for instance. a user poses the query "find al1 Japanese collectors who own a Stradivarius violin". the query optimizcr would usually need some information about the statistical profile of the database (how many Japanrse collectors are stored in the database. how many individuals are expected to own a Stradivarius violin. and mors). Givrn thess premises. the optimizer may establish a suitable plan to soive the problem efficiently. A plan of execution has to takr into account several different factors. including the order of oprrations. the searching algorithms that are used and the database structure itself. Sorne of the most cornmon stratrgies adopted in query optimization includr: 1. Selection of the most efficient overall evaluation method ( i.e.. the computational mode1 tbat derives al1 the solutions to the q u e l ) . The algorithm that is used to search for the ansrvers clrariy has an influence on the efficiency of execution of the que?. 'JOevaluation method is intrinsically supenor to the others. In fact. the performance of different svaluation methods drpends on the nature of the problem. Typical evaluation msthods include bottom-up evaluation. top-down-evaluation. and combinations of both. Here. the optimization (i-e.. the decision as to which evaluation method is the most suitable for the giwn que-) is prrformed during the evaluation procrss itsslf. 2. Determination of the best syntactic rearrangement of the query subgoais. Given that the order of sxecution of the subgoals can substantially influence the time that is required to retrieve the answen to the query. it is usually advantageous to find the goal ordering that is the least expensive to executr. Unfortunately. since the number of cornbinations increases grometncally with the number of subgoals in the query. an exhaustive search through al1 possible combinations may become ?For instance. w e will show a simple cxample in which a reduction factor of 2.000 is achievcd. prohibitive. -4 practical cost mode1 is needed to compare the pertormance of different orderings and select a suitable (efficient)ordering. 3. Transformation of the original user que- into an equivalent one which can be executed more efficiently. In some cases. standard simplifications rnay br applird to the new query- whereas they rnay not have been applicable to the initial que?. However. this process of que- rewriting dors not guarantee that a more etXcirnt que^ wi11 be found. In some cases. a loss in et'ficiency may occur. If the evaluation is pehrmed by a specific "machine". w s will be more interested in the last two approaches to query optimîzation ( a fixrd evaluation strategy is the usual case for many qurry languages). Our work will address the issue of selecting the brsi syntactic rearrangernent of the query subgoals for a specific query language. namely GraphLog [Consrns89]. W s will rsfer to this problrm as qire,?.wor-&ring. 1.2 Datalog There has bern extensive work directed towards tackling the traditional dartzhase prog-arnrningparadigm. However. with a recent trend towards integating the database and logic prograrnminq paradi-ms. nrw rrquiremsnts and challenges demand a difkrent approach to the sprcial problems raissd by the fogic'programmi~tgpn~-adigm. This disssrtation is specifically focused on GraphLog. a language that incorporates the two abovementioned programming paradi_gns. Sincr GraphLog is closrly relared to Datalog. a relatively wcll-known logic que- language. we proceed to give a bricf overview of this lanwage. C Datalog [Ullrnan88] is a language that applirs the principlrs of logic programming to the field of databasa. Datalog was specifically designed for interacting with large databases. The languagr is basrd on first-order Hom clauses without structures as arguments. Le.. only constants and variables are allowed. Constant arguments are also re- ferred to as groirnd atoms. Most underlying Datalog concepts are similar to those in Logic Propmming [Cerigo]. In fact. the design of Datalog bas been noticeably influenced by one of the most popular logic pro-admrninr languages. Prolog [ClocksinS 11. We proceed to give a brirf description of the language. .\ more detailrd covenge of the lan- -mage can be found in the literature [L'llmanSS] [Gardaring91 [Ceri90]. A Datalog progarn consists of a finite set of logic clauses often referred to as fucis and mies. Facts are assertions that drfine true staternents about some objects and thsir The Datalog relationships. Typical facts are "Feli-r is a ma~t"or "The square of.5 fi 3". notation for thesr facts is: The atomic symbol thai names the rrlationship is said to be thepredicart. definition. In the example. maie and squarr are predicate symbols. The objects that are affectsd by the relationships are named the nryrrmenis or data objects. In our example. thesr are the constant values-/dix. 5 and 3.As a notational convention. both predicate s p b o l s and constant arguments are writtcn with an initial lonw-case Iettsr. The collection of facts is usually refsrred to as the h i a h a s r . Rides are collections of statcmrnts rhat cstablish somc grneral proprnies of the objrcts and thrir relationships. Broadly speaking. mlcs permit the denvation of facts from othrr facts. A Datalog nile is expressed in the fom of Hom clauses [HomSl]. that is. clauses having the general forrn: P if Ql and Q2 and ... and Q, or. in Datalog notation. . p :-q l q2. .... qn. p being the hend of the mle and the conjunctivs pan being the hodi. of the rule. Each q , is named a srtbgonl of the rule. Rules usually make use of variables to represrnt general objects rather than specific ones. Variables are represented by identifiers that must commence with a capital letter. For example. the predicate can be interpreted as "X is a son oJ'Y &Yis malt, and Y is a pal-en1 o f ' Y ' , The predicates male and purenr should br defined rlsewhere. tither as facts or as niles. The user may request information from the database by entering qireries. These are Hom clauses which lack a head and can br evaluated or verified against the facts and rules in the pro-mm. For example. the query :-patient(Name. Disease). tropical(Disease). may be used to retneve the names of those patients that have suffered a tropical disease according to thrir clinical history. The answer to this que- is given by the set of d l tuples that satis- the 1.3 GraphLog h related langage is GraphLog [ConsenslP]. GraphLog is a graphical database que? language based on Datalog, and ennched by some additional features (specifically. the formulation of path replar expressions). One of its original aims was to facilitate pro- -mrnrning via a graphical representation of the programmer's designs and intentions. The main idea is that a relational database can br reprrsented as a graph. and gnphs are a very natunl representation for data in many application domains (for instance. tnnsportation networks. projrct scheduling, parts hierarchies. family trees. concept hîerarchies and Hypenrxt) [Consens891 [Consens901 [Fukar91] [Consens921 [Ryman9?] [Ryman93]. Each node in the graph is labelled by a tuple of values: they correspond to the attnbute values in the database. Each edge in the p p h is labelled by a name of a relation and an optional tuple of values. The set of values in both the edge label and the nodes comected by the edge. together wîth the name of the relation in the edge, correspond to +For practical reasons. some systems have the option of retrieving just swer (by reporting the first instances of the solution that are derived). 3 subset of the whole an- one niple in the database. Figure 1.1 shows threr cquivalent &-ph representations of the fact : Figure 1.1 Three representations of a given database tuple Grneral relations (rules) and queries may also be represcntsd by -mphs. Every edge in the graph rrpresrnts a relation amongst data objects as representcd in the nodes connccted by the cdge (and optionally in the edge). These data objects are the predicats ar- -cuments and the): can be either variables or constants. The rule itsrlf is represented by a special edgs (called the disringuished edgs) rhat also connects a pair of nodes. For instance. Figure 1.2 shows a g a p h representation of the mlr: son(X.Y) :- male(X). parent(Y,X). son parent Figure 1.2 A graph representation of a nrle Anothrr example of a GraphLog relation is given in Figure 1.2. In this case. the following rulr is defined: This example shows that the g a p h does not have ro be a comected graph. Note also that the arguments are ordered as followsf: ( a ) tint thosr appeanng in the **startins"node: (b) thoss showm in the "ending" node: and ( c ) those specitkd in the edge. Figure 1.3 A graph representation of a GraphLog relation GraphLog is a languaze that repressnts database facts. mlcs and qucrics as graphs as described above. A formal definition of this qucry language can be found in [Consens89]. It is shown that a GraphLog program has an equivalent Datalos program associated with it. Of particular relevance is the fact that GraphLog allows programmen to express r-ecrrrsii*erelations. thus providing a greatcr expressive pon-er than that of traditional relational algebra. 1.1 The Importance of Que- Reordering The efticiency with which a logic pro-gamming languaget exrcutcs a query is cntically dependent on the ordrr in which soals arc cxpressed in a conjunction [Warren8 1 1. @<et?. reorcier-ing is an important qurry optimization technique for findins more efficient eval- uation orders for the predicates. The main goal of this technique is to reduce the numbrr of altemativcs to be explored. +In fact. arguments may be specified in prefiu. posf~x or infix notation. 4Thorrghr favours the in~ L Yconvention. SIt is assumed that the specific resolution technique used is SLD-resolution [Ceri90]. To determine more efficient ways of evaluating a givrn set of subgoals. it is convrnient to have some information about the actual (extensional) database. Knowledge of some paramevic values of the database can help determine an approximatr exrcution cost that is to be associated with rve- subgoal. Query reordering usually requires at least three different procrsses: ( a ) sathering a database profile or some general knowledge on the characteristics of the databnse tuples. (b) estimating costs for dityerent orderings (in the ideal case. for al1 possible valid ordrrings)'. and ( c ) determining the best order. In this dissertation. we concentrate on the second issue. i-e.. trying to predict the (relative) cost of rvaluatins a query (an.que-) for a givrn database. 1.1.1 Effect of Que- Reordering To illustrate the effect that q u e v reordering may have on the performance of a qusry. we use the following cvampls that describes a Prolog databasef. Example. Considrr a database that consists of three predicata: book(Title, Publisher-Name. AuthorJJarne).-4 collection of book tit les dong wi th thrir publishers and authors. pubiisher(Pubiisher-Name. City). A list of different citirs where book publishrrs have an authonzed distributor. author(Author-Name.Nationality). A group of facts that relate authors to thrir respec- tive nationalitiss. Suppose that we wish to retneve a list of tuples m t l e , Publisher-Name. City. AuthocName> of thosr publications whose author has Dutch nationaiity. +Although the database profile may be used to estimate the cost of some simple subgoals ( for instance. facts). the cost of more cornplex (derived)subgoals requires some additional computational work. :These rcsuIts also apply to GraphLog. especially since CiraphLog queries are usually translated into Prolog under current implcmentations of the languagc. Since this query involves al1 three predicates. there are 3! dif5erent ways to express it: :-book(T. P. A). publisher(P. C). author(A. dutch). :-book(T. P. A), author(A. dutch). publisher(P. C). :-publisherfP. C ) . book(T. P. A), author(A. dutch). :-publisher(P. C). author(A. dutch). book(T. P. A). :-author(A. dutch). publisher(P.C). book(T. P. A). :- author(A. dutch). book(T. P. A). publisher(P. C). The answer will be the same. regardkss of the chosen order. Howsvrr. depending on the characteristics of the underlying database. the timings of the q u r n e s will not be the same. For example. we applied al1 six orderings to a panicular database with 3 .O00 book titles. 20 diflerent publishers. 450 authon. 30 nationalitiss and 380 cititts worldwide. and ob- srrvcd the costs shown in Table 1.1. The figures were obtained using SICStus Prolog version 1.2 and Stony Brook Prolog (SB-Prolog) version 3.0 mrasured on a Sun SPARCstation SLC. Al 1 e'tecution times are estimated. accordincg to the implementation manuals. in "artificial" units. The database under consideration compnsed 3.000 facts for the book predicate. 1.766 facts for rhe publisher predicatr and 450 facts for the author predicate. ordering publisher-author-book author-publisher-book publisher-book-author author-book-publisher I 1 cost using SICStus Prolog 3434745 3438660 1 I 260040 1 I 2690 I cost using SB-prolog 3 152360 3 125060 443900 2810 1 Table 1.1 Cost of the evaluation of a given query using different orderings It is clear from this example that the ordrr of the subgoals substantially affects the performance of the Prolog query. It is also evident that the panicular Prolog implementation may affect the choice of the best ordering as well. 1.5 Our Dissertation A cost mode1 of a particular implementation of the language GraphLog (in which Prolog is the target pro-gram) is proposed in this dissertation. In panicular. we address the issue of nnking different ( syntactically-equivalent ) arrangements of a given query in order to select the ( potentially) most efficient orderins. One major feature of our methodology is the ability to estimate the cost of recursive quenes and transitive closures. 1.S.1 The Problem Solved Essentially. we have denved a methodolog that allows us to choose a potentially Icss expensive ordering amongst a group of valid subgoal orderings. In othrr words. our proposed hmework is able to n n k different ordrrings according to their expocted execution cost. Rather than assigning absolute values ( i.r.. exact cxecution tirnos) to the diffsrent ordrrings under consideration. we are only interested in prrdictins thrrir expected relative cost. Execution timc is used as the determining factor in the analysis. Wc may state the general problsm as follows: Givrn a GraphLog que? y of the form: wr are to sstimate the relative cost of any jivsn ordering of the subgoals. Our methodolo-y only ranks different orderings. It does not select potentially good candidates from the whole spcctrurn of valid ordenngs. It is the responsibility of a preprocrssor to select a subset of potentially cheap orderings to start with (especially if the number of permutations of orderings would make an exhaustive analysis prohibitive). In facr, since we are interestrd in finding a permutation of the subgoals that yields a more efficient plan of execution. there are at most rn! possible orderings (some of them may be invalid as they may not comply with the safety rules of the query language) so that it is not always feasible to test thern al1 individuaily. A practical approach is to select a subset of the orderings. namely those that are potentially less expensive to execute. Then. we can estimate the cost of execution of each ordering in the subset to detemine a good ordering. Thrre are several methods to select subsrts of potsntially efficient ordrrings. amongst them. Sheridan's algorithm [Sheridan9 11 and simulatrd-annraling-based algori thrns [Ioamidis90]. 1.5.2 Overview of Our Cost Mode1 In genrral. we have assumsd that somr information about the undcrlying database' is availabie. Sheridan's algorithm [Sheridan9 11 is the framenark of choice whrn no information regarding the databases can b r obtainrd. For any givrn ordering. a mode analysis [Debny88] is prrformed to detemine the degree of instantiation of the subgoal arguments. For the case of the previously-mentioned Prolog implementation of GraphLog, our model takss into account the specific evaluation stratrgy o f this language under a panicular implemrntation (namrly. the WAM [Ait9 11). W r have chosen to consider what wr cal1 the nierage behaviour for quenss. Given ni[ possible valid quenrs that the user may pose for a panicular calling pattern (cf. De- bny's framrwork). w e rstimate an aiw-age value of al[ their rxpected exrcution timings and use this value as the rxpecred cost of the given query.: The framtwork in its prrsctnt statc dors not produce any additional information such as msasurrs of the dispersion of the values with respect to the average value. or corresponding upper and lowrr bounds. .1L1I Furthemore. rathcr than a dctailrd and expensive exact solution. our model considers the procrss of solving a que- as a set of general actions only. We have determinrd that a convrnirnt way to obtain a suitable ranking for the orderings under study is to consider the existence of what w r have called c-osr conrrihirrors. that we proceed to sxplain in the following subsection. +For instance. we asurne that the nurnbcr of tuples for cac h database fact and the nurnbcr of distinct values for tach argument position are available. :Thus. wc arc assuming chat al1 queries have an cqual probability of being poscd. which is a major assumption. ++In fact. ive decided not to use intenals to characterize the rcsults based on the îàct that for a transitive closure. the rcsulting intcnals werc normally too wide to be of practical use. Additionally. we have devçloprd a mtthodology to estimate the average numbcr of solutions associated with the query. this being an impiementarion-independent quantity. In fact. Debray and Lin's related work [Debny93]. that derives a cost modrl of logic programS. is mainly concemed with this soie issue. Our mode1 is more general as it handles recursive and closurr predicates. One major consideration that a a s regardrd as essential since the inccption of this dissertation was to produce a simple as possible framework. while producing yet accrptable results. Wr strongly belie*.-rthat our mode1 is simple. both conccptually and tiorn the point of virw of a practical implernrntarion. We have testrd our mcthodology on sev- eral real-Iife ( large) databases. Some detaiiçd case studics are given in Chapter 6. Cost Contributors Rather than analyzing the nature of the exact machine code that is gencrated (for instance. in the form of machine cycles that are required to executs the instructions). a simplrr analysis is often desirable. although at the rxpensr of a potrntial Ioss in prrcision. The r e n r n i idca is to determine some peneric activities or groups of operations that are directly rclatsd to the cost of exrcution of the que- and then estimate the individual costs associated with such components. Therefore. we wish to single out some --costcontributors" that influence the efficiency of the code cxecution. Some typical cost contnbutors are ( 1 ) the numbcr of tuplrs in the databass that are visitcd to tind the global solution. ( 2 )the numbrr of marching (unification)attempts thar tske place during the resolution process. and (3) the number of solutions or answers to the query that are gathered and displayed (we also have to consider any associated backtracking thar rnay occur when new solutions are anempted). Some contributors may have a greater impact on the query performance than others. For instance. it has been reponed that a Prolog progam rnay spend 55-70% of its tirne unifying and 15-35% of its time backtracking [ w o o ~ ~ ] . ' +This behaviour is specirilIy rele~mrto our work. since the current implernentation of the GnphLog interpreter gencrates Prclog code as the target language. For this reason. the number of visited rupIes is a relevant cost contributor (if not the most relevant). Lhfortunately. many of these quantities are both model- and machine dependent. For example. if the rnodel uses clawe indexing to narrow down the number of clauses to be explored, fewer tuple visits and unifications will be performed. Similarly. if specialized code optimizations are incorporated. this rnay have an impact on vanous cost contribu tors ( for instance. tail recursion optimization [Knise87] may reduce the cost associated with backtracking). The only cost contributor that is independcnt of the execution mode1 seems to be the total number of solutions to the query. but. in the case of GraphLog, this number is also independent of whatever ordering of the subgoals is selec ted! In our rnodel. one initial task consists of defining which cosr contriblrtors are more relevant. By eliminating some cost contributors. the process of cost estimation will be simplified at the expense of some loss in precision. As we will argue latrr. many real-lifr examples can be characterized by only a handfül of cost contributors (in some cases. only one rnay suffice). Database Profiling Once a selected set of cost contributors is determined. a simple way to determine the expected value of thesr quantities must be found. This is usually done by usine a database profile rather than the exact values in the database. Traditional statistical profiles are specified by rneans of four catrgories of quantitative descriptors [Mamino88]: ( 1 ) descriptors of central tendency: (2) descripton of dispersion: (3) descriptors of size: and (4) descriptors of fiequency distribution. Usually. the more precise the descriptors. the more accurate the predictions. There are many widely-used "standard" descnptors: mode. mçan. median: variance. standard deviation: cardinality of the relations: normality. uniformity. to mention only a few. Many real-life databases can be characterized by these common descriptors with the advantage of a simpler. more general cost analysis. normally at the expense of some loss in accuracy. In fact. many fiequency distributions have been extensively studied in the area of statistics [~annino88].' tGiven an arbitrary database. it is not always easy to establish which "standard"set of descriptors approximates the data best. Sets of tests have been developed for some of the most popular approximation functions in the lirerature. However. derived relations and complex quenrs do not dral with simple distribution functions. but rather with combinations (specifically. joins. semijoins. sslections and projections) of distributions that require a more complex analysis. Most of the rrsearch work' has been devoted to just a few distribution functions (uniform. Pearson. normal and Zipf) and not al1 basic database operators have brrn studied with the sams d s g e e of depth or success. A substantial pan of the work has concentnted on the estimation of the nurnber of output tuples to the Given thsse deficiencies. it is not unusual that query optimizers automatically assume a distribution function that is simple and well understood (typically the uniform distribution). An additional problem occurs when the actual dktn bu tion function is not known (databases are constant1y changing and it is not always possible to keep track of the changes in the shape of the distribution) or only known in a non-parametric fonn (usually histograms). Our model will nonnally assume a uniform distribution of attnbute values in compliance with the standard trend. Given a certain degee of instantiation of the arguments of a GnphLog subgoal. our daim is that it is kasible to estimate an expected value for the selected set of cost contributors. As it is always the case with abstract interpretation techniques [Cousot77]. [Cousot9?], the mort information we have about the subgoal. the more accuratc the estimates can be. For the case of extensional database predicates. in Our model. such an sstimate is obtained by simple statistical considerationstt. In the ideal case. if we know the exact values of the database tuples as wel1 as the exact subgoal (query retrieval) under consideration. the expected value of a cost contributor can be calculated accurately. If our knowledgr is more limited. we have to introducr some assumptions (as mentioned br- fSee [Mmino88Jfor a thorough (although slishtly out-of-date) survey on the topic. $Mer all. in traditional database q u e l planning the sizcs of intermediate relations are usually regarded as important (if not the most important)conaibutors to the total execution cost of a queryt+Theestimation of a simple fact retrieval (i.e.. direct extensional database searches) is rnostly a statistical problem since the dismbution followed by its arguments is assumed to be known in advance or c m be somehow deterrnined. fore. we will normally assume a uniform distribution of indrprndent attribute values). vet still achieving acceptable results. For the case of intensional database predicates. the estimation of the expected value of a cost contributor requires a more elaborate process. which we proceed to sketch. Cost of a General Query Given a que- whose cost we wish to estimate. we propose to decompose the query into simpler components. To simpliQ the problem. we assume ihat qucnes are independent of each other'. The simplest choice consists of defining a GraphLog suhgoal as the primitive entity to be analyzed. A subgoal is then treated as a "black box": siven some inputs (such as degree of instantiation of the arguments. number of times that the subgoal is cxpected to be invoked. average nurnber of solutions that are expected to be retumed by the subgoal. etc.). the expected values of the cost contributors rnay be estirnated (as the outpurs of the black box) and used by successive blocks as their respective inputs. The subgoal itself has to provide some information about intemal charactenstics such as distnbution of attnbute values or correlation amongst arguments (see Figure 1.4 as an example of this idea. Note that average values are obtained. sincr the actual values of the vound terms are not taken into consideration: a uniform distribution of attribute values 3 is assumed instsad). The total cost of the query is then estimated as the sum of the individual costs of the subgoals. Again. standard abstract interpretation techniques are used to determine the demee of instantiation of the arguments and propagaie the intermediate results through al1 Y successive query components. This instantiation information may also be used to rrjcct unsafe orderings [cf. Section 2.1 31. The estimation of a general predicate cal1 can be obtained as the surn of the costs associated with each individual mle (Figure 1.5). This holds largely true as long as rules are independent of each other (Le.. they do not have cornmon solutions). However. i t is quite cornmon that two or more rules provide common solutions. A mutual exclusion .tWe will see that a more complex frarnework is required to deal with dependencies arnongst components. nation(canada). nation(be1gium). nation(uk). language(canada. french). language(canada. english). language(belgium. dutch). language(belgium. french). language(belgium. german1. language(uk, english). the language predicate has 3 distinct values for argument # 1 and 4 distinct values for argument #2. Of the total of 12 possible combinations of these values, only 6 will produce an answer: there is a rate of success of 1/2 / \ , there are 3 nations in the database: \ 1 3 tuples are visited v and 3 tuples are retrieved \ \ this valus is a constant average value there are 6 language tuples in the database for each nation retrieved in the previous step. 6 tuples are visited (assuming no indexing). The solution will contain 3 times (112) tuples Figure 1.4 A query as a series of successive operations analysis may help, but the general problem of duplication resulting from independent niles seems to be dificult to solve. Our cost mode1 does not take this source o f duplication o f tuples into account.' tWe must distinguish benveen the cost of finding al1 ansivers (i.e.. the sum of the costs of the individual mies) and the cost of finding al1 disrinci solurions (whose estimation has to take into account the process of dimination of duplicates). , ,. predicate ;subgoal, , subgoal, ,2. .-..subgoal, . predicate :-Subgoal2.1 , subgoa12.2. .... s u b g ~ a l ~ . ~ . , ' \ ''\ , . \' predicate :-subgo&), \ , , s~bgoal,,~. ....subgoal,,,. \\ \\ ' 'ù \ for each predicate rule: estimate the cost of each rule body, add the cost of head unification and consider the process of projection and elimination of duplicates Figure 1.5 The cost of a general predicate is the sum of the cost of its individual rules When we are dealing with general predicate calls. we have to considsr some additional issues. such as (a) head unification. (b) clause indexinz. ( c ) independence of subeoals and (d) the fact that the distribution of the ruples may be difficult to predict. Head CI unification and clause indexing are implementation-specific issues and thsy are taksn into account in our model by assigning to each rule in the predicate a probability of success. (usually) given the drgrer of instantiation of the arguments involved. Each nile is rhen weighted based on this probability factor. In some instances. the output of a sub_poalis affected by the nature of other sub- -goals. Consider. for instance. a sequence of subgoals p(x. r).q ( ~Y.) . and suppose that the set of values that the firsr subgoal derives for variable T are such that thry do not form part of the domain for the first argument in predicate q. Cjnless we keep track of al1 intermrdiate values for variable T (which is normally contrary to abstract interpretation principles). we have no easy way to determine that predicate q will fail for al1 its inputs. By the same token, sincs we will not know the exact values of the variables involved. we have no direct mrthod to estimate the shapr of the distribution of attribute values for general predicates. In Our cost model. we will ignore the issues of independence of subgoals and distribution for intermediate results. Once the determination of the outputs of the subgoals has been solved (that is. the equivalent of the selsction operation of relational algebra). we need to couple different black boxes ( Le.. tackie the analogue of the join and projection opentions of relational algebra). Sevenl hurdles anse at this point. but the two most problematic are the duplication of solutions afier a projsct ion of arguments ( noted before) and the correlation between the arguments of two or more different subgoals ( interdependence amongst subgoals). Our model in its present form does not tackle these issues. Our model also handles 1-ecrwsiivequenes which. in the specific case of GraphLog. are in the f o m of a prcdicate closure. Spccifically. Our methodology estimates the expected average numbrr of solutions of a recursive predicate. The basic idea is that any linearly recursive query can be rxpressed as a transitive closure (possibly prtxeded and followed by somc non-recursivr predicates) [Jagadish87]. Thrrefore. w s estimate the number of solutions of the recursive predicate by estimating the number of solutions of an equivalent que- expressed in t e m s of transitive closure. Thus. we propose a method ro estimate the average number of solutions of a transitive closure. An entire chapter wili be devotsd to explain how our framework dsals with recursik-equeries. Other issues not currently considered by our cost model include ( a )aliasing or sharing of a common variable within the same subgoal. ( b ) consideration of invalid inputs. and ( c )more complex forms of rrcursion. .4s we will see in a subsequsnt chapter. more accurate results may be achievrd when the methodology is tailorrd to the specific abstnct machine and the panicular charactrristics of the system used to execute the queries. If we wish to obtain more accurate results. we would also requirr specific knowiedge of the evaluation methods that are used ( which is crucial when dealing with recursive quenes) and the special optimization tech- niques that are implrmented. ?Joie that. under this scheme. a new analysis would br rrquired for each different system. As can be seen. this process ma- become quite tedious. .An alternative. more general solution would require making rough assumptions and concentrating on more "high-level" cost contributors. Thus. given a general evaluation strategy (top-dounwaluation in Our case [cf. Section 2.1.2]). we are able to estimate the cost of a given GraphLog query without specific knowledge of the particular abstract machine that is being used by the GraphLog systcm under consideration. Our framework addresses both approaches. so that we propose a mode1 tailored to a specific machine. the WAM [ X t 9 1 1. as well as a model based on more "high-Ievel" cost contributors and relatively independent of the underlying abstract machins (Figure 1.6). Approach 1 : Mode1 tailored to a specitic machine é~aluationmethod is known optimizritions also knoum more accurate WC mriy estirnate sxscution timss only \.alid for that panicular machine 1 A proach72: &dei barcd on --hi$ Iei.ilv' cost contributors 1 specitic evalurition msthod and optimizations used are unknown lsss accurate we only estimate \dues of the cost contributors and not sxpected timss more generril Figure 1.6 Two general alternatives for a cost model framework Chapter 2. Cost Modeling .A cost model may be visualized as an abstraction that attempts to sstimate the g f k i r n c i of the acmal execution of some pirce of code ( in Our case. a GraphLog que^ ). Diffsrrnt parameters may be used to measure the degree of sff~ciency.The most commonly used metrics are the rime or rnemog that are required to answer the entire query. It can br arw e d that. as memory continues to become cheaper. emphasis should be given to rstimat- 3 ing time rfficiency nther than memory efficiency. Different orderings of the samr group of subgoals in a GraphLog qurry will usually result in a diffrrent degree of rjflcirn- of execution. Such a difference is due to many factors. ranging from sorne that are rathcr predictablç (such as the size and nature of the machine code that is genented. or the series of systçmatic code oprimization techniques that are prrformed) to those that are shaped by the current rnvironmrnt in which the proz a m is executed (such as currrnt systçm load. or the number of proccssçs compsting for C cornmon resources). The Iatter considerations are hard to take into account and are normally ignorsd. In this chapter. we start with an ovenriew of somr issues relatsd to qusry reordering in Datalog (which also apply to GraphLog). We also give a briçf account of some rclatrd work in the area of qurry reordering. 2.1 Evaluation Methods for Datalog Given a Datalog program. a computational mode1 that derives ail the facts sarisQing the user's query is required. Normal1y. the chosen evaluation method computes solutions according to the so-called leasrflrpoint model [Ceri9 1 1. Although pure Logic Programming does not include brrilr-in predicates such as arithmetic or cornparison operators. most implementations permit the use of such predicates. An additional useful construct not available in pure Datalog is the use of iiegation. Negation is often handled by using the ciosrd ii-orldassumption. a mechanism of negation as failure that States that the nrgation of a fact that cannot be logically derived from the Datalog pro-gram is considrred to be valid. Several eval uation methods have been proposed for solving Datalog queries. i.r.. determining whether a user's que- is valid piven the collection of rules and facts that are fomulated in the pro-mm. Wr can categonzr these methods into nvo major p u p s according to the general evaluation stratekW. namrly bottorn-up and top-doivn evaluations [Ceri9 1 1. 2.1.1 Bottom-up Evaluation Botrom-trp evaluation mrthods apply the pnnciple of matching rules (usually called intensional dorahase pr-edicarrs)against the facts (also called rxtensiorzal datahase pwdicnre-7)to obtain vaiid values for the variables involved in the corresponding niles. Those niles whoss hsad variables acquire gound values are then considered in a similar manner to extensional database predicates. and the process is repeated until al1 necessary facts have besn derived. .Most bottom-up evaluation rnethods have been borrowsd or adapted from well-known algorithms originally dsveloped to solve systrtms of squations in Numerical .\nalysis (for example. the Jacobi algorithm for finding least fixpoints). Most enrsnsions of the basic algorithms are aimed at avoiding duplication in the evaluation of intermediate solutions. Bottom-up evaluation is the natural method for sct-orisnted languages like Datalog. 2.1.2 Top-down Evaluation Top-do~tzevaluation methods use the principlr of cin@catio>tbrtween a given subgoal and the intensional or estensional database predicates. This process of unification provides a set of valid bindings that then are propagated to the other subgoals that constitute the query. A so-called deriratio~itree is generated. A fairly wçll-known method that is based on this resolution principle is the SLD-resolution procedure and its several extrnsions (which constitute the evaluation method of choice for the language Prolog). Topdown evaluation is well-suited for solving simple transitive closure problems when the extensional databasc relation has no cycles. or when just one answsr to the que- is nsedcd . In one of the current irnplementations due to Fukar [Fukar9 I l . the query language GnphLog is translated into Prolog. Thus. the GraphLog databass can be viewed as a Prolog database. and the executable proCimmas a Prolog progam. As a result. under this particular implementarion. GnphLog is evaluatrd using a top-down strategy For this very reason. al1 cost models that we propose in this dissertation are tailored to a top-down cvaluation strategy. 2.1.3 S a f e l Considerations - is an important issus related to the evaluation s t r a t c g that is chosen. Gsnerally Sn/enb speaking. a query is safe ro evaluate if it has a finitr number of answsn and the computation that is performed to tjnd them terminates. Le.. al1 the answen are obtaincid aftsr a finite number of computations. For this reason. qusry safety p l a y a tery important rolr when a plan of exscution is selectrd. The issue of the safety of rulrs has bcen e.ctcnsivcly studied in the litsrature and safety conditions have been derived for difirent logic promrnming s 2.1.1 languages. and Datalog is not an exception [BancilhonS6]. Que- Reordering in Datalog In pure logic progamming. both rules and subgoals can bs reordered at n-il1 withour changing the mraning of the progam. In practice. some orderings may yield more efficient executions of the program. Howsvrr. we have already scen that somc orderings ma? lead to non-tsrminating computations. A distinction exists between inherently non-terminating quenes and queries whose computation dors not terminate forjust some orderings. In this latter case. the reordrring algorithrn rnust reject such unsafe orderings. The two principal causes of non-terminating computations for othenvise safe queries are: Einlziablr predicates. i-e.. predicates that require that some of their arguments have a ground valus prior to the predicate invocation. This is a conssquence of the fact that built-in predicates usually deal with infiiire relations. In general. if the predicate arguments do not have ground values before the cal 1. the evaluable predicate will produce an infinite number of answers. Typical cxampltts of evaluablt predicares are aritlunrtic expressions and companson operators. For instance. consider the evaluable predicate plzcs~Y.Y, Z) which represents the arithmetic expression .Y + Y = Z. This predicate is unsafe if two or more arguments are not inteser constants. Thus. a qurry such as :-p l w j . Y. 2, would yield an infinite number of answers. .Vrgarion. which is normally handled under the so-callrd Closrd Rot-ld .-îs.sronprion. considers anything that cannot be logically derived from the rules and îàcts to br falss. The Datalog fixpoint evaluation procedure handltts negation by computing the complemrnt of the relation that is being negated. If the domain of such a relation happens to be intinite. the complement may be infinite too. For this reason. the ncgation of a predicate with at lrast one variable argument is a potential source for an infinite computation. Safety niles for GraphLog have been formulated by Fukar [Fukar91]. it is s h o w that. when GraphLog is translated into Prolog. safety is achieved when the following order for the subgoals is observrd: ( 1 ) positive (Le.. non-negated) database predicates first: ( 2 )evaluable predicates next: and ( 3 ) negated predicates last. However. this specification is harshly restrictive. sincr çvaluable predicates and negations of predicates are only unsafe under certain circumstances. A less limiting condition restricts evaluable and negated predicates to positions where they are guaranteed to be safe. For the case of evaluable predicates. we have to define a set of lists of arguments that are required to be ground in order to be safe (Le.. yield a finite number of answers). Figure 2.1 shows two examples of such sets of lists. In the case of negation of predicates. we must guarantee that ail arzuments becorne oround z pnor to the evaluation of the predicate. built-in predicate > 96 >(A.B) :-tme if A is greuter h a n B 96 A, B: integer values O . 6 built-in predicate O 6 -(.A.B,C) :- true if C = A minus Ob A. B. C: integer values This evaluable predicate is safe when both arguments are ground: othenvisr it is not safe. This evaluabls predicate is safe whenever two or more arguments are ground: not safe othenvise. Set of lists of ground arguments that marantees sa fety: [A-BI I. Set of lists of required ground arguments that parantees sakty: [AB]. [A,C]. [B.CI. [X.B.C] C B. Figure 2.1 Sets of lists of arguments for two evaluable predicates that ensure safety 2.2 Some Recent Work on que^ Reordering Several cost models for logic progamming languages have bcen proposed in the past. McCarthy [McCanhy81] proposed the use of graph-colouring algorithms to mimic the rvaluation process of a conjunction of literals. Gooley and Wah [Gooley89] suggrstrd a hruristic msthod for reordering Prolog clauses using Markov chains and probabilitirs for succrss and failure. McEnery and Nikolopoulos [McEnery90] dcjcribrd 3 reordering system that rrarnnges non-recursivs Prolog clauses by applying borh static and dynamic reordrrings: the dynamic reordrring uses star k t ical information from previous executions. Shcridan [Sheridan9 1 ] designrd a "bound-is-easier" heuristic algorit hm for reordenng conjunctions of literals by selecting subgoals containing ground arguments to be placed brfore other subgoals. Wane. Yoo and Cheatham [Wang931 developed a heuristic C reordering system for C-Prolog based on the probability of success or failure as estimatrd by a statistical protiler. Finally. Debray and Lin [Debray93] developed a method for cost analysis of Prolog pro-mms based on knowledge about "size" relationships between arguments of predicates. this being specially aimed to handle recunion (althouzh some cornmon cases of recursion. such as transitive closure and chain recursion. are not solved at al]). 2.2.1 Efficient Reordering of Prolog Programs by LTsing.ÇIarkov Chains Goolry and Wah's work [Gooley89] has propossd a mode1 that approximates the rvaluation s r r a t e u of Prolog propams by means of a Markov process. The cost is measured as the number of predicate calls or unifications that takr place. The method needs to know in advancc the probability of success and the cost of cxecution of each predicate. Gooley and Wahh'sreordcring mrthod takrs into account rhr fact rhst differrnt Ievels of instantiation (ntodes) for the arguments in the subgoals lead to diffrrrnt values of probabilitirs and costs. -4.Markov chain is proposed for each valid calling mode. The values of costs and the probabilities of success arc to be providrd by the user (at least in the csss of tlir base predicates). To avoid exploring al1 permutations of the subgoals. Goolsy and Wah propose the use of a bat-tirst search. The merhod cils0 considers that thcrr are some orderings thar must br rejectrd brcause of safety conditions. Howcver. no pnctical solution is giwn for recursivs predicates. The results for the simple Pro!og programs that are prrsrnted have soms ncçeptable ratios of improvernent. although the method seems to be quite expensive to implement. Apprndix .A 1.1 givrs a more detailrd visw of this method. 2-22 -4 Meta-Interpreter for Prolop Query Optimization McEnery and Nikolopoulos [McEnrryPO] describe a meta-interpretcr for Prolog whkh reorders clauses and predicates. It has two componrnts: ( a ) a static cornponent in chargr of reamnging the clauses "a prion". and ( b ) 3 dynamic component that reorders the clauses according to probabilistic profiles built from previously answrred qusries. This method's static reordering phase consists of rearranging the clauses that definr a predicate in such a way that the most successfd clauses are tried first. and the subgoals within a clause are reordered in dsscrnding order of success likelihood. Subgoal reordering is performed by using a generalization of a heuristic due to D.H.D. Warren [WarrenBl]. Warren proposed a formula for the cost c of a simple qusry q as givrn by c, = sa. where s is the size in tuples (i.e., the nurnbttr of solutions) of the subgoal. and a is the product of the sizes of the domains of cach instantiated argument. The generalized formula proposed by M c E n q and Nikolopoulos is given by: where s and a are defined as in Warren's formula. and p is the probability of succcss of the clause under analysis. The method dors not handle recursive queries and it explores al1 permutations of possible reorderings. which may be very expensive for large queries. For a more in-depth view of this method. the readrr is referred to Appendix X 1.2. 22.3 Efficient Reordering of C-Prolog Wang. Yoo and Cheatharn [Wang931 have implsmentrd a reordering mechanism for Prolog pro-mms which assumes that the cost of evaluating a subgoal is a constant that can be estimated by means of cumulative statistics. A profiler collects the number of subuoals that are invoked for a given predicatep. as well as the number of rimes that the cal1 2 fails. The average value of these metrics over the total number of calls to predicate p is used as a measure of the cost of rvaluating such a predicate. The probabili tiçs of success and failure collected dunng stat istical profiling are then used to determine a suitable ordering. In fact. the system only accumulates the nurnber of c a b to a predicate and the nurnber of times a failure occurs. The probability of failure of a conjunction of subgoais. S I . ....s,, is then calculatsd as the product of the individual probabilities of failure of the subgoals. - number of failures - failure number of caIls ,,,, pmbabiliry of failure = failure rarei An evident advantage of this method is that handling recursion is not a major prob- lem, since we are only interested in the number of calls and failures. without paying attention to whether the calls are recursive or not. An obvious disadvantage of the method is that the degree of instantiation of the subgoals is totally ignored. and. therefore. there is no distinction between different calling modes of the same predicate. and these usually yield different execution costs. Another drawback is that safety conditions are not incorporated and it is the responsibility of the user to inform the system about which predicates are not suitable for reordering. 1.2.1 On Reordering Conjunctions of Literals: A Simple, Fast Algorithm Sheridan [Sheridan9 I l has formulated a good heuristic algorithm for reordrring conjunctions of subgoals in Prolog programs. This method differs from many othen in that it does not require profile information of the underlying database. Although the method is simple. it yirlds surprisingly good results. The method exploits the notion of '-ground is bstter". Le.. the fact that the more instantiated the arguments in a subgoal are. the lrss enpensive its exccution is. The goal of the method is to maximizr the so-callrd s i d r i r q ~ i~forrnafion passing [UllmanSS] from lefi to right. Sheridan's algorithm distinguishes three groups of subgoals: ( a )positive built-in litrrals. (b) negative literals and (c) other positive literals. This classification of the subgoals has to do with safety considerations. For instance. a built-in predicate may require that some of its arguments have instantiated values before the predicate cal1 (an enobling list of arguments). For example. consider the predicate surn(.-!. B. CI that rvaluatrs the operation A = B + C. Typical cnabling lists (i-e.. lists of arguments that guarantee that the given predicate is immediatrly evaluable) for this arithrnetic predicate are: [A. B. Cl, [A, Cl. [A. BI, and [B. Cl. In other words. the predicate is safe whtnever two or three of the arguments are instantiated to an integer value. By the same token. a nrgative literal is safe if al1 its arguments are constant values. For example. given non-built-in predicates y and p. the following ordenngs are safe ones. whereas these are not: Note that the algorithm exploits the property of Datalog-like programs where each argument is guaranteed to have a constant value afier an,. call. Thus. given this specific property. any occurrence of a variable other than the fint one is paranteed to have a constant value. The algorithm nondeterministically selects subgoals according to the following critena (in descending order of priority): 1. non-nrgative non-built-in subgoals with at least one ground argument (either an sxplicit constant or a variable that is known to be instantiated to a constant value by virtur of having appeared in a previously selected subgoal): 2. non-negative built-in subgoals that are safe (Le.. at leasr one of its rnabling lists is entirely cornposed of ground arguments): 3. negative subgoals that are safe (Le.. ail its arguments are ground): 1. non-negative non-built-in subgoals with no ground arguments. The algorithm can use an additional heunstic mle which gives preference to subgoals with a larger number of bound arguments within each criteria group. An important feature of this algorithm is that no knowledpe of the underlying data- base is required. The main advantage of this fact is that there is no requirement for a database profile to be obtained. and this may represent a substantial saving. An obvious restriction of Sheridan's algorithm is that no distinction is made between a predicate that retrieves a huge nurnber of tuples and one that is associated with a vety small set of tuples, and. as a result, the expensive predicate rnay be given priority over a possibly better choice. 2.2.5 Cost Analysis of Logic Programs Debray and Lin [Debray93] have proposed a more general h m e w o r k to analyze the cost of logic programs. including simple forms of recursion. In particular. the method estimates the number of solutions of a logic pro-mm based on the sizes of the predicatr arguments. The method derives size relationships arnongst predicate arguments. This size information is rhen used to compute the number of solutions generated by each predicate. The methodology is applicabk to ail non-recursive predicates and to those recursive predicatcs with the proprny of having an argument whose size is reduced at svery recursive stcrp. until a base-case valus is obtained. Unfortunately. this leaves out some interesting cases of recursion (such as transitive closure or chain recursion). This method is described in more drtail in Appendix A 1 -3. Chapter 3. A Machine-Dependent Cost Model We now proceed to study our cost model for a specific abstract machine. Since the current version of the GraphLog interpreter generates Prolog code [Fukar9 11. our analysis will be focused on this particular target language. Funherrnore. we have chosen a panicular execution model for Prolo_r,namely the WAM abstract machine [Ait9 11. because it is widely used for the Prolog language. (Prolog is the most widely used logic proberamming language. ) When dealing with databasrs. it is usual to separate logic predicates into two catsgories: extrnsional database predicates. which comprise a finite set of positive ground facts. and intensional database predicates. which include al1 other predicates. We will devote the initial pan of this chapter to deriving a framework for extensional database predicates. and tackle the case of intensional database predicates therraftsr. 3.1 Cost model, Initial Assumptions We stan fiom two assurnptions. First. we suppose that some parametic values of the da- tabase are known in advance (such as the number of distinct valurs for every argument position for al1 database facts. a model for the distribution that is followed by these attribute values. etc.). Furthemore. we assume that the model of Prolog's execution closrly follows the design of the Warren Abstract Machine (WAM) rnodel [fit9 I l . We normally consider three different costs that can be rstimated for a given subgoal: (a) the cost of retricving al/ solutions to the subgoal: (b) the cost of finding thejirst ansuqei-to the subgoal: and (c) the cost o f obtaining the nexf valid answer for a given state. We wilI concentrate on the all-solutions case, since this is the usual scenario for standard database queries. Fact Retrieval, A11 Solutions 3.2 The simplest possible Prolog subgoal is one that only retrieves facts fiom the extensional database. in this section we find the cost associated with finding al1 solutions to the subgoal. Cr Consider a subgoal p of arity ri of the form: p c P , . P..- .... P n ) where P I . P2. .... P, are the arguments to the sub_poal.The rvaluation of this subgoal may require the execution of a specific set of WAM instructions. such as: predicate calls. allocation and deallocation of stack frames. unification operations. attempt to examine the different unifiable clauses. variable unwinding (in case of unification failure and backtracking). etc. One straightfonvard way of estimating the cost of evaluating the subgoal is 10 deducs the exact sequence of machine instructions that is executed. If ~ v know e the costs of the individual WAM instructions. a total cost for the fact retneval operation may bc calculated.' For instance. the WAM defines several term manipulation instructions to handle unification. Thrir behaviour drpends on the mode set by a geun-zrcnire instruction. If rend mode is set. the unification algorithm is applied to both the instruction operand and the current heap ce11 (the WAM stores new t e m s ont0 a rnemory area called the heap). If. instead. iir.ire mode is sprcified. a new ce11 is allocated on the heap. X typical translation of a fact is shown in Figure 3.1 (the WAM instructions are shown to the lefi). Note that the number and nature of the arguments will determine the set of instructions that corresponds to the WAM translation. And the existence of two modes (read and write) has to be considered as well. However. a simpler approach can bc proposed instead. We cm neglect or disregard those instructions that either are executed regardless of the position of the subgoal in a +Sec [GorlickS7] for an attempt to use this approach. The proposed mode1 only considers very simple clauses without disjunctions (therefore leaving out clause indexing).and does not address the issue of the degree of instantiatioa (or "modes") of the predicate arguments either. predicate/3 : get-variable XO getstructure m/2.X1 unify-vanable X5 unify-variable X6 get-structure n/2.X2 unify-value XO unify-value X6 get-list X5 unify-constant a unify-variable X4 get-list X4 unify-constant b unify-variable X3 get-list X 3 unify-constant c unify-constant [ ] l l O% Oh predicate( V, m( X5. W) % . O/, O/, O/O "/O O/O O/O n( V. W)) X5=[ al X4j O O/, X4=[ bl X3] O/O X3=[ O/, O/O % "10 cl [ 11 predicate(V.m([a,b,c],W),n(V, W)) Figure 3.1 Partial translation of a fact conjunctive clause (as in the case of the predicate call) or that do not incur a significant cost (such as. for example. WAM's swirch instructions which suppon argument indexing). This latter group of instructions can be safely neglectsd ivhen ive are dealing with fairly large databases. when othrr operations (variable unwinding. tuple visiting) dominate the execution performance. We have found expenrnentally that three groups of WAM instructions are usually responsible for the major part of the time spent evaluating a subgoal. These are: (a) instructions that are used to manipulate choice points (tn-me-else. nms-e); rern~rne-dse and ( b )instructions that perfom the unification algorithm for terms; and (c) in- structions that restore a previous state when a new solution is required (since a process of backtracking is launched). Our general cost function is based on these observations. For case of analysis. we will usually assume a uniform distribution of independent attribute values, a cornrnonly used assumption in the database field [Mannino88]. Choice Point Manipulation 3.2.1 The fint g o u p of WAM instructions that are heavily used during fact retrieval is concemed with physical access to the tuples. In our rnodel. we propose to writr the cost due to choice point traversal as: cos[-rravrrsul = nL,hpx <,hp where nChp is the total of number of choice points that are "visited". and Tc.,1p is the cxpeçted cost of executing the instructions associated with a sinrle choice point. Tc,, is assumed to be a constant that depends on the Prolog system in use. and its value may br determined experimentally. The number of choicr points. i.e.. the number of alternatives that must be rxplored during an all-solutions retrieval can b r estimated fiom the database profile. Given the instantiations of the arguments and the scheme of clause indexing that is used. we may estimate the number of tuples whose unification will be attempted. Appendix I gives a formula that hoids whrn a uniform distribution of independent attribute values is boing usrd. 3.2.2 Unification Operations This second g o u p of WAM instructions is concemed with unification applied to terms (gel-consranr instructions in Our case. since we are considering simple facts). To simpli- fy our analysis, let us consider the two simplest cases of term unification: groumi (constant) and nor grulrnd (variable) unificationst. A variable unification is always guaranteed to succerd. whereas a ground unification can fail. In our model. we can rstimate the cost associated with the unification operation as follows: cosr-wt~i~urton= n TL. x TF' + nuc x Tu'. + n,, x +in fact. constants and variables rire the only rwo terms that are aIIowed in GraphLog. w here ,n is the number of successfùl constant unifications that take place: nu,, is the number of unsuccessful constant unifications that take place: n,, is the number of (successful) variable unifications that take place: T',, is the expected cost of perfoming one successful constant unification: T,,,,, is the expected cost of performing one unsuccesstiil constant unitication: and Tl.,l is the expected cost of performing one (successful) variable unification. The three numbers can be derived from the database profile (the instantiation of the arguments and the distribution of anribute values may be used for this purpose): the three cost factors rnay be determincd cxperimentally. Consider again a subgoal p of anty n of the f o m : p ( P , . P,.- To estimatr the value of n,,,,. II,, .... P") and n,.,l. two quantities have to be detrrmined for evey argument position: (a) the number of unification anempts and ( b ) the number of succrssful unifications. Clearly the number of unification anempts that take place for position k has rxactly the same valus as the number of successfûl unifications that occurred for position k-1 ( b l ) . assuming that arguments are unifird from Ieft to right. The numbrr of successjirl unifications at a given argument position is a fraction of the total number of unification attempts that are made. We propose the following forrnula for n,,&. the number of successfÛ1 unifications at position k: K/k) is a reduction factor for argument position k. which also represents the savings due to clause indexing (if implemented); and n,,ifl(I,,(k) is the number of unification attempts at argument position L Additionally. Appendix 1 shows some formulae that apply to the special case of a uniform distribution of independent attributr values. The contribution to the cost of retrieving al1 solutions to a fact dcpends for the most pan on the total number of solutions that can be retrieved. and is represented by opentions that restore ptior States during backtracking that are not part of the "choice point manipulation" previously consideredt. We propose the following formula: where TbucrrcP is the expectrd time associated with the procrss of restonng a previous state whrn a new solution is searched. and can be determinrd experimentall y: and n s o l is the expected number of solutions to the query. Consider a subgoal p of arity t i of the form: p ( P,.P,. - .... Pn) For a uniform distribution of attribute values. the total number of solutions is given by: n-sol= n ,,, 01) ns,,,,l,(n) being the number of successful unifications that occur for the last argument P,, (Section 3.2.2). tAnother action that is dircctiy related to the number of solutions has to do tvith the actual display of the results. In general. the total number of solutions may be derived kom the database profile. Much work has been published on this subject [Mamino88]. General Formula 3.2.4 A global formula simply takes al1 the above considerations into account. Given that we have decided to restrict Our scope to the threr previously mentioned cost contributors. our final formula is as follows: ioral_c.ost = cos[-tra versal + cos[-un qicarion + cos[-backri'acking Experimental Values for the Elementary Constants 3.3 Here we explain how to obtain ernpirical values for the constants Tclip,T,.U,T,,, TL,,,and for the panicular case of a uniform distribution of attribute values. I t must be rmphasized that these values are htavily dependent on the actual implementation that is used. A good strategy to determine the values of the above-mentionrd constants consists of building srvenl perfectly uniform databases. and then measuring the rxecurion timr for different types of queries involving both ground and variable arguments. -4temary predicate serms to be a convrnient choice because it contains most important variants - without having to deal with huge databases (Figure 3.2). A binary predicate may work as well. but less accurate results can be expected. predl(ba,ba,ba) . predl (ba,ba, aa) . predl(ba,aa,ba) . predl (ba,aa, aa) . predl (aa,ba,ba) . predl (aa,ba,aa) . predl (aa,aa,ba) . predl (aa,aa, aa) . predl ( b a taa,bal . predl (ba,aa, 2 ) . predl(ba,Y,ba) . predl (ba,Y,Z ) . predl (X,aa,ba) . predl (X,aa,2 ) . predl ( X IY,ba) . predl(X,Y,Z) . Figure 3.2 (a) An extract from one of the databases that were used and (b) typical subgoals which retrieve these facts The basic idea consists of building a kind of database for which we can theoretically predict the number of WAM instructions that get executed for our different queries (a symmetric database is a suitable choice given its predictability with regards to the number of WAM operations that are expected to be executed). Thus. we can denvr theoretical fomulae based upon some parametric variables for al1 contributors that ive consider relevant. Then. we experimentally obtain the costs of executing the quenes. and relate thesr costs to the parametric variables- In the case of a perfectly uniform database. al1 contnbuton may br expressed as fûnctions of the sizes Siof the argument domains and their respective products. Therefore. we rnay propose a grnrnl formula of the form: whcre each G ~ ( SS,. , -. .... S,.) = k and each P, ç n, sk P. : 1. 2. ...-VI whrre N is the number of arguments in the subgoal. For the temary case. wr have: where Sn rcpresrnts the dornain size of argument n: and the q are constants that are re- latrd to the weight or influence of the corresponding tenn in the total cost - a zero value would mean no contribution whatsoever due to that particular term. In fact. when several independent expenments are launched. on1y a few constants show both mcasurabl y "large" and consistent values in repeated experiments. and thess are obvious candidates to be considered significant. Al1 experimental results mentioned in this section were obtained on both SICStus Prolog. version 2.1. and SB-Prolog. version 3 .O.executing on a SUN IPC SPARCstation. The experimental values were measured using the profiling routines pmvided by SICSms Prolog and SB-Prolog; al1 execution times are estimated. according to the implernen- tation manuals. in "artificial" units. .4pproximately 1.O00different databases were built with sizes ranging from I O to about 25.000 different tuples. For every database. al1 possible combinations of ground and variable arguments in the query were tried (see Figure 3.2). Using the least squares method for curve fitting. the value of constants ci(i-e.. the dependency of the exccution times upon the parametric values of the database) were obtained. Initial sxperimcnts showed that al1 these dependencies were approximately linear. Table 3.1 and Table 3.2 summarize some actual results for a cornplete experiment. Sk stands for the numbcr of distinct values for argument position k. Those cells in the ta- ble containing values that are clearly distinct from zero (and may indicate that the trrm under consideration may contribute to the total cost) have been marked in bold font. A decision was made as to consider as few constants as possible. for instance. disregardhg some values for variables S,, S7and S3 (as well as the independent tenn). which will normally hold smaller values than their products. Some variables rnay have values clearly distinct From zero afier one experiment. but no consistent values from experiment to experirncnt: we decided to ignore these constants as wellt. The eight different cases of cround and not ground combinations are abbreviated using the letters g (for ground) and c. /'(forji-et. variable. i.2.. not ground). At the same time. the corresponding WAM instmctions and the number of times that they had been rxecuted were calculated. Wr assumed the first-argument indexing characteristic of SICStus Prolog. A rough cstimate of the number of times that the WAM instructions were cxpected to be executed is s h o w in Table 3.3. The fact that for the all-ground-argument case (Le.. ggg) there was no clear dependence on the value of variable S3 (constant c j is ncgligible), and for the first-not-groundthe-rest-ground case (Le.. fgg) no appreciable dependency on variable (SIS3) was observed (incidentally. the expressions in which these t e m s appear are highlighted by a light shading on Table 3.3), suggests that the contribution of constant unifications (Le.. T,,, and Tu,,) may be neglected. Thus. a simplified table (Table 3.1) is obtained. t W e observed that some apparentIy significant negmive quantities showed no consistent values from experiment to experiment. and most of the time rheir values werc close to zero. For instance. the value -O.14 in the first row of column c j in Table 3.1. r SISj SIS2 S3 S- SI CS C4 Cj C~ I 0.019 0.068 0.000 -0.14 -0.00 0.000 -0.02 -0.00 0.020 0.000 0.00n 0.055 ( -0.00 -0.00 -0.06 -o -k -0.00 0.027 0.000 0.000 -0.00 0.049 -0.00 -0.03 - tT i? -0.00 0.082 0.000 -0.09 -0.00 -0.70 -0.00 -0.05 fgg 0.026 0.002 0.003 0.000 -0.03 -0.03 0.019 0.088 fsf 0.026 0.003 0.060 0.002 -0.05 -0.03 -0.04 0.180 ffg 0.030 0.002 0.002 0.053 -0.03 -0.03 -0.04 0.100 fft' 0.083 0.010 0.008 0.006 -0.07 -0.05 -0.06 0.180 Variable SIS2S3 Case C7 =CC 000 -0.00 _o,of SlS3 1 Co A Table 3.1 Typical Experimental Results for a Ternary Predicate for SlCStus Prolog 1 VariabIe SIS,ST, SIS; SISt SISl S3 Sq s1 1 Case C -i Cf-, c5 C-I C3 2 cI Co -0.00 0.022 0.004 0.002 -0.03 -0.01 -0.01 0.211 -0.01 -0.01 0.138 O(3U z=t -cof -off' -0.00 0.023 0.004 0.002 0.01 1 -0.00 0.033 0.01 1 0.008 -0.07 -0.00 -0.06 0.515 -0.00 0.069 0.002 -0.00 -0.03 0.007 0.009 0.080 tg 0.023 -0.00 -0.00 -0.00 -0.01 0.016 0.057 -0.10 t-gf 0.024 -0.00 0.039 -0.00 0.021 0.01 5 0.023 -0.20 ffs 0.030 -0.00 -0.00 0.030 0.001 0.032 0.002 -0.00 t'ff 0.072 -0.00 -0.00 -0.00 0.006 -0.00 -0.00 0.081 fa ,Z i - Table 3.2 Typical Experimental Results for a Ternary Predicate for SB-Prolog Thus. if we decide to consider on1y the rernaining three constants. Tc/,, (directly related to "retry-me-else" operations), T,.,, (associated with successful "get-variable" in- (comected to the number of solutions of the retneval), then we prostructions) and Tbucr(ceed to establish which products of our S variables are expected to contribute to the cost of the retneval. For instance. the product S2xS3 is significant for the "ggg" case, and products SlxS2xS3 and SlxS3 are significant for the "fgf' case. We may build a table cal1 switch switch prcdi- on on cate t erm constant rerryme dse -1 nust me successful SUCCÈSS- ful get hl grt t g \-anable constant totaI n u m ber of solutions U~SUCC~~S- consrnt 1 s s - O 1 3S,S2S3 1 O Is1s2s3( Table 3 3 Number of tirces that the WAM Instructions are executed I I Case retn; me CI se I 'iuccess\ fil get ariabls to tai num ber of soIutions Table 3.4 Number of times that the WAM Instructions are executed (simplified version) showing such dependencirs (Table 3.5) and then proceed to connect these theoretical values with the expenmental values. To derivr the final values for Our constants Tc,lp, T,-and ThaCk,we rnust solve a system of simultaneous equations. For instance. for the SICStus Prolog single sxpenment of Table 3.1. we would considcr the following system of approximate rquations: Case , S S2S3 S2S3 Sis3 S I S , S7 - S, Table 3.5 Approximate Theoretical Values for a Ternary Predicate Note that some cquations are redundant. Sometimes the same terms are rquated to slightly dissimilar values. srwing to remind us that our results are only approximate. There is no unique method to solve such an overdetenninedt system of equations. A simple rnethod descnbed in [Frobeg85] solves the system by using a maximum n o m . We have computed the following approximate values for Our particular environment when using our particular version of SICStus Prolog: Tchp= 0.020. T,.,, = 0.007. Tbÿck= 0.048. +An overdetermined (or inconsistent) system has more equations than unknoums. Table 3.6 summarizes the deviation between the experirnental values (i.e.. the actual execution rimes in mificial units as obtained during the experiments) and the proposed theoretical values when using thesr values for SICStus Prolog ( i.e.. applying these values to the formula described in Section 32.4)'. The greatest discrepancics occur for the d l - aound-argument case. due in part to the fact that constant unifications play a major rôle Y hrrr. and this is ignored by our approximation (i-e.. values for Tu,,and T,,, wcre not derived). average deviation between theoretical and experimental values ( 1 .O00 different databases) Table 3.6 Average cost error introduced by our approximation 3.4 Conjunction of Simple Queries, AI1 Solutions We now procred to study the case of a conjunction of facts. -4gain. we are interested in the ail-solutions case. Consider a conjunction of simple queries of the form p l ~ , . pal. ~ . . . . p n an whcre the notation p/a means that predicate p has an arity a. +Since the databases in the experiments were forced to have a uniform distribution of independent attribute values. for each database. we can easily estimate the values of n,hp .,n .,n n,, and n-sol rcquired by the formula. If. at svery point in die evaluation of this query. we know the instantiation of the arguments of every subgoal. we can determine the cost of evaluating rach subgoai by using the formula described in Section 3.2. The following formulz could be applied to estimate the global cost of finding al1 solutions (i.e.. the total cost of evaluating a conjunction of subgoals): where i i l l n ) is the cost associated with finding al1 solutions to subgoal p,: n-soiln) is the sstimated (average-case) number of solutions to subgoal p,,: Note that each successive subgoal will bc called as many times as there are distinct solutions that the previous subgoal is able to retneve. Sincr we are dealing with subgoals that retrievr niples fiom a database. wrt may dc- termine in advance the acnial instantiation of every argument in the conjunction. Thus. cvcry variable that appears for the first time in a subgoal will be uninstantiatcd. whrreas any variable that has appeared before in another subgoal will be instantiated at that point. To find the Ieast overall cost. al1 possible orders of subgoals must be considrred. Table 3.7 shows a cornparison bstween (a) the experirnental costs for the book database example and (b) thosc costs predicted when only using the primitive constants in Our formula and assuming a uniform distribution. Since the book databasr. like most real databases. does not follow a uniform distribution. sigificant differenccs c m be observed. However. in this particular case. the uniform distribution mode1 can still b e used to predict a general trend. i.e.. we may still obtain the most efficient evaluation order. but there is no guarantee that this will br the case in a grneral situation. The mode1 was also tested against (c) another (artificially generated) book database which was designed to follow a strictly uniform distribution. and. not surprisingly. our theoretical values predicted the costs more accurately. All values in Table 3.7 are reported in SICStus Prolog's artificial units. (a) reai database, erperimental costs using SICStus Prolog order (CI (b) difierence uniform distribution, uniform between experimental costs distribution, (b) and (c) theoretical value using SICStus Prolog - author-publisher-book 3438660 author-book-publisher 2690 Table 3.7 The book titles database Norrnally. w s will pay more attention to the relative cost amongst different ordrrings rather than to the "exact" cost values. Table 3.8 shows that we were able to predict the correct order of the costs of the different orderings. (a) (b) Ranking of theoretical predictions Ranliing of SICStus Proiog when using a uniform da ta base 1 . book-author-publisher (1635) 1. book-author-publisher 1 . book-author-publisher (2935) (3370) 2. author-book-publisher (2690) 2. aurhor-book-publisher 2. author-book-publisher (3168) (4330) 3. book-publisher-author 631345) 3. book-publisher-auttior 3 . book-publisher-auchor h ( - 3. publisher-book-author (260040) 4. publisher-book-author (696333) 5za. publisher-author-book 5=. pubiisher-author-book (343476) (933330 1 ) 5=. author-publisher-book 5=. author-publisher-book (3438660) 1 ( 114655) 103305) - - 1 (c) Ranking of SICStus Prolog (actual measurements) (9244272) 4. publisher-book-author (713215) 1 5=. publisher-author-book I I (9485305) 5=. author-publisher-book (93985 IO) Table 3.8 Orderings ranked by their costs a. Since these last two orderings are within 0.1% of each other (given actual measurements), a similar rank is shown 3.5 Intensional Database Predicates As mentioned before. it is common practice to separate logic predicates into two categories: exrensional database predicates and inremional database predicates. One advantage of this division is that extensional predicates typically have large numbers of clauses (they can be seen as the database itself). whereas intensional predicates normally have a small number of clauses. Additionally. one can infer some proprnies for extensional predicates. such as a distribution for the attribute values or correlation factors arnongst them. that characterize the database under consideration. Normaily. these propenies are constant in timr (cf. previous sections). In other words. one can predict. within certain parameters. how a query will behavc when applied to that database. On the othrr hand. intensional predicates are less predictablc. They require a more complex analysis hmework. whosr predictions are normally lrss accurate. A standard approach for analyzing the execution brhaviour of a program is to use abstract interpretation techniques [Cousot77. Cousot9ZJ. which transfer the problrm to a different. casier to handle domain at the expense of some loss of precision. In the specific case of logic progams. for example. instead of kesping tnck of the exact values that evsry variable holds during program execution. one may want to consider a sirnpler. more general propcrty. One such proprrty is the mude of the variable, that is. its drgree of instantiation [MelIish8-5] [Debray89]. We will not know the exact value. but at least we can ascertain that the variable under consideration is an uninstantiated variable. or a ground constant. or a term with a combination of both (Le.. a panially grounded structure). We can infer these attributes by performing a static mode anabsis. 3.6 Mode analysis In general. Prolog prograrns are undirectrd. that is. there is no distinction between input and output parameters for a given predicate. This notion of bi-directionality presents a major challenge to the production of efficient code. since the depth-first search strategy with chronological backtracking that Prolog uses to implement non-drtenninisrn is itself a very inefficient strategy [Mellish85]. However, Prolog predicates are typically witten with one sole direction in mind and, as a result, some parameters are rneant to be exclu- sively input or output. Knowledge of such directionality c m be expressed using the notion of modes. a concept which was introduced by D.H.D. Warren (and refined by Mellish) to classi@ the ways in which a Prolog predicate is used during the rxecution of a program. If the pro-g-amrner provides such dues to help the compiler identi@ directionality. the genrrated code can be dramatically improved. A possible alternative is to infer the mode information by perfonning a global analysis of the program [Debray88]. The standard approach for detennining the mode information of a logic pro_mam statical ly uses nbstracr inrrrpreiarion [Cousot77]. [Cousot92]. This is a genrral technique where the standard semantics of a program are projçcted ont0 a different (and simpler) domain. Several solutions to the problem of finding the modes of a Prolog pro%pm have been proposed. A qui te extensive survey is given in the introduction of [Debray SB]. In this section. the mode inferencc algorithm of Debray [Debray89] is described. since this fnmcwork is the basis of the detenninacy analysis for Our work. The mode of a predicate in a Prolog probgram specifies which arguments are input a r g mrnts and which are output arguments, taking into account al1 possible calls that can occur during the rxecution of a program. Depending on the nature of the problem. a set of modes must be defined to characterize the modes of the arguments in a Prolog predicate. Debray proposrd the family of modes h. = [ c. d. e. f. nv :. whrrr c denotrs the set of Iiilly-instantiatcd (gound) terms. d (don't know) the universai set of al1 texms. e the cmpty set. f ( free variable) the set of un-instantiated variables. and nv the set of non-vanable t e m s (that is. structured ternis which are not h l l y instantiated). The set cornpletr lattice under the inclusion operator (Figure 3.3): Figure 3.3 Debray's lattice for mode analysis -\ foms a Given a set of terms T. its insranriarion is detïned to be the element of 1 that best characterizes it. Thus. the least upper bound [Birkhoff40] for al1 tcnns in T is chosen. Prolo@ unification operation can be undrrstood in the mode's domain (cailrd the absrracr domain) as an operation that. given the instantiations of the arguments in a call. refines them according to the nature of die hsad arguments. Dèbray [Debray89] defined the lattice in a way that given nvo term instantiations T I and L.the unification of them is chosen to be the least upper bound of their instantiations under the following partial ordering: f s d c n v c c e The unification of terms is modelled by applying thejoin operation to two rlsments of the latticr. a and b. which retums the least upper bound of a and h. The join operator for the ordering under consideration is written as V . Some examples are shown in Figure 3.4. Note that. since some information is not taken into account in the abstract domain. the results are usually lsss accurate than in the concrete t and more cornplex) world.' 3.6.2 General Mode Analysis Method Debray's msthod uses thepr*ocrclru-alvimr- for Prolog. which recognizes the existence of mechanisms such as procedure call. success. failure. backtracking. etc. Debray 's static inference of Prolog modes is based on keeping track of individual variable instantiations throughout the cxecution of a program. Such information is propagatcd in the usual way. from caller to callee at any predicate invocation and from c a k e to callrr at the time of the predicate's completion. Thus. at any point during pro-mm execution. an instanriation state is defined. which contains instantiation information for every variable in the program. The notion of an instantiation state can be extended to any arbitrary non-variable trrm. A constant trrm will have ground instantiation (c) and an rmpty dependency set. +In particular. unification in Debray's mode1 c m never fail. f2 Concrete domain Abstract domain ff Y) r- 4 nv --7- =.. ..,-- f(6) Prolog unification f(6) Concrets domain c abstract unificatio c Abstract domain -4 Figure 3.4 Abstract interpretation applied to Prolog unification given two terms t l and t2 A strucnired term will have ground instantiation ( c )if al1 its arguments are ground and non-variable instantiation (nv) othenvise. In order to facilitate the propagation of mode information. Debray ' s method defines the existence of instanriation partet-ns for every procedure call. An instantiation pattern will contain. for every procedure argument. sorne information related to its instantiation. 3.6.3 Abstract Domains In essence. we can characterize every predicate call by a previous state (in which the instanriations of the arguments are grouped in a so-called callingportern).and by a result- ing state. which differs from the original state in the same degree as the arguments do (the new set of modes for the arguments is referred to as the szrccess pattern). In the framework proposed by Debray [Debray89], we have knowledge about the degree of instantiation of rvery term at any execution point. Although not explicitly formulated by Debray. we may characterize his abstract domain as pairs of ccalling panemsisuccess panems> for rach predicate call in the probprn. ,More fomally. I Inzr 5 ncpar ( Predr). I 2 n 5 nspatl Prrd, cpur,,, . 1 5 r 5 number of distinct predicates where Prrd, rcprrsents the ,-th prrdicate in the database: cpnt, is a feasible calling pattem for predicate Pred: a calling pattem is an ordrred k-tuple (where k is the number of arguments in prrdicatr Prrd) in which the k-th elernent represents the current mode for the k-th argument; spnt,l., represents the m-th valid success pattern for the given calling pattem cpnt,: a succcss pattem is identical in form to a calling pattem and it differs from the calling pattern in that the modes for the arguments are updatrd as a rrsult of the predicate call (by using Debray's unification rules in the abstract domain): ncpat(Pred) is the numbsr of distinct calling patterns that Prrd can be invoked with (as determined by a static analysis); and mpartPred. q a t ) is the number of different success pattems that results from a call to Pred given a calling pattem q a t . We start with Debray's approach and propose enriching the domain with probabilities of occurrence for the various success pattems, as well as quantities related to the cost of that particular execution path. We tailor the analysis to the all-solutions case. Our analysis will be restricted to non-recursive programs. Wc may easily extend it to allow recursion controlled by an argument that reduces its size at every recursive step as De- bray does [Debray93], but this type of recursion does not occur in pure Datalog programs. To illustrate how cost contributon are incorporated into Our absuact domain. consider a variant of the books database introduced in Section 2.1.4. Consider the following order of subgoals for an arbitrary query: and assume that the database attributes foliow a uniform distribution of values. The database profile is shown in Table 3.9. Predicate name number of tuples distinct values in argument 1 distinct values in argument 2 distinct values in argument 3 book4 3.000 3.000 30 IO publisheri2 7,600 20 330 - 450 450 30 - aut hor!? distinct values in argument 4 1 T'able 3.9 The books database profile For Our panicular query. the book predicate will be invokrd with al1 arguments un- bound (ix. a calling pattem [f. f. f. fl)'. The prrblisher predicate will be called with a calling pattem [g. tl. since the first argument will have a constant value afirr the call to predicate book has been completed. Finally. the author predicate will have both arguments bound to a constant value (i-ç..a cailing pattern [g, g ] ) . In this panicular case. we have the following instances that charactrrize the execution of the que.: (a) book predicate: In Debray's domain. the following instance is generated: +As before. WC use "g" to denote a gound terms and "f' to indicate a free variable. where the third element of the mple is the success pattern that results from a successful call. in Our domain. we wish to include some cost contributon. namrly the number of tuples that are visited. n,, the number of variable unifications that take place. n,. and the expected number of solutions. n,. Thus. we would generate the following instance of the book predicate: <book. [f. f. f. fl. [g, g, g, g], 3000.12000.3000>, where the three numerical values represent the cost rnetrics n,. n,. and n,. respectively. (b) book predicate: Debray's instance for this predicate would have the form: <book. [g, fl, [g, g]>. Our enriched domain wouId be: <book. fg, fj, [g, g],380.380.380~. ( c ) airthor predicate: Using Debray's domain. we would obtain: whiie Our dornain would provide additional information: + <author, [g, gJ,[g,g], 1. 0. 0.0333>. More formally. our enriched domain is defined as follows: Cost Abstract Dornain = (Predr. Clause .cpar, 4 I 5 m 2 ncpat ( P d r ) . 1 < nr < nspat( Pred, cparm . rnrrri~s,,~ ;I . 1. r, 1 Ir 5 number of distinct predicates 1 5 q, 5 nurnber of distinct clauses in predicate Pre< +Note t h a ~in this case. the rate of success is given by 150i450/30. +. where Pred, cpor,, ncpar(Pued) and nspar(Pred. cpnt) are the sarne as before: CIausey identifies the q-th clause of the predicate under consideration: for extensional predicates. ail clauses may be collapscd into a single tuple template. making this element irrelevant: cpat, is the n-th feasible calling pattern for predicate Pred rnetrics contains a list c v , . v2. ... v,> of values for n different cost contributors that we have decided beforehand are relevant to produce an estimate of the total cost associated with calling pattern q a t , ; for instance. if we have decided that our cost function will be based on the number of successful unifications (say. n s u c i n i j ) we may have a list of the form Cnstrc-{ni/: num-sols. where num_sol is the number of solutions that resuit fiom calling predicate P r d In othcr words. we will record al1 those quantities that are required by our cost formula. Thus. for the formula derived for the WAM in Section 3.2. our list of metrics . that the would most probably be of the form < n,hp, nieu,nhacb n ~ m s o bNote number of solutions associated with a predicate call is normally required in order to calculate costs for conjunction of quenes (see Section 3.4): In othcr words. given a predicate clause. we are mainiy interested in obtaining sorne metncs related to the cost estimation for any viable calling pattern. We have omitted in the dornain a place for the success patterns that are obtained. The reason for this is that such information is impiicitly used by successive predicates in thsir respective cal iing 3.7 Cost Function In this section. we will proceed to explain the domain element metrics. Its purpose is to keep track of al1 relevant parametric values that are used to estimate the cost of the predicate clause. Normally. it will include values such as the average number of solutions that +Note that. in the case of GraphLog. onIy one success pattern is obtained after any predicate call. are expected. the number of tuples that are visited. the numbrr of variable or constant unifications that take place, the number of times that the state should be restored. and so on. In the general case. the values of Our parametric values will have different values for different calling pattems. For instance. a call of the f o m p( X) with calling pattem cf> (i-e..a tiee variable) will retrieve a11 facts in the database. whereas a call of the form p(c) with calling pattem cg> (i.e.. a ground term) will retrieve only one fact at most. Therefore. we have to keep track of feasible abstract paths. i.e. information regarding the calling pattems that the subgoals may be invoked with during the evaluation of that particular clause. Givcn a conjunction of subgoals: S , s,.- .... S . we define an nbstr-uct path as a list of tuples: ubsrrucr pmh = [ {sl. c p a r ! ) . . ... (s,,. cpar,)] where cput, represents a feasible calling pattern for subgoal s,. Note that an abstract path is defined for any GraphLog query or mle and its unique value is detcrmined statically via a simple mode analysis. Although no success pattern appears in the definition of an abstract path. it should be clear that successive calling patterns are built Frorn the success pattems that are obtained from previous subgoals. 3.7.1 Cost Function from the Perspective of Head Unifications For intensional predicates. we will estimate the evaluation costs at the clause level. that is. we will obtain the contribution to the cost due to each one of the clauses of a given predicate. In this section. we propose a methodology that can be used to estimate the average cost of evaluating a predicate given a particular calling pattern. We will concenmate on the probability associated with the process of head unijkation. In the following section we will consider how to estimate the cost of evaluating a complete query or clause body. For this reason wc start from the assumption that the average cost of evrry body is available before evaiuation time. We estirnate the total cost of a single clause as follows (Eq. 3. 1): where c ~ s r ( P r r r l , C l a n s e , lis ~ )the cost that results from the et-aluation of the q-th clause in the >--thpredicatr. siven an initial calling pattem cpal: co.vr(httnif(Pi-ed,CI~~t~se~ )lcpul) is the cost due to the proccss of head unification for the y-th clause in the >=thpredicate. given a specific calling pattem 'par: P( h r i i ~ ~ f ( P i - e d , C l u z ~ s eis~the ) l ~probability ~~,~) that the process of head uni fication for the q-th clause in the r-th predicate is suc~essfülfor the ,uiven calling pattern cpal: h(rpar) is the rnodified pattem that results after a successtul head unification giv- en a calling pattern cpm. which in tum is the initial caIling pattern for the body df the clause: co.vr(hodifPred,CIa~~se~ )l , I , C PdI, ) is the cost that is associatrd with the evaluat ion of the body of the q-th clause in the i--thpredicate given a calling pattem h(cpni): this value will be anatyzed in a following section. Besides the cost of the body. there are wo unknown quantitirs at this point: the cost due to the proccss of head unification and the probability that the head of the clause suc- cesshlly unifies with the arguments to the call. The estimation of the number of primitive operations that take place doring a successfül head unification is quite straightforward: roughly spsaking, one tuple is visited, only one restoration process would be necessary in case of backtracking, and the number of variable and constant unifications can easily be detcrmined from the calling paitem and the internai structure of the head. Additionally. +given a calling pattern. we wish to determine the probability that the head of a clause is successfully unified. Since we do not know the exact values that c m appear at every argument position. nor the fiequency with which these values appear. we are forced to make assumptions. A rough but simple assurnption would consider that al1 ground values follow a uniform distribution. Our univene of ground values may be defined such that it comprises exactly those values that appear in the heads of the ciauses. Alternatively. we may obtain the distribution and univene of attribute values from anothcr source. If we want to attach probability values to rach clause we are forced to define or select a universe of values that. at least, includes al1 constant arguments that occur in the heads. The probability that a given calling pattern successfully unifies with a clause head is estimated as follows: n m g w here P(hlrniflPred,Clause(~,I~~~~) is the probability that the process of hcad unification is successful for the y-th clause in the !=th predicats. given a calling pattem cpoi: P(suc~_rrniJ(a~)l,~ is~the ~ ~ )probability that an argument with instantiation cpar[k] can be successfully unified with the k-th argument in the actual head: cpar(k] is the current instantiation of the k-th argument in the calling pattern cpat: and nitrnnrg is the number of arguments in the head. can be estimateci as follows: In general. the value of P(succ_rrn~~ak)~CPuIIkl) I 1. ifak = f and c p a t [ k j = f K @ othenvise ... ( Eq 3.3) It is important to realize that real-life predicates do not necessariiy have indepcndent probabilities for their argument unifications. In other words. our proposal assumes an ideal case: that there is no correlation amongst arguments. The value of K~ should be determined from whatever abstraction we use to characterize the distribution function of attribute values. and in our framework represents the average probability that a ground or partially ground term cpar[k] can be successtùlly unified wirh the actual argument ak. If no distribution function is known or if our abstraction does not keep uack of the actual values for the constants. we may simply assume a uniforrn distribution of values. With this crude assumption. the probability that a ground argument can be unified with anorher ground argument is given by I /C(ad. where C'(ad is the cardinality of the universe of values for argument position k. The probability that a ground term can be unified with a non-compatible argument (for rxample. a structure with a different arity) is zero. For rxample. consider the following predicate. u/3: We may drcide to cstimate the universes of values as 1 3 . 4 5. 6 1 for the first argument. ;t. v. W. : x for the secona argument. and { g. i { for the third argument. Suppose that Our abstraction for these attributs values consists of the number of distinct values for each attribute. If we assume independence amongst attributes. the total probability that a head can bc unified with any calling pattern is given by the product of the individual probabilities associated with each argument (one in the case of variables, a fraction over the number of distinct values for constant ar_munents). Thus. in our example. the probability that the first clause succeeds would be given by: Note that this probability is the sarne for an); cal1 in which al1 three argumtnts are constants (Le.. ail initial four rules in predicate d). Similarly. the probability that the fifth and sixth clauses succeed is estirnated as: pro b : .V uni tics with argument 3 I Finally. the probability that the clause a(M.P.N) succeeds would be one. Note that. although our universe sets are arbitrary and underestirnates of the tnie argument domains may occur. thrrr is a notion of which clauses are more likely to succeed. 3.7.2 Cost Function from the Perspective of Body Evaluations Once we are able to assign a probabilistic value to each head clause given a calling pattern. we then estirnate the cost of rvaluating the corresponding bodies. Fint. we derive a formula thai permits estimation of the cos<of evaluating a single subgoai s,. Since predicate Predp the predicate invoked by the subgoal. contains y different clauses, the foliowing formula can be used to determine an average cost associated with the whole predicate: 4 where cosi(Pred,.,Clause,~,,,,) is the cost that results from the evaluation of the y-th clause in the r-th predicate givsn an initial cailing pattern cpar. as analyzed in the previous section. Now. givrn a conjunction of subgoals: SI' S ,-. .... S q . the cost of the compound sequencr of subgoals will be decomposed into the in( costs due to sach abstract path, as follows: cost ( sequence of subgoals ) = cost ( u h s t m r p h k ) ... ( E q 3.7) where n p a h is the nurnber of distinct abstract paths that a given clause of a predicate can yield when a calling pattern is initially usedf: and cosr(parhp)is the cost that results from the evaluation of a cornpiete k-th abstract path. This function can be expresscd as follows (Eq. 3.8): where nsolk is the average number of solutions that subgoal sk produces when invoked with a calling pattern cpark.as recorded in the domain element rnerrics. Example. Consider an extended version of the books database introduced in Szction 1 -4.1. +In pure Datalog. only one abstract path is acnially derived. 59 book(Tïtle, Publisher-Name, Subject. Author-Name). A coIlection of book tities d o n g with their pubiishers. subjects of the publications and authors. m publisher(Publisher-Name, City). A list of different cities where book publishen have an authorized distributor. author(Author'-Name Nationality). A g o u p of facts that relate authors ro their re- spective nationalities. skilled(Author-Name, Subject). A k t of the two more prominent authors on every possible subject. forte(Publisher-Name. Subject). A list of the two top publishing companies for sv- ery given subject. Suppose that we wish to retrieve an exhaustive list of tuples of the general f o m a i t l e , Publisher-Name. City, AuthorName> for those "worthwhile" publications whose author has a certain nationality. Database profiIe: (a) Extensional DB oredicaies. We assume that the extensional database predicates follow a strict uniform distribution of aaribute values. The corresponding database profile is given in Table 3. IO. Predicate nams number of ruples distinct values in argument 1 distinct values in argument 2 distinct values in argument 3 Table 3.10 The extended books database distinct values in qurnsnt 4 (b) Intensional DB ~redicatr.Suppose that our (only) intensional databasr predicarr is defincd as followst: %worthwhile/3:worthwhile(Publisher.Author.Subject). Tells us if a book %is worth buying worthwhile(publishef-1-3. worthwhile(publisher-5-Aworthwhile(publisher-10.. worrhwhile~author-23. worthwhilelauthor-7J worthwhilelauthor-l3d. worthwhile(Publisher,Subject):-forte(PuMdw3bject). wo~hwhile~Author.Subject):-skilled(Author.Subject). (c) Ouery. We consider the following query: Table 3. 1 1 and Table 3.17 show the results of applying Our framework to this example. cosi- <clause. cpat> cbooh4. [f.f.f.t]> "chp 3.000 n,, nsol (nchp~Tchp+n\.u~T,u - I l 5 0 1 x Tbac j 4~3.000 3.000 276.000 Table 3.11 Predictions for al1 predicates tStrictly speaking. these predicate definitions are not safe. Al1 anonynous variables should be explicitly consûained by direct references to the book. publishvr and atcfhor predicates. For instance. the tirst deftnition should be defmed as: worthwhile@ubIisher-1 .A.S):- b 0 o k L S . A ) . However. we omit these additional predicates to keep this example simple. <worhwhilc3 n l , [g.g,g]> - 1 120 1 1;20 1O 1 0.05 1 0.0224 - Table 3.12 Predictions for the intensional database predicate Experimrntal result: average cost for al1 possible queriest: 30.834.6 Theoretical result: cost = cost ( book1 [,. ( ( , I- ) + cost ( worthwhik 1 cost ( publisherl n s o l ( book1 LI. r- f- /l 1 i s. a. n 1 [ S . 11 ) + ) + n-sol( cost = 276.0 + 3000.0 x (0.232 + 0.261 n s o l ( worthwhile~ I.c. s.sl 1 publishert [S.fl x ( 28.120 + 380.0 x ) x X cost ( author 1 ' [R-.si ) ) 0.022) ) = 29. 535.6 theoretical cost: 29.535.6 (an error of 3.6% approximately). +The experiment was repeated for al1 ditrerent author nationalities (the only ground argument in the querv). and the average vatue is reported here. Empirical constants used: Appendix ilshows another cxample of our methodology applied IO a different Prolog system. 3.8 Overview of the Model In this chapter we first denved a cmde method to estimate the cost of rvaluating GraphLog queries when applied to extensional predicates. A top-down model of computation is assumed. The rnethod requires an empirical estimation of vanous constants associated with the rvaluation time of primitive operations. A profile of the underlying database is also required. By using this database profile, some formulae to estirnate the expected number of primitive operations that wiil occur dunng query evaluation must be derived. We have considered the case when database values are distnbuted uniformly and independently. Although real databases seldom conform to a unifom distribution model, we may still be able to predict which evaluation orders give the brst exscution times. t Additionally. we have managrd to rxplain why some heuristic techniques for query reordering that are based upon a "bound-is-easier" heuristic procedure work [Sheridan9 1 1. These non-detrrministic algori thms usually srlrc t subgoals containing ground arguments to be placed before other subgoals. We have obssrved that the occurrence of a ground argument reduces the number of primitive operations that take place with respect to the case where that argument is not bound. For cxample, fewer tuples have to be visited, fewer variable unifications take place (the constant unifications that take place instead are, by far. less expensive operations), and. since fewer solutions are expected to occur. fewer state restorations will occur. Thus. a fact retrieval with ground arguments will be less expensive to evaluate than its non-ground counterpart. +The closer the distribution resembles a unifom distribution the better the results wiIl be. We then have proposed a general framework to sstimate the performance of GraphLog (Prolog) queries based on abstract interpretation techniques and mode analysis. Again. a top-down execution is assurned. The method is applicable to the all-solutions case. The basic idea is to associate probabilities with the process of head unification. while considering the expected costs of the subgoals in the body of the clauses. Cost metrics and the average number of solutions are propagated throughout the bodies of the clauses in the usual manner. Typically. the expressions for the number of solutions and primitive operations for a given tuple ~ t i b g o o lcpal> . will normally be rxpressed in terrns of the numbrr of solutions of othçr queries represented by the tuples (subgoalA7cpatk>. This impliss that an order of rvaluation of the analysis rquations should be found. If we restnct the queriss to be non-circzilar. in the sense that they do not contain recursive calls (direct or indirect), it is always possible to tind an order of evaluation which is paranteed to terminate. Chapter 4. A qualitative model So far. we have studied how to obtain a cost model for a specific abstract machine. Unfominately. if another abstract machine is used. we rnay not be able to apply Our specific model. Furthemore, even if the same abstract machine is being used. but substantial additions or optimizations have been incorporated. the model may produce poor results. Wc now procrcd to analyze how to derive a more general model in which underlying implementations are lrss relevant. 4.1 Fundamental Database Operations Revisited We have already mentioned that any DatalogiGraphLog que- can be expressrd in t e m s of fundamental database opentions (Le.. selections. joins and projections). and therefore the methodology for estimating the value of the cost contributors may be focused on these fundamental operations. Although. strictly speaking. this operational modrl usually assumes a bottom-up cornputation. we may also borrow some of the concepts and apply them to a top-down model. As an example. consider a database of articles sold at a given store. Simplified relations would include base relations for ( 1) products, Say. article(Artic1e-Name. Price. Department. Distributor-Name). (?) interna1 bookkeeping. Say. taxation(Departrnent, Applicable-Tax), (3) personnel grouped by department. Say, personnel(Department. Name), and (4) distributor information, Say, distributor(DistributorIName, Distributor-Data). A typical query to retrieve information to calculate the cost of an item (ziven a specific distributor) afier taxes would have the f o m : :-article(peaches, Price, Dept, starninaAc), taxation(Dept, Tax). If we know how many distributors are registered for the article peaches.t the calculation of the cost contributon is quite straightforward. In this case. the cost of executing the second subgoal is independent of the specific value of the department ( ~ e p tthat ) is retrieved by the first subgoal (assuming that every department has at most one tax rate in place). If the exact number of distributors that se11 peaches to the store is not known. but an average of distributon prr article or similar information is available. this average value can be used to estimate an "expected" number whose accuracy will depend on h o a -average.*the article is. Now. suppose that we pose a query to retneve the names of the clerks that belong to the store department that sells peaches: :-article(peaches.P. Dept. D). personnel(Dept.Clerk). In this case. the value of the cost contributors associated with the second subgoal will be influenced bby the actual department (Dept) that is retrieved by the fint subgoal. assuming that different departments have different number of employees. If wr do not know the depanment that will be the output of the fint subgoal. accunte knowledge of the distribution hnction for the personnel relation will not be of much hrlp (sincc that department can be an? of the valid departments in the store). This is a vrry cornmon situation. and a compromise is needed (unless we want to execute the code to detemiine the exact value of the department! ). One possible solution would be to ' ~ e i g h t "al1 different departments by using a measure related to rheir probability of appsaring in the query (some departments are more likely to be invoked) and take their weighted anthmetic mean as the value for the "average" department. If al1 departments have the same probability of being selected, a simple average may be used. The utilization of central tendency values seems to be more appropriate than the use of extreme (skewed) values. at least in the long mn. It is obvious that the absence of an exact value for the department anribute inevitably produces a loss in the accuracy of the sstimate. fWe also know that any article has only one pricc and forms part of one department exclusively. Selection The selection operation is the easiest of the three basic relational algebra operations to deal with. Several researchers [Selinger79. Christodoulakis83. Fedorowicz83] have proposed diverse fomulae for different distribution functions. A srraightforward application of these formulae applied to the information indicated in the profile is al1 we need to detennine the expected cardinality (i-e.. the average number of output tuples) of the result. and the estimated values of other cost contributors. such as the expected number of visited tuples or the expected nurnber of t e m unifications. can also be derived from the forrnulae. For instance. consider a database base relation r(integer. String) whose first argument is known to follow an integer normal distribution with mean p and standard deviation G. and that we also know that the number of tuples is. say, :V. If a qurry of the forni r(X. Y ) is used. where X is bound to a constant integer value while variable Y is a fiee variable. we may estimate the number of tuples that are expected to be retrieved by using our knowledge of how a normal distribution behaves. and by selecting appropriate ranges for our analysis (sincr w r must "discretize" Our representation to accommodate integer values exclusively). However. we must be aware that the database profile is ofien just a sirnpie approximation of the real problem. and a real-life database will normally differ from the -'ideal" case. Figure 4.1 shows ri typical example of an attribute that follows a discrete version of a normal distribution. Note that its general shape is the one wr rxpçct for a normal distribution, but individual values have some deviations from the ideal representation. An interesting problem occurs when the same variable is attached to two or more argument positions within a predicate. For instance. a subgoal such as a(X.X) establishes an additional restriction: that both arguments have the same value. Even for simpler distribution functions. this seemingly harmless restriction poses a difficult challenge that would require some additional information (for instance. the correlation amongst attributes) to be solved properly. Figure 4.1 Frequency diagram of an attribute that may be approximated by a discrete normal distribution -4s mentioned before. if we do not know the exact constant value imrolved in the seiection. wr may not br able to use the database profile. since this is otien given as a fünction of the input value. For instance. in the example of Figure 1 . 1 . if the constant value of the argument is known to have a value. Say. X = 38 1. wc may expcct a cardinality of approximately 20 tuples (from the normal distribution). But if the value of X is unknown (prrhaps because Our abstract interpretation analysis did not keep track of constant values). al1 we can do is çither propose a ronge of values (i-r..from O to 22. for the ideal curve) or calculate an average value (and we must establish a finite range of :'Y' values to do so). For exarnple, if we decide to estimate the cardinality of the selection a s a simple average. for the attnbute depicted in Figure 4.1 we rnay choose a range of "X' values from - -3 value of .V 8.0 x o to p -+ 3 x o. in which case we would have an approximate average . If we choose a range of values that varies from p-2 x a to p + 2 x o. the average value will be approximately Y = 1 i .j . If we consider a range from p - 0 to p+o. we will have Y = 16.7 . Join Generally speaking. the join operation can be viewed as a Cartesian product of the two relations involved. It is used to combine tuples from two or more relations [Mishra91]. In the case of a Datalop/GraphLo_equery. a join of the form ... si (A1.....AN), s2 (61.....BM). ... c m be analyzed (assuming independence of subgoals and a top-down evaluation strate- -w )as two separate selections (for s i and s2. respectively) and thtn. realizing that the d second subgoal will be invoked as many times as solutions the first subgoal provides. a panicular cosr conuibutor may be calculated as (Eq. 4.1 ): cost conuibutor ( s 1 join sZ ) = cost conmbutor ( s 1 ) + solutions ( s I ) x cost conuibutor ( s 2 ) Naturally. a simple ("mode") analysis must give information as to which arguments will hold constant values in s i and s2. For instance. in the sequsncr predicate p will be invoked with two variable arguments. predicates q and r will be callrd with a first argument constant and a second argument variable and prrdicate s will have both arguments ground. Note that the simplicity of this analysis is due to the fact that Datalog-like languages guarantee that al1 arguments are bound to some constant value afier any predicatr call. Note that this analysis dors not keep track of the actual constant values: only the fact that the argument is constant is established. Using similar notation to the one used in the previous chapter. Our formula is as follows (Eq. 1.2): ç o s t ( p ( . - l . B ) . y ( B . O . r ( C . D ) . s ( . . l . D )=) cost(p(.-l.Bl! + where the calling patterns are abbreviated as "c" for constant values and "j" for free (or unbound) variables. Note that. the formula for a reordering of the subgoals should nomally have a slightly different aspect. since the "calling" patterns will usually Vary. Consider the following reordenng of subgoals: In this case. our formula becomes (Eq. 4.3): Projection The most problematic of the basic relational algebra operations is the projection operation. The main challenge has to do with handling duplication of output tuples aficr the projection [Kwast94]. Again. statistical considentions regarding the distnbution of attribute values may be used to tackle the problem [GelenbeSZ. AstrahanSj]. To illustrate this idea. consider again the example in Figure 4.1 . Note that there are 1000 different tuples of the f o m r(integer. String). However. if a projection is performed over the fint argument (thus. rliminating the second argument). it becomrs clear that w r will obtain from O to 25 duplicates for each "..Y" value. Standard Datalog does not discriminate amonot duplicates. and only one value is reported. For this reason. if the first argument is known to be constant. we can establish that at most one valid answer will be derived. If the "X' value lies within a particular region of the distribution curve. we may assign a probability of that value producing such a valid answer. For instance. in Our normal distnbution for the example in Figure 4.1, if the 'Y' value is contained within the region from p -I x O to p + I x a. we may estirnate that the probability that the projection of that first argument will have cardinality 1 is approximately 95.44% (normal distribution). Note that. if we perform the projection over a first argument that is a free variable, the cardinality of the result will be given by the nurnber of distinct attribute values for the first argument in the relation. A more complicated scenano takes place when the projection involves more than one relation. A cornrnon example occurs when two subgoals in the same qurry share a common variable. such as in: In this case. the projection will be a new relation. say sld. having the union of al1 arguments from s l and s l (Le.. the join of both relations). but with only one instantiation of the cornmon arguments (the projection proper): As an example. consider the two base relations in Figure 1.2.: L % predicate s l %predicate s2 SI (ba,ba,ba). sl(ba,ba,aa). sl (aa,ba, aa) . SI (aa,aa,bal . sl (aa,aa, aa) . s 2 (ba,ba, b a l . s 2 (ba,aa,ba) s2 (ba,aa,aa) . . s 2 (aa,ba,ba). s 2 (aa,aa,aa). 7 (a) (b) Figure 4.2 Two ternary predicates sl and s2 Suppose that we have to estimate the cardinality of query ~ ( C . D )where: . The join would yield the intermediate relation sa shown in Figure 4.3. Then, a selection is performed such that the first argument of predicate si is q u a 1 to the first argument of predicate s l . The resultinz relation sh 1s shown in Figure 1.4.Finally, we must project arguments 3 and 5 to obtain the final result. as shown in Figure 4.5. Note that fiom the 12 tuples that are obtained for relation q, only 4 will be in the final answer (the remainder are discarded as duplicates). Fortunately, in this case we already know the upper bound for the cardinality of relation q, which is the product of the sizes of the domains of arguments 3 and 5. i.e. 2 x 2 = 4. Thus, if the cardinality of the selection % predicate s1 j o i n s 2 ,ba,ba,ba,ba,ba). ,ba,ba,ba,aa,ba). ,balba, ba ,aa, aa) . ,ba,ba,aa,ba,ba). ,ba,ba,aa,aa,aa). ,ba, aa, ba ,b a lba) . Figure 4.3 Join of predicates SI and s2 2lec tion a £ ter join Figure 4.4 Selection after the join of prec licates sl and s2 afier join has a value that exceeds this upper bound. we must automatically reducr the estimate to have a value that docs not exceed the upper bound. The lower bound is almost always a cardinality of zero. The whole picture As has been mrntioned before. it is not unusual to obtain formulae for combinrd relational algebra operations that occur relatively fiequently (for instance. a selection afier projection). The main advantage of this idea is that some sources of inaccuracy are elim- Figure 4.5 Final projection of arguments 3 and 5 inated (mainly. the fact that we simply do not know the shape of the distribution for intermediate results). not to mention that the estimation procrss requires Iess effort. In traditional databasc que. modelling. given a sequencc of subgoals. we usually decompose it into primitive relational algebra operations. apply fomulae derived for the speci fic charactenstics of each partici patine relation to each of such componrnts. and successiveiy continue applying the formular to the intermediate relations that result until the entirr sequence is analyzed. Note that this approach is only valid when dealing with non-recursiiu qurries. In Our model. we use an analogous approach. The estimation of the cost of an extensional database predicate call may simply apply fomulae alrcady derived in standard database rrsearch. We specifically definr a simple formula to estimate the cost of a conjunction of subgoals (Eq. 4.1 ) that is applicable to the top-dom model of execution. As a final note. practically al1 methods assume that the constant values indicated in the quex-y are valid ones. i-r.. values defined in the domain of the respective attribute. This validation can be done for base relations without any major complication (for instance, a simple check for quenes to predicate personnel(Department, Name) in which the first argument is a constant may determine whether this is a valid Department), but is not an easy task for intermediate (virtual) relations (unless we keep track of the entire inter- mediate results instead of a simpler abstraction). For intermediate relations. the validity of constant values is usually autornatically assumrd for ease of analysis. 4.2 Recapitulation. Cost Estimation and Query Reordering In this section we repeat some of the ideas that have bern mentioned before in ordcr to produce a clear picture of al1 the important issues that must bc addrrssed. Given a query of the form whose cost we wish to estimate. we propose to decompose it into simpler components that are assumed to be independent from each other. The simplrst choice consists of detining a subgoal as the primitive entity to be analyzed. A subroal is then treated as a "black box": givrn some inputs (degree of instantiation of the arguments. numbrr of timss that the subgoal is expected to be invoked. etc.). the expected values of the cost contributors may be estimated (as the outputs of the black box) and used by successive blocks as thrir respective inputs (Sre Figure 1 A). The subgoal itself has to provide some information about interna1 c haractenstics such as distribution of attribute values or correlation amongst arguments. The total cost of the query is obtained as the sum of the individual costs of the subgoals. Standard abstract intctrpretation techniques may be used to determine the degee of instantiation of the arguments and propagate the intermediate results through al1 successive query components. When a subquery is known to have at leasr one constant argument (whose exact val- ue rnay bc unknown at the analysis time). we are forced to choose a way to account for di fferent possible scenarios that result from the selection (since di fferent constant values will produce different values for the cost contributors). A simple compromise is to consider "average queries" that represent either the most typical query that is cxpected to occur or an amalgamation of al1 distinct possibilities in which a (weighted) average is calculated. There are two general groups of subgoals that are treated separately: simple fact retnevais (i.e., extensional database predicates) and general predicate calls (i.e., intension- al database predicates). The estimation of the cost o f a simple fact retrieval can be reduced to a statistical problern since we know (or rnay detemine) the distribution followed by the arguments. General predicate calls are more complex. Specifically. we have to deal with the following issues. amongst others: (a) head unification. ( b ) clause indexing. (c) independence of subgoals and (d) the fact that the distribution of internediate results rnay be difficult to predict. Head unification and clause indexing rnay be taken into account by assigning to each rule in the predicate a probability of success given the d e g r e of instantiation of the arguments involved. Each rule is then weighted based on this factor. The problem that two or more rules rnay provide comrnon solutions is not a trivial one. Given two rules rl and Q. that provide set of answers .A1 and -4 ,. respectively. we wish to find a new set -4 12 that is the (set-)union of sets -4 and -4,. Unfortunately. our analysis cannot provide enough information to solvr this problem. since we do not know the nature of answers -4 1 and d 7 . A mutual exclusion analysis may help. in the sense that if wr determine that .dl and .A 7 have no answers in common then we know that the cardinality of . A I 7 is the sum of the cardinalities of d l and A I . But the genenl problem of duplication resulting frorn independent rules is complex to solve. We also have to handle recursive queries which. in the case of Datalog-like query languages. occur in the f o m of a predicate closure. In Our schemr. a recursive query is also treated as a black box, although the estimation of cost contributors (outputs) has to be solved quite differently. The values of many of the cost contributors are totally meth- od-dependent. Apparently. we rnay obtain good estimates of the number of tuples that result from the closure (which is a crucial value required by successive black boxes). 1.3 Our Proposed Framework In this section we delineate how to determine the expected values of the cost contributors for a given subgoal. As a first step. the set of relevant cost contriburors thai we are going to work with must be selected. Unfominately. unless we have a history of performance of the form of query under analysis. it is not straightforward tributors are important and which ones rnay be disregarded.* COdecide which cost con- Once the relevant cost contributors have been selected, we have to calculate their average values for each subgoal (represented as a black-box). We will estimate the rx- pected average value for each cost contributor given some information. such as the actual calling pattern or a database profile. The estimation of such average values will normally require the application of formulae denved for the different basic operations of relational algebra previously-mentioned (selcction, join. projection). W s rnay use formulae described in the literature (if we happen to be working with a specific database distribution that has been previously studied). or simply consider a simple distribution (a uniform distribution of independent attribute values is the usual choice). to a crrAs has been mentioned before, the rxpected average nurnher ofsoli<~ions tain subgoal has to be estimated. since it wil1 be used whenever a join operation occurs. Once the expected average values of al1 relevant contributors have been estimated for each separate subgoal (by using formulae for the sdecrion operation given a certain calling pattern). the join operation is considered. We observe that when the al 1-solutions case is considered (as is the case in a standard GraphLog query). any subgoal will be artemptrd as many times as solutions the subgoal to the left has providçdt (See Figure 1.6). Thus. the values of the cost contributors are scaled by a factor given by the number of solutions of the previous subgoal. In other words. the value of a cost contributor of a subgoal is estimated as (Eq. 4.4): value ( cos[-conun, ) = num-solnl , x average-value ( cost-contrnr) The calculation of the nurnber of solutions to the whole que- uses the value of the number of answers to the last subgoal (scaled by the values of the number of answers to al1 previous subgoals) as an upper bound.: nm-solquq = num_sol, x nu-sol, - x ... x num_solm iAccurnulritivc profriing is often the best aid to this end +For the case of the lefi-most subgoai. a factor of 1 must be considered ...( Eq. 4.5) a34 repeat v repeat ns2 times cost contributor cos1 contributor values values ns1 answers ns2 answers Figure 4.6 Cost contributors are estimated for each subgoal This value does not consider duplicates. and therefore. we rnay obtain an overestimate if we use this value directly. To avoid this. our framework wouid require a way to take into consideration the removal of dupiicate tuples fiom the final solution. The calculation of the average value of a cost contributor for the whole que- is accomplished by adding al1 individual values of the cost contributors for the differrnt subgoals in the query under consideration (Eq. 4.6): value ( cost~connnqucn ) = value ( costcontrl ) + value ( cost-contr, J + ... + value ( cost-conun, ) - Finally. once we know the values of al1 cost connibutors for the whole que-. wr are in a position to determine the total cost of the query. If we know the rxpected average cost of each contributor per se (as a primitive operation). the problsm is reduced to what wr have already discussed in Chapter 3. However. if the empirical values of these primitive operations are unknown. wr are forced to attach some weights to each of them. or give pnority to some of them. The simplest straiegy is to select one single cost contnbutor and base Our rankings on this sole parameter. Othenvise. we face the problem of assigning specific wrights to the cost contributors. :Unfomuiately. errors in the estimation of the number of answers to the whole query may increase exponentially with the number of components [Ioannidis95]. Chapter 5. Handling Recursive Queries So far. we have characterized the cost of a (non-recursive) predicate by means of simple cost measures. such as the number of visitrd tuples. the number of successful unification operations or the number of solutions that are obtained. The only compiication we havr encountered bas to do with having to consider different variants of clause indexing. depending on the actual implrmentation of the qurry evaluator. Extending those results <O recursive predicates poses a real challenge. Undecidability of geneni recursion is well-known. and so is the potential occurrence of infinite computations. It does not corne as a surprise that most resrarchers have concrntrated on very specific cases of recunion. For instance. Debray and Lin [Debny93] havr devrloped a method for cost analysis of PmIog programs based on knowledge about "sizr" rrlationships betwren arguments of predicates. which is only applicable to recursivs definitions in which an argument decreases in size at each new recursive invocation. 5.1 Erecution Cost of a Recursive Q u e l Besides recursion with decreasing sizr functions over new recursive steps. there are other cases of recunion that may be handled by our cost model. The most important of these is linear recursion over a database domain. In fact. one of the greatrst advantages of query langages derived fiom Datalog is that every (database) query produces a finite number of answrrs. and infinite loops are therefore avoided by choosing an appropriate evaluation method. One immediate consequence of the selection of a specific evaluation method is that the actual cost of evaluating a recursive predicate will depend on the chosen method. There are many different evaluation methods that deal with recursive queries [Ceri90]. In general. cost measures such as the nurnber of visited tuples or the number of unification attempts are algorithrn-dependent, and there are additional factors that add to the evaluation costs (for instance. bookkeeping of structures or validation of certain conditions). However, there is one cost measure that is totally independent of the evaluation method: the nwnber of solutions to the query. Furthemore. the number of solutions is a quantity that is propagated to other subgoals in the query. since it affects the number of times that the successive subgoals will be invoked. For these reasons. it is relevant to devise a method to estimate the number of solutions that is associated with a recursive query. 5.2 Formulation of a Recursive Q u e l in Terms of Transitive Closure Jagadish and Agrawal [Jagadish87] have shown that rvery linrarly recunive query can be rxpressed as a transitive closure possibly preceded and followed by the usual openton of standard relational algebra (joins. projections. selections. etc.). A recursive rule is Iinear if there is exactly one occurrence of the recursive literal in the body. BanciIhon and Ramaknshnan have conjectured that most recursive queries are linear [Bancilhon86]. The significance of this result is that it is potentially feasible to predict the number of solutions of every linearly recursive query if we denve a general method that is able to detennine the number of sdutions of the transitive closure case. Thus. we suggest the following methodolog to atirnatr the number of solutions of a recursive predicate: 1. Transfoml the linearly recunive predicats into its equivalent form that involvrs transitive closure; 2. Estimate the cost of the transformed predicate in terms of its constituents (i.e.. normal non-recunive predicates and the transitive closure itsel f). Thus. it becomes clear that we need to devise a rnethod to estimate the cardinality of a transitive closure. 5.3 Predicting the Average Number of Solutions of a Transitive Closure One of the most cornmon uses of recursion in GraphLog is simple transitive closure. exemplified by the following two rules: where tc defines the result of the transitive closure and b is the relation (or base predicatr) over which the closure is performed. Note that only two (sets of) arguments are involved in the closure relation. Our goal is to find the cardinality of tc (i-e.. the number of tuples n, that are obtained as a result of applying the transitive closure operator) given some information about predicatr h. It is evident that the nature of predicate h has a substantial impact on the cardinality of its transitive closure: if we represent a predicate by its rquivalent gaph in which a fact is represented by a directed edge between the two values (nodes) of the closure arguments. the transitive closure of a tree-like structure will produce fewer niples than. for instance. that of a heavily comected structure with the same number of facts. Bv the same token. a predicate with a highsr number of facts will nonnally producr more tuplrs afisr the application of the transitive closure operator than a similarly-structured predicate with fewer facts. The simplrst possible study of transitive closure is one that only considen the cardinality of h (that is. the number of tuples nh that are associated with predicate h). disregarding any interna1 relationships between the arguments. Suppose that we are interested in deterrnining the number of tuplrs n, that result from applying transitive closure to predicate h. If we assume that the number of unique mples nh associated with predicate b is known, and so is the number of distinct attribute values for the relation. n,,. some upper- and lower bounds rnay be established. By propenies of transitive closure. we know that n, < n,'. < ( nb ) nb .- .( ~ q5 .-1) . Furthemore. for ndr2- ni! + 2 ...(Eq. 5 . 2 ) - the limit value ( nd,) is obtained. Unfominatrly. these fron- tier values are not that helpful for large values of n,,. For even the simplest possible case. that of a uniform distribution of independent attributes. denving exact fomulae proves to be a hard task. For sxample. wr derived the following formula for the average expected value in the trivial case when n, = 2 : The complrxity of the exact formulae increases as the value of nh docs. and cach formula has to be obtained separately. which produces an impractical situation. Estimating the Average Cardinaüty of Transitive Closure 5.1 As mentioned before. the cardinality of a transitive closure may Vary fiom a value in which no tuples are added as a result of the closure to a maximum value given by the square of the original number of tuples (in a gmph representation. this would correspond to a bbcomplete"p p h for the involved "input" nodes). Since this range of values may produce a vast intemal. a compromise is to work with central tendency mcasurements. such as the anthmetic rnean. For this purpose. we have generated randomly distributed tuples for our base predicate h and obtained results for srveral values of n,, (the number of distinct attnbute values for the relation) and nb (the number of unique tuples for predicatr b)'. After sevrral experimrnts. it appears that the average number of tuples of the transitive closure can be charactenzed by means of three different regions: (a) a simple ( linear) beha~iourfor small values of nh: ( b ) a non-linear region for intermediate values of ilh; and (c) an exponrntial region for higher values of nh (Figure 5.1 and Figure 5-21. We proceed to characterize these thres regions. 5.4.1 Region of Small Values for the Number of Tuples in the Base Predicate For small values of nb, a surpnsingly simple linear formula was empincally derived. If we express nb in terms of n,, in the forrn n, = n,, the corresponding transitive closure can , the average number of tuples of be expressed (approximately) as: +In our espeiiments. nb tuples were randomly selected fkom the (na,)' possible ruples that c m be fomed with n,, distinct artribute values at each argument position. input density nb Figure 5-1 Region for srnaII values - n - =n . - ) q 5 . 4 . For - instance. if .-= i 5. - n,. = nJI 4 . or if --i= 4. = " J I 3 . or if 4 = I . nIL z nd, (Table 5.1 ). and so on. The formula seems to work wtll for .-I2 i .?I . although accuracy starts to debmde sharply in the nri_rhbourhood of *,L this value. Furthermore. the predicted value is more precise for higher values of n,,. Table 5.1 The linear region average output densit! n, P large values tn,, 211u1 input ciensit! nh Figure 5.2 Region for large values 5.1.2 Region of Intermediate Values for the Number of Tuples in the Base Predicate Our linear formuia begins to fail when the constant .4 Stans getting closer to one (Table 5.2). In fact. the standard deviation of the recorded values also becomes bigger. We have bsen unable to derive a simple gencral formula for this range. Fonunately. this "internediate" region is represented by a relatively narrow interval of values that range from approximately nb 2 O.Xnd, to nh 5 i .hd, . As a practical solution. we have obtainrd sorne approximats formulae for different values of n , n, = 0.9nyr n, - nu, . we have applied the approximation .the formula <= nJt I .I 2 in this region. For instance. for - nt'. = J x nJt ( 3 x ( -4 -I)) . and for gives a good approximation of the cardinality of the transitive closure. Note that. since this region is relatively small. the denvation of approximate formulae for different values o f n, represents a feasible strategy.' tNaturally. our proposed formulae represent just simple approximations. We decided not to spend too much time in deriving more exact formulae. since these approximations satis- Our needs rather adequately. I I I 1 lOO(8801 1 . 3 1 I I 4119.821 UOO Table 5.2 The intermediate region 5.43 Region of Large Values for the Number of Tuples in the Base Predicate An important observation is that for values sponding maximum value ( n,) R, . 2 1 . 2 ~ the ~ ~ prrcrnrage of the corre- = that is obtained afier the closure can be considered al- most a constant (as seen in Table 5.3). The values of some percentages are depictrd in Table 5.3. Furthermore. we observe that some of these percentages may be represented by very simple fractions. For instance. for R, = 1.4nd, n, = 1.6nd, . the associated fraction is . the fraction is 1/25: for n, = i.3na, 1/1;for n , = n, = 1.75n,, . the fraction is 1%; 1-jn,, , the for fraction is 113: for the fraction is l i 2 . Table 5.3 Percentages of the maximum value for n, nb=I .Z%, nb=1.6%, nb= 1 4 , nb= 1.4%, nb=1.5%, 0.09 0.17 0.25 nb=l .7%, nb=1.8%, 0-41 0.33 = 1.6 m nb= 1.911,~~ 0.59 0.53 0.47 Table 5.4 Percentages of the maximum value for some factors This particular behaviour may be approximated by an analytical formula. In fact. the values that are obtained strongly suggest that we may use an exponential formula to mode1 this region. Thus. we have used the following simple general formula to characterize this subregioni: 7 n, l' 'JI \ 1 1 )- nIL= nd,-l 1 - enpl - - + p ' 7 '\ JEq. 5.5) ,, For instance. the values that are obtained when using this formula when p = 0.9 are shown in Table 5.5. nb = nb = I l I nb = l I nb = nb = I.-t%, 1 . 5 !.6n,, nb = l nb = 1 nh = I.Sn., . deri\.ed iformula) \dues 0.03Y2 0.0801 0.1179 0.22 12 0 . 3 0 3 0.3871 0.4727 0.555 1 sxperimental \dues 0.06 0.17 - - 0.09 0.25 0.33 0.31 0.47 0 . nh= l 0.632 1 0.59 - Table 5.5 Comparison between the formula and the experirnental results Also. for the range n, 2 Zn", . we have used the following simple formula: +Again this formula just represents a reasonable approximation of the vaIues under consideration. and "better" formulae may be proposed as well if more accuracy is needed. "b where the value a = 3nu, - 7 - ssems to give satisfactory results (Table 5.6). 5.5 Recursion Revisited Once a formula that predicts the cardinality of transitive closure has been obtained. we may use it to predict the cardinality of a recursive predicate. As an example. let us consider a generalized version of the same generation exarnple proposed by Bancilhon and Ramaknshnan [Bancilhon86]: whcre jlat. ~ r pand down are extensional database predicates. and p is the recursive (dcrived) predicate. This predicate can be exprrssed in terms of transitive closure as follows: where ~ c p h i \ n t indicates c the transitive closure of predicate updoitn: updowntc(X.YU.XU.Y) :- u ~ ~ o w ~ ( x . Y u . x u . Y ) + . We will analyze the GraphLog program for the generalized version of the sams generation problrm as s h o w in Figure 5.3. The visual representation of this program is shown in Figure 5.4. We wish to rstimate the number of ~ p k associated s with recursive predicate p2. Since duplication of tuples must be considered (rhe second rule may produce tuples that are already pan of the base predicate.jlor). the cardinality np2ofp.? will be: where n ~ is ~the ,number of tuples o f predicateflat (which is known fiom the database profile) and nupdo,,, is the number of tuples of the transitive closure of predicate tcpdown. -- - - - -- - O/o~pdown(X.YU.XU.Y) :-u p ( ~ . ~ l J ) . d o w n ( ~ u . ~ ) node( g9, n21, [v('X')] ). node( 99, n22, [ ~ ( ' y )).] node( 99. n23, [v('XU')] ) node( 99, n24. [ ~ ( 'U')] y }. node( 39, n25, [v('X').v('YU')] ). node( 99. n26. [v('XU').v('Y')] 1. edge( 99. n21, n23, up ). edge( 99. n24, n22. down ). disLedge( 99. n25. n26. updown ). %extensional DB predicates db-schema(up, 2). db-scherna(down. 2). db-schema( flat. 2). :-flat(X.Y). O/~~(X.Y) node( 94. 179. [v('W] ). node( 94. n 10. [v('Y)l ) edge( 94. n9. n 10. flat ). dist-edge( 94. n9. n10. p). %p(X,Y) :-up(X.XU),p(XU.YU),down(YU.Y) node( 93. n5, [v('X')] ). node( 93. n6. [v('XU')] ). node( 93. n7. [v('YU1)]). node( 93, n8, [vi'Y')] ). edge( 93. n5. n6. up). edge( g3, n6. n7. pl. edge( 93, n7. n8. down). 3ist-edge( 93. n5. n8. p). Youptc(X.Y) :-up(X,Y)+. iode( g6. n 14. [v('X')]). iode( g6. n 15. [v('Y')] ). sdge( g6, n 14. n15, up:+: ). jistedge( g6. n 14. n15. uptc). O/oupdowntc(X,YU,XU,Y):-updown(X.YU.XU.Y)+. node( g10. n27, [v('X1).v('YU')]). node( g 10, n28, [v('XU').v('Y')] ). edge( gl0, n27. n28. updown:+: ). distedge( g10, n27. n28. updowntc ). %p2(X.Y) :-flat(XU.YU), updowntc(X.YU.XU.Y)node( 912, n29, [v('X').v("fU')] 1. node( g 12. n30. [v('XU'),v('Y')] ). node( 912, n35, [v('X')] ). node( 912. n36. [v('XU')] ). node( 912. n37. [v('YU')] ). node( g12. n38. [v('Y')] 1. edge( g 12, n36. n37, flat). edge( g 12, n29, n30. updowntc). distedge( g12. n35. n38, pz). %downtc(X.Y) :-down(X.Y)+. iode( 98, n 19, [v('X')] ). iode( 98. n20. [v('Y')] ). idge( g8, n 19. n20. down:+: ). jist-edge( 98. n19. nSO, downtc). Figure 5.3 GraphLog program updown(X,YU.XU.Y) :-up(X,XU).down(YU,Y). The cardinality of predicate updow.n may be inferred by using the normal rnrthod for rstimating the cost of non-recursive predicates. in this specific case, since the subgoals do not share any variable. we have that nupdomn - nup niiown If we know the number N of distinct attribute values common to relations up and down. we might be trmpted to use our formulae for cardinality of transitive closure. -eiven -9 n,, = .v and nb = nUp x n doWC This would be perfectly valid if there were no restrictions whatsoever regarding how the tuples are distributed. i.e.. if we have a random distribution. Unfortunately. it seems not to be the case in real-life databases. We uptc downtc 673 down' ( a ) original program updowntc (c) program in tems of transitive closure Figure 5.4 GraphLog program for the recursive program consistently obsewed that our formulae produced sorne overesrimates for higher values of nu, . Ernpirical results nupd,,,, < N2. (ix.. have also s h o w that our predictions are adequate when in the region for "small values" of n,pdo,,n). Furthermore. we have observed that our estimates may be improvrd for the other rwo regions: for the region of "higher values" of nupdo,,, the cardinality of the transitive closure consistently seems to be directly related to the product of the cardinalities of the individual transitive closures of cip and doitx: for the region of intermediate values, the cardinality of the transitive Table 5.6 The exponential region 89 closure seems to be related to the products n, x n u,,c and nUpx n ,,,,, . Table 5.7 shows some typical results for different values. Table 5.7 Estimating the cardinality of a transitive closure Let us study a very simple example and try to explain why the distribution of the relation that the transitive closure is applied to is not uniform. Esample. Consider a database with .V = 10 distinct attnbute values and the follow- h g randomly generated extensional database predicates: up(1.2)* up(1S). up(3.2). up(3.1O). upiW. up(6.1). up(7.8). up(7.lO). up(9.9). up(10.5). A graphical representation of the two predicates is shown in Figure 5.5. w down Figure 5.5 Graphical representation of base predicates up and down In this example. nap = 1 0 and nJ0,,., = 8 . The transitive closure of predicate irp follows the following behaviour: Note that the tuples in the closure follow a recurrent pattern (al1 paths of lrngth 6 are also paths of length 3: al1 paths of length 7 are paihs of length 4 as well: and so on). Similarly. predicate h r r - n has the following closure: predicate "dowm" 1 7 3 3 5 6 7 8 paths of length 1 CI.!> c1.D <32> <3.10> c7.P <S.6> <9.5> <10.10> paths of length 2 <1 . I > < 1.6> < I .S> - 3 . 1 O> <7.2> <7. i O> < 10.1O> pathsoflength3 <i.l> <l.6> c1.D <3.10> cl,10> clO.lO> pathsoflength4 cl.]> c1.6> <I.S> clIo> cl.10> <10.10> Predicate updoitn is the Cartesian product of both relations: (9.9) (9.1.9.1 1 (9.1.9.3) (9.3.9.2) (9.3.9.10) (9.7.9.3) (9.3.9.6) (10.5) (10.1.5.1) (IOl.5.) (10.3.5.2) (l0.3.5.10) (10.7.5.3) ( 10.i1.5.6) (9.4.9.5) ~9.10.9.l0~ I O . . (lO.lO.5.lO1 From this table. it should be evident that the distribution of attribute values of pairs ([X. YU]. [XU. Y]) is not randorn at all. Not only are many [XL. Y] pairs shared by some [X. YU] pairs. but also the XU value is totally determined by the X value. For instance. if X has a value of 1. the value of XU is either 2 or 5. and therefore. although thsre arc: 1000 possible candidates ( [ l . YU]. [XU. Y]). only 200 of them comply with the restnction ([l. YU], [Z. Y]) or the restriction ( [ l . YU]. [ 5 . Y]). In other words. although we have derived some formulae to estimatr the cardinality of the transitive closure of a uniform distribution of attributs values, they may not be accurate for -'realWpredicates. as different distribution hnctions rnay be encountered. In summary. to estimate the cardinality of the transitive closure of a canesian product we propose to consider three different regions: (a) m a l 1 values of n,. the cardinality of the canesian product. with respect to n,'; (b) intermediate values of n,; and ( c )higher values of n,. Experimental results have indicated that we rnay use our formulae when n, c n,'. in the region that we have called "small values of n," with some accuracy. Once more. it is the intermediate region which poses the major challenge: a good estimate of the cardinality of the transitive closure can be obtained by using the products and nu, x n,,_,, n d 0 , x nuprc ,either the midpoint or some value in between. Finally. for the region of "higher values of n,". the cardinality of the transitive closure seems to br directly related to the (product of the) cardinalities of the individual transitive closures for up and do\%-n. As a final note. it should be mentioned that o u study of transitive closure was restncted to the estimation of its cardinality. Once we are able to determine the expected number of solutions that result from the transitive closure of a predicate (that is equivalent to the original recursive predicate). we may propagate this value to the othrr '-black boxes" in our modelt. However. nothing has been said about the actual cost of executing the recursion or closure. In fact. thsre is not much to be said, other than it will bs totalty dependent on the actual implementation. The svaluation method that is used by the systern to handle recursive qurries will determine the cost of solving the recursion. Bancilhon and Ramakrishnan [Bancilhon86] have derived some formulae for several cornrnonly used evaluation methods for some elementary foms of recursion. Ln fact. the design and evaluation of the performance of transitive closure algorithms has been an active area of research [Agrawal90]. We must add that some systrms rnay decide to compute the transitive closure of a predicate only once and store the results for fbture use (instead of computing the closure any time that a user query requests it) [cheiney94].: Additionally, it is not uncommon to use a pre-proccssor that transforrns the original form of closure to another that is equivalent and more efficient to execute [Lu93]. 5.6 Algorithm to Estimate the Cost of a GraphLog Query We now formulate our genenl algorithm to estimate the cost of a given GraphLog query. Input: a query y of the t o m and a calling pattern q a t q . +Recall that the only inpur quantity that a "black box" requires is the number of tupIes retrieved by the previous black box. Explicit storage of the transitive closure results minirnizes access tirne and only requires a single (usualIy expensive) computation o f the transitive closure. Output: a cost estimate. cos$. and the expected number of solutions to the que-. Geneml Algorithm: algorithm estirnate-cost(q, cpab, cosb, num-solq) ; I' assume that there are m subgoals in the query */ begin perform-mode-analysis(q, cpatq, calling~attern(s,). ..., callingpattern(s,)) ; for i:=l t o m do begin e~timate~number~of~solutions(s~, callinclpattern(si), num-soli); estimate~relevant~co~t~metric(s~, COS^^); end; num-solo:= 1 ; total-cost:=O: solutions:=l ; for j:=l to m do begin tota-ost:= num_s0Ij., x COStj + tota-ost; solutions:= solutions x num-sol,; end; costq:= total-cost; num-soiq:= solutions; end; procedure peiform-mode-analysis(q, cpatq, callgat(sl ) ,....callgat(s,)); begin This analysis determines the calling pattern of each subgoal Si in the query' end; procedure estimate-nu mber~of~solutions(s, callinggattern(s), numsol); begin if s is an extensional DB predicate then use the database profile;: if s is an intensional non-recursive OB predicate then estimate~numsol~nonrecursive(s, callingpattern(s), num-sol); if s is an intensional recursive DB predicate then estimate~numso~cursive(s, callinclpattern(s), num-sol) ; end; +See Section 3.6 tSee Sections 3.2 and 4.1 procedure estimate~relevant~cost~metric(s, callinggattern(s), cost); begin if s is an extensional DB predicate then use the database profile; if s is an intensional non-recursive DB predicate then estimate~cost~metric~nonrecursive(s, callinc~pattern(s),cost); if s is an intensional recursive DB predicate then estimate~cost~metric~recursive(s, callinuattern(s), cost) ; end cpat(s), num-sol); procedure estimate~numsol~nonrecursive(s, /' assume that predicate s has n different clauses '1 begin for k:=1 to n do estimate-cost (body(clausek), cost-bodyk, numsolk, h+(cpat(s))); estirnate~number~of~solutions(num~sol, num-sol,, .... nurnsol,);+ end: callingpattern(s), num-sol) ; procedure estimate~numsol~recursive(s, begin transform the recursive predicate to an equivalent transitive closure;tt estimate the number of solutions by applying properties of transitive c~osure;** end; procedure estimate-cost-metric-non recursive(s, cost); begin 1' assume that predicate s has n different clauses */ begin for k:=l to n do begin estimate~cost(body(clausek),costbodyk, num-solk, h(cpat(s))); costk:= cost-h unifttt(ciausek, cpat(s))+ ?h(cpat)is the calling partem that is obtained after a successfid head unification and is detcrmined by a simple mode analysis :Sec Sections 3.5 and 4.3 +tSee Section 5.2 t:See Section 5.3 pf(hunif(ciausek, cpat(s))) x costbodyk; end; totalcost :=O; for k:=l to n do total-cost:= totahost + costk; cost:= total-cost; end; procedure estimate~cost~metric~recursive(s, cost); begin estimate the cost based on knowledae about the recursive algorithm that is used; end; Most procedures can be pertormed mechanically once the abstraction of the database profile has been chosen. In general. we have assumed a uniform distribution of attributs values. but that does not have co be the case. There are two procedures that pose some di fficulties regarding thrir automation. One of them. the automatic transformation of a recursive quety into an cquivalent fonn of transitive closure. has just recently been addressed by researchers in the tield [~onsens89]:. The other. the estimation of the cost contributors when a rrcurçive predicate is solved, would imply a thorough analysis of the recursive algorithm in place. W r will partially address rhis issue in the following chaptrr. tttcost-hunif is the cost due to the process of head unification [Section 3.7.11. t P ( hunif(clausek.cpat(s))is h e probabilil that the head unification is successful [Section 3.7.1 1. ZSee Section 5.2 Chapter 6. Some Case Studies In this chapter. we will apply Our framework to some typical databases. In particular. we are intrrested in showing how somr real-life issues such as high correlation amongst at- tributes or duplication of tuples may affect the accuracy of the results. We also compare our results with one of the best algorithms for query reordering. narnely Shendan's algo- ri thrn [Sheridan9 11. 6.1 The congressional voting records database This publicly available database contains the votes for each of the US. House of Representatives Congressmen on several key votes in the 1984 session (the so-called 1984 United States Congressional Voting Records Database). A vrry simple protile of this da- tabase is shown in Figure 6.1 .' An rxtract of the corresponding GraphLog database is shown in Figure 6.2. - paW(1Jep) projectl(1.n). project2(l.y). project3(1,n). project4(l ,y). project5(l .y). project6(1.y). project7(1,n). project8(l,n). projectg(1,n). projectl O(1.y). project11(1.a). project12(l.y). projectl3(1,y). projectl4(l.y). projectl5(1,n). projectl6(1.y). party(2.dem). project1(2.n). project2(2,y). project3i2.n). project4(2.y). project5(2,y). project6(2,y). project7(2,n). project8(2.n). project9(2.n). projectl O(2.n). projectl l(2.n). projectl2(2,y). projectl3(2,y). projectl4(2.y). projectl5(2,n). projectl6(2,a). m.. - party(435,rep). projectl(435.n). project2(435.y). project3(435,n). project4(435,y). projectS(435.y). project6(435.y). project7(435.n). project8(435,n). project9(435.n). projectl O(435.y). projectl 1(435,n). project 12(43S,y). projectl3(435,y). project14(435,y). projectl5pl35.a). projectl6(435,n). -- Figure 6.2 The GraphLog database f ln fact. there are three attribute values for each vote. The third attribute value (besides '-es" and "no" votes) can be regarded as an abstention. Number of Instances: 435 (267 democrats. 168 republicans) Number of Attributes: 16 + class name = 17 (al1 Boolean valued: y = yes: n= no: a = abstention) Attribute Information: 1. Class Name: 2 attribute values (dernocrat, republican) 2. handicapped-infants: 2 attribute values (y.n) 3. water-project-cost-sharing: 2 attribute values (y,n) 4. adoption-of-the-budget-resolution:2 attribute values (y,n) 5. physician-fee-freeze: 2 attribute values (y.n) 6. el-salvador-aid:2 attribute values (y,n) 7. religious-groups-in-schools: 2 attribute values (y.n) 8. anti-satellite-test-ban: 2 attribute values (y,n) 9.aid-to-nicaraguan-contras:2 attribute values (y.n) 10. mx-missile: 2 attribute values (y,n) 11. immigration: 2 attribute values (y,n) 12. synfuels-corporation-cutback: 2 attribute values (y.n) 13. education-spending: 2 attribute values (y.n) 14. superfund-right-to-sue: 2 attribute values (y,n) 15.crime: 2 attribute values (y,n) 16. duty-free-exports:2 attribute values (y,n) 17. export-administration-act-south-africa:2 attribute values (y,n) Figure 6.1 The 1984 United States Congressional Voting Records Database Enample 1 Suppose that we wish to compare two ordenngs for a query that retritves those individuals who voted "yes" on issue # 4 and "no" on issues K 3 and + 5 . These two orderings are shown in Figure 6.3. 1, orderl (ld.Party) :-project4(ld,y). party(ld.Party), proiect3(ld.n). project5(ld.n). ordeR(td.Party) :-party(ld.Party), project4(ld.y). project3(Id.n). projectS(1d.n). Figure 6.3 Two orderings that we wish to compare We will assume that the GraphLog translater generates code for a system that uses first-argument indexing. Additionally. consider that we are mostly interested in making our decision based on the number of visited tuples onlyt. +As it tiappens. for this particular example. the number of visited tupIes (Le.. the nurnber of subgoal unification attempts that take place) is indeed the most relevant conmbutor to the cost of executing this query Let us consider ordering # I first: orderl (1d.Party) :- project4(ld.y), party(ld.Party),project3(ld.n), project5(1d.n). A very simple analysis will determine the calling patterns for the different subgoals in this que- as: orderl([f. fl) :- project4((f.g]). party([g, fl), project3([g, g]), project5([g. gj). (g stands for '-gound argument": j'represents a "free variable)". We proceed to sstimats the cost (as the expected number of visited tuples) of each subgoal. Successively, ws ob- tain: pro]ect4([f.g]) This predicate cal1 will visit al1 335 instances of this particular project. Only a fraction will actually succeed. In the absence of any additional information. we are forced to assume a particular distribution for the three attributes in the second argument. namely 1. ("yes"), n ("no") and a ("abstention"). For instance. we may decide to use a uniform distribution of independent values (so that we are expecting a fairly high number of abstentions!). Under this cmde consideration. we would retneve 435i3 (i.e.. 135) tuples. ParMg. fl) Now. we will visit as many pu- tuples as solutions we got from the previous sub- goal. Our estimation would indicate that 145 tuplrs had to be visited (recall that first-argument indexing is assumed). The second argument poses no restriction whatsoevver. so that al1 145 tuples are expectrd to succeed. project3([g,g]), project5([g,g]) The final two subgoals are very similar from the point of view of our analysis. Since first-argument indexing occurs. a first argument ground will result in that. for each of the 135 tuples obtained from the previous phase, only one project3 tuple has to be visited. with s Our uniform assumption for the attnbute). a 113 rate of success - roughly 48 ~ p k (recall Similarly, for each successfdly retrieved project3 tupie, one projectb tuple will be visited with a 113 rate of success as per Our assumptions (approximately 16 tuples). Let us turn our attention to ordenng Ff 2: ordeQ(ld.Party) :-party(ld,Party).project4(ld,y),project3(ld,n).projectS(ld.n}. Again, a simple analysis will determine the cailing patterns for the different subaoals in this que- as: 5 order2([f. fl) :-party([f,fl), project4([g, g]), project3([g, gj). project5([g, g]). and we proceed to estimate the expected number of visited tuples for each subgoal on the right hand side. pafly([f.fl) This predicate call will visit al1 435 instances of predicate Party. Since both argu- ments are variable. al1 435 instances will be retrieved as a result of the call. proiect4([!& gI) Now. we will visit as many project4 niples as solutions we got from the previous subuoal (note that first-argument indexing is assumed). Our estimation would indicatr that. 3 from thosr 435 tuplcs. only a fraction will actually comply with the restriction posed by the second arprnent. If a uniform distribution of independent attributr values is assumed. roughly 435'3 (Le.. 145) tuples will be successfully retrirvrd. project3([g,g]). projectS([g. g]) The analysis of the final two subgoals is identical to that for the aitemative ordenng. Since first-argument indexing occurs. a first argument ground will rrsult in that. for each of the I l 5 tuples obtained from the previous phase, only one project3 tuple has to be visited. with a li3 rate of success - roughly 48 tuples. and. similarly. for each successfully retrieved project3 tuple, one projecto ~ p will k be visited with a l/3 rate of success as per Our assumptions (approximately 16 tuples). Figure 6.1show the abstract values for the different subgoals. represented by "black boxes". Figure 6.5 shows how these "black boxes" are intercomected to obtain the global values for the whole query. The results of both analyses are sketched in Table 6.1 and Table 6.2. I party/2 335 solutions 3 visi ed tuples 7 calling pattern 1 call 1 project~ 7 1 cal1 1 145 solutions 3 4 3 5 visi ed tuples I 1 cal1 / 1 solution 435 1 1 1 31 visi ed tuples calling pattem 0.33 solutions 0.33 solutions -7" 1 call 0.33 solutions 3 1 cati calling pattern calling pattern =visiPedl tuples cal l i ng pattem 1 -1 7 ' call calling pattem Figure 6.4 Abstract black boxes for Example 1 number of visited tuples expected number of solutions I TOTAL Table 6.1 Number of visited tuples for ordering # 1 JTp' solutions solutions Figure 6.5 lnterconnection of the black boxes for Example 1 expected number of numbsr of visited tuples solutions I TOTAL - PrP 1063 1 16 1 Table 6.2 Number of visited tuples for ordering # 2 W r may conclude that we expect ordering + 1 to be a better option with respect to ordering + 2 (which has a 38% additional cost). Experimental results for this query con- fim our prediction. These results. using SICStus Prolog version 2.1 on a Sun SPARCstation SLC. are shown in Table 6.1 and Table 6.2. From the figures, ordering 8 2 is 43% more expensive than ordenng t 1. Cost measurements are @en in Prolog's anificial units. I orderf (ld,Party) :-project4(ld.y),party(ld,Party),project3(ld.n).project5(ld.n). orderZ?(Id.Party):-partyjld.Party), projectQ(ld.y),project3(ld.n).projecd(ld.n). ordering rivenge cost t 1000 experiments) number of soIutions Figure 6.6 Experimental results for both orderings -4lthough our assumption of a uniform distribution is ckarly inaccurate. we stiIl can predict which ordering will be less rxpensive to rxecute (from the point of view of visited tuples). Our sstimate of the number of solutions is obviously poor. but only a more drtailed profils would yield brner results. As an additional note. we must mention that this particular database has a very high correlation factor amongst attributes. In other words. "republicans" are expectsd to vote as a block on some (if not most) issues. and so are "democrats". If we pose a qurry of the form: requesting those democrats that voted some way. we should not be surprised to find that our cstimatcs reprding the numbrr of succrsstLl tuples that are retrievrd are sven less accurate. Erample 2 Suppose that we wish to compare al1 orderings for a query that retrieves those individuals that voted "yes" on issue 16 and "no" on issue 6. These orderings are shown in Figure orderl (ld,Party) :- party(ld.Party),projectl6(ld,y), project6(ld.n). order2(ld,Party) :-party(ld.Party),project6(ld.n), projectl6(Id.y). ordeB(ld.Party) :-projec?l6(ld.y),party(ld.Par&y), project6(ld.n). order4(ld,Party) :-projectl6(ld,y). project6(ld.n),party(ld.Party). order5(ld,Party) :-project6(ld,n).party(ld.Party),projectl6(ld.y). orde&(ld.Party) :-project6(ld,n),projectl6(ld.y). party(ld.Party). - - -- - . - - Figure 6.7 Six orderings that we wish to compare A very simple static analysis determines the corresponding calling patterns which are shown in Figure 6.8. orderl [f.q :- party[f,fl, project16[g,gltproject6[gvg]. order2[f,fl :- party[f.fl,project6[g,g],projectl6[g,g]. order3[f,fl :- projectl6[f,g],party[g.fl. project6[g.g]. order4[f.fl :- projectl6[f.g],project6[g,g], party[g.fl. order5[f,fl :-project6[f,gl,party(g,fl, projectl6[g.g]. ordeffi[f.fl:-project6[f.g],projectl6[g.gl. party[g.fl. Figure 6.8 Six ordertngs that we wish to compare We wish to estimate the number of visited tuples associated with each predicatef calling pattern pair that appears in the different orderings. Following a similar reasoning to that in Examplr 1. we are able to deduce the values shown in Figure 6.9 in which we may use "black boxes" to identify the abstract values that wtr estimate. Note rhat tintargument indexing is assumed. These "black boxes" are thrn inttrrconnected as shown in Figure 6.10 to obtain the global values of an entire que-. We are now in a position to cstimate the total number of tuples cxpectrd for rach ordering. We summarize the (analytical) results in Table 6.3. Expcrimental results for SICStus Prolog are shown in Table 6.4. where thty are cornparcd to our rstimatrd values. We werc able to detect the two Ieast efficient orderings. Howrver. our predictions regarding the other four orderings are not very accurate. This is due to the fact that we are assurning a distribution that behaves in a certain way. whrreas the real database fol- lows a different pattern. Predicate projectl6 clearly has a different behaviour from that of predicate project6 (in fact, the latter predicate gets executed more efficirntly than the former), and our framework cannot make this distinction in the absence of a more de- I 135 solutions i/ 1 solution calling pattern callins pattern / 145 solutions 1 party 2 0.33 solutions 1 91 visi ed tuples caiiing pattern I / caliing pattern 1 0.33 solutions 145 solutions calling pattern 1 1 calling pattern I Figure 6.9 Abstract black boxes for Example 2 tailed database profile. But. our framework is able to detect that if a '-project" predicatc as the nrxt predicate. leavis sslected first. it is more efficient to place the other '*projectW ing the Party predicate to the last position. Similarly. Our fiarnework determines that it is not convcnient to place the party predicate as the fint subgoal in the clause. 6.2 The Performers Database We proceed to study another real database. Our second database contains detailed infor- mation of 888 classical-music compact discs. The information available for each com- b-isited "' b caIIing pûnern isited \ isited tuples numbcroi 48 Figure 6.10 lnterconnection of two black boxes in Example 2 I ordering l estimated n u r n k r of visited niples - Table 6.3 Expected number of visited tuples pact disc includes a list of individual tracks. a list o f individual and collective performers. as well as some technical data rqarding the production of the recording. For the purposes of our case snidy. we wil1 consider the portion of the database that is related to the musicians and their instruments. l ordering experimental value clipcrimental rruikrng theorerical ranking Table 6.4 Comparison between the predicted and experimental values Primitive Entities 6.2.1 The main entities to be considered are: ( 1 ) compact disc numbers. ( 2 ) artists or musicians. (3) instruments used by the musicians and (4) compact disc labels. To produce a more interesting example. we will introduce an additional entity. narnely: ( 5 ) the overseas distributors for the compact discs. Compact Disc Numbers Every compact disc can be identified by a manufacturer's numbcr. which is an alphanumenc code. Each compact disc number is then assigned an intemal code to br used by other relations: recording(CD-Code, Manufacturer-Number) Artists Each individual performer or musical ensemble in the databasr is idenri fied by an interna1 code. lnstrumen ts Similarly. rvery instrument description has a unique code assigned to it. Different descriptions for the same instruments are treated as different entities. instrurnent(lnstrumenLCode,InstrumentName). Companies U'e also find entries for the diffrrcnt companies that produce the compact discs storcd in the database. Again, a sprcitic code is provided for rach label. The Estensional Database 6.2.2 Once we have introducrd the main entities in the database. we procecd to explain the set of facts that conform the performers databasc. W r will consider the following relations. available as extensional DE3 prcdicatss: Performers For rach compact disc represented in the databasr. a list of prrformers is available. For a ziven compact disc code. thrre is one envy for rach performrr listed for that production. If the same artist utilizes more than one instrument. there is one separate entry for each instrument usrd by that perforrner. performer(CD-Code. Performer-Code. Instrument-Code). Labels This relation @es the interna1 code for the Company that has produced the compact d i x . label(CD-Code. Label-Code). Distributors Finally. rach label may have one or more -'oveneas distributors". that is. independent companies that impon and distribute the compact discs in different pans of the world. distributor(Label-Code. Distributor). Typical sample tuples of the different relations in the performers database are sketchrd in Figure 6.1 1. distributor(k1. qualiton). distributor(k7. pelleas). distributor(k2. allegro). distributor(k2. sri). distributor(l1. analekta). distributor(l3. polygram). distributor(l4, sri). distributor(l4. hmusa). distributor(l7. polygram). distributor(l8.koch). jistributor(l2. allegro). cfistributor(l2, fusion). jistributor(m2. ebs). recording(886:SYMPHONlA SY 91S06'). recording(887:TELDEC 4509-90798-2'). recording(888,'CRD 331 1') . anist(b516,'BohumilBenicek'). artist(b51 7:Ensemble Tempo Barocco'). artist(b518,'ArsMusicae Barcelona'). Figure 6.1 1 Sample tuples from the performers database We stan by analyzing a simple non-recursive que-. For instance. we may be interested in knowing the codes of the performers of some particular instrument that are availablr tiom a given overseas distnbutor. The following Datalog predicate defines such a rela- rionship: Suppose that we are interested in the following farnily of queries: In other words. we will study a query with a calling pattern [f. g. g ] : the first argument is a variable. whereas the second and third arguments are constants. There are six different orderings in which the three subgoals may be amnged (calling patterns are showm in square brackets): performer [f.f.g].label [g.fl, distributor [g,g] performer [f.f.g].distributor [f,g], label [g.g] label [f,fj, perfomer [g,f,gj. distributor [g,g] label [f.f], distributor [g.g], perfomer [g,f.g] distributor [f.g],performer [f.f.g]. label [g.gj distributor [f.gJ,label [f.g], perfomer [g,f.gl Our goal consists of selecting the most efficient ordering of al1 six. Al1 w r know about the user's query is the caiiing pattern: instrurnentists~avaifabIe~fr~rn~a~distributor [f. g. gl. Typical queries are shown as follows: :- instrumentists~available~frommaadistributor(A sri. ten). (tenon on a label distributrd by Scandinavian Record Impons): :- instrumentists~available~from~aadistnbutor(A. qualiton, sop). (sopranos on a label distributed by Qualiton Imports): :-instrumentists~available~frommaadistributor(A, allegro, obo). (oboists on a label distributed by Allegro Imports). The database profile of the performers database is shown in Table 6.5. distinct distinct distinct values in values in values in argument 1 argument Z argument 3 Predicate name number of tuples perforrnrri3 12.85 1 888 3,727 710 labeV2 888 888 92 - distributor12 177 111 23 - Table 6.5 The performers database predicates In the absence of further information. we will treat the database as a uniform and independent distribution of anribute values (aithough we know that this is probably not the case). We will try to assign cost values to al1 six different possible orderings for the subgoals in the predicate definition. We will consider a couple of cost contributors in our analysis: number of visited tuples and number of variable unifications that are perforrnedSince we use SICSnis Prolog to executr the code generated by the GraphLog tnnslator. we have to assume that first-argument indexing is used. Our abstract "black boxes" for al1 (statically detected) combinations of calling patterns for the three predicates are shown in Figure 6.12. The expected average number of tuples is either the total number of tuples for that predicate if the tint argument is a variable. or this value divided by the number of distinct values in the fint argument position otherwise. The total number of variable unifications may be calculated with the aid of the formula that is shown in Appendix 2. or if a simpler approximation is considered. by multiplying the number of expected visited tuples by the number of variable arguments in the predicate call. Finally. the cxpected average number of solutions is calculated by dividing the total number of tuples by the number of distinct values at each gound argument position. Once we have detemined the expected values for our cost contnbutors when a sind e call is considered, we procred to "interconnect" al1 three predicates. for each ordering = under consideration. This is illustrated in Figure 6.13 for only one of the specific orderings. Table 6.6 shows the values that are estimated for both cost contnbutors given al1 six different ordrrings. Table 6.7 surnmarizes the corresponding expenmental results for different sets of ground terms. From these tables. we observe that we accurately predict the best ordering. as well as the two worst orderings. Interestingly enough. the ordenng that we expect to be the second most efficient one (i.e.. ordering #6). is only so for a few of the experiments. In fact, for some distributors that carry many labels (sri, qualiton, al- legro), orderings #1 and #3 seem to be more efficient. We must realize that our predic- 18.10 solutions 3 0.020 solutions l2,85 1 visi ed tuples 7 calling pattern 1 call 3 --+ 3 25,702 -L'1 call calling pattern vanable uni fications 1 labe lutions unifications 1 9.65 solutions 888 visi ed tuples 7 calling pattem 1 call Vgag visi cd tuples 1,776 vanable unifications L ' unifications 0.108 solutions 1 solution 7 " 1 call 7 1 call ' 7.70 solutions unifications unifications 9 0.07 solutions visi ed tuples 177 7 calling pattern 1 call 7 1 call vanable unifications Figure 6.12 Abstract black boxes for the non-recursive query O vanable unifications -a 18.10 solutions visited-tuples = 1285 1 + 18.10 + 28.96 = 12898.06 12,851 3 18.10 calling pattern y i ited tuplrs P9 18.10 van ble uni 1 ations 1.27 solutions 28.96 +A calling 7' pattern 18-10 calls I variable-unifications = 25702 + 18-10 + O = 25720.1 unifications Figure 6.13 Expected values for the cost contributors for a specific ordering tions are bassd on uniform distributions of independent attnbute vat ues. and the darabase does not follow this type of distribution. However. we are still able to select efficient ordenngs and reject those that perform poorly. 6.2.4 An Example involving a Closure We will study an sxample that inciudes a closure predicate, which requires special treatment in Our framework. For practical reasons. we had to use a smaller database. because the current implementation of the transitive closure algonthm is inefficient. The modified database profile for the performers predicate is shown in Table 6.8. ordering expected num- ranking e x p c ;cd iiüm- ranking ber of visited ber of variable for n,., for n, niples n, unifications n, , performer. label. distributor 1 12898 1 i31 25720 1 f4 1 perfoner.distributor. label 16194 28906 PI 1 Pl label. perfoner. distributor 1 1 5764 16623 I PI Pl 1 1 1 1 distributor. perfomer. label 1 distributor. label. perfomer 1 label. distributor. performer 1 1 1 1 1 1 1 3208 99269 8095 I I 1 I 1 1 2676 197913 1 1 f Pl I 1 1 [11 [61 1 1 1 1 1 1 Pl 7926 L [il [61 1 Table 6.6 Predicted values of two cost contributors for the non-recursive query ordering A. perfomer. label. distributor 1 j j j [2] 1 A, 1 A. 1 A. 369 [2] 1 296 [2] 1 3 14 [ j ] A. 1 3 19 [3] A, 1 387 [j] Table 6.7 Experimentai results for the non-recursive query (rankings in square brackets) Predicate name number of tuples performers/3 704 distinct distinct distinct values in values in values in argument 1 argument 2 argument 3 9 132 45 Table 6.8 The rnodified performers database profile We now proceed to definr some additional relations to be applied to the performers database. We are interested in the definition of a predicate that uses some form of recursion or closure. Let us define a predicate coiieague that is true when two musicians participate in the same recording production: colleague(A.6) :-performer(X.A,J, perforrner(X,B,J t fSmctly speaking. we should also impose the restriction that A and B are different musicians. We wi1l omit this additional consmint here and in future definitions to simpli@ the analysis. 1 We also definc that musician .f is an "indirect" colleagur of a musician B if both have recorded at least one compact disc with a mutual "colleagu6' as defined beforr: and therefore a transitive closure is used. Further suppose that we wish to define a more specific type of "indirect" colleagur. in which musician d has participated in a recording project performing the snrnr instrument as musician B. and both having recorded with a mutual "collsague" of the same instrument. This could be done by defining a predicate that includes rhe additional restnction of the musicians performing on same instrument: and then taking the closure over this predicate. However. we will use a different set of predicates for illustrativr purposes (i.e.. showing how closure predicates are handled by Our framework). Thus. we define a "same-instrument" indirect colleague -4as a musician who has participated in a recording project performing the samr instrument as another rnusician B. and both have recorded with a mutual "colleague of a collea_mie" (as defined above) as: where the last subgoal is a transitive closure. W r observe that there are six different orderings for the nght-hand side of the predicate. If, Say. the calling pattern for the predicate same~instniment~cooiieague~of~~coiieague is known to be [g. fl. the calling patterns of the predicate subgoals for al! six orderings are performer [f,g,fl, performer [f.f.gj. colleague~of~a~colleague [g.g] perfomer [f.g,fl, colleague-of-asolleague [g,fi, performer [f.g,gJ perfomer [f.f,fj,performer [f,g,gJ,colleague-of-a-colleague [g,g] performer [f.f,fl,colleague-of-a-colleague colleague-of-a-colleague [g.g], perfomer [f.g.g] [g,fl, performer [f.g,fl, performer ff.g,g) colleague-of-asolleague [g.fl. performer[f.g.fl, performer [f,g.gj As mentioned before. predicates that involve closures or recursion have to be treat- ed as special black boxes: unless we know the exact algorithm that is being used to solve the closure or recursion, nothing c m be said about the values of the cost contributors for the predicate. excspt for the rxpected number of solutions (a value that is propagated to other black boxes when we interconnect them to find the global values of the cost contnbutors). Since the base predicate for this closure is given by the following two subgoals: wc may estimate the average number of tuples as 204 x (101 9 = 4024. sincr there are 204 perforrner facts and 9 label facts (k., recordings) in the database. The average number of solutions of the closure predicate is estimated by noting that the number of distinct attribute values for the relation is nd, = 132 x i;2= 17421 ( 132 is the number of performers). and the number of unique tuples for the base predicate is estimated as n, = 4624. We then detemine the region that corresponds to these values: their ratio is calculated as .-r = nd, n, = 3-77 . This corresponds to the "region of small values for the number of tuples in the base predicate" as explained in Section 5.4.1. Applying the formula mentioned in that section. we estimate the number of solutions of the closure as <= I 71241 ( 3.77 - I ) s 6290 . This is the expected number of tuples for the whole closure. If one argument is ground. only a fraction of the tuples will form part of the solution. W r already know that the tuples produced by a transitive closure do not fol+Note that the nvo last orderings are equivalent in the abstract domain. 116 low a uniform distribution (See Section 5.5). but we may produce a rough estimate by making this assumption. thus dividing the total number of tuples by the number of distinct values for that particular ground argument (i-e.. the number of perfonnen. in our example). or , n u:g ] = n,'. [,o./] s o u n d , our estimate would be: C 9 6290/' 132 = 4 7 . 7 . ny.[ g .g ] = 47.7 Similady. if both arguments are 137 1 0 . 3 6 . Thus. our black boxes for this exampIe are shown in Figure 6.14. 204 solutions 4.53 solutions 7 ' calling pattern 1 call 1 calling pattern I ' unifications l calling pattern 1>408 TIa;: caiiing pattern ' unifications ' unifications L 1 47.7 solutions 1 calling pattern 1 vanable ' unifications 1 vanable unifications 0.034 solutions 1-55 solutions 7 I cal1 1>.08 0 3 6 solutions I n 9 W. visi ed tuples caiiing pattern I Figure 6.14 Abstract black boxes for the recursive query I unifications Suppose that we wish to use the number of visited tuples as the relevant cost contributor. The abstract representation of our six orderings would be as s h o w in Figure 6.15. We have used the narnes .Vdand iVgg to denote the unknown number of visited tuples associated with the closure predicate with c a l h g patterns [g, f ] and [g. g ] . respectively. Thus. the expected number of visitrd tuples for the diffrrent ordenngs are obtained as: performer [f.g.fl. performer [f.f.g],colleague-of-~colleagu [g.g1: visited niples = 520.2 + 7.02 Y,,, -- bL perfoner [f.g.q. colleague-of-a-colleague [g,fl.perfoner [f.g.gl: visited niples = 1 5286.7 + 1 -55 -- perfomer [f.f.fj. perfomer [f.g.g]. colleague-of-hcolleague [g.gl: visited niples = 41 820 + 6.94 Y,., performer [f.f.q. coileague-of-a-colleague visited niples = 15 1 86.8 + 204 Y,,, [g.g]. pufornier [f.g.gl: ---- - bL coiieague~of~a~colleague [g.q. performer [f.g.fl.~ e r f o n e [f.g.gl: r visited niples = 248 13.5 + NCf coiieague-of-a-colleague visited mples = 248 13.5 + Nu*- [g.fJ.perfomler [f.g.q. performer [f.g.gl: - - At this point. wr need estimates of the magnitudes of .%-and .VU,. Unless we have .5rS a detailrd performance analysis of the algorithm that is used to executs the transitive closure. we are forcrd to propose some suitable value. For instance. we know that simple algonthms to solve the transitive closurs problem [Warren751 are computed in tims ar most proponional to the cube of the size of n. the number of distinct values in the d a tion. or @[ n' ! [Baase88]. We also know that typical transitive closure algorithms cornpute practically the same instructions for calling patterns [g, g ] and [g. fl [Fukar9 11. This is because the algonthm first obtains al1 pairs that are reachable From the first argument. and only then the nature of the second argument is taken into account. In fact. many transitive closure algorithms have similar behaviour even for the calling pattern [f. g]. since the computation is started from the second argument rather than fiom the unbound first argument. number of performer [f,f,fj, colleag olleague [g,gj, performer [f.g,g 2.5 14981.8 visited rupI1-s number of solutions colleague-of-a-colleague [g,fl,performer [f,g,fI,perfomer [f.g,gJ colleague-of-~colleague [g,fl col league-O f-a-CO t leazuc 2 Figure 6.15 Abstract representation of the different orderings numbcr of 2.5 15082.7 One educated guess as to the values of N'-and !La wouid be to use the cube of the number of t U p k in the base relation ( 1 6 d = 98 x 9 10 .in our example).' Given this val- ue. our estimates become (rankings s h o w in square brackets): Typical experimental results for this famiiy of queries when using SICStus Prolog; as the target language are s h o w in Table 6.9 (again. rankings are shown in square brackets). Note that our first choice for the most efficient query ordering is close to the one that is experimentally best (in fact. there is a virtual "tie" amongst the three orderings with besr experirnental performance). W s are also able to discover those orderings thar are more inefficient and therefore should be discarded. Not surprisingly. the most significant term is the one that relates to the ctosure predicate. 1 performer [f.f.fl.performer [f.g.gl. ~ ~ l l e a g ~ k ~ f - ~ ~[g.g]: ~ l l e1 a g 128 ~ e 14 1 .O 1 perfoner [f.f.fj.colleague-of-a-colleague - - - [g.g]. perfomer [f.g.g]: 1 [1=] 74962 10.0 [6] colleague-of-a-colleague [g,fl,performer [f.g.fl,performer [f.g,g]: 43330.0 Il=] colleague-of-a-colleague [g,fl, performer [f,g,fl, performer [f,g,gJ: 43472.0 [1=] 1 1 Table 6.9 Experimentai results for the recursive predicate Incidentally. the ratio between the most and less efficient orderings in the experiments is 7496/13 = 174.3; the corresponding ratio in our predictions is 20000/99 s 202.0. Our "educated guess" tumed out to be reasonably accurate. t.4~mentioned before. both calling patterns require visits to similar number of tuples during the computation of the closure: the only difference is that the calling pattern [g. gl wiII result in fewer nipies to be kept in the finai answer. We also launched a series of rxperiments to determine the performznce sf the umsitive closure algorithm used by the GraphLog translater. The resuits wheri SICSms Prolog is used are shown in Table 6.1 O (data values are expressed in "anificial units"). cal1ing pattern execution tirne colleague~of~~colleague[f, fl 5?695-!9.0 colleague~of~a~colleague[g, fl 4 1943.2 cotleague-of-a-colleague[f. g] 62922.5 colleague-of-a-colleague[g. g] 42 1433 Table 6.10 Efficiencyof the transitive closure for different calling patterns Not surprisingly, the eficiency of a totally unbound transitive closure is very poor as compared to the case when one or both arguments are ground. There is a factor of almost 138 brtween the most and least efficient calling patterns. Also. as we had predicted. there is no visible difference between the efficiencies of calling patterns [g. fl and [g. g ] . The fact that the efficirncy of calling pattern [f. g] is approximateiy 1.5 tirnes the efficiency of the other two calling pattems that involve a ground term suggats that this particular transitive closure algonthm is not symmetric. 6.3 The Packages Esarnple Our final case study will be based on the "Packages Example" described in Appendix 4. Our database profile is shown in Table 6.1 1. Predicate number of tuples name distinct distinct distinct values in values in values in argument 1 argument 2 argument 3 paru 3 1640 136 16 1610 uses/2 3075 1203 1288 A Table 6.1 1 The extensional database predicates We introduce a predicate that cornputes packages that are in a cycle: cycleo() :-pkg-uses(X,X)+. It defines a unary predicate c N e ( X ) to be true when there is a path of one or more arcs labeled pkg-uses fiom X to itself. that is. when ;Y is related ro itself by the transitive closure of pkg-uses [Consens92]. Que- to be analyzed Suppose that we wish to determine the best ordenng o f the following arbitrary and consider the case when the Y argument is ground pnor to the call. We wiil proceed to estimate the cost of al1 different ordenngs. They are shown. with their respective call- - ing patterns, in Table 6.12. Note that only 1 O of the ordenngs are unique. It becomes clear that we must determine the abstract properties of predicates part-of and cycie. In Appendix 4. we have already obtainrd the information related to two intrnsional database predicates (namely part-of and pkkuses). The properties of prrdicatr cycie are related to those of predicate pkcuses. so that we have to deducr the properties of this intensional predicate. We know the specific ordenng that is used to implement the pkg-uses predicate.: Once we select this ordering we can draw the Wack boxes" for the relevant calling patterns for the subgoals and then derive the "black box" for the pkg-uses predicate. As before. we interconnsct the "black boxes" that correspond to the selscted ordenng and then obtain the expected values of our cost contnbutors once the expected values for the number of solutions are propagated. This is depicted in Figure 6.16 for one of the rnany possible cost contributors: the nurnber of visited tuples. . must recall that a sysRegarding the number of solutions of predicate p k ~ u s e s we tem predicate was ignored in the analysis of Appendix 4. The actual number of solutions once the system predicate is considerrd is approximately ten times smaller. Now we are able to corne up with a "black box" for the cyclic predicate. Le.. the transitive closure of the pksuses predicate. Disregarding the additional call and head unitAlthough this query has no special intent. it is still an interesting exarnple. given the fact that it contains nvo transitive closurcs. $Althou_ghwe already know what the best ordering for this predicate woufd be. somctimes wc are not able to modi@ (and recompile) the already existing code. 1075 solutions n 1 solution 1=1\4075 7 calling pattern I call 1 cal1 I/ ' calling pattern uni k a t ions unifications 4 4075 solutions ~ > 1 3 . 4 x 1 0 ~ visi ed tuples caiiing pattern 1 kk 13*4x106 ' unifications Figure 6.16 Abstract black boxes for sorne predicates in the packages example fication due to the indirect call to the transitive closure via the cycle predicate. we appIy Our formulas for transitive closure to the pkg-uses predicate. In this example. the number of distinct attribute values for the relation is nd, = 1640 x 1640= 2689600 (since 1640 is the number of pans) and the number of unique tuples for the base predicate is estimated as n, = 1073 . With these values, we proceed to determine the region that corresponds to their ratio -4 = n,, l'n, z 660.0 . This value lies in the "region of small values for the number of tuples in the base predicate" as explained in Section 5.3.1. Applying the for- mula derived in that section, we estimate the average number of solutions that results af- calling patterns ordering ordering 8 Table 6.1 2 Different orderings for the query under consideration ter the computation of the closure as <= 2489600: ( 660.0 - 1 ) z 108 1 .Z (expected number of tuples for the entire closure). If one argument is ground. only a fraction of the tuples will be in the solution. We already know that the transitive closure does not follow a uniforrn distribution (See Section 5.51, but we may produce a rough estimate by making this assumption. thus dividing the total number of tuples by the number of distinct values for that particular ground argument (Le.. the number of pans. in our example), or u:g] = <rg.n= 4081 -2. 1640 = 2-49. Similady. if both arguments are _moud. our a t i mate would be: <, [g.g ] = 2.49, 1 6 M = 0.00lj. With these values. we are able to propose the "black boxes" for predicatr cyde as shown in Figure 6.17.' We also need those "biack boxes" that correspond to predicate partof (Figure 6.18). 1 12.06 solutions 1640 solutions caiiinç pattern 0.0073 solutions calling pattern 1 Figure 6.18 Abstract black boxes for predicate part-of As in the previous case study. we propose that the numbsr of visited tuples in the computation of the closure is in the order of .y = ( n,) j = 40753 s 67.7 x IO' when the first ar-grnent is ground. In the previous case snidy we have already seen that the value of -ygis rxpected to bs some 1 3 8 times the values of lVL, or Ng: whereas the values of .ykis just about 1.5 times that of .v,, = 138 x 40753 = 9.34 x 10'2 If we use the values .Vgg t I n , ) 3 = 40753 t 67.7 x 10' and as approximations for the number of visited tuples in the closures. we are finally able to estimate the cost of the ten different orderings. Initially. we will snidy the case where al1 arguments are init ially uninstantiated, i.e., the case when tStnctly speaking. we should only have to derive the -'black boxes" that correspond to calling patterns [f. fl and [cg. g ] 1 pkcuses+2 ] 0.034 solutions 1075 solutions 3 Nfg visi ed tuples calling pattern 7 1 cal1 cal ling pattern 1 pkg_uses+Q j 0.00 15 solutions 0.034 solutions visi ed tuples 3 Na, n visi ed tuples O- 7 1 cal1 7 1 cal1 calling pattern 1 3075 solutions 9 calling pattern 0.00 15 solutions Ngg visi ed tuplss visi ed niples N~ 7 calling pattern 1 call 3 7 calling pattern I call Figure 6.17 Abstract black boxes for predicate cycle we wish to revieve al1 solutions to the general query. As before. Our cost contributor will be the rxpected number of visited tuples. Then. we will consider the case when the user specifirs a ground term. i.e.. we wish to retrieve only a fraction of the tuples in the gen- eral solution. Example I For the case when the query has the form or, afier value substitution: n,,=t - lMO+1640x(I+lx ( 6 7 . 7 x 1 0 ~ + 0 . 0 0 1 5 r 6 7 . 7 ~ 1 0 ~ ) ) = 1 1 I x i 0 ~ ~ -7 *. t - - = 1640+1640~ (67.7~ I O ~ + O . O O ! ~(XI + 1 ~ 6 7 . 10% 7 ~ = Il1 = 1640 + 16-10x n,$ ! I r -5 = 16-10+ 16-10 x = 19 x 10IC i 9.34x 1012+-Io75x (67.7x ~ o ~ + O . OxO 1I)~ =-!67 n,,% = 9.34x 1012+4075 x 12 n , , ~ 7= 9.34~ 10 +JO75 X I n l r - 8 = 9.34x 10" + 4075 x t* = 9.34 x 1012 67.7x IO'+ 0.0015x 1 9.34x 10'~+4075r 1 ! o 134 x 1012 n,.[=I = 1640+ 1610~' 9.34~ ~ O " + J O ~ Sx ( 1 cO.0073 ~67.7 x 1 0') ni x i 10'' +-IO73x ( ( I W O + 12.06 x ( 1 +1 x 67.7x IO') ) =5 x x 1011 10" l 6 ~ l+X h j x i 9.34~ l0I2+4075x 1 '!=459x 10' ( 1640+ 1640 x ( 1 + 0.0073x 67.7x 10') ) =3 x loi! 1640+ 1640 x (67.7 x 10~+0.0015 x 1)) ~ 4 5 x2 10'' nrf~10 = 9.34x 1012+M75 x j 9.34 x 10"+407j x (1640+12.06x 1 ) ) = 38 x l0lS and we conclude that it is likely that the fint three orderings are the most efficient ones from the viewpoint of the number of visited niples. Eaample 2 For the case when the query has the form Our estimates are modified as follows (again. n,, stands for "number of visited tuples" and or. after valus substitution: \,, =l = 1 + 1 x ( 1 + 1 x (67.7 x IO'+ 0.0015 x 67.7 x IO') ) = 67.8 x IO" and. again. we conclude that the first three orderings are the most likely to be more efficient fkom the perspective of the number of visited tuples. There is a factor of almost 162 berween die three least expensive orderings and the next most rfficient ordering. We obtained some expenmental values (only for the most efficient orderings) which are summarized in Table 6.1 3 .' calling patterns ordem # partof[f,fj, part-of[g,fl, cycle[gj, cycle[g]. 1 theoretical experimental experimental Tanking ranking result I= 1340538 1= Table 6.1 3 Experirnentai results for the three rnost efficient orderings Cornparison to Sheridan's algorithm 6.4 In this section we will compare our results to those obtained by the appIication of Shendan's algorithm [Sre Section 2.2.41. one of the most successfûl reordering algorithms in the literature. 6.4.1 Why is Sheridan's algorithm so successful? The main idça sxploited by Sheridan's algorithm is that the more instantiated the a r y rnents in a predicate cal1 are. the less expensive that cal1 will be. For instance. in the rrsults showm in Table 6.14 (borrowed frorn Appendix 4). we notice that having an additional g o u n d argument always parantees a lower cost of execution. 1 <clause. cpat> cost Table 6.14 Cost metrics for al1 predicates f.4 single experiment for the most efficient ordering (orderings ff 1. 2. 3 ) required several hours of computation. The computation of a single experiment for the next group of orderings (orderings $4. 6.8) would requue close to one month of unintermpted execution! 6.4.2 Our framework versus Sheridan's When comparing our results with those obtained by using Sheridan's algorithm. Our framework gives consistently better results than Sheridan's for at least two groups of situations: (a) when the position of the ground arguments within a predicate call is crucial. and (b) when the performance of two syntactically similar predicate calls varies considrrably becauss of noticeably different sizes of their underlying databasr definitions. A simple glance at Table 6.14 will reveal rhat. for some predicates. the exact loca- tions of the groound arguments will have an impact on the performance of the predicate call. For example. predicate call tcses wirh a first argument constant and a second argument variable (Le.. cailing pattern [g. fj) will perform far better than the cal1 with a second argument constant and a fint argument variable (Le.. calling pattern [f. g]). Sheridan's algorithm has no way to determine such a difference. and which of the two will be given preference is a matter of chance. By a similar token. due to the fact that Sheridan's algorithm does not take into account any information regarding the underlying database. there is no obvious way to disiinguish between a potentially very expensive predicate and a less expensive one based onlv on the nature (Le.. the mode) of the arguments. Consider a very simple case. that is shown in Figure 6.19. Here wr have two database predicates with a remarkably different number of facts. Sheridan's algorithm would consider both orderings shown in the figure as equally expensive: the first predicate is executed with both arguments variable: the second predicate is executed with one ground argument and one variable argument. However. since it is dramatically more expensive to retrieve al! several thousand tuplcs of predicatc p as opposrd to only one tuple for predicate q, it is bencr to place the call to predicate q bcfore the call to predicatep. Again. since our framework uses a profile of the database in use, we are able to make this kind of prediction. whereas Sheridan's algorithm is not. A third situation in which our method has a potential advantage over Sheridan's al- eonthm can be observed when the query contains recursive predicates or path regular ex- t pressions. Sheridan's algorithm does not treat these predicates as special cases. and then, predicate nurnber of niples ( expenment 1 ) number of niples (experiment 7 2 ) orderl (A.6.C):- p(A.B),q(B.C). ordeR(A.B.C) :-q(B.C).p(A.B). ordering average cost experiment r; 1 ) at-eragecost Iexperirncnt G 2 ) I Figure 6.19 Impact of the underlying database on the performance of the cal1 a potentially expensive recursive cal1 may be chosen as one of the subgoals to bs exscuted first. Our methodology pemits us to estimate the sizes of those special predicates and their repercussions on succeeding predicates. Chapter 7. Conclusions and Future Work In this chapter. we surnrnanze our results and address the limitations of our fnmework. We also propose some additional work that is required. 7.1 Contributions of this Dissertation W r have proposed a new methodology to estimate the cost of a peneral GraphLog query. It is based on the assumption that a profile of the underlying database is known. We have been able to predict with good accuracy the costs of conjunctions of queries when the program is translated into a WAM-based version of Prolog. This is done by obtaining smpirical values for the diverse primitive operations in which the que- is decomposed. in combination with predictions regarding the number of times thesr primitive operations will be invoked. We have also explained how to predict the costs when different abstract machines are used by defining relevant cost contributors whose values are propagated in conjunction with the expected number of solutions for each subgoal in the qurry. Our predictions are normally able to detect the most efficient reordenngs as well as the most expensive ones. The accuracy of the results is noticeably influenced by the nature of the underlyins database. and our predictions are best for databases whose distribution resemblss a uniforrn distribution of attribute values. Our predictions are usually superior to those obtained when using Sheridan's algorithrn, because we make use of more information and are also able to consider sorne special forms of subgoals. A major contribution of our work is the ability to handle recursive queries and clo- sures. We have shown that the key factor is the estimation of the output density after applying the transitive closure. We have provided some guidelines and formulae to estimate such output density which have not appeared elsewhere in the literature. Our results may be of special interest given that SQL, the de facto standard for relational databases, has finally included recursive constructs [Dar93]. and recursive query lanpages will soon become the n o m [Ahad93]. Our results should be directly applicable to pure Datalog and any other functionfree database language. 7.2 Limitations of Our Framework The main deficiency of our methodolog is that we have assumed "ideal" databases. We have not addressed some important issues such as duplication of tuples alter a projection or correlation amongst attributes. Many of our claims rnay be applicable to unifonn distributions of attribute values only. Another issue that has not been addressed by our framrwork is the impact that vround arguments have on the cost of the unification aigorithm. It stands ro rcason that z the unification of long strings of charactes consumes more time and resourcrs than unification of an integer or an atom with a short name. In fact. somr systems transfomi the real attributes to shorter and easier to handle equivalent codes [Graefe9 11. as is the case in the perfomcrs database in Chapter 6. 7.3 Future work Our current framework does not consider the special case of aliasing of variables (especially whrn the same variable is used several tirnes within the samr predicate). A domain analysis of the arguments that are involvrd may tstablish some upper bounds for the number of tuples that are retrieved. We also propose to address the inclusion of correlation factors and the anal ysis of other path regular expressions besides transitive closure. Our framework would also be more complete if built-in predicates were also to be included in the analysis. It is likely that additional information regarding the cardinality of the base relations may bc used to refine our results. Further study of this idea could be fruitfùl. If we wish to use our Framework in a practical situation, we require a pre-process that determines a set of suitable query orderings to be analyzed. Sorne methods have been proposed to this efTect (randomized algorithms [Ioannidis90] or even Sheridan's algorithm may be adapted to that purpose). Nanirally. we need to incorporate a phase that determines the calling patterns o f the different subgoals. but this has been already solved slsewhere [Debray881. So far. we have not mentioned what to do with built-in predicates. a common extension to pure GraphLo_g/Datalog. In fact w r consider that the inclusion of built-in predicatrs fits quite naturally in Our framrwork. Estimating the number of solutions of a built-in predicate becomes trivial when the domain of the attributes is known in advance. and so do the values of the different cost contributors we might be intrrested in. Several extensions have been proposed for GraphLog [Consens89]. Thesr include the definition of aggregates and the option of using functional arguments. Aggregarrs are constructs that are used to summarize data (typical aggregates are the average. maximum, minimum, sum or count of an attribute). We believe that Our Framework can be casily extended to handle these constructs once we drtermine what additional operations take place. In general, aggregates have to perform an action over al1 the tuples that satism a condition. Our frarnework already estimates the cost of retrieval of the tuples. and we only need to add the cost due to the aggregatr action (which will usualiy require to visir each and rvery mple in the solution). Handling functional arguments is a more complex matter. We require to denve a ncher absuact domain to distinguish partially instantiated arguments. and modi@ the niles of the corresponding abstract unification. It is not unusual to use a cache in order to reduce the cost of processing a query by preventing multiple evaluations of the same predicate cal1 [Sellis87]. This is specially uscfûl for inhrrently expensive quenes (such as transitive closures). A practical cost mode1 would have to take into consideration this and other impkmentation issues. References .Gad. R., and Yao. B. RQL: A Recursive Que- Language. IEEE Transactions on Knoidedge and Daia Engineering. Vol. 5. No. 3. June 1 993. pp. 451361. Agrawal. R.. Dar. S.. and Jagadish. H.V. Direct transitive closure algorithrns: Design and performance evaluation. -4CM Transacrions oti Database Sisrems. Vol. 15. No. 3. September 1990. pp. 427-458. Mt-Kaci. H. War-ren's Abswact Machine: -4 Tirtor-ial Reconsrr-rrcrion. MIT Press, Cambridge. Mass.. 199 1. Baase. S. Comptrrer .-ilgorirhrns: lnrroduction io Design and .-lna&~i.s. Addison-Wesley. 2nd ed.. 1988. Bancilhon. F.. Ramaluishnan. R. An amateur's introduction to recursive query processing strategies. Proceedings of rhe I9X6 -4CM-SIGMOD Conjèrence. 1986. pp. 16-52. Birkhoff. G. Larrice rheon.. American Mathematical Society Colloquium Publications, Vol. 25. New York. 1940. Ceri. S.. Gottlob. G.. and Tanca, L. Logic Programming and Darabases. Springer-Veriag Berlin, 1990. Ceri, S.. Gottlob. G., and Tanca. L. Datalog: A Self-Contained Tutorial (Part 1). Published in Programmirovunie. No. 4. July-August. 199 1. pp. 20-3 S. Cheiney. J.-P.. and Huang, Y.-N. Efficient maintenance of explicit transitive closure with set-oriented update propagation and parallel processing. Dara and Knodedge Engineering, Vol. 13. No. 3. Ocrober 1994. pp. 197-226. Clocksin. W.F.. and Mellish. CS. Programming in Prolog. SpringerVerlag, New York. 1981. Consens. M .P.Graphiog: "Real L*" Recursive Queries Using Graphs. M S thesis. D e p m e n t of Computer Science. University of Toronto. January 1989. Consens. M.. Mendelzon. A. GraphLog: a Visual Formalism for Real Life Recursion. Proceedings oj'the 9th ACM SIG.4 CT-SIGMOD .$mposiwn on Prirzcipies of' Database &stems. 1990. pp. 104-3 1 6. Consens. M.. -Mendelzon. A.. and Ryman. A. Visualizing and Querying Sohvare Structures. Proceedings of'the 14th Ir~ternationalCor!fkrence on Sojnt-areEngineering, Melbourne. Australia. May 1991. Cousot. P.. and Cousot, R. Abstract Interpretation: a Unifird Framework for Static Analysis of Pro-mms by Construction of Approximation of Fixpoints. Proceedings oj'the 4th ACM Conf2r-ence on Principles 01' Programming Languages. The Association for Computing Machinery. New York. N.Y.. 1977.p~.238-252. Cousot. P.. and Cousot. R. .-lhsn-acr lnterpretation and Application to Logic Programs. Laboratoire d'Informatique de i'Ëcole Normale Supérieure. Research Report LIENS-92- 12. June 1992. Dar, S.. and Agrawal, R. Extending SQL with Genrralized Transitive Closure Functionality. IEEE Transactions on Knowledge and Data Engineering. Vol. 5. No. 5. Octobrr 93. pp. 799-8 1 2. Debray. S.K. Static Inference of Modes and Data Dependencies in Logic Programs. .A CA4 Transactions on Programming Languages and Si*sterns, Vol. 1 1 No. 3. July 1989. pp. 41 8450. Debray, S.K., and Lin. N. Cost Analysis of Logic Programs. .4CM Transactions on Programming Longuages and Slsiems. Vol. 15 No. 5. Nov 1993. pp. 826-875. Debray, S.K. and Warren. D.S. Automatic Mode Inference for Logic Programs. n e Journal oj'logic Programming. Vcl. 5 no. 3. Sept. 1988, pp. 207-229. Froberg, C.-E. ivumerical Mathematics - nieory and Computer Applications. The BenjamidCummings Publishing Co., 1 985. Fukar, M . Tt-anslating Graphlog into Prolog. Technical Report T R 74.080. Centre for Advanced Studies. IBM Canada Laboratory. 199 1. Gardarin, G. and Valduriez. P. Relational Databases and Knowledge Bases. Addison-Wesley. 1989. Gooley. M.M.. and Wah. B.W. Efficient Reordering of PROLOG Programs. iEEE Transactions on Kno w ledge and Data Engineering. Vo 1Ume 1. Number 4. 1989, pp. 470382. Gorlick, M M . and Kesselman. C.F. Timing Prolog Programs without Clocks. Proceedings of' rhe 1 98 7 $vmposiurn 0 1 2 Logic Programrnit~g. San Francisco, CA. pp. 426332. Graefe, G.. and Shapiro, L.D.Data compression and databasr performance. 199 1 Proceedngs of 'the -4 CMXEEE-Cornputer Science Simposirïm on .Ipplied Comprrting. Hom. A. On Sentences which are Tme of Direct Unions of Algebras. Journal of-Svmbolic Logic. Vol. 16, pp. 11-2 1 . Ioannidis. Y.E.. and Kang, Y.C. Ramdomized Algorithm for Optimizing Large Join Queries. Proceedings of'the 1990 .-iC!WSIGMOD Cotzfèrencr on the :Managemen[of'Data. Atlantic City. N J. USA. May 1 990. pp. 3 12-321. [oamidis. Y .E ..and Poosala. V. Balancing Histogram Optimality and Practicality for Query Result Sizr Estimation. Proceedings of'the 1995 . K M SIGMOD Infernational Conférence on Management ofData. San Jose. CA, pp. 233-144. Jagadish. H.V.. and Agrawal. R. A Study of Transitive Closure as a Recursion Mechanism. Procerdings of'the A CM Special interest Grotlp on Management of Data 1987 Annual Conference (SIGMOD Record Vol. 16 No. 3. December 1987). San Francisco, CA, USA, May 27-29 198% pp. 33 1-344. Jarke, M.. and Koch, J. Query Optimization in Database Systems. KA4 Compirting S l r n e y , Vol. 16, No. 2. June 1984. pp. 1 1 1- 152. [Kruse87] Knise,R.L. Dara Simcmres & Program Design. Prentice-Hall Software Series. New Jersey. USA. 1987. [Kwast94] Kwast, K.L.. and van Demeheuvel. S.J. Duplicates in SQL. Data and Knorvfedge Engineering. Vol. 13. No. 1. August 1994. pp. 3 1-66. [Lu931 Lu. W.. and Lee. D.L.Characterization and Procrssing of Simple Prefixed-Chain Recursion. Info~rnarionsciences. Vol. 68. No. 3. .March 1993. [Mannino88] Mannino. M.V. et al. Statistical Profile Estimation in Database Systsms. .4CM Cornpuring Srrrvq-S.Vol. 20. No. 3. September 1988. pp. 19 1-22 1. [McCarthySî] McCarthy. J. Coloring Maps and the Kart-alski Doca?ne. Report STANCS-83-903, Depariment of Computer Science. Stanford University. 1982. [McEnery90] McEnexy A.. and Nikolopoulos, C. A Meta-lnterpreter for Prolog Query Optimization. Proceedings of'the IASTED Intemarional $.mposiirm on E- peut Sistems Theo- and Applications, 1990. pp. 75-77. [Mellish85] Mellish, CS. Sorne Global Optirnizations for a Prolog Compiler. The Journal qfLogic Programming. vol. 2 no. 1. April 1985, pp. 43-66. [Mishra9Z] Mishra. P.. and Eich. M.H. Join Processing in Relational Databases. .-ICA4Comptdng S w ~ v j !Vol. ~ . 25. No. 2. June 1993. pp. 63- 1 13. [Ryrnan92] R p a n , A. Foundat ions of 4Thought. Proceedings of.C.4SCON '91. 1992, pp. L 33- 155. [Ryman93] Ryman. A. Illuminating Sofhvare Specifications. Proceedings oj'C-4sCON '93. Vol. 1. 1993. pp. 412428. [Ryman93a] Ryman. A. Constructing Software Design Theorirs and Models. in Studies in Sofware Design, LNCS 1078. Springer. 1993. pp. 103- 1 14. [Ssllis87] Sellis, T.K. Efficiently supporting procedures in relational database systems. Proceedings of the 1987 -4 CM SIGMOD Conference. San Francisco, CA, pp. 278-29 1. [Sheridan911 Sheridan. P. B. On Reordering Conjunctions of Literals; a Simple. Fast Algorithm. Proceedings ofrhe 1991 $?mpositrm on -4pplied Cornpuring. Kansas City. Mo. USA, April 199 1. IEEE Computer Society, pp. 73-79. [Ullman85] Ullman. J.D. Implementation of Logical Query Langages for Databases. .-KM Transncrions on Database $sems. Vol. 10. No. 3. 1985. pp. 289-32 1. wllman88] Ullman. J. D.Principles of 'Daraboseand Kno\iiedge-Base Svsrems. Vol. I and II. Computer Science Press. 1988-89. [Wang931 Wang, J., Yoo. J.. and Cheatham. T. Efficient Reordering of C-PROLOG. Procerdings oj-rhe2 lsi d CM Comprrter Science Conjerrnce. NY. USA, 1993. pp. 151-155. [Warren751 Warren. E.S. A Modification of Warshall's algorithm for the transitive closure of binary relations. Comrnrtnicationsoj'rhe rlC.W. Vol. 18. No. 4. April 1975. pp. 2 18-220. [Warren8 1] Warren. D.H.D. Efficient Processing of Interactive Relational Database Queries Enpressed in Logic. Proceedings uj'rhr 7th Inm-narional Conjiwnce on Ve)? Large Darabases. pp. 272-28 1. IEEE. 1 98 1. [W0085] Woo, N.S. A Hardware Unification Unit: Design and Analysis. Proceedings of'the 2 3 h lnternarional Svmposiurn 011 Comprrter -4rchitectrtre. Boston. MA. 1985. pp. 198-305. Appendix 1 A Detailed View of Other Approaches to Query Reordering A 1. f Efficient Reordering of Prolog Programs by Using Markov Chains Gooley and Wah's work [Gooley89] has proposed a mode1 that approximates the evaluation strategy of Prolog programs by means of a Markov process. The cost is mrasured as the number of predicate calls or unifications rhat take place. The method needs to know in advance the probability of success and the cost of rxrcution of rach predicate. With rhis initial information. the cost of a panicular ordenng for the subgoals within a single clause is calculatrd as follows: Consider a predicate clause p with subgoals s,. ....s,. If qi is the probability that subgoal sifaiis. and if ciis the cost associated with executing subgoal si.the cost of a failure is given by the formula: ', - :qi x c , ; - + : ( 1 - y l ) x q -, x ;(1 -q,) x ( (Cl + c ,-) : f... 1-q2) x --- x ( + 1 -qn i ) X q a X ( C I+ ... + C , J ! or in closed tom: n . ni-l l : m The goal of the method is to make failing clauses fail earlier and thus reducr backtracking. In other words. goals rhat are "more likely to fail" (and inexpensive ro rvaluate) are placed near the head of the clause. This is usually accomplished by ordering the subgoals in decreasing order of their ratios yiici. A very sirnilar approach is used to estimate a suitable ordering for the clauses in a @en predicate: Consider a predicate p defined by clauses k l . .... k,, km: p :-~ ~ .-..1s,". . Ifpi is the probability that clause ki fails. and if di is the cost associated with exccuting clause ki, the cost of a single succrss is given by the formula: = tp, x d , : + ( 1 - p , ) x p ,-x ( d , + d-, ) : +...+ - : ( L - p ! ) x ( L - p , ) x...x (1-p, ,) xp,x c d , + . - - + J n ) I or in closed form: The goal in this case is to get an initial answer as quickly and inexpsnsively as possible. In other words, goals that are '-more likely to succeed" (and inexpensive to evaluate) are placed near the begiming of the predicate. This is intuitively accomplished by ordenng the clauses in a decreasing order of thrir ratios pj/'dj. Note that these formulae assume that costs and success/failure probabilities of the subgoals in a clause are independent of each other. which is usually not the case in Prolog or GraphLog. Thus. the mode1 is just a coarse approximation. although the behaviour of the clauses may still be predicted with some accuracy. The success/failure probability and cost of executing the body of a clause (that dors not involves recursion) are both calculated once the subgoals s,. .... s, are modelled by either the Markov chain in Figure A 1.1 if we are interested in the first solution only. or the Markov chain in Figure A1 .Z if we want the cost of finding a11 solutions. Note that each subgoal si corresponds to a distinct state. From each subgoal state and there are two transition arcs which are labelled with the probabilities of success (pi) p.. Pi P2 Figure A1.1 P3 Pn-2 Pn- t Pn Markov chain for the single solution case p.. Pi P2 body: Figure A1.2 p3 "..'5 Pn-2 Pn- i Pn s3, ... . %-[. 5, Markov chain for the all-solutions case failure ( 1 -pi).The success transition arc connects the subgoal state with the next subgoal state. while the failure transition connects it with the previous one. For the sprcial case of the t h subgoal. the failure transition arc soes to an absorbing state labelled F (clause failure). Similady, the success transition arc for the last subgoal reaches another absorbing state labelled S (clause success). For GraphLog queries. we are often interestrd in deriving al1 solutions rather than just the first one. Notice that when the all-solutions case is under consideration, the S state is no longer an absorbing state and it has a failure transition arc of probability one assigned to it (which mimics backtracking). Goolry and Wah (Gooley891 explain that these Markov chains can be represented mathematically by means of r x r matrices. where r is the number of states in the chain. The transition matrix for the single-solution case has the following structure [Revuz81]: A similar matrix is obtained for the ail-solutions case: Several cost metrics (such as the number of visits to the success state Sand the probabilities and costs for the clause body) can be obtained after sorne mathematical manipulation of these matrices. For instance. the expected cost of a solution in the ail-solutions case is given by: A1.2 A Meta-Interpreter for Prolog Que. Optimization McEnery and Nikolopoulos [McEnery90] describe a mcta-interpreter for Prolog which reorders clauses and predicates. It has two components: (a) a static component in charge of reananging the clauses "a priori", and (b) a dynamic component that reorders the clauses according to probabilistic profiles built from previously answered queries. This method's static reordering phase consisü of rearranging the clauses that define a predicate in such a way that the most successful clauses are tried first. and the subgoals within a clause are reordered in descending order of success likelihood. Subgoal reordering is performed by using a generalization of a heuristic due to D.H.D.Warren [Warren8 11. Warren proposed a formula for the cost c of a simple que. y as given by cq = s/a. where s is the size in niples (Le.. the number of solutions) of the subgoal. and a is the product of the sizes of the domains of rach instantiated ar,*urnent . For example. given the following Prolog database: nation(canada). nation(be4gium). nation(uk). language(canada. french). language(canada. english). language(belgiurn. dutch). language(belgium.french). language(belgium. german). language(uk,english). language(quebec. french). language(texas.english). and the fol Iowing predicatr detinition: french-speaking-nation(N) :- nation(N),language(N. french) the cost of the query wouId be obtained as follows: The cost associated with the execution of predicate naiion with an unbound argument is given by: (there are three nations. and thus a A- = 3 ; there are no instantiated ar3wents. therefore = I ). Similady. the cost of subgoal Iangzrage with both ar,ouments bound is estimated as: (there are eight tuples for the predicate. and thus s = 8 : the value of a is derived from the fact that there are five regions and four different langages). Thus. the cost of the wholr query would be given by: 'liçnçh rpediing nation - Crinilon + 'lrnguec = 3.4 if the textual ordenng is to be applied. If the alternative order french-speaking_nation[N) :- language(N, french), nation(N). is to be used instead. our estimates will change accordingly. We calculate the cost of subgoal langtrage when the tirst argument is unbound and the second argument is ground: c. and the cost of subgoal narion with a bound argument: '' nmun -- 3- --1 3 and we obtain the cost of the alternative order as: ene eh spnlrlng nation -- c'luimup + "nation =3 In other words. this second ordenng is estimated to be more efficient than the original one. 3 The generalized formula proposed by McEnery and Nikolopoulos is given by: where s and a are defined as in Warren's formula. and p is the probability of success of the clause under analysis. The authors propose a dynarnic evaluation of the value ofp. according to the accurnulated history of the predicate. Note that the higher the value of p. the lower its cost. This success rate is physically stored in the database as an ordered tuple. Every tirne a clause succeeds, its success rate is increased by one. The probabilistic profile is given by the ratio of the success rate and the overall sample space. A . Cost Analysis of Logic Programs Debray and Lin [Debray93] have proposed a fnmework to analyze the cost of logic pro-gams. including pro-mams with simple recursion. The method cstimates the number of solutions of a logic p r o _ m based on the skes of the diverse predicate arguments. Various measures are included under the generic name "size". such as integer-value. listlength. term-depth. or term-size. Thus, some gpe information must be inferred and propagated for cach argument via a static program analysis. The method derives size relationships amongst predicate arguments. This size information is propagated to compute the number of solutions generated by each predicate. The size properties of predicate arguments are descnbed by mcans of two functions: (a) size(arg). which provides the actual size of argument arg. and ( b ) &'(argl. arg2). that calculates the size diffierence between two arguments. argl and ar-g.?. Each of these functions has a different definition depending on the measure under consideration. For examplc. the definition of si=e(wg) for the particular measure "list-length" is as fol- r sizc(t) = O if t is the empty list 1 +size(t,) if t is of the form [Atl] for some trrm t l undefined O t herwi se t W e use the standard Prolog notation for lists. [HIT]refers to a list whose initial element is H (the head of the list) and the rest of the list is another list T (the rail of the list). An underscore '--.' represents an anonynous variable. i.e. a variable whose exact binding is irrelevant for ow purposes. Note that when the value of a particular size (or difference between sizes) cannot be determined from the context. the hnctions return a special value of "bndefined. A distinction between "output" and "input" arguments is also made. The size of an input argument is always calculated from the sizes of previous occurrences of the variables appearing in that argument (the so-calledpredecessors of the input position). Consider the following predicate clause in which the measure under consideration is "listIength": nrev([HIL], R) :-nrev(L. Fi1 ), append(R1. [Hl. R). in which the cal1 has the following inpu~outputargument positions: nrev(<input>.<output>), and. from this calling pattern, we may derive the calling panrrns of the subgoals on the right hand side as follows: For pcdagogical reasons. we number the argument positions (and literals) as follows: The size of <input> (Le.. [HI) is simply calculated as the size of a unitary list as given by function sizt.(ara. i.e., size(<input,>) = 1. The size of <input2> (i-r..L) is obtainrd by applying function d@iargl. arg.?)to -=input,> ([HIL]. predecessor of <input2>) and ==input2> itself. which gives size(cinput2>)= size(<input,>)- 1. Finally. the size of <input3> (i.e.. Ri ) is expressed in terms of that argument predecessor. -=output,> (whose value must be calculated elsewhere) as size(cinputp) = size(<output,>). Size relationships for output argument positions are derived as functions expressed in terms of the sizes of the different input arguments. symbolically expressed as Sz(S.Arg,size(input,). ..., size(input,)). where input, ... input, are the input arguments of subgoal S and Arg is the argument position whose size is being calculated. In out example. the size of coutputp is rxpressed in t e m s of the input argument as size(<outputp)= Sz(nrevb,2,size(cinputp)), and the size of <output3>is derived from the sizes of its two input arguments as size(coutputp) = Sz(append.j.size(cinput3>). size(<inpuLp)). Similarly. the size of <output,> is obtained as a fûnction of <output3>.the argument that originates the output value of the whole clause. and in this case. size(coutput,>) = size(coutput3>). The set of size relations that is obtained for a given predicate clause is then sxpressed in t e m s of head input arguments only. a process that is called normalizution. For instance. if we already know that Sz(nrev.l.size(x)) = size(x) for a given input X. thrn size(cinput3>)should be transformed successively into: which is expressed in terms of the head input argument exclusively. Once normalization has been performed. a system of difference equations is obtained. To get closed form expressions. these difference equations need to be solved. Unfominately. solving difference equations automatically is a difficult problem. although automatic solutions for a wide varirty of them have bsrn proposed in the literanire. The number of solutions generated by a predicate is sstimated frorn the size relationships by counting the number of possible values that every variable in the clause may br bound to. The method exploits two properties of unification that hold in many logiç progamming languages. One of these properties is that if a variabis appears n times in the subgoals of a clause. the total number of distinct bindings that such a variable may have is at most the minimum value of the set (b,, ....b, i , the number of possible bindings for the variable at each argument position as they would be computed independently. For instance, consider the predicate clause: in which predicates O , p, and q are predicates that always retum a bound term for argument positions 1, 2 , and 1, respectively. It follows that variable Y will be bound to only those gound values that are cornmon to al1 three predicates o. p. and q. If the number of distinct values at those argument positions are b,, bp, and bq. respectively. the number of distinct bindings for variable Y is at most min (b,, bp,bq { Another useful property of unitkation is that for subgoals that contain more than one variable. an upper bound of the number of bindings for such a subgoal is given by the product of the number of bindings of each one of its variables. For example. given a subgoal if bY is the number of bindings that variable Y can take and b.y is the number of bindings for variable X the maximum number of distinct tuples that can be obtained for the subgoal is given by h,. x 6,. . Suppose that variable Y can be bound to values a and h. whereas variable ,Ycan be unified to values h, c and d. Then. the 2 x 3 distinct tuples that can be obtained are: s(a. b. a). s(a. c. a), s(a. d, a). s(b. b. b). s(b. c. b) and s(b. d. b). These tuples are called the instances of the subgoal. A function called instunce(T) is drfined to compute the number of instances of a term T. Two additional quantitirs (functions) are defined for each predicate p in the pro_mm: Rdp. that represents the size of the relation defined by p (i-s..the number of tuples that the predicate generates when al1 its arguments are uninstantiated), and Solp. the solution size forp (i.e..the number of tuples that are obtained for a particular input. as specified by the size values of the (instantiated) input arguments). Function insrancelS) is used to calculate both quantities. the main difference being that only output variables are considered for the derivation of Solp, whereas al1 variables (input and output) are used for the calculation of Rel,. To obtain values for Solp and Re$,, we nred to determine values for the number of bindings that are possible for the variables contained in the predicate clause under consideration. denoted by p,, . Two cases are considered: the number of variable bindings for input arguments (denoted by pri ), and the number of variable bindings for output arguments (denoted by ). The value Pr, for input arguments is denvrd by using the above- mentioned properties of unification (Le., function instunce(S)). On the other hand. for output arguments is bounded by the product of the number of solutions that are expected from the predicate given a particular input size (as obtained from the s i x relationships for predicate arguments) and Pr,. itself. Again. if recursive predicates are presrnt in the program. a set of difference cquations will be obtained. and pro can be substantially improved if special cases are The upper bounds for also considered. Such special cases include: distinct variables that are bound b y the samr literal. output variables that are instantiated according to the bindings of the input variables. etc. Similarly. the detection of mutually exclusive clauses can produce more precise results. To demonstrate how the method is applied. considrr the following recursivr prom m which permutes a list of r ~ r r n e n t s : ~ Y perm([ 1 9 [ 1)- perm(X. [LILl]) :- select(L. X. Y), perrn(Y. L I ) . select(H. [HjT. T). select(>(. [HiTl], [HIT2]) :- select(>(,T l . T2). Suppose that we are using %st-length" as the relevant msasure. as well as the following inpuVoutput mapping: or, in an alternative. expanded form: +The syrnbol [] denotes an empty list. Le.. a list with no elements at ail. We will concentrate on the simpler predicatr select. We try to denve the s i x of both output arguments. output and ourpug.* The size relations for the first output argument (output,.) of predicate select are computed as: size(coutpuk>)= undefined. size(<outpuk>)= size(coutputlO>) = Sz(se1ect.1 ,size(<inputp)). (rule select, ) (rule select2) This system of equations yields: Sz(select,1.size(<inpub>)) = size(outpuk)= undefined. The size relations for the other output argument (<outpub>)of selecc are as follows: (rule select, ) (rule select*) = size(<input+) - 1. size(-=output7>) size(<outputg>)= size(<output,,>) + 1. where or. in normalized form: size(coutput,,>) = Sz(seiect.~,size(~input7~)) = Sz(seiect,~.size(<input6~)-1 ) Combining this set of conditions. we finally obtain: Sz(seiect,3.~ize(cinpub>))= size(<output7>)= size(-=input& - 1. Sz(select.3.size(<inpub>)) = size(<outputp)= S~(seiect.3,size(~inpub>)-1 ) (rule select, ) + 1 . (rule select2) This results in a system of difference equations of the form: In this case. the solution is straightfonvard (which is not always the case): f(x) = x - 1, or. afier variable substitution: +Foreach output argument position. we obtain one equation for each rule which is eicpressed in tems of its respective inputs: inpub for nile select,. and inputs for rule select2. The nexxt step is to estimate the number of solutions that predicateselecr is expected to generate. Consider fint subgoal selecb in clause select,. The number of bindings for die input variable T[ (<input7>) is qua1 to 1 (since evcry input variable has an initial. unique value). We know that the number of bindings of both output variables Xand T7 is boundsd by: p, - p,, - x Sol(seiect,size(4np~>)) =1 x - S01(select.~ize(~1np~~') 1 ). In this case. the total number of solutions for select* squals the number of bindings for the output arguments and is rxpected to bc: Sol(seiect.size(cinpuk>)) = SoI(sekct.size(cinputfir)- 1 ). Consider now the other predicate clause. select,. Again. the number of bindings for input variables H and 71 (cinpugz)is çqual to 1. Since they are also the only output variables. we get: Sol(seiect,size(cinput5>)) = 1. In other words. the following equations are obtained: which can be cornbined into: f(x) = f(x- 1 ) + 1. since both clauses are mutually exclusive. Using boundary conditions (namely. that f(0) = O must hold). the final answer is obtained by solving the difference equationt: SoI(se~ect.size(-=inpu~~)) = size(cinpub>). and we conclude that predicate selecr will generate at most n solutions for an input of size n. +This particular difference equation has a trivial solution. However. one major challenge faced by Debray and Lin's framework is that many real-Iife difference equations cannot be so!ved automatically. Appendix 2 Primitive Constants in a Uniform Distribution For the special case of a uniform and independent distribution of attribute values " c h o i ~ e ~ o i nis r s given by: where nrup!es is the number of tuplrs in the databasç for predicatr plPl. P ,. ... . PJ : Kk is a wducrion factor for argument Pk. I 1. if no indexin2 is apptied to argument position k number ddisrinct values for argument position k. if argument indexing is invotved and the argument is ground (note that the value of Kk is the same. regardless of whethrr hash table collisions occur or not ). Note that this formuia assumes a uniform distribution. 1. if the argument is a variable or if the argument position is sub- jected to indsxing KJk) = number of distinct values for argument position k, if argument indexing is not involved Thus. if we define - 1 O. othenvise and c;:,,Ix-) = 1, if Pk is a variable. O, othenvise we can derive the followhg final cquaiions: n For the trivial case of a uniform distribution. the following formula can be used to calculate Fp the expected nurnber of tuples that. in average. have to be visitcd in order to find thejim solution ( n represents the arity of the subgoal). O. if Pk is a ground term and Pk-, is also a ground term , : n n 'ni= k l @ (rn) / -1 . if Pk is a ground terrn and Pk-, is not gound J O. if Pk is not ground and k = 1 1 O. if Pk is not ground and Pk-1 is not ground r "'"" = 1 1. ifm is w incirxed position and P, is g o u n d number of distinct values for argument position m. othrnviss Table A2.1 shows the corresponding values of ?for the tema- case. assuming rhat no position is indexed. Once again. Sn.stands fur the number of distinct values for argument position I-. Valus of F,. Table A l 1 Values of trie Traversal Factor for the Ternarj Predicate Examp!e Appendix 3 Method of Measurement The general method to measure the CPU time required to executr a given Prolog que. consists of repeating the execution of the query a cenain number of times. and talcing the average of thesr rneasurements. The general scheme is shown in Figure A3.3. where main represents the query undrr consideration. Note that wr rnust discard the contribu- tion to the execution time due to the loop itssif. rrrco~trolforallsol~tio~~s : - control. Figure -43.3 General method to rneasure CPU execution times Appendix 4 A Performance Mode1 for QUINTUS Prolog In this section. the performance mode1 is applied to QUINTUS Prolog for the packages example in [consens921t. Essentially, the database contains the following extensional DB predicates: p a r - ~ 3an : enurneration of 1 -640 parts in the system: uses-2:a set of 4.075 facts which establish which pan uses anothrr pan. An intensional DE3 predicate that can be used to determine the reiation '--4 is purt of' package B" is as follows: part_of(A.B) :- part(B,A). The following Prolog query may be used to determine al1 the packages X. such that X contains a pan A that uses a part B which is in turn contained in a package Y diffsrent from X: pkg_uses(X.Y) :- uses(A.B). part-of(r2.W. pa-f(B.Y). .+(X=Y). This conjunction of four subgoals can be re-arranged into srveral diffèrent orders.without affecting the accuracy of the result. Somc orderings are forbidden dur to the fact that the subgoal \ + (X=Y) involves negation thus requiring that both of its arguments have a bound value before the evaiuation of the predicate is made. The following table (Table A 4 1 ) shows al1 valid orderings for the query. +Consens. M.. Mendelton. A.. and Ryman. A. Visualizing and Qucrying Software Structures. Proceedings of-che /3rh Internarional Conference on Sofnc*areEngineering. Melbourne. lia May 1992. -4usü-a- 1 ordenng 1 subgoal X 1 1 subgoal + 2 1 subgoal t 3 1 subgoal s' 4 ( Table 44.1 Valid orderings for the query pkg_uses/2 A 41 Database profile The packages example can be viewed as a set of extensional and intensional predicates aiong with the safe qurnes that are to be applied io the database. (a) Extensional predicates. We will assume that they follow a strict uniform distribution of attribute values. Predicate number of tuples name distinct distinct distinct values in values in values in argument 1 argument 3 argument 3 Table A4.2 The extensional database predicates (b) lntensional predicate. There is one intensional predicates which is defined upon one of the extensional predicates. a feature that simplifies the analysis. part-O f/2. We consider al1 eight valid orderings for the package-mes example: - :- uses(A.B). part-of(A.X). part-of(B.Y). u 1 -. ordenne # i :- uses( AB). part-of(B.Y ). pan-of(A.X), #y-V_). ordenng - $2 :-pa-f(X.X). :- part-of(A.X). :-part-of(A.X). :- p a r t o f ( KY). :-part_of( B.Y). :-p a r t o f ( B.Y). uses(A.B). p=of(B.Y), .t(u-\L).ordcring $3 partoQB.Y). uses(A.B). i+(X=Y+. ordrring Lt;l part-0fiB.Y). WV-V), uses(A.B). ordcring =5 uses( A B ). part-0flA.X). *+X+J. ordrring #6 part_of(A.X). uses(A.B). W-Vf. orderine 37 part-of(A.?(), jrl_Y-V), uses( A.B). ordering $8 CI Note that the built-in predicatr has been left out. since the performance mode1 is not applicable to system predicates. A42 Abstract Domains Table -44.3describes Debray's domain for this particular query. Table A 4 4 indicatrs the cost domain that applirs to the extensional predicates. whilr Table A l . 5 -ives the cost domain for the intensionai predicats and the different queries. Table .43.3 Debray's domain for al1 predicates 1 1 1 (ail clauses) 1 [f.f.g]. <usesi?. / / (allclauses) ( <usesil. 1 (al1 clauses) 1 <par~'3, [g.p]. 1 1 { cost;.n-sol; [. [g.fl. 1 I ( cost4,n-sol4 1.. I { cost,.n_.so12 ;. I C 1' [ 1' 1' Table A 4 4 Cost domain for the extensional predicates Table A4.5 summarizes the specific values of the cost domain for both the intensional predicate and the different queries. ( In fact. once the built-in predicate is removed from the queriss. the eight valid ordenngs yield only four distinct queries: ( a ) ordering K 1 = ordering ?Z: (b) ordering $3; (c)ordering i4 = ordering $5 = ordeting $7 = ordenng 5; and ( d ) ordsring $6.) -44.3 Cost metrics Some cost metris are surnmarized in this subsection. The values of the basic constants are specific to QCINTUS Proiog under AIX. Head unification probabilities: probl=l (it always unifies) Empirical constants used: Clause cost metrics: Table A4.6 shows a summary of the cost metrics for al1 predicates, whereas Table A 4 7 provides them for the intensional predicate. I l l :prob, .cos17.n-~oI7). 1 pan-ofil. ( :prob 1 .cost8.n-SOI,i. { 1.costord# 1.n-soiquery l. ordenng 8 1. ordering i: 3. ordrring a 5. ordering F# 6. ordering * 7. ordenng X 8. Table A4.5 Cost domain for the intensional predicate and the main query Table .44.6 Cost metrics for al1 predicates <clause. c p a ~ p( hunif) n,hp VU <part_oG2. [g,fj> 1 1+16401+2~1640 <part_ofi2. [f.fl> 1 1+1640 2+3x1640 - - n-sol cost= (n ~ T ~ ~ ~ i n ~ x TxTbd) ,.,+n~ 13-13 1 1640 21 .O0 --- Table A4.7 Cost metrics for the intensional predicate -44.4 Query cost formulae cost ( or-dering6 ) = A4.5 cosr c pan-of '2 1 L,- ,] ) + n s o l s x ( cost ( nsol, x ( cost ( part-of 2i uses 2 i [ [c. r l ) ] II..~) 1 + ) Comparison between the Mode1 Prediction and the Erperimental Results Table A 4 8 presents a sumnary of the values predicted by the perhrmance mode1 as compared with the values that are obtained experimentally. It should be noticed that the performance mode1 was able to predict the correct order of performance of the quenes. Ordering d Theoretical value Experimental value 1 9302 1 107054 O.b Error 15.08% i Table A4.8 Theoretical and Experimental Values for the Packages Example IMAGE EVALUATION TEST TARGET (QA-3) 2 I W G E . lnc -. NY -----Fax: APPLIED 1653 East Main Street Rochester, 14609 USA Phone: 716/482-0300 716/288-5989

Log In

Estimating the cost of GraphLog queries

Related papers

Related papers

Related topics