Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Distributed Cost Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52
At a glance
Powered by AI
The key takeaways are that distributed query optimization is more complex than centralized optimization due to additional decisions around query execution plans like bushy vs linear trees and shipping relations between sites. Heuristics and restricting the search space are important to deal with the large number of possible plans.

Some of the main issues in distributed query optimization are deciding on linear vs bushy trees, what and where to ship relations between sites, and when to use semi-joins instead of joins. Additional statistics also need to be gathered about the relations across sites.

The two main strategies for scanning the search space in query optimization are deterministic and randomized. Deterministic builds plans methodically while randomized searches around a starting point.

Distributed Database Systems

Fall 2012

Distributed Query Optimization


SL05

Basic Concepts

Distributed Cost Model

Database Statistics

Joins and Semijoins

Query Optimization Algorithms

DDBS12, SL05

1/52

M. Bohlen

Basic Concepts/1
I

Query optimization: Process of


producing an optimal (close to
optimal) query execution plan which
represents an execution strategy
I

Centralized query optimization:


I

I
I

The main task in query optimization


is to consider different orderings of
the operations
Find (the best) query execution plan
in space of equivalent query trees
Minimize an objective cost function
Gather statistics about relations

Distributed query optimization brings additional issues


I
I
I
I
I

DDBS12, SL05

Linear query trees are not necessarily a good choice


Bushy query trees are not necessarily a bad choice
What and where to ship the relations
How to ship relations (ship as a whole, ship as needed)
When to use semi-joins instead of joins
2/52

M. Bohlen

Basic Concepts/2
I

Search space: The set of alternative query execution plans (query


trees)
I
I
I

Typically very large


The main issue is to optimize joins
For N relations, there are O (N !) equivalent join trees that can be
obtained by applying commutativity and associativity rules

Example: 3 equivalent query trees (join trees) of the joins in the


following query
SELECT ENAME,RESP
FROM
EMP, ASG, PROJ
WHERE EMP.ENO=ASG.ENO AND ASG.PNO=PROJ.PNO

DDBS12, SL05

3/52

M. Bohlen

Basic Concepts/3
I

Reduction of the search space


I

Restrict by means of heuristics


I

Perform unary operations before binary operations, etc

Restrict the shape of the join tree


I

Consider the type of trees (linear trees vs. bushy trees)

Linear Join Tree

DDBS12, SL05

Bushy Join Tree

4/52

M. Bohlen

Basic Concepts/4
I

There are two main strategies to scan the search space


I
I

Deterministic
Randomized

Deterministic scan of the search space


I

DDBS12, SL05

Start from base relations and build plans by adding one relation at
each step
Breadth-first strategy (BFS): build all possible plans before choosing
the best plan (dynamic programming approach)
Depth-first strategy (DFS): build only one plan (greedy approach)

5/52

M. Bohlen

Basic Concepts/5
I

Randomized scan of the search space


I
I
I

Search for optimal solutions around a particular starting point


e.g., iterative improvement or simulated annealing techniques
Trades optimization time for execution time
I

DDBS12, SL05

Does not guarantee that the best solution is obtained, but avoid the
high cost of optimization

The strategy is better when more than 5-6 relations are involved

6/52

M. Bohlen

Distributed Cost Model/1

Two different types of cost functions can be used


I

Reduce total time


I

Reduce response time


I
I

DDBS12, SL05

Reduce each cost component (in terms of time) individually, i.e., do as


little for each cost component as possible
Optimize the utilization of the resources (i.e., increase system
throughput)
Do as many things in parallel as possible
May increase total time because of increased total activity

7/52

M. Bohlen

Distributed Cost Model/2


I

Total time: Sum of the time of all individual components


I
I

Local processing time: CPU time + I/O time


Communication time: fixed time to initiate a message + time to
transmit the data

Total time =TCPU #instructions + TI/O #I/Os +


TMSG #messages + TTR #bytes

The individual components of the total cost have different weights:


I

Wide area network


I
I
I

Local area networks


I
I

DDBS12, SL05

Message initiation and transmission costs are high


Local processing cost is low (fast mainframes or minicomputers)
Ratio of communication to I/O costs is 20:1
Communication and local processing costs are more or less equal
Ratio of communication to I/O costs is 1:1.6 (10MB/s network)
8/52

M. Bohlen

Distributed Cost Model/3

Response time: Elapsed time between the initiation and the


completion of a query
Response time =TCPU #seq instructions + TI/O #seq I/Os +
TMSG #seq messages + TTR #seq bytes

where #seq x (x in instructions, I/O, messages, bytes) is the


maximum number of x which must be done sequentially.

Any processing and communication done in parallel is ignored

DDBS12, SL05

9/52

M. Bohlen

Distributed Cost Model/4

Example: Query at site 3 with data from sites 1 and 2.

I
I
I

DDBS12, SL05

Assume that only the communication cost is considered


Total time = TMSG 2 + TTR (x + y )
Response time = max{TMSG + TTR x , TMSG + TTR y }

10/52

M. Bohlen

Database Statistics/1

The primary cost factor is the size of intermediate relations


I
I

that are produced during the execution and


must be transmitted over the network, if a subsequent operation is
located on a different site

It is costly to compute the size of the intermediate relations precisely.

Instead global statistics of relations and fragments are


computed and used to provide approximations

DDBS12, SL05

11/52

M. Bohlen

Database Statistics/2

I
I

Let R (A1 , A2 , . . . , Ak ) be a relation fragmented into R1 , R2 , . . . , Rr .


Relation statistics
I min and max values of each attribute: min{A }, max{A }.
i
i
I length of each attribute: length (A )
i
I number of distinct values in each domain: card (dom (A ))
i
Fragment statistics
I cardinality of the fragment: card (R )
i
I cardinality of each attribute of each fragment: card ( (R )), card (A )
Ai
j
i

DDBS12, SL05

12/52

M. Bohlen

Database Statistics/3
I

Selectivity factor of an operation: the proportion of tuples of an


operand relation that participate in the result of that operation

Assumption: independent attributes and uniform distribution of


attribute values

Selectivity factor of selection


SF (A = value ) =

card (A (R ))
max(A ) value
SF (A > value ) =
max(A ) min(A )
value min(A )
SF (A < value ) =
max(A ) min(A )

DDBS12, SL05

13/52

M. Bohlen

Database Statistics/4

Properties of the selectivity factor of the selection


SF (p (Ai ) p (Aj )) = SF (p (Ai )) SF (p (Aj ))
SF (p (Ai ) p (Aj )) = SF (p (Ai )) + SF (p (Aj ))

(SF (p (Ai )) SF (p (Aj ))


SF (A {values }) = SF (A = value ) card ({values })

DDBS12, SL05

14/52

M. Bohlen

Database Statistics/5
I

Cardinality of intermediate results


I

Selection
card (P (R )) = SF (P ) card (R )

Projection
I
I

More difficult: correlations between projected attributes are unknown


Simple if the projected attribute is a key

card (A (R )) = card (R )
I

Cartesian Product
card (R S ) = card (R ) card (S )

Union
I
I

Set Difference
I
I

DDBS12, SL05

upper bound: card (R S ) card (R ) + card (S )


lower bound: card (R S ) max{card (R ), card (S )}
upper bound: card (R S ) = card (R )
lower bound: 0
15/52

M. Bohlen

Database Statistics/6
I

Selectivity factor for joins


SFZ =

card (R Z S )
card (R ) card (S )

Cardinality of joins
I

Upper bound: cardinality of Cartesian Product


card (R Z S ) card (R ) card (S )

General case (if SF is given):


card (R Z S ) = SFZ card (R ) card (S )

Special case: R .A is a key of R and S .A is a foreign key of S;


I

each S-tuple matches with at most one tuple of R

card (R ZR .A =S .A S ) = card (S )

DDBS12, SL05

16/52

M. Bohlen

Database Statistics/7

Selectivity factor for semijoins: fraction of R-tuples that join with


S-tuples
I

An approximation is the selectivity of A in S

SFB< (R B<A S ) = SFB< (S .A ) =


I

card (A (S ))
card (dom[A ])

Cardinality of semijoin (general case):


card (R B<A S ) = SFB< (S .A ) card (R )

Example: R .A is a foreign key in S (S .A is a primary key)


Then SF = 1 and the result size corresponds to the size of R

DDBS12, SL05

17/52

M. Bohlen

Join Ordering in Fragment Queries/1

Join ordering is an important aspect in centralized DBMS, and it is


even more important in a DDBMS since joins between fragments
that are stored at different sites may increase the communication
time.
Two approaches exist:
I

Optimize the ordering of joins directly


I
I

Replace joins by combinations of semijoins in order to minimize the


communication costs
I

DDBS12, SL05

INGRES and distributed INGRES


System R and System R

Hill Climbing and SDD-1

18/52

M. Bohlen

Join Ordering in Fragment Queries/2

Direct join odering of two relation/fragments located at different


sites
I
I

DDBS12, SL05

Move the smaller relation to the other site


We have to estimate the size of R and S

19/52

M. Bohlen

Join Ordering in Fragment Queries/3


I

Direct join ordering of queries involving more than two relations is


substantially more complex

Example: Consider the following query and the respective join


graph, where we make also assumptions about the locations of the
three relations/fragments
PROJ ZPNO ASG ZENO EMP

DDBS12, SL05

20/52

M. Bohlen

Join Ordering in Fragment Queries/4


I

Example (contd.): The query can be evaluated in at least 5


different ways.
I

Plan 1:
EMPSite 2
Site 2: EMP=EMPZASG
EMPSite 3
Site 3: EMPZPROJ

Plan 2:
ASGSite 1
Site 1: EMP=EMPZASG
EMPSite 3
Site 3: EMPZPROJ

Plan 4:
PROJSite 2
Site 2: PROJ=PROJZASG
PROJSite 1
Site 1: PROJZEMP

Plan 3:
ASGSite 3
Site 3: ASG=ASGZPROJ
ASGSite 1
Site 1: ASGZEMP

Plan 5:
EMPSite 2
PROJSite 2
Site 2: EMPZPROJZASG

DDBS12, SL05

21/52

M. Bohlen

Join Ordering in Fragment Queries/5

To select a plan, a lot of information is needed, including


I size (EMP ), size (ASG ), size (PROJ )
I size (EMP Z ASG ), size (ASG Z PROJ )
I

DDBS12, SL05

Possibilities of parallel execution if response time is used

22/52

M. Bohlen

Semijoin Based Algorithms/1


I

Semijoins can be used to efficiently implement joins


I

The semijoin acts as a size reducer (similar as to a selection) such


that smaller relations need to be transferred

Consider two relations: R located at site 1 and S located and site 2


I

Solution with semijoins: Replace one or both operand


relations/fragments by a semijoin, using the following rules:
R ZA S (R B<A S ) ZA S

R ZA (S B<A R )
(R B<A S ) ZA (S B<A R )
I

The semijoin is beneficial if the cost to produce and send it to the


other site is less than the cost of sending the whole operand relation
and doing the actual join.

DDBS12, SL05

23/52

M. Bohlen

Semijoin Based Algorithms/2


I

sl06.2

Cost analysis R ZA S vs. (R B<A S ) Z S, assuming that


size (R ) < size (S )
I

Perform the join R Z S:


I
I

Perform the semijoins (R B< S ) Z S:


I
I
I
I
I

R Site 2
Site 2 computes R Z S
S 0 = A (S )
S 0 Site 1
Site 1 computes R 0 = R B< S 0
R 0 Site 2
Site 2 computes R 0 Z S

Semijoin is better if: size (A (S )) + size (R B< S ) < size (R )

The semijoin approach is better if the semijoin acts as a sufficient


reducer (i.e., a few tuples of R participate in the join)

The join approach is better if almost all tuples of R participate in


the join

DDBS12, SL05

24/52

M. Bohlen

INGRES Algorithm/1

INGRES uses a dynamic query optimization algorithm that


recursively breaks a query into smaller pieces. It is based on the
following ideas:
I

An n-relation query q is decomposed into n subqueries


q1 q2 qn
I
I

For the decomposition two basic techniques are used: detachment


and substitution
There is a processor that can efficiently process mono-relation
queries
I

DDBS12, SL05

Each qi is a mono-relation (mono-variable) query


The output of qi is consumed by qi +1

Optimizes each query independently for the access to a single relation

25/52

M. Bohlen

INGRES Algorithm/2
I

Detachment: Break a query q into q0 q00 , based on a common


relation that is the result of q0 , i.e.
I

The query
q: SELECT
FROM
WHERE
AND

is decomposed by detachment of the common relation R1 into


q0 :
SELECT R1 .A1
INTO
R10
FROM
R1
WHERE P1 (R1 .A10 )
q00 :

R2 .A2 , . . . , Rn .An
R1 , R2 , . . . , Rn
P1 (R1 .A10 )
P2 (R1 .A1 , . . . , Rn .An )

SELECT
FROM
WHERE

R2 .A2 , . . . , Rn .An
R10 , R2 , . . . , Rn
P2 (R10 .A1 , . . . , Rn .An )

Detachment reduces the size of the relation on which the query q00
is defined.

DDBS12, SL05

26/52

M. Bohlen

INGRES Algorithm/3
I

Example: Consider query q1: Names of employees working on the


CAD/CAM project
q1 : SELECT EMP.ENAME
FROM
EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND
ASG.PNO = PROJ.PNO
AND
PROJ.PNAME = CAD/CAM

Decompose q1 into q11 q0 :


q11 : SELECT PROJ.PNO
INTO
JVAR
FROM
PROJ
WHERE PROJ.PNAME = CAD/CAM
q0 :

DDBS12, SL05

SELECT
FROM
WHERE
AND

EMP.ENAME
EMP, ASG, JVAR
EMP.ENO = ASG.ENO
ASG.PNO = JVAR.PNO

27/52

M. Bohlen

INGRES Algorithm/4
I

I
I
I

Example (contd.): The successive detachments may transform q0


into q12 q13 :
q0 :
SELECT EMP.ENAME
FROM
EMP, ASG, JVAR
WHERE EMP.ENO = ASG.ENO
AND
ASG.PNO = JVAR.PNO
q12 :

SELECT
INTO
FROM
WHERE

ASG.ENO
GVAR
ASG, JVAR
ASG.PNO=JVAR.PNO

q13 :

SELECT
FROM
WHERE

EMP.ENAME
EMP, GVAR
EMP.ENO=GVAR.ENO

q1 is now decomposed by detachment into q11 q12 q13


q11 is a mono-relation query
q12 and q13 are multi-relation queries, which cannot be further
detached; also called irreducible

DDBS12, SL05

28/52

M. Bohlen

INGRES Algorithm/5
I

Tuple substitution allows to convert an irreducible query q into


mono-relation queries.
I
I

Choose a relation R1 in q for tuple substitution


For each tuple in R1 , replace the R1 -attributes referred in q by their
actual values, thereby generating a set of subqueries q0 with n 1
relations, i.e.,
q(R1 , R2 , . . . , Rn ) is replaced by {q0 (t1i , R2 , . . . , Rn ), t1i R1 }

Example (contd.): Assume GVAR consists only of the tuples


{E1, E2}. Then q13 is rewritten with tuple substitution in the following
way
q13 : SELECT EMP.ENAME
FROM
EMP, GVAR
WHERE EMP.ENO = GVAR.ENO
q131 :

DDBS12, SL05

SELECT
FROM
WHERE

EMP.ENAME
EMP
EMP.ENO = E1
29/52

M. Bohlen

INGRES Algorithm/6

Example (contd.):
q132 :

DDBS12, SL05

SELECT
FROM
WHERE

EMP.ENAME
EMP
EMP.ENO = E2

q131 and q132 are mono-relation queries

30/52

M. Bohlen

Distributed INGRES Algorithm

sl06.1

The distributed INGRES query optimization algorithm is very


similar to the centralized INGRES algorithm.
I

DDBS12, SL05

In addition to the centralized INGRES, the distributed one should


break up each query qi into sub-queries that operate on fragments;
only horizontal fragmentation is handled.
Optimization with respect to a combination of communication cost
and response time

31/52

M. Bohlen

System R Algorithm/1
I

The System R (centralized) query optimization algorithm


I

Performs static query optimization based on exhaustive search of


the solution space and a cost function (IO cost + CPU cost)
I
I
I

Input: relational algebra tree


Output: optimal relational algebra tree
Dynamic programming technique is applied to reduce the number of
alternative plans

The optimization algorithm consists of two steps


1. Predict the best access method to each individual relation
(mono-relation query)
2. Consider using index, file scan, etc.
3. For each relation R, estimate the best join ordering
4. R is first accessed using its best single-relation access method
5. Efficient access to inner relation is crucial

Considers two different join strategies


I
I

DDBS12, SL05

(Indexed-) nested loop join


Sort-merge join

32/52

M. Bohlen

System R Algorithm/2
I

Example: Consider query q1: Names of employees working on the


CAD/CAM project
PROJ ZPNO ASG ZENO EMP
I

Join graph

Indexes
I
I
I

DDBS12, SL05

EMP has an index on ENO


ASG has an index on PNO
PROJ has an index on PNO and an index on PNAME

33/52

M. Bohlen

System R Algorithm/3

Example (contd.): Step 1 Select the best single-relation access


paths
I
I
I

DDBS12, SL05

EMP: sequential scan (because there is no selection on EMP)


ASG: sequential scan (because there is no selection on ASG)
PROJ: index on PNAME (because there is a selection on PROJ
based on PNAME)

34/52

M. Bohlen

System R Algorithm/4
I

sl06.4

Example (contd.): Step 2 Select the best join ordering for each
relation

I
I

(EMP PROJ) and (PROJ EMP) are pruned because they are CPs
(ASG Z PROJ) pruned because (we assume) it has higher cost than
(PROJ Z ASG); similar for (ASG Z EMP)
Best total join order ((PROJZ ASG)Z EMP), since it uses the indexes
best
I
I
I

DDBS12, SL05

Select PROJ using index on PNAME


Join with ASG using index on PNO
Join with EMP using index on ENO
35/52

M. Bohlen

Distributed System R Algorithm/1

The System R query optimization algorithm is an extension of


the System R query optimization algorithm with the following main
characteristics:
I

Only the whole relations can be distributed, i.e., fragmentation and


replication is not considered
Query compilation is a distributed task, coordinated by a master site,
where the query is initiated
Master site makes all inter-site decisions, e.g., selection of the
execution sites, join ordering, method of data transfer, ...
The local sites do the intra-site (local) optimizations, e.g., local joins,
access paths

Join ordering and data transfer between different sites are the most
critical issues to be considered by the master site

DDBS12, SL05

36/52

M. Bohlen

Distributed System R Algorithm/2

Two methods for inter-site data transfer


I

Ship whole: The entire relation is shipped to the join site and stored
in a temporary relation
I
I
I

Fetch as needed: The outer relation is sequentially scanned, and for


each tuple the join value is sent to the site of the inner relation and
the matching inner tuples are sent back (i.e., semijoin)
I
I
I

DDBS12, SL05

Larger data transfer


Smaller number of messages
Better if relations are small

Number of messages = O(cardinality of outer relation)


Data transfer per message is minimal
Better if relations are large and the selectivity is good

37/52

M. Bohlen

Distributed System R Algorithm/3


I

Four main join strategies for R Z S:


I
I

Notation:
I
I
I

R is outer relation
S is inner relation
LT denotes local processing time
CT denotes communication time
s denotes the average number of S-tuples that match an R-tuple

Strategy 1: Ship the entire outer relation to the site of the inner
relation, i.e.,
I
I
I

Retrieve outer tuples


Send them to the inner relation site
Join them as they arrive

Total cost = LT (retrieve card (R ) tuples from R ) +


CT (size (R )) +
LT (retrieve s tuples from S ) card (R )
DDBS12, SL05

38/52

M. Bohlen

Distributed System R Algorithm/4

Strategy 2: Ship the entire inner relation to the site of the outer
relation. We cannot join as they arrive; they need to be stored.
I

The inner relation S need to be stored in a temporary relation

Total cost = LT (retrieve card (S ) tuples from S ) +


CT (size (S )) +
LT (store card (S ) tuples in T ) +
LT (retrieve card (R ) tuples from R ) +
LT (retrieve s tuples from T ) card (R )

DDBS12, SL05

39/52

M. Bohlen

Distributed System R Algorithm/5

Strategy 3: Fetch tuples of the inner relation as needed for each


tuple of the outer relation.
I
I

For each R-tuple, the join attribute A is sent to the site of S


The s matching S-tuples are retrieved and sent to the site of R

Total cost = LT (retrieve card (R ) tuples from R ) +


CT (length (A )) card (R ) +
LT (retrieve s tuples from S ) card (R ) +
CT (s length (S )) card (R )

DDBS12, SL05

40/52

M. Bohlen

sl06.6
sl06.7

Distributed System R Algorithm/6


I

Strategy 4: Move both relations to a third site and compute the join
there.
I

The inner relation S is first moved to a third site and stored in a


temporary relation.
Then the outer relation is moved to the third site and its tuples are
joined as they arrive.

Total cost = LT (retrieve card (S ) tuples from S ) +


CT (size (S )) +
LT (store card (S ) tuples in T ) +
LT (retrieve card (R ) tuples from R ) +
CT (size (R )) +
LT (retrieve s tuples from T ) card (R )

DDBS12, SL05

41/52

M. Bohlen

Hill-Climbing Algorithm/1

Hill-Climbing query optimization algorithm


I

I
I
I

DDBS12, SL05

Refinements of an initial feasible solution are recursively computed


until no more cost improvements can be made
Semijoins, data replication, and fragmentation are not used
Devised for wide area point-to-point networks
The first distributed query processing algorithm

42/52

M. Bohlen

Hill-Climbing Algorithm/2
I

The hill-climbing algorithm proceeds as follows


1. Select initial feasible execution strategy ES0
I

i.e., a global execution schedule that includes all intersite


communication
Determine the candidate result sites, where a relation referenced in the
query exist
Compute the cost of transferring all the other referenced relations to
each candidate site
ES0 = candidate site with minimum cost

2. Split ES0 into two strategies: ES1 followed by ES2


I

ES1: send one of the relations involved in the join to the other relations
site
ES2: send the join result to the final result site

3. Replace ES0 with the split schedule which gives


cost (ES1) + cost (local join) + cost (ES2) < cost (ES0)
4. Recursively apply steps 2 and 3 on ES1 and ES2 until no more
benefit can be gained
5. Check for redundant transmissions in the final plan and eliminate
them
DDBS12, SL05

43/52

M. Bohlen

Hill-Climbing Algorithm/3
I

Example: What are the salaries of engineers who work on the


CAD/CAM project?
SAL (PAY ZTITLE EMP ZENO (ASG ZPNO (PNAME =CAD /CAM 00 (PROJ ))))
I

Schemas: EMP(ENO, ENAME, TITLE), ASG(ENO, PNO, RESP,


DUR), PROJ(PNO, PNAME, BUDGET, LOC), PAY(TITLE, SAL)
Statistics
Relation Size Site
EMP
8
1
PAY
4
2
PROJ
1
3
ASG
10
4
Assumptions:
I
I
I
I
I

DDBS12, SL05

Size of relations is defined as their cardinality


Minimize total cost
Transmission cost between two sites is 1
Ignore local processing cost
size(EMP Z PAY) = 8, size(PROJ Z ASG) = 2, size(ASG Z EMP) = 10
44/52

M. Bohlen

Hill-Climbing Algorithm/4
I

Example (contd.): Determine initial feasible execution strategy


I

Alternative 1: Resulting site is site 1


Total cost = cost (PAY Site1) + cost (ASG Site1) +
cost (PROJ Site1)
= 4 + 10 + 1 = 15

Alternative 2: Resulting site is site 2


Total cost = 8 + 10 + 1 = 19

Alternative 3: Resulting site is site 3


Total cost = 8 + 4 + 10 = 22

Alternative 4: Resulting site is site 4


Total cost = 8 + 4 + 1 = 13

I
DDBS12, SL05

Therefore ES0 = EMPSite4; PAY Site4; PROJ Site4


45/52

M. Bohlen

Hill-Climbing Algorithm/5
I

Example (contd.): Candidate split


I

Alternative 1: ES1,
ES2, ES3
I
I

cost ((EMP Z PAY) Site4) +

ES1: EMPSite 2
ES2: (EMPZPAY)
Site4
ES3: PROJSite 4

Alternative 2: ES1,
ES2, ES3
I

Total cost = cost (EMP Site2) +


cost (PROJ Site4)

= 8 + 8 + 1 = 17

Total cost = cost (PAYSite 1) +

ES1: PAY Site1


ES2: (PAY Z
EMP) Site4
ES3: PROJ
Site 4

cost ((PAY Z EMP) Site4) +


cost (PROJ Site4)

= 4 + 8 + 1 = 13

Both alternatives are not better than ES0, so keep ES0 (or take
alternative 2 which has the same cost)

DDBS12, SL05

46/52

M. Bohlen

Hill-Climbing Algorithm/6

Problems
I

I
I

sl06.5

Greedy algorithm determines an initial feasible solution and iteratively


improves it
If there are local minima, it may not find the global minimum
An optimal schedule with a high initial cost would not be found, since
it wont be chosen as the initial feasible solution

Example: A better schedule is


I PROJSite 4
I ASG = (PROJZASG)Site 1
I (ASGZEMP)Site 2
I Total cost= 1 + 2 + 2 = 5

DDBS12, SL05

47/52

M. Bohlen

SDD-1
I

The SDD-1 algorithm extends the hill climbing algorithm with


semijoins and has the following properties:
I

Considers semijoins
I
I

I
I

cost (R |>< A S ) = CMSG + size (A (S )) CTR


benefit (R |>< A S ) = (1 SF |>< (S .A )) size (R ) CTR

Does not consider replication and fragmentation


Cost of transferring the result to the user site from the final result site
is not considered
Can minimize either total time or response time

The SDD-1 algorithm works with and updates a database profile:


R
R1
R2
R3

DDBS12, SL05

size (R )
1500
3000
2000

A
R1.A
R2.A
R2.B
R3.B

SF |><
0.3
0.8
1.0
0.4

48/52

size (A )
36
320
400
80

M. Bohlen

SDD-1 Algorithm
Step 1 Include all local processing in the execution strategy ES.
Step 2 Update database profile with effects of local processing.
Step 3 Determine beneficial

|><

, i.e., cost ( |>< i ) < benefit ( |>< i ).

Step 4 Remove the most beneficial

|><

and append it to ES.

Step 5 Update the database profile.


Step 6 Update the set of beneficial semijoins; possibly include new
ones.
Step 7 If there are beneficial semijoins go back to Step 4.
Step 8 Find the site where the largest amount of data resides and
select it as the result site.
Step 9 For each Ri at the result site, remove semijoins of the form
Ri |>< Rj where the total cost of ES without this semijoin is
smaller than the cost with it.
Step 10 Permute the order of semijoins if doing so would improve
the total cost of ES.
DDBS12, SL05

49/52

M. Bohlen

Conclusion
I

Distributed query optimization is more complex that centralized


query processing, since
I
I

bushy query trees are not necessarily a bad choice


one needs to decide what, where, and how to ship the relations
between the sites

Query optimization searches the optimal query plan (tree)

For N relations, there are O (N !) equivalent join trees. To cope with


the complexity heuristics and/or restricted types of trees are
considered.

There are two main strategies in query optimization: randomized


and deterministic.

Semi-joins can be used to implement a join. The semi-joins require


more operations to perform, but the data transfer rate is reduced.

INGRES, System R and Hill Climbing are distributed query


optimization algorithms.

DDBS12, SL05

50/52

M. Bohlen

Course Project

I
I

Hand in of project: December 23, 2012


Report
I
I
I
I
I

problem definition
running example
description of solution
evaluation
strength, weaknesses, limitations

Report (5 pages) and implementation (source code, data, steps to


install and run) as zip/tar file

Send by email to boehlen@ifi.uzh.ch and cafagna@ifi.uzh.ch

DDBS12, SL05

51/52

M. Bohlen

Course Exam

Exam date: 16.01.2013

Exam time: 12:15 - 12:45

Exam location: BIN 2.E.13

Exam form and procedure


I
I
I

oral, 20 minutes
10 minutes about project (demo, code, algorithm)
10 about a topic of the course

During exam: present solutions on examples

Prepare suitable examples beforehand

DDBS12, SL05

52/52

M. Bohlen

You might also like