Chapter 4 Query Optimization
Chapter 4 Query Optimization
Chapter 4 Query Optimization
Transformation Rules
1
What is Query Processing?
– Steps required to transform high level SQL query into a correct
and “efficient” strategy for execution and retrieval.
What is Query Optimization?
– The activity of choosing a single “efficient” execution strategy
(from hundreds) as determined by database catalog statistics.
– Which relational algebra expression, equivalent to the given
query, will lead to the most efficient solution plan?
– For each algebraic operator, what algorithm (of several
available) do we use to compute that operator?
– How do operations pass data (main memory buffer, disk
buffer,…)?
– Will this plan minimize resource usage? (CPU/Response
Time/Disk)
2
Example:
Identify all managers who work in a London branch
SELECT * FROM Staff s, Branch b
WHERE s.branchNo = b.branchNo AND s.position = ‘Manager’ AND
b.city = ‘london’;
Query 2 (Better)
s(position=‘Manager’)^(city=‘London’) (Staff wvStaff.branchNo = Branch.branchNo Branch)
– Again requires (1000+50) disk accesses to read from Staff and Branch
– Joins Staff and Branch on branchNo with 1000 tuples
(1 employee : 1 branch )
– Requires (1000) disk access to read in joined relation and check predicate
Total Work = (1000+50) + 2*(1000) =
3050 I/O operations
3300% Improvement over Query 1 4
Query 3 (Best)
[ s(position=‘Manager’) (Staff) ] wvStaff.branchNo = Branch.branchNo [ s(city=‘London’) (Branch) ]
5
Query Processing Steps
8
Example: Consider relations r(AB) and s(CD). We require r X s.
Method 1 :
a. Load next record of r in RAM.
b. Load all records of s, one at a time and concatenate with r.
c. All records of r concatenated?
NO: goto a.
YES: exit (the result in RAM or on disk).
Performance: Too many accesses.
Method 2: Improvement
a. Load as many blocks of r as possible leaving room for one block of s.
b. Run through the s file completely one block at a time.
Performance: Reduces the number of times s blocks are loaded by a factor of
equal to the number of r records than can fit in main memory.
Considerations during query Optimization:
– Narrow down intermediate result sets quickly. SELECT and
PROJECTION before JOIN
– Use access structures (indexes).
9
Approaches to Query Optimization : Heuristics and Cost Function
A. Heuristics Approach
• Heuristics Approach uses the knowledge of the characteristics of the relational
algebra operations and the relationship between the operators to optimize the
query.
• Thus the heuristic approach of optimization will make use of:
– Properties of individual operators
– Association between operators
– Query Tree: a graphical representation of the operators, relations, attributes
and predicates and processing sequence during query processing.
• It is composed of three main parts:
– The Leafs: the base relations used for processing the query/
extracting the required information
– The Root: the final result/relation as an out put based on the
operation on the relations used for query processing
– Nodes: intermediate results or relations before reaching the final
result.
• Sequence of execution of operation in a query tree will start from the
leaves and continues to the intermediate nodes and ends at the root. 10
11
12
13
14
A .Using Heuristics in Query Optimization
Process for heuristics optimization
1. The parser of a high-level query generates an initial internal
representation;
2. Apply heuristics rules to optimize the internal representation.
3. A query execution plan is generated to execute groups of
operations based on the access paths available on the files
involved in the query.
The main heuristic is to apply first the operations that reduce the size
of intermediate results.
– E.g. Apply SELECT and PROJECT operations before applying
the JOIN or other binary operations.
Query block: The basic unit that can be translated into the algebraic
operators and optimized.
A query block contains a single SELECT-FROM-WHERE expression, as
well as GROUP BY and HAVING clause if these are part of the block.
Nested queries within a query are identified as separate query blocks.
Slide 15- 15
• Query tree:
– A tree data structure that corresponds to a relational algebra expression. It
represents the input relations of the query as leaf nodes of the tree, and
represents the relational algebra operations as internal nodes.
• An execution of the query tree consists of executing an internal node operation
whenever its operands are available and then replacing that internal node by
the relation that results from executing the operation.
• Query graph:
– A graph data structure that corresponds to a relational calculus expression.
It does not indicate an order on which operations to perform first. There is
only a single graph corresponding to each query.
Example:
– For every project located in ‘Stafford’, retrieve the project number, the controlling
department number and the department manager’s last name, address and birthdate.
Relation algebra:
– PNUMBER, DNUM, LNAME, ADDRESS, BDATE ((( PLOCATION=‘STAFFORD’(PROJECT))
DNUM=DNUMBER (DEPARTMENT)) MGRSSN=SSN (EMPLOYEE))
SQL query:
Q2: SELECT P.NUMBER,P.DNUM,E.LNAME,E.ADDRESS, E.BDATE
FROM PROJECT AS P,DEPARTMENT AS D, EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND
D.MGRSSN=E.SSN AND
P.PLOCATION=‘STAFFORD’;
Slide 15- 16
Slide 15- 17
Slide 15- 18
• Heuristic Optimization of Query Trees:
– The same query could correspond to many different relational
algebra expressions — and hence many different query trees.
– The task of heuristic optimization of query trees is to find a final
query tree that is efficient to execute.
• Example:
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;
Slide 15- 19
Slide 15- 20
Slide 15- 21
Summary of Heuristics for Algebraic Optimization:
3. The select and join operations that are most restrictive should
be executed before other similar operations. (This is done by
reordering the leaf nodes of the tree among themselves and
adjusting the rest of the tree appropriately.)
Slide 15- 22
Using Selectivity and Cost Estimates in Query Optimization
Slide 15- 23
B. Cost Estimation Approach to Query Optimization
• The main idea is to minimize he cost of processing a query. The cost
function is comprised of:
• I/O cost + CPU processing cost + communication cost + Storage cost
• These components might have different weights in different
processing environments
• The DBMs will use information stored in the system catalogue for the
purpose of estimating cost.
• The main target of query optimization is to minimize the size of the
intermediate relation. The size will have effect in the cost of:
– Disk Access
– Data Transportation
– Storage space in the Primary Memory
– Writing on Disk
24
1. Access Cost of Secondary Storage
• Data is going to be accessed from secondary storage, as a query will be
needing some part of the data stored in the database. The disk access cost
can again be analyzed in terms of:
– Searching
– Reading, and
– Writing, data blocks used to store some portion of a relation.
• Remark: The disk access cost will vary depending on
– The file organization used and the access method implemented for the
file organization.
– whether the data is stored contiguously or in scattered manner, will
affect the disk access cost.
2. Storage Cost
•While processing a query, as any query would be composed of many database
operations, there could be one or more intermediate results before reaching the
final output. These intermediate results should be stored in primary memory for
further processing. The bigger the intermediate relation, the larger the memory
requirement, which will have impact on the limited available space. This will be
considered as a cost of storage.
25
3. Query Execution Plans
Slide 15- 26
3. Computation Cost
• Query is composed of many operations. The operations could be
database operations like reading and writing to a disk, or
mathematical and other operations like:
– Searching
– Sorting
– Merging
– Computation on field values
4. Communication Cost
o In most database systems the database resides in one station and
various queries originate from different terminals. This will have impact
on the performance of the system adding cost for query processing. Thus,
the cost of transporting data between the database site and the terminal
from where the query originate should be analyzed.
27
Query Decomposition
• ANALYSIS
<Query>
28
• RELATIONAL ALGEBRA TREE
– Root : The desired result of query
– Leaf : Base relations of query
– Non-Leaf : Intermediate relation created from relational algebra operation
• NORMALIZATION
– Convert WHERE clause into more easily manipulated form
– Conjunctive Normal Form(CNF) : (a v b) [(c v d) e] f (more efficient)
– Disjunctive Normal Form(DNF) :
29
Heuristic Optimization
GOAL:
– Use relational algebra equivalence rules to improve the
expected performance of a given query tree.
Consider the example given earlier:
– Join followed by Selection (~ 3050 disk reads)
– Selection followed by Join (~ 1160 disk reads)
30
Relational Algebra Transformations
Cascade of Selection
– (1) sp q r (R) = sp(sq(sr(R)))
x wvp
=
R S R S
Note : The above is an incomplete List! For a complete list see the text.
31
More Relational Algebra Transformations
Join and Cartesian Product Operations are Commutative and
Associative
(6) R x S = S x R
(7) R x (S x T) = (R x S) x T
(8) R wvp S = S wvp R
(9) (R wvp S) wvq T = R wvp (S wvq T)
32
Optimization Uses The Following Heuristics
Break apart conjunctive selections into a sequence of simpler
selections (preparatory step for next heuristic).
Move s down the query tree for the earliest possible execution
(reduce number of tuples processed).
Break apart and move as far down the tree as possible lists of
projection attributes, create new projections where possible
(reduce tuple widths early).
33
. Algorithms for SELECT and JOIN Operations
• Examples:
– (OP1): s SSN='123456789' (EMPLOYEE)
– (OP2): s DNUMBER>5(DEPARTMENT)
– (OP3): s DNO=5(EMPLOYEE)
– (OP4): s DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
– (OP5): s ESSN=123456789 AND PNO=10(WORKS_ON)
• Search Methods for Simple Selection:
– S1 Linear search (brute force):
• Retrieve every record in the file, and test whether its attribute
values satisfy the selection condition.
– S2 Binary search:
• If the selection condition involves an equality comparison on a
key attribute on which the file is ordered, binary search (which
is more efficient than linear search) can be used
– S3 Using a primary index to retrieve a single record:
• If the selection condition involves an equality comparison on a
key attribute with a primary index use the primary index to
retrieve the record.
34
Implementing the JOIN Operation:
– Join
• two–way join: a join on two files
• e.g. R A=B S
• multi-way joins: joins involving more than two files.
• e.g. R A=B S C=D T
Examples
– (OP6): EMPLOYEE DNO=DNUMBER DEPARTMENT
– (OP7): DEPARTMENT MGRSSN=SSN EMPLOYEE