2 Chapter 3 Query Optimization
2 Chapter 3 Query Optimization
22
2 5/ QUERY OPTIMIZATION 1
/
10
OUTLINE
I. Query Processing and Optimization: Why?
Semantics based
22
2 5/ QUERY OPTIMIZATION 2
/
10
INTRODUCTION TO QUERY OPTIMISATION
With declarative languages such as SQL, the user specifies what data is required
E.g. SELECT name FROM Staff
WHERE gender = ‘F’;
rather than how it is to be retrieved
This relieves the user of knowing what constitutes good execution strategy.
22
2 5/ QUERY OPTIMIZATION 3
/
10
What is Query Processing?
Steps required to transform high level SQL query into a correct and
“efficient” strategy for execution and retrieval.
What is Query Optimization?
The activity of choosing a single “efficient” execution strategy
(from hundreds) as determined by database catalog statistics.
Which relational algebra expression, equivalent to the given query,
will lead to the most efficient solution plan?
How do operations pass data (main memory buffer, disk buffer,…)?
Will this plan minimize resource usage? (CPU/Response
Time/Disk)
22
2 5/ QUERY OPTIMIZATION 4
/
10
Example:
Identify all managers who work in a London branch
SELECT * FROM Staff s, Branch b
WHERE s.branchNo = b.branchNo AND s.position = ‘Manager’ AND
b.city = ‘london’;
Query 2 (Better)
(position=‘Manager’)^(city=‘London’) (Staff Staff.branchNo = Branch.branchNo Branch)
– Again requires (1000+50) disk accesses to read from Staff and Branch
– Joins Staff and Branch on branchNo with 1000 tuples
(1 employee : 1 branch )
– Requires (1000) disk access to read in joined relation and check predicate
Total Work = (1000+50) + 2*(1000) =
3050 I/O operations
22
3300% Improvement over Query 1
2 5/ QUERY OPTIMIZATION 6
/
10
QUERY 3 (BEST)
[ (position=‘Manager’) (Staff) ] Staff.branchNo = Branch.branchNo [ (city=‘London’) (Branch) ]
22
2 5/ QUERY OPTIMIZATION 7
/
10
QUERY PROCESSING
STEPS
22
2 5/ QUERY OPTIMIZATION 10
/
10
2. Query Optimization
Everyone wants the performance of their database to be optimal. In
particular, there is often a requirement for a specific query or object
that is query based, to run faster.
Problem of query optimization is to find the sequence of steps that
produces the answer to user request in the most efficient manner,
given the database structure.
The performance of a query is affected by the tables or queries that
underlies the query and by the complexity of the query.
Given a request for data manipulation or retrieval, an optimizer
will choose an optimal plan for evaluating the request from among
the manifold alternative strategies. i.e. there are many ways (access
paths) for accessing desired file/record.
hence ,DBMS is responsible to pick the best execution strategy
based on various considerations( Least amount of I/O and CPU resources. )
22
2 5/ QUERY OPTIMIZATION 11
/
10
22
2 5/ QUERY OPTIMIZATION 12
/
10
22
2 5/ QUERY OPTIMIZATION 13
/
10
22
2 5/ QUERY OPTIMIZATION 14
/
10
22
2 5/ QUERY OPTIMIZATION 15
/
10
A .USING HEURISTICS IN QUERY OPTIMIZATION
Query block: The basic unit that can be translated into the algebraic
operators and optimized.
A query block contains a single SELECT-FROM-WHERE expression,
as well as GROUP BY and HAVING clause if these are part of the block.
Nested queries within a query are identified as separate query blocks.
SQL query:
Q2: SELECT P.NUMBER, P.DNUM, E.LNAME, E.ADDRESS, E.BDATE
FROM PROJECT AS P ,DEPARTMENT AS D, EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND
/2
2 D.MGRSSN=E.SSN AND
17
10/2
5 P.PLOCATION=‘STAFFORD’; QUERY OPTIMIZATION
22
2 5/ QUERY OPTIMIZATION 18
/
10
Q2: SELECT P.NUMBER, P.DNUM, E.LNAME, E.ADDRESS, E.BDATE
FROM PROJECT AS P ,DEPARTMENT AS D, EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND
D.MGRSSN=E.SSN AND
P.PLOCATION=‘STAFFORD’;
22
2 5/ QUERY OPTIMIZATION 19
/
10
22
2 5/ QUERY OPTIMIZATION 20
/
10
Heuristic Optimization of Query Trees:
The same query could correspond to many different relational algebra
expressions — and hence many different query trees.
The task of heuristic optimization of query trees is to find a final
query tree that is efficient to execute.
Example:
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;
22
2 5/ QUERY OPTIMIZATION 21
/
10
22
2 5/ QUERY OPTIMIZATION 22
/
10
22
2 5/ QUERY OPTIMIZATION 23
/
10
Which tree is better?
brName brName
Branch
s.name = ‘Mary’ Branch
s.name = ‘Mary’
Staff
Staff
22
2 5/ QUERY OPTIMIZATION 24
/
10
Summary of Heuristics for Algebraic Optimization:
1. The main heuristic is to apply first the operations that reduce
the size of intermediate results.
2. Perform select operations as early as possible to reduce the
number of tuples and perform project operations as early as
possible to reduce the number of attributes. (This is done by
moving select and project operations as far down the tree as possible.)
3. The select and join operations that are most restrictive should
be executed before other similar operations. (This is done by
reordering the leaf nodes of the tree among themselves and
adjusting the rest of the tree appropriately.)
22
2 5/ QUERY OPTIMIZATION 25
/
10
B. Cost Estimation Approach to Query Optimization
The main idea is to minimize the cost of processing a query. The cost
function is comprised of:
I/O cost + CPU processing cost + communication cost + Storage cost
These components might have different weights in different
processing environments
The DBMs will use information stored in the system catalogue for
the purpose of estimating cost.
The main target of query optimization is to minimize the size of the
intermediate relation. The size will have effect in the cost of:
Disk Access
Data Transportation
Storage space in the Primary Memory
Writing on Disk
22
2 5/ QUERY OPTIMIZATION 26
/
10
1. Access Cost of Secondary Storage
Data is going to be accessed from secondary storage, as a query will be
needing some part of the data stored in the database. The disk access
cost can again be analyzed in terms of:
Searching
Reading, and
Writing, data blocks used to store some portion of a relation.
Remark: The disk access cost will vary depending on
The file organization used and the access method implemented for the file
organization.
whether the data is stored contiguously or in scattered manner, will affect the
disk access cost.
2. Storage Cost
•While processing a query, as any query would be composed of many database
operations, there could be one or more intermediate results before reaching the
final output. These intermediate results should be stored in primary memory for
further processing. The bigger the intermediate relation, the larger the memory
requirement, which will have impact on the limited available space.
22
2 5/ QUERY OPTIMIZATION 27
/
10
3. Computation Cost
Query is composed of many operations. The operations could be
database operations like reading and writing to a disk, or
mathematical and other operations like:
Searching
Sorting
Merging
Computation on field values
4. Communication Cost
o In most database systems the database resides in one station and
various queries originate from different terminals. This will have impact
on the performance of the system adding cost for query processing. Thus,
the cost of transporting data between the database site and the terminal
from where the query originate should be analyzed.
22
2 5/ QUERY OPTIMIZATION 28
/
10
3. Query Execution Plans
– An execution plan for a relational algebra query consists of a
combination of the relational algebra query tree and information
about the access methods to be used for each relation as well as the
methods to be used in computing the relational operators stored in
the tree.
22
2 5/ QUERY OPTIMIZATION 29
/
10