Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chapter 4 Query Optimization

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 35

Chapter 4

Query Processing and Optimization


Outline
I. Query Processing and Optimization: Why?

II. Steps of Processing

III. Methods of Optimization

 Heuristic (Logical Transformations)

 Transformation Rules

 Heuristic Optimization Guidelines

 Cost Based (Physical Execution Costs)

 Data Storage/Access Refresher

1
 What is Query Processing?
– Steps required to transform high level SQL query into a correct
and “efficient” strategy for execution and retrieval.
 What is Query Optimization?
– The activity of choosing a single “efficient” execution strategy
(from hundreds) as determined by database catalog statistics.
– Which relational algebra expression, equivalent to the given
query, will lead to the most efficient solution plan?
– For each algebraic operator, what algorithm (of several
available) do we use to compute that operator?
– How do operations pass data (main memory buffer, disk
buffer,…)?
– Will this plan minimize resource usage? (CPU/Response
Time/Disk)

2
Example:
Identify all managers who work in a London branch
SELECT * FROM Staff s, Branch b
WHERE s.branchNo = b.branchNo AND s.position = ‘Manager’ AND
b.city = ‘london’;

Results in these equivalent relational algebra statements


(1) s(position=‘Manager’)^(city=‘London’)^(Staff.branchNo=Branch.branchNo) (Staff X Branch)
(2) s(position=‘Manager’)^(city=‘London’) (Staff wvStaff.branchNo = Branch.branchNo Branch)
(3) [s(position=‘Manager’) (Staff)] wvStaff.branchNo = Branch.branchNo [s(city=‘London’) (Branch)]
Assume:
– 1000 tuples in Staff.
– 50 Managers
– 50 tuples in Branch.
– 5 London branches
– No indexes or sort keys
– All temporary results are written back to disk (memory is small)
– Tuples are accessed one at a time (not in blocks) 3
Query 1 (Bad)
s(position=‘Manager’)^(city=‘London’)^(Staff.branchNo=Branch.branchNo) (Staff X Branch)
– Requires (1000+50) disk accesses to read from Staff and Branch
relations
– Creates temporary relation of Cartesian Product (1000*50) tuples
– Requires (1000*50) disk access to read in temporary relation and test
predicate
Total Work = (1000+50) + 2*(1000*50) =
101,050 I/O operations

Query 2 (Better)
s(position=‘Manager’)^(city=‘London’) (Staff wvStaff.branchNo = Branch.branchNo Branch)

– Again requires (1000+50) disk accesses to read from Staff and Branch
– Joins Staff and Branch on branchNo with 1000 tuples
(1 employee : 1 branch )
– Requires (1000) disk access to read in joined relation and check predicate
Total Work = (1000+50) + 2*(1000) =
3050 I/O operations
3300% Improvement over Query 1 4
Query 3 (Best)
[ s(position=‘Manager’) (Staff) ] wvStaff.branchNo = Branch.branchNo [ s(city=‘London’) (Branch) ]

– Read Staff relation to determine ‘Managers’ (1000 reads)


• Create 50 tuple relation(50 writes)

– Read Branch relation to determine ‘London’ branches (50 reads)


• Create 5 tuple relation(5 writes)

– Join reduced relations and check predicate (50 + 5 reads)

Total Work = 1000 + 2*(50) + (50 + 5) =


1160 I/O operations

8700% Improvement over Query 1

5
Query Processing Steps

• Processing can be divided into :Decomposition, Optimization , and


Execution ,Code generation
1. Query Decomposition
• It is the process of transforming a high level query into a relational algebra
query, and to check that the query is syntactically and semantically correct. It
Consists of parsing and validation
6
Typical stages in query decomposition are:
i. Analysis: lexical and syntactical analysis of the query(correctness) based
on attributes, data type.. ,. Query tree will be built for the query containing
leaf node for base relations, one or many non-leaf nodes for relations
produced by relational algebra operations and root node for the result of the
query. Sequence of operation is from the leaves to the root.(SELECT *
FROM Catalog c ,Author a Where a.authorid = c.authorid AND
c.price>200 AND a.country= ‘ USA’ )
ii. Normalization: convert the query into a normalized form. The predicate
WHERE will be converted to Conjunctive (∨) or Disjunctive (∧) Normal
form.
iii. Semantic Analysis: to reject normalized queries that are not correctly
formulated or contradictory. Incorrect if components do not contribute to
generate result. Contradictory if the predicate can not be satisfied by any
tuple. Say for example,(Catalog =“BS”  Catalog= “CS”) since a given
book can only be classified in either of the category at a time
iv. Simplification: to detect redundant qualifications, eliminate common sub-
expressions, and transform the query to a semantically equivalent but more
easily and effectively computed form. For example, If a user don’t have the
necessary access to all of the objects of the query , it should be rejected. 7
2. Query Optimization
 Everyone wants the performance of their database to be optimal.
In particular, there is often a requirement for a specific query or
object that is query based, to run faster.
 Problem of query optimization is to find the sequence of steps
that produces the answer to user request in the most efficient
manner, given the database structure.
 The performance of a query is affected by the tables or queries
that underlies the query and by the complexity of the query.
 Given a request for data manipulation or retrieval, an optimizer
will choose an optimal plan for evaluating the request from among
the manifold alternative strategies. i.e. there are many ways
(access paths) for accessing desired file/record.
 hence ,DBMS is responsible to pick the best execution strategy
based on various considerations( Least amount of I/O and CPU resources. )

8
 Example: Consider relations r(AB) and s(CD). We require r X s.
 Method 1 :
a. Load next record of r in RAM.
b. Load all records of s, one at a time and concatenate with r.
c. All records of r concatenated?
 NO: goto a.
 YES: exit (the result in RAM or on disk).
 Performance: Too many accesses.
 Method 2: Improvement
a. Load as many blocks of r as possible leaving room for one block of s.
b. Run through the s file completely one block at a time.
 Performance: Reduces the number of times s blocks are loaded by a factor of
equal to the number of r records than can fit in main memory.
 Considerations during query Optimization:
– Narrow down intermediate result sets quickly. SELECT and
PROJECTION before JOIN
– Use access structures (indexes).
9
Approaches to Query Optimization : Heuristics and Cost Function
A. Heuristics Approach
• Heuristics Approach uses the knowledge of the characteristics of the relational
algebra operations and the relationship between the operators to optimize the
query.
• Thus the heuristic approach of optimization will make use of:
– Properties of individual operators
– Association between operators
– Query Tree: a graphical representation of the operators, relations, attributes
and predicates and processing sequence during query processing.
• It is composed of three main parts:
– The Leafs: the base relations used for processing the query/
extracting the required information
– The Root: the final result/relation as an out put based on the
operation on the relations used for query processing
– Nodes: intermediate results or relations before reaching the final
result.
• Sequence of execution of operation in a query tree will start from the
leaves and continues to the intermediate nodes and ends at the root. 10
11
12
13
14
A .Using Heuristics in Query Optimization
 Process for heuristics optimization
1. The parser of a high-level query generates an initial internal
representation;
2. Apply heuristics rules to optimize the internal representation.
3. A query execution plan is generated to execute groups of
operations based on the access paths available on the files
involved in the query.
 The main heuristic is to apply first the operations that reduce the size
of intermediate results.
– E.g. Apply SELECT and PROJECT operations before applying
the JOIN or other binary operations.
 Query block: The basic unit that can be translated into the algebraic
operators and optimized.
 A query block contains a single SELECT-FROM-WHERE expression, as
well as GROUP BY and HAVING clause if these are part of the block.
 Nested queries within a query are identified as separate query blocks.

Slide 15- 15
• Query tree:
– A tree data structure that corresponds to a relational algebra expression. It
represents the input relations of the query as leaf nodes of the tree, and
represents the relational algebra operations as internal nodes.
• An execution of the query tree consists of executing an internal node operation
whenever its operands are available and then replacing that internal node by
the relation that results from executing the operation.
• Query graph:
– A graph data structure that corresponds to a relational calculus expression.
It does not indicate an order on which operations to perform first. There is
only a single graph corresponding to each query.

 Example:
– For every project located in ‘Stafford’, retrieve the project number, the controlling
department number and the department manager’s last name, address and birthdate.
 Relation algebra:
– PNUMBER, DNUM, LNAME, ADDRESS, BDATE ((( PLOCATION=‘STAFFORD’(PROJECT))
DNUM=DNUMBER (DEPARTMENT)) MGRSSN=SSN (EMPLOYEE))
 SQL query:
Q2: SELECT P.NUMBER,P.DNUM,E.LNAME,E.ADDRESS, E.BDATE
FROM PROJECT AS P,DEPARTMENT AS D, EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND
D.MGRSSN=E.SSN AND
P.PLOCATION=‘STAFFORD’;
Slide 15- 16
Slide 15- 17
Slide 15- 18
• Heuristic Optimization of Query Trees:
– The same query could correspond to many different relational
algebra expressions — and hence many different query trees.
– The task of heuristic optimization of query trees is to find a final
query tree that is efficient to execute.
• Example:
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;

Slide 15- 19
Slide 15- 20
Slide 15- 21
Summary of Heuristics for Algebraic Optimization:

1. The main heuristic is to apply first the operations that reduce


the size of intermediate results.

2. Perform select operations as early as possible to reduce the


number of tuples and perform project operations as early as
possible to reduce the number of attributes. (This is done by
moving select and project operations as far down the tree as possible.)

3. The select and join operations that are most restrictive should
be executed before other similar operations. (This is done by
reordering the leaf nodes of the tree among themselves and
adjusting the rest of the tree appropriately.)

Slide 15- 22
Using Selectivity and Cost Estimates in Query Optimization

• Cost-based query optimization:


– Estimate and compare the costs of executing a query using
different execution strategies and choose the strategy with the
lowest cost estimate. (Compare to heuristic query optimization)
• Issues
– Cost function
– Number of execution strategies to be considered
• Cost Components for Query Execution
1. Access cost to secondary storage
2. Storage cost
3. Computation cost
4. Memory usage cost
5. Communication cost

Slide 15- 23
B. Cost Estimation Approach to Query Optimization
• The main idea is to minimize he cost of processing a query. The cost
function is comprised of:
• I/O cost + CPU processing cost + communication cost + Storage cost
• These components might have different weights in different
processing environments
• The DBMs will use information stored in the system catalogue for the
purpose of estimating cost.
• The main target of query optimization is to minimize the size of the
intermediate relation. The size will have effect in the cost of:
– Disk Access
– Data Transportation
– Storage space in the Primary Memory
– Writing on Disk

24
1. Access Cost of Secondary Storage
• Data is going to be accessed from secondary storage, as a query will be
needing some part of the data stored in the database. The disk access cost
can again be analyzed in terms of:
– Searching
– Reading, and
– Writing, data blocks used to store some portion of a relation.
• Remark: The disk access cost will vary depending on
– The file organization used and the access method implemented for the
file organization.
– whether the data is stored contiguously or in scattered manner, will
affect the disk access cost.
2. Storage Cost
•While processing a query, as any query would be composed of many database
operations, there could be one or more intermediate results before reaching the
final output. These intermediate results should be stored in primary memory for
further processing. The bigger the intermediate relation, the larger the memory
requirement, which will have impact on the limited available space. This will be
considered as a cost of storage.
25
3. Query Execution Plans

– An execution plan for a relational algebra query consists of a


combination of the relational algebra query tree and information
about the access methods to be used for each relation as well as the
methods to be used in computing the relational operators stored in
the tree.

Slide 15- 26
3. Computation Cost
• Query is composed of many operations. The operations could be
database operations like reading and writing to a disk, or
mathematical and other operations like:
– Searching
– Sorting
– Merging
– Computation on field values

4. Communication Cost
o In most database systems the database resides in one station and
various queries originate from different terminals. This will have impact
on the performance of the system adding cost for query processing. Thus,
the cost of transporting data between the database site and the terminal
from where the query originate should be analyzed.

27
Query Decomposition
• ANALYSIS

• Lexical: Is it even valid SQL?


• Syntactic: Do the relations/attributes exist and are the
operations valid?
• Result is internal tree representation of SQL query (Parse Tree)

<Query>

SELECT select_list FROM <from_list>



<attribute>

28
• RELATIONAL ALGEBRA TREE
– Root : The desired result of query
– Leaf : Base relations of query
– Non-Leaf : Intermediate relation created from relational algebra operation
• NORMALIZATION
– Convert WHERE clause into more easily manipulated form
– Conjunctive Normal Form(CNF) : (a v b)  [(c v d)  e]  f (more efficient)
– Disjunctive Normal Form(DNF) : 

29
Heuristic Optimization
GOAL:
– Use relational algebra equivalence rules to improve the
expected performance of a given query tree.
Consider the example given earlier:
– Join followed by Selection (~ 3050 disk reads)
– Selection followed by Join (~ 1160 disk reads)

30
Relational Algebra Transformations
Cascade of Selection
– (1) sp  q  r (R) = sp(sq(sr(R)))

Commutativity of Selection Operations


– (2) sp(sq(R)) = sq(sp(R))

In a sequence of projections only the last is required


– (3) PLPM…PN(R) = PL(R)

Selections can be combined with Cartesian Products and Joins


– (4) sp( R x S ) = R wv S p

– (5) sp( R wv S ) = R wv S Visual of 4


sp
q q^ p

x wvp
=

R S R S
Note : The above is an incomplete List! For a complete list see the text.
31
More Relational Algebra Transformations
Join and Cartesian Product Operations are Commutative and
Associative
(6) R x S = S x R
(7) R x (S x T) = (R x S) x T
(8) R wvp S = S wvp R
(9) (R wvp S) wvq T = R wvp (S wvq T)

Selection Distributes over Joins


– If predicate p involves attributes of R only:
(10) sp( R wvq S ) = sp(R) wvq S
– If predicate p involves only attributes of R and q involves only
attributes of S:
(11) sp^q(R wvr S) = sp(R) wvr sq(S)

32
Optimization Uses The Following Heuristics
Break apart conjunctive selections into a sequence of simpler
selections (preparatory step for next heuristic).

Move s down the query tree for the earliest possible execution
(reduce number of tuples processed).

Replace s-x pairs by wv (avoid large intermediate results).

Break apart and move as far down the tree as possible lists of
projection attributes, create new projections where possible
(reduce tuple widths early).

Perform the joins with the smallest expected result first

33
. Algorithms for SELECT and JOIN Operations
• Examples:
– (OP1): s SSN='123456789' (EMPLOYEE)
– (OP2): s DNUMBER>5(DEPARTMENT)
– (OP3): s DNO=5(EMPLOYEE)
– (OP4): s DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
– (OP5): s ESSN=123456789 AND PNO=10(WORKS_ON)
• Search Methods for Simple Selection:
– S1 Linear search (brute force):
• Retrieve every record in the file, and test whether its attribute
values satisfy the selection condition.
– S2 Binary search:
• If the selection condition involves an equality comparison on a
key attribute on which the file is ordered, binary search (which
is more efficient than linear search) can be used
– S3 Using a primary index to retrieve a single record:
• If the selection condition involves an equality comparison on a
key attribute with a primary index use the primary index to
retrieve the record.
34
 Implementing the JOIN Operation:
– Join
• two–way join: a join on two files
• e.g. R A=B S
• multi-way joins: joins involving more than two files.
• e.g. R A=B S C=D T
 Examples
– (OP6): EMPLOYEE DNO=DNUMBER DEPARTMENT
– (OP7): DEPARTMENT MGRSSN=SSN EMPLOYEE

 J1 Nested-loop join (brute force):


• For each record t in R (outer loop), retrieve every record s from S
(inner loop) and test whether the two records satisfy the join condition
t[A] = s[B].
 J2 Single-loop join (Using an access structure to retrieve the matching records):
• If an index (or hash key) exists for one of the two join attributes —
say, B of S — retrieve each record t in R, one at a time, and then use
the access structure to retrieve directly all matching records s from S
that satisfy s[B] = t[A].
Slide 15- 35

You might also like