Chapter - 2 Query Processing
Chapter - 2 Query Processing
Chapter - 2 Query Processing
Chapter Two
2
Overview of Query Processing
Query processing: The activities involved in parsing,
3
Steps of Query Processing
2. Optimization
3. Evaluation Parser &
Translator
Relational Algebra
Expression Statistics
About Data
Optimizer
Query
Evaluation Engine Execution Plan
Output
5
Con…
Evaluation: general guidelines to evaluate a query.
6
Query optimization:
The activity of choosing an efficient execution strategy for
processing a query.
Task: Find an efficient physical query plan (aka execution plan) for
an SQL query
Goal: Minimize the evaluation time for the query, i.e., compute
query result as fast as possible
Cost Factors: Disk accesses, read/write operations, [I/O, page
transfer] (CPU time is typically ignored)
Optimization: find the most efficient evaluation plan for a query because
there can be more than one way.
7
Examples:
Note: All examples in this slide are done based on these tables.
8
Find all Managers who work at a London branch.
SELECT * FROM Staff s, Branch b WHERE s.branchNo =
b.branchNo AND (s.position = ‘Manager’ AND b.city = ‘London’);
9
Different Strategies
Assume:
10
Cost Comparison
Cost (in disk accesses) are:
11
Phases of query processing
12
Query Processing has four main phases.
1. Decomposition.
• Analysis.
• Normalization.
• Semantic Analysis.
• Simplification.
• Restructuring.
2. Optimization.
• Heuristics.
• Comparing costs.
3. Code Generation.
4. Execution.
13
Query Decomposition
Transform high-level query into RA query.
Normalization,
Semantic analysis,
Simplification,
Query restructuring.
14
Analysis
Analyze query lexically and syntactically using compiler
techniques.
Verify relations and attributes exist.
Verify operations are appropriate for object type.
Example
SELECT staff_no FROM Staff WHERE position > 10;
15
Analysis
Finally, query transformed into a query tree constructed as follows:
Leaf node for each base relation.
Non-leaf node for each intermediate relation produced by RA
operation.
Root of tree represents query result.
Sequence is directed from leaves to root.
16
Normalization
(position='Manager'branchNo='B003')(salary>20000branchNo
='B003')
17
Semantic Analysis
Rejects normalized queries that are incorrectly formulated or
contradictory.
Query is incorrectly formulated if components do not contribute
to generation of result.
Query is contradictory if its predicate cannot be satisfied by any
tuple.
Algorithms to determine correctness exist only for queries that
do not contain disjunction and negation.
18
Semantically incorrect
Components do not contribute in any way to the
generation of the result
Only a subset of relational calculus queries can be tested
for correctness
● Those that do not contain disjunction and negation
● To detect
➠ connection graph (query graph)
➠ join graph
19
Relation connection graph
a. Create node for each relation and node for result.
b. Create edges between two nodes that represent a join.
c. Create edges between nodes that represent projection.
If not connected, query is incorrectly formulated.
Example: SELECT p.propertyNo, p.street FROM Client c, Viewing v,
PropertyForRent p WHERE c.clientNo = v.clientNo AND c.maxRent >= 500
AND c.prefType = ‘Flat’ AND p.ownerNo = ‘CO93’;
20 (v.propertyNo = p.propertyNo) .
Example 2
SELECT Ename,Resp FROM Emp, Works, Project WHERE
Emp.Eno = Works.Eno AND Works.Pno = Project.Pno AND
Pname = ‘CAD/CAM’ AND Dur > 36 AND Title = ‘Programmer’
21
Simplification
1. Detects redundant qualifications,
22
Example
SELECT TITLE FROM E WHERE(NOT (TITLE=
“Programmer”) AND (TITLE=“Programmer” OR
TITLE=”Electrical Eng.”) AND NOT (TITLE=“Electrical
Eng.”))OR ENAME=“J.Doe”; is
equivalent to
SELECT TITLE FROM E WHERE ENAME= “J.Doe”;
23
Restructuring
Convert
. SQL to relational algebra
Make use of query trees
Example: SELECT Ename FROM Emp,
Works, Project WHERE Emp.Eno =
Works.Eno AND Works.Pno =
Project.Pno AND Ename <> ‘J. Doe’
AND Pname = ‘CAD/CAM’ AND (Dur =
12 OR Dur = 24)
24
Query tree:
A tree data structure that corresponds to a relational algebra
expression.
It represents the input relations of the query as leaf nodes of the tree,
and represents the relational algebra operations as internal nodes.
Query graph:
A graph data structure that corresponds to a relational calculus
expression.
It does not indicate an order on which operations to perform first.
There is only a single graph corresponding to each query.
25
Transformation Rules for RA Operations
1. Conjunctive Selection operations can cascade into individual
Selection operations (and vice versa).
2. Commutativity of Selection.
26
Con…
3. In a sequence of Projection operations, only the last in the
sequence is required.
27
Con…
5. Commutativity of Theta join (and Cartesian product).
28
6. Commutativity of Selection and Theta join (or Cartesian product)
If selection predicate involves only attributes of one of join
relations, Selection and Join (or Cartesian product) operations
commute:
29
7. Commutativity of Projection &Theta join (or Cartesian product)
30
8. Commutativity of Union & Intersection (but not set difference)
RS=SR
RS=SR
9.Commutativity of Selection and set operations (Union,
Intersection, and Set difference).
p(R S) = p(S) p(R)
p(R S) = p(S) p(R)
p(R - S) = p(S) - p(R)
32
2. Query Optimization
DBMS has algorithms to implement relational algebra
expressions
SQL is a different kind of high level language; specify what is
wanted, not how it is obtained
Optimization – not necessarily “optimal”, but reasonably
efficient
Techniques:
Heuristic rules
Query tree (relational algebra) optimization
Query graph optimization
Cost estimation(Comparing costs of different plans)
33
Cost-based (physical) optimization
2.a. Heuristic based Processing Strategies
► Perform Selection operations as early as possible.
►Keep predicates on same relation together.
►Combine Cartesian product with subsequent Selection whose predicate
represents join condition into a Join operation.
►Use associativity of binary operations to rearrange leaf nodes so leaf
nodes with most restrictive Selection operations executed first.
►Perform Projection as early as possible.
►Keep projection attributes on same relation together.
►Compute common expressions once.
►If common expression appears more than once, and result not too
large, store result and reuse it when required.
34
Examples
What are the names of customers living on Elm Street who have
checked out “Terminator”?
SQL query:
SELECT Name FROM Customer CU, CheckedOut CH, Film F WHERE
Title = ’Terminator’ AND F.FilmId = CH.FilmID AND CU.CustomerID =
CH.CustomerID AND CU.Street = ‘Elm’
35
Apply Selections Early
36
Apply More Restrictive Selections Early
37
Form Joins
38
Apply Projections Early
39
Cost- Based Optimization
Use transformations to generate multiple candidate query trees from
the canonical query tree.
Statistics on the inputs to each operator are needed.
Statistics on leaf relations are stored in the system catalog.
Statistics on intermediate relations must be estimated; most
important is the relations' cardinalities.
Cost formulas estimate the cost of executing each operation in each
candidate query tree.
Dependent on the specific algorithm used by the operator.
„ Cost can be CPU time, I/O time, communication time, main
memory usage, or a combination.
The candidate query tree with the least total cost is
selected for execution.
40
Example: Cost Estimation
41
42
43
Operation 3: σ followed by a π
44
Measures of Query Cost
There are many possible ways to estimate cost, e.g., based on
disk accesses, CPU time, or communication overhead.
Slide 15- 46
Using Selectivity and Cost Estimates in
Query Optimization (3)
Catalog Information Used in Cost Functions
Information about the size of a file
number of records (tuples) (r),
record size (R),
number of blocks (b)
blocking factor (bfr)
Information about indexes and indexing attributes of a file
Number of levels (x) of each multilevel index
Number of first-level index blocks (bI1)
Number of distinct values (d) of an attribute
Selectivity (sl) of an attribute
Selection cardinality (s) of an attribute. (s = sl * r)
Slide 15- 47
Selection Operation
S1 - Linear search
cost(S1)= BR
48
Con…
49
50
Query Evaluation
51
Cost of Operations
Cost depends on
Types of query conditions
bR: # pages in R
53
Simple Selection
Simple selection: A op a(R)
A is a single attribute, a is a constant, op is one of =, , <, , >, .
Do not further discuss because it requires a sequential scan of
table.
How many tuples will be selected?
Selectivity Factor (SFA op a(R)) : Fraction of tuples of R satisfying
“A op a”
0 SFA op a(R) 1
# tuples selected: NS = nR SFA op a(R)
54
Options of Simple Selection
Sequential (linear) Scan
General condition: cost = bR
Equality on key: average cost = bR / 2
Binary Search
Records are stored in sorted order
Equality on key: cost = log2(bR)
Equality on non-key (duplicates allowed)
cost = log2(bR) + NS/bfR - 1
= sorted search time + selected – first one
55
Example: Cost of Selection
Relation: R(A, B, C)
nR = 10000 tuples
bfR = 20 tuples/page
dist(A) = 50, dist(B) = 500
B+ tree clustering index on A with order 25 (p=25)
B+ tree secondary index on B w/ order 25
Query:
select * from R where A = a1 and B = b1
Relational Algebra: A=a1 B=b1 (R)
56
Example: Cost of Selection (cont.)
Option 1: Sequential Scan
Have to go thru the entire relation
Cost = bR = 10000/20 = 500
Option 2: Binary Search using A = a
It is sorted on A (why?)
NS = 10000/50 = 200
assuming equal distribution
Cost = log2(bR) + NS/bfR - 1
= log2(500) + 200/20 - 1 = 18
57
Cost of Join
58
Estimate Size of Join Result
n R nS n R nS
NJ = min( , )
dist(R. A) dist(S .B)
59
Estimate Size of Join Result (cont.)
How wide is a tuple in join result?
Natural join: W = W(R) + W(S) – W(SR)
Theta join: W = W(R) + W(S)
What is blocking factor of join result?
bfJoin = block size / W
How many blocks does join result have?
bJoin = NJ / bfJoin
60
Query Execution Plans
An execution plan for a relational algebra query consists of a
combination of the relational algebra query tree and information
about the access methods to be used for each relation as well as
the methods to be used in computing the relational operators
stored in the tree.
Materialized evaluation: the result of an operation is stored as a
temporary relation.
Pipelined evaluation: as the result of an operator is produced, it
is forwarded to the next operator in sequence
61
Query Tuning
Monitoring or revising the query to increase throughput,
to lower response time for time-critical applications.
Having to tune queries is a fact of life.
Query tuning has a localized effect and is thus relatively
attractive.
It is a time-consuming and specialized task.
It makes the queries harder to understand.
However, it is often a necessity.
This is not likely to change any time soon.
62
Assignment one
Using heuristic algorithm optimize the following sql query.
SELECT LNAME FROM EMPLOYEE, WORKS_ON, PROJECT
63