Query Optimization
Query Optimization
Query Optimization
OPTIMIZATION
How queries are processed In SQL?
Users/Programmers
DBMS
Software Software: Query Processing
& Programs
Database
Database
Definition
Agenda
I. Query Processing and Optimization: Why?
Query
Scanning
Parsing
Validating
Intermediate form of Query
(query Tree)
Query
Optimizer
Catalog
Execution Plan
Query Code
Generator
Compile
d Query Executable Code
Code
Execution in
Runtime processor
1. Query Recognition
Scanning is the process of identifying
the tokens in the query.
The tokenized representation is suitable for
processing by the parser.
Token examples are SQL keywords,
Attribute names, Table names, …
This representation may be in a tree form.
Query
Scanning
Parsing
Validating
Intermediate form of Query
(query Tree)
Query
Optimizer
Catalog
Execution Plan
Query Code
Generator
Compile
d Query Executable Code
Code
Execution in
Runtime processor
2. Query Optimization
The goal of the query optimizer
is to find an efficient strategy for
executing the query using the
access routines.
Query
Scanning
Parsing
Validating
Intermediate form of Query
(query Tree)
Query
Optimizer
Catalog
Execution Plan
Query Code
Generator
Compile
d Query Executable Code
Code
Execution in
Runtime processor
3. Query Code Generator
Once the query optimizer has
determined the execution plan (the
specific ordering of access routines),
the code generator writes out the
actual access routines to be executed.
With an interactive session, the query
code is interpreted and passed directly
to the runtime database processor for
execution.
It is also possible to compile the access
routines and store them for later
execution
Access Routines
are algorithms that are used to access
and aggregate data in a database.
A RDBMS may have a collection of
general access routines that can be
combined to implement a query
execution plan.
We are interested in access routines
for selection, projection, join and set
operations such as union, intersection,
set difference, Cartesian product, etc.
Relational Query Processing
Query
Scanning
Parsing
Validating
Intermediate form of Query
(query Tree)
Query
Optimizer
Catalog
Execution Plan
Query Code
Generator
Compile
d Query Executable Code
Code
Execution in
Runtime
processor
4. Execution in the runtime
database processor
At this point, the query has been
scanned, parsed, planned and
(possibly) compiled.
The runtime database processor then
executes the access routines against
the database.
The results are returned to the
application that made the query in
the first place.
Any runtime errors are also returned.
Query Processing &
Optimization
What is Query Processing?
Steps required to transform high level
SQL query into a correct and “efficient”
strategy for execution and retrieval.
R(A,B,C)
S(C,D,E)
SELECT B, D
FROM R, S
WHERE R.C=S.C AND
R.A = "c" AND
S.E = 2
R A B C S C D E
a 1 10 10 x 2
b 1 20 20 y 2
c 2 10 30 z 2
d 2 35 40 x 1
e 3 45 50 y 3
Answer B D
2 x
But this is your intelligent way..
• How to execute query?
Plan II
Π B,D
natural join
R S
R S
A B C σ A='c' (R) σ E=2 (S) C D E
a 1 10 A B C C D E 10 x 2
b 1 20 c 2 10 10 x 2 20 y 2
c 2 10 20 y 2 30 z 2
d 2 35 30 z 2 40 x 1
e 3 45 50 y 3
Π B,D
Physical Query Plan:
d 2 35 output: <2,x> 40 x 1
e 3 45 50 y 3
next tuple:
<c,7,15>
Physical operators
SQL query
parse
parse tree
convert
answer
logical query plan
execute
apply laws
Pi
“improved” l.q.p
estimate result sizes statistics
pick best
{P1,C1>...
l.q.p. +sizes }
consider physical plans estimate costs
{P1,P2,…..}
Processing Steps
Three Major Steps of
Processing
(1) Query Decomposition
Analysis
Derive Relational Algebra Tree
Normalization
<Query>
*
Query Decomposition (cont…)
50 tuples in Branch.
~ 5 London branches
Requires (1000) disk access to read in joined relation and check predicate
Consider if Staff and Branch relations were 10x size? 100x? !!!
Heuristic Optimization
GOAL:
Use relational algebra equivalence rules to
improve the expected performance of a given
query tree.
(5) σ p( R S ) = R Visual of 4
S
σ
q q^p
p
x =
p
R S R S
Note : The above is an incomplete List! For a complete list see the text.
More Relational Algebra
Transformations
Join and Cartesian Product Operations are
Commutative and Associative
(6) R x S = S x R
(7) R x (S x T) = (R x S) x T
(8) R p S = S p R
(9) (R p S) q T = R p (S q T)
Move σ down the query tree for the earliest possible execution
(reduce number of tuples processed).
Break apart and move as far down the tree as possible lists of
projection attributes, create new projections where possible
(reduce tuple widths early).
SELECT p.ticketno
FROM Flight f , Passenger p, Crew c
WHERE f.flightNo = p.flightNo AND
f .flightNo = c.flightNo AND
f.date = ’01-01-06’ AND
f.to = ’FRA’ AND
p.name = c.name AND Canonical Relational Algebra Expression
c.job = ’Pilot’
Heuristic Optimization (Step 1)
Heuristic Optimization (Step
2)
Heuristic Optimization (Step
3)
Heuristic Optimization (Step
4)
Heuristic Optimization (Step
5)
Heuristic Optimization (Step
6)
Physical Execution Plan
Identified “optimal” Logical Query Plans
Every heuristic not always “best” transform
Heuristic Analysis reduces search space for cost
evaluation but does not necessarily reduce costs
Types of Records:
Variable Length
Fixed Length
Record Separation
Fixed records don’t need it
If needed, indicate records with special
marker and give record lengths or offsets
Record Separation
Unspanned
Records must stay within a block
Simpler, but wastes space
Spanned
Records are across multiple blocks
Require pointer at the end of the
block to the next block with that
record
Essential if record size > block size
Record Separation
Mixed Record Types – Clustering
Different record types within the same
block
Why cluster? Frequently accessed
records are in the same block
Has performance downsides if there are
many frequently accessed queries with
different ordering
Split Records
Put fixed records in one place and
variable in another block
Record Separation
Sequencing
Order records in sequential blocks
based on a key
Indirection
Record address is a combination of
various physical identifiers or an
arbitrary bit string
Very flexible but can be costly
Accessing Data
What is an index?
Data structure that allows the
DBMS to quickly locate particular
records or tuples that meet specific
conditions
Types of indicies:
Primary Index
Secondary Index
Dense Index
Sparse Index/Clustering Index
Multilevel Indicies
Accessing Data
Primary Index
Index on the attribute that
determines the sequencing of the
table
Guarantees that the index is unique
Secondary Index
An index on any other attribute
Does not guarantee unique index
Accessing Data
Dense Index
Every value of the indexed attribute
appears in the index
Can tell if record exists without
accessing files
Better access to overflow records
Clustering Index
Each index can correspond to many
records
Dense Index
10 10
20 20
30
40
30
40
50
60 50
70 60
80
70
90 80
100 90
110 100
120
Accessing Data
Sparse Index
Many values of the indexed
attribute don’t appear
Less index space per record
Can keep more of index in memory
Better for insertions
Multilevel Indices
Build an index on an index
Level 2 Index -> Level 2 Index ->
Data File
Sparse Index
10 10
30 20
50
70
30
40
90
110 50
130 60
150
70
170 80
190 90
210 100
230
B+ Tree
Use a tree model to hold data or
indices
Maintain balanced tree and aim
for a “bushy” shallow tree
100
120
150
180
30
180
200
120
130
100
101
110
150
156
179
11
30
35
3
5
B+ Tree
Rules:
If root is not a leaf, it must have at
least two children
For a tree of order n, each node
must have between n/2 and n
pointers and children
For a tree of order n, the number of
key values in a leaf node must be
between (n-1)/2 and (n-1) pointers
and children
B+ Tree (cont…)
Rules:
The number of key values
contained in a non-leaf node is 1
less than the number of pointers
The tree must always be balanced;
that is, every path from the root
node to a leaf must have the same
length
Leaf nodes are linked in order of
key values
Hashing
Calculates the address of the page in
which the record is to be stored based
on more or more fields
Each hash points to a bucket
Hash function should evenly distribute
the records throughout the file
A good hash will generate an equal
number of keys to buckets
Keep keys sorted within buckets
Hashing
.
records
key → h(key)
.
Hashing
Types of hashing:
Extensible Hashing
Pro:
Handle growing files
Less wasted space
No full reorganizations
Con:
Uses indirection
Directory doubles in size
Hashing
Types of hashing:
Linear Hashing
Pro:
Handle growing files
Less wasted space
No full reorganizations
No indirection like extensible hashing
Con:
Still have overflow chains
Indexing vs. Hashing
Hashing is good for:
Probes given specific key
SELECT * FROM R WHERE R.A = 5
A
10
A
10
index
10
20
20
Selection using a clustering index
Y value is y
- output all possible joins of the matching tuples
r ∈ R and s ∈ S
Example: Join of R and S sorted on Y
R(X, Y) S(Y, Z)
1 a
2 c b 1 5 a
1 e b 1 2c
c 2 2...c c 2 2
2c
3 c c 3 3 c c 3 3
4 d c 4 4 d c 4 2c
e 5 4
5 e e 5 3c
... ... ...
2
3…
c
… … Main memory 3
3c
4
Analysis of Sort-Based Two-Phase Join
Cost of scanning R:
B(R), if clustered; T(R), if not
Questions? Comments?