Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chapter - 2 Query Processing

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 63

Advanced Database Systems(CoSc2072)

Chapter Two


Query Processing and Optimization: Outline
 Operator Evaluation Strategies
 Query processing in general
 Selection
 Join
 Query Optimization
 Heuristic query optimization
 Cost-based query optimization
 Query Tuning

Overview of Query Processing
 Query processing: The activities involved in parsing,

validating, optimizing, and executing a query.

 Aims

 To transform a query written in a high-level language,

typically SQL, into a correct and efficient execution strategy
expressed in a low-level language (implementing the relational
algebra), and
 To execute the strategy to retrieve the required data.

Steps of Query Processing

1. Parsing and translation Query

2. Optimization
3. Evaluation Parser &

Relational Algebra
Expression Statistics
About Data


Evaluation Engine Execution Plan

The Query Execution Plans describe the

steps and the order used to access or
Data modify data in the database.
Three Steps of Query Processing
 Parsing and translation
− Translate the query into its internal form (parse tree).
This is then translated into an expression of the relational algebra.

− Parser checks syntax, validates relations, attributes and

access permissions

 Evaluation: general guidelines to evaluate a query.

− what the query-execution engine takes a query-evaluation plan

to executes that plan and returns the answers to the query.

− An annotated expression specifying detailed evaluation

strategy is called the execution plan (includes, index, join

algorithms, . . . )

Query optimization:
The activity of choosing an efficient execution strategy for
processing a query.
 Task: Find an efficient physical query plan (aka execution plan) for
an SQL query
Goal: Minimize the evaluation time for the query, i.e., compute
query result as fast as possible
Cost Factors: Disk accesses, read/write operations, [I/O, page
transfer] (CPU time is typically ignored)
Optimization: find the most efficient evaluation plan for a query because
there can be more than one way.

Note: All examples in this slide are done based on these tables.

Find all Managers who work at a London branch.
SELECT * FROM Staff s, Branch b WHERE s.branchNo =
b.branchNo AND (s.position = ‘Manager’ AND b.city = ‘London’);

The equivalent relational algebra queries corresponding to this

SQL statement are:

Different Strategies
 Assume:

 1000 tuples in staff; 50 tuples in branch;

 50 managers; 5 London branches;

 No indexes or sort keys;

 Results of any intermediate operations stored on disk;

 Cost of the final write is ignored;

 Tuples are accessed one at a time.

Cost Comparison
 Cost (in disk accesses) are:

(1) (1000 + 50) + 2*(1000 * 50) = 101 050

(2) 2*1000 + (1000 + 50) = 3 050

(3) 1000 + 2*50 + 5 + (50 + 5) = 1 160

Cartesian product and join operations much more expensive than

selection, and third option significantly reduces size of relations
being joined together.

Phases of query processing

 Query Processing has four main phases.
1. Decomposition.
• Analysis.
• Normalization.
• Semantic Analysis.
• Simplification.
• Restructuring.
2. Optimization.
• Heuristics.
• Comparing costs.
3. Code Generation.
4. Execution.

 Query Decomposition
 Transform high-level query into RA query.

 Check that query is syntactically and semantically correct.

 Typical stages are:

 Analysis,

 Normalization,

 Semantic analysis,

 Simplification,

 Query restructuring.

 Analysis
 Analyze query lexically and syntactically using compiler
 Verify relations and attributes exist.
 Verify operations are appropriate for object type.
SELECT staff_no FROM Staff WHERE position > 10;

 This query would be rejected on two grounds:

staff_no is not defined for Staff relation (should be staffNo).

Comparison ‘>10’ is incompatible with type position, which

is variable character string.

 Finally, query transformed into a query tree constructed as follows:
Leaf node for each base relation.
Non-leaf node for each intermediate relation produced by RA
Root of tree represents query result.
 Sequence is directed from leaves to root.


 Converts query into a normalized form for easier manipulation.

 Predicate can be converted into one of two forms:

 Conjunctive normal form:

(position = 'Manager'  salary > 20000)  (branchNo = 'B003')

 Disjunctive normal form:


Semantic Analysis
 Rejects normalized queries that are incorrectly formulated or
 Query is incorrectly formulated if components do not contribute
to generation of result.
 Query is contradictory if its predicate cannot be satisfied by any
 Algorithms to determine correctness exist only for queries that
do not contain disjunction and negation.

Semantically incorrect
 Components do not contribute in any way to the
generation of the result
 Only a subset of relational calculus queries can be tested
for correctness
● Those that do not contain disjunction and negation
● To detect
➠ connection graph (query graph)
➠ join graph

Relation connection graph
a. Create node for each relation and node for result.
b. Create edges between two nodes that represent a join.
c. Create edges between nodes that represent projection.
 If not connected, query is incorrectly formulated.
Example: SELECT p.propertyNo, p.street FROM Client c, Viewing v,
PropertyForRent p WHERE c.clientNo = v.clientNo AND c.maxRent >= 500
AND c.prefType = ‘Flat’ AND p.ownerNo = ‘CO93’;

 Relation connection graph not fully

connected, so query is not correctly
 Have omitted the join condition

20 (v.propertyNo = p.propertyNo) .
Example 2
SELECT Ename,Resp FROM Emp, Works, Project WHERE
Emp.Eno = Works.Eno AND Works.Pno = Project.Pno AND
Pname = ‘CAD/CAM’ AND Dur > 36 AND Title = ‘Programmer’

If the query graph is connected, the query is semantically correct.

1. Detects redundant qualifications,

2. Eliminates common sub-expressions,

3. Transforms query to semantically equivalent but more

easily and efficiently computed form.

 Apply well-known transformation rules of Boolean algebra.

“Programmer”) AND (TITLE=“Programmer” OR
TITLE=”Electrical Eng.”) AND NOT (TITLE=“Electrical
Eng.”))OR ENAME=“J.Doe”; is

equivalent to

 Convert
. SQL to relational algebra
 Make use of query trees
 Example: SELECT Ename FROM Emp,
Works, Project WHERE Emp.Eno =
Works.Eno AND Works.Pno =
Project.Pno AND Ename <> ‘J. Doe’
AND Pname = ‘CAD/CAM’ AND (Dur =
12 OR Dur = 24)

 Query tree:
 A tree data structure that corresponds to a relational algebra
 It represents the input relations of the query as leaf nodes of the tree,
and represents the relational algebra operations as internal nodes.
 Query graph:
 A graph data structure that corresponds to a relational calculus
 It does not indicate an order on which operations to perform first.
 There is only a single graph corresponding to each query.

Transformation Rules for RA Operations
1. Conjunctive Selection operations can cascade into individual
Selection operations (and vice versa).

 Sometimes referred to as cascade of Selection.

2. Commutativity of Selection.

3. In a sequence of Projection operations, only the last in the
sequence is required.

4. Commutativity of Selection and Projection.

 If predicate p involves only attributes in projection list, Selection and
Projection operations commute:

5. Commutativity of Theta join (and Cartesian product).

Rule also applies to Equijoin and Natural join.


6. Commutativity of Selection and Theta join (or Cartesian product)
 If selection predicate involves only attributes of one of join
relations, Selection and Join (or Cartesian product) operations

 If selection predicate is conjunctive predicate having form (p  q),

where p only involves attributes of R, and q only attributes of S,
Selection and Theta join operations commute as:

7. Commutativity of Projection &Theta join (or Cartesian product)

8. Commutativity of Union & Intersection (but not set difference)
9.Commutativity of Selection and set operations (Union,
Intersection, and Set difference).
p(R  S) = p(S)  p(R)
p(R  S) = p(S)  p(R)
p(R - S) = p(S) - p(R)

10.Commutativity of Projection and Union.

L(R  S) = L(S)  L(R)

11. Associativity of Union & Intersection (but not Set difference).

(R  S)  T = S  (R  T), (R  S)  T = S  (R  T)
12 . Associativity of Theta join (and Cartesian product).

 Cartesian product and Natural join are always associative.

2. Query Optimization
 DBMS has algorithms to implement relational algebra
 SQL is a different kind of high level language; specify what is
wanted, not how it is obtained
 Optimization – not necessarily “optimal”, but reasonably
 Techniques:
 Heuristic rules
 „ Query tree (relational algebra) optimization
 Query graph optimization
 Cost estimation(Comparing costs of different plans)
 Cost-based (physical) optimization
2.a. Heuristic based Processing Strategies
► Perform Selection operations as early as possible.
►Keep predicates on same relation together.
►Combine Cartesian product with subsequent Selection whose predicate
represents join condition into a Join operation.
►Use associativity of binary operations to rearrange leaf nodes so leaf
nodes with most restrictive Selection operations executed first.
►Perform Projection as early as possible.
►Keep projection attributes on same relation together.
►Compute common expressions once.
►If common expression appears more than once, and result not too
large, store result and reuse it when required.

 What are the names of customers living on Elm Street who have
checked out “Terminator”?
 SQL query:
SELECT Name FROM Customer CU, CheckedOut CH, Film F WHERE
Title = ’Terminator’ AND F.FilmId = CH.FilmID AND CU.CustomerID =
CH.CustomerID AND CU.Street = ‘Elm’

Apply Selections Early

Apply More Restrictive Selections Early

Form Joins

Apply Projections Early

Cost- Based Optimization
 Use transformations to generate multiple candidate query trees from
the canonical query tree.
 Statistics on the inputs to each operator are needed.
 Statistics on leaf relations are stored in the system catalog.
 Statistics on intermediate relations must be estimated; most
important is the relations' cardinalities.
 Cost formulas estimate the cost of executing each operation in each
candidate query tree.
 Dependent on the specific algorithm used by the operator.
„ Cost can be CPU time, I/O time, communication time, main
memory usage, or a combination.
 The candidate query tree with the least total cost is
selected for execution.

Example: Cost Estimation

Operation 3: σ followed by a π

Measures of Query Cost
 There are many possible ways to estimate cost, e.g., based on
disk accesses, CPU time, or communication overhead.

 Disk access is the predominant cost (in terms of time); relatively

easy to estimate; therefore, number of block transfers from/to disk
is typically used as measure.

 Simplifying assumption: each block transfer has the same cost

 Cost of algorithm (e.g., for join or selection) depends on database

buffer size; more memory for DB buffer reduces disk accesses.

 Thus DB buffer size is a parameter for estimating cost.

 We refer to the cost estimate of algorithm S as cost(S).
 We do not consider cost of writing output to disk.
Using Selectivity and Cost Estimates in
Query Optimization (2)
 Cost Components for Query Execution
1. Access cost to secondary storage
2. Storage cost
3. Computation cost
4. Memory usage cost
5. Communication cost

 Note: Different database systems may focus on different

cost components.

Slide 15- 46
Using Selectivity and Cost Estimates in
Query Optimization (3)
 Catalog Information Used in Cost Functions
 Information about the size of a file
 number of records (tuples) (r),
 record size (R),
 number of blocks (b)
 blocking factor (bfr)
 Information about indexes and indexing attributes of a file
 Number of levels (x) of each multilevel index
 Number of first-level index blocks (bI1)
 Number of distinct values (d) of an attribute
 Selectivity (sl) of an attribute
 Selection cardinality (s) of an attribute. (s = sl * r)

Slide 15- 47
Selection Operation

σA=a(R) where a is a constant value, A an attribute of R

File Scan - search algorithms that locate and retrieve records

that satisfy a selection condition

S1 - Linear search
cost(S1)= BR

S2 - Binary search, i.e., the file ordered based on attribute A

(primary index)


Query Evaluation

 How to evaluate individual relational operation?

Selection: find a subset of rows in a table

Join: connecting tuples from two tables

Other operations: union, projection, …

 How to estimate cost of individual operation?

 How does available buffer affect the cost?

 How to evaluate a relational algebraic expression?

Cost of Operations

 Cost = I/O cost + CPU cost

 I/O cost: # pages (reads & writes) or # operations (multiple pages)

 CPU cost: # comparisons or # tuples processed

 I/O cost dominates (for large databases)

 Cost depends on
 Types of query conditions

 Availability of fast access paths

 DBMSs keep statistics for cost estimation


 Used to describe the cost of operations.

 Relations: R, S

 nR: # tuples in R, nS: # tuples in S

 bR: # pages in R

 dist(R.A) : # distinct values in R.A

 min(R.A) : smallest value in R.A

 max(R.A) : largest value in R.A

 HI: # index pages accessed (B+ tree height?)

Simple Selection
 Simple selection: A op a(R)
 A is a single attribute, a is a constant, op is one of =, , <, , >, .
 Do not further discuss  because it requires a sequential scan of
How many tuples will be selected?
 Selectivity Factor (SFA op a(R)) : Fraction of tuples of R satisfying
“A op a”
 0  SFA op a(R)  1
# tuples selected: NS = nR  SFA op a(R)

Options of Simple Selection
Sequential (linear) Scan
 General condition: cost = bR
 Equality on key: average cost = bR / 2
Binary Search
 Records are stored in sorted order
 Equality on key: cost = log2(bR)
 Equality on non-key (duplicates allowed)
cost = log2(bR) + NS/bfR - 1
= sorted search time + selected – first one

Example: Cost of Selection
Relation: R(A, B, C)
nR = 10000 tuples
bfR = 20 tuples/page
dist(A) = 50, dist(B) = 500
B+ tree clustering index on A with order 25 (p=25)
B+ tree secondary index on B w/ order 25
 select * from R where A = a1 and B = b1
Relational Algebra: A=a1  B=b1 (R)

Example: Cost of Selection (cont.)
Option 1: Sequential Scan
 Have to go thru the entire relation
 Cost = bR = 10000/20 = 500
Option 2: Binary Search using A = a
 It is sorted on A (why?)
 NS = 10000/50 = 200
 assuming equal distribution
 Cost = log2(bR) + NS/bfR - 1
= log2(500) + 200/20 - 1 = 18

Cost of Join

Cost = # I/O reading R & S +

# I/O writing result
Additional notation:
 M: # buffer pages available to join operation
 LB: # leaf blocks in B+ tree index
Limitation of cost estimation
 Ignoring CPU costs
 Ignoring timing
 Ignoring double buffering requirements

Estimate Size of Join Result

How many tuples in join result?

 Cross product (special case of join)
NJ = nR  nS
 R.A is a foreign key referencing S.B
NJ = nR (assume no null value)
 S.B is a foreign key referencing R.A
NJ = nS (assume no null value)
 Both R.A & S.B are non-key

n R  nS n R  nS
NJ = min( , )
dist(R. A) dist(S .B)
Estimate Size of Join Result (cont.)
How wide is a tuple in join result?
 Natural join: W = W(R) + W(S) – W(SR)
 Theta join: W = W(R) + W(S)
What is blocking factor of join result?
 bfJoin = block size / W
How many blocks does join result have?
 bJoin = NJ / bfJoin

Query Execution Plans
 An execution plan for a relational algebra query consists of a
combination of the relational algebra query tree and information
about the access methods to be used for each relation as well as
the methods to be used in computing the relational operators
stored in the tree.
 Materialized evaluation: the result of an operation is stored as a
temporary relation.
 Pipelined evaluation: as the result of an operator is produced, it
is forwarded to the next operator in sequence

Query Tuning
 Monitoring or revising the query to increase throughput,
to lower response time for time-critical applications.
 Having to tune queries is a fact of life.
 Query tuning has a localized effect and is thus relatively
 It is a time-consuming and specialized task.
 It makes the queries harder to understand.
 However, it is often a necessity.
 This is not likely to change any time soon.

Assignment one
 Using heuristic algorithm optimize the following sql query.


AND BDATE > ‘1957-12-31’;


You might also like