Chapter 1 Query Processing
Chapter 1 Query Processing
Objectives
At the end of this chapter you will be able to:
Define query processing and query optimization.
Transformation rules ,heuristic and cost estimation rules to improve the efficiency of a query.
12/23/2023 2
1.1. Overview of Query Processing
The process of activities involved in parsing, validating, optimizing, and executing a query.
Aims
To transform a query written in a high-level language, typically SQL, into a correct and efficient execution strategy
expressed in a low-level language (implementing the relational algebra), and
select* from customer where custid > 101 and custid < 300;
12/23/2023 3
Query Processing steps
Parser &
Translator
Relational Algebra
Expression tree Statistics
About Data
Optimizer
Evaluation Engine
Query code
Output Execution Plan
Data
12/23/2023 4
Three Steps of Query Processing
1) The Parsing and translation
It will first translate the query into its internal form, then translate the query into relational algebra and
verifies relations.
The parser and translator are to check syntax like select* from customer having salary >1000; the others
check schema elements like attributes, relations etc. and also coverts the SQL to RA expression.
2) Optimization
It is to find the most efficient process for executing/ evaluation plan for a query because
there can be more than one way.
3) Evaluation:
It is what the query-execution engine takes a query-evaluation plan to executes that plan and returns the answers to the query.
12/23/2023 5
Phases of query processing
12/23/2023 6
1.2. Translating SQL Queries into Relational Algebra and Other
Operators
• SQL
• Query language used in most RDBMSs
• Query decomposed into query blocks
• Basic units that can be translated into the algebraic operators
• Contains single SELECT-FROM-WHERE expression
• May contain GROUP BY and HAVING clauses
Slide 16- 7
Translating SQL Queries (cont’d...)
• Example:
• Inner block
• Outer block
8
Translating SQL Queries (cont’d.)
• Example (cont’d.)
• Inner block translated into:
Slide 16- 9
Phases of query processing
12/23/2023 10
1.3. Query Decomposition
Query Decomposition
Aim
• transform high-level query into RA query.
• check that query is syntactically and semantically correct.
Typical stages are:
a.Analysis,
b.Normalization,
c.Semantic analysis,
d.Simplification,
e.Query restructuring.
12/23/2023 11
1.a. Analysis
Analyze query lexically and syntactically using compiler techniques.
Verify relations and attributes exist.
Verify operations are appropriate for object type .
• Example
SELECT staff_no FROM Staff WHERE position > 10;
• This query error would be rejected on two grounds:
• staff_no is not defined for Staff relation (should be staffNo).
• Comparison ‘>10’ is incompatible with type position, which is variable character
string.
12/23/2023 12
1.a. Analysis(cont’d...)
Finally, query transformed into a query tree constructed as follows:
Leaf node for each base relation.
Non-leaf node for each intermediate relation produced by RA operation.
Root of tree represents query result.
• Sequence is directed from leaves to root.
12/23/2023 13
1.b. Normalization
Converts query into a normalized form for easier manipulation.
Predicate can be converted into one of two forms:
Conjunctive normal form:
(position = 'Manager' salary > 20000) (branchNo = 'B003')
Disjunctive normal form:
(position = 'Manager' branchNo = 'B003' ) (salary > 20000 branchNo = 'B003')
12/23/2023 14
1.c. Semantic Analysis
Rejects normalized queries that are incorrectly formulated or contradictory.
Algorithms to determine correctness exist only for queries that do not contain disjunction and
negation.
For these queries (no disjunction and no negation) could construct two graphs:
1. A relation connection graph.
2. Normalized attribute connection graph.
12/23/2023 17
1.4 Optimization Process
Query optimization:
The activity of choosing an efficient execution strategy for processing a query.
Aim
To choose the one that minimizes resource usage.
Generally, we try to reduce the total execution time of the query, which is the sum of
the execution times of all individual operations that make up the query.
Disk access tends to be dominant cost in query processing for centralized DBMS.
12/23/2023 18
Query Optimization(QO)
Generally
SQL is a different kind of high level language; specify what is wanted, not how it is obtained
a. Heuristic rules
► Combine Cartesian product with subsequent Selection whose predicate represents join condition into a Join operation.
► Use associativity of binary operations to rearrange leaf nodes so leaf nodes with most restrictive Selection operations
executed first.
12/23/2023 22
Example Query
• select e.lname, e.fname, w.pno, w.hours
from employee e, works_on w
where e.ssn = w.essn and w.hours > 20;
Find the relational algebra query tree in the above SQl
query?
πe.lname, e.fname, w.pno, w.hours
Employee e works_on w
12/23/2023 23
Cont...
• Find Heuristic of Conjunctive select cascade of selection?
Employee e works_on w
12/23/2023 24
Cont...
• Find the heuristic Combine select and cross join ?
Employee e works_on w
12/23/2023 25
Cont...
Find the heuristic Push projects through join?
πe.lname, e.fname, w.pno, w.hours
12/23/2023 26
Using Heuristics in Query Optimization
1.The main heuristic is to apply first the operations that reduce the
size of intermediate results.
2.Perform select operations as early as possible to reduce the
number of tuples and perform project operations as early as
possible to reduce the number of attributes. (This is done by
moving select and project operations as far down the tree as
possible.)
3.The select and join operations that are most restrictive should be
executed before other similar operations. (This is done by
reordering the leaf nodes of the tree among themselves and
adjusting the rest of the tree appropriately.)
12/23/2023 29
2.b. Cost Estimation for Relational Algebra Operations
Many different ways of implementing RA operations.
12/23/2023 31
An Example
Query:
Select B, D From R,S
Where R.A = “c” and S.E = 2 and R.C=S.C; Find the answer B,D ?
R S
A B C C D E
a 1 10 15 x 2 Answer
b 1 20
c 2 25
25 y 2 B D
32 y 3 2 y
d 2 10
e 3 26 10 z 1
12/23/2023 32
An Example (cont.)
Plan 1
• Cross product of R & S
• Select tuples using WHERE conditions
• Project on B & D
Algebra expression query tree
B,D
R S
12/23/2023 33
An Example (cont.)
Plan 2
• Select R tuples with R.A=“c”
• Select S tuples with S.E=2
• Natural join
• Project B & D
Algebra expression query tree
B,D
R S
12/23/2023 34
Measures of Query Cost
Disk Access
Cpu cycle
• Transit Time in network- distributed system
• CPU cost is difficult to calculate and CPU cost is insignificant compared with disk access cost.
• Consider only disk access
Disk Access Cost:
• No. of seeks (N) –Random I/O cost < Sequential I/O cost
• Cost =N*avg seek time
• No. of blocks read
• Cost =N*avg block read cost
• No. of blocks written
• Cost = N*avg block write cost
• Cost of writing >> cost of reading. because once written it has to be read to check if it is written
correctly.
12/23/2023 35
Query Evaluation Process
Internal
Query Scanne Parse
representatio
r r
n
Executio
DBMS n
Strategie
Answer Data s
Optimize
r
Runtime
Code Execution
Database
Generato plan
Processor
r
12/23/2023 36
Query Evaluation
12/23/2023 37
Cost of Operations
Cost = I/O cost + CPU cost
• I/O cost: # pages (reads & writes) or # operations (multiple pages)
• CPU cost: # comparisons or # tuples processed
• I/O cost dominates (for large databases)
Cost depends on
• Types of query conditions
• Availability of fast access paths
DBMSs keep statistics for cost estimation.
12/23/2023 38
Notations
Used to describe the cost of operations.
Relations: R, S
nR: # tuples in R,
nS: # tuples in S
bR: # pages in R
12/23/2023 40
Options of Simple Selection
Sequential (linear) Scan
• General condition: cost = bR
• Equality on key: average cost = bR / 2
Binary Search
• Records are stored in sorted order
• Equality on key: cost = log2(bR)
• Equality on non-key (duplicates allowed)
cost = log2(bR) + NS/bfR - 1
= sorted search time + selected – first one
12/23/2023 41
Example: Cost of Selection
Relation: R(A, B, C)
nR = 10,000 tuples
bfR = 20 tuples/page
dist(A) = 50, dist(B) = 500
B+ tree clustering index on A with order 25 (p=25)
B+ tree secondary index on B with order 25
Query:
select * from R where A = a and B = b1
Find the relational algebra?
Relational Algebra expression : A=a B=b1 (R)
12/23/2023 43
Example: Cost of Selection (cont’d...)
• Option 1: Sequential(linear search) Scan
• Have to go thru the entire relation
• Cost bR= nR/bfR = 10,000/20 = 500
• Option 2: Binary Search using A = a
• It is sorted on A
• NS = nR/dist(A)=10,000/50 = 200
• assuming equal distribution
• Cost = log2(bR) + NS/bfR - 1
= log2(500) + 200/20 - 1 = 18
12/23/2023 44
Example: Cost of Selection (cont.)
• Option 3: Use index on R.A:
• The secondary index average order of B+ tree
= (P + 0.5P)/2 =18.75~ 19
• Leaf nodes have 18 entries, internal nodes have 19 pointers
• # leaf nodes = 50/18 = 3
• # nodes next level = 1
• HI = 2
• Clustering index: Cost = HI + NS/bfR
= 2 + 200/20 = 12
12/23/2023 45
Example: Cost of Selection (cont.)
• Option 4: Use index on R.B
• Average order = 19
• NS =10000/500 = 20
• Use Option I (allow duplicate keys)
• # nodes 1st level = 10000/18 = 556 (leaf)
• # nodes 2nd level = 556/19 = 29 (internal)
• # nodes 3rd level = 29/19 = 2 (internal)
• # nodes 4th level = 1
• HI = 4
• Secondary index Cost = HI + NS
= 4+20 =24
12/23/2023 46
Estimate Size of Join Result
• How many tuples in join result?
• Cross product (special case of join)
NJ = nR nS
• R.A is a foreign key referencing S.B
NJ = nR (assume no null value)
• S.B is a foreign key referencing R.A
NJ = nS (assume no null value)
• Both R.A & S.B are non-key
nR nS nR nS
NJ min ( , )
dist(R.A) dist(S.B)
12/23/2023 47
Basic Algorithms for Executing Query Operations
Consider only single table queries
•Three categories:
1. simple SELECT: one condition, no AND or OR
2. conjunctive SELECT : multiple conditions, connected by AND
3. disjunctive SELECT : multiple conditions, connected by OR
12/23/2023 49
1. Simple SELECT or methods for simple selection
1.1. Linear search (brute force algorithm):
• algorithm:
• Retrieve every record in the file
• test whether its attribute values satisfy the selection condition
• works when:
• always works
• best on small files
• only choice when no indexes or ordering
• cost
• average case: b/2, where b = # blocks in file
• worst case: b
where b
12/23/2023 = # blocks in file, s = # selected records 50
Simple SELECT(cont’d...)
1.3. Primary index to retrieve a single record:
• algorithm:
• look up record using primary index
• works when:
• selection condition is equality test on key attribute with primary index
• cost:
x + 1 , where x = # index levels or height of the tree
1.4.Clustering index to retrieve multiple records:
• algorithm:
• use clustering(secondary) index to retrieve all the records satisfying the selection condition
• works when:
• selection condition is equality comparison on a non-key attribute with clustering index
• cost:
12/23/2023 53
3.Disjunctive SELECT ( logical OR) selection
•Disjunctive selects are much harder to optimize
• no single condition can be used to ‘pre-filter’ the results
• result is union of each condition
• best you can do is to try to optimize each individual query, then compute the union
example:
select * from EMPLOYEE
where DNO=3 or SALARY > 80,000 or SEX=‘F’;
12/23/2023 54
Join Algorithms
•We’ll consider joins such as R ⋈A=B S.
•Extends to joins like R ⋈A=B and C=D S
by considering <A,C> and <B,D> as single attributes.
12/23/2023 55
Semantic Query Optimization
Semantic Query Optimization:
Uses constraints specified on the database schema in order to modify one query into another query that is
more efficient to execute.
Consider the following SQL query,
select e.lname, m.lname from employee e, m
where e.superssn = m.ssn and e.salary > m.salary
Explanation:
Suppose that we had a constraint on the database schema that stated that no employee can earn more than
his or her direct supervisor. If the semantic query optimizer checks for the existence of this constraint, it
need not execute the query at all because it knows that the result of the query will be empty. Techniques
known as theorem proving can be used for this purpose.
12/23/2023 56
1.6 Transformation Rules
12/23/2023 59
Transformation Rules for RA Operations(cont’d...)
If join condition q involves attributes only from S and T, then Theta
join is associative:
12/23/2023 60
Example 1.3
Use of Transformation Rules
For prospective renters of flats, find properties that match requirements and owned
by CO93.
SELECT p.propertyNo,p.street FROM Client c, Viewing v, PropertyForRent p
WHERE c.prefType = ‘Flat’ AND c.clientNo = v.clientNo AND
v.propertyNo = p.propertyNo AND c.maxRent >= p.rent AND
c.prefType = p.type AND p.ownerNo = ‘CO93’;
Find the RA tree in the above SQL query statements?
12/23/2023 61
Con’d…
12/23/2023 62
Algorithms for PROJECT and Set Operations
• Set operations
• UNION
• INTERSECTION
• SET DIFFERENCE
• CARTESIAN PRODUCT
• Set operations sometimes expensive to implement
• Sort-merge technique
• Hashing
• Use of anti-join for SET DIFFERENCE
• EXCEPT or MINUS in SQL
• Example: Find which departments have no employees becomes
65
Implementing Aggregate Operations and Different Types of JOINs
• Aggregate operators
• MIN, MAX, COUNT, AVERAGE, SUM
• Can be computed by a table scan or using an appropriate index
• Example:
12/23/2023 67