Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chapter 2-Query Processing and Optimi

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 43

Advanced Database Systems

Chapter 2 :Query Processing and Optimization

1
Outline

1. Translating SQL Queries into Relational Algebra


2. Basic Algorithms for Executing Query Operations
3. Semantic Query Optimization
4. Using Heuristic in Query Optimization
5. Using Selectivity and Cost Estimates in Query Optimization

2
 What is Query Processing?
– Steps required to transform high level SQL query into a correct
and “efficient” strategy for execution and retrieval.
 What is Query Optimization?
– The activity of choosing a single “efficient” execution strategy
(from hundreds) as determined by database catalog statistics.
– Which relational algebra expression, equivalent to the given
query, will lead to the most efficient solution plan?
– How do operations pass data (main memory buffer, disk buffer,
…)?
– Will this plan minimize resource usage? (CPU/Response
Time/Disk)

3
 Relational Algebra in DBMS
– Relational Algebra is a procedural query language.
Relational algebra mainly provides a theoretical foundation
for relational databases and SQL.
– The main purpose of using Relational Algebra is to define
operators that transform one or more input relations into an
output relation.
– Given that these operators accept relations as input and
produce relations as output, they can be combined and used
to express potentially complex queries that transform
potentially many input relations (whose data are stored in
the database) into a single output relation (the query
results).

4
 Relational Algebra in DBMS
Fundamental Operators
These are the basic/fundamental operators used
in Relational Algebra.
–Selection(σ)
–Projection(π)
–Union(U)
–Set Difference(-)
–Set Intersection(∩)
–Rename(ρ)
–Cartesian Product(X)

5
 Relational Algebra in DBMS
Fundamental Operators
1. Selection(σ): It is used to select required tuples
of the relations. Example:
For the above relation, σ(c>3)R will select
the tuples which have c more than 3.

6
 Relational Algebra in DBMS
Fundamental Operators
2. Projection(π): It is used to project required
column data from a relation.
Example: Consider Table 1. Suppose we want
columns B and C from Relation R.
π(B,C)R will show following columns.

Note: By Default, projection


removes duplicate data.

7
 Relational Algebra in DBMS
Fundamental Operators
3. Union(U): Union operation in relational algebra
is the same as union operation in set theory.
Example: Consider the following table of
Students having different optional
FRENCH
GERMAN subjects in their course.
π(Student_Name)FRENCH U
π(Student_Name)GERMAN

8
 Relational Algebra in DBMS
Fundamental Operators
• 4. Set Difference(-): Set Difference in relational
algebra is the same set difference operation as
in set theory.
• Example: From the above table of FRENCH and
GERMAN, Set Difference is used as follows
π(Student_Name)FRENCH -
π(Student_Name)GERMAN

9
 Relational Algebra in DBMS
Fundamental Operators
• 5. Set Intersection(∩): Set Intersection in
relational algebra is the same set intersection
operation in set theory.
• Example: From the above table of FRENCH and
GERMAN, the Set Intersection is used as
follows
• π(Student_Name)FRENCH ∩
π(Student_Name)GERMAN

10
 Relational Algebra in DBMS
Fundamental Operators
• 6. Rename(ρ): Rename is a unary operation
used for renaming attributes of a relation.
• ρ(a/b)R will rename the attribute 'b' of the
relation by 'a'.

11
 Relational Algebra in DBMS
Fundamental Operators
• 7. Cross Product(X): Cross-product between two
relations. Let’s say A and B, so the cross product
between A X B will result in all the attributes of A
followed by each attribute of B. Each record of A
will pair with every record of B.
• Example:

12
 Relational Algebra in DBMS
Derived Operators
• These are some of the derived operators,
which are derived from the fundamental
operators.
–Natural Join(⋈)
–Conditional Join

13
 Relational Algebra in DBMS
Derived Operators
• These are some of the derived operators, which
are derived from the fundamental operators.
–Natural Join(⋈)
–Conditional Join
1. Natural Join(⋈): Natural join is a binary
operator. Natural join between two or more
relations will result in a set of all combinations
of tuples where they have an equal common
attribute.
14
 Relational Algebra in DBMS
Derived Operators
• Natural Join(⋈) Example

Natural join between EMP and DEPT with


condition :
EMP.Dept_Name = DEPT.Dept_Name

15
 Relational Algebra in DBMS
Derived Operators
• 2. Conditional Join: Conditional join works
similarly to natural join.
• In natural join, by default condition is equal
between common attributes while in
conditional join we can specify any condition
such as greater than, less than, or not equal.

16
 Relational Algebra in DBMS
Join between R and S with
Derived Operators condition R.marks >= S.marks
• Conditional Join Example

17
Query Optimization
• Activity of choosing an efficient execution
strategy for processing query.
• As there are many equivalent
transformations of same high-level query,
aim of QO is to choose one that minimizes
resource usage.
• Generally, reduce total execution time of
query. and may also reduce response
time of query.
• Problem computationally intractable with
large number of relations, so strategy
adopted is reduced to finding near 18
An Example (Branch and Staff Relations)

19
Example:
Identify all managers who work in a London branch
SELECT * FROM Staff s, Branch b
WHERE s.branchNo = b.branchNo AND
(s.position = ‘Manager’ AND b.city = ‘london’);
Results in these equivalent relational algebra statements

(1) (position=‘Manager’)^(city=‘London’)^(Staff.branchNo=Branch.branchNo) (Staff X Branch)


(2) (position=‘Manager’)^(city=‘London’) (Staff  Branch)
Staff.branchNo = Branch.branchNo

(3) (position=‘Manager’ (Staff)) Staff.branchNo = Branch.branchNo (  city=‘London’ (Branch)


Assume:

– 1000 tuples in Staff.


– 50 Managers
– 50 tuples in Branch.
– 5 London branches
– No indexes or sort keys
– Results of any intermediate operations stored on disk; 20
Query 1 (Bad)
(position=‘Manager’)^(city=‘London’)^(Staff.branchNo=Branch.branchNo) (Staff X Branch)
– Requires (1000+50) disk accesses to read from Staff and Branch
relations
– Creates temporary relation of Cartesian Product (1000*50) tuples
– Requires (1000*50) disk access to read in temporary relation and test
predicate
Total Work = (1000+50) + 2*(1000*50)
= 101,050 I/O operations
Query 2 (Better)
 (position=‘Manager’)^(city=‘London’) (Staff Staff.branchNo = Branch.branchNo Branch)
– Again requires (1000+50) disk accesses to read from Staff and Branch
– Joins Staff and Branch on branchNo with 1000 tuples
(1 employee : 1 branch )
– Requires (1000) disk access to read in joined relation and check predicate
Total Work = (1000+50) + 2*(1000)
= 3050 I/O operations

21
Query 3 (Best)


(position=‘Manager’ (Staff)) Staff.branchNo = Branch.branchNo ( city=‘London’ (Branch))

– Read Staff relation to determine ‘Managers’ (1000 reads)


• Create 50 tuple relation(50 writes)

– Read Branch relation to determine ‘London’ branches (50 reads)


• Create 5 tuple relation(5 writes)

– Join reduced relations and check predicate (50 + 5 reads)

Total Work = 1000 + 2*(50) + 5 + (50 + 5)


= 1160 I/O operations

22
Dynamic versus Static Optimization
• Two choices when first three phases of QP can be
carried out:
1. Dynamically every time query is run.
 Advantages if dynamic QO arise from fact that information is up-
to-date.
 Disadvantages are that performance of query is affected, time
may limit finding optimum strategy.
2. Statically when query is first submitted.
 Advantages of static QO are removal of runtime overhead more
time to find optimum strategy.
 Disadvantages arise from fact that chosen execution strategy may
no longer be optimal when query is run.
 Could use a hybrid approach to overcome this

23
Query Processing Steps

• Processing can be divided into :Decomposition, Optimization , and Code


generation & Execution 24
Query Processing Steps
1. Query Decomposition
•It is the process of transforming a high level query into a
relational algebra query, and to check that the query is
syntactically and semantically correct.
•Typical stages in query decomposition are:
 Analysis
 Normalization
 Semantic Analysis
 Simplification
 Query Restructuring

25
Query Processing Steps
1. Query Decomposition (Analysis)
Analyze query lexically and syntactically using compiler
techniques.
 Verify relations and attributes exist.
 Verify operations are appropriate for object type.
 EX: SELECT staff_no FROM Staff
WHERE position > 10;
 This query would be rejected on two grounds:
 Staff_no is not defined for Staff relation (should be
StaffNo).
 Comparison ‘>10’ is incompatible with type‘position’,
which is variable character string. 26
Query Processing Steps
1. Query Decomposition (Analysis)
•Finally, query transformed into some internal representation
more suitable for processing.
•Some kind of query tree is typically chosen, constructed as
follows:
 Leaf node created for each base relation.
 Non-leaf node created for each intermediate relation
produced by RA operation.
 Root of tree represents query result.
 Sequence is directed from leaves to root

27
Query Processing Steps
1. Query Decomposition (Relational Algebra Tree)

28
Query Processing Steps
1. Query Decomposition (Normalization)
•Converts query into a normalized form for easier
manipulation.
•Predicate can be converted into one of two forms:
• Conjunctive normal form:

• Disjunctive normal form:

29
Query Processing Steps
1. Query Decomposition (Semantic Analysis)
 Rejects normalized queries that are incorrectly formulated
or contradictory.
 Query is incorrectly formulated if components do not
contribute to generation of result.
 Query is contradictory if its predicate cannot be satisfied
by any tuple.
 Algorithms to determine correctness exist only for queries
that do not contain disjunction and negation.

30
Query Processing Steps
1. Query Decomposition (Semantic Analysis)
For these queries, could construct:
 A relation connection graph.
 Normalized attribute connection graph.
 Relation connection graph
 Create node for each relation and node for result.
Create edges between two nodes that represent a join, and edges
between nodes that represent projection.
 If not connected, query is incorrectly formulated.

31
Query Processing Steps
1. Query Decomposition (Simplification)
 Simplification strategy:
 Detects redundant qualifications,
 Eliminates common sub-expressions,
 Transforms query to semantically equivalent but more
easily and efficiently computed form.
 Typically, access restrictions, view definitions, and
integrity constraints are considered.
 Assuming user has appropriate access privileges, first
apply well-known idempotency rules of Boolean algebra.

32
2. Query Optimization
 Everyone wants the performance of their database to be optimal. In particular,
there is often a requirement for a specific query or object that is query based, to
run faster.
 Problem of query optimization is to find the sequence of steps that produces
the answer to user request in the most efficient manner, given the database
structure.
 The performance of a query is affected by the tables or queries that underlies
the query and by the complexity of the query.
 Given a request for data manipulation or retrieval, an optimizer will choose an
optimal plan for evaluating the request from among the manifold alternative
strategies. i.e. there are many ways (access paths) for accessing desired
file/record.
 hence ,DBMS is responsible to pick the best execution strategy based on
various considerations( Least amount of I/O and CPU resources. )

33
34
35
36
37
A .Using Heuristics in Query Optimization

 Query block: The basic unit that can be translated into the algebraic
operators and optimized.
 A query block contains a single SELECT-FROM-WHERE expression,
as well as GROUP BY and HAVING clause if these are part of the block.
 Nested queries within a query are identified as separate query blocks.

 Process for heuristics optimization


1. The parser of a high-level query generates an initial internal
representation;
2. Apply heuristics rules to optimize the internal representation.
3. A query execution plan is generated to execute groups of
operations based on the access paths available on the files
involved in the query.
 The main heuristic is to apply first the operations that reduce the size
of intermediate results.
– E.g. Apply SELECT and PROJECT operations before applying
the JOIN or other binary operations. 38
Summary of Heuristics for Algebraic Optimization:
1. The main heuristic is to apply first the operations that reduce
the size of intermediate results.
2. Perform select operations as early as possible to reduce the
number of tuples and perform project operations as early as
possible to reduce the number of attributes. (This is done by
moving select and project operations as far down the tree as
possible.)
3. The select and join operations that are most restrictive should
be executed before other similar operations. (This is done by
reordering the leaf nodes of the tree among themselves and
adjusting the rest of the tree appropriately.)

39
B. Cost Estimation Approach to Query Optimization
• The main idea is to minimize the cost of processing a query. The cost
function is comprised of:
• I/O cost + CPU processing cost + communication cost + Storage cost
• These components might have different weights in different
processing environments
• The DBMs will use information stored in the system catalogue for the
purpose of estimating cost.
• The main target of query optimization is to minimize the size of the
intermediate relation. The size will have effect in the cost of:
– Disk Access
– Data Transportation
– Storage space in the Primary Memory
– Writing on Disk
40
B. Cost Estimation Approach to Query Optimization
1. Access Cost of Secondary Storage
• Data is going to be accessed from secondary storage, as a query will be
needing some part of the data stored in the database. The disk access cost
can again be analyzed in terms of:
– Searching
– Reading, and
– Writing, data blocks used to store some portion of a relation.
• Remark: The disk access cost will vary depending on
– The file organization used and the access method implemented for the
file organization.
– whether the data is stored contiguously or in scattered manner, will
affect the disk access cost.
2. Storage Cost
•While processing a query, as any query would be composed of many database
operations, there could be one or more intermediate results before reaching the
final output. These intermediate results should be stored in primary memory for
further processing. The bigger the intermediate relation, the larger the memory
41
requirement, which will have impact on the limited available space.
B. Cost Estimation Approach to Query Optimization

3. Computation Cost
• Query is composed of many operations. The operations could be
database operations like reading and writing to a disk, or
mathematical and other operations like:
– Searching
– Sorting
– Merging
– Computation on field values

4. Communication Cost
o In most database systems the database resides in one station and
various queries originate from different terminals. This will have impact
on the performance of the system adding cost for query processing. Thus,
the cost of transporting data between the database site and the terminal
from where the query originate should be analyzed.
42
3. Query Generation and Execution Plans
– An execution plan for a relational algebra query consists of a
combination of the relational algebra query and information about
the access methods to be used for each relation as well as the
methods to be used in computing the relational operators.

43

You might also like