Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chapter 5

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 45

CHAPTER 5

QUERY PROCESSING AND


OPTIMIZATION
The query optimization techniques are used to choose an efficient execution
plan that will minimize the runtime as well as many other types of resources
such as number of disk I/O, CPU time and so on.

Query Processing is a procedure of transforming a high-level query (such as


SQL) into a correct and efficient execution plan expressed in low-level
language.

When a database system receives a query for update or retrieval of


information, it goes through a series of compilation steps, called execution
plan.
Query processing goes through various phases:
first phase is called syntax checking phase, the system parses
the query and checks that it follows the syntax rules or not.
• It then matches the objects in the query syntax with the view
tables and columns listed in the system table.
In second phase the SQL query is translated in to an algebraic
expression using various rules.
• the process of transforming a high-level SQL query into a
relational algebraic form is called Query Decomposition.
• The relational algebraic expression now passes to the query
optimizer.
In third phase optimization is performed by substituting
equivalent expression
So query processing includes three main steps

1. parsing and translation

2. Optimization

3. evaluation
Query optimization module work with the join manager module
to improve the order in which joins are performed. At this stage the
cost model and several other estimation formulas are used to
rewrite the query.

• The modified query is written to utilize system resources, to bring


the optimal performance.

• This action plans are converted into a query codes that are finally
executed by a run time database processor.

• The run time database processor estimate the cost of each action
plan and chose the optimal one for the execution.
Query Analysis

The query is syntactically analyzed using the programming language


compiler (parser).
Example: SELECT emp_Fname FROM EMPLOYEE WHERE
emp_Lname >100
This query will be rejected because the comparison “>100” is
incompatible with the data type of emp_Lname which is character string.
At the end of query analysis phase, the high-level query (SQL) is
transformed into some internal representation that is more suitable for
processing.
This internal representation is typically a kind of query Tree.
Algorithms for External Sorting
The sorting of relations which do not fit in the memory because their size is
larger than the memory size. Such type of sorting is known as External Sorting.
Sorting is one of the primary algorithms used in query processing.
For example, whenever an SQL query specifies an ORDER BY-clause, the
query result must be sorted.
Sorting is also a key component in sort-merge algorithms used for JOIN and
other operations (such as UNION and INTERSECTION).
The typical external sorting algorithm uses a sort-merge strategy, which starts
by sorting small subfiles—called runs—of the main file and then merges the
sorted runs, creating larger sorted subfiles that are merged in turn.
consists of two phases: the sorting phase and the merging phase .
Algorithms for SELECT and JOIN
Operations
Indexing is a data structure technique to efficiently retrieve records from the
database files based on some attributes on which the indexing has been done.
Indexing in database systems is similar to what we see in books.
Indexing is defined based on its indexing attributes. Indexing can be of the
following types
Primary Index − Primary index is defined on an ordered data file. The data file
is ordered on a key field. The key field is generally the primary key of the
relation.
Secondary Index − Secondary index may be generated from a field which is a
candidate key and has a unique value in every record, or a non-key with duplicate
values
Clustering Index − Clustering index is defined on an ordered data file. The data
file is ordered on a non-key field.
Cont’d
Search Methods for Simple Selection. A number of search algorithms are
possible for selecting records from a file. These are also known as file scans, If
the search algorithm involves the use of an index, the index search is called an
index scan.
Linear search (brute force algorithm). Retrieve every record in the file, and
test whether its attribute values satisfy the selection condition.
Binary search: If the selection condition involves an equality comparison on a
key attribute on which the file is ordered
Using a primary index: If the selection condition involves an equality
comparison on a key attribute with a primary index.
Eg. σSsn = ‘123456789’ (EMPLOYEE)
Cont’d
Using a hash key: If the selection condition involves an equality comparison on
a key attribute with a hash key.
Using a primary index to retrieve multiple records: If the comparison condition
is >, >=, <, or <= on a key field with a primary index
For example, Dnumber > 5 we can use the index to find the record satisfying the
corresponding equality condition (Dnumber = 5), then retrieve all subsequent
records in the (ordered) file.
Using a clustering index to retrieve multiple records. If the selection condition
involves an equality comparison on a non key attribute with a clustering index.
Eg. σDno = 5 (EMPLOYEE) : use the index to retrieve all the records satisfying the
condition.
Reading Assignment
Detail reading on Indexing , hashing and file structures
Fundamentals of DB systems 6th edition page 583
Cont’d
Search Methods for Complex Selection.

If a condition of a SELECT operation is a conjunctive condition—that is, if it


is made up of several simple conditions connected with the AND logical
connective

Conjunctive selection using an individual index.

Conjunctive selection using a composite index

Conjunctive selection by intersection of record pointers


Algorithms for Project and set
Operations
Algorithm for PROJECT operations
π <attribute list>(R)
1. If <attribute list> has a key of relation R, extract all tuples from R with only
the values for the attributes in <attribute list>.
2. If <attribute list> does NOT include a key of relation R, duplicated tuples must
be removed from the results. We can use sorting as well as hashing.

Of all the operations, CARTESIAN PRODUCT operation is very expensive


and should be avoided if possible
Cont’d
1. UNION
Sort the two relations on the same attributes.
Scan and merge both sorted files concurrently, whenever the same tuple exists in
both relations, only one is kept in the merged results.
2. INTERSECTION
Sort the two relations on the same attributes.
Scan and merge both sorted files concurrently, keep in the merged results only
those tuples that appear in both relations.
3. SET DIFFERENCE R-S
Keep in the merged results only those tuples that appear in relation R but not in
relation S.
Combining Operations using Pipelining

A query is mapped into a sequence of operations.

Each execution of an operation produces a temporary result.

Generating and saving temporary files on disk is time consuming and expensive.

Alternative:

Avoid constructing temporary results as much as possible.

Pipeline the data through multiple operations - pass the result of a previous
operator to the next without waiting to complete the previous operation.
Parallel query processing

Parallel query processing designates the transformation of high-level queries


into execution plans that can be efficiently executed in parallel, on a
multiprocessor computer

Parallel query processing can improve the performance of the following types of
queries: select statements that scan large numbers of pages but return
relatively few rows, such as: Table scans or clustered index scans with grouped
or ungrouped aggregates.
Query Trees and Heuristics for
Query Optimization
A Query Tree is a tree data structure that corresponds expression. A Query
Tree is also called a relational algebra tree.

Leaf node of the tree, representing the base input relations of the query.

Internal nodes result of applying an operation in the algebra.

Root of the tree representing a result of the query.

A relational algebra expression may have many equivalent expressions

Eg. σbalance<2500(πbalance(account)) ≡ πbalance(σbalance<2500(account))


Example
SELECT (P.proj_no, P.dept_no, E.name, E.add, E.dob)
FROM PROJECT P, DEPARTMENT D, EMPLOYEE E
WHERE P.dept_no = D.d_no AND D.mgr_id = E.emp_id AND P.proj_loc =
‘Mumbai’ ;
Query Optimization: Amongst all equivalent evaluation plans choose
the one with lowest cost.
Cost is estimated using statistical information from the database catalog
e.g. number of tuples in each relation, size of tuples, etc.
SQL queries are decomposed into query blocks, which form the basic
units that can be translated into the algebraic operators and optimized.
 A query block contains a single SELECT-FROM-WHERE
expression, as well as GROUP BY and HAVING clauses if these are
part of the block.
Hence, nested queries within a query are identified as separate query
blocks.
An overall rule for heuristic query optimization is to perform as many
select and project operations as possible before doing any joins.
Heuristic query optimizer transforms the initial query tree into final
query tree using equivalence transformation rules.
This final query tree is efficient to execute.
General Transformation Rules for Relational Algebra Operations:

1. Cascade of σ: A conjunctive selection condition can be broken up into


a cascade (sequence) of individual σ operations:
2. Commutativity of σ: The σ operation is commutative:

3. Cascade of π: In a cascade (sequence) of π operations, all but the last one


can be ignored:

4. Commuting σ with π: If the selection condition c involves only the


attributes A1, ..., An in the projection list, the two operations can be
commuted:

5. Commutativity of ( and x ): The operation is commutative as is


the x operation:
6. Commuting σ with (or x ): If all the attributes in the selection
condition c involve only the attributes of one of the relations being
joined—say, R—the two operations can be commuted as follows:

7. Commuting π with (or x): Suppose that the projection list is L =


{A1,..., An, B1, ..., Bm}, where A1, ..., An are attributes of R and
B1, ..., Bm are attributes of S. If the join condition c involves only
attributes in L, the two operations can be commuted as follows:

8. Commutativity of set operations: The set operations υ and ∩ are


commutative but “–” is not.
9. Associativity of , x, υ, and ∩ : These four operations are individually
associative; that is, if θ stands for any one of these four operations
(throughout the expression), we have

10. Commuting σ with set operations: The σ operation commutes with


υ , ∩ , and –. If θ stands for any one of these three operations, we have
Choice of Query Execution Plans

Query Processing Components:


1. Query language that is used

SQL
2. Query execution methodology
The steps that one goes through in executing high-level user queries.

3. Query optimization which answers How do we determine a good


execution plan?
A. Rule based
B. Cost based
Choice of Query Execution Plans

A. Rule based : RBO uses predefined set of precedence rules(golden rules)


to figure out the optimal path
• These rules used to choose one index over another index and when full
table scan
• Oracle 9i has 20 “golden rules” for optimal execution path
B. Cost based optimization: Collect statistics about tables, clusters and
indexes, and store those statistics in the data dictionary
Query processing methodology
Query Normalization: Lexical and syntactic analysis, check validity,
check for attributes and relations
type checking on the qualification
Put into (query normal form)
eg. Conjunctive normal form
(p1 ∨ p2∨...∨ pn) ∧ ... ∧ (pm1 ∨p m2∨...∨p mn)

Disjunctive normal form


(p1 ∧ p2 ∧... ∧ pn) ∨...∨(p m1∧p m2∧...∧ p mn)

OR's mapped into union

AND's mapped into join or selection


Query Analysis: Type incorrect like If any of its attribute or relation names
are not defined.
Semantically incorrect
Query Simplification: Use appropriate transformation rules and integrity
rules.
Query Restructuring: Convert SQL to relational algebra and Make use of
query trees
Query Optimization:
A. Exhaustive search : cost-based and optimal
B. Heuristics: not optimal,
perform selection, projection as early as possible
Cost Based Optimizer (CBO)
Use several Database Initialization Parameters
Uses statistics about the objects and system (CPU, Disk etc)
Use this information and makes decisions on the “best way” to
generate an execution plan
The cost-based query optimizer (CBO)...
– Uses data from a variety of sources
– Estimates the costs of several execution plans
– Chooses the plan it estimates to be the least expensive
Characteristics
Adapts to changing circumstances
The only query optimizer supported by Oracle Corporation from release 10 onward
Data Points Collected by the Cost
Based Optimizer
Table Statistics
Column Statistics
Index Statistics
 System Statistics
Table Statistics
Statistics collected for tables appear in the data dictionary views
*_TABLES
Number of rows (NUM_ROWS)
Number of data blocks below the high water mark (BLOCKS)
Number of data blocks allocated to the table that have never been used
(EMPTY_BLOCKS)
Average available free space in each data block in bytes (AVG_SPACE)
Number of chained rows (CHAIN_CNT)
Average row length, including the row's overhead, in bytes
(AVG_ROW_LEN)
Column Statistics
Statistics collected for columns appear in the data dictionary
views *_TAB_COLUMNS and
*_TAB_COL_STATISTICS.
Number of distinct column values (NUM_DISTINCT)
Lowest value in a column (LOW_VALUE)
Highest value in a column (HIGH_VALUE)
Selectivity estimate for column (DENSITY)
Number of null values in a column (NUM_NULLS)
 Number of histogram buckets (NUM_BUCKETS)
Average column length (AVG_COL_LEN)
Index Statistics

Statistics collected for indexes appear in the data dictionary


views *_INDEXES.
Depth of the index from its root block to its leaf blocks
(BLEVEL)
Number of leaf blocks (LEAF_BLOCKS)
Number of distinct indexed values (DISTINCT_KEYS)
Average number of leaf blocks in which each distinct value in the index
appears (AVG_LEAF_BLOCKS_PER_KEY)
Average number of data blocks in the table that are pointed to by a
distinct value in the index (AVG_DATA_BLOCKS_PER_KEY)
Histogram
System Statistics
Selectivity and Cost Estimates in Query
Optimization
Cost Components for Query Execution
1.Access cost to secondary storage
2.Storage cost
3.Computation cost
4.Memory usage cost
5.Communication cost
Note: Different database systems may focus on different
cost components.
Cont’d
1. Access cost to secondary storage.
•This is the cost of transferring (reading and writing) data
blocks between secondary disk storage and main memory
buffers.
•This is also known as disk I/O (input/output) cost.
•The cost of searching for records in a disk file depends on
the type of access structures on that file, such as ordering,
hashing, and primary or secondary indexes.
Cont’d
2. Disk storage cost.
•This is the cost of storing on disk any intermediate files that
are generated by an execution strategy for the query
3. Computation cost.
•This is the cost of performing in-memory operations on the
records within the data buffers during query execution.
•Such operations include searching for and sorting records,
merging records for a join or a sort operation, and performing
computations on field values.
•This is also known as CPU (central processing unit) cost
Cont’d
4. Memory usage cost.
•This is the cost pertaining to the number of main memory
buffers needed during query execution.
5. Communication cost.
•This is the cost of shipping the query and its results from the
database site to the site or terminal where the query
originated.
•In distributed databases, it would also include the cost of
transferring tables and results among various computers
during query evaluation
Catalog Information Used in Cost
Functions

Information about the size of a file: number of records (tuples) (r),


record size (R), number of blocks (b) , blocking factor (bfr)
Information about indexes and indexing attributes of a file: Number
of levels (x) of each multilevel index, Number of first-level index
blocks (bI1), Number of distinct values (d) of an attribute,
Selectivity (sl) of an attribute, which is the fraction of records
satisfying an equality condition on the attribute.
Selection cardinality (s) of an attribute. (s = sl * r)
Note: For a key attribute, d = r, sl = 1/r and s = 1.
For a non key attribute, by making an assumption that the d distinct
values are uniformly distributed among the records, we estimate
sl = (1/d) and so s = (r/d)
Cont’d
For example, suppose that a company has 5 departments numbered
1 through 5, and 200 employees who are distributed among the
departments as follows:
(1, 5), (2, 25), (3, 70), (4, 40), (5, 60).
In such cases, the optimizer can store a histogram that reflects the
distribution of employee records over different departments in a
table with the two attributes (Dno, Selectivity), which would
contain the following values for our example:
(1, 0.025), (2, 0.125), (3, 0.35), (4, 0.2), (5, 0.3).
The selectivity values stored in the histogram can also be estimates
if the employee table changes frequently
Cost Estimation for SELECT
Operations
For a given SELECT query, the DBMS has a number of possible
execution strategies.
We can use: Linear search (brute force), Binary search or Using a
primary database index / hash key to retrieve a single record
Assumptions:
Field Value
Query SELECT FROM EMPLOYEE WHERE EmpID=125
Number of EMPLOYEE records (r) 100,000
Number of disk blocks (b) 10,000
Blocking factor (bfr) (records per block) 10
Cont’d
Linear Search
The DBMS must use a linear search when no database index exists on the
selection condition (e.g., the EmpID). This is precisely the type of
operation that we want to prevent with proper indexing.
Given our assumptions, the cost of a linear search for this query would
be:
C = b/2 on average if the record exists
C = b if the record does not exist
So, for the query above, C = b/2 = 5,000 block accesses.
Cont’d
binary search

The cost of a performing a binary search is exactly the same as performing


a binary search anywhere else, namely, C = log2b. For the query above, the

database developer has C = log2b = log210,000 = 14 block accesses.

Primary Index
If the column is unique (like a primary key, for example) then the database
index can be implemented as a hash table. Such an index on the EmpID
column would allow the database developer to hash directly to the correct
employee record in constant time.
In general, static and linear hashes have a cost of C = 1.
Oracle query optimizer
The oracle query Optimizer determines the most efficient execution
plan for each SQL statement based on the structure of the query, the
available statistical information about the underlying objects, and all the
relevant optimizer and execution features.

It has Adaptive Query Optimization is a set of capabilities that enable


the optimizer to make run-time adjustments to execution plans and
discover additional information that can lead to better statistics.

You might also like