Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
125 views

Chapter 1 Query Processing and Optimization

The document discusses query processing and optimization. Query processing involves transforming a high-level query into efficient low-level operations to retrieve data from a database. Query optimization aims to choose the most efficient execution strategy by estimating costs and minimizing resource usage and time.

Uploaded by

MEK SO G
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views

Chapter 1 Query Processing and Optimization

The document discusses query processing and optimization. Query processing involves transforming a high-level query into efficient low-level operations to retrieve data from a database. Query optimization aims to choose the most efficient execution strategy by estimating costs and minimizing resource usage and time.

Uploaded by

MEK SO G
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 129

Query Processing and

Optimization

1
Query Processing
Activities of retrieving data from the database.
• Extracting data from the database.
• In query processing, it takes various steps for
fetching the data from the database

• Aims of QP:
• transform query written in high-level language (e.g.
SQL), into correct and efficient execution strategy
expressed in low-level language (implementing RA);
• execute strategy to retrieve required data.

2
Query Processing

high level user query

query
Processor

low level data manipulation


commands
3
Query Processing

 Query Processing: The process by which the query


results are retrieved from a high-level query such as SQL or
OQL.
 Query optimization:
 The process of choosing a suitable execution strategy for
processing a query.
 Two internal representations of a query:
 Query Tree
 Query Graph

Slide 15- 4
Query Optimization

Activity of choosing an efficient execution


strategy for processing query.
• As there are many equivalent transformations of
same high-level query.
aim of QO is :
• To choose one that minimizes resource usage .
• To reduce total execution time of query.
• To reduce response time of query.
• to finding near optimum solution for query.

5
Query Optimization Algorithm
• Compute alternative plans
• Compute estimated cost of each plan
• Compute number of I/Os
• Compute CPU cost

• Choose plan with lowest cost


• This is called cost-based optimization

6
Query optimization
• Query optimization
• Conducted by a query optimizer in a DBMS
• Goal: select best available strategy for executing
query
• Based on information available
• Most RDBMSs use a tree as the internal
representation of a query

Slide 19- 7
Phases of Query Processing
• QP for centralized database has four main
phases:
• decomposition (consisting of parsing and
validation);
• optimization;
• code generation;
• execution.

8
Phases of Query Processing
central database

9
Phases of Query Processing central
database

Query in a high-level language

Scanning, Parsing, Validating

Intermediate form of Query

Query Optimizer

Execution Plan

Query Code Generator

Code to execute the query

Runtime Database Processor

Result of Query
•.
Query Processing central database
Query Processing

A query expressed in a high-level query language such as SQL must be scanned, parsed,
and validate
 Scanner: The scanner identifies the language tokens such as SQL Keywords, attribute
names, and relation names in the text of the query. Therefore, scanner identify the
language tokens.
 Parser: The parser checks the query syntax to determine whether it is formulated
according to the syntax rules of the query language. translate the query into its internal
form. This is then translated into relational algebra. Parser checks syntax, verifies
relations. A query is checked for syntax errors. Then it converts it into the parse tree.
Therefore, parser check query syntax.
 Validation: The query must be validated by checking that all attributes and relation
names are valid and semantically meaningful names in the schema of the particular
database being queried. Therefore, validator check all attribute and relation names are
valid
Query Processing
Parsing: Firstly, the query will be parsed. This will also check whether the syntax is correct or
not. Then this query will be converted into a parse tree. This tree will look like this.
Eg. SELECT first_name, last_name FROM ninjas WHERE question_solved > 50;
SELECT
|
+-- first_name
|
+-- last_name
FROM
|
+-- ninjas
WHERE
|
+-- question_solved
|
+-- >
|
+-- 50
Example
• In SQL, a user wants to fetch the records of the employees
whose salary is greater than or equal to 10000. For doing this,
the following query is undertaken:
• select emp_name from Employee where salary>10000;
• Thus, to make the system understand the user query, it needs to
be translated in the form of relational algebra. We can bring this
query in the relational algebra form as:
• σsalary>10000 (πsalary (Employee))
• πsalary (σsalary>10000 (Employee))
• After translating the given query, we can execute each relational
algebra operation by using different algorithms. So, in this way,
a query processing begins its working.
Query Processing

 Query Optimization: How to choose a suitable (efficient) strategy for


processing a query is known as query. It is the process of choosing a
suitable execution strategy for processing a query. It is important :
• To producing an execution plan.
• To Planning a good execution strategy.
• To finding the most efficient way to execute the given query.
• To tells the DBMS what the best execution plan is for it.
• To retrieve the required data with minimal cost in terms of resources
and time.
 Query Code Generator: It generates the code to execute the plan.
 Runtime Database Processor: It has the task of running the query code
whether in compiled or interpreted mode. If a runtime error results an
error message is generated by the runtime database processor.
Query Optimization
• A database system generates an efficient query evaluation
plan, which minimizes its cost. This type of task performed
by the database system and is known as Query
Optimization.
• For optimizing a query, the query optimizer should have
an estimated cost analysis of each operation. It is because
the overall operation cost depends on the memory
allocations to several operations, execution costs, and so
on.
• After selecting an evaluation plan, the system evaluates the
query and produces the output of the query.
• A query execution engine is responsible for generating the
output of the given query. It takes the query execution plan,
executes it, and finally makes the output for the user query.
Query Optimization
• The major reasons for SQL Query Optimizations are:

• Enhancing Performance: The main reason for SQL Query


Optimization is to reduce the response time and enhance the
performance of the query. The time difference between request
and response needs to be minimized for a better user
experience.
• Reduced Execution Time: The SQL query optimization ensures
reduced CPU time hence faster results are obtained.
• Enhances the Efficiency: Query optimization reduces the time
spend on hardware and thus servers run efficiently with lower
power and memory consumption.
Query Optimization

• Best Practices For SQL Query Optimization

• 1. Use Where Clause instead of having: This means that using Where
instead of having will enhance the performance and minimize the time
taken
• 2. Avoid Queries inside a Loop: This is one of the best optimization
techniques that you must follow. Running queries inside the loop will
slow down the execution time to a great extent. To avoid this, all the
queries can be made outside loops, and hence, the efficiency can be
improved.
Query Processing

Evaluation: The DBMS executes the optimized query. It


retrieves the results from the database. Then it returns the
first_name and last_name of all ninjas whose
question_solved is greater than 50.
Query Representation

 An internal representation (query tree or query graph) of the


query is created after scanning, parsing, and validating. Before
optimizing the query it is represented in an internal or
intermediate form.
It is created using two data structures
 Query tree: A tree data structure that corresponds to a relational
algebra expression. It represents the input relations of the query
as leaf nodes of the tree, and represents the relational algebra
operations as internal nodes. Query tree: Represents relational
algebra expression
 Query graph: A graph data structure that corresponds to a
relational calculus expression. It does not indicate an order on
which operations to perform first. There is only a single graph
corresponding to each query. Query graph: Represents relational
calculus expression
Relational Algebra
• It is a Procedural Query language
• Relational algebra refers to a procedural query language
• It takes relation instances as input and returns relation
instances as output.
• It performs queries with the help of operators.
• A binary or unary operator can be used.
• Operations:
• select: σ
• project: π
• union: 
• difference: -
• product: x
• join:
21
Translating SQL Queries into Relational Algebra
 Relational Algebra consists of several groups of operations
 Unary Relational Operations

 SELECT (symbol:  (sigma))

 PROJECT (symbol: (pi))

 RENAME (symbol:  (rho))

 Relational Algebra Operations From Set Theory

 UNION (  ), INTERSECTION ( ), DIFFERENCE (or

MINUS, – )
 CARTESIAN PRODUCT ( x )

 Binary Relational Operations

 JOIN (several variations of JOIN exist)

 DIVISION

 Additional Relational Operations

 OUTER JOINS, OUTER UNION

 AGGREGATE FUNCTIONS (These compute summary of

information: for example, SUM, COUNT, AVG, MIN, MAX)


Translating SQL Queries into Relational Algebra
 An SQL query is first translated into an equivalent extended
relation algebra expression (as a query tree) that is then optimized
 Relational algebra is set of basic operations for the relational
model
 These operations enable a user to specify basic retrieval requests
(or queries)
 The result of an operation is a new relation, which may have been
formed from one or more input relations
 This property makes the algebra “closed” (all objects in

relational algebra are relations)


Translating SQL Queries into Relational Algebra

The SELECT operation (denoted by  (sigma)) is used to select a
subset of the tuples from a relation based on a selection condition.
 The selection condition acts as a filter
 Keeps only those tuples that satisfy the qualifying condition
 Tuples satisfying the condition are selected whereas the
other tuples are discarded (filtered out) PROJECT
 creates a horizontal partitioning
 Examples:
 Select the EMPLOYEE tuples whose department number is 4:

 DNO = 4 (EMPLOYEE)
 Select the employee tuples whose salary is greater than $30,000:
 SALARY > 30,000 (EMPLOYEE)
Translating SQL Queries into Relational Algebra
 In general, the select operation is denoted by
 <selection condition>(R) where

the symbol  (sigma) is used to denote the select
operator
 the selection condition is a Boolean (conditional)
expression specified on the attributes of relation R
 tuples that make the condition true are selected
 appear in the result of the operation
 tuples that make the condition false are filtered out
 discarded from the result of the operation
Translating SQL Queries into Relational Algebra
 In general, the select operation is denoted by
 <selection condition>(R) where
Five operations for demonstration:
– (OP1): σSSN=01234567890 (EMPLOY EE)
– (OP2): σDNUMBER>5 (DEPARTMENT)
– (OP3): σDNO=5 (EMPLOY EE)
– (OP4): σDNO=5 and SALARY >30000 and SEX=0F0
(EMPLOY EE)
– (OP5): σESSN=01234567890 and P NO=10
(WORKS_ON)
Translating SQL Queries into Relational Algebra
 SELECT Operation Properties

The SELECT operation  <selection condition>(R) produces a relation
S that has the same schema (same attributes) as R
 SELECT  is commutative:

 <condition1>( < condition2> (R)) =  <condition2> ( < condition1> (R))
 Because of commutativity property, a cascade (sequence) of
SELECT operations may be applied in any order:

<cond1>(<cond2> (<cond3> (R)) = <cond2> (<cond3> (<cond1> ( R)))
 A cascade of SELECT operations may be replaced by a
single selection with a conjunction of all the conditions:

<cond1>(< cond2> (<cond3>(R)) =  <cond1> AND < cond2> AND < cond3>(R)))
 The number of tuples in the result of a SELECT is less than
(or equal to) the number of tuples in the input relation R
Translating SQL Queries into Relational Algebra
 PROJECT Operation is denoted by (pi)
 This operation keeps certain columns (attributes)
from a relation and discards the other columns.
 PROJECT creates a vertical partitioning
 The list of specified columns (attributes) is kept in
each tuple
 The other attributes in each tuple are discarded
 Example: To list each employee’s first and last
name and salary, the following is used:
LNAME, FNAME,SALARY(EMPLOYEE)
Translating SQL Queries into Relational Algebra
 The general form of the project operation is:
<attribute list>(R)
  (pi) is the symbol used to represent the project
operation
 <attribute list> is the desired list of attributes from
relation R.
 The project operation removes any duplicate
tuples
 This is because the result of the project operation
must be a set of tuples
 Mathematical sets do not allow duplicate elements.
Translating SQL Queries into Relational Algebra
 PROJECT Operation Properties
 The number of tuples in the result of projection
<list>(R) is always less or equal to the number of
tuples in R
 If the list of attributes includes a key of R, then the
number of tuples in the result of PROJECT is equal
to the number of tuples in R
 PROJECT is not commutative
  <list1> ( <list2> (R) ) =  <list1> (R) as long as <list2>
contains the attributes in <list1>
Translating SQL Queries into Relational Algebra
 Query block:
 The basic unit that can be translated into the
algebraic operators and optimized.
 A query block contains a single SELECT-FROM-
WHERE expression, as well as GROUP BY and
HAVING clause if these are part of the block.
 Nested queries within a query are identified as
separate query blocks.
 Aggregate operators in SQL must be included in
the extended algebra.
Translating SQL Queries into Relational
Algebra
SELECT LNAME, FNAME
FROM EMPLOYEE
WHERE SALARY > ( SELECT MAX (SALARY)
FROM EMPLOYEE
WHERE DNO = 5);

SELECT LNAME, FNAME SELECT MAX (SALARY)


FROM EMPLOYEE FROM EMPLOYEE
WHERE SALARY > C WHERE DNO = 5

πLNAME, FNAME (σSALARY>C(EMPLOYEE)) ℱMAX SALARY (σDNO=5 (EMPLOYEE))


Translating SQL Queries
• Example:

• Inner block

• Outer block

Slide 18- 33
Translating SQL Queries

• Inner block translated into:

• Outer block translated into:

• Query optimizer chooses execution plan for each


query block

Slide 18- 34
SQL Query
Example : consider the following subset of the engineering
database schema
EMP(ENO, ENAME, TITLE)
ASG(ENO, PNO, RESP, DUR)
“Find the names of employees who are managing a project”

SELECT ENAME
FROM EMP,ASG
WHERE EMP.ENO = ASG.ENO
AND RESP = ‘‘Manager’’

35
Translating SQL Queries to RA
SELECT ENAME
FROM EMP,ASG
WHERE EMP.ENO = ASG.ENO
AND RESP = ‘‘Manager’’

36
JOIN Operation
– Implementing the JOIN Operation on two tables
• Two operations for demonstration:
- EMPLOY EE DNO=DNUMBER DEPARTMENT
- DEPARTMENT MGRSSN=SSN EMPLOY EE

37
Relational Algebra

• SELECT * FROM student WHERE name=Paul


• σname=Paul(student)
• πname( σcid<00112235(student) )
• πname(σcoursename=Advanced DBs((student cid takes)
courseid course) )

student takes course


cid name cid courseid courseid coursename
00112233 Paul 00112233 312 312 Advanced DBs
00112238 Rob 00112233 395 395 Machine Learning
00112235 Matt 00112235 312

38
Complexity of Relational Operations
The simplest way of defining complexity is in terms of relation
cardinalities independent of physical implementation details
such as fragmentation and storage
Operation Complexity

Select O(n)
Project
Project (with duplicate elimination) O(nlog n)
Group

Join O(nlog n)
Semi-join
Division
Set Operations
Cartesian Product O(n2)
39
𝐒𝐐𝐋 𝐐𝐔𝐄𝐑𝐘 𝐎𝐏𝐓𝐈𝐌𝐈𝐙𝐀𝐓𝐈𝐎𝐍

• Consider using an IN predicate when querying an indexed


column
• Use EXISTS instead of DISTINCT when using table joins
• Avoid including a HAVING clause in SELECT statements
• Use Column Names Instead of * in a SELECT Statement
• Eliminate Unnecessary DISTINCT Conditions
• Un-nest sub queries
• Avoid using OR in join conditions
• Try to use UNION ALL in place of UNION
• Remove any redundant mathematics
• Avoid functions on the right-hand side of the operator
Query optimization techniques

There are two main techniques for implementing query optimization.


1. Heuristic rules: order operations in a query. Apply SELECT (to
reduce rows) and PROJECT (to reduce columns) before JOIN to
reduces size of files to be joined
• Heuristic optimization algorithms are used to finding the optimal
solution in efficient computational cost (i.e., time or memory space) .
2. Systematically estimating the cost of different execution strategies
and choosing the lowest cost estimate. Comparing different strategies
based on relative costs, and selecting one that minimizes resource
usage.

41
Heuristics Query Optimization
 Heuristics: Make a sequence of choices for query based on
heuristics.
 not optimal (near to optimal)
 regroup common sub-expressions
 perform selection, projection first
 replace a join by a series of semi-joins
 reorder operations to reduce intermediate relation size
 optimize individual operations
 Only one plan is generated!
 However, it did not work well.
 It is not cost based, and can pick expensive plans

Some of the common heuristic rules are :


 Perform select and project operations before join operations.

This is done by moving the select and project operations down the query tree. This
reduces the number of tuples available for join.
 Perform the most restrictive select/project operations at first before the other
operations.
 Avoid cross-product operation since they result in very large-sized intermediate tables

42
Heuristic optimization algorithms

The common advantages of heuristic optimization algorithms are


as follows:
• Fast: They can find a “near-optimal” solution in a short time.
• Small: They can work in a relatively small memory space.
The common disadvantages of heuristic optimization algorithms
are as follows:
• Not absolute optimal solution: Heuristic algorithms cannot
guarantee to find the optimal solution.
• Uncertainty: The time required for finding a “near-optimal”
solution can be large in an unlucky case

43
Using Heuristics in Query Optimization

Using heuristic rules to modify the internal representation (query tree) of a


query.
One of the main heuristic rules is to apply SELECT and PROJECT operations
before applying the JOIN or other binary operations.
The SELECT and PROJECT operations reduce the size of a file and hence
should be applied first.
Heuristic based optimization uses rule-based optimization approaches for
query optimization.
These algorithms have polynomial time and space complexity, which is lower
than the exponential complexity of exhaustive search-based algorithms.
 However, these algorithms do not necessarily produce the best query plan

44
Heuristic Optimization of Query Trees
The query parser will typically generate a standard initial query
tree to correspond to an SQL query, without doing any
optimization.
The heuristic query optimizer transform this initial query tree
(inefficient) to a final query tree that is efficient to execute.
Steps in converting a query tree during heuristic optimization.
Step 1. Initial (canonical) query tree for SQL query Q.
Step 2: Moving SELECT operations down the query tree
Step 3: Applying the more restrictive SELECT operation first.
Step 4: Replacing CARTESIAN PRODUCT and SELECT with
JOIN operations.
Step 5: Moving PROJECT operations down the query tree
45
Heuristic Optimization of Query Trees

Example for transforming SQL query ⇒ Initial query


tree ⇒ Optimized query tree using heuristic
Algorithm.
SELECT lname FROM employee, works-on, project
WHERE pname=’Aquarius’ and pnumber=pno and
essn=ssn and bdate > ’1957-12-31’;

46
Heuristic Optimization of Query Trees

47
Heuristic Optimization of Query Trees
– initial query tree ⇒ query tree after pushing down selection operation.

48
Heuristic Optimization of Query Trees

49
Heuristic Optimization of Query Trees

50
Heuristic Optimization of Query Trees

51
Using Selectivity and Cost Estimates in Query
Optimization

• A query optimizer should not depend solely on


heuristic rules; it should also estimate and
compare the costs of executing a query using
different execution strategies and choose the
lowest cost estimate.
• We need a cost function which estimates the
costs of executing a query

52
Cost Components for Query Execution

• The cost of executing a query includes the following components:


1. Access cost to secondary storage: The cost of searching for, reading,
and
writing data blocks that reside on secondary storage.
2. Storage cost: The cost of storing temporary files generated by an
execution strategy for the query.
3. Computation cost: The cost of performing in-memory operations on the
data buffers during query execution.
4. Memory usage cost: The cost of pertaining to the number of memory
buffers needed during query execution.
5. Communication cost: The cost of shipping the query and its results
from
the database site to the site or terminal where the query originated.
53
Cost Components for Query Execution
•Different applications emphasize differently on individual
cost components. For example,
– For large databases, the main emphasis is on minimizing
the access cost to secondary storage.
– For smaller databases, the emphasis is on minimizing
computation cost because most of the data in files involved
in the query can be completely stored in memory.
– For distributed databases, communication cost must be
minimized despite of other factors.

54
Estimating Cost
• What needs to be considered:
• Disk I/Os
• sequential
• random
• CPU time
• Network communication
• What are we going to consider:
• Disk I/Os
• page reads/writes
• Ignoring cost of writing final output

55
Estimating Cost
In a distributed database system, the total cost to be
minimized includes
I/O cost + CPU cost + communication cost
 These might have different weights in different distributed
environments
 Wide area networks
 communication cost will dominate
 low bandwidth
 low speed
 high protocol overhead
 Local area networks
 communication cost not that dominant
 total cost function should be considered 56
Selectivity Cost-Based Optimization

• Cost components for query execution


• Access cost to secondary storage
• Disk storage cost
• Computation cost
• Memory usage cost
• Communication cost

Slide 19- 57
Using Selectivity and Cost Estimates in Query
Optimization

Find all Managers who work at a London branch.

SELECT *
FROM Staff s, Branch b
WHERE s.branchNo = b.branchNo AND
(s.position = ‘Manager’ AND b.city = ‘London’);

58
Different Strategies
• Three equivalent RA queries are:
(1) (position='Manager')  (city='London') 
(Staff.branchNo=Branch.branchNo) (Staff X Branch)
(2) (position='Manager')  (city='London')(
Staff Staff.branchNo=Branch.branchNo Branch)
(3) (position='Manager'(Staff)) Staff.branchNo=Branch.branchNo

(city='London' (Branch))

59
Different Strategies
• Assume:
• 1000 tuples in Staff; 50 tuples in Branch;
• 50 Managers; 5 London branches;
• no indexes or sort keys;
• results of any intermediate operations stored on disk;
• cost of the final write is ignored;
• tuples are accessed one at a time.

60
Cost Comparison
• Cost (in disk accesses) are:

(1) (1000 + 50) + 2*(1000 * 50) = 101 050


(2) 2*1000 + (1000 + 50) = 3 050
(3) 1000 + 2*50 + 5 + (50 + 5) = 1 160

• Cartesian product and join operations much more


expensive than selection, and third option significantly
reduces size of relations being joined together.

61
Exhaustive Search Optimization
 Input language – relational calculus or relational algebra
 In these techniques, for a query, all possible query plans
are initially generated and then the best plan is selected
 Exhaustive search
 cost-based
 This is cost based, which is good.
 However, it is too expensive!
 The search space is much too large.
 Optimal
 combinatorial complexity in the number of relations
 Exhaustive search techniques is suitable for queries with
a few relations,
62
Semantic Query Optimization

• The process of transforming one query into


another equivalent one using semantic
knowledge expressed as integrity constraints
to yield the same answer as the original query
regardless of the state of the database.
• Integrity constraints are properties that a database
must satisfy.

Slide 19- 63
Semantic Query Optimization

• Semantic query optimization is a process of using


integrity constraints and other semantic knowledge to
transform a query into another equivalent one.
• Two queries can be called as a semantically
equivalent if they return the same answer for a
database. For this purpose, it uses integrity
constraints to match results.
• Semantic query optimization is known as process of
determining the set of semantic conversion that result in
a semantically equivalent query with a low execution cost
• Goal: modify one query into another that is more efficient
to execute
Slide 19- 64
Semantic Query Optimization

Slide 19- 65
Dynamic versus Static Optimization
• QP can be carried out:
• dynamically every time query is run;
• statically when query is first submitted.
• Advantages of dynamic QO arise from fact that
information is up to date.
• Disadvantages are that performance of query is
affected, time may limit finding optimum strategy.

66
© Pearson Education Limited 1995, 2005
Dynamic versus Static Optimization
• Advantages of static QO are removal of runtime
overhead, and more time to find optimum
strategy.
• Disadvantages arise from fact that chosen
execution strategy may no longer be optimal
when query is run.
• Could use a hybrid approach to overcome this.

67
© Pearson Education Limited 1995, 2005
Optimization Timing
 Static
 compilation optimize prior to the execution
 difficult to estimate the size of the intermediate results error propagation
 can amortize over many executions
 E.g. R*
 Dynamic
 run time optimization
 exact information on the intermediate relation sizes
 have to reoptimize for multiple executions
 E.g. Distributed INGRES
 Hybrid
 compile using a static algorithm
 if the error in estimate sizes > threshold, reoptimize at run time
 E.g. MERMAID
68
Database Statistics
• Success of estimation depends on amount and
currency of statistical information DBMS holds.
• Keeping statistics current can be problematic.
• If statistics updated every time tuple is changed,
this would impact performance.
• DBMS could update statistics on a periodic basis,
for example nightly, or whenever the system is
idle.

69
Database Statistics of Optimization
 Relation
 cardinality
 size of a tuple
 fraction of tuples participating in a join with another relation
 Attribute
 cardinality of domain
 actual number of distinct values
 Common assumptions
 independence between different attribute values
 uniform distribution of attribute values within their domain

70
Typical Statistics for Relation R

nTuples(R) - number of tuples in R.

bFactor(R) - blocking factor of R.

nBlocks(R) - number of blocks required to store R:


nBlocks(R) = [nTuples(R)/bFactor(R)]

71
© Pearson Education Limited 1995, 2005
Typical Statistics for Attribute A of Relation R

nDistinctA(R) - number of distinct values that


appear for attribute A in R.
minA(R),maxA(R)
• minimum and maximum possible values for
attribute A in R.
SCA(R) - selection cardinality of attribute A in R.
Average number of tuples that satisfy an equality
condition on attribute A.

72
© Pearson Education Limited 1995, 2005
Layers of Distributed database Query
Processing

73
Step 1 Query Decomposition
It Decompose calculus query to algebraic query on distributed R
 Input : Calculus query on global relations.
 Output: relational algebra on global relation
 Query decomposition can be viewed as four successive steps

1. Normalization(Transform the query to a normalized form for further processing )


 Transform the SQL query using query quantifiers and query qualification(the
where clause) by applying logical operator priority.
2. Analysis
 analyzed the SQL query semantically so that incorrect queries are detected and
rejected as early as possible
3. Simplification
 eliminate redundant predicates
4. Restructuring
 calculus query is restructured into algebraic query
 more than one translation is possible
 use transformation rules

74
Normalization
 Lexical and syntactic analysis
 check validity (similar to compilers)
 check for attributes and relations
 type checking on the qualification
 There are two possible normal forms for the predicate, one giving
precedence to the AND (^) and the other to the OR (V).
 Put into normal form
 Conjunctive normal form(In other words, a statement in CNF is a series of ORs
connected by ANDs. For example, (A OR B) AND (C OR D) is in CNF
(p11∨p12∨…∨p1n) ∧…∧ (pm1∨pm2∨…∨pmn)
 Disjunctive normal form
(p11∧p12 ∧…∧p1n) ∨…∨ (pm1 ∧pm2∧…∧pmn)
 OR's mapped into union
 AND's mapped into join or selection

The idea is to transform the Boolean expression to an equivalent form that


makes it easier to analyze or work with.
75
Normalization
• Converts query into a normalized form for easier
manipulation or makes it easier to analyze or
work with.
• Predicate can be converted into one of two forms:

Conjunctive normal form:


(position = 'Manager'  salary > 20000)  (branchNo = 'B003')

Disjunctive normal form:


(position = 'Manager'  branchNo = 'B003' ) 
(salary > 20000  branchNo = 'B003')

77
Analysis
• Analyze the query lexically and syntactically
using compiler techniques.
• Verify relations and attributes exist.
• Verify operations are appropriate for object type.

78
Relations
Analysis - Example

80
Semantic Analysis
• For these queries, could construct:
• A relation connection graph.
• Normalized attribute connection graph.

Relation connection graph


Create node for each relation and node for result.
Create edges between two nodes that represent a
join, and edges between nodes that represent
projection.
• If not connected, query is incorrectly formulated.

81
Semantic Analysis
• Rejects normalized queries that are incorrectly
formulated or contradictory.
• Query is incorrectly formulated if components
do not contribute to generation of result.
• Query is contradictory if its predicate cannot be
satisfied by any tuple.
• Algorithms to determine correctness exist only
for queries that do not contain disjunction and
negation.

82
Analysis
 Remove incorrect queries
 Type incorrect
 If any of its attribute or relation names are not defined in the global schema
 If operations are applied to attributes of the wrong type
 Semantically(meaningfully) incorrect general query
 If its Components do not contribute in any way to the generation of the
result
 Not possible for general queries but only a subset of relational calculus
queries can be tested for correctness
 However, it is possible to do so for a large class of relational queries,
those which do not contain disjunction and negation
 Technique to detect semantically incorrect queries
 connection graph (query graph) that represent the semantic of the query
 join graph(sub graph of query graph that considered only the join

83
Analysis
• Finally, query transformed into some internal
representation more suitable for processing.
• Some kind of query tree is typically chosen,
constructed as follows:
• Leaf node created for each base relation.
• Non-leaf node created for each intermediate relation
produced by RA operation.
• Root of tree represents query result.
• Sequence is directed from leaves to root.

84
Analysis – Example
the query graph is connected , the query is semantically correct . Find the names
and responsibilities of programmers who have been working on the CAD/CAM
project for more than 3 years.”
Select ENAME,RESP
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND ASG.PNO = PROJ.PNO
AND PNAME = "CAD/CAM"
AND DUR ≥ 36
AND TITLE = "Programmer"

85
Analysis
 If the query graph is not connected, the query is
semantically wrong.
SELECT ENAME,RESP, PNAME
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND PNAME = "CAD/CAM"
AND DUR ≥ 36
AND TITLE = "Programmer"

86
Analysis
 There are basically three solutions to the problem:
1) reject the query
2) assume that there is an implicit Cartesian product
between relations ASG and PROJ
3) Infer (conclude) from the schema the missing join
predicate ASG.PNO = PROJ.PNO which transforms the
query into that
4) Relation connection graph not fully connected, so query
is not correctly formulated semantically.

87
Simplification
• Detects redundant qualifications,
• eliminates common sub-expressions,
• transforms query to semantically equivalent but
more easily and efficiently computed form.

88
Simplification
Elimination of Redundancy
 Such redundancy and thus redundant work may be
eliminated by simplifying the qualification with the
following well-known idempotency rules and
equivalence rules for logical operator (AND, OR and
Negation)

89
Simplification – Example
SELECT TITLE
FROM EMP
WHERE EMP.ENAME = “J. Doe”
OR (NOT(EMP.TITLE = “Programmer”)
AND [(EMP.TITLE = “Programmer”)
OR EMP.TITLE = “Elect. Eng.”)]
AND NOT(EMP.TITLE = “Elect. Eng.”) )

Let p3=EMP.ENAME=“J.Doe”
Let P1=EMP.TITLE=“Prpgramer”
Let P2=EMP.TITLE=“Elect.Egn.”

90
Simplification

91
Simplification – Example
SELECT TITLE
FROM EMP
WHERE EMP.ENAME = “J. Doe”
OR (NOT(EMP.TITLE = “Programmer”)
AND (EMP.TITLE = “Programmer”)
OR EMP.TITLE = “Elect. Eng.”)
AND NOT(EMP.TITLE = “Elect. Eng.”) )
The simplifed query:
SELECT TITLE
FROM EMP
WHERE EMP.ENAME = “J. Doe”

92
Restructuring (Rewriting)
 The last step of query decomposition rewrites the query in
relational algebra.
 For the sake of clearness to represent the relational
algebra query graphically by an operator tree.
 An operator tree is a tree in which a leaf node is a relation
stored in the database, and a non-leaf node is an
intermediate relation produced by a relational algebra
operator.
 The sequence of operations is directed from the leaves to
the root, which represents the answer to the query.

93
Restructuring (Rewriting)
How to draw query Tree
 In SQL, the leaves are immediately available in the FROM
clause. (relations(tables) are leaves (FROM clause)
 Second, the root node is created as a project operation
involving (the result attributes are root). These are found in
the SELECT clause in SQL.
 Third, the qualification (SQL WHERE clause) is translated into
the appropriate sequence of relational operations (select, join,
union, etc.) going from the leaves to the root. Intermediate
leaves should give a result from the leaves to root
 The sequence can be given directly by the order of appearance
of the predicates and operators.

94
Restructuring (Rewriting)
 Convert relational calculus to
relational algebra
 Make use of query trees

Example
Find the names of employees other than
J. Doe who worked on the CAD/CAM
project for either 12 or 24 years.

SELECT ENAME
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND ASG.PNO = PROJ.PNO
AND ENAME ≠ “J. Doe”
AND PNAME = “CAD/CAM”
AND (DUR = 12 OR DUR = 24)
95
Restructuring (Rewriting)
 By applying transformation rules, many different trees
may be found equivalent to the one produced by the
method described above
 There are six most useful equivalence rules, which
concern the basic relational algebra operators.

96
Restructuring –Transformation Rules
 Commutativity of binary operations
 R×S⇔S×R
 R join S ⇔S join R
 R∪S⇔S∪R
 Associativity of binary operations
 ( R × S ) × T ⇔ R × (S × T)
 ( R join S) join T ⇔ R join (S join T)
 Idempotence of unary operations
 ΠA’(ΠA’(R)) ⇔ΠA’(R)
σp1(A1)(σp2(A2)(R)) = σp1(A1) ∧ p2(A2)(R)

where R[A] and A' ⊆ A, A" ⊆ A and A' ⊆ A"
 Commuting selection with projection
97
Restructuring –Transformation Rules
 Commuting selection with binary operations
 σp(A)(R × S) ⇔ (σp(A) (R)) × S
 σp(Ai)(R join(Aj,Bk) S) ⇔ (σp(Ai)(R)) join(Aj,Bk) S
 σp(Ai)(R ∪ T) ⇔ σp(Ai)(R) ∪ σp(Ai)(T)
where Ai belongs to R and T
 Commuting projection with binary operations
 ΠC(R × S) ⇔ΠA’(R) × ΠB’(S)
 ΠC(R join(Aj,Bk) S)⇔ΠA’(R) join(Aj,Bk) ΠB’(S)
ΠC(R ∪ S) ⇔ΠC (R) ∪ ΠC (S)

where R[A] and S[B]; C = A' ∪ B' where A' ⊆ A, B' ⊆ B

98
Example
Example
Find the names of employees other than
J. Doe who worked on the CAD/CAM
project for either 12 or 24 years

SELECT ENAME
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND ASG.PNO = PROJ.PNO
AND ENAME ≠ “J. Doe”
AND PNAME = “CAD/CAM”
AND (DUR = 12 OR DUR =
24)

99
Equivalent Query

100
Restructuring

σDur=12 v Dur=24

101
Step 2 – Data Localization
 The localization layer translates an algebraic query on
global relations into an algebraic query expressed on
physical fragments.
 Input :Algebraic query on conceptual schema
 Goal: localize the queries data using information
stored in the fragment schema
 Total tables are fragmented and store in different site
 Fragmentation is defined through fragmentation rules,
which can be expressed as relational queries

102
Step 2 – Data Localization …
 Objective to localize the query’s data using data
distribution information in the fragment schema
 It identifies which fragments are involved in the query and
transforms the distributed query into fragment query
 It can be done in two steps

1. Distributed query is mapped into fragment query by


substituting each distributed relation by its reconstruction
program
2. Simplify the fragment query and reconstruct to replace
with anther “good” query

103
104
 Assume
 EMP is fragmented into EMP1, EMP2,
EMP3 as follows:
 EMP1=σENO≤“E3”(EMP)
 EMP2= σ“E3”<ENO≤“E6”(EMP)
 EMP3=σENO≥“E6”(EMP)
 ASG fragmented into ASG1 and ASG2 as
follows:
 ASG1=σENO≤“E3”(ASG)
 ASG2=σENO>“E3”(ASG)
 The localization program for an horizontally fragmented relation is
the union of the fragments.

Replace EMP by (EMP1∪EMP2∪EMP3 ) and ASG by (ASG1 ∪ ASG2) in any query

The result is called , the generic query


105
106
107
Primary Horizontal Fragmentation
Definition:
 A primary horizontal fragmentation is defined by a selection
operation on the owner relations of a database schema.
Rj = σFj (R ), 1 ≤ j ≤ w

where Fj is a selection formula, which is (preferably) a minterm


predicate
 A horizontal fragment Ri of relation R consists of all the tuples of
R which satisfy a minterm predicate mi
 Given a set of minterm predicates M, there are as many horizontal
fragments of relation R as there are minterm predicates
 Set of horizontal fragments also referred to as minterm fragments
108
Example

109
110
Reduction for Primary Horizontal
Fragmentation
 Assume
 EMP is fragmented into EMP1, EMP2,
EMP3 as follows:
 EMP1=σENO≤“E3”(EMP)
 EMP2= σ“E3”<ENO≤“E6”(EMP)
 EMP3=σENO≥“E6”(EMP)
 ASG fragmented into ASG1 and ASG2 as
follows:
 ASG1=σENO≤“E3”(ASG)
 ASG2=σENO>“E3”(ASG)
 The localization program for an horizontally
fragmented relation is the union of the
fragments.

Replace EMP by (EMP1∪EMP2∪EMP3 ) and ASG by


(ASG1 ∪ ASG2) in any query

In general, the generic query is inefficient since 111


Reduction for Primary Horizontal
Fragmentation
 Thus the localized form of any query specified on EMP is
obtained by replacing it by (EMP1uEMP2uEMP3)
 The reduction of queries on horizontally fragmented
relations consists primarily of determining, after
restructuring the subtrees, those that will produce empty
relations, and removing them.
 Horizontal fragmentation can be exploited to simplify
both selection and join operations.

112
Reduction with selection

113
Reduction for PHF
 Reduction with selection
 Relation R and FR={R1, R2, …, Rw} where Rj=σ pj(R)
σ pi(Rj)= φ if ∀x in R: ¬(pi(x) ∧ pj(x))
EMP1=σENO≤“E3”(EMP)
Example EMP2= σ“E3”<ENO≤“E6”(EMP)
SELECT * EMP3=σENO>“E6”(EMP)
FROM EMP
WHERE ENO=“E5”

114
115
Reduction for PHF
 Reduction with join
 Possible if fragmentation is done on join attribute
 Distribute join over union
(R1 ∪ R2) join S ⇔ (R1 join S) ∪ (R2 join S)
 Given Ri = σpi(R) and Rj = σpj(R)
Ri join Rj = φ if ∀x in Ri, ∀y in Rj: ¬(pi(x) ∧ pj(y))

116
Reduction for PHF
 Reduction with join - Example
 Assume EMP is fragmented into three
ASG1: σENO ≤ "E3"(ASG)
ASG2: σENO > "E3"(ASG) EMP1=σ (EMP)
ENO≤“E3”
 Consider the query EMP2= σ“E3”<ENO≤“E6”(EMP)
SELECT * FROM EMP, ASG EMP3=σENO>“E6”(EMP)
WHERE EMP.ENO=ASG.ENO

117
Reduction for PHF
 Reduction with join
 The query reduced by distributing joins over unions and
applying rule 2 can be implemented as a union of three
partial joins that can be done in parallel

118
Reduction for VF
 Find useless (not empty) intermediate relations
Relation R defined over attributes A = {A1, ..., An} vertically
fragmented as Ri = ΠA'(R) where A' ⊆ A:
ΠD,K(Ri) is useless if the set of projection attributes D is not in A’
Example: EMP1= ΠENO,ENAME(EMP);
EMP2= ΠENO,TITLE (EMP)
– By commuting the projection with the join (i.e., projecting
SELECT ENAME on ENO, ENAME), we can see that the projection on EMP 2
is useless because ENAME is not in EMP 2.
FROM EMP

119
Reduction for DHF
 Rule :
 Distribute joins over unions
 Apply the join reduction for horizontal fragmentation

Example
ASG1: ASG JoinENO EMP1
ASG2: ASG JoinENO EMP2
EMP1: σTITLE=“Programmer” (EMP)
EMP2: σTITLE<>“Programmer” (EMP)

Query
SELECT *
FROM EMP, ASG
WHERE ASG.ENO = EMP.ENO
AND EMP.TITLE = “Mech. Eng.”
120
Reduction for DHF

121
Reduction for DHF
Joins over unions

Elimination of the empty intermediate relations (left sub-tree)

122
Reduction for Hybrid Fragmentation
 Combine the rules already specified:
 Remove empty relations generated by contradicting selections
on horizontal fragments
 Remove useless relations generated by projections on vertical
fragments
 Distribute joins over unions in order to isolate and remove
useless joins

123
Reduction for Hybrid Fragmentation
Example
Consider the following hybrid
fragmentation:
EMP1=σENO≤"E4" (ΠENO,ENAME(EMP))
EMP2=σENO>"E4"
(ΠENO,ENAME(EMP))
EMP3= ΠENO,TITLE(EMP)
and the query
SELECT ENAME
FROM EMP
WHERE ENO=“E5”

124
Step 3. Global Query Optimization
 Input: algebraic Fragment query
 Goal : to determine an execution strategy close to
optimal solution and to determine cost function.
 Find the best (not necessarily optimal) global schedule
 Minimize a cost function
 Distributed join processing
 Which relation to ship where?
 Ship-whole vs ship-as-needed
 Decide on the use of semijoins
 Semijoin saves on communication at the expense of more local
processing.
 Join methods
 nested loop vs ordered joins (merge join or hash join)
125
Cost-Based Optimization
 Solution space
 The set of equivalent algebra expressions (query trees).
 Cost function (in terms of time)
 I/O cost + CPU cost + communication cost
 These might have different weights in different distributed
environments (LAN vs WAN).
 Can also maximize throughput
 Search algorithm
 How do we move inside the solution space?
 Exhaustive search, heuristic algorithms (iterative improvement,
simulated annealing, genetic,…)

126
Step 4. Local optimization
• Input: optimized fragment algebraic query.
• Output: optimized local algebraic query.
• Each sub quires executing at one site called
local query .
• This optimized use local schema and then
executed.
128
The end
Thank you
Question?

You might also like