Ch-2 Query Processing and Optimization
Ch-2 Query Processing and Optimization
Aims of QP:
transform query written in high-level language (e.g. SQL), into correct and
efficient execution strategy expressed in low-level language (implementing RA);
Execute strategy to retrieve required data.
The major steps involved in query processing are depicted in the figure below.
A query written in SQL is given as input to the query processor. For our case, let us consider the
SQL query written above.
Step 1: Parsing
In this step, the parser of the query processor module checks the syntax of the query, the user’s
privileges to execute the query, the table names and attribute names, etc. The correct table names,
attribute names and the privilege of the users can be taken from the system catalog (data
dictionary).
Step 2: Translation
If we have written a valid query, then it is converted from high level language SQL to low level
instruction in Relational Algebra.
For example, our SQL query can be converted into a Relational Algebra equivalent as follows;
πEname(σDOP>10 Λ Employee.Eno=Proj_Assigned.Eno(Employee X Prof_Assigned))
Step 3: Optimizer
Optimizer uses the statistical data stored as part of data dictionary. The statistical data are
information about the size of the table, the length of records, the indexes created on the table, etc.
Optimizer also checks for the conditions and conditional attributes which are parts of the query.
Step 4: Execution Plan
A query can be expressed in many ways. The query processor module, at this stage, using the
information collected in step 3 to find different relational algebra expressions that are equivalent
and return the result of the one which we have written already.
For our example, the query written in Relational algebra can also be written as the one given below;
πEname(Employee ⋈Eno (σDOP>10 (Prof_Assigned)))
The query decomposition is the first phase of query processing whose aims are to transform a
high-level query into a relational algebra query and to check whether that query is syntactically
and semantically correct. Thus, a query decomposition phase starts with a high-level query and
transforms into a query graph of low-level operations (algebraic expressions), which satisfies the
query. In practice, SQL (a relational calculus query) is used as high-level query language, which
is used in most commercial RDBMSs. The SQL is then decomposed into query blocks (low-level
operations), which form the basic units. The query decomposer goes through five stages of
processing for decomposition into low-level operation and to accomplish the translation into
algebraic expressions. Fig. 2.2 shows the five stages of query decomposer. The five stages of
query decomposition are:
Query analysis.
Query normalization.
Semantic analysis.
Query simplifier.
Query restructuring.
2.2.1 Query Analysis
During the query analysis phase, the query is lexically and syntactically analyzed using the
programming language compilers (parsers) in the same way as conventional programming to
find out any syntax errors. A syntactically legal query is then validated, using the system
catalogues, to ensure that all database objects (relations and attributes) referred to by the query
are defined in the database. It is also verified whether relationships of the attributes and relations
mentioned in the query are correct as per the system catalogue. The type specification of the
query qualifiers and result is also checked at this stage.
At the end of query analysis phase, the high-level query (SQL) is transformed into some internal
representation that is more suitable for processing. This internal representation is typically a kind
of query tree. A query tree is a tree data structure that corresponds to a relational algebra
expression. A query tree is also called as relational algebra tree. The query tree has the following
components:
4 | P a g e Advanced database pre by Jilo D
Leaf nodes of the tree, representing the base input relations of the query.
Internal (non-leaf) nodes of the tree, representing an intermediate relation which is the
result of applying an operation in the algebra.
Root of the tree, representing the result of the query.
The sequence of operations (or data flow) is directed from leaves to the root.
The query tree is executed by executing an internal node operation wherever its operands are
available. The internal node is then replaced by the relation that results from executing the
operation. The execution terminates when the root node is executed and produces the result
relation for the query.
Let us consider a SQL query in which it is required to list the project number (PROJ-NO.), the
controlling department number (DEPT-NO.), and the department manager’s name (MGR-
NAME), address (MGR-ADD) and date of birth (MGR-DOB) for every project located in
‘Mumbai’. The SQL query can be written as follows:
FROM
PROJECT
AS P, DEPARTMENT AS D, EMPLOYEE AS E
In the above SQL query, the join condition DEPT-NO = D-NUM relates a project to its
controlling department, whereas the join condition MGR-ID = EMP-ID relates the controlling
department to the employee who manages that department. The equivalent relational algebra
expression for the above SQL query can be written as:
Or
Fig. 2.3 shows an example of a query tree for the above SQL statement and relational algebra
expression. This type of query tree is also referred as relational algebra tree.
Query graph is sometimes also used for representation of a query, as shown in Fig. 2.4. In query
graph representation, the relations (PROJECT, DEPARTMENT and EMPLOYEE in our
example) in the query are represented by relation nodes. These relation nodes are displayed as
single circle. The constant values from the query selection (project location =‘Mumbai’ in our
example) are represented by constant nodes, displayed as double circles. The selection and join
conditions are represented by the graph edges, for example, P.DEPT-NO = D.DEPT-NUM and
D.MGR-ID=E.EMP-ID, as shown in Fig. 2.4. Finally, the attributes to be retrieved from each
relation are displayed in square brackets above each relation, for example [P.PROJ-NUM,
P.DEPT-NO] and [E.EMP-NAME, E.EMP-ADD, E.EMP-DOB], as shown in Fig. 2.4. A query
graph representation corresponds to a relation calculus expression.
The primary goal of normalization phase is to avoid redundancy. The normalization phase
converts the query into a normalized form that can be more easily manipulated. In the
normalization phase, a set of equivalency rules is applied so that the projection and selection
operations included in the query are simplified to avoid redundancy.
1. p1 ^ p2 ⇔ p2 ^ p1
2. p1 v p2 ⇔ p2 v p1
7. ┐(p1 ^ p2) ⇔ ┐ p1 v ┐ p2
8. ┐(p1 v p2) ⇔ ┐ p1 ^ ┐ p2
9. ┐(┐ p) ⇔ p
By applying these equivalency rules, the normalization phase rewrites the query into a normal
form which can be readily manipulated in later steps. The predicate is converted into one of the
following two normal forms:
Disjunctive normal form is a sequence of disjunct that are connected with the ‘OR’ (‘∨’)
operator. Each disjunct contains one or more terms connected by the ‘AND’ (‘∧’) operator. A
disjunctive selection contains those tuples formed by the union of all tuples that satisfy the
disjunct. An example of disjunctive normal form can be given as:
The objective of semantic analyzer phase of query processing is to reduce the number of
predicates that must be evaluated by refuting incorrect or contradictory queries or qualifications.
The semantic analyzer rejects the normalized queries that are incorrectly formulated or
contradictory. A query is incorrectly formulated if components do not contribute to the
generation of the result. This happens in case of missing join specification. A query is
contradictory if its predicate cannot satisfy by any tuple in the relation. The semantic analyzer
examines the relational calculus query (SQL) to make sure it contains only data objects (that is,
tables, columns, views, indexes) that are defined in the database catalogue. It makes sure that
each object in the query is referenced correctly according to its data type.
In case of missing join specifications the components do not contribute to the generation of the
results, and thus, a query may be incorrectly formulated. A query is contradictory if its predicate
cannot be satisfied by any tuple. For example, let us consider the following query:
(EMP-DESIG=‘Programmer’∧ EMP-DESIG=‘Analyst’)
As an employee cannot be both ‘Programmer’ and ‘Analyst’ simultaneously, the above predicate
on the EMPLOYEE relation is contradictory.
That means, query is not correctly formulated. In this graph, the join condition (V.PROJ-NO =
P.PROJ-NO) has been omitted.
This graph has a cycle between the nodes D.MAX-BUDGET and 0 with a negative valuation
sum. Thus, it indicates that the query is contradictory. Clearly, we cannot have a department with
a maximum budget that is both greater than INR 85,000 and less than INR 50000.
2.2.4 Query Simplifier
The objectives of a query simplifier are to detect redundant qualification, eliminate common sub-
expressions and transform sub-graphs (query) to semantically equivalent but more easily and
efficiently computed forms. Commonly integrity constraints, view definitions and access
restrictions are introduced into the graph at this stage of analysis so that the query can be
simplified as much as possible. Integrity constraints define constants which must hold for all
states of the database, so any query that contradicts an integrity constraint must be void and can
be rejected without accessing the database. If the user does not have the appropriate access to all
the components of the query, the query must be rejected. Queries expressed in terms of views
can be simplified by substituting the view definition, since this will avoid having to materialize
Now, the above part of the query can be represented in the form of idempotence rules of Boolean
algebra as follows:
(PRED1 AND (NOT (PRED1)) AND NOT (PRED3) = (P1 ∧ (~P1)) AND ~ (P3)
The query normalizer now applies rule 4 of idempotency rules (Table 2.1) of query simplifier
phase and obtains the following form:
Thus, in the above example, the original query contained many redundant predicates, which were
eliminated without changing the semantics of the query.
In the final stage of query decomposition, the query can be restructured to give a more efficient
implementation. Transformation rules are used to convert one relational algebra expression into
an equivalent form that is more efficient. The query can now be regarded as a relational algebra
program, consisting of a series of operations on relations.
The primary goal of query optimizer is of choosing an efficient execution strategy for processing
a query. The query optimizer attempts to minimize the use of certain resources (mainly the
number of I/Os and CPU time) by choosing the best of a set of alternative query access plans.
Query optimization starts during the validation phase by the system to validate whether the user
has appropriate privileges. Existing statistics for the tables and columns are located, such as how
many rows (tuples) exist in the table and relevant indexes are found with their own applicable
statistics. Now an access plan is generated to perform the query. The access plan is then put into
effect with the execution plan of generated during query processing phase, wherein the indexes
and tables are accessed and the answer to the query is derived from the data.
Fig. 2.5 shows a detailed block diagram of query optimizer. Following four main inputs are used
in the query optimizer module:
Relational algebra query trees generated by the query simplifier module of query
decomposer.
Estimation formulas used to determine the cardinality of the intermediate result tables.
A cost model.
Statistical data from the database catalogue.
Fig. 2.5. Detailed block diagram of query optimizer
The term query optimization does not mean giving always an optimal (best) strategy as the
execution plan. It is just a reasonably efficient strategy for execution of the query. The
decomposed query blocks of SQL is translated into an equivalent extended relational algebra
expression (or operators) and then optimized. There are two main techniques for implementing
query optimization.
2 Transformation rule
Now, let us consider a query in the above database to find the names of employees born after
1970 whop work on a project named ‘Growth’. This SQL query can be written as follows:
SELECT EMP-NAME
Fig. 2.6 shows the improved query tree for the above SQL query. It can be observed that by
executing this initial query tree directly creates a very large file containing the CARTESIAN
PRODUCT (×) of the entire EMPLOYEE, WORKS_ON and PROJECT files. But, the query
needed only one tuple (record) from the PROJECT relation for the ‘Growth’ project and only the
EMPLOYEE records for those whose date of birth is after ‘31-12-1970’.
The improvement in the query tree can be achieved by keeping only the attributes needed by the
subsequent operations in the intermediate relations, by including PROJECT (∏) operations in the
query tree, as shown in Fig. 2.6. This reduces the attributes (columns or fields) of the
intermediate relations, whereas the SELECT operations reduce the number of tuples (rows or
records).
To summarize, we can conclude from the preceding example that a query tree can
be transformed step by step into another more efficient executable query tree. But,
one must ensure that the transformation steps always lead to an equivalent query
tree and the desired output is achieved.
Transformation rules are used by the query optimizer to transform one relational algebra
expression into an equivalent expression that is more efficient to execute. A relation is
considered as equivalent of another relation if two relations have the same set of attributes in a
different order but representing the same information. These transformation rules are used to
restructure the initial (canonical) relational algebra query tree generated during query
decomposition. Let us consider three relations R, S and T, with R defined over the attributes A =
{A1, A2,........, An} and S defined over B = {B1, B2,........, Bn}. c = {c1, c2,........, cn}, denote
predicates and L, L1, L2, M, M1, M1, N denote sets of attributes.
Or
Example:
Example:
∏L ∏m..........∏N(R) ≡ ∏L
Example:
Example:
R⋈c S ≡ S⋈c R
R×S≡S×R
Example:
Rule 6: Commutativity of Selection (σ) and Join (⋈) or Cartesian product (×)
σc R⋈S ≡ (σc ⋈ S
σc (R × S) ≡ (σc (R)) × S
Alternatively, if the selection predicate is a conjunctive predicate of the form (c1 AND c2,
or c1 ∧ c2) condition c1 involves only the attributes of R and condition c2 involves only the
attributes of S, the selection and join operations commute as follows:
Example:
Rule 7: Commutativity of Projection (∏) and Join (⋈) or Cartesian product (×)
Example:
Example:
R∪S≡S∪R
R∩S≡S∩R
Rule 9: Commutativity of Selection (σ) and set of operations such as Union (∪), Intersection (∩) and set
difference (–)
If θ stands for any of the set of operations such as Union (∪), Intersection (∩) or set difference (–
), then the above expression can be written as:
(R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T)
If the join condition c involves only attributes from the relation S and T, then join is associative
in the following manner:
If θ stands for any of the set of operations such as Join (⋈), Union (∪), Intersection (∩) or
Cartesian product (×), then the above expression can be written as:
(R θ S) θ T ≡ R θ (S θ T)
(R ∪ S) ∪ T ≡ S ∪ (R ∪ T)
(R ∩ S) ∩ T ≡ S ∩ (R ∩ T)
Rule 13: Converting a Selection and Cartesian Product (σ, ×) sequence into Join (⋈)
σc (R × S) ≡ (R ⋈c S)
2.4 Pipelining
When a query is composed of several relational algebra operators, the result of one operator is
sometimes pipelined to another operator without creating a temporary relation to hold the
intermediate result. When the input relation to a unary operation (for example, selection or
projection) is pipelined into it, it is sometimes said that the operation is applied on-the-
fly. Pipelining (or on-the-fly processing) is sometimes used to improve the performance of the
queries. As we know that the results of intermediate algebra operations are stored on the
secondary storage or disk, which are temporarily written. If the output of an operator operation is
saved in a temporary relation for processing by the next operator, it is said that the tuples
are materialized. Thus, this process of temporarily writing intermediate algebra operations is
called materialization.