Query Processing
Query Processing
With SQL, operations on DBMS become a lot easier, structured, and systematic. These are
not just easier for the users to understand but with the basic understanding of them, they tend
to become more and more intuitive. In fact, SQL as a DBMS has evolved a lot. It even
optimizes the queries for us and finds out the best way to execute it.
SELECT
emp_name
FROM
employee
WHERE
salary>10000;
The problem here is that DBMS won't understand this statement. So for that, we have SQL
(Structured Query Language) queries. SQL being a High-Level Language makes it easier
not just for the users to query data based on their needs but also bridges the communication
gap between the DBMS which does not really understand human language. In fact, the
underlying system of DBMS won't even understand these SQL queries. For them to
understand and execute a query, they first need to be converted to a Low-Level Language.
The SQL queries go through a processing unit that converts them into low-level Language via
Relational Algebra. Since relational algebra queries are a bit more complex than SQL
queries, DBMS expects the user to write only the SQL queries. It then processes the query
before evaluating it.
As mentioned in the above image, query processing can be divided into compile-
time and run-time phases. Compile-time phase includes:
In the Runtime phase, the database engine is primarily responsible for interpreting and
executing the hence generated query with physical operators and delivering the query output.
A note here that as soon as any of the above stages encounters an error, they simply throw the
error and return without going any further (since warnings are not fatal/terminal, that is not
the case with warnings).
The first step in query processing is Parsing and Translation. The fired queries undergo
lexical, syntactic, and semantic analysis. Essentially, the query gets broken down into
different tokens and white spaces are removed along with the comments (Lexical Analysis).
In the next step, the query gets checked for the correctness, both syntax and semantic wise.
The query processor first checks the query if the rules of SQL have been correctly followed
or not (Syntactic Analysis).
Finally, the query processor checks if the meaning of the query is right or not. Things like if
the table(s) mentioned in the query are present in the DB or not? if the column(s) referred
from all the table(s) are actually present in them or not? (Semantic Analysis)
Once the above mentioned checks pass, the flow moves to convert all the tokens into
relational expressions, graphs, and trees. This makes the processing of the query easier for the
other parsers.
Let's consider the same query (mentioned below as well) as an example and see how the flow
works.
Query:
SELECT
emp_name
FROM
employee
WHERE
salary>10000;
The name of the queried table is looked into the data dictionary table.
The name of the columns mentioned (emp_name and salary) in the tokens are
validated for existence.
The type of column(s) being compared have to be of the same type (salary and the
value 10000 should have the same data type).
The next step is to translate the generated set of tokens into a relational algebra query. These
are easy to handle for the optimizer in further processes.
Query Evaluation
Once the query processor has the above-mentioned relational forms with it, the next step is to
apply certain rules and algorithms to generate a few other powerful and efficient data
structures. These data structures help in constructing the query evaluation plans. For example,
if the relational graph was constructed, there could be multiple paths from source to
destination. A query execution plan will be generated for each of the paths.
As you can see in the above possible graphs, one way could be first projecting followed by
selection (on the right). Another way would be to do selection followed by projection (on the
left). The above sample query is kept simple and straightforward to ensure better
comprehension but in the case of joins and views, more such paths (evaluation plans) start to
open up. The evaluation plans may also include different annotations referring to the
algorithm(s) to be used. Relational Algebra which has annotations of these sorts is known
as Evaluation Primitives. You might have figured out by now that these evaluation
primitives are very essential and play an important role as they define the sequence of
operations to be performed for a given plan.
Query Optimization
In the next step, DMBS picks up the most efficient evaluation plan based on the cost each
plan has. The aim here is to minimize the query evaluation time. The optimizer also evaluates
the usage of index present in the table and the columns being used. It also finds out the best
order of subqueries to be executed so as to ensure only the best of the plans gets executed.
Simply put, for any query, there are multiple evaluation plans to execute it. Choosing the one
which costs the least is called Query Optimization. Some of the factors weighed in by the
optimizer to calculate the cost of a query evaluation plan is:
CPU time
Number of tuples to be scanned
Disk access time
number of operations
Query Optimization in DBMS
The query optimizer (also known as the optimizer) is database software that identifies the
most efficient way (like by reducing time) for a SQL statement to access data
The process of selecting an efficient execution plan for processing a query is known as query
optimization.
Following query parsing which is a process by which this decision making is done that for a
given query, calculating how many different ways there are in which the query can run, then
the parsed query is delivered to the query optimizer, which generates various execution plans
to analyze the parsed query and select the plan with the lowest estimated cost. The catalog
manager assists the optimizer in selecting the optimum plan to perform the query by
generating the cost of each plan.
Query optimization is used to access and modify the database in the most efficient way
possible. It is the art of obtaining necessary information in a predictable, reliable, and timely
manner. Query optimization is formally described as the process of transforming a query into
an equivalent form that may be evaluated more efficiently. The goal of query optimization is
to find an execution plan that reduces the time required to process a query. We must complete
two major tasks to attain this optimization target.
The first is to determine the optimal plan to access the database, and the second is to reduce
the time required to execute the query plan.
The optimizer tries to come up with the best execution plan possible for a SQL statement.
Among all the candidate plans reviewed, the optimizer chooses the plan with the lowest cost.
The optimizer computes costs based on available facts. The cost computation takes into
account query execution factors such as I/O, CPU, and communication for a certain query in
a given context.
For example, there is a query that requests information about students who are in leadership
roles, such as being a class representative. If the optimizer statistics show that 50% of
students are in positions of leadership, the optimizer may decide that a full table search is the
most efficient. However, if data show that just a small number of students are in positions of
leadership, reading an index followed by table access by row id may be more efficient than a
full table scan.
Because the database has so many internal statistics and tools at its disposal, the optimizer is
frequently in a better position than the user to decide the best way to execute a statement. As
a result, the optimizer is used by all SQL statements.
Query optimization is the process of selecting the most efficient way to execute a SQL
statement. Because SQL is a nonprocedural language, the optimizer can merge, restructure,
and process data in any sequence.
The Optimizer allocates a cost in numerical form for each step of a feasible plan for a given
query and environment, and then discovers these values together to get a cost estimate for the
plan or possible strategy. The Optimizer aims to find the plan with the lowest cost estimate
after evaluating the costs of all feasible plans. As a result, the Optimizer is sometimes known
as the Cost-Based Optimizer.
Execution Plans:
The plan describes the steps taken by Oracle Database to execute a SQL statement. Each step
physically retrieves or prepares rows of data from the database for the statement's user.
An execution plan shows the total cost of the plan, which is stated on line 0, as well as the
cost of each individual operation. A cost is an internal unit that appears solely in the
execution plan to allow for plan comparisons. As a result, the cost value cannot be fine-tuned
or adjusted.
From the bottom up, the database optimizes query blocks separately. As a result, the database
optimizes the innermost query block first, generating a sub-plan for it, before generating the
outer query block, which represents the full query.
The number of query block plans is proportional to the number of items in the FROM clause.
As the number of objects rises, this number climbs exponentially. The possibilities for a join
of five tables, for example, are far higher than those for a connection of two tables.
A biker wishes to find the most efficient bicycle path from point A to point B. A query is
analogous to the phrase "I need the quickest route from point A to point B" or "I need the
quickest route from point A to point B via point C". To choose the most efficient route, the
trip advisor employs an internal algorithm that takes into account factors such as speed and
difficulty. The biker can sway the trip advisor's judgment by saying things like "I want to
arrive as quickly as possible" or "I want the simplest route possible.”
In this example, an execution plan is a possible path generated by the travel advisor.
Internally, the advisor may divide the overall route into multiple subroutes (sub plans) and
compute the efficiency of each subroute separately. For example, the trip advisor may
estimate one subroute to take 15 minutes and be of medium difficulty, another subroute to
take 22 minutes and be of low difficulty, and so on.
Based on the user-specified goals and accessible facts about roads and traffic conditions, the
advisor selects the most efficient (lowest cost) overall route. The better the guidance, the
more accurate the statistics. For example, if the advisor is not kept up to date on traffic
delays, road closures, and poor road conditions, the proposed route may prove inefficient
(high cost).
Heuristics are used to reduce the number of choices that must be made in a cost-based
approach.
Rules
Heuristic optimization transforms the expression-tree by using a set of rules which improve
the performance. These rules are as follows −
Perform the SELECTION process foremost in the query. This should be the
first action for any SQL table. By doing so, we can decrease the number of
records required in the query, rather than using all the tables during the query.
Perform all the projection as soon as achievable in the query. Somewhat like a
selection but this method helps in decreasing the number of columns in the
query.
Perform the most restrictive joins and selection operations. What this means is
that select only those sets of tables and/or views which will result in a
relatively lesser number of records and are extremely necessary in the query.
Obviously any query will execute better when tables with few records are
joined.
Some systems use only heuristics and the others combine heuristics with partial cost-based
optimization.
Let’s see the steps involve in heuristic optimization, which are explained below −
Deconstruct the conjunctive selections into a sequence of single selection
operations.
Move the selection operations down the query tree for the earliest possible
execution.
First execute those selections and join operations which will produce smallest
relations.
Replace the cartesian product operation followed by selection operation with
join operation.
Deconstructive and move the tree down as far as possible.
Identify those subtrees whose operations are pipelined.
Have a look at the Employee table below. It contains attributes as column values, namely
1. Employee_Id
2. Employee_Name
3. Employee_Department
4. Salary
Employee Table
Employee_Id Employee_Name Employee_Department
1 Ryan Mechanical
2 Justin Biotechnology
Now that we are clear with the jargon related to functional dependency, let's discuss what
functional dependency is.
R (ABCD)
1. A → BCD
2. B → CD
William Armstrong in 1974 suggested a few rules related to functional dependency. They are
called RAT rules.
1. Reflexivity: If A is a set of attributes and B is a subset of A, then the functional
dependency A → B holds true.
o For example, { Employee_Id, Name } → Name is valid.
Decomposition in DBMS
Whenever we decompose a relation, there are certain properties that must be satisfied to
ensure no information is lost while decomposing the relations. These properties are:
We can follow certain rules to ensure that the decomposition is a lossless join
decomposition Let’s say we have a relation R and we decomposed it into R1 and R2, then the
rules are:
1. The union of attributes of both the sub relations R1 and R2 must contain all the
attributes of original relation R.
R1 ∪ R2 = R
2. The intersection of attributes of both the sub relations R1 and R2 must not be null,
i.e., there should be some attributes that are present in both R1 and R2.
R1 ∩ R2 ≠ ∅
3. The intersection of attributes of both the sub relations R1 and R2 must be the
superkey of R1 or R2, or both R1 and R2.
R1 ∩ R2 = Super key of R1 or R2
Let’s see an example of a lossless join decomposition. Suppose we have the following
relation EmployeeProjectDetail as:
<EmployeeProjectDetail>
<EmployeeProject>
<ProjectDetail>
Project_ID Project_Name
P03 Project103
P01 Project101
P04 Project104
P02 Project102
Now, let’s see if this is a lossless join decomposition by evaluating the rules discussed above:
<EmployeeProject ∪ ProjectDetail>
<EmployeeProject ∩ ProjectDetail>
Project_ID
P03
P01
P04
P02
As we can see this is not null, so the the second condition holds as well. Also
the EmployeeProject ∩ ProjectDetail = Project_Id. This is the super key of the ProjectDetail
relation, so the third condition holds as well.
Now, since all three conditions hold for our decomposition, this is a lossless join
decomposition.
Dependency Preserving
The second property of lossless decomposition is dependency preservation which says that
after decomposing a relation R into R1 and R2, all dependencies of the original relation R
must be present either in R1 or R2 or they must be derivable using the combination of
functional dependencies present in R1 and R2.
<EmployeeProjectDetail>
Now, after decomposing the relation into EmployeeProject and ProjectDetail as:
<EmployeeProject>
<ProjectDetail>
Project_ID Project_Name
P03 Project103
P01 Project101
P04 Project104
P02 Project102
As we can see that all FDs in EmployeeProjectDetail are either part of the EmployeeProject
or the ProjectDetail, So this decomposition is dependency preserving.
Distributed Databases
A distributed database is a database that is not limited to one computer system. It is
like a database that consists of two or more files located in different computers or
sites either on the same network or on an entirely different network.
These sites do not share any physical component. Distributed databases are needed
when a particular data in the database needs to be accessed by various users globally.
It needs to be handled in such a way that for a user it always looks like one single
database.
By contrast, a Centralized database consists of a single database file located at one
site using a single network.
Though there are many distributed databases to choose from, some examples of
distributed databases include Apache Ignite, Apache Cassandra, Apache
HBase, Amazon SimpleDB, Clusterpoint, and FoundationDB.