The document discusses query optimization techniques used by DBMS to process and execute high-level queries, including scanning, parsing, and validating SQL queries. It explains the creation of query trees, the evaluation of execution strategies, and the importance of heuristic optimization to improve performance by reducing intermediate results. Additionally, it covers the conversion of query trees into execution plans, detailing access methods and evaluation approaches such as materialized and pipelined evaluations.
The document discusses query optimization techniques used by DBMS to process and execute high-level queries, including scanning, parsing, and validating SQL queries. It explains the creation of query trees, the evaluation of execution strategies, and the importance of heuristic optimization to improve performance by reducing intermediate results. Additionally, it covers the conversion of query trees into execution plans, detailing access methods and evaluation approaches such as materialized and pipelined evaluations.
used by DBMS to process, optimize and execute high level queries. A query expressed as a high-level language such as SQL must first be scanned, parsed and validated. • scanner identifies the language tokens-such as SQL keywords, attribute names and relation names Query Optimization • the parser checks the query syntax to determine whether it is formulated according to the syntax rules of the query language • query must also be validated by checking that all attribute and relation names are valid and semantically meaningful names in the schema Query Optimization • An internal representation of the query is then created, usually as a tree data structure called a query tree. It is possible to represent the query using a graph data structure called a query graph. • The DBMS must then devise an execution strategy for retrieving the result of the query from the database files. A query typically has many possible execution strategies and the process of choosing a suitable one for processing is known as query optimization. Query Optimization Query Optimization • A RDBMS (and a ODBMS) must systematically evaluate alternative query execution strategies and choose a reasonably efficient or optimal strategy. • Each DBMS has general database access algorithms that implement relational operations such as SELECT or JOIN or combinations of these operations. Translating SQL Queries into Relational Algebra • An SQL query is first translated into an equivalent extended relational algebra expression represented as a query tree data structure that is then optimized. • SQL queries are decomposed into query blocks which form the basic units that can be translated into the algebraic operators and optimized Translating SQL Queries into Relational Algebra • A query block contains a single SELECT-FROM- WHERE expression as well as GROUPBY and HAVING clauses if these are part of the block. • Nested queries within a query are identified as separate query block • Because SQL includes aggregate operators such as MAX, MIN, SUM and COUNT, these operators must also be included in the extended algebra. Translating SQL Queries into Relational Algebra Translating SQL Queries into Relational Algebra • This query includes a nested subquery and hence would be decomposed into two blocks. The inner block is Translating SQL Queries into Relational Algebra • The outer block is Translating SQL Queries into Relational Algebra Translating SQL Queries into Relational Algebra • The query optimizer would then choose an execution plan for each block. • In the example above the inner block needs to be evaluated only once to produce the maximum salary which is then used as the constant c. Basic Algorithms for Executing Query Operations • For each operation or combination of operations one or more algorithms would typically be available to execute the operation(s). • An algorithm may apply only to particular storage structures and access paths, if so then it can only be used if the files involved in the operation include these access paths. Basic Algorithms for Executing Query Operations • The external sorting is at the heart of many relational operations that utilize sort-merge strategies • Access algorithms for implementing SELECT, JOIN, PROJECT and set operations( UNION, INTERSECTION, SET DIFFERENCE), and Aggregate operations (MIN, COUNT, AVERAGE, SUM) are also important in query optimization. External Sorting • Sorting is one of the primary algorithms used in query processing. E.g whenever an SQL query specifies an ORDER BY clause, the query result must be sorted. Sorting is also a key component in sort-merge algorithms used for JOIN and other operations such as UNION and INTERSECTION and in duplicate elimination algorithms for the PROJECT operation ( when an SQL query specifies the DISTINCT option in the SELECT clause) External Sorting • External sorting refers to sorting algorithms that are suitable for large files of records stored on disk that do not fit entirely in main memory such as database files • The typical external sorting algorithm uses a sort-merge strategy, which starts by sorting small subfiles-called runs-of the main file and then merges the sorted runs creating larger sorted files that are merged in turn External Sorting • The sort-merge algorithm like other database algorithms requires buffer space in main memory where the actual sorting and merging of the runs is performed. • The basic algorithm consists of two phases: i. Sorting Phase ii. Merging Phase Sorting phase Runs (portions) of the file that can fit in the available buffer space are read into main memory, sorted using an internal sorting algorithm and written back to disk as temporary sorted subfiles or runs Merging phase • The sorted runs are merged during one or more passes. The degree of merging is the number of runs that can be merged together in each pass. In each pass, one buffer block is needed for containing one block of the merge result Combining Operations Using pipelining • A query specified in SQL will typically be translated into relational algebra expression that is a sequence of relational operations. For example rather than being implemented separately, a JOIN can be combined with two SELECT operations on the input files and a final PROJECT operation on the resulting file; Combining Operations Using pipelining • All this implemented by one algorithm with two input files and a single output file. Heuristic relational algebra optimization can group operations together foe execution. This is called pipelining or stream-based processing Using Heuristics in Query Optimization • Application of heuristic rules to modify the internal representation of a query is used to achieve improvement in performance. One of the main heuristic rules is to apply SELECT and PROJECT operations before applying the JOIN or other binary operations. • This is because the size of the file resulting from a binary operation such as a JOIN is usually a multiplicative function of the sizes of the input files. The SELECT and PROJECT operations reduce the size of a file and hence should be applied before a JOIN or other binary operation. Notation for query trees and query graphs • A query tree is a tree data structure that corresponds to a relational algebra expression. It represents the input relations of the query as leaf nodes of the tree, and represents the relational algebra operations as internal nodes. • An execution of the query tree consists of executing an internal node operation whenever its operands are available and then replacing that internal node by the relation that results from executing the operation Notation for query trees and query graphs • The execution terminates when the root node is executed and produces the result relation for the query. • Example: For every project located in ‘Nairobi’ retrieve the project number, the controlling department number and the department managers last name, address and birthdate. This query is specified on the relational schema below and corresponds to the following algebra expression Notation for query trees and query graphs Notation for query trees and query graphs Notation for query trees and query graphs Notation for query trees and query graphs Notation for query trees and query graphs Query tree corresponding to the relational algebra expression Query tree corresponding to the relational algebra expression • The three relations PROJECT, DEPARTMENT and EMPLOYEE are represented by leaf nodes P,D and E, while the relational algebra operations of the expression are represented by internal tree nodes. • When this query tree is executed, the node marked (1) must begin execution before (2) because some resulting tuples of operartion (1) must be available before we can begin executing operation (2). Similarly node (2) must begin executing and producing results before node (3) can start execution and so on Initial ( canonical ) query tree for the query Query graph for the query Query graph for the query • Relations in the query are represented by relation nodes, which are displayed as single circles. Constant values, typically from the query selection conditions are represented by constant nodes which are displayed as double circles. • Selection and join conditions are represented by the graph edges. The attributes to be retrieved from each relation are displayed in square brackets above each relation. Query graph for the query • The query graph representation does not indicate an order on which operations to perform first. There is only a single graph corresponding to each query. • Query trees are preferred because the query optimizer needs to show the order of operations for query execution, which is not possible in query graphs Heuristic Optimization Of Query Trees
• In general many different relational algebra
expressions ( hence many different query trees) can be equivalent i.e. they can correspond to the same query. • The query parser will typically generate a standard initial query tree to correspond to an SQL query, without doing any optimization. In the above example the canonical form is that initial tree. Heuristic Optimization Of Query Trees • The CARTESIAN PRODUCT of the relations specified in the FROM clause is first applied; then the selection and join conditions of the WHERE clause are applied, followed by the projection on the SELECT clause attributes. • Such a canonical query tree represents a relational algebra expression that is very inefficient if executed directly, because of the CARTESIAN PRODUCT (X) operations Heuristic Optimization Of Query Trees
• For example if the PROJECT, DEPARTMENT and
EMPLOYEE relations had record sizes of 100, 50 and 150 bytes and contained 100, 20, 5000 tuples respectively, the result of the CARTESIAN PRODUCT would contain 10 million tuples of record size 300 bytes each Heuristic Optimization Of Query Trees
• It is now the job of the heuristic query optimizer
to transform this initial query tree into a final query tree that is efficient to execute. • The optimizer must include rules for equivalence among relational algebra expressions that can be applied to the initial tree. The heuristic query optimization rules then utilize these equivalence expressions to transform the initial tree into the final optimised query tree. Heuristic Optimization Of Query Trees
• Example of transforming a tree:
• Find the last names of employees born after 1957 who work on a project named ‘sensors’. • This query can be specified in SQL as: Heuristic Optimization Of Query Trees Initial (canonical) query tree for SQL query Moving SELECT operations down the query tree Moving SELECT operations down the query tree • This is an improved query tree that first applies the SELECT operation to reduce the number of tuples that appear in the CARTESIAN PRODUCT. Applying the more restrictive SELECT operation first Applying the more restrictive SELECT operation first • A further improvement is achieved by switching the positions of the EMPLOYEE and PROJECT relations in figure above. This uses the information that PNUMBER is a key attribute of the project relation and hence the SELECT operation on the PROJECT relation will retrieve a single record only. Replacing CARTESIAN PRODUCT and SELECT with JOIN operations Replacing CARTESIAN PRODUCT and SELECT with JOIN operations • In above figure improvement is achieved by replacing any CARTESIAN PRODUCT operation that is followed by a join condition with a JOIN operation Moving PROJECT operations down the query tree. Moving PROJECT operations down the query tree • In this improvement is achieved by keeping only the attributes needed by the subsequent operations in the intermediate relations by including project operations as early as possible in the query tree. This reduces the attributes of the intermediate relations, whereas the SELECT operations reduce the number of tuples. General transformation Rules for Relational Algebra Operations • This example demonstrates that a query tree can be transformed step by step into another query tree that is more efficient to execute. However we must be sure that the transformation steps always lead to an equivalent query tree. To do this the query optimizer must know which transformation rules preserve this equivalence General transformation Rules for Relational Algebra Operations General transformation Rules for Relational Algebra Operations • Take Home CAT 2 1. Complete the list of rules up to rule number 12 and identify how the rules have been implemented in the example given earlier in figures b to e. 2. Discuss the methods for implementing SELECTION, JOIN, PROJECT, SET and AGGREGATE Operations (hint. For selection: linear search, binary search) For Join (Nested-loop join, sort- merge join) SET( Hashing) e.t.c General transformation Rules for Relational Algebra Operations • The main heuristic is to apply first the operations that reduce the size of intermediate results. This includes performing as early as possible SELECT operations to reduce the number of tuples and PROJECT operations to reduce the number of attributes. • This is done by moving SELECT and PROJECT operations as far down the tree as possible General transformation Rules for Relational Algebra Operations • In addition, the SELECT and JOIN operations that are most restrictive-that is result in relations with the fewest tuples or with the smallest absolute size-should be executed before other similar operations. • This is done by reordering the leaf nodes of the tree among themselves while avoiding CARTESIAN PRODUCTS, and adjusting the rest of the tree appropriately. Converting Query Trees into Execution Plans An execution plan for a relational algebra expression represented as a query tree includes information about the access methods available for each relation as well as the algorithms to be used in computing the relational operators represented in the tree. Converting Query Trees into Execution Plans Converting Query Trees into Execution Plans • Consider the query tree above: to convert this into an execution plan, the optimizer might choose an index search for the SELECT operation (assuming one exists), a table scan as access method for EMPLOYEE, a nested- loop join algorithm for the join, and a scan of the JOIN result for the PROJECT operator. Converting Query Trees into Execution Plans • In addition, the approach taken for executing the query may specify a materialised or a pipelined evaluation. With a materialised evaluation, the result of an operation is stored as a temporary relation (that is the result is physically materialised). • For instance the join operation can be computed and the entire result stored as a temporary relation, which is then read as input by the algorithm that computes the PROJECT operation, which would produce the query result table. Converting Query Trees into Execution Plans • On the other hand, with a pipelined evaluation, as the resulting tuples of an operation are produced, they are forwarded directly to the next operation in the query sequence. • The advantage of pipelining is the cost saving in not having to write the intermediate results to disk and not having to read them back for the next operation.