Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
153 views

Spark SQL Optimization

1) Query optimization involves generating the most efficient execution plan for a given query. The query optimizer will analyze the query and apply logical and physical optimizations to improve performance. 2) Logical optimizations rewrite or reorder the logical operations in the query in a way that reduces the size of intermediate results. Physical optimizations select the specific low-level physical operations used to execute the query based on analyzing statistics and estimating costs. 3) Spark SQL's Catalyst query optimizer applies rule-based logical optimizations like predicate pushdown and projection pruning. It then uses a cost-based approach and statistics like selectivity to select the most efficient physical execution plan, such as choosing a broadcast join over a shuffle join.

Uploaded by

Parv Agarwal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views

Spark SQL Optimization

1) Query optimization involves generating the most efficient execution plan for a given query. The query optimizer will analyze the query and apply logical and physical optimizations to improve performance. 2) Logical optimizations rewrite or reorder the logical operations in the query in a way that reduces the size of intermediate results. Physical optimizations select the specific low-level physical operations used to execute the query based on analyzing statistics and estimating costs. 3) Spark SQL's Catalyst query optimizer applies rule-based logical optimizations like predicate pushdown and projection pruning. It then uses a cost-based approach and statistics like selectivity to select the most efficient physical execution plan, such as choosing a broadcast join over a shuffle join.

Uploaded by

Parv Agarwal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Spark SQL – Optimization

pm jat @ daiict
Query Optimization in Spark-SQL?

• What do you understand by “Query Optimization”?

12-Sep-23 Spark SQL – Optimization 2


Query Execution and Optimization
• A query can be expressed in different way
(1) πfname, dname, salary(
σsalary >= 30000 AND employee.dno = department.dno
(employee x department))
(2) πfname, dname, salary(σsalary >= 30000 (employee)
* department )
(3) σsalary >= 30000(πfname, dname, salary
(employee * department))
• Which one is better query and why?
• Should a programmer take care for this?

12-Sep-23 Spark SQL – Optimization 3


Concept of Query Optimization?
• Let user write a query that is mathematically correct (not necessarily most efficient
in terms of execution)
• DBMS systems provide a run-time module called “query optimizer”
• Query Optimizer should read the query and generate most “efficient plan” for
executing the query.
• What do mean by plan?

12-Sep-23 Spark SQL – Optimization 4


Query Execution Steps
• Query Parsing
• Query Optimization
– Logical: Rewritten - Reordering of
Relational operations
– Create Physical Plan: Identification of
Actual “Physical operations” – in
algorithmic form
• Query Execution
• Generated Physical plan is actually
executed!

12-Sep-23 Spark SQL – Optimization 5


12-Sep-23 Spark SQL – Optimization 6
Logical Plan and Physical Plan in
“Relational World”
• Logical Plan:
– A Relation Algebra Tree representing the user query
• Physical Plan?
– A sequence of lower-level (physical) operations on data file in order to execute
the user query
– Examples of lower-level operations are - sequential scan of file, index traversal,
sequential scan of leaf nodes of “b+-tree index file”, sort-merge, hash-join, etc
– Our SQL expressions are required to be executed in terms of these operations

12-Sep-23 Spark SQL – Optimization 7


How Logical Optimization works?
• It should be easy to explain following
– If possible, selection should be executed early in the order
--> reduces the size of intermediate “operand relations” in following operations
--> reduced overall cost of execution.
– With the same logic, early projection also leads to “faster execution” of the
query
– If a user has submitted a query in which JOIN is expressed in terms of
CROSS PRODUCT, it should be rewritten using JOIN

Do you remember: 𝑅 ⋈𝑐 𝑆 ≡ 𝜎𝑐 (𝑅 × 𝑆) ?

12-Sep-23 Spark SQL – Optimization 8


How Logical Optimization works?
• Said “preferred approach” can be defined in terms of rules.
• The query optimizer finds if the “inputted query” is not in compliance with these
rules, it is re-expressed on these lines.
• That means Query is “Rewritten” (called as “Query Rewriting”)
• In other words, the Parsed Evaluation Tree from the inputted query is transformed
into “better ones” considering “said Rules”
• So, how it is done
– “Operation Reordering”
– Pushing “predicate” down, “projection” down
– Combining or splitting operations

12-Sep-23 Spark SQL – Optimization 9


Physical Optimization
• A physical plan is basically a sequence of physical operations that are actually
performed to execute the query
• Typical set of physical operations are: table scan, index scan, hashing, sort-merge,
hash-join, so forth!
• For producing a physical plan for a given logical plan, often we have multiple options
and depend on “data file organization“ and data statistics
• Query optimizer uses the concept of “COST” for choosing the optimal query plan.
• A cost function typically gives some estimation of time taken by the query. A plan
with minimum cost is chosen.

12-Sep-23 Spark SQL – Optimization 10


Cost based Optimization
• For example, the following is a simple estimation of the cost for “single loop join”;
the same is used in our broadcast join.
𝐵𝑅 + ( 𝑅 × (𝐻𝐵𝑆 + 1)) + ( 𝑗𝑠 × |𝑅| × |𝑆 |Τ𝐵𝐹𝑅𝑆 )
• Similar would be the cost for another join approach; the optimizer chooses a plan
that has minimum cost.
• For details, you can refer to any database text book. Here formula comes from
Elmasri/Navathe.

12-Sep-23 Spark SQL – Optimization 11


Cost based Optimization (factors)
• File Organization includes
– If records are sorted, sorted on what attribute
– If indexes are available, if yes, on what attribute, what is a method of index “B+-
tree” based or “hash based”
• Metadata
– Record Size, block size, cardinality, selectivity (ratio of distinct values for
attributes), join selectivity

12-Sep-23 Spark SQL – Optimization 12


“explain” of SQL (RDBMS)
• Explain of SQL is used for showing you finally “Optimized Physical” plan of query
execution! Snap shot here is from “PostgreSQL”

12-Sep-23 Spark SQL – Optimization 13


Spark SQL Optimizer Catalyst [1]
• All the statements are cached as Abstract Syntax Tree (AST)
• Lazy evaluation of AST enables optimization of expressed operations.
• Diagram here depicts optimization pipeline

DAG of RDDs

12-Sep-23 Spark SQL – Optimization 14


Catalyst - Analysis Phase
• SQL/Dataframe  AST
• Resolves attribute: if valid, able to resolve name ambiguity, etc
• “Type Coercion”
• Takes meta-information from “Catalog”

12-Sep-23 Spark SQL – Optimization 15


An example of a “query optimization”

Book: Learning Spark [4]


12-Sep-23 Spark SQL – Optimization 16
Example

Book: Learning Spark [4]


12-Sep-23 Spark SQL – Optimization 17
Logical Optimization in SparkSQL-Catalyst
• Logical operations: SparkSQL operation in SQL/API
• The logical optimization phase applies standard rule-based optimizations to the
logical plan. The article[1] reports the following rule-based techniques that can help
in producing a better query plan.
– Constant folding : constant propagation
– Predicate pushdown: move the predicate as early as possible
– Projection pruning: drop unnecessary columns in query execution
– null propagation,
– Boolean expression simplification, and other rules.

12-Sep-23 Spark SQL – Optimization 18


Cost-Based Optimization (Motivating Example #1)

• Here is an example: a simplified version of


Q11 TPC-DS benchmark.
• Join Order makes a difference, and the
most optimal can not be determined
unless we have some estimation of
intermediate results.
• Require some additional information
Cost-Based Optimizer in Apache Spark 2 2 - Ron Hu & Sameer Agarwal
Spark 12-Sep-23
Summit’2017: https://www.youtube.com/watch?v=qS_aS99TjCM Spark SQL – Optimization 19
Physical Optimization in SparkSQL-Catalyst
• The physical plan is basically RDD DAG only.
• In the physical planning phase, Spark SQL takes a logical plan and generates one or more
physical plans.
• It then selects a plan using a Cost Model. (Cost Based Optimization, CBO)
• An example of a cost comparison might be choosing how to perform a given join by looking
at the physical attributes of a given table (how big the table is or how big its partitions are).
– Say which join approach to use: “--Broadcast join” or “Shuffle Join” (Sort Merge Join)
• The physical planner also performs rule-based physical optimizations, such as “pipelining
projections or filters” into one Spark map operation.
• In addition, it can push operations from the logical plan into data sources that support
predicate or projection pushdown.
• The final phase of query optimization involves generating Java bytecode to run on each
machine.
12-Sep-23 Spark SQL – Optimization 20
Cost-Based Optimization (Motivating Example #2)

• Rule Says
Smaller table of
R and L is to be
Hashed!
• If Only Rule is used
(without considering
intermediate results)
• Estimating size of
Intermediate Result requires some
more information, called statistical information

https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
12-Sep-23 Spark SQL – Optimization 21
Statistical Information in CBO [4]

• Uses notion of
“Filter Selectivity”,
and
“Join Selectivity”
• Selectivity's are
often estimated
based on histograms
of distinct values
and cardinalities of operand relations, etc

https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
12-Sep-23 Spark SQL – Optimization 22
Cost-Based Optimization
(Example #1)
Here we see how the change of Join-order by looking at Join Selectivity

estimated size intermediate results turns out to be Filter Selectivity


faster!

12-Sep-23 Spark SQL – Optimization 23


https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
.explain(example)

Book: Learning Spark [4]


12-Sep-23 Spark SQL – Optimization 24
.explain(example)

Book: Learning Spark [4]


12-Sep-23 Spark SQL – Optimization 25
SparkSQL-Catalyst Features [1]
• Supports both: “rule-based” and “cost-based” optimization.
• Catalyst is extensible.
• Its extensibility is said to have the following two purposes
– Different type of optimization rules for different problems of associated with
“big data” (e.g., semistructured data and advanced analytics).
– Ability to add data source-specific rules that can push filtering or aggregation
into external storage systems.
• More features
– Schema inference,
– Query federation to external databases

12-Sep-23 Spark SQL – Optimization 26


SparkSQL-Catalyst “Extensibility”
• The article says “In general, we have found it extremely simple to add rules for a
wide variety of situations”
– For example: aggregate operations on fixed precision decimals are done by
converting them 64 bit integers and finally converting them back to decimals.
– SQL LIKE operation is executed through “String.startsWith” or
“String.contains” calls makes a difference

12-Sep-23 Spark SQL – Optimization 27


SparkSQL Optimization - API Notes

• df.explain
• In Scala you can also call df.queryExecution.logical or
df.queryExecution.optimizedPlan

12-Sep-23 Spark SQL – Optimization 28


References/Further Reading
[1] Armbrust, Michael, et al. "Spark SQL: Relational data processing in spark." Proceedings of the 2015
ACM SIGMOD international conference on management of data. ACM, 2015.
[2] Ron Hu, Zhenhua Wang, Wenchen Fan and Sameer Agarwal, Cost Based Optimizer in Apache Spark
2.2. Databricks. https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-
2.html. Published August, 2017 and Video Talk: https://www.youtube.com/watch?v=qS_aS99TjCM
[3] Baldacci L, Golfarelli M. A Cost Model for SPARK SQL. IEEE Trans Knowl Data Eng. 2019;31(5):819-832.
doi:10.1109/TKDE.2018.2850339
[4] (Book) Learning Spark: lightning-fast big data analytics by Damji, Jules S., et al. O'Reilly Media, 2020.
[5] Armbrust, M., et al. "Deep dive into spark sql’s catalyst optimizer." Diambil kembali dari
https://databricks. com: https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-
optimizer.html (2015).

12-Sep-23 Spark SQL – Optimization 29

You might also like