0% found this document useful (0 votes)

153 views

Spark SQL Optimization

1) Query optimization involves generating the most efficient execution plan for a given query. The query optimizer will analyze the query and apply logical and physical optimizations to improve performance. 2) Logical optimizations rewrite or reorder the logical operations in the query in a way that reduces the size of intermediate results. Physical optimizations select the specific low-level physical operations used to execute the query based on analyzing statistics and estimating costs. 3) Spark SQL's Catalyst query optimizer applies rule-based logical optimizations like predicate pushdown and projection pruning. It then uses a cost-based approach and statistics like selectivity to select the most efficient physical execution plan, such as choosing a broadcast join over a shuffle join.

Uploaded by

Parv Agarwal

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

153 views

Spark SQL Optimization

Uploaded by

Parv Agarwal

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Spark SQL – Optimization

pm jat @ daiict
Query Optimization in Spark-SQL?

• What do you understand by “Query Optimization”?

12-Sep-23 Spark SQL – Optimization 2

Query Execution and Optimization
• A query can be expressed in different way
(1) πfname, dname, salary(
σsalary >= 30000 AND employee.dno = department.dno
(employee x department))
(2) πfname, dname, salary(σsalary >= 30000 (employee)
* department )
(3) σsalary >= 30000(πfname, dname, salary
(employee * department))
• Which one is better query and why?
• Should a programmer take care for this?

12-Sep-23 Spark SQL – Optimization 3

Concept of Query Optimization?
• Let user write a query that is mathematically correct (not necessarily most efficient
in terms of execution)
• DBMS systems provide a run-time module called “query optimizer”
• Query Optimizer should read the query and generate most “efficient plan” for
executing the query.
• What do mean by plan?

12-Sep-23 Spark SQL – Optimization 4

Query Execution Steps
• Query Parsing
• Query Optimization
– Logical: Rewritten - Reordering of
Relational operations
– Create Physical Plan: Identification of
Actual “Physical operations” – in
algorithmic form
• Query Execution
• Generated Physical plan is actually
executed!

12-Sep-23 Spark SQL – Optimization 5

12-Sep-23 Spark SQL – Optimization 6
Logical Plan and Physical Plan in
“Relational World”
• Logical Plan:
– A Relation Algebra Tree representing the user query
• Physical Plan?
– A sequence of lower-level (physical) operations on data file in order to execute
the user query
– Examples of lower-level operations are - sequential scan of file, index traversal,
sequential scan of leaf nodes of “b+-tree index file”, sort-merge, hash-join, etc
– Our SQL expressions are required to be executed in terms of these operations

12-Sep-23 Spark SQL – Optimization 7

How Logical Optimization works?
• It should be easy to explain following
– If possible, selection should be executed early in the order
--> reduces the size of intermediate “operand relations” in following operations
--> reduced overall cost of execution.
– With the same logic, early projection also leads to “faster execution” of the
query
– If a user has submitted a query in which JOIN is expressed in terms of
CROSS PRODUCT, it should be rewritten using JOIN

Do you remember: 𝑅 ⋈𝑐 𝑆 ≡ 𝜎𝑐 (𝑅 × 𝑆) ?

12-Sep-23 Spark SQL – Optimization 8

How Logical Optimization works?
• Said “preferred approach” can be defined in terms of rules.
• The query optimizer finds if the “inputted query” is not in compliance with these
rules, it is re-expressed on these lines.
• That means Query is “Rewritten” (called as “Query Rewriting”)
• In other words, the Parsed Evaluation Tree from the inputted query is transformed
into “better ones” considering “said Rules”
• So, how it is done
– “Operation Reordering”
– Pushing “predicate” down, “projection” down
– Combining or splitting operations

12-Sep-23 Spark SQL – Optimization 9

Physical Optimization
• A physical plan is basically a sequence of physical operations that are actually
performed to execute the query
• Typical set of physical operations are: table scan, index scan, hashing, sort-merge,
hash-join, so forth!
• For producing a physical plan for a given logical plan, often we have multiple options
and depend on “data file organization“ and data statistics
• Query optimizer uses the concept of “COST” for choosing the optimal query plan.
• A cost function typically gives some estimation of time taken by the query. A plan
with minimum cost is chosen.

12-Sep-23 Spark SQL – Optimization 10

Cost based Optimization
• For example, the following is a simple estimation of the cost for “single loop join”;
the same is used in our broadcast join.
𝐵𝑅 + ( 𝑅 × (𝐻𝐵𝑆 + 1)) + ( 𝑗𝑠 × |𝑅| × |𝑆 |Τ𝐵𝐹𝑅𝑆 )
• Similar would be the cost for another join approach; the optimizer chooses a plan
that has minimum cost.
• For details, you can refer to any database text book. Here formula comes from
Elmasri/Navathe.

12-Sep-23 Spark SQL – Optimization 11

Cost based Optimization (factors)
• File Organization includes
– If records are sorted, sorted on what attribute
– If indexes are available, if yes, on what attribute, what is a method of index “B+-
tree” based or “hash based”
• Metadata
– Record Size, block size, cardinality, selectivity (ratio of distinct values for
attributes), join selectivity

12-Sep-23 Spark SQL – Optimization 12

“explain” of SQL (RDBMS)
• Explain of SQL is used for showing you finally “Optimized Physical” plan of query
execution! Snap shot here is from “PostgreSQL”

12-Sep-23 Spark SQL – Optimization 13

Spark SQL Optimizer Catalyst [1]
• All the statements are cached as Abstract Syntax Tree (AST)
• Lazy evaluation of AST enables optimization of expressed operations.
• Diagram here depicts optimization pipeline

DAG of RDDs

12-Sep-23 Spark SQL – Optimization 14

Catalyst - Analysis Phase
• SQL/Dataframe  AST
• Resolves attribute: if valid, able to resolve name ambiguity, etc
• “Type Coercion”
• Takes meta-information from “Catalog”

12-Sep-23 Spark SQL – Optimization 15

An example of a “query optimization”

Book: Learning Spark [4]

12-Sep-23 Spark SQL – Optimization 16
Example

Book: Learning Spark [4]

12-Sep-23 Spark SQL – Optimization 17
Logical Optimization in SparkSQL-Catalyst
• Logical operations: SparkSQL operation in SQL/API
• The logical optimization phase applies standard rule-based optimizations to the
logical plan. The article[1] reports the following rule-based techniques that can help
in producing a better query plan.
– Constant folding : constant propagation
– Predicate pushdown: move the predicate as early as possible
– Projection pruning: drop unnecessary columns in query execution
– null propagation,
– Boolean expression simplification, and other rules.

12-Sep-23 Spark SQL – Optimization 18

Cost-Based Optimization (Motivating Example #1)

• Here is an example: a simplified version of

Q11 TPC-DS benchmark.
• Join Order makes a difference, and the
most optimal can not be determined
unless we have some estimation of
intermediate results.
• Require some additional information
Cost-Based Optimizer in Apache Spark 2 2 - Ron Hu & Sameer Agarwal
Spark 12-Sep-23
Summit’2017: https://www.youtube.com/watch?v=qS_aS99TjCM Spark SQL – Optimization 19
Physical Optimization in SparkSQL-Catalyst
• The physical plan is basically RDD DAG only.
• In the physical planning phase, Spark SQL takes a logical plan and generates one or more
physical plans.
• It then selects a plan using a Cost Model. (Cost Based Optimization, CBO)
• An example of a cost comparison might be choosing how to perform a given join by looking
at the physical attributes of a given table (how big the table is or how big its partitions are).
– Say which join approach to use: “--Broadcast join” or “Shuffle Join” (Sort Merge Join)
• The physical planner also performs rule-based physical optimizations, such as “pipelining
projections or filters” into one Spark map operation.
• In addition, it can push operations from the logical plan into data sources that support
predicate or projection pushdown.
• The final phase of query optimization involves generating Java bytecode to run on each
machine.
12-Sep-23 Spark SQL – Optimization 20
Cost-Based Optimization (Motivating Example #2)

• Rule Says
Smaller table of
R and L is to be
Hashed!
• If Only Rule is used
(without considering
intermediate results)
• Estimating size of
Intermediate Result requires some
more information, called statistical information

https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
12-Sep-23 Spark SQL – Optimization 21
Statistical Information in CBO [4]

• Uses notion of
“Filter Selectivity”,
and
“Join Selectivity”
• Selectivity's are
often estimated
based on histograms
of distinct values
and cardinalities of operand relations, etc

https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
12-Sep-23 Spark SQL – Optimization 22
Cost-Based Optimization
(Example #1)
Here we see how the change of Join-order by looking at Join Selectivity

estimated size intermediate results turns out to be Filter Selectivity

faster!

12-Sep-23 Spark SQL – Optimization 23

https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
.explain(example)

Book: Learning Spark [4]

12-Sep-23 Spark SQL – Optimization 24
.explain(example)

Book: Learning Spark [4]

12-Sep-23 Spark SQL – Optimization 25
SparkSQL-Catalyst Features [1]
• Supports both: “rule-based” and “cost-based” optimization.
• Catalyst is extensible.
• Its extensibility is said to have the following two purposes
– Different type of optimization rules for different problems of associated with
“big data” (e.g., semistructured data and advanced analytics).
– Ability to add data source-specific rules that can push filtering or aggregation
into external storage systems.
• More features
– Schema inference,
– Query federation to external databases

12-Sep-23 Spark SQL – Optimization 26

SparkSQL-Catalyst “Extensibility”
• The article says “In general, we have found it extremely simple to add rules for a
wide variety of situations”
– For example: aggregate operations on fixed precision decimals are done by
converting them 64 bit integers and finally converting them back to decimals.
– SQL LIKE operation is executed through “String.startsWith” or
“String.contains” calls makes a difference

12-Sep-23 Spark SQL – Optimization 27

SparkSQL Optimization - API Notes

• df.explain
• In Scala you can also call df.queryExecution.logical or
df.queryExecution.optimizedPlan

12-Sep-23 Spark SQL – Optimization 28

References/Further Reading
[1] Armbrust, Michael, et al. "Spark SQL: Relational data processing in spark." Proceedings of the 2015
ACM SIGMOD international conference on management of data. ACM, 2015.
[2] Ron Hu, Zhenhua Wang, Wenchen Fan and Sameer Agarwal, Cost Based Optimizer in Apache Spark
2.2. Databricks. https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-
2.html. Published August, 2017 and Video Talk: https://www.youtube.com/watch?v=qS_aS99TjCM
[3] Baldacci L, Golfarelli M. A Cost Model for SPARK SQL. IEEE Trans Knowl Data Eng. 2019;31(5):819-832.
doi:10.1109/TKDE.2018.2850339
[4] (Book) Learning Spark: lightning-fast big data analytics by Damji, Jules S., et al. O'Reilly Media, 2020.
[5] Armbrust, M., et al. "Deep dive into spark sql’s catalyst optimizer." Diambil kembali dari
https://databricks. com: https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-
optimizer.html (2015).

12-Sep-23 Spark SQL – Optimization 29

Blender Cheat Sheet: General
No ratings yet
Blender Cheat Sheet: General
3 pages
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
Jan 2020 C12 QP
No ratings yet
Jan 2020 C12 QP
48 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
MySQL Cheatsheet - CodeWithHarry
100% (1)
MySQL Cheatsheet - CodeWithHarry
13 pages
Lab 2 - Working With Data Storage
No ratings yet
Lab 2 - Working With Data Storage
15 pages
Qlik Interview Questions & Answers Updated
No ratings yet
Qlik Interview Questions & Answers Updated
20 pages
z65 PDF
0% (1)
z65 PDF
96 pages
Databricks
No ratings yet
Databricks
4 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Srikanth M - Data Engineer
No ratings yet
Srikanth M - Data Engineer
5 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Apache Spark Interview Questions and Answers PDF
No ratings yet
Apache Spark Interview Questions and Answers PDF
31 pages
Snowflake:: Data Warehouse For Cloud
No ratings yet
Snowflake:: Data Warehouse For Cloud
2 pages
ADF Copy Data
100% (1)
ADF Copy Data
81 pages
Mandapriyanka (7 0)
No ratings yet
Mandapriyanka (7 0)
3 pages
Prashanth Talend
No ratings yet
Prashanth Talend
4 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
16 pages
Sampath Polishetty BigData Consultant
No ratings yet
Sampath Polishetty BigData Consultant
7 pages
Sssis Interview Questins
No ratings yet
Sssis Interview Questins
7 pages
azure DE interview que
100% (1)
azure DE interview que
25 pages
Delta Table and Pyspark Interview Questions
100% (1)
Delta Table and Pyspark Interview Questions
14 pages
Power BI and SSAS Tabular Interview Template: Data Modeling
No ratings yet
Power BI and SSAS Tabular Interview Template: Data Modeling
16 pages
Pyspark Hands on
No ratings yet
Pyspark Hands on
189 pages
Basics On Creating SSIS Packages
No ratings yet
Basics On Creating SSIS Packages
19 pages
2.7 Years AzureDataEngineer Prateek
No ratings yet
2.7 Years AzureDataEngineer Prateek
2 pages
ETL Developer Resume 1660107492
No ratings yet
ETL Developer Resume 1660107492
4 pages
Archita TableauDeveloper Resume
No ratings yet
Archita TableauDeveloper Resume
5 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
Durgesh Sr. Data Architect / Modeler/Bigdata
100% (1)
Durgesh Sr. Data Architect / Modeler/Bigdata
5 pages
DataWarehouseInterview Part1
No ratings yet
DataWarehouseInterview Part1
4 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
Lab 3 - Enabling Team Based Data Science With Azure Databricks
No ratings yet
Lab 3 - Enabling Team Based Data Science With Azure Databricks
18 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Resume Mohit
No ratings yet
Resume Mohit
6 pages
Azure Data Factory
No ratings yet
Azure Data Factory
6 pages
Ssis Interviewquestions
No ratings yet
Ssis Interviewquestions
53 pages
Azure DataBricks Interview Questions
No ratings yet
Azure DataBricks Interview Questions
17 pages
PySpark VS SQL Interview Questions
No ratings yet
PySpark VS SQL Interview Questions
16 pages
Shilpa Ravichettu - Tableau3
No ratings yet
Shilpa Ravichettu - Tableau3
6 pages
Create Temporary, Permanent & Transient Table
No ratings yet
Create Temporary, Permanent & Transient Table
2 pages
Data Engineering Roadmap 2023
No ratings yet
Data Engineering Roadmap 2023
1 page
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Top 50 Data Warehousing Interview Questions & Answers
No ratings yet
Top 50 Data Warehousing Interview Questions & Answers
8 pages
Roshani Kumari ETL Engineer
No ratings yet
Roshani Kumari ETL Engineer
7 pages
Azuredatabricks New
No ratings yet
Azuredatabricks New
22 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Professional Summary:: 123456 Tableau Developer
No ratings yet
Professional Summary:: 123456 Tableau Developer
3 pages
Zclus - Harish - Data Engineer
No ratings yet
Zclus - Harish - Data Engineer
6 pages
Data Modeling 101
No ratings yet
Data Modeling 101
17 pages
CV For Snowflake Traning
No ratings yet
CV For Snowflake Traning
4 pages
Venkateswarlu Snowflake
No ratings yet
Venkateswarlu Snowflake
3 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
Azure DataEngineering End To End Videos
No ratings yet
Azure DataEngineering End To End Videos
21 pages
SQL Question
No ratings yet
SQL Question
78 pages
Tableau Sample Resume 2
No ratings yet
Tableau Sample Resume 2
6 pages
OLAP
100% (1)
OLAP
107 pages
Reetesh Jain2
No ratings yet
Reetesh Jain2
4 pages
Cube Implementations
No ratings yet
Cube Implementations
29 pages
Nosql Databases Types
No ratings yet
Nosql Databases Types
29 pages
Dynamo DB Api
No ratings yet
Dynamo DB Api
75 pages
Lab-7 202103010 202103041
No ratings yet
Lab-7 202103010 202103041
13 pages
SOC Selected Optimized Coordination Coordination Table For Selectivity 415 Vac
No ratings yet
SOC Selected Optimized Coordination Coordination Table For Selectivity 415 Vac
36 pages
AI_LOGBOOK_ASSISTANT_(1)[1](3) final done
No ratings yet
AI_LOGBOOK_ASSISTANT_(1)[1](3) final done
25 pages
IBM CP0101EN Certificate - Simplilearn
No ratings yet
IBM CP0101EN Certificate - Simplilearn
2 pages
InDesign CC21 Student Packet - P2 Conference Poster
No ratings yet
InDesign CC21 Student Packet - P2 Conference Poster
9 pages
Sap Basis Consultant: Mobile - +91 9160790349
No ratings yet
Sap Basis Consultant: Mobile - +91 9160790349
4 pages
Archive 24-05-29 13-18-42
No ratings yet
Archive 24-05-29 13-18-42
266 pages
Assignment 3 - Group (A)
No ratings yet
Assignment 3 - Group (A)
3 pages
Sub Modules Comparison From ECC To S4Hana
No ratings yet
Sub Modules Comparison From ECC To S4Hana
26 pages
Frequency Response of Lsi Systems
No ratings yet
Frequency Response of Lsi Systems
30 pages
MATH - MW - Fibonacci Sequence
No ratings yet
MATH - MW - Fibonacci Sequence
1 page
Profile Summary: Sindhu Rao
No ratings yet
Profile Summary: Sindhu Rao
3 pages
Pert7 Act1 Kelvin
No ratings yet
Pert7 Act1 Kelvin
11 pages
Insert Paper Title Here in Title Case (Style: CET Title) : Chemical Engineering
No ratings yet
Insert Paper Title Here in Title Case (Style: CET Title) : Chemical Engineering
3 pages
Gr.10 First PT
50% (2)
Gr.10 First PT
4 pages
Access Control Lists: CCNA Routing and Switching Connecting Networks v6.0
No ratings yet
Access Control Lists: CCNA Routing and Switching Connecting Networks v6.0
45 pages
KPMG IIOT Future State Development
No ratings yet
KPMG IIOT Future State Development
4 pages
Submitted To Bharathiar University in Partial Fulfillment of The Requirements For The Award of The Degree of
No ratings yet
Submitted To Bharathiar University in Partial Fulfillment of The Requirements For The Award of The Degree of
63 pages
CSC 318 Solution To Past Q
No ratings yet
CSC 318 Solution To Past Q
4 pages
Topic 3 Test
No ratings yet
Topic 3 Test
4 pages
Periyar University: University Departments Prospectus and Application Form M.Phil
No ratings yet
Periyar University: University Departments Prospectus and Application Form M.Phil
11 pages
7 Ways To Boot in Safe Mode in Windows 10
No ratings yet
7 Ways To Boot in Safe Mode in Windows 10
19 pages
Early Life: Kannada Mysore
No ratings yet
Early Life: Kannada Mysore
4 pages
Zxdu68 b301 v5.0r02m12 Datasheet
No ratings yet
Zxdu68 b301 v5.0r02m12 Datasheet
34 pages
SMA1 K
No ratings yet
SMA1 K
12 pages
Assignment 3
0% (1)
Assignment 3
2 pages
Annexure III - Problem Statements
No ratings yet
Annexure III - Problem Statements
9 pages
scaled-agile-safe-product-owner-product-manager-popm-6.0-dumps-by-cox-27-05-2024-11qa-certscare
No ratings yet
scaled-agile-safe-product-owner-product-manager-popm-6.0-dumps-by-cox-27-05-2024-11qa-certscare
12 pages