Azure Databricks Best Practices 1664384402
Azure Databricks Best Practices 1664384402
Azure Databricks Best Practices 1664384402
Azure Databricks (ADB) has the power to process terabytes of data, while simultaneously running
heavy data science workloads. Over time, as data input and workloads increase, job performance
decreases. As an ADB developer, optimizing your platform enables you to work faster and save
hours of effort for you and your team. Below are the 18 best practices you need to optimize your
ADB environment.
You should call count() or write() immediately after calling cache() so that the entire
DataFrame is processed and cached in memory. If you only cache part of the DataFrame,
the entire DataFrame may be recomputed when a subsequent action is performed on the
DataFrame.
1
Ravi Credits: databricks.com, azure.com Youtube: techlake
2. Create partitions on every table and for fact tables use partition column on key join
column like country_code, city, market_code
Delta tables in ADB support partitioning, which enhances performance. You can partition by
a column if you expect data in that partition to be at least 1 GB. If column cardinality is high,
do not use that column for partitioning. For example, if you partition by user ID and there
are 1M distinct user IDs, partitioning would increase table load time. Syntax example:
2
Ravi Credits: databricks.com, azure.com Youtube: techlake
Avoid high list cost on large directories like Hierarichal folder structure
3
Ravi Credits: databricks.com, azure.com Youtube: techlake
Z-Ordering is a technique to colocate related information in the same set of files. This co-
locality is automatically used by Delta Lake on Databricks data-skipping algorithms to
dramatically reduce the amount of data that needs to be read. To Z-Order data, you specify
the columns to order on in the ZORDER BY clause:
You must explicitly enable Optimized Writes and Auto Compaction using one of the
following methods:
4
Ravi Credits: databricks.com, azure.com Youtube: techlake
Auto Optimize consists of two complementary features: Optimized Writes and Auto
Compaction.
CREATE TABLE student (id INT, name STRING, age INT) TBLPROPERTIES
(delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true)
NOTE:
5
Ravi Credits: databricks.com, azure.com Youtube: techlake
6. Decide partition size (block size default is 128MB). Based on that it will create no of
files at table.
Join hints
Join hints allow you to suggest the join strategy that Databricks Runtime should use. When
different join strategy hints are specified on both sides of a join, Databricks Runtime
prioritizes hints in the following
order: BROADCAST over MERGE over SHUFFLE_HASH over SHUFFLE_REPLICATE_N
L. When both sides are specified with the BROADCAST hint or the SHUFFLE_HASH hint,
Databricks Runtime picks the build side based on the join type and the sizes of the
relations. Since a given strategy may not support all join types, Databricks Runtime is not
guaranteed to use the join strategy suggested by the hint.
BROADCAST
6
Ravi Credits: databricks.com, azure.com Youtube: techlake
Use broadcast join. The join side with the hint is broadcast regardless
of autoBroadcastJoinThreshold. If both sides of the join have the broadcast hints, the
one with the smaller size (based on stats) is broadcast. The aliases
for BROADCAST are BROADCASTJOIN and MAPJOIN.
MERGE
SHUFFLE_HASH
Use shuffle hash join. If both sides have the shuffle hash hints, Databricks Runtime
chooses the smaller side (based on stats) as the build side.
7
Ravi Credits: databricks.com, azure.com Youtube: techlake
SHUFFLE_REPLICATE_NL
COALESCE
Reduce the number of partitions to the specified number of partitions. It takes a partition
number as a parameter.
8
Ravi Credits: databricks.com, azure.com Youtube: techlake
REPARTITION
REPARTITION_BY_RANGE
REBALANCE
The REBALANCE hint can be used to rebalance the query result output partitions, so
that every partition is of a reasonable size (not too small and not too big). It can take
column names as parameters, and try its best to partition the query result by these
columns. This is a best-effort: if there are skews, Spark will split the skewed partitions,
to make these partitions not too big. This hint is useful when you need to write the result
of this query to a table, to avoid too small/big files. This hint is ignored if AQE is not
enabled.
9
Ravi Credits: databricks.com, azure.com Youtube: techlake
Delete temporary tables that were created as intermediate tables during notebook
execution. Deleting tables saves storage, especially if the notebook is scheduled daily.
ADB clusters store table metadata, even if you use drop statements to delete. Before
creating temporary tables, use dbutils.fs.rm() to permanently delete metadata. If you don’t
use this statement, an error message will appear stating that the table already exists. To
avoid this error in daily refreshes, you must use dbutils.fs.rm().
11. Use Lower() or Upper() when comparing strings or common filter conditions to avoid
losing data
ADB can't compare strings with different casing. To avoid losing data, use case conversion
statements Lower() or Upper(). Example:
If your calculation requires multiple steps, you can save time and by creating a one-step
custom function. ADB offers a variety of built in SQL functions, however to create custom
functions, known as user-defined functions (UDF), use Scala. Once you have a custom
function, you can call it every time you need to perform that specific calculation.
In ADB, Hive tables do not support UPDATE and MERGE statements or NOT NULL and
CHECK constraints. Delta tables do support these commands, however running large
amounts of data on Delta tables decreases query performance. So not to decrease
performance, store table versions.
10
Ravi Credits: databricks.com, azure.com Youtube: techlake
11
Ravi Credits: databricks.com, azure.com Youtube: techlake
If you need to create intermediate tables, use views to minimize storage usage and save
costs. Views are session-oriented and will automatically remove tables from storage after
query execution. For optimal query performance, do not use joins or subqueries in views.
AQE improves large query performance. By default, AQE is disabled in ADB. To enable it,
use: set spark.sql.adaptive.enabled = true;
Enabling AQE
AQE can be enabled by setting SQL config spark.sql.adaptive.enabled to true (default false
in Spark 3.0), and applies if the query meets the following criteria:
12
Ravi Credits: databricks.com, azure.com Youtube: techlake
It contains at least one exchange (usually when there’s a join, aggregate or window
operator) or one subquery
1. Optimizing Shuffles
2. Choosing Join Strategies
3. Handling Skew Joins
4. Understand AQE Query Plans
5. The AdaptiveSparkPlan Node
6. The CustomShuffleReader Node
7. Detecting Join Strategy Change
When creating mount points to Azure Data Lake Storage (ADLS), use a key vault client ID
and client secret to enhance security.
13
Ravi Credits: databricks.com, azure.com Youtube: techlake
If you need to use the data from parquet files, do not extract into ADB in intermediate table
format. Instead, directly query on the parquet file to save time and storage. Example:
SELECT ColumnName FROM parquet.`Location of the file`
14
Ravi Credits: databricks.com, azure.com Youtube: techlake
18. Choosing cluster mode for individual jobs execution and common jobs execution.
For group of jobs and multiple jobs with dependency tables in parallel or sequential load
choose High Concurrency Mode.
1. Deploy a shared cluster instead of letting each user create their own cluster.
2. Create the shared cluster in High Concurrency mode instead of Standard mode.
3. Configure security on the shared High Concurrency cluster, using one of the following
options:
o Turn on AAD Credential Passthrough if you’re using ADLS
o Turn on Table Access Control for all other stores
15
Ravi Credits: databricks.com, azure.com Youtube: techlake
1. Minimizing Cost: By forcing users to share an autoscaling cluster you have configured
with maximum node count, rather than say, asking them to create a new one for their use
each time they log in, you can control the total cost easily. The max cost of shared cluster
can be calculated by assuming it is running X hours at maximum size with the particular
VMs. It is difficult to achieve this if each user is given free reign over creating clusters of
arbitrary size and VMs.
2. Optimizing for Latency: Only High Concurrency clusters have features which allow
queries from different users share cluster resources in a fair, secure manner. HC clusters
come with Query Watchdog, a process which keeps disruptive queries in check by
automatically pre-empting rogue queries, limiting the maximum size of output rows
returned, etc.
3. Security: Table Access control feature is only available in High Concurrency mode and
needs to be turned on so that users can limit access to their database objects (tables,
views, functions, etc.) created on the shared cluster. In case of ADLS, we recommend
restricting access using the AAD Credential Passthrough feature instead of Table Access
Controls.
16
Ravi Credits: databricks.com, azure.com Youtube: techlake
It is impossible to predict the correct cluster size without developing the application because
Spark and Azure Databricks use numerous techniques to improve cluster utilization. The broad
approach you should follow for sizing is:
1. Develop on a medium sized cluster of 2-8 nodes, with VMs matched to workload class
as explained earlier.
2. After meeting functional requirements, run end to end test on larger representative data
while measuring CPU, memory and I/O used by the cluster at an aggregate level.
3. Optimize cluster to remove bottlenecks found in step 2
o CPU bound: add more cores by adding more nodes
o Network bound: use fewer, bigger SSD backed machines to reduce network
size and improve remote read performance
o Disk I/O bound: if jobs are spilling to disk, use VMs with more memory.
Repeat steps 2 and 3 by adding nodes and/or evaluating different VMs until all obvious
bottlenecks have been addressed.
Performing these steps will help you to arrive at a baseline cluster size which can meet SLA on
a subset of data. In theory, Spark jobs, like jobs on other Data Intensive frameworks (Hadoop)
exhibit linear scaling. For example, if it takes 5 nodes to meet SLA on a 100TB dataset, and
the production data is around 1PB, then prod cluster is likely going to be around 50 nodes in
size. You can use this back of the envelope calculation as a first guess to do capacity planning.
However, there are scenarios where Spark jobs don’t scale linearly. In some cases this is due
to large amounts of shuffle adding an exponential synchronization cost (explained next), but
there could be other reasons as well. Hence, to refine the first estimate and arrive at a more
accurate node count we recommend repeating this process 3-4 times on increasingly larger
data set sizes, say 5%, 10%, 15%, 30%, etc. The overall accuracy of the process depends on
how closely the test data matches the live workload both in type and size.
17
Ravi Credits: databricks.com, azure.com Youtube: techlake
18
Ravi Credits: databricks.com, azure.com Youtube: techlake
20. Specify distribution when publishing data to Azure Data Warehouse (ADW)
Use hash distribution for fact tables or large tables, round robin for dimensional tables,
replicated for small dimensional tables. Example:
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://
") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "my_table_in_dw_copy") \
.option("tableOptions", "table_options") \
.save()
19
Ravi Credits: databricks.com, azure.com Youtube: techlake
VMs VM pricing
20
Ravi Credits: databricks.com, azure.com Youtube: techlake
Example 1: If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2
instances, the billing would be the following for All-purpose Compute:
Example 2: If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2
instances, the billing would be the following for Jobs Compute workload:
In addition to VM and DBU charges, there will be additional charges for managed disks,
public IP address, bandwidth, or any other resource such as Azure Storage, Azure Cosmos
DB depending on your application.
Terminating inactive clusters saves costs. ADB automatically terminates clusters based on
a default down time. As different projects have different needs, it’s important to customize
21
Ravi Credits: databricks.com, azure.com Youtube: techlake
the down time to avoid premature or delayed termination. For example: set a longer down
time for development environments, as work is continuous.
ADB offers cluster autoscaling, which is disabled by default. Enable this feature to enhance
job performance. Instead of providing a fixed number of worker nodes during cluster
creation, you should provide a minimum and maximum. ADB then automatically reallocates
the worker nodes based on job characteristics.
22
Ravi Credits: databricks.com, azure.com Youtube: techlake
24. Use Azure Data Factory (ADF) to run ADB notebook jobs
If you run numerous notebooks daily, the ADB job scheduler will not be efficient. The ADB
job scheduler cannot set notebook dependency, so you would have to store all notebooks
23
Ravi Credits: databricks.com, azure.com Youtube: techlake
in one master, which is difficult to debug. Instead, schedule jobs through Azure Data
Factory, which enables you to set dependency and easily debug if anything fails.
24
Ravi Credits: databricks.com, azure.com Youtube: techlake
Processing notebooks in ADB through ADF can overload the cluster, causing notebooks to
fail. If failure occurs, the entire job should not stop. To continue work from the point of
failure, set ADF to retry two to three times with five-minute intervals. As a result, the
processing should continue from the set time, saving you time and effort.
25
Ravi Credits: databricks.com, azure.com Youtube: techlake
Your business’s data has never been more valuable. Additional security is a worthwhile
investment. ADB Premium includes 5-level access control.
26
Ravi Credits: databricks.com, azure.com Youtube: techlake
27
Ravi Credits: databricks.com, azure.com Youtube: techlake
28
Ravi Credits: databricks.com, azure.com Youtube: techlake
29