Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Azure Databricks Best Practices 1664384402

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Azure Databricks Best Practice Guide

Azure Databricks (ADB) has the power to process terabytes of data, while simultaneously running
heavy data science workloads. Over time, as data input and workloads increase, job performance
decreases. As an ADB developer, optimizing your platform enables you to work faster and save
hours of effort for you and your team. Below are the 18 best practices you need to optimize your
ADB environment.

1. Use Cache table/dataframe for re-usable tables or confirmed dimentions.

cache() is an Apache Spark transformation that can be used on a DataFrame, Dataset, or


RDD when you want to perform more than one action. cache() caches the specified
DataFrame, Dataset, or RDD in the memory of your cluster’s workers. Since cache() is a
transformation, the caching operation takes place only when a Spark action (for example,
count(), show(), take(), or write()) is also used on the same DataFrame, Dataset, or RDD in
a single action.

You should call count() or write() immediately after calling cache() so that the entire
DataFrame is processed and cached in memory. If you only cache part of the DataFrame,
the entire DataFrame may be recomputed when a subsequent action is performed on the
DataFrame.

1
Ravi Credits: databricks.com, azure.com Youtube: techlake

2. Create partitions on every table and for fact tables use partition column on key join
column like country_code, city, market_code

Delta tables in ADB support partitioning, which enhances performance. You can partition by
a column if you expect data in that partition to be at least 1 GB. If column cardinality is high,
do not use that column for partitioning. For example, if you partition by user ID and there
are 1M distinct user IDs, partitioning would increase table load time. Syntax example:

CREATE TABLE events (


DATE DATE
,eventId STRING
,eventType STRING
,data STRING
) USING delta PARTITIONED BY (DATE)

2
Ravi Credits: databricks.com, azure.com Youtube: techlake

3. Land data in Blob Store/ADLS partitioned into separate directory

Avoid high list cost on large directories like Hierarichal folder structure

3
Ravi Credits: databricks.com, azure.com Youtube: techlake

4. Use Delta Lake performance features like OPTIMIZE with ZORDER

Z-Ordering (multi-dimensional clustering)

Z-Ordering is a technique to colocate related information in the same set of files. This co-
locality is automatically used by Delta Lake on Databricks data-skipping algorithms to
dramatically reduce the amount of data that needs to be read. To Z-Order data, you specify
the columns to order on in the ZORDER BY clause:

5. Enable Auto Optimize option for all staging tables.

Enable Auto Optimize

You must explicitly enable Optimized Writes and Auto Compaction using one of the
following methods:

 New table: Set the table


properties delta.autoOptimize.optimizeWrite = true and delta.autoOptimize.autoCompact = t
rue in the CREATE TABLE command.

4
Ravi Credits: databricks.com, azure.com Youtube: techlake
 Auto Optimize consists of two complementary features: Optimized Writes and Auto
Compaction.

 CREATE TABLE student (id INT, name STRING, age INT) TBLPROPERTIES
(delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true)

NOTE:

 Databricks does not support Z-Ordering with Auto Compaction as Z-Ordering is


significantly more expensive than just compaction.
 Auto Compaction generates smaller files (128 MB) than OPTIMIZE (1 GB).
 Auto Compaction greedily chooses a limited set of partitions that would best leverage
compaction. The number of partitions selected will vary depending on the size of cluster
it is launched on. If your cluster has more CPUs, more partitions can be optimized.
 To control the output file size, set the Spark
configuration spark.databricks.delta.autoCompact.maxFileSize. The default value
is 134217728, which sets the size to 128 MB. Specifying the value 104857600 sets the
file size to 100MB.
 spark.sql("set spark.databricks.delta.autoCompact.enabled = true")

5
Ravi Credits: databricks.com, azure.com Youtube: techlake

6. Decide partition size (block size default is 128MB). Based on that it will create no of
files at table.

7. Use hints for improving query performance like BROADCAST.

Join hints

Join hints allow you to suggest the join strategy that Databricks Runtime should use. When
different join strategy hints are specified on both sides of a join, Databricks Runtime
prioritizes hints in the following
order: BROADCAST over MERGE over SHUFFLE_HASH over SHUFFLE_REPLICATE_N
L. When both sides are specified with the BROADCAST hint or the SHUFFLE_HASH hint,
Databricks Runtime picks the build side based on the join type and the sizes of the
relations. Since a given strategy may not support all join types, Databricks Runtime is not
guaranteed to use the join strategy suggested by the hint.

Join hint types

 BROADCAST
6
Ravi Credits: databricks.com, azure.com Youtube: techlake
Use broadcast join. The join side with the hint is broadcast regardless
of autoBroadcastJoinThreshold. If both sides of the join have the broadcast hints, the
one with the smaller size (based on stats) is broadcast. The aliases
for BROADCAST are BROADCASTJOIN and MAPJOIN.

 MERGE

Use shuffle sort merge join. The aliases


for MERGE are SHUFFLE_MERGE and MERGEJOIN.

 SHUFFLE_HASH

Use shuffle hash join. If both sides have the shuffle hash hints, Databricks Runtime
chooses the smaller side (based on stats) as the build side.

7
Ravi Credits: databricks.com, azure.com Youtube: techlake
 SHUFFLE_REPLICATE_NL

Use shuffle-and-replicate nested loop join.

8. Use Repartition hints for balancing partitions.

Partitioning hint types

 COALESCE

Reduce the number of partitions to the specified number of partitions. It takes a partition
number as a parameter.

8
Ravi Credits: databricks.com, azure.com Youtube: techlake
 REPARTITION

Repartition to the specified number of partitions using the specified partitioning


expressions. It takes a partition number, column names, or both as parameters.

 REPARTITION_BY_RANGE

Repartition to the specified number of partitions using the specified partitioning


expressions. It takes column names and an optional partition number as parameters.

 REBALANCE

The REBALANCE hint can be used to rebalance the query result output partitions, so
that every partition is of a reasonable size (not too small and not too big). It can take
column names as parameters, and try its best to partition the query result by these
columns. This is a best-effort: if there are skews, Spark will split the skewed partitions,
to make these partitions not too big. This hint is useful when you need to write the result
of this query to a table, to avoid too small/big files. This hint is ignored if AQE is not
enabled.

9. Delete temporary tables after notebook execution

9
Ravi Credits: databricks.com, azure.com Youtube: techlake
Delete temporary tables that were created as intermediate tables during notebook
execution. Deleting tables saves storage, especially if the notebook is scheduled daily.

10. Use dbutils.fs.rm() to permanently delete temporary table metadata

ADB clusters store table metadata, even if you use drop statements to delete. Before
creating temporary tables, use dbutils.fs.rm() to permanently delete metadata. If you don’t
use this statement, an error message will appear stating that the table already exists. To
avoid this error in daily refreshes, you must use dbutils.fs.rm().

11. Use Lower() or Upper() when comparing strings or common filter conditions to avoid
losing data

ADB can't compare strings with different casing. To avoid losing data, use case conversion
statements Lower() or Upper(). Example:

SELECT 'MAQSoftware' = 'maqsoftware' AS WithOutLowerOrUpper


,LOWER('MAQSoftware') = 'maqsoftware' AS WithLower
,UPPER('MAQSoftware') = 'MAQSOFTWARE' AS WithUpper

12. Use custom functions to simplify complex calculations

If your calculation requires multiple steps, you can save time and by creating a one-step
custom function. ADB offers a variety of built in SQL functions, however to create custom
functions, known as user-defined functions (UDF), use Scala. Once you have a custom
function, you can call it every time you need to perform that specific calculation.

13. Use Delta tables for DML commands

In ADB, Hive tables do not support UPDATE and MERGE statements or NOT NULL and
CHECK constraints. Delta tables do support these commands, however running large
amounts of data on Delta tables decreases query performance. So not to decrease
performance, store table versions.
10
Ravi Credits: databricks.com, azure.com Youtube: techlake

11
Ravi Credits: databricks.com, azure.com Youtube: techlake

14. Use views when creating intermediate tables

If you need to create intermediate tables, use views to minimize storage usage and save
costs. Views are session-oriented and will automatically remove tables from storage after
query execution. For optimal query performance, do not use joins or subqueries in views.

15. Enable adaptive query execution (AQE)

AQE improves large query performance. By default, AQE is disabled in ADB. To enable it,
use: set spark.sql.adaptive.enabled = true;

Enabling AQE

AQE can be enabled by setting SQL config spark.sql.adaptive.enabled to true (default false
in Spark 3.0), and applies if the query meets the following criteria:

It is not a streaming query

12
Ravi Credits: databricks.com, azure.com Youtube: techlake
It contains at least one exchange (usually when there’s a join, aggregate or window
operator) or one subquery

1. Optimizing Shuffles
2. Choosing Join Strategies
3. Handling Skew Joins
4. Understand AQE Query Plans
5. The AdaptiveSparkPlan Node
6. The CustomShuffleReader Node
7. Detecting Join Strategy Change

16. Use key vault credentials when creating mount points

When creating mount points to Azure Data Lake Storage (ADLS), use a key vault client ID
and client secret to enhance security.

13
Ravi Credits: databricks.com, azure.com Youtube: techlake

17. Query directly on parquet files from ADLS

If you need to use the data from parquet files, do not extract into ADB in intermediate table
format. Instead, directly query on the parquet file to save time and storage. Example:
SELECT ColumnName FROM parquet.`Location of the file`

14
Ravi Credits: databricks.com, azure.com Youtube: techlake

18. Choosing cluster mode for individual jobs execution and common jobs execution.

For individual job execution use standard mode cluster.

For group of jobs and multiple jobs with dependency tables in parallel or sequential load
choose High Concurrency Mode.

1. Deploy a shared cluster instead of letting each user create their own cluster.
2. Create the shared cluster in High Concurrency mode instead of Standard mode.
3. Configure security on the shared High Concurrency cluster, using one of the following
options:
o Turn on AAD Credential Passthrough if you’re using ADLS
o Turn on Table Access Control for all other stores

15
Ravi Credits: databricks.com, azure.com Youtube: techlake

1. Minimizing Cost: By forcing users to share an autoscaling cluster you have configured
with maximum node count, rather than say, asking them to create a new one for their use
each time they log in, you can control the total cost easily. The max cost of shared cluster
can be calculated by assuming it is running X hours at maximum size with the particular
VMs. It is difficult to achieve this if each user is given free reign over creating clusters of
arbitrary size and VMs.

2. Optimizing for Latency: Only High Concurrency clusters have features which allow
queries from different users share cluster resources in a fair, secure manner. HC clusters
come with Query Watchdog, a process which keeps disruptive queries in check by
automatically pre-empting rogue queries, limiting the maximum size of output rows
returned, etc.

3. Security: Table Access control feature is only available in High Concurrency mode and
needs to be turned on so that users can limit access to their database objects (tables,
views, functions, etc.) created on the shared cluster. In case of ADLS, we recommend
restricting access using the AAD Credential Passthrough feature instead of Table Access
Controls.

16
Ravi Credits: databricks.com, azure.com Youtube: techlake

19. Arrive at Correct Cluster Size by Iterative Performance Testing

It is impossible to predict the correct cluster size without developing the application because
Spark and Azure Databricks use numerous techniques to improve cluster utilization. The broad
approach you should follow for sizing is:

1. Develop on a medium sized cluster of 2-8 nodes, with VMs matched to workload class
as explained earlier.
2. After meeting functional requirements, run end to end test on larger representative data
while measuring CPU, memory and I/O used by the cluster at an aggregate level.
3. Optimize cluster to remove bottlenecks found in step 2
o CPU bound: add more cores by adding more nodes
o Network bound: use fewer, bigger SSD backed machines to reduce network
size and improve remote read performance
o Disk I/O bound: if jobs are spilling to disk, use VMs with more memory.

Repeat steps 2 and 3 by adding nodes and/or evaluating different VMs until all obvious
bottlenecks have been addressed.

Performing these steps will help you to arrive at a baseline cluster size which can meet SLA on
a subset of data. In theory, Spark jobs, like jobs on other Data Intensive frameworks (Hadoop)
exhibit linear scaling. For example, if it takes 5 nodes to meet SLA on a 100TB dataset, and
the production data is around 1PB, then prod cluster is likely going to be around 50 nodes in
size. You can use this back of the envelope calculation as a first guess to do capacity planning.
However, there are scenarios where Spark jobs don’t scale linearly. In some cases this is due
to large amounts of shuffle adding an exponential synchronization cost (explained next), but
there could be other reasons as well. Hence, to refine the first estimate and arrive at a more
accurate node count we recommend repeating this process 3-4 times on increasingly larger
data set sizes, say 5%, 10%, 15%, 30%, etc. The overall accuracy of the process depends on
how closely the test data matches the live workload both in type and size.

17
Ravi Credits: databricks.com, azure.com Youtube: techlake

18
Ravi Credits: databricks.com, azure.com Youtube: techlake

 Fewer big instances > more small instances


o Reduce network shuffle; Databricks has 1 executor / machine
o Applies to batch ETL mainly (for streaming, one could start with smaller instances
depending on complexity of transformation)
o Not set in stone, and reverse would make sense in many cases - so sizing exercise
matters
 Size based on the number of tasks initially, tweak later
o Run the job with a small cluster to get idea of # of tasks (use 2-3x tasks per core for base
sizing)
 Choose based on workload (Probably start with F-series or DSv2):
o ETL with full file scans and no data reuse - F / DSv2
o ML workload with data caching - DSv2 / F
o Data Analysis - L
o Streaming - F

20. Specify distribution when publishing data to Azure Data Warehouse (ADW)

Use hash distribution for fact tables or large tables, round robin for dimensional tables,
replicated for small dimensional tables. Example:
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://
") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "my_table_in_dw_copy") \
.option("tableOptions", "table_options") \
.save()
19
Ravi Credits: databricks.com, azure.com Youtube: techlake

21. Understand Databricks Pricing on individual like Compute,storage,VM and


bandwidth.

Service or Resource Pricing

DBUs DBU pricing

VMs VM pricing

Public IP Addresses Public IP Addresses pricing

Blob Storage Blob Storage pricing

Managed Disk Managed Disk pricing

Bandwidth Bandwidth pricing

20
Ravi Credits: databricks.com, azure.com Youtube: techlake

Example 1: If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2
instances, the billing would be the following for All-purpose Compute:

 VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598


 DBU cost for All-purpose Compute workload for 10 DS13v2 instances —100 hours x
10 instances x 2 DBU per node x $0.55/DBU = $1,100
 The total cost would therefore be $598 (VM Cost) + $1,100 (DBU Cost) = $1,698.

Example 2: If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2
instances, the billing would be the following for Jobs Compute workload:

 VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598


 DBU cost for Jobs Compute workload for 10 DS13v2 instances —100 hours x 10
instances x 2 DBU per node x $0.30/DBU = $600
 The total cost would therefore be $598 (VM Cost) + $600 (DBU Cost) = $1,198.

In addition to VM and DBU charges, there will be additional charges for managed disks,
public IP address, bandwidth, or any other resource such as Azure Storage, Azure Cosmos
DB depending on your application.

22. Customize cluster termination time

Terminating inactive clusters saves costs. ADB automatically terminates clusters based on
a default down time. As different projects have different needs, it’s important to customize

21
Ravi Credits: databricks.com, azure.com Youtube: techlake
the down time to avoid premature or delayed termination. For example: set a longer down
time for development environments, as work is continuous.

23. Enable cluster autoscaling

ADB offers cluster autoscaling, which is disabled by default. Enable this feature to enhance
job performance. Instead of providing a fixed number of worker nodes during cluster
creation, you should provide a minimum and maximum. ADB then automatically reallocates
the worker nodes based on job characteristics.

22
Ravi Credits: databricks.com, azure.com Youtube: techlake

24. Use Azure Data Factory (ADF) to run ADB notebook jobs

If you run numerous notebooks daily, the ADB job scheduler will not be efficient. The ADB
job scheduler cannot set notebook dependency, so you would have to store all notebooks

23
Ravi Credits: databricks.com, azure.com Youtube: techlake
in one master, which is difficult to debug. Instead, schedule jobs through Azure Data
Factory, which enables you to set dependency and easily debug if anything fails.

25. Use the retry feature in ADF when scheduling jobs

24
Ravi Credits: databricks.com, azure.com Youtube: techlake
Processing notebooks in ADB through ADF can overload the cluster, causing notebooks to
fail. If failure occurs, the entire job should not stop. To continue work from the point of
failure, set ADF to retry two to three times with five-minute intervals. As a result, the
processing should continue from the set time, saving you time and effort.

26. Consider upgrading to ADB Premium

25
Ravi Credits: databricks.com, azure.com Youtube: techlake
Your business’s data has never been more valuable. Additional security is a worthwhile
investment. ADB Premium includes 5-level access control.

26
Ravi Credits: databricks.com, azure.com Youtube: techlake

27
Ravi Credits: databricks.com, azure.com Youtube: techlake

28
Ravi Credits: databricks.com, azure.com Youtube: techlake

29

You might also like