Databricks Apache Spark Certified Developer Master Cheat Sheet
Databricks Apache Spark Certified Developer Master Cheat Sheet
Databricks Apache Spark Certified Developer Master Cheat Sheet
Index
1. GENERAL IMP LINKS
2. POINTS TO CONSIDER
3. COURSE TOPICS
o a. Spark Concept
o b. WEB UI / Spark UI
o c. RDD + DataFrame + DataSets + SparkSQL
o d. Streaming
o e. SparkMLLib
o f. GraphLib
4. NOTES FROM THE BOOKS / GUIDES.
o 4.1 Learning Spark: Lightning-Fast Big Data
o 4.2 High Performance Spark - Holden Karau and Rachel Warren
o 4.3 Machine Learning with Spark: Nick Pentreath
o 4.4 https://databricks.gitbooks.io/databricks-spark-knowledge-
base/content/
o 4.5 Programming Guides from http://spark.apache.org/docs/latest/
5. SPARKSESSION & PYSPARK.SQL.FUNCTIONS f
databricks - free 6GB cluster with preinstall spark and relavent dependencies
for notebooks
zepl - limited resource spark non distributed notebooks
colab - from google
Kaggle Kernals (Kaggle kernal > Internet On ; ! Pip install pyspark)
pg. 1
SKILLCERTPRO
!tar xf spark-2.3.1-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-hadoop2.7"
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
References:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/
https://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-
spark-applications
https://pages.databricks.com/rs/094-YMS-629/images/7-steps-for-a-developer-to-
learn-apache-spark.pdf
https://docs.databricks.com/spark/latest/gentle-introduction/index.html
http://www.bigdatatrunk.com/developer-certification-for-apache-spark-databricks/
2. POINTS TO CONSIDER
40 questions, 90 minutes
70% programming Scala, Python and Java, 30% are theory.
Orielly learning spark : Chapter’s 3,4 and 6 for 50% ; Chapters 8,9(IMP) and 10
for 30%
Programming Languages (Certifications will be offered in Scala or Python)
Some experience developing Spark apps in production already
Developers must be able to recognize the code that is more parallel, and less
memory constrained. They must know how to apply the best practices to
avoid run time issues and performance bottlenecks.
pg. 2
SKILLCERTPRO
3. COURSE TOPICS
a. Spark Concept
http://spark.apache.org/
https://databricks.gitbooks.io/databricks-spark-reference-
applications/content/index.html
https://thachtranerc.wordpress.com/2017/07/10/databricks-developer-certifcation-
for-apache-spark-finally-i-made-it/
videos :
https://www.youtube.com/watch?v=7ooZ4S7Ay6Y
https://www.youtube.com/watch?v=tFRPeU5HemU
pg. 3
SKILLCERTPRO
pg. 4
SKILLCERTPRO
pg. 5
SKILLCERTPRO
pg. 6
SKILLCERTPRO
o https://spark.apache.org/docs/latest/configuration.html#dynamic-
allocation
o scales the number of executors registered with this application up and
down based on the workload.
o spark.dynamicAllocation.enabled
o spark.speculation
o If set to "true", if one or more tasks are running slowly in a stage, they
will be re-launched.
o spark.locality.wait
o How long to wait to launch a data-local task before giving up and
launching it on a less-local node.
o The same wait will be used to step through multiple locality levels
(process-local, node-local, rack-local and then any).
o It is also possible to customize the waiting time for each level by setting
spark.locality.wait.node, etc.
o You should increase this setting if your tasks are long and see poor
locality, but the default usually works well.
http://spark.apache.org/docs/latest/tuning.html
Data Serialization:
Memory Tuning:
pg. 7
SKILLCERTPRO
o the amount of memory used by your objects (you may want your entire
dataset to fit in memory),
o the cost of accessing those objects
o the overhead of garbage collection (if you have high turnover in terms
of objects).
create an RDD, put it into cache, and look at the “Storage” page in the web UI
With cache(), you use only the default storage level MEMORY_ONLY. With
persist(), you can specify which storage level you want.
o MEMORY_ONLY
o MEMORY_ONLY_SER
o MEMORY_AND_DISK
o MEMORY_AND_DISK_SER
o DISK_ONLY
o avoid the Java features that add overhead, such as pointer-based data
structures and wrapper objects.
o prefer arrays of objects, and primitive types, instead of the standard
Java or Scala collection classes
o Avoid nested structures with a lot of small objects and pointers when
possible.
o Consider using numeric IDs or enumeration objects instead of strings
for keys.
pg. 8
SKILLCERTPRO
o When your objects are still too large to efficiently store despite this
tuning, a much simpler way to reduce memory usage is to store them
in serialized formt
o Downside is performance hit, as it add overhead of deserialization
every time
Level of Parallelism
The simplest fix here is to increase the level of parallelism, so that each task’s
input set is smaller
Data Locality
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-data-locality.html
If data and the code that operates on it are together then computation tends
to be fast
Typically it is faster to ship serialized code from place to place than a chunk of
data because code size is much smaller than data. - Spark builds its
scheduling around this general principle of data locality.
pg. 9
SKILLCERTPRO
Spark prefers to schedule all tasks at the best locality level, but this is not
always possible.
In situations where there is no unprocessed data on any idle executor, Spark
switches to lower locality levels.
There are two options:
- a) wait until a busy CPU frees up to start a task on data on the same server,
or
- b) immediately start a new task in a farther away place that requires moving
data there.
What Spark typically does is wait a bit in the hopes that a busy CPU frees up.
Once that timeout expires, it starts moving the data from far away to the free
CPU.
You should increase these settings if your tasks are long and see poor locality,
but the default usually works well.
The best means of checking whether a task ran locally is to inspect a given
stage in the Spark UI.
In the Stages tab of spark UI Locality Level column displays which locality a
given task ran with.
Locality Level : PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, or ANY
pg. 10
SKILLCERTPRO
Kryo serialization
https://spark.apache.org/docs/latest/tuning.html#data-serialization
http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-
an-application
http://spark.apache.org/docs/latest/security.html
pg. 11
SKILLCERTPRO
http://spark.apache.org/docs/latest/hardware-provisioning.html
a.14 Shuffles
http://hydronitrogen.com/apache-spark-shuffles-explained-in-depth.html
a.15 Partitioning
https://medium.com/parrot-prediction/partitioning-in-apache-spark-
8134ad840b0
https://techmagie.wordpress.com/2015/12/19/understanding-spark-
partitioning/
https://www.talend.com/blog/2018/03/05/intro-apache-spark-partitioning-
need-know/
o Every node in a Spark cluster contains one or more partitions.
o too few (causing less concurrency, data skewing & improper resource
utilization)
o too many (causing task scheduling to take more time than actual
execution time)
o By default, it is set to the total number of cores on all the executor
nodes.
o Partitions in Spark do not span multiple machines.
o Tuples in the same partition are guaranteed to be on the same
machine.
o Spark assigns one task per partition and each worker can process one
task at a time.
b. WEB UI / Spark UI
spark web ui
https://www.cloudera.com/documentation/enterprise/5-9-
x/topics/operation_spark_applications.html
pg. 12
SKILLCERTPRO
i. JOBS tab : The Jobs tab consists of two pages, i.e. All Jobs and Details
for Job pages.
o Stages tab in web UI shows the current state of 'all stages of all jobs' in
a Spark application (i.e. a SparkContext)
o two optional pages for the tasks and statistics for a stage (when a stage
is selected) and pool details (when the application works in FAIR
scheduling mode).
o Summary Metrics :
for Completed Tasks in Stage : The summary metrics table shows
the metrics for the tasks in a given stage that have already
finished with SUCCESS status and metrics available.
The table consists of the following columns: Metric, Min, 25th
percentile, Median, 75th percentile, Max.
iv. ENVIRONMENT tab: Shows various details like total tasks, Input,
Shuffle read & write, etc
pg. 13
SKILLCERTPRO
vi. SQL tab: SQL tab in web UI shows SQLMetrics per physical operator in
a structured query physical plan.
Types of RDD
type based on how RDDs made
HadoopRDD, FilterRDD, MapRDD, ShuffleRDD, S3RDD , etc
pg. 14
SKILLCERTPRO
d. Streaming
https://spark.apache.org/docs/latest/streaming-programming-guide.html
e. SparkMLLib
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-mllib/spark-
mllib.html
f. GraphLib
https://spark.apache.org/docs/latest/graphx-programming-guide.html
pg. 15
SKILLCERTPRO
RDDs containing key-value pairs. These RDDs are called Pair RDDs.
Transformations one pair rdd : reduceByKey / foldByKey, combineByKey,
countByValue, groupByKey, mapValues, flatMapValues, keys, values, sortByKey
pg. 16
SKILLCERTPRO
Running on a Cluster
pg. 17
SKILLCERTPRO
The driver runs in its own Java process and each executor is a Java process.
A driver and its executors are together termed a Spark application.
A Spark application is launched on a set of machines using an external service
called a cluster manager.
Driver program main duties :
o a. compiling user program into task
o b. scheduling task on executor
Executor
o a. running the tasks
o b. in-memory storage for rdd
Sparks Dirver & Executor VS YARNs Master & Worker
o For instance Apache YARN runs a master daemon (called the Resource
Manager) and several worker daemons called (Node Managers).
o Spark will run both drivers and executors on YARN worker nodes.
spark2-submit options types :
o The first is the location of the cluster manager along with an amount of
resources you’d like to request for your job (as shown above).
o The second is information about the runtime dependencies of your
application, such as libraries or files you want to be present on all
worker machines.
pg. 18
SKILLCERTPRO
Then, using this series of steps called the execution plan, the scheduler
computes the missing partitions for each stage until it computes the whole
RDD.
partitions()
iterator(p, parentIters)
dependencies()
partitioner()
preferredLocations(p)
static allocation
dynamic allocation
jobs
o highest element of Spark’s execution hierarchy.
o Each Spark job corresponds to one action
stages
o As mentioned above, a job is defined by calling an action.
o The action may include several transformations, which breakdown of
jobs into stages.
pg. 19
SKILLCERTPRO
Spark SQL’s column operators are defined on the column class, so a filter containing
the expression 0 >= df.col("friends") will not compile since Scala will use the >=
defined on 0. Instead you would write df.col("friend") <= 0 or convert 0 to a column
literal with lit
Transformations : types
o filters
o sql standard functions
o 'when' - for if then else
o Specialized DataFrame Transformations for Missing & Noisy Data
o Beyond Row-by-Row Transformations
o Aggregates and groupBy - agg API
o windowing
o sorting - orderBy
o Multi DataFrame Transformations
Tungsten
pg. 20
SKILLCERTPRO
Query Optimizer
In order to join data, Spark needs the data that is to be joined to live on the
same partition.
The default implementation of join in Spark is a shuffled hash join.
Shuffel could be avoided if
o
a. Both RDDs have a known partitioner.
o
pg. 21
SKILLCERTPRO
.saveAsTable("tble1") : For file-based data source, e.g. text, parquet, json, etc.
you can specify a custom table path via the path option. When the table is
dropped, the custom table path will not be removed and the table data is still
there.
sdf.write.parquet("DIR_LOCATION")
sdf.write.save(FILE_LOCATION.parquet)
df.write
.partitionBy("favorite_color")
.bucketBy(42, "name")
pg. 22
SKILLCERTPRO
.saveAsTable("people_partitioned_bucketed")
Schema Merging
Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution.
Users can start with a simple schema, and gradually add more columns to the
schema as needed.
In this way, users may end up with multiple Parquet files with different but
mutually compatible schemas.
The Parquet data source is now able to automatically detect this case and
merge schemas of all these files.
spark.read.option("mergeSchema", "true").parquet("FOLDER_LOCATION")
Parquet Files
HIVE vs Parquet
Pandas in spark
pg. 23
SKILLCERTPRO
x = pd.Series([1,2,3,4])
multi = pandas_udf(multi_fun,returnType=LongType())
sdf= spark.createDataFrame(pd.DataFrame(x, columns=["x"]))
sdf.select(multi(col("x"),col("x"))).show()
Grouped map Pandas UDFs are used with groupBy().apply() which implements
the “split-apply-combine” pattern.
Split-apply-combine consists of three steps:
o Split the data into groups by using DataFrame.groupBy.
o Apply a function on each group. The input data contains all the rows
and columns for each group.
o Combine the results into a new DataFrame.
sdf_grp = spark.createDataFrame([(1,10),(2,10),(3,30)],("id","v"))
sdf_grp.groupBy("id").apply(fun_1).show()
NaN
pg. 24
SKILLCERTPRO
http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
lit()
Creates a Column of literal value
monotonically_increasing_id()
A column that generates monotonically increasing 64-bit integers. monotonically
increasing and unique, but not consecutive.
df.withColumn("new_id", f.monotonically_increasing_id())
SparkSession.table()
Returns the specified table as a DataFrame.
expr()
Parses the expression string into the column that it represents
JOIN
http://www.learnbymarketing.com/1100/pyspark-joins-by-example/
https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html
https://spark.apache.org/docs/2.3.0/api/python/_modules/pyspark/sql/dataframe.ht
ml#DataFrame.join
df_res = df_one.join(df_two,df_one.col1 == df_two.col1,"left")
df_res = df_one.join(other=df_two,on=["col1"],how="left")
df_res = df_one.alias("a").join(df_two.alias("b"),col("a.col1") ==
col("b.col1"),"left")
pg. 25
SKILLCERTPRO
distinct()
https://stackoverflow.com/questions/30959955/how-does-distinct-function-work-in-
spark
dataFrame.checkpoint
https://dzone.com/articles/what-are-spark-checkpoints-on-dataframes
https://stackoverflow.com/questions/35127720/what-is-the-difference-between-
spark-checkpoint-and-persist-to-a-disk
df.checkpoint
pyspark.sql.window
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-
sql.html
window_condn = window \
.partitionby(df.col_date) \
.rangeBetween(min_val, max_val) \
pg. 26
SKILLCERTPRO
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
\
.OrderBy(df.col_date)
df_new = df.withColumn('col1',f.sum('col2')).over(window_condn)
df.pivote
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-
spark.html
f.explode(col)
Returns a new row for each element in the given array or map.
new_df = df.select(f.explode(df.col1))
df.groupBy(*cols)
Groups the DataFrame using the specified columns, so we can run aggregation on
them. See GroupedData for all the available aggregate functions.
df.groupBy(df.col1).count().collect()
df.agg(*expr)
Aggregate on the entire DataFrame without groups (shorthand for df.groupBy.agg()).
df.groupBy(df.col1).agg(a.max(df.col1))
col1 collect_set(col2)
1 [2,3]
5 [7,9]
pyspark.sql.types.ArrayType
pg. 27
SKILLCERTPRO
df.filter()
df_new = df.filter(col("col1").isNotNull & col("col1") > 100)
f.first(col)
Aggregate function: returns the first value in a group.
df.distinct()
df.select("abc").distinct()
sdf.drop("new_col)
eqNullSafe
sdf_1.join(sdf_2,sdf_1.col_1.eqNullSafe(sdf_2.col_1))
pg. 28
SKILLCERTPRO
Disclaimer: All data and information provided on this site is for informational
purposes only. This site makes no representations as to accuracy, completeness,
correctness, suitability, or validity of any information on this site & will not be
liable for any errors, omissions, or delays in this information or any losses,
injuries, or damages arising from its display or use. All information is provided on
an as-is basis.
pg. 29