5 6323551620588110404
5 6323551620588110404
5 6323551620588110404
Matthew Powers
This book is for sale at http://leanpub.com/beautiful-spark
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Typical painful workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Productionalizing advanced analytics models is hard . . . . . . . . . . . . . . . . . . . . . . . 2
Why Scala? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Who should read this book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Is this book for data engineers or data scientists? . . . . . . . . . . . . . . . . . . . . . . . . . 3
Beautiful Spark philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
DataFrames vs. RDDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Spark streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
The “coalesce test” for evaluating learning resources . . . . . . . . . . . . . . . . . . . . . . . 4
Will we cover the entire Spark SQL API? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
How this book is organized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Spark programming levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Note about Spark versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Databricks Community . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Creating a notebook and cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Running some code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Introduction to DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Creating DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Adding columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Filtering rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
More on schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Creating DataFrames with createDataFrame() . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
CONTENTS
Column Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
A simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Instantiating Column objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
gt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
substr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
+ operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
lit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
isNull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
isNotNull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
when / otherwise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
singleSpace() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
removeAllWhitespace() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Why Scala?
Spark offers Scala, Python, Java, and R APIs. This book covers only the Scala API.
The best practices for each language are quite different. Entire chapters of this book are irrelevant
for SparkR and PySpark users.
The best Spark API for an organization depends on the team’s makeup - a group with lots of Python
experience should probalby use the PySpark API.
Email me if you’d like a book on writing beautiful PySpark or SparkR code and I’ll take it into
consideration.
Scala is great for Spark for a variety of reasons:
You shouldn’t use RDD API unless you have a specific optimization that requires you to operate at
a lower level (or if you’re forced to work with Spark 1). Most users will never need to use the RDD
API.
It’s best to master the DataFrame API before thinking about RDDs.
Spark streaming
Lots of analyses can be performed in batch mode, so streaming isn’t relevant for all Spark users.
While Spark streaming is important for users that need to perform analyses in real time, it’s
important to learn the material in this book before diving into the streaming API. Streaming is
complex. Testing streaming applications is hard. You’ll struggle with streaming if you don’t have a
solid understanding of the basics.
Accordingly, this book does not cover streaming.
Machine learning
Advanced Analytics with Spark¹ is a great book on building Spark machine learning models with
the Scala API.
You should read this book first and then read Advanced Analytics with Spark if you’re interested in
building machine learning models with Spark.
• Spark fundamentals
• Building libraries and applications
• Practical job tuning
1 bash ~/Documents/spark/spark-2.3.0-bin-hadoop2.7/bin/spark-shell
1 scala> 2 + 3
2 res0: Int = 5
The “Spark console” is really just a Scala console that preloads all of the Spark libraries.
²https://spark.apache.org/downloads.html
Running Spark Locally 7
1 val df = spark.read.csv("/Users/powers/Documents/tmp/data/silly_file.csv")
1 df.show()
2 // +-------+--------------+
3 // | person| silly_level|
4 // +-------|--------------+
5 // | a| 10 |
6 // | b| 5 |
7 // +-------+--------------+
Console commands
The :quit command stops the console.
The :paste lets the user add multiple lines of code at once. Here’s an example:
1 scala> :paste
2 // Entering paste mode (ctrl-D to finish)
3
4 val y = 5
5 val x = 10
6 x + y
7
8 // Exiting paste mode, now interpreting.
9
10 y: Int = 5
11 x: Int = 10
12 res8: Int = 15
Always use the :paste command when copying examples from this book into your console!
The :help command lists all the available console commands. Here’s a full list of all the console
commands:
Running Spark Locally 8
1 scala> :help
2 All commands can be abbreviated, e.g., :he instead of :help.
3 :edit <id>|<line> edit history
4 :help [command] print this summary or command-specific help
5 :history [num] show the history (optional num is commands to show)
6 :h? <string> search the history
7 :imports [name name ...] show import history, identifying sources of names
8 :implicits [-v] show the implicits in scope
9 :javap <path|class> disassemble a file or class name
10 :line <id>|<line> place line(s) at the end of history
11 :load <path> interpret lines in a file
12 :paste [-raw] [path] enter paste mode or paste a file
13 :power enable power user mode
14 :quit exit the interpreter
15 :replay [options] reset the repl and replay all previous commands
16 :require <path> add a jar to the classpath
17 :reset [options] reset the repl to its initial state, forgetting all session\
18 entries
19 :save <path> save replayable session to a file
20 :sh <command line> run a shell command (result is implicitly => List[String])
21 :settings <options> update compiler options, if possible; see reset
22 :silent disable/enable automatic printing of results
23 :type [-v] <expr> display the type of an expression without evaluating it
24 :kind [-v] <expr> display the kind of expression's type
25 :warnings show the suppressed warnings from the most recent line whic\
26 h had any
This Stackoverflow answer³ contains a good description of the available console commands.
³https://stackoverflow.com/a/32808382/1125159
Databricks Community
Databricks provides a wonderful browser-based interface for running Spark code. You can skip this
chapter if you’re happy running Spark code locally in your console, but I recommend trying out
both workflows (the Spark console and Databricks) and seeing which one you prefer.
Databricks Sign in
Click Shared
Create cluster
Write a name for the cluster and then click the Create Cluster button.
Databricks Community 12
Attach cluster
Now let’s demonstrate that we can access the SparkSession via the spark variable.
Next steps
You’re now able to run Spark code in the browser.
Let’s start writing some real code!
Introduction to DataFrames
Spark DataFrames are similar to tables in relational databases. They store data in columns and rows
and support a variety of operations to manipulate the data.
Here’s an example of a DataFrame that contains information about cities.
This chapter will discuss creating DataFrames, defining schemas, adding columns, and filtering rows.
Creating DataFrames
You can import the spark implicits library and create a DataFrame with the toDF() method.
1 import spark.implicits._
2
3 val df = Seq(
4 ("Boston", "USA", 0.67),
5 ("Dubai", "UAE", 3.1),
6 ("Cordoba", "Argentina", 1.39)
7 ).toDF("city", "country", "population")
Run this code in the Spark console by running the :paste command, pasting the code snippet, and
then pressing ctrl-D.
Run this code in the Databricks browser notebook by pasting the code in a call and clicking run cell.
You can view the contents of a DataFrame with the show() method.
1 df.show()
Introduction to DataFrames 15
1 +-------+---------+----------+
2 | city| country|population|
3 +-------+---------+----------+
4 | Boston| USA| 0.67|
5 | Dubai| UAE| 3.1|
6 |Cordoba|Argentina| 1.39|
7 +-------+---------+----------+
Each DataFrame column has name, dataType and nullable properties. The column can contain null
values if the nullable property is set to true.
The printSchema() method provides an easily readable view of the DataFrame schema.
1 df.printSchema()
1 root
2 |-- city: string (nullable = true)
3 |-- country: string (nullable = true)
4 |-- population: double (nullable = false)
Adding columns
Columns can be added to a DataFrame with the withColumn() method.
Let’s add an is_big_city column to the DataFrame that returns true if the city contains more than
one million people.
1 import org.apache.spark.sql.functions.col
2
3 val df2 = df.withColumn("is_big_city", col("population") > 1)
4 df2.show()
1 +-------+---------+----------+-----------+
2 | city| country|population|is_big_city|
3 +-------+---------+----------+-----------+
4 | Boston| USA| 0.67| false|
5 | Dubai| UAE| 3.1| true|
6 |Cordoba|Argentina| 1.39| true|
7 +-------+---------+----------+-----------+
DataFrames are immutable, so the withColumn() method returns a new DataFrame. withColumn()
does not mutate the original DataFrame. Let’s confirm that df is still the same with df.show().
Introduction to DataFrames 16
1 +-------+---------+----------+
2 | city| country|population|
3 +-------+---------+----------+
4 | Boston| USA| 0.67|
5 | Dubai| UAE| 3.1|
6 |Cordoba|Argentina| 1.39|
7 +-------+---------+----------+
df does not contain the is_big_city column, so we’ve confirmed that withColumn() did not mutate
df.
Filtering rows
The filter() method removes rows from a DataFrame.
1 +-------+---------+----------+
2 | city| country|population|
3 +-------+---------+----------+
4 | Dubai| UAE| 3.1|
5 |Cordoba|Argentina| 1.39|
6 +-------+---------+----------+
It’s a little hard to read code with multiple method calls on the same line, so let’s break this code up
on multiple lines.
1 df
2 .filter(col("population") > 1)
3 .show()
We can also assign the filtered DataFrame to a separate variable rather than chaining method calls.
More on schemas
Once again, the DataFrame schema can be pretty printed to the console with the printSchema()
method. The schema method returns a code representation of the DataFrame schema.
Introduction to DataFrames 17
1 df.schema
1 StructType(
2 StructField(city, StringType, true),
3 StructField(country, StringType, true),
4 StructField(population, DoubleType, false)
5 )
Each column of a Spark DataFrame is modeled as a StructField object with name, columnType, and
nullable properties. The entire DataFrame schema is modeled as a StructType, which is a collection
of StructField objects.
Let’s create a schema for a DataFrame that has first_name and age columns.
1 import org.apache.spark.sql.types._
2
3 StructType(
4 Seq(
5 StructField("first_name", StringType, true),
6 StructField("age", DoubleType, true)
7 )
8 )
Spark’s programming interface makes it easy to define the exact schema you’d like for your
DataFrames.
1 import org.apache.spark.sql.types._
2 import org.apache.spark.sql.Row
3
4 val animalData = Seq(
5 Row(30, "bat"),
6 Row(2, "mouse"),
7 Row(25, "horse")
8 )
9
10 val animalSchema = List(
11 StructField("average_lifespan", IntegerType, true),
12 StructField("animal_type", StringType, true)
13 )
14
15 val animalDF = spark.createDataFrame(
16 spark.sparkContext.parallelize(animalData),
17 StructType(animalSchema)
18 )
19
20 animalDF.show()
1 +----------------+-----------+
2 |average_lifespan|animal_type|
3 +----------------+-----------+
4 | 30| bat|
5 | 2| mouse|
6 | 25| horse|
7 +----------------+-----------+
We can use the animalDF.printSchema() method to confirm that the schema was created as
specified.
1 root
2 |-- average_lifespan: integer (nullable = true)
3 |-- animal_type: string (nullable = true)
Next Steps
DataFrames are the fundamental building blocks of Spark. All machine learning and streaming
analyses are built on top of the DataFrame API.
Now let’s look at how to build functions to manipulate DataFrames.
Working with CSV files
CSV files are great for learning Spark.
When building big data systems, you’ll generally want to use a more sophisticated file format like
Parquet or Avro, but we’ll generally use CSVs in this book cause they’re human readable.
Once you learn how to use CSV files, it’s easy to use other file formats.
Later chapters in the book will cover CSV and other file formats in more detail.
1 cat_name,cat_age
2 fluffy,4
3 spot,3
1 df.show()
2
3 +--------+-------+
4 |cat_name|cat_age|
5 +--------+-------+
6 | fluffy| 4|
7 | spot| 3|
8 +--------+-------+
1 df.printSchema()
2
3 root
4 |-- cat_name: string (nullable = true)
5 |-- cat_age: string (nullable = true)
1 import org.apache.spark.sql.functions.lit
2
3 df
4 .withColumn("speak", lit("meow"))
5 .write
6 .csv("/Users/powers/Documents/cat_output1")
The cat_output folder contains the following files after the data is written:
1 cat_output/
2 _SUCCESS
3 part-00000-db62f6a7-4efe-4396-9fbb-4caa6aced93e-c000.csv
In this small example, Spark wrote only one file. Spark typically writes out many files in parallel.
We’ll revisit writing files in detail after the chapter on memory partitioning.
Once you upload the file, Databricks will show you the file path that can be used to access the data.
Working with CSV files 22
Let’s read this uploaded CSV file into a DataFrame and then display the contents.
Working with CSV files 23
1 sum(10, 5) // returns 15
Let’s write a Spark SQL function that adds two numbers together:
1 import org.apache.spark.sql.Column
2
3 def sumColumns(num1: Column, num2: Column): Column = {
4 num1 + num2
5 }
Let’s create a DataFrame in the Spark shell and run the sumColumns() function.
Just Enough Scala for Spark Programmers 25
1 +--------+-----------+-------+
2 |some_num|another_num|the_sum|
3 +--------+-----------+-------+
4 | 10| 4| 14|
5 | 3| 4| 7|
6 | 8| 4| 12|
7 +--------+-----------+-------+
Spark SQL functions take org.apache.spark.sql.Column arguments whereas vanilla Scala functions
take native Scala data type arguments like Int or String.
Currying functions
Scala allows for functions to take multiple parameter lists, which is formally known as currying. This
section explains how to use currying with vanilla Scala functions and why currying is important for
Spark programmers.
Spark has a Dataset#transform() method that makes it easy to chain DataFrame transformations.
Here’s an example of a DataFrame transformation function:
Just Enough Scala for Spark Programmers 26
1 import org.apache.spark.sql.DataFrame
2
3 def withCat(name: String)(df: DataFrame): DataFrame = {
4 df.withColumn("cat", lit(s"$name meow"))
5 }
DataFrame transformation functions can take an arbitrary number of arguments in the first
parameter list and must take a single DataFrame argument in the second parameter list.
Let’s create a DataFrame in the Spark shell and run the withCat() function.
1 +-----+----------+
2 |thing| cat|
3 +-----+----------+
4 |chair|darla meow|
5 | hair|darla meow|
6 | bear|darla meow|
7 +-----+----------+
Most Spark code can be organized as Spark SQL functions or as custom DataFrame transformations.
object
Spark functions can be stored in objects.
Let’s create a SomethingWeird object that defines a vanilla Scala function, a Spark SQL function, and
a custom DataFrame transformation.
Just Enough Scala for Spark Programmers 27
1 import org.apache.spark.sql.functions._
2 import org.apache.spark.sql.{Column, DataFrame}
3
4 object SomethingWeird {
5
6 // vanilla Scala function
7 def hi(): String = {
8 "welcome to planet earth"
9 }
10
11 // Spark SQL function
12 def trimUpper(col: Column) = {
13 trim(upper(col))
14 }
15
16 // custom DataFrame transformation
17 def withScary()(df: DataFrame): DataFrame = {
18 df.withColumn("scary", lit("boo!"))
19 }
20
21 }
Let’s create a DataFrame in the Spark shell and run the trimUpper() and withScary() functions.
1 +-----+---------------+-----+
2 | word|trim_upper_word|scary|
3 +-----+---------------+-----+
4 | niCE| NICE| boo!|
5 | CaR| CAR| boo!|
6 |BAR | BAR| boo!|
7 +-----+---------------+-----+
trait
Traits can be mixed into objects to add commonly used methods or values. We can define
a SparkSessionWrapper trait that defines a spark variable to give objects easy access to the
SparkSession object.
1 import org.apache.spark.sql.SparkSession
2
3 trait SparkSessionWrapper extends Serializable {
4
5 lazy val spark: SparkSession = {
6 SparkSession.builder().master("local").appName("spark session").getOrCreate()
7 }
8
9 }
package
Packages are used to namespace Scala code. Per the Databricks Scala style guide⁵, packages should
follow Java naming conventions.
For example, the Databricks spark-redshift⁶ project uses the com.databricks.spark.redshift
namespace.
The Spark project used the org.apache.spark namespace. spark-daria⁷ uses the com.github.mrpowers.spark.daria
namespace.
Here an example of code that’s defined in a package in spark-daria:
1 package com.github.mrpowers.spark.daria.sql
2
3 import org.apache.spark.sql.Column
4 import org.apache.spark.sql.functions._
5
6 object functions {
7
8 def singleSpace(col: Column): Column = {
9 trim(regexp_replace(col, " +", " "))
10 }
11
12 }
The package structure should mimic the file structure of the project.
Implicit classes
Implicit classes can be used to extend Spark core classes with additional methods.
Let’s add a lower() method to the Column class that converts all the strings in a column to lower
case.
⁵https://github.com/databricks/scala-style-guide#naming-convention
⁶https://github.com/databricks/spark-redshift
⁷https://github.com/MrPowers/spark-daria
Just Enough Scala for Spark Programmers 30
1 package com.github.mrpowers.spark.daria.sql
2
3 import org.apache.spark.sql.Column
4
5 object FunctionsAsColumnExt {
6
7 implicit class ColumnMethods(col: Column) {
8
9 def lower(): Column = {
10 org.apache.spark.sql.functions.lower(col)
11 }
12
13 }
14
15 }
1 col("some_string").lower()
Implicit classes should be avoided in general. I only monkey patch core classes in the spark-daria⁸
project. Feel free to send pull requests if you have any good ideas for other extensions.
Next steps
There are a couple of other Scala features that are useful when writing Spark code, but this chapter
covers 90%+ of common use cases.
You don’t need to understand functional programming or advanced Scala language features to be a
productive Spark programmer.
In fact, staying away from UDFs and native Scala code is a best practice.
Focus on mastering the native Spark API and you’ll be a productive big data engineer in no time!
⁸https://github.com/MrPowers/spark-daria/
Column Methods
The Spark Column class⁹ defines a variety of column methods for manipulating DataFrames.
This chapter demonstrates how to instantiate Column objects and how to use the most important
Column methods.
A simple example
Let’s create a DataFrame with superheros and their city of origin.
1 val df = Seq(
2 ("thor", "new york"),
3 ("aquaman", "atlantis"),
4 ("wolverine", "new york")
5 ).toDF("superhero", "city")
Let’s use the startsWith() column method to identify all cities that start with the word new:
1 df
2 .withColumn("city_starts_with_new", $"city".startsWith("new"))
3 .show()
1 +---------+--------+--------------------+
2 |superhero| city|city_starts_with_new|
3 +---------+--------+--------------------+
4 | thor|new york| true|
5 | aquaman|atlantis| false|
6 |wolverine|new york| true|
7 +---------+--------+--------------------+
The $"city" part of the code creates a Column object. Let’s look at all the different ways to create
Column objects.
⁹http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
Column Methods 32
1. $"city"
2. df("city")
3. col("city") (must run import org.apache.spark.sql.functions.col first)
Column objects are commonly passed as arguments to SQL functions (e.g. upper($"city")).
We will create column objects in all the examples that follow.
gt
Let’s create a DataFrame with an integer column so we can run some numerical column methods.
1 val df = Seq(
2 (10, "cat"),
3 (4, "dog"),
4 (7, null)
5 ).toDF("num", "word")
Let’s use the gt() method (stands for greater than) to identify all rows with a num greater than five.
1 df
2 .withColumn("num_gt_5", col("num").gt(5))
3 .show()
1 +---+----+--------+
2 |num|word|num_gt_5|
3 +---+----+--------+
4 | 10| cat| true|
5 | 4| dog| false|
6 | 7|null| true|
7 +---+----+--------+
Scala methods can be invoked without dot notation, so this code works as well:
Column Methods 33
1 df
2 .withColumn("num_gt_5", col("num") gt 5)
3 .show()
We can also use the > operator to perform “greater than” comparisons:
1 df
2 .withColumn("num_gt_5", col("num") > 5)
3 .show()
substr
Let’s use the substr() method to create a new column with the first two letters of the word column.
1 df
2 .withColumn("word_first_two", col("word").substr(0, 2))
3 .show()
1 +---+----+--------------+
2 |num|word|word_first_two|
3 +---+----+--------------+
4 | 10| cat| ca|
5 | 4| dog| do|
6 | 7|null| null|
7 +---+----+--------------+
Notice that the substr() method returns null when it’s supplied null as input. All other Column
methods and SQL functions behave similarly (i.e. they return null when the input is null).
Your functions should handle null input gracefully and return null when they’re supplied null as
input.
+ operator
Let’s use the + operator to add five to the num column.
Column Methods 34
1 df
2 .withColumn("num_plus_five", col("num").+(5))
3 .show()
1 +---+----+-------------+
2 |num|word|num_plus_five|
3 +---+----+-------------+
4 | 10| cat| 15|
5 | 4| dog| 9|
6 | 7|null| 12|
7 +---+----+-------------+
We can also skip the dot notation when invoking the function.
1 df
2 .withColumn("num_plus_five", col("num") + 5)
3 .show()
The syntactic sugar makes it harder to see that + is a method defined in the Column class. Take a
look at the docs¹⁰ to convince yourself that the + operator is defined in the Column class!
lit
Let’s use the / method to take two divided by the num column.
1 df
2 .withColumn("two_divided_by_num", lit(2) / col("num"))
3 .show()
1 +---+----+------------------+
2 |num|word|two_divided_by_num|
3 +---+----+------------------+
4 | 10| cat| 0.2|
5 | 4| dog| 0.5|
6 | 7|null|0.2857142857142857|
7 +---+----+------------------+
Notice that the lit() function must be used to convert two into a Column object before the division
can take place.
¹⁰http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
Column Methods 35
1 df
2 .withColumn("two_divided_by_num", 2 / col("num"))
3 .show()
The / method is defined in both the Scala Int and Spark Column classes. We need to convert the
number to a Column object, so the compiler knows to use the / method defined in the Spark Column
class. Upon analyzing the error message, we can see that the compiler is mistakenly trying to use
the / operator defined in the Scala Int class.
isNull
Let’s use the isNull method to identify when the word column is null.
1 df
2 .withColumn("word_is_null", col("word").isNull)
3 .show()
1 +---+----+------------+
2 |num|word|word_is_null|
3 +---+----+------------+
4 | 10| cat| false|
5 | 4| dog| false|
6 | 7|null| true|
7 +---+----+------------+
isNotNull
Let’s use the isNotNull method to filter out all rows with a word of null.
Column Methods 36
1 df
2 .where(col("word").isNotNull)
3 .show()
1 +---+----+
2 |num|word|
3 +---+----+
4 | 10| cat|
5 | 4| dog|
6 +---+----+
when / otherwise
Let’s create a final DataFrame with word1 and word2 columns, so we can play around with the ===,
when(), and otherwise() methods
1 val df = Seq(
2 ("bat", "bat"),
3 ("snake", "rat"),
4 ("cup", "phone"),
5 ("key", null)
6 ).toDF("word1", "word2")
Let’s write a little word comparison algorithm that analyzes the differences between the two words.
1 import org.apache.spark.sql.functions._
2
3 df
4 .withColumn(
5 "word_comparison",
6 when($"word1" === $"word2", "same words")
7 .when(length($"word1") > length($"word2"), "word1 is longer")
8 .otherwise("i am confused")
9 ).show()
Column Methods 37
1 +-----+-----+---------------+
2 |word1|word2|word_comparison|
3 +-----+-----+---------------+
4 | bat| bat| same words|
5 |snake| rat|word1 is longer|
6 | cup|phone| i am confused|
7 | key| null| i am confused|
8 +-----+-----+---------------+
when() and otherwise() are how to write if / else if / else logic in Spark.
Next steps
You will use Colum methods all the time when writing Spark code.
If you don’t have a solid object oriented programming background, it can be hard to iden-
tify which methods are defined in the Column class and which methods are defined in the
org.apache.spark.sql.functions package.
Scala lets you skip dot notation when invoking methods which makes it extra difficult to spot which
methods are Column methods.
In later chapters, we’ll discuss chaining column methods and extending the Column class.
Column methods will be used extensively throughout the rest of this book.
Introduction to Spark SQL functions
This chapter shows you how to use Spark SQL functions and how to build your own SQL functions.
Spark SQL functions are key for almost all analyses.
1 import org.apache.spark.sql.functions._
2
3 val df = Seq(2, 3, 4).toDF("number")
4
5 df
6 .withColumn("number_factorial", factorial(col("number")))
7 .show()
1 +------+----------------+
2 |number|number_factorial|
3 +------+----------------+
4 | 2| 2|
5 | 3| 6|
6 | 4| 24|
7 +------+----------------+
The factorial() function takes a single Column argument. The col() function, also defined in the
org.apache.spark.sql.functions object, returns a Column object based on the column name.
If Spark implicits are imported (i.e. you’ve run import spark.implicits._), then you can also create
a Column object with the $ operator. This code also works.
¹¹http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions\protect\char”0024\relax
Introduction to Spark SQL functions 39
1 import org.apache.spark.sql.functions._
2 import spark.implicits._
3
4 val df = Seq(2, 3, 4).toDF("number")
5
6 df
7 .withColumn("number_factorial", factorial($"number"))
8 .show()
The rest of this chapter focuses on the most important SQL functions that’ll be used in most analyses.
lit() function
The lit() function creates a Column object out of a literal value. Let’s create a DataFrame and use
the lit() function to append a spanish_hi column to the DataFrame.
1 +------+----------+
2 | word|spanish_hi|
3 +------+----------+
4 |sophia| hola|
5 | sol| hola|
6 | perro| hola|
7 +------+----------+
Let’s create a DataFrame of countries and use some when() statements to append a country column.
Introduction to Spark SQL functions 40
1 +------------+-------------+
2 | word| continent|
3 +------------+-------------+
4 | china| asia|
5 | canada|north america|
6 | italy| europe|
7 |tralfamadore| not sure|
8 +------------+-------------+
Spark lets you cut the lit() method calls sometimes and to express code compactly.
1 df
2 .withColumn(
3 "continent",
4 when(col("word") === "china", "asia")
5 .when(col("word") === "canada", "north america")
6 .when(col("word") === "italy", "europe")
7 .otherwise("not sure")
8 )
9 .show()
1 +---+----------+
2 |age|life_stage|
3 +---+----------+
4 | 10| child|
5 | 15| teenager|
6 | 25| adult|
7 +---+----------+
The when method is defined in both the Column class and the functions object. Whenever you see
when() that’s not preceded with a dot, it’s then when from the functions object. .when() comes from
the Column class.
1 import org.apache.spark.sql.Column
2
3 def lifeStage(col: Column): Column = {
4 when(col < 13, "child")
5 .when(col >= 13 && col <= 18, "teenager")
6 .when(col > 18, "adult")
7 }
1 +---+----------+
2 |age|life_stage|
3 +---+----------+
4 | 10| child|
5 | 15| teenager|
6 | 25| adult|
7 +---+----------+
Let’s create another function that trims all whitespace and capitalizes all of the characters in a string.
1 import org.apache.spark.sql.Column
2
3 def trimUpper(col: Column): Column = {
4 trim(upper(col))
5 }
1 val df = Seq(
2 " some stuff",
3 "like CHEESE "
4 ).toDF("weird")
5
6 df
7 .withColumn(
8 "cleaned",
9 trimUpper(col("weird"))
10 )
11 .show()
Introduction to Spark SQL functions 43
1 +----------------+-----------+
2 | weird| cleaned|
3 +----------------+-----------+
4 | some stuff| SOME STUFF|
5 |like CHEESE |LIKE CHEESE|
6 +----------------+-----------+
Custom SQL functions can typically be used instead of UDFs. Avoiding UDFs is a great way to write
better Spark code.
Next steps
Spark SQL functions are preferable to UDFs because they handle the null case gracefully (without
a lot of code) and because they are not a black box¹².
Most Spark analyses can be run by leveraging the standard library and reverting to custom SQL
functions when necessary. Avoid UDFs at all costs!
¹²https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs-blackbox.html
User Defined Functions (UDFs)
Spark let’s you define custom SQL functions called user defined functions (UDFs). UDFs are great
when built-in SQL functions aren’t sufficient, but should be used sparingly because they’re not
performant.
This chapter will demonstrate how to define UDFs and will show how to avoid UDFs, when possible,
by leveraging native Spark functions.
1 sourceDF.select(
2 lowerRemoveAllWhitespaceUDF(col("aaa")).as("clean_aaa")
3 ).show()
4
5 +--------------+
6 | clean_aaa|
7 +--------------+
8 | hithere|
9 |givemepresents|
10 +--------------+
This code will unfortunately error out if the DataFrame column contains a null value.
User Defined Functions (UDFs) 45
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 3.0 failed 1 times,
most recent failure: Lost task 2.0 in stage 3.0 (TID 7, localhost, executor driver): org.apache.spark.SparkException:
Failed to execute user defined function(anonfun$2: (string) ⇒ string)
Caused by: java.lang.NullPointerException
Cause: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$2: (string)
⇒ string)
Cause: java.lang.NullPointerException
Let’s write a lowerRemoveAllWhitespaceUDF function that won’t error out when the DataFrame
contains null values.
1 anotherDF.select(
2 betterLowerRemoveAllWhitespaceUDF(col("cry")).as("clean_cry")
3 ).show()
4
5 +---------+
6 |clean_cry|
7 +---------+
8 | boo|
9 | hoo|
10 | null|
11 +---------+
We can use the explain() method to demonstrate that UDFs are a black box for the Spark engine.
== Physical Plan ==
*Project [UDF(cry#15) AS clean_cry#24]
+- Scan ExistingRDD[cry#15]
Spark doesn’t know how to convert the UDF into native Spark instructions. Let’s use the native
Spark library to refactor this code and help Spark generate a physical plan that can be optimized.
1 anotherDF.select(
2 bestLowerRemoveAllWhitespace()(col("cry")).as("clean_cry")
3 ).show()
4
5 +---------+
6 |clean_cry|
7 +---------+
8 | boo|
9 | hoo|
10 | null|
11 +---------+
Notice that the bestLowerRemoveAllWhitespace elegantly handles the null case and does not require
us to add any special null logic.
1 anotherDF.select(
2 bestLowerRemoveAllWhitespace()(col("cry")).as("clean_cry")
3 ).explain()
== Physical Plan ==
*Project [lower(regexp_replace(cry#29, \s+, )) AS clean_cry#38]
+- Scan ExistingRDD[cry#29]
Spark can view the internals of the bestLowerRemoveAllWhitespace function and optimize the
physical plan accordingly. UDFs are a black box for the Spark engine whereas functions that take a
Column argument and return a Column are not a black box for Spark.
Conclusion
Spark UDFs should be avoided whenever possible. If you need to write a UDF, make sure to handle
the null case as this is a common cause of errors.
Chaining Custom DataFrame
Transformations in Spark
This chapter explains how to write DataFrame transformations and how to chain multiple transfor-
mations with the Dataset#transform method.
We can use the transform method to run the withGreeting() and withFarewell() methods.
1 val df = Seq(
2 "funny",
3 "person"
4 ).toDF("something")
5
6 val weirdDf = df
7 .transform(withGreeting)
8 .transform(withFarewell)
Chaining Custom DataFrame Transformations in Spark 49
1 weirdDf.show()
2
3 +---------+-----------+--------+
4 |something| greeting|farewell|
5 +---------+-----------+--------+
6 | funny|hello world| goodbye|
7 | person|hello world| goodbye|
8 +---------+-----------+--------+
The transform method can easily be chained with built-in Spark DataFrame methods, like select.
1 df
2 .select("something")
3 .transform(withGreeting)
4 .transform(withFarewell)
The transform method helps us write easy-to-follow code by avoiding nested method calls. Without
transform, the above code becomes less readable:
1 withFarewell(withGreeting(df))
2
3 // even worse
4 withFarewell(withGreeting(df)).select("something")
We can use the transform method to run the withGreeting() and withCat() methods.
1 val df = Seq(
2 "funny",
3 "person"
4 ).toDF("something")
5
6 val niceDf = df
7 .transform(withGreeting)
8 .transform(withCat("puffy"))
1 niceDf.show()
2
3 +---------+-----------+----------+
4 |something| greeting| cats|
5 +---------+-----------+----------+
6 | funny|hello world|puffy meow|
7 | person|hello world|puffy meow|
8 +---------+-----------+----------+
Whitespace data munging with Spark
Spark SQL provides a variety of methods to manipulate whitespace in your DataFrame StringType
columns.
The spark-daria¹³ library provides additional methods that are useful for whitespace data munging.
Learning about whitespace data munging is useful, but the more important lesson in this chapter is
learning how to build reusable custom SQL functions.
We’re laying the foundation to teach you how to build reusable code libraries.
¹³https://github.com/MrPowers/spark-daria/
Whitespace data munging with Spark 52
1 actualDF.show()
2
3 +----------+------------+
4 | word|trimmed_word|
5 +----------+------------+
6 |" a "| "a"|
7 | "b "| "b"|
8 | " c"| "c"|
9 | null| null|
10 +----------+------------+
Let’s use the same sourceDF and demonstrate how the ltrim() method removes the leading
whitespace.
1 actualDF.show()
2
3 +----------+-------------+
4 | word|ltrimmed_word|
5 +----------+-------------+
6 |" a "| "a "|
7 | "b "| "b "|
8 | " c"| "c"|
9 | null| null|
10 +----------+-------------+
The rtrim() method removes all trailing whitespace from a string - you can easily figure that one
out by yourself �
Whitespace data munging with Spark 53
singleSpace()
The spark-daria project defines a singleSpace() method that removes all leading and trailing
whitespace and replaces all inner whitespace with a single space.
Here’s how the singleSpace() function is defined in the spark-daria source code.
1 import org.apache.spark.sql.Column
2
3 def singleSpace(col: Column): Column = {
4 trim(regexp_replace(col, " +", " "))
5 }
1 actualDF.show()
2
3 +-------------------+---------------+
4 | words| single_spaced|
5 +-------------------+---------------+
6 |"i like cheese"|"i like cheese"|
7 |" the dog runs "| "the dog runs"|
8 | null| null|
9 +-------------------+---------------+
removeAllWhitespace()
spark-daria defines a removeAllWhitespace() method that removes all whitespace from a string as
shown in the following example.
1 actualDF.show()
2
3 +-------------------+-------------+
4 | words|no_whitespace|
5 +-------------------+-------------+
6 |"i like cheese"|"ilikecheese"|
7 |" the dog runs "| "thedogruns"|
8 | null| null|
9 +-------------------+-------------+
Notice how the removeAllWhitespace function takes a Column argument and returns a Column.
Custom SQL functions typically use this method signature.
Conclusion
Spark SQL offers a bunch of great functions for whitespace data munging.
Whitespace data munging with Spark 55
spark-daria adds some additional custom SQL functions for more advanced whitespace data
munging.
Study the method signatures of the spark-daria functions. You’ll want to make generic cleaning
functions like these for your messy data too!
Defining DataFrame Schemas with
StructField and StructType
Spark DataFrames schemas are defined as a collection of typed columns. The entire schema is stored
as a StructType and individual columns are stored as StructFields.
This chapter explains how to create and modify Spark schemas via the StructType and StructField
classes.
We’ll show how to work with IntegerType, StringType, and LongType columns.
Complex column types like ArrayType, MapType and StructType will be covered in later chapters.
Mastering Spark schemas is necessary for debugging code and writing tests.
1 import org.apache.spark.sql.types._
2
3 val data = Seq(
4 Row(8, "bat"),
5 Row(64, "mouse"),
6 Row(-27, "horse")
7 )
8
9 val schema = StructType(
10 List(
11 StructField("number", IntegerType, true),
12 StructField("word", StringType, true)
13 )
14 )
15
16 val df = spark.createDataFrame(
17 spark.sparkContext.parallelize(data),
18 schema
19 )
Defining DataFrame Schemas with StructField and StructType 57
1 df.show()
2
3 +------+-----+
4 |number| word|
5 +------+-----+
6 | 8| bat|
7 | 64|mouse|
8 | -27|horse|
9 +------+-----+
1 df.schema
2
3 StructType(
4 StructField(number,IntegerType,true),
5 StructField(word,StringType,true)
6 )
StructField
StructField objects are created with the name, dataType, and nullable properties. Here’s an
example:
The StructField above sets the name field to "word", the dataType field to StringType, and the
nullable field to true.
StringType means that the column can only take string values like "hello" - it cannot take other
values like 34 or false.
When the nullable field is set to true, the column can accept null values.
Defining DataFrame Schemas with StructField and StructType 58
The :: operator makes it easy to construct lists in Scala. We can also use :: to make a list of numbers.
1 5 :: 4 :: Nil
Notice that the last element always has to be Nil or the code will error out.
add() is an overloaded method and there are several different ways to invoke it - this will work too:
Check the StructType documentation¹⁵ for all the different ways add() can be used.
Common errors
The data only contains two columns, but the schema contains three StructField columns.
Type mismatch
The following code incorrectly characterizes a string column as an integer column and will error out
with this message: java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException:
java.lang.String is not a valid external type for schema of int.
16 schema
17 )
18
19 df.show()
The first column of data (8, 64, and -27) can be characterized as IntegerType data.
The second column of data ("bat", "mouse", and "horse") cannot be characterized as an IntegerType
column - this could would work if this column was recharacterized as StringType.
LongType
Long values are suitable for bigger integers. You can create a long value in Scala by appending L to
an integer - e.g. 4L or -60L.
Let’s create a DataFrame with a LongType column.
1 df.show()
2
3 +--------+-----+
4 |long_num| word|
5 +--------+-----+
6 | 5| bat|
7 | -10|mouse|
8 | 4|horse|
9 +--------+-----+
You’ll get the following error message if you try to add integers to a LongType column: java.lang.RuntimeException:
Error while encoding: java.lang.RuntimeException: java.lang.Integer is not a valid
external type for schema of bigint
Next steps
You’ll be defining a lot of schemas in your test suites so make sure to master all the concepts covered
in this chapter.
Different approaches to manually
create Spark DataFrames
This chapter shows how to manually create DataFrames with the Spark and spark-daria helper
methods.
We’ll demonstrate why the createDF() method defined in spark-daria is better than the toDF() and
createDataFrame() methods from the Spark source code.
toDF
Up until now, we’ve been using toDF to create DataFrames.
toDF() provides a concise syntax for creating DataFrames and can be accessed after importing Spark
implicits.
1 import spark.implicits._
2
3 # The toDF() method can be called on a sequence object to create a DataFrame.
4 val someDF = Seq(
5 (8, "bat"),
6 (64, "mouse"),
7 (-27, "horse")
8 ).toDF("number", "word")
createDataFrame
The createDataFrame() method addresses the limitations of the toDF() method and allows for full
schema customization and good Scala coding practices.
Here is how to create someDF with createDataFrame().
createDataFrame() provides the functionality we need, but the syntax is verbose. Our test files will
become cluttered and difficult to read if createDataFrame() is used frequently.
createDF
createDF() is defined in spark-daria and allows for the following terse syntax.
createDF() creates readable code like toDF() and allows for full schema customization like create-
DataFrame(). It’s the best of both worlds.
Different approaches to manually create Spark DataFrames 65
What is null?
In SQL databases, “null means that some value is unknown, missing, or irrelevant¹⁶.” The SQL
concept of null is different than null in programming languages like JavaScript or Scala. Spark
DataFrame best practices are aligned with SQL best practices, so DataFrames should use null for
values that are unknown, missing or irrelevant.
1 name,country,zip_code
2 joe,usa,89013
3 ravi,india,
4 "",,12389
All the blank values and empty strings are read into a DataFrame as null.
¹⁶https://www.itprotoday.com/sql-server/sql-design-reason-null
Dealing with null in Spark 67
1 peopleDf.show()
2
3 +----+-------+--------+
4 |name|country|zip_code|
5 +----+-------+--------+
6 | joe| usa| 89013|
7 |ravi| india| null|
8 |null| null| 12389|
9 +----+-------+--------+
The Spark csv() method demonstrates that null is used for values that are unknown or missing
when files are read into DataFrames.
nullable Columns
Let’s create a DataFrame with a name column that isn’t nullable and an age column that is nullable.
The name column cannot take null values, but the age column can take null values. The nullable
property is the third argument when instantiating a StructField.
If we try to create a DataFrame with a null value in the name column, the code will blow up with
this error: “Error while encoding: java.lang.RuntimeException: The 0th field ‘name’ of input row
cannot be null”.
Here’s some code that would cause the error to be thrown:
Dealing with null in Spark 68
Make sure to recreate the error on your machine! It’s a hard error message to understand unless
you’re used to it.
You can keep null values out of certain columns by setting nullable to false.
You won’t be able to set nullable to false for all columns in a DataFrame and pretend like null
values don’t exist. For example, when joining DataFrames, the join column will return null when
a match cannot be made.
Now let’s add a column that returns true if the number is even, false if the number is odd, and
null otherwise.
1 numbersDF
2 .withColumn("is_even", $"number" % 2 === 0)
3 .show()
Dealing with null in Spark 69
1 +------+-------+
2 |number|is_even|
3 +------+-------+
4 | 1| false|
5 | 8| true|
6 | 12| true|
7 | null| null|
8 +------+-------+
The Spark % method returns null when the input is null. Actually all Spark functions return null
when the input is null.
You should follow this example in your code - your Spark functions should return null when the
input is null too!
1 +------+
2 |number|
3 +------+
4 | 1|
5 | 8|
6 | 12|
7 | null|
8 +------+
Our UDF does not handle null input values. Let’s run the code and observe the error.
1 numbersDF.withColumn(
2 "is_even",
3 isEvenSimpleUdf(col("number"))
4 )
1 actualDF.show()
2
3 +------+-------+
4 |number|is_even|
5 +------+-------+
6 | 1| false|
7 | 8| true|
8 | 12| true|
9 | null| null|
10 +------+-------+
It’s better to write user defined functions that gracefully deal with null values and don’t rely on the
isNotNull work around - let’s try again.
1 actualDF.show()
2
3 +------+-------+
4 |number|is_even|
5 +------+-------+
6 | 1| false|
7 | 8| true|
8 | 12| true|
9 | null| false|
10 +------+-------+
This code works, but is terrible because it returns false for odd numbers and null numbers. Remember
that null should be used for values that are irrelevant. null is not even or odd - returning false for null
numbers implies that null is odd!
Let’s refactor this code and correctly return null when number is null.
The isEvenBetter method returns an Option[Boolean]. When the input is null, isEvenBetter
returns None, which is converted to null in DataFrames.
Let’s run the isEvenBetterUdf on the same numbersDF as earlier and verify that null values are
correctly added when the number column is null.
1 actualDF.show()
2
3 +------+-------+
4 |number|is_even|
5 +------+-------+
6 | 1| false|
7 | 8| true|
8 | 12| true|
9 | null| null|
10 +------+-------+
The isEvenBetter function is still directly referring to null. Let’s do a final refactoring to fully
remove null from the user defined function.
The isEvenOption function converts the integer to an Option value and returns None if the conversion
cannot take place. This code does not use null and follows the purist advice: “Ban null from any of
your code. Period.”
This solution is less performant than directly referring to null, so a refactoring should be considered
if performance becomes a bottleneck.
1 numbersDF.withColumn(
2 "is_even",
3 col("number") / lit(2) === lit(0)
4 )
• Scala code should deal with null values gracefully and shouldn’t error out if there are null
values.
• Scala code should return None (or null) for values that are unknown, missing, or irrelevant.
DataFrames should also use null for for values that are unknown, missing, or irrelevant.
• Use Option in Scala code and fall back on null if Option becomes a performance bottleneck.
Using JAR Files Locally
This chapter explains how to attach spark-daria to a Spark console session and to a Databricks
cluster.
We’ll need the spark-daria createDF method to easily make DataFrames because the createDataFrame
method is too verbose.
Let’s access a class that’s defined in spark-daria to make sure the code was successfully loaded in
the console.
1 scala> com.github.mrpowers.spark.daria.sql.EtlDefinition
2 res0: com.github.mrpowers.spark.daria.sql.EtlDefinition.type = EtlDefinition
Quit the terminal session with the :quit command. It’ll look like this when typed into the console.
1 scala> :quit
¹⁹https://github.com/MrPowers/spark-daria/releases/tag/v0.35.2
Using JAR Files Locally 76
1 bash /Users/powers/spark-2.4.0-bin-hadoop2.7/bin/spark-shell
1 scala> com.github.mrpowers.spark.daria.sql.EtlDefinition
2 <console>:24: error: object mrpowers is not a member of package com.github
3 com.github.mrpowers.spark.daria.sql.EtlDefinition
Let’s add spark-daria JAR to the console we just started with the :require command.
1 scala> com.github.mrpowers.spark.daria.sql.EtlDefinition
2 res1: com.github.mrpowers.spark.daria.sql.EtlDefinition.type = EtlDefinition
Create Library
Create a notebook, attach it to your cluster, and verify you can access the spark-daria EtlDefinition
class.
Review
This chapter showed you how to attach the spark-daria JAR file to console sessions and Databricks
notebooks.
You can use this workflow to attach any JAR files to your Spark analyses.
Notice how the :require command was used to add spark-daria to the classpath of an existing
Spark console. Starting up a Databricks cluster and then attaching spark-daria to the cluster classpath
Using JAR Files Locally 80
is similar. Running Spark code locally helps you understand how the code works in a cluster
environment.
Working with Spark ArrayType
columns
Spark DataFrame columns support arrays, which are great for data sets that have an arbitrary length.
This chapter will demonstrate Spark methods that return ArrayType columns, describe how to create
your own ArrayType columns, and explain when to use arrays in your analyses.
Scala collections
Scala has different types of collections: lists, sequences, and arrays. Let’s quickly review the different
types of Scala collections before jumping into Spark ArrayType columns.
Let’s create and sort a collection of numbers.
List, Seq, and Array differ slightly, but generally work the same. Most Spark programmers don’t
need to know about how these collections differ.
Spark uses arrays for ArrayType columns, so we’ll mainly use arrays in our code snippets.
1 actualDF.show()
2
3 +-------+----------------+
4 | name| hit_songs|
5 +-------+----------------+
6 |beatles|[help, hey jude]|
7 | romeo| [eres mia]|
8 +-------+----------------+
1 actualDF.printSchema()
2
3 root
4 |-- name: string (nullable = true)
5 |-- hit_songs: array (nullable = true)
6 | |-- element: string (containsNull = true)
An ArrayType column is suitable in this example because a singer can have an arbitrary amount of
hit songs. We don’t want to create a DataFrame with hit_song1, hit_song2, …, hit_songN columns.
1 singersDF.show()
2
3 +------+-------------+
4 | name| hit_songs|
5 +------+-------------+
6 |bieber|[baby, sorry]|
7 | ozuna| [criminal]|
8 +------+-------------+
1 singersDF.printSchema()
2
3 root
4 |-- name: string (nullable = true)
5 |-- hit_songs: array (nullable = true)
6 | |-- element: string (containsNull = true)
The ArrayType case class is instantiated with an elementType and a containsNull flag. In ArrayType(StringType,
true), StringType is the elementType and true is the containsNull flag.
array_contains
The Spark functions²¹ object provides helper methods for working with ArrayType columns. The
array_contains method returns true if the array contains a specified element.
Let’s create an array with people and their favorite colors. Then let’s use array_contains to append
a likes_red column that returns true if the person likes red.
²⁰http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.ArrayType
²¹http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions\protect\char”0024\relax
Working with Spark ArrayType columns 84
1 actualDF.show()
2
3 +-----+---------------+---------+
4 | name|favorite_colors|likes_red|
5 +-----+---------------+---------+
6 | bob| [red, blue]| true|
7 |maria| [green, red]| true|
8 | sue| [black]| false|
9 +-----+---------------+---------+
explode
Let’s use the same DataFrame and the explode() method to create a new row for every element in
each array.
1 val df = peopleDF.select(
2 col("name"),
3 explode(col("favorite_colors")).as("color")
4 )
Working with Spark ArrayType columns 85
1 df.show()
2
3 +-----+-----+
4 | name|color|
5 +-----+-----+
6 | bob| red|
7 | bob| blue|
8 |maria|green|
9 |maria| red|
10 | sue|black|
11 +-----+-----+
peopleDF has 3 rows and the exploded DataFrame has 5 rows. The explode() method adds rows to
a DataFrame.
collect_list
The collect_list method collapses a DataFrame into fewer rows and stores the collapsed data in
an ArrayType column.
Let’s create a DataFrame with letter1, letter2, and number1 columns.
1 val df = Seq(
2 ("a", "b", 1),
3 ("a", "b", 2),
4 ("a", "b", 3),
5 ("z", "b", 4),
6 ("a", "x", 5)
7 ).toDF("letter1", "letter2", "number1")
8
9 df.show()
1 +-------+-------+-------+
2 |letter1|letter2|number1|
3 +-------+-------+-------+
4 | a| b| 1|
5 | a| b| 2|
6 | a| b| 3|
7 | z| b| 4|
8 | a| x| 5|
9 +-------+-------+-------+
Let’s use the collect_list() method to eliminate all the rows with duplicate letter1 and letter2
rows in the DataFrame and collect all the number1 entries as a list.
Working with Spark ArrayType columns 86
1 df
2 .groupBy("letter1", "letter2")
3 .agg(collect_list("number1") as "number1s")
4 .show()
1 +-------+-------+---------+
2 |letter1|letter2| number1s|
3 +-------+-------+---------+
4 | a| x| [5]|
5 | z| b| [4]|
6 | a| b|[1, 2, 3]|
7 +-------+-------+---------+
1 df.printSchema
2
3 root
4 |-- letter1: string (nullable = true)
5 |-- letter2: string (nullable = true)
6 |-- number1s: array (nullable = true)
7 | |-- element: integer (containsNull = true)
²²https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-
apache-spark.html
Working with Spark ArrayType columns 87
1 val df = spark.createDF(
2 List(
3 (Array(1, 2)),
4 (Array(1, 2, 3, 1)),
5 (null)
6 ), List(
7 ("nums", ArrayType(IntegerType, true), true)
8 )
9 )
1 df.show()
2
3 +------------+
4 | nums|
5 +------------+
6 | [1, 2]|
7 |[1, 2, 3, 1]|
8 | null|
9 +------------+
Let’s use the array_distinct() method to remove all of the duplicate array elements in the nums
column.
1 df
2 .withColumn("nums_distinct", array_distinct($"nums"))
3 .show()
4
5 +------------+-------------+
6 | nums|nums_distinct|
7 +------------+-------------+
8 | [1, 2]| [1, 2]|
9 |[1, 2, 3, 1]| [1, 2, 3]|
10 | null| null|
11 +------------+-------------+
Let’s use array_join() to create a pipe delimited string of all elements in the arrays.
Working with Spark ArrayType columns 88
1 df
2 .withColumn("nums_joined", array_join($"nums", "|"))
3 .show()
4
5 +------------+-----------+
6 | nums|nums_joined|
7 +------------+-----------+
8 | [1, 2]| 1|2|
9 |[1, 2, 3, 1]| 1|2|3|1|
10 | null| null|
11 +------------+-----------+
Let’s use the printSchema method to verify that nums_joined is a StringType column.
1 df
2 .withColumn("nums_joined", array_join($"nums", "|"))
3 .printSchema()
4
5 root
6 |-- nums: array (nullable = true)
7 | |-- element: integer (containsNull = true)
8 |-- nums_joined: string (nullable = true)
Let’s use array_max to grab the maximum value from the arrays.
1 df
2 .withColumn("nums_max", array_max($"nums"))
3 .show()
4
5 +------------+--------+
6 | nums|nums_max|
7 +------------+--------+
8 | [1, 2]| 2|
9 |[1, 2, 3, 1]| 3|
10 | null| null|
11 +------------+--------+
Let’s use array_min to grab the minimum value from the arrays.
Working with Spark ArrayType columns 89
1 df
2 .withColumn("nums_min", array_min($"nums"))
3 .show()
4
5 +------------+--------+
6 | nums|nums_min|
7 +------------+--------+
8 | [1, 2]| 1|
9 |[1, 2, 3, 1]| 1|
10 | null| null|
11 +------------+--------+
Let’s use the array_remove method to remove all the 1s from each of the arrays.
1 df
2 .withColumn("nums_sans_1", array_remove($"nums", 1))
3 .show()
4
5 +------------+-----------+
6 | nums|nums_sans_1|
7 +------------+-----------+
8 | [1, 2]| [2]|
9 |[1, 2, 3, 1]| [2, 3]|
10 | null| null|
11 +------------+-----------+
1 df
2 .withColumn("nums_sorted", array_sort($"nums"))
3 .show()
4
5 +------------+------------+
6 | nums| nums_sorted|
7 +------------+------------+
8 | [1, 2]| [1, 2]|
9 |[1, 2, 3, 1]|[1, 1, 2, 3]|
10 | null| null|
11 +------------+------------+
Working with Spark ArrayType columns 90
You can use the spark-daria²³ forall() method to run this computation on a Spark DataFrame with
an ArrayType column.
1 import com.github.mrpowers.spark.daria.sql.functions._
2
3 val df = spark.createDF(
4 List(
5 (Array("cream", "cookies")),
6 (Array("taco", "clam"))
7 ), List(
8 ("words", ArrayType(StringType, true), true)
9 )
10 )
11
12 df.withColumn(
13 "all_words_begin_with_c",
14 forall[String]((x: String) => x.startsWith("c")).apply(col("words"))
15 ).show()
1 +----------------+----------------------+
2 | words|all_words_begin_with_c|
3 +----------------+----------------------+
4 |[cream, cookies]| true|
5 | [taco, clam]| false|
6 +----------------+----------------------+
The native Spark API doesn’t provide access to all the helpful collection methods provided by Scala.
spark-daria²⁴ uses User Defined Functions to define forall and exists methods.
Spark will add higher level array functions to the API when Scala 3 is released.
²³https://github.com/MrPowers/spark-daria
²⁴https://github.com/MrPowers/spark-daria
Working with Spark ArrayType columns 91
Let’s use array_intersect to get the elements present in both the arrays without any duplication.
1 numbersDF
2 .withColumn("nums_intersection", array_intersect($"nums1", $"nums2"))
3 .show()
4
5 +------------+---------+-----------------+
6 | nums1| nums2|nums_intersection|
7 +------------+---------+-----------------+
8 | [1, 2]|[4, 5, 6]| []|
9 |[1, 2, 3, 1]|[2, 3, 4]| [2, 3]|
10 | null| [6, 7]| null|
11 +------------+---------+-----------------+
Let’s use array_union to get the elements in either array, without duplication.
1 numbersDF
2 .withColumn("nums_union", array_union($"nums1", $"nums2"))
3 .show()
Working with Spark ArrayType columns 92
1 +------------+---------+---------------+
2 | nums1| nums2| nums_union|
3 +------------+---------+---------------+
4 | [1, 2]|[4, 5, 6]|[1, 2, 4, 5, 6]|
5 |[1, 2, 3, 1]|[2, 3, 4]| [1, 2, 3, 4]|
6 | null| [6, 7]| null|
7 +------------+---------+---------------+
Let’s use array_except to get the elements that are in num1 and not in num2 without any duplication.
1 numbersDF
2 .withColumn("nums1_nums2_except", array_except($"nums1", $"nums2"))
3 .show()
4
5 +------------+---------+------------------+
6 | nums1| nums2|nums1_nums2_except|
7 +------------+---------+------------------+
8 | [1, 2]|[4, 5, 6]| [1, 2]|
9 |[1, 2, 3, 1]|[2, 3, 4]| [1]|
10 | null| [6, 7]| null|
11 +------------+---------+------------------+
1 val df = spark.createDF(
2 List(
3 (Array("a", "b", "c")),
4 (Array("d", "e", "f")),
5 (null)
6 ), List(
7 ("letters", ArrayType(StringType, true), true)
8 )
9 )
Working with Spark ArrayType columns 93
1 df.show()
2
3 +---------+
4 | letters|
5 +---------+
6 |[a, b, c]|
7 |[d, e, f]|
8 | null|
9 +---------+
1 df
2 .select(
3 $"letters".getItem(0).as("col1"),
4 $"letters".getItem(1).as("col2"),
5 $"letters".getItem(2).as("col3")
6 )
7 .show()
8
9 +----+----+----+
10 |col1|col2|col3|
11 +----+----+----+
12 | a| b| c|
13 | d| e| f|
14 |null|null|null|
15 +----+----+----+
1 df
2 .select(
3 (0 until 3).map(i => $"letters".getItem(i).as(s"col$i")): _*
4 )
5 .show()
6
7 +----+----+----+
8 |col0|col1|col2|
9 +----+----+----+
10 | a| b| c|
²⁵https://stackoverflow.com/questions/39255973/split-1-column-into-3-columns-in-spark-scala
Working with Spark ArrayType columns 94
11 | d| e| f|
12 |null|null|null|
13 +----+----+----+
Our code snippet above is a little ugly because the 3 is hardcoded. We can calculate the size of every
array in the column, take the max size, and use that rather than hardcoding.
1 val numCols = df
2 .withColumn("letters_size", size($"letters"))
3 .agg(max($"letters_size"))
4 .head()
5 .getInt(0)
6
7 df
8 .select(
9 (0 until numCols).map(i => $"letters".getItem(i).as(s"col$i")): _*
10 )
11 .show()
12
13 +----+----+----+
14 |col0|col1|col2|
15 +----+----+----+
16 | a| b| c|
17 | d| e| f|
18 |null|null|null|
19 +----+----+----+
Closing thoughts
Spark ArrayType columns makes it easy to work with collections at scale.
Master the content covered in this chapter to add a powerful skill to your toolset.
For more examples, see this Databricks notebook²⁶ that covers even more Array / Map functions.
²⁶https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/142158605138935/3773509768457258/
7497868276316206/latest.html
Working with Spark MapType
Columns
Spark DataFrame columns support maps, which are great for key / value pairs with an arbitrary
length.
This chapter describes how to create MapType columns, demonstrates built-in functions to manip-
ulate MapType columns, and explain when to use maps in your analyses.
Scala maps
Let’s begin with a little refresher on Scala maps.
Create a Scala map that connects some English and Spanish words.
1 wordMapping("dog") // "perro"
1 singersDF.show(false)
2
3 +------------+----------------------------------------------------+
4 |name |songs |
5 +------------+----------------------------------------------------+
6 |sublime |[good_song -> santeria, bad_song -> doesn't exist] |
7 |prince_royce|[good_song -> darte un beso, bad_song -> back it up]|
8 +------------+----------------------------------------------------+
Let’s examine the DataFrame schema and verify that the songs column has a MapType:
1 singersDF.printSchema()
2
3 root
4 |-- name: string (nullable = true)
5 |-- songs: map (nullable = true)
6 | |-- key: string
7 | |-- value: string (valueContainsNull = true)
1 singersDF
2 .withColumn("song_to_love", element_at(col("songs"), "good_song"))
3 .show(false)
Working with Spark MapType Columns 97
1 +------------+----------------------------------------------------+-------------+
2 |name |songs |song_to_love |
3 +------------+----------------------------------------------------+-------------+
4 |sublime |[good_song -> santeria, bad_song -> doesn't exist] |santeria |
5 |prince_royce|[good_song -> darte un beso, bad_song -> back it up]|darte un beso|
6 +------------+----------------------------------------------------+-------------+
1 countriesDF.show(false)
2
3 +------------+-----------+---------------------+
4 |country_name|cute_animal|some_map |
5 +------------+-----------+---------------------+
6 |costa_rica |sloth |[costa_rica -> sloth]|
7 |nepal |red_panda |[nepal -> red_panda] |
8 +------------+-----------+---------------------+
1 countriesDF.printSchema()
2
3 root
4 |-- country_name: string (nullable = true)
5 |-- cute_animal: string (nullable = true)
6 |-- some_map: map (nullable = false)
7 | |-- key: string
8 | |-- value: string (valueContainsNull = true)
1 val df = spark.createDF(
2 List(
3 (Array("a", "b"), Array(1, 2)),
4 (Array("x", "y"), Array(33, 44))
5 ), List(
6 ("letters", ArrayType(StringType, true), true),
7 ("numbers", ArrayType(IntegerType, true), true)
8 )
9 ).withColumn(
10 "strange_map",
11 map_from_arrays(col("letters"), col("numbers"))
12 )
1 df.show(false)
2
3 +-------+--------+------------------+
4 |letters|numbers |strange_map |
5 +-------+--------+------------------+
6 |[a, b] |[1, 2] |[a -> 1, b -> 2] |
7 |[x, y] |[33, 44]|[x -> 33, y -> 44]|
8 +-------+--------+------------------+
Let’s take a look at the df schema and verify strange_map is a MapType column:
Working with Spark MapType Columns 99
1 df.printSchema()
2
3 |-- letters: array (nullable = true)
4 | |-- element: string (containsNull = true)
5 |-- numbers: array (nullable = true)
6 | |-- element: integer (containsNull = true)
7 |-- strange_map: map (nullable = true)
8 | |-- key: string
9 | |-- value: integer (valueContainsNull = true)
The Spark way of converting to arrays to a map is different that the “regular Scala” way of converting
two arrays to a map.
We could wrap this code in a User Defined Function and define our own map_from_arrays function
if we wanted.
In general, it’s best to rely on the standard Spark library instead of defining our own UDFs.
The key takeaway is that the Spark way of solving a problem is often different from the Scala way.
Read the API docs and always try to solve your problems the Spark way.
1 val df = spark.createDF(
2 List(
3 (Map("a" -> "aaa", "b" -> "bbb"), Map("c" -> "ccc", "d" -> "ddd"))
4 ), List(
5 ("some_data", MapType(StringType, StringType, true), true),
6 ("more_data", MapType(StringType, StringType, true), true)
7 )
8 )
9
10 df
11 .withColumn("all_data", map_concat(col("some_data"), col("more_data")))
12 .show(false)
1 +--------------------+--------------------+----------------------------------------+
2 |some_data |more_data |all_data |
3 +--------------------+--------------------+----------------------------------------+
4 |[a -> aaa, b -> bbb]|[c -> ccc, d -> ddd]|[a -> aaa, b -> bbb, c -> ccc, d -> ddd]|
5 +--------------------+--------------------+----------------------------------------+
1 +------+--------------------------------+
2 |name |stature |
3 +------+--------------------------------+
4 |lebron|[height -> 6.67, units -> feet] |
5 |messi |[height -> 1.7, units -> meters]|
6 +------+--------------------------------+
1 athletesDF.printSchema()
2
3 root
4 |-- name: string (nullable = true)
5 |-- stature: map (nullable = true)
6 | |-- key: string
7 | |-- value: string (valueContainsNull = true)
stature is a MapType column, but we can also store stature as a StructType column.
19 )
20 )
21
22 val athletesDF = spark.createDataFrame(
23 spark.sparkContext.parallelize(data),
24 schema
25 )
1 athletesDF.show(false)
2
3 +-----------+-------------+
4 |player_name|stature |
5 +-----------+-------------+
6 |lebron |[6.67, feet] |
7 |messi |[1.7, meters]|
8 +-----------+-------------+
1 athletesDF.printSchema()
2
3 root
4 |-- player_name: string (nullable = true)
5 |-- stature: struct (nullable = true)
6 | |-- height: string (nullable = true)
7 | |-- unit: string (nullable = true)
Sometimes both StructType and MapType columns can solve the same problem and you can choose
between the two.
1 writing to disk
2 - cannot write maps to disk with the CSV format *** FAILED ***
3 org.apache.spark.sql.AnalysisException: CSV data source does not support map<strin\
4 g,string> data type.;
5 at org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchem\
6 a$1.apply(DataSourceUtils.scala:69)
7 at org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchem\
8 a$1.apply(DataSourceUtils.scala:67)
9 at scala.collection.Iterator$class.foreach(Iterator.scala:891)
10 at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
11 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
12 at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
13 at org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifySchema(DataSo\
14 urceUtils.scala:67)
15 at org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifyWriteSchema(D\
16 ataSourceUtils.scala:34)
17 at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWr\
18 iter.scala:100)
19 at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.ru\
20 n(InsertIntoHadoopFsRelationCommand.scala:159)
MapType columns can be written out with the Parquet file format. This code runs just fine:
Working with Spark MapType Columns 104
Conclusion
MapType columns are a great way to store key / value pairs of arbitrary lengths in a DataFrame
column.
Spark 2.4 added a lot of native functions that make it easier to work with MapType columns. Prior
to Spark 2.4, developers were overly reliant on UDFs for manipulating MapType columns.
StructType columns can often be used instead of a MapType column. Study both of these column
types closely so you can understand the tradeoffs and intelligently select the best column type for
your analysis.
Adding StructType columns to
DataFrames
StructType objects define the schema of DataFrames. StructType objects contain a list of StructField
objects that define the name, type, and nullable flag for each column in a DataFrame.
Let’s start with an overview of StructType objects and then demonstrate how StructType columns
can be added to DataFrame schemas (essentially creating a nested schema).
StructType columns are a great way to eliminate order dependencies from Spark code.
StructType overview
The StructType case class can be used to define a DataFrame schema as follows.
1 df.show()
2
3 +---+------+
4 |num|letter|
5 +---+------+
6 | 1| a|
7 | 5| z|
8 +---+------+
1 print(df.schema)
2
3 StructType(
4 StructField(num, IntegerType, true),
5 StructField(letter, StringType, true)
6 )
Let’s look at another example to see how StructType columns can be appended to DataFrames.
18
19 val actualDF = df.withColumn(
20 "animal_interpretation",
21 struct(
22 (col("weight") > 5).as("is_large_animal"),
23 col("animal_type").isin("rat", "cat", "dog").as("is_mammal")
24 )
25 )
1 actualDF.show(truncate = false)
2
3 +------+-----------+---------------------+
4 |weight|animal_type|animal_interpretation|
5 +------+-----------+---------------------+
6 |20.0 |dog |[true,true] |
7 |3.5 |cat |[false,true] |
8 |6.0E-6|ant |[false,false] |
9 +------+-----------+---------------------+
1 print(actualDF.schema)
2
3 StructType(
4 StructField(weight,DoubleType,true),
5 StructField(animal_type,StringType,true),
6 StructField(animal_interpretation, StructType(
7 StructField(is_large_animal,BooleanType,true),
8 StructField(is_mammal,BooleanType,true)
9 ), false)
10 )
The animal_interpretation column has a StructType type, so this DataFrame has a nested schema.
It’s easier to view the schema with the printSchema method.
Adding StructType columns to DataFrames 108
1 actualDF.printSchema()
2
3 root
4 |-- weight: double (nullable = true)
5 |-- animal_type: string (nullable = true)
6 |-- animal_interpretation: struct (nullable = false)
7 | |-- is_large_animal: boolean (nullable = true)
8 | |-- is_mammal: boolean (nullable = true)
1 actualDF.select(
2 col("animal_type"),
3 col("animal_interpretation")("is_large_animal").as("is_large_animal"),
4 col("animal_interpretation")("is_mammal").as("is_mammal")
5 ).show(truncate = false)
1 +-----------+---------------+---------+
2 |animal_type|is_large_animal|is_mammal|
3 +-----------+---------------+---------+
4 |dog |true |true |
5 |cat |false |true |
6 |ant |false |false |
7 +-----------+---------------+---------+
Notice that both the withIsTeenager and withHasPositiveMood transformations must be run before
the withWhatToDo transformation can be run. The functions have an order dependency because
they must be run in a certain order for the code to work.
Let’s build a DataFrame and execute the functions in the right order so the code will run.
17 )
18
19 df
20 .transform(withIsTeenager())
21 .transform(withHasPositiveMood())
22 .transform(withWhatToDo())
23 .show()
1 +---+-----+-----------+-----------------+-----------+
2 |age| mood|is_teenager|has_positive_mood| what_to_do|
3 +---+-----+-----------+-----------------+-----------+
4 | 30|happy| false| true| null|
5 | 13| sad| true| false| null|
6 | 18| glad| true| true|have a chat|
7 +---+-----+-----------+-----------------+-----------+
Let’s use the struct function to append a StructType column to the DataFrame and remove the order
depenencies from this code.
1 +---+-----+-----------------------+
2 |age|mood |best_action |
3 +---+-----+-----------------------+
4 |30 |happy|[false,true,null] |
5 |13 |sad |[true,false,null] |
6 |18 |glad |[true,true,have a chat]|
7 +---+-----+-----------------------+
1 import java.sql.Date
2 import org.apache.spark.sql.types.{DateType, IntegerType}
3
4 val sourceDF = spark.createDF(
5 List(
6 (1, Date.valueOf("2016-09-30")),
7 (2, Date.valueOf("2016-12-14"))
8 ), List(
9 ("person_id", IntegerType, true),
10 ("birth_date", DateType, true)
11 )
12 )
1 sourceDF.show()
2
3 +---------+----------+
4 |person_id|birth_date|
5 +---------+----------+
6 | 1|2016-09-30|
7 | 2|2016-12-14|
8 +---------+----------+
9
10 sourceDF.printSchema()
11
12 root
13 |-- person_id: integer (nullable = true)
14 |-- birth_date: date (nullable = true)
The cast() method can create a DateType column by converting a StringType column into a date.
Working with dates and times 113
1 sourceDF.show()
2
3 +---------+----------+
4 |person_id|birth_date|
5 +---------+----------+
6 | 1|2013-01-30|
7 | 2|2012-01-01|
8 +---------+----------+
9
10 sourceDF.printSchema()
11
12 root
13 |-- person_id: integer (nullable = true)
14 |-- birth_date: date (nullable = true)
1 +---------+----------+----------+-----------+---------+
2 |person_id|birth_date|birth_year|birth_month|birth_day|
3 +---------+----------+----------+-----------+---------+
4 | 1|2016-09-30| 2016| 9| 30|
5 | 2|2016-12-14| 2016| 12| 14|
6 +---------+----------+----------+-----------+---------+
The org.apache.spark.sql.functions package has a lot of functions that makes it easy to work
with dates in Spark.
minute(), second()
Let’s create a DataFrame with a TimestampType column and use built in Spark functions to extract
the minute and second from the timestamp.
Working with dates and times 115
1 import java.sql.Timestamp
2
3 val sourceDF = spark.createDF(
4 List(
5 (1, Timestamp.valueOf("2017-12-02 03:04:00")),
6 (2, Timestamp.valueOf("1999-01-01 01:45:20"))
7 ), List(
8 ("person_id", IntegerType, true),
9 ("fun_time", TimestampType, true)
10 )
11 )
12
13 sourceDF.withColumn(
14 "fun_minute",
15 minute(col("fun_time"))
16 ).withColumn(
17 "fun_second",
18 second(col("fun_time"))
19 ).show()
1 +---------+-------------------+----------+----------+
2 |person_id| fun_time|fun_minute|fun_second|
3 +---------+-------------------+----------+----------+
4 | 1|2017-12-02 03:04:00| 4| 0|
5 | 2|1999-01-01 01:45:20| 45| 20|
6 +---------+-------------------+----------+----------+
datediff()
The datediff() and current_date() functions can be used to calculate the number of days between
today and a date in a DateType column. Let’s use these functions to calculate someone’s age in days.
Working with dates and times 116
1 +---------+----------+-----------+
2 |person_id|birth_date|age_in_days|
3 +---------+----------+-----------+
4 | 1|1990-09-30| 9946|
5 | 2|2001-12-14| 5853|
6 +---------+----------+-----------+
date_add()
The date_add() function can be used to add days to a date. Let’s add 15 days to a date column.
1 +---------+----------+-----------+
2 |person_id|birth_date|15_days_old|
3 +---------+----------+-----------+
4 | 1|1990-09-30| 1990-10-15|
5 | 2|2001-12-14| 2001-12-29|
6 +---------+----------+-----------+
Next steps
Look at the Spark SQL functions²⁷ for the full list of methods available for working with dates and
times in Spark.
²⁷http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions\protect\char”0024\relax
Performing operations on multiple
columns with foldLeft
The Scala foldLeft method can be used to iterate over a data structure and perform multiple
operations on a Spark DataFrame.
For example, foldLeft can be used to eliminate all whitespace in multiple columns or convert all the
column names in a DataFrame to snake_case.
foldLeft is great when you want to perform similar operations on multiple columns. Let’s dive in!
The sum of 1, 5, and 7 is 13 and that’s what the code snippet above will print.
The foldLeft function is initialized with a starting value of zero and the running sum is accumulated
in the memo variable. This code sums all the numbers in the odds list.
1 actualDF.show()
2
3 +------+--------+
4 | name| country|
5 +------+--------+
6 | pablo|Paraguay|
7 |Neymar| Brasil|
8 +------+--------+
We can improve this code by using the DataFrame#columns method and the removeAllWhitespace
method defined in spark-daria.
The snakeCaseColumns custom transformation can now be reused for any DataFrame. This
transformation is already defined in spark-daria by the way.
Next steps
If you’re still uncomfortable with the foldLeft method, try the Scala collections CodeQuizzes. You
should understand foldLeft in Scala before trying to apply foldLeft in Spark.
Whenever you’re applying a similar operation to multiple columns in a Spark DataFrame, try to use
foldLeft. It will reduce the redundancy in your code and decrease your code complexity. Try to wrap
your foldLeft calls in custom transformations to make beautiful functions that are reusable!
Equality Operators
Spark has a standard equality operator and a null safe equality operator.
This chapter explains how the equality operators differ and when each operator should be used.
===
Let’s create a DataFrame with word1 and word1 columns and compare the equality with the ===
operator.
TODO - finish chapter
Introduction to Spark Broadcast Joins
Spark broadcast joins are perfect for joining a large DataFrame with a small DataFrame.
Broadcast joins cannot be used when joining two large DataFrames.
This post explains how to do a simple broadcast join and how the broadcast() function helps Spark
optimize the execution plan.
Conceptual overview
Spark splits up data on different nodes in a cluster so multiple computers can process data in parallel.
Traditional joins are hard with Spark because the data is split on multiple machines.
Broadcast joins are easier to run on a cluster. Spark can “broadcast” a small DataFrame by sending all
the data in that small DataFrame to all nodes in the cluster. After the small DataFrame is broadcasted,
Spark can perform a join without shuffling any of the data in the large DataFrame.
Simple example
Let’s create a DataFrame with information about people and another DataFrame with information
about cities. In this example, both DataFrames will be small, but let’s pretend that the peopleDF is
huge and the citiesDF is tiny.
1 +----------+---------+
2 |first_name| city|
3 +----------+---------+
4 | andrea| medellin|
5 | rodolfo| medellin|
6 | abdul|bangalore|
7 +----------+---------+
1 +---------+--------+----------+
2 | city| country|population|
3 +---------+--------+----------+
4 | medellin|colombia| 2.5|
5 |bangalore| india| 12.3|
6 +---------+--------+----------+
1 peopleDF.join(
2 broadcast(citiesDF),
3 peopleDF("city") <=> citiesDF("city")
4 ).show()
1 +----------+---------+---------+--------+----------+
2 |first_name| city| city| country|population|
3 +----------+---------+---------+--------+----------+
4 | andrea| medellin| medellin|colombia| 2.5|
5 | rodolfo| medellin| medellin|colombia| 2.5|
6 | abdul|bangalore|bangalore| india| 12.3|
7 +----------+---------+---------+--------+----------+
The Spark null safe equality operator (<=>) is used to perform this join.
1 peopleDF.join(
2 broadcast(citiesDF),
3 peopleDF("city") <=> citiesDF("city")
4 ).explain()
== Physical Plan ==
BroadcastHashJoin [coalesce(city#6, )], [coalesce(city#21, )], Inner, BuildRight, (city#6 <⇒ city#21)
:- LocalTableScan [first_name#5, city#6]
+- BroadcastExchange HashedRelationBroadcastMode(List(coalesce(input[0, string, true], )))
+- LocalTableScan [city#21, country#22, population#23]
In this example, Spark is smart enough to return the same physical plan, even when the broadcast()
method isn’t used.
1 peopleDF.join(
2 citiesDF,
3 peopleDF("city") <=> citiesDF("city")
4 ).explain()
== Physical Plan ==
BroadcastHashJoin [coalesce(city#6, )], [coalesce(city#21, )], Inner, BuildRight, (city#6 <⇒ city#21)
:- LocalTableScan [first_name#5, city#6]
+- BroadcastExchange HashedRelationBroadcastMode(List(coalesce(input[0, string, true], )))
+- LocalTableScan [city#21, country#22, population#23]
Spark isn’t always smart about optimally broadcasting DataFrames when the code is complex, so
it’s best to use the broadcast() method explicitly and inspect the physical plan to make sure the
join is executed properly.
1 peopleDF.join(
2 broadcast(citiesDF),
3 Seq("city")
4 ).show()
Introduction to Spark Broadcast Joins 126
1 +---------+----------+--------+----------+
2 | city|first_name| country|population|
3 +---------+----------+--------+----------+
4 | medellin| andrea|colombia| 2.5|
5 | medellin| rodolfo|colombia| 2.5|
6 |bangalore| abdul| india| 12.3|
7 +---------+----------+--------+----------+
1 peopleDF.join(
2 broadcast(citiesDF),
3 Seq("city")
4 ).explain()
== Physical Plan ==
Code that returns the same result without relying on the sequence join generates an entirely different
physical plan.
1 peopleDF.join(
2 broadcast(citiesDF),
3 peopleDF("city") <=> citiesDF("city")
4 )
5 .drop(citiesDF("city"))
6 .explain()
Introduction to Spark Broadcast Joins 127
== Physical Plan ==
Project [first_name#5, city#6, country#22, population#23]
+- BroadcastHashJoin [coalesce(city#6, )], [coalesce(city#21, )], Inner, BuildRight, (city#6 <⇒ city#21)
:- LocalTableScan [first_name#5, city#6]
+- BroadcastExchange HashedRelationBroadcastMode(List(coalesce(input[0, string, true], )))
+- LocalTableScan [city#21, country#22, population#23]
It’s best to avoid the shortcut join syntax so your physical plans stay as simple as possible.
1 peopleDF.join(
2 broadcast(citiesDF),
3 peopleDF("city") <=> citiesDF("city")
4 )
5 .drop(citiesDF("city"))
6 .explain(true)
+- ResolvedHint isBroadcastable=true
+- Project [_1#17 AS city#21, _2#18 AS country#22, _3#19 AS population#23]
+- LocalRelation [_1#17, _2#18, _3#19]
+- ResolvedHint isBroadcastable=true
+- Project [_1#17 AS city#21, _2#18 AS country#22, _3#19 AS population#23]
+- LocalRelation [_1#17, _2#18, _3#19]
== Physical Plan ==
Project [first_name#5, city#6, country#22, population#23]
+- BroadcastHashJoin [coalesce(city#6, )], [coalesce(city#21, )], Inner, BuildRight, (city#6 <⇒ city#21)
:- LocalTableScan [first_name#5, city#6]
+- BroadcastExchange HashedRelationBroadcastMode(List(coalesce(input[0, string, true], )))
+- LocalTableScan [city#21, country#22, population#23]
Notice how the parsed, analyzed, and optimized logical plans all contain ResolvedHint isBroadcastable=true
because the broadcast() function was used. This hint isn’t included when the broadcast() function
isn’t used.
1 peopleDF.join(
2 citiesDF,
3 peopleDF("city") <=> citiesDF("city")
4 )
5 .drop(citiesDF("city"))
6 .explain(true)
Next steps
Broadcast joins are a great way to append data stored in relatively small single source of truth data
files to large DataFrames. DataFrames up to 2GB can be broadcasted so a data file with tens or even
hundreds of thousands of rows is a broadcast candidate.
Partitioning Data in Memory
Spark splits data into partitions and executes computations on the partitions in parallel. You should
understand how data is partitioned and when you need to manually adjust the partitioning to keep
your Spark computations running efficiently.
Intro to partitions
Let’s create a DataFrame of numbers to illustrate how data is partitioned:
1 val x = (1 to 10).toList
2 val numbersDF = x.toDF("number")
1 numbersDF.rdd.partitions.size // => 4
1 numbersDF.write.csv("/Users/powers/Desktop/spark_output/numbers")
1 Partition A: 1, 2
2 Partition B: 3, 4, 5
3 Partition C: 6, 7
4 Partition D: 8, 9, 10
coalesce
The coalesce method reduces the number of partitions in a DataFrame. Here’s how to consolidate
the data from four partitions to two partitions:
We can verify coalesce has created a new DataFrame with only two partitions:
Partitioning Data in Memory 131
1 numbersDF2.rdd.partitions.size // => 2
1 numbersDF2.write.csv("/Users/powers/Desktop/spark_output/numbers2")
1 Partition A: 1, 2, 3, 4, 5
2 Partition C: 6, 7, 8, 9, 10
The coalesce algorithm moved the data from Partition B to Partition A and moved the data from
Partition D to Partition C. The data in Partition A and Partition C does not move with the coalesce
algorithm. This algorithm is fast in certain situations because it minimizes data movement.
Increasing partitions
You can try to increase the number of partitions with coalesce, but it won’t work!
numbersDF3 keeps four partitions even though we attemped to create 6 partitions with coalesce(6).
The coalesce algorithm changes the number of nodes by moving data from some partitions to existing
partitions. This algorithm obviously cannot increate the number of partitions.
repartition
The repartition method can be used to either increase or decrease the number of partitions.
Let’s create a homerDF from the numbersDF with two partitions.
1 Partition ABC: 1, 3, 5, 6, 8, 10
2 Partition XYZ: 2, 4, 7, 9
Partition ABC contains data from Partition A, Partition B, Partition C, and Partition D. Partition XYZ
also contains data from each original partition. The repartition algorithm does a full data shuffle and
equally distributes the data among the partitions. It does not attempt to minimize data movement
like the coalesce algorithm.
These results will be different when run on your machine. You’ll also note that the data might not
be evenly split because this data set is so tiny.
Increasing partitions
The repartition method can be used to increase the number of partitions as well.
Here’s how the data is split up amongst the partitions in the bartDF.
1 Partition 00000: 5, 7
2 Partition 00001: 1
3 Partition 00002: 2
4 Partition 00003: 8
5 Partition 00004: 3, 9
6 Partition 00005: 4, 6, 10
The repartition method does a full shuffle of the data, so the number of partitions can be increased.
repartition by column
Let’s use the following data to examine how a DataFrame can be repartitioned by a particular
column.
Partitioning Data in Memory 133
1 +-----+-------+
2 | age | color |
3 +-----+-------+
4 | 10 | blue |
5 | 13 | red |
6 | 15 | blue |
7 | 99 | red |
8 | 67 | blue |
9 +-----+-------+
1 colorDF = peopleDF.repartition($"color")
When partitioning by a column, Spark will create a minimum of 200 partitions by default. This
example will have two partitions with data and 198 empty partitions.
1 Partition 00091
2 13,red
3 99,red
4 Partition 00168
5 10,blue
6 15,blue
7 67,blue
The colorDF contains different partitions for each color and is optimized for extracts by color.
Partitioning by a column is similar to indexing a column in a relational database. A later chapter on
partitioning data on disk will explain this concept more completely.
Partitioning Data in Memory 134
Spark doesn’t adjust the number of partitions when a large DataFrame is filtered, so the dataPuddle
will also have 13,000 partitions. The dataPuddle only contains 2,000 rows of data, so a lot of the
partitions will be empty. It’s not efficient to read or write thousands of empty text files to disk - we
should improve this code by repartitioning.
partitionBy() is a DataFrameWriter method that specifies if the data should be written to disk in
folders. By default, Spark does not write data to disk in nested folders.
Memory partitioning is often important, independent of disk partitioning. But in order to write data
on disk properly, you’ll almost always need to repartition the data in memory first.
Simple example
Suppose we have the following CSV file with first_name, last_name, and country columns:
1 first_name,last_name,country
2 Ernesto,Guevara,Argentina
3 Vladimir,Putin,Russia
4 Maria,Sharapova,Russia
5 Bruce,Lee,China
6 Jack,Ma,China
Let’s partition this data on disk with country as the partition key. Let’s create one file per partition.
Partitioning on Disk with partitionBy 137
1 partitioned_lake1/
2 country=Argentina/
3 part-00044-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet
4 country=China/
5 part-00059-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet
6 country=Russia/
7 part-00002-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet
Creating one file per disk partition is not going to work for production sized datasets. We won’t
want to write the China partition out as a single file if it contains 100GB of data.
1 partitioned_lake2/
2 country=Argentina/
3 part-00003-c2d1b76a-aa61-437f-affc-a6b322f1cf42.c000.snappy.parquet
4 country=China/
5 part-00000-c2d1b76a-aa61-437f-affc-a6b322f1cf42.c000.snappy.parquet
6 part-00004-c2d1b76a-aa61-437f-affc-a6b322f1cf42.c000.snappy.parquet
7 country=Russia/
8 part-00001-c2d1b76a-aa61-437f-affc-a6b322f1cf42.c000.snappy.parquet
9 part-00002-c2d1b76a-aa61-437f-affc-a6b322f1cf42.c000.snappy.parquet
The partitionBy writer will write out files to disk for each memory partition. The maximum number
of files written out by partitionBy is the number of unique countries multiplied by the number of
memory partitions.
In this example, we have 3 unique countries * 5 memory partitions, so 15 files could get written
out (if each memory partition had one Argentinian, one Chinese, and one Russian person). We only
have 5 rows of data, so only 5 files are written in this example.
1 partitioned_lake3/
2 country=Argentina/
3 part-00000-bc6ce757-d39f-489e-9677-0a7105b29e66.c000.snappy.parquet
4 country=China/
5 part-00000-bc6ce757-d39f-489e-9677-0a7105b29e66.c000.snappy.parquet
6 country=Russia/
7 part-00000-bc6ce757-d39f-489e-9677-0a7105b29e66.c000.snappy.parquet
1 person_name,person_country
2 a,China
3 b,China
4 c,China
5 ...77 more China rows
6 a,France
7 b,France
8 c,France
9 ...12 more France rows
10 a,Cuba
11 b,Cuba
12 c,Cuba
13 ...2 more Cuba rows
Let’s create 8 memory partitions and scatter the data randomly across the memory partitions (we’ll
write out the data to disk, so we can inspect the contents of a memory partition).
1 p,China
2 f1,China
3 n1,China
4 a2,China
5 b2,China
6 d2,China
7 e2,China
8 f,France
9 c,Cuba
This technique helps us set a maximum number of files per partition when creating a partitioned
lake. Let’s write out the data to disk and observe the output.
1 partitioned_lake4/
2 person_country=China/
3 part-00000-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
4 part-00001-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
5 ... 6 more files
6 person_country=Cuba/
7 part-00002-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
8 part-00003-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
9 ... 2 more files
10 person_country=France/
11 part-00000-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
12 part-00001-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
13 ... 5 more files
Each disk partition will have up to 8 files. The data is split randomly in the 8 memory partitions and
there won’t be any output files for a given disk partition if the memory partition doesn’t have any
data for the country.
This is better, but still not ideal. We only want one file for Cuba (currently have 4) and two files for
France (currently have 7), so too many small files are being created.
Let’s review the contents of our memory partition from earlier:
Partitioning on Disk with partitionBy 141
1 p,China
2 f1,China
3 n1,China
4 a2,China
5 b2,China
6 d2,China
7 e2,China
8 f,France
9 c,Cuba
partitionBy will split up this particular memory partition into three files: one China file with 7
rows of data, one France file with one row of data, and one Cuba file with one row of data.
This technique is particularity important for partition keys that are highly skewed. The number of
inhabitants by country is a good example of a partition key with high skew. For example Jamaica
has 3 million people and China has 1.4 billion people - we’ll want ∼467 times more files in the China
partition than the Jamaica partition.
We calculate the total number of records per partition key and then create a my_secret_partition_-
key column rather than relying on a fixed number of partitions.
You should choose the desiredRowsPerPartition based on what will give you ∼1 GB files. If you
have a 500 GB dataset with 750 million rows, set desiredRowsPerPartition to 1,500,000.
Conclusion
Partitioned data lakes can be much faster to query (when filtering on the partition keys). Partitioned
data lakes can allow for a massive amount of data skipping.
Creating and maintaining partitioned data lakes is challenging, but the performance gains make
them a worthwhile effort.
Fast Filtering with Spark
PartitionFilters and PushedFilters
Spark can use the disk partitioning of files to greatly speed up certain filtering operations.
This post explains the difference between memory and disk partitioning, describes how to analyze
physical plans to see when filters are applied, and gives a conceptual overview of why this design
pattern can provide massive performace gains.
1 first_name,last_name,country
2 Ernesto,Guevara,Argentina
3 Vladimir,Putin,Russia
4 Maria,Sharapova,Russia
5 Bruce,Lee,China
6 Jack,Ma,China
1 val df = spark
2 .read
3 .option("header", "true")
4 .csv("/Users/powers/Documents/tmp/blog_data/people.csv")
Let’s write a query to fetch all the Russians in the CSV file with a first_name that starts with M.
1 df
2 .where($"country" === "Russia" && $"first_name".startsWith("M"))
3 .show()
Fast Filtering with Spark PartitionFilters and PushedFilters 144
1 +----------+---------+-------+
2 |first_name|last_name|country|
3 +----------+---------+-------+
4 | Maria|Sharapova| Russia|
5 +----------+---------+-------+
1 df
2 .where($"country" === "Russia" && $"first_name".startsWith("M"))
3 .explain()
<pre>
== Physical Plan ==
Project [first_name#12, last_name#13, country#14]
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12)) && (country#14 = Russia)) &&
StartsWith(first_name#12, M))
+- FileScan csv [first_name#12,last_name#13,country#14]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/people.csv],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name), EqualTo(country,Russia), StringStartsWith(first_-
name,M)],
ReadSchema: struct<first_name:string,last_name:string,country:string>
</pre>
Take note that there are no PartitionFilters in the physical plan.
partitionBy()
The repartition() method partitions the data in memory and the partitionBy() method partitions
data in folders when it’s written out to disk.
Let’s write out the data in partitioned CSV files.
Fast Filtering with Spark PartitionFilters and PushedFilters 145
1 df
2 .repartition($"country")
3 .write
4 .option("header", "true")
5 .partitionBy("country")
6 .csv("/Users/powers/Documents/tmp/blog_data/partitioned_lake")
1 partitioned_lake/
2 country=Argentina/
3 part-00044-c5d2f540-e89b-40c1-869d-f9871b48c617.c000.csv
4 country=China/
5 part-00059-c5d2f540-e89b-40c1-869d-f9871b48c617.c000.csv
6 country=Russia/
7 part-00002-c5d2f540-e89b-40c1-869d-f9871b48c617.c000.csv
Here are the contents of the CSV file in the country=Russia directory.
1 first_name,last_name
2 Vladimir,Putin
3 Maria,Sharapova
Notice that the country column is not included in the CSV file anymore. Spark has abstracted a
column from the CSV file to the directory name.
PartitionFilters
Let’s read from the partitioned data folder, run the same filters, and see how the physical plan
changes.
Let’s run the same filter as before, but on the partitioned lake, and examine the physical plan.
Fast Filtering with Spark PartitionFilters and PushedFilters 146
<pre>
== Physical Plan ==
Project [first_name#74, last_name#75, country#76]
+- Filter (isnotnull(first_name#74) && StartsWith(first_name#74, M))
+- FileScan csv [first_name#74,last_name#75,country#76]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/partitioned_lake],
PartitionCount: 1,
PartitionFilters: [isnotnull(country#76), (country#76 = Russia)],
PushedFilters: [IsNotNull(first_name), StringStartsWith(first_name,M)],
ReadSchema: struct<first_name:string,last_name:string>
</pre>
You need to examine the physical plans carefully to identify the differences.
When filtering on df we have PartitionFilters: [] whereas when filtering on partitionedDF we
have PartitionFilters: [isnotnull(country#76), (country#76 = Russia)].
Spark only grabs data from certain partitions and skips all of the irrelevant partitions. Data skipping
allows for a big performance boost.
PushedFilters
When we filter off of df, the pushed filters are [IsNotNull(country), IsNotNull(first_name),
EqualTo(country,Russia), StringStartsWith(first_name,M)].
When we filter off of partitionedDf, the pushed filters are [IsNotNull(first_name), StringStartsWith(first_-
name,M)].
Spark doesn’t need to push the country filter when working off of partitionedDF because it can
use a partition filter that is a lot faster.
partitionBy() changes how data is partitioned when it’s written out to disk.
Use repartition() before writing out partitioned data to disk with partitionBy() because it’ll
execute a lot faster and write out fewer files.
Partitioning in memory and paritioning on disk are related, but completely different concepts that
expert Spark programmers must master.
1 df
2 .repartition($"country")
3 .write
4 .option("header", "true")
5 .partitionBy("country")
6 .csv("/Users/powers/Documents/tmp/blog_data/partitioned_lake")
We don’t our data lake to contain some massive files because that’ll make Spark reads / writes
unnecessarily slow.
If we don’t do any in memory reparitioning, Spark will write out a ton of files for each partition and
our data lake will contain way too many small files.
1 df
2 .write
3 .option("header", "true")
4 .partitionBy("country")
5 .csv("/Users/powers/Documents/tmp/blog_data/partitioned_lake")
This answer²⁹ explains how to intelligently repartition in memory before writing out to disk with
partitionBy().
²⁹https://stackoverflow.com/questions/53037124/partitioning-a-large-skewed-dataset-in-s3-with-sparks-partitionby-method
Fast Filtering with Spark PartitionFilters and PushedFilters 148
1 import org.apache.spark.sql.functions.rand
2
3 df
4 .repartition(100, $"country", rand)
5 .write
6 .option("header", "true")
7 .partitionBy("country")
8 .csv("/Users/powers/Documents/tmp/blog_data/partitioned_lake")
Next steps
Effective disk partitioning can greatly speed up filter operations.
Scala Text Editing
It is easier to develop Scala with an Integrated Development Environment (IDE) or an IDE-like setup.
IntelliJ³⁰ is a great Scala IDE with a free community edition.
Scala Metals³¹ adds IDE-like features to “regular” text editors like Visual Studio Code, Atom, Vim,
Sublime Text, and Emacs.
Text editors provide a wide range of features that make it a lot easier to develop code. Databricks
notebooks only offer a tiny fraction of the text editing features that are available in IDEs.
Let’s take a look at some common Scala IDE features that’ll help when you’re writing Spark code.
Syntax highlighting
Let’s look a little chunk of code to create a DataFrame with the spark-daria createDF method.
Databricks doesn’t do much in the way of syntax highlighting.
IntelliJ clearly differentiates between the string, integer, and boolean types.
³⁰https://www.jetbrains.com/idea/
³¹https://scalameta.org/metals/
Scala Text Editing 150
Import reminders
Databricks won’t complain about code that’s not imported until you run the code.
If there are missing imports, IntelliJ will complain, even before the code is run.
Scala Text Editing 151
Import hints
IntelliJ smartly assumes that you code might be missing the org.apache.spark.sql.types.StructField
import.
You can click a button and IntelliJ will add the import statement for you.
Databricks only provides the type mismatch error after the code is run.
When you hover the mouse over the incorrect argument type, IntelliJ provides a helpful type hint.
If you hover over the greyed out import, IntelliJ will provide a “Unused import statement” warning.
Applications
Introduction to SBT
SBT is an interactive build tool that is used to run tests and package your projects as JAR files.
SBT lets you package projects created in text editors, so you can run the code in a cloud cluster
computing environment (like Databricks).
SBT has a comprehensive Getting started guide³², but let’s be honest - who wants to read a book on
a build tool?
This chapter teaches Spark programmers what they need to know about SBT and skips all the other
details!
Sample code
I recommend cloning the spark-daria³³ project on your local machine, so you can run the SBT
commands as you read this post.
build.sbt
The SBT build definition is specified in the build.sbt file.
This is where you’ll add code to specify the project dependencies, the Scala version, how to build
your JAR files, how to manage memory, etc.
One of the only things that’s not specified in the build.sbt file is the SBT version itself. The SBT
version is specified in the project/build.properties file, for example:
³²https://www.scala-sbt.org/1.x/docs/Getting-Started.html
³³https://github.com/MrPowers/spark-daria
Introduction to SBT 156
1 sbt.version=1.2.8
libraryDependencies
You can specify libraryDependencies in your build.sbt file to fetch libraries from Maven or
JitPack³⁴.
Here’s how to add Spark SQL and Spark ML to a project:
SBT provides shortcut sytax so you can clean up your build.sbt file a bit.
"provided" dependencies are already included in the environment where we run our code.
Here’s an example of some test dependencies that are only used when we run our test suite:
The chapter on building JAR files provides a more detailed discussion on provided and test
dependencies.
sbt test
You can run your test suite with the sbt test command.
You can set environment variables in your test suite by adding this line to your build.sbt file:
envVars in Test := Map("PROJECT_ENV" -> "test"). Refer to the environment specific config
chapter for more details about this design pattern.
You can run a single test file when using Scalatest with this command:
³⁴https://jitpack.io/
Introduction to SBT 157
Complicated SBT commands are generally easier to run from the SBT shell, so you don’t need to
think about proper quoting.
sbt doc
The sbt doc command generates HTML documentation for your project.
You can open the documentation on your local machine with open target/scala-2.11/api/index.html
after it’s been generated.
Codebases are easier to understand when the public API is clearly defined, and you should focus on
marking anything that’s not part of the public interface with the private keyword. Private methods
aren’t included in the API documentation.
sbt console
The sbt console command starts the Scala interpreter with easy access to all your project files.
Let’s run sbt console in the spark-daria project and then invoke the StringHelpers.snakify()
method.
Running sbt console is similar to running the Spark shell with the spark-daria JAR file attached.
Here’s how to start the Spark shell with the spark-daria JAR file attached.
The same code from before also works in the Spark shell:
The sbt console is sometimes useful for playing around with code, but the test suite is usually
better.
Don’t “test” your code in the console and neglect writing real tests.
Introduction to SBT 158
You should be comfortable with developing Spark code in a text editor, packaging your project as a
JAR file, and attaching your JAR file to a cloud cluster for production analyses.
sbt clean
The sbt clean command deletes all of the generated files in the target/ directory.
This command will delete the documentation generated by sbt doc and will delete the JAR files
generated by sbt package and sbt assembly.
It’s good to run sbt clean frequently, so you don’t accumlate a lot of legacy clutter in the target/
directory.
Next steps
SBT is a great build tool for Spark projects.
It lets you easily run tests, generate documentation, and package code as JAR files.
Managing the SparkSession, The
DataFrame Entry Point
The SparkSession is used to create and read DataFrames.
This post explains how to create a SparkSession and share it throughout your program.
Spark errors out if you try to create multiple SparkSesssions, so it’s important that you share one
SparkSession throughout your program.
Some environments (e.g. Databricks) create a SparkSession for you and in those cases, you’ll want
to reuse the SparkSession that already exists rather than create your own.
The SparkSession is used to access the SparkContext, which has a parallelize method that converts
a sequence into a RDD.
RDDs aren’t used much now that the DataFrame API has been released, but they’re still useful when
creating DataFrames.
Managing the SparkSession, The DataFrame Entry Point 160
Creating a DataFrame
The SparkSession is used twice when manually creating a DataFrame:
1 import org.apache.spark.sql.Row
2 import org.apache.spark.sql.types._
3
4 val rdd = spark.sparkContext.parallelize(
5 Seq(
6 Row("bob", 55)
7 )
8 )
9
10 val schema = StructType(
11 Seq(
12 StructField("name", StringType, true),
13 StructField("age", IntegerType, true)
14 )
15 )
16
17 val df = spark.createDataFrame(rdd, schema)
1 df.show()
2
3 +----+---+
4 |name|age|
5 +----+---+
6 | bob| 55|
7 +----+---+
You will frequently use the SparkSession to create DataFrames when testing your code.
Reading a DataFrame
The SparkSession is also used to read CSV, JSON, and Parquet files.
Here are some examples.
Managing the SparkSession, The DataFrame Entry Point 161
There are separate posts on CSV, JSON, and Parquet files that do deep dives into the intracacies of
each file format.
Creating a SparkSession
You can create a SparkSession in your applications with the getOrCreate method:
You don’t need to manually create a SparkSession in programming environments that already define
the variable (e.g. the Spark shell or a Databricks notebook). Creating your own SparkSession becomes
vital when you write Spark code in a text editor.
Wrapping the spark variable in a trait is the best way to share it across different classes and objects
in your codebase.
1 import org.apache.spark.sql.SparkSession
2
3 trait SparkSessionWrapper extends Serializable {
4
5 lazy val spark: SparkSession = {
6 SparkSession.builder().master("local").appName("my cool app").getOrCreate()
7 }
8
9 }
The getOrCreate() method will create a new SparkSession if one does not exist and reuse an exiting
SparkSession if it exists.
Here’s how getOrCreate() works in different environments:
• In the Databricks environment, getOrCreate will always use the SparkSession created by
Databricks and will never create a SparkSession
• In the Spark console, getOrCreate will use the SparkSession created by the console
• In the test environment, getOrCreate will create a SparkSession the first time it encounters the
spark variable and will then reuse that SparkSession
Managing the SparkSession, The DataFrame Entry Point 162
Your production environment will probably already define the spark variable, so getOrCreate()
won’t ever both creating a SparkSession and will simply use the SparkSession already created by
the environment.
Here is how the SparkSessionWrapper can be used in some example objects.
1 import utest._
2
3 object TransformsTest extends TestSuite with SparkSessionWrapper with ColumnComparer\
4 {
5
6 val tests = Tests {
7
8 'withSomeDatamart - {
9
10 val coolDF = spark.createDF(
11 List(
12
13 ), List(
14
15 )
16 )
17
Managing the SparkSession, The DataFrame Entry Point 163
18 val df = spark.createDF(
19 List(
20
21 ), List(
22
23 )
24 ).transform(transformations.withSomeDatamart())
25
26 }
27
28 }
29
30 }
The test leverages the createDF method, which is a SparkSession extension defined in spark-daria.
createDF is similar to createDataFrame, but more cocise.
SparkContext
The SparkSession encapsulates the SparkConf, SparkContext, and SQLContext.
Prior to Spark 2.0, developers needed to explicly create SparkConf, SparkContext, and SQLContext
objects. Now Spark developers can just create a SparkSession and access the other objects as needed.
The following code snippet uses the SparkSession to access the sparkContext, so the parallelize
method can be used to create a DataFrame.
1 spark.sparkContext.parallelize(
2 Seq(
3 Row("bob", 55)
4 )
5 )
Managing the SparkSession, The DataFrame Entry Point 164
You shouldn’t have to access the sparkContext much - pretty much only when manually creating
DataFrames. See the spark-daria³⁵ createDF() method, so you don’t even need to explicitly call
sparkContext when you want to create a DataFrame.
Conclusion
You’ll need a SparkSession in your programs to create DataFrames.
Reusing the SparkSession in your application is critical for good code organization. Reusing the
SparkSession in your test suite is vital to make your tests execute as quickly as possible.
³⁵https://github.com/MrPowers/spark-daria
³⁶https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html
Testing Spark Applications
Testing Spark applications allows for a rapid development workflow and gives you confidence that
your code will work in production.
Most Spark users spin up clusters with sample data sets to develop code - this is slow (clusters are
slow to start) and costly (you need to pay for computing resources).
An automated test suite lets you develop code on your local machine free of charge. Test files should
run in under a minute, so it’s easy to rapidly iterate.
The test suite documents how the code functions, reduces bugs, and makes it easier to add new
features without breaking existing code.
We’ll talk about more benefits of testing later. Let’s start with some simple examples!
1 package com.github.mrpowers.spark.test.example
2
3 import org.apache.spark.sql.DataFrame
4 import org.apache.spark.sql.functions._
5
6 object HelloWorld {
7
8 def withGreeting()(df: DataFrame): DataFrame = {
9 df.withColumn("greeting", lit("hello world"))
10 }
11
12 }
1 +------+
2 | name|
3 +------+
4 |miguel|
5 | luisa|
6 +------+
When we run the HelloWorld.withGreeting() method, we should get a new DataFrame that looks
like this:
1 +------+-----------+
2 | name| greeting|
3 +------+-----------+
4 |miguel|hello world|
5 | luisa|hello world|
6 +------+-----------+
Add a SparkSessionTestWrapper trait in the test directory so we can create DataFrames in our test
suite via the SparkSession.
1 package com.github.mrpowers.spark.test.example
2
3 import org.apache.spark.sql.SparkSession
4
5 trait SparkSessionTestWrapper {
6
7 lazy val spark: SparkSession = {
8 SparkSession
9 .builder()
10 .master("local")
11 .appName("spark test example")
12 .getOrCreate()
13 }
14
15 }
Let’s write a test that creates a DataFrame, runs the withGreeting() method, and confirms that the
greeting column has been properly appended to the DataFrame.
Testing Spark Applications 167
1 package com.github.mrpowers.spark.test.example
2
3 import com.github.mrpowers.spark.fast.tests.DataFrameComparer
4 import org.apache.spark.sql.Row
5 import org.apache.spark.sql.types._
6 import org.scalatest.FunSpec
7
8 class HelloWorldSpec
9 extends FunSpec
10 with DataFrameComparer
11 with SparkSessionTestWrapper {
12
13 import spark.implicits._
14
15 it("appends a greeting column to a Dataframe") {
16
17 val sourceDF = Seq(
18 ("miguel"),
19 ("luisa")
20 ).toDF("name")
21
22 val actualDF = sourceDF.transform(HelloWorld.withGreeting())
23
24 val expectedSchema = List(
25 StructField("name", StringType, true),
26 StructField("greeting", StringType, false)
27 )
28
29 val expectedData = Seq(
30 Row("miguel", "hello world"),
31 Row("luisa", "hello world")
32 )
33
34 val expectedDF = spark.createDataFrame(
35 spark.sparkContext.parallelize(expectedData),
36 StructType(expectedSchema)
37 )
38
39 assertSmallDataFrameEquality(actualDF, expectedDF)
40
41 }
42
43 }
Testing Spark Applications 168
1 package com.github.mrpowers.spark.test.example
2
3 import org.apache.spark.sql.functions._
4
5 object NumberFun {
6
7 def isEven(n: Integer): Boolean = {
8 n % 2 == 0
9 }
10
11 val isEvenUDF = udf[Boolean, Integer](isEven)
12
13 }
The test isn’t too complicated, but prepare yourself for a wall of code.
Testing Spark Applications 169
1 package com.github.mrpowers.spark.test.example
2
3 import org.scalatest.FunSpec
4 import org.apache.spark.sql.types._
5 import org.apache.spark.sql.functions._
6 import org.apache.spark.sql.Row
7 import com.github.mrpowers.spark.fast.tests.DataFrameComparer
8
9 class NumberFunSpec
10 extends FunSpec
11 with DataFrameComparer
12 with SparkSessionTestWrapper {
13
14 import spark.implicits._
15
16 it("appends an is_even column to a Dataframe") {
17
18 val sourceDF = Seq(
19 (1),
20 (8),
21 (12)
22 ).toDF("number")
23
24 val actualDF = sourceDF
25 .withColumn("is_even", NumberFun.isEvenUDF(col("number")))
26
27 val expectedSchema = List(
28 StructField("number", IntegerType, false),
29 StructField("is_even", BooleanType, true)
30 )
31
32 val expectedData = Seq(
33 Row(1, false),
34 Row(8, true),
35 Row(12, true)
36 )
37
38 val expectedDF = spark.createDataFrame(
39 spark.sparkContext.parallelize(expectedData),
40 StructType(expectedSchema)
41 )
42
43 assertSmallDataFrameEquality(actualDF, expectedDF)
Testing Spark Applications 170
44
45 }
46 }
We create a DataFrame, run the NumberFun.isEvenUDF() function, create another expected DataFrame,
and compare the actual result with our expectations using assertSmallDataFrameEquality() from
spark-fast-tests.
We can improve by testing isEven() on a standalone basis and cover the edge cases. Here are some
tests we might like to add.
1 describe(".isEven") {
2 it("returns true for even numbers") {
3 assert(NumberFun.isEven(4) === true)
4 }
5
6 it("returns false for odd numbers") {
7 assert(NumberFun.isEven(3) === false)
8 }
9
10 it("returns false for null values") {
11 assert(NumberFun.isEven(null) === false)
12 }
13 }
The first two tests pass with our existing code, but the third one causes the code to error out with a
NullPointerException. If we’d like our user defined function to assume that this function will never
be called on columns that are nullable, we might be able to get away with ignoring null values.
It’s probably safter to account for null values and refactor the code accordingly.
A Real Test
Let’s write a test for a function that converts all the column names of a DataFrame to snake_case.
This will make it a lot easier to run SQL queries off of the DataFrame.
Testing Spark Applications 171
1 package com.github.mrpowers.spark.test.example
2
3 import org.apache.spark.sql.DataFrame
4
5 object Converter {
6
7 def snakecaseify(s: String): String = {
8 s.toLowerCase().replace(" ", "_")
9 }
10
11 def snakeCaseColumns(df: DataFrame): DataFrame = {
12 df.columns.foldLeft(df) { (acc, cn) =>
13 acc.withColumnRenamed(cn, snakecaseify(cn))
14 }
15 }
16
17 }
snakecaseify is a pure function and will be tested using the Scaliest assert() method. We’ll compare
the equality of two DataFrames to test the snakeCaseColumns method.
1 package com.github.mrpowers.spark.test.example
2
3 import com.github.mrpowers.spark.fast.tests.DataFrameComparer
4 import org.scalatest.FunSpec
5
6 class ConverterSpec
7 extends FunSpec
8 with DataFrameComparer
9 with SparkSessionTestWrapper {
10
11 import spark.implicits._
12
13 describe(".snakecaseify") {
14
15 it("downcases uppercase letters") {
16 assert(Converter.snakecaseify("HeLlO") === "hello")
17 }
18
19 it("converts spaces to underscores") {
20 assert(Converter.snakecaseify("Hi There") === "hi_there")
21 }
22
Testing Spark Applications 172
23 }
24
25 describe(".snakeCaseColumns") {
26
27 it("snake_cases the column names of a DataFrame") {
28
29 val sourceDF = Seq(
30 ("funny", "joke")
31 ).toDF("A b C", "de F")
32
33 val actualDF = Converter.snakeCaseColumns(sourceDF)
34
35 val expectedDF = Seq(
36 ("funny", "joke")
37 ).toDF("a_b_c", "de_f")
38
39 assertSmallDataFrameEquality(actualDF, expectedDF)
40
41 }
42
43 }
44
45 }
This test file uses the describe method to group tests associated with the snakecaseify() and
snakeCaseColumns() methods. This makes separates code in the test file and makes the console
output more clear when the tests are run.
Finding Bugs
When writing user defined functions or DataFrame transformations that will process billions of rows
of data, you will likely encounter bad data. There will be strange characters, null values, and other
inconsistencies. Testing encourages you to proactively deal with edge cases. If your code breaks with
a production anomaly, you can add another test to make sure the edge case doesn’t catch you again.
Testing Spark Applications 173
Supplying Documentation
It is often easier to understand code by reading the tests! When I need to grok some new code, I start
with the tests and then progress to the source code.
API documentation can sometimes fall out of sync with the actual code. A developer may update
the code and forget to update the API documentation.
The test suite won’t fall out of sync with the code. If the code changes and the tests start failing, the
developer will remember to update the test suite.
1 package com.github.mrpowers.spark.spec.sql
2
3 object Config {
4
5 var test: Map[String, String] = {
6 Map(
7 "libsvmData" -> new java.io.File("./src/test/resources/sample_libsvm_data.txt"\
8 ).getCanonicalPath,
9 "somethingElse" -> "hi"
10 )
11 }
12
13 var production: Map[String, String] = {
14 Map(
15 "libsvmData" -> "s3a://my-cool-bucket/fun-data/libsvm.txt",
16 "somethingElse" -> "whatever"
17 )
18 }
19
Environment Specific Config in Spark Scala Projects 176
The Config.get() method will grab values from the test or production map depending on the
PROJECT_ENV value.
Let’s restart the SBT console and run the same code in the production environment.
Here is how the Config object can be used to fetch a file in your GitHub repository in the test
environment and also fetch a file from S3 in the production environment.
This solution is elegant and does not clutter our application code with environment logic.
You should never write code with different execution paths in the production and test environments
because then your test suite won’t really be testing the actual code that’s run in production.
Overriding config
The Config.test and Config.production maps are defined as variables (with the var keyword), so
they can be overridden.
Giving users the ability to swap out config on the fly makes your codebase more flexible for a variety
of use cases.
You can update your build.sbt file as follows to set PROJECT_ENV to test whenever the test suite is
run.
Big thanks to the StackOverflow community for helping me figure this out³⁸.
Other implementations
This StackOverflow thread³⁹ discusses other solutions.
One answer relies on an external library, one is in Java, and one doesn’t allow for overrides.
Next steps
Feel free to extend this solution to account for other environments. For example, you might want to
add a staging environment that uses different paths to test code before it’s run in production.
Just remember to follow best practices and avoid the config anti-pattern that can litter your codebase
and reduce the protection offered by your test suite.
Adding Config objects to your functions adds a dependency you might not want. In a future chapter,
we’ll discuss how dependency injection can abstract these Config depencencies and how the Config
object can be leveraged to access smart defaults - the best of both worlds!
³⁸https://stackoverflow.com/questions/39902049/setting-environment-variables-when-running-scala-sbt-test-suite?rq=1
³⁹https://stackoverflow.com/questions/21607745/specific-config-by-environment-in-scala
Building Spark JAR Files with SBT
Spark JAR files let you package a project into a single file so it can be run on a Spark cluster.
A lot of developers develop Spark code in brower based notebooks because they’re unfamiliar with
JAR files. Scala is a difficult language and it’s especially challenging when you can’t leverage the
development tools provided by an IDE like IntelliJ.
This episode will demonstrate how to build JAR files with the SBT package and assembly commands
and how to customize the code that’s included in JAR files. Hopefully it will help you make the leap
and start writing Spark code in SBT projects with a powerful IDE by your side!
<iframe width=”560” height=”315” src=”https://www.youtube.com/embed/0yyw2gD0SrY” frame-
border=”0” allow=”autoplay; encrypted-media” allowfullscreen></iframe>
If you run sbt package, SBT will build a thin JAR file that only includes your project files. The thin
JAR file will not include the uJson files.
If you run sbt assembly, SBT will build a fat JAR file that includes both your project files and the
uJson files.
Let’s dig into the gruesome details!
⁴⁰https://en.wikipedia.org/wiki/JAR_(file_format)
⁴¹https://github.com/sbt/sbt-assembly
Building Spark JAR Files with SBT 180
Important take-aways:
⁴²https://github.com/MrPowers/spark-daria
⁴³https://github.com/MrPowers/spark-style-guide#jar-files
Building Spark JAR Files with SBT 181
The sbt-assembly plugin needs to be added to build fat JAR files that include the project’s
dependencies.
⁴⁴https://github.com/MrPowers/spark-slack
Building Spark JAR Files with SBT 182
Important observations:
Let’s build the JAR file with sbt assembly and then inspect the content.
1 <dependencies>
2 <dependency>
3 <groupId>com.google.code.gson</groupId>
4 <artifactId>gson</artifactId>
5 <version>${gson.version}</version>
6 </dependency>
7 </dependencies>
You’ll want to be very careful to minimize your project dependencies. You’ll also want to rely on
⁴⁵https://github.com/gpedro/slack-webhook/blob/master/pom.xml#L159-L165
Building Spark JAR Files with SBT 184
external libraries that have minimal dependencies themselves as the dependies of a library quickly
become your dependencies as soon as you add the library to your project.
Next Steps
Make sure to always mark your libraryDependencies with “provided” or “test” whenever possible
to keep your JAR files as thin as possible.
Only add dependencies when it’s absolutely required and try to avoid libraries that depend on a lot
of other libraries.
It’s very easy to find yourself in dependency hell⁴⁶ with Scala and you should proactively avoid this
uncomfortable situation.
Your Spark runtime environment should generally provide the Scala and Spark dependencies and
you shouldn’t include these in your JAR files.
I fought long and hard to develop the build.sbt strategies outlined in this episode. Hopefully this
will save you from some headache!
⁴⁶https://en.wikipedia.org/wiki/Dependency_hell
Shading Dependencies in Spark
Projects with SBT
sbt-assembly makes it easy to shade dependencies in your Spark projects when you create fat JAR
files. This chapter explains why it’s useful to shade dependencies and will teach you how to shade
dependencies in your own projects.
The sbt assembly command will create a JAR file that includes spark-daria and all of the
spark-pika code. The JAR file won’t include the libraryDependencies that are flagged with
“provided” or “test” (i.e. spark-sql, spark-fast-tests, and scalatest won’t be included in the JAR
file). Let’s verify the contents of the JAR file with the jar tvf target/scala-2.11/spark-pika_-
2.11-2.3.1_0.0.1.jar command.
⁴⁷https://github.com/MrPowers/spark-pika
Shading Dependencies in Spark Projects with SBT 186
If the spark-pika fat JAR file is attached to a cluster, users will be able to access the com.github.mrpowers.spark.dari
and com.github.mrpowers.spark.pika namespaces.
We don’t want to provide access to the com.github.mrpowers.spark.daria namespace when
spark-pika is attached to a cluster for two reasons:
1. It just feels wrong. When users attach the spark-pika JAR file to their Spark cluster, they
should only be able to access the spark-pika namespace. Adding additional namespaces to the
classpath is unexpected.
2. It prevents users from accessing a different spark-daria version than what’s specified in the
spark-pika build.sbt file. In this example, users are forced to use spark-daria version 2.3.1_-
0.24.0.
Let’s run sbt clean and then rebuild the spark-pika JAR file with sbt assembly. Let’s examine the
contents of the new JAR file with jar tvf target/scala-2.11/spark-pika_2.11-2.3.1_0.0.1.jar.
Shading Dependencies in Spark Projects with SBT 187
The JAR file used to contain the com.github.mrpowers.spark.daria namespace and that’s now been
replaced with a shadedSparkDariaForSparkPika namespace.
All the spark-pika references to spark-daria will use the shadedSparkDariaForSparkPika names-
pace.
Users can attach both spark-daria and spark-pika to the same Spark cluster now and there won’t
be a com.github.mrpowers.spark.daria namespace collision anymore.
Conclusion
When creating Spark libraries, make sure to shade dependencies that are included in the fat JAR file,
so your library users can specify different versions for dependencies at will. Try your best to design
your libraries to only add a single namespace to the classpath when the JAR files is attached to a
cluster.
Dependency Injection with Spark
Dependency injection is a design pattern that let’s you write Spark code that’s more flexible and
easier to test.
This chapter shows code with a path dependency and and demonstrates how to inject the path
dependency in a backwards compatible manner. It also shows how to inject an entire DataFrame as
a dependency.
1 object Config {
2
3 val test: Map[String, String] = {
4 Map(
5 "stateMappingsPath" -> new java.io.File(s"./src/test/resources/state_mappings.\
6 csv").getCanonicalPath
7 )
8 }
9
10 val production: Map[String, String] = {
11 Map(
12 "stateMappingsPath" -> "s3a://some-fake-bucket/state_mappings.csv"
13 )
14 }
15
16 var environment = sys.env.getOrElse("PROJECT_ENV", "production")
17
18 def get(key: String): String = {
19 if (environment == "test") {
20 test(key)
21 } else {
22 production(key)
23 }
24 }
25
26 }
The chapter on environment specific configuration will cover this design pattern in more detail.
Let’s create a src/test/resources/state_mappings.csv file, so we can run the withStateFullName
method on some sample data.
1 state_name,state_abbreviation
2 Tennessee,TN
3 New York,NY
4 Mississippi,MS
1 val df = Seq(
2 ("john", 23, "TN"),
3 ("sally", 48, "NY")
4 ).toDF("first_name", "age", "state")
5
6 df
7 .transform(withStateFullName())
8 .show()
9
10 +----------+---+-----+----------+
11 |first_name|age|state|state_name|
12 +----------+---+-----+----------+
13 | john| 23| TN| Tennessee|
14 | sally| 48| NY| New York|
15 +----------+---+-----+----------+
Let’s refactor the withStateFullName so it does not depend on the Config object. In other words,
let’s remove the Config dependency from withStateFullName with the dependency injection design
pattern.
Injecting a path
Let’s create a withStateFullNameInjectPath method that takes the path to the state mappings data
as an argument.
1 def withStateFullNameInjectPath(
2 stateMappingsPath: String = Config.get("stateMappingsPath")
3 )(df: DataFrame): DataFrame = {
4 val stateMappingsDF = spark
5 .read
6 .option("header", true)
7 .csv(stateMappingsPath)
8 df
9 .join(
10 broadcast(stateMappingsDF),
11 df("state") <=> stateMappingsDF("state_abbreviation"),
12 "left_outer"
13 )
14 .drop("state_abbreviation")
15 }
Dependency Injection with Spark 191
The stateMappingsPath leverages a smart default, so users can easily use the function without
explicitly referring to the path. This code is more flexible because it allows users to override the
smart default and use any stateMappingsPath when running the function.
Let’s rely on the smart default and run this code.
1 val df = Seq(
2 ("john", 23, "TN"),
3 ("sally", 48, "NY")
4 ).toDF("first_name", "age", "state")
5
6 df
7 .transform(withStateFullNameInjectPath())
8 .show()
9
10 +----------+---+-----+----------+
11 |first_name|age|state|state_name|
12 +----------+---+-----+----------+
13 | john| 23| TN| Tennessee|
14 | sally| 48| NY| New York|
15 +----------+---+-----+----------+
1 def withStateFullNameInjectDF(
2 stateMappingsDF: DataFrame = spark
3 .read
4 .option("header", true)
5 .csv(Config.get("stateMappingsPath"))
6 )(df: DataFrame): DataFrame = {
7 df
8 .join(
9 broadcast(stateMappingsDF),
10 df("state") <=> stateMappingsDF("state_abbreviation"),
11 "left_outer"
12 )
13 .drop("state_abbreviation")
14 }
Dependency Injection with Spark 192
This code provides the same functionality and is even more flexible. We can now run the function
with any DataFrame. We can read a Parquet file and run this code or create a DataFrame with toDF
in our test suite.
Let’s override the smart default and run this code in our test suite:
Injecting the entire DataFrame as a dependency allows us to test our code without reading from a
file. Avoiding file I/O in your test suite is a great way to make your tests run faster.
This design pattern also makes your tests more readable. Your coworkers won’t need to open up
random CSV files to understand the tests.
Conclusion
Dependency injection can be used to make code that’s more flexible and easier to test.
We went from having code that relied on a CSV file stored in a certain path to code that’s flexible
enough to be run with any DataFrame.
Before productionalizing this code, it’d be a good idea to run some DataFrame validations (on both
the underlying DataFrame and the injected DataFrame) and make the code even more flexible by
making it schema independent.
Make sure to leverage this design pattern so you don’t need to read from CSV / Parquet files in your
test suite anymore!
Broadcasting Maps
Spark makes it easy to broadcast maps and perform hash lookups in a cluster computing environ-
ment.
This post explains how to broadcast maps and how to use these broadcasted variables in analyses.
Simple example
Suppose you have an ArrayType column with a bunch of first names. You’d like to use a nickname
map to standardize all of the first names.
Here’s how we’d write this code for a single Scala array.
1 import scala.util.Try
2
3 val firstNames = Array("Matt", "Fred", "Nick")
4 val nicknames = Map("Matt" -> "Matthew", "Nick" -> "Nicholas")
5 val res = firstNames.map { (n: String) =>
6 Try { nicknames(n) }.getOrElse(n)
7 }
8 res // equals Array("Matthew", "Fred", "Nicholas")
Let’s create a DataFrame with an ArrayType column that contains a list of first names and then
append a standardized_names column that runs all the names through a Map.
1 import scala.util.Try
2
3 val nicknames = Map("Matt" -> "Matthew", "Nick" -> "Nicholas")
4 val n = spark.sparkContext.broadcast(nicknames)
5
6 val df = spark.createDF(
7 List(
8 (Array("Matt", "John")),
9 (Array("Fred", "Nick")),
10 (null)
11 ), List(
12 ("names", ArrayType(StringType, true), true)
13 )
Broadcasting Maps 194
14 ).withColumn(
15 "standardized_names",
16 array_map((name: String) => Try { n.value(name) }.getOrElse(name))
17 .apply(col("names"))
18 )
19
20 df.show(false)
21
22 +------------+------------------+
23 |names |standardized_names|
24 +------------+------------------+
25 |[Matt, John]|[Matthew, John] |
26 |[Fred, Nick]|[Fred, Nicholas] |
27 |null |null |
28 +------------+------------------+
We use the spark.sparkContext.broadcast() method to broadcast the nicknames map to all nodes
in the cluster.
Spark 2.4 added a transform method that’s similar to the Scala Array.map() method, but this isn’t
easily accessible via the Scala API yet, so we map through all the array elements with the spark-
daria⁴⁸ array_map method.
Note that we need to call n.value() to access the broadcasted value. This is slightly different than
what’s needed when writing vanilla Scala code.
We have some code that works which is a great start. Let’s clean this code up with some good Spark
coding practices.
Refactored code
Let’s wrap the withColumn code in a Spark custom transformation⁴⁹, so it’s more modular and easier
to test.
⁴⁸https://github.com/MrPowers/spark-daria/
⁴⁹https://medium.com/@mrpowers/chaining-custom-dataframe-transformations-in-spark-a39e315f903c
Broadcasting Maps 195
1 nickname,firstname
2 Matt,Matthew
3 Nick,Nicholas
Now let’s refactor our code to read the CSV into a DataFrame and convert it to a Map before
broadcasting it.
1 import com.github.mrpowers.spark.daria.sql.DataFrameHelpers
2
3 val nicknamesPath = new java.io.File(s"./src/test/resources/nicknames.csv").getCanon\
4 icalPath
5
6 val nicknamesDF = spark
7 .read
8 .option("header", "true")
9 .option("charset", "UTF8")
10 .csv(nicknamesPath)
11
12 val nicknames = DataFrameHelpers.twoColumnsToMap[String, String](
13 nicknamesDF,
14 "nickname",
15 "firstname"
16 )
17
18 val n = spark.sparkContext.broadcast(nicknames)
19
20 def withStandardizedNames(n: org.apache.spark.broadcast.Broadcast[Map[String, String\
21 ]])(df: DataFrame) = {
22 df.withColumn(
23 "standardized_names",
24 array_map((name: String) => Try { n.value(name) }.getOrElse(name))
25 .apply(col("names"))
26 )
27 }
28
29 val df = spark.createDF(
30 List(
31 (Array("Matt", "John")),
32 (Array("Fred", "Nick")),
33 (null)
34 ), List(
35 ("names", ArrayType(StringType, true), true)
36 )
Broadcasting Maps 197
37 ).transform(withStandardizedNames(n))
38
39 df.show(false)
40
41 +------------+------------------+
42 |names |standardized_names|
43 +------------+------------------+
44 |[Matt, John]|[Matthew, John] |
45 |[Fred, Nick]|[Fred, Nicholas] |
46 |null |null |
47 +------------+------------------+
Conclusion
You’ll often want to broadcast small Spark DataFrames when making broadcast joins⁵¹.
This post illustrates how broadcasting Spark Maps is another powerful design pattern when writing
code that executes on a cluster.
Feel free to broadcast any variable to all the nodes in the cluster. You’ll get huge performance gains
whenever code is run in parallel on various nodes.
⁵⁰https://github.com/MrPowers/spark-daria
⁵¹https://mungingdata.com/apache-spark/broadcast-joins/
Validating Spark DataFrame Schemas
This post demonstrates how to explicitly validate the schema of a DataFrame in custom transfor-
mations so your code is easier to read and provides better error messages.
Spark’s lazy evaluation and execution plan optimizations yield amazingly fast results, but can also
create cryptic error messages.
This post will demonstrate how schema validations create code that’s easier to read, maintain, and
debug.
1 +------+---+
2 | name|age|
3 +------+---+
4 |miguel| 80|
5 | liz| 10|
6 +------+---+
1 +------+---+-----------------+
2 | name|age|is_senior_citizen|
3 +------+---+-----------------+
4 |miguel| 80| true|
5 | liz| 10| false|
6 +------+---+-----------------+
withIsSeniorCitizen assumes that the DataFrame has an age column with the IntegerType. In
this case, the withIsSeniorCitizen transformation’s assumption was correct and the code worked
perfectly ;)
1 +---+
2 |pet|
3 +---+
4 |cat|
5 |dog|
6 +---+
1 animalDF.transform(withFullName())
1 import com.github.mrpowers.spark.daria.sql.DataFrameValidator
2
3 def withFullName()(df: DataFrame): DataFrame = {
4 validatePresenceOfColumns(df, Seq("first_name", "last_name"))
5 df.withColumn(
6 "full_name",
7 concat_ws(" ", col("first_name"), col("last_name"))
8 )
9 }
1 animalDF.transform(withFullName())
When the num1 and num2 columns contain numerical data, the withSum transformation works as
expected.
1 numsDF.transform(withSum()).show()
2
3 +----+----+---+
4 |num1|num2|sum|
5 +----+----+---+
6 | 1| 3| 4|
7 | 7| 8| 15|
8 +----+----+---+
withSum doesn’t work well when the num1 and num2 columns contain strings.
1 wordsDF.transform(withSum()).show()
2 +-----+-----+----+
3 | num1| num2| sum|
4 +-----+-----+----+
5 | one|three|null|
6 |seven|eight|null|
7 +-----+-----+----+
withSum should error out if the num1 and num2 columns aren’t numeric. Let’s refactor the function
to error out with a descriptive error message.
Validating Spark DataFrame Schemas 202
1 wordsDF.transform(withSum()).show()
1 val resultDF = df
2 .transform(myFirstTransform()) // one set of assumptions
3 .transform(mySecondTransform()) // more assumptions
4 .transform(myThirdTransform()) // even more assumptions
Debugging order dependent transformations, each with a different set of assumptions, is a night-
mare! Don’t torture yourself!
Validating Spark DataFrame Schemas 203
Conclusion
DataFrame schema assumptions should be explicitly documented in the code with validations.
Code that doesn’t make assumptions is easier to read, better to maintain, and returns more
descriptive error message.
spark-daria contains the DataFrame validation functions you’ll need in your projects. Follow these
setup instructions and write DataFrame transformations like this:
1 import com.github.mrpowers.spark.daria.sql.DataFrameValidator
2
3 object MyTransformations extends DataFrameValidator {
4
5 def withStandardizedPersonInfo(df: DataFrame): DataFrame = {
6 val requiredColNames = Seq("name", "age")
7 validatePresenceOfColumns(df, requiredColNames)
8 // some transformation code
9 }
10
11 }
Applications with proper DataFrame schema validations are significantly easier to debug, especially
when complex transformations are chained.