Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

5 6323551620588110404

Download as pdf or txt
Download as pdf or txt
You are on page 1of 212

Writing Beautiful Apache Spark Code

Processing massive datasets with ease

Matthew Powers
This book is for sale at http://leanpub.com/beautiful-spark

This version was published on 2020-02-02

This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.

© 2019 - 2020 Matthew Powers


Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Typical painful workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Productionalizing advanced analytics models is hard . . . . . . . . . . . . . . . . . . . . . . . 2
Why Scala? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Who should read this book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Is this book for data engineers or data scientists? . . . . . . . . . . . . . . . . . . . . . . . . . 3
Beautiful Spark philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
DataFrames vs. RDDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Spark streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
The “coalesce test” for evaluating learning resources . . . . . . . . . . . . . . . . . . . . . . . 4
Will we cover the entire Spark SQL API? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
How this book is organized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Spark programming levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Note about Spark versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Running Spark Locally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6


Starting the console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Running Scala code in the console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Accessing the SparkSession in the console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Console commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Databricks Community . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Creating a notebook and cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Running some code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Introduction to DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Creating DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Adding columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Filtering rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
More on schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Creating DataFrames with createDataFrame() . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
CONTENTS

Working with CSV files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19


Reading a CSV file into a DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Writing a DataFrame to disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Reading CSV files in Databricks Notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Just Enough Scala for Spark Programmers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24


Scala function basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Currying functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Implicit classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Column Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
A simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Instantiating Column objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
gt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
substr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
+ operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
lit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
isNull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
isNotNull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
when / otherwise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Introduction to Spark SQL functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38


High level review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
lit() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
when() and otherwise() functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Writing your own SQL function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

User Defined Functions (UDFs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44


Simple UDF example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Using Column Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Chaining Custom DataFrame Transformations in Spark . . . . . . . . . . . . . . . . . . . . . . 48


Dataset Transform Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Transform Method with Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Whitespace data munging with Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


trim(), ltrim(), and rtrim() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
CONTENTS

singleSpace() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
removeAllWhitespace() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Defining DataFrame Schemas with StructField and StructType . . . . . . . . . . . . . . . . . 56


Defining a schema to create a DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
StructField . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Defining schemas with the :: operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Defining schemas with the add() method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Common errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
LongType . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Different approaches to manually create Spark DataFrames . . . . . . . . . . . . . . . . . . . 63


toDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
createDataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
createDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
How we’ll create DataFrames in this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Dealing with null in Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66


What is null? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Spark uses null by default sometimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
nullable Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Native Spark code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Scala null Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
User Defined Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Spark Rules for Dealing with null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Using JAR Files Locally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75


Starting the console with a JAR file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Adding JAR file to an existing console session . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Attaching JARs to Databricks clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Working with Spark ArrayType columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81


Scala collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Splitting a string into an ArrayType column . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Directly creating an ArrayType column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
array_contains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
explode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
collect_list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Single column array functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Generic single column array functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Multiple column array functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
CONTENTS

Split array column into multiple columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92


Closing thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Working with Spark MapType Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95


Scala maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Creating MapType columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Fetching values from maps with element_at() . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Appending MapType columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Creating MapType columns from two ArrayType columns . . . . . . . . . . . . . . . . . . . 98
Converting Arrays to Maps with Scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Merging maps with map_concat() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Using StructType columns instead of MapType columns . . . . . . . . . . . . . . . . . . . . . 100
Writing MapType columns to disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Adding StructType columns to DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105


StructType overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Appending StructType columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Using StructTypes to eliminate order dependencies . . . . . . . . . . . . . . . . . . . . . . . . 108
Order dependencies can be a big problem in large Spark codebases . . . . . . . . . . . . . . 111

Working with dates and times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112


Creating DateType columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
year(), month(), dayofmonth() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
minute(), second() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
datediff() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
date_add() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Performing operations on multiple columns with foldLeft . . . . . . . . . . . . . . . . . . . . 118


foldLeft review in Scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Eliminating whitespace from multiple columns . . . . . . . . . . . . . . . . . . . . . . . . . . 118
snake_case all columns in a DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Wrapping foldLeft operations in custom transformations . . . . . . . . . . . . . . . . . . . . 120
Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Equality Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122


=== . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Introduction to Spark Broadcast Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123


Conceptual overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Analyzing physical plans of joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Eliminating the duplicate city column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
CONTENTS

Diving deeper into explain() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127


Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Partitioning Data in Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130


Intro to partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
coalesce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Increasing partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
repartition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Differences between coalesce and repartition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Real World Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Partitioning on Disk with partitionBy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136


Memory partitioning vs. disk partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
partitionBy with repartition(5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
partitionBy with repartition(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Partitioning datasets with a max number of files per partition . . . . . . . . . . . . . . . . . 139
Partitioning dataset with max rows per file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Partitioning dataset with max rows per file pre Spark 2.2 . . . . . . . . . . . . . . . . . . . . 141
Small file problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Fast Filtering with Spark PartitionFilters and PushedFilters . . . . . . . . . . . . . . . . . . . 143


Normal DataFrame filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
partitionBy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
PartitionFilters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
PushedFilters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Partitioning in memory vs. partitioning on disk . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Disk partitioning with skewed columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Scala Text Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149


Syntax highlighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Import reminders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Import hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Argument type checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Flagging unnecessary imports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
When to use text editors and Databricks notebooks? . . . . . . . . . . . . . . . . . . . . . . . 153

Structuring Spark Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154


Project name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Package naming convention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Typical library structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
CONTENTS

Introduction to SBT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155


Sample code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Running SBT commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
build.sbt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
libraryDependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
sbt test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
sbt doc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
sbt console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
sbt package / sbt assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
sbt clean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

Managing the SparkSession, The DataFrame Entry Point . . . . . . . . . . . . . . . . . . . . . 159


Accessing the SparkSession . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Example of using the SparkSession . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Creating a DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Reading a DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Creating a SparkSession . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Reusing the SparkSession in the test suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
SparkContext . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

Testing Spark Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165


Hello World Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Testing a User Defined Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A Real Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
How Testing Improves Your Codebase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Running a Single Test File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

Environment Specific Config in Spark Scala Projects . . . . . . . . . . . . . . . . . . . . . . . . 175


Basic use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Environment specific code anitpattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Overriding config . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Setting the PROJECT_ENV variable for test runs . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Other implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Next steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

Building Spark JAR Files with SBT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179


JAR File Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Building a Thin JAR File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Building a Fat JAR File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

Shading Dependencies in Spark Projects with SBT . . . . . . . . . . . . . . . . . . . . . . . . . 185


CONTENTS

When shading is useful . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185


How to shade the spark-daria dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

Dependency Injection with Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188


Code with a dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Injecting a path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Injecting an entire DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

Broadcasting Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193


Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Refactored code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Building Maps from data files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

Validating Spark DataFrame Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198


Custom Transformations Refresher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
A Custom Transformation Making a Bad Assumption . . . . . . . . . . . . . . . . . . . . . . 199
Column Presence Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Full Schema Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Documenting DataFrame Assumptions is Especially Important for Chained DataFrame
Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Introduction
It’s easy to follow internet tutorials and write basic Spark code in browser editors, but it’s hard to
write Spark code that’s readable, maintainable, debuggable, and testable.
Spark error messages can be extremely difficult to decipher. You can spend hours tracking down
bugs in Spark codebases, especially if your code is messy.
You might also write jobs that run for hours and then fail for unknown reasons. Getting jobs like
these to execute successfully can take days of trial and error.
The practices outined in this book will save you a lot of time:

• Avoiding Spark design patterns that can cause errors


• Reusing functions across your organization
• Identifying bottlenecks before running production jobs
• Catching bugs in the testing environment

Typical painful workflow


Suppose you’d like to build a machine learning model on top of a messy dataset. You use Spark to
take a sample of the data and write data cleaning code. Then you write machine learning code that’ll
run on the clean sample data set.
Your preliminary model results look good, so you run your model on a production-sized dataset.
The job blows up quickly with a NullPointerException. Looks like there are some input values in
the production data that weren’t in the sample dataset. Because Spark error messages are hard to
decipher, you spend a long time figuring out what part of the data cleaning code is actually erroring
out.
You kick off the job again. This time the job errors out with an “Out of Memory” exception. You
don’t really know why your job causes a memory exception — the cluster RAM is greater than the
dataset size — but you try resizing the cluster to use bigger nodes, and that seems to help.
You kick off the job a third time. Now things seem to be running fine — or are they? You thought
it’d execute in a few hours, but it’s still running after 5 hours. Your workday is done and you decide
to keep the job running overnight rather than destroy all your progress.
You come to work the next day and, to your surprise, your job is still running after 21 hours! Worse
yet, you have absolutely no idea how to identify the code bottleneck.
You begin a multi-day process of trying to productionalize the model. You tweak some code, rerun
the model, and run a new version every day. If you’re lucky, you might be able to pull the right
levers and get the model to run after a few iterations. Maybe you’ll throw your arms up in disgust
after a week of failure.
Introduction 2

Productionalizing advanced analytics models is hard


Building models on big datasets is difficult. The bigger the data, the harder the challenge.
The principles outlined in this book will make it easier to build big data models. These best practices
will also save you from the “silly bugs,” so you can jump right to the difficult optimizations without
wasting any iterations.

Why Scala?
Spark offers Scala, Python, Java, and R APIs. This book covers only the Scala API.
The best practices for each language are quite different. Entire chapters of this book are irrelevant
for SparkR and PySpark users.
The best Spark API for an organization depends on the team’s makeup - a group with lots of Python
experience should probalby use the PySpark API.
Email me if you’d like a book on writing beautiful PySpark or SparkR code and I’ll take it into
consideration.
Scala is great for Spark for a variety of reasons:

• Spark is written in Scala


• The Scala Dataset#transform makes it easy to chain custom transformations
• Lots of examples / Stackoverflow questions are in Scala

Who should read this book?


Spark newbies and experienced Spark programmers will both find this book useful.
Noobs will learn how to write Spark code properly right off the bat and avoid wasting time chasing
spaghetti code bugs.
Experienced Spark coders will learn how to use best practices to write better code, publish internal
libraries, and become the Spark superstar at their company.
Some of your coworkers might be copy-pasting code snippets from one Databricks notebook to
another. You’ll be their savior when you publish a JAR of helper functions that’s easily accessible
by the whole company.
Introduction 3

Is this book for data engineers or data scientists?


Data scientists, data engineers, and less technical Spark users will all find this book useful.
All Spark users should know the basics, the parts of the Spark API to avoid, how libraries should be
structured, and how to properly structure production jobs.
Data engineers should build internal libraries and hand over well documented code to data scientists,
so the data scientists can focus on modelling.
Data scientists might want to pass this book along to data engineers and ask them to follow the
design principles and start building some great libraries. Data scientists need to understand how
great Spark libraries are structured to ask for them ;)

Beautiful Spark philosophy


Spark code should generally be organized as custom DataFrame transformations or column func-
tions.
Spark functions shouldn’t depend explicitly on external data files and shouldn’t perform file I/O.
In functional programming terminology, Spark functions should be “pure functions” void of “side
effects.”
Spark codebases should make minimal use of advanced Scala programming features. Scala is a
complex language that can be used as a functional programming langauge or an object-oriented
programming language. Spark codebases shouldn’t use either main Scala programming style - they
should use a minimal subset of the Scala programming language.
In fact, a lot of typical Scala anti-patterns for “high-quality” Scala codebases are perfectly fine in a
Spark codebase.
Spark codebases are often worked on by folks that aren’t Scala experts. Lots of Spark programmers
are using only Scala because it’s the language used by Spark itself.
Spark code gets complicated quickly enough, even when only simple langauge features are used.
Cluster computing and machine learning are sufficiently complex without any fancy language
features.
Organizations should develop libraries with helper functions that are useful for a variety of analyses.
Spark applications should depend on libraries. A production job run typically entails invoking library
functions plus some application-specific logic.

DataFrames vs. RDDs


This book covers only the DataFrame API. The DataFrame API is generally faster and easier to work
with than the low-level RDD API.
Introduction 4

You shouldn’t use RDD API unless you have a specific optimization that requires you to operate at
a lower level (or if you’re forced to work with Spark 1). Most users will never need to use the RDD
API.
It’s best to master the DataFrame API before thinking about RDDs.

Spark streaming
Lots of analyses can be performed in batch mode, so streaming isn’t relevant for all Spark users.
While Spark streaming is important for users that need to perform analyses in real time, it’s
important to learn the material in this book before diving into the streaming API. Streaming is
complex. Testing streaming applications is hard. You’ll struggle with streaming if you don’t have a
solid understanding of the basics.
Accordingly, this book does not cover streaming.

Machine learning
Advanced Analytics with Spark¹ is a great book on building Spark machine learning models with
the Scala API.
You should read this book first and then read Advanced Analytics with Spark if you’re interested in
building machine learning models with Spark.

The “coalesce test” for evaluating learning resources


The coalesce method shuffles data between memory partitions. It’s expecially useful after filtering
a DataFrame or when compacting small files on disk.
Most Spark training materials talk about the coalesce method, but don’t provide any context. They
say “the coalesce method takes one argument that’s an integer…”. They provide a narrative that sits
on top of the API documentation.
The “coalesce test” checks whether a learning resource provides context when discussing the coalesce
method or if it provides a narrative on the API documentation.
Learning resources that provide important context will make you a stronger programmer. Reading
API documentation helps you understand the available methods, but doesn’t let you know why
certain methods exist.
I hope this book passes the coalesce test - I want you to understand the concepts that are critical for
writing great Spark code.
¹https://www.amazon.com/Advanced-Analytics-Spark-Patterns-Learning/dp/1491972955
Introduction 5

Will we cover the entire Spark SQL API?


The Spark SQL API contains hundreds of methods, and most users will work with only a small
subset of the API.
Some users will spend lots of time manipulating dates whereas other users will spend most of their
time with mathematical functions.
This book focuses on the subset of the API that all Spark users need to master. After reading the
book, you’ll be good enough at Spark to read the API docs and figure out what methods you need
for your specific use case.

How this book is organized


The books is structured in three high level parts:

• Spark fundamentals
• Building libraries and applications
• Practical job tuning

Spark programming levels


Spark programmers progress through these development stages:

• Level 1: writing notebook queries


• Level 2: writing Spark code in a text editor and packaging JAR files
• Level 3: following best practices and testing code
• Level 4: creating high quality private and public libraries
• Level 5: deep understanding of Spark fundamentals
• Level 6: Spark open source contributor
• Level 7: Spark core contributor

This book focuses on Levels 1-4.


Some books jump to Level 5 WAY too fast and leave the readers feeling overwhelmed.
Most books skip Levels 2, 3, and 4 completely and don’t give readers the practical skills to build
Spark libraries.
This book is sequential and doesn’t make any conceptual leaps.

Note about Spark versions


This book is written with Spark 2.
Running Spark Locally
You’ll need to have a good workflow for running Spark locally to get through the examples in this
book. This chapter explains how to download Spark and run commands in your Terminal.
We’ll talk about the best way to run Spark code locally in later chapters. For now, focus on getting
Spark running on your local machine.

Starting the console


Download Spark² and run the spark-shell executable command to start the Spark console.
I store my Spark versions in the ∼/Documents/spark directory, so I can start my Spark shell with
this command.

1 bash ~/Documents/spark/spark-2.3.0-bin-hadoop2.7/bin/spark-shell

Running Scala code in the console


Start by running some simple Scala commands in the console:

1 scala> 2 + 3
2 res0: Int = 5

Let’s perform some string concatenation:

1 scala> val name = "Matthew"


2 name: String = Matthew
3
4 scala> "My name is " + name
5 res1: String = My name is Matthew

The “Spark console” is really just a Scala console that preloads all of the Spark libraries.
²https://spark.apache.org/downloads.html
Running Spark Locally 7

Accessing the SparkSession in the console


On startup, your console session will initialize a global spark variable, which you can use to access
the SparkSession.
The SparkSession enables many features. For example, you can load data from a CSV file on your
local machine into a DataFrame (more on DataFrames later):

1 val df = spark.read.csv("/Users/powers/Documents/tmp/data/silly_file.csv")

You can then read from the data frame:

1 df.show()
2 // +-------+--------------+
3 // | person| silly_level|
4 // +-------|--------------+
5 // | a| 10 |
6 // | b| 5 |
7 // +-------+--------------+

Console commands
The :quit command stops the console.
The :paste lets the user add multiple lines of code at once. Here’s an example:

1 scala> :paste
2 // Entering paste mode (ctrl-D to finish)
3
4 val y = 5
5 val x = 10
6 x + y
7
8 // Exiting paste mode, now interpreting.
9
10 y: Int = 5
11 x: Int = 10
12 res8: Int = 15

Always use the :paste command when copying examples from this book into your console!
The :help command lists all the available console commands. Here’s a full list of all the console
commands:
Running Spark Locally 8

1 scala> :help
2 All commands can be abbreviated, e.g., :he instead of :help.
3 :edit <id>|<line> edit history
4 :help [command] print this summary or command-specific help
5 :history [num] show the history (optional num is commands to show)
6 :h? <string> search the history
7 :imports [name name ...] show import history, identifying sources of names
8 :implicits [-v] show the implicits in scope
9 :javap <path|class> disassemble a file or class name
10 :line <id>|<line> place line(s) at the end of history
11 :load <path> interpret lines in a file
12 :paste [-raw] [path] enter paste mode or paste a file
13 :power enable power user mode
14 :quit exit the interpreter
15 :replay [options] reset the repl and replay all previous commands
16 :require <path> add a jar to the classpath
17 :reset [options] reset the repl to its initial state, forgetting all session\
18 entries
19 :save <path> save replayable session to a file
20 :sh <command line> run a shell command (result is implicitly => List[String])
21 :settings <options> update compiler options, if possible; see reset
22 :silent disable/enable automatic printing of results
23 :type [-v] <expr> display the type of an expression without evaluating it
24 :kind [-v] <expr> display the kind of expression's type
25 :warnings show the suppressed warnings from the most recent line whic\
26 h had any

This Stackoverflow answer³ contains a good description of the available console commands.
³https://stackoverflow.com/a/32808382/1125159
Databricks Community
Databricks provides a wonderful browser-based interface for running Spark code. You can skip this
chapter if you’re happy running Spark code locally in your console, but I recommend trying out
both workflows (the Spark console and Databricks) and seeing which one you prefer.

Creating a notebook and cluster


This link⁴ describes how to create a free Databricks community account and run Spark code in your
browser.
Sign in with you username and password when you create an account:

Databricks Sign in

Click the Workspace button on the left:


⁴https://databricks.com/product/faq/community-edition
Databricks Community 10

Click workspace button

Click Shared and press the workbook button:

Click Shared

Create a Scala notebook:


Databricks Community 11

Create Scala notebook

Once you have a notebook created, click Create a cluster:

Create cluster

Write a name for the cluster and then click the Create Cluster button.
Databricks Community 12

Create cluster button

Go back to your notebook and attach the cluster.

Attach cluster

Running some code


Let’s add 3 and 2:
Databricks Community 13

Run some code

Now let’s demonstrate that we can access the SparkSession via the spark variable.

Access the SparkSession

Next steps
You’re now able to run Spark code in the browser.
Let’s start writing some real code!
Introduction to DataFrames
Spark DataFrames are similar to tables in relational databases. They store data in columns and rows
and support a variety of operations to manipulate the data.
Here’s an example of a DataFrame that contains information about cities.

city country population


Boston USA 0.67
Dubai UAE 3.1
Cordoba Argentina 1.39

This chapter will discuss creating DataFrames, defining schemas, adding columns, and filtering rows.

Creating DataFrames
You can import the spark implicits library and create a DataFrame with the toDF() method.

1 import spark.implicits._
2
3 val df = Seq(
4 ("Boston", "USA", 0.67),
5 ("Dubai", "UAE", 3.1),
6 ("Cordoba", "Argentina", 1.39)
7 ).toDF("city", "country", "population")

Run this code in the Spark console by running the :paste command, pasting the code snippet, and
then pressing ctrl-D.
Run this code in the Databricks browser notebook by pasting the code in a call and clicking run cell.
You can view the contents of a DataFrame with the show() method.

1 df.show()
Introduction to DataFrames 15

1 +-------+---------+----------+
2 | city| country|population|
3 +-------+---------+----------+
4 | Boston| USA| 0.67|
5 | Dubai| UAE| 3.1|
6 |Cordoba|Argentina| 1.39|
7 +-------+---------+----------+

Each DataFrame column has name, dataType and nullable properties. The column can contain null
values if the nullable property is set to true.
The printSchema() method provides an easily readable view of the DataFrame schema.

1 df.printSchema()

1 root
2 |-- city: string (nullable = true)
3 |-- country: string (nullable = true)
4 |-- population: double (nullable = false)

Adding columns
Columns can be added to a DataFrame with the withColumn() method.
Let’s add an is_big_city column to the DataFrame that returns true if the city contains more than
one million people.

1 import org.apache.spark.sql.functions.col
2
3 val df2 = df.withColumn("is_big_city", col("population") > 1)
4 df2.show()

1 +-------+---------+----------+-----------+
2 | city| country|population|is_big_city|
3 +-------+---------+----------+-----------+
4 | Boston| USA| 0.67| false|
5 | Dubai| UAE| 3.1| true|
6 |Cordoba|Argentina| 1.39| true|
7 +-------+---------+----------+-----------+

DataFrames are immutable, so the withColumn() method returns a new DataFrame. withColumn()
does not mutate the original DataFrame. Let’s confirm that df is still the same with df.show().
Introduction to DataFrames 16

1 +-------+---------+----------+
2 | city| country|population|
3 +-------+---------+----------+
4 | Boston| USA| 0.67|
5 | Dubai| UAE| 3.1|
6 |Cordoba|Argentina| 1.39|
7 +-------+---------+----------+

df does not contain the is_big_city column, so we’ve confirmed that withColumn() did not mutate
df.

Filtering rows
The filter() method removes rows from a DataFrame.

1 df.filter(col("population") > 1).show()

1 +-------+---------+----------+
2 | city| country|population|
3 +-------+---------+----------+
4 | Dubai| UAE| 3.1|
5 |Cordoba|Argentina| 1.39|
6 +-------+---------+----------+

It’s a little hard to read code with multiple method calls on the same line, so let’s break this code up
on multiple lines.

1 df
2 .filter(col("population") > 1)
3 .show()

We can also assign the filtered DataFrame to a separate variable rather than chaining method calls.

1 val filteredDF = df.filter(col("population") > 1)


2 filteredDF.show()

More on schemas
Once again, the DataFrame schema can be pretty printed to the console with the printSchema()
method. The schema method returns a code representation of the DataFrame schema.
Introduction to DataFrames 17

1 df.schema

1 StructType(
2 StructField(city, StringType, true),
3 StructField(country, StringType, true),
4 StructField(population, DoubleType, false)
5 )

Each column of a Spark DataFrame is modeled as a StructField object with name, columnType, and
nullable properties. The entire DataFrame schema is modeled as a StructType, which is a collection
of StructField objects.
Let’s create a schema for a DataFrame that has first_name and age columns.

1 import org.apache.spark.sql.types._
2
3 StructType(
4 Seq(
5 StructField("first_name", StringType, true),
6 StructField("age", DoubleType, true)
7 )
8 )

Spark’s programming interface makes it easy to define the exact schema you’d like for your
DataFrames.

Creating DataFrames with createDataFrame()


The toDF() method for creating Spark DataFrames is quick, but it’s limited because it doesn’t let you
define your schema (it infers the schema for you). The createDataFrame() method lets you define
your DataFrame schema.
Introduction to DataFrames 18

1 import org.apache.spark.sql.types._
2 import org.apache.spark.sql.Row
3
4 val animalData = Seq(
5 Row(30, "bat"),
6 Row(2, "mouse"),
7 Row(25, "horse")
8 )
9
10 val animalSchema = List(
11 StructField("average_lifespan", IntegerType, true),
12 StructField("animal_type", StringType, true)
13 )
14
15 val animalDF = spark.createDataFrame(
16 spark.sparkContext.parallelize(animalData),
17 StructType(animalSchema)
18 )
19
20 animalDF.show()

1 +----------------+-----------+
2 |average_lifespan|animal_type|
3 +----------------+-----------+
4 | 30| bat|
5 | 2| mouse|
6 | 25| horse|
7 +----------------+-----------+

We can use the animalDF.printSchema() method to confirm that the schema was created as
specified.

1 root
2 |-- average_lifespan: integer (nullable = true)
3 |-- animal_type: string (nullable = true)

Next Steps
DataFrames are the fundamental building blocks of Spark. All machine learning and streaming
analyses are built on top of the DataFrame API.
Now let’s look at how to build functions to manipulate DataFrames.
Working with CSV files
CSV files are great for learning Spark.
When building big data systems, you’ll generally want to use a more sophisticated file format like
Parquet or Avro, but we’ll generally use CSVs in this book cause they’re human readable.
Once you learn how to use CSV files, it’s easy to use other file formats.
Later chapters in the book will cover CSV and other file formats in more detail.

Reading a CSV file into a DataFrame


Let’s create a CSV file with this path: ∼/Documents/cat_data/file1.txt.
The file should contain this data:

1 cat_name,cat_age
2 fluffy,4
3 spot,3

Let’s read this file into a DataFrame:

1 val path = "/Users/powers/Documents/cat_data/file1.txt"


2 val df = spark.read.option("header", "true").csv(path)

Let’s print the contents of the DataFrame:

1 df.show()
2
3 +--------+-------+
4 |cat_name|cat_age|
5 +--------+-------+
6 | fluffy| 4|
7 | spot| 3|
8 +--------+-------+

Let’s also inspect the contents of the DataFrame:


Working with CSV files 20

1 df.printSchema()
2
3 root
4 |-- cat_name: string (nullable = true)
5 |-- cat_age: string (nullable = true)

Spark infers that the columns are strings.


You can also manually set the schema of a CSV when loading it into a DataFrame.
In later chapters, we’ll explain how to instruct Spark to load in the cat_age column as an integer.

Writing a DataFrame to disk


Let’s add a speak column to the DataFrame and write the data to disk.

1 import org.apache.spark.sql.functions.lit
2
3 df
4 .withColumn("speak", lit("meow"))
5 .write
6 .csv("/Users/powers/Documents/cat_output1")

The cat_output folder contains the following files after the data is written:

1 cat_output/
2 _SUCCESS
3 part-00000-db62f6a7-4efe-4396-9fbb-4caa6aced93e-c000.csv

In this small example, Spark wrote only one file. Spark typically writes out many files in parallel.
We’ll revisit writing files in detail after the chapter on memory partitioning.

Reading CSV files in Databricks Notebooks


We can also upload the CSV file to Databricks and read the file into a browser notebook.
Sign in to Databricks and click the Data tab so you can upload a file:
Working with CSV files 21

Upload file blank form

Once you upload the file, Databricks will show you the file path that can be used to access the data.
Working with CSV files 22

Uploaded file blank

Let’s read this uploaded CSV file into a DataFrame and then display the contents.
Working with CSV files 23

Read CSV file


Just Enough Scala for Spark
Programmers
Spark programmers only need to know a small subset of the Scala API to be productive.
Scala has a reputation for being a difficult language to learn and that scares some developers away
from Spark. This guide covers the Scala language features needed for Spark programmers.
Spark programmers need to know how to write Scala functions, encapsulate functions in objects,
and namespace objects in packages. It’s not a lot to learn - I promise!

Scala function basics


This section describes how to write vanilla Scala functions and Spark SQL functions.
Here is a Scala function that adds two numbers:

1 def sum(num1: Int, num2: Int): Int = {


2 num1 + num2
3 }

We can invoke this function as follows:

1 sum(10, 5) // returns 15

Let’s write a Spark SQL function that adds two numbers together:

1 import org.apache.spark.sql.Column
2
3 def sumColumns(num1: Column, num2: Column): Column = {
4 num1 + num2
5 }

Let’s create a DataFrame in the Spark shell and run the sumColumns() function.
Just Enough Scala for Spark Programmers 25

1 val numbersDF = Seq(


2 (10, 4),
3 (3, 4),
4 (8, 4)
5 ).toDF("some_num", "another_num")
6
7 numbersDF
8 .withColumn(
9 "the_sum",
10 sumColumns(col("some_num"), col("another_num"))
11 )
12 .show()

1 +--------+-----------+-------+
2 |some_num|another_num|the_sum|
3 +--------+-----------+-------+
4 | 10| 4| 14|
5 | 3| 4| 7|
6 | 8| 4| 12|
7 +--------+-----------+-------+

Spark SQL functions take org.apache.spark.sql.Column arguments whereas vanilla Scala functions
take native Scala data type arguments like Int or String.

Currying functions
Scala allows for functions to take multiple parameter lists, which is formally known as currying. This
section explains how to use currying with vanilla Scala functions and why currying is important for
Spark programmers.

1 def myConcat(word1: String)(word2: String): String = {


2 word1 + word2
3 }

Here’s how to invoke the myConcat() function.

1 myConcat("beautiful ")("picture") // returns "beautiful picture"

myConcat() is invoked with two sets of arguments.

Spark has a Dataset#transform() method that makes it easy to chain DataFrame transformations.
Here’s an example of a DataFrame transformation function:
Just Enough Scala for Spark Programmers 26

1 import org.apache.spark.sql.DataFrame
2
3 def withCat(name: String)(df: DataFrame): DataFrame = {
4 df.withColumn("cat", lit(s"$name meow"))
5 }

DataFrame transformation functions can take an arbitrary number of arguments in the first
parameter list and must take a single DataFrame argument in the second parameter list.
Let’s create a DataFrame in the Spark shell and run the withCat() function.

1 val stuffDF = Seq(


2 ("chair"),
3 ("hair"),
4 ("bear")
5 ).toDF("thing")
6
7 stuffDF
8 .transform(withCat("darla"))
9 .show()

1 +-----+----------+
2 |thing| cat|
3 +-----+----------+
4 |chair|darla meow|
5 | hair|darla meow|
6 | bear|darla meow|
7 +-----+----------+

Most Spark code can be organized as Spark SQL functions or as custom DataFrame transformations.

object
Spark functions can be stored in objects.
Let’s create a SomethingWeird object that defines a vanilla Scala function, a Spark SQL function, and
a custom DataFrame transformation.
Just Enough Scala for Spark Programmers 27

1 import org.apache.spark.sql.functions._
2 import org.apache.spark.sql.{Column, DataFrame}
3
4 object SomethingWeird {
5
6 // vanilla Scala function
7 def hi(): String = {
8 "welcome to planet earth"
9 }
10
11 // Spark SQL function
12 def trimUpper(col: Column) = {
13 trim(upper(col))
14 }
15
16 // custom DataFrame transformation
17 def withScary()(df: DataFrame): DataFrame = {
18 df.withColumn("scary", lit("boo!"))
19 }
20
21 }

Let’s create a DataFrame in the Spark shell and run the trimUpper() and withScary() functions.

1 val wordsDF = Seq(


2 ("niCE"),
3 (" CaR"),
4 ("BAR ")
5 ).toDF("word")
6
7 wordsDF
8 .withColumn("trim_upper_word", SomethingWeird.trimUpper(col("word")))
9 .transform(SomethingWeird.withScary())
10 .show()
Just Enough Scala for Spark Programmers 28

1 +-----+---------------+-----+
2 | word|trim_upper_word|scary|
3 +-----+---------------+-----+
4 | niCE| NICE| boo!|
5 | CaR| CAR| boo!|
6 |BAR | BAR| boo!|
7 +-----+---------------+-----+

Objects are useful for grouping related Spark functions.

trait
Traits can be mixed into objects to add commonly used methods or values. We can define
a SparkSessionWrapper trait that defines a spark variable to give objects easy access to the
SparkSession object.

1 import org.apache.spark.sql.SparkSession
2
3 trait SparkSessionWrapper extends Serializable {
4
5 lazy val spark: SparkSession = {
6 SparkSession.builder().master("local").appName("spark session").getOrCreate()
7 }
8
9 }

The Serializable trait is mixed into the SparkSessionWrapper trait.


Let’s create a SpecialDataLake object that mixes in the SparkSessionWrapper trait to provide easy
access to a data lake.

1 object SpecialDataLake extends SparkSessionWrapper {


2
3 def dataLake(): DataFrame = {
4 spark.read.parquet("some_secret_s3_path")
5 }
6
7 }
Just Enough Scala for Spark Programmers 29

package

Packages are used to namespace Scala code. Per the Databricks Scala style guide⁵, packages should
follow Java naming conventions.
For example, the Databricks spark-redshift⁶ project uses the com.databricks.spark.redshift
namespace.
The Spark project used the org.apache.spark namespace. spark-daria⁷ uses the com.github.mrpowers.spark.daria
namespace.
Here an example of code that’s defined in a package in spark-daria:

1 package com.github.mrpowers.spark.daria.sql
2
3 import org.apache.spark.sql.Column
4 import org.apache.spark.sql.functions._
5
6 object functions {
7
8 def singleSpace(col: Column): Column = {
9 trim(regexp_replace(col, " +", " "))
10 }
11
12 }

The package structure should mimic the file structure of the project.

Implicit classes
Implicit classes can be used to extend Spark core classes with additional methods.
Let’s add a lower() method to the Column class that converts all the strings in a column to lower
case.

⁵https://github.com/databricks/scala-style-guide#naming-convention
⁶https://github.com/databricks/spark-redshift
⁷https://github.com/MrPowers/spark-daria
Just Enough Scala for Spark Programmers 30

1 package com.github.mrpowers.spark.daria.sql
2
3 import org.apache.spark.sql.Column
4
5 object FunctionsAsColumnExt {
6
7 implicit class ColumnMethods(col: Column) {
8
9 def lower(): Column = {
10 org.apache.spark.sql.functions.lower(col)
11 }
12
13 }
14
15 }

After running import com.github.mrpowers.spark.daria.sql.FunctionsAsColumnExt._, you can


run the lower() method directly on column objects.

1 col("some_string").lower()

Implicit classes should be avoided in general. I only monkey patch core classes in the spark-daria⁸
project. Feel free to send pull requests if you have any good ideas for other extensions.

Next steps
There are a couple of other Scala features that are useful when writing Spark code, but this chapter
covers 90%+ of common use cases.
You don’t need to understand functional programming or advanced Scala language features to be a
productive Spark programmer.
In fact, staying away from UDFs and native Scala code is a best practice.
Focus on mastering the native Spark API and you’ll be a productive big data engineer in no time!
⁸https://github.com/MrPowers/spark-daria/
Column Methods
The Spark Column class⁹ defines a variety of column methods for manipulating DataFrames.
This chapter demonstrates how to instantiate Column objects and how to use the most important
Column methods.

A simple example
Let’s create a DataFrame with superheros and their city of origin.

1 val df = Seq(
2 ("thor", "new york"),
3 ("aquaman", "atlantis"),
4 ("wolverine", "new york")
5 ).toDF("superhero", "city")

Let’s use the startsWith() column method to identify all cities that start with the word new:

1 df
2 .withColumn("city_starts_with_new", $"city".startsWith("new"))
3 .show()

1 +---------+--------+--------------------+
2 |superhero| city|city_starts_with_new|
3 +---------+--------+--------------------+
4 | thor|new york| true|
5 | aquaman|atlantis| false|
6 |wolverine|new york| true|
7 +---------+--------+--------------------+

The $"city" part of the code creates a Column object. Let’s look at all the different ways to create
Column objects.
⁹http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
Column Methods 32

Instantiating Column objects


Column objects must be created to run Column methods.
A Column object corresponding with the city column can be created using the following three
syntaxes:

1. $"city"
2. df("city")
3. col("city") (must run import org.apache.spark.sql.functions.col first)

Column objects are commonly passed as arguments to SQL functions (e.g. upper($"city")).
We will create column objects in all the examples that follow.

gt
Let’s create a DataFrame with an integer column so we can run some numerical column methods.

1 val df = Seq(
2 (10, "cat"),
3 (4, "dog"),
4 (7, null)
5 ).toDF("num", "word")

Let’s use the gt() method (stands for greater than) to identify all rows with a num greater than five.

1 df
2 .withColumn("num_gt_5", col("num").gt(5))
3 .show()

1 +---+----+--------+
2 |num|word|num_gt_5|
3 +---+----+--------+
4 | 10| cat| true|
5 | 4| dog| false|
6 | 7|null| true|
7 +---+----+--------+

Scala methods can be invoked without dot notation, so this code works as well:
Column Methods 33

1 df
2 .withColumn("num_gt_5", col("num") gt 5)
3 .show()

We can also use the > operator to perform “greater than” comparisons:

1 df
2 .withColumn("num_gt_5", col("num") > 5)
3 .show()

substr
Let’s use the substr() method to create a new column with the first two letters of the word column.

1 df
2 .withColumn("word_first_two", col("word").substr(0, 2))
3 .show()

1 +---+----+--------------+
2 |num|word|word_first_two|
3 +---+----+--------------+
4 | 10| cat| ca|
5 | 4| dog| do|
6 | 7|null| null|
7 +---+----+--------------+

Notice that the substr() method returns null when it’s supplied null as input. All other Column
methods and SQL functions behave similarly (i.e. they return null when the input is null).
Your functions should handle null input gracefully and return null when they’re supplied null as
input.

+ operator
Let’s use the + operator to add five to the num column.
Column Methods 34

1 df
2 .withColumn("num_plus_five", col("num").+(5))
3 .show()

1 +---+----+-------------+
2 |num|word|num_plus_five|
3 +---+----+-------------+
4 | 10| cat| 15|
5 | 4| dog| 9|
6 | 7|null| 12|
7 +---+----+-------------+

We can also skip the dot notation when invoking the function.

1 df
2 .withColumn("num_plus_five", col("num") + 5)
3 .show()

The syntactic sugar makes it harder to see that + is a method defined in the Column class. Take a
look at the docs¹⁰ to convince yourself that the + operator is defined in the Column class!

lit
Let’s use the / method to take two divided by the num column.

1 df
2 .withColumn("two_divided_by_num", lit(2) / col("num"))
3 .show()

1 +---+----+------------------+
2 |num|word|two_divided_by_num|
3 +---+----+------------------+
4 | 10| cat| 0.2|
5 | 4| dog| 0.5|
6 | 7|null|0.2857142857142857|
7 +---+----+------------------+

Notice that the lit() function must be used to convert two into a Column object before the division
can take place.
¹⁰http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
Column Methods 35

1 df
2 .withColumn("two_divided_by_num", 2 / col("num"))
3 .show()

Here is the error message:

1 notebook:2: error: overloaded method value / with alternatives:


2 (x: Double)Double <and>
3 (x: Float)Float <and>
4 (x: Long)Long <and>
5 (x: Int)Int <and>
6 (x: Char)Int <and>
7 (x: Short)Int <and>
8 (x: Byte)Int
9 cannot be applied to (org.apache.spark.sql.Column)
10 .withColumn("two_divided_by_num", 2 / col("num"))

The / method is defined in both the Scala Int and Spark Column classes. We need to convert the
number to a Column object, so the compiler knows to use the / method defined in the Spark Column
class. Upon analyzing the error message, we can see that the compiler is mistakenly trying to use
the / operator defined in the Scala Int class.

isNull
Let’s use the isNull method to identify when the word column is null.

1 df
2 .withColumn("word_is_null", col("word").isNull)
3 .show()

1 +---+----+------------+
2 |num|word|word_is_null|
3 +---+----+------------+
4 | 10| cat| false|
5 | 4| dog| false|
6 | 7|null| true|
7 +---+----+------------+

isNotNull
Let’s use the isNotNull method to filter out all rows with a word of null.
Column Methods 36

1 df
2 .where(col("word").isNotNull)
3 .show()

1 +---+----+
2 |num|word|
3 +---+----+
4 | 10| cat|
5 | 4| dog|
6 +---+----+

when / otherwise
Let’s create a final DataFrame with word1 and word2 columns, so we can play around with the ===,
when(), and otherwise() methods

1 val df = Seq(
2 ("bat", "bat"),
3 ("snake", "rat"),
4 ("cup", "phone"),
5 ("key", null)
6 ).toDF("word1", "word2")

Let’s write a little word comparison algorithm that analyzes the differences between the two words.

1 import org.apache.spark.sql.functions._
2
3 df
4 .withColumn(
5 "word_comparison",
6 when($"word1" === $"word2", "same words")
7 .when(length($"word1") > length($"word2"), "word1 is longer")
8 .otherwise("i am confused")
9 ).show()
Column Methods 37

1 +-----+-----+---------------+
2 |word1|word2|word_comparison|
3 +-----+-----+---------------+
4 | bat| bat| same words|
5 |snake| rat|word1 is longer|
6 | cup|phone| i am confused|
7 | key| null| i am confused|
8 +-----+-----+---------------+

when() and otherwise() are how to write if / else if / else logic in Spark.

Next steps
You will use Colum methods all the time when writing Spark code.
If you don’t have a solid object oriented programming background, it can be hard to iden-
tify which methods are defined in the Column class and which methods are defined in the
org.apache.spark.sql.functions package.

Scala lets you skip dot notation when invoking methods which makes it extra difficult to spot which
methods are Column methods.
In later chapters, we’ll discuss chaining column methods and extending the Column class.
Column methods will be used extensively throughout the rest of this book.
Introduction to Spark SQL functions
This chapter shows you how to use Spark SQL functions and how to build your own SQL functions.
Spark SQL functions are key for almost all analyses.

High level review


Spark SQL functions are defined in the org.apache.spark.sql.functions object. There are ton of
functions!
The documentation page¹¹ lists all of the built-in SQL functions.
Most SQL functions take Column argument(s) and return Column objects.
Let’s demonstrate how to use a SQL function. Create a DataFrame with a number column and use
the factorial function to append a number_factorial column.

1 import org.apache.spark.sql.functions._
2
3 val df = Seq(2, 3, 4).toDF("number")
4
5 df
6 .withColumn("number_factorial", factorial(col("number")))
7 .show()

1 +------+----------------+
2 |number|number_factorial|
3 +------+----------------+
4 | 2| 2|
5 | 3| 6|
6 | 4| 24|
7 +------+----------------+

The factorial() function takes a single Column argument. The col() function, also defined in the
org.apache.spark.sql.functions object, returns a Column object based on the column name.

If Spark implicits are imported (i.e. you’ve run import spark.implicits._), then you can also create
a Column object with the $ operator. This code also works.
¹¹http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions\protect\char”0024\relax
Introduction to Spark SQL functions 39

1 import org.apache.spark.sql.functions._
2 import spark.implicits._
3
4 val df = Seq(2, 3, 4).toDF("number")
5
6 df
7 .withColumn("number_factorial", factorial($"number"))
8 .show()

The rest of this chapter focuses on the most important SQL functions that’ll be used in most analyses.

lit() function
The lit() function creates a Column object out of a literal value. Let’s create a DataFrame and use
the lit() function to append a spanish_hi column to the DataFrame.

1 val df = Seq("sophia", "sol", "perro").toDF("word")


2
3 df
4 .withColumn("spanish_hi", lit("hola"))
5 .show()

1 +------+----------+
2 | word|spanish_hi|
3 +------+----------+
4 |sophia| hola|
5 | sol| hola|
6 | perro| hola|
7 +------+----------+

The lit() function is especially useful when making boolean comparisons.

when() and otherwise() functions


The when() and otherwise() functions are used for control flow in Spark SQL, similar to if and
else in other programming languages.

Let’s create a DataFrame of countries and use some when() statements to append a country column.
Introduction to Spark SQL functions 40

1 val df = Seq("china", "canada", "italy", "tralfamadore").toDF("word")


2
3 df
4 .withColumn(
5 "continent",
6 when(col("word") === lit("china"), lit("asia"))
7 .when(col("word") === lit("canada"), lit("north america"))
8 .when(col("word") === lit("italy"), lit("europe"))
9 .otherwise("not sure")
10 )
11 .show()

1 +------------+-------------+
2 | word| continent|
3 +------------+-------------+
4 | china| asia|
5 | canada|north america|
6 | italy| europe|
7 |tralfamadore| not sure|
8 +------------+-------------+

Spark lets you cut the lit() method calls sometimes and to express code compactly.

1 df
2 .withColumn(
3 "continent",
4 when(col("word") === "china", "asia")
5 .when(col("word") === "canada", "north america")
6 .when(col("word") === "italy", "europe")
7 .otherwise("not sure")
8 )
9 .show()

Here’s another example of using when() to manage control flow.


Introduction to Spark SQL functions 41

1 val df = Seq(10, 15, 25).toDF("age")


2
3 df
4 .withColumn(
5 "life_stage",
6 when(col("age") < 13, "child")
7 .when(col("age") >= 13 && col("age") <= 18, "teenager")
8 .when(col("age") > 18, "adult")
9 )
10 .show()

1 +---+----------+
2 |age|life_stage|
3 +---+----------+
4 | 10| child|
5 | 15| teenager|
6 | 25| adult|
7 +---+----------+

The when method is defined in both the Column class and the functions object. Whenever you see
when() that’s not preceded with a dot, it’s then when from the functions object. .when() comes from
the Column class.

Writing your own SQL function


You can easily build your own SQL functions. Lots of new Spark developers build user defined
functions when it’d be a lot easier to simply build a custom SQL function. Avoid user defined
functions whenever possible!
Let’s create a lifeStage() function that takes an age argument and returns “child”, “teenager” or
“adult”.

1 import org.apache.spark.sql.Column
2
3 def lifeStage(col: Column): Column = {
4 when(col < 13, "child")
5 .when(col >= 13 && col <= 18, "teenager")
6 .when(col > 18, "adult")
7 }

Here’s how to use the lifeStage() function:


Introduction to Spark SQL functions 42

1 val df = Seq(10, 15, 25).toDF("age")


2
3 df
4 .withColumn(
5 "life_stage",
6 lifeStage(col("age"))
7 )
8 .show()

1 +---+----------+
2 |age|life_stage|
3 +---+----------+
4 | 10| child|
5 | 15| teenager|
6 | 25| adult|
7 +---+----------+

Let’s create another function that trims all whitespace and capitalizes all of the characters in a string.

1 import org.apache.spark.sql.Column
2
3 def trimUpper(col: Column): Column = {
4 trim(upper(col))
5 }

Let’s run trimUpper() on a sample data set.

1 val df = Seq(
2 " some stuff",
3 "like CHEESE "
4 ).toDF("weird")
5
6 df
7 .withColumn(
8 "cleaned",
9 trimUpper(col("weird"))
10 )
11 .show()
Introduction to Spark SQL functions 43

1 +----------------+-----------+
2 | weird| cleaned|
3 +----------------+-----------+
4 | some stuff| SOME STUFF|
5 |like CHEESE |LIKE CHEESE|
6 +----------------+-----------+

Custom SQL functions can typically be used instead of UDFs. Avoiding UDFs is a great way to write
better Spark code.

Next steps
Spark SQL functions are preferable to UDFs because they handle the null case gracefully (without
a lot of code) and because they are not a black box¹².
Most Spark analyses can be run by leveraging the standard library and reverting to custom SQL
functions when necessary. Avoid UDFs at all costs!
¹²https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs-blackbox.html
User Defined Functions (UDFs)
Spark let’s you define custom SQL functions called user defined functions (UDFs). UDFs are great
when built-in SQL functions aren’t sufficient, but should be used sparingly because they’re not
performant.
This chapter will demonstrate how to define UDFs and will show how to avoid UDFs, when possible,
by leveraging native Spark functions.

Simple UDF example


Let’s define a UDF that removes all the whitespace and lowercases all the characters in a string.

1 def lowerRemoveAllWhitespace(s: String): String = {


2 s.toLowerCase().replaceAll("\\s", "")
3 }
4
5 val lowerRemoveAllWhitespaceUDF = udf[String, String](lowerRemoveAllWhitespace)
6
7 val sourceDF = spark.createDF(
8 List(
9 (" HI THERE "),
10 (" GivE mE PresenTS ")
11 ), List(
12 ("aaa", StringType, true)
13 )
14 )

1 sourceDF.select(
2 lowerRemoveAllWhitespaceUDF(col("aaa")).as("clean_aaa")
3 ).show()
4
5 +--------------+
6 | clean_aaa|
7 +--------------+
8 | hithere|
9 |givemepresents|
10 +--------------+

This code will unfortunately error out if the DataFrame column contains a null value.
User Defined Functions (UDFs) 45

1 val anotherDF = spark.createDF(


2 List(
3 (" BOO "),
4 (" HOO "),
5 (null)
6 ), List(
7 ("cry", StringType, true)
8 )
9 )
10
11 anotherDF.select(
12 lowerRemoveAllWhitespaceUDF(col("cry")).as("clean_cry")
13 ).show()

org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 3.0 failed 1 times,
most recent failure: Lost task 2.0 in stage 3.0 (TID 7, localhost, executor driver): org.apache.spark.SparkException:
Failed to execute user defined function(anonfun$2: (string) ⇒ string)
Caused by: java.lang.NullPointerException
Cause: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$2: (string)
⇒ string)
Cause: java.lang.NullPointerException
Let’s write a lowerRemoveAllWhitespaceUDF function that won’t error out when the DataFrame
contains null values.

1 def betterLowerRemoveAllWhitespace(s: String): Option[String] = {


2 val str = Option(s).getOrElse(return None)
3 Some(str.toLowerCase().replaceAll("\\s", ""))
4 }
5
6 val betterLowerRemoveAllWhitespaceUDF = udf[Option[String], String](betterLowerRemov\
7 eAllWhitespace)
8 val anotherDF = spark.createDF(
9 List(
10 (" BOO "),
11 (" HOO "),
12 (null)
13 ), List(
14 ("cry", StringType, true)
15 )
16 )
User Defined Functions (UDFs) 46

1 anotherDF.select(
2 betterLowerRemoveAllWhitespaceUDF(col("cry")).as("clean_cry")
3 ).show()
4
5 +---------+
6 |clean_cry|
7 +---------+
8 | boo|
9 | hoo|
10 | null|
11 +---------+

We can use the explain() method to demonstrate that UDFs are a black box for the Spark engine.
== Physical Plan ==
*Project [UDF(cry#15) AS clean_cry#24]
+- Scan ExistingRDD[cry#15]
Spark doesn’t know how to convert the UDF into native Spark instructions. Let’s use the native
Spark library to refactor this code and help Spark generate a physical plan that can be optimized.

Using Column Functions


Let’s define a function that takes a Column argument, returns a Column, and leverages native Spark
functions to lowercase and remove all whitespace from a string.

1 def bestLowerRemoveAllWhitespace()(col: Column): Column = {


2 lower(regexp_replace(col, "\\s+", ""))
3 }
4
5 val anotherDF = spark.createDF(
6 List(
7 (" BOO "),
8 (" HOO "),
9 (null)
10 ), List(
11 ("cry", StringType, true)
12 )
13 )
User Defined Functions (UDFs) 47

1 anotherDF.select(
2 bestLowerRemoveAllWhitespace()(col("cry")).as("clean_cry")
3 ).show()
4
5 +---------+
6 |clean_cry|
7 +---------+
8 | boo|
9 | hoo|
10 | null|
11 +---------+

Notice that the bestLowerRemoveAllWhitespace elegantly handles the null case and does not require
us to add any special null logic.

1 anotherDF.select(
2 bestLowerRemoveAllWhitespace()(col("cry")).as("clean_cry")
3 ).explain()

== Physical Plan ==
*Project [lower(regexp_replace(cry#29, \s+, )) AS clean_cry#38]
+- Scan ExistingRDD[cry#29]
Spark can view the internals of the bestLowerRemoveAllWhitespace function and optimize the
physical plan accordingly. UDFs are a black box for the Spark engine whereas functions that take a
Column argument and return a Column are not a black box for Spark.

Conclusion
Spark UDFs should be avoided whenever possible. If you need to write a UDF, make sure to handle
the null case as this is a common cause of errors.
Chaining Custom DataFrame
Transformations in Spark
This chapter explains how to write DataFrame transformations and how to chain multiple transfor-
mations with the Dataset#transform method.

Dataset Transform Method


The Dataset transform method provides a “concise syntax for chaining custom transformations.”
Suppose we have a withGreeting() method that appends a greeting column to a DataFrame and a
withFarewell() method that appends a farewell column to a DataFrame.

1 def withGreeting(df: DataFrame): DataFrame = {


2 df.withColumn("greeting", lit("hello world"))
3 }
4
5 def withFarewell(df: DataFrame): DataFrame = {
6 df.withColumn("farewell", lit("goodbye"))
7 }

We can use the transform method to run the withGreeting() and withFarewell() methods.

1 val df = Seq(
2 "funny",
3 "person"
4 ).toDF("something")
5
6 val weirdDf = df
7 .transform(withGreeting)
8 .transform(withFarewell)
Chaining Custom DataFrame Transformations in Spark 49

1 weirdDf.show()
2
3 +---------+-----------+--------+
4 |something| greeting|farewell|
5 +---------+-----------+--------+
6 | funny|hello world| goodbye|
7 | person|hello world| goodbye|
8 +---------+-----------+--------+

The transform method can easily be chained with built-in Spark DataFrame methods, like select.

1 df
2 .select("something")
3 .transform(withGreeting)
4 .transform(withFarewell)

The transform method helps us write easy-to-follow code by avoiding nested method calls. Without
transform, the above code becomes less readable:

1 withFarewell(withGreeting(df))
2
3 // even worse
4 withFarewell(withGreeting(df)).select("something")

Transform Method with Arguments


Our example transforms (withFarewell and withGreeting) modify DataFrames in a standard way:
that is, they will always append a column named farewell and greeting, each with hardcoded
values (“goodbye” and “hello world”, respectively).
We can also create custom DataFrame transformations by defining transforms that take arguments.
Now we can begin to leverage currying and multiple parameter lists in Scala.
To illustrate the difference, let’s use the same withGreeting() method from earlier and add a
withCat() method that takes a string as an argument.
Chaining Custom DataFrame Transformations in Spark 50

1 def withGreeting(df: DataFrame): DataFrame = {


2 df.withColumn("greeting", lit("hello world"))
3 }
4
5 def withCat(name: String)(df: DataFrame): DataFrame = {
6 df.withColumn("cats", lit(s"$name meow"))
7 }

We can use the transform method to run the withGreeting() and withCat() methods.

1 val df = Seq(
2 "funny",
3 "person"
4 ).toDF("something")
5
6 val niceDf = df
7 .transform(withGreeting)
8 .transform(withCat("puffy"))

1 niceDf.show()
2
3 +---------+-----------+----------+
4 |something| greeting| cats|
5 +---------+-----------+----------+
6 | funny|hello world|puffy meow|
7 | person|hello world|puffy meow|
8 +---------+-----------+----------+
Whitespace data munging with Spark
Spark SQL provides a variety of methods to manipulate whitespace in your DataFrame StringType
columns.
The spark-daria¹³ library provides additional methods that are useful for whitespace data munging.
Learning about whitespace data munging is useful, but the more important lesson in this chapter is
learning how to build reusable custom SQL functions.
We’re laying the foundation to teach you how to build reusable code libraries.

trim(), ltrim(), and rtrim()


Spark provides functions to eliminate leading and trailing whitespace. The trim() function removes
both leading and trailing whitespace as shown in the following example.

1 val sourceDF = Seq(


2 (" a "),
3 ("b "),
4 (" c"),
5 (null)
6 ).toDF("word")
7
8 val actualDF = sourceDF.withColumn(
9 "trimmed_word",
10 trim(col("word"))
11 )

¹³https://github.com/MrPowers/spark-daria/
Whitespace data munging with Spark 52

1 actualDF.show()
2
3 +----------+------------+
4 | word|trimmed_word|
5 +----------+------------+
6 |" a "| "a"|
7 | "b "| "b"|
8 | " c"| "c"|
9 | null| null|
10 +----------+------------+

Let’s use the same sourceDF and demonstrate how the ltrim() method removes the leading
whitespace.

1 val sourceDF = Seq(


2 (" a "),
3 ("b "),
4 (" c"),
5 (null)
6 ).toDF("word")
7
8 val actualDF = sourceDF.withColumn(
9 "ltrimmed_word",
10 ltrim(col("word"))
11 )

1 actualDF.show()
2
3 +----------+-------------+
4 | word|ltrimmed_word|
5 +----------+-------------+
6 |" a "| "a "|
7 | "b "| "b "|
8 | " c"| "c"|
9 | null| null|
10 +----------+-------------+

The rtrim() method removes all trailing whitespace from a string - you can easily figure that one
out by yourself �
Whitespace data munging with Spark 53

singleSpace()
The spark-daria project defines a singleSpace() method that removes all leading and trailing
whitespace and replaces all inner whitespace with a single space.
Here’s how the singleSpace() function is defined in the spark-daria source code.

1 import org.apache.spark.sql.Column
2
3 def singleSpace(col: Column): Column = {
4 trim(regexp_replace(col, " +", " "))
5 }

Let’s run the function:

1 val sourceDF = Seq(


2 ("i like cheese"),
3 (" the dog runs "),
4 (null)
5 ).toDF("words")
6
7 val actualDF = sourceDF.withColumn(
8 "single_spaced",
9 singleSpace(col("words"))
10 )

1 actualDF.show()
2
3 +-------------------+---------------+
4 | words| single_spaced|
5 +-------------------+---------------+
6 |"i like cheese"|"i like cheese"|
7 |" the dog runs "| "the dog runs"|
8 | null| null|
9 +-------------------+---------------+

Copying and pasting code from spark-daria should usually be avoided.


In later chapters, we’ll learn how to setup a project with IntelliJ, add spark-daria as a dependency,
and import spark-daria functions.
For now, focus on studing how functions are defined in spark-daria. Studying reusable functions with
good abstractions is a good way for you to learn how to make your own custom SQL functions.
Whitespace data munging with Spark 54

removeAllWhitespace()
spark-daria defines a removeAllWhitespace() method that removes all whitespace from a string as
shown in the following example.

1 def removeAllWhitespace(col: Column): Column = {


2 regexp_replace(col, "\\s+", "")
3 }
4
5 Here's how to use `removeAllWhitespace`:
6
7 ```scala
8 val sourceDF = Seq(
9 ("i like cheese"),
10 (" the dog runs "),
11 (null)
12 ).toDF("words")
13
14 val actualDF = sourceDF.withColumn(
15 "no_whitespace",
16 removeAllWhitespace(col("words"))
17 )

1 actualDF.show()
2
3 +-------------------+-------------+
4 | words|no_whitespace|
5 +-------------------+-------------+
6 |"i like cheese"|"ilikecheese"|
7 |" the dog runs "| "thedogruns"|
8 | null| null|
9 +-------------------+-------------+

Notice how the removeAllWhitespace function takes a Column argument and returns a Column.
Custom SQL functions typically use this method signature.

Conclusion
Spark SQL offers a bunch of great functions for whitespace data munging.
Whitespace data munging with Spark 55

spark-daria adds some additional custom SQL functions for more advanced whitespace data
munging.
Study the method signatures of the spark-daria functions. You’ll want to make generic cleaning
functions like these for your messy data too!
Defining DataFrame Schemas with
StructField and StructType
Spark DataFrames schemas are defined as a collection of typed columns. The entire schema is stored
as a StructType and individual columns are stored as StructFields.
This chapter explains how to create and modify Spark schemas via the StructType and StructField
classes.
We’ll show how to work with IntegerType, StringType, and LongType columns.
Complex column types like ArrayType, MapType and StructType will be covered in later chapters.
Mastering Spark schemas is necessary for debugging code and writing tests.

Defining a schema to create a DataFrame


Let’s invent some sample data, define a schema, and create a DataFrame.

1 import org.apache.spark.sql.types._
2
3 val data = Seq(
4 Row(8, "bat"),
5 Row(64, "mouse"),
6 Row(-27, "horse")
7 )
8
9 val schema = StructType(
10 List(
11 StructField("number", IntegerType, true),
12 StructField("word", StringType, true)
13 )
14 )
15
16 val df = spark.createDataFrame(
17 spark.sparkContext.parallelize(data),
18 schema
19 )
Defining DataFrame Schemas with StructField and StructType 57

1 df.show()
2
3 +------+-----+
4 |number| word|
5 +------+-----+
6 | 8| bat|
7 | 64|mouse|
8 | -27|horse|
9 +------+-----+

StructType objects are instantiated with a List of StructField objects.

The org.apache.spark.sql.types package must be imported to access StructType, StructField,


IntegerType, and StringType.

The createDataFrame() method takes two arguments:

1. RDD of the data


2. The DataFrame schema (a StructType object)

The schema() method returns a StructType object:

1 df.schema
2
3 StructType(
4 StructField(number,IntegerType,true),
5 StructField(word,StringType,true)
6 )

StructField

StructFields model each column in a DataFrame.

StructField objects are created with the name, dataType, and nullable properties. Here’s an
example:

1 StructField("word", StringType, true)

The StructField above sets the name field to "word", the dataType field to StringType, and the
nullable field to true.

"word" is the name of the column in the DataFrame.

StringType means that the column can only take string values like "hello" - it cannot take other
values like 34 or false.
When the nullable field is set to true, the column can accept null values.
Defining DataFrame Schemas with StructField and StructType 58

Defining schemas with the :: operator


We can also define a schema with the :: operator, like the examples in the StructType documenta-
tion¹⁴.

1 val schema = StructType(


2 StructField("number", IntegerType, true) ::
3 StructField("word", StringType, true) :: Nil
4 )

The :: operator makes it easy to construct lists in Scala. We can also use :: to make a list of numbers.

1 5 :: 4 :: Nil

Notice that the last element always has to be Nil or the code will error out.

Defining schemas with the add() method


We can use the StructType#add() method to define schemas.

1 val schema = StructType(Seq(StructField("number", IntegerType, true)))


2 .add(StructField("word", StringType, true))

add() is an overloaded method and there are several different ways to invoke it - this will work too:

1 val schema = StructType(Seq(StructField("number", IntegerType, true)))


2 .add("word", StringType, true)

Check the StructType documentation¹⁵ for all the different ways add() can be used.

Common errors

Extra column defined in Schema


The following code has an extra column defined in the schema and will error out with this message:
java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException:
2.
¹⁴http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType
¹⁵http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType
Defining DataFrame Schemas with StructField and StructType 59

1 val data = Seq(


2 Row(8, "bat"),
3 Row(64, "mouse"),
4 Row(-27, "horse")
5 )
6
7 val schema = StructType(
8 List(
9 StructField("number", IntegerType, true),
10 StructField("word", StringType, true),
11 StructField("num2", IntegerType, true)
12 )
13 )
14
15 val df = spark.createDataFrame(
16 spark.sparkContext.parallelize(data),
17 schema
18 )

The data only contains two columns, but the schema contains three StructField columns.

Type mismatch
The following code incorrectly characterizes a string column as an integer column and will error out
with this message: java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException:
java.lang.String is not a valid external type for schema of int.

1 val data = Seq(


2 Row(8, "bat"),
3 Row(64, "mouse"),
4 Row(-27, "horse")
5 )
6
7 val schema = StructType(
8 List(
9 StructField("num1", IntegerType, true),
10 StructField("num2", IntegerType, true)
11 )
12 )
13
14 val df = spark.createDataFrame(
15 spark.sparkContext.parallelize(data),
Defining DataFrame Schemas with StructField and StructType 60

16 schema
17 )
18
19 df.show()

The first column of data (8, 64, and -27) can be characterized as IntegerType data.
The second column of data ("bat", "mouse", and "horse") cannot be characterized as an IntegerType
column - this could would work if this column was recharacterized as StringType.

Nullable property exception


The following code incorrectly tries to add null to a column with a nullable property set to
false and will error out with this message: java.lang.RuntimeException: Error while encoding:
java.lang.RuntimeException: The 0th field 'word1' of input row cannot be null.

1 val data = Seq(


2 Row("hi", "bat"),
3 Row("bye", "mouse"),
4 Row(null, "horse")
5 )
6
7 val schema = StructType(
8 List(
9 StructField("word1", StringType, false),
10 StructField("word2", StringType, true)
11 )
12 )
13
14 val df = spark.createDataFrame(
15 spark.sparkContext.parallelize(data),
16 schema
17 )
18
19 df.show()

LongType

Integers use 32 bits whereas long values use 64 bits.


Integers can hold values between -2 billion to 2 billion (-scala.math.pow(2, 31) to scala.math.pow(2,
31) - 1 to be exact).
Defining DataFrame Schemas with StructField and StructType 61

Long values are suitable for bigger integers. You can create a long value in Scala by appending L to
an integer - e.g. 4L or -60L.
Let’s create a DataFrame with a LongType column.

1 val data = Seq(


2 Row(5L, "bat"),
3 Row(-10L, "mouse"),
4 Row(4L, "horse")
5 )
6
7 val schema = StructType(
8 List(
9 StructField("long_num", LongType, true),
10 StructField("word", StringType, true)
11 )
12 )
13
14 val df = spark.createDataFrame(
15 spark.sparkContext.parallelize(data),
16 schema
17 )

1 df.show()
2
3 +--------+-----+
4 |long_num| word|
5 +--------+-----+
6 | 5| bat|
7 | -10|mouse|
8 | 4|horse|
9 +--------+-----+

You’ll get the following error message if you try to add integers to a LongType column: java.lang.RuntimeException:
Error while encoding: java.lang.RuntimeException: java.lang.Integer is not a valid
external type for schema of bigint

Here’s an example of the erroneous code:


Defining DataFrame Schemas with StructField and StructType 62

1 val data = Seq(


2 Row(45, "bat"),
3 Row(2, "mouse"),
4 Row(3, "horse")
5 )
6
7 val schema = StructType(
8 List(
9 StructField("long_num", LongType, true),
10 StructField("word", StringType, true)
11 )
12 )
13
14 val df = spark.createDataFrame(
15 spark.sparkContext.parallelize(data),
16 schema
17 )
18
19 df.show()

Next steps
You’ll be defining a lot of schemas in your test suites so make sure to master all the concepts covered
in this chapter.
Different approaches to manually
create Spark DataFrames
This chapter shows how to manually create DataFrames with the Spark and spark-daria helper
methods.
We’ll demonstrate why the createDF() method defined in spark-daria is better than the toDF() and
createDataFrame() methods from the Spark source code.

toDF
Up until now, we’ve been using toDF to create DataFrames.
toDF() provides a concise syntax for creating DataFrames and can be accessed after importing Spark
implicits.

1 import spark.implicits._
2
3 # The toDF() method can be called on a sequence object to create a DataFrame.
4 val someDF = Seq(
5 (8, "bat"),
6 (64, "mouse"),
7 (-27, "horse")
8 ).toDF("number", "word")

someDF has the following schema.


root
| — number: integer (nullable = false)
| — word: string (nullable = true)
toDF() is limited because the column type and nullable flag cannot be customized. In this example,
the number column is not nullable and the word column is nullable.
The import spark.implicits._ statement can only be run inside of class definitions when the Spark
Session is available. All imports should be at the top of the file before the class definition, so toDF()
encourages bad Scala coding practices.
toDF() is suitable for local testing, but production grade code that’s checked into master should use
a better solution.
Different approaches to manually create Spark DataFrames 64

createDataFrame
The createDataFrame() method addresses the limitations of the toDF() method and allows for full
schema customization and good Scala coding practices.
Here is how to create someDF with createDataFrame().

1 val someData = Seq(


2 Row(8, "bat"),
3 Row(64, "mouse"),
4 Row(-27, "horse")
5 )
6
7 val someSchema = List(
8 StructField("number", IntegerType, true),
9 StructField("word", StringType, true)
10 )
11
12 val someDF = spark.createDataFrame(
13 spark.sparkContext.parallelize(someData),
14 StructType(someSchema)
15 )

createDataFrame() provides the functionality we need, but the syntax is verbose. Our test files will
become cluttered and difficult to read if createDataFrame() is used frequently.

createDF
createDF() is defined in spark-daria and allows for the following terse syntax.

1 val someDF = spark.createDF(


2 List(
3 (8, "bat"),
4 (64, "mouse"),
5 (-27, "horse")
6 ), List(
7 ("number", IntegerType, true),
8 ("word", StringType, true)
9 )
10 )

createDF() creates readable code like toDF() and allows for full schema customization like create-
DataFrame(). It’s the best of both worlds.
Different approaches to manually create Spark DataFrames 65

How we’ll create DataFrames in this book


We’ll generally use toDF to create DataFrames and will only use createDataFrame when extra
schema control is needed.
We won’t use createDF because we don’t want to make it hard to copy and paste the code snippets
in this book.
Once you’re a more experienced Spark programmer and you have a solid workflow established in
the IntelliJ text editor, you’ll want to use the createDF method.
Dealing with null in Spark
Spark DataFrames are filled with null values and you should write code that gracefully handles these
null values.
You don’t want to write code that throws NullPointerExceptions - yuck!
This chapter outlines when null should be used, how native Spark functions handle null input, and
how to simplify null logic by avoiding user defined functions.

What is null?
In SQL databases, “null means that some value is unknown, missing, or irrelevant¹⁶.” The SQL
concept of null is different than null in programming languages like JavaScript or Scala. Spark
DataFrame best practices are aligned with SQL best practices, so DataFrames should use null for
values that are unknown, missing or irrelevant.

Spark uses null by default sometimes


Let’s look at the following file as an example of how Spark considers blank and empty CSV fields
as null values.

1 name,country,zip_code
2 joe,usa,89013
3 ravi,india,
4 "",,12389

All the blank values and empty strings are read into a DataFrame as null.

1 val peopleDf = spark.read.option("header", "true").csv(path)

¹⁶https://www.itprotoday.com/sql-server/sql-design-reason-null
Dealing with null in Spark 67

1 peopleDf.show()
2
3 +----+-------+--------+
4 |name|country|zip_code|
5 +----+-------+--------+
6 | joe| usa| 89013|
7 |ravi| india| null|
8 |null| null| 12389|
9 +----+-------+--------+

The Spark csv() method demonstrates that null is used for values that are unknown or missing
when files are read into DataFrames.

nullable Columns
Let’s create a DataFrame with a name column that isn’t nullable and an age column that is nullable.
The name column cannot take null values, but the age column can take null values. The nullable
property is the third argument when instantiating a StructField.

1 val schema = List(


2 StructField("name", StringType, false),
3 StructField("age", IntegerType, true)
4 )
5
6 val data = Seq(
7 Row("miguel", null),
8 Row("luisa", 21)
9 )
10
11 val df = spark.createDataFrame(
12 spark.sparkContext.parallelize(data),
13 StructType(schema)
14 )

If we try to create a DataFrame with a null value in the name column, the code will blow up with
this error: “Error while encoding: java.lang.RuntimeException: The 0th field ‘name’ of input row
cannot be null”.
Here’s some code that would cause the error to be thrown:
Dealing with null in Spark 68

1 val data = Seq(


2 Row("phil", 44),
3 Row(null, 21)
4 )

Make sure to recreate the error on your machine! It’s a hard error message to understand unless
you’re used to it.
You can keep null values out of certain columns by setting nullable to false.
You won’t be able to set nullable to false for all columns in a DataFrame and pretend like null
values don’t exist. For example, when joining DataFrames, the join column will return null when
a match cannot be made.

Native Spark code


Native Spark code handles null gracefully.
Let’s create a DataFrame with numbers so we have some data to play with.

1 val schema = List(


2 StructField("number", IntegerType, true)
3 )
4
5 val data = Seq(
6 Row(1),
7 Row(8),
8 Row(12),
9 Row(null)
10 )
11
12 val numbersDF = spark.createDataFrame(
13 spark.sparkContext.parallelize(data),
14 StructType(schema)
15 )

Now let’s add a column that returns true if the number is even, false if the number is odd, and
null otherwise.

1 numbersDF
2 .withColumn("is_even", $"number" % 2 === 0)
3 .show()
Dealing with null in Spark 69

1 +------+-------+
2 |number|is_even|
3 +------+-------+
4 | 1| false|
5 | 8| true|
6 | 12| true|
7 | null| null|
8 +------+-------+

The Spark % method returns null when the input is null. Actually all Spark functions return null
when the input is null.
You should follow this example in your code - your Spark functions should return null when the
input is null too!

Scala null Conventions


Native Spark code cannot always be used and sometimes you’ll need to fall back on Scala code
and User Defined Functions. The Scala best practices for null are different than the Spark null best
practices.
David Pollak, the author of Beginning Scala, stated “Ban null from any of your code. Period.” Alvin
Alexander, a prominent Scala blogger and author, explains why Option is better than null in this
blog post¹⁷. The Scala community clearly prefers Option to avoid the pesky null pointer exceptions
that have burned them in Java.
Some developers erroneously interpret these Scala best practices to infer that null should be banned
from DataFrames as well! Remember that DataFrames are akin to SQL tables and should generally
follow SQL best practices. Scala best practices are completely different.
The Databricks Scala style guide¹⁸ does not agree that null should always be banned from Scala code
and says: “For performance sensitive code, prefer null over Option, in order to avoid virtual method
calls and boxing.”
The Spark source code uses the Option keyword 821 times, but it also refers to null directly in code
like if (ids != null). Spark may be taking a hybrid approach of using Option when possible and
falling back to null when necessary for performance reasons.
Let’s dig into some code and see how null and Option can be used in Spark user defined functions.

User Defined Functions


Let’s create a user defined function that returns true if a number is even and false if a number is
odd.
¹⁷https://alvinalexander.com/scala/using-scala-option-some-none-idiom-function-java-null
¹⁸https://github.com/databricks/scala-style-guide#perf-option
Dealing with null in Spark 70

1 def isEvenSimple(n: Integer): Boolean = {


2 n % 2 == 0
3 }
4
5 val isEvenSimpleUdf = udf[Boolean, Integer](isEvenSimple)

Suppose we have the following numbersDF DataFrame:

1 +------+
2 |number|
3 +------+
4 | 1|
5 | 8|
6 | 12|
7 | null|
8 +------+

Our UDF does not handle null input values. Let’s run the code and observe the error.

1 numbersDF.withColumn(
2 "is_even",
3 isEvenSimpleUdf(col("number"))
4 )

Here is the error message:


SparkException: Job aborted due to stage failure: Task 2 in stage 16.0 failed 1 times, most recent fail-
ure: Lost task 2.0 in stage 16.0 (TID 41, localhost, executor driver): org.apache.spark.SparkException:
Failed to execute user defined function($anonfun$1: (int) ⇒ boolean)
Caused by: java.lang.NullPointerException
We can use the isNotNull method to work around the NullPointerException that’s caused when
isEvenSimpleUdf is invoked.

1 val actualDF = numbersDF.withColumn(


2 "is_even",
3 when(
4 col("number").isNotNull,
5 isEvenSimpleUdf(col("number"))
6 ).otherwise(lit(null))
7 )
Dealing with null in Spark 71

1 actualDF.show()
2
3 +------+-------+
4 |number|is_even|
5 +------+-------+
6 | 1| false|
7 | 8| true|
8 | 12| true|
9 | null| null|
10 +------+-------+

It’s better to write user defined functions that gracefully deal with null values and don’t rely on the
isNotNull work around - let’s try again.

Dealing with null badly


Let’s refactor the user defined function so it doesn’t error out when it encounters a null value.

1 def isEvenBad(n: Integer): Boolean = {


2 if (n == null) {
3 false
4 } else {
5 n % 2 == 0
6 }
7 }
8
9 val isEvenBadUdf = udf[Boolean, Integer](isEvenBad)

We can run the isEvenBadUdf on the same numbersDF as earlier.

1 val actualDF = numbersDF.withColumn(


2 "is_even",
3 isEvenBadUdf(col("number"))
4 )
Dealing with null in Spark 72

1 actualDF.show()
2
3 +------+-------+
4 |number|is_even|
5 +------+-------+
6 | 1| false|
7 | 8| true|
8 | 12| true|
9 | null| false|
10 +------+-------+

This code works, but is terrible because it returns false for odd numbers and null numbers. Remember
that null should be used for values that are irrelevant. null is not even or odd - returning false for null
numbers implies that null is odd!
Let’s refactor this code and correctly return null when number is null.

Dealing with null better


The isEvenBetterUdf returns true / false for numeric values and null otherwise.

1 def isEvenBetter(n: Integer): Option[Boolean] = {


2 if (n == null) {
3 None
4 } else {
5 Some(n % 2 == 0)
6 }
7 }
8
9 val isEvenBetterUdf = udf[Option[Boolean], Integer](isEvenBetter)

The isEvenBetter method returns an Option[Boolean]. When the input is null, isEvenBetter
returns None, which is converted to null in DataFrames.
Let’s run the isEvenBetterUdf on the same numbersDF as earlier and verify that null values are
correctly added when the number column is null.

1 val actualDF = numbersDF.withColumn(


2 "is_even",
3 isEvenBetterUdf(col("number"))
4 )
Dealing with null in Spark 73

1 actualDF.show()
2
3 +------+-------+
4 |number|is_even|
5 +------+-------+
6 | 1| false|
7 | 8| true|
8 | 12| true|
9 | null| null|
10 +------+-------+

The isEvenBetter function is still directly referring to null. Let’s do a final refactoring to fully
remove null from the user defined function.

Best Scala Style Solution (What about performance?)


We’ll use Option to get rid of null once and for all!

1 def isEvenOption(n: Integer): Option[Boolean] = {


2 val num = Option(n).getOrElse(return None)
3 Some(num % 2 == 0)
4 }
5
6 val isEvenOptionUdf = udf[Option[Boolean], Integer](isEvenOption)

The isEvenOption function converts the integer to an Option value and returns None if the conversion
cannot take place. This code does not use null and follows the purist advice: “Ban null from any of
your code. Period.”
This solution is less performant than directly referring to null, so a refactoring should be considered
if performance becomes a bottleneck.

User Defined Functions Cannot Take Options as Params


User defined functions surprisingly cannot take an Option value as a parameter, so this code won’t
work:
Dealing with null in Spark 74

1 def isEvenBroke(n: Option[Integer]): Option[Boolean] = {


2 val num = n.getOrElse(return None)
3 Some(num % 2 == 0)
4 }
5
6 val isEvenBrokeUdf = udf[Option[Boolean], Option[Integer]](isEvenBroke)

If you run this code, you’ll get the following error:

1 org.apache.spark.SparkException: Failed to execute user defined function


2
3 Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to scala.O\
4 ption

Avoiding UDFs is the best when possible


In this example, we can avoid UDFs completely and get the desired result:

1 numbersDF.withColumn(
2 "is_even",
3 col("number") / lit(2) === lit(0)
4 )

Spark Rules for Dealing with null


Use native Spark code whenever possible to avoid writing null edge case logic
If UDFs are needed, follow these rules:

• Scala code should deal with null values gracefully and shouldn’t error out if there are null
values.
• Scala code should return None (or null) for values that are unknown, missing, or irrelevant.
DataFrames should also use null for for values that are unknown, missing, or irrelevant.
• Use Option in Scala code and fall back on null if Option becomes a performance bottleneck.
Using JAR Files Locally
This chapter explains how to attach spark-daria to a Spark console session and to a Databricks
cluster.
We’ll need the spark-daria createDF method to easily make DataFrames because the createDataFrame
method is too verbose.

Starting the console with a JAR file


You can download the spark-daria JAR file on this release page¹⁹.
The JAR file is downloaded to /Users/powers/Downloads/spark-daria-0.35.2.jar on my machine.
I downloaded Spark and saved it in the /Users/powers/spark-2.4.0-bin-hadoop2.7 directory.
The Spark console can be initiated with spark-daria on my machine with this command:

1 bash /Users/powers/spark-2.4.0-bin-hadoop2.7/bin/spark-shell --jars /Users/powers/Do\


2 wnloads/spark-daria-0.35.2.jar

Let’s access a class that’s defined in spark-daria to make sure the code was successfully loaded in
the console.

1 scala> com.github.mrpowers.spark.daria.sql.EtlDefinition
2 res0: com.github.mrpowers.spark.daria.sql.EtlDefinition.type = EtlDefinition

Quit the terminal session with the :quit command. It’ll look like this when typed into the console.

1 scala> :quit

Adding JAR file to an existing console session


You can add a JAR file to an existing console session with the :require command.
Shut down your current console session and start a new one (don’t attach the spark-daria JAR this
time):

¹⁹https://github.com/MrPowers/spark-daria/releases/tag/v0.35.2
Using JAR Files Locally 76

1 bash /Users/powers/spark-2.4.0-bin-hadoop2.7/bin/spark-shell

Let’s verify that we cannot access the spark-daria EtlDefinition class.

1 scala> com.github.mrpowers.spark.daria.sql.EtlDefinition
2 <console>:24: error: object mrpowers is not a member of package com.github
3 com.github.mrpowers.spark.daria.sql.EtlDefinition

Let’s add spark-daria JAR to the console we just started with the :require command.

1 scala> :require /Users/powers/Downloads/spark-daria-0.35.2.jar


2 Added '/Users/powers/Downloads/spark-daria-0.35.2.jar' to classpath.

Let’s verify that we can access the EtlDefinition class now.

1 scala> com.github.mrpowers.spark.daria.sql.EtlDefinition
2 res1: com.github.mrpowers.spark.daria.sql.EtlDefinition.type = EtlDefinition

Attaching JARs to Databricks clusters


We can also attach spark-daria to Databricks notebooks.
Create a Libraries folder in your Databricks account and click “Create Library”.

Create Library

Click the “Drop Jar here” link.


Using JAR Files Locally 77

Drop Jar here

Attach the JAR file and then click “Create”:


Using JAR Files Locally 78

Upload JAR to Databricks account

Create a cluster as we’ve already discussed.


Once the cluster is running, click on the spark-daria JAR file you uploaded.
Attach the JAR file to your cluster.
Using JAR Files Locally 79

Attach JAR to cluster

Create a notebook, attach it to your cluster, and verify you can access the spark-daria EtlDefinition
class.

Accessing spark-daria code in Databricks

Review
This chapter showed you how to attach the spark-daria JAR file to console sessions and Databricks
notebooks.
You can use this workflow to attach any JAR files to your Spark analyses.
Notice how the :require command was used to add spark-daria to the classpath of an existing
Spark console. Starting up a Databricks cluster and then attaching spark-daria to the cluster classpath
Using JAR Files Locally 80

is similar. Running Spark code locally helps you understand how the code works in a cluster
environment.
Working with Spark ArrayType
columns
Spark DataFrame columns support arrays, which are great for data sets that have an arbitrary length.
This chapter will demonstrate Spark methods that return ArrayType columns, describe how to create
your own ArrayType columns, and explain when to use arrays in your analyses.

Scala collections
Scala has different types of collections: lists, sequences, and arrays. Let’s quickly review the different
types of Scala collections before jumping into Spark ArrayType columns.
Let’s create and sort a collection of numbers.

1 List(10, 2, 3).sorted // List[Int] = List(2, 3, 10)


2 Seq(10, 2, 3).sorted // Seq[Int] = List(2, 3, 10)
3 Array(10, 2, 3).sorted // Array[Int] = Array(2, 3, 10)

List, Seq, and Array differ slightly, but generally work the same. Most Spark programmers don’t
need to know about how these collections differ.
Spark uses arrays for ArrayType columns, so we’ll mainly use arrays in our code snippets.

Splitting a string into an ArrayType column


Let’s create a DataFrame with a name column and a hit_songs pipe delimited string. Then let’s use
the split() method to convert hit_songs into an array of strings.
Working with Spark ArrayType columns 82

1 val singersDF = Seq(


2 ("beatles", "help|hey jude"),
3 ("romeo", "eres mia")
4 ).toDF("name", "hit_songs")
5
6 val actualDF = singersDF.withColumn(
7 "hit_songs",
8 split(col("hit_songs"), "\\|")
9 )

1 actualDF.show()
2
3 +-------+----------------+
4 | name| hit_songs|
5 +-------+----------------+
6 |beatles|[help, hey jude]|
7 | romeo| [eres mia]|
8 +-------+----------------+

1 actualDF.printSchema()
2
3 root
4 |-- name: string (nullable = true)
5 |-- hit_songs: array (nullable = true)
6 | |-- element: string (containsNull = true)

An ArrayType column is suitable in this example because a singer can have an arbitrary amount of
hit songs. We don’t want to create a DataFrame with hit_song1, hit_song2, …, hit_songN columns.

Directly creating an ArrayType column


Let’s use the spark-daria createDF method to create a DataFrame with an ArrayType column
directly.
Let’s create another singersDF with some different artists.
Working with Spark ArrayType columns 83

1 val singersDF = spark.createDF(


2 List(
3 ("bieber", Array("baby", "sorry")),
4 ("ozuna", Array("criminal"))
5 ), List(
6 ("name", StringType, true),
7 ("hit_songs", ArrayType(StringType, true), true)
8 )
9 )

1 singersDF.show()
2
3 +------+-------------+
4 | name| hit_songs|
5 +------+-------------+
6 |bieber|[baby, sorry]|
7 | ozuna| [criminal]|
8 +------+-------------+

1 singersDF.printSchema()
2
3 root
4 |-- name: string (nullable = true)
5 |-- hit_songs: array (nullable = true)
6 | |-- element: string (containsNull = true)

The ArrayType case class is instantiated with an elementType and a containsNull flag. In ArrayType(StringType,
true), StringType is the elementType and true is the containsNull flag.

Here’s the documentation for the ArrayType class²⁰.

array_contains

The Spark functions²¹ object provides helper methods for working with ArrayType columns. The
array_contains method returns true if the array contains a specified element.

Let’s create an array with people and their favorite colors. Then let’s use array_contains to append
a likes_red column that returns true if the person likes red.

²⁰http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.ArrayType
²¹http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions\protect\char”0024\relax
Working with Spark ArrayType columns 84

1 val peopleDF = spark.createDF(


2 List(
3 ("bob", Array("red", "blue")),
4 ("maria", Array("green", "red")),
5 ("sue", Array("black"))
6 ), List(
7 ("name", StringType, true),
8 ("favorite_colors", ArrayType(StringType, true), true)
9 )
10 )
11
12 val actualDF = peopleDF.withColumn(
13 "likes_red",
14 array_contains(col("favorite_colors"), "red")
15 )

1 actualDF.show()
2
3 +-----+---------------+---------+
4 | name|favorite_colors|likes_red|
5 +-----+---------------+---------+
6 | bob| [red, blue]| true|
7 |maria| [green, red]| true|
8 | sue| [black]| false|
9 +-----+---------------+---------+

explode

Let’s use the same DataFrame and the explode() method to create a new row for every element in
each array.

1 val df = peopleDF.select(
2 col("name"),
3 explode(col("favorite_colors")).as("color")
4 )
Working with Spark ArrayType columns 85

1 df.show()
2
3 +-----+-----+
4 | name|color|
5 +-----+-----+
6 | bob| red|
7 | bob| blue|
8 |maria|green|
9 |maria| red|
10 | sue|black|
11 +-----+-----+

peopleDF has 3 rows and the exploded DataFrame has 5 rows. The explode() method adds rows to
a DataFrame.

collect_list

The collect_list method collapses a DataFrame into fewer rows and stores the collapsed data in
an ArrayType column.
Let’s create a DataFrame with letter1, letter2, and number1 columns.

1 val df = Seq(
2 ("a", "b", 1),
3 ("a", "b", 2),
4 ("a", "b", 3),
5 ("z", "b", 4),
6 ("a", "x", 5)
7 ).toDF("letter1", "letter2", "number1")
8
9 df.show()

1 +-------+-------+-------+
2 |letter1|letter2|number1|
3 +-------+-------+-------+
4 | a| b| 1|
5 | a| b| 2|
6 | a| b| 3|
7 | z| b| 4|
8 | a| x| 5|
9 +-------+-------+-------+

Let’s use the collect_list() method to eliminate all the rows with duplicate letter1 and letter2
rows in the DataFrame and collect all the number1 entries as a list.
Working with Spark ArrayType columns 86

1 df
2 .groupBy("letter1", "letter2")
3 .agg(collect_list("number1") as "number1s")
4 .show()

1 +-------+-------+---------+
2 |letter1|letter2| number1s|
3 +-------+-------+---------+
4 | a| x| [5]|
5 | z| b| [4]|
6 | a| b|[1, 2, 3]|
7 +-------+-------+---------+

We can see that number1s is an ArrayType column.

1 df.printSchema
2
3 root
4 |-- letter1: string (nullable = true)
5 |-- letter2: string (nullable = true)
6 |-- number1s: array (nullable = true)
7 | |-- element: integer (containsNull = true)

Single column array functions


Spark added a ton of useful array functions in the 2.4 release²².
We will start with the functions for a single ArrayType column and then move on to the functions
for multiple ArrayType columns.
Let’s start by creating a DataFrame with an ArrayType column.

²²https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-
apache-spark.html
Working with Spark ArrayType columns 87

1 val df = spark.createDF(
2 List(
3 (Array(1, 2)),
4 (Array(1, 2, 3, 1)),
5 (null)
6 ), List(
7 ("nums", ArrayType(IntegerType, true), true)
8 )
9 )

1 df.show()
2
3 +------------+
4 | nums|
5 +------------+
6 | [1, 2]|
7 |[1, 2, 3, 1]|
8 | null|
9 +------------+

Let’s use the array_distinct() method to remove all of the duplicate array elements in the nums
column.

1 df
2 .withColumn("nums_distinct", array_distinct($"nums"))
3 .show()
4
5 +------------+-------------+
6 | nums|nums_distinct|
7 +------------+-------------+
8 | [1, 2]| [1, 2]|
9 |[1, 2, 3, 1]| [1, 2, 3]|
10 | null| null|
11 +------------+-------------+

Let’s use array_join() to create a pipe delimited string of all elements in the arrays.
Working with Spark ArrayType columns 88

1 df
2 .withColumn("nums_joined", array_join($"nums", "|"))
3 .show()
4
5 +------------+-----------+
6 | nums|nums_joined|
7 +------------+-----------+
8 | [1, 2]| 1|2|
9 |[1, 2, 3, 1]| 1|2|3|1|
10 | null| null|
11 +------------+-----------+

Let’s use the printSchema method to verify that nums_joined is a StringType column.

1 df
2 .withColumn("nums_joined", array_join($"nums", "|"))
3 .printSchema()
4
5 root
6 |-- nums: array (nullable = true)
7 | |-- element: integer (containsNull = true)
8 |-- nums_joined: string (nullable = true)

Let’s use array_max to grab the maximum value from the arrays.

1 df
2 .withColumn("nums_max", array_max($"nums"))
3 .show()
4
5 +------------+--------+
6 | nums|nums_max|
7 +------------+--------+
8 | [1, 2]| 2|
9 |[1, 2, 3, 1]| 3|
10 | null| null|
11 +------------+--------+

Let’s use array_min to grab the minimum value from the arrays.
Working with Spark ArrayType columns 89

1 df
2 .withColumn("nums_min", array_min($"nums"))
3 .show()
4
5 +------------+--------+
6 | nums|nums_min|
7 +------------+--------+
8 | [1, 2]| 1|
9 |[1, 2, 3, 1]| 1|
10 | null| null|
11 +------------+--------+

Let’s use the array_remove method to remove all the 1s from each of the arrays.

1 df
2 .withColumn("nums_sans_1", array_remove($"nums", 1))
3 .show()
4
5 +------------+-----------+
6 | nums|nums_sans_1|
7 +------------+-----------+
8 | [1, 2]| [2]|
9 |[1, 2, 3, 1]| [2, 3]|
10 | null| null|
11 +------------+-----------+

Let’s use array_sort to sort all of the arrays in ascending order.

1 df
2 .withColumn("nums_sorted", array_sort($"nums"))
3 .show()
4
5 +------------+------------+
6 | nums| nums_sorted|
7 +------------+------------+
8 | [1, 2]| [1, 2]|
9 |[1, 2, 3, 1]|[1, 1, 2, 3]|
10 | null| null|
11 +------------+------------+
Working with Spark ArrayType columns 90

Generic single column array functions


Suppose you have an array of strings and would like to see if all elements in the array begin with
the letter c. Here’s how you can run this check on a Scala array:

1 Array("cream", "cookies").forall(_.startsWith("c")) // true


2 Array("taco", "clam").forall(_.startsWith("c")) // false

You can use the spark-daria²³ forall() method to run this computation on a Spark DataFrame with
an ArrayType column.

1 import com.github.mrpowers.spark.daria.sql.functions._
2
3 val df = spark.createDF(
4 List(
5 (Array("cream", "cookies")),
6 (Array("taco", "clam"))
7 ), List(
8 ("words", ArrayType(StringType, true), true)
9 )
10 )
11
12 df.withColumn(
13 "all_words_begin_with_c",
14 forall[String]((x: String) => x.startsWith("c")).apply(col("words"))
15 ).show()

1 +----------------+----------------------+
2 | words|all_words_begin_with_c|
3 +----------------+----------------------+
4 |[cream, cookies]| true|
5 | [taco, clam]| false|
6 +----------------+----------------------+

The native Spark API doesn’t provide access to all the helpful collection methods provided by Scala.
spark-daria²⁴ uses User Defined Functions to define forall and exists methods.
Spark will add higher level array functions to the API when Scala 3 is released.
²³https://github.com/MrPowers/spark-daria
²⁴https://github.com/MrPowers/spark-daria
Working with Spark ArrayType columns 91

Multiple column array functions


Let’s create a DataFrame with two ArrayType columns so we can try out the built-in Spark array
functions that take multiple ArrayType columns as input.

1 val numbersDF = spark.createDF(


2 List(
3 (Array(1, 2), Array(4, 5, 6)),
4 (Array(1, 2, 3, 1), Array(2, 3, 4)),
5 (null, Array(6, 7))
6 ), List(
7 ("nums1", ArrayType(IntegerType, true), true),
8 ("nums2", ArrayType(IntegerType, true), true)
9 )
10 )

Let’s use array_intersect to get the elements present in both the arrays without any duplication.

1 numbersDF
2 .withColumn("nums_intersection", array_intersect($"nums1", $"nums2"))
3 .show()
4
5 +------------+---------+-----------------+
6 | nums1| nums2|nums_intersection|
7 +------------+---------+-----------------+
8 | [1, 2]|[4, 5, 6]| []|
9 |[1, 2, 3, 1]|[2, 3, 4]| [2, 3]|
10 | null| [6, 7]| null|
11 +------------+---------+-----------------+

Let’s use array_union to get the elements in either array, without duplication.

1 numbersDF
2 .withColumn("nums_union", array_union($"nums1", $"nums2"))
3 .show()
Working with Spark ArrayType columns 92

1 +------------+---------+---------------+
2 | nums1| nums2| nums_union|
3 +------------+---------+---------------+
4 | [1, 2]|[4, 5, 6]|[1, 2, 4, 5, 6]|
5 |[1, 2, 3, 1]|[2, 3, 4]| [1, 2, 3, 4]|
6 | null| [6, 7]| null|
7 +------------+---------+---------------+

Let’s use array_except to get the elements that are in num1 and not in num2 without any duplication.

1 numbersDF
2 .withColumn("nums1_nums2_except", array_except($"nums1", $"nums2"))
3 .show()
4
5 +------------+---------+------------------+
6 | nums1| nums2|nums1_nums2_except|
7 +------------+---------+------------------+
8 | [1, 2]|[4, 5, 6]| [1, 2]|
9 |[1, 2, 3, 1]|[2, 3, 4]| [1]|
10 | null| [6, 7]| null|
11 +------------+---------+------------------+

Split array column into multiple columns


We can split an array column into multiple columns with getItem. Lets create a DataFrame with a
letters column and demonstrate how this single ArrayType column can be split into a DataFrame
with three StringType columns.

1 val df = spark.createDF(
2 List(
3 (Array("a", "b", "c")),
4 (Array("d", "e", "f")),
5 (null)
6 ), List(
7 ("letters", ArrayType(StringType, true), true)
8 )
9 )
Working with Spark ArrayType columns 93

1 df.show()
2
3 +---------+
4 | letters|
5 +---------+
6 |[a, b, c]|
7 |[d, e, f]|
8 | null|
9 +---------+

This example uses the same data as this Stackoverflow question²⁵.


Let’s use getItem to break out the array into col1, col2, and col3.

1 df
2 .select(
3 $"letters".getItem(0).as("col1"),
4 $"letters".getItem(1).as("col2"),
5 $"letters".getItem(2).as("col3")
6 )
7 .show()
8
9 +----+----+----+
10 |col1|col2|col3|
11 +----+----+----+
12 | a| b| c|
13 | d| e| f|
14 |null|null|null|
15 +----+----+----+

Here’s how we can use getItem with a loop.

1 df
2 .select(
3 (0 until 3).map(i => $"letters".getItem(i).as(s"col$i")): _*
4 )
5 .show()
6
7 +----+----+----+
8 |col0|col1|col2|
9 +----+----+----+
10 | a| b| c|
²⁵https://stackoverflow.com/questions/39255973/split-1-column-into-3-columns-in-spark-scala
Working with Spark ArrayType columns 94

11 | d| e| f|
12 |null|null|null|
13 +----+----+----+

Our code snippet above is a little ugly because the 3 is hardcoded. We can calculate the size of every
array in the column, take the max size, and use that rather than hardcoding.

1 val numCols = df
2 .withColumn("letters_size", size($"letters"))
3 .agg(max($"letters_size"))
4 .head()
5 .getInt(0)
6
7 df
8 .select(
9 (0 until numCols).map(i => $"letters".getItem(i).as(s"col$i")): _*
10 )
11 .show()
12
13 +----+----+----+
14 |col0|col1|col2|
15 +----+----+----+
16 | a| b| c|
17 | d| e| f|
18 |null|null|null|
19 +----+----+----+

Closing thoughts
Spark ArrayType columns makes it easy to work with collections at scale.
Master the content covered in this chapter to add a powerful skill to your toolset.
For more examples, see this Databricks notebook²⁶ that covers even more Array / Map functions.
²⁶https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/142158605138935/3773509768457258/
7497868276316206/latest.html
Working with Spark MapType
Columns
Spark DataFrame columns support maps, which are great for key / value pairs with an arbitrary
length.
This chapter describes how to create MapType columns, demonstrates built-in functions to manip-
ulate MapType columns, and explain when to use maps in your analyses.

Scala maps
Let’s begin with a little refresher on Scala maps.
Create a Scala map that connects some English and Spanish words.

1 val wordMapping = Map("one" -> "uno", "dog" -> "perro")

Fetch the value associated with the dog key:

1 wordMapping("dog") // "perro"

Creating MapType columns


Let’s create a DataFrame with a MapType column.

1 val singersDF = spark.createDF(


2 List(
3 ("sublime", Map(
4 "good_song" -> "santeria",
5 "bad_song" -> "doesn't exist")
6 ),
7 ("prince_royce", Map(
8 "good_song" -> "darte un beso",
9 "bad_song" -> "back it up")
10 )
11 ), List(
Working with Spark MapType Columns 96

12 ("name", StringType, true),


13 ("songs", MapType(StringType, StringType, true), true)
14 )
15 )

1 singersDF.show(false)
2
3 +------------+----------------------------------------------------+
4 |name |songs |
5 +------------+----------------------------------------------------+
6 |sublime |[good_song -> santeria, bad_song -> doesn't exist] |
7 |prince_royce|[good_song -> darte un beso, bad_song -> back it up]|
8 +------------+----------------------------------------------------+

Let’s examine the DataFrame schema and verify that the songs column has a MapType:

1 singersDF.printSchema()
2
3 root
4 |-- name: string (nullable = true)
5 |-- songs: map (nullable = true)
6 | |-- key: string
7 | |-- value: string (valueContainsNull = true)

We can see that songs is a MapType column.


Let’s explore some built-in Spark methods that make it easy to work with MapType columns.

Fetching values from maps with element_at()


Let’s use the singersDF DataFrame and append song_to_love as a column.

1 singersDF
2 .withColumn("song_to_love", element_at(col("songs"), "good_song"))
3 .show(false)
Working with Spark MapType Columns 97

1 +------------+----------------------------------------------------+-------------+
2 |name |songs |song_to_love |
3 +------------+----------------------------------------------------+-------------+
4 |sublime |[good_song -> santeria, bad_song -> doesn't exist] |santeria |
5 |prince_royce|[good_song -> darte un beso, bad_song -> back it up]|darte un beso|
6 +------------+----------------------------------------------------+-------------+

The element_at() function fetches a value from a MapType column.

Appending MapType columns


We can use the map() method defined in org.apache.spark.sql.functions to append a MapType
column to a DataFrame.

1 val countriesDF = spark.createDF(


2 List(
3 ("costa_rica", "sloth"),
4 ("nepal", "red_panda")
5 ), List(
6 ("country_name", StringType, true),
7 ("cute_animal", StringType, true)
8 )
9 ).withColumn(
10 "some_map",
11 map(col("country_name"), col("cute_animal"))
12 )

1 countriesDF.show(false)
2
3 +------------+-----------+---------------------+
4 |country_name|cute_animal|some_map |
5 +------------+-----------+---------------------+
6 |costa_rica |sloth |[costa_rica -> sloth]|
7 |nepal |red_panda |[nepal -> red_panda] |
8 +------------+-----------+---------------------+

Let’s verify that some_map is a MapType column:


Working with Spark MapType Columns 98

1 countriesDF.printSchema()
2
3 root
4 |-- country_name: string (nullable = true)
5 |-- cute_animal: string (nullable = true)
6 |-- some_map: map (nullable = false)
7 | |-- key: string
8 | |-- value: string (valueContainsNull = true)

Creating MapType columns from two ArrayType


columns
We can create a MapType column from two ArrayType columns.

1 val df = spark.createDF(
2 List(
3 (Array("a", "b"), Array(1, 2)),
4 (Array("x", "y"), Array(33, 44))
5 ), List(
6 ("letters", ArrayType(StringType, true), true),
7 ("numbers", ArrayType(IntegerType, true), true)
8 )
9 ).withColumn(
10 "strange_map",
11 map_from_arrays(col("letters"), col("numbers"))
12 )

1 df.show(false)
2
3 +-------+--------+------------------+
4 |letters|numbers |strange_map |
5 +-------+--------+------------------+
6 |[a, b] |[1, 2] |[a -> 1, b -> 2] |
7 |[x, y] |[33, 44]|[x -> 33, y -> 44]|
8 +-------+--------+------------------+

Let’s take a look at the df schema and verify strange_map is a MapType column:
Working with Spark MapType Columns 99

1 df.printSchema()
2
3 |-- letters: array (nullable = true)
4 | |-- element: string (containsNull = true)
5 |-- numbers: array (nullable = true)
6 | |-- element: integer (containsNull = true)
7 |-- strange_map: map (nullable = true)
8 | |-- key: string
9 | |-- value: integer (valueContainsNull = true)

The Spark way of converting to arrays to a map is different that the “regular Scala” way of converting
two arrays to a map.

Converting Arrays to Maps with Scala


Here’s how you’d convert two collections to a map with Scala.

1 val list1 = List("a", "b")


2 val list2 = List(1, 2)
3
4 list1.zip(list2).toMap // Map(a -> 1, b -> 2)

We could wrap this code in a User Defined Function and define our own map_from_arrays function
if we wanted.
In general, it’s best to rely on the standard Spark library instead of defining our own UDFs.
The key takeaway is that the Spark way of solving a problem is often different from the Scala way.
Read the API docs and always try to solve your problems the Spark way.

Merging maps with map_concat()


map_concat() can be used to combine multiple MapType columns to a single MapType column.
Working with Spark MapType Columns 100

1 val df = spark.createDF(
2 List(
3 (Map("a" -> "aaa", "b" -> "bbb"), Map("c" -> "ccc", "d" -> "ddd"))
4 ), List(
5 ("some_data", MapType(StringType, StringType, true), true),
6 ("more_data", MapType(StringType, StringType, true), true)
7 )
8 )
9
10 df
11 .withColumn("all_data", map_concat(col("some_data"), col("more_data")))
12 .show(false)

1 +--------------------+--------------------+----------------------------------------+
2 |some_data |more_data |all_data |
3 +--------------------+--------------------+----------------------------------------+
4 |[a -> aaa, b -> bbb]|[c -> ccc, d -> ddd]|[a -> aaa, b -> bbb, c -> ccc, d -> ddd]|
5 +--------------------+--------------------+----------------------------------------+

Using StructType columns instead of MapType


columns
Let’s create a DataFrame that stores information about athletes.

1 val athletesDF = spark.createDF(


2 List(
3 ("lebron",
4 Map(
5 "height" -> "6.67",
6 "units" -> "feet"
7 )
8 ),
9 ("messi",
10 Map(
11 "height" -> "1.7",
12 "units" -> "meters"
13 )
14 )
15 ), List(
16 ("name", StringType, true),
Working with Spark MapType Columns 101

17 ("stature", MapType(StringType, StringType, true), true)


18 )
19 )
20
21 athletesDF.show(false)

1 +------+--------------------------------+
2 |name |stature |
3 +------+--------------------------------+
4 |lebron|[height -> 6.67, units -> feet] |
5 |messi |[height -> 1.7, units -> meters]|
6 +------+--------------------------------+

1 athletesDF.printSchema()
2
3 root
4 |-- name: string (nullable = true)
5 |-- stature: map (nullable = true)
6 | |-- key: string
7 | |-- value: string (valueContainsNull = true)

stature is a MapType column, but we can also store stature as a StructType column.

1 val data = Seq(


2 Row("lebron", Row("6.67", "feet")),
3 Row("messi", Row("1.7", "meters"))
4 )
5
6 val schema = StructType(
7 List(
8 StructField("player_name", StringType, true),
9 StructField(
10 "stature",
11 StructType(
12 List(
13 StructField("height", StringType, true),
14 StructField("unit", StringType, true)
15 )
16 ),
17 true
18 )
Working with Spark MapType Columns 102

19 )
20 )
21
22 val athletesDF = spark.createDataFrame(
23 spark.sparkContext.parallelize(data),
24 schema
25 )

1 athletesDF.show(false)
2
3 +-----------+-------------+
4 |player_name|stature |
5 +-----------+-------------+
6 |lebron |[6.67, feet] |
7 |messi |[1.7, meters]|
8 +-----------+-------------+

1 athletesDF.printSchema()
2
3 root
4 |-- player_name: string (nullable = true)
5 |-- stature: struct (nullable = true)
6 | |-- height: string (nullable = true)
7 | |-- unit: string (nullable = true)

Sometimes both StructType and MapType columns can solve the same problem and you can choose
between the two.

Writing MapType columns to disk


The CSV file format cannot handle MapType columns.
This code will error out.
Working with Spark MapType Columns 103

1 val outputPath = new java.io.File("./tmp/csv_with_map/").getCanonicalPath


2
3 spark.createDF(
4 List(
5 (Map("a" -> "aaa", "b" -> "bbb"))
6 ), List(
7 ("some_data", MapType(StringType, StringType, true), true)
8 )
9 ).write.csv(outputPath)

Here’s the error message:

1 writing to disk
2 - cannot write maps to disk with the CSV format *** FAILED ***
3 org.apache.spark.sql.AnalysisException: CSV data source does not support map<strin\
4 g,string> data type.;
5 at org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchem\
6 a$1.apply(DataSourceUtils.scala:69)
7 at org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchem\
8 a$1.apply(DataSourceUtils.scala:67)
9 at scala.collection.Iterator$class.foreach(Iterator.scala:891)
10 at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
11 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
12 at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
13 at org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifySchema(DataSo\
14 urceUtils.scala:67)
15 at org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifyWriteSchema(D\
16 ataSourceUtils.scala:34)
17 at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWr\
18 iter.scala:100)
19 at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.ru\
20 n(InsertIntoHadoopFsRelationCommand.scala:159)

MapType columns can be written out with the Parquet file format. This code runs just fine:
Working with Spark MapType Columns 104

1 val outputPath = new java.io.File("./tmp/csv_with_map/").getCanonicalPath


2
3 spark.createDF(
4 List(
5 (Map("a" -> "aaa", "b" -> "bbb"))
6 ), List(
7 ("some_data", MapType(StringType, StringType, true), true)
8 )
9 ).write.parquet(outputPath)

Conclusion
MapType columns are a great way to store key / value pairs of arbitrary lengths in a DataFrame
column.
Spark 2.4 added a lot of native functions that make it easier to work with MapType columns. Prior
to Spark 2.4, developers were overly reliant on UDFs for manipulating MapType columns.
StructType columns can often be used instead of a MapType column. Study both of these column
types closely so you can understand the tradeoffs and intelligently select the best column type for
your analysis.
Adding StructType columns to
DataFrames
StructType objects define the schema of DataFrames. StructType objects contain a list of StructField
objects that define the name, type, and nullable flag for each column in a DataFrame.
Let’s start with an overview of StructType objects and then demonstrate how StructType columns
can be added to DataFrame schemas (essentially creating a nested schema).
StructType columns are a great way to eliminate order dependencies from Spark code.

StructType overview
The StructType case class can be used to define a DataFrame schema as follows.

1 val data = Seq(


2 Row(1, "a"),
3 Row(5, "z")
4 )
5
6 val schema = StructType(
7 List(
8 StructField("num", IntegerType, true),
9 StructField("letter", StringType, true)
10 )
11 )
12
13 val df = spark.createDataFrame(
14 spark.sparkContext.parallelize(data),
15 schema
16 )
Adding StructType columns to DataFrames 106

1 df.show()
2
3 +---+------+
4 |num|letter|
5 +---+------+
6 | 1| a|
7 | 5| z|
8 +---+------+

The DataFrame schema method returns a StructType object.

1 print(df.schema)
2
3 StructType(
4 StructField(num, IntegerType, true),
5 StructField(letter, StringType, true)
6 )

Let’s look at another example to see how StructType columns can be appended to DataFrames.

Appending StructType columns


Let’s use the struct() function to append a StructType column to a DataFrame.

1 val data = Seq(


2 Row(20.0, "dog"),
3 Row(3.5, "cat"),
4 Row(0.000006, "ant")
5 )
6
7 val schema = StructType(
8 List(
9 StructField("weight", DoubleType, true),
10 StructField("animal_type", StringType, true)
11 )
12 )
13
14 val df = spark.createDataFrame(
15 spark.sparkContext.parallelize(data),
16 schema
17 )
Adding StructType columns to DataFrames 107

18
19 val actualDF = df.withColumn(
20 "animal_interpretation",
21 struct(
22 (col("weight") > 5).as("is_large_animal"),
23 col("animal_type").isin("rat", "cat", "dog").as("is_mammal")
24 )
25 )

1 actualDF.show(truncate = false)
2
3 +------+-----------+---------------------+
4 |weight|animal_type|animal_interpretation|
5 +------+-----------+---------------------+
6 |20.0 |dog |[true,true] |
7 |3.5 |cat |[false,true] |
8 |6.0E-6|ant |[false,false] |
9 +------+-----------+---------------------+

Let’s take a look at the schema.

1 print(actualDF.schema)
2
3 StructType(
4 StructField(weight,DoubleType,true),
5 StructField(animal_type,StringType,true),
6 StructField(animal_interpretation, StructType(
7 StructField(is_large_animal,BooleanType,true),
8 StructField(is_mammal,BooleanType,true)
9 ), false)
10 )

The animal_interpretation column has a StructType type, so this DataFrame has a nested schema.
It’s easier to view the schema with the printSchema method.
Adding StructType columns to DataFrames 108

1 actualDF.printSchema()
2
3 root
4 |-- weight: double (nullable = true)
5 |-- animal_type: string (nullable = true)
6 |-- animal_interpretation: struct (nullable = false)
7 | |-- is_large_animal: boolean (nullable = true)
8 | |-- is_mammal: boolean (nullable = true)

We can flatten the DataFrame as follows.

1 actualDF.select(
2 col("animal_type"),
3 col("animal_interpretation")("is_large_animal").as("is_large_animal"),
4 col("animal_interpretation")("is_mammal").as("is_mammal")
5 ).show(truncate = false)

1 +-----------+---------------+---------+
2 |animal_type|is_large_animal|is_mammal|
3 +-----------+---------------+---------+
4 |dog |true |true |
5 |cat |false |true |
6 |ant |false |false |
7 +-----------+---------------+---------+

Using StructTypes to eliminate order dependencies


Let’s demonstrate some order dependent code and then use a StructType column to eliminate the
order dependencies.
Let’s consider three custom transformations that add is_teenager, has_positive_mood, and what_-
to_do columns to a DataFrame.
Adding StructType columns to DataFrames 109

1 def withIsTeenager()(df: DataFrame): DataFrame = {


2 df.withColumn("is_teenager", col("age").between(13, 19))
3 }
4
5 def withHasPositiveMood()(df: DataFrame): DataFrame = {
6 df.withColumn(
7 "has_positive_mood",
8 col("mood").isin("happy", "glad")
9 )
10 }
11
12 def withWhatToDo()(df: DataFrame) = {
13 df.withColumn(
14 "what_to_do",
15 when(
16 col("is_teenager") && col("has_positive_mood"),
17 "have a chat"
18 )
19 )
20 }

Notice that both the withIsTeenager and withHasPositiveMood transformations must be run before
the withWhatToDo transformation can be run. The functions have an order dependency because
they must be run in a certain order for the code to work.
Let’s build a DataFrame and execute the functions in the right order so the code will run.

1 val data = Seq(


2 Row(30, "happy"),
3 Row(13, "sad"),
4 Row(18, "glad")
5 )
6
7 val schema = StructType(
8 List(
9 StructField("age", IntegerType, true),
10 StructField("mood", StringType, true)
11 )
12 )
13
14 val df = spark.createDataFrame(
15 spark.sparkContext.parallelize(data),
16 schema
Adding StructType columns to DataFrames 110

17 )
18
19 df
20 .transform(withIsTeenager())
21 .transform(withHasPositiveMood())
22 .transform(withWhatToDo())
23 .show()

1 +---+-----+-----------+-----------------+-----------+
2 |age| mood|is_teenager|has_positive_mood| what_to_do|
3 +---+-----+-----------+-----------------+-----------+
4 | 30|happy| false| true| null|
5 | 13| sad| true| false| null|
6 | 18| glad| true| true|have a chat|
7 +---+-----+-----------+-----------------+-----------+

Let’s use the struct function to append a StructType column to the DataFrame and remove the order
depenencies from this code.

1 val isTeenager = col("age").between(13, 19)


2 val hasPositiveMood = col("mood").isin("happy", "glad")
3
4 df.withColumn(
5 "best_action",
6 struct(
7 isTeenager.as("is_teenager"),
8 hasPositiveMood.as("has_positive_mood"),
9 when(
10 isTeenager && hasPositiveMood,
11 "have a chat"
12 ).as("what_to_do")
13 )
14 ).show(truncate = false)
Adding StructType columns to DataFrames 111

1 +---+-----+-----------------------+
2 |age|mood |best_action |
3 +---+-----+-----------------------+
4 |30 |happy|[false,true,null] |
5 |13 |sad |[true,false,null] |
6 |18 |glad |[true,true,have a chat]|
7 +---+-----+-----------------------+

Order dependencies can be a big problem in large


Spark codebases
If you’re code is organized as DataFrame transformations, order dependencies can become a big
problem.
You might need to figure out how to call 20 functions in exactly the right order to get the desired
result.
StructType columns are one way to eliminate order dependencies from your code.
Working with dates and times
Spark supports DateType and TimestampType columns and defines a rich API of functions to make
working with dates and times easy. This chapter will demonstrates how to make DataFrames with
DateType / TimestampType columns and how to leverage Spark’s functions for working with these
columns.

Creating DateType columns


Import the java.sql.Date library to create a DataFrame with a DateType column.

1 import java.sql.Date
2 import org.apache.spark.sql.types.{DateType, IntegerType}
3
4 val sourceDF = spark.createDF(
5 List(
6 (1, Date.valueOf("2016-09-30")),
7 (2, Date.valueOf("2016-12-14"))
8 ), List(
9 ("person_id", IntegerType, true),
10 ("birth_date", DateType, true)
11 )
12 )

1 sourceDF.show()
2
3 +---------+----------+
4 |person_id|birth_date|
5 +---------+----------+
6 | 1|2016-09-30|
7 | 2|2016-12-14|
8 +---------+----------+
9
10 sourceDF.printSchema()
11
12 root
13 |-- person_id: integer (nullable = true)
14 |-- birth_date: date (nullable = true)

The cast() method can create a DateType column by converting a StringType column into a date.
Working with dates and times 113

1 val sourceDF = spark.createDF(


2 List(
3 (1, "2013-01-30"),
4 (2, "2012-01-01")
5 ), List(
6 ("person_id", IntegerType, true),
7 ("birth_date", StringType, true)
8 )
9 ).withColumn(
10 "birth_date",
11 col("birth_date").cast("date")
12 )

1 sourceDF.show()
2
3 +---------+----------+
4 |person_id|birth_date|
5 +---------+----------+
6 | 1|2013-01-30|
7 | 2|2012-01-01|
8 +---------+----------+
9
10 sourceDF.printSchema()
11
12 root
13 |-- person_id: integer (nullable = true)
14 |-- birth_date: date (nullable = true)

year(), month(), dayofmonth()


Let’s create a DataFrame with a DateType column and use built in Spark functions to extract the
year, month, and day from the date.
Working with dates and times 114

1 val sourceDF = spark.createDF(


2 List(
3 (1, Date.valueOf("2016-09-30")),
4 (2, Date.valueOf("2016-12-14"))
5 ), List(
6 ("person_id", IntegerType, true),
7 ("birth_date", DateType, true)
8 )
9 )
10
11 sourceDF.withColumn(
12 "birth_year",
13 year(col("birth_date"))
14 ).withColumn(
15 "birth_month",
16 month(col("birth_date"))
17 ).withColumn(
18 "birth_day",
19 dayofmonth(col("birth_date"))
20 ).show()

1 +---------+----------+----------+-----------+---------+
2 |person_id|birth_date|birth_year|birth_month|birth_day|
3 +---------+----------+----------+-----------+---------+
4 | 1|2016-09-30| 2016| 9| 30|
5 | 2|2016-12-14| 2016| 12| 14|
6 +---------+----------+----------+-----------+---------+

The org.apache.spark.sql.functions package has a lot of functions that makes it easy to work
with dates in Spark.

minute(), second()
Let’s create a DataFrame with a TimestampType column and use built in Spark functions to extract
the minute and second from the timestamp.
Working with dates and times 115

1 import java.sql.Timestamp
2
3 val sourceDF = spark.createDF(
4 List(
5 (1, Timestamp.valueOf("2017-12-02 03:04:00")),
6 (2, Timestamp.valueOf("1999-01-01 01:45:20"))
7 ), List(
8 ("person_id", IntegerType, true),
9 ("fun_time", TimestampType, true)
10 )
11 )
12
13 sourceDF.withColumn(
14 "fun_minute",
15 minute(col("fun_time"))
16 ).withColumn(
17 "fun_second",
18 second(col("fun_time"))
19 ).show()

1 +---------+-------------------+----------+----------+
2 |person_id| fun_time|fun_minute|fun_second|
3 +---------+-------------------+----------+----------+
4 | 1|2017-12-02 03:04:00| 4| 0|
5 | 2|1999-01-01 01:45:20| 45| 20|
6 +---------+-------------------+----------+----------+

datediff()
The datediff() and current_date() functions can be used to calculate the number of days between
today and a date in a DateType column. Let’s use these functions to calculate someone’s age in days.
Working with dates and times 116

1 val sourceDF = spark.createDF(


2 List(
3 (1, Date.valueOf("1990-09-30")),
4 (2, Date.valueOf("2001-12-14"))
5 ), List(
6 ("person_id", IntegerType, true),
7 ("birth_date", DateType, true)
8 )
9 )
10
11 sourceDF.withColumn(
12 "age_in_days",
13 datediff(current_timestamp(), col("birth_date"))
14 ).show()

1 +---------+----------+-----------+
2 |person_id|birth_date|age_in_days|
3 +---------+----------+-----------+
4 | 1|1990-09-30| 9946|
5 | 2|2001-12-14| 5853|
6 +---------+----------+-----------+

date_add()
The date_add() function can be used to add days to a date. Let’s add 15 days to a date column.

1 val sourceDF = spark.createDF(


2 List(
3 (1, Date.valueOf("1990-09-30")),
4 (2, Date.valueOf("2001-12-14"))
5 ), List(
6 ("person_id", IntegerType, true),
7 ("birth_date", DateType, true)
8 )
9 )
10
11 sourceDF.withColumn(
12 "15_days_old",
13 date_add(col("birth_date"), 15)
14 ).show()
Working with dates and times 117

1 +---------+----------+-----------+
2 |person_id|birth_date|15_days_old|
3 +---------+----------+-----------+
4 | 1|1990-09-30| 1990-10-15|
5 | 2|2001-12-14| 2001-12-29|
6 +---------+----------+-----------+

Next steps
Look at the Spark SQL functions²⁷ for the full list of methods available for working with dates and
times in Spark.
²⁷http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions\protect\char”0024\relax
Performing operations on multiple
columns with foldLeft
The Scala foldLeft method can be used to iterate over a data structure and perform multiple
operations on a Spark DataFrame.
For example, foldLeft can be used to eliminate all whitespace in multiple columns or convert all the
column names in a DataFrame to snake_case.
foldLeft is great when you want to perform similar operations on multiple columns. Let’s dive in!

foldLeft review in Scala


Suppose you have a list of three odd numbers and would like the calculate the sum of all the numbers
in the list.
The foldLeft method can iterate over every element in the list and keep track of a running sum.

1 var odds = List(1, 5, 7)


2 println {
3 odds.foldLeft(0) { (memo: Int, num: Int) =>
4 memo + num
5 }
6 }

The sum of 1, 5, and 7 is 13 and that’s what the code snippet above will print.
The foldLeft function is initialized with a starting value of zero and the running sum is accumulated
in the memo variable. This code sums all the numbers in the odds list.

Eliminating whitespace from multiple columns


Let’s create a DataFrame and then write a function to remove all the whitespace from all the columns.
Performing operations on multiple columns with foldLeft 119

1 val sourceDF = Seq(


2 (" p a b l o", "Paraguay"),
3 ("Neymar", "B r asil")
4 ).toDF("name", "country")
5
6 val actualDF = Seq(
7 "name",
8 "country"
9 ).foldLeft(sourceDF) { (memoDF, colName) =>
10 memoDF.withColumn(
11 colName,
12 regexp_replace(col(colName), "\\s+", "")
13 )
14 }

1 actualDF.show()
2
3 +------+--------+
4 | name| country|
5 +------+--------+
6 | pablo|Paraguay|
7 |Neymar| Brasil|
8 +------+--------+

We can improve this code by using the DataFrame#columns method and the removeAllWhitespace
method defined in spark-daria.

1 val actualDF = sourceDF


2 .columns
3 .foldLeft(sourceDF) { (memoDF, colName) =>
4 memoDF.withColumn(
5 colName,
6 removeAllWhitespace(col(colName))
7 )
8 }

snake_case all columns in a DataFrame


It’s easier to work with DataFrames when all the column names are in snake_case, especially when
writing SQL. Let’s used foldLeft to convert all the columns in a DataFrame to snake_case.
Performing operations on multiple columns with foldLeft 120

1 val sourceDF = Seq(


2 ("funny", "joke")
3 ).toDF("A b C", "de F")
4
5 sourceDF.show()
6
7 +-----+----+
8 |A b C|de F|
9 +-----+----+
10 |funny|joke|
11 +-----+----+
12
13 val actualDF = sourceDF
14 .columns
15 .foldLeft(sourceDF) { (memoDF, colName) =>
16 memoDF
17 .withColumnRenamed(
18 colName,
19 colName.toLowerCase().replace(" ", "_")
20 )
21 }
22
23 actualDF.show()
24
25 +-----+----+
26 |a_b_c|de_f|
27 +-----+----+
28 |funny|joke|
29 +-----+----+

Wrapping foldLeft operations in custom


transformations
We can wrap foldLeft operations in custom transformations to make them easily reusable. Let’s
create a custom transformation for the code that converts all DataFrame columns to snake_case.
Performing operations on multiple columns with foldLeft 121

1 def toSnakeCase(str: String): String = {


2 str.toLowerCase().replace(" ", "_")
3 }
4
5 def snakeCaseColumns(df: DataFrame): DataFrame = {
6 df.columns.foldLeft(df) { (memoDF, colName) =>
7 memoDF.withColumnRenamed(colName, toSnakeCase(colName))
8 }
9 }
10
11 val sourceDF = Seq(
12 ("funny", "joke")
13 ).toDF("A b C", "de F")
14
15 val actualDF = sourceDF.transform(snakeCaseColumns)
16
17 actualDF.show()
18
19 +-----+----+
20 |a_b_c|de_f|
21 +-----+----+
22 |funny|joke|
23 +-----+----+

The snakeCaseColumns custom transformation can now be reused for any DataFrame. This
transformation is already defined in spark-daria by the way.

Next steps
If you’re still uncomfortable with the foldLeft method, try the Scala collections CodeQuizzes. You
should understand foldLeft in Scala before trying to apply foldLeft in Spark.
Whenever you’re applying a similar operation to multiple columns in a Spark DataFrame, try to use
foldLeft. It will reduce the redundancy in your code and decrease your code complexity. Try to wrap
your foldLeft calls in custom transformations to make beautiful functions that are reusable!
Equality Operators
Spark has a standard equality operator and a null safe equality operator.
This chapter explains how the equality operators differ and when each operator should be used.

===
Let’s create a DataFrame with word1 and word1 columns and compare the equality with the ===
operator.
TODO - finish chapter
Introduction to Spark Broadcast Joins
Spark broadcast joins are perfect for joining a large DataFrame with a small DataFrame.
Broadcast joins cannot be used when joining two large DataFrames.
This post explains how to do a simple broadcast join and how the broadcast() function helps Spark
optimize the execution plan.

Conceptual overview
Spark splits up data on different nodes in a cluster so multiple computers can process data in parallel.
Traditional joins are hard with Spark because the data is split on multiple machines.
Broadcast joins are easier to run on a cluster. Spark can “broadcast” a small DataFrame by sending all
the data in that small DataFrame to all nodes in the cluster. After the small DataFrame is broadcasted,
Spark can perform a join without shuffling any of the data in the large DataFrame.

Simple example
Let’s create a DataFrame with information about people and another DataFrame with information
about cities. In this example, both DataFrames will be small, but let’s pretend that the peopleDF is
huge and the citiesDF is tiny.

1 val peopleDF = Seq(


2 ("andrea", "medellin"),
3 ("rodolfo", "medellin"),
4 ("abdul", "bangalore")
5 ).toDF("first_name", "city")
6
7 peopleDF.show()
Introduction to Spark Broadcast Joins 124

1 +----------+---------+
2 |first_name| city|
3 +----------+---------+
4 | andrea| medellin|
5 | rodolfo| medellin|
6 | abdul|bangalore|
7 +----------+---------+

1 val citiesDF = Seq(


2 ("medellin", "colombia", 2.5),
3 ("bangalore", "india", 12.3)
4 ).toDF("city", "country", "population")
5
6 citiesDF.show()

1 +---------+--------+----------+
2 | city| country|population|
3 +---------+--------+----------+
4 | medellin|colombia| 2.5|
5 |bangalore| india| 12.3|
6 +---------+--------+----------+

Let’s broadcast the citiesDF and join it with the peopleDF.

1 peopleDF.join(
2 broadcast(citiesDF),
3 peopleDF("city") <=> citiesDF("city")
4 ).show()

1 +----------+---------+---------+--------+----------+
2 |first_name| city| city| country|population|
3 +----------+---------+---------+--------+----------+
4 | andrea| medellin| medellin|colombia| 2.5|
5 | rodolfo| medellin| medellin|colombia| 2.5|
6 | abdul|bangalore|bangalore| india| 12.3|
7 +----------+---------+---------+--------+----------+

The Spark null safe equality operator (<=>) is used to perform this join.

Analyzing physical plans of joins


Let’s use the explain() method to analyze the physical plan of the broadcast join.
Introduction to Spark Broadcast Joins 125

1 peopleDF.join(
2 broadcast(citiesDF),
3 peopleDF("city") <=> citiesDF("city")
4 ).explain()

== Physical Plan ==
BroadcastHashJoin [coalesce(city#6, )], [coalesce(city#21, )], Inner, BuildRight, (city#6 <⇒ city#21)
:- LocalTableScan [first_name#5, city#6]
+- BroadcastExchange HashedRelationBroadcastMode(List(coalesce(input[0, string, true], )))
+- LocalTableScan [city#21, country#22, population#23]
In this example, Spark is smart enough to return the same physical plan, even when the broadcast()
method isn’t used.

1 peopleDF.join(
2 citiesDF,
3 peopleDF("city") <=> citiesDF("city")
4 ).explain()

== Physical Plan ==
BroadcastHashJoin [coalesce(city#6, )], [coalesce(city#21, )], Inner, BuildRight, (city#6 <⇒ city#21)
:- LocalTableScan [first_name#5, city#6]
+- BroadcastExchange HashedRelationBroadcastMode(List(coalesce(input[0, string, true], )))
+- LocalTableScan [city#21, country#22, population#23]
Spark isn’t always smart about optimally broadcasting DataFrames when the code is complex, so
it’s best to use the broadcast() method explicitly and inspect the physical plan to make sure the
join is executed properly.

Eliminating the duplicate city column


We can pass a sequence of columns with the shortcut join syntax to automatically delete the
duplicate column.

1 peopleDF.join(
2 broadcast(citiesDF),
3 Seq("city")
4 ).show()
Introduction to Spark Broadcast Joins 126

1 +---------+----------+--------+----------+
2 | city|first_name| country|population|
3 +---------+----------+--------+----------+
4 | medellin| andrea|colombia| 2.5|
5 | medellin| rodolfo|colombia| 2.5|
6 |bangalore| abdul| india| 12.3|
7 +---------+----------+--------+----------+

Let’s look at the physical plan that’s generated by this code.

1 peopleDF.join(
2 broadcast(citiesDF),
3 Seq("city")
4 ).explain()

== Physical Plan ==

Project [city#6, first_name#5, country#22, population#23]

+- BroadcastHashJoin [city#6], [city#21], Inner, BuildRight

:- Project [_1#2 AS first_name#5, _2#3 AS city#6]


+- Filter isnotnull(_2#3)
+- LocalTableScan [_1#2, _2#3]

+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))


+- Project [_1#17 AS city#21, _2#18 AS country#22, _3#19 AS population#23]
+- Filter isnotnull(_1#17)
+- LocalTableScan [_1#17, _2#18, _3#19]

Code that returns the same result without relying on the sequence join generates an entirely different
physical plan.

1 peopleDF.join(
2 broadcast(citiesDF),
3 peopleDF("city") <=> citiesDF("city")
4 )
5 .drop(citiesDF("city"))
6 .explain()
Introduction to Spark Broadcast Joins 127

== Physical Plan ==
Project [first_name#5, city#6, country#22, population#23]
+- BroadcastHashJoin [coalesce(city#6, )], [coalesce(city#21, )], Inner, BuildRight, (city#6 <⇒ city#21)
:- LocalTableScan [first_name#5, city#6]
+- BroadcastExchange HashedRelationBroadcastMode(List(coalesce(input[0, string, true], )))
+- LocalTableScan [city#21, country#22, population#23]

It’s best to avoid the shortcut join syntax so your physical plans stay as simple as possible.

Diving deeper into explain()


You can pass the explain() method a true argument to see the parsed logical plan, analyzed logical
plan, and optimized logical plan in addition to the physical plan.

1 peopleDF.join(
2 broadcast(citiesDF),
3 peopleDF("city") <=> citiesDF("city")
4 )
5 .drop(citiesDF("city"))
6 .explain(true)

== Parsed Logical Plan ==

Project [first_name#5, city#6, country#22, population#23]

+- Join Inner, (city#6 <⇒ city#21)

:- Project [_1#2 AS first_name#5, _2#3 AS city#6]


+- LocalRelation [_1#2, _2#3]

+- ResolvedHint isBroadcastable=true
+- Project [_1#17 AS city#21, _2#18 AS country#22, _3#19 AS population#23]
+- LocalRelation [_1#17, _2#18, _3#19]

== Analyzed Logical Plan ==

first_name: string, city: string, country: string, population: double


Introduction to Spark Broadcast Joins 128

Project [first_name#5, city#6, country#22, population#23]

+- Join Inner, (city#6 <⇒ city#21)

:- Project [_1#2 AS first_name#5, _2#3 AS city#6]


+- LocalRelation [_1#2, _2#3]

+- ResolvedHint isBroadcastable=true
+- Project [_1#17 AS city#21, _2#18 AS country#22, _3#19 AS population#23]
+- LocalRelation [_1#17, _2#18, _3#19]

== Optimized Logical Plan ==


Project [first_name#5, city#6, country#22, population#23]
+- Join Inner, (city#6 <⇒ city#21)
:- LocalRelation [first_name#5, city#6]
+- ResolvedHint isBroadcastable=true
+- LocalRelation [city#21, country#22, population#23]

== Physical Plan ==
Project [first_name#5, city#6, country#22, population#23]
+- BroadcastHashJoin [coalesce(city#6, )], [coalesce(city#21, )], Inner, BuildRight, (city#6 <⇒ city#21)
:- LocalTableScan [first_name#5, city#6]
+- BroadcastExchange HashedRelationBroadcastMode(List(coalesce(input[0, string, true], )))
+- LocalTableScan [city#21, country#22, population#23]

Notice how the parsed, analyzed, and optimized logical plans all contain ResolvedHint isBroadcastable=true
because the broadcast() function was used. This hint isn’t included when the broadcast() function
isn’t used.

1 peopleDF.join(
2 citiesDF,
3 peopleDF("city") <=> citiesDF("city")
4 )
5 .drop(citiesDF("city"))
6 .explain(true)

== Parsed Logical Plan ==

Project [first_name#5, city#6, country#22, population#23]


Introduction to Spark Broadcast Joins 129

+- Join Inner, (city#6 <⇒ city#21)

:- Project [_1#2 AS first_name#5, _2#3 AS city#6]


+- LocalRelation [_1#2, _2#3]

+- Project [_1#17 AS city#21, _2#18 AS country#22, _3#19 AS population#23]


+- LocalRelation [_1#17, _2#18, _3#19]

== Analyzed Logical Plan ==

first_name: string, city: string, country: string, population: double

Project [first_name#5, city#6, country#22, population#23]

+- Join Inner, (city#6 <⇒ city#21)

:- Project [_1#2 AS first_name#5, _2#3 AS city#6]


+- LocalRelation [_1#2, _2#3]

+- Project [_1#17 AS city#21, _2#18 AS country#22, _3#19 AS population#23]


+- LocalRelation [_1#17, _2#18, _3#19]

== Optimized Logical Plan ==


Project [first_name#5, city#6, country#22, population#23]
+- Join Inner, (city#6 <⇒ city#21)
:- LocalRelation [first_name#5, city#6]
+- LocalRelation [city#21, country#22, population#23]
== Physical Plan ==
Project [first_name#5, city#6, country#22, population#23]
+- BroadcastHashJoin [coalesce(city#6, )], [coalesce(city#21, )], Inner, BuildRight, (city#6 <⇒ city#21)
:- LocalTableScan [first_name#5, city#6]
+- BroadcastExchange HashedRelationBroadcastMode(List(coalesce(input[0, string, true], )))
+- LocalTableScan [city#21, country#22, population#23]

Next steps
Broadcast joins are a great way to append data stored in relatively small single source of truth data
files to large DataFrames. DataFrames up to 2GB can be broadcasted so a data file with tens or even
hundreds of thousands of rows is a broadcast candidate.
Partitioning Data in Memory
Spark splits data into partitions and executes computations on the partitions in parallel. You should
understand how data is partitioned and when you need to manually adjust the partitioning to keep
your Spark computations running efficiently.

Intro to partitions
Let’s create a DataFrame of numbers to illustrate how data is partitioned:

1 val x = (1 to 10).toList
2 val numbersDF = x.toDF("number")

On my machine, the numbersDF is split into four partitions:

1 numbersDF.rdd.partitions.size // => 4

When writting to disk, each partition is a ouputted as a separate CSV file.

1 numbersDF.write.csv("/Users/powers/Desktop/spark_output/numbers")

Here is how the data is separated on the different partitions.

1 Partition A: 1, 2
2 Partition B: 3, 4, 5
3 Partition C: 6, 7
4 Partition D: 8, 9, 10

coalesce
The coalesce method reduces the number of partitions in a DataFrame. Here’s how to consolidate
the data from four partitions to two partitions:

1 val numbersDF2 = numbersDF.coalesce(2)

We can verify coalesce has created a new DataFrame with only two partitions:
Partitioning Data in Memory 131

1 numbersDF2.rdd.partitions.size // => 2

numbersDF2 will be written out to disk as two text files:

1 numbersDF2.write.csv("/Users/powers/Desktop/spark_output/numbers2")

The partitions in numbersDF2 have the following data:

1 Partition A: 1, 2, 3, 4, 5
2 Partition C: 6, 7, 8, 9, 10

The coalesce algorithm moved the data from Partition B to Partition A and moved the data from
Partition D to Partition C. The data in Partition A and Partition C does not move with the coalesce
algorithm. This algorithm is fast in certain situations because it minimizes data movement.

Increasing partitions
You can try to increase the number of partitions with coalesce, but it won’t work!

1 val numbersDF3 = numbersDF.coalesce(6)


2 numbersDF3.rdd.partitions.size // => 4

numbersDF3 keeps four partitions even though we attemped to create 6 partitions with coalesce(6).
The coalesce algorithm changes the number of nodes by moving data from some partitions to existing
partitions. This algorithm obviously cannot increate the number of partitions.

repartition
The repartition method can be used to either increase or decrease the number of partitions.
Let’s create a homerDF from the numbersDF with two partitions.

1 val homerDF = numbersDF.repartition(2)


2 homerDF.rdd.partitions.size // => 2

Let’s examine the data on each partition in homerDF:


Partitioning Data in Memory 132

1 Partition ABC: 1, 3, 5, 6, 8, 10
2 Partition XYZ: 2, 4, 7, 9

Partition ABC contains data from Partition A, Partition B, Partition C, and Partition D. Partition XYZ
also contains data from each original partition. The repartition algorithm does a full data shuffle and
equally distributes the data among the partitions. It does not attempt to minimize data movement
like the coalesce algorithm.
These results will be different when run on your machine. You’ll also note that the data might not
be evenly split because this data set is so tiny.

Increasing partitions
The repartition method can be used to increase the number of partitions as well.

1 val bartDF = numbersDF.repartition(6)


2 bartDF.rdd.partitions.size // => 6

Here’s how the data is split up amongst the partitions in the bartDF.

1 Partition 00000: 5, 7
2 Partition 00001: 1
3 Partition 00002: 2
4 Partition 00003: 8
5 Partition 00004: 3, 9
6 Partition 00005: 4, 6, 10

The repartition method does a full shuffle of the data, so the number of partitions can be increased.

Differences between coalesce and repartition


The repartition algorithm does a full shuffle of the data and creates equal sized partitions of data.
coalesce combines existing partitions to avoid a full shuffle.
repartition can increase the number of partitions whereas coalesce only can decrease the number of
partitions.

repartition by column
Let’s use the following data to examine how a DataFrame can be repartitioned by a particular
column.
Partitioning Data in Memory 133

1 +-----+-------+
2 | age | color |
3 +-----+-------+
4 | 10 | blue |
5 | 13 | red |
6 | 15 | blue |
7 | 99 | red |
8 | 67 | blue |
9 +-----+-------+

We’ll start by creating the DataFrame:

1 val people = List(


2 (10, "blue"),
3 (13, "red"),
4 (15, "blue"),
5 (99, "red"),
6 (67, "blue")
7 )
8
9 val peopleDF = people.toDF("age", "color")

Let’s repartition the DataFrame by the color column:

1 colorDF = peopleDF.repartition($"color")

When partitioning by a column, Spark will create a minimum of 200 partitions by default. This
example will have two partitions with data and 198 empty partitions.

1 Partition 00091
2 13,red
3 99,red
4 Partition 00168
5 10,blue
6 15,blue
7 67,blue

The colorDF contains different partitions for each color and is optimized for extracts by color.
Partitioning by a column is similar to indexing a column in a relational database. A later chapter on
partitioning data on disk will explain this concept more completely.
Partitioning Data in Memory 134

Real World Example


Suppose you have a data lake that contains 2 billion rows of data (1TB) split in 13,000 partitions.
You’d like to create a data puddle that’s a random sampling of one millionth of the data lake. The
data puddle will be used in development and the data lake will be reserved for production grade
code. You’d like to write the data puddle out to S3 for easy access.
Here’s how you’d structure the code:

1 val dataPuddle = dataLake.sample(true, 0.000001)


2 dataPuddle.write.parquet("my_bucket/puddle/")

Spark doesn’t adjust the number of partitions when a large DataFrame is filtered, so the dataPuddle
will also have 13,000 partitions. The dataPuddle only contains 2,000 rows of data, so a lot of the
partitions will be empty. It’s not efficient to read or write thousands of empty text files to disk - we
should improve this code by repartitioning.

1 val dataPuddle = dataLake.sample(true, 0.000001)


2 val goodPuddle = dataPuddle.repartition(4)
3 goodPuddle.write.parquet("my_bucket/puddle/")

Why did we choose 4 partitions for the data puddle?


The data is a million times smaller, so we reduce the number of partitions by a million and keep the
same amount of data per partition. 13,000 partitions / 1,000,000 = 1 partition (rounded up). We used
4 partitions so the data puddle can leverage the parallelism of Spark.
In general, you can determine the number of partitions by multiplying the number of CPUs in the
cluster by 2, 3, or 4.
number_of_partitions = number_of_cpus * 4
If you’re writing the data out to a file system, you can choose a partition size that will create
reasonable sized files (1 GB). Spark will optimize the number of partitions based on the number
of clusters when the data is read.

Why did we use the repartition method instead of coalesce?


A full data shuffle is an expensive operation for large data sets, but our data puddle is only 2,000
rows. The repartition method returns equal sized text files, which are more efficient for downstream
consumers.
Partitioning Data in Memory 135

Actual performance improvement


It took 241 seconds to count the rows in the data puddle when the data wasn’t repartitioned (on a 5
node cluster). It only took 2 seconds to count the data puddle when the data was partitioned - that’s
a 124x speed improvement!

You probably need to think about partitions


The partitioning of DataFrames seems like a low level implementation detail that should be managed
by the framework, but it’s not. When filtering large DataFrames into smaller ones, you should almost
always repartition the data.
You’ll probably be filtering large DataFrames into smaller ones frequently, so get used to reparti-
tioning. Grok it!
Partitioning on Disk with partitionBy
Spark writers allow for data to be partitioned on disk with partitionBy. Some queries can run 50 to
100 times faster on a partitioned data lake, so partitioning is vital for certain queries.
Creating and maintaining partitioned data lakes is hard.
This chapter discusses how to use partitionBy and explains the challenges of partitioning production-
sized datasets on disk. We’ll discuss different memory partitioning tactics that let partitionBy
operate more efficiently.
You’ll need to master the concepts covered in this chapter to create partitioned data lakes on large
datasets, especially if you’re dealing with a high-cardinality partition key.

Memory partitioning vs. disk partitioning


coalesce() and repartition() change the memory partitions for a DataFrame.

partitionBy() is a DataFrameWriter method that specifies if the data should be written to disk in
folders. By default, Spark does not write data to disk in nested folders.
Memory partitioning is often important, independent of disk partitioning. But in order to write data
on disk properly, you’ll almost always need to repartition the data in memory first.

Simple example
Suppose we have the following CSV file with first_name, last_name, and country columns:

1 first_name,last_name,country
2 Ernesto,Guevara,Argentina
3 Vladimir,Putin,Russia
4 Maria,Sharapova,Russia
5 Bruce,Lee,China
6 Jack,Ma,China

Let’s partition this data on disk with country as the partition key. Let’s create one file per partition.
Partitioning on Disk with partitionBy 137

1 val path = new java.io.File("./src/main/resources/ss_europe/").getCanonicalPath


2 val df = spark
3 .read
4 .option("header", "true")
5 .option("charset", "UTF8")
6 .csv(path)
7
8 val outputPath = new java.io.File("./tmp/partitioned_lake1/").getCanonicalPath
9 df
10 .repartition(col("country"))
11 .write
12 .partitionBy("country")
13 .parquet(outputPath)

Here’s what the data will look like on disk:

1 partitioned_lake1/
2 country=Argentina/
3 part-00044-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet
4 country=China/
5 part-00059-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet
6 country=Russia/
7 part-00002-cf737804-90ea-4c37-94f8-9aa016f6953a.c000.snappy.parquet

Creating one file per disk partition is not going to work for production sized datasets. We won’t
want to write the China partition out as a single file if it contains 100GB of data.

partitionBy with repartition(5)


Let’s run repartition(5) to create five memory partitions before running partitionBy and see how
that impacts how the files get written out on disk.

1 val outputPath = new java.io.File("./tmp/partitioned_lake2/").getCanonicalPath


2 df
3 .repartition(5)
4 .write
5 .partitionBy("country")
6 .parquet(outputPath)

Here’s what the files look like on disk:


Partitioning on Disk with partitionBy 138

1 partitioned_lake2/
2 country=Argentina/
3 part-00003-c2d1b76a-aa61-437f-affc-a6b322f1cf42.c000.snappy.parquet
4 country=China/
5 part-00000-c2d1b76a-aa61-437f-affc-a6b322f1cf42.c000.snappy.parquet
6 part-00004-c2d1b76a-aa61-437f-affc-a6b322f1cf42.c000.snappy.parquet
7 country=Russia/
8 part-00001-c2d1b76a-aa61-437f-affc-a6b322f1cf42.c000.snappy.parquet
9 part-00002-c2d1b76a-aa61-437f-affc-a6b322f1cf42.c000.snappy.parquet

The partitionBy writer will write out files to disk for each memory partition. The maximum number
of files written out by partitionBy is the number of unique countries multiplied by the number of
memory partitions.
In this example, we have 3 unique countries * 5 memory partitions, so 15 files could get written
out (if each memory partition had one Argentinian, one Chinese, and one Russian person). We only
have 5 rows of data, so only 5 files are written in this example.

partitionBy with repartition(1)


If we repartition the data to one memory partition before partitioning on disk with partitionBy,
then we’ll write out a maximum of three files. numMemoryPartitions * numUniqueCountries =
maxNumFiles. 1 * 3 = 3.
Let’s take a look at the code.

1 val outputPath = new java.io.File("./tmp/partitioned_lake2/").getCanonicalPath


2 df
3 .repartition(1)
4 .write
5 .partitionBy("country")
6 .parquet(outputPath)

Here’s what the files look like on disk:


Partitioning on Disk with partitionBy 139

1 partitioned_lake3/
2 country=Argentina/
3 part-00000-bc6ce757-d39f-489e-9677-0a7105b29e66.c000.snappy.parquet
4 country=China/
5 part-00000-bc6ce757-d39f-489e-9677-0a7105b29e66.c000.snappy.parquet
6 country=Russia/
7 part-00000-bc6ce757-d39f-489e-9677-0a7105b29e66.c000.snappy.parquet

Partitioning datasets with a max number of files per


partition
Let’s use a dataset with 80 people from China, 15 people from France, and 5 people from Cuba.
Here’s a link to the data²⁸.
Here’s what the data looks like:

1 person_name,person_country
2 a,China
3 b,China
4 c,China
5 ...77 more China rows
6 a,France
7 b,France
8 c,France
9 ...12 more France rows
10 a,Cuba
11 b,Cuba
12 c,Cuba
13 ...2 more Cuba rows

Let’s create 8 memory partitions and scatter the data randomly across the memory partitions (we’ll
write out the data to disk, so we can inspect the contents of a memory partition).

1 val outputPath = new java.io.File("./tmp/repartition_for_lake4/").getCanonicalPath


2 df
3 .repartition(8, col("person_country"), rand)
4 .write
5 .csv(outputPath)

Let’s look at one of the CSV files that is outputted:


²⁸https://gist.github.com/MrPowers/95a8e160c37fffa9ffec2f9acfbee51e
Partitioning on Disk with partitionBy 140

1 p,China
2 f1,China
3 n1,China
4 a2,China
5 b2,China
6 d2,China
7 e2,China
8 f,France
9 c,Cuba

This technique helps us set a maximum number of files per partition when creating a partitioned
lake. Let’s write out the data to disk and observe the output.

1 val outputPath = new java.io.File("./tmp/partitioned_lake4/").getCanonicalPath


2 df
3 .repartition(8, col("person_country"), rand)
4 .write
5 .partitionBy("person_country")
6 .csv(outputPath)

Here’s what the files look like on disk:

1 partitioned_lake4/
2 person_country=China/
3 part-00000-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
4 part-00001-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
5 ... 6 more files
6 person_country=Cuba/
7 part-00002-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
8 part-00003-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
9 ... 2 more files
10 person_country=France/
11 part-00000-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
12 part-00001-0887fbd2-4d9f-454a-bd2a-de42cf7e7d9e.c000.csv
13 ... 5 more files

Each disk partition will have up to 8 files. The data is split randomly in the 8 memory partitions and
there won’t be any output files for a given disk partition if the memory partition doesn’t have any
data for the country.
This is better, but still not ideal. We only want one file for Cuba (currently have 4) and two files for
France (currently have 7), so too many small files are being created.
Let’s review the contents of our memory partition from earlier:
Partitioning on Disk with partitionBy 141

1 p,China
2 f1,China
3 n1,China
4 a2,China
5 b2,China
6 d2,China
7 e2,China
8 f,France
9 c,Cuba

partitionBy will split up this particular memory partition into three files: one China file with 7
rows of data, one France file with one row of data, and one Cuba file with one row of data.

Partitioning dataset with max rows per file


Let’s write some code that’ll create partitions with approximately ten rows of data per file. We’d
like our data to be stored in 8 files for China, one file for Cuba, and two files for France.
We can use the maxRecordsPerFile option that’ll make sure the China and France partitions aren’t
created with files that are too huge.

1 val outputPath = new java.io.File("./tmp/partitioned_lake5/").getCanonicalPath


2 df
3 .repartition(col("person_country"))
4 .write
5 .option("maxRecordsPerFile", 10)
6 .partitionBy("person_country")
7 .csv(outputPath)

This technique is particularity important for partition keys that are highly skewed. The number of
inhabitants by country is a good example of a partition key with high skew. For example Jamaica
has 3 million people and China has 1.4 billion people - we’ll want ∼467 times more files in the China
partition than the Jamaica partition.

Partitioning dataset with max rows per file pre Spark


2.2
The maxRecordsPerFile option was added in Spark 2.2, so you’ll need to write your own custom
solution if you’re using an earlier version of Spark.
Partitioning on Disk with partitionBy 142

1 val countDF = df.groupBy("person_country").count()


2
3 val desiredRowsPerPartition = 10
4
5 val joinedDF = df
6 .join(countDF, Seq("person_country"))
7 .withColumn(
8 "my_secret_partition_key",
9 (rand(10) * col("count") / desiredRowsPerPartition).cast(IntegerType)
10 )
11
12 val outputPath = new java.io.File("./tmp/partitioned_lake6/").getCanonicalPath
13 joinedDF
14 .repartition(col("person_country"), col("my_secret_partition_key"))
15 .drop("count", "my_secret_partition_key")
16 .write
17 .partitionBy("person_country")
18 .csv(outputPath)

We calculate the total number of records per partition key and then create a my_secret_partition_-
key column rather than relying on a fixed number of partitions.

You should choose the desiredRowsPerPartition based on what will give you ∼1 GB files. If you
have a 500 GB dataset with 750 million rows, set desiredRowsPerPartition to 1,500,000.

Small file problem


Partitioned data lakes can quickly develop a small file problem when they’re updated incrementally.
It’s hard to compact partitioned data lakes. As we’ve seen, it’s hard to even make a partitioned data
lake!
Use the tactics outlined in this chapter to build your partitioned data lakes and start them off without
the small file problem!

Conclusion
Partitioned data lakes can be much faster to query (when filtering on the partition keys). Partitioned
data lakes can allow for a massive amount of data skipping.
Creating and maintaining partitioned data lakes is challenging, but the performance gains make
them a worthwhile effort.
Fast Filtering with Spark
PartitionFilters and PushedFilters
Spark can use the disk partitioning of files to greatly speed up certain filtering operations.
This post explains the difference between memory and disk partitioning, describes how to analyze
physical plans to see when filters are applied, and gives a conceptual overview of why this design
pattern can provide massive performace gains.

Normal DataFrame filter


Let’s create a CSV file (/Users/powers/Documents/tmp/blog_data/people.csv) with the following
data:

1 first_name,last_name,country
2 Ernesto,Guevara,Argentina
3 Vladimir,Putin,Russia
4 Maria,Sharapova,Russia
5 Bruce,Lee,China
6 Jack,Ma,China

Let’s read in the CSV data into a DataFrame:

1 val df = spark
2 .read
3 .option("header", "true")
4 .csv("/Users/powers/Documents/tmp/blog_data/people.csv")

Let’s write a query to fetch all the Russians in the CSV file with a first_name that starts with M.

1 df
2 .where($"country" === "Russia" && $"first_name".startsWith("M"))
3 .show()
Fast Filtering with Spark PartitionFilters and PushedFilters 144

1 +----------+---------+-------+
2 |first_name|last_name|country|
3 +----------+---------+-------+
4 | Maria|Sharapova| Russia|
5 +----------+---------+-------+

Let’s use explain() to see how the query is executed.

1 df
2 .where($"country" === "Russia" && $"first_name".startsWith("M"))
3 .explain()

<pre>
== Physical Plan ==
Project [first_name#12, last_name#13, country#14]
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12)) && (country#14 = Russia)) &&
StartsWith(first_name#12, M))
+- FileScan csv [first_name#12,last_name#13,country#14]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/people.csv],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name), EqualTo(country,Russia), StringStartsWith(first_-
name,M)],
ReadSchema: struct<first_name:string,last_name:string,country:string>
</pre>
Take note that there are no PartitionFilters in the physical plan.

partitionBy()

The repartition() method partitions the data in memory and the partitionBy() method partitions
data in folders when it’s written out to disk.
Let’s write out the data in partitioned CSV files.
Fast Filtering with Spark PartitionFilters and PushedFilters 145

1 df
2 .repartition($"country")
3 .write
4 .option("header", "true")
5 .partitionBy("country")
6 .csv("/Users/powers/Documents/tmp/blog_data/partitioned_lake")

Here’s what the directory structure looks like:

1 partitioned_lake/
2 country=Argentina/
3 part-00044-c5d2f540-e89b-40c1-869d-f9871b48c617.c000.csv
4 country=China/
5 part-00059-c5d2f540-e89b-40c1-869d-f9871b48c617.c000.csv
6 country=Russia/
7 part-00002-c5d2f540-e89b-40c1-869d-f9871b48c617.c000.csv

Here are the contents of the CSV file in the country=Russia directory.

1 first_name,last_name
2 Vladimir,Putin
3 Maria,Sharapova

Notice that the country column is not included in the CSV file anymore. Spark has abstracted a
column from the CSV file to the directory name.

PartitionFilters
Let’s read from the partitioned data folder, run the same filters, and see how the physical plan
changes.
Let’s run the same filter as before, but on the partitioned lake, and examine the physical plan.
Fast Filtering with Spark PartitionFilters and PushedFilters 146

1 val partitionedDF = spark


2 .read
3 .option("header", "true")
4 .csv("/Users/powers/Documents/tmp/blog_data/partitioned_lake")
5
6 partitionedDF
7 .where($"country" === "Russia" && $"first_name".startsWith("M"))
8 .explain()

<pre>
== Physical Plan ==
Project [first_name#74, last_name#75, country#76]
+- Filter (isnotnull(first_name#74) && StartsWith(first_name#74, M))
+- FileScan csv [first_name#74,last_name#75,country#76]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/partitioned_lake],
PartitionCount: 1,
PartitionFilters: [isnotnull(country#76), (country#76 = Russia)],
PushedFilters: [IsNotNull(first_name), StringStartsWith(first_name,M)],
ReadSchema: struct<first_name:string,last_name:string>
</pre>
You need to examine the physical plans carefully to identify the differences.
When filtering on df we have PartitionFilters: [] whereas when filtering on partitionedDF we
have PartitionFilters: [isnotnull(country#76), (country#76 = Russia)].
Spark only grabs data from certain partitions and skips all of the irrelevant partitions. Data skipping
allows for a big performance boost.

PushedFilters
When we filter off of df, the pushed filters are [IsNotNull(country), IsNotNull(first_name),
EqualTo(country,Russia), StringStartsWith(first_name,M)].
When we filter off of partitionedDf, the pushed filters are [IsNotNull(first_name), StringStartsWith(first_-
name,M)].
Spark doesn’t need to push the country filter when working off of partitionedDF because it can
use a partition filter that is a lot faster.

Partitioning in memory vs. partitioning on disk


repartition() and coalesce() change how data is partitioned in memory.
Fast Filtering with Spark PartitionFilters and PushedFilters 147

partitionBy() changes how data is partitioned when it’s written out to disk.

Use repartition() before writing out partitioned data to disk with partitionBy() because it’ll
execute a lot faster and write out fewer files.
Partitioning in memory and paritioning on disk are related, but completely different concepts that
expert Spark programmers must master.

Disk partitioning with skewed columns


Suppose you have a data lake with information on all 7.6 billion people in the world. The country
column is skewed because a lot of people live in countries like China and India and compatively few
people live in countries like Montenegro.
This code is problematic because it will write out the data in each partition as a single file.

1 df
2 .repartition($"country")
3 .write
4 .option("header", "true")
5 .partitionBy("country")
6 .csv("/Users/powers/Documents/tmp/blog_data/partitioned_lake")

We don’t our data lake to contain some massive files because that’ll make Spark reads / writes
unnecessarily slow.
If we don’t do any in memory reparitioning, Spark will write out a ton of files for each partition and
our data lake will contain way too many small files.

1 df
2 .write
3 .option("header", "true")
4 .partitionBy("country")
5 .csv("/Users/powers/Documents/tmp/blog_data/partitioned_lake")

This answer²⁹ explains how to intelligently repartition in memory before writing out to disk with
partitionBy().

Here’s how we can limit each partition to a maximum of 100 files.

²⁹https://stackoverflow.com/questions/53037124/partitioning-a-large-skewed-dataset-in-s3-with-sparks-partitionby-method
Fast Filtering with Spark PartitionFilters and PushedFilters 148

1 import org.apache.spark.sql.functions.rand
2
3 df
4 .repartition(100, $"country", rand)
5 .write
6 .option("header", "true")
7 .partitionBy("country")
8 .csv("/Users/powers/Documents/tmp/blog_data/partitioned_lake")

Next steps
Effective disk partitioning can greatly speed up filter operations.
Scala Text Editing
It is easier to develop Scala with an Integrated Development Environment (IDE) or an IDE-like setup.
IntelliJ³⁰ is a great Scala IDE with a free community edition.
Scala Metals³¹ adds IDE-like features to “regular” text editors like Visual Studio Code, Atom, Vim,
Sublime Text, and Emacs.
Text editors provide a wide range of features that make it a lot easier to develop code. Databricks
notebooks only offer a tiny fraction of the text editing features that are available in IDEs.
Let’s take a look at some common Scala IDE features that’ll help when you’re writing Spark code.

Syntax highlighting
Let’s look a little chunk of code to create a DataFrame with the spark-daria createDF method.
Databricks doesn’t do much in the way of syntax highlighting.

Databricks minimal syntax highlighting

IntelliJ clearly differentiates between the string, integer, and boolean types.
³⁰https://www.jetbrains.com/idea/
³¹https://scalameta.org/metals/
Scala Text Editing 150

IntelliJ extensive syntax highlighting

Import reminders
Databricks won’t complain about code that’s not imported until you run the code.

Databricks doesn’t complain about imports

If there are missing imports, IntelliJ will complain, even before the code is run.
Scala Text Editing 151

IntelliJ complains loudly

Import hints
IntelliJ smartly assumes that you code might be missing the org.apache.spark.sql.types.StructField
import.

IntelliJ import hint

You can click a button and IntelliJ will add the import statement for you.

Argument type checking


Text editors will check argument types and provide helpful error messages when the arguments
supplied don’t match the method signature.
The withColumn() method expects one string argument and another Column argument. Let’s
incorrectly supply an integer as the second argument to the withColumn()method in a Databricks
notebook.
Scala Text Editing 152

Databricks doesn’t give warning

Databricks only provides the type mismatch error after the code is run.

Databricks type mismatch error

IntelliJ will underline arguments that aren’t an acceptable type.

IntelliJ underlines incorrect argument type

When you hover the mouse over the incorrect argument type, IntelliJ provides a helpful type hint.

IntelliJ type hint

Flagging unnecessary imports


IntelliJ will flag unused imports by greying them out.
Scala Text Editing 153

IntelliJ greys out unused import statement

If you hover over the greyed out import, IntelliJ will provide a “Unused import statement” warning.

IntelliJ Unused import statement warning

When to use text editors and Databricks notebooks?


Databricks notebooks are great for exploring datasets or running complex logic that’s been defined
in a text editor.
You should shy away from writing complex logic directly in a Databricks notebook.
Scala is a difficult programming language and you should use all the help you can get! Leverage the
IntelliJ IDE to get great syntax highlighting, type hints, and help with your imports.
Structuring Spark Projects
Project name

Package naming convention

Typical library structure

Applications
Introduction to SBT
SBT is an interactive build tool that is used to run tests and package your projects as JAR files.
SBT lets you package projects created in text editors, so you can run the code in a cloud cluster
computing environment (like Databricks).
SBT has a comprehensive Getting started guide³², but let’s be honest - who wants to read a book on
a build tool?
This chapter teaches Spark programmers what they need to know about SBT and skips all the other
details!

Sample code
I recommend cloning the spark-daria³³ project on your local machine, so you can run the SBT
commands as you read this post.

Running SBT commands


SBT commands can be run from the command line or from the SBT shell.
For example, here’s how to run the test suite from Bash: sbt test.
Alternatively, we can open the SBT shell by running sbt in Bash and then simply run test.
Run exit to leave the SBT shell.

build.sbt
The SBT build definition is specified in the build.sbt file.
This is where you’ll add code to specify the project dependencies, the Scala version, how to build
your JAR files, how to manage memory, etc.
One of the only things that’s not specified in the build.sbt file is the SBT version itself. The SBT
version is specified in the project/build.properties file, for example:

³²https://www.scala-sbt.org/1.x/docs/Getting-Started.html
³³https://github.com/MrPowers/spark-daria
Introduction to SBT 156

1 sbt.version=1.2.8

libraryDependencies

You can specify libraryDependencies in your build.sbt file to fetch libraries from Maven or
JitPack³⁴.
Here’s how to add Spark SQL and Spark ML to a project:

1 libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0" % "provided"


2 libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.4.0" % "provided"

SBT provides shortcut sytax so you can clean up your build.sbt file a bit.

1 libraryDependencies ++= Seq(


2 "org.apache.spark" %% "spark-sql" % "2.4.0" % "provided",
3 "org.apache.spark" %% "spark-mllib" % "2.4.0" % "provided"
4 )

"provided" dependencies are already included in the environment where we run our code.

Here’s an example of some test dependencies that are only used when we run our test suite:

1 libraryDependencies += "com.lihaoyi" %% "utest" % "0.6.3" % "test"


2 libraryDependencies += "MrPowers" % "spark-fast-tests" % "0.17.1-s_2.11" % "test"

The chapter on building JAR files provides a more detailed discussion on provided and test
dependencies.

sbt test
You can run your test suite with the sbt test command.
You can set environment variables in your test suite by adding this line to your build.sbt file:
envVars in Test := Map("PROJECT_ENV" -> "test"). Refer to the environment specific config
chapter for more details about this design pattern.
You can run a single test file when using Scalatest with this command:

³⁴https://jitpack.io/
Introduction to SBT 157

1 sbt "test:testOnly *LoginServiceSpec"

This command is easier to run from the SBT shell:

1 > testOnly *LoginServiceSpec

Complicated SBT commands are generally easier to run from the SBT shell, so you don’t need to
think about proper quoting.

sbt doc
The sbt doc command generates HTML documentation for your project.
You can open the documentation on your local machine with open target/scala-2.11/api/index.html
after it’s been generated.
Codebases are easier to understand when the public API is clearly defined, and you should focus on
marking anything that’s not part of the public interface with the private keyword. Private methods
aren’t included in the API documentation.

sbt console
The sbt console command starts the Scala interpreter with easy access to all your project files.
Let’s run sbt console in the spark-daria project and then invoke the StringHelpers.snakify()
method.

1 scala> com.github.mrpowers.spark.daria.utils.StringHelpers.snakify("FunStuff") // fu\


2 n_stuff

Running sbt console is similar to running the Spark shell with the spark-daria JAR file attached.
Here’s how to start the Spark shell with the spark-daria JAR file attached.

1 ./spark-shell --jars ~/Documents/code/my_apps/spark-daria/target/scala-2.11/spark-da\


2 ria-assembly-0.28.0.jar

The same code from before also works in the Spark shell:

1 scala> com.github.mrpowers.spark.daria.utils.StringHelpers.snakify("FunStuff") // fu\


2 n_stuff

The sbt console is sometimes useful for playing around with code, but the test suite is usually
better.
Don’t “test” your code in the console and neglect writing real tests.
Introduction to SBT 158

sbt package / sbt assembly


sbt package builds a thin JAR file (only includes the project files). For spark-daria, the sbt package
command builds the target/scala-2.11/spark-daria-0.28.0.jar file.
sbt assembly builds a fat JAR file (includes all the project and dependency files). For spark-daria,
the sbt assembly command builds the target/scala-2.11/spark-daria-assembly-0.28.0.jar file.
Read the chapter on building Spark JAR files for a detailed discussion on how sbt package and sbt
assembly differ.

You should be comfortable with developing Spark code in a text editor, packaging your project as a
JAR file, and attaching your JAR file to a cloud cluster for production analyses.

sbt clean
The sbt clean command deletes all of the generated files in the target/ directory.
This command will delete the documentation generated by sbt doc and will delete the JAR files
generated by sbt package and sbt assembly.
It’s good to run sbt clean frequently, so you don’t accumlate a lot of legacy clutter in the target/
directory.

Next steps
SBT is a great build tool for Spark projects.
It lets you easily run tests, generate documentation, and package code as JAR files.
Managing the SparkSession, The
DataFrame Entry Point
The SparkSession is used to create and read DataFrames.
This post explains how to create a SparkSession and share it throughout your program.
Spark errors out if you try to create multiple SparkSesssions, so it’s important that you share one
SparkSession throughout your program.
Some environments (e.g. Databricks) create a SparkSession for you and in those cases, you’ll want
to reuse the SparkSession that already exists rather than create your own.

Accessing the SparkSession


A SparkSession is automatically created and stored in the spark variable whenever you start a Spark
console or open a Databricks notebook.
Your program should reuse the same SparkSession and you should avoid any code that creates and
uses a different SparkSession.

Example of using the SparkSession


Let’s open the Spark console and use the spark variable to create a RDD from a sequence. This is a
simple example of how to use a SparkSession.
Notice that the message Spark session available as 'spark' is printed when you start the Spark
shell.

1 val data = Seq(2, 4, 6)


2 val myRDD = spark.sparkContext.parallelize(data)

The SparkSession is used to access the SparkContext, which has a parallelize method that converts
a sequence into a RDD.
RDDs aren’t used much now that the DataFrame API has been released, but they’re still useful when
creating DataFrames.
Managing the SparkSession, The DataFrame Entry Point 160

Creating a DataFrame
The SparkSession is used twice when manually creating a DataFrame:

1. Converts a sequence into a RDD


2. Converts a RDD into a DataFrame

1 import org.apache.spark.sql.Row
2 import org.apache.spark.sql.types._
3
4 val rdd = spark.sparkContext.parallelize(
5 Seq(
6 Row("bob", 55)
7 )
8 )
9
10 val schema = StructType(
11 Seq(
12 StructField("name", StringType, true),
13 StructField("age", IntegerType, true)
14 )
15 )
16
17 val df = spark.createDataFrame(rdd, schema)

1 df.show()
2
3 +----+---+
4 |name|age|
5 +----+---+
6 | bob| 55|
7 +----+---+

You will frequently use the SparkSession to create DataFrames when testing your code.

Reading a DataFrame
The SparkSession is also used to read CSV, JSON, and Parquet files.
Here are some examples.
Managing the SparkSession, The DataFrame Entry Point 161

1 val df1 = spark.read.csv("/mnt/my-bucket/csv-data")


2 val df2 = spark.read.json("/mnt/my-bucket/json-data")
3 val df3 = spark.read.parquet("/mnt/my-bucket/parquet-data")

There are separate posts on CSV, JSON, and Parquet files that do deep dives into the intracacies of
each file format.

Creating a SparkSession
You can create a SparkSession in your applications with the getOrCreate method:

1 val spark = SparkSession.builder().master("local").appName("my cool app").getOrCreat\


2 e()

You don’t need to manually create a SparkSession in programming environments that already define
the variable (e.g. the Spark shell or a Databricks notebook). Creating your own SparkSession becomes
vital when you write Spark code in a text editor.
Wrapping the spark variable in a trait is the best way to share it across different classes and objects
in your codebase.

1 import org.apache.spark.sql.SparkSession
2
3 trait SparkSessionWrapper extends Serializable {
4
5 lazy val spark: SparkSession = {
6 SparkSession.builder().master("local").appName("my cool app").getOrCreate()
7 }
8
9 }

The getOrCreate() method will create a new SparkSession if one does not exist and reuse an exiting
SparkSession if it exists.
Here’s how getOrCreate() works in different environments:

• In the Databricks environment, getOrCreate will always use the SparkSession created by
Databricks and will never create a SparkSession
• In the Spark console, getOrCreate will use the SparkSession created by the console
• In the test environment, getOrCreate will create a SparkSession the first time it encounters the
spark variable and will then reuse that SparkSession
Managing the SparkSession, The DataFrame Entry Point 162

Your production environment will probably already define the spark variable, so getOrCreate()
won’t ever both creating a SparkSession and will simply use the SparkSession already created by
the environment.
Here is how the SparkSessionWrapper can be used in some example objects.

1 object transformations extends SparkSessionWrapper {


2
3 def withSomeDatamart(
4 coolDF: DataFrame = spark.read.parquet("/mnt/my-bucket/cool-data")
5 )(df: DataFrame): DataFrame = {
6 df.join(
7 broadcast(coolDF),
8 df("some_id") <=> coolDF("some_id")
9 )
10 }
11
12 }

The transformations.withSomeDatamart() method is injecting coolDF, so the code can easily be


tested and intelligently grab the right file by default when run in production.
Notice how the spark variable is used to set our smart default.
We will use the SparkSessionWrapper trait and spark variable again when testing the withSomeDatamart
method.

1 import utest._
2
3 object TransformsTest extends TestSuite with SparkSessionWrapper with ColumnComparer\
4 {
5
6 val tests = Tests {
7
8 'withSomeDatamart - {
9
10 val coolDF = spark.createDF(
11 List(
12
13 ), List(
14
15 )
16 )
17
Managing the SparkSession, The DataFrame Entry Point 163

18 val df = spark.createDF(
19 List(
20
21 ), List(
22
23 )
24 ).transform(transformations.withSomeDatamart())
25
26 }
27
28 }
29
30 }

The test leverages the createDF method, which is a SparkSession extension defined in spark-daria.
createDF is similar to createDataFrame, but more cocise.

Reusing the SparkSession in the test suite


Starting and stopping the SparkSession is slow, so you want to reuse the same SparkSession
throughout your test suite. Don’t restart the SparkSession for every test file that is run - Spark
tests run slowly enough as is and shouldn’t be made any slower.
The SparkSessionWrapper can be reused in your application code and the test suite.

SparkContext
The SparkSession encapsulates the SparkConf, SparkContext, and SQLContext.
Prior to Spark 2.0, developers needed to explicly create SparkConf, SparkContext, and SQLContext
objects. Now Spark developers can just create a SparkSession and access the other objects as needed.
The following code snippet uses the SparkSession to access the sparkContext, so the parallelize
method can be used to create a DataFrame.

1 spark.sparkContext.parallelize(
2 Seq(
3 Row("bob", 55)
4 )
5 )
Managing the SparkSession, The DataFrame Entry Point 164

You shouldn’t have to access the sparkContext much - pretty much only when manually creating
DataFrames. See the spark-daria³⁵ createDF() method, so you don’t even need to explicitly call
sparkContext when you want to create a DataFrame.

Read this blog post³⁶ for more information.

Conclusion
You’ll need a SparkSession in your programs to create DataFrames.
Reusing the SparkSession in your application is critical for good code organization. Reusing the
SparkSession in your test suite is vital to make your tests execute as quickly as possible.
³⁵https://github.com/MrPowers/spark-daria
³⁶https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html
Testing Spark Applications
Testing Spark applications allows for a rapid development workflow and gives you confidence that
your code will work in production.
Most Spark users spin up clusters with sample data sets to develop code - this is slow (clusters are
slow to start) and costly (you need to pay for computing resources).
An automated test suite lets you develop code on your local machine free of charge. Test files should
run in under a minute, so it’s easy to rapidly iterate.
The test suite documents how the code functions, reduces bugs, and makes it easier to add new
features without breaking existing code.
We’ll talk about more benefits of testing later. Let’s start with some simple examples!

Hello World Example


The spark-test-examples repository contains all the code snippets covered in this tutorial! The spark-
fast-tests library is used to make DataFrame comparisons.
The following HelloWorld object contains a withGreeting method that appends a greeting column
to a DataFrame.

1 package com.github.mrpowers.spark.test.example
2
3 import org.apache.spark.sql.DataFrame
4 import org.apache.spark.sql.functions._
5
6 object HelloWorld {
7
8 def withGreeting()(df: DataFrame): DataFrame = {
9 df.withColumn("greeting", lit("hello world"))
10 }
11
12 }

Suppose we start with a DataFrame that looks like this:


Testing Spark Applications 166

1 +------+
2 | name|
3 +------+
4 |miguel|
5 | luisa|
6 +------+

When we run the HelloWorld.withGreeting() method, we should get a new DataFrame that looks
like this:

1 +------+-----------+
2 | name| greeting|
3 +------+-----------+
4 |miguel|hello world|
5 | luisa|hello world|
6 +------+-----------+

Add a SparkSessionTestWrapper trait in the test directory so we can create DataFrames in our test
suite via the SparkSession.

1 package com.github.mrpowers.spark.test.example
2
3 import org.apache.spark.sql.SparkSession
4
5 trait SparkSessionTestWrapper {
6
7 lazy val spark: SparkSession = {
8 SparkSession
9 .builder()
10 .master("local")
11 .appName("spark test example")
12 .getOrCreate()
13 }
14
15 }

Let’s write a test that creates a DataFrame, runs the withGreeting() method, and confirms that the
greeting column has been properly appended to the DataFrame.
Testing Spark Applications 167

1 package com.github.mrpowers.spark.test.example
2
3 import com.github.mrpowers.spark.fast.tests.DataFrameComparer
4 import org.apache.spark.sql.Row
5 import org.apache.spark.sql.types._
6 import org.scalatest.FunSpec
7
8 class HelloWorldSpec
9 extends FunSpec
10 with DataFrameComparer
11 with SparkSessionTestWrapper {
12
13 import spark.implicits._
14
15 it("appends a greeting column to a Dataframe") {
16
17 val sourceDF = Seq(
18 ("miguel"),
19 ("luisa")
20 ).toDF("name")
21
22 val actualDF = sourceDF.transform(HelloWorld.withGreeting())
23
24 val expectedSchema = List(
25 StructField("name", StringType, true),
26 StructField("greeting", StringType, false)
27 )
28
29 val expectedData = Seq(
30 Row("miguel", "hello world"),
31 Row("luisa", "hello world")
32 )
33
34 val expectedDF = spark.createDataFrame(
35 spark.sparkContext.parallelize(expectedData),
36 StructType(expectedSchema)
37 )
38
39 assertSmallDataFrameEquality(actualDF, expectedDF)
40
41 }
42
43 }
Testing Spark Applications 168

The test file is pretty verbose… welcome to Scala!


Some notable points in the test file:
We need to run import spark.implicits._ to access the toDF helper method that creates sourceDF.
The expectedDF cannot be created with the toDF helper method. toDF allows the greeting column
to be null, see the third argument in the following method - StructField(“greeting”, StringType, true).
We need the greeting column to be StructField(“greeting”, StringType, false).
The assertSmallDataFrameEquality() function compares the equality of two DataFrames. We need
to include the DataFrameComparer trait in the test class definition and setup the project with spark-
fast-tests to access this method.
The HelloWorld and HelloWorldSpec files are checked into GitHub if you’d like to clone the repo
and play with the examples yourself.

Deeper Dive into StructField


StructField takes three arguments:

• The column name


• The column type (notice that these are imported from org.apache.spark.sql.types)
• A boolean value that indicates if the column is nullable. If the this argument is set to true, then
the column can contain null values.

Testing a User Defined Function


Let’s create a user defined function that returns true if a number is even and false otherwise.
The code is quite simple.

1 package com.github.mrpowers.spark.test.example
2
3 import org.apache.spark.sql.functions._
4
5 object NumberFun {
6
7 def isEven(n: Integer): Boolean = {
8 n % 2 == 0
9 }
10
11 val isEvenUDF = udf[Boolean, Integer](isEven)
12
13 }

The test isn’t too complicated, but prepare yourself for a wall of code.
Testing Spark Applications 169

1 package com.github.mrpowers.spark.test.example
2
3 import org.scalatest.FunSpec
4 import org.apache.spark.sql.types._
5 import org.apache.spark.sql.functions._
6 import org.apache.spark.sql.Row
7 import com.github.mrpowers.spark.fast.tests.DataFrameComparer
8
9 class NumberFunSpec
10 extends FunSpec
11 with DataFrameComparer
12 with SparkSessionTestWrapper {
13
14 import spark.implicits._
15
16 it("appends an is_even column to a Dataframe") {
17
18 val sourceDF = Seq(
19 (1),
20 (8),
21 (12)
22 ).toDF("number")
23
24 val actualDF = sourceDF
25 .withColumn("is_even", NumberFun.isEvenUDF(col("number")))
26
27 val expectedSchema = List(
28 StructField("number", IntegerType, false),
29 StructField("is_even", BooleanType, true)
30 )
31
32 val expectedData = Seq(
33 Row(1, false),
34 Row(8, true),
35 Row(12, true)
36 )
37
38 val expectedDF = spark.createDataFrame(
39 spark.sparkContext.parallelize(expectedData),
40 StructType(expectedSchema)
41 )
42
43 assertSmallDataFrameEquality(actualDF, expectedDF)
Testing Spark Applications 170

44
45 }
46 }

We create a DataFrame, run the NumberFun.isEvenUDF() function, create another expected DataFrame,
and compare the actual result with our expectations using assertSmallDataFrameEquality() from
spark-fast-tests.
We can improve by testing isEven() on a standalone basis and cover the edge cases. Here are some
tests we might like to add.

1 describe(".isEven") {
2 it("returns true for even numbers") {
3 assert(NumberFun.isEven(4) === true)
4 }
5
6 it("returns false for odd numbers") {
7 assert(NumberFun.isEven(3) === false)
8 }
9
10 it("returns false for null values") {
11 assert(NumberFun.isEven(null) === false)
12 }
13 }

The first two tests pass with our existing code, but the third one causes the code to error out with a
NullPointerException. If we’d like our user defined function to assume that this function will never
be called on columns that are nullable, we might be able to get away with ignoring null values.
It’s probably safter to account for null values and refactor the code accordingly.

A Real Test
Let’s write a test for a function that converts all the column names of a DataFrame to snake_case.
This will make it a lot easier to run SQL queries off of the DataFrame.
Testing Spark Applications 171

1 package com.github.mrpowers.spark.test.example
2
3 import org.apache.spark.sql.DataFrame
4
5 object Converter {
6
7 def snakecaseify(s: String): String = {
8 s.toLowerCase().replace(" ", "_")
9 }
10
11 def snakeCaseColumns(df: DataFrame): DataFrame = {
12 df.columns.foldLeft(df) { (acc, cn) =>
13 acc.withColumnRenamed(cn, snakecaseify(cn))
14 }
15 }
16
17 }

snakecaseify is a pure function and will be tested using the Scaliest assert() method. We’ll compare
the equality of two DataFrames to test the snakeCaseColumns method.

1 package com.github.mrpowers.spark.test.example
2
3 import com.github.mrpowers.spark.fast.tests.DataFrameComparer
4 import org.scalatest.FunSpec
5
6 class ConverterSpec
7 extends FunSpec
8 with DataFrameComparer
9 with SparkSessionTestWrapper {
10
11 import spark.implicits._
12
13 describe(".snakecaseify") {
14
15 it("downcases uppercase letters") {
16 assert(Converter.snakecaseify("HeLlO") === "hello")
17 }
18
19 it("converts spaces to underscores") {
20 assert(Converter.snakecaseify("Hi There") === "hi_there")
21 }
22
Testing Spark Applications 172

23 }
24
25 describe(".snakeCaseColumns") {
26
27 it("snake_cases the column names of a DataFrame") {
28
29 val sourceDF = Seq(
30 ("funny", "joke")
31 ).toDF("A b C", "de F")
32
33 val actualDF = Converter.snakeCaseColumns(sourceDF)
34
35 val expectedDF = Seq(
36 ("funny", "joke")
37 ).toDF("a_b_c", "de_f")
38
39 assertSmallDataFrameEquality(actualDF, expectedDF)
40
41 }
42
43 }
44
45 }

This test file uses the describe method to group tests associated with the snakecaseify() and
snakeCaseColumns() methods. This makes separates code in the test file and makes the console
output more clear when the tests are run.

How Testing Improves Your Codebase


Sandi Metz lists some benefits of testing in her book Practical Object Oriented Design with Ruby.
Let’s see how her list applies to Spark applications.

Finding Bugs
When writing user defined functions or DataFrame transformations that will process billions of rows
of data, you will likely encounter bad data. There will be strange characters, null values, and other
inconsistencies. Testing encourages you to proactively deal with edge cases. If your code breaks with
a production anomaly, you can add another test to make sure the edge case doesn’t catch you again.
Testing Spark Applications 173

Supplying Documentation
It is often easier to understand code by reading the tests! When I need to grok some new code, I start
with the tests and then progress to the source code.
API documentation can sometimes fall out of sync with the actual code. A developer may update
the code and forget to update the API documentation.
The test suite won’t fall out of sync with the code. If the code changes and the tests start failing, the
developer will remember to update the test suite.

Exposing Design Flaws


Poorly designed code is difficult to test. If it’s hard to write a test, you’ll be forced to refactor the
code. Automated tests incentivize well written code.

Advice from the Trenches


The spark-test-example GitHub repository contains all the examples that were covered in this
chapter. Clone the repo, run sbt test, and play around with the code to get your hands dirty.
The FunSpec traits is included in the test suites to make the code more readable. Following the Ruby
convention of grouping tests for individual functions in a describe block and giving each spec a
descriptive title makes it easier to read the test output.
Use a continuous intergration tool to build your project every time it’s merged with master.
When the test suite fails, it should be broadcasted loudly and the bugs should be fixed immediately.

Running a Single Test File


Create a workflow that enables you to run a single test file, so the tests run quicker. The IntelliJ text
editor makes it easy to run a single text file or you can use the following command in your Terminal.

1 sbt "test-only *HelloWorldSpec"

Spend the time to develop a fluid development workflow, so testing is a delight.

Debugging with show()


The show() method can be called in the test suite to output the DataFrame contents in the console.
Use actualDF.show() and expectedDF.show() whenever you’re debugging a failing spec with a
cryptic error message.
Testing Spark Applications 174

Mocking is Limited in Scala!


ScalaMock3 can only mock traits and no-args classes - Paul Butcher, author of ScalaMock
ScalaMock3 doesn’t support objects or classes with arguments, arguably the most common language
constructs you’d like to mock.
ScalaMock4 will be significantly more useable, but it doesn’t look like the project is making forward
progress. ScalaMock4 was initially blocked on the release of scala.meta, but scala.meta has since
been released and it doesn’t look like development ScalaMock4 has progressed.
Let’s hope mocking in Scala gets better soon!

Should You Test Your Spark Projects?


I started using Spark in the browser-based Databricks notebooks and the development workflow
was painful. Editing text in the browser is slow and buggy. I was manually testing my code and
other developers were reusing my code by copying and pasting the functions.
Developing Spark code in tested SBT projects is much better!
Other programming languages make testing easier. For example, with Ruby, it is easier to stub /
mock, the testing frameworks have more features, and the community encourages testing. Scala
only has one book on testing and it doesn’t get great reviews.
Environment Specific Config in Spark
Scala Projects
Environment config files return different values for the test, development, staging, and production
environments.
In Spark projects, you will often want a variable to point to a local CSV file in the test environment
and a CSV file in S3 in the production environment.
This episode will demonstrate how to add environment config to your projects and how to set
environment variables to change the environment.
<iframe width=”560” height=”315” src=”https://www.youtube.com/embed/aRbxcLgs7YA” frame-
border=”0” allow=”autoplay; encrypted-media” allowfullscreen></iframe>

Basic use case


Let’s create a Config object with one Map[String, String] with test configuration and another
Map[String, String] with production config.

1 package com.github.mrpowers.spark.spec.sql
2
3 object Config {
4
5 var test: Map[String, String] = {
6 Map(
7 "libsvmData" -> new java.io.File("./src/test/resources/sample_libsvm_data.txt"\
8 ).getCanonicalPath,
9 "somethingElse" -> "hi"
10 )
11 }
12
13 var production: Map[String, String] = {
14 Map(
15 "libsvmData" -> "s3a://my-cool-bucket/fun-data/libsvm.txt",
16 "somethingElse" -> "whatever"
17 )
18 }
19
Environment Specific Config in Spark Scala Projects 176

20 var environment = sys.env.getOrElse("PROJECT_ENV", "production")


21
22 def get(key: String): String = {
23 if (environment == "test") {
24 test(key)
25 } else {
26 production(key)
27 }
28 }
29
30 }

The Config.get() method will grab values from the test or production map depending on the
PROJECT_ENV value.

Let’s use the sbt console command to demonstrate this.

1 $ PROJECT_ENV=test sbt console


2 scala> com.github.mrpowers.spark.spec.sql.Config.get("somethingElse")
3 res0: String = hi

Let’s restart the SBT console and run the same code in the production environment.

1 $ PROJECT_ENV=production sbt console


2 scala> com.github.mrpowers.spark.spec.sql.Config.get("somethingElse")
3 res0: String = whatever

Here is how the Config object can be used to fetch a file in your GitHub repository in the test
environment and also fetch a file from S3 in the production environment.

1 val training = spark


2 .read
3 .format("libsvm")
4 .load(Config.get("libsvmData"))

This solution is elegant and does not clutter our application code with environment logic.

Environment specific code anitpattern


Here is an example of how you should not add environment paths to your code.
Environment Specific Config in Spark Scala Projects 177

1 var environment = sys.env.getOrElse("PROJECT_ENV", "production")


2 val training = if (environment == "test") {
3 spark
4 .read
5 .format("libsvm")
6 .load(new java.io.File("./src/test/resources/sample_libsvm_data.txt").getCanonic\
7 alPath)
8 } else {
9 spark
10 .read
11 .format("libsvm")
12 .load("s3a://my-cool-bucket/fun-data/libsvm.txt")
13 }

An anti-pattern is a common response to a recurring problem that is usually ineffective


and risks being highly counterproductive. - source³⁷

You should never write code with different execution paths in the production and test environments
because then your test suite won’t really be testing the actual code that’s run in production.

Overriding config
The Config.test and Config.production maps are defined as variables (with the var keyword), so
they can be overridden.

1 scala> import com.github.mrpowers.spark.spec.sql.Config


2 scala> Config.get("somethingElse")
3 res1: String = hi
4
5 scala> Config.test = Config.test ++ Map("somethingElse" -> "give me clean air")
6 scala> Config.get("somethingElse")
7 res2: String = give me clean air

Giving users the ability to swap out config on the fly makes your codebase more flexible for a variety
of use cases.

Setting the PROJECT_ENV variable for test runs


The Config object uses the production environment by default. You’re not going to want to have to
remember to set the PROJECT_ENV to test everytime you run your test suite (e.g. you don’t want to
type PROJECT_ENV=test sbt test).
³⁷https://en.wikipedia.org/wiki/Anti-pattern
Environment Specific Config in Spark Scala Projects 178

You can update your build.sbt file as follows to set PROJECT_ENV to test whenever the test suite is
run.

1 fork in Test := true


2 envVars in Test := Map("PROJECT_ENV" -> "test")

Big thanks to the StackOverflow community for helping me figure this out³⁸.

Other implementations
This StackOverflow thread³⁹ discusses other solutions.
One answer relies on an external library, one is in Java, and one doesn’t allow for overrides.

Next steps
Feel free to extend this solution to account for other environments. For example, you might want to
add a staging environment that uses different paths to test code before it’s run in production.
Just remember to follow best practices and avoid the config anti-pattern that can litter your codebase
and reduce the protection offered by your test suite.
Adding Config objects to your functions adds a dependency you might not want. In a future chapter,
we’ll discuss how dependency injection can abstract these Config depencencies and how the Config
object can be leveraged to access smart defaults - the best of both worlds!
³⁸https://stackoverflow.com/questions/39902049/setting-environment-variables-when-running-scala-sbt-test-suite?rq=1
³⁹https://stackoverflow.com/questions/21607745/specific-config-by-environment-in-scala
Building Spark JAR Files with SBT
Spark JAR files let you package a project into a single file so it can be run on a Spark cluster.
A lot of developers develop Spark code in brower based notebooks because they’re unfamiliar with
JAR files. Scala is a difficult language and it’s especially challenging when you can’t leverage the
development tools provided by an IDE like IntelliJ.
This episode will demonstrate how to build JAR files with the SBT package and assembly commands
and how to customize the code that’s included in JAR files. Hopefully it will help you make the leap
and start writing Spark code in SBT projects with a powerful IDE by your side!
<iframe width=”560” height=”315” src=”https://www.youtube.com/embed/0yyw2gD0SrY” frame-
border=”0” allow=”autoplay; encrypted-media” allowfullscreen></iframe>

JAR File Basics


A JAR (Java ARchive) is a package file format typically used to aggregate many Java
class files and associated metadata and resources (text, images, etc.) into one file for
distribution. - Wikipedia⁴⁰

JAR files can be attached to Databricks clusters or launched via spark-submit.


You can build a “thin” JAR file with the sbt package command. Thin JAR files only include the
project’s classes / objects / traits and don’t include any of the project dependencies.
You can build “fat” JAR files by adding sbt-assembly⁴¹ to your project. Fat JAR files inlude all the
code from your project and all the code from the dependencies.
Let’s say you add the uJson library to your build.sbt file as a library dependency.

1 libraryDependencies += "com.lihaoyi" %% "ujson" % "0.6.5"

If you run sbt package, SBT will build a thin JAR file that only includes your project files. The thin
JAR file will not include the uJson files.
If you run sbt assembly, SBT will build a fat JAR file that includes both your project files and the
uJson files.
Let’s dig into the gruesome details!
⁴⁰https://en.wikipedia.org/wiki/JAR_(file_format)
⁴¹https://github.com/sbt/sbt-assembly
Building Spark JAR Files with SBT 180

Building a Thin JAR File


As discussed, the sbt package builds a thin JAR file of your project.
spark-daria⁴² is a good example of an open source project that is distributed as a thin JAR file. This
is an excerpt of the spark-daria build.sbt file:

1 libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0" % "provided"


2
3 libraryDependencies += "com.github.mrpowers" % "spark-fast-tests" % "2.3.0_0.8.0" % \
4 "test"
5 libraryDependencies += "com.lihaoyi" %% "utest" % "0.6.3" % "test"
6 testFrameworks += new TestFramework("utest.runner.Framework")
7
8 artifactName := { (sv: ScalaVersion, module: ModuleID, artifact: Artifact) =>
9 artifact.name + "_" + sv.binary + "-" + sparkVersion + "_" + module.revision + "."\
10 + artifact.extension
11 }

Important take-aways:

• The “provided” string at the end of the libraryDependencies += "org.apache.spark" %%


"spark-sql" % "2.3.0" % "provided" line indicates that the spark-sql dependency should be
provided by the runtime environment that uses this JAR file.
• The “test” string at the end of the libraryDependencies += "com.github.mrpowers" %
"spark-fast-tests" % "2.3.0_0.8.0" % "test" indicates that spark-fast-tests dependency is
only for the test suite. The application code does not rely on spark-fast-tests, but spark-fast-tests
is needed when running sbt test.
• The artifactName := ... line customizes the name of the JAR file created with the sbt package
command. As discussed in the spark-style-guide⁴³, it’s best to include the Scala version, Spark
version, and project version in the JAR file name, so it’s easier for your users to select the right
JAR file for their projects.

The sbt package command creates the target/scala-2.11/spark-daria_2.11-2.3.0_0.19.0.jar


JAR file. We can use the jar tvf command to inspect the contents of the JAR file.

⁴²https://github.com/MrPowers/spark-daria
⁴³https://github.com/MrPowers/spark-style-guide#jar-files
Building Spark JAR Files with SBT 181

1 $ jar tvf target/scala-2.11/spark-daria_2.11-2.3.0_0.19.0.jar


2
3 255 Wed May 02 20:50:14 COT 2018 META-INF/MANIFEST.MF
4 0 Wed May 02 20:50:14 COT 2018 com/
5 0 Wed May 02 20:50:14 COT 2018 com/github/
6 0 Wed May 02 20:50:14 COT 2018 com/github/mrpowers/
7 0 Wed May 02 20:50:14 COT 2018 com/github/mrpowers/spark/
8 0 Wed May 02 20:50:14 COT 2018 com/github/mrpowers/spark/daria/
9 0 Wed May 02 20:50:14 COT 2018 com/github/mrpowers/spark/daria/sql/
10 0 Wed May 02 20:50:14 COT 2018 com/github/mrpowers/spark/daria/utils/
11 3166 Wed May 02 20:50:12 COT 2018 com/github/mrpowers/spark/daria/sql/DataFrameHel\
12 pers.class
13 1643 Wed May 02 20:50:12 COT 2018 com/github/mrpowers/spark/daria/sql/DataFrameHel\
14 pers$$anonfun$twoColumnsToMap$1.class
15 876 Wed May 02 20:50:12 COT 2018 com/github/mrpowers/spark/daria/sql/ColumnExt$.c\
16 lass
17 1687 Wed May 02 20:50:12 COT 2018 com/github/mrpowers/spark/daria/sql/transformati\
18 ons$$anonfun$1.class
19 3278 Wed May 02 20:50:12 COT 2018 com/github/mrpowers/spark/daria/sql/DataFrameCol\
20 umnsChecker.class
21 3607 Wed May 02 20:50:12 COT 2018 com/github/mrpowers/spark/daria/sql/functions$$t\
22 ypecreator3$1.class
23 1920 Wed May 02 20:50:12 COT 2018 com/github/mrpowers/spark/daria/sql/DataFrameCol\
24 umnsException$.class
25 ...
26 ...

The sbt-assembly plugin needs to be added to build fat JAR files that include the project’s
dependencies.

Building a Fat JAR File


spark-slack⁴⁴ is a good example of a project that’s distributed as a fat JAR file. The spark-
slack JAR file includes all of the spark-slack code and all of the code in two external libraries
(net.gpedro.integrations.slack.slack-webhook and org.json4s.json4s-native).
Let’s take a snippet from the spark-slack build.sbt file:

⁴⁴https://github.com/MrPowers/spark-slack
Building Spark JAR Files with SBT 182

1 libraryDependencies ++= Seq(


2 "net.gpedro.integrations.slack" % "slack-webhook" % "1.2.1",
3 "org.json4s" %% "json4s-native" % "3.3.0"
4 )
5
6 libraryDependencies += "com.github.mrpowers" % "spark-daria" % "v2.3.0_0.18.0" % "te\
7 st"
8 libraryDependencies += "com.github.mrpowers" % "spark-fast-tests" % "v2.3.0_0.7.0" %\
9 "test"
10 libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"
11
12 assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala =\
13 false)
14 assemblyJarName in assembly := s"${name.value}_${scalaBinaryVersion.value}-${sparkVe\
15 rsion.value}_${version.value}.jar"

Important observations:

• "net.gpedro.integrations.slack" % "slack-webhook" % "1.2.1" and "org.json4s" %%


"json4s-native" % "3.3.0" aren’t flagged at “provided” or “test” dependencies, so they will
be included in the JAR file when sbt assembly is run.
• spark-daria, spark-fast-tests, and scalatest are all flagged as “test” dependencies, so they
won’t be included in the JAR file.
• assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala
= false) means that the Scala code itself should not be included in the JAR file. Your Spark
runtime environment should already have Scala setup.
• assemblyJarName in assembly := s"${name.value}_${scalaBinaryVersion.value}-${sparkVersion.value}_
${version.value}.jar" customizes the JAR file name that’s created by sbt assembly. Notice
that sbt package and sbt assembly require different code to customize the JAR file name.

Let’s build the JAR file with sbt assembly and then inspect the content.

1 $ jar tvf target/scala-2.11/spark-slack_2.11-2.3.0_0.0.1.jar


2
3 0 Wed May 02 21:09:18 COT 2018 com/
4 0 Wed May 02 21:09:18 COT 2018 com/github/
5 0 Wed May 02 21:09:18 COT 2018 com/github/mrpowers/
6 0 Wed May 02 21:09:18 COT 2018 com/github/mrpowers/spark/
7 0 Wed May 02 21:09:18 COT 2018 com/github/mrpowers/spark/slack/
8 0 Wed May 02 21:09:18 COT 2018 com/github/mrpowers/spark/slack/slash_commands/
9 0 Wed May 02 21:09:18 COT 2018 com/google/
10 0 Wed May 02 21:09:18 COT 2018 com/google/gson/
Building Spark JAR Files with SBT 183

11 0 Wed May 02 21:09:18 COT 2018 com/google/gson/annotations/


12 0 Wed May 02 21:09:18 COT 2018 com/google/gson/internal/
13 0 Wed May 02 21:09:18 COT 2018 com/google/gson/internal/bind/
14 0 Wed May 02 21:09:18 COT 2018 com/google/gson/reflect/
15 0 Wed May 02 21:09:18 COT 2018 com/google/gson/stream/
16 0 Wed May 02 21:09:18 COT 2018 com/thoughtworks/
17 0 Wed May 02 21:09:18 COT 2018 com/thoughtworks/paranamer/
18 0 Wed May 02 21:09:18 COT 2018 net/
19 0 Wed May 02 21:09:18 COT 2018 net/gpedro/
20 0 Wed May 02 21:09:18 COT 2018 net/gpedro/integrations/
21 0 Wed May 02 21:09:18 COT 2018 net/gpedro/integrations/slack/
22 0 Wed May 02 21:09:18 COT 2018 org/
23 0 Wed May 02 21:09:18 COT 2018 org/json4s/
24 0 Wed May 02 21:09:18 COT 2018 org/json4s/native/
25 0 Wed May 02 21:09:18 COT 2018 org/json4s/prefs/
26 0 Wed May 02 21:09:18 COT 2018 org/json4s/reflect/
27 0 Wed May 02 21:09:18 COT 2018 org/json4s/scalap/
28 0 Wed May 02 21:09:18 COT 2018 org/json4s/scalap/scalasig/
29 1879 Wed May 02 21:09:14 COT 2018 com/github/mrpowers/spark/slack/Notifier.class
30 1115 Wed May 02 21:09:14 COT 2018 com/github/mrpowers/spark/slack/SparkSessionWrap\
31 per$class.class
32 683 Wed May 02 21:09:14 COT 2018 com/github/mrpowers/spark/slack/SparkSessionWrap\
33 per.class
34 2861 Wed May 02 21:09:14 COT 2018 com/github/mrpowers/spark/slack/slash_commands/S\
35 lashParser.class
36 ...
37 ...

sbt assembly provides us with the com/github/mrpowers/spark/slack, net/gpedro/, and org/json4s/


as expected. But why does our fat JAR file include com/google/gson/ code as well?
If we look at the net.gpedro pom.xml file⁴⁵, we can see that the net.gpedro relies on com.google.code.gson:

1 <dependencies>
2 <dependency>
3 <groupId>com.google.code.gson</groupId>
4 <artifactId>gson</artifactId>
5 <version>${gson.version}</version>
6 </dependency>
7 </dependencies>

You’ll want to be very careful to minimize your project dependencies. You’ll also want to rely on
⁴⁵https://github.com/gpedro/slack-webhook/blob/master/pom.xml#L159-L165
Building Spark JAR Files with SBT 184

external libraries that have minimal dependencies themselves as the dependies of a library quickly
become your dependencies as soon as you add the library to your project.

Next Steps
Make sure to always mark your libraryDependencies with “provided” or “test” whenever possible
to keep your JAR files as thin as possible.
Only add dependencies when it’s absolutely required and try to avoid libraries that depend on a lot
of other libraries.
It’s very easy to find yourself in dependency hell⁴⁶ with Scala and you should proactively avoid this
uncomfortable situation.
Your Spark runtime environment should generally provide the Scala and Spark dependencies and
you shouldn’t include these in your JAR files.
I fought long and hard to develop the build.sbt strategies outlined in this episode. Hopefully this
will save you from some headache!
⁴⁶https://en.wikipedia.org/wiki/Dependency_hell
Shading Dependencies in Spark
Projects with SBT
sbt-assembly makes it easy to shade dependencies in your Spark projects when you create fat JAR
files. This chapter explains why it’s useful to shade dependencies and will teach you how to shade
dependencies in your own projects.

When shading is useful


Let’s look at a snippet from the spark-pika⁴⁷ build.sbt file and examine the JAR file that’s
constructed by sbt assembly.

1 libraryDependencies += "mrpowers" % "spark-daria" % "2.3.1_0.24.0"


2 libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1" % "provided"
3 libraryDependencies += "MrPowers" % "spark-fast-tests" % "2.3.1_0.15.0" % "test"
4 libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"
5
6 assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala =\
7 false)
8 assemblyJarName in assembly := s"${name.value}_2.11-${sparkVersion.value}_${version.\
9 value}.jar"

The sbt assembly command will create a JAR file that includes spark-daria and all of the
spark-pika code. The JAR file won’t include the libraryDependencies that are flagged with
“provided” or “test” (i.e. spark-sql, spark-fast-tests, and scalatest won’t be included in the JAR
file). Let’s verify the contents of the JAR file with the jar tvf target/scala-2.11/spark-pika_-
2.11-2.3.1_0.0.1.jar command.

⁴⁷https://github.com/MrPowers/spark-pika
Shading Dependencies in Spark Projects with SBT 186

1 0 Sun Sep 23 23:04:00 COT 2018 com/


2 0 Sun Sep 23 23:04:00 COT 2018 com/github/
3 0 Sun Sep 23 23:04:00 COT 2018 com/github/mrpowers/
4 0 Sun Sep 23 23:04:00 COT 2018 com/github/mrpowers/spark/
5 0 Sun Sep 23 23:04:00 COT 2018 com/github/mrpowers/spark/daria/
6 0 Sun Sep 23 23:04:00 COT 2018 com/github/mrpowers/spark/daria/ml/
7 0 Sun Sep 23 23:04:00 COT 2018 com/github/mrpowers/spark/daria/sql/
8 0 Sun Sep 23 23:04:00 COT 2018 com/github/mrpowers/spark/daria/sql/types/
9 0 Sun Sep 23 23:04:00 COT 2018 com/github/mrpowers/spark/daria/utils/
10 0 Sun Sep 23 23:04:00 COT 2018 com/github/mrpowers/spark/pika/

If the spark-pika fat JAR file is attached to a cluster, users will be able to access the com.github.mrpowers.spark.dari
and com.github.mrpowers.spark.pika namespaces.
We don’t want to provide access to the com.github.mrpowers.spark.daria namespace when
spark-pika is attached to a cluster for two reasons:

1. It just feels wrong. When users attach the spark-pika JAR file to their Spark cluster, they
should only be able to access the spark-pika namespace. Adding additional namespaces to the
classpath is unexpected.
2. It prevents users from accessing a different spark-daria version than what’s specified in the
spark-pika build.sbt file. In this example, users are forced to use spark-daria version 2.3.1_-
0.24.0.

How to shade the spark-daria dependency


We can use SBT to change the spark-daria namespace for all the code that’s used by spark-pika.
spark-daria will still be in the fat JAR file, but the namespace will be different, so users can still
attach their own version of spark-daria to the cluster.
Here is the code to shade the spark-daria dependency in spark-pika.

1 assemblyShadeRules in assembly := Seq(


2 ShadeRule.rename("com.github.mrpowers.spark.daria.**" -> "shadedSparkDariaForSpark\
3 Pika.@1").inAll
4 )

Let’s run sbt clean and then rebuild the spark-pika JAR file with sbt assembly. Let’s examine the
contents of the new JAR file with jar tvf target/scala-2.11/spark-pika_2.11-2.3.1_0.0.1.jar.
Shading Dependencies in Spark Projects with SBT 187

1 0 Sun Sep 23 23:29:32 COT 2018 com/


2 0 Sun Sep 23 23:29:32 COT 2018 com/github/
3 0 Sun Sep 23 23:29:32 COT 2018 com/github/mrpowers/
4 0 Sun Sep 23 23:29:32 COT 2018 com/github/mrpowers/spark/
5 0 Sun Sep 23 23:29:32 COT 2018 com/github/mrpowers/spark/pika/
6 0 Sun Sep 23 23:29:32 COT 2018 shadedSparkDariaForSparkPika/
7 0 Sun Sep 23 23:29:32 COT 2018 shadedSparkDariaForSparkPika/ml/
8 0 Sun Sep 23 23:29:32 COT 2018 shadedSparkDariaForSparkPika/sql/
9 0 Sun Sep 23 23:29:32 COT 2018 shadedSparkDariaForSparkPika/sql/types/
10 0 Sun Sep 23 23:29:32 COT 2018 shadedSparkDariaForSparkPika/utils/

The JAR file used to contain the com.github.mrpowers.spark.daria namespace and that’s now been
replaced with a shadedSparkDariaForSparkPika namespace.
All the spark-pika references to spark-daria will use the shadedSparkDariaForSparkPika names-
pace.
Users can attach both spark-daria and spark-pika to the same Spark cluster now and there won’t
be a com.github.mrpowers.spark.daria namespace collision anymore.

Conclusion
When creating Spark libraries, make sure to shade dependencies that are included in the fat JAR file,
so your library users can specify different versions for dependencies at will. Try your best to design
your libraries to only add a single namespace to the classpath when the JAR files is attached to a
cluster.
Dependency Injection with Spark
Dependency injection is a design pattern that let’s you write Spark code that’s more flexible and
easier to test.
This chapter shows code with a path dependency and and demonstrates how to inject the path
dependency in a backwards compatible manner. It also shows how to inject an entire DataFrame as
a dependency.

Code with a dependency


Let’s create a withStateFullName method that appends a state_name column to a DataFrame.

1 def withStateFullName()(df: DataFrame): DataFrame = {


2 val stateMappingsDF = spark
3 .read
4 .option("header", true)
5 .csv(Config.get("stateMappingsPath"))
6 df
7 .join(
8 broadcast(stateMappingsDF),
9 df("state") <=> stateMappingsDF("state_abbreviation"),
10 "left_outer"
11 )
12 .drop("state_abbreviation")
13 }

withStateFullName appends the state_name column with a broadcast join.


withStateFullName depends on the Config object. withStateFullName “has a dependency”. This is
the dependency that’ll be “injected”.
The Config object is defined as follows:
Dependency Injection with Spark 189

1 object Config {
2
3 val test: Map[String, String] = {
4 Map(
5 "stateMappingsPath" -> new java.io.File(s"./src/test/resources/state_mappings.\
6 csv").getCanonicalPath
7 )
8 }
9
10 val production: Map[String, String] = {
11 Map(
12 "stateMappingsPath" -> "s3a://some-fake-bucket/state_mappings.csv"
13 )
14 }
15
16 var environment = sys.env.getOrElse("PROJECT_ENV", "production")
17
18 def get(key: String): String = {
19 if (environment == "test") {
20 test(key)
21 } else {
22 production(key)
23 }
24 }
25
26 }

The chapter on environment specific configuration will cover this design pattern in more detail.
Let’s create a src/test/resources/state_mappings.csv file, so we can run the withStateFullName
method on some sample data.

1 state_name,state_abbreviation
2 Tennessee,TN
3 New York,NY
4 Mississippi,MS

Run the withStateFullName method.


Dependency Injection with Spark 190

1 val df = Seq(
2 ("john", 23, "TN"),
3 ("sally", 48, "NY")
4 ).toDF("first_name", "age", "state")
5
6 df
7 .transform(withStateFullName())
8 .show()
9
10 +----------+---+-----+----------+
11 |first_name|age|state|state_name|
12 +----------+---+-----+----------+
13 | john| 23| TN| Tennessee|
14 | sally| 48| NY| New York|
15 +----------+---+-----+----------+

Let’s refactor the withStateFullName so it does not depend on the Config object. In other words,
let’s remove the Config dependency from withStateFullName with the dependency injection design
pattern.

Injecting a path
Let’s create a withStateFullNameInjectPath method that takes the path to the state mappings data
as an argument.

1 def withStateFullNameInjectPath(
2 stateMappingsPath: String = Config.get("stateMappingsPath")
3 )(df: DataFrame): DataFrame = {
4 val stateMappingsDF = spark
5 .read
6 .option("header", true)
7 .csv(stateMappingsPath)
8 df
9 .join(
10 broadcast(stateMappingsDF),
11 df("state") <=> stateMappingsDF("state_abbreviation"),
12 "left_outer"
13 )
14 .drop("state_abbreviation")
15 }
Dependency Injection with Spark 191

The stateMappingsPath leverages a smart default, so users can easily use the function without
explicitly referring to the path. This code is more flexible because it allows users to override the
smart default and use any stateMappingsPath when running the function.
Let’s rely on the smart default and run this code.

1 val df = Seq(
2 ("john", 23, "TN"),
3 ("sally", 48, "NY")
4 ).toDF("first_name", "age", "state")
5
6 df
7 .transform(withStateFullNameInjectPath())
8 .show()
9
10 +----------+---+-----+----------+
11 |first_name|age|state|state_name|
12 +----------+---+-----+----------+
13 | john| 23| TN| Tennessee|
14 | sally| 48| NY| New York|
15 +----------+---+-----+----------+

The withStateFullNameInjectPath method does not depend on the Config object.

Injecting an entire DataFrame


Let’s refactor the code again to inject the entire DataFrame as an argument, again with a smart
default.

1 def withStateFullNameInjectDF(
2 stateMappingsDF: DataFrame = spark
3 .read
4 .option("header", true)
5 .csv(Config.get("stateMappingsPath"))
6 )(df: DataFrame): DataFrame = {
7 df
8 .join(
9 broadcast(stateMappingsDF),
10 df("state") <=> stateMappingsDF("state_abbreviation"),
11 "left_outer"
12 )
13 .drop("state_abbreviation")
14 }
Dependency Injection with Spark 192

This code provides the same functionality and is even more flexible. We can now run the function
with any DataFrame. We can read a Parquet file and run this code or create a DataFrame with toDF
in our test suite.
Let’s override the smart default and run this code in our test suite:

1 val stateMappingsDF = Seq(


2 ("Tennessee", "TN"),
3 ("New York", "NY")
4 ).toDF("state_full_name", "state_abbreviation")
5
6 val df = Seq(
7 ("john", 23, "TN"),
8 ("sally", 48, "NY")
9 ).toDF("first_name", "age", "state")
10
11 df
12 .transform(withStateFullNameInjectDF(stateMappingsDF))
13 .show()
14
15 +----------+---+-----+---------------+
16 |first_name|age|state|state_full_name|
17 +----------+---+-----+---------------+
18 | john| 23| TN| Tennessee|
19 | sally| 48| NY| New York|
20 +----------+---+-----+---------------+

Injecting the entire DataFrame as a dependency allows us to test our code without reading from a
file. Avoiding file I/O in your test suite is a great way to make your tests run faster.
This design pattern also makes your tests more readable. Your coworkers won’t need to open up
random CSV files to understand the tests.

Conclusion
Dependency injection can be used to make code that’s more flexible and easier to test.
We went from having code that relied on a CSV file stored in a certain path to code that’s flexible
enough to be run with any DataFrame.
Before productionalizing this code, it’d be a good idea to run some DataFrame validations (on both
the underlying DataFrame and the injected DataFrame) and make the code even more flexible by
making it schema independent.
Make sure to leverage this design pattern so you don’t need to read from CSV / Parquet files in your
test suite anymore!
Broadcasting Maps
Spark makes it easy to broadcast maps and perform hash lookups in a cluster computing environ-
ment.
This post explains how to broadcast maps and how to use these broadcasted variables in analyses.

Simple example
Suppose you have an ArrayType column with a bunch of first names. You’d like to use a nickname
map to standardize all of the first names.
Here’s how we’d write this code for a single Scala array.

1 import scala.util.Try
2
3 val firstNames = Array("Matt", "Fred", "Nick")
4 val nicknames = Map("Matt" -> "Matthew", "Nick" -> "Nicholas")
5 val res = firstNames.map { (n: String) =>
6 Try { nicknames(n) }.getOrElse(n)
7 }
8 res // equals Array("Matthew", "Fred", "Nicholas")

Let’s create a DataFrame with an ArrayType column that contains a list of first names and then
append a standardized_names column that runs all the names through a Map.

1 import scala.util.Try
2
3 val nicknames = Map("Matt" -> "Matthew", "Nick" -> "Nicholas")
4 val n = spark.sparkContext.broadcast(nicknames)
5
6 val df = spark.createDF(
7 List(
8 (Array("Matt", "John")),
9 (Array("Fred", "Nick")),
10 (null)
11 ), List(
12 ("names", ArrayType(StringType, true), true)
13 )
Broadcasting Maps 194

14 ).withColumn(
15 "standardized_names",
16 array_map((name: String) => Try { n.value(name) }.getOrElse(name))
17 .apply(col("names"))
18 )
19
20 df.show(false)
21
22 +------------+------------------+
23 |names |standardized_names|
24 +------------+------------------+
25 |[Matt, John]|[Matthew, John] |
26 |[Fred, Nick]|[Fred, Nicholas] |
27 |null |null |
28 +------------+------------------+

We use the spark.sparkContext.broadcast() method to broadcast the nicknames map to all nodes
in the cluster.
Spark 2.4 added a transform method that’s similar to the Scala Array.map() method, but this isn’t
easily accessible via the Scala API yet, so we map through all the array elements with the spark-
daria⁴⁸ array_map method.
Note that we need to call n.value() to access the broadcasted value. This is slightly different than
what’s needed when writing vanilla Scala code.
We have some code that works which is a great start. Let’s clean this code up with some good Spark
coding practices.

Refactored code
Let’s wrap the withColumn code in a Spark custom transformation⁴⁹, so it’s more modular and easier
to test.

⁴⁸https://github.com/MrPowers/spark-daria/
⁴⁹https://medium.com/@mrpowers/chaining-custom-dataframe-transformations-in-spark-a39e315f903c
Broadcasting Maps 195

1 val nicknames = Map("Matt" -> "Matthew", "Nick" -> "Nicholas")


2 val n = spark.sparkContext.broadcast(nicknames)
3
4 def withStandardizedNames(n: org.apache.spark.broadcast.Broadcast[Map[String, String\
5 ]])(df: DataFrame) = {
6 df.withColumn(
7 "standardized_names",
8 array_map((name: String) => Try { n.value(name) }.getOrElse(name))
9 .apply(col("names"))
10 )
11 }
12
13 val df = spark.createDF(
14 List(
15 (Array("Matt", "John")),
16 (Array("Fred", "Nick")),
17 (null)
18 ), List(
19 ("names", ArrayType(StringType, true), true)
20 )
21 ).transform(withStandardizedNames(n))
22
23 df.show(false)
24
25 +------------+------------------+
26 |names |standardized_names|
27 +------------+------------------+
28 |[Matt, John]|[Matthew, John] |
29 |[Fred, Nick]|[Fred, Nicholas] |
30 |null |null |
31 +------------+------------------+

The withStandardizedNames() transformation takes a org.apache.spark.broadcast.Broadcast[Map[String,


String]]) as an argument. We can pass our broadcasted Map around as a argument to functions.
Scala is awesome.

Building Maps from data files


You can abstract your Map to a CSV file, convert it to a Map, and then broadcast the Map. It’s
typically best to store data in a CSV file instead of in a Map that lives in your codebase.
Let’s create a little CSV file with our nickname to firstname mappings.
Broadcasting Maps 196

1 nickname,firstname
2 Matt,Matthew
3 Nick,Nicholas

Now let’s refactor our code to read the CSV into a DataFrame and convert it to a Map before
broadcasting it.

1 import com.github.mrpowers.spark.daria.sql.DataFrameHelpers
2
3 val nicknamesPath = new java.io.File(s"./src/test/resources/nicknames.csv").getCanon\
4 icalPath
5
6 val nicknamesDF = spark
7 .read
8 .option("header", "true")
9 .option("charset", "UTF8")
10 .csv(nicknamesPath)
11
12 val nicknames = DataFrameHelpers.twoColumnsToMap[String, String](
13 nicknamesDF,
14 "nickname",
15 "firstname"
16 )
17
18 val n = spark.sparkContext.broadcast(nicknames)
19
20 def withStandardizedNames(n: org.apache.spark.broadcast.Broadcast[Map[String, String\
21 ]])(df: DataFrame) = {
22 df.withColumn(
23 "standardized_names",
24 array_map((name: String) => Try { n.value(name) }.getOrElse(name))
25 .apply(col("names"))
26 )
27 }
28
29 val df = spark.createDF(
30 List(
31 (Array("Matt", "John")),
32 (Array("Fred", "Nick")),
33 (null)
34 ), List(
35 ("names", ArrayType(StringType, true), true)
36 )
Broadcasting Maps 197

37 ).transform(withStandardizedNames(n))
38
39 df.show(false)
40
41 +------------+------------------+
42 |names |standardized_names|
43 +------------+------------------+
44 |[Matt, John]|[Matthew, John] |
45 |[Fred, Nick]|[Fred, Nicholas] |
46 |null |null |
47 +------------+------------------+

This code uses the spark-daria⁵⁰ DataFrameHelpers.twoColumnsToMap() method to convert the


DataFrame to a Map. Use spark-daria whenever possible for these utility-type operations, so you
don’t need to reinvent the wheel.

Conclusion
You’ll often want to broadcast small Spark DataFrames when making broadcast joins⁵¹.
This post illustrates how broadcasting Spark Maps is another powerful design pattern when writing
code that executes on a cluster.
Feel free to broadcast any variable to all the nodes in the cluster. You’ll get huge performance gains
whenever code is run in parallel on various nodes.
⁵⁰https://github.com/MrPowers/spark-daria
⁵¹https://mungingdata.com/apache-spark/broadcast-joins/
Validating Spark DataFrame Schemas
This post demonstrates how to explicitly validate the schema of a DataFrame in custom transfor-
mations so your code is easier to read and provides better error messages.
Spark’s lazy evaluation and execution plan optimizations yield amazingly fast results, but can also
create cryptic error messages.
This post will demonstrate how schema validations create code that’s easier to read, maintain, and
debug.

Custom Transformations Refresher


A custom transformation is a function that takes a DataFrame as an argument and returns a
DataFrame.
Let’s look at an example of a custom transformation that makes an assumption.
The following transformation appends an is_senior_citizen column to a DataFrame.

1 def withIsSeniorCitizen()(df: DataFrame): DataFrame = {


2 df.withColumn("is_senior_citizen", df("age") >= 65)
3 }

Suppose we have the following peopleDF:

1 +------+---+
2 | name|age|
3 +------+---+
4 |miguel| 80|
5 | liz| 10|
6 +------+---+

Let’s run the withIsSeniorCitizen transformation.

1 val actualDF = peopleDF.transform(withIsSeniorCitizen())

actualDF will have the following data.


Validating Spark DataFrame Schemas 199

1 +------+---+-----------------+
2 | name|age|is_senior_citizen|
3 +------+---+-----------------+
4 |miguel| 80| true|
5 | liz| 10| false|
6 +------+---+-----------------+

withIsSeniorCitizen assumes that the DataFrame has an age column with the IntegerType. In
this case, the withIsSeniorCitizen transformation’s assumption was correct and the code worked
perfectly ;)

A Custom Transformation Making a Bad Assumption


Let’s use the following withFullName transformation to illustrate how making incorrect assump-
tions yields bad error messages.

1 def withFullName()(df: DataFrame): DataFrame = {


2 df.withColumn(
3 "full_name",
4 concat_ws(" ", col("first_name"), col("last_name"))
5 )
6 }

Suppose we have the following animalDF.

1 +---+
2 |pet|
3 +---+
4 |cat|
5 |dog|
6 +---+

Let’s run the withFullName transformation.

1 animalDF.transform(withFullName())

The code will error out with this message.


org.apache.spark.sql.AnalysisException: cannot resolve ‘first_name’ given input columns: [pet]
The withFullName transformation assumes the DataFrame has first_name and last_name columns.
The assumption isn’t met, so the code errors out.
The default error message isn’t terrible, but it’s not complete. We would like to have an error message
that specifies both the first_name and last_name columns are required to run the withFullName
transformation.
Validating Spark DataFrame Schemas 200

Column Presence Validation


Let’s use the spark-daria DataFrameValidator to specify the column assumptions within the
withFullName transformation.

1 import com.github.mrpowers.spark.daria.sql.DataFrameValidator
2
3 def withFullName()(df: DataFrame): DataFrame = {
4 validatePresenceOfColumns(df, Seq("first_name", "last_name"))
5 df.withColumn(
6 "full_name",
7 concat_ws(" ", col("first_name"), col("last_name"))
8 )
9 }

Let’s run the code again.

1 animalDF.transform(withFullName())

This is the new error message.


com.github.mrpowers.spark.daria.sql.MissingDataFrameColumnsException: The [first_name, last_-
name] columns are not included in the DataFrame with the following columns [pet]
validatePresenceOfColumns makes the withFullName transformation better in two important ways.
withFullName will be easier to maintain and use because the transformation requirements are
explicitly documented in the code When the withFullName assumptions aren’t met, the error
message is more descriptive

Full Schema Validation


We can also use the spark-daria DataFrameValidator to validate the presence of StructFields in
DataFrames (i.e. validate the presence of the name, data type, and nullable property for each column
that’s required).
Let’s look at a withSum transformation that adds the num1 and num2 columns in a DataFrame.
Validating Spark DataFrame Schemas 201

1 def withSum()(df: DataFrame): DataFrame = {


2 df.withColumn(
3 "sum",
4 col("num1") + col("num2")
5 )
6 }

When the num1 and num2 columns contain numerical data, the withSum transformation works as
expected.

1 val numsDF = Seq(


2 (1, 3),
3 (7, 8)
4 ).toDF("num1", "num2")

1 numsDF.transform(withSum()).show()
2
3 +----+----+---+
4 |num1|num2|sum|
5 +----+----+---+
6 | 1| 3| 4|
7 | 7| 8| 15|
8 +----+----+---+

withSum doesn’t work well when the num1 and num2 columns contain strings.

1 val wordsDF = Seq(


2 ("one", "three"),
3 ("seven", "eight")
4 ).toDF("num1", "num2")

1 wordsDF.transform(withSum()).show()
2 +-----+-----+----+
3 | num1| num2| sum|
4 +-----+-----+----+
5 | one|three|null|
6 |seven|eight|null|
7 +-----+-----+----+

withSum should error out if the num1 and num2 columns aren’t numeric. Let’s refactor the function
to error out with a descriptive error message.
Validating Spark DataFrame Schemas 202

1 def withSum()(df: DataFrame): DataFrame = {


2 val requiredSchema = StructType(
3 List(
4 StructField("num1", IntegerType, true),
5 StructField("num2", IntegerType, true)
6 )
7 )
8 validateSchema(df, requiredSchema)
9 df.withColumn(
10 "sum",
11 col("num1") + col("num2")
12 )
13 }

Let’s run the code again.

1 wordsDF.transform(withSum()).show()

Now we get a more descriptive error message.


com.github.mrpowers.spark.daria.sql.InvalidDataFrameSchemaException: The [StructField(num1,IntegerType,true)
StructField(num2,IntegerType,true)] StructFields are not included in the DataFrame with the follow-
ing StructFields [StructType(StructField(num1,StringType,true), StructField(num2,StringType,true))]

Documenting DataFrame Assumptions is Especially


Important for Chained DataFrame Transformations
Production applications will define several standalone transformations and chain them together for
the final result.

1 val resultDF = df
2 .transform(myFirstTransform()) // one set of assumptions
3 .transform(mySecondTransform()) // more assumptions
4 .transform(myThirdTransform()) // even more assumptions

Debugging order dependent transformations, each with a different set of assumptions, is a night-
mare! Don’t torture yourself!
Validating Spark DataFrame Schemas 203

Conclusion
DataFrame schema assumptions should be explicitly documented in the code with validations.
Code that doesn’t make assumptions is easier to read, better to maintain, and returns more
descriptive error message.
spark-daria contains the DataFrame validation functions you’ll need in your projects. Follow these
setup instructions and write DataFrame transformations like this:

1 import com.github.mrpowers.spark.daria.sql.DataFrameValidator
2
3 object MyTransformations extends DataFrameValidator {
4
5 def withStandardizedPersonInfo(df: DataFrame): DataFrame = {
6 val requiredColNames = Seq("name", "age")
7 validatePresenceOfColumns(df, requiredColNames)
8 // some transformation code
9 }
10
11 }

Applications with proper DataFrame schema validations are significantly easier to debug, especially
when complex transformations are chained.

You might also like