Using pySpark with Google Colab & Spark 3.0 preview

Using pySpark with Google Colab
& Spark 3.0 preview
Milan, 11 Dec 2019
Speaker: Mario Cartia

Apache Spark is a uniﬁed computing engine and a set of
libraries for parallel data processing on computer clusters
As of the time this writing, Spark is the most actively
developed open source engine for this task; making it the
de facto tool for any developer or data scientist interested
in big data
Spark supports multiple widely used programming
languages including Python, Java, Scala and R
Introducing Apache Spark

Spark includes libraries for diverse tasks ranging from SQL to
streaming and machine learning, and runs anywhere from a
laptop to a cluster of thousands of servers

Unified
Spark’s key driving goal is to offer a unified platform for writing big
data applications
What do we mean by unified? Spark is designed to support a wide
range of data analytics tasks, ranging from simple data loading and
SQL queries to machine learning and streaming computation, over
the same computing engine and with a consistent set of APIs
Spark’s APIs are also designed to enable high performance by
optimizing across the different libraries and functions composed
together in a user program

For example, if you load data using a SQL query and then
evaluate a machine learning model over it using Spark’s
ML library, the engine can combine these steps into one
scan over the data
The combination of general APIs and high-performance
execution no matter how you combine them makes Spark
a powerful platform for interactive and production
applications

At the same time that Spark strives for uniﬁcation, Spark
carefully limits its scope to a computing engine
By this, we mean that Spark only handles loading data from
storage systems and performing computation on it, not
permanent storage as the end itself
Spark can be used with a wide variety of persistent storage
systems, including cloud storage systems such as Azure
Storage and Amazon S3, distributed ﬁle systems such as HDFS,
key-value stores such as Apache Cassandra, and message
buses such as Apache Kafka

Spark’s focus on computation makes it different from
earlier big data software platforms such as Apache
Hadoop
Hadoop included both a storage system (the Hadoop file
system, designed for low-cost storage over clusters of
commodity servers) and a computing system (MapReduce),
which were closely integrated together
While Spark runs well on Hadoop, it is also possible to use it
in the context of different architectures

Libraries
Spark’s final component is its libraries, which build on its design as
a unified engine to provide a unified API for common data analysis
tasks
Spark includes libraries for SQL and structured data (Spark SQL),
machine learning (MLlib), stream processing (Spark Streaming and
the newer Structured Streaming), and graph analytics (GraphX).
Beyond these libraries, there hundreds of open source external
libraries ranging from connectors for various storage systems to
machine learning algorithms

Spark’s Architecture
Typically when you think of a “computer” you think about one
machine sitting on your desk at home or at work
This machine works perfectly well for watching movies or
working with spreadsheet software. However, as many users
likely experience at some point, there are some things that
your computer is not powerful enough to perform. One
particularly challenging area is data processing
Single machines do not have enough power (CPU) and
resources (RAM) to perform computations on huge amounts
of information

A cluster, or group of machines, pools the resources of many
machines together allowing us to use all the cumulative resources
as if they were one
Now a group of machines alone is not powerful, you need a
framework to coordinate work across them. Spark is a tool for just
that, managing and coordinating the execution of tasks on data
across a cluster of computers

The cluster of machines that Spark
will leverage to execute tasks will
be managed by a cluster manager
like YARN, Mesos, Kubernetes or
Spark’s Standalone cluster
manager
We then submit Spark
Applications to these cluster
managers which will grant
resources to our application so that
we can complete our work

Spark Applications
Spark Applications consist of a driver process and a set
of executor processes
The driver process runs your main() function, sits on a
node in the cluster, and is responsible for three things:
1. maintaining information about the Spark Application;
2. responding to a user’s program or input;
3. analyzing, distributing, and scheduling work across
the executors (deﬁned momentarily)

Spark Applications
The executors are responsible for actually executing the
work (tasks) that the driver assigns them
This means, each executor is responsible for only two
things:
1. executing code assigned to it by the driver
2. reporting the state of the computation, on that
executor, back to the driver node

Spark Applications
The cluster manager (YARN,
Mesos, K8s or Spark itself)
controls physical machines
and allocates resources to
Spark Applications
This means that there can be
multiple Spark Applications
running on a cluster at the
same time

Note
Spark, in addition to its cluster mode, also has a local
mode
The driver and executors are simply processes, this
means that they can live on the same machine or
diﬀerent machines
In local mode, these both run (as threads) on your
individual computer instead of a cluster

Spark’s Language APIs
Spark is written in Scala but language APIs allow you to
run Spark code from other languages
For the most part, Spark presents some core “concepts”
in every language and these concepts are translated into
Spark code that runs on the cluster of machines
When using Spark from Python (or R), the user never
writes explicit JVM instructions, but instead writes
Python (or R) code that Spark will translate into code that
Spark can then run on the executor JVMs

Spark’s Language APIs

Spark’s API
While Spark is available from a variety of languages,
what Spark makes available in those languages is worth
mentioning
Spark has two fundamental sets of APIs:
● The low level “Unstructured” APIs
● The higher level Structured APIs
We’ll discuss both!

RDD API
The main abstraction Spark provides is a resilient distributed
dataset (RDD), which is a collection of elements partitioned
across the nodes of the cluster that can be operated on in
parallel
RDDs are created by starting with data in a supported distributed
storage, or an existing collection in the driver program, and
(parallelizing) transforming it
Users may also ask Spark to persist an RDD in memory, allowing
it to be reused eﬃciently across parallel operations. Finally, RDDs
automatically recover from node failures

RDD API
RDDs support two types of operations: transformations, which create a
new dataset from an existing one, and actions, which return a value to
the driver program after running a computation

RDD API
There are two ways to create RDDs:
● parallelizing an existing collection in your driver program
● referencing a dataset in an external storage system, such
as a shared ﬁlesystem, HDFS, HBase, or any data source
oﬀering a Hadoop InputFormat
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
...
rdd = sc.textFile("/path/to/data.txt")

Spark’s API
All transformations in Spark are lazy, in that they do not
compute their results right away. Instead, they just
remember the transformations applied to some base
dataset (e.g. a ﬁle)
The transformations are only computed when an action
requires a result to be returned to the driver program
This design enables Spark to run very eﬃciently!

Spark’s API
By default, each transformed RDD may be recomputed
each time you run an action on it
However, you may also persist an RDD in memory using
the persist (or cache) method, in which case Spark will
keep the elements around on the cluster for much faster
access the next time you query it
There is also support for persisting RDDs on disk, or
replicated across multiple nodes

RDD API
If suppose we read a file in Spark, the entire content of the file is
partitioned into multiple smaller chunks. Any RDD for that matter is
partitioned by Spark into multiple partitions
When we apply a transformation on a RDD, the transformation is
applied to each of its partition. Spark spawns a single Task for a
single partition, which will run inside the executor JVM
Each stage contains as many tasks as partitions of the RDD and will
perform the transformations (map, filter etc) pipelined in the stage

RDD API

RDD API
https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

RDD API
https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions

Shuffle operations
Certain operations (wide transformations) within Spark
trigger an event known as the shuffle
The shuffle is Spark’s mechanism for re-distributing data
so that it’s grouped differently across partitions
This typically involves moving data across executors and
machines (network), making the shuffle a complex and
costly operation

Shuﬄe operations

RDD Persistence
One of the most important capabilities in Spark is persisting (or
caching) a dataset in memory across operations.
When you persist an RDD, each node stores any partitions of it that
it computes in memory and reuses them in other actions on that
dataset (or datasets derived from it).
This allows future actions to be much faster (often by more than
10x). Caching is a key tool for iterative algorithms and fast interactive
use.

RDD Persistence
You can mark an RDD to be persisted using the persist() or
cache() methods on it.
The ﬁrst time it is computed in an action, it will be kept in
memory on the nodes.
Spark’s cache is fault-tolerant – if any partition of an RDD is
lost, it will automatically be recomputed using the
transformations that originally created it.

Shared Variables
Normally, when a function passed to a Spark operation (such as
map or reduce) is executed on a remote cluster node, it works on
separate copies of all the variables used in the function.
These variables are copied to each machine, and no updates to
the variables on the remote machine are propagated back to the
driver program. Spark does provide two limited types of shared
variables for two common usage patterns: broadcast variables
and accumulators.

Broadcast Variables
Spark actions are executed through a set of stages, separated by
distributed “shuﬄe” operations. Spark automatically broadcasts the
common data needed by tasks within each stage.
The data broadcasted this way is cached in serialized form and
deserialized before running each task. This means that explicitly
creating broadcast variables is only useful when tasks across
multiple stages need the same data or when caching the data in
deserialized form is important.

Broadcast Variables
Broadcast variables are created from a variable v by calling
SparkContext.broadcast(v). The broadcast variable is a wrapper
around v, and its value can be accessed by calling the value
method. The code below shows this:
>>> broadcastVar = sc.broadcast([1, 2, 3])
<pyspark.broadcast.Broadcast object at 0x102789f10>
>>> broadcastVar.value
[1, 2, 3]

Broadcast Variables
After the broadcast variable is created, it should be used
instead of the value v in any functions run on the cluster so
that v is not shipped to the nodes more than once.
In addition, the object v should not be modiﬁed after it is
broadcast in order to ensure that all nodes get the same
value of the broadcast variable (e.g. if the variable is
shipped to a new node later).

Accumulators
Accumulators are variables that are only “added” to
through an associative and commutative operation and
can therefore be eﬃciently supported in parallel.
They can be used to implement counters (as in
MapReduce) or sums.
Spark natively supports accumulators of numeric types,
and programmers can add support for new types.

Accumulators
An accumulator is created calling SparkContext.accumulator(v).
Tasks running on a cluster can then add to it using the add method or
the += operator. Only the driver program can read the accumulator’s
value, using its value method.
>>> accum = sc.accumulator(0)
>>> accum
Accumulator<id=0, value=0>
>>> sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x))
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s
>>> accum.value
10

Accumulators
A named accumulator will display in the web UI for the stage that modiﬁes
that accumulator. Spark displays the value for each accumulator modiﬁed
by a task in the “Tasks” table (N.B. not supported in Python)

SQL API
A DataFrame is the most common Structured API and simply
represents a table of data with rows and columns
The list of columns and the types in those columns the schema.
A simple analogy would be a spreadsheet with named columns
The fundamental diﬀerence is that while a spreadsheet sits on
one computer in one speciﬁc location, a Spark DataFrame can
span thousands of computers

SQL API

SQL API
DataFrames can be constructed from a wide array of sources such
as: structured data ﬁles, tables in Hive, external databases, or
existing RDDs
A DataFrame can be operated on using relational transformations
and can also be used to create a temporary view
Registering a DataFrame as a temporary view allows you to run SQL
queries over its data

Data Sources
Let’s see the general methods for loading and saving data using the
Spark Data Sources and then goes into speciﬁc options available for the
built-in data sources
In the simplest form parquet (unless otherwise conﬁgured by
spark.sql.sources.default) will be used for all operations

Spark 3 Preview Highlights
Spark Graph: Cypher Script & Property Graph
Popular graph query language Cypher has been
added to Spark 3.0, which is coupled by Property
Graph Model a directed multigraph. The Graph
query will follow similar principles as SparkSQL with
its own Catalysts providing full support for
Dataframes.

Python 3, Scala 2.12 and JDK 11
● Spark 3.0 is expected to completely move to
Python3
● Scala version upgrade to 2.12
● It will fully support JDK 11

Deep Learning: Adds GPU Support
It supports heterogeneous GPUs like AMD, Intel, and
Nvidia. For Kubernetes, it oﬀers GPU isolation at the
executor pod level. In addition to this we get:
● GPU acceleration for Pandas UDF.
● You can specify the number of GPUs in your RDD
operations.
● To easily specify a Deep Learning environment there
is YARN and Docker support to launch Spark 3.0 with
GPU.

Binary Files
You can use binary ﬁles as the data source of your spark
dataframe, however, as of now write operations in binary
are not allowed. We can expect that in the future
releases
val df = spark.read.format(BINARY_FILE).load(dir.getPath)

Koalas: Spark scale to Pandas
Koalas is a pandas API on Apache Spark which makes
data engineers and scientists more productive when
interacting with big data
With 3.0 feature release Koalas can now scale to the
distributed environment without the requirement of
reading Spark 3.0 dataframe separately, in contrast, to
the singe node environment previously

Koalas: Spark scale to Pandas

Kafka Streaming: includeHeaders
Now you can allow reading headers in Kafka streaming

Dynamic Partition Pruning
It offerers optimized execution during runtime by reusing the
dimension table broadcast results in hash joins. This helps
Spark 3.0 work more efficiently specifically with queries based
on Star-Schema thereby removing the need to ETL the
denormalized tables
Databricks’s session on Dynamic Partition Pruning in Apache
Spark:
https://databricks.com/session_eu19/dynamic-partition-pru
ning-in-apache-spark

Thank you for your attention!
Follow me

Using pySpark with Google Colab & Spark 3.0 preview

More Related Content

Using pySpark with Google Colab & Spark 3.0 preview