Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo

1

Using pySpark with Google Colab
& Spark 3.0 preview
Milan, 11 Dec 2019
Speaker: Mario Cartia

2

Using pySpark with Google Colab & Spark 3.0 preview

3

Using pySpark with Google Colab & Spark 3.0 preview

4

Apache Spark is a unified computing engine and a set of
libraries for parallel data processing on computer clusters
As of the time this writing, Spark is the most actively
developed open source engine for this task; making it the
de facto tool for any developer or data scientist interested
in big data
Spark supports multiple widely used programming
languages including Python, Java, Scala and R
Introducing Apache Spark
Using pySpark with Google Colab & Spark 3.0 preview

5

Introducing Apache Spark
Using pySpark with Google Colab & Spark 3.0 preview
Spark includes libraries for diverse tasks ranging from SQL to
streaming and machine learning, and runs anywhere from a
laptop to a cluster of thousands of servers

6

Introducing Apache Spark
Using pySpark with Google Colab & Spark 3.0 preview
Unified
Spark’s key driving goal is to offer a unified platform for writing big
data applications
What do we mean by unified? Spark is designed to support a wide
range of data analytics tasks, ranging from simple data loading and
SQL queries to machine learning and streaming computation, over
the same computing engine and with a consistent set of APIs
Spark’s APIs are also designed to enable high performance by
optimizing across the different libraries and functions composed
together in a user program

7

Introducing Apache Spark
Using pySpark with Google Colab & Spark 3.0 preview
For example, if you load data using a SQL query and then
evaluate a machine learning model over it using Spark’s
ML library, the engine can combine these steps into one
scan over the data
The combination of general APIs and high-performance
execution no matter how you combine them makes Spark
a powerful platform for interactive and production
applications

8

Introducing Apache Spark
Using pySpark with Google Colab & Spark 3.0 preview
At the same time that Spark strives for unification, Spark
carefully limits its scope to a computing engine
By this, we mean that Spark only handles loading data from
storage systems and performing computation on it, not
permanent storage as the end itself
Spark can be used with a wide variety of persistent storage
systems, including cloud storage systems such as Azure
Storage and Amazon S3, distributed file systems such as HDFS,
key-value stores such as Apache Cassandra, and message
buses such as Apache Kafka

9

Introducing Apache Spark
Using pySpark with Google Colab & Spark 3.0 preview
Spark’s focus on computation makes it different from
earlier big data software platforms such as Apache
Hadoop
Hadoop included both a storage system (the Hadoop file
system, designed for low-cost storage over clusters of
commodity servers) and a computing system (MapReduce),
which were closely integrated together
While Spark runs well on Hadoop, it is also possible to use it
in the context of different architectures

10

Introducing Apache Spark
Using pySpark with Google Colab & Spark 3.0 preview
Libraries
Spark’s final component is its libraries, which build on its design as
a unified engine to provide a unified API for common data analysis
tasks
Spark includes libraries for SQL and structured data (Spark SQL),
machine learning (MLlib), stream processing (Spark Streaming and
the newer Structured Streaming), and graph analytics (GraphX).
Beyond these libraries, there hundreds of open source external
libraries ranging from connectors for various storage systems to
machine learning algorithms

11

Spark’s Architecture
Using pySpark with Google Colab & Spark 3.0 preview
Typically when you think of a “computer” you think about one
machine sitting on your desk at home or at work
This machine works perfectly well for watching movies or
working with spreadsheet software. However, as many users
likely experience at some point, there are some things that
your computer is not powerful enough to perform. One
particularly challenging area is data processing
Single machines do not have enough power (CPU) and
resources (RAM) to perform computations on huge amounts
of information

12

Spark’s Architecture
Using pySpark with Google Colab & Spark 3.0 preview
A cluster, or group of machines, pools the resources of many
machines together allowing us to use all the cumulative resources
as if they were one
Now a group of machines alone is not powerful, you need a
framework to coordinate work across them. Spark is a tool for just
that, managing and coordinating the execution of tasks on data
across a cluster of computers

13

Spark’s Architecture
Using pySpark with Google Colab & Spark 3.0 preview
The cluster of machines that Spark
will leverage to execute tasks will
be managed by a cluster manager
like YARN, Mesos, Kubernetes or
Spark’s Standalone cluster
manager
We then submit Spark
Applications to these cluster
managers which will grant
resources to our application so that
we can complete our work

14

Spark Applications
Using pySpark with Google Colab & Spark 3.0 preview
Spark Applications consist of a driver process and a set
of executor processes
The driver process runs your main() function, sits on a
node in the cluster, and is responsible for three things:
1. maintaining information about the Spark Application;
2. responding to a user’s program or input;
3. analyzing, distributing, and scheduling work across
the executors (defined momentarily)

15

Spark Applications
Using pySpark with Google Colab & Spark 3.0 preview
The executors are responsible for actually executing the
work (tasks) that the driver assigns them
This means, each executor is responsible for only two
things:
1. executing code assigned to it by the driver
2. reporting the state of the computation, on that
executor, back to the driver node

16

Spark Applications
Using pySpark with Google Colab & Spark 3.0 preview
The cluster manager (YARN,
Mesos, K8s or Spark itself)
controls physical machines
and allocates resources to
Spark Applications
This means that there can be
multiple Spark Applications
running on a cluster at the
same time

17

Note
Using pySpark with Google Colab & Spark 3.0 preview
Spark, in addition to its cluster mode, also has a local
mode
The driver and executors are simply processes, this
means that they can live on the same machine or
different machines
In local mode, these both run (as threads) on your
individual computer instead of a cluster

18

Spark’s Language APIs
Using pySpark with Google Colab & Spark 3.0 preview
Spark is written in Scala but language APIs allow you to
run Spark code from other languages
For the most part, Spark presents some core “concepts”
in every language and these concepts are translated into
Spark code that runs on the cluster of machines
When using Spark from Python (or R), the user never
writes explicit JVM instructions, but instead writes
Python (or R) code that Spark will translate into code that
Spark can then run on the executor JVMs

19

Spark’s Language APIs
Using pySpark with Google Colab & Spark 3.0 preview

20

Spark’s API
Using pySpark with Google Colab & Spark 3.0 preview
While Spark is available from a variety of languages,
what Spark makes available in those languages is worth
mentioning
Spark has two fundamental sets of APIs:
● The low level “Unstructured” APIs
● The higher level Structured APIs
We’ll discuss both!

21

RDD API
Using pySpark with Google Colab & Spark 3.0 preview
The main abstraction Spark provides is a resilient distributed
dataset (RDD), which is a collection of elements partitioned
across the nodes of the cluster that can be operated on in
parallel
RDDs are created by starting with data in a supported distributed
storage, or an existing collection in the driver program, and
(parallelizing) transforming it
Users may also ask Spark to persist an RDD in memory, allowing
it to be reused efficiently across parallel operations. Finally, RDDs
automatically recover from node failures

22

RDD API
Using pySpark with Google Colab & Spark 3.0 preview
RDDs support two types of operations: transformations, which create a
new dataset from an existing one, and actions, which return a value to
the driver program after running a computation

23

RDD API
Using pySpark with Google Colab & Spark 3.0 preview
There are two ways to create RDDs:
● parallelizing an existing collection in your driver program
● referencing a dataset in an external storage system, such
as a shared filesystem, HDFS, HBase, or any data source
offering a Hadoop InputFormat
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
...
rdd = sc.textFile("/path/to/data.txt")

24

Spark’s API
Using pySpark with Google Colab & Spark 3.0 preview
All transformations in Spark are lazy, in that they do not
compute their results right away. Instead, they just
remember the transformations applied to some base
dataset (e.g. a file)
The transformations are only computed when an action
requires a result to be returned to the driver program
This design enables Spark to run very efficiently!

25

Spark’s API
Using pySpark with Google Colab & Spark 3.0 preview
By default, each transformed RDD may be recomputed
each time you run an action on it
However, you may also persist an RDD in memory using
the persist (or cache) method, in which case Spark will
keep the elements around on the cluster for much faster
access the next time you query it
There is also support for persisting RDDs on disk, or
replicated across multiple nodes

26

RDD API
Using pySpark with Google Colab & Spark 3.0 preview
If suppose we read a file in Spark, the entire content of the file is
partitioned into multiple smaller chunks. Any RDD for that matter is
partitioned by Spark into multiple partitions
When we apply a transformation on a RDD, the transformation is
applied to each of its partition. Spark spawns a single Task for a
single partition, which will run inside the executor JVM
Each stage contains as many tasks as partitions of the RDD and will
perform the transformations (map, filter etc) pipelined in the stage

27

RDD API
Using pySpark with Google Colab & Spark 3.0 preview

28

RDD API
Using pySpark with Google Colab & Spark 3.0 preview
https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

29

RDD API
Using pySpark with Google Colab & Spark 3.0 preview
https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions

30

Shuffle operations
Using pySpark with Google Colab & Spark 3.0 preview
Certain operations (wide transformations) within Spark
trigger an event known as the shuffle
The shuffle is Spark’s mechanism for re-distributing data
so that it’s grouped differently across partitions
This typically involves moving data across executors and
machines (network), making the shuffle a complex and
costly operation

31

Shuffle operations
Using pySpark with Google Colab & Spark 3.0 preview

32

Shuffle operations
Using pySpark with Google Colab & Spark 3.0 preview

33

RDD Persistence
Using pySpark with Google Colab & Spark 3.0 preview
One of the most important capabilities in Spark is persisting (or
caching) a dataset in memory across operations.
When you persist an RDD, each node stores any partitions of it that
it computes in memory and reuses them in other actions on that
dataset (or datasets derived from it).
This allows future actions to be much faster (often by more than
10x). Caching is a key tool for iterative algorithms and fast interactive
use.

34

RDD Persistence
Using pySpark with Google Colab & Spark 3.0 preview
You can mark an RDD to be persisted using the persist() or
cache() methods on it.
The first time it is computed in an action, it will be kept in
memory on the nodes.
Spark’s cache is fault-tolerant – if any partition of an RDD is
lost, it will automatically be recomputed using the
transformations that originally created it.

35

Shared Variables
Using pySpark with Google Colab & Spark 3.0 preview
Normally, when a function passed to a Spark operation (such as
map or reduce) is executed on a remote cluster node, it works on
separate copies of all the variables used in the function.
These variables are copied to each machine, and no updates to
the variables on the remote machine are propagated back to the
driver program. Spark does provide two limited types of shared
variables for two common usage patterns: broadcast variables
and accumulators.

36

Broadcast Variables
Using pySpark with Google Colab & Spark 3.0 preview
Spark actions are executed through a set of stages, separated by
distributed “shuffle” operations. Spark automatically broadcasts the
common data needed by tasks within each stage.
The data broadcasted this way is cached in serialized form and
deserialized before running each task. This means that explicitly
creating broadcast variables is only useful when tasks across
multiple stages need the same data or when caching the data in
deserialized form is important.

37

Broadcast Variables
Using pySpark with Google Colab & Spark 3.0 preview
Broadcast variables are created from a variable v by calling
SparkContext.broadcast(v). The broadcast variable is a wrapper
around v, and its value can be accessed by calling the value
method. The code below shows this:
>>> broadcastVar = sc.broadcast([1, 2, 3])
<pyspark.broadcast.Broadcast object at 0x102789f10>
>>> broadcastVar.value
[1, 2, 3]

38

Broadcast Variables
Using pySpark with Google Colab & Spark 3.0 preview
After the broadcast variable is created, it should be used
instead of the value v in any functions run on the cluster so
that v is not shipped to the nodes more than once.
In addition, the object v should not be modified after it is
broadcast in order to ensure that all nodes get the same
value of the broadcast variable (e.g. if the variable is
shipped to a new node later).

39

Accumulators
Using pySpark with Google Colab & Spark 3.0 preview
Accumulators are variables that are only “added” to
through an associative and commutative operation and
can therefore be efficiently supported in parallel.
They can be used to implement counters (as in
MapReduce) or sums.
Spark natively supports accumulators of numeric types,
and programmers can add support for new types.

40

Accumulators
Using pySpark with Google Colab & Spark 3.0 preview
An accumulator is created calling SparkContext.accumulator(v).
Tasks running on a cluster can then add to it using the add method or
the += operator. Only the driver program can read the accumulator’s
value, using its value method.
>>> accum = sc.accumulator(0)
>>> accum
Accumulator<id=0, value=0>
>>> sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x))
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s
>>> accum.value
10

41

Accumulators
Using pySpark with Google Colab & Spark 3.0 preview
A named accumulator will display in the web UI for the stage that modifies
that accumulator. Spark displays the value for each accumulator modified
by a task in the “Tasks” table (N.B. not supported in Python)

42

SQL API
Using pySpark with Google Colab & Spark 3.0 preview
A DataFrame is the most common Structured API and simply
represents a table of data with rows and columns
The list of columns and the types in those columns the schema.
A simple analogy would be a spreadsheet with named columns
The fundamental difference is that while a spreadsheet sits on
one computer in one specific location, a Spark DataFrame can
span thousands of computers

43

SQL API
Using pySpark with Google Colab & Spark 3.0 preview

44

SQL API
Using pySpark with Google Colab & Spark 3.0 preview
A DataFrame is the most common Structured API and simply
represents a table of data with rows and columns
The list of columns and the types in those columns the schema.
A simple analogy would be a spreadsheet with named columns
The fundamental difference is that while a spreadsheet sits on
one computer in one specific location, a Spark DataFrame can
span thousands of computers

45

SQL API
Using pySpark with Google Colab & Spark 3.0 preview
DataFrames can be constructed from a wide array of sources such
as: structured data files, tables in Hive, external databases, or
existing RDDs
A DataFrame can be operated on using relational transformations
and can also be used to create a temporary view
Registering a DataFrame as a temporary view allows you to run SQL
queries over its data

46

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("HelloSpark").getOrCreate()
sc = spark.sparkContext
df = spark.read.format("csv").load("/path/to/people.csv", header='true',
inferSchema='true')
df.createOrReplaceTempView("people")
spark.sql("SELECT person_ID, first, last, email FROMpeople").limit(10).show()
+---------+-------+-------+-----------------+
|person_ID| first| last| email|
+---------+-------+-------+-----------------+
| 3130|Rosella| Burks| BurksR@univ.edu|
| 3297| Damien| Avila| AvilaD@univ.edu|
| 3547| Robin| Olsen| OlsenR@univ.edu|
Spark SQL
Using pySpark with Google Colab & Spark 3.0 preview

47

Data Sources
Using pySpark with Google Colab & Spark 3.0 preview
Let’s see the general methods for loading and saving data using the
Spark Data Sources and then goes into specific options available for the
built-in data sources
In the simplest form parquet (unless otherwise configured by
spark.sql.sources.default) will be used for all operations

48

Spark 3 Preview Highlights
Using pySpark with Google Colab & Spark 3.0 preview
Spark Graph: Cypher Script & Property Graph
Popular graph query language Cypher has been
added to Spark 3.0, which is coupled by Property
Graph Model a directed multigraph. The Graph
query will follow similar principles as SparkSQL with
its own Catalysts providing full support for
Dataframes.

49

Spark 3 Preview Highlights
Using pySpark with Google Colab & Spark 3.0 preview
Python 3, Scala 2.12 and JDK 11
● Spark 3.0 is expected to completely move to
Python3
● Scala version upgrade to 2.12
● It will fully support JDK 11

50

Spark 3 Preview Highlights
Using pySpark with Google Colab & Spark 3.0 preview
Deep Learning: Adds GPU Support
It supports heterogeneous GPUs like AMD, Intel, and
Nvidia. For Kubernetes, it offers GPU isolation at the
executor pod level. In addition to this we get:
● GPU acceleration for Pandas UDF.
● You can specify the number of GPUs in your RDD
operations.
● To easily specify a Deep Learning environment there
is YARN and Docker support to launch Spark 3.0 with
GPU.

51

Spark 3 Preview Highlights
Using pySpark with Google Colab & Spark 3.0 preview
Binary Files
You can use binary files as the data source of your spark
dataframe, however, as of now write operations in binary
are not allowed. We can expect that in the future
releases
val df = spark.read.format(BINARY_FILE).load(dir.getPath)

52

Spark 3 Preview Highlights
Using pySpark with Google Colab & Spark 3.0 preview
Koalas: Spark scale to Pandas
Koalas is a pandas API on Apache Spark which makes
data engineers and scientists more productive when
interacting with big data
With 3.0 feature release Koalas can now scale to the
distributed environment without the requirement of
reading Spark 3.0 dataframe separately, in contrast, to
the singe node environment previously

53

Spark 3 Preview Highlights
Using pySpark with Google Colab & Spark 3.0 preview
Koalas: Spark scale to Pandas

54

Spark 3 Preview Highlights
Using pySpark with Google Colab & Spark 3.0 preview
Kafka Streaming: includeHeaders
Now you can allow reading headers in Kafka streaming

55

Spark 3 Preview Highlights
Using pySpark with Google Colab & Spark 3.0 preview
Dynamic Partition Pruning
It offerers optimized execution during runtime by reusing the
dimension table broadcast results in hash joins. This helps
Spark 3.0 work more efficiently specifically with queries based
on Star-Schema thereby removing the need to ETL the
denormalized tables
Databricks’s session on Dynamic Partition Pruning in Apache
Spark:
https://databricks.com/session_eu19/dynamic-partition-pru
ning-in-apache-spark

56

Now
Get Your Hands Dirty!

57

Thank you for your attention!
Follow me

More Related Content

Using pySpark with Google Colab & Spark 3.0 preview

  • 1. Using pySpark with Google Colab & Spark 3.0 preview Milan, 11 Dec 2019 Speaker: Mario Cartia
  • 4. Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters As of the time this writing, Spark is the most actively developed open source engine for this task; making it the de facto tool for any developer or data scientist interested in big data Spark supports multiple widely used programming languages including Python, Java, Scala and R Introducing Apache Spark Using pySpark with Google Colab & Spark 3.0 preview
  • 5. Introducing Apache Spark Using pySpark with Google Colab & Spark 3.0 preview Spark includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere from a laptop to a cluster of thousands of servers
  • 6. Introducing Apache Spark Using pySpark with Google Colab & Spark 3.0 preview Unified Spark’s key driving goal is to offer a unified platform for writing big data applications What do we mean by unified? Spark is designed to support a wide range of data analytics tasks, ranging from simple data loading and SQL queries to machine learning and streaming computation, over the same computing engine and with a consistent set of APIs Spark’s APIs are also designed to enable high performance by optimizing across the different libraries and functions composed together in a user program
  • 7. Introducing Apache Spark Using pySpark with Google Colab & Spark 3.0 preview For example, if you load data using a SQL query and then evaluate a machine learning model over it using Spark’s ML library, the engine can combine these steps into one scan over the data The combination of general APIs and high-performance execution no matter how you combine them makes Spark a powerful platform for interactive and production applications
  • 8. Introducing Apache Spark Using pySpark with Google Colab & Spark 3.0 preview At the same time that Spark strives for unification, Spark carefully limits its scope to a computing engine By this, we mean that Spark only handles loading data from storage systems and performing computation on it, not permanent storage as the end itself Spark can be used with a wide variety of persistent storage systems, including cloud storage systems such as Azure Storage and Amazon S3, distributed file systems such as HDFS, key-value stores such as Apache Cassandra, and message buses such as Apache Kafka
  • 9. Introducing Apache Spark Using pySpark with Google Colab & Spark 3.0 preview Spark’s focus on computation makes it different from earlier big data software platforms such as Apache Hadoop Hadoop included both a storage system (the Hadoop file system, designed for low-cost storage over clusters of commodity servers) and a computing system (MapReduce), which were closely integrated together While Spark runs well on Hadoop, it is also possible to use it in the context of different architectures
  • 10. Introducing Apache Spark Using pySpark with Google Colab & Spark 3.0 preview Libraries Spark’s final component is its libraries, which build on its design as a unified engine to provide a unified API for common data analysis tasks Spark includes libraries for SQL and structured data (Spark SQL), machine learning (MLlib), stream processing (Spark Streaming and the newer Structured Streaming), and graph analytics (GraphX). Beyond these libraries, there hundreds of open source external libraries ranging from connectors for various storage systems to machine learning algorithms
  • 11. Spark’s Architecture Using pySpark with Google Colab & Spark 3.0 preview Typically when you think of a “computer” you think about one machine sitting on your desk at home or at work This machine works perfectly well for watching movies or working with spreadsheet software. However, as many users likely experience at some point, there are some things that your computer is not powerful enough to perform. One particularly challenging area is data processing Single machines do not have enough power (CPU) and resources (RAM) to perform computations on huge amounts of information
  • 12. Spark’s Architecture Using pySpark with Google Colab & Spark 3.0 preview A cluster, or group of machines, pools the resources of many machines together allowing us to use all the cumulative resources as if they were one Now a group of machines alone is not powerful, you need a framework to coordinate work across them. Spark is a tool for just that, managing and coordinating the execution of tasks on data across a cluster of computers
  • 13. Spark’s Architecture Using pySpark with Google Colab & Spark 3.0 preview The cluster of machines that Spark will leverage to execute tasks will be managed by a cluster manager like YARN, Mesos, Kubernetes or Spark’s Standalone cluster manager We then submit Spark Applications to these cluster managers which will grant resources to our application so that we can complete our work
  • 14. Spark Applications Using pySpark with Google Colab & Spark 3.0 preview Spark Applications consist of a driver process and a set of executor processes The driver process runs your main() function, sits on a node in the cluster, and is responsible for three things: 1. maintaining information about the Spark Application; 2. responding to a user’s program or input; 3. analyzing, distributing, and scheduling work across the executors (defined momentarily)
  • 15. Spark Applications Using pySpark with Google Colab & Spark 3.0 preview The executors are responsible for actually executing the work (tasks) that the driver assigns them This means, each executor is responsible for only two things: 1. executing code assigned to it by the driver 2. reporting the state of the computation, on that executor, back to the driver node
  • 16. Spark Applications Using pySpark with Google Colab & Spark 3.0 preview The cluster manager (YARN, Mesos, K8s or Spark itself) controls physical machines and allocates resources to Spark Applications This means that there can be multiple Spark Applications running on a cluster at the same time
  • 17. Note Using pySpark with Google Colab & Spark 3.0 preview Spark, in addition to its cluster mode, also has a local mode The driver and executors are simply processes, this means that they can live on the same machine or different machines In local mode, these both run (as threads) on your individual computer instead of a cluster
  • 18. Spark’s Language APIs Using pySpark with Google Colab & Spark 3.0 preview Spark is written in Scala but language APIs allow you to run Spark code from other languages For the most part, Spark presents some core “concepts” in every language and these concepts are translated into Spark code that runs on the cluster of machines When using Spark from Python (or R), the user never writes explicit JVM instructions, but instead writes Python (or R) code that Spark will translate into code that Spark can then run on the executor JVMs
  • 19. Spark’s Language APIs Using pySpark with Google Colab & Spark 3.0 preview
  • 20. Spark’s API Using pySpark with Google Colab & Spark 3.0 preview While Spark is available from a variety of languages, what Spark makes available in those languages is worth mentioning Spark has two fundamental sets of APIs: ● The low level “Unstructured” APIs ● The higher level Structured APIs We’ll discuss both!
  • 21. RDD API Using pySpark with Google Colab & Spark 3.0 preview The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel RDDs are created by starting with data in a supported distributed storage, or an existing collection in the driver program, and (parallelizing) transforming it Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures
  • 22. RDD API Using pySpark with Google Colab & Spark 3.0 preview RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation
  • 23. RDD API Using pySpark with Google Colab & Spark 3.0 preview There are two ways to create RDDs: ● parallelizing an existing collection in your driver program ● referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat data = [1, 2, 3, 4, 5] rdd = sc.parallelize(data) ... rdd = sc.textFile("/path/to/data.txt")
  • 24. Spark’s API Using pySpark with Google Colab & Spark 3.0 preview All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file) The transformations are only computed when an action requires a result to be returned to the driver program This design enables Spark to run very efficiently!
  • 25. Spark’s API Using pySpark with Google Colab & Spark 3.0 preview By default, each transformed RDD may be recomputed each time you run an action on it However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it There is also support for persisting RDDs on disk, or replicated across multiple nodes
  • 26. RDD API Using pySpark with Google Colab & Spark 3.0 preview If suppose we read a file in Spark, the entire content of the file is partitioned into multiple smaller chunks. Any RDD for that matter is partitioned by Spark into multiple partitions When we apply a transformation on a RDD, the transformation is applied to each of its partition. Spark spawns a single Task for a single partition, which will run inside the executor JVM Each stage contains as many tasks as partitions of the RDD and will perform the transformations (map, filter etc) pipelined in the stage
  • 27. RDD API Using pySpark with Google Colab & Spark 3.0 preview
  • 28. RDD API Using pySpark with Google Colab & Spark 3.0 preview https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
  • 29. RDD API Using pySpark with Google Colab & Spark 3.0 preview https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions
  • 30. Shuffle operations Using pySpark with Google Colab & Spark 3.0 preview Certain operations (wide transformations) within Spark trigger an event known as the shuffle The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions This typically involves moving data across executors and machines (network), making the shuffle a complex and costly operation
  • 31. Shuffle operations Using pySpark with Google Colab & Spark 3.0 preview
  • 32. Shuffle operations Using pySpark with Google Colab & Spark 3.0 preview
  • 33. RDD Persistence Using pySpark with Google Colab & Spark 3.0 preview One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.
  • 34. RDD Persistence Using pySpark with Google Colab & Spark 3.0 preview You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
  • 35. Shared Variables Using pySpark with Google Colab & Spark 3.0 preview Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.
  • 36. Broadcast Variables Using pySpark with Google Colab & Spark 3.0 preview Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
  • 37. Broadcast Variables Using pySpark with Google Colab & Spark 3.0 preview Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The code below shows this: >>> broadcastVar = sc.broadcast([1, 2, 3]) <pyspark.broadcast.Broadcast object at 0x102789f10> >>> broadcastVar.value [1, 2, 3]
  • 38. Broadcast Variables Using pySpark with Google Colab & Spark 3.0 preview After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).
  • 39. Accumulators Using pySpark with Google Colab & Spark 3.0 preview Accumulators are variables that are only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types.
  • 40. Accumulators Using pySpark with Google Colab & Spark 3.0 preview An accumulator is created calling SparkContext.accumulator(v). Tasks running on a cluster can then add to it using the add method or the += operator. Only the driver program can read the accumulator’s value, using its value method. >>> accum = sc.accumulator(0) >>> accum Accumulator<id=0, value=0> >>> sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x)) ... 10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s >>> accum.value 10
  • 41. Accumulators Using pySpark with Google Colab & Spark 3.0 preview A named accumulator will display in the web UI for the stage that modifies that accumulator. Spark displays the value for each accumulator modified by a task in the “Tasks” table (N.B. not supported in Python)
  • 42. SQL API Using pySpark with Google Colab & Spark 3.0 preview A DataFrame is the most common Structured API and simply represents a table of data with rows and columns The list of columns and the types in those columns the schema. A simple analogy would be a spreadsheet with named columns The fundamental difference is that while a spreadsheet sits on one computer in one specific location, a Spark DataFrame can span thousands of computers
  • 43. SQL API Using pySpark with Google Colab & Spark 3.0 preview
  • 44. SQL API Using pySpark with Google Colab & Spark 3.0 preview A DataFrame is the most common Structured API and simply represents a table of data with rows and columns The list of columns and the types in those columns the schema. A simple analogy would be a spreadsheet with named columns The fundamental difference is that while a spreadsheet sits on one computer in one specific location, a Spark DataFrame can span thousands of computers
  • 45. SQL API Using pySpark with Google Colab & Spark 3.0 preview DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs A DataFrame can be operated on using relational transformations and can also be used to create a temporary view Registering a DataFrame as a temporary view allows you to run SQL queries over its data
  • 46. from pyspark.sql import SparkSession spark = SparkSession.builder.master("local[*]").appName("HelloSpark").getOrCreate() sc = spark.sparkContext df = spark.read.format("csv").load("/path/to/people.csv", header='true', inferSchema='true') df.createOrReplaceTempView("people") spark.sql("SELECT person_ID, first, last, email FROMpeople").limit(10).show() +---------+-------+-------+-----------------+ |person_ID| first| last| email| +---------+-------+-------+-----------------+ | 3130|Rosella| Burks| BurksR@univ.edu| | 3297| Damien| Avila| AvilaD@univ.edu| | 3547| Robin| Olsen| OlsenR@univ.edu| Spark SQL Using pySpark with Google Colab & Spark 3.0 preview
  • 47. Data Sources Using pySpark with Google Colab & Spark 3.0 preview Let’s see the general methods for loading and saving data using the Spark Data Sources and then goes into specific options available for the built-in data sources In the simplest form parquet (unless otherwise configured by spark.sql.sources.default) will be used for all operations
  • 48. Spark 3 Preview Highlights Using pySpark with Google Colab & Spark 3.0 preview Spark Graph: Cypher Script & Property Graph Popular graph query language Cypher has been added to Spark 3.0, which is coupled by Property Graph Model a directed multigraph. The Graph query will follow similar principles as SparkSQL with its own Catalysts providing full support for Dataframes.
  • 49. Spark 3 Preview Highlights Using pySpark with Google Colab & Spark 3.0 preview Python 3, Scala 2.12 and JDK 11 ● Spark 3.0 is expected to completely move to Python3 ● Scala version upgrade to 2.12 ● It will fully support JDK 11
  • 50. Spark 3 Preview Highlights Using pySpark with Google Colab & Spark 3.0 preview Deep Learning: Adds GPU Support It supports heterogeneous GPUs like AMD, Intel, and Nvidia. For Kubernetes, it offers GPU isolation at the executor pod level. In addition to this we get: ● GPU acceleration for Pandas UDF. ● You can specify the number of GPUs in your RDD operations. ● To easily specify a Deep Learning environment there is YARN and Docker support to launch Spark 3.0 with GPU.
  • 51. Spark 3 Preview Highlights Using pySpark with Google Colab & Spark 3.0 preview Binary Files You can use binary files as the data source of your spark dataframe, however, as of now write operations in binary are not allowed. We can expect that in the future releases val df = spark.read.format(BINARY_FILE).load(dir.getPath)
  • 52. Spark 3 Preview Highlights Using pySpark with Google Colab & Spark 3.0 preview Koalas: Spark scale to Pandas Koalas is a pandas API on Apache Spark which makes data engineers and scientists more productive when interacting with big data With 3.0 feature release Koalas can now scale to the distributed environment without the requirement of reading Spark 3.0 dataframe separately, in contrast, to the singe node environment previously
  • 53. Spark 3 Preview Highlights Using pySpark with Google Colab & Spark 3.0 preview Koalas: Spark scale to Pandas
  • 54. Spark 3 Preview Highlights Using pySpark with Google Colab & Spark 3.0 preview Kafka Streaming: includeHeaders Now you can allow reading headers in Kafka streaming
  • 55. Spark 3 Preview Highlights Using pySpark with Google Colab & Spark 3.0 preview Dynamic Partition Pruning It offerers optimized execution during runtime by reusing the dimension table broadcast results in hash joins. This helps Spark 3.0 work more efficiently specifically with queries based on Star-Schema thereby removing the need to ETL the denormalized tables Databricks’s session on Dynamic Partition Pruning in Apache Spark: https://databricks.com/session_eu19/dynamic-partition-pru ning-in-apache-spark
  • 57. Thank you for your attention! Follow me