Apache Spark Fundamentals Meetup Talk

Apache
Spark

Fundamentals

Eren
Avşaroğulları

Data
Science
and
Engineering
Club
Meetup

Dublin
-‐
December
9,
2017

Agenda

Ê  What
is
Apache
Spark?

Ê  Spark
Ecosystem
&
Terminology

Ê  RDDs
&
Operation
Types
(Transformations
&
Actions)

Ê  RDD
Lineage

Ê  Job
Lifecycle

Ê  RDD
Evolution
(DataFrames
and
DataSets)

Ê  Persistency

Ê  Clustering
/
Spark
on
YARN

shows
code
samples

Bio

Ê  B.Sc
&
M.Sc.
on
Electronics
&
Control
Engineering

Ê  Apache
Spark
Contributor
since
v2.0.0
Ê  Sr.
Software
Engineer
@

Ê  Currently,
work
on
Data
Analytics

Data
Transformations
/
Cleaning

erenavsarogullari

What
is
Apache
Spark?

Ê  Distributed
Compute
Engine

Ê  Project
started
in
2009
at
UC
Berkley

Ê  First
version(v0.5)
is
released
on
June
2012

Ê  Moved
to
Apache
Software
Foundation
in
2013

Ê  Supported
Languages:
Java,
Scala,
Python
and
R

Ê  +
1100
contributors
/
+14K
forks
on
Github

Ê  spark-‐packages.org
=>
~380
Extensions

Spark
Ecosystem

Spark
SQL

Spark

Streaming

MLlib
GraphX

Spark
Core
Engine

Standalone
YARN
Mesos
Local

Cluster
Mode
Local
Mode

Terminology

Ê  RDD:
Resilient
Distributed
Dataset,
immutable,
resilient
and
partitioned.

Ê  DAG:
Direct
Acyclic
Graph.
An
execution
plan
of
a
job
(a.k.a
RDD
dependency
graph)

Ê  Application:
An
instance
of
Spark
Context.
Single
per
JVM.

Ê  Job:
An
action
operator
triggering

computation.

Ê  Driver:
The
program/process
for
running

the
Job
over
the
Spark
Engine

Ê  Executor:
The
process
executing
a
task

Ê  Worker:
The
node
running
executors.

How
to
create
RDD?

Ê  Collection
Parallelize

Ê  By
Loading
ﬁle

Ê  Transformations

Ê  Lets
see
the
sample
=>
Application-‐1

RDD

RDD

RDD

RDD
Operation
Types

Two
types
of
Spark
operations
on
RDD

Ê  Transformations:
lazy
evaluated
(not
computed
immediately)

Ê  Actions:
triggers
the
computation
and
returns
value

Transformations

RDD
Actions
Value
Data

Transformations

Ê  map(func)

Ê  ﬂatMap(func)

Ê  ﬁlter(func)

Ê  union(dataset)

Ê  join(dataset,
usingColumns:
Seq[String])

Ê  intersect(dataset)

Ê  coalesce(numPartitions)

Ê  repartition(numPartitions)

Full
List:

https://spark.apache.org/docs/latest/rdd-‐programming-‐
guide.html#transformations

Actions

Ê  ﬁrst()

Ê  take(n)

Ê  collect()

Ê  count()

Ê  saveAsTextFile(path)

Full
List:

https://spark.apache.org/docs/latest/rdd-‐programming-‐guide.html#actions

Lets
see
the
sample
=>
Application-‐2

RDD
Dependencies
(Lineage)

RDD
5

Stage
1

RDD
1

Stage
0

RDD
3

RDD
2

map

RDD
4

union

RDD
6

sort

RDD
7

join

Stage
3

Narrow

Transformation

Narrow

Transformations

Wide

Transformations

Shuﬄes

Shuﬄes

RDD
Evolution

RDD

V1.0

(2011)

DataFrame

V1.3

(2013)

DataSet

V1.6

(2015)

Untyped
API

Schema
based
-‐
Tabular

Java
Objects

Low
level
data-‐structure

To
work
with

Unstructured
Data

Typed
API:
[T]

Tabular

SQL
Support

To
work
with

Semi-‐Structured
(csv,
json)
/
Structured
Data
(jdbc)

Project
Tungsten

Catalyst
Optimizer

Two
tier

optimizations

How
to
create
the
DataFrame?

Ê  By
loading
ﬁle
(spark.read.format("csv").load())

Ê  SparkSession.createDataFrame(RDD,
schema)

Lets
see
the
code
–
Application-‐3

How
to
create
the
DataSet?

Ê  By
loading
ﬁle
(spark.read.format("csv").load())

Ê  SparkSession.createDataSet(collection
or
RDD)

Lets
see
the
code
–
Application-‐4-‐1

Application-‐4-‐2

Persistency

Storage
Modes
Details

MEMORY_ONLY
Store
RDD
as
deserialized
Java
objects
in
the
JVM

MEMORY_AND_DISK
Store
RDD
as
deserialized
Java
objects
in
the
JVM

MEMORY_ONLY_SER
Store
RDD
as
serialized
Java
objects
(Kryo
API
can
be
thought)

MEMORY_AND_DISK_SER
Similar
to
MEMORY_ONLY_SER

DISK_ONLY
Store
the
RDD
partitions
only
on
disk.

MEMORY_ONLY_2,

MEMORY_AND_DISK_2

Same
as
the
levels
above,
but
replicate
each
partition
on
two

cluster
nodes.

Ê  RDD
/
DF.persist(newStorageLevel:
StorageLevel)

Ê  RDD.unpersist()
=>
Unpersists
RDD
from
memory
and
disk

Unpersist
will
need
to
be
forced
for
long
term
to
use

executor
memory
eﬃciently.

Note:
Also
when
cached
data
exceeds
storage
memory,

Spark
will
use
Least
Recently
Used(LRU)
Expiry
Policy
as
default

Clustering
/
Spark
on
YARN

YARN
Client

Mode

Q
&
A

Thanks

References

Ê  https://spark.apache.org/docs/latest/

Ê  https://cwiki.apache.org/conﬂuence/display/SPARK/Spark+Internals

Ê  https://jaceklaskowski.gitbooks.io/mastering-‐apache-‐spark

Ê  https://stackoverﬂow.com/questions/36215672/spark-‐yarn-‐architecture

Ê  High
Performance
Spark
by

Holden
Karau
&
Rachel
Warren

Apache Spark Fundamentals Meetup Talk

More Related Content

Apache Spark Fundamentals Meetup Talk