The document provides an overview of Apache Spark fundamentals including what Spark is, its ecosystem and terminology, how to create RDDs and use different operations like transformations and actions, RDD lineage and evolution from RDDs to DataFrames and DataSets. It also discusses concepts like job lifecycle, persistency, and running Spark on a YARN cluster. Code samples are shown to demonstrate different Spark features. The presenter has a computer engineering background and currently works on data analytics and transformations using Spark.
Report
Share
Report
Share
1 of 18
Download to read offline
More Related Content
Apache Spark Fundamentals Meetup Talk
1. Apache
Spark
Fundamentals
Eren
Avşaroğulları
Data
Science
and
Engineering
Club
Meetup
Dublin
-‐
December
9,
2017
3. Bio
Ê B.Sc
&
M.Sc.
on
Electronics
&
Control
Engineering
Ê Apache
Spark
Contributor
since
v2.0.0
Ê Sr.
Software
Engineer
@
Ê Currently,
work
on
Data
Analytics
Data
Transformations
/
Cleaning
erenavsarogullari
4. What
is
Apache
Spark?
Ê Distributed
Compute
Engine
Ê Project
started
in
2009
at
UC
Berkley
Ê First
version(v0.5)
is
released
on
June
2012
Ê Moved
to
Apache
Software
Foundation
in
2013
Ê Supported
Languages:
Java,
Scala,
Python
and
R
Ê +
1100
contributors
/
+14K
forks
on
Github
Ê spark-‐packages.org
=>
~380
Extensions
5. Spark
Ecosystem
Spark
SQL
Spark
Streaming
MLlib
GraphX
Spark
Core
Engine
Standalone
YARN
Mesos
Local
Cluster
Mode
Local
Mode
6. Terminology
Ê RDD:
Resilient
Distributed
Dataset,
immutable,
resilient
and
partitioned.
Ê DAG:
Direct
Acyclic
Graph.
An
execution
plan
of
a
job
(a.k.a
RDD
dependency
graph)
Ê Application:
An
instance
of
Spark
Context.
Single
per
JVM.
Ê Job:
An
action
operator
triggering
computation.
Ê Driver:
The
program/process
for
running
the
Job
over
the
Spark
Engine
Ê Executor:
The
process
executing
a
task
Ê Worker:
The
node
running
executors.
7. How
to
create
RDD?
Ê Collection
Parallelize
Ê By
Loading
file
Ê Transformations
Ê Lets
see
the
sample
=>
Application-‐1
8. RDD
RDD
RDD
RDD
Operation
Types
Two
types
of
Spark
operations
on
RDD
Ê Transformations:
lazy
evaluated
(not
computed
immediately)
Ê Actions:
triggers
the
computation
and
returns
value
Transformations
RDD
Actions
Value
Data
13. RDD
Evolution
RDD
V1.0
(2011)
DataFrame
V1.3
(2013)
DataSet
V1.6
(2015)
Untyped
API
Schema
based
-‐
Tabular
Java
Objects
Low
level
data-‐structure
To
work
with
Unstructured
Data
Typed
API:
[T]
Tabular
SQL
Support
To
work
with
Semi-‐Structured
(csv,
json)
/
Structured
Data
(jdbc)
Project
Tungsten
Catalyst
Optimizer
Two
tier
optimizations
14. How
to
create
the
DataFrame?
Ê By
loading
file
(spark.read.format("csv").load())
Ê SparkSession.createDataFrame(RDD,
schema)
Lets
see
the
code
–
Application-‐3
15. How
to
create
the
DataSet?
Ê By
loading
file
(spark.read.format("csv").load())
Ê SparkSession.createDataSet(collection
or
RDD)
Lets
see
the
code
–
Application-‐4-‐1
Application-‐4-‐2
16. Persistency
Storage
Modes
Details
MEMORY_ONLY
Store
RDD
as
deserialized
Java
objects
in
the
JVM
MEMORY_AND_DISK
Store
RDD
as
deserialized
Java
objects
in
the
JVM
MEMORY_ONLY_SER
Store
RDD
as
serialized
Java
objects
(Kryo
API
can
be
thought)
MEMORY_AND_DISK_SER
Similar
to
MEMORY_ONLY_SER
DISK_ONLY
Store
the
RDD
partitions
only
on
disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2
Same
as
the
levels
above,
but
replicate
each
partition
on
two
cluster
nodes.
Ê RDD
/
DF.persist(newStorageLevel:
StorageLevel)
Ê RDD.unpersist()
=>
Unpersists
RDD
from
memory
and
disk
Unpersist
will
need
to
be
forced
for
long
term
to
use
executor
memory
efficiently.
Note:
Also
when
cached
data
exceeds
storage
memory,
Spark
will
use
Least
Recently
Used(LRU)
Expiry
Policy
as
default