14 SparkParallelProcessing
14 SparkParallelProcessing
Processing
in
Spark
Chapter
14
201509
Course
Chapters
10
Spark
Basics
11
Working
with
RDDs
in
Spark
12
AggregaHng
Data
with
Pair
RDDs
13
WriHng
and
Deploying
Spark
ApplicaHons
Distributed
Data
Processing
with
14
Parallel
Processing
in
Spark
Spark
15
Spark
RDD
Persistence
16
Common
PaDerns
in
Spark
Data
Processing
17
Spark
SQL
and
DataFrames
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-2
Parallel
Programming
with
Spark
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-3
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-4
Spark
Cluster
Review
Worker
(Slave)
Nodes
$ spark-submit
--master yarn-client
--class MyClass
--num-executors 3
MyApp.jar
Cluster
HDFS
Master
Master
Node
Node
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-5
Spark
Cluster
Review
Worker
(Slave)
Nodes
$ spark-submit
Container
--master yarn-client
Driver
Program
--class MyClass
Spark
--num-executors 3
Context
MyApp.jar
Container
Cluster
HDFS
Master
Master
Node
Node
Container
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-6
Spark
Cluster
Review
Worker
(Slave)
Nodes
$ spark-submit
Container
Executor
--master yarn-client
Driver
Program
--class MyClass
Spark
--num-executors 3
Context
MyApp.jar
Executor
Container
Cluster
HDFS
Master
Master
Node
Node
Executor
Container
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-7
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-8
RDDs
on
a
Cluster
Executor
rdd_1_2
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-9
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-10
File
ParHHoning:
Single
Files
Default
is
2
myle
More
parHHons
=
more
parallelizaHon
Executor
Executor
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-11
File
ParHHoning:
MulHple
Files
RDD
sc.textFile("mydir/*")
Executor
Each
le
becomes
(at
least)
one
parHHon
le1
sc.wholeTextFiles("mydir")
For
many
small
les
RDD
Creates
a
key-value
PairRDD
Executor
key
=
le
name
value
=
le
contents
Executor
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-12
OperaHng
on
ParHHons
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-13
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-14
HDFS
and
Data
Locality
(1)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-15
HDFS
and
Data
Locality
(2)
HDFS:
mydata
HDFS
Block
1
HDFS
Block
2
HDFS
Block
3
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-16
HDFS
and
Data
Locality
(3)
HDFS:
Driver
Program
mydata
Executor
HDFS
Spark
Block
1
Context
Executor
HDFS
Block
2
Executor
HDFS
Block
3
Executor
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-17
HDFS
and
Data
Locality
(4)
Executor
HDFS
Block
2
Executor
HDFS
Block
3
Executor
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-18
HDFS
and
Data
Locality
(5)
An
acHon
triggers
sc.textFile("hdfs://mydata").collect() execuHon:
tasks
on
executors
load
data
from
blocks
into
parHHons
HDFS:
RDD
Driver
Program
mydata
Executor
HDFS
Spark
task
Block
1
Context
Executor
HDFS
task
Block
2
Executor
HDFS
task
Block
3
Executor
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-19
HDFS
and
Data
Locality
(6)
HDFS:
RDD
Driver
Program
mydata
Executor
HDFS
Spark
Block
1
Context
Executor
HDFS
Block
2
Executor
HDFS
Block
3
Executor
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-20
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-21
Parallel
OperaHons
on
ParHHons
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-22
Example:
Average
Word
Length
by
LeDer
(1)
RDD
HDFS:
mydata
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-23
Example:
Average
Word
Length
by
LeDer
(2)
RDD RDD
HDFS:
mydata
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-24
Example:
Average
Word
Length
by
LeDer
(3)
HDFS:
mydata
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-25
Example:
Average
Word
Length
by
LeDer
(4)
HDFS:
mydata
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-26
Example:
Average
Word
Length
by
LeDer
(5)
HDFS:
mydata
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-27
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-28
Stages
OperaAons
that
can
run
on
the
same
parAAon
are
executed
in
stages
Tasks
within
a
stage
are
pipelined
together
Developers
should
be
aware
of
stages
to
improve
performance
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-29
Spark
ExecuHon:
Stages
(1)
> avglens.saveAsTextFile("avglen-output")
Stage
1
Stage
2
RDD
RDD
RDD
RDD
RDD
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-30
Spark
ExecuHon:
Stages
(2)
> avglens.saveAsTextFile("avglen-output")
Stage 1 Stage 2
Task
1
Task
5
Task
2
Task 3 Task 6
Task 4
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-31
Spark
ExecuHon:
Stages
(3)
> avglens.saveAsTextFile("avglen-output")
Stage 1 Stage 2
Task
1
Task
5
Task
2
Task
3
Task
6
Task 4
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-32
Spark
ExecuHon:
Stages
(4)
> avglens.saveAsTextFile("avglen-output")
Stage 1 Stage 2
Task
1
Task
5
Task
2
Task
3
Task
6
Task 4
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-33
Summary
of
Spark
Terminology
Stage
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-34
How
Spark
Calculates
Stages
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-35
Viewing
the
Stages
using
toDebugString
(Scala)
> avglens.toDebugString()
Indents
indicate
stages
(shue
boundaries)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-36
Viewing
the
Stages
using
toDebugString
(Python)
Indents
indicate
stages
(shue
boundaries)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-37
Spark
Task
ExecuHon
(1)
Task 4
Executor
HDFS
Block
2
Driver
Program
Spark
Context
Executor
HDFS
Block
3
Executor
HDFS
Block
4
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-38
Spark
Task
ExecuHon
(2)
Task
6Client
Executor
HDFS
Block
1
Task
1
Executor
HDFS
Block
2
Driver
Program
Task
2
Spark
Context
Executor
HDFS
Block
3
Task
3
Executor
HDFS
Block
4
Task
4
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-39
Spark
Task
ExecuHon
(3)
Task
6Client
Executor
HDFS
Block
1
Shue
Data
Task
1
Executor
HDFS
Block
2
Shue
Driver
Program
Data
Task
2
Spark
Context
Executor
HDFS
Block
3
Shue
Data
Task
3
Executor
HDFS
Block
4
Shue
Data
Task
4
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-40
Spark
Task
ExecuHon
(4)
Task
6Client
Executor
HDFS
Block
1
Shue
Data
Executor
HDFS
Block
2
Shue
Driver
Program
Data
Spark
Context
Executor
HDFS
Block
3
Shue
Data
Executor
HDFS
Block
4
Shue
Data
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-41
Spark
Task
ExecuHon
(5)
Client
HDFS
Executor
Block
1
Shue
Data
Executor
HDFS
Block
2
Shue
Task
5
Driver
Program
Data
Spark
Context
Executor
HDFS
Block
3
Shue
Task
6
Data
Executor
HDFS
Block
4
Shue
Data
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-42
Spark
Task
ExecuHon
(6)
Executor
HDFS
Block
1
Executor
HDFS
Block
2
Task
5
Driver
Program
part-00000
Spark
Context
Executor
HDFS
Block
3
Task
6
part-00001
Executor
HDFS
Block
4
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-43
Spark
Task
ExecuHon
(alternate
ending)
Executor
HDFS
Block
1
Executor
HDFS
Block
2
Task
5
Driver
Program
Spark
Context
Executor
HDFS
Block
3
Task
6
Executor
HDFS
Block
4
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-44
Controlling
the
Level
of
Parallelism
spark.default.parallelism 10
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-45
Viewing
Stages
in
the
Spark
ApplicaHon
UI
(1)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-46
Viewing
Stages
in
the
Spark
ApplicaHon
UI
(2)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-47
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-48
EssenHal
Points
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-49
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriDen
consent
from
Cloudera.
14-50
Homework:
View
Jobs
and
Stages
in
the
Spark
ApplicaHon
UI
Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriDen consent from Cloudera. 14-51