Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East talk by Jose Soltren

Fault Tolerance in Spark:
Lessons Learned
from Production
José Soltren
Cloudera

Who am I?
• Software Engineer at Cloudera focused on
Apache Spark
– …also an Apache Spark contributor
• Previous hardware, kernel, and driver hacking
experience

So… why does Cloudera care about
Fault Tolerance?
• Cloudera supports big customers running big
applications on big hardware.
• How big?
– >$1B/yr,
– core business logic
– 1000+ node clusters.
• Outages are really expensive.
– …about as expensive as flying a small jet.
• Customer’sproblems are our problems.

Apache Spark
Fault Tolerance Basics

https://0x0fff.com/spark-architecture-shuffle/

• RDDs – Resilient Distributed Datasets: multiple
pieces, multiple copies
• Lineage: not versions, but, a way of re-creating data
• HDFS (or HBase or other external store)
• Scheduler: Blacklist
• Scheduler/Storage: Duplication and Locality

.gs.
s9q)
q)
.-
!

(-
s
'€
->
er+
A
sr
srBlaq
9/.n,?-rt
)-y
rIt
?.
t--
s--
$
(-
f;
O
Lt-.
cf)
+
_g
,e
-l
{so
F
?
ET'L
t<
{,
J
c-
a
)
Na
,€
rntJ
c
t;
ss
A
-s/
s
Nv
trD
tI
.l
E3l
€it*."1
tr*l
3Fl'.:l"
rl
tlc..l
.J
glol--f'l
-Yl
+l
s
c-)
€
tY
s
/q3
3
0
5.qJ
g
i
(
E(-
s
t
&
3
..r-
5-
-t)
t-
-+-
,f
e
)
-+
I
.trc.t
b
E
j(-
-
3E3-E
{
ET
F
€
x
e
II
.-{dsA,F<-{
_-e-<*€g_bLF
€*J
ifr
tsJ
a-
;lly
?X-)
g.R.r{'>
,:rF:(S
:sEt
=.-JA
")
(^
55
a
L
.c-
,
5
E
1
L
a
>
3*:*d
,g
-Es(f-
r€*d
s-rt
tr
f'l
-+l-Y
f
I
Il-
I-d
i-
F+-
I+(

I

I
{
I
:ll
.-5',
?Pd
FtS
E$t
ee-l
ic<3.
2
P
C-.
t
€
{
t<
qf,
r
rt
sf
'A
-t
=.F
c
q
c-^
-*-:s
=f
-;

Case Study:
Application Outage
[SPARK-8425]

2016-04-22: An Application Outage
• Customer reports an application failure.
– Spark cluster with hundreds of nodes and 8 disks per node.
– Application runs on customer’s Spark cluster.
– High Availability is critical for this application.
• Immediate cause was a disk failure:
FileNotFoundException.
• “HDFS and YARN responded to the disk failure
appropriately but Spark continued to access the failed
disk.”
– Yikes.

Disk Failures and Scheduling
• One node has one bad disk.
– …tasks sometimes fail on this node.
– ...tasks consistently fail if they hit this disk.
• Tasks succeed if they are scheduled on other nodes!
• Tasks fail if they are scheduled on the same node.
– ...which is likely due to locality preferences.
• The scheduler will kill the whole job after some number
of failures.
• Can’t tell YARN (Mesos?) we have a bad resource.

Failure Recap
• There is already support for fault tolerance present –
don’t panic!
• Spark could handle an unreachable node.
• Spark could handle a node with no usable disks.
• We hit an edge case.
• Failure modes are binary, and not expressive enough.

Short Term:
Workaround on Spark 1.6 and 2.0
• spark.scheduler.executorTaskBlacklistTime
– Set a value that is much longer than the duration of the longest task.
– Tells the scheduler to “blacklist” an (executor, task) combination for some
amount of time.
• spark.task.maxFailures
– Set a value that is larger than the maximum number of executors on a node.
– Determines the number of failures before the whole application is killed.
• spark.speculation
– Defaults to false. Did not recommend enabling.

Long Term:
Overhaul The Blacklist
• The Scheduler is critical core code! Bug whack-a-mole.
• Driver multi-threaded, asynchronous requests.
• Many scenarios considered in Design Doc http://bit.do/bklist
– Large Cluster, Large Stages, One Bad Disk (our scenario)
– Failing Job, Very Small Cluster
– Large Cluster, Very Small Stages
– Long Lived Application, Occasional Failed Tasks
– Bad Node leads to widespread Shuffle-Fetch Failures
– Bad Node, One Executor, Dynamic Allocation
– Application programming errors!

org.apache.spark.scheduler.
BlacklistTracker
/**
* BlacklistTracker is designed to track problematic executors and nodes. It supports blacklisting
* executors and nodes across an entire application (with a periodic expiry). TaskSetManagers add
* additional blacklisting of executors and nodes for individual tasks and stages which works in
* concert with the blacklisting here.
*
* The tracker needs to deal with a variety of workloads, eg.:
*
* * bad user code -- this may lead to many task failures, but that should not count against
* individual executors
* * many small stages -- this may prevent a bad executor for having many failures within one
* stage, but still many failures over the entire application
* * "flaky" executors -- they don't fail every task, but are still faulty enough to merit
* blacklisting
* * See the design doc on SPARK-8425 for a more in-depth discussion.
*
* THREADING: As with most helpers of TaskSchedulerImpl, this is not thread-safe. Though it is
* called by multiple threads, callers must already have a lock on the TaskSchedulerImpl. The
* one exception is [[nodeBlacklist()]], which can be called without holding a lock.
*/

Long Term:
Scheduler Improvements
• New in Spark 2.2 (under development June – December 2016)
• spark.blacklist.enabled (Default: false)
• spark.blacklist.task.maxTaskAttemptsPerExecutor (Default: 1)
• spark.blacklist.task.maxTaskAttemptsPerNode (Default: 2)
• spark.blacklist.task.maxFailedTasksPerExecutor (Default: 2)
• spark.blacklist.task.maxFailedExecutorsPerNode (Default: 2)
– http://spark.apache.org/docs/latest/configuration.html

Case Study:
Shuffle Fetch Failures
[SPARK-4105]

SPARK-4105: The Symptom
• Non-deterministic
FAILED_TO_UNCOMPRESS(5) errors during
shuffle read.
• Difficult to reproduce.
• “Smells” like stream corruption.
– Some users saw similar issues
with LZF compression.
• Not related to spilling.
https://0x0fff.com/spark-architecture-shuffle/

SPARK-4105: Fixed in Spark 2.2
• https://github.com/apache/spark/pull/15923
• “It seems that it's very likely the corruption is
introduced by some weird machine/hardware, also
the checksum (16 bits) in TCP is not strong enough
to identify all the corruption.”
• Try to decompressblocks as they come in and
check for IOExceptions.
• Works for now, maybe we can do better.

What did we learn?
• The Spark Scheduler is responsible for assigning units of
work to compute resources.
• The scheduler is where the rubber meets the road when it
comes to fault tolerance.
• There are a few knobs to tweak, but hopefully that is not
necessary.
• Other things can fail besides the scheduler, too.
• Many classical distributed systems problems are still present
(even though Spark does a great job of abstracting most of
them away).

Recommendations for
Application Developers
• Gather and read logs, early and often.
– Issues may occur in smaller environments.
• Start small: one executor, one host.
• Grow slowly.
• Use “pen and paper” to determine expectations for job times.
• Watch out for stragglers, outliers, and crashes.
• Don’t: start critical job on huge cluster and expect perfect
performance the first time.

Thank You.
José Soltren
jose@cloudera.com

Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East talk by Jose Soltren

More Related Content

Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East talk by Jose Soltren