Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo

1

Fault Tolerance in Spark:
Lessons Learned
from Production
José Soltren
Cloudera

2

Who am I?
• Software Engineer at Cloudera focused on
Apache Spark
– …also an Apache Spark contributor
• Previous hardware, kernel, and driver hacking
experience

3

So… why does Cloudera care about
Fault Tolerance?
• Cloudera supports big customers running big
applications on big hardware.
• How big?
– >$1B/yr,
– core business logic
– 1000+ node clusters.
• Outages are really expensive.
– …about as expensive as flying a small jet.
• Customer’sproblems are our problems.

4

Apache Spark
Fault Tolerance Basics

5

Fault Tolerance Basics
https://0x0fff.com/spark-architecture-shuffle/

6

Fault Tolerance Basics
• RDDs – Resilient Distributed Datasets: multiple
pieces, multiple copies
• Lineage: not versions, but, a way of re-creating data
• HDFS (or HBase or other external store)
• Scheduler: Blacklist
• Scheduler/Storage: Duplication and Locality

7

.gs.
s9q)
q)
.-
!

(-
s
'€
->
er+
A
sr
srBlaq
9/.n,?-rt
)-y
rIt
?.
t--
s--
$
(-
f;
O
Lt-.
cf)
+
_g
,e
-l
{so
F
?
ET'L
t<
{,
J
c-
a
)
Na
,€
rntJ
c
t;
ss
A
-s/
s
Nv
trD
tI
.l
E3l
€it*."1
tr*l
3Fl'.:l"
rl
tlc..l
.J
glol--f'l
-Yl
+l
s
c-)
€
tY
s
/q3
3
0
5.qJ
g
i
(
E(-
s
t
&
3
..r-
5-
-t)
t-
-+-
,f
e
)
-+
I
.trc.t
b
E
j(-
-
3E3-E
{
ET
F
€
x
e
II
.-{dsA,F<-{
_-e-<*€g_bLF
€*J
ifr
tsJ
a-
;lly
?X-)
g.R.r{'>
,:rF:(S
:sEt
=.-JA
")
(^
55
a
L
.c-
,
5
E
1
L
a
>
3*:*d
,g
-Es(f-
r€*d
s-rt
tr
f'l
-+l-Y
f
I
Il-
I-d
i-
F+-
I+(

I

I
{
I
:ll
.-5',
?Pd
FtS
E$t
ee-l
ic<3.
2
P
C-.
t
€
{
t<
qf,
r
rt
sf
'A
-t
=.F
c
q
c-^
-*-:s
=f
-;

8

In
production,
at scale.

9

Case Study:
Application Outage
[SPARK-8425]

10

2016-04-22: An Application Outage
• Customer reports an application failure.
– Spark cluster with hundreds of nodes and 8 disks per node.
– Application runs on customer’s Spark cluster.
– High Availability is critical for this application.
• Immediate cause was a disk failure:
FileNotFoundException.
• “HDFS and YARN responded to the disk failure
appropriately but Spark continued to access the failed
disk.”
– Yikes.

11

Disk Failures and Scheduling
• One node has one bad disk.
– …tasks sometimes fail on this node.
– ...tasks consistently fail if they hit this disk.
• Tasks succeed if they are scheduled on other nodes!
• Tasks fail if they are scheduled on the same node.
– ...which is likely due to locality preferences.
• The scheduler will kill the whole job after some number
of failures.
• Can’t tell YARN (Mesos?) we have a bad resource.

12

Failure Recap
• There is already support for fault tolerance present –
don’t panic!
• Spark could handle an unreachable node.
• Spark could handle a node with no usable disks.
• We hit an edge case.
• Failure modes are binary, and not expressive enough.

13

Yuck.
:(

14

Short Term:
Workaround on Spark 1.6 and 2.0
• spark.scheduler.executorTaskBlacklistTime
– Set a value that is much longer than the duration of the longest task.
– Tells the scheduler to “blacklist” an (executor, task) combination for some
amount of time.
• spark.task.maxFailures
– Set a value that is larger than the maximum number of executors on a node.
– Determines the number of failures before the whole application is killed.
• spark.speculation
– Defaults to false. Did not recommend enabling.

15

Long Term:
Overhaul The Blacklist
• The Scheduler is critical core code! Bug whack-a-mole.
• Driver multi-threaded, asynchronous requests.
• Many scenarios considered in Design Doc http://bit.do/bklist
– Large Cluster, Large Stages, One Bad Disk (our scenario)
– Failing Job, Very Small Cluster
– Large Cluster, Very Small Stages
– Long Lived Application, Occasional Failed Tasks
– Bad Node leads to widespread Shuffle-Fetch Failures
– Bad Node, One Executor, Dynamic Allocation
– Application programming errors!

16

org.apache.spark.scheduler.
BlacklistTracker
/**
* BlacklistTracker is designed to track problematic executors and nodes. It supports blacklisting
* executors and nodes across an entire application (with a periodic expiry). TaskSetManagers add
* additional blacklisting of executors and nodes for individual tasks and stages which works in
* concert with the blacklisting here.
*
* The tracker needs to deal with a variety of workloads, eg.:
*
* * bad user code -- this may lead to many task failures, but that should not count against
* individual executors
* * many small stages -- this may prevent a bad executor for having many failures within one
* stage, but still many failures over the entire application
* * "flaky" executors -- they don't fail every task, but are still faulty enough to merit
* blacklisting
* * See the design doc on SPARK-8425 for a more in-depth discussion.
*
* THREADING: As with most helpers of TaskSchedulerImpl, this is not thread-safe. Though it is
* called by multiple threads, callers must already have a lock on the TaskSchedulerImpl. The
* one exception is [[nodeBlacklist()]], which can be called without holding a lock.
*/

17

Long Term:
Scheduler Improvements
• New in Spark 2.2 (under development June – December 2016)
• spark.blacklist.enabled (Default: false)
• spark.blacklist.task.maxTaskAttemptsPerExecutor (Default: 1)
• spark.blacklist.task.maxTaskAttemptsPerNode (Default: 2)
• spark.blacklist.task.maxFailedTasksPerExecutor (Default: 2)
• spark.blacklist.task.maxFailedExecutorsPerNode (Default: 2)
– http://spark.apache.org/docs/latest/configuration.html

18

Blacklist
Web UI

19

Case Study:
Shuffle Fetch Failures
[SPARK-4105]

20

SPARK-4105: The Symptom
• Non-deterministic
FAILED_TO_UNCOMPRESS(5) errors during
shuffle read.
• Difficult to reproduce.
• “Smells” like stream corruption.
– Some users saw similar issues
with LZF compression.
• Not related to spilling.
https://0x0fff.com/spark-architecture-shuffle/

21

SPARK-4105: Fixed in Spark 2.2
• https://github.com/apache/spark/pull/15923
• “It seems that it's very likely the corruption is
introduced by some weird machine/hardware, also
the checksum (16 bits) in TCP is not strong enough
to identify all the corruption.”
• Try to decompressblocks as they come in and
check for IOExceptions.
• Works for now, maybe we can do better.

22

Closing thoughts.

23

What did we learn?
• The Spark Scheduler is responsible for assigning units of
work to compute resources.
• The scheduler is where the rubber meets the road when it
comes to fault tolerance.
• There are a few knobs to tweak, but hopefully that is not
necessary.
• Other things can fail besides the scheduler, too.
• Many classical distributed systems problems are still present
(even though Spark does a great job of abstracting most of
them away).

24

Recommendations for
Application Developers
• Gather and read logs, early and often.
– Issues may occur in smaller environments.
• Start small: one executor, one host.
• Grow slowly.
• Use “pen and paper” to determine expectations for job times.
• Watch out for stragglers, outliers, and crashes.
• Don’t: start critical job on huge cluster and expect perfect
performance the first time.

25

Thank You.
José Soltren
jose@cloudera.com

More Related Content

Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East talk by Jose Soltren

  • 1. Fault Tolerance in Spark: Lessons Learned from Production José Soltren Cloudera
  • 2. Who am I? • Software Engineer at Cloudera focused on Apache Spark – …also an Apache Spark contributor • Previous hardware, kernel, and driver hacking experience
  • 3. So… why does Cloudera care about Fault Tolerance? • Cloudera supports big customers running big applications on big hardware. • How big? – >$1B/yr, – core business logic – 1000+ node clusters. • Outages are really expensive. – …about as expensive as flying a small jet. • Customer’sproblems are our problems.
  • 6. Fault Tolerance Basics • RDDs – Resilient Distributed Datasets: multiple pieces, multiple copies • Lineage: not versions, but, a way of re-creating data • HDFS (or HBase or other external store) • Scheduler: Blacklist • Scheduler/Storage: Duplication and Locality
  • 10. 2016-04-22: An Application Outage • Customer reports an application failure. – Spark cluster with hundreds of nodes and 8 disks per node. – Application runs on customer’s Spark cluster. – High Availability is critical for this application. • Immediate cause was a disk failure: FileNotFoundException. • “HDFS and YARN responded to the disk failure appropriately but Spark continued to access the failed disk.” – Yikes.
  • 11. Disk Failures and Scheduling • One node has one bad disk. – …tasks sometimes fail on this node. – ...tasks consistently fail if they hit this disk. • Tasks succeed if they are scheduled on other nodes! • Tasks fail if they are scheduled on the same node. – ...which is likely due to locality preferences. • The scheduler will kill the whole job after some number of failures. • Can’t tell YARN (Mesos?) we have a bad resource.
  • 12. Failure Recap • There is already support for fault tolerance present – don’t panic! • Spark could handle an unreachable node. • Spark could handle a node with no usable disks. • We hit an edge case. • Failure modes are binary, and not expressive enough.
  • 14. Short Term: Workaround on Spark 1.6 and 2.0 • spark.scheduler.executorTaskBlacklistTime – Set a value that is much longer than the duration of the longest task. – Tells the scheduler to “blacklist” an (executor, task) combination for some amount of time. • spark.task.maxFailures – Set a value that is larger than the maximum number of executors on a node. – Determines the number of failures before the whole application is killed. • spark.speculation – Defaults to false. Did not recommend enabling.
  • 15. Long Term: Overhaul The Blacklist • The Scheduler is critical core code! Bug whack-a-mole. • Driver multi-threaded, asynchronous requests. • Many scenarios considered in Design Doc http://bit.do/bklist – Large Cluster, Large Stages, One Bad Disk (our scenario) – Failing Job, Very Small Cluster – Large Cluster, Very Small Stages – Long Lived Application, Occasional Failed Tasks – Bad Node leads to widespread Shuffle-Fetch Failures – Bad Node, One Executor, Dynamic Allocation – Application programming errors!
  • 16. org.apache.spark.scheduler. BlacklistTracker /** * BlacklistTracker is designed to track problematic executors and nodes. It supports blacklisting * executors and nodes across an entire application (with a periodic expiry). TaskSetManagers add * additional blacklisting of executors and nodes for individual tasks and stages which works in * concert with the blacklisting here. * * The tracker needs to deal with a variety of workloads, eg.: * * * bad user code -- this may lead to many task failures, but that should not count against * individual executors * * many small stages -- this may prevent a bad executor for having many failures within one * stage, but still many failures over the entire application * * "flaky" executors -- they don't fail every task, but are still faulty enough to merit * blacklisting * * See the design doc on SPARK-8425 for a more in-depth discussion. * * THREADING: As with most helpers of TaskSchedulerImpl, this is not thread-safe. Though it is * called by multiple threads, callers must already have a lock on the TaskSchedulerImpl. The * one exception is [[nodeBlacklist()]], which can be called without holding a lock. */
  • 17. Long Term: Scheduler Improvements • New in Spark 2.2 (under development June – December 2016) • spark.blacklist.enabled (Default: false) • spark.blacklist.task.maxTaskAttemptsPerExecutor (Default: 1) • spark.blacklist.task.maxTaskAttemptsPerNode (Default: 2) • spark.blacklist.task.maxFailedTasksPerExecutor (Default: 2) • spark.blacklist.task.maxFailedExecutorsPerNode (Default: 2) – http://spark.apache.org/docs/latest/configuration.html
  • 19. Case Study: Shuffle Fetch Failures [SPARK-4105]
  • 20. SPARK-4105: The Symptom • Non-deterministic FAILED_TO_UNCOMPRESS(5) errors during shuffle read. • Difficult to reproduce. • “Smells” like stream corruption. – Some users saw similar issues with LZF compression. • Not related to spilling. https://0x0fff.com/spark-architecture-shuffle/
  • 21. SPARK-4105: Fixed in Spark 2.2 • https://github.com/apache/spark/pull/15923 • “It seems that it's very likely the corruption is introduced by some weird machine/hardware, also the checksum (16 bits) in TCP is not strong enough to identify all the corruption.” • Try to decompressblocks as they come in and check for IOExceptions. • Works for now, maybe we can do better.
  • 23. What did we learn? • The Spark Scheduler is responsible for assigning units of work to compute resources. • The scheduler is where the rubber meets the road when it comes to fault tolerance. • There are a few knobs to tweak, but hopefully that is not necessary. • Other things can fail besides the scheduler, too. • Many classical distributed systems problems are still present (even though Spark does a great job of abstracting most of them away).
  • 24. Recommendations for Application Developers • Gather and read logs, early and often. – Issues may occur in smaller environments. • Start small: one executor, one host. • Grow slowly. • Use “pen and paper” to determine expectations for job times. • Watch out for stragglers, outliers, and crashes. • Don’t: start critical job on huge cluster and expect perfect performance the first time.