Apache storm vs. Spark Streaming

Apache Storm and Spark
Streaming Compared
P. Taylor Goetz, Hortonworks
@ptgoetz

Honestly...
• I know a lot more about Apache Storm than I do
Apache Spark Streaming.
• I've been involved with Apache Storm, in one
way or another, since it was open-sourced.
• I'm admittedly biased.

But...
• A number of articles/papers comparing Apache
Storm and Spark Streaming are inaccurate in
terms of Storm’s features and performance
characteristics.
• Code and conﬁguration for those studies is not
available, so independent veriﬁcation is
impossible.
• Claims don't match real-world observations.

But...
• There is an inherent “Home Team Advantage” in
any benchmark comparison.
• Without open source code, any benchmark
claims are essentially marketing ﬂuff, and should
be taken with a grain or two of NaCl.
• Any benchmark claim should be independently
veriﬁable.

Spark Streaming Paper
• Compares Spark Streaming (Micro-Batch) to
Core Storm (One-at-a-Time)
• A more appropriate comparison would have
been with Storm’s Trident (Micro-Batch) API
• Trident mentioned only in passing (on pages 3
and 12)
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf

• Benchmark code/conﬁguration not publicly
available
• Performance claims not independently veriﬁable

• Granted, the Spark Streaming paper is almost 2
years old and written at a time when Trident was
relatively new.
• However, that paper is often cited when
comparing Apache Storm and Spark Streaming,
particularly in terms of performance.
• A lot can change in 2 years.

Streaming and batch
processing are
fundamentally different.

Batch vs. Streaming
• Storm is a stream processing framework that
also does micro-batching (Trident). 
• Spark is a batch processing framework that also
does micro-batching (Spark Streaming).

Batch vs. Streaming
Batch Streaming

Batch vs. Streaming
Batch Streaming
Micro-Batch

Apache Storm: Two
Streaming APIs
Core Storm (Spouts and Bolts)!
• One at a Time
• Lower Latency
• Operates on Tuple Streams
Trident (Streams and Operations)!
• Micro-Batch
• Higher Throughput
• Operates on Streams of Tuple Batches and Partitions

Language Options
Core Storm Storm Trident Spark Streaming
• Java
• Clojure
• Scala
• Python
• Ruby
• others*
• Java
• Clojure
• Scala
• Java
• Scala
• Python
*Storm’s Multi-Lang feature allows the use of virtually any programming language.

Reliability Models
Core Storm Storm Trident
Spark
Streaming
At Most Once Yes Yes No
At Least Once Yes Yes No*
Exactly Once No Yes Yes*
*In some node failure scenarios, Spark Streaming
falls back to at-least-once processing or data loss.

Programing Model
Core Storm Storm Trident Spark Streaming
Stream Primitive Tuple
Tuple, Tuple
Batch, Partition
DStream
Stream Source Spouts
Spouts, Trident
Spouts
HDFS, Network
Computation/
Transformation
Bolts
Filters,
Functions,
Aggregations,
Joins
Transformation,
Window
Operations
Stateful
Operations
No
(roll your own)
Yes Yes
Output/
Persistence
Bolts State, MapState foreachRDD

Production Deployments
Apache Storm Spark Streaming
• Too many to list 
 
http://
storm.incubator.apache.org/
documentation/Powered-
By.html
• Sharethrough 
 
http://
engineering.sharethrough.com/blog/
2014/06/27/sharethrough-at-spark-
summit-2014-spark-streaming-for-
realtime-auctions/

Support
Apache Storm Spark
Spark
Streaming
Hadoop Distro
Hortonworks,
MapR
Cloudera,
MapR,
Hortonworks
(preview)
Hortonworks,
Cloudera,
MapR
Resource
Management
YARN, Mesos YARN, Mesos YARN*, Mesos
Provisioning/
Monitoring
Apache
Ambari
Cloudera
Manager
?
*With issues: http://spark-summit.org/wp-content/uploads/2014/07/
Productionizing-a-247-Spark-Streaming-Service-on-YARN-Ooyala.pdf

Worker Failure:
Spark Streaming
"So if a worker node fails, then the system can recompute
the lost from the the left over copy of the input data.
However, if the worker node where a network receiver was
running fails, then a tiny bit of data may be lost, that is, the
data received by the system but not yet replicated to other
node(s)."
Only HDFS-backed data sources are fully fault tolerant.
https://spark.apache.org/docs/latest/streaming-programming-
guide.html#fault-tolerance-properties

Worker Failure:
Spark Streaming
https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-
zero-data-loss-in-spark-streaming.html
Solution?: Write Ahead Logs (SPARK-3129)
• Enabling WAL requires DFS (HDFS, S3) — no such
requirement with Storm
• Incurs a performance penalty that adds to overall latency
• Full fault tolerance still requires a data source that can
replay data (e.g. Kafka)!
• Architectural band aid?

Worker Failure:
Apache Storm
• If a supervisor node fails, Nimbus will reassign that node's
tasks to other nodes in the cluster.
• Any tuples sent to a failed node will time out and be
replayed (In Trident, any batches will be replayed).
• Delivery guarantees dependent on a reliable data source.

Data Source Reliability
• A data source is considered unreliable if there is no means
to replay a previously-received message.
• A data source is considered reliable if it can somehow replay
a message if processing fails at any point.
• A data source is considered durable if it can replay any
message or set of messages given the necessary selection
criteria.
!
(These are my terms.)

Reliability Limitations:
Apache Storm
• Exactly once processing requires a durable data source.
• At least once processing requires a reliable data source.
• An unreliable data source can be wrapped to provide
additional guarantees.
• With durable and reliable sources, Storm will not drop data.
• Common pattern: Back unreliable data sources with
Apache Kafka (minor latency hit traded for 100% durability).

Apache Storm Spouts
Durable!
Kafka 
 
 
 
 
 
Reliable!
JMS
RabbitMQ /
AMQP
Kestrel
Amazon SQS
Amazon Kinesis
Unreliable!
Twitter
Scribe
MongoDB

Apache Storm Output
(Bolts, Trident State)
• Cassandra
• HBase
• HDFS
• Kafka
• Redis
• Memcached
• R
• JMS
• MongoDB
• RDBMS

Apache Storm + Kafka
Apache Kafka is an ideal source for Storm topologies. It
provides everything necessary for:
• At most once processing
• At least once processing
• Exactly once processing
Apache Storm includes Kafka spout implementations for all
levels of reliability.
Kafka Supports a wide variety of languages and integration
points for both producers and consumers.

Reliability Limitations:
Spark Streaming
• Fault tolerance and reliability guarantees require
HDFS-backed data source.
• Moving data to HDFS prior to stream processing
introduces additional latency.
• Network data sources (Kafka, etc.) are
vulnerable to data loss in the event of a worker
node failure.
https://spark.apache.org/docs/latest/streaming-programming-
guide.html#fault-tolerance-properties

Performance
“The main reason cited by Tathagata for Spark's
performance gain over Storm is the aggregation of
small records that occurs through the mechanics
of RDDs.”
http://www.cs.duke.edu/~kmoses/cps516/dstream.html
In other words: Micro-Batching

Performance
http://www.cs.duke.edu/~kmoses/cps516/dstream.html
Storm capped at 10k msgs/sec/node?
Spark Streaming 40x faster than Storm?
Others may disagree…

https://twitter.com/
nathanmarz/status/
207989068519317505
http://www.slideshare.net/
JamesSirota/cisco-opensoc

Netty Transport
• Introduced in Apache Storm
0.9.0
• Faster, pure Java alternative
for 0MQ
• Yahoo! Engineering
announcement: 
http://yahooeng.tumblr.com/post/
64758709722/making-storm-ﬂy-
with-netty
• Performance Test Code: 
https://github.com/yahoo/storm-
perf-test
Netty
0mq

STORM-297
• Introduced in Apache Storm
0.9.2-incubating
• Big performance boost,
especially for small messages
• JIRA Discussion: 
https://issues.apache.org/jira/
browse/STORM-297
• Performance Test Code: 
https://github.com/yahoo/storm-
perf-test

Benchmarking Storm
• 5 nodes on AWS (m1.large - not very powerful)
• 1 ZooKeeper, 1 Nimbus, 3 Supervisors
• Storm Core API and Trident API benchmarks
• Is Trident API slower than Core API?
https://github.com/ptgoetz/storm-benchmark

Is Trident API slower than
Core API?
• On low-power hardware with 3 supervisor nodes…
• Core API:
~150k msg./sec. with ~80 ms. latency
• Trident API:
~300k msg./sec. with ~250 ms. latency
• Higher throughput possible with increased latency
• Better performance with bigger hardware

Is Spark + Spark Streaming a
"Lambda Architecture in a Box?"
• No!
• Lambda is a lot more than batch + streaming.
• Lambda is powerful when applied correctly, but is
not right for every use case.
• Spark and Spark Streaming have overlapping
programming models for batch and micro-batch.
• The rest is up to you (as it is with Storm).

Final Thoughts
In general (not speciﬁc to Spark Streaming):!
• Beware any claim that A is X times faster than B.
• Performance is a matter of proper tuning for the
use case at hand.
• Any system can be hobbled to look bad in a
benchmark.

Recommendation
• It is up to you, and your speciﬁc use case.
• Consider fault tolerance. Is data loss
acceptable?
• Consider all facets and make informed
decisions.
• Rely on your own benchmarks

Apache storm vs. Spark Streaming

More Related Content

Apache storm vs. Spark Streaming