Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Storm
Distributed and fault-tolerant realtime computation




                                          Nathan Marz
                                            Twitter
Basic info
• Open sourced September 19th
• Implementation is 12,000 lines of code
• Used by over 25 companies
• >2280 watchers on Github (most watched
  JVM project)
• Very active mailing list
   • >1700 messages
   • >520 members
Hadoop


              Batch computation




Distributed                       Fault-tolerant
Storm


         Realtime computation




Distributed                 Fault-tolerant
Hadoop


• Large, finite jobs
• Process a lot of data at once
• High latency
Storm


• Infinite computations called topologies
• Process infinite streams of data
• Tuple-at-a-time computational model
• Low latency
Before Storm



Queues        Workers
Example




 (simplified)
Example




Workers schemify tweets
 and append to Hadoop
Example




Workers update statistics on URLs by
incrementing counters in Cassandra
Problems

• Scaling is painful
• Poor fault-tolerance
• Coding is tedious
What we want
• Guaranteed data processing
• Horizontal scalability
• Fault-tolerance
• No intermediate message brokers!
• Higher level abstraction than message passing
• “Just works”
Storm
Guaranteed data processing
Horizontal scalability
Fault-tolerance
No intermediate message brokers!
Higher level abstraction than message passing
“Just works”
Use cases



  Stream      Distributed   Continuous
processing       RPC        computation
Storm Cluster
Storm Cluster




Master node (similar to Hadoop JobTracker)
Storm Cluster




Used for cluster coordination
Storm Cluster




 Run worker processes
Starting a topology
Killing a topology
Concepts

• Streams
• Spouts
• Bolts
• Topologies
Streams


Tuple   Tuple   Tuple   Tuple   Tuple   Tuple   Tuple




          Unbounded sequence of tuples
Spouts




Source of streams
Spout examples


• Read from Kestrel queue
• Read from Twitter streaming API
Bolts




Processes input streams and produces new streams
Bolts
• Functions
• Filters
• Aggregation
• Joins
• Talk to databases
Topology




Network of spouts and bolts
Tasks




Spouts and bolts execute as
many tasks across the cluster
Task execution




Tasks are spread across the cluster
Task execution




Tasks are spread across the cluster
Stream grouping




When a tuple is emitted, which task does it go to?
Stream grouping

• Shuffle grouping: pick a random task
• Fields grouping: mod hashing on a
  subset of tuple fields
• All grouping: send to all tasks
• Global grouping: pick task with lowest id
Topology
shuffle      [“id1”, “id2”]




           shuffle
[“url”]


  shuffle

              all
Streaming word count




TopologyBuilder is used to construct topologies in Java
Streaming word count




Define a spout in the topology with parallelism of 5 tasks
Streaming word count




Split sentences into words with parallelism of 8 tasks
Streaming word count



Consumer decides what data it receives and how it gets grouped




Split sentences into words with parallelism of 8 tasks
Streaming word count




   Create a word count stream
Streaming word count




      splitsentence.py
Streaming word count
Streaming word count




  Submitting topology to a cluster
Streaming word count




  Running topology in local mode
Demo
Distributed RPC




Data flow for Distributed RPC
DRPC Example


Computing “reach” of a URL on the fly
Reach


Reach is the number of unique people
    exposed to a URL on Twitter
Computing reach
                Follower
                           Distinct
      Tweeter   Follower   follower

                Follower
                           Distinct
URL   Tweeter              follower   Count   Reach
                Follower

                Follower   Distinct
      Tweeter              follower
                Follower
Reach topology
Reach topology
Reach topology
Reach topology


          Keep set of followers for
         each request id in memory
Reach topology




          Update followers set when
           receive a new follower
Reach topology


            Emit partial count after
          receiving all followers for a
                   request id
Demo
Guaranteeing message
     processing




       “Tuple tree”
Guaranteeing message
     processing

• A spout tuple is not fully processed until all
  tuples in the tree have been completed
Guaranteeing message
     processing

• If the tuple tree is not completed within a
  specified timeout, the spout tuple is replayed
Guaranteeing message
     processing




      Reliability API
Guaranteeing message
     processing




“Anchoring” creates a new edge in the tuple tree
Guaranteeing message
     processing




 Marks a single node in the tree as complete
Guaranteeing message
     processing

• Storm tracks tuple trees for you in an
  extremely efficient way
Transactional topologies


How do you do idempotent counting with an
     at least once delivery guarantee?
Transactional topologies


      Won’t you overcount?
Transactional topologies


 Transactional topologies solve this problem
Transactional topologies


Built completely on top of Storm’s primitives
        of streams, spouts, and bolts
Transactional topologies


Batch 1   Batch 2   Batch 3




          Process small batches of tuples
Transactional topologies


Batch 1     Batch 2   Batch 3




          If a batch fails, replay the whole batch
Transactional topologies


Batch 1   Batch 2   Batch 3




   Once a batch is completed, commit the batch
Transactional topologies


Batch 1   Batch 2   Batch 3




Bolts can optionally implement “commit” method
Transactional topologies

Commit 1   Commit 1   Commit 2   Commit 3   Commit 4   Commit 4




   Commits are ordered. If there’s a failure during
    commit, the whole batch + commit is retried
Example
Example

       New instance of this object
      for every transaction attempt
Example



          Aggregate the count for
                this batch
Example


      Only update database if
       transaction ids differ
Example


     This enables idempotency since
          commits are ordered
Example


      (Credit goes to Kafka guys
       for figuring out this trick)
Transactional topologies


Multiple batches can be processed in parallel,
but commits are guaranteed to be ordered
Transactional topologies

• Will be available in next version of Storm
  (0.7.0)
• Requires a source queue that can replay
  identical batches of messages
• Aiming for first TransactionalSpout
  implementation to use Kafka
Storm UI
Storm on EC2


https://github.com/nathanmarz/storm-deploy




          One-click deploy tool
Starter code


https://github.com/nathanmarz/storm-starter




            Example topologies
Documentation
Ecosystem

• Scala, JRuby, and Clojure DSL’s
• Kestrel, AMQP, JMS, and other spout adapters
• Serializers
• Multilang adapters
• Cassandra, MongoDB integration
Questions?


http://github.com/nathanmarz/storm

More Related Content

Storm