Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Introduction to
Kafka and Zookeeper
June Hadoop Meetup
Rahul Jain
@rahuldausa
Who am I?
 Software Engineer
 Member of Core technology @ IVY Comptech,
Hyderabad, India
 6 years of programming experience
 Areas of expertise/interest
 High traffic web applications
 JAVA/J2EE
 Big data, NoSQL
 Information-Retrieval, Machine learning
2
Agenda
• Overview
• Zookeeper
• Messaging System (Basic Concepts)
• Kafka
• Q&A
3
Apache Zookeeper TM
What is a Distributed System
“A Distributed system consists of multiple computers
that communicate and coordinate their actions by
passing messages. The components interact with each
other in order to achieve a common goal. ”
- Wikipedia
What is Zookeeper
• An Open source, High Performance coordination service
for distributed applications
• Centralized service for
– Configuration Management
– Locks and Synchronization for providing coordination
between distributed systems
– Naming service (Registry)
– Group Membership
• Features
– hierarchical namespace
– provides watcher on a znode
– allows to form a cluster of nodes
• Supports a large volume of request for data retrieval and
update
• http://zookeeper.apache.org/
6
Source : http://zookeeper.apache.org
Zookeeper Use cases
• Configuration Management
• Cluster member nodes Bootstrapping configuration from a
central source
• Distributed Cluster Management
• Node Join/Leave
• Node Status in real time
• Naming Service – e.g. DNS
• Distributed Synchronization – locks, barriers
• Leader election
• Centralized and Highly reliable Registry
Zookeeper Data Model
 Hierarchical Namespace
 Each node is called “znode”
 Each znode has data(stores data in
byte[] array) and can have children
 znode
– Maintains “Stat” structure with
version of data changes , ACL
changes and timestamp
– Version number increases with each
changes
Let’s recall basic concepts of
Messaging System
Point to Point Messaging
(Queue)
Credit: http://fusesource.com/docs/broker/5.3/getting_started/FuseMBStartedKeyJMS.html
Publish-Subscribe Messaging
(Topic)
Credit: http://fusesource.com/docs/broker/5.3/getting_started/FuseMBStartedKeyJMS.html
Apache Kafka
Overview
• An apache project initially developed at LinkedIn
• Distributed publish-subscribe messaging system
• Designed for processing of real time activity stream data e.g.
logs, metrics collections
• Written in Scala
• Does not follow JMS Standards, neither uses JMS APIs
• Features
– Persistent messaging
– High-throughput
– Supports both queue and topic semantics
– Uses Zookeeper for forming a cluster of nodes
(producer/consumer/broker)
and many more…
• http://kafka.apache.org/
13
How it works
Credit : http://kafka.apache.org/design.html
Real time transfer
15
Consumer3
(Group2)
Kafka
Broker
Consumer4
(Group2)
Producer
Zookeeper
Consumer2
(Group1)
Consumer1
(Group1)
Update Consumed
Message offset
Queue
Topology
Topic
Topology
Kafka
Broker
Design Elements
• Uses Filesystem Cache
• Zero-copy transfer of messages
• Batching of Messages
• Batch Compression
• Automatic Producer Load balancing.
• Broker does not Push messages to Consumer, Consumer
Polls messages from Broker.
Design Elements (Contd.)
• Cluster formation of Broker/Consumer using Zookeeper,
– So on the fly more consumer, broker can be introduced. The new
cluster rebalancing will be taken care by Zookeeper
• Data is persisted in broker
– But not removed on consumption (till retention period), so if one
consumer fails while consuming, same message can be re-consumed
again later from broker.
• Simplified storage mechanism for message,
– not for each message per consumer.
Performance Numbers
Credit : http://research.microsoft.com/en-us/UM/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
Producer Performance Consumer Performance
Questions ?
@rahuldausa on twitter and slideshare
http://www.linkedin.com/in/rahuldausa

More Related Content

Introduction to Kafka and Zookeeper

  • 1. Introduction to Kafka and Zookeeper June Hadoop Meetup Rahul Jain @rahuldausa
  • 2. Who am I?  Software Engineer  Member of Core technology @ IVY Comptech, Hyderabad, India  6 years of programming experience  Areas of expertise/interest  High traffic web applications  JAVA/J2EE  Big data, NoSQL  Information-Retrieval, Machine learning 2
  • 3. Agenda • Overview • Zookeeper • Messaging System (Basic Concepts) • Kafka • Q&A 3
  • 5. What is a Distributed System “A Distributed system consists of multiple computers that communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. ” - Wikipedia
  • 6. What is Zookeeper • An Open source, High Performance coordination service for distributed applications • Centralized service for – Configuration Management – Locks and Synchronization for providing coordination between distributed systems – Naming service (Registry) – Group Membership • Features – hierarchical namespace – provides watcher on a znode – allows to form a cluster of nodes • Supports a large volume of request for data retrieval and update • http://zookeeper.apache.org/ 6 Source : http://zookeeper.apache.org
  • 7. Zookeeper Use cases • Configuration Management • Cluster member nodes Bootstrapping configuration from a central source • Distributed Cluster Management • Node Join/Leave • Node Status in real time • Naming Service – e.g. DNS • Distributed Synchronization – locks, barriers • Leader election • Centralized and Highly reliable Registry
  • 8. Zookeeper Data Model  Hierarchical Namespace  Each node is called “znode”  Each znode has data(stores data in byte[] array) and can have children  znode – Maintains “Stat” structure with version of data changes , ACL changes and timestamp – Version number increases with each changes
  • 9. Let’s recall basic concepts of Messaging System
  • 10. Point to Point Messaging (Queue) Credit: http://fusesource.com/docs/broker/5.3/getting_started/FuseMBStartedKeyJMS.html
  • 13. Overview • An apache project initially developed at LinkedIn • Distributed publish-subscribe messaging system • Designed for processing of real time activity stream data e.g. logs, metrics collections • Written in Scala • Does not follow JMS Standards, neither uses JMS APIs • Features – Persistent messaging – High-throughput – Supports both queue and topic semantics – Uses Zookeeper for forming a cluster of nodes (producer/consumer/broker) and many more… • http://kafka.apache.org/ 13
  • 14. How it works Credit : http://kafka.apache.org/design.html
  • 16. Design Elements • Uses Filesystem Cache • Zero-copy transfer of messages • Batching of Messages • Batch Compression • Automatic Producer Load balancing. • Broker does not Push messages to Consumer, Consumer Polls messages from Broker.
  • 17. Design Elements (Contd.) • Cluster formation of Broker/Consumer using Zookeeper, – So on the fly more consumer, broker can be introduced. The new cluster rebalancing will be taken care by Zookeeper • Data is persisted in broker – But not removed on consumption (till retention period), so if one consumer fails while consuming, same message can be re-consumed again later from broker. • Simplified storage mechanism for message, – not for each message per consumer.
  • 18. Performance Numbers Credit : http://research.microsoft.com/en-us/UM/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf Producer Performance Consumer Performance
  • 19. Questions ? @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa