0% found this document useful (0 votes)

239 views

Apache Kafka Description

Apache Kafka is a distributed publish-subscribe messaging system that can be used for building real-time data pipelines and streaming applications. It is designed to be fast, scalable, and durable. Kafka maintains feeds of messages in categories called topics that can be published to by producers and consumed from by subscribers. It uses a distributed commit log that provides ordering and replication to ensure reliability.

Uploaded by

Roy Antonius

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

239 views

Apache Kafka Description

Uploaded by

Roy Antonius

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Apache Kafka

A high throughput distributed messaging system

What is Kakfa?

 Kafka is a distributed publish-subscribe messaging system rethought as a distributed

commit log.
 It’s designed to be
 Fast
 Scalable
 Durable
 When used in the right way and for the right use case, Kafka has unique attributes
that make it a highly attractive option for data integration.
Publish subscribe messaging system

 Kafka maintains feeds of messages in categories called topics

 Producers are processes that publish messages to one or more topics
 Consumers are processes that subscribe to topics and process the feed of
published messages

Message Subscriber

Publisher Message Topic Message Subscriber

Message Subscriber
Kafka cluster

 Since Kafka is distributed in nature, Kafka is run as a cluster.

 A cluster is typically comprised multiple servers; each of which is called a broker.
 Communication between the clients and the servers takes place over TCP protocol

Producer Consumer
Kafka cluster
Broker 1
Topic 1 Topic 2
Producer Consumer
Broker 2
Topic 1 Topic 2

Zookeeper
Producer Consumer
Deep dive into high level abstractions
Topic
 To balance load, a topic is divided into
multiple partitions and replicated
across brokers.
 Partitions are ordered, immutable
sequences of messages that’s
continually appended i.e. a commit log.
 The messages in the partitions are each
assigned a sequential id number called
the offset that uniquely identifies each
message within the partition.
Distribution and partitions

 Partitions allow a topic’s log to scale beyond a size that will fit on a single server (i.e. a
broker) and act as the unit of parallelism
 The partitions of a topic are distributed over the brokers in the Kafka cluster where each
broker handles data and requests for a share of the partitions.
 For fault tolerance, each partition is replicated across a configurable number of brokers.
Distribution and fault tolerance

 Each partition has one server which acts as the "leader" and zero or more servers
which act as "followers".
 The leader handles all read and write requests for the partition while the followers
passively replicate the leader.
 If the leader fails, one of the followers will automatically become the new leader.
 Each server acts as a leader for some of its partitions and a follower for others so load
is well balanced within the cluster.
Retention

 The Kafka cluster retains all published messages—whether or not they have been
consumed—for a configurable period of time; after which it will be discarded to
free up space.
 Metadata retained on a per-consumer basis is the position of the consumer in the
log, called the offset; which is controlled by consumer.
 Normally a consumer will advance its offset linearly as it reads messages, but it can
consume messages in any order it likes.
 Kafka consumers can come and go without much impact on the cluster or on other
consumers.
Producers

 Producers publish data to the topics by assigning messages to a partition within the
topic either in a round-robin fashion or according to some semantic partition function
(say based on some key in the message).
Consumers

 Kafka offers a single consumer abstraction called consumer group that generalises
both queue and topic.
 Consumers label themselves with a consumer group name.
 Each message published to a topic is delivered to one consumer instance within each
subscribing consumer group.
 If all the consumer instances have the same consumer group, then this works just like
a traditional queue balancing load over the consumers.
 If all the consumer instances have different consumer groups, then this works like
publish-subscribe and all messages are broadcast to all consumers.
Consumer groups

 Topics have a small number of consumer groups, one for each logical subscriber.
 Each group is composed of many consumer instances for scalability and fault tolerance.
Ordering guarantees

 Kafka assigns partitions in a topic to consumers in a consumer group so, each partition is
consumed by exactly one consumer in the group.
 Limitation: there cannot be more consumer instances in a consumer group than partitions.
 Provides a total order over messages within a partition, not between different partitions in
a topic.
Comaprison

Kafka JMS message broker; Rabbit MQ

A fire hose of events arriving at rate of Messages arriving at a rate of 20k+/sec
approximately 100k+/sec
‘At least once‘ processed as data is read Exactly once processed by consumers
with an offset within a partition.
Producer-centric. Doesn't have message Broker-centric. Uses the broker itself to
acknowledgements as consumers track maintain state of what's consumed (via
messages consumed. message acknowledgements)
Supports both online and batch Consumers are mostly online, and any
consumers that may be online or offline. It messages "in wait" (persistent or not) are
also supports producer message batching held opaquely.
- it's designed for holding and distributing
large volumes of messages at a very low
latency.
Comaprison

Kafka JMS message broker; Rabbit MQ

Provides a rudimentary routing. It uses Provides rich routing capabilities with
topic for exchanges. Advanced Message Queuing Protocol’s
(AMQP) exchange, binding and queuing
model.
Makes distributed cluster explicit, by Makes the distributed cluster transparent,
forcing the producer to know it is as if it were a virtual broker
partitioning a topic's messages across
several nodes.
Preserves ordered delivery within a Almost always unordered delivery. AMQP
partition model says "one producer channel, one
exchange, one queue, one consumer
channel" is required for in-order delivery
Throttling is un-necessary

 The whole job of Kafka is to provide a "shock absorber" between the flood of
events and those who want to consume them in their own way.
Performance benchmark

 500,000 messages published per second

 22,000 messages consumed per second
 on a 2-node cluster
 with 6-disk RAID 10.
 See research.microsoft.com/en-
us/um/people/srikanth/netdb11/netdb11papers/net
db11-final12.pdf
Key benefits

 Horizontally scalable
 It’s a distributed system can be elastically and transparently expanded with no downtime
 High throughput
 High throughput is provided for both publishing and subscribing, due to disk structures
that provide constant performance even with many terabytes of stored messages
 Reliable delivery
 Persists messages on disk, and provides intra-cluster replication
 Supports large number of subscribers and automatically balances consumers in case of
failure.
Use cases

 Common use cases include

1. Stream processing, Event sourcing or a replacement for a more traditional message
broker
2. Website activity tracking - original use case for Kafka
3. Metrics collection and monitoring - centralized feeds of operational data
4. Log aggregation
Getting practical
Download and extract Kafka

 Download the archive from

kafka.apache.org/downloads.html and
extract it
Kafka uses ZooKeeper for cluster coordination
 Kafka uses ZooKeeper; which enables
highly reliable distributed coordination so,
one needs to first start a ZooKeeper server.
 Kafka bundles a single-node ZooKeeper
instance. Single node zookeeper cluster
does NOT run a leader and a follower.
 Typically exchanged metadata include
 Kafka broker addresses
 Consumed messages offset
Common Challenges faced by distributed
system
 Outages
 Co-ordination of tasks
 Reduction of operational complexity
 Consistency and ordering guarantees

Zookeeper to rescue
Apache Zookeeper: Definition

 Centralised service for

 Maintaining configuration information
 Naming
 Distributed synchronisation and
 providing group services.
Apache Zookeeper: Features

 Distributed consistent data store which favours consistency over everything else.
 High availability - Tolerates a minority of an ensemble members being unavailable and
continues to function correctly.
 In an ensemble of n members where n is an odd number, the loss of (n-1)/2 members can be
tolerated.
 High performance - All the data is stored in memory and benchmarked at 50k ops/sec but
the numbers really depend on your servers and network
 Tuned for read heavy write light work load. Reads are served from the node to which a client a
connected.
 Provides strictly ordered access for data.
 Atomic write guarantees in the order they're sent to zookeeper.
 Writes are acknowledged and changes are also seen in the order they occurred.
Apache Zookeeper: Operation basics

 A cluster is an ensemble with a leaders and several followers

 Read requests are serviced from each server using its local replica but write request are forwarded to a
leader. When the leader receives a write request, it calculates what the state of the system is when the
write is to be applied and transforms this into a transaction that captures this new state.
 When zookeeper starts, it goes through a loading algorithm where by one node of the cluster is
elected to act as the leader. At any given point in time, only one node acts as a leader.
Apache Zookeeper: Operation basics

 Clients create a state full session (i.e. with heartbeats through an open socket) when they
connect to a node of an ensemble. Number of open sockets available on a zookeeper node will
limit the number of clients that connect to it.
 When a cluster member dies, clients notice a disconnect event and thus reconnect themselves
to another member of the quorum.
 Session (i.e. state of the client connected to node of an ensemble) stay alive when the client
goes down, as the session events go through the leader and gets replicated in a cluster onto
another node.
 When the leader goes down, remaining members of the cluster will re-elect a new leader using
a atomic broadcast consensus algorithm. Cluster remains unavailable only when it re-elects a
new leader.
1. Start ZooKeeper
2. Set-up a cluster with 3 brokers
2. Adjust broker
configuration
files
3. Start
kafka server
1
3. Start
kafka server
2
3. Start
kafka server
3
3. Servers 1, 2 and 3 are started and running
4. Create a kakfa topic, list topics and describe one

5. Start a producer and publish some messages

6. Start a consumer and process messages

Log directories for each of the broker instances

Understanding Apache Kafka White Paper
No ratings yet
Understanding Apache Kafka White Paper
7 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Kafka Low Level Architecture
No ratings yet
Kafka Low Level Architecture
52 pages
Apache Kafka
No ratings yet
Apache Kafka
9 pages
Kafka Sparkstreaming
No ratings yet
Kafka Sparkstreaming
75 pages
Configuring Kafka For High Throughput
No ratings yet
Configuring Kafka For High Throughput
11 pages
Getting To Know Kafka: Ola Is The First Course in The Series of Courses Covering All The Aspects of Kafka
No ratings yet
Getting To Know Kafka: Ola Is The First Course in The Series of Courses Covering All The Aspects of Kafka
23 pages
Apache Kafka Tutorial
No ratings yet
Apache Kafka Tutorial
6 pages
Apache Kafka Key Concepts
100% (1)
Apache Kafka Key Concepts
8 pages
Apache Kafka 101
No ratings yet
Apache Kafka 101
25 pages
Kafka My Kafka Note v67
No ratings yet
Kafka My Kafka Note v67
55 pages
Top Answers To Kafka Interview Questions
No ratings yet
Top Answers To Kafka Interview Questions
3 pages
Chapter 1 - Introduction To KAFKA: Objectives
No ratings yet
Chapter 1 - Introduction To KAFKA: Objectives
17 pages
Documentation
No ratings yet
Documentation
105 pages
Mastering Kafka Streams: From Basics to Expert Proficiency
From Everand
Mastering Kafka Streams: From Basics to Expert Proficiency
William Smith
No ratings yet
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Kafka PDF
No ratings yet
Kafka PDF
106 pages
Stream Processing Using Kafka
No ratings yet
Stream Processing Using Kafka
46 pages
Apache Kafka
No ratings yet
Apache Kafka
6 pages
KAFKA
No ratings yet
KAFKA
22 pages
Kafka
No ratings yet
Kafka
50 pages
Kafka and Spark Streaming
No ratings yet
Kafka and Spark Streaming
45 pages
Apache Spark Streaming Presentation
100% (1)
Apache Spark Streaming Presentation
28 pages
Learning Apache Kafka - Second Edition - Sample Chapter
No ratings yet
Learning Apache Kafka - Second Edition - Sample Chapter
12 pages
Cloudurable Kafka Tutorial v1 PDF
No ratings yet
Cloudurable Kafka Tutorial v1 PDF
79 pages
Chandan Prakash's Blog
No ratings yet
Chandan Prakash's Blog
4 pages
Kafka Cloudera Documentation
100% (1)
Kafka Cloudera Documentation
175 pages
Top 10 Kafka Problems
No ratings yet
Top 10 Kafka Problems
3 pages
Apache Kafka - Basic Operations
No ratings yet
Apache Kafka - Basic Operations
6 pages
Apache Kafka Cookbook - Sample Chapter
100% (1)
Apache Kafka Cookbook - Sample Chapter
14 pages
Hibernate
No ratings yet
Hibernate
161 pages
Message Queues (ActiveMQs and Kafka)
No ratings yet
Message Queues (ActiveMQs and Kafka)
7 pages
Kafka Internals
No ratings yet
Kafka Internals
30 pages
Apache Kafka
No ratings yet
Apache Kafka
130 pages
Micro Services
No ratings yet
Micro Services
92 pages
5 Kafka Producer Advanced
No ratings yet
5 Kafka Producer Advanced
152 pages
Apache Kafka
No ratings yet
Apache Kafka
245 pages
Apache Kafka Interview Questions
No ratings yet
Apache Kafka Interview Questions
5 pages
Integrating Apache Nifi and Apache Kafka
No ratings yet
Integrating Apache Nifi and Apache Kafka
5 pages
Apache Cassandra Certification
No ratings yet
Apache Cassandra Certification
0 pages
Apache Kafka
No ratings yet
Apache Kafka
17 pages
Kafka Interview Questions
No ratings yet
Kafka Interview Questions
11 pages
Slide 5-6 Kafka
No ratings yet
Slide 5-6 Kafka
111 pages
Handle Large Messages in Apache Kafka
No ratings yet
Handle Large Messages in Apache Kafka
59 pages
Bigquery: Introducing Powerful New Enterprise Data Warehousing Features
No ratings yet
Bigquery: Introducing Powerful New Enterprise Data Warehousing Features
6 pages
Complex Event Processing With Apache Flink Presentation
No ratings yet
Complex Event Processing With Apache Flink Presentation
49 pages
Kafka Producer Internals: Find Answers On The Fly, or Master Something New. Subscribe Today
No ratings yet
Kafka Producer Internals: Find Answers On The Fly, or Master Something New. Subscribe Today
1 page
Apache Kafka
100% (2)
Apache Kafka
33 pages
Apache Kafka
No ratings yet
Apache Kafka
32 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Certification
No ratings yet
Certification
16 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
Apache Kafka Tutorial
No ratings yet
Apache Kafka Tutorial
3 pages
Cassandra Tutorial
No ratings yet
Cassandra Tutorial
27 pages
MongoDB Datatypes
No ratings yet
MongoDB Datatypes
14 pages
Camel Microservices With Spring Boot and Kubernetes
No ratings yet
Camel Microservices With Spring Boot and Kubernetes
67 pages
Hibernate Notes
No ratings yet
Hibernate Notes
4 pages
A Visual Introduction To Apache Kafka PDF
No ratings yet
A Visual Introduction To Apache Kafka PDF
84 pages
CIS Amazon Web Services Three-Tier Web Architecture Benchmark v1.0.0
No ratings yet
CIS Amazon Web Services Three-Tier Web Architecture Benchmark v1.0.0
215 pages
Kubernetes A Complete Guide
From Everand
Kubernetes A Complete Guide
Gerardus Blokdyk
No ratings yet
Safety Data Sheet Sds #: Ninjaflex Semiflex: 1. Product and Company Identification
No ratings yet
Safety Data Sheet Sds #: Ninjaflex Semiflex: 1. Product and Company Identification
5 pages
Introduction
No ratings yet
Introduction
1 page
MYP Self Study Questionnaire
No ratings yet
MYP Self Study Questionnaire
41 pages
IBM Tivoli Storage Manager For Mail Version 6.3.0
No ratings yet
IBM Tivoli Storage Manager For Mail Version 6.3.0
86 pages
Main Relief Valve and Waste Cone - : Submittal Drawing
No ratings yet
Main Relief Valve and Waste Cone - : Submittal Drawing
2 pages
Advanced Fired Boilers: Oil and Gas
No ratings yet
Advanced Fired Boilers: Oil and Gas
12 pages
HIRADC JAYA EXCAVATOR JADI Icha
No ratings yet
HIRADC JAYA EXCAVATOR JADI Icha
15 pages
Choong-Sik Chung - Developing Digital Governance - South Korea As A Global Digital Government Leader-Routledge (2020) - Removed
No ratings yet
Choong-Sik Chung - Developing Digital Governance - South Korea As A Global Digital Government Leader-Routledge (2020) - Removed
13 pages
Personality Types Predicting Social Media Behavior Spredfast Smart Social Report
No ratings yet
Personality Types Predicting Social Media Behavior Spredfast Smart Social Report
15 pages
Comparision Old and New IRC SP 84
No ratings yet
Comparision Old and New IRC SP 84
2 pages
RESEARCH-FINAL
No ratings yet
RESEARCH-FINAL
35 pages
Block Poster
No ratings yet
Block Poster
2 pages
The Ceecec Handbook
No ratings yet
The Ceecec Handbook
533 pages
Dynamics of Rigid Bodies Part 1 (Edited)
No ratings yet
Dynamics of Rigid Bodies Part 1 (Edited)
1 page
5 Month LAC Plan
No ratings yet
5 Month LAC Plan
6 pages
Marine Biologist
No ratings yet
Marine Biologist
12 pages
Exploded View
No ratings yet
Exploded View
8 pages
Financial Institutions & Markets
No ratings yet
Financial Institutions & Markets
2 pages
sda final
No ratings yet
sda final
7 pages
VOS3000 Details Pricing
No ratings yet
VOS3000 Details Pricing
13 pages
AMUL
No ratings yet
AMUL
19 pages
Uson vs. Del Rosario, G.R. No. L-4963 January 29, 1953 Facts
No ratings yet
Uson vs. Del Rosario, G.R. No. L-4963 January 29, 1953 Facts
1 page
Dev Resme
No ratings yet
Dev Resme
3 pages
A SAFE APPROACH TO HOISTING
No ratings yet
A SAFE APPROACH TO HOISTING
14 pages
Employee Warning Notice Form in DOC
No ratings yet
Employee Warning Notice Form in DOC
1 page
Puh 641 Regular Exam 2017
No ratings yet
Puh 641 Regular Exam 2017
4 pages
Practice of Urban Aerial Ropeways: Work Report No.1
No ratings yet
Practice of Urban Aerial Ropeways: Work Report No.1
79 pages
Ies Oradea
No ratings yet
Ies Oradea
50 pages
Business Combinations-Conso at DOA Pt2
No ratings yet
Business Combinations-Conso at DOA Pt2
1 page
Bks MaiSL 11uu mx00 Xxaann
No ratings yet
Bks MaiSL 11uu mx00 Xxaann
7 pages