Getting To Know Kafka: Ola Is The First Course in The Series of Courses Covering All The Aspects of Kafka
Getting To Know Kafka: Ola Is The First Course in The Series of Courses Covering All The Aspects of Kafka
Getting To Know Kafka: Ola Is The First Course in The Series of Courses Covering All The Aspects of Kafka
More recently, businesses have an increased need for handling real-time data feeds, i.e.
analyzing and processing data and events as they happen.
A messaging system is a medium that allows data transfer from one application to another
so that the applications can focus on data without worrying about how to share it.
But there is a restriction that a particular message can be consumed by a maximum of only
one receiver.
The message disappears from the queue, once the message is consumed by the receiver.
What is Kafka?
Apache Kafka is a distributed publish-subscribe messaging system used for collecting and
delivering high volumes of data with low latency, similar to a traditional message broker.
Apache Kafka was originated at LinkedIn and became an open source project in 2011. Scala
and Java are used to develop Kafka.
Benefits of Kafka
Reliability: Kafka's distributed design, topic partitioning, and data replication over
servers make it reliable.
Scalability: Kafka system exists as a cluster of brokers. The number of brokers can
grow over time when more data comes. Any failure of an individual broker in a cluster
is handled by the system providing uninterrupted service.
Durability: Disk-based data retention makes Kafka durable. Messages remain on the
disk based on the retention rule configured on a per-topic basis. Even if a consumer
falls backs due to any reason, the data continue to reside in the Broker till the retention
period and is not lost.
1
Kafka plays well as a traditional message broker.
When compared to other messaging systems, Kafka
has better performance, inbuilt partitioning, fault-
tolerance, and replication.
Website Activity Tracking
Log Aggregation
4
This includes collecting log files from the server
and saving them in a central file system.
Kafka reads those data and creates more abstract
data as a stream of messages and make them
available in a standard format for consumers.
Stream Processing
5
Add the newly created user kafka to the sudo group to provide all the privileges
required to install Kafka's dependencies. You can add kafka to the sudo group using
the adduser command:
adduser kafka sudo
su - kafka
Apache Kafka needs a Java runtime environment. Install the default-jre package using apt-get
command. Type the following command:
Installing Prerequisites...
Step 3 — Install ZooKeeper
Install Zookeeper package which is available in Ubuntu's default repositories. Type the
following command:
At the Telnet prompt, type in ruok and press ENTER. If everything's fine, ZooKeeper will
say imok and end the Telnet session.
mkdir -p ~/Downloads
Download the Kafka binaries using wget.
wget "http://www-eu.apache.org/dist/kafka/0.11.0.1/kafka_2.11-0.11.0.1.tgz" -O
~/Downloads/kafka.tgz
Create the base directory kafka for Kafka installation and change to this directory.
Extract the archive you have downloaded using the tar command.
vi ~/kafka/config/server.properties
The default configuration of Kafka doesn't allow topic deletion. We need to configure Kafka for
that. You can add the following line of code at the end of the
file ~/kafka/config/server.properties
delete.topic.enable = true
Wait for a few seconds. Once the server starts, you will see the following messages
in ~/kafka/kafka.log:
[2015-07-29 06:02:41,736] INFO New leader is 0
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
Now you have a Kafka server which is listening on port 9092. You may use
Fundamental Components
Before getting deep into Kafka, we must have an understanding on some of the frequently
used terms in Kafka, which are as follows :
Topic
A Kafka topic is a category or feed name under which messages are stored.
A Kafka producer publishes messages to a topic, which may be subscribed by zero or
more consumers.
As shown in the figure, the Kafka cluster maintains a partitioned log for each topic.
Each of the partitions contains messages or records in an immutable ordered
sequence.
Partitions
A topic partition is a structured commit log to which the records are continually appended.
For each topic, Kafka keeps a minimum of one partition.
What is a Producer?
Kafka producers publish messages to one or more Kafka topics.
Every time a producer sends a message to a broker, the broker appends them to the
corresponding topic’s partition. Producers can also send messages to a partition of
their choice.
Producers write to a single leader so that each write is served by a separate broker
which helps in load balancing.
Being a leader partition 0 replicates that write to the available followers - broker 2 & broker 3.
When each replica acknowledges that it has received the message, the system is in sync.
Each message published on a topic will be delivered to one consumer instance within each
subscribed consumer group. These consumer instances may either be in separate processes
or on separate machines.
If all the consumer instances are within the same consumer group, then the records
will be load balanced over the instances.
If all the consumer instances are within different consumer groups, then each record
will be broadcast to all the consumer processes.
Partitions 0 and 3 are kept in server 1 and partitions 1 and 2 are kept in server 2.
There are two consumer groups - A and B. A is composed of two consumers, and B of four
consumers.
Consumer Group A consists of two consumers each reading two partitions and
together reading all the four partitions of the topic.
On the other hand, Consumer Group B has the same number of consumers and
partitions, each reading exactly one partition from the topic.
Consumer Offset
The Offset or position of a consumer in the partition log is the only metadata
retained for that consumer.
When a consumer reads records, the offset advances linearly along the partition log.
The consumer can read data from any position in the partition log - it can move back to
an older offset to re-read older data or jump ahead to the latest record and start
consuming from there.
Concept Of Consumer
How Consumer reads data from Topic Partitions
Consumer Groups
Consumer Offset
Kafka Broker
Being a distributed system, Kafka runs in a cluster of machines, where each node in
the cluster is called a Kafka broker.
Each broker may hold zero or more partitions of a topic. For example, if you have a
topic with 24 partitions and a cluster with 3 Kafka brokers, each one will hold 8
partitions of the topic.
Kafka and Zookeeper will handle the load distributions among these partitions and
redistribute them correctly when any broker goes down.
A leader is the node that handles all read and write requests for a given partition.
It updates the followers or replicas with new data.
If a leader fails, a follower takes over as the new leader.
Assume we have a Kafka cluster with three brokers, and topic partitions replicated over them.
As shown in the figure, for partition 0, broker 1 acts as the leader and brokers 2 & 3 are
followers (replicas).
The read and write requests for a partition are handled by the leader and the followers
replicate the leader across the nodes of the cluster.
Each broker in the cluster will be a leader for some of its partitions and a follower for
others, to maintain proper load balancing.
What is Zookeeper ?
Zookeeper is a distributed centralized service that coordinates/manages large sets of
hosts.
1. Electing a controller :
The controller is one among the many brokers responsible for maintaining the
leader/follower relationship for all the partitions.
When a node crashes or shuts down, the controller tells other replicas to become
partition leader replacing the one on the node, that is going away.
Zookeeper elects a controller, makes sure there is only one, and elects a new one it if
it crashes.
2. Cluster membership:
Zookeeper monitors which brokers are alive and part of the cluster.
3. Topic configuration:
Zookeeper keeps track of topics, its partitions and replicas, who is the preferred leader
and what configuration overrides are set for each topic.
4. Quotas:
Zookeeper tracks how much data each client is allowed to read and write.
5. ACLs:
Zookeeper tracks the following: Who is allowed to read and write to which topic, What
are the consumer groups which exist, Who are their respective members and What is
the latest offset each group received from each partition.
If all consumers belong to different consumer groups, then all the Consumer Groups will
consume messages (This is a Publish-Subscribe model ).
If all consumers belong to the same consumer group, then the partitions will be evenly
distributed among consumers in the consumer group. (This is a Queuing Model ).
2. Kafka broker stores all messages to the partitions configured for that topic, such that the
messages are equally divided among the partitions of the topic.
4. On subscription, Kafka will send the current offset of the topic to the consumer. It then
saves a copy of the current offset in Zookeeper ensemble.
5. The consumer then requests Kafka for new messages at regular intervals.
6. Once received from the producer, Kafka forwards the message to the consumer.
9. On receiving the acknowledgment, Kafka broker changes offset to the new value and
updates it in Zookeeper.
10. The above flow goes on repeating until the consumer stops the request.
11. At any time, the consumer can rewind/skip to the desired offset and get subsequent
messages.
2. Broker stores those messages to the partitions configured for the topic " Topic-01", such
that the messages are equally divided among the partitions of the topic.
3. A consumer with GroupId "Group-01" subscribes to the topic "Topic-01".
4. Kafka communicates with the consumer using the same steps as in Publish-
Subscribemessaging system.
6. Now, the data is shared between the two consumers in the consumer group, such that
each topic partition is read by a consumer in the consumer group.
8. Now, if again another consumer with the same GroupId Group-01 subscribes to the same
topic Topic-01, it has to wait till any other consumers within the consumer
groupunsubscribe.
In Kafka, the consumer group divides processing of messages among its consumer
instances, similar to a queue. Again, Kafka broadcasts messages to all subscribing consumer
groups, as with Publish-Subscribe.
Thus, Kafka combines the strength of both these message models, enabling it to
easily scale.
Kafka assigns topic partitions to each consumer within the consumer group in such a
way that each partition is consumed by only one consumer in the group.
This guarantees that the consumer is the sole reader of that partition, consuming the
data in order.
For example, if retention period is set as three days, a record will be available for
consumption for three days after it is published. It will be discarded after the retention period.
Kafka in a nutshell
APIS
Kafka includes five core apis:
Kafka exposes all its functionality over a language independent protocol which has clients
available in many programming languages. However only the Java clients are maintained as part
of the main Kafka project, the others are available as independent open source projects. A list of
non-Java clients is available here.
For example, an application might take in input streams of data and perform computations for
handling out-of-order data, reprocess input as code changes etc. and then output a stream of
transformed data.
The input for the Streams API are the producer and consumer APIs.
It uses Kafka for stateful storage.
For fault tolerance among stream processor instances, it uses the same group mechanism.
Key Concepts
1. Stream - Primary abstraction in Kafka Streams, it represents an unbounded and
continuously updating data set. A stream contains a sequence of immutable data records
which are ordered and fault tolerant.
2. Stream processing application - Program using the Kafka Streams Library to implement
its computational logic through one or more processor topologies.
3. Processor topology - Graph of stream processors (nodes) that are connected by streams
(edges).
4.Stream processor - Node in the processor topology. It denotes a processing step to operate
on input stream data by receiving an input record at a time from its upstream processors in
the topology, applying transformations, and consequently producing output records to its
downstream processors.
1. Source Processor - Does not have any upstream processors. It produces an input
stream to its topology by consuming records from one or more Kafka topics and
forwards it to downstream processors.
2. Sink Processor - Does not have downstream processors. It sends any received
records from its upstream processors to a specified Kafka topic.
Quick Fact
Simple, lightweight client library.
No external dependencies on systems other than
Kafka.
Supports exactly-once processing semantics.
Supports event-time based windowingoperations.
Course Summary
In this course, you have learned the fundamental concepts of Kafka like Topics, Partitions,
Producers & Consumers and how they work in sync.
The course also explained on how Kafka could be used as a messaging, storage and stream
processing tool.
Quiz :
1. In Kafka, the c+A1:B26ommunication between the clients and servers is done with ----- Protocol. TCP
2. Which one functions as a messaging system? KAFKA
3. Based on the classification of messages Kafka categorizes messages into TOPIC
4. __________ is the subset of the replicas list that is currently alive and caught-up to the leader. IN-
SYNC-REPLICA
5. Which of the following is incorrect ? SINGLE LINE
6. Each record in the partition is assigned a sequential id called as the _______. offset
7. Which service monitors Cluster membership ? ZOOKEEPER
8. Which is the node responsible for all reads and writes for the given partition ? leader
9. A _______ is a structured commit log to which records are appended continually. TOPIC
PARTITION
10. Which concept of Kafka helps scale processing and multi-subscription. CONSUMER GRUOUP
11. How are messages stored in topic partitions ? IMMUTABLE
12. Kafka is run as a cluster comprised of one or more servers each of which is called : BROKER
13. The _________ is one of the brokers and is responsible for maintaining the leader/follower relationship
for all the partitions. CONTROLLER
14. Kafka supports both 'queuing model' and 'publish-subscribe' model? T
15. If the retention policy is set to two days, then for the two days after a record is published, it is available
for consumption, after which it will be discarded to free up space. T
16. If multiple consumers subscribed for a topic and each belong to different Consumer Group, the
messages of the topic will be consumed by all consumers. This is: PUBLISH SUBSCRIBE
17. Kafka combines the strength of both queuing and publish-subscribe models, enabling it to scale easily.
T
18. When a consumer subscribe to a topic, kafka provides the current offset of the topic to : BOTH
19. A configurable ________ can be set to retain all published records in Kafka irrespective of whether they
have been consumed or not. RETENTION PERIOD
20. If only one 'Consumer Group' subscribed for a topic and there are lots of consumers in this Consumer
Group, messages of the topic will be evenly load balanced between consumers of the consumer group.
This is : QUEINING MODEL
21. Kafka also provides better ordering guarantees than a traditional messaging system using ________.
TOPIC PARTITION
22. Source Processor does not have any downstream processors. F
23. Which processor sends any received records from its upstream processors to a specified Kafka topic ?
SINK
24. A ________ is a logical abstraction for stream processing code. PROCESSOR TROPOLOGY
25. Kafka Streams supports exactly-once processing semantics. T
26. A _________ in Kafka reads streams of data from input topics, processes this data and produces
continual streams of data to output topics. STREAM PROCESSOR
27. Graph of stream processors (nodes) that are connected by streams (edges) is called ? PROCESSOR
TROPOLOGY
28. A ________ is a logical abstraction for stream processing code Processor Topology
29. A Kafka distribution with more than one broker is called as a Kafka cluster T
30. are servers that replicate a partition log regardless of their role as leader or follower Followers
31. In a _______, a pool of consumers may read from a server and each record goes to one of them; in
_________ the record is broadcast to all consumers. queue, publish-subscribe
32. Kafka Streams employs one-record-at-a-time processing to achieve millisecond processing latency
T
33. A _______ subscribes to a topic and consumes published messages by pulling data from the brokers.
consumer
34. The _______ allows an application to act as a stream processor, consuming an input stream from one
or more topics and producing an output stream to one or more output topics. Streams API
35. Point out the wrong statement The Kafka cluster does not retain all published messages.
36. Kafka can be used for which of the following All the options
37. The __________ property is the unique and permanent name of each node in the cluster
broker.id
38. Kafka cluster can enforce quota on requests to control the broker resources used by clients T
39. A consumer cannot reset to an older offset to reprocess data from the past or skip ahead to the most
recent record and start consuming from now F
40. Each record published to a topic is delivered to _______ consumer instance within each subscribing
consumer group. multiple
41. Sink processor does not have any upstream processors. FALSE
42. Each message published to a topic is delivered to _______ within each subscribing consumer group.
one consumer instance
43. The ________ allows building and running reusable producers or consumers that connect Kafka topics
to existing applications or data systems. Connector API
44. Kafka provides better ordering guarantees than a traditional messaging system using topic partitions
TRUE
45. Which processor consumes records from one or more Kafka topics and forwards it to downstream
processors ? Source
46. The ________ allows an application to publish a stream of records to one or more Kafka topics.
Streams API
47. If all the consumer instances have different consumer groups, then each record will be broadcast to all
the consumer processes. T
48. Producers can also send messages to a partition of their choice. T
49. Engineers at _________ developed and open sourced Kafka, LinkedIn
50. It is possible to delete a Kafka topic. TRUE
51. Which acknowledgement number shows that the leader should wait for the full set of in-sync replicas
to acknowledge the record. 0
52. In a cluster Kafka can work without Zookeeper F
53. "isr" is the set of "in-sync" replicas. T
54. Producers write to a single leader, so that each write is serviced by a separate broker. T
55. Kafka only provides a total order over records within a partition, not between different partitions in a
topic. F
56. _________ data retention makes Kafka a durable system Disk-based
57. Which method of Kafka Consumer class is used to manually assign a list of partitions to a consumer
subscribe()
58. Which is the correct order of steps to create a simple messaging system in Kafka i.Step1. Start the
ZooKeeper server. ii.Step 2: Start the kafka server. iii.Step 3:Run the producer and send some
messages. iv. Step 4: Create a topic. v. Step 5: Start the consumer and see the messages send from
producer.
59. __________keeps track of topics, its partitions and replicas, who is the preferred leader and what
configuration overrides are set for each topic Zookeeper
60. Which API supports managing and inspecting topics, brokers and other kafka objects AdminClient
API
61. A hashing-based Partitioner takes ___ and generates a hash to locate which partition the message
should go Topic
62. Which messaging semantics do Kafka use to handle failure to any broker in cluster? retries
63. The __________ allows an application to subscribe to one or more topics and process the stream of
records produced to them Consumer API
64. What is the default retention period for a Kafka topic 7 Days
65. Which configuration in Producer API controls the criteria under which requests are considered
complete? Acks
66. Which of the following statement is incorrect ? In Queue based messaging system message ordering is
lost during parallel processing.
67. Which one below is not a parameter to the Kafka cluster.ProducerRecord class constructor ? Offset
68. For stream processing, Kafka provides which of the following Streams API
69. A _____ is the primary abstraction in Kafka Streams, and represents an unbounded and continuously
updating data set. Stream
70. Kafka Streams has no external dependencies on systems other than Apache Kafka itself T
71. Kafka has push based consumer where data is pushed from broker to consumer F
72. Each record consists of a key, a value, and a ________ data
73. Kafka Streams supports both stateful and stateless operations T
74. The only metadata retained on a per-consumer basis is the position of the consumer in the log, called :
offset
75. Kafka stores metadata of basic information about Topics, Brokers and consumer offsets in :
Zookeeper ensemble
76. Banking industry can leverage Kafka Streams for detecting fraudulent transactions. T