Message Partitions: Find Answers On The Fly, or Master Something New. Subscribe Today
Message Partitions: Find Answers On The Fly, or Master Something New. Subscribe Today
⏮ Message topics Building Data Streaming Applications with Apache Kafka Replication and replicated logs ⏭
Message partitions
Suppose that we have in our possession a purchase table and we want to read records for an item from the
purchase table that belongs to a certain category, say, electronics. In the normal course of events, we will simply
filter out other records, but what if we partition our table in such a way that we will be able to read the records of
our choice quickly?
This is exactly what happens when topics are broken into partitions known as units of parallelism in Kafka. This
means that the greater the number of partitions, the more throughput. This does not mean that we should choose a
huge number of partitions. We will talk about the pros and cons of increasing the number of partitions further.
While creating topics, you can always mention the number of partitions that you require for a topic. Each of the
messages will be appended to partitions and each message is then assigned with a number called an offset. Kafka
makes sure that messages with similar keys always go to the same partition; it calculates the hash of the message
key and appends the message to the partition. Time ordering of messages is not guaranteed in topics but within a
partition, it's always guaranteed. This means that messages that come later will always be appended to the end of
the partition.
Partitions are fault-tolerant; they are replicated across the Kafka brokers. Each partition has its leader
that serves messages to the consumer that wants to read the message from the partition. If the leader
Find answers on the fly, or master something new. Subscribe today. See pricing options.
fails a new leader is elected and continues to serve messages to the consumers. This helps in achieving
high throughput and latency.
High throughput: Partitions are a way to achieve parallelism in Kafka. Write operations on different
partitions happen in parallel. All time-consuming operations will happen in parallel as well; this operation will
utilize hardware resources at the maximum. On the consumer side, one partition will be assigned to one
consumer within a consumer group, which means that different consumers available in different groups can
read from the same partition, but different consumers from the same consumer group will not be allowed to
read from the same partition.
So, the degree of parallelism in a single consumer group depends on the number of partitions it is
reading from. A large number of partitions results in high throughput.
Choosing the number of partitions depends on how much throughput you want to achieve. We will
talk about it in detail later. Throughput on the producer side also depends on many other factors such
as batch size, compression type, number of replications, types of acknowledgement, and some other
configurations, which we will see in detail in Chapter 3, Deep Dive into Kafka Producers.
However, we should be very careful about modifying the number of partitions--the mapping of
messages to partitions completely depends on the hash code generated based on the message key that
guarantees that messages with the same key will be written to the same partition. This guarantees the
consumer about the delivery of messages in the order which they were stored in the partition. If we
change the number of partitions, the distribution of messages will change and this order will no
longer be guaranteed for consumers who were looking for the previous order subscribed. Throughput
for the producer and consumer can be increased or decreased based on different configurations that
we will discuss in detail in upcoming chapters.
Increases producer memory: You must be wondering how increasing the number of partitions will force us
to increase producer memory. A producer does some internal stuff before flushing data to the broker and
asking them to store it in the partition. The producer buffers incoming messages per partition. Once the upper
bound or the time set is reached, the producer sends his messages to the broker and removes it from the buffer.
If we increase the number of partitions, the memory allocated for the buffering may exceed in a very
short interval of time, and the producer will block producing messages until it sends buffered data to
the broker. This may result in lower throughput. To overcome this, we need to configure more
memory on the producer side, which will result in allocating extra memory to the producer.
High availability issue: Kafka is known as high-availability, high-throughput, and distributed messaging
system. Brokers in Kafka store thousands of partitions of different topics. Reading and writing to partitions
happens through the leader of that partition. Generally, if the leader fails, electing a new leader takes only a
few milliseconds. Observation of failure is done through controllers. Controllers are just one of the brokers.
Now, the new leader will serve the request from the producer and consumer. Before serving the request, it
reads metadata of the partition from Zookeeper. However, for normal and expected failure, the window is very
small and takes only a few milliseconds. In the case of unexpected failure, such as killing a broker
unintentionally, it may result in a delay of a few seconds based on the number of partitions. The general
formula is:
Delay Time = (Number of Partition/replication * Time to read metadata for single partition)
The other possibility could be that the failed broker is a controller, the controller replacement time depends on the
number of partitions, the new controller reads the metadata of each partition, and the time to start the controller
will increase with an increase in the number of partitions.
We should take care while choosing the number of partitions and we will talk about this in upcoming chapters
and how we can make the best use of Kafka's capability.