Kafka101training Public v2 140818033637 Phpapp01
Kafka101training Public v2 140818033637 Phpapp01
Kafka101training Public v2 140818033637 Phpapp01
8 basic training
Michael G. Noll, Verisign
mnoll@verisign.com / @miguno
July 2014
Update 2015-08-01:
Shameless plug! Since publishing this Kafka training deck about a year ago
I joined Confluent Inc. as their Developer Evangelist.
http://www.confluent.io/training
I can say with confidence that these are the best and most effective Apache
Kafka trainings available on the market. But you dont have to take my word
for it feel free to take a look yourself and reach out to us if youre interested.
Verisign Public
Michael 2
Kafka?
Part 1: Introducing Kafka
Why should I stay awake for the full duration of this workshop?
Part 2: Kafka core concepts
Topics, partitions, replicas, producers, consumers, brokers
Part 3: Operating Kafka
Architecture, hardware specs, deploying, monitoring, P&S tuning
Part 4: Developing Kafka apps
Writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
Part 5: Playing with Kafka using Wirbelsturm
Wrapping up
Verisign Public 3
Part 1: Introducing Kafka
Verisign Public 4
Overview of Part 1: Introducing Kafka
Kafka?
Kafka adoption and use cases in the wild
At LinkedIn
At other companies
How fast is Kafka, and why?
Kafka + X for processing
Storm, Samza, Spark Streaming, custom apps
Verisign Public 5
Kafka?
http://kafka.apache.org/
Originated at LinkedIn, open sourced in early 2011
Implemented in Scala, some Java
9 core committers, plus ~ 20 contributors
https://kafka.apache.org/committers.html
https://github.com/apache/kafka/graphs/contributors
Verisign Public 6
Kafka?
LinkedIns motivation for Kafka was:
A unified platform for handling all the real-time data feeds a large company might have.
Must haves
High throughput to support high volume event feeds.
Support real-time processing of these feeds to create new, derived feeds.
Support large data backlogs to handle periodic ingestion from offline systems.
Support low-latency delivery to handle more traditional messaging use cases.
Guarantee fault-tolerance in the presence of machine failures.
http://kafka.apache.org/documentation.html#majordesignelements
Verisign Public 7
Kafka @ LinkedIn, 2014
https://twitter.com/SalesforceEng/status/466033231800713216/photo/1
http://www.hakkalabs.co/articles/site-reliability-engineering-linkedin-kafka-service
Verisign Public 8
Data architecture @ LinkedIn, Feb 2013
http://gigaom.com/2013/12/09/netflix-open-sources-its-data-traffic-cop-suro/
Verisign Public 9
Kafka @ LinkedIn, 2014
Multiple data centers, multiple clusters
Mirroring between clusters / data centers
http://www.hakkalabs.co/articles/site-reliability-engineering-linkedin-kafka-service
http://www.slideshare.net/JayKreps1/i-32858698
http://search-hadoop.com/m/4TaT4qAFQW1
Verisign Public 10
Kafka @ LinkedIn, 2014
15 brokers
15,500 partitions (replication factor 2)
400,000 msg/s inbound
70 MB/s inbound
400 MB/s outbound
https://kafka.apache.org/documentation.html#java
Verisign Public 11
Staffing: Kafka team @ LinkedIn
Team of 8+ engineers
Site reliability engineers (Ops): at least 3
Developers: at least 5
SREs as well as DEVs are on call 24x7
https://kafka.apache.org/committers.html
http://www.hakkalabs.co/articles/site-reliability-engineering-linkedin-kafka-service
Verisign Public 12
Kafka adoption and use cases
LinkedIn: activity streams, operational metrics, data bus
400 nodes, 18k topics, 220B msg/day (peak 3.2M msg/s), May 2014
Netflix: real-time monitoring and event processing
Twitter: as part of their Storm real-time data pipelines
Spotify: log delivery (from 4h down to 10s), Hadoop
Loggly: log collection and processing
Mozilla: telemetry data
Airbnb, Cisco, Gnip, InfoChimps, Ooyala, Square, Uber,
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Verisign Public 13
Kafka @ Spotify
Verisign Public 14
How fast is Kafka?
Up to 2 million writes/sec on 3 cheap machines
Using 3 producers on 3 different machines, 3x async replication
Only 1 producer/machine because NIC already saturated
Sustained throughput as stored data grows
Slightly different test config than 2M writes/sec above.
Test setup
Kafka trunk as of April 2013, but 0.8.1+ should be similar.
3 machines: 6-core Intel Xeon 2.5 GHz, 32GB RAM, 6x 7200rpm SATA, 1GigE
http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
Verisign Public 15
Why is Kafka so fast?
Fast writes:
While Kafka persists all data to disk, essentially all writes go to the
page cache of OS, i.e. RAM.
Cf. hardware specs and OS tuning (we cover this later)
Fast reads:
Very efficient to transfer data from page cache to a network socket
Linux: sen d f i
le() system call
http://kafka.apache.org/documentation.html#persistence
Verisign Public 16
Why is Kafka so fast?
Example: Loggly.com, who run Kafka & Co. on Amazon AWS
99.99999% of the time our data is coming from disk cache and RAM; only
very rarely do we hit the disk.
One of our consumer groups (8 threads) which maps a log to a customer
can process about 200,000 events per second draining from 192 partitions
spread across 3 brokers.
Brokers run on m2.xlarge Amazon EC2 instances backed by provisioned IOPS
http://www.developer-tech.com/news/2014/jun/10/why-loggly-loves-apache-kafka-how-unbreakable-infinitely-scalable-messaging-makes-log-management-better/
Verisign Public 17
Kafka + X for processing the data?
Kafka + Storm often used in combination, e.g. Twitter
Kafka + custom
Normal Java multi-threaded setups
Akka actors with Scala or Java, e.g. Ooyala
Recent additions:
Samza (since Aug 13) also by LinkedIn
Spark Streaming, part of Spark (since Feb 13)
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Verisign Public 18
Part 2: Kafka core concepts
Verisign Public 19
Overview of Part 2: Kafka core concepts
A first look
Topics, partitions, replicas, offsets
Producers, brokers, consumers
Putting it all together
Verisign Public 20
A first look
The who is who
Producers write data to brokers.
Consumers read data from brokers.
All this is distributed.
The data
Data is stored in topics.
Topics are split into partitions, which are replicated.
Verisign Public 21
A first look
http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
Verisign Public 22
Topics
Topic: feed name to which messages are published
Example: zerg.hydra
Kafka prunes head based on age or max size or key
Producer A1
Kafka topic
Producer A2
new
Producer An
Older msgs Newer msgs
Broker(s)
Verisign Public 23
Topics
Producer A1
Producer A2
new
Producer An
Older msgs Newer msgs
Broker(s)
Verisign Public 24
Topics
Creating a topic
CLI
$ kafka-topics.sh --zookeeper zookeeper1:2181 --create --topic zerg.hydra \
--partitions 3 --replication-factor 2 \
--config x= y
API
https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scal
a/com/miguno/kafkastorm/storm/
KafkaStormDemo.scala
Auto-create via auto.create.topics.enable = true
Modifying a topic
https://kafka.apache.org/documentation.html#basic_ops_modify_topic
Verisign Public 26
Partitions
#partitions of a topic is configurable
#partitions determines max consumer (group) parallelism
Cf. parallelism of Storms KafkaSpout via builder.setSpout(,,N )
Verisign Public 28
Replicas of a partition
Replicas: backups of a partition
They exist solely to prevent data loss.
Replicas are never read from, never written to.
They do NOT help to increase producer or consumer parallelism!
Kafka tolerates (numReplicas - 1) dead brokers before losing data
LinkedIn: numReplicas == 2 1 broker can die
Verisign Public 29
Topics vs. Partitions vs. Replicas
http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
Verisign Public 30
Inspecting the current state of a topic
--describe the topic
$ kafka-topics.sh --zookeeper zookeeper1:2181 --describe --topic zerg.hydra
Topic:zerg2.hydra PartitionC ount:3 ReplicationFactor:2 Confi g s:
Topic: zerg2.hydra Partition: 0 Leader: 1 Replicas: 1,0 Isr: 1,0
Topic: zerg2.hydra Partition: 1 Leader: 0 Replicas: 0,1 Isr: 0,1
Topic: zerg2.hydra Partition: 2 Leader: 1 Replicas: 1,0 Isr: 1,0
In this example:
Broker 0 is leader for partition 1.
Broker 1 is leader for partitions 0 and 2.
All replicas are in-sync with their respective leader partitions.
Verisign Public 31
Lets recap
The who is who
Producers write data to brokers.
Consumers read data from brokers.
All this is distributed.
The data
Data is stored in topics.
Topics are split into partitions which are replicated.
Verisign Public 32
Putting it all together
http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/
Verisign Public 33
Side note (opinion)
Drawing a conceptual line from Kafka to Clojure's core.async
Verisign Public 34
Part 3: Operating Kafka
Verisign Public 35
Overview of Part 3: Operating Kafka
Kafka architecture
Kafka hardware specs
Deploying Kafka
Monitoring Kafka
Kafka apps
Kafka itself
ZooKeeper
"Auditing" Kafka (not: security audit)
P&S tuning
Ops-related Kafka references
Verisign Public 36
Kafka architecture
Kafka brokers
You can run clusters with 1+ brokers.
Each broker in a cluster must have
a unique broker.id.
Verisign Public 37
Kafka architecture
Kafka requires ZooKeeper
LinkedIn runs (old) ZK 3.3.4,
but latest 3.4.5 works, too.
ZooKeeper
v0.8: used by brokers and consumers, but not by producers.
Brokers: general state information, leader election, etc.
Consumers: primarily for tracking message offsets (cf. later)
v0.9: used by brokers only
Consumers will use special Kafka topics instead of ZooKeeper
Will substantially reduce the load on ZooKeeper for large deployments
Verisign Public 38
Kafka broker hardware specs @ LinkedIn
Solely dedicated to running Kafka, run nothing else.
1 Kafka broker instance per machine
2x 4-core Intel Xeon (info outdated?)
64 GB RAM (up from 24 GB)
Only 4 GB used for Kafka broker, remaining 60 GB for page cache
Page cache is what makes Kafka fast
RAID10 with 14 spindles
More spindles = higher disk throughput
Cache on RAID, with battery backup
Before H/W upgrade: 8x SATA drives (7200rpm), not sure about RAID
1 GigE (?) NICs
Puppet module
https://github.com/miguno/puppet-kafka
Hiera-compatible, rspec tests, Travis CI setup (e.g. to test against multiple
versions of Puppet and Ruby, Puppet style checker/lint, etc.)
Verisign Public 42
Operating Kafka
Typical operations tasks include:
Adding or removing brokers
Example: ensure a newly added broker actually receives data, which
requires moving partitions from existing brokers to the new broker
Kafka provides helper scripts (cf. below) but still manual work involved
Balancing data/partitions to ensure best performance
Add new topics, re-configure topics
Example: Increasing #partitions of a topic to increase max parallelism
Apps management: new producers, new consumers
Verisign Public 43
Lessons learned from operating Kafka at LinkedIn
Biggest challenge has been to manage hyper growth
Growth of Kafka adoption: more producers, more consumers,
Growth of data: more LinkedIn.com users, more user activity,
http://www.hakkalabs.co/articles/site-reliability-engineering-linkedin-kafka-service
Verisign Public 44
Kafka security
Original design was not created with security in mind.
Discussion started in June 2014 to add security features.
Covers transport layer security, data encryption at rest, non-repudiation, A&A,
See [DISCUSS] Kafka Security Specific Features
Verisign Public 45
Monitoring Kafka
Verisign Public 46
Monitoring Kafka
Nothing fancy built into Kafka (e.g. no UI) but see:
https://cwiki.apache.org/confluence/display/KAFKA/System+Tools
https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem
Verisign Public 47
Monitoring Kafka
Use of standard monitoring tools recommended
Graphite
Puppet module: https://github.com/miguno/puppet-graphite
Java API, also used by Kafka: http://metrics.codahale.com/
JMX
https://kafka.apache.org/documentation.html#monitoring
Verisign Public 48
Monitoring Kafka apps
Almost all problems are due to:
1. Consumer lag
2. Rebalancing <<< we cover this later in part 4
Verisign Public 49
Monitoring Kafka apps: consumer lag
Lag = how far your consumer is behind the producers
Consumer group C1
Producer A1
Producer A2
new
Producer An
Older msgs Newer msgs
Broker(s)
Verisign Public 50
Monitoring Kafka itself (1 of 3)
Under-replicated partitions
For example, because a broker is down.
Means cluster runs in degraded state.
FYI: LinkedIn runs with replication factor of 2 => 1 broker can die.
Offline partitions
Even worse than under-replicated partitions!
Serious problem (data loss) if anything but 0 offline partitions.
Verisign Public 51
Monitoring Kafka itself (1 of 3)
Data size on disk
Should be balanced across disks/brokers
Data balance even more important than partition balance
FYI: New script in v0.8.1 to balance data/partitions across brokers
Verisign Public 52
Monitoring Kafka itself (1 of 3)
Leader partition count
Should be balanced across brokers so that each broker gets the same
amount of load
Only 1 broker is ever the leader of a given partition, and only this broker is
going to talk to producers + consumers for that partition
Non-leader replicas are used solely as safeguards against data loss
Feature in v0.8.1 to auto-rebalance the leaders and partitions in case a
broker dies, but it does not work that well yet (SRE's still have to do this
manually at this point).
Network utilization
Maxed network one reason for under-replicated partitions
LinkedIn don't run anything but Kafka on the brokers, so network max is
due to Kafka. Hence, when they max the network, they need to add more
capacity across the board.
Verisign Public 53
Monitoring ZooKeeper
Ensemble (= cluster) availability
LinkedIn run 5-node ensembles = tolerates 2 dead
Twitter run 13-node ensembles = tolerates 6 dead
Latency of requests
Metric target is 0 ms when using SSDs in ZooKeeper machines.
Why? Because SSDs are so fast they typically bring down latency below ZKs
metric granularity (which is per-ms).
Outstanding requests
Metric target is 0.
Why? Because ZK processes all incoming requests serially. Non-zero
values mean that requests are backing up.
Verisign Public 54
"Auditing" Kafka
LinkedIn's way to detect data loss etc.
Verisign Public 55
Auditing Kafka
LinkedIn's way to detect data loss etc. in Kafka
Not part of open source stack yet. May come in the future.
In short: custom producer+consumer app that is hooked into monitoring.
Value proposition
Monitor whether you're losing messages/data.
Monitor whether your pipelines can handle the incoming data load.
http://www.hakkalabs.co/articles/site-reliability-engineering-linkedin-kafka-service
Verisign Public 56
LinkedIn's Audit UI: a first look
Example 1: Count discrepancy
Caused by messages failing to
reach a downstream Kafka
cluster
Verisign Public 57
Auditing Kafka
Every producer is also writing messages into a special topic about
how many messages it produced, every 10mins.
Example: "Over the last 10mins, I sent N messages to topic X.
This metadata gets mirrored like any other Kafka data.
Audit consumer
1 audit consumer per Kafka cluster
Reads every single message out of its Kafka cluster. It then calculates
counts for each topic, and writes those counts back into the same special
topic, every 10mins.
Example: "I saw M messages in the last 10mins for topic X in THIS cluster
And the next audit consumer in the next, downstream cluster does the
same thing.
Verisign Public 58
Auditing Kafka
Monitoring audit consumers
Completeness check
"#msgs according to producer == #msgs seen by audit consumer?"
Lag
"Can the audit consumers keep up with the incoming data rate?"
If audit consumers fall behind, then all your tracking data falls behind
as well, and you don't know how many messages got produced.
Verisign Public 59
Auditing Kafka
Audit UI
Only reads data from that special "metrics/monitoring" topic, but
this data is reads from every Kafka cluster at LinkedIn.
What they producers said they wrote in.
What the audit consumers said they saw.
Shows correlation graphs (producers vs. audit consumers)
For each tier, it shows how many messages there were in each topic
over any given period of time.
Percentage of how much data got through (from cluster to cluster).
If the percentage drops below 100%, then emails are sent to Kafka
SRE+DEV as well as their Hadoop ETL team because that stops the
Hadoop pipelines from functioning properly.
Verisign Public 60
LinkedIn's Audit UI: a closing look
Example 1: Count discrepancy
Caused by messages failing to
reach a downstream Kafka
cluster
Verisign Public 61
Kafka performance tuning
Verisign Public 62
OS tuning
Kernel tuning
Dont swap! vm .sw appiness = 0 (RHEL 6.5 onwards: 1)
Allow more dirty pages but less dirty cache.
LinkedIn have lots of RAM in servers, most of it is for page cache (60
of 64 GB). They let dirty pages built up, but cache should be available
as Kafka does lots of disk and network I/O.
See vm .dirty_*_ratio & friends
Disk throughput
Longer commit interval on mount points. (ext3 or ext4?)
Normal interval for ext3 mount point is 30s (?) between flushes;
LinkedIn: 120s. They can tolerate losing 2mins worth of data (because
of partition replicas) so they rather prefer higher throughput here.
More spindles (RAID10 w/ 14 disks)
Verisign Public 63
Java/JVM tuning
Biggest issue: garbage collection
And, most of the time, the only issue
Verisign Public 64
Java garbage collection in Kafka @ Spotify
https://www.jfokus.se/jfokus14/preso/Reliable-real-time-processing-with-Kafka-and-Storm.pdf
Verisign Public 65
Java/JVM tuning
Good news: use JDK7u51 or later and have a quiet life!
LinkedIn: Oracle JDK, not OpenJDK
$ java -Xm s4g -Xm x4g -XX:Perm Size= 48m -XX:M axPerm Size= 48m
-XX:+ U seG 1G C
-XX:M axG CPauseM illis= 20
-XX:InitiatingH eapO ccupancyPercent= 35
Verisign Public 66
Kafka configuration tuning
Often not much to do beyond using the defaults, yay.
Verisign Public 67
Kafka usage tuning lessons learned from others
Don't break things up into separate topics unless the data in them is
truly independent.
Consumer behavior can (and will) be extremely variable, dont assume you
will always be consuming as fast as you areproducing.
http://grokbase.com/t/kafka/users/145qtx4z1c/topic-partitioning-strategy-for-large-data
Verisign Public 68
Ops-related references
Kafka FAQ
https://cwiki.apache.org/confluence/display/KAFKA/FAQ
Kafka operations
https://kafka.apache.org/documentation.html#operations
Verisign Public 70
Overview of Part 4: Developing Kafka apps
Writing data to Kafka with producers
Example producer
Producer types (async, sync)
Message acking and batching of messages
Write operations behind the scenes caveats ahead!
Reading data from Kafka with consumers
High-level consumer API and simple consumer API
Consumer groups
Rebalancing
Testing Kafka
Serialization in Kafka
Data compression in Kafka
Example Kafka applications
Dev-related Kafka references
Verisign Public 71
Writing data to Kafka
Verisign Public 72
Writing data to Kafka
You use Kafka producers to write data to Kafka brokers.
Available for JVM (Java, Scala), C/C++, Python, Ruby, etc.
The Kafka project only provides the JVM implementation.
Has risk that a new Kafka release will break non-JVM clients.
Verisign Public 73
Producers
The Java producer API is very simple.
Well talk about the slightly confusing details next.
Verisign Public 74
Producers
Two types of producers: async and sync
Verisign Public 75
Sync producers
Straight-forward so I wont cover sync producers here
Please go to https://kafka.apache.org/documentation.html
Verisign Public 76
Async producer
Sends messages in background = no blocking in client.
Provides more powerful batching of messages (see later).
Wraps a sync producer, or rather a pool of them.
Communication from async->sync producer happens via a queue.
Which explains why you may see kafka.producer.async.Q ueueFullException
Each sync producer gets a copy of the original async producer config,
including the request.required.acks setting (see later).
Implementation details: Producer, async.AsyncProducer,
async.ProducerSendThread, ProducerPool, async.DefaultEventHandler#send()
Verisign Public 77
Async producer
Caveats
Async producer may drop messages if its queue is full.
Solution 1: Dont push data to producer faster than it is able to send to brokers.
Solution 2: Queue full == need more brokers, add them now! Use this solution in
favor of solution 3 particularly if your producer cannot block (async producers).
Solution 3: Set queue.enqueue.tim eout.m s to -1 (default). Now the producer
will block indefinitely and will never willingly drop a message.
Solution 4: Increase queue.buff
ering.m ax.m essages (default: 10,000).
In 0.8 an async producer does not have a callback for send() to register
error handlers. Callbacks will be available in 0.9.
Verisign Public 78
Producers
Two aspects worth mentioning because they significantly influence
Kafka performance:
1. Message acking
2. Batching of messages
Verisign Public 79
1) Message acking
Background:
In Kafka, a message is considered committed when any required ISR (in-
sync replicas) for that partition have applied it to their data log.
Message acking is about conveying this Yes, committed! information back
from the brokers to the producer client.
Exact meaning of any required is defined by request.required.acks.
-1: producer gets an ack after all ISR have received the data.
better
Gives the best durability as Kafka guarantees that no data will be lost as long as at least
one ISR remains.
Sync producer: will send this list (batch) of messages right now. Blocks!
Async producer: will send this list of messages in background as usual, i.e.
according to batch-related configuration settings. Does not block!
Verisign Public 82
2) Batching of messages
Option 1: How send(listO fM essages) works behind the scenes
partitioner.class p6 p1 p4 p4 p6 p6 p6
p1
p1 and so on
Verisign Public 83
2) Batching of messages
Option 2: Async producer
Standard behavior is to batch messages
Semantics are controlled via producer configuration settings
batch.num .m essages
ering.m ax.m s + queue.buff
queue.buff ering.m ax.m essages
queue.enqueue.tim eout.m s
And more, see producer configuration docs.
Verisign Public 84
FYI: upcoming producer configuration changes
Verisign Public 85
Write operations behind the scenes
When writing to a topic in Kafka, producers write directly to the
partition leaders (brokers) of that topic
Remember: Writes always go to the leader ISR of a partition!
But theres one catch with line 2 (i.e. no key) in Kafka 0.8.
Verisign Public 87
Keyed vs. non-keyed messages in Kafka 0.8
If a key is not specified:
Verisign Public 89
2) How to know the current leader of a partition?
Producers: broker discovery aka bootstrapping
Producers dont talk to ZooKeeper, so its not through ZK.
Broker discovery is achieved by providing producers with a bootstrapping
broker list, cf. m etadata.broker.list
These brokers inform the producer about all alive brokers and where to find
current partition leaders. The bootstrap brokers do use ZK for that.
Verisign Public 90
Bootstrapping in Kafka 0.8
Scenario: N=5 brokers total, 2 of which are for bootstrap
Verisign Public 91
Reading data from Kafka
Verisign Public 92
Reading data from Kafka
You use Kafka consumers to write data to Kafka brokers.
Available for JVM (Java, Scala), C/C++, Python, Ruby, etc.
The Kafka project only provides the JVM implementation.
Has risk that a new Kafka release will break non-JVM clients.
Verisign Public 94
Reading data from Kafka
Important consumer configuration settings
group.id assigns an individual consumer to a group
zookeeper.connect to discover brokers/topics/etc., and to store consumer
state (e.g. when using the high-level consumer API)
fetch.m essage.m ax.bytes number of message bytes to (attempt to) fetch for each
partition; must be >= brokers m essage.m ax.bytes
Verisign Public 95
Reading data from Kafka
Consumer groups
Allows multi-threaded and/or multi-machine consumption from Kafka topics.
Consumers join a group by using the same group.id
Kafka guarantees a message is only ever read by a single consumer in a group.
Kafka assigns the partitions of a topic to the consumers in a group so that each partition is
consumed by exactly one consumer in the group.
Maximum parallelism of a consumer group: #consumers (in the group) <= #partitions
Verisign Public 96
Guarantees when reading data from Kafka
A message is only ever read by a single consumer in a group.
A consumer sees messages in the order they were stored in the log.
The order of messages is only guaranteed within a partition.
No order guarantee across partitions, which includes no order guarantee per-topic.
If total order (per topic) is required you can consider, for instance:
Use #partition = 1. Good: total order. Bad: Only 1 consumer process at a time.
Add total ordering in your consumer application, e.g. a Storm topology.
Some gotchas:
If you have multiple partitions per thread there is NO guarantee about the order you
receive messages, other than that within the partition the offsets will be sequential.
Example: You may receive 5 messages from partition 10 and 6 from partition 11, then 5
more from partition 10 followed by 5 more from partition 10, even if partition 11 has data
available.
Adding more processes/threads will cause Kafka to rebalance, possibly changing
the assignment of a partition to a thread (whoops).
Verisign Public 97
Rebalancing: how consumers meet brokers
Remember?
Verisign Public 98
Rebalancing: how consumers meet brokers
Why dynamic at run-time?
Machines can die, be added,
Consumer apps may die, be re-configured, added,
Verisign Public 99
Rebalancing: how consumers meet brokers
Rebalancing?
Consumers in a group come into consensus on which consumer is
consuming which partitions required for distributed consumption
Divides broker partitions evenly across consumers, tries to reduce the
number of broker nodes each consumer has to connect to
When does it happen? Each time:
a consumer joins or leaves a consumer group, OR
a broker joins or leaves, OR
a topic joins/leaves via a filter, cf. createM essageStream sByFilter()
Examples:
If a consumer or broker fails to heartbeat to ZK rebalance!
createM essageStream s() registers consumers for a topic, which results
in a rebalance of the consumer-broker assignment.
Alternatives to Bijection:
e.g. https://github.com/miguno/kafka-avro-codec
Verisign Public 105
Data compression in Kafka
Will run unit tests plus end-to-end tests of Kafka, Storm, and Kafka-
Storm integration.
KafkaConsumerApp
https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scal
a/com/miguno/kafkastorm/kafka/
KafkaConsumerApp.scala