Getting Started With Apache Kafka in Python - Towards Data Science PDF
Getting Started With Apache Kafka in Python - Towards Data Science PDF
In this post, I am going to discuss Apache Kafka and how Python programmers can use it
for building distributed systems.
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 1/17
3/15/2020 Getting started with Apache Kafka in Python - Towards Data Science
According to Wikipedia:
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 2/17
3/15/2020 Getting started with Apache Kafka in Python - Towards Data Science
Think of it is a big commit log where data is stored in sequence as it happens. The users
of this log can just access and use it as per their requirement.
Ka ka Use Cases
Uses of Kafka are multiple. Here are a few use-cases that could help you to gure out its
usage.
Activity Monitoring:- Kafka can be used for activity monitoring. The activity could
belong to a website or physical sensors and devices. Producers can publish raw data
from data sources that later can be used to nd trends and pattern.
Messaging:- Kafka can be used as a message broker among services. If you are
implementing a microservice architecture, you can have a microservice as a
producer and another as a consumer. For instance, you have a microservice that is
responsible to create new accounts and other for sending email to users about
account creation.
Log Aggregation:- You can use Kafka to collect logs from di erent systems and store
in a centralized system for further processing.
ETL:- Kafka has a feature of almost real-time streaming thus you can come up with
an ETL based on your need.
Database:- Based on things I mentioned above, you may say that Kafka also acts as a
database. Not a typical databases that have a feature of querying the data as per
need, what I meant that you can keep data in Kafka as long as you want without
consuming it.
Ka ka Concepts
Let’s discuss core Kafka concepts.
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 3/17
3/15/2020 Getting started with Apache Kafka in Python - Towards Data Science
Topics
Every message that is feed into the system must be part of some topic. The topic is
nothing but a stream of records. The messages are stored in key-value format. Each
message is assigned a sequence, called O set. The output of one message could be an
input of the other for further processing.
Producers
Producers are the apps responsible to publish data into Kafka system. They publish data
on the topic of their choice.
Consumers
The messages published into topics are then utilized by Consumers apps. A consumer
gets subscribed to the topic of its choice and consumes data.
Broker
Every instance of Kafka that is responsible for message exchange is called a Broker.
Kafka can be used as a stand-alone machine or a part of a cluster.
I try to explain the whole thing with a simple example, there is a warehouse or godown
of a restaurant where all the raw material is dumped like rice, vegetables etc. The
restaurant serves di erent kinds of dishes: Chinese, Desi, Italian etc. The chefs of each
cuisine can refer to the warehouse, pick the desire things and make things. There is a
possibility that the stu made by the raw material can later be used by all departments’
chefs, for instance, some secret sauce that is used in ALL kind of dishes. Here, the
warehouse is a broker, vendors of goods are the producers, the goods and the secret
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 4/17
3/15/2020 Getting started with Apache Kafka in Python - Towards Data Science
sauce made by chefs are topics while chefs are consumers. My analogy might sound funny
and inaccurate but at least it’d have helped you to understand the entire thing :-)
Kafka is available in two di erent avors: One by Apache foundation and other by
Con uent as a package. For this tutorial, I will go with the one provided by Apache
foundation. By the way, Con uent was founded by the original developers of Kafka.
Starting Zookeeper
Kafka relies on Zookeeper, in order to make it run we will have to run Zookeeper rst.
bin/zookeeper-server-start.sh config/zookeeper.properties
it will display lots of text on the screen, if see the following it means it’s up properly.
Starting Ka ka Server
Next, we have to start Kafka broker server:
bin/kafka-server-start.sh config/server.properties
And if you see the following text on the console it means it’s up.
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 5/17
3/15/2020 Getting started with Apache Kafka in Python - Towards Data Science
Create Topics
Messages are published in topics. Use this command to create a new topic.
You can also list all available topics by running the following command.
Sending Messages
Next, we have to send messages, producers are used for that purpose. Let’s initiate a
producer.
You start the console based producer interface which runs on the port 9092 by default. -
-topic allows you to set the topic in which the messages will be published. In our case
the topic is test
It shows you a > prompt and you can input whatever you want.
Messages are stored locally on your disk. You can learn about the path of it by checking
the value of log.dirs in config/server.properties le. By default they are set to
/tmp/kafka-logs/
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 6/17
3/15/2020 Getting started with Apache Kafka in Python - Towards Data Science
If you list this folder you will nd a folder with name test-0 . Upon listing it you will nd
3 les: 00000000000000000000.index 00000000000000000000.log
00000000000000000000.timeindex
^@^@^@^@^@^@^@^@^@^@^@=^@^@^@^@^BÐØR^V^@^@^@^@^@^@^@^@^Acça<9a>o^@^@^
Acça<9a>oÿÿÿÿÿÿÿÿÿÿÿÿÿÿ^@^@^@^A^V^@^@^@^A
Hello^@^@^@^@^@^@^@^@^A^@^@^@=^@^@^@^@^BÉJ^B-
^@^@^@^@^@^@^@^@^Acça<9f>^?^@^@^Acça<9f>^?
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿ^@^@^@^A^V^@^@^@^A
World^@
~
Looks like the encoded data or delimiter separated, I am not sure. If someone knows this
format then do let me know.
Anyways, Kafka provides a utility that lets you examine each incoming message.
➜ kafka_2.11-1.1.0 bin/kafka-run-class.sh
kafka.tools.DumpLogSegments --deep-iteration --print-data-log --files
/tmp/kafka-logs/test-0/00000000000000000000.log
Dumping /tmp/kafka-logs/test-0/00000000000000000000.log
Starting offset: 0
offset: 0 position: 0 CreateTime: 1528595323503 isvalid: true
keysize: -1 valuesize: 5 magic: 2 compresscodec: NONE producerId: -1
producerEpoch: -1 sequence: -1 isTransactional: false headerKeys: []
payload: Hello
offset: 1 position: 73 CreateTime: 1528595324799 isvalid: true
keysize: -1 valuesize: 5 magic: 2 compresscodec: NONE producerId: -1
producerEpoch: -1 sequence: -1 isTransactional: false headerKeys: []
payload: World
You can see the message with other details like offset , position and CreateTime etc.
Consuming Messages
Messages that are stored should be consumed too. Let’s started a console based
consumer.
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 7/17
3/15/2020 Getting started with Apache Kafka in Python - Towards Data Science
If you run, it will dump all the messages from the beginning till now. If you are just
interested to consume the messages after running the consumer then you can just omit -
-from-beginning switch it and run. The reason it does not show the old messages because
the o set is updated once the consumer sends an ACK to the Kafka broker about
processing messages. You can see the work ow below.
Accessing Ka ka in Python
There are multiple Python libraries available for usage:
PyKafka — This library is maintained by Parsly and it’s claimed to be a Pythonic API.
Unlike Kafka-Python you can’t create dynamic topics.
Con uent Python Kafka:- It is o ered by Con uent as a thin wrapper around
librdkafka, hence it’s performance is better than the two.
In the last post about Elasticsearch, I scraped Allrecipes data. In this post, I am going to
use the same scraper as a data source. The system we are going to build is an alert
system which will send noti cation about the recipes if it meets the certain threshold of
the calories. There will be two topics:
raw_recipes:- It will be storing the raw HTML of each recipe. The idea is to use this
topic as the main source of our data that later can be processed and transformed as
per need.
parsed_recipes:- As the name suggests, this will be parsed data of each recipe in
JSON format.
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 9/17
3/15/2020 Getting started with Apache Kafka in Python - Towards Data Science
1 def fetch_raw(recipe_url):
2 html = None
3 print('Processing..{}'.format(recipe_url))
4 try:
5 r = requests.get(recipe_url, headers=headers)
6 if r.status_code == 200:
7 html = r.text
8 except Exception as ex:
9 print('Exception while accessing raw html')
10 print(str(ex))
11 finally:
12 return html.strip()
13
14
15 def get_recipes():
16 recipies = []
17 salad_url = 'https://www.allrecipes.com/recipes/96/salad/'
18 url = 'https://www.allrecipes.com/recipes/96/salad/'
19 print('Accessing list')
20
21 try:
22 r = requests.get(url, headers=headers)
23 if r.status_code == 200:
24 html = r.text
25 soup = BeautifulSoup(html, 'lxml')
26 links = soup.select('.fixed-recipe-card__h3 a')
27 idx = 0
28 for link in links:
29
30 sleep(2)
31 recipe = fetch_raw(link['href'])
32 recipies.append(recipe)
33 idx += 1
34 if idx > 2:
35 break
36 except Exception as ex:
37 print('Exception in get_recipes')
38 print(str(ex))
39 finally:
40 return recipies
This code snippet will extract markup of each recipe and return in list format.
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 10/17
3/15/2020 Getting started with Apache Kafka in Python - Towards Data Science
Next, we to create a producer object. Before we proceed further, we will make changes in
config/server.properties le. We have to set advertised.listeners to
PLAINTEXT://localhost:9092 otherwise you could experience the following error:
We will now add two methods: connect_kafka_producer() that will give you an instance
of Kafka producer and publish_message() that will just dump the raw HTML of
individual recipes.
if __name__ == '__main__':
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X
10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181
Safari/537.36',
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 11/17
3/15/2020 Getting started with Apache Kafka in Python - Towards Data Science
'Pragma': 'no-cache'
}
all_recipes = get_recipes()
if len(all_recipes) > 0:
kafka_producer = connect_kafka_producer()
for recipe in all_recipes:
publish_message(kafka_producer, 'raw_recipes', 'raw',
recipe.strip())
if kafka_producer is not None:
kafka_producer.close()
/anaconda3/anaconda/bin/python /Development/DataScience/Kafka/kafka-
recipie-alert/producer-raw-recipies.py
Accessing list
Processing..https://www.allrecipes.com/recipe/20762/california-
coleslaw/
Processing..https://www.allrecipes.com/recipe/8584/holiday-chicken-
salad/
Processing..https://www.allrecipes.com/recipe/80867/cran-broccoli-
salad/
Message published successfully.
Message published successfully.
Message published successfully.
I am using a GUI tool, named as Kafka Tool to browse recently published messages. It is
available for OSX, Windows and Linux.
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 12/17
3/15/2020 Getting started with Apache Kafka in Python - Towards Data Science
Ka kaToolKit in action
Recipe Parser
The next script we are going to write will serve as both consumer and producer. First it
will consume data from raw_recipes topic, parse and transform data into JSON and then
will publish it in parsed_recipes topic. Below is the code that will fetch HTML data from
raw_recipes topic, parse and then feed into parsed_recipes topic.
1 import json
2 from time import sleep
3
4 from bs4 import BeautifulSoup
5 from kafka import KafkaConsumer, KafkaProducer
6
7
8 def publish_message(producer_instance, topic_name, key, value):
9 try:
10 key_bytes = bytes(key, encoding='utf-8')
11 value_bytes = bytes(value, encoding='utf-8')
12 producer_instance.send(topic_name, key=key_bytes, value=value_bytes)
13 producer_instance.flush()
14 print('Message published successfully.')
15 except Exception as ex:
16 print('Exception in publishing message')
17 print(str(ex))
18
19
20 def connect_kafka_producer():
21 _producer = None
22 try:
23 _producer = KafkaProducer(bootstrap_servers=['localhost:9092'], api_version=(0, 10))
24 except Exception as ex:
25 print('Exception while connecting Kafka')
26 print(str(ex))
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 13/17
3/15/2020 Getting started with Apache Kafka in Python - Towards Data Science
27 finally:
28 return _producer
29
30
31 def parse(markup):
32 title = '-'
33 submit_by = '-'
34 description = '-'
35 calories = 0
36 ingredients = []
37 rec = {}
38
39 try:
40
41 soup = BeautifulSoup(markup, 'lxml')
42 # title
43 title_section = soup.select('.recipe-summary__h1')
44 # submitter
45 submitter_section = soup.select('.submitter__name')
46 # description
47 description_section = soup.select('.submitter__description')
48 # ingredients
49 ingredients_section = soup.select('.recipe-ingred_txt')
50
51 # calories
52 calories_section = soup.select('.calorie-count')
53 if calories_section:
54 calories = calories_section[0].text.replace('cals', '').strip()
55
56 if ingredients_section:
57 for ingredient in ingredients_section:
58 ingredient_text = ingredient.text.strip()
59 if 'Add all ingredients to list' not in ingredient_text and ingredient_text !=
60 ingredients.append({'step': ingredient.text.strip()})
61
62 if description_section:
63 description = description_section[0].text.strip().replace('"', '')
64
65 if submitter_section:
66 submit_by = submitter_section[0].text.strip()
67
68 if title_section:
69 title = title_section[0].text
70
71 rec {'title': title 'submitter': submit by 'description': description 'calories': c
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 14/17
3/15/2020 Getting started with Apache Kafka in Python - Towards Data Science
71 rec = { title : title, submitter : submit_by, description : description, calories : c
72 'ingredients': ingredients}
73
74 except Exception as ex:
75 print('Exception while parsing')
76 print(str(ex))
77 finally:
78 return json.dumps(rec)
79
80
81 if __name__ == '__main__':
82 print('Running Consumer..')
83 parsed_records = []
84 topic_name = 'raw_recipes'
85 parsed_topic_name = 'parsed_recipes'
86
87 consumer = KafkaConsumer(topic_name, auto_offset_reset='earliest',
88 bootstrap_servers=['localhost:9092'], api_version=(0, 10), consumer
89 for msg in consumer:
90 html = msg.value
91 result = parse(html)
92 parsed_records.append(result)
93 consumer.close()
94 sleep(5)
95
96 if len(parsed_records) > 0:
97 print('Publishing records..')
98 producer = connect_kafka_producer()
99 for rec in parsed_records:
100 publish_message(producer, parsed_topic_name, 'parsed', rec)
KafkaConsumer accepts a few parameters beside the topic name and host address. By
providing auto_offset_reset='earliest' you are telling Kafka to return messages from
the beginning. The parameter consumer_timeout_ms helps the consumer to disconnect
after the certain period of time. Once disconnected, you can close the consumer stream
by calling consumer.close()
After this, I am using same routines to connect producers and publish parsed data in the
new topic. KafaTool browser gives glad tidings about newly stored messages.
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 15/17
3/15/2020 Getting started with Apache Kafka in Python - Towards Data Science
So far so good. We stored recipes in both raw and JSON format for later use. Next, we
have to write a consumer that will connect with parsed_recipes topic and generate alert
if certain calories critera meets.
1 import json
2 from time import sleep
3
4 from kafka import KafkaConsumer
5
6 if __name__ == '__main__':
7 parsed_topic_name = 'parsed_recipes'
8 # Notify if a recipe has more than 200 calories
9 calories_threshold = 200
10
11 consumer = KafkaConsumer(parsed_topic_name, auto_offset_reset='earliest',
12 bootstrap_servers=['localhost:9092'], api_version=(0, 10), consumer_
13 for msg in consumer:
14 record = json.loads(msg.value)
15 calories = int(record['calories'])
16 title = record['title']
17
18 if calories > calories_threshold:
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 16/17
3/15/2020 Getting started with Apache Kafka in Python - Towards Data Science
19 print('Alert: {} calories count is {}'.format(title, calories))
20 sleep(3)
21
22 if consumer is not None:
23 consumer.close()
The JSON is decoded and then check the calories count, a noti cation is issued once the
criteria meet.
Conclusion
Kafka is a scalable, fault-tolerant, publish-subscribe messaging system that enables you
to build distributed applications. Due to its high performance and e ciency, it’s
getting popular among companies that are producing loads of data from various external
sources and want to provide real-time ndings from it. I have just covered the gist of it.
Do explore the docs and existing implementation and it will help you to understand how
it could be the best t for your next system.
https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05 17/17