1 Python Multithreading and Multiprocessing Tutorial
1 Python Multithreading and Multiprocessing Tutorial
1 Python Multithreading and Multiprocessing Tutorial
1
Process can have multiple threads. They execute the same code belonging to the parent pro-
cess. Ideally, they run in parallel, but not necessarily.
A process is an executing instance of an application. What does that mean? Well, for example,
when you double-click the Microsoft Word icon, you start a process that runs Word. A thread is
a path of execution within a process. Also, a process can contain multiple threads. When you
start Word, the operating system creates a process and begins executing the primary thread of that
process.
It’s important to note that a thread can do anything a process can do. But since a process can
consist of multiple threads, a thread could be considered a ‘lightweight’ process.
Threads within the same process share the same address space, whereas different processes do
not. This allows threads to read from and write to the same data structures and variables, and also
facilitates communication between threads.
Threads, of course, allow for multi-threading. A common example of the advantage of mul-
tithreading is the fact that you can have a word processor that prints a document using a back-
ground thread, but at the same time another thread is running that accepts user input, so that you
can type up a new document.
If we were dealing with an application that uses only one thread, then the application would
only be able to do one thing at a time – so printing and responding to user input at the same time
would not be possible in a single threaded application.
Sections of code that modify data structures shared by multiple threads are called critical sec-
tions. When a critical section is running in one thread it’s extremely important that no other thread
be allowed into that critical section.
A --- -- ---
B --- -- ---
Parallel:
A ---------------
B ---------------
In [2]: import os
import time
import threading
import multiprocessing
NUM_WORKERS = 4
def only_sleep():
2
""" Do nothing, wait for a timer to expire """
print("PID: {}, Process Name: {}, Thread Name: {}".
format(os.getpid(),
multiprocessing.current_process().name,
threading.current_thread().name))
time.sleep(1)
def crunch_numbers():
""" Do some computations """
print("PID: {}, Process Name: {}, Thread Name: {}".
format(os.getpid(),
multiprocessing.current_process().name,
threading.current_thread().name))
x = 0
while x< 10000000:
x +=1
We have created two tasks. Both of them long running, but only crunch_numbers actively
performs computations.
Let’s run only_sleep - Serially - Multithreadedd - Using Multiple processes
and compare the results
3
print("Parallel time= {}".format(end_time - start_time))
In the case of the serial approach, things are pretty obvious. We’re running the tasks one after
the other. All four runs are executed by the same thread of the same process.
Using processes we cut the execution time down to a quarter of the original time, simply
because the tasks are executed in parallel. Notice how each task is performed in a different process
and on the MainThread of that process.
Using threads we take advantage of the fact that the tasks can be executed concurrently. The
execution time is also cut down to a quarter, even though nothing is running in parallel. Here’s
how that goes: we spawn the first thread and it starts waiting for the timer to expire. We pause its
execution, letting it wait for the timer to expire, and in this time we spawn the second thread. We
repeat this for all the threads. At one moment the timer of the first thread expires so we switch
execution to it and we terminate it. The algorithm is repeated for the second and for all the other
threads. At the end, the result is as if things were run in parallel. You’ll also notice that the four
different threads branch out from and live inside the same process: MainProcess.
start_time = time.time()
threads = [threading.Thread(target=crunch_numbers) for _ in range(NUM_WORKERS)]
[thread.start() for thread in threads]
4
[thread.join() for thread in threads]
end_time = time.time()
start_time = time.time()
processes = [multiprocessing.Process(target=crunch_numbers) for _ in range(NUM_WORKERS)]
[process.start() for process in processes]
[process.join() for process in processes]
end_time = time.time()
The main difference here is in the result of the multithreaded approach. This time it performs
very similarly to the serial approach, and here’s why: since it performs computations and Python
doesn’t perform real parallelism, the threads are basically running one after the other, yielding
execution to one another until they all finish.
• Application goes frequently over a list of websites URLs and checks if those websites are up
• Every websites should be checked every 5-10 mnutes so that the downtime is not significant
• Instead of performing a classic HTTP GET request, it performs a HEAD request so that it
doesnot affect your traffic significantly
• If the HTTP status is in the danger ranges (400+, 500+), the owner is notified.
• The owner is notified by email, text-message, or push notification
5
As the list of websites grow, going through the list serially won’t guarantee us that every
website is checked every five minutes or so. The website could be down for hours, and the owner
won’t be notified.
In [6]: import time
import logging
import requests
class WebsiteDownException(Exception):
pass
def notify_owner(address):
'''
Send the owner of the address a notification.
For now, we're going to sleep for 0.5 seconds.
'''
logging.info("Notifying the owner of {} website".format(
address))
time.sleep(0.5)
def check_website(address):
'''
Utility function: check if website is down
'''
try:
ping_website(address)
except WebsiteDownException:
notify_owner(address)
In [7]: WEBSITE_LIST = [
'http://envato.com',
6
'http://amazon.co.uk',
'http://amazon.com',
'http://facebook.com',
'http://google.com',
'http://google.fr',
'http://google.es',
'http://google.co.uk',
'http://internet.org',
'http://gmail.com',
'http://stackoverflow.com',
'http://github.com',
'http://heroku.com',
'http://really-cool-available-domain.com',
'http://djangoproject.com',
'http://rubyonrails.org',
'http://basecamp.com',
'http://trello.com',
'http://yiiframework.com',
'http://shopify.com',
'http://another-really-interesting-domain.co',
'http://airbnb.com',
'http://instagram.com',
'http://snapchat.com',
'http://youtube.com',
'http://baidu.com',
'http://yahoo.com',
'http://live.com',
'http://linkedin.com',
'http://yandex.ru',
'http://netflix.com',
'http://wordpress.com',
'http://bing.com',
]
In [8]: import time
start_time = time.time()
end_time = time.time()
print("Time for Serial: {} secs".format(end_time-start_time))
WARNING:root:Timeout expired for website http://really-cool-available-domain.com
WARNING:root:Timeout expired for website http://another-really-interesting-domain.co
WARNING:root:Website http://live.com returned status_code=405
WARNING:root:Website http://netflix.com returned status_code=405
WARNING:root:Website http://bing.com returned status_code=405
7
Time for Serial: 27.111411809921265 secs
NUM_WORKERS = 4
task_queue = Queue()
def worker():
#Constantly check the queue for addresses
while True:
address = task_queue.get()
check_website(address)
start_time = time.time()
end_time = time.time()
8
Time fo Thread: 11.231960535049438 secs
• join() in Threading
For example, when the join() is invoked from a main thread, the main thread waits till the
child thread on which join is invoked exits. The significance of join() method is, if join() is not
invoked, the main thread may exit before the child thread, which will result undetermined
behaviour of programs and affect program invariants and integrity of the data on which the
program operates.
1.1.2 2. concurrent.futures
concurrent.futures is a high-level API for using threads. We will use a ThreadPoolExecutor. We’re
going to submit tasks to the pool and get back the futures, which are results that will be available
to us in the future. Of course, we can wait for all futures to become actual results.
NUM_WORKERS = 4
start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(
max_workers=NUM_WORKERS) as executor:
futures = {executor.submit(check_website, address)
for address in WEBSITE_LIST}
concurrent.futures.wait(futures)
end_time = time.time()
print("Time for Future: {}".format(end_time-start_time))
9
import multiprocessing
NUM_WORKERS = 4
start_time = time.time()
1.1.4 Gevent
Gevent is a popular alternative for achieving massive concurrency. Few things to know:-
• You need to monkey patch standart functions so that they cooperate with gevent. What
it means is that normally a socket operation is blocking. We’re waiting for the operation
to finish. If we were in a multithreaded environment, the scheduler would simply switch
to another thread while other one is waiting for I/O. Since we are not in multithreaded
environment, gevent patches the standard functions so that they become non-blocking and
return control to the gevent scheduler.
NUM_WORKERS = 4
start_time = time.time()
pool = Pool(NUM_WORKERS)
for address in WEBSITE_LIST:
10
pool.spawn(check_website,address)
end_time = time.time()
1.1.5 Celery
Celery is an approach that mostly differs from what we’ve seen so far. It is battle tested in the
context of very complex and high-performance environments. Setting up Celery will require bit
more tinkering than all the above solutions.
First, we’ll need to install Celery:
Tasks are the central concepts within the Celery project. Everything that you’ll want to run
inside Celery needs to be a task.
Celery offers great flexibility for running tasks:
you can run them synchronously or asynchronously, real-time or scheduled, on the same machine or on
multiple machines, and using threads, processes, Eventlet, or gevent.
Celery uses other services for sending and receiving messages. These messages are usually
tasks or results from tasks. We’re going to use Redis in this tutorial for this purpose.
Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and
message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries,
bitmaps, hyperloglogs, geospatial indexes with radius queries and streams.
Install Redis by Redis Quickstart
Also to install the redis Python libray,
redis-server
11
To get started building stuff with Celery, we’ll first need to create a Celery application. After
that, Celery needs to know what kind of tasks it might execute. To achieve that, we need to register
tasks to the Celery application. We’ll do this using the app.task decorator
import time
import logging
import requests
from celery import Celery
from celery.result import ResultSet
class WebsiteDownException(Exception):
pass
def notify_owner(address):
'''
Send the owner of the address a notification.
For now, we're going to sleep for 0.5 seconds.
'''
logging.info("Notifying the owner of {} website".format(
address))
time.sleep(0.5)
def check_website(address):
'''
Utility function: check if website is down
'''
try:
ping_website(address)
12
except WebsiteDownException:
notify_owner(address)
WEBSITE_LIST = [
'http://envato.com',
'http://amazon.co.uk',
'http://amazon.com',
'http://facebook.com',
'http://google.com',
'http://google.fr',
'http://google.es',
'http://google.co.uk',
'http://internet.org',
'http://gmail.com',
'http://stackoverflow.com',
'http://github.com',
'http://heroku.com',
'http://really-cool-available-domain.com',
'http://djangoproject.com',
'http://rubyonrails.org',
'http://basecamp.com',
'http://trello.com',
'http://yiiframework.com',
'http://shopify.com',
'http://another-really-interesting-domain.co',
'http://airbnb.com',
'http://instagram.com',
'http://snapchat.com',
'http://youtube.com',
'http://baidu.com',
'http://yahoo.com',
'http://live.com',
'http://linkedin.com',
'http://yandex.ru',
'http://netflix.com',
'http://wordpress.com',
'http://bing.com',
]
app = Celery('selery',
broker='redis://localhost:6379/0',
backend='redis://localhost:6379/0')
@app.task
13
def check_website_task(address):
return check_website(address)
if __name__ == "__main__":
start_time = time.time()
end_time = time.time()
Then,
One thing to pay attention to: notice how we passed the Redis address to our Redis application
twice. The broker parameter specifies where the tasks are passed to Celery, and backend is where
Celery puts the results so that we can use them in our app. If we don’t specify a result backend,
there’s no way for us to know when the task was processed and what the result was.
In [ ]:
14