Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

1 Python Multithreading and Multiprocessing Tutorial

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Python Multithreading and Multiprocessing

December 19, 2018

1 Python Multithreading and Multiprocessing Tutorial


• WHY IS PARALLELISM TRICKY IN PYTHON? (Hint: its because of GIL - Global Interpreter
Lock)
• Threads vs Processes: Different ways of achieving parallelism. When to use one over the
other?
• Parallel vs Concurrent: Why in some cases we can settle for concurrency rather than paral-
lelism
• Building a simple but practical example using the various techniques discussed.

1.0.1 Global Interpreter Lock


The Global Interpreter Lock (GIL) is one of the most controversial subjects in the Python world.
In CPython, the most popular implementation of Python, the GIL is a mutex that makes things
thread-safe.
Thread safe: Implementation is guaranteed to be free of race conditions when accessed by
multiple threads simultaneously.
A race condition occurs when two or more threads can access shared data and they try to
change it at the same time. Because the thread scheduling algorithm can swap between threads at
any time, you don’t know the order in which the threads will attempt to access the shared data.
Therefore, the result of the change in data is dependent on the thread scheduling algorithm, i.e.
both threads are "racing" to access/change the data.
Problems often occur when one thread does a "check-then-act" (e.g. "check" if the value is X,
then "act" to do something that depends on the value being X) and another thread does something
to the value in between the "check" and the "act".
In order to prevent race conditions from occurring, you would typically put a lock around the
shared data to ensure only one thread can access the data at a time.
The GIL makes it easy to integrate with external libraries that are not thread-safe, and it makes
non-parallel code faster.
Due to the GIL, we can’t achieve true parallelism via multithreading.
But stuff that happens outside the GIL realm is free to be parallel.
In this category fall long-running tasks like I/O and, fortunately, libraries like numpy.

1.0.2 Threads vs Processes


Process is a program that is in execution - in other words, code that is running. Multiple processes
are always running in a computer, and they are executing in parallel.

1
Process can have multiple threads. They execute the same code belonging to the parent pro-
cess. Ideally, they run in parallel, but not necessarily.
A process is an executing instance of an application. What does that mean? Well, for example,
when you double-click the Microsoft Word icon, you start a process that runs Word. A thread is
a path of execution within a process. Also, a process can contain multiple threads. When you
start Word, the operating system creates a process and begins executing the primary thread of that
process.
It’s important to note that a thread can do anything a process can do. But since a process can
consist of multiple threads, a thread could be considered a ‘lightweight’ process.
Threads within the same process share the same address space, whereas different processes do
not. This allows threads to read from and write to the same data structures and variables, and also
facilitates communication between threads.
Threads, of course, allow for multi-threading. A common example of the advantage of mul-
tithreading is the fact that you can have a word processor that prints a document using a back-
ground thread, but at the same time another thread is running that accepts user input, so that you
can type up a new document.
If we were dealing with an application that uses only one thread, then the application would
only be able to do one thing at a time – so printing and responding to user input at the same time
would not be possible in a single threaded application.
Sections of code that modify data structures shared by multiple threads are called critical sec-
tions. When a critical section is running in one thread it’s extremely important that no other thread
be allowed into that critical section.

1.0.3 Parallel vs Concurrent


Concurrency implies scheduling independent code to be executed in a cooperative manner. Tak-
ing advantage of the fact that piece of code is waiting on I/O operations, and during that time run
a different but independent part of the code.
Processes A & B
Concurrent:

A --- -- ---

B --- -- ---

Parallel:

A ---------------

B ---------------

In [2]: import os
import time
import threading
import multiprocessing

NUM_WORKERS = 4

def only_sleep():

2
""" Do nothing, wait for a timer to expire """
print("PID: {}, Process Name: {}, Thread Name: {}".
format(os.getpid(),
multiprocessing.current_process().name,
threading.current_thread().name))
time.sleep(1)

def crunch_numbers():
""" Do some computations """
print("PID: {}, Process Name: {}, Thread Name: {}".
format(os.getpid(),
multiprocessing.current_process().name,
threading.current_thread().name))
x = 0
while x< 10000000:
x +=1

We have created two tasks. Both of them long running, but only crunch_numbers actively
performs computations.
Let’s run only_sleep - Serially - Multithreadedd - Using Multiple processes
and compare the results

In [3]: # Run tasks serially


start_time = time.time()
for _ in range(NUM_WORKERS):
only_sleep()
end_time = time.time()

print("Serial time= {} \n".format(end_time-start_time))

#Run tasks using threads


start_time = time.time()
threads = [threading.Thread(target=only_sleep) for _ in range(NUM_WORKERS)]
print([thread for thread in threads])
[thread.start() for thread in threads]

[thread.join() for thread in threads]


end_time = time.time()

print("Threads time= {} \n".format(end_time - start_time))

# Run tasks using processes


start_time = time.time()
processes = [multiprocessing.Process(target=only_sleep) for _ in range(NUM_WORKERS)]
print([process for process in processes])
[process.start() for process in processes]
[process.join() for process in processes]
end_time = time.time()

3
print("Parallel time= {}".format(end_time - start_time))

PID: 17451, Process Name: MainProcess, Thread Name: MainThread


PID: 17451, Process Name: MainProcess, Thread Name: MainThread
PID: 17451, Process Name: MainProcess, Thread Name: MainThread
PID: 17451, Process Name: MainProcess, Thread Name: MainThread
Serial time= 4.005737543106079

[<Thread(Thread-4, initial)>, <Thread(Thread-5, initial)>, <Thread(Thread-6, initial)>, <Thread(


PID: 17451, Process Name: MainProcess, Thread Name: Thread-5
PID: 17451, Process Name: MainProcess, Thread Name: Thread-6PID: 17451, Process Name: MainProces

PID: 17451, Process Name: MainProcess, Thread Name: Thread-7


Threads time= 1.0192456245422363

[<Process(Process-1, initial)>, <Process(Process-2, initial)>, <Process(Process-3, initial)>, <P


PID: 18100, Process Name: Process-1, Thread Name: MainThread
PID: 18102, Process Name: Process-2, Thread Name: MainThread
PID: 18106, Process Name: Process-3, Thread Name: MainThread
PID: 18109, Process Name: Process-4, Thread Name: MainThread
Parallel time= 1.048060655593872

In the case of the serial approach, things are pretty obvious. We’re running the tasks one after
the other. All four runs are executed by the same thread of the same process.
Using processes we cut the execution time down to a quarter of the original time, simply
because the tasks are executed in parallel. Notice how each task is performed in a different process
and on the MainThread of that process.
Using threads we take advantage of the fact that the tasks can be executed concurrently. The
execution time is also cut down to a quarter, even though nothing is running in parallel. Here’s
how that goes: we spawn the first thread and it starts waiting for the timer to expire. We pause its
execution, letting it wait for the timer to expire, and in this time we spawn the second thread. We
repeat this for all the threads. At one moment the timer of the first thread expires so we switch
execution to it and we terminate it. The algorithm is repeated for the second and for all the other
threads. At the end, the result is as if things were run in parallel. You’ll also notice that the four
different threads branch out from and live inside the same process: MainProcess.

In [5]: start_time = time.time()


for _ in range(NUM_WORKERS):
crunch_numbers()
end_time = time.time()

print("Serial time=", end_time - start_time)

start_time = time.time()
threads = [threading.Thread(target=crunch_numbers) for _ in range(NUM_WORKERS)]
[thread.start() for thread in threads]

4
[thread.join() for thread in threads]
end_time = time.time()

print("Threads time=", end_time - start_time)

start_time = time.time()
processes = [multiprocessing.Process(target=crunch_numbers) for _ in range(NUM_WORKERS)]
[process.start() for process in processes]
[process.join() for process in processes]
end_time = time.time()

print("Parallel time=", end_time - start_time)

PID: 17451, Process Name: MainProcess, Thread Name: MainThread


PID: 17451, Process Name: MainProcess, Thread Name: MainThread
PID: 17451, Process Name: MainProcess, Thread Name: MainThread
PID: 17451, Process Name: MainProcess, Thread Name: MainThread
Serial time= 2.1661345958709717PID: 17451, Process Name: MainProcess, Thread Name: Thread-12

PID: 17451, Process Name: MainProcess, Thread Name: Thread-13


PID: 17451, Process Name: MainProcess, Thread Name: Thread-14
PID: 17451, Process Name: MainProcess, Thread Name: Thread-15
Threads time= 2.9878089427948
PID: 18132, Process Name: Process-9, Thread Name: MainThread
PID: 18133, Process Name: Process-10, Thread Name: MainThread
PID: 18141, Process Name: Process-12, Thread Name: MainThread
PID: 18138, Process Name: Process-11, Thread Name: MainThread
Parallel time= 1.2481975555419922

The main difference here is in the result of the multithreaded approach. This time it performs
very similarly to the serial approach, and here’s why: since it performs computations and Python
doesn’t perform real parallelism, the threads are basically running one after the other, yielding
execution to one another until they all finish.

1.1 Building Practical Application


Build an application that checks the uptime of websites.

• Application goes frequently over a list of websites URLs and checks if those websites are up
• Every websites should be checked every 5-10 mnutes so that the downtime is not significant
• Instead of performing a classic HTTP GET request, it performs a HEAD request so that it
doesnot affect your traffic significantly
• If the HTTP status is in the danger ranges (400+, 500+), the owner is notified.
• The owner is notified by email, text-message, or push notification

Why is it essential to take parallel/concurrent approach to the problem?

5
As the list of websites grow, going through the list serially won’t guarantee us that every
website is checked every five minutes or so. The website could be down for hours, and the owner
won’t be notified.
In [6]: import time
import logging
import requests

class WebsiteDownException(Exception):
pass

def ping_website(address, timeout=20):


'''
Check if the website is down. If status_code >= 400
or if timeout expires.
Throw a WebsiteDownException if any of the website
down conditions are met.
'''
try:
response = requests.head(address, timeout=timeout)
if response.status_code >= 400:
logging.warning("Website {} returned status_code={}".format(
address, response.status_code))
raise WebsiteDownException()
except requests.exceptions.RequestException:
logging.warning("Timeout expired for website {}".format(
address))
raise WebsiteDownException()

def notify_owner(address):
'''
Send the owner of the address a notification.
For now, we're going to sleep for 0.5 seconds.
'''
logging.info("Notifying the owner of {} website".format(
address))
time.sleep(0.5)

def check_website(address):
'''
Utility function: check if website is down
'''
try:
ping_website(address)
except WebsiteDownException:
notify_owner(address)
In [7]: WEBSITE_LIST = [
'http://envato.com',

6
'http://amazon.co.uk',
'http://amazon.com',
'http://facebook.com',
'http://google.com',
'http://google.fr',
'http://google.es',
'http://google.co.uk',
'http://internet.org',
'http://gmail.com',
'http://stackoverflow.com',
'http://github.com',
'http://heroku.com',
'http://really-cool-available-domain.com',
'http://djangoproject.com',
'http://rubyonrails.org',
'http://basecamp.com',
'http://trello.com',
'http://yiiframework.com',
'http://shopify.com',
'http://another-really-interesting-domain.co',
'http://airbnb.com',
'http://instagram.com',
'http://snapchat.com',
'http://youtube.com',
'http://baidu.com',
'http://yahoo.com',
'http://live.com',
'http://linkedin.com',
'http://yandex.ru',
'http://netflix.com',
'http://wordpress.com',
'http://bing.com',
]
In [8]: import time
start_time = time.time()

for address in WEBSITE_LIST:


check_website(address)

end_time = time.time()
print("Time for Serial: {} secs".format(end_time-start_time))
WARNING:root:Timeout expired for website http://really-cool-available-domain.com
WARNING:root:Timeout expired for website http://another-really-interesting-domain.co
WARNING:root:Website http://live.com returned status_code=405
WARNING:root:Website http://netflix.com returned status_code=405
WARNING:root:Website http://bing.com returned status_code=405

7
Time for Serial: 27.111411809921265 secs

1.1.1 1. Threading Approach


Use a queue to put the addresses in and create worker threads to get them out of the queue and
process them. We are going to wait for the queue to be empty.

In [9]: import time


from queue import Queue
from threading import Thread

NUM_WORKERS = 4
task_queue = Queue()

def worker():
#Constantly check the queue for addresses
while True:
address = task_queue.get()
check_website(address)

#Mark the processed task as done


task_queue.task_done()

start_time = time.time()

#Create the worker threads


threads = [Thread(target=worker) for _ in range(NUM_WORKERS)]

#Add the websites to the task queue


[task_queue.put(item) for item in WEBSITE_LIST]

#Start all the workers


[thread.start() for thread in threads]

#Wait for all the tasks in the queue to be processed


task_queue.join()

end_time = time.time()

print("Time fo Thread: {} secs".format(end_time - start_time))

WARNING:root:Timeout expired for website http://another-really-interesting-domain.co


WARNING:root:Timeout expired for website http://really-cool-available-domain.com
WARNING:root:Website http://live.com returned status_code=405
WARNING:root:Website http://bing.com returned status_code=405
WARNING:root:Website http://netflix.com returned status_code=405

8
Time fo Thread: 11.231960535049438 secs

• join() in Threading
For example, when the join() is invoked from a main thread, the main thread waits till the
child thread on which join is invoked exits. The significance of join() method is, if join() is not
invoked, the main thread may exit before the child thread, which will result undetermined
behaviour of programs and affect program invariants and integrity of the data on which the
program operates.

1.1.2 2. concurrent.futures
concurrent.futures is a high-level API for using threads. We will use a ThreadPoolExecutor. We’re
going to submit tasks to the pool and get back the futures, which are results that will be available
to us in the future. Of course, we can wait for all futures to become actual results.

In [11]: import time


import concurrent.futures

NUM_WORKERS = 4
start_time = time.time()

with concurrent.futures.ThreadPoolExecutor(
max_workers=NUM_WORKERS) as executor:
futures = {executor.submit(check_website, address)
for address in WEBSITE_LIST}
concurrent.futures.wait(futures)
end_time = time.time()
print("Time for Future: {}".format(end_time-start_time))

WARNING:root:Timeout expired for website http://really-cool-available-domain.com


WARNING:root:Timeout expired for website http://another-really-interesting-domain.co
WARNING:root:Website http://live.com returned status_code=405
WARNING:root:Website http://bing.com returned status_code=405
WARNING:root:Website http://netflix.com returned status_code=405

Time for Future: 14.673298120498657

1.1.3 3. The Multiprocessing Approach


The multiprocessing library provides an almost drop-in replacement API for the threading li-
brary. In this case, we’re going to take an approach similar to the concurrent.futures and sub-
mitting tasks to it by mapping a function to the list of addresses (think of classic Python map
function).

In [12]: import time


import socket

9
import multiprocessing

NUM_WORKERS = 4
start_time = time.time()

with multiprocessing.Pool(processes=NUM_WORKERS) as pool:


results = pool.map_async(check_website, WEBSITE_LIST)
results.wait()
end_time = time.time()
print("Time for MultiProcessing: {}".format(end_time-start_time))

WARNING:root:Timeout expired for website http://really-cool-available-domain.com


WARNING:root:Timeout expired for website http://another-really-interesting-domain.co
WARNING:root:Website http://live.com returned status_code=405
WARNING:root:Website http://netflix.com returned status_code=405
WARNING:root:Website http://bing.com returned status_code=405

Time for MultiProcessing: 12.136277437210083

1.1.4 Gevent
Gevent is a popular alternative for achieving massive concurrency. Few things to know:-

• Code performed concurrently by greenlets is deterministic. As opposed to two other pre-


sented alternatives, this paradigm guarantees that for any two identical run, you’ll get the
same results in the same order.

• You need to monkey patch standart functions so that they cooperate with gevent. What
it means is that normally a socket operation is blocking. We’re waiting for the operation
to finish. If we were in a multithreaded environment, the scheduler would simply switch
to another thread while other one is waiting for I/O. Since we are not in multithreaded
environment, gevent patches the standard functions so that they become non-blocking and
return control to the gevent scheduler.

In [14]: import time


from gevent.pool import Pool
from gevent import monkey

NUM_WORKERS = 4

#Monkey-Patch socket module for HTTP requests


monkey.patch_socket()

start_time = time.time()

pool = Pool(NUM_WORKERS)
for address in WEBSITE_LIST:

10
pool.spawn(check_website,address)

#Wait for stuff to finish


pool.join()

end_time = time.time()

print("Time for Monkey {}".format(end_time-start_time))

WARNING:root:Timeout expired for website http://really-cool-available-domain.com


WARNING:root:Timeout expired for website http://another-really-interesting-domain.co
WARNING:root:Website http://live.com returned status_code=405
WARNING:root:Website http://netflix.com returned status_code=405
WARNING:root:Website http://bing.com returned status_code=405

Time for Monkey 15.155159950256348

1.1.5 Celery
Celery is an approach that mostly differs from what we’ve seen so far. It is battle tested in the
context of very complex and high-performance environments. Setting up Celery will require bit
more tinkering than all the above solutions.
First, we’ll need to install Celery:

pip install celery

Tasks are the central concepts within the Celery project. Everything that you’ll want to run
inside Celery needs to be a task.
Celery offers great flexibility for running tasks:
you can run them synchronously or asynchronously, real-time or scheduled, on the same machine or on
multiple machines, and using threads, processes, Eventlet, or gevent.
Celery uses other services for sending and receiving messages. These messages are usually
tasks or results from tasks. We’re going to use Redis in this tutorial for this purpose.
Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and
message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries,
bitmaps, hyperloglogs, geospatial indexes with radius queries and streams.
Install Redis by Redis Quickstart
Also to install the redis Python libray,

pip install redis

And the bundle necessary for using Redis and Celery:

pip install celery[redis]

Start the Redis server by:

redis-server

11
To get started building stuff with Celery, we’ll first need to create a Celery application. After
that, Celery needs to know what kind of tasks it might execute. To achieve that, we need to register
tasks to the Celery application. We’ll do this using the app.task decorator

In [19]: #Make File selery.py

import time
import logging
import requests
from celery import Celery
from celery.result import ResultSet

class WebsiteDownException(Exception):
pass

def ping_website(address, timeout=20):


'''
Check if the website is down. If status_code >= 400
or if timeout expires.
Throw a WebsiteDownException if any of the website
down conditions are met.
'''
try:
response = requests.head(address, timeout=timeout)
if response.status_code >= 400:
logging.warning("Website {} returned status_code={}".format(
address, response.status_code))
raise WebsiteDownException()
except requests.exceptions.RequestException:
logging.warning("Timeout expired for website {}".format(
address))
raise WebsiteDownException()

def notify_owner(address):
'''
Send the owner of the address a notification.
For now, we're going to sleep for 0.5 seconds.
'''
logging.info("Notifying the owner of {} website".format(
address))
time.sleep(0.5)

def check_website(address):
'''
Utility function: check if website is down
'''
try:
ping_website(address)

12
except WebsiteDownException:
notify_owner(address)

WEBSITE_LIST = [
'http://envato.com',
'http://amazon.co.uk',
'http://amazon.com',
'http://facebook.com',
'http://google.com',
'http://google.fr',
'http://google.es',
'http://google.co.uk',
'http://internet.org',
'http://gmail.com',
'http://stackoverflow.com',
'http://github.com',
'http://heroku.com',
'http://really-cool-available-domain.com',
'http://djangoproject.com',
'http://rubyonrails.org',
'http://basecamp.com',
'http://trello.com',
'http://yiiframework.com',
'http://shopify.com',
'http://another-really-interesting-domain.co',
'http://airbnb.com',
'http://instagram.com',
'http://snapchat.com',
'http://youtube.com',
'http://baidu.com',
'http://yahoo.com',
'http://live.com',
'http://linkedin.com',
'http://yandex.ru',
'http://netflix.com',
'http://wordpress.com',
'http://bing.com',
]

app = Celery('selery',
broker='redis://localhost:6379/0',
backend='redis://localhost:6379/0')

@app.task

13
def check_website_task(address):
return check_website(address)

if __name__ == "__main__":
start_time = time.time()

# Using `delay` runs the task async


rs = ResultSet([check_website_task.delay(address) for address in WEBSITE_LIST])

# Wait for the tasks to finish


rs.get()

end_time = time.time()

print("Celery:", end_time - start_time)

in the same folder where our python file resides:

>> celery worker -A selery --loglevel=INFO --concurrency=4

Then,

>> python selery.py


Celery: 4.989539623260498

One thing to pay attention to: notice how we passed the Redis address to our Redis application
twice. The broker parameter specifies where the tasks are passed to Celery, and backend is where
Celery puts the results so that we can use them in our app. If we don’t specify a result backend,
there’s no way for us to know when the task was processed and what the result was.

In [ ]:

14

You might also like