Python Data Visualization Cookbook - Second Edition - Sample Chapter
Python Data Visualization Cookbook - Second Edition - Sample Chapter
ee
Python Data Visualization Cookbook, Second Edition starts by showing you how to set up matplotlib and
related libraries. It also includes explanations of how to incorporate matplotlib into different environments,
such as a writing system or LaTeX, and how to create Gantt charts using Python.
Python libraries
and problems
problems efficiently
real-world problems
Giuseppe Vettigli
$ 44.99 US
28.99 UK
P U B L I S H I N G
Sa
pl
e
Python Data Visualization Cookbook, Second Edition will take the reader from installing and setting up a
Python environment for data manipulation and visualization, all the way through to 3D animations using
Python libraries. It contains over 70 precise and reproducible recipes that will guide the reader towards a
better understanding of data concepts.
P U B L I S H I N G
Igor Milovanovi
Dimitry Foures
Giuseppe Vettigli
Giuseppe Vettigli is a data scientist who has worked in the research industry and
academia for many years. His work is focused on the development of machine learning
models and applications to use information from structured and unstructured data.
He also writes about scientific computing and data visualization in Python on his blog
at http://glowingpython.blogspot.com.
Preface
The best data is the data that we can see and understand. As developers and data scientists,
we want to create and build the most comprehensive and understandable visualizations.
It is not always simple; we need to find the data, read it, clean it, filter it, and then use the
right tool to visualize it. This book explains the process of how to read, clean, and visualize
the data into information with straight and simple (and sometimes not so simple) recipes.
How to read local data, remote data, CSV, JSON, and data from relational databases are all
explained in this book.
Some simple plots can be plotted with one simple line in Python using matplotlib, but
performing more advanced charting requires knowledge of more than just Python. We need
to understand information theory and human perception aesthetics to produce the most
appealing visualizations.
This book will explain some practices behind plotting with matplotlib in Python, statistics used,
and usage examples for different charting features that we should use in an optimal way.
Preface
Chapter 6, Plotting Charts with Images and Maps, deals with image processing, projecting
data onto maps, and creating CAPTCHA test images.
Chapter 7, Using Right Plots to Understand Data, covers explanations and recipes on some
more advanced plotting techniques such as spectrograms and correlations.
Chapter 8, More on matplotlib Gems, covers a set of charts such as Gantt charts, box plots,
and whisker plots, and it also explains how to use LaTeX for rendering text in matplotlib.
Chapter 9, Visualizations on the Clouds with Plot.ly, introduces how to use Plot.ly to create
and share your visualizations on its cloud environment.
Preparing Your
Working Environment
In this chapter, you will cover the following recipes:
Introduction
This chapter introduces the reader to the essential tooling and their installation and
configuration. This is necessary work and a common base for the rest of the book. If you have
never used Python for data and image processing and visualization, it is advised not to skip
this chapter. Even if you do skip it, you can always return to this chapter in case you need to
install some supporting tools or verify what version you need to support the current solution.
Getting ready
We assume that you already have Linux (preferably Debian/Ubuntu or RedHat/SciLinux)
installed and Python installed on it. Usually, Python is already installed on the mentioned
Linux distributions and, if not, it is easily installable through standard means. We assume
that Python 2.7+ Version is installed on your workstation.
Almost all code should work with Python 3.3+ Versions, but since most
operating systems still deliver Python 2.7 (some even Python 2.6),
we decided to write the Python 2.7 Version code. The differences are
small, mainly in the version of packages and some code (xrange
should be substituted with range in Python 3.3+).
We also assume that you know how to use your OS package manager in order to install
software packages and know how to use a terminal.
The build requirements must be satisfied before matplotlib can be built.
matplotlib requires NumPy, libpng, and freetype as build dependencies. In order to be
able to build matplotlib from source, we must have installed NumPy. Here's how to do it:
Install NumPy (1.5+ if you want to use it with Python 3) from http://www.numpy.org/
NumPy will provide us with data structures and mathematical functions for using it with large
datasets. Python's default data structures such as tuples, lists, or dictionaries are great
for insertions, deletions, and concatenation. NumPy's data structures support "vectorized"
operations and are very efficient for use and for executions. They are implemented with big
data in mind and rely on C implementations that allow efficient execution time.
SciPy, building on top of NumPy, is the de facto standard's scientific and
numeric toolkit for Python comprising a great selection of special functions
and algorithms, most of them actually implemented in C and Fortran, coming
from the well-known Netlib repository (http://www.netlib.org).
Chapter 1
If you are using RedHat or a variation of this distribution (Fedora, SciLinux, or CentOS),
you can use yum to perform the same installation:
$ su -c 'yum-builddep python-matplotlib'
How to do it...
There are many ways one can install matplotlib and its dependencies: from source,
precompiled binaries, OS package manager, and with prepackaged Python distributions
with built-in matplotlib.
Most probably the easiest way is to use your distribution's package manager. For Ubuntu
that should be:
# in your terminal, type:
$ sudo apt-get install python-numpy python-matplotlib python-scipy
If you want to be on the bleeding edge, the best option is to install from source. This path
comprises a few steps: get the source code, build requirements, and configure, compile,
and install.
Download the latest source from code host SourceForge by following these steps:
$ cd ~/Downloads/
$ wget https://downloads.sourceforge.net/project/matplotlib/matplotlib/
matplotlib-1.3.1/matplotlib-1.3.1.tar.gz
$ tar xzf matplotlib-1.4.3.tar.gz
$ cd matplotlib-1.4.3
$ python setup.py build
$ sudo python setup.py install
How it works...
We use standard Python Distribution Utilities, known as Distutils, to install matplotlib from
the source code. This procedure requires us to previously install dependencies, as we already
explained in the Getting ready section of this recipe. The dependencies are installed using the
standard Linux packaging tools.
There's more...
There are more optional packages that you might want to install depending on what your data
visualization projects are about.
No matter what project you are working on, we recommend installing IPythonan Interactive
Python shell where you already have matplotlib and related packages, such as NumPy and
SciPy, imported and ready to play with. Please refer to IPython's official site on how to install it
and use itit is, though, very straightforward.
Chapter 1
Getting ready
To install virtualenv, you must have a workable installation of Python and pip. Pip is a tool
for installing and managing Python packages, and it is a replacement for easy_install.
We will use pip through most of this book for package management. Pip is easily installed,
as root executes the following line in your terminal:
# easy_install pip
virtualenv by itself is really useful, but with the help of virtualenvwrapper, all this becomes
easy to do and also easy to organize many virtual environments. See all the features at
http://virtualenvwrapper.readthedocs.org/en/latest/#features.
How to do it...
By performing the following steps, you can install the virtualenv and virtualenvwrapper tools:
1. Install virtualenv and virtualenvwrapper:
$ sudo pip install virtualenv
$ sudo pip install virtualenvwrapper
# Create folder to hold all our virtual environments and export
the path to it.
$ export VIRTENV=~/.virtualenvs
$ mkdir -p $VIRTENV
# We source (ie. execute) shell script to activate the wrappers
$ source /usr/local/bin/virtualenvwrapper.sh
# And create our first virtual environment
$ mkvirtualenv virt1
3. You will probably want to add the following line to your ~/.bashrc file:
source /usr/loca/bin/virtualenvwrapper.sh
mkvirtualenv ENV: This creates a virtual environment with the name ENV
and activates it
In this case, we see that even though we simply installed matplotlib, many other packages
are also installed. Apart from wsgiref, which is used by pip itself, these are required
dependencies of matplotlib which have been automatically installed.
When transferring a project from an environment (possibly a virtual environment) to another,
the receiving environment needs to have all the necessary packages installed (in the same
version as in the original environment) in order to be sure that the code can be properly run.
This can be problematic as two different environments might not contain the same packages,
and, worse, might contain different versions of the same package. This can lead to conflicts
or unexpected behaviors in the execution of the program.
In order to avoid this problem, pip freeze can be used to save a copy of the current
environment configuration. The command will save the output of the command to the file
requirements.txt:
$ pip freeze > requirements.txt
In a new environment, this file can be used to install all the required libraries. Simply run:
$ pip install -r requirements.txt
All the necessary packages will automatically be installed in their specified version. That way,
we ensure that the environment where the code is used is always the same. This is a good
practice to have a virtual environment and a requirements.txt file for every project you
are developing. Therefore, before installing the required packages, it is advised that you first
create a new virtual environment to avoid conflicts with other projects.
Chapter 1
On machine 1:
$ mkvirtualenv env1
(env1)$ pip install matplotlib
(env1)$ pip freeze > requirements.txt
On machine 2:
$ mkvirtualenv env2
(env2)$ pip install -r requirements.txt
Getting ready
We will use the Homebrew (you could also use MacPorts in the same way) project that eases
the installation of all software that Apple did not install on your OS, including Python and
matplotlib. Under the hood, Homebrew is a set of Ruby and Git that automate download and
installation. Following these instructions should get the installation working. First, we will
install Homebrew, and then Python, followed by tools such as virtualenv, then dependencies
for matplotlib (NumPy and SciPy), and finally matplotlib. Hold on, here we go.
How to do it...
1. In your terminal, paste and execute the following command:
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/
install/master/install)"
After the command finishes, try running brew update or brew doctor to verify that the
installation is working properly.
3. You will need to restart the terminal so that it picks a new path. Installing Python is as
easy as firing up another one liner:
brew install python --framework --universal
5. To verify that the installation has worked, type python --version in the command
line, you should see 2.7.3 as the version number in the response.
6. You should have pip installed by now. In case it is not installed, use easy_install
to add pip:
$ easy_install pip
7.
Now, it's easy to install any required package; for example, virtualenv and
virtualenvwrapper are useful:
pip install virtualenv
pip install virtualenvwrapper
9. Verify that everything is working. Call Python and execute the following commands:
import numpy
print numpy.__version__
import scipy
print scipy.__version__
quit()
Chapter 1
Getting ready
There are two ways of installing matplotlib on Windows. The easiest way is by installing
prepackaged Python environments, such as EPD, Anaconda, SageMath, and Python(x,y).
This is the suggested way to install Python, especially for beginners.
The second way is to install everything using binaries of precompiled matplotlib and required
dependencies. This is more difficult as you have to be careful about the versions of NumPy
and SciPy you are installing, as not every version is compatible with the latest version of
matplotlib binaries. The advantage in this is that you can even compile your particular
versions of matplotlib or any library to have the latest features, even if they are not provided
by authors.
How to do it...
The suggested way of installing free or commercial Python scientific distributions is as easy as
following the steps provided on the project's website.
If you just want to start using matplotlib and don't want to be bothered with Python versions
and dependencies, you may want to consider using the Enthought Python Distribution (EPD).
EPD contains prepackaged libraries required to work with matplotlib and all the required
dependencies (SciPy, NumPy, IPython, and more).
As usual, we download Windows installer (*.exe) that will install all the code we need to start
using matplotlib and all recipes from this book.
There is also a free scientific project Python(x,y) (http://python-xy.github.io) for
Windows 32-bit system that contains all dependencies resolved, and is an easy (and free!)
way of installing matplotlib on Windows. Since Python(x,y) is compatible with Python modules
installers, it can be easily extended with other Python libraries. No Python installation should
be present on the system before installing Python(x,y).
There's more...
Note that many examples are not included in the Windows installer. If you want to try the
demos, download the matplotlib source and look in the examples subdirectory.
How to do it...
The easiest and most recommended way is to use your platform's package managers. For
Debian and Ubuntu use the following commands:
$ sudo apt-get build-dep python-imaging
$ sudo pip install http://effbot.org/downloads/Imaging-1.1.7.tar.gz
10
Chapter 1
How it works...
This way we are satisfying all build dependencies using the apt-get system but also installing
the latest stable release of PIL. Some older versions of Ubuntu usually don't provide the
latest releases.
On RedHat and SciLinux systems, run the following commands:
# yum install python-imaging
# yum install freetype-devel
# pip install PIL
There's more...
There is a good online handbook, specifically, for PIL. You can read it at http://www.
pythonware.com/library/pil/handbook/index.htm or download the PDF version
from http://www.pythonware.com/media/data/pil-handbook.pdf.
There is also a PIL fork, Pillow, whose main aim is to fix installation issues. Pillow can be found
at http://pypi.python.org/pypi/Pillow and it is easy to install (at the time of writing,
Pillow is the only choice if you are using OS X).
On Windows, PIL can also be installed using a binary installation file. Install PIL in your Python
site-packages by executing .exe from http://www.pythonware.com/products/pil/.
Now, if you want PIL used in a virtual environment, manually copy the PIL.pth file and the
PIL directory at C:\Python27\Lib\site-packages to your virtualenv site-packages
directory.
11
How to do it...
Using pip is the best way to install requests. Use the following command for the same:
$ pip install requests
That's it. This can also be done inside your virtualenv, if you don't need requests for every
project or want to support different requests versions for each project.
Just to get you ahead quickly, here's a small example on how to use requests:
import requests
r = requests.get('http://github.com/timeline.json')
print r.content
How it works...
We sent the GET HTTP request to a URI at www.github.com that returns a JSON-formatted
timeline of activity on GitHub (you can see HTML version of that timeline at https://github.
com/timeline). After the response is successfully read, the r object contains content and
other properties of the response (response code, cookies set, header metadata, and even the
request we sent in order to get this response).
Getting ready
As we already said, matplotlib configuration is read from a configuration file. This file provides
a place to set up permanent default values for certain matplotlib properties, well, for almost
everything in matplotlib.
How to do it...
There are two ways to change parameters during code execution: using the dictionary of
parameters (rcParams) or calling the matplotlib.rc() command. The former enables
us to load an already existing dictionary into rcParams, while the latter enables a call to a
function using a tuple of keyword arguments.
12
Chapter 1
Both examples are semantically the same. In the second sample, we define that all
subsequent plots will have lines with line width of 2 points. The last statement of the
previous code defines that the color of every line following this statement will be red,
unless we override it by local settings. See the following example:
import matplotlib.pyplot as plt
import numpy as np
s = np.sin(2 * np.pi * t)
# make line red
plt.rcParams['lines.color'] = 'r'
plt.plot(t,s)
c = np.cos(2 * np.pi * t)
# make line thick
plt.rcParams['lines.linewidth'] = '3'
plt.plot(t,c)
plt.show()
13
How it works
First, we import matplotlib.pyplot and NumPy to allow us to draw sine and cosine
graphs. Before plotting the first graph, we explicitly set the line color to red using the
plt.rcParams['lines.color'] = 'r' command.
Next, we go to the second graph (cosine function) and explicitly set the line width to three
points using the plt.rcParams['lines.linewidth'] = '3' command.
If we want to reset specific settings, we should call matplotlib.rcdefaults().
In this recipe, we have seen how to customize the style of a matplotlib chart dynamically
changing its configuration parameters. The matplotlib.rcParams object is the interface
that we used to modify the parameters. It's global to the matplotlib packages and any change
that we apply to it affects all the charts that we draw after.
Getting ready
If you don't want to configure matplotlib as the first step in your code every time you use
it (as we did in the previous recipe), this recipe will explain how to have different default
configurations of matplotlib for different projects. This way your code will not be cluttered
with configuration data and, moreover, you can easily share configuration templates with
your co-workers or even among other projects.
How to do it...
If you have a working project that always uses the same settings for certain parameters
in matplotlib, you probably don't want to set them every time you want to add a new graph
code. Instead, what you want is a permanent file, outside of your code, which sets defaults
for matplotlib parameters.
matplotlib supports this via its matplotlibrc configuration file that contains most of the
changeable properties of matplotlib.
14
Chapter 1
How it works...
There are three different places where this file can reside and its location defines its usage.
They are:
Current working directory: This is where your code runs from. This is the place to
customize matplotlib just for your current directory that might contain your current
project code. The file is named matplotlibrc.
The following one liner will print the location of your configuration directory and can be run
from shell:
$ python -c 'import matplotlib as mpl; print mpl.get_configdir()'
axes: This deals with face and edge color, tick sizes, and grid display.
figure: This deals with dpi, edge color, figure size, and subplot settings.
font: This looks at font families, font size, and style settings.
legend: This specifies how legends and text inside will be displayed.
lines: This checks for line (color, style, width, and so on) and markers settings.
patch: These patches are graphical objects that fill 2D space, such as polygons
and circles; set linewidth, color, antialiasing, and so on.
savefig: There are separate settings for saved figures. For example, to make
rendered files with a white background.
text: This looks for text color, how to interpret text (plain versus latex markup)
and similar.
15
verbose: This checks how much information matplotlib gives during runtime: silent,
helpful, debug, and debug annoying.
xticks and yticks: These set the color, size, direction, and label size for major and
minor ticks for the x and y axes.
There's more...
If you are interested in more details for every mentioned setting (and some that we did not
mention here), the best place to go is the website of the matplotlib project where there is
up-to-date API documentation. If it doesn't help, user and development lists are always
good places to leave questions. See the back of this book for useful online resources.
16
www.PacktPub.com
Stay Connected: