Data Mining With Py Draft PDF
Data Mining With Py Draft PDF
Contents i
List of Tables ix
1 Introduction 1
1.1 Other introductions to Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Why Python for data mining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Why not Python for data mining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Components of the Python language and software . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Developing and running Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5.1 Python, pypy, IPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5.2 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.3 Python 2 vs. Python 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.4 Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.5 Python in the cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.6 Running Python in the browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Python 9
2.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Booleans (bool) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Numbers (int, float, complex and Decimal) . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Strings (str) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.4 Dictionaries (dict) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.5 Dates and times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.6 Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.7 Other containers classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Functions and arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Anonymous functions with lambdas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Optional function arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Object-oriented programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Objects as functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Modules and import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.1 Submodules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.2 Globbing import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.3 Coping with Python 2/3 incompatibility . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Persistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.1 Pickle and JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.2 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
i
2.6.3 NoSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8.1 Testing for type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8.2 Zero-one-some testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8.3 Test layout and test discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8.4 Test coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8.5 Testing in different environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.9 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.10 Coding style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.10.1 Where is private and public? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.11 Command-line interface scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.11.1 Distinguishing between module and script . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.11.2 Argument parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.11.3 Exit status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.12 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.12.1 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.13 Advices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
ii
5 Case: Pima data set 65
5.1 Problem description and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Descriptive statistics and plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 Predicting diabetes type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Bibliography 81
Index 85
iii
iv
Preface
Python has grown to become one of the central languages in data mining offering both a general programming
language and libraries specifically targeted numerical computations.
This book is continuously being written and grew out of course given at the Technical University of
Denmark.
v
vi
List of Figures
2.1 Overview of methods and attributes in the common Python 2 built-in data types plotted as a
formal concept analysis lattice graph. Only a small subset of methods and attributes is shown. 16
vii
viii
List of Tables
ix
x
Chapter 1
Introduction
1
Too see how well Python with its modern data mining packages compares with R take a look at Carl J.
V.’s blog posts on Will it Python? 2 and his GitHub repository where he reproduces R code in Python
based on R data analyses from the book Machine Learning for Hackers.
5. Python with its BSD license fall in the group of free and open source software. Although some
large Python development environments may have associated license cost for commercial use, the
basic Python development environment may be setup and run with no licensing cost. Indeed in some
systems, e.g., many Linux distributions, basic Python comes readily installed. The Python Package
Index provides a large set of packages that are also free software.
6. Large community. Python has a large community and has become more popular. Several indicators
testify to this. Popularity of Language Index (PYPL) bases its programming language ranking on
Google search volume provided by Google Trends and puts Python in the third position after Java and
PHP. According to PYPL the popularity of Python has grown since 2004. TIOBE constructs another
indicator putting Python in rank 6th. This indicator is “based on the number of skilled engineers world-
wide, courses and third party vendors”.3 Also Python is among the leading programming language in
terms of StackOverflow tags and GitHub projects.4 Furthermore, in 2014 Python was the most popular
programming language at top-ranked United States universities for teaching introductory programming
[9].
7. Quality: The Coverity company finds that Python code has errors among its 400,000 lines of code,
but that the error rate is very low compared to other open source software projects. They found a
0.005 defects per KLoC [10].
8. Jupyter Notebook: With the browser-based interactive notebook, where code, textual and plot-
ting results and documentation may be interleaved in a cell-based environment, the Jupyter Notebook
represents a interesting approach that you will typically not find in many other programming lan-
guage. Exceptions are the commercial systems Maple and Mathematica that have notebook interfaces.
IPython Notebooks runs locally on a Web-browser. The Notebook files are JSON files that can easily
be shared and rendered on the Web.
The obvious advantages with the Jupyter Notebook has led other language to use the environment.
The Jupyter Notebook can be changed to use, e.g., the Julia language as the computational backend,
i.e., instead of writing Python code in the code cells of the notebook you write Julia code. With
appropriate extensions the Jupyter Notebook can intermix R code.
1. Not well-suited to mobile phones and other portable devices. Although Python surely can
run on mobile phones and there exist a least one (dated) book for ‘Mobile Python’ [11], Python has not
caught on for development of mobile apps. There exist several mobile app development frameworks
with Kivy mentioned as leading contender. Developers can also use Python in mobile contexts for the
backend of a web-based system and for data mining data collected at the backend.
2. Does not run ‘natively’ in the browser. Javascript entirely dominates as the language in web-
browsers. Various ways exist to mix Python and webbrowser programming.5 The Pyjamas project with
its Python-to-Javascript compiler allows you to write webbrowser client code in Python and compile it
to Javascript which the webbrowser then runs. There are several other of these stand-alone Javascript
2 http://slendermeans.org/pages/will-it-python.html.
3 http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html.
4 http://www.dataists.com/2010/12/ranking-the-popularity-of-programming-langauges/.
5 See https://wiki.python.org/moin/WebBrowserProgramming
2
compilers in ‘various states of development’ as it is called: PythonJS, Pyjaco, Py2JS. Other frameworks
use in-browser implementations, one of them being Brython, which enable the front-end engineer to
write Python code in a HTML script tag if the page includes the brython.js Javascript library via
the HTML script tag. It supports core Python modules and has access to the DOM API, but not,
e.g., the scientific Python libraries written in C. Brython scripts run unfortunately considerable slower
than scripts directly implemented Javascript or ordinary Python implementation execution [12].
3. Concurrent programming. Standard Python has no direct way of utilizing several CPUs in the
language. Multithreading capabilities can be obtained with the threading package, but the individual
threads will not run concurrently on different CPUs in the standard python implementation. This
implementation has the so-called ‘Global Interpreter Lock’ (GIL), which only allows a single thread at
a time. This is to ensure the integrity of the data. A way to get around the GIL is by spawning new
process with the multiprocessing package or just the subprocess module.
4. Installation friction. You may run into problems when building, distributing and installing your
software. There are various ways to bundle Python software, e.g., with setuptools package. Based
on a configuration file, setup.py, where you specify, e.g., name, author and dependencies of your
package, setuptools can build a file to distribute with the commands python setup.py bdist or
python setup.py bdist egg. The latter command will build a so-called Python Egg file containing
all the Python files you specified. The user of your package can install your Python files based on the
configuration and content of that file. It will still need to download and install the dependencies you
have specified in the setup.py file, before the user of your software can use your code. If your user
does not have Python, the installation tools and a C compiler installed it is likely that s/he find it a
considerable task to install your program.
Various tools exist to make the distribution easier by integrating the the distributed file to one self-
contained downloadable file. These tools are called cx Freeze, PyInstaller, py2exe for Window and
py2app for OSX) and pynsist.
5. Speed. Python will typically perform slower than a compiled languages such as C++, and Python
typically performs poorer than Julia, — the programming language designed for technical computing.
Various Python implementations and extensions, such as pypy, numba and Cython, can speed up the
execution of Python code, but even then Julia can perform faster: Andrew Tulloch has reported per-
formance ratios between 1.1 and 300 in Julia’s favor for isotonic regression algorithms.6 The slowness
of Python means that Python libraries tends to be developed in C, while, e.g., well-performing Julia
libraries may be developed in Julia itself.7 Speeding up Python often means modifying Python code
with, e.g., specialized decorators, but a proof-of-concept system, Bohrium, has shown that a Python
extension may require only little change in ‘standard’ array-processing code to speed up Python con-
siderably [13].
It may, however, be worth to note that variability in a program’s performance can vary as much or
more between programmers as between Python, Java and C++ [7].
3
Figure 1.1: The Python hierarchy.
2. Built-in classes and functions. An ordinary implementation of Python makes a set of classes and
functions available at program start without the need of module import. Examples include the function
for opening files (open), classes for built-in data types (e.g., float and str) and data manipulation
functions (e.g., sum, abs and zip). The builtins module makes these classes and functions avail-
able and you can see a listing of them with dir( builtins ).8 You will find it non-trivial to get
rid of the built-in functions, e.g., if you want to restrict the ability of untrusted code to call the open
function, cf. sandboxing Python.
3. Built-in modules. Built-in modules contain extra classes and functions built into Python, — but
not immediately accessible. You will need to import these with import to use them. The sys built-in
module contains a list of all the built-in modules: sys.builtin module names. Among the built-in
modules are the system-specific parameters and functions module (sys), a module with mathematical
functions (math), the garbage collection module (gc) and a module with many handy iterator functions
good to be acquited with (itertools).
The set of built-in modules varies between implementations of Python. In one of my installations I
count 46 modules, which include the builtins module and the current working module main .
4. Python Standard Library (PSL). An ordinary installation of Python makes a large set of modules
with classes and functions available to the programmer without the need for extra installation. The
programmer only needs to write a one line import statement to have access to exported classes,
functions and constants in such a module.
You can see which Python (byte-compiled) source file associates with the import via file property
of the module, e.g., after import os you can see the filename with os. file . Built-in modules do
not have this property set in the standard implementation of Python. On a typically Linux system
you might find the PSL modules in a directories with names like /usr/lib/python3.2/.
One of my installations has just above 200 PSL modules.
8 There are some silly differences between builtin and builtins . For Python3 use builtins .
4
5. Python Package Index (PyPI) also known as the CheeseShop is the central archive for Python
packages available from https://pypi.python.org.
The index reports that it contains over 42393 packages as of April 2014. They range from popular
packages such as lxml and requests over large web frameworks, such as Django to strange packages,
such as absolute, — a package with the sole purpose of implementing a function that computes the
absolute value of a number (this functionality is already built-in with the abs function).
You will often need to install the packages unless you use one of the large development frameworks such
as Enthought and Anaconda or if it is already installed via your system. If you have the pip program
up and running then installation of packages from PyPI is relatively easy: From the terminal (outside
Python) you write pip install <packagename>, which will download, possibly compile, install and
setup the package. Unsure of the package, you can write pip search <query> and pip will return a
list of packages matching the query. Once you have done installed the package you will be able to use
the package in Python with >>> import <packagename>.
If parts of the software you are installing are written in C, then the pip install will require a C compiler
to build the library files. If a compiler is not readily available you can download and install a binary
pre-compiled package, — if this is available. Otherwise some systems, e.g., Ubuntu and Debian will
distribute a large set of the most common package from PyPI in their pre-compiled version, e.g., the
Ubuntu/Debian name of lxml and requests are called python-lxml and python-requests.
On a typical Linux system you will find the packages installed under directories, such as
/usr/lib/python2.7/dist-packages/
6. Other Python components. From time to time you will find that not all packages are available from
the Python Package Index. Often these packages comes with a setup.py that allows you to install the
software.
If the bundle of Python files does not even have a setup.py file, you can download it a put in your
own self-selected directory. The python program will not be able to discover the path to the program,
so you will need to tell it. In Linux and Windows you can set the environmental variable PYTHONPATH
to a colon- or semicolon-separated list of directories with the Python code. Windows users may also
set the PYTHONPATH from the ‘Advanced’ system properies. Alternatively the Python developer can set
the sys.path attribute from within Python. This variable contains the paths as strings in a list and
the developer can append a new directory to it.
GitHub user Vinta provides a good curated list of important Python frameworks, libraries and software
from https://github.com/vinta/awesome-python.
5
performance than CPython but lags behind PyPy.”9 Though interesting, these programs are not yet so
relevant in data mining applications.
Some individuals and companies have assembled binary distributions of Python and many Python package
together with an integrated development environment (IDE). These systems may be particularly relevant for
users without a compiler to compile C-based Python packages, e.g., many Windows users. Python(x,y) is a
Windows- and scientific-oriented Python distribution with the Spyder integrated development environment.
WinPython is similar system. You will find many relevant data mining package included in the WinPython,
e.g., pandas, IPython, numexpr, as well as a tool to install, uninstall and upgrade packages. Continuum
Analytics distributes their Anaconda and Enthought their Enthought Canopy, — both systems targeted
to scientists, engineers and other data analysts. Available for the Window, Linux and Mac platforms they
include what you can almost expect of such data mining environments, e.g., numpy, scipy, pandas, nltk,
networkx. Enthought Canopy is only free for academic use. The basic Anaconda is ‘completely free’,
while the Continuum Analytics provides some ‘add-ons’ that are only free for academic use. Yet another
prominent commercial grade distribution of Python and Python packages is ActivePython. It seems less
geared towards data mining work. For Windows users not using these systems and who do not have the
ability to compile C may take a look at Christoph Gohlke’s large list of precompiled binaries assembled at
http://www.lfd.uci.edu/~gohlke/pythonlibs/.
6
part of this approach uses the future module importing relevant features, e.g., future .division
and future .print function like:
from __future__ import division , print_function , unicode_literals
This scheme will change Python 2’s division operator ‘/’ from integer division to floating point division and
the print from a keyword to a function.
Python X adherrence might be particular inconvenient for string-based processing, but the module six
provides further help on the issue. For testing whether a variable is a general string, in Python 2 you would
test whether the variable is an instance of the basestring built-in type to capture both byte-based strings
(Python 2 str type) and Unicode strings (Python 2 unicode type). However, Python 3 has no basestring
by default. Instead you test with the Python 3 str class which contains Unicode strings. A constant in the
six module, the six.string types captures this difference and is an example how the six module can help
writing portable code. The following code testing for string type for a variable will work in both Python 2
and 3:
if isinstance ( my_variable , six . string_types ):
print ( ’ my_variable is a string ’)
else :
print ( ’ my_variable is not a string ’)
1.5.4 Editing
For editing you should have a editor that understands the basic elements of the Python syntax, e.g., to help
you make correct indentation which is an essential part of the Python syntax. A large number of Python-
aware editors exists,11 e.g., Emacs and the editors in the Spyder and Eric IDEs. Commercial IDEs, such as
PyCharm and Wing IDE, also have good Python editors.
For autocompletion Python has a jedi module, which various editors can use through a plugin. Pro-
grammers can also call it directly from a Python program. IPython and spyder features autocompletion
For collorative programming—pair programming or physically separated programming—it is worth to
note that the collaborative document editor Gobby has support for Python syntax highlighting and Pythonic
indentation. It features chat, but has no features beyond simple editing, e.g., you will not find support for
direct execution, style checking nor debugging, that you will find in Spyder. The Rudel plugin for Emacs
supports the Gobby protocol.
7
The company Runnable provides a such service through the URL http://runnable.com, where users may
write Python code directly in the browser and let the system executes and returns the result. The cloud
service Wakari (https://wakari.io/) let users work and share cloud-based Jupyter Notebook sessions. It is a
cloud version of from Continuum Analytics’ Anaconda.
The Skulpt implementation of Python runs in a browser and a demonstration of it runs from
its homepage http://www.skulpt.org/. It is used by several other websites, e.g., CodeSkulptor
http://www.codeskulptor.org. Codecademy is a webservice aimed at learning to code. Python features
among the programming languages supported and a series of interactive introductory tutorials run from
the URL http://www.codecademy.com/tracks/python. The Online Python Tutor uses its interactive envi-
ronment to demonstrate with program visualization how the variables in Python changes as the program
is executed [17]. This may serve well novices learning the Python, but also more experienced programmer
when they debug. pythonanywhere (https://www.pythonanywhere.com) also has coding in the browser.
Code Golf from http://codegolf.com/ invites users to compete by solving coding problems with the
smallest number of characters. The contestants cannot see each others contributions. Another Python code
challenge website is Check IO, see http://www.checkio.org
Such services have less relevance for data mining, e.g., Runnable will not allow you to import numpy, but
they may be an alternative way to learn Python. CodeSkulptor implementing a subset of Python 2 allows
the programmer to import the modules numeric, simplegui, simplemap and simpleplot for rudimentary
matrix computations and plotting numerical data. At Plotly (https://plot.ly) users can collaboratively
construct plots, and Python coding with Numpy features as one of the methods to build the plots.
8
Chapter 2
Python
2.1 Basics
Two functions in Python are important to known: help and dir. help shows the documentation for the
input argument, e.g., help(open) shows the documentation for the open built-in function, which reads and
writes files. help works for most elements of Python: modules, classes, variables, methods, functions, . . . , —
but not keywords. dir will show a list of methods, constants and attributes for a Python object, and since
most elements in Python are objects (but not keywords) dir will work, e.g., dir(list) shows the methods
associated with the built-in list datatype of Python. One of the methods in the list object is append. You
can see its documentation with help(list.append).
Indentation is important in Python, — actually essential: It is what determines the block structure, so
indentation limits the scope of control structures as well as class and function definitions. Four spaces is
the default indentation. Although the Python semantic will work with other number of spaces and tabs for
indentation, you should generally stay with four spaces.
2.2 Datatypes
Table 2.1 displays Python’s basic data types together with the central data types of the Numpy and Pandas
modules. The data types in the first part of table are the built-in data types readily available when python
starts up. The data types in the second part are Numpy data types discussed in chapter 3, specifically in
section 3.1, while the data types in the third part of the table are from the Pandas package discussed in
section 3.3. An instance of a data type is converted to another type by instancing the other class, e.g., turn
the float 32.2 into a string ’32.2’ with str(32.2) or the string ’abc’ into the list [’a’, ’b’, ’c’] with
list(’abc’). Not all of the conversion combinations work, e.g., you cannot convert an integer to a list. It
results in a TypeError.
9
Built-in type Operator Mutable Example Description
bool No True Boolean
bytearray Yes bytearray(b’\x01\x04’) Array of bytes
bytes b’’ No b’\x00\x17\x02’
complex No (1+4j) Complex number
dict {:} Yes {’a’: True, 45: ’b’} Dictionary, indexed by, e.g., strings
float No 3.1 Floating point number
frozenset No frozenset({1, 3, 4}) Immutable set
int No 17 Integer
list [] Yes [1, 3, ’a’] List
set {} Yes {1, 2} Set with unique elements
slice : No slice(1, 10, 2) Slice indices
str "" or ’’ No "Hello" String
tuple (,) No (1, ’Hello’) Tuple
Numpy type Char Mutable Example
array Yes np.array([1, 2]) One-, two, or many-dimensional
matrix Yes np.matrix([[1, 2]]) Two-dimensional matrix
bool — np.array([1], ’bool_’) Boolean, one byte long
int — np.array([1]) Default integer, same as C’s long
int8 b — np.array([1], ’b’) 8-bit signed integer
int16 h — np.array([1], ’h’) 16-bit signed integer
int32 i — np.array([1], ’i’) 32-bit signed integer
int64 l, p, q — np.array([1], ’l’) 64-bit signed integer
uint8 B — np.array([1], ’B’) 8-bit unsigned integer
float — np.array([1.]) Default float
float16 e — np.array([1], ’e’) 16-bit half precision floating point
float32 f — np.array([1], ’f’) 32-bit precision floating point
float64 d — 64-bit double precision floating point
float128 g — np.array([1], ’g’) 128-bit floating point
complex — Same as complex128
complex64 — Single precision complex number
complex128 — np.array([1+1j]) Double precision complex number
complex256 — 2 128-bit precision complex number
Pandas type Mutable Example Description
Series Yes pd.Series([2, 3, 6]) One-dimension (vector-like)
DataFrame Yes pd.DataFrame([[1, 2]]) Two-dimensional (matrix-like)
Panel Yes pd.Panel([[[1, 2]]]) Three-dimensional (tensor-like)
Panel4D Yes pd.Panel4D([[[[1]]]]) Four-dimensional
Table 2.1: Basic built-in and Numpy and Pandas datatypes. Here import numpy as np and import pandas
as pd. Note that Numpy has a few more datatypes, e.g., time delta datatype.
10
The different packages of Python confusingly handle complex numbers differently. Consider three different
implementations of the square root function in the math, numpy and scipy packages computing the square
root of −1:
>>> import math , numpy , scipy
>>> math . sqrt ( -1)
Traceback ( most recent call last ):
File " < stdin > " , line 1 , in < module >
ValueError : math domain error
>>> numpy . sqrt ( -1)
__main__ :1: RuntimeWarning : invalid value encountered in sqrt
nan
>>> scipy . sqrt ( -1)
1j
Here there is an exception for the math.sqrt function, numpy returns a NaN for the float input while scipy
the imaginary number. The numpy.sqrt function may also return the imaginary number if—instead of the
float input number it is given a complex number:
>>> numpy . sqrt ( -1+0 j )
1j
Python 2 has long, which is for long integers. In Python 2 int(12345678901234567890) will switch
ton a variable with long datatype. In Python 3 long has been subsumed in int, so int in this version can
represent arbitrary long integers, while the long type has been removed. A workaround to define long in
Python 3 is simply long = int.
The issue of multibyte Unicode and byte-strings yield complexity. Indeed Python 2 and Python 3 differ
(unfortunately!) considerably in their definition of what is a Unicode strings and what is a byte strings.
The triple double quotes are by convention used for docstrings. When Python prints out a it uses single
quotes, — unless the string itself contains a single quote.
11
>>> a
{( ’ Friston ’ , ’ Worsley ’ ): 2}
Dictionaries may also be created with dictionary comprehensions, here an example with a dictionary of
lengths of method names for the float object:
>>> { name : len ( name ) for name in dir ( float )}
{ ’ __int__ ’: 7 , ’ __repr__ ’: 8 , ’ __str__ ’: 7 , ’ conjugate ’: 9 , ...
Iterations over the keys of the dictionary are immediately available via the object itself or via the dict.keys
method. Values can be iterated with the dict.values method and both keys and values can be iterated
with the dict.items method.
Dictionary access shares some functionality with object attribute access. Indeed the attributes are ac-
cessible as a dictionary in the dict attribute:
>>> class MyDict ( dict ):
... def __init__ ( self ):
... self . a = None
>>> my_dict = MyDict ()
>>> my_dict . a
>>> my_dict . a = 1
>>> my_dict . __dict__
{ ’a ’: 1}
>>> my_dict [ ’a ’] = 2
>>> my_dict
{ ’a ’: 2}
In the Pandas library (see section 3.3) columns in its pandas.DataFrame object can be accessed both as
attributes and as keys, though only as attributes if the key name is a valid Python identifier, e.g., strings
with spaces or other special characters cannot be attribute names. The addict package provides a similar
functionality as in Pandas:
>>> from addict import Dict
>>> paper = Dict ()
>>> paper . title = ’ The functional anatomy of verbal initiation ’
>>> paper . authors = ’ Nathaniel - James , Fletcher , Frith ’
>>> paper
{ ’ authors ’: ’ Nathaniel - James , Fletcher , Frith ’ ,
’ title ’: ’ The functional anatomy of verbal initiation ’}
>>> paper [ ’ authors ’]
’ Nathaniel - James , Fletcher , Frith ’
The advantage of accessing dictionary content as attributes is probably mostly related to ease of typing and
readability.
There are various means to handle dates and times in Python. Python provides the datetime mod-
ule with the datetime.datetime class (the class is confusingly called the same as the module). The
datetime.datetime class records date, hours, minutes, seconds, microseconds and time zone information,
while datetime.date only handles dates. As an example consider computing the number of days from 15
January 2001 to 24 September 2014. datetime.date makes such a computation relatively straightforward:
12
>>> from datetime import date
>>> date (2014 , 9 , 24) - date (2001 , 1 , 15)
datetime . timedelta (5000)
>>> str ( date (2014 , 9 , 24) - date (2001 , 1 , 15))
’ 5000 days , 0:00:00 ’
i.e., 5000 days from the one date to the other. A function in the dateutil module converts from date and
times represented as strings to datetime.datetime objects, e.g., dateutil.parser.parse(’2014-09-18’)
returns datetime.datetime(2014, 9, 18, 0, 0).
Numpy has also a datatype to handle dates, enabling easy date computation on multiple time data, e.g.,
below we compute the number of days for two given days given a starting date:
>>> import numpy as np
>>> start = np . array ([ ’ 2014 -09 -01 ’] , ’ datetime64 ’)
>>> dates = np . array ([ ’ 2014 -12 -01 ’ , ’ 2014 -12 -09 ’] , ’ datetime64 ’)
>>> dates - start
array ([91 , 99] , dtype = ’ timedelta64 [ D ] ’)
Here the computation defaults to represent the timing with respect to days.
A datetime.datetime object can be turned into a ISO 8601 string format with the
datetime.datetime.isoformat method but simply using str may be easier:
>>> from datetime import datetime
>>> str ( datetime . now ())
’ 2015 -02 -13 12:21:22.758999 ’
To get rid of the part with milliseconds use the replace method:
>>> str ( datetime . now (). replace ( microsecond =0))
’ 2015 -02 -13 12:22:52 ’
2.2.6 Enumeration
Python 3.4 has an enumeration datatype (symbolic members) with the enum.Enum class. In previous versions
of Python enumerations were just implemented as integers, e.g., in the re regular expression module you
would have a flag such as re.IGNORECASE set to the integer value 2. For older versions of Python the enum34
pip package can be installed which contains an enum Python 3.4 compatible module.
Below is a class called Grade derived from enum.Enum and used as a label for the quality of an apple,
where there are three fixed options for the quality:
from enum import Enum
13
2.3 Functions and arguments
Functions are defined with the keyword def and the return argument specifies which object the function
should return, — if any. The function can be specified to have multiple, positional and keyword (named)
input arguments and optional input arguments with default values can also be specified. As with control
structures indentation marks the scope of the function definition.
Functions can be called recursively, but the are usually slower than their iterative counterparts and there
is by default a recursion depth limit on 1000.
f = polynomial (3 , -2 , -2)
f (3)
plot_dirac (2)
plt . hold ( True )
plot_dirac (3 , linewidth =3)
plot_dirac ( -2 , ’r - - ’)
plt . axis (( -4 , 4 , 0 , 2))
plt . show ()
In the first call to plot dirac args and kwargs with be empty, i.e., an empty tuple and and empty dictionary.
In the second called print(kwargs) will show ’linewidth’: 3 and in the third call we get (’r--’,) from
the print(args) statement.
The above polynomial function can be changed to accept a variable number of positional arguments so
polynomials of any order can be returned from the polynomial construction function:
def polynomial (* args ):
expons = range ( len ( args ))[:: -1]
return lambda x : sum ([ coef * x ** expon for coef , expon in zip ( args , expons )])
14
Method Operator Description
init ClassName() Constructor, called when an instance of a class is made
del del Destructor
call object name() The method called when the object is a function, i.e., ‘callable’
getitem [] Get element: a.__getitem__(2) the same as a[2]
setitem [] = Set element: a.__setitem__(1, 3) the same as a[1] = 3
contains in Determine if element is in container
str Method used for print keyword/function
abs abs() Method used for absolute value
len len() Method called for the len (length) function
add + Add two objects, e.g., add two numbers or concatenate two strings
iadd += Addition with assignment
div / Division (In Python 2 integer division for int by default)
floordiv // Integer division with floor rounding
pow ** Power for numbers, e.g., 3 ** 4 = 34 = 81
and & Method called for and operator ‘&’
eq == Test for equality.
lt < Less than
le <= Less than or equal
xor ^ Exclusive or. Works bitwise for integers and binary for Booleans
...
Attribute — Description
class Class of object, e.g., <type ’list’> (Python 2), <class ’list’> (3)
doc The documention string, e.g., used for help()
Table 2.2: Class methods and attributes. These names are available with the dir function, e.g., an integer
= 3; dir(an integer).
f = polynomial (3 , -2 , -2)
f (3) # Returned result is 19
f = polynomial ( -2)
f (3) # Returned result is -2
15
Figure 2.1: Overview of methods and attributes in the common Python 2 built-in data types plotted as a
formal concept analysis lattice graph. Only a small subset of methods and attributes is shown.
16
class is defined with a definition for the length method:
>>> class Integer ( int ):
>>> def __len__ ( self ):
>>> return 1
>>> i = Integer (3)
>>> len ( i )
1
After instancing the WordString class with a string we can call the object to let it return, e.g., the fifth
word:
>>> s = WordsString ( " To suppose that the eye will all its inimitable contrivances " )
>>> s (4)
’ eye ’
Alternatively we could have defined an ordinary method with a name such as word and called the object as
s.word(4), — a slightly longer notation, but perhaps more readable and intuitive for the user of the class
compared to the surprising use with the call method.
The file associated with the module is available in the file attribute; in the example that would be
os. file . While standard Python 2 (CPython) does not make this attribute available for builtin modules
it is available in Python 3 and in this case link to the os.py file.
Individual classes, attributes and functions can be imported via the from keyword, e.g., if we only need
the os.listdir function from the os module we could write:
from os import listdir
This import variation will make the os.listdir function available as listdir.
If the package contains submodules then they can be imported via the dot notation, e.g., if we want
names from the tokenization part of the NLTK library we can include that submodule with:
import nltk . tokenize
The imported modules, class and functions can be renamed with the as keyword. By convention several
data mining modules are aliased to specific names:
1 6. Modules in The Python Tutorial
2 Unless built-in.
17
import numpy as np
import matplotlib . pyplot as plt
import networkx as nx
import pandas as pd
import statsmodels . api as sm
import statsmodels . formula . api as smf
With these aliases Numpy’s sin function will be avaiable under the name np.sin.
Import statements should occur before imported name is used. They are usually placed at the top of the
file, but this is only a style convention. Import of names from the special future module should be at the
very top. Style checking tool flake8 will help on checking conventions for imports, e.g., it will complain about
unused import, i.e., if a module is imported but the names in it are never used in the importing module.
The flake8-import-order flake8 extension even pedantically checks for the ordering of the imports.
2.5.1 Submodules
If a package contains of a directory tree then subdirectories can be used as submodules. For older versions
of Python is it necessary to have a init .py file in each subdirectory before Python recognizes the
subdirectories as submodules. Here is an example of a module, imager, which contains three submodules in
two subdirectories:
/imager
__init__.py
/io
__init__.py
jpg.py
/process
__init__.py
factorize.py
categorize.py
Provided that the module imager is available in the path (sys.path) the jpg module will now be available
for import as
import imager . io . jpg
Relative imports can be used inside the package. Relative import are specified with single or double dots
in much the same way as directory navigation, e.g., a relative import of the categorize and jpg modules
from the factorize.py file can read:
from . import categorize
from .. io import jpg
Some developers encourage the use of relative imports because it makes refactoring easier. On the other
hand can relative imports cause problems if circular import dependencies between the modules appear. In
this latter case absolute imports work around the problem.
Name clashes can appear: In the above case the io directory shares name with the io module of the
standard library. If the file imager/__init__.py writes ‘import io’ it is not immediately clear for the
novice programmer whether it is the standard library version of io or the imager module version that
Python imports. In Python 3 it is the standard library version. The same is the case in Python 2 if
the ‘from __future__ import absolute_import’ statement is used. To get the imager module version,
imager.io, a relative import can be used:
from . import io
18
2.5.2 Globbing import
In interactive data mining one sometimes imports everything from the pylab module with ‘from pylab
import *’. pylab is actually a part of Matplotlib (as matplotlib.pylab) and it imports a large number of
functions and class from the numerical and plotting packages of Python, i.e., numpy and matplotlib, so the
definitions are readily available for use in the namespace without module prefix. Below is an example where
a sinusoid is plotted with Numpy and Matplotlib functions:
from pylab import *
t = linspace (0 , 10 , 1000)
plot (t , sin (2 * pi * 3 * t ))
show ()
Some argue that the massive import of definitions with ‘from pylab import *’ pollutes the namespace and
should not be used. Instead they argue you should use explicit import, like:
from numpy import linspace , pi , sin
from matplotlib . pyplot import plot , show
t = linspace (0 , 10 , 1000)
plot (t , sin (2 * pi * 3 * t ))
show ()
Or alternatively you should use prefix, here with an alias:
import numpy as np
import matplotlib . pyplot as plt
t = np . linspace (0 , 10 , 1000)
plt . plot (t , np . sin (2 * np . pi * 3 * t ))
plt . show ()
This last example makes it more clear where the individual functions comes from, probably making large
Python code files more readable. With ‘from pylab import *’ it is not immediately clear the the load
function comes from, — in this case the numpy.lib.npyio module which function reads pickle files. Similar
named functions in different modules can have different behavior. Jake Vanderplas pointed to this nasty
example:
>>> start = -1
>>> sum ( range (5) , start )
9
>>> from numpy import *
>>> sum ( range (5) , start )
10
Here the built-in sum function behaves differently than numpy.sum as their interpretations of the second
argument differ.
19
try :
from cStringIO import StringIO
except ImportError :
try :
from StringIO import StringIO
except ImportError :
from io import StringIO
try :
import cPickle as pickle
except ImportError :
import pickle
After these imports you will, e.g., have the configuration parser module available as configparser.
2.6 Persistency
How do you store data between Python sessions? You could write your own file reading and writing function
or perhaps better rely on Python function in the many different modules, Python PSL, supports comma-
separated values files (csv in PSL and csvkit that will handle UTF-8 encoded data) and JSON (json).
PSL also has several XML modules, but developers may well prefer the faster lxml module, — not only for
XML, but also for HTML [18].
where the slow pure Python-based is used as a fallback if the fast C-based version is not available. Python
3’s pickle does this ‘trick’ automatically.
The open standard JSON (JavaScript Object Notation) has—as the name implies—its foundations in
Javascript, but the format maps well to Python data types such as strings, numbers, list and dictionaries.
JSON and Pickle modules have similar named functions: load, loads, dump and dumps. The load functions
load objects from file-like objects into Python objects and loads functions load from string objects, while
the dump and dumps functions ‘save’ to file-like objects and strings, respectively.
There are several JSON I/O modules for Python. Jonas Tärnström’s ujson may perform more than
twice as fast as Bob Ippolito’s conventional json/simplejson. Ivan Sagalaev’s ijson module provides a
streaming-based API for reading JSON files, enabling the reading of very large JSON files which does not
fit in memory.
3 pickle-js, https://code.google.com/p/pickle-js/, is a Javascript implementation supporting a subset of primitive Python
data types.
20
Note the few gotchas for the use of JSON in Python: while Python can use strings, Booleans,
numbers, tuples and frozensets (i.e., hashable types) as keys in dictionaries, JSON can only handle
strings. Python’s json module converts numbers and Booleans to string representation in JSON, e.g.,
json.loads(json.dumps({1: 1})) returns the number used as key to a string: {u’1’: 1}. A data type
such as a tuple used as key will result in a TypeError when used to dump data to JSON. Numpy data type
yields another JSON gotcha relevant in data mining. The json does not support, e.g., Numpy 32-bit floats,
and with the following code you end up with a TypeError:
import json , numpy
json . dumps ( numpy . float32 (1.23))
Individual numpy.float64 and numpy.int works with the json module, but Numpy arrays are not directly
supported. Converting the array to a list may help
>>> json . dumps ( list ( numpy . array ([1. , 2.])))
’ [1.0 , 2.0] ’
Rather than list it is better to use the numpy.array.tolist method, which also works for arrays with
dimensions larger than one:
>>> json . dumps ( numpy . array ([[1 , 2] , [3 , 4]]). tolist ())
’ [[1 , 2] , [3 , 4]] ’
2.6.2 SQL
For interaction with SQL databases Python has specified a standard: The Python Database API Specification
version 2 (DBAPI2) [20]. Several modules each implement the specification for individual database engines,
e.g., SQLite (sqlite3), PostgreSQL (psycopg2) and MySQL (MySQLdb).
Instead of accessing the SQL databases directly through DBAPI2 you may use a object-relational mapping
(ORM, aka object relation manager) encapsulating each SQL table with a Python class. Quite a number
of ORM packages exist, e.g., sqlobject, sqlalchemy, peewee and storm. If you just want to read from
an SQL database and perform data analysis on its content, then the pandas package provides a convenient
SQL interface, where the pandas.io.sql.read frame function will read the content of a table directly into
a pandas.DataFrame, giving you basic Pythonic statistical methods or plotting just one method call away.
Greg Lamp’s neat module, db.py, works well for exploring databases in data analysis applications. It
comes with the Chinook SQLite demonstration database. Queries on the data yield pandas.DataFrame
objects (see section 3.3).
2.6.3 NoSQL
Python can access NoSQL databases through modules for, e.g., MongoDB (pymongo). Such systems typically
provide means to store data in a ‘document’ or schema-less way with JSON objects or Python dictionaries.
Note that ordinary SQL RDMS can also store document data, e.g., FriendFeed has been storing data as
zlib-compressed Python pickle dictionaries in a MySQL BLOB column.4
2.7 Documentation
Documentation features as an integral part of Python. If you setup the documentation correctly the Python
execution environment has access to the documentation and may make the documentation available to the
programmer/user in a variety of ways. Python can even use parts of the documentation, e.g., to test the code
or produce functionality that the programmer would otherwise put in the code, examples include specifying
an example use and return argument for automated testing with the doctest package or specifying script
input argument schema parseable with the docopt module.
4 http://backchannel.org/blog/friendfeed-schemaless-mysql.
21
Concept Description
Unit testing Testing each part of a system separately
Doctesting Testing with small test snippets included in the documentation
Test discovery Method, that a testing tools will use, to find which part of the
code should be executed for testing.
Zero-one-some Test a list input argument with zero, one and several elements
Coverage Lines of codes tested compared to total number of lines of code
Programmers should not invent their own style of documentation but write to the standards of the Python
documentation. PEP 257 documents the primary conventions for docstrings [21], and Vladimir Keleshev’s
pydocstyle tool (initially called pep257) will test if your documentation conforms to that standard. Numpy
follows further docstring conventions which yield a standardized way to describe the input and return ar-
guments, coding examples and description. It uses the reStructuredText text format. pydocstyle does not
test for the Numpy convention.
Once (or while) your have documented your code properly you can translate it into several different
formats with one of the several Python documentation generator tools, e.g., to HTML for an online help
system. The Python Standard Library features the pydoc module, while Python Standard Library itself
uses the popular Sphinx tool.
2.8 Testing
2.8.1 Testing for type
In data mining applications numerical list-like objects can have different types: list of integers, list of floats,
list of booleans and Numpy arrays or Numpy matrices with different types of elements. Proper testing should
cover all relevant input argument types. Below is an example where a mean diff function is tested in the
test mean diff function for both floats and integers:
from numpy import max , min
def mean_diff ( a ):
""" Compute the mean difference in a sequence .
Parameters
- - - -------
a : array_like
"""
return float (( max ( a ) - min ( a )) / ( len ( a ) - 1))
22
2.8.2 Zero-one-some testing
The testing pattern zero-one-some attempts to ensure coverage for variables which may have multiple ele-
ments. The pattern says you should test with zeros elements, one elements and ‘some’ (2 or more) elements.
The listing below shows the test_mean function testing an attempt on a mean function with the three
zero-one-some cases:
def mean ( x ):
return float ( sum ( x ))/ len ( x )
import numpy as np
def mean ( x ):
""" Compute mean of list of numbers .
Examples
--------
>>> np . isnan ( mean ([]))
True
>>> mean ([4.2])
4.2
>>> mean ([1 , 4.3 , 4])
3.1
"""
try :
return float ( sum ( x ))/ len ( x )
except ZeroDivisionError :
return np . nan
If we call the file doctestmean.py we can then perform doctesting by invoking the doctest module on the
file by python -m doctest doctestmean.py. This will report no output if no errors occur.
23
appmodule . py
...
tests /
test_app . py
...
For the other layout, the ‘inlining test directories’, you put a test directory on the same level as the module:
setup . py # your distutils / setuptools Python package metadata
mypkg /
__init__ . py
appmodule . py
...
test /
test_app . py
...
This second method allows you to distribute the test together with the implementation, letting other devel-
opers use your tests as part of the application. In this case you should also add an init .py file to the
test directory. For both layouts, files, methods and functions should be prefixed with test for the test
discovery, while test classes whould be prefixed with Test.
In data mining where you work with machine learning training and test set, you should be careful not to
name your ordinary (non-testing) function with a pre- or postfix of ‘test’, as this may invoke testing when
you run the test from the package level.
numerics . py .
For some systems you will find that the coverage setup installs the central script as python-coverage. You
can execute this script from the command-line, first calling it with the run command argument and the
filename of the Python source, and then with the report command argument:
> python - coverage run numerics . py
> python - coverage report -m
Name Stmts Miss Cover Missing
----------------------------------------------------------------------
/ usr / share / pyshared / coverage / collector 132 127 4%
3 -229 , 236 -244 , 248 -292
/ usr / share / pyshared / coverage / control 236 235 1% 3 -355 , 358 -624
/ usr / share / pyshared / coverage / execfile 35 16 54%
3 -17 , 42 -43 , 48 , 54 -65
numerics 8 1 88% 4
----------------------------------------------------------------------
TOTAL 411 379 8%
24
The -m optional command line argument reports the line numbers missing in the test. It reports that the
test did not cover line 4 in numerics. This is because we did not test for the case with x = 2 so the block
with the if condictional would be executed.
With the coverage module installed, the nose package can also report the coverage. Here we use the
command line script nosetests with a optional input argument:
> nosetests -- with - cover numerics . py
.
Name Stmts Miss Cover Missing
----------------------------------------
numerics 8 1 88% 4
----------------------------------------------------------------------
Ran 1 test in 0.003 s
OK
A coverage plugin for py.test is also available, so the coverage for a module which contains test function
may be measured with the --cov option:
> py . test -- cov themodule
The specific lines that the test is missing to test can be show with an option to the report command:
> coverage report -- show - missing
2.9 Profiling
Various functions in the PSL time module allow the programmer to measure the timing performance of the
code. One relevant function time.clock times ‘processor time’ on Unix-like system and elapsed wall-clock
seconds since the first call to the function on Windows. Since version 3.3 Python has deprecated this function
and instead encourages using the new functions time.perf counter or time.process time.
For short code snippets the time may not yield sufficient timing resolution, and the PSL timeit module
will help the developer profiling such snippets by executing the code many times and time the total wall
clock time with the timeit.timeit function.
The listing below applies timeit on code snippets taken from Google Python style guide example code
[22]. The two different code snippets have similar functionality but implement it with for loops and list
comprehension, respectively.
def function1 ():
result = []
for x in range (10):
25
for y in range (5):
if x * y > 10:
result . append (( x , y ))
function1 . name = " For loop version "
import timeit
An execution of the code will show that the list comprehension version performs slightly faster:
$ python timeit_example.py
For loop version = 10.14
List comprehension version = 8.67
By default timeit.timeit will execute the input argument one million times and report the total duration
in seconds, thus in this case each execution of each code snippets takes around 10 microseconds.
Note that such control structure-heavy code can run considerably faster with the pypy Python imple-
mentation as opposed to the standard Python python implementation: pypy can gain a factor of around
five.
$ pypy timeit_example.py
For loop version = 2.26
List comprehension version = 1.93
time and timeit measure only a single number. If you want to measure timing performance for, e.g.,
each function called during an execution of a script then use profile or cProfile modules. profile, the
pure Python implementation, has more overhead, so unless you find cProfile unavailable on your system,
use cProfile. The associated pstats modules has method for displaying the result of the profiling from
profile and cProfile. You can run these profiling tools both from within Python and by calling it from
the shell command line. You will also find that some IDEs make the profiling functionality directly available,
e.g., in Spyder profiling is just the F10 keyboard shortcut away. An example with profiling a module called
dynamatplotlib.py with cProfile from the shell command line reads in one line:
$ python -m cProfile dynamatplotlib . py
It produces list with timing information for each invididual component of the program. It you wish to sort
the list according to execution time of the individual parts then use the -s option:
$ python -m cProfile -s time dynamatplotlib . py
The report of the profiling may be long and difficult to get an overview of. The profiler can instead write the
profiling report to a binary file which the pstats module can read and interact with. The follow combines
cProfile and pstats for showing statistics about the ten longest running line in terms of cumulated time:
$ python -m cProfile -o dynamatplotlib . profile dynamatplotlib . py
$ python -m pstats dynamatplotlib . profile
dynamatplotlib . profile % sort cumulative
dynamatplotlib . profile % stats 10
The pstats module spawns a small command line utility where ‘help’, ‘sort’ and ‘stats’ are among the
commands.
26
IPython has a magic function for quick and easy profiling of a function with timeit. The below IPython
code tests with the %timeit magic function how well the scalar sin function from the math module performs
against the vectorized sin version in the numpy module for a scalar input argument:
In [1]: from math import sin
27
The software engineer principle ‘Don’t repeat yourself’ (DRY) says the code should not have redundancy.
A few tools exists for spotting duplicate code, e.g,. clonedigger.
Some editors has the ability to setup the style checking tools to ‘check-as-you-type’.
Coding style checking can be setup to run as part of the ordinary testing, e.g., it can be included as a
tox test environment in the tox.ini file so that multi-version Python testing and style checking are run for
an entire module when the tox program is run.
@property
def weight ( self ):
return self . _weight
@weight . setter
def weight ( self , value ):
if value < 0:
# Ensure weight is non - negative
value = 0.0
self . _weight = float ( value )
With this framework, the weight property is no longer accessed as a method but rather as an attribute with
no calling parentheses.
28
2.11 Command-line interface scripting
2.11.1 Distinguishing between module and script
A .py file may act as both a module and a script. To distinguish between code which python should execute
when the file is to be regarded as a script rather than a module one usually use the __name__ == ’__main__’
‘trick’, — an expression that will return true when the file is executed as a script and false when executed
during import. Code in the main namespace (i.e., code at the top level) will get executed when a module gets
imported, and to avoid having the (main part of) script running at the time of import the above conditional
will help. As little Python code as possible should usually appear in the main namespace of the finished
script, so usually one calls a defined ‘main’ function right after the conditional:
def main ():
# Actual script code goes here
if __name__ == ’ __main__ ’:
main ()
This pattern, encouraged by the Google Style Guide [22], allows one to use the script part of the file as a
module by importing it with import script and call the function with script.main(). It also allows you
to test most of the script gaining almost full test coverage using the usual Python unit testing frameworks.
Otherwise, you could resort to the scripttest package to test your command-line script.
If a module contains a main .py file in the root directory then this file will be executed when the
module is executed. Consider the following directory structure and files:
/mymodule
__main__.py
__init__.py
mysubmodule.py
With python -m mymodule the main .py is executed.
29
if len ( args ) == 2: # The first value in args is the program name
print ( args [1])
else :
sys . exit (2)
if __name__ == ’ __main__ ’:
main ( sys . argv )
Here below is a Linux Bash shell session using the Python script, first with the correct number of input
arguments and then with the wrong number of input arguments (‘$?’ is a Bash variable containing the exit
code of the previous command):
$ python print_one_arg . py hello
hello
$ echo $ ?
0
$ python print_one_arg . py
$ echo $ ?
2
An alternative exit could raise the SystemExit exception with raise SystemExit(2). Both SystemExit
and sys.exit may take a string as input argument which are output from the script as a error message to
the stderr. The exit status is then 1, unless the code attribute is set to another value in the raised exception
object.
2.12 Debugging
If an error is not immediately obvious and you consider beginning a debugging session, you might instead
want to run your script through one of the Python checker programs that can spot some common errors.
pylint, pyflakes and pychecker will all do that. pychecker executes your program, while pyflakes
does not do that, and thus considered ‘safer’, but checks for fewer issues. pylint and flake8, the lat-
ter wrapping pyflakes and pep8, will also perform code style checking, along with the code analysis.
These tools can be called from outside Python on the command-line, e.g., ‘pylint your code.py’. Yet
another tool in this domain is PySonar, which can do type analysis. It may be run with something like
‘python pysonarsq/scripts/run analyzer.py your code.py pysonar-output’.
The Python checker programs do not necessarily catch all errors. Consider Bob Ippolito nonsense.py
example:
from __future__ import print_function
if __name__ == ’ __main__ ’:
main ()
The program generates a TypeError as the + operator sees both an integer and a string which it cannot
handle. Neither pep8 nor pyflakes may report any error, and pylint complains about the missing docstring,
but not the type error.
The simplest run-time debugging puts in one or several print functions/command at the critical points
in the code to examine the value of variables. print will usually not display a nested variable (e.g., a list
of dicts of lists) in particular readable way, and here a function in the pprint module will come in handy:
The pprint.pprint function will ‘pretty print’ nested variables with indentation.
The ‘real’ run-time debugging tool for Python programs is the command-line-based pdb and its graphical
counterpart Winpdb. The name of the latter could trick you into believing that this was a Windows-based
30
program only, but it is platform independent and once installed available in, e.g., Linux, as the winpdb
command. In the simplest case of debugging with pdb you set a breakpoint in your code with
import pdb ; pdb . set_trace ()
and continue from there. You get a standard debugging prompt where you can examine and change variables,
single step through the lines of the code or just continue with execution. Integrated development environment,
such as Spyder, has convenient built-in debugging functionality with pdb and winpdb and keyboard shortcuts
such as F12 for insertion of a break point and Ctrl+F10 for single step. Debugging with pdb can be combined
with testing: py.test has an option that brings up the debugger command-line interface in case of an
assertion error. The invocation of the testing could look like this: py.test --pdb.
A somewhat special tool is pyringe a “Python debugger capable of attaching to processes”. More python
debugging tools are displayed at https://wiki.python.org/moin/PythonDebuggingTools.
2.12.1 Logging
print (or pprint) should probably not occur in the finished code as means for logging. Instead you can use
the PSL logging module. It provides methods for multiple modules logging with multiple logging levels, user-
definable formats, and with a range of output options: standard out, file or network sockets. As one of the
few examples of Python HOWTOs, you will find a Logging Cookbook with examples of, e.g., multiple module
logging, configuration, logging across a network. A related module is warnings with the warnings.warn
function. The Logging HOWTO suggests how you should distinguish the use of logging and warnings
module: “warnings.warn()in library code if the issue is avoidable and the client application should be
modified to eliminate the warning” and “logging.warning() if there is nothing the client application can
do about the situation, but the event should still be noted.”
The logging levels of logging are in increasing order of severity: debug, info, warn/warning, error,
critical/fatal. These are associated with constants defined in the module, e.g., logging.DEBUG. The levels
each has a function for logging with the same name. By default only error and critical/fatal log messages
are outputted:
>>> import logging
>>> logging . info ( ’ Timeout on connection ’)
>>> logging . error ( ’ Timeout on connection ’)
ERROR : root : Timeout on connection
For customization a logger.Logger object can be acquired. It has a method for changing the log level:
>>> import logging
>>> logger = logging . getLogger ( __name__ )
>>> logger . setLevel ( logging . INFO )
>>> logger . info ( ’ Timeout on connection ’)
INFO : __main__ : Timeout on connection
The format of the outputted text message can be controlled with logging.basicConfig, such that, e.g.,
time information is added to the log. The stack trace from a raised exception can be written to the logger
via the exception method and function.
2.13 Advices
1. Structure your code into module, classes and function.
2. Run your code through a style checker to ensure that it conforms to the standard, — and possible
catch errors.
3. Do not make redundant code. clonedigger can check your code for redundancy.
31
4. Document your code according to standard. Check that it conforms to the standard with the
pydocstyle tools. Follow the Numpy convention in reStructuredText format to document input argu-
ments, returned output and other aspects of the functions and classes.
5. Test your code with one of the testing frameworks such as py.test. If there are code example in the
documentation run these through doctests.
6. Measure test coverage with the coverage package. Ask yourself when you did not reach 100% coverage,
— if you did not.
7. If you discover a bug, then write a test that tests for the specific bug before you change the code to
fix the bug. Make sure that the test fails for the unpatched code, then fix the implementation and test
that the implementation works.
32
Chapter 3
3.1 Numpy
Numpy (numpy) dominates as the Python package for numerical computations in Python. Deprecated
numarray and numeric packages has now only relevance for legacy code. Other data mining packages,
such as SciPy, Matplotlib and Pandas, build upon Numpy. Numpy itself relies on numerical libraries. It can
show which with the numpy. config .show function. It should show BLAS and LAPACK.
While Numpy’s primary container data type is the array, Numpy also contains the numpy.matrix class.
The shape of the matrix is always two-dimensional (2D), meaning that any scalar indexing, such as A[1, :]
will return a 2D structure. The matrix class defines the * operator as matrix multiplication, rather than
elementwise multiplication as for numpy.array. Furthermore, the matrix class has complex conjugation
(Hermitian) with the property numpy.matrix.H and the ordinary matrix inverse in the numpy.matrix.I
property. The inversion will raise an exception in case the matrix is singular. For conversion back and forth
between numpy.array and numpy.matrix the matrix class has the property numpy.matrix.A which converts
to a 2D numpy.array.
3.2 Plotting
There is not quite a good all-embrassing plotting package in Python. There exists several libraries which
each has its advantages and disadvantages: Matplotlib, Matplotlib toolkits with mplot3d, ggplot, seaborn,
mpltools, Cairo, mayavi, PIL, Pillow, Pygame, pyqtgraph, mpld3, Plotly and vincent.
The primary plotting library associated with Numpy is matplotlib. Developers familiar with Matlab
will find many of the functions quite similar to Matlab’s plotting functions. Matplotlib has a confusing
number of different backends. Often you do not need to worry about the backend.
Perhaps the best way to get an quick idea of the visual capabilities in Matplotlib is to go through a tour of
Matplotlib galleries. The primary gallery is http://matplotlib.org/gallery.html, but there also alternatives,
e.g., https://github.com/rasbt/matplotlib-gallery and J.R. Johansson’s matplotlib - 2D and 3D plotting in
Python Jupyter Notebook.
33
The young statistical data visualization module seaborn brings stylish and aesthetics plots to Python
building on top of matplotlib. seaborn is pandas-aware, requires also scipy and furthermore recommends
statsmodels. During import seaborn is usually aliased to sns or sb (e.g., import seaborn as sns).
Among its capabilities you will find colorful annotated correlation plots (seaborn.corrplot) and regression
plots (seaborn.regplot). A similar library is ggplot, which is heavily inspired by R’s ggplot2 package.
Other ‘stylish’ Matplotlib extensions, mpltools and prettyplotlib, adjust the style of plots, e.g., with
respect to color, fonts or background.
3.2.1 3D plotting
3-dimensional (3D) plotting is unfortunately not implemented directly in the standard Matplotlib plotting
functions, e.g., matplotlib.pyplot.plot will only take an x and y coordinate, — not a z coordinate, and
generally Python’s packages do not provide optimal functionality for 3D plotting.
Associated with Matplotlib is an extention call mplot3d, available as mpl toolkits.mplot3d, that has a
somewhat similar look and feel as Matplotlib and can be used as a ‘remedy’ for simple 3D visualization, such
as a mesh plot. mplot3d has a Axes3D object with methods for plotting lines, 3D barplots and histograms,
3D contour plots, wireframes, scatter plots and triangle-based surfaces. mplot3d should not be use to render
more complicated 3D models such as brain surfaces.
For more elaborate 3D scientific visualization the plotting library Mayavi might be better. Mayavi uses
the Visualization Toolkit for its rendering and will accept Numpy arrays for input [25, 26]. The mayavi.mlab
submodule, with a functional interface inspired from matplotlib, provides a number of 3D plotting functions,
e.g., 3D contours. Mayavi visualization can be animated and the graphical user interface components can be
setup to control the visualization. Mayavi has also a stand-alone program called mayavi2 for visualization.
Mayavi relies on VTK (Visualization Toolkit). As of spring 2015 Python 3 has no support for VTK, thus
Mayavi does not work with Python 3.
A newer 3D plotting option under development is OpenGL-based Vispy. It targets not only 3D visualiza-
tion but also 2D plots. Developers behind Vispy has previously been involved in other Python visualization
toolkits: pyqtgraph, also features 3D visualization and has methods for volume rendering, Visvis, Glumpy
and Galry. The vispy.mpl plot submodule is an experimental OpenGL-backend for matplotlib. Instead of
import matplotlib.pyplot as plt one can write import vispy.mpl_plot as plt and get some of the
same Matplotlib functionality from the functions in the plt alias. In the more low-level Vispy interface in
vispy.gloo the developer should define the visual objects in a C-like language called GLSL.
34
x -= 1
def autoregressor ( it ):
x = 0
while True :
x = 0.79 * x + 0.2 * it . next ()
yield x
This approach is not necessarily effective. Profiling of the script will show that most of the time is spend
with Matplotlib drawing. The matplotlib.animation submodule has specialized classes for real-time plot-
ting and animations. Its class matplotlib.animation.FuncAnimation takes a figure handle and an an-
imation step function. An initialization plotting function might also be necessary to define and submit
FuncAnimation. The flow of the plotting may be controlled by various parameters: number of frames,
interval between plots, repetitions and repetition delays.
import matplotlib . pyplot as plt
from matplotlib import animation
import random
from collections import deque
35
def autoregressor ( it ):
x = 0
while True :
x = 0.70 * x + 0.2 * it . next ()
yield x
36
is captured via the file-like StringIO string object. The binary data in the sio object can be read with the
getvalue method and encoded to a ASCII string with the encode method in the ‘base64’ format.
Listing 3.1: Web plotting with Flask and an embedded imaged in img tag.
from flask import Flask
import matplotlib . pyplot as plt
from StringIO import StringIO
@app . route ( ’/ ’)
def index ():
return """ < html > < body >
< img src =" data : image / png ; base64 ,{0}" >
</ body > </ html > """ . format ( plot_example ())
if __name__ == ’ __main__ ’:
app . run ()
When the web browser is pointed to the default Flask URL http://127.0.0.1:5000/ a plot should appear
in the browser. The flask decorator, @app.route(’/’), around the index function will tell Flask to call that
function when the web client makes a request for the http://127.0.0.1:5000/ address.
Note that in Python 3 the StringIO object is moved to the io module, so the import should go like
from io import StringIO. A Python 2/3 compatible version would put a try and except block around
the StringIO imports.
Whether it is a good idea to put image data in the HTML file may depend on the application. In the
present case the simple HTML file results in a over 40 KB large file that the server has to send to the
requesting client at each non-cached page request. A similar binary-coded PNG file is somewhat smaller and
the server can transmit it independently of the HTML. The Jupyter Notebook uses the img-tag encoding for
its generated plots (which can be embedded on a web-page), so in the saved IPython Notebook session files
you will find large blocks of string-encoded image data, and all data—code, annotation and plots—fit neatly
into one file with no need to keep track of separate image data files when you move the session file around!
37
# Vega Scaffold modified from https :// github . com / trifacta / vega / wiki / Runtime
HTML = """
< html >
< head >
< script src =" http :// trifacta . github . io / vega / lib / d3 . v3 . min . js " > </ script >
< script src =" http :// trifacta . github . io / vega / lib / d3 . geo . projection . min . js " > </ script >
< script src =" http :// trifacta . github . io / vega / vega . js " > </ script >
</ head >
< body > < div id =" vis " > </ div > </ body >
< script type =" text / javascript " >
// parse a spec and create a visualization view
function parse ( spec ) {
vg . parse . spec ( spec , function ( chart ) { chart ({ el :"# vis "}). update (); });
}
parse ("/ plot ");
</ script >
</ html >
"""
class VegaExample :
@cherrypy . expose
def index ( self ):
return HTML
@cherrypy . expose
def plot ( self ):
bar = vincent . Bar ([2 , 4 , 2 , 6 , 3])
return bar . to_json ()
Plotly
Another web plotting approach use the freemium cloud service Plotly available from http://plot.ly. Users
with an account created on the Plotly website may write Python plot commands on the Plotly website and
render and share them online, but it is also possible to construct online plots from a local installation of
Plotly. With the plotly Python package installed locally and the API key for Plotly available,1 plotting a
sinusoid online requires only few lines:
import plotly
import numpy as np
x = np . linspace (0 , 10)
y = np . sin ( x )
graph_url = plotly . plotly . plot ([ x , y ])
By default the plotly.plotly.plot spawns a webbrowser with the graph url which may be something
like https://plot.ly/~fnielsen/7. The displayed webpage shows an interactive plot (in this case of
the sinusoid) where the web user may zoom, pan and scroll. By default plotly.plotly.plot creates
a world readable plot, i.e. public data files and public plot. The web developer using the plot in
1 The API key may be found on https://plot.ly/python/getting-started/ or under the profile.
38
his/her web application can add the online plot as a frame on the webpage via an HTML iframe tag:
<iframe src="https://plot.ly/~fnielsen/7"/>. Plotly has a range of chart types such as boxplots, bar
charts, polar area chart, bubble chart, etc. with good control over the style of the plot elements. It also
has the capability to continuously update the plot with the so-called streaming plots as well as some simple
statistics functionality such as polynomial fitting. There are various ways to set up the plot. The above code
called the plot command with a data set. It is also possible to use the standard Matplotlib to construct the
plot and ‘replot’ the Matplotlib figure with the plotly function plotly.plotly.plot mpl with the Matplotlib
figure handle as the first input argument.
Plotly appears fairly easy to deal with. The downside is the reliance on the cloud service as a freemium
service. The basic gratis plan provides unlimited number of public files (plots) and 20 private files.
D3
D3 is a JavaScript library for plotting on the web. There is a very large set of diverse visualization types
possible with this library. There are also extensions to D3 for further refinement, e.g., NVD3. Python
webservices can use D3 in two ways: Either by outputting data in a format that D3 can read and serve a
HTML page with D3 JavaScript included, or by using a Python package that will take care of the translation
from Python plot commands to D3 JavaScript and possible HTML. mpld3 is an example of the latter, and
the code below shows a small compact example on how it can be used in connection with CherryPy.
import matplotlib . pyplot as plt , mpld3 , cherrypy
Other visualizations
Various other visualization libraries exist, e.g., bokeh, glue and the OpenGL-based vispy.
from bokeh . plotting import *
import numpy as np
import webbrowser
webbrowser . open ( os . path . join ( os . getcwd () , ’ rand . html ’ ))
3.3 Pandas
pandas, a relatively new Python package for data analysis, features an R-like data frame structure, annotated
data series, hierarchical indices, methods for easy handling of times and dates, pivoting as well as a range
of other useful functions and methods for handling data. Together with the statsmodels packages it makes
39
data loading, handling and ordinary linear statistical data analysis almost as simple as it can be written in
R. The primary developer Wes McKinney has written the book Python for Data Analysis [27] explaining
the package in detail. Here we will cover the most important elements of pandas. Often when the library is
imported it is aliases to pd like “import pandas as pd”.
Index a b c
A= 2 4 5 yes (3.1)
3 6.5 7 no
6 8 9 ok
40
b 5
c yes
Name : 2 , dtype : object
>>> A . iloc [2 , :] # position integer (3 th row )
a 8
b 9
c ok
Name : 6 , dtype : object
>>> A . ix [2 , :] # label - based
a 4
b 5
c yes
Name : 2 , dtype : object
In all these cases the indexing methods return a pandas.Series. It is not necessary to index the columns
with colon: In a more concise notation we can write A.loc[2], A.iloc[2] or A.ix[2] and get the same
rows returned. Note that in the above example that the ix method uses the label-based method, because
the index contains integers. If instead, the index contains non-integers, e.g., strings, the ix method would
fall back on position integer indexing as seen in the example here below (this ambiguity seems prone to bugs,
so take care):
>>> B = pd . DataFrame ([[4 , 5 , ’ yes ’] , [6.5 , 7 , ’ no ’] , [8 , 9 , ’ ok ’]] ,
index =[ ’e ’ , ’f ’ , ’g ’] , columns =[ ’a ’ , ’b ’ , ’c ’ ])
>>> B . ix [2 , :]
a 8
b 9
c ok
Name : g , dtype : object
Trying B.iloc[’f’] to address the second row will result in a TypeError, while B.loc[’f’] and B.ix[’f’]
are ok.
The columns of the data frame may also be indexed. Here for the second column (‘b’) of the A matrix:
>>> A . loc [: , ’b ’] # label - based
2 5
3 7
6 9
Name : b , dtype : int64
>>> A . iloc [: , 1] # position - based
2 5
3 7
6 9
Name : b , dtype : int64
>>> A . ix [: , 1] # fall back position - based
2 5
3 7
6 9
Name : b , dtype : int64
>>> A . ix [: , ’b ’] # label - based
2 5
3 7
6 9
Name : b , dtype : int64
In all cases a pandas.Series is returned. The column may also be indexed directly as an item
>>> A [ ’b ’]
2 5
3 7
41
6 9
Name : b , dtype : int64
If we want to get multiple rows or columns from pandas.DataFrames the indices should be of the slice
type or an iterable.
Combining position-based row indexing with label-based column indexing, e.g., getting the all rows from
the second to the end row and the ‘c’ column of the A matrix, neither A[1:, ’c’] nor A.ix[1:, ’c’] work
(the latter one returns all rows). Instead you need something like:
>>> A [ ’c ’ ][1:]
3 no
6 ok
Name : c , dtype : object
or somewhat confusingly, but more explicitly:
>>> A . loc [: , ’c ’ ]. iloc [1: , :]
3 no
6 ok
Name : c , dtype : object
Index a b Index a c
A= 1 4 5 B= 1 8 9 (3.2)
2 6 7 3 10 11
Note here that the two matrices/data frames has one overlapping row index (1) and two non-overlapping
row indices (2 an 3) as well as one overlapping column name (a) and two non-overlapping column names (b
and c). Also note that the one equivalent element in the two matrices (1, a) is inconsistent: 4 for A and 8
for B.
If we want to combine these two matrices into one there are multiple ways to do this. We can append
the rows of B after the rows of A. We can match the columns of both matrices such that the a-column of
A matches the a-column of B.
>>> import pandas as pd
42
>>> A . merge (B , how = ’ inner ’ , left_index = True , right_index = True )
a_x b a_y c
1 4 5 8 9
>>> x = [1 , 2 , 3]
>>> mean = numpy . mean ( x )
>>> s_biased = numpy . sqrt ( sum (( x - mean )**2) / len ( x ))
>>> s_biased
0 .8 1 6 49 658092772603
>>> s_unbiased = numpy . sqrt ( sum (( x - mean )**2) / ( len ( x ) - 1))
>>> s_unbiased
1.0
>>> numpy . std ( x ) # Biased
0 .8 1 6 49 658092772603
>>> numpy . std (x , ddof =1) # Unbiased
1.0
>>> numpy . array ( x ). std () # Biased
0 .8 1 6 49 658092772603
>>> pandas . Series ( x ). std () # Unbiased
1.0
>>> pandas . Series ( x ). values . std () # Biased
0 .8 1 6 49 658092772603
>>> df = pandas . DataFrame ( x )
>>> df [ ’ dummy ’] = numpy . ones (3)
>>> df . groupby ( ’ dummy ’ ). agg ( numpy . std ) # Unbiased !!!
0
dummy
1 1
Numpy computes by default the biased version of the standard deviation, but if the optional argument ddof
is set to 1 it will compute the unbiased version. Contrary, Pandas computes by default the unbiased standard
43
Subpackage Function examples Description
cluster vq.keans, vq.vq, hierarchy.dendrogram Clustering algorithms
fftpack fft, ifft, fftfreq, convolve Fast Fourier transform, etc.
optimize fmin, fmin cg, brent Function optimization
spatial ConvexHull, Voronoi, distance.cityblock Functions working with spatial data
stats nanmean, chi2, kendalltau Statistical functions
deviation. Perhaps the most surprising of the above examples is the case with aggregation method (agg)
of the DataFrameGroupBy which will compute the unbiased estimate even when called with the numpy.std
function! pandas.Series.values is a numpy.array and thus the std method will by default use the biased
version.
3.4 SciPy
SciPy (Scientific Python) contains a number of numerical algorithms that work seamlessly with Numpy.
SciPy contains functions for linear algebra, optimizations, sparse matrices, signal processing algorithms,
statistical functions and special mathematical functions, see Table 3.2 for an overview of some of subpackages
and their functions. Many of the SciPy function are made directly available by pylab with ‘from pylab
import *’.
A number of the functions in scipy also exist in numpy for backwards compatibility. For in-
stance, the packages makes eigenvalue decomposition available as both numpy.linalg.linalg.eig and
scipy.linalg.eig and the Fourier transform as numpy.fft.fft and scipy.fftpack.fft. Usually the
scipy version are preferable. They may be more flexible, e.g., be able to make inplace modification. In some
instances one can also experience that scipy versions are faster. Note that the pylab.eig is the Numpy
version.
3.4.1 scipy.linalg
The linear algebra part of SciPy (scipy.linalg) contains, e.g., singular value decomposition
(scipy.linalg.svd), eigenvalue decomposition (scipy.linalg.eig). These common linear algebra meth-
ods are also implemented in Numpy and available with the numpy.linalg module. scipy.linalg has quite
a large number of specialized linear algebra method, e.g., LU decomposision with scipy.linalg.lu factor,
which are not available in numpy.linalg.
Be careful with scipy.linalg.eigh. It will not check whether your matrix is symmetric and it looks
only at the lower triangular matrix:
>>> eigh ([[1 , 0.5] , [0.5 , 1]])
( array ([ 0.5 , 1.5]) , array ([[ -0.70710678 , 0.70710678] ,
[ 0.70710678 , 0.70710678]]))
>>> eigh ([[1 , 1000] , [0.5 , 1]])
( array ([ 0.5 , 1.5]) , array ([[ -0.70710678 , 0.70710678] ,
[ 0.70710678 , 0.70710678]]))
44
Numpy’s and SciPy’s Fourier transform may be considerably slow if the input is of a size corresponding
to a prime number. In these cases the chirp-z transform algorithm can considerable cut processing time.
Outside scipy there exist wrappers for other efficient Fourier transform libraries beyond FFTPACK. The
pyFFTW wraps the FFTW (Fastest Fourier Transform in the West) library.3 A scipy.fftpack lookalike API
in the pyFFTW package is available in the subpackage pyfftw.interfaces.scipy fftpack. If the purpose
with the Fourier transform is just to produce a spectrum then the welch function over in the scipy.signal
subpackage may be another option. It calls the scipy.fftpack.fft function repeatedly on windows of the
signal to be transformed.
3.5 Statsmodels
The statsmodels package provides some common statistical analysis methods with linear modeling and
statistical tests. Part of the package is modeled after the R programming language where you write statistical
formulas with the tilde notation enabling the specification of statistical tests in a quite compact format. With
pandas and statsmodels imported, reading a data set from comma-separated files with multiple variables
represented in columns, specifying the relevant test with dependent and independent variables, testing and
reporting can be neatly done with just one single line of Python code.
Based on an example from the original statsmodels paper [4] we use a data set from James W. Lon-
gley that is included as part of the statsmodels package as one among over 25 reference data sets and
accessible via the datasets submodule. The data is represented in the small comma-separated values file
longley.csv placed in a statsmodels subdirectory. After adding an intercept column to the exogeneous
variables (i.e., the independent variables) we instance an object from the ordinary least squares (OLS) class
of statsmodels with the endogeneous variable (i.e., dependent variable) as the first argument to the con-
structor. Using the fit method of the object returns a result object which contain, e.g., parameter estimates.
Its summary method produces a verbose text report with the fitted parameters, P -values, confidence intervals
and diagnostics.
>>> import statsmodels . api as sm
>>> data = sm . datasets . longley . load ()
>>> longley_model = sm . OLS ( data . endog , sm . add_constant ( data . exog ))
>>> lo ng l ey _r es u lt s = longley_model . fit ()
>>> print ( l on gl ey _ re su lt s . summary ())
OLS Regression Results
==============================================================================
Dep . Variable : y R - squared : 0.995
Model : OLS Adj . R - squared : 0.992
Method : Least Squares F - statistic : 330.3
Date : Fri , 13 Feb 2015 Prob (F - statistic ): 4.98 e -10
Time : 13:56:24 Log - Likelihood : -109.62
No . Observations : 16 AIC : 233.2
Df Residuals : 9 BIC : 238.6
Df Model : 6
Covariance Type : nonrobust
==============================================================================
coef std err t P >| t | [95.0% Conf . Int .]
------------------------------------------------------------------------------
const -3.482 e +06 8.9 e +05 -3.911 0.004 -5.5 e +06 -1.47 e +06
x1 15.0619 84.915 0.177 0.863 -177.029 207.153
x2 -0.0358 0.033 -1.070 0.313 -0.112 0.040
x3 -2.0202 0.488 -4.136 0.003 -3.125 -0.915
x4 -1.0332 0.214 -4.822 0.001 -1.518 -0.549
x5 -0.0511 0.226 -0.226 0.826 -0.563 0.460
x6 1829.1515 455.478 4.016 0.003 798.788 2859.515
==============================================================================
Omnibus : 0.749 Durbin - Watson : 2.559
Prob ( Omnibus ): 0.688 Jarque - Bera ( JB ): 0.684
Skew : 0.420 Prob ( JB ): 0.710
3 http://www.fftw.org/
45
Kurtosis : 2.434 Cond . No . 4.86 e +09
==============================================================================
Warnings :
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified .
[2] The condition number is large , 4.86 e +09. This might indicate that there are
strong m u l t i c o l l i n e a r i t y or other numerical problems .
The original CSV file contained a column heading with the variable names: Obs, TOTEMP,
GNPDEFL, GNP, UNEMP, ARMED, POP, YEAR. These name were lost when we used the
sm.datasets.longley.load() method, as it only returned unannotated Numpy arrays. In the summary
of the result the variable are referred to with the generic names x1, x2, . . . With statsmodels pandas
integration we can maintain the variable names:
>>> data = sm . datasets . longley . load_pandas ()
>>> longley_model = sm . OLS ( data . endog , sm . add_constant ( data . exog ))
>>> lo ng l ey _r es u lt s = longley_model . fit ()
>>> print ( l on gl ey _ re su lt s . summary ())
OLS Regression Results
==============================================================================
Dep . Variable : TOTEMP R - squared : 0.995
Model : OLS Adj . R - squared : 0.992
Method : Least Squares F - statistic : 330.3
Date : Fri , 13 Feb 2015 Prob (F - statistic ): 4.98 e -10
Time : 14:25:14 Log - Likelihood : -109.62
No . Observations : 16 AIC : 233.2
Df Residuals : 9 BIC : 238.6
Df Model : 6
Covariance Type : nonrobust
==============================================================================
coef std err t P >| t | [95.0% Conf . Int .]
------------------------------------------------------------------------------
const -3.482 e +06 8.9 e +05 -3.911 0.004 -5.5 e +06 -1.47 e +06
GNPDEFL 15.0619 84.915 0.177 0.863 -177.029 207.153
GNP -0.0358 0.033 -1.070 0.313 -0.112 0.040
UNEMP -2.0202 0.488 -4.136 0.003 -3.125 -0.915
ARMED -1.0332 0.214 -4.822 0.001 -1.518 -0.549
POP -0.0511 0.226 -0.226 0.826 -0.563 0.460
YEAR 1829.1515 455.478 4.016 0.003 798.788 2859.515
==============================================================================
Omnibus : 0.749 Durbin - Watson : 2.559
Prob ( Omnibus ): 0.688 Jarque - Bera ( JB ): 0.684
Skew : 0.420 Prob ( JB ): 0.710
Kurtosis : 2.434 Cond . No . 4.86 e +09
==============================================================================
Warnings :
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified .
[2] The condition number is large , 4.86 e +09. This might indicate that there are
strong m u l t i c o l l i n e a r i t y or other numerical problems .
Here we use the load_pandas method which returns Pandas objects so that data.endog is a pandas.Series,
and data.exog is a pandas.DataFrame and now the result summary displays the variable names associated
with the parameter estimates.
Using the R-like formula part of statsmodels available with the statsmodels.formula.api module the
latter can also be written with:
>>> import statsmodels . api as sm
>>> import statsmodels . formula . api as smf
>>> data = sm . datasets . longley . load_pandas (). data
>>> formula = ’ TOTEMP ~ GNPDEFL + GNP + UNEMP + ARMED + POP + YEAR ’
>>> longley_model = smf . ols ( formula , data )
>>> lo ng l ey _r es u lt s = longley_model . fit ()
>>> print ( l on gl ey _ re su lt s . summary ())
46
Following the R-convention the variable on the left side of the tilde (in the formula variable) is the variable
to be predicted (the dependent variable or the endogeneous variable), while the variables on the right side of
the tilde are the independent variables. An intercept column is implicitly added to the independent variables.
If the intercept column should not be added then the formula should be append with minus one:
formula = ’ TOTEMP ~ GNPDEFL + GNP + UNEMP + ARMED + POP + YEAR - 1 ’
3.6 Sympy
sympy is the symbolic mathematics package of Python. Symbolic variables can be setup to define equations
and, e.g., functions may be differentiated and simplified. The below code sets up the equation f = sin(2πx)×
exp(−x2 ), differentiate it twice and plot the function and its derivatives as well as evaluates the second order
derivative at x = 0.75:
import sympy
x = sympy . symbols ( ’x ’)
47
Name Module KLoC GS-cites Reference
SciPy.linalg scipy.linalg
Statsmodels statsmodels 92 141 [4]
Scikit-learn sklearn 427 8.830 [28]
PyMVPA 136 100 + 45 [29, 30]
Orange orange 286 56 [31]
Mlpy: machine learning Python 75 8 [32]
Modular toolkit for Data Processing (MDP) 31 58 [33]
PyBrain pybrain 36 128 [34]
Pylearn2 61
Bob bob 31
Gensim gensim 9 699 [35]
Natural Language Toolkit (NLTK) nltk 215 2.590 [36]
PyPR pypr ? — —
Caffe
...
Table 3.3: Some of the Python machine learning packages. KLoC denotes 1,000 lines of code and ‘GS-cites’
the number of citations as reported by Google Scholar (Note: Numbers not necessarily up to date).
Table 3.4: Scikit-learn methods. Whether the methods are define for the class depends on its algorithmic
types, e.g., classifiers should have the predict defined.
available, see Table 3.3 for a list. Here we will cover the Scikit-learn package, which probably is the machine
learning package with the largest momentum as of 2014 both according to the lines of code and the number of
scientific citations to the primary research paper for the package [28], — if we disregard NLTK as a machine
learning package.
There are a number of other package: The functions in scipy.linalg can be used to estimate lin-
ear models. The relevant methods are, e.g., the pseudo-inverse of a matrix scipy.linalg.pinv and
the least square solution to the equation Ax = b with scipy.linalg.lstsq and, for square matrices,
scipy.linalg.solve. For optimizing a machine learning cost function where the parameters enter in a
non-linear fashion the functions scipy.optimize can be used. The primary general function in that module
is scipy.optimize.minimize. NLTK is primarily a toolbox for processing text, see section 3.8, but it also
contains some classifiers in the nltk.classify module. The newest version of the NLTK toolbox contains
an interface to the Scikit-learn classifiers.
48
Type Class name Parameter examples
K-nearest neighbor sklearn.neighbors.KNeighborsClassifier
Linear discriminant analysis sklearn.lda.LDA
Support vector machine sklearn.svm.SVC kernel
Principal component analysis sklearn.decomposition.PCA Number of components
Non-negative matrix factorization sklearn.decomposition.NMF
3.7.1 Scikit-learn
PyPI calls the package scikit-learn while the name of the main module is called sklearn and should be
imported as such.
sklearn has a plentora of classifiers, decomposition and clustering algorithms and other machine learning
algorithms all implemented as individual Python classes. The methods in the classes follow a naming pattern,
see Table 3.4. The method for parameter estimation or training is called fit, while the method for prediction
is called predict, — if prediction is relevant for the algorithm. The parameters of the model are available
as class attributes with names that has a postfixed underscore, e.g., coef or alpha . The uniform interface
to the classifiers means that it does not take that much extra effort to use several classifiers compared to a
single classifier.
49
Description Example Example matches
. Any character . a, b, .
* Zero or more of the preceding group .* a, ab, abab, ‘’ (empty string)
^ Beginning of string ^b.* b, baaaa
$ End of string b.*b$ bb, baaaab, bab
[ ] Match any one in a set of characters [a-cz] a, b, c, z
[^ ] Set of characters [^a] b, c, 1, 2
( ) Captured subexpression (a.*) a, abb
{m, n} Match at least m and at most n of preceding group a{2,4} aa, aaa, aaaa
| Or, alternation, either one or the other a|b a, b
+ One or more of the proceeding group a+ a, aa, aaa, aaaaaaaaa
? Zero or one a? ‘’ (empty string), a
\d Digit \d 1, 5, 0
\D Non-digit \D a, b, )
\s Whitespace
\S Non-whitespace
\w Word character
\W Non-word character
\b Word boundary
Table 3.6: Metacharacters and character classes of Python’s regular expressions in the re module. The
metacharacters in the first group are the POSIX metacharacters, while the second group features are the
extended POSIX metacharacters. The third group has the Perl-like character sets.
50
any non-digit character or [^0-9] in POSIX notation, see Table 3.6 for the list of character classes. Note
that the so-called word character referenced by \w (that matches letter, digit or the underscore) may match
letters beyond ASCII’s a–z, such as ø, æ, å and ß. They can be used to match all international letters that
the character class [a-zA-Z] will not do, by using the complement of the complement of word characters
excluding digits and the underscore: [^\W_0-9]. This trick will be able to identify words with international
letters, here in Python 2: re.findall(’[^\W_0-9]+’, u’Årup Sø Straße 2800’, flags=re.UNICODE),
which will return a list with ‘Årup’, ‘Sø’ and ‘Straße’ while avoiding the number.
One application of regular expressions is tokenization: Finding meaningful entities (e.g., words) in a
text. Word tokenization in informal texts are not necessarily easy. Regard the following difficult invented
micropost “@fnielsen Pråblemer!..Øreaftryk i Århus..:)” where there seems to be two ellipses and
smiley as well as international characters:
# Ordinary string split () does only split at whitespace
text . split ()
# Also ellipses
re . findall ( ’@ \ w + | [^\ W_ \ d ]+ | :\) | \.\.+ ’ , text , re . UNICODE | re . VERBOSE )
The last two regular expressions catch the smiley, but it will not catch, e.g, :(, :-) or the full :)). In the
above code re.VERBOSE will ignore whitespaces in the definition of the regular expression making it more read-
able. re.UNICODE ensures that Python 2 will handle Unicode characters, e.g., re.findall(’\w+’, text)
without re.UNICODE will not work as it splits the string at the Danish characters.
For more information about Python regular expressions see the Python’s regular expression HOWTO4
or chapter 7 in Dive into Python 5 . For some cases the manual page for Perl regular expressions (perlre) may
also be of some help, but the docstring of the re module, available with help(re), also has a good overview
of the special characters in regular expressions patterns.
url = ’ http :// www . w3 . org / TR /2009/ REC - skos - reference -20090818/ ’
response = requests . get ( url )
tree = etree . HTML ( response . content )
4 https://docs.python.org/2/howto/regex.html
5 http://www.diveintopython.net/regular expressions/
6 That is not in Unicode. In Python 2 it has the type ‘str’, while in Python 3 it has the type ‘bytes’.
51
# The title in first - level header :
title = tree . xpath ( " // h1 " )[0]. text
# Get the names from the text between the HTML tags
names = [ name . strip () for name in editor_tree . getnext (). itertext ()]
W3C seems to have no consistent formatting of the editor names for its many specifications, so
you will need to do further processing of the names list of names to extract real names. In this
case the editors end up in a list split between given name and surname and contain affiliation
as well: “[’Alistair’, ’’, ’Miles’, ’, STFC\n Rutherford Appleton Laboratory / University
of Oxford’, ’Sean’, ’’, ’Bechhofer’, ’,\n University of Manchester’]”
Note that Firefox has an ‘Inspector’ in the Web Developer tool (F12 keyboard shortcut) which helps
navigating the tag hierarchy and identify suitable tags for the XPath specification.
Below is another implementation of the W3C technical report editor extraction with a low-level rather
‘dump’ use of the re module
import re
import requests
url = ’ http :// www . w3 . org / TR /2009/ REC - skos - reference -20090818/ ’
response = requests . get ( url )
editors = re . findall ( ’ Editors :(.*?) </ dl > ’ , response . text ,
flags = re . UNICODE | re . DOTALL )[0]
editor_list = re . findall ( ’ <a .*? >(.+?) </ a > ’ , editors )
Here the names variables contains a list with each element as a name: “[u’Alistair Miles’, u’Sean
Bechhofer’], but this version unfortunately does not necessarily work with other W3C pages, e.g., it fails
with http://www.w3.org/TR/2013/REC-css-style-attr-20131107/.
We may also use BeautifulSoup and its find_all and find_next methods:
from bs4 import BeautifulSoup
import re
import requests
url = ’ http :// www . w3 . org / TR /2009/ REC - skos - reference -20090818/ ’
response = requests . get ( url )
soup = BeautifulSoup ( response . content )
names = soup . find_all ( ’ dt ’ , text = re . compile ( ’ Editors ?: ’ ))[0]. find_next ( ’ dd ’ ). text
Here the result is returned in a string with both names and affiliation. A regular expression using the re
module matches text to find the dt HTML tag containing the word ‘Editor’ or ‘Editors’ followed by a colon.
After BeautifulSoup has found the the relevant dt HTML tag, it identifies the text of the following dt tag
with the find_next method of the BeautifulSoup object.
3.8.3 NLTK
NLTK (Natural Language Processing Toolkit) is one of the leading natural language processing packages for
Python. It is described in depth by the authors of the package in the book Natural Language Processing with
Python [36], available online. There are many submodules in NLTK, some of them displayed in Table 3.7.
Associated with the package is a range of standard natural language processing corpora which each and all
52
Name Description Example
nltk.app Miscellaneous application, e.g., a WordNet browser nltk.app.wordnet
nltk.book Example texts associated with the book [36] nltk.book.sent7
nltk.corpus Example texts, some of them annotated nltk.corpus.shakespeare
nltk.text Representation and text
nltk.tokenize Word and sentence segmentation nltk.tokenize.sent tokenize
can be downloaded with the nltk.download interactive function. Once downloaded, the corpora are made
available by functions in the nltk.corpus submodule.
53
[( u ’ To ’ , u ’ TO ’) , ( u ’ suppose ’ , u ’ VB ’) , ( u ’ that ’ , u ’ IN ’) , ( u ’ the ’ , u ’ DT ’ )]
>>> [ token . orth_ for token in tokens if token . tag_ == ’ NN ’]
[ u ’ eye ’ , u ’ focus ’ , u ’ light ’ , u ’ correction ’ , u ’ aberration ’ ,
u ’ selection ’ , u ’ confess ’ , u ’ absurd ’ , u ’ degree ’ , u ’ sun ’ , u ’ world ’ ,
u ’ round ’ , u ’ sense ’ , u ’ mankind ’ , u ’ doctrine ’ , u ’ false ’ , u ’ saying ’ ,
u ’ populi ’ , u ’ philosopher ’ , u ’ science ’]
Note the differencies in POS-tagging between NLTK and spaCy in words such as ‘admitting’ and ‘light’.
In its documentation spaCy claims to have both more accuracy and much faster execution than NLTK’s
POS-tagging.
Another language detector is Chromium Compact Language Detector. The cld module makes a single
function available:
>>> import cld
>>> cld . detect ( u ’ Det er ikke godt h å ndv æ rk . ’. encode ( ’utf -8 ’ ))
( ’ DANISH ’ , ’ da ’ , False , 30 , [( ’ DANISH ’ , ’ da ’ , 63 , 49.930651872399444) ,
( ’ NORWEGIAN ’ , ’ nb ’ , 37 , 26.4 10564 2256 90276 )])
The language detection in this module has been using the Google Translate service for the detection. Al-
though this seems to offer quite good results, any repeated use could presumable be blocked by Google.
The base sentiment analyzer uses an English word-based sentiment analyzer and process the text so it will
handle a few cases of negations. The textblob base sentiment analyzer comes from the pattern module.
The interface in the pattern library is different:
>>> from pattern . en import sentiment
>>> sentiment ( ’ This is way worse than bad . ’)
( -0.5083333333333333 , 0.6333333333333333)
54
Figure 3.2: Comorbidity for ICD-10 disease code (appendicitis).
The returned values are polarity and subjectivity as for the textblob method.
Both pattern and textblob rely (in their default setup) on the en-sentiment.xml file containing over
2,900 English words where WordNet identifier, POS tag, polarity, subjectivity, subjectivity, intensity and
confidence are encoded for each word. Numerous other wordlists for sentiment analysis exist, e.g., my AFINN
wordlist [40]. Good wordlist-based sentiment analyzer often use multiple wordlists.
55
with diseases as nodes and comorbidity as edges. Finally we draw a part of the graph as the ‘ego-graph’
around the node K35, which is the disease code for appendicitis:
import matplotlib . pyplot as plt
import networkx as nx
import pandas as pd
import requests
from StringIO import StringIO
disease_graph = nx . DiGraph ()
disease_graph . add_edges_from ( data [[ ’ Code ’ , ’ Code .1 ’ ]]. itertuples ( index = False ))
nx . draw ( nx . ego_graph ( disease_graph , ’ K35 ’ ))
plt . show ()
Figure 3.2 displays the resulting plot after we call the matplotlib.pyplot.show function and after we have
extracted the ego-graph with networkx.ego graph and plotted it with networkx.draw.
The first call to the attribute may take several hundred milliseconds to execute, while the second call
takes in the order of microseconds. Note that the decorator changes the method Matrix.U() so that the
56
computed values should not be accessed as a function (X.U()) but rather as an attribute (X.U) without
calling parentheses. If we wanted to access the right singular values (Vh) we would need to implement a
second method with computation of the singular values. In this case we will not take advantage of that the
same computation occurs both for Matrix.U and Matrix.Vh.
It is possible to move the computation to the constructor:
class Matrix ( matrix ):
def __init__ ( self , * args , ** kwargs ):
matrix . __init__ ( self , * args , ** kwargs )
self . _U , self . _s , self . _Vh = svd ( self , full_matrices = False )
@property
def U ( self ):
return self . _U
In this case the singular vectors are only computed once, but it also means that the computation will always
happen, even though the attribute with the singular vectors is never accessed. However, in this case we only
perform one single computation for Matrix.U, and Matrix.Vh if we implemented that attribute.
It is possible to only compute the singular vectors when they are needed and only compute the singular
value decomposition once by moving the computation to a separate method and have multiple attributes
calling it.
class Matrix ( matrix ):
def __init__ ( self , * args , ** kwargs ):
matrix . __init__ ( self , * args , ** kwargs )
self . _U , self . _s , self . _Vh = None , None , None
def svd ( self ):
self . _U , self . _s , self . _Vh = svd ( self )
return self . _U , self . _s , self . _Vh
@property
def U ( self ):
if self . _U is None :
self . svd ()
return self . _U
@property
def s ( self ):
if self . _s is None :
self . svd ()
return self . _s
@property
def Vh ( self ):
if self . _Vh is None :
self . svd ()
return self . _Vh
Applying this class:
>>> X = Matrix ( npr . random ((3000 , 200))) # Fast
>>> X . Vh # Slow - computing the SVD
>>> X . Vh # Fast
>>> X.U # Fast
57
2. Numerical precision in the computation means that computed results are different from expected ‘exact’
results.
3. For machine learning algorithms you may have no idea what the result should be, and indeed the task
of machine learning is to develop an algorithm that performs well.
Numpy has developed a testing framework to deal with the first two issues which is available in the
numpy.testing module.
58
Chapter 4
import logging
from logging import NullHandler
Examples
--------
>>> m = Matrix ([[1 , 2] , [3 , 4]])
>>> m [0 , 1]
2
"""
return self . _matrix [ indices [0]][ indices [1]]
59
Examples
--------
>>> m = Matrix ([[1 , 2] , [3 , 4]])
>>> m [0 , 1]
2
>>> m [0 , 1] = 5
>>> m [0 , 1]
5
"""
self . _matrix [ indices [0]][ indices [1]] = value
@property
def shape ( self ):
""" Return shape of matrix .
Examples
--------
>>> m = Matrix ([[1 , 2] , [3 , 4] , [5 , 6]])
>>> m . shape
(3 , 2)
"""
rows = len ( self . _matrix )
if rows == 0:
rows = 1
columns = 0
else :
columns = len ( self . _matrix [0])
return ( rows , columns )
Examples
--------
>>> m = Matrix ([[1 , -1]])
>>> m_abs = abs ( m )
>>> m_abs [0 , 1]
1
"""
result = Matrix ([[ abs ( element ) for element in row ]
for row in self . _matrix ])
return result
Parameters
----------
other : integer or Matrix
Returns
-------
60
m : Matrix
Matrix of the same size as the original matrix
Examples
--------
>>> m = Matrix ([[1 , 2] , [3 , 4]])
>>> m = m + 1
>>> m [0 , 0]
2
"""
if isinstance ( other , int ) or isinstance ( other , float ):
result = [[ element + other for element in row ]
for row in self . _matrix ]
elif isinstance ( other , Matrix ):
result = [[ self [m , n ] + other [m , n ]
for n in range ( self . shape [1])]
for m in range ( self . shape [0])]
else :
raise TypeError
return Matrix ( result )
Parameters
----------
other : integer , float
Returns
-------
m : Matrix
Matrix with multiplication result
Examples
--------
>>> m = Matrix ([[1 , 2] , [3 , 4]])
>>> m = m * 2
>>> m [0 , 0]
2
"""
if isinstance ( other , int ) or isinstance ( other , float ):
result = [[ element * other for element in row ]
for row in self . _matrix ]
else :
raise TypeError
return Matrix ( result )
61
Parameters
----------
other : integer , float
Returns
-------
m : Matrix
Matrix with multiplication result
Examples
--------
>>> m = Matrix ([[1 , 2] , [3 , 4]])
>>> m = m ** 3
>>> m [0 , 1]
8
"""
if isinstance ( other , int ) or isinstance ( other , float ):
result = [[ element ** other for element in row ]
for row in self . _matrix ]
else :
raise TypeError
return Matrix ( result )
Examples
--------
>>> m = Matrix ([[1 , 2] , [3 , 4]])
>>> m = m . transpose ()
>>> m [0 , 1]
3
"""
log . debug ( " Transposing " )
# list necessary for Python 3 where zip is a generator
return Matrix ( list ( zip (* self . _matrix )))
@property
def T ( self ):
""" Transposed of matrix .
Returns
-------
m : Matrix
Copy of matrix
Examples
--------
>>> m = Matrix ([[1 , 2] , [3 , 4]])
>>> m = m . T
62
>>> m [0 , 1]
3
"""
log . debug ( " Calling transpose () " )
return self . transpose ()
63
64
Chapter 5
Now we have two files ready for reading with Python. An examination of the first four lines of the
pima.tr.csv file yields:
"" ," npreg " ," glu " ," bp " ," skin " ," bmi " ," ped " ," age " ," type "
"1" ,5 ,86 ,68 ,28 ,30.2 ,0.364 ,24 ," No "
"2" ,7 ,195 ,70 ,33 ,25.1 ,0.163 ,55 ," Yes "
"3" ,5 ,77 ,82 ,41 ,35.8 ,0.156 ,35 ," No "
Here the first column is the row index: Note that the last column (‘type’) does not contain numerical values
in its cells, but rather a string for the categorical column, thus we cannot read it directly into a numpy.array.
1 http://diabetes.niddk.nih.gov/dm/pubs/pima/pathfind/pathfind.htm
65
This data set has left out one of the variables (the measurement of serum insulin) and has also excluded
subjects that had missing data for some of the variables. Furthermore the data set is split into a training
set and test set, with 200 and 332 subjects respectively. The full data set represents data from 768 female
subjects.
pandas.read csv handles the column header by default, setting the column variable of the returned object
based on the first line in the CSV file, while for the row header (here the first column) we need to explicitly
set it with the index col=0 input argument. We now have access, e.g., to the number of pregnancies column
in the training set with pima_tr.npreg or pima_tr[’npreg’].
If we would like to read in the full data set with the insulin serum measurement and with subject having
missing data we can grab the dat from the UCI repository:
url = ( ’ http :// ftp . ics . uci . edu / pub / machine - learning - databases / ’
’ pima - indians - diabetes / pima - indians - diabetes . data ’)
pima = pd . read_csv ( url , names =[ ’ npreg ’ , ’ glu ’ , ’ bp ’ , ’ skin ’ ,
’ ins ’ , ’ bmi ’ , ’ ped ’ , ’ age ’ , ’ type ’ ])
Note here that the pandas.read csv are able to download the data from the Internet and that there is no
header in the data set, which is why the columns are named explicit with the names argument.
There are various ways to get an overview of the data. The Pandas data frame object has, e.g., mean,
std, min methods. These can all be displayed with the describe data frame method:
>>> pima_tr . describe ()
npreg glu bp skin bmi ped \
count 200.000000 200.000000 200.000000 200.000000 200.000000 200.000000
mean 3.570000 123.970000 71.260000 29.215000 32.310000 0.460765
std 3.366268 31.667225 11.479604 11.724594 6.130212 0.307225
min 0.000000 56.000000 38.000000 7.000000 18.200000 0.085000
25% 1.000000 100.000000 64.000000 20.750000 27.575000 0.253500
50% 2.000000 120.500000 70.000000 29.000000 32.800000 0.372500
75% 6.000000 144.000000 78.000000 36.000000 36.500000 0.616000
max 14.000000 199.000000 110.000000 99.000000 47.900000 2.288000
age
count 200.000000
mean 32.110000
std 10.975436
min 21.000000
25% 23.000000
50% 28.000000
75% 39.250000
max 63.000000
66
Here we see that the maximum number of pregnancies in the data set for an Indian women is 14 (‘max’ row
and ‘npreg’ column), the maximum BMI is 47.9 and the age ranges between 21 and 63 with an average of
32.11. Note that the describe method ignored the categorical ‘type’ column.
With the grouping functionality using the groupby method of the Pandas data frame we can get the
summary statistics based on the rows grouped into sets depending on the value of the specified column
When we used the ‘type’ column for the grouping operation the two set become ‘No’ and ‘Yes’:
>>> pima_tr . groupby ( ’ type ’ ). mean ()
npreg glu bp skin bmi ped \
type
No 2.916667 113.106061 69.545455 27.204545 31.074242 0.415485
Yes 4.838235 145.058824 74.588235 33.117647 34.708824 0.548662
age
type
No 29.234848
Yes 37.691176
The groupby method returns a Pandas object called DataFrameGroupBy. Like the DataFrame object it has
summary statistics methods and the above listing showed an example with the mean method. It indicates
to us that the women with a diagnose of diabetes mellitus have on average a higher number of pregnancies
(4.8 against 2.9), a higher BMI value and higher age.
With standard Pandas functionality we can also get an overview of the correlation between the variables
with the corr method of the data frame:
>>> pima_tr . corr ()
npreg glu bp skin bmi ped age
npreg 1.000000 0.170525 0.252061 0.109049 0.058336 -0.119473 0.598922
glu 0.170525 1.000000 0.269381 0.217597 0.216790 0.060710 0.343407
bp 0.252061 0.269381 1.000000 0.264963 0.238821 -0.047400 0.391073
skin 0.109049 0.217597 0.264963 1.000000 0.659036 0.095403 0.251926
bmi 0.058336 0.216790 0.238821 0.659036 1.000000 0.190551 0.131920
ped -0.119473 0.060710 -0.047400 0.095403 0.190551 1.000000 -0.071410
age 0.598922 0.343407 0.391073 0.251926 0.131920 -0.071410 1.000000
It shows skin fold thickness and BMI to be quite correlated with a correlation of 0.659036 and that age and
number of pregnancies are also quite correlated with 0.598922.
The seaborn package has a nice correlation plot function which works with with Pandas, corresponding
to the Pandas data frame corr method:
import seaborn as sns
import matplotlib . pyplot as plt
Seaborn produces a color-coded correlation plot which also displays the variable names, the
numerical correlation coefficients and the result of statistical tests for the correlation coeffi-
cient, see Figure 5.1. The statsmodels also has a correlation plot function hidden as
statsmodels.graphics.correlation.plot corr, but it does not yield as informative a plot.
67
Figure 5.1: Seaborn correlation plot on the Pima data set constructed with the seaborn.corrplot function.
The diagonal displays the variables names, the upper right triangle the numerical correlation coefficients
together with starts indicating statistical significance and with the lower left triangle color-coded according
to correlation.
model = smf . glm ( ’ type ~ npreg + glu + bp + skin + bmi + ped + age ’ ,
data = pima_tr , family = sm . families . Binomial ()). fit ()
print ( model . summary ())
The last line will print out the result of the fitting of the model, with the fitted parameter values, their
standard errors, their t-values, the two-sided P -values and the 95% confidence intervals:
68
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 9.7731 1.770 5.520 0.000 6.303 13.243
npreg -0.1032 0.065 -1.595 0.111 -0.230 0.024
glu -0.0321 0.007 -4.732 0.000 -0.045 -0.019
bp 0.0048 0.019 0.257 0.797 -0.032 0.041
skin 0.0019 0.022 0.085 0.932 -0.042 0.046
bmi -0.0836 0.043 -1.953 0.051 -0.168 0.000
ped -1.8204 0.666 -2.735 0.006 -3.125 -0.516
age -0.0412 0.022 -1.864 0.062 -0.084 0.002
==============================================================================
The fitted parameters which are displayed in the ‘coef’ column are also available in the model.params
attribute. This variable has the pandas.Series data type, so we, e.g., can access float value (actually
numpy.float64) of the parameter for the intercept with model.params.Intercept. The other numerical
data displayed with the print function are also available, e.g., the t-values from the ‘t’ columns appear in
the model.tvalues attribute.
To evaluate the performance of the classifier we will need a function that compares the predicted value
with the true value. Here we define an accuracy function function that computes the fraction of correctly
predicted labels:
def accuracy ( truth , predicted ):
if len ( truth ) != len ( predicted ):
raise Exception ( " Wrong sizes ... " )
total = len ( truth )
if total == 0:
return 0
hits = len ( filter ( lambda (x , y ): x == y , zip ( truth , predicted )))
return float ( hits )/ total
69
def to_labels ( self , y ):
return pd . Series ( asarray ( where (y <0 , " No " , " Yes " )). flatten ())
70
Chapter 6
Instancing the DemoDB will read the data and setup it up for access in the db object. Note that here we—
somewhat confusingly—use the same name for the database object as the module name. An overview of the
tables are available with the tables attribute. Here we have a shortened output:
>>> db . tables
+ - - - - - - - - - - - - - - -+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -...
| Table | Columns
+ - - - - - - - - - - - - - - -+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -...
| Album | AlbumId , Title , ArtistId
| Artist | ArtistId , Name
| Customer | CustomerId , FirstName , LastName , ...
| Employee | EmployeeId , LastName , FirstName , Title , ...
...
Examining the schema of the individual tables is likewise straightforward as the individual tables are acces-
sible as attributes to the tables atttribute:
71
>>> db . tables . Album
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -+
| Album |
+ - - - - - - - - - -+ - - - - - - - - - - - - - - -+ - - - - - - - - - - - - - - - - -+ - - - - - - - - - - - - - - - -+
| Column | Type | Foreign Keys | Reference Keys |
+ - - - - - - - - - -+ - - - - - - - - - - - - - - -+ - - - - - - - - - - - - - - - - -+ - - - - - - - - - - - - - - - -+
| AlbumId | INTEGER | | Track . AlbumId |
| Title | NVARCHAR (160) | | |
| ArtistId | INTEGER | Artist . ArtistId | |
+ - - - - - - - - - -+ - - - - - - - - - - - - - - -+ - - - - - - - - - - - - - - - - -+ - - - - - - - - - - - - - - - -+
For accessing the actual data in the database we can use the select, all, sample or head methods of the
db.db.Table object.
>>> db . tables . InvoiceLine . head ()
InvoiceLineId InvoiceId TrackId UnitPrice Quantity
0 1 1 2 0.99 1
1 2 1 4 0.99 1
2 3 2 6 0.99 1
3 4 2 8 0.99 1
4 5 2 10 0.99 1
5 6 2 12 0.99 1
The returned object is a pandas.DataFrame, such that further data analysis with the functions and methods
of Pandas is straightforward.
Data returned as Pandas’ data frame can also be obtained via the db.query function where SQL state-
ments can be formulated, e.g., the following two statements return the same data:
>>> db . tables . Album . head (3)
AlbumId Title ArtistId
0 1 For Those About To Rock We Salute You 1
1 2 Balls to the Wall 2
2 3 Restless and Wild 2
>>> db . query ( ’ select * from Album limit 3 ’)
AlbumId Title ArtistId
0 1 For Those About To Rock We Salute You 1
1 2 Balls to the Wall 2
2 3 Restless and Wild 2
72
Figure 6.1: Database tables graph with the Chinook database where nodes are tables and edges indicate
foreign keys connections.
# Construct graph
graph = nx . MultiDiGraph ()
for table in db . tables :
graph . add_node ( table . name , number_of_rows = len ( table . all ()))
for key in table . foreign_keys :
graph . add_edge ( table . name , key . table )
73
node_size = sizes )
nx . d r a w _networkx_labels ( graph , pos = pos , font_color = ’k ’ , font_size =8)
plt . show ()
It appears that not all tracks in the database have been sold so with the left outer join we end up with
some tracks with no entry in the ‘Sold’ column. When the data is returned as a pandas.DataFrame these
missing entries have the value NaN. As they should be interpreted as zero we exchange NaN with zero using
the fillna method of the data frame object.
We can get an overview of the number of sold track by plotting the histogram:
>>> import matplotlib . pyplot as plt
>>> sold_per_track [ ’ Sold ’ ]. hist ()
>>> plt . xlabel ( ’ Number of items sold per track ’)
>>> plt . ylabel ( ’ Frequency ’)
>>> plt . show ()
Here we should get suspicious if this part of Chinook was based on real-life data: Following the idea of the
long tail [45], we should expect a few hit tracks selling a large number of items, while the most of the tracks
should sell few.
74
Chapter 7
We read in all the words from the corpus and find the unique words in the lowercase version. After that we
read in all sentences into two data sets: one with sentence labeled as ‘news’, the other labeled with any of
the other category:
unique_words = set ([ word . lower () for word in brown . words ()])
75
To train a classifier we will use one-gram word features, i.e., indicate with a Boolean variable whether a word
is present or not in each sentence.
def word_features ( sentence ):
features = { word : False for word in unique_words }
for word in sentence :
if word . isalpha ():
features [ word . lower ()] = True
return features
We can do a sampling test on whether the features are set up correctly, e.g., the word ‘county’ appears in the
first sentence and in the featureset variable the associated value should be true, whereas the word ‘city’
occurs in the second sentence but not the first:
>>> news_sentences [0][:8]
[ ’ The ’ , ’ Fulton ’ , ’ County ’ , ’ Grand ’ , ’ Jury ’ , ’ said ’ , ’ Friday ’ , ’ an ’]
>>> featuresets [0][0][ ’ county ’]
True
>>> featuresets [0][0][ ’ city ’]
False
The actual training is done by instancing an object of the nltk.classify.NaiveBayesClassifier class
with the featuresets as input to the train method:
classifier = NaiveBayesClassifier . train ( featuresets )
The estimation takes some time, so when it finally finishes we save the trained classifier in the pickle format
via the pickle module:
pickle . dump ( classifier , open ( ’ news_classifier . pck ’ , ’w ’ ))
The pickle file will allow us to load the classifier in another Python session, rather
than training the classifier again. The pickle module will load the classifier with
classifier = pickle.load(open(’news_classifier.pck’)).
We can make a couple of random tests displaying the estimated probability of being a news sentence:
>>> news = ’ senate tax overhaul gains steam as floor debate awaits ’
>>> classifier . prob_classify ( word_features ( news . split ())). prob ( ’ news ’)
0. 71 45 304532017633
>>> other = ’ when are they going to let you back in the usa ’
>>> classifier . prob_classify ( word_features ( other . split ())). prob ( ’ news ’)
3.5592820175844 e -05
While the two first sentences yield a reasonable classification, the third sentences is bad. When interpreting
the probability, one should keep in mind that the dataset is quite unbalanced with only 8% of the sentences
being news sentences:
>>> from __future__ import division
>>> len ( news_sentences ) / len ( featuresets )
0 .0 8 0 62 434600627834
76
Chapter 8
json_string = """ [
{" id ": 1 , " content ": " hello "} ,
{" id ": 2 , " content ": " world "}]
"""
The print line displays the dictionary as the ijson.items returns a generator which can be iterated with
the for loop. The second input argument to ijson.items tells which part of the JSON object the function
should yield at each iteration. In this case an object is yielded at every JSON list item, and the variable
object has the data type dict.
Lets scale the simple 4-line example up to a 3.1 gigabyte compressed JSON file (20140721.json.gz) pro-
vided by the Wikidata project currently available from http://dumps.wikimedia.org/other/wikidata/ and
77
which contains over 15 million items in multiple languages. The gzip library will uncompress the file on-the-
fly with the gzip.open function returning a file-like handle that directly can be feed into the ijson.items
function.
import collections
import gzip
import ijson
import os . path
In this case we print company names when an item in the Wikidata is annotated as an instance of a
company (https://www.wikidata.org/wiki/Q783794), in the present case the generated output starts with:
EADS, Sako, SABMiller, Berliet, Aixam, The Walt Disney Company, . . . The company names are printed
with their Romanian (‘ro’) names with fallback to German (‘de’) and English (‘en’) names.
json_string = """ {" id ": 1 , " content ": " hello "}
{" id ": 2 , " content ": " world "}
"""
78
obj = json . loads ( line )
print ( obj )
79
80
Bibliography
[1] Allen B. Downey. Think Python. O’Reilly Media, first edition, August 2012.
[2] Kevin Sheppard. Introduction to Python for econometrics, statistics and data analysis. Self-published,
University of Oxford, version 2.1 edition, February 2014.
[3] Stephen Marsland. Machine learning: An algorithmic perspective. Chapman & Hall/CRC, 2009.
[4] Skipper Seabold and Josef Perktold. Statsmodels: econometric and statistical modeling with python.
In Proceedings of the 9th Python in Science Conference, 2010.
[5] Florian Krause and Oliver Lindemann. Expyriment: A Python library for cognitive and neuroscientific
experiments. Behavior Research Methods, 46(2):416–428, June 2013.
Annotation: Initial publication for the Expyriment Python package for stimulus presenta-
tion, response collection and recording in psychological experiments.
[6] Jeffrey M. Perkel. Programming: pick up python. Nature, 518(7537):125–126, February 2015.
[7] Lutz Prechelt. An empirical comparison of C, C++, Java, Perl, Python, Rexx and Tcl. Computer,
33(10):23–29, October 2000.
[8] Sebastian Nanz and Carlo A. Furia. A comparative study of programming languages in Rosetta Code.
ArXiv, September 2014.
Annotation: Compares C, C#, F#, Go, Haskell, Java, Python and Ruby in terms of lines
of code, size of executable and running time.
[9] Philip Guo. Python is now the most popular introductory teaching language at top U.S. universities.
BLOG@CACM, July 2014.
[10] Coverity. Coverity finds Python sets new level of quality for open source software. Press release, August
2013.
[11] Jürgen Scheible and Ville Tuulos. Mobile python: Rapid prototyping of applications on the mobile
platform. Wiley, 1st edition, October 2007.
[12] Susan Tan. Python in the browser: Intro to Brython. YouTube, April 2014.
[13] Mads Ruben Burgdorff Kristensen, Simon Andreas Frimann Lund, Troels Blum, Kenneth Skovhede,
and Brian Vinter. Bohrium: unmodified NumPy code on CPU, GPU, and cluster. In Python for High
Performance and Scientific Computing, November 2013.
81
[14] Sue Gee. Python 2.7 to be maintained until 2020. I Programmer, April 2014.
[15] Kerrick Staley and Nick Coghlan. The “python” command on Unix-like systems. PEP 394, Python
Software Foundation, 2011.
[16] Dan Sanderson. Programming Google App Engine. O’Reilly, Sebastopol, California, USA, second
edition edition, October 2012.
[17] Philip J. Guo. Online Python Tutor: embeddable web-based program visualization for CS education.
In Proceeding of the 44th ACM technical symposium on Computer science education, pages 579–584,
New York, NY, USA, March 2013. Association for Computing Machinery.
[18] Ian Bicking. Python HTML parser performance. Ian Bicking: a blog, March 2008.
[19] Antoine Pitrou. Pickle protocol version 4. PEP 3154, Python Software Foundation, August 2011.
[20] Marc-André Lemburg. Python database API specification v2.0. PEP 249, Python Software Foundation,
Beaverton, Oregon, USA, JNovember 2012.
[21] David Goodger and Guido van Rossum. Docstring conventions. PEP 257, Python Software Foundation,
Beaverton, Oregon, USA, June 2001.
[22] Amit Patel, Antoine Picard, Eugene Jhong, Gregory P. Smith, Jeremy Hylton, Matt Smart, Mike
Shields, and Shane Liebling. Google Python style guide, 2013.
[23] Thomas J. McCabe. A complexity measure. IEEE Transactions on Software Engineering, SE-2(4):308–
320, 1976.
[24] Guido van Rossum, Barry Warsaw, and Nick Coglan. Style guide for python code. Python Enhancement
Proposals 8, Python Software Foundation, August 2013.
[25] Prabhu Ramachandran and Gaël Varoquaux. Mayavi: making 3D data visualization reusable. In Gaël
Varoquaux, T. Vaught, and J. Millman, editors, Proceedings of the 7th Python in Science Conference
(SciPy 2008), pages 51–57, 2008.
[26] Prabhu Ramachandran and Gaël Varoquaux. Mayavi: 3D visualization of scientific data. Computing
in Science & Engineering, 13(2):40–50, March-April 2011.
[27] Wes McKinney. Python for data analysis. O’Reilly, Sebastopol, California, first edition, October 2012.
Annotation: Book on data analysis with Python introducing the Pandas library.
[28] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier
Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre
Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn:
machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
[29] Michael Hanke, Yaroslav O. Halchenko, Per B. Sederberg, Stephen José Hanson, James V. Haxby,
and Stefan Pollmann. PyMVPA: a Python toolbox for multivariate pattern analysis of fMRI data.
Neuroinformatics, 7(1):37–53, March 2009.
82
[30] Michael Hanke, Yaroslav O. Halchenko, Per B. Sederberg, Emanuele Olivetti, Ingo Frund, Jochem W.
Rieger, Christoph S. Herrmann, James V. Haxby, Stephen Jose Hanson, and Stefan Pollmann. PyMVPA:
a unifying approach to the analysis of neuroscientific data. Frontiers in neuroinformatics, 3:3, 2009.
[31] Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovi, Martin Možina,
Matija Polajnar, Marko Toplak, Anže Stari, Miha Štajdohar, Lam Umek, Lan Žagar, Jure Žbontar,
Marinka Žitnik, and Blaž Zupan. Orange: data mining toolbox in Python. Journal of Machine Learning
Research, 14:2349–2353, August 2013.
[32] Davide Albanese, Roberto Visintainer, Stefano Merler, Samantha Riccadonna, Giuseppe Jurman, and
Cesare Furlanello. mlpy: machine learning Python. ArXiv, March 2012.
[33] T. Zito, N. Wilbert, L. Wiskott, and P. Berkes. Modular toolkit for data processing (mdp): a python
data processing framework. Frontiers in Neuroinformatics, 2:8, 2008.
[34] Tom Schaul, Justin Bayer, Daan Wierstra, Yi Sun, Martin Felder, Frank Sehnke, Thomas Rückstie, and
Jürgen Schmidhuber. Pybrain. Journal of Machine Learning Research, 11:743––746, February 2010.
[35] Radim Řehůřek and Petr Sojka. Software framework for topic modelling with large corpora. In Pro-
ceedings of LREC 2010 workshop New Challenges for NLP Frameworks, 2010.
[36] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python. O’Reilly,
Sebastopol, California, June 2009.
Annotation: The canonical book for the NLTK package for natural language processing
in the Python programming language. Corpora, part-of-speech tagging and machine learning
classification are among the topics covered.
[37] Tom De Smedt and Walter Daelemans. Pattern for Python. Journal of Machine Learning Research,
13:2063–2067, 2012.
Annotation: Describes the Pattern module written in the Python programming language
for data, web, text and network mining.
[38] Brendan O’Connor, Michel Krieger, and David Ahn. TweetMotif: exploratory search and topic sum-
marization for Twitter. In Proceedings of the International AAAI Conference on Weblogs and Social
Media, 2010.
[39] Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein,
Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. Part-of-speech tagging for
Twitter: annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language Technologies: Short Papers, volume 2,
pages 42–47. Association for Computational Linguistics, 2011.
[40] Finn Årup Nielsen. A new ANEW: evaluation of a word list for sentiment analysis in microblogs. In
Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, and Mariann Hardey, editors, Proceedings of the
ESWC2011 Workshop on ’Making Sense of Microposts’: Big things come in small packages, volume 718
of CEUR Workshop Proceedings, pages 93–98, May 2011.
Annotation: Initial description and evaluation of the AFINN word list for sentiment analysis.
[41] Aric Hagberg, Pieter Swart, and Daniel S. Chult. Exploring network structure, dynamics, and function
using NetworkX. In Gäel Varoquaux, Travis Vaught, and Jarrod Millman, editors, Proceedings of the
7th Python in Science Conference, pages 11–16, 2008.
83
[42] Peter H. Bennett, Thomas A. Burch, and Max Miller. Diabetes mellitus in American (Pima) Indians.
Lancet, 2(7716):125–128, July 1971.
[43] W. C. Knowler, P. H. Bennett, R. F. Hamman, and M. Miller. Diabetes incidence and prevalence in Pima
Indians: a 19-fold greater incidence than in Rochester, Minnesota. American Journal of Epidemiology,
108(6):497–495, December 1978.
[44] Jack W. Smith, J. E. Everhart, W. C. Dickson, W. C. Knowler, and R. S. Johannes. Using the ADAP
learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Annual Symposium
on Computer Application in Medical Care, pages 261–265. American Medical Informatics Association,
1988.
[45] Chris Anderson. The long tail. Wired, 12(10), October 2004.
[46] Johan Galtung and Mari Holmboe Ruge. The structure of foreign news: the presentation of the Congo,
Cuba and Cyprus crises in four Norwegian newspapers. Journal of Peace Research, 2(1):64–91, 1965.
[47] Bongwon Suh, Lichan Hong, Peter Pirolli, and Ed H. Chi. Want to be retweeted? large scale analytics
on factors impacting retweet in Twitter network. In 2010 IEEE International Conference on Social
Computing (SocialCom10). IEEE, 2010.
[48] Lars Kai Hansen, Adam Arvidsson, Finn Årup Nielsen, Elanor Colleoni, and Michael Etter. Good
friends, bad news — affect and virality in Twitter. In James J. Park, Laurence T. Yang, and Changhoon
Lee, editors, Future Information Technology, volume 185 of Communications in Computer and Infor-
mation Science, pages 34–43, Berlin, 2011. Springer.
[49] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data. IEEE
Intelligent Systems, pages 8–12, March/April 2009.
[50] Frederick Jelinek. Some of my best friends are linguists. Talk at LREC 2004, May 2008.
[51] Chris Anderson. The end of theory: The data deluge makes the scientific method obsolete. Wired, June
2008.
84
Index
", 10 file , 4, 17
’’, 10 floordiv , 15
( ), 50 format , 15
(), 15 future , 7, 18
(,), 10 future .division, 7
*, 14, 33, 50 future .print function, 7
**, 15 getitem , 15
**kwargs, 14 iadd , 15
*args, 14 init , 15
+, 15, 50 init .py, 18, 24
+=, 15 le , 15
., 50 len , 15
.py, 17 lt , 15
/, 7, 15 main , 4
//, 15 main , 29
<, 15 main .py, 29
<=, 15 pow , 15
==, 15 setitem , 15
?, 50 str , 15
[ ], 50 xor , 15
[^ ], 50 \D, 50
[], 10, 15 \S, 50
[] =, 15 \W, 50
$, 50 \b, 50
%timeit, 27 \d, 50
&, 9, 15 \s, 50
^, 9, 15, 50 \w, 50
abs , 15 {:}, 10
add , 15 {}, 10
all , 28 {m, n}, 50
and , 15
builtin , 4 abs, 4, 5
builtins , 4 abs, 15
call , 15, 17 absolute, 5
class , 15 accuracy, 69
contains , 15 ActivePython, 6
del , 15 addict, 12
dict , 12 addition, 15
div , 15 agg, 44
doc , 15 Anaconda, 6, 8
eq , 15 and, 9
85
animation, 35, 36 complex64, 10
append, 9 complex , 10
argparse, 29 concepts, 15
args, 14 confidence interval, 68
array, 10 ConfigParser, 19
as, 17 configparser, 19, 20
assert, 23 Continuum Analytics, 6
autocompletion, 7 ConvexHull, 44
autoregression, 34 convolve, 44
Axes3D, 34 corr, 67
correlation, 67, 68
b’’, 10 coverage, 22, 24, 25, 32
barplot, 37 coverage, 24, 25, 32
basestring, 7 Coverity, 2
BeautifulSoup, 50–52 cPickle, 19, 20
big data, 77 cProfile, 26
bitwise, 9 cStringIO, 19
bob, 48 csv, 20, 66
Bohrium, 3 csvkit, 20, 66
bokeh, 39 cx Freeze, 3
bool, 9, 10, 15 Cython, 3, 53
bool , 10
brent, 44 D3, 39
Brython, 3 data frame, 39, 40
brython.js, 3 data type
built-in, 4 testing, 22
bytearray, 10 DataFrame, 10, 42, 67
bytes, 10 DataFrameGroupBy, 44, 67
datasets, 45
Cairo, 33 date, 12, 13
Canopy, 6 datetime, 12
central difference, 47 datetime.date, 12
CheeseShop, 5 datetime.datetime, 12, 13
CherryPy, 37, 39 datetime.datetime.isoformat, 13
chi2, 44 dateutil, 13
Chromium Compact Language Detector, 54 db.db.Table, 72
class, 3, 15 db.DemoDB, 71
classifier, 76 db.py, 21, 71
cld, 54 db.query, 72, 74
clonedigger, 28, 31 Debian, 5
cluster, 44 debugging, 30, 31
Code Golf, 8 Decimal, 10
Codecademy, 8 decimal, 10
collaborative programming, 7 decimal.Decimal, 10
collections, 13 decision function, 48
collections.Counter, 13 decorator, 37, 70
collections.deque, 13, 34 def, 3, 14, 15
command-line interface, 29, 30 del, 15
comorbidity, 55 describe, 43
complex, 10, 11 diabetes, 65
complex128, 10 dict, 10, 11, 15, 77
complex256, 10 dict.items, 12
86
dict.keys, 12 float16, 10
dict.values, 12 float32, 10
differentiation, 47 float64, 10
dir, 9, 15 float , 10
distance.cityblock, 44 fmin, 44
division, 15 fmin cg, 44
integer, 15 for, 3
Django, 5 formal concept analysis, 15, 16
docopt, 21, 29 Fourier transform, 44, 45
docstring, 11, 59 FriendFeed, 21
doctest, 23, 59 from, 17
doctest, 21, 23 frozenset, 10, 11
doctesting, 22 function, 14, 15, 17
documentation, 21, 22
DOM, 3 gc, 4, 27
don’t repeat yourself, 28 generator, 77
DRY, 28 gensim, 48
dump, 20 gentlemen, 28
dumps, 20 get params, 48
getopt, 29
ego-graph, 56 getvalue, 37
Emacs, 7 ggplot, 33
encode, 37 ggplot, 34
Enthought, 6 GIL, 3
enum, 13 Global Interpreter Lock, 3
enum.Enum, 13 globbing, 19
enum34, 13 GLSL, 34
Eric, 7 glue, 39
Excel, 55 Gobby, 7
except, 37 Google App Engine, 7
exception, 31 Google Scholar, 48
exception, 31 graphviz, 15
groupby, 67
exit status, 29, 30
guppy, 27
exploratory programming, 1
gzip, 78
False, 9 gzip.open, 78
fft, 44 happyfuntokenizing.py, 53
fftfreq, 44 hashable, 11
fftpack, 44 help, 9, 15
fillna, 74 hierarchy.dendrogram, 44
fit, 45, 48, 49 HTML, 36, 39, 50, 51
fit predict, 48
fit transform, 48 icol, 40
flake8, 18 IDE, see integrated development environment
flake8, 27, 30 if, 3
flake8-docstrings, 27 ifft, 44
flake8-double-quotes, 27 iget value, 40
flake8-import-order, 18 ijson, 20, 77
flake8-quotes, 27 ijson.items, 77, 78
Flask, 7, 36 iloc, 40
float, 4, 10, 11, 15, 22 import, 17–20
float128, 10 globbing, 19
87
relative, 18 linewidth, 14
import, 4, 17, 27, 28 list, 9, 10, 15, 21
in, 15 listdir, 17
indentation, 9 load, 19, 20
Index, 40 loads, 20
indexing, 40, 42 loc, 40
infinite impulse response, 34 logger.Logger, 31
information diffusion, 75 logging, 31
int, 10, 11, 15 logging, 31
int16, 10 logging.basicConfig, 31
int32, 10 logging.DEBUG, 31
int64, 10 logging.warning, 31
Int64Index, 40 long, 6, 11
int8, 10 long line, 27
int , 10 lxml, 5, 20, 50, 51
integrated development environment, 6, 7
inverse transform, 48 machine learning, 47–49
io, 18, 19, 37 magic function, 27
IPython, 27, 47 main, 29
IPython Notebook, 6 Maple, 2
irow, 40 Markdown, 6
ISO 8601, 13 math, 4, 11, 27
itertools, 4 math.sin, 27
ix, 40, 41 math.sqrt, 11
Mathematica, 2, 6
JavaScript, 39 Matplotlib, 19, 33–36, 39, 55
Javascript, 2, 3, 20 matplotlib, 19, 33, 34, 36
jedi, 7 matplotlib.animation, 35
Jinja2, 7 matplotlib.animation.FuncAnimation, 35
JSON, 20, 21, 77 matplotlib.pylab, 19
json, 20, 21, 78 matplotlib.pyplot.plot, 14, 34
Julia, 2, 3 matplotlib.pyplot.show, 56
Jupyter, 2 matrix, 10
Jupyter Notebook, 6, 37 Mayavi, 34
Keleshev, Vladimir, 22 mayavi, 6
kendalltau, 44 mayavi.mlab, 34
keyboard shortcut, 31 mccabe, 27
keyword, 3 mean, 67
keyword.kwlist, 3 MediaWiki, 50
Kivy, 2 meliae, 27
kwargs, 14 memory profiling, 27
memory profiler, 27
lambda, 3, 14 Micro Python, 5
lambdas, 14 Microsoft Excel, 55
lambdify, 47 Mlpy, 48
langid, 54 module, 17, 19, 20
language detection, 54 mpl toolkits.mplot3d, 34
lazy, 56 mpld3, 33, 36, 39
lazy attribute, 56 mpld3.fig to html, 39
len, 15 mpltools, 33, 34
len(), 15 multiprocessing, 3
length, 15 mwparserfromhell, 50
88
MySQL, 7 numpy.ones, 33
MySQLdb, 21 numpy.random.random, 33
numpy.sin, 27
NaN, 74 numpy.sqrt, 11
nanmean, 44 numpy.std, 44
network mining, 55, 56 numpy.sum, 19
NetworkX, 55, 72, 74 numpy.testing, 58
networkx, 6, 55 numpy.zeros, 33
networkx.DiGraph, 55 NVD3, 39
networkx.draw, 56 nx, 55
networkx.ego graph, 56
NLTK, 52, 53 object-relational mapping, 21
nltk, 6, 20, 48, 75 Online Python Tutor, 8
nltk.app, 53 open, 4, 6, 9
nltk.app.wordnet, 53 optimization, 48
nltk.book, 53 optimize, 44
nltk.classify, 48 optparse, 29
nltk.classify.NaiveBayesClassifier, 76 or, 9
nltk.corpus, 53, 75 Orange, 48
nltk.download, 53, 75 orange, 48
nltk.sent tokenize, 53 os, 17
nltk.text, 53 os. file , 4
nltk.tokenize, 53 os.listdir, 17
nltk.tokenize.sent tokenize, 53
nltk.word tokenize, 53 pair programming, 7
nose, 25 Pandas, 9, 33, 39, 40, 42–44, 72, 74
nosetests, 25 Pandas, 27
NoSQL, 21 pandas, 6, 21, 34, 39, 40, 45, 46
not, 9 pandas.DataFrame, 12, 21, 40, 42, 46, 72, 74
np.sin, 18 pandas.DataFrame.to records, 40
ntlk, 75 pandas.io.sql.read frame, 21
numarray, 33 pandas.Panel, 40
numba, 3 pandas.Panel4D, 40
numeric, 8, 33 pandas.read csv, 66
numexpr, 6 pandas.read excel, 55
Numpy, 9, 13, 19, 22, 23, 33, 40, 44, 59 pandas.Series, 40, 41, 46, 69
numpy, 5, 6, 8, 10, 11, 19, 27, 33, 44 pandas.Series.kurtosis, 43
numpy. config .show, 33 pandas.Series.mean, 43
numpy.array, 33, 40, 44, 65 pandas.Series.quantile, 43
numpy.array.tolist, 21 pandas.Series.std, 43
numpy.eye, 33 pandas.Series.values, 44
numpy.fft.fft, 44 Panel, 10
numpy.float64, 21, 69 Panel4D, 10
numpy.int, 21 part-of-speech tagging, 53, 54
numpy.lib.npyio, 19 pattern, 50, 54, 55
numpy.linalg, 44 pd, 40
numpy.linalg.linalg.eig, 44 pdb, 30, 31
numpy.matrix, 33 peewee, 21
numpy.matrix.A, 33 pep257, 22
numpy.matrix.H, 33 pep8, 27, 30
numpy.matrix.I, 33 Perl, 1
numpy.nan, 23 persistency, 20
89
pickle, 20 pymongo, 21
pickle, 19, 20, 76 PyMVPA, 48
PIL, 33 pynsist, 3
Pillox, 33 PyPI, 5
Pima data set, 65–67, 69, 70 PyPR, 48
Pima Indians, 65 pypr, 48
pip, 5 pypy, 3, 5, 26
Plotly, 8, 33, 38 pyqtgraph, 33, 34
plotly, 38 pyringe, 31
plotly.plotly.plot, 38 PySizer, 27
plotly.plotly.plot mpl, 39 PySonar, 30
plotting, 33, 34, 36–39 Pyston, 5
3D, 34 python, 5, 26
network, 55, 56 Python Egg, 3
web, 36–39 Python Package Index, 5
plt, 34 Python Standard Library, 4, 20, 25, 29, 31, 50, 66
plt.plot, 14 Python(x,y), 6
pozer, 27 python-coverage, 24
pprint, 30, 31 python3, 6
pprint.pprint, 30 Pythonanywhere, 7
predict, 48, 49 PYTHONPATH, 5
predict proba, 48
pretty printing, 30 quantile, 43
prettyplotlib, 34 quote, 11
print, 6, 7, 30, 31, 77 R, 2, 39
private, 28 random walk, 34
profile, 26 range, 6
profiling, 25, 27 re, 13, 50–52
PSL, see Python Standard Library re.IGNORECASE, 13
pstats, 26 re.UNICODE, 51
psutil, 27 re.VERBOSE, 51
psycopg2, 21 read-eval-print loop, 1
public, 28 record array, 40
py.test, 23–25, 31, 32 regular expression, 50
py.text, 25 REPL, 1
py2exe, 3 request.get, 51
PyBrain, 48 requests, 5, 51, 55
pybrain, 48 return, 3, 14
PyCharm, 7 ring buffer, 13, 34
pychecker, 30 rounding, 15
pydoc, 22 Rudel, 7
pydocstyle, 22, 27, 28, 32 Runnable, 8
pyFFTW, 45
pyfftw.interfaces.scipy fftpack, 45 sandboxing, 4
pyflakes, 27, 30 sb, 34
pygal, 36 schema, 29
Pygame, 34 scikit-learn, 49
pygame, 33 SciPy, 33, 44, 66
PyInstaller, 3 scipy, 5, 6, 11, 34, 44, 45
pylab, 19, 44 scipy.fftpack, 44, 45
pylab.eig, 44 scipy.fftpack.fft, 44, 45
pylint, 27, 30 scipy.linalg, 44, 48, 56
90
scipy.linalg.eig, 44 Stanford CoreNLP, 50
scipy.linalg.eigh, 44 static method, 70
scipy.linalg.lstsq, 48 stats, 44
scipy.linalg.lu factor, 44 statsmodels, 34, 39, 45, 46, 48, 67
scipy.linalg.pinv, 48 statsmodels.formula.api, 46
scipy.linalg.solve, 48 statsmodels.graphics.correlation.plot corr,
scipy.linalg.svd, 44 67
scipy.loadtxt, 66 std, 43, 44
scipy.misc.derivative, 47 storm, 21
scipy.optimize, 48 str, 4, 7, 10, 11, 13, 15, 17, 50
scipy.optimize.minimize, 48 str.capitalize, 50
scipy.signal, 45 str.isdigit, 50
score, 48 str.isspace, 50
scripttest, 29 str.lower, 50
seaborn, 33 str.replace, 50
seaborn, 34, 67 str.rsplit, 50
seaborn.corrplot, 34, 68 str.split, 50
seaborn.regplot, 34 str.splitlines, 50
sentiment analysis, 54, 55 str.upper, 50
Series, 10 StringIO, 19, 37
set, 10, 15 structured array, 40
set params, 48 style, 27, 28
setup.py, 3, 5, 25 subprocess, 3
setuptools, 3 sum, 4, 19
simplegui, 8 summary, 45
simplejson, 20 Sympy, 47
simplemap, 8 sympy, 47
simpleplot, 8 sys, 4
sin, 18, 27 sys.builtin module names, 4
singular value decomposition, 44, 56, 57 sys.exit, 29, 30
six, 7 sys.path, 5, 18
six.string types, 7 SystemExit, 30
sklearn, 48, 49
sklearn.decomposition.NMF, 49 t-values, 69
sklearn.decomposition.PCA, 49 Test, 24
sklearn.lda.LDA, 49 test coverage, 24
sklearn.metrics.accuracy score, 69 test discovery, 23, 24
sklearn.neighbors.KNeighborsClassifier, 49 test layout, 23, 24
sklearn.svm.SVC, 49 test , 24
Skulpt, 8 testing, 22
slice, 10, 42 text discovery, 22
sns, 34 text mining, 50, 55
spaCy, 50, 53 textblob, 50, 54, 55
spatial, 44 textblob.TextBlob, 54
Sphinx, 22 threading, 3
spreadsheet, 55 time, 25, 26
Spyder, 6, 7, 26, 27, 31 time.clock, 25
SQL, 7, 21 time.perf counter, 25
sqlalchemy, 21 time.process time, 25
sqlite3, 21 timeit, 25–27
sqlobject, 21 timeit.timeit, 25, 26
standard deviation, 43 title, 50
91
tokenization, 51, 53 ZeroDivisionError, 23
tox, 25, 28 zip, 4
tox.ini, 25, 28
TrackId, 74
transform, 48
Trifacta, 37
True, 9, 10
try, 37
tuple, 10, 11, 15
Twitter, 75
part-of-speech tagging, 53
tokenization, 53
twokenize.py, 53
TypeError, 9, 21, 30, 41
Ubuntu, 5
uint8, 10
ujson, 20
underscore, 28
unicode, 7, 50
valgrind, 27
Vega, 37
vincent, 33, 36, 37
Vispy, 34
vispy, 39
vispy.gloo, 34
vispy.mpl plot, 34
volume rendering, 34
Voronoi, 44
vq.keans, 44
vq.vq, 44
VTK, 34
warnings, 31
warnings.warn, 31
welch, 45
Wikidata, 77, 78
Wikipedia, 50
Windows, 5, 6
Wing IDE, 7
Winpdb, 30
winpdb, 31
WinPython, 6
WordNet, 55
XML, 20, 51
XPath, 51, 52
xrange, 6
yield, 3
zero-one-sum, 23
92