Topic 7. Crash Course - Python
Topic 7. Crash Course - Python
People are still crazy about Python after twenty-five years, which I find hard to believe.
—Michael Palin
All new employees at DataSciencester are required to go through new employee ori‐
entation, the most interesting part of which is a crash course in Python.
This is not a comprehensive Python tutorial but instead is intended to highlight the
parts of the language that will be most important to us (some of which are often not
the focus of Python tutorials).
The Basics
Getting Python
You can download Python from python.org. But if you don’t already have Python, I
recommend instead installing the Anaconda distribution, which already includes
most of the libraries that you need to do data science.
As I write this, the latest version of Python is 3.4. At DataSciencester, however, we use
old, reliable Python 2.7. Python 3 is not backward-compatible with Python 2, and
many important libraries only work well with 2.7. The data science community is still
firmly stuck on 2.7, which means we will be, too. Make sure to get that version.
If you don’t get Anaconda, make sure to install pip, which is a Python package man‐
ager that allows you to easily install third-party packages (some of which we’ll need).
It’s also worth getting IPython, which is a much nicer Python shell to work with.
(If you installed Anaconda then it should have come with pip and IPython.)
Just run:
15
pip install ipython
and then search the Internet for solutions to whatever cryptic error messages that
causes.
Whitespace Formatting
Many languages use curly braces to delimit blocks of code. Python uses indentation:
for i in [1, 2, 3, 4, 5]:
print i # first line in "for i" block
for j in [1, 2, 3, 4, 5]:
print j # first line in "for j" block
print i + j # last line in "for j" block
print i # last line in "for i" block
print "done looping"
This makes Python code very readable, but it also means that you have to be very
careful with your formatting. Whitespace is ignored inside parentheses and brackets,
which can be helpful for long-winded computations:
long_winded_computation = (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 +
13 + 14 + 15 + 16 + 17 + 18 + 19 + 20)
and for making code easier to read:
list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
because the interpreter thinks the blank line signals the end of the for loop’s block.
IPython has a magic function %paste, which correctly pastes whatever is on your
clipboard, whitespace and all. This alone is a good reason to use IPython.
Modules
Certain features of Python are not loaded by default. These include both features
included as part of the language as well as third-party features that you download
yourself. In order to use these features, you’ll need to import the modules that con‐
tain them.
One approach is to simply import the module itself:
import re
my_regex = re.compile("[0-9]+", re.I)
Here re is the module containing functions and constants for working with regular
expressions. After this type of import you can only access those functions by prefix‐
ing them with re..
If you already had a different re in your code you could use an alias:
import re as regex
my_regex = regex.compile("[0-9]+", regex.I)
You might also do this if your module has an unwieldy name or if you’re going to be
typing it a lot. For example, when visualizing data with matplotlib, a standard con‐
vention is:
import matplotlib.pyplot as plt
If you need a few specific values from a module, you can import them explicitly and
use them without qualification:
from collections import defaultdict, Counter
lookup = defaultdict(int)
my_counter = Counter()
If you were a bad person, you could import the entire contents of a module into your
namespace, which might inadvertently overwrite variables you’ve already defined:
The Basics | 17
match = 10
from re import * # uh oh, re has a match function
print match # "<function re.match>"
However, since you are not a bad person, you won’t ever do this.
Arithmetic
Python 2.7 uses integer division by default, so that 5 / 2 equals 2. Almost always this
is not what we want, so we will always start our files with:
from __future__ import division
after which 5 / 2 equals 2.5. Every code example in this book uses this new-style
division. In the handful of cases where we need integer division, we can get it with a
double slash: 5 // 2.
Functions
A function is a rule for taking zero or more inputs and returning a corresponding
output. In Python, we typically define functions using def:
def double(x):
"""this is where you put an optional docstring
that explains what the function does.
for example, this function multiplies its input by 2"""
return x * 2
Python functions are first-class, which means that we can assign them to variables
and pass them into functions just like any other arguments:
def apply_to_one(f):
"""calls the function f with 1 as its argument"""
return f(1)
subtract(10, 5) # returns 5
subtract(0, 5) # returns -5
subtract(b=5) # same as previous
We will be creating many, many functions.
Strings
Strings can be delimited by single or double quotation marks (but the quotes have to
match):
single_quoted_string = 'data science'
double_quoted_string = "data science"
Python uses backslashes to encode special characters. For example:
tab_string = "\t" # represents the tab character
len(tab_string) # is 1
If you want backslashes as backslashes (which you might in Windows directory
names or in regular expressions), you can create raw strings using r"":
not_tab_string = r"\t" # represents the characters '\' and 't'
len(not_tab_string) # is 2
You can create multiline strings using triple-[double-]-quotes:
multi_line_string = """This is the first line.
and this is the second line
and this is the third line"""
Exceptions
When something goes wrong, Python raises an exception. Unhandled, these will cause
your program to crash. You can handle them using try and except:
try:
print 0 / 0
except ZeroDivisionError:
print "cannot divide by zero"
Although in many languages exceptions are considered bad, in Python there is no
shame in using them to make your code cleaner, and we will occasionally do so.
The Basics | 19
Lists
Probably the most fundamental data structure in Python is the list. A list is simply
an ordered collection. (It is similar to what in other languages might be called an
array, but with some added functionality.)
integer_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True]
list_of_lists = [ integer_list, heterogeneous_list, [] ]
although you will get a ValueError if you don’t have the same numbers of elements
on both sides.
It’s common to use an underscore for a value you’re going to throw away:
_, y = [1, 2] # now y == 2, didn't care about the first element
Tuples
Tuples are lists’ immutable cousins. Pretty much anything you can do to a list that
doesn’t involve modifying it, you can do to a tuple. You specify a tuple by using
parentheses (or nothing) instead of square brackets:
my_list = [1, 2]
my_tuple = (1, 2)
other_tuple = 3, 4
my_list[1] = 3 # my_list is now [1, 3]
try:
my_tuple[1] = 3
except TypeError:
print "cannot modify a tuple"
Tuples are a convenient way to return multiple values from functions:
def sum_and_product(x, y):
return (x + y),(x * y)
Dictionaries
Another fundamental data structure is a dictionary, which associates values with keys
and allows you to quickly retrieve the value corresponding to a given key:
empty_dict = {} # Pythonic
empty_dict2 = dict() # less Pythonic
grades = { "Joel" : 80, "Tim" : 95 } # dictionary literal
You can look up the value for a key using square brackets:
The Basics | 21
joels_grade = grades["Joel"] # equals 80
But you’ll get a KeyError if you ask for a key that’s not in the dictionary:
try:
kates_grade = grades["Kate"]
except KeyError:
print "no grade for Kate!"
Dictionaries have a get method that returns a default value (instead of raising an
exception) when you look up a key that’s not in the dictionary:
joels_grade = grades.get("Joel", 0) # equals 80
kates_grade = grades.get("Kate", 0) # equals 0
no_ones_grade = grades.get("No One") # default default is None
You assign key-value pairs using the same square brackets:
grades["Tim"] = 99 # replaces the old value
grades["Kate"] = 100 # adds a third entry
num_students = len(grades) # equals 3
We will frequently use dictionaries as a simple way to represent structured data:
tweet = {
"user" : "joelgrus",
"text" : "Data Science is Awesome",
"retweet_count" : 100,
"hashtags" : ["#data", "#science", "#datascience", "#awesome", "#yolo"]
}
Besides looking for specific keys we can look at all of them:
tweet_keys = tweet.keys() # list of keys
tweet_values = tweet.values() # list of values
tweet_items = tweet.items() # list of (key, value) tuples
Dictionary keys must be immutable; in particular, you cannot use lists as keys. If
you need a multipart key, you should use a tuple or figure out a way to turn the key
into a string.
defaultdict
Imagine that you’re trying to count the words in a document. An obvious approach is
to create a dictionary in which the keys are words and the values are counts. As you
A third approach is to use get, which behaves gracefully for missing keys:
word_counts = {}
for word in document:
previous_count = word_counts.get(word, 0)
word_counts[word] = previous_count + 1
They can also be useful with list or dict or even your own functions:
dd_list = defaultdict(list) # list() produces an empty list
dd_list[2].append(1) # now dd_list contains {2: [1]}
The Basics | 23
Counter
A Counter turns a sequence of values into a defaultdict(int)-like object mapping
keys to counts. We will primarily use it to create histograms:
from collections import Counter
c = Counter([0, 1, 2, 0]) # c is (basically) { 0 : 2, 1 : 1, 2 : 1 }
Sets
Another data structure is set, which represents a collection of distinct elements:
s = set()
s.add(1) # s is now { 1 }
s.add(2) # s is now { 1, 2 }
s.add(2) # s is still { 1, 2 }
x = len(s) # equals 2
y = 2 in s # equals True
z = 3 in s # equals False
We’ll use sets for two main reasons. The first is that in is a very fast operation on sets.
If we have a large collection of items that we want to use for a membership test, a set
is more appropriate than a list:
stopwords_list = ["a","an","at"] + hundreds_of_other_words + ["yet", "you"]
stopwords_set = set(stopwords_list)
"zip" in stopwords_set # very fast to check
The second reason is to find the distinct items in a collection:
item_list = [1, 2, 3, 1, 2, 3]
num_items = len(item_list) # 6
item_set = set(item_list) # {1, 2, 3}
num_distinct_items = len(item_set) # 3
distinct_item_list = list(item_set) # [1, 2, 3]
We’ll use sets much less frequently than dicts and lists.
If you need more-complex logic, you can use continue and break:
for x in range(10):
if x == 3:
continue # go immediately to the next iteration
if x == 5:
break # quit the loop entirely
print x
Truthiness
Booleans in Python work as in most other languages, except that they’re capitalized:
one_is_less_than_two = 1 < 2 # equals True
true_equals_false = True == False # equals False
Python uses the value None to indicate a nonexistent value. It is similar to other lan‐
guages’ null:
x = None
print x == None # prints True, but is not Pythonic
print x is None # prints True, and is Pythonic
Python lets you use any value where it expects a Boolean. The following are all
“Falsy”:
The Basics | 25
• False
• None
• [] (an empty list)
• {} (an empty dict)
• ""
• set()
• 0
• 0.0
Pretty much anything else gets treated as True. This allows you to easily use if state‐
ments to test for empty lists or empty strings or empty dictionaries or so on. It also
sometimes causes tricky bugs if you’re not expecting this behavior:
s = some_function_that_returns_a_string()
if s:
first_char = s[0]
else:
first_char = ""
A simpler way of doing the same is:
first_char = s and s[0]
since and returns its second value when the first is “truthy,” the first value when it’s
not. Similarly, if x is either a number or possibly None:
safe_x = x or 0
is definitely a number.
Python has an all function, which takes a list and returns True precisely when every
element is truthy, and an any function, which returns True when at least one element
is truthy:
all([True, 1, { 3 }]) # True
all([True, 1, {}]) # False, {} is falsy
any([True, 1, {}]) # True, True is truthy
all([]) # True, no falsy elements in the list
any([]) # False, no truthy elements in the list
The Not-So-Basics
Here we’ll look at some more-advanced Python features that we’ll find useful for
working with data.
By default, sort (and sorted) sort a list from smallest to largest based on naively
comparing the elements to one another.
If you want elements sorted from largest to smallest, you can specify a reverse=True
parameter. And instead of comparing the elements themselves, you can compare the
results of a function that you specify with key:
# sort the list by absolute value from largest to smallest
x = sorted([-4,1,-2,3], key=abs, reverse=True) # is [-4,3,-2,1]
List Comprehensions
Frequently, you’ll want to transform a list into another list, by choosing only certain
elements, or by transforming elements, or both. The Pythonic way of doing this is list
comprehensions:
even_numbers = [x for x in range(5) if x % 2 == 0] # [0, 2, 4]
squares = [x * x for x in range(5)] # [0, 1, 4, 9, 16]
even_squares = [x * x for x in even_numbers] # [0, 4, 16]
You can similarly turn lists into dictionaries or sets:
square_dict = { x : x * x for x in range(5) } # { 0:0, 1:1, 2:4, 3:9, 4:16 }
square_set = { x * x for x in [1, -1] } # { 1 }
If you don’t need the value from the list, it’s conventional to use an underscore as the
variable:
zeroes = [0 for _ in even_numbers] # has the same length as even_numbers
The Not-So-Basics | 27
increasing_pairs = [(x, y) # only pairs with x < y,
for x in range(10) # range(lo, hi) equals
for y in range(x + 1, 10)] # [lo, lo + 1, ..., hi - 1]
We will use list comprehensions a lot.
The following loop will consume the yielded values one at a time until none are left:
for i in lazy_range(10):
do_something_with(i)
(Python actually comes with a lazy_range function called xrange, and in Python 3,
range itself is lazy.) This means you could even create an infinite sequence:
def natural_numbers():
"""returns 1, 2, 3, ..."""
n = 1
while True:
yield n
n += 1
although you probably shouldn’t iterate over it without using some kind of break
logic.
The flip side of laziness is that you can only iterate through a gen‐
erator once. If you need to iterate through something multiple
times, you’ll need to either recreate the generator each time or use a
list.
Recall also that every dict has an items() method that returns a list of its key-value
pairs. More frequently we’ll use the iteritems() method, which lazily yields the
key-value pairs one at a time as we iterate over it.
Randomness
As we learn data science, we will frequently need to generate random numbers, which
we can do with the random module:
import random
The random module actually produces pseudorandom (that is, deterministic) num‐
bers based on an internal state that you can set with random.seed if you want to get
reproducible results:
random.seed(10) # set the seed to 10
print random.random() # 0.57140259469
random.seed(10) # reset the seed to 10
print random.random() # 0.57140259469 again
There are a few more methods that we’ll sometimes find convenient. random.shuffle
randomly reorders the elements of a list:
up_to_ten = range(10)
random.shuffle(up_to_ten)
print up_to_ten
# [2, 5, 1, 9, 7, 3, 8, 6, 4, 0] (your results will probably be different)
If you need to randomly pick one element from a list you can use random.choice:
my_best_friend = random.choice(["Alice", "Bob", "Charlie"]) # "Bob" for me
And if you need to randomly choose a sample of elements without replacement (i.e.,
with no duplicates), you can use random.sample:
lottery_numbers = range(60)
winning_numbers = random.sample(lottery_numbers, 6) # [16, 36, 10, 6, 25, 9]
The Not-So-Basics | 29
To choose a sample of elements with replacement (i.e., allowing duplicates), you can
just make multiple calls to random.choice:
four_with_replacement = [random.choice(range(10))
for _ in range(4)]
# [9, 4, 4, 2]
Regular Expressions
Regular expressions provide a way of searching text. They are incredibly useful but
also fairly complicated, so much so that there are entire books written about them.
We will explain their details the few times we encounter them; here are a few exam‐
ples of how to use them in Python:
import re
Object-Oriented Programming
Like many languages, Python allows you to define classes that encapsulate data and
the functions that operate on them. We’ll use them sometimes to make our code
cleaner and simpler. It’s probably simplest to explain them by constructing a heavily
annotated example.
Imagine we didn’t have the built-in Python set. Then we might want to create our
own Set class.
What behavior should our class have? Given an instance of Set, we’ll need to be able
to add items to it, remove items from it, and check whether it contains a certain
value. We’ll create all of these as member functions, which means we’ll access them
with a dot after a Set object:
# by convention, we give classes PascalCase names
class Set:
def __repr__(self):
"""this is the string representation of a Set object
if you type it at the Python prompt or pass it to str()"""
return "Set: " + str(self.dict.keys())
Functional Tools
When passing functions around, sometimes we’ll want to partially apply (or curry)
functions to create new functions. As a simple example, imagine that we have a func‐
tion of two variables:
def exp(base, power):
return base ** power
and we want to use it to create a function of one variable two_to_the whose input is a
power and whose output is the result of exp(2, power).
We can, of course, do this with def, but this can sometimes get unwieldy:
def two_to_the(power):
return exp(2, power)
The Not-So-Basics | 31
from functools import partial
two_to_the = partial(exp, 2) # is now a function of one variable
print two_to_the(3) # 8
You can also use partial to fill in later arguments if you specify their names:
square_of = partial(exp, power=2)
print square_of(3) # 9
It starts to get messy if you curry arguments in the middle of the function, so we’ll try
to avoid doing that.
We will also occasionally use map, reduce, and filter, which provide functional
alternatives to list comprehensions:
def double(x):
return 2 * x
xs = [1, 2, 3, 4]
twice_xs = [double(x) for x in xs] # [2, 4, 6, 8]
twice_xs = map(double, xs) # same as above
list_doubler = partial(map, double) # *function* that doubles a list
twice_xs = list_doubler(xs) # again [2, 4, 6, 8]
You can use map with multiple-argument functions if you provide multiple lists:
def multiply(x, y): return x * y
And reduce combines the first two elements of a list, then that result with the third,
that result with the fourth, and so on, producing a single result:
x_product = reduce(multiply, xs) # = 1 * 2 * 3 * 4 = 24
list_product = partial(reduce, multiply) # *function* that reduces a list
x_product = list_product(xs) # again = 24
enumerate
Not infrequently, you’ll want to iterate over a list and use both its elements and their
indexes:
If the lists are different lengths, zip stops as soon as the first list ends.
You can also “unzip” a list using a strange trick:
pairs = [('a', 1), ('b', 2), ('c', 3)]
letters, numbers = zip(*pairs)
The asterisk performs argument unpacking, which uses the elements of pairs as indi‐
vidual arguments to zip. It ends up the same as if you’d called:
zip(('a', 1), ('b', 2), ('c', 3))
add(1, 2) # returns 3
add([1, 2]) # TypeError!
add(*[1, 2]) # returns 3
It is rare that we’ll find this useful, but when we do it’s a neat trick.
The Not-So-Basics | 33
args and kwargs
Let’s say we want to create a higher-order function that takes as input some function f
and returns a new function that for any input returns twice the value of f:
def doubler(f):
def g(x):
return 2 * f(x)
return g
This works in some cases:
def f1(x):
return x + 1
g = doubler(f1)
print g(3) # 8 (== ( 3 + 1) * 2)
print g(-1) # 0 (== (-1 + 1) * 2)
However, it breaks down with functions that take more than a single argument:
def f2(x, y):
return x + y
g = doubler(f2)
print g(1, 2) # TypeError: g() takes exactly 1 argument (2 given)
What we need is a way to specify a function that takes arbitrary arguments. We can
do this with argument unpacking and a little bit of magic:
def magic(*args, **kwargs):
print "unnamed args:", args
print "keyword args:", kwargs
# prints
# unnamed args: (1, 2)
# keyword args: {'key2': 'word2', 'key': 'word'}
That is, when we define a function like this, args is a tuple of its unnamed arguments
and kwargs is a dict of its named arguments. It works the other way too, if you want
to use a list (or tuple) and dict to supply arguments to a function:
def other_way_magic(x, y, z):
return x + y + z
x_y_list = [1, 2]
z_dict = { "z" : 3 }
print other_way_magic(*x_y_list, **z_dict) # 6
You could do all sorts of strange tricks with this; we will only use it to produce
higher-order functions whose inputs can accept arbitrary arguments:
g = doubler_correct(f2)
print g(1, 2) # 6
Welcome to DataSciencester!
This concludes new-employee orientation. Oh, and also, try not to embezzle any‐
thing.