Clojure For Data Science - Sample Chapter
Clojure For Data Science - Sample Chapter
ee
$ 44.99 US
28.99 UK
P U B L I S H I N G
Henry Garner
pl
C o m m u n i t y
E x p e r i e n c e
D i s t i l l e d
Sa
m
Henry Garner
He started his technical career at Britain's largest telecoms provider, BT, working
with a traditional data warehouse infrastructure. As a part of a small team for
3 years, he built sophisticated data models to derive insight from raw data and use
web applications to present the results. These applications were used internally by
senior executives and operatives to track both business and systems performance.
He then went on to co-found Likely, a social media analytics start-up. As the
CTO, he set the technical direction, leading to the introduction of an event-based
append-only data pipeline modeled after the Lambda architecture. He adopted
Clojure in 2011 and led a hybrid team of programmers and data scientists, building
content recommendation engines based on collaborative filtering and clustering
techniques. He developed a syllabus and copresented a series of evening classes
from Likely's offices for professional developers who wanted to learn Clojure.
Henry now works with growing businesses, consulting in both a development and
technical leadership capacity. He presents regularly at seminars and Clojure meetups
in and around London.
Preface
"Statistical thinking will one day be as necessary for efficient citizenship as the
ability to read and write."
- H. G. Wells
"I have a great subject [statistics] to write upon, but feel keenly my
literary incapacity to make it easily intelligible without sacrificing
accuracy and thoroughness."
- Sir Francis Galton
A web search for "data science Venn diagram" returns numerous interpretations
of the skills required to be an effective data scientist (it appears that data science
commentators love Venn diagrams). Author and data scientist Drew Conway
produced the prototypical diagram back in 2010, putting data science at the
intersection of hacking skills, substantive expertise (that is, subject domain
understanding), and mathematics and statistics knowledge. Between hacking
skills and substantive expertisethose practicing without strong mathematics
and statistics knowledgelies the "danger zone."
Five years on, as a growing number of developers seek to plug the data science skills'
shortage, there's more need than ever for statistical and mathematical education
to help developers out of this danger zone. So, when Packt Publishing invited me
to write a book on data science suitable for Clojure programmers, I gladly agreed.
In addition to appreciating the need for such a book, I saw it as an opportunity
to consolidate much of what I had learned as CTO of my own Clojure-based data
analytics company. The result is the book I wish I had been able to read before
starting out.
Preface
Clojure for Data Science aims to be much more than just a book of statistics for
Clojure programmers. A large reason for the spread of data science into so many
diverse areas is the enormous power of machine learning. Throughout the book,
I'll show how to use pure Clojure functions and third-party libraries to construct
machine learning models for the primary tasks of regression, classification,
clustering, and recommendation.
Approaches that scale to very large datasets, so-called "big data," are of particular
interest to data scientists, because they can reveal subtleties that are lost in smaller
samples. This book shows how Clojure can be used to concisely express jobs to
run on the Hadoop and Spark distributed computation frameworks, and how to
incorporate machine learning through the use of both dedicated external libraries
and general optimization techniques.
Above all, this book aims to foster an understanding not just on how to perform
particular types of analysis, but why such techniques work. In addition to providing
practical knowledge (almost every concept in this book is expressed as a runnable
example), I aim to explain the theory that will allow you to take a principle and
apply it to related problems. I hope that this approach will enable you to effectively
apply statistical thinking in diverse situations well into the future, whether or not
you decide to pursue a career in data science.
Preface
Statistics
"The people who cast the votes decide nothing. The people who count the votes
decide everything."
Joseph Stalin
Over the course of the following ten chapters of Clojure for Data Science, we'll attempt
to discover a broadly linear path through the field of data science. In fact, we'll find
as we go that the path is not quite so linear, and the attentive reader ought to notice
many recurring themes along the way.
Descriptive statistics concern themselves with summarizing sequences of numbers
and they'll appear, to some extent, in every chapter in this book. In this chapter,
we'll build foundations for what's to come by implementing functions to calculate
the mean, median, variance, and standard deviation of numerical sequences
in Clojure. While doing so, we'll attempt to take the fear out of interpreting
mathematical formulae.
As soon as we have more than one number to analyze it becomes meaningful to ask
how those numbers are distributed. You've probably already heard expressions such
as "long tail" and the "80/20 rule". They concern the spread of numbers throughout
a range. We demonstrate the value of distributions in this chapter and introduce the
most useful of them all: the normal distribution.
The study of distributions is aided immensely by visualization, and for this we'll use
the Clojure library Incanter. We'll show how Incanter can be used to load, transform,
and visualize real data. We'll compare the results of two national electionsthe 2010
United Kingdom general election and the 2011 Russian presidential electionand
see how even basic analysis can provide evidence of potentially fraudulent activity.
[1]
Statistics
on the command line. By default, the REPL will open in the examples namespace.
Alternatively, to run a specific numbered example, you can execute:
lein run -example 1.1
We only assume basic command-line familiarity throughout this book. The ability to
run Leiningen and shell scripts is all that's required.
If you become stuck at any point, refer to the book's wiki at
http://wiki.clojuredatascience.com. The wiki will
provide troubleshooting tips for known issues, including
advice for running examples on a variety of platforms.
In fact, shell scripts are only used for fetching data from remote locations
automatically. The book's wiki will also provide alternative instructions for
those not wishing or unable to execute the shell scripts.
[2]
Chapter 1
Throughout this book we'll be making use of numerous datasets. Where possible,
we've included the data with the example code. Where this hasn't been possible
either because of the size of the data or due to licensing constraintswe've included
a script to download the data instead.
Chapter 1, Statistics is just such a chapter. If you've cloned the chapter's code and
intend to follow the examples, download the data now by executing the following
on the command line from within the project's directory:
script/download-data.sh
The script will download and decompress the sample data into the project's
data directory.
If you have any difficulty running the download script or would
like to follow manual instructions instead, visit the book's wiki at
http://wiki.clojuredatascience.com for assistance.
[3]
Statistics
If you don't mind including more libraries than you need, you can simply include
the full Incanter distribution instead:
:dependencies [[incanter/incanter "1.5.5"]
...]
If our data is a text file (a CSV or tab-delimited file), we can use the
If our data is an Excel file (for example, an .xls or .xlsx file), we can use the
For any other data source (an external database, website, and so on), as long
as we can get our data into a Clojure data structure we can create a dataset
with the dataset function in incanter-core
This chapter makes use of Excel data sources, so we'll be using read-xls. The
function takes one required argumentthe file to loadand an optional keyword
argument specifying the sheet number or name. All of our examples have only one
sheet, so we'll just provide the file argument as string:
(ns cljds.ch1.data
(:require [clojure.java.io :as io]
[incanter.core :as i]
[incanter.excel :as xls]))
[4]
Chapter 1
In general, we will not reproduce the namespace declarations from the example
code. This is both for brevity and because the required namespaces can usually
be inferred by the symbol used to reference them. For example, throughout this
book we will always refer to clojure.java.io as io, incanter.core as I, and
incanter.excel as xls wherever they are used.
We'll be loading several data sources throughout this chapter, so we've created a
multimethod called load-data in the cljds.ch1.data namespace:
(defmulti load-data identity)
(defmethod load-data :uk [_]
(-> (io/resource "UK2010.xls")
(str)
(xls/read-xls)))
As described in running the examples earlier, functions beginning with ex- can be
run on the command line with Leiningen like this:
lein run e 1.1
[5]
Statistics
The output of the preceding command should be the following Clojure vector:
["Press Association Reference" "Constituency Name" "Region" "Election
Year" "Electorate" "Votes" "AC" "AD" "AGS" "APNI" "APP" "AWL" "AWP"
"BB" "BCP" "Bean" "Best" "BGPV" "BIB" "BIC" "Blue" "BNP" "BP Elvis"
"C28" "Cam Soc" "CG" "Ch M" "Ch P" "CIP" "CITY" "CNPG" "Comm" "Comm
L" "Con" "Cor D" "CPA" "CSP" "CTDP" "CURE" "D Lab" "D Nat" "DDP"
"DUP" "ED" "EIP" "EPA" "FAWG" "FDP" "FFR" "Grn" "GSOT" "Hum" "ICHC"
"IEAC" "IFED" "ILEU" "Impact" "Ind1" "Ind2" "Ind3" "Ind4" "Ind5" "IPT"
"ISGB" "ISQM" "IUK" "IVH" "IZB" "JAC" "Joy" "JP" "Lab" "Land" "LD"
"Lib" "Libert" "LIND" "LLPB" "LTT" "MACI" "MCP" "MEDI" "MEP" "MIF"
"MK" "MPEA" "MRLP" "MRP" "Nat Lib" "NCDV" "ND" "New" "NF" "NFP" "NICF"
"Nobody" "NSPS" "PBP" "PC" "Pirate" "PNDP" "Poet" "PPBF" "PPE" "PPNV"
"Reform" "Respect" "Rest" "RRG" "RTBP" "SACL" "Sci" "SDLP" "SEP" "SF"
"SIG" "SJP" "SKGP" "SMA" "SMRA" "SNP" "Soc" "Soc Alt" "Soc Dem" "Soc
Lab" "South" "Speaker" "SSP" "TF" "TOC" "Trust" "TUSC" "TUV" "UCUNF"
"UKIP" "UPS" "UV" "VCCA" "Vote" "Wessex Reg" "WRP" "You" "Youth"
"YRDPL"]
This is a very wide dataset. The first six columns in the data file are described as
follows; subsequent columns break the number of votes down by party:
Constituency Name: This is the common name given to the voting district
Election Year: This is the year in which the election was held
Whenever we're confronted with new data, it's important to take time to understand
it. In the absence of detailed data definitions, one way we could do this is to begin by
validating our assumptions about the data. For example, we expect that this dataset
contains information about the 2010 election so let's review the contents of the
Election Year column.
[6]
Chapter 1
Incanter provides the i/$ function (i, as before, signifying the incanter.core
namespace) for selecting columns from a dataset. We'll encounter the function
regularly throughout this chapterit's Incanter's primary way of selecting columns
from a variety of data representations and it provides several different arities. For
now, we'll be providing just the name of the column we'd like to extract and the
dataset from which to extract it:
(defn ex-1-2 []
(i/$ "Election Year" (load-data :uk)))
;; (2010.0 2010.0 2010.0 2010.0 2010.0 ... 2010.0 2010.0 nil)
The years are returned as a single sequence of values. The output may be hard to
interpret since the dataset contains so many rows. As we'd like to know which
unique values the column contains, we can use the Clojure core function distinct.
One of the advantages of using Incanter is that its useful data manipulation functions
augment those that Clojure already provides as shown in the following example:
(defn ex-1-3 []
(->> (load-data :uk)
(i/$ "Election Year")
(distinct)))
;; (2010 nil)
The 2010 year goes a long way to confirming our expectations that this data is
from 2010. The nil value is unexpected, though, and may indicate a problem
with our data.
We don't yet know how many nils exist in the dataset and determining this could
help us decide what to do next. A simple way of counting values such as this it to
use the core library function frequencies, which returns a map of values to counts:
(defn ex-1-4 [ ]
(->> (load-data :uk)
(i/$ "Election Year")
(frequencies)))
;; {2010.0 650 nil 1}
[7]
Statistics
It wouldn't take us long to confirm that in 2010 the UK had 650 electoral districts,
known as constituencies. Domain knowledge such as this is invaluable when
sanity-checking new data. Thus, it's highly probable that the nil value is
extraneous and can be removed. We'll see how to do this in the next section.
Data scrubbing
It is a commonly repeated statistic that at least 80 percent of a data scientist's work is
data scrubbing. This is the process of detecting potentially corrupt or incorrect data
and either correcting or filtering it out.
Data scrubbing is one of the most important (and time-consuming)
aspects of working with data. It's a key step to ensuring that
subsequent analysis is performed on data that is valid, accurate,
and consistent.
The nil value at the end of the election year column may indicate dirty data that
ought to be removed. We've already seen that filtering columns of data can be
accomplished with Incanter's i/$ function. For filtering rows of data we can use
Incanter's i/query-dataset function.
We let Incanter know which rows we'd like it to filter by passing a Clojure map of
column names and predicates. Only rows for which all predicates return true will
be retained. For example, to select only the nil values from our dataset:
(-> (load-data :uk)
(i/query-dataset {"Election Year" {:$eq nil}}))
If you know SQL, you'll notice this is very similar to a WHERE clause. In fact, Incanter
also provides the i/$where function, an alias to i/query-dataset that reverses the
order of the arguments.
[8]
Chapter 1
The query is a map of column names to predicates and each predicate is itself a map
of operator to operand. Complex queries can be constructed by specifying multiple
columns and multiple operators together. Query operators include:
:$eq equal to
:$fn a predicate function that should return a true response for rows to keep
If none of the built-in operators suffice, the last operator provides the ability to pass a
custom function instead.
We'll continue to use Clojure's thread-last macro to make the code intention a little
clearer, and return the row as a map of keys and values using the i/to-map function:
(defn ex-1-5 []
(->> (load-data :uk)
(i/$where {"Election Year" {:$eq nil}})
(i/to-map)))
;; {:ILEU nil, :TUSC nil, :Vote nil ... :IVH nil, :FFR nil}
Looking at the results carefully, it's apparent that all (but one) of the columns in
this row are nil. In fact, a bit of further exploration confirms that the non-nil row
is a summary total and ought to be removed from the data. We can remove the
problematic row by updating the predicate map to use the :$ne operator, returning
only rows where the election year is not equal to nil:
(->> (load-data :uk)
(i/$where {"Election Year" {:$ne nil}}))
The preceding function is one we'll almost always want to make sure we call in
advance of using the data. One way of doing this is to add another implementation
of our load-data multimethod, which also includes this filtering step:
(defmethod load-data :uk-scrubbed [_]
(->> (load-data :uk)
(i/$where {"Election Year" {:$ne nil}})))
[9]
Statistics
Now with any code we write, can choose whether to refer to the :uk or
:uk-scrubbed datasets.
By always loading the source file and performing our scrubbing on top, we're
preserving an audit trail of the transformations we've applied. This makes it clear
to usand future readers of our codewhat adjustments have been made to the
source. It also means that, should we need to re-run our analysis with new source
data, we may be able to just load the new file in place of the existing file.
Descriptive statistics
Descriptive statistics are numbers that are used to summarize and describe data.
In the next chapter, we'll turn our attention to a more sophisticated analysis,
the so-called inferential statistics, but for now we'll limit ourselves to simply
describing what we can observe about the data contained in the file.
To demonstrate what we mean, let's look at the Electorate column of the data.
This column lists the total number of registered voters in each constituency:
(defn ex-1-6 []
(->> (load-data :uk-scrubbed)
(i/$ "Electorate")
(count)))
;; 650
We've filtered the nil field from the dataset; the preceding code should return a list
of 650 numbers corresponding to the electorate in each of the UK constituencies.
Descriptive statistics, also called summary statistics, are ways of measuring
attributes of sequences of numbers. They help characterize the sequence and can
act as a guide for further analysis. Let's start by calculating the two most basic
statistics that we can from a sequence of numbersits mean and its variance.
[ 10 ]
Chapter 1
The mean
The most common way of measuring the average of a data set is with the mean.
It's actually one of several ways of measuring the central tendency of the data. The
mean, or more precisely, the arithmetic mean, is a straightforward calculation
simply add up the values and divide by the countbut in spite of this it has a
somewhat intimidating mathematical notation:
x=
1 n
xi
n i =1
[ 11 ]
Statistics
Finally, n appears just before the sigma, indicating that the entire expression
should be multiplied by 1 divided by n (also called the reciprocal of n). This
can be simplified to just dividing by n.
Name
Mathematical symbol
n
Sigma notation
Clojure equivalent
(count xs)
(reduce + xs)
i =1
Pi notation
(reduce * xs)
i =1
Putting this all together, we get "add up the elements in the sequence from the first to
the last and divide by the count". In Clojure, this can be written as:
(defn mean [xs]
(/ (reduce + xs)
(count xs)))
Where xs stands for "the sequence of xs". We can use our new mean function to
calculate the mean of the UK electorate:
(defn ex-1-7 []
(->> (load-data :uk-scrubbed)
(i/$ "Electorate")
(mean)))
;; 70149.94
[ 12 ]
Chapter 1
The median
The median is another common descriptive statistic for measuring the central
tendency of a sequence. If you ordered all the data from lowest to highest, the
median is the middle value. If there is an even number of data points in the
sequence, the median is usually defined as the mean of the middle two values.
The median is often represented in formulae by x , pronounced x-tilde. It's one
of the deficiencies of mathematical notation that there's no particularly standard
way of expressing the formula for the median value, but nonetheless it's fairly
straightforward in Clojure:
(defn median [xs]
(let [n
(count xs)
mid (int (/ n 2))]
(if (odd? n)
(nth (sort xs) mid)
(->> (sort xs)
(drop (dec mid))
(take 2)
(mean)))))
Incanter also has a function for calculating the median value as s/median.
Variance
The mean and the median are two alternative ways of describing the middle value
of a sequence, but on their own they tell you very little about the values contained
within it. For example, if we know the mean of a sequence of ninety-nine values is
50, we can still say very little about what values the sequence contains.
It may contain all the integers from one to ninety-nine, or forty-nine zeros and
fifty ninety-nines. Maybe it contains negative one ninety-eight times and a single
five-thousand and forty-eight. Or perhaps all the values are exactly fifty.
[ 13 ]
Statistics
The variance of a sequence is its "spread" about the mean, and each of the preceding
examples would have a different variance. In mathematical notation, the variance is
expressed as:
s2 =
1 n
2
( xi x )
n i =1
We're using Incanter's i/sq function to calculate the square of our expression.
Since we've squared the deviation before taking the mean, the units of variance are
also squared, so the units of the variance of the UK electorate are "people squared".
This is somewhat unnatural to reason about. We can make the units more natural by
taking the square root of the variance so the units are "people" again, and the result is
called the standard deviation:
(defn standard-deviation [xs]
(i/sqrt (variance xs)))
(defn ex-1-9 []
(->> (load-data :uk-scrubbed)
(i/$ "Electorate")
(standard-deviation)))
;; 7672.77
[ 14 ]
Chapter 1
Quantiles
The median is one way to calculate the middle value from a list, and the variance
provides a way to measure the spread of the data about this midpoint. If the entire
spread of data were represented on a scale of zero to one, the median would be the
value at 0.5.
For example, consider the following sequence of numbers:
[10 11 15 21 22.5 28 30]
There are seven numbers in the sequence, so the median is the fourth, or 21. This
is also referred to as the 0.5 quantile. We can get a richer picture of a sequence of
numbers by looking at the 0, 0.25, 0.5, 0.7, and 1.0 quantiles. Taken together, these
numbers will not only show the median, but will also summarize the range of the
data and how the numbers are distributed within it. They're sometimes referred
to as the five-number summary.
One way to calculate the five-number summary for the UK electorate data is shown
as follows:
(defn quantile [q xs]
(let [n (dec (count xs))
i (-> (* n q)
(+ 1/2)
(int))]
(nth (sort xs) i)))
(defn ex-1-10 []
(let [xs (->> (load-data :uk-scrubbed)
(i/$ "Electorate"))
f (fn [q]
(quantile q xs))]
(map f [0 1/4 1/2 3/4 1])))
;; (21780.0 66219.0 70991.0 75115.0 109922.0)
Quantiles can also be calculated in Incanter directly with the s/quantile function.
A sequence of desired quantiles is passed as the keyword argument :probs.
[ 15 ]
Statistics
Where quantiles split the range into four equal ranges as earlier, they are called
quartiles. The difference between the lower and upper quartile is referred to as the
interquartile range, also often abbreviated to just IQR. Like the variance about the
mean, the IQR gives a measure of the spread of the data about the median.
Binning data
To develop an intuition for what these various calculations of variance are
measuring, we can employ a technique called binning. Where data is continuous,
using frequencies (as we did with the election data to count the nils) is not practical
since no two values may be the same. However, it's possible to get a broad sense of
the structure of the data by grouping the data into discrete intervals.
The process of binning is to divide the range of values into a number of consecutive,
equally-sized, smaller bins. Each value in the original series falls into exactly one
bin. By counting the number of points falling into each bin, we can get a sense of the
spread of the data:
The preceding illustration shows fifteen values of x split into five equally-sized bins.
By counting the number of points falling into each bin we can clearly see that most
points fall in the middle bin, with fewer points falling into the bins towards the
edges. We can achieve the same in Clojure with the following bin function:
(defn bin [n-bins xs]
(let [min-x
(apply min xs)
max-x
(apply max xs)
range-x (- max-x min-x)
bin-fn
(fn [x]
(-> x
(- min-x)
[ 16 ]
Chapter 1
(/ range-x)
(* n-bins)
(int)
(min (dec n-bins))))]
(map bin-fn xs)))
For example, we can bin range 0-14 into 5 bins like so:
(bin 5 (range 15))
;; (0 0 0 1 1 1 2 2 2 3 3 3 4 4 4)
Once we've binned the values we can then use the frequencies function once again
to count the number of points in each bin. In the following code, we use the function
to split the UK electorate data into five bins:
(defn ex-1-11 []
(->> (load-data :uk-scrubbed)
(i/$ "Electorate")
(bin 10)
(frequencies)))
;; {1 26, 2 450, 3 171, 4 1, 0 2}
The count of points in the extremal bins (0 and 4) is much lower than the bins in the
middlethe counts seem to rise up towards the median and then down again. In the
next section, we'll visualize the shape of these counts.
Histograms
A histogram is one way to visualize the distribution of a single sequence of values.
Histograms simply take a continuous distribution, bin it, and plot the frequencies
of points falling into each bin as a bar. The height of each bar in the histogram
represents how many points in the data are contained in that bin.
We've already seen how to bin data ourselves, but incanter.charts contains a
histogram function that will bin the data and visualize it as a histogram in two
steps. We require incanter.charts as c in this chapter (and throughout the book).
(defn ex-1-12 []
(-> (load-data :uk-scrubbed)
(i/$ "Electorate")
(c/histogram)
(i/view)))
[ 17 ]
Statistics
We can configure the number of bins data is segmented into by passing the keyword
argument :nbins as the second parameter to the histogram function:
(defn ex-1-13 []
(-> (uk-electorate)
(c/histogram :nbins 200)
(i/view)))
[ 18 ]
Chapter 1
The preceding graph shows a single, high peak but expresses the shape of the data
quite crudely. The following graph shows fine detail, but the volume of the bars
obscures the shape of the distribution, particularly in the tails:
Choosing the number of bins to represent your data is a fine balancetoo few bins
and the shape of the data will only be crudely represented, too many and noisy
features may obscure the underlying structure.
(defn ex-1-14 []
(-> (i/$ "Electorate" (load-data :uk-scrubbed))
(c/histogram :x-label "UK electorate"
:nbins 20)
(i/view)))
[ 19 ]
Statistics
This final chart containing 20 bins seems to be the best representation for this data
so far.
Along with the mean and the median, the mode is another way of measuring the
average value of a sequenceit's defined as the most frequently occurring value
in the sequence. The mode is strictly only defined for sequences with at least one
duplicated value; for many distributions, this is not the case and the mode is
undefined. Nonetheless, the peak of the histogram is often referred to as the mode,
since it corresponds to the most popular bin.
We can clearly see that the distribution is quite symmetrical about the mode, with
values falling sharply either side along shallow tails. This is data following an
approximately normal distribution.
[ 20 ]
Chapter 1
[ 21 ]
Statistics
Each bar of the histogram is approximately the same height, corresponding to the
equal probability of generating a number that falls into each bin. The bars aren't
exactly the same height since the uniform distribution describes the theoretical
output that our random sampling can't mirror precisely. Over the next several
chapters, we'll learn ways to precisely quantify the difference between theory and
practice to determine whether the differences are large enough to be concerned with.
In this case, they are not.
If instead we generate a histogram of the means of sequences of numbers, we'll end
up with a distribution that looks rather different.
(defn ex-1-16 []
(let [xs (->> (repeatedly rand)
(partition 10)
(map mean)
(take 10000))]
(-> (c/histogram xs
:x-label "Distribution of means"
:nbins 20)
(i/view))))
[ 22 ]
Chapter 1
The preceding code will provide an output similar to the following histogram:
Although it's not impossible for the mean to be close to zero or one, it's exceedingly
improbable and grows less probable as both the number of averaged numbers and
the number of sampled averages grow. In fact, the output is exceedingly close to the
normal distribution.
This outcomewhere the average effect of many small random fluctuations leads to
the normal distributionis called the central limit theorem, sometimes abbreviated
to CLT, and goes a long way towards explaining why the normal distribution occurs
so frequently in natural phenomena.
The central limit theorem wasn't named until the 20th century, although the effect
had been documented as early as 1733 by the French mathematician Abraham de
Moivre, who used the normal distribution to approximate the number of heads
resulting from tosses of a fair coin. The outcome of coin tosses is best modeled with
the binomial distribution, which we will introduce in Chapter 4, Classification. While
the central limit theorem provides a way to generate samples from an approximate
normal distribution, Incanter's distributions namespace provides functions for
generating samples efficiently from a variety of distributions, including the normal:
(defn ex-1-17 []
(let [distribution (d/normal-distribution)
xs (->> (repeatedly #(d/draw distribution))
[ 23 ]
Statistics
(take 10000))]
(-> (c/histogram xs
:x-label "Normal distribution"
:nbins 20)
(i/view))))
The d/draw function will return one sample from the supplied distribution.
The default mean and standard deviation from d/normal-distribution
are zero and one respectively.
Poincar's baker
There's a story that, while almost certainly apocryphal, allows us to look in more
detail at the way in which the central limit theorem allows us to reason about how
distributions are formed. It concerns the celebrated nineteenth century French
polymath Henri Poincar who, so the story goes, weighed his bread every day
for a year.
[ 24 ]
Chapter 1
Baking was a regulated profession, and Poincar discovered that, while the weights
of the bread followed a normal distribution, the peak was at 950g rather than the
advertised 1kg. He reported his baker to the authorities and so the baker was fined.
The next year, Poincar continued to weigh his bread from the same baker. He found
the mean value was now 1kg, but that the distribution was no longer symmetrical
around the mean. The distribution was skewed to the right, consistent with the baker
giving Poincar only the heaviest of his loaves. Poincar reported his baker to the
authorities once more and his baker was fined a second time.
Whether the story is true or not needn't concern us here; it's provided simply
to illustrate a key pointthe distribution of a sequence of numbers can tell us
something important about the process that generated it.
Generating distributions
To develop our intuition about the normal distribution and variance, let's model an
honest and dishonest baker using Incanter's distribution functions. We can model the
honest baker as a normal distribution with a mean of 1,000, corresponding to a fair
loaf of 1kg. We'll assume a variance in the baking process that results in a standard
deviation of 30g.
(defn honest-baker [mean sd]
(let [distribution (d/normal-distribution mean sd)]
(repeatedly #(d/draw distribution))))
(defn ex-1-18 []
(-> (take 10000 (honest-baker 1000 30))
(c/histogram :x-label "Honest baker"
:nbins 25)
(i/view)))
[ 25 ]
Statistics
The preceding code will provide an output similar to the following histogram:
Now, let's model a baker who sells only the heaviest of his loaves. We partition the
sequence into groups of thirteen (a "baker's dozen") and pick the maximum value:
(defn dishonest-baker [mean sd]
(let [distribution (d/normal-distribution mean sd)]
(->> (repeatedly #(d/draw distribution))
(partition 13)
(map (partial apply max)))))
(defn ex-1-19 []
(-> (take 10000 (dishonest-baker 950 30))
(c/histogram :x-label "Dishonest baker"
:nbins 25)
(i/view)))
[ 26 ]
Chapter 1
It should be apparent that this histogram does not look quite like the others
we have seen. The mean value is still 1kg, but the spread of values around the
mean is no longer symmetrical. We say that this histogram indicates a skewed
normal distribution.
[ 27 ]
Statistics
Skewness
Skewness is the name for the asymmetry of a distribution about its mode. Negative
skew, or left skew, indicates that the area under the graph is larger on the left side
of the mode. Positive skew, or right skew, indicates that the area under the graph is
larger on the right side of the mode.
Incanter has a built-in function for measuring skewness in the stats namespace:
(defn ex-1-20 []
(let [weights (take 10000 (dishonest-baker 950 30))]
{:mean (mean weights)
:median (median weights)
:skewness (s/skewness weights)}))
The preceding example shows that the skewness of the dishonest baker's output is
about 0.4, quantifying the skew evident in the histogram.
Quantile-quantile plots
We encountered quantiles as a means of describing the distribution of data earlier
in the chapter. Recall that the quantile function accepts a number between zero
and one and returns the value of the sequence at that point. 0.5 corresponds to the
median value.
Plotting the quantiles of your data against the quantiles of the normal distribution
allows us to see how our measured data compares against the theoretical
distribution. Plots such as this are called Q-Q plots and they provide a quick and
intuitive way of determining normality. For data corresponding closely to the normal
distribution, the Q-Q Plot is a straight line. Deviations from a straight line indicate
the manner in which the data deviates from the idealized normal distribution.
[ 28 ]
Chapter 1
Let's plot Q-Q plots for both our honest and dishonest bakers side-by-side.
Incanter's c/qq-plot function accepts the list of data points and generates
a scatter chart of the sample quantiles plotted against the quantiles from the
theoretical normal distribution:
(defn ex-1-21 []
(->> (honest-baker 1000 30)
(take 10000)
(c/qq-plot)
(i/view))
(->> (dishonest-baker 950 30)
(take 10000)
(c/qq-plot)
(i/view)))
[ 29 ]
Statistics
The Q-Q plot for the honest baker is shown earlier. The dishonest baker's plot is next:
The fact that the line is curved indicates that the data is positively skewed; a curve in
the other direction would indicate negative skew. In fact, Q-Q plots make it easier to
discern a wide variety of deviations from the standard normal distribution, as shown
in the following diagram:
[ 30 ]
Chapter 1
Q-Q plots compare the distribution of the honest and dishonest baker against
the theoretical normal distribution. In the next section, we'll compare several
alternative ways of visually comparing two (or more) measured sequences
of values with each other.
Comparative visualizations
Q-Q plots provide a great way to compare a measured, empirical distribution to
a theoretical normal distribution. If we'd like to compare two or more empirical
distributions with each other, we can't use Incanter's Q-Q plot charts. We have a
variety of other options, though, as shown in the next two sections.
Box plots
Box plots, or box and whisker plots, are a way to visualize the descriptive statistics
of median and variance visually. We can generate them using the following code:
(defn ex-1-22 []
(-> (c/box-plot (->> (honest-baker 1000 30)
(take 10000))
:legend true
:y-label "Loaf weight (g)"
:series-label "Honest baker")
(c/add-box-plot (->> (dishonest-baker 950 30)
(take 10000))
:series-label "Dishonest baker")
(i/view)))
[ 31 ]
Statistics
The boxes in the center of the plot represent the interquartile range. The median
is the line across the middle of the box, and the mean is the large black dot. For the
honest baker, the median passes through the centre of the circle, indicating the mean
and median are about the same. For the dishonest baker, the mean is offset from the
median, indicating a skew.
The whiskers indicate the range of the data and outliers are represented by hollow
circles. In just one chart, we're more clearly able to see the difference between the two
distributions than we were on either the histograms or the Q-Q plots independently.
[ 32 ]
Chapter 1
For a fair die, the probability I'll row a five or lower is 6 . Conversely, the probability
1
I'll roll a one is only . Three or lower corresponds to even oddsa probability of
6
50 percent.
The CDF of die rolls follows the same pattern as all CDFsfor numbers at the lower
end of the range, the CDF is close to zero, corresponding to a low probability of
selecting numbers in this range or below. At the high end of the range, the CDF is
close to one, since most values drawn from the sequence will be lower.
The CDF and quantiles are closely related to each otherthe
CDF is the inverse of the quantile function. If the 0.5 quantile
corresponds to a value of 1,000, then the CDF for 1,000 is 0.5.
Let's plot the CDF of both the honest and dishonest bakers side by side. We can
use Incanter's c/xy-plot for visualizing the CDF by plotting the source datathe
samples from our honest and dishonest bakersagainst the probabilities calculated
against the empirical CDF. The c/xy-plot function expects the x values and the y
values to be supplied as two separate sequences of values.
To plot both distributions on the same chart, we need to be able to provide multiple
series to our xy-plot. Incanter offers functions for many of its charts to add additional
series. In the case of an xy-plot, we can use the function c/add-lines, which accepts
the chart as the first argument, and the x series and the y series of data as the next two
arguments respectively. You can also pass an optional series label. We do this in the
following code so we can tell the two series apart on the finished chart:
(defn ex-1-23 []
(let [sample-honest
[ 33 ]
Statistics
ecdf-honest
(s/cdf-empirical sample-honest)
ecdf-dishonest (s/cdf-empirical sample-dishonest)]
(-> (c/xy-plot sample-honest (map ecdf-honest sample-honest)
:x-label "Loaf Weight"
:y-label "Probability"
:legend true
:series-label "Honest baker")
(c/add-lines sample-dishonest
(map ecdf-dishonest sample-dishonest)
:series-label "Dishonest baker")
(i/view))))
Although it looks very different, this chart shows essentially the same information
as the box and whisker plot. We can see that the two lines cross at approximately the
median of 0.5, corresponding to 1,000g. The dishonest line is truncated at the lower
tail and longer on the upper tail, corresponding to a skewed distribution.
[ 34 ]
Chapter 1
[ 35 ]
Statistics
Datasets don't have to be contrived to reveal valuable insights when graphed. Take
for example this histogram of the marks earned by candidates in Poland's national
Matura exam in 2013:
[ 36 ]
Chapter 1
Benford showed that the law applied to data as diverse as electricity bills, street
addresses, stock prices, population numbers, death rates, and lengths of rivers. The
law is so consistent for data sets covering large ranges of values that deviation from
it has been accepted as evidence in trials for financial fraud.
[ 37 ]
Statistics
You can see from the proximity of the two lines to each other how closely this data
resembles normality, although a slight skew is evident. The skew is in the opposite
direction to the dishonest baker CDF we plotted previously, so our electorate data is
slightly skewed to the left.
As we're comparing our distribution against the theoretical normal distribution,
let's use a Q-Q plot, which will do this by default:
(defn ex-1-25 []
(->> (load-data :uk-scrubbed)
(i/$ "Electorate")
(c/qq-plot)
(i/view)))
[ 38 ]
Chapter 1
The following Q-Q plot does an even better job of highlighting the left skew evident
in the data:
As we expected, the curve bows in the opposite direction to the dishonest baker
Q-Q plot earlier in the chapter. This indicates that there is a greater number of
constituencies that are smaller than we would expect if the data were more
closely normally distributed.
Adding columns
So far this chapter, we've reduced the size of our dataset by filtering both rows and
columns. Often we'll want to add rows to a dataset instead, and Incanter supports
this in several ways.
Firstly, we can choose whether to replace an existing column within the dataset or
append an additional column to the dataset. Secondly, we can choose whether to
supply the new column values to replace the existing column values directly, or
whether to calculate the new values by applying a function to each row of the data.
[ 39 ]
Statistics
The following chart lists our options and the corresponding Incanter function to use:
By providing a sequence
Replace data
i/replace-column
Append data
i/add-column
By applying a function
i/transform-column
i/add-derived-column
Chapter 1
($ "LD")
(map type)
(frequencies))
;; {java.lang.Double 631, java.lang.String 19}
Let's use the i/$where function we encountered earlier in the chapter to inspect just
these rows:
(defn ex-1-27 []
(->> (load-data :uk-scrubbed)
(i/$where #(not-any? number? [(% "Con") (% "LD")]))
(i/$ [:Region :Electorate :Con :LD])))
;;
;;
;;
;;
;;
;;
|
Region | Electorate | Con | LD |
|------------------+------------+-----+----|
| Northern Ireland |
60204.0 |
|
|
| Northern Ireland |
73338.0 |
|
|
| Northern Ireland |
63054.0 |
|
|
...
This bit of exploration should be enough to convince us that the reason for these
fields being blank is that candidates were not put forward in the corresponding
constituencies. Should they be filtered out or assumed to be zero? This is an
interesting question. Let's filter them out, since it wasn't even possible for voters
to choose a Liberal Democrat or Conservative candidate in these constituencies. If
instead we assumed a zero, we would artificially lower the mean number of people
whogiven the choicevoted for either of these parties.
Now that we know how to filter the problematic rows, let's add the derived
columns for the victor and the victor's share of the vote, along with election
turnout. We filter the rows to show only those where both a Conservative
and Liberal Democrat candidate were put forward:
(defmethod load-data :uk-victors [_]
(->> (load-data :uk-scrubbed)
(i/$where {:Con {:$fn number?} :LD {:$fn number?}})
(i/add-derived-column :victors [:Con :LD] +)
(i/add-derived-column :victors-share [:victors :Votes] /)
(i/add-derived-column :turnout [:Votes :Electorate] /)))
[ 41 ]
Statistics
Referring back to the diagram of various Q-Q plot shapes earlier in the chapter
reveals that the victor's share of the vote has "light tails" compared to the normal
distribution. This means that more of the data is closer to the mean than we might
expect from truly normally distributed data.
[ 42 ]
Chapter 1
Russia's data is available in two data files. Fortunately the columns are the same in
each, so they can be concatenated together end-to-end. Incanter's function i/conjrows exists for precisely this purpose:
(defmethod load-data :ru [_]
(i/conj-rows (-> (io/resource "Russia2011_1of2.xls")
(str)
(xls/read-xls))
(-> (io/resource "Russia2011_2of2.xls")
(str)
(xls/read-xls))))
Statistics
The column names in the Russia dataset are very descriptive, but perhaps longer
than we want to type out. Also, it would be convenient if columns that represent
the same attributes as we've already seen in the UK election data (the victor's
share and turnout for example) were labeled the same in both datasets. Let's
rename them accordingly.
Along with a dataset, the i/rename-cols function expects to receive a map whose
keys are the current column names with values corresponding to the desired new
column name. If we combine this with the i/add-derived-column data we have
already seen, we arrive at the following:
(defmethod load-data :ru-victors [_]
(->> (load-data :ru)
(i/rename-cols
{"Number of voters included in voters list" :electorate
"Number of valid ballots" :valid-ballots
"United Russia" :victors})
(i/add-derived-column :victors-share
[:victors :valid-ballots] i/safe-div)
(i/add-derived-column :turnout
[:valid-ballots :electorate] /)))
The i/safe-div function is identical to / but will protect against division by zero.
Rather than raising an exception, it returns the value Infinity, which will be
ignored by Incanter's statistical and charting functions.
[ 44 ]
Chapter 1
This histogram doesn't look at all like the classic bell-shaped curves we've seen so
far. There's a pronounced positive skew, and the voter turnout actually increases
from 80 percent towards 100 percentthe opposite of what we would expect from
normally-distributed data.
Given the expectations set by the UK data and by the central limit theorem, this is a
curious result. Let's visualize the data with a Q-Q plot instead:
(defn ex-1-31 []
(->> (load-data :ru-victors)
(i/$ :turnout)
(c/qq-plot)
(i/view)))
[ 45 ]
Statistics
This Q-Q plot is neither a straight line nor a particularly S-shaped curve. In fact, the
Q-Q plot suggests a light tail at the top end of the distribution and a heavy tail at the
bottom. This is almost the opposite of what we see on the histogram, which clearly
indicates an extremely heavy right tail.
In fact, it's precisely because the tail is so heavy that the Q-Q plot is misleading: the
density of points between 0.5 and 1.0 on the histogram suggests that the peak should
be around 0.7 with a right tail continuing beyond 1.0. It's clearly illogical that we
would have a percentage exceeding 100 percent but the Q-Q plot doesn't account
for this (it doesn't know we're plotting percentages), so the sudden absence of data
beyond 1.0 is interpreted as a clipped right tail.
Given the central limit theorem, and what we've observed with the UK election data,
the tendency towards 100 percent voter turnout is curious. Let's compare the UK and
Russia datasets side-by-side.
[ 46 ]
Chapter 1
Comparative visualizations
Let's suppose we'd like to compare the distributions of electorate data between the
UK and Russia. We've already seen in this chapter how to make use of CDFs and
box plots, so let's investigate an alternative that's similar to a histogram.
We could try and plot both datasets on a histogram but this would be a bad idea.
We wouldn't be able to interpret the results for two reasons:
The sizes of the voting districts, and therefore the means of the distributions,
are very different
An alternative to the histogram that addresses both of these issues is the probability
mass function (PMF).
[ 47 ]
Statistics
There are innumerable ways to normalize data, but one of the most basic is to ensure
that each series is in the range zero to one. None of our values decrease below zero,
so we can accomplish this normalization by simply dividing by the largest value:
(defn as-pmf [bins]
(let [histogram (frequencies bins)
total
(reduce + (vals histogram))]
(->> histogram
(map (fn [[k v]]
[k (/ v total)]))
(into {}))))
With the preceding function in place, we can normalize both the UK and Russia data
and plot it side by side on the same axes:
(defn ex-1-32 []
(let [n-bins 40
uk (->> (load-data :uk-victors)
(i/$ :turnout)
(bin n-bins)
(as-pmf))
ru (->> (load-data :ru-victors)
(i/$ :turnout)
(bin n-bins)
(as-pmf))]
(-> (c/xy-plot (keys uk) (vals uk)
:series-label "UK"
:legend true
:x-label "Turnout Bins"
:y-label "Probability")
(c/add-lines (keys ru) (vals ru)
:series-label "Russia")
(i/view))))
[ 48 ]
Chapter 1
After normalization, the two distributions can be compared more readily. It's clearly
apparent howin spite of having a lower mean turnout than the UKthe Russia
election had a massive uplift towards 100-percent turnout. Insofar as it represents the
combined effect of many independent choices, we would expect election results to
conform to the central limit theorem and be approximately normally distributed. In
fact, election results from around the world generally conform to this expectation.
Although not quite as high as the modal peak in the center of the distribution
corresponding to approximately 50 percent turnoutthe Russian election data
presents a very anomalous result. Researcher Peter Klimek and his colleagues at
the Medical University of Vienna have gone as far as to suggest that this is a clear
signature of ballot-rigging.
[ 49 ]
Statistics
Scatter plots
We've observed the curious results for the turnout at the Russian election and
identified that it has a different signature from the UK election. Next, let's see how
the proportion of votes for the winning candidate is related to the turnout. After
all, if the unexpectedly high turnout really is a sign of foul play by the incumbent
government, we'd anticipate that they'll be voting for themselves rather than anyone
else. Thus we'd expect most, if not all, of these additional votes to be for the ultimate
election winners.
Chapter 3, Correlation, will cover the statistics behind correlating two variables
in much more detail, but for now it would be interesting simply to visualize the
relationship between turnout and the proportion of votes for the winning party.
The final visualization we'll introduce this chapter is the scatter plot. Scatter plots
are very useful for visualizing correlations between two variables: where a linear
correlation exists, it will be evident as a diagonal tendency in the scatter plot.
Incanter contains the c/scatter-plot function for this kind of chart with
arguments the same as for the c/xy-plot function.
(defn ex-1-33 []
(let [data (load-data :uk-victors)]
(-> (c/scatter-plot (i/$ :turnout data)
(i/$ :victors-share data)
:x-label "Turnout"
:y-label "Victor's Share")
(i/view))))
[ 50 ]
Chapter 1
Although the points are arranged broadly in a fuzzy ellipse, a diagonal tendency
towards the top right of the scatter plot is clearly apparent. This indicates an
interesting resultturnout is correlated with the proportion of votes for the ultimate
election winners. We might have expected the reverse: voter complacency leading to
a lower turnout where there was a clear victor in the running.
As mentioned earlier, the UK election of 2010 was far from ordinary,
resulting in a hung parliament and a coalition government. In fact,
the "winners" in this case represent two parties who had, up until
election day, been opponents. A vote for either counts as a vote for
the winners.
[ 51 ]
Statistics
Next, we'll create the same scatter plot for the Russia election:
(defn ex-1-34 []
(let [data (load-data :ru-victors)]
(-> (c/scatter-plot (i/$ :turnout data)
(i/$ :victors-share data)
:x-label "Turnout"
:y-label "Victor's Share")
(i/view))))
Although a diagonal tendency in the Russia data is clearly evident from the outline
of the points, the sheer volume of data obscures the internal structure. In the last
section of this chapter, we'll show a simple technique for extracting structure from a
chart such as the earlier one using opacity.
[ 52 ]
Chapter 1
Scatter transparency
In situations such as the preceding one where a scatter plot is overwhelmed by the
volume of points, transparency can help to visualize the structure of the data. Since
translucent points that overlap will be more opaque, and areas with fewer points
will be more transparent, a scatter plot with semi-transparent points can show the
density of the data much better than solid points can.
We can set the alpha transparency of points plotted on an Incanter chart with the
c/set-alpha function. It accepts two arguments: the chart and a number between
zero and one. One signifies fully opaque and zero fully transparent.
(defn ex-1-35 []
(let [data (-> (load-data :ru-victors)
(s/sample :size 10000))]
(-> (c/scatter-plot (i/$ :turnout data)
(i/$ :victors-share data)
:x-label "Turnout"
:y-label "Victor Share")
(c/set-alpha 0.05)
(i/view))))
[ 53 ]
Statistics
The preceding scatter plot shows the general tendency of the victor's share and the
turnout to vary together. We can see a correlation between the two values, and a
"hot spot" in the top right corner of the chart corresponding to close to 100-percent
turnout and 100-percent votes for the winning party. This in particular is the sign
that the researchers at the Medial University of Vienna have highlighted as being
the signature of electoral fraud. It's evident in the results of other disputed elections
around the world, such as those of the 2011 Ugandan presidential election, too.
The district-level results for many other elections around the world are
available at http://www.complex-systems.meduniwien.ac.at/
elections/election.html. Visit the site for links to the research
paper and to download other datasets on which to practice what you've
learned in this chapter about scrubbing and transforming real data.
We'll cover correlation in more detail in Chapter 3, Correlation, when we'll learn how
to quantify the strength of the relationship between two values and build a predictive
model based on it. We'll also revisit this data in Chapter 10, Visualization when we
implement a custom two-dimensional histogram to visualize the relationship between
turnout and the winner's proportion of the vote even more clearly.
Summary
In this first chapter, we've learned about summary statistics and the value of
distributions. We've seen how even a simple analysis can provide evidence of
potentially fraudulent activity.
In particular, we've encountered the central limit theorem and seen why it goes such
a long way towards explaining the ubiquity of the normal distribution throughout
data science. An appropriate distribution can represent the essence of a large
sequence of numbers in just a few statistics and we've implemented several of them
using pure Clojure functions in this chapter. We've also introduced the Incanter
library and used it to load, transform, and visually compare several datasets. We
haven't been able to do much more than note a curious difference between two
distributions, however.
In the next chapter, we'll extend what we've learned about descriptive statistics to
cover inferential statistics. These will allow us to quantify a measured difference
between two or more distributions and decide whether a difference is statistically
significant. We'll also learn about hypothesis testinga framework for conducting
robust experiments that allow us to draw conclusions from data.
[ 54 ]
www.PacktPub.com
Stay Connected: