Statistics and Data Visualisation With Python 2023
Statistics and Data Visualisation With Python 2023
This book is intended to serve as a bridge in statistics for graduates and business practitioners interested
in using their skills in the area of data science and analytics as well as statistical analysis in general. On the
one hand, the book is intended to be a refresher for readers who have taken some courses in statistics, but
who have not necessarily used it in their day-to-day work. On the other hand, the material can be suitable
for readers interested in the subject as a first encounter with statistical work in Python. Statistics and
Data Visualisation with Python aims to build statistical knowledge from the ground up by enabling the
reader to understand the ideas behind inferential statistics and begin to formulate hypotheses that form
the foundations for the applications and algorithms in statistical analysis, business analytics, machine
learning, and applied machine learning. This book begins with the basics of programming in Python and
data analysis, to help construct a solid basis in statistical methods and hypothesis testing, which are use-
ful in many modern applications.
Chapman & Hall/CRC
The Python Series
About the Series
Python has been ranked as the most popular programming language, and it is widely used in education and industry.
This book series will offer a wide range of books on Python for students and professionals. Titles in the series will
help users learn the language at an introductory and advanced level, and explore its many applications in data sci-
ence, AI, and machine learning. Series titles can also be supplemented with Jupyter notebooks.
Python Packages
Tomas Beuzen and Tiffany-Anne Timbers
Jesús Rogel-Salazar
First edition published 2023
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume respon-
sibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the
copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify
in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any
form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and
recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www. copyright.com
or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that
are not available on CCC please contact mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification
and explanation without intent to infringe.
DOI: 10.1201/9781003160359
Typeset in URWPalladioL-Roman
by KnowledgeWorks Global Ltd.
Publisher’s note: This book has been prepared from camera-ready copy provided by the author.
To Luceli, Rosario and Gabriela
2.2.3 Strings 46
2.3.3 Tuples 61
2.3.4 Dictionaries 66
2.3.5 Sets 72
2.5 Functions 89
3 Snakes, Bears & Other Numerical Beasts: NumPy, SciPy & pandas
99
3.1 Numerical Python – NumPy 100
3.1.1 Matrices and Vectors 101
4.4.2 Splitting One’s Sides: Quantiles, Quartiles, Percentiles and More 166
Bibliography 501
Index 511
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
List of Figures
8.14 A stacked bar chart for the total population per country
for the cities contained in our dataset categorised by
city size. The plot was created with pandas. 444
8.15 A column bar for the total population per country for
the cities contained in our dataset categorised by city
size. The plot was created with Seaborn. 445
8.16 A stacked bar chart for the total population per country
for the cities contained in our dataset categorised by
city size. The plot was created with Pandas
Bokeh. 446
8.17 A stacked bar chart for the total population per country
for the cities contained in our dataset categorised by
city size. The plot was created with Plotly. 447
8.18 Top: A pie chart of the information shown in Table 8.2.
The segments are very similar in size and it is difficult
to distinguish them. Bottom: A bar chart of the same
data. 449
8.19 A donut chart of the data from Table 8.2 created with
pandas. 451
8.20 A histogram of the miles per gallon variable in the cars
dataset. The chart is created with matplotlib. 454
8.21 Histogram of the miles per gallon as a function of the
type of transmission. The chart is created with
pandas. 455
8.22 Histogram of the miles per gallon as a function of the
type of transmission. The chart is created with
Seaborn. 456
8.23 Histogram of the miles per gallon as a function of the
type of transmission. The chart is created with Pandas
Bokeh. 457
xxii j. rogel-salazar
5.2 Special cases of the PDF and CDF for the Student’s
t-distribution with different degrees of freedom. 256
floppy disks or 1, 498 CD-ROM discs to store just 1 TB These references may only be
meaningful to people of a certain
worth of information. 1024 TB is one petabyte (PB) and this
age... ahem... If not, look it up!
would take over 745 million floppy disks or 1.5 million
CD-ROM discs.
• Volume – The sheer volume of data that is generated and Volume – the size of the datasets
at hand.
captured. It is the most visible characteristic of big data.
• Velocity – Not only do we need large quantities of data, Velocity – the speed at which data
is generated.
but they also need to be made available at speed. High
velocity requires suitable processing techniques not
available with traditional methods.
• Variety – The data that is collected not only needs to Variety – Different sources and
data types.
come from different sources, but also encompasses
different formats and show differences and variability.
After all, if you just capture information from StarTrek
followers, you will think that there is no richness in Sci-Fi.
6 j. rogel-salazar
• Veracity – This refers to the quality of the data collected. Veracity – Quality and
trustworthiness of the data.
This indicates the level of trustworthiness in the datasets
you have. Think of it – if you have a large quantity of
noise, all you have is a high pile of rubbish, not big data
by any means.
The other V I would like to talk about is that of value. At The final V is that of value. Data
that does not hold value is a cost.
the risk of falling for the cliché of talking about data being
the new oil, there is no question that data — well curated,
maintained and secured data — holds value. And to follow
the overused oil analogy, it must be said that for it to
achieve its potential, it must be distilled. There are very few
products that use crude oil in its raw form. The same is true
when using data.
Think of it from your experience. Consider the following I bet you can transpose this to a
more sunny setting. Remember
example: You have just attended a fine social event with
that any resemblance to actual
colleagues and are thinking of going back home. The events is purely coincidental :)
weather in London is its usual self and it is pouring down
with cold rain and you have managed to miss the last tube
home. You decide to use one of those ride-hail apps on your
mobile. It does not matter how many clever algorithms –
from neural to social network analysis – they have used, if I am sure you have experienced
this at some point.
the cab is not there in the x minutes they claimed it would
take, the entire proposition has failed and you will think
twice before using the service again.
The median is another important basic measure that we all Some of these measures are
discussed in Section 4.2.
learn at school. It was first described by Edward Wright, a
cartographer and mathematician, to determine location for
navigation with the help of a compass. Another important
application of the median includes the work of Roger Joseph
Boskovich, astronomer and physicist, who used it as a
way to minimise the sum of absolute deviations. This is
effectively a regression model19 based on the L1-norm or 19
Rogel-Salazar, J. (2017). Data
Science and Analytics with Python.
Manhattan distance. The actual word median was coined by Chapman & Hall/CRC Data
Mining and Knowledge Discovery
Francis Galton in 1881.
Series. CRC Press
Perhaps the Guinness brewers were happy with the initial I guess, similar to Guinness stout.
reception of what could be a dry piece of mathematical Not much fizz, but lots of flavour!
writing without much fizz. Except, that is, if you are Ronald
Aylmer Fisher, who considered how certain a result would
statistics and data visualisation with python 19
Another distinction you may encounter from time to time Another distinction appears
between the frequentist and
regards the approach to making inference. This may be even
Bayesian approaches to stats.
thought of as “philosophies” behind the approach: You can
do frequentist inference or Bayesian. The differences
between frequentist and Bayesian statistics is rooted in the
interpretation of the concept of probability. In frequentist
statistics, only repeatable random events have probabilities. Think of the typical coin flipping
experiment.
These are equal to the long-term frequency of occurrence of
the events in which we are interested. In the Bayesian
approach, probabilities are used to represent the uncertainty
of an event and as such it is possible to assign a probability
value to non-repeatable events! We are then able to improve
our probability estimate as we get more information about
an event, narrowing our uncertainty about it. This is
encapsulated in Bayes’ theorem.
the future. With our model under our arm, we are able to
apply it to tell us something about observations outside
our sample.
OBJECTIVE
PLANNING QUESTIONS
Data Project
Workflow
INSIGHTS
Outcomes Outputs
Increase revenue Reports
Change in cost Spreadsheets
Rate of growth Dashboards
Increased customer satisfaction Infographics
Retain more customers Brochures
Etc. Etc.
Our visuals are not about us, but instead about two things:
2. making sure our audience gets our message in a clear Remember that the aim is to
communicate, not obfuscate.
and concise manner.
The Python programmer community has the advantage that We call ourselves a Pythonistas.
Talking about Python versions, in this book we are working We will be working with version 3
of the Python distribution.
with version 3.x. Remember that Python 2 was sunset on
January 1, 2020. This means that there will be no more
releases for that version and should there be any security Python 2 was released in 2000,
and was supported for 20 years,
patches needed or bugs that need fixing, there will not be not bad!
any support from the community. If you or your
organisation, school, university, or local Pythonista club are
using Python 2, consider upgrading to the next version as
soon as you can.
38 j. rogel-salazar
We are assuming that the script has been saved in the local The command above is launched
directly from the terminal; no
path. In this way we can run the programme as many times
need for the iPython shell.
as we want; we can also add instructions and extend our
analysis.
Well, all that and more is possible with the Jupyter notebook.
As the name indicates, a notebook provides us with a way Jupyter notebooks used to be
called iPython notebooks.
to keep our programmes or scripts annotated and we are
able to run them interactively. Code documentation can be
42 j. rogel-salazar
Operation Operator
Table 2.1: Arithmetic operators in
Addition + Python.
Subtraction -
Multiplication *
Division /
Exponentiation **
> magic = 3
> type(magic)
The command type lets us see the
type of an object.
int
float
In this case we can see that the type of the object trick is
float. Notice that since Python is dynamically typed we
can mix integers and floats to carry out operations and the
Python interpreter will make the appropriate conversion or
casting for us. For example:
statistics and data visualisation with python 45
You can check that the result of the operation above results
in a float:
> type(trick2)
The result has the expected type.
float
int
> print(trick2)
• int() creates an integer number from an integer literal, A literal is a notation for
representing a fixed value in
a float literal by removing all decimals, or a string literal
source code.
(so long as the string represents a whole number)
2.2.3 Strings
> print(example1)
’This is a string’
Strings in Python can be defined
with single or double quotes.
> print(example2)
> type(example1)
at the beginning of this chapter. In the statement above In Python we can do multiple
assignation of values in a single
we are assigning the value “Live Long” to the variable
line.
vulcan, whereas the variable salute has the value Prosper.
Spock.
to str
’Live Long3’
’’’’’’
There are a few other tricks that strings can do and we will
cover some of those later on. One more thing to mention
about strings is that they are immutable objects. This means Strings in Python are immutable.
> z = 42+24j
(42+24j)
.format(z.real, z.imag) )
You may also have noticed that we have referred to the The method of an object can be
real and imaginary parts with .real and .imag. These are invoked by following the name of
object with a dot (.) and the name
methods of the complex type. This makes sense when we
of the method.
remember that each entity in Python is an object. Each
statistics and data visualisation with python 51
print(’Addition =’, x + y)
print(’Subtraction =’, x - y)
print(’Multiplication =’, x * y)
print(’Division =’, x / y)
Addition = (4+6j)
Subtraction = (-2-2j)
Multiplication = (-5+10j)
Division = (0.44+0.08j)
Conjugate = (1-2j)
2.3.1 Lists
numbers = [1, 2, 3, 4, 5]
> print(numbers[0])
> print(vulcan[2:4])
[’rehkuh’, ’kehkuh’]
with Python it would start counting as: “0, 1, 2, 3, . . .”. In Indexing in Python starts at zero.
the second command above you can also see how we can
refer to a sub-sequence in the list. This is often referred
54 j. rogel-salazar
this case the last item has the index −1, the next one −2 and
so on. We can see how the indices and the slicing operator
statistics and data visualisation with python 55
> L = ’StarTrek’
> print(L)
’S’
>L[1:4]
’tar’
> L[-4:-1]
’Tre’
List are mutable objects and this means that we can change Remember, however, that strings
are immutable! In this case we
individual items in the list. Let us create a list to play with:
have a list of characters and that
can be modified.
> scifi = [’S’,’t’,’a’,’r’,’T’,’r’,’e’,’k’]
> scifi[4]
’T’
> scifi.append(’.’)
append lets us add elements to a
> print(scifi)
list.
and this means that the + operator will not sum the
elements of the list. We will talk about arrays in the next
chapter.
> print(l1)
> l1.sort()
Note that since this is a method, we call it with the dot (.)
notation. The sorting is done in place and this means that
now our list has changed. We can see that when we print
the list we get the Fibonacci numbers from our original list
in ascending order. If we wanted them in reverse order we
simply pass this as a parameter as follows:
> l1.sort(reverse=True)
We can use reverse to order in
reverse.
[55, 34, 21, 13, 8, 5, 3, 2, 1, 1, 0]
58 j. rogel-salazar
We can check that the original object has not been modified
by printing it:
> print(l1)
> print(m[2])
[3,4]
and continue drilling down to obtain the sub-elements in The same slicing and dicing
operations can be applied to them.
the lists that comprise the main list:
> m[2][1]
We can obtain the length of a list with the len function. For
example:
> print(len(m))
The length of a list can be
obtained withe len function.
> print(l2)
> print(l3)
[[0, 0], [3, 9], [1, 1], [1, 1], [2, 4]]
2.3.3 Tuples
numbers_tuple = (1, 2, 3, 4, 5)
These tuples contain the same
vulcan_tuple = (‘‘wuhkuh’’, ‘‘dahkuh’’, ‘‘rehkuh’’,
items as the lists created in the
‘‘kehkuh’’, ‘‘kaukuh’’) previous section.
starships_tuple = (1701, ‘‘Enterprise’’, 1278.40,
> type(vulcan_tuple)
’tuple’
> numbers_tuple[3]
> vulcan_tuple[1:4]
assignment
> t2 = sorted(t1)
> print(t2)
The result of the sorted function
on a tuple is a list.
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]
64 j. rogel-salazar
> type(t2)
The result of the function is a list.
list
> t1.count(1)
> t1.index(1)
We can see that there are two items with the value 1 in
the tuple t1, and that the first item with value 1 is in the
position with index 3, i.e. the fourth element in the tuple.
We can obtain the length of a tuple with the len function:
statistics and data visualisation with python 65
> len(vulcan_tuple)
’zip’
As you can see, the result is an object of class zip. If you try
to print the object directly, you get a strange result:
We can see how the association worked and we can use this
as a form of dictionary to get the name of the numbers in
vulcan. Perhaps though, we can use an actual dictionary for
these purposes. Let us take a look.
2.3.4 Dictionaries
’Security/Tactical Officer’] }
> enterprise[’Spock’]
We can query a dictionary by key.
> enterprise[’Spock’][1]
We can use nested slicing and
dicing as required.
’Science Officer’
’Chief Engineer’
’ Officer’
> print(names)
The keys() method creates a
dynamic view that updates as the
dict_keys([’James T. Kirk’, ’Spock’, ’Nyota Uhura’, dictionary object changes.
’Leonard McCoy’])
> print(names)
> print(rank)
The contents of the dynamic views
created with keys(), values() and
dict_values([’Captain’, [’First Officer’, pairs() are kept up-to-date.
’Helmsman’])
> print(pairs)
’Security/TacticalOfficer’]),
You can see that the view object that corresponds to the
items in the dictionary is made of tuples containing the key
and value for each entry.
statistics and data visualisation with python 71
4: ’kehkuh’, 5: ’kaukuh’}
We are taking each item in the zip object and assigning the
values to two variables: k for the key and v for the value.
We then use these variables to construct the entries of our
new dictionary as k:v. The result is a new dictionary that
works in the usual way:
> vulcan_dict[3]
’rehkuh’
With a zip object there is another more succinct way to Remember that zip objects can
only be used once! You may need
create the dictionary without comprehension. We could
to re-create translate to make
have simply used the dict() function: this work.
vulcan_dict = dict(translate)
72 j. rogel-salazar
2.3.5 Sets
significant_seven = {1, 2, 3, 4, 5,
6, 6, 6, 6, 6, 6,
Here we have several copies of
8, 8}
each Cylon model.
> type(significant_seven)
set
> print(significant_seven)
Here is the set of the Significant
Seven Cylons.
{1, 2, 3, 4, 5, 6, 8}
> final_five = set([’Saul Tigh’, ’Galen Tyrol’, Let us create a set with the Final
’Samuel T. Anders’, ’Tory Foster’, Five Cylons.
’Ellen Tigh’])
Please note that the items that we use to create a set have
to be immutable, otherwise Python will complain. For
example, if you pass a list an error is returned:
> noerror
Using tuples as input is fine.
> len(final_five)
False
The in keyword lets us check for
membership.
>’Galen Tyrol’ in final_five
True
76 j. rogel-salazar
Phew! So we know that President Roslin is not a Cylon but Spoiler alert!!!
final_five)
’Tory Foster’}
’Sharon Valerii’])
Another important operation is the intersection of sets. The The intersection gives us the items
that exist simultaneously in each
operation results in the items that exist simultaneously in
of the sets in question.
each of the sets in question. We can use the intersection()
method or the & operator. Let us see who are the Cylons in
our officers set:
This will give us a set that contains those who are officers in
Galactica and who are also Cylons. In this case, this
corresponds to Chief and Galactica’s XO. Well, our Cylon Our Cylon detector needs to be
improved!
detector needs to be better, right? We could detect the
presence of two of the final five, but a copy of model
number 8 has infiltrated Galactica. Let us try to rectify this
situation.
There you go, we can now detect the presence of the Cylon
infiltrator!
that are in the first set but not in the second. The same
operation can be done with the minus sign (-):
{1, 2, 3, 4, 5, 6, 7, 8,
The set difference is not
’Ellen Tigh’, Samuel T. Anders’, commutative.
> print(42>10)
> print(2001<=1138)
False
> type(42>10)
bool
> a = 75 != 75
True
82 j. rogel-salazar
Operation Operator
Table 2.6: Comparison and logical
Equal == operators in Python.
Different !=
Greater than >
Less than <
Greater or equal to >=
Less or equal to <=
Object identity is
Negated object identity is not
Logical AND and
Logical OR or
Logical NOT not
if expression1 :
block of code executed
if expression1 is True
elif expression2 :
block of code executed
if expression2 is True
The if... elif... else... lets
... us test various conditions and
create branches for our code.
elif expressionN :
block of code executed
if expressionN is True
else:
block of code executed
while condition:
block of code to be executed
You can see a couple of things that are similar to the syntax
we saw for conditional statements. First, we need a colon
after the condition. Second, the block of code that is
executed as long as the condition is True is indented. One
more example where whitespace is important in Python.
> dilithium = 42
print(’Engage!’)
Note that dilithium -= 10 is
dilithium -= 10 a shorthand for dilithium =
dilithium - 10.
Engage!
Engage!
Engage!
Engage!
Engage!
executed
print(’Engage!’)
range enables us to define a
sequence of numbers as an object.
Engage! This means that the values are
generated as they are needed.
Engage!
Engage!
Engage!
Engage!
2.5 Functions
def picard_beverage():
temperature = ’Hot.’
return result
> picard_beverage()
In the syntax shown above you can see a line of code that
starts with triple quotes. This is called a documentation string A documentation string enables us
to provide information about what
and it enables us to describe what our function does. You
a function does. Make sure you
can take a look at the contents as follows: use it!
> print(picard_beverage.__doc__)
return area
> a = area_triangle(50)
The arguments are passed to the
> print(a) function in round brackets.
25.0
100.0
if scale == ’C’:
print(s)
print(s)
else:
return result
> sq(3)
def main():
t = input(’Give me a temperature: ’)
if s == ’C’:
print(’Try again!’)
main()
96 j. rogel-salazar
Now that we have saved the script we are able to execute it.
Open a terminal, navigate to the place where the script is
saved and type the following command:
Converting 451.0 F
451.0 F is 232.78 C
This is great, we can rerun this script over and over and
obtain the results required. We may envisage a situation
where other functions to transform different units are
developed and expand our capabilities until we end up with In this case we want to convert 451
enough of them to create a universal converter. We may F to C.
can save the day as they have now all the functionality of
the helicopter or kungfu modules. Please note these two
modules are purely fictional at this stage!
import math
return area
r = 7
ac = area_circ(r)
You may have seen in the code above that we require to tell
Python that the constant π is part of the math module. This
is done as math.pi. In the example above we are importing
all the functions of the math module. This can be somewhat
inefficient in cases where only specific functionality is
needed. Instead we could have imported only the value of π
as follows:
In some cases it may be more
from math import pi efficient to load only the needed
functionality from a module.
a1,1 a1,2 ··· a1,n
a2,1 a2,2 ··· a2,n
A matrix is effectively a collection
A=
.. .. .. .. .
(3.1)
of row (or column) vectors.
. . . .
am,1 am,2 ··· am,n
list_a = [0, 1, 1, 2, 3]
Lists are excellent Python objects, but their use is limited for
some of the operations we need to execute with numerical
arrays. Python is able to use these lists as a starting point to
build new objects with their own operations and functions
letting us do the mathematical manipulations we require.
In this case, modules such as NumPy and SciPy are already
available to us for the use of n-dimensional arrays (i.e.
ndarray) that can be used in mathematical, scientific, and
import numpy as np
We define a NumPy array
with np.array, where np is a
A = np.array(list_a) convenient alias used for the
NumPy package.
B = np.array(list_b)
> type(A)
> C = A + B
• Vector addition: +
> type(M1)
The type of a matrix is, surprise,
surprise, matrix.
numpy.matrix
statistics and data visualisation with python 105
> M1 + M2
Matrix addition (and sustraction)
works as expected.
matrix([[ 1.5, 2. ],
[-4. , 3.5]])
or matrix multiplication:
> M1 * M2
> M2.transpose()
Other operations, such as
transposition, are also available.
matrix([[ 0.5, -4. ],
[ 2. , 2.5]])
> A.shape
(5,)
> A.dtype
The elements inside matrices or
dtype(’int64’)
arrays also have types.
> M2.dtype
dtype(’float64’)
> z1 = np.zeros((2,3))
> z1
zeros() creates a matrix whose
elements are all zero.
array([[0., 0., 0.],
> o1 = np.ones((3,3))
> a = np.arange(12)
[0 2 4 6 8]
containing the first 5 numbers and the second with the same
information but divided by 2:
> np.array([np.arange(5),0.5*np.arange(5)])
array([[0. , 1. , 2. , 3. , 4. ],
[4.5, 6, 3.5],
The results of each student are
[8.5, 10, 9],
captured as rows in our array.
[8, 6.5, 9.5],
> marks[2,1]
We are obtaining the element in
row 2, column 1.
10.0
> marks[:,0]
Here, we get all the element in
column 0.
array([8. , 4.5, 8.5, 8. , 9. ])
(9.0, 4.5)
(7.6, 1.5937377450509227)
> marks.mean()
We can obtain the average mark
across all subjects for all students.
7.733333333333333
> marks.mean(axis=0)
Or for a specific subject, i.e.
column, with slicing and dicing
array([7.6, 8.3, 7.3])
operations.
We can see that the average mark for Physics is 7.6 (which
is the value we calculated above), for Spanish 8.3 and for
History 7.3. Finally, we may be interested in looking at the
average marks for each student. In this case we operate on
each row. In other words axis=1:
statistics and data visualisation with python 111
> marks.mean(axis=1)
The average mark for each student
is also easily calculated.
array([8.0, 4.6666667, 9.1666667, 8.0, 8.8333333])
> np.unique(m)
We can get unique elements with
unique().
array([ 3, 4, 5, 7, 9, 10, 12])
> np.unique(marks)
If we are interested in finding entire unique rows or Unique rows and columns in an
array are obtained by specifying
columns, we need to specify the axis. For rows we use
the axis.
axis=0 and for columns axis=1.
import numpy as np
We can now use these two arrays x and y to calculate the We can invert a matrix with the
following expression: coe f = ( x T x )−1 x T y. This gives us .inv method.
n = linalg.inv(dot(x.T, x))
Matrix multiplication with arrays
k = dot(x.T, y)
can be done with the dot()
function.
coef = dot(n,k)
3.5
3.0
2.5
y
2.0
1.5
1.0
10 15 20 25 30 35
x
[[-0.0955809 ]
[ 0.10612972]]
> linalg.det(a1)
-200.0
The determinant can be calculated
with the .det function.
116 j. rogel-salazar
> l, v = linalg.eig(a1)
> print(l)
[[-0.90937671 -0.56576746]
[ 0.41597356 -0.82456484]]
Z ∞ √
− x2 π
e = . (3.3)
0 2
statistics and data visualisation with python 117
(0.8862269254527579, 7.101318390472462e-09)
array([0.78539816])
118 j. rogel-salazar
3.2.4 Statistics
s = np.random.normal(size=1500)
We get normally distributed
b = np.arange(-4, 5) random numbers with
h = np.histogram(s, bins=b, density=True)[0] random.normal.
> np.mean(s)
0.008360091147360452
The mean and the median can
easily be obtained with the help of
NumPy as we saw in Section 3.1.
> np.median(s)
0.030128236407879254
0.030128236407879254
120 j. rogel-salazar
Ttest_indResult(statistic=-7.352146293288264,
pvalue=1.2290311506115089e-11)
The result comes in two parts. First the so-called t-statistic A t-test is used to determine if
there is a significant difference
value tells us about the significance of the difference
between the means of two groups.
between the processes. The second is the p-value, which is
the probability that the two processes are identical. If the
value is close to 0, the more likely the processes have
different means.
The most basic type of data array in pandas is called a We are using the spelling of
pandas with lowercase as used
series, which is a 1-D array. A collection of series is called a
in its documentation, except in
dataframe. Each series in a dataframe has a data type and, cases where is at the beginning of
as you can imagine, pandas builds these capabilities on top a sentence.
import numpy as np
s1 = pd.Series(a1)
pandas.core.series.Series
As you can see the type of the object is a series, and each
series (and dataframe) has a number of methods. The first
thing to do is get to grips with the tabular data that we are
able to manipulate with pandas. Let us look at some data
published5 in 2016 by the Greater London Authority about 5
Three Dragons, David Lock
Associates, Traderisks, Opinion
population density. In Table 3.2 we show the population and Research Services, and J. Coles
(2016). Lessons from Higher
area (in square kilometres) for four global cities contained in Density Development. Report to
the report. the GLA
statistics and data visualisation with python 123
We can load this data into Python by creating lists with the
appropriate information about the two features describing
the cities in the table:
8491079]
df = pd.DataFrame({’cities’: names,
We can pass a dictionary to
’population’: population, .DataFrame to create our table of
pandas.core.frame.DataFrame
124 j. rogel-salazar
> df.head(2)
The .head() method lets us see
the first few rows of a dataframe.
cities population area
> df.tail(2)
Similarly, .tail() will show the
last few rows.
cities population area
> df.shape
The dimension of our dataframe
can be seen with .shape.
(4, 3)
> df.dtypes
area int64
dtype: object
> df.info()
<class ’pandas.core.frame.DataFrame’>
RangeIndex: 4 entries, 0 to 3
The .info() method gives us
Data columns (total 3 columns): information about a dataframe
such as the index and column
# Column Non-Null Count Dtype
dtypes, non-null values and
--- ------ -------------- ----- memory usage.
0 cities 4 non-null object
> df.columns
The columns method returns
the names of the columns in a
Index([’cities’, ’population’, ’area’], dataframe.
dtype=’object’)
This means that we can use these names to refer to the data
in each of the columns. For example, we can retrieve the
data about the population of the cities in rows 2 and 3 as
follows:
df[’population’][2:4]
We can view the contents of a
dataframe column by name, and
2 2229621 the data can be sliced with the
usual colon notation.
3 8491079
You may notice that there is an index running along the left-
hand side of our dataframe. This is automatically generated
by pandas and it starts counting the rows in our table from Pandas automatically assigns an
index to our dataframe.
0. We can use this index to refer to our data. For example,
we can get the city name and area for the first row in our
table:
128 j. rogel-salazar
> df[[’cities’,’area’]][0:1]
cities area
We may want to define our own unique index for the data
we are analysing. In this case, this can be the name of We can assign our own index with
set_index().
the cities. We can change the index with the set_index()
method as follows:
df.set_index(’cities’, inplace=True)
> df.loc[’Tokyo’]
We can locate data by index name
or label with loc. If you need to
population 9272565.0 locate data by integer index use
area 627.0 iloc instead.
> df.describe()
population area
dataframe:
130 j. rogel-salazar
> df
Source Command
Table 3.4: Some of the input
read_table() sources available to pandas.
Flat file read_csv()
read_fwf()
read_excel()
Excel file
ExcelFile.parse()
read_json()
JSON
json_normalize()
read_sql_table()
SQL read_sql_query()
read_sql()
HTML read_html()
import numpy as np
> gla_cities.head(3)
You can see that row 0 has NaN as entries. This means that
the row has missing data, and in this case, it turns out that
there is an empty row in the table. We can drop rows with
missing information as follows:
> gla_cities.shape
(17, 11)
> gla_cities.columns
dtype=’object’)
> gla_cities[’pop_density’] = \
gla_cities[’Population’]/gla_cities[’Area km2’]
We get an error because the
data was read as strings, not as
TypeError: unsupported operand type(s) for /: numbers.
The reason for this is that the data for the Population, Area
Km2 and Dwellings columns has been read as strings and, as
> gla_cities[’pop_density’] = \
gla_cities[’Population’]/gla_cities[’Area km2’]
Let’s see what’s out there... The
> gla_cities[’pop_density’].head(3) operation can now be completed
correctly!
1 5511.005089
2 10782.758621
3 4165.470494
This can make it easier for us and other humans to read the
information in the table. The important thing is to note that
pandas has effectively vectorised the operation and divided
each and every entry in the column by a million. If we need
to apply a more complex function, for instance one we have
created ourselves, it can also be applied to the dataframe.
First, let us create a function that categorises the cities
into small, medium, large and mega. Please note that this
categorisation is entirely done for demonstration purposes
and it is not necessarily a real demographic definition:
def city_size(x):
if x < 1.5:
s = ’Small’
s = ’Large’
else:
s = ’Mega’
return s
[gla_cities[’Population (M)’]>8]
We can combine Boolean filtering
with column selection.
City Population (M)
> gla_grouped.size()
Medium 5
Mega 5
Small 3
dtype: int64
Medium 2.324692
Mega 7.625415
Small 0.604062
np.std])
Mega
Large
City Size
Medium
Small
0 1 2 3 4 5 6 7 8
Average Population (M)
• Mode
• Geometric mean
• Harmonic mean
4.3.1 Mode
• Physics - 8
• Spanish - 10
> stats.mode(marks)
count=array([[2, 2, 1]]))
In this case Python tells us the mode for each column in the
NumPy array marks and also gives us the frequency. For the
History column, the mode reported is the smallest value.
1 15.2
2 19.2
3 21.0
statistics and data visualisation with python 149
4 21.4
5 22.8
6 30.4
> ct = cars.groupby([’am’])
1 30.4
Once again we have more than one value. For the automatic
cars we have three mode values, whereas for manual we
have two. Note that since mode is not a function that
reduces the dataset, like for example a sum of values,
pandas does not have a method for grouped dataframes.
some data. Think for instance of the marks for Physics. The
mode is 10 but it may not be a good choice to represent
the rest of the marks for the class. Similarly, for the cars Other measures of central
tendency are available.
dataset, since we have a multi-modal distribution it may be
difficult to discern a central value. Fear not, that is where
other centrality measures come handy.
4.3.2 Median
In this case, the middle value is 8 and that is our median. Depending on whether we
have an odd or even number
Note that here we have an odd number of observations and
of observations, our median is
determining the middle value is easy. However, when the either the middle value or the
number of observations is even we sum the two values in average of the central ones.
array([8. , 9. , 7.5])
> ct[’mpg’].median()
am
The method can also be used for
0 17.3 grouped dataframes.
1 22.8
1
(45 + 30 + 50 + 26 + 40) = 38.2. (4.3) Battlestar Galactica seems to have
5 a young crew!
1
(45 + 30 + 50 + 26 + 40 + 2052) = 373.833. (4.4)
6
The presence of the Cylon has skewed the average age to the The presence of a large number
skews the average value. In this
point that it does not represent the majority of the values
case an old Cylon turns our
in the dataset. The median of this dataset is 42.5 and that average into an impossible human
may be a better description for this group of officers (and age.
means:
> physics.mean()
7.6
> ct[’mpg’].mean()
The average fuel consumption is
different between our automatic
am and manual transmission cars in
0 17.147368 our dataset.
1 24.392308
avglnx = np.log(x)
return np.exp(avglnx.mean())
For the BSG officers list of ages we can then use this BSG = Battlestar Galactica.
> g_mean(bsg_ages)
37.090904350447026
can use this to calculate the geometric mean for the Physics
marks:
> from scipy.stats import gmean Here we are using the gmean()
> gmean(physics) function in SciPy.
7.389427365423367
We can also calculate the geometric mean for all the subjects
as follows:
19.25006404155361
> ct[’mpg’].apply(gmean)
am
0 16.721440
1 23.649266
With the arithmetic mean, the 2249 vintage seems the better
wine. However, the fact that we have ratings with different
scales means that large numbers will dominate the
arithmetic mean. Let us use the geometric mean to see how
things compare: The geometric mean lets us make
a better comparison.
√
2249 vintage → 3.5 ∗ 80 = 16.73 (4.10)
√
2286 vintage → 4.5 ∗ 75 = 18.37 (4.11)
statistics and data visualisation with python 159
∑in=1 ln wi xi
WGM = exp . (4.12) There is also a weighted version
∑in=1 wi for the geometric mean.
2. Calculate their arithmetic mean. The harmonic mean uses the sum
of reciprocals of the data points.
3. Take the reciprocal of the result and multiply by the
number of values.
160 j. rogel-salazar
In other words:
! −1
n
n
∑
This is the formula to calculate the
HM = =n xi−1 . (4.13)
∑in=1 1
x1 i =1 harmonic mean.
def h_mean(x):
Beware that this implementation
sum = 0 cannot handle zeroes.
for val in x:
sum += 1/val
return len(x)/sum
> h_mean(bsg_ages)
35.96679987703658
Note that the harmonic mean is always the least of the The harmonic mean is the least of
the three means discussed.
three means we have discussed, with the arithmetic mean
being the greatest. If we include the 2052-year-old Cylon the
harmonic mean is 43.
∑in=1 wi
W HM = . (4.14)
∑in=1 wi xi−1
For the index mentioned above we have that the weighted The harmonic mean provides a
fair comparison for the ratios.
harmonic mean is given by:
0.2 + 0.8
PEW HM = 0.2 0.8
= 14.63. (4.15)
24 + 13.33
• After the successful rescue, due to damaged sustained A successful trip to rescue
Commander Michael Burnham.
to the spore drive, USS Discovery made the trip back in 8
seconds
4.4 Dispersion
> physics.max()
9.0
4.5
> print(physics_range)
4.5
4.5
With it we can also calculate the range for the marks array in
one go:
We can see from the results above that History (6) has a
wider range and this means that the variability in the marks
is higher than for Spanish (4).
statistics and data visualisation with python 165
For our pandas dataframe we can use pretty much the same
methods as above. Let us take a look at the maximum and
minimum values of the fuel consumption column we have
been analysing:
> print(mpg_range)
23.5
> np.ptp(mpg)
Or apply the peak-to-peak method
instead.
23.5
> ct[’mpg’].apply(np.ptp)
In this case we are using the apply
method to calculate the range for
our grouped dataframe.
am
0 14.0
1 18.9
166 j. rogel-salazar
We find our n − 1 quantiles by first ranking the data in order, We need to order our data to
obtain our quantiles.
and then cutting it into n − 1 equally spaced points on the
interval, obtaining n groups. It is important to mention that
the terms quartile, decile, percentile, etc. refer to the cut-off
points and not to the groups obtained. The groups should
be referred to as quarters, tenths, etc.
statistics and data visualisation with python 167
> print(mpg.describe())
25% 15.425000
50% 19.200000
75% 22.800000
max 33.900000
> md(physics)
1.24
> mpg.mad()
4.714453125
> ct[’mpg’].mad()
1 5.237870
Variance
n
1
∑
2
s2 = ( x − X̄ ) . (4.21)
n − 1 i =1 i
Standard Deviation
> np.std(physics)
1.5937377450509227
1.7818529681205462
that the marks for History are spread wider than those for
Spanish.
> mpg.var(ddof=0)
> mpg.std(ddof=0)
5.932029552301219
> mpg.var()
36.32410282258065
> ct[’mpg’].var()
am
0 14.699298
1 38.025769
The methods can be applied to
grouped dataframes too.
> ct[’mpg’].std()
am
0 3.833966
1 6.166504
5.1 Probability
1. In the sample space (Ω) for an experiment, the Some important properties of
probabilities.
probability of P(Ω) is 1.
P( A) + P( A0 ) = 1.
| AB|
| AB| |Ω|
P( A| B) = = | B|
. (5.1)
| B|
Ω
| BA|
P( B| A) = . (5.2)
| A|
P( B| A) P( A) Bayes’ theorem.
P( A| B) = , (5.3)
P( B)
• SS
• SC
• CS
• CC
In the encounter with Mr Dent above, we have managed to A random variable has values
determined by chance.
create a random variable. A random variable is a variable
that takes on different values determined by chance.
X Probability Cumulative
Table 5.1: Probability of our coin
P( X = x ) Probability P( X ≤ x ) flipping experiment.
0 0.25 0.25
1 0.50 0.75
2 0.25 1.00
We can now ask about the probability that the value of our
random variable X falls within a specified range. This is
called the cumulative probability. For example, if we are A cumulative probability is the
probability that the value of a
interested in the probability of obtaining one or fewer S
random variable falls within a
outcomes, we can calculate the probability of obtaining no S, specified range.
plus the probability of getting one S:
P ( X ≤ 1) = P ( X = 0) + P ( X = 1),
Z ∞
The area under the PDF gives us
p( x )dx = 1, (5.5)
−∞ the probability.
E( X ) = µ = ∑ xi pi , (5.6)
where pi is the probability associated with the outcome The expected value is a weighted
average.
xi . This is effectively a weighted average as described in
Section 4.3.4. For our experiment above we have that the
mean value is:
Var ( X ) = E[ X 2 ] − 2E[ X ] E[ X ] + E[ X ]2 ,
= = E [ X 2 ] − E [ X ]2 . (5.8)
Var ( X ) = E [ X 2 ] − µ2 ,
Z ∞ Z ∞
2
= x2 f ( x ) − x f (x) . (5.12)
−∞ −∞
import numpy as np
import random
We can use this function to
simulate repeated coin flips for
def coinFlip(nflips=100): our experiment.
flips = np.zeros(nflips)
flips[flip] = random.choice([0,1])
return flips
> t1, t2, t3 = coinFlip(), coinFlip(), coinFlip() Remember that the numbers are
> np.sum(t1), np.sum(t2), np.sum(t3) random, so your list may look
different.
def coinExperiment(ntrials):
return results
r.append(coinExperiment(n))
> for i in r:
6 15
4 10
2 5
0 0
0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.35 0.40 0.45 0.50 0.55 0.60 0.65
1000 experiments 10000 experiments
800
80
700
60 600
Occurrences
500
40 400
300
20 200
100
0 0
0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.35 0.40 0.45 0.50 0.55 0.60 0.65
Fraction of S faces Fraction of S faces
n n −1
1
µ = ∑ xi f ( xi ) = n ∑ i,
1 i =0
The mean of the uniform
1 n ( n − 1) distribution.
= ,
n 2
n−1
= , (5.16)
2
n −1
n ( n − 1)
∑i=
See Appendix B for more
. (5.17)
i =0
2 information.
n −1 2
n−1
1
σ =2
n ∑i 2
+
2
. (5.19)
i =0
1 n(n − 1)(2n − 1) n2 − 2n + 1
σ2 = + ,
n 6 4
The variance of the uniform
2n2 − 3n + 1 n2 − 2n + 1 distribution.
= + ,
6 4
4n2 − 6n + 2 − 3n2 + 6n − 3 n2 − 1
= = . (5.21)
12 12
194 j. rogel-salazar
array([1, 2, 3, 4, 5, 6])
> print(pdf)
We obtain the PDF of the uniform
distribution with the pdf method.
see a depiction for our die rolls in the upper right-hand side
panel of Figure 5.2.
statistics and data visualisation with python 195
0.1725
0.8
Cumulative Probability
0.1700
Probability
0.1675 0.6
0.1650
0.1625 0.4
0.1600
0.2
0.1575
1 2 3 4 5 6 1 2 3 4 5 6
Outcome Outcome
Uniform PPF Uniform Random Variates
6 1750
5 1500
1250
4
Occurrences
Outcome
1000
3
750
2
500
1
250
0
0
0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7
Probability Outcome
> print(cdf)
1. ])
196 j. rogel-salazar
> x1 = x1.astype(int)
Let us see what the mean and population variance for the
random variables above are:
statistics and data visualisation with python 197
(3.5154, 2.9080536453645367)
2.9166666666666665
σ2 = ( p − 1)2 p + ( p − 0)2 (1 − p ),
= p − p2 ,
The variance of the Bernoulli
= p (1 − p ). (5.24) distribution.
statistics and data visualisation with python 199
array([0., 1.])
200 j. rogel-salazar
Cumulative Probability
0.7 0.7
Probability
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
1 0 1 2 1 0 1 2
Outcome Outcome
Bernoulli PPF Bernoulli Random Variates
1.1
1.0
8000
0.9
0.8
0.7 6000
Occurrences
0.6
Outcome
0.5
4000
0.4
0.3
0.2 2000
0.1
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 0 1 2
Probability Outcome
distribution as follows:
> print(bernoulli.mean(0.12))
> print(bernoulli.var(0.12))
0.12
0.1056
where n! stands for n factorial, and we can read the n! = n(n − 1)(n − 2) . . . (2)(1).
3!
f (0, 3, 0.5) = (0.5)0 (1 − 0.5)3−0 ,
3!0!
This is the probability of getting
no scar faces in 3 flips of Mr
3 Dent’s coin.
= (1)(1)(0.5) ,
= 0.125. (5.27)
statistics and data visualisation with python 203
n
n − 1 ( k −1)
= np ∑ k−1
p (1 − p ) ( n − k ) . (5.30)
k =1
n−1
n See Appendix D for more
k· = n· , (5.31)
k k−1 information.
We have taken the factor n out of the sum as the latter does
not depend on n. Now, we are going to do some re-writing
and express n − k as (n − 1) − (k − 1):
n
n − 1 ( k −1)
µ = np ∑ k−1
p (1 − p)(n−1)−(k−1) , (5.32)
k =1
m
m j
= np ∑ p (1 − p ) m − j , (5.33)
j =0
j
σ2 = E [ X 2 ] − µ2 ,
n
n−1 k
= n ∑ k
k−1
p (1 − p)n−k − (np)2 . (5.36)
k =1
n−1
n2 n See Appendix D.
k = kk = kn . (5.37)
k k k−1
n
n − 1 k −1
σ2 = np ∑ k
k−1
p (1 − p)n−k − (np)2 , (5.38)
k =1
n
n − 1 k −1
σ =np ∑ k
2
p ·
k =1
k−1
m
m j
=np ∑ ( j + 1) p (1 − p)m− j − (np)2 , (5.40)
j =0
j
mp + 1, (5.45)
(n − 1) p + 1. (5.46)
Note that the binomial distribution for the case where n = 1 The case n = 1 recovers the
Bernoulli distribution.
is actually a Bernoulli distribution. You can check the mean
and variance for the Bernoulli distribution is recovered
when only one trial is run.
import numpy as np
The PMF and CDF of the binomial
x = np.arange(0, 100, 1)
distribution are shown in the top
pmf = binom.pmf(x, 100, 0.5) panels of Figure 5.4.
cdf = binom.cdf(x, 100, 0.5)
The PPF for our experiment can be seen in the bottom left-
hand side of Figure 5.4 and is calculated as follows:
We can see what the mean and variance are for our
experiment and compare with the sample mean and sample
variance above:
statistics and data visualisation with python 207
Cumulative Probability
0.05 0.6
Probability
0.04
0.03 0.4
0.02
0.2
0.01
0.00 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Outcome Outcome
Binomial PPF Binomial Random Variates
1600
60
1400
50
1200
40 1000
Occurrences
Outcome
30 800
20 600
400
10
200
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100
Probability Outcome
50.0
We can calculate the mean and
variance for a binomial process
> print(binom.var(100, 0.5))
with Python’s help.
25.0
208 j. rogel-salazar
get an ace in the first hand, the chances would be 4/51. The
game continues and, as you can see, the outcome of the next
trial depends on the outcome of the previous one, and one
of the key assumptions behind the binomial distribution is
broken.
statistics and data visualisation with python 209
−K
(Kk )( Nn− k) The PMF of the hypergeometric
f (k, N, K, n) = , (5.48)
( Nn ) distribution.
−4
(42)(52
4−2 )
f (2, 52, 4, 4) = ,
(52
4)
The probabilities of the card game
(6)(1128) with Starbuck are easily obtained
= ,
270725 with the PMF above.
= 0.02499954. (5.49)
0.024999538276849162
1000−425
(425
3 )( 10−3 )
f (3, 1000, 425, 10) = ,
(1000
10 )
= 0.19170. (5.50)
The results are pretty close and the idea behind this
proximity is that as the population N grows to be larger and Although we do not replace the
balls, the probability of the next
larger, although we do not replace the Poké Balls, the
event remains nearly the same.
probability of the next event is nearly the same as before.
This is indeed because we have a large population. We can
now go and “catch ’em all”.
−K
n (Kk )( Nn− k) We will use this expression
E[ X r ] = ∑ kr ( Nn )
. (5.52) to obtain the mean of the
k =0
hypergeometric distribution.
K−1
K
k = K , (5.53)
k k−1
See Appendix D.1 for more
information
N−1
N 1 N 1
= ·n = ·N . (5.54)
n n n n n−1
−1 N − K
nK n (Kk− 1 )( n−k )
E[ X r ] =
N ∑ k r −1 ( Nn−−11)
. (5.55)
k =1
We are writing the sum from k = 1 as the case for 0 does not
contribute to the sum. If we define j = k − 1 and m = K − 1
statistics and data visualisation with python 213
nK n −1 (mj)(((Nn− 1)−m
−1)− j
)
r
E[ X ] =
N ∑ ( j + 1) r −1
( Nn−−11)
. (5.56)
j =0
nK
E[ X r ] = E[(Y + 1)r−1 ], (5.57)
N
nK
µ= . (5.58) The mean of the hypergeometric
N distribution.
(n − 1)(K − 1) nK 2
2 nK
σ = +1 − ,
N N−1 N
(n − 1)(K − 1)
nK nK
= +1− ,
N N−1 N
N 2 − Nn − NK + nK
nK
= ,
N N ( N − 1)
N−n
nK K
= 1− . (5.59) The variance of the
N N−1 N hypergeometric distribution.
The PPF for our game of bridge can be seen in Figure 5.5 in
the lower left-hand panel and here is the code for that:
3.241 1.9808998998999
Cumulative Probability
0.20
0.6
Probability
0.15
0.4
0.10
0.05 0.2
0.00 0.0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Outcome Outcome
Hypergeometric PPF Hypergeometric Random Variates
500
8
400
6
300
Occurrences
Outcome
200
2
100
0
0
0.0 0.2 0.4 0.6 0.8 0 2 4 6 8 10 12 14
Probability Outcome
3.25
1.8639705882352942
216 j. rogel-salazar
What is the probability for the case where there are Cylon
Raptor sightings, i.e. n 6= 0. We can break this into two parts
as follows:
dP(n, t)
+ lP(n, t) = lP(n − 1, t). (5.64)
dt
∞ ∞
µn µ n −1 The n = 0 term is zero, hence the
= e−µ ∑ n n! = µe−µ ∑ ( n − 1) !
, change in the sum.
n =1 n =1
∞
µm We relabelled the index so that
= µe−µ ∑ m!
. (5.66)
m = n − 1.
m =0
∞
µm We used m = n − 2 to relabel the
= µ2 e − µ ∑ m!
= µ2 e − µ e µ = µ2 . (5.68)
sum.
m =0
220 j. rogel-salazar
e −2
P(0, 2) = 20 = 0.1353,
0!
e −2
P(1, 2) = 21 = 0.2707, For µ = 2, we can get the
1! probability of different number of
e −2
P(2, 2) = 22 = 0.2707, Cylon Raptor sightings.
2!
e −2
P(3, 2) = 23 = 0.1805.
3!
0: 0.1353352832366127
1: 0.2706705664732254
2: 0.2706705664732254
3: 0.18044704431548356
statistics and data visualisation with python 221
n! µ k µ n−k
P( X ) = lim 1− , (5.70) See Appendix F.2.
n→∞ k!(n − k)! n n
µk −µ
= e . (5.71)
k!
Cumulative Probability
0.125
0.6
Probability
0.100
0.075 0.4
0.050
0.2
0.025
0.000 0.0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Outcome Outcome
Poisson PPF Poisson Random Variates
7 200
6 175
5 150
4 125
Occurrences
Outcome
3 100
2 75
1
50
0
25
1
0
0.0 0.2 0.4 0.6 0.8 0 2 4 6 8 10 12 14
Probability Outcome
4
Frequency
0
2.096 2.097 2.098 2.099 2.100 2.101
Kellicam Measure
d f (x)
= − k ( x − µ ) f ( x ), (5.72)
dx
where k is a positive constant. We can solve this equation as
follows:
d f (x)
Z Z
= −k ( x − µ)dx,
f (x)
k 2
f (x) = C exp − ( x − µ) . (5.73)
2
We can find the value of C by recalling that this is a We will deal with the value of k
later on.
probability distribution and therefore the area under the
curve must be equal to 1:
Z ∞
k
C exp − ( x − µ)2 dx = 1. (5.74)
−∞ 2
√
It can be shown that the constant C = k/2π. Our See Appendix G.1 for more
information.
probability distribution is thus far given by
r
k k 2
f (x) = exp − ( x − µ) . (5.75)
2π 2
E[ x ] = E[ν] + µ = 0 + µ = µ. (5.77)
As expected, the expected value for the normal distribution Pun definitely intended!
is µ.
k 2 ∞
r " # r Z ∞
k we− 2 w 1 k k 2
σ = 2
− + e− 2 w dw. (5.79)
2π k k 2π −∞
−∞
The first term is zero, and the second one contains the PDF
of the normal distribution multiplied by 1/k and therefore Compare with Equation (5.75).
we have that:
1
σ2 = ,
k
sigma = 0.2
x = np.arange(1, 3, 0.01)
Cumulative Probability
1.25 0.6
Probability
1.00
0.75 0.4
0.50
0.2
0.25
0.00 0.0
1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Outcome Outcome
Normal PPF Normal Random Variates
2.6
100
2.4
80
2.2
Occurrences
Outcome
60
2.0
40
1.8
20
1.6 0
0.0 0.2 0.4 0.6 0.8 1.0 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Probability Outcome
The graph of the CDF for the normal distribution can be The normal CDF is shown in the
top right-hand panel of Figure 5.8.
seen in the top right-hand panel of Figure 5.8. We can use
the CDF to find the probability that a measurement is below
2 metres for example:
232 j. rogel-salazar
0.6826894921370861
2.096088672876906 0.035225815987363816
2.099
0.04000000000000001
√ √
np > npq, np + 2 npq < n,
√
np > 4q, 2 npq < n(1 − p),
np ≥ 5, 5 ≤ nq. (5.85)
68.2%
2.1% 2.1%
0.1% 13.6% 13.6% 0.1%
4 3 2 1 0 1 2 3 4
Number of Standard Deviations from the Mean
µ1 E [( X − µ)]
m1 = = 1/2
,
σ E [( X − µ)2 ]
µ−µ
= =0 (5.88) The first moment of the normal
( E[( X − µ)2 ])1/2 distribution is m1 = µ = 0.
φ(t) = E[etX ],
The moment generating function.
∑ etx P( X = x )
discrete case,
x
= R ∞ tx (5.90)
−∞ e f ( x ) dx continuous case.
1 1
E(etX ) = E[1 + tX + t2 X 2 + t3 X 3 + · · · ] The Taylor expansion of the
2 3!
moment generating function.
1 1
= 1 + tEX + t2 E[ X 2 ] + t3 E[ X 3 ] + · · ·
2 3!
d h tX i
E e t =0
= E [ X ], We can obtain the moments from
dt
the Taylor expansion above.
d2 h tX i
E e t =0
= E [ X 2 ].
dt2
1 1 1
tx − x2 = − (−2tx + x2 ) = (( x − t)2 − t2 ).
2 2 2
m n +2 = ( n + 1 ) m n . (5.92)
1.0
Positive skewed
Negative skewed
0.8
0.6
0.4
0.2
0.0
4 2 0 2 4
µ3
g1 = m 3 = . (5.93) The third moment or skewness.
σ3
For data that is normally distributed, the skewness must be Normally distributed data has
skewness close or equal to 0.
close to zero. This is because the skewness for the standard
normal is 0:
E[ X 3 ] = m3 = (1 + 1)m1 = 0. (5.95)
statistics and data visualisation with python 243
skew(kellicam)
skew(kellicam, bias=False)
-0.7794228634063756
-0.8688392583242721
Mesokurtic
0.6 Platykurtic
Leptokurtic
0.5
0.4
0.3
0.2
0.1
0.0
4 2 0 2 4
µ4
g2 = m 4 = . (5.96) The fourth moment or kurtosis.
σ4
Let us see the value of the kurtosis for the standard normal:
E[ X 4 ] = m4 = (2 + 1)m2 = 3. (5.97)
kurtosis(kellicam)
kurtosis(kellicam, fisher=False)
0.89999999999992
3.89999999999992
X −µ Xi − µ
To show this, let us define Y = σ and Yi = σ . Then
Yi are independent, and distributed as Y with mean 0 and
√
standard deviation 1, and Tn = ∑i Yi / n. We aim to prove
that as n tends to infinity, the moment generating function
of Tn , i.e. φTn (t), tends to the moment generating function of
the standard normal. Let us take a look:
statistics and data visualisation with python 247
h i
φTn (t) = E e Tn t ,
t t
√ Y1 √ Y
=E e n ×···×E e n n ,
t n
√ Y
=E e n ,
2 /2
n As n → ∞, φTn (t) → et .
t2 t3
t
= 1 + √ E[Y ] + E[Y2 ] + 3/2 E[Y3 ] + · · · ,
n 2n 6n
2 3 n
t t 3
= 1+0+ + E [Y ] + · · · ,
2n 6n3/2
2
t2
' 1+ ,
2n
2 /2
→ et . (5.99)
2 /2
As we showed in Equation (5.91), φZ (t) = et
2 /2
and Remember that et is the
moment generating function of the
therefore the limiting distribution is indeed the standard
standard normal distribution.
normal N (0, 1).
What should we do? Well, it is only logical to look at what How very Vulcan!
print(CIupper - CIlower)
0.950004209703559
250 j. rogel-salazar
1.1936373019551489e-08
As we saw in Equation (5.102) we can define a t-score for The t-score helps us to define the
t-test describe in Section 6.5.1.
the t-distribution similar to the z-score for the standard
256 j. rogel-salazar
ν PDF CDF
1 1 1
ν=1 π (1+ t2 ) 2 + π arctan(t)
1 1 t
ν=2 √ t2 3/2 2 + √ t2 2/2
2 2 1+ 2 2 2 1+ 2
" #
2 1 1 t √t
ν=3 √ 2 2 2 + π √ t2 + arctan
3
π 3 1+ t3 3 1+ 3
3 1 3q t 1 t2
ν=4 5 2 + 8 2
1− 12 1+ t2 .
2 1+ t4
8 1+ t4 2 4
" ! #
8 1 1 t 2 2 √t
ν=5 √ 2 3 2 + √ t2 1+ + arctan
π 3 1+ t5 5
3π 5 1+ t5 5 1+ 5
2
h i
ν=∞ √1 e−t /2 1
1 + erf √t
2π 2 2
0.30
Probability Density
0.25
0.20
0.15
0.10
0.05
0.00
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Standard Deviations
nu = 3
Cumulative Probability
0.25
0.6
Probability
0.20
0.15 0.4
0.10
0.2
0.05
0.00 0.0
4 2 0 2 4 4 2 0 2 4
Outcome Outcome
Student's t-distribution PPF Student's t-distribution Random Variates
16
4
14
2 12
10
Occurrences
Outcome
0 8
6
2
4
4 2
0
0.0 0.2 0.4 0.6 0.8 1.0 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Probability Outcome
x1 = t.rvs(nu, size=1000)
( X − µ )2
χ2 = z2 = . (5.108)
σ2
statistics and data visualisation with python 261
As you can see, the values are always positive. In any event,
if the categories we are selecting things from are mutually
independent, then the sum of individual χ2 follows a chi-
square distribution, i.e. ∑ x2 = χ2 . This means that we
can find a chi-square for each of the categories and their If the distribution of observation
follows a chi-square, the categories
sum will also be a chi-square. If that is not the case, the are independent.
categories are not independent!
0.40
chi-squared dist., k=1
chi-squared dist., k=5
0.35 chi-squared dist., k=10
0.30
0.25
Frequency
0.20
0.15
0.10
0.05
0.00
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Chi-square
0.20 0.8
Cumulative Probability
0.15 0.6
Probability
0.10 0.4
0.05 0.2
0.00 0.0
0 2 4 6 8 10 0 2 4 6 8 10
Outcome Outcome
Chi-squared Distribution PPF Chi-squared Distribution Random Variates
10 17.5
15.0
8
12.5
Occurrences
Outcome
6 10.0
4 7.5
5.0
2
2.5
0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Probability Outcome
k = 3
pdf = chi2.pdf(x, k)
We can see the shape of the CDF in the top right-hand panel
of Figure 5.15. The PPF can be seen in the bottom left-hand
panel of Figure 5.15 and obtained as follows: The CDF and PPF of the chi-
squared distribution is shown
in the top-right and bottom-left
probs = np.arange(0, 1, 0.01)
panels of Figure 5.15.
ppf = chi2.ppf(probs, k)
x1 = chi2.rvs(k, size=1000)
chi2.ppf(0.95,1 )
3.841458820694124
• we reject the null hypothesis and accept the alternative Not rejecting the null hypothesis
one as we have enough evidence in favour of Ha , or does not mean that H0 is
necessarily true!
• we do not reject the null hypothesis because we do not
have enough evidence in favour of the alternative.
• H0 : p = 0.05,
Note that the null hypothesis test
is for equality.
• Ha : p > 0.05.
Critical Region
Figure 6.1: A schematic way to
think about hypothesis testing.
Do not reject 𝐻! Reject 𝐻!
X: Critical Value
5.3.2, there are only two outcomes for each trial, either the Our test results in a Bernoulli
process as we have only two
Model-5 toaster is defective or it is not. If the probability of
outcomes: defective or non-
defect is p = 0.05, in a sample of 100 toasters we expect to defective toaster.
have 100 × 0.05 = 5 defective toasters. If we find 10 defective
ones, we have strong evidence to reject the null hypothesis.
As we can see, we have a Type I error when we have a Type I errors correspond to false
positives.
false positive. In other words, we reject the null hypothesis
when in fact it is true. If this was a courtroom case, this is
equivalent to convicting a defendant when in fact they are
innocent. The probability of committing a Type I error is
The lower the value of α, the
called the level of significance, and we denote it with the
less likely we are to erroneously
letter α. The lower the value of α, the less likely we are to convict an innocent person.
commit a Type I error. As discussed in the Section 5.5 we
generally choose values smaller than 0.05.
statistics and data visualisation with python 271
> print(alpha)
0.028188294163416106
beta = 0
print(beta)
0.7219779808448744
X = 45
n = 550
We are reducing α, and thus our
p = 0.05
chances of committing a Type I
zs= (X-n*p)/np.sqrt(n*p*(1-p)) error.
alpha=1-norm.cdf(zs)
print(alpha)
0.0003087467117317555
> print(norm.sf(zs))
We are using the survival function
for the normal distribution.
0.0003087467117317515
H0 : = 0 Reject H0
Ha : > 0
0
Left tail test
Reject H0 H0 : = 0
Ha : < 0
0
Two tail test
Reject H0 H0 : = 0 Reject H0
Ha : 0
• Ha : θ < θ0
For the first step, we have that our hypothesis is that the
manufacturer used a 48Ω resistor. Our null hypothesis
corresponds to the claim made. If we reject the null Stating clearly our hypothesis, let
us better interpret the results.
hypothesis we are saying that there is enough evidence to
reject the claim. When the alternative hypothesis is the
statistics and data visualisation with python 277
• Ha : µ 6= 48
-1.9599639845400545
> n = 100
> mu = 48
> X = 46.5
> zs = (X-mu)/(sigma/np.sqrt(n))
278 j. rogel-salazar
> print(zs)
-1.5789473684210527
1.8856518702577993
statistics and data visualisation with python 279
Now that we have a better understanding of how hypothesis Hypothesis testing works along
these lines for different tests. We
testing works, we will concentrate on more specific ways in
will discuss this in the rest of this
which this general methodology can be applied to make chapter.
sure that ugly facts are not hidden behind alluring
arguments, and vice versa.
The idea is that if both sets of quantiles come from the same
distribution, the plot will show a straight line. The further A straight line indicates that the
data follows a normal curve.
the plot is from a straight line, the less likely it is that the
two distributions are similar. Note that this is only a visual
check and it is recommended to use a more formal test as
discussed in the rest of this section.
stats.probplot(df[’normal_example’], dist=’norm’,
plot=plt)
The more the data points depart from the reference line
in the Q-Q plot, the greater the chance the datasets come
from different distributions. We can see the Q-Q plot for the
skewed dataset by using the corresponding pandas series:
282 j. rogel-salazar
Now that we know how the test considers the null and
alternative hypotheses, we can create a function that takes
the sample data and returns the result of the test:
stat, p = shapiro(data)
if p > alpha:
else:
distributed.’’)
> shapiro_test(df[’normal_example’])
We can test against the data we
know is normally distributed.
stat = 0.9954, p-value= 0.6394
> shapiro_test(df[’skewed_example’])
following transformation:
g1
Z1 ( g1 ) = δ asinh √ , (6.5)
α µ2
where:
W2
p
= 2γ2 + 4 − 1, See Appendix H for expressions
for µ2 and γ2 .
δ = (ln W )−1/2 ,
2
α2 = .
W2 − 1
1 24πn
Y= D− √ √ . (6.7)
2 π 12 3 − 37 + 2π
k2, p = normaltest(data)
if p > alpha:
else:
distributed.’’)
> dagostino_test(df[’normal_example’])
We can test against the data we
know is normally distributed.
stat = 1.0525, p-value= 0.5908
> dagostino_test(df[’skewed_example’])
d, p = kstest(data, ’norm’)
if p > alpha:
else:
distributed.’’)
> ks_test(df[’normal_example’])
We can test against the data we
know is normally distributed.
stat = 0.0354, p-value= 0.8942
> ks_test(df[’skewed_example’])
• H0 :
(O − E)2
X2 = ∑ E
. (6.9)
testing the new replicator and there is a vote for the chosen
Mexican food for the evening meal, with 60 colleagues
going for enchiladas, and 40 voting for tacos.
6.3.2 Independence
In Python we can use the chi2_contingency function in We will use the chi2_contingency
function in SciPy.
SciPy stats where we provide as an input an array in the
form of a contingency table with the frequencies in each
category.
mexican = np.array(
print(exp_vals)
if p > alpha:
else:
> chi_indep(mexican)
We can use our function on our
contingency table.
stat = 1.6059, p-value= 0.4480
n
1
Cov( x, y) =
n ∑ (xi − x̄)(yi − ȳ)
i =1
!
n
1
=
n ∑ xi yi − nx̄ȳ . (6.11) Covariance between variables x
i =1 and y.
xi − x̄ yi − ȳ
1
ρ =
n ∑ σx σy
i
Cov( x, y)
= (6.12) The Pearson correlation.
σx σy
a·b
ρ= , (6.13) The Pearson correlation in terms
|a||b| of vectors.
import numpy as np
import pandas as pd
gla_cities=pd.read_csv(’GLA_World_Cities_2016.csv’)
statistics and data visualisation with python 299
cities = gla_cities.copy()
cities.dropna(inplace=True)
stats.pearsonr(cities[’Population’],
We use the pearsonr module in
cities[’Approx city radius km’]) SciPy.
(0.7913877430390618, 0.0002602542910062228)
n = cities.shape[0]
4.843827042161198
> print(p2t)
0.0002602542910061789
> 2*t.sf(tscore,n-2)
We can also use the survival
function.
0.00026025429100622294
radius km
dwelling
’City_Radius’)
> results = smf.ols(’Population ~ City_Radius’, This is the OLS model for the
population as a function of the city
data=gla_cities).fit()
radius.
> results.params
City_Radius 4.762102e+05
304 j. rogel-salazar
results.summary()
=====================================
No. Observations: 16
Df Residuals: 14
Df Model: 1
R-squared: 0.626
First we can see that the model is using the Population data
as the dependent variable and that the model is the ordinary
least squares we have discussed above. We have a total of
n = 16 observations and we are using k = 1 predictive Df residuals corresponds to the
degrees of freedom.
variable, denoted as Df Model. The Df residuals is the
number of degrees of freedom in our model, calculated as
n − k − 1.
R-squared is known as the coefficient of determination See Section 6.4.1 for information
on the Pearson correlation.
and is related to the Pearson correlation coefficient so that
statistics and data visualisation with python 305
(1 − R2 )(n − 1)
R2Adj = 1 − . (6.20)
n−k−1
F-statistic: 23.46
BIC: 510.3
306 j. rogel-salazar
Intercept −1.797 × 106 1.24 × 106 −1.453 0.168 4.45 × 106 8.56 × 105
City_Radius 4.762 × 105 9.83 × 104 4.844 0.000 2.65 × 105 6.87 × 105
There are a few other measures that are presented at the end
of the summary:
• d = 2 indicates no autocorrelation
Values for d such that 1.5 ≤ d ≤
• d < 2 indicates positive autocorrelation 2.5 are considered to be no cause
for concern.
• d > 2 indicates negative autocorrelation
We have decided to get schwifty and join the famous Bear with me!
• H0 : The correlation between the variables is zero, i.e. We have a two-tailed test!
r=0
where R(·) denotes the rank. If there are no values that can Remember that tied observations
have the same value and thus it
be assigned the same rank, we say that there are no ties in
is not possible to assign them a
our data. We can simplify our expression by noting that unique rank.
1
2∑
Cov( R( x ), R(y)) = σ2 − d2 , (6.25)
6
r = 1−
n3 −n ∑ d2 . (6.26) The Spearman correlation for no
ties.
SciPy called spearmanr. As with the Pearson correlation, the The Spearman correlation can be
calculated with spearmanr.
implementation will return the value of the coefficient and
the p-value to help us test our hypothesis.
schwifty = pd.DataFrame({
> r, p = stats.spearmanr(schwifty[’judge1’],
r=0.8148148148148148, p=0.004087718984741058
A typical case for the use of sample testing is to check We covered the use of a z-test in
Section 6.1.1.
the mean value of a normally distributed dataset against a
reference value. Our test statistic is the Student
t-distribution we encountered in Section 5.5.1 given by the
following expression:
X̄ − µ
t= , (6.27)
√s
n
• Ha : µ > µ0 ,
• Ha : µ < µ0 ,
• Ha : µ 6 = µ0 .
Let us revisit our test for the Model-5 toaster: The Caprica
City manufacturer is still claiming that their toasters have a See Section 6.1.
xbar = np.mean(data)
s= np.std(data, ddof=1)
314 j. rogel-salazar
> print(xbar)
49.58166666666668
> print(s)
9.433353624563248
mu = 48
n = len(data)
print(pval)
0.5730709487398256
Ttest_1sampResult(
statistic=0.5808172016728843,
pvalue=0.5730709487398256)
format(stat, p))
if p > alpha:
A helpful function to interpret a
print(‘‘Can’t Reject the null hypothesis. There one-sample t test.
is no evidence to suggest that the mean is
else:
• Ha : p 6 = p0 .
• nobs – The number of trials or observations, with the The parameters used in
proportions_ztest.
same length as count
nobs=100, value=0.36)
Let us look at the Cylon detector
> print(’stat = {0:.4f}, p-value= {1:.4f}’.\
values.
format(stat, pval))
> p0 = 0.75
> print(zscore)
-8.804591605141793
if p > alpha:
else:
count/n))
320 j. rogel-salazar
> proportion_test(count, n, p)
We reject the null hypothesis in
this case.
stat = 8.8046, p-value= 0.0000
• Ha : m 6 = m0 .
where,
1
if | Xi − m0 | > 0,
ψ1 = (6.31)
0 if | Xi − m0 | < 0.
n
E (W ) = E ( U ) = ∑ E(Ui ).
i =1
322 j. rogel-salazar
1 n 1 n ( n + 1)
2 i∑
= i= ,
=1
2 2
n ( n + 1)
= . (6.32) The expected value of W.
4
2
2 1 2 1 i
= 0 +i − ,
2 2 2
1 n(n + 1)(2n + 1)
= . (6.34) The variance of W.
4 6
alpha=0.05):
A function to help us interpret the
stat, p = wilcoxon(data-m0, alternative=alt)
results of a Wilcoxon test.
print(’stat ={0:.4f}, p-value= {1:.4f}’.
format(stat, p))
if p > alpha:
else:
{0}.’’.format(m0))
> lieutenants=np.array([
• Ha : µ1 − µ2 < c,
• Ha : µ1 − µ2 6= c.
( x1 − x2 ) − ( µ1 − µ2 )
t= q , (6.38) A test statistic for two samples
s p n1 + n12 with pooled variance.
1
import pandas as pd
cars = pd.read_csv(DATA/’cars.csv’)
am
0 17.147368
1 24.392308
alternative=’two-sided’,
usevar=’pooled’)
330 j. rogel-salazar
Note that the values obtained are the same, but we get the
degrees of freedom with the Statsmodels implementation.
of group i:
where:
|Y − Y |, Y is a mean of the i-th group, Levene test.
ij i i
Zij =
|Y − Y
ij i |, Yi is a median of the i-th group, Brown-Forsythe test.
e e
Ni
1
Zi· =
Ni ∑ Zij is the mean of the Zij for group i,
j =1
k Ni
1
Z·· =
N ∑ ∑ Zij is the mean of all Zij .
i =1 j =1
Comparing two means when the datasets are not We use Welch’s test when
comparing means of two samples
homoscedastic requires a different approach to the one
with different variances.
discussed in Section 6.6.1. Using a pooled variance, as
we did before, works well to detect evidence to reject H0
if the variances are equal. However, the results can lead
to erroneous conclusions in cases where the population
variances are not equal, i.e. we have heteroscedasticity.
The test was proposed by Bernard Lewis Welch15 as an 15
Welch, B. L. (1947, 01). The
generalization of ’Student’s’
adaptation to the Student t-test. problem when several different
population variances are involved.
Biometrika 34(1-2), 28–35
The approach is still the same for our null and alternative
hypotheses and the main difference is the consideration of
the variances. In this case, since we do not assume equality,
we have that the test statistic is
( x1 − x2 ) − ( µ1 − µ2 )
t= r , (6.40)
s21 s22
n1 + n2
s2i
where Vi = .
ni
statistics and data visualisation with python 333
alternative=’two-sided’, equal_var=False)
The Welch test is implemented
> print(’stat = {0:.4f}, p-value= {1:.4f}’. in ttest_ind with the parameter
equal_var=False.
format(tstat, p))
n1 ( n1 + 1)
U1 = n1 n2 + − R1 (6.42) The Mann-Whitney test statistic.
2 We can select the larger of U1 and
U2 too.
n2 ( n2 + 1)
U2 = n1 n2 + − R2 (6.43)
2
with Ri being the sum of the ranks for group i. Note that
U1 + U2 = n1 n2 . The theoretical range of the test statistic U
ranges between:
• 0: Complete separation of the two samples, and thus H0 The U statistic ranges between 0
and n1 n2 .
most likely to be rejected, and
1 4 1 8
2 2 2 7
3 6 3 5
4 2 4 10
5 3 5 6
6 5 6 9
7 7 7 8
8 8
> np.median(headezine)
The medians of the two treatments
seem to be different. Is this
4.5 statistically significant?
> np.median(kabezine)
8.0
alternative=’less’)
To assert which sample has a
> print(’stat = 0:.4f, p-value= 1:.4f’. higher median, we conduct a new
(one-tailed) test.
format(U, p))
• Ha : µ1 − µ2 < c,
• Ha : µ1 − µ2 6= c.
statistics and data visualisation with python 339
xd − c
t= , (6.44) The test statistic for a paired
sd
√
n sample t-test.
Let us see the mean and (sample) standard deviation for the
differences:
> df[’difference’].describe()
These are the descriptive statistics
for our Kabezine study.
count 20.000000 25% 0.300000
> shapiro_test(df[’pre’])
> shapiro_test(df[’post’])
format(tstat, p))
on average, Kabezine does offer improvements to the Kabezine does offer improvements
to Starfleet officers.
headaches in Starfleet colleagues affected by the gas under
investigation.
The paired test follows the same logic as the single sample
version we saw in Section 6.5.3. The null hypothesis is that
the medians of the populations are equal. As before we can
express this as H0 : m1 − m2 = c. The alternative hypothesis
can be any of the following: The hypotheses for a Wilcoxon
matched pairs test.
• Ha : m1 − m2 > c,
• Ha : m1 − m2 < c,
• Ha : m1 − m2 6= c.
where,
1 if | X − Y | > 0,
i i
ψ1 = (6.46)
0 if | X − Y | < 0.
i i
’post’: cerritos_post)
0.8999999999999999
format(tstat, p))
difference is statistically significant. In this case we used a Commander T’Ana can now
start thinking of a wider study
two-sided test as we left the alternative parameter out of
including more subjects.
the command, using the default value.
2)
T2
SST = ∑ ∑ xij2 − n
, (6.48) The total sum of squares.
i j
Ti2 T2
SSB = ∑ ni
− ,
n
(6.49) The sum of squares between
samples.
i
SSB
MSB = , mean square between samples, (6.52)
k−1
SSW
MSW = , mean square within samples. (6.53)
n−k
S1 /ν1
X= , (6.55)
S2 /ν2
ν1 ν1 − ν1 +ν2
2 −1
x ν1 2 ν 2
f ( x, ν1 , ν2 ) = ν1 ν2 1+ 1x , (6.56)
B 2, 2
ν2 ν2
2ν22 (ν1 + ν2 − 2)
σ2 = , for ν2 > 4. (6.58)
ν1 (ν2 − 2)2 (ν2 − 4)
• Type I: Sequential sum of squares: First assigns Type I looks at the factors
sequentially.
maximum variation to A: SS( A), then to B: SS( B| A),
followed by the interaction SS( AB| B, A), and finally to
the residuals. In this case the order of the variables makes
a difference and in many situations this is not what we
want.
• Type II: Sum of squares no interaction: First assigns the Type II ignores the interaction.
• Type III: Sum of squares with interaction: Here we look Type II considers the interaction.
as Li and thus µi = µ + Li . The assertion above would Li is the mean effect of factor level
i relative to the mean µ.
indicate that the sum of Li is equal to 0. We can interpret
Li as the mean effect of factor level i relative to the mean µ.
With this information we are able to define the following
model:
xij = µ + Li + ε ij , (6.60)
Within
SSW n−k MSW
samples
Let us capture this data into NumPy arrays and look at the
means. Assuming we have already imported NumPy:
> five = np.array([14, 17, 12, 14, 22, 19, 16, 17,
eight.mean())
With an F statistic of 59.3571 we have a p-value lower than See Section 6.7.2 to see how
Tukey’s range test can help find
α = 0.05, for a 95% confidence level we can reject the
the sample with the difference.
null hypothesis. We conclude at least one of the means is
different from the others.
we can resort to the fact that we can express our problem To obtain more information we
use Statsmodels and recast our
as a linear model and use Statsmodels to help us. The first
problem as a linear regression.
thing to mention is that we may need to reorganise our data
into a long form table. We will therefore create a pandas
dataframe with the information from Table 6.9:
import pandas as pd
toasters = pd.DataFrame({
’eight’: eight})
toasters_melt = pd.melt(
toasters.reset_index(),
We then melt the dataframe to
id_vars=[’index’],
obtain a long form table.
value_vars=toasters.columns,
var_name=’toastermodel’,
value_name=’excesshours’)
import statsmodels.api as sm
> print(anova_toasters)
This starts looking like the result
summary from Table 6.8.
sum_sq df F PR(>F)
This table still does not show the sum of squares and the
totals. We can write a function to help us with that:
statistics and data visualisation with python 357
def anova_summary(aov):
aov2 = aov.copy()
aov2[’mean_sq’] = aov2[:][’sum_sq’]/aov2[:][’df’]
cols = [’sum_sq’, ’df’, ’mean_sq’, ’F’, ’PR(>F)’] A function to calculate the sum of
aov2.loc[’Total’] = [aov2[’sum_sq’].sum(), squares and the totals.
aov2 = aov2[cols]
return aov2
> anova_summary(anova_toasters)
sum_sq df mean_sq
F PR(>F)
59.357143 1.306568e-10
NaN NaN
NaN NaN
Let us take a look and note that we are considering that the
cars dataset has been loaded as before:
> anova_summary(anova_mpg1)
sum_sq df mean_sq F
PR(>F)
0.000285
NaN
NaN
8 15.100000
> anova_summary(anova_mpg2)
sum_sq df mean_sq
The ANOVA summary for the
C(cyl) 824.784590 2.0 412.392295 comparison of the mean fuel
Residual 301.262597 29.0 10.388365 consumption per number of
cylinders.
Total 1126.047188 31.0 NaN
F PR(>F)
39.697515 4.978919e-09
NaN NaN
NaN NaN
statsmodels.stats.multicomp:
import pairwise_tukeyhsd
endog=toasters_melt[’excesshours’],
groups=toasters_melt[’toastermodel’],
We can run a Tukey’s range test
alpha=0.05) with the pairwise_tukeyhsd
> print(tukey_test) function in Statsmodels.
=================================================
-------------------------------------------------
k
SS M = n ∑( x· j − x·· )2 , (6.64) Sum of squares for our model.
j
’starfleetHeadacheTreatment.csv’)
Our data is in a neat table.
> headache.head(3)
Ibuprenine
5.7
4.4
5.9
value_vars=headache.columns,
var_name=’drug’, value_name=’health’)
We can get the tail of the dataframe to see how things look:
> h_melt.tail(4)
We now have a long format table.
76 17 Ibuprenine 4.9
77 18 Ibuprenine 4.4
78 19 Ibuprenine 5.3
79 20 Ibuprenine 4.4
> print(aovrm)
Anova
==================================
----------------------------------
==================================
With Rij being the rank of the j-th observation in the i-th
group, and Ri· being the average rank of the observations in
the i-th group, the H test statistic for the Kruskal-Wallis test
is given by:
2
∑ik=1 ni Ri· − R
H = ( n − 1) n 2 , (6.65) The Kruskal-Wallis test statistic.
∑ik=1 ∑ j=i 1 Rij − R
where R = (n + 1)/2, i.e. the average of all the Rij . When the
data does not have ties, we can express the test statistic as:
k
12 2
H=
n ( n + 1) ∑ ni Ri · − 3( n + 1). (6.66)
i =1 See Appendix I.
89.0, 90.7]
78.2, 84.8]
project_book = [85.1, 95.5, 93.8, 100.0, 88.2, We capture the data from Table
6.10 in a pandas dataframe.
91.7, 93.2]
df = pd.DataFrame({’instructor’: instructor,
’holographic’: holographic,
’project_book’: project_book})
> df.median()
df[’holographic’], df[’project_book’])
Running a Kruskal-Wallis test
> print(’stat = {0:.4f}, p-value= {1:.4f}’. indicated that the differences are
indeed statistically significant.
format(hstat, p))
x R2 C1 1 x R2 C2 1 ... x R2 Cc 1
x R2 C1 2 x R2 C2 2 ... x R2 Cc 2
Level R2 .. .. ..
..
. . . .
x R2 C1 n x R2 C2 n ... x R2 Cc n
TR2 C1 TR2 C2 ... TR2 Cc TR2
.. .. .. ..
. . . .
x Rr C1 1 x Rr C2 1 ... x Rr Cc 2
x Rr C1 2 x Rr C2 2 ... x Rr Cc 2
Level Rr .. .. ..
..
. . . .
x Rr C1 n x Rr C2 n ... x Rr Cc n
TRr C1 TRr C2 ... TRr Cc TRr
∑ TRC
2
= TR2 1 C1 + TR2 1 C2 + · · · + TR2 r Cc . (6.69)
T2
SST = ∑ ∑ ∑ xrck
2
− , (6.70) Total sum of squares.
r c k
N
∑ TR2 T2
SSR = − , (6.71) Sum of squares between rows.
nc N
∑ TC2 T2
SSC = − , (6.72) Sum of squares between columns.
nr N
2
∑ TRC T2
SSRC = − − SSR − SSC , (6.73) Sum of squares of interactions.
n N
Between MSC
SSC c−1 MSC MSE
columns
MSRC
Interaction SSRC (r − 1)(c − 1) MSRC MSE
The logic is the same as for one-way ANOVA, and we look Remember that the F statistic is
the ratio of the mean squares in
at the F distribution to assess our hypothesis. For two-way
question. See Equation (6.55).
ANOVA we have the following set of null hypotheses:
• The sample means for the first factor are all equal
• The sample means for the second factor are all equal The hypotheses of a two-way
ANOVA.
• There is no interaction between the two factors
SSe f f ect
η2 = , (6.75)
SST
import pandas as pd
df = pd.read(’Python_Study_Scores.csv’)
index=[’Delivery’], columns=[’Revision’],
aggfunc=np.mean)
> print(table)
A pivot table can let us summarise
the mean scores of delivery v
revision.
Revision Mock Exam Weekly Quiz
Delivery
> model = ols(formula, df).fit() the help of a linear model and the
anova_lm method.
> aov_table = sm.stats.anova_lm(model, typ=2)
Notice that in this case our model takes into account each of
the factors and the interaction:
β dr ∗ Delivery ∗ Revision.
> anova_summary(aov_table)
sum_sq df
def effect_sizes(aov):
aov[’eta_sq’] = ’NaN’
376 j. rogel-salazar
aov[’omega_sq’] = ’NaN’
aov[’omega_sq’] = (aov[:-1][’sum_sq’]-
(aov[:-1][’df’]*mse))/(sum(aov[’sum_sq’])+mse)
return aov
eta_sq omega_sq
y = β 0 + β 1 x. (6.77)
For the Pearson correlation, the null hypothesis corresponds The Pearson correlation
corresponds to H0 : β 1 = 0
to having a zero slope in our model, i.e. H0 : β 1 = 0. We use
in our linear model.
Statsmodels instead of scipy.stats.pearson(x,y):
The linear model will give you the slope, not the coefficient
r. If you require the correlation coefficient simply scale the
data by the standard deviation.
def signed_rank(df):
A function to calculate signed
return np.sign(df) * df.abs().rank() ranks.
statistics and data visualisation with python 379
y = β 0 + β 1 xi , (6.81)
y2 − y1 = β 0 , (6.83)
Et voilà!
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
7
Delightful Details – Data Visualisation
A table may be more appropriate in situations where all the A table is appropriate when
all information requires equal
information requires equal attention or when we want to
attention.
enable viewers to pick the data of their interest. Similarly,
when referring to certain data points in a written paragraph,
a table can be of help. In contrast, a graphical representation A plot can let us get a general
overview of the data.
will enable viewers to have a more general overview of
the data. We can use the visual elements described in the
previous section to help our viewers understand the results
and draw their attention to specific aspects. Let us take a
look at each of the methods mentioned above.
statistics and data visualisation with python 385
A distinctive advantage of using tables is that we are able to A table lets us present quantitative
and qualitative data.
present both quantitative and qualitative information. We
are able to present data that may be difficult to show in a
graph, for example by showing figures within a significant
386 j. rogel-salazar
> df = pd.read_csv(’jackalope.csv’)
We have a dataset with 261
> df.shape
observations and 3 columns.
(261, 3)
> df.describe()
x y
> df.corr()
> df = pd.read_csv(’anscombe.csv’)
> df.head(3)
0 A 10 8.04
1 A 8 6.95
2 A 13 7.58
x y
The mean and standard deviations
mean std mean std
of the four groups are the same.
Dataset
df.groupby(’Dataset’).corr()
x y
Dataset
A x 1.000000 0.816421
C x 1.000000 0.816287
y 0.816287 1.000000
D x 1.000000 0.816521
y 0.816521 1.000000
same line of best fit. You may think that is great, until you However, when we plot the data
we see how different the sets are.
look at the graphic representation of these datasets as
shown in Figure 7.2.
12 A B
8
y
4 = 7.50 = 7.50
= 2.03 = 2.03
r = 0.82 r = 0.82
12 C D
8
y
4 = 7.50 = 7.50
= 2.03 = 2.03
r = 0.82 r = 0.82
0 10 20 0 10 20
x x
• Ordered data: Refers to the use of natural ordering of Ordered data exploits natural
ordering in the observations.
the data in question. For example t-shirt sizes: Small,
medium, large and extra large. Suitable operators include:
=, 6 =, <, >
Part of that value comes from the fact that the data
representation requires “visual integrity”. What Tufte talks Visual integrity implies not
distorting the underlying
about in this respect is almost a moral requirement, in the
data. It avoids creating a false
sense that the representation chosen should not distort the interpretation or impression.
underlying data and it should not create a false impression
or interpretation. For example, the dimensions used in an
image should be dictated by the data itself. Any variations
that happen in the representation should relate to the data
396 j. rogel-salazar
Data-ink
Data-ink ratio = . (7.1) Tufte’s data-ink ratio.
Total ink used
example in Figure 7.4, where we are asked to determine Pre-attentive processing is the
unconscious cognitive processing
how many letters “t” there are in the sequence. By
of a stimulant prior to attention
presenting the information in a way that appeals to our being engaged.
pre-attentive perception, we can more easily see that there
are 7 letters “t” in panel B. We may be able to describe
pre-attentiveness as being “seen” rather than “understood”.
of the bars. We can still tell that one is larger than the other
one, but crucially in this case we may be more easily able to
tell that the larger bar is about three and a half times longer
than the shorter bar.
There are some really good modules that support very nice
visuals such as Seaborn, or interactivity such as Bokeh.
The standard library used for plotting in Python is called
matplotlib7 and if you are familiar with MATLAB there is 7
Hunter, J. D. (2007). Matplotlib:
A 2D graphics environment.
an API called pyplot that uses similar syntax. Computing in Science &
Engineering 9(3), 90–95
Matplotlib supports the object orientation used in Python,
and offers a comprehensive set of tools to create static,
interactive and animated visualisations. As usual, the first
step is to load the module:
In a Jupyter notebook you can
import numpy as np use the magic command %pylab
inline to load NumPy and
import matplotlib.pyplot as plt
matplotlib.
values a and b.
plt.plot(x, y)
plt.show()
plt.xlabel(r’$x$’, fontsize=14)
Adding a title and axes labels.
plt.ylabel(r’$y$’, fontsize=14)
plt.title(r’Plot of $\sin(x)$’,
fontsize=16)
Plot of sin(x)
1.00
0.75
0.50
0.25
0.00
y
0.25
0.50
0.75
1.00
3 2 1 0 1 2 3
x
In this case we have a grey dotted line of width 0.5. The end
result of all the commands above can be seen in Figure 7.7.
statistics and data visualisation with python 407
label=r’$\cos(x)$’)
7.6 Subplots
We have seen how to plot two line charts in the same axes.
We can use the subplots
This is useful when a comparison between the plots needs
command to plot our data in
to be highlighted. In other cases we may want to create separate figures.
separate figures for each plot. In this case we can exploit the
fact that we are using subplots to generate our figure. The
408 j. rogel-salazar
0.75
0.50
0.25
0.00
y
0.25
0.50
0.75
1.00
3 2 1 0 1 2 3
x
Apart from specifying the size of the overall figure, the first
command contains the following syntax: subplot(2,2).
This specifies that the window should be split into a 2 × 2 We could request a figure with
m rows and n columns with
array. The object axs is now an array of axes, and we can
subplot(m,n).
refer to each one of them with relevant array notation. We
can see that we have plotted the sin( x ) and cos( x ) functions
in the first two axes, and then we added plots for x and x2
to the other two.
Each of the axes in the figure has its own title, labels, legend,
etc. We can add labels to the x axes for the bottom row of
plots, and to the y axes of the first column of plots:
We can modify the range of the axes with the help of We can specify the axes limits with
set_xlim() and set_ylim().
set_xlim() and set_ylim(). Here we will change the x
axs[1, 0].set_xlim(-3, 3)
axs[1, 1].set_xlim(-3, 3)
We can flatten the axs object to run a loop over it. In this
case we will add a title counting the number of axes:
Let us now add a colour bar to aid with reading the values
plotted and include some labels. The result can be seen in
Figure 7.10.
location=’left’)
We finally add a colour bar and
ax.set_xlabel(r’$x$’, fontsize=12) some labels.
ax.set_ylabel(r’$y$’, fontsize=12)
ax.set_zlabel(r’$z=\exp(-x^2-y^2)$’, fontsize=12)
plt.show()
is the goal of the visualisation? That clarification can be Consider the questions a viewer
may care about. That will make
sharpened by considering who the intended audience
your visual more meaningful.
for the visualisation is. One thing that I find useful is to
think about what the viewer is trying to achieve when
presented with the graphic, and take into account the
questions they may care about. It is sometimes useful
to direct the query to them in the first place. In that
way you are more likely to create a visualisation that
addresses their needs
– Composition and structure: Use stacked bars or pie Pie charts only if there are a small
number of components.
charts
1. Distribution: We are interested to use charts that show For distribution, use line charts,
histograms or scatter plots.
us how items are distributed to different parts. Some
charts that are useful for this are line charts, histograms
and scatter plots.
2. Relationship: A typical use of statistics is to show the For a relationship, use scatter plots
or bubble charts.
relationship between variables. Scatter plots or bubble
charts are great choices for this.
3. Trend: Sometimes we need to indicate the direction or For a trend, use line or bar charts.
statistics and data visualisation with python 419
5. Composition: Sometimes we are interested to find out For composition, use stacked bars
or pie charts.
how the whole is composed of different parts. A typical
example of this is a pie chart. Area charts and stacked bar
charts are also good for this aim.
import numpy as np
Remember that the data may
import pandas as pd
be different in your computer
because we are using random
timeSeries = np.random.randn(1000, 2) numbers.
cm = df.cumsum()
Note that we are using the plot method for the pandas
dataframe, and we are passing arguments to determine the
style of each line. We also add a title to the plot. The styling
follows the same syntax discussed in 7.4.2. The result can be
seen in Figure 8.1.
8.2.2 Seaborn
automatically maps data values to visual attributes such as Seaborn understands statistical
analysis.
style, colour and size. It also adds informative labels such as
axis names and legends. To use Seaborn all you need to do
is install the library, and import it as follows:
sns.set_theme(style=’whitegrid’)
linewidth=2.5).\
legend_label=’A’)
Note that we are adding elements
p.line(x=’idx’, y=’B’, source=cm,
to the p object that holds our
legend_label=’B’, line_color=’red’, Bokeh figure.
line_dash=’dashed’)
p.legend.location = ’top_left’
you will see a menu that lets you interact with the chart by
panning and zooming and even a way to export your chart.
8.2.4 Plotly
import plotly.express as px
find some controls to interact with the plot, letting you pan,
zoom, autoscale and export the chart.
fig, ax = plt.subplots(figsize=(10,8))
ax.scatter(x=gla_cities[’Population’],
(matplotlib)’)
Let us now create the plot with this new information using We are printing the plots in black
and white, but you shall be able
pandas with the help of the scatter method. In this case
to see them in full colour in your
we are using the mapping above as the value passed to the computer.
c attribute, which manages the colour of the marker. The
gla_cities.plot.scatter(
x=’Population’, y=’Approx city radius km’, A scatter plot created with pandas.
s=50, c=colours)
x = gla_cities[’Population’]
We can now pass these objects to the scatter method. Note Bokeh offers a lot of flexibility, but
it may take more code to get there.
that we are passing the array called s as the size of the
markers, and using the colours mapping created for the
pandas example. The result can be seen in Figure 8.8.
p = figure(width=750, height=500)
p.xaxis.axis_label = ’Population’
alpha=0.6)
show(p)
import pandas_bokeh
pandas_bokeh.output_notebook()
We now create a new column to hold the sizes of the An alternative is to use Pandas
Bokeh, which supports pandas
markers based on the People per dwelling column: dataframes.
gla_cities[’bsize’]=10*\
size=’bsize’, alpha=0.6,
legend = "bottom_right")
Finally, let us create the scatter plot with the help of Plotly.
In this case we call the scatter method of Plotly Express.
The result is shown in Figure 8.10.
fig.show()
438 j. rogel-salazar
df = pd.read_csv(’jackalope.csv’)
We are creating a scatter plot with
sns.jointplot(x=’x’, y=’y"’, data=df, Seaborn and we are requesting a
regression line to be added.
kind=’reg’, color=’black’);
pop = gla_cities.groupby(’Country’)[[
’Population’]].sum()
ax.bar(x=pop.index, height=pop[’Population’],
color=’gray’, edgecolor=’k’)
Creating a bar chart with
ax.set_xlabel(’Country’)
matplotlib can easily be done.
ax.set_ylabel(’Population (for the Cities in
the dataset)’)
The result can be seen in Figure 8.12. Note that we are using
the index of the grouped dataframe for the horizontal axis,
whereas the height is given by the population column of the
grouped dataframe. We also specify the colour and the edge
line for all the bars.
Let us re-create the same plot but this time using the
methods of the pandas dataframe itself:
442 j. rogel-salazar
the dataset)’)
the dataset)’)
ci=None, estimator=sum)
interval with ci=None. Note that we are not required to use We do not need to group the data
first when using Seaborn.
pandas to group the data first. Finally, we can orient the
plot by placing the categorical variable in either the x or y
parameter. The result can be seen in Figure 8.15
446 j. rogel-salazar
p_stacked_hbar = popf.plot_bokeh(
A horizontal bar chart created
x=’Country’, stacked=True, kind=’barh’, with Pandas Bokeh.
the dataset)’)
statistics and data visualisation with python 447
cat = list(’ABCDEFG’)
We now calculate a percentage of the total for each entry: To create our pie chart, we need to
calculate a percentage of the total.
df[’pct’]=df[’value’]/df[’value’].sum()
statistics and data visualisation with python 449
plt.axis(’equal’)
The segments are so similar in size that it is difficult to If the segments are similar in
size, it may be difficult to draw a
distinguish which may be larger or even which are actually
comparison.
the same. We may have to add legends to support our
viewers, but this implies altering the data-ink ratio of our
chart. If the segments shown are more obviously different See Section 7.3 about the data-ink
ratio.
in size from each other, a pie chart is great. However, as we
have seen in the example above, we may be asking for extra
cognitive power from our viewers. This gets much worse
as we add more and more segments to the pie, making it
impossible to read. These are some of the reasons why there
are some detractors to the use of pie charts.
df.set_index(’category’).plot.pie(y=’pct’,
wedgeprops=dict(width=0.3),
autopct="%1.0f%%", legend=False)
x=’category’, y=’pct’ )
names=’category’, hole=0.7)
fig.show()
Since the plots obtained with Bokeh and Plotly are very
similar to the one obtained with matplotlib we are not
showing them here. Please note that Seaborn relies on
matplotlib to create pie charts so we are also skipping that.
All in all, pie charts are recommended to be used only in
cases where there is no confusion, and in general you are
better off using a bar chart. One thing you should definitely The only 3D pies you should have
(it at all) are edible ones.
avoid is to use 3D pie charts or exploding pies, even if for
comedic effects!
8.7 Histogram
plt.ylabel(’Frequency’)
color=’black’)
plt.legend()
We are using the default bins (i.e. 10) and we are also
adding a suitable legend to our plot so that we can
statistics and data visualisation with python 455
bins=15, kde=True)
Seaborn has a histplot function
plt.ylabel(’Frequency’)
that creates beautiful histograms.
plt.xlabel(’Miles per Gallon’)
plt.legend(title=’Transmission’,
loc=’upper right’,
labels=[’Automatic’, ’Manual’])
456 j. rogel-salazar
df.plot_bokeh.hist(bins=10,
A histogram with Pandas Bokeh.
ylabel=’Freq’, xlabel=’Miles Per Gallon’,
line_color="black")
nbins=16)
fig.update_layout(barmode=’overlay’)
A histogram with Plotly.
fig.update_traces(opacity=0.75)
fig.show()
past the whiskers of the boxplot. The body of the box plot is
given by the first and third quartiles (Q1 and Q3 in the
diagram), the line inside the box represents the median of
the data and the whiskers show the maximum and
minimum values.
Let us take a look at creating a box plot for the miles per
gallon variable in our cars dataset.
plt.boxplot(cars[’mpg’])
plt.set_xlabel(’MPG’)
Figure 8.26: Anatomy of a boxplot.
Notice that the width of the box does not have meaning,
but it is possible to use other representations. For example,
in the case of the so-called violin plot, instead of using a We can use other representations
apart from a box. Be mindful that
rectangle for the body of the graphic, we use approximate
the chart may be more difficult to
density distribution curves. Let us use Seaborn to see the interpret.
different things we can do to represent the same data:
The code above will generate the chart showing in the left
panel of Figure 8.29.
You may notice that in the violin plot, right in the middle
There is a small box plot inside the
of each density curve, there is a small box plot, showing a
violin plot!
little box with the first and third quartiles and a central dot
to show the median. Violin plots are sometimes difficult to
read, so I would encourage you to use them sparingly.
points=’all’)
A box plot with Plotly.
fig.show()
Consider the data in Table 8.3 showing the number of recent For any fellow StarTrek fans, the
data is totally made up.
first encounters made by notable Starfleet ships. We assume
statistics and data visualisation with python 465
plt.xlabel(’Star Date’)
plt.ylabel(’Encounters’)
466 j. rogel-salazar
stacked=False)
plt.xticks(idxdf.index, xlabels)
plt.xlabel(’Star Date’)
plt.ylabel(’Encounters’))
statistics and data visualisation with python 467
p=df.plot_bokeh.area(x=’stardate’,
ylabel=’Encounters’,
xlabel=’Star Date’)
fig.show()
468 j. rogel-salazar
cars[’transmission’] = cars[’am’].map(wrd)
We first group our data to create
a table to show the count of cars
carcount = cars.groupby([’transmission’, ’cyl’]) \
per transmission and number of
.count().rename(columns = ’Car’: ’cars’) \ cylinders.
.reset_index()
columns = ’transmission’)
A pivot table lets us read the
> carheatmap = carheatmap[’cars’].copy() count table in a straightforward
> print(carheatmap) way.
cyl
4 8 3
6 3 4
8 2 12
import matplotlib as mp
color_map = mp.cm.get_cmap(’binary’)
The pivot table can be used
plt.pcolor(carheatmap, cmap = color_map) to create our heatmap with
xt=np.arange(0.5, len(carheatmap.columns), 1) matplotlib.
yt=np.arange(0.5, len(carheatmap.index), 1)
plt.xticks(xt, carheatmap.columns)
plt.yticks(yt, carheatmap.index)
plt.colorbar()
470 j. rogel-salazar
carheatmap.style.background_gradient(
In pandas we can modify the style
cmap =’binary’, axis=None)\ to create a heatmap directly in our
.set_properties(**{’font-size’: ’20px’}) dataframe.
fig = px.imshow(carheatmap,
text_auto=True)
fig.show()
transm=list(carcount[’transmission’].unique())
statistics and data visualisation with python 473
mapper = LinearColorMapper(palette=colors,
low=carcount.cars.min(),
high=carcount.cars.max())
TOOLS = ’hover,save,pan,box_zoom,reset,wheel_zoom’
tools=TOOLS, toolbar_location=’below’,
tooltips=[(’transmission’, ’@transmission’),
’transform’: mapper},
line_color=’gray’)
show(p)
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
A
Variance: Population v Sample
n
1
σ2 =
n ∑ ( x i − µ )2 . (A.1)
i =1
n
s2 = σ2 , (A.3)
n−1 b
n
1
∑
2
s2 = ( x − X̄ ) , (A.4)
n − 1 i =1 i
Sn = 1+2+3+···+n (B.1)
Sn = n + ( n − 1) + ( n − 2) + · · · + 1 (B.2)
2Sn = n + 1 + (2 + n − 1) + (3 + n − 2) + · · ·
+(n + 1) (B.3)
= ( n + 1) + ( n + 1) + ( n + 1) + · · ·
+(n + 1) (B.4)
n
n ( n + 1)
n3 = 3 ∑ k2 − 3 2
+n (C.5)
k =1
n
n ( n + 1)
3 ∑ k2 = n3 + 3
2
−n (C.6)
k =1
n
n3 n ( n + 1) n
∑ k2 =
3
+
2
−
3
(C.7)
k =1
n
n(n + 1)(2n + 1)
∑ k2 =
6
. (C.8)
k =1
D
The Binomial Coefficient
n!
n(n − 1)(n − 2) . . . (n − k + 1) = = P(n, k), (D.1)
(n − k)!
given by
n! n
C (n, k) = = . (D.2)
(n − k)!k! k
n−1
C (n − 1, k − 1) = , (D.3)
k−1
( n − 1) !
= , (D.4)
((n − 1) − (k − 1))!(k − 1)!
( n − 1) !
= . (D.5)
( n − k ) ! ( k − 1) !
n ( n − 1) !
n
= , (D.6)
k (n − k)!k(k − 1)!
n ( n − 1) !
= , (D.7)
k (n − k)(k − 1)!
n n−1
= (D.8)
k k−1
E
The Hypergeometric Distribution
−K
(Kk )( Nn− k)
f (k, N, K, n) = , (E.1)
( Nn )
and let us keep the ratio K/N = p fixed. We want to show
that:
n k
lim f (k, N, K, n) = p (1 − p ) n − k . (E.2)
N →∞ k
−K
(Kk )( Nn− k) K! ( N − K )!
= · ·
( Nn ) k!(K − k)! (n − k)!( N − n − (K − k))!)
n!( N − n)!
(E.3)
N!
486 j. rogel-salazar
−K
(Kk )( Nn− k) n! K! ( N − K )!
= · ·
( Nn ) k!(n − k)! (K − k )! ( N − n − (K − k))!
( N − n)!
, (E.4)
N!
K!/(K − k)!
n
= · ·
k N!/( N − k)!
( N − K )!( N − n)!
(E.5)
( N − k)!( N − K − (n − k))!)
K!/(K − k)!
n
= · ·
k N!/( N − k)!
( N − K )!/( N − K − (n − k))!
(E.6)
( N − n + (n − k))!/( N − n)!
k
n (K − k + i)
= ·∏ ·
k i =1
( N − k + i)
n−k
( N − K − (n − k) + j)
∏ ( N − n + j)
(E.7)
j =1
and similarly:
N − K − (n − k) + j N−K
lim = lim = 1 − p. (E.9)
N →∞ N−n+j N →∞ N
Hence:
n k
lim f (k, N, K, n) = p (1 − p ) n − k (E.10)
N →∞ k
F
The Poisson Distribution
dP(n, t)
+ lP(n, t) = lP(n − 1, t). (F.1)
dt
dP(n, t) d
ν(t) + lP(n, t) = [ν(t) P(n, t)] . (F.2)
dt dt
d h lt i
e P(n, t) = elt lP(n − 1, t). (F.4)
dt
d h lt i
e P(1, t) = elt lP(0, t) = lelt e−lt = l (F.5)
dt
Z
elt P(1, t) = ldt = lt + C1 (F.6)
µ n−k
n µ k
P( X ) = lim 1− , (F.8)
n→∞ k n n
n! µ k µ n−k
= lim 1− , (F.9)
n→∞ k!(n − k)! n n
!
n! µk µ n
= lim 1− ˙
n→∞ ( n − k ) !nk k! n
µ −k
1− . (F.10)
n
Let us look at the ratio n!/(n − k!): For the case where
integer n is greater than the k. This can be expressed as the
successive product of n with (n − i ) down to (n − (k − 1)):
n!
= n(n − 1)(n − 2) · · · (n − k + 1), k < n, (F.11)
(n − k)
n(n − 1)(n − 2) · · · (n − k + 1)
lim = 1. (F.12)
n→∞ nk
x −k
lim 1− = 1, (F.15)
n→∞ n
µk −µ
P( X ) = e , (F.16)
k!
Z ∞
k
f (x) = C exp − ( x − µ)2 dx = 1. (G.1)
−∞ 2
Sadly, the integral above does not have a representation in
terms of elementary functions. However, there are some
things we can do to evaluate it. We can follow the steps of
Poisson himself.
2 ∞ − u2
r Z
f (x) = C e du = 1. (G.2)
k −∞
∞ ∞
Z Z
2 2
J2 = e− x1 dx1 e−y1 dy1 ,
−∞ −∞
Z ∞ Z ∞
2 2
= e−( x1 +y1 ) dx1 dy1 . (G.4)
−∞ −∞
1 2 ∞ Z 2π
= − e −r dθ,
2 0 0
Z 2π
1
= − (1) dθ.
2 0
statistics and data visualisation with python 493
1 2π
= θ ,
2 0
= π,
√
and therefore J = π.
r
k
C= . (G.6)
2π
" 2 #
1 x−µ
d f (x) = − √ ( x − µ) exp =0 (G.7)
σ3 2π σ
494 j. rogel-salazar
( x − µ )2
1 x −µ 2
2 − 12 ( σ )
d f (x) = − √ − 1 e = 0. (G.8)
σ3 2π σ2
µ 1 ( g1 ) = 0,
6( n − 2)
µ 2 ( g1 ) = ,
(n + 1)(n + 3)
µ 3 ( g1 )
γ1 ( g1 ) = = 0,
µ2 ( g1 )3/2
µ 4 ( g1 )
γ2 ( g1 ) = − 3,
µ 2 ( g1 ) 2
36(n − 7)(n2 + 2n − 5)
= .
(n − 2)(n + 5)(n + 7)(n + 9)
496 j. rogel-salazar
6
µ 1 ( g2 ) = ,
n+1
24n(n − 2)(n − 1)
µ 2 ( g2 ) = ,
(n + 1)2 (n + 3)(n + 5)
s
6(n2 − 5n + 2) (n + 3)(n + 5)
γ1 ( g2 ) = ,
(n + 7)(n + 9) n(n − 2)(n − 3)
k ni 2 k ni
2
∑∑ Rij − R = ∑ ∑ R2ij − 2Rij R + R . (I.2)
i =1 j =1 i =1 j =1
k ni
n ( n + 1)
= ∑ ∑ Rij . (I.3)
2 i =1 j =1
k ni k ni k ni
2
∑ ∑ R2ij − ∑ ∑ 2Rij R + ∑ ∑R . (I.4)
i =1 j =1 i =1 j =1 i =1 j =1
498 j. rogel-salazar
n ( n + 1) n ( n + 1)2
2R = . (I.6)
2 2
2
Finally, the third term of Equation (I.4) shows that R
appears n times and thus we can express it as:
2 n ( n + 2)2
nR = . (I.7)
4
(n + 1)(n2 − n) ( n + 1) n ( n − 1)
= . (I.8)
12 12
k k
2 2
∑ n i ( R i · − R )2 = ∑ n i Ri· − 2Ri· R + R ,
i =1 i =1
statistics and data visualisation with python 499
k
2 2
∑ ni Ri · − ni 2Ri· R + ni R ,
i =1
k k k
2 2
∑ ni Ri · − ∑ ni 2Ri· R + ∑ ni R . (I.9)
i =1 i =1 i =1
k k ni
∑ ni 2Ri· R = (n + 1) ∑ ∑ Rij . (I.10)
i =1 i =1 j =1
k k
2 ( n + 1)2 n ( n + 1)2
∑ ni R =
4 ∑ ni = 4
. (I.11)
i =1 i =1
Let us now plug back Equations (I.8), (I.9) and (I.10) as well
as (I.11) into our original Expression (I.1):
!
k
12(n − 1) 2 n ( n + 1)2 n ( n + 1)2
H=
n(n + 1)(n − 1) ∑ ni Ri · −
2
+
4
,
i =1
k
12 2 12 n ( n + 1)2
=
n ( n + 1) ∑ ni Ri · −
n ( n + 1) 4
,
i =1
k
12 2
=
n ( n + 1) ∑ ni Ri · − 3( n + 1). (I.12)
i =1
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Bibliography
2020/jun/19/ucl-renames-three-facilities -that-honoured-
prominent-eugenicists. Accessed: 2021-02-14.