Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
686 views

Statistics and Data Visualisation With Python 2023

Uploaded by

dhiya alya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
686 views

Statistics and Data Visualisation With Python 2023

Uploaded by

dhiya alya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 554

Statistics and Data

Visualisation with Python

This book is intended to serve as a bridge in statistics for graduates and business practitioners interested
in using their skills in the area of data science and analytics as well as statistical analysis in general. On the
one hand, the book is intended to be a refresher for readers who have taken some courses in statistics, but
who have not necessarily used it in their day-to-day work. On the other hand, the material can be suitable
for readers interested in the subject as a first encounter with statistical work in Python. Statistics and
Data Visualisation with Python aims to build statistical knowledge from the ground up by enabling the
reader to understand the ideas behind inferential statistics and begin to formulate hypotheses that form
the foundations for the applications and algorithms in statistical analysis, business analytics, machine
learning, and applied machine learning. This book begins with the basics of programming in Python and
data analysis, to help construct a solid basis in statistical methods and hypothesis testing, which are use-
ful in many modern applications.
Chapman & Hall/CRC
The Python Series
About the Series
Python has been ranked as the most popular programming language, and it is widely used in education and industry.
This book series will offer a wide range of books on Python for students and professionals. Titles in the series will
help users learn the language at an introductory and advanced level, and explore its many applications in data sci-
ence, AI, and machine learning. Series titles can also be supplemented with Jupyter notebooks.

Image Processing and Acquisition using Python, Second Edition


Ravishankar Chityala, Sridevi Pudipeddi

Python Packages
Tomas Beuzen and Tiffany-Anne Timbers

Statistics and Data Visualisation with Python


Jesús Rogel-Salazar

For more information about this series please visit: https://www.crcpress.com/Chapman--


HallCRC/book-series/PYTH
Statistics and Data
Visualisation with Python

Jesús Rogel-Salazar
First edition published 2023
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742

and by CRC Press


4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

CRC Press is an imprint of Taylor & Francis Group, LLC

© 2023 Taylor & Francis Group, LLC

Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume respon-
sibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the
copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify
in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any
form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and
recording, or in any information storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, access www. copyright.com
or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that
are not available on CCC please contact mpkbookspermissions@tandf.co.uk

Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification
and explanation without intent to infringe.

Library of Congress Cataloging-in-Publication Data

Names: Rogel-Salazar, Jesus, author.


Title: Statistics and data visualisation with Python / Dr. Jesús
Rogel-Salazar.
Description: First edition. | Boca Raton, FL : CRC Press, 2023. | Series:
Chapman & Hall/CRC Press the python series | Includes bibliographical
references and index. | Identifiers: LCCN 2022026521 (print) | LCCN 2022026522 (ebook) | ISBN
9780367749361 (hbk) | ISBN 9780367744519 (pbk) | ISBN 9781003160359
(ebk)
Subjects: LCSH: Mathematical statistics--Data processing. | Python
(Computer program language) | Information visualization.
Classification: LCC QA276.45.P98 R64 2023 (print) | LCC QA276.45.P98
(ebook) | DDC 519.50285/5133--dc23/eng20221026
LC record available at https://lccn.loc.gov/2022026521
LC ebook record available at https://lccn.loc.gov/2022026522

ISBN: 978-0-367-74936-1 (hbk)


ISBN: 978-0-367-74451-9 (pbk)
ISBN: 978-1-003-16035-9 (ebk)

DOI: 10.1201/9781003160359

Typeset in URWPalladioL-Roman
by KnowledgeWorks Global Ltd.

Publisher’s note: This book has been prepared from camera-ready copy provided by the author.
To Luceli, Rosario and Gabriela

Thanks and lots of love!


Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Contents

1 Data, Stats and Stories – An Introduction 1


1.1 From Small to Big Data 2

1.2 Numbers, Facts and Stats 10

1.3 A Sampled History of Statistics 14

1.4 Statistics Today 22

1.5 Asking Questions and Getting Answers 25

1.6 Presenting Answers Visually 30

2 Python Programming Primer 33


2.1 Talking to Python 35

2.1.1 Scripting and Interacting 38

2.1.2 Jupyter Notebook 41

2.2 Starting Up with Python 42

2.2.1 Types in Python 43


viii j. rogel-salazar

2.2.2 Numbers: Integers and Floats 43

2.2.3 Strings 46

2.2.4 Complex Numbers 49

2.3 Collections in Python 51


2.3.1 Lists 52

2.3.2 List Comprehension 60

2.3.3 Tuples 61

2.3.4 Dictionaries 66

2.3.5 Sets 72

2.4 The Beginning of Wisdom: Logic & Control Flow 80


2.4.1 Booleans and Logical Operators 80

2.4.2 Conditional Statements 82

2.4.3 While Loop 85

2.4.4 For Loop 87

2.5 Functions 89

2.6 Scripts and Modules 94

3 Snakes, Bears & Other Numerical Beasts: NumPy, SciPy & pandas
99
3.1 Numerical Python – NumPy 100
3.1.1 Matrices and Vectors 101

3.1.2 N-Dimensional Arrays 102


statistics and data visualisation with python ix

3.1.3 N-Dimensional Matrices 104

3.1.4 Indexing and Slicing 107

3.1.5 Descriptive Statistics 109

3.2 Scientific Python – SciPy 112

3.2.1 Matrix Algebra 114

3.2.2 Numerical Integration 116

3.2.3 Numerical Optimisation 117

3.2.4 Statistics 118

3.3 Panel Data = pandas 121

3.3.1 Series and Dataframes 122

3.3.2 Data Exploration with pandas 124

3.3.3 Pandas Data Types 125

3.3.4 Data Manipulation with pandas 126

3.3.5 Loading Data to pandas 130

3.3.6 Data Grouping 136

4 The Measure of All Things – Statistics 141


4.1 Descriptive Statistics 144

4.2 Measures of Central Tendency and Dispersion 145

4.3 Central Tendency 146

4.3.1 Mode 147

4.3.2 Median 150


x j. rogel-salazar

4.3.3 Arithmetic Mean 152

4.3.4 Geometric Mean 155

4.3.5 Harmonic Mean 159

4.4 Dispersion 163

4.4.1 Setting the Boundaries: Range 163

4.4.2 Splitting One’s Sides: Quantiles, Quartiles, Percentiles and More 166

4.4.3 Mean Deviation 169

4.4.4 Variance and Standard Deviation 171

4.5 Data Description – Descriptive Statistics Revisited 176

5 Definitely Maybe: Probability and Distributions 179


5.1 Probability 180

5.2 Random Variables and Probability Distributions 182

5.2.1 Random Variables 183

5.2.2 Discrete and Continuous Distributions 185

5.2.3 Expected Value and Variance 186

5.3 Discrete Probability Distributions 191

5.3.1 Uniform Distribution 191

5.3.2 Bernoulli Distribution 197

5.3.3 Binomial Distribution 201

5.3.4 Hypergeometric Distribution 208

5.3.5 Poisson Distribution 216


statistics and data visualisation with python xi

5.4 Continuous Probability Distributions 223

5.4.1 Normal or Gaussian Distribution 224

5.4.2 Standard Normal Distribution Z 235

5.4.3 Shape and Moments of a Distribution 238

5.4.4 The Central Limit Theorem 245

5.5 Hypothesis and Confidence Intervals 247

5.5.1 Student’s t Distribution 253

5.5.2 Chi-squared Distribution 260

6 Alluring Arguments and Ugly Facts – Statistical Modelling and


Hypothesis Testing 267
6.1 Hypothesis Testing 268

6.1.1 Tales and Tails: One- and Two-Tailed Tests 273

6.2 Normality Testing 279

6.2.1 Q-Q Plot 280

6.2.2 Shapiro-Wilk Test 282

6.2.3 D’Agostino K-squared Test 285

6.2.4 Kolmogorov-Smirnov Test 288

6.3 Chi-square Test 291

6.3.1 Goodness of Fit 291

6.3.2 Independence 293


xii j. rogel-salazar

6.4 Linear Correlation and Regression 296

6.4.1 Pearson Correlation 296

6.4.2 Linear Regression 301

6.4.3 Spearman Correlation 308

6.5 Hypothesis Testing with One Sample 312

6.5.1 One-Sample t-test for the Population Mean 312

6.5.2 One-Sample z-test for Proportions 316

6.5.3 Wilcoxon Signed Rank with One-Sample 320

6.6 Hypothesis Testing with Two Samples 324

6.6.1 Two-Sample t-test – Comparing Means, Same Variances 325

6.6.2 Levene’s Test – Testing Homoscedasticity 330

6.6.3 Welch’s t-test – Comparing Means, Different Variances 332

6.6.4 Mann-Whitney Test – Testing Non-normal Samples 334

6.6.5 Paired Sample t-test 338

6.6.6 Wilcoxon Matched Pairs 342

6.7 Analysis of Variance 345

6.7.1 One-factor or One-way ANOVA 347

6.7.2 Tukey’s Range Test 360

6.7.3 Repeated Measures ANOVA 361

6.7.4 Kruskal-Wallis – Non-parametric One-way ANOVA 365

6.7.5 Two-factor or Two-way ANOVA 369


statistics and data visualisation with python xiii

6.8 Tests as Linear Models 376


6.8.1 Pearson and Spearman Correlations 377

6.8.2 One-sample t- and Wilcoxon Signed Rank Tests 378

6.8.3 Two-Sample t- and Mann-Whitney Tests 379

6.8.4 Paired Sample t- and Wilcoxon Matched Pairs Tests 380

6.8.5 One-way ANOVA and Kruskal-Wallis Test 380

7 Delightful Details – Data Visualisation 383


7.1 Presenting Statistical Quantities 384
7.1.1 Textual Presentation 385

7.1.2 Tabular Presentation 385

7.1.3 Graphical Presentation 386

7.2 Can You Draw Me a Picture? – Data Visualisation 387


7.3 Design and Visual Representation 394
7.4 Plotting and Visualising: Matplotlib 402
7.4.1 Keep It Simple: Plotting Functions 403

7.4.2 Line Styles and Colours 404

7.4.3 Titles and Labels 405

7.4.4 Grids 406

7.5 Multiple Plots 407


7.6 Subplots 407
7.7 Plotting Surfaces 410
7.8 Data Visualisation – Best Practices 414
xiv j. rogel-salazar

8 Dazzling Data Designs – Creating Charts 417


8.1 What Is the Right Visualisaton for Me? 417

8.2 Data Visualisation and Python 420

8.2.1 Data Visualisation with Pandas 421

8.2.2 Seaborn 423

8.2.3 Bokeh 425

8.2.4 Plotly 428

8.3 Scatter Plot 430

8.4 Line Chart 438

8.5 Bar Chart 440

8.6 Pie Chart 447

8.7 Histogram 452

8.8 Box Plot 459

8.9 Area Chart 464

8.10 Heatmap 468

A Variance: Population v Sample 477

B Sum of First n Integers 479

C Sum of Squares of the First n Integers 481


statistics and data visualisation with python xv

D The Binomial Coefficient 483


D.1 Some Useful Properties of the Binomial Coefficient 484

E The Hypergeometric Distribution 485


E.1 The Hypergeometric vs Binomial Distribution 485

F The Poisson Distribution 487


F.1 Derivation of the Poisson Distribution 487
F.2 The Poisson Distribution as a Limit of the Binomial Distribution 488

G The Normal Distribution 491


G.1 Integrating the PDF of the Normal Distribution 491
G.2 Maximum and Inflection Points of the Normal Distribution 493

H Skewness and Kurtosis 495

I Kruskal-Wallis Test – No Ties 497

Bibliography 501

Index 511
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
List of Figures

1 Do you know what type of data you have? Use this


flow chart to guide you about the tests you may want to
use. xxxiii

1.1 Diagrammatic representation of the data project


workflow. 26
1.2 The power of questions on a scale. 29

2.1 Slicing and dicing a list. 54


2.2 Conditional control flow. 83
2.3 While loop control flow. 85
2.4 For loop control flow. 87

3.1 Scatter plot of synthetic data and a line of best fit


obtained with simple matrix operations. 115
3.2 Average population in millions for the city classification
created for the GLA report data. 139

5.1 Probability distribution of the number of S faces in


Two-Face’s fair coin flipped 100 times. 190
5.2 The uniform distribution. 195
5.3 The Bernoulli distribution. 200
5.4 The binomial distribution. 207
xviii j. rogel-salazar

5.5 The hypergeometric distribution. 215


5.6 The Poisson distribution. 222
5.7 Measures of a standard imperial Klingon kellicam in
metres. 226
5.8 The normal or Gaussian distribution. 231
5.9 The empirical rule gives us approximations about the
percentage of data observations within a number of
standard deviations from the mean. 236
5.10 Positive and negative skewed distributions. 242
5.11 Kurtosis of different distributions. 244
5.12 Probability distributions for the Student’s t-distribution
for ν = 1, 5, 30. For comparison we show the normal
distribution as a dashed curve. 257
5.13 The Student’s t-distribution for ν = 3. 259
5.14 Probability density function for the chi-square
distribution for different degrees of freedom. 262
5.15 The chi-squared distribution for k = 3. 263

6.1 A schematic way to think about hypothesis


testing. 269
6.2 One- v Two-Tail Tests. 275
6.3 Q-Q plots for a normally distributed dataset, and a
skewed dataset. 282
6.4 Two possible datasets to be analysed using analysis of
variance. We have three populations and their means.
Although the means are the same in both cases, the
variability in set 2) is greater. 345

7.1 A scatter plot for the jackalope.csv dataset. 389


statistics and data visualisation with python xix

7.2 Anscombe’s quartet. All datasets have the same


summary statistics, but they have very different
distributions. 392
7.3 Visual variables and their ease of perception. 397
7.4 How many letters “t” are there in the sequence? 398
7.5 Compare the area of the circles v compare the length of
the bars. 399
7.6 Combining visual variables can help our visualisations
be more effective. 400
7.7 Plot of the function y = sin( x ) generated with
matplotlib. 406
7.8 Plot of the functions sin( x ) and cos( x ) generated with
matplotlib. 408
7.9 Subplots can also be created with matplotlib. Each
subplot can be given its own labels, grids, titles, etc.
409
7.10 A surface plot obtained with the plot_surface
command. Please note that this requires the generation
of a grid with the command meshgrid. 412

8.1 Time series plot created with pandas. 422


8.2 Time series plot created with Bokeh. 425
8.3 Time series plot created with seaborn. 427
8.4 Time series plot created with Plotly. 429
8.5 Scatterplot of city population versus its approximate
radius size. The plot was created with
matplotlib. 431
xx j. rogel-salazar

8.6 Scatterplot of city population versus its approximate


radius size, the colour is given by the city size category
in the dataset. The plot was created with
pandas. 432
8.7 Bubble chart of city population versus its approximate
radius size. The colour is given by the city size category
in the dataset, and the marker size by the people per
dwelling. The plot was created with Seaborn. 433
8.8 Bubble chart of city population versus its approximate
radius size. The colour is given by the city size category
in the dataset, and the marker size by the people per
dwelling. The plot was created with Bokeh. 434
8.9 Bubble chart of city population versus its approximate
radius size, the colour is given by the city size category
in the dataset, and the marker size by the people per
dwelling . The plot was created with Bokeh using the
Pandas Bokeh backend. 436
8.10 Bubble chart of city population versus its approximate
radius size. The colour is given by the city size category
in the dataset, and the marker size by the people per
dwelling. The plot was created with Plotly. 437
8.11 A scatter plot for the jackalope.csv dataset including a
regression line and marginal histograms created with
the jointplot method of Seaborn. 440
8.12 A bar chart for the total population per country for the
cities contained in our dataset. The plot was created
with matplotlib. 442
8.13 A horizontal bar chart for the total population per
country for the cities contained in our dataset. The plot
was created with pandas. 443
statistics and data visualisation with python xxi

8.14 A stacked bar chart for the total population per country
for the cities contained in our dataset categorised by
city size. The plot was created with pandas. 444
8.15 A column bar for the total population per country for
the cities contained in our dataset categorised by city
size. The plot was created with Seaborn. 445
8.16 A stacked bar chart for the total population per country
for the cities contained in our dataset categorised by
city size. The plot was created with Pandas
Bokeh. 446
8.17 A stacked bar chart for the total population per country
for the cities contained in our dataset categorised by
city size. The plot was created with Plotly. 447
8.18 Top: A pie chart of the information shown in Table 8.2.
The segments are very similar in size and it is difficult
to distinguish them. Bottom: A bar chart of the same
data. 449
8.19 A donut chart of the data from Table 8.2 created with
pandas. 451
8.20 A histogram of the miles per gallon variable in the cars
dataset. The chart is created with matplotlib. 454
8.21 Histogram of the miles per gallon as a function of the
type of transmission. The chart is created with
pandas. 455
8.22 Histogram of the miles per gallon as a function of the
type of transmission. The chart is created with
Seaborn. 456
8.23 Histogram of the miles per gallon as a function of the
type of transmission. The chart is created with Pandas
Bokeh. 457
xxii j. rogel-salazar

8.24 Histogram of the miles per gallon as a function of the


type of transmission. The chart is created with
Plotly. 458
8.25 Pairplot of the cars dataset showing the relationship
between miles per gallon and horse power per
transmission type. 459
8.26 Anatomy of a boxplot. 460
8.27 Box plot of the miles variable in the cars dataset. The
chart is created with matplotlib. 460
8.28 Box plot of the miles per gallon as a function of the
type of transmission. The chart is created with
pandas. 461
8.29 Left: Box plots of the miles per gallon as a function of
the type of transmission. Middle: Same information but
including a swarm plot. Right: Same information
represented by violin plots. Graphics created with
Seaborn. 462
8.30 Box plot of the miles per gallon as a function of the
type of transmission. The chart is created with
Plotly. 463
8.31 Area plot of the data from Table 8.3 created using
matplotlib. 465
8.32 Unstacked area plot of the data from Table 8.3 created
with pandas. 466
8.33 Area plot of the data from Table 8.3 created using
Pandas Bokeh. 467
8.34 Area plot of the data from Table 8.3 created using
Plotly. 468
statistics and data visualisation with python xxiii

8.35 Heatmap of the number of cars by transmission type


and number of cylinders. Plot created using
matplotlib. 470
8.36 Heatmap of the number of cars by transmission type
and number of cylinders in a pandas dataframe. 470
8.37 Heatmap of the number of cars by transmission type
and number of cylinders created with Seaborn. 471
8.38 Heatmap of the number of cars by transmission type
and number of cylinders created with Plotly. 472
8.39 Heatmap of the number of cars by transmission type
and number of cylinders created with Bokeh. 474
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
List of Tables

1.1 Common orders of magnitude for data 3


1.2 Ranking of companies by market capitalisation in
billions of U.S. dollars in 2020 7
1.3 Examples of outcomes and outputs. 28

2.1 Arithmetic operators in Python. 43


2.2 List methods in Python. 60
2.3 Tuple methods in Python. 64
2.4 Dictionary methods in Python. 72
2.5 Set methods in Python. 79
2.6 Comparison and logical operators in Python. 82

3.1 Student marks in three different subjects. 109


3.2 Population and area of some selected global
cities. 123
3.3 Pandas data types. 126
3.4 Some of the input sources available to pandas. 131

4.1 Ratings for some Château Picard wine. 158

5.1 Probability of our coin flipping experiment. 184


xxvi j. rogel-salazar

5.2 Special cases of the PDF and CDF for the Student’s
t-distribution with different degrees of freedom. 256

6.1 Types of errors when performing hypothesis


testing. 270
6.2 Results from a random sample of 512 Starfleet officers
about their preference for Mexican food and their
rank. 294
6.3 Results from the regression analysis performed on the
brain and body dataset. 306
6.4 Results of the planetary Cromulon musical
competition. 311
6.5 Ratings for the perceived effect of two Starfleet pain
relief medications. 336
6.6 Pre- and post-treatment general health measures of
Starfleet volunteers in the Kabezine study. 339
6.7 Pre- and post-treatment general health measures of USS
Cerritos volunteers in the Kabezine study. 343
6.8 Table summarising the results of an analysis of variance
(ANOVA). 352
6.9 Performance of three Caprica City toasters in hours in
excess of 1500 hours of use. 353
6.10 Results of the different methods to learn Python for
Data Science at Starfleet Academy. 368
6.11 A typical data arrangement for a two-factor
ANOVA. 370
6.12 Table summarising the results of a two-way analysis of
variance (two-way ANOVA). 372

7.1 Effectiveness ranking of perceptual tasks for different


visual variables. 400
statistics and data visualisation with python xxvii

7.2 Colours and line styles that can be used by


matplotlib. 405
7.3 Names of colormaps available in matplotlib. 413

8.1 Given the question of interest, and the type of data


provided, this table provides guidance on the most
appropriate chart to use. 418
8.2 A table of values to create a pie chart and compare to a
bar chart. 448
8.3 First encounters by notable Starfleet ships. 464
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Preface

“This is the last time” are the words I remember thinking


after finishing the corrections of Advanced Data Science and
Analytics with Python1 . However, I do know myself and here 1
Rogel-Salazar, J. (2020). Advanced
Data Science and Analytics with
we are again. Actually, this is exactly what happens after Python. Chapman & Hall/CRC
Data Mining and Knowledge
finishing running a half-marathon: After the race I think I Discovery Series. CRC Press
would not sign up for another one, and then a few weeks
later I am training again. The same has happened with this
book. Although I thought I was not going to write another
one, the niggling feeling was there and the end result is in
your hands.

The motivation for the book has been the conversations


with colleagues and students about the need to have a
statistical bridge for graduates and business practitioners
interested in using their skills in the area of data science The book is good as a refresher
on statistics, but also a bridge
and analytics. The book is also intended to be a refresher
for graduates and business
for readers that have taken some courses in statistics, but practitioners.
who have not necessarily used it in their day-to-day work.
Having said that, the material covered can be suitable for
readers interested in the subject as a first encounter with
statistical work in Python.
xxx j. rogel-salazar

Statistics and Data Visualisation with Python aims to build


statistical knowledge from the ground up enabling us to
understand the ideas behind inferential statistics and
formulate hypotheses that can serve as the basis for the Statistical concepts underpin
applications in data science and
applications and algorithms in business analytics, machine
machine learning.
learning and applied machine learning. The book starts
with the basics of programming in Python and data analysis
to construct a solid background in statistical methods and
hypothesis testing useful in a variety of modern
applications.

As with my previous books, Python is the chosen language


to implement computations. Unlike other books in statistics,
where step-by-step manual calculations are shown, we
concentrate on the use of programming to obtain statistical We use Python 3 in this book.

quantities of interest. To that end, we make use of a number


of modules and packages that Pythonistas have created. We
assume that you have access to a computer with Python 3.x
installed and you are encouraged to use a Jupyter notebook.
For reference, the versions of some of the packages used in
the book are as follows:

Python - 3.8.x pandas - 1.3.x


NumPy - 1.21.x SciPy - 1.7.x Versions of Python modules used
StatsModels - 0.13.x Matplotlib - 3.5.x in this book.
Seaborn - 0.11.x Plotly Express - 5.6.x
Bokeh - 2.4.x pandas Bokeh - 0.5

As before, I am using the Anaconda Python distribution2 2


Anaconda (2016, November).
Anaconda Software Distribution.
provided by Continuum Analytics. Remember that there are Computer Software. V. 2-2.4.0.
https://www.anaconda.com
other ways of obtaining Python as well as other versions of
statistics and data visualisation with python xxxi

the software: For instance, directly from the Python


Software Foundation, as well as distributions from Python Software Foundation
https://www.python.org
Enthought Canopy, or from package managers such as
Homebrew.
Enthought Canopy https://www.
enthought.com/products/epd/
We show computer code by enclosing it in a box as follows:

> 1 + 1 % Example of computer code Homebrew http://brew.sh

We use a diple (>) to denote the command line terminal


prompt shown in the Python shell. Keeping to the look
and feel of the previous books, we use margin notes, such
as the one that appears to the right of this paragraph, to
highlight certain areas or commands, as well as to provide
some useful comments and remarks.

The book starts with an introduction to what statistics is and


how it has evolved over the years from the administrative Chapter 1 serves as a preamble to
the rest to the book.
activities around a city and its population, to the powerful
tool on which a lot of us come to rely on a day-to-day basis.
The first chapter serves as a preamble to the rest of the book
and this can be read independently of the rest of the book.

Since we will be using Python throughout the book, in


Chapter 2 we present a programming primer that provides Chapter 2 is a Python primer and
3 introduces some useful Python
the basics of Python from assigning variables and managing
modules like NumPy, SciPy and
collections like lists, tuples and dictionaries to building pandas.
programming logic using loops and conditionals. In
Chapter 3 we build from the basics of Python to start using
some powerful modules that let us delve into statistical
xxxii j. rogel-salazar

analysis in an easy way. The chapter introduces NumPy,


SciPy and pandas to manipulate data and start extracting
information from it. If you are familiar with Python you can
safely skip these two chapters and jump straight into
Chapter 4.

I recommend you read Chapters 4 to 6 sequentially. The


reason for this is that later chapters build on the content Chapter 4 covers descriptive
statistics, Chapter 5 discusses
from previous ones. In Chapter 4 we discuss different
probability distributions and
measures that are important in descriptive statistics and Chapter 6 hypothesis testing.
provide the footing to consider datasets as a whole. In
Chapter 5 we talk about random variables and probability,
opening up the discussion to different probability
distributions that let us start thinking of hypothesis testing.
This is the subject of Chapter 6, where we look various
statistical tests, both parametric and non-parametric.

If you are interested in jumping to a particular section of


the book, use the diagram shown in Figure 1 where you can Consult Figure 1 to quickly decide
the type of test you may need.
follow the flow chart depending on the characteristics of the
data you have. Note that each test has some assumptions
behind it; use the information in the rest of the book to
support your analysis. Bear in mind that there is an element
of randomness and uncertainty in any given data, and I
hope that the discussions do not leave you singing random
Pavarotti songs like good old Rex the Runt’s friend Vince. Ta-raa-raa-raa ♩ Raa-raa-raa 
RAA-RAA 
The chapters mentioned above shall let us handle that
randomness and uncertainty in a systematic way.

Remember that statistics is a discipline that studies methods


of collecting, analysing and presenting data. With the latter
statistics and data visualisation with python xxxiii

Figure 1: Do you know what type


of data you have? Use this flow
in mind, the last two chapters of the book are dedicated chart to guide you about the tests
you may want to use.
to the discussion of data visualisation from perception of
visual aspects, to best practices in Chapter 7 and examples
on how to create common statistical visualisations with Chapters 7 and 8 cover different
aspects of data visualisation.
Python in Chapter 8.

I sincerely hope that the contents of the book are useful to


many of you. Good statistical thinking is a great tool to have
in many walks of life and it is true to say that the current
availability of data makes it even more important to have
a good grasp of the concepts and ideas that underpin a
xxxiv j. rogel-salazar

sound statistical analysis, with great visuals to support our


arguments. Stay in touch and who knows I may be saying
“maybe one more” after having uttered yet again “this is the
last time”!

London, UK Dr Jesús Rogel-Salazar


June 2022
About the Author

Dr Jesús Rogel-Salazar is a lead data scientist working


for companies such as Tympa Health Technologies, AKQA,
IBM Data Science Studio, Dow Jones, Barclays, to name
a few. He is a visiting researcher at the Department of
Physics at Imperial College London, UK and a member of
the School of Physics, Astronomy and Mathematics at the
University of Hertfordshire, UK. He obtained his doctorate
in Physics at Imperial College London for work on quantum
atom optics and ultra-cold matter.

He has held a position as Associate Professor in


mathematics, as well as a consultant and data scientist in a
variety of industries including science, finance, marketing,
people analytics and health, among others. He is the author
of Data Science and Analytics with Python and Advanced Data
Science and Analytics with Python, as well as Essential
MATLAB® and Octave, published by CRC Press. His
interests include mathematical modelling, data science and
optimisation in a wide range of applications including
optics, quantum mechanics, data journalism, finance and
health tech.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Other Books by the Same Author

• Data Science and Analytics with Python


CRC Press, 2018, ISBN 978-1-138-04317-6 (hardback)
978-1-4987-4209-2 (paperback)

Data Science and Analytics with Python is designed for


practitioners in data science and data analytics in both
academic and business environments. The aim is to
present the reader with the main concepts used in data
science using tools developed in Python. The book
discusses what data science and analytics are, from the
point of view of the process and results obtained.

• Advanced Data Science and Analytics with Python


CRC Press, 2020, ISBN 978-0-429-44661-0 (hardback)
978-1-138-31506-8 (paperback)

Advanced Data Science and Analytics with Python enables


data scientists to continue developing their skills and
apply them in business as well as academic settings. The
subjects discussed in this book are complementary and a
follow-up to the topics discussed in Data Science and
Analytics with Python. The aim is to cover important
advanced areas in data science using tools developed in
xxxviii j. rogel-salazar

Python such as SciKit-learn, pandas, NumPy, Beautiful


Soup, NLTK, NetworkX and others. The model
development is supported by the use of frameworks such
as Keras, TensorFlow and Core ML, as well as Swift for
the development of iOS and MacOS applications.

• Essential MATLAB® and Octave


CRC Press, 2014, ISBN 978-1-138-41311-5 (hardback)
978-1-4822-3463-3 (paperback)

Widely used by scientists and engineers, well-established


MATLAB® and open-source Octave provide excellent
capabilities for data analysis, visualisation, and more.
By means of straightforward explanations and examples
from different areas in mathematics, engineering, finance,
and physics, the book explains how MATLAB and Octave
are powerful tools applicable to a variety of problems.
1
Data, Stats and Stories – An Introduction

Data is everywhere around us and we use it to our


advantage: From wearing clothes appropriate for the
weather outside, to finding our way to a new location with
the aid of GPS and deciding what to buy in our weekly
shopping. Furthermore, not only are we consumers of data
but each and every one of us is a data creator. It has never
been easier to create and generate data. Consider your daily
use of technology such as your humble smartphone or
tablet: How many emails do you read or write? How many
pictures do you take? How many websites do you visit?
And many other similar questions. Each of these
interactions generates a bunch of data points.

There are even accounts about the amount of data that is


being generated. For example, in 2013 SINTEF1 reported 1
SINTEF (2013). Big Data, for
better or worse: 90% of world’s
that the 90% of the world’s data had been created in the data generated over last two years.
www.sciencedaily.com/releases/2013/
previous two years. And the pace will keep up and even
05/130522085217.htm. Accessed:
accelerate given the number of new devices, sensors and 2021-01-01

ways to share data.


2 j. rogel-salazar

In this chapter we will look at how the current availability


of data has given rise to the need for more and better data
analysis techniques. We will see how these techniques
are underpinned by strong foundations in statistics. We Here’s what to expect in this
chapter.
will also cover some of the historical developments that
have made statistics a sought-after skill and will provide a
framework to tackle data-driven enquiry. Let us get started.

1.1 From Small to Big Data

Big things start from small ones. Think of the vastness


of a desert, and consider the smallness of each grain of sand.
Or look at a single star and expand your view to the large
number of them in a galaxy, and then to the number of
galaxies, most of them seeded with their own black holes. We will come back to these black
holes, bear with me.
The same applies to data: Each data point on its own may
not be all that powerful, but when combined with others the
possibilities open up.

The interest in data may range from looking at historical


trends that in turn may hold some clues as to what may
happen in the future and even tell us what we should do. There are different levels of
analytics, take a look.
These are the different levels required for a robust analytics
environment. Or put in different terms:

• Descriptive analytics - use data in aggregation to look at


the past and answer questions such as “what happened?”

• Predictive analytics - use statistical models and


forecasting techniques to answer questions such as “what
will happen?”
statistics and data visualisation with python 3

• Prescriptive analytics - Use optimisation and simulation


techniques to look at possible outcomes and answer
questions such as “what will happen if...? and what
should we do?”

Name Value Bytes Magnitude


Byte (B) 1 1 1
Kilobyte (KB) 1, 0241 1, 024 103
Megabyte (MB) 1, 0242 1, 048, 576 106
Gigabyte (GB) 1, 0243 1, 073, 741, 824 109 .
Terabyte (TB) 1, 0244 1, 099, 511, 627, 776 1012
Petabyte (PB) 1, 0245 1, 125, 899, 906, 842, 624 1015
Exabyte (EB) 1, 0246 1, 152, 921, 504, 606, 846, 976 1018
Zettabyte (ZB) 1, 0247 1, 180, 591, 620, 717, 411, 303, 424 1021
Yottabyte (YB) 1, 0248 1, 208, 925, 819, 614, 629, 174, 706, 176 1024

Table 1.1: Common orders of


magnitude for data

The use of relevant data is central to each of these types of


analytics. As such, the volume of data together with their
variability and richness can power up various applications.
These days, services underpinned by machine learning
have become all the rage, and every successful company
out there is actually a data company in one way or another.
The Economist reported in 2017 that the data created and
copied every year will reach 180 zettabytes in 2025. Sending That is 180 × 1021 or 180 followed
by 21 zeros.
that information via an average broadband connection, was
calculated by The Economist to take 450 million years2 . 2
Loupart, F. (2017).
Data is giving rise to a
new economy. The Economist -
Just for reference, one single exabyte corresponds to www.economist.com/briefing/2017/
05/06/data-is-giving-rise-to-
1, 048, 576 terabytes (TB). You can refer to Table 1.1 to look at a-new-economy. Accessed:
2021-01-02
the relationship between different common measures for
data. You may want to know that it would take 728, 177
4 j. rogel-salazar

floppy disks or 1, 498 CD-ROM discs to store just 1 TB These references may only be
meaningful to people of a certain
worth of information. 1024 TB is one petabyte (PB) and this
age... ahem... If not, look it up!
would take over 745 million floppy disks or 1.5 million
CD-ROM discs.

Numbers like these sound fanciful, but consider that there


are examples of companies resorting to the use shipping
containers pulled by trucks to transfer data. Given the
different geographical locations where relevant data is
generated and captured, it is not unusual to hear about
physical transfer of data. A simple example is provided
Including the “Gran Telescopio
by the capture of the first image ever obtained of a black Milimétrico Alfonso Serrano” at
hole in 2019. With the use of eight radio telescopes located INAOE, an institution where I
spent some time as a young optics
around the globe, scientists where able to record an image
researcher.
of a black hole by improving upon a technique that allows
for the imaging of far-away objects known as Very Long
Baseline Interferometry, or VLBI3 . Each telescope would 3
The Event Horizon Telescope
Collaboration (2019). First M87
capture 350 TB of data per day for one week. In this way, Event Horizon Telescope Results. I.
The Shadow of the Supermassive
from each grain of sand, a great desert is created. Black Hole. ApJL 875(L1), 1–17

The telescopes are not physically connected and the data is


not networked. However, they synchronise the data with
atomic clocks to time their observations in a precise manner.
The data was stored on high-performance helium-filled All five petabytes of observations.

hard drives and flown to highly specialised supercomputers


— known as correlators — at the Max Planck Institute for
Radio Astronomy and MIT Haystack Observatory to be
combined. They were then painstakingly converted into an
They used Python by the way!
image using novel computational tools developed by the You can look at the image here:
https://www.eso.org/public/
collaboration. All that for a single, magnificent image that
news/eso1907/.
let’s us gaze into the abyss.
statistics and data visualisation with python 5

The availability of data in such quantities is sometimes


described using the umbrella term “Big Data”. I am not a
huge fan of the term, after all big is a comparative adjective
— i.e. “big” compared to what and according to whom? Big compared to what and
according to whom?
Despite my dislike for the term, you are likely to have come
across it. Big data is used to describe large volumes of
data which we would not be able to process using a single
machine or traditional methods.

In the definition provided above you may be able to see


why I have a problem with this term. After all, if I am a
corner shop owner with a spreadsheet large enough for
my decades-old computer I already have big data; whereas
this spreadsheet is peanuts for a server farm from a big
corporation like Google or Amazon. Sometimes people refer
to the 4 V’s of big data for a more precise definition. We
have already mentioned one of them, but let us take a look
at all of them:

• Volume – The sheer volume of data that is generated and Volume – the size of the datasets
at hand.
captured. It is the most visible characteristic of big data.

• Velocity – Not only do we need large quantities of data, Velocity – the speed at which data
is generated.
but they also need to be made available at speed. High
velocity requires suitable processing techniques not
available with traditional methods.

• Variety – The data that is collected not only needs to Variety – Different sources and
data types.
come from different sources, but also encompasses
different formats and show differences and variability.
After all, if you just capture information from StarTrek
followers, you will think that there is no richness in Sci-Fi.
6 j. rogel-salazar

• Veracity – This refers to the quality of the data collected. Veracity – Quality and
trustworthiness of the data.
This indicates the level of trustworthiness in the datasets
you have. Think of it – if you have a large quantity of
noise, all you have is a high pile of rubbish, not big data
by any means.

I would like to add a couple of more V’s to the mix. All


for good measure. I mentioned visibility earlier on, and No, visibility is not one of the
traditional V’s but it is one to keep
that is a very important V in big data. If the data that we
an eye on :)
have captured is sequestered in silos or simply not made
available for analysis, once again we have a large storage
facility and nothing else.

The other V I would like to talk about is that of value. At The final V is that of value. Data
that does not hold value is a cost.
the risk of falling for the cliché of talking about data being
the new oil, there is no question that data — well curated,
maintained and secured data — holds value. And to follow
the overused oil analogy, it must be said that for it to
achieve its potential, it must be distilled. There are very few
products that use crude oil in its raw form. The same is true
when using data.

We can look at a proxy for the value of data by looking at


the market capitalisation of some of the largest companies.
By the end of 2020, according to Statista4 , 7 out of the 10 4
Statista (2020). The 100 largest
companies in the world by
most valuable companies in the world are technology market capitalization in 2020.
www.statista.com/statistics/263264/top-
companies that rely on the use of data, including Microsoft,
companies-in-the-world-by-
Apple and Alphabet. Two are multinational conglomerates market-capitalization. Accessed:
2021-01-03
or corporations whose component firms surely use data.
statistics and data visualisation with python 7

The other company in the top 10, funny enough, is an oil


company. See the ranking in Table 1.2.

Table 1.2: Ranking of companies


Rank Name by market capitalisation in billions
of U.S. dollars in 2020
Saudi Arabian Oil Company (Saudi Aramco)
1
(Saudi Arabia)
2 Microsoft (United States)
3 Apple (United States)
4 Amazon (United States)
5 Alphabet (United States)
6 Facebook (United States)
7 Alibaba (China)
8 Tencent Holdings (China)
9 Berkshire Hathaway (United States)
10 Johnson & Johnson (United States)
.

According to Gartner5 , more and more companies will be 5


Gartner (2017). Gartner
Says within Five Years,
valued on their information portfolios and the data they Organizations Will Be Valued
on Their Information Portfolios.
have. Companies are using data to improve their businesses
www.gartner.com/en/newsroom/press-
from operations to services, increase revenue, predict releases/2017-02-08-gartner-says-
within-five-years-organizations-
behaviour, etc. Using the different forms of analytics we will-be-valued-on-their-
information-portfolios. Accessed:
mentioned earlier on, companies are able to look at what 2021-01-04
has happened in the recent past, and be able to predict what
will happen and drive changes given different scenarios.
This is an area that will continue developing and change is
required. For example, current Generally Accepted
Accounting Principles (GAAP) standards do not allow for
intangible assets such as data to be capitalised.

In any event, companies and businesses in general are in an


environment where customers expect them to provide
8 j. rogel-salazar

services that are personalised, and cater to their wants and


needs with every transaction or interaction. This requires a
panoramic view informed by the richness of data including
the customer journey and user experience, sales, product
usage, etc. Measuring the value of our data can be a hard
task. James E. Short and Steve Todd tackle6 the problem 6
Short, J. E. and S. Todd (2017).
What’s your data worth? MIT
with a few different approaches including monetisation of Sloan Management Review,
sloanreview.mit.edu/article/whats-
the raw data itself by analysing its value for their customer; your-data-worth/. Accessed:
the usage value that the data holds, in other words, how 2021-01-08

frequently the data is accessed, the transaction rate or


application workload; the expected value of the data by
looking at the cash flow or income they generate or
comparing to existing, tracked datasets.

Independently of the valuation of the data that businesses


may be able to undertake, there is no doubt that better data
enable better decisions; which in turn leads to better
outcomes. For instance, in 2014 the McKinsey Global
Institute reported7 on the use of customer analytics across 7
McKinsey & Co. (2014). Using
customer analytics to boost
different sectors concluding that data-driven organisations corporate performance.
are better equipped and can improve profits and growth.
Companies that leverage their customer data are 23 times
more likely to acquire customers, 6 times as likely to retain
their customers, and 19 times as likely to be profitable as a
result.

All the benefits we have mentioned above are part of the


appeal in developing internal data science teams and
analytics squads that are able to harness the power of data.
There is no doubt that this is a way forward, but it is Strong foundations in statistics can
help your A.I. journey!
important not to be distracted by dazzling artificial
statistics and data visualisation with python 9

intelligence promises. That is not to say that A.I. or its


cousin, machine learning, are not useful. However, in many
cases we forget that strong foundations in data analysis and
statistics underpin a lot of the successful stories out there.

For all the great outcomes provided by the use of data,


privacy and ethics are important ingredients to include.
There are some excellent resources you can look at such as
the work of Annika Richterich8 , or look at the cautionary 8
Richterich, A. (2018). The Big Data
Agenda: Data Ethics and Critical
tales edited by Bill Franks9 . This is a vast area outside the Data Studies. Critical, Digital and
Social Media Studies. University
scope of this book. However, it is paramount to remember of Westminster Press
9
Franks, B. (2020). 97 Things About
that behind each data point there may be a corporation, an
Ethics Everyone in Data Science
individual, a user whose information is being analysed. Should Know. O’Reilly Media

Bear in mind their privacy and consider the ethical


implications of the decisions that are being made with the
help of this data.

In previous works10 I have argued about the rise of the role 10


Rogel-Salazar, J. (2017). Data
Science and Analytics with Python.
of data science and analytics in modern business settings. Chapman & Hall/CRC Data
Mining and Knowledge Discovery
You will find my referring to data science jackalopes to
Series. CRC Press; and Rogel-
counter the need for magic unicorns that are able to cover Salazar, J. (2020). Advanced Data
Science and Analytics with Python.
the data workflow and produce results in no time, while Chapman & Hall/CRC Data
Mining and Knowledge Discovery
still having a fabulous mane. Data science, as a team sport Series. CRC Press
played by a team of data science jackalopes, may indeed use
some deep learning or support vector machines, they may
invent something new or adapt an existing approach. What
I have found though, is that if you can answer a question
simply by aggregating, slicing and dicing — counting really
— well-curated data and do so rigorously, then it is OK and
you are definitely in the right track.
10 j. rogel-salazar

Think of it from your experience. Consider the following I bet you can transpose this to a
more sunny setting. Remember
example: You have just attended a fine social event with
that any resemblance to actual
colleagues and are thinking of going back home. The events is purely coincidental :)
weather in London is its usual self and it is pouring down
with cold rain and you have managed to miss the last tube
home. You decide to use one of those ride-hail apps on your
mobile. It does not matter how many clever algorithms –
from neural to social network analysis – they have used, if I am sure you have experienced
this at some point.
the cab is not there in the x minutes they claimed it would
take, the entire proposition has failed and you will think
twice before using the service again.

Part of the success (or failure) in the example above is down


to having a good user experience (UX) in designing the
product and presenting correct and useful information. That
information needs to rely on accurate data and its robust
analysis. Perhaps a weighted average is a good measure for This is the motivation behind this
book.
the time presented to the user and may be quicker to
calculate than using a more computationally expensive
method that takes longer to update. Understanding
uncertainty, probable outcomes, and interpreting the results
rests on a solid foundation of statistical concepts. This is the
main motivation behind this book.

1.2 Numbers, Facts and Stats

Thinking about statistics as the art of learning from


data, provides us with a powerful perspective. Statistical
knowledge helps us tackle tasks that deal with data in
general: From collection, to analysis and even presentation
statistics and data visualisation with python 11

of results. Not only is statistics an important component of


the scientific process, but also of many everyday subjects Statistics enables the creation of
new knowledge.
from weather to insurance, health and sport. A good
statistical foundation will help us understand a problem
better and lead us to sound conclusions, as well as spotting
areas that require further analysis.

Consider how many decisions, even opinions, are based on


data as evidence. It stands to reason that having a good
grasp of statistics can help us assess the quality of an Having a good grasp of statistics
can help a lot.
analysis presented to us; perhaps even carry out the analysis
ourselves. Statistics is therefore more than just numbers and
facts such as 9 out of 10 dental professionals prefer this Although that is the colloquial
usage.
toothpaste, or the possibility of successfully navigating an asteroid
field being approximately 3, 720 to 1. The real power of
statistics is to understand how they came up with those
figures: How many dentists and hygienists were asked?
What were they consulted on and what other toothpaste
brands were presented to them? As for our protocol
android, how recent is the data consulted? Is that the figure I am sure C3-PO will be able to
answer these and more.
for freighters only? And what are the chances of a collision
given that 5 of the main characters in the story are on
board?

As a matter of fact, a good statistical analysis will also take


into account uncertainties and errors in the results
presented. You will hear about conclusions being
statistically significant (or not) and care is taken to ensure
that the analysis is based on reliable data, that the data in A good statistical analysis takes
into account uncertainties and
question is analysed appropriately and that the conclusions
errors.
are reasonable based on the analysis done. The current
12 j. rogel-salazar

scientific and business environments have more and better


data to be analysed and given the benefits that statistics
brings to the table, no wonder there has been a surge in the
need for people with statistical skills. If on top of that you
consider that quite a few of the tools used in statistics are a Some even call it statistical
learning.
good foundation for applications of machine learning and
artificial intelligence, then the requirement is even more
compelling.

You may think that statistics is a recent human endeavour,


particularly when looking at the prevalence of mathematics
and even computational prowess that is emphasised these
days. However, statistics has had a long journey. Our
capabilities for storing and analysing data have been
evolving gradually and areas such as data collection started
much earlier than the computer revolution. One of the
earliest examples of data collection is the Ishango Bone11 11
de Heinzelin, J. (1962, Jun).
Ishango. Scientific American (206:6),
dated 18, 000 BCE and discovered in 1960 in present-day 105–116
Uganda. The bone has some non-random markings thought
to have been used for counting12 . 12
Pletser, V. (2012). Does the
Ishango Bone Indicate Knowledge
of the Base 12? An Interpretation
of a Prehistoric Discovery, the First
Other tools and devices have been used to keep track of Mathematical Tool of Humankind.
arXiv math.HO 1204.1019
information or for performing calculations. For example,
the abacus provided ancient cultures with a calculating
device that is used to this day. Simply look at the soroban
competitions in Japan. As for early examples of abaci, 算盤,そろばん - Calculating tray.

perhaps the Salamis Tablet13 is one of the oldest counting 13


Heffelfinger, T. and G. Flom
(2004). Abacus: Mystery of the
boards, certainly a precursor. Discovered in 1846 in the Bead. http://totton.idirect.com.
Accessed: 2021-02-03
Greek island of Salamis dating back to 300 BEC and used by
the Babylonians.
statistics and data visualisation with python 13

Different forms of counting have been developed by very


different people at different times in various different places.
For example, the Incas and other cultures in the Andes in
South America would use knots on a string, known as a
quipu or khipu14 to record and communicate information. 14
Buckmaster, D. (1974). The
Incan Quipu and the Jacobsen
Pre-dating the height of the Incan Empire as far back as hypothesis. Journal of Accounting
Research 12(1), 178–181
2500 BCE, quipus would be able to record dates, statistics,
accounts, and even represent, in abstract form, key episodes
from traditional folk stories and poetry. The Mayans15 15
Díaz Díaz, R. (2006). Apuntes
sobre la aritmética maya.
in Mexico and Central America (Classic Period – 250-900 Educere 10(35), 621–627
CE) developed a positional vigesimal numeral system
requiring a zero as a place-holder. Its use in astronomy and
calendar calculations was aided by their number system. A
remarkable achievement!

With the registering of information, it becomes possible to


keep records of all sorts: Goods, taxes, assets, laws, etc. One
of the earliest examples of the practice comes from Ancient
Egypt where specialised bookkeepers would maintain
records of goods in papyrus as early as 4000 BCE. Papyrus
was not only used for administrative accounting, but also
for recording literature and even music16 . In more recent Lichtheim, M. (2019). Ancient
16

Egyptian Literature. University of


times, paper-like material has been used by other cultures California Press
for record-keeping. For instance, the use of amate in pre- From āmatl in Nahuatl.

Hispanic civilisations such as the Aztecs. An example


of economic records is shown in Section 2 of the Codex
Mendoza17 listing 39 provinces containing more than 400 17
Hassig, R. (2013). El tributo en la
economía prehispánica. Arqueología
towns that were required to pay tribute. The codex has been Mexicana 21(124), 32–39
held at the Bodleian Library at the University of Oxford
since 1659.
14 j. rogel-salazar

An impressive achievement of ancient data storage is


epitomised by the Library of Alexandria (300 BCE - 48 CE) The Romans have a lot to answer
for.
with half a million scrolls covering a large proportion of
knowledge acquired by that time. We all know what
happened when the Romans invaded ... and it is indeed the
Romans who we shall thank for the origin of the word
statistics. Let’s take a look.

1.3 A Sampled History of Statistics

The word statistics is derived from the Latin statisticum


collegium or “council of state” which could be understood to Statistics was an account of the
be an account of the state of affairs. This is also where the activities of a state.

Italian word statista (politician/statesman) originates and


where the relationship with the word state becomes clear.
The word statistics was used to encapsulate the information
resulting from the administrative activity of a state, in
particular economic data and its population.

In the eighteenth-century, in the book by Gottfried


Achenwall18 entitled Staatsverfassung der heutigen 18
van der Zande, J. (2010). Statistik
and history in the German
vornehmsten europäischen Reiche und Völker im Grundrisse, the enlightenment. Journal of the
History of Ideas 71(3), 411–432
term statistik was first used to mean the comprehensive
description of the sociopolitical and economic characteristics
of a state. In the process, Achenwall provided a word for
this description in various other languages. By the following
decades, the term was broaden to include the collection and
analysis of wider types of data and started being used in
conjunction with probability to end up becoming what we We will talk about statistical
inference later on.
now know as statistical inference. The rest is history.
statistics and data visualisation with python 15

Today, the term has a much wider meaning including not


only the colloquial use of the word to mean facts and figures
but also the branch of mathematics dealing with the
collection, analysis, interpretation and representation of data People talk about statistics to
mean facts and figures, but the term
and the different types of methods and algorithms that is broader than that.
enable us to analyse data. Many have been the contributions
and contributors to the science of statistics and here I would
like to mention but a few.

Some fundamental concepts, such as the mean of two


numbers or the mode, were well-known to the ancient
Greeks. Extensions to obtaining the mean (or average) of What was not known by the
ancient Greeks!!
more than two numbers was used in the late sixteenth
century to estimate astronomical locations. In 1750s Thomas
Simpson showed in his work on the theory of errors that the
arithmetic mean was better than a single observation.

The median is another important basic measure that we all Some of these measures are
discussed in Section 4.2.
learn at school. It was first described by Edward Wright, a
cartographer and mathematician, to determine location for
navigation with the help of a compass. Another important
application of the median includes the work of Roger Joseph
Boskovich, astronomer and physicist, who used it as a
way to minimise the sum of absolute deviations. This is
effectively a regression model19 based on the L1-norm or 19
Rogel-Salazar, J. (2017). Data
Science and Analytics with Python.
Manhattan distance. The actual word median was coined by Chapman & Hall/CRC Data
Mining and Knowledge Discovery
Francis Galton in 1881.
Series. CRC Press

Now that we have mentioned regression, it is pertinent to


talk about least squares. Apart from Boskovich’s efforts
mentioned earlier, Adrien-Marie Legendre’s work in the
16 j. rogel-salazar

area is perhaps better known. In 1805 Legendre made


contributions on least squares in the context of astronomical
calculations20 . Some say that least squares is to statistics 20
Samueli, J.-J. (2010).
Legendre et la méthode
what calculus is to mathematics. It is a standard approach des moindres carrés. Bibnum
journals.openedition.org/bibnum/580.
that many of us would have been introduced to in many Accessed: 2021-02-14
courses of probability and statistics.

And talking about probability, it started as a way to


understand games of chance among mathematicians such as
Pierre de Fermat and Blaise Pascal in the mid-1600s. Some
of the elements of early probability had been proposed a The history of probability is
much broader than the few things
century earlier by Girolamo Cardano although not
mentioned here. Studying the
published at the time. Jacob Bernoulli applied the outcomes of games of chance and
techniques to understand how, from given outcomes of a gambling is definitely the start.

game of chance, it was possible to understand the properties


of the game itself, starting us up going down the road
towards inferential statistics. Abraham de Moivre I promise we will talk about
inference shortly.
considered the properties of samples from Bernoulli’s
binomial distribution as the sample size increased to very
large numbers ending up with a discrete version of the
normal distribution.

Data was also used to quantify other phenomena. For


instance, in 1662 John Graunt estimated the population
of London using records on the number of funerals per Statistics starts to be used in a
variety of problems.
year, death rate and the average size family. Pierre Simon,
Marquis de Laplace used a similar method for the French
population in 1802. Among the many contributions that
Laplace made to the wider field of mathematics and physics,
he also played an important role in the establishment of the
normal distribution generalising the work of Moivre.
statistics and data visualisation with python 17

In 1774, Laplace proposed an error curve21 — now called 21


Stahl, S. (2006). The evolution
of the normal distribution.
imaginatively the Laplace distribution — noting that an Mathematics Magazine 79(2),
96
error in a measurement can be expressed as an exponential
function of the absolute value of its magnitude. In 1880
his formulation of the Central Limit theorem confirmed
the assumptions that Carl Friedrich Gauss used for his We associate Gauss with the
normal distribution, but a lot of
derivation of the normal distribution — also known as the
it is due to Laplace. We discuss
Gaussian distribution— to model errors in astronomical the normal distribution in Section
observations. In many applications, it is often the case that 5.4.1.

we treat the error of an observation as the result of many


small, independent errors. This is a powerful thing that lets
us apply the central limit theorem, and treat the errors as
normally distributed.

We mentioned Galton earlier, on and his work not only


included giving name to the mode or indeed advancing
fingerprints as evidence in court or even guessing the
weight of a fat ox and the statistical implications of the
estimates provided22 . He also fitted normal curves to data 22
Rogel-Salazar, J. (2017). Data
Science and Analytics with Python.
and proposed the idea of regression to the mean by noticing Chapman & Hall/CRC Data
Mining and Knowledge Discovery
the form by which a distribution is dispersed looking at the Series. CRC Press
ideas of what we now call variance and standard deviation.
His 1886 work on the relative size/height of children and
their parents provided us with the name for regression and
the use of correlation.

We cannot mention correlation without referring to Karl


Pearson, whose work on the method of moments and Pearson correlation is discussed in
Section 6.4.1.
correlation analysis provided a foundation for mathematical
statistics. His work is also encapsulated in the use of
p-values and principal component analysis. He is also See Section 5.5.
18 j. rogel-salazar

credited23 with introducing the histogram as a tool to 23


Ioannidis, Y. (2003). The
history of histograms (abridged).
visualise data. His chi-square test of goodness of fit is one of www.vldb.org/conf/2003/
papers/S02P01.pdf. Accessed:
the main ways to test for significance in statistics. There is 2021-02-14
no doubt that both Galton and Pearson made great
contributions to the field, but it is also important to reflect See Section 6.3 for the chi-square
test.
on the impact that their views on eugenics have had in the
world. It is great that University College London decided to
rename three facilities that bore their names. UCL’s
president and provost, Prof Michael Arthur, is quoted
saying24 24
PA Media (2020). UCL
renames three facilities that
“This problematic history has, and continues, to cause honoured prominent eugenicists.
www.theguardian.com/education/
significant concern for many in our community and has a 2020/jun/19/ucl-renames-
profound impact on the sense of belonging that we want three-facilities -that-honoured-
prominent-eugenicists. Accessed:
all of our staff and students to have. Although UCL is a 2021-02-14
very different place than it was in the 19th century, any
suggestion that we celebrate these ideas or the figures behind
them creates an unwelcoming environment for many in our
community.”

In terms of statistical testing, we need to mention William


Sealy Gosset, aka Student, who proposed the famous
Student’s t distribution in 190825 . Gosset published his work 25
Student (1908). The probable
error of a mean. Biometrika 6(1),
under the pseudonym of Student to appease his employer, 1–25
the Guinness Brewery in Dublin, who were reluctant for his
work to be published in case competitors came to realise See Section 5.5.1 for the t
distribution and Section 6.5.1
they were relying on t-tests to determine the quality of their
for the t-test.
raw materials.

Perhaps the Guinness brewers were happy with the initial I guess, similar to Guinness stout.
reception of what could be a dry piece of mathematical Not much fizz, but lots of flavour!

writing without much fizz. Except, that is, if you are Ronald
Aylmer Fisher, who considered how certain a result would
statistics and data visualisation with python 19

be compared to chance and if those chances are low we can


consider the result to be significant. Fisher introduced
techniques such as the analysis of variance and estimation
theory. He also contributed to the development of better
experimental design. Some even consider him to have
founded modern statistics. The Rothamsted Research26 26
Rothamsted Research
(2020). Statement on R. A. Fisher.
released in 2020 a statement condemning Fisher’s www.rothamsted.ac.uk/news/statement-
r-fisher. Accessed: 2021-02-14
involvement with eugenics and distancing themselves from
his views. Similarly, Gonville and Caius College27 of the 27
Busby, M. (2020). Cambridge
college to remove window
University of Cambridge announced the removal of a commemorating eugenicist.
www.theguardian.com/education
stained-glass window commemorating his work.
/2020/jun/27/cambridge-gonville-
caius-college -eugenicist-window-
ronald-fisher. Accessed: 2021-02-14
The traditional use of statistics up until Fisher had been
mainly inferential where hypothesis testing and p-values
take centre stage. A different focus was brought in by John
Tukey who considered the importance of exploring the data
at hand to see what it is telling us and with that exploratory
data analysis was born28 . He is also the inventor of the 28
Tukey, J. W. (1977). Exploratory
Data Analysis. Number v. 2
box plot, contributed to the development of fast Fourier in Addison-Wesley Series in
Behavioral Science. Addison-
transform algorithms and is even credited with coining the
Wesley Publishing Company
term “bit” as a contraction of “binary digit” and now widely
used in computer science and beyond. The influence that
he has had in the field can be felt in the more recent world
of data science. For an account of the trials and tribulations
of the jackalope data scientist, take a look at Chapter 1 of You will need to consult the book
for more on jackalopes and data
Data Science and Analytics with Python and for techniques
science.
and algorithms see the rest of the book and its counterpart
Advanced Data Science and Analytics with Python29 . 29
Rogel-Salazar, J. (2020). Advanced
Data Science and Analytics with
Python. Chapman & Hall/CRC
Let us now turn our attention to the field of data Data Mining and Knowledge
Discovery Series. CRC Press
visualisation. We have mentioned above the contributions by
20 j. rogel-salazar

Pearson and Tukey in the form of histograms and box plots,


but there are a couple more people that we would like to
mention in this sampled history. The first is the Scottish
engineer William Playfair, whose background ranges from
engineering through economics, banking, and even
espionage30 . In this account of contributions he gets a 30
Berkowitz, B. D. (2018). Playfair:
The True Story of the British Secret
mention for having invented different types of data Agent Who Changed How We See the
World. George Mason University
representation that we use to this day. Examples include the Press
line chart, the area plot and the bar chart, even the infamous
pie chart. All that without the aid of modern plotting We will cover why infamous in
Section 8.6.
software!

Cartography has always been a rich ground for data


depiction and the use of maps overlaying other information
has been with us for a long time. A pioneer of these sort of
maps is the French engineer Charles Joseph Minard. The
flow maps he created depict the traffic of existing roads in
Dijon in France, famous for its
the Dijon area in 1845; this may have helped with laying the mustard of course.
railroad infrastructure in the area later on. Even more
famous is his graphical representation of the Napoleonic
Russian campaign of 1812 showing the deployment and
eventual demise of Napoleon’s army on their way to and
back from Moscow. This is effectively a predecessor of what
we now refer to as Sankey diagrams. They are named after
Matthew Hery Phineas Riall Sankey, who used them to
depict energy efficiency in a steam engine.

A notable entry is that of Florence Nightingale who not only


founded modern nursing31 but was also a social reformer, 31
Bostridge, M. (2015). Florence
Nightingale: The Woman and Her
statistician and data visualiser. Building on work from Legend. Penguin Books Limited
Playfair, she was able to use charts effectively in her
statistics and data visualisation with python 21

publications to draw attention to important results. An


excellent example is her use of polar area charts or
coxcombs to elucidate the actual causes of death by month
in the Crimean War. In 1859 she was elected the first female
member of the Royal Statistical Society.

Around the same time that Nightingale was persuading


Queen Victoria and Members of Parliament of the need to
improve conditions in military hospitals, John Snow was
mapping the locations of cholera cases32 , in particular 32
Vinten-Johansen, P., H. Brody,
N. Paneth, S. Rachman, M. Rip,
deaths, showing a clear concentration around a now and D. Zuck (2003). Cholera,
Chloroform, and the Science of
infamous water pump in Broad Street in Soho, London. He Medicine: A Life of John Snow.
was able to use this information not only to stop the Oxford University Press

outbreak but also to change the way we understand the


cause and spread of disease.

Talking about maps, a well-deserved mention goes to Pierre


Charles François Dupin33 for the creation of the choropleth 33
Bradley, M. Charles Dupin (1784
- 1873) and His Influence on France.
map to illustrate the illiteracy rate in France in 1826. This Cambria Press
kind of thematic map uses shadings in proportion to the
variable that represents an aggregate summary for each
area.

There is much more we can cover in this sampled history.


However, we will finish this section with the mention of
Jacques Bertin and his 1967 book Sémiologie Graphique34 . 34
Bertin, J. and M. Barbut
(1967). Sémiologie graphique:
We can think of his work to be to data visualisation what Les diagrammes, les réseaux, les
cartes. Gauthier-Villars
Mendeleev’s periodic table did for chemistry. In Chapter
7 we will cover some of this organisation of the visual and
perceptual elements of graphics according to the features
and relations in data. Can’t wait!
22 j. rogel-salazar

1.4 Statistics Today

As we have already seen, the field of statistics has an


interesting and varied history. Today, the field has seen a
renewed interest in light of the data availability that we
discussed in Section 1.1. In a way, we can think of statistics The renewed interest in statistics
is fuelled by more widespread
as a way to describe the methods and ideas dealing with
availability of data.
the collection, presentation, analysis and interpretation of
data. All this with the purpose of using data to understand
a variety of phenomena.

Largely speaking we can consider two different approaches


to statistical analysis and both can be used in conjunction
with each other. We mentioned them in the previous section
and surely you have heard about descriptive and inferential
statistics. Descriptive statistics, as the name suggests, is
used to describe data from a group of observations, usually Usually a sample of that data.

referred to as a population. In that way, descriptive statistics


deals with the aggregation and summarisation of data. The
use of tables, charts, or graphical displays is the norm in
this area, as well as the calculation of some measures that
provide information about the bulk of the data and how
spread the observations are. Descriptive statistics enables us Descriptive statistics enable us to
see our data in a meaningful way.
to see data in a meaningful way and provides us with clues
not only about the shape of the data, but also about what to
do next with it.

And some of those things include going beyond the


aggregations and reach conclusions from the evidence
provided by the data although it may not be made explicit.
statistics and data visualisation with python 23

This is the role of inferential statistics. The methods and


estimations made with inferential statistics lets us make
Finally, we are here talking about
inferences about the population from a sample. In this way inferential statistics. I hope it was
worth the wait.
we can study relationships between different attributes or
values, called variables and make generalisations or
predictions. It is then possible for us to test a hypothesis, in
other words a proposed explanation for the phenomena we
are seeing. Our aim is therefore to accept or reject the
hypothesis based on the data at hand. Typically we would
consider a quantity to be measured and propose that its
value is zero. This is called the null hypothesis. We can Accepting or rejecting the null
hypothesis is part and parcel of
then look at the results of two models for example, and we
inferential statistics.
deem our results to be statistically significant if data would
be unlikely to occur if the null hypothesis were true given a
significance level provided by a threshold probability.

There are other classifications of statistics you may hear


from time to time, for example, whether it is parametric
or nonparametric. In the parametric methods, we work
under the idea that there is a set of fixed parameters that Parametric methods try to
determine parameters to describe
describe our model and therefore its probability distribution.
the data. Hence the name.
Parametric statistical methods are typically used when
we know that the population is near normal, or we can
approximate such a distribution. The normal distribution
has two parameters: The mean and the standard deviation.
If you are able to determine those two parameters, you have
a model.

What about nonparametric statistics? Well, these are


This is in contrast to
methods that do not need to make assumptions on nonparametric methods.
parameters to describe the population at hand. This means
24 j. rogel-salazar

that the distribution is not fixed and sometimes we refer to


these models as distribution-free. These methods are usually
applied when we have ordinal data, in other words, data
that relies on ranking or a certain order. A good example of
a simple descriptive nonparametric statistic is a histogram
plot.

Another distinction you may encounter from time to time Another distinction appears
between the frequentist and
regards the approach to making inference. This may be even
Bayesian approaches to stats.
thought of as “philosophies” behind the approach: You can
do frequentist inference or Bayesian. The differences
between frequentist and Bayesian statistics is rooted in the
interpretation of the concept of probability. In frequentist
statistics, only repeatable random events have probabilities. Think of the typical coin flipping
experiment.
These are equal to the long-term frequency of occurrence of
the events in which we are interested. In the Bayesian
approach, probabilities are used to represent the uncertainty
of an event and as such it is possible to assign a probability
value to non-repeatable events! We are then able to improve
our probability estimate as we get more information about
an event, narrowing our uncertainty about it. This is
encapsulated in Bayes’ theorem.

In a nutshell, the main goals of inferential statistics can be


summarised as follows:

• Parameter estimation: If we are using parametric The goals of inferential statistics.

methods, we are interested in determining the


distribution from which our samples came.

• Data prediction: Given the distribution obtained from


the previous goal, we are now interested in predicting
statistics and data visualisation with python 25

the future. With our model under our arm, we are able to
apply it to tell us something about observations outside
our sample.

• Model comparison: If we have a couple of models that


can be applied, we are interested in determining which
one explains the observed data better.

In general, statistics is interested in a set of observations


describing a population and not about a single particular
observation. In statistics we are interested in the uncertainty
of a phenomenon and make a distinction between the Accounting for uncertainty is a big
part of doing statistics.
uncertainties we can control and those we cannot. That
uncertainty is framed in the context of probability. We need
to be mindful not only about the methods that we use and
the interpretation of the results obtained, but also about the
reliability of the data collection, processing and storage.

1.5 Asking Questions and Getting Answers

As we have seen above, statistics is a general discipline


that can be applied to a varied range of questions. It is a
great tool to have under our belt to design laboratory and What is not to like about stats?

field experiments, surveys or other data collection tasks. It


supports our work in planning inquiries and enables us to
draw conclusions as well as making predictions.

The use of statistics can be seen in every branch of


knowledge, from physics to biology, genetics to finance,
marketing to medicine. Any businessperson, or engineer,
marketer or doctor can employ statistical methods in their
26 j. rogel-salazar

OBJECTIVE

PLANNING QUESTIONS

Data Project
Workflow

DECISIONS DATA AND METRICS

INSIGHTS

Figure 1.1: Diagrammatic


representation of the data project
work. A key to this use is the generation of suitable workflow.
questions that can seek for an answer based on data. There
may be different approaches to tackle data-driven work and
a few steps are represented in the diagram shown in Figure
1.1.

I always advocate35 for strong foundations to support 35


Rogel-Salazar, J. (2017). Data
Science and Analytics with Python.
successful data-related projects. This includes the three Chapman & Hall/CRC Data
Mining and Knowledge Discovery
pillar roles of: Series. CRC Press; and Rogel-
Salazar, J. (2020). Advanced Data
• Project manager Science and Analytics with Python.
Chapman & Hall/CRC Data
Mining and Knowledge Discovery
• Lead statistician or data scientist Series. CRC Press

• Lead data architect


Variations of these titles are
A very important step for the team above is to establish a welcome!

clearly defined objective and explore how it informs the


statistics and data visualisation with python 27

questions we might ask. A great starting point is to ask


ourselves what it is we are trying to learn. In this way we A clear objective helps the team
stay the course.
are able to provide a direction and a goal to work towards.
At the same time it helps us and our team to stay the course
and clear out any noise. Similarly, it lets us plan the best
way to use our budget and resources.

Once we have clarity in the questions the need answering,


it becomes much easier to determine what datasets may
be useful in tackling those questions. I do despair when The next step is to source data that
can help answer the questions we
someone approaches me with a given dataset and asks to
formulated.
tell them “something they don’t know”... In any case, if you
find yourselves in that situation, use the diagram above to
take them through the workflow. Alongside the datasets it is
also important to determine what metrics we are interested
in calculating and/or driving. In the first part of this book
we will cover some of the statistical methods that will
enable us to use data to answer questions that meet our
objectives. We will use Python as the main programming
language to carry out the analysis. In the next chapter we In this book we will use Python
to help us in our statistical
will give a concise introduction to the language and in
adventures.
the next chapters we will use it in the context of statistical
analysis.

An important consideration at this point in our workflow is


the distinction between outcomes and outputs. These
should be driven by the objectives we have set out. In other
words, once we have a clear aim for what we want to learn, Remember that outcomes 6=
outputs.
we need to develop the specific questions we will be
answering with our analysis, and agree how we will assess
the answers we obtain. For instance, our output may be
28 j. rogel-salazar

related to the number of products we offer in the form of a


report, whereas the outcome we expect is related to an
increase in revenue and usage of those products. The
distinction between outcomes and outputs may be a subtle The distinction is subtle but
important.
one, but it is an important one to bear in mind particularly
during conversations with interested parties who need the
results or the actions stemming from them. Some examples
of outcomes and outputs are shown in Table 1.3.

Outcomes Outputs
Increase revenue Reports
Change in cost Spreadsheets
Rate of growth Dashboards
Increased customer satisfaction Infographics
Retain more customers Brochures
Etc. Etc.

Table 1.3: Examples of outcomes


and outputs.

Think of it this way: Outputs tell the story of what we have


produced or indeed the activities of our team or
organisation. Outputs themselves do not measure the
impact of the work done, that is done by the outcomes. An
outcome indicates the level of performance or achievement
that occurred because of the activities carried out. In a way,
the outcome should be contained in the output. Outcomes In a way, the outcome should be
contained in the output.
often focus on changing things and they let us take a look at
what may be the variables that impact our figures and
provide us with insights. In turn, this leads us to making
better decisions based on evidence. Finally, we can close the
circuit by enabling the decisions made to plan for the next
iteration in the workflow.
statistics and data visualisation with python 29

A key component of the workflow described above is the


formulation of good data questions. A good data question is
one that covers the following points:

• It uses simple and direct language

• It is sharp and specific The marks of a good data


question.
• It asks technical teams to predict or estimate something

• It indicates what type of data is needed

• It involves data modelling or analysis

Figure 1.2: The power of questions


on a scale.

Learning to ask better questions helps us make better


requests to our teams and also enables us to clarify
requirements. The way in which we formulate our questions
can lead us to the answers we are looking for. Although all
questions are important, in the workflow outlined above,
some question words are more powerful than others. See
the scale presented in Figure 1.2. As we can see, “why” Some question words are more
powerful than others. Use the
questions are more powerful than others. This is because
scale above to help you.
they are more open-ended and allow for more exploration.
At the opposite side of the spectrum we have question
30 j. rogel-salazar

words such as “which, where, who”. These tend to be more


specific and hence more directive.

Do not be fooled into thinking that because they are marked


as “less powerful” these questions are not important and
should not be used. On the contrary, since they are more But less powerful does not mean
less important!
specific they help us narrow on specifics and may enable us
to identify the data needed. The questions on the right-hand
side of our diagram are more open and therefore can help
us identify the objective of our data analysis.

Finally, I would like to mention the potential that we can


tap into by realising what data we are missing. This is
something that David Hand36 has called dark data in analogy 36
Hand, D. J. (2020). Dark Data:
Why What You Don’t Know Matters.
to the dark matter that composes much of the universe. The Princeton University Press
data we have and measure does not necessarily capture all
the information that is required to answer the questions we
have. I would like to think that this is also reflected in the
models we create with that data. Remember the data we do There are no perfect models, just
good enough ones ... given the
have. that there are not perfect models, just good enough
data we have.
ones. And we can extend that maxim by saying, “just good
enough given the data we do have”. Be mindful of this!

1.6 Presenting Answers Visually

I would like to finish this chapter by talking about the


use of visuals to present the answers we have obtained from We will talk about data
visualisation in the last two
the data analysis workflow we talked about in the previous
chapters of this book.
section. Many outputs we may generate such as reports or
dashboards do end up making use of visual representations
statistics and data visualisation with python 31

of the analysis done. This includes tables, charts, graphics


and other visual cues that tell the story of the outcomes we
have generated.

Representing data in tables can be a good way to drive a


summary of the data. However, you would not like your
audience to hunt for the patterns that drive your analysis in This would be the equivalent of
a long list of numbers. Instead, you may want to present a trying to figure out where Waldo
is in one of those super detailed
nice, elegant chart that is easy to understand and drives the
illustrations.
point you want to make.

Our data and the statistical analysis we perform on it does


tell a story. Presenting the answers to our questions should
fit the audience that we are addressing. This can be made a
much easier task by asking the following three questions:

1. What is the purpose of generating a visual output for our


data?
Answer these questions when
trying to visualise or present your
2. Who are we intending to communicate that purpose to?
data and results.

3. How can we visualise that purpose in the best possible


way?

The answers to those questions will start shaping up the


visual outputs we need to design. However, that is just the
beginning of the story. It is not uncommon to see some very
cluttered visualisations, and this is because there may be Whenever possible avoid cluttered
visualisations, your audience will
a tendency to present anything and everything. The end
appreciate it.
result is not only a busy visual but also one that misses the
point of the message we are trying to create, leaving our
audience trying to figure out what we are trying to convey
at best, or utterly confused at worst.
32 j. rogel-salazar

Our visuals are not about us, but instead about two things:

1. the data analysis performed, and more importantly

2. making sure our audience gets our message in a clear Remember that the aim is to
communicate, not obfuscate.
and concise manner.

That is why we should avoid the temptation of including


everything, and instead make sure that we are using the
visuals that are in line with the purpose of our
communication.

In the latter part of this book we will look at some aspects of


data visualisation that will help us make decisions about the
best way to present our results. Bear in mind that this part
of the statistical work has a firm overlap with design and as
such it may be a good opportunity to add your own creative
flare to the work you do. Having said that, be mindful of
adding extra design elements that may end up distorting or
adding ambiguity to the statistical message.

OK, let us get started with a bit of Python!


2
Python Programming Primer

Python programming is a lot of fun. No wonder


the language is becoming ever more popular and used
in academic and commercial settings. As programming
languages go, Python is versatile and with a very gentle
learning curve. The emphasis it puts on readability and
productivity does help its cause.

Python was created around 1989 by Guido van Rossum


who continued to be Python’s Benevolent Dictator for Life This is a title given to some open-
source software development
until his “permanent vacation” announcement in 2018. He
leaders who have the final say in
named the language after the British comedy troupe Monty disputes or arguments within the
Python and it has nothing to do with the snake that bears community.

the same name. In this chapter we will cover some of the


most important aspects of programming with Python. As a
modern language, Python exploits the paradigms of object
orientation and a lot can be said about this. However, we
are not going to cover those aspects of the language here,
instead we will concentrate on the basics that enable the use
of programming in an interactive way.
34 j. rogel-salazar

Python’s attention to readability means that almost anyone


can pick up a piece of code and understand what the
programme is doing. We mentioned above that Python can Readability is an important part of
Python.
be used interactively which is enabled by the fact that
Python is an interpreted language. This means Python
executes code without having to translate it into
machine-language instructions. In contrast, compiled
languages such as C/C++, FORTRAN or Java get executed
by the computer’s CPU. It can be argued that interpreted
code may run less quickly than compiled code, but by the
same token we can execute the same programme on
multiple platforms without modification.

The Python programmer community has the advantage that We call ourselves a Pythonistas.

the standard library of the language is quite powerful and


comprehensive. Sometimes this is referred to as having
“batteries included”. Some of those batteries are things such
as working with file system directories and reading files,
manipulating strings, parsing JSON and XML, handling
CSV files and database access, multiprocessing,
mathematics, etc. Some of the more idiosyncratic features of
the language include the use of indentation to mark blocks
of data, the lack of punctuation to mark the code The use of indentation is an
important feature of Python.
statements and the use of idiomatic shorthand notation
known as “pythonic style”.

If you are a seasoned ninja Pythonista you may want to skip


this chapter. You may be interested in a refresher and if so
you are more than welcome to continue reading. If you are
a newcomer to the Pythonista community, you are welcome.
Please do take a look at this chapter and enjoy the party.
statistics and data visualisation with python 35

2.1 Talking to Python

Having a dialogue is an enriching experience, from


talking to friends and family to interacting with colleagues
and collaborators. Performing data analysis as a form of
dialogue with our data is a more rewarding exercise that
can tell us where to move next depending on the answers Python is a very versatile general-
purpose programming language.
we are getting. The capabilities that Python offers us in this
regard are well suited. Not only are we able to run batch
scripts to perform an established pattern of analysis and
even produce reports programmatically, but we are also
able to establish a dialogue with the data we are analysing
thanks to the flexibility and interactivity provided by the
Python ecosystem.

Python in an interpreted language. This means that the


programmes we write are executed one statement at a time.
In a simplified way, we are asking the computer to read Python is an interpreted language.

the statements provided, evaluate them, provide us with


an answer and move to the next statement. This cycle is
sometimes called REPL:

1. Read the user input.

2. Evaluate the commands provided. The all-important REPL enables


interactivity.
3. Print any results if available.

4. Loop back to step 1.

The REPL is one of the most attractive features of Python.


One implementation that has gained the hearts of many
36 j. rogel-salazar

data scientists and statisticians, as well as scientists in


general, is the Jupyter notebook. The notebook gives us
a fully fledged IDE that lives in a plain JSON file and is IDE stands for Integrated
Development Environment.
rendered as a website. We can use the REPL to ask Python
to extract, load and transform our data in an interactive
manner, executing code in each of the cells in the notebook.
Interactivity is not exclusive of the Jupyter notebook, indeed
this can be achieved even with an interactive terminal such
as the iPython shell, but it may not be an experience rated
highly by all.

Another important feature that is often cited as an


advantage of the Python ecosystem is the plethora of
modules that are readily available to the Pythonista
community. We will be discussing in more detail the use of This is sometimes described as
having “batteries included”.
some of those modules or libraries, but as a starter we can
mention a few of the more popular ones used for statistical
analysis and data science. For example, NumPy is a module NumPy stands for Numerical
Python.
that lets us carry out numerical analysis work extending
Python to use multidimensional arrays and matrices. With
this we can do computational linear algebra and other
mathematical operations needed in statistical work. Further
to that, we have the SciPy module which builds on the SciPy stands for Scientific Python.

functionality provided by NumPy extending library to


include implementations of functions and algorithms used
in many scientific applications.

As you can see, Python programmers are able to use the


modularity of Python to extend existing libraries or
modules. A great example of the power of that extensibility
is the pandas module. It uses, among other modules,
statistics and data visualisation with python 37

NumPy and SciPy to enable us to carry out panel data


analysis with the use of dataframes, i.e. extensions to the Panel data analysis give pandas its
name.
arrays used in NumPy for example. Within pandas we are
able to do data manipulations and analysis including
plotting. This is thanks to the use of a library widely used
for visualisation purposed in Python: Matplotlib. In terms
of statistical analysis, there are some useful implementations
of algorithms used in the area within NumPy and SciPy.
More targeted approaches are provided by modules such as
Statsmodels for statistical modelling and Seaborn for
statistical data visualisation.

We have mentioned that Python is an interpreted language


and this brings about another aspect that makes code
written in Python flexible and portable. This means that Python code is flexible and
portable.
unlike compiled programmes, the scripts we develop in
Python can be executed in machines with different
architectures and operating systems. Python interpreters are
available in Windows, Mac and Unix/Linux systems and
your work can be run in any of them, provided you have the
correct version installed.

Talking about Python versions, in this book we are working We will be working with version 3
of the Python distribution.
with version 3.x. Remember that Python 2 was sunset on
January 1, 2020. This means that there will be no more
releases for that version and should there be any security Python 2 was released in 2000,
and was supported for 20 years,
patches needed or bugs that need fixing, there will not be not bad!
any support from the community. If you or your
organisation, school, university, or local Pythonista club are
using Python 2, consider upgrading to the next version as
soon as you can.
38 j. rogel-salazar

If you want an easy way to install the latest version of the


language, I usually recommend looking at the Anaconda
distribution1 , built by Continuum Analytics. It offers an 1
Anaconda (2016, November).
Anaconda Software Distribution.
easy way to install Python and a number of modules that Computer Software. V. 2-2.4.0.
https://www.anaconda.com
can get you ready to go in no time. You may also want to
look at Miniconda in case you want a streamlined version
of the distribution. If you do use it, please remember that
you may have to install individual modules separately. You
can do that with the help of package managers such as pip,
easy-install, homebrew or others. As you can imagine, there
are other ways to obtain Python and any of them should be
suitable for the explanations and code that will be used in
the rest of the book.

2.1.1 Scripting and Interacting

We now understand that Python can be used in


an interactive manner. This can be done for instance in a
terminal or shell to establish a dialogue with Python. This
is only one part of the story. We can actually use Python
in a more traditional way by writing the entire sequence
of commands we need to execute and run them in one go
without interrupting the computer. Depending on the type You do not have to choose one
over the other as Python is flexible
of analysis, you have the flexibility of using one approach
enough to enable you to do both.
or the other. In fact you can start with an interactive session
and once you are happy with the results, you can save the
commands in a stand-alone script.

We mentioned above using a terminal or shell to interact


with Python. A shell lets us access the command line
statistics and data visualisation with python 39

interface, where we can input text to let our computer know


what to do using text. In the case of Python we may want to
use something like an iPython shell to run Python iPython is an interactive command
shell to write programmes in
commands. Within the shell we can see the result of the
various languages including
commands we issue and be able to take the output of the Python.
latest code executed as input for the next. This is very useful
in prototyping and testing.

One disadvantage of using the shell directly is that,


although we can interact with Python, we are not able to
persist or save the programmes we write. In many cases you
would like to save your programmes so that you can use
them at a later point in time. In those cases you may want to
use scripts. These are simple text files that can be executed Python scripts have the extension
.py.
with the Python interpreter. If you have saved your script
with the name myprogramme.py, you can execute it from the
terminal as follows:

> Python myprogramme.py

We are assuming that the script has been saved in the local The command above is launched
directly from the terminal; no
path. In this way we can run the programme as many times
need for the iPython shell.
as we want; we can also add instructions and extend our
analysis.

For the purposes of the explanations and implementations


shown in this book, we will make the assumption that you
are using an interactive shell such as iPython. This will let
us break the code as we go along, and when appropriate,
we can see the output we are generating. You may decide
to save the commands in a single script for reuse. The code
40 j. rogel-salazar

will be presented with a starting diple (>) to indicate that a


command has been entered in the shell. Here is an example:

> 365 - 323


We assume that you are using an
interactive shell.
42

As you can see, when running this command Python prints


out the answer. Remember that we are using the REPL and See Section 2.1.

Python is printing the result of the operation for us.

In other cases we may need to show more than one


command in our explanations. In those cases we will be
presenting the code in a script format and we will not show
the diple. The same will be done for the cases where the
REPL is not expected to print the result. For example, in the
case of assigning a value to a variable, the interactive shell
does not print anything. In this case the code will be
presented as follows:

If the REPL has nothing to print


result = 365 - 323
we will not show the shell prompt.

The commands we issue to communicate with Python are


enough for the interpreter to know what it is required to do.
However, with the principle of readability in mind, human
interpreters or our code are better off when we provide
extra explanations for what we are trying to achieve. This is
where comments come in handy. A comment in Python is
created with a hash symbol, #. Here is an example:
statistics and data visualisation with python 41

A comment in Python is entered


> pi = 3.141592 # An approximation of pi
with the hash symbol, #.

In the example above we have assigned the value 3.141592


to the variable called pi and have done so with the help of See Section 2.6 for a better way of
referring to π.
the = sign. Note that Python is ignoring everything after the
hash symbol. The most typical way to enter comments is to
start a line with a hash. Including comments in our
programme makes it easier for other programmers to
understand the code. An added advantage of using Please include comments in all
your programmes, your future self
comments is that we can remove some lines of code from
will appreciate it!
execution without deleting them from our script. We simply
comment them out, and if we need them again, we can
always remove the hash.

2.1.2 Jupyter Notebook

Having interactive shells is great as we can see the


outcome of the code we have written as it executes. In many
cases this is enough but sometimes we would like a bit more
than readability in a plain-looking terminal. Wouldn’t it be Interactivity is the name of the
game with Jupyter.
great to be able to have that interactivity together with a
way in which the comments we add are formatted nicely
and even include pictures or plots as well as being able to
share the script not only with other programmers but a
more general audience?

Well, all that and more is possible with the Jupyter notebook.
As the name indicates, a notebook provides us with a way Jupyter notebooks used to be
called iPython notebooks.
to keep our programmes or scripts annotated and we are
able to run them interactively. Code documentation can be
42 j. rogel-salazar

done beyond the simple use of comments as the notebook


understands markdown and it is presented as part of a a
web-based interface.

A Jupyter notebook is effectively a JSON file and it lets us


include plain text, mathematical expression and inline A notebook is effectively a JSON
file.
graphics as well as other rich media such as websites,
images or video. The notebook format has the extension
.ipynb so you can easily recognise one. In many cases

though, it is not possible to open a notebook directly by


double-clicking in the file. Instead, you open it via the
Jupyter web interface. Portability is the name of the game
with Jupyter notebooks and this means that we can convert
our files into plain Python scripts or even create
presentations, export to PDF or HTML and even LATEX files. Jupyter gets its name from for
Ju-lia, Py-thon and R; some of
Furthermore, the Jupyter project supports the use of other
the languages supported by
programming languages. We recommend that you use the interface. Other kernels are
notebooks as you work through the content of the book. We available too.

will not show a Jupyter interface to facilitate readability for


the book.

2.2 Starting Up with Python

Up until now we have been talking about the benefits of


using Python and how the interactive capabilities of an
interpreted language are well suited to our needs in data
analysis and statistics. Furthermore, in the brief examples Python can be used as a glorified
calculator, but there is more to it
above we have seen how Python uses well-known notation
than that!
to perform arithmetic operations such as subtraction (-) and
I am sure you can see how other operators fit in within that
statistics and data visualisation with python 43

syntax. These operators are shown in Table 2.1, perhaps the


one operator that needs some attention is the
exponentiation, which is represented with a double asterisk, Exponentiation is denoted with **
**. These operators are available to be used with numbers, in Python.

and like many other programming languages Python has


types. Let us take a look.

Operation Operator
Table 2.1: Arithmetic operators in
Addition + Python.
Subtraction -
Multiplication *
Division /
Exponentiation **

2.2.1 Types in Python

In Python we do not need to declare variables before


we use them and furthermore, we do not have to tell the Remember that Python is an object
oriented language.
interpreter what type the variables or objects are. Each
variable we create is automatically a Python object. Instead,
Python is a dynamically typed language. This means that
the type of an object is checked as the code runs and the Python is a dynamically typed
language: We do not need to
type is allowed to change over its lifetime. However, we still
specify the variable type in
need to know some of the basic types supported in Python advance.
so that we can build other objects. Let us take a look.

2.2.2 Numbers: Integers and Floats

As you can imagine, Python can be used as a calculator


and therefore supporting numbers is part and parcel of
44 j. rogel-salazar

this use case. Python supports two basic number types:


Integers and floating point numbers, in Python called floats. Python supports integers and
floating point numbers.
In that way it is possible for us to assign an integer value to
a variable as follows:

> magic = 3

Remember that assignation does not require Python to print


anything as a result. It is possible for us to check the type of
an object with the command type:

> type(magic)
The command type lets us see the
type of an object.

int

Python will let us know what type of object we are dealing


with; in this case the object magic is of type integer. Let us
see an example for a floating point number:

> trick = 1.5


We can see that this is a float
> type(trick)
object.

float

In this case we can see that the type of the object trick is
float. Notice that since Python is dynamically typed we

can mix integers and floats to carry out operations and the
Python interpreter will make the appropriate conversion or
casting for us. For example:
statistics and data visualisation with python 45

> trick2 = magic / trick

> print(trick2) Python is able to make


appropriate conversions or
casting.
2.0

You can check that the result of the operation above results
in a float:

> type(trick2)
The result has the expected type.

float

We can cast the type of a variable too. For example, if we


require the result stored in trick2 to be an integer we can
do the following:

> trick2 = int(trick2)


We can also request Python to cast
> type(trick2)
values into the appropriate type.

int

We can see that if we print the value assigned to trick2


after the casting operation above, we get an integer. We can
see that as it has no decimal part:

> print(trick2)

We are using functions to do the casting for us, these


include:
46 j. rogel-salazar

• int() creates an integer number from an integer literal, A literal is a notation for
representing a fixed value in
a float literal by removing all decimals, or a string literal
source code.
(so long as the string represents a whole number)

• float() creates a float number from an integer literal,


a float literal or a string literal (so long as the string
represents a float or an integer)

• str() creates a string from a wide variety of data types,


including strings, integer literals and float literals

Take a look at these examples:

x = float(1) # x will be 1.0


Applying casting in Python is
y = int(’2’) # y will be 2 (an integer number)
easy.
z = str(6.3) # z will be ’6.3’ (a string)

We mentioned strings above, but we have not said what they


are. Not to worry, let us string along.

2.2.3 Strings

A sequence of characters is called a string. Strings in


Python are defined by enclosing the characters in question A string is a sequence of
characters or text.
in either single (’ ’) or double quotes ( ‘‘ ’’). Here are a
couple of examples:

> example1 = ’This is a string’

> example2 = ‘‘And this is also a string’’

Strings can be used to print messages to the standard


output with the help of the print function:
statistics and data visualisation with python 47

> print(example1)

’This is a string’
Strings in Python can be defined
with single or double quotes.
> print(example2)

’And this is also a string’

We can check the type of example1 as we have done before:

> type(example1)

The type of a string is str.


str

As we can see the type of the variable is reported to be a str


or a string.

Python knows how to add to numbers, and should we use


the + operator between two strings, Python knows what to
do. Instead of giving an error, it overloads the operator to
concatenate the two strings:

> vulcan, salute = ‘‘Live Long’’, ‘‘Prosper’’ Concatenation of strings can be


achieved with the + symbol.
> print(vulcan + ’ and ’ + salute)

Live Long and Prosper

If you program in other languages, the statement above


may look surprising. In Python is it possible to assign
multiple values to multiple variables in one single line of
code. This is part of the pythonic style of coding mentioned
48 j. rogel-salazar

at the beginning of this chapter. In the statement above In Python we can do multiple
assignation of values in a single
we are assigning the value “Live Long” to the variable
line.
vulcan, whereas the variable salute has the value Prosper.

Furthermore, in the concatenation we have added the string


’ and ’ to get the traditional salute made famous by Mr.

Spock.

It is important to note that the operator would not work if


we were to mix strings and either floats or integers. Let us
take a look:

> print(vulcan + magic)


We cannot operate strings with
numbers, either floats or integers.
...

TypeError: can only concatenate str (not ‘‘int’’)

to str

We can see that Python is telling us in the error message


that we cannot concatenate the two values. Instead, we can
use casting to correct for this as follows:

> print(vulcan + str(magic))


We need to apply casting!

’Live Long3’

Note that there is no space between the g and the 3


characters as we never asked for one!

Python also overloads the multiplication operator * to be


used with strings and in this case it replicates the string. For
example:
statistics and data visualisation with python 49

> print(vulcan + ’ and ’ + 3*(salute + ’ ’))

’Live Long and Prosper Prosper Prosper ’

Finally, it is possible to create strings over multiple lines


with the use of triple single or double quotes. This is also
useful to define docstrings for functions

> space = ‘‘‘‘‘‘Space:

the final frontier.


More about docstrings in Section
These are the voyages of 2.5.
the Starship, Enterprise.

’’’’’’

There are a few other tricks that strings can do and we will
cover some of those later on. One more thing to mention
about strings is that they are immutable objects. This means Strings in Python are immutable.

that it is not possible to change individual elements of a


string. We shall discuss more about immutable objects in
the context of tuples in Section 2.3.3.

2.2.4 Complex Numbers

We have seen that Python understands the use of


numbers such as integers and floats. If you are interested in
doing mathematical operations, we may also be interested in
using complex numbers too, and Python has you covered.
Python calls the imaginary
In a very engineering style, Python denotes the imaginary number i j.

number i = −1 as j, and so for a number m, mj is
interpreted as a complex number.
50 j. rogel-salazar

Let us see an example: If we want to define the complex


number z = 42 + 24i we simply tell Python the following:

> z = 42+24j

> print(z) Complex numbers in Python are


not complicated!

(42+24j)

As any complex number, we can get its real and imaginary


parts. Let us take a look:

> print(’The real part is {0}, \

the imaginary part is {1}’ \

.format(z.real, z.imag) )

The real part is 42.0, the imaginary part is 24.0

You can see that Python is casting the integer numbers


used to define z as floats. The curious among you may have
noticed a few interesting things in the string above. First,
we have broken out our statement into several lines of code The backslash allows us to break a
line.
with the use of a backslash (\). This helps with readability
of the code. Furthermore, we have used a string method
called format that helps format specified values and insert
them in the given placeholders denoted by the number
inside curly brackets. In that way, the value of z.real is
replaced in the {0} placeholder for example.

You may also have noticed that we have referred to the The method of an object can be
real and imaginary parts with .real and .imag. These are invoked by following the name of
object with a dot (.) and the name
methods of the complex type. This makes sense when we
of the method.
remember that each entity in Python is an object. Each
statistics and data visualisation with python 51

object has a number of possible actions they are able to


perform, i.e. methods.

There is an alternative way to define a complex number in


Python: The complex() function:
We can also create complex
> x, y = complex(1, 2), complex(3, 4)
numbers with complex().

As you can imagine, Python knows how to do complex


arithmetics:

print(’Addition =’, x + y)

print(’Subtraction =’, x - y)

print(’Multiplication =’, x * y)

print(’Division =’, x / y)

print(’Conjugate =’, x.conjugate()) Complex arithmetics in Python.

Addition = (4+6j)

Subtraction = (-2-2j)

Multiplication = (-5+10j)

Division = (0.44+0.08j)

Conjugate = (1-2j)

Python has a couple of modules that extend its functionality


to define mathematical functions. For real numbers we can
use math and for complex numbers we have cmath.

2.3 Collections in Python

An organised group of objects acquired and maintained


for entertainment, study, display or simply for fun is known
52 j. rogel-salazar

as a collection. I am sure you have one yourself, even if


you do not know it or want to admit it. Some of us collect
stamps, mugs, frogs or cats. Whatever you collect, each item
in your collection is important. The collection itself may get
bigger or smaller over time, and sometimes you would like Any similarities with actual
collections you may have are
to know where each item is. Well, Python has a pretty good
purely coincidental.
idea of what collections are too and in fact they work in a
similar manner to the collection of ornamental plates you
keep in the attic.

A collection in Python is effectively a container type that is


used to represent different data type items as a single unit.
Depending on the use case, different collection types have Python has four collection types:
Lists, sets, tuples and dictionaries.
ways to manipulate data and manage storage. Python has
four collection types: Lists, sets, tuples and dictionaries.
Let us take a look at each of them. Collections are said to be
iterables. In Python an iterable is an object that can be used
as a sequence. We are able to loop over an iterable and thus We talk about while loops in
Section 2.4.3 and for loos in
access the individual items. This, as we shall see, is a very
Section 2.4.4.
useful feature of the language.

2.3.1 Lists

A list is a useful collection that I am sure you already


know how to use. We all maintain a list of things to do, or
a shopping list, or a list of friends and foes. In Python a list
is a very versatile type that puts items in a sequence. The Python lists can have items of
different types.
items in a Python list can be of different types and they can
be accessed one by one, they can be added or deleted, and
you can sort them or loop through them.
statistics and data visualisation with python 53

A list in Python is defined with square brackets, [ ]. Lists


are mutable objects and therefore it is possible to change
A list is defined with square
individual elements in a list. Furthermore the items in a list brackets [ ].

are ordered and we can have duplicates. Let us create a few


simple lists:

numbers = [1, 2, 3, 4, 5]

vulcan = [‘‘wuhkuh’’, ‘‘dahkuh’’, ‘‘rehkuh’’,


Yes, those are the numbers 1 to 5
‘‘kehkuh’’, ‘‘kaukuh’’]
in Vulcan!
starships = [1701, ‘‘Enterprise’’, 1278.40,

1031, ‘‘Discovery’’, 1207.3]

We mentioned above that you can look at the individual


items in our collection, and a list is no different. Since order
is important in a list, we can refer to each of the elements
in the list by their position. This is given by an index that is
part of the list. We can do this as follows:

> print(numbers[0])

1 We can refer to elements in a list


with an index.

> print(vulcan[2:4])

[’rehkuh’, ’kehkuh’]

Note that we are referring to the first element in the list


numbers with the index 0. If you were playing hide and seek

with Python it would start counting as: “0, 1, 2, 3, . . .”. In Indexing in Python starts at zero.

the second command above you can also see how we can
refer to a sub-sequence in the list. This is often referred
54 j. rogel-salazar

to as slicing and dicing the sequence. We refer to the sub-


sequence with the help of a colon as start:end, where
start refers to the first element we want to include in

the sub-sequence and end is the last element we want to


consider in the slice.

Since Python counts from 0, the slicing notation means that


we are requesting for the last entry in our sub-sequence to
Slicing refers to the subsetting of
be the end-1 element of the original list. In our example, we
an array-like object such as lists
are asking for the sub-list to read item 2 and up to 4, but not and tuples.
including 4. That is why only the third and fourth elements
of the vulcan list are returned.

A more general way to obtain a slice of a list L sequence is


as follows:
L[start : end : step], (2.1)

which lets us traverse the list with an increment given by


step, which by default has the value 1. We can also omit the

start and/or end to let Python know that we want to start

at the beginning of the list and/or finish at the end of it.

Figure 2.1: Slicing and dicing a


We can use negative indices to slice our list from the end. In list.

this case the last item has the index −1, the next one −2 and
so on. We can see how the indices and the slicing operator
statistics and data visualisation with python 55

work in the picture shown in Figure 2.1. We can think of


each of the letters shown to be individual items in a list.
However, you may be surprised to find out that strings can
actually be sliced and diced in the same way as lists. Let us
look at slicing the word shown in Figure 2.1:

> L = ’StarTrek’

> print(L)

’StarTrek’ Strings are sequences of characters


and we can slice them and dice
them too.
> L[0]

’S’

>L[1:4]

’tar’

> L[-4:-1]

’Tre’

List are mutable objects and this means that we can change Remember, however, that strings
are immutable! In this case we
individual items in the list. Let us create a list to play with:
have a list of characters and that
can be modified.
> scifi = [’S’,’t’,’a’,’r’,’T’,’r’,’e’,’k’]

> scifi[4]

’T’

Let us change the name of our favourite Sci-Fi saga by


changing the last four items in our list:
56 j. rogel-salazar

> scifi[4:] = ’W’,’a’,’r’,’s’

> print(scifi) Here we are changing the last 4


items in our list.

[’S’, ’t’, ’a’, ’r’, ’W’, ’a’, ’r’, ’s’]

We have changed the item with index 4 to be the letter “W”,


index 5 to “a” , 6 with “r” and 7 with “s”.

Immutability also means that we can add elements to a list.


We can do this with the append method:

> scifi.append(’.’)
append lets us add elements to a
> print(scifi)
list.

[’S’, ’t’, ’a’, ’r’, ’W’, ’a’, ’r’, ’s’,’.’]

The new element (a period .) is added to the scifi list at


the end, increasing the length of the list by one element.

Concatenation of lists is easily achieved with the + operator.


Let us concatenated the numbers and vulcan lists:

> print(numbers + vulcan)

The + symbol lets us carry out list


[1, 2, 3, 4, 5, ’wuhkuh’, ’dahkuh’, concatenation.
’rehkuh’, ’kehkuh’, ’kaukuh’]

Since the order of elements in a list is important, the


concatenation of lists is not commutative. In other words
numbers + vulcan is not the same as vulcan + numbers.

Another important thing to note is that lists are not arrays


statistics and data visualisation with python 57

and this means that the + operator will not sum the
elements of the list. We will talk about arrays in the next
chapter.

We know that lists have order and as they are mutable we


are able to change that order. In particular, we can sort the Order is important in a list. sort
is used to order a list in place.
items in our collection. We do this with the sort method.
Let us take a look:

> l1 = [21, 0, 3, 1, 34, 8, 13, 1, 55, 5, 2]

> print(l1)

[3, 6, 9, 2, 78, 1, 330, 587, 19]

We can now call the sort method as follows:

> l1.sort()

> print(l1) Using sort orders a list in


ascending order.

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

Note that since this is a method, we call it with the dot (.)
notation. The sorting is done in place and this means that
now our list has changed. We can see that when we print
the list we get the Fibonacci numbers from our original list
in ascending order. If we wanted them in reverse order we
simply pass this as a parameter as follows:

> l1.sort(reverse=True)
We can use reverse to order in
reverse.
[55, 34, 21, 13, 8, 5, 3, 2, 1, 1, 0]
58 j. rogel-salazar

Python also has functions that can act on objects. In this


case there is a function for lists that can help us order the
items. This is the sorted function and the difference is that Lists have a sorted function.

the function will create a new object rather than changing


the original:

> l1 = [21, 0, 3, 1, 34, 8, 13, 1, 55, 5, 2]

> print(sorted(l1)) A new object is created, the


original list is not changed.

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

We can check that the original object has not been modified
by printing it:

> print(l1)

[21, 0, 3, 1, 34, 8, 13, 1, 55, 5, 2]

In principle we could have assigned the result of applying


the sorted function to a new variable. In that way we can
refer to the newly created object at a later point in time.

As before, if we need the elements in descending order we


can simply pass the reverse parameter to the function:

> sorted(l1, reverse=True)

Once again we can use reverse to


obtain a list ordered in reverse.
[55, 34, 21, 13, 8, 5, 3, 2, 1, 1, 0]

The elements of a list can be other types of objects, for


example they can be other lists. Let us define a short list of
lists as follows:
statistics and data visualisation with python 59

The items in a list can be other


> m = [[1,2], [2,3], [3,4]]
collections, such as lists.

We can select items in the list:

> print(m[2])

[3,4]

and continue drilling down to obtain the sub-elements in The same slicing and dicing
operations can be applied to them.
the lists that comprise the main list:

> m[2][1]

We can obtain the length of a list with the len function. For
example:

> print(len(m))
The length of a list can be
obtained withe len function.

In this case we can see that the list m has 3 elements.

Please note that it is not possible to simply copy a list by


assignation. In other words list2 = list1 does not result
in a copy of list1. Instead it is a reference to it and any We cannot copy a list by
assignation. We use the copy()
changes made to list1 will be made to list2. If you
method instead.
require a copy, you need to use the copy() method. Other
methods you can use with lists can be seen in Table 2.2.
60 j. rogel-salazar

List method Description


append(item) Add item at the end of the list
clear() Remove all items
copy() Copy the list
count(value) Number of items with the given value
extend(iterable) Add items of an iterable at the end of the current list
index(value) Index of the first item with the given value
insert(position, item) Add item at the specified position
pop(position) Remove item at given position
remove(value) Remove item with given value
reverse() Reverses the order of the list
sort(reverse=True|False, Sort the list, func is a function to specify sorting
key=func)

Table 2.2: List methods in Python.

2.3.2 List Comprehension

There is a very pythonic way of creating lists. This is


known as list comprehension and it lets us create a new list List comprehension is a very
pythonic way of creating lists.
based on the items of an existing one. For example, we can
take a list of numbers and create a new one whose elements
are the square of the original items:

> l2 = [x**2 for x in l1]

> print(l2)

[441, 0, 9, 1, 1156, 64, 169, 1, 3025, 25, 4]

In a more traditional way, the code above could be written


in an explicit for loop using a counter. As we shall see, we
We will cover the use of for loops
do not need counters in Python when using an iterator and in Section 2.4.4.

this can be used to our advantage as we can include the for


loop inside the list definition as shown above.
statistics and data visualisation with python 61

Moreover, the syntax can accommodate the use of a


condition too. The full syntax is as follows:

newlist = [ expression for item in iterable We will cover conditional


statements in Section 2.4.2.
if condition == True]. (2.2)

Let us look at an example:

> l3 = [ [x, x**2] for x in l1 if x <= 3]

> print(l3)

[[0, 0], [3, 9], [1, 1], [1, 1], [2, 4]]

In this case we are telling Python to create a list of lists.


Each element in our list l3 is a list whose elements are a List comprehension is a compact
way to do things in a single line of
number and its square. We are reading the elements of list
code.
l1 and only applying the operation if the item from l1 is

smaller or equal to 3. All that in a single line of pythonic


code!

2.3.3 Tuples

We have seen that a list is an ordered sequence that we


can change. What happens if we want to use a sequence
where order is important, but we also require it to be
immutable? Well, the answer is that we need to use a tuple.
The items of a tuple can also be of different types but
immutability means that we are not able to change the
Tuples are defined with round
items. Tuples support duplicate values too and we define a brackets ( ).
tuple in Python with round brackets, ( ), and each element
in the tuple is separated with a comma.
62 j. rogel-salazar

Let us take a look at some tuples:

numbers_tuple = (1, 2, 3, 4, 5)
These tuples contain the same
vulcan_tuple = (‘‘wuhkuh’’, ‘‘dahkuh’’, ‘‘rehkuh’’,
items as the lists created in the
‘‘kehkuh’’, ‘‘kaukuh’’) previous section.
starships_tuple = (1701, ‘‘Enterprise’’, 1278.40,

1031, ‘‘Discovery’’, 1207.3)

The items in the variables defined above are the same as


in the previous section. The main difference is that these
variables are tuples instead of lists. We can see this as they
have been defined with round brackets. We can ask for the
type to make sure:

> type(vulcan_tuple)

’tuple’

We know that tuples are iterables where order is important.


This means that we can use the colon notation we learned in
Section 2.3.1 to slice and dice our tuples:

> starships_tuple[1] Tuples can also be sliced with the


help of an index.
’Enterprise’

> numbers_tuple[3]

> vulcan_tuple[1:4]

(’dahkuh’, ’rehkuh’, ’kehkuh’)


statistics and data visualisation with python 63

Notice that in the example above, the result of the slicing


command returns a sub-tuple. You can also use negative
indices as we did for lists and the diagram shown in Figure
2.1.

As we have mentioned above, immutable objects cannot be


changed. In other words, we cannot add or remove elements
Tuples are immutable objects.
and thus, unlike lists, they cannot be modified in place.
Let us see what happens when we try to change one of the
elements of a tuple:

> starships_tuple[0] = ’NCC-1701’


We are not able to change
elements of a tuple as they are
... immutable objects.

TypeError: ’tuple’ object does not support item

assignment

Similarly we are not able to apply a method to sort a tuple


in place. However, we can create a new list whose elements
are the sorted items from the original tuple. We can do this
with the sorted function. Let us define a tuple as follows:

> t1 = (21, 0, 3, 1, 34, 8, 13, 1, 55, 5, 2)

We can now apply the sorted function to the tuple:

> t2 = sorted(t1)

> print(t2)
The result of the sorted function
on a tuple is a list.
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]
64 j. rogel-salazar

We mentioned above that the result would be a list and


we can see that, as the object above is printed with square
brackets. We can check this with the type command:

> type(t2)
The result of the function is a list.

list

This behaviour is due to the immutability of tuples. Since


we cannot change the tuple directly we transform it into a
list. This also means that the number of methods available
for tuples is limited too, as shown in Table 2.3.

Tuple method Description


count(value) Number of items with the given value
index(value) Index of the first item with the given value
Table 2.3: Tuple methods in
Python.

Let us look at a couple of examples:

> t1.count(1)

> t1.index(1)

We can see that there are two items with the value 1 in
the tuple t1, and that the first item with value 1 is in the
position with index 3, i.e. the fourth element in the tuple.
We can obtain the length of a tuple with the len function:
statistics and data visualisation with python 65

> len(vulcan_tuple)

The immutability of tuples may be a strange thing if you


are coming from other programming languages. You may
be asking yourself, why do they exist? Well, the answer is
that their use indicates to other programmers, and indeed These are some reasons to use
tuples.
to the Python interpreter, that the programme contains an
object that will never change, which means that they also
offer some performance optimisations.

A useful way to create a tuple is via the zip() function. It


takes one or more iterables as input, and merges them item
by item based on their indices. The result is a tuple iterator.
An iterator is an object used to iterate over an iterable object. An iterator is an object used to
iterate over an iterable object.
The iterator that results from the application of zip() yields
tuples on demand and can be traversed only once. Let us
zip the list numbers with the tuple vulcan_tuple as follows:

> translate = zip(numbers, vulcan_tuple)

> type(translate) The zip() function merges


iterables by index.

’zip’

As you can see, the result is an object of class zip. If you try
to print the object directly, you get a strange result:

> print(translate) A zip object yields each item


one at a time until the iterable is
exhausted.
<zip object at 0x7fd41092f380>
66 j. rogel-salazar

This is because we are trying to print an iterator. Instead, we


need to ask Python to yield the items one by one until the
iterable is exhausted. We can do this with a for loop as we
will see in Section 2.4.4 or by passing the request to a list as
follows:

> list_translate = list(translate)

We can cast the zip object into a


[(1, ’wuhkuh’), (2, ’dahkuh’), (3, ’rehkuh’), list if required.

(4, ’kehkuh’), (5, ’kaukuh’)]

We can see how the association worked and we can use this
as a form of dictionary to get the name of the numbers in
vulcan. Perhaps though, we can use an actual dictionary for
these purposes. Let us take a look.

2.3.4 Dictionaries

How do we go about counting from 1 to 3 in Vulcan?


Well, it is only logical to ask the Enterprise computer to give Although the universal translator
used by Starfleet command will
you some information and what we will get is effectively
make this unnecessary.
a dictionary. Like a human dictionary, a Python one lets
us find the information associated with a key. We do not
have to iterate over every entry in the dictionary to find the
information, we simple search for it by key. A dictionary in
Python is made out of key and value pairs.

We define a dictionary in Python with the use of curly


brackets, { }. Each key-value pair is constructed with a We define a dictionary with curly
brackets { }.
colon (:) between the key and the value and pairs are
statistics and data visualisation with python 67

separated from each other by a comma. Let us take a look at


some examples:

> enterprise = { ’James T. Kirk’: ’Captain’,

’Spock’: [’First Officer’, ’Science Officer’],

’Nyota Uhura’: ’Communications Officer’,

’Pavel Chekov’: [’Navigator’,

’Security/Tactical Officer’] }

With the dictionary above we can check the rank of each of


the senior command officers on the Enterprise. The keys in
this case are strings, and that is the typical case. In general The keys can be any immutable
object: Numbers, strings or tuples
the keys of a dictionary can be any immutable Python object
for example.
including numbers, strings and tuples. The value associated
with a particular key can be changed by reassigning the new
value to the dictionary item with the correct key. This means
that dictionaries are mutable and duplicate keys are not
allowed.

We can obtain the value for a given key as follows:

> enterprise[’Spock’]
We can query a dictionary by key.

[’First Officer’, ’Science Officer’]

In this case the result is a list. In general, the value


associated with a key can be any object. Since the result of
that particular entry is a list, we can ask for the n-th element
of the list:
68 j. rogel-salazar

> enterprise[’Spock’][1]
We can use nested slicing and
dicing as required.
’Science Officer’

If we attempt to search for a key that does not exist, Python


will let us know that there is an error.

> enterprise[’Montgomery Scott’]

As expected, we cannot query for


... keys that do not exist.
KeyError: ’Montgomery Scott’

However, as dictionaries are mutable we are able to add new


key-value pairs:

> enterprise[’Montgomery Scott’] = ’Engineer’ However, we can create new


entries on the go.

Furthermore, we can change the value associated with a key


simply by reassigning the new value. We can, for example,
correct the rank of Scotty:

> enterprise[’Montgomery Scott’] = ’Chief Engineer’


The values in a dictionary can be
> enterprise[’Montgomery Scott’]
modified.

’Chief Engineer’

You may be interested in getting to see all the keys in the


dictionary and there is a method that lets us do just that:
keys():
statistics and data visualisation with python 69

> names = enterprise.keys()

> print(names) A list of dictionary keys can be


obtained with the keys() method.

dict_keys([’James T. Kirk’, ’Spock’, ’Nyota Uhura’,

’Pavel Chekov’, ’Montgomery Scott’])

The object created is a view object which provides a


dynamic view of the dictionary’s entries. They can be
iterated over and work in a similar way to a list. Let us
update our dictionary and print the content of names:

> enterprise[’Leonard McCoy’]=’Chief Medical’ \

’ Officer’

> print(names)
The keys() method creates a
dynamic view that updates as the
dict_keys([’James T. Kirk’, ’Spock’, ’Nyota Uhura’, dictionary object changes.

’Pavel Chekov’, ’Montgomery Scott’,

’Leonard McCoy’])

As we can see, the addition of Mr McCoy to the dictionary


was immediately picked up by the viewer object names, no
need to update it separately. The same behaviour happens
for two other methods, one to see all the values values()
and one for the key-value pairs items():

> enterprise[’Hikaru Sulu’] = ’Helmsman’


The same behaviour is obtained
> rank = enterprise.values() with the values() and items()
methods.
> pairs = enterprise.items()

In the code above we have added Mr Sulu to our dictionary


and created two new view objects, one with the values
70 j. rogel-salazar

associated with the keys in the dictionary and a second one


with the associated pairs. Let us look at the contents:

> print(names)

dict_keys([’James T. Kirk’, ’Spock’, ’Nyota Uhura’,

’Pavel Chekov’, ’Montgomery Scott’,

’Leonard McCoy’, ’Hikaru Sulu’])

> print(rank)
The contents of the dynamic views
created with keys(), values() and
dict_values([’Captain’, [’First Officer’, pairs() are kept up-to-date.

’Science Officer’], ’Communications Officer’,

[’Navigator’, ’Security/Tactical Officer’],

’Chief Engineer’, ’Chief Medical Officer’,

’Helmsman’])

> print(pairs)

dict_items([(’James T. Kirk’, ’Captain’),


Note that the entries of the
(’Spock’, [’First Officer’, ’Science Officer’]),
pairs() dynamic view are tuples
(’Nyota Uhura’, ’Communications Officer’), containing the keys and values.
(’Pavel Chekov’, [’Navigator’,

’Security/TacticalOfficer’]),

(’Montgomery Scott’, ’Chief Engineer’),

(’Leonard McCoy’, ’Chief Medical Officer’),

(’Hikaru Sulu’, ’Helmsman’)])

You can see that the view object that corresponds to the
items in the dictionary is made of tuples containing the key
and value for each entry.
statistics and data visualisation with python 71

In Section 2.3.2 we saw how list comprehension can help us


create lists in a streamlined manner. We can apply the same
logic to the creation of new dictionaries in Python. In this We can create dictionaries with
some dictionary comprehension.
case we also use the pythonic style of multiple assignation.
Dictionary comprehension is also able to take a condition
if necessary. Let us create a dictionary out of the zipped
translate object we created in Section 2.3.3:

> translate = zip(numbers, vulcan)

> vulcan_dict = {k:v for (k, v) in translate}


We use a zipped object to create a
> print(vulcan_dict) dictionary by comprehension.

{1: ’wuhkuh’, 2: ’dahkuh’, 3: ’rehkuh’,

4: ’kehkuh’, 5: ’kaukuh’}

We are taking each item in the zip object and assigning the
values to two variables: k for the key and v for the value.
We then use these variables to construct the entries of our
new dictionary as k:v. The result is a new dictionary that
works in the usual way:

> vulcan_dict[3]

’rehkuh’

With a zip object there is another more succinct way to Remember that zip objects can
only be used once! You may need
create the dictionary without comprehension. We could
to re-create translate to make
have simply used the dict() function: this work.

vulcan_dict = dict(translate)
72 j. rogel-salazar

Dictionary method Description


clear() Remove all items
copy() Copy the dictionary
fromkeys(keys, val) Dictionary with the given keys. val is optional and is
applied to all keys
get(key) Value of given key list
items() View of all key-value pairs
keys() View of all keys
pop(key) Remove entry with given key
popitem() Remove last inserted entry
setdefault(key, val) Returns the value of specified key. If key does not exist it
inserts it with the specified val
update(dict) Inserts dict into the dictionary
values() View of all the values
Table 2.4: Dictionary methods in
In Table 2.4 we can see some of the dictionary methods we Python.

have described along with others that can be useful in your


work.

2.3.5 Sets

A set is a collection that covers some aspects that are


not included in the collections we have discussed so far. If
you remember your set theory course from school, using A set is an unordered collection
without duplicates.
Venn diagrams and things like intersections, unions and
differences, then you are ready to go. No worries if that is
not the case, we shall cover this below.

In Python, a set is an unordered collection whose elements


are unique and they cannot be changed, although we can
add new ones. We can create a set with the use of curly We can create a set with curly
brackets { }. Do not confuse this
brackets, { }. We used them to create dictionaries, but in
with dictionaries!
this case we have single elements separated by commas
statistics and data visualisation with python 73

instead of key-value pairs. Let us create an initial set based


on Cylons:

significant_seven = {1, 2, 3, 4, 5,

6, 6, 6, 6, 6, 6,
Here we have several copies of
8, 8}
each Cylon model.
> type(significant_seven)

set

We can see that the type of the object significant_seven


is a set. You can see that the definition above has multiple
copies of some of our favorite antagonist characters in Sci-Fi. No matter how many copies there
are, there are only 7 significant
However, it does not matter how many copies we have of
models.
each model, there are only 7 models in this set. So, there
may be a Sharon “Boomer” Valerii as a sleeper agent in
Galactica, or a Sharon “Athena” Agathon in Caprica, but
both are a Cylon 8 model. The same with the many copies
of Cylon model 6: Caprica Six talking to Baltar, or Gina
Inviere in Pegasus or Natalie leading the rebels versus Cylon Other notable 6s include Shelly
Godfrey, Lida and Sonja.
model 1. All of them are instances of the same model and
the set should only contain one entry for them. Let us see if
that is the case in our set:

> print(significant_seven)
Here is the set of the Significant
Seven Cylons.
{1, 2, 3, 4, 5, 6, 8}

An alternative way to create a set is via the set() method.


All we have to do is pass an iterable with the elements that
need to be included in the set. Let us create some sets with
74 j. rogel-salazar

the rest of the Cylon models. First, there is of course the


extinct model 7 dubbed “Daniel”:

> extinct = set([7]) We can create sets with set().

Finally, we cannot forget to mention the final five:

> final_five = set([’Saul Tigh’, ’Galen Tyrol’, Let us create a set with the Final
’Samuel T. Anders’, ’Tory Foster’, Five Cylons.

’Ellen Tigh’])

Please note that the items that we use to create a set have
to be immutable, otherwise Python will complain. For
example, if you pass a list an error is returned:

> error = {’end’, [’of’, ’line’]}


The input to create sets must be
immutable objects. Otherwise we
... get an error.
TypeError: unhashable type: ’list’

However, we are able to pass a tuple, because they are


immutable:

> noerror = {’end’, (’of’, ’line’)}

> noerror
Using tuples as input is fine.

{(’of’, ’line’), ’end’}

If we pass a string to the set() method we obtain a set of


characters in the string:
statistics and data visualisation with python 75

> species = ’cylon’

> letters = set(species) Creating a set with a string results


in a set of characters.
> print(letters)

{’c’, ’l’, ’n’, ’o’, ’y’}

However, items placed in curly brackets are left intact:

> {species} Note that passing a string directly


inside curly brackets leaves the
object intact.
{’cylon’}

We can have empty sets; we simply pass no arguments to


the set() method, or include no elements in between the
curly brackets:

> human_model = {} Empty sets are easily created.

We can check the size of a set with the len() method:

> len(final_five)

We can check for membership using the in keyword:

> ’Laura Roslin’ in final_five

False
The in keyword lets us check for
membership.
>’Galen Tyrol’ in final_five

True
76 j. rogel-salazar

Phew! So we know that President Roslin is not a Cylon but Spoiler alert!!!

Senior Petty Officer Galen Tyrol is indeed a Cylon!

A lot of the operations that we have used for other


collections are not possible for sets. For instance, they
cannot be indexed or sliced. Nonetheless, Python provides a
whole host of operations on set objects that generally mimic
the operations that are defined for mathematical sets.

One of the most important operations to carry out over a


set is the union. A union contains all the items of the sets in A union contains all the items of
the sets in question.
question and any duplicates are excluded. We can create the
union of sets with the union method:

> humanoid_cylons = significant_seven.union(

final_five)

Another way to obtain the union of sets is with the pipe


operator |. Note that it lets us get the union of more than
two sets:

> cylons=significant_seven | final_five | extinct

We can obtain the union with


{1, 2, 3, 4, 5, 6, 7, 8, .union() or |.

’Ellen Tigh’, ’Galen Tyrol’,

’Samuel T. Anders’, ’Saul Tigh’,

’Tory Foster’}

Let us create a new set containing some of the officers in


Galactica:
statistics and data visualisation with python 77

> officers = set([’William Adama’,

’Saul Thigh’, ’Kara Thrace’,

’Lee Adama’, ’Galen Tyrol’,

’Sharon Valerii’])

Another important operation is the intersection of sets. The The intersection gives us the items
that exist simultaneously in each
operation results in the items that exist simultaneously in
of the sets in question.
each of the sets in question. We can use the intersection()
method or the & operator. Let us see who are the Cylons in
our officers set:

> officers & cylons


We can obtain the intersection
with .intersection() or &.
{’Galen Tyrol’, ’Saul Tigh’}

This will give us a set that contains those who are officers in
Galactica and who are also Cylons. In this case, this
corresponds to Chief and Galactica’s XO. Well, our Cylon Our Cylon detector needs to be
improved!
detector needs to be better, right? We could detect the
presence of two of the final five, but a copy of model
number 8 has infiltrated Galactica. Let us try to rectify this
situation.

We can do this by adding elements to the set. Although


the elements in a set are immutable, sets themselves can be
modified. We can add elements one by one with add() or
update the set with the union of sets using update():

> cylons.update([’Sharon Valerii’, We can add elements to a set with


.update().
’Sharon Agathon’])
78 j. rogel-salazar

Let us see if our Cylon detector is working a bit better. We


can run the intersection operation on the updated set and
officers:

> officers & cylons


We can now detect the Cylon
infiltrator!
{’Galen Tyrol’, ’Saul Tigh’, ’Sharon Valerii’}

There you go, we can now detect the presence of the Cylon
infiltrator!

It is possible to check for humans in the officers set by


The difference is the set of
taking out the Cylons. We can do this with the elements that are in the first
difference() method, which will return the set of elements set but not in the second.

that are in the first set but not in the second. The same
operation can be done with the minus sign (-):

> officers - cylons


We can use .difference() or -.

{’Kara Thrace’, ’Lee Adama’, ’William Adama’}

With the difference operation we go through the sets from


left to right. This means that the operation is not
commutative. Let us take a look:

> cylons - officers

{1, 2, 3, 4, 5, 6, 7, 8,
The set difference is not
’Ellen Tigh’, Samuel T. Anders’, commutative.

’Sharon Agathon’, ’Tory Foster’}


statistics and data visualisation with python 79

We may be interested in getting all the elements that are in


either of two sets but not in both. In this case we can use the
symmetric_difference operation. This is effectively The symmetric difference gives us
the elements that are in either of
equivalent to subtracting the intersection of the sets in
two sets but not in both.
question. For more than two sets we can use the ^ operator
too. Let us take a look at the symmetric difference of
officers and cylons:

> officers ^ cylons

We can use symmetric_difference


{1, 2, 3, 4, 5, 6, 7, 8, ’Ellen Tigh’,
or ^.
’Kara Thrace’, ’Lee Adama’,

’Samuel T. Anders’, ’Sharon Agathon’,

’Tory Foster’, ’William Adama’}

Set method Description


add(item) Add item to the set
clear() Remove all elements
copy() Copy the set
difference(set) Difference between sets
discard(item) Remove item from set. If item not in set return None
intersection(set) Intersection of sets
isdisjoint(set) Check if two sets have a intersection
issubset(set) Check if another set contains this one
issuperset(set) Check if this set contains another one
pop(item) Return and remove item from set
remove(item) Remove item from set. If item not in set raise an error
symmetric_difference(set) Symmetric difference between sets
union(set) Union of sets
update(set) Update set with the union of this set and others
Table 2.5: Set methods in Python.

There are a number of other methods that we can use with


sets and some of them are shown in Table 2.5.
80 j. rogel-salazar

2.4 The Beginning of Wisdom: Logic & Control Flow

We have now a better understanding of how Python


works and some of the types and objects that are available
to us out of the box. However, as Mr. Spock has said, “Logic algorithm = logic + control flow

is the beginning of wisdom, not the end,” and indeed we are


able to use some wisdom in Python thanks to the logical
operators available to us. With them we are able to control
the flow of a programme to construct algorithms.

In Python, line breaks and indentation define blocks of code.


This means that in Python, whitespace is important. Unlike Whitespace is a meaningful
character in Python.
other programming languages, there are no curly brackets
to define blocks, or there is no need for other extra symbols
like semicolons. There is, however, the need to use a colon
to start classes, functions, methods, and loops as we shall
see. It is only logical to get started.

2.4.1 Booleans and Logical Operators

We have discussed some of the types in Python including


integers, floats and strings. One type that we have not
mentioned yet is the Boolean type. Boolean values represent Boolean values are True or False.

one of two values: True or False. Please note that in Python


we write these two values with capital letters. We get these
values when we run comparisons between two values, for
example:
statistics and data visualisation with python 81

> print(42>10)

True The result of comparisons return


Boolean values.

> print(2001<=1138)

False

As we can see Python returns True for the first comparison


and False for the second comparison. We can see that the
type of the result is a bool:

> type(42>10)

bool

In Python we can combine our comparisons to perform


logical operations such as and, or and not. In Table 2.6 we
can see the comparison and logical operators available to us.
We can use this to create more complex comparisons as in
the next examples:

> a = 75 != 75

> (51 > 33) and a

We can combine our comparisons


False to perform logical operations such
as and, or and not.

> (72 > 32) and ((94 == 87) or not a)

True
82 j. rogel-salazar

Operation Operator
Table 2.6: Comparison and logical
Equal == operators in Python.
Different !=
Greater than >
Less than <
Greater or equal to >=
Less or equal to <=
Object identity is
Negated object identity is not
Logical AND and
Logical OR or
Logical NOT not

Boolean operators are a must when having to write


programmes in any language as they let us make decisions
We can use the result of logical
about the flow of our analysis or algorithm execution. This operations to control the flow of
our programmes.
is what we call control flow. Let us look at controlling the
flow of our programmes in Python.

2.4.2 Conditional Statements

Being able to decide what to do next depending on the


result of a previous operation is an important part of
programming. We can do this with the help of the Boolean
operations we discussed in Section 2.4.1. A conditional A conditional statement lets us
decide what to do if a condition
statement lets us decide what to do if a condition is
is True or not.
fulfilled, otherwise we can take a different action or actions,
as we can add other conditions too. In Python we can do
this with the following syntax:
statistics and data visualisation with python 83

if expression1 :
block of code executed

if expression1 is True

elif expression2 :
block of code executed

if expression2 is True
The if... elif... else... lets
... us test various conditions and
create branches for our code.
elif expressionN :
block of code executed

if expressionN is True

else:
block of code executed

if no conditions are True

We can see a diagram representing the control flow of a


conditional statement in Figure 2.2.

Figure 2.2: Conditional control


flow.
84 j. rogel-salazar

Please note that each block of actions after each test


expression is indented at the same level. The moment we go Indentation is important in
Python.
back to the previous indentation level, the block is
automatically finished. Note that the colon (:) following
each expression is required.

The expressions we can test can be simple comparisons


or more complex expressions that combine the logical
operators and, or and not. We can test as many conditions The conditions to test are logical
expressions that evaluate to True
as we require with the help of the “else if” operand elif.
or False.
The conditions are tested in the order we have written
them and the moment one condition is met, the rest of the
conditions are not tested.

> president = ’Laura Roslin’

> if president in cylons:

print(‘‘We have a problem!’’)

elif president in officers:

print(‘‘We need an election!’’)


Hurrah!
else:

print(‘‘Phew! So say we all!’’)

Phew! So say we all!

In this case our expression is given by checking set


membership. We are interested to know what to do if our
Cylon detector tells us whether President Roslin is a Cylon,
or whether she is part of the military and democracy has
been put in jeopardy. In this case we can breathe as things
are all good.
statistics and data visualisation with python 85

2.4.3 While Loop

Repeating a set of actions is also a typical thing to do


in programming. Sometimes we do not necessarily know A while loop is depends on the
result of a condition.
how many times we need to repeat the actions and we only
stop when a condition is met. In this case we need to apply
a while loop.

while condition:
block of code to be executed

don’t forget to update the test variable

See Figure 2.3 for a diagrammatic representation of the


syntax above.

Figure 2.3: While loop control


flow.
86 j. rogel-salazar

You can see a couple of things that are similar to the syntax
we saw for conditional statements. First, we need a colon
after the condition. Second, the block of code that is
executed as long as the condition is True is indented. One
more example where whitespace is important in Python.

Please note that if the condition is not True at the beginning


of the loop, the block of code will never execute. Also, you The while loop requires a logical
test at the beginning of the block.
need to remember that the test variable in the condition
needs to be updated as part of the block of code. Otherwise
you end up with an infinite loop and Python will continue
repeating the block forever.

We can see how this works for the Enterprise. So long as


there is enough dilithium, Captain Picard is able to ask the
crew to engage the engine:

> dilithium = 42

> while dilithium > 0:

print(’Engage!’)
Note that dilithium -= 10 is
dilithium -= 10 a shorthand for dilithium =
dilithium - 10.

Engage!

Engage!

Engage!

Engage!

Engage!

As soon as the dilithium reserves are exhausted, we are not


able to continue warping around space, the final frontier.
statistics and data visualisation with python 87

2.4.4 For Loop

When we know how many times we need to repeat an


action, we are better off using a for loop. Either the number In a for loop we know how
many times our code needs to be
of rounds is provided in advance or, as is typical in Python,
repeated.
we traverse an iterable object such as a list, tuple or a string.
Since Python has access to the length of the iterable the
number of rounds is determined. In Figure 2.4 we can see
how this works. The syntax for a for loop is as follows:

for item in sequence:


block of code to be

executed

Figure 2.4: For loop control flow.


88 j. rogel-salazar

Once again we have the colon after defining the sequence


to loop over and the indentation to define the actions that
need repeating. Let us salute each of the Galactica officers
we know:

> for officer in officers: This is the same basic structure


used in list comprehension. See
print(’Salute, {0}!’.format(officer))
Section 2.3.2.

Salute, Lee Adama!

Salute, Saul Tigh!

Salute, Kara Thrace!

Salute, William Adama!

Salute, Sharon Valerii!

Salute, Galen Tyrol!

Let us rewrite the code for Captain Picard and define a


range for the dilithium available to the Enterprise:

> for dilithum in range(42,0,-10):

print(’Engage!’)
range enables us to define a
sequence of numbers as an object.
Engage! This means that the values are
generated as they are needed.
Engage!

Engage!

Engage!

Engage!

In the example above we used the range(start, end,


step) function to generate a sequence of numbers from

start to end−1 in steps given by step.


statistics and data visualisation with python 89

2.5 Functions

That has been a warp! Now that we have a good


understanding of the fundamentals of Python, we are able
to continue expanding our code to achieve specific goals we A function is a block of code
which can be run any time it is
have in mind. Many times those goals can be attained by
invoked.
applying the same set of instructions. For instance, we know
that Captain Picard prefers his beverage to be Tea. Earl Grey.
Hot so we may be able to put together the steps in a block of
code under the name picard_beverage to get the Captain’s
order right every time.

We are talking about creating a function with a given name.


The function may have input parameters and may be able to
return a result. In Python, the syntax to build a function is
as follows:

def my_function(arg1, arg2=default2,... \


argn=defaultn):

’’’Docstring (optional) ’’’

instructions to be executed The function definition starts with


when executing the function the word def.

return result # optional

We define a function with the keyword def at the beginning


of our block. As before, we need a colon at the end of the
first line and the block of code that forms the function is
indented. Here is what our picard_beverage function may
look like:
90 j. rogel-salazar

def picard_beverage():

’’’ Make Captain Picard’s beverage ’’’


Not quite a replicator, but hey!
beverage = ’Tea. ’

kind = ’Earl Grey. ’

temperature = ’Hot.’

result = beverage + kind + temperature

return result

We can run the code by calling the function:

> picard_beverage()

Tea. Earl Grey. Hot.

In this case our function is not taking any arguments and it


returns a string as output. In general a function is able to
take arguments arg1, arg2,... , argn that can be used

in the block of code that makes up our function. It is


possible to define default values for some of these
parameters. Parameters with default values must be defined
last in the argument list.

In the syntax shown above you can see a line of code that
starts with triple quotes. This is called a documentation string A documentation string enables us
to provide information about what
and it enables us to describe what our function does. You
a function does. Make sure you
can take a look at the contents as follows: use it!

> print(picard_beverage.__doc__)

Make Captain Picard’s beverage


statistics and data visualisation with python 91

Let us define a function to calculate the area of a triangle


given its base and height:

def area_triangle(base, height=1.0):

’’’Calculate the area of a triangle’’’


We are defining a default value for
area = base * height / 2 the parameter height.

return area

Notice that height has been given the default value of 1.


This means that we are able to call this function with one
parameter, which will be used for the base, and the function
will use the default value for height.

> a = area_triangle(50)
The arguments are passed to the
> print(a) function in round brackets.

25.0

In this case we are giving the function only one argument


such that the value of 50 is assigned to the base of the
triangle and the height is automatically assumed to be 1. We
can provide the second argument to override the default
value for height:

> a1 = area_triangle(50, 4) A function can be called simply


> print(a1) using its name and any required
input parameters.

100.0

The functions we define can make use of the control flow


commands that we have discussed and make decisions
92 j. rogel-salazar

depending on the parameters provided by the user. For


example we can create a function to convert temperatures
from centigrade to Fahrenheit and vice versa. We can
convert from centigrade to Fahrenheit with the following
formula:
9
F= C + 32, (2.3)
5 This may be handy in determining
and from centigrade to Fahrenheit with: the temperature of Captain
Picard’s beverage.
 
5
C = ( F − 32) . (2.4)
9

Let us take a look at creating a function to do the


conversion:

def conv_temp(temp, scale):

s = ’Converting {0} {1}’.format(temp, scale)

if scale == ’C’:

print(s)

result = (temp * 9/5) + 32

elif scale == ’F’:

print(s)

result = (temp - 32) * 5/9

else:

print(’Please use C or F for the scale’)

return result

We are required to provide a temperature and the scale


used for that temperature. When we pass the scale C, the
temperature is measured in centigrade and we convert it
to Fahrenheit. However, if the scale is F, we are converting
from Fahrenheit to centigrade. Notice that if we use any
statistics and data visualisation with python 93

other character the function does not make any conversions


and instead it tells us what scales to use. Let us convert 451
Fahrenheit to centigrade:

> conv_temp(451, ’F’)


We can try to improve the
presentation of the result... oh
well! We show how to do that
Converting 451 F
later.
232.77777777777777

In some cases we may not need to define a full function as


before and instead create an anonymous function that has
a single expression. This is called a lambda function and
there is no need to name the function, we simply use the
following syntax:
A lambda function in Python is
lambda arg1, arg2, ... : expression an anonymous function created at
runtime.

where, as before, arg1, arg2,... are the input parameters


and expression is the code to be executed with the input
provided.

For example, if we require to calculate the square of a


number, we could try the following code:

In this case the object sq is a


sq = lambda n: n*n lambda function that can be called
as any other function in Python.

We can then use the object sq as a function. For instance let


us calculate the square of 3:
94 j. rogel-salazar

> sq(3)

What about if we had a list of numbers and we require to


obtain the square of the items in the list. Well, no problem,
we can use our lambda function in a list comprehension.

> num = [4,14, 42, 451] List comprehension with a lambda


function, a pythonic way of
> [sq(n) for n in num]
writing code!

[16, 196, 1764, 203401]

With functions we are able to create reusable pieces of code


that can be used to carry out a specific task and make our
programmes more readable. As we create more functions to
do things for us we can put them together to create useful
scripts and modules.

2.6 Scripts and Modules

Writing commands to ask the computer to do things


for us is lots of fun. However, if we had to retype everything A script is a way to save our
programmes to be run any time
every time we need to execute the commands, things
we want.
become rather tedious very quickly. Instead, we would
prefer to have the capability of writing our commands and
saving them for later use. We are able to do this with the
help of scripts in Python. These scripts or programmes are
plain text files with the extension .py. Furthermore, if we
use the interactivity provided by Jupyter, it is also possible
statistics and data visualisation with python 95

to save our notebooks in a JSON formatted notebook with


the extension .ipynb.

Python has the advantage of being cross-platform. This


means that we can execute our script in any computer that
has a Python interpreter with the correct version. All we
need to do is open a terminal and invoke Python followed
by the name of the file that contains our commands. In
the example below we have a script that defines a function We are not showing the conv_temp
function in the code below. Make
called main(). It asks users for a temperature and scale, then
sure that you save both in the
makes use of the function conv_temp we created earlier to same script.
convert the temperature from one scale to the other one.
Finally, the main() function is executed in the same script.
Let us tale a look:

def main():

’’’Convert temperature provided

by the user ’’’

t = input(’Give me a temperature: ’)

t = float(t) We are defining a main function in


this programme and calling it with
s = input(’C or F? ’)
the command main().
res = conv_temp(t, s)

if s == ’C’:

print(’{0} C is {1:.2f} F’.format(t, res))

elif s == ’F’: Note how we are formatting the


result to show only two decimal
print(’{0} F is {1:.2f} C’.format(t, res))
places.
else

print(’Try again!’)

main()
96 j. rogel-salazar

We can save this function together with the definition of


conv_temp in a file called convert_user_temperature.py.

In this case we are asking the user for a temperature t and


the scale (C or F). This is provided to us via the input() input() stores values as strings.
In this case we need to ensure the
command. We need to ensure that the value is cast as a float.
values are floats.
Finally, the conversion is made with the conv_temp function
and the result is printed.

Now that we have saved the script we are able to execute it.
Open a terminal, navigate to the place where the script is
saved and type the following command:

> python convert_user_temperature.py


Here we have handled the number
of decimal places better than
Give me a temperature: 451
before!
C or F? F

Converting 451.0 F

451.0 F is 232.78 C

This is great, we can rerun this script over and over and
obtain the results required. We may envisage a situation
where other functions to transform different units are
developed and expand our capabilities until we end up with In this case we want to convert 451
enough of them to create a universal converter. We may F to C.

decide to put all these functions under a single umbrella


and make it available to others. This is what a module in
Python is.

A module is a collection of scripts with related Python


functions and objects to complete a defined task. Modules
let us extend Python and there are thousands of them. All
statistics and data visualisation with python 97

we need to do is install them and off we go. Furthermore,


if you are using distributions such as Anaconda, many A module is a collection of scripts
with related Python functions to
modules are already available to you. All we need to do is
achieve a specific task.
import them whenever we need to use them. This is not
dissimilar to what Trinity and Neo do when they asked the
Operator to import functionality to fly a helicopter or Kung-
Fu fight! All the Operator would have to do is type import
helicopter or import kungfu and presto, Trinity and Neo

can save the day as they have now all the functionality of
the helicopter or kungfu modules. Please note these two
modules are purely fictional at this stage!

In an example closer to home, we may not want to Kung-


Fu fight, but we can try to do some trigonometry. We can
ask the Operator (in this case Python) to import the math
module. In that way, we can ask for the value of π and just The math module contains some
common mathematical functions.
like that we can create a function to calculate the area of a
circle of radius r for example:

import math

We can get the value of π with


def area_circ(r): math.pi.

’’’Area of a circle of radius r’’’

area = math.pi * r**2

return area

r = 7

ac = area_circ(r)

print(’The area of a circle with ’ \

’radius {0} is {1:.2f}’.format(r, ac))


98 j. rogel-salazar

Running the programme will result in the following output


in a terminal:

> Python area_circ.py We run the programme by typing


this in a terminal.

The area of a circle with radius 7 is 153.94

You may have seen in the code above that we require to tell
Python that the constant π is part of the math module. This
is done as math.pi. In the example above we are importing
all the functions of the math module. This can be somewhat
inefficient in cases where only specific functionality is
needed. Instead we could have imported only the value of π
as follows:
In some cases it may be more
from math import pi efficient to load only the needed
functionality from a module.

With this syntax we can refer to π only as pi without the


dot notation. In many cases it is preferable to use the dot
notation to make it clear where functions, methods,
constants and other objects are coming from. This aids with
readability of our code and as you remember, this is an
important component of programming in Python.

We mentioned that there are many modules available to us.


A large number of modules are available from the Python
Standard Library and more information can be found Check https://docs.python.
org/3/library/ for the Python
in https://docs.python.org/3/library/. We can create
Standard Library modules.
our own modules and these modules can be organised
in packages. We will talk about some useful packages for
statistics in the rest of the book.
3
Snakes, Bears & Other Numerical Beasts:
NumPy, SciPy & pandas

Programming offers endless possibilities and in


the previous chapter we covered some of the basic aspects
of programming with Python. We finished that chapter
by looking at ways to break tasks into smaller, separate
subtasks that can be imported into memory. It turns out
that when we have created a large number of functions
that support specific tasks in our workflow, it becomes
much easier to put them together. This is called a module or
package and there are many of them. In this chapter we are In this chapter we will talk about
NumPy, SciPy and pandas.
going to concentrate on a few that are useful in statistical
analysis, namely NumPy, SciPy and pandas. For other
packages you may want to check the Python Package Index
(https://pypi.org). In the rest of the book we will deal
with a few of these modules and packages.
100 j. rogel-salazar

3.1 Numerical Python – NumPy

One of the most useful packages in the Python


ecosystem is NumPy1 . It stands for “numerical Python” and 1
Scientific Computing Tools
for Python (2013). NumPy.
it is widely used as an important component for many other http://www.numpy.org

packages, particularly for data analysis and scientific


computing. It provides functionality for multidimensional
arrays that lists cannot even dream of executing.

To get a feeling of the capabilities of NumPy, let us take a


look at some of the computations we would like to perform.
A typical case is the use of linear algebra, as matrix
calculations can facilitate the implementation of suitable
statistical analysis workflows and treatments.

In that manner, the data analysis workflow, form processing


to visualisation, benefits from the use of vectors and
matrices. In Python we can define these as arrays. An array An array lets us store multiple
values under a single identifier.
is a data type used to store multiple values under a single
identifier and each element can be referenced by its index.
This is similar to the collections we discussed in the
previous chapter. However, in this case the elements are of
the same type.

NumPy arrays are used to store numerical data in the form


of vectors and matrices, and these numerical objects have a A straightforward application
of NumPy arrays is vector and
defined set of operations such as addition, subtraction, or
matrix algebra.
multiplication, as well as other more specialised ones such
as transposition, inversion, etc.
statistics and data visualisation with python 101

3.1.1 Matrices and Vectors

A matrix is a rectangular array with m rows and n


columns. We say that such an array is an m × n matrix. When
m = 1 we have a column vector and when n = 1 we have a
row vector. We can represent a matrix A with elements am,n
as follows:

 
a1,1 a1,2 ··· a1,n
 
 a2,1 a2,2 ··· a2,n 
 
A matrix is effectively a collection
A=
 .. .. .. ..  .
 (3.1)
of row (or column) vectors.
 . . . . 
 
am,1 am,2 ··· am,n

We know that lists can be used to group elements in what


may be row or column vectors. Consider the following two
lists and let us push them to their computational limit:

list_a = [0, 1, 1, 2, 3]

list_b = [5, 8, 13, 21, 34]

On the surface these two objects may resemble a pair of


vectors. After all, they are collections of numbers. However,
Python considers these objects as lists, and for the
interpreter these have specific operations. If we tried to add We covered list concatenation in
Section 2.3.1.
these two lists with the + symbol, Python will interpret the
command as concatenation, not as the element-wise
addition of the arrays:
102 j. rogel-salazar

> list_a + list_b


Using the + symbol with lists
results in concatenation.
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

What happens if we try to carry out the multiplication of the


two lists? Let us see:

> list_a * list_b

Using other arithmetic symbols


TypeError: can’t multiply sequence by with lists results in an error.

non-int of type ’list’

Lists are excellent Python objects, but their use is limited for
some of the operations we need to execute with numerical
arrays. Python is able to use these lists as a starting point to
build new objects with their own operations and functions
letting us do the mathematical manipulations we require.
In this case, modules such as NumPy and SciPy are already
available to us for the use of n-dimensional arrays (i.e.
ndarray) that can be used in mathematical, scientific, and

engineering applications including statistical analysis.

3.1.2 N-Dimensional Arrays

The possibility of creating objects that are suitable


for the operations mentioned above is open thanks to
NumPy. The module lets us define n-dimensional arrays for NumPy extends the types in
Python by including arrays.
this purpose. An n-dimensional array is a multidimensional
container whose elements are of the same type and size. In
Python, the type of an n-dimensional array is ndarray. The
statistics and data visualisation with python 103

number of dimensions in an ndarray is its shape, which is a


tuple of n integers specifying the sizes of each dimension. The shape of an array gives us its
dimensions.
The type of the elements in an array is specified by a
separate data-type object (called dtype) associated with each
array. For convenience we will refer to ndarray objects as
arrays in the rest of the text.

We can think of arrays as an enhancement on lists and you


would not be surprised to know that we can create arrays
with the help of lists:

import numpy as np
We define a NumPy array
with np.array, where np is a
A = np.array(list_a) convenient alias used for the
NumPy package.
B = np.array(list_b)

In the code above we are importing the NumPy package


and using the alias np to refer to the module. With the help
of the array command in NumPy we transform a list into
an array object. Let us check the type of one of these new
objects:

> type(A)

The type of an array is ndarray.


numpy.ndarray

As you can see, the type of object A is an n dimensional


array. Since these are arrays, the sum is a well-defined
operation now:
104 j. rogel-salazar

> C = A + B

> C The use of the + symbol with the


arrays defined above results in
their addition as expected.
array([5, 9, 14, 23, 37])

This time Python has added the arrays element by element


as expected for a vector operation. In this way the following
vector operations are defined for NumPy arrays:

• Vector addition: +

• Vector subtraction: - These are some of the vector


operations that are supported by
• Element-wise multiplication: * NumPy arrays.

• Scalar product: dot()

• Cross product: cross()

3.1.3 N-Dimensional Matrices

NumPy also defines matrix objects with np.matrix.


Let us create a couple of them:

M1 = np.matrix([[1, 0], [0, 1]])

M2 = np.matrix([[0.5, 2.0], [-4.0, 2.5]])

Notice that the type of these objects is numpy.matrix:

> type(M1)
The type of a matrix is, surprise,
surprise, matrix.
numpy.matrix
statistics and data visualisation with python 105

We can use apply arithmetic operations as defined for


matrices. For example matrix addition:

> M1 + M2
Matrix addition (and sustraction)
works as expected.
matrix([[ 1.5, 2. ],

[-4. , 3.5]])

or matrix multiplication:

> M1 * M2

We can multiply NumPy matrices

matrix([[ 0.5, 2. ], with the usual multiplication


symbol.
[-4. , 2.5]])

In this case the result is the same as M2 because M1 is the 2 × 2


identity matrix. Note that we can also cast NumPy arrays as
matrices with the mat command.

An operation that is specific to matrices is transposition. We


can achieve this with the of the transpose method:

> M2.transpose()
Other operations, such as
transposition, are also available.
matrix([[ 0.5, -4. ],

[ 2. , 2.5]])

We can check the shape of the vectors or matrices with the


shape method as follows:
106 j. rogel-salazar

> A.shape

(5,)

> M2.shape Matrix dimensions can be


obtained with shape.
(2, 2)

Similarly, we can obtain the types of the elements stored in


arrays and matrices. In this case, we use the dtype method:

> A.dtype
The elements inside matrices or
dtype(’int64’)
arrays also have types.

> M2.dtype

dtype(’float64’)

As you can see, array A has integer elements whereas the


matrix M2 has float elements.

We can create an array with all elements initialised to 0


using the function zeros():

> z1 = np.zeros((2,3))

> z1
zeros() creates a matrix whose
elements are all zero.
array([[0., 0., 0.],

[0., 0., 0.]])

We can also create an array with all elements initialised to 1


using the function ones():
statistics and data visualisation with python 107

> o1 = np.ones((3,3))

> o1 ones() initialises a matrix with


elements all equal to one.

array([[1., 1., 1.],

[1., 1., 1.],

[1., 1., 1.]])

3.1.4 Indexing and Slicing

Array objects created with NumPy can also be indexed


as well as sliced and even iterated over. The same notation
we have used for lists and tuples is valid too. In other
words start:end:step will extract the appropriate elements
starting at start in steps given by step and until end−1. Let
us create an array of the first 12 numbers and store it in a
variable called a:

> a = np.arange(12)

> print(a[1:4]); print(a[0:10:2])


Arrays and matrices can be
indexed and sliced with the usual
[1 2 3] colon notation for lists and tuples.

[0 2 4 6 8]

In the example above we are first selecting the elements


from 1 and up to but not including 4. We then ask for the
elements from 0 through to 10 in steps of 2.

We can do the same for arrays with higher dimensions. For


example, let us first create an array with two rows, one
108 j. rogel-salazar

containing the first 5 numbers and the second with the same
information but divided by 2:

> np.array([np.arange(5),0.5*np.arange(5)])

> print(b) We have created a 2 × 5 array first.

array([[0. , 1. , 2. , 3. , 4. ],

[0. , 0.5, 1. , 1.5, 2. ]])

This array has shape 2 × 5 and in this case we can select


elements from row 1 as follows:

> print(b[1, :])


Remember that Python starts
counting from 0.
[0. 0.5 1. 1.5 2. ]]

We can also get, for instance, the elements in column 2 of


the array:

> print(b[:, 2])


Slicing with : retrieves the entire
sequence.
[2. 1.]

This can be very useful when considering panel data that


can be put in the form of a table. Let us consider an
example where we have captured results for school tests in a
(very small) class of 5 students. See the results in Table 3.1

We can create a 5 × 3 array to capture the marks of the 5


students in each of the 3 subjects.
statistics and data visualisation with python 109

Name Physics Spanish History


Table 3.1: Student marks in three
Antonio 8 9 7 different subjects.
Ziggy 4.5 6 3.5
Bowman 8.5 10 9
Kirk 8 6.5 9.5
María 9 10 7.5

> marks = np.array([[8, 9 ,7],

[4.5, 6, 3.5],
The results of each student are
[8.5, 10, 9],
captured as rows in our array.
[8, 6.5, 9.5],

[9, 10, 7.5]])

If we wanted to obtain the marks that Bowman got for


Spanish, we need to obtain the element in row 2 for column
1. This can be done as follows:

> marks[2,1]
We are obtaining the element in
row 2, column 1.
10.0

Similarly we can get the marks for Physics by requesting the


values of row 0 for all the students:

> marks[:,0]
Here, we get all the element in
column 0.
array([8. , 4.5, 8.5, 8. , 9. ])

3.1.5 Descriptive Statistics

NumPy also provides us with some functions to perform


operations on arrays such as descriptive statistics including:
110 j. rogel-salazar

maximum, minimum, sum, mean and standard deviation.


Let us see some of those for the Physics test marks:

> marks[:, 0].max(), marks[:, 0].min()

(9.0, 4.5)

Descriptive stats such as the


> marks[:, 0].sum() maximum, minimum, sum,

38.0 mean and standard deviation are


calculated easily with NumPy.

> marks[:, 0].mean(), marks[:, 0].std()

(7.6, 1.5937377450509227)

Furthermore, we can obtain these statistics for the entire


frame or along rows or columns. For example, the mean of
the marks across all subjects is:

> marks.mean()
We can obtain the average mark
across all subjects for all students.
7.733333333333333

The average mark for each subject can be obtained by telling


the function to operate on each column, this is axis=0:

> marks.mean(axis=0)
Or for a specific subject, i.e.
column, with slicing and dicing
array([7.6, 8.3, 7.3])
operations.

We can see that the average mark for Physics is 7.6 (which
is the value we calculated above), for Spanish 8.3 and for
History 7.3. Finally, we may be interested in looking at the
average marks for each student. In this case we operate on
each row. In other words axis=1:
statistics and data visualisation with python 111

> marks.mean(axis=1)
The average mark for each student
is also easily calculated.
array([8.0, 4.6666667, 9.1666667, 8.0, 8.8333333])

The student with the best average mark is Bowman with


9.16, whereas the student that needs some support for the
finals is Ziggy, with an average mark of 4.66.

It is possible to find unique elements in an array with the


unique method. Consider the following array:

> m=np.array([3, 5, 3, 4, 5, 9, 10, 12, 3, 4, 10])

We can get the unique values as follows:

> np.unique(m)
We can get unique elements with
unique().
array([ 3, 4, 5, 7, 9, 10, 12])

We can also obtain the frequency of each unique element


with the return_counts argument as follows:

> np.unique(m, return_counts=True)


It is also possible to get the
frequency for each unique
(array([ 3, 4, 5, 7, 9, 10, 12]), element.
array([3, 2, 1, 1, 1, 2, 1]))

In this way, we see that number 3 appears 3 times, number 4


twice, 5 once, and so on. The same also works for
multidimensional arrays. Let us take a look at our table of
student marks:
112 j. rogel-salazar

> np.unique(marks)

array([ 3.5, 4.5, 6. , 6.5, 7. , 7.5,

8. , 8.5, 9. , 9.5, 10. ])

If we are interested in finding entire unique rows or Unique rows and columns in an
array are obtained by specifying
columns, we need to specify the axis. For rows we use
the axis.
axis=0 and for columns axis=1.

Now that we have access to arrays, we can start expanding


the type of operations we can do with these objects. A lot of
those operations and algorithms are already available to us
in the SciPy package.

3.2 Scientific Python – SciPy

The computational demands of scientific applications


require the support of dedicated libraries available to us.
In the case of Python, the SciPy2 package provides us 2
Jones, E., T. Oliphant, P. Peterson,
et al. (2001–). SciPy: Open
with implementations of algorithms and sub-modules source scientific tools for Python.
http://www.scipy.org/
that enables us to carry out calculations for statistics, use
special functions, apply optimisation routines, and more. In
this section we will cover only some of the richness of SciPy
and you can explore some of the sub-packages that make up
the library depending on your needs:

• cluster for clustering algorithms

• constants including physical and mathematical constants

• fftpack for Fast Fourier Transform routines


statistics and data visualisation with python 113

• integrate covering integration and ordinary differential


equation solvers

• interpolate for interpolation and smoothing splines

• io for input and output

• linalg with linear algebra routines

• ndimage for N-dimensional image processing SciPy sub-packages have you


covered for (almost) any scientific
• odr for orthogonal distance regression computing needs you have.

• optimize with optimisation and root-finding routines

• signal for signal processing

• sparse to manage sparse matrices and associated


routines

• spatial for spatial data structures and algorithms

• special with implementations of special functions

• stats for statistical analysis including distributions and


functions

The SciPy package depends on NumPy for a lot of


functionality. The typical way to import the required
modules is to bring NumPy first, and then the required Importing SciPy and its sub-
packages is straightforward.
sub-module from SciPy. For instance, to use the stats
sub-module we can import the following:

import numpy as np

from scipy import stats


114 j. rogel-salazar

3.2.1 Matrix Algebra

Let us take a look at some things that we can do with


SciPy such as inverting a matrix. This is useful in solving
systems of linear equations, for example.: Linear algebra methods are
included in linalg inside SciPy.

from numpy import array, dot

from scipy import linalg

x = array([[1, 10], [1, 21], [1, 29], [1, 35]])

y = array([[0.9], [2.2], [3.1], [3.5]])

We can now use these two arrays x and y to calculate the We can invert a matrix with the
following expression: coe f = ( x T x )−1 x T y. This gives us .inv method.

the coefficients that define a line of best fit as given by a


linear regression. We can break down the calculation into
two parts. First n = ( x T x )−1 , where we need to calculate a
transpose and apply the inverse. The second part is given by
k = x T y, such that coe f = nk:

n = linalg.inv(dot(x.T, x))
Matrix multiplication with arrays
k = dot(x.T, y)
can be done with the dot()
function.

coef = dot(n,k)

In Figure 3.1 we can see a scatter plot of the synthetic data


created and a line of best fit y = 0.10612x − 0.09558 given
by the coefficients we have calculated. In this case the
coefficients are stored in the coef object.
statistics and data visualisation with python 115

3.5

3.0

2.5
y

2.0

1.5

1.0

10 15 20 25 30 35
x

Figure 3.1: Scatter plot of synthetic


data and a line of best fit obtained
> print(coef) with simple matrix operations.

[[-0.0955809 ]

[ 0.10612972]]

We can use the linear algebra operations in SciPy to


calculate other important quantities, for instance the
determinant:

> a1 = np.array([[10, 30], [20, 40]])

> linalg.det(a1)

-200.0
The determinant can be calculated
with the .det function.

116 j. rogel-salazar

We can also obtain the eigenvalues and eigenvectors of a


matrix:

> l, v = linalg.eig(a1)

> print(l)

[-3.72281323+0.j 53.72281323+0.j] We obtain the eigenvalues and


eigenvectors with the .eig
function.
> print(v)

[[-0.90937671 -0.56576746]

[ 0.41597356 -0.82456484]]

3.2.2 Numerical Integration

SciPy lets us preform numerical integration of integrals


of the following kind:
Z b
I= f ( x )dx. (3.2)
a

We can calculate definite integrals


A number of routines are in the scipy.integrate library. with SciPy.

One example is the quad() function, which takes the


function f ( x ) as an argument, as well as the limits a and b.
It returns a tuple with two values: One is the computed
results, and the other is an estimation of the numerical error
of that result. We can, for instance, calculate the area under
the curve for half of a Gaussian curve:

Z ∞ √
− x2 π
e = . (3.3)
0 2
statistics and data visualisation with python 117

> from scipy import integrate

> g = lambda x: np.exp(-x**2)


quad() lets us compute a definite
> integrate.quad(g, 0, np.inf)
integral.

(0.8862269254527579, 7.101318390472462e-09)

3.2.3 Numerical Optimisation

We can also find roots of functions with the optimize


module in SciPy. Root finding consists in searching for a
value x such that

Finding roots numerically is easy


f ( x ) = 0. (3.4)
with SciPy.

Let us find (one) solution for the following transcendental


equation:
tan( x ) = 1 (3.5)

We can rewrite our problem in the form: f ( x ) = tan( x ) − 1 =


0. We can now find the root with the fsolve() function in
SciPy as follows:

> from scipy import optimize

> f = lambda x: np.tan(x)-1


We can find roots with fsolve().
> optimize.fsolve(f, 0)

array([0.78539816])
118 j. rogel-salazar

3.2.4 Statistics

An important module for us is stats, as it contains


statistical tools and probabilistic descriptions that can be
of great use in many statistical analyses. This, together The stats module in SciPy
together with random from NumPy
with the random number generators from NumPy’s random
are a great combination for
module, enables us to do a lot of things. For example, we statistical work.
can calculate a histogram for a set of observations of a
random process:

s = np.random.normal(size=1500)
We get normally distributed
b = np.arange(-4, 5) random numbers with
h = np.histogram(s, bins=b, density=True)[0] random.normal.

In this case we are drawing samples from a normal


(Gaussian) distribution and calculate the histogram for b A histogram gives us an
approximation of the data
bins and we require the result to be the value of the
distribution.
probability density function at the bin (density=True).

The histogram is an estimator of the probability density


function (PDF) for the random process. We can use the PDF describes the relative
likelihood for a random variable
stats module to calculate the probability density function
to take on a given value.
for a normal distribution as follows:

bins = 0.5*(b[1:] + b[:-1]) We talk about histograms in


Section 8.7 and the normal
from scipy import stats
probability distribution is
pdf = stats.norm.pdf(bins) addressed in Section 5.4.1.

We can estimate the parameters of the underlying


distribution for a random process. In the example above,
statistics and data visualisation with python 119

knowing that the samples belong to a normal distribution


we can fit a normal process to the observed data:

> loc, std = stats.norm.fit(s)


Remember that we used random
> print(loc, std)
samples, so the values shown here
may be different in your computer.
0.008360091147360452 0.98034848283625

In this case this is close to a distribution with 0 mean and


standard deviation equal to 1.

We know that we can use NumPy to calculate the mean and


median:

> np.mean(s)

0.008360091147360452
The mean and the median can
easily be obtained with the help of
NumPy as we saw in Section 3.1.
> np.median(s)

0.030128236407879254

Other percentiles may be calculated too. We know that


the median is effectively the percentile 50, as half of the
observations are below it. We can calculate the percentiles
with scoreatpercentile: The n-th percentile of a set of data
is the value at which n percent of
the data is below it. We talk more
> stats.scoreatpercentile(s, 50)
about them in Section 4.4.2.

0.030128236407879254
120 j. rogel-salazar

Other percentiles are easily obtained, for instance the


percentile 75:

> stats.scoreatpercentile(s, 75)


The 75-th percentile is also known
as the third quartile.
0.7027762265868024

We can also carry out some statistical testing. For example,


given two sets of observations, which are assumed to have See Section 6.5.1 for more
information on t-tests.
been generated from Gaussian processes, we can use a t-test
to decide if the means are significantly different:

> process1 = np.random.normal(0, 1, size=100)

> process2 = np.random.normal(1, 1, size=50)

> stats.ttest_ind(process1, process2)

Ttest_indResult(statistic=-7.352146293288264,

pvalue=1.2290311506115089e-11)

The result comes in two parts. First the so-called t-statistic A t-test is used to determine if
there is a significant difference
value tells us about the significance of the difference
between the means of two groups.
between the processes. The second is the p-value, which is
the probability that the two processes are identical. If the
value is close to 0, the more likely the processes have
different means.

We will explore some of the concepts discussed above in


greater detail later in the book. In the meantime, we can see We start discussing hypothesis
testing in Section 5.5.
that the arrays, matrices and vectors we have been playing
with all contain numerical data. In many cases, data comes
in different formats and we should not expect all of it to
have the same type. NumPy and SciPy are unrivalled for
statistics and data visualisation with python 121

numerical calculations, but we need other modules to be


able to deal with different data types coming from a variety NumPy and SciPy are great to
deal with numerical data.
of sources. This is where packages such as pandas come to
the rescue.

3.3 Panel Data = pandas

Pandas, more specifically, giant pandas are indeed


those charismatic mammals with distinctive black-and-white
markings that give part of its scientific name. Giant pandas, Ailuropoda melanoleuca means
“black and white cat footed
unlike most bears, do not have round pupils. Instead, they
animal’.’
have vertical slits, similar to those of cats! There is of course
the red panda, which may not be as famous, but it is equally
endangered. Here, we will be talking about a different kind
of panda, a numerical beast that is friends with Python, and
no, not the snake, but the programming language.

Pandas3 started life as a project by Wes McKinney in 2008 3


McKinney, W. (2012). Python
for Data Analysis: Data Wrangling
with the aim to make Python a practical statistical with Pandas, NumPy, and IPython.
O’Reilly Media
computing environment. Pandas is now used in a wider
range of applications including machine learning and data
science4 . It is a powerful Python package that makes it easy 4
Rogel-Salazar, J. (2017). Data
Science and Analytics with Python.
for us to work with panel data. Panel data is typically Chapman & Hall/CRC Data
Mining and Knowledge Discovery
encountered in economics, social science or epidemiology
Series. CRC Press; and Rogel-
and the term is used to refer to datasets that have Salazar, J. (2020). Advanced Data
Science and Analytics with Python.
observations about different cross sections across time. Chapman & Hall/CRC Data
Mining and Knowledge Discovery
Series. CRC Press
With the help of pandas, we can read data from a variety of
sources and have it ready in a dataframe that can be easily
manipulated. Pandas supports indexed, structured data
122 j. rogel-salazar

with many columns and has functions to deal with missing


values, encoding, manage multiple dataframes, etc.

3.3.1 Series and Dataframes

The most basic type of data array in pandas is called a We are using the spelling of
pandas with lowercase as used
series, which is a 1-D array. A collection of series is called a
in its documentation, except in
dataframe. Each series in a dataframe has a data type and, cases where is at the beginning of
as you can imagine, pandas builds these capabilities on top a sentence.

of NumPy and SciPy. A NumPy array can be used to create


a pandas series as follows::

import numpy as np

import pandas as pd A typical alias for the pandas


library is pd.
a1 = np.array([14.75, 18.5, 72.9, 35.7])

s1 = pd.Series(a1)

We can check the type of the object s1 in the usual way:

> type(s1) The type of a pandas series is a


Series.

pandas.core.series.Series

As you can see the type of the object is a series, and each
series (and dataframe) has a number of methods. The first
thing to do is get to grips with the tabular data that we are
able to manipulate with pandas. Let us look at some data
published5 in 2016 by the Greater London Authority about 5
Three Dragons, David Lock
Associates, Traderisks, Opinion
population density. In Table 3.2 we show the population and Research Services, and J. Coles
(2016). Lessons from Higher
area (in square kilometres) for four global cities contained in Density Development. Report to
the report. the GLA
statistics and data visualisation with python 123

City Population Area (sq km)


Table 3.2: Population and area of
Greater London 8, 663, 300 1572 some selected global cities.
Tokyo 9, 272, 565 627
Paris 2, 229, 621 105
New York 8, 491, 079 784

We can load this data into Python by creating lists with the
appropriate information about the two features describing
the cities in the table:

names = [’Greater London’, ’Tokyo’, ’Paris’,


We can use lists to create a pandas
’New York’]
dataframe.
population = [8663300, 9272565, 2229621,

8491079]

area = [1572, 627, 105, 784]

Let us use the DataFrame method in pandas with a


dictionary as input. The keys will become our column
names and the values will be the lists we created above:

df = pd.DataFrame({’cities’: names,
We can pass a dictionary to
’population’: population, .DataFrame to create our table of

’area’: area}) data.

We can see that the type of the df object is a pandas


dataframe:

> type(df) The type of a pandas dataframe is


DataFrame.

pandas.core.frame.DataFrame
124 j. rogel-salazar

3.3.2 Data Exploration with pandas

Now that the data is loaded into a pandas dataframe


we can start exploring it. The first couple of entries in the
dataframe df can be seen with the command .head():

> df.head(2)
The .head() method lets us see
the first few rows of a dataframe.
cities population area

0 Greater London 8663300 1572

1 Tokyo 9272565 627

We can also see the last few rows with .tail():

> df.tail(2)
Similarly, .tail() will show the
last few rows.
cities population area

2 Paris 2229621 105

3 New York 8491079 784

We can get information about our dataframe with some


useful methods. For example, we can see the size of our
dataframe with .shape:

> df.shape
The dimension of our dataframe
can be seen with .shape.
(4, 3)

The result is a tuple that contains the dimensions of the


rectangular table. The first number corresponds to the
number of rows and the second the number of columns.
statistics and data visualisation with python 125

3.3.3 Pandas Data Types

We can also take a look at the type of data that is stored


in each of the columns of our dataframe:

> df.dtypes

We can easily check the types in


cities object our columns with .dtypes.
population int64

area int64

dtype: object

This tells us that the population and area columns have


integers, and the cities one is either made out of strings or
mixed data. We can see the overlap between pandas, Python
and NumPy data types in Table 3.3.

More information about the dataframe can be obtained with


the .info() method:

> df.info()

<class ’pandas.core.frame.DataFrame’>

RangeIndex: 4 entries, 0 to 3
The .info() method gives us
Data columns (total 3 columns): information about a dataframe
such as the index and column
# Column Non-Null Count Dtype
dtypes, non-null values and
--- ------ -------------- ----- memory usage.
0 cities 4 non-null object

1 population 4 non-null int64

2 area 4 non-null int64

dtypes: int64(2), object(1)


126 j. rogel-salazar

Pandas dtype Python type NumPy Type Description


object string or string_, unicode_, mixed Text or mixed numeric and
mixed types non-numeric values

int64 int int_, int8, int16, int32, int64, Integer numbers


uint8, uint16, uint32, uint64

float64 float float_, float16, float32, Floating point numbers


float64

bool bool bool_ True/False values

datetime64 datetime datetime64[ns] Date and time values

timedelta[ns] NA NA Differences between two


datetimes

category NA NA Finite list of text values

Table 3.3: Pandas data types.

We may have situations where we require to change the


type of the data held in a column. In those cases, we can use
the astype() function. Let us change the area column from
integers to floats: We can cast columns into other
types with .astype().

> df[’area’] = df[’area’].astype(’float’)

3.3.4 Data Manipulation with pandas

Let us now look at some easy data manipulation we


can perform with the help of pandas. You may recall that a
dictionary can be used to create a pandas. The keys become
statistics and data visualisation with python 127

the column names in our table. We can check the column


names as follows:

> df.columns
The columns method returns
the names of the columns in a
Index([’cities’, ’population’, ’area’], dataframe.

dtype=’object’)

This means that we can use these names to refer to the data
in each of the columns. For example, we can retrieve the
data about the population of the cities in rows 2 and 3 as
follows:

df[’population’][2:4]
We can view the contents of a
dataframe column by name, and
2 2229621 the data can be sliced with the
usual colon notation.
3 8491079

Notice that we have referred to the name of the column as a


string. Also, we have used slicing to select the data required.

You may notice that there is an index running along the left-
hand side of our dataframe. This is automatically generated
by pandas and it starts counting the rows in our table from Pandas automatically assigns an
index to our dataframe.
0. We can use this index to refer to our data. For example,
we can get the city name and area for the first row in our
table:
128 j. rogel-salazar

> df[[’cities’,’area’]][0:1]

cities area

0 Greater London 1572.0

We may want to define our own unique index for the data
we are analysing. In this case, this can be the name of We can assign our own index with
set_index().
the cities. We can change the index with the set_index()
method as follows:

df.set_index(’cities’, inplace=True)

In this way, we are able to locate data by the index name


with the help of the loc method:

> df.loc[’Tokyo’]
We can locate data by index name
or label with loc. If you need to
population 9272565.0 locate data by integer index use
area 627.0 iloc instead.

Name: Tokyo, dtype: int64

In the example above, we looked at all the entries for Tokyo.


We can also specify the columns we require. For instance,
we can search for the population of Greater London as
follows:

> df.loc[’Greater London’,’population’]


With loc we need to specify the
name of the rows and columns
that we want to filter out.
8663300
statistics and data visualisation with python 129

As we have seen, there are various useful commands in


pandas that facilitate different tasks to understand the
contents in a dataframe. A very handy one is the describe
method. It gives us a fast way to obtain descriptive statistics
for numerical columns in our dataframe. Instead of having
to calculate the count, mean, standard deviation, and
quartiles for our data, we can simply ask pandas to give us
a description, and in one line of code we get a wealth of
information:

> df.describe()

population area

count 4.000000e+00 4.000000 The describe method provides


mean 7.164141e+06 772.000000 us with descriptive statistics of
numerical data.
std 3.306720e+06 607.195191

min 2.229621e+06 105.000000

25% 6.925714e+06 496.500000

50% 8.577190e+06 705.500000

75% 8.815616e+06 981.000000

max 9.272565e+06 1572.000000

Note that using the describe method with categorical data


provides a count, number of unique entries, the top category
and frequency.

It is very easy to add new columns to a dataframe. For


Creating new columns out
example, we can add a column that calculates the of existing ones is as easy as
population density given the two existing columns in our operating over the input columns.

dataframe:
130 j. rogel-salazar

> df[’pop_density’] = df[’population’]/df[’area’]

The operation above is applied to each entry of the columns


in question, and the result is added as a new column such
that the result of the operation is aligned with the inputs
used for the calculation. Let us take a look:

> df

population area pop_density And our new column is part of the


cities pandas dataframe.

Greater London 8663300 1572.0 5511.005089

Tokyo 9272565 627.0 14788.779904

Paris 2229621 105.0 21234.485714

New York 8491079 784.0 10830.457908

3.3.5 Loading Data to pandas

In the sections above we have entered our data by typing


the entries. However, in many real situations, the data that
we need to analyse has already been captured and stored in
a database, a spreadsheet or a comma-separated-value (CSV)
file. Since pandas was created to facilitate data analysis We can import data into a pandas
dataframe from a variety of
work, it is not surprising to hear that there are a number of
sources.
ways we can bring data into a pandas dataframe. Table 3.4
lists some of the sources that we can use as data input.

Let us look at an example: The cities in Table 3.2 is only a


small part of the actual data from the Greater London
statistics and data visualisation with python 131

Source Command
Table 3.4: Some of the input
read_table() sources available to pandas.
Flat file read_csv()
read_fwf()
read_excel()
Excel file
ExcelFile.parse()
read_json()
JSON
json_normalize()
read_sql_table()
SQL read_sql_query()
read_sql()
HTML read_html()

Authority report. The full dataset is available at6 6


Rogel-Salazar, J. (2021a,
May). GLA World Cities
https://doi.org/10.6084/m9.figshare.14657391.v1 as a 2016. https://doi.org/
10.6084/m9.figshare.14657391.v1
comma-separated value file with the name
“GLA_World_Cities_2016.csv”. Once you have downloaded
the file and saved it in a known location in your computer,
you can read the data into pandas as follows:

import numpy as np

import pandas as pd We are reading a CSV file with


read_csv()
gla_cities=pd.read_csv(’GLA_World_Cities_2016.csv’)

The read_csv() function can take a number of parameters,


such as sep for the separator or delimiter used in the file,
header to define the rows that should be used as column

names, or encoding to tell Python what encoding to use for


UTF when reading or writing text.

Let us look at some of the rows in our dataset. For printing


purposes we will only show some limited information here:
132 j. rogel-salazar

> gla_cities.head(3)

City Country ... Constraint


We are only showing a portion
0 NaN NaN ... NaN of the dataset. The ellipses (...)
indicate columns not shown.
1 Greater London England ... 4%

2 Inner London England ... 5%

You can see that row 0 has NaN as entries. This means that
the row has missing data, and in this case, it turns out that
there is an empty row in the table. We can drop rows with
missing information as follows:

Dropping rows from the table is


> gla_cities.dropna(how=’all’, inplace=True) achieved with dropna().

The dropna method also has a number of parameters to use.


In this case have used how=’all’ to drop rows whose rows
have all NaN values. The second parameter we used indicates
to pandas that the dropping of rows should be done in the
same dataframe object. We now have a table with 17 rows
and 11 columns:

> gla_cities.shape

(17, 11)

For convenience we can rename the columns of our


dataframe. This is easily done with the help of a dictionary
whose keys are the existing column names and the values of
the new names:
statistics and data visualisation with python 133

> renamecols = {’Inland area in km2’: ’Area km2’}


Renaming columns is an
> gla_cities.rename(columns=renamecols, important manipulation that
inplace=True) can be done with rename(().

Let us look at the column names we have after the change


above:

> gla_cities.columns

Index([’City’, ’Country’, ’Population’,

’Area km2’, ’Density in people per hectare’,

’Dwellings’, ’Density in dwellings per hectare’,

’People per dwelling’, ’Approx city radius km’,

’Main topographical constraint’, ’Constraint’],

dtype=’object’)

We will now create a new column to capture the population


density per square kilometre. If you do the following, you
will get an error:

> gla_cities[’pop_density’] = \

gla_cities[’Population’]/gla_cities[’Area km2’]
We get an error because the
data was read as strings, not as
TypeError: unsupported operand type(s) for /: numbers.

’str’ and ’str’

The reason for this is that the data for the Population, Area
Km2 and Dwellings columns has been read as strings and, as

we know, we are not allowed to divide strings. We can fix


this issue by changing the type of our columns. We will also
134 j. rogel-salazar

need to get rid of the commas used to separate thousands


in the file provided. All this can be done by chaining the
changes as follows:

gla_cities[’Population’]=gla_cities[’Population’].\ Recasting the data type solves the


issue.
str.replace(’,’, ’’).astype(float)

First we use the replace() method for strings to remove


the comma, and then we apply the astype() function to
transform the data into a float. We do the same for the other
two columns:

gla_cities[’Area km2’]=gla_cities[’Area km2’].\


First we replace the commas in
str.replace(’,’, ’’).astype(float)
the strings and then we recast the
gla_cities[’Dwellings’]=gla_cities[’Dwellings’].\ column data as float.
str.replace(’,’, ’’).astype(float)

We can now create our new column:

> gla_cities[’pop_density’] = \

gla_cities[’Population’]/gla_cities[’Area km2’]
Let’s see what’s out there... The
> gla_cities[’pop_density’].head(3) operation can now be completed
correctly!

1 5511.005089

2 10782.758621

3 4165.470494

It is also easy to make calculations directly with the


columns of our dataframe. For instance, we can create a
column that shows the population in units of millions:
statistics and data visualisation with python 135

> gla_cities[’Population (M)’] =


As before, we can create new
gla_cities[’Population’]/1000000 columns out of existing ones.

This can make it easier for us and other humans to read the
information in the table. The important thing is to note that
pandas has effectively vectorised the operation and divided
each and every entry in the column by a million. If we need
to apply a more complex function, for instance one we have
created ourselves, it can also be applied to the dataframe.
First, let us create a function that categorises the cities
into small, medium, large and mega. Please note that this
categorisation is entirely done for demonstration purposes
and it is not necessarily a real demographic definition:

def city_size(x):

if x < 1.5:

s = ’Small’

elif 1.5 <= x < 3: Creating functions to manipulate


s = ’Medium’ our data is the same as discussed
in Section 2.5.
elif 3 <= x < 5:

s = ’Large’

else:

s = ’Mega’

return s

We can now use the apply method of a series to categorise


our cities:

> gla_cities[’City Size’] = \ We can then apply those functions


gla_cities[’Population (M)’].apply(city_size) to our dataframe.
136 j. rogel-salazar

Pandas enables us to filter the data by simply making


comparisons on the values of columns directly. For example,
we can ask for the rows in the table for the cities classified
as small:

> gla_cities[gla_cities[’City Size’]==’Small’]

City ... Population (M) City Size


Boolean filtering is available
5 Lyon ... 0.500715 Small
directly on the dataframe.
9 Sevilla ... 0.693878 Small

12 Boston ... 0.617594 Small

We can also request a handful of columns instead of the


entire table. For example, let us filter the data for the city
name and population in millions for the cities with
populations greater than 8 million:

> gla_cities[[’City’, ’Population (M)’]] \

[gla_cities[’Population (M)’]>8]
We can combine Boolean filtering
with column selection.
City Population (M)

1 Greater London 8.663300

10 New York (City) 8.491079

16 Tokyo (Special Wards Area) 9.272565

3.3.6 Data Grouping

Having data organised in a table makes it possible


to ask for aggregations and groupings given the values of
some columns of interest. Pandas makes it easy for us to
statistics and data visualisation with python 137

create groups and apply aggregations to our dataframes


with functions such as .groupby() and .agg(). Let us look Data grouping and aggregation is
readily available with pandas.
at counting the number of cities that are in each of the
classifications we created in the previous section. First, let
us group the cities by the City Size column:

> gla_grouped = gla_cities.groupby(’City Size’) Simply pass the column to group


by to the .groupby() function.

We will now use the size() function to look at the size of


the groups created:

> gla_grouped.size()

We can see the size of the groups


City Size
created.
Large 4

Medium 5

Mega 5

Small 3

dtype: int64

We can also apply aggregation functions. Let us calculate


the average population of the city classifications we created:

> gla_grouped[’Population (M)’].mean()

And get descriptive statistics for


City Size
those groups too.
Large 3.488562

Medium 2.324692

Mega 7.625415

Small 0.604062

Name: Population (M), dtype: float64


138 j. rogel-salazar

It is possible to apply multiple aggregations with the .agg()


function we mentioned above:

> gla_grouped[’Population (M)’].agg([np.mean,

np.std])

mean std We can also apply other


City Size aggregations with the .agg()
function.
Large 3.488562 0.313245

Medium 2.324692 0.453616

Mega 7.625415 1.705035

Small 0.604062 0.097290

In this case, we are getting the mean and standard deviation


with the help of NumPy.

There are many more things we can do with pandas. For


example, it is possible to create plots directly from a
dataframe. We can create a horizontal bar chart of the
average population in our city classification as follows:

Pandas lets us create data


gla_grouped[’Population (M)’].mean().\
visualisations directly from
sort_values(ascending=True).plot.barh() the dataframe. We cover this in
further detail in Section 8.2.1.

We are sorting the values before passing the data to the


plotting function. The result can be seen in Figure 3.2. We
will cover data visualisation in more detail in Chapters 7
and 8.

As you can see, NumPy, SciPy and pandas provide us with


a wide range of tools to analyse and manipulate data in
statistics and data visualisation with python 139

Mega

Large
City Size

Medium

Small

0 1 2 3 4 5 6 7 8
Average Population (M)

Figure 3.2: Average population in


millions for the city classification
a coherent ecosystem. Here, we have only covered some
created for the GLA report data.
of the capabilities of these libraries, and we will continue
using them in the rest of the book as we explore the world
of statistics with Python.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
4
The Measure of All Things – Statistics

We mentioned in Chapter 1 that statistics is the art


of learning from data. Statistics enables the creation of Statistics is the art of learning
from data.
new knowledge and a good statistical foundation will
help us understand a problem better and lead us to sound
conclusions.

In that sense, statistics is not about simply looking at


numbers, but about understanding data. In other words,
statistics is all about numbers in context. We may take a
look at a list of numbers and be able to calculate things like It is all about numbers in context
to enable understanding.
the mean or the standard deviation, but the context around
the numbers enables us to make interpretations about the
results we are seeing. Statistics lets us measure all sorts of
things and make sense of information around us and not be
overwhelmed or overloaded by it.

In order to do this we need to be methodical in the way


we acquire, organise and analyse available data. Given
a particular phenomenon of interest, we know that the
142 j. rogel-salazar

measurements we may record about it are varied. After all,


if all measurements were exactly identical we would live in A universe not worth exploring for
strange new worlds or seek out
a very boring universe. Think of the students whose grades
new life and new civilisations.
we encountered in Section 3.1.4, wouldn’t it be strange if all
af the grades were exactly the same for all students in all
subjects?

Being methodical about our analysis is supported by having


some definitions under our belt. We call a population the Population and sample are
important concepts in statistics.
entire collection of things or subjects about which
information is obtained for our analysis. A sample is a subset
of the population, selected for study. Descriptive statistics
help us describe data from our population, and we can use
tables and graphs, as well as numerical aggregations to help
us with our description. The data about our population may Our data can be univariate or
multivariate.
consist of a single or multiple characteristics, features or
variables. In the case of a single variable, we are talking
about a univariate dataset, and a multivariate one when we
have two or more variables.

The variables captured about our population can be


categorical or qualitative if the individual observations
correspond to categories or classes. If, however, the The variables can be qualitative or
quantitative. Quantitative data can
observations are numbers, the variables are said to be
be discrete or continuous.
numerical or quantitative. In turn, numerical variables can be
of different kinds. If the numbers resulting from the
observation of our population are isolated points in the
number line we have discrete data. If the observations can
take any possible value along the number line we have
continuous data.
statistics and data visualisation with python 143

With a phenomenon under study, for a sample in a


population, we can set up an experiment, in other words set
up a study where one or more explanatory or independent A experiment to observe the
effects of explanatory variables on
variables or factors are considered to observe the effects on a
response ones.
response or dependent variable. The former are the variables
whose values are controlled, whereas the latter are
measured as part of the experiment. An experimental
condition or treatment is the name we give to a particular
combination of values for the explanatory variables.

The first thing to do after having collected the appropriate


data is to explore it. Data exploration lets us build context Data exploration lets us build
context around the data.
around the data and in turn create suitable models. The
source of the data is an important component of the
exploration and whenever possible it is recommended to
include any information that the data owners can provide
about the data. For example, are the numbers we are seeing
actually referring to categories, or are the dates provided in Information about the data origin
is as important as the data itself.
a specific format? The type of exploration may depend on
the answers to some of those questions. Some of the typical
things we would like to achieve during the exploration
phase of a project include:

• Detecting erroneous data

• Determining how much missing data there is

• Understanding the structure of the data

• Identifying important variables in the data

• Sense-checking the validity of the data

• Determining any data refinement or cleaning required


144 j. rogel-salazar

Perhaps one of the most important outcomes of the data


exploration step is the information that descriptive statistics
gives us.

4.1 Descriptive Statistics

Think of your favourite place on Earth. Without


telling Mr Spock the name of the place or its location, can
you tell him about some of the most important things that Surely Mr Spock would use logic
to determine where your favourite
distinguish this place? In other words, can you describe it?
place on Earth is.
He can perhaps do the same about his home planet Vulcan.
With some characteristics such as being a rocky planet,
retaining a suitable atmosphere to sustain humanoid life, a
surface abundant in water and with an orbit in the habitable Suitable descriptions of data
can let us plan for further
zone of its system, the United Federation of Planets would
explorations.
consider both Earth and Vulcan as Class M worlds. Initial
descriptions can let appropriate survey teams plan for
further expeditions. The same is true of datasets.

In the case of datasets, instead of using phrases for the


description, we use numbers to summarise and describe
the data. The descriptions provided by these summary Descriptive statistics do not
generalise but let us understand.
figures is simply a description, they do not aim to generalise
anything beyond the data points available and we refer to
them as descriptive statistics. As the name says, descriptive
statistics are just descriptive; their use does not generalise or
tries to tell us anything beyond the data at hand.

When we use descriptive statistics for our data, we are


interested in coming up with a number that is
statistics and data visualisation with python 145

representative of the observations we have. Typically, we are


interested to know where the data is located or “centred”
and thus this is called a measure of central tendency. Let us
take a look at some of them.

4.2 Measures of Central Tendency and Dispersion

Think of your favourite dataset. It may contain a


few dozens of points, or thousands, even millions. Unless
you are an Operator in the Matrix universe, you may not be
able to look at a cascade of green neon figures, and make Determining where our data is
centred is an important step in
decisions accordingly. Instead, you will probably want to
descriptive statistics.
summarise the data and a useful aggregation is one that
tells us something about the distribution of the data points.
For that matter, locating where the data is centred is very
useful. A second measure you will be interested in is how
spread your data is.

Consider the student marks we looked at in Chapter 3,


shown in Table 3.1. A student with a given score may want
to know how close they are to the maximum mark. They Where does María stand
compared to her classmates and
may also be interested in comparing their score to that of
to the class in general? Central
others. For example, María will be very pleased to have got tendency can help answer this
9/10 in Physics. Not only is it a mark close to the questions.

maximum, but is higher than the rest. However, her History


mark of 7.5/10 is not as pleasing, but where is she
compared to the rest of the class? This is where central
tendency comes to the rescue.
146 j. rogel-salazar

4.3 Central Tendency

Central tendency is the trend that a given set of data


points have to pile up around a single value. That single
value is called a measure of central tendency and lets us We summaries a large number of
data points into a single value.
condense the dozens, thousands or millions of observations
into a single number. In the example above, it lets María
know how well she has performed compared to her
classmates, but also in comparison to other subjects, and as
a class in the school. Some common measures of central
tendency include:

• Mode

• Median Some important measures of


central tendency.
• Arithmetic mean

• Geometric mean

• Harmonic mean

We will cover these measures in the following sections. For


that purpose, we will use some data about car road tests1 1
Henderson, H. V. and P. F.
Velleman (1981). Building multiple
from 1973 to 1974. The dataset comprises fuel consumption regression models interactively.
Biometrics 37(2), 391–411
and some information on automobile design and
performance for 32 cars. The data can be obtained from2 2
Rogel-Salazar, J. (2016,
Mar). Motor Trend Car
https://doi.org/10.6084/m9.figshare.3122005.v1 as a Road Tests. https://doi.org/
10.6084/m9.figshare.3122005.v1
comma-separated value file with the name “cars.csv”. The
variables included in this dataset are as follows:

1. mpg: Miles/(US) gallon

2. cyl: Number of cylinders


statistics and data visualisation with python 147

3. disp: Displacement (cu.in.)

4. hp: Gross horsepower

5. drat: Rear axle ratio

6. wt: Weight (1000 lbs) These are the features contained in


the “cars.csv” dataset.
7. qsec: 1/4 mile time

8. vs: Engine shape (0 = v-shaped, 1 = straight)

9. am: Transmission (0 = automatic, 1 = manual)

10. gear: Number of forward gears

11. carb: Number of carburettors

4.3.1 Mode

A good way to get a sense of the values in a set of


observations is to look for the value that is present in the
data the most times. This is called the mode and it The mode is the most frequent
value observed in a dataset
corresponds to the most common value in the dataset. In
feature.
some cases we can even get more than one mode and this is
called a multi-modal distribution.

Let us look at an example. Consider the student marks we


saw in Table 3.1. The mode for each subject is:

• Physics - 8

• Spanish - 10

• History - multiple values


148 j. rogel-salazar

For History we have a multi-modal distribution. This is


because each of the values appears the same number of
times. We can use SciPy to help us with the calculation:

> import numpy as np

> from scipy import stats

> marks = np.array([[8, 9 ,7],


We can obtain the mode with the
[4.5, 6, 3.5], mode function in SciPy’s stats.

[8.5, 10, 9],

[8, 6.5, 9.5],

[9, 10, 7.5]])

> stats.mode(marks)

ModeResult(mode=array([[ 8. , 10. , 3.5]]),

count=array([[2, 2, 1]]))

In this case Python tells us the mode for each column in the
NumPy array marks and also gives us the frequency. For the
History column, the mode reported is the smallest value.

Let us look at the cars dataset, in particular we want to see


the mode of the fuel consumption in miles per gallon.

> import pandas as pd

> cars = pd.read_csv(’cars.csv’)

> cars.loc[:, ’mpg’].mode()


We can apply the same function to
a pandas dataframe.
0 10.4

1 15.2

2 19.2

3 21.0
statistics and data visualisation with python 149

4 21.4

5 22.8

6 30.4

In this case we also have a multi-modal distribution. We


can apply the same measure of central tendency to different
groups. For example, we may want to see the mode of fuel
consumption for each of the cars with automatic (0) and
manual (1) transmissions.

> ct = cars.groupby([’am’])

> ct[’mpg’].apply(lambda x: x.mode())

The function can also be applied


am to a grouped dataframe. The
result is provided for each of the
0 0 10.4
groups. Note however that this
1 15.2 needs to be done as a lambda
function as mode does not reduce
2 19.2
the dataset to a single number.
1 0 21.0

1 30.4

Once again we have more than one value. For the automatic
cars we have three mode values, whereas for manual we
have two. Note that since mode is not a function that
reduces the dataset, like for example a sum of values,
pandas does not have a method for grouped dataframes.

In a nutshell, to calculate the mode we need to tabulate how


many times each of the values appear, i.e. we calculate their The mode may not be the best way
to summarise our data.
frequency. The entries with the highest frequency are the
mode. The mode tells us what the most common value in
a dataset is, but it may not be a good choice to summarise
150 j. rogel-salazar

some data. Think for instance of the marks for Physics. The
mode is 10 but it may not be a good choice to represent
the rest of the marks for the class. Similarly, for the cars Other measures of central
tendency are available.
dataset, since we have a multi-modal distribution it may be
difficult to discern a central value. Fear not, that is where
other centrality measures come handy.

4.3.2 Median

Instead of looking at the most common values, we can


summarise our data by looking at their position in our
dataset. This is exactly what the median does; it The median is the middle-most
value of an ordered dataset.
corresponds to the middle-most value when our set of
observations is arranged in ascending (or descending) order
of magnitude. In the case of our student marks for Physics
we have that the data would be sorted as follows:

Physics = [ 4.5, 8, 8, 8.5, 9 ]

In this case, the middle value is 8 and that is our median. Depending on whether we
have an odd or even number
Note that here we have an odd number of observations and
of observations, our median is
determining the middle value is easy. However, when the either the middle value or the
number of observations is even we sum the two values in average of the central ones.

the middle and divide the result by 2.

We can calculate the median of the three subjects our


students have been studying with the help of NumPy as
follows:
statistics and data visualisation with python 151

> np.median(marks, axis=0)

array([8. , 9. , 7.5])

The median for Physics is 8 as calculated above, for Spanish


it is 9 and for History 7.5.

In a pandas dataframe we can use the .median() method as


follows:

> cars.loc[:, ’mpg’].median()


Pandas dataframes have a
median() method.
19.2

The median fuel consumption for the cars in our dataset is


19.2. We can see the median per transmission as follows:

> ct[’mpg’].median()

am
The method can also be used for
0 17.3 grouped dataframes.
1 22.8

In this case, pandas does have a method for us to apply


directly. We can see that the median fuel consumption for
automatic cars (17.3) is lower than that for manual ones
(22.8). Also, note that the values are different from the
median obtained for the entire dataset. One may start to We will answer this question in
Section 6.6.1.
wonder if there is a marked difference in fuel consumption
for cars with different transmission, and this may help us
decide what car is best for us.
152 j. rogel-salazar

4.3.3 Arithmetic Mean

Finding the centre of a distribution, as we have seen,


can be done from an actual position, and this has the great‘ The arithmetic mean is actually
the average or mean value we all
advantage of being unbiased to having large or small values
learn at school.
in the dataset. There are other ways to find the centre, and
one of the most widely used methods is the arithmetic mean
or simply the mean or the average.

It is important to note that when we use the entire collection


of data, i.e. the population, to calculate the mean, we refer The population mean is µ and the
sample mean is X̄.
to this as the population mean and denote it with the Greek
letter µ. In contrast, when we use a sample we call it the
sample mean and denote it as X̄.

The arithmetic mean takes into account all the observations


in the dataset (population or sample) and summarises them
in a single number. Given a set of n values in an array X,
the average is found by adding up all the observations
x1 , x2 , . . . , xn and dividing the result by the number of
elements:
n
1
X̄ =
n ∑ xi . (4.1) This is how to calculate the
i =1 arithmetic mean, or average.

Unlike the median, the average is susceptible to the


presence of large values in the data. Say we are at the
Battlestar Galactica mess, and there are 5 human officers
there. If we ask their ages, they may report the following
values:

ages = [ 45, 30, 50, 26, 40 ]. (4.2)


statistics and data visualisation with python 153

The average age would be given by:

1
(45 + 30 + 50 + 26 + 40) = 38.2. (4.3) Battlestar Galactica seems to have
5 a young crew!

Suddenly, a Cylon arrives. It turns out to be one of the


Final Five, and their age is, say, 2052. What happens to the
average age? Let’s take a look:

1
(45 + 30 + 50 + 26 + 40 + 2052) = 373.833. (4.4)
6

The presence of the Cylon has skewed the average age to the The presence of a large number
skews the average value. In this
point that it does not represent the majority of the values
case an old Cylon turns our
in the dataset. The median of this dataset is 42.5 and that average into an impossible human
may be a better description for this group of officers (and age.

Cylon). This is because the median disregards the values


present and concentrates on the most central value (in terms
of ordered position).

Let us take a look at the student marks we have been using


to demonstrate how we use Python to calculate these central
measures. The average for Physics is:

> physics = marks[:, 0]


We use the mean function in
> print(’Physics Average: ’, np.mean(physics)) NumPy.

Physics Average: 7.6

We are using the mean function defined in NumPy for this


task, and as we can see the mean is 7.6. We can calculate the
mean for all the subjects in one go by passing the array of
values and the axis:
154 j. rogel-salazar

> np.mean(marks, axis=0) The function can be applied to


each column by providing the
axis.
array([7.6, 8.3, 7.3])

Note that NumPy arrays also have a mean method. This


means that we can calculate the mean by the following Pardon the alliteration.

means:

> physics.mean()

7.6

This is very similar to what we would do for a pandas


dataframe. Let us see the average of the fuel consumption
for our cars dataset:

> print(’MPG average: ’, mpg.mean()) Pandas dataframes have the


mean().

MPG average: 20.090624999999996

We know that we have two different types of transmission


in our dataset. Let us see if the average fuel consumption is
different between automatic and manual:

> ct[’mpg’].mean()
The average fuel consumption is
different between our automatic
am and manual transmission cars in
0 17.147368 our dataset.

1 24.392308

Name: mpg, dtype: float64


statistics and data visualisation with python 155

The automatic cars have a mean fuel consumption of 17.15


miles per gallon, whereas the mean for manual is 24.4. Is See Section 6.6.1.

this difference significant? Well, this is a question we shall


tackle later on in the book.

There may be some applications where each data point is


given a weight such that each data point is contributing a
different amount to the average. This is called a weighted
arithmetic mean and is calculated as:

∑in=1 wi xi Weighted arithmetic mean.


X̄ = . (4.5)
∑in=1 wi

A weighted arithmetic mean is used for instance to calculate


the return for a portfolio of securities3 . 3
Rogel-Salazar, J. (2014). Essential
MATLAB and Octave. CRC Press

4.3.4 Geometric Mean

An alternative measure of central tendency is the


geometric mean. In this case, instead of using the sum of
the observations as in the arithmetic mean, we calculate The geometric mean is the
n-th root of the product of
their product. The geometric mean is therefore defined as
observations.
the n-th root of the product of observations.

In other words, given a set of n values in an array X, with all


elements x1 , x2 , . . . , xn being positive, the geometric mean is
calculated as follows:
!1/n
n
GM = ∏ xi . (4.6) Geometric mean.
1=1

The geometric mean is suitable to be used for observations


that are exponential in nature, for instance growth figures
156 j. rogel-salazar

such as interest rates or populations. We can express


Equation (4.6) as follows:
!1/n " #
n n
1 An alternative formulation of the
∏ xi = exp
n ∑ ln xi . (4.7) geometric mean.
1=1 i =1

It is clear from the expression above that we can calculate


the geometric mean by computing the arithmetic mean of
the logarithm of the values, and use the exponential to This is the implementation used in
many computer languages.
return the computation to its original scale. This way of
calculating the geometric mean is the typical
implementation used in computer languages.

To calculate the geometric mean using NumPy we can


define a function that implements Equation (4.7):

import numpy as np Here is the implementation of


def g_mean(x): Equation (4.7).

avglnx = np.log(x)

return np.exp(avglnx.mean())

For the BSG officers list of ages we can then use this BSG = Battlestar Galactica.

function to calculate the geometric mean:

> bsg_ages = [45, 30, 50, 26, 40]

> g_mean(bsg_ages)

37.090904350447026

For a set of positive numbers,


Including the 2052-year-old Cylon will result in a value of the geometric mean is smaller or
equal to the arithmetic mean.
72.4. A more human-like value for an age.
statistics and data visualisation with python 157

An alternative to creating our own function is using the


gmean() function implemented in SciPy. For example, we

can use this to calculate the geometric mean for the Physics
marks:

> from scipy.stats import gmean Here we are using the gmean()
> gmean(physics) function in SciPy.

7.389427365423367

We can also calculate the geometric mean for all the subjects
as follows:

> gmean(marks, axis=0) We can calculate the geometric


mean for the columns too.

array([7.38942737, 8.11075762, 6.90619271])

Finally, for a pandas dataframe, we can continue using the


SciPy function. For our fuel consumption column we have:

> gmean(mpg) The function can also be used in


pandas dataframes.

19.25006404155361

And for the car data grouped by transmission we have:

> ct[’mpg’].apply(gmean)

am

0 16.721440

1 23.649266

Name: mpg, dtype: float64


158 j. rogel-salazar

For datasets containing positive values with at least a pair


of unequal ones, the geometric mean is lower than the
arithmetic mean. Furthermore, due to its multiplicative The geometric mean can let us
compare values in different scales.
nature, the geometric mean can be used to compare entries
in different scales. Imagine that we have the ratings shown
in Table 4.1 for some Château Picard wine:

Vintage Rater Rate


Table 4.1: Ratings for some
Commander Data 3.5 Château Picard wine.
2249
Commander Raffi 80
Musiker
Commander Data 4.5
2286
Commander Raffi 75
Musiker

In this case, Commander Data has provided his ratings in


a scale from 1 to 5, whereas Commander Raffi Musiker has We have ratings for Château
Picard wine in two different scales.
given her ratings in a scale of 100. If we were to take the
The arithmetic mean is dominated
arithmetic mean on the raw data we would have a biased by the larger numbers.
view:

2249 vintage → (3.5 + 80)/2 = 41.75 (4.8)

2286 vintage → (4.5 + 75)/2 = 39.75 (4.9)

With the arithmetic mean, the 2249 vintage seems the better
wine. However, the fact that we have ratings with different
scales means that large numbers will dominate the
arithmetic mean. Let us use the geometric mean to see how
things compare: The geometric mean lets us make
a better comparison.

2249 vintage → 3.5 ∗ 80 = 16.73 (4.10)

2286 vintage → 4.5 ∗ 75 = 18.37 (4.11)
statistics and data visualisation with python 159

The better rated wine with the geometric mean seems to


be the 2286 vintage. The geometric mean can help with the
varying proportions, but notice that we lose the scaling. In Although the geometric mean
manages the proportion better, the
other words, we can compare the values but we cannot say
scaling is lost.
that we have an average rating of 16.73 out of neither 100 or
5. The other solution would be to normalise the data into a
common scale first and then apply the calculations.

We can also define a weighted geometric mean as follows:

∑in=1 ln wi xi
 
WGM = exp . (4.12) There is also a weighted version
∑in=1 wi for the geometric mean.

This can be useful when we need to calculate a mean in a


dataset with a large number of repeated values.

4.3.5 Harmonic Mean

Another alternative to summarising our data


with a single central number is the harmonic mean. In this
case, instead of using the sum or the multiplication of the
numbers like the arithmetic or the geometric means, we rely
on the sum of the reciprocals.

To calculate the harmonic mean we need to do the


following:

1. Take the reciprocal of each value in our data.

2. Calculate their arithmetic mean. The harmonic mean uses the sum
of reciprocals of the data points.
3. Take the reciprocal of the result and multiply by the
number of values.
160 j. rogel-salazar

In other words:
! −1
n
n

This is the formula to calculate the
HM = =n xi−1 . (4.13)
∑in=1 1
x1 i =1 harmonic mean.

We can create a function in Python to obtain the harmonic


mean as follows:

def h_mean(x):
Beware that this implementation
sum = 0 cannot handle zeroes.
for val in x:

sum += 1/val

return len(x)/sum

Note that the code above, as indeed shown in Equation


(4.13), is not able to handle zeroes in the list on numbers x.
For our age list of Battlestar Galactica officers we have that
the harmonic mean is given by:

> bsg_ages = [45, 30, 50, 26, 40]

> h_mean(bsg_ages)

35.96679987703658

Note that the harmonic mean is always the least of the The harmonic mean is the least of
the three means discussed.
three means we have discussed, with the arithmetic mean
being the greatest. If we include the 2052-year-old Cylon the
harmonic mean is 43.

The harmonic mean is a useful centrality measure when


we are interested in data that relates to ratios or rates. In The harmonic mean is useful to
compare ratios or rates.
financial applications, for example in price-to-earnings (P/E)
ratios in an index. Let us see an example.
statistics and data visualisation with python 161

Say for instance that you have two companies, Weyland


Corp and Cyberdyne Systems. Each with a market
capitalisation of $60 billion and $40 billion respectively. An example of the use of the
harmonic mean for ratios.
Their earnings are $2.5 billion and $3 billion each. With this
information we have that Wayland Corp has a P/E ratio of
24, and Cyberdyne systems of 13.33.

Consider now that we have an investment that has 20% in


Wayland and the rest in Cyberdyne. We can calculate the
P/E ratio of the index with a weighted harmonic mean
defined as follows:

∑in=1 wi
W HM = . (4.14)
∑in=1 wi xi−1

For the index mentioned above we have that the weighted The harmonic mean provides a
fair comparison for the ratios.
harmonic mean is given by:

0.2 + 0.8
PEW HM = 0.2 0.8
= 14.63. (4.15)
24 + 13.33

If we were to use the weighted arithmetic mean we will


overestimate the P/E ratio:

The arithmetic mean would


PEwAM = (0.2)(24) + (0.8)(13.22) = 18.93. (4.16)
overestimate the result.

This is because the arithmetic mean gives more weight to


larger values.

For another example where the harmonic mean is used is


in the context of rates, think for example speed. Consider An example of the use or the
harmonic mean for rates.
a trip performed by Captain Suru of the USS Discovery to
rescue Commander Michael Burnham:
162 j. rogel-salazar

• On the way to the rescue, USS Discovery travelled 180


Light Years in 2.6 seconds

• After the successful rescue, due to damaged sustained A successful trip to rescue
Commander Michael Burnham.
to the spore drive, USS Discovery made the trip back in 8
seconds

• Captain Suru instructed Lieutenant Commander Stamets


to use the same route both ways

If we are interested in determining the average speed across


the entire trip, back and forth, we can use the arithmetic
mean and make the following calculation:
   
180 180 LY The arithmetic mean will give
0.5 + = 45.86 . (4.17) us a result that implies that we
2.6 8 sec
travelled more slowly on the first
leg of the rescue.
This would be a good estimate if we had travelled different
distances at different speeds. However, given that we used
the same route, the arithmetic mean would imply that we
actually travelled more slowly on the way there. However,
we know that the spore drive was in spotless condition to
start with. Let us use instead the harmonic mean:
   −1
2.6 3.4 LY The harmonic mean provides
2 + = 33.96 . (4.18)
180 180 sec a more accurate picture as we
travelled the first leg faster than
the second one.
In this case we take into account that we covered the first
180 Light Years quicker and therefore spent less time
travelling compared to the second leg. The harmonic mean
does this for us at lightning speed.
4
Rogel-Salazar, J. (2017). Data
We should also note that the harmonic mean is used in the Science and Analytics with Python.
Chapman & Hall/CRC Data
definition of the F1 score used in classification modelling4 . Mining and Knowledge Discovery
Series. CRC Press
statistics and data visualisation with python 163

It is effectively the harmonic mean of precision and recall,


both of which are actually rates, and as we have seen, the
harmonic mean is helpful when comparing these sorts of
quantities.

4.4 Dispersion

Knowing where our data is centred is of great help.


However, that is not the end of the tale. There is much more
we can tell from the tail, quite literally. If we know how far
our values are from the centre of the distribution, we get Looking at how close or far the
values is from the centre tells us
a better picture of the data we have in front of us. This is
something about our data.
what we mean by dispersion, and it helps us understand
variability. If all the values in our data are very close to the
central value, we have a very homogeneous dataset. The
wider the distribution, the more heterogeneous our data is.
Let us take a look at some useful dispersion measures.

4.4.1 Setting the Boundaries: Range

A very straightforward measure of dispersion is the


width covered by the values in the dataset. This is called the The range is the difference
between the maximum and
range and it is simply the difference between the maximum
minimum values.
and minimum values in the dataset.

The first step in calculating the range is to obtain the


boundaries. In other words, we need to find the maximum
and minimum values. For our Physics marks from Table 3.1
we can do this as follows:
164 j. rogel-salazar

> physics.max()

9.0

> physics.min() We are using the max and min


methods for NumPy arrays.

4.5

We can then simply calculate the difference:

> physics_range = physics.max() - physics.min()

> print(physics_range)

4.5

This is such a common thing to do that NumPy has a


method for arrays called .ptp() or peak-to-peak:

physics.ptp() Or we can use the peak-to-peak or


ptp method instead.

4.5

With it we can also calculate the range for the marks array in
one go:

> marks.ptp(axis=0) We can see that the marks for


History are more spread than for
the other two subjects.
array([4.5, 4. , 6. ])

We can see from the results above that History (6) has a
wider range and this means that the variability in the marks
is higher than for Spanish (4).
statistics and data visualisation with python 165

For our pandas dataframe we can use pretty much the same
methods as above. Let us take a look at the maximum and
minimum values of the fuel consumption column we have
been analysing:

> mpg.max(), mpg.min()

For pandas dataframes we can


(33.9, 10.4) use the max and min methods to
calculate the range.

> mpg_range = mpg.max() - mpg.min()

> print(mpg_range)

23.5

We can actually apply the NumPy peak-to-peak function to


the pandas series too:

> np.ptp(mpg)
Or apply the peak-to-peak method
instead.
23.5

For the car data grouped by transmission type we can


calculate the range as follows:

> ct[’mpg’].apply(np.ptp)
In this case we are using the apply
method to calculate the range for
our grouped dataframe.
am

0 14.0

1 18.9
166 j. rogel-salazar

4.4.2 Splitting One’s Sides: Quantiles, Quartiles,


Percentiles and More

In many cases it is useful to group observations into


several equal groups. This is effectively what we are doing
with the median: It represents the middle position of our The median splits a dataset into
two halves.
data when dividing it into two parts. Half of the data has
values lower than the median and the other half is higher.
Positions like the median are useful as they indicate where a
specified proportion of our data lies. These landmark points Quantiles split our data into equal
groups.
are known as quantiles and they split our observations into
equal groups.

We may want to create not only two groups, but for


example four groups, such that the values that split the data Quartiles split the data into 4
groups. The median is therefore
give 25% of the full set each. In this case these groups have
also called the second quartile.
a special name: quartiles. Note that if we partition our data
into n groups, we have n − 1 quantiles. In the case of
quartiles we have therefore 3 of them, with the middle one Percentiles partition the data into
100 equal groups. The median is
being equal to the median. Other quantiles with special
thus the 50th percentile.
names include deciles, which split the data into 10 groups,
and percentiles split it into 100 groups. In this case the 50th
percentile corresponds to the median.

We find our n − 1 quantiles by first ranking the data in order, We need to order our data to
obtain our quantiles.
and then cutting it into n − 1 equally spaced points on the
interval, obtaining n groups. It is important to mention that
the terms quartile, decile, percentile, etc. refer to the cut-off
points and not to the groups obtained. The groups should
be referred to as quarters, tenths, etc.
statistics and data visualisation with python 167

As we mentioned before, quantiles are useful to specify the


position of a set of data. In that way, given an unknown We can compare the distribution
of quantiles to known
distribution of observations we may be able to compare its
distributions.
quantiles against the values of a known distribution. This
can help us determine whether a model is a good fit for our
data. A widely used method is to create a scatterplot known See Section 6.2.1 to learn more
about Q-Q plots.
as the Q-Q plot or quantile-quantile plot. If the result is
(roughly) linear, the model is indeed a good fit.

Let us take a look at the quartiles for the Physics marks in


Table 3.1. We will use the quantile method in NumPy,
where we pass the array we need to analyse as a first
argument, followed by the desired quantile. In the case of
quartiles we want to partition our data into n = 4 groups
and thus we require n − 1 = 3 quartiles, namely 0.25, 0.5 and
0.75:

print(’Physics Q1: ’, np.quantile(physics, 0.25))

print(’Physics Q2: ’, np.quantile(physics, 0.50))


We are using the quantile method
print(’Physics Q3: ’, np.quantile(physics, 0.75)) for NumPy arrays.

Physics Q1: 8.0

Physics Q2: 8.0

Physics Q3: 8.5

The quantile method can take a sequence for the quantiles


required and we can indicate which axes we are interested
in analysing. In this way, we can obtain the quartiles for
each subject in a single line of code:
168 j. rogel-salazar

np.quantile(marks, [0.25, 0.5, 0.75], axis=0)

The quantile method can take


array([[ 8. , 6.5, 7. ], a list of values to calculate the
appropriate quantiles.
[ 8. , 9. , 7.5],

[ 8.5, 10. , 9. ]])

In a pandas dataframe we can do pretty much the same as


above as there is also a quantile method for dataframes:

mpg = cars.loc[:, ’mpg’]


Pandas dataframes also have a
print(’MPG Q1: ’, mpg.quantile(0.25))
quantile method.
print(’MPG Q2: ’, mpg.quantile(0.50))

print(’MPG Q3: ’, mpg.quantile(0.75))

MPG Q1: 15.425

MPG Q2: 19.2

MPG Q3: 22.8

The method can also take an array of quantile values as the


NumPy one. Let us get, for example, the first and 100-th
percentile for the miles per gallon column in our dataset:

> print(mpg.quantile([0.1, 1]))


Similarly, we can also provide a
list of the desired quantiles for a
0.1 14.34 pandas dataframe.
1.0 33.90

Note that we can use the describe method in a pandas


dataframe to obtain some measures such as the quartiles:
statistics and data visualisation with python 169

> print(mpg.describe())

count 32.000000 The describe method of a pandas


dataframe provides a list of useful
mean 20.090625
descriptive statistics, including the
std 6.026948 quartiles.
min 10.400000

25% 15.425000

50% 19.200000

75% 22.800000

max 33.900000

We get the quartiles as calculated above and we also get


other information such as the maximum and minimum we
described in the previous section. We also get the count of
data and mean as defined in Section 4.3.3 and the standard
deviation (std), which we will discuss in Section 4.4.4.

4.4.3 Mean Deviation

Think of how you would describe the departure from


an established route or course of action, even an accepted
standard. The larger that gap, the bigger the departure or
deviation. We can use this idea to consider the variation The mean deviation is the average
departure of our data points from
of all the observations of a dataset from a central value,
a central value.
for example the mean. If we calculate the mean of the
deviations of all values from the arithmetic mean, we are
obtaining the mean deviation of our dataset.

By definition the chosen measure of central tendency is


indeed central, and so we expect to have values larger and
170 j. rogel-salazar

lower than it. In order for our mean deviation to avoid


cancelling out positive and negative values, we take the
absolute value of the deviations, in this case given by the An alternative name is therefore
mean absolute deviation.
difference from the point xi from the mean µ, in other words
| xi − µ|. The mean deviation for n observations is therefore
given by:
n
1
MD =
n ∑ | x i − µ |. (4.19)
i =1

The mean deviation provides us with information about


how far, on average, our observations are spread out from a
central point, in this case the mean. We can create a function
to calculate the mean deviation:

def md(x, axis=None):

avg = np.mean(x, axis) Here is an implementation of


Equation (4.19).
dev = np.absolute(x - avg)

return np.mean(dev, axis)

With that function in place, we can calculate the mean


absolute deviation for our Physics marks:

> md(physics)

1.24

We can also apply it to our array of marks:

> md(marks, axis=0)


Note that the mean deviation for
Spanish and History marks is the
same.
array([1.24, 1.64, 1.64])
statistics and data visualisation with python 171

Notice than the mean deviation for Spanish and History


marks is the same. This means that on average, the marks
for these two subjects are 1.64 units from their respective The interpretation of the mean
deviation requires the context of
means; 8.3 for Spanish and 7.3 for History. Notice that the
the mean value.
interpretation of the mean deviation requires, therefore, the
context of the average value.

In pandas we have a dataframe method called mad that


calculates the mean absolute deviation of the values in the In pandas the mean absolute
deviation can be calculated with
dataframe. For our cars observations, the mean absolute
the mad method.
deviation for fuel consumption is:

> mpg.mad()

4.714453125

Similarly, for the mean absolute deviation for fuel


consumption by transmission is:

> ct[’mpg’].mad()

The mad method can be used for


am
grouped dataframes too.
0 3.044875

1 5.237870

Name: mpg, dtype: float64

4.4.4 Variance and Standard Deviation

We have seen that we can get a measure of the dispersion


from a central value by calculating the difference from it. In
the previous section we took the differences and used the
172 j. rogel-salazar

absolute value to avoid the cancelling out of positive and


negative values. Another possibility is to take the square We can use the square of the
differences.
of each difference. We ensure in that way that we obtain
positive values, and have the advantage that we make the
measure more sensitive. Let us take a look.

Variance

This new measure of dispersion is called the variance and for


a population with mean µ, the population variance is given
by:
n
1
∑ ( x i − µ )2 .
The variance is also called the
σ2 = (4.20)
n i =1 mean square deviation.

For a sample, the variance is calculated as follows:

n
1

2
s2 = ( x − X̄ ) . (4.21)
n − 1 i =1 i

Notice that the only difference between the population and


sample variances is the factor used in the calculation. We
use n − 1 for the sample variance to obtain an unbiased See Appendix A for more
information.
estimate of the variability. This is because for a sample we
would tend to obtain a lower value than the real variance for
the population. The factor n − 1 makes the sample variance
larger.

In many cases we are interested in calculating the sample


variance to estimate the population variance, or we may be
interested in saying something about the sampling error of We estimate the population
variance with the sample variance.
the mean. If the sample size is n and the population mean
µ is known, then the most appropriate estimation of the
population variance is given by Equation (4.20). And we are
done.
statistics and data visualisation with python 173

However, it is usually the case that we do not know the


value of µ. Instead we can estimate it by using the sample
mean X̄. With a fixed value of X̄ we need to know n − 1 The degrees of freedom are the
values that can vary in a statistical
of the elements in the sample to determine the remaining
analysis.
element. That element, therefore, cannot be freely assigned;
and although we have used n elements to calculate the
mean, we need to reduce the number by one to obtain the
mean squared deviation. The number of values that are free
to vary is called the degrees of freedom of a statistic.

Standard Deviation

Calculating the square of the difference is helpful, but it


leaves us with much larger units and in some cases units
that are difficult to interpret as they would be the square of
the unit. If we are interested in expressing the dispersion The standard deviation is the
square root of the variance.
measurement in the same units as the original values, we
need to simply take the square root of the variance. This
measure of dispersion is called the standard deviation.

The population standard deviation is therefore given by:


s
n
1
σ=
n ∑ ( x i − µ )2 . (4.22)
i =1

The sample standard deviation is calculated as:


s
n
1

2
s= ( x − X̄ ) . (4.23)
n − 1 i =1 i

We can calculate the population variance and standard


deviation with the help of NumPy. For our Physics marks
we can do the following:
174 j. rogel-salazar

> np.var(physics) You can check that the variance


2.54 is the square of the standard
deviation.

> np.std(physics)

1.5937377450509227

If we need to calculate the sample variance and standard


deviation, we simply use the delta degrees of freedom
parameter ddof, and we use n - ddof in the calculation. For
our Physics marks we have:

> np.var(physics, ddof=1)


ddof is the delta degrees of
3.175
freedom.

> np.std(physics, ddof=1)

1.7818529681205462

We can apply the same functions to the array of marks too.


For the population variance and standard deviation:

> np.var(marks, axis=0)

array([2.54, 2.96, 4.46])


We can apply the functions to the
columns of a NumPy array.

> np.std(marks, axis=0)

array([1.59373775, 1.72046505, 2.11187121])

You may remember that when calculating the mean


deviation for our student marks we obtained the same value
for Spanish and History. With the standard deviation above
we obtain markedly different values for the variance and
standard deviation. From the values obtained we can see
statistics and data visualisation with python 175

that the marks for History are spread wider than those for
Spanish.

We can use the delta degrees of freedom parameter for the


sample variance and standard deviation:

> np.var(marks, axis=0, ddof=1)


The variance and standard
array([3.175, 3.7 , 5.575])
deviation provide a better view of
the dispersion of a dataset.

> np.std(marks, axis=0, ddof=1)

array([1.78185297, 1.92353841, 2.36114379])

For a pandas dataframe we can use the .var and .std


methods provided. However, please note that these two The pandas .var and .std
methods provide the sample
methods provide by default the sample variance and
variance and standard deviation
standard deviation measures. If we are interested in the by default.
population variance and standard deviation we need to pass
the parameter ddof=0. Let us take a look. For the fuel
consumption variable in our cars dataset we have:

> mpg.var(ddof=0)

35.188974609375 We need to pass ddof=0 to obtain


the population values.

> mpg.std(ddof=0)

5.932029552301219

Since the default is ddof=1 for these methods, to obtain the


sample variance and standard deviation we simply do the
following:
176 j. rogel-salazar

> mpg.var()

36.32410282258065

The sample values are obtained by


> mpg.std() default.
6.026948052089105

For our grouped data by transmission we get the following


sample variance and standard deviation measures:

> ct[’mpg’].var()

am

0 14.699298

1 38.025769
The methods can be applied to
grouped dataframes too.

> ct[’mpg’].std()

am

0 3.833966

1 6.166504

4.5 Data Description – Descriptive Statistics


Revisited

As we have seen, we can summarise a lot of information


from a dataset using the descriptive statistics discussed so The mean and standard deviation
are the most widely used
far. The mean and standard deviation are the most common
descriptive statistics.
measures that are quoted in many data description tasks.
This is because they can provide a good view of what our
data looks like and these measures, as we shall see, can be
used to compare against well-known distributions.
statistics and data visualisation with python 177

When the data does not match a distribution, or there it is


skewed, it is sometimes preferable to use the median and
the centiles such as the 10th and 90th . We know that the The median and percentiles are
used when the distribution is
.describe() method in pandas quotes the first and third
skewed.
quartiles. This is useful as it lets us define the interquartile
range.

Although descriptive statistics tell us a lot about our data,


they do not tell us everything. We may look at the mean
or median value, but we cannot say anything about the Descriptive statistics are not the
full story; we need to consider the
outliers that happen to be in the data for example. We may
overall distribution of the data.
want to look at the dispersion too and even then, different
distributions have very different shapes and still have the
same mean and standard deviation.

If we know that the data matches a known distribution


we can use this information to our advantage. If the mean Z-scores let us compare the data
from a test to a normal population
and the standard deviation are known, we can for instance
(in the statistical sense).
express the values in our dataset as deviations from the
mean as a multiple of the standard deviation. This is called
a Z-score or standard score:

x−µ See Section 5.4.1 for information


Z= . (4.24)
σ about the normal distribution.

The advantage of using these scores is that the values are


effectively normalised, enabling us to make comparisons.

Other times, we may have two distributions that have


similar average values and thus we can compare the
standard deviations. However, if the units are different, or
the average value is different, we need a variability measure
that is independent of the units used and takes into account
178 j. rogel-salazar

the means. In this case we can use the following


transformation: The coefficient of variation
100σ measures the variability in relation
CV = , (4.25)
µ to the mean.

and it is called the coefficient of variation. It is expressed as


a percentage, and is defined as the ratio of the standard
deviation to the mean. It encapsulates the degree of
variability in relation to the mean of the population.

As you can imagine, we need other tools to continue making


sense of the data we are analysing, and having a better view
of different distributions is our next step.
5
Definitely Maybe: Probability and Distributions

The type statistical analysis we have seen in the


previous chapters has concentrated on describing our data.
However, we are also interested in making predictions
based on it. In order to do that, we rely on the probability Probability is a good companion
of statistical analysis.
that certain phenomena will occur. We can quantify that
probability with a number between 0 and 1, where the
former is equivalent to saying that the phenomenon is
impossible, and the latter that the phenomenon will occur
for certain.

We can define probability in terms of relative frequency. In


other words, from the number of times that an event We can define probability based
on the relative frequency of an
happens, divided by the total number of trials in an actual
event happening.
experiment. Think of flipping a coin; the theoretical
probability of getting a head is 1/2. If we flipped the coin
100 times we may not get exactly 50 heads, but a number
close to it. Relative frequencies let us estimate the
probability we are after based on the outcomes of an
experiment or trial.
180 j. rogel-salazar

5.1 Probability

When we talk about uncertainty we tend to use the


language of probability. Capturing data about the specific
phenomena of interest provides us with important
information to understand the randomness in the results Empirical probability is measured
by experimentation.
obtained. This is known as empirical probability and it is
obtained from experimental measurement. We call Ω the set
of all possible events that may occur around the
phenomenon we are interested in studying. There are some
important properties that probabilities follow.

1. In the sample space (Ω) for an experiment, the Some important properties of
probabilities.
probability of P(Ω) is 1.

2. For an event A, the probability of the event happening is


given by P( A) = | A|/|Ω| where | · | is the cardinality of
the event.

3. Given the statement above, for any event A, the


probability P( A) is between 0 and 1.

4. For an any event A and its complement A0 , Think of A0 as “not A”.

P( A) + P( A0 ) = 1.

5. If two events, A and B, are independent from each other,


the probability of them occurring simultaneously is
P( A ∪ B) = P( AB) = | AB|/|Ω| = P( A) P( B).

6. If two events, A and B, cannot occur at the same time, we


say they form a disjoint event. The probability of A or B
occurring is given by P( A ∩ B) = P( A) + P( B).
statistics and data visualisation with python 181

7. If two events, A and B, are not mutually exclusive, the


probability of both occurring is P( A ∪ B) = P( A) + P( B) −
P ( A ∩ B ).

We may also be interested in knowing the probability of


an event A happening based on the occurrence of another Conditional probability is the
probability of an event happening
event B. This is known as the conditional probability and
based on the occurrence of another
is denoted as P( A| B). This is read as the probability of A event.
given B and is calculated as:

| AB|
| AB| |Ω|
P( A| B) = = | B|
. (5.1)
| B|

It is possible to ask about the conditional probability of B


given A and the expression for this is given by:

| BA|
P( B| A) = . (5.2)
| A|

We can substitute Equation (5.2) into Equation (5.1) and we


obtain the following expression:

P( B| A) P( A) Bayes’ theorem.
P( A| B) = , (5.3)
P( B)

and this is known as Bayes’ theorem. It is named after the


18th-century English statistician and Presbyterian minister
Thomas Bayes based a posthumous publication1 presented 1
Bayes, T. (1763). An essay
towards solving a problem in the
to the Royal Society in 1763. There is an alternative doctrine of chances. Philosophical
Transactions 53, 370–418
interpretation of the concept of probability known as
Bayesian probability. Instead of using the relative frequency
of an event, probability is thought of as reasonable
expectation representing a state of knowledge or as
182 j. rogel-salazar

quantification of a personal belief. This expectation can be


updated from prior ones.

The probability P( A| B) in Equation (5.3) is known as the


posterior probability, and P( A) is the prior. P( B| A) is called P( A| B) is the posterior probability
and P( A) is the prior.
the likelihood. Bayes’ theorem can be thought of as a rule
that enables us to update our belief about a hypothesis
A in light of new evidence B, and so our posterior belief
P( A| B) is updated by multiplying our prior belief P( A) by
the likelihood P( B| A) that B will occur if A is actually true.

This rule has had a number of very successful applications,


for example, in machine learning classification2 problems 2
Rogel-Salazar, J. (2017). Data
Science and Analytics with Python.
where the Naïve Bayes Classifier is a favourite. It can also Chapman & Hall/CRC Data
Mining and Knowledge Discovery
be applied to a variety of things, even in space, the final
Series. CRC Press
frontier. An example I enjoy of the use of this approach is
the account of Laplace determining the mass of Saturn3 . Let 3
Laplace, P. S. and A. Dale (2012).
Pierre-Simon Laplace Philosophical
us know talk about the role that random variables have in Essay on Probabilities: Translated
from the Fifth French Edition of 1825
the calculation of probabilities and their distribution. with Notes by the Translator. Sources
in the History of Mathematics and
Physical Sciences. Springer New
York
5.2 Random Variables and Probability Distributions

We have been referring to important concepts in


statistics, for instance in Chapter 4 we talked about
population and sample. We know that the idea of taking
samples is to draw conclusions about the population. We
may be interested in understanding a given characteristic of
the population, but we may not have the entire data. This Sample statistic is the value
calculated from the sample data.
can be solved by estimating a parameter from the sample
statistic to determine a parameter that describes the
statistics and data visualisation with python 183

population. The samples can be taken at random and thus


we need to understand some ideas behind random
variables.

5.2.1 Random Variables

Let us consider the following example. You are faced


with Harvey Dent, aka Two-Face, and Batman is nowhere
to be seen. Mr Dent uses his coin to wield his weapon of
choice: Decisions. Depending on whether the coin lands Mr Dent could be using a normal
coin too, with heads and tails
on the scarred or the clear face, we may end with different
instead.
outcomes, evil or less so. Say that the coin is flipped twice
and it can land on either the scarred face (S) or the clear
face (C). For this situation, we have the following possible
outcomes:

• SS

• SC

• CS

• CC

In the encounter with Mr Dent above, we have managed to A random variable has values
determined by chance.
create a random variable. A random variable is a variable
that takes on different values determined by chance.

If we are interested in the number of times that the coin


leads Two-Face to act evilly, we need to count the number of
times the coin lands on the S face. We will call our variable
of interest X, and according to the information above, it can
take the values 0, 1, or 2.
184 j. rogel-salazar

We can use this information to build a table that tells us


how frequent the outcomes may be. In particular, we can
look at the relative frequency of the outcomes, which in The relative frequency gives us the
probability of different events.
turn can be interpreted as the probability of a particular
outcome to be realised. For example, there is only one time
in four that no S occurs in our experiment. The relative
frequency of this outcome is 1/4. This means that we have a
0.25 probability of getting this result.

Table 5.1 shows the probabilities for all outcomes and it


represents the probability distribution of our statistical
experiment.

X Probability Cumulative
Table 5.1: Probability of our coin
P( X = x ) Probability P( X ≤ x ) flipping experiment.
0 0.25 0.25
1 0.50 0.75
2 0.25 1.00

We can now ask about the probability that the value of our
random variable X falls within a specified range. This is
called the cumulative probability. For example, if we are A cumulative probability is the
probability that the value of a
interested in the probability of obtaining one or fewer S
random variable falls within a
outcomes, we can calculate the probability of obtaining no S, specified range.
plus the probability of getting one S:

P ( X ≤ 1) = P ( X = 0) + P ( X = 1),

= 0.25 + 0.50 = 0.75. (5.4)

For our coin experiment with Two-Face, the cumulative


probability distribution for P( X ≤ x ) is in the last column of
Table 5.1.
statistics and data visualisation with python 185

5.2.2 Discrete and Continuous Distributions

The variable X in our experiment with Two-Face’s coin


can only take four possible outcomes and thus we refer to
its probability distribution as a discrete probability Discrete probability distribution
refers to the occurrence of
distribution. The frequency function that describes our
discrete or individually countable
discrete random variable experiment is called the outcomes.
probability mass function or PMF.

If the random variable we are studying can take any value


between two specified values, then the variable is called Continuous probability
distributions can take any value in
continuous and, imaginatively enough, its probability
a given range.
distribution is called a continuous probability distribution.

The probability that a continuous random variable will take


a particular value is zero and this means that we cannot
express the distribution in the form of a table. We require
instead a mathematical expression. We refer to this
expression as the probability density function or PDF. The
area under the curve described by the PDF is equal to 1 and
it tells us the probability:

Z ∞
The area under the PDF gives us
p( x )dx = 1, (5.5)
−∞ the probability.

and the probability that our random variable takes a value


between two values a and b is given by the area under the
curve bounded by a and b.

Sometimes it is useful to go the other way around. In other


words, given a distribution, we may want to start with
a probability and compute the corresponding x for the
186 j. rogel-salazar

cumulative distribution. This can be done with the percent


point function (PPF). We can think of the PPF as the inverse
of the cumulative distribution function.

5.2.3 Expected Value and Variance

Given the scenario above with Mr Dent, we can ask


what value we should expect to obtain. To answer this
question we need to find the expected value of our random We also call this the mean.

variable. For a discrete random variable the expected value


E( X ) is given by:

E( X ) = µ = ∑ xi pi , (5.6)

where pi is the probability associated with the outcome The expected value is a weighted
average.
xi . This is effectively a weighted average as described in
Section 4.3.4. For our experiment above we have that the
mean value is:

µ = E[ X ] = 0(0.25) + 1(0.5) + 2(0.25) = 1. (5.7)

This means that if we flip a fair coin 2 times we expect 1 S


face; or equivalently the fraction of heads for two flips is
one half. But what about the variability in our experiment.
Well, we are therefore interested in looking at the variance
of our discrete random variable which is given by:
h i
Var ( X ) = σ2 = E ( X − E[ X ])2 ,
We can express the variance in
terms of expected values.
h i
2 2
= = E X − 2XE[ X ] + E[ X ] ,
statistics and data visualisation with python 187

Var ( X ) = E[ X 2 ] − 2E[ X ] E[ X ] + E[ X ]2 ,

= = E [ X 2 ] − E [ X ]2 . (5.8)

In other words, the variance of X is equal to the mean of


the square of X minus the square of the mean of X. For our
coin experiment this can be written as:

An alternative notation for the


σ2 = ∑ xi2 pi − µ2 . (5.9)
variance.

Substituting the values we have that:

σ2 = 02 (0.25) + 12 (0.5) + 22 (0.25) − 12 = 0.5. (5.10)

In the case of a continuous variable with PDF f ( x ), we have


that the expected value is defined as:
Z ∞ The expectation value for a
E( X ) = µ = x f ( x )dx, (5.11) continuous variable is given
−∞ by the integral of its PDF, f ( x ),
multiplied by x.
and the variance is given by:

Var ( X ) = E [ X 2 ] − µ2 ,

Z ∞ Z ∞
2
= x2 f ( x ) − x f (x) . (5.12)
−∞ −∞

The results obtained in Equations (5.7) and (5.10) are


suitable for two flips of Mr Dent’s coin. However, we are We are now interested in repeated
trials of our coin flip.
now curious to know about the probability distribution for
the fraction of S faces we get if we were to increase the
number of flips to, say, 100. We can ask Python to help us.
188 j. rogel-salazar

import numpy as np

import random
We can use this function to
simulate repeated coin flips for
def coinFlip(nflips=100): our experiment.
flips = np.zeros(nflips)

for flip in range(nflips):

flips[flip] = random.choice([0,1])

return flips

With the help of the random.choice function in the random


module in Python, we can choose values at random from
the list provided, in this case between the discrete values 0
or 1. Let 1 be the S face we are interested in. We throw the
fateful coin 100 times by default (nflips=100), and we store
the results in a list called flips. Let us see the result of a
few of instances of the 100 flips:

> t1, t2, t3 = coinFlip(), coinFlip(), coinFlip() Remember that the numbers are
> np.sum(t1), np.sum(t2), np.sum(t3) random, so your list may look
different.

(57.0, 46.0, 54.0)

We have performed 100 flips on three different trials. In the


first trial we obtained 57 scar faces, 46 in the second trial
and 54 in the third. The fraction of scar faces is close to 50,
but to be able to assert the expected value we may need to
do the experiment several more times. Perhaps 100, 250,
1000 or even 10000 times. Once again, lets get Python to
help:
statistics and data visualisation with python 189

def coinExperiment(ntrials):

results = np.zeros(ntrials) This function will let us simulate


for trial in range(ntrials): repeated trials of our coin flipping
experiment. We are throwing the
f = coinFlip()
coin nflips=100 times.
results[trial] = np.sum(f)

return results

Here, we store the number of S faces obtained in each trial


in the array called results. We can now ask Python to
throw the coins:

ntrials = [100, 250, 1000, 10000]


We repeat our experiment with an
r = []
increasing number of trials.
for i, n in enumerate(ntrials):

r.append(coinExperiment(n))

We can look at the mean, variance and standard deviations


obtained in each of the experiments:

> for i in r:

print(i.mean()/100, i.var()/100, i.std()/10)


We are looking at the fraction
of S faces obtained, hence the
0.49369999999999997 0.183731 0.428638542364 division by 100. Remember that
the standard deviation is the
0.49723999999999996 0.293358 0.541625553311
square root of the variance.
0.49988999999999995 0.262368 0.512219474444

0.50110199999999999 0.252276 0.502271400340

We are looking at the fraction of S faces obtained and that


is why we are dividing by the total number of flips, i.e.
n f lips = 100. We can look at the frequency of each result.
This can be seen in the histogram shown in Figure 5.1. We
190 j. rogel-salazar

100 experiments 250 experiments


25
8
20
Occurrences

6 15

4 10

2 5

0 0
0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.35 0.40 0.45 0.50 0.55 0.60 0.65
1000 experiments 10000 experiments
800
80
700
60 600
Occurrences

500
40 400
300
20 200
100
0 0
0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.35 0.40 0.45 0.50 0.55 0.60 0.65
Fraction of S faces Fraction of S faces

Figure 5.1: Probability distribution


of the number of S faces in Two-
can see that as the number of experiments increases, we get Face’s fair coin flipped 100 times.
a better defined distribution.

The distribution that emerges as the number of experiments


grows is called the normal distribution and this is the result See Section 5.4.1 for more
information about the normal
of taking a random sample of size n from any distribution.
distribution.
For sufficiently large n, the sample mean X̄ follows
approximately the normal distribution with mean µ and

standard deviation σ/ n where µ and σ are the mean and
statistics and data visualisation with python 191

standard deviation of the population from which the sample


was selected. This is the central limit theorem and we will See Section 5.4.4 for information
about the central limit theorem.
look at this in more detail later on in Section 5.4.4.

5.3 Discrete Probability Distributions

Now that we are better acquainted with what a


probability distribution is, we will take a look at a few of the
most widely used discrete probability distributions.

5.3.1 Uniform Distribution

Let us start with one of the most straightforward


probability distributions, the uniform distribution. As the
name implies, it describes the case where every possible In the uniform distributions,
all outcomes are equally likely.
outcome has an equal likelihood of happening. An example
Examples include the casting of
of this is the casting of a fair die: Each face is equally likely a die, or drawing a card from a
to be obtained with a probability of 1/6. Another example deck.

is drawing any card from a standard deck of cards, where


each card has a probability of 1/52 to be drawn.

The probability mass function (PMF) for the uniform


distribution is given by:

1 The PMF of the uniform


f (x) = , (5.13)
n distribution.

where n is the number of values in the range. It is also


possible to use upper and lower bounds, denoted as U and
L respectively. In this way we can express the number of
values in the range as n = U − L + 1.
192 j. rogel-salazar

In terms of upper and lower bounds, the probability mass


function for the uniform distribution can be written as:

1 The PMF of the uniform
 U − L +1 for L ≤ x ≤ U + 1,
distribution in terms of upper


f (x) = (5.14) and lower bounds.


0 for x < L or x > U + 1.

The cumulative distribution function for the uniform


distribution is given by:



 0 for x < L,





 The cumulative distribution
F(x) = x− L (5.15) function of the uniform
U − L +1 for L ≤ x ≤ U + 1,
distribution.








1 for x > U + 1.

The mean of the uniform distribution can be calculated as


follows:

n n −1
1
µ = ∑ xi f ( xi ) = n ∑ i,
1 i =0
The mean of the uniform
1 n ( n − 1) distribution.
= ,
n 2

n−1
= , (5.16)
2

where we have used the following identity:

n −1
n ( n − 1)
∑i=
See Appendix B for more
. (5.17)
i =0
2 information.

Equation (5.16) is the standard version of the mean for the


uniform distribution where the lower bound is 0. Recalling
statistics and data visualisation with python 193

that E[ X + c] = E[ X ] + c, where c is a constant, we can write


our mean as follows:
The mean of the uniform
n−1 L+U
µ= +L= , (5.18) distribution in terms of upper
2 2
and lower bounds.

where the last expression is obtained by substituting n =


U − L + 1. This shows that the mean value of the uniform
distribution is the arithmetic mean of the upper and lower
bounds.

The variance of the uniform distribution is given by σ2 =


E[ X 2 ] − E[ X ]2 and thus:

n −1 2
n−1

1
σ =2
n ∑i 2
+
2
. (5.19)
i =0

The second term of the expression above is given by


Equation (5.16), whereas the first term can be calculated
with the identity:
See Appendix C for more
n −1 information.
n(n − 1)(2n − 1)
∑ i2 = 6
. (5.20)
i =0

Substituting Equation (5.20) into Equation (5.19) and


expanding:

1 n(n − 1)(2n − 1) n2 − 2n + 1
σ2 = + ,
n 6 4
The variance of the uniform
2n2 − 3n + 1 n2 − 2n + 1 distribution.
= + ,
6 4

4n2 − 6n + 2 − 3n2 + 6n − 3 n2 − 1
= = . (5.21)
12 12
194 j. rogel-salazar

Let us go back to the idea of throwing a die. We can model


the PMF of the uniform distribution in Python with the help
of uniform from scipy.stats as follows:

> import numpy as np

> from scipy.stats import uniform

> x = np.arange(1,7) The uniform distribution is


> print(x) implemented in SciPy’s uniform.

array([1, 2, 3, 4, 5, 6])

The array x in the code above represents our random


variable. Here, the values refer to the faces of our die. We
can now call the pdf(x, loc, scale) method, where x is
the random variable. The distribution is uniform on [0, 1]. If The implementation of the
uniform distribution in SciPy is
we use the parameters loc and scale, we obtain the
described as continuous, hence the
uniform distribution on [loc, loc + scale]. We can see use of PDF in the implementation
the depiction of the uniform distribution PDF in the upper name.

left-hand panel of Figure 5.2.

> pdf = uniform.pdf(x, 0, 6)

> print(pdf)
We obtain the PDF of the uniform
distribution with the pdf method.

array([0.16666667, 0.16666667, 0.16666667,

0.16666667, 0.16666667, 0.16666667])

The cumulative distribution function can be obtained with


cdf(x, loc, scale) as shown in the code below. We can

see a depiction for our die rolls in the upper right-hand side
panel of Figure 5.2.
statistics and data visualisation with python 195

Uniform PDF Uniform CDF


0.1750 1.0

0.1725
0.8

Cumulative Probability
0.1700
Probability

0.1675 0.6
0.1650

0.1625 0.4

0.1600
0.2
0.1575
1 2 3 4 5 6 1 2 3 4 5 6
Outcome Outcome
Uniform PPF Uniform Random Variates
6 1750

5 1500

1250
4
Occurrences
Outcome

1000
3
750
2
500
1
250
0
0
0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7
Probability Outcome

Figure 5.2: The uniform


distribution.

> cdf = uniform.cdf(x, 0, 6)

> print(cdf)

We get the CDF of the uniform


array([0.16666667, 0.33333333, 0.5, distribution with the cdf method.
0.66666667, 0.83333333,

1. ])
196 j. rogel-salazar

As we can see, the probabilities add up to 1 as expected.


Let us now look at the percent point function (PPF). We
start by postulating some probabilities and calculate the
corresponding random variables:

> probs = np.arange(0, 1.1, 0.1)


We get the PPF of the uniform
> ppf = uniform.ppf(probs, 0, 6) distribution with the ppf method.

We store the probabilities from 0.1 to 1 in steps of 0.1 in


the probs array. We then calculate the PPDF with the ppf
method. The result is shown in the lower left-hand side
panel of Figure 5.2.

Finally, let us look at generating random variates that follow


the uniform distribution. We can do this with the help of We generate random variates that
follow the uniform distribution
rvs(loc, scale, size), where the last parameter specifies
with the rvs method.
the shape of the arrays requested. Let us look at generating
10000 random variates. Please note that the entries will be
of type float, so we will cast them as integers to get a better
feel for the actual throw of a die:

> x1 = uniform.rvs(1, 6, 10000)

> x1 = x1.astype(int)

Since in our example we can only obtain integer numbers


from our dice, we are casting the random variates given by Remember that in SciPy
the uniform distribution is
rvs as integers. A histogram of these random variates is
implemented as continuous.
shown in the lower right-hand side panel of Figure 5.2.

Let us see what the mean and population variance for the
random variables above are:
statistics and data visualisation with python 197

> x1.mean(), x1.var(ddof=1)

(3.5154, 2.9080536453645367)

Using the Equations (5.18) and (5.21) we can check the


values from the theoretical distribution:

> n=6, L=1

> u_mean = (n-1)/2 + L

> u_var = (n**2-1)/12


We can corroborate that the mean
> print(u_mean) and variance of the variates

3.5 generated correspond to the


values we expect for a uniform
distribution.
> print(u_var)

2.9166666666666665

5.3.2 Bernoulli Distribution

Remember our encounter with Mr Dent and his fateful


coin? Provided that his coin is fair, for a single coin flip we And that fairness remains to be
proven. Where is Batman by the
have one of two outcomes each with a probability of 50%.
way?
This scenario can be described by the Bernoulli distribution.

The distribution bears the name of the 17-th century


mathematician Jacob Bernoulli who defined what we now
call a Bernoulli trial: An experiment with only two possible
outcomes. His posthumous book Ars Conjectandi4 is perhaps 4
Bernoulli, J., J. Bernoulli, and
E. D. Sylla (2006). The Art of
one of the first textbooks on formal probability, and in it the Conjecturing, Together with Letter
to a Friend on Sets in Court Tennis.
probabilistic folklore of trials with outcomes described as Johns Hopkins University Press
“success” or “failure” started getting cemented.
198 j. rogel-salazar

Coin flipping is a classic example of the Bernoulli


distribution as there are only two possible outcomes. In the
case of Mr Dent’s coin we have the scarred and the clear
faces. One could argue that for a fair coin the distribution A Bernoulli experiment is one
where there are only two possible
could be a uniform distribution too. Similarly, the case of
outcomes.
rolling a die can be cast as a Bernoulli distribution if we fold
the outcomes into 2, for instance getting an odd or an even
number. The two outcomes of our Bernoulli trial are such
that a success (1) is obtained with probability p and a failure
(0) with probability q = 1 − p. This is therefore the PMF of
the Bernoulli distribution:

f ( x ) = p x (1 − p)1− x , for x ∈ {0, 1}. (5.22)

For our coin p = q = 0.5, but the probabilities need not be


equal. The coin may be biased, or the Bernoulli trial may The probabilities p and q may be
different. Perhaps Mr Dent’s coin
be the outcome of a fight between me and Batman... He is
is indeed biased.
much more likely to win! My success may be 0.12 and his
0.88.

The mean of the Bernoulli distribution is:

µ = (1)( p) + 0(1 − p) = p. (5.23) The mean of the Bernoulli


distribution.

For our coin flipping experiment the mean is µ = 0.5. The


variance of the Bernoulli distribution is:

σ2 = ( p − 1)2 p + ( p − 0)2 (1 − p ),
= p − p2 ,
The variance of the Bernoulli
= p (1 − p ). (5.24) distribution.
statistics and data visualisation with python 199

The variance is the multiplication of the two possible


outcomes. For our coin flipping experiment the variance is
σ2 = 0.25.

Python has us covered with the bernoulli implementation


in SciPy. Let us create a distribution for the encounter
between me and Batman. The outcomes are failure (0) or
success (1) and p = 0.12 for my success. The PMF can KA-POW!!!

be seen in the top left-hand panel of Figure 5.3 and we


calculate it with Python as follows:

> x = np.array([0, 1]) Probability distribution


> pmf = bernoulli.pmf(x, 0.12) implementations in SciPy all have
similar methods for calculating
> print(pmf) PMFs/PDFs, CDF, PPF and
random variates. Here we use
bernoulii module.
array([0.88, 0.12])

The CDF for the Bernoulli distribution is easily obtained:

> cdf = bernoulli.cdf(x, 0.12)

> print(cdf) See the CDF of the Bernoulli


distribution in the top right-hand
panel of Figure 5.3.
array([0.88, 1. ])

The PPF, as you expect, will return the outcome given


certain probabilities:

> probs = np.array([0.88, 1.0])

> ppf = bernoulli.ppf(probs, 0.12) See PPF in the bottom left-hand


panel of Figure 5.3.
> print(ppf)

array([0., 1.])
200 j. rogel-salazar

Bernoulli PMF Bernoulli CDF


1.1 1.1
1.0 1.0
0.9 0.9
0.8 0.8

Cumulative Probability
0.7 0.7
Probability

0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
1 0 1 2 1 0 1 2
Outcome Outcome
Bernoulli PPF Bernoulli Random Variates
1.1
1.0
8000
0.9
0.8
0.7 6000
Occurrences

0.6
Outcome

0.5
4000
0.4
0.3
0.2 2000
0.1
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 0 1 2
Probability Outcome

Figure 5.3: The Bernoulli


We can obtain random variates that follow the Bernoulli distribution.

distribution as follows:

> x1 = bernoulli.rvs(0.12, size=10000)


See a histogram of Bernoulli
> print(x1.mean(), x1.var(ddof=1))
variates in the bottom right-hand
panel of Figure 5.3.
0.1144 0.10132277227722773

Let us compare the result obtained versus theoretical values:


statistics and data visualisation with python 201

> print(bernoulli.mean(0.12))

> print(bernoulli.var(0.12))

0.12

0.1056

5.3.3 Binomial Distribution

Let us revisit our coin toss encounter with Two-Face,


where we know that flipping a coin one time has one of two
possible outcomes: Either the scarred face or the clear one.
What if we were to challenge our villainous friend to make Enemy? Frenemy perhaps?

his decision on 2 out of 3 flips, or 4 out of 5 or many more


throws, say k out of n flips. Well, we are facing a situation
similar to that encountered in Section 5.2.3 where we tested
with 100 trials and went up to 10000.

Each of the coin flips follows a Bernoulli distribution as


explained in Section 5.3.2. The repeated trials imply the sum
of these Bernoulli random variables. This sum will follow A binomial distribution is given
by a sequence of n independent
the binomial distribution. We require though that each of
Bernoulli experiments.
the trials has the same probability of success (and therefore
of failure). Also, each trial is independent from any other
trial, in other words, the previous flip does not influence the
outcome of the next one.

Back to the challenge to Two-Face. We may be interested


in the way we obtain k results out of n items, irrespective
202 j. rogel-salazar

of their order. We can calculate this with the help of the


binomial coefficient5 : 5
Scheinerman, E. A. (2012).
Mathematics: A Discrete Introduction.
  Cengage Learning
n n!
= , (5.25)
k (n − k)!k!

where n! stands for n factorial, and we can read the n! = n(n − 1)(n − 2) . . . (2)(1).

binomial coefficient as “choose k out of n” or “n choose k”.


Check Appendix D for a derivation of the binomial
coefficient, and there are no prizes for guessing why this
distribution is called binomial.

The probability mass function for the binomial distribution


is given by:  
n k (n−k) The PMF of the binomial
f (k, n, p) = p q , (5.26)
k distribution.

where p is the probability of success in each trial, q = 1 − p


is the probability of failure, n is the number of trials and k is
the number of successes that can occur among the n trials.
Remember that the values for p and q should remain the
same from one trial to the next, and they do not need to be
equal to each other.

We can use Equation (5.26) to determine the probability of


getting no (0) scar faces in 3 flips:

3!
f (0, 3, 0.5) = (0.5)0 (1 − 0.5)3−0 ,
3!0!
This is the probability of getting
no scar faces in 3 flips of Mr
3 Dent’s coin.
= (1)(1)(0.5) ,

= 0.125. (5.27)
statistics and data visualisation with python 203

The mean of the binomial distribution can be obtained as


follows:
n  
n k (n−k)
µ = ∑ k·
k
p q , (5.28)
k =1

n Some preliminary steps to


n − 1 k (n−k)
 
= n ∑ k−1
p q , (5.29) calculate the mean of the binomial
k =1 distribution.

n
n − 1 ( k −1)
 
= np ∑ k−1
p (1 − p ) ( n − k ) . (5.30)
k =1

In Equation (5.28) we start the sum from 1 as the case k = 0


does not have a contribution. In Equation (5.29) we have
used the binomial property:

n−1
   
n See Appendix D for more
k· = n· , (5.31)
k k−1 information.

We have taken the factor n out of the sum as the latter does
not depend on n. Now, we are going to do some re-writing
and express n − k as (n − 1) − (k − 1):

n
n − 1 ( k −1)
 
µ = np ∑ k−1
p (1 − p)(n−1)−(k−1) , (5.32)
k =1

m  
m j
= np ∑ p (1 − p ) m − j , (5.33)
j =0
j

The mean of the binomial


= np. (5.34)
distribution.

Here, we have made some substitutions in Equation (5.33),


such that n − 1 = m, and k − 1 = j. The sum in Equation
(5.33) is equal to 1 as it is the sum of all probabilities for all
outcomes.
204 j. rogel-salazar

The variance of the binomial distribution is given by:

σ2 = E [ X 2 ] − µ2 ,

n   Some preliminary steps to


n k
= ∑ k2
k
p (1 − p)n−k − (np)2 , (5.35) calculate the variance of the
k =1 binomial distribution.

n
n−1 k
 
= n ∑ k
k−1
p (1 − p)n−k − (np)2 . (5.36)
k =1

The sum in Equation (5.35) runs from 1 as the case k = 0


does not contribute to it. In Equation (5.36) we have used
the following property:

n−1
     
n2 n See Appendix D.
k = kk = kn . (5.37)
k k k−1

Let us now use the following identity pk = ppk−1 :

n
n − 1 k −1
 
σ2 = np ∑ k
k−1
p (1 − p)n−k − (np)2 , (5.38)
k =1

and we can simplify the expression above:

n
n − 1 k −1
 
σ =np ∑ k
2
p ·
k =1
k−1

(1 − p)(n−1)−(k−1) − (np)2 , (5.39)


Almost there...

m  
m j
=np ∑ ( j + 1) p (1 − p)m− j − (np)2 , (5.40)
j =0
j

=np ((n − 1) p + 1) − (np)2 , (5.41)


statistics and data visualisation with python 205

σ2 =(np)2 + np(1 − p) − (np)2 , (5.42)

The variance of the binomial


=np(1 − p). (5.43)
distribution.

In Equation (5.40) we have distributed the sum and


expressed it as two terms, the first being the expected value
of the binomial distribution and the second the sum of all
probabilities:
m   m  
m j m
∑ j j p (1 − p ) + ∑ j p j (1 − p ) m − j ,
m− j
(5.44)
j =0 j =0

mp + 1, (5.45)

(n − 1) p + 1. (5.46)

Note that the binomial distribution for the case where n = 1 The case n = 1 recovers the
Bernoulli distribution.
is actually a Bernoulli distribution. You can check the mean
and variance for the Bernoulli distribution is recovered
when only one trial is run.

In Python, the binomial distribution can be calculated with


binom. For our several attempts with our coin, we can check

the probability obtained in Equation (5.27) as follows:

> from scipy.stats import binom

> binom.pmf(0, 3, 0.5) The binomial distribution is


implemented in the binom module
in SciPy.
0.125
206 j. rogel-salazar

Let us calculate the probability mass function (PMF) and the


cumulative distribution function for obtaining 0, 1, ... , 100
scar faces in 100 attempts. The results can be seen in the top
panels of Figure 5.4:

import numpy as np
The PMF and CDF of the binomial
x = np.arange(0, 100, 1)
distribution are shown in the top
pmf = binom.pmf(x, 100, 0.5) panels of Figure 5.4.
cdf = binom.cdf(x, 100, 0.5)

The PPF for our experiment can be seen in the bottom left-
hand side of Figure 5.4 and is calculated as follows:

probs = np.arange(0, 1, 0.01)


The PPF is shown in the bottom
ppf = binom.ppf(probs, 100, 0.5) left-hand side of Figure 5.4.

Finally, let us obtain 10000 random variates that follow the


binomial distribution as shown in the bottom right-hand
side of Figure 5.4:

> x1 = binom.rvs(n=100, p=0.5, size=10000)

> print(x1.mean(), x1.var(ddof=1)) A histogram of binomial variates


is shown in the bottom right-hand
side of Figure 5.4.
49.9683 25.216416751675165

We can see what the mean and variance are for our
experiment and compare with the sample mean and sample
variance above:
statistics and data visualisation with python 207

Binomial PMF Binomial CDF


0.08 1.0
0.07
0.8
0.06

Cumulative Probability
0.05 0.6
Probability

0.04
0.03 0.4

0.02
0.2
0.01
0.00 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Outcome Outcome
Binomial PPF Binomial Random Variates
1600
60
1400
50
1200
40 1000
Occurrences
Outcome

30 800

20 600
400
10
200
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100
Probability Outcome

Figure 5.4: The binomial


distribution.
> print(binom.mean(100, 0.5))

50.0
We can calculate the mean and
variance for a binomial process
> print(binom.var(100, 0.5))
with Python’s help.

25.0
208 j. rogel-salazar

5.3.4 Hypergeometric Distribution

Starbuck has been chasing Cylons that jumped well


into our quadrant and she is now stranded in our planet.
After recovering for a few days, to pass the time she talks
to us about an interesting card game popular among the
Twelve Colonies: Triad. In the game, hexagonal cards show Playing triad or poker with
Starbuck may prove to be a futile
a large star surrounded by a border with smaller stars. After
way to try to earn money, but a
trying to explain the rules to us, and realising that we do good way to pass the time.
not have a triad deck of cards, we teach her some of the card
games we play in this planet, like poker or Texas hold ’em.
Let us play: Starbuck starts with a deck of 52 cards, and she
is dealt 4 cards. If those 4 cards are returned to the deck, the
next time we deal the probabilities remain the same, but if
we hold those cards, the deck now has only 48 cards. What
happens then?

Let us go through a first deal with Starbuck by considering


the probability of getting an ace of any suit. There are 4
aces in a deck and for the first card we have 4/52, or 1/13,
chance of getting an ace. Starbuck is a lucky person and she
gets an ace! For the second card, since the ace obtained is In this case, the outcome of the
next trial depends on the outcome
no longer in the deck, the probability of getting an ace is
of the previous one, breaking one
now different. In the case where an ace is drawn in the first of the assumptions behind the
hand, the probability would be 3/51, but if Starbuck did not binomial distribution.

get an ace in the first hand, the chances would be 4/51. The
game continues and, as you can see, the outcome of the next
trial depends on the outcome of the previous one, and one
of the key assumptions behind the binomial distribution is
broken.
statistics and data visualisation with python 209

For repeated trials where the probabilities remain constant,


or when replacement is enabled, the distribution is binomial.
For a single trial, we recover the Bernoulli distribution.
But what about a different type trial, like the simple game In a hypergeometric experiment,
the probabilities from one trial to
with our Battlestar Galactica officer? The outcome may
the next change depending on the
still be either success or failure, but without replacing the outcome of previous trials.
cards, the probabilities shift. This type of trial is called a
hypergeometric experiment and its distribution is called a
hypergeometric distribution.

It sounds like a grandiose name, and perhaps a little bit


scary, but there is a reason behind the it. Consider first a
sequence where each succeeding term is produced by
multiplying the previous term by a fixed number, called the
common ratio. This type of sequence is called a geometric
progression and they have been studied for centuries.
Simply take a look at Euclid’s Elements6 . An example of a 6
Heath, T. L. (2017). Euclid’s
Elements (The Thirteen Books).
geometric progression with common ratio 1 − p is: Digireads.com Publishing

p, p(1 − p), p(1 − p)2 , . . . , p(1 − p)n . (5.47)

Generalisations of these types of progressions have been


used in a variety of applications from quantum mechanics
to engineering7 for the description of elliptic curves or 7
Seaborn, J. B. (2013).
Hypergeometric Functions and
solving differential equations. The idea behind the Their Applications. Texts in Applied
Mathematics. Springer New York
generalisation is that successive terms, say k and k + 1, get
applied different proportions that vary relative to each other,
all this determined by a rational function. Since these A rational function is the quotient
of two polynomials.
progressions go beyond the geometric one described above,
they are referred to as hypergeometric.
210 j. rogel-salazar

The probability mass function for the hypergeometric


distribution in a population of size N is given by the
number of samples that result in k successes divided by the
number of possible samples of size n:

−K
(Kk )( Nn− k) The PMF of the hypergeometric
f (k, N, K, n) = , (5.48)
( Nn ) distribution.

where K is the number of successes in the population. As


mentioned above, the denominator is the number of
samples of size n. The numerator is made out of two factors, Remember that (ba) stands for the
binomial coefficient. See Appendix
the first one is the number of combinations of k successes
D.
observed out of the number of successes in the population
(K). The second corresponds to the failures, where N − K is
the total number of failures in the population.

Let us go back to our card game with Starbuck, and we can


now consider the probability of getting 2 aces when 4 cards
are dealt. This means that k = 2, N = 52, K = 4 and n = 4:

−4
(42)(52
4−2 )
f (2, 52, 4, 4) = ,
(52
4)
The probabilities of the card game
(6)(1128) with Starbuck are easily obtained
= ,
270725 with the PMF above.

= 0.02499954. (5.49)

In other words, there is about 2.5% chance of Starbuck


getting 2 aces when dealt 4 cards. In Python, we can use
the SciPy implementation of the hypergeometric PMF as
follows:
statistics and data visualisation with python 211

> from scipy.stats import hypergeom


SciPy provides the hypergeom
> hypergeom.pmf(2, 52, 4, 4) module.

0.024999538276849162

For cases where the population N is large, the


hypergeometric distribution approximates the binomial. Let See Appendix E.1.

us go catch some Pokémon with Ash Ketchum and Pikachu.


We have a large container of Poké Balls; the container has Or to use their actual name モンス
ターボール or Monster Ball.
two types of balls, 575 Poké Balls and 425 Beast Balls.
Pikachu will take 10 balls out of the container for us. What
is the probability that we will get 3 Beast Balls? Well, we can
use the hypergeometric PMF to get the result:

1000−425
(425
3 )( 10−3 )
f (3, 1000, 425, 10) = ,
(1000
10 )

(1.2704 × 107 )(3.9748 × 1015 )


= ,
2.6340 × 1023

= 0.19170. (5.50)

Using Python we have that:

> from scipy.stats import hypergeom

> hypergeom.pmf(3, 1000, 425, 10) When the population N is large,


the hypergeometric distribution
approximates the binomial.
0.19170588792109242

We can compare this to a binomial point estimate:


 
10
(0.425)3 (0.575)7 = 0.19143. (5.51)
3
212 j. rogel-salazar

The results are pretty close and the idea behind this
proximity is that as the population N grows to be larger and Although we do not replace the
balls, the probability of the next
larger, although we do not replace the Poké Balls, the
event remains nearly the same.
probability of the next event is nearly the same as before.
This is indeed because we have a large population. We can
now go and “catch ’em all”.

We are now interested in describing the mean and variance


for the hypergeometric distribution. Let us start with the
mean; let P[ X = k] = f (k, N, K, n) and we calculate the
expectation value for X r :

−K
n (Kk )( Nn− k) We will use this expression
E[ X r ] = ∑ kr ( Nn )
. (5.52) to obtain the mean of the
k =0
hypergeometric distribution.

We can use some useful identities of the binomial


coefficient:

K−1
   
K
k = K , (5.53)
k k−1
See Appendix D.1 for more
information
N−1
     
N 1 N 1
= ·n = ·N . (5.54)
n n n n n−1

Substituting Expressions (5.53) and (5.54) into Equation


(5.52) and writing kr as k · kr−1 :

−1 N − K
nK n (Kk− 1 )( n−k )
E[ X r ] =
N ∑ k r −1 ( Nn−−11)
. (5.55)
k =1

We are writing the sum from k = 1 as the case for 0 does not
contribute to the sum. If we define j = k − 1 and m = K − 1
statistics and data visualisation with python 213

we can rewrite the above expression as:

nK n −1 (mj)(((Nn− 1)−m
−1)− j
)
r
E[ X ] =
N ∑ ( j + 1) r −1
( Nn−−11)
. (5.56)
j =0

The sum in Expression (5.56) can be seen as the expectation


value of a random variable Y with parameters N − 1, K and
n − 1 and we end up with the following recursion formula:

nK
E[ X r ] = E[(Y + 1)r−1 ], (5.57)
N

and therefore the mean is given by:

nK
µ= . (5.58) The mean of the hypergeometric
N distribution.

For the variance, remember that Var [ X ] = E[ X 2 ] − E2 [ X ]. We


can use the recursion formula obtained above to get the first
term in the variance and thus:

(n − 1)(K − 1) nK 2
   
2 nK
σ = +1 − ,
N N−1 N

(n − 1)(K − 1)
 
nK nK
= +1− ,
N N−1 N

N 2 − Nn − NK + nK
 
nK
= ,
N N ( N − 1)

N−n
  
nK K
= 1− . (5.59) The variance of the
N N−1 N hypergeometric distribution.

Let us play a different card game with Starbuck. This time it


is Bridge, where a hand is such that 13 cards are selected at
random and without replacement. We can use the
214 j. rogel-salazar

hypergeom implementation in SciPy to understand this game

better. Let us look at the distribution of hearts in a bridge


hand. Let us see what Starbuck may experience. First, let us
see the PMF and CDF for our experiment:

> x = np.arange(0, 14, 1) hypeprgeom provides pmf and cdf

> pmf = hypergeom.pmf(x, 52, 13, 13) methods.

> cdf = hypergeom.cdf(x, 52, 13, 13)

The PMF can be seen in the upper left-hand panel of Figure


5.5, and the CDF in the upper right-hand one. Notice that See the upper panels of Figure
5.5 for the PMF and CDF of the
there are 13 cards for each of the suits in the deck. If we
hypergeometric distribution.
were to ask for the distribution of red cards, the distribution
will start looking more and more like a binomial.

The PPF for our game of bridge can be seen in Figure 5.5 in
the lower left-hand panel and here is the code for that:

The PPF of the hypergeometric


> probs = np.arange(0, 1, 0.1)
distribution is shown in the lower
> ppf = hypergeom.ppf(probs, 52, 26, 13) left-hand panel of Figure 5.5.

Finally, let us create a set of 1000 observations following this


hypergeometric distribution:

> x1 = hypergeom.rvs(52, 13, 13, size=1000)


A histogram of hypergeometric
> print(x1.mean(), x1.var(ddof=1))
variates is shown in the lower
right-hand panel of Figure 5.5.

3.241 1.9808998998999

A histogram of these values can be seen in the lower right-


hand panel of Figure 5.5. We can compare these values with
the mean and variance for the hypergeometric distribution:
statistics and data visualisation with python 215

Hypergeometric PMF Hypergeometric CDF


0.30
1.0
0.25
0.8

Cumulative Probability
0.20
0.6
Probability

0.15
0.4
0.10

0.05 0.2

0.00 0.0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Outcome Outcome
Hypergeometric PPF Hypergeometric Random Variates
500
8

400
6

300
Occurrences
Outcome

200
2

100
0

0
0.0 0.2 0.4 0.6 0.8 0 2 4 6 8 10 12 14
Probability Outcome

Figure 5.5: The hypergeometric


distribution.

> print(hypergeom.mean(52, 13, 13))

3.25

> print(hypergeom.var(52, 13, 13))

1.8639705882352942
216 j. rogel-salazar

5.3.5 Poisson Distribution

After having played cards with Starbuck in the


previous section, we start talking about the number of
Cylon Raptors that she had experienced in the day prior to
landing on our planet. For some time, nothing seemed to
appear, and then two of them may show up in succession. We have all been there.

A bit like waiting for a bus at your local bus stop.

Events like these may be described, within their context,


as rare in the sense that they do not happen that often.
Some examples include: The release of an electron by the
change of a neutron to a proton in radioactive beta decay8 , 8
Melissinos, A. C. and
J. Napolitano (2003). Experiments in
the number of hits to your website over a period of time, Modern Physics. Gulf Professional
Publishing
the number of births in a hospital during a given day, the
number of road accidents occurring at a particular interval,
the number of single photons detected9 in an experiment, 9
McCann, L. I. (2015). Introducing
students to single photon
or as mentioned, the number of buses arriving to your local detection with a reverse-biased
LED in avalanche mode. In E. B.
bus stop, and even the number of Cylon Raptors scouting M. Eblen-Zayas and J. Kozminski
for Battlestar Galactica. (Eds.), BFY Proceedings. American
Association of Physics Teachers

These phenomena can be described with the Poisson


distribution. It is named after the French mathematician
Siméon Denis Poisson, and it was derived in 1837. It was
famously used by Bortkiewicz to describe the number of
deaths by horse kicking in the Prussian army10 . A Poisson 10
Good, I. J. (1986). Some statistical
applications of Poisson’s work.
experiment is such that the following conditions are met: Statistical Science 1, 157–170

• The number of successes can be counted

• The mean number of successes that occurs in a specific


interval is known
statistics and data visualisation with python 217

• Outcomes are independent of each other

• The probability that a success occurs is proportional to


the size of the interval

Consider the time interval t in which Starbuck is looking for


Cylon Raptors. We can break that time in smaller intervals,
δt, small enough so that no two events occur in the same
interval. We find an event with probability P(1, δt) = lδt Our “success” event here is a
Cylon Raptor sighting.
and its complement, i.e. no event with probability P(0, δt) =
1 − lδt. We are interested in determining the probability of
getting n Cylon Raptor sightings in the interval t.

Let us assume that we know P(0, t). We can reasonably


ask about the probability of observing a Cylon Raptor
in the interval t + δt. Since the events are independent See the probability properties
described in Section 5.1.
the probability is given by the product of the individual
probabilities:

P(0, t + δt) = P(0, t)(1 − lδt). (5.60)

We can rewrite the expression above as a rate of change as


follows:

P(0, t + δt) − P(0, t) dP(0, t)


= = −lP(0, t). (5.61) It becomes a differential in the
δt dt limit δt → 0.

The solution of this differential equation is P(0, t) = Ce−lt .


At t = 0 we have no sightings and thus the probability
P(0, 0) = 1 which means that C = 1:

P(0, t) = e−lt . (5.62)


218 j. rogel-salazar

What is the probability for the case where there are Cylon
Raptor sightings, i.e. n 6= 0. We can break this into two parts
as follows:

1. The probability of having all n sightings in the interval t


and none in δt, and Breaking the probability of Cylon
sightings into two parts.

2. Having n − 1 sightings in t and one in δt.

Let us take a look:

The probability of n events at


P(n, t + δt) = P(n, t)(1 − lδt) + P(n − 1, t)lδ. (5.63)
t + δt.

In the limit δt → 0, the expression above is given by the


following differential equation for P(n, t):

dP(n, t)
+ lP(n, t) = lP(n − 1, t). (5.64)
dt

From Equation (5.64) it is possible to show that the


probability of finding n events in the interval t is given by See Appendix F.1.
the following expression:

µn −µ The PMF of the Poisson


P(n, µ) = e , (5.65)
n! distribution.

where µ = lt. The expression above is the probability mass


function for the Poisson distribution. We can now turn our
attention to obtaining the mean and variance for the Poisson
distribution.

The mean or expected value of the Poisson distribution can


be calculated directly from the probability mass function
statistics and data visualisation with python 219

obtained in Equation (5.65):


∞ ∞
µn
E[ X ] = ∑ nP(n, µ) = ∑ n n! e−µ ,
n =0 n =0

∞ ∞
µn µ n −1 The n = 0 term is zero, hence the
= e−µ ∑ n n! = µe−µ ∑ ( n − 1) !
, change in the sum.
n =1 n =1


µm We relabelled the index so that
= µe−µ ∑ m!
. (5.66)
m = n − 1.
m =0

We can recognise the Taylor expansion for eµ in expression


(5.66) and thus:

E[ X ] = µe−µ eµ = µ. (5.67) The mean of the Poisson


distribution.

The parameter µ in the Poisson distribution is therefore


the mean number of successes that occur during a specific
interval.

For the variance, we know that Var [ X ] = E[ X 2 ] − E2 [ X ]. The


second term can be easily obtained from the mean value
above. For the first term we can use the following identity:
E[ X 2 ] = E[ X ( X − 1)] + E[ X ]. For this expression, the second
term is known, and we can use a similar approach to the
one used for calculating the mean to obtain the value of
E[ X ( X − 1)]:
∞ ∞
µn µ n −2
E[ X ( X − 1)] = ∑ n(n − 1) n! e−µ = µ2 e−µ ∑ ( n − 2) !
, The cases for n = 0 and n = 1 do
n =0 n =2 not contribute to the sum.


µm We used m = n − 2 to relabel the
= µ2 e − µ ∑ m!
= µ2 e − µ e µ = µ2 . (5.68)
sum.
m =0
220 j. rogel-salazar

Hence the variance is given by:

Var [ X ] = µ2 + µ − µ2 = µ. (5.69) The variance of the Poisson


distribution.

As we can see, the variance of the Poisson distribution is


equal to the mean value µ.

Let us say that Starbuck has experienced an average of


µ = 2 Cylon Raptors per hour. We can use Equation (5.65)
to determine the probability of getting 0, 1, 2 or 3 sightings:

e −2
P(0, 2) = 20 = 0.1353,
0!
e −2
P(1, 2) = 21 = 0.2707, For µ = 2, we can get the
1! probability of different number of
e −2
P(2, 2) = 22 = 0.2707, Cylon Raptor sightings.
2!
e −2
P(3, 2) = 23 = 0.1805.
3!

Let us see what Python returns for the example above. We


can use the poisson implementation in SciPy and calculate
the PMF as follows:

> from scipy.stats import poisson

> print(’The probability of seeing Cylon Raptors:’)

> for r in range(0, 4):


We can use the poisson module in
print(’{0}: {1}’.format(r, poisson.pmf(r, 2)))
SciPy.

The probability of seeing Cylon Raptors:

0: 0.1353352832366127

1: 0.2706705664732254

2: 0.2706705664732254

3: 0.18044704431548356
statistics and data visualisation with python 221

The Poisson distribution can be obtained as a limit of the


binomial distribution we discussed in Section 5.3.3. This
can be done by letting n be a large number of trials, in
other words n → ∞. We also fix the mean rate µ = np. The Poisson distribution can be
obtained as a limit of the binomial
Equivalently, we can make the probability p be very small
distribution.
such that p → 0. In doing so, we are effectively saying that
there is no longer a fixed number of events in the interval in
question. Instead, each new event does not depend on the
number of events already occurred, recovering the statistical
independence for the events. The PMF for the binomial
distribution given by Equation (5.26) can be written as
follows:

n!  µ k  µ n−k
P( X ) = lim 1− , (5.70) See Appendix F.2.
n→∞ k!(n − k)! n n

µk −µ
= e . (5.71)
k!

which is the Poisson PMF we obtained before.

Let us go back to the SciPy implementation of the Poisson


distribution. We know that the PMF can be calculated with
poisson.pmf. If the average number of Cylon Raptors were

4, the PMF will be given by the top leff-hand panel of Figure


5.6 and can be calculated as follows:

mu = 4 The PMF of the Poisson


x = np.arange(0, 14, 1) distribution is shown in the
top left-hand panel of Figure 5.6.
pmf = poisson.pmf(x, mu)

The PMF shows the probability of sighting between 0 and 14


Cylon Raptors in the same time interval.
222 j. rogel-salazar

Poisson PMF Poisson CDF


0.200 1.0
0.175
0.150 0.8

Cumulative Probability
0.125
0.6
Probability

0.100
0.075 0.4
0.050
0.2
0.025
0.000 0.0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Outcome Outcome
Poisson PPF Poisson Random Variates
7 200
6 175
5 150
4 125
Occurrences
Outcome

3 100
2 75
1
50
0
25
1
0
0.0 0.2 0.4 0.6 0.8 0 2 4 6 8 10 12 14
Probability Outcome

Figure 5.6: The Poisson


distribution.
The cumulative distribution function can be seen in the top
right-hand panel of Figure 5.6 and is calculated as follows:

The CDF is shown in the top


cdf = poisson.cdf(x, mu)
right-hand panel of Figure 5.6.

We can calculate the occurrences for a given probability


with the PPF:
statistics and data visualisation with python 223

probs = np.arange(0, 1, 0.1)


The PPF is shown in the lower
ppf = poisson.ppf(probs, mu) left-hand panel of Figure 5.6.

and the result is shown in the lower left-hand panel of


Figure 5.6.

Finally, we can generate random variates that follow the


Poisson distribution as follows:

> x1 = poisson.rvs(mu, size=1000)

> print(x1.mean(), x1.var(ddof=1)) Random Poisson variates can be


seen in the lower right-hand panel
of Figure 5.6.
4.004 4.054038038038038

Here, we started with µ = 4, which in the Poisson


distribution corresponds to the mean and variance. As we
can see, the values obtained from the 1000 random variates
we generated are close to that parameter. The histogram of
these values can be seen in the lower right-hand panel of
Figure 5.6.

5.4 Continuous Probability Distributions

The probability distributions we have discussed so


far have all been discrete. We now turn our attention to See Section 5.2.2 for the distinction
between discrete and continuous
continuous distributions. Remember that if the random
distributions.
variable of interest can take any value between two specified
values, then the variable is called continuous, and
imaginatively enough, its probability distribution is called a
continuous probability distribution.
224 j. rogel-salazar

5.4.1 Normal or Gaussian Distribution

No self-respecting book on statistics would avoid


talking about the normal distribution. It can be argued that
it would be better to simply call it Gaussian as it seems that
normal is not the norm. Gaussian alludes to the description What is normal anyway?

of the famous bell-shaped curve by Carl Friedrich Gauss.


As a matter of fact its original name was law of errors as
given that Gauss11 used it to model errors in astronomical 11
Gauss, C. F. (1823). Theoria
combinationis observationum
observations in 1823 as we mentioned in Section 1.3. erroribus minimis obnoxiae. Number
V. 2 in Commentationes Societatis
Regiae Scientiarum Gottingensis
One may argue that there is nothing normal about the recentiores: Classis Mathemat. H.
Dieterich
normal distribution. As we will see, the distribution runs
from minus infinity to infinity, and never ever touches
or crosses the x-axis. If we were to use this to describe,
for example, the height of “normal” people, we would
not be surprised to find that Gulliver’s discovery of tiny Values range in the (−∞, ∞).
interval, enabling for tiny
Lilliputians, or Jack’s encounter with a bone-bread-making
Lilliputians and enormous giants.
giant are real. The term “normal” has become a moniker
to mean typical or common. The use became more popular
thanks to Karl Pearson who later clarified12 the use of 12
Pearson, K. (1920). Notes
on the history of correlation.
“normal” in the sense of being typical, as opposed to being Biometrika 13(1), 25–45
normative:

“Many years ago I called the Laplace–Gaussian curve the


normal curve, which name, while it avoids an international
question of priority, has the disadvantage of leading people
to believe that all other distributions of frequency are in
one sense or another ‘abnormal’. That belief is, of course,
not justifiable. It has led many writers to try and force all
frequency by aid of one or another process of distortion into
a ’normal’ curve.”
statistics and data visualisation with python 225

We mentioned that Gauss used the normal distribution to


model errors in astronomy. We can motivate our discussion
with a StarTrek example: Consider the use of kellicams or Space is still the final frontier.

qellqams by the Klingon empire to measure distance.

Lieutenant Commander Worf informs us that a kellicam


is roughly equal to two kilometers. But, how close is that
correspondence? As good lower decks, we take the mission
and make some measurements. The first measurement is We are obtaining empirical
measures for a quantity of interest,
1.947 metres. For the second we get 2.099 metres. So what is
go stats!
it? The first or the second? Well, let us measure it again and
then again and again. As it happens, each measure gives us
a different answer and we end up with the following list of
measures:

2.098 2.096 2.100


2.101 2.099 2.098
Different measures of a standard
2.099 2.098 2.099 imperial Klingon kellicam in
2.099 2.100 2.100 metres.

2.100 2.099 2.099

A histogram of this data, shown in Figure 5.7, indicates


that the “true” value for a kellicam in metres is 2.099 to the
nearest millimetre. We can think of the measurements above
in terms of errors: If there is a true value for the metres in We can think of our measurements
in terms of errors.
a kellicam, as we measure the distance, we may incur some
errors, but we are likely to obtain a measure close, or on
target, for that value. As we get farther and farther from
the true value, our chances of obtaining it are reduced. In
other words, the rate at which the frequencies decrease is
proportional to the distance from the true value.
226 j. rogel-salazar

4
Frequency

0
2.096 2.097 2.098 2.099 2.100 2.101
Kellicam Measure

Figure 5.7: Measures of a standard


imperial Klingon kellicam in
This is, however, not the end of the story. If that was the metres.
only condition on the frequencies, we may end up with a
distribution that follows a parabolic curve. In that case, as
we move farther from the true value, the frequencies would Obtaining negative frequencies is
highly illogical, as Mr Spock will
become negative, and as Mr. Spock would remind us, that
surely tell us.
is highly illogical. We can get the frequencies to level off
as they get closer and closer to zero by requiring that the
rate at which they fall off is proportional to the frequencies
themselves. We say that data is normally distributed if
statistics and data visualisation with python 227

the rate at which the frequencies observed fall off 1) is


proportional to the distance of the score from the central A couple of conditions for
frequencies to follow a normal
value, and 2) is proportional to the frequencies themselves. distribution.
This can be expressed by the following differential equation:

d f (x)
= − k ( x − µ ) f ( x ), (5.72)
dx
where k is a positive constant. We can solve this equation as
follows:

d f (x)
Z Z
= −k ( x − µ)dx,
f (x)

 
k 2
f (x) = C exp − ( x − µ) . (5.73)
2

We can find the value of C by recalling that this is a We will deal with the value of k
later on.
probability distribution and therefore the area under the
curve must be equal to 1:
Z ∞  
k
C exp − ( x − µ)2 dx = 1. (5.74)
−∞ 2


It can be shown that the constant C = k/2π. Our See Appendix G.1 for more
information.
probability distribution is thus far given by
r  
k k 2
f (x) = exp − ( x − µ) . (5.75)
2π 2

We can use Equation (5.75) to calculate the expected value.


In order to simplify our calculation, let us consider
This means that dx = dv.
transforming our variable such that ν = x − µ and therefore
228 j. rogel-salazar

E[ x ] = E[ν] + µ. The expected value of ν is:


r Z ∞
k k 2
E[ν] = νe− 2 ν dν. (5.76)
2π −∞

Let w = − 2k ν2 , and therefore dw = −kνdν. Substituting into


(5.76):
r Z ∞ The expected value of the normal
1
E[ν] = ew dw = 0 distribution.
2πk −∞

E[ x ] = E[ν] + µ = 0 + µ = µ. (5.77)

As expected, the expected value for the normal distribution Pun definitely intended!

is µ.

For the variance we have:


Z ∞
σ2 = ( x − µ)2 f ( x )dx,
−∞

r Z ∞   Let us now look at the variance of


k k
= ( x − µ)2 exp − ( x − µ)2 dx. the normal distribution.
2π −∞ 2

Let us make the following change of variable: w = x − µ and


dw = dx. The variance of the normal distribution can then
be expressed as:
r Z ∞
k k 2
2
σ = w2 e− 2 w dw. (5.78)
2π −∞

We can integrate the (5.78) by parts with:

1 k 2 Useful variable changes to


u = w, v = − e− 2 w ,
k integrate our expression by
− 2k w2 parts.
du = dw, dv = we dw.
statistics and data visualisation with python 229

Substituting into our integral, we have:

k 2 ∞
r " # r Z ∞
k we− 2 w 1 k k 2
σ = 2
− + e− 2 w dw. (5.79)
2π k k 2π −∞
−∞

The first term is zero, and the second one contains the PDF
of the normal distribution multiplied by 1/k and therefore Compare with Equation (5.75).

we have that:

1
σ2 = ,
k

1 This is because the area under the


k = . (5.80)
σ2 curve is equal to 1.

This gives meaning to the constant k in terms of the variance


of the normal distribution. As such, we are able to write our
PDF in terms of the mean µ and the standard deviation σ
as:
"  #
1 x−µ 2

1 The PDF of the normal
f (x) = √ exp − . (5.81) distribution.
σ 2π 2 σ

We can determine the maximum of the probability


distribution function of the normal distribution and show
that the peak corresponds to the mean value µ.
Furthermore, the points of inflection for the PDF are given The points of inflection for the
normal PDF are given by µ ± σ.
by µ ± σ. In other words, the points where the PDF changes
concavity lie one standard deviation above and below the
mean. Given the importance of the parameters µ and σ to The notation N (µ, σ) stands for a
normal distribution with mean µ
define the normal distribution, we sometimes use the
and standard deviation σ.
notation N (µ, σ ) to refer to a normal distribution with mean
µ and standard deviation σ.
230 j. rogel-salazar

Let us go back to our measurements of the Klingon kellicam.


Our Starfleet colleagues have amassed an impressive We have kellicam measurements
that follow a N (2.099, 0.2)
collection of measurements that follow a normal
distribution.
distribution with mean µ = 2.099 and standard deviation
σ = 0.2. We can use this information to calculate the
probability distribution function with Python as follows:

from scipy.stats import norm


We can use the norm module in
SciPy.
mu = 2.099

sigma = 0.2

x = np.arange(1, 3, 0.01)

pdf = norm.pdf(x, mu, sigma)

A graph of the probability distribution function can be seen


in the top-left hand panel of Figure 5.8.

As the normal distribution is continuous, obtaining the


cumulative distribution function is as easy as calculating the
following integral:
"  #
1 x−µ 2
Z a 
1 The probability is given by the
F(x) = √ exp − . (5.82)
−∞ σ 2π 2 σ area under the normal curve.

Unfortunately, there is no simple closed formula for this


integral and the computation is best done numerically.
Python is well suited to help with this. Another alternative We prefer to use a computer for
this!
is to use probability tables. In Python we can obtain the
CDF as follows:
statistics and data visualisation with python 231

Normal PDF Normal CDF


2.00 1.0
1.75
0.8
1.50

Cumulative Probability
1.25 0.6
Probability

1.00
0.75 0.4

0.50
0.2
0.25
0.00 0.0
1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Outcome Outcome
Normal PPF Normal Random Variates
2.6

100
2.4
80
2.2
Occurrences
Outcome

60
2.0
40

1.8
20

1.6 0
0.0 0.2 0.4 0.6 0.8 1.0 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Probability Outcome

Figure 5.8: The normal or


Gaussian distribution.
cdf = norm.cdf(x, mu, sigma)

The graph of the CDF for the normal distribution can be The normal CDF is shown in the
top right-hand panel of Figure 5.8.
seen in the top right-hand panel of Figure 5.8. We can use
the CDF to find the probability that a measurement is below
2 metres for example:
232 j. rogel-salazar

> norm.cdf(2, mu, sigma) The prob of having a


measurement below 2 metres
for N (2.099, 0.2).
0.3103000602544874

Similarly, we can ask for the probability that a measurement


is between one standard deviation above and below the
mean:

> lower = norm.cdf(mu-sigma, mu, sigma)


The probability of having a
> upper = norm.cdf(mu+sigma, mu, sigma)
measurement above and below
> p = upper - lower one standard deviation.
> print(p)

0.6826894921370861

In other words, the probability of a kellicam measurement


being between 1.89 and 2.299 metres is:

P(1.89 ≤ X ≤ 2.299) = 0.6826, (5.83)

or equivalently, 68.2% of the observations are within 1


standard deviation of the mean. We will revisit this fact in We discuss the “empirical rule” for
the normal distribution in Section
the next section when we talk about the empirical rule and
5.4.2.
the standard normal distribution Z.

Now, given a normal distribution with parameters µ and σ,


we can use Python to calculate the percent point function
(PPF):

probs = np.arange(0, 1, 0.01) The normal PPF is shown in the


bottom left-hand panel of Figure
ppf = norm.ppf(probs, mu, sigma)
5.8.
statistics and data visualisation with python 233

and the result is shown in the bottom left-hand panel of


Figure 5.8.

Another useful thing that we can do with Python is to create


random variates that follow a normal distribution with
parameters µ and σ.

x1 = norm.rvs(mu, sigma, size=1000) Normal variates are shown in the


bottom right-hand panel of Figure
print(x1.mean(), x1.var(ddof=1))
5.8.

A histogram of these random variates is shown in the


bottom right-hand panel of Figure 5.8. Let us compare the
mean and standard deviations obtained with our randomly
generated values and the parameters of our Gaussian
function:

> print(x1.mean(), x1.var(ddof=1)) Remember that the variance is the


square of the standard deviation.

2.096088672876906 0.035225815987363816

> print(norm.mean(mu, sigma))

> print(norm.var(mu, sigma))

2.099

0.04000000000000001

The Binomial Distribution Approximated by a Normal


Distribution

In Section 5.2.3 we looked at the repeated flipping of a


coin in a growing number of experiments. As suggested
by the last panel of Figure 5.1, the binomial distribution
234 j. rogel-salazar

that describes our experiment can be approximated by a


Gaussian curve with the same mean and standard deviation,

i.e. µ = np and σ = npq. This works well under the
following conditions:

1. the number of trials n is large,


Conditions for the binomial
2. the probability of success is sufficiently small, and
distribution to approximate a
3. the mean µ = np is a finite number. normal curve.

What does “sufficiently small” probability mean though?


Well, to start with, if the underlying binomial distribution is
skewed, the approximation is not going to be that good, so
the closer the probability is to 0.5 the better. Let us consider
the bulk of the Gaussian distribution to be two standard
deviations from its mean and use this information to
determine some bounds for the number of trials. We can
express this in terms of the following inequalities:
We consider the bulk of the
µ − 2σ > 0, µ + 2σ < n, (5.84) normal distribution to be ±2σ
away from the mean.
substituting the mean and standard deviation for the
binomial we have that:

√ √
np > npq, np + 2 npq < n,

np > 4q, 2 npq < n(1 − p),

np > 4 − 4p, 4 − 4q < nq,

np ≥ 5, 5 ≤ nq. (5.85)

Since p and q are probabilities, their values must stay


between 0 and 1, and therefore the inequalities are satisfied
if np and nq are greater than 5.
statistics and data visualisation with python 235

Remember that the normal distribution is a continuous


probability distribution, however the binomial is discrete. Remember that the binomial
distribution is discrete.
If we want to use the approximation we need to apply a
continuity correction. The simplest thing to do is to add or
subtract 0.5 to or from the discrete value.

5.4.2 Standard Normal Distribution Z

In the previous section, we motivated the discussion


of the normal distribution from measurements obtained
by our Starfleet colleagues to determine the length of a
Klingon kellicam in metres. This gave us information to
calculate the mean and standard deviation of the measures The general characteristics of a
normal distribution are the same.
obtained. Depending on the mean and standard deviation,
the shape of the probability distribution is different but
the characteristics are the same. For instance, the inflection
points will be given by µ ± σ, no matter what the individual
values of µ and σ are. This means that we can use these
characteristics to help us understand our data.

Consider for example a case where we need to compare


different distributions. If we are able to bring them under
the same footing, we have a better chance of comparing
them. This is called standardisation, and the standard
normal distribution is what we end up with. Think of the The standard normal is given by
N (0, 1).
standard normal distribution as a special case of the normal
distribution where the mean is 0 and the standard deviation
is 1. We can standardise our data by taking each raw
measurement, subtracting the mean from it and dividing the
236 j. rogel-salazar

result by the standard deviation. In other words:

X−µ The standard or z-score.


z= . (5.86)
σ

Standard Normal Curve

68.2%

2.1% 2.1%
0.1% 13.6% 13.6% 0.1%

4 3 2 1 0 1 2 3 4
Number of Standard Deviations from the Mean

Figure 5.9: The empirical rule


gives us approximations about the
This is usually referred to as the z-score and the values percentage of data observations
within a number of standard
represent the number of standard deviations a specific deviations from the mean.

observation is away from the mean. For example, if a


z-score is −1.2 it means that the observation is 1.2 below the
mean. This turns out to be a very effective way to
understand each observation relative to the full distribution.
It also helps with the comparison of distributions that have
different parameters.
statistics and data visualisation with python 237

Furthermore, if our data is well described by a normal


curve, we can use the standard normal to tell us something
about the observations we have. In Equation (5.83) we About 68.2% of the observations in
a normal distribution are within
calculated the probability that our Klingon kellicam
one standard deviation of the
measurement fell within a standard deviation of the mean mean.
obtaining a probability of around 68.2%. This is the case for
any normal curve and this part of what we call the
Empirical Rule. On a normal distribution:

• about 68.2% of data will be within one standard


deviation of the mean,
The Empirical Rule of the
• about 95.4% will be within two standard deviations of the standard normal distribution.
mean, and

• about 99.7% will be within three standard deviations of


the mean

This is shown schematically in Figure 5.9. This information


lets us make predictions about the probabilities of the
phenomena we are studying, provided that the data is
normally distributed. It is also a great way to identify We can use the Empirical Rule to
look for potential data outliers.
outliers. Since 99.7% of the observations are supposed to be
within three standard deviations of the mean, any value that
is outside can be considered as a potential outlier. We can
even use the Empirical Rule as a simplistic test for normality
of our data: If more than 0.3% of the data is outside the
three standard deviations from the mean, the data may not
follow a normal distribution.

In Python, using SciPy, the standard normal distribution


is obtained with the same methods used in the previous
section. The default values for the mean and standard
238 j. rogel-salazar

deviation are µ = 0 and σ = 1. We can get the probability


distribution for the standard normal as follows:

from scipy.stats import norm


We do not need to specify the
x = np.arange(-5, 5, 0.01)
mean and standard deviation.
zdist = norm.pdf(x)

Notice that we do not need to specify the values for the


mean or the standard deviation. Alternatively, we could
have written this as norm.pdf(x, 1, 0). Similarly, we can
use norm.cdf to obtain the cumulative distribution. Let us
see the values obtained for the empirical rule:

> sigma1 = 100*(norm.cdf(1)-norm.cdf(-1))

> sigma2 = 100*(norm.cdf(2)-norm.cdf(-2))


The Empirical Rule values
> sigma3 = 100*(norm.cdf(3)-norm.cdf(-3)) obtained using Python.

> print(’{0:.2f}%, {1:.2f}%, {2:.2f}%’.

format(sigma1, sigma2, sigma3))

68.27%, 95.45%, 99.73%

5.4.3 Shape and Moments of a Distribution

An important feature in the description of the


empirical rule discussed above is the symmetry of the The shape of a normal distribution
is distinctive enough to
normal distribution around the mean. The shape of the
characterise how close data is
distribution is a helpful measure of the normality of a to being normally distributed.
distribution, and a way to characterise the shape is through
the moments of the distribution. The k-th moment is
statistics and data visualisation with python 239

defined as the expectation of the k-th power of the random


variable X, i.e. mk = E[ X k ]. As a matter of fact, we have
already dealt with some of the moments of a distribution. The first moment of a distribution
is the mean. The second is the
The first moment is the expectation value or mean, whereas
variance.
the second moment is the standard deviation.

The k-th moment of a distribution can be expressed as:


h i
µ E ( X − µ)k The k-th moment of a distribution.
mk = k = h ik/2 . (5.87)
σk
E ( X − µ )2

Thinking of our standard normal distribution, we can see


that the first moment is indeed µ = 0:

µ1 E [( X − µ)]
m1 = = 1/2
,
σ E [( X − µ)2 ]

µ−µ
= =0 (5.88) The first moment of the normal
( E[( X − µ)2 ])1/2 distribution is m1 = µ = 0.

We said that the second moment of a distribution


corresponds to the variance, and for a standard normal this
has the value 1:
h i
µ2 E ( X − µ )2 The second moment is m1 = σ2 =
m2 = = =1 (5.89)
σ2 E [( X − µ)2 ]
2/2 1.

The mean tells us where our distribution is centred and the


variance tells us about the dispersion in our distribution.
It is clear that obtaining the moments of a distribution is a We can obtain the moments with
the help of generating functions.
useful task and there is an easier way to generate them. We
can do this with moment generating functions.
240 j. rogel-salazar

If X is a random variable, its moment generating function is


given by:

φ(t) = E[etX ],
The moment generating function.

∑ etx P( X = x )

discrete case,
x
= R ∞ tx (5.90)
−∞ e f ( x ) dx continuous case.

The name of this mathematical object should be clear


enough to explain what it does. Nonetheless, let us take a
further look and write its Taylor expansion:

1 1
E(etX ) = E[1 + tX + t2 X 2 + t3 X 3 + · · · ] The Taylor expansion of the
2 3!
moment generating function.

1 1
= 1 + tEX + t2 E[ X 2 ] + t3 E[ X 3 ] + · · ·
2 3!

As we can see we can generate the k-th moment mk = E[ X k ]


out of this series and we note that:

d h tX i
E e t =0
= E [ X ], We can obtain the moments from
dt
the Taylor expansion above.

d2 h tX i
E e t =0
= E [ X 2 ].
dt2

We can compute the moment generating function for a


standard normal random variable as follows:
Z ∞
1 2 /2
φ(t) = √ etx e− x dx,
2π −∞
The moment generating function
Z ∞ of the normal distribution.
1 2 ( x −t)2 /2 t2 /2
= √ et /2 e dx = e , (5.91)
2π −∞
statistics and data visualisation with python 241

where we have used the following expression:

1 1 1
tx − x2 = − (−2tx + x2 ) = (( x − t)2 − t2 ).
2 2 2

We can obtain the moments of the normal distribution


carrying out an integration by parts. For the standard
normal we have that:
Z ∞ ∞ Z ∞ n +1 
x n +1 − x 2

2
n − x2 x 2
− x2
x e dx = e 2 − − xe dx,
−∞ n+1 −∞ −∞ n+1
We can calculate the moments
of the standard normal with this
Z ∞ expression.
1 x2
= x n+2 e− 2 dx.
n+1 −∞

In terms of moments, we therefore have the following


recurrence relation:

m n +2 = ( n + 1 ) m n . (5.92)

Now that we are able to obtain moments of a distribution in


a more expeditious way, let us return to our description of
the shape of our distributions: Skewness and kurtosis. Let Skewness and kurtosis are
measures that help us characterise
us look first at skewness. Consider the distributions shown
the shape of our distribution.
in Figure 5.10, as we can see, the symmetry shown by the
standard normal distribution is broken and the tails of each
distribution taper differently.

In a positive skewed distribution, the right tail is longer and


thus the mass of the distribution is concentrated on the left The skewness describes the
shift of the centre of mass of the
of the curve. For a negative skewed distribution, the left tail
distribution.
is longer and the mass is concentrated on the right of the
curve.
242 j. rogel-salazar

1.0
Positive skewed
Negative skewed

0.8

0.6

0.4

0.2

0.0
4 2 0 2 4

Figure 5.10: Positive and negative


skewed distributions.
The third moment provides us with a way to measure the
skewness. We denote it as g1 and it is usually referred to as
the Fisher-Pearson coefficient of skewness:

µ3
g1 = m 3 = . (5.93) The third moment or skewness.
σ3

To correct for statistical bias, the adjusted Fisher-Pearson


standardised moment coefficient is given by:
p
n ( n − 1)
G1 = g1 . (5.94)
n−1

For data that is normally distributed, the skewness must be Normally distributed data has
skewness close or equal to 0.
close to zero. This is because the skewness for the standard
normal is 0:

E[ X 3 ] = m3 = (1 + 1)m1 = 0. (5.95)
statistics and data visualisation with python 243

If the Fisher-Pearson coefficient of skewness is greater than


zero, we have a positive skewed dataset; if the value is
smaller than zero we have a negative skewed dataset. We Positive skewed data has skewness
greater than 0; negative skewed
can calculate the Fisher-Pearson coefficient of skewness
data has skewness lower than 0.
in Python with the help of skew in the statistics module
of SciPy. Note that we can obtain the unbiased value with
the bias=False parameter. Let us see the skewness of our
kellicam measurements:

from scipy.stats import skew

kellicam = np.array([2.098, 2.101, 2.099, 2.099,


In this case our sample data is
2.1, 2.096, 2.099, 2.098, 2.1, 2.099, 2.1,
negatively skewed.
2.098, 2.099, 2.1, 2.099])

skew(kellicam)

skew(kellicam, bias=False)

-0.7794228634063756

-0.8688392583242721

Another measure about the shape of the normal distribution


is the kurtosis. It tells us how curved the distribution is and From the Greek κυρτ óς meaning
curved or arched.
whether it is heavy-tailed or light-tailed compared to the
normal. The normal distribution is said to be mesokurtic.
If a distribution has lighter tails than a normal curve it will
tend to give us fewer and less extreme outliers than the From the Greek λeπτ óς, meaning
normal. These distributions are called leptokurtic. Finally, thin or slender.

if a distribution has heavier tails than a normal curve, we


have a platykurtic distribution and we tend to have more
outliers than with the normal. The three different kurtoses From the Greek πλατ ύς, meaning
flat.
mentioned above are depicted in Figure 5.11.
244 j. rogel-salazar

Mesokurtic
0.6 Platykurtic
Leptokurtic

0.5

0.4

0.3

0.2

0.1

0.0
4 2 0 2 4

Figure 5.11: Kurtosis of different


distributions.

The kurtosis is measured with the fourth standard moment


and defined therefore as:

µ4
g2 = m 4 = . (5.96) The fourth moment or kurtosis.
σ4

Let us see the value of the kurtosis for the standard normal:

E[ X 4 ] = m4 = (2 + 1)m2 = 3. (5.97)

The kurtosis of a normal distribution is 3 and therefore


platykurtic distributions have values lower than 3, whereas A normal distribution has kurtosis
equal to 3.
leptokurtic ones have values greater than 3. Sometimes it
is useful to subtract 3 from the value obtained and thus a
normal distribution has a kurtosis value of 0. This makes it
statistics and data visualisation with python 245

easier to compare. This is known as the Fisher definition of


kurtosis.

In Python we can use kurtosis in the statistics module of


SciPy. Note that we can obtain the unbiased value with
the bias=False parameter. By default, the implementation The SciPy implementation used
the Fisher definition of kurtosis,
used Fisher’s definition of kurtosis. If you need the value in
where the normal kurtosis is 0.
terms of the fourth moment, simply use the fisher=False
parameter. Let us see the kurtosis of our kellicam measures:

from scipy.stats import kurtosis

kurtosis(kellicam)

kurtosis(kellicam, fisher=False)

0.89999999999992

3.89999999999992

5.4.4 The Central Limit Theorem

The expected value and variance are concepts we are


now familiar with. In particular, we looked at an experiment We covered them in Section 5.2.3.

with a fair coin performed multiple times. We observed


that the distribution starts resembling a normal distribution As the number of trials increases,
the distribution becomes normal.
as the number of trials increases. Let us consider this in
light of what we know now about the standard normal
distribution.

What we did for our experiment consists of getting a


sequence of independent and identically distributed (i.i.d.)
random variables X1 , X2 , . . . Xn . Let us consider that each of
246 j. rogel-salazar

these observations has a finite mean such that E[ Xi ] = µ and


variance σ2 . The central limit theorem tells us that the
sample mean X̄ follows approximately the normal The central limit theorem.

distribution with mean µ and standard deviation √σ . In


n
other words X̄ ' N (µ, √σ ).
n
We can even standardise the

values and use z = n( x̄ − µ)/σ.

Let us look at this in terms of the moment generating


function. Let Xn be a random variable with moment
generating function φXn (t), and X be a random variable
with moment generating function φX (t). The central limit
theorem therefore tells us that as n increases:

The central limit theorem in terms


lim φXn (t) = φX (t), (5.98)
n→∞ of moment generating functions.

and thus the CDF of Xn converges to the CDF of X.

Let us use the information above to look at the central


limit theorem. Assume that X is a random variable with
mean µ and standard deviation σ. If, as in our experiment, Having independent and
identically distributed random
X1 , X2 , . . . Xn are independent and identically distributed
variables is important.
∑i X√
i − nµ
following X, let Tn = σ n
. The central limit theorem
says that for every x, the probability P( Tn ≤ x ) tends to
P( Z ≤ z) as n tends to infinity, where Z is a standard
normal random variable.

X −µ Xi − µ
To show this, let us define Y = σ and Yi = σ . Then
Yi are independent, and distributed as Y with mean 0 and

standard deviation 1, and Tn = ∑i Yi / n. We aim to prove
that as n tends to infinity, the moment generating function
of Tn , i.e. φTn (t), tends to the moment generating function of
the standard normal. Let us take a look:
statistics and data visualisation with python 247

h i
φTn (t) = E e Tn t ,
 t   t 
√ Y1 √ Y
=E e n ×···×E e n n ,
 t n
√ Y
=E e n ,
2 /2
n As n → ∞, φTn (t) → et .
t2 t3

t
= 1 + √ E[Y ] + E[Y2 ] + 3/2 E[Y3 ] + · · · ,
n 2n 6n
 2 3 n
t t 3
= 1+0+ + E [Y ] + · · · ,
2n 6n3/2
2
t2

' 1+ ,
2n
2 /2
→ et . (5.99)

2 /2
As we showed in Equation (5.91), φZ (t) = et
2 /2
and Remember that et is the
moment generating function of the
therefore the limiting distribution is indeed the standard
standard normal distribution.
normal N (0, 1).

5.5 Hypothesis and Confidence Intervals

Our measurements for the length of a standard


Klingon kellicam has proved so useful that the Vulcan
Science Academy has asked us to take a look at an issue
that has been puzzling Starfleet for quite some time and we
are eager to help. It turns out that there has been a report We are not about to decline a
request by the Vulcan Science
of a gas leakage whose properties seem to be similar to
Academy!
hexafluorine gas, used in starship and sensor construction.
The possibilities of an alternative gas for anti-intruder
systems that is less lethal to non-Vulcans or non-Romulans
seem exciting. We have been tasked with finding more
248 j. rogel-salazar

about this gas, particularly after some reports have come


about the potential anti-gravity effects that it seems to have
on some Starfleet officers.

Since the gas is still under strict Federation embargo, we


are not able to get the full details of all the population that
has been exposed to the gas, but we have a sample of the
crew members that first encountered it. We select 256 crew Selecting a suitable sample is a
good first step.
members that provide a good sample for what happened in
the starship, and although we may not be able to account
for all systematic variables, we are in a position to account
for the hexafluorine-like substance. All in all, if we do find
differences between crew that has been affected by this
gas and those who have not, we may have a good case to
continue with a more complete investigation. The mild
anti-gravity effect the gas has is reflected in the mass of the
person affected. We obtain a mean of 81.5 kg. Measuring Having a target variable for our
analysis is required.
the average weight of the crew seems to be around the right
track, but this may tell us nothing about the comparison
of the affected population with those who were not. For
instance, a difference may be due to having selected heavier,
non-affected crew members.

However, we are able to compare a score to a population


of scores, in this case the mean. In other words, we can
consider selecting every possible sample of size 256 of the Having every possible sample
of size n from the population is
unaffected population and calculate each sample mean. This
a monumental task. We need a
will give us a distribution that can be used as a benchmark. different approach.
We already have the sample from the affected starship and
we can now compare these two populations of sample
means. If we find a difference, we are now onto something.
statistics and data visualisation with python 249

Sadly, we currently only have the mean of our 256 gravity-


challenged colleagues!

What should we do? Well, it is only logical to look at what How very Vulcan!

we have been discussing in this chapter. We do not need to


go out of our way to obtain every possible sample of the
unaffected population. We know that for a large enough
sample we obtain a normal distribution. We can use this
information to test the hypothesis that the gas has an We test our hypothesis based on
what we know about the normal
anti-gravity effect and we need to ascertain a significance
distribution.
level for this hypothesis, find the z-score, and obtain the
probability to make our decision.

If we consider that we have a population with mean µ and


standard deviation σ, and provided that we have a large
number of observations for the central limit theorem to
apply, we can define a confidence interval as:
 
σ
x̄ ± (z critical value ) √ . (5.100) Confidence interval.
n

A typical critical value is to consider a 95% confidence


interval. In terms of z-scores, the probability of randomly
selecting a score between −1.96 and +1.96 standard
deviations from the mean is around 95%. Let us check this
with Python’s help: For the standard normal
distribution, a 95% confidence
CIlower = norm.cdf(-1.96) interval corresponds to z-scores
between ±1.96.
CIupper = norm.cdf(1.96)

print(CIupper - CIlower)

0.950004209703559
250 j. rogel-salazar

We can use this information to say that if we have less than


a 5% chance of randomly selecting a raw score, we have The level of significance, α.

a statistically significant result. We call this the level of


significance and is denoted as α.

Back to our gravity-defying effect, let us say that the mean


of our sampling distribution is 83 kg with a standard
deviation 4.3. The 95% confidence interval will be:
As we saw above a 95% confidence
 
4.3
83 ± (1.96) √ = (82.47325, 83.52675). (5.101) interval requires z-scores between
256
±1.96.
In other words, plausible values of the mean are between
82.4 and 83.5 kg.

What about our measurement for the affected population?


Well, let us take a look: The score we are after is calculated
as:
X̄ − µ 81.5 − 83
Z= = = −5.5813.
σ
n √4.3
256

Let us ask Python to give us the probability p of obtaining a


sample mean as low or even lower than 81.5 kgs:

> norm.cdf(-5.5813) We call this the p-value.

1.1936373019551489e-08

This is well below a significance level of α = 0.05, or


similarly well below the bottom 5% of the unaffected We compare the p-value to the
significance level α to determine
sampling distribution. We can conclude that a sample mean
whether we reject (or not) our
of 81.5 kg is so uncommon in the unaffected Starfleet hypothesis.

population that the distribution obtained from our


colleagues who encountered this gas is not the same as the
statistics and data visualisation with python 251

benchmark, and hence the hypothesis that the


hexafluorine-like gas may indeed have some anti-gravity
effects.

The hypothesis that proposes that there are no differences


H0 is the null hypothesis and
between the characteristics of two populations of interest is it typically proposed that there
called the null hypothesis and is usually denoted as H0 . is no difference between two
populations.
The counterpart of the null hypothesis is called the
alternative hypothesis. In our example above we can
therefore say that we accept or reject the null hypothesis
with a level of significance α = 0.05. Please note that there is The significance value is arbitrarily
chosen.
nothing special about the value chosen, but it is a common
value to use. In the next chapter we will expand our
discussions about hypothesis testing.

The probability p mentioned above to help us reject (or


not) the null hypothesis is imaginatively called the p-value
and it tells us how likely it is that our data could have
occurred under the null hypothesis. The calculation of
a p-value depends on the statistical test we are using to
check our hypothesis. Sometimes we refer to statistical
significance and that is a way to say that the p-value for Statistical significance refers to
having a p-value small enough to
a given statistical test is small enough to reject the null
reject the null hypothesis.
hypothesis. There are some criticisms about the misuse
and misinterpretation of p-values. The debate is a long
one, and part of the problem is the interpretation of the
meaning of these values. It is not unusual to think that the
p-value is the probability of the null hypothesis given the
data. Actually it is the probability of the data given the null
hypothesis.
252 j. rogel-salazar

The American Statistical Association released a statement13 13


Wasserstein, R. L. and N. A.
Lazar (2016). The ASA statement
on the use of statistical significance and p-values. The on p-values: Context, process,
and purpose. The American
statement addresses six principles around the use of p- Statistician 70(2), 129–133
values:

1. p-values can indicate how incompatible the data are with


a specified statistical model.

2. p-values do not measure the probability that the studied


hypothesis is true, or the probability that the data were
produced by random chance alone.

3. Scientific conclusions and business or policy decisions


Six principles about the use of
should not be based only on whether a p-value passes a
p-values.
specific threshold.

4. Proper inference requires full reporting and transparency

5. A p-value, or statistical significance, does not measure the


size of an effect or the importance of a result.

6. By itself, a p-value does not provide a good measure of


evidence regarding a model or hypothesis.

We are entering the realm of statistical inference, where we


are interested in performing hypothesis tests. The
comparison of statistical models is said to be statistically We will expand this discussion in
Chapter 6.
significant if, according to the threshold probability chosen,
the observations obtained would be unlikely to occur if the
null hypothesis were true. In order to do this, we have a
variety of tests and distributions that can be used. In this
section we are going to talk about two particular
distributions that are commonly used in statistical inference.
statistics and data visualisation with python 253

5.5.1 Student’s t Distribution

Imagine you are a working at your local brewery in


Dublin, Ireland and your employer is highly interested
in bringing some quality control to the brewing of your
famous stouts. As a connoisseur of beer, you already know
that brewing has an element of uncertainty, and you have From Guinness to Romulan Ale,
brewing is both science and art.
been tasked with eliminating that uncertainty as much
as possible. The ingredients that make your beer do have
an inherent variability: From the malted barly you buy
to the hops and yeast the brewer has cultivated over the
years. Your task involves not only the assessment of these
ingredients but also doing so in a cost-effective manner.
Instead of wasting a lot of valuable ingredients in tests, you
are looking at carrying out experiments with small samples
to come to some conclusions.

Unlike the example of the fictitious hexafluorine-like gas,


this is an actual historical one. William Sealy Gosset was Remember Gosset next time you
enjoy a Guinness beer!
indeed working at the Guinness brewery in Ireland, and his
work led to the effective use of small samples to show that
the distribution of the means deviated from the normal
distribution. This meant that he could not rely on methods
devised under the assumption of normality to draw valid
conclusions. He therefore could not use conventional We will cover those in Chapter 6.

statistical methods based upon a normal distribution to


draw his conclusions.

In 1908 Gosset published14 a paper about his work under 14


Student (1908). The probable
error of a mean. Biometrika 6(1),
the pseudonym of Student. In this paper, he shows how 1–25
having n observations, within which lays the mean of the
254 j. rogel-salazar

sample population, can be used to determine within what


limits the mean of the sampled population lay. He provides
values (from 4 to 10) which become the cumulative
distribution function of what now we know as the Student’s
t-distribution. The t-distribution can be used to describe The t-distribution is useful in cases
where we have smaller sample
data that follow a normal-like curve, but with the greatest
sizes, and where the standard
number of observations close to the mean and fewer data deviation of the data is not known.
points around the tails. As shown by Gosset, it is useful in
cases where we have smaller sample sizes, and where the
standard deviation of the data is not known. Instead, the
standard deviation is estimated based on the degrees of
freedom of the dataset, in other words, the total number of
observations minus one.

As you would expect, as the degrees of freedom increase,


the t-distribution converges to the normal distribution. Remember the central limit
theorem.
Anecdotally, this happens to be the case above 30 degrees
of freedom, at which point we can use the standard normal
distribution z.

For a series of independently and identically distributed


1
random variables X1 , X2 , . . . , Xn , let X̄ = n ∑i Xi be the
1
sample mean and let S2 = n −1 ∑ i ( Xi − X̄ )2 be the sample
X̄ −µ
variance. The random variable Z = √
σ/ n
has a standard
normal distribution, and the variable:

X̄ − µ This is called the t-score.


t= , (5.102)
√S
n

has a Student t-distribution with ν = n − 1 degrees of We denote the degrees of freedom


as ν, for the Student t-distribution
freedom. The value of S may not be close to σ, particularly
ν = n − 1.
for small n and this introduces more variability in our
statistics and data visualisation with python 255

model. The t-distribution tends to be more spread than the


standard normal.

The probability density function of the t-distribution is


given by:
 
Γ ν+ 1 − ν+2 1
t2 The PDF of the t-distribution.

2
f (t) = √  1+ , (5.103)
νπΓ 2ν ν

where ν = n − 1 is the degrees of freedom and Γ(·) is


the gamma function. For a positive integer n, the gamma
function is related to the factorial:

Γ(n) = (n − 1)!. (5.104)

We can also define the gamma function15 for all complex 15


Arfken, G., H. Weber, and
F. Harris (2011). Mathematical
numbers, except non-positive integers: Methods for Physicists: A
Comprehensive Guide. Elsevier
Z ∞
Science
Γ(z) = x z−1 e− x dx. (5.105)
0

Expressions for the cumulative distribution function can


be written in terms of the regularised incomplete beta
function16 , I: 16
Pearson, K. (1968). Tables of
the Incomplete Beta-Function: With
Z t a New Introduction. Cambridge
 
1 ν 1
F (t) = f (u)du = 1 − I , , (5.106) University Press
−∞ 2 x (t) 2 2

where x (t) = ν/(t2 + ν). For some values of the degrees of


freedom we have some simple forms for the PDF and CDF
of the Student’s t-distribution as shown in Table 5.2.

As we saw in Equation (5.102) we can define a t-score for The t-score helps us to define the
t-test describe in Section 6.5.1.
the t-distribution similar to the z-score for the standard
256 j. rogel-salazar

ν PDF CDF

1 1 1
ν=1 π (1+ t2 ) 2 + π arctan(t)

1 1 t
ν=2 √  t2 3/2 2 + √  t2 2/2
2 2 1+ 2 2 2 1+ 2

" #
 
2 1 1 t √t
ν=3 √  2 2 2 + π √  t2  + arctan
3

π 3 1+ t3 3 1+ 3

 
3 1 3q t 1 t2
ν=4 5 2 + 8 2
1− 12 1+ t2 .
2 1+ t4

8 1+ t4 2 4

" ! #
 
8 1 1 t  2 2 √t
ν=5 √  2 3 2 + √  t2  1+ + arctan
π 3 1+ t5 5

3π 5 1+ t5 5 1+ 5

2
h  i
ν=∞ √1 e−t /2 1
1 + erf √t
2π 2 2

Table 5.2: Special cases of the


PDF and CDF for the Student’s t-
normal one. It represents the number of standard deviations distribution with different degrees
of freedom.
from the mean and we can use this to define upper and
lower bounds for our confidence interval and obtain p-
values for t-tests.

As we mentioned above, the Student t-distribution is useful


when the standard deviation is not known. If the sample
size is large, we can use the confidence interval definition
used for the normal distribution shown in Equation (5.100).
Otherwise, we need to rely on the sample data to estimate
the standard deviation and thus our confidence interval is
given by:  
S Confidence interval for the t-
x̄ ± (t critical value) √ , (5.107)
n distribution.
statistics and data visualisation with python 257

where the t critical value is defined in terms of ν, the


degrees of freedom. Remember that if n is large, we can use
the normal distribution for our estimation. The confidence
interval above is suitable only for small n provided that the
population distribution is approximately normal. If that is
not the case, a different estimation method should be
considered. In Figure 5.12 we can see a comparison of the
probability density of the normal distribution versus
Student’s t-distributions for different values of the degrees
of freedom ν.

0.40 Normal Dist.


t-distribution, =1
t-distribution, =5
0.35 t-distribution, =30

0.30
Probability Density

0.25

0.20

0.15

0.10

0.05

0.00
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Standard Deviations

Figure 5.12: Probability


distributions for the Student’s
t-distribution for ν = 1, 5, 30. For
For ν = 1, the t-distribution is much flatter and has flatter comparison we show the normal
tails than the standard Gaussian curve. This is distribution as a dashed curve.

advantageous in the case of smaller sample sizes as it


258 j. rogel-salazar

provides a more conservative estimate of the probability


density compared to using the normal distribution. As the
degrees of freedom increase, we can see that the curves
match quite well.

Python lets us calculate the Student’s t-distribution in


much the same way as we have seen for others. The PDF We use the t module in SciPy.

can be calculated with the t.pdf method and we need to


provide the degrees of freedom as a required parameter.
Note that you are also able to provide a parameter loc to
shift the distribution and use the parameter scale to scale
it. Otherwise, Python will assume you are requesting a
standardised t-distribution. Let us calculate the PDF for
standard t-distribution for the ν = 3 case which can be seen
in the top left-hand panel of Figure 5.13:

from scipy.stats import t


Python will calculate a standard t
x = np.arange(-5, 5, 0.01) distribution by default.

nu = 3

pdf = t.pdf(x, nu)

The CDF for the Student’s t-distribution also requires us to


provide the degrees of freedom. The CDF for the case ν = 3
can be seen in the top right-hand panel of Figure 5.13:
The PDF and CDF of the t
distribution are shown in the
cdf = t.cdf(x, nu)
top panels of Figure 5.13.

As before, the PPF can easily be obtained too:


statistics and data visualisation with python 259

Student's t-distribution PDF Student's t-distribution CDF


1.0
0.35
0.30 0.8

Cumulative Probability
0.25
0.6
Probability

0.20
0.15 0.4
0.10
0.2
0.05
0.00 0.0
4 2 0 2 4 4 2 0 2 4
Outcome Outcome
Student's t-distribution PPF Student's t-distribution Random Variates
16
4
14

2 12
10
Occurrences
Outcome

0 8
6
2
4

4 2
0
0.0 0.2 0.4 0.6 0.8 1.0 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Probability Outcome

Figure 5.13: The Student’s t-


distribution for ν = 3.
probs = np.arange(0, 1, 0.01)

ppf = t.ppf(probs, nu)

and the result is shown in the bottom left-hand panel of


Figure 5.13.

Finally, as we have done with other distributions, we are


able to create random variates with the rvs method. In
this case we are creating random variates following the
Student’s t-distribution for ν = 3. The results are shown in
260 j. rogel-salazar

the bottom right-hand panel of Figure 5.13.

x1 = t.rvs(nu, size=1000)

5.5.2 Chi-squared Distribution

A new replicator has finally been delivered to our


ship, the USS Cerritos. We, along with Mariner, Rutherford, Yes, you and me, making it a
group of 100 Starfleet officers.
Boimler and Tendi and 94 other Starfleet colleagues, are
trying to decide whether the enchiladas or the tacos are
the best choice of Mexican food. In the best democratic
style, a vote is required and we have 60 colleagues going
for enchiladas, whereas the rest, i.e. 40, are voting for the
tacos. The requirement here has been a tally of frequency It seems that we for the next
squad meal we are going to have
counts on categorical data, in this case enchiladas and tacos.
enchiladas. Yay!
We can extend our sample to the rest of the ship and the
frequency data collected can be used to determine a test
for the hypothesis that there is a preference for replicator
enchiladas, hmmm!

A way to analyse frequency data is the use of the


chi-squared distribution and we can use it to test if the χ-squared or even χ2

variables are independent. We arrange the data to the


expectation that the variables are independent. If the data
does not fit the model, the likelihood is that the variables
are dependent and thus rejecting the null hypothesis. We See Section 6.3 for more
information on the chi-square
start our analysis based on the z-score and we obtain our χ2
test.
statistic simply by squaring z:

( X − µ )2
χ2 = z2 = . (5.108)
σ2
statistics and data visualisation with python 261

As you can see, the values are always positive. In any event,
if the categories we are selecting things from are mutually
independent, then the sum of individual χ2 follows a chi-
square distribution, i.e. ∑ x2 = χ2 . This means that we
can find a chi-square for each of the categories and their If the distribution of observation
follows a chi-square, the categories
sum will also be a chi-square. If that is not the case, the are independent.
categories are not independent!

The shape of the chi-square distribution depends on the


degrees of freedom, which in this case is the number of
categories minus one. For our enchiladas and tacos vote, we
have k = 2 − 1 = 1, in other words one degree of freedom.
The probability distribution function of the chi-squared
distribution is given by:
 k
−1 − x
 x 2k e 2 , x > 0;


The PDF of the chi-squared
f (x) = 2 2 Γ( 2k ) (5.109)
 distribution.
0,

otherwise.

where k corresponds to the degrees of freedom and Γ(·) is


the gamma function.

For each chi-squared variable of degree k, the mean is


actually k and its variance is 2k. Using the central limit
theorem, we can see that as n tends to infinity, the chi-
 q 
square distribution converges to a normal distribution of As n → ∞, χ2 (k ) → N k, 2k
n

mean k and standard deviation 2k/n. In practical terms,
for k greater than 50 the normal distribution can safely be
used. We can see the probability density function of the
chi-square distribution for different degrees of freedom in
Figure 5.14.
262 j. rogel-salazar

0.40
chi-squared dist., k=1
chi-squared dist., k=5
0.35 chi-squared dist., k=10

0.30

0.25
Frequency

0.20

0.15

0.10

0.05

0.00
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
Chi-square

Figure 5.14: Probability density


function for the chi-square
When k = 1 we expect that, if the categories are indeed distribution for different degrees
of freedom.
independent, most of the results will be close to 0 as we can
see in Figure 5.14. As we add more and more degrees of
freedom we are effectively adding together independent
chi-distributions each with k = 1. Although they
individually bunch close to 0, when added together the
centre of the distribution starts shifting. The more values we
add, the more symmetrical the distribution becomes, until
we converge to a normal distribution.

The cumulative distribution function is expressed in terms


of the gamma function, Γ(·), we encountered in the
statistics and data visualisation with python 263

previous section, and the lower incomplete gamma


function17 γ(·): 17
Abramowitz, M. and I. Stegun
(1965). Handbook of Mathematical
  Functions: With Formulas, Graphs,
k x
γ 2, 2 and Mathematical Tables. Applied
F(x) =   . (5.110) Mathematics Series. Dover
Γ k
2
Publications

Chi-squared Distribution PDF Chi-squared Distribution CDF


0.25 1.0

0.20 0.8
Cumulative Probability

0.15 0.6
Probability

0.10 0.4

0.05 0.2

0.00 0.0
0 2 4 6 8 10 0 2 4 6 8 10
Outcome Outcome
Chi-squared Distribution PPF Chi-squared Distribution Random Variates

10 17.5
15.0
8
12.5
Occurrences
Outcome

6 10.0

4 7.5
5.0
2
2.5
0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
Probability Outcome

Figure 5.15: The chi-squared


We can obtain values for the PDF with Python with the help distribution for k = 3.
of the chi2 implementation in SciPy:
264 j. rogel-salazar

from scipy.stats import chi2

x = np.arange(0, 10, 0.01) We use the chi2 module in SciPy.

k = 3

pdf = chi2.pdf(x, k)

The distribution can be seen in the top left-hand panel of


Figure 5.15. You need to provide the degrees of freedom as
a parameter. As before the distribution can be shifted with
loc and scaled with scale.

The CDF is calculated in Python as follows:

cdf = chi2.cdf(x, k) You know the drill by now.

We can see the shape of the CDF in the top right-hand panel
of Figure 5.15. The PPF can be seen in the bottom left-hand
panel of Figure 5.15 and obtained as follows: The CDF and PPF of the chi-
squared distribution is shown
in the top-right and bottom-left
probs = np.arange(0, 1, 0.01)
panels of Figure 5.15.
ppf = chi2.ppf(probs, k)

Finally, we can get random variates following the


chi-squared distribution with k = 3 with rvs:

x1 = chi2.rvs(k, size=1000)

A histogram of the values obtained is shown in the bottom


right-hand panel of Figure 5.15.

Let us go back to our initial survey of preferred Mexican


food. From our data, we have that the probability for
statistics and data visualisation with python 265

enchiladas is pe = 0.6 and the probability for tacos is


pt = 0.4. We could explore whether obtaining 72 and 28 is We can use the chi-square to
devise a suitable test.
significantly different from the values obtained. There are
therefore two χ2 distributions we need to add up:

(72 − 60)2 (28 − 40)2


χ2 = + = 6. (5.111)
60 40

Let us find the critical value for a chi-squared distribution


for 95% confidence with k = 1. We can use the PPF to find
the value:

chi2.ppf(0.95,1 )

3.841458820694124

As the value obtained is larger than the one shown by the


chi-distribution we can conclude that there is indeed a
strong preference for enchiladas! In this case we may not See Section 6.3.1 for more on the
goodness-of-fit test.
need to resort to a test, but as the number of categories
increases, the chi-square is a good friend to have.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
6
Alluring Arguments and Ugly Facts – Statistical
Modelling and Hypothesis Testing

We have come a long way in our journey; with the


knowledge we have now, we are in a position to start
interrogating our data a bit more. In this chapter we will
introduce the idea of statistical modelling as a way to Statistical modelling lets us
understand and interpret our data.
understand and interpret the observations we have gathered.
This will let us identify potential relationships between
variables and make predictions.

We will first expand on our discussion of hypothesis testing


from the previous chapter and look at the difference
between one-tailed and two-tailed tests. This will give us Check Section 5.5.

the opportunity to talk about testing for normality, i.e. to


check if a set of data follow a normal distribution. We will
apply hypothesis testing to more specific tests that will help
us tease out ugly facts in alluring arguments, and
sometimes the other way around too.
268 j. rogel-salazar

6.1 Hypothesis Testing

In the previous chapter we talked about the


characteristics of various probability distributions and we
also looked at how these distributions can help us with
determining whether we should accept or reject a
hypothesis. We now understand that for us to absolutely We cannot absolutely have
absolute knowledge about the
determine if a hypothesis is true or false, we need absolute
population.
knowledge and hence we will have to examine a full
population or populations. This is clearly not possible in
many cases. Instead, we use a random sample, and use its
properties to see if there is enough evidence to support the
hypothesis or not.

Let us recall from Section 5.5 that we have two hypotheses:

• H0 : the null hypothesis, and

• Ha : the alternative hypothesis.

We may want to test if the alternative hypothesis is likely to


be true and hence we have two possible results:

• we reject the null hypothesis and accept the alternative Not rejecting the null hypothesis
one as we have enough evidence in favour of Ha , or does not mean that H0 is
necessarily true!
• we do not reject the null hypothesis because we do not
have enough evidence in favour of the alternative.

Note that not rejecting the null hypothesis does not


necessarily mean that H0 it is true. It only means that we do
not have enough evidence to support Ha .
statistics and data visualisation with python 269

Let us consider an example. The toaster manufacturer


in Caprica City has issued a technical statement saying
that their Model-5 artificially intelligent toasters are really We know how good the A.I. in
these toasters is, particularly in the
good. However, they mentioned that their defective rate
Model-5.
is 5%. We would like to investigate if the true probability,
p, of a toaster being defective is indeed equal to the value
mentioned by the manufacturer, or actually greater than
that. Our hypothesis can be formulated as follows:

• H0 : p = 0.05,
Note that the null hypothesis test
is for equality.
• Ha : p > 0.05.

We take a sample of 100 Model-5 toasters and perform some


quality control tests to determine if the toasters are defective
or not. Let us denote the number of defective toasters X,
and it is our test statistic. We will reject the null hypothesis
if X ≥ 10, with 10 being our critical value. The region that We chose the critical value
arbitrarily, but not randomly.
provides evidence about the rejection of the null hypothesis
is called the critical region. Look at Figure 6.1 to give you
an idea of what we are dealing with. You may be thinking,
why 10? Well, as we said, the value is arbitrary, but not
random. Let us take a look.

Critical Region
Figure 6.1: A schematic way to
think about hypothesis testing.
Do not reject 𝐻! Reject 𝐻!

X: Critical Value

The result of our quality testing on each Model-5 toaster


is effectively a Bernoulli process. As discussed in Section
270 j. rogel-salazar

5.3.2, there are only two outcomes for each trial, either the Our test results in a Bernoulli
process as we have only two
Model-5 toaster is defective or it is not. If the probability of
outcomes: defective or non-
defect is p = 0.05, in a sample of 100 toasters we expect to defective toaster.
have 100 × 0.05 = 5 defective toasters. If we find 10 defective
ones, we have strong evidence to reject the null hypothesis.

Taking measurements is not an exact science, and we need


to come to terms with the idea that we may make some
errors. If we reject the null hypothesis when the alternative
one is actually correct we are good! That is also the case
when we do not reject the null hypothesis when it is in fact
true. But, what happens when we make an error. We can
use Table 6.1 to think of the possible situations.

H0 is true Ha is true Table 6.1: Types of errors when


performing hypothesis testing.
Correct decision Type II error
Do not reject H0 True Negative False Negative
1−α β
Type I error Correct decision
Reject H0 False Positive True Positive
α 1−β

As we can see, we have a Type I error when we have a Type I errors correspond to false
positives.
false positive. In other words, we reject the null hypothesis
when in fact it is true. If this was a courtroom case, this is
equivalent to convicting a defendant when in fact they are
innocent. The probability of committing a Type I error is
The lower the value of α, the
called the level of significance, and we denote it with the
less likely we are to erroneously
letter α. The lower the value of α, the less likely we are to convict an innocent person.
commit a Type I error. As discussed in the Section 5.5 we
generally choose values smaller than 0.05.
statistics and data visualisation with python 271

We know from Section 5.3.3 that repeated Bernoulli trials


follow a binomial distribution. We can therefore use the The distribution of our data helps
us determine the values we need
probability mass function (PMF) of the binomial to
to test for.
determine the level of significance for our test, i.e.
α = Pr ( X ≥ 10) when p = 0.05. Let us use Python to do
this:

from scipy.stats import binom We are calculating the probability


in a range greater than 10, hence
alpha = 0
the use of range(10, 101).

for n in range(10, 101):

alpha += binom.pmf(n, 100, 0.05)

And we can see the significance value calculated:

> print(alpha)

0.028188294163416106

Let us now turn our attention to Type II errors. These come


about when we fail to reject the null hypothesis when Type II errors correspond to false
negatives.
in fact the alternative hypothesis is true. This is a false
negative and the probability of committing a Type II error is
denoted as β. If we have a true positive when rejecting H0 ,
the probability is 1 − β and this is called the power of a test.

In order to compute β, we need to look at specific


alternative hypotheses. In our example, since we do not We calculate β for specific Ha ’s.

know the true value of p, all we can do is compute β for


specific cases, for example Ha : p = 0.08. In other words,
β = Pr ( X < 10) when p = 0.08:
272 j. rogel-salazar

beta = 0

for n in range(0, 10):

beta += binom.pmf(n, 100, 0.08)

print(beta)

0.7219779808448744

Note that there is an interplay between α and β. If we


reduce the critical region, we reduce α but this will increase We can reduce α and β
simultaneously by increasing
β. The inverse is also true, we can reduce β by increasing
the sample size.
the critical region, but then α increases. There is a way to
reduce both α and β simultaneously: We need to increase
the sample size!

Hang on a minute! Increasing the sample size implies that


we can make use of the central limit theorem as discussed Increasing the sample size gives
us the opportunity of using the
in Section 5.4.4 and we can approximate our distribution
central limit theorem.
with a Gaussian curve. For our example with a binomial
distribution, we have that the mean is np and the standard
deviation is np(1 − p). We can use this information to obtain
our z-score:

We are using the mean and


X − np standard deviation of the binomial
z= p . (6.1)
np(1 − p) distribution. See Section 5.3.3.

Let us say that we increase our sample to 550 toasters to test,


and our critical value to 45. We can use Python to calculate
α with the help of the CDF of the normal distribution for
our z-score:
statistics and data visualisation with python 273

X = 45

n = 550
We are reducing α, and thus our
p = 0.05
chances of committing a Type I
zs= (X-n*p)/np.sqrt(n*p*(1-p)) error.

alpha=1-norm.cdf(zs)

print(alpha)

0.0003087467117317555

With these values, we are very unlikely to commit a Type


I error! This is such a common thing that Python has a
shortcut for us in the form of the so-called survival function
sf=1-cdf:

> print(norm.sf(zs))
We are using the survival function
for the normal distribution.
0.0003087467117317515

6.1.1 Tales and Tails: One- and Two-Tailed Tests

So, you have a hypothesis? Now what? Well, let us recap


a few of the things we mentioned in the previous section.
After having identified your null hypothesis H0 , and your So, you have a hypothesis? Let us
test it!
alternative hypothesis Ha , you are interested in determining
if there is enough evidence to show that the null hypothesis
can be rejected.

• H0 assumes that there is no difference between a H0 assumes there is no difference.

parameter and a given value, or that there is no difference


between two parameters
274 j. rogel-salazar

• Ha assumes that there is a difference between a Ha assumes there is a difference.

parameter and a given value, or that there is a difference


between two parameters

• The critical region is the range of values that indicate that


a significant difference is found, and therefore the null The critical region helps us reject
the null hypothesis.
hypothesis should be rejected

• The non-critical region is the range of values that


indicates that the difference is probably due to chance,
and we should not reject the null hypothesis

• The critical value separates the critical and non-critical


regions

• The significance level is the maximum probability of


committing a Type I error, or in other words getting a
false positive i.e P(Type I error)| H0 is true) = α.

In the previous section, we considered an example where


the Model-5 toaster manufacturer in Caprica City claimed
that their defective rate was equal to a given value, θ0 . The We used p before, and we adopt
θ0 here to avoid confusion with
alternative hypothesis in that example was to assume that
p-values.
the probability was bigger than θ0 . In other words:

• H0 : θ = θ0 The hypotheses of a right -tailed


test.
• Ha : θ > θ0

In this case the critical region was only on the right-hand


side of the distributions. This is why we call tests like these
one-tailed tests; here we have a right-tailed test. We can see
a depiction of this situation in the top panel of Figure 6.2.
statistics and data visualisation with python 275

Right tail test

H0 : = 0 Reject H0
Ha : > 0

0
Left tail test

Reject H0 H0 : = 0
Ha : < 0

0
Two tail test

Reject H0 H0 : = 0 Reject H0
Ha : 0

Figure 6.2: One- v Two-Tail Tests.


We could have formulated a one-tailed test where the tails is
located on the left-hand side.

• H0 : θ = θ0 The hypotheses of a left-tailed test.

• Ha : θ < θ0

This left-tailed test is depicted in the middle panel of Figure


6.2. Finally, we have the case where we check for
differences:
276 j. rogel-salazar

• H0 : θ = θ0 The hypotheses of a two-tailed


test.
• Ha : θ 6 = θ0

Here, we have a situation where the critical region is on


both sides of the distribution and hence we call this a two-
tailed test. Note that the significance level is split so that the
critical region covers α/2 on each side.

Let us go back to our Model-5 toaster manufacturer in


Caprica City for an example of a two-tailed test. This time
the statement is that their toasters have a resistor of 48Ω. I assume a lecture on the futility
of resistance would result in a
We take a sample of 100 toasters and look at the resistors
wrong-SciFi pun.
obtaining a sample mean of 46.5Ω. Let us assume that
we have a known population standard deviation of 9.5,
following a normal distribution. The steps that we need
to complete in hypothesis testing can be summarised as
follows:

1. State the hypotheses and identify the claim. This helps


formulate H0 and Ha .
Hypothesis testing steps.
2. Find the critical value or values.

3. Compute the test value.

4. Decide whether to reject the null hypothesis or not.

5. Explain the result.

For the first step, we have that our hypothesis is that the
manufacturer used a 48Ω resistor. Our null hypothesis
corresponds to the claim made. If we reject the null Stating clearly our hypothesis, let
us better interpret the results.
hypothesis we are saying that there is enough evidence to
reject the claim. When the alternative hypothesis is the
statistics and data visualisation with python 277

claim, rejecting the null hypothesis means that there is


enough evidence to support the claim. In our Model-5
toaster case, our hypothesis test is given by: Note that we have a two-tailed
hypothesis.
• H0 : µ = 48

• Ha : µ 6= 48

Since we have information about the population standard


deviation, we are going to use as our test statistic a z-test: Note that we are assuming that we
know the standard deviation of
X̄ − µ0 the population.
z= . (6.2)
√σ
n

Let us select a significance level α = 0.05. We can then


calculate a critical value, cv. Since this is a two-tailed test, In a two-tailed test, we split the
significance level to have α/2 on
we need to check that the test statistic given by our z-score
each side.
value is between −cv and cv, otherwise we will reject the
null hypothesis:

alpha=0.05 Remember the magic z-scores of


print(norm.ppf(alpha/2)) ±1.96 for a 95% confidence level?
If not, take a look at page 249.

-1.9599639845400545

We can now use the information from our data to calculate


our test value:

> n = 100

> sigma = 9.5 Let us calculate our z-score.

> mu = 48

> X = 46.5

> zs = (X-mu)/(sigma/np.sqrt(n))
278 j. rogel-salazar

Let us take a look:

> print(zs)

-1.5789473684210527

Since the test value is between −1.96 and 1.96, we cannot


reject the null hypothesis and conclude that the sample data
support the claim that the resistors in the Model-5 toasters
are 48Ω at a 5% significance level. In this example we have See Section 6.5 for running tests
with the t-distribution.
made use of a standard normal distribution, however, in
many cases a Student’s t-distribution is used for this type of
test.

In the example above we selected the significance value


first. We can also employ a p-value to test our hypothesis.
Remember that a p-value is the probability of getting a See Section 5.5.

sample statistic, for example the mean, or a more extreme


sample statistic pointing toward the alternative hypothesis
when the null hypothesis is considered to be true. In other
words, the p-value is the lowest level of significance at The smaller the p-value, the
stronger the evidence against H0 .
which the observed value of a test statistic is significant.
This means that we are calculating the minimum probability
of a Type I error with which the null hypothesis can still
be rejected. Let us see what the p-value is for our Model-5
toasters:

> pval = 2*(1-norm.cdf(zs))


We can also use the survival
> print(pval) function to obtain pval.

1.8856518702577993
statistics and data visualisation with python 279

This p-value is such that we would not be able to reject the


null hypothesis, as the minimum significance is p = 1.8856.
In this case the claim from the manufacturers of the Model-5
toasters holds.

Now that we have a better understanding of how hypothesis Hypothesis testing works along
these lines for different tests. We
testing works, we will concentrate on more specific ways in
will discuss this in the rest of this
which this general methodology can be applied to make chapter.
sure that ugly facts are not hidden behind alluring
arguments, and vice versa.

6.2 Normality Testing

There is no doubt of the importance of the normal or


Gaussian distribution. It arises in many situations and
applications. Thanks to the central limit theorem many
variables will have distributions that approximate the Testing if our data follows a
normal distribution is a typical
normal. Furthermore, given its mathematical closed form, it
step in statistical analysis.
is easy to manipulate and consequently it underpins many
statistical tests. Many statistical analyses start with some
exploratory data analysis as described in Chapter 4,
followed by some normality testing. This will let us decide
if other tests described in this chapter are suitable to be
applied.

The Gaussian distribution can be fully determined by two


parameters: The mean and the standard deviation. In terms Review Sections 5.4.1 and 5.4.2
for more information about the
of shape, we know that it is a symmetrical distribution
normal distribution.
and the empirical rule indicates the percentage of data
observations within a number of standard deviations from
280 j. rogel-salazar

the mean. One way to assess the normality of our data is to


do it graphically. We shall see how to use Python to create a Histograms and box plots are
other ways to check for normality
histogram in Section 8.7 and a box plot in Section 8.8. Let us
in a graphical way.
start with a graphical test call the Q-Q plot and we will then
discuss some more formal tests.

6.2.1 Q-Q Plot

A Q-Q plot, or quantile-quantile plot, is a graphical


way to assess if a dataset resembles a given distribution Q-Q is short for quantile-
quantile. Nothing to do with
such as the normal. The idea is to look at the distribution of
the extra-dimensional being with
data in each of the quantiles. In Section 4.4.2 we discussed immeasurable power.
the grouping of observations into quantiles, and in this
graphical test we create a scatterplot that shows the
quantiles of two sets of data, the dataset in question and See Section 8.3 for more
that of a theoretical distribution such as the normal. information on scatterplots.

The idea is that if both sets of quantiles come from the same
distribution, the plot will show a straight line. The further A straight line indicates that the
data follows a normal curve.
the plot is from a straight line, the less likely it is that the
two distributions are similar. Note that this is only a visual
check and it is recommended to use a more formal test as
discussed in the rest of this section.

Let us take a look at some example data containing random


samples taken from a normal distribution, and compare
them to some that are skewed. The data can be obtained1 1
Rogel-Salazar, J. (2021b,
Dec). Normal and Skewed
from https://doi.org/10.6084/m9.figshare.17306285.v1 Example Data. https://doi.org/
10.6084/m9.figshare.17306285.v1
as a comma-separated value file with the name
“normal_skewed.csv”. Let us read the data into a pandas
dataframe:
statistics and data visualisation with python 281

import pandas as pd The data contains data for two


series, one normally distributed
df = pd.read_csv(’normal_skew.csv’)
and one skewed.

We now can call the probplot method from the stats


module in SciPy. It lets us generate a Q-Q plot of sample
data against the quantiles of a specified theoretical
distribution. The default is the normal distribution and in
the example below we explicitly say so with dist=’norm’.
Note that we request for a plot to be produced with the help
of matplotlib. Let us look at the normal example first:

from scipy import stats We create a Q-Q plot with the


probplot method.
import matplotlib.pyplot as plt

stats.probplot(df[’normal_example’], dist=’norm’,

plot=plt)

We are directly passing the pandas series to the probplot


method and indicate the reference distribution via dist. We provide the reference
distribution via dist.
The result can be seen in the left-hand side of Figure 6.3.
The plot shows a reference diagonal line. If the data come
from the same distribution, the points should align with
this reference. For this example, we can see that the data in
question is normally distributed.

The more the data points depart from the reference line
in the Q-Q plot, the greater the chance the datasets come
from different distributions. We can see the Q-Q plot for the
skewed dataset by using the corresponding pandas series:
282 j. rogel-salazar

Figure 6.3: Q-Q plots for a


normally distributed dataset, and
stats.probplot(df[’skewed_example’], dist=’norm’, a skewed dataset.
plot=plt)

The result is shown in the right-hand side of Figure 6.3.


As we can see, the points deviate from the reference line
significantly. In this case, we can visually assess that the
dataset in question is not normally distributed.

6.2.2 Shapiro-Wilk Test

Testing for normality in a visual way as done in the


previous section provides us with a quick way to assess
our data. However, in many cases this subjective approach Graphical testing is subjective, and
we need a more robust method.
requires more robust ways to test our hypothesis. With that
in mind, one of the most widely used tests for normality is
the one proposed by Samuel Sanford Shapiro and Martin
statistics and data visualisation with python 283

Wilk in 19652 . The Shapiro-Wilk test statistic is called W and 2


Shapiro, S. S. and Wilk, M. B.
(1965, 12). An analysis of variance
is given by: test for normality (complete
samples)†. Biometrika 52(3-4),
 2 591–611
∑ i a i x (i )
W= , (6.3)
∑i ( xi − x̄ )2
where x̄ is the sample mean and x(i) is the i-th order
statistic, in other words the i-th smallest number in the
sample.

In order to compare to the normal distribution, the test


takes samples from a normal distribution and calculates We calculate the coefficients ai
with the help of expected values
expected values mi of the order statistics of normally
m.
distributed samples. We denote m = (mi , . . . , mn ) T the
vector of these expected values, and V the covariance matrix
of those normal order statistics. These mathematical objects
enable us to calculate the coefficients ai in Equation (6.3):

m T V −1 See Section 6.4.1 for a formal


( ai , . . . , a n ) = . (6.4)
| V −1 m | definition of the covariance.

The test makes the null hypothesis that a random sample


is independent and identically distributed (i.i.d) and came
from a normal distribution N (µ, σ) for some unknown µ The null hypothesis is that the
data is normally distributed.
and σ > 0. If the p-value obtained from the test is less than
the chosen level of significance α, the null hypothesis is
rejected, indicating that the data is not normally distributed.
In Python we can use the implementation of the test in the
statistics module in SciPy. The function is called shapiro
and takes an array of sample data. It returns the test statistic
and a p-value. Let us import the function:
284 j. rogel-salazar

We use the shapiro function in


from scipy.stats import shapiro
SciPy.

Now that we know how the test considers the null and
alternative hypotheses, we can create a function that takes
the sample data and returns the result of the test:

def shapiro_test(data, alpha=0.05):

stat, p = shapiro(data)

print(’stat = {0:.4f}, p-value= {1:.4f}’.


A function that helps us interpret
format(stat, p)) our results for a given value of α.

if p > alpha:

print(‘‘Can’t Reject the null hypothesis. The

data seems to be normally distributed.’’)

else:

print(‘‘Reject the null hypothesis. The data

does not seem to be normally

distributed.’’)

Let us take a look at applying the Shapiro-Wilk test to the


example data used in the previous section:

> shapiro_test(df[’normal_example’])
We can test against the data we
know is normally distributed.
stat = 0.9954, p-value= 0.6394

Can’t reject the null hypothesis.

The data seems to be normally distributed.

For the skewed data we have the following:


statistics and data visualisation with python 285

> shapiro_test(df[’skewed_example’])

stat = 0.6218, p-value= 0.0000 As well as the skewed data.

Reject the null hypothesis.

The data does not seem to be normally distributed.

6.2.3 D’Agostino K-squared Test

We are all familiar with the shape of the Gaussian


distribution, and in Section 5.4.3 we discussed how the
moments of the normal distribution can help us characterise
the normality of sample data. We know that a normal
distribution has kurtosis equal to g2 = 3, and skewness Or kurtosis g2 = 0 if we use the
Fisher definition.
g1 = 0. Looking at these measures for our sample data
can help us formulate a test. This is where the D’Agostino
proposal 3 comes into play. 3
D’Agostino, R. B. (1970, 12).
Transformation to normality
of the null distribution of g1 .
Biometrika 57(3), 679–681
Looking at the skewness, g1 , and kurtosis, g2 , of the sample
data is a good start. In fact, if the samples are drawn from
a population that is normally distributed, it is possible
to write down expressions of the skewness and kurtosis See Appendix H.

in terms of their mean (µ1 ), variance (µ2 ), skewness (γ1 )


and kurtosis (γ2 ). Unfortunately, the convergence of these
expressions to the theoretical distribution is slow. This
means that we may require a large number of observations.
In order to remedy this, D’Agostino’s proposal is to use the
286 j. rogel-salazar

following transformation:
 
g1
Z1 ( g1 ) = δ asinh √ , (6.5)
α µ2

where:

W2
p
= 2γ2 + 4 − 1, See Appendix H for expressions
for µ2 and γ2 .
δ = (ln W )−1/2 ,
2
α2 = .
W2 − 1

A further proposal was explored by D’Agostino4 to use an 4


D’Agostino, R. B. (1971, 08). An
omnibus test of normality for
omnibus test that combines skewness and kurtosis. For this moderate and large size samples.
Biometrika 58(2), 341–348
we define a statistic D such that:
 
∑i i − 21 (n + 1) Xi,n
D= , (6.6)
n2 S
where Xi,n are the ordered observations derived from a
random sample X1 , . . . , Xn , and S2 = ∑( Xi − X̄ )2 /n. An An omnibus test is used to check
if the explained variance in a set of
approximate standardised variable with asymptotically
data is significantly greater than
mean zero and variance one is: the unexplained variance, overall.

  
1 24πn
Y= D− √ √ . (6.7)
2 π 12 3 − 37 + 2π

If a sample is drawn from a non-normal distribution, the


expected value of Y tends to be different from zero.

This approach and some empirical results5 to test for the 5


D’Agostino, R. and E. S. Pearson
(1973). Tests for Departure from
departure from the normal distribution have been used to Normality. Empirical Results√ for
the Distributions of b2 and b1 .
implement a statistical test that combines skewness and
Biometrika 60(3), 613–622
kurtosis in SciPy. The null hypothesis is that the sample
comes from a normal distribution and the implementation
statistics and data visualisation with python 287

in the statistics module in SciPy is called the normaltest.


The test is two-sided.

We will again create a function that makes it clearer when


we are able to reject the null hypothesis and interpret the
result:

from scipy.stats import normaltest We first load the normaltest


function.

def dagostino_test(data, alpha=0.05):

k2, p = normaltest(data)

print(’stat = {0:.4f}, p-value= {1:.4f}’.


We then create a function to help
format(k2, p)) us interpret the results of the test.

if p > alpha:

print(‘‘Can’t Reject the null hypothesis. The

data seems to be normally distributed.’’)

else:

print(‘‘Reject the null hypothesis. The data

does not seem to be normally

distributed.’’)

Let us take a look at applying the Shapiro-Wilk test to the


example data used in the previous section:

> dagostino_test(df[’normal_example’])
We can test against the data we
know is normally distributed.
stat = 1.0525, p-value= 0.5908

Can’t reject the null hypothesis.

The data seems to be normally distributed.


288 j. rogel-salazar

For the skewed data we have the following:

> dagostino_test(df[’skewed_example’])

As well as the skewed data.


stat = 253.0896, p-value= 0.00000

Reject the null hypothesis.

The data does not seem to be normally distributed.

6.2.4 Kolmogorov-Smirnov Test

Another method for testing for normality is, given a


set of random samples Xi , . . . , Xn , we need to check if an
empirical distribution, Fn ( x ), is consistent with a theoretical
known distribution F ( x ). Note that the theoretical This approach can test for other
distributions, not just Gaussian.
distribution does not need to be a Gaussian and the test can
check for other distributions, but we assume that the
population mean and standard deviation are known.

Consider the order statistics yz < y2 < · · · < yn of the


observed random samples, with no tied observations, then Tied observations are data points
with the same value.
the empirical distribution is defined as:




 0, for x ≤ y1 ,

Empirical distribution given a set
Fn ( x ) = k , for yk ≤ x < yk+1 , k = 1, 2, . . . , n − 1,
 n of order statistics.


for x ≥ yn .

1,

We describe our empirical distribution as the fraction of


sample values that are equal to, or less than x. We are
using a step function that jumps by 1/n in height at each
statistics and data visualisation with python 289

observation. When we have nk tied observations the jump is


nk /n.

With this information we can justify creating a test statistic


that helps us check whether a cumulative distribution
function Fn ( x ) equals a theoretical distribution function F ( x )
such that:
Dn = max [| Fn ( x ) − F ( x )|] . (6.8) The Kolmogorov-Smirnov test.
x

This is the Kolmogorov-Smirnov test, and the hypothesis


testing can be summarised as:

• H0 : Fn ( x ) = F ( x ), i.e. the distributions are the same, The Kolmogorov-Smirnov


hypotheses.
• Ha : Fn ( x ) 6= F ( x ), i.e. the distributions are different.

If the p-value obtained is lower than the significance level α,


we reject the null hypothesis and assume that the
distribution of our data is not normal. If, on the other hand,
the p-value is larger than the significance level, we fail to
reject the null hypothesis and assume that our data is
normally distributed.

The Kolmogorov-Smirnov test6 is a non-parametric test. 6


Kolmogorov, A. (1933). Sulla
determinazione empirica di una
This means that we are not making any assumption about legge di distribuzione. Inst.
Ital. Attuari, Giorn. 4, 83–91; and
the frequency distribution of the variables. Instead, rely on Smirnov, N. V. (1939). Estimate
order statistics. The test relies on the fact that as the sample of deviation between empirical
distribution functions in two
size n tends to infinity, the empirical distribution function, independent samples. Bulletin
Moscow University 2(2), 3–16
Fn ( x ), converges with probability 1 and uniformly in x, to
the theoretical distribution function F ( x ). If at any point
there is a large difference between the two, we take this as
evidence that the distributions are different.
290 j. rogel-salazar

In Python we are able to use the kstest implementation of


the Kolmogorov-Smirnov test in the stats module of SciPy.
It lets us perform both one- and two-sided tests with the kstest implements the
Kolmogorov-Smirnov test in
help of the alternative parameter. Notice that the default
Python.
is the two-sided test, and that is what we will be using here.
Let us write a function to test our hypothesis. Remember
that we are using the two-sided test:

from scipy.stats import kstest We first load the kstest function.

def ks_test(data, alpha=0.05):

d, p = kstest(data, ’norm’)

print(’stat = {0:.4f}, p-value= {1:.4f}’. We create a function to help us


interpret the results of the test.
format(d, p))

if p > alpha:

print(‘‘Can’t Reject the null hypothesis. The

data seems to be normally distributed.’’)

else:

print(‘‘Reject the null hypothesis. The data

does not seem to be normally

distributed.’’)

We can apply this function to check the example normal


data we have used in the rest of this section:

> ks_test(df[’normal_example’])
We can test against the data we
know is normally distributed.
stat = 0.0354, p-value= 0.8942

Can’t reject the null hypothesis.

The data seems to be normally distributed.


statistics and data visualisation with python 291

For the skewed example data we have that:

> ks_test(df[’skewed_example’])

As well as the skewed data.


stat = 0.5420, p-value= 0.0000

Reject the null hypothesis.

The data does not seem to be normally distributed.

6.3 Chi-square Test

In Section 5.5.2 we introduced the chi-squared


distribution as a way to analyse frequency data. Now that
we have a better understanding of hypothesis testing, we We were trying to decide whether
enchiladas or tacos were the
can see how to use this distribution to look at the
favourite Mexican food in our
comparison of observed frequencies, O, with those of an spaceship.
expected pattern, E.

6.3.1 Goodness of Fit

We can obtain observed frequencies from collected


data, and we may have an expectation of what these
frequencies should be. As you may imagine, the expected
frequencies are those corresponding to situations where the The expected frequencies form our
null hypothesis.
null hypothesis is true. In general, however, we can check if
the observed frequencies match any pattern of expected
frequencies we are interested in, and hence we call this a
“goodness of fit”. In this way, we can decide if the set of
observations is a good fit for our expected frequencies.
292 j. rogel-salazar

Let us then consider k different categories in our data and


our hypotheses are as follows:

• H0 :

– p1 - hypothesised proportion for category 1

– p2 - hypothesised proportion for category 2


The hypotheses of the goodness-of-
– ... fit test.

– pk - hypothesised proportion for category k

• Ha : the null hypothesis is not true

Our test statistics is:

(O − E)2
X2 = ∑ E
. (6.9)

with degrees of freedom ν = k − 1. In the example used in


Section 5.5.2 we have that our colleagues in USS Cerritos are Check Section 5.5.2.

testing the new replicator and there is a vote for the chosen
Mexican food for the evening meal, with 60 colleagues
going for enchiladas, and 40 voting for tacos.

If we expected the proportions to be equally distributed,


are the observed frequencies a good fit for our expected
pattern? Well, let us take a look with Python’s help. We can You can see why this test is called
that, right?
use the chisquare implementation in SciPy providing an
array for the observed frequencies, and one for the expected
ones:

> expected = [50, 50]

> observed = [60, 40]


statistics and data visualisation with python 293

We can now pass these arrays to our chisquare function:

> from scipy.stats import chisquare

> chisq, p = chisquare(observed, expected)


We use the chisquare function in
> print(’stat = {0:.4f}, p-value= {1:.4f}’.
SciPy.
format(chisq, p))

stat = 4.0000, p-value= 0.0455

With p < 0.05 there is a significant difference between


the observed and expected frequencies and therefore the
enchiladas and tacos are not equally preferred. Enchiladas
for dinner it is!

6.3.2 Independence

We mentioned in Section 5.5.2 that independence


can be inferred from the chi-squared distribution and our
hypotheses are:

• H0 : Two variables are independent,


Hypotheses to test for
independence.
• Ha : Two variables are not independent.

Our test statistic is the same as before, given by Equation


(6.9).

In this case, it is convenient to organise the frequency table


in a contingency table with r rows and c columns with the A contingency table, or crosstab
displays the frequency distribution
intersection of a row and a column called a cell. We can
of the variables facilitating the
calculate the expected values for our table as: calculation of probabilities.

row total × column total


Expected value of a cell = (6.10)
overall total
294 j. rogel-salazar

Suppose that we want to know whether or not the rank


of a Starfleet officer is associated with their preference for
a particular Mexican dish created by the new replicator Is the rank of a Starfleet officer
associated with their choice of
delivered to the USS Cerritos. Table 6.2 has the result of a
Mexican food?
simple random sample of 512 officers and their choice for
Mexican food for dinner.

Tacos Enchiladas Chilaquiles Total Table 6.2: Results from a random


sample of 512 Starfleet officers
Senior Officers 125 93 38 256 about their preference for Mexican
Junior Officers 112 98 46 256 food and their rank.

Total 237 191 84 512

In Python we can use the chi2_contingency function in We will use the chi2_contingency
function in SciPy.
SciPy stats where we provide as an input an array in the
form of a contingency table with the frequencies in each
category.

mexican = np.array(

[[125, 93, 38],

[112, 98, 46]])

The result of the function comprises 4 objects:

• the test statistic,

• the p-value, The result comprises these 4


objects.
• the degrees of freedom, and

• an array with the expected frequencies.


statistics and data visualisation with python 295

Let us create a function that helps us interpret the results:

def chi_indep(data, alpha=0.05):

chisq, p, dof, exp_vals = chi2_contingency(data)

print(’stat = {0:.4f}, p-value= {1:.4f}’.


We create a function to help
format(chisq, p)) us interpret the result of our
print(’Expected values are:’) independence test.

print(exp_vals)

if p > alpha:

print(‘‘Can’t reject the null hypothesis.

The samples seem to be independent’’)

else:

print(‘‘Reject the null hypothesis.

The samples are not independent.’’)

We can now apply our function to the contingency table


above:

> chi_indep(mexican)
We can use our function on our
contingency table.
stat = 1.6059, p-value= 0.4480

Expected values are:

[[118.5 95.5 42. ]

[118.5 95.5 42. ]]

Can’t reject the null hypothesis.

The samples seem to be independent

As we can see, with a p > 0.05 we fail to reject the null


hypothesis and we conclude that we do not have sufficient
296 j. rogel-salazar

evidence to say that there is an association between the rank


of the officers and their preference for Mexican food. After
all, who can resist the flavours of such fascinating cuisine.

6.4 Linear Correlation and Regression

Tall Vulkan parents seem to have tall Vulkan children,


and shorter Vulkan parents, have shorter Vulkan children. The same may apply to other
humanoid species, not just
There seems to be a relationship there, but every so often
Vulkans.
we find that some tall Vulkan parents have shorter kids and
vice versa. We are interested to know if, despite the counter
examples, there is a relationship. In other words, we would
like to see whether the height of a Vulkan parent correlates
to the height of their children.

We do not need to travel the galaxy to encounter a question


like this; as a matter of fact Francis Galton asked that
question about human parents and children and in the
process he gave us the term regression7 . If we are able to 7
Rogel-Salazar, J. (2017). Data
Science and Analytics with Python.
determine that there is a correlation between two variables, Chapman & Hall/CRC Data
Mining and Knowledge Discovery
then we shall be able to create a regression model. In Series. CRC Press
particular, if the relationship is linear we have a linear
regression model. Our first task is therefore the
quantification of the correlation. We can then take a look at
performing a regression analysis.

6.4.1 Pearson Correlation

One way to start looking at the relationship between


two random variables is to calculate the covariance. This
statistics and data visualisation with python 297

will let us see if the change in one of the variables is


proportional to the change in the other one. For
observations x1 , x2 , . . . , xn and y1 , y2 , . . . , yn their covariance
is given by:

n
1
Cov( x, y) =
n ∑ (xi − x̄)(yi − ȳ)
i =1
!
n
1
=
n ∑ xi yi − nx̄ȳ . (6.11) Covariance between variables x
i =1 and y.

If the covariance is positive the two variables move in the


same direction, if it is negative they move in opposite
direction, and if it is 0 there is no relation between them.

Let us assume further that the data that we need to analyse


follows a normal distribution. In that case, we know that
we can standardise our values by subtracting the mean, and
dividing by the standard deviation. We have therefore the
following expression:

xi − x̄ yi − ȳ
  
1
ρ =
n ∑ σx σy
i

Cov( x, y)
= (6.12) The Pearson correlation.
σx σy

This is known as the Pearson correlation coefficient. If


we express the Pearson correlation coefficient in terms of
vectors a = ( x1 − x̄, x2 − x̄, . . . , xn − x̄ ) and b = (y1 − ȳ, y2 −
ȳ, . . . , yn − ȳ) we can write it as:

a·b
ρ= , (6.13) The Pearson correlation in terms
|a||b| of vectors.

and since a · b = |a||b| cos θ we can see that the Pearson


298 j. rogel-salazar

coefficient is bounded between −1 and 1, the former value


indicating perfect negative correlation and the latter perfect
positive correlation.

We may want to look at testing the significance of the


coefficient calculated and make the following hypotheses: The hypotheses for correlation
between two variables.
• H0 : The correlation between the variables is zero, i.e.
ρ=0

• Ha : The correlation between the variables is not zero, i.e.


ρ 6= 1

Under the null hypothesis, the sample correlation is


approximately normally distributed with standard error8 8
Bowley, A. L. (1928). The
p Standard Deviation of the
(1 − ρ2 )/(n − 2) and the t-statistic is therefore: Correlation Coefficient. Journal
of the American Statistical

n−2 Association 23(161), 31–34
t = ρp . (6.14)
1 − ρ2

The p-value is calculated as the corresponding two-tailed


p-value for the t-distribution with n − 2 degrees of freedom.

It is time to look at some data and bring the ideas above


to life. Let us use the cities data we looked at in Chapter 3.
Remember that dataset is available at9 https://doi.org/ 9
Rogel-Salazar, J. (2021a,
May). GLA World Cities
10.6084/m9.figshare.14657391.v1 as a comma-separated 2016. https://doi.org/
10.6084/m9.figshare.14657391.v1
value file with the name “GLA_World_Cities_2016.csv”. As
before we read it into pandas for convenience:

import numpy as np

import pandas as pd

gla_cities=pd.read_csv(’GLA_World_Cities_2016.csv’)
statistics and data visualisation with python 299

We will assume here that the data is cleaned up as before.


We will create a copy of the dataframe to ensure we do Refer to Section 3.3.5 for some
required transformations on this
not destroy the dataset. We drop the missing values as the
dataset.
Pearson coefficient implementation in SciPy does not work
with them:

cities = gla_cities.copy()

cities.dropna(inplace=True)

Let us now calculate the Pearson correlation coefficient with


pearsonr from scipy.stats for the population of the cities

and their radius:

stats.pearsonr(cities[’Population’],
We use the pearsonr module in
cities[’Approx city radius km’]) SciPy.

(0.7913877430390618, 0.0002602542910062228)

The first number is the correlation coefficient and therefore


we have that ρ = 0.7913. The second number given by
Python is the p-value. For a 95% confidence, in the example The first number is the correlation
coefficient, the second one is the
above we can reject the null hypothesis as the value
p-value.
obtained is lower than 0.05 and say that the correlation
coefficient we have calculated is statistically significant.
Let us check that we can obtain the p-value above. We have
that n = 16 and ρ = 0.7913. We can therefore calculate our
test statistic as:

n = cities.shape[0]

rho = 0.7913877430390618 We can obtain the p-value with the


help of Equation 6.14.
tscore = rho*np.sqrt(n-2) / np.sqrt(1-rho**2)
300 j. rogel-salazar

Let us see the result:

> print(tscore) We are not done yet.

4.843827042161198

The one-tailed p-value can be obtained from the cumulative


distribution function (CDF) of the t-distribution with n − 2
degrees of freedom. Since, we require the two-tailed value,
we simply multiply by 2:

from scipy.stats import t


We use our t-score to obtain our
p1t = 1- t.cdf(tscore, n-2)
p-value.
p2t = 2*p1t

> print(p2t)

0.0002602542910061789

This is effectively the same value reported by the pearsonr


implementation. An alternative method to obtain the one-
tailed p-value is to use the survival function, defined as
1-cdf:

> 2*t.sf(tscore,n-2)
We can also use the survival
function.
0.00026025429100622294

Pandas is able to calculate the correlation coefficients too.


Furthermore, we are able to obtain a correlation matrix
statistics and data visualisation with python 301

between the columns provided, and we do not need to


worry about missing values. Let us calculate the correlation
coefficient of the population, radius of the city and the
people per dwelling in the original gla_cities dataframe.
All we need to do is use the corr method:

> gla_cities[[’Population’,’Approx city radius km’,

’People per dwelling’]].corr()

We can use the corr method in a


Population Approx city People per pandas dataframe to calculate a
correlation matrix.
radius km dwelling

Population 1.000000 0.791388 -0.029967

Approx city 0.791388 1.000000 0.174474

radius km

People per -0.029967 0.174474 1.000000

dwelling

Remember that the correlation coefficient is a measure of the


existence of a linear relationship. It is possible that there is
a non-linear relationship between variables, in which case
the correlation coefficient is not a good measure. Another Remember that correlation does
not imply causation.
important thing to mention is that although correlation
may be present, it does not necessarily imply that there is a
causal relationship between the variables.

6.4.2 Linear Regression

Now that we have a way to quantify the correlation we


may want to take a look at building a model to describe the
relationship. We mentioned that if the relationship is linear
302 j. rogel-salazar

we can build a linear regression model. Let us take a look at


doing exactly that.

Our model needs to capture the relationship between the


dependent variable y and the independent variables xi .
Since the relationship is assumed to be linear our model can
be written as:
y = βX + ε, (6.15) A linear regression model.

where β = ( β 0 , . . . , β i ) T are the regression coefficients and


we aim to find their value; ε are the residuals which are
random deviations and are assumed to be independent and
identically normally distributed.

One way to find the regression coefficients is to minimise


the sum of squared residuals10 : 10
Rogel-Salazar, J. (2017). Data
Science and Analytics with Python.
Chapman & Hall/CRC Data
SSR = ε2 , (6.16) Mining and Knowledge Discovery
Series. CRC Press
= |Y − Xβ|2 ,
= (Y − Xβ)T (Y − Xβ) ,
= Y T Y − β T X T Y − Y T Xβ + β T X T Xβ. (6.17)

Since we require the minimum of the SSR quantity above,


we take its derivative with respect to each of the β i
parameters and equal the result to zero. This leads us to the
following expression:

∂(SSR)   We take the derivative to obtain


= −X T Y + X T X β = 0. (6.18) the minimum.
∂β i

The regression coefficients are therefore given by:


  −1 This is the solution to the linear
β = XT X X T Y. (6.19)
model given in Equation (6.15).
statistics and data visualisation with python 303

We refer to Equation (6.19) as the normal equation


associated with the regression model, and this method is OLS = Ordinary least squares.

known as Ordinary Least Squares or OLS for short.

To calculate the coefficients in Python we are able to use the


Statsmodels package that has an implementation for OLS.
Let us start by importing the module:
We first import the Statsmodels
package. We will use the
import statsmodels.formula.api as smf
formula.api method to simplify
our code.

We will be using the formula notation that uses a tilde (~) to


show the dependency among variables. In this case we want
to show the relationship between the city population and
the radius of the city. Before we do that let us rename the
city radius variable so that we can use the formula:

> gla_cities.columns = gla_cities.columns.str.\

replace(’Approx city radius km’,

’City_Radius’)

We are now able to fit our regression model:

> results = smf.ols(’Population ~ City_Radius’, This is the OLS model for the
population as a function of the city
data=gla_cities).fit()
radius.

Let us look at the parameters we have obtained:

> results.params

These are the coefficients of our


Intercept -1.797364e+06 regression model.

City_Radius 4.762102e+05
304 j. rogel-salazar

This indicates that the intercept of the model is


β 0 = −1.797364 × 106 and the slope of the line is Population = β0 + β1 ×
City_Radius.
β 1 = 4.762102 × 105. The results object has further
information for us and you can look at a summary with the
following command:

results.summary()

We will select some of the entries in the summary report


and explain what they mean:

OLS Regression Results

=====================================

Dep. Variable: Population


A regression model summary.
Model: OLS

Method: Least Squares

No. Observations: 16

Df Residuals: 14

Df Model: 1

R-squared: 0.626

Adj. R-squared: 0.600

First we can see that the model is using the Population data
as the dependent variable and that the model is the ordinary
least squares we have discussed above. We have a total of
n = 16 observations and we are using k = 1 predictive Df residuals corresponds to the
degrees of freedom.
variable, denoted as Df Model. The Df residuals is the
number of degrees of freedom in our model, calculated as
n − k − 1.

R-squared is known as the coefficient of determination See Section 6.4.1 for information
on the Pearson correlation.
and is related to the Pearson correlation coefficient so that
statistics and data visualisation with python 305

ρ2 = R2 . This is perhaps the most important measure in


the summary. In our case we have an R2 = 0.626 indicating
that around 62% of the total variance is explained with
our simple model. However, we need to be careful when Remember that the relationship
does not mean causation!
only looking at R2 . This is because it is known that when
adding more explanatory variables to our regression model,
this measure will be equal or higher. This may give the
impression that the model is getting more accurate with
more variables, but that may not be true.

In order to mitigate this risk, the adjusted R2 introduces a


penalty as extra variables are included in the model:

(1 − R2 )(n − 1)
R2Adj = 1 − . (6.20)
n−k−1

We are also interested in knowing if the global result of the


regression is significant or not. One way to do that is to test
the null hypothesis that all of the regression coefficients are
See Section 6.7.1 for more
equal to zero. If so, this would mean that the model has
information on the F distribution
no predictive capabilities. This can be done with the help and analysis of variance.
of an F test which compares our linear model to a model
where only the intercept is present. This lets us decide if our
coefficients are such that they make our model useful. The
OLS results provide some information about the F test:

F-statistic: 23.46

Prob (F-statistic): 0.000260


The F test summary for a
Log-Likelihood: -252.37 regression model.
AIC: 508.7

BIC: 510.3
306 j. rogel-salazar

In this case the p-value is 0.000260. For a value of α = 0.05


we can reject the null hypothesis. We also get a measure
of the log-likelihood, ln( L), which is a way to measure The log-likelihood gives us a
goodness-of-fit for our model.
the goodness-of-fit for a model. In general, the higher this
value the better the model fits the data provided. It ranges
from minus infinity to infinity though, and on its own
the measure is not very useful. Instead you can use the
value of the log-likelihood for a model and compare it to
others. Other ways to compare and select models is via the
comparison of the Akaike and Bayesian Information Criteria,
AIC and BIC:

AIC = 2k p − 2 ln( L), (6.21) These information criteria can help


us select models.

BIC = k p ln(n) − 2ln( L). (6.22)

where k p is the number of estimated parameters.

coef std err t P>|t| [0.025 0.975]

Intercept −1.797 × 106 1.24 × 106 −1.453 0.168 4.45 × 106 8.56 × 105

City_Radius 4.762 × 105 9.83 × 104 4.844 0.000 2.65 × 105 6.87 × 105

Table 6.3: Results from the


The summary of model also has a breakdown of the regression analysis performed on
the brain and body dataset.
variables that have been included, and it helps us determine
whether the components we have chosen have an impact in
the predictability of the model. In Table 6.3 we present the
results of the individual components in our model. We can
see the coefficients for the Intercept and City_Radius in
the column called “coef”.
statistics and data visualisation with python 307

The standard deviation of the coefficients is shown in the


“std err” column. The column “t” has the t-statistic helping See Sections 5.5.1 and 6.5.1 for
information on t tests.
us determine if our estimate is statistically significant. This
is where the column “ P>|t|” comes in: It shows the p-
value for the estimate, and for a given significance value we
can use this to reject (or not) the null hypothesis that the
coefficient is equal to zero. Finally, the “[0.025” and “0.975]”
columns give us the upper and lower values for our 95%
confidence interval.

There are a few other measures that are presented at the end
of the summary:

================================================== Other measures included in the


Omnibus: 9.560 Durbin-Watson: 2.192 linear regression model summary.

Prob(Omnibus): 0.008 Jarque-Bera (JB): 6.136

Skew: 1.374 Prob(JB): 0.0465

Kurtosis: 4.284 Cond. No. 34.2

The omnibus measure describes a test of normality for the


distribution of the residuals based on a chi-squared test
using skewness and kurtosis, which are also shown in the Please look at Section 6.3 for
information on chi-squared
summary. The p-value for the omnibus test is shown as
testing.
Prob(omnibus). Another measure of normality is given by

the Jarque-Bera measurement and its p-value is given by


Prob(JB).

One of the main assumptions in regression modelling is that


there is no correlation between consecutive residuals, this The Durbin-Watson test detects
autocorrelation in the residuals of
is called autocorrelation of the residuals and the Durbin-
our linear model.
Watson measurement is used to detect autocorrelation in the
residuals of our model. The null hypothesis is that there is
308 j. rogel-salazar

no correlation among residuals, and the test is calculated as


follows:
2
∑tT=2 (ε t − ε t−1 ) The Durbin-Watson test statistic.
d= , (6.23)
∑tT=1 ε2t

where T is the total number of observations and ε t is the


t-th residual from the regression model. The statistic above
ranges between 0 and 4, and we can interpret the results as
follows:

• d = 2 indicates no autocorrelation
Values for d such that 1.5 ≤ d ≤
• d < 2 indicates positive autocorrelation 2.5 are considered to be no cause
for concern.
• d > 2 indicates negative autocorrelation

Finally, another important assumption in linear regression


modelling is the fact that no multicollinearity should be
present. This refers to a situation where there is perfect
or exact relationship between the regression exploratory
variables. One way to measure for this is the use of the
condition number. For an object x, the condition number
is defined11 as the norm of x times the norm of the inverse 11
Strang, G. (2006). Linear Algebra
and Its Applications. Thomson,
of x. For a matrix, the condition number tells us whether Brooks/Cole
inverting the matrix is numerically unstable with finite-
precision numbers. If the condition number is above 30, the
regression may have severe multicollinearity.

6.4.3 Spearman Correlation

We have decided to get schwifty and join the famous Bear with me!

planetary Cromulon musical competition. Our song is a


statistics and data visualisation with python 309

blast, and we think that we will win. However, there is


a question hanging in the air: Are the scores of two of
the head judges related? Since the scores provided are Pun definitely intended for
Cromulon judges.
ordinal values between 0 and 10, it is not possible to use
the Pearson correlation we described in the Section 6.4.1.
Instead, we need to use the ranks that the ordinal values
provide.

It is possible to find a correlation coefficient using the ranks


of the data, and it is commonly known as the Spearman
correlation coefficient, rs . To calculate it, we need to rank the The Spearman correlation is a
non-parametric test.
scores from lowest to highest and we simply calculate a
Pearson correlation on the ranks rather than on the values.
Note that since the ranks do not come from a normal
distribution, we cannot use our standard normal, or for that
matter, any other parametric distributions. Instead, we call
this type of test non-parametric.

Similar to the Pearson correlation, the Spearman one varies


between −1 and 1, with 0 indicating no correlation. In this
way the hypothesis testing can be expressed as follows:

• H0 : The correlation between the variables is zero, i.e. We have a two-tailed test!
r=0

• Ha : The correlation between the variables is not zero i.e.


r 6= 0

For two variables that need to be checked for correlation,


when using ranks they will have the same standard
deviation σx = σy = σ and so we can write the Spearman
correlation as:
Cov( R( x ), R(y))
r= , (6.24)
σ2
310 j. rogel-salazar

where R(·) denotes the rank. If there are no values that can Remember that tied observations
have the same value and thus it
be assigned the same rank, we say that there are no ties in
is not possible to assign them a
our data. We can simplify our expression by noting that unique rank.

1
2∑
Cov( R( x ), R(y)) = σ2 − d2 , (6.25)

where d is the difference between a subject’s ranks on the


two variables in question. This may be easier to see with an
example. Let us go back to our planetary musical
competition. For Singer X, Judge 1 has given a mark that I wonder if this is why they call it
the X factor.
corresponds to rank 7, whereas the mark from Judge 2
corresponds to rank 4. The difference d for Singer X is
d = 7 − 2 = 5.

When using ranks, we have that the standard deviation is


related to the number of observations n as σ2 = (n3 − n)/12.
Substituting this expression in Equation (6.25), for the case
of no ties, the Spearman correlation can be written as:

6
r = 1−
n3 −n ∑ d2 . (6.26) The Spearman correlation for no
ties.

Let us consider the results of the Cromulon competition


considering two judges as shown in Table 6.4. If we were to
calculate the Spearman correlation manually, we would have
to order the scores for each judge separately and assign
them a rank. This will let us calculate d. If there are no
ties we can use the Equation (6.26), otherwise we need to
calculate the covariances.

Instead, we are going to rely on Python and use the


implementation of the Spearman correlation coefficient in
statistics and data visualisation with python 311

Singer A B C D E F G H I J Table 6.4: Results of the planetary


Cromulon musical competition.
Judge 1 2 10 7 8 5 7 5 8 6 4
Judge 2 4 9 8 8 2 9 4 7 6 3

SciPy called spearmanr. As with the Pearson correlation, the The Spearman correlation can be
calculated with spearmanr.
implementation will return the value of the coefficient and
the p-value to help us test our hypothesis.

Let us first create a pandas dataframe with the date from


Table 6.4:

schwifty = pd.DataFrame({

’singer’: [’A’, ’B’, ’C’, ’D’, ’E’, ’F’, ’G’,

’H’, ’I’, ’J’],

’judge1’: [2, 10, 7, 8, 5, 7, 5, 8, 6, 4],

’judge2’: [4, 9, 8, 8, 2, 9, 4, 7, 6, 3]})

We can now simply pass the values to the spearmanr


function:

> r, p = stats.spearmanr(schwifty[’judge1’],

schwifty[’judge2’]) We simply pass the columns of


interest to spearmanr to obtain the
> print(’r={0}, p={1}’.format(r, p)) Spearman correlation.

r=0.8148148148148148, p=0.004087718984741058

We can see that the Spearman rank correlation is r (n −


2 = 8) = 0.8148 with a p-value smaller than 0.05. Given
this information, for a 95% confidence level, we reject the
null hypothesis and say that the values in the data are
monotonically correlated.
312 j. rogel-salazar

6.5 Hypothesis Testing with One Sample

We have a pretty good idea about how hypothesis


testing works, and in Section 6.1 we have covered a lot
of the things we need to consider when formulating a A one-sample test checks if
an estimated parameter is
hypothesis and testing it statistically. A good example for
significantly different from a
using hypothesis testing is when we want to check if the hypothesised one.
estimated parameter from a population is significantly
different from a hypothesised population value. We call this
type of test a one-sample test.

This is similar to the treatment we addressed in Section 6.1


where we were using the standard normal z-distribution.
For datasets where we know the population variance, or For n > 30, we can use
the sample size, n, is big we can use the z-distribution, but the z-distribution, although
many computational packages
in cases where the sample size is smaller, we are better off
implement a t-test.
using the t-distribution for our test.

6.5.1 One-Sample t-test for the Population Mean

A typical case for the use of sample testing is to check We covered the use of a z-test in
Section 6.1.1.
the mean value of a normally distributed dataset against a
reference value. Our test statistic is the Student
t-distribution we encountered in Section 5.5.1 given by the
following expression:

X̄ − µ
t= , (6.27)
√s
n

with ν = n − 1 degrees of freedom.


statistics and data visualisation with python 313

Our null hypothesis is that the mean is equal to a


hypothesised value µ0 , i.e. H0 : µ = µ0 . Our alternative
hypothesis may be any of the following: The hypotheses for the one-sample
t-test.

• Ha : µ > µ0 ,

• Ha : µ < µ0 ,

• Ha : µ 6 = µ0 .

In the first two cases we have a one-tailed test, and in the


last one, a two-tailed test.

Let us revisit our test for the Model-5 toaster: The Caprica
City manufacturer is still claiming that their toasters have a See Section 6.1.

resistor of 48Ω. This time we have a set of 12 measurements


for the resistor:

[ 54.79, 59.62, 57.33,


48.57, 51.14, 25.78,
53.50, 50.75, 43.34,
45.81, 43.88, 60.47 ]

Our null hypothesis is H0 : µ = 48Ω, the alternative


hypothesis is Ha : µ 6= 48. We can calculate the mean and
standard deviation of the sample with Python as follows:

data = np.array([54.79, 59.62, 57.33, 48.57,


We need the sample mean and
51.14, 25.78, 53.50,50.75, 43.34, 45.81,
standard deviation.
43.88, 60.47])

xbar = np.mean(data)

s= np.std(data, ddof=1)
314 j. rogel-salazar

Let us take a look at the values:

> print(xbar)

49.58166666666668

> print(s)

9.433353624563248

Our test statistic is therefore given by:

49.58 − 48 We use the values obtained to


t= 9.43
= 0.58081. (6.28)
√ calculate our test statistic.
12

In Python we can calculate the statistic in one go, together


with the corresponding p-value with the help of the
test_1samp function is the stats module of SciPy:

from stats import t

mu = 48

n = len(data)

ts = (xbar-mu)/(s/np.sqrt(n)) We can obtain the p-value with the


pval = 2*t.sf(ts, n-1) help of the survival function.

print(pval)

0.5730709487398256

We have used the survival function for the t-distribution in


the calculation above. We are multiplying the result by 2 as
this is a two-sided test.
statistics and data visualisation with python 315

This is such a common process that there is a function in the


stats module of SciPy to calculate the test statistic and its p-
value. All we need to do is to call the test_1samp function,
which takes an array of data and a popmean parameter for
the expected value in the null hypothesis.

> from scipy.stats import ttest_1samp We can calculate the one-sample t


test with test_1samp.

We can also ask if we are doing a one- or two-sided test


with the help of the alternative parameter which takes
values ’two-sided’, ’less’ or ’greater’. In our case we
have that:

> ttest_1samp(data, popmean=mu,


The values are the same that we
alternative=’two-sided’)
obtained before.

Ttest_1sampResult(

statistic=0.5808172016728843,

pvalue=0.5730709487398256)

The values obtained are the same as the ones we calculated


above using the Student’s t-distribution directly. Let us take
a look at the results we have obtained: For a significance
level of α = 0.05, given the p-value obtained, we are not We can use the p-value to interpret
our results.
able to reject the null hypothesis. We can therefore conclude
that there is no evidence in the data we have that the mean
resistance for the resistor in the Model-5 toaster is different
from 48Ω.

We can build the following function to help us with the


interpretation of the results for other t-tests:
316 j. rogel-salazar

def t_test(data, mu, alt=’two-sided’, alpha=0.05):

stat, p =ttest_1samp(data, mu, alternative=alt)

print(’stat = {0:.4f}, p-value= {1:.4f}’.

format(stat, p))

if p > alpha:
A helpful function to interpret a
print(‘‘Can’t Reject the null hypothesis. There one-sample t test.
is no evidence to suggest that the mean is

different from {0}.’’.format(mu))

else:

print(‘‘Reject the null hypothesis. The mean

is different from {0}.’’.format(mu))

We can use this function to solve our problem:

> t_test(data, mu)

We see the results and


stat = 0.5808, p-value= 0.5731
interpretation in one go.

Can’t reject the null hypothesis. There

is no evidence to suggest that the mean is

different from 48.

6.5.2 One-Sample z-test for Proportions

We have seen how we can use a one-sample test for the


mean of a population. Another useful thing for which we
can use the test is the proportion of successes in a single
population to a known or hypothesised proportion p0 . Let We can use a one-sample test to
us motivate the discussion with an example: We have been check proportions.

assessing Dr Gaius Baltar’s Cylon detector techniques and


statistics and data visualisation with python 317

he claims that the proportion of Cylons in any one ship of


the convoy is 40%. We select a simple random sample of We are not at liberty to disclose
the details of the Cylon
the crew and people aboard the Battlestar Galactica and
detector technique. Surely you
test whether they are Cylon or not. The proportion in the understand...
sample population surely will be different from Dr Baltar’s
claim. The question is whether the difference is statistically
significant or not.

Our null hypothesis is that the population proportion is the


same as the hypothesised value p0 , i.e. H0 : p = p0 . The
alternative hypothesis can be any of the following:

• Ha : p > p0 , The hypotheses for a one-sample


proportion test.
• Ha : p < p0 ,

• Ha : p 6 = p0 .

As before, the first two cases refer to one-tailed tests, and


the third one is a two-tailed one.

Our test statistic is given by:

p − p0 Our one-sample proportion test


z= q , (6.29)
p0 (1− p0 ) statistic.
n

where p is the estimated proportion and n is the sample


size. Going back to our Cylon detection example, after
taking a look at our data, we find that the observed sample
proportion is 36 in a sample of size n = 100. We can use We are going to be using the
proportions_ztest function in the
the same logic for the testing done in Section 6.1.1. Instead
Statsmodels module.
we are going to look at the proportions_ztest function in
the Statsmodels module for Python. The function takes the
following parameters:
318 j. rogel-salazar

• count – The number of successes in the number of trials


or observations

• nobs – The number of trials or observations, with the The parameters used in
proportions_ztest.
same length as count

• value – The value used for the null hypothesis

• alternative – Whether we require a two-sided or


one-sided test. Options are ’two-sided’, ’smaller’ or
’larger’

Let us take a look at our test:

> stat, pval = proportions_ztest(count=40,

nobs=100, value=0.36)
Let us look at the Cylon detector
> print(’stat = {0:.4f}, p-value= {1:.4f}’.\
values.
format(stat, pval))

stat = 0.8165, p-value= 0.4142

We can see that our test statistic is 0.816 and the


corresponding p-value is 0.4142. Since this value is not
lower than α = 0.05, we are not able to reject the null For a 95% confidence level, we
cannot reject the null hypothesis.
hypothesis. In other words, we do not have enough
evidence to say that the proportion of Cylons is different
from 40%. We will find you out Baltar!

Let us look at a second example. In this case we have


received a report that 75% of Starfleet Academy applicants Great to see that Python has
survived over centuries.
in general indicated that they have learnt Python in the past
year. The recruiting officer wants to assess whether
applicants from Vulkan have also learned Python in the
statistics and data visualisation with python 319

same proportion. A sample of 256 Vulkan applicants are


surveyed, and the results indicate that 131 of them have
knowledge of Python. Is there a significant difference in the
Vulkan applicants compared to the overall population? Well,
let us take a look:

> p0 = 0.75

> n = 256 Is there a significant difference in


the Vulkan applicants compared to
> p = 131/n
the overall population?
> count = n*p0

> zscore = (p-p0)/np.sqrt((p0*(1-p0)/n))

> print(zscore)

-8.804591605141793

We can use the z-score to assess our null hypothesis, or we


can use the proportions_ztest. In this case let us create a
function to help us with the interpretation:

def proportion_test(count, n, value, alpha=0.05):

stat, p = proportions_ztest(count, n, value)


A helpful function to interpret our
print(’stat = {0:.4f}, p-value= {1:.4f}’.
results for proportions.
format(stat, p))

if p > alpha:

print(‘‘Can’t reject the null hypothesis. There

is no evidence to suggest that the proportion

is different from {0}’’.format(count/n))

else:

print(‘‘Reject the null hypothesis. The

proportion is different from {0}.’’.format(

count/n))
320 j. rogel-salazar

Let us take a look using the values calculated above:

> proportion_test(count, n, p)
We reject the null hypothesis in
this case.
stat = 8.8046, p-value= 0.0000

Reject the null hypothesis.

The proportion is different from 0.75

At a 95% confidence level we reject the null hypothesis as


the p-value obtained is lower than 0.05. We can conclude
that there is a statistically significant difference in the This is a result that may defy logic,
particularly for a Vulkan!
proportion of Vulkan applicants to Starfleet Academy that
learn Python compared to the proportion in the overall
population of applicants.

6.5.3 Wilcoxon Signed Rank with One-Sample

Up until now our discussion has assumed that the data


we have is normally distributed and we have therefore used
parametric methods for our one-sample tests. If instead we We use non-parametric tests
with data that is not normally
have data that is not normally distributed, we make use of
distributed.
non-parametric methods and the equivalent of the t-test
discussed in Section 6.5.1 is the one-sample Wilcoxon test.

First proposed by Frank Wilcoxon12 in 1945, the test does 12


Wilcoxon, F. (1945). Individual
comparisons by ranking methods.
not assume that the samples are normally distributed and is Biometrics Bulletin 1(6), 80–83
typically is used to test if the samples come from a
symmetric population with a specified median, m0 . The test
is sometimes referred to as the Wilcoxon t-test.
statistics and data visualisation with python 321

The null hypothesis is therefore that the median is the


same as the hypothesised value m0 , i.e. H0 : m = m0 . The
alternative hypothesis can be any of the following: Note that in this case we test for
the median. This is because we
• Ha : m > m0 , have non-normally distributed
data.
• Ha : m < m0 ,

• Ha : m 6 = m0 .

Consider a set of Xi samples for our test. We calculate


| Xi − m0 | and determine the rank Ri for each observation
in ascending order according to their magnitude. The
Wilcoxon signed-rank test statistic is calculated as follows:

n The Wilcoxon signed-rank test


W= ∑ ψi Ri , (6.30) statistic.
i =1

where, 
 1

 if | Xi − m0 | > 0,
ψ1 = (6.31)


0 if | Xi − m0 | < 0.

Let us stop for a moment to consider the distribution of


the sum of a set of numbers Ui such that U = ∑i Ui ; each
number has an equal chance of being included in the sum or Our test statistic W has the same
1 distribution as U.
not and thus P(Ui = i ) = P(Ui = 0) = 2. Our test statistic
W has the same distribution as U. For a large number of
samples, n, we have that the expected value is given by:

n
E (W ) = E ( U ) = ∑ E(Ui ).
i =1
322 j. rogel-salazar

With the information above we can calculate the expected


value as follows:
    
1 1
E [W ] = ∑ 0
2
+ i
2
,
i =1

1 n 1 n ( n + 1)
2 i∑
= i= ,
=1
2 2

n ( n + 1)
= . (6.32) The expected value of W.
4

Note that Var (W ) = Var (U ) = ∑i Var (Ui ); this is because


the Ui are independent from each other under the null
hypothesis. The variance can be calculated as follows:

Var (Ui ) = E(Ui2 ) − E(Ui )2 ,

      2
2 1 2 1 i
= 0 +i − ,
2 2 2

i2 i2 i2 We will use this information to


= − = . (6.33)
2 4 4 calculate the variance of W.

We have therefore that:


n n
1
Var (W ) = ∑ Var(Ui ) = 4 ∑ i2 ,
i =1 i =1

1 n(n + 1)(2n + 1)
= . (6.34) The variance of W.
4 6

We can use this information to create our test statistic:

W − E (W ) We can obtain a standard


W0 = p , (6.35)
Var (W ) Wilcoxon test statistic.

which follows an approximate standard normal distribution.


statistics and data visualisation with python 323

The implementation of the Wilcoxon test in scipy.stats is


called wilcoxon and it takes an array of data from which
we need to subtract the hypothesised mean. It also can take We can use the wilcoxon
implementation in SciPy.
an alternative parameter to identify whether we need a
one-sided or a two-sided test. Note that the default is the
two-sided test. Let us create a function to help with the
interpretation:

from scipy.stats import wilcoxon

def wilcoxon1s(data, m0, alt=’two-sided’,

alpha=0.05):
A function to help us interpret the
stat, p = wilcoxon(data-m0, alternative=alt)
results of a Wilcoxon test.
print(’stat ={0:.4f}, p-value= {1:.4f}’.

format(stat, p))

if p > alpha:

print(‘‘Can’t reject the null hypothesis.

There is no evidence to suggest that the

median is different from {0}.’’.format(m0))

else:

print(‘‘Reject the null hypothesis.

The median is different from

{0}.’’.format(m0))

Let us now see the Wilcoxon test in action using Python. We


are continuing our study of Starfleet Academy applicants. A very young commanding force
at Starfleet.
We have heard that the median age at which Starfleet
officers receive the rank of lieutenant is 41 years of age. We
have collated the following information for recently
promoted officers:
324 j. rogel-salazar

[ 32.4, 55.2, 40.5, 47.9, 33.4,


34.3, 28.1, 43.0, 34.8, 60.2, Ages at which Starfleet officers
got promoted to the rank of
48.5, 52.2, 29.7, 29.9, 26.6, lieutenant.
44.4, 43.6, 50.4, 47.3, 34.2,
38.5, 61.0, 55.3, 45.2, 58.6 ].

Let us enter the data into Python:

> lieutenants=np.array([

32.4, 55.2, 40.5, 47.9, 33.4,

34.3, 28.1, 43.0, 34.8, 60.2,

48.5, 52.2, 29.7, 29.9, 26.6,

44.4, 43.6, 50.4, 47.3, 34.2,

38.5, 61.0, 55.3, 45.2, 58.6])

and let us check our hypothesis:

> wilcoxon1s(lieutenants, 41)

We can use our function to test


stat = 132.0000, p-value= 0.4261 our hypothesis. It seems that we
do indeed have a young Starfleet
force!
Can’t reject the null hypothesis.

There is no evidence to suggest that

the median is different from 41

6.6 Hypothesis Testing with Two Samples

In the previous section we concentrated on testing


whether an estimated parameter is significantly different
from a hypothesised one. In many cases, we are interested
statistics and data visualisation with python 325

in comparing two different populations, for example by


estimating the difference between two population means or We can use hypothesis testing to
compare two populations.
determining whether this difference is statistically
significant. We require therefore to use notation that lets us
distinguish between the two populations.

If the parameter to be compared is the mean, instead of


simply using µ, we may need to refer to µ1 for one
population and µ2 for the other one. The hypotheses we
formulate can be recast in different manners, for example We can recast our test for µ1 = µ2
as µ1 − µ2 = 0.
we can hypothesise that the population means are equal and
test µ1 = µ2 . This is equivalent to checking for the difference
to be equal to zero, i.e. µ1 − µ2 = 0. In the same manner we
can express the test that µ1 > µ2 as µ1 − µ2 > 0 and µ1 < µ2
as µ1 − µ2 < 0.

One important consideration that we need to make is how


the data for each population is sampled. We say that the
samples are independent if the selection for one of the Independent populations do not
affect each other’s selection.
populations does not affect the other. Another option is to
Paired populations can be
select the populations in a way that they are related. Think, thought of as “before” and “after”
for example, of a study of a “before” and “after” treatment. situations.

In this case the samples are said to be paired. In this section


we will take a look at how to run hypothesis testing for
independent and paired populations.

6.6.1 Two-Sample t-test – Comparing Means, Same


Variances

For independent samples, we know that we can


estimate the population mean µi with the sample mean xi .
326 j. rogel-salazar

We can therefore use x1 − x2 as an estimate for µ1 − µ2 . The


mean value of the sampling distribution x1 − x2 is We use the sample to estimate the
population parameters.
µ x1 − x2 = µ x1 − µ x2 = µ1 − µ2 . Similarly, the standard
σ12 σ22
deviation of x1 − x2 is given by σx21 − x2 = σx21 + σx22 = n1 + n2 .

A typical test to perform is the comparison of two


population means. The null hypothesis can be formulated as
H0 : µ1 − µ2 = c, where c is a hypothesised value. If the test
is to check that the two population means are the same, we
have that c = 0. Our alternative hypothesis may be any of
the following: The hypotheses for the
comparison of two-sample means.
• Ha : µ1 − µ2 > c,

• Ha : µ1 − µ2 < c,

• Ha : µ1 − µ2 6= c.

When the number of observations n1 and n2 for our samples


is large, their distributions are approximately normal, and
if we know the standard deviations, we can calculate a
z-distribution such that:

( x1 − x2 ) − ( µ1 − µ2 ) The test statistic for our two-


z= r . (6.36)
σ12 σ22 sample test.
n1 + n2

However, in practice we may not know the standard


deviations of the populations and instead we use the Homoscedasticity describes a
situation where our random
standard deviations, si , for the samples. If we are confident
samples have the same variance.
that the population variances of the two samples are equal, When the opposite is true we use
we can use a weighted average of the samples to estimate the term heteroscedasticity.

the population variance. For group 1 with sample variance


statistics and data visualisation with python 327

s21 , and group 2 with sample variance s22 , the pooled


variance can be expressed as:

s21 (n1 − 1) + s22 (n2 − 1)


s2p = . (6.37)
n1 + n2 − 2

We can use this information to calculate a t-distribution


such that:

( x1 − x2 ) − ( µ1 − µ2 )
t= q , (6.38) A test statistic for two samples
s p n1 + n12 with pooled variance.
1

that approximates a t-distribution with d f = n1 + n2 − 2


degrees of freedom.

Let us look at an example using Python. In this case we


will revisit the cars dataset we used in Chapter 4 in Section Check Section 4.3 for more info on
the cars dataset used here.
4.3. We saw that the fuel consumption in miles per gallon
for cars in the dataset with automatic (0) and manual (1)
transmissions had different means:

import pandas as pd

cars = pd.read_csv(DATA/’cars.csv’)

ct = cars.groupby([’am’]) We see a difference in the fuel


consumption of manual and
ct[’mpg’].mean() automatic cars in our dataset. Is
that difference significant?

am

0 17.147368

1 24.392308

Name: mpg, dtype: float64

We are now better equipped to answer the question as to


whether the difference is statistically significant and we can
328 j. rogel-salazar

do this with a two-sample t-test! First let us create arrays for


our automatic and manual observations:

automatic = cars[cars[’am’]==0][[’mpg’]] We create arrays with the data for


clarity on the analysis.
manual = cars[cars[’am’]==1][[’mpg’]]

Let us check whether the data is normally distributed. We


can use the Shapiro-Wilk function described in Section 6.2.2
to test for normality.

> shapiro_test(automatic, alpha=0.05)

We can use the tests covered in


stat = 0.9768, p-value= 0.8987 Section 6.2 to test for normality.
Can’t reject the null hypothesis.

The data seems to be normally distributed.

> shapiro_test(manual, alpha=0.05)

stat = 0.9458, p-value= 0.5363

Can’t reject the null hypothesis.

The data seems to be normally distributed.

We would like to see if the means are different and thus


we use a two-sided test. Let us first take a look at the two-
sample t-test implemented in SciPy. The function is called In SciPy we have the ttest_ind
function for a two-sample t-test.
ttest_ind and it takes a parameter called alternative

with the default value ’two-sided’. We can also specify


heteroscedasticity with the parameter equal_var. In this
case we use the value True.
statistics and data visualisation with python 329

import scipy.stats as sps


We need to pass each of the arrays
to compare. We can also specify

tstat, p = sps.ttest_ind(automatic, manual, the type of test and whether the


variances are assumed to be equal.
alternative=’two-sided’, equal_var=True)

Let us look at the results:

> print(’stat = {0:.4f}, p-value = {1:.4f}’.


There is enough evidence to
format(tstat, p)) say that the difference in fuel
consumption is statistically
significant.
stat = -4.1061, p-value= 0.0003

With a test statistic of −4.1061 we have a p-value less than


0.05. For a 95% confidence level we reject the null
hypothesis. We have sufficient evidence to say that the
difference in mean fuel consumption between automatic and
manual transmission cars in our dataset are statistically
significant.

We can also use the implementation of the two-sample t-test


in the Statsmodels package. The name of the function is the
same ttest_ind and we are distinguishing them with a
method notation. The implementation also takes an Refer to Chapter 2 about the
method notation in Python.
alternative parameter and in this case the

heteroscedasticity is defined with the parameter


usevar=’pooled’:

import statsmodels.stats.weightstats as smw

tstat, p, df = smw.ttest_ind(automatic, manual,

alternative=’two-sided’,

usevar=’pooled’)
330 j. rogel-salazar

Let us look at the results:

> print(’stat = {0:.4f}, p-value = {1:.4f}, df =


The Statsmodels implementation
{2}’.format(tstat, p, df))
provides the degrees of freedom.

stat = -4.1061, p-value = 0.0003, df = 30.0

Note that the values obtained are the same, but we get the
degrees of freedom with the Statsmodels implementation.

6.6.2 Levene’s Test – Testing Homoscedasticity

We mentioned in the previous section that one of the


assumptions used in the sample t-test is homoscedasticity. The Levene test checks for
homoscedasticity, i.e. equal
In other words, the implementation of the test relies on
variances.
the assumption that the variances of the two groups are
equal. It is therefore fair to consider a statistical test for this
situation and this is the main objective of the Levene test.
The test was introduced by Howard Levene13 in 1960 for the 13
Howard, L. (1960). Robust
tests for equality of variances. In
mean and then extended by Morton B. Brown and Alan B. I. Olkin and H. Hotelling (Eds.),
Contributions to Probability and
Forsythe14 for the median and the trimmed mean.
Statistics: Essays in Honor of Harold
Hotelling, pp. 278–292. Stanford
University Press
The hypotheses in the Levene test are: 14
Brown, M. B. and A. B. Forsythe
(1974). Robust tests for the equality
of variances. Journal of the American
• H0 : σ12 = σ22 = . . . = σk2 , Statistical Association 69(346),
364–367

• Ha : σi2 6= σj2 , for at least one pair (i, j).

The test statistic W is defined for a variable Y with sample


size N divided into k groups, with Ni being the sample size
statistics and data visualisation with python 331

of group i:

N − k ∑ik=1 Ni ( Zi· − Z·· )2


W= , (6.39)
k − 1 ∑k ∑ Ni ( Zij − Zi· )2
i =1 j =1

where:

 |Y − Y |, Y is a mean of the i-th group, Levene test.
ij i i
Zij =
 |Y − Y
ij i |, Yi is a median of the i-th group, Brown-Forsythe test.
e e
Ni
1
Zi· =
Ni ∑ Zij is the mean of the Zij for group i,
j =1
k Ni
1
Z·· =
N ∑ ∑ Zij is the mean of all Zij .
i =1 j =1

In Python we can use the implementation in SciPy called


levene that takes a parameter called center which can be We can use the levene function in
SciPy.
any of ’median’, ’mean’, ’trimmed’, with the median
being the default.

We can take a look at the cars data we used before and


check if indeed our assumption of heteroscedasticity was
correct:

> import scipy.stats as sps

> tstat, p = sps.levene(automatic, manual) The variances for manual and


automatic cars are not equal.
> print(’stat ={0:.4f}, p-value= {1:.4f}’.
What do we do now? See the next
format(tstat, p)) section!

stat = 4.1876, p-value= 0.0496

With α = 0.05, we can see that we reject the null hypothesis


in this case. Sadly the variances in our data appear to be
332 j. rogel-salazar

different. So, what do we do in cases like this. Well, let us


see in the next section.

6.6.3 Welch’s t-test – Comparing Means, Different


Variances

Comparing two means when the datasets are not We use Welch’s test when
comparing means of two samples
homoscedastic requires a different approach to the one
with different variances.
discussed in Section 6.6.1. Using a pooled variance, as
we did before, works well to detect evidence to reject H0
if the variances are equal. However, the results can lead
to erroneous conclusions in cases where the population
variances are not equal, i.e. we have heteroscedasticity.
The test was proposed by Bernard Lewis Welch15 as an 15
Welch, B. L. (1947, 01). The
generalization of ’Student’s’
adaptation to the Student t-test. problem when several different
population variances are involved.
Biometrika 34(1-2), 28–35
The approach is still the same for our null and alternative
hypotheses and the main difference is the consideration of
the variances. In this case, since we do not assume equality,
we have that the test statistic is

( x1 − x2 ) − ( µ1 − µ2 )
t= r , (6.40)
s21 s22
n1 + n2

which approximates a t-distribution with degrees of


freedom given by the Welch-Satterthwaite equation16 : 16
Satterthwaite, F. E. (1946). An
approximate distribution of
estimates of variance components.
(V1 + V2 )2 Biometrics Bulletin 2(6), 110–114
ν= , (6.41)
V12 V22
n1 −1 + n2 −1

s2i
where Vi = .
ni
statistics and data visualisation with python 333

One advantage of using this approach is that if the variances


are the same, the degrees of freedom obtained with Welch’s I recommend using the Welch
test as we do not have to concern
test are the same as the ones for the Student t-test. In a
ourselves with assumptions about
practical way, it is recommended to use the Welch test as the variance.
you do not have to concern yourself with assumptions about
the variance. Note that the degrees of freedom in Welch’s
t-test are not integers and they need to be rounded in case
you are using statistical tables. Fortunately we are using
Python.

In Python, the implementation of Welch’s test is done using


the same methods used in Section 6.6.1. In the SciPy
implementation we need to use the parameter
equal_var=False. Let us see the results for our cars data:

> import scipy.stats as sps

> tstat, p = sps.ttest_ind(automatic, manual,

alternative=’two-sided’, equal_var=False)
The Welch test is implemented
> print(’stat = {0:.4f}, p-value= {1:.4f}’. in ttest_ind with the parameter
equal_var=False.
format(tstat, p))

stat = -3.7671, p-value= 0.0014

In the case of the Statsmodels implementation we have to


use the parameter usevar=’unequal’ to apply Welch’s test
to our data:

import statsmodels.stats.weightstats as smw


For the Statsmodels
tstat, p, df = smw.ttest_ind(automatic, manual, implementation we use
alternative=’two-sided’, usevar=’unequal’) usevar=’unequal’.
334 j. rogel-salazar

Let us see the results:

> print(’stat = {0:.4f}, p-value = {1:.4f},

df = {2:.4f}’.format(tstat, p, df)) We still reject the null hypothesis


for our manual v automatic car
comparison.
stat = -3.7671, p-value = 0.0014, df = 18.3323

As before, we have a p-value smaller than 0.05 and we reject


the null hypothesis. We have sufficient evidence to say that
the difference in mean fuel consumption between automatic
and manual transmission cars in our dataset are statistically
significant.

6.6.4 Mann-Whitney Test – Testing Non-normal Samples

When we are interested in comparing samples where


we have a small number of observations, or the samples
are not normally distributed, the tests described above are
not suitable. Instead, we need to resort to non-parametric
techniques. The Mann-Whitney test is a non-parametric test,
proposed in 194717 , that helps us compare the differences 17
Mann, H. B. and D. R. Whitney
(1947, Mar). On a test of whether
between two independent, non-normal, samples of ordinal one of two random variables is
stochastically larger than the other.
or continuous variables. Ann. Math. Statist. 18(1), 50–60

The null hypothesis H0 is that the distributions of the two


populations are equal. The typical alternative hypothesis Ha The Mann-Whitney test lets us
compare non-normally distributed
is that they are not, in other words, we have a two-sided test.
samples. It is a non-parametric
The main idea for the test is based on the comparison of test.
the ranks of the data. If we are able to identify a systematic
difference between the samples, then most of the high ranks
statistics and data visualisation with python 335

will be attached to one of the samples, and most of the


lower ones to the other one. If the two samples are similar,
then the ranks will be distributed more evenly.

For samples with sizes n1 and n2 , the test involves pooling


In the test we pool the
the observations from the two samples while keeping track observations of both samples
of the sample from which each observation came. The ranks and rank them keeping track of
which sample the observation
for the pooled sample range from 1 to n1 + n2 . The Mann- came from.
Whitney test statistic is denoted as U and tells us about the
difference between the ranks, and usually it is the smaller of
U1 and U2

n1 ( n1 + 1)
U1 = n1 n2 + − R1 (6.42) The Mann-Whitney test statistic.
2 We can select the larger of U1 and
U2 too.
n2 ( n2 + 1)
U2 = n1 n2 + − R2 (6.43)
2

with Ri being the sum of the ranks for group i. Note that
U1 + U2 = n1 n2 . The theoretical range of the test statistic U
ranges between:

• 0: Complete separation of the two samples, and thus H0 The U statistic ranges between 0
and n1 n2 .
most likely to be rejected, and

• n1 n2 : Suggesting evidence in support of Ha .

In Python we can use the mannwhitneyu implementation in


SciPy’s stats module. Note that this implementation always
reports the Ui statistic associated with the first sample. As SciPy provides the mannwhitneyu
function.
with other tests we can pass an alternative parameter to
select whether we require a ’two-sided’ (default), ’less’,
or ’greater’ test.
336 j. rogel-salazar

Let us look at an example: In Section 5.5 we tackled the


mystery posed to us by the Vulcan Academy regarding a
gas similar to hexafluorine with alleged anti-gravity effects
on some Starfleet officers. Although we found that the Although the anti-gravity effects
of hexafluorine-like gas did not
anti-gravity hypothesis was not supported, colleagues
have a lot of evidence, it turns
reported some headaches that turned out to be indeed out it does cause some headaches.
caused by the gas. This presented us with a great Never a dull day at Starfleet.

opportunity to compare the effectiveness of some new pain


killers: “Headezine” and “Kabezine”. We found 15
volunteers to trial the medication and have randomly
allocated them to one or the other treatment. We have asked
them to report their perceived effectiveness in a scale from 0
(no effect) to 10 (very effective). The results are shown in
Table 6.5.

Headezine Kabezine Table 6.5: Ratings for the


perceived effect of two Starfleet
Volunteer Rating Volunteer Rating pain relief medications.

1 4 1 8
2 2 2 7
3 6 3 5
4 2 4 10
5 3 5 6
6 5 6 9
7 7 7 8
8 8

We would like to know if there is a significant difference in Is there a difference in the


perceived effectiveness of the
the perceived effectiveness of the two pain relief
treatments? We shall see!
medications. We can start by looking at the medians of the
ratings provided:
statistics and data visualisation with python 337

> headezine = np.array([4, 2, 6, 2, 3, 5, 7, 8])

> kabezine = np.array([8, 7, 5, 10, 6, 9, 8])

> np.median(headezine)
The medians of the two treatments
seem to be different. Is this
4.5 statistically significant?

> np.median(kabezine)

8.0

We can see that the medians are different, but we want to


know if the difference is statistically significant. We can use
the Mann-Whitney test for this purpose:

> U, p = mannwhitneyu(headezine, kabezine)


We use the Mann-Whitney test to
> print(’stat = {0:.4f}, p-value= {1:.4f}’.
check our hypothesis.
format(U, p))

stat = 8.5000, p-value= 0.0268

With a p-value lower than α − 0.05, we can reject the null


hypothesis and conclude that the difference is statistically We can reject the null hypothesis
and conclude that there is indeed
significant. Note that since we carried out a two-tailed test
a difference.
we can only strictly conclude that there is a difference in the
perceived effectiveness. If we wanted to assert that Kabezine
is perceived to be more effective among the Starfleet
volunteers, we can use a one-tailed test. Here, our
alternative hypothesis being that Kabezine has a lower
median:
338 j. rogel-salazar

> U, p = mannwhitneyu(headezine, kabezine,

alternative=’less’)
To assert which sample has a
> print(’stat = 0:.4f, p-value= 1:.4f’. higher median, we conduct a new
(one-tailed) test.
format(U, p))

stat = 8.5000, p-value= 0.0134

We can reject the null hypothesis and conclude that indeed


Kabezine is perceived to be more effective.

6.6.5 Paired Sample t-test

We have a follow up to the Starfleet challenge with the


hexafluorine-like gas and the headaches reported by some
of our colleagues. We are interested in the self-reported Since we have a before and after
treatment case, we use a paired
scores from volunteers taking “Kabezine” before and after
sample t-test. We assume the data
treatment. Since we have a pair of measurements for each is normally distributed.
subject, we should take this relationship into account. This
is a good scenario to use a paired sample t-test. Sometimes
this is referred as a dependent samples t-test.

The null hypothesis is that the population means are equal.


This can be expressed in terms of the difference as follows:
H0 : µ1 − µ2 = c, where c is the hypothesised value. A
typical case is for c = 0. The alternative hypothesis can be
any of the following: The hypotheses for a paired
sample t-test.
• Ha : µ1 − µ2 > c,

• Ha : µ1 − µ2 < c,

• Ha : µ1 − µ2 6= c.
statistics and data visualisation with python 339

Notice that we are interested in the difference between the


values of each subject in the data. With n sample differences,
our test statistic is therefore given by

xd − c
t= , (6.44) The test statistic for a paired
sd

n sample t-test.

where xd is the mean of the sample differences and sd


is their standard deviation. The degrees of freedom are
ν = n − 1.

Volunteer Pre-Treatment Post-Treatment Difference Table 6.6: Pre- and post-treatment


general health measures of
1 6 7.3 1.3 Starfleet volunteers in the
2 7 8.3 1.3 Kabezine study.

3 5.4 5.7 0.3


4 7.3 8 0.7
5 6.8 5.3 −1.5
6 8 9.7 1.7
7 5.7 6.7 1
8 7 7.7 0.7
9 7.7 6.3 −1.3
10 6 6.7 0.7
11 4.7 5 0.3
12 4.1 5.1 1
13 5.2 6 0.8
14 6.3 8.8 2.5
15 6 6 0
16 6.7 8 1.3
17 4 6 2
18 7.3 8.3 1
19 5 6.4 1.4
20 5.7 5.7 0

Let us go back to our Kabezine study. This time we have


requested for volunteers affected by the hexafluorine-like
340 j. rogel-salazar

gas to have a medical examination with Dr McCoy before


and after treatment with Kabezine. The measures provided Dr McCoy has captured data in
a continuous scale from 1 to 10,
by the medical examination are numbers in the range 1-10
with the higher the number, the
in a continuous scale, with 10 being great general health. better the health of the volunteer.
The results are shown in Table 6.6. Let us look at entering
our data in Python:

pre = [6.0, 7.0, 5.4, 7.3, 6.8, 8.0,

5.7, 7.0, 7.7, 6.0, 4.7, 4.1, 5.2,

6.3, 6.0, 6.7, 4.0, 7.3, 5.0, 5.7]

post = [7.3, 8.3, 5.7, 8.0, 5.3, 9.7,

6.7, 7.7, 6.3, 6.7, 5.0, 5.1, 6.0,

8.8, 6.0, 8.0, 6.0, 8.3, 6.4, 5.7]

We can use our lists to create a pandas dataframe, which


will let us create a new column for the differences:

df = pd.DataFrame(’pre’: pre, ’post’: post) We create a dataframe with the


data from Table 6.6.
df[’difference’] = df[’post’]-df[’pre’]

Let us see the mean and (sample) standard deviation for the
differences:

> df[’difference’].describe()
These are the descriptive statistics
for our Kabezine study.
count 20.000000 25% 0.300000

mean 0.755000 50% 0.900000

std 0.980588 75% 1.300000

min -1.500000 max 2.500000

Let us look at whether the data seems to be normally


distributed. In this case we will use the Shapiro-Wilk test
statistics and data visualisation with python 341

function from Section 6.2.2. First, the pre-treatment


measurements:

> shapiro_test(df[’pre’])

We are using the Shapiro-Wilk test


stat = 0.9719, p-value= 0.7944
to check for normality in our data.

Can’t reject the null hypothesis.

The data seems to be normally distributed.

And the post-treatment ones:

> shapiro_test(df[’post’])

stat = 0.9453, p-value= 0.3019

Can’t reject the null hypothesis.

The data seems to be normally distributed.

So far so good. We have that the mean of the differences


is xd = 0.755 with a sample standard deviation equal to We can use the ttest_rel function
in SciPy to perform a paired
0.9805. We can use this to build our test statistic. In Python
sample t-test.
we can use the ttest_rel function in SciPy’s stat module.
The default for alternative is ’two-sided’.

> from scipy.stats import ttest_rel

> tstat, p = ttest_rel(df[’pre’], df[’post’])

> print(’stat = {0:.4f}, p-value= {1:.4f}’.

format(tstat, p))

stat = -3.4433, p-value= 0.0027


342 j. rogel-salazar

With a p-value lower than 0.05 we can reject the null


For a 95% confidence level,
hypothesis and conclude that there is strong evidence that, we reject the null hypothesis.

on average, Kabezine does offer improvements to the Kabezine does offer improvements
to Starfleet officers.
headaches in Starfleet colleagues affected by the gas under
investigation.

6.6.6 Wilcoxon Matched Pairs

As you can imagine, we may have a situation when


a parametric test is not suitable with paired data, either The Wilcoxon matched pairs test is
non-parametric.
because we have a small dataset, or the data is not normally
distributed. A non-parametric alternative to the paired t-test
is the Wilcoxon matches pairs test.

The paired test follows the same logic as the single sample
version we saw in Section 6.5.3. The null hypothesis is that
the medians of the populations are equal. As before we can
express this as H0 : m1 − m2 = c. The alternative hypothesis
can be any of the following: The hypotheses for a Wilcoxon
matched pairs test.
• Ha : m1 − m2 > c,

• Ha : m1 − m2 < c,

• Ha : m1 − m2 6= c.

In this case we consider a set of paired samples ( Xi , Y − i ).


For each observation we calculate | Xi − Yi | and determine the
rank, Ri , compared to the other observations. The Wilcoxon
signed-rank test statistic is calculated in the same way as
before:
n The Wilcoxon test statistic is the
W= ∑ ψi Ri , (6.45) same as in Section 6.5.3.
i =1
statistics and data visualisation with python 343

where, 
 1 if | X − Y | > 0,
i i
ψ1 = (6.46)
 0 if | X − Y | < 0.
i i

The normalised test statistic remains the same too:

W − E (W ) The normalised test statistic for


W0 = p . (6.47)
Var (W ) the Wilcoxon matched pairs test.

Given the similarities with the one-sample test, in Python


we actually use the same wilcoxon function. The main
difference is that we pass two arrays instead of one. We
can continue using the alternative parameter to identify
whether we need a one-sided or a two-sided test.

Volunteer Pre-Treatment Post-Treatment Difference Table 6.7: Pre- and post-treatment


general health measures of USS
1 8.2 8.3 0.1 Cerritos volunteers in the Kabezine
2 7.3 9.3 2.0 study.

3 7.2 9.6 2.4


4 8.0 9.1 1.1
5 1.6 2.0 0.4
6 9.6 8.8 −0.8
7 4.0 9.9 5.9
8 4.7 4.0 −0.7
9 7.9 7.4 −0.5
10 2.9 3.5 0.6
11 5.0 8.5 3.5
12 2.0 5.8 3.8

Let us consider an example: It turns out that Commander


T’Ana of the USS Cerritos is interested in reproducing
the results of our study of Kabezine and enlists a limited
number of volunteers to assess the feasibility of a study.
344 j. rogel-salazar

With a small dataset, Commander T’Ana is better using a


nonparametric test. The results of the small study are as
shown in Table 6.7. Let us capture the data:

cerritos_pre = [8.2, 7.3, 7.2, 8, 1.6, 9.6,


We create a pandas dataframe.
4, 4.7, 7.9, 2.9, 5, 2] Make sure you import pandas.
cerritos_post = [8.3, 9.3, 9.6, 9.1, 2, 8.8,

9.9, 4, 7.4, 3.6, 8.5, 5.8]

cerritos_df = pd.DataFrame(’pre’: cerritos_pre,

’post’: cerritos_post)

Let us calculate the difference between post- and


pre-treatment and look at the median:

> cerritos_df[’difference’] = cerritos_df[’post’]-


We can calculate the median
cerritos_df[’pre’] difference between post- and
> cerritos_df[’difference’].median() pre-treatment.

0.8999999999999999

The null hypothesis is that the median of the differences is


equal to 0. Let us test our hypothesis:

> tstat, p = wilcoxon(cerritos_df[’pre’],


We apply the wilcoxon function to
cerritos_df[’post’]) test our hypothesis.
> print(’stat = {0:.4f}, p-value= {1:.4f}’.

format(tstat, p))

stat = 13.5000, p-value= 0.0425

With a p-value of 0.0425 at α = 0.05 level of significance,


we can reject the null hypothesis and conclude that the
statistics and data visualisation with python 345

difference is statistically significant. In this case we used a Commander T’Ana can now
start thinking of a wider study
two-sided test as we left the alternative parameter out of
including more subjects.
the command, using the default value.

6.7 Analysis of Variance

Up until now we have looked at ways in which we


can test for situations where our null hypothesis has been
a comparison of two populations, i.e. H0 : µ1 = µ2 or We have seen how to compare two
populations, but what happens
checking a single population against a hypothesised value,
when we need to compare more?
i.e. H0 : µ1 = c. In many cases, we may need to carry out
a comparison of more than two populations and running
pairwise tests may not be the best option. Instead, we look
at estimating the “variation” among and between groups.

1) Figure 6.4: Two possible datasets


to be analysed using analysis
of variance. We have three
populations and their means.
Although the means are the same
in both cases, the variability in set
Mean 1 Mean 2 Mean 3 2) is greater.

2)

Mean 1 Mean 2 Mean 3

We may encounter some situations similar to those shown in


Figure 6.4. We have three populations depicted by open
circles, filled circles and plus signs. In each of the cases 1)
346 j. rogel-salazar

and 2) we have a different distribution of the observations


for each population. In both cases, however, we have that
the population means are located in the same positions in The samples in 1) are more clearly
separated. The three means are
our scale. In the first situation the samples of each
different.
population are more clearly separated and we may agree by
initial inspection the three means are different from each
other.

However, in the second case we can see that there is a wider


spread in the samples for each population and there is even The samples in 2) are more
spread. The means may be better
some overlap. The separation between the means may be
explained by the variability of the
attributed to the variability in the populations rather than data.
actual differences in the mean.

In case 1) shown in Figure 6.4 the within-sample variability


is small compared to the between-sample one, whereas in ANOVA is a way of looking at
different aspects that contribute to
case 2) the overall variability observed is due to higher
the overall variance of a variable.
variability within-sample. If we can account for the
differences between sample means by variability
within-sample, then there is no reason to reject the null
hypothesis that the means are all equal. Understanding the
variation can be done with the so-called analysis of variance,
or ANOVA for short. Not only did Ronald Fisher coin the
term “variance”18 , but also popularised the ANOVA 18
R. A. Fisher (1918). The
correlation between relatives
methodology in his book Statistical Methods for Research on the supposition of mendelian
inheritance. Philos. Trans. R. Soc.
Workers19 first published in 1925. We have never looked Edinb. 52, 399–433
back and analysis of variance is a useful tool to understand.
19
Fisher, R. (1963). Statistical
Methods for Research Workers.
Biological monographs and
manuals. Hafner Publishing
Let us consider a study where we need to compare two or
Company
more populations. Typically, we would be interested in a
particular variable for the comparison. This would be a
statistics and data visualisation with python 347

variable that enables us to distinguish among the samples.


We call that variable a factor under investigation.
Sometimes we talk about levels to refer to values of a factor.
Let us say that we are interested in how one factor affects A one-way ANOVA looks at the
effect of one factor on the response
our response variable; in this case we want to use a
variable. A two-way ANOVA
one-factor ANOVA, also known as one-way ANOVA. looks at the effect of two factors.
Similarly, we use a two-factor ANOVA, or two-way ANOVA,
to analyse how two factors affect a response variable. In this
situation, we are also interested in determining if there is an
interaction between the two factors in question, having an
impact on the response variable. Let us take a look.

6.7.1 One-factor or One-way ANOVA

A one-factor analysis of variance is used in the


comparison of k populations, or treatments, with means
µ1 , µ2 , . . . , µk and our aim is to test the following
hypotheses:

• H0 : µ1 = µ2 = · · · = µk , One-way ANOVA hypotheses.

• Ha : at least two of the means are different.

As with other statistical tests we have seen, there are a


number of assumptions that are made in order to apply a
one-way ANOVA. We assume that:

1. The observations are independent and selected randomly


from each of the k populations. Assumptions behind a one-way
ANOVA.

2. The observations for each of the k populations, or factor


levels, are approximately normally distributed.
348 j. rogel-salazar

3. The normal populations in question have a common


variance σ2 .

For a population with k levels such that each level has ni


observations we have that the total number of observations
is n = ∑ik ni . Let us now zoom in the levels themselves and This is important notation used in
ANOVA.
say that the j-th observation in the i-th level is xij with j =
1, 2, . . . ni . The sum of ni observations in level i is denoted
as Ti = ∑ j xij and thus the sum of all n observations is
T = ∑i Ti = ∑i ∑ j xij .

As we did in Section 6.4.2, for a linear regression, we are


interested in minimising a sum of squares. In this case we
would like to compare the sum of squares between samples
and within samples. Let us take a look. The total sum of
squares is:

T2
SST = ∑ ∑ xij2 − n
, (6.48) The total sum of squares.
i j

similarly, the sum of squares between samples is:

Ti2 T2
SSB = ∑ ni
− ,
n
(6.49) The sum of squares between
samples.
i

and so we can calculate the sum of squares within samples


by subtracting the two quantities above:

The sum of squares within


SSW = SST − SSB . (6.50)
samples.

Let us recall from Section 4.4.4 that a sample variance can


be obtained by taking the sum of squares and dividing by
statistics and data visualisation with python 349

the population minus one as shown in Equation (4.21). This


means that we can obtain a mean square for the quantities
above as follows:

SST Note that the degrees of freedom


MST = , total mean square, (6.51)
n−1 n − 1=(k − 1) + (n − k ).

SSB
MSB = , mean square between samples, (6.52)
k−1

SSW
MSW = , mean square within samples. (6.53)
n−k

When the null hypothesis is true, the between- and


within-sample mean squares are equal, i.e. MSB = MSW .
Otherwise, MSB > MSW , and as the difference among the
sample means is bigger, the greater MSB will be compared
to MSW . We can therefore think of creating a test statistic
that looks at the ratio of these two mean squares, or
effectively, variances. This is called the F statistic:

We encountered the F statistic in


MSB between sample variance
F= = , (6.54) Section 6.4.2 in relationship to
MSW within sample variance
regression coefficients.

The F distribution with degrees of freedom ν1 and ν2 is the


distribution described by:

S1 /ν1
X= , (6.55)
S2 /ν2

where S1 and S2 are independent random variables with chi-


squared distributions with respective degrees of freedom ν1 See Section 5.5.2 about the chi-
squared distribution.
350 j. rogel-salazar

and ν2 . Its probability distribution function is given by:

ν1  ν1  − ν1 +ν2
2 −1

x ν1 2 ν 2
f ( x, ν1 , ν2 ) = ν1 ν2  1+ 1x , (6.56)
B 2, 2
ν2 ν2

for real x > 0, and where B is the beta function20 . The 20


Abramowitz, M. and I. Stegun
(1965). Handbook of Mathematical
expected value for the F distribution is given by: Functions: With Formulas, Graphs,
and Mathematical Tables. Applied
Mathematics Series. Dover
ν2 Publications
µ= , for ν2 > 2, (6.57)
ν2 − 2
and its variance by:

2ν22 (ν1 + ν2 − 2)
σ2 = , for ν2 > 4. (6.58)
ν1 (ν2 − 2)2 (ν2 − 4)

The cumulative distribution function of the F distribution is


given by:
ν ν2 
1
F ( x, ν1 , ν2 ) = I ν1 x , , (6.59)
ν1 x +ν2 2 2

where I is the regularised incomplete beta function21 . 21


Abramowitz, M. and I. Stegun
(1965). Handbook of Mathematical
Functions: With Formulas, Graphs,
It is important to note that for unbalanced data, there are and Mathematical Tables. Applied
Mathematics Series. Dover
three different ways to determine the sums of squares Publications
involved in ANOVA. If we consider a model that includes 2
factors, say A and B, we can consider their main effects as
well as their interaction. We may represent that model as
SS( A, B, AB). We may be interested in looking at the sum of
squares for the effects of A after those of the B factor and
ignore interactions. This will be represented as: There are different ways to
consider the effects of our factors.
SS( A| B) = SS( A, B) − SS( B). Another example:
SS( AB| A, B) = SS( A, B, AB) − SS( A, B) is the sum of
squares for the interaction of A and B after the main
statistics and data visualisation with python 351

individual effects. This bring us to the types mentioned


above:

• Type I: Sequential sum of squares: First assigns Type I looks at the factors
sequentially.
maximum variation to A: SS( A), then to B: SS( B| A),
followed by the interaction SS( AB| B, A), and finally to
the residuals. In this case the order of the variables makes
a difference and in many situations this is not what we
want.

• Type II: Sum of squares no interaction: First assigns the Type II ignores the interaction.

variation to factor A taking into account factor B, i.e.


SS( A| B). Then the other way around, i.e. SS( B| A). We
leave the interaction out.

• Type III: Sum of squares with interaction: Here we look Type II considers the interaction.

at the effect of factor A after the effect of B and the


interaction AB, i.e. SS( A| B, AB). Then the other way
around for the main effects: SS( B| A, AB). If the
interactions are not significant we should use Type II.

Let us take a closer look at the assumptions we are making


about the model. First we are assuming that the
observations in our dataset are normally distributed, i.e.
xij ∼ N (µi , σ2 ) for j = 1, 2, . . . , k. We can centre the We exploit the normality of our
data to express our model in
distributions and thus xij − µi = ε ij = N (0, σ2 ), where ε ij
terms of variations inherent to the
denotes the variation of xij about its mean µi and can be observations.
interpreted as random variations that are part of the
observations themselves.

Now, if the overall mean is µ = ∑i µi /k, then we can assert


that ∑i (µi − µ) = 0. Let us denote the quantity in parenthesis
352 j. rogel-salazar

as Li and thus µi = µ + Li . The assertion above would Li is the mean effect of factor level
i relative to the mean µ.
indicate that the sum of Li is equal to 0. We can interpret
Li as the mean effect of factor level i relative to the mean µ.
With this information we are able to define the following
model:
xij = µ + Li + ε ij , (6.60)

for i = 1, 2, . . . , k and j = 1, 2, . . . , ni . For the case where the


null hypothesis holds, we have that µi = µ for all i and this The null hypothesis in ANOVA
is equivalent to having Li = 0 for
is equivalent to having Li = 0 for all i, implying that there
all i. In other words, we have no
are no effects for any of the available levels. As you can see, effects.
we can consider ANOVA as a regression analysis problem
with categorical variables such that for a response variable
Y we are interested to see how the levels X1 , X2 , . . . , Xk
influence the response and so we have a model that can be
written as:

We can express ANOVA as a


Y = β 0 + β 1 X1 + β 2 X2 + · · · + β k Xk + ε. (6.61)
regression analysis.

Source of Sum of Degrees of Mean F Table 6.8: Table summarising the


variation Squares Freedom Square statistic results of an analysis of variance
(ANOVA).
Between MSB
SSB k−1 MSB MSW
samples

Within
SSW n−k MSW
samples

Total SST n−1

Under the null hypothesis we have that β 1 = β 2 = · · · = 0,


versus the alternative hypothesis that β j 6= 0 for any j.
statistics and data visualisation with python 353

The results of the ANOVA analysis are conveniently


presented as a summary in the form of a table as shown in
Table 6.8. As usual the p-value obtained can help with our
decisions to reject, or not, the null hypothesis.

Let us now consider an example to see ANOVA in action


with Python. We will go back to our toaster manufacturer in
Caprica City and we are interested in the performance of
three models of toaster, including Model 5. The Frakking toasters!

manufacturer has provided information of ours in excess of


1500 hours for three models as shown in Table 6.9.

Model 5 Model Centurion Model 8 Table 6.9: Performance of three


Caprica City toasters in hours in
14 17 24 excess of 1500 hours of use.
17 19 33
12 19 28
14 22 22
22 18 29
19 17 32
16 18 31
17 19 28
15 17 29
16 16 30

Let us capture this data into NumPy arrays and look at the
means. Assuming we have already imported NumPy:

> five = np.array([14, 17, 12, 14, 22, 19, 16, 17,

15, 16]) We enter our data into NumPy


arrays and use the mean method to
> centurion = np.array([17, 19, 19, 22, 18, 17, 18, obtain the averages.
19, 17, 16])
354 j. rogel-salazar

> eight = np.array([24, 33, 28, 22, 29, 32, 31,

28, 29, 30])

> print(five.mean(), centurion.mean(),

eight.mean())

16.2 18.2 28.6

We can see that the mean performance for toaster model


8 is larger than for the other 2 models. The question is
whether this is a statistically significant difference. We can Is the average value difference
statistically significant?
use ANOVA to look at the null hypothesis, that all the
means are the same, and the alternative hypothesis, that
at least one of them is different. In Python we can use the
f_oneway implementation in SciPy:

> from scipy.stats import f_oneway

> fstat, p = f_oneway(five, centurion, eight)


We use the f_oneway function in
> print(’stat = {0:.4f}, p-value= {1:.4f}’.
SciPy to carry out an ANOVA.
format(fstat, p))

stat = 59.3571, p-value= 0.0000

With an F statistic of 59.3571 we have a p-value lower than See Section 6.7.2 to see how
Tukey’s range test can help find
α = 0.05, for a 95% confidence level we can reject the
the sample with the difference.
null hypothesis. We conclude at least one of the means is
different from the others.

Note that the SciPy implementation only gives us the F


statistic and the p-value, but none of the other information
shown in Table 6.8. Furthermore, the function only uses
the Type II sum of squares. If we require more information,
statistics and data visualisation with python 355

we can resort to the fact that we can express our problem To obtain more information we
use Statsmodels and recast our
as a linear model and use Statsmodels to help us. The first
problem as a linear regression.
thing to mention is that we may need to reorganise our data
into a long form table. We will therefore create a pandas
dataframe with the information from Table 6.9:

import pandas as pd

toasters = pd.DataFrame({

’five’: five, We first create a dataframe with


our observations.
’centurion’: centurion,

’eight’: eight})

To get a long form table, we melt the dataframe as follows:

toasters_melt = pd.melt(

toasters.reset_index(),
We then melt the dataframe to
id_vars=[’index’],
obtain a long form table.
value_vars=toasters.columns,

var_name=’toastermodel’,

value_name=’excesshours’)

The result will be a long dataframe with a column


containing the name of the toaster model and another one
with the excess hours. We can use this to write a model
such that:

excesshours = β 0 + β 1 ∗ toastermodel, (6.62) We can now recast our problem as


a linear regression.

where the toastermodel variable contains three levels:


five, centurion, eight. We can now use the ordinary

least squares method we saw in Section 6.4.2 and apply


356 j. rogel-salazar

the anova_lm function to get more information about our


analysis of variance. Let us first import the modules:

import statsmodels.api as sm

from statsmodels.formula.api import ols

We can now implement our ordinary least squares for the


model in Equation (6.62):

model1 = ols(’excesshours ~ C(toastermodel)’,


After fitting the model, we apply
data=toasters_melt).fit()
the anova_lm function.
anova_toasters = sm.stats.anova_lm(model1, typ=2)

As you can see, we are creating a model that explains the


excess hours as a function of the toaster model. The C in
front of the variable toastermodel makes it clear to Python We need to tell Python to treat
our toastermodel variable as
that the variable should be treated as categorical. Note also
categorical.
that we are able to specify the type of sum of squares we
want to apply as a parameter to anova_lm. Let us look at the
results:

> print(anova_toasters)
This starts looking like the result
summary from Table 6.8.
sum_sq df F PR(>F)

C(toastermodel) 886.4 2.0 59.35714 1.30656e-10

Residual 201.6 27.0 NaN NaN

This table still does not show the sum of squares and the
totals. We can write a function to help us with that:
statistics and data visualisation with python 357

def anova_summary(aov):

aov2 = aov.copy()

aov2[’mean_sq’] = aov2[:][’sum_sq’]/aov2[:][’df’]

cols = [’sum_sq’, ’df’, ’mean_sq’, ’F’, ’PR(>F)’] A function to calculate the sum of
aov2.loc[’Total’] = [aov2[’sum_sq’].sum(), squares and the totals.

aov2[’df’].sum(), np.NaN, np.NaN, np.NAN]

aov2 = aov2[cols]

return aov2

Let us take a look:

> anova_summary(anova_toasters)

sum_sq df mean_sq

C(toastermodel) 886.4 2.0 443.200000

Residual 201.6 27.0 7.466667 This is now a good ANOVA


summary similar to Table 6.8.
Total 1088.0 29.0 NaN

F PR(>F)

59.357143 1.306568e-10

NaN NaN

NaN NaN

The F statistic and p-value are the same as obtained before,


but we now have more information about the results of As mentioned before, check
Section 6.7.2.
our analysis of variance. However, we do not know which
groups are different from each other, we simply know that
there are differences.

Let us consider another example that may help put some of


the discussions about other tests we have learned into
358 j. rogel-salazar

perspective. Usually ANOVA is best employed in the


comparison of three or more groups. In the case of two ANOVA is better suited to
compare three or more groups.
groups we can use a two-sample t-test as we did for our
For two groups, use a t-test or
cars dataset in Section 6.6.1. We can, in principle, use remember that F = t2 .
ANOVA to make the comparison and simply remember the
following relationship for our test statistics in the case of
two groups: F = t2 .

Let us take a look and note that we are considering that the
cars dataset has been loaded as before:

> lm_mpg1 = ols(’mpg ~ C(am)’, data=cars).fit()

> anova_mpg1 = sm.stats.anova_lm(lm_mpg1, typ=2)

> anova_summary(anova_mpg1)

sum_sq df mean_sq F

C(am) 405.150588 1.0 405.150588 16.860279

Residual 720.896599 30.0 24.029887 NaN Using ANOVA to compare the


fuel consumption of manual and
Total 1126.047187 31.0 NaN NaN
automatic cars in the dataset from
Section 6.6.1.

PR(>F)

0.000285

NaN

NaN

In this case the F statistic obtained is F (1, 30) = 16.8602 with


p < 0.05. In Section 6.6.1 we obtained a t statistic of −4.1061
and we can verify that F = t2 . The result of our ANOVA test (−4.1061)2 = 16.860.

indicates that we reject the null hypothesis and so the mpg


for automatic cars is different from that of manual ones.
statistics and data visualisation with python 359

Let us do one more thing with our cars dataset. We have a


variable cyl and so we can get the mean mpg per cyl:

> cars.groupby([’cyl’])[’mpg’].mean() We have three levels in the cyl


variable. We can use ANOVA to
cyl
see if there are any differences in
4 26.663636 the mean fuel consumption per
6 19.742857 number of cylinders.

8 15.100000

We have three groups and using ANOVA to determine if


these means are the same or not is now easy to implement:

> lm_mpg2 = ols(’mpg ~ C(cyl)’, data=cars).fit()

> anova_mpg2 = sm.stats.anova_lm(lm_mpg2, typ=2)

> anova_summary(anova_mpg2)

sum_sq df mean_sq
The ANOVA summary for the
C(cyl) 824.784590 2.0 412.392295 comparison of the mean fuel
Residual 301.262597 29.0 10.388365 consumption per number of
cylinders.
Total 1126.047188 31.0 NaN

F PR(>F)

39.697515 4.978919e-09

NaN NaN

NaN NaN

The results obtained also indicate that the fuel consumption


for cars with different numbers of cylinders is different with
F (2, 19) = 39.6975, p < 0.005. But where is the difference?
Well, Tukey’s range test may be able to help.
360 j. rogel-salazar

6.7.2 Tukey’s Range Test

We have analysed the toaster data provided by the


Caprica City manufacturer and concluded that at least one
of the mean values is statistically different from the others.
But which one? One common post hoc test that can be Tukey’s test can be used to find
means that are significantly
applied is the Tukey’s range test, or simply Tukey’s test,
different from each other.
which compares the means of each group to every other
group pairwise, i.e. µi − µ j and identifies any difference
between two means that is greater than the expected
standard error. In Python we can use the
pairwise_tukeyhsd function in Statsmodels under

statsmodels.stats.multicomp:

> from statsmodels.stats.multicomp\

import pairwise_tukeyhsd

> tukey_test = pairwise_tukeyhsd(

endog=toasters_melt[’excesshours’],

groups=toasters_melt[’toastermodel’],
We can run a Tukey’s range test
alpha=0.05) with the pairwise_tukeyhsd
> print(tukey_test) function in Statsmodels.

Multiple Comparison of Means-Tukey HSD, FWER=0.05

=================================================

group1 group2 meandiff p-adj lower upper

-------------------------------------------------

centurion eight 10.4 0.001 7.3707 13.4293

centurion five -2.0 0.2481 -5.0293 1.0293

eight five -12.4 0.001 -15.4293 -9.3707


statistics and data visualisation with python 361

The p-value for the comparison between the Centurion


toaster and Model Five is greater than 0.05 and thus, for a
95% confidence level, we cannot reject the null hypothesis.
In other words, there is no statistically significant difference According to the results, the
Model Eight toaster has the
between the means of the Centurion and Five models. The
different mean value.
p-values for the comparison between the Centurion toaster
and the Model Eight one, on the one hand, and Models
Eight and Five on the other hand, are below our significance
value. We reject the null hypothesis and can say that there
is a statistically significant difference between the means of
these groups.

6.7.3 Repeated Measures ANOVA

As we saw in Section 6.6.5, sometimes we have data


that corresponds to a given treatment of the same subject
in a repeated manner, for example a measurement that
takes into account a before and after situation such as a
treatment. We know that these are paired datasets, and We are interested in the variability
effects within subjects.
in the case of more than two groups, we are interested in
looking at the variability effects within subjects as opposed
to the between samples approach used in Section 6.7.1 for
ANOVA. With that in mind, we want to take a closer look
at the within-samples sum of squares, SSW , and break it
further, leaving the sum of squares between samples, SSB ,
out of the discussion. The hypotheses are:

• H0 : µ1 = µ2 = · · · = µk , i.e. there is no difference among The hypotheses for repeated


measures ANOVA.
the treatments

• Ha : at least one mean is different from the rest


362 j. rogel-salazar

The variation in this case may arise, in part, due to our


variables having different means. We want to account for
this in our model. Let us call this SS M , and the variance
that is not accounted for is called SSe , which we treat as an We want to take into account that
the variation may be due by our
error term. We are now left with the comparison between
variables having different means.
SS M and SSe . When SS M > SSe , then the variation within
subjects is mostly accounted for by our model, leading to a
large F statistic that is unlikely for the null hypothesis that
the means are all equal.

For a dataset with n subjects with k treatments, we denote


the observation for subject i on treatment j as xij . The mean Some important notation used in
repeated measures ANOVA.
for subject i overall treatments is denoted as xi· , whereas the
mean for treatment j among all subjects is x· j . Finally x·· is
the grand mean for the dataset. The within-subjects sum of
squares is:
n k
SSW = ∑ ∑(xij − xi· )2 , (6.63) Within-subjects sum of squares.
i j

and the sum of squares for our model is:

k
SS M = n ∑( x· j − x·· )2 , (6.64) Sum of squares for our model.
j

and thus SSe = SSW − SS M . The degrees of freedom


for our model is νM = k − 1, and νe = (k − 1)(n − 1).
As done in the previous section, the mean squares are
MS M = SS M /νM and MSe = SSe /νe and our F statistic is
given by F = MS M /MSe .

In Python, we can use the AnovaRM implementation in the


Statsmodels package. We need to specify a melted
dataframe containing our data, the dependent variable We can use the AnovaRM function
in Statsmodels.
statistics and data visualisation with python 363

(depvar), the column that contains the subject identifier


(subject), and a list with the within-subject factors. Let us
take a look at applying a repeated ANOVA to help of our
colleagues at the USS Cerritos. Commander T’Ana is looking
at the headache treatment study she started and has enlisted See Section 6.6.6 for Commander
T’Ana’s initial pilot study.
the help of 20 volunteers that have been given different
drugs: Kabezine, Headezine, Paramoline and Ibuprenine.
The overall health score of the volunteers is measured on
each of the four drugs.

The data can be obtained from22 https://doi.org/10.6084/ 22


Rogel-Salazar, J. (2022d, Jan).
Starfleet Headache Treatment
m9.figshare.19089896.v1 as a comma-separated value file - Example Data for Repeated
ANOVA. https://doi.org/
with the name “starfleetHeadacheTreatment.csv”. Let us 10.6084/m9.figshare.19089896.v1
load the data into a pandas dataframe:

> headache = pd.read_csv(

’starfleetHeadacheTreatment.csv’)
Our data is in a neat table.
> headache.head(3)

Volunteer Kabezine Headezine Paramoline

0 1 10.0 3.4 1.7

1 2 7.7 3.9 4.3

2 3 7.6 7.6 2.5

Ibuprenine

5.7

4.4

5.9

As we can see, the data is organised as a neat table that


is great for a human to read. However, we know that the
364 j. rogel-salazar

AnovaRM function is expecting a long format table. Let us

melt the dataframe taking the Volunteer column as the IDs


and the values as the columns in our original dataframe.
We will rename the variable name as drug and the value
columns as health:

> h_melt = pd.melt(headache.reset_index(), We need to melt our dataframe to


use it with AnovaRM.
id_vars=[’Volunteer’],

value_vars=headache.columns,

var_name=’drug’, value_name=’health’)

We can get the tail of the dataframe to see how things look:

> h_melt.tail(4)
We now have a long format table.

Volunteer drug health

76 17 Ibuprenine 4.9

77 18 Ibuprenine 4.4

78 19 Ibuprenine 5.3

79 20 Ibuprenine 4.4

Let us now feed our melted dataframe to our repeated


measures function AnovaRM:

> from statsmodels.stats.anova import AnovaRM

> aovrm = AnovaRM(data=h_melt, depvar=’health’,


The AnovaRM function can now be
subject=’Volunteer’, within=[’drug’]).fit() applied.

We can take a look at the results by printing the summary


provided by the model instance:
statistics and data visualisation with python 365

> print(aovrm)

Anova

==================================

F Value Num DF Den DF Pr > F

----------------------------------

drug 19.4528 3.0000 57.0000 0.0000

==================================

With a statistic F (3, 57) = 19.4528, p < 0.05 we can reject


the null hypothesis for a 95% confidence level, and conclude
that the type of drug used leads to statistically significant Surely Commander T’Ana will be
happily meowing with the results.
differences in overall health measures taken by Commander
T’Ana’s team.

6.7.4 Kruskal-Wallis – Non-parametric One-way ANOVA

By now we are well aware that when our data is not


normally distributed, we may need to revert to other
methods, such as non-parametric tests. This is also true for
carrying out an analysis of variance, and one way to
evaluate if the medians, mi , of two or more groups are equal
(or not) is the Kruskal-Wallis test. The test is named after
William Kruskal and W. Allen Wallis23 . The test statistic is 23
William H. Kruskal and W.
Allen Wallis (1952). Use of ranks
denoted with the letter H and it can be seen as an extension in one-criterion variance analysis.
Journal of the American Statistical
of the Mann-Whitney U test we discussed in Section 6.6.4, Association 47(260), 583–621
as we estimate the differences in the ranks of the
observations.

As mentioned before, we do not make any assumption


about the distribution of the groups except that they have
366 j. rogel-salazar

the same shapes. We evaluate the observations in either an


ordinal, ratio or interval scale and they are assumed to be
independent. As with the Mann-Whitney test, we first need
to rank the n observations in the k groups with n = ∑ik ni .
Our hypotheses are as follows:

• H0 : m1 = m2 = · · · = mk , The hypotheses of the Kruskal-


Wallis test.
• Ha : at least two of the medians are different.

With Rij being the rank of the j-th observation in the i-th
group, and Ri· being the average rank of the observations in
the i-th group, the H test statistic for the Kruskal-Wallis test
is given by:
2
∑ik=1 ni Ri· − R
H = ( n − 1) n 2 , (6.65) The Kruskal-Wallis test statistic.
∑ik=1 ∑ j=i 1 Rij − R

where R = (n + 1)/2, i.e. the average of all the Rij . When the
data does not have ties, we can express the test statistic as:

k
12 2
H=
n ( n + 1) ∑ ni Ri · − 3( n + 1). (6.66)
i =1 See Appendix I.

Now that we have our test statistic, to decide on whether we


reject the null hypothesis or not, we compare H to a critical
cutoff point determined by the chi-square distribution, this See Section 5.5.2 for more
information about the chi-squared
is due to the fact that this distribution is a good
distribution.
approximation for H for samples sizes greater than 5. The
degrees of freedom for the chi-distribution is ν = k − 1,
where k is the number of groups. When the H statistic is
greater than the cutoff, we reject the null hypothesis.
statistics and data visualisation with python 367

In Python we can use the kruskal function in the stats


module of SciPy. We simply need to provide arrays with We use the kruskal function in
SciPy.
the data in each of the groups. Let us consider an example.
We are following up the learning of Python by new recruits
to Starfleet Academy. We saw in Section 6.5.2 that 75% of
applicants learn some Python, however they still need to
take a course in their first year. New recruits are randomly
assigned to one of 3 different methods of learning:

• Instructor-led course with Commander Data in a face-to-


face format

• Remote lectures via holographic communicator with


Hologram Janeway

• Project-led course following “Data Science and Analytics


with Python”24 and “Advanced Data Science and 24
Rogel-Salazar, J. (2017). Data
Science and Analytics with Python.
Analytics with Python”25 Chapman & Hall/CRC Data
Mining and Knowledge Discovery
Series. CRC Press
25
Rogel-Salazar, J. (2020). Advanced
We are interested to know if the medians of the results are Data Science and Analytics with
Python. Chapman & Hall/CRC
the same or not, indicating that either the method of
Data Mining and Knowledge
learning is not important, or it actually makes a difference. Discovery Series. CRC Press
Does the method of learning
Perhaps doing practical work with quality material is better
Python make a difference for
than having Commander Data trying to simplify Starfleet recruits?

explanations, or better than just simply having remote


lectures without practicals. After having sat the courses, the
students are assessed and the results are shown in Table
6.10.

We will use the Kruskal-Wallis test, but first let us capture


the data in Python:
368 j. rogel-salazar

Table 6.10: Results of the different


Instructor Holographic Communicator Project & Book methods to learn Python for Data
Science at Starfleet Academy.
81.4 75.4 85.1
90.4 100.0 95.5
85.2 83.2 93.8
100.0 80.7 100.0
98.9 78.9 88.2
89.0 78.2 91.7
90.7 84.8 93.2

instructor = [81.4, 90.4, 85.2, 100.0, 98.9,

89.0, 90.7]

holographic = [75.4, 100.0, 83.2, 80.7, 78.9,

78.2, 84.8]

project_book = [85.1, 95.5, 93.8, 100.0, 88.2, We capture the data from Table
6.10 in a pandas dataframe.
91.7, 93.2]

df = pd.DataFrame({’instructor’: instructor,

’holographic’: holographic,

’project_book’: project_book})

Let us look at the median for each method of learning:

> df.median()

instructor 90.4 The medians are different, but


is the difference statistically
holographic 80.7
significant?
project_book 93.2

And now we run our statistical test:


statistics and data visualisation with python 369

> from scipy.stats import kruskal

> hstat, p = kruskal(df[’instructor’],

df[’holographic’], df[’project_book’])
Running a Kruskal-Wallis test
> print(’stat = {0:.4f}, p-value= {1:.4f}’. indicated that the differences are
indeed statistically significant.
format(hstat, p))

stat = 6.7188, p-value= 0.0348

With an H statistic of 6.7188 and a p-value lower than 0.05,


we reject the null hypothesis for a 95% confidence level and
conclude that there are differences in the medians for the
three different methods of learning.

6.7.5 Two-factor or Two-way ANOVA

We know how to assess the impact of one factor in a


response variable over a number of groups with a one-factor
ANOVA. We may also be interested in the impact of two See Section 6.7.1 for one-factor
ANOVA.
factors on the response variable and get a sense of whether
there is an interaction between the two factors in question.
In those cases we need to use a two-factor ANOVA.

When the interactions are important, we need to have


multiple measurements for each combination of the levels of
the two factors in our analysis. Two-way ANOVA also There are several requirements to
run a two-way ANOVA.
requires that the samples at each factor level combination
are normally distributed, and that the samples have a
common variance. We can organise our data in a tabular
form such that we have one factor in the columns and
another one in the rows as show in Table 6.11.
370 j. rogel-salazar

Factor C Table 6.11: A typical data


Factor R arrangement for a two-factor
ANOVA.
Level C1 Level C2 ... Level Cc
x R1 C1 1 x R1 C2 1 ... x R1 Cc 1
x R1 C1 2 x R1 C2 2 ... x R1 Cv 2
Level R1 .. .. ..
..
. . . .
x R1 C1 n x R1 C2 n ... x R1 Cc n
TR1 C1 TR1 C2 ... TR1 Cc TR1

x R2 C1 1 x R2 C2 1 ... x R2 Cc 1
x R2 C1 2 x R2 C2 2 ... x R2 Cc 2
Level R2 .. .. ..
..
. . . .
x R2 C1 n x R2 C2 n ... x R2 Cc n
TR2 C1 TR2 C2 ... TR2 Cc TR2

.. .. .. ..
. . . .
x Rr C1 1 x Rr C2 1 ... x Rr Cc 2
x Rr C1 2 x Rr C2 2 ... x Rr Cc 2
Level Rr .. .. ..
..
. . . .
x Rr C1 n x Rr C2 n ... x Rr Cc n
TRr C1 TRr C2 ... TRr Cc TRr

TC1 TC2 ... TCc T

We can denote the number of levels in the column factor as


c, and the number of levels in the row factor as r. We have
N observations, and the observation in the rck cell of our
table is xrck , where r = R1 , R2 , . . . , Rr , c = C1 , C2 , . . . , Cc Important notation used in two-
way ANOVA.
and k = 1, 2, . . . , n. With this in mind, we have that T =
∑r ∑c ∑k xrck is the sum of all N = rcn observations. In the
same manner we have the following sums:
statistics and data visualisation with python 371

∑ TR2 = TR2 1 + TR2 2 + · · · + TR2 r , (6.67)

∑ TC2 = TC21 + TC22 + · · · + TC2c , (6.68)

∑ TRC
2
= TR2 1 C1 + TR2 1 C2 + · · · + TR2 r Cc . (6.69)

As with the one-factor ANOVA, we are interested in


different sums of squares, and for the two-factor we have
some extra ones. Let us take a look:

T2
SST = ∑ ∑ ∑ xrck
2
− , (6.70) Total sum of squares.
r c k
N

∑ TR2 T2
SSR = − , (6.71) Sum of squares between rows.
nc N

∑ TC2 T2
SSC = − , (6.72) Sum of squares between columns.
nr N

2
∑ TRC T2
SSRC = − − SSR − SSC , (6.73) Sum of squares of interactions.
n N

SSE =SST − SSR − SSC − SSRC , (6.74) Residual sum of squares.

As we did in Section 6.7.1, we need to calculate mean


squares quantities from the sums above, dividing by the
corresponding degrees of freedom. The degrees of freedom
follow this relationship: N − 1 = (r − 1) + (c − 1)+
(r − 1)(c − 1) + rc(n − 1).
372 j. rogel-salazar

Source of Sum of Degrees of Mean F Table 6.12: Table summarising the


Variation Squares Freedom Square Statistic results of a two-way analysis of
variance (two-way ANOVA).
Between MSR
SSR r−1 MSR MSE
rows

Between MSC
SSC c−1 MSC MSE
columns

MSRC
Interaction SSRC (r − 1)(c − 1) MSRC MSE

Residual SSE rc(n − 1) MSE

Total SST N−1

The logic is the same as for one-way ANOVA, and we look Remember that the F statistic is
the ratio of the mean squares in
at the F distribution to assess our hypothesis. For two-way
question. See Equation (6.55).
ANOVA we have the following set of null hypotheses:

• The sample means for the first factor are all equal

• The sample means for the second factor are all equal The hypotheses of a two-way
ANOVA.
• There is no interaction between the two factors

The first and second hypotheses are equivalent to


performing a one-way ANOVA for either the row or the
column factor alone. We can use a table to summarise the
results of a two-factor ANOVA, as shown in Table 6.12.

Sometimes it is useful to look at the degree of association


between an effect of a factor or an interaction, and the
dependent variable. This can be thought of as a form of This is effectively the amount of
variance that is explainable by the
correlation between them, and the square of the measure
factor in question.
can be considered as the proportion of variance in the
statistics and data visualisation with python 373

dependent variable that is attributable to each effect. Two


widely-used measures for the effect size are the eta-squared,
η 2 , and omega-squared, ω 2 :

SSe f f ect
η2 = , (6.75)
SST

SSe f f ect − νe f f ect MSE


ω2 = , (6.76)
MSE + SST

where νe f f ect is the degrees of freedom for the effect.

Let us expand on our study of Python learning among


Starfleet Academy. We know that the recruits are assigned
to 3 delivery methods of learning as described in Section Our Starfleet recruits are now
encouraged to revise using 2
6.7.4: instructor, holographic communicator, and project.
methods. A two-way ANOVA
We are now also encouraging them to revise, and they are looking at method of delivery and
revision combined can be run.
assigned 2 methods: mock exams and weekly quizzes. We
would like to conduct a two-way ANOVA that compares the
mean score for each level of delivery combined with each
level of revision.

The dataset for this is available at26 https://doi.org/10. 26


Rogel-Salazar, J. (2022c,
Feb). Python Study
6084/m9.figshare.19208676.v1 as a comma-separated Scores. https://doi.org/
10.6084/m9.figshare.19208676.v1
value file with the name “Python_Study_Scores.csv”. Let
us read the data into a pandas dataframe:

import pandas as pd

df = pd.read(’Python_Study_Scores.csv’)

Let us look at the mean scores per delivery and revision


methods. We can do this easily via a pivot table with
pandas:
374 j. rogel-salazar

> table = pd.pivot_table(df, values=’Score’,

index=[’Delivery’], columns=[’Revision’],

aggfunc=np.mean)

> print(table)
A pivot table can let us summarise
the mean scores of delivery v
revision.
Revision Mock Exam Weekly Quiz

Delivery

Holographic 90.85 74.89

Instructor 94.31 86.42

Project 96.28 87.33

Let us now run our two-way ANOVA. We do this with the


help of a linear model as we did in Section 6.7.1:

> formula = ’Score ~ C(Delivery) + C(Revision) +

C(Delivery):C(Revision)’ We run our two-way ANOVA with

> model = ols(formula, df).fit() the help of a linear model and the
anova_lm method.
> aov_table = sm.stats.anova_lm(model, typ=2)

Notice that in this case our model takes into account each of
the factors and the interaction:

Score = β 0 + β d ∗ Delivery + β r ∗ Revision+

β dr ∗ Delivery ∗ Revision.

We can apply the same function we created in Section


6.7.1 to the resulting table from our code above, giving us
information about the mean squares:
statistics and data visualisation with python 375

> anova_summary(aov_table)

sum_sq df

C(Delivery) 920.552333 2.0

C(Revision) 1793.066667 1.0

C(Delivery):C(Revision) 192.314333 2.0

Residual 1046.096000 54.0

Total 3952.029333 59.0 The two-way ANOVA summary


for our study of Python learning
among Starfleet recruits.
mean_sq F PR(>F)

460.2761667 23.75968649 3.96E-08

1793.066667 92.55900032 2.64E-13

96.15716667 4.963681154 0.010497559

19.37214815 NaN NaN

NaN NaN NaN

For a 95% confidence level, since the p-values for Delivery


and Revision are both less than 0.05, we can conclude that We reject the null hypothesis for
each of the factors.
both factors have a statistically significant effect on the
score, rejecting the null hypothesis that the means for each
factor are all equal. We can now turn our attention to the But there is a statistically
significant interaction between
interaction effect, with F (2, 54) = 4.9636 and p = 0.010,
them.
indicating that there is a significant interaction between the
delivery and revision methods on the score obtained.

If we are interested in the effect sizes, we can apply the


following function to our results table:

def effect_sizes(aov):

aov[’eta_sq’] = ’NaN’
376 j. rogel-salazar

aov[’eta_sq’] = aov[:-1][’sum_sq’]/ A function to calculate η 2 and


ω 2 for a two-way ANOVA. See
sum(aov[’sum_sq’])
Equations (6.75) and (6.76).
mse = aov[’sum_sq’][-1]/aov[’df’][-1]

aov[’omega_sq’] = ’NaN’

aov[’omega_sq’] = (aov[:-1][’sum_sq’]-

(aov[:-1][’df’]*mse))/(sum(aov[’sum_sq’])+mse)

return aov

Applying this function to the results above, will add the


following columns:

eta_sq omega_sq

C(Delivery) 0.232932 0.222040

C(Revision) 0.453708 0.446617

C(Delivery):C(Revision) 0.048662 0.038669

Residual NaN NaN

6.8 Tests as Linear Models

We have been discussing an entire zoo of statistical tests.


It is useful to understand the different assumptions behind
them and familiarise ourselves with the names that others
use for them so that we can understand the results and A lot of the tests we have explored
can be expressed as linear models!
interpretations for the given data. A lot of these tests can
be expressed in a more familiar way. Take for instance the
ANOVA test in Section 6.7 where we appealed to expressing
the result in terms of a linear model. It turns out that many
of the tests we have described can be expressed in terms of a
linear model too.
statistics and data visualisation with python 377

6.8.1 Pearson and Spearman Correlations

We start with a linear model such that:

y = β 0 + β 1 x. (6.77)

For the Pearson correlation, the null hypothesis corresponds The Pearson correlation
corresponds to H0 : β 1 = 0
to having a zero slope in our model, i.e. H0 : β 1 = 0. We use
in our linear model.
Statsmodels instead of scipy.stats.pearson(x,y):

import statsmodels.formula.api as smf

smf.ols(’y ~ 1 + x’, data)

The linear model will give you the slope, not the coefficient
r. If you require the correlation coefficient simply scale the
data by the standard deviation.

For the Spearman correlation we simply take the rank of the


data and so the model is:

The Spearman correlation requires


rank(y) = β 0 + β 1 rank( x ). (6.78)
us to take the rank of the data.

Instead of using scipy.stats.spearman(x,y) we can use


Statsmodels:

smf.ols(’y ~ 1 + x’, rank(data))


378 j. rogel-salazar

6.8.2 One-sample t- and Wilcoxon Signed Rank Tests

In one-sample tests, we are interested in a single number.


For a one-sample t-test we have:

In a one-sample t-test we have


y = β0 , (6.79)
H0 : β 0 = 0.

and our null hypothesis corresponds to H0 : β 0 = 0. Instead


of using scipy.stats.1samp(y) we can use Statsmodels as
follows:

smf.ols(’y ~ 1’, data)

For the Wilcoxon signed rank we simply take the signed


rank of the data:

signed_rank(y) = β 0 . (6.80) For the Wilcoxon test, we need to


take the rank of the data.

This approximation is better as the number of observations


is large, and good enough for n > 14. Instead of using
scipy.stats.wilcoxon(y) we can use Statsmodels:

smf.ols(’signed_rank(y) ~ 1’, data)

where we have defined the following function to calculate


the signed rank:

def signed_rank(df):
A function to calculate signed
return np.sign(df) * df.abs().rank() ranks.
statistics and data visualisation with python 379

6.8.3 Two-Sample t- and Mann-Whitney Tests

For two-sample tests, we are interested to know if a data


point belongs to one group (0) or the other (1). We can use a
dummy variable xi as an indicator for the group, and thus
the linear model can be expressed as:

y = β 0 + β 1 xi , (6.81)

For the two-sample independent t-test our null hypothesis


corresponds to H0 : β 1 = 0. Instead of using For the two-sample independent
t-test we have that H0 : β 1 = 0.
scipy.stats.ttest_ind(y1, y2) we can use Statsmodels:

smf.ols(’y ~ 1 + group’, data)

where group is a dummy variable for each of the two


samples.

For the Mann-Whitney test we simply take the signed rank


of the data:
For the Mann-Whitney test we
need signed ranked data.
signed_rank(y) = β 0 + β 1 xi . (6.82)

This approximation is better as the number of observations


is large, and good enough for n > 11. Instead of using
scipy.stats.mannwhitney(y1, y2) we can use Statsmodels

to fit linear model:

smf.ols(’y ~ 1 + group’, signed_rank(data))


380 j. rogel-salazar

6.8.4 Paired Sample t- and Wilcoxon Matched Pairs Tests

When we have paired data, we are after a single number


that predicts pairwise differences. Our linear model can
then be written as:

y2 − y1 = β 0 , (6.83)

and we can actually simplify this as y = y2 − y1 . This


means that we can express this as a one-sample t test on the For a paired sample t-test we have
H0 : β 0 = 0.
pairwise differences. Our null hypothesis is to H0 : β 0 = 0.
Instead of using scipy.stats.ttest_rel(y1, y2) we can
use Statsmodels, provided we calculate diff_y2_y1=y1-y2:

smf.ols(’diff_y2_y ~ 1’, data)

For the Wilcoxon matched pairs test we simply take the


signed rank of the data:

For the Wilcoxon matched pairs


signed_rank(y2 − y1 ) = β 0 . (6.84)
test we need singed ranked data.

This approximation is better as the number of observations


is large, and good enough for n > 14. Instead of using
scipy.stats.wilcoxon(y1, y2) we can use Statsmodels:

smf.ols(’diff_y2_y ~ 1’, signed_rank(data))

6.8.5 One-way ANOVA and Kruskal-Wallis Test

We have seen the relationship between ANOVA and


the linear model before, but we can summarise it here as
statistics and data visualisation with python 381

follows: For one mean for each group we have a model:


We have already explored the
y = β 0 + β 1 x1 + · · · + β k x k , (6.85) relationship of ANOVA with a
linear model.
where xi are group indicators (0, 1) created as dummy
variables. The null hypothesis is therefore H0 : y = β 0 .

Notice that when we have two groups, the test reverts to


the equivalent of the independent two-sample t-test, and
when there is one group we recover the one-sample t-test.
We saw in Section 6.7.1 how we can use Statsmodels instead
of scipy.stats.f_oneway.

In the case of the Kruskal-Wallis test we need to take the


rank of the data:

For the Kuskal-Wallis test we need


rank(y) = β 0 + β 1 x1 + · · · + β k xk . (6.86)
to rank our data.

Et voilà!
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
7
Delightful Details – Data Visualisation

When looking at answering questions for research,


business decisions or simply to understand and explain
things to ourselves, the tools and techniques that we have
described in the previous chapters are invaluable. We are Simply presenting a table of
numbers or the results of a
able to make sound comparisons and draw conclusions
statistical test may not be enough.
based on evidence. In many occasions, however, simply
presenting a table of numbers, or the result of a statistical
test and its significance, is not enough. Results are more
impactful and call to action when using visual tools to bring
the message home.

We mentioned briefly in Section 1.6 that representing data in


tables may work well for some, but there is something to be
said about thinking critically about the design choices that
help putting across a message that we wish to convey. We We need to consider the what,
who and how for that story to
are talking about telling a story and we should consider at
drive meaning.
least the what, who and how for that story to drive meaning.
In this chapter we will cover some background on the
importance of visual representation and design, particularly
384 j. rogel-salazar

in the context of presenting statistical quantities. We will


then look at how matplotlib helps us with the creation of
charts, and cover some best practices when creating data
visualisations.

7.1 Presenting Statistical Quantities

Presenting statistical information requires us not


only to consider the data and the analysis we have
performed (or are performing), but also our target audience. More about visual integrity in
Section 7.3.
In some cases, the creation of infographics is suitable, but in
general, a well-crafted visualisation with visual integrity
and excellence is far more powerful. Generally speaking,
data can be presented in three ways: Text, tabular or
graphical form. The best method is, as always, determined
by the data itself, including its format and location. We also
need to consider the method of analysis used, and the
message we need to convey.

A table may be more appropriate in situations where all the A table is appropriate when
all information requires equal
information requires equal attention or when we want to
attention.
enable viewers to pick the data of their interest. Similarly,
when referring to certain data points in a written paragraph,
a table can be of help. In contrast, a graphical representation A plot can let us get a general
overview of the data.
will enable viewers to have a more general overview of
the data. We can use the visual elements described in the
previous section to help our viewers understand the results
and draw their attention to specific aspects. Let us take a
look at each of the methods mentioned above.
statistics and data visualisation with python 385

7.1.1 Textual Presentation

This method of presentation is one that we are all


familiar with. It conveys information in writing and where
information is presented in sentences, paragraphs, sections,
etc. Think of a report or an essay where we are able to For textual representation, think of
a report.
provide explanations and interpretations. We need to have
clarity in our writing, and this is helped by organising our
thoughts into meaningful sentences.

When the statistical information to be conveyed is a few


numbers, textual presentation may be suitable. Think for Use it when there are only a few
numbers to be presented.
example of summary statistics. However, as the numbers
increase, it may be more suitable to use a different kind of
presentation, for instance if the message is about trends we
may prefer to show a graph.

7.1.2 Tabular Presentation

A table is a systematic summary of statistical


information organised into rows and columns. A good table
is accompanied by a reference number, a title and a
description of the rows and columns included. Tables are
most appropriate to use when individual pieces of
information are required to be presented.

A distinctive advantage of using tables is that we are able to A table lets us present quantitative
and qualitative data.
present both quantitative and qualitative information. We
are able to present data that may be difficult to show in a
graph, for example by showing figures within a significant
386 j. rogel-salazar

number of decimal places. They also let us present data that


may be expressed in different units or dimensions in a way
that summarises the data. The data in a table must be A table must find balance between
its length and breadth.
organised in a way that lets the viewer draw comparisons
between different components, and therefore balance
between its length and breadth is important. In this way,
anyone is able to understand the data presented.

Since the data in a table receive the same importance, tables


must be avoided when appealing to the viewer to select Avoid using tables if specific data
needs to be emphasised.
information or when emphasis is required. Using coloured
tables such as heatmaps may support required emphasis,
but one may argue that heatmap tables fall into the category
of graphical presentation.

7.1.3 Graphical Presentation

The use of graphics in statistics is a combination that


should be up there together with great partnerships: R2-D2
and C3-PO, Moulder and Scully, Thelma and Louise, Spock Many other great couples can be
included here.
and Kirk, Batman and Robin, or peanut butter and jelly.
Graphical representation lets us simplify and summarise
complex information through the use of visual elements.
Unlike tables, where individual pieces of information are
presented, graphical representation lets us take a look at the
bigger picture, letting the viewer capture trends, patterns No pun intended.

and relationships in a faster and more concise way.

A graphical presentation for our data requires the choice


of visual elements that help us understand the information
encoded in them. In the next chapter we will explore a
statistics and data visualisation with python 387

variety of graphs from line and bar charts, to scatter plots or


histograms. Before we get there, there are a few things that We will cover these in more detail
in Chapter 8.
we need to take into account to be able to choose a chart
that most appropriately presents the data at hand, and the
message we need to convey. Shall we take a look?

7.2 Can You Draw Me a Picture? – Data


Visualisation

They say that a picture is worth a thousand words, and


if that picture helps make better decisions it may be worth
even more. Think, for example, of a recent news article you
have read: There would have been one or two key numbers
that are suitable to understand the issue in the article, but
whenever there is more complex information to capture, Our brain is better equipped
at understanding patterns,
our brain is better equipped at understanding patterns,
differences and relationships
differences and relationships in a visual way. Surely you can visually.
do that in words alone, but a visualisation is able to present
the key information in a more compact, accessible and faster
way. That does not mean that any old picture would do. We
all have seen bad visualisations. This is why understanding Having some design
understanding is of great help.
some ideas behind design and representation comes handy
when using pictures to help tell our data stories.

Data visualisation can be thought of as the representation


of data using charts, text, colours, infographics and even Data visualisation is the
representation of data with
animations. Its aim is to display simple, and complex,
charts, text, colours, etc.
information to understand relationships, differences and
patterns to reveal important details that help transform
information into knowledge. It is important to mention that
388 j. rogel-salazar

data visualisation is not the exclusive realm of designers, or


for that matter of data and statistical teams. On the contrary, Data visualisation is for everyone!

data visualisation can be used by anyone that requires to


tell a story with data, with the aim to generate thoughts,
illustrate ideas or discover meaning.

Let us look at an example. Consider the dataset available at1 1


Rogel-Salazar, J. (2022b, Feb).
Jackalope Dataset. https://doi.org/
https://doi.org/10.6084/m9.figshare.19221666.v3 as a 10.6084/m9.figshare.19221666.v3
comma-separated value file with the name “jackalope.csv”.
Let us load that into a pandas dataframe and look at some
descriptive statistics:

> import pandas as pd

> df = pd.read_csv(’jackalope.csv’)
We have a dataset with 261
> df.shape
observations and 3 columns.

(261, 3)

We can see that we have 3 columns and 261 data


observations. We can ask pandas for a description:

> df.describe()

x y

count 261.000000 261.000000

mean 14.311111 16.172414

std 6.901846 6.751804 Some descriptive statistics for our


Jackalope dataset.
min 1.600000 1.000000

25% 9.150000 12.100000

50% 13.500000 18.100000

75% 19.900000 21.100000

max 28.300000 27.400000


statistics and data visualisation with python 389

We obtain the mean and standard deviation for the dataset


in a matter of seconds. We can even ask for a correlation
matrix:

> df.corr()

x y We seem to have a bit of a


negative correlation between
x 1.000000 -0.034559
columns x and y.
y -0.034559 1.000000

Figure 7.1: A scatter plot for the


It is perhaps not the strongest anticorrelation and we may jackalope.csv dataset.

want to check if it is statistically significant. However, before


we do any of that, let us look at a scatter plot of the data.
The result is shown in Figure 7.1.
390 j. rogel-salazar

What happened there? Well, although we are able to obtain


descriptive statistics and actually run statistical analysis on Our dataset encodes a picture of
a jackalope... not real data to be
the data, it turns out that in this case the dataset contains a
analysed!
picture of a cute jackalope. Had we not looked at the data
before, we could have started an extensive analysis on a
dataset that makes no sense.

Furthermore, it may be that we are facing datasets whose


descriptive statistics are very similar, but whose underlying
distributions are very different. This is a point that is best
described with the help of the so-called Anscombe’s quartet.
In 1973 Frank J. Anscombe presented2 four datasets 2
Anscombe, F. J. (1973). Graphs in
statistical analysis. The American
consisting of eleven points each. They all have identical Statistician 27(1), 17–21
descriptive statistics, but the distributions are very different.
Let us take a look at the dataset, available at3 3
Rogel-Salazar, J. (2022a,
Feb). Anscombe’s
https://doi.org/10.6084/m9.figshare.19221720.v3 as a Quartet. https://doi.org/
10.6084/m9.figshare.19221720.v3
comma-separated value file with the name “anscombe.csv”.
Let us load the data into a pandas dataframe and take a
look:

> df = pd.read_csv(’anscombe.csv’)

> df.head(3)

Loading the Anscombe quartet


Dataset x y data to pandas.

0 A 10 8.04

1 A 8 6.95

2 A 13 7.58

The data is organised in long form, with each of the four


datasets identified by a letter from A to D. Let us group the
data and look at the mean and standard deviations:
statistics and data visualisation with python 391

> df.groupby(’Dataset’).agg({’x’: [’mean’, ’std’],

’y’: [’mean’, ’std’]})

x y
The mean and standard deviations
mean std mean std
of the four groups are the same.
Dataset

A 9.0 3.316625 7.500909 2.031568

B 9.0 3.316625 7.500909 2.031657

C 9.0 3.316625 7.500000 2.030424

D 9.0 3.316625 7.500909 2.030579

As we can see, the mean and standard deviations are


virtually the same. Let us see the correlation:

df.groupby(’Dataset’).corr()

x y

Dataset

A x 1.000000 0.816421

y 0.816421 1.000000 The correlation between the


variables in each dataset is also the
B x 1.000000 0.816237
same.
y 0.816237 1.000000

C x 1.000000 0.816287

y 0.816287 1.000000

D x 1.000000 0.816521

y 0.816521 1.000000

Once again, the correlation coefficients are effectively


indistinguishable. Should we use any of these datasets to
perform a regression analysis, we would end up with the
392 j. rogel-salazar

same line of best fit. You may think that is great, until you However, when we plot the data
we see how different the sets are.
look at the graphic representation of these datasets as
shown in Figure 7.2.

12 A B

8
y

4 = 7.50 = 7.50
= 2.03 = 2.03
r = 0.82 r = 0.82

12 C D

8
y

4 = 7.50 = 7.50
= 2.03 = 2.03
r = 0.82 r = 0.82

0 10 20 0 10 20
x x

Figure 7.2: Anscombe’s quartet.


Without looking at the data, we could have used a linear All datasets have the same
summary statistics, but they have
model to explain data that is clearly not linear, as shown for very different distributions.

example for dataset B; or used a regression for continuous


variables for something that may well be categorical, for
example dataset D.
statistics and data visualisation with python 393

It is quite unlikely that when collecting data we end up


with a pattern like the jackalope one shown in Figure 7.1.
However, I hope you get the idea behind the importance of
visualising your data. It helps us in many important things
such as:

• Record information: Answer questions, make decisions,


contextualise data Some great reasons to use
visualisation for our data.
• Support reasoning: Develop and assess hypotheses,
detect anomalies, find patterns

• Communicate: Present arguments, persuade others,


collaborate and find solutions.

Before we are able to run our analyses, and even before


we consider what charts are best for our purposes, it is
good to take a step back and consider the way our datasets
are organised. Not only are we interested in cleaning and An important step is to look at
how our data is organised.
curating our data, but we may want to consider if the sets
contain information that may be chronological or temporal
in nature, spatial or geographical, qualitative or quantitative.
It is useful to revise some type of data as this will help
choose not just the best visualisation, but also the most
appropriate method of analysis.

• Nominal data: Refers to the use of names or labels in


the data. For example, robot types: Astromech, Protocol
droid, Cylon Centurion, Cylon “toaster”, Soong-type Nominal data use names or labels.

synthetic intelligence android-lifeform, etc. A suitable


operation for this kind of data is to check for equality
using = or 6= operators
394 j. rogel-salazar

• Ordered data: Refers to the use of natural ordering of Ordered data exploits natural
ordering in the observations.
the data in question. For example t-shirt sizes: Small,
medium, large and extra large. Suitable operators include:
=, 6 =, <, >

• Interval data: This is data that is measured along a scale


on which each point is placed at an equal distance from
any other point. The location of 0 may be arbitrary. We Interval data is measured along a
scale.
cannot divide or multiply this data, simply add or
subtract it. For example, temperature scale in centigrade,
or dates.

• Ratio data: This data has an absolute 0 and, like the


interval data above, it uses continuous intervals where a Ratio data lets us measure
proportions.
ratio between different data points is available so that we
are able to measure proportions. For example, counts and
amounts of money in Republic credits, Mexican Pesos,
Euros or Pound Sterling.

With this in mind, we can start thinking of creating


impactful graphics that support the activities mentioned
above and consider the type of data we have at hand. Before
we do that, it is useful to consider some aspects of data
representation and design. Let us have a look.

7.3 Design and Visual Representation

Creating a visualisation based on data requires us to


think not only of the data available, but also the different We need to consider the visual
elements that we want to include.
visual elements that help us represent the data in a
meaningful way. Design principles that help us determine
statistics and data visualisation with python 395

what a “good” representation is should be considered and


we may learn a thing of two from Edward Tufte, a leading
voice in the area of information design and data
visualisation. As a matter of fact, the outline of this book
follows some of his ideas4 . We will not do justice to the 4
Tufte, E. (2022). The Work of
Edward Tufte and Graphics Press.
volume of work from Tufte, but we will draw some general https://www.edwardtufte.com/tufte/.
Accessed: 2022-23-02
principles that may help us in our quest to find good
representations for our data and information.

Considering that we are interested in using visualisation to


help our communication, we can argue that a “good”
representation is one that is the most fit for our purpose for There is not one single “good”
representation for our data.
maximum benefit. Unfortunately, this means that there is
not a single representation that fits all cases, and therefore
we need to consider different criteria to help us get there.
Tufte talks about “graphical excellence” in terms of usability. Graphical excellence provides the
user with the greatest number of
In other words, it is the representation of data that provides
ideas in the shortest time.
the user with the greatest number of ideas in the shortest
time, using the least amount of ink in the smallest space.
Notice that this is different from having a purely decorative
image. For a data representation to be of graphical
excellence it needs to offer value.

Part of that value comes from the fact that the data
representation requires “visual integrity”. What Tufte talks Visual integrity implies not
distorting the underlying
about in this respect is almost a moral requirement, in the
data. It avoids creating a false
sense that the representation chosen should not distort the interpretation or impression.
underlying data and it should not create a false impression
or interpretation. For example, the dimensions used in an
image should be dictated by the data itself. Any variations
that happen in the representation should relate to the data
396 j. rogel-salazar

itself, and not to the creative choices that we may have in


the design. For instance, we should not exaggerate a rise or
fall in a curve, or use arbitrary starting points in our axes. If Be true to your data!

we use legends or keys, these should be presented only


when necessary and should not be distorted or ambiguous.

This brings us to the idea of “maximising the data-ink ratio”


which refers to removing superfluous elements from our We should remove superfluous
elements from our visualisations.
data representation. Extra elements such as backgrounds,
3D elements or icons may simply distract from the
information we are trying to present. Data-ink is defined by
Tufte as the ink that is used for the presentation of the data,
the data-ink ratio is thus:

Data-ink
Data-ink ratio = . (7.1) Tufte’s data-ink ratio.
Total ink used

If we remove the data-ink, our graphic loses its content. In


contrast, non-data-ink is the ink that can be removed and
our representation keeps its integrity. Our goal is to design
a representation with the highest possible data-ink ratio.

We also need to mention the idea of “graphical elegance”.


Tufte considers this not in terms of the subjective beauty of This is the balance between data
complexity and design simplicity.
the representation, but instead on the balance between the
complexity of the data and the simplicity of the design. This
can be thought of as the aim of making complex data
accessible. That accessibility may be best assessed by the
way our brains are able to more easily perceive visual cues.
In 1967 Jacque Bertin published his Semiologie Graphique5 , 5
Bertin, J. and M. Barbut
(1967). Sémiologie graphique:
later released in an English translation, where he identified Les diagrammes, les réseaux, les
cartes. Gauthier-Villars
visual variables that can be manipulated to encode
statistics and data visualisation with python 397

information. An interpretation based on these visual


variables is shown in Figure 7.3.

Figure 7.3: Visual variables and


their ease of perception.
An important aspect to consider when looking at the visual
variables we may want to use is the fact that they are
processed by our brain pre-attentively. In other words, they
have an impact on us at an immediate level. We can see an
398 j. rogel-salazar

example in Figure 7.4, where we are asked to determine Pre-attentive processing is the
unconscious cognitive processing
how many letters “t” there are in the sequence. By
of a stimulant prior to attention
presenting the information in a way that appeals to our being engaged.
pre-attentive perception, we can more easily see that there
are 7 letters “t” in panel B. We may be able to describe
pre-attentiveness as being “seen” rather than “understood”.

Figure 7.4: How many letters “t”


are there in the sequence?

With this in mind, from Figure 7.3, we are able to “see”


position more immediately than size, and colour hue more
immediately than texture. Similarly, we perceive data points
more easily than lines, and we understand differences in
length more immediately than differences in area or volume.
This is one of the reasons why pie charts, in particular 3D
pie charts, are much maligned. We can nonetheless look at We will cover pie charts in more
detail in Section 8.6.
comparing areas versus comparing lengths. Take a look at
Figure 7.5. We are presented with two circles and are asked
to compare the areas. We can tell that one circle is larger
than the other one, but it is not immediately obvious by how
much. Instead, when we are asked to compare the length
statistics and data visualisation with python 399

of the bars. We can still tell that one is larger than the other
one, but crucially in this case we may be more easily able to
tell that the larger bar is about three and a half times longer
than the shorter bar.

Figure 7.5: Compare the area of


the circles v compare the length of
the bars.
Combining some of these visual variables in a way that
supports graphical excellence can help with communication
and usability of our visualisation. Take a look at the pictures
Graphical excellence in the Tufte
shown in Figure 7.6. The use of different shapes sense.

immediately helps with distinguishing some points from


others. Adding colour differences makes these points jump
out of the page.

An effectiveness ranking of some of these visual variables


is helpfully presented by Jock Mackinlay6 . The rankings 6
Mackinlay, J. (1986, Apr).
Automating the Design of
are given in terms of the type of data that is encoded with Graphical Presentations of
Relational Information. ACM
the visual variables. We present the rankings in Table 7.1 Trans. Graph. 5(2), 110–141
and we can see that position is the highest ranking visual
400 j. rogel-salazar

Figure 7.6: Combining


visual variables can help our
variable for quantitative, ordinal and nominal data. We can
visualisations be more effective.
also see that length consistently outranks angle, area and
volume. In other words, when presenting data, a viewer
may be more easily persuaded to notice a difference in
length than a difference in angle or area.

Quantitative Ordinal Nominal


Position Position Position
Length Density (Color Value) Color Hue
Angle Color Saturation Texture
Slope Color Hue Connection
Area Texture Containment
Volume Connection Density (Color Value)
Density (Color Value) Containment Color Saturation
Color Saturation Length Shape
Color Hue Angle Length
Texture Slope Angle
Connection Area Slope
Containment Volume Area
Shape Shape Volume

Table 7.1: Effectiveness ranking


of perceptual tasks for different
Another aspect that we have not mentioned before is the visual variables.

textual information that accompanies a visual. The data-ink


statistics and data visualisation with python 401

ratio principle will tell us to minimise superfluous ink.


However, in many cases it is preferred, and even
unavoidable, to add text to a graphic. Consideration for a
suitable typeface is an important area that aids the viewer Use a typeface that helps the
viewer understand the graphic
understand the graphic. We must therefore choose a
and consider the font size, colour,
typeface that makes it easy for a viewer to read short bursts or justification.
of text such as a heading, bullets or other textual
components. Along-side the typeface, we also need to
consider the font size, text justification and even colour.

With a better understanding of the visual variables we can


use to encode information, we can turn our attention to the
design principles that we can use for our visualisation.
Some useful visual design principles that we need to
consider include:

• Balance: How are the visual elements distributed in


our visualisation? Are we using a symmetrical or an Some important visual design
principles we need to consider for
asymmetrical approach to organising the elements? Is it
our data visualisations.
suitable to organise elements in a radial manner? Are the
elements evenly distributed?

• Proximity: How close are the elements that need to be


consumed together? Are the elements that are close to
each other related?

• Alignment: Are the visual elements suitably aligned


vertically, horizontally or diagonally? Are the more
important visual elements at the top?

• Emphasis: Are there any visual elements that need to


attract more attention? Can we use proximity to bring
together or isolate important elements that need to be
402 j. rogel-salazar

emphasised? Is there a hierarchical order that we can


exploit?

• Contrast: Are there any differences that need to be


highlighted?

• Unity: While there are various visual elements in any


visualisation, does it feel like they all work together to
drive our message? Is there a unifying, harmonious
theme, for example by using repetition without becoming
monotonous? Are there similar colours, shapes or
textures?

As many of us know, design is not a set of ready-made


formulae that render perfect results. In many cases, what Be prepared to have several
iterations when preparing data
seems to be a simple and elegant data visualisation may
visualisations.
have taken several iterations to be completed so as to distill
the complexity into an effective visual. Be prepared to
sketch multiple times including with pen and paper. The
results will pay off.

7.4 Plotting and Visualising: Matplotlib

If we want to apply the principles above in practice, we


need to find a way to help us create the graphical elements
that are required to bring our data to life via a chart. There
are a great number of excellent data visualisation tools, and We will cover some of these tools
in Chapter 8.
they all have their advantages and disadvantages. Given
that we have been using Python for our discussions in
the implementation of statistical analysis, it makes sense
to use some of the modules that are available in Python.
statistics and data visualisation with python 403

There are some really good modules that support very nice
visuals such as Seaborn, or interactivity such as Bokeh.
The standard library used for plotting in Python is called
matplotlib7 and if you are familiar with MATLAB there is 7
Hunter, J. D. (2007). Matplotlib:
A 2D graphics environment.
an API called pyplot that uses similar syntax. Computing in Science &
Engineering 9(3), 90–95
Matplotlib supports the object orientation used in Python,
and offers a comprehensive set of tools to create static,
interactive and animated visualisations. As usual, the first
step is to load the module:
In a Jupyter notebook you can
import numpy as np use the magic command %pylab
inline to load NumPy and
import matplotlib.pyplot as plt
matplotlib.

7.4.1 Keep It Simple: Plotting Functions

Imagine you require to graph a simple function such


as y = sin( x ), and we are interested in creating a graph
between −π ≤ x ≤ π. The first thing we require is to get
the necessary data to create the plot. We can generate it We want to create a plot of y =
sin( x ) for −π ≤ x ≤ π.
with Python by creating an equally spaced vector to hold
the values for x, and use that as an input to calculate y.
An easy way to create the vector is to use NumPy and its
linspace(a,b,n) which creates an n-array between the

values a and b.

> x = linspace(-np.pi ,np.pi, 100) We use linspace to create our


data.
> y = np.sin(x)

Finally the plot of the points can be obtained as follows:


404 j. rogel-salazar

fig, ax = plt.subplots(figsize=(10, 8))

plt.plot(x, y)

plt.show()

We are creating a couple of objects to hold the figure and


axes with the subplots method. As you can see, we can
specify the size of the figure with a tuple. The show The result is shown in Figure 7.7,
but we have not finished yet.
command displays all open figures. Although the command
above creates the plot as required, there may be other
information that we may want to include such as a suitable
different line style or labels and a title.

7.4.2 Line Styles and Colours

By default, the line plots created with matplotlib use


solid lines with a colour palette that starts with a pale blue
color. If we wanted to specify a blue, solid line we could use
the following command:

plot(x,y,’b-’) This commands creates a line plot


with a blue, solid line.

The third argument is a string whose first character specifies


the colour, and the second the line style. The options for
colours and styles are shown in Table 7.2. We can also
indicate the thickness of the line to be used in the plot with
the linewidth argument. For example, to create a black, We can also use the abbreviation
lw.
solid line of width 3 we issue the following command:

plt.plot(x, y, ’k-’, linewidth=3)


statistics and data visualisation with python 405

Colours Styles Table 7.2: Colours and line styles


that can be used by matplotlib.
b blue o circle
c cyan -. dash dot
g green -- dashed
k black : dotted
m magenta + plus
r red . point
w white - solid
y yellow * star
x x mark

7.4.3 Titles and Labels

It is usually a very good practice to add information


that tells the viewer what is being plotted as well as include Remember, however, the data-ink
ratio mentioned in Section 7.3
a descriptive title to the figure. In matplotlib we can add
labels and a title with the following methods: xlabel,
ylabel and title. For our sinusoidal plot from Section 7.4.1

we want to label the axes x and y and include a title that


reads: “Plot of sin( x )”. We can do this with the following
commands:

plt.xlabel(r’$x$’, fontsize=14)
Adding a title and axes labels.
plt.ylabel(r’$y$’, fontsize=14)

plt.title(r’Plot of $\sin(x)$’,

fontsize=16)

Please note that we are able to use LATEX style commands


in the annotations, and with the help of fontsize we can
dictate the size of the font for each annotation.
406 j. rogel-salazar

Plot of sin(x)
1.00

0.75

0.50

0.25

0.00
y

0.25

0.50

0.75

1.00
3 2 1 0 1 2 3
x

Figure 7.7: Plot of the function y =


sin( x ) generated with matplotlib.
7.4.4 Grids

In some cases the addition of grids may be helpful to


guide the eye of our viewers to some particular values in the
plot. We can add a grid in a similar way to the creation of
the plot itself:

plt.grid(color=’gray’, ls = ’--’, lw=0.5) This command adds a grid to our


plot.

In this case we have a grey dotted line of width 0.5. The end
result of all the commands above can be seen in Figure 7.7.
statistics and data visualisation with python 407

7.5 Multiple Plots

Some graphics may require us to show multiple plots


in the same figure. All we need to do is tell matplotlib to We can add other plots using the
same axes.
use the same axes. Let us create a figure that shows not only
sin( x ) but also cos( x ). We replace the code above with the
following:

plt.plot(x, y, ’k-’, linewidth=3,

label=r’$\sin(x)$’) Here we are plotting sin( x ) and


cos( x ) in the same chart.
plt.plot(x, np.cos(x), ’k--’, lw=2,

label=r’$\cos(x)$’)

You can see that we have added a new parameter to our


commands: label. This helps us add a label to each line so
that we can include a legend:
We can add a legend to
plt.legend() distinguish the plots using the
label parameter attached to each
series.
This will create a text box that provides a list of line–styles,
as they appeared in the plot command, followed by the label
provided. The end result can be seen in Figure 7.8.

7.6 Subplots

We have seen how to plot two line charts in the same axes.
We can use the subplots
This is useful when a comparison between the plots needs
command to plot our data in
to be highlighted. In other cases we may want to create separate figures.
separate figures for each plot. In this case we can exploit the
fact that we are using subplots to generate our figure. The
408 j. rogel-salazar

Plot of sin(x) and cos(x)


1.00 sin(x)
cos(x)

0.75

0.50

0.25

0.00
y

0.25

0.50

0.75

1.00
3 2 1 0 1 2 3
x

Figure 7.8: Plot of the functions


arrangement can be achieved with the subplot(m,n), where sin( x ) and cos( x ) generated with
matplotlib.
m and n define the plot array. For example, take a look at
the following code:

fig, axs = plt.subplots(2, 2, figsize=(10, 8))

axs[0, 0].plot(x, y, ’k-’) We are creating a 2 × 2 array of


plots. We select the axis with the
axs[0, 1].plot(x, np.cos(x), ’k-’)
following notation: [m, n].
axs[1, 0].plot(x, x, ’k-’)

axs[1, 1].plot(x, x**2, ’k-’)


statistics and data visualisation with python 409

Apart from specifying the size of the overall figure, the first
command contains the following syntax: subplot(2,2).
This specifies that the window should be split into a 2 × 2 We could request a figure with
m rows and n columns with
array. The object axs is now an array of axes, and we can
subplot(m,n).
refer to each one of them with relevant array notation. We
can see that we have plotted the sin( x ) and cos( x ) functions
in the first two axes, and then we added plots for x and x2
to the other two.

Figure 7.9: Subplots can also be


created with matplotlib. Each
subplot can be given its own
labels, grids, titles, etc.
410 j. rogel-salazar

Each of the axes in the figure has its own title, labels, legend,
etc. We can add labels to the x axes for the bottom row of
plots, and to the y axes of the first column of plots:

axs[1, 0].set_xlabel(’$x$’, fontsize=12) Each subplot has its own title,


labels, legend, etc.
axs[1, 1].set_xlabel(’$x$’, fontsize=12)

axs[0, 0].set_ylabel(’$y$’, fontsize=12)

axs[1, 0].set_ylabel(’$y$’, fontsize=12)

In this example we are using the object-oriented syntax for


matplotlib and hence the difference in some commands
used in the previous section.

We can modify the range of the axes with the help of We can specify the axes limits with
set_xlim() and set_ylim().
set_xlim() and set_ylim(). Here we will change the x

limits for the bottom row of axes:

axs[1, 0].set_xlim(-3, 3)

axs[1, 1].set_xlim(-3, 3)

We can flatten the axs object to run a loop over it. In this
case we will add a title counting the number of axes:

for i, ax in enumerate(axs.flat): We can set a title for each plot


programmatically.
ax.set_title(’Figure {0}’.format(i))

The end result can be seen in Figure 7.9.

7.7 Plotting Surfaces

We can also use matplotlib to create graphs in three


dimensions. We refer to these graphs as surfaces and they
statistics and data visualisation with python 411

are defined by a function z = f ( x, y). We need to specify the


fact that we need to plot this in a 3d projection as follows:

fig = plt.figure(figsize=(10,8)) We need to specify that we want to


plot a surface by specifying a 3D
ax = plt.axes(projection=’3d’)
projection.

We also require to consider the range of values for the x


and y variables. In this case we are going to use an interval
[−5, 5] for both of them. We will use this information to
create a grid that contains each of the pairs given by the
x and y values. Let us consider a spacing of 0.25 between
points; we can therefore calculate x and y as:

X = np.arange(-3, 3, 0.25) We first specify the range for each


of the axes.
Y = np.arange(-3, 3, 0.25)

and our grid can be calculated with the NumPy command


meshgrid:

X, Y = np.meshgrid(X, Y) We can then create a mesh to plot


our surface.

We can check that the new arrays generated are


two-dimensional:

> X.shape, Y.shape

((24, 24), (24, 24))

We are now able to evaluate our surface. Let us look at the


following function:
  Let us plot a Gaussian function in
f ( x, y) = exp − x2 − y2 . (7.2)
3D.
412 j. rogel-salazar

Figure 7.10: A surface plot


This is a Gaussian function in 3D. In Python we can obtained with the plot_surface
command. Please note that this
calculate this as follows: requires the generation of a grid
with the command meshgrid.
Z = np.exp(-X**2-Y**2)

We create our surface as follows:

from matplotlib import cm

surf = ax.plot_surface(X, Y, Z, cmap=cm.jet, We create our surface with


plot_surface and provide the
linewidth=0, antialiased=False)
arrays X, Y, Z.
ax.set_zlim(0, 1.01)

Notice that we are importing the cm module which contains


colour maps that provide us with colour palettes for our
statistics and data visualisation with python 413

graphs. In Table 7.3 we have the names of the colour maps


available in matplotlib.

Let us now add a colour bar to aid with reading the values
plotted and include some labels. The result can be seen in
Figure 7.10.

fig.colorbar(surf, shrink=0.5, aspect=7,

location=’left’)
We finally add a colour bar and
ax.set_xlabel(r’$x$’, fontsize=12) some labels.
ax.set_ylabel(r’$y$’, fontsize=12)

ax.set_zlabel(r’$z=\exp(-x^2-y^2)$’, fontsize=12)

plt.show()

Sequential Sequential 2 Diverging Qualitative Miscellaneous Cyclic


Greys binary PiYG Pastel1 flag twilight
Purples gist_yarg PRGn Pastel2 prism twilight_shifted
Blues gist_gray BrBG Paired ocean hsv
Greens gray PuOr Accent gist_earth
Uranges bone RdGy Dark2 terrain
Reds pink RdBu Setl gist_stern
YIOrBr spring RoYIBu Set2 gnuplot
YOrRd summer RaYIGn Set3 gnuplot2
OrRd autumn Spectral tab10 CMRmap
PuRd winter coolwarm tab20 cubehelix
RdPu cool bwr tab20b brg
BuPu Wistia seismic tab20c gist_rainbow
GnBu hot rainbow
PuBu afmhot jet
Y GnBu gist_heat turoo
PuBuGn copper nipy_spectral
BuGn gist_ncar

Table 7.3: Names of colormaps


available in matplotlib.
414 j. rogel-salazar

7.8 Data Visualisation – Best Practices

With the knowledge we now have regarding design,


visual variables, perception and even matplotlib, we are in a
better position to consider more specific points about widely Some best practices that will make
our work more impactful.
used charts and when they are best suited. We close this
chapter with some best practices that will make our work
more impactful.

Visual cues are all around us, and visual communication


permeates every aspect of our lives. With so much
information presented to us, we may end up becoming Our aim is to communicate
effectively with the aim of
blind to the messages encoded in graphics, rendering them
graphics.
ineffective and even forgettable. It is our aim to make
decisions that make our data visualisation simple and
effective carriers of information. The following areas can
help set the scene for this:

• If content is King, context is Queen: Assuming that you


have identified a suitable source of reliable data, and it is
ready to be used for analysis, you can embark on
extracting useful information from it. Your content is Context provides the audience
with tools that make the content
ready and without it there is no message. However, it is
more understandable.
also important to take into account the overall
background of that information, setting the arena for it to
become knowledge and drive action. Context is therefore
Queen: An equal companion to content

• Message clarity for the appropriate audience: We have


mentioned that data visualisation communicates. If so,
what is the message that you need to convey and what
statistics and data visualisation with python 415

is the goal of the visualisation? That clarification can be Consider the questions a viewer
may care about. That will make
sharpened by considering who the intended audience
your visual more meaningful.
for the visualisation is. One thing that I find useful is to
think about what the viewer is trying to achieve when
presented with the graphic, and take into account the
questions they may care about. It is sometimes useful
to direct the query to them in the first place. In that
way you are more likely to create a visualisation that
addresses their needs

• Choose a suitable visual: This is easier said than done,


but in order to be effective in communicating with a
visual, we need to make sure that we are selecting a Select a visual representation
appropriate for the type of data to
display that is appropriate for the type of data we need to
be presented.
present. We also need to consider if the audience may be
familiar with the graphics we use, particularly if we use
complex visualisations. We also need to ensure that the
graphical representations we create have integrity in the
sense discussed in Section 7.3. It is important that from
the first moment the viewer looks at the visual they get
an accurate portrayal of the data. Consider what it is that
you are showing:

– Comparison: Use bar charts, column charts, line charts,


tables or heat maps

– Relationship: Use a scatter plot or a bubble chart

– Composition and structure: Use stacked bars or pie Pie charts only if there are a small
number of components.
charts

– Distribution: Use histograms, box plots, scatter plots or


area charts
416 j. rogel-salazar

• Simplicity is best: This is a case where the cliché of “less


is more” is definitely true. A simple graphic is much
more effective than one that has a lot of clutter. Make Just because you can add some
visual elements, it does not mean
sure that you include scales and labels on axes. However,
you should. Keep your visuals
you may not need labels in every single data point or simple!
every bar in a chart. Are you using a colour palette that
is effective in communicating your message and are they
accessible to a wide range of viewers? Include only the
necessary data and if appropriate add a succinct title

• Iterate and sketch: A good, effective visualisation may


require you to consider different designs, and you should Do not be afraid to go back to the
literal drawing board and iterate.
be prepared to take an iterative approach to the task.
Sketching your ideas may help bring better, more
polished designs that communicate better.
8
Dazzling Data Designs – Creating Charts

R2-D2 and C3-PO, Spock and Kirk, Eddie and Patsy,


Tom and Jerry... I am sure you can think of many other
famous couples, and in Chapter 7 we added to the list the We included Spock and Kirk,
Batman and Robin, and peanut
amazing pairing of statistical analysis and data visualisation.
butter and jelly, among others.
They go so well with one another that it is almost
impossible to think of any statistical analysis that does not
rely on the use of graphics to explore, explain and exhort. In
many cases, we are so used to certain charts, that we need
no further instruction on how to read the information in
them. Nonetheless, when we are trying to paint a picture Look at Section 7.8 for some best
practices.
with our data, it is useful to remind ourselves of what kind
of data we are able to present with each of them, but also
consider which chart may aid our data story better.

8.1 What Is the Right Visualisaton for Me?

Dazzling data designs can be created more easily when


we consider the general aim of some of the charts we know
418 j. rogel-salazar

Question Aim Data Types Chart

How is the data Distribution 1 Continuous Histogram, Box plot,


distributed? Violin plot
How often does each Comparison 1 Categorical Bar chart, Pie chart
value appear?
How do the parts relate Composition 1 Categorical Bar chart, Stacked bar
to the whole? chart
Is there a trend or Trend 1 Continuous & Line chart, Area chart
pattern? 1 Ordinal
Is there a correlation Relationship 2 Continuous Scatterplot, Bubble chart
between variables?
How often do certain Comparison 2 Categorical Heatmap
values occur?
Are the distributions Distribution & 1 Continuous & Small multiples of 1
similar in each Comparison 1 Categorical continuous with a
grouping? categorical encoding
(e.g. colour)

Table 8.1: Given the question


of interest, and the type of data
and love. In a simplified manner, we can think of several provided, this table provides
aims for our visualisations. In Table 8.1 we show some of guidance on the most appropriate
chart to use.
the questions and aims that we may have for our graphics,
while taking into account the type of data needed for each.
The following are typical use cases:

1. Distribution: We are interested to use charts that show For distribution, use line charts,
histograms or scatter plots.
us how items are distributed to different parts. Some
charts that are useful for this are line charts, histograms
and scatter plots.

2. Relationship: A typical use of statistics is to show the For a relationship, use scatter plots
or bubble charts.
relationship between variables. Scatter plots or bubble
charts are great choices for this.

3. Trend: Sometimes we need to indicate the direction or For a trend, use line or bar charts.
statistics and data visualisation with python 419

tendency of our data over a period of time or over an


ordered progression. Line charts or bar charts as well as
area charts can help.

4. Comparison: We can use visualisations to compare Small multiples is a series of


similar charts that use the same
2 or more variables. Items can be compared with the
scale and axes. This lets us
use if bar charts, heatmaps, pie charts. If we need to compare them easily.
compare different partitions of our dataset we can use
small multiples.

5. Composition: Sometimes we are interested to find out For composition, use stacked bars
or pie charts.
how the whole is composed of different parts. A typical
example of this is a pie chart. Area charts and stacked bar
charts are also good for this aim.

Following these general guidelines will help you create


your dazzling data designs that will inform, persuade and
delight. They will help you choose the most appropriate
visualisation to use, and save you a lot of time. For example, Use the guidelines above to help
you select a good chart for your
if you have data that changes over time, a line chart is a
purposes.
good solution. If you have more than one variable over the
same period of time, you can present multiple lines in the
same graphic. In the case where you need to emphasise how
the values rise and fall you may want to experiment using
an area chart.

For comparisons, a bar chart works really well in the vast


majority of cases. Not only is it a chart that needs no Remember the design principles
covered in Section 7.3.
explanation, but also uses length as the visual encoding for
the comparison, making it really easy for your viewer to
understand. If the comparison is over groups of data you
can use grouped or stacked bar charts.
420 j. rogel-salazar

To show parts of a whole or talk about proportions, consider


using bar charts, stacked bar charts or even pie or donut Bear in mind that areas and angles
are more difficult to discern than
charts. Remember that areas and angles are more difficult to
length.
discern than length, and if you have many parts to show, a
pie chart may get too busy.

If what you want to show is a distribution, a histogram or a


box plot may be the best choice. Scatter plots let us present
a large number of values and display the trend. If you have The tools we wil cover are:
matplotlib, pandas, Seaborn,
three variables to encode, a bubble chart lets us encode
Bokeh and Plotly.
the third variable in the size of the marker. We will cover
each of these choices in more detail later on in this chapter.
Before we do that, let us talk about available Python tools
that can help us in our task

8.2 Data Visualisation and Python

The next thing to consider for creating our data


visualisation master pieces is what tools we have at our
disposal. There are several options and since this is a book
based on Python, we will mention a couple of tools you may
want to explore. In the previous chapter we have already
talked about matplotlib as the standard module for data See Section 7.4 for more
information about matplotlib.
visualisation. There are other options and we will briefly
introduce them here. We will start with the visualisation
capabilities of pandas, making this a really well-rounded
module for end-to-end data analysis. We will also introduce
a couple of widely-used libraries: Seaborn, Bokeh and Plotly.
Later in the chapter we will address the creation of the
charts mentioned above using these tools.
statistics and data visualisation with python 421

8.2.1 Data Visualisation with Pandas

We know that pandas is an excellent companion to


manage data structures and data analysis. In Section ?? we
covered how pandas can help us make our data analysis
much simpler by using series and dataframes. In turn, these
objects let us carry out data exploration with the use of Pandas dataframes have a plot
method that lets us create data
various methods. One important method that is part of
visualisations based on the data in
these objects is the plot, which is a wrapper on the plot the dataframe.
methods from matplotlib’s pyplot. In other words, we can
apply this method directly to series or dataframes to create
visualisations for the information contained in the objects.

This means that we can employ similar syntax to that of


matplotlib, but apply it directly to the objects. Consider
creating the following dataframe based on some random
data generated with NumPy. We name the columns in our Running this in your machine may
result in a different outcome.
dataframe A and B and use the cumsum method to find the
cumulative sum over each column:

import numpy as np
Remember that the data may
import pandas as pd
be different in your computer
because we are using random
timeSeries = np.random.randn(1000, 2) numbers.

df = pd.DataFrame(timeSeries, columns=[’A’, ’B’])

cm = df.cumsum()

We can now apply our knowledge of pandas to describe the


data and look at some potentially useful information. In this
case, we are interested in creating a line plot to visualise the
422 j. rogel-salazar

cumulative sum for each column. We can easily do this as


follows:

import matplotlib.pyplot as plt We are using the plot method for


our dataframe.
cm.plot(style={’A’: ’k-’, ’B’: ’k:’},

title=’Time Series Cumulative Sum’)

Note that we are using the plot method for the pandas
dataframe, and we are passing arguments to determine the
style of each line. We also add a title to the plot. The styling
follows the same syntax discussed in 7.4.2. The result can be
seen in Figure 8.1.

Figure 8.1: Time series plot created


with pandas.
statistics and data visualisation with python 423

As you can see, it is very straightforward to create a line


plot. The same philosophy can be used for other charts,
and all we need to do is to specify the kind of chart we are
interested in with the kind parameter, including:

• Bar plots: ’bar’ or ’barh’

• Histograms: ’hist’ We can apply the kind parameter


to create all these plots with
• Boxplots: ’box’ pandas.

• Area plots: ’area’

• Scatter plots: ’scatter’

• Pie charts: ’pie’

We will look at each of these separately in the following


pages.

8.2.2 Seaborn

Creating plots with matplotlib, or pandas for that


matter, seems to be enough, so why use other libraries? The
answer to that may become clear when you try to create
visualisations that are more than a pair of axes and a few
lines or bars. Things can get a very complex, very quickly.
Michael Waskom introduced seaborn1 as a way to simplify 1
Waskom, M. L. (2021).
Seaborn: Statistical data
some of those complexities via a high-level interface to visualization. Journal of Open
Source Software 6(60), 3021
matplotlib, with close integration to pandas data structures.

Seaborn understands statistical analysis and this makes it


easier to create visualisations that need to communicate
statistical results from scratch. Furthermore, the library
424 j. rogel-salazar

automatically maps data values to visual attributes such as Seaborn understands statistical
analysis.
style, colour and size. It also adds informative labels such as
axis names and legends. To use Seaborn all you need to do
is install the library, and import it as follows:

import seaborn as sns

You may find that the standard alias or abbreviation for


the library is sns. It is said that the alias was proposed Great to see I am not alone in
giving a nod to favourite fictional
by Waskom as a tribute to the fictional character Samuel
characters!
Norman Seaborn from the TV show The West Wing, played
by Rob Lowe for four seasons.

In Seaborn we can specify the dataframe where our data


is stored, and operate on the entire object. We can use the
columns in the dataset to provide encodings for different
attributes such as colour, hue, size, etc. We can re-create the
plot from the previous section as follows:

sns.set_theme(style=’whitegrid’)

sns.lineplot(data=cm, palette=’dark’, Creating a line plot with Seaborn.

linewidth=2.5).\

set(title=’Time Series Cumulative Sum’)

Note that we are specifying a theme for our plot and


Seaborn gives us some beautiful styles for our plots. Here
we are using one that automatically adds grids for our axes.
The line plots are created with the lineplot function and
we specify the data to be plotted with the data parameter;
you can see that we are also adding a palette to be used and
the line width for our graph. Finally, we use the set method
to include a title. The end result can be seen in Figure 8.2.
statistics and data visualisation with python 425

Figure 8.2: Time series plot created


with Bokeh.
8.2.3 Bokeh

Sometimes it is useful to provide our viewers not only


with a chart, but also enable them to interact with it. One
tool that lets us do that is Bokeh. Bokeh2 is a powerful and 2
Bokeh Development Team
(2018). Bokeh: Python library
versatile library with concise syntax to create interactive for interactive visualization.
https://bokeh.pydata.org/en/latest/
graphics over large or streaming datasets. The name of the
library is taken from the term bokeh used in photography, in
turn derived from the Japanese word 暈け/ ボケ meaning
“blur”. It refers to the aesthetic quality of the blur produced
in images with parts that are out-of-focus.
426 j. rogel-salazar

The rendering in Bokeh is done directly in HTML and


JavaScript. This means that we can extend our creations by Bokeh can be extended with
custom JavaScript commands.
adding custom JavaScript. After installing Bokeh in your
machine you can call import the following:

from bokeh.plotting import figure

from bokeh.io import show, output_notebook

The code above assumes we are working on a Jupyter


notebook. That is why we are importing the
output_notebook function. If you are working on a script,

you will probably be better sending the output to a file and


you will instead need to import the following:

from bokeh.io import output_file We are sending the output to an


HTML file.
output_file(’myfilename.html’)

Note that in the code above we are sending the output to an


HTML file with the name myfilename.html.

For us to reproduce the cumulative sum plot from Section


8.2.1 we are adding a column to our cm dataframe to serve
as the values for the x-axis:
We are adding a column to
cm[’idx’] = cm.index our dataframe so that we can
reproduce our chart with Bokeh.

We are doing this so that we can use the dataframe as a


source of data for Bokeh. This will become clearer when we
start creating our chart.

The following code creates a Bokeh figure with a given title.


We then add different elements to our chart, in this case two
lines with data coming from our cm dataframe:
statistics and data visualisation with python 427

Figure 8.3: Time series plot created


with seaborn.
p = figure(title = ’Time Series Cumulative Sum’)

p.line(x=’idx’, y=’A’, source=cm,

legend_label=’A’)
Note that we are adding elements
p.line(x=’idx’, y=’B’, source=cm,
to the p object that holds our
legend_label=’B’, line_color=’red’, Bokeh figure.

line_dash=’dashed’)

p.legend.location = ’top_left’

We are passing the names of the columns inside the


dataframe to the parameters for x and y, indicating the
dataframe in the source parameter. In the third code line we
are changing the colour and line style. Finally, we indicate
to Bokeh where to place the legend with the information
428 j. rogel-salazar

given to legend_label for each of the lines. Figure 8.3


shows the chart created. In your notebook or HTML page Bokeh offers interactivity, enjoy it!

you will see a menu that lets you interact with the chart by
panning and zooming and even a way to export your chart.

8.2.4 Plotly

Another great tool to create interactive plots is


Plotly3 . The API is provided by Plotly, a company 3
Plotly Technologies Inc (2015).
Collaborative data science.
headquartered in Quebec, Canada, and offers a number of https://plot.ly
open-source and enterprise products. Their JavaScript
library called Plotly.js is offered as open-source and it
powers Plotly.py for Python. They also support other
programming languages such as R, MATLAB, Julia, etc.

We will assume that you are using a Jupyter notebook


to create your Plotly charts. We will be using the Plotly
Express module, which provides functions that let us create We will be using the Plotly
Express module.
entire figures in a simple way. Plotly Express is part of the
full Plotly library, and is a good starting point to use Plotly.

After installing Plotly you can import Plotly Express as


follows:

import plotly.express as px

We will be using the original dataframe for our cumulative


sum. We can re-create the chart from Section 8.2 as follows:

fig = px.line(cm) The code to re-create our chart in


Plotly is very simple.
fig.show()
statistics and data visualisation with python 429

The result will be an interactive chart similar to the one


shown in Figure 8.4. In your Jupyter notebook you will also Plotly also supports interactivity.

find some controls to interact with the plot, letting you pan,
zoom, autoscale and export the chart.

Figure 8.4: Time series plot created


with Plotly.
Note that Plotly has automatically added a legend and
placed it outside the plot area. It also added labels to the
axes and has rendered the plot in a nice looking theme. Plotly has automatically added
labels, legends, etc. Fantastic!
You can also see that hovering on top of the line gives us
information about that part of the plot.
430 j. rogel-salazar

8.3 Scatter Plot

A scatter plot displays the values of two different


variables as individual points. The data for each point A scatter plot uses markers to
represent values for two different
is represented along the horizontal and vertical axes of
numeric variables.
our chart. The main use of a scatter plot is to show the
relationship of two numeric variables given by the pattern
that emerges from the position of the individual data points.

In Section 6.4 we used some cities data to take a look at


measuring correlation. That dataset is available at4 4
Rogel-Salazar, J. (2021a,
May). GLA World Cities
https://doi.org/10.6084/m9.figshare.14657391.v1 as a 2016. https://doi.org/
10.6084/m9.figshare.14657391.v1
comma-separated value file with the name
“GLA_World_Cities_2016.csv”. Let us start by creating a
simple scatter plot of the population of a city versus its
approximate radius. Let us create the plot first with
matplotlib:

fig, ax = plt.subplots(figsize=(10,8))

ax.scatter(x=gla_cities[’Population’],

y=gla_cities[’Approx city radius km’], s=50)


Creating a scatter plot with
ax.set_xlabel(’Population’) matplotlib.
ax.set_ylabel(’Approx city radius km’)

ax.set_title(’Population v City Radius

(matplotlib)’)

The result can be seen in Figure 8.5. The position of each


dot on the horizontal and vertical axes indicates the value
of each of the two variables in question. We can use the plot
to visually assess the relationship between the variables.
statistics and data visualisation with python 431

We are using the attribute s as the size of the dots in the


scatterplot.

Figure 8.5: Scatterplot of city


population versus its approximate
We may want to encode other information in the scatterplot, radius size. The plot was created
with matplotlib.
for example, capture the classification for the size of the
city, i.e. Mega, Large, Medium or Small. We can map the
categories into a colour map and use this to present the
information as the colour of the dot in the scatter plot:

c_map = {’Mega’: ’r’, ’Large’: ’b’, ’Medium’:

’g’, ’Small’: ’black’} We want to encode the size of the


city in the colour of our markers.
colours = gla_cities[’City Size’].map(c_map)

We are colouring Mega with red, Large with blue, Medium


with green and Small with black.
432 j. rogel-salazar

Let us now create the plot with this new information using We are printing the plots in black
and white, but you shall be able
pandas with the help of the scatter method. In this case
to see them in full colour in your
we are using the mapping above as the value passed to the computer.
c attribute, which manages the colour of the marker. The

result can be seen in Figure 8.6.

gla_cities.plot.scatter(

x=’Population’, y=’Approx city radius km’, A scatter plot created with pandas.

title="Population v City Radius (pandas)",

s=50, c=colours)

Figure 8.6: Scatterplot of city


population versus its approximate
In the example above we have not added a legend to clarify
radius size, the colour is given
what each colour means. This is not a good thing as we by the city size category in the
dataset. The plot was created with
want our plot to communicate effectively. We can write pandas.
statistics and data visualisation with python 433

some code to create the legend or we can use other


visualisation tools that simplify the work for us. Let us take
a look at embedding the category of the city in the colour of In a bubble chart we can encode
another property in the size of the
the marker, and furthermore embed yet another property in
markers.
the size of the marker. This is sometimes referred to as a
bubble chart. We will use Seaborn to create the plot. In this
case we need to use the scatterplot method.

Figure 8.7: Bubble chart of city


population versus its approximate
radius size. The colour is given
sns.scatterplot(data=gla_cities, by the city size category in the
dataset, and the marker size by the
x=’Population’, y=’Approx city radius km’, people per dwelling. The plot was
created with Seaborn.
hue=’City Size’, alpha = 0.8,

size=’People per dwelling’, sizes=(20, 500))


434 j. rogel-salazar

Here, the colour of the markers is managed by the hue


property, and we are using the column City Size for this
purpose. In case the markers overlap, we are using a The syntax in Seaborn makes it
easy to create complex plots.
transparency of 0.8 given by the alpha parameter. Finally,
the size of the bubbles is given by the People per dwelling
column in the dataset and we pass a tuple that manages the
minimum and maximum sizes for the bubbles. The end
result can be seen in Figure 8.7. Please note that the legends
are automatically generated for us.

Figure 8.8: Bubble chart of city


population versus its approximate
radius size. The colour is given
In Bokeh the scatter plot can be reproduced as follows: To by the city size category in the
dataset, and the marker size by the
make the code a bit more readable, we first create some people per dwelling. The plot was
created with Bokeh.
arrays with the information to be used in the plot:
statistics and data visualisation with python 435

x = gla_cities[’Population’]

y = gla_cities[’Approx city radius km’]

s = gla_cities[’People per dwelling’]

We can now pass these objects to the scatter method. Note Bokeh offers a lot of flexibility, but
it may take more code to get there.
that we are passing the array called s as the size of the
markers, and using the colours mapping created for the
pandas example. The result can be seen in Figure 8.8.

p = figure(width=750, height=500)

p.xaxis.axis_label = ’Population’

p.yaxis.axis_label = ’Approx city radius km’

p.scatter(x=x, y=y, size=10*s, color=colours,

alpha=0.6)

show(p)

The solution above looks good, but it still leaves us with


some work to do with the legends and the colour mapping.
A great solution to this is the plotting backend provided by
Pandas Bokeh5 , providing support for pandas dataframes. 5
P. Hlobil (2021). Pandas Bokeh.
https://github.com/PatrikHlobil/Pandas-
Let us first import the library: Bokeh

import pandas_bokeh

pandas_bokeh.output_notebook()

We now create a new column to hold the sizes of the An alternative is to use Pandas
Bokeh, which supports pandas
markers based on the People per dwelling column: dataframes.

gla_cities[’bsize’]=10*\

gla_cities[’People per dwelling’]


436 j. rogel-salazar

We can now use the plot_bokeh method for the pandas


dataframe to create our scatter plot.

Figure 8.9: Bubble chart of city


population versus its approximate
p_scatter = gla_cities.plot_bokeh.scatter( radius size, the colour is given
by the city size category in the
x=’Population’, dataset, and the marker size by
the people per dwelling . The plot
y=’Approx city radius km’, was created with Bokeh using the
Pandas Bokeh backend.
category=’City Size’,

size=’bsize’, alpha=0.6,

legend = "bottom_right")

As you can see we have the best of both worlds, a simple


way of creating a scatter plot and the interactivity provided
by Bokeh.
statistics and data visualisation with python 437

Finally, let us create the scatter plot with the help of Plotly.
In this case we call the scatter method of Plotly Express.
The result is shown in Figure 8.10.

Figure 8.10: Bubble chart of city


population versus its approximate
radius size. The colour is given
fig = px.scatter(gla_cities, x=’Population’, by the city size category in the
dataset, and the marker size by the
y=’Approx city radius km’, people per dwelling. The plot was
created with Plotly.
color=’City Size’,

size=’People per dwelling’)

fig.show()
438 j. rogel-salazar

8.4 Line Chart

A line chart is exactly what it sounds like: It is a plot


that uses connected line segments, organised from left to
right, to show changes in a given value. The variable shown A line chart displays a series of
data points that are connected by
in the horizontal axis is continuous, providing meaning
lines.
to the connection between segments. A typical case is
time. It is important to ensure that the intervals of the
continuous variable in the horizontal axis are regular or
equally distributed. The values in the vertical axis represent
a metric of interest across the continuous progression.

Since a line chart emphasises the changes in the values of


the variable shown in the vertical axis as a function of the
variable shown in the horizontal one, this can be used to A line chart is good to show trend
and/or relationship.
find patterns in that change. A good example is the result of
a regression analysis, where a line of best fit can be shown
together with a scatter plot of the data in question.

The examples discussed in Section 8.2 when talking about


the different visualisation packages used in this chapter are See Section 8.2 for examples of
how to create line charts.
all line charts. We will therefore avoid repetition and simply
encourage you to refer to that section to see how to create
line charts. In summary we have the following commands
for creating line plots with the libraries we are exploring in
this book:

• Matplotlib: Use the plot method of pyplot

• Pandas: Use the plot method of the DataFrame object

• Seaborn: Use the lineplot method of seaborn


statistics and data visualisation with python 439

• Bokeh: Use the line method of the figure object

• Plotly: Use the line method of plotly.express

However, we will show a nice feature of Seaborn to plot


regression lines out of the box. For this matter we will use
the “jackalope.csv” dataset from Section 7.2. We will be The jointplot method in Seaborn
lets us combine different statistical
using the jointplot method that lets us combine different
plots.
plots in a simple manner. We need to provide the kind of
plot to draw; the options include:

• ’scatter’ for a scatter plot

• ’kde’ for a kernel density estimate

• ’hist’ for a histogram These are some of the plots we can


combine with jointplot.
• ’hex for a hexbin plot

• ’reg’ for a regression line

• ’resid’ for a residuals plot

As mentioned above, we are interested in creating a


regression line given the observation. We first read the data
into a pandas dataframe and then create the plot:

df = pd.read_csv(’jackalope.csv’)
We are creating a scatter plot with
sns.jointplot(x=’x’, y=’y"’, data=df, Seaborn and we are requesting a
regression line to be added.
kind=’reg’, color=’black’);

Note that we are plotting the values of x and y in the


dataset contained in the dataframe df and request a
regression plot with the kind=’reg’ parameter. The result
440 j. rogel-salazar

Figure 8.11: A scatter plot for the


jackalope.csv dataset including
contains a scatter plot plus marginal histograms and a
a regression line and marginal
regression line including a confidence interval. The result is histograms created with the
jointplot method of Seaborn.
in Figure 8.11.

8.5 Bar Chart

A bar chart shows numeric values


In many situations it is necessary to compare values
for levels of a categorical feature
for different categories. As we know from the discussion as bars.
statistics and data visualisation with python 441

on perception and visual representation in Section 7.3, we


are better at distinguishing differences in length than other
encodings. In a bar chart the comparison between different
categories uses the length of each bar, and this can be done
either horizontally or vertically.

Let us continue working with the data from the


“GLA_World_Cities_2016.csv” file. We are interested in We will create a bar chart of
population by country. First we
comparing the total population of the cities contained in the
will group our data.
data by country. We first create a dataframe with the
information grouped by country:

pop = gla_cities.groupby(’Country’)[[

’Population’]].sum()

We can now use this dataframe, indexed by country, to


create a bar chart with matplotlib as follows:

ax.bar(x=pop.index, height=pop[’Population’],

color=’gray’, edgecolor=’k’)
Creating a bar chart with
ax.set_xlabel(’Country’)
matplotlib can easily be done.
ax.set_ylabel(’Population (for the Cities in

the dataset)’)

The result can be seen in Figure 8.12. Note that we are using
the index of the grouped dataframe for the horizontal axis,
whereas the height is given by the population column of the
grouped dataframe. We also specify the colour and the edge
line for all the bars.

Let us re-create the same plot but this time using the
methods of the pandas dataframe itself:
442 j. rogel-salazar

Figure 8.12: A bar chart for the


total population per country
pop.plot(y=’Population’, kind=’barh’, for the cities contained in our
dataset. The plot was created with
rot=45, color=’gray’, edgecolor=’k’)
matplotlib.
ax.set_xlabel(’Country’)

ax.set_ylabel(’Population (for the Cities in

the dataset)’)

With the help of the plot method of the dataframe we create


a horizontal bar chart (kind=’barh’), and we specify that We can create a (vertical) bar chart
with kind=’bar’.
the labels are rotated by 45 degrees.

Sometimes it is important to represent multiple types of


data in a single bar, breaking down a given category (shown
in one of the axes) into different parts. In these situations
statistics and data visualisation with python 443

Figure 8.13: A horizontal bar


we can use a stacked bar chart where each bar is segmented chart for the total population per
country for the cities contained in
into the components of the subcategory shown in the graph. our dataset. The plot was created
with pandas.

For example, we may want to look at the population per


country in the cities contained in the dataset as above, but
we are interested in looking at how each of the city size
categories contribute to each population. We can do this
with the help of groupby in pandas:

Pandas can help us unstack our


popcitysize = gla_cities.groupby([’Country’,
grouped dataframe.
’City Size’])[[’Population’]].sum().unstack()

In the code above, we are asking pandas to unstack the


dataframe, so that we can use the full dataframe to create
our stacked bar chart as follows:
444 j. rogel-salazar

Figure 8.14: A stacked bar chart


for the total population per
popcitysize.plot(kind=’bar’, stacked=True, country for the cities contained
in our dataset categorised by city
rot=45, ax=ax) size. The plot was created with
pandas.
ax.set_xlabel(’Country’)

ax.set_ylabel(’Population (for the Cities in

the dataset)’)

In this case we use the property stacked=True to create a


stacked bar chart. The result of the code above is shown in
Creating a stacked bar chart with
Figure 8.14. We can create a column chart with pandas.

stacked=False where the bars will be shown side by side.

This can be useful for comparisons within each category.


statistics and data visualisation with python 445

Let us create a column chart with Seaborn:

sns.barplot(data=gla_cities, x=’Population’, A bar chart with Seaborn.

y=’Country’, hue=’City Size’,

ci=None, estimator=sum)

Figure 8.15: A column bar for the


total population per country for
Since Seaborn is geared towards statistical plotting it creates
the cities contained in our dataset
by default a plot for the mean of the values. We can change categorised by city size. The plot
was created with Seaborn.
the default and instead in get a total. This is why we use the
estimator=sum parameter. We are removing the confidence

interval with ci=None. Note that we are not required to use We do not need to group the data
first when using Seaborn.
pandas to group the data first. Finally, we can orient the
plot by placing the categorical variable in either the x or y
parameter. The result can be seen in Figure 8.15
446 j. rogel-salazar

It is also possible to flatten the grouped dataframe and use


it to create our plot. Let us do that and use Pandas Bokeh to
create our chart.

Figure 8.16: A stacked bar chart


for the total population per
country for the cities contained
popf = pd.DataFrame(popcitysize.to_records()) in our dataset categorised by city
size. The plot was created with
popf.columns = [’Country’,’Large’, ’Medium’,
Pandas Bokeh.
’Mega’, ’Small’]

p_stacked_hbar = popf.plot_bokeh(
A horizontal bar chart created
x=’Country’, stacked=True, kind=’barh’, with Pandas Bokeh.

xlabel=’Population (for the Cities in

the dataset)’)
statistics and data visualisation with python 447

Finally, let us create a stacked bar chart using Plotly using


the popf dataframe created above. The result is shown in
Figure 8.17.

Figure 8.17: A stacked bar chart


for the total population per
fig = px.bar(popf, x=’Country’, country for the cities contained
in our dataset categorised by city
y=[’Large’, ’Medium’, ’Mega’, ’Small’], size. The plot was created with
Plotly.
fig.show()

8.6 Pie Chart

An alternative to a bar chart to show comparison


between values is the pie chart. A pie chart is a circular The slices of pie chart show the
relative size of the data.
chart that reflects the proportion of different data groups
or categories. Instead of showing the actual value (as in the
448 j. rogel-salazar

bar chart), a pie chart shows the percentage contribution


as a part of the whole. Since this is a circular chart, there
are no horizontal or vertical axes, and instead we show the
proportion with the help of an angle.

As you may recall from our discussion on perception and


visual representation in Section 7.3, we know that it is easier
to visualise differences in length and differences in area. Consider using a bar chart when
thinking of pies, your viewers will
This is one of the reasons why sometimes bar charts are
thank you.
preferred to the use of pie charts, particularly on occasions
where the distinction between segments in the pie chart are
close to each other. Consider the data shown in Table 8.2.

Category Value Table 8.2: A table of values to


create a pie chart and compare to
A 10 a bar chart.
B 11
C 12
D 13
E 11
F 12
G 14

Let us put the data above in a pandas dataframe.

cat = list(’ABCDEFG’)

val = [10, 11, 12, 13, 11, 12, 14]

df = pd.DataFrame(’category’: cat, ’value’: val)

We now calculate a percentage of the total for each entry: To create our pie chart, we need to
calculate a percentage of the total.

df[’pct’]=df[’value’]/df[’value’].sum()
statistics and data visualisation with python 449

Figure 8.18: Top: A pie chart


of the information shown in
Let us create a pie chart with this information:
Table 8.2. The segments are very
similar in size and it is difficult to
plt.pie(df[’pct’], labels=df[’category’], distinguish them. Bottom: A bar
chart of the same data.
startangle=90)

plt.axis(’equal’)

We are using the pie method in matplotlib to plot the


percentage column calculated before and labelling the
450 j. rogel-salazar

segments with the corresponding category. The startangle


parameter determines where we start drawing the pie. The
result can be seen in the top part of Figure 8.18.

The segments are so similar in size that it is difficult to If the segments are similar in
size, it may be difficult to draw a
distinguish which may be larger or even which are actually
comparison.
the same. We may have to add legends to support our
viewers, but this implies altering the data-ink ratio of our
chart. If the segments shown are more obviously different See Section 7.3 about the data-ink
ratio.
in size from each other, a pie chart is great. However, as we
have seen in the example above, we may be asking for extra
cognitive power from our viewers. This gets much worse
as we add more and more segments to the pie, making it
impossible to read. These are some of the reasons why there
are some detractors to the use of pie charts.

Instead, let’s see what happens when we create a bar chart


with the same data as shown at the bottom of Figure 8.18. We can distinguish things better
with a bar chart, and we can use
First, we may be able to use the actual values rather than
the actual values in the data.
a percentage. Second, the length of the bars can be easily
compared with each other, and it is immediately obvious to
see which one is the longest.

A compromise between the compactness of a pie chart, and


the readability of a bar chart is a donut chart. It is a pie Appropriately, a donut chart is a
pie chart with the centre cut out.
chart with its centre cut out. This is a simple change, but
it helps avoid confusion with the area or angle perception
issues of the pie chart. Let us create a donut chart of the
data in Table 8.2 with pandas:
statistics and data visualisation with python 451

df.set_index(’category’).plot.pie(y=’pct’,

wedgeprops=dict(width=0.3),

autopct="%1.0f%%", legend=False)

Figure 8.19: A donut chart of the


data from Table 8.2 created with
pandas.
We are able to take the centre our with the wedgeprops
property where the width indicates the width of the
segments in the donut. In this case we are also adding labels
for the percentage value of each segment and indicate the
format to be used. Finally, we take out the legend as it is not
necessary. The plot can be seen in Figure 8.19.
452 j. rogel-salazar

With Pandas Bokeh we can recreate the pie chart as follows:

p_pie = df.plot_bokeh.pie( A pie chart with Pandas Bokeh.

x=’category’, y=’pct’ )

Whereas in Plotly we can draw the same graph with:

fig = px.pie(df, values=’value’, And with Plotly.

names=’category’, hole=0.7)

fig.show()

Since the plots obtained with Bokeh and Plotly are very
similar to the one obtained with matplotlib we are not
showing them here. Please note that Seaborn relies on
matplotlib to create pie charts so we are also skipping that.
All in all, pie charts are recommended to be used only in
cases where there is no confusion, and in general you are
better off using a bar chart. One thing you should definitely The only 3D pies you should have
(it at all) are edible ones.
avoid is to use 3D pie charts or exploding pies, even if for
comedic effects!

8.7 Histogram

We have been talking extensively about the


distribution of data and how important it is when thinking
about statistical tests. In Chapter 5 we cover a few of the A histogram shows frequency
distribution.
most important probability distributions and in their shape
is a distinctive feature for each of them. When thinking of
data points, we often need to consider how they are
distributed and one way to represent this distribution is
through a histogram.
statistics and data visualisation with python 453

A histogram is similar to a bar chart in the sense that we


use rectangular shapes to construct our representation. In
a histogram we use the length of the rectangle to represent Note that in a bar chart the width
of the bar has no special meaning.
the frequency of an observation, or a set of observations,
In a histogram it does!
in an interval. The base of each rectangle represents the
class interval or bin. Note that the shape of the histogram
depends on the number of class intervals or bins we use.
The rectangles in a histogram are presented right next to
each other. Throughout Chapter 5 we used histograms to
show the shape of various probability distributions.

Let us start by looking at how to build a histogram with


matplotlib. We are going to be using the cars dataset we
encountered in Chapter 4 and available6 at https://doi. 6
Rogel-Salazar, J. (2016,
Mar). Motor Trend Car
org/10.6084/m9.figshare.3122005.v1. We will create a Road Tests. https://doi.org/
10.6084/m9.figshare.3122005.v1
histogram of the miles per gallon variable:

plt.hist(cars[’mpg’], bins=15, color=’gray’) A histogram with matplotlib.


plt.xlabel(’MPG’)

plt.ylabel(’Frequency’)

We are using hist to create the histogram and we provide


the number of bins with the bins parameter; here we have Try changing the number of bins
in the code above.
15 bins. The histogram generated by the code above can be
seen in Figure 8.20. You can try changing this number and
see what happens to your histogram. The default is 10.

We may want to look at the distribution of our variable cut


by a different feature. For instance, we may need to look at
the histogram for automatic and manual transmission given
by the variable am:
454 j. rogel-salazar

Figure 8.20: A histogram of the


miles per gallon variable in the
aut=cars[cars[’am’]==0][[’mpg’]] cars dataset. The chart is created
man=cars[cars[’am’]==1][[’mpg’]] with matplotlib.

We can use these dataframes to create our histograms.


We are going to be using the hist method of the pandas
dataframes:

aut.hist(alpha=0.8, label=’Automatic’, A histogram with pandas.

color=’black’)

man.hist(alpha=0.8, label=’Manual’, color=’gray’)

plt.legend()

We are using the default bins (i.e. 10) and we are also
adding a suitable legend to our plot so that we can
statistics and data visualisation with python 455

distinguish the automatic from the manual data. The plot is


shown in Figure 8.21.

Figure 8.21: Histogram of the


miles per gallon as a function of
Seaborn can make the task a bit easier with histplot as we the type of transmission. The chart
is created with pandas.
can pass the hue parameter to separate the automatic and
manual histograms. Let us take a look:

sns.histplot(data=cars, x=’mpg’, hue=’am’,

bins=15, kde=True)
Seaborn has a histplot function
plt.ylabel(’Frequency’)
that creates beautiful histograms.
plt.xlabel(’Miles per Gallon’)

plt.legend(title=’Transmission’,

loc=’upper right’,

labels=[’Automatic’, ’Manual’])
456 j. rogel-salazar

Figure 8.22: Histogram of the


The result is shown in Figure 8.22. Notice that we are also miles per gallon as a function of
the type of transmission. The chart
requesting Seaborn to plot a kernel density estimate (KDE) is created with Seaborn.

with kde=True. This is a representation of the data using a


continuous probability density curve. We are also
improving the look of the legend a bit to make it more
readable.

We can combine the individual series obtained before into a


single dataframe and use it to create our plot:

df=pd.concat([aut, man], axis=1) We will use this new dataframe to


create our histogram with Pandas
df.columns=[’Automatic’, ’Manual’]
Bokeh.

Let us use the dataframe df above to plot our histogram


using Pandas Bokeh. The plot can be seen in Figure 8.23.
statistics and data visualisation with python 457

df.plot_bokeh.hist(bins=10,
A histogram with Pandas Bokeh.
ylabel=’Freq’, xlabel=’Miles Per Gallon’,

line_color="black")

Figure 8.23: Histogram of the


miles per gallon as a function of
the type of transmission. The chart
Finally, let us create the histogram with Plotly as follows: is created with Pandas Bokeh.

fig = px.histogram(cars, x=’mpg’, color=’am’,

nbins=16)

fig.update_layout(barmode=’overlay’)
A histogram with Plotly.
fig.update_traces(opacity=0.75)

fig.show()

The result is in Figure 8.24. Before we close this section, let


us see the pairplot method in Seaborn. It shows pairwise
relationships in a dataset. In this way we can look at the
458 j. rogel-salazar

Figure 8.24: Histogram of the


miles per gallon as a function of
distribution of a variable in the diagonal of the grid chart, the type of transmission. The chart
is created with Plotly.
and scatter plots of the pairwise relations in the off-diagonal.
Let us create one to see the relationship of the miles per
gallon and the horse power per transmission:

mycars = cars[[’mpg’,’hp’,’am’]] Seaborn’s pairplot lets us plot


sns.pairplot(mycars, diag_kind=’kde’, multiple pairwise bivariate
distributions.
kind=’scatter’, hue=’am’)

The pairplot created is shown in Figure 8.25. Notice that we


can choose to show a histogram in the diagonal by changing
diag_kind to ’hist’ and we could obtain regression lines by

changing kind to ’reg’. Try that for yourself!


statistics and data visualisation with python 459

Figure 8.25: Pairplot of the cars


dataset showing the relationship
8.8 Box Plot between miles per gallon and
horse power per transmission
type.

We have seen how the distribution of a variable can be


seen with the help of a histogram, but that is not the only See Section 8.7.

way to visualise this information. An alternative way is the


use of a box plot, sometimes referred to as a whisker plot.

Take a look at the box plot diagram shown in Figure 8.26. A


box plot represents the quartiles of the data, the maximum See Section 4.4.2 for more
information on quartiles.
and minimum values and even outliers by showing them
460 j. rogel-salazar

past the whiskers of the boxplot. The body of the box plot is
given by the first and third quartiles (Q1 and Q3 in the
diagram), the line inside the box represents the median of
the data and the whiskers show the maximum and
minimum values.

Let us take a look at creating a box plot for the miles per
gallon variable in our cars dataset.

plt.boxplot(cars[’mpg’])

plt.set_xlabel(’MPG’)
Figure 8.26: Anatomy of a boxplot.

Figure 8.27: Box plot of the miles


The box plot is shown in Figure 8.27. Notice how the box variable in the cars dataset. The
chart is created with matplotlib.
plot summarises the data in a neat single graphic.
statistics and data visualisation with python 461

In Section 6.6.1 we use the cars data as an example of the


comparison of the mean of two groups. In that case we
wanted to know if the difference in the mean value of the
miles per gallon variable as a function of the transmission
was statistically significant. We can depict that information
with the help of a box plot. Let us use pandas to create the
plot shown in Figure 8.28.

Figure 8.28: Box plot of the miles


per gallon as a function of the
type of transmission. The chart is
cars[[’mpg’,’am’]].boxplot(column=’mpg’, created with pandas.

by =’am’) A box plot with pandas.


plt.grid(axis=’x’)
462 j. rogel-salazar

Notice that the width of the box does not have meaning,
but it is possible to use other representations. For example,
in the case of the so-called violin plot, instead of using a We can use other representations
apart from a box. Be mindful that
rectangle for the body of the graphic, we use approximate
the chart may be more difficult to
density distribution curves. Let us use Seaborn to see the interpret.
different things we can do to represent the same data:

sns.boxplot(x=’am’, y=’mpg’, A box plot with Seaborn.


data=cars)

The code above will generate the chart showing in the left
panel of Figure 8.29.

Figure 8.29: Left: Box plots of


the miles per gallon as a function
of the type of transmission.
Sometimes it is useful to superimpose the data points that Middle: Same information but
make up the box plot, and we can do this in Seaborn by including a swarm plot. Right:
Same information represented by
combining the code above with a swarm plot as follows: violin plots. Graphics created with
Seaborn.
statistics and data visualisation with python 463

sns.swarmplot(x=’am’, y=’mpg’, data=cars A swarm plot includes the data


points in our box plot.

The result is the chart shown in the middle panel of Figure


8.29. Finally, we can create the violin plot shown in the right
panel of Figure 8.29 as follows:

Seaborn lets us create violin plots


sns.violinplot(x=’am’, y=’mpg’, data=cars) too.

You may notice that in the violin plot, right in the middle
There is a small box plot inside the
of each density curve, there is a small box plot, showing a
violin plot!
little box with the first and third quartiles and a central dot
to show the median. Violin plots are sometimes difficult to
read, so I would encourage you to use them sparingly.

Figure 8.30: Box plot of the miles


per gallon as a function of the
type of transmission. The chart is
created with Plotly.
Finally, let us re-create the box plot chart with the help of
Plotly as shown in Figure 8.30.
464 j. rogel-salazar

fig = px.box(cars, y=’mpg’, color=’am’,

points=’all’)
A box plot with Plotly.
fig.show()

Notice that we are using the points=’all’ parameter to


show the data observations alongside the box plots. This is
similar to the swarm plot created with Seaborn.

8.9 Area Chart

An area chart is a special form of a line graph. We can


think of it as a combination of a line and a bar chart to show
how the values of a variable change over the progression of An area chart combines a line
chart and bar chart to show
a second (continuous) variable. The fact that the area below
progression.
the line is filled with a solid colour lets us demonstrate how
various data series rise and fall.

Star date Enterprise Cerritos Discovery Table 8.3: First encounters by


notable Starfleet ships.
47457.1 22 9 2
47457.2 21 6 3
47457.3 15 9 5
47457.4 18 5 7
47457.5 13 7 7
47457.6 25 5 2
47457.7 13 6 1
47457.8 21 4 9
47457.9 11 8 4
47458.0 13 5 5

Consider the data in Table 8.3 showing the number of recent For any fellow StarTrek fans, the
data is totally made up.
first encounters made by notable Starfleet ships. We assume
statistics and data visualisation with python 465

that we have created a pandas dataframe called df with this


data ready to plot some area charts. In matplotlib we can
use the stackplot method, see Figure 8.31.

Figure 8.31: Area plot of the


data from Table 8.3 created using
plt.stackplot(df[’stardate’], df[’enterprise’], matplotlib.
df[’cerritos’], df[’discovery’],

labels=[’Enterprise’, ’Cerritos’, ’Discovery’])


An area chart lets us express how
plt.legend() different components relate to
plt.xticks(rotation=45) the whole over time, enabling a
comparison too. Here is code to
xlabels=[f’{label:,}’ for label in df[’stardate’]]
create one with matplotlib.
plt.xticks(df[’stardate’].values, xlabels)

plt.xlabel(’Star Date’)

plt.ylabel(’Encounters’)
466 j. rogel-salazar

The values in the area chart are stacked, in other words,


the count of the second series starts at the end of the first
series, not at zero. Let us create an unstacked area chart
with pandas and the area method of the dataframe. See
Figure 8.32.

Figure 8.32: Unstacked area plot


of the data from Table 8.3 created
idxdf = df.set_index(’stardate’) with pandas.
idxdf.plot.area(rot=45,

stacked=False)

xlabels = [f’{label:,}’ for label in idxdf.index] An area chart with pandas.

plt.xticks(idxdf.index, xlabels)

plt.xlabel(’Star Date’)

plt.ylabel(’Encounters’))
statistics and data visualisation with python 467

The area chart can be reproduced in Pandas Bokeh with


the area method. The result of the code below is shown in
Figure 8.33:

p=df.plot_bokeh.area(x=’stardate’,

stacked=True, An area chart with Pandas Bokeh.

ylabel=’Encounters’,

xlabel=’Star Date’)

Figure 8.33: Area plot of the


data from Table 8.3 created using
Pandas Bokeh.
Finally, let us use Plotly to create the same graph, shown in
Figure 8.34. We employ the area method as follows:

fig = px.area(df, x=’stardate’,

y=[’enterprise’, ’cerritos’, ’discovery’]) An area chart with Plotly.

fig.show()
468 j. rogel-salazar

Figure 8.34: Area plot of the


data from Table 8.3 created using
8.10 Heatmap Plotly.

A heatmap is a chart that depicts the values in a variable


of interest using colours in a scale that ranges from smaller
to higher values. As we discussed in Section 7.3, presenting A heatmap represents data in a
way that values are depicted by
information in a way that appeals to our pre-attentive
colour.
perception makes it easier to pick out patterns. If we are
interested in grabbing the attention of our viewers by
highlighting important values, we make sure that the
colours used in our heatmap do the work for us.

Let us prepare a small summary of the number of cars per


transmission and number of cylinders:
statistics and data visualisation with python 469

wrd = {0: ’Manual’, 1:’Automatic’}

cars[’transmission’] = cars[’am’].map(wrd)
We first group our data to create
a table to show the count of cars
carcount = cars.groupby([’transmission’, ’cyl’]) \
per transmission and number of
.count().rename(columns = ’Car’: ’cars’) \ cylinders.

.filter([’transmission’, ’cyl’, ’cars’]) \

.reset_index()

A pivot table lets us summarise our information:

> carheatmap = carcount.pivot(index = ’cyl’,

columns = ’transmission’)
A pivot table lets us read the
> carheatmap = carheatmap[’cars’].copy() count table in a straightforward
> print(carheatmap) way.

transmission Automatic Manual

cyl

4 8 3

6 3 4

8 2 12

We can now create our heatmap with matplotlib as follows:

import matplotlib as mp

color_map = mp.cm.get_cmap(’binary’)
The pivot table can be used
plt.pcolor(carheatmap, cmap = color_map) to create our heatmap with
xt=np.arange(0.5, len(carheatmap.columns), 1) matplotlib.

yt=np.arange(0.5, len(carheatmap.index), 1)

plt.xticks(xt, carheatmap.columns)

plt.yticks(yt, carheatmap.index)

plt.colorbar()
470 j. rogel-salazar

The heatmap created with the code above is shown in


Figure 8.35.

Figure 8.35: Heatmap of the


number of cars by transmission
type and number of cylinders.
In pandas we can modify the styling of the dataframe to Plot created using matplotlib.

show the table cells with the appropriate gradient.

Figure 8.36: Heatmap of the


number of cars by transmission
type and number of cylinders in a
pandas dataframe.
statistics and data visualisation with python 471

carheatmap.style.background_gradient(
In pandas we can modify the style
cmap =’binary’, axis=None)\ to create a heatmap directly in our
.set_properties(**{’font-size’: ’20px’}) dataframe.

Notice that we are using the parameter axis=None to apply


the same gradient to the entire table. We are also changing
the font size of the cells so that they are easily read. The
result is shown in Figure 8.36.

Let us see how Seaborn has us covered to create the same


heatmap. See Figure 8.37.

Figure 8.37: Heatmap of the


number of cars by transmission
type and number of cylinders
sns.heatmap(carheatmap,
created with Seaborn.
cmap=’binary’, annot=True,)
472 j. rogel-salazar

As you can see, Seaborn provides a much easier way to


create the heatmap compared to matplotlib. With Plotly we
also have a simple way to create the heatmap:

fig = px.imshow(carheatmap,

text_auto=True)

fig.show()

Figure 8.38: Heatmap of the


number of cars by transmission
type and number of cylinders
The result of the code above is shown in Figure 8.38. created with Plotly.

With Pandas Bokeh there is no heatmap API, but we can


use the native Bokeh instructions to create it. We will be
working with the carcount dataframe:

from bokeh.palettes import Greys

transm=list(carcount[’transmission’].unique())
statistics and data visualisation with python 473

At the time of writing, Pandas


colors=list(Greys[5]) Bokeh does not support heatmaps.
We can use Bokeh directly instead.
colors.reverse()

mapper = LinearColorMapper(palette=colors,

low=carcount.cars.min(),

high=carcount.cars.max())

TOOLS = ’hover,save,pan,box_zoom,reset,wheel_zoom’

In the code above we have created a few objects that will


help us create our heatmap. First we are using a black-and-
white colour palette given by Greys, and reversing the order
of the colours to match the look of our other heatmaps. We
need to map the range of values in the dataset to the palette
and that is the aim of mapper. Finally, the TOOLS string lets
Bokeh know what tools to show when rendering the chart.

Let us now create the heatmap. This requires a bit more


code than in the previous examples. First we create a figure Creating a heatmap with Bokeh
requires a bit more code than with
object with the range of the horizontal axis to cover the
other modules.
transmission types. We also let Bokeh know what tools to
show and their location. The tooltips object lets us specify
what information will be shown when users hover on the
squares that form our heatmap grid. Here we show the
transmission type and the number of cars.

p = figure(title="Number of cars" , First we create a figure object to


x_range=transm, hold our chart.

tools=TOOLS, toolbar_location=’below’,

tooltips=[(’transmission’, ’@transmission’),

(’Num. Cars’, ’@cars’)])


474 j. rogel-salazar

We are now ready to attach the rectangles that will form


our heatmap. We do this with the rect method, where we
specify the information that goes in each of the axes as well
as the height and width of the rectangular area. We also
specify the colour of the rectangles and the lines.

p.rect(x=’transmission’, y=’cyl’, width=1,


We use rectangles to create the
height=2, source=carcount,
areas in our heatmap.
fill_color={’field’: ’cars’,

’transform’: mapper},

line_color=’gray’)

Figure 8.39: Heatmap of the


number of cars by transmission
type and number of cylinders
Finally, we add a colour bar that will help users determine created with Bokeh.
what the colours mean and how they relate to each other.
Here is where we use our mapper object and determine the
statistics and data visualisation with python 475

ticks to show as well as the format. We add the bar to the


layout and show our finished heatmap. The result is the
heatmap shown in Figure 8.39.

color_bar = ColorBar(color_mapper=mapper, ticker=


A colour bar will let our viewers
BasicTicker(desired_num_ticks=len(colors)),
gauge the value depicted by each
formatter=PrintfTickFormatter(format="%d")) rectangle.
p.add_layout(color_bar, ’right’)

show(p)
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
A
Variance: Population v Sample

Let us start with the population variance where we take


the sum of the square differences from the mean and divide
by the size of the population, n:

n
1
σ2 =
n ∑ ( x i − µ )2 . (A.1)
i =1

For a sample, we are interested in estimating the population


variance. We can start from the definition above:
n
1
∑ (xi − X̄ )
2
σb2 = . (A.2)
n i =1

This is a biased estimate. To correct for this we need to


consider that we are only taking samples.

In this case, the random variable xi deviates from the


sample mean X̄ with variance σb2 . In turn, the sample mean
σ2
X̄ also deviates from µ with variance n . This can easily be
seen by considering that the sample mean, X̄ , is calculated
every time using different values from one sample to the
478 j. rogel-salazar

next. In other words, it is a random variable with mean µ


σ2
and variance n .

We are interested in estimating an unbiased value of σ2 ,


i.e. s2 , the sample variance. We have therefore a random
variable xi that deviates from µ with a two-part variance:
s2
s2 = σb2 + n. This relationship can be expressed as:

n
s2 = σ2 , (A.3)
n−1 b

which tells us what our unbiased estimator should be.


Replacing this information in (A.2) we have that:

n
1

2
s2 = ( x − X̄ ) , (A.4)
n − 1 i =1 i

which is the expression for the population variance given in


Equation (4.21).
B
Sum of First n Integers

We are interested in calculating the sum of the first


n integers Sn = 1 + 2 + · · · + n. Let us arrange the sum as
follows:

Sn = 1+2+3+···+n (B.1)

Sn = n + ( n − 1) + ( n − 2) + · · · + 1 (B.2)

We can calculate 2Sn by grouping the two expressions


above:

2Sn = n + 1 + (2 + n − 1) + (3 + n − 2) + · · ·
+(n + 1) (B.3)

= ( n + 1) + ( n + 1) + ( n + 1) + · · ·
+(n + 1) (B.4)

The factor n + 1 is repeated n times and therefore we can


write that
2Sn = n(n + 1). (B.5)
480 j. rogel-salazar

Therefore we have that:


n
n ( n + 1)
Sn = ∑k= 2
. (B.6)
k =1
C
Sum of Squares of the First n Integers

Let us start with the following binomial expansion:

(k − 1)3 = k3 − 3k2 + 3k − 1. (C.1)

We can rearrange the terms as follows:

k3 − (k − 1)3 = 3k2 − 3k + 1. (C.2)

Let us now sum both sides:


n n n n
∑ ( k 3 − ( k − 1)3 ) = 3 ∑ k2 − 3 ∑ k+ ∑1 (C.3)
k =1 k =1 k =1 k =1

It is useful at this point to consider the definition of a


telescoping sum. This refers to a finite sum where pairs of
consecutive terms cancel each other, leaving only the initial
and final terms. For example, if ak is a sequence of numbers,
then:
n
∑ ( a k − a k −1 ) = a n − a 0 . (C.4)
k =1
482 j. rogel-salazar

The sum on the left-hand side of Equation (C.3) telescopes


and is equal to n3 . We have therefore the following:

n
n ( n + 1)
n3 = 3 ∑ k2 − 3 2
+n (C.5)
k =1

n
n ( n + 1)
3 ∑ k2 = n3 + 3
2
−n (C.6)
k =1

n
n3 n ( n + 1) n
∑ k2 =
3
+
2

3
(C.7)
k =1

n
n(n + 1)(2n + 1)
∑ k2 =
6
. (C.8)
k =1
D
The Binomial Coefficient

Let us motivate the discussion of the binomial


coefficient by thinking of sequences. We can consider
making an ordered list of distinct elements of length k from
n elements. For the first element in our sequence there are n
ways to pick it. Once the first element is chosen, there are
n − 1 ways to pick the second, and n − 2 to pick the third.
This will continue until we get to the k-th element: There are
n − k + 1 ways to pick it. The number of sequences that can
be constructed is therefore given by:

n!
n(n − 1)(n − 2) . . . (n − k + 1) = = P(n, k), (D.1)
(n − k)!

where P(n, k) is called a permutation of k elements from n.

The permutation above is an ordered sequence. If instead


we are interested in unordered arrangements, we are
looking for a combination C (n, k). For each sequence, we
want to identify the k! permutations of its elements as being
the same sequence, this means that the combinations are
484 j. rogel-salazar

given by  
n! n
C (n, k) = = . (D.2)
(n − k)!k! k

D.1 Some Useful Properties of the Binomial


Coefficient

We are going to start by deriving a useful expression for


the C (n − 1, k − 1):

n−1
 
C (n − 1, k − 1) = , (D.3)
k−1

( n − 1) !
= , (D.4)
((n − 1) − (k − 1))!(k − 1)!

( n − 1) !
= . (D.5)
( n − k ) ! ( k − 1) !

For the next useful property, let us recall that x! = x ( x − 1)!.


This lets us express the binomial coefficient as:

n ( n − 1) !
 
n
= , (D.6)
k (n − k)!k(k − 1)!

n ( n − 1) !
= , (D.7)
k (n − k)(k − 1)!

n n−1
 
= (D.8)
k k−1
E
The Hypergeometric Distribution

E.1 The Hypergeometric vs Binomial Distribution

Let us start with the hypergeometric distribution:

−K
(Kk )( Nn− k)
f (k, N, K, n) = , (E.1)
( Nn )
and let us keep the ratio K/N = p fixed. We want to show
that:

 
n k
lim f (k, N, K, n) = p (1 − p ) n − k . (E.2)
N →∞ k

Let us express Equation (E.1) in terms of factorials:

−K
(Kk )( Nn− k) K! ( N − K )!
= · ·
( Nn ) k!(K − k)! (n − k)!( N − n − (K − k))!)

n!( N − n)!
(E.3)
N!
486 j. rogel-salazar

We now rearrange some of the terms so that we can recover


the binomial coefficient (nk):

−K
(Kk )( Nn− k) n! K! ( N − K )!
= · ·
( Nn ) k!(n − k)! (K − k )! ( N − n − (K − k))!
( N − n)!
, (E.4)
N!

K!/(K − k)!
 
n
= · ·
k N!/( N − k)!
( N − K )!( N − n)!
(E.5)
( N − k)!( N − K − (n − k))!)

K!/(K − k)!
 
n
= · ·
k N!/( N − k)!
( N − K )!/( N − K − (n − k))!
(E.6)
( N − n + (n − k))!/( N − n)!

  k
n (K − k + i)
= ·∏ ·
k i =1
( N − k + i)
n−k
( N − K − (n − k) + j)
∏ ( N − n + j)
(E.7)
j =1

Taking the limit for large N and fixed K/N, n and k, we


have:
K−k+i K
lim = lim = p, (E.8)
N →∞ N−k+i N →∞ N

and similarly:

N − K − (n − k) + j N−K
lim = lim = 1 − p. (E.9)
N →∞ N−n+j N →∞ N

Hence:
 
n k
lim f (k, N, K, n) = p (1 − p ) n − k (E.10)
N →∞ k
F
The Poisson Distribution

F.1 Derivation of the Poisson Distribution

In Section 5.3.5 we obtained the following differential


equation to find the probability of getting n successes in the
time interval t:

dP(n, t)
+ lP(n, t) = lP(n − 1, t). (F.1)
dt

The equation above can be solved by finding an integrating


factor, ν(t), such that we can get a total derivative for the
left-hand side when multiplying by ν(t), i.e.:

 
dP(n, t) d
ν(t) + lP(n, t) = [ν(t) P(n, t)] . (F.2)
dt dt

The integrating factor for the equation above is:


Z 
ν(t) = exp ldt = elt , (F.3)
488 j. rogel-salazar

and thus we have that:

d h lt i
e P(n, t) = elt lP(n − 1, t). (F.4)
dt

Using Equation (F.4) we have that for n = 1:

d h lt i
e P(1, t) = elt lP(0, t) = lelt e−lt = l (F.5)
dt

where we have used Equation (5.62). We can now integrate


both sides of the result above:

Z
elt P(1, t) = ldt = lt + C1 (F.6)

Since the probability of finding an event at t = 0 is 0, we


have that C1 = 0. This result can be generalised to n by
induction giving us the following expression for the Poisson
probability:
(lt)n −lt
P(n, t) = e . (F.7)
n!

F.2 The Poisson Distribution as a Limit of the


Binomial Distribution

In Section 5.3.5 we mentioned that the Poisson


distribution can be obtained as a limit of the binomial
distribution. As we divide the time interval into more
subintervals, the binomial distribution starts behaving more
and more like the Poisson distribution. This is the case for a
large number of trials i.e. n → ∞, and we fix the mean rate
µ = np. Equivalently, we can make the probability p be very
small, i.e. p → 0.
statistics and data visualisation with python 489

Let us start with the binomial distribution, and we express


the probability p = µ/n:

µ n−k
   
n µ k
P( X ) = lim 1− , (F.8)
n→∞ k n n

n!  µ k  µ n−k
= lim 1− , (F.9)
n→∞ k!(n − k)! n n
!
n! µk  µ n
= lim 1− ˙
n→∞ ( n − k ) !nk k! n

 µ −k
1− . (F.10)
n

Let us look at the ratio n!/(n − k!): For the case where
integer n is greater than the k. This can be expressed as the
successive product of n with (n − i ) down to (n − (k − 1)):

n!
= n(n − 1)(n − 2) · · · (n − k + 1), k < n, (F.11)
(n − k)

In the limit n → ∞ the expression above is a polynomial


whose leading term is nk and thus:

n(n − 1)(n − 2) · · · (n − k + 1)
lim = 1. (F.12)
n→∞ nk

Our expression for the probability mass function is therefore


written as:
!
µk  µ n  µ −k
P( X ) = lim 1− lim 1 − . (F.13)
k! n→∞ n n→∞ n
490 j. rogel-salazar

We can use the following identities to simplify the


expression above:
 x −n
lim 1− = e− x , (F.14)
n→∞ n

 x −k
lim 1− = 1, (F.15)
n→∞ n

and therefore the probability mass function is given by

µk −µ
P( X ) = e , (F.16)
k!

which is the expression for the probability mass function of


the Poisson distribution.
G
The Normal Distribution

G.1 Integrating the PDF of the Normal Distribution

In Section 5.4.1 we mentioned that we can find the


proportionality constant C in Equation (5.74) for the
probability distribution function of the normal distribution.
Given that the area under the curve must be one, we have
that:

Z ∞  
k
f (x) = C exp − ( x − µ)2 dx = 1. (G.1)
−∞ 2
Sadly, the integral above does not have a representation in
terms of elementary functions. However, there are some
things we can do to evaluate it. We can follow the steps of
Poisson himself.

Before we do that, let us make a couple of changes to the


expression above. First let us define the variable:
r r
k k
u= ( x − µ), and so du = dx.
2 2
492 j. rogel-salazar

In this way, we can express Equation (G.1) as:

2 ∞ − u2
r Z
f (x) = C e du = 1. (G.2)
k −∞

Let us concentrate on the integral in expression (G.2):


Z ∞
2
J= e−u du. (G.3)
−∞

Instead of tackling the integral as above, we are going to


look at squaring it and transform the problem into a double
integral:

∞ ∞
Z  Z 
2 2
J2 = e− x1 dx1 e−y1 dy1 ,
−∞ −∞

Z ∞ Z ∞
2 2
= e−( x1 +y1 ) dx1 dy1 . (G.4)
−∞ −∞

We can express the double integral in polar coordinates such


that Z 2π Z ∞
2
J= re−r drdθ. (G.5)
0 0

The radial part of the integral can be done by letting u = r2 ,


and thus du = 2rdr. Applying this to Equation (G.5):
Z 2π Z ∞
1
J2 = e−u dudθ,
2 0 0

1 2 ∞ Z 2π
= − e −r dθ,
2 0 0

Z 2π
1
= − (1) dθ.
2 0
statistics and data visualisation with python 493

The angular part is readily solved as follows:


Z 2π
1
J2 = dθ,
2 0

1 2π
= θ ,
2 0

= π,


and therefore J = π.

Substituting back into Equation (G.1), we have that:


r
2√
C π = 1,
k

r
k
C= . (G.6)

and recall that k = 1/σ2 as shown in Equation (5.80).

G.2 Maximum and Inflection Points of the Normal


Distribution

We are now interested in finding the maximum value


of the probability distribution function of the normal
distribution, as well as its inflection points. For the
maximum we have need to calculate d f ( x ) = 0:

" 2 #
1 x−µ
d f (x) = − √ ( x − µ) exp =0 (G.7)
σ3 2π σ
494 j. rogel-salazar

The expression above is only equal to zero when x − µ = 0


and hence x = µ. This tells us that the Gaussian function
has its maximum at the mean value µ.

Let us now look at the inflection points. In this case, we


need to solve for d2 f ( x ) = 0:

( x − µ )2
 
1 x −µ 2
2 − 12 ( σ )
d f (x) = − √ − 1 e = 0. (G.8)
σ3 2π σ2

The expression above is equal to zero when:


2
x−µ

= 1. (G.9)
σ

Solving for x we have that the inflection points of the


Gaussian distribution are at x = µ ± σ. This means that they
lie one standard deviation above and below the mean.
H
Skewness and Kurtosis

Expressions for the sample skewness and kurtosis


in terms of their mean (µ1 ), variance (µ2 ), skewness (γ1 )
and kurtosis (γ2 )1 . With n being the sample size, for the 1
Pearson, E. S. (1931, 05). I.
Note on tests for normality.
skewness, we have that: Biometrika 22(3-4), 423–424

µ 1 ( g1 ) = 0,

6( n − 2)
µ 2 ( g1 ) = ,
(n + 1)(n + 3)

µ 3 ( g1 )
γ1 ( g1 ) = = 0,
µ2 ( g1 )3/2

µ 4 ( g1 )
γ2 ( g1 ) = − 3,
µ 2 ( g1 ) 2

36(n − 7)(n2 + 2n − 5)
= .
(n − 2)(n + 5)(n + 7)(n + 9)
496 j. rogel-salazar

For the kurtosis:

6
µ 1 ( g2 ) = ,
n+1

24n(n − 2)(n − 1)
µ 2 ( g2 ) = ,
(n + 1)2 (n + 3)(n + 5)
s
6(n2 − 5n + 2) (n + 3)(n + 5)
γ1 ( g2 ) = ,
(n + 7)(n + 9) n(n − 2)(n − 3)

36(15n6 − 36n5 − 628n4 + 982n3 + 5777n2 − 6402n + 900)


γ2 ( g2 ) = .
n(n − 3)(n − 2)(n + 7)(n + 9)(n + 11)(n + 13)
I
Kruskal-Wallis Test – No Ties

Let us start with the following expression for the H


statistic:
2
∑ik=1 ni Ri· − R
H = ( n − 1) n 2 . (I.1)
∑ik=1 ∑ j=i 1 Rij − R

Let us concentrate on the denominator and expand it:

k ni 2 k ni
2
∑∑ Rij − R = ∑ ∑ R2ij − 2Rij R + R . (I.2)
i =1 j =1 i =1 j =1

Let us recall that R = (n + 1)/2, and we can rewrite as a sum


n
as follows ∑ik=1 ∑ j=i 1 Rij /N. We can use these expressions to
write the following:

k ni
n ( n + 1)
= ∑ ∑ Rij . (I.3)
2 i =1 j =1

We can use the expression above to recast Equation (I.2) in


terms of a double sum:

k ni k ni k ni
2
∑ ∑ R2ij − ∑ ∑ 2Rij R + ∑ ∑R . (I.4)
i =1 j =1 i =1 j =1 i =1 j =1
498 j. rogel-salazar

With the assumption that there are no ties in the ranked


observations, the first term of Equation (I.4) is the sum
of squared ranks, i.e. 12 + 22 + · · · + n2 and thus can be
expressed as:
n(n + 1)(2n + 1)
. (I.5)
6
The second term of Equation (I.4) can be written as:

n ( n + 1) n ( n + 1)2
2R = . (I.6)
2 2
2
Finally, the third term of Equation (I.4) shows that R
appears n times and thus we can express it as:

2 n ( n + 2)2
nR = . (I.7)
4

We can therefore express the denominator as follows:

n(n + 1)(2n + 1) n(n + 1)2 n ( n + 1)2


− + ,
6 2 4

[2n(n + 1)(2n + 1)] − [6n(n + 1)2 ] + [3n(n + 1)2 ]


,
12

(n + 1)[2n(2n + 1) − 6n(n + 1) + 3n(n + 1)]


,
12

(n + 1)(n2 − n) ( n + 1) n ( n − 1)
= . (I.8)
12 12

Let us now look at the numerator and do the same sort of


expansion:

k k  
2 2
∑ n i ( R i · − R )2 = ∑ n i Ri· − 2Ri· R + R ,
i =1 i =1
statistics and data visualisation with python 499

k
2 2
∑ ni Ri · − ni 2Ri· R + ni R ,
i =1

k k k
2 2
∑ ni Ri · − ∑ ni 2Ri· R + ∑ ni R . (I.9)
i =1 i =1 i =1

Let us look at the second term of Equation (I.9). We know


n
that the Ri· = ∑ j=i 1 Rij /ni , and thus:

k k ni
∑ ni 2Ri· R = (n + 1) ∑ ∑ Rij . (I.10)
i =1 i =1 j =1

The double sum above is the sum of ranks such that 1 + 2 +


· · · + n and that is equal to n(n + 1)/2 when there are no ties.
The second term thus is equal to n(n + 1)2 /2.

For the last term of Equation (I.9) we have that:

k k
2 ( n + 1)2 n ( n + 1)2
∑ ni R =
4 ∑ ni = 4
. (I.11)
i =1 i =1

Let us now plug back Equations (I.8), (I.9) and (I.10) as well
as (I.11) into our original Expression (I.1):
!
k
12(n − 1) 2 n ( n + 1)2 n ( n + 1)2
H=
n(n + 1)(n − 1) ∑ ni Ri · −
2
+
4
,
i =1

k
12 2 12 n ( n + 1)2
=
n ( n + 1) ∑ ni Ri · −
n ( n + 1) 4
,
i =1

k
12 2
=
n ( n + 1) ∑ ni Ri · − 3( n + 1). (I.12)
i =1
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Bibliography

Abramowitz, M. and I. Stegun (1965). Handbook of


Mathematical Functions: With Formulas, Graphs, and
Mathematical Tables. Applied Mathematics Series. Dover
Publications.

Anaconda (2016, November). Anaconda Software


Distribution. Computer Software. V. 2-2.4.0. https:
//www.anaconda.com.

Anscombe, F. J. (1973). Graphs in statistical analysis. The


American Statistician 27(1), 17–21.

Arfken, G., H. Weber, and F. Harris (2011). Mathematical


Methods for Physicists: A Comprehensive Guide. Elsevier
Science.

Bayes, T. (1763). An essay towards solving a problem in the


doctrine of chances. Philosophical Transactions 53, 370–418.

Berkowitz, B. D. (2018). Playfair: The True Story of the British


Secret Agent Who Changed How We See the World. George
Mason University Press.

Bernoulli, J., J. Bernoulli, and E. D. Sylla (2006). The Art of


502 j. rogel-salazar

Conjecturing, Together with Letter to a Friend on Sets in Court


Tennis. Johns Hopkins University Press.

Bertin, J. and M. Barbut (1967). Sémiologie graphique: Les


diagrammes, les réseaux, les cartes. Gauthier-Villars.

Bokeh Development Team (2018). Bokeh:


Python library for interactive visualization.
https://bokeh.pydata.org/en/latest/.

Bostridge, M. (2015). Florence Nightingale: The Woman and Her


Legend. Penguin Books Limited.

Bowley, A. L. (1928). The Standard Deviation of the


Correlation Coefficient. Journal of the American Statistical
Association 23(161), 31–34.

Bradley, M. Charles Dupin (1784 - 1873) and His Influence on


France. Cambria Press.

Brown, M. B. and A. B. Forsythe (1974). Robust tests for


the equality of variances. Journal of the American Statistical
Association 69(346), 364–367.

Buckmaster, D. (1974). The Incan Quipu and the Jacobsen


hypothesis. Journal of Accounting Research 12(1), 178–181.

Busby, M. (2020). Cambridge college


to remove window commemorating
eugenicist. www.theguardian.com/education
/2020/jun/27/cambridge-gonville-caius-college -
eugenicist-window-ronald-fisher. Accessed: 2021-02-14.

D’Agostino, R. and E. S. Pearson (1973). Tests for Departure


from Normality. Empirical Results for the Distributions of

b2 and b1 . Biometrika 60(3), 613–622.
statistics and data visualisation with python 503

D’Agostino, R. B. (1971, 08). An omnibus test of normality for


moderate and large size samples. Biometrika 58(2), 341–348.

D’Agostino, R. B. (1970, 12). Transformation to normality of


the null distribution of g1 . Biometrika 57(3), 679–681.

de Heinzelin, J. (1962, Jun). Ishango. Scientific


American (206:6), 105–116.

Díaz Díaz, R. (2006). Apuntes sobre la aritmética maya.


Educere 10(35), 621–627.

Fisher, R. (1963). Statistical Methods for Research Workers.


Biological monographs and manuals. Hafner Publishing
Company.

Franks, B. (2020). 97 Things About Ethics Everyone in Data


Science Should Know. O’Reilly Media.

Gartner (2017). Gartner Says within Five Years, Organizations


Will Be Valued on Their Information Portfolios.
www.gartner.com/en/newsroom/press-releases/2017-
02-08-gartner-says-within-five-years-organizations-will-
be-valued-on-their-information-portfolios. Accessed:
2021-01-04.

Gauss, C. F. (1823). Theoria combinationis observationum


erroribus minimis obnoxiae. Number V. 2 in Commentationes
Societatis Regiae Scientiarum Gottingensis recentiores:
Classis Mathemat. H. Dieterich.

Good, I. J. (1986). Some statistical applications of Poisson’s


work. Statistical Science 1, 157–170.

Hand, D. J. (2020). Dark Data: Why What You Don’t Know


Matters. Princeton University Press.
504 j. rogel-salazar

Hassig, R. (2013). El tributo en la economía prehispánica.


Arqueología Mexicana 21(124), 32–39.

Heath, T. L. (2017). Euclid’s Elements (The Thirteen Books).


Digireads.com Publishing.

Heffelfinger, T. and G. Flom (2004). Abacus: Mystery of the


Bead. http://totton.idirect.com. Accessed: 2021-02-03.

Henderson, H. V. and P. F. Velleman (1981). Building multiple


regression models interactively. Biometrics 37(2), 391–411.

Howard, L. (1960). Robust tests for equality of variances.


In I. Olkin and H. Hotelling (Eds.), Contributions to
Probability and Statistics: Essays in Honor of Harold Hotelling,
pp. 278–292. Stanford University Press.

Hunter, J. D. (2007). Matplotlib: A 2D graphics environment.


Computing in Science & Engineering 9(3), 90–95.

Ioannidis, Y. (2003). The history of histograms (abridged).


www.vldb.org/conf/2003/ papers/S02P01.pdf.
Accessed: 2021-02-14.

Jones, E., T. Oliphant, P. Peterson, et al. (2001–). SciPy: Open


source scientific tools for Python. http://www.scipy.org/.

Kolmogorov, A. (1933). Sulla determinazione empirica di una


legge di distribuzione. Inst. Ital. Attuari, Giorn. 4, 83–91.

Laplace, P. S. and A. Dale (2012). Pierre-Simon Laplace


Philosophical Essay on Probabilities: Translated from the Fifth
French Edition of 1825 with Notes by the Translator. Sources
in the History of Mathematics and Physical Sciences.
Springer New York.
statistics and data visualisation with python 505

Lichtheim, M. (2019). Ancient Egyptian Literature. University


of California Press.

Loupart, F. (2017). Data is giving rise to a new economy.


The Economist - www.economist.com/briefing/2017/
05/06/data-is-giving-rise-to-a-new-economy. Accessed:
2021-01-02.

Mackinlay, J. (1986, Apr). Automating the Design of


Graphical Presentations of Relational Information. ACM
Trans. Graph. 5(2), 110–141.

Mann, H. B. and D. R. Whitney (1947, Mar). On a test of


whether one of two random variables is stochastically
larger than the other. Ann. Math. Statist. 18(1), 50–60.

McCann, L. I. (2015). Introducing students to single photon


detection with a reverse-biased LED in avalanche mode.
In E. B. M. Eblen-Zayas and J. Kozminski (Eds.), BFY
Proceedings. American Association of Physics Teachers.

McKinney, W. (2012). Python for Data Analysis: Data Wrangling


with Pandas, NumPy, and IPython. O’Reilly Media.

McKinsey & Co. (2014). Using customer analytics to boost


corporate performance.

Melissinos, A. C. and J. Napolitano (2003). Experiments in


Modern Physics. Gulf Professional Publishing.

P. Hlobil (2021). Pandas Bokeh.


https://github.com/PatrikHlobil/Pandas-Bokeh.

PA Media (2020). UCL renames three facilities that honoured


prominent eugenicists. www.theguardian.com/education/
506 j. rogel-salazar

2020/jun/19/ucl-renames-three-facilities -that-honoured-
prominent-eugenicists. Accessed: 2021-02-14.

Pearson, E. S. (1931, 05). I. Note on tests for normality.


Biometrika 22(3-4), 423–424.

Pearson, K. (1920). Notes on the history of correlation.


Biometrika 13(1), 25–45.

Pearson, K. (1968). Tables of the Incomplete Beta-Function: With


a New Introduction. Cambridge University Press.

Pletser, V. (2012). Does the Ishango Bone Indicate Knowledge


of the Base 12? An Interpretation of a Prehistoric
Discovery, the First Mathematical Tool of Humankind.
arXiv math.HO 1204.1019.

Plotly Technologies Inc (2015). Collaborative data science.


https://plot.ly.

R. A. Fisher (1918). The correlation between relatives on the


supposition of mendelian inheritance. Philos. Trans. R. Soc.
Edinb. 52, 399–433.

Richterich, A. (2018). The Big Data Agenda: Data Ethics and


Critical Data Studies. Critical, Digital and Social Media
Studies. University of Westminster Press.

Rogel-Salazar, J. (2014). Essential MATLAB and Octave. CRC


Press.

Rogel-Salazar, J. (2016, Mar). Motor Trend Car Road Tests.


https://doi.org/ 10.6084/m9.figshare.3122005.v1.
statistics and data visualisation with python 507

Rogel-Salazar, J. (2020). Advanced Data Science and Analytics


with Python. Chapman & Hall/CRC Data Mining and
Knowledge Discovery Series. CRC Press.

Rogel-Salazar, J. (2021a, May). GLA World Cities 2016.


https://doi.org/ 10.6084/m9.figshare.14657391.v1.

Rogel-Salazar, J. (2021b, Dec). Normal and Skewed Example


Data. https://doi.org/ 10.6084/m9.figshare.17306285.v1.

Rogel-Salazar, J. (2022a, Feb). Anscombe’s Quartet.


https://doi.org/ 10.6084/m9.figshare.19221720.v3.

Rogel-Salazar, J. (2022b, Feb). Jackalope Dataset.


https://doi.org/ 10.6084/m9.figshare.19221666.v3.

Rogel-Salazar, J. (2022c, Feb). Python Study Scores.


https://doi.org/ 10.6084/m9.figshare.19208676.v1.

Rogel-Salazar, J. (2022d, Jan). Starfleet Headache Treatment


- Example Data for Repeated ANOVA. https://doi.org/
10.6084/m9.figshare.19089896.v1.

Rogel-Salazar, J. (2017). Data Science and Analytics with Python.


Chapman & Hall/CRC Data Mining and Knowledge
Discovery Series. CRC Press.

Rothamsted Research (2020). Statement on R. A. Fisher.


www.rothamsted.ac.uk/news/statement-r-fisher. Accessed:
2021-02-14.

Samueli, J.-J. (2010). Legendre et la méthode des moindres


carrés. Bibnum journals.openedition.org/bibnum/580.
Accessed: 2021-02-14.
508 j. rogel-salazar

Satterthwaite, F. E. (1946). An approximate distribution of


estimates of variance components. Biometrics Bulletin 2(6),
110–114.

Scheinerman, E. A. (2012). Mathematics: A Discrete


Introduction. Cengage Learning.

Scientific Computing Tools for Python (2013). NumPy.


http://www.numpy.org.

Seaborn, J. B. (2013). Hypergeometric Functions and Their


Applications. Texts in Applied Mathematics. Springer New
York.

Shapiro, S. S. and Wilk, M. B. (1965, 12). An analysis


of variance test for normality (complete samples)†.
Biometrika 52(3-4), 591–611.

Short, J. E. and S. Todd (2017). What’s your


data worth? MIT Sloan Management Review,
sloanreview.mit.edu/article/whats-your-data-worth/.
Accessed: 2021-01-08.

SINTEF (2013). Big Data, for better or worse:


90% of world’s data generated over last two
years. www.sciencedaily.com/releases/2013/
05/130522085217.htm. Accessed: 2021-01-01.

Smirnov, N. V. (1939). Estimate of deviation between


empirical distribution functions in two independent
samples. Bulletin Moscow University 2(2), 3–16.

Stahl, S. (2006). The evolution of the normal distribution.


Mathematics Magazine 79(2), 96.
statistics and data visualisation with python 509

Statista (2020). The 100 largest companies in


the world by market capitalization in 2020.
www.statista.com/statistics/263264/top-companies-in-the-
world-by-market-capitalization. Accessed: 2021-01-03.

Strang, G. (2006). Linear Algebra and Its Applications. Thomson,


Brooks/Cole.

Student (1908). The probable error of a mean. Biometrika 6(1),


1–25.

The Event Horizon Telescope Collaboration (2019). First M87


Event Horizon Telescope Results. I. The Shadow of the
Supermassive Black Hole. ApJL 875(L1), 1–17.

Three Dragons, David Lock Associates, Traderisks, Opinion


Research Services, and J. Coles (2016). Lessons from
Higher Density Development. Report to the GLA.

Tufte, E. (2022). The Work of Edward Tufte and Graphics


Press. https://www.edwardtufte.com/tufte/. Accessed:
2022-23-02.

Tukey, J. W. (1977). Exploratory Data Analysis. Number v. 2


in Addison-Wesley Series in Behavioral Science. Addison-
Wesley Publishing Company.

van der Zande, J. (2010). Statistik and history in the German


enlightenment. Journal of the History of Ideas 71(3), 411–432.

Vinten-Johansen, P., H. Brody, N. Paneth, S. Rachman,


M. Rip, and D. Zuck (2003). Cholera, Chloroform, and the
Science of Medicine: A Life of John Snow. Oxford University
Press.
510 j. rogel-salazar

Waskom, M. L. (2021). Seaborn: Statistical data visualization.


Journal of Open Source Software 6(60), 3021.

Wasserstein, R. L. and N. A. Lazar (2016). The ASA statement


on p-values: Context, process, and purpose. The American
Statistician 70(2), 129–133.

Welch, B. L. (1947, 01). The generalization of ’Student’s’


problem when several different population variances are
involved. Biometrika 34(1-2), 28–35.

Wilcoxon, F. (1945). Individual comparisons by ranking


methods. Biometrics Bulletin 1(6), 80–83.

William H. Kruskal and W. Allen Wallis (1952). Use of ranks


in one-criterion variance analysis. Journal of the American
Statistical Association 47(260), 583–621.
Index

L1-norm, 15 Arithmetic mean, 152 Black hole, 4


p-value, 250, 251, 278 Arithmetic operators, 43 Bokeh, 425
t-statistic, 120 Artificial intelligence, 9 Boskovich, Roger Joseph, 15
t-test, 120, 255, 378–381 Atomic clock, 4 Box plot, 19
z-score, 177 Autocorrelation, 307
LATEX, 405 Average, 152
Cardano, Girolamo, 16
Aztecs, 13
Central limit theorem, 17, 191,
Achenwall, Gottfried, 14 245, 272, 279
Alexandria, Library of, 14 Backslash, 50 Central tendency, 145, 146
Alphabet, 6 Battlestar Galactica Arithmetic mean, 152
Amate, 13 Cylon, 73 Geometric mean, 155
Amazon, 5 Battlestar Galactica, 73 Harmonic mean, 159
Analysis of variance, 345 Caprica, 73 Median, 150
ANOVA Bayes’ theorem, 24, 181 Mode, 147
One-way, 347, 380 Bayes, Thomas, 181 Chi-square, 291
Repeated measures, 361 Bernoulli trial, 197 Goodness of fit, 291
Two-way, 369 Bernoulli, Jacob, 16 Independence, 293
F statistic, 349 Big data, 5 Choropleth map, 21
Factor, 347 Value, 6 Code readability, 50
Kruskal-Wallis, 365, 380 Variety, 6 Coefficient of variation, 178
Level, 347 Velocity, 5 Collection
Tukey’s test, 360 Veracity, 6 Comprehension, 60, 71
Ancient Egypt, 13 Visibility, 6 Dictionary, 66
Apple, 6 Volume, 5 List, 52
512 j. rogel-salazar

Set, 72 Matplotlib, 402 Eigenvalue, 116


Tuple, 61 Pairplot, 458 Eigenvector, 116
Collections, 51 pandas, 421 Ethical implications, 9
Condition number, 308 Pie chart, 20, 447 Expected value, 186
Confidence intervals, 247 Plot, 403
Control flow Plotly, 428 F distribution, see ANOVA, 305,
For loop, 87 Presentation, 384 349
If statement, 82 Python modules, 420 Fermat, Pierre de, 16
While loop, 85 Scatter plot, 430 Fisher, Ronald Aylmer, 18
Covariance, 283, 296 Seaborn, 403, 423 Flow map, 20
Critical value, 269 Small multiples, 419 Function, 89
Customer journey, 8 Statistical quantities, 384 Definition, 89
Swarm plot, 462

Data science, 19 Violin plot, 462 GAAP, 7


Data exploration, 143 Visual representation, 394 Galton, Francis, 15
Data presentation, 384 Deep learning, 9 Gauss, Carl Friedrich, 17, 224
Graphical, 386 Degrees of freedom, 173 Gaussian distribution, 17, 116
Tabular, 385 Descriptive statistics, 109, 144 Giant panda, 121
Textual, 385 max, 110 Google, 5
Data privacy, 9 mean, 110 Gosset, William Sealy, 18
Data visualisation, 383, 387, 417 min, 110 Graunt, John, 16
Area chart, 20, 464 standard deviation, 110 Guinness, 18
Bar chart, 20, 440 sum, 110

Horizontal, 442 Design, 394, 417 Harmonic mean, 159


Stacked, 443 Determinant, 116 Heteroscedasticity, 326
Best practices, 414 Dictionaries Histogram, 18, 118
Bokeh, 403, 425 items, 69 Homoscedasticity, 326, 330
Box plot, 459 keys, 68 Hypothesis, 23, 247
Bubble chart, 433 values, 69 Alternative, 251
Chart types, 417 Dispersion, 163 Null, 251
Design, 394 dot command, 114 Hypothesis testing, 267, 268
Donut chart, 450 dtype, 106 Linear models, 376
Heatmap, 468 Dupin, Charles, 21 One-sample, 312
Histogram, 452 Durbin-Watson test, 307 t-test for the mean, 312, 378
Line chart, 20, 438 z-test, 277
statistics and data visualisation with python 513

z-test for proportions, 316 Mesokurtic, 243 Astronomy, 4


Paired sample t-test, 338, Platykurtic, 243 Mayans, 13
380 Mean, 15, 152
Wilcoxon matched pairs, Lambda function, 93 Population, 152
342, 380 Laplace Mean deviation, 169
Wilcoxon signed rank, 320, Pierre Simon, 182 Mendoza Codex, 13
378 Laplace distribution, 17 Microsoft, 6
Two-sample, 324 Laplace, Pierre Simon Marquis Minard, Charles Joseph, 20
Independent, 325 de, 16 MIT Haystack Observatory, 4
Levene’s test, 330 Least squares, 15 Mode, 15
Mann-Whitney, 334, 379 Legendre, Adrien-Marie, 15 Multi-modal, 147
Paired, 325 Level of significance, 250 Modules, 96
Two-sample t-test, 325, 379 Level of significance, 270 Moivre, Abraham de, 16
Welch’s test, 332 Linear algebra, 100 Multicollinearity, 308
Linear correlation, 296 Multivariate dataset, 142
Immutable object Linear regression, 296, 301 Mutable object
Strings, 49 Normal equation, 303 List, 53
Tuple, 49, 64 List
INAOE, 4 append, 56 Nightingale, Florence, 20
Incas, 13 Comprehension, 60 Non-parametric statistics, 289,
Indentation, 80 Concatenation, 56, 101 309
Indexing, 54, 107 sort, 57 Normal distribution, 17, 190
Colon notation (:), 54 sorted, 58 Kurtosis, 238, 243, 285
Infinite loop, 86 Log-likelihood, 306 Moments, 238, 239
Information criterion Logical operators, 80 Skewness, 238, 285
Akaike, 306 Normality testing, 279
Bayesian, 306 Machine learning, 9 D’Agostino K-squared test,
Intangible assets, 7 Manhattan distance, 15 285
Ishango bone, 12 MATLAB, 403 Kolmogorov-Smirnov test, 288
Matplotlib, 402 Q-Q plot, 280
JSON, 95 Matrix, 101 Shapiro Wilk, 282
Inverse, 114 Numerical integration, 116
Klingon, 225 Matrix algebra, 114 Numerical optimisation, 117
Kurtosis Matrix calculus, 100 NumPy, 99, 100, 102, 104
Leptokurtic, 243 Max Planck Institute for Radio Arrays, 102
514 j. rogel-salazar

Matrix, 104 Partition values, 166 Chi-squared, 260


Transpose, 105 Pascal, Blaise, 16 Continuous, 185, 223
Pearson correlation, 296, 377 Discrete, 185, 191
One-tailed test, 273 Pearson, Karl, 17, 224 Hypergeometric, 208
Order statistic, 283 Percent point function, 186 Normal, 224
Ordinary least squares, 303 Percentiles, 119, 166 Poisson, 216
Playfair, William, 20 Standard Normal, 235

pandas, 99, 121, 421 Plot Student’s t, 253

agg(), 138 Colours, 404 Uniform, 191

astype(), 126, 134 Grid, 406 Z-score, 235

CSV, 130 Labels, 405 Probability mass function, 185

Data types, 125 Lines, 404 Programming, 33

DataFrame, 122, 123 Multiple, 407 Python, 33

describe, 129 Subplot, 407 Anaconda, 38

dropna(), 132 Surface, 410 Casting, 44

dtypes, 125 Title, 405 Commenting code, 40

Excel, 130 Plotly, 428 Control flow, 80

Group by, 136 Plotting, 402 easy-install, 38

groupby(), 137 Pooled standard deviation, 327 Homebrew, 38

groups, 137 Population, 142 Indentation, 34

head, 124 Power of a test, 271 Interactive shell, 39

iloc, 128 Probability, 179, 180, 182, 184 iPython notebook, 41

Index, 127, 128 Conditional, 181 Iterable, 52

info, 125 Cumulative, 184 Jupyter notebook, 36, 41

loc, 128 Empirical, 180 Matplotlib, 37

read_csv(), 131 Likelihood, 182 Methods, 50

rename(), 133 Posterior, 182 Miniconda, 38

replace, 134 Prior, 182 Modules, 94

Series, 122 Theoretical, 179 Monty Python, 33

size(), 137 Probability density function, 118, Multiple assignation, 47

tail, 124 185 NumPy, 36

Panel data, 121 Probability distribution, 179, 182, Object oriented, 50

Papyrus, 13 184, 185 pandas, 36

Parameter estimation, 118 Bernoulli, 197 pip, 38

Parametric statistics, 309 Binomial, 201, 233 Portability, 37


statistics and data visualisation with python 515

REPL, 35 Skewness Tukey, John, 19


SciPy, 36 Fisher-Pearson coefficient, 242 Tukey’s test, 360
Scripts, 94 Slicing, 54, 107 Tuple
Seaborn, 37 Colon notation (:), 54 zip, 65
shell, 38 Snow, John, 21 Two-tailed test, 273
Statsmodels, 37 Soroban, 12 type (command), 44
Strings, 46 Spearman correlation, 308, 377 Typeface, 401
View object, 69 Standard deviation, 17, 171, 173 Types, 43
StarTrek, 5 Boolean, 80
Quartiles, 120, 166 Statistical inference, 252 Complex numbers, 49
Queen Victoria, 21 Statistical learning, 12 Float, 43
Quipu (Khipu), 13 Statistical modelling, 267 Integer, 43
Statistical significance, 251 String, 46
Radio telescope, 4 Statistical test, 120
Random number, 118 Statistics, 1, 118, 141
Unique elements, 111
Random variable, 183 Bayesian, 24
Univariate dataset, 142
Range, 163 Descriptive, 22
University College London, 18
Red panda, 121 Frequentist, 24
User experience, 8, 10
Relative frequency, 179, 184 Inferential, 23
UX, 10
Root finding, 117 Introduction, 1
Nonparametric, 23
van Rossum, Guido, 33
Sample, 142 Parametric, 23
Variable, 23
Sankey diagram, 20 String
Categorical, 142
Sankey, Matthew H. P. R., 20 Concatenation, 47
Continuous, 142
SciPy, 99, 102, 112 Support vector machine, 9
Discrete, 142
Seaborn, 423 Survival function, 273, 314
Numerical, 142
Set, see Collection
Variance, 17, 171, 172
Difference, 78 Test statistic, 269
Vector, 101
Intersection, 77 The Matrix, 145
Visualisation, 383
Symmetric difference, 79 Tufte, Edward, 395
Vulcan, 144
Union, 76 Data-ink ratio, 396
Update, 77 Graphical elegance, 396
shape, 105 Graphical excellence, 395 Whitespace, 80
Simpson, Thomas, 15 Visual integrity, 395 Wright, Edward, 15

You might also like