Data Analytics
Data Analytics
Data Analytics
www.dbooks.org
The Elements of Data
Analytic Style
A guide for people who want to
analyze data.
Jeff Leek
This book is for sale at http://leanpub.com/datastyle
Shared by LOSC
Shared by LOSC
An @simplystats publication.
Thank you to Karl Broman and Alyssa Frazee for
constructive and really helpful feedback on the first draft of
this manuscript. Thanks to Roger Peng, Brian Caffo, and
Rafael Irizarry for helpful discussions about data analysis.
www.dbooks.org
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . 1
5. Exploratory analysis . . . . . . . . . . . . . . . . . 23
8. Causality . . . . . . . . . . . . . . . . . . . . . . . 50
9. Written analyses . . . . . . . . . . . . . . . . . . . 53
10.Creating figures . . . . . . . . . . . . . . . . . . . 58
11.Presenting data . . . . . . . . . . . . . . . . . . . . 70
12.Reproducibility . . . . . . . . . . . . . . . . . . . . 79
15.Additional resources . . . . . . . . . . . . . . . . . 92
www.dbooks.org
1. Introduction
The dramatic change in the price and accessibility of data
demands a new focus on data analytic literacy. This book is
intended for use by people who perform regular data analyses.
It aims to give a brief summary of the key ideas, practices, and
pitfalls of modern data analysis. One goal is to summarize
in a succinct way the most common difficulties encountered
by practicing data analysts. It may serve as a guide for
peer reviewers who may refer to specific section numbers
when evaluating manuscripts. As will become apparent, it is
modeled loosely in format and aim on the Elements of Style
by William Strunk.
The book includes a basic checklist that may be useful as a
guide for beginning data analysts or as a rubric for evaluating
data analyses. It has been used in the author’s data analysis
class to evaluate student projects. Both the checklist and this
book cover a small fraction of the field of data analysis, but the
experience of the author is that once these elements are mas-
tered, data analysts benefit most from hands on experience in
their own discipline of application, and that many principles
may be non-transferable beyond the basics.
If you want a more complete introduction to the analysis
of data one option is the free Johns Hopkins Data Science
Specialization¹.
As with rhetoric, it is true that the best data analysts some-
times disregard the rules in their analyses. Experts usually do
¹https://www.coursera.org/specialization/jhudatascience/1
Introduction 2
www.dbooks.org
2. The data analytic
question
2.1 Define the data analytic
question first
Data can be used to answer many questions, but not all of
them. One of the most innovative data scientists of all time
said it best.
2.2 Descriptive
A descriptive data analysis seeks to summarize the measure-
ments in a single data set without further interpretation. An
example is the United States Census. The Census collects data
on the residence type, location, age, sex, and race of all people
in the United States at a fixed time. The Census is descriptive
because the goal is to summarize the measurements in this
fixed data set into population counts and describe how many
www.dbooks.org
The data analytic question 5
2.3 Exploratory
An exploratory data analysis builds on a descriptive analysis
by searching for discoveries, trends, correlations, or rela-
tionships between the measurements of multiple variables to
generate ideas or hypotheses. An example is the discovery of a
four-planet solar system by amateur astronomers using public
astronomical data from the Kepler telescope. The data was
made available through the planethunters.org website, that
asked amateur astronomers to look for a characteristic pattern
of light indicating potential planets. An exploratory analysis
like this one seeks to make discoveries, but rarely can confirm
those discoveries. In the case of the amateur astronomers,
follow-up studies and additional data were needed to confirm
the existence of the four-planet system.
2.4 Inferential
An inferential data analysis goes beyond an exploratory anal-
ysis by quantifying whether an observed pattern will likely
hold beyond the data set in hand. Inferential data analyses are
the most common statistical analysis in the formal scientific
literature. An example is a study of whether air pollution
correlates with life expectancy at the state level in the United
States. The goal is to identify the strength of the relationship
in both the specific data set and to determine whether that
relationship will hold in future data. In non-randomized
The data analytic question 6
2.5 Predictive
While an inferential data analysis quantifies the relationships
among measurements at population-scale, a predictive data
analysis uses a subset of measurements (the features) to pre-
dict another measurement (the outcome) on a single person
or unit. An example is when organizations like FiveThir-
tyEight.com use polling data to predict how people will vote
on election day. In some cases, the set of measurements
used to predict the outcome will be intuitive. There is an
obvious reason why polling data may be useful for predicting
voting behavior. But predictive data analyses only show that
you can predict one measurement from another, they don’t
necessarily explain why that choice of prediction works.
2.6 Causal
A causal data analysis seeks to find out what happens to one
measurement if you make another measurement change. An
example is a randomized clinical trial to identify whether
fecal transplants reduces infections due to Clostridium di-
ficile. In this study, patients were randomized to receive a
fecal transplant plus standard care or simply standard care.
In the resulting data, the researchers identified a relationship
between transplants and infection outcomes. The researchers
www.dbooks.org
The data analytic question 7
2.7 Mechanistic
Causal data analyses seek to identify average effects between
often noisy variables. For example, decades of data show
a clear causal relationship between smoking and cancer. If
you smoke, it is a sure thing that your risk of cancer will
increase. But it is not a sure thing that you will get cancer. The
causal effect is real, but it is an effect on your average risk. A
mechanistic data analysis seeks to demonstrate that changing
one measurement always and exclusively leads to a specific,
deterministic behavior in another. The goal is to not only
understand that there is an effect, but how that effect operates.
An example of a mechanistic analysis is analyzing data on
how wing design changes air flow over a wing, leading
to decreased drag. Outside of engineering, mechanistic data
analysis is extremely challenging and rarely undertaken.
2.8.2 Overfitting
Interpreting an exploratory analysis as predictive
www.dbooks.org
The data analytic question 9
2.8.3 n of 1 analysis
Descriptive versus inferential analysis.
www.dbooks.org
Tidying the data 11
You know the raw data is in the right format if you ran no
software on the data, did not manipulate any of the numbers
in the data, did not remove any data from the data set, and
did not summarize the data in any way.
If you did any manipulation of the data at all it is not the raw
form of the data. Reporting manipulated data as raw data is
a very common way to slow down the analysis process, since
the analyst will often have to do a forensic study of your data
to figure out why the raw data looks weird.
While these are the hard and fast rules, there are a number
of other things that will make your data set much easier to
handle.
www.dbooks.org
Tidying the data 13
want to know any other information about how you did the
data collection/study design. For example, are these the first
20 patients that walked into the clinic? Are they 20 highly
selected patients by some characteristic like age? Are they
randomized to treatments?
A common format for this document is a Word file. There
should be a section called “Study design” that has a thorough
description of how you collected the data. There is a section
called “Code book” that describes each variable and its units.
www.dbooks.org
Tidying the data 15
www.dbooks.org
4. Checking the data
Data munging or processing is required for basically every
data set that you will have access to. Even when the data
are neatly formatted like you get from open data sources like
Data.gov¹, you’ll frequently need to do things that make it
slightly easier to analyze or use the data for modeling.
The first thing to do with any new data set is to understand
the quirks of the data set and potential errors. This is usually
done with a set of standard summary measures. The checks
should be performed on the rawest version of the data set you
have available. A useful approach is to think of every possible
thing that could go wrong and make a plot of the data to check
if it did.
• Continuous
• Ordinal
• Categorical
• Missing
• Censored
¹http://www.data.gov/
Checking the data 18
www.dbooks.org
Checking the data 19
www.dbooks.org
Checking the data 21
www.dbooks.org
5. Exploratory analysis
Exploratory analysis is largely concerned with summarizing
and visualizing data before performing formal modeling. The
reasons for creating graphs for data exploration are:
but if you overlay the actual data points you can see that they
have very different distributions.
www.dbooks.org
Exploratory analysis 25
Figure 5.3 Data sets with identical correlations and regression lines
²http://en.wikipedia.org/wiki/Anscombe%27s_quartet
www.dbooks.org
Exploratory analysis 27
but if you size the points by the skill of the student you see
that more skilled students don’t study as much. So it is likely
that skill is confounding the relationship
www.dbooks.org
Exploratory analysis 29
Figure 5.6 Studying versus score with point size by skill level
www.dbooks.org
Exploratory analysis 31
³http://en.wikipedia.org/wiki/Bland%E2%80%93Altman_plot
⁴https://en.wikipedia.org/wiki/File:Bland-Altman_Plot.svg
Exploratory analysis 32
www.dbooks.org
Exploratory analysis 33
www.dbooks.org
Statistical modeling and inference 35
Figure 6.2 The first step in inference is making a best estimate of what
is happening in the population
www.dbooks.org
Statistical modeling and inference 37
people and also have smaller shoes. Age is related to both lit-
eracy and shoe size and is a confounder for that relationship.
When you observe a correlation or relationship in a data set,
consider the potential confounders - variables associated with
both variables you are trying to relate.
www.dbooks.org
Statistical modeling and inference 39
www.dbooks.org
Statistical modeling and inference 41
www.dbooks.org
Statistical modeling and inference 43
Figure 6.4 If you infer to the wrong population bias will result.
www.dbooks.org
7. Prediction and
machine learning
The central idea with prediction is to take a sample from
a population - like with inference - and create a training
data set. Some variables measured on the individuals in the
training set are called features and some are outcomes. The
goal of prediction is to build an algorithm or prediction
function that automatically takes the feature data from a new
individual and makes a best guess or estimate about the value
of the outcome variables (Figure 7.1).
www.dbooks.org
Prediction and machine learning 47
www.dbooks.org
Prediction and machine learning 49
www.dbooks.org
Causality 51
www.dbooks.org
9. Written analyses
Data analysis is as much about communication as it is about
statistics. A written data analysis report should communi-
cate your message clearly and in a way that is readable to
non-technical audiences. The goal is to tell a clear, precise
and compelling story. Throughout your written analysis you
should focus on how each element: text, figures, equations,
and code contribute to or detract from the story you are trying
to tell.
• A title
• An introduction or motivation
• A description of the statistics or machine learning
models you used
• Results including measures of uncertainty
• Conclusions including potential problems
• References
Written analyses 54
www.dbooks.org
Written analyses 55
www.dbooks.org
Written analyses 57
www.dbooks.org
Creating figures 59
www.dbooks.org
Creating figures 61
www.dbooks.org
Creating figures 63
www.dbooks.org
Creating figures 65
www.dbooks.org
Creating figures 67
Figure 10.5 Without logs 99% of the data are in the lower left hand
corner in this figure from
www.dbooks.org
Creating figures 69
• Meet people
• Get people excited about your ideas/software/results
• Help people understand your ideas/software/results
www.dbooks.org
Presenting data 71
www.dbooks.org
Presenting data 73
www.dbooks.org
Presenting data 75
www.dbooks.org
Presenting data 77
• https://speakerdeck.com/²
• http://www.slideshare.net/³.
www.dbooks.org
12. Reproducibility
Reproducibility involves being able to recalculate the exact
numbers in a data analysis using the code and raw data
provided by the analyst. Reproducibility is often difficult to
achieve and has slowed down the discovery of important data
analytic errors¹. Reproducibility should not be confused with
“correctness” of a data analysis. A data analysis can be fully
reproducible and recreate all numbers in an analysis and still
be misleading or incorrect.
www.dbooks.org
Reproducibility 81
• Data
– raw data
– processed data
• Figures
– Exploratory figures
– Final figures
• R code
– Raw or unused scripts
– Data processing scripts
– Analysis scripts
• Text
– README files explaining what all the compo-
nents are
– Final data analysis products like presentation-
s/writeups
www.dbooks.org
Reproducibility 83
www.dbooks.org
13. A few matters of
form
• Report estimates followed by parentheses.
www.dbooks.org
14. The data analysis
checklist
This checklist provides a condensed look at the information
in this book. It can be used as a guide during the process of a
data analysis, as a rubric for grading data analysis projects, or
as a way to evaluate the quality of a reported data analysis.
14.5 Inference
1. Did you identify what large population you are trying
to describe?
www.dbooks.org
The data analysis checklist 89
14.6 Prediction
1. Did you identify in advance your error measure?
2. Did you immediately split your data into training and
validation?
3. Did you use cross validation, resampling, or bootstrap-
ping only on the training data?
4. Did you create features using only the training data?
5. Did you estimate parameters only on the training data?
6. Did you fix all features, parameters, and models before
applying to the validation data?
7. Did you apply only one final model to the validation
data and report the error rate?
14.7 Causality
1. Did you identify whether your study was randomized?
2. Did you identify potential reasons that causality may
not be appropriate such as confounders, missing data,
non-ignorable dropout, or unblinded experiments?
3. If not, did you avoid using language that would imply
cause and effect?
The data analysis checklist 90
14.9 Figures
1. Does each figure communicate an important piece of
information or address a question of interest?
2. Do all your figures include plain language axis labels?
3. Is the font size large enough to read?
4. Does every figure have a detailed caption that explains
all axes, legends, and trends in the figure?
14.10 Presentations
1. Did you lead with a brief, understandable to everyone
statement of your problem?
2. Did you explain the data, measurement technology, and
experimental design before you explained your model?
www.dbooks.org
The data analysis checklist 91
3. Did you explain the features you will use to model data
before you explain the model?
4. Did you make sure all legends and axes were legible
from the back of the room?
14.11 Reproducibility
1. Did you avoid doing calculations manually?
2. Did you create a script that reproduces all your analy-
ses?
3. Did you save the raw and processed versions of your
data?
4. Did you record all versions of the software you used to
process the data?
5. Did you try to have someone else run your analysis
code to confirm they got the same answers?
14.12 R packages
1. Did you make your package name “Googleable”
2. Did you write unit tests for your functions?
3. Did you write help files for all functions?
4. Did you write a vignette?
5. Did you try to reduce dependencies to actively main-
tained packages?
6. Have you eliminated all errors and warnings from R
CMD CHECK?
15. Additional
resources
15.1 Class lecture notes
• Johns Hopkins Data Science Specialization¹ and Addi-
tional resources²
• Data wrangling, exploration, and analysis with R³
• Tools for Reproducible Research⁴
• Data carpentry⁵
15.2 Tutorials
• Git/github tutorial⁶
• Make tutorial⁷
• knitr in a knutshell⁸
• Writing an R package from scratch⁹
¹https://github.com/DataScienceSpecialization/courses
²http://datasciencespecialization.github.io/
³https://stat545-ubc.github.io/
⁴http://kbroman.org/Tools4RR/
⁵https://github.com/datacarpentry/datacarpentry
⁶http://kbroman.org/github_tutorial/
⁷http://kbroman.org/minimal_make/
⁸http://kbroman.org/knitr_knutshell/
⁹http://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/
www.dbooks.org
Additional resources 93
15.4 Books
• An introduction to statistical learning¹³
• Advanced data analysis from an elementary point of
view¹⁴
• Advanced R programming¹⁵
• OpenIntro Statistics¹⁶
• Statistical inference for data science¹⁷
¹⁰https://github.com/jtleek/datasharing
¹¹https://github.com/jtleek/talkguide
¹²https://github.com/jtleek/rpackages
¹³http://www-bcf.usc.edu/~gareth/ISL/
¹⁴http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/
¹⁵http://adv-r.had.co.nz/
¹⁶https://www.openintro.org/stat/textbook.php
¹⁷https://leanpub.com/LittleInferenceBook