Exploratory Data Analysis - Komorowski PDF
Exploratory Data Analysis - Komorowski PDF
Exploratory Data Analysis - Komorowski PDF
net/publication/308007227
CITATIONS READS
6 8,469
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Matthieu Komorowski on 13 October 2016.
Learning Objectives
• Why is EDA important during the initial exploration of a dataset?
• What are the most essential tools of graphical and non-graphical EDA?
15.1 Introduction
Exploratory data analysis (EDA) is an essential step in any research analysis. The
primary aim with exploratory analysis is to examine the data for distribution,
outliers and anomalies to direct specific testing of your hypothesis. It also provides
tools for hypothesis generation by visualizing and understanding the data usually
through graphical representation [1]. EDA aims to assist the natural patterns
recognition of the analyst. Finally, feature selection techniques often fall into EDA.
Since the seminal work of Tukey in 1977, EDA has gained a large following as the
gold standard methodology to analyze a data set [2, 3]. According to Howard
Seltman (Carnegie Mellon University), “loosely speaking, any method of looking at
data that does not include formal statistical modeling and inference falls under the
term exploratory data analysis” [4].
EDA is a fundamental early step after data collection (see Chap. 11) and
pre-processing (see Chap. 12), where the data is simply visualized, plotted,
manipulated, without any assumptions, in order to help assessing the quality of the
data and building models. “Most EDA techniques are graphical in nature with a few
quantitative techniques. The reason for the heavy reliance on graphics is that by its
very nature the main role of EDA is to explore, and graphics gives the analysts
unparalleled power to do so, while being ready to gain insight into the data. There
are many ways to categorize the many EDA techniques” [5].
The interested reader will find further information in the textbooks of Hill and
Lewicki [6] or the NIST/SEMATECH e-Handbook [1]. Relevant R packages are
available on the CRAN website [7].
The objectives of EDA can be summarized as follows:
1. Maximize insight into the database/understand the database structure;
2. Visualize potential relationships (direction and magnitude) between exposure
and outcome variables;
3. Detect outliers and anomalies (values that are significantly different from the
other observations);
4. Develop parsimonious models (a predictive or explanatory model that performs
with as few exposure variables as possible) or preliminary selection of appro-
priate models;
5. Extract and create clinically relevant variables.
EDA methods can be cross-classified as:
• Graphical or non-graphical methods
• Univariate (only one variable, exposure or outcome) or multivariate (several
exposure variables alone or with an outcome variable) methods.
Tables 15.1 and 15.2 suggest a few EDA techniques depending on the type of data
and the objective of the analysis.
These non-graphical methods will provide insight into the characteristics and the
distribution of the variable(s) of interest.
Univariate Non-graphical EDA
Tabulation of Categorical Data (Tabulation of the Frequency of Each Category)
Fig. 15.1 Symmetrical versus asymmetrical (skewed) distribution, showing mode, mean and
median
P
n
ðxi xÞ2
s2 ¼ i¼1 ð15:1Þ
ðn 1Þ
The standard deviation is simply the square root of the variance. Therefore it has
the same units as the original data, which helps make it more interpretable.
The sample standard deviation is usually represented by the symbol s. For a
theoretical Gaussian distribution, mean plus or minus 1, 2 or 3 standard deviations
holds 68.3, 95.4 and 99.7 % of the probability density, respectively.
15.2 Part 1—Theoretical Concepts 189
IQR ¼ Q3 Q1 ð15:2Þ
In the same way that the median is more robust than the mean, the IQR is a more
robust measure of spread than variance and standard deviation and should therefore
be preferred for small or asymmetrical distributions.
Important rule:
• Symmetrical distribution (not necessarily normal) and N > 30: express results
as mean ± standard deviation.
• Asymmetrical distribution or N < 30 or evidence for outliers: use
median ± IQR, which are more robust.
Skewness/kurtosis
Skewness is a measure of a distribution’s asymmetry. Kurtosis is a summary
statistic communicating information about the tails (the smallest and largest values)
of the distribution. Both quantities can be used as a means to communicate infor-
mation about the distribution of the data when graphical methods cannot be used.
More information about these quantities can be found in [9]).
Summary
We provide as a reference some of the common functions in R language for
generating summary statistics relating to measures of central tendency (Table 15.4).
Several non-graphical methods exist to assess the normality of a data set (whether it
was sampled from a normal distribution), like the Shapiro-Wilk test for example.
Please refer to the function called “Distribution” in the GitHub repository for this
book (see code appendix at the end of this Chapter).
Table 15.4 Main R functions for basic measure of central tendencies and variability
Function Description
summary(x) General description of a vector
max(x) Maximum value
mean(x) Average or mean value
median(x) Median value
min(x) Smallest value
sd(x) Standard deviation
var(x) Variance, measure the spread or dispersion of the values
IQR(x) Interquartile range
190 15 Exploratory Data Analysis
Finding Outliers
Several statistical methods for outlier detection fall into EDA techniques, like
Tukey’s method, Z-score, studentized residuals, etc [8]. Please refer to the Chap. 14
“Noise versus Outliers” for more detail about this topic.
P
n
ðxi xÞðyi yÞ
i¼1
covðx; yÞ ¼ ð15:3Þ
n1
where x and y are the variables, n the number of data points in the sample, x the
mean of the variable x and y the mean of the variable y.
A positive covariance means the variables are positively related (they move
together in the same direction), while a negative covariance means the variables are
inversely related. A problem with covariance is that its value depends on the scale
of the values of the random variables. The larger the values of x and y, the larger the
Covðx; yÞ
Corðx; yÞ ¼ ð15:4Þ
sx sy
where Covðx; yÞ is the covariance between x and y and sx ; sy are the sample standard
deviations of x and y.
The significance of the correlation coefficient between two normally distributed
variables can be evaluated using Fisher’s z transformation (see the cor.test function
in R for more details). Other tests exist for measuring the non-parametric rela-
tionship between two variables, such as Spearman’s rho or Kendall’s tau.
Histograms are among the most useful EDA techniques, and allow you to gain
insight into your data, including distribution, central tendency, spread, modality and
outliers.
Histograms are bar plots of counts versus subgroups of an exposure variable. Each
bar represents the frequency (count) or proportion (count divided by total count) of
cases for a range of values. The range of data for each bar is called a bin. Histograms
give an immediate impression of the shape of the distribution (symmetrical,
uni/multimodal, skewed, outliers…). The number of bins heavily influences the final
aspect of the histogram; a good practice is to try different values, generally from 10 to
50. Some examples of histograms are shown below as well as in the case studies.
Please refer to the function called “Density” in the GitHub repository for this book
(see code appendix at the end of this Chapter) (Figs. 15.3 and 15.4).
Histograms enable to confirm that an operation on data was successful. For
example, if you need to log-transform a data set, it is interesting to plot the his-
togram of the distribution of the data before and after the operation (Fig. 15.5).
Histograms are interesting for finding outliers. For example, pulse oximetry can
be expressed in fractions (range between 0 and 1) or percentage, in medical records.
Figure 15.6 is an example of a histogram showing the distribution of pulse
oximetry, clearly showing the presence of outliers expressed in a fraction rather
than as a percentage.
192 15 Exploratory Data Analysis
Fig. 15.5 Example of the effect of a log transformation on the distribution of the dataset
Stem Plots
Stem and leaf plots (also called stem plots) are a simple substitution for histograms.
They show all data values and the shape of the distribution. For an example, Please
refer to the function called “Stem Plot” in the GitHub repository for this book (see
code appendix at the end of this Chapter) (Fig. 15.7).
15.2 Part 1—Theoretical Concepts 193
Boxplots
Boxplots are interesting for representing information about the central tendency,
symmetry, skew and outliers, but they can hide some aspects of the data such as
multimodality. Boxplots are an excellent EDA technique because they rely on
robust statistics like median and IQR.
Figure 15.8 shows an annotated boxplot which explains how it is constructed.
The central rectangle is limited by Q1 and Q3, with the middle line representing the
median of the data. The whiskers are drawn, in each direction, to the most extreme
point that is less than 1.5 IQR beyond the corresponding hinge. Values beyond 1.5
IQR are considered outliers.
The “outliers” identified by a boxplot, which could be called “boxplot outliers”
are defined as any points more than 1.5 IQRs above Q3 or more than 1.5 IQRs
below Q1. This does not by itself indicate a problem with those data points.
Boxplots are an exploratory technique, and you should consider designation as a
boxplot outlier as just a suggestion that the points might be mistakes or otherwise
unusual. Also, points not designated as boxplot outliers may also be mistakes. It is
also important to realize that the number of boxplot outliers depends strongly on the
size of the sample. In fact, for data that is perfectly normally distributed, we expect
0.70 % (about 1 in 140 cases) to be “boxplot outliers”, with approximately half in
either direction.
194 15 Exploratory Data Analysis
2D Line Plot
2D line plots represent graphically the values of an array on the y-axis, at regular
intervals on the x-axis (Fig. 15.9).
Besides the probability plots, there are many quantitative statistical tests (not
graphical) for testing for normality, such as Pearson Chi2, Shapiro-Wilk, and
Kolmogorov-Smirnov.
Deviation of the observed distribution from normal makes many powerful
statistical tools useless. Note that some data sets can be transformed to a more
normal distribution, in particular with log-transformation and square-root trans-
formations. If a data set is severely skewed, another option is to discretize its values
into a finite set.
Representing several boxplots side by side allows easy comparison of the charac-
teristics of several groups of data (example Fig. 15.11). An example of such
boxplot is shown in the case study.
Fig. 15.11 Side-by-side boxplot showing the cardiac index for five levels of Positive
end-expiratory pressure (PEEP)
196 15 Exploratory Data Analysis
Scatterplots
Scatterplots are built using two continuous, ordinal or discrete quantitative variables
(Fig. 15.12). Each data point’s coordinate corresponds to a variable. They can be
complexified to up to five dimensions using other variables by differentiating the
data points’ size, shape or color.
Scatterplots can also be used to represent high-dimensional data in 2 or 3D
(Fig. 15.13), using T-distributed stochastic neighbor embedding (t-SNE) or prin-
cipal component analysis (PCA). t-SNE and PCA are dimension reduction features
used to reduce complex data set in two (t-SNE) or more (PCA) dimensions.
Fig. 15.12 Scatterpolot showing an example of actual mortality per rate of predicted mortality
For binary variables (e.g. 28-day mortality vs. SOFA score), 2D scatterplots are
not very helpful (Fig. 15.14, left). By dividing the data set in groups (in our
example: one group per SOFA point), and plotting the average value of the outcome
in each group, scatterplots become a very powerful tool, capable for example to
identify a relationship between a variable and an outcome (Fig. 15.14, right).
Curve Fitting
Curve fitting is one way to quantify the relationship between two variables or the
change in values over time (Fig. 15.15). The most common method for curve fitting
relies on minimizing the sum of squared errors (SSE) between the data and the
fitted function. Please refer to the “Linear Fit” function to create linear regression
slopes in R.
More Complicated Relationships
Many real life phenomena are not adequately explained by a straight-line
relationship. An always increasing set of methods and algorithms exist to deal
with that issue. Among the most common:
• Adding transformed explanatory variables, for example, adding x2 or x3 to the
model.
• Using other algorithms to handle more complex relationships between variables
(e.g., generalized additive models, spline regression, support vector machines,
etc.).
Heat maps are simply a 2D grid built from a 2D array, whose color depends on the
value of each cell. The data set must correspond to a 2D array whose cells contain
the values of the outcome variable. This technique is useful when you want to
represent the change of an outcome variable (e.g. length of stay) as a function of
two other variables (e.g. age and SOFA score).
The color mapping can be customized (e.g. rainbow or grayscale). Interestingly,
the Matlab function imagesc scales the data to the full colormap range. Their 3D
equivalent is mesh plots or surface plots (Fig. 15.16).
This case study refers to the research that evaluated the effect of the placement of
indwelling arterial catheters (IACs) in hemodynamically stable patients with res-
piratory failure in intensive care, from the MIMIC-II database.
For this case study, several aspects of EDA were used:
• The categorical data was first tabulated.
• Summary statistics were then generated to describe the variables of interest.
• Graphical EDA was used to generate histograms to visualize the data of interest.
Tabulation
To analyze, visualize and test for association or independence of categorical vari-
ables, they must first be tabulated. When generating tables, any missing data will be
counted in a separate “NA” (“Not Available”) category. Please refer to the Chap. 13
“Missing Data” for approaches in managing this problem. There are several
methods for creating frequency or contingency tables in R, such as for example,
tabulating outcome variables for mortality, as demonstrated in the case study. Refer
to the “Tabulate” function found in the GitHub repository for this book (see code
appendix at the end of this Chapter) for details on how to compute frequencies of
outcomes for different variables.
Statistical Tests
Multiple statistical tests are available in R and we refer the reader to the Chap. 16
“Data Analysis” for additional information on use of relevant tests in R. For
examples of a simple Chi-square…” as “For examples of a simple Chi-squared test,
please refer to the “Chi-squared” function found in the GitHub repository for this
book (see code appendix at the end of this Chapter). In our example, the hypothesis
of independence between expiration in ICU and IAC is accepted (p > 0.05). On the
contrary, the dependence link between day-28 mortality and IAC is rejected.
Summary statistics
Summary statistics as described above include, frequency, mean, median, mode,
range, interquartile range, maximum and minimum values. An extract of summary
statistics of patient demographics, vital signs, laboratory results and comorbidities,
is shown in Table 6. Please refer to the function called “EDA Summary” in the
200 15 Exploratory Data Analysis
Table 15.5 Comparison between the two study cohorts (subsample of variables only)
Variables Entire Cohort (N = 1776)
Non-IAC IAC p-value
Size 984 (55.4 %) 792 (44.6 %) NA
Age (year) 51 (35–72) 56 (40–73) 0.009
Gender (female) 344 (43.5 %) 406 (41.3 %) 0.4
Weight (kg) 76 (65–90) 78 (67–90) 0.08
SOFA score 5 (4–6) 6 (5–8) <0.0001
Co-morbidities
CHF 97 (12.5 %) 116 (11.8 %) 0.7
… … … …
Lab tests
WBC 10.6 (7.8–14.3) 11.8 (8.5–15.9) <0.0001
Hemoglobin (g/dL) 13 (11.3–14.4) 12.6 (11–14.1) 0.003
… … … …
GitHub repository for this book (see code appendix at the end of this Chapter)
(Table 15.5).
When separate cohorts are generated based on a common variable, in this case
the presence of an indwelling arterial catheter, summary statistics are presented for
each cohort.
It is important to identify any differences in subject baseline characteristics. The
benefits of this are two-fold: first it is useful to identify potentially confounding
variables that contribute to an outcome in addition to the predictor (exposure)
variable. For example, if mortality is the outcome variable then differences in
severity of illness between cohorts may wholly or partially account for any variance
in mortality. Identifying these variables is important as it is possible to attempt to
control for these using adjustment methods such as multivariable logistic regres-
sion. Secondly, it may allow the identification of variables that are associated with
the predictor variable enriching our understanding of the phenomenon we are
observing.
The analytical extension of identifying any differences using medians, means
and data visualization is to test for statistically significant differences in any given
subject characteristic using for example Wilcoxon-Rank sum test. Refer to Chap. 16
for further details in hypothesis testing.
Histograms
Histograms are considered the backbone of EDA for continuous data. They can be
used to help the researcher understand continuous variables and provide key
information such as their distribution. Outlined in noise and outliers, the histogram
allows the researcher to visualize where the bulk of the data points are placed
between the maximum and minimum values. Histograms can also allow a visual
comparison of a variable between cohorts. For example, to compare severity of
illness between patient cohorts, histograms of SOFA score can be plotted side by
side (Fig. 15.17). An example of this is given in the code for this chapter using the
“side-by-side histogram” function (see code appendix at the end of this Chapter).
Boxplot and ANOVA
Outside of the scope of this case study, the user may be interested in analysis of
variance. When performing EDA and effective way to visualize this is through the
use of boxplot. For example, to explore differences in blood pressure based on
severity of illness subjects could be categorized by severity of illness with blood
pressure values at baseline plotted (Fig. 15.18). Please refer to the function called
“Box Plot” in the GitHub repository for this book (see code appendix at the end of
this Chapter).
The box plot shows a few outliers which may be interesting to explore indi-
vidually, and that people with a high SOFA score (>10) tend to have a lower blood
pressure than people with a lower SOFA score.
202 15 Exploratory Data Analysis
Fig. 15.18 Side-by-side boxplot of MAP for different levels of severity at admission
15.4 Conclusion
Open Access This chapter is distributed under the terms of the Creative Commons
Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/
4.0/), which permits any noncommercial use, duplication, adaptation, distribution and reproduction
in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, a link is provided to the Creative Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included in
the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
Code Appendix
The code used in this chapter is available in the GitHub repository for this book:
https://github.com/MIT-LCP/critical-data-book. Further information on the code is
available from this website.
References 203
References