2 Stat EDA

Exploratory Data
Analysis
Graphical Displays of Data
Measures of Central Tendency
Measures of Dispersion
Exploratory vs Confirmatory Data Analysis
 ExploratoryData Analysis (EDA)
Descriptive Statistics
Graphical
Data driven
 Confirmatory Data Analysis (CDA)
Inferential Statistics
EDA and theory driven
WHAT IS EDA?
 The analysis of datasets based on various numerical methods

and graphical tools.
 Exploring data for patterns, trends, underlying structure,
deviations from the trend, anomalies and strange structures.

 It facilitates discovering unexpected as well as conforming
the expected.
 Another definition: An approach/philosophy for data analysis
that employs a variety of techniques (mostly graphical).

AIM OF THE EDA
 Maximize insight into a dataset
 Uncover underlying structure
 Extract important variables
 Detect outliers and anomalies
 Test underlying assumptions
 Develop valid models
 Determine optimal factor settings (Xs)

AIM OF THE EDA
 The goal of EDA is to open-mindedly explore data.
 Tukey: EDA is detective work… Unless detective finds the
clues, judge or jury has nothing to consider.
 Here, judge or jury is a confirmatory data analysis
 Tukey: Confirmatory data analysis goes further, assessing the
strengths of the evidence.
 With
EDA, we can examine data and try to understand the
meaning of variables. What are the abbreviations stand for.
STEPS OF EDA
 Generate good research questions
 Data restructuring: You may need to make new variables from the
existing ones.
 Instead of using two variables, obtaining rates or percentages of them
 Creating dummy variables for categorical variables
 Based on the research questions, use appropriate graphical tools and
obtain descriptive statistics. Try to understand the data structure,
relationships, anomalies, unexpected behaviors.
 Try to identify confounding variables, interaction relations and
multicollinearity, if any.
 Handle missing observations
 Decide on the need of transformation (on response and/or explanatory
variables).
 Decide on the hypothesis based on your research questions
AFTER EDA
 Confirmatory Data Analysis: Verify the hypothesis by
statistical analysis
 Get conclusions and present your results nicely.
Classification of EDA*
 Exploratory data analysis is generally cross-classified in two ways.
First, each method is either non-graphical or graphical. And second,
each method is either univariate or multivariate (usually just
bivariate).
 Non-graphical methods generally involve calculation of summary
statistics, while graphical methods obviously summarize the data in
a diagrammatic or pictorial way.
 Univariate methods look at one variable (data column) at a time,
while multivariate methods look at two or more variables at a time
to explore relationships. Usually our multivariate EDA will be
bivariate (looking at exactly two variables), but occasionally it will
involve three or more variables.
 Itis almost always a good idea to perform univariate EDA on each
of the components of a multivariate EDA before performing the
multivariate EDA.
*Seltman, H.J. (2015). Experimental Design and Analysis. http://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf
EXAMPLE
In a breast cancer research, main questions of interest might be:
Does any treatment method result in a higher survival rate?
Can a particular treatment be suggested to a woman with
specific characteristic?
Is there any difference between patients in terms of survival
rates (e.g. Are white woman more likely to survive compare the
black woman if they are both at the same stage of disease?)
 Most of the statistical information in newspapers, magazines,
company reports and other publications consists of data that
are summarized and presented in a form that is easy for the
reader to understand.
 Presentation of Qualitative Data
 A graphic display can reveal at a glance the main characteristics of a
data set.
 Their presentation are depend on the nature of data, whether the data
is in quantitative(ex. income and CGPA) or qualitative(ex. Gender and
ethnic group).
 Three types of graphs used to display qualitative data:
 bar graph / column chart
 pie chart
 line chart
 Presentation of Qualitative Data
 Bar Chart
 Bar chart is used to display the frequency distribution in the graphical
form. It consists of two orthogonal axes and one of the axes represent
the observations while the other one represents the frequency of the
observations. The frequency of the observations is represented by a
bar.
 Pie Chart
 Pie Chart is used to display the frequency distribution. It displays the
ratio of the observations. It is a circle consists of a few sectors. The
sectors represent the observations while the area of the sectors
represent the proportion of the frequencies of that observations.
 Line Chart
 Line chart is used to display the trend of observations. It consists of
two orthogonal axes and one of the axes represent the observations
while the other one represents the frequency of the observations. The
frequency of the observations are joint by lines.
 Example: Table below shows the number of sandpipers recorded between
January 1989 till December 1989.
 Presentation of Quantitative Data
 There are few graphs available for the graphical presentation of the
quantitative data.
 Frequency polygon
 Histogram
 Ogive
 Boxplot (Will be our focus in this chapter)
 Histogram
 Histogram looks like the bar chart except that the horizontal axis represent
the data which is quantitative in nature. There is no gap between the bars.
 Frequency Polygon
 Frequency polygon looks like the line chart except that the horizontal axis
represent the class mark of the data which is quantitative in nature.
 Ogive
 Ogive is a line graph with the horizontal axis represent the upper limit of the
class interval while the vertical axis represent the cumulative frequencies.
 Boxplot
 The box plot (a.k.a. box and whisker diagram) is a standardized way of
displaying the distribution of data based on the five number summary:
minimum, first quartile, median, third quartile, and maximum.
 Boxplot
 Divided by data sets into fourths or four equal parts.
 Boxplot
 How to obtain Quartiles?
 Q2 – Median
 Q1 – Median between lowest value and Q2
 Q3 – Median between Q2 and largest value
 Examples
 Odd set of numbers
(1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27)

Q1 Q2 Q3
 Even set of numbers
(3, 5, 7, 8, 9), (11, 15, 16, 20, 21)
Q1 Q2 Q3
(9+11)/2
=10
 Boxplot Lower Fence  Q1  1.5( IQR)
IQR  Q3  Q1 Upper Fence  Q3  1.5( IQR )
Those that exceed upper or lower fence
is considered as outlier
 Boxplot
 Outlier
 Extreme observations
 Can occur because of the error in measurement of a variable, during data
entry or errors in sampling.
 Boxplot
 Outlier
 Checking for outliers by using Quartiles
 Step 1: Determine the first and third quartiles of data.
 Step 2: Compute the interquartile range (IQR).
IQR  Q3  Q1
 Step 3: Determine the fences. Fences serve as cut-off points for
determining outliers.
Lower Fence  Q1  1.5( IQR)
Upper Fence  Q3  1.5( IQR)
 Step 4: If data value is less than the lower fence or greater than the
upper fence, considered outlier.
How to:
Boxplot
Interpretation
The Basics
A boxplot splits the data set into quartiles. It consists of a

minimum value, the first quartile (Q1) to the third quartile
(Q3) @ median, and a maximum value
 Outliers are plotted separately as points on the chart
Interpreting Boxplot
 Things that can be described on boxplot
 The five numbers summary
 Range of the boxplot
 The IQR
 Shape of the data
 More than one boxplot
 Compare their shape and position
 The five numbers summary
 Minimum = -25
 First Quartile = 300
 Second Quartile / Median = 400
 Third Quartile = 600
 Maximum = 1000
 Range
 In the boxplot above, data values ranged from about -700 (the smallest
outlier) to 1700 (the largest outlier), so the range is 2400. If you
ignore outliers, the range is illustrated by the distance between the
opposite ends of the whiskers - about 1000 in the boxplot above.
 Interquartile Range (IQR)
 In the boxplot above, the range between the quartiles is equal to 600 -
300 or about 300
 Based on Q1, we know that 25% of the data has a value below 300 @
75% of the data has a value above 300
 Based on Q2, we know half of the data has a value less than 400
 Based on Q3, we know that 75% of the data has a value below 600 @
25% of the data has a value above 600
 Shape of the data
 Boxplots often provide information about the shape of a data set. The
examples below show some common patterns
 Shape of the data
 For our case, the boxplot is skewed to the right

Interpreting More Than One Boxplot
 The second boxplot is comparatively short

 This suggests that the overall data of the second boxplot has small
variance (most of the data have similar values)
 The first and third boxplot is comparatively tall

 This suggests that the variance for these boxplot is high (most of the
data did not have similar values)
 The third boxplot is much higher than the fourth boxplot

 This could suggest a differences in the value between groups. As can
be seen, almost 75% of the data in the third boxplot have higher value
than the fourth boxplot.
 Thereare obvious variance differences between first and

second boxplots; second boxplots and third boxplot
 Same median, different distribution

 Look at the first, second and third boxplot. Their medians are all at
the same place. We know that for the three boxplots, more than
half of their data falls below Q2, which is 287.5. However they
show differences in variance.
Exercise
 Describeabout each boxplot

 Compare the boxplots, what can you say?
 Measure of central tendency is a summary statistics that are
used to summarize a set of observations.
 The common measures of central tendency are
 Mean
 Median
 Mode
 Mean
 Mean (sample) is defined by x
 The mean of a sample is the sum of the measurements divided by the
number of measurements in the set. Mean is denoted by
 Example
 The mean for this case is

x
 x  390  78
n 5
 Median
 Median is the middle value of a set of observations arranged in order
of magnitude and normally is denoted by x 
 The median depends on the number of observations in the data, n .
 -If nis odd, then the median is the  n  1 th observation of the
 2 
ordered observations.
n
 -If nis even, then the median is the arithmetic mean of the 2th
n
observation and the   1 th observation.
2 
 Example
 The median of this data (4, 6, 3, 1, 2, 5, 7, 3) is 3.5.
 Rearrange the data in order of magnitude becomes 1,2,3,3,4,5,6,7.
 As n  8 (even), the median is the mean of the 4th and 5th observations that
is 3.5.
 Mode
 Mode of a set of observations is the observation with the highest
frequency and is usually denoted by x̂ . Sometimes mode can also
be used to describe qualitative data.
 Mode has the advantage in that it is easy to calculate and eliminates
the effect of extreme values.
 However, mode may not exist and even if it does exit, it may not be
unique.
 Mode
 If a set of data has 2 measurements with higher frequency, therefore
the measurements are assumed as data mode and known as bimodal
data.
 If a set of data has more than 2 measurements with higher frequency
so the data can be assumed as no mode.
 Example:
 The mode for the observations 4,6,3,1,2,5,7,3 is 3.
 Mode
data.
 Example:
 Mode
data.
 Example:
 The measure of dispersion or spread is the degree to which a

set of data tends to spread around the average value.
 It shows whether data will set is focused around the mean or
scattered.
 The common measures of dispersion are variance and
standard deviation.
 The standard deviation actually is the square root of the
variance.
2
 The sample variance is denoted by s and the sample standard
deviation is denoted by s.
 Range
 Range is the simplest measure of dispersion to calculate.
 Range = Largest value – Smallest value
 Example:
 Range = 267,277 – 49,651 = 217,626 squaremiles.

 Variance
 The variance of a sample (also known as mean square) for the raw (ungrouped) data
is denoted by and defined by: 2
s
S 2

 (x  x ) 2
n 1
 Example (using previous data in Range):
53182  49651  69903

x  57578.6667
3
2 (53182  57578.6667) 2  ...  (69903  57578.6667) 2
S 
3 1
 117033884.3
 Range = 267,277 – 49,651 = 217,626 squaremiles.
 Standard deviation
 It is simply a square root value of variance
 Example
S  117033884.3  10818.22 squaremiles

How to interpret Standard deviation?
 2

 (x  ) 2
s 2

 (x  x ) 2
N n 1
  2  (x  ) 2
s  2  (x  x ) 2
N n 1
Too High Dogs with

standard
height
Too Short

2 Stat EDA

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

2 Stat EDA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 Stat EDA

Uploaded by

Copyright:

Available Formats

Exploratory Data

 The analysis of datasets based on various numerical methods

deviations from the trend, anomalies and strange structures.

that employs a variety of techniques (mostly graphical).

 Extract important variables

 Detect outliers and anomalies

 Test underlying assumptions

 Develop valid models

 Determine optimal factor settings (Xs)

(1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27)

A boxplot splits the data set into quartiles. It consists of a

 For our case, the boxplot is skewed to the right

 The second boxplot is comparatively short

 The first and third boxplot is comparatively tall

 The third boxplot is much higher than the fourth boxplot

 Thereare obvious variance differences between first and

 Same median, different distribution

 Describeabout each boxplot

 The mean for this case is

 The measure of dispersion or spread is the degree to which a

 Range = 267,277 – 49,651 = 217,626 squaremiles.

53182  49651  69903

S  117033884.3  10818.22 squaremiles

Too High Dogs with

You might also like