Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
15 views60 pages

2 Stat EDA

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1/ 60

Exploratory Data

Analysis
Graphical Displays of Data
Measures of Central Tendency
Measures of Dispersion
Exploratory vs Confirmatory Data Analysis
 ExploratoryData Analysis (EDA)
Descriptive Statistics
Graphical
Data driven
 Confirmatory Data Analysis (CDA)

Inferential Statistics
EDA and theory driven
WHAT IS EDA?

 The analysis of datasets based on various numerical methods


and graphical tools.
 Exploring data for patterns, trends, underlying structure,

deviations from the trend, anomalies and strange structures.


 It facilitates discovering unexpected as well as conforming

the expected.
 Another definition: An approach/philosophy for data analysis

that employs a variety of techniques (mostly graphical).


AIM OF THE EDA
 Maximize insight into a dataset
 Uncover underlying structure

 Extract important variables

 Detect outliers and anomalies

 Test underlying assumptions

 Develop valid models

 Determine optimal factor settings (Xs)


AIM OF THE EDA
 The goal of EDA is to open-mindedly explore data.
 Tukey: EDA is detective work… Unless detective finds the
clues, judge or jury has nothing to consider.
 Here, judge or jury is a confirmatory data analysis
 Tukey: Confirmatory data analysis goes further, assessing the
strengths of the evidence.
 With
EDA, we can examine data and try to understand the
meaning of variables. What are the abbreviations stand for.
STEPS OF EDA
 Generate good research questions
 Data restructuring: You may need to make new variables from the
existing ones.
 Instead of using two variables, obtaining rates or percentages of them
 Creating dummy variables for categorical variables
 Based on the research questions, use appropriate graphical tools and
obtain descriptive statistics. Try to understand the data structure,
relationships, anomalies, unexpected behaviors.
 Try to identify confounding variables, interaction relations and
multicollinearity, if any.
 Handle missing observations
 Decide on the need of transformation (on response and/or explanatory
variables).
 Decide on the hypothesis based on your research questions
AFTER EDA
 Confirmatory Data Analysis: Verify the hypothesis by
statistical analysis
 Get conclusions and present your results nicely.
Classification of EDA*
 Exploratory data analysis is generally cross-classified in two ways.
First, each method is either non-graphical or graphical. And second,
each method is either univariate or multivariate (usually just
bivariate).
 Non-graphical methods generally involve calculation of summary
statistics, while graphical methods obviously summarize the data in
a diagrammatic or pictorial way.
 Univariate methods look at one variable (data column) at a time,
while multivariate methods look at two or more variables at a time
to explore relationships. Usually our multivariate EDA will be
bivariate (looking at exactly two variables), but occasionally it will
involve three or more variables.
 Itis almost always a good idea to perform univariate EDA on each
of the components of a multivariate EDA before performing the
multivariate EDA.
*Seltman, H.J. (2015). Experimental Design and Analysis. http://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf
EXAMPLE
In a breast cancer research, main questions of interest might be:
Does any treatment method result in a higher survival rate?
Can a particular treatment be suggested to a woman with
specific characteristic?
Is there any difference between patients in terms of survival

rates (e.g. Are white woman more likely to survive compare the
black woman if they are both at the same stage of disease?)
Graphical Displays of Data
 Most of the statistical information in newspapers, magazines,
company reports and other publications consists of data that
are summarized and presented in a form that is easy for the
reader to understand.
Graphical Displays of Data
 Presentation of Qualitative Data
 A graphic display can reveal at a glance the main characteristics of a
data set.
 Their presentation are depend on the nature of data, whether the data
is in quantitative(ex. income and CGPA) or qualitative(ex. Gender and
ethnic group).
 Three types of graphs used to display qualitative data:
 bar graph / column chart
 pie chart
 line chart
Graphical Displays of Data
 Presentation of Qualitative Data
Graphical Displays of Data
 Bar Chart
 Bar chart is used to display the frequency distribution in the graphical
form. It consists of two orthogonal axes and one of the axes represent
the observations while the other one represents the frequency of the
observations. The frequency of the observations is represented by a
bar.
Graphical Displays of Data
 Pie Chart
 Pie Chart is used to display the frequency distribution. It displays the
ratio of the observations. It is a circle consists of a few sectors. The
sectors represent the observations while the area of the sectors
represent the proportion of the frequencies of that observations.
Graphical Displays of Data
 Line Chart
 Line chart is used to display the trend of observations. It consists of
two orthogonal axes and one of the axes represent the observations
while the other one represents the frequency of the observations. The
frequency of the observations are joint by lines.
 Example: Table below shows the number of sandpipers recorded between
January 1989 till December 1989.
Graphical Displays of Data
 Presentation of Quantitative Data
 There are few graphs available for the graphical presentation of the
quantitative data.
 Frequency polygon
 Histogram
 Ogive
 Boxplot (Will be our focus in this chapter)
Graphical Displays of Data
 Presentation of Quantitative Data
 Histogram
 Histogram looks like the bar chart except that the horizontal axis represent
the data which is quantitative in nature. There is no gap between the bars.
Graphical Displays of Data
 Presentation of Quantitative Data
 Frequency Polygon
 Frequency polygon looks like the line chart except that the horizontal axis
represent the class mark of the data which is quantitative in nature.
Graphical Displays of Data
 Presentation of Quantitative Data
 Ogive
 Ogive is a line graph with the horizontal axis represent the upper limit of the
class interval while the vertical axis represent the cumulative frequencies.
Graphical Displays of Data
 Presentation of Quantitative Data
 Boxplot
 The box plot (a.k.a. box and whisker diagram) is a standardized way of
displaying the distribution of data based on the five number summary:
minimum, first quartile, median, third quartile, and maximum.
Graphical Displays of Data
 Presentation of Quantitative Data
 Boxplot
 Divided by data sets into fourths or four equal parts.
Graphical Displays of Data
 Presentation of Quantitative Data
 Boxplot
 How to obtain Quartiles?
 Q2 – Median
 Q1 – Median between lowest value and Q2
 Q3 – Median between Q2 and largest value
 Examples
 Odd set of numbers

(1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27)


Q1 Q2 Q3
 Even set of numbers
(3, 5, 7, 8, 9), (11, 15, 16, 20, 21)
Q1 Q2 Q3
(9+11)/2
=10
Graphical Displays of Data
 Presentation of Quantitative Data
 Boxplot Lower Fence  Q1  1.5( IQR)
IQR  Q3  Q1 Upper Fence  Q3  1.5( IQR )
Those that exceed upper or lower fence
is considered as outlier
Graphical Displays of Data
 Presentation of Quantitative Data
 Boxplot
 Outlier
 Extreme observations
 Can occur because of the error in measurement of a variable, during data
entry or errors in sampling.
Graphical Displays of Data
 Presentation of Quantitative Data
 Boxplot
 Outlier
 Checking for outliers by using Quartiles
 Step 1: Determine the first and third quartiles of data.
 Step 2: Compute the interquartile range (IQR).

IQR  Q3  Q1
 Step 3: Determine the fences. Fences serve as cut-off points for
determining outliers.
Lower Fence  Q1  1.5( IQR)
Upper Fence  Q3  1.5( IQR)
 Step 4: If data value is less than the lower fence or greater than the
upper fence, considered outlier.
How to:
Boxplot
Interpretation
The Basics

A boxplot splits the data set into quartiles. It consists of a


minimum value, the first quartile (Q1) to the third quartile
(Q3) @ median, and a maximum value
 Outliers are plotted separately as points on the chart
Interpreting Boxplot
 Things that can be described on boxplot
 The five numbers summary
 Range of the boxplot
 The IQR
 Shape of the data
 More than one boxplot
 Compare their shape and position
Interpreting Boxplot
 The five numbers summary

 Minimum = -25
 First Quartile = 300
 Second Quartile / Median = 400
 Third Quartile = 600
 Maximum = 1000
Interpreting Boxplot
 Range

 In the boxplot above, data values ranged from about -700 (the smallest
outlier) to 1700 (the largest outlier), so the range is 2400. If you
ignore outliers, the range is illustrated by the distance between the
opposite ends of the whiskers - about 1000 in the boxplot above.
Interpreting Boxplot
 Interquartile Range (IQR)

 In the boxplot above, the range between the quartiles is equal to 600 -
300 or about 300
 Based on Q1, we know that 25% of the data has a value below 300 @
75% of the data has a value above 300
 Based on Q2, we know half of the data has a value less than 400
 Based on Q3, we know that 75% of the data has a value below 600 @
25% of the data has a value above 600
Interpreting Boxplot
 Shape of the data
 Boxplots often provide information about the shape of a data set. The
examples below show some common patterns
Interpreting Boxplot
 Shape of the data

 For our case, the boxplot is skewed to the right


Interpreting More Than One Boxplot

 The second boxplot is comparatively short


 This suggests that the overall data of the second boxplot has small
variance (most of the data have similar values)
Interpreting More Than One Boxplot

 The first and third boxplot is comparatively tall


 This suggests that the variance for these boxplot is high (most of the
data did not have similar values)
Interpreting More Than One Boxplot

 The third boxplot is much higher than the fourth boxplot


 This could suggest a differences in the value between groups. As can
be seen, almost 75% of the data in the third boxplot have higher value
than the fourth boxplot.
Interpreting More Than One Boxplot

 Thereare obvious variance differences between first and


second boxplots; second boxplots and third boxplot
Interpreting More Than One Boxplot

 Same median, different distribution


 Look at the first, second and third boxplot. Their medians are all at
the same place. We know that for the three boxplots, more than
half of their data falls below Q2, which is 287.5. However they
show differences in variance.
Exercise

 Describeabout each boxplot


 Compare the boxplots, what can you say?
Measures of Central Tendency
 Measure of central tendency is a summary statistics that are
used to summarize a set of observations.
 The common measures of central tendency are
 Mean
 Median
 Mode
Measures of Central Tendency

 Mean
 Mean (sample) is defined by x
 The mean of a sample is the sum of the measurements divided by the
number of measurements in the set. Mean is denoted by
Measures of Central Tendency

 Example

 The mean for this case is


x
 x  390  78
n 5
Measures of Central Tendency

 Median
 Median is the middle value of a set of observations arranged in order
of magnitude and normally is denoted by x 
 The median depends on the number of observations in the data, n .
 -If nis odd, then the median is the  n  1 th observation of the
 2 
ordered observations.
n
 -If nis even, then the median is the arithmetic mean of the 2th
n
observation and the   1 th observation.
2 
Measures of Central Tendency

 Example
 The median of this data (4, 6, 3, 1, 2, 5, 7, 3) is 3.5.
 Rearrange the data in order of magnitude becomes 1,2,3,3,4,5,6,7.
 As n  8 (even), the median is the mean of the 4th and 5th observations that
is 3.5.
Measures of Central Tendency

 Mode
 Mode of a set of observations is the observation with the highest
frequency and is usually denoted by x̂ . Sometimes mode can also
be used to describe qualitative data.
 Mode has the advantage in that it is easy to calculate and eliminates
the effect of extreme values.
 However, mode may not exist and even if it does exit, it may not be
unique.
Measures of Central Tendency

 Mode
 If a set of data has 2 measurements with higher frequency, therefore
the measurements are assumed as data mode and known as bimodal
data.
 If a set of data has more than 2 measurements with higher frequency
so the data can be assumed as no mode.
 Example:
 The mode for the observations 4,6,3,1,2,5,7,3 is 3.
Measures of Central Tendency

 Mode
 If a set of data has 2 measurements with higher frequency, therefore
the measurements are assumed as data mode and known as bimodal
data.
 If a set of data has more than 2 measurements with higher frequency
so the data can be assumed as no mode.
 Example:
 The mode for the observations 4,6,3,1,2,5,7,3 is 3.
Measures of Central Tendency

 Mode
 If a set of data has 2 measurements with higher frequency, therefore
the measurements are assumed as data mode and known as bimodal
data.
 If a set of data has more than 2 measurements with higher frequency
so the data can be assumed as no mode.
 Example:
 The mode for the observations 4,6,3,1,2,5,7,3 is 3.
Measures of Dispersion

 The measure of dispersion or spread is the degree to which a


set of data tends to spread around the average value.
 It shows whether data will set is focused around the mean or

scattered.
 The common measures of dispersion are variance and

standard deviation.
 The standard deviation actually is the square root of the

variance.
2
 The sample variance is denoted by s and the sample standard

deviation is denoted by s.
Measures of Dispersion

 Range
 Range is the simplest measure of dispersion to calculate.
 Range = Largest value – Smallest value
 Example:

 Range = 267,277 – 49,651 = 217,626 squaremiles.


Measures of Dispersion

 Variance

 The variance of a sample (also known as mean square) for the raw (ungrouped) data
is denoted by and defined by: 2
s

S 2

 (x  x ) 2

n 1
 Example (using previous data in Range):

53182  49651  69903


x  57578.6667
3
2 (53182  57578.6667) 2  ...  (69903  57578.6667) 2
S 
3 1
 117033884.3
 Range = 267,277 – 49,651 = 217,626 squaremiles.
Measures of Dispersion

 Standard deviation
 It is simply a square root value of variance
 Example

S  117033884.3  10818.22 squaremiles


How to interpret Standard deviation?

 2

 (x  ) 2

s 2

 (x  x ) 2

N n 1

  2  (x  ) 2

s  2  (x  x ) 2

N n 1
How to interpret Standard deviation?
How to interpret Standard deviation?
How to interpret Standard deviation?
How to interpret Standard deviation?
How to interpret Standard deviation?
How to interpret Standard deviation?

Too High Dogs with


standard
height

Too Short

You might also like