2 Stat EDA
2 Stat EDA
2 Stat EDA
Analysis
Graphical Displays of Data
Measures of Central Tendency
Measures of Dispersion
Exploratory vs Confirmatory Data Analysis
ExploratoryData Analysis (EDA)
Descriptive Statistics
Graphical
Data driven
Confirmatory Data Analysis (CDA)
Inferential Statistics
EDA and theory driven
WHAT IS EDA?
the expected.
Another definition: An approach/philosophy for data analysis
rates (e.g. Are white woman more likely to survive compare the
black woman if they are both at the same stage of disease?)
Graphical Displays of Data
Most of the statistical information in newspapers, magazines,
company reports and other publications consists of data that
are summarized and presented in a form that is easy for the
reader to understand.
Graphical Displays of Data
Presentation of Qualitative Data
A graphic display can reveal at a glance the main characteristics of a
data set.
Their presentation are depend on the nature of data, whether the data
is in quantitative(ex. income and CGPA) or qualitative(ex. Gender and
ethnic group).
Three types of graphs used to display qualitative data:
bar graph / column chart
pie chart
line chart
Graphical Displays of Data
Presentation of Qualitative Data
Graphical Displays of Data
Bar Chart
Bar chart is used to display the frequency distribution in the graphical
form. It consists of two orthogonal axes and one of the axes represent
the observations while the other one represents the frequency of the
observations. The frequency of the observations is represented by a
bar.
Graphical Displays of Data
Pie Chart
Pie Chart is used to display the frequency distribution. It displays the
ratio of the observations. It is a circle consists of a few sectors. The
sectors represent the observations while the area of the sectors
represent the proportion of the frequencies of that observations.
Graphical Displays of Data
Line Chart
Line chart is used to display the trend of observations. It consists of
two orthogonal axes and one of the axes represent the observations
while the other one represents the frequency of the observations. The
frequency of the observations are joint by lines.
Example: Table below shows the number of sandpipers recorded between
January 1989 till December 1989.
Graphical Displays of Data
Presentation of Quantitative Data
There are few graphs available for the graphical presentation of the
quantitative data.
Frequency polygon
Histogram
Ogive
Boxplot (Will be our focus in this chapter)
Graphical Displays of Data
Presentation of Quantitative Data
Histogram
Histogram looks like the bar chart except that the horizontal axis represent
the data which is quantitative in nature. There is no gap between the bars.
Graphical Displays of Data
Presentation of Quantitative Data
Frequency Polygon
Frequency polygon looks like the line chart except that the horizontal axis
represent the class mark of the data which is quantitative in nature.
Graphical Displays of Data
Presentation of Quantitative Data
Ogive
Ogive is a line graph with the horizontal axis represent the upper limit of the
class interval while the vertical axis represent the cumulative frequencies.
Graphical Displays of Data
Presentation of Quantitative Data
Boxplot
The box plot (a.k.a. box and whisker diagram) is a standardized way of
displaying the distribution of data based on the five number summary:
minimum, first quartile, median, third quartile, and maximum.
Graphical Displays of Data
Presentation of Quantitative Data
Boxplot
Divided by data sets into fourths or four equal parts.
Graphical Displays of Data
Presentation of Quantitative Data
Boxplot
How to obtain Quartiles?
Q2 – Median
Q1 – Median between lowest value and Q2
Q3 – Median between Q2 and largest value
Examples
Odd set of numbers
IQR Q3 Q1
Step 3: Determine the fences. Fences serve as cut-off points for
determining outliers.
Lower Fence Q1 1.5( IQR)
Upper Fence Q3 1.5( IQR)
Step 4: If data value is less than the lower fence or greater than the
upper fence, considered outlier.
How to:
Boxplot
Interpretation
The Basics
Minimum = -25
First Quartile = 300
Second Quartile / Median = 400
Third Quartile = 600
Maximum = 1000
Interpreting Boxplot
Range
In the boxplot above, data values ranged from about -700 (the smallest
outlier) to 1700 (the largest outlier), so the range is 2400. If you
ignore outliers, the range is illustrated by the distance between the
opposite ends of the whiskers - about 1000 in the boxplot above.
Interpreting Boxplot
Interquartile Range (IQR)
In the boxplot above, the range between the quartiles is equal to 600 -
300 or about 300
Based on Q1, we know that 25% of the data has a value below 300 @
75% of the data has a value above 300
Based on Q2, we know half of the data has a value less than 400
Based on Q3, we know that 75% of the data has a value below 600 @
25% of the data has a value above 600
Interpreting Boxplot
Shape of the data
Boxplots often provide information about the shape of a data set. The
examples below show some common patterns
Interpreting Boxplot
Shape of the data
Mean
Mean (sample) is defined by x
The mean of a sample is the sum of the measurements divided by the
number of measurements in the set. Mean is denoted by
Measures of Central Tendency
Example
Median
Median is the middle value of a set of observations arranged in order
of magnitude and normally is denoted by x
The median depends on the number of observations in the data, n .
-If nis odd, then the median is the n 1 th observation of the
2
ordered observations.
n
-If nis even, then the median is the arithmetic mean of the 2th
n
observation and the 1 th observation.
2
Measures of Central Tendency
Example
The median of this data (4, 6, 3, 1, 2, 5, 7, 3) is 3.5.
Rearrange the data in order of magnitude becomes 1,2,3,3,4,5,6,7.
As n 8 (even), the median is the mean of the 4th and 5th observations that
is 3.5.
Measures of Central Tendency
Mode
Mode of a set of observations is the observation with the highest
frequency and is usually denoted by x̂ . Sometimes mode can also
be used to describe qualitative data.
Mode has the advantage in that it is easy to calculate and eliminates
the effect of extreme values.
However, mode may not exist and even if it does exit, it may not be
unique.
Measures of Central Tendency
Mode
If a set of data has 2 measurements with higher frequency, therefore
the measurements are assumed as data mode and known as bimodal
data.
If a set of data has more than 2 measurements with higher frequency
so the data can be assumed as no mode.
Example:
The mode for the observations 4,6,3,1,2,5,7,3 is 3.
Measures of Central Tendency
Mode
If a set of data has 2 measurements with higher frequency, therefore
the measurements are assumed as data mode and known as bimodal
data.
If a set of data has more than 2 measurements with higher frequency
so the data can be assumed as no mode.
Example:
The mode for the observations 4,6,3,1,2,5,7,3 is 3.
Measures of Central Tendency
Mode
If a set of data has 2 measurements with higher frequency, therefore
the measurements are assumed as data mode and known as bimodal
data.
If a set of data has more than 2 measurements with higher frequency
so the data can be assumed as no mode.
Example:
The mode for the observations 4,6,3,1,2,5,7,3 is 3.
Measures of Dispersion
scattered.
The common measures of dispersion are variance and
standard deviation.
The standard deviation actually is the square root of the
variance.
2
The sample variance is denoted by s and the sample standard
deviation is denoted by s.
Measures of Dispersion
Range
Range is the simplest measure of dispersion to calculate.
Range = Largest value – Smallest value
Example:
Variance
The variance of a sample (also known as mean square) for the raw (ungrouped) data
is denoted by and defined by: 2
s
S 2
(x x ) 2
n 1
Example (using previous data in Range):
Standard deviation
It is simply a square root value of variance
Example
2
(x ) 2
s 2
(x x ) 2
N n 1
2 (x ) 2
s 2 (x x ) 2
N n 1
How to interpret Standard deviation?
How to interpret Standard deviation?
How to interpret Standard deviation?
How to interpret Standard deviation?
How to interpret Standard deviation?
How to interpret Standard deviation?
Too Short