M2. Understanding A Data Set II
M2. Understanding A Data Set II
Presented by
Aung Kay Tu, MBBS, DTM&H, MCTM, PhD
Variance and Standard Deviation
Variance: a measure of how data points differ
from the mean
• Data Set 1: 3, 5, 7, 10, 10
Data Set 2: 7, 7, 7, 7, 7
But we know that the two data sets are not identical! The variance shows how they are
different.
( x X )
N
• We could just drop the negative signs, which is the same mathematically as taking the
absolute value, which is known as the mean deviations.
• The average of the squared deviations about the mean is called the variance.
x X
2
2 For population variance
N
x X
2
2 For sample variance
s
n 1
Score XX ( X X )2
X
1
3
2
5
3
7
4
10
5
10
Totals
35
1
3 3-7=-4
2
5 5-7=-2
3
7 7-7=0
4
10 10-7=3
5
10 10-7=3
Totals
35
Score XX ( X X )2
X
1
3 3-7=-4 16
2
5 5-7=-2 4
3
7 7-7=0 0
4
10 10-7=3 9
5
10 10-7=3 9
Totals
35 38
Score XX ( X X )2
X
1
3 3-7=-4 16
2
5 5-7=-2 4
3
7 7-7=0 0
4
10 10-7=3 9
5
10 10-7=3 9
Totals
35 38
x X
2
2 38
s 7.6
n 5
Example 2
mean 23 23
median 22 27
range 10 22
x X
2
s
• sample standard deviation: n 1
N
Find area
Length = 10 cm
Breadth= 3.5 cm
Length = 10 cm
Breadth= 10 cm
Find area under the curve
Bell shaped curve
• 68% of all scores fall with 1 standard deviation of
the mean
• 95% of all scores fall with 2 standard deviation of
the mean
• 99.7% of all scores fall with 3 standard deviation of
the mean
z Scores, and the Normal Curve
• Z-distribution – normal distribution of standardized scores
z Scores, and the Normal Curve
• So what are z-scores?
• Number of standard deviations away from the mean of a particular score
• Can be positive or negative
• Positive = above mean
• Negative = below mean
( X )
z
The z Distribution
Transforming Raw Scores to z Scores
• Step 1: Subtract the mean of the population from the raw score
• Step 2: Divide by the standard deviation of the population
1.96
Mean Mode
Sk
SD
Since in some cases, Mode doesn’t exist, so using empirical relation,
We can write,
3 Median Mean
Sk
SD
(it ranges b/w -3 to +3)
Application
• If we have a skewed data then it may harm our results.
• In order to use a skewed data we have to apply a log transformation
over the whole set of values to discover patterns in the data and make
it usable for the statistical model.
Karl Pearson (1857-1938)
Kurtosis
• Karl Pearson introduced the term Kurtosis (literally the amount of hump)
for the degree of peakedness or flatness of a unimodal curve.
• When is greater than 3, the curve is more sharply peaked and has narrower tails
than the normal curve and is said to be leptokurtic.
• When it is less than 3, the curve has a flatter top and relatively wider tails than the
normal curve and is said to be platykurtic.
Kurtosis
Q.D
Kurt=
P90 P10
Where,
Q.D is semi IQR=(Q3-Q1)/2
P90=90th percentile
P10=10th percentile
Application
• We can evaluate the normality of variable by using skewness and
kurtosis.
• A kurtosis with an absolute value greater than 10 is problematic
• Useful measure of whether there is a problem with outliers in a data
set.
• Larger kurtosis indicates a more serious outlier problem, and may
need to choose alternative statistical methods.