Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
69 views

M2. Understanding A Data Set II

This document discusses variance, standard deviation, and other statistical measures of variability and distribution of data. It defines variance and standard deviation, and how to calculate them. It explains how variance and standard deviation can quantify how dispersed a data set is compared to the mean. The document also discusses the normal distribution and z-scores, how to transform raw scores to z-scores, and what percentages of scores fall within certain standard deviations of the mean in a normal distribution. Finally, it defines skewness and kurtosis as other measures of the shape of a distribution, and how to calculate and interpret them.

Uploaded by

MYo Oo
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

M2. Understanding A Data Set II

This document discusses variance, standard deviation, and other statistical measures of variability and distribution of data. It defines variance and standard deviation, and how to calculate them. It explains how variance and standard deviation can quantify how dispersed a data set is compared to the mean. The document also discusses the normal distribution and z-scores, how to transform raw scores to z-scores, and what percentages of scores fall within certain standard deviations of the mean in a normal distribution. Finally, it defines skewness and kurtosis as other measures of the shape of a distribution, and how to calculate and interpret them.

Uploaded by

MYo Oo
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

M2.

Understanding a data set (II)

Presented by
Aung Kay Tu, MBBS, DTM&H, MCTM, PhD
Variance and Standard Deviation
Variance: a measure of how data points differ
from the mean
• Data Set 1: 3, 5, 7, 10, 10
Data Set 2: 7, 7, 7, 7, 7

What is the mean and median of the above data set?

Data Set 1: mean = 7, median = 7


Data Set 2: mean = 7, median = 7

But we know that the two data sets are not identical! The variance shows how they are
different.

We want to find a way to represent these two data set numerically.


How to Calculate?
• If we conceptualize the spread of a distribution as the extent to which
the values in the distribution differ from the mean and from each other,
then a reasonable measure of spread might be the average deviation, or
difference, of the values from the mean.

( x  X )
N
• We could just drop the negative signs, which is the same mathematically as taking the
absolute value, which is known as the mean deviations.
• The average of the squared deviations about the mean is called the variance.

x  X 
2
2 For population variance
 
N

x  X 
2
2 For sample variance
s 
n 1
Score XX ( X  X )2
X

1
3
2
5
3
7
4
10
5
10
Totals
35

The mean is 35/5=7.


Score XX ( X  X )2
X

1
3 3-7=-4
2
5 5-7=-2
3
7 7-7=0
4
10 10-7=3
5
10 10-7=3
Totals
35
Score XX ( X  X )2
X

1
3 3-7=-4 16
2
5 5-7=-2 4
3
7 7-7=0 0
4
10 10-7=3 9
5
10 10-7=3 9
Totals
35 38
Score XX ( X  X )2
X

1
3 3-7=-4 16
2
5 5-7=-2 4
3
7 7-7=0 0
4
10 10-7=3 9
5
10 10-7=3 9
Totals
35 38

x  X 
2
2 38
s    7.6
n 5
Example 2

No. Data Set A Data set B


1 28 27
2 22 27
3 21 28
4 26 6
5 18 27
Find the mean, median, mode, range?

mean 23 23
median 22 27
range 10 22

What can be said about this data?

Due to the outlier, the median is more typical of overall performance.

Which data set is more consistent?


standard deviation - a measure of variation of scores about the
mean

• higher standard deviation indicates higher spread, less consistency,


and less clustering.

x  X 
2

s
• sample standard deviation: n 1

• population standard deviation: x  


2


N
Find area
Length = 10 cm
Breadth= 3.5 cm

Length = 10 cm
Breadth= 10 cm
Find area under the curve
Bell shaped curve
•  68% of all scores fall with 1 standard deviation of
the mean
•  95% of all scores fall with 2 standard deviation of
the mean
•  99.7% of all scores fall with 3 standard deviation of
the mean
z Scores, and the Normal Curve
• Z-distribution – normal distribution of standardized scores
z Scores, and the Normal Curve
• So what are z-scores?
• Number of standard deviations away from the mean of a particular score
• Can be positive or negative
• Positive = above mean
• Negative = below mean

( X  )
z

The z Distribution
Transforming Raw Scores to z Scores
• Step 1: Subtract the mean of the population from the raw score
• Step 2: Divide by the standard deviation of the population
1.96

the normal distribution is symmetric


Skewness and Kurtosis
Skewness
• Skewness describes how the sample differs in shape from a
symmetrical distribution.
• If a normal distribution has a skewness of 0, right skewed is greater
then 0 and left skewed is less than 0.
• In a normal distribution where skewness is 0, the mean, median and
mode are equal.
• In a negatively skewed distribution, the mode > median > mean.
• Positively skewed distributions occur when most of the scores are
toward the low end of the distribution.
• In a positively skewed distribution, mode< median< mean.
Skewness
When the distribution is symmetric, the value of skewness should be zero.
Karl Pearson defined coefficient of Skewness as:

Mean  Mode
Sk 
SD
Since in some cases, Mode doesn’t exist, so using empirical relation,

We can write,

3  Median  Mean 
Sk 
SD
(it ranges b/w -3 to +3)
Application
• If we have a skewed data then it may harm our results.
• In order to use a skewed data we have to apply a log transformation
over the whole set of values to discover patterns in the data and make
it usable for the statistical model.
Karl Pearson (1857-1938)
Kurtosis
• Karl Pearson introduced the term Kurtosis (literally the amount of hump)
for the degree of peakedness or flatness of a unimodal curve.

When the peak of a curve becomes relatively high


then that curve is called Leptokurtic.

When the curve is flat-topped, then it is called


Platykurtic.

Since normal curve is neither very peaked nor very


flat topped, so it is taken as a basis for comparison.

The normal curve is called Mesokurtic.


Kurtosis
• For a normal distribution, kurtosis is equal to 3.

• When is greater than 3, the curve is more sharply peaked and has narrower tails
than the normal curve and is said to be leptokurtic.

• When it is less than 3, the curve has a flatter top and relatively wider tails than the
normal curve and is said to be platykurtic.
Kurtosis

Another measure of Kurtosis, known as Percentile coefficient of kurtosis is:

Q.D
Kurt=
P90  P10
Where,
Q.D is semi IQR=(Q3-Q1)/2
P90=90th percentile
P10=10th percentile
Application
• We can evaluate the normality of variable by using skewness and
kurtosis.
• A kurtosis with an absolute value greater than 10 is problematic
• Useful measure of whether there is a problem with outliers in a data
set.
• Larger kurtosis indicates a more serious outlier problem, and may
need to choose alternative statistical methods.

You might also like