TRISEM14-2021-22 BMT5113 TH VL2021220200049 Reference Material I 04-Aug-2021 Measures of Dispersion
TRISEM14-2021-22 BMT5113 TH VL2021220200049 Reference Material I 04-Aug-2021 Measures of Dispersion
TRISEM14-2021-22 BMT5113 TH VL2021220200049 Reference Material I 04-Aug-2021 Measures of Dispersion
Statistics is the study and manipulation of data to gather, review, analyze, draw conclusions from data.
2 major areas of it are descriptive and inferential statistics.
Descriptive statistics focus on the central tendency, variability, and distribution of sample data.
Central tendency means the estimate of the characteristics, a typical element of a sample or population, and
it includes mean, median, and mode.
Variability refers to a set of statistics that show how much difference is there among the elements of a
sample or population and includes metrics such as range, variance, and standard deviation.
The distribution refers to overall "shape" of the data, which can be depicted on a chart such as a
histogram or dot plot, and includes properties such as the probability distribution function, skewness,
and kurtosis.
Inferential statistics are tools used to draw conclusions about the characteristics of a population from
the characteristics of a sample and to decide how certain they can be of the reliability of those
conclusions.
Based on the sample size and distribution of the sample data we can calculate the probability of statistics
that can provide an accurate picture of the corresponding parameters of the whole population from
which the sample is drawn.
Inferential statistics are used to make generalizations about large groups, such as estimating average
demand for a product by surveying a sample of consumers' buying habits, or to attempt to predict
future events, such as projecting the future return of a security or asset class based on returns in a
sample period.
Regression analysis is a common method of statistical inference that attempts to determine the strength
and character of the relationship (or correlation) between one dependent variable (usually denoted
by Y) and a series of other variables (known as independent variables).
Measures of Dispersion
A measure of variability is a summary statistic that represents the amount of dispersion in a dataset.
How spread out are the values?
While a measure of central tendency describes the typical value, measures of variability define how far
away the data points tend to fall from the center.
A low dispersion indicates that the data points tend to be clustered tightly around the center.
High dispersion means they tend to fall further away.
In statistics, variability, dispersion, and spread are synonyms that denote the width of the
distribution.
Just as there are multiple measures of central tendency, there are several measures of variability.
Let us look at the following two data sets.
Let us plot the above two data set I and II.
Two sets of ten measurements each center at the same value: they both have mean, median, and
mode 40.
In Data Set I the measurements vary only slightly from the center, while for Data Set II the
measurements vary greatly.
Let us associate to each data set numbers that measure quantitatively how the data either scatter
away from the center or cluster close to it.
These quantities are called measures of variability.
Low variability is ideal because it means that we can predict information about the population
based on sample data.
High variability means that the values are less consistent, so it’s harder to make predictions.
Understanding variability helps us to grasp the likelihood of unusual events.
Range
It is the most straightforward measure of variability to calculate and the simplest to understand.
The range of a dataset is the difference between the largest and smallest values in that dataset.
E.g, in the two datasets below, dataset 1 has a range of 20 – 38 = 18 dataset 2 has a range of
41(11 – 52 = 41).
Dataset 2 has a broader range and more variability than dataset 1.
Coefficient of Range = Large-Small/Large+Small
It is based on only the two most extreme values in the dataset, which makes it very susceptible to outliers.
If one of those numbers is unusually high or low, it affects the entire range even if it is atypical.
The size of the dataset affects the range.
In general, we are less likely to observe extreme values.
However, as we increase the sample size, we have more opportunities to obtain these extreme values.
Consequently, when we draw random samples from the same population, the range tends to increase
as the sample size increases.
Consequently, use the range to compare variability only when the sample sizes are similar.
Range Individual Observation
The following are the share prices of 5 companies
Company Share Price
1 250
2 260
3 270
4 280
5 290
Largest value (L) = 290
Smallest value (S) = 250
Range = L-S = 290-250 = 40
Coefficient of Range = L-S/L+S = 40/540 = 0.074
Range = L-S
= 153-150
Coefficient of Range = L-S/L+S
= 3/303
= 0.009
Range of Continuous Series
The following shows the income of 5 employees:
• Income No.of Employees
• 5000 -10000 3
• 10000-15000 10
• 15000-20000 7
• 20000-25000 20
L = 25000 S = 5000
Range = L-S = 25,000 – 5,000 = 20,000
Coefficient of Range = L-S/L+S = 20,000/30,000
= 2/3 = 0.67
Quartile Deviation
Interquartile Deviation
The interquartile range is the middle half of the data.
Note that median value splits the dataset in half.
Similarly, we can divide the data into quarters.
Statisticians refer to these quarters as quartiles and denote them from low to high as Q1, Q2, and Q3.
The lowest quartile (Q1) contains the quarter of the dataset with the smallest values.
The upper quartile (Q4) contains the quarter of the dataset with the highest values.
The interquartile range is the middle half of the data that is in between the upper and lower quartiles.
In other words, the interquartile range includes the 50% of data points that fall between Q1 and Q3.
The interquartile range is the third quartile (Q3) minus the first quartile (Q1).
This gives us the range of the middle half of a data set.
Quartile Deviation = (Q3-Q1)/2
Coefficient of Quartile Deviation = (Q3-Q1)/(Q3+Q1)
Individual Observation:
Number of policies sold by insurance agent is:
Month :1 2 3 4 5 6 7
Policies sold:14 11 18 15 13 12 16
Write the data in ascending order
11 12 13 14 15 16 18
Q1 = (N+1)/4 = (7+1)/4 = 2nd observation = 12
Q3 = 3*[(N+1)/4] th observation = 3(2)
= 6th observation = 16
Inter Quartile Range = Q3-Q1
= 16-12 = 4
Quartile Deviation = (Q3-Q1)/2
= 4/2 = 2
Coefficient of Quartile Deviation = (Q3-Q1)/(Q3+Q1)
= 4/28 = 0.143
Discrete Series:
Class : 5 6 7 8 9 10 11
Frequency : 8 13 16 23 15 12 6
Cumulative
Frequency (c.f.) :8 21 37 60 75 87 95
Q1 = (N+1)/4 = (95+1)/4 = 24 where N is highest cumulative frequency
Q3 = 3(N+1/4) = 3*24 = 72
Interquartile Range = Q3-Q1 = 72 – 24 = 48
Quartile Deviation = (Q3-Q1)/2 = 48/2 = 24
Coefficient of Quartile Deviation
= (Q3-Q1)/(Q3+Q1) = 48/(72+24)
= 48/96 = 0.5
Q1 = L + ((N/4 – c.f)/f))*i
N/4 = 217/4 = 54.25
L = 20, f=34, cf = 53, i =10
Q1 = 20 + ((54.25 – 53)/34)*i
= 20 + (1.25/34)*10
= 20.367
Q3 = L + ((3*(N/4) – c.f)/f))*I
3*(N/4) = 3*(54.25) = 162.75
L=40, f = 37, cf = 128, i =10
= 40 + ((162.75 – 128)/37))*10
= 40 + (34.75/37)*10
= 49.392
I.Q.R = Q3-Q1 = 49.392 – 20.367 = 29.025
Q.D = (29.025/2) = 14.513
Coefft.of QD = (Q3-Q1)/(Q3+Q1)
= 29.025/69.759 = 0.416
The interquartile range (IQR) extends from the low end of Q2 to the upper limit of Q3.
For this dataset, the range is 21 – 39.
We can also use other percentiles to determine the spread of different proportions.
For example, the range between the 97.5th percentile and the 2.5th percentile covers 95% of the data.
The broader these ranges, the higher the variability in our dataset.
Let us consider a dataset containing 8 data points:
72,110, 134,190,238,287,305,324.
To find the interquartile range first find the values at Q1 and Q3.
Multiply the number of values in the data set (8) by 0.25 for the 25th percentile (Q1) and by 0.75 for the
75th percentile (Q3).
Q1 position: 0.25 x 8 = 2 ; Q3 position: 0.75 x 8 = 6
Q1 is the value in the 2nd position, which is 110.
Q3 is the value in the 6th position, which is 287.
IQR = Q3 – Q1; IQR = 287 – 110 = 177
Like range, IQR uses only 2 values for calculation.
But the IQR is less affected by outliers:
the 2 values come from the middle half of the data set, so they are unlikely to be extreme scores.
The IQR gives a consistent measure of variability for skewed as well as normal distributions.
Five-number summary - Every distribution can be organized using a five-number summary:
Lowest value
Q1: 25th percentile
Q2: the median
Q3: 75th percentile
Highest value (Q4)
These five-number summaries can be easily visualized using box and whisker plots.
Let us say we are investigating amounts of time spent on phones daily by different groups of people.
Using simple random samples, we collect data from 3 groups: Sample A: high school students, Sample B:
college students, Sample C: adult full-time employees.
All three of your samples have the same average phone use, at 195 minutes or 3 hours and 15 minutes.
This is the x-axis value where the peak of the curves are.
Although the data follows a normal distribution, each sample has different spreads.
Sample A has the largest variability while Sample C as the smallest variability.
Mean Deviation
Individual observation
Find mean deviation and its coefficient for data:
Income: 3,000 5,000 5,200 5,400 4,600 4,800 5,800
Ascending values are 3,000 4,600 4,800 5,000, 5,200 5.400 5,800 We have 7 values,
Assumed mean is 3.5th observation i.e. 4th observation A = 5,000
X l D l = l X-A l
3000 2000 M.D = ∑ l D l/N
5000 0 = 4100/7
5200 200 = 585.714
5400 400
4600 500 Coefficient of M.D = M.D/Median
4800 200 Median = ((N+1)/2)th observation
5800 800 = ((7+1)/2)th observation
------ = 4th observation = 5400
4100 Coefficient of M.D = 585.714/5400
= 0.108
Discrete Series
The number of CFL lamps in a given table. Find M.D and its coefficient for data
No. of CFL (x) Number of houses (f)
1 3
2 7
3 6
4 4
No.of CFL (x) No. of houses(f) lDl =lX-Al flDl cf Here Assumed mean (A)
is chosen as 2
1 3 1 3 3
2 7 0 0 3
3 6 1 6 9
4 4 2 8 17
---- -----
20 17
M.D = ∑flDl/∑f = 17/20 = 0.85
Median = (N+1)/2 = (17+1)/2 = 9th observation = 3
Coefficient of M.D = M.D/Median = 0.85/3 = 0.283
A deviation from the mean is how far a score lies from the mean.
The variance is the average of squared deviations from the mean.
Variance is the square of the standard deviation.
This means that the units of variance are larger than those of a typical value of a data set.
Variance reflects the degree of spread in the data set.
The more spread the data, the larger the variance is in relation to the mean.
Variance formula for populations
σ2 = population variance
Σ = sum of…
Χ = each value
μ = population mean
Ν = number of values in the population
Variance formula for samples
s2 = sample variance
Σ = sum of…
Χ = each value
x̄ = sample mean
n = number of values in the sample
Biased versus unbiased estimates of variance
An unbiased estimate in statistics is one that doesn’t consistently give us either high values or low values
– it has no systematic bias.
If the sample variance formula used the sample n, the sample variance would be biased towards lower
numbers than expected.
Reducing the sample n to n – 1 makes the variance artificially larger.
In this case, bias is not only lowered but totally removed.
The sample variance formula gives completely unbiased estimates of variance.
What’s the best measure of variability?
The best measure of variability depends on your level of measurement and distribution.
Level of measurement
For data measured at an ordinal level, the range and interquartile range are the only appropriate measures of
variability.
For more complex interval and ratio levels, the standard deviation and variance are also applicable.
Distribution
For normal distributions, all measures can be used.
The standard deviation and variance are preferred because they take our whole data set into account, but this
also means that they are easily influenced by outliers.
For skewed distributions or data sets with outliers, the interquartile range is the best measure.
It’s least affected by extreme values because it focuses on the spread in the middle of the data set.
Example of calculating the sample variance
Let us work through an example using the formula for a sample on a dataset with 17 observations in the
table below.
The numbers in parentheses represent the corresponding table column number.
The procedure involves taking each observation (1), subtracting the sample mean (2) to calculate the
difference (3), and squaring that difference (4).
Then, we sum the squared differences at the bottom of the table.
Finally, we take the sum and divide by 16 because we are using the sample variance equation with 17
observations (17 – 1 = 16).
The variance for this dataset is 201.
Note : Various statistical tests use the variance in their calculations. For an example, F-test and ANOVA.
Let us consider the Data Set II below and compute the variance:
Let us assume we are given the GPA score of 10 students on 4.0 scale as below.
58 -59/3 3481/9
61 -50/3 2500/3
67 -32/3 1024/9
71 -20/3 400/9
73 -14/3 196/9
76 -5/3 25/9
79 4/3 16/9
82 13/3 169/9
83 16/3 256/9
85 22/3 484/9
87 28/3 784/9
88 31/3 961/9
88 31/3 961/9
92 43/3 1849/9
93 46/3 2116/9
94 49/3 2401/9
97 58/3 3364/9
Sum 0 46908/9
Standard Deviation
The standard deviation is the standard or typical difference between each data point and the mean.
When the values in a dataset are grouped closer together, we have a smaller standard deviation.
But when the values are spread out more, the standard deviation is larger because standard distance is
greater.
Standard deviation uses the original units of the data, which makes interpretation easier.
For example, in the pizza delivery example, a standard deviation of 5 indicates that the typical delivery
time is plus or minus 5 minutes from the mean.
It’s reported along with mean: 20 minutes (s.d. 5).
The standard deviation is just the square root of the variance.
Recall that the variance is in squared units.
Hence, the square root returns the value to the natural units.
The symbol for the standard deviation as a population parameter is σ while s represents it as a sample
estimate.
There are six steps for finding the standard deviation by hand:
List each score and find their mean.
Subtract the mean from each score to get the deviation from the mean.
Square each of these deviations.
Add up all of the squared deviations.
Divide the sum of the squared deviations by n – 1 (for a sample or N (for a population).
Find the square root of the number we found.
For example, in table below we take amount of time a cricketer spent in each test match while batting in
8 test matches.
Because we’re dealing with a sample, we use n – 1.
n–1=7
63904 / 7 = 9129.14
= √9129.14 = 95.54
The standard deviation of our data is 95.54.
This means that on average, each score deviates from the mean by 95.54 points.
For data measured at an ordinal level, the range and interquartile range are the appropriate measures of
variability.
For complex interval and ratio levels, the standard deviation and variance are also applicable.
For normal distributions, all measures can be used.
The standard deviation and variance are preferred because they take our whole data set into account,
but that they are easily influenced by outliers.
For skewed distributions or data sets with outliers, the interquartile range is the best measure.
It’s least affected by extreme values because it focuses on the spread in the middle of the data set.
Standard Deviation – Individual observation
The following data shows the number of cars sold during 7 months
Month : Jan Feb March April May June July
No. of cars: 90 100 110 120 130 140 150
Find the standard deviation for the above data
X d = X-A d2
90 -30 900
100 -20 400
110 -10 100
120 0 0
130 10 100
140 20 400
150 30 900
---- -----
0 2800
Standard deviation = √∑d2 /N - (∑d/N) = √2800/7 = 20
Discrete Series
The number of CFL lamps in a house is given below:
No. of CFLs (X) : 1 2 3 4
No. of houses (f) : 3 8 5 6
X f d = X-A d2 fd fd2
1 3 -1 1 3 3
2 8 0 0 0 0
3 5 1 1 5 5
4 6 2 4 12 24
--- --- ---
22 20 32
Standard Deviation = √∑fd /N - (∑fd/N) 2 ; Here N = ∑f
2
Let’s take another look at the pizza delivery example where we have a mean delivery time of 20 minutes
and a standard deviation of 5 minutes.
Using the Empirical Rule, we can use the mean and standard deviation to determine that 68% of the
delivery times will fall between 15-25 minutes (20 +/- 5) and 95% will fall between 10-30 minutes
(20 +/- 2*5).
Coefficient of Variation
We have considered various measures of variation: Range, IQR, and Variance (and its square root
counterpart - Standard Deviation).
These are all measures we can calculate from one quantitative variable e.g. height or weight.
But how can we compare dispersion (i.e. variability) of data from two or more distinct populations that
have vastly different means?
A popular statistic to use in such situations is the Coefficient of Variation or CV.
This is a unit-free statistic and the one that has the higher the value the greater the dispersion.
The calculation of CV is:
Coefficient of Variation (CV) = Standard Deviation / Mean
Comparing Prices
Assume you are shopping for toilet tissue.
As you compare prices of various brands, some offer price per roll while others offer price per sheet.
You are interested in determining which pricing method has less variability so you sample several of each
and calculate the mean and standard deviation for the sampled items that are priced per roll and the mean
and standard deviation for the sampled items that are priced per sheet.
The table below summarizes your results.
Item Mean Standard
Deviation
Following are the records of two players regarding their performance in cricket matches.
Scores of Player A: 48 52 55 60 65 45 63 70
Scores of Player B: 33 35 80 70 100 15 41 25
Which player is more consistent in his performance?
Coefficient of Variation for player A
Mean : 58.57142857
Standard Error : 3.213491965
C.V for Player A = (σ/X )*100
= (3.213491965/58.57142857)*100 = 5.48645%
Coefficient of Variation for player
Mean : 52.28571429
Standard Error : 11.87176379
C.V for Player B = (σ/X )*100
= (11.8716379/52.28571429)*100 = 22.70556
Since C.V of player A < C.V of B and hence player A is more consistent
The following table shows that monthly expenditures of 80 students of a University on morning
breakfast:
Expenditure (in Rs.) No. of Students
78-82 2
73-77 5
68-72 8
63-67 10
58-62 20
53-57 13
48-52 9
43-47 5
38-42 6
33-37 2
C.V = = (σ/X )*100
σ = √ ( f d2/N) – (fd/N)2 )*I = 8.803
X= A + ∑fd/N = 55 +(31/80)*5
= 56.938
C.V = (8.803/56.938)*100 = 15.461%
Coefficient of Variation – Continuous Series
X m f d = (m- d2 fd f d2
A)/i
33-37 35 2 -4 16 -8 32
38-42 40 5 -3 9 -15 45
43-47 45 8 -2 4 -16 32
48-52 50 10 -1 1 -10 10
53-57 55 20 0 0 0 0
58-62 60 13 1 1 13 13
63-67 65 9 2 4 18 36
68-72 70 5 3 9 15 45
73-77 75 6 4 16 24 46
78-82 80 2 5 25 10 20
The coefficient of variation is a statistical measure of a set of data around mean or average.
This measure is used to analyze the difference of spread in the data relative to the mean or average
value.
Coefficient of variation is derived by dividing the standard deviation by the mean or average.
In simple words, it shows by what percentage data varies from its mean.
Standard deviation can be the same for different data ranges but their coefficient of variation may
not be the same.
In Statistical mathematics
CV = Standard deviation / Average or Mean