Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Quartile & Deviation

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 31

PROPERTIES OF ARITHMETIC MEAN

1. The Algebraic sum of the deviations of a set of


numbers from their arithmetic mean is zero.
2. The sum of the squares of the deviations of a set of
numbers X j from any number a is a minimum if and
only if a = X
3. If f1 numbers have mean m1, f2 numbers have m2,
……………, fk numbers have mean mk, then the mean
of all members is

f1 m1 + f2 m2+ + fkmk
X= ------------------------------------
f1 + f 2+ …….+ fk
4. If A is any guessed or assumed arithmetic mean
(which may be any number) and if dj =Xj-A are the
deviations of Xj from A, then equation iare
X=A+d/n

X = A +  fd/n
Quantiles:

The concept of the median in Statistics is the middle


value or the mean of the two middle values, of an array.
We have learnt that the median divides a set of data into
two equal parts.

In the same way, there are also certain other values


which divides a set of data into four, ten or hundred
equal parts. Such values are referred as quartiles,
deciles, and percentiles respectively.

Collectively, the quartiles, deciles and percentiles and


other values obtained by equal sub-division of the data
are called Quartiles
Quartiles:

The values which divide an array (a set of data arranged in


ascending or descending order) into four equal parts are called
Quartiles. The first, second and third quartiles are denoted by Q1,
Q2,Q3 respectively. The first and third quartiles are also called the
lower and upper quartiles respectively. The second quartile
represents the median, the middle value.

Quartiles for Ungrouped Data:


Quartiles for ungrouped data are calculated by the
following formulae.
Deciles:
The values that divide the data into 10 equal parts are called Deciles
and are denoted by D1, D2, D3, D4, D5, D6, D7, D8, D9.

Percentiles:
The vales dividing the data into 100 equal parts are called
Percentiles and is denoted by P1, P2, P3, P4…… P99.
Problem
Following is the data of marks obtained by 20 students in a test of
statistics;
In order to apply formulae, we need to arrange the above data into
ascending order i.e. in the form of an array.

53 74 82 42 39 20 81 68 58 28

67 54 93 70 30 55 36 37 29 61

After ascending order

20 28 29 30 36 37 39 42 53 54

55 58 61 67 68 70 74 81 82 93
The value of the 5th item is 36 and that of the 6th item is 37. Thus, the
first quartile is a value 0.25th of the way between 36 and 37, which are
36.25. Therefore,  = 36.25. Similarly,

The value of the 10th item is 54 and that of the 11th item is 55. Thus the
second quartile is the 0.5th of the value 54 and 55. Since the difference
between 54 and 55 is of 1, therefore 54 + 1(0.5) = 54.5. Hence,   = 54.5.
Likewise,
The value of the 15th item is 68 and that of the 16th item is 70.
Thus the third quartile is a value 0.75th of the way between 68
and 70. As the difference between 68 and 70 is 2, so the third
quartile will be 68 + 2(0.75) = 69.5. Therefore,    = 69.5.
Quartiles for Grouped Data:

The quartiles may be determined from grouped data in the same way as the
median except that in place of n/2 we will use n/4. For calculating quartiles
from grouped data we will form cumulative frequency column. Quartiles for
grouped data will be calculated from the following formulae;

 = Median.
Where,
l = lower class boundary of the class containing the ,
i.e. the class corresponding to the cumulative frequency in which n/4 or 3n/4
lies
h = class interval size of the class containing .
f = frequency of the class containing  .
n = number of values, or the total frequency.
C.F = cumulative frequency of the class preceding the class containing   .
For Example:
We will calculate the quartiles from the frequency distribution for the
weight of 120 students as given in the following

Weight (lb) Frequency (f) Class Boundaries Cumulative Frequency

110 – 119 1 109.5 – 119.5 0

120 – 129 4 119.5 – 129.5 5

130 – 139 17 129.5 – 139.5 22

140 – 149 28 139.5 – 149.5 50

150 – 159 25 149.5 – 159.5 75

160 – 169 18 159.5 – 169.5 93

170 – 179 13 169.5 – 179.5 106

180 – 189 6 179.5 – 189.5 112

190 – 199 5 189.5 – 199.5 117

200 – 209 2 195.5 – 209.5 119

210 – 219 1 209.5 – 219.5 120

∑f = n = 120
i. The first quartile  is the value of  or the 30th item from the lower
end. From this data, we see that cumulative frequency of the third class
is 22 and that of the fourth class is 50. Thus  lies in the fourth class i.e.
140 – 149.

The thirds quartile   is the value of   or 90th item from the


lower end. The cumulative frequency of the fifth class is 75 and that
of the sixth class is 93. Thus,    lies in the sixth class i.e. 160 – 169.

From    we conclude that 25% of the students weight 142.36


pounds or less and 75% of the students weight 167.83 pounds or
less.
Dispersion or Variation

Dispersion is a statistical term that describes the size of the


distribution of values expected for a particular variable and
can be measured by several different statistics, such as
range, mean deviation, semi-interquartile range, 10th –
90th percentile range, variance, coefficient of variation and 
standard deviation. 
Range:

Measures of variability, the range is the easiest and quickest


way to determine. Difference between the largest and smallest
numbers in the set.

Range: Maximum – Minimum

Example: Scores of students in the exam :

50,65,85,95,45,55
Minimum: 45 Maximum : 95

Range = 95-45 = 50
Advantages

• Gives quick approximation of the variability of data.


• It is not very sophisticated/complicated.
• Used when the mode is preferred measure of central
tendency. (i.e./ when you have nominal/titular level data.)
• It is the simplest measure of variability/dispersion.

Disadvantages
• It is limited/partial, if extreme scores are not representative
of the sample, but are included among the scores.
• It is not very informative, because it is based only on the
most extreme scores.
• It is severely affected by extreme scores in your data
distribution.
Mean Deviation or average deviation:
The mean deviation gives information about how far the data values
are spread out from the mean value.
Distance of each value from that mean
Deviation of set of N numbers X1, X2, …… XN

 XJ - X
MD = -------------------- j= 1………… N
N

X = Arithmetic mean
Exercise:

Find the mean deviation of the set of marks


2, 3, 6, 8, 11
Arithmetic mean= (2 + 3 + 6 + 8 + 11)/ 5 = 6
Mean Deviation= [(2-6)+ (3-6)+ (6-6)+ (8-6)+ (11-6)] / 5
= (4+3+0+2+5) / 5 = 2.8
2.8 marks of deviation from the mean of each observation.
Semi-inter quartile Range:

The semi-interquartile range or Quartile range of a set of data


is denoted by Q
Q3 - Q 1
Q = ------------
2
Q3 & Q1 are the first and third quartiles for the data.
Semi-inter quartile range is more common as a measure of
dispersion.

Quartile range for wages of workers

(1450 – 1100)
Q= ------------------ = Rs. 175
2
Variance (V): is calculated by dividing the sum of the squared
deviations from the mean by the number of observations. If
sample size is small, one may use (N – 1) instead of N in the
denominator for calculating SD.
N _ 2 _ 2
V = {  ( Xi - X ) } / N or  ( Xi - X ) / N - 1
i:1
 
 
Higher the value of V, greater is the spread of observations /
cases from the mean. For heterogeneous data, V will be larger.
The square root of V is known as Standard Deviation (S).
Standard Deviation:

Standard deviation of set of N numbers X1, X2, …… XN is


denoted by
2
 (Xi - X)
s= -------------------- i= 1………… N
N

X = Arithmetic mean

s is the root mean square of the deviation from the mean or root-mean-
square deviation

Variance of a set of data is defined as the square of standard deviation


and is denoted by s2
S2 and  2 represent the sample variance and population variance
Problem

The yields of 10 sample farmers (drawn randomly) from a population of farmers in a


village are given below. Find out the Variance (V) and Standard Deviation (SD) and
interpret the result. (ungrouped data)
 
Yield (quintals / Hectare) : 15, 12, 20, 25, 30, 12, 18, 20, 28 and 30
 
Mean yield is 21 quintals per hectare
 
Variance V= { ( 15 - 21)2 + (12- 21) 2 + … +(30 - 21) 2 } / 10

= 43.6 quintals if n is used & 48.4 Quintal if n-1 is used

SD (s) = 6.6 quintals (or 6.96 or 7 Quintals)

Therefore, the variability in yield levels is moderate.


Problem

Standard deviation using grouped variables (continuous or discrete)

220 students were asked the number of hours per week they spent watching television.
With this information, calculate the mean and standard deviation of hours spent watching
television by the 220 students.

Table 3.  Number of hours per week spent watching


television
Hours Number of students
10 to 14 2
15 to 19 12
20 to 24 23
25 to 29 60
30 to 34 77
35 to 39 38
40 to 44 8
Hours Number of   Xi (Mid fixi
students (fi) Points)
10 to 14 2 24 12 24
15 to 19 12 34 17 204
20 to 24 23 44 22 506
25 to 29 60 54 27 1620
30 to 34 77 64 32 2464
35 to 39 38 74 37 1406
40 to 44 8 84 42 336
  220     6560
able 4.  Number of hours spent watching television
Hours Midpoint Frequency xf (x -x ) (x -x )2 (x -x )2f
(x) (f)

10 to 14 12 2 24 -17.82 317.6 635.2

15 to 19 17 12 204 -12.82 164.4 1,972.8

20 to 24 22 23 506 -7.82 61.2 1,407.6

25 to 29 27 60 1,620 -2.82 8.0 480.0

30 to 34 32 77 2,464 2.18 4.8 369.6

35 to 39 37 38 1,406 7.18 51.6 1,960.8

40 to 44 42 8 336 12.18 148.4 1,187.2

    220 6,560      8,013.2


Standard deviation
Properties of standard deviation
• When using standard deviation keep in mind the following properties.
• Standard deviation is only used to measure spread or dispersion around the mean of a
data set.
• Standard deviation is never negative.
• Standard deviation is sensitive to outliers. A single outlier can raise the standard
deviation and in turn, distort the picture of spread.
• For data with approximately the same mean, the greater the spread, the greater the
standard deviation.

If all values of a data set are the same, the standard deviation is zero (because each value
is equal to the mean).
When analysing normally distributed data, standard deviation can be used in conjunction
with the mean in order to calculate data intervals.

If  = mean, S = standard deviation and x = a value in the data set, then
• about 68% of the data lie in the interval:  - S < x <  + S.
• about 95% of the data lie in the interval:  - 2S < x < + 2S.
• about 99% of the data lie in the interval:  - 3S < x <  + 3S.
Assuming the frequency distribution is approximately normal, calculate
the interval within which 95% of the previous example's observations
would be expected to occur.

 = 29.82, s = 6.03
Calculate the interval using the following formula:  - 2s < x <  + 2s
29.82 - (2 X 6.03) < x < 29.82 + (2 X 6.03)
29.82 - 12.06 < x < 29.82 + 12.06
17.76 < x < 41.88

This means that there is about a 95% certainty that a student will spend
between 18 hours and 42 hours per week watching television.
What’s the difference between standard deviation and variance?

Variance is the average squared deviations from the mean, while


 standard deviation is the square root of this number. Both measures
reflect variability in a distribution, but their units differ:

Standard deviation is expressed in the same units as the original values


(e.g., minutes or meters).

Variance is expressed in much larger units (e.g., meters squared).


Although the units of variance are harder to intuitively understand,
variance is important in statistical tests.
What Is the Coefficient of Variation (CV)?

• The coefficient of variation (CV) is a statistical measure of the


dispersion of data points in a data series around the mean. The
coefficient of variation represents the ratio of the standard deviation to
the mean, and it is a useful statistic for comparing the degree of
variation from one data series to another, even if the means are
drastically different from one another.

• The higher the coefficient of variation, the greater the level of dispersion
around the mean. It is generally expressed as a percentage. Without
units, it allows for comparison between distributions of values whose
scales of measurement are not comparable.

• When we are presented with estimated values, the CV relates the


standard deviation of the estimate to the value of this estimate. The
lower the value of the coefficient of variation, the more precise the
estimate.
Coefficient of variation or Coefficient Dispersion is denoted by CV

Coefficient of variation: Standard Deviation


--------------------------
Average

s
CV = --------- x 100
X

6.03
CV = --------- x 100 = 20.2
29.82

Data series around the mean of watching TV is 20.2 Percentage.


Exercise:
1. Calculate Range of the given set of observations Student spending hours on
studies per day
Monday: 8 hours Tuesday: 4 hours Wednesday: 2 hours Thursday: 6
Hours Friday : 7 hours Saturday ; 1 hour
Find the variation spending hours on studies per day.

2. Find out SD of the given data (Ungrouped data)


Heights of the student Children in the Same age group of 14 years

heights are: 600mm, 470mm, 170mm, 430mm and 300mm.


Mean = 600 + 470 + 170 + 430 + 300/5
  = 1970/5
  = 394

mean (average) height is 394 mm

calculate the Variance, take each difference, square it, and then average the result:

Variance
σ2 = 2062 + 762 + (−224)2 + 362 + (−94)25
42436 + 5776 + 50176 + 1296 +
  =
88365
  = 1085205
  = 21704
So the Variance is 21,704
And the Standard Deviation is just the square root of
Variance, so:

Standard Deviation
σ = √21704
  = 147.32...
  = 147 (to the nearest mm)

And the good thing about the Standard Deviation is that it is useful. Now we
can show which heights are within one Standard Deviation (147mm) of the
Mean.
So, using the Standard Deviation we have a "standard" way of knowing what
is normal, and what is extra large or extra small.

You might also like