Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

UNIT 3 PROPERTIES OF NUMERICAL DATA

Objectives
After going through this unit, you will be able to:
 State different measures of Central Tendency.
 State different measures of Dispersion.
Structure
3.1 Introduction
3.2 Measures of Central Tendency
3.3 Measures of Dispersion
3.4 Case Study
3.5 Summary
3.6 Keywords

3.1 INTRODUCTION

In the previous unit, elaboration was made as to how the statistical data can be tabulated and
presented for drawing a meaningful inference at a glance.

Now imagine this situation: You are in a class with just four other students, and the five of you took a
5-point pop quiz. Today your instructor is walking around the room, handing back the quizzes. She
stops at your desk and hands you your paper. Written in bold black ink on the front is 3/5. How do you
react? Are you happy with your score of 3 or disappointed? How do you decide? You might calculate
your percentage correct, realize it is 60%, and be appalled. But it is more likely that when deciding
how to react to your performance, you will want additional information. What additional information
would you like?

If you are like most students, you will immediately ask your neighbours, “What have you got?” and
then ask the instructor, “How did the class do?” In other words, the additional information you want
is how your quiz score compares to other students' scores. You therefore understand the importance
of comparing your score to the class distribution of scores. Should your score of 3 turn out to be among
the higher grades then you will be pleased after all. On the other hand, if 3 is among the lowest scores
in the class, you will not be quite so happy.

Standard Notations
Measure Sample Population
Mean X μ

Standard Deviation s σ

Variance s2 σ2

Size n N

This idea of comparing individual scores to a distribution of scores is fundamental to statistics. The
objective of the statistical analysis is to determine various numerical measures and thus summarizing.
Two of such characteristics are “Central Tendency” and “Dispersion” for which standard notations to
be used are presented in the above table.

The measures of central tendency describe a distribution in terms of its most “frequent,” “typical,” or
“average” data value. But there are different ways of representing or expressing the idea of
“typicality”. The descriptive statistics most often used for this purpose are the Mean (the average),
the Mode (the most frequently occurring score), and the Median (the middle score). However, for
studying the spread of data on either side of central pivot ‘mean’, dispersion value needs to be found
out. The measures of central tendency are not indicative of the idea of spread, where a small
dispersion indicates high uniformity of the observations and the large dispersion would mean less
uniformity.

Central Tendency
(Location, i.e. Same
Mean but Different

Variation (Same Mean


Different Dispersion)

Shape

Fig 3.1
There is one important situation in which all three measures of central tendency are identical. This
occurs when a distribution is symmetrical, that is, when the right half of the distribution is the mirror
image of the left half of the distribution. In this case, the mean will fall exactly at the middle of the
distribution (the median position) and the value at this central point will be the most frequently
observed data value, the mode. To the extent that differences are observed among these three
measures, the distribution is asymmetrical or “skewed.” Asymmetry will occur whenever the
distribution contains one or more observations whose deviation from the mean is not matched by an
offsetting deviation in the opposite direction.

Such asymmetrical distribution contains some values on the high end of the distribution, which are
very far from the mean that are not matched by corresponding values on the low end of the
distribution. So the degree of discrepancy between the median and the mean can then be interpreted
as an indicator of skewness or the lack of symmetry. If these two indices are identical, the distribution
is symmetrical.

If the mean is greater than the median, the extreme values (or “outliers”) are located at the high end
of the distribution and the distribution is said to be “positively skewed”; when the outliers are at the
low end of the distribution, the mean will be less than the median and the distribution is said to be
“negatively skewed.”

With respect to shape, another way in which a distribution can be characterized is in terms of kurtosis
or whether a distribution can be described as relatively flat or peaked or somewhere in between. A
leptokurtic distribution is a relatively tall and narrow distribution, indicating that the observations
were tightly clustered within a relatively narrow range of values. Another way of describing a
leptokurtic distribution would be to say that it is a distribution with relatively little dispersion. A
mesokurtic distribution reveals observations to be more distributed across a wider range of values,
and a platykurtic distribution is one where proportionately fewer cases are observed across a wider
range of values. We could also say that mesokurtic and platykurtic distributions are increasingly
“flatter” and have increasingly greater dispersions.

3.2 MEASURES OF CENTRAL TENDENCY

The mean is defined as the arithmetic average of a set of numerical scores, that is, the sum of all the
numbers divided by the number of observations contributing to that sum.

X i
X1  X 2    X n
X i 1

n n
Sample Mean:

X i
X1  X 2    X N
Population Mean:  i 1

N N
Where,
Xi = an individual score
N = Population Size
n = Sample Size
Sigma or Σ ≡ Take the sum

Frequency Distribution:

The mean of a frequency distribution is also the weighted mean.

The mean (arithmetic mean) is the most common measure of central tendency, which represents the
‘Balance Point’. Note that mean value gets affected by extreme values (outliers).

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14

Mean = 5 Mean = 6

The marks of seven students in a mathematics test with a maximum possible mark of 20 are given
below: 15, 13, 18, 16, 14, 17, and 12. Find the mean of this set of data values.

Solution:
So, the mean mark is 15.
Symbolically, we can set out the solution as follows:

So, the mean mark is 15.

As mean gets affected by extreme values, another robust measure of central tendency is required to
address this issue of outliners. You may use median values, which do not get affected by extreme
values.

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14

Median = 5 Median = 5

Note that for an ordered array, the median is the “middle” number, only if n or N is odd. However, if
n or N is even, the median is the average of the two middle numbers.

The median of a set of data values is the middle value of the data set when it has been arranged in
ascending order. That is, from the smallest value to the highest value.

We would now find the median for the marks of nine students in a geography test that had a maximum
possible mark of 50 are given below:
47 35 37 32 38 39 36 34 35
Solution:
Arrange the data values in order from the lowest value to the highest value:
32 34 35 35 36 37 38 39 47
The fifth data value, 36, is the middle value in this arrangement, that is, median = 36.
Note:

In general:

If the number of values in the data set is even, then the median is the average of the two middle
values.
Further, find the median of the following data set:
12 18 16 21 10 13 17 19
Solution:
Arrange the data values in order from the lowest value to the highest value:
10 12 13 16 17 18 19 21
The number of values in the data set is 8, which is even. So, the median is the average of
the two middle values.

Alternative way:
There are 8 values in the data set.
The fourth and fifth scores, 16 and 17, are in the middle. That is, there is no one middle
value.

Note:
 Half of the values in the data set lie below the median and half lie above the
median.
 The median is the most commonly quoted figure used to measure property prices.
The use of the median avoids the problem of the mean property price which is
affected by a few expensive properties that are not representative of the general
property market.

Yet another common measure of central tendency is mode, which represents the value that occurs
most often. Note that like median, mode values are not affected by extreme values. It can be used for
either numerical or categorical data. It is possible that a distribution may have no mode or several
modes.

Mode = 9 No Mode

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
The mode has applications in printing. For example, it is important to print more of the most
popular books; because printing different books in equal numbers would cause a shortage of some
books and an oversupply of others.
Find the mode of the following data set:
48 44 48 45 42 49 48
Solution:
The mode is 48 since it occurs most often.
We have three measures of central tendency, but which is the most reliable / best measure to be
applied? Please understand that not one measure is best to explain a data set – still it mostly
depends on the characteristics of the data set. Whereas mean is popular but not very helpful when
it comes to data sets which contain outliers, median is good, and mode is simple to locate but not
very useful for practical application. Following table may be referred for right application of these
central tendencies with respect to a data set.

Mean Median Mode Variance / Std.


Deviation
Nominal No No Yes No
Ordinal No Yes Yes No
Interval / Ratio Yes Yes Yes Yes

3.3 MEASURES OF DISPERSION


The simplest measure of dispersion is range (the difference between the maximum and minimum
values). But if there is an outlier in the data, it would tend to be the minimum or maximum value.
Thus, the range is not robust to outliers.

Range  X Largest  X Smallest

Range = 12 - 7 = 5 Range = 12 - 7 = 5

7 8 9 10 11 12 7 8 9 10 11 12
The range is a measure of spread and it tells us how much a data set is spread out or scattered.
Range is defined as the difference between the highest and lowest values. That is:
Find the range of the following data set:
8 4 13 35 63
Solution:

The standard deviation and the variance are popular measures of spread that are optimal for normally
distributed samples. The standard deviation is the square root of the variance and has the desirable
property of being in the same units as the data. That is, if the data is in meters, the standard deviation
is in meters as well. The variance is in meters2, which is more difficult to interpret.

n
Sample Variance:
 X X
2
i
S2  i 1

n 1
N

 X 
2
i
Population Variance: 2  i 1

 X X
2
i
Sample Standard Deviation: S  i 1

n 1

 X 
2
i
Population Standard Deviation:  i 1

N
Data A Mean = 15.5
s = 3.338
11 12 13 14 15 16 17 18 19 20 21

Data B Mean = 15.5


11 12 13 14 15 16 17 18 19 20 21 s = .9258

Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.57

As seen from above figure, which marks the comparison among different data sets with respect to
standard deviation, it can be observed that neither the standard deviation nor the variance is robust
to outliers. A data value that is separate from the body of the data can increase the value of the
statistics by an arbitrarily large amount.

The mean absolute deviation (MAD) (not in syllabus) is also sensitive to outliers. But the MAD does
not move quite as much as the standard deviation or variance in response to outliner data.
The inter-quartile range (IQR) is the difference between the 75 th and 25th percentile of the data. Since
only the middle 50% of the data affects this measure, it is considered robust to outliers. But, quartile
calculations are not to be covered in this course. However, for the benefit of special readers some
insights on IOR as given below:

If a data set of scores is arranged in ascending order of magnitude, then:


 The median is the middle value of the data set.

 The lower quartile (Q1) is the median of the lower half of the data set.

 The upper quartile (Q3) is the median of the upper half of the data set.

 The interquartile range (IQR) is the spread of the middle 50% of the data values. So:
The interquartile range is a more useful measure of spread than range as it describes the middle 50%
of the data values and thus, is less affected by outliers.

Find the median, lower quartile, upper quartile and interquartile range of the following data set of
scores:
19 21 24 21 24 28 25 24 30
Solution:
Arrange the score values in ascending order of magnitude:
19 21 21 24 24 24 25 28 30

There are 9 values in the data set.


This means the middle 50% of the data values range from 21 to 26.5.
In statistics, the coefficient of variation (CV) is a normalized measure of dispersion of a probability
distribution. It is also known as unitized risk or the variation coefficient. The absolute value of the CV
is sometimes known as relative standard deviation (RSD), which is expressed as a %. CV should not be
used interchangeably with RSD (i.e. one term should be used consistently).

The coefficient of variation should be computed only for data measured on a ratio scale, which are
measurements that can only take non-negative values. The coefficient of variation may not have any
meaning for data on an interval scale.

For example, most temperature scales are interval scales (e.g. Celsius, Fahrenheit etc.), they can take
both positive and negative values. The Kelvin scale has an absolute null value, and no negative values
can naturally occur. Hence, the Kelvin scale is a ratio scale. While the standard deviation (SD) can be
derived on both the Kelvin and the Celsius scale (with both leading to the same SDs), the CV could only
be derived for, the Kelvin scale.

The CV is to be expressed as a percent, in which case the CV is multiplied by 100%.

The coefficient of variation is useful because the standard deviation of data must always be
understood in the context of the mean of the data. Instead, the actual value of the CV is independent
of the unit in which the measurement has been taken, so it is a dimensionless number. For comparison
between data sets with different units or widely different means, one should use the coefficient of
variation instead of the standard deviation. However, when the mean value is close to zero, the
coefficient of variation will approach infinity and is hence sensitive to small changes in the mean.

3.4 SUMMARY

This section of the chapter aims to find a representative figure for collected large data, wherein we
find that there is a tendency of such collected data to concentrate about a particular value called as
‘central tendency’ in case of normal behaviour. Also, this chapter aims to provide more meaning for
data upon showing its variation. It is important discussion as both measures of central tendency and
measures of dispersion together give correct description of data for any distribution. It is generally
said that if you have the value of mean and standard deviation of a distribution, you know everything
about that distribution based on some sigma limits.

3.5 KEYWORDS
 Mean – Average of observation.
 Median – Middle observation after observations are sorted from low to high.
 Mode – Most frequent occurring observation.
 Range – Difference between largest and smallest observation.
 Variance – Measure of variability, basically, average of squared deviations from the mean.
 Standard Deviation – Measure of variability in same units as observations, which is the square
root of variance.
 Quartile – First quartile has 25% of observations below it. Third quartile has 25% of
observations above it. Second quartile is the median.
 Inter-quartile Range – Difference between third and first quartile.

You might also like