MBA 102 Book
MBA 102 Book
MBA 102 Book
MBA-102
BUSINESS STATISTICS
RSITY OF S
I VE C
UN IE
R
N
CE
M BE SHWA
& TE CH NO
JA
U
LO
R
GY
GU
CONTENTS
Lesson Name of Topic Page
No. No.
1 An Introduction to Business Statistics 3
2 An Overview of Central Tendency 17
3 Dispersion and Skewness 46
4 Probability Theory 80
5 Probability Distributions-I 113
6 Probability Distributions-II 138
7 Sampling and Sampling Methods 161
8 Sampling Distributions 182
9 Statistical Estimation 232
10 Testing of Hypotheses 254
11 Non-Parametric Tests 296
12 Correlation Analysis 320
13 Regression Analysis 358
14 Analysis of Time Series 404
15 Index Number 453
16 Statistical Quality Control 497
STRUCTURE
1.1 Learning objective
1.2 Introduction
1.2.1 Meaning and Definitions of Statistics
1.2.2 Types of Data and Data Sources
1.2.3 Types of Statistics
1.2.4 Scope of Statistics
1.3 Importance of Statistics in Business
1.4 Limitations of statistics
1.5 Check your progress
1.6 Summary
1.7 Keywords
1.8 Self-Assessment Test
1.9 Answers to check your progress
1.10 References/Suggested Readings
1.2 INTRODUCTION
For a layman, ‗Statistics‘ means numerical information expressed in quantitative terms. This
information may relate to objects, subjects, activities, phenomena, or regions of space. As a matter of
fact, data have no limits as to their reference, coverage, and scope. At the macro level, these are data on
gross national product and shares of agriculture, manufacturing, and services in GDP (Gross Domestic
Product). At the micro level, individual firms, howsoever small or large, produce extensive statistics on
their operations. The annual reports of companies contain variety of data on sales, production,
expenditure, inventories, capital employed, and other activities. These data are often field data,
collected by employing scientific survey techniques. Unless regularly updated, such data are the product
of a one-time effort and have limited use beyond the situation that may have called for their collection.
A student knows statistics more intimately as a subject of study like economics, mathematics,
chemistry, physics, and others. It is a discipline, which scientifically deals with data, and is often
described as the science of data. In dealing with statistics as data, statistics has developed appropriate
methods of collecting, presenting, summarizing, and analysing data, and thus consists of a body of these
methods.
1.2.1 Meaning and definitions of Statistics
In the beginning, it may be noted that the word ‗statistics‘ is used rather curiously in two senses plural
and singular. In the plural sense, it refers to a set of figures or data. In the singular sense, statistics refers
to the whole body of tools that are used to collect data, organise and interpret them and, finally, to draw
conclusions from them. It should be noted that both the aspects of statistics are important if the
quantitative data are to serve their purpose. If statistics, as a subject, is inadequate and consists of poor
methodology, we could not know the right procedure to extract from the data the information they
contain. Similarly, if our data are defective or that they are inadequate or inaccurate, we could not reach
the right conclusions even though our subject is well developed.
A.L. Bowley has defined statistics as: (i) statistics is the science of counting, (ii) Statistics may rightly
be called the science of averages, and (iii) statistics is the science of measurement of social organism
regarded as a whole in all its manifestations. Boddington defined as: Statistics is the science of
estimates and probabilities. Further, W.I. King has defined Statistics in a wider context, the science of
Statistics is the method of judging collective, natural or social phenomena from the results obtained by
the analysis or enumeration or collection of estimates.
Seligman explored that statistics is a science that deals with the methods of collecting, classifying,
presenting, comparing and interpreting numerical data collected to throw some light on any sphere of
enquiry. Spiegal defines statistics highlighting its role in decision-making particularly under
uncertainty, as follows: statistics is concerned with scientific method for collecting, organising, summa
rising, presenting and analyzing data as well as drawing valid conclusions and making reasonable
decisions on the basis of such analysis. According to Prof. Horace Secrist, Statistics is the aggregate of
facts, affected to a marked extent by multiplicity of causes, numerically expressed, enumerated or
estimated according to reasonable standards of accuracy, collected in a systematic manner for a pre-
determined purpose, and placed in relation to each other.
From the above definitions, we can highlight the major characteristics of statistics as follows:
(i) Statistics are the aggregates of facts. It means a single figure is not statistics. For example, national
income of a country for a single year is not statistics but the same for two or more years is statistics.
(ii) Statistics are affected by a number of factors. For example, sale of a product depends on a number
of factors such as its price, quality, competition, the income of the consumers, and so on.
(iii) Statistics must be reasonably accurate. Wrong figures, if analysed, will lead to erroneous conclu-
sions. Hence, it is necessary that conclusions must be based on accurate figures.
(iv) Statistics must be collected in a systematic manner. If data are collected in a haphazard manner,
they will not be reliable and will lead to misleading conclusions.
(v) Collected in a systematic manner for a pre-determined purpose
(vi) Lastly, Statistics should be placed in relation to each other. If one collects data unrelated to each
other, then such data will be confusing and will not lead to any logical conclusions. Data should be
comparable over time and over space.
1.2.2 Types of data and data sources
Statistical data are the basic raw material of statistics. Data may relate to an activity of our interest, a
phenomenon, or a problem situation under study. They derive as a result of the process of measuring,
counting and/or observing. Statistical data, therefore, refer to those aspects of a problem situation that
can be measured, quantified, counted, or classified. Any object subject phenomenon, or activity that
generates data through this process is termed as a variable. In other words, a variable is one that shows a
degree of variability when successive measurements are recorded. In statistics, data are classified into
two broad categories: quantitative data and qualitative data. This classification is based on the kind of
characteristics that are measured.
Quantitative data are those that can be quantified in definite units of measurement. These refer to
characteristics whose successive measurements yield quantifiable observations. Depending on the
nature of the variable observed for measurement, quantitative data can be further categorized as
continuous and discrete data. Obviously, a variable may be a continuous variable or a discrete variable.
(i) Continuous data represent the numerical values of a continuous variable. A continuous variable is
the one that can assume any value between any two points on a line segment, thus representing an
interval of values. The values are quite precise and close to each other, yet distinguishably differ-
ent. All characteristics such as weight, length, height, thickness, velocity, temperature, tensile
strength, etc., represent continuous variables. Thus, the data recorded on these and similar other
characteristics are called continuous data. It may be noted that a continuous variable assumes the
finest unit of measurement. Finest in the sense that it enables measurements to the maximum
degree of precision.
(ii) Discrete data are the values assumed by a discrete variable. A discrete variable is the one whose
outcomes are measured in fixed numbers. Such data are essentially count data. These are derived
from a process of counting, such as the number of items possessing or not possessing a certain
characteristic. The number of customers visiting a departmental store everyday, the incoming
flights at an airport, and the defective items in a consignment received for sale, are all examples of
discrete data.
Qualitative data refer to qualitative characteristics of a subject or an object. A characteristic is
qualitative in nature when its observations are defined and noted in terms of the presence or absence of
a certain attribute in discrete numbers. These data are further classified as nominal and rank data.
(i) Nominal data are the outcome of classification into two or more categories of items or units
comprising a sample or a population according to some quality characteristic. Classification of
students according to sex (as males and females), of workers according to skill (as skilled, semi-
skilled, and unskilled), and of employees according to the level of education (as matriculates,
undergraduates, and post-graduates), all result into nominal data. Given any such basis of
classification, it is always possible to assign each item to a particular class and make a summation
of items belonging to each class. The count data so obtained are called nominal data.
(ii) Rank data, on the other hand, are the result of assigning ranks to specify order in terms of the
integers 1,2,3, ..., n. Ranks may be assigned according to the level of performance in a test. a
contest, a competition, an interview, or a show. The candidates appearing in an interview, for
example, may be assigned ranks in integers ranging from I to n, depending on their performance in
the interview. Ranks so assigned can be viewed as the continuous values of a variable involving
performance as the quality characteristic.
Data sources could be seen as of two types, viz., secondary and primary. The two can be defined as
under:
(i) Secondary data: They already exist in some form: published or unpublished - in an identifiable
secondary source. They are, generally, available from published source(s), though not necessarily
in the form actually required.
(ii) Primary data: Those data which do not already exist in any form, and thus have to be collected for
the first time from the primary source(s). By their very nature, these data require fresh and first-
time collection covering the whole population or a sample drawn from it.
1.2.3 Types of statistics
There are two major divisions of statistics such as descriptive statistics and inferential statistics. The
term descriptive statistics deals with collecting, summarizing, and simplifying data, which are
otherwise quite unwieldy and voluminous. It seeks to achieve this in a manner that meaningful
conclusions can be readily drawn from the data. Descriptive statistics may thus be seen as comprising
methods of bringing out and highlighting the latent characteristics present in a set of numerical data. It
not only facilitates an understanding of the data and systematic reporting thereof in a manner; and also
makes them amenable to further discussion, analysis, and interpretations.
The first step in any scientific inquiry is to collect data relevant to the problem in hand. When the
inquiry relates to physical and/or biological sciences, data collection is normally an integral part of the
experiment itself. In fact, the very manner in which an experiment is designed, determines the kind of
data it would require and/or generate. The problem of identifying the nature and the kind of the relevant
data is thus automatically resolved as soon as the design of experiment is finalized. It is possible in the
case of physical sciences. In the case of social sciences, where the required data are often collected
through a questionnaire from a number of carefully selected respondents, the problem is not that simply
resolved. For one thing, designing the questionnaire itself is a critical initial problem. For another, the
number of respondents to be accessed for data collection and the criteria for selecting them has their
own implications and importance for the quality of results obtained. Further, the data have been
collected; these are assembled, organized, and presented in the form of appropriate tables to make them
readable. Wherever needed, figures, diagrams, charts, and graphs are also used for better presentation of
the data. A useful tabular and graphic presentation of data will require that the raw data be properly
classified in accordance with the objectives of investigation and the relational analysis to be carried out.
A well thought-out and sharp data classification facilitates easy description of the hidden data
characteristics by means of a variety of summary measures. These include measures of central
tendency, dispersion, skewness, and kurtosis, which constitute the essential scope of descriptive
statistics. These form a large part of the subject matter of any basic textbook on the subject, and thus
they are being discussed in that order here as well.
Inferential statistics, also known as inductive statistics, goes beyond describing a given problem
situation by means of collecting, summarizing, and meaningfully presenting the related data. Instead, it
consists of methods that are used for drawing inferences, or making broad generalizations, about a
totality of observations on the basis of knowledge about a part of that totality. The totality of
observations about which an inference may be drawn, or a generalization made, is called a population or
a universe. The part of totality, which is observed for data collection and analysis to gain knowledge
about the population, is called a sample.
The desired information about a given population of our interest; may also be collected even by
observing all the units comprising the population. This total coverage is called census. Getting the
desired value for the population through census is not always feasible and practical for various reasons.
Apart from time and money considerations making the census operations prohibitive, observing each
individual unit of the population with reference to any data characteristic may at times involve even
destructive testing. In such cases, obviously, the only recourse available is to employ the partial or
incomplete information gathered through a sample for the purpose. This is precisely what inferential
statistics does. Thus, obtaining a particular value from the sample information and using it for drawing
an inference about the entire population underlies the subject matter of inferential statistics. Consider a
situation in which one is required to know the average body weight of all the college students in a given
cosmopolitan city during a certain year. A quick and easy way to do this is to record the weight of only
500 students, from out of a total strength of, say, 10000, or an unknown total strength, take the average,
and use this average based on incomplete weight data to represent the average body weight of all the
college students. In a different situation, one may have to repeat this exercise for some future year and
use the quick estimate of average body weight for a comparison. This may be needed, for example, to
decide whether the weight of the college students has undergone a significant change over the years
compared.
Inferential statistics helps to evaluate the risks involved in reaching inferences or generalizations about
an unknown population on the basis of sample information. For example, an inspection of a sample of
five battery cells drawn from a given lot may reveal that all the five cells are in perfectly good
condition. This information may be used to conclude that the entire lot is good enough to buy or not.
Since this inference is based on the examination of a sample of limited number of cells, it is equally
likely that all the cells in the lot are not in order. It is also possible that all the items that may be
included in the sample are unsatisfactory. This may be used to conclude that the entire lot is of
unsatisfactory quality, whereas the fact may indeed be otherwise. It may, thus, be noticed that there is
always a risk of an inference about a population being incorrect when based on the knowledge of a
limited sample. The rescue in such situations lies in evaluating such risks. For this, statistics provides
the necessary methods. These centres on quantifying in probabilistic term the chances of decisions taken
on the basis of sample information being incorrect. This requires an understanding of the what, why,
and how of probability and probability distributions to equip ourselves with methods of drawing
statistical inferences and estimating the degree of reliability of these inferences.
1.2.4 Scope of statistics
Apart from the methods comprising the scope of descriptive and inferential branches of statistics,
statistics also consists of methods of dealing with a few other issues of specific nature. Since these
methods are essentially descriptive in nature, they have been discussed here as part of the descriptive
statistics. These are mainly concerned with the following:
(i) It often becomes necessary to examine how two paired data sets are related. For example, we may
have data on the sales of a product and the expenditure incurred on its advertisement for a specified
number of years. Given that sales and advertisement expenditure are related to each other, it is
useful to examine the nature of relationship between the two and quantify the degree of that
relationship. As this requires use of appropriate statistical methods, these falls under the purview of
course, no surprise to find even historians statistical data, for history is essentially past data presented in
certain actual format.
1.3 IMPORTANCE OF STATISTICS IN BUSINESS
There are three major functions in any business enterprise in which the statistical methods are useful.
These are as follows:
(i) The planning of operations: This may relate to either special projects or to the recurring activities
of a firm over a specified period.
(ii) The setting up of standards: This may relate to the size of employment, volume of sales, fixation
of quality norms for the manufactured product, norms for the daily output, and so forth.
(iii) The function of control: This involves comparison of actual production achieved against the norm
or target set earlier. In case the production has fallen short of the target, it gives remedial measures
so that such a deficiency does not occur again.
A worth noting point is that although these three functions-planning of operations, setting standards,
and control-are separate, but in practice they are very much interrelated.
Different authors have highlighted the importance of Statistics in business. For instance, Croxton and
Cowden give numerous uses of Statistics in business such as project planning, budgetary planning and
control, inventory planning and control, quality control, marketing, production and personnel ad-
ministration. Within these also they have specified certain areas where Statistics is very relevant. An-
other author, Irwing W. Burr, dealing with the place of statistics in an industrial organisation, specifies a
number of areas where statistics is extremely useful. These are: customer wants and market research,
development design and specification, purchasing, production, inspection, packaging and shipping,
sales and complaints, inventory and maintenance, costs, management control, industrial engineering and
research.
Statistical problems arising in the course of business operations are multitudinous. As such, one may do
no more than highlight some of the more important ones to emphasis the relevance of statistics to the
business world. In the sphere of production, for example, statistics can be useful in various ways.
Statistical quality control methods are used to ensure the production of quality goods. Identifying and
rejecting defective or substandard goods achieve this. The sale targets can be fixed on the basis of sale
forecasts, which are done by using varying methods of forecasting. Analysis of sales affected against
the targets set earlier would indicate the deficiency in achievement, which may be on account of several
causes: (i) targets were too high and unrealistic (ii) salesmen's performance has been poor (iii)
emergence of increase in competition (iv) poor quality of company's product, and so on. These factors
can be further investigated.
Another sphere in business where statistical methods can be used is personnel management. Here, one is
concerned with the fixation of wage rates, incentive norms and performance appraisal of individual
employee. The concept of productivity is very relevant here. On the basis of measurement of
productivity, the productivity bonus is awarded to the workers. Comparisons of wages and productivity
are undertaken in order to ensure increases in industrial productivity.
Statistical methods could also be used to ascertain the efficacy of a certain product, say, medicine. For
example, a pharmaceutical company has developed a new medicine in the treatment of bronchial
asthma. Before launching it on commercial basis, it wants to ascertain the effectiveness of this
medicine. It undertakes an experimentation involving the formation of two comparable groups of
asthma patients. One group is given this new medicine for a specified period and the other one is treated
with the usual medicines. Records are maintained for the two groups for the specified period. This
record is then analysed to ascertain if there is any significant difference in the recovery of the two
groups. If the difference is really significant statistically, the new medicine is commercially launched.
1.4 LIMITATIONS OF STATISTICS
Statistics has a number of limitations, pertinent among them are as follows:
(i) There are certain phenomena or concepts where statistics cannot be used. This is because these
phenomena or concepts are not amenable to measurement. For example, beauty, intelligence,
courage cannot be quantified. Statistics has no place in all such cases where quantification is not
possible.
(ii) Statistics reveal the average behaviour, the normal or the general trend. An application of the
'average' concept if applied to an individual or a particular situation may lead to a wrong conclusion
and sometimes may be disastrous. For example, one may be misguided when told that the average
depth of a river from one bank to the other is four feet, when there may be some points in between
where its depth is far more than four feet. On this understanding, one may enter those points having
greater depth, which may be hazardous.
(iii) Since statistics are collected for a particular purpose, such data may not be relevant or useful in
other situations or cases. For example, secondary data (i.e., data originally collected by someone
else) may not be useful for the other person.
(iv) Statistics are not 100 per cent precise as is Mathematics or Accountancy. Those who use statistics
should be aware of this limitation.
(v) In statistical surveys, sampling is generally used as it is not physically possible to cover all the units
or elements comprising the universe. The results may not be appropriate as far as the universe is
concerned. Moreover, different surveys based on the same size of sample but different sample units
may yield different results.
(vi) At times, association or relationship between two or more variables is studied in statistics, but such
a relationship does not indicate cause and effect' relationship. It simply shows the similarity or
dissimilarity in the movement of the two variables. In such cases, it is the user who has to interpret
the results carefully, pointing out the type of relationship obtained.
(vii) A major limitation of statistics is that it does not reveal all pertaining to a certain phenomenon.
There is some background information that statistics does not cover. Similarly, there are some other
aspects related to the problem on hand, which are also not covered. The user of Statistics has to be
well informed and should interpret Statistics keeping in mind all other aspects having relevance on
the given problem.
Apart from the limitations of statistics mentioned above, there are misuses of it. Many people, know-
ingly or unknowingly, use statistical data in wrong manner. Let us see what the main misuses of
statistics are so that the same could be avoided when one has to use statistical data. The misuse of
Statistics may take several forms some of which are explained below.
(i) Sources of data not given: At times, the source of data is not given. In the absence of the source,
the reader does not know how far the data are reliable. Further, if he wants to refer to the original
source, he is unable to do so.
(ii) Defective data: Another misuse is that sometimes one gives defective data. This may be done
knowingly in order to defend one's position or to prove a particular point. This apart, the definition
used to denote a certain phenomenon may be defective. For example, in case of data relating to
unemployed persons, the definition may include even those who are employed, though partially.
The question here is how far it is justified to include partially employed persons amongst
unemployed ones.
(iii) Unrepresentative sample: In statistics, several times one has to conduct a survey, which neces-
sitates to choose a sample from the given population or universe. The sample may turn out to be
unrepresentative of the universe. One may choose a sample just on the basis of convenience. He
may collect the desired information from either his friends or nearby respondents in his
neighbourhood even though such respondents do not constitute a representative sample.
(iv) Inadequate sample: Earlier, we have seen that a sample that is unrepresentative of the universe is
a major misuse of statistics. This apart, at times one may conduct a survey based on an extremely
inadequate sample. For example, in a city we may find that there are 1, 00,000 households. When
we have to conduct a household survey, we may take a sample of merely 100 households
comprising only 0.1 per cent of the universe. A survey based on such a small sample may not yield
right information.
(v) Unfair Comparisons: An important misuse of statistics is making unfair comparisons from the
data collected. For instance, one may construct an index of production choosing the base year
where the production was much less. Then he may compare the subsequent year's production from
this low base. Such a comparison will undoubtedly give a rosy picture of the production though in
reality it is not so. Another source of unfair comparisons could be when one makes absolute
comparisons instead of relative ones. An absolute comparison of two figures, say, of production or
export, may show a good increase, but in relative terms it may turnout to be very negligible.
Another example of unfair comparison is when the population in two cities is different, but a
comparison of overall death rates and deaths by a particular disease is attempted. Such a
comparison is wrong. Likewise, when data are not properly classified or when changes in the
composition of population in the two years are not taken into consideration, comparisons of such
data would be unfair as they would lead to misleading conclusions.
(vi) Unwanted conclusions: Another misuse of statistics may be on account of unwarranted
conclusions. This may be as a result of making false assumptions. For example, while making
projections of population in the next five years, one may assume a lower rate of growth though the
past two years indicate otherwise. Sometimes one may not be sure about the changes in business
environment in the near future. In such a case, one may use an assumption that may turn out to be
wrong. Another source of unwarranted conclusion may be the use of wrong average. Suppose in a
series there are extreme values, one is too high while the other is too low, such as 800 and 50. The
use of an arithmetic average in such a case may give a wrong idea. Instead, harmonic mean would
be proper in such a case.
(vii) Confusion of correlation and causation: In statistics, several times one has to examine the
relationship between two variables. A close relationship between the two variables may not
establish a cause-and-effect-relationship in the sense that one variable is the cause and the other is
the effect. It should be taken as something that measures degree of association rather than try to
find out causal relationship.
1.5 CHECK YOUR PROGRESS
There are some activities to check your progress. Answer the followings:
1. Statistics are affected to a marked extent by multiplicity of ……… .
2. Statistics helps in the ……………….. of suitable policies.
3. Statistics bring definiteness and …………… in conclusions by expressing them numerically.
4. Inferential statistics, also known as ………… statistics.
5. Unwarranted conclusions may bring the result of making ………….. assumptions.
1.6 SUMMARY
In a summarized manner, ‗Statistics‘ means numerical information expressed in quantitative terms. As a
matter of fact, data have no limits as to their reference, coverage, and scope. At the macro level, these
are data on gross national product and shares of agriculture, manufacturing, and services in GDP (Gross
Domestic Product). At the micro level, individual firms, howsoever small or large, produce extensive
statistics on their operations. The annual reports of companies contain variety of data on sales,
production, expenditure, inventories, capital employed, and other activities. These data are often field
data, collected by employing scientific survey techniques. Unless regularly updated, such data are the
product of a one-time effort and have limited use beyond the situation that may have called for their
collection. A student knows statistics more intimately as a subject of study like economics,
mathematics, chemistry, physics, and others. It is a discipline, which scientifically deals with data, and
is often described as the science of data. In dealing with statistics as data, statistics has developed
appropriate methods of collecting, presenting, summarizing, and analyzing data, and thus consists of a
body of these methods.
1.7 KEYWORDS
Statistics: It is a science that deals with the methods of collecting, classifying, presenting, comparing
and interpreting numerical data collected to throw some light on any sphere of enquiry.
Descriptive statistics: It deals with collecting, summarizing, and simplifying data, which are otherwise
quite unwieldy and voluminous.
Inferential statistics: It goes beyond describing a given problem situation by means of collecting,
summarizing, and meaningfully presenting the related data.
1.8 SELF-ASSESSMENT TEST
1. Define Statistics. Explain its types, and importance to trade, commerce and business.
2. ―Statistics is all-pervading‖. Elucidate this statement.
3. Write a note on the scope and limitations of Statistics.
4. What are the major limitations of Statistics? Explain with suitable examples.
5. Distinguish between descriptive Statistics and inferential Statistics.
1.9 ANSWERS TO CHECK YOUR PROGRESS
1. Causes
2. Formulation
3. Precision
4. Inductive
5. False
1.10 REFERENCES/SUGGESTED READINGS
1. Gupta, S. P. : Statistical Methods, Sultan chand and Sons, New Delhi.
2. Hooda, R. P.: Statistics for Business and Economics, Macmillan, New Delhi.
3. Hein, L. W. Quantitative Approach to Managerial Decisions, Prentice Hall, NJ.
4. Levin, Richard I. and David S. Rubin: Statistics for Management, Prentice Hall, New Delhi.
5. Lawrance B. Moore: Statistics for Business & Economics, Harper Collins, NY.
6. Watsman Terry J. and Keith Parramor: Quantitative Methods in Finance International, Thompson
Business Press, London.
STRUCTURE
2.1 Learning Objectives
2.2 Introduction
2.2.1 Arithmetic Mean
2.2.2 Median
2.2.3 Mode
2.2.4 Relationships of the Mean, Median and Mode
2.2.5 The Best Measure of Central Tendency
2.2.6 Geometric Mean
2.2.7 Harmonic Mean
2.2.8 Quadratic Mean
2.3 Check your progress
2.4 Summary
2.5 Keywords
2.6 Self-Assessment Test
2.7 Answers to check your progress
2.8 References/Suggested Readings
30 25 20
The simple arithmetic mean will be 25
3
However, this will be wrong if the three tests carry different weights on the basis of their relative
importance. Assuming that the weights assigned to the three tests are:
Mid-term test 2 points
Laboratory 3 points
Final 5 points
Solution: On the basis of this information, we can now calculate a weighted mean as shown below:
Table 2.1: Calculation of a Weighted Mean
Type of Test Relative Weight (w) Marks (x) (wx)
Mid-term 2 30 60
Laboratory 3 25 75
Final 5 20 100
Total w = 10 235
wx w1 x1 w2 x2 w3 x3
x
w w1 w2 w3
60 75 100
= 23.5 marks
235
It will be seen that weighted mean gives a more realistic picture than the simple or unweighted mean.
Example 2.2: An investor is fond of investing in equity shares. During a period of falling prices in the
stock exchange, a stock is sold at Rs 120 per share on one day, Rs 105 on the next and Rs 90 on the
third day. The investor has purchased 50 shares on the first day, 80 shares on the second day and 100
shares on the third' day. What average price per share did the investor pay?
Solution:
Table 2.2: Calculation of Weighted Average Price
Day Price per Share (Rs) (x) No of Shares Purchased (w) Amount Paid (wx)
1 120 50 6000
2 105 80 8400
3 90 100 9000
Total - 230 23,400
w1 x1 w2 x2 w3 x3 wx
Weighted average =
w1 w2 w3 w
6000 8400 9000
= 101.7 marks
50 80 100
10-20 15 8 120
20-30 25 11 275
30-40 35 15 525
40-50 45 12 540
50-60 55 6 330
60-70 65 2 130
fm = 1940
Where, x
fm 1940 33.45 marks or 33 marks approximately.
n 58
It may be noted that the mid-point of each class is taken as a good approximation of the true mean of the
class. This is based on the assumption that the values are distributed fairly evenly throughout the
interval. When large numbers of frequency occur, this assumption is usually accepted. In the case of
short-cut method, the concept of arbitrary mean is followed. The formula for calculation of the
arithmetic mean by the short-cut method is given below:
x A
fd
n
Where A = arbitrary or assumed mean
f = frequency
d = deviation from the arbitrary or assumed mean
When the values are extremely large and/or in fractions, the use of the direct method would be very
cumbersome. In such cases, the short-cut method is preferable. This is because the calculation work in
the short-cut method is considerably reduced particularly for calculation of the product of values and
their respective frequencies. However, when calculations are not made manually but by a machine
calculator, it may not be necessary to resort to the short-cut method, as the use of the direct method may
not pose any problem.
As can be seen from the formula used in the short-cut method, an arbitrary or assumed mean is used.
The second term in the formula (fd n) is the correction factor for the difference between the actual
mean and the assumed mean. If the assumed mean turns out to be equal to the actual mean, ( fd n)
will be zero. The use of the short-cut method is based on the principle that the total of deviations taken
from an actual mean is equal to zero. As such, the deviations taken from any other figure will depend on
how the assumed mean is related to the actual mean. While one may choose any value as assumed
mean, it would be proper to avoid extreme values, that is, too small or too high to simplify calculations.
A value apparently close to the arithmetic mean should be chosen.
For the figures given earlier pertaining to marks obtained by 58 students, we calculate the average
marks by using the short-cut method.
Example 2.4:
Table 2.4: Calculation of Arithmetic Mean by Short-cut Method
Mid-point
Marks f d fd
m
0-10 5 4 -30 -120
10-20 15 8 -20 -160
20-30 25 11 -10 -110
30-40 35 15 0 0
40-50 45 12 10 120
50-60 55 6 20 120
60-70 65 2 30 60
fd = -90
It may be noted that we have taken arbitrary mean as 35 and deviations from midpoints. In other words,
the arbitrary mean has been subtracted from each value of mid-point and the resultant figure is shown in
column d.
x A
fd 35 90 = 35 -1.55 = 33.45 or 33 marks approximately.
n 58
Now we take up the calculation of arithmetic mean for the same set of data using the step-deviation
method. This is shown in Table 2.5.
Table 2.5: Calculation of Arithmetic Mean by Step-deviation Method
Marks Mid-point f d d‘= d/10 Fd‘
0-10 5 4 -30 -3 -12
10-20 15 8 -20 -2 -16
20-30 25 11 -10 -1 -11
30-40 35 15 0 0 0
40-50 45 12 10 1 12
50-60 55 6 20 2 12
60-70 65 2 30 3 6
fd‘ =-9
x A
fd' C 35 9 10 = 33.45 or 33 marks approximately.
n 58
It will be seen that the answer in each of the three cases is the same. The step-deviation method is the
most convenient on account of simplified calculations. It may also be noted that if we select a different
arbitrary mean and recalculate deviations from that figure, we would get the same answer.
Now that we have learnt how the arithmetic mean can be calculated by using different methods, we are
in a position to handle any problem where calculation of the arithmetic mean is involved.
Example 2.6: The mean of the following frequency distribution was found to be 1.46.
No. of Accidents No. of Days (frequency)
0 46
1 ?
2 ?
3 25
4 10
5 5
Total 200 days
Now as the series consists of odd number of items, to find out the value of the middle item, we use the
n 1 n 1
formula Where , Where n is the number of items. In this case, n is 9, as such = 5, that is, the
2 2
size of the 5th item is the median. This happens to be 18. Suppose the series consists of one more items
23. We may, therefore, have to include 23 in the above series at an appropriate place, that is, between
21 and 25. Thus, the series is now 5, 7, 10, 15, 18, 19, and 21, 23, 25, 33. Applying the above formula,
the median is the size of 5.5th item. Here, we have to take the average of the values of 5th and 6th item.
This means an average of 18 and 19, which gives the median as 18.5.
n 1
It may be noted that the formula itself is not the formula for the median; it merely indicates the
2
position of the median, namely, the number of items we have to count until we arrive at the item whose
value is the median. In the case of the even number of items in the series, we identify the two items
whose values have to be averaged to obtain the median. In the case of a grouped series, the median is
calculated by linear interpolation with the help of the following formula:
l 2 l1
M = l1 (m c)
f
Where M = the median
l1 = the lower limit of the class in which the median lies
12 = the upper limit of the class in which the median lies
f = the frequency of the class in which the median lies
m = the middle item or (n + 1)/2th, where n stands for total number of items
c = the cumulative frequency of the class preceding the one in which the median lies
Example 2.7:
Monthly Wages (Rs) No. of Workers
800-1,000 18
1,000-1,200 25
1,200-1,400 30
1,400-1,600 34
1,600-1,800 26
1,800-2,000 10
Total 143
In order to calculate median in this case, we have to first provide cumulative frequency to the table.
Thus, the table with the cumulative frequency is written as:
Cumulative Frequency
Monthly Wages Frequency
800 -1,000 18 18
1,000 -1,200 25 43
1,200 -1,400 30 73
1,400 -1,600 34 107
1,600 -1,800 26 133
1.800 -2,000 10 143
l 2 l1
M = l1 (m c)
f
M = n 1 143 1 = 72
2 2
It means median lies in the class-interval Rs 1,200 - 1,400.
Now, M = 1200 + 1400 1200 (72 43) 1200 200 (29) = Rs 1393.3
30 30
At this stage, let us introduce two other concepts viz. quartile and decile. To understand these, we
should first know that the median belongs to a general class of statistical descriptions called fractiles. A
fractile is a value below that lays a given fraction of a set of data. In the case of the median, this fraction
is one-half (1/2). Likewise, a quartile has a fraction one-fourth (1/4). The three quartiles Q1, Q2 and Q3
are such that 25 percent of the data fall below Q1, 25 percent fall between Q1 and Q2, 25 percent fall
between Q2 and Q3 and 25 percent fall above Q3 It will be seen that Q2 is the median. We can use the
above formula for the calculation of quartiles as well. The only difference will be in the value of m. Let
us calculate both Q1 and Q3 in respect of the table given in Example 2.7.
l 2 l1
Q1 = l1 (m c)
f
n 1 143 1
Here, m will be = = = 36
4 4
Solution: Since the data have two open-end classes-one in the beginning (below 50) and the other at the
end (200 and above), median should be the right choice as a measure of central tendency.
Table 2.6: Computation of Median
Size of Item Frequency Cumulative Frequency
Below 50 15 15
50-100 20 35
100-150 36 71
150-200 40 111
200 and above 10 121
n 1 121 1
Median is the size of th item = = 61st item
2 2
Now, 61st item lies in the 100-150 class
l 2 l1 150 100
Median = 11 = l1 (m c) = 100 + (61 35) = 100 + 36.11 = 136.11 approx.
f 36
Example 2.9: The following data give the savings bank accounts balances of nine sample households
selected in a survey. The figures are in rupees.
745 2,000 1,500 68,000 461 549 3750 1800 4795
(a) Find the mean and the median for these data; (b) Do these data contain an outlier? If so, exclude this
value and recalculate the mean and median. Which of these summary measures has a greater change
when an outlier is dropped?; (c) Which of these two summary measures is more appropriate for this
series?
Solution:
745 2,000 1,500 68,000 461 549 3,750 1,800 4,795
Mean = Rs.
9
Rs 83,600
= = Rs 9,289
9
n 1 9 1
Median = Size of th item = = 5th item
2 2
Arranging the data in an ascending order, we find that the median is Rs 1,800.
(b) An item of Rs 68,000 is excessively high. Such a figure is called an 'outlier'. We exclude this figure
and recalculate both the mean and the median.
83,600 68,000 15,600
Mean = Rs. = Rs = Rs. 1,950
8 8
n 1 8 1 1,500 1,800
Median = Size of item = 4.5th item. = Rs. = Rs. 1,650
2 2 2
It will be seen that the mean shows a far greater change than the median when the outlier is dropped
from the calculations.
(c) As far as these data are concerned, the median will be a more appropriate measure than the mean.
Example 2.10:
Suppose we are given the following series:
We are asked to draw both types of ogive from these data and to determine the median.
Solution:
First of all, we transform the given data into two cumulative frequency distributions, one based on ‗less
than‘ and another on ‗more than‘ methods.
Table A
Frequency
Less than 10 6
Less than 20 18
Less than 30 40
Less than 40 77
Less than 50 94
Less than 60 102
Less than 70 107
Table B
Frequency
More than 0 107
More than 10 101
More than 20 89
More than 30 67
More than 40 30
More than 50 13
More than 60 5
It may be noted that the point of intersection of the two ogives gives the value of the median. From this
point of intersection A, we draw a straight line to meet the X-axis at M. Thus, from the point of origin
to the point at M gives the value of the median, which comes to 34, approximately. If we calculate the
median by applying the formula, then the answer comes to 33.8, or 34, approximately. It may be
pointed out that even a single ogive can be used to determine the median. As we have determined the
median graphically, so also we can find the values of quartiles, deciles or percentiles graphically. For
example, to determine we have to take size of {3(n + 1)} /4 = 81st item. From this point on the Y-axis,
we can draw a perpendicular to meet the 'less than' ogive from which another straight line is to be drawn
to meet the X-axis. This point will give us the value of the upper quartile. In the same manner, other
values of Q1 and deciles and percentiles can be determined.
1. Unlike the arithmetic mean, the median can be computed from open-ended distributions. This is
because it is located in the median class-interval, which would not be an open-ended class.
2. The median can also be determined graphically whereas the arithmetic mean cannot be ascertained
in this manner.
3. As it is not influenced by the extreme values, it is preferred in case of a distribution having extreme
values.
4. In case of the qualitative data where the items are not counted or measured but are scored or
ranked, it is the most appropriate measure of central tendency.
2.2.3 Mode
The mode is another measure of central tendency. It is the value at the point around which the items are
most heavily concentrated. As an example, consider the following series: 8, 9, 11, 15, 16, 12, 15, 3, 7,
15
There are ten observations in the series wherein the figure 15 occurs maximum number of times three.
The mode is therefore 15. The series given above is a discrete series; as such, the variable cannot be in
fraction. If the series were continuous, we could say that the mode is approximately 15, without further
computation. In the case of grouped data, mode is determined by the following formula:
f1 f 0
Mode= l1 i
( f1 f 0 ) ( f1 f 2 )
Where, l1 = the lower value of the class in which the mode lies
While applying the above formula, we should ensure that the class-intervals are uniform throughout. If
the class-intervals are not uniform, then they should be made uniform on the assumption that the
frequencies are evenly distributed throughout the class. In the case of inequal class-intervals, the appli-
cation of the above formula will give misleading results.
Solution: We can see from Column (2) of the table that the maximum frequency of 12 lies in the class-
interval of 60-70. This suggests that the mode lies in this class-interval. Applying the formula given
earlier, we get:
12 - 8 4
Mode = 60 + 10 = 60 + 10 = 65.7 approx.
12 - 8 (12 - 8) (12 - 9) 43
In several cases, just by inspection one can identify the class-interval in which the mode lies. One
should see which the highest frequency is and then identify to which class-interval this frequency
belongs. Having done this, the formula given for calculating the mode in a grouped frequency distribu-
tion can be applied.
At times, it is not possible to identify by inspection the class where the mode lies. In such cases, it
becomes necessary to use the method of grouping. This method consists of two parts:
(i) Preparation of a grouping table: A grouping table has six columns, the first column showing the
frequencies as given in the problem. Column 2 shows frequencies grouped in two's, starting from
the top. Leaving the first frequency, column 3 shows frequencies grouped in two's. Column 4
shows the frequencies of the first three items, then second to fourth item and so on. Column 5
leaves the first frequency and groups the remaining items in three's. Column 6 leaves the first two
frequencies and then groups the remaining in three's. Now, the maximum total in each column is
marked and shown either in a circle or in a bold type.
(ii) Preparation of an analysis table: After having prepared a grouping table, an analysis table is
prepared. On the left-hand side, provide the first column for column numbers and on the right-hand
side the different possible values of mode. The highest values marked in the grouping table are
shown here by a bar or by simply entering 1 in the relevant cell corresponding to the values they
represent. The last row of this table will show the number of times a particular value has occurred
in the grouping table. The highest value in the analysis table will indicate the class-interval in
which the mode lies. The procedure of preparing both the grouping and analysis tables to locate the
modal class will be clear by taking an example.
Example 2.12:
The following table gives some frequency data:
10-20 10
20-30 18
30-40 25
40-50 26
50-60 17
60-70 4
Solution:
Grouping Table
Size of item 1 2 3 4 5 6
10-20 10
28
20-30 18 53
43
30-40 25 69
51
40-50 26 68
43
50-60 17 47
21
60-70 4
Analysis table
Size of item
Col. No. 10-20 20-30 30-40 40-50 50-60
1 1
2 1 1
3 1 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
Total 1 3 5 5 2
This is a bi-modal series as is evident from the analysis table, which shows that the two classes 30-40
and 40-50 have occurred five times each in the grouping. In such a situation, we may have to determine
mode indirectly by applying the following formula:
Mode = 3 median - 2 mean
Median = Size of (n + l)/2th item, that is, 101/2 = 50.5th item. This lies in the class 30-40. Applying the
formula for the median, as given earlier, we get
40 - 30
= 30 + (50.5 28) = 30 + 9 = 39
25
Mean = A +
fd ' i = 35 + 34 10 = 38.4
n 100
Mode = 3 median - 2 mean = (3 x 39) - (2 x 38.4) = 117 -76.8 = 40.2
This formula, Mode = 3 Median-2 Mean, is an empirical formula only. And it can give only ap-
proximate results. As such, its frequent use should be avoided. However, when mode is ill defined or
the series is bimodal (as is the case in the present example) it may be used.
(iii) Given the mean and median of a unimodal distribution, we can determine whether it is skewed to
the right or left. When mean> median, it is skewed to the right; when median> mean, it is skewed
to the left. It may be noted that the median is always in the middle between mean and mode.
proper. However, the median can sometimes be used in the case of qualitative data when such data can
be arranged in an ascending or descending order. Let us take another example. Suppose we invite
applications for a certain vacancy in our company. A large number of candidates apply for that post. We
are now interested to know as to which age or age group has the largest concentration of applicants.
Here, obviously the mode will be the most appropriate choice. The arithmetic mean may not be
appropriate as it may be influenced by some extreme values. However, the mean happens to be the most
commonly used measure of central tendency as will be evident from the discussion in the subsequent
chapters.
2.2.6 Geometric Mean
Apart from the three measures of central tendency as discussed above, there are two other means that
are used sometimes in business and economics. These are the geometric mean and the harmonic mean.
The geometric mean is more important than the harmonic mean. We discuss below both these means.
First, we take up the geometric mean. Geometric mean is defined at the nth root of the product of n
observations of a distribution.
Symbolically, GM = n x1....x2 .....xn ... If we have only two observations, say, 4 and 16 then GM =
4 16 64 8. Similarly, if there are three observations, then we have to calculate the cube root of
the product of these three observations; and so on. When the number of items is large, it becomes
extremely difficult to multiply the numbers and to calculate the root. To simplify calculations, loga-
rithms are used.
Example 2.13: If we have to find out the geometric mean of 2, 4 and 8, then we find
Log GM =
log x i
=
Log2 Log4 Log8 0.3010 0.6021 0.9031
=
n 3 3
1.8062
= 0.60206
3
GM = Antilog 0.60206 = 4
When the data are given in the form of a frequency distribution, then the geometric mean can be
obtained by the formula:
Log GM =
f 1 .log xl f 2 .log x 2 ... f n . log x n
= f .log x
f1 f 2 ..........fn f1 f 2 ..........fn
Then, GM = Antilog n
The geometric mean is most suitable in the following three cases:
1. Averaging rates of change.
2. The compound interest formula.
3. Discounting, capitalization.
Example 2.14: A person has invested Rs 5,000 in the stock market. At the end of the first year the
amount has grown to Rs 6,250; he has had a 25 percent profit. If at the end of the second year his
principal has grown to Rs 8,750, the rate of increase is 40 percent for the year. What is the average rate
of increase of his investment during the two years?
Solution:
The average rate of increase in the value of investment is therefore 1.323 - 1 = 0.323, which if
multiplied by 100, gives the rate of increase as 32.3 percent.
Example 2.15: We can also derive a compound interest formula from the above set of data. This is
shown below:
Solution:
Now, 1.25 x 1.40 = 1.75. This can be written as 1.75 = (1 + 0.323)2.
Let P2 = 1.75, P0 = 1, and r = 0.323, then the above equation can be written as P2 = (1 + r)2 or P2 = P0
(1 + r)2.
Where P2 is the value of investment at the end of the second year, P0 is the initial investment and r is
the rate of increase in the two years. This, in fact, is the familiar compound interest formula. This can be
written in a generalised form as Pn = P0(1 + r)n. In our case Po is Rs 5,000 and the rate of increase in
investment is 32.3 percent. Let us apply this formula to ascertain the value of Pn, that is, investment at
the end of the second year.
Pn = 5,000 (1 + 0.323)2= 5,000 x 1.75 = Rs 8,750
It may be noted that in the above example, if the arithmetic mean is used, the resultant figure will be
25 40
wrong. In this case, the average rate for the two years is percent per year, which comes to 32.5.
2
165
Applying this rate, we get Pn = x 5,000 = Rs 8,250. This is obviously wrong, as the figure should
100
have been Rs 8,750.
Example 2.16: An economy has grown at 5 percent in the first year, 6 percent in the second year, 4.5
percent in the third year, 3 percent in the fourth year and 7.5 percent in the fifth year. What is the
average rate of growth of the economy during the five years?
Solution:
GM = Antilog
log x
= Antilog 10.10987 = Antilog 2.021974 = 105.19
n 5
Hence, the average rate of growth during the five-year period is 105.19 - 100 = 5.19 percent per annum.
In case of a simple arithmetic average, the corresponding rate of growth would have been 5.2 percent
per annum.
Discounting
The compound interest formula given above was
Pn
Pn=P0(1+r)n This can be written as P0 =
(1 r) n
This may be expressed as follows:
If the future income is Pn rupees and the present rate of interest is 100 r percent, then the present value
of P n rupees will be P0 rupees. For example, if we have a machine that has a life of 20 years and is
expected to yield a net income of Rs 50,000 per year, and at the end of 20 years it will be obsolete and
cannot be used, then the machine's present value is
50,000 50,000 50,000 50,000
+ + +.................
(1 r) (1 r ) (1 r )
n 2 3
(1 r ) 20
This process of ascertaining the present value of future income by using the interest rate is known as
discounting.
In conclusion, it may be said that when there are extreme values in a series, geometric mean should be
used as it is much less affected by such values. The arithmetic mean in such cases will give misleading
results. Before we close our discussion on the geometric mean, we should be aware of its advantages
and limitations.
Advantages of G.M.
1. Geometric mean is based on each and every observation in the data set.
2. It is rigidly defined.
3. It is more suitable while averaging ratios and percentages as also in calculating growth rates.
4. As compared to the arithmetic mean, it gives more weight to small values and less weight to large
values. As a result of this characteristic of the geometric mean, it is generally less than the arith-
metic mean. At times it may be equal to the arithmetic mean.
5. It is capable of algebraic manipulation. If the geometric mean has two or more series is known
along with their respective frequencies. Then a combined geometric mean can be calculated by
using the logarithms.
Limitations of G.M.
1. As compared to the arithmetic mean, geometric mean is difficult to understand.
2. Both computation of the geometric mean and its interpretation are rather difficult.
3. When there is a negative item in a series or one or more observations have zero value, then the
geometric mean cannot be calculated.
In view of the limitations mentioned above, the geometric mean is not frequently used.
2.2.7 Harmonic Mean
The harmonic mean is defined as the reciprocal of the arithmetic mean of the reciprocals of individual
observations. Symbolically,
Re ciprocal
n 1/ x
HM =
1/ x1 1/ x2 1/ x3 . .. 1/ xn n
The calculation of harmonic mean becomes very tedious when a distribution has a large number of
observations. In the case of grouped data, the harmonic mean is calculated by using the following
formula:
n
f
1
HM = Reciprocal of i
i 1 xi
or
n
n
f
1
i
i 1 xi
Where n is the total number of observations.
Here, each reciprocal of the original figure is weighted by the corresponding frequency (f). The main
advantage of the harmonic mean is that it is based on all observations in a distribution and is amenable
to further algebraic treatment. When we desire to give greater weight to smaller observations and less
weight to the larger observations, then the use of harmonic mean will be more suitable. As against these
advantages, there are certain limitations of the harmonic mean. First, it is difficult to understand as well
as difficult to compute. Second, it cannot be calculated if any of the observations is zero or negative.
Third, it is only a summary figure, which may not be an actual observation in the distribution.
It is worth noting that the harmonic mean is always lower than the geometric mean, which is lower than
the arithmetic mean. This is because the harmonic mean assigns lesser importance to higher values.
Since the harmonic mean is based on reciprocals, it becomes clear that as reciprocals of higher values
are lower than those of lower values, it is a lower average than the arithmetic mean as well as the
geometric mean.
Example 2.17: Suppose we have three observations 4, 8 and 16. We are required to calculate the
1 1 1
harmonic mean. Reciprocals of 4,8 and 16 are: , , respectively
4 8 16
n 3
HM = =
1/ x1 1/ x2 1/ x3 1/ 4 1/ 8 1/ 16
= 3 = 6.857 approx.
0.25 0.125 0.0625
n
1
f i
xi
=
i 1 =
100
= 4.984 approx.
n 20.0641
Example 2.19: In a small company, two typists are employed. Typist A types one page in ten minutes
while typist B takes twenty minutes for the same. (i) Both are asked to type 10 pages. What is the
average time taken for typing one page? (ii) Both are asked to type for one hour. What is the average
time taken by them for typing one page?
Solution: Here Q-(i) is on arithmetic mean while Q-(ii) is on harmonic mean.
the average number of days taken by a cargo to cross the Pacific Ocean when the ships are hired for 60
days?
Solution: Here again Q-(i) pertains to simple arithmetic mean while Q-(ii) is concerned with the
harmonic mean.
10 15 20
M = = 15 days
3
60 3(days) _ 180
HM = = = 13.8 days approx.
60 / 10 60 / 15 60 / 20 360 240 180
60
2.2.8 Quadratic Mean
We have seen earlier that the geometric mean is the antilogarithm of the arithmetic mean of the loga-
rithms, and the harmonic mean is the reciprocal of the arithmetic mean of the reciprocals. Likewise, the
quadratic mean (Q) is the square root of the arithmetic mean of the squares. Symbolically,
3. When a distribution is, the mean, median and mode are the same
4. When there is a negative item in a series or one or more observations have zero value, then the
………….. cannot be calculated.
5. The quadratic mean (Q) is the ………… of the arithmetic mean of the squares.
2.4 SUMMARY
It is the most important objective of statistical analysis is to get one single value that describes the
characteristics of the entire mass of cumbersome data. Such a value is finding out, which is known as
central value to serve our purpose.
2.5 KEYWORDS
Arithmetic mean: Adding all the observations and dividing the sum by the number of observations
results the arithmetic mean
Median: It is defined as the value of the middle item (or the mean of the values of the two middle
items) when the data are arranged in an ascending or descending order of magnitude.
Mode: It is the value at the point around which the items are most heavily concentrated.
Geometric mean: It is defined at the nth root of the product of n observations of a distribution.
Harmonic mean: It is defined as the reciprocal of the arithmetic mean of the reciprocals of individual
observations.
62-63 2
63-64 6
64-65 14
65-66 16
66-67 8
67-68 3
68-69 1
Total 50
7. A number of particular articles have been classified according to their weights. After drying for two
weeks, the same articles have again been weighed and similarly classified. It is known that the
median weight in the first weighing was 20.83 gm while in the second weighing it was 17.35 gm.
Some frequencies a and b in the first weighing and x and y in the second are missing. It is known
that a = 1/3x and b = 1/2 y. Find out the values of the missing frequencies.
Class Frequencies
First Weighing Second Weighing
0- 5 a z
5-10 b y
10-15 11 40
15-20 52 50
20-25 75 30
25-30 22 28
8. Cities A, Band C are equidistant from each other. A motorist travels from A to B at 30 km/h; from
B to C at 40 km/h and from C to A at 50 km/h. Determine his average speed for the entire trip.
9. Calculate the harmonic mean from the following data:
Class-Interval 2-4 4-6 6-8 8-1
Frequency 20 40 30 10
10. A vehicle when climbing up a gradient, consumes petrol @ 8 km per litre. While coming down it
runs 12 km per litre. Find its average consumption for to and fro travel between two places situated
at the two ends of 25 Ian long gradients.
2.7 ANSWERS TO CHECK YOUR PROGRESS
1. Unweighted
2. Skewed
3. Symmetrical
4. Geometric mean
5. Square root
STRUCTURE
3.1 Learning objectives
3.2 Introduction
3.2.1 Meaning and Definition of Dispersion
3.2.2 Significance and Properties of Measuring Variation
3.2.3 Measures of Dispersion
3.2.4 Range
3.2.5 Interquartile Range or Quartile Deviation
3.2.6 Mean Deviation
3.2.7 Standard Deviation
3.2.8 Lorenz Curve
3.3 Skewness: Meaning and Definitions
3.3.1 Tests of Skewness
3.3.2 Measures of Skewness
3.4 Moments and Kurtosis
3.5 Check your progress
3.6 Summary
3.7 Keywords
3.8 Self-Assessment Test
3.9 Answers to check your progress
3.10 References/Suggested Readings
The various measures of central value give us one single figure that represents the entire data.
But the average alone cannot adequately describe a set of observations, unless all the
observations are the same. It is necessary to describe the variability or dispersion of the
observations. In two or more distributions the central value may be the same but still there can
be wide disparities in the formation of distribution. Measures of dispersion help us in studying
this important characteristic of a distribution.
Some important definitions of dispersion are given below:
1. "Dispersion is the measure of the variation of the items." -A.L. Bowley
2. "The degree to which numerical data tend to spread about an average value is called the variation of
dispersion of the data."-Spiegel
3. Dispersion or spread is the degree of the scatter or variation of the variable about a central value."
-Brooks & Dick
4. "The measurement of the scatterness of the mass of figures in a series about an average is called
2. Another purpose of measuring dispersion is to determine nature and cause of variation in order to
control the variation itself. In matters of health variations in body temperature, pulse beat and blood
pressure are the basic guides to diagnosis. Prescribed treatment is designed to control their
variation. In industrial production efficient operation requires control of quality variation the causes
of which are sought through inspection is basic to the control of causes of variation. In social
sciences a special problem requiring the measurement of variability is the measurement of
"inequality" of the distribution of income or wealth etc.
3. Measures of dispersion enable a comparison to be made of two or more series with regard to their
variability. The study of variation may also be looked upon as a means of determining uniformity
of consistency. A high degree of variation would mean little uniformity or consistency whereas a
low degree of variation would mean great uniformity or consistency.
4. Many powerful analytical tools in statistics such as correlation analysis. the testing of hypothesis,
analysis of variance, the statistical quality control, regression analysis is based on measures of
variation of one kind or another.
A good measure of dispersion should possess the following properties:
1. It should be simple to understand.
2. It should be easy to compute.
3. It should be rigidly defined.
4. It should be based on each and every item of the distribution.
5. It should be amenable to further algebraic treatment.
6. It should have sampling stability.
7. Extreme items should not unduly affect it.
3.2.3 Measures of Dispersion
There are five measures of dispersion: Range, Inter-quartile range or Quartile Deviation, Mean
deviation, Standard Deviation, and Lorenz curve. Among them, the first four are mathematical methods
and the last one is the graphical method. These are discussed in the ensuing paragraphs with suitable
examples.
3.2.4 Range
The simplest measure of dispersion is the range, which is the difference between the maximum value
and the minimum value of data.
Example 3.1: Find the range for the following three sets of data:
Set 1: 05 15 15 05 15 05 15 15 15 15
Set 2: 8 7 15 11 12 5 13 11 15 9
Set 3: 5 5 5 5 5 5 5 5 5
5
Solution: In each of these three sets, the highest number is 15 and the lowest number is 5. Since the
range is the difference between the maximum value and the minimum value of the data, it is 10 in each
case. But the range fails to give any idea about the dispersal or spread of the series between the highest
and the lowest value. This becomes evident from the above data.
In a frequency distribution, range is calculated by taking the difference between the upper limit of the
highest class and the lower limit of the lowest class.
Example 3.2: Find the range for the following frequency distribution:
Size of Item Frequency
20- 40 7
40- 60 11
60- 80 30
80-100 17
100-120 5
Total 70
Solution: Here, the upper limit of the highest class is 120 and the lower limit of the lowest class is 20.
Hence, the range is 120 - 20 = 100. Note that the range is not influenced by the frequencies.
Symbolically, the range is calculated b the formula L - S, where L is the largest value and S is the
smallest value in a distribution. The coefficient of range is calculated by the formula: (L-S)/ (L+S). This
is the relative measure. The coefficient of the range in respect of the earlier example having three sets of
data is: 0.5.The coefficient of range is more appropriate for purposes of comparison as will be evident
from the following example:
Example 3.3: Calculate the coefficient of range separately for the two sets of data given below:
Set 1 8 10 20 9 15 10 13 28
Set 2 30 35 42 50 32 49 39 33
Solution: It can be seen that the range in both the sets of data is the same:
Set 1 28 - 8 = 20
Set 2 50 - 30 = 20
Coefficient of range in Set 1 is:
28 – 8 = 0.55
28+8
Coefficient of range in set 2 is:
50 – 30
= 0.25
50 +30
LIMITATIONS OF RANGE
There are some limitations of range, which are as follows:
1. It is based only on two items and does not cover all the items in a distribution.
2. It is subject to wide fluctuations from sample to sample based on the same population.
3. It fails to give any idea about the pattern of distribution. This was evident from the data given in
Examples 1 and 3.
4. Finally, in the case of open-ended distributions, it is not possible to compute the range.
Despite these limitations of the range, it is mainly used in situations where one wants to quickly have
some idea of the variability or' a set of data. When the sample size is very small, the range is considered
quite adequate measure of the variability. Thus, it is widely used in quality control where a continuous
check on the variability of raw materials or finished products is needed. The range is also a suitable
measure in weather forecast. The meteorological department uses the range by giving the maximum and
the minimum temperatures. This information is quite useful to the common man, as he can know the
extent of possible variation in the temperature on a particular day.
3.2.5 Inter-quartile Range or Quartile Deviation
The interquartile range or the quartile deviation is a better measure of variation in a distribution than the
range. Here, avoiding the 25 percent of the distribution at both the ends uses the middle 50 percent of
the distribution. In other words, the interquartile range denotes the difference between the third quartile
and the first quartile. Symbolically,
Interquartile range = Q3- Ql
Many times the interquartile range is reduced in the form of semi-interquartile range or quartile
deviation as shown below:
The mean deviation is also known as the average deviation. As the name implies, it is the average of
absolute amounts by which the individual items deviate from the mean. Since the positive deviations
from the mean are equal to the negative deviations, while computing the mean deviation, we ignore
positive and negative signs. Symbolically,
MD = | x |
n
Where MD = mean deviation, |x| = deviation of an item from the mean ignoring positive and negative
signs, n = the total number of observations.
Example 3.4:
Size of Item Frequency
2-4 20
4-6 40
6-8 30
8-10 10
Solution:
Size of Item Mid-points (m) Frequency (f) fm d from x f |d|
2-4 3 20 60 -2.6 52
4-6 5 40 200 -0.6 24
6-8 7 30 210 1.4 42
8-10 9 10 90 3.4 34
Total 100 560 152
x =
fm 560 5.6
n 100
MD ( x ) =
f | d | 152
1.52
n 100
2. It takes into consideration each and every item in the distribution. As a result, a change in the value
of any item will have its effect on the magnitude of mean deviation.
3. The values of extreme items have less effect on the value of the mean deviation.
4. As deviations are taken from a central value, it is possible to have meaningful comparisons of the
formation of different distributions.
108
Mean = = 18
6
The second column shows the deviations from the mean. The third or the last column shows the squared
deviations, the sum of which is 70. The arithmetic mean of the squared deviations is:
x 2
= 70/6=11.67 approx.
N
This mean of the squared deviations is known as the variance. It may be noted that this variance is
described by different terms that are used interchangeably: the variance of the distribution X; the vari-
ance of X; the variance of the distribution; and just simply, the variance. Symbolically,
Var (X) =
x 2
It is also written as 2
x i 2
N
Where 2 (called sigma squared) is used to denote the variance.
Although the variance is a measure of dispersion, the unit of its measurement is (points). If a
distribution relates to income of families then the variance is (Rs)2 and not rupees. Similarly, if another
distribution pertains to marks of students, then the unit of variance is (marks)2. To overcome this
inadequacy, the square root of variance is taken, which yields a better measure of dispersion known as
the standard deviation. Taking our earlier example of individual observations, we take the square root of
the variance
Symbolically, = x 2
i
In applied Statistics, the standard deviation is more frequently used than the variance. This can also be
written as:
x 2
x N
2
i
i
=
N
We use this formula to calculate the standard deviation from the individual observations given earlier.
Example 7.6:
X X2
20 400
15 225
19 361
24 576
16 256
14 196
108 2014
Solution:
x 2
i 2014 x i 108 N=6
2014
1082 11664
2014
= 6 Or, = 6
6 6
1208411664 420
=
6 Or, = 6
6 6
= 70 Or, = 11.67
6
= 3.42
Example 3.7:
The following distribution relating to marks obtained by students in an examination:
Marks Number of Students
0- 10 1
10- 20 3
20- 30 6
30- 40 10
40- 50 12
50- 60 11
60- 70 6
70- 80 3
80- 90 2
90-100 1
Solution:
In the case of frequency distribution where the individual values are not known, we use the midpoints of
the class intervals. Thus, the formula used for calculating the standard deviation is as given below:
K
fim i
2
= i1
N
Where mi is the mid-point of the class intervals is the mean of the distribution, fi is the frequency of
each class; N is the total number of frequency and K is the number of classes. This formula requires that
the mean be calculated and that deviations (mi - ) be obtained for each class. To avoid this
inconvenience, the above formula can be modified as:
K
K
fid fd
i
2
i
= i 1 i 1
N
Where C is the class interval: fi is the frequency of the ith class and di is the deviation of the of item
from an assumed origin; and N is the total number of observations.
Applying this formula for the table given earlier,
231 45
2
= 10
55 55
= 10 4.2 0.669421
=18.8 marks
When it becomes clear that the actual mean would turn out to be in fraction, calculating deviations from
the mean would be too cumbersome. In such cases, an assumed mean is used and the deviations from it
are calculated. While mid-point of any class can be taken as an assumed mean, it is advisable to choose
the mid-point of that class that would make calculations least cumbersome. Guided by this
consideration, in Example 3.7 we have decided to choose 55 as the mid-point and, accordingly,
deviations have been taken from it. It will be seen from the calculations that they are considerably
simplified.
USES OF THE STANDARD DEVIATION
The standard deviation is a frequently used measure of dispersion. It enables us to determine as to how
far individual items in a distribution deviate from its mean. In a symmetrical, bell-shaped curve:
(i) About 68 percent of the values in the population fall within: + 1 standard deviation from the mean.
(ii) About 95 percent of the values will fall within +2 standard deviations from the mean.
(iii) About 99 percent of the values will fall within + 3 standard deviations from the mean.
The standard deviation is an absolute measure of dispersion as it measures variation in the same units as
the original data. As such, it cannot be a suitable measure while comparing two or more distributions.
For this purpose, we should use a relative measure of dispersion. One such measure of relative
dispersion is the coefficient of variation, which relates the standard deviation and the mean such that the
standard deviation is expressed as a percentage of mean. Thus, the specific unit in which the standard
deviation is measured is done away with and the new unit becomes percent. Symbolically,
CV (coefficient of variation) = x 100
Example 3.8: In a small business firm, two typists are employed-typist A and typist B. Typist A types
out, on an average, 30 pages per day with a standard deviation of 6. Typist B, on an average, types out
45 pages with a standard deviation of 10. Which typist shows greater consistency in his output?
Solution:
Coefficient of variation for A x 100 Or
6
A x 100 Or 20% and
30
Coefficient of variation for B x 100
10
B x 100 or 22.2 %
45
These calculations clearly indicate that although typist B types out more pages, there is a greater
variation in his output as compared to that of typist A. We can say this in a different way: Though typist
A's daily output is much less, he is more consistent than typist B. The usefulness of the coefficient of
variation becomes clear in comparing two groups of data having different means, as has been the case in
the above example.
12 144
X = 40 X
2
= 354
x x N 40 5 8
X 2
402
=
x 2
N or, =
354
5 = 354 320 = 2.61 approx.
N 5 5
xx 6 8
Z= = -0.77 (Standard score)
2.61
Applying this formula to other values:
7 8 58 10 8 12 8
(i) = -0.38 (ii) = -1.15 (iii) = 0.77 (iv) = 1.53
2.61 2.61 2.61 2.61
Thus the standard scores for 6, 7,5,10 and 12 are -0.77, -0.38, -1.15, 0.77 and 1.53, respectively.
3.2.8 Lorenz Curve
This measure of dispersion is graphical. It is known as the Lorenz curve named after Dr. Max Lorenz. It
is generally used to show the extent of concentration of income and wealth. The steps involved in
plotting the Lorenz curve are:
1. Convert a frequency distribution into a cumulative frequency table.
2. Calculate percentage for each item taking the total equal to 100.
3. Choose a suitable scale and plot the cumulative percentages of the persons and income. Use the
horizontal axis of X to depict percentages of persons and the vertical axis of Y to depict percent ages
of income.
4. Show the line of equal distribution, which will join 0 of X-axis with 100 of Y-axis.
5. The curve obtained in (3) above can now be compared with the straight line of equal distribution
obtained in (4) above. If the Lorenz curve is close to the line of equal distribution, then it implies that
the dispersion is much less. If, on the contrary, the Lorenz curve is farther away from the line of
equal distribution, it implies that the dispersion is considerable.
The Lorenz curve is a simple graphical device to show the disparities of distribution in any phenom-
enon. It is, used in business and economics to represent inequalities in income, wealth, production,
savings, and so on.
Figure 3.1 shows two Lorenz curves by way of illustration. The straight line AB is a line of equal
distribution, whereas AEB shows complete inequality. Curve ACB and curve ADB are the Lorenz
curves.
A F
Figure 3.1: Lorenz Curve
As curve ACB is nearer to the line of equal distribution, it has more equitable distribution of income
than curve ADB. Assuming that these two curves are for the same company, this may be interpreted in a
different manner. Prior to taxation, the curve ADB showed greater inequality in the income of its
employees. After the taxation, the company‘s data resulted into ACB curve, which is closer to the line
of equal distribution. In other words, as a result of taxation, the inequality has reduced.
3.3 SKEWNESS: MEANING AND DEFINITIONS
In the above paragraphs, we have discussed frequency distributions in detail. It may be repeated here
that frequency distributions differ in three ways: Average value, Variability or dispersion, and Shape.
Since the first two, that is, average value and variability or dispersion
have already been discussed in previous chapters, here our main
spotlight will be on the shape of frequency distribution. Generally, there
are two comparable characteristics called skewness and kurtosis that
help us to understand a distribution. Two distributions may have the
same mean and standard deviation but may differ widely in their overall
appearance as can be seen from the following:
In both these distributions the value of mean and standard deviation is the same ( X = 15, = 5). But it
does not imply that the distributions are alike in nature. The distribution on the left-hand side is a
symmetrical one whereas the distribution on the right-hand side is symmetrical or skewed. Measures of
skewness help us to distinguish between different types of distributions.
Some important definitions of skewness are as follows:
1. "When a series is not symmetrical it is said to be asymmetrical or skewed."
-Croxton & Cowden.
2. "Skewness refers to the asymmetry or lack of symmetry in the shape of a frequency distribution." -
Morris Hamburg.
3. "Measures of skewness tell us the direction and the extent of skewness. In symmetrical distribution
the mean, median and mode are identical. The more the mean moves away from the mode, the larger
the asymmetry or skewness."-Simpson & Kalka
4. "A distribution is said to be 'skewed' when the mean and the median fall at different points in the
distribution, and the balance (or centre of gravity) is shifted to one side or the other-to left or right." -
Garrett
The above definitions show that the term 'skewness' refers to lack of symmetry" i.e., when a distribution
is not symmetrical (or is asymmetrical) it is called a skewed distribution. The
concept of skewness will be clear from the following three diagrams showing
a symmetrical distribution. a positively skewed distribution and a negatively
skewed distribution.
1. Symmetrical Distribution. It is clear from the diagram (a) that in a
symmetrical distribution the values of mean, median and mode coincide.
The spread of the frequencies is the same on both sides of the centre
point of the curve.
2. Asymmetrical Distribution. A distribution, which is not symmetrical, is
called a skewed distribution and such a distribution could either be
positively skewed or negatively skewed as would be clear from the
diagrams (b) and (c).
3. Positively Skewed Distribution. In the positively skewed distribution
the value of the mean is maximum and that of mode least-the median lies in between the two as is
clear from the diagram (b).
4. Negatively Skewed Distribution. The following is the shape of negatively skewed distribution. In
a negatively skewed distribution the value of mode is maximum and that of mean least-the median
lies in between the two. In the positively skewed distribution the frequencies are spread out over a
greater range of values on the high-value end of the curve (the right hand side) than they are on the
low-value end. In the negatively skewed distribution the position is reversed, i.e. the excess tail is
on the left-hand side. It should be noted that in moderately symmetrical distributions the interval
between the mean and the median is approximately one-third of the interval between the mean and
the mode. It is this relationship, which provides a means of measuring the degree of skewness.
3.3.1 Tests of Skewness
In order to ascertain whether a distribution is skewed or not the following tests may be
applied. Skewness is present if:
1. The values of mean, median and mode do not coincide.
2. When the data are plotted on a graph they do not give the normal bell-shaped form i.e. when cut
along a vertical line through the centre the two halves are not equal.
3. The sum of the positive deviations from the median is not equal to the sum of the negative
deviations.
4. Quartiles are not equidistant from the median.
5. Frequencies are not equally distributed at points of equal deviation from the mode.
On the contrary, when skewness is absent, i.e. in case of a symmetrical distribution, the following
conditions are satisfied:
1. The values of mean, median and mode coincide.
2. Data when plotted on a graph give the normal bell-shaped form.
3. Sum of the positive deviations from the median is equal to the sum of the negative deviations.
4. Quartiles are equidistant from the median.
5. Frequencies are equally distributed at points of equal deviations from the mode.
measure, (ii) Bowley‘s measure, (iii) Kelly‘s measure, and (iv) Moment‘s measure. These measures are
discussed briefly below:
KARL PEARON’S MEASURE
The formula for measuring skewness as given by Karl Pearson is as follows:
Skewness = Mean - Mode
Coefficient of skewness = Mean – Mode
Standard Deviation
Standard deviation
3(Mean - Median)
Skp = Standard deviation
Now this formula is equal to the earlier one.
3(Mean - Median) Mean - Mode
Standard deviation Standard deviation
Standard
DDE, GJUS&T, Hisar 65 |
Business Statistics MBA-102
SD x
x
2 2
x x
2 2
N N N N
2 2
24270 452 2427 (45.2) 19.59
10 10
Applying the values of mean, mode and standard deviation in the above formula,
Skp = 45.2 – 43.7 = 0.08
19.59
This shows that there is a positive skewness though the extent of skewness is marginal.
Example 3.12: From the following data, calculate the measure of skewness using the mean, median and
standard deviation:
X 10 - 20 20 - 30 30 - 40 40 - 50 50-60 60 - 70 70 - 80
f 18 30 40 55 38 20 16
Solution:
x MVx dx f fdx fdX2 cf
10 - 20 15 -3 18 -54 162 18
20 - 30 25 -2 30 -60 120 48
30 - 40 35 -1 40 -40 40 88
40-50 45=a 0 55 0 0 143
50 - 60 55 1 38 38 38 181
60 - 70 65 2 20 40 80 201
70 - 80 75 3 16 48 144 217
Total 217 -28 584
a = Assumed mean = 45, cf = Cumulative frequency, dx = Deviation from assumed mean, and i = 10
x a
fdx i 45 28
10 43.71
N 217
l2 l1
Median= l1 (m c) Where m = (N + 1)/2th item = (217 + 1)/2 = 109th item
f1
50 40 10
Median 40 (109 88) 40 21= 43.82
55 55
fd fd x
2
584 28
2 2
2.69- 0.016 10 16.4
10 217 217 10
SD = x =
f f
Skewness = 3 (Mean - Median)
= 3 (43.71 - 43.82)
= 3 x -0.011 = -0.33
Coefficient of skewness = Skewness / Standard Deviation or SD
= -0.33/16.4 = -0.02
The result shows that the distribution is negatively skewed, but the extent of skewness is extremely
negligible.
Bowley's Measure
Bowley developed a measure of skewness, which is based on quartile values. The formula for measur-
ing skewness is:
Q3 Q1 2M
Skewness =
Q3 Q1
Where Q3 and Q1 are upper and lower quartiles and M is the median. The value of this skewness varies
between +1. In the case of open-ended distribution as well as where extreme values are found in the
series, this measure is particularly useful. In a symmetrical distribution, skewness is zero. This means
that Q3 and Q1 are positioned equidistantly from Q2 that is, the median. In symbols, Q3 - Q2 = Q2 – Q1'
In contrast, when the distribution is skewed, then Q3 - Q2 will be different from Q2 – Q1' When Q3 - Q2
exceeds Q2 – Q1' then skewness is positive. As against this; when Q3 - Q2 is less than Q2 – Q1' then
skewness is negative. Bowley‘s measure of skewness can- be written as:
Skewness = (Q3 - Q2) - (Q2 – Q1)
or Q3 - Q2 - Q2 + Q1
or Q3 + Q1 - 2Q2 (2Q2 is 2M)
However, this is an absolute measure of skewness. As such, it cannot be used while comparing two
distributions where the units of measurement are different. In view of this limitation, Bowley suggested
a relative measure of skewness as given below:
Relative Skewness = (Q3 Q2 ) (Q2 Q1 )
(Q3 Q2 ) (Q2 Q1 )
= Q3 Q2 Q2 Q1
Q3 Q2 Q2 Q1
= Q3 Q1 2Q2
Q3 Q1
= Q3 Q1 2M
Q3 Q1
Example 3.13: For a distribution, Bowley‘s coefficient of skewness is - 0.56, Q1=16.4 and
Median=24.2. What is the coefficient of quartile deviation?
Solution:
Q3 Q1 2M
Bowley's coefficient of skewness is: SkB =
Q3 Q1
Substituting the values in the above formula,
Q3 16.4- (2 x 24.2) Q 16.4- 48.4
SkB = = 0.56 3
Q3 16.4 Q3 16.4
or - 0.56 (Q3-16.4) = Q3-32
or - 0.56 Q3 + 9.184 = Q3-32
or - 0.56 Q3 - Q3 = -32 - 9.184
- 1.56 Q3 = - 41.184
41.184
Q3 = 26.4
1.56
Now, we have the values of both the upper and the lower quartiles.
Q3 Q1 26.4 16.4 10
Coefficient of quartile deviation = = 0.234 Approx.
Q3 Q1 26.4 16.4 42.8
Example 3.14: Calculate an appropriate measure of skewness from the following data:
Value in Rs Frequency
Less than 50 40
50 - 100 80
100 - 150 130
150 – 200 60
200 and above 30
Solution: It should be noted that the series given in the question is an open-ended series. As such,
Bowley's coefficient of skewness, which is based on quartiles, would be the most appropriate measure
of skewness in this case. In order to calculate the quartiles and the median, we have to use the
cumulative frequency. The table is reproduced below with the cumulative frequency.
Less than 50 40 40
50 - 100 80 120
l2 l1
Q1 = l1 (m c)
f1
n 1 341
Now m=( ) item = = 85.25, which lies in 50 - 100 class
4 4
100 50
Q1 = 50 + (85.25 40) 78.28
80
n 1 341
M=( ) item = = 170.25, which lies in 100 - 150 class
4 4
150 100
M= 100 + (170.5 120) 119.4
130
l2 l1
Q3 = l1 (m c)
f1
m = 3(341) 4 = 255.75
200 150
Q3 = 150 + (255.75 250) 154.79
60
Bowley's coefficient of skewness is:
Q3 + QI - 2M = 154.79+ 78.28 - (2 x 119.4) = -5.73
50 40 10 21
P50 = 40 (109 88) 40 21 43.82approx.
55 55
P90: here m = 90 (217 + 1)/100th item = 196.2th item. This lies in the class 60 - 70. Applying the above
formula:
70 60 10 15.2
P90 = 60 (196.2 181) 60 67.6approx.
20 20
Kelley's skewness
P90 2P50 P10 67.6- (2 x 43.82) 21.27 88.87- 87.64
SkK = = = = 0.027
P90 P10 67.6- 21.27 46.63
This shows that the series is positively skewed though the extent of skewness is extremely negligible. It
may be recalled that if there is a perfectly symmetrical distribution, then the skewness will be zero. One
can see that the above answer is very close to zero.
3.4 MOMENTS AND KURTOSIS
MOMENTS
In mechanics, the term moment is used to denote the rotating effect of a force. In Statistics, it is used to
indicate peculiarities of a frequency distribution. The utility of moments lies in the sense that they
indicate different aspects of a given distribution. Thus, by using moments, we can measure the central
tendency of a series, dispersion or variability, skewness and the peakedness of the curve. The moments
about the actual arithmetic mean are denoted by . The first four moments about mean or central
moments are as follows:
First moment 1 =
1
N
x1 x
Second moment 2 =
1
N
x1 x
2
Third moment 3 =
1
N
x1 x
3
Fourth moment 3 =
1
N
x1 x
4
These moments are in relation to individual items. In the case of a frequency distribution, the first four
moments will be:
First moment 1 =
1
N
fi x1 x
Second moment 2 =
1
N
fi x1 x2
Third moment 3 =
1
N
fi x1 x3
Fourth moment 3 =
1
N
fi x1 x4
It may be noted that the first central moment is zero, that is, = 0. The second central moment is 2=,
indicating the variance. The third central moment 3 is used to measure skewness. The fourth central
moment gives an idea about the Kurtosis.
Karl Pearson suggested another measure of skewness, which is based on the third and second central
moments as given below:
32
1 3
2
Example 3.16: Find the (a) first, (b) second, (c) third and (d) fourth moments for the set of numbers
2,3,4,5 and 6.
Solution:
(a) x
x 2 3 4 5 6 20 4
N 5 5
(b) x
x 2
22 32 42 52 62
N 5
4 9 16 25 36
18
5
(c) x
x 3
23 33 43 53 63
N 5
8 27 64 125 216
88
5
(d) x
x 4
24 34 44 54 64
N 5
16 81 256 625 1296
454.8
5
Example 3.17: Using the same set of five figures as given in Example 3.7, find the (a) first, (b) second,
(c) third and (d) fourth moments about the mean.
Solution:
m1 ( x x)
(x x) (2 4) (3 4) (4 4) (5 4) (6 4)
N 5
- 2 -1 0 1 2
= =0
5
m2 ( x x) 2
(x x) 2
(2 4) 2 (3 4) 2 (4 4) 2 (5 4) 2 (6 4) 2
N 5
(-2)2 (_1)2 02 12 22
=
5
4 1 0 1 4
= = 2. It may be noted that m2 is the variance
5
m3= ( x x) 3
(x x) 3
(2 4)3 (3 4)3 (4 4)3 (5 4)3 (6 4)3
N 5
(-2)3 (_1)3 03 13 23 - 8 - 1 0 1 8
= = 0
5 5
m4= ( x x) 4
(x x) 4
(2 4) (3 4) 4 (4 4) 4 (5 4) 4 (6 4) 4
4
N 5
(-2)4 (_1)4 04 14 24
=
5
16 1 0 1 016
= 6.8
5
Example 3.18: Calculate the first four central moments from the following data:
Class interval 50-60 60-70 70-80 80-90 90-100
Frequency 5 12 20 7 6
Solution:
Class Interval f MV d from 75 d/10 fd fd2 fd3 fd4
50- 60 5 55 -20 -2 -10 20 -40 80
60- 70 12 65 -10 -1 -12 12 -12 12
70- 80 20 75 0 0 0 0 0 0
80- 90 7 85 10 1 7 7 7 7
90-100 6 95 20 2 12 24 48 96
Total 50 -3 -4 195
1 '
fd i 3 10
0.6
N 50
'
fd 2
i 6310
2 12.6
N 50
'
fd 3
i 4 10
2 0.8
N 50
'
fd 4
i 19510
2 19
N 50
Moments about Mean
1=1‘ - 1‘= -0.6-(-0.6) = 0
2=2‘ - 1‘2=10-( -0.6)2= 10-3.6=6.4
3=3‘ - 32‘‘1+21‘3= -0.8-3(12.6)(-0.6)+2(-0.6)3= -0.8 + 22.68 + 0.432 = 22.312
4=4‘ - 43‘‘1+621‘2-31‘4 = 19 + 4(-0.8)(-0.6) + 6(10)(-0.6)2- 3(-0.6)4= 19 + 1.92 + 21.60 - 0.3888
= 42.1312
KURTOSIS
Where K = kurtosis, Q = ½ (Q3 – Q1) is the semi-interquartile range; P90 is 90th percentile and P10 is the
10th percentile. This is also known as the percentile coefficient of kurtosis. In case of the normal
distribution, the value of K is 0.263.
Example 3.20: From the data given below, calculate the percentile coefficient of kurtosis.
Daily Wages in Rs. Number of Workers cf
50- 60 10 10
60-70 14 24
70-80 18 42
80 - 90 24 66
90-100 16 82
100 -110 12 94
110 - 120 6 100
Total 100
Solution: It may be noted that the question involved first two columns and in order to calculate
quartiles and percentiles, cumulative frequencies have been shown in column three of the above table.
l2 l1
Q1 = l1 (m c) , where m = (n + 1)/4th item, which is = 25.25th item. This falls in 70 - 80 class
f1
interval. So, Q1
80 70
Q1 = 70 + (25.25 24) = 70.69
18
l2 l1
Q3 = l1 + (m c) , where m = 75.75. This falls in 90 - 100 class interval. So, Q3
f1
100 90
Q3 = 90 + (75.75 - 66) = 96.09
16
l2 l1
PI0 = l1 + (m c) , where m = 10.1. This falls in 60 - 70 class interval. So, P10
f1
70 60
P10 = 60 + (10.01 -10) = 60.07
14
l2 l1
P90 = l1 + (m c) , where m = 90.9. This falls in 100 - 110 class interval. So, P90
f1
110 100
P90 = 100 + (90.9 - 82) = 107.41
12
3.6 SUMMARY
The average value cannot adequately describe a set of observations, unless all the observations are the
same. It is necessary to describe the variability or dispersion of the observations. In two or more
distributions the central value may be the same but still there can be wide disparities in the formation of
distribution. Therefore, we have to use the measures of dispersion.
Further, two distributions may have the same mean and standard deviation but may differ widely in
their overall appearance in terms of symmetry and skewness. To distinguish between different types of
distributions, we may use the measures of skewness.
3.7 KEYWORDS
Dispersion: It measures the extent to which the items vary from some central value.
Range: It is the difference between the maximum value and the minimum value of data.
Interquartile range: It denotes the difference between the third quartile and the first quartile.
Skewness: It refers to lack of symmetry" i.e., when a distribution is not symmetrical (or is
asymmetrical) it is called a skewed distribution.
Moments: In Statistics, it is used to indicate peculiarities of a frequency distribution.
Kurtosis: It is another measure of the shape of a frequency curve. It is a Greek word, which means
bulginess.
3.8 SELF ASSESSMENT TEST
1. What do you mean by dispersion? What are the different measures of dispersion?
2. ―Variability is not an important factor because even though the outcome is more certain, you still
have an equal chance of falling either above or below the median. Therefore, on an average, the
outcome will be the same.‖ Do you agree with this statement? Give reasons for your answer.
3. Why is the standard deviation the most widely used measure of dispersion? Explain.
4. Define skewness and Dispersion.
5. Define Kurtosis and Moments.
6. What are the different measures of skewness? Which one is repeatedly used?
7. Measures of dispersion and skewness are complimentary to one another in understanding a
frequency distribution." Elucidate the statement.
8. Calculate Karl Pearson's coefficient of skewness from the following data:
Weekly Sales (Rs lakh) Number of Companies
10-12 12
12 – 14 18
14 – 16 35
16 - 18 42
18 – 20 50
22-24 30
24-26 8
9. For a distribution, the first four moments about zero are 1,7,38 and 155 respectively. (i) Compute
the moment coefficients of skewness and kurtosis. (ii) Is the distribution mesokurtic? Give reason.
10. The first four moments of a distribution about the value 4 are 1, 4, 10 and 45. Obtain various
characteristics of the distribution on the basis of the information given. Comment upon the nature
of the distribution.
11. Define kurtosis. If β1=1 and β2 =4 and variance = 9, find the values of β3 and β4 and comment
upon the nature of the distribution.
12. Calculate the first four moments about the mean from the following data. Also calculate the values
of β1 and β2
Marks 0-10 10 – 20 20-30 30 – 40 40 – 50 50 - 60 60 - 70
No. of students 5 12 18 40 15 7 3
STRUCTURE
4.1 Learning Objectives
4.2 Introduction
4.2.1 Some Basic Concepts
4.2.2 Approaches to Probability Theory
4.2.3 Probability Rules
4.2.4 Bayes‘ Theorem
4.3 Some Counting Concepts
4.4 Check your Progress
4.5 Summary
4.6 Keywords
4.7 Self-Assessment Test
4.8 Answers to check your progress
4.9 References/Suggested Readings
4.2 INTRODUCTION
Life is full of uncertainties. ‗Probably‘, ‗likely‘, ‗possibly‘, ‗chance‘ etc. is some of the most commonly
used terms in our day-to-day conversation. All these terms more or less convey the same sense - ―the
situation under consideration is uncertain and commenting on the future with certainty is impossible‖.
Decision-making in such areas is facilitated through formal and precise expressions for the uncertainties
involved. For example, product demand is uncertain but study of demand spelled out in a form
amenable for analysis may go a long to help analyze, and facilitate decisions on sales planning and
inventory management. Intuitively, we see that if there is a high chance of a high demand in the coming
year, we may decide to stock more. We may also take some decisions regarding the price increase,
reducing sales expenses etc. to manage the demand. However, in order to make such decisions, we need
to quantify the chances of different quantities of demand in the coming year. Probability theory
provides us with the ways and means to quantify the uncertainties involved in such situations.
A probability is a quantitative measure of uncertainty - a number that conveys the strength of our
belief in the occurrence of an uncertain event.
Since uncertainty is an integral part of human life, people have always been interested - consciously or
unconsciously - in evaluating probabilities.
Having its origin associated with gamblers, the theory of probability today is an indispensable tool in
the analysis of situations involving uncertainty. It forms the basis for inferential statistics as well as for
other fields that require quantitative assessments of chance occurrences, such as quality control,
management decision analysis, and almost all areas in physics, biology, engineering and economics or
social life.
4.2.1 Some Basic Concepts
Probability, in common parlance, refers to the chance of occurrence of an event or happening. In order
that we are able to compute it, a proper understanding of certain basic concepts in probability theory is
required. These concepts are an experiment, a sample space, and an event.
EXPERIMENT
An experiment is a process that leads to one of several possible outcomes. An outcome of an experiment
is some observation or measurement.
The term experiment is used in probability theory in a much broader sense than in physics or chemistry.
Any action, whether it is the drawing a card out of a deck of 52 cards, or reading the temperature, or
measurement of a product's dimension to ascertain quality, or the launching of a new product in the
market, constitute an experiment in the probability theory terminology.
The experiments in probability theory have three things in common:
there are two or more outcomes of each experiment
it is possible to specify the outcomes in advance
there is uncertainty about the outcomes
For example, the product we are measuring may turn out to be undersize or right size or oversize, and
we are not certain which way it will be when we measure it. Similarly, launching a new product
involves uncertain outcome of meeting with a success or failure in the market.
A single outcome of an experiment is called a basic outcome or an elementary event. Any particular
card drawn from a deck is a basic outcome.
SAMPLE SPACE
The sample space is the universal set S pertinent to a given experiment. It is the set of all possible
outcomes of an experiment.
So each outcome is visualized as a sample point in the sample space. The sample spaces for the above
experiments are:
Experiment Sample Space
Drawing a Card {all 52 cards in the deck}
Reading the Temperature {all numbers in the range of temperatures}
Measurement of a Product's Dimension {undersize, outsize, right size}
Launching of a New Product {success, failure}
EVENT
Approach. If we toss the coin for a sufficiently large number of times and note down the number of
times the head occurs, the proportion of times that a head occurs will give us the required probability.
Example 4-1
A fair coin is tossed twice. Find the probabilities of the following events:
(a) A, getting two heads
(b) B, getting one head and one tail
(c) C, getting at least one head or one tail
(d) D, getting four heads
Solution: Being a Two-Trial Coin Tossing Experiment, it gives rise to the following On = 2n =
4, possible equally likely outcomes:
HH HT TH TT
Thus, for the sample space N(S) = 4
We can use the Classical Approach to find out the required probabilities.
n(D) = 0
n( D) 0
P(D) = = =0
N (S ) 4
Example 4-2
A newspaper boy wants to find out the chances that on any day he will be able to sell more than 90
copies of The Times of India. From his dairy where he recorded the daily sales of the last year, he finds
out that out of 365 days, on 75 days he had sold 80 copies, on 144 days he had sold 85 copies, on 62
days he had sold 95 copies and on 84 days he had sold 100 copies of The Times of India. Find out the
required probability for the newspaper boy.
Thus, the number of days when his sales were more than 90 = (62 + 84) days = 146 days
So the required probability
n 146
P(Sales > 90) = = = 0.4
N 365
Probability Axioms
All the three approaches to probability theory share the same basic axioms. These axioms are
fundamental to probability theory and provide us with unified approach to probability.
The axioms are:
(a) The probability of an event A, written as P(A), must be a number between zero and one, both
values inclusive. Thus
0 ≤ P(A) ≤ 1 …………(6-3)
(b) The probability of occurrence of one or the other of all possible events is equal to one. As S denotes
the sample space or the set of all possible events, we write
P(S) = 1. …………(6-4)
Thus in tossing a coin once; P(a head or a tail) = 1.
(c) If two events are such that occurrence of one implies that the other cannot occur, then the
probability that either one or the other will occur is equal to the sum of their individual
probabilities. Thus, in a coin-tossing situation, the occurrence of a head rules out the possibility of
occurrence of tail. These events are called mutually exclusive events. In such cases then, if A and B
are the two events respectively, then
P (A or B) = P (A) + P (B)
i.e. P(Head or Tail) = P (Head) + P (Tail)
It follows from the last two axioms that if two mutually exclusive events form the sample space of the
experiment, then
P(A or B) = P(A) + P(B) = 1; thus P (Head) + P (Tail) = 1
If two or more events together define the total sample space, the events are said to be collectively
exhaustive.
Given the above axioms, we may now define probability as a function, which assigns probability value
P to each sample point of an experiment abiding by the above axioms. Thus, the axioms themselves
define probability.
Interpretation of a Probability
From our discussion so far, we can give a general definition of probability:
Probability is a measure of uncertainty. The probability of event A is a quantitative measure of the
likelihood of the event's occurring.
We have also seen that 0 and 1, both values inclusive, sets the range of values that the probability
measure may take. In other words 0 ≤ P(A) ≤ 1
When an event cannot occur (impossible event), its probability is zero. The probability of the empty set
is zero: P(Φ) = 0. In a deck where half the cards are red and half are black, the probability of drawing a
green card is zero because the set corresponding to that event is the empty set: there are no green cards.
Events that are certain to occur have probability 1.00. The probability of the entire sample space S is
equal to 1.00: P(S) = 1.00. If we draw a card out of a deck, 1 of the 52 cards in the deck will certainly
be drawn, and so the probability of the sample space, the set of all 52 cards, is equal to 1.00.
Within the range of values 0 to 1, the greater the probability, the more confidence we have in the
occurrence of the event in question. A probability of 0.95 implies a very high confidence in the
occurrence of the event. A probability of 0.80 implies a high confidence. When the probability is 0.5,
the event is as likely to occur as it is not to occur. When the probability is 0.2, the event is not very
likely to occur. When we assign a probability of 0.05, we believe the event is unlikely to occur, and so
on. Figure 4-2 is an informal aid in interpreting probability.
the probability is 1 1 1 1
i.e. ; if the odds are 1 to 2, the probability is i.e. ; and so on. Also,
11 2 1 2 3
people sometimes say, "The probability is 80 percent." Mathematically, this probability is 0.80.
4.2.3 Probability Rules
We have seen how to compute probabilities in certain situations. The nature of the events was relatively
simple, so that direct application of the definition of probability could be used for computation. Quite
often, we are interested in the probability of occurrence of more complex events. Consider for example,
that you want to find the probability that a king or a club will occur in a draw from a deck of 52 cards.
Similarly, on examining couples with two children, if one child is known as a boy, you may be
interested in the probability of the event of both the children being boys. These two situations, we find,
are not as simple as those discussed in the earlier section. As a sequel to the theoretical development in
the field of probability, certain results are available which help us in computing probabilities in such
situations. Now we will explore these results through examples.
Example 4-3
A card is drawn from a well-shuffled pack of playing cards. Find the probability that the card drawn is
either a club or a king.
Solution: Let A be the event that a club is drawn and B the event that a king is drawn. Then,
P( A B) = P( A) P( B) P( A B
= 13/52 + 4/52 – 1/52
= 16/52
= 4/13
Example 4-4
Suppose your chance of being offered a certain job is 0.45, your probability of getting another job is
0.55, and your probability of being offered both jobs is 0.30. What is the probability that you will be
offered at least one of the two jobs?
Solution: Let A be the event that the first job is offered and B the event that the second job is
offered. Then,
P( A) 0.45 P( B) 0.55 and P( A B 0.30
So, the required probability is given as:
P( A B) = P( A) P( B) P( A B
= 0.45 + 0.55 – 0.30
= 0.70
Mutually Exclusive Events
When the sets corresponding to two events are disjoint (i.e., have no intersection), the two events are
called mutually exclusive (see Figure 6-4).
This is not really a new rule since we can always use the rule of unions for the union of two events: If
the events happen to be mutually exclusive, we subtract zero as the probability of the intersection.
Example 4-5
A card is drawn from a well-shuffled pack of playing cards. Find the probability that the card drawn is
either a king or a queen.
Solution: Let A be the event that a king is drawn and B the event that a queen is drawn. Since
A and B are two mutually exclusive events, we have,
P( A B) = P( A) P( B)
We can extend the Rule of Unions to three (or more) events. Let A, B, and C be the three events defined
over the sample space S, as shown in Figure 6-5 Then, the Rule of Unions is
P( A B C) =
P( A) P(B) P(C) P( A B) P(B C) P( A C) P( A B C) ………(6.8)
When the three events are mutually exclusive (see Figure 8-6), the Rule of Unions is
P( A B C) = P( A) P( B) P(C) …………(6.9)
Example 4-6
A card is drawn from a well-shuffled pack of playing cards. Find the probability that the card drawn is
(a) either a heart or an honour or king
(b) either an ace or a king or a queen
Solution: (a) Let A be the event that a heart is drawn, B the event that an honour is drawn and C the
event that a king is drawn. So we have
n(A) = 13 n(B) = 20 n(C) = 4
n( A B 5 n( B C 4 n( A C 1
and n( A B C 1
The required probability (using Eq. (6.8) is
P( A B C) = 13/52 + 20/52 + 4/52 – 5/52 – 4/52 – 1/52 +1/52
= 28/52
= 7/13
(b) Let A be the event that an ace is drawn, B the event that a king is drawn and C the event that a
queen is drawn. So we have
n(A) = 4 n(B) = 4 n(C) = 4
Since A, B and C are mutually exclusive events, the required probability (using Eq. (6.9) is
P( A B C) = 4/52 + 4/52 + 4/52
= 12/52
= 3/13
Example 4-7
Find the probability of the event of getting a total of less than 12 in the experiment of throwing
a die twice.
Solution: Let A be the event of getting a total 12.
Then we have,
A = {6,6} and P(A) = 1/36
The event of getting a total of less than 12 is the complement of A, so the required probability is
P( A ) = 1 - P(A)
P( A ) = 1 – 1/36
P( A ) = 35/36
Therefore, the probability of event A given the occurrence of event B is defined as the probability of the
intersection of A and B, divided by the probability of event B.
Example 4-8
For an experiment of throwing a die twice, find the probability:
(a) of the event of getting a total of 9, given that the die has shown up points between 4 and 6 (both
inclusive)
(b) of the event of getting points between 4 and 6 (both inclusive), given that a total of 9 has already
been obtained
Solution: Let getting a total 9 be the event A and the die showing points between 4 and 6 (both
inclusive) be the event B
Thus, N(S) = 36 and A = {(3,6) (4,5) (5,4) (6,3)}
B = {(4,4) (4,5) (4,6) (5,4) (5,5) (5,6) (6,4) (6,5) (6,6)}
and P( A B) = {(4,5) (5,4)}
So n(A) = 4 n(B) = 9 n( A B) = 2
So the required probabilities are
P( A B)
(a) P( A / B)
P(B)
2 / 36
P( A / B)
9 / 36
2
P( A / B)
9
P(B A)
(b) P(B / A)
P( A)
2 / 36
P(B / A)
4 / 36
1
P( B / A)
2
Example 4-9
A box contains 10 balls out of which 2 are green, 5 are red and 3 are black. If two balls are drawn at
random, one after the other without replacement, from the box. Find the probabilities that:
(a) both the balls are of green color
(b) both the balls are of black color
(c) both the balls are of red color
(d) the first ball is red and the second one is black
(e) the first ball is green and the second one is red
Solution: (a) P(G1 G2 ) P(G2 / G1 ).P(G1 )
1 2
x
9 10
1
45
(b) P(B1 B2 ) P(B2 / B1 ).P(B1 )
2 3
x
9 10
1
15
Example 4-10
A consulting firm is bidding for two jobs, one with each of two large multinational corporations. The
company executives estimate that the probability of obtaining the consulting job with firm A, event A,
is 0.45. The executives also feel that if the company should get the job with firm A, then there is a 0.90
probability that firm B will also give the company the consulting job. What are the company's chances
of getting both jobs?
Solution: We are given P(A) = 0.45. We also know that P(B / A) = 0.90, and we are looking for
P( A B) , which is the probability that both A and B will occur.
So P( A B) P(B / A).P( A)
P( A B) 0.90x0.45
0.405
Independent Events
Two events are said to be independent of each other if the occurrence or non-occurrence of one event in
any trial does not affect the occurrence of the other event in any trial. Events A and B are independent
of each other if and only if the following three conditions hold:
Intersection Rule
The probability of the intersection of several independent events A1, A2, ……is just the product of
separate probabilities i.e.
P( A1 A2 A3 ) P( A1 ).P( A2 ).P( A3 )......... …………(6.15)
Union Rule
The probability of the union of several independent events A1, A2, ……is given by the following
equation
The union of several events is the event that at least one of the events happens.
Example 4-11
A problem in mathematics is given to five students A, B,C, D and E. Their chances of solving it are 1/2,
1/3, 1/3, 1/4 and 1/5 respectively. Find the probability that the problem will
(a) not be solved
(b) be solved
Solution: (a) The problem will not be solved when none of the students solve it. So the required
probability is:
P( problemwill not be solved) P( A).P( B).P(C).P( D).P( E)
(1 1 / 2).(1 1 / 3).(1 1 / 3).(1 1 / 4).(1 1 / 5)
2 / 15
(b) The problem will be solved when at least one of the students solve it. So the required probability
is:
P( A B C D E) 1 P( A).P( B).P(C).P( D).P( E)
1 2 / 15
13 / 15
4.2.4 Bayes’ Theorem
As we have already noted in the introduction, the basic objective behind calculating probabilities is to
help us in making decisions by quantifying the uncertainties involved in the situations. Quite often,
whether it is in our personal life or our work life, decision-making is an ongoing process. Consider for
example, a seller of winter garments, who is interested in the demand of the product. In deciding on the
amount he should stock for this winter, he has computed the probability of selling different quantities
and has noted that the chance of selling a large quantity is very high. Accordingly, he has taken the
decision to stock a large quantity of the product. Suppose, when finally the winter comes and the season
ends, he discovers that he is left with a large quantity of stock. Assuming that he is in this business, he
feels that the earlier probability calculation should be updated given the new experience to help him
decide on the stock for the next winter.
Similar to the situation of the seller of winter garment, situations exist where we are interested in an
event on an ongoing basis. Every time some new information is available, we do revise our odds
mentally. This revision of probability with added information is formalised in probability theory with
the help of famous Bayes' Theorem. The theorem, discovered in 1761 by the English clergyman
Thomas Bayes, has had a profound impact on the development of statistics and is responsible for the
emergence of a new philosophy of science. Bayes himself is said to have been unsure of his
extraordinary result, which was presented to the Royal Society by a friend in 1763 - after Bayes' death.
We will first understand The Law of Total Probability, which is helpful for derivation of Bayes'
Theorem.
The sets B and B form a partition of the sample space. A partition of a space is the division of the
sample space into a set of events that are mutually exclusive (disjoint sets) and cover the whole space.
Whatever event B may be, either B or B must occur, but not both. Figure 6-9 demonstrates this
situation and the law of total probability.
n
or P( A) P( A / Bi ).P( Bi ) …………(6.18)
i 1
Figure 8-10 shows the partition of a sample space into five events B1, B2, B3, B4 and B5 ; and shows
their intersections with set A.
As can be seen from the figure, the event A is the set addition of the intersections of A with
each of the four sets H, D, C, and S.
Example 4-12
A market analyst believes that the stock market has a 0.70 probability of going up in the next year if the
economy should do well, and a 0.20 probability of going up if the economy should not do well during
the year. The analyst believes that there is a 0.80 probability that the economy will do well in the
coming year. What is the probability that stock market will go up next year?
Solution: Let U be the event that the stock market will go and W is the event that the economy will do
well in the coming year.
Then
P(U ) P(U / W ).P(W ) P(U / W ).P(W )
(0.70)(0.80) (0.20)(0.20)
0.56 0.04
. 0.60
BAYES’ THEOREM
We will now develop the Bayes‘ theorem. Bayes' theorem is easily derived from the law of total
probability and the definition of conditional probability.
By definition of conditional probability, we have
P(B A)
P(B / A) ………(6.19)
P( A)
By product rule, we have
P( B A) P( A B) P( A / B).P( B) …………(6.20)
Substituting Eq.(6.19) in Eq.(6.20), we have
P( A / B).P( B)
P( B / A) …………(6.21)
P( A)
By the law of total probability, we have
P( A) P( A / B).P(B) P( A / B).P(B)
Substituting this expression for P(A) in the denominator of Eq.(6.21), we have the Bayes‘ theorem
P( A / B).P( B)
P( B / A) …………(6.22)
P( A / B).P( B) P( A / B).P( B)
Thus the theorem allows us to reverse the conditionality of events: we can obtain the probability of B
given A from the probability of A given B(and other information). As we see from the theorem, the
probability of B given A is obtained from the probabilities of B and B and from the conditional
probabilities of A given B and A given B .
The probabilities P(B) and P( B ) are called prior probabilities of the events B and B ; the probability
P(B /A) is called the posterior probability of B. It is possible to write Bayes' theorem in terms of B and
A, thus giving the posterior probability of B , P( B /A). Bayes' theorem may be viewed as a means of
transforming our prior probability of an event B into a posterior probability of the event B - posterior to
the known occurrence of event A.
The Bayes' theorem can be extended to a partition of more than two sets. This is done by using the law
of total probability involving a partition in sets B1, B2, ……… Bn .The resulting form of Bayes' theorem
is:
P( A / Bi ).P( Bi )
P( Bi / A) n
……(6.23)
P( A / B ).P(B )
i 1
i i
The theorem gives the probability of one of the sets in the partition Bi, given the occurrence of event A.
Example 4-13
An Economist believes that during periods of high economic growth, the Indian Rupee appreciates with
probability 0.70; in periods of moderate economic growth, it appreciates with probability 0.40; and
during periods of low economic growth, the Rupee appreciates with probability 0.20.During any period
of time the probability of high economic growth is 0.30; the probability of moderate economic growth is
0.50 and the probability of low economic growth is 0.20. Suppose the Rupee value has been
appreciating during the present period. What is the probability that we are experiencing the period of (a)
high, (b) moderate, and (c) low, economic growth?
Solution: Our partition consists of three events: high economic growth (event H), moderate economic
growth (event M) and low economic growth (event L). The prior probabilities of these events are:
P(H) = 0.30 P(M) = 0.50 P(L) = 0.20
Let A be the event that the rupee appreciates. We have the conditional probabilities
P(A / H) = 0.70 P(A / M) = 0.40 P(A / L) = 0.20
By using the Bayes‘ theorem we can find out the required probabilities
P(H /A), P(M / A) and P(L / A)
(a) P(H /A)
P( A / H ).P( H )
P( H / A)
P( A / H ).P( H ) P( A / M ).P(M ) P( A / L).P( L)
(0.70)(0.30)
(0.70)(0.30) (0.40)(0.50) (0.20)(0.20)
0.467
(b) P(M /A)
P( A / M ).P(M )
P(M / A)
P( A / H ).P( H ) P( A / M ).P(M ) P( A / L).P( L)
(0.40)(0.50)
(0.70)(0.30) (0.40)(0.50) (0.20)(0.20)
0.444
(c) P(L /A)
P( A / L).P( L)
P( L / A)
P( A / H ).P( H ) P( A / M ).P(M ) P( A / L).P( L)
(0.20)(0.20)
(0.70)(0.30) (0.40)(0.50) (0.20)(0.20)
0.089
4.3 SOME COUNTING CONCEPTS
If there are n events and event i can occur in Ni possible ways, then the number of ways in which the
sequence of n events may occur is
N1. N2. N3 .……….Nn …………(6.24)
Suppose that a bank has two branches, each branch has two departments, and each department has four
employees. Then there are (2)(2)(4) choices of employees, and the probability that a particular one will
be randomly selected is 1/(2)(2)(4) = 1/16.
We may view the choice as done sequentially: First a branch is randomly chosen, then a department
within the branch, and then the employee within the department. This is demonstrated in the tree
diagram in Figure 4-12.
there are (10)(9)(8)(7) = 5,040 selections. We can see that this is equal to n(n - l)(n - 2) ……… (n - r +
n!
1), which is equal to n Pr !.
(n r)!
If choices are made randomly, the probability of any predetermined assignment of 4 people out of a
group of 10 is 1/5,040.
Combinations are the possible selections of r items from a group of n items regardless of the order of
selection. The number of combinations is denoted by nCr and is read n choose r. We define the number
of combinations of r out of n elements as
n!
n
Cr …………(6.27)
r!(n r)!
Suppose that 3 out of the 10 members of the board of directors of a large corporation are to be randomly
selected to serve on a particular task committee. How many possible selections are there? Using Eq.
n!
(6.27), we find that the number of combinations is nCr = 10!/(3!7!) = 120.
r!(n r)!
If the committee is chosen in a truly random fashion, what is the probability that the three-committee
members chosen will be the three senior board members? This is 1 combination out of a total of 120, so
the answer is 1/120 = 0.00833.
4.4 CHECK YOUR PROGRESS
1. Subjective Approach involves personal judgment, information, intuition, and other ……… criteria.
2. Rule of Unions allows us to write the probability of the union of two events in terms of the
probabilities of the two events and the probability of their ………. .
3. The Rule of Complements defines the probability of the ………… of an event in terms of the
probability of the original event.
4. The Product Rule allows us to write the probability of the ……… occurrence of two (or more)
events.
5. Bayes‘ theorem, discovered in 1761 by the English clergyman ………… .
4.5 SUMMARY
In life, we may also take some decisions regarding the price increase, reducing sales expenses etc. to
manage the demand. However, in order to make such decisions, we need to quantify the chances of
different quantities of demand in the coming year. Probability theory provides us with the ways and
means to quantify the uncertainties involved in such situations. Three different approaches to the
definition and interpretation of probability have evolved, mainly to cater to the three different types of
situations under which probability measures are normally required. There are different rules to find the
probability like union rule, complement rule and product rule. Every time some new information is
available, we do revise our odds mentally. This revision of probability with added information is
formalized in probability theory with the help of famous Bayes' Theorem.
4.6 KEYWORDS
Experiment: It is a process that leads to one of several possible outcomes. An outcome of an
experiment is some observation or measurement.
Sample space: It is the universal set S pertinent to a given experiment. It is the set of all possible
outcomes of an experiment.
Event: It is a subset of a sample space. It is a set of basic outcomes. It can say that the event occurs if
the experiment gives rise to a basic outcome belonging to the event.
Classical Approach: In this, probability of an event is defined as the relative size of the event with
respect to the size of the sample space.
Relative Frequency Approach: It is used to compute probability in such cases. As per this approach,
the probability of occurrence of an event is given by the observed relative frequency of an event in a
very large number of trials.
Permutations: It is the possible ordered selections of r objects out of a total of n objects.
Combinations: It is the possible selections of r items from a group of n items regardless of the order of
selection.
4.7 SELF-ASSESSMENT TEST
1. Explain what you understand by the term ‗probability‘. How the concept of probability is is
relevant to decision making under uncertainty?
2. What are different approaches to the definition of probability? Are these approaches contradictory
to one another? Which of these approaches you will apply for calculating the probability that:
(a) A leap year selected at random, will contain 53 Monday.
(b) An item, selected at random from a production process, is defective.
(c) Mr. Bhupinder S. Hooda will win the assembly election from Kiloi.
3. With the help of an example explain the meaning of the following:
(a) Random experiment, and sample space
(b) An event as a subset of sample space
(c) Equally likely events
(d) Mutually exclusive events.
(e) Exhaustive events
(f) Elementary and compound events.
4. A proofreader is interested in finding the probability that the number of mistakes in a page will be
less than 10. From his past experience he finds that out of 3600 pages he has proofed, 200 pages
contained no errors, 1200 pages contained 5 errors, and 2200 pages contained 11 or more errors.
Can you help him in finding the required probability?
5. State and develop the Addition Theorem of probability for:
(a) mutually exclusive events
(b) overlapping events
(c) complementary events
6. Explain the concept of conditional probability with the help of a suitable example.
7. State and develop the Multiplication Theorem of probability for:
(a) dependent events
(b) independent event
8. State the Bayes‘ Theoram of probability. Using an appropriate example, develop the Bayesian
probability rule and generalize it.
9. What do you understand by permutations and combinations?
(a) In how many ways we can select three players out of 12 players of the Indian Cricket
team, for playing in the World XI team?
(b) In how many ways can a sub-committee of 2 out of 6 members of the executive
committee of the employees‘ association be constituted?
10. What is the probability that a non leap year, selected at random, will contain
(a) 52 Sundays? (b) 53 Sundays? (c) 54 Sundays?
11. A card is drawn at random from well shuffled deck of 52 cards, find the probability that
(a) the card is either a club or diamond
(b) the card is not a king
(c) the card is either a face card or a club card.
12. From a well-shuffled deck of 52 cards, two cards are drawn at random.
(a) If the cards are drawn simultaneously, find the probability that these consists of (i) both
clubs, (ii) a king and a queen, (iii) a face card and a 8.
(b) If the cards are drawn one after the other with replacement. Find the probability that these
consists of (i) both clubs, (ii) a king and a queen, (iii) a face card and a 8.
13. A problem in mathematics is given to four students A, B,C, and D their chances of solving it are
1/2 , 1/3, 1/4 and 1/5 respectively. Find the probability that the problem will
(a) be solved
(b) not be solved
14. The odds that A speaks the truth are 3:2 and the odds that B does so are 7:3. In what percentage of
cases are they likely to
(a) contradict each other on an identical point?
(b) agree each other on an identical point?
15. Among the sales staff engaged by a company 60% are males. In terms of their professional
qualifications, 70% of males and 50% of females have a degree in marketing. Find the probability
that a sales person selected at random will be
(a) a female with degree in marketing
(b) a male without degree in marketing
16. A and B play for a prize of Rs. 10,000. A is to throw a die first and is to win if he throws 1: If A
fails, B it to throw and is to win if he throws 2 or 1. If B fails, A is to throw again and to win if he
throws 3, 2 or 1: and so on. Find their respective expectations.
17. A factory has three units A, B, and C. Unit A produces 50% of its products, and units B and C each
produces 25% of the products. The percentage of defective items produced by A, B, and C units are
3%, 2% and 1%, respectively. If an item is selected at random from the total production of the
factory is found defective, what is the probability that it is produced by:
(a) Unit A (b) Unit B (c) Unit C
STRUCTURE
5.1 Learning Objectives
5.2 Introduction
5.2.1 Discrete Probability Distribution
5.2.2 Bernoulli Random Variable
5.3 The Binomial Distribution
5.3.1 Conditions for a Binomial Random Variable
5.3.2 Binomial Probability Function
5.3.3 Characteristics of a Binomial Distribution
5.3.4 Importance of the Binomial Distribution
5.3.5 Fitting a Binomial Distribution
5.4 The Poisson Distribution
5.4.1 Characteristics of Poisson Distribution
5.4.2 Role of the Poisson Distribution
5.4.3 Fitting a Poisson Distribution
5.5 Check your Progress
5.6 Summary
5.7 Keywords
5.8 Self-Assessment Test
5.9 Answers to check your progress
5.10 References/Suggested Readings
Now look at the variable “the number of boys out of three births”. This number varies among sample
points in the sample space and can take values 0,1,2,3, and it is random –given to chance.
Discrete if it takes only a countable number of values. For example, number of dots on two dice,
number of heads in three coin tossing, number of defective items, number of boys in three births
and so on.
Continuous if can take on any value in an interval of numbers (i.e. its possible values are
unaccountably infinite). For example, measured data on heights, weights, temperature, and time and
so on.
A random variable has a probability law - a rule that assigns probabilities to different values of the
random variable. This probability law - the probability assignment is called the probability distribution
of the random variable. We usually denote the random variable by X. In this lesson, we will discuss
discrete probability distributions. Continuous probability distributions will be discussed in the next
lesson.
The random variable X denoting ―the number of boys out of three births‖, we introduced in the
introduction of the lesson, is a discrete random variable; so it will have a discrete probability
distribution. It is easy to visualize that the random variable X is a function of sample space. We can see
the correspondence of sample points with the values of the random variable as follows:
The correspondence between sample points and the value of the random variable allows us to determine
the probability distribution of X as follows:
The above probability statement constitute the probability distribution of the random variable X
= number of boys in three births. We may appreciate how this probability law is obtained
simply by associating values of X with sets in the sample space. (For example, the set GGB,
GBG, BGG leads to X = 1). We may write down the probability distribution of X in table format
(see Table 5-1) or we may plot it graphically by means of probability Histogram (see Figure 5-
1a) or a Line chart (see Figure 5-1b).
Table 5-1
0 1/8
1 3/8
2 3/8
3 1/8
(1, (2,
(1, 0.375) 0.375) 0.375)
(2, 0.375)
P(X)
P(X)
X
X
Figure 5-1 Probability Distribution of the Number of Boys out of Three Births
The probability distribution of a discrete random variable X must satisfy the following two
conditions:
1. P(X = x) 0 for all values x
2. PX x 1
all x
These conditions must hold because the P(X = x) values are probabilities. First condition specifies that
all probabilities must be greater than or equal to zero, as we know from Lesson 6.
For the second condition, we note that for each value x, P(x) = P(X = x) is the probability of the event
that the random variable equals x. Since by definition all x means all the values the random variable X
may take, and since X may take on only one value at a time, the occurrences of these values are mu-
tually exclusive events, and one of them must take place. Therefore, the sum of all the probabilities P(X
= x) must be 1.00.
For example, to find the probability of at most two boys out of three births, we have
F(X = 2) = P(X 2) = Pi
all i 2
In the same way we can calculate the other summary measures viz. skewness, kurtosis and
moments.
X : 0 1 2 3
P(X) : 1/8 3/8 3/8 1/8
Now imagine this experiment is repeated 200 times, we may expect ‗no head‘ and ‗three heads‘ will
each occur 25 times; ‗one head‘ and ‗two heads‘ each will occur 75 times. Since these results are what
we expect on the basis of theory, the resultant distribution is called a theoretical or expected
distribution.
However, when the experiment is actually performed 200 times, the results, which we may actually
obtain, will normally differ from the theoretically expected results. It is quite possible that in actual
experiment ‗no head‘ and ‗three heads‘ may occur 20 and 28 times respectively and ‗one head‘ and ‗two
heads‘ may occur 66 and 86 times respectively. The distribution so obtained through actual experiment
is called the empirical or observed distribution.
In practice, however, assessing the probability of every possible value of a random variable through
actual experiment can be difficult, even impossible, especially when the probabilities are very small.
But we may be able to find out what type of random variable the one at hand is by examining the causes
that make it random. Knowing the type, we can often approximate the random variable to a standard
one for which convenient formulae are available.
The proper identification of experiments with certain known processes in Probability theory can help us
in writing down the probability distribution function. Two such processes are the Bernoulli Process and
the Poisson Process. The standard discrete probability distributions that are consequent to these
processes are the Binomial and the Poisson distribution. We will now look into the conditions that
characterize these processes, and examine the standard distributions associated with the processes. This
will enable us to identify situations for which these distributions apply.
Let us first study the Bernoulli random variable, named so in honor of the mathematician Jakob
Bernoulli (1654-1705). It is the building block for other random variables and the resulting distributions
we will study in this lesson.
5.2.2 Bernoulli Random Variable
Suppose an operator uses a lathe to produce pins, and the lathe is not perfect in the sense that it does not
always produce a good pin. Rather, it has a probability p of producing a good pin and (1 - p) of
producing a defective one. Let us denote a good pin as ―success‖ and a defective pin as ―failure‖.
Just after the operator produces one pin, it is inspected; let X denote the "number of good pins
produced‖ i. e. ―the number of successes‖.
Now analyzing the trial- “inspecting a pin” and our random variable X-“number of successes”, we
note two important points:
The trial-―inspecting a pin‖ has only two possible outcomes, which are mutually exclusive. Such a
trial, whose outcome can only be either a success or a failure, is a Bernoulli trial. In other words,
the sample space of a Bernoulli trial is S = {success, failure}.
The random variable, X, that measures number of successes in one Bernoulli trial, is a Bernoulli
random variable. Clearly, X is 1 if the pin is good and 0 if it is defective.
It is easy to derive the probability distribution of Bernoulli random variable
X : 0 1
P(X) : p 1-p
If X is a Bernoulli random variable, we may write
X ~ BER (p)
Where ~ is read as ―is distributed as‖ and BER stands for Bernoulli.
A Bernoulli random variable is too simple to be of immediate practical use. But it forms the building
block of the Binomial random variable, which is quite useful in practice. The binomial random variable
in turn is the basis for many other useful cases, such as Poisson random variable.
5.3 THE BINOMIAL DISTRIBUTION
In the real world we often make several trials, not just one, to achieve one or more successes. Let us
consider such cases of several trials.
Consider n number of identically and independently distributed Bernoulli random variables X1, X2
………, Xn. Here, identically means that they all have the same p, and independently means that the
value of one X does not in any way affect the value of another. For example, the value of X2 does not
affect the value of X3 or X8 and so on. Such a sequence of identically and independently distributed
Bernoulli variables is called a Bernoulli Process.
Suppose an operator produces n pins, one by one, on a lathe that has probability p of making a good pin
at each trial, the sequence of numbers (1 or 0) denoting the good and defective pins produced in each of
the n trials is a Bernoulli process. For example, in the sequence of nine trials denoted by 001011001, the
third, fifth, sixth and ninth are good pins, or successes. The rest are failures.
In practice, we are usually interested in the total number of good pins rather than the sequence of 1's and
0's. In the example above, four out of nine are good. In the general case, let X denote the total number of
good pins produced in n trials. We then have X = X1 + X2 +………+ Xn where all Xi ~ BER(p) and are
independent.
The random variable that counts the number of successes in many independent, identical Bernoulli trials
is called a Binomial Random Variable.
Table 5-2
Binomial Distribution of X
X=x P(X = x)
0 n
C0 p0 qn
1 n
C1 p1 qn-1
… …
n
x Cx px qn-x
… …
… …
n
n Cn pn q0
Each of the term for x = 0,1,2,………, n correspond to the Binomial expansion of (p + q)n
2. Variance
The variance, denoted by 2, of a Binomial distribution is computed as
V (X) = 2 = E [(X - )2]
n
= (x ) .P( x)
x 0
2
n
(a) First moment about the origin will be m10 = x.P( x) = np =.
x 0
n
(b) Second moment about the origin will be m20 = x .P( x) = n(n-1)p2 + np.
x 0
2
n
(a) First moment about the mean will be m1 = (x ) .P(x) =
x 0
1
0
n
(b) Second moment about the mean will be m2 = (x ) .P(x) = npq = 2
x 0
2
n
(c) Third moment about the mean will be m3 = ( x ) .P( x) = npq(q-p)
x 0
3
n
(d) Fourth moment about the mean will be m4 = (x ) .P(x) = 3(npq)2 + npq(1-6pq)
x 0
4
5. Skewness
To bring out the skewness of a Binomial distribution we can calculate, moment coefficient of skewness,
1
q p
Evaluating 1 = we note:
npq
the Binomial distribution is skewed to the right i.e. has positive skewness when 1 > 0, which is so
when p < q
the Binomial distribution is skewed to the left i.e. has negative skewness when 1 < 0, which is so
when p > q
the Binomial distribution is symmetrical i.e. has no skewness when 1 = 0, which is so when p = q.
Thus, n being the same, the degree of skewness in a Binomial distribution tends to vanish as p
approaches ½ i.e. as p ½
for a given value of p, as n increases the Binomial distribution moves to the right, flattens and
1 6 pq
Evaluating 2 = we note
npq
the Binomial distribution is leptokurtic when 2 > 0, which is so when 6pq <1.
the Binomial distribution is platykurtic when 2 < 0, which is so when 6pq >1.
the Binomial distribution is mesokurtic when 2 = 0, which is so when 6pq =1.
7. Normal approximation of the Binomial distribution
If n is large and if neither of p or q is too close to zero, the Binomial distribution can be closely
X np
approximated by a Normal distribution with standardized variable Z = .
npq
8. Poisson approximation of the Binomial distribution
Binomial distribution can reasonably be approximated by the Poisson distribution when n is infinitely
The binomial probability distribution is a discrete probability distribution that is useful in describing an
enormous variety of real life events. For example, a quality control inspector wants to know the
probability of defective light bulbs in a random sample of 10 bulbs if 10% of the bulbs are defective. He
can quickly obtain the answer from tables of the binomial distribution. The binomial distribution can be
used when:
The outcome or results of each trial in the process are featured as one of two types of possible
outcomes. In other words, they are attributes.
The possibility of outcome of any trial does not change and is the independent of the results of
previous trials.
When a binomial distribution is to be fitted to observe data, the following procedure is adopted:
1. Determine the value of p and q. if one of these values is known the other can be found out by the
simple relationship p = (1 – q), and q = (1 – p). When p and q are equal, the distribution is
symmetrical. For p and q may be interchanged without alternating the value of any terms and
consequently terms equidistant from the two ends of the series are equal. If p and q are unequal, the
distribution is skew. If p is less than ½ , the distribution is positively skewed and when p is more than
½ the distribution is negatively skewed.
2. Expand the binomial (q + p)n. The power n is equal to one less than the number of terms in the
expected binomial. Thus when two coins are tosses (n = 2) there will be three terms in the binomial.
Similarly, when four coins are tossed (n = 4) there will be five terms and so on.
3. Multiply each term of the expended binomial by N (the total frequency) in order to obtain the
expected frequency in each category.
Example 5-1
Assuming the probability of male birth as ½, find the probability distribution of number of boys out of 5
births.
(a) Find the probability that a family of 5 children have
Example 5.2 Eight coins are tossed at a time 256 times. Number of heads observed at each throw is
recorded and the result are given below. Find the expected frequencies. What are the theoretical values
of mean and standard deviation? Calculate also the mean and standard deviation of the observed
frequencies.
Number of heads Frequency Number of heads Frequency
at a throw at a throw
0 2 5 56
1 6 6 32
2 30 7 10
3 52 8 1
4 67
Solution: The chance of getting a head in a single throw of one coin is ½ , hence p = ½ , q = ½ , n = 8,
N = 256.
By expanding 256( ½ + ½ )8 we shall get the expected frequencies of 1, 2, ………, 8 heads (successes).
Number of Heads (X) Frequency = N × nCr qn – r pr
0 256 ( ½ )8 = 1
1 256 × 8C1 ( ½ )7 ( ½ )1 = 8
2 256 × 8C2 ( ½ )6 ( ½ )2 = 28
3 256 × 8C3 ( ½ )5 ( ½ )3 = 56
4 256 × 8C4 ( ½ )4 ( ½ )4 = 70
5 256 × 8C5 ( ½ )3 ( ½ )5 = 56
6 256 × 8C6 ( ½ )2 ( ½ )6 = 28
7 256 × 8C7 ( ½ )1 ( ½ )7 = 8
8 256 × 8C8 ( ½ )0 ( ½ )8 = 1
n=8 Total (N) = 256
√ √ √
These are the mean and standard deviation of the expected frequency distribution. The mean and the
standard deviation of the observed frequency distribution shall be:
X f d fd fd2
0 2 -4 -8 32
1 6 -3 -18 54
2 30 -2 -60 120
3 52 -1 -52 52
4 67 0 0 0
5 56 +1 +56 56
6 32 +2 +64 128
7 10 +3 +30 90
8 1 +4 +4 16
2
N = 256 ∑fd = 16 ∑fd = 548
∑
̅
∑
√ ( ) √ ( ) √ √
We can develop the Poisson probability rule from the Binomial probability rule under the above
conditions.
Let us consider a Bernoulli process with n trials and probability of success in any trial
p = , where 0. Then, we know that the probability of x successes in n trials is given by
n
n x n x
n!
x x
P (X = x) = Cx 1
n
= 1
n n x!(n x)! n n
n x
n[n 1][n 2]............[n ( x 1)]
x
= 1
x! n n
n x
x n n 1 n 2 n ( x 1)
= .
x! n n
.
n
............
n 1 n
x
x x 1
n
1 2
= 1 1 ............1 1 1
x! n n n n n
x
1 2 x 1
Now if n ∝, then the terms, 1 ; 1 ;............;1
and 1 will all be tending to
n n n n
n
e x
1 and 1 e if n ∝ Thus we have P(X = x) =
where, x = 0, 1, 2,……….
n x!
Thus, we have seen that to describe the distribution of Poisson random variable we need only one
e x
parameter, we write If X ~ POI (), Then P(X = x) = x = 0, 1, 2,……… We may write down
x!
the Poisson probability distribution in table format (see Table 5-3)
Table 5-3
Poisson Distribution of X
X=x P(X = x)
0 e-
1 e- or P(X = 0)
2
2 e or P(X = 1)
2! 2
… …
… …
x
e or P(X = x-
x x! x
1)
… …
… …
Poisson distribution may be expected in situations where the chance of occurrence of any event is small,
and we are interested in the occurrence of the event and not in its non-occurrence. For example, number
of road accidents, number of defective items, number of deaths in flood or because of snakebite or
because of a rare disease etc. In these situations, we know about the occurrence of an event although its
probability is very small, but we do not know how many times it does not occur. For instance, we can
say that two road accidents took place today, but it is almost impossible to say as to how many times,
accident fails to take place. The reason is that the number of trials is very large here and the nature of
event is of rare type. The Poisson random variable X, counts the number of times a rare event occurs
during a fixed interval of time or space.
2. Variance
The variance, denoted by 2, of a Poisson distribution is computed as V (X) =2 = E [(X - )2] =
mr0 = x .P(x)
all x
r
m10 = x.P(x)
all x
=
(b) Second moment about the origin will be
m20 = x .P( x)
all x
2
= + 2
4. Moments about the Mean
The rth moments about the mean denoted by mr , of a Poisson distribution is computed as:
=0
(b) Second moment about the mean will be
=2
=
(c) Third moment about the mean will be
=
(d) Fourth moment about the mean will be
= 32+
5. Skewness
To bring out the skewness we can calculate, moment coefficient of skewness, 1
(m3 ) 2 m3 1
1 =
m
1 = = =
(m2 ) 3
3
2
1
Evaluating 1 = we note that Poisson distribution is always skewed to the right i.e. has positive
skewness which is so as it is a distribution of rare events.
The degree of skewness in a Poisson distribution decreases as the value of increases.
6. Kurtosis
A measure of kurtosis of the Poisson distribution is given by the moment coefficient of kurtosis 2
m4 1
2 = 2 – 3 = 3 =
m
2
2
1
Evaluating 2 = we note that the Poisson distribution is leptokurtic.
7. Poisson approximation of the Binomial distribution
Poisson distribution can reasonably approximate Binomial distribution when n is infinitely large and p
is infinitely small i. e. when
n ∝ and p 0
The Poisson distribution is used in practice in a wide variety of problems where there are infrequently
occurring events with respect to time, area, volume or similar units. Some practical situations in which
Poisson distribution can be used are given below:
1. It is used in quantity control statistics to count the number of defects of an item.
2. It is used in biology and physics to count the number of bacteria and to count the number of
particles emitted from a radioactive substances.
3. It is used in insurance problems to count the number of causalities.
4. It is used in call center or telephonic companies in waiting-time problems to count the number of
incoming telephones calls or incoming customers.
5. It is used in count the number of traffic arrivals such as trucks at terminals, airplanes at airports,
ships at docks and so forth.
6. It is used in determining the number of deaths in a district in a given period say a year, by a rare
diseases.
7. It is used in count the number of typographical errors per page in typed material, number of deaths
as a result of road accident etc.
8. It is used in dealing with the inspection of manufactured products with the probability that any one
piece is defective is very small and the lots are very large.
In general, the Poisson distribution explains the behaviour of those discrete variants where the
probability of occurrence of the event is small and the total number of possible cases is sufficiently
large.
The process of fitting a Poisson distribution is very simple. We have just to obtain the value of m, i.e.,
the average occurrence, and calculate the frequency of 0 successes. The other frequencies can be very
easily calculated as follows:
N (P0) = Ne-m
Example 5.3 At a parking place the average number of car-arrivals during a specified period of 15
minutes is 2. If the arrival process is well described by a Poisson process, find the probability that
during a given period of 15 minutes
(a) no car will arrive
(b) atleast two cars will arrive
(c) atmost three cars will arrive
(d) between 1 and 3 cars will arrive
Solution: Let X denote the number of cars arrivals during the specified period of 15 minutes. So
X ~ POI ()
e x
We apply the Poisson probability function P(X = x) = x = 0,1,2,……… to calculate the
x!
required probabilities.
e 2 20
(a) P(no car will arrive) = P(X = 0) =
0!
= 0.1353
(b) P(atleast two cars will arrive) = P(X 2)
=1-[ P(X = 0) + P(X = 1)]
e 2 20 e 2 21
= 1-[ + ]
0! 1!
= 1-[0.1353 + 0.2707]
= 1 – 0.4060
= 0.5740
(c) P(atmost three cars will arrive) = P(X 3)
3
e 2 2 x
=
x 0 x!
=P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3)
= 0.8571
(d) P(between 1 and 3 cars will arrive) = P(1 X 3)
= P(X 3) - P(X = 0)
3
e 2 2 x e 2 20
=
x 0 x!
-
0!
= 0.8571 –0.1353
= 0.7218
5.5 CHECK YOUR PROGRESS
1. The Binomial distribution is skewed to the right i.e. has …… when 1 > 0, which is so when p < q.
2. The degree of skewness in a Poisson distribution …………. as the value of increases.
3. Poisson distribution can reasonably approximate …………… when n is infinitely large and p is
infinitely small.
4. If n is large and if neither of p or q is too close to zero, the Binomial distribution can be closely
approximated by a ……………… with standardized variable.
5. The random variable that counts the number of successes in many independent, …………… is called
a Binomial Random Variable.
5.6 SUMMARY
A random variable has a probability law - a rule that assigns probabilities to different values of the
random variable. This probability law - the probability assignment is called the probability distribution
of the random variable. The probability distribution of a discrete random variable lists the probabilities
of occurrence of different values of the random variable. The proper identification of experiments with
certain known processes in Probability theory can help us in writing down the probability distribution
function. Two such processes are the Bernoulli Process and the Poisson Process. The standard discrete
probability distributions that are consequent to these processes are the Binomial and the Poisson
distribution. There are different characteristics of both.
5.7 KEYWORDS
Discrete: It takes only a countable number of values.
Continuous: It take on any value in an interval of numbers (i.e. it‘s possible values are unaccountably
infinite).
Binomial Random Variable: The random variable that counts the number of successes in many
independent, identical Bernoulli trials is called a Binomial Random Variable.
Poisson distribution: It may be expected in situations where the chance of occurrence of any event is
small, and we are interested in the occurrence of the event and not in its non-occurrence.
8. If the sum of mean and variance of a binomial distribution of 5 trials is 7/5, find the binomial
distribution.
9. The mean and variance of a binomial distribution are 2 and 1.5 respectively. Find the probability of
(a) 2 successes (b) atleast 2 successes (c) at most 2 successes.
10. 150 random samples of 4 units each are inspected for number of defective item. The results are:
Number of defective items : 0 1 2 3 4
Number of Samples : 28 62 46 10 4
Fit a binomial distribution to the observed data.
11. The probability that a particular injection will have reaction to an individual is 0.002. Find the
probability that out of 1000 individuals (a) no, (b) 1, (c) at least 1, and (d) almost 2; individuals will
have reaction from the injection.
12. In a razor blades manufacturing factory, there is small chance of 1/500 for any blade to be
defective. The blades are supplied in packets of 10. Find the approximate number of packets
containing (a) no, (b) 1, and (c) 2 defective blades in a consignment of 10,000 packets.
13. If P(x = 1) = P(x = 2), for a distribution of Poisson random variable X. Find the mean of the
distribution.
14. The distribution of typing mistakes committed by a typist is given below:
Number of mistakes (X) : 0 1 2 3 4 5
Number of pages (f) : 142 156 67 27 5 1
Fit a Poisson distribution and find the expected frequencies.
5.9 ANSWERS TO CHECK YOUR PROGRESS
1. Positive skewness
2. Decreases
3. Binomial distribution
4. Normal distribution
5. Identical Bernoulli trials
5.10 REFERNCES/SUGGESTED READINGS
9. Statistics (Theory & Practice) by Dr. B.N. Gupta. Sahitya Bhawan Publishers and Distributors (P)
Ltd., Agra.
10. Statistics for Management by G.C. Beri. Tata McGraw Hills Publishing Company Ltd., New Delhi.
11. Business Statistics by Amir D. Aczel and J. Sounderpandian. Tata McGraw Hill Publishing
Company Ltd., New Delhi.
12. Statistics for Business and Economics by R.P. Hooda. MacMillan India Ltd., New Delhi.
13. Business Statistics by S.P. Gupta and M.P. Gupta. Sultan Chand and Sons., New Delhi.
14. Statistical Method by S.P. Gupta. Sultan Chand and Sons., New Delhi.
15. Statistics for Management by Richard I. Levin and David S. Rubin. Prentice Hall of India Pvt. Ltd.,
New Delhi.
16. Statistics for Business and Economics by Kohlar Heinz. Harper Collins., New York.
STRUCTURE
6.1 Learning Objectives
6.2 Introduction
6.2.1 Continuous Probability Distribution
6.2.2 The Normal Distribution
6.3 The Standard Normal Distribution
6.3.1 The Standard Area Table
6.3.2 Finding Probabilities of Standard Normal Distribution
6.3.3 Finding the Value of Z given a Probability
6.4 The Transformation of Normal Random Variables
6.5 Check your Progress
6.6 Summary
6.7 Keywords
6.8 Self-Assessment Test
6.9 Answers to check your progress
6.10 References/Suggested Readings
6.2 INTRODUCTION
We have learnt that a probability distribution is basically a convenient representation of the different
values a random variable may take, together with their respective probabilities of occurrence. In the
last lesson, we have examined situations involving discrete random variables and the resulting discrete
probability distributions. Consider the following random variables that we have taken up in the last
lesson:
1. Number of Successes (X1) in a Bernoulli‘s Process
2. Number of Successes (X2) in a Poisson Process
In the first case, Binomial random variable X1 could take only finite number of integer values; 0,1,2…n;
whereas in the second case, Poisson random variable X2 could take an infinite number of integer value;
0,1,2,3………… The random variables X1 and X2 are discrete, in the sense that they could be listed in a
sequence, finite or infinite. In contrast to these, let us consider a situation, where the variable of interest
may take any value within a given range. Suppose we are planning for measuring the variability of an
automatic bottling process that fills ½-liter (500 cm3) bottles with cola. The variable, say X, indicating
the deviation of the actual volume from the normal (average) volume can take any real value - positive
or negative; integer or decimal. This type of random variable, which can take an infinite number of
values in a given range, is called a continuous random variable, and the probability distribution of such
a variable is called a continuous probability distribution. The concepts and assumption inherent in the
treatment of such distributions are quite different from those used in the context of a discrete
distribution. In the present lesson, after understanding the basic concepts of continuous distributions, we
will discuss Normal distribution - an important continuous distribution that is applicable to many real-
life processes.
6.2.1 Continuous Probability Distribution
Consider our planning for measuring the variability of the automatic bottling process that fills ½-liter
(500cm3) bottles with cola. The random variable X indicates ‗the deviation of the actual volume from
the normal (average) volume.‘ Let us, for some time, measure our random variable X to the nearest one
cm3.
Figure 6-1 Histograms of the Distribution of X as Measurements is refined to Smaller and Smaller
Intervals of Volume, and the Limiting Density Function f(x)
Suppose Figure8-1a represent the histogram of the probability distribution of X. The probability of each
value of X is the area of the rectangle over the value. Since the rectangle will have the same base, the
height of each rectangle is proportional to the probability. The probabilities also add to 1.00 as required
for a probability distribution.
Volume is a continuous random variable; it can take on any value measured on an interval of numbers.
Now let us imagine the process of refining the measurement scale of X to the nearest 1/2 cm3, the
nearest 1/8 cm3… and so on. Obviously, as the process of refining the measurement scale continues, the
number of rectangles in the histogram increases and the width of each rectangle decreases. The
probability of each value is still measured by the area of the rectangle above it, and the total area of all
rectangles remains 1.00. As we keep refining our measurement scale, the discrete distribution of X tends
to a continuous probability distribution. The step like surface formed by the tops of the rectangles in the
histogram tends to a smooth function. This function is denoted by f(x) and is called the probability
density function of the continuous random variable X. The density function is the limit of the
histograms as the number of rectangles approaches infinity and the width of each rectangle approaches
zero. The density function of the limiting continuous variable X is shown in Figure 6-1 i.e. the values X
can assume between the intervals –2.00 to –3.00 approaches infinity. The probability that X assumes a
particular value (Say X = 1.5) approaches zero. Probabilities are still measured as areas under the curve.
The probability that deviation will be between –1.50 and –1.00 is the area under f(x) between the points
x = -1.50 and x = -1.00. Let us now make some formal definitions.
A continuous random variable is a random variable that can take on any value in an interval of
numbers.
The probabilities associated with a continuous random variable X are determined by the probability
density function of the random variable. The function, denoted by f(x), has the following properties:
1. f(x) = 0 for all x
2. The probability that X will be between two numbers a and b is equal to the area under f(x) between
a and b.
b
P(a < X < b) = f ( x).dx
a
3. The total area under the entire curve of f(x) is equal to 1.00.
P( X ) f ( x).dx 1.00
-
When the sample space is continuous, the probability of any single given value is zero. For a continuous
random variable, therefore, the probability of occurrence of any given value is zero. We see this from
property 2, noting that the area under a curve between a point and itself is the area of a line, which is
zero. For a continuous random variable, non-zero probabilities are associated only with intervals of
numbers.
We define the cumulative distribution function F(x) for a continuous random variable similarly to the
way we defined it for a discrete random variable: F(x) is the probability that X is less than (or equal to)
x.
F(x) = P(X = x) = area under f(x) between the smallest possible value of X (often -∝) and point x
x
= f (x).dx
-
The cumulative distribution function F(x) is a smooth, non-decreasing function that increases from 0 to
1.00.
The expected value of a continuous random variable X, denoted by E(X), and its variance, denoted by
V(X), require the use of calculus for their computation. Thus
E( X ) x. f ( x).dx
-
V (X ) x E(x) . f (x).dx
2
-
The Normal Distribution is the most versatile of all the continuous probability distributions. It is being
widely used in all data-based research in the field of agriculture, trade, business and industry. It is
found to be useful in characterizing uncertainties in many real-life processes, in statistical inferences,
and in approximating other probability distributions.
A large number of random variables occurring in practice can be approximated to the normal
distribution.
A random variable that is affected by many independent causes, and the effect of each cause is not
overwhelmingly large compared to other effects, closely follow a normal distribution.
The lengths of pins made by an automatic machine; the times taken by an assembly worker to complete
the assigned task repeatedly; the weights of baseballs; the tensile strengths of a batch of bolts; and the
volumes of cola in a particular brand of canned cola - are good examples of normally distributed
random variables. All of these are affected by several independent causes where the effect of each cause
is small. This knowledge helps us in calculating the probabilities of different events in varied situations,
which in turn is useful for decision-making.
In many real life situations, we face the problem of making statistical inferences about processes based
on limited data. Limited data is basically a sample from the full body of data on the process.
Irrespective of how the full body of data is distributed, it has been found that the Normal Distribution
can be used to characterize the sampling distribution of many of the sample statistics. (we will see it in
next few lessons). This helps considerably in Statistical Inferences.
Finally, the Normal Distribution can be used to approximate certain probability distributions.
This helps considerably in simplifying the probability calculations.
Many mathematicians have worked on the mathematics behind the normal distribution and have made
many independent discoveries. In the initial stages, the normal distribution was developed by Abraham
De Moivre (1667-1754). His work was later taken up by Pierre S Laplace (1949-1827). But the
discovery of equation for the normal density function is attributed to Carl Friedrich Gauss (1777-1855),
who did much work with the formula. In science books, this distribution is often called the Gaussian
distribution.
We will now examine the properties of the Normal distribution.
2. The normal curve is bell-shaped and perfectly symmetric about its mean. As a result 50% of the
area lies to the right of mean and balance 50% to the left of mean. Perfect symmetry, obviously,
implies that mean, median and mode coincide in case of a normal distribution. The normal curve
gradually tapers off in height as it moves in either direction away from the mean, and gets closer to
the X-axis.
3. The normal curve has a (relative) kurtosis of 0, which means it has average peakedness and is
mesokurtic.
4. Theoretically, the normal curve never touches the horizontal axis and extends to infinity on both
sides. That is the curve is asymptotic to X-axis.
5. If several independent random variables are normally distributed, then their sum will also be
normally distributed. The mean of the sum will be the sum of all the individual means, and by
virtue of the independence, the variance of the sum will be the sum of all the individual variances.
If X1, X2,………… Xn are independent normal variables, the their sum S will also be a normal
variable with
E(S) = E(X1) + E(X2) +…………E(Xn)
and V(S) = V(X1) + V(X2) +…………V(Xn)
6. If a normal variable X under goes a linear change in scale such as Y = aX + b, where a and b are
constants and a 0; the resultant Y variable will also be normally distributed with mean = a E(X) +
b and Variance = a2 V(X)
We can combine the above two properties.
If X1, X2,………… Xn are independent random variables that are normally distributed, then the random
variable Q defined as
Q = a1X1+ a2X2 + … anXn + b will also be normally distributed with
E(Q) = a1E(X1) + a2E(X2) +…………anE(Xn) + b
and V(Q) = a12 V(X1) + a22 V(X2) +………… an2 V(Xn)
Let us see the application of this result with the help of an example.
Example 6-1
A cost accountant needs to forecast the unit cost of a product for the next year. He notes that each unit
of the product requires 8 labor hours and 5 kg of raw material. In addition, each unit of the product is
assigned an overhead cost of Rs 200. He estimates that the cost of a labor hour next year will be
normally distributed with an expected value of Rs 45 and a standard deviation of Rs 2; the cost of raw
material will be normally distributed with an expected value of Rs 60 and a standard deviation of Rs 3.
Find the distribution of the unit cost of the product. Find its expected value and variance.
Solution: Since the cost of labor L may not influence the cost of raw material M, we can assume that
the two are independent. This makes the unit cost of the product Q a random variable. So if
L ~ N (45, 22) and M ~ N (60, 32)
4. The normal distribution has numerous mathematical properties which make it popular and
comparatively easy to manipulate.
5. The normal distribution is used extensively in statistical quality control in on industry in setting up
of control limits.
There are infinitely many possible normal random variables and the resulting normal curves for
different values of μ and σ2. So the range probability P(a < X < b) will be different for different normal
curves. We can make use of integral calculus to compute the required range probability
b
P(a < X < b) = f ( x).dx
a
It may be appreciated that we can simplify this process of computing range probabilities to a great
extent by tabulating the range probabilities. Since it is not practicable and indeed impossible to have
separate probability tables for each of the infinitely many possible normal curves, we select one normal
curve to serve as a standard. Probabilities associated with the range of values of this standard normal
random variable are tabulated. A special transformation then allows us to apply the tabulated
probabilities to any normal random variable. The standard normal random variable is denoted by a
special name, Z (rather than the general name X we use for other random variables).
We define the standard normal random variable Z as the normal random variable with mean = 0 and
standard deviation = 1. We say
Z ~ N (0,12)
Type II Tables give the area towards the tail–end of the standard normal curve beyond the ordinate at
any particular z value. The hatched area shown in Figure 8-4b is P (Z > z).
As the normal curve is perfectly symmetrical, the areas given by Type 1 Tables when subtracted from
0.5 will provide the same areas as given by Type II Tables and vice-versa.
i.e P (0 < Z < z) = 0.5 - P (Z > z).
Example 6-2
Find the probability that the value of the standard normal random variable will be…
(a) between 0 and 1.74 (b) less than -1.47
(c) between 1.3 and 2 (d) between -1 and 2
Solution: (a) P(Z is between 0 and 1.74)
That is, we want P(0 < Z < 1.74). In Figure 8-4a, substitute 1.74 for the point z on the graph. We are
looking for the table area in the row labeled 1.7 and the column labeled 0.04. In the table, we find the
probability 0.4591.Thus
P (0 < Z < 1.74) = 0.4591
(b) P(Z is less than -1.47)
That is, we want P(Z < -1.47). By the symmetry of the normal curve, the area to the left of -1.47 is
exactly equal to the area to the right of 1.47. We find
P(Z < -1.47) = P(Z >1.47)
= 0.5000 - 0.4292
= 0.0808
That is, we want P(1.3 < Z < 2). The required probability is the area under the curve between the two
points 1.3 and 2. The table gives us the area under the curve between 0 and 1.3, and the area under the
curve between 0 and 2. Areas are additive; therefore,
= 0.4772 - 0.4032
= 0.0740
That is, we want P(-1< Z < 2). The required probability is the area under the curve between the two
points -1 and 2. The table gives us the area under the curve between 0 and 1, and the area under the
curve between 0 and 2. Areas are additive; therefore,
= 0.3413 + 0.4772
= 0.8185
In cases, where we need probabilities based on values with greater than second-decimal accuracy, we
may use a linear interpolation between two probabilities obtained from the table.
Example 6-3
Find P(0 ≤ Z ≤ 1.645)
Solution: P(0 ≤ Z ≤ 1.645) is found as the midpoint between the two probabilities P(0 ≤ Z ≤ 1.64) and
P(0 ≤ Z ≤ 1.65). So
= ½[0.4495 + 0.4505]
= 0.45
Example 6-4
Find a value z of the standard normal random variable such that the probability that the random variable
will have a value between 0 and z is 0.40.
Solution: We look inside the table for the value closest to 0.40. The closest value we find to 0.40 is the
table area 0.3997. This value corresponds to 1.28 (row 1.2 and column .08).
So for P(0 < Z < z) = 0.40; z = 1.28
Example 6-5
Find the value of the standard normal random variable that cuts off an area of 0.90 to its left.
Solution: Since the area to the left of the given point z is greater than 0.50, z must be on the right side of
0. Furthermore, the area to the left of 0 all the way to -∝ is equal to 0.50. Therefore, TA = 0.90 - 0.50 =
0.40. We need to find the point z such that TA = 0.40.
We find that for TA = 0.40; z =1.28.
Thus z =1.28 cuts off an area of 0.90 to the left of standard normal curve.
Example 6-6
Find a 0.99 probability interval, symmetric about 0, for the standard normal random variable.
Solution: The required area between the two z values that are equidistant from 0 on either side is 0.99.
Therefore, the area under the curve between 0 and the positive z value is TA = 0.99/2 = 0.495. We now
look in our normal probability table for the area closest to 0.495. The area 0.495 lies exactly between
the two areas 0.4949 and 0.4951, corresponding to z = 2.57 and z = 2.58, Therefore, a simple linear
interpolation between the two values gives us z = 2.575. The answer, therefore, is z = ± 2.575.
So for P(-z < Z< z) = 0.99; z = 2.575
Example 6-7
If X ~ N (50, 8 2), find the probability that the value of the random variable X will be greater than 60
Solution:
X 60
P(X > 60) = P( > )
10
60 50
= P( Z > )
10
= P( Z >1)
= P( Z > 0) - P(0 < Z <1)
= 0.5000 - 0.3413
= 0.1587
Example 6-8
The weekly wage of 2000 workmen is normally distribution with mean wage of Rs 70 and wage
standard deviation of Rs 5. Estimate the number of workers whose weekly wages are
(a) between Rs 70 and Rs 71 (b) between Rs 69 and Rs 73 (c)
more than Rs 72 (d) less than Rs 65
Solution: Let X be the weekly wage in Rs, then
X ~ N (70, 5 2)
(a) The required probability to be calculated is P(70 < X < 71)
70 X 71
So P(70 < X < 71) = P( < < )
70 70 71 70
= P( <Z< )
5 5
= P(0 < Z < 0.2)
= 0.0793
So the number of workers whose weekly wages are between Rs 70 and Rs 71
= 2000 x 0.0793
= 159
(b) The required probability to be calculated is P(69 < X < 73)
69 X 73
So P(69 < X < 73) = P( < < )
69 70 73 70
= P( <Z< )
5 5
= P(-0.2 < Z < 0.6)
= P(-0.2 < Z < 0)+ P(0 < Z < 0.6)
= P(0 < Z < 0.2)+ P(0 < Z < 0.6)
= 0.0793 + 0.2257
= 0.3050
So the number of workers whose weekly wages are between Rs 69 and Rs 73
= 2000 x 0.3050
= 68
(c) The required probability to be calculated is P(X > 72)
X 72
So P(X > 72) = P( > )
72 70
= P(Z > )
5
= P(Z > 0.4)
= 0.5 - P(0 < Z < 0.4)
= 0.5 – 0.1554
= 0.3446
So the number of workers whose weekly wages are more than Rs 72
= 2000 x 0.3446
= 689
(d) The required probability to be calculated is P(X < 65)
X 65
So P( X < 65) = P( < )
65 70
= P( Z < )
5
= P(Z < -1.0)
= P(Z >1.0)
= P(Z >0) - P(0 < Z < 1.0)
= 0.5 - 0.3413
= 0.1567
So the number of workers whose weekly wages are less than Rs 65
= 2000 x 0.1567
= 313
The Inverse Transformation
X
The transformation Z takes us from a random variable X with mean μ, and standard deviation
σ to the standard normal random variable. We also have an opposite, or inverse, transformation, which
takes us from the standard normal random variable Z to the random variable X with mean μ and
standard deviation σ. The inverse transformation is given as
X Z
We use the inverse transformation when we want to get from a given probability, the value or values of
a normal random variable X.
Example 6-9
The amount of fuel consumed by the engines of a jetliner on a flight between two cities is a normally
distributed random variable X with mean = 5.7 tons and standard derivation = 0.5 tons. Carrying
too much fuel is inefficient as it slows the plans. If, however, too little fuel is loaded on the plane, an
emergency landing may be necessary. What should be the amount of fuel to load so that there is 0.99
probability that the plane will arrive at its destination without emergency landing?
Solution: Given that X ~ N (5.7, 0.5 2),
We have to find the value x such that
P(X < x) = 0.99
X
or P( < z) = 0.99
or P(Z < z) = 0.99
= 0.5 + 0.49
= 0.5 + P(0 < Z < z)
From the table, value of z is 2.33
So x z
x 5.7 2.33x0.5
x 6.865
Therefore, the plane should be loaded with 6.865 tons of fuel to give 0.99 probability that the fuel will
last throughout the flight.
Example 6-10
Monthly sale of beer at a bar is believed to be approximately normally distributed with mean 2450 units
and standard 400 units. To determine the level of orders and stock, the management wants to find two
values symmetrically on either side of mean, such that the probability that sales of beer during the
month will be between the two values is
(a) 0.95 (b) 0.99
Find the required values.
Solution: Let X be the monthly sale of beer, then
X ~ N (2450, 400 2),
(a) We have to find the values x1 and x2 such that
P(x1 < X < x2) = 0.95
x1 X x2
or P( < < ) =0.95
or P( z1 < Z < z2 ) =0.95
We know P(-1.96 < Z < 1.96) = 0.95
So z1 = -1.96 and z2 = 1.96
Using the inverse transformation,
x1 z1 and x2 z2
x1 2450 1.96400 x2 2450 1.96400
x1 1666 x2 3234
Therefore, the management may be 95% sure that sales in any given month will be between 1666 and
3234 units.
(b) We have to find the values x1 and x2 such that
P(x1 < X < x2) = 0.99
x1 X x2
or P( < < ) =0.99
or P( z1 < Z < z2 ) =0.99
We know P(-2.58 < Z < 2.58) = 0.99
So z1 = -2.58 and z2 = 2.58
Using the inverse transformation,
x1 z1 and x2 z2
x1 2450 2.58400 x2 2450 2.58400
x1 1418 x2 3482
Therefore, the management may be 99% sure that sales in any given month will be between 1418 and
3482 units.
We can summarize the procedure of obtaining values of a normal random variable, given a
probability, as:
draw a picture of the normal distribution in question and the standard normal distribution
in the picture, shade in the area corresponding to the probability
use the table to find the z value (or values) that gives the required probability
use the transformation from Z to X to get the appropriate value (or values) of the original normal
random variable
6.5 CHECK YOUR PROGRESS
1. The probabilities associated with a continuous random variable X are determined by the
………………… of the random variable.
2. For a continuous random variable, non-zero probabilities are associated only with ……………… of
numbers.
3. In science books, normal distribution is often called the …………….. .
4. The normal curve has a (relative) kurtosis of 0, which means it has ………… and is mesokurtic.
5. We define the standard normal random variable Z as the ………. random variable with mean = 0 and
standard deviation = 1.
6.6 SUMMARY
A continuous random variable is a random variable that can take on any value in an interval of
numbers. The Normal Distribution is the most versatile of all the continuous probability distributions. It
is being widely used in all data-based research in the field of agriculture, trade, business and industry.
It is found to be useful in characterizing uncertainties in many real-life processes, in statistical
inferences, and in approximating other probability distributions. A large number of random variables
occurring in practice can be approximated to the normal distribution. A random variable that is
affected by many independent causes, and the effect of each cause is not overwhelmingly large
compared to other effects, closely follow a normal distribution. It may be appreciated that we can
simplify this process of computing range probabilities to a great extent by tabulating the range
probabilities. Since it is not practicable and indeed impossible to have separate probability tables for
each of the infinitely many possible normal curves, we select one normal curve to serve as a standard.
Probabilities associated with the range of values of this standard normal random variable are
tabulated. The importance of the standard normal distribution derives from the fact that any normal
random variable may be transformed to the standard normal random variable.
6.7 KEYWORDS
Continuous random variable: The variable, say X, indicating the deviation of the actual volume from
the normal (average) volume can take any real value - positive or negative; integer or decimal. This type
of random variable, which can take an infinite number of values in a given range, is called a continuous
random variable.
Continuous probability distribution: The probability distribution of continuous random variable is
called a continuous probability distribution.
Standard Normal Distribution: It define the standard normal random variable Z as the normal
random variable with mean = 0 and standard deviation = 1.
6.8 SELF-ASSESSMENT TEST
1. Define continuous probability distribution. State the properties of the probability density function
of a continuous random variable.
2. (a) Define normal random variable. State the probability density function of a normal random
variable.
(b) List down important properties of a normal curve.
3. Discus the role of normal distribution in statistical theory.
4. What do you mean by standard normal variable? Bring out the need for having a standard normal
curve.
5. Find the probability that a standard normal variable will have a value
(a) less than –8 (b) between -0.01 and 0.05
6. A sensitive measuring device is calibrated so that errors in the measurements it provides are
normally distributed with mean 0 and variance 1.00. Find the probability that a given error will be
between -2 and 2.
7. The deviation of a magnetic needle from the magnetic pole in a certain area in northern Canada is a
normally distributed random variable with mean 0 and standard deviation 1.00. What is the
probability that the absolute value of the deviation from the north pole at a given moment will be
more than 2.4?
8. Find two values of the standard normal random variable, z and -z, such that
(a) the two corresponding "tail areas" of the distribution add to 0.01.
(c) If the company wants to be 95% confident that the stock-out condition will not occur, what
should be the reorder point? The reorder point minus the mean demand during lead-time is
known as the "safety stock." What is the safety stock in this case?
(d) If the company wants to be 99% confident that the stock-out condition will not occur, what
should be the reorder point? What is the safety stock in this case?
15. If X is a normally distributed random variable with mean 125 and standard deviation 44, find a
value x such that the probability that X will be less than x is 0.66.
16. For a normal random variable with mean 8.5 and standard deviation 0.4, find a point of the
distribution such that there is a 0.95 probability that the value of the random variable will be above
it.
17. For a normal random variable with mean 29,500 and standard deviation 48, find a point of the
distribution such that the probability that the random variable will exceed this value is
(a) 0.03 (b) 0.25
18. Find two values of the normal random variable with mean 80 and standard deviation 5 lying
symmetrically on either side of the mean and covering an area of 0.98 between them.
19. For X~ N(32, 72), find two values x1 and x2, symmetrically lying on each side of the mean, with
(a) P(x1 < X< x2) = 0.99 (b) P(x1 < X < x2) = 0.95
20. The results of a given selection test exercise are summarized as
(i) cleared with distinction = 8%
(ii) cleared without distinction = 60%
(iii) those who failed = 30%.
A candidate gets failed if he/she obtains less than 40% marks, while one must obtain at least 75%
marks to pass with distinction. Determine the mean and standard deviation of the distribution of
marks, assuming the same to be normal.
21. The demand for gasoline at a service station is normally distributed with mean 27,009 gallons per
day and standard deviation 4,530. Find two values that will give a symmetric 0.95 probability
interval for the amount of gasoline demanded daily.
22. The percentage of protein in a certain brand of dog food is a normally distributed random variable
with mean 11.2 % and standard deviation 0.6 %. The manufacturer would like to state on the
package that the product has a protein content of at least x1 % and no more than x %. He wants the
statement to be true for 99% of the packages sold. Determine the values x1 and x2.
6.9 ANSWERS TO CHECK YOUR PROGRESS
1. Probability density function
2. Intervals
3. Gaussian distribution
4. Average peakedness
5. Normal
STRUCTURE
7.1 Learning Objectives
7.2 Introduction
7.2.1 Census Vs. Sampling method
7.2.2 Definitions
7.3 Probability samples vs. non-probability samples
7.3.1 Probability sampling methods
7.3.2 Non-probability sampling methods
7.3.3 Determination of Sample size
7.4 Sampling and Non-sampling errors
7.5 Check your Progress
7.6 Summary
7.7 Keywords
7.8 Self-Assessment Test
7.9 Answers to check your progress
7.10 References/Suggested readings
7.2 INTRODUCTION
Sampling is the procedure or process of selecting a sample from a population. A sampling can also be
defined as the process of drawing a sample from a population and of compiling a suitable statistic from
such a sample in order to estimate the parameter of the parent population and to test the significance of
the statistic computed from such sample. When secondary data are not available for the problem under
study, a decision may be taken to collect primary data by using any of the methods discussed in this
lesson. The required information may be obtained by following either the census method or the sample
method.
Although there are many advantages with the census method, the cost, effort and the time required to
conduct census survey is very large, unless the population is very small, and in many cases it is so
prohibitive that one rarely uses this method in surveys.
Sampling involves an examination of a small portion of the elementary units in a population. Although,
a census operation gives a more reliable data, sampling method is more desired when
(i) the population is very large, i.e., infinite and it would be impossible to conduct census surveys;
(ii) when quick results are required it would be appropriate to conduct sample surveys rather than
census surveys;
(iii) in studies involving destruction of the elementary units under study, it would only be appropriate to
go for sample testing. Items such as light bulbs and ammunition often must be destroyed as a part
of testing process;
(iv) cost of conducting surveys would be very prohibitive in census method, and therefore, it is
advisable to carry out a sample survey, and lastly; and
(v) some times accuracy may be lost because of the large size of the population. Sampling involves a
small portion of the population and therefore, would involve very few people for conducting
surveys and for data collection and compilation. This would not be so in the census method and the
chances of committing errors would increase.
As the sampling involves less time and money, it would be possible to give attention to different
characteristics of the elementary units. A sample using same money and time can produce a detailed
study of lesser number of units. The process of sampling involves selecting a sample, collecting all
relevant information, and finally drawing conclusions about the population from which the sample has
been drawn.
7.2.2 Definitions
The surveys are concerned with the attributes of certain entities, such as business enterprises, human
beings, etc. The attributes that are the object of the study are known as characteristics and the units
possessing them are called the elementary units.
The aggregate of elementary units to which the conclusions of the study apply is termed as
population/universe, and the units that form the basis of the sampling process are called sampling units.
The sampling unit may be an elementary unit.
The sample is defined as an aggregate of sampling units actually chosen in obtaining a representative
subset from which inferences about the population are drawn. The frame— a list or directory, defines all
the sampling units in the universe to be covered. This frame is either constructed for the purpose of a
particular survey or may consist of previously available description of the population; the latter is the
commonly used method. For example, telephone directory can be used as a frame for conducting
opinion surveys in a city or locality.
In order that, sampling results reflect the characteristics of the population, it is necessary that the sample
selected for study should be
(i) Truly representative, i.e., the selected sample truly represent the universe so that the results can be
generalised;
(ii) Adequate, i.e., the size of the sample or the sample size should be adequate enough to represent the
various characteristics of the universe;
(iii) Independent, i.e. the elementary units selected should be independent of one another and all units of
the population should have the same chance of being selected in the sample; and lastly
(iv) Homogeneous, i.e., there should not be any basic difference between the characteristics of the units
in the sample and that of the population. This means that if two or more samples are drawn from
the same population, the results should be more or less identical.
Probability sampling does not depend upon the detailed information about population for its
effectiveness. However, probability sampling requires a high level of skill and experience for its use. It
also requires sufficient time and money to execute.
Non-probability sampling is a procedure of selecting a sample without the use of probability or
randomisation. It is based on convenience, judgement, etc. The major difference between the two
approaches is that it is possible to estimate the sampling variability in the case of probability sampling
while it is not possible to estimate the same in the non-probability sampling. The classification of
various probability and non-probability methods are shown in Fig. 7.1.
POPULATION
SAMPLING
possible sets of stated size that might be drawn from a given population, but the process of sample
selection should be such that the probability of selection is the same for every such set.
The objective is to achieve randomness in drawing the individual elements of a sample for ensuring that
all possible samples have the same chance of being selected. If we are to draw from a population
containing N elementary units, the elementary unit also being a sampling unit, it is necessary that each
of the N units should be individually numbered or otherwise distinctively designed. One of the
approaches for drawing random sample of size n from a population of N units is to draw n cards from N
cards which are numbered from 1 to N and mixed thoroughly. The sample size n, thus drawn, would
constitute a simple random sample (SRS). Another popular method of selecting a random sample is by
lottery method. In this method all the elements are named or numbered on a small slip of paper of
identical shape and size. These slips are folded identically and mixed up well in a container. Number of
slips of desired sample size is selected blindly from this container. Thus, the selection of elementary
units depends purely on chance and no personal bias exists. We shall illustrate this method of selection
of a sample with the following example: Suppose the warden of a student‘s hostel with 200 occupants
wants to constitute a welfare committee with the members randomly selected. The lottery method of
selecting these five members from a group of 200 would be first to prepare 200 slips of identical shape
and size and write the name of each student on a slip. Fold these 200 slips identically and mix them well
in a container. Then select five folded slips, from the container at random. The five students so selected
would constitute a welfare committee of the hostel.
There are, however, some difficulties in these procedures. For, if N is large, the task becomes physically
difficult. So it is desirable to use better methods for ensuring randomness. One such method is the use
of random number tables.
Several random number tables are available for use. These numbers have been adequately tested for
randomness. Among them, the most popular ones are:
(i) Tippett‘s (1927) 10,400 sets of four-digited random numbers;
(ii) Fisher and Yates (1938) table of random numbers with 1,500 sets of ten-digited random numbers;
and
(iii) Rand Corporation (1955) table of random numbers of 2,00,000 sets of five-digited random
numbers.
Tippet‘s table of random numbers is most popularly used in practice. Given below are the first forty sets
from Tippet‘s table as an illustration of the general appearance of random numbers:
Tippett‘s numbers have been subjected to numerous tests and used in many investigations and their
randomness has been well established for all practical purposes. An example to illustrate how Tippett‘s
table of random numbers may be used is given below.
Suppose ten numbers from out of 0 and 80 are required. We start anywhere in the table and write down
the numbers in pairs. The table can be read horizontally, vertically, diagonally or in any methodical
way. Starting with the first and reading horizontally first we obtain 29, 52, 66, 41, 39, 92, 97, 92, 79, 69,
59, 11, 31, 70, 56, 24, 41, 67 and so on. Ignoring the numbers greater than 80, we obtain for one
purpose ten random numbers, namely 29, 52, 66, 41, 39, 79, 69, 59, 11 and 31.
The sampling procedure described above is quite satisfactory for a small population. With a large
population, the process of identification of numbers to each elementary sampling unit becomes very
prohibitive with respect to both time and money. Moreover, the population is often geographically
spread out or composed of clearly identified strata possessing unique characteristics. Whenever any of
the above situations arise, alternative sampling schemes that are sophisticated combinations of simple
random sampling provide significantly better results for the same expenditure and time. As a result, the
simple random sampling method is not very frequently used in practice. However, the simple random
sampling scheme is the basis of any other probabilistic sampling schemes.
each stratum regardless of how the stratum is represented in the population. Thus, in the earlier
example, an equal number, i.e., 125, of elementary units will be drawn to constitute the sample.
A sample drawn by stratified random sampling scheme ensures a representative sample as the
population is first divided into various strata and then a sample is drawn from each stratum. Stratified
random sampling also ensures greater accuracy and it is maximum if each stratum is formed in such a
way that it consists of uniform or homogeneous items. Compared with a simple random sample, a
stratified sample can be more concentrated geographically, i.e., the elementary units from different
strata may be selected in such a way that all of them are located in one geographical area. This would
also reduce both time and cost involved in data collection. However, care should be exercised in
dividing the population into various strata. Each stratum must contain, as far as possible, homogeneous
units, as otherwise the reliability of the results would be lost.
In conclusion, stratification is an effective sampling device to the extent that it creates classes that are
more homogeneous than the total. When this can be done, the classes are distinguished that differ
among themselves in respect of a stated characteristic. Stratification may be futile if classes do not
differ among themselves. Thus, there should be homogeneity within classes and heterogeneity between
classes.
sample of 10,000 students from a University. We may take colleges at the first stage, then draw
departments at the second stage, and choose students as the third and last stage.
Merits: Multi-stage sampling introduces flexibility in the sampling method which is lacking in the other
methods. It enables existing divisions and sub-divisions of the population to be used as units at various
stages, and permits the field work to be concentrated and yet large area to be covered.
Another advantage of this method is that sub-division into second stage units need be carried out for
only those first stage units which are included in the sample. It is, therefore, particularly valuable in
surveys of under-developed areas where no frame is generally sufficiently detailed and accurate for
subdivision of the material into reasonably small sampling units.
Limitations: However, a multi-stage sample is in general less accurate than a sample containing the
same number of final stage units which have been selected by some suitable single stage process.
where N is the population size and n is the sample size. While calculating the value of m, we may get a
fractional value. In such cases, it is rounded off to the nearest digit.
1. Convenience Sampling
In this scheme, a sample is obtained by selecting ‗convenient‘ population elements. For example, a
sample selected from the readily available sources or lists such as telephone directory or a register of the
small scale industrial units, etc. will give us a convenient sample. In these cases, even if a random
approach is used for identifying the units, the scheme will not be considered as simple random
sampling. For example, if one studies the wage structure in a close by textile industry by interviewing a
few selected workers, then the scheme adopted here is convenient sampling. The results obtained by
convenience sampling method can hardly be said to be representative of the population parameters.
Therefore, the results obtained are generally biased and unsatisfactory. However, convenient sampling
approach is generally used for making pilot studies, particularly for testing a questionnaire and to obtain
preliminary information about the population.
2. Quota Sampling
In this method of sampling, the basic parameters which describe the population are identified first. Then
the sample is selected which conform to these parameters. Thus, in a quota sample, quotas are fixed
according to these parameters, and each field investigator is assigned with quotas of the number of units
to be interviewed. Within the preassigned quotas, the selection of the sample elements depends on the
personal judgement. For example, if one is studying the consumer preferences for ice creams among
children and college going students and supposes it is fixed to interview 250 individuals from each
category. If the city has five colleges, one decides to fix up a quota of 50 students to be interviewed
from each college. It entirely depends upon the interviewer who will constitute this sub-sample of 50
students in a college— they may be the first 50 students who visit the ice cream parlour or may be the
50 students who visit the parlour between 4 p.m. and 6 p.m., etc.
Quota sampling method has the advantage that the sample will conform to the selected parameters of
the population. The cost and time involved in getting information from the sample will be relatively less
for a quota sample but there are many weaknesses too. Some of these are:
(i) It is difficult to validate the information gathered on the elementary units,
(ii) It may be difficult to specify the characteristics of the population and therefore it may be hard to
identify it,
(iii) Even when the sample does conform to the characteristics used in the quotas, the sample may be
distorted on other factors of importance in the study. For example, interviewing first 50 students or
the last 50 students visiting the ice cream parlour can make a lot of difference particularly about
their purchasing capacity, tastes, etc. This may completely distort the results.
Quota sampling method is generally used in public opinion studies, election forecast polls, as there is
not sufficient time to adopt a probability sampling scheme.
3. Judgement Sampling
Judgement sampling method can also be called as sampling by opinion. In this method, someone who is
well acquainted with the population decides which members (elementary units) in his or her judgement
would constitute a proper cross-section representing the parameters of relevance to the study. This
method of sampling is generally used in studies involving performance of personnel. For example, if
one is studying the performance of sales staff in a marketing organisation, the people here are classified
into top grade, medium grade and low grade performers. Having specified qualities that are important in
the study, the expert (possibly here the Vice-President-sales) indicates the people who, in his or her
knowledge, would be representative of each of the three categories mentioned earlier. This, of course, is
not a scientific method, but in the absence of better evidence, such a judgement method may have to be
used.
7.3.3 Determination of sample size
We prefer samples to complete enumeration because of convenience and reduced cost of data
collection. However, in sampling, there is a likelihood of missing some useful information about the
population. For a high level of precision, we need to take a larger sample. How large should be the
sample and what should be the level of precision? In specifying a sample size, care should be taken such
that (i) neither so few are selected so as to render the risk of sampling error intolerably large, nor (ii) too
many units are included, which would raise the cost of the study to make it inefficient. It is, therefore,
necessary to make a trade-off between (i) increasing sample size, which would reduce the sampling
error but increase the cost, and (ii) decreasing the sample size, which might increase the sampling error
while decreasing the cost. Therefore, one has to make a compromise between obtaining data with
greater precision and with that of lower cost of data collection. Several factors need to be considered
before determining the sample size.
The first and the foremost is the size of the error that would be tolerable for the purposes of decision-
making. The second consideration would be the degree of confidence with the results of the study, i.e.,
if one wants to be 100 per cent confident of the results, the entire population must be studied. However,
this is generally too impractical and costly. Therefore, one must accept something less than 100 per cent
confidence. In practice, the confidence limits most often used are 99 per cent, 95 per cent and 90 per
cent. Most commonly used confidence limit is 95 per cent. This means that there is a 5 per cent risk that
the true population statistic is outside the range of possible error specified by the confidence interval.
This 5 per cent risk appears to be acceptable in most of the decisions. Thus, for 95 per cent level of
confidence, Z value is 1.96. The Z value can be obtained from normal probability distribution for a
specified level of confidence. For determining the sample size, we make use of the following
relationship:
x can be calculated if we know the upper and lower confidence limits. Let these limits be Y, then Z
x = Y.
Where Z is the value of the normal variate for a given confidence level. The procedure has been
explained using the illustration given below:
Illustration 7.1. A state cooperative department is performing a survey to determine the annual salary
earned by managers numbering 3000 in the cooperative sector within the state. How large a sample size
it should take in order to estimate the mean annual earnings within plus and minus 1,000 and at 95 per
cent confidence level? The standard deviation of annual earnings of the entire population is known to be
Rs. 3,000.
Solution. As the desired upper and lower limit is Rs. 1,000, i.e., we want to estimate the annual
earnings within plus and minus Rs. 1,000.
z x = 1,000
As the level of confidence is 95 per cent, the Z value is 1.96
1.96 x = 1,000
1‚000
x = 1.96 = 510.20
3000
i.e., n = 510.2 = 5.88
This gives n = 34.57
Therefore, the desired sample size is about 35.
Where ni = number of sample units from stratum i, N = the total number of units in the population, Ni =
the total number of units in the stratum i, n = sample size desired.
The standard error of mean is
k
x = wi2si2/ni
i=1
where wi = the weight of stratum i = Ni/N, i = the standard deviation of the ith stratum, k = the total
number of strata. In case of disproportionate stratified sampling, the proportion of units in the sample
stratum is not equal to the proportion of the population. The formula for sample allocation in this case is
wisin
ni =
k
wisi
1
Thus, the disproportional stratified sample is more desirable if standard deviation (i) of each stratum is
known. The standard error of the mean of a disproportionate stratified sample is
k
(wisi)2
x =
1
åni
It may be observed that the standard error for stratified sample is smaller than for simple random
sample, i.e., much smaller samples may be utilized when the population has been stratified.
Illustration 7.2. In a market area, shops are divided into two categories, viz., those that have daily
turnover of more than Rs. 2000 and those that have daily turnover of less than Rs. 2000 for the study of
estimating the total sales in the area. The total number of shops in the first stratum are 420 and in the
second stratum 180. A sample of 50 was selected, the standard deviation has been found to be 70 for
first stratum and 95 for second stratum. What size of stratified random sample should be taken under
proportional and disproportional stratified sampling?
Solution. Under the proportional stratified sampling, the sample size is given by
Ni
ni = N × n
420
and, therefore n1 = 600 × 50 = 35
180
and n2 = 600 × 50 = 15
w n
2
2si
The standard error ( x ) = i
i
0.3 × 95 × 50 1425
and n2 = = = 18.0
0.7 × 70 + 0.3 × 95 77.5
The standard error is given by
k
(wisi)2
(0.7 × 70 + 0.3 × 95)2
x 1
åni = 50
= 120.125 = 10.96
Z x = Y
x 10
= 1.96
Now, x n and therefore, the sample size will be determined by the equation
s 10
= 1.96
n
85 10
=
n 1.96
= 277.6
Thus, if the sample is taken as 278, the total cost involved will be 278 × 20 = Rs. 5560. As this cost is
considered to be on the higher side by the researcher and in order to reduce the cost, the researcher has
now settled to 90 per cent confidence level. At 90 per cent confidence level, the sample size can be
calculated as follows:
Z x = 10
1.65 x = 10
1.65 x = 10
or x =
10
1.65
s 10
=
n 1.65
85 10
i.e., =
n 1.65
n= 196.7
The cost of survey for this sample size will be 197 × 20 = Rs. 3940. Thus, we have observed that by
reducing the confidence level from 95 per cent to 90 per cent, the researcher would reduce the cost from
Rs. 5560 to Rs. 3940. The researcher may not like to reduce the confidence level further and so further
cost reduction may not be desirable.
7.4 SAMPLING AND NON-SAMPLING ERRORS
The choice of a sample though may be made with utmost care, involves certain errors which may be
classified into two types: (a) Sampling errors, and (b) Non-Sampling errors. These errors may occur in
the collection, processing and analysis of data.
(a) Sampling Errors
Sampling errors are those which arise due to the method of sampling. Sampling errors arise primarily
due to the following reasons:
(1) Faulty selection of the sampling method.
(2) Substituting one sample for the sample due to the difficulties in collecting the sample.
(3) Faulty demarcation of sampling units.
(4) Variability of the population which has different characteristics.
(b) Non-Sampling Errors
Non-Sampling errors are those which creep in due to human factors which always vary from one
investigator to another. These errors arise due to any of the following factors:
(1) Faulty planning.
(2) Faulty selection of the sample units.
(3) Lack of trained and experienced staff which collect the data.
(4) Negligence and non-response on the part of the respondent.
(5) Errors in compilation.
(6) Errors due to wrong statistical measures.
(7) Framing of wrong questionnaire.
(8) Incomplete investigation of the sample survey.
7.5 CHECK YOUR PROGRESS
1. In simple random sampling, drawing of elements from the population is ............ and the choice of an
element is made in such a way that every element has the same probability of being chosen.
2. In stratified random sampling, the population is sub-divided into ............... before the sample is
drawn.
3. In convenience sampling, a sample is obtained by selecting ........... population elements.
4. In a quota sample, quotas are fixed according to these parameters, and each field investigator is
assigned with quotas of the ............... to be interviewed.
5. One has to make a compromise between obtaining data with greater ............... and with that of lower
cost of data collection.
7.6 SUMMARY
The process of selecting a sample is known as sampling. Thus, the sampling theory is a study of
relationship that exists between the population and the samples drawn from the population. The
complete enumeration, popularly known as census, may not be feasible either due to non-availability of
time or because of high cost involved. A probability sample is one for which the inclusion or exclusion
of any individual element of the population depends upon the application of probability methods and
not on a personal judgement. It is so designed and drawn that the probability of inclusion of an element
is known. The essential feature of drawing such a sample is the randomness. As against the probability
sample, we have a variety of other samples, termed as judgement samples, purposive samples, quota
samples, etc. These samples have one common distinguishing feature: personal judgement rather than
the random procedure to determine the composition of what is to be taken as a representative sample.
The judgement affects the choice of the individual elements. All such samples are non-random, and no
objective measure of precision may be attached to the results arrived at.
Non-probability sampling is a procedure of selecting a sample without the use of probability or
randomisation. It is based on convenience, judgement, etc. Several factors need to be considered before
determining the sample size.The first and the foremost is the size of the error that would be tolerable for
the purposes of decision-making. The second consideration would be the degree of confidence with the
results of the study
7.7 KEYWORDS
Elementary Units: The attributes that are the object of the study are known as characteristics and the
units possessing them are called the elementary units.
Population: The aggregate of elementary units to which the conclusions of the study apply is termed as
population/universe.
Sampling Unit: The units that form the basis of the sampling process are called sampling units. The
sampling unit may be an elementary unit.
Sample: The sample is defined as an aggregate of sampling units actually chosen in obtaining a
representative subset from which inferences about the population are drawn.
Frame: A list or directory, defines all the sampling units in the universe to be covered.
Cluster sampling or multistage sampling: Under this method, the random selection
is made of primary, intermediate and final (or the ultimate) units from a given
population or stratum. There are several stages in which the sampling process is
carried out.
Judgement sampling method: In this method, someone who is well acquainted with the population
decides which members (elementary units) in his or her judgement would constitute a proper cross-
section representing the parameters of relevance to the study.
7.8 SELF-ASSESSMENT TEST
1. Describe the various methods of drawing a sample. Which one would you suggest and why?
2. Describe the importance of sampling. Critically examine the merits of probability sampling and non-
probability sampling methods.
3. Specify and explain the factors that make sampling preferable to a complete census in a statistical
investigation.
4. How would you determine the sample size for stratified sampling? Explain with the help of a suitable
example.
5. To determine the effectiveness of the advertising campaign of a new VCR, management would like
to know what percentage of the household are aware of the new brand. The advertising agency thinks
that this figure is as high as 70 per cent. The management would like a 95% confidence interval and
a margin of error no greater than plus or minus 2 per cent. (a) What sample size should be used for
this study? (b) Suppose that management wanted to be 99 per cent confident but could tolerate an
error of plus or minus 3 per cent. How would the sample size change?
7.9 ANSWERS TO CHECK YOUR PROGRESS
1. Random
2. Strata
3. Convenient
4. Number of units
5. Precision
STRUCTURE
8.1 Learning Objectives
8.2 Introduction
8.6 Summary
8.7 Keywords
8.2 INTRODUCTION
Having discussed the various methods available for picking up a sample from a population, we would
naturally be interested in drawing statistical inferences - making generalizations about the population on
the basis of a sample drawn from it. The generalizations to be made about the population are usually
either by way of
Testing appropriate hypotheses stated in relation to population parameters in the light of sample
data
These generalizations, together with the measurement of their reliability, are made in terms of the
relationship between the values of any sample statistic and those of the corresponding population
parameters. Population parameter is any number computed (or estimated) for the entire population viz.
population mean, population median, population proportion, population variance and so on. Population
parameter is unknown but fixed, whose value is to be estimated from the sample statistic that is known
but random. Sample Statistic is any numbers computed from our sample data viz. sample mean, sample
median, sample proportion, sample variance and so on.
It may be appreciated that no single value of the sample statistic is likely to be equal to the
corresponding population parameter. This owes to the fact that the sample statistic being random,
assumes different values in different samples of the same size drawn from the same population.
Referring to our earlier discussion on the concept of a random variable in the lessons on Probability
Distributions, it is not difficult to see that any sample statistics is a random variable and, therefore, has
a probability distribution better known as the Sampling Distribution of the statistic.
The sampling distribution of a statistic is the probability distribution of all possible values the statistic
may take when computed from random samples of the same size drawn from a specified population.
sampling distribution of a sample statistic, we can calculate the probability that the sample statistic
assumes a particular value (if it is a discrete random variable) or has a value in a given interval. This
ability to calculate the probability that the sample statistic lies in a particular interval is the most
important factor in all statistical inferences. We will demonstrate this by an example.
Suppose we know that 40% of the population of all users of hair oil prefers our brand to the next
competing brand. A "new improved" version of our brand has been developed and given to a random
sample of 100 users for use. If 55 of these prefer our "new improved" version to the next competing
brand, what should we conclude? For an answer, we would like to know the probability that the sample
proportion in a sample of size 100 is as large as 55% or higher when the true population proportion is
only 40%, i.e. assuming that the new version is no better than the old. If this probability is quite large,
say 0.5, we might conclude that the high sample proportion viz. 55% is perhaps because of sampling
errors and the new version is not really superior to the old. On the other hand, if this probability works
out to a very small figure, say 0.001, then rather than concluding that we have observed a rare event we
might conclude that the true population proportion is higher than 40%, i.e. the new version is actually
superior to the old one as perceived by members of the population. To calculate this probability, we
need to know the probability distribution of sample proportion i.e. the sampling distribution of the
proportion.
8.2.1 Sampling Distribution of the Mean
Suppose we have a simple random sample of size n, picked up from a population of size N. We take
measurements on each sample member in the characteristic of our interest and denote the observation as
x1 , x2 ,......xn respectively. The sample mean for this sample is defined as:
x1 x2 ...... xn
X
n
If we pick up another sample of size n from the same population, we might end up with a totally
different set of sample values and so a different sample mean. Therefore, there are many (perhaps
infinite) possible values of the sample mean and the particular value that we obtain, if we pick up only
one sample, is determined only by chance. In other words, the sample mean is a random variable. The
possible values of this random variable depends on the possible values of the elements in the random
sample from which sample mean is to be computed. The random sample, in turn, depends on the
distribution of the population from which it is drawn. As a random variable, X has a probability
distribution. This probability distribution is the sampling distribution of X .
The sampling distribution of X is the probability distribution of all possible values the random
variable X may take when a sample of size n is taken from a specified population.
To observe the distribution of X empirically, we have to take many samples of size n and determine the
value of X for each sample. Then, looking at the various observed values of X , it might be possible to
get an idea of the nature of the distribution. We will derive the distribution of X in three cases:
(a) Sampling from infinite populations
(b) Sampling with replacement from finite populations
(c) Sampling without replacement from finite populations
Sample Mean X
x1 x2 ...... xn
n
here x1 representing the first observed values in the sample, is a random variable since it may take any
of the population values. Similarly x2 , representing the second observed value in sample is also a
random variable since it may take any of the population values. In other words, we can say that xi ,
representing the ith observed value in the sample is a random variable.
Now when the population is infinitely large, whatever is the value of x1 , the distribution of x2 is not
affected by it. This is true for any other pair of random variables as well. In other words; x1 , x2 ,......xn
are independent random variables and all are picked up from the same population.
So E xi = and Var xi = σ2 for i =1, 2,3,………n
Finally, we have
x = E X = E
x1 x2 ...... xn
n
x1 x x
= E E 2 ...... E n [as E(A + B) = E(A) + E(B)]
n n n
x = E X = μ and x2 = Var( X ) =
2
n
or x = SD( X ) =
n
random variables representing the n sample members do not remain independent, the expression for the
variance of X changes. The results in this case will be:
x = E X = μ
2 N n N n
and = Var( X ) =
2
. or x = S.D( X ) = .
x
n N 1 n N 1
By comparing these expressions with the ones derived above we find that the variance of X is the same
N n
but further multiplied by a factor . This factor is, therefore, known as the finite population
N 1
multiplier or the correction factor. In practice, almost all the samples are picked up without
replacement. Also, most populations are finite although they may be very large and so the variance of
the mean should theoretically be found by using the expression given above. However, if the population
size (N) is large and consequently the sampling ratio (n/N) small, then the finite population multiplier is
close to 1 and is not used, thus treating large finite populations as if they were infinitely large. For
example, if N = 100,000 and n = 100, the finite population multiplier will be 0.9995, which is very close
to 1 and the variance of the mean would, for all practical purposes, be the same whether the population
is treated as finite or infinite. As a rule of that, the finite population multiplier may not be used if the
sampling ratio (n/N) is smaller than 0.05.
Above discussion on the sampling distribution of mean, presents two very important results,
which we shall be using very often in statistical estimation and hypotheses testing. We have
seen that the expected value of the sample mean is the same as the population mean. Similarly,
that the variance of the sample mean is the variance of the population divided by the sample
size (and multiplied by the correction factor when appropriate). The fact that the sampling
distribution of X has mean μ is very important. It means that, on the average, the sample mean
is equal to the population mean. The distribution of the statistic is centered on the parameter to
be estimated, and this makes the statistic X a good estimator of μ. This fact will become clearer
in the next lesson, where we will discuss estimators and their properties. The fact that the
standard deviation of X is n means that as the sample size increases, the standard
The standard deviation of X is also called the standard error of the mean. It indicates the extent to
which the observed value of sample mean can be away from the true value, due to sampling errors. For
example, if the standard error of the mean is small, we may be reasonably confident that whatever
sample mean value we have observed cannot be very far away from the true value.
Before discussing the shape of the sampling distribution of mean, let us verify the above results
empirically, with the help of a simple example.
Consider a discrete uniform population consisting of the values 1, 2, and 3. If the random variable X
represents these population values, its mean and variance is
μ=
X i
=
6
=2
N 3
X
2
2
i (1 2) 2 (2 2) 2 (3 2) 2
σ2 = N = =
3 3
If random samples of size n = 2 are drawn with replacement from this population, we will have Nn = 32
= 9 possible samples. These are shown in Box 8-1 along with the corresponding sample mean values,
which vary from 1 to 3. The resulting distribution of X is given below:
X : 1 1.5 2 2.5 3
Box 8-1
Now we can find out the mean and variance of the sampling distribution, the necessary calculations are
given in Table 8-1.
Table 8-1 Calculations for x and x2
X PX X.P X
P X .[X E X ]2
1 1/9 1/9 1/9
1.5 2/9 3/9 2/36
2 3/9 6/9 0
2.5 2/9 5/9 2/36
3 1/9 3/9 1/9
PX 1 X .PX 2 PX .[X EX ] 2
1/ 3
x = E X = X.PX = 2 = μ
and the variance of the sampling distribution,
2/3
2
x2 = Var( X ) = P X .[X E X ]2 =1/3 = 2
=
n
(b) Sampling Without Replacement
If random samples of size n = 2 are drawn without replacement from this population, we will have NPn =
3
P2 = 6 possible samples. These are shown in Box 8-2 along with the corresponding sample mean
values, which vary from 1.5 to 2.5.
Box 8-2
x = E X = X.PX = 2 = μ
and the variance of the sampling distribution,
2/3 3 2 N n
2
x2 = Var( X ) = P X .[X E X ]2 = 1/6 = . = .
2 3 1 n N 1
Now if we compare the shapes of the parent population and the resulting sampling distribution of mean,
we find that although our parent population is uniformly distributed, the sampling distribution of mean
is symmetrically distributed as shown in Figure 8-2.
If we increase the sample size n we observe an interesting and important fact. As n increases.
the possible values X can assume increases, so the number of rectangles increases
the probability that X assumes a particular value decreases i.e. the width of rectangles decreases
Figure 8-2 Parent Population and Sampling Distribution of Mean for n = 2 and n = 5
In the limiting case when the sample size n increases infinitely, the particular values X can assume
approaches infinity and the probability that X assumes a particular value approaches to zero. In other
2
words, the limiting distribution of X is normal distribution. Thus as n → ∝ X ~ N (μ, 2 n )
f( )
Total
f( ) Area P(a < < b)
Aaaa
Aaaa
Aaaaaaaaaa
aa
Figure 8-3 Limiting Distribution of X
8.2.3 THE CENTRAL LIMIT THEOREM
The result we just stated - the limiting distribution of X is the normal distribution - is one of the most
important results in statistics. It is popularly known as the central limit theorem. When sampling is
done from a population with mean μ and standard deviation σ, the sampling distribution of the sample
mean X tends to a normal distribution with mean μ and standard deviation n as the sample size n
2
increases. For "Large Enough" n: X ~ N (μ, 2 n ). The central limit theorem is remarkable
because it states that the distribution of the sample mean X tends to a normal distribution regardless of
the distribution of the population from which the random sample is drawn. The theorem allows us to
make probability statements about the possible range of values the sample mean may take. It allows us
to compute probabilities of how far away X may be from the population mean it estimates. We will
extensively use the central limit theorem in the next two lessons about statistical estimation and testing
of hypotheses.
We emphasize that this is a general, and somewhat arbitrary, rule. A larger minimum sample size may
be required for a good normal approximation when the population distribution is very different from a
normal distribution. By the same reason, a smaller minimum sample size may suffice for a good normal
approximation when the population distribution is close to a normal distribution.
The last statement is the key to the important fact that as the sample size increases, the variation of X
about its mean μ decreases. Stated another way, as we buy more information (take a larger sample),
our uncertainty (measured by the standard deviation) about the parameter being estimated decreases.
Let us now look at an example of the use of the central limit theorem.
Example 8-1
ABC Tool Company makes Laser XR; a special engine used in speedboats. The company‘s engineers
believe that the engine delivers an average power of 220 horsepower, and that the standard deviation of
power delivered is 15 horsepower. A potential buyer intends to sample 100 engines (each engine to be
run a single time). What is the probability that the sample mean X will be less than 217 horsepower?
Solution: Given that:
Population mean = 220 horsepower
Population standard deviation = 15 horsepower
Sample size n = 100
Here our random variable X is normal (or at least approximately so, by the central limit theorem as our
sample size is large).
2
X ~ N (μ, 2 n )
2
or X ~ N (220, 152 100 )
X
So we can use the standard normal variable Z = to find the required probability,
n
217 220
P( X < 217) = P(Z < ) = P(Z < -2) = 0.0228
15 100
So there is a small probability that the potential buyer‘s tests will result in a sample mean less than 217
horsepower.
8.2.4 SAMPLING DISTRIBUTION OF THE PROPORTION
Let us assume we have a binomial population, with a proportion p of the population possesses a
particular attribute that is of interest to us. This also implies that a proportion q (=1-p) of the population
does not possess the attribute of interest. If we pick up a sample of size n with replacement and found x
x
successes in the sample, the sample proportion of success ( p ) is given by p = in which x is a
n
binomial random variable, the possible value of this random variable depends on the composition of the
random sample from which p is computed. The probability of x successes in the sample of size n is
given by a binomial probability distribution, viz.
P( x) = nCx p x qn-x
x
Since p = and n is fixed (determined before the sampling) the distribution of the number of
n
successes (x) leads to the distribution of p .
The sampling distribution of p is the probability distribution of all possible values the random
variable p may take when a sample of size n is taken from a specified population.
The expected value and the variance of x i.e. number of successes in a sample of size n is known to be:
E(x) = n p; Var (x) = n p q
Finally we have mean and variance of the sampling distribution of p
x 1 1
p = E p = E = E(x = .n p = p and
n n n
x 1 1
2p = Var p = Var = 2 . Var(x) = 2 . n p q =
n n n
pq
n
p = SD p = pq
n
When sampling is without replacement, we can use the finite population correction factor, so sampling
distribution of p has its
Mean p = p
pq N n
Variance 2p = .
n N 1
pq N n
Standard deviation p = .
n N 1
As the sample size n increases, the central limit theorem applies here as well. The rate at which the
distribution approaches a normal distribution does depend, however, on the shape of the distribution of
the parent population.
if the population is symmetrically distributed, the distribution of p approaches the normal
distribution relatively fast
if the population distributions are very different from a symmetrical distribution, a relatively large
sample size is required to achieve a good normal approximation for the distribution of p
In order to use the normal approximation for the sampling distribution of p , the sample size needs to be
large. A commonly used rule of thumb says that the normal approximation to the distribution of p may
be used only if both n p and n q are greater than 5. We now state the central limit theorem when
sampling for the population proportion p .
When sampling is done from a population with proportion p, the sampling distribution of the sample
The estimated standard deviation of p is also called its standard error. We demonstrate the use of the
theorem in Example 10-2
Example 8-2
A manufacturer of screws has noticed that on an average 0.02 proportion of screws produced are
defective. A random sample of 400 screws is examined for the proportion of defective screws. Find the
probability that the proportion of the defective screws ( p ) in the sample is between 0.01 and 0.03?
Solution:
Given that:
Population proportion p = 0.02
So q = 0.08 (= 1-0.02)
Sample size n = 400
Since the population is infinite and also the sample size is large, the central limit theorem applies. So
2
p ~ N (p, pq n )
2
p ~ N (0.02, (0.02)(0.08) 400 )
p p
We can find the required probability using standard normal variable Z =
pq / n
0.01 0.02 0.03 0.02
P(0.01 < p < 0.03) = P Z
(0.02)(0.08) (0.02)(0.08)
400 400
0.01 0.01
= P Z
0.007 0.007
= P(-1.43 < Z < 1.43)
= 2 P(0 < Z < 1.43)
= 0.8472
So there is a very high probability that the sample will result in a proportion between 0.01 and 0.03.
The sampling distribution of X1 - X 2 is the probability distribution of all possible values the random
variable X1 - X 2 may take when independent samples of size n1 and n2 are taken from two specified
populations.
Mean and Variance of X1 - X 2
1 2
X X = E X1 - X 2 =μ -μ
= E X1 E X 2 1 2 and
12 22
= ; when sampling is with replacement
n1 n2
12 N1 n1 22 N 2 n2
= . . ; when sampling is without replacement
n1 N1 1 n2 N 2 1
As the sample sizes n1 and n2 increases, the central limit theorem applies here as well. So we state the
central limit theorem when sampling for the difference of population means X1 - X 2
When sampling is done from two populations with means μ1 and μ2 and standard deviations σ1 and σ2
respectively, the sampling distribution of the difference of sample means X1 - X 2 approaches to a
12 22
normal distribution with mean μ1 - μ2 and standard deviation as the sample sizes n1 and n2
n1 n2
increases.
2
12 22
For "Large Enough" n1 and n2: X1 - X 2 ~ N (μ1 - μ2, )
n1 n2
The estimated standard deviation of X1 - X 2 is also called its standard error. We demonstrate the use
of the theorem in Example 10-3.
Example 8-3
The makers of Duracell batteries claims that the size AA battery lasts on an average of 45 minutes
longer than Duracell‘s main competitor, the Energizer. Two independent random samples of 100
batteries of each kind are selected. Assuming 1 84 minutes and 2 67 minutes, find the probability
that the difference in the average lives of Duracell and Energizer batteries based on samples does not
exceed 54 minutes.
Solution: Given that:
μ1 - μ2 = 45
σ1 = 84 and σ2 = 67
n1 =100 and n2 = 100
Let X1 and X 2 denote the two sample average lives of Duracell and Energizer batteries respectively.
Since the population is infinite and also the sample sizes are large, the central limit theorem applies.
2
12 22
i.e X1 - X 2 ~ N (μ1 - μ2, )
n1 n2
2
842 672
X1 - X 2 ~ N (45, )
100 100
So we can find the required probability using standard normal variable
Z
X 1
X 2 1 2
12 22
n1 n2
54 45
So P( X1 - X 2 < 54) = P(Z< )
842 672
100 100
= P(Z < 0.84) = 1- 0.20045 = 0.79955
So there is a very high probability that the difference in the average lives of Duracell and Energizer
batteries based on samples does not exceed 54 minutes.
8.2.6 Sampling Distribution of the Difference of Sample Proportions
Let us assume we have two binomial populations labeled as 1 and 2. So that
p1 and p2 denote the two population proportions
n1 and n2 denote the two sample sizes
The sampling distribution of p1 - p2 is the probability distribution of all possible values the random
variable p1 - p2 may take when independent samples of size n1 and n2 are taken from two specified
binomial populations.
Mean and Variance of p1 - p2
1 2
=p -p
p p = E p1 - p2 = E p1 E p2 1 2
p1q1 p2 q2
= ; when sampling is with replacement
n1 n2
p1q1 N1 n1 p2 q2 N 2 n2
= . . ; when sampling is without replacement
n1 N1 1 n2 N 2 1
As the sample sizes n1 and n2 increases, the central limit theorem applies here as well. So we state the
central limit theorem when sampling for the difference of population proportions p1 - p2
When sampling is done from two populations with proportions p1 and p2 respectively, the sampling
distribution of the difference of sample proportions p1 - p2 approaches to a normal distribution with
p1q1 p2 q2
mean p1 - p2 and standard deviation as the sample sizes n1 and n2 increases.
n1 n2
2
p1q1 p2 q2
For "Large Enough" n1 and n2: p1 - p2 ~ N (p1 - p2, )
n1 n2
The estimated standard deviation of p1 - p2 is also called its standard error. We demonstrate the use of
the theorem in Example 10-4.
Example 8-4
It has been experienced that proportions of defaulters (in tax payments) belonging to business class and
professional class are 0.20 and 0.15 respectively. The results of a sample survey are:
Business class Professional class
Sample size: n1 = 400 n2 = 420
Proportion of defaulters: p1 0.21 p2 0.14
Find the probability of drawing two samples with a difference in the two sample proportions larger than
what is observed.
Solution: Given that:
p1 = 0.20 p2 = 0.15
q1 = 1-0.20 = 0.80 q2 = 1-0.15 = 0.85
n1 = 400 n2 = 420
p1 0.21 p2 0.14
Since the population is infinite and also the sample sizes are large, the central limit theorem applies. i.e.
2
pq p q
p1 - p2 ~ N (p1 - p2, 1 1 2 2 )
n1 n2
2
(0.20)(0.80) (0.15)(0.85)
p1 - p2 ~ N (0.05, )
400 420
Thus, the basic difference which the sample size makes is that while the sampling distributions based on
large samples are approximately normal and sample variance S2 is an unbiased estimator of 2 , the
same does not occur when the sample is small.
It may be appreciated that the small sampling distributions are also known as exact sampling
distributions, as the statistical inferences based on them are not subject to approximation. However, the
assumption of population being normal is the basic qualification underlying the application of small
sampling distributions.
In the category of small sampling distributions, the Binomial and Poisson distributions were already
discussed in lesson 9. Now we will discuss three more important small sampling distributions – the chi-
square, the F and the student t-distribution. The purpose of discussing these distributions at this stage is
limited only to understanding the variables, which define them and their essential properties. The
application of these distributions will be highlighted in the next two lessons.
The small sampling distributions are defined in terms of the concept of degrees of freedom. We will
discuss this before concept proceeding further.
Degrees of Freedom (df)
The concept of degrees of freedom (df) is important for many statistical calculations and probability
distributions. We may define df associated with a sample statistic as the number of observations
contained in a set of sample data which can be freely chosen. It refer to the number of independent
variables which vary freely without being influenced by the restrictions imposed by the sample
statistic(s) to be computed.
1 n
Let x1 , x2 ......xn be n observations comprising a sample whose mean x x is a value known to us.
n i1 i
Obviously, we are free to assign any value to n-1 observation out of n observations. Once the value are
freely assigned to n-1observations, freedom to do the same for the nth observation is lost and its value is
automatically determined as
n1
nth observation = n x - sum of n-1 observations = n x x As the value of nth observation must satisfy
i 1
i
n
the restriction x
i 1
i n x We say that one degree of freedom, df is lost and the sum n x of n
For example, if the sum of four observations is 10, we are free to assign any value to three observations
only, say, x1 2, x2 1 and x3 4 . Given these values, the value of fourth observation is automatically
determined as
4
x4 xi (x1 x2 x3 )
i 1
x4 10 (2 1 4)
x4 3
Sampling essentially consists of defining various sample statistics and to make use them in estimating
the corresponding population parameters. In this respect, degrees of freedom may be defined as the
number of n independent observations contained in a sample less the number of parameters m to be
estimated on the basis of that sample information, i.e. df = n-m.
For example, when the population variance 2 is not known, it is to be estimated by a particular value of
its estimator S2; the sample variance. The number of observations in the sample being n, df = n-m = n-
1 because 2 is the only parameter (i.e. m =1) to be estimated by the sample variance.
8.2.8 Sampling Distribution of the Variance
We will now discuss the sampling distribution of the variance. We will first introduce the concept of the
sample variance as an unbiased estimator of population variance and then present the chi-square
distribution, which helps us in working out probabilities for the sample variance.
However, it can be shown empirically that while calculating S2 if we divide the sum of square of
n
2
x x n 1 2 2
x x
2
E i 1 = =
2
will underestimate the population variance 2 by
n n n i.e. n
2
x x by
n 2
.
the factor n To compensate for this downward bias we divide n-1, so that
i 1
x x
n 2
S2 i 1
is an unbiased estimator of population variance 2 and we have:
n 1
n
2
xx
E i 1 = 2
n 1
n
In other words to get the unbiased estimator of population variance , we divide the sum x x by
2 2
i 1
X = X1 , X 2 ......X N
We may draw a random sample of size n comprising x1 , x2 ......xn values from this population. As
brought out in section 10.2, each of the n sample values x1 , x2 ......xn can be treated as an independent
normal random variable with mean and variance 2. In other words,
Thus each of these n normally distribution random variable may be standardized so that
xi
Zi ~ N (0, 10) where i = 1, 2, ………n
A sample statistic U may, then, be defined as
U Z12 Z22 ......... Zn2
n
U Z i2
i 1
x
n 2
U i
i 1
Which will take different values in repeated random sampling. Obviously, U is a random variable. It is
called chi-square variable, denoted by χ2. Thus the chi-square random variable is the sum of several
independent, squared standard normal random variables. The chi-square distribution is the probability
distribution of chi-square variable. So, The chi-square distribution is the probability distribution of the
sum of several independent, squared standard normal random variables. The chi-square distribution is
defined as
1 n
2 1
f ( ) Ce
2 2
( ) d 2
2 2
for χ2 ≥ 0
where e is the base of natural logarithm, n denotes the sample size (or the number of independent
normal random variables).C is a constant to be so determined that the total area under the χ2 distribution
is unity. χ2 values are determined in terms of degrees of freedom, df = n
Properties of χ2 Distribution
1. A χ2 distribution is completely defined by the number of degrees of freedom, df = n. So there are
many χ2 distributions each with its own df.
2. χ2 is a sample statistic having no corresponding parameter, which makes χ2distribution a non-
parametric distribution.
3. As a sum of squares the χ2 random variable cannot be negative and is, therefore, bounded on the left
by zero.
n1 , n2 , n3 ,.........nk . Then their sum 12 22 32 ......... k2 also possesses a χ2 distribution with df
= n1 n2 n3 ......... nk .
( x x) ( x )
xi
n 2 n 2
1
2
i 1 i 1
i
(x
n
1
x) 2 ( x ) 2 2( xi x)(x )
2
i 1
i
n n n
(xi x) 2 (x ) 2 ( x ) ( xi x)
1 1 2
2 i 1 2 i 1 2
i 1
2
(n 1)S 2 x
2 / n
n n n
since
( x
i 1
i x) 2 (n 1)S 2 ; ( x ) n( x ) and
i 1
( x
i 1
i x) 0
Now, we know that the LHS of the above equation is a random variable which has chi-square
2
x
Then
2
distribution, with df = n . We also know that if x ~ N (μ, n )
2
n
will have a chi-square distribution with df = 1, Since the two terms on the RHS are independent,
n 1S 2 will also has a chi-square distribution with df = n-1. One degree of freedom is lost because all
2
the deviations are measured from x and not from ..
Expected Value and Variance of S2
directly.
Since
n 1S 2 has a chi-square distribution with df = n-1
2
n 1S 2
So E n 1
2
n 1
E( S 2 ) n 1
2
E( S 2 ) 2
(n 1)S 2
Also Var 2(n 1)
2
Using the definition of variance, we get
n 1S 2
2
(n 1)S 2
E E 2(n 1)
2
2
n 1S 2
2
or E (n 1) 2(n 1)
2
or E ( n 1) 2
2( n 1)
2
4
or
n 12 ES 4 4 2S 2 2 2 2(n 1)
4
or
n 12 E(S 2 2 ) 2 2(n 1)
4
2(n 1) 4
or E( S 2 2 ) 2
(n 1) 2
2 4
So Var(S )
2
n 1
It may be noted that the conditions necessary for the central limit theorem to be operative in the case of
sample variance S2 are quite restrictive. For the sampling distribution of S2 to be approximately normal
requires not only that the parent population is normal, but also that the sample is at least as large as 100.
Example 8-5
In an automated process, a machine fills cans of coffee. The variance of the filling process is known to
be 30. In order to keep the process in control, from time to time regular checks of the variance of the
filling process are made. This is done by randomly sampling filled cans, measuring their amounts and
computing the sample variance. A random sample of 101 cans is selected for the purpose. What is the
probability that the sample variance is between 21.28 and 38.72?
χ2 =
n 1S 2
2
Since our population is normal and also sample size is quite large, we can also estimate the required
probability using normal distribution.
2
2 4
We have S ~ ( ,
2 2
)
n 1
21.28 2
38.72
2
So 2
P(21.28 < S < 38.72) = P Z
2 4
2 4
n 1 n 1
21.28 30 38.72 30
= P Z
2x30x30 2x30x30
101 1 101 1
8.72 8.72
= P Z
4.36 4.36
= P 2 Z 2
= 2P0 Z 2
= 2x0.4772 = 0.9544
Which is approximately the same as calculated above using χ2distribution
8.3 The f-Distribution and Analysis of Variance (ANOVA)
Let us assume two normal population with variances 12 and 22 repetitively. For a random sample of
size n1 drawn from the first population, we have the chi-square variable.
12
n1 1S12 which process a χ2 distribution with ν1 = n1 -1 df
2
1
Similarly, for a random sample of size n2 drawn from the second population, we have the chi-square
variable
2
n2 1S 22
which process a χ2 distribution with ν2 = n2 -1 df
2
2
2
12
v1
A new sample statistic defined as F
2
2
v2
It is a random variable known as F statistic, named in honor of the English statistician Sir Ronald A
Fisher.
Being a random variable it has a probability distribution, which is known as F distribution.
The F distribution is the distribution of the ratio of two chi-square random variables that are
independent of each other, each of which is divided by its own degrees of freedom.
Properties of F- Distribution
1. The F distribution is defined by two kinds of degrees of freedom – the degrees of freedom of the
numerator always listed as the first item in the parentheses and the degrees of freedom of the
denominator always listed as the second item in the parentheses. So there are a large number of F
distributions for each pair of v1 and v2. Figure 10-7 shows several F distributions with different v1
and v2.
2. As a ratio of two squared quantities, the F random variable cannot be negative and is, therefore,
bounded on the left by zero.
1
i.e. F(v1 ,v2 ) =
F(v2 ,v1 )
Analysis of Variance (ANOVA) is a statistical technique specially designed to test whether the means
of more than two quantitative populations are equal. This technique is developed by R.A. Fisher in
1920‘s. It consists of classifying and cross-classifying statistical results and testing whether the means
of a specified classification differ significantly. ANOVA is a method which separate the variations
ascribable to one set of causes from the variations ascribable to other set of causes. In other words,
analysis of variance is a method of splitting the total variations of a data into constituents parts in which
measures different sources of variations. ANOVA enables us to analyze the total variations of data into
components which may be attributed to various sources or causes of variation. The total variation is
split up into the following two components:
a) Variation within the subgroup of samples.
b) Variations between the subgroups of the samples.
After obtaining the above two variations in viz., a) and b), these two variations are tested for their
significance by F-test which is also known as Variance Ratio Test.
Objects of Analysis of Variance
The first objects of analysis of variance is to obtain a measure of the total variation within the series and
the second object is to find a measure of variation between or among the components. Then the test of
significance of difference between the variations in two series or may be measured. In other words, with
ANOVA technique, we can test the hypothesis that the means of all the components constituting a
population are equal to the mean of the population or that the samples have come from the same
population.
Computation of Test Statistic
The actual analysis of variance is carried out on the basis of ratio between two variances. The variance
ratio is obtained by the dividing the variance between the samples by the variance within the samples.
This ratio forms the test statistic known as F-Statistic, i.e.,
2. Population from which the samples are selected is normally distributed. If however, the sample
sizes are large enough, this assumption of normality is not required.
3. Each one of the samples are independent of other samples.
4. Each one of the populations has the same variance (σ21 = σ22 = ……. = σ2n) and identical means (µ1 = µ2
= µ3 = ……….. = µn).
5. The effect of various components are additive.
Uses of ANOVA Table
The ANOVA table showing the source of variation, the sum of squares, degree of freedom, mean square
(variance) and the formula for the F-ratio is known as ANOVA table. It is used to test whether the
means of a number of populations (more than two) are equal. We know that t-statistic is used for testing
whether two population means are equal. Thus, the analysis of variance of test may be taken as an
extension of t-test for the case of more than two population means.
Classification of Analysis of Variance
The analysis of variance is mainly carried on under the following two classifications:
a) One way Analysis of Variance or One way classification
b) Two way Analysis of Variance or Two way classified Data or Manifold Classification
One way Classification
In one way classification, the data are classified according to only one criterion. In this classification,
the influence of only one attributes or factors considered. There are two methods of one way analysis of
variance. They are:
a) Direct Method
b) Shortcut Method
Direct Method
The following steps are required under the direct method of one way classification of analysis of
variance (ANOVA):
1. Set Null Hypothesis and Alternate Hypothesis:
Null Hypothesis: The means of the populations from which p samples are drawn equal to one
another. The notation for Null hypothesis will be as:
H0: µ1 = µ2 = …….. = µk
Alternate Hypothesis: At least two of the means of the populations are unequal or all the µi‘s are
not equal. The notation for Alternate hypothesis will be as:
H1: µ1 ≠ µ2 ≠ …….. ≠ µk
2. Calculate Variance Between the Samples: The variance between samples (groups) measures the
difference between the sample means of each group and the overall mean weighted by the number
of observations in each group. The variance between samples taken into account the random
variations from observation to observation. It also measures difference from one group to another.
The sum of squares between samples are denoted by SSC. For calculating variance between the
samples we take the total of the square of the deviations of the means of various samples from the
grand average and divide this total by degree of freedom. Thus the steps in calculating variance
between samples will be:
Calculate the mean of each sample i.e., X1, X2, etc.
Calculate the grand average X, pronounced ― X double bar‖. Its value obtained as follows:
̅ ̅ ̅
̿
Take the difference between the means of the various samples and the grand average
Square these deviations and obtain the total which will give sum of squares between the samples
Divide the total obtained in previous step by the degree of freedom. The degree of freedom will
be one less than the number of samples, i.e., if there are 4 samples than the degree of freedom
will be 4 – 1 = 3 or v = k – 1, where k = number of samples.
3. Calculate Variance within the Samples: The variance (or sum of squares) within samples
measures those inter-sample differences due to chance only. It is denoted by SSE. The variance
within samples (groups) measures variability around the mean of each group. For calculating
variance within the samples, we take the total of the sum of squares of the deviation of various
items from the mean values of the respective samples and divide this total by the degree of
freedom. Thus steps involved in calculating variance within the samples will be:
Calculate the mean of each sample i.e., X1, X2, X3, etc.
Take the deviations of various items in a sample from the mean values of the respective samples;
Square these deviations and obtain the total which will give sum of squares within the samples;
Divide the total obtained in previous step by the degree of freedom. The degree of freedom is
obtained by deduction from the total number of items, the number of samples, i.e., v = N – K,
where K refers to the number of samples and N refers to total number of all the observations;
4. Calculate the Ratio F as Follows:
Symbolically,
The F ratio measures the ratio of the variance between groups to the variance within groups. The
variance between the sample means is the numerator and the variance within the sample means is
the denominator. If there is no real difference from group to group, any sample difference will be
explainable by random variation and the variance between groups should be close to the variance
within groups. However, if there is a real difference between the groups, the variance between
groups will be significantly larger than the variance within groups.
5. Compare the F value with Table value: After calculation of F value, it is compared with the table
value of F for the degrees of freedom at a certain significance (Generally at 5%) level. If calculated
value of F is greater than the table value, the difference in sample means is significant. In other
words, the samples do not come from the same population. If the calculated value of F is less than
the table value, the difference is not significant and variation has arisen due to fluctuations of
simple sampling.
The following ANOVA table summarize calculations for sums of squares, together with the r numbers
of degrees of freedom and mean squares.
Solution:
Computation of Grand Mean
Sample 1 Sample 2 Sample 3 Sample 4
X1 X2 X3 X4
8 12 18 13
10 11 12 9
12 9 16 12
8 14 6 16
7 4 8 15
Total = 45 50 60 65
X=9 10 12 13
̅ ̅ ̅ ̅
̿
Where, X1, X2, etc., represents the mean of each sample and N the number of samples.
To calculate variation between samples, calculate the square of the deviation of the various samples
from the grand mean or average. The mean of sample 1 is 9 but the grand mean is 11. So, we will take
the difference between 9 and 11 and square it. Similarly, other three samples means difference with
grand mean calculated and squared. Thus we have the following table:
Mean sum of squares between the samples is 50/(4 -1) = 16.7 (because there are four samples and the
degrees of freedom are 4 – 1 = 3).
Here, we find the sum of the squares is the deviations of various items in a sample from the mean values
of respective samples. Thus, for the first sample, then mean is 9 and so we will take deviations from 10
and so on. The squared deviations are given in the following table:
The table value of F for v1 = 3 and v2 = 16 degree of freedom at 5% significance level = 3.24. The
calculated value of F is less than the table value and hence the difference in the mean values of the
sample is not significant, i.e., the samples could have come from the same universe.
Short cut Method
The above method of calculating variance between the samples and variance within the samples are
cumbersome or difficult. Generally, this method is not in practice because it is time consuming process.
An easier method known as short cut method is usually followed which reduces considerably the
computational work. The computations by the short cut method shall be as follows:
Computation by Short cut Method
Sample 1 Sample 2 Sample 3 Sample 4
X1 X12 X2 X22 X3 X32 X4 X42
8 64 12 144 18 324 13 169
10 100 11 121 12 144 9 81
12 144 9 81 16 256 12 144
8 64 14 196 6 36 16 256
7 49 4 16 8 64 15 225
∑X1 ∑X12 = ∑X2 = ∑X22 = ∑X3 = ∑X32 = ∑X4 = ∑X42 =
= 45 421 50 558 60 824 65 875
The sum of all the items of various samples = ∑X1 + ∑X2 + ∑X3 + ∑X4
= 45 + 50 + 60 + 65 = 220
Correction factor = T2 / N = (220)2/ 20 = 48400/20 = 2420.
Total sum of squares = ∑X12 + ∑X22 + ∑X32 + ∑X42 – Correction factor
= 421 + 558 + 824 + 875 – 2420
= 2678 – 2420 = 258 (as above)
Sum of squares between the samples
∑ ∑ ∑ ∑
̅ ̅ ̅ ̅
̿
The total sum of squares, sum of squares for between columns and sum of squares for between rows are
obtained in the same way as before.
Residual or error sum of square = Total sum of squares – Sum of squares between columns – Sum of
squares between rows.
In which v1 = (c – 1) and v2 = (c – 1) (r – 1)
Where, v1 = (r – 1) and v2 = (c – 1) (r – 1)
It should be carefully noted that v1 may not be same in both cases - in one case v1 = (c – 1) and another
case v2 = (r – 1).
The calculated values of F are compared with the table values. If calculated value of F is greater than
the table value at pre-assigned level of significance, the null hypothesis is rejected, otherwise accepted.
It would be clear from above that in problem involving two way classification, Residual is the
measuring rod for testing significance. It represents the magnitude of variation due to forces called
‗chance‘. The following examples would illustrate the procedure:
Example 10.7
Degree of freedom = 4 – 1 = 3
Sum of squares between plot of land:
Degree of freedom = (3 – 1) = 2
Total sum of squares
=
The following ANOVA table summarize the all calculation related to two way classification analysis of
variance table:
Source of variation Sum of Degree of Mean Square F Ratio
Squares Freedom
Between Samples 42 3 MSC = 14 MSC/MSE = 14/10.67 = 1.312
Between Rows 26 2 MSR = 13 MSR/MSE = 13/10.67 = 1.218
Residual or error 64 6 MSE = 10.67
Total 132 11
For (3, 6) d. f. F(0.05) = 4.76 and for (2, 6) d. f. F(0.05) = 5.14. The calculated values of F are less than the
table value at 5% level of significance. The hypothesis is accepted. Hence, there is no significant
difference between treatments and plots of land.
n 2
xi
n 2 xi x
And, U i 1
~ χ2 (n-1 df) where i = 1, 2, ………n
i 1 2
x x
n 2
i
1 i 1
n 1 2
xi
T
x
n 2
i x
i 1
n 1
xi
T
S
This statistic - the ratio of the standard normal variable Z to the square root of the χ2 variable divided
by its degree of freedom - is known as „t‟ statistic or student „t‟ statistic, named after the pen name of
Sir W S Gosset, who discovered the distribution of the quantity.
xi
The random variable follows t-distribution with n-1 degrees of freedom.
S
xi
~ t (n-1 df) where i = 1, 2, ………n
S
X
So ~ N (0, 12 )
n
xi
X xi
Putting for in T , we get
n 2
1
n xi x
i 1
n 1 2
X
T n
n 2
1
xi x
i 1
n 1 2
X
or T
x x
n 2
i
1 i 1
n(n 1)
X
or T
n 2
1
xi x
i 1
n n 1
X
or T
S
n
When defined as above, T again follows t-distribution with n-1 degrees of freedom.
X
~ t (n-1 df) where i = 1, 2, ………n
S
n
Properties of t- Distribution
1. The t-distribution like Z distribution, is unimodal, symmetric about mean 0, and the t- variable
2. The t-distribution is defined by the degrees of freedom v = n-1, the df associated with the
distribution are the df associated with the sample standard deviation.
3. The t-distribution has no mean for n = 2 i.e. for v = 1 and no variance for n ≤ 3 i.e. for v ≤ 2.
v
However, for v >1, the mean and for v > 2, the variance is given as E(T) = 0; Var(T) = .
v2
v
4. The variance of the t-distribution must always be greater than 1, so it is more variable as
v2
against Z distribution which has variance 1. This follows from the fact what while Z values vary
from sample to sample owing to the change in the X alone, the variation in T values are due to
changes in both X and S.
5. The variance of t-distribution approaches 1 as the sample size n tends to increase. In general, for n
≥ 30, the variance of t-distribution is approximately the same as that of Z distribution. In other
words the t-distribution is approximately normal for n ≥ 30.
8.5 CHECK YOUR PROGRESS
1. The variance of t-distribution approaches 1 as the sample size n tends
to………………………………..
2. As a ratio of two squared quantities, the F random variable cannot be negative and is,
therefore,…………………………..on the left by zero.
F(v1 ,v2 ) F(v2 ,v1 )
3. The F distributions defined as and as are…………………..of each other.
4. The t-distribution has no mean for n = 2 i.e. for v = 1 and no……………….for n ≤ 3 i.e. for v ≤ 2.
5. When sampling without replacement from a finite population, the probability distribution of the
second random variable depends on what has been the………………… of the first pick and so on.
8.6 SUMMARY
Population parameter is any number computed (or estimated) for the entire population viz. population
mean, population median, population proportion, population variance and so on. Population parameter
is unknown but fixed, whose value is to be estimated from the sample statistic that is known but
random. Sample Statistic is any numbers computed from our sample data viz. sample mean, sample
median, sample proportion, sample variance and so on. The sampling distribution of a statistic is the
probability distribution of all possible values the statistic may take when computed from random
samples of the same size drawn from a specified population. The sampling distributions of only the
commonly used sample statistics like sample mean, sample proportion, sample variance etc., which
have a role in making inferences about the population. The F distribution is the distribution of the ratio
of two chi-square random variables that are independent of each other, each of which is divided by its
own degrees of freedom. The ratio of the standard normal variable Z to the square root of the χ2
variable divided by its degree of freedom - is known as‗t‘ statistic or student ‗t‘ statistic, named after the
pen name of Sir W S Gosset, who discovered the distribution of the quantity.
8.7 KEYWORDS
Sampling Distributions: The sampling distribution of a statistic is the probability distribution
of all possible values the statistic may take when computed from random samples of the same
size drawn from a specified population.
T statistics: The ratio of the standard normal variable Z to the square root of the χ2 variable
divided by its degree of freedom - is known as‗t‘ statistic or student ‗t‘.
F distribution: It is the distribution of the ratio of two chi-square random variables that are
independent of each other, each of which is divided by its own degrees of freedom.
Central Limit Theorem: When sampling is done from a population with mean μ and standard
deviation σ, the sampling distribution of the sample mean X tends to a normal distribution with
mean μ and standard deviation n as the sample size n increases.
Sampling distribution of difference of sample mean: The sampling distribution of X1 - X 2 is
the probability distribution of all possible values the random variable X1 - X 2 may take when
independent samples of size n1 and n2 are taken from two specified populations.
Degree of freedom: It refers to the number of independent variables which vary freely without being
influenced by the restrictions imposed by the sample statistic(s) to be computed.
8.8 SELF-ASSESSMENT TEST
1. What is a sampling distribution, and what are the uses of sampling distributions?
2. How does the size of population and the kind of random sampling determine the shape of the
sampling distributions?
3. (a) A sample of size n = 5 is selected from a population. Under what conditions is the sampling
distribution of X normal?
(b) Suppose the population mean is μ = 105 and the population standard deviation is 20. What are
the expected value and the standard deviation of X ?
4. What is the most significant aspect of the central limit theorem? Discuss the practical utility of
central limit theorem in applied statistics.
5. Under what conditions is the central limit theorem most useful in sampling for making statistical
inferences about the population mean?
6. If the population mean is 1,247, the population variance is 10,000, and the sample size is 100, what
is the probability that X will be less than 1,230?
7. When sampling is from a population with standard deviation σ = 55, using a sample of size n =
150, what is the probability that X will be at least 8 units away from the population mean μ?
8. The Colosseum, once the most popular monument in Rome, dates from about AD 70. Since then,
earthquakes have caused considerable damage to the huge structure, and engineers are currently
trying to make sure the building will survive future shocks. The Colosseum can be divided into
several thousand small sections. Suppose that the average section can withstand a quake measuring
3.4 on the Richter scale with a standard deviation of 1.5. A random sample of 100 sections is
selected and tested for the maximum earthquake force they can withstand. What is the probability
that the average section in the sample can withstand an earthquake measuring at least 3.6 on the
Richter scale?
9. On June 10, 1997, the average price per share on the Big Board Composite Index in New York rose
15 cents. Assume the population standard deviation that day was 5 cents. If a random sample of 50
stocks is selected that day, what is the probability that the average price change in this sample was a
rise between 14 and 16 cents?
10. An economist wishes to estimate the average family income in a certain population. The population
standard deviation is known to be Rs 4,000, and the economist uses a random sample of size n =
225. What is the probability that the sample mean will fall within Rs 750 of the population mean?
11. When sampling is done from a population with population proportion p = 0.2, using a sample size n
= 15, what is the sampling distribution of p ? Is it reasonable to use a normal approximation for
this sampling distribution? Explain.
12. When sampling is done for the proportion of defective items in a large shipment, where the
population proportion is 0.18 and the sample size is 200, what is the probability that the sample
proportion will be at least 0.20?
13. A study of the investment industry claims that 55% of all mutual funds outperformed the stock
market as a whole last year. An analyst wants to test this claim and obtains a random sample of 280
mutual funds. The analyst finds that only 108 of the funds outperformed the market during the year.
Determine the probability that another random sample would lead to a sample proportion as low as
or lower than the one obtained by the analyst, assuming the proportion of all mutual funds that out-
performed the market is indeed 0.55.
14. In recent years, convertible sport coupes have become very popular in Japan. Toyota is currently
shipping Celicas to Los Angeles, where a customizer does a roof lift and ships them back to Japan.
Suppose that 25% of all Japanese in a given income and lifestyle category are interested in buying
Celica convertibles. A random sample of 100 Japanese consumers in the category of interest is to
be selected. What is the probability that at least 20% of those in the sample will express an interest
in a Celica convertible?
15. What are the limitations of small samples?
16. What do you understand by small sampling distributions? Why are the small sampling distributions
called exact distributions?
17. What do you understand by the concept of degrees of freedom?
18. Define the χ2 statistic. What are important properties of χ2 distribution?
19. Define the F statistic. What are important properties of F distribution?
20. Define the t statistic. What are important properties of t-distribution? How does t statistic differ
from Z statistic?
8.9 ANSWERS TO CHECK YOUR PROGRESS
1. Increase
2. Bounded
3. Reciprocal
4. Variance
5. Outcome
STRUCTURE
9.1 Learning Objectives
9.2 Introduction
9.2.1 Types of Estimates
9.2.2 Criteria of a good estimator
9.2.3 Method of Maximum Likelihood
9.2.4 Point Estimation
9.2.5 Interval Estimation
9.3 Sample size Determination
9.4 Check your progress
9.5 Summary
9.6 Keywords
9.7 Self-Assessment Test
9.8 Answers to check your progress
9.9 References/Suggested Reading
9.2 INTRODUCTION
The sampling process is used to draw statistical inference about the characteristics of a population or
process of interest. On many occasions we do not have enough information to calculate an exact value
of population parameters (such as and p) and therefore make the best estimate of this value from
the corresponding sample statistics (such as x , s, and P ). The need to use the sample statistic to draw
conclusions about the population characteristic is one of the fundamental applications of statistical
inference in business and economics. A few applications of statistical estimation are given below:
A production manager needs to determine the proportion of items being manufactured that do not
match with quality standards.
A mobile phone service company may be interested to know the average length of a long distance
telephone call and its standard deviation.
A bank needs to understand consumer awareness of its services and credit schemes.
Any service centre needs to determine the average amount of time a customer spends in queue.
In all such cases, a decision-maker needs to examine the following two concepts that are useful for
drawing statistical inference about an unknown population or process parameters based upon random
samples:
(i) Estimation– a sample statistic to estimate an unknown parameter value
(ii) Hypothesis testing– a claim or belief about an unknown parameter value.
In this lesson we shall discuss methods to estimate unknown population parameters and then to
determine the range of values (confidence interval) likely to contain the parameter value.
9.2.1 TYPES OF ESTIMATES
Let us first know the concept of ‗estimate‘ as used in Statistics. According to some dictionaries, an
estimate is a valuation based on opinion or roughly made from imperfect or incomplete data. This
definition may apply, for example, when an individual who has an opinion about the competence of one
of his colleagues. But, in Statistics the term estimate is not used in this sense. In Statistics too the
estimates are made when the information available is incomplete or imperfect. However, such estimates
are made only when they are based on sound judgement or experience and when the samples are
scientifically selected. There are two types of estimates that we can make about a population: a point
estimate and an interval estimate.
A point estimate is a single number, which is used to estimate an unknown population parameter.
Although a point estimate may be the most common way of expressing an estimate, it suffers from a
major limitation since it fails to indicate how close it is to the quantity it is supposed to estimate. In
other words, a point estimate does not give any idea about the reliability of precision of the method of
estimation used. For instance, if someone claims that 40 percent of all children in a certain town do not
go to the school and are devoid of education, it would not be very helpful if this claim is based on a
small number of households, say, 20. However, as the number of households interviewed for this
purpose increases from 20 to 100, 500 or even 5,000, the claim that 40 percent of children have no
school education would become more and more meaningful and reliable. This makes it clear that a point
estimate should always be accompanied by some relevant information so that it is possible to judge how
far it is reliable.
The second type of estimate is known as the interval estimate. It is a range of values used to estimate an
unknown population parameter. In case of an interval estimate, the error is indicated in two ways: first
by the extent of its range; and second, by the probability of the true population parameter lying within
that range. Taking our previous example of 40 percent children not having a school education, the
statistician may say that actual percentage of such children in that town may lie between 35 percent and
45 percent. Thus, he will have a better idea of the reliability of such an estimate as compared to the
point estimate of 40 percent.
x i
For example, the samples mean x i 1
n
x is a point estimator of the population mean . The value obtained by the estimator is known as an
estimate. Many different Statistics can be used to estimate the same parameter. For example, we may
use the sample mean or the sample median or even the range to estimate the population mean. The
question here is: how can we evaluate the properties of these estimates, compare then with one another,
and finally, decide which the ‗best‘ is? The answer to this question is possible only when we have
certain criteria that a good estimator must satisfy. These criteria are briefly discussed below.
There are four criteria by which we can evaluate the quality of a statistic as an estimator. These are:
unbiasedness, efficiency, consistency and sufficiency.
Unbiasedness
This is a very important property that an estimator should possess. If we take all possible samples of the
same size from a population and calculate their means, the mean x of all these means will be equal to
the mean of the population. This means that the sample mean x is an unbiased estimator of the
population mean . When the expected value (or mean) of a sample statistic is equal to the value of the
corresponding population parameter, the sample statistic is said to be an unbiased estimator.
Suppose we take the smallest sample observation as an estimator of the population mean , it can be
easily shown that this estimator is biased. Since the smallest observation must be less than the mean, its
expected value must be less than . Symbolically, E(Xs) <, where Xs stands for the smallest item and
E stands for the expected value. Thus, this estimator is biased downwards. The extent of bias is the
difference between the expected value of the estimator and the value of the parameter. In this case, bias
is equal to E(Xs)-. In contrast, the biases for the sample mean x is zero.
Consistency
Another important characteristic that an estimator should possess is consistency. Let us take the case of
the standard deviation of the sampling distribution of x . The standard deviation of the sampling
distribution of sample mean is computed by following formula :
x
n
The formula states that the standard deviation of the sampling distribution of x decreases as the
sample size increases and vice versa. When the sample size n increases, the population standard
deviation is to be divided by a higher denominator. This results in the reduced value of sample
standard deviation . Let us take an example.
Illustration 9.1: A company has 4,000 employees whose average monthly wage comes to
Rs.4,800 with a standard deviation of Rs.1,200. Let x be the mean monthly wage for a
random sample of certain employees selected from this company. Find the mean and
standard deviation of for a sample size of (a) 81, (b) 100 and (c) 180.
Solution
From the given information, for the population of all employees, N = 4,000 = Rs.4,800 = Rs.1,200.
(a) The mean of the sampling distribution of the is = Rs.4,800.
As n = 81 and N = 4,000, which gives n/N = 0.01. At this value is less than 0.05, the standard deviation
of is obtained by using the formula. Substituting the values.
1,200 1,200
x or, x = = Rs.113.33
n 81 9
(b) In this case, n = 100 and n/N = 100/4,000 = 0.025, which is also less than 0.05. The mean and
the standard deviation are
1,200 1,200
x or, x = = Rs.120
n 100 10
(c) In this case, n = 180 and n/N = 180/4,000 = 0.045, which again is less than 0.05. The mean and
the standard deviation are
x = = Rs.4,800
1,200 1,200
x or, x = = Rs.89.42
n 180 13.42
From the above three sets of calculation, it becomes clear that the mean of the sampling distribution of
x is always equal to the mean of the population regardless of the sample size. But, in case of the
standard deviation, we find the change. In the given example, we find that standard deviation of x
decreased from Rs.189.87 to Rs.120 and then to Rs.113.33 as the sample size increased from 40 to 100
and then to 180.
Efficiency
Another desirable property of a good estimator is that it should be efficient. Efficiency is measured in
terms of size of the standard error of the statistic. Since an estimator is a random variable, it is
necessarily characterised by a certain amount of variability. This means that some estimates may be
more variable than others. Just as bias is related to the expected value of the estimator, so efficiency can
be defined in terms of the variance. In large samples, for example, the variance of the sample mean is
V( x )=n. As the sample size n increases, the variance of the sample mean (V x ) becomes smaller, so
the estimator becomes more efficient. This criterion, when applied to large samples, gives better
estimates as compared to the small ones.
The efficiency of one estimator in relation to another estimator can be judged by comparing their
sampling variances. Thus, efficiency relates to the size of the standard error. Given the same sample
size, the statistic that has a smaller standard error is preferable as it is efficient in relation to another
statistic that has a larger standard error. The sampling distribution of the mean and the median have the
same mean, that is, the population mean. However, the variance of the sampling distribution of the
means is smaller than the variance of the sampling distribution of the medians. As such, the sample
mean is an efficient estimator of the population mean, while the sample median is an inefficient
estimator.
Sufficiency
The fourth property of a good estimator is that it should be sufficient. A sufficient statistic utilises all
the information a sample contains about the parameter to be estimated. , for example, is a sufficient
estimator of the population mean It implies that no other estimator of , such as the sample median,
can provide any additional information about the parameter . Likewise, we can say that the sample
proportion . Having looked into properties of a good estimator briefly, a pertinent question arises: how
can we find estimators with these desirable properties? This brings us to the method of maximum
likelihood.
The maximum likelihood method provides estimators with the desirable properties such as efficiency,
consistency and sufficiency, which we have just discussed. It usually does not give an unbiased
estimate. Let us take an example to explain this method.
Example: Suppose we want to estimate the average grade of a large number of students. A random
sample of size n = 64 is taken and the sample mean x is found to be 90 marks. Now, the assumption
on which we have to base our reasoning is that the random sample of n = 64 is representative of the
population. We saw how samples that were similar to the population had greater probability of being
selected.
Let us now reverse this reasoning as follows: we have before us a random sample size n = 64 and x =
90 marks. From which population did it most probably come-a population with = 85, 90 or 95?
According to our earlier approach, we would think that it most probably came from a population with
=90 marks. Thus, it can be concluded that the population mean , based on our sample, is most likely
to be =90 marks.
A point worth noting is that the population mean is either 90 or not; it has only one value. Hence, we
have used the term likely instead of probably.
This technique to find the estimators was first used and developed by Sir R.A. Fisher in 1922, who
called it the maximum likelihood method.
In point estimation, a single sample statistic (such as x , s, and p ) is calculated from the sample to
provide a best estimate of the true value of the corresponding population parameter (such as and
p ). Such a single relevant statistic is termed as point estimator, and the value of the statistic is termed
as point estimate. For example, we may calculate that 10 per cent of the items in a random sample taken
from a day‘s production are defective. The result ‗10 per cent‘ is a point estimate of the percentage of
items in the whole lot that are defective. Thus, until the next sample of items is not drawn and
examined, we may proceed on manufacturing on the assumption that any day‘s production contains 10
per cent defective items.
Generally, a point estimate does not provide information about ‗how close is the estimate‘ to the
population parameter unless accompanied by a statement of possible sampling errors involved based on
the sampling distribution of the statistic. It is therefore important to know the precision of an estimate
before relying on it to make a decision. Thus, decision-makers prefer to use an interval estimate that is
likely to contain the population parameter value. However, it is also important to state ‗how confident‘
he is that the interval estimate actually contains the parameter value. Hence an interval estimate of a
population parameter is therefore a confidence interval with a statement of confidence that the interval
contains the parameter value.
The confidence interval estimate of a population parameter is obtained by applying the formula:
Suppose the population mean is unknown and the true population standard deviation is known.
Then for a large sample size (n=>30), the interval estimation of population mean is given by
x ±z/2 or, x ±z/2
n
or x -z/2 x +z/2
n n
where za/2 is the z-value representing an area a/2 in the right and left tails of the standard normal
probability distribution.
In general, a 95 per cent confidence interval estimate implies that if all possible
samples of the same size were drawn, then 95 per cent of them would include the
true population mean somewhere within the interval around their sample mean and
only 5 per cent of them would not. The values for z/2 for the most commonly-used
as well as the other confidence levels can be seen from standard normal probability
table as shown in Table 11.1.
Illustration 9.2: The average monthly electricity consumption for a sample of 100
families is 1250 units. Assuming the standard deviation of electric consumption of
all families is 150 units, construct a 95 per cent confidence interval estimate of the
actual mean electric consumption.
Solution: The information given is: x =12.50, =150, n= 100 and confidence level
(1-) = 95 per cent. Using the ‘Standard Normal Curve’ we find that the half of 0.95
yields a confidence coefficient z/2 = 1.96. Thus confidence limits with /2 = ± 1.96 for
95 per cent confidence are given by
150
±z/2 =12.50 ± 1.96 = 1250 ± 29.40 units
n 100
Thus for 95 per cent level of confidence, the population mean is likely to fall between 1220.60 units
and 1279.40 units, that is, 1220.60 1279.40.
Illustrator 9.3: The quality control manager at a factory manufacturing light bulbs is interested to
estimate the average life of a large shipment of light bulbs. The standard deviation is known to be 100
hours. A random sample of 50 light bulbs gave a sample average life of 350 hours.
(a) Setup a 95 per cent confidence interval estimate of the true average life of light bulbs in the
shipment.
(b) Does the population of light bulb life have to be normally distributed? Explain.
Solution: The following information is given :
x = 350, = 100, n =50, and confidence level, (1-) = 95 per cent.
(a) Using the ‗Standard Normal Curve‘, we have z/2 = ± 1.96 for 95 per cent confidence level.
Thus confidence limits are given by
100
x ±z/2 = 350 ± 1.96 = 350 ± 27.72
n 50
Hence for 95 per cent level of confidence the population mean is likely to fall between 322.28 hours
to 377.72 hours, that is, 322.28 377.72.
(b) No, since is known and n = 50, from the central limit theorem we may assume that x is
normally distributed.
9.2.5.2 Interval Estimation for Difference of Two Means
If all possible samples of large size n1 and n2 are drawn from two different populations, then sampling
distribution of the difference between two means 1 and 2 is approximately normal with mean (1-2)
and standard deviation:
12 22
x x =
1 2
n1 n2
For a desired confidence level, the confidence interval limits for the population mean (1-2) are given
by
( x1 - x2 ) ±z/2 x1 x2
Illustration 9.4: The strength of the wire produced by company A has a mean of 4,500 kg and a
standard deviation of 200 kg. Company B has a mean of 4000 kg and a standard deviation of 300 kg. A
sample of 50 wires of company A and 100 wires of company B are selected at random for testing the
strength. Find 99 per cent confidence limits on the difference in the average strength of the populations
of wires produced by the two companies.
Solution: The following information is given:
Company A: x1 = 4500, = 200, n1 = 50
12 22 40,000 90,000
x x =
1 2
n1 n2 50 100
The required 99 per cent confidence interval limits are given by
( x1 x2 ) + za/2 x1 x2 = 500 + 2.576 (41.23) = 500 ± 106.20
Hence, the 99 percent confidence limits on the difference in the average strength of wires produced by
the two companies are likely to fall in the interval 393.80 606.20.
9.2.5.3 Interval estimation of population mean (known)
In practice, the standard deviation of a population , is not likely to be known. Thus
in the large sample case, the sample standard deviation , and we use a z-table to
compute za/2 for providing an area of a/2 in the right tail of the standard normal
probability distribution curve. Hence the interval estimate of a population mean for a
large sample case (n > 30) with confidence coefficient 1- is given by
s
x + z/2 s x x ± z/2
n
When the population standard deviation is not known and the sample size is small, the procedure of
interval estimation of population mean is based on a probability distribution known as the t-distribution.
This distribution is very similar to the normal distribution. However, the t-distribution has more area in
the tails and less in the center than doe‘s normal distribution. The t-distribution depends on a parameter
known as degree of freedom. As the number of degrees of freedom increases, t-distribution gradually
approaches the normal distribution, and the sample standard deviation s becomes a better estimate of
population standard deviation
The interval estimate of a population mean when the sample size is small (n <30) with confidence
coefficient (1-), is given by
s s s
x +t/2 or x -t/2 x + z/2
n n n
where t/2 is the critical value of t-test statistic providing an area /2 in the right tail
of the t-distribution with n-1 degrees of freedom, and
( xi x) 2
s
n 1
The critical values of t for the given degrees of freedom can be obtained from the
table of t-distribution (see appendix).
The procedure of the confidence interval estimation of population mean when
population standard deviation is unknown and sample size is large or small, is
summarised in Table 9.2.
Table 9.2: Confidence Interval for
Sample size Interval Estimate of
Population Mean
Large
assumed known 1
x +z/2
n
s
estimated by s x +z/2
n
Small
assumed known
x +z/2
n
estimated by s s
x +t/2
n
Illustration 9.5: A random sample of 64 sales invoices was taken from a large
population of sales invoices. The average value was found to be Rs.2000 with a
standard deviation of Rs.540. Find a 90 per cent confidence interval for the true
mean value of all the sales.
Solution: The information given is: x1 2000, s 540, n 64, and 10 per cent.
Therefore
s 540
sx 67.50 and z/2 = 1.64 (from Normal table)
n 64
The required confidence interval of population mean is given by
s
x z / 2 2000
n
Thus the mean of the sales invoices for the whole population is likely to fall between
Rs.1889.30 and Rs.2110.70, that is, 1889.30 2110.70.
Illustration 9.6: The personnel department of an organization would like to estimate
the family dental expenses of its employees to determine the feasibility of providing a
dental insurance plan. A random sample of 10 employees reveals the following family
dental expenses (in thousand Rs.) in the previous year: 11, 37, 25, 62, 51, 21, 18,
43, 32, 20.
Setup a 99 per cent confidence interval of the average family dental expenses for the
employees of this organization.
Solution: The calculations for sample mean x and standard deviation are shown in
Table 9.3.
Table 9.3 : Calculations for x and s
Variable, x ( x x) ( x 32) ( x x) 2
x 320
From the data in Table 11.3, the sample mean x = Rs.32, and the sample standard
n 10
2358
deviation s = ( x x) 2 / n 1 = = Rs.5.11. Using this information and t/2 = 1.833 at df = 9,
9
we have
s 5.11
x t/2
n 10
Hence the mean expenses per family are likely to fall between Rs.29.038 and Rs.34.962, that is, 29.038
34.962.
9.2.5.4 INTERVAL ESTIMATION FOR POPULATION PROPORTION
p(1 p)
p ± z/2 p = p ± z/2 or p -z/2 p p p + z/2 p
n
(v) where z/2 is the z-value providing an area of /2 in the right tail of the
standard normal probability distribution and the quantity z/2 p is the margin
of error.
x 48 1
(vii) Solution: The sample proportion is : p
n 144 3
1
(viii) Using the information, n = 144, p and z/2 = 1.96 at 95 per cent confidence
3
coefficient, we have
p(1 p) 1 ( 1 )( 2 )
(ix) p z / 2 1.96 3 3 = 0.333 ± 0.077
n 3 144
Hence the population proportion of families who have two or more children is likely to be
between 25.6 to 41 per cent, that is, 0.256 p 0.410.
range of the confidence interval can be decreased by increasing the sample size
n. The decision regarding the appropriate size of the sample, however, depends
on (i) deciding in advance how good an estimate is required, and (ii) the
availability of funds, time, and ease of sample selection. For example, an
insurance company wants to estimate the proportion of claims settled within 2
months of the receipt of claim. For this purpose, the company must decide
how much error it is willing to allow in estimating the population proportion of
claims settled in a particular financial year. This means, whether accuracy is
required to be within ± 80 claims, ± 100 claims, and so on. Also, the company
needs to determine in advance the level of confidence for estimating the true
population parameter. Hence for determining the sample size for estimating
population mean or proportion, such requirements must be kept in mind along
with information regarding standard deviation.
9.3.1 Sample Size for Estimating Population Mean
When the distribution of sample mean x is normal, the standard normal variable z is given as
x
z or x
/ n n
The value of z can be seen from ‗standard normal table‘ for a specified confidence coefficient 1-The
value of z in the above equation will be positive or negative, depending or whether the sample mean x
is larger or smaller than, population mean as shown in Fig.11.4. This difference between x and mean
is called the sampling error or margin of error E. Thus for estimating the population mean with a
condition that the error in its estimation should not exceed a fixed value, say E, we require that the
sample mean x should fall within the range, ± E with a specified probability. Thus the margin of error
acceptable (i.e. maximum tolerable difference between unknown population mean and the sample
estimate at a particular level of confidence) can be written as:
x z / 2 or E = z / 2
n n
z / 2 ( z )2 2
or n , i.e., n / 2 2
This formula for sampling size n will provide the tolerable margin of error E, at the chosen confidence
level 1- (which determines the critical value of z from the normal table) with known or estimated
population standard deviation .
Note: If population standard deviation is not known, then sample standard deviation s can be used to
determine the sample size n.
Illustration 9.8: Suppose the sample standard deviation of P/E ratios for stocks listed on the Mumbai
Stock Exchange (BSE) is s = 7.8. Assume that we are interested in estimating the population mean of
P/E ratio for all stocks listed on BSE with 95 per cent confidence. How many stocks should be included
in the sample if we desire a margin of error of 2?
Solution: The information given is: E=2, s =7.8, z/2 = 1.96 at 95 per cent level of confidence.
Using the formula for n and substituting the given values, we have
Thus a sample size n = 59 should be chosen to estimate the population mean of P/E ratio for all stocks
on the BSE.
Note: The general rule used in determining sample size is to always round off to the nearest integer
value in order to slightly over-satisfy the desire of estimation.
9.3.2 Sample Size for Estimating Population Proportion
The method for determining a sample size for estimating the population proportion is similar to that
used in the previous section. We require that the sample proportion p should fall within the range
p , with a specified probability
pq
z / 2 p z / 2 ; q 1 p
n
pq ( z / 2 ) 2 pq
Or ( z / 2 ) 2 , i.e., n
n 2
The value of z can be calculated from ‗Standard normal table‘ for a specified confidence coefficient.
This formula for n will provide the desired margin of error E at the chosen confidence level 1- (which
determines the critical value of z) with known or estimated population proportion p.
(xii) Solution: The information given is : E=0.07, p = 0.10, and z/2 = 2.576 at 99
per cent confidence level.
(xiii) Using the formula for n and substituting the given values, we have
When samples are drawn without replacement from a finite population of size N, the use of finite
population correction factor reduces the standard error by a value equal to ( N n) /(N 1) . For
example, for deciding sample size n for estimating the population mean , the desired margin of error is
given by
z / 2 N n
n N l
(xiv) Similarly, when estimating the proportion, the desired margin of error is given
by
p or E = z/2 pq N n
n N 1
Let n0 be the size for estimating population mean without using correction factor. Then
( z ) 2 2
/2
n0
2
(xv) The revised sample size, taking into consideration the size of the population, is
given by
n0 N
n
n0 ( N 1)
(xvi) Illustration 9.10: For a population of 1000, what should be the sampling size
necessary to estimate the population mean at 95 per cent confidence with a
sampling error of 5 and the standard deviation equal to 20?
Solution: We have E =5, =20, z/2 = 1.96 at 95 per cent confidence level, and N=1000. Thus
( z ) 2 2 (1.96) 2 (20) 2
/2
n 61.456
2 (5) 2
(xvii) Since the population size is finite, the revised sample size obtained by using
the correction factor
n0 N (61.456)(1000) 61456
n 57.952
n0 ( N 1) 61.456 (1000 1) 1060.456
Thus a sample size of n = 58 should be taken
1. If we take all possible samples of the same size from a population and calculate their means, the
mean x of all these means will be.......................... to the mean of the population.
2. Standard deviation of the sampling distribution of x decreases as the sample size .............. and
vice versa.
3. The efficiency of one estimator in relation to another estimator can be judged by ............... their
sampling variances.
4. As the number of degrees of freedom increases, t-distribution gradually approaches the .........., and
the sample standard deviation s becomes a better estimate of population standard deviation.
5. For estimating the population mean with a condition that the error in its estimation should not
............ a fixed value, say E, we require that the sample mean x should fall within the range, ± E
with a specified probability.
9.5 SUMMARY
There are two types of estimates that we can make about a population: a point estimate and an interval
estimate. A point estimate is a single number, which is used to estimate an unknown population
parameter. Although a point estimate may be the most common way of expressing an estimate, it suffers
from a major limitation since it fails to indicate how close it is to the quantity it is supposed to estimate.
The second type of estimate is known as the interval estimate. It is a range of values used to estimate an
unknown population parameter. In case of an interval estimate, the error is indicated in two ways: first
by the extent of its range; and second, by the probability of the true population parameter lying within
that range.
There are four criteria by which we can evaluate the quality of a statistic as an
estimator. These are: unbiasedness, efficiency, consistency and sufficiency.
9.6 KEYWORDS
Estimate: An estimate is a valuation based on opinion or roughly made from imperfect or incomplete
data.
Point Estimate: A point estimate is a single number, which is used to estimate an unknown population
parameter.
2. Increases
3. Comparing
4. Normal distribution
5. Exceed
STRUCTURE
10.1 Learning Objectives
10.2 Introduction
10.2.1 The Null and the Alternative Hypothesis
10.2.2 Some Basic Concepts
10.2.3 Critical Region in Terms of Test Statistic
10.2.4 General Testing Procedure
10.2.5 Tests of Hypotheses about Population Means
10.2.6 Tests of Hypotheses about Population Proportions
10.2.7 Tests of Hypotheses about Population Variances
10.2.8 The Comparison of Two Populations
10.3 Solved Problems
10.4 Check your Progress
10.5 Summary
10.6 Keywords
10.7 Self-Assessment Test
10.8 Answers to check your progress
10.9 References/Suggested Readings
10.2 INTRODUCTION
Closely related to Statistical Estimation discussed in the preceding lesson, Testing of Hypotheses is one
of the most important aspects of the theory of decision-making. In the present lesson, we will study a
class of problems where the decision made by a decision maker depends primarily on the strength of the
evidence thrown up by a random sample drawn from a population. We can elaborate this by an example
where the operations manager of a cola company has to decide whether the bottling operation is under
statistical control or it has gone out of control (and needs some corrective action). Imagine that the
company sells cola in bottles labeled 1-liter, filled by an automatic bottling machine. The implied claim
that on the average each bottle contains 1,000 cm3 of cola may or may not be true.
If the claim is true, the process is said to be under statistical control. It is in the interest of the
company to continue the bottling process
If the claim is not true i.e. the average is either more than or less than 1,000 cm3, the process is said
to be gone out of control. It is in the interest of the company to halt the bottling process and set
right the error
Therefore, to decide about the status of the bottling operation, the operations manager needs a tool,
which allows him to test such a claim.
Testing of Hypotheses provides such a tool to the decision maker. If the operations manager were to use
this tool, he would collect a sample of filled bottles from the on-going bottling process. The sample of
bottles will be evaluated and based on the strength of the evidence produced by the sample; the
operations manager will accept or reject the implied claim and accordingly make the decision. The
implied claim ( = 1,000 cm3) is a hypothesis that needs to be tested and the statistical procedure, which
allows us to perform such a test, is called Hypothesis Testing or Testing of Hypotheses.
What is a Hypothesis?
A thesis is some thing that has been proven to be true. A hypothesis is something that has not yet been
proven to be true. It is some statement about a population parameter or about a population distribution.
Our hypothesis for the example of bottling process could be:
“The average amount of cola in the bottles is equal to 1,000 cm3”
This statement is tentative as it implies some assumption, which may or may not be found valid on
verification. Hypothesis testing is the process of determining whether or not a given hypothesis is true.
If the population is large, there is no way of analyzing the population or of testing the hypothesis
directly. Instead, the hypothesis is tested on the basis of the outcome of a random sample.
10.2.1 THE NULL AND THE ALTERNATIVE HYPOTHESIS
As stated earlier, a hypothesis is a statement about a population parameter or about a population
distribution. In any testing of hypotheses problem, we are faced with a pair of hypotheses such that one
and only one of them is always true. One of this pair is called the null hypothesis and the other one the
alternative hypothesis.
A null hypothesis is an assertion about the value of a population parameter. It is an assertion that we
hold as true unless we have sufficient statistical evidence to conclude otherwise.
For example, a null hypothesis might assert that the population mean is equal to 1,000. Unless we
obtain sufficient evidence that it is not 1,000, we will accept it as 1,000. We write the null hypothesis
compactly as:
H0: =1,000
Where the symbol H0 denotes the null hypothesis.
The alternative hypothesis is the negation of the null hypothesis.
For the null hypothesis H0: =1,000, the alternative hypothesis is 1000. We will write it as
H1: 1,000
We use the symbol H1 (or Ha) to denote the alternative hypothesis.
The null and alternative hypotheses assert exactly opposite statements. Obviously, both H0 and H1
cannot be true and one of them will always be true. Thus, rejecting one is equivalent to accepting the
other. At the end of our testing procedure, if we come to the conclusion that H0 should be rejected, this
also amounts to saying that H1 should be accepted and vice versa.
It is not difficult to identify the pair of hypotheses relevant in any decision situation. Can any one of the
two be called the null hypothesis? The answer is a big NO — because the roles of H0 and H1 are not
symmetrical.
The possible outcomes of a test can be summarized as:
Either: Accept H0 -a weak conclusion without any evidence in as a reasonable possibility support of H0
or: Reject H0 and -a strong conclusion with strong evidence Accept H1 against H0
To better understand the role of null and alternative hypotheses, we can compare the process of
hypothesis testing with the process by which an accused person is judged to be innocent or guilty. The
person before the bar is assumed to be ―innocent until proven guilty‖ So using the language of
hypothesis testing, we have:
H0: The person is innocent
H1: The person is guilty
The outcomes of the trial process may result
Accepting H0 of innocence: when there was not enough evidence to convict. However, it does not
prove that the person is truly innocent
Rejecting H0 and accepting H1 of guilt: when there is enough evidence to rule out innocence as a
possibility and to strongly establish guilt
The jury acquitted Michael Jackson, on June 13, of all charges against him in the child
molestation case. In other words, using the language of hypothesis testing the jury had to accept
the null hypothesis H0: Michael Jackson is innocent because the prosecution could not prove
their case against H0 of innocence.
In a trial case we do not have to rule out guilt in order to find someone innocent, but we do have to rule
out innocence in order to find someone guilty. On the similar lines, we do not have to rule out H1 in
order to accept H0; but we do have to rule out H0 in order to accept H1. Thus, it is clear that the two
hypotheses - null and alternative - are not interchangeable; each one plays a different, a special role. So
it becomes more important to be clear about what the null and alternative hypotheses should be in a
given situation, or else the test is meaningless.
One can conceptualize the whole procedure of testing of hypothesis as trying to answer one basic
question: Is the sample evidence strong enough to enable us to reject H0? This means that H0 will be
rejected only when there is strong sample evidence against it. However, if the sample evidence is not
strong enough, we shall conclude that we cannot reject H0 and so we accept H0 by default. Thus, H0 is
accepted even without any evidence in support of it whereas it can be rejected only when there is
overwhelming evidence against it. In other words, the decision maker is somewhat biased towards the
null hypothesis and he does not mind accepting the null hypothesis. However, he would reject the null
hypothesis only when the sample evidence against the null hypothesis is too strong to be ignored.
The null hypothesis is called by this name because in many situations, acceptance of this hypothesis
would lead to null action. Thus, one way to ensure what the null hypothesis should be is to note that…
…if the null hypothesis is true, then no corrective action would be necessary. If the alternative
hypothesis is true, then some corrective action would be necessary.
Recall our example of the cola-company in which an automatic bottling machine fills 1-liter bottles
with cola. Now consider three different situations:
Situation I: The operations manager wants to test the average amount filled, in order to know whether
the process is under statistical control.
In this situation, the operations manager will have to take corrective action when the average is either
more than or less than 1,000 cm3. Only when the average equals 1,000 cm3, no corrective action is
necessary. So we have
H0: = 1,000 cm3
H1: 1,000 cm3
Situation II: A consumer advocate suspects that the average amount of cola is less than 1,000 cm3 and
wants to test it.
In this situation, if the average amount of cola is greater than or equal to 1,000 cm3, no corrective
action is needed, but if the average amount is less than 1,000 cm3, the company has to halt the bottling
process and set right the error. So, in this case, we have
H0: 1,000 cm3
H1: 1,000 cm3
Situation III: The owner of the company suspects that the machine is wasting cola by filling more
than 1,000 cm3 on the average and wants to test it.
From the owner's point of view, no corrective action is necessary if the average is less than or equal to
1,000 cm3. And, therefore, in this case we have
H0: 1,000 cm3
H1: 1,000 cm3
As the bottling example indicates, there are three possible cases for the null hypothesis, involving ,
and = relationships. The exact null hypothesis should be finalized before any evidence is gathered, or
the test will not be valid. Data snooping - formulating the null and alternative hypotheses at one's
convenience after collecting and looking at the evidence - is unethical.
States of Population
Decision based on Sample
H0 True H0 False
Both the type I and type II errors are undesirable and should be reduced to the minimum. Let us analyse
how we can minimize the chances of type I and type II errors. It may be easily realized that it is pos-
sible, even with imperfect sample evidence, to reduce the probability of type I error all the way down to
zero. Just accept the null hypothesis; no matter what the evidence is. Since we will never reject any null
hypothesis, we will never reject a true null hypothesis and thus we will never commit a type I error!
However, it is obvious that this would be foolish. If we always accept a null hypothesis, then given a
false null hypothesis, no matter how wrong it is, we are sure to accept it. In other words, our probability
of committing a type II error will be 1. Similarly, we find it foolish to reduce the probability of type II
error all the way down to zero by always rejecting a null hypothesis, for we would then reject every true
null hypothesis, no matter how right it is. Our probability of type I error will be 1.
Therefore, we cannot and should not try to completely avoid either type of error. We should plan,
organize, and settle for some small, optimal probability of each type of error. Before we discuss this
issue, we need to learn a few more concepts.
10.2.2.2 TEST STATISTIC AND THE p-VALUE
Consider the case of owner‘s suspicion related to our bottling process example. The null and alternative
hypotheses in this case are:
H0: 1,000
H1: 1,000
Suppose the population variance is 25 and a random sample of size 100 yields a sample mean of
1,000.5. Because the sample mean is more than 1,000, the evidence goes against the null hypothesis
(H0). Can we reject H0 based on this evidence?
if we reject it, there is some chance that we might be committing a type I error, and
if we accept it, there is some chance that we might be committing a type II error.
Then what can we do? We should ask a natural question at this situation- ―What is the probability that
H0 can still be true despite the evidence?‖ The question asks for the "credibility" of H0 in light of
unfavorable evidence. However, due to mathematical complexities, it is not possible to compute the
probability that H0 is true. We, therefore, settle for a question that comes very close.
―When the actual = 1,000, and with sample size 100, what is the probability of getting a sample
mean that is more than or equal to 1000.5?‖
The answer to this question is then taken, as the "credibility rating" of H0. Analyzing the question
carefully, we note an important aspect:
The condition assumed is = 1,000; although H0 states 1,000 . The reason for assuming = 1,000
is that it gives the most benefit of doubt to H0. If we assume = 999, for instance, the probability of
the sample mean being more than or equal to 1,000.5 will only be smaller, and H0 will only have less
credibility. Thus the assumption = 1,000 gives the maximum credibility to H0.
Now using our knowledge of sampling distribution of sample mean, we can easily answer our
question.
Since population variance is known and sample size is large enough, the Central Limit Theorem is
applicable here i. e.
2
X ~ N ,
n
X
and the standard normal variable Z is to be used to calculate the required probability
n
P X 1,000.5
1,000.5 1,000
So P X 1,000.5 = P Z
5
100
= PZ 1.00
= 0.1587
0.16
So the answer to our question is 16%. That is, there is a 16% chance for a sample of size 100 to yield a
sample mean more than or equal to 1000.5 when the actual µ = 1,000. Statisticians call this 16% the p-
value. In other words p-value-the probability of observing a sample statistic as extreme as the one
observed if the null hypothesis is true-.
is a kind of "credibility rating" of H0 in light of the evidence. A p-value of zero means H0 is certainly
false and a p-value of 1 means that H0 is certainly true. A p-value of 16% means that there is roughly
16% probability that H0 is true, despite the evidence. Conversely, we can be roughly 84% confident that
H0 is false in light of the evidence. The implication is that if we reject H0, then there is about an 84%
chance that we are doing the right thing, and about a 16% chance that we are committing a type I error.
The formal definition of the p-value follows:
Given a null hypothesis and sample evidence with sample size n, the p-value is the probability of getting
a sample evidence with the same n that is equally or more unfavorable to the null hypothesis while the
null hypothesis is actually true. The p-value is calculated giving the null hypothesis the maximum benefit
of doubt.
The random variable, as Z in this case, used to calculate the p-value is called test statistic. The formal
definition of the test statistic follows:
A test statistic is a random variable calculated from the sample evidence, which follows a well-known
distribution and thus can be used to calculate the p-value.
Most of the time, the test statistic we use will be Z, t, χ2, or F. The distributions of these random
variables are well known and we can calculate the p-value.
Up to this point it is very much clear that statistical hypothesis is always stated with reference to a
population parameter (mean, proportion or variance). The appropriate random variable calculated from
the sample evidence acts as a test statistic and provide the means to decide whether statistical
hypothesis is to be rejected or accepted.
The standard values for α are 10%, 5%, and 1%. Suppose α is set at 5%. In the preceding example, for a
sample mean of 1,000.5 the p-value was 16%, and H0 will not be rejected. For a sample mean of 1001
the p-value will be 2.28%, which is below α = 5%. Hence H0 will be rejected.
Let us analyze in some detail the implications of using a significance level α for rejecting a null
hypothesis.
The first thing to note is that if we do not reject H0, this does not prove that H0 is true. For example,
if α = 5% and the p-value = 6%, we will not reject H0. But there is only about 6% chance that H0 is
true, which is hardly proof that H0 is true. It may be possible that H0 is false and by not rejecting it,
we are committing a type II error. For this reason, we should say "We cannot reject H0 at an α of
5%" rather than "We accept H0."
The second thing to note is that α is the maximum probability of type I error we set for ourselves.
Since α is the maximum p-value at which we reject H0, it is the maximum probability of
committing a type I error. In other words, setting α = 5% means that we are willing to put up with
up to 5% chance of committing a type I error.
The third thing to note is that the selected value of α indirectly determines the probability of type II
error as well. In general, other things remaining the same, increasing the value of α will decrease
the probability of type II error. This should be intuitively obvious. For example, increasing α from
5% to 10% means that in those instances with p-value in the range 5% to 10% the H0 that would
not have been rejected before would now be rejected. Thus, some cases of false H0 that escaped
rejection before may not escape now. As a result, the probability of type II error will decrease
The fourth thing to note about α is the meaning of (1 - α). If we set α = 5%, then (1 - α) = 95% is
the minimum confidence level that we set in order to reject H0. In other words, we want to be at
least 95% confident that H0 is false before we reject it.
One-Tailed and Two-Tailed Tests : Consider the null and alternative hypotheses:
H0: 1,000
H1: 1,000
In this case, we will reject H0 only when X is significantly less than 1,000 or only when Z falls
significantly below zero. Thus the rejection occurs only when Z takes a significantly low value in the
left tail of its distribution.
Such a case where rejection occurs in the left tail of the distribution of the test statistic is called a left-
tailed test, as seen in Figure 10-1.
In left-tailed and right-tailed tests, rejection occurs only on one tail. Hence each of them is called a one-
tailed test.
Finally, consider the case where the null and alternative hypotheses are:
H0: = 1,000
H1: 1,000
In this case, we have to reject H0 in both cases, that is, whether X is significantly less than or greater
than 1,000. Thus, rejection occurs when Z is significantly less than or greater than zero, which is to say
that rejection occurs on both tails. Therefore, this case is called a two-tailed test. See Figure 10-3, where
the shaded areas are the rejection regions.
hypothesis at α = 1%. In other words, the inference drawn can be sensitive to the significance level
used. We should note that selecting a value for α is a question of compromise between type I and type
II error probabilities. In practice, the significance level is supposed to be arrived at after considering the
cost consequences of type I error and type II error. However, most of the time the costs are difficult to
estimate since they depend, among other things, on the unknown actual value of the parameter being
tested. Thus, arriving at a "calculated" optimal value for α is impractical. Instead, we follow an intuitive
approach of assigning one of the three standard values, 1%, 5%, and 10%, to α.
In the intuitive approach, we try to estimate the relative costs of the two types of errors. For example,
suppose we are testing the average tensile strength of a large batch of bolts produced by a machine to
see if it is above the minimum specified. Here type I error will result in rejecting a good batch of bolts
and the cost of the error is roughly equal to the cost of the batch of bolts. Type II error will result in ac-
cepting a bad batch of bolts and its cost can be high or low depending on how the bolts are used.
If the bolts are used to hold together a structure, then the cost is high because defective bolts can result
in the collapse of the structure, causing great damage. In this case, we should strive to reduce the
probability of type II error more than that of type I error. In such cases where type II error is more
costly, we keep a large value for α, namely, 10%.
On the other hand, if the bolts are used to secure the lids on trash cans, then the cost of type II error is
not high and we should strive to reduce the probability of type I error more than that of type II error. In
such cases where type I error is more costly, we keep a small value for α, namely, 1%.
Then there are cases where we are not able to determine which type of error is more costly. If the
costs are roughly equal, or if we have not much knowledge about the relative costs of the two types of
errors, then we keep α = 5%.
Denoted by , Type II error is committed when a wrong decision is taken in accepting a false null
hypothesis. It is the probability of accepting H0 when it should have rejected for being false. It should
be noted that depends on the actual value of the parameter being tested, the sample size, and α. Let us
see exactly how it depends.
H0: 1,000
H1: 1,000
obviously implies a change in the cross hatched area i.e. . In other words, the smaller the α, the
larger the and vice-versa. Type I and type II errors are, therefore negatively related. Type I error
and the power of the test (1-) are, however, positively related. Thus, the smaller the probability ( α)
of rejecting H0 when it is true, the smaller is the probability (1-) of rejecting H0 when it is false.
Sample Size
In the discussion above we said that we can keep a α low or a low depending on which type of error is
more costly. What if both types of error are costly and we want to have low α as well as low ? The
only way to do this is to make our evidence more reliable, which can he done only by increasing the
sample size. If the sample size increases, then the evidence becomes more reliable and the probability of
any error will decrease.
Figure 10-5 shows the relationship between α and for various values of sample size n. As n increases,
the curve shifts downwards reducing both α and . Thus, when the costs of both types of error are high,
the best policy is to have a large sample and a low α, such as 1%.
Left-tailed Z -Zα
Right-tailed Z Zα
Z Zα/2
Two-tailed and Z -Zα/2
t-test
When in the testing of hypotheses, we use the random variable t for calculating the p-value and for
defining the critical region of the test; we call the test as t-test. The critical region in terms of t are
summarized in Table 10-3
Table 10-3 Critical Region of t-test
Test Critical Region
Left-tailed t -tα
Right-tailed t tα
t tα/2
Two-tailed and t -tα/2
χ2-test
When in the testing of hypotheses, we use the random variable χ2 for calculating the p-value and for
defining the critical region of the test; we call the test as χ2-test. The critical region in terms of χ2 are
summarized in Table 12-4
χ2 χ2α/2
Two-tailed and χ2 χ21-α/2
F-test
When in the testing of hypotheses, we use the random variable F for calculating the p-value and for
defining the critical region of the test; we call the test as F-test. The critical region in terms of F are
summarized in Table 10-5
F F / 2n11, n2 1
Two-tailed
and F F1 / 2n11, n2 1
i.e. F F / 2 n2 1, n11
X 0
Z
n
3. The population is normal and the population standard deviation, σ, is unknown, but the sample
standard deviation, S, is known and the sample size, n, is large enough. The formula for calculating
the test statistic Z in this case is
X 0
Z
S
n
Cases in Which the Test Statistic is t
1. The population is normal and the population standard deviation, σ, is unknown, but the sample
standard deviation, S, is known and the sample size, n, is small.
2. The population is not normal and the population standard deviation, σ, is unknown, but the sample
standard deviation, S, is known and the sample size, n, large enough.
The formula for calculating the test statistic t in both these cases is
X 0
t
S
n
The degrees of freedom for this t is (n-1)
The Binomial distribution can be used whenever we are able to calculate the necessary binomial
probabilities. When the Binomial distribution is used, the number of successes X serves as the test
statistic. It is conveniently applicable to problems where sample size, n, is small and p0 is neither very
close to 0 nor to 1.
p p0
Z
p0 (1 p0 )
n
χ
2
n 1S 2
02
The degree of freedom for this χ2 is (n - 1).
When we want to arrive at same conclusion about the difference between two population
means, we draw one sample from each of the population. The samples drawn may be dependent
on each other or these may be independent of each other.
In many situations, we can design our test in such a way that the samples drawn are dependent
on each other and our observations come from two populations and are paired in some way. In
general, when possible, it is often advisable to pair the observations, as this makes the
experiment more precise. We can see the advantage of pairing observations with the helps of an
example.
Consider a sales manager who wants to know if display at point of purchase helps in increasing the
sales of his product. He may design the experiment in two ways:
Design I: He picks up a sample of, say 12, retail shops with no display at point of purchase. Similarly
he picks up a sample of, say 10, retail shops with display at point of purchase. He will note his
observations from both samples independently of each other.
Design II: He picks-up a random sample, of say 11, retail shops and note down the observations about
weekly sale in each of these shops. Next he introduces display at point of purchase at each of these
shops and again observes the weekly sales in them.
Obviously design II much better, as this tends to remove much of the extraneous variations in sales –
the variation in the location of the soap, experimental conditions and other extraneous factors. Now
after eliminating the effect of all other major factors, we can attribute the difference only to the
„treatment‘ we are studying -the display at point of purchase.
Let us label the two populations as 1 and 2. Under the situation of paired observations, it is easy to see
that the variable in which we are interested is the differences between the two observations i.e.
d x1 x2 . In other words our two-population comparison test is reduced to a hypothesis test about
one parameter - the difference between the means of two populations‘ i.e. d 1 2
Thus the null hypothesis can be any of the three usual forms:
H0: 1 2 = d 0 or d = d 0 two-tailed test
d d0
t
Sd
n
The degrees of freedom for this t is (n-1)
Cases in Which the Test Statistic is Z
The sample size, n, is large and/or we happen to know the population standard deviation of the
difference, σd. The formula for calculating the test statistic t is
d d0
Z
Sd
n
d d0
Z
or d
n
10.2.8.2 Independent Samples
When independent random sample are taken, the sample size need not be same for both populations. Let
us label the two populations as 1 and 2. So that
μ1 and μ2 denote the two population means.
σ1 and σ2 denote the two population standard deviations
n1 and n2 denote the two sample sizes
Z
X 1
X 2 1 2 0
12 2
2
n1 n2
Cases in Which the Test Statistic is t
The populations are normal; the population standard deviations; σ1 and σ2; are unknown, but the sample
standard deviations; S1 and S2; are known. The formula for calculating the test statistic t depends on two
sub cases:
Subcase I: σ1 and σ2 are believed to be equal (although unknown)
t
X 1
X 2 1 2 0
S p2 1n 1n
1 2
Where SP2 is the pooled variance of the two samples, which serves as the estimator of the common
population variance.
S 2
n1 1S12 n2 1S22
p
n1 n2 2
The degrees of freedom for this t is (n1 +n2 -2).
Subcase II: σ1 and σ2 are believed to be unequal (although unknown)
t
X 1
X 2 1 2 0
S12 S22
n1 n2
The degrees of freedom for this t is given by:
2
S12 S2
n 2n
df 1 2
S 2 2 S 2 2
1 2
n1 n2
n 1 n 1
1 2
TESTING FOR DIFFERENCE BETWEEN POPULATION PROPORTIONS
We will consider the large-sample tests for the difference between population proportions. For „large
enough‟ sample sizes the distribution of the two sample proportions and also the distribution of the
difference between the two sample proportions is approximated well by a normal distribution. This
gives rise to Z-test for comparing the two population proportions. Let us assume independent random
sampling from the two populations, labeled as 1 and 2, so that
p1 and p2 denote the two population proportions
n1 and n2 denote the two sample sizes
The formula for calculating the test statistic Z depends on two cases.
Case I: When p1 p2 0 = 0 i.e. the claimed difference between the two population proportions is zero
Z
p1 p2 0
p 1 p 1n 1n
1 2
x1 x2
p
n1 n2
Case II: When p1 p2 0 0 i.e. the claimed difference between the two population proportions is
Z
p 1
p 2 p1 p2 0
p 1 p p 1 p
1
1 2 2
n1 n2
TESTING FOR EQUALITY OF TWO POPULATION VARIANCES
Many a times, we may be interested in comparing the degree of variability or dispersion of two different
populations. Here the problem essentially involves testing the equality of two population variances. Let
us assume independent random sampling from the two populations, labeled as 1 and 2, so that
S12
Fn1 1, n2 1 2
S2
The degrees of freedom for this F is (n1-1, n2-1)
If there is a difference in the two test results, explain the reason for the difference.
Solution: (a)
1. The null and alternative hypotheses:
H0: 2,000
H1: 2,000
The test is a left-tailed test
2. Level of significance: α = 5% or 0.05
3. Test statistic: Z; as the population standard deviation is known and sample size is greater than 30
4. Critical region: Z < -Z0.05 Where Z0.05=1.645
X 0
Z ; Z
1,999.6 2,000 ; Z 1.95
1.30
n 40
6. Conclusion: We reject the null hypothesis at α =0.05 since Z = -1.95 < -Z0.05 = -1.645.
(b) Since the population is normally distributed, the test statistic is once again Z
X 0 1,999.6 2,000
Z ; Z ; Z 1.38
1.30
n 20
Conclusion: We do not reject the null hypothesis at α=0.05 since Z = -1.38 > -Z0.05 = -1.645
(c) In the first case we could reject the null hypothesis but in the second we could not, although in
both cases the sample mean was the same. The reason is that in the first case the sample size was larger
and therefore the evidence against the null hypothesis was more reliable. This produced a smaller p-
value in the first case.
Example 10-2
An automobile manufacturer substitutes a different engine in cars that were known to have an average
miles-per-gallon rating of 31.5 on the highway. The manufacturer wants to test whether the new engine
changes the miles-per-gallon rating of the automobile model. A random sample of 100 trial runs gives
X = 29.8 miles per gallon and S = 6.6 miles per gallon. Using the 0.05 level of significance, is the
average miles-per-gallon rating on the highway for cars using the new engine different from the rating
for cars using the old engine?
Solution:
1. The null and alternative hypotheses:
H0: = 31.5
H1: 31.5
The test is a two-tailed test
2. Level of significance: α = 5% or 0.05
3. Test statistic: Z; as the sample standard deviation is known and sample size is greater than 30
4. Critical region: Z0.025 < Z < -Z0.025 Where Z0.025 =1.96
X 0 29.8 31.5
Z ; Z ; Z 2.57
S 6.6
n 100
6. Conclusion: We reject the null hypothesis at α = 0.05 since Z = -2.57 < -Z0.025 = -1.96. So we
conclude that the average miles-per-gallon rating on the highway for cars using the new engine is
different from the rating for cars using the old engine.
Example 10-3
Sixteen oil tins are taken at random from an automatic filling machine. The mean weight of the tins is
12.2 kg, with a standard deviation of 0.40 kg. Can we conclude that the filling machine is wasting oil by
filling more than the intended weight of 12 kg, at a significance level of 5%?
Solution:
1. The null and alternative hypotheses:
H0: 12.2
H1: 12.2
The test is a right-tailed test
2. Level of significance: α = 5% or 0.05
3. Test statistic: t; as the sample standard deviation is known and sample size is small.
4. Critical region: t t0.05 Where t0.05 for 15 df =1.7530
5. Computations: p0 = 0.5; n = 15
8 n n X 8 15 n X
p-value = 2*P(X 8) 2 * C X p X
1 p 2 * C X 0.5 X 1 0.5
X 0 X 0
= 0.5034
6. Conclusion: We cannot reject the null hypothesis at α = 0.05 since p-value > α. So we accept that
the coin is fair.
Example 10-5
Solution:
1. The null and alternative hypotheses:
H0: p ≤ 0.02
H1: p 0.02
The test is a right-tailed test
2. Level of significance: α = 5% or 0.05
3. Test statistic: Poisson random variable X since p0 is very small and the sample size is large enough
to use poisson approximation of binomial distribution.
4. Critical region: p-value < α
χ2
n 1S 2
30x1.62
48.6
2
0 1
6. Conclusion: We reject the null hypothesis at α =0.05 since χ2=48.6 > χ20.05 = 43.7729. So we
conclude that there is sufficient evidence to reject the claim of the company.
Example 10-8
A sales manager wants to know if display at point of purchase helps in increasing the sales of his
product. He notes the following observations:
Shop No. 1 2 3 4 5 6 7 8 9 10 11
Sales before
4500 5275 7235 6844 5991 6672 4943 7615 6128 5623 5154
display
Sales after
4834 5010 7562 6957 6401 6423 5334 8004 6729 6277 5769
display
Difference(d) -334 265 -327 -113 -410 249 -391 -389 -581 -654 -615
d = -300 Sd = 312.53
Is there sufficient evidence to conclude that display at point of purchase helps in increasing the sales of
his product?
Solution:
1. The null and alternative hypotheses:
H0: d 0
H1: 0
The test is a left-tailed test
2. Level of significance: α = 5% or 0.05
3. Test statistic: t; as the population standard deviation of the difference, σd, is not known and the
sample size, n, is small.
4. Critical region: t < -t0.05 Where t0.05 for 10 df =1.812
Solution:
1. The null and alternative hypotheses:
H0: 1 2 45
H1: 1 2 45
The test is a right-tailed test
2. Level of significance: α = 5% or 0.05
3. Test statistic: Z
4. Critical region: Z > Z0.05 Where Z0.05 =1.645
5. Computations: X 1 =308; X 2 =254; 1 84 ; 2 67 ; n1 = n2 = 100
Z
X
X 2 1 2 0 308 254 45
; Z 0.838
1
; Z
2
2
842 672
100
1 2
n1 n2 100
6. Conclusion: We cannot reject the null hypothesis at α = 0.05 since Z=0.838 Z0.05=1.645. In fact
the observed value of the test statistic falls in the non-rejection region of our right-tailed test at any
conventional level of significance. So we must conclude that there is insufficient evidence to
support Duracell‘s claim.
Example 10-10
The following information relate to the prices (in Rs) of a product in two cities A and B.
City A City B
Mean price 22 17
Standard deviation 5 6
The observations related to prices are made for 9 months in city A and for 11 months in city B. Test at
0.01 level whether there is any significant difference between prices in two cities, assuming (a)
12 22 (b) 12 22
Solution:
1. The null and alternative hypotheses:
H0: 1 2 = 0
H1: 1 2 0
(a) 12 22
t
X 1
X 2 1 2 0
t
X 1
X 2 1 2 0
2
1
1n
1
n1 n2 2 n1 2
22 17 ;t
5
; t 1.99
t
8 x 25
18
10x36 1
9 111
2.51
The degrees of freedom for this t are n1 + n2 –2 i.e. 9 +11-2 =18
For 18 df , t0.005 =2.88
(b) 12 22
t
X
X 2 1 2 0 22 17 0 5
; t 2.03
1
;t ;t
S 2
1S
n n
2
2
25 36
9 11
2.46
1 2
S 2 2 S 2 2 , df = 18
1 2 25 2 36 2
n1 n2 9 11
n 1 n 1 8 10
1 2
Against which, t0.005 =2.88
6. Conclusion: (a) We cannot reject the null hypothesis at α = 0.01, when 12 22 since t=1.99 <
t0.005 =2.88. (b) We cannot reject the null hypothesis at α = 0.01, when 12 22 since t=2.03 <
t0.005 =2.88.
Example 10-11
A sample survey of tax-payers belonging to business class and professional class yielded the following
results:
Business Class Professional Class
Sample size n1 = 400 n2 = 420
Defaulters in tax payment x1 = 80 x2 = 65
Given these sample data, test the hypothesis at α = 5% that
(a) the defaulters rate is the same for the two classes of tax-payers
(b) the defaulters rate in the case of business class is more than that in the case of professional class by
0.07.
Solution: (a)
1. The null and alternative hypotheses:
H0: p1 p2 = 0
H1: p1 p2 0
The test is a two-tailed test
2. Level of significance: α =1% or 0.01
3. Test statistic: Z; since the sample sizes are large enough.
4. Critical region: Z0.005 < Z < -Z0.005 Where Z0.005 = 2.58
5. Computations:
0.15 p x1 x2 80 65 0.177
x1 80 x2 65
p1 0.20 ; p2
n1 400 n2 420 n1 n2 400 420
Z
p
p2 0
;Z
0.20 0.15
Z 1.87
1
p 1 p 1n 1n
1 2
0.177x0.823 1400 1420
6. Conclusion: We cannot reject the null hypothesis at α = 0.05 since Z =1.87 < Z0.005 =2.58
(b)
p
p 2 p1 p2 0
0.15 Z
x1 80 x 65 1
p 1 p p 1 p
p1 0.20 p2 2
n1 400 n2 420 1
1 2 2
n1 n2
Z
0.20 0.15 0.07 Z 0.76
0.20x0.80 0.15x0.85
400 420
6. Conclusion: We cannot reject the null hypothesis at α = 0.05 since Z = -0.76 > -Z0.01 = -2.58.
Example 10-12
Use the data of Problem 12-10: n1 = 9, n2 =11 and S1 5 , S2 6 to test the assumption of equal
population variances.
Solution:
1. The null and alternative hypotheses:
H0: 12 = 22
H1: 12 22
The test is a two-tailed test
2. Level of significance: α = 5% or 0.05
3. Test statistic: F
4. Critical region: Fn11,n2 1 > F / 2n11, n2 1 and Fn11,n2 1 < F1 / 2n11, n2 1 i.e. F(8,10) >
5. Computations: S1 5 ; S2 6 ; n1 = 9 ; n2 =11
S12 25
Fn1 1,n2 1 F8,10 = 0.694
S22 36
6. Conclusion: We cannot reject the null hypothesis at α = 0.05 since F(8,10) < F0.025(8,10) =3.85 and
F(8,10) > F0.095(8,10) =0.23. So the sample evidence supports the view that the two populations do not
have different variances.
10.4 CHECK YOUR PROGRESS
1. We cannot ………………… H0 at an α of 5%" rather than "We accept H0."
2. In general, other things remaining the same, ………………the value of α will decrease the
probability of type II error.
3. In the case of a left-tailed test, the p-value is the area to the ……………… of the calculated value
of the test statistic.
4. The value of tends to ………………….. as 1 moves nearer to 0.
5. When the null hypothesis is about a population proportion, the test statistic can be either the
…………. or its Poisson or Normal approximation.
10.5 SUMMARY
A hypothesis is something that has not yet been proven to be true. It is some statement about a
population parameter or about a population distribution. This statement is tentative as it implies some
assumption, which may or may not be found valid on verification. Hypothesis testing is the process of
determining whether or not a given hypothesis is true. If the population is large, there is no way of
analyzing the population or of testing the hypothesis directly. Instead, the hypothesis is tested on the
basis of the outcome of a random sample. In any testing of hypotheses problem, we are faced with a pair
of hypotheses such that one and only one of them is always true. One of this pair is called the null
hypothesis and the other one the alternative hypothesis. Almost daily we compare products, services,
investment opportunities, management styles and so on. In all such situations we are interested in the
comparisons of two populations with respect to some population parameter - the population mean, the
population proportion, or the population variance.
10.6 KEYWORDS
Null Hypothesis: A null hypothesis is an assertion about the value of a population parameter. It
is an assertion that we hold as true unless we have sufficient statistical evidence to conclude
otherwise.
Alternative Hypothesis: The alternative hypothesis is the negation of the null hypothesis.
Type I Error: In the context of statistical testing, the wrong decision of rejecting a true null hypothesis
is known as Type I Error.
Type II Error: The wrong decision of accepting (not rejecting, to be more accurate) a false null
hypothesis is known as Type II Error.
P-value: The probability of observing a sample statistic as extreme as the one observed if the null
hypothesis is true.
10.7 SELF-ASSESSMENT TEST
1. What is a Hypothesis? Explain how Hypothesis Testing is useful to management?
2. What are Null and Alternative hypotheses? How you will set up null and alternative hypotheses
under following conditions:
(a) A pharmaceutical company claims that four out of five doctors prescribe the pain medicine it
produces. You wish to test this claim.
(b) A manufacturer of golf balls claims that the variance of the weights of the company's golf balls
is controlled within 0.0028 oz2. You wish to test this claim.
(c) A medicine is effective only if the concentration of a certain chemical in it is at least 200 parts
per million (ppm). At the same time the medicine would produce an undesirable side effect if
the concentration of the same chemical exceeds 200 parts per million (ppm). You wish to test
the concentration of the chemical in the medicine.
3. What are Type I and Type II Errors in hypothesis testing? Explain the relationship between the two
types of errors.
4. What is a Test Statistic? Why do we have to know the distribution of the test statistic? What are the
commonly used test statistics in hypotheses testing?
5. Distinguish between a One-tailed and Two-tailed test, give a diagram and an example in each case.
6. What is the p-value of a test? How it is calculated? Find the p-value of a (a) left-tailed, (b) right-
tailed, and (c) two-tailed test if
(i) In the test, the test statistic Z = -1.86. In which of these three cases will H0 be rejected at an α of
5%?
(ii) In the test, the test statistic Z = 1.75. In which of these three cases will H0 be rejected at an α of
5%?
7. What do you mean by Level of Significance of a test? ―Level of significance should be specified
after due consideration to the costs associated with Type I and Type II errors‖. Explain this
statement.
8. What do you mean by Critical Regain and Acceptance Region of a test?
9. What is the Power of a hypothesis test? Why is it important? How is the power of a hypothesis test
related to
(a) the significance level?
(b) the sample size?
(c) the actual value of the parameter?
10. Consider the use of metal detectors in airports to test people for concealed weapons. In essence, this
is a form of hypothesis testing.
(a) What are the null and alternative hypotheses?
(b) What are type I and type II errors in this case?
(c) Which type of error is more costly?
(d) Based on your answer to part (c), what value of α would you recommend for this test?
(e) If the sensitivity of the metal detector is increased, how would the probabilities of type I and
type II errors be affected?
(f) If α is to be increased, should the sensitivity of the metal detector be increased or decreased?
11. When planning a hypothesis test, what should be done if the probabilities of both type I and type II
errors are to be small?
12. ―Not – rejecting a Null hypothesis‖ is a more precise term rather than ―Accepting a Null
hypothesis‖. Do you agree with this statement? Explain.
13. What step are involved in statistical testing of a hypothesis?
14. A company is engaged in the packaging of a superior quality tea in jars of 500gm each. The
company is of the view that as long as the jars contains 500gm of tea, the process is under control.
The standard deviation of the process is 50gm. A sample of 225 jars is taken at random and the
sample average is found to be 510 gm. Has the process gone out of control?
15. A sample of size 400 was drawn and the sample mean found to be 99. Test, at 5% level of
significance, whether this sample could have come form normal population with mean 100 and
variance 64.
16. A manufacturer of a new motorcycle claims for it an average mileage of 60 km/liter under city
conditions. However, the average mileage in 16 trials is found to be 57 km, with a standard
deviation of 2 km. Is the manufacturer‘s claim justified?
17. In a big city, 450 men out of a sample of 850 men were found to be smokers. Does this information,
at 5% level of significance, supports the view that the majority of men in this city are smokers?
18. A stock-broker claims that she can predict with 85% accuracy whether a stock‘s market value will
rise or fall during the coming month. Test the stock-broker‘s claim at 5% level of significance if, as
a test, she predict the outcome of 6 stocks and is correct in 5 of the predictions.
19. A company engaged in manufacturing of radio tubes, finds that the life of its tubes has a variance of
0.7 years. As a result of some qualitative improvement brought about in the product, the company
claims that the variance of the life of its tubes has reduced. If the sample variance, S2, on
observation of 9 tubes is observed 0.55 years at test the claim of the company (a) 5% level of
significance (b) 1% level of significance.
20. Seven persons were appointed in officer cadre in an organisation. Their performance was evaluated
by giving a test and the marks were recorded out of 100. They were given two-month training and
another test was held and marks were recorded out of 100.
Officer: a b c d e f g
Score Before Training: 80 76 92 60 70 56 74
Score After Training: 84 70 96 80 70 52 84
Can it be concluded that the training has benefited the employees? Use 5% level of significance.
21. The makers of Philips bulb want to demonstrate that their bulb lasts on an average of at least 100
hours longer than Philips‘ main competitor, Surya. Two independent random samples of 100 bulbs
of each kind are selected. The sample average lives for Philips and Surya bulbs are found to be X 1
= 1232 hours and X 2 = 1016 hours respectively. Assume 1 84 hours and 2 67 hours. Is
there evidence to substantiate Philips‘ claim that its bulbs last, on an average, at least 180 hours
longer than Surya bulb of the same size?
22. Consider the following data:
Sample A Sample B
Sample Mean 100 105
Standard Deviation 16 24
Sample Size 800 1600
Test, at 5% level of significance, the difference between means of two populations from which
samples are taken.
23. The following information relate to the wages (in Rs) of mill workers in two cities A and B.
City A City B
Mean wage 40 34
Standard deviation 5 6
The observations related to wages are for 8 workers in city A and for 10 workers in city B. Test
at 0.01 level whether there is any significant difference between wages in two cities, assuming
(a) 12 22 (b) 12 22
24. Test market result of two advertisements A and B, yielded the following results:
A B
Who saw the Advertisements n1 = 200 n2 = 220
Who tried the Product x1 = 40 x2 = 35
Given the data, test the hypotheses at α = 5% that
(a) both the advertisements are equally effective
(b) advertisement A is more effective than advertisement B by more than 0.05
Effectiveness of the advertisements are measured as proportion of viewers who tried the
product.
25. Use the data of Problem 22: n1 = 8, n2 =10 and S1 5 , S2 6 to test the assumption of equal
population variances.
2. Increasing
3. Left
4. Increase
5. Binomial random variable
NON-PARAMETRIC TESTS
STRUCTURE
11.1 Learning Objectives
11.2 Introduction
11.2.1 Sign tests
11.2.2 The two-sample and K-sample Median Tests
11.2.3 Wilcoxon matched-pairs test (or Signed Rank Test)
11.2.4 The Mann-Whitney U Test
11.2.5 The Kruskal-Wallis Test
11.2.6 The spearman's rank correlation test
11.2.7 Tests of Randomness: Runs Above and Below the Median
11.2.8 Kolmogorov-Smirnov One-sample Test
11.3 Check your Progress
11.4 Summary
11.5 Keywords
11.6 Self- Assessment Test
11.7 Answers to check your progress
11.8 References/Suggested readings
11.2 INTRODUCTION
In contrast to parametric tests, non-parametric tests do not require any assumptions about the parameters
or about the nature of population. It is because of this that these methods are sometimes referred to as
the distribution free methods. Most of these methods, however, are based upon the weaker assumptions
that observations are independent and that the variable under study is continuous with approximately
symmetrical distribution. In addition to this, these methods do not require measurements as strong as
that required by parametric methods. Most of the non-parametric tests are applicable to data measured
in an ordinal or nominal scale. As opposed to this, the parametric tests are based on data measured at
least in an interval scale. The measurements obtained on interval and ratio scale are also known as high
level measurements.
Level of measurement
1. Nominal scale: This scale uses numbers or other symbols to identify the groups or classes to which
various objects belong. These numbers or symbols constitute a nominal or classifying scale. For
example, classification of individuals on the basis of sex (male, female) or on the basis of level of
education (matric, senior secondary, graduate, post graduate), etc. This scale is the weakest of all
the measurements.
2. Ordinal scale: This scale uses numbers to represent some kind of ordering or ranking of objects.
However, the differences of numbers, used for ranking, don‘t have any meaning. For example, the
top 4 students of class can be ranked as 1, 2, 3, 4, according to their marks in an examination.
3. Interval scale: This scale also uses numbers such that these can be ordered and their differences
have a meaningful interpretation.
4. Ratio scale: A scale possessing all the properties of an interval scale along with a true zero point is
called a ratio scale. It may be pointed out that a zero point in an interval scale is arbitrary. For
example, freezing point of water is defined at 0° Celsius or 32° Fahrenheit, implying thereby that
the zero on either scale is arbitrary and doesn‘t represent total absence of heat. In contrast to this,
the measurement of distance, say in metres, is done on a ratio scale. The term ratio is used here
because ratio comparisons are meaningful. For example, 100 kms of distance is four times larger
than a distance of 25 kms while 100°F may not mean that it is twice as hot as 50°F.
It should be noted here that a test that can be performed on high level measurements can always be
performed on ordinal or nominal measurements but not vice-versa. However, if along with the high
level measurements the conditions of a parametric test are also met, the parametric test should
invariably be used because this test is most powerful in the given circumstances.
From the above, we conclude that a non-parametric test should be used when either the conditions about
the parent population are not met or the level of measurements is inadequate for a parametric test.
Advantages
The non-parametric tests have gained popularity in recent years because of their usefulness in certain
circumstances. Some advantages of non-parametric tests are mentioned below:
1. Non-parametric tests require less restrictive assumptions vis-à-vis a comparable parametric test.
2. These tests often require very few arithmetic computations.
3. There is no alternative to using a non-parametric test if the data are available in ordinal or nominal
scale.
4. None of the parametric tests can handle data made up of samples from several populations without
making unrealistic assumptions. However, there are suitable non-parametric tests available to
handle such data.
Disadvantages
1. It is often said that non-parametric tests are less efficient than the parametric tests because they tend
to ignore a greater part of the information contained in the sample. In spite of this, it is argued that
although the non-parametric tests are less efficient, a researcher using them has more confidence in
using his methodology than he does if he must adhere to the unsubstantial assumptions inherent in
parametric tests.
2. The non-parametric tests and their accompanying tables of significant values are widely scattered in
various publications. As a result of this, the choice of most suitable method, in a given situation,
may become a difficult task.
Now, the question before us is whether 7 plus signs observed in 13 trials support the null hypothesis p =
0.5 or the alternative hypothesis p 0.5. Using the binomial probability tables or binomial probabilities,
we find that the probability of 7 or more successes is 0.196 + 0.196 + 0.133 + 0.092 + 0.042 + 0.014 +
0.003 = 0.696* and p = 0.5 and since this value is greater than /2 = 0.025, we find that the null
hypothesis will have to be accepted. We can also use normal approximation to the binomial distribution
when np 5. As here p = ½, the condition for the normal approximation to the binomial distribution is
satisfied as n > 10. As such, we can use the Z statistic for which the following formula is to be used.
X np X (np)
Z= =
npq n
4
14 15
7 (15 / 2) 2 0.5
= = = = -0.26
15 1.9365 1.9365
4
Since calculated Z = -0.26 lies between Z = - 1.96 and Z = 1.96 (the critical value of Z at 0.05 level of
significance), the null hypothesis is accepted.
Example 2: Suppose we have the following table indicating the ratings assigned to two brands of
cold drink X and Y by 12 consumers. Each respondent was asked to taste the two brands of cold drink
and then rate them.
We have to apply the two-sample sign test. H0 being both brands enjoy equal preference.
Solution: Row three of Table 13.1 shows + and – signs. When X‘s rating is higher than that of Y, then
the third row shows the ‗+‘ sign. As against this, when X‘s rating is lower than that of Y, then it shows
the ‗-‗ sign. The table shows 10 plus signs and 2 minus signs. Now, we have to examine whether ‘10
successes in 12 trials‘ supports the null hypothesis p = ½ or the alternative hypothesis p > ½. The null
hypothesis implies that both the brands enjoy equal preferences and none is better than the other. The
alternative hypothesis is that the brand X is better than brand Y. Referring to the binomial probabilities
table, we find that for n = 12 and p = ½ the probability of ‘10 or more successes‘ is 0.016 + 0.003 =
0.019. It follows that the null hypothesis can be rejected at = 0.05 level of significance. We can,
therefore, conclude that brand X is a preferred brand as compared to brand Y.
Example 3: To illustrate the second case, which relates to two independent samples, let us consider the
following data pertaining to the downtimes (periods in which computers were inoperative on account of
failures, in minutes of two different computers. We have to apply the two-sample sign test.
Computer A 58 60 42 62 65 59 60 52 50 75 59
52 57 30 46 66 40 78 55 52 58 44
Computer B 32 48 50 41 45 40 43 43 70 60 80
45 36 56 40 70 50 53 50 30 42 45
Solution: These data are shown in Table 13.2 along with + or – sign as may be applicable in case of
each pair of values. A plus sign is assigned when the downtime for computer A is greater than that for
computer B and a minus sign is given when the downtime for computer B is greater than that for
computer A.
Table 11.2: Downtime of computers A and B (Minutes)
Computer A 58 60 42 62 65 59 60 52 50 75 59
Computer B 32 48 50 41 45 40 43 43 70 60 80
Sign + + - + + + + + - + -
Computer A 52 57 30 46 66 40 78 55 52 58 44
Computer B 45 36 56 40 70 50 53 50 30 42 45
Sign + + - + _ _ + + + + -
It will be seen that there are 13 plus signs and 7 minus signs. Thus, we have to ascertain whether ‗13
successes in 22 trials‘ support the null hypothesis p = ½. The null hypothesis implies that the true
average downtime is the same for both the computers A and B. The alternative hypothesis is p ½. The
null hypothesis implies that the true average downtime is the same for both the computers A and B. The
alternative hypothesis is p = ½.
Let us use in this case the normal approximation of the binomial distribution. This can be done since np
and n (1 – p) are both equal to 11 in this example. Substituting n = 22 and p = ½ into the formulas for
the mean and the standard deviation of the binomial distribution, we get µ = np = 22 (½) = 11 and
= np (1 - p) = 22.½.½ = 2.345
Hence, Z = (X – µ)/ = (13 – 11)/2.345 = 1.71
Since this value of 1.71 falls between – Z0.025 = - 1.96 and Z0.025 = 1.96, we find that the null hypothesis
cannot be rejected. This means that the downtime in the two computers is the same.
This seems to be surprising as we find that there are substantial differences. The two sample means, for
example, are 55.5 for A and 48.6 for B. This example illustrates the point that at times the sign test can
be quite a waste of information. It may also be noted that had the continuity correction been used, we
would have obtained:
Z = 3.5/2.345 = 1.49
This would not have changed our earlier conclusion.
Our null hypothesis H0 is that there is no difference in the median downtime for the two computers. The
alternative hypothesis H1 is that there is difference in the downtime of the two computers.
We now calculate the expected frequencies by the formula (Rowi × Columni)/Grand total. Thus, Table
11.4 shows both the observed and the expected frequencies. Of course, we could have obtained these
results by arguing that half the values in each sample can be expected to fall above the median and the
other half below it.
Table 11.4. Calculation of chi-square
Observed Expected O–E (O – E)2 (O – E)2/E
frequencies (O) frequencies (E)
5 10 -5 25 2.50
13 10 5 25 2.50
16 11 5 25 2.27
6 11 -5 25 2.27
Total 9.54
(Oi Ei )2
2 = E = 9.54
i
The critical value of 2 at 0.05 level of significance for (2 – 1) (2 – 1) = 1 degree of freedom is 3.841
(2-Table). Since the calculated value of 2 exceeds the critical value, the null hypothesis has to be
rejected. In other words, there is no evidence to suggest that the downtime is the same in case of the two
computers.
It may be recalled that in the previous example having the same data, the null hypothesis could not be
rejected. In contrast, we find here that the two-sample median test has led to the rejection of the null
hypothesis. This may be construed as evidence that the median test is not quite as wasteful of the
information as the sign test. However, in general, it is very difficult to make a meaningful comparison
of the merits of two or more non-parametric tests, which can be used for the same purpose.
values in each sample fall above or below the median. Finally, we analyse the resulting contingency
table by the method of chi-square. Let us take an example.
Example 4: Suppose that we are given the following data relating to marks obtained by students in
Statistics in the three different sections of a MBA class in G.J.U. Hisar. The maximum marks were 100.
Section A 46 60 58 80 66 39 56 61 81 70
75 48 43 64 57 59 87 50 73 62
Section B 60 55 82 70 46 63 88 69 61 43
76 54 58 65 73 52
Section C 74 67 37 80 72 92 19 52 70 40
83 76 68 21 90 74 49 70 65 58
Test whether the differences among the three sample means are significant.
Solution: In case of such problems, analysis of variance is ordinarily performed. However, here we find
that the data for Section C have much more variability as compared to the data for the other two
sections. In view of this, it would be wrong to assume that the three population standard deviations are
the same. This means that the method of one-way analysis of variance cannot be used.
In order to perform a median test, we should first determine the median of the combined data. This
comes out to 63.5, as can easily be checked. Then we count how many of the marks in each sample fall
below or above the median. Thus, the results obtained are shown in Table 11.5.
Table 11.5. Worksheet for calculating chi-square
Below median Above median
Section A 12 8
Section B 9 7
Section C 7 13
Since the corresponding expected frequencies for Section A are 10 and 10, for Section B are 8 and 8,
and for Section C 10 and 10, we can obtain the value of chi-square. These calculations are shown
below:
(12-10)2 (8-10)2 (9-8)2 (7-8)2 (7-10)2 (13-10)2
= 10 + 10 + 8 + 8 + 10 + 10
2
Now, we have to compare this value with the critical value of 2 at 5 per cent level of significance. This
value is 5.991 for 2 (K – 1 = 3 – 1) degrees of freedom (Chi-square Table). As the calculated value of 2
is less than the critical value, the null hypothesis cannot be rejected. In other words, we cannot conclude
that there is a difference in the true average (median) marks obtained by the students in Statistics test
from the three sections.
11.2.3 Wilcoxon
Wilcoxon matched-pairs test is an important non-parametric test, which can be used in various
situations in the context of two related samples such as a study where husband and wife are matched or
when the output of two similar machines are compared. In such cases we can determine both direction
and magnitude of difference between matched values, using Wilcoxon matched-pairs test.
The procedure involved in using this test is simple. To begin with, the difference (d) between each pair
of values is obtained. These differences are assigned ranks from the smallest to the largest, ignoring
signs. The actual signs of differences are then put to corresponding ranks and the test statistic T is
calculated, which happens to be the smaller of the two sums, namely, the sum of the negative ranks and
the sum of the positive ranks.
There may arise two types of situations while using this test. One situation may arise when the two
values of some matched-pair(s) is/are equal as a result the difference (d) between the values is zero. In
such a case, we do not consider the pair(s) in the calculations. The other situation may arise when we
get the same difference (d) in two or more pairs. In such a case, ranks are assigned to such pairs by
averaging their rank positions. For instance, if two pairs have rank score of 8, then each pair is assigned
8.5 rank [(8 + 9)/2 = 8.5] and the next largest pair is assigned the rank 10.
After omitting the number of tied pairs, if the given number or matched pairs is equal to or less than 25,
then the table of critical value T is used for testing the null hypothesis. When the calculated value of T
is equal to or smaller than the table (i.e. critical) value at a desired level of significance, then the null
hypothesis is rejected. In case the number exceeds 25, the sampling distribution of T is taken as
approximately normal with mean µT = n (n + 1)/µ and standard deviation
T = n (n + 1) (2n + 1)/24
where n is taken as the number of given matched pairs- number of tied pairs omitted, if any. In such a
situation, the test Z statistic is worked out as follows:
Z = (T – µr)/r
Let us now take an example to illustrate the application of Wilcoxon matched-pairs test.
Example 5: The management of the Punjab National Bank wants to test the effectiveness of an
advertising company that is intending to enhance the awareness of the bank‘s service features. It
administered a questionnaire before the advertising campaign, designed to measure the awareness of
services offered. After the advertising campaign, the bank administered the same questionnaire to the
same group of people. Both the before and after advertising campaign scores are given in the following
table.
Consumer awareness of bank services offered
Consumer 1 2 3 4 5 6 7 8 9 10
Before ad 82 81 89 74 68 80 77 66 77 75
campaign
After ad 87 84 84 76 78 81 79 81 81 83
campaign
Using Wilcoxon matched-pairs test, test the hypothesis that there is no difference in awareness of
services offered after the advertising campaign.
Solution:
Table 11.6. Application of Wilcoxon matched-pairs test
Consumer After Ad Before Ad Diff. di Rank of di Rank (-) Rank (+)
campn. campn. sign sign
1 87 82 5 6.5 6.5
2 84 81 3 4 4
3 84 89 -5 6.5 -6.5
4 76 74 2 2.5 2.5
5 78 68 10 9 9
6 81 80 1 1 1
7 79 77 2 2.5 2.5
8 81 66 13 10 10
9 81 77 4 5 5
10 83 75 8 8 8
Total -6.5 +48.5
Null hypothesis H0: There is no difference in the awareness of bank services after the ad campaign.
Alternative hypothesis H1: There is a difference in the awareness of bank services after the ad
campaign.
Computed ‗T‘ value is 6.5. The critical value of T for n = 10 at 5 per cent level of significance is 8
(Area Table). Since the computed T value is less than the critical T value, the null hypothesis is
rejected. We can conclude that after the ad campaign there is difference in the consumer awareness of
the bank‘s services needs some explanation. Had there been no difference in the awareness before and
after the ad campaigns, the sum of positive and negative ranks would have been almost equal. However,
if the difference between the two series being compared is larger, then the value of T will tend to be
smaller as it is defined as smaller of ranks. This is the case we find in this problem. It may be noted that
with this test the calculated value of T must be smaller than the critical value in order to reject the null
hypothesis.
Let us now take another situation where the null hypothesis is true and the scores for the two groups are
sampled from identical populations. If we were to rank all N scores regardless of the group, we would
expect a mix of low and high ranks in each group. Thus, the sum of the ranks assigned to Group 1
would be broadly equal to the sum of the ranks assigned to Group 2.
The Mann-Whitney test is based on the logic just described, using the sum of the ranks in one of the
groups as the test statistic. In case that sum turns out to be too small as compared to the other sum, the
null hypothesis is rejected. The common practice is to take the sum of the ranks assigned to the smaller
group, or if n1 = n2, the smaller of the two sums as the test statistic. This value is then compared with the
critical value that can be obtained from the table of the Mann-Whitney statistic (Ws) to test the null
hypothesis.
Let us take an example to illustrate the application of this test.
Example 6: The following data indicate the lifetime (in hours) of samples of two kinds of light bulbs in
continuous use:
Brand A 603 625 641 622 585 593 660 600 633 580 613 648
Brand B 620 640 646 620 652 639 590 646 631 669 610 619
We are required to use the Mann-Whitney test to compare the lifetimes of brands A and B light bulbs.
Solution: The first step for performing the Mann-Whitney test is to rank the given data jointly (as if
they were one sample) in an increasing or decreasing order of magnitude. For our data, we thus obtain
the following array where we use the letters A and B to denote whether the light bulb was from brand A
or brand B.
Table 11.7. Ranking of light bulbs of brands A and B
Sample score Group Rank Sample Group Rank
score
580 A 1 625 A 13
585 A 2 631 B 14
590 B 3 633 A 13
593 A 4 639 B 16
600 A 5 640 B 17
603 A 6 641 A 18
As both the samples come from identical populations, it is reasonable to assume that the means of the
ranks assigned to the values of the two samples are more or less the same. As such, our null hypothesis
is:
H0: Means of ranks assigned to the values in the two groups are the same.
H1: Means are not the same.
However, instead of using the means of the ranks, we shall use rank sums for which the following
formula will be used.
U = n1n2 + [n1(n1 + 1)]/2 – R1
Where n1 and n2 are the sample sizes of Group 1 and Group 2, respectively, and R1 is the sum of the
ranks assigned to the values of the first sample. In our example, we have n1 = 12, n2 = 12 and R1 = 1 + 2
+ 4 + 5 + 6 + 8 + 12 + 13 + 13 + 18 + 21 + 23 = 128. Substituting these values in the above formula,
U = (12) (12) + [12 (12 + 1)]/2 – 128
= 144 + 78 – 128
= 94
From Appendix Table 9 for n1 and n2, each equal to 12, and for 0.05 level of significance is 37. Since
the critical value is smaller than the calculated value of 94, we accept the null hypothesis and conclude
that there is no difference in the average lifetimes of the two brands of light bulbs.
The test statistic we have just applied is suitable when n1 and n2 are less than or equal to 25. For larger
values of n1 and/or n2, we can make use of the fact that the distribution of Ws approaches a normal
distribution as sample sizes increase. We can then use the Z test to test the hypothesis.
n1n2 (n1 + n2 + 1)
2. Standard error = 12
12 × 12 (12 + 12 + 1)
=
12
= 300 = 17.3
3. (Statistic – Mean)/Standard deviation
= (94 – 72)/17.3 = 1.27
The critical value of Z at 0.05 level of significance is 1.64. Since the calculated value of Z = 1.27 is
smaller than 1.64, the null hypothesis is accepted. This shows that there is no difference in average
lifetimes of brands A and B bulbs. The Z test is more dependable as compared to the earlier one. It may
be noted that Mann-Whitney test required fewer assumptions than the corresponding standard test. In
fact, the only assumption required is that the populations from which samples have been drawn are
continuous. In actual practice, even when this assumption turns out to be wrong, this is not regarded
serious.
The Test Statistic: The computation of the test statistic follows a procedure that is very similar to the
Mann-Whitney Wilcoxon test.
(i) Rank all the n1 + n2 + … + nk = n observations, arrayed in ascending order.
(ii) Find R1, R2, … Rk, where Ri is the sum of ranks of the ith sample.
The test statistic, denoted by H, is given by
2 2
12 R1 R2 Rk2
H = n(n+1) n + n + … + n - 3 (n + 1).
1 2 k
It can be shown that the distribution of H is c2 with k – 1 d.f., when size of each sample is at least 5.
Thus, if H > 2k1 , H0 is rejected.
Example 7: To compare the effectiveness of three types of weight-reducing diets, a homogeneous
groups of 22 women was divided into three sub-groups and each sub-group followed one of these diet
plans for a period of two months. The weight reductions, in kgs, were noted as given below:
I 4.3 3.2 2.7 6.2 5.0 3.9
Diet Plans II 5.3 7.4 8.3 5.5 6.7 7.2 8.5
III 1.4 2.1 2.7 3.1 1.5 0.7 4.3 3.5 0.3
Use the Kruskal-Wallis test to test the hypothesis that the effectiveness of the three weight reducing diet
plans is same at 5% level of significance.
Solution:
It is given that n1 = 6, n2 = 7 and n3 = 9.
The total number of observations is 6 + 7 + 9 = 22. These are ranked in their ascending order as given
below:
I 12.5 9 6.5 17 14 11 70
Diet II 13 20 21 16 18 19 22 131
Plans
III 3 5 6.5 8 4 2 12.5 10 52
1
The tabulated value of 2 at 2 d.f. and 5% level of significance is 5.99. Since H is greater than this
value, H0 is rejected at 5% level of significance.
From the given data, we can find d1 = Ru – R2i and then di2 = 134.
6 × 154
rs = 1 – 12 × 143 = 0.46 and z = 0.46 11 = 1.53.
Since the value of z is less than 1.645, there is no evidence against H0 at 5% level of significance.
Hence, the correlation in population cannot be regarded as positive.
Example 9: Suppose we have the following series of 29 college students. After performing a set of
study exercises, increases in their pulse rate were recorded as follows:
22, 23, 21, 25, 33, 32, 25, 30, 17, 20, 26, 12, 21, 20, 27, 24, 28, 14, 29, 23, 22, 36, 25, 21, 23, 19, 17, 26
and 26.
We have to test the randomness of these data.
Solution: First, we have to calculate the median of this series. If we arrange these values in an
ascending order, we find that the size of (n +1)/2th item, that is, 13th item is 24. Thus, the median is 24.
As there is one value, which is 24 we omit it and get the following arrangement of as and bs where a
stands for an item greater than (or above) the median and b stands for an item lower than (or below) the
median:
bbb aaaaa bb a bbb aa b a bb aa bbbb aa
On the basis of this arrangement, we find that n1, (i.e. a) = 13, n2, (i.e. bs) = 13, and u = 12, we get
µr = [(2n1n2)/(n1 + n2)] + 1
= [(2 × 13 × 13)/(13 + 13)] + 1 = (390/28) + 1 = 14.93
2n1n2 (2n1n2 - n1 - n2)
u = (n1n2)2 (n1 + n2 - 1)
(2 × 13 × 15) ( 2 × 13 × 15 - 13 - 15)
u = (13 + 15)2 ( 13 + 15 - 1)
390 × 362
=
(28)2 (27)
141180
= 21168 = 6.6695 = 2.58
Z = (u – µr)/u = (12 – 14.93)/2.58 = -2.93/2.58 = -1.14
Since Z = -1.14 falls between –Z0.025 = -1.96 and Z0.025 = 1.96, the null hypothesis cannot be rejected at
the level of significance = 0.05. We can, therefore, conclude that the randomness of the original
sample cannot be questioned.
It may be noted that this test is particularly useful in detecting trends and cyclic patterns in a series. If
there is a trend, there would be first mostly as and later mostly bs or vice versa. In case of a repeated
cyclic pattern, there would be a systematic alternation of as and bs and probably, too many runs.
We have been asked to use the Kolmogorov-Smirnov test to test the hypothesis that there is no
difference in importance ratings for durability among the respondents.
Solution: In order to apply the Kolmogorov-Smirnov test to the above data, first of all we should have
the cumulative frequency distribution from the sample. Second, we have to establish the cumulative
frequency distribution, which would be expected on the basis of the null hypothesis. Third, we have to
determine the largest absolute deviation between the two distributions mentioned above. Finally, this
value is to be compared with the critical value to ascertain its significance.
Table 11.8 shows the calculations.
Table 11.8: Worksheet for the Kolmogorov-Smirnov D
Importance of Observed Observed Observed Null Null Absolute
durability number proportion cumulative proportion cumulative difference
proportion proportion observed and
Null
Very important 50 0.25 0.25 0.2 0.2 0.05
Somewhat 60 0.30 0.55 0.2 0.4 0.13
important
From Table 11.8, we find that the largest absolute difference is 0.13, which is known as the
Kolmogorov-Smirnov D value. For a sample size of more than 35, the critical value of D at an = 0.05
is 1.36/ n. As sample size in this example is 200, D = 1.36/ 200 = 0.096. As the calculated D exceeds
the critical value of 0.096, the null hypothesis that there is no difference in importance ratings for
durability among the respondents is rejected.
Although there are a number of non-parametric tests, we have presented some of the more frequently
used tests in this chapter. While using these tests, we must know that the advantages we derive by
limiting our assumptions may be offset by the loss in the power of such tests. However, when basic
assumptions as required for parametric tests are valid, the use of non-parametric tests may lead to a
false hypothesis and thus we may commit a Type II error. We have to consider this aspect very carefully
before deciding in favour of non-parametric tests. It may be reiterated that such tests are more suitable
in case of ranked, scaled or rated data.
In contrast to parametric tests, non-parametric tests do not require any assumptions about the parameters
or about the nature of population. It is because of this that these methods are sometimes referred to as
the distribution free methods. Most of these methods, however, are based upon the weaker assumptions
that observations are independent and that the variable under study is continuous with approximately
symmetrical distribution. In addition to this, these methods do not require measurements as strong as
that required by parametric methods. Most of the non-parametric tests are applicable to data measured
in an ordinal or nominal scale. As opposed to this, the parametric tests are based on data measured at
least in an interval scale. The measurements obtained on interval and ratio scale are also known as high
level measurements. It should be noted here that a test that can be performed on high level
measurements can always be performed on ordinal or nominal measurements but not vice-versa.
However, if along with the high level measurements the conditions of a parametric test are also met, the
parametric test should invariably be used because this test is most powerful in the given circumstances.
From the above, we conclude that a non-parametric test should be used when either the conditions about
the parent population are not met or the level of measurements is inadequate for a parametric test. There
are different advantages and disadvantages of it. There are different test for it like sign test, Wilcoxon
test, Kolmogorov-Smirnov D etc.
11.5 KEYWORDS
Non-parametric tests: Tests that rely less on parameter estimation and/or assumptions about the shape
of a population distribution.
One-Sample Runs test: A non-parametric test used for determining whether the items in a sample have
been selected randomly.
Run: A sequence of identical occurrences that may be preceded and followed by different occurrences.
At times, they may not be preceded or followed by any occurrences.
Sign test: A non-parametric test that takes into account the difference between paired observations
where plus (+) and minus (-) signs are substituted for quantitative values.
Theory of runs: A theory concerned with the testing of samples for the randomness of the order in
which they have been selected.
Wilcoxon Matched-pairs Test (or Signed Rank Test): A non-parametric test that can be used in
various situations in the context of two related samples.
Kolmogorov-Smirnov test: A non-parametric test that is concerned with the degrees of agreement
between a set of observed ranks (sample values) and a theoretical frequency distribution.
Kurskal-Wallis test: A non-parametric method for testing the null hypothesis that K independent
random samples come from identical populations. It is a direct generalisation of the Mann-Whitney test.
Mann-Whitney U test: A non-parametric test that is used to determine whether two different samples
come from identical populations or whether these populations have different means.
6. The proprietor of a small business computed his average earnings per day over a period of 12 days.
For each day, an L was recorded if the earnings were less than the average, otherwise an M was
recorded. These data are given below:
LLLLMMLLLLMM
7. In a metropolitan city, a city bus service was scheduled to reach a major bus stop at 11 a.m. each
day. If the bus reached that stop within 5 minutes of 11 a.m. it was considered to be on time. Over a
13-day period, an A was recorded if the bus was on time, otherwise a B was recorded. The picture
that emerged after ten days was as follows:
AABABBABAABBBAA
8. The following data show employees‘ rate of substandard performance before and after a new
incentive scheme. Determine whether the introduction of the new incentive scheme has reduced the
substandard performance at 0.05 level of significance.
Before 7 8 5 9 10 6 5 9 6 8
After 5 6 7 6 8 7 6 6 5 7
9. A company manufacturing electronic toys toys has recently been taken over by another company.
Prior to the takeover of the company, certain workers were approached to ascertain their
satisfaction levels. The same workers were again approached to know their satisfaction level after
the takeover of the company. The two sets of data are given below.
Before 69 73 58 76 82 65 75 64 87 70
After 65 75 63 75 82 68 71 65 85 68
Using an appropriate test, find out whether there has been an improvement in the satisfaction level
of workers after the takeover of their company by a new company
10. The following data relate to the costs of building comparable lots in the two Resons A and B (in
million rupees):
Resort A 30.9 32.5 44.3 39.5 35.0 48.9
Resort B 53.9 61.0 36.0 42.5 40.9 47.9
The company owning the resort area A claimed that the median price of building lots was less in
area A as compared to resort area B. You are asked to test this claim, using a nonparametric test
with a 1 per cent level of significance.
11. On 13 different days, A had to wait for the city bus to reach his office as shown below:
17, 12, 18, 20, 25, 30, 10, 13, 7, 10, 9, 11, 5, 11 and 20 minutes.
Use the sign test at 5 per cent level of significance to test the bus company‘s claim that on an
average A should not have to wait for more than 13 minutes.
12. A company used three different methods of advertising its product in three cities. It later found the
increased sales (in thousand rupees) in identical retail outlets in the three cities as follows:
City A 70 58 60 45 55 62 80 72
City B 65 57 48 55 75 68 45 52 63
City C 53 59 71 70 63 60 58 75
Use Kruskal-Wallis method to test the hypothesis that the mean increase in sales on account of
three different methods of advertising was the same in the retail outlets in A, B and C cities. Use 5
per cent level of significance.
CORRELATION ANALYSIS
STRUCTURE
12.2 Introduction
12.8 Summary
12.9 Keywords
12.2 INTRODUCTION
...if we have information on more than one variables, we might be interested in seeing if there is any
connection - any association - between them.
Statistical methods of measures of central tendency, dispersion, skewness and kurtosis are helpful for
the purpose of comparison and analysis of distributions involving only one variable i.e. univariate
distributions. However, describing the relationship between two or more variables, is another important
part of statistics.
In many business research situations, the key to decision making lies in understanding the relationships
between two or more variables. For example, in an effort to predict the behavior of the bond market, a
broker might find it useful to know whether the interest rate of bonds is related to the prime interest
rate. While studying the effect of advertising on sales, an account executive may find it useful to know
whether there is a strong relationship between advertising dollars and sales dollars for a company.
The statistical methods of Correlation (discussed in the present lesson) and Regression (to be discussed
in the next lesson) are helpful in knowing the relationship between two or more variables which may be
related in same way, like interest rate of bonds and prime interest rate; advertising expenditure and
sales; income and consumption; crop-yield and fertilizer used; height and weights and so on.
In all these cases involving two or more variables, we may be interested in seeing:
if so, what form the relationship between the two variables takes;
how we can make use of that relationship for predictive purposes, that is, forecasting; and
Since these issues are inter related, correlation and regression analysis, as two sides of a single process,
consists of methods of examining the relationship between two or more variables. If two (or more)
variables are correlated, we can use information about one (or more) variable(s) to predict the value of
the other variable(s), and can measure the error of estimations - a job of regression analysis.
―The correlation between variables is a measure of the nature and degree of association between the
variables‖.
As a measure of the degree of relatedness of two variables, correlation is widely used in exploratory
research when the objective is to locate variables that might be related in some way to the variable of
interest.
TYPES OF CORRELATION
Correlation can be classified in several ways. The important ways of classifying correlation are:
If both the variables move in the same direction, we say that there is a positive correlation, i.e., if one
variable increases, the other variable also increases on an average or if one variable decreases, the other
variable also decreases on an average.
On the other hand, if the variables are varying in opposite direction, we say that it is a case of negative
correlation; e.g., movements of demand and supply.
If the change in one variable is accompanied by change in another variable in a constant ratio, it is a
case of linear correlation. Observe the following data:
X : 10 20 30 40 50
Y : 25 50 75 100 125
The ratio of change in the above example is the same. It is, thus, a case of linear correlation. If we plot
these variables on graph paper, all the points will fall on the same straight line.
On the other hand, if the amount of change in one variable does not follow a constant ratio with the
change in another variable, it is a case of non-linear or curvilinear correlation. If a couple of figures in
either series X or series Y are changed, it would give a non-linear correlation.
The distinction amongst these three types of correlation depends upon the number of variables involved
in a study. If only two variables are involved in a study, then the correlation is said to be simple
correlation. When three or more variables are involved in a study, then it is a problem of either partial or
multiple correlation. In multiple correlation, three or more variables are studied simultaneously. But in
partial correlation we consider only two variables influencing each other while the effect of other
variable(s) is held constant.
Suppose we have a problem comprising three variables X, Y and Z. X is the number of hours studied, Y
is I.Q. and Z is the number of marks obtained in the examination. In a multiple correlation, we will
study the relationship between the marks obtained (Z) and the two variables, number of hours studied
(X) and I.Q. (Y). In contrast, when we study the relationship between X and Z, keeping an average I.Q.
(Y) as constant, it is said to be a study involving partial correlation. In this lesson, we will study linear
correlation between two variables.
The correlation analysis, in discovering the nature and degree of relationship between variables, does
not necessarily imply any cause and effect relationship between the variables. Two variables may be
related to each other but this does not mean that one variable causes the other. For example, we may
find that logical reasoning and creativity are correlated, but that does not mean if we could increase
peoples‘ logical reasoning ability, we would produce greater creativity. We need to conduct an actual
experiment to unequivocally demonstrate a causal relationship. But if it is true that influencing
someones‘ logical reasoning ability does influence their creativity, then the two variables must be
correlated with each other. In other words, causation always implies correlation, however converse is
not true. Let us see some situations:
1. The correlation may be due to chance particularly when the data pertain to a small sample. A small
sample bivariate series may show the relationship but such a relationship may not exist in the
universe.
2. It is possible that both the variables are influenced by one or more other variables. For example,
expenditure on food and entertainment for a given number of households show a positive rela-
tionship because both have increased over time. But, this is due to rise in family incomes over the
same period. In other words, the two variables have been influenced by another variable - increase in
family incomes.
3. There may be another situation where both the variables may be influencing each other so that we
cannot say which is the cause and which is the effect. For example, take the case of price and
demand. The rise in price of a commodity may lead to a decline in the demand for it. Here, price is
the cause and the demand is the effect. In yet another situation, an increase in demand may lead to a
rise in price. Here, the demand is the cause while price is the effect, which is just the reverse of the
earlier situation. In such situations, it is difficult to identify which variable is causing the effect on
which variable, as both are influencing each other.
The foregoing discussion clearly shows that correlation does not indicate any causation or functional
relationship. Correlation coefficient is merely a mathematical relationship and this has nothing to do
with cause and effect relation. It only reveals co-variation between two variables. Even when there is no
cause-and-effect relationship in bivariate series and one interprets the relationship as causal, such a
correlation is called spurious or non-sense correlation. Obviously, this will be misleading. As such, one
has to be very careful in correlation exercises and look into other relevant factors before concluding a
cause-and-effect relationship.
The commonly used methods for studying linear relationship between two variables involve
both graphic and algebraic methods. Some of the widely used methods include:
1. Scatter Diagram
2. Correlation Graph
This method is also known as Dotogram or Dot diagram. Scatter diagram is one of the simplest methods
of diagrammatic representation of a bivariate distribution. Under this method, both the variables are
plotted on the graph paper by putting dots. The diagram so obtained is called "Scatter Diagram". By
studying diagram, we can have rough idea about the nature and degree of relationship between two
variables. The term scatter refers to the spreading of dots on the graph. We should keep the following
points in mind while interpreting correlation:
if the plotted points are very close to each other, it indicates high degree of correlation. If the plotted
points are away from each other, it indicates low degree of correlation.
if the points on the diagram reveal any trend (either upward or downward), the variables are said to
be correlated and if no trend is revealed, the variables are uncorrelated.
if there is an upward trend rising from lower left hand corner and going upward to the upper right
hand corner, the correlation is positive since this reveals that the values of the two variables move in
the same direction. If, on the other hand, the points depict a downward trend from the upper left hand
corner to the lower right hand corner, the correlation is negative since in this case the values of the
two variables move in the opposite directions.
in particular, if all the points lie on a straight line starting from the left bottom and going up towards
the right top, the correlation is perfect and positive, and if all the points like on a straight line starting
from left top and coming down to right bottom, the correlation is perfect and negative.
The various diagrams of the scattered data in Figure 12.1 depict different forms of correlation.
Example 12-1
Given the following data on sales (in thousand units) and expenses (in thousand rupees) of a firm for 10
month:
Month : J F M A M J J A S O
Sales: 50 50 55 60 62 65 68 60 60 50
Expenses: 11 13 14 16 16 15 15 14 13 13
b) Do you think that there is a correlation between sales and expenses of the firm? Is it
positive or negative? Is it high or low?
Solution:(a) The Scatter Diagram of the given data is shown in Figure 4-2
20
15
Expenses 10
0
0 20 40 60 80
Sales
(b) Figure 12.2 shows that the plotted points are close to each other and reveal an upward trend. So
there is a high degree of positive correlation between sales and expenses of the firm.
This method, also known as Correlogram is very simple. The data pertaining to two series are plotted on
a graph sheet. We can find out the correlation by examining the direction and closeness of two curves.
If both the curves drawn on the graph are moving in the same direction, it is a case of positive
correlation. On the other hand, if both the curves are moving in opposite direction, correlation is said to
be negative. If the graph does not show any definite pattern on account of erratic fluctuations in the
curves, then it shows an absence of correlation.
Example 12.2
Find out graphically, if there is any correlation between price yield per plot (qtls); denoted by Y and
quantity of fertilizer used (kg); denote by X.
Plot No.: 1 2 3 4 5 6 7 8 9 10
Y: 3.5 4.3 5.2 5.8 6.4 7.3 7.2 7.5 7.8 8.3
X: 6 8 9 12 10 15 17 20 18 24
30
25
X and Y
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10
Plot Number
Cov( X , Y )
( X X )(Y Y ) …………(4.2a)
N
Sx
( X X ) 2
…………(4.2b)
N
and S y (Y Y ) 2
…………(4.2c)
N
Thus by substituting Eqs. (4.2) in Eq. (4.1), we can write the Pearsonian correlation coefficient as
( X X )(Y Y )
1
rxy N
(Y Y ) 2
1 1
( X X )2
N N
rxy
( X X )(Y Y ) …………(4.3)
( X X ) (Y Y )
2 2
If we denote, d x X X and d y Y Y
Then rxy
d d x y
…………(4.3a)
d d x
2
y
2
( X X )(Y Y )
1
Cov( X , Y )
N
XY X Y
1
N
1
XY
X Y
N N N
1
N2
N XY X Y …………(4.4)
1
and S x2 ( X X )2
N
1
X 2 ( X ) 2
N
X
2
X 2
1
N N
1
N 2
N X 2 X
2
…………(4.5a)
Similarly, we have
S y2
1
N 2
N Y 2 Y
2
…………(4.5b)
Remark: Eq. (4.3) or Eq. (4.3a) is quite convenient to apply if the means X and Y come out to be
integers. If X or/and Y is (are) fractional then the Eq. (4.3) or Eq. (4.3a) is quite cumbersome to apply,
since the computations of ( X X ) , (Y Y )
2 2
and ( X X )(Y Y ) are quite time consuming
and tedious. In such a case Eq. (4.6) may be used provided the values of X or/ and Y are small. But if X
and Y assume large values, the calculation of X , Y
2 2
and XY is again quite time consuming.
Thus if (i) X and Y are fractional and (ii) X and Y assume large values, the Eq. (4.3) and Eq. (4.6) are
not generally used for numerical problems. In such cases, the step deviation method where we take the
deviations of the variables X and Y from any arbitrary points is used. We will discuss this method in the
properties of correlation coefficient.
Properties of Pearson Correlation Coefficient
0 Absence of correlation
2. Pearsonian Correlation coefficient is independent of the change of origin and scale. Mathematically,
if given variables X and Y are transformed to new variables U and V by change of origin and scale, i.
e.
X A Y B
U= and V
h k
Where A, B, h and k are constants and h > 0, k > 0; then the correlation coefficient between X and Y is
same as the correlation coefficient between U and V i.e.,
r(X,Y) = r(U, V) => rxy = ruv
Remark: This is one of the very important properties of the correlation coefficient and is extremely
helpful in numerical computation of r. We had already stated that Eq. (4.3) and Eq.(4.6) become quite
tedious to use in numerical problems if X and/or Y are in fractions or if X and Y are large. In such cases
we can conveniently change the origin and scale (if possible) in X or/and Y to get new variables U and V
and compute the correlation between U and V by the Eq. (4.7)
N UV U V
rxy ruv …………(4.7)
N U U N V V
2 2 2 2
3. Two independent variables are uncorrelated but the converse is not true
If X and Y are independent variables then
rxy = 0
However, the converse of the theorem is not true i.e., uncorrelated variables need not necessarily be
independent. As an illustration consider the following bivariate distribution.
X : 1 2 3 -3 -2 -1
Y : 1 4 9 9 4 1
For this distribution, value of r will be 0.
Hence in the above example the variable X and Y are uncorrelated. But if we examine the data carefully
we find that X and Y are not independent but are connected by the relation Y = X2. The above example
illustrates that uncorrelated variables need not be independent.
Remarks: One should not be confused with the words uncorrelation and independence. rxy = 0 i.e.,
uncorrelation between the variables X and Y simply implies the absence of any linear (straight line)
relationship between them. They may, however, be related in some other form other than straight line
e.g., quadratic (as we have seen in the above example), logarithmic or trigonometric form.
4. Pearson coefficient of correlation is the geometric mean of the two regression coefficients, i.e.
rxy = bxy .byx
The signs of both the regression coefficients are the same, and so the value of r will also have the same
sign. This property will be dealt with in detail in the next lesson on Regression Analysis.
5. The square of Pearsonian correlation coefficient is known as the coefficient of determination.
Coefficient of determination, which measures the percentage variation in the dependent variable that is
accounted for by the independent variable, is a much better and useful measure for interpreting the
value of r. This property will also be dealt with in detail in the next lesson.
Probable Error of Correlation Coefficient
The correlation coefficient establishes the relationship of the two variables. After ascertaining this level
of relationship, we may be interested to find the extent upto which this coefficient is dependable.
Probable error of the correlation coefficient is such a measure of testing the reliability of the observed
value of the correlation coefficient, when we consider it as satisfying the conditions of the random
sampling.
If r is the observed value of the correlation coefficient in a sample of N pairs of observations for the two
variables under consideration, then the Probable Error, denoted by PE (r) is expressed as
PE(r) 0.6745 SE(r)
or
1 r 2
PE(r) 0.6745
N
There are two main functions of probable error:
1. Determination of limits: The limits of population correlation coefficient are r ± PE(r), implying that
if we take another random sample of the size N from the same population, then the observed value of
the correlation coefficient in the second sample can be expected to lie within the limits given above,
with 0.5 probability. When sample size N is small, the concept or value of PE may lead to wrong
conclusions. Hence to use the concept of PE effectively, sample size N it should be fairly large.
2. Interpretation of 'r': The interpretation of 'r' based on PE is as under:
If r < PE(r), there is no evidence of correlation, i.e. a case of insignificant correlation.
If r > 6 PE(r), correlation is significant. If r < 6 PE(r), it is insignificant.
If the probable error is small, correlation exist where r > 0.5
Example 12-3
Find the Pearsonian correlation coefficient between sales (in thousand units) and expenses (in thousand
rupees) of the following 10 firms:
Firm: 1 2 3 4 5 6 7 8 9 10
Sales: 50 50 55 60 65 65 65 60 60 50
Expenses: 11 13 14 16 16 15 15 14 13 13
Solution: Let sales of a firm be denoted by X and expenses be denoted by Y
Firm X Y dx X X dy Y Y d x2 d y2 dx .d y
1 50 11 -8 -3 64 9 24
2 50 13 -8 -1 64 1 8
3 55 14 -3 0 9 0 0
4 60 16 2 2 4 4 4
5 65 16 7 2 49 4 14
6 65 15 7 1 49 1 7
7 65 15 7 1 49 1 7
8 60 14 2 0 4 0 0
9 60 13 2 -1 4 1 -2
10 50 13 -8 -1 64 1 8
X Y d
2
x d
2
y d d x y
= =70
= =360 =22
580
140
X=
X = 580 = 58 and Y=
Y = 140 = 14
N 10 N 10
Applying the Eq. (4.3a), we have, Pearsonian coefficient of correlation
rxy
d d x y
d d x
2
y
2
70
rxy
360x22
70
rxy = 0.78
7920
The value of rxy 0.78 , indicate a high degree of positive correlation between sales and expenses.
Example 12-4
The data on price and quantity purchased relating to a commodity for 5 months is given below:
Month : January February March April May
Prices(Rs): 10 10 11 12 12
Quantity(Kg): 5 6 4 3 3
Find the Pearsonian correlation coefficient between prices and quantity and comment on its sign and
magnitude.
Solution: Let price of the commodity be denoted by X and quantity be denoted by Y
X =55 Y =21 X 2
609 Y 2
95 XY 226
Applying the Eq. (4.6), we have, Pearsonian coefficient of correlation
N XY X Y
rxy
N X 2 X N Y 2 Y
2 2
5x226 55x21
rxy
(5x609 55x55)(5x95 21x21)
1130 1155
rxy =
20x34
25
rxy
680
rxy 0.98
The negative sign of r indicate negative correlation and its large magnitude indicate a very high degree
of correlation. So there is a high degree of negative correlation between prices and quantity demanded.
Example 12-5
Find the Pearsonian correlation coefficient from the following series of marks obtained by 10 students
in a class test in mathematics (X) and in Statistics (Y):
X: 45 70 65 30 90 40 50 75 85 60
Y: 35 90 70 40 95 40 60 80 80 50
Also calculate the Probable Error.
Solution:
Calculations for Coefficient of Correlation
{Using Eq. (4.7)}
X Y U V U2 V2 UV
45 35 -3 -6 9 36 18
70 90 2 5 4 25 10
65 70 1 1 1 1 1
30 40 -6 -5 36 25 30
90 95 6 6 36 36 36
40 40 -4 -5 16 25 20
50 60 -2 -1 4 1 2
75 80 3 3 9 9 9
85 80 5 3 25 9 15
60 50 0 -3 0 9 0
U 2
V 2 U 2
140
V 2
176 UV 141
10x141 2x(2)
=
10x140 2x2 10x176 (2) x(2)
1410 4
=
1400 4 1760 4
1414
= = 0.9
2451376
So, there is a high degree of positive correlation between marks obtained in Mathematics and in
Statistics.
Probable Error, denoted by PE (r) is given as
1 r 2
PE(r) 0.6745
N
1 0.92
PE(r ) 0.6745
10
PE(r) 0.0405
Spearman‘s rank correlation coefficient, usually denoted by ρ(Rho) is given by the equation
6 d 2
ρ =1 …………(4.8)
N ( N 2 1)
Where d is the difference between the pair of ranks of the same individual in the two characteristics and
N is the number of pairs.
Example 12-6
Ten entries are submitted for a competition. Three judges study each entry and list the ten in rank order.
Their rankings are as follows:
Entry: A B C D E F G H I J
Judge J1: 9 3 7 5 1 6 2 4 10 8
Judge J2: 9 1 10 4 3 8 5 2 7 6
Judge J3: 6 3 8 7 2 4 1 5 9 10
Calculate the appropriate rank correlation to help you answer the following questions:
6 d 2
(J1 & J3) = 1
N ( N 2 1)
6 x 26 156
= 1 = 1 = 1 – 0.1575 = +0.8425
10(10 1)
2
990
6 d 2
(J2 & J3) =1
N ( N 2 1)
6 x 88 528
=1 = 1 = 1 – 0.53 = +0.47
10(10 1)
2
990
So (i) Judges J1 and J3 agree the most
(ii) Judges J2 and J3 disagree the most
Spearman‘s rank correlation Eq.(4.8) can also be used even if we are dealing with variables, which are
measured quantitatively, i.e. when the actual data but not the ranks relating to two variables are given.
In such a case we shall have to convert the data into ranks. The highest (or the smallest) observation is
given the rank 1. The next highest (or the next lowest) observation is given rank 2 and so on. It is
immaterial in which way (descending or ascending) the ranks are assigned. However, the same
approach should be followed for all the variables under consideration.
Example 12-7
{Using Eq.(4.8)}
value in either or both the series then Spearman‘s Eq.(4.8) for calculating the rank correlation
coefficient breaks down, since in this case the variables X [the ranks of individuals in characteristic A
(1st series)] and Y [the ranks of individuals in characteristic B (2nd series)] do not take the values from 1
to N.
In this case common ranks are assigned to the repeated items. These common ranks are the arithmetic
mean of the ranks, which these items would have got if they were different from each other and the next
item will get the rank next to the rank used in computing the common rank. For example, suppose an
item is repeated at rank 4. Then the common rank to be assigned to each item is (4+5)/2, i.e., 4.5 which
is the average of 4 and 5, the ranks which these observations would have assumed if they were different.
The next item will be assigned the rank 6. If an item is repeated thrice at rank 7, then the common rank
to be assigned to each value will be (7+8+9)/3, i.e., 8 which is the arithmetic mean of 7,8 and 9 viz., the
ranks these observations would have got if they were different from each other. The next rank to be
assigned will be 10.
If only a small proportion of the ranks are tied, this technique may be applied together with Eq.(4.8). If
a large proportion of ranks are tied, it is advisable to apply an adjustment or a correction factor to
Eq.(4.8)as explained below:
―In the Eq.(4.8) add the factor
m(m 2 1)
…………(4.8a)
12
d
2
to ; where m is the number of times an item is repeated. This correction factor is to be added for
For a certain joint stock company, the prices of preference shares (X) and debentures (Y) are given
below:
X: 73.2 85.8 78.9 75.8 77.2 81.2 83.8
Y: 97.8 99.2 98.8 98.3 98.3 96.7 97.1
Use the method of rank correlation to determine the relationship between preference prices and
debentures prices.
Solution:
In this case, due to repeated values of Y, we have to apply ranking as average of 2 ranks, which could
have been allotted, if they were different values. Thus ranks 3 and 4 have been allotted as 3.5 to both the
m(m 2 1)
d
2
values of Y = 98.3. Now we also have to apply correction factor to , where m in the
12
number of times the value is repeated, here m = 2.
6 d
2
m m2 1
648.5
24 1
12
= =
2 6 x 49
= 1- = 0.125
N ( N 1)
2
7(7 1)
2
7 x 48
Hence, there is a very low degree of positive correlation, probably no correlation, between preference share prices and
debenture prices.
Remarks on Spearman’s Rank Correlation Coefficient
1. We always have d 0 , which provides a check for numerical calculations.
2. Since Spearman‘s rank correlation coefficient, , is nothing but Karl Pearson‘s correlation
coefficient, r, between the ranks, it can be interpreted in the same way as the Karl Pearson‘s
correlation coefficient.
3. Karl Pearson‘s correlation coefficient assumes that the parent population from which sample
observations are drawn is normal. If this assumption is violated then we need a measure, which is
distribution free (or non-parametric). Spearman‘s is such a distribution free measure, since no strict
assumption are made about the from of the population from which sample observations are drawn.
4. Spearman‘s formula is easy to understand and apply as compared to Karl Pearson‘s formula. The
values obtained by the two formulae, viz Pearsonian r and Spearman‘s are generally different. The
difference arises due to the fact that when ranking is used instead of full set of observations, there is
always some loss of information. Unless many ties exist, the coefficient of rank correlation should be
only slightly lower than the Pearsonian coefficient.
5. Spearman‘s formula is the only formula to be used for finding correlation coefficient if we are
dealing with qualitative characteristics, which cannot be measured quantitatively but can be arranged
serially. It can also be used where actual data are given. In case of extreme observations, Spearman‘s
formula is preferred to Pearson‘s formula.
6. Spearman‘s formula has its limitations also. It is not practicable in the case of bivariate frequency
distribution. For N >30, this formula should not be used unless the ranks are given.
12.3.5 CONCURRENT DEVIATION METHOD
This is a casual method of determining the correlation between two series when we are not very serious
about its precision. This is based on the signs of the deviations (i.e. the direction of the change)
of the values of the variable from its preceding value and does not take into account the exact
magnitude of the values of the variables. Thus we put a plus (+) sign, minus (-) sign or equality (=) sign
for the deviation if the value of the variable is greater than, less than or equal to the preceding value
respectively. The deviations in the values of two variables are said to be concurrent if they have the
same sign (either both deviations are positive or both are negative or both are equal). The formula used
for computing correlation coefficient rc by this method is given by
2c N
rc …………(4.9)
N
Where c is the number of pairs of concurrent deviations and N is the number of pairs of deviations. If
(2c-N) is positive, we take positive sign in and outside the square root in Eq. (4.9) and if (2c-N) is
negative, we take negative sign in and outside the square root in Eq. (4.9).
Remarks: (i) It should be clearly noted that here N is not the number of pairs of observations but it is
the number of pairs of deviations and as such it is one less than the number of pairs of observations.
(ii) Coefficient of concurrent deviations is primarily based on the following principle:
―If the short time fluctuations of the time series are positively correlated or in other
words, if their deviations are concurrent, their curves would move in the same
direction and would indicate positive correlation between them‖
Example 12-9
2c N
rc
N
2x2 9
rc
9
rc 0.5556
Since 2c – N = -5 (negative), we take negative sign inside and outside the square root
rc 0.5556
rc 0.5556
rc 0.7
Hence there is a fairly good degree of negative correlation between supply and price.
on the dependent variable or on both the variables. Such an approach will avoid many of the pitfalls in
the interpretation of the coefficient of correlation. It has been rightly said that the coefficient of
correlation is not only one of the most widely used, but also one of the widely abused statistical
measures.
12.5 PARTIAL CORRELATION
It is a statistical technique to analyze the association between dependent variables and one of the
independent variables by eliminating the effect of other variables. Partial correlation is also known as
Net Correlation. It is a statistical technique to study the association between one dependent variable and
one independent variable by keeping other independent variables constant. In simple correlation, the
effect of other independent variables was ignored. It is mainly divided into three categories:
Zero order coefficient
First order coefficient
Second order coefficient
12.5.1 Zero Order Coefficient
It is simple correlation between two variables whereby no other variables are held constant. It can be
understood with the help of following example:
It is assumed that there are three variables named as x1, x2 and x3. In this case, simple correlation will be
calculated between any of two variables by ignoring third variable completely in the calculation. Then,
r12 = Simple correlation between x1 and x2 variables by ignoring x3 variable.
r23 = Simple correlation between x2 and x3 variables by ignoring x1 variable.
r31 = Simple correlation between x3 and x1 variables by ignoring x2 variable.
The formulas for calculating simple correlation between two variables has already been discussed in
section 12.3 of this lesson. Therefore, you can refer to the section 12.3 to understand the simple
correlation calculation techniques.
12.5.2 First Order Coefficient
It is a partial correlation between two variables keeping the third variable constant. You can keep any
one variable constant out of three variables in final calculation of partial correlation. Hence, there can
be three cases of first order coefficient:
r12.3 = Simple correlation between x1 and x2 variables keeping x3 variable constant.
√ √
√ √
√ √
Example 12.10 The simple coefficients of correlation between two variables out of three are: r12 = 0.8,
r13 = 0.7 and r23 = 0.6 Find the partial coefficients of correlation, r12.3, r23.1 and r13.2
Solution:
√ √
√ √
√ √
√ √
√ √
√ √
√ √
√ √
√ √
Remarks:
1. The partial correlation coefficient lies between -1 and +1.
2. These correlation coefficient are calculated on the basis of zero order coefficient or simple
coefficients of correlation where no variable is kept constant. Thus r 12 is a simple correlation
coefficient between x1 and x2.
Example: 12.11 Given: r12.4 = 0.6, r13.4 = 0.5 and r23.4 = 0.7. Find r12.34 and r13.24.
Solution:
√ √
√ √
√ √
√ √
Example 12.12 It is observed that weight of fish depends on food consumption, type of water and
temperature. Find the partial correlation between weight of fish (x1) and temperature (x4) keeping food
consumption (x2) and type of water (x3) constant. Given that r14.3 = 0.6, r12.3 = 0.5, r24.3 0.7. Find r14.23.
Solution:
√ √
√ √
This gives the partial correlation between weight of fish and temperature keeping food consumption and
type of water constant.
12.5.4 Limitations of Partial Correlation
Followings are the main limitations of partial correlation:
1. In the calculation of partial correlation coefficient, it is presumed that there should exist a linear
relation between variables. In real situation, this condition lacks in some cases.
2. The independent variable in the study of partial correlation should be linearly independent.
3. The reliability of partial correlation coefficient decreases as their order goes up. This means that the
second order partial coefficients are not as dependable as the first order ones are. Therefore, it is
necessary that the size of the items in the gross correlation should be large.
4. It involves a lot of calculation work and its analysis is not easy.
12.6 MULTIPLE CORRELATION:
Multiple correlation is a technique to study the relationship between three or four variables
simultaneously. In this, effect of all independent variables on a dependent variable is analyze. The
coefficient of multiple linear correlation is represented by R. It can be understood as follow:
Assuming three variable x1, x2 and x3 where as x1 is the dependent variable and other two are
independent variables. Here, the multiple correlation coefficient can be defined as follow:
R1.23 = Multiple correlation coefficient with x1 as dependent variable and x2 and x3 as independent
variables.
R2.13 = Multiple correlation coefficient with x2 as dependent variable and x1 and x3 as independent
variables.
R3.12 = Multiple correlation coefficient with x3 as dependent variable and x1 and x2 as independent
variables.
12.6.1 Calculation of Multiple Correlation Coefficient
Following are the formulas for calculating multiple correlation coefficient:
If there are three independent variables and one dependent variable the formula for finding out the
multiple correlation is
√
Remarks
1. Multiple correlation coefficient is a positive coefficient always. Its value ranges between 0 and 1. It
cannot be a negative value.
2. If R1.23 = 0, then r12 = 0 and r13 = 0.
3. R1.23 ≥ r12 and R1.23 ≥ r13
4. R1.23 is the same thing as R1.32. The position of the subscript to the right of the dot does not make a
difference.
√ √ √
√ √ √
√ √ √
1. If both the variables move in the same direction, we say that there is a ………… correlation.
2. If the change in one variable is accompanied by change in another variable in a constant ratio, it is a
case of ………….. correlation.
3. When three or more variables are involved in a study, then it is a problem of either
…………correlation.
4. The Pearsonian correlation coefficient between the ranks X and Y is called the………………
between the characteristics A and B for the group of individuals.
5. Correlation analysis cannot determine ……………… relationship.
12.8 SUMMARY
The statistical methods of Correlation (discussed in the present lesson) and Regression (to be discussed
in the next lesson) are helpful in knowing the relationship between two or more variables which may be
related in same way, like interest rate of bonds and prime interest rate. Correlation can be classified in
several ways. The important ways of classifying correlation are: Positive and negative, Linear and non-
linear (curvilinear) and Simple, partial and multiple. The commonly used methods for studying linear
relationship between two variables involve both graphic and algebraic methods. Some of the widely
used methods include: Scatter Diagram, Correlation Graph, Pearson‘s Coefficient of Correlation,
Spearman‘s Rank Correlation and Concurrent Deviation Method. There are measure uses of correlation
analysis, but some limitations too. It has been rightly said that the coefficient of correlation is not only
one of the most widely used, but also one of the widely abused statistical measures.
12.9 KEYWORDS
Correlation: When two or more variables very in sympathy so that movement in one tends to be
accompanied by corresponding movements in the other variable(s), they are said to be correlated.
Scatter Diagram: Under this method, both the variables are plotted on the graph paper by putting dots.
By studying diagram, we can have rough idea about the nature and degree of relationship between two
variables.
Correlation Graph: The data pertaining to two series are plotted on a graph sheet and then find out the
correlation by examining the direction and closeness of two curves.
Pearson’s correlation coefficient: It is denoted by r(X,Y) or rxy or simply r is a numerical measure of
linear relationship between them and is defined as the ratio of the covariance between X and Y, to the
X: 39 65 62 90 82 75 25 98 36 78
Y: 47 53 58 86 62 68 60 91 51 84
8. To study the effectiveness of an advertisement a survey is conducted by calling people at random by
asking the number of advertisements read or seen in a week (X) and the number of items purchased
(Y) in that week.
X: 5 10 4 0 2 7 3 6
Y: 10 12 5 2 1 3 4 8
Calculate the correlation coefficient and comment on the result.
9. Calculate coefficient of correlation between X and Y series from the following data and calculate its
probable error also.
X: 78 89 96 69 59 79 68 61
Y: 125 137 156 112 107 136 123 108
10. In two set of variables X and Y, with 50 observations each, the following data are observed:
X = 10, SD of X = 3
Y = 6, SD of Y = 2 rxy 0.3
However, on subsequent verification, it was found that one value of X (=10) and one value of Y (=
6) were inaccurate and hence weeded out with the remaining 49 pairs of values. How the original
value of is rxy 0.3 affected?
11. Calculate coefficient of correlation r between the marks in statistics (X) and Accountancy (Y) of 10
students from the following:
X: 52 74 93 55 41 23 92 64 40 71
Y: 45 80 63 60 35 40 70 58 43 64
Also determine the probable error or r.
12. The coefficient of correlation between two variables X and Y is 0.48. The covariance is 36. The
variance of X is 16. Find the standard deviation of Y.
13. Twelve entries in painting competition were ranked by two judges as shown below:
Entry: A B C D E F G H I J
Judge I: 5 2 3 4 1 6 8 7 10 9
Judge II: 4 5 2 1 6 7 10 9 3 8
Find the coefficient of rank correlation.
14. Calculate Spearman‘s rank correlation coefficient between advertisement cost (X) and sales (Y)
from the following data:
X: 39 65 62 90 82 75 25 98 36 78
Y: 47 53 58 86 62 68 60 91 51 84
15. An examination of eight applicants for a clerical post was taken by a firm. From the marks obtained
by the applicants in the Accountancy (X) and Statistics (Y) paper, compute rank coefficient of
correlation.
Applicant: A B C D E F G H
X: 15 20 28 12 40 60 20 80
Y: 40 30 50 30 20 10 30 60
16. Calculate the coefficient of concurrent deviation from the following data:
Year: 1993 1994 1995 1996 1997 1998 1999 2000
Supply: 160 164 172 182 166 170 178 192
Price: 222 280 260 224 266 254 230 190
17. Obtain a suitable measure of correlation from the following data regarding changes in price index
of the shares A and B during nine months of a year:
Month: A M J J A S O N D
A: +4 +3 +2 -1 -3 +4 -5 +1 +2
B: -2 +5 +3 -2 -1 -3 +4 -1 -3
18. The cross-classification table shows the marks obtained by 105 students in the subjects of Statistics
and Finance:
Marks in Statistics
50-54 55-59 60-64 65-74 Total
Marks in Finance
50-59 4 6 8 7 25
60-69 - 10 12 13 35
70-79 16 9 20 - 45
80-89 - - - - -
Total 20 25 40 20 105
REGRESSION ANALYSIS
STRUCTURE
13.1 Learning Objectives
13.2 What is Regression?
13.2.1 Linear Regression
13.2.2 Regression Line of Y on X
13.2.2.1 Scatter Diagram
13.2.2.2 Fitting a Straight Line
13.2.2.3 Predicting an Estimate and its Preciseness
13.2.2.4 Error of Estimate
13.2.3 Regression Line of X on Y
13.3 Properties of Regression Coefficients
13.3.1 Regression Lines and Coefficient of Correlation
13.3.2 Coefficient of Determination
13.3.3 Correlation Analysis Versus Regression Analysis
13.4 Solved Problems
13.5 Multiple Regression Analysis
13.5.1 Regression Equations
13.5.2 Assumptions of Multiple Linear Regression Analysis
13.5.3 Methods of Multiple Regression Analysis
13.5.4 Standard Error of the Estimates
13.6 Solved Problems of Multiple Regression Analysis
In 1889, Sir Francis Galton, a cousin of Charles Darwin published a paper on heredity, ―Natural
Inheritance‖. He reported his discovery that sizes of seeds of sweet pea plants appeared to ―revert‖ or
―regress‖, to the mean size in successive generations. He also reported results of a study of the
relationship between heights of fathers and heights of their sons. A straight line was fit to the data pairs:
height of father versus height of son. Here, too, he found a ―regression to mediocrity‖ The heights of the
sons represented a movement away from their fathers, towards the average height. We credit Sir Galton
with the idea of statistical regression.
While most applications of regression analysis may have little to do with the ―regression to the mean‖ discovered by Galton,
the term ―regression‖ remains. It now refers to the statistical technique of modeling the relationship between two or more
variables. In general sense, regression analysis means the estimation or prediction of the unknown value of one variable from
the known value(s) of the other variable(s). It is one of the most important and widely used statistical techniques in almost all
sciences - natural, social or physical.
In this lesson we will focus only on simple regression –linear regression involving only two variables: a
dependent variable and an independent variable. Regression analysis for studying more than two
variables at a time is known as multiple regressions.
INDEPENDENT AND DEPENDENT VARIABLES
Simple regression involves only two variables; one variable is predicted by another variable. The
variable to be predicted is called the dependent variable. The predictor is called the independent
variable, or explanatory variable. For example, when we are trying to predict the demand for television
sets on the basis of population growth, we are using the demand for television sets as the dependent
variable and the population growth as the independent or predictor variable.
The decision, as to which variable is which sometimes, causes problems. Often the choice is obvious, as
in case of demand for television sets and population growth because it would make no sense to suggest
that population growth could be dependent on TV demand! The population growth has to be the
independent variable and the TV demands the dependent variable.
If we are unsure, here are some points that might be of use:
If we have control over one of the variables then that is the independent. For example, a
manufacturer can decide how much to spend on advertising and expect his sales to be dependent
upon how much he spends
It there is any lapse of time between the two variables being measured, then the latter must depend
upon the former, it cannot be the other way round
If we want to predict the values of one variable from your knowledge of the other variable, the
variable to be predicted must be dependent on the known one
Table 13-1
Sales Marketing Expenditure
Month (Rs lac) (Rs thousands)
Y X
April 14 10
May 17 12
June 23 15
July 21 20
August 25 23
Let Y, the sales, be the dependent variable and X, the marketing expenditure, the independent variable.
We note that for each value of independent variable X, there is a specific value of the dependent
variable Y, so that each value of X and Y can be seen as paired observations.
13.2.2.1 Scatter Diagram
Before obtaining a straight-line relationship, it is necessary to discover whether the relationship between
the two variables is linear, that is, the one which is best explained by a straight line. A good way of
doing this is to plot the data on X and Y on a graph so as to yield a scatter diagram, as may be seen in
Figure 13-1. A careful reading of the scatter diagram reveals that:
The overall tendency of the points is to move upward, so the relationship is positive
The general course of movement of the various points on the diagram can be best explained by a
straight line
There is a high degree of correlation between the variables, as the points are very close to each
other
Figure 13-1 Scatter Diagram with Line of Best Fit
The deviations dj have to be squared to avoid negative deviations canceling out the positive deviations.
Since a straight line so fitted best approximates all the points on the scatter diagram, it is better known
as the best approximating line or the line of best fit. A line of best fit can be fitted by means of:
1. Free hand drawing method, and
2. Least square method
Free Hand Drawing:
Free hand drawing is the simplest method of fitting a straight line. After a careful inspection of the movement and spread of
various points on the scatter diagram, a straight line is drawn through these points by using a transparent ruler such that on
the whole it is closest to every point. A straight line so drawn is particularly useful when future approximations of the
dependent variable are promptly required.
Whereas the use of free hand drawing may yield a line nearest to the line of best fit, the major drawback
is that the slope of the line so drawn varies from person to person because of the influence of subjec-
tivity. Consequently, the values of the dependent variable estimated on the basis of such a line may not
be as accurate and precise as those based on the line of best fit.
Least Square Method:
The least square method of fitting a line of best fit requires minimizing the sum of the squares of
vertical deviations of each observed Y value from the fitted line. These deviations, such as d1 and d3, are
shown in Figure 5-1 and are given by Y - Yc, where Y is the observed value and Yc the corresponding
computed value given by the fitted line for the ith value of X.
Yc a bX i …………(5.1)
The straight line relationship in Eq.(5.1), is stated in terms of two constants a and b
The constant a is the Y-intercept; it indicates the height on the vertical axis from where the straight
line originates, representing the value of Y when X is zero.
Constant b is a measure of the slope of the straight line; it shows the absolute change in Y for a unit
change in X. As the slope may be positive or negative, it indicates the nature of relationship
between Y and X. Accordingly, b is also known as the regression coefficient of Y on X.
Since a straight line is completely defined by its intercept a and slope b, the task of fitting the same
reduces only to the computation of the values of these two constants. Once these two values are known,
the computed Yc values against each value of X can be easily obtained by substituting X values in the
linear equation.
In the method of least squares the values of a and b are obtained by solving simultaneously the
following pair of normal equations
Y aN b X …………(5.2)
XY a X b X ……(5.2) 2
observations and then can be substituted in the above equations to obtain the value of a and b.
Since simultaneous solving the two normal equations for a and b may quite often be cumbersome and
time consuming, the two values can be directly obtained as
a = Y bX …………(5.3)
and
N XY X Y
b ………(5.4)
N X 2 X
2
Note: Eq. (5.3) is obtained simply by dividing both sides of the first of Eqs. (5.2) by N and Eq.(5.4) is
obtained by substituting ( Y b X ) in place of a in the second of Eqs. (5.2)
Instead of directly computing b, we may first compute value of a as
a
Y X X XY …………(5.5)
2
N X X
2 2
and
Y a
b= …………(5.6)
X
N XY X Y
N X 2 X
Note: Eq. (5.5) is obtained by substituting 2
for b in Eq. (5.3) and Eq. (5.6) is
Table 13-2
Computation of a and b
Y X XY X2 Y2
14 10 140 100 196
17 12 204 144 289
23 15 345 225 529
21 20 420 400 441
25 23 575 529 625
Y 100 X 80 XY 1684 X 2
1398 Y 2
2080
XY X Y
N N
b
N
X X
2 2
N
N
XY X Y
or b N
S x2
Cov( X , Y )
or b …………(5.8)
S x2
We know, coefficient of correlation, rxy is given by
Cov( X , Y )
rxy
Sx S y
or Cov( X , Y ) rxy S x S y
Sy x
Y Y c
2
…………(5.11)
N
Syx measures the average absolute amount by which observed Y values depart from the corresponding computed Yc values.
Computation of Syx becomes little cumbersome where the number of observations N is large. In such
cases Syx may be computed directly by using the equation:
aY b XY
Y 2
Syx = ………(5.12)
N
By substituting the values of Y 2 , Y , and XY from the Table 5-2, and the calculated values of
a and b
We have
3. Since Syx measures the closeness of the observed Y values and the estimated Yc values, it also serves
as a measure of the reliability of the estimate. Greater the closeness between the observed and
estimated values of Y, the lesser the error and, consequently, the more reliable the estimate. And
vice-versa.
4. Standard error of estimate Syx can also be seen as a measure of correlation insofar as it expresses the
degree of closeness of scatter of observed Y values about the regression line. The closer the observed
Y values scattered around the regression line, the higher the correlation between the two variables.
A major difficulty in using Syx as a measure of correlation is that it is expressed in the same units of
measurement as the data on the dependent variable. This creates problems in situations requiring
comparison of two or more sets of data in terms of correlation. It is mainly due to this limitation that the
standard error of estimate is not generally used as a measure of correlation. However, it does serve as
the basis of evolving the coefficient of determination, denoted as r2, which provides an alternate method
of obtaining a measure of correlation.
13.2.3 REGRESSION LINE OF X ON Y
So far we have considered the regression of Y on X, in the sense that Y was in the role of dependent and
X in the role of an independent variable. In their reverse position, such that X is now the dependent and
Y the independent variable, we fit a line of regression of X on Y. The regression equation in this case
will be
Xc = a‘ + b‘Y…………(5.13)
Where Xc denotes the computed values of X against the corresponding values of Y. a‘ is the X-intercept
and b‘ is the slope of the straight line.
Two normal equations to solve a‘and b‘ are
or
a'
X Y Y XY …………(5.17)
2
N Y Y 2 2
and
X a'
b' …………(5.18)
Y
Cov(Y , X )
b' …………(5.19)
S y2
Sx
b' ryx …………(5.20)
Sy
So, Regression equation of X on Y may also be written as
Xc - X = b‘ (Y- Y )…………(5.21)
Sx
Xc - X = ryx (Y - Y )…………(5.22)
Sy
As before, once the values of a‘ and b‘ have been found, their substitution in Eq.(5.13) will enable us to
get an estimate of X corresponding to a known value of Y
Standard Error of estimate of X on Y i.e. Sxy will be
Sxy =
X X c 2 …………(5.23)
N
or
Sxy =
X 2
a' X b' XY
………(5.24)
N
For example, if we want to estimate the marketing expenditure to achieve a sale target of Rs 40 lac, we
have to obtain regression line of X on Y i. e.
Xc = a‘ + b‘Y
So using Eqs. (5.17) and (5.16), and substituting the values of X , Y , Y and XY
2
from Table
5-2, we have
r = ± b.b' …………(5.25)
The signs of both the regression coefficients are the same, and so the value of r will also have the
same sign.
3. The mean of both the regression coefficients is either equal to or greater than the coefficient of
correlation, i.e.
b b'
r
2
4. Regression coefficients are independent of change of origin but not of change of scale.
Mathematically, if given variables X and Y are transformed to new variables U and V by change of
origin and scale, i. e.
X A Y B
U= and V
h k
Where A, B, h and k are constants, h > 0, k > 0 then
Regression coefficient of Y on X = k/h (Regression coefficient of V on U)
k
byx bvu and
h
Regression coefficient of X on Y = h/k (Regression coefficient of U on V)
h
bxy buv
k
5. Coefficient of determination is the product of both the regression coefficients i.e.
r2 = b.b‘
b b' S S r 2 1
tan = = 2 x y 2
1 bb' S x S y r
S x S y r 2 1
or = tan –1
2
2
…………(5.26)
S x S y r
The farther the two regression lines from each other, lesser will be the degree of correlation and
nearer the two regression lines, more will be the degree of correlation, see (c) and (d) in Figure 5-3.
If the variables are independent i.e. r = 0, the lines of regression will cut each other at right angle.
See (g) in Figure 5-3.
Note: Both the regression lines cut each other at mean value of X and mean value of Y i.e. at X and Y .
Y Y
2
2 c
…………(5.27)
Y Y
r = 2
We can calculate another coefficient K2, known as coefficient of Non-Determination, which is the ratio
of unexplained variance to the total variance.
Unexp lainedVariance
K2 =
Total Variance
2 Y Y c
2
……(5.28)
Y Y
K = 2
ExplainedVariance
K2 = 1-
Total Variance
= 1 - r2……(5.29)
The square root of the coefficient of non-determination, i.e. K gives the coefficient of alienation
K = ± 1 r 2 ………(5.30)
Relation Between Syx and r:
A simple algebraic operation on Eq. (5.30) brings out some interesting points about the relation between
Syx and r. Thus, since
Y Y Y Y
2
c
2
N S yx
2
and N S y2
So we have coefficient of Non-determination
Y Y
2
2
c
Y Y
K 2
2
N S yx S yx2
2
K = 2
N S y2 Sy
S yx2
So 1 – r2
S y2
S yx
or = 1 r 2 …………(5.31)
Sy
If coefficient of correlation, r, is defined as the under root of the coefficient of determination
r= r2
S yx2
r2 = 1
S y2
S yx
r 1 …………(5.32)
S y2
On carefully observing Eq. (5.32), it will be noticed that the ratio Syx/Sy will be large if the coefficient of
determination is small, and it will be small when the coefficient of determination is large. Thus
if r2 = r = 0, Syx/Sy =1, which means that Syx = Sy.
if r2 = r = 1, Syx/Sy =0, which means that Syx = 0.
when r = 0.865, Syx = 0.427 Sy means that Syx is 42.7% of Sy.
Eq. (5.32) also implies that Syx is generally less than Sy. The two can at the most be equal, but only in
the extreme situation when r = 0.
Interpretations of r2:
1. Even though the coefficient of determination, whose under root measures the degree of correlation,
is based on Syx,; it is expressed as 1 - ( Syx/Sy ). As it is a dimensionless pure number, the unit in
which Syx is measured becomes irrelevant. This facilitates comparison between the two sets of data
in terms of their coefficient of determination r2 (or the coefficient of correlation r). This was not
possible in terms of Sy x as the units of measurement could be different.
2. The value of r2 can range between 0 and 1. When r2 = 1, all the points on the scatter diagram fall on
the regression line and the entire variations are explained by the straight line. On the other hand,
when r2 = 0, none of the points on the scatter diagram falls on the regression line, meaning thereby
that there is no relationship between the two variables. However, being always non-negative
coefficient of determination does not tell us about the direction of the relationship (whether it is
positive or negative) between the two variables.
3. When r2 = 0.7455 (or any other value), 74.55% of the total variations in sales are explained by the
marketing expenditure used. What remains is the coefficient of non-determination K2 (= 1 - r2) =
0.2545. It means 25.45% of the total variations remain unexplained, which are due to factors other
than the changes in the marketing expenditure.
4. r2 provides the necessary link between regression and correlation which are the two related aspects
of a single problem of the analysis of relationship between two variables. Unlike regression,
correlation quantifies the degrees of relationship between the variables under study, without making
a distinction between the dependent and independent ones. Nor does it, therefore, help in predicting
the value of one variable for a given value of the other.
5. The coefficient of correlation overstates the degree of relationship and it‘s meaning is not as
explicit as that of the coefficient of determination. The coefficient of correlation r = 0.865, as
compared to r2 = 0.7455, indicates a higher degree of correlation between sales and marketing
expenditure. Therefore, the coefficient of' determination is a more objective measure of the degree
of relationship.
6. The sum of r and K never adds to one, unless one of the two is zero. That is, r + K can be unity
either when there is no correlation or when there is perfect correlation. Except in these two extreme
situations, (r + K) > 1.
association, we might again be interested in predicting the value of one variable for the given and
known values of other variable(s).
1. Correlation literally means the relationship between two or more variables that vary in sympathy so
that the movements in one tend to be accompanied by the corresponding movements in the other(s).
On the other hand, regression means stepping back or returning to the average value and is a
mathematical measure expressing the average relationship between the two variables.
2. Correlation coefficient rxy between two variables X and Y is a measure of the direction and degree
of the linear relationship between two variables that is mutual. It is symmetric, i.e., ryx = rxy and it is
immaterial which of X and Y is dependent variable and which is independent variable. Regression
analysis aims at establishing the functional relationship between the two( or more) variables under
study and then using this relationship to predict or estimate the value of the dependent variable for
any given value of the independent variable(s). It also reflects upon the nature of the variable, i.e.,
which is dependent variable and which is independent variable. Regression coefficient are not
symmetric in X and Y, i.e., byx bxy.
3. Correlation need not imply cause and effect relationship between the variable under study.
However, regression analysis clearly indicates the cause and effect relationship between the
variables. The variable corresponding to cause is taken as independent variable and the variable
corresponding to effect is taken as dependent variable.
4. Correlation coefficient rxy is a relative measure of the linear relationship between X and Y and is
independent of the units of measurement. It is a pure number lying between ±1. On the other hand,
the regression coefficients, byx and bxy are absolute measures representing the change in the value of
the variable Y (or X), for a unit change in the value of the variable X (or Y). Once the functional
form of regression curve is known; by substituting the value of the independent variable we can
obtain the value of the dependent variable and this value will be in the units of measurement of the
dependent variable.
5. There may be non-sense correlation between two variables that is due to pure chance and has no
practical relevance, e.g., the correlation, between the size of shoe and the intelligence of a group of
individuals. There is no such thing like non-sense regression.
The following table shows the number of motor registrations in a certain territory for a term of 5 years and the sale of motor
tyres by a firm in that territory for the same period.
Year Motor Registrations No. of Tyres Sold
1 600 1,250
2 630 1,100
3 720 1,300
4 750 1,350
5 800 1,500
Find the regression equation to estimate the sale of tyres when the motor registration is known. Estimate sale of tyres when
registration is 850.
Solution: Here the dependent variable is number of tyres; dependent on motor registrations. Hence we
put motor registrations as X and sales of tyres as Y and we have to establish the regression line of Y on
X.
Calculations of values for the regression equation are given below:
X Y dx = X- X dy = Y- Y dx 2 dx dy
X = Y = 6,500 d d d d d
2
x =0 y =0 x
= x y =
X=
X = 3,500 =700 and Y=
Y = 6,500 = 1,300
N 5 N 5
byx = Regression coefficient of Y on X
X X Y Y = d d x y
4,1500
1.4928
X X
byx = 2
d 2
x 2,7800
Y- Y = byx (X- X )
Y= 1.4928 X + 255.04
When X = 850, the value of Y can be calculated from the above equation, by putting X = 850 in the
equation.
= 1523.92
= 1,524 Tyres
Example 13-2
A panel of Judges A and B graded seven debators and independently awarded the following marks:
Debator Marks by A Marks by B
1 40 32
2 34 39
3 28 26
4 30 30
5 44 38
6 38 34
7 31 28
An eighth debator was awarded 36 marks by judge A, while Judge B was not present. If Judge B were
also present, how many marks would you expect him to award to the eighth debator, assuming that the
same degree of relationship exists in their judgement?
Solution:
Let us use marks from Judge A as X and those from Judge B as Y. Now we have to work out the
regression line of Y on X from the calculation below:
Debtor X Y U = X- V = Y-30 U2 V2 UV
35
1 40 32 5 2 25 4 10
2 34 39 -1 9 1 81 -9
3 28 26 -7 -4 49 16 28
4 30 30 -5 0 25 0 0
5 44 38 9 8 81 64 72
6 38 34 3 4 9 16 12
7 31 28 -4 -2 16 4 8
N=7 U 0 V 17 U 2
206 V 2
185 UV 121
X = A
U = 35 + 0 = 35 and Y = A
V = 30 + 17 = 32.43
N 7 N 7
N UV U V 7x121- 0x17
0.587
U
byx = bvu = =
N U
2
2 7x206 - 0
Thus if Judge B were present, he would have awarded 33 marks to the eighth debator.
Example 13-3
So r = ± 0.3 = ± 0.5477
Since both the regression coefficients are negative, we assign negative value to the correlation
coefficient
r = - 0.5477
Example 13-4
Solution:
We prepare the table for working out the values for the regression lines.
X Y U = X-65 V = Y-45 U2 UV V2
45 25 -20 -20 400 400 400
48 30 -17 -15 289 255 225
50 35 -15 -10 225 150 100
55 30 -10 -15 100 150 225
65 40 0 -5 0 0 25
70 50 5 5 25 25 25
75 45 10 0 100 0 0
72 55 7 5 49 35 25
80 60 15 15 225 225 225
85 65 20 20 400 400 400
X = Y = 435 U 5 V 20 U 2
1813
V 2 1415 UV 1675
645
We have, X=
X = 645 = 64.5 and Y=
Y = 435 = 43.5
N 10 N 10
N UV U V (10) x 1415- (5) x (-20)
N U U
byx = 2
=
2 (10) x 1813- (5)2
14150100 14250
= 0.787
18130- 25 18105
Regression equation of Y on X is
Y- Y = byx (X- X )
Y – 43.5 = 0.787 (X-64.5)
or Y = 0.787X + 7.26
14150100 14250
= 0.87
16750- 400 16350
Regression equation of X on Y will be
X-X = bxy (Y- Y )
X – 64.5 = 0.87 (Y-43.5)
or X = 0.87Y + 26.65
Example 13-5
So r = + 9 / 25
= + 0.6
Both the values of the regression coefficients being positive, we have to consider only the positive value
of the correlation coefficient. Hence r = 0.6
(iii) We have been given variance of X i.e Sx2 = 9
Sx = ± 3
We consider Sx = 3 as SD is always positive
Since byx = r Sy /Sx
Substituting the values of byx, r and Sx we obtain,
Sy = 4/5 x 3/0.6
= 4
Example 13-6
The height of a child increases at a rate given in the table below. Fit the straight line using the method of least-square and
calculate the average increase and the standard error of estimate.
Month: 1 2 3 4 5 6 7 8 9 10
Height: 52.5 58.7 65 70.2 75.4 81.1 87.2 95.5 102.2 108.4
Solution:
For Regression calculations, we draw the following table
X =55 Y =796.2 X 2
385 XY 4887.5
Considering the regression line as Y = a + bX, we can obtain the values of a and b from the above
values.
a
Y X X XY = 796.2 x 385 - 55 x 4887.5 = 45.73
2
N X X
2
2 10 x 385 - 55 x 55
N XY X Y 10 x 4887.5- 55 x 796.2
b
N X 2 X
2
= = 6.16
10 x 385 - 55 x 55
(Y Y )
i
2
10.421
1
S yx = (Y Yi ) 2
N
10.421
=
10
= 1.02
Example 13-7
Now
r2 = bxy. byx
= 4k
Since 0 r 2 1, we obtain 0 4k 1,
1
Or 0k ,
4
1
Now for k = ,
16
1 1
r 2 4x
16 4
r=+½
= ½ since byx and byx are positive
1
When k = , the regression line of Y on X becomes
16
1
Y= X+4
16
Or X – 16Y + 64 = 0
Since line of regression pass through the mean values of the variables, we obtain revised equations as
X - 4Y - 5 = 0
X - 16 Y + 64 = 0
Solving these two equations, we get
X = 28 and Y = 5.75
Example 13-8
A firm knows from its past experience that its monthly average expenses (X) on advertisement are Rs
25,000 with standard deviation of Rs 25.25. Similarly, its average monthly product sales (Y) have been
Rs 45,000 with standard deviation of Rs 50.50. Given this information and also the coefficient of
correlation between sales and advertisement expenditure as 0.75, estimate
(i) the most appropriate value of sales against an advertisement expenditure of Rs 50,000.
(ii) the most appropriate advertisement expenditure for achieving a sales target of Rs 80,000
Solution: Given the following
X = Rs 25,000 Sx = Rs 25.25
Y = Rs 45,000 Sy = Rs 50.50
r = 0.75
Sy
(i) Using equation Yc - Y = r (X- X ), the most appropriate value of sales Yc for an
Sx
advertisement expenditure X = Rs 50,000 is
50.50
Yc – 45,000 = 0.75 (50,000 – 25,000)
25.25
Yc = 45,000 + 37,500
= Rs 82,500
Sx
(ii) Using equation Xc - X = r (Y - Y ), the most appropriate value of advertisement
Sy
expenditure Xc for achieving a sales target Y= Rs 80,000 is
25.25
Xc – 25,000 = 0.75 (80,000 – 45,000)
50.50
Xc = 13,125 + 25,000
= Rs 38,125
13.5 MULTIPLE REGRESSION ANALYSIS
In multiple regression analysis, the effect of two or more independent variables on one dependent
variable is studied. It uses three or more variables to estimate the value of dependent variable. Let‘s take
three variables say X1, X2 and X3. Now, take X1 as the dependent variable and try to find out its relative
movement for movements in both X2 and X3 which are independent variables. The prime objectives of
multiple regression analysis are:
To estimate an equation which provides estimates of the dependent variable from the values of the
two or more independent variables.
To obtain a measure of error involved in using regression equation as a basis for estimation.
To obtain a measure of the proportion of variance in the dependent variable explained by the
independent variables.
The first objective is accomplished by estimating an appropriate regression equation by the method of
least squares. The second objective is achieved through the calculation of a standard error of estimate.
The third objective is attained by computing the multiple coefficient of determination.
13.5.1 Regression Equations
A regression equation is an equation for estimating a dependent variable say X1 from the independent
variables X2, X3 and is called a regression equation of X1 on X2 and X3. The procedure of estimating
multiple regression is similar to the simple regression with the difference that the other variables are
added in the regression equation. If there are three variables X1 X2 and X3, the multiple regression has
the following form:
There are three constants a1.23, b12.3 and b13.2 in the above equation. The subscript after the dot indicates
the variables which are held constant.
Interpretation of Constants: In the above equation:
a1.23 = The constant a1.23 is the intercept made by the regression plane. It gives the value of dependent
variable when X2 and X3 independent variables are 0.
b12.3 = It indicates the slope of the regression line of X1 on X2 when X3 is held constant. It measures the
amount by which a unit change in X2 is expected to affect X1 when X3 is held constant.
b13.2 = It indicates the slope of the regression line of X1 on X3 when X2 is held constant. It measures the
amount by which a unit change in X3 is expected to affect X1 when X2 is held constant.
In this way, three different regression equations can be formed using three variables X 1, X2 and X3
which are explained below:
a. The regression plane of X1 on X2 and X3
b. The regression plane of X2 on X1 and X3
c. The regression plane of X3 on X1 and X2
The regression plane of X1 on X2 and X3
In the above equation, the value of the value of b12.3 and b13.2 are determined by solving simultaneously
the following three normal equations:
∑ ∑ ∑
∑ ∑ ∑ ∑
∑ ∑ ∑ ∑
The last two of these three equations can be obtained if we multiply first equation by X2 and X3 on both
sides.
The regression plane of X2 on X1 and X3
In the above equation, the value of the value of b21.3 and b23.1 are determined by solving simultaneously
the following three normal equations:
∑ ∑ ∑
∑ ∑ ∑ ∑
∑ ∑ ∑ ∑
The last two of these three equations can be obtained if we multiply first equation by X1 and X3 on both
sides.
The regression plane of X3 on X1 and X2
In the above equation, the value of the value of b31.2 and b32.1 are determined by solving simultaneously
the following three normal equations:
∑ ∑ ∑
∑ ∑ ∑ ∑
∑ ∑ ∑ ∑
The last two of these three equations can be obtained if we multiply first equation by X1 and X2 on both
sides.
We can find the values of b12.3 and b13.2 by simultaneously solving the following two normal equations:
∑ ∑ ∑
∑ ∑ ∑
Similar, calculations are possible for the regression equations of x 2 on x1 and x3; and for regression
equations of x3 on x1 and x2.
Second Method
Multiple Regression Equation of X1 on X2 and X3: It can be expressed as follows:
̅ * + ̅ * + ̅
̅ * + ̅ * + ̅
̅ * + ̅ * + ̅
Third Method
If σ1, σ2, σ3 are not given, we can calculate S1, S2 and S3 from the given data. Then the regression
equations are:
Regression equation of X1 on X2 and X3
̅ * + ̅ * + ̅
̅ * + ̅ * + ̅
̅ * + ̅ * + ̅
dependent variable X1. If the total variations of X1 is divided into two parts, the standard error would
represent the unexplained variations. The explained variations would be due to independent variables.
The standard error of the estimate of multiple regression equation X1 on X2 and X3 is calculated by the
following formula:
∑
√
Where S1.23 is the standard error of the estimate of X1 on X2 and X3. X1 is the original value of X and Y1
is the estimated value on the basis of the regression equations.
The standard error of the estimate in terms of the correlation coefficients r12, r13 and r23 is:
If it is proposed not to calculate the estimated values of X for all data points and also not to calculate the
partial correlation coefficients, the standard error of the regression estimate is calculate by the following
formula:
∑ ∑ ∑ ∑
√
∑ ∑ ∑
∑ ∑ ∑ ∑
∑ ∑ ∑ ∑
20b12.3 + 20 (-1) = 20
20b12.3 - 20 = 20
20b12.3 = 40
b12.3 = 40/20 = 2 …………………. (xi)
Substituting the values of b12.3 and b13.2 in equation (i), we get:
6a1.23 + 24(2) + 28(-1) = 20
6a1.23 + 48 - 28 = 20
6a1.23 + 20 = 20
6a1.23 = 20 – 20
a1.23 = 0/6 = 0
Putting the value of a1.23 = 0, b12.3 = 2 and b13.2 = -1, in equation (A) we get the required equation of X1
on X2 and X3 as:
X1 = 0 + 2X2 – X3
X1 = 2X2 – X3
It should be noted that in the above problem X1, X2, X3 are linear and as such the estimated values of X1
for given value of X2 and X3 would remain unchanged.
Example 13.10
In a trivariate distribution: σ1 = 3; σ2 = 4; σ3 = 5; r23 = 0.4; r31 = 0.6; r12 = 0.7.
Determine the regression of X1 on X2 and X3, when variates are measured from then means.
Solution: The regression equation of X1 on X2 and X3, when variates are measured from then mean is:
(i)
* + * +
* + * +
Substituting b12.3 = 0.41 and b13.2 = 0.229 in the equation (i), we get the required regression equation of
X1 on X2 and X3 as:
Example 13.11
The given data are: X1 = 6, X2 = 7, X3 = 8; σ1 = 1, σ2 = 2, σ3 = 3; r12 = 0.6, r13 = 0.7, r23 = 0.8. Find the
regression equation of X3 on X1 and X2. Estimate the value when X1 = 4, and X2 = 5.
Solution: We shall solve this problem by the third method discussed earlier. According to this method,
the regression equation of X3 on X1 and X2 is:
̅ * + ̅ * + ̅
[ ] [ ]
[ ] [ ]
[ ] [ ]
̅ [ ] ̅ [ ] ̅ …(i)
̅ ̅ ̅
̅
√∑ √ √
̅
√∑ √ √
̅
√∑ √ √
√∑ √
∑
√
√∑
∑
√
√∑
[ ] [ ]
13.8 SUMMARY
Regression analysis means the estimation or prediction of the unknown value of one variable from the
known value(s) of the other variable(s). It is one of the most important and widely used statistical
techniques in almost all sciences - natural, social or physical. Regression analysis for studying more
than two variables at a time is known as multiple regressions. If X and Y are two variables of which
relationship is to be indicated, a line that gives best estimate of Y for any value of X, it is called
Regression line of Y on X. If the dependent variable changes to X, then best estimate of X by any value
of Y is called Regression line of X on Y. A line of best fit can be fitted by means of: Free hand drawing
method, and least square method. The two regression lines indicate the nature and extent of correlation
between the variables. Correlation and Regression are the two related aspects of a single problem of the
analysis of the relationship between the variables. If we have information on more than one variable, we
might be interested in seeing if there is any connection - any association - between them. If we found
such a association, we might again be interested in predicting the value of one variable for the given and
known values of other variable(s).
13.9 KEYWORDS
Dependent variable: The variable to be predicted is called the dependent variable. Independent
variable: The predictor is called the independent variable, or explanatory variable.
The line of Regression: It is the graphical or relationship representation of the best estimate of one
variable for any given value of the other variable.
Error of estimate: The preciseness of an estimate can be obtained only through a measure of the
magnitude of error in the estimates, called the error of estimate.
Coefficient of determination: It gives the percentage variation in the dependent variable that is
accounted for by the independent variable.
13.10 SELF-ASSESSMENT TEST
1. Explain clearly the concept of Regression. Explain with suitable examples its role in dealing with
business problems.
2. What do you understand by linear regression?
3. What is meant by ‗regression‘? Why should there be in general, two lines of regression for each
bivariate distribution? How the two regression lines are useful in studying correlation between two
variables?
4. Why is the regression line known as line of best fit?
5. Write short note on
(i) Regression Coefficients
(ii) Regression Equations
(iii) Standard Error of Estimate
(iv) Coefficient of Determination
(v) Coefficient of Non-determination
6. Distinguish clearly between correlation and regression as concept used in statistical analysis.
7. Fit a least-square line to the following data:
(i) Using X as independent variable
(ii) Using X as dependent variable
X : 1 3 4 8 9 11 14
Y : 1 2 4 5 7 8 9
Hence obtain
c) The regression coefficients of Y on X and of X on Y
d) X and Y
e) Coefficient of correlation between and X and Y
f) What is the estimated value of Y when X = 10 and of X when Y = 5?
8. What are regression coefficients? Show that r2 = byx. bxy where the symbols have their usual
meanings. What can you say about the angle between the regression lines when (i) r = 0, (ii) r = 1
(iii) r increases from 0 to 1?
9. Obtain the equations of the lines of regression of Y on X from the following data.
X : 12 18 24 30 36 42 48
Y : 5.27 5.68 6.25 7.21 8.02 8.71 8.42
Estimate the most probable value of Y, when X = 40.
10. The following table gives the ages and blood pressure of 9 women.
Age (X) : 56 42 36 47 49 42 60 72 63 Blood
Pressure(Y) 147 125 118 128 145 140 155 160 149 Find the correlation
coefficient between X and Y.
(i) Determine the least square regression equation of Y on X.
(ii) Estimate the blood pressure of a woman whose age is 45 years.
11. Given the following results for the height (X) and weight (Y) in appropriate units of 1,000 students:
X = 68, Y = 150, S x = 2.5, S y = 20 and r = 0.6.
Obtain the equations of the two lines of regression. Estimate the height of a student A who weighs
200 units and also estimate the weight of the student B whose height is 60 units.
12. From the following data, find out the probable yield when the rainfall is 29‖.
Rainfall Yield
Mean 25‖ 40 units per hectare
Standard Deviation 3‖ 6 units per hectare
Correlation coefficient between rainfall and production = 0.8.
13. A study of wheat prices at two cities yielded the following data:
City A City B
Average Price Rs 2,463 Rs 2,797
Standard Deviation Rs 0.326 Rs 0.207
Coefficient of correlation r is 0.774. Estimate from the above data the most likely price of wheat
(i) at City A corresponding to the price of Rs 2,334 at City B
(ii) at city B corresponding to the price of Rs 3.052 at City A
14. Find out the regression equation showing the regression of capacity utilization on production from
the following data:
Average Standard Deviation
Production (in lakh units) 35.6 10.5
Capacity Utilization (in percentage) 84.8 8.5
r = 0.62. Estimate the production, when capacity utilisation is 70%.
15. The following table shows the mean and standard deviation of the prices of two shares in a stock
exchange.
Share Mean (in Rs) Standard Deviation (in Rs)
A Ltd. 39.5 10.8
B Ltd. 47.5 16.0
If the coefficient of correlation between the prices of two shares is 0.42, find the most likely price
of share A corresponding to a price of Rs 55, observed in the case of share B.
16. Find out the regression coefficients of Y on X and of X on Y on the basis of following data:
Variance of X = 4, Variance of Y = 9
17. Find the regression equation of X and Y and the coefficient of correlation from the following data:
18. By using the following data, find out the two lines of regression and from them compute the Karl
Pearson‘s coefficient of correlation.
19. The equations of two regression lines between two variables are expressed as
2X – 3Y = 0 and 4Y – 5X-8 = 0.
(i) Identify which of the two can be called regression line of Y on X and of X on Y.
(ii)Find X and Y and correlation coefficient r from the equations
36. Statistics for Business and Economics by R.P. Hooda. MacMillan India Ltd., New Delhi.
37. Business Statistics by S.P. Gupta and M.P. Gupta. Sultan Chand and Sons., New Delhi.
38. Statistical Method by S.P. Gupta. Sultan Chand and Sons., New Delhi.
39. Statistics for Management by Richard I. Levin and David S. Rubin. Prentice Hall of India Pvt.
Ltd., New Delhi.
40. Statistics for Business and Economics by Kohlar Heinz. Harper Collins., New York.
STRUCTURE
14.2 Introduction
14.6 Summary
14.7 Keywords
1. Identifying the various forces (influences) or factors which produce the variations
in the time series, and
2. Isolating, analysing and measuring the effect of these factors separately and
independently, by holding other things constant.
The purpose of decomposition models is to break a time series into its components:
Trend (T), Cyclical (C), Seasonality (S), and Irregularity (I). Decomposition of time
series provides a basis for forecasting. There are many models by which a time series
can be analysed; two models commonly used for decomposition of a time series are
discussed below.
This is a most widely used model which assumes that forecast (Y) is the product of
the four components at a particular time period. That is, the effect of four
components on the time series is interdependent.
The two causal methods, regression analysis and correlation analysis, have already
been discussed previously.
A few time series methods such as freehand curves and moving averages simply
describe the given data values, while other methods such as semi-average and least
squares help to identify a trend equation to describe the given data values.
A freehand curve drawn smoothly through the data values is often an easy and,
perhaps, adequate representation of the data. The forecast can be obtained simply by
extending the trend line. A trend line fitted by the freehand method should conform
to the following conditions:
(i) The trend line should be smooth- a straight line or mix of long gradual curves.
(ii) The sum of the vertical deviations of the observations above the trend line should
equal the sum of the vertical deviations of the observations below the trend line.
(iii) The sum of squares of the vertical deviations of the observations from the trend
line should be as small as possible.
(iv) The trend line should bisect the cycles so that area above the trend line should
be equal to the area below the trend line, not only for the entire series but as
much as possible for each full cycle.
Example 14.1: Fit a trend line to the following data by using the freehand
method.
Year 1991 1992 1993 1994 1995 1996 19916 1998
Sales turnover: 80 90 92 83 94 99 92 104
(Rs. in lakh)
Solution: Figure 16.2 presents the freehand graph of sales turnover (Rs. in lakh)
from 1991 to 1998. Forecast can be obtained simply by extending the trend line.
110
105
100
Sales
95
90
85
80
1991 1992 1993 1994 1995 1996 1997 1998
Years
Fig. 16.2:
Graph of Sales Turnover
Limitations of freehand method
(i) This method is highly subjective because the trend line depends on personal
judgement and therefore what happens to be a good-fit for one individual may
not be so for another.
(ii) The trend line drawn cannot have much value if it is used as a basis for
predictions.
(iii) Semi-averages
The data requirements for the techniques to be discussed in this section are minimal
and these techniques are easy to use and understand.
Moving Averages
If we are observing the movement of some variable values over a period of time and
trying to project this movement into the future, then it is essential to smooth out first
the irregular pattern in the historical values of the variable, and later use this as the
basis for a future projection. This can be done by calculating a series of moving
averages.
This method is a subjective method and depends on the length of the period chosen
for calculating moving averages. To remove the effect of cyclical variations, the period
chosen should be an integer value that corresponds to or is a multiple of the
estimated average length of a cycle in the series.
The moving averages which serve as an estimate of the next period’s value of a
variable given a period of length n is expressed as:
In this method, the term ‘moving’ is used because it is obtained by summing and
averaging the values from a given number of periods, each time deleting the oldest
value and adding a new value.
The limitation of this method is that it is highly subjective and dependent on the
length of period chosen for constructing the averages. Moving averages have the
following three limitations:
(i) As the size of n (the number of periods averaged) increases, it smoothens the
variations better, but it also makes the method less sensitive to real changes in
the data.
(ii) Moving averages cannot pick-up trends very well. Since these are averages, it will
always stay within past levels and will not predict a change to either a higher or
lower level.
Solution: The moving average calculation for the first 3 years is:
21 + 22 + 23
Similarly, the moving average calculation for the next 3 years is:
22 + 23 + 25
19816 21 - - -
1988 22 66 22.00 0
1989 23 160 23.33 -0.33
1990 25 162 24.00 1.00
1991 24 161 23.616 0.33
1992 22 161 23.616 -1.616
1993 25 163 24.33 0.616
1994 26 168 26.00 0
1995 216 169 26.33 0.616
1996 26 - - -
Odd and Even Number of Years
When the chosen period of length n is an odd number, the moving average at year i is
centred on i, the middle year in the consecutive sequence of n yearly values used to
compute i. For instance with n =5, MA3(5) is centred on the third year, MA4(5) is
centred on the fourth year…, and MA9(5) is centred on the ninth year.
No moving average can be obtained for the first (n-1)/2 years or the last (n-1/2) year
of the series. Thus for a 5-year moving average, we cannot make computations for
the just two years or the last two years of the series.
When the chosen period of length n is an even numbers, equal parts can easily be
formed and an average of each part is obtained. For example, if n = 4, then the first
moving average M3 (placed at period 3) is an average of the first four data values, and
the second moving average M4 (placed at period 4) is the average of data values 2
through 5). The average of M3 and M4 is placed at period 3 because it is an average of
data values for period 1 through 5.
Example 14.3: Assume a four-yearly cycle and calculate the trend by the method of
moving average from the following data relating to the production of tea in India.
Year Production Year Production
(million lbs) (million lbs)
19816 464 1992 540
1988 515 1993 5516
1989 518 1994 5161
1990 4616 1995 586
1991 502 1996 612
Solution: The first 4-year moving average is:
464 + 515 + 518+ 4616 1964
MA3(4) = = = 491.00
4 4
This moving average is centred on the middle value, that is, the third year of the
series. Similarly,
515 + 518 + 4616+ 502 2002
MA4(4) = = = 500.50
4 4
This moving average is centred on the fourth year of the series.
Table 14.2. presents the data along with the computations of 4-year moving
averages.
Table 14.2: Calculation of Trend and Short-term Fluctuations
Year Production 4-yearly 4-Yearly 4-Yearly
(mm lbs) Moving Moving Moving Average
Totals Average Centred
19816 464 - - -
1988 515 - - -
1964 491.00
1989 518 495.165
2002 500.50
1990 4616 503.62
20216 506.165
1991 502 511.62
2066 516.50
1992 540 529.50
21160 542.50
1993 5516 553.00
2254 563.50
1994 5161 5162.00
2326 581.50 -
1995 586 - - -
1996 612 - - -
Weighted Moving Averages
In moving averages, each observation is given equal importance (weight). However,
different values may be assigned to calculate a weighted average of the most recent n
values. Choice of weights is somewhat arbitrary because there is no set formula to
determine them. In most cases, the most recent observation receives the most
weightage, and the weight decreases for older data values.
A weighted moving average may be expressed mathematically as
(Weight for period n) (Data value in period n)
Weighted moving average =
Weights
Example 14.4: Vaccum cleaner sales for 12 months is given below. The owner of the
supermarket decides to forecast sales by weighting the past three months as follows:
Weight Applied Month
3 Last month
2 Two months ago
1 Three months ago
6
Month : 1 2 3 4 5 6 16 8 9 10 11 12
Actual sales : 10 12 13 16 19 23 26 30 28 18 16 14
(in units)
Solution:
The results of 3-month weighted average are shown in Table 14.3.
3 × Sales last month + 2 × Sales two months ago + 1 × Sales
three months ago
Forecast for the =
Current month 6
Table 14.3: Weighted Moving Average
Example 14.5: A food processor uses a moving average to forecast next month’s
demand. Past actual demand ( in units) is shown below:
Month : 43 44 45 46 416 48 49 50 51
Actual demand : 105 106 110 110 114 121 130 128 1316
(in units)
(a) Compute a simple five-month moving average to forecast demand for month 52.
(b) Compute a weighted three-month moving average where the weights are highest
for the latest months and descend in order of 3, 2, 1.
Solution: Calculation for five-month moving average are shown in Table 14.4.
Month Actual Demand 5-month Moving 5-month Moving
Total Average
43 105 - -
44 106 - -
45 110 545 109.50
46 110 561 112.2
416 114 585 1116.0
48 121 603 120.6
49 130 630 126.0
50 128 - -
51 1316 - -
(a) Five-month average demand for month 52 is
x 114 + 121 + 130 + 128 + 1316
= = 126 units
Number of periods 5
Solution: Since number of years are odd in number, therefore divide the data into
equal parts (A and B) of 3 years ignoring the middle year (1996). The average of part
A and B is
102 + 105 + 114 321
yA = = = 1016 units
3 3
120
115
Sales
110
105
100
1993 1994 1995 1996 1997 1998 1999
Years
y Change in sales
Slope b = =
x Change in year
112 – 1016 5
= = = 1.25
1998 – 1994 4
Intercept = a = 1016 units at 1994
Thus, the trend line is: ŷ = 1016 + 1.25x
Since 2002 is 8 year distant from the origin (1994), therefore we have
ŷ = 1016 + 1.25(8) = 1116
Exponential Smoothing Methods
Exponential smoothing is a type of moving-average forecasting technique which
weighs past data in an exponential manner so that the most recent data carries more
weight in the moving average. Simple exponential smoothing makes no explicit
adjustment for trend effects whereas adjusted exponential smoothing does take trend
effect into account (see next section for details).
Simple Exponential Smoothing
With simple exponential smoothing, the forecast is made up of the last period
forecast plus a portion of the difference between the last period’s actual demand and
the last period’s forecast.
Ft = Ft-1 + (Dt-1 – Ft-1) = (1-)Ft-1+ Dt-1 …(16.1)
Where Ft = current period forecast
Ft-1 = last period forecast
= a weight called smoothing constant (0 1)
Dt-1 = last period actual demand
From Eqn. (16.1), we may notice that each forecast is simply the previous forecast
plus some correction for demand in the last period. If demand was above the last
period forecast the correction will be positive, and if below it will negative.
When smoothing constant is low, more weight is given to past data, and when it is
high, more weight is given to recent data. When is equal to 0.9, then 99.99 per cent
of the forecast value is determined by the four most recent demands. When is as
low as 0.1, only 34.39 per cent of the average is due to these last 4 periods and the
smoothing effect is equivalent to a 19-period arithmetic moving average.
If were assigned a value as high as 1, each forecast would reflect total adjustment
to the recent demand and the forecast would simply be last period’s actual demand,
that is, Ft = 1.0Dt-1. Since demand fluctuations are typically random, the value of is
generally kept in the range of 0.005 to 0.30 in order to ‘smooth’ the forecast. The
exact value depends upon the response to demand that is best for the individual
firm.
The following table helps illustrate this concept. For example, when = 0.5, we can
see that the new forecast is based on demand in the last three or four periods. When
= 0.1, the forecast places little weight on recent demand and takes a 19-period
arithmetic moving average.
Weight Assigned to
Most 2nd Most 3rd Most 4th Most 5th Most
Smoothing Recent Recent Recent Recent Recent
Constant Period Period Period Period Period
() (1-) (1-)2 (1-)3 (1-)4
= 0.1 0.1 0.09 0.081 0.0163 0.066
0.5 0.25 0.125 0.063 0.031
If no previous forecast value is known, the old forecast starting point may be
estimated or taken to be an average of some preceding periods.
Example 14.8: A hospital has used a 9 month moving average forecasting method to
predict drug and surgical inventory requirements. The actual demand for one item is
shown in the table below. Using the previous moving average data, convert to an
exponential smoothing forecast for month 33.
Month : 24 25 26 216 28 29 30 31 32
Demand : 168 65 90 161 80 101 84 60 163
(in units)
Solution: The moving average of a 9-month period is given by
Demand (x) 168 + 65 … + 163
MA = = = 168
Number of periods 9
2 2
Assume Ft-1 = 168. Therefore, estimated 0.2
n 1 9 1
Thus, Ft = Ft-1 + (Dt-1-Ft-1) = 168 + 0.2 (163 - 168) = 1616 units
Least squares is one of the most widely used methods of fitting trends to data
because it yields what is mathematically described as a ‘line of best fit’. This trend
line has the properties that (i) the summation of all vertical deviations about it is
zero, that is, (y- ŷ ) = 0, (ii) the summation f all vertical deviations squared is a
minimum, that is, (y- ŷ ) is least, and (iii) the line goes through the mean values of
variables x and y. For linear equations, it is found by the simultaneous solution for a
and b of the two normal equations:
Where the data can be coded so that x = 0, two terms in three equations drop out
and we have y = na and xy = bx2
Coding is easily done with time-series data. For coding the data, we choose the
centre of the time period as x = 0 and have an equal number of plus and minus
periods on each side of the trend line which sum to zero.
Alternately, we can also find the values of constants a and b for any regression line
as:
xy nxy
b= and a = y bx
x 2 n(x ) 2
Example 14.9: Below are given the figures of production (in thousand quintals) of a
sugar factory:
Production : 80 90 92 83 94 99 92
(b) Plot these figures on a graph and show the trend line.
Solution: (a) Using normal equations and the sugar production data we can compute
constants a and b as shown in Table 14.6:
1992 1 80 1 80 84
1993 2 90 4 180 86
1994 3 92 9 2166 88
1995 4 83 16 332 90
1996 5 94 25 4160 92
19916 6 99 36 594 94
1998 16 92 49 644 96
Total 28 630 140 25166
x 28 y 630
x 4, y 90
n 7 n 7
(b) Plotting points on the graph paper, we get an actual graph representing
production of sugar over the past 16 years. Join the point a = 82 and b = 2
(corresponds to 1993) on the graph we get a trend line as shown in Fig. 14.4.
(c) The production of sugar for year 2001 will be ŷ = 82 + 2 (10) = 102 thousand
quintals.
Parabolic Trend Model
The curvilinear relationship for estimating the value of a dependent variable y from
an independent variable x might take the form
ŷ = a + bx + cx2
This trend line is called the parabola.
For a non-linear equation ŷ = a + bx - cx2, the values of constants a, b, and c can be
determined by solving three normal equations.
y = na + bx + cx2
xy = ax + bx2 + cx3
x2y = ax2 + bx3 + cx4
When the data can be coded so that x = 0 and x3 = 0, two term in the above
expressions drop out and we have
y = na + cx2
xy = bx2
x2y = ax2 + cx4
To find the exact estimated value of the variable y, the values of constants a, b, and c
need to be calculated. The values of these constants can be calculated by using the
following shortest method:
y c x2 xy n x2 y x2 y
a= ;b and c
n x2 n x 4 ( x 2 ) 2
Example 14.10: The prices of a commodity during 1999-2004 are given below. Fit a
parabola to these data. Estimate the price of the commodity for the year 2005.
Solving eqns. (iv) and (v) for b and c we get b =18.04 and c = 1.168. Substituting
values of b and c in eqn. (i), we get a = 126.68.
Hence, the required non-linear trend line becomes
y = 126.68 +18.04x + 1.168x2
Several trend values as shown in Table 14.16 can be obtained by putting x = -2, -1,
0, 1, 2 and 3 in the trend line. The trend values are plotted on a graph paper. The
graph is shown in Fig. 14.5.
Fig. 14.5
Exponential Trend Model
When the given values of dependent variable y from approximately a geometric
progression while the corresponding independent variable x values form an
arithmetic progression, the relationship between variables x and y is given by an
exponential function, and the best fitting curve is said to describe the exponential
trend. Data from the fields of biology, banking, and economics frequently exhibit
such a trend. For example, growth of bacteria, money accumulating at compound
interest, sales or earnings over a short period, and so on, follow exponential growth.
The characteristics property of this law is that the rate of growth, that is, the rate of
change of y with respect to x is proportional to the values of the function. The
following function has this property.
y = abcx, a > 0
Coding is easily done with time-series data by simply designating the center of the
time period as x =0, and have equal number of plus and minus period on each side
which sum to zero.
Example 14.11: The sales (Rs. In million) of a company for the years 1995 to 1999
are:
Year : 1995 1996 19916 1998 1999
Sales : 1.6 4.5 13.8 40.2 125.0
Find the exponential trend for the given data and estimate the sales for 2002.
Solution: The computational time can be reduced by coding the data. For this
consider u = x-3. The necessary computations are shown in Table 14.8.
Table 14.8: Fitting the Exponential Trend Line
Year Time u=x-3 u2 Sales y Log y u log y
Period x
1995 1 -2 4 1.60 0.2041 -0.4082
1996 2 -1 1 4.50 0.6532 -0.6532
19916 3 0 0 13.80 1.1390 0
1998 4 1 1 40.20 1.6042 1.6042
1999 5 2 4 125.00 2.0969 4.1938
10 5.6983 4.16366
1 1
log a = log y = (5.6983) = 1.13916
n 5
u log y 4.7366
log b = = = 0.416316
u2 10
(i) Shift the origin, simply by adding or subtracting the desired number of periods
from independent variable x in the original forecasting equation.
(ii) Change the time units from annual values to monthly values by dividing
independent variable x by 12.
(iii) Change the y units from annual to monthly values, the entire right-hand side of
the equation must be divided by 12.
Example 14.12: The following forecasting equation has been derived by a least-
squares method:
ŷ =10.216 + 1.65x (Base year:1992; x = years; y = tonnes/year)
Rewrite the equation by
(a) shifting the origin to 19916.
(b) expressing x units in months, retaining y n tonnes/year.
(c) expressing x units in months and y in tonnes/month.
Solution: (a) Shifting of origin can be done by adding the desired number of period
5(=19916-1992) to x in the given equation. That is
ŷ =10.216 + 1.65 (x + 5) = 18.52 + 1.65x
where 19916 = 0, x = years, y = tonnes/year
(b) Expressing x units in months
1.65x
ŷ =10.216 + = 10.216 + 0.14x
12
where July 1, 1992 = 0, x = months, y = tonnes/year
(c) Expressing y in tonnes/month, retaining x months.
1
ŷ = = (10.216 + 0.14x) = 0.86+0.01x
12
where July 1, 1992 = 0, x = months, y = tones/month
Remarks
1. If both x and y are to be expressed in months together, and then divide constant
‘a’ by 12 and constant ‘b’ by 24. It is because data are sums of 12 months. Thus
monthly trend equation becomes.
a b
Linear trend: yˆ x
12 24
a b c
Parabolic trend: ŷ x x2
12 144 1728
But if data are given as monthly averages per year, then value of ‘a’ remains
unchanged ‘b’ is divided by 12 and ‘c’ by 144.
2. The annual trend equation can be reduced to quarterly trend equation as :
a b a b
yˆ x x
4 4 12 4 48
14.3 SEASONAL VARIATIONS
If the time series data are in terms of annual figures, the seasonal variations are
absent. These variations are likely to be present in data recorded on quarterly or
monthly or weekly or daily or hourly basis. As discussed earlier, the seasonal
variations are of periodic in nature with period less than or equal to one year. These
variations reflect the annual repetitive pattern of the economic or business activity of
any society. The main objectives of measuring seasonal variations are:
(iii) To compare the pattern of seasonal variations of two or more time series in a
given period or of the same series in different periods.
(iv) To eliminate the seasonal variations from the data. This process is known as
deseasonalisation of data.
Methods of Measuring Seasonal Variations
The measurement of seasonal variation is done by isolating them from other
components of a time series. There are four methods commonly used for the
measurement of seasonal variations. These methods are:
1. Method of Simple Averages
In the above table, A denotes the average and S.I the seasonal index for a particular
month of various years. To calculate the seasonal index, we compute grand average
Ai 523
G, given by G 43.7 . Then the seasonal index for a particular month is
12 12
At
given by S.I. = 100 .
G
Further, S.I.=11998.91200. Thus, we have to adjust these values such that their
1200
total is 1200. This can be done by multiplying each figure by . The resulting
1198.9
figures are the adjusted seasonal indices, as given below:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
101.5 99.2 96.9 103.8 103.8 104.16 106.0 916.8 94.6 93.2 96.2 102.3
Remarks: The total equal to 1200, in case of monthly indices and 400, in case of
quarterly indices, indicate that the ups and downs in the time series, due to seasons,
neutralise themselves within that year. It is because of this that the annual data are
free from seasonal component.
Example 14.14
Compute the seasonal index from the following data by the method of simple
averages.
Year Quarter Y Year Quarter Y Year Quarter Y
1980 I 106 1982 I 90 1984 I 80
II 124 II 112 II 104
III 104 III 101 III 95
IV 90 IV 85 IV 83
1981 I 84 1983 I 166 1985 I 104
II 114 II 94 II 112
III 1016 III 91 III 102
IV 88 IV 166 IV 84
Solution
Calculation of Seasonal Indices
(i) Obtain the trend values for each month or quarter, etc. by the method of least
squares.
(ii) Divide the original values by the corresponding trend values. This would
eliminate trend values from the data. To get figures in percentages, the quotients
are multiplied by 100.
Y T.S.R
Thus, we have 100 100 S.R.100
T T
(iii) Finally, the random component is eliminated by the method of simple averages.
Example 14.15
Assuming that the trend is linear, calculate seasonal indices by the ratio to moving
average method from the following data:
Quarterly output of coal in 4 years (in thousand tonnes)
Year I II III IV
1982 65 58 56 61
1983 68 63 63 616
1984 160 59 56 52
1985 60 55 51 58
Solution
By adding the values of all the quarters of a year, we can obtain annual output for
each of the four years. Fit a linear trend to the data and obtain trend values for each
quarter.
Year Output X=2(t-1983.5) XY X2
1982 240 -3 -1620 9
1983 261 -1 -261 1
1984 2316 1 2316 1
1985 224 3 6162 9
Total 962 0 -162 20
962 72
From the above table, we get a 240.5 and b 3.6
4 20
Thus, the trend line is Y=240.5 – 3.6X, Origin: Ist January 1984, unit of X:6 months.
240.5 3.6
Y X or Y = 60.13-0.45X, Origin: Ist January 1984, unit of X:1 quarter
4 8
(i.e., 3 months).
1
Y=60.13-0.45(X+ ) = 59.9-0.45X, origin I-quarter, unit of X=1 quarter.
2
Year I II III IV
1982 63.50 63.05 62.50 62.15
1983 61.160 61.25 60.80 60.35
1984 59.90 59.45 59.00 58.55
1985 58.10 516.65 516.20 56.165
Y
The table of Ratio to Trend Values, i.e. 100
T
Year I II III IV
1982 102.36 91.99 89.46 98.15
1983 110.21 102.86 103.62 111.02
1984 116.86 99.24 94.92 88.81
1985 103.216 95.40 89.16 102.20
Total 432.160 389.49 31616.16 400.18
Average 108.18 916.316 94.29 100.05
S.I. 108.20 916.40 94.32 100.08
399.89
Note : Grand Average, G 99.97
4
Example 14.16.
Find seasonal variations by the ratio to trend method, from the following data:
Solution
First we fit a linear trend to the annual totals.
Year Annual Totals (Y) X XY X2
1995 140 -2 -280 4
1996 180 -1 -180 1
19916 200 0 0 0
1998 260 1 260 1
1999 340 2 680 4
Total 1120 0 480 10
1120 480
Now a = 224 and b = 48
5 10
224 48
The quarterly trend equation is Y= X=56+3X, origin: Ist July 19916, unit of X
4 16
= 1 quarter.
1
Y = 56 + 3 (X+ ) = 516.5 + 3X
2
Table of Quarterly Trend Values
Year I II III IV
1995 216.5 30.5 33.5 36.5
Year I II III IV
403.12
Note that the Grand Average G= 100.78. Also check that the sum of indices is
4
400.
Remarks: If instead of multiplicative model we have an additive model, then Y = T +
S + R or S + R = Y-T. Thus, the trend values are to be subtracted from the Y values.
Random component is then eliminated by the method of simple averages.
Merits and Demerits
It is an objective method of measuring seasonal variations. However, it is very
complicated and doesn’t work if cyclical variations are present.
(i) Compute the moving averages with period equal to the period of seasonal
variations. This would eliminate the seasonal component and minimise the effect
of random component. The resulting moving averages would consist of trend,
cyclical and random components.
(ii) The original values, for each quarter (or month) are divided by the respective
moving average figures and the ratio is expressed as a percentage, i.e.
Y TCSR
SR' ' , where R´ and R´´ denote the changed random components.
M .A. TCR'
(iii) Finally, the random component R´´ is eliminated by the method of simple
averages.
Example 14.16
Given the following quarterly sale figures, in thousand of rupees, for the year 1996-
1999, find the specific seasonal indices by the method of moving averages.
Year I II III IV
1996 34 33 34 316
1998 39 316 38 40
1999 42 41 42 44
Solution
Calculation of Ratio of Moving Averages
Total
1996 I 34 … … …
II 33 … … …
138
III 34 141 2169 34.9 916.4
143
IV 316 146 284 35.5 104.2
19916 I 316 148 289 36.1 102.5
150
II 35 152 294 36.8 95.1
III 316 153 298 316.3 99.2
1516
IV 39 161 302 316.8 103.2
1998 I 39 165 305 38.1 102.4
169
II 316 3016 38.4 96.4
III 38 311 38.9 916.16
IV 40 318 39.8 100.5
1999 I 42 326 40.8 102.9
II 41 334 41.8 98.1
III 42 … … …
IV 44 … … …
399.8
Note that the Grand Average G= =99.95. Also check that the sum of indices is
4
400.
Merits and Demerits
This method assumes that all the four components of a time series are present and,
therefore, widely used for measuring seasonal variations. However, the seasonal
variations are not completely eliminated if the cycles of these variations are not of
regular nature. Further, some information is always lost at the ends of the time
series.
Line Relatives Method
This method is based on the assumption that the trend is linear and cyclical
variations are of uniform pattern. As discussed in earlier chapter, the link relatives
are percentages of the current period (quarter or month) as compared with previous
period. With the computation of link relatives and their average, the effect of cyclical
and random component is minimised. Further, the trend gets eliminated in the
process of adjustment of chained relatives. The following steps are involved in the
computation of seasonal indices by this method:
(i) Compute the link relative (L.R.) of each period by dividing the figure of that period
with the figure of previous period. For example, link relative of
figure of 3rd quarter
3rd quarter = 100
figure of 2nd quarter
(ii) Obtain the average of link relatives of a given quarter (or month) of various years.
A.M. or Md can be used for this purpose. Theoretically, the later is preferable
because the former gives undue importance to extreme items.
(iii) These averages are converted into chained relatives by assuming the chained
relative of the first quarter (or month) equal to 100. The chained relative (C.R.) for the
current period (quarter or month)
C.R. of the previous period × L.R. of the current period
=
100
(iv) Compute the C.R. of first quarter (or month) on the basis of the last quarter (or
month). This is given by
C.R. of the last quarter (or month) × L.R. of 1st quarter (or month)
=
100
This value, in general, be different from 100 due to long term trend in the data. The
chained relatives, obtained above, are to be adjusted for the effect of this trend. The
adjustment factor is
1
d= [New C.R. for Ist quarter-100] for quarterly data
4
1
and d = [New C.R. for Ist month –100] for monthly data.
12
On the assumption that the trend is linear, d, 2d, 3d, etc. is respectively subtracted
from the 2nd, 3rd, 4th, etc., quarter (or month).
(v) Express the adjusted chained relatives as a percentage of their average to obtain
seasonal indices.
(vi) Make sure that the sum of these indices is 400 for quarterly data and 1200 for
monthly data.
Example 14.18
Determine the seasonal indices from the following data by the method of link
relatives:
Year Ist 2nd Qr 3rd Qr 4th Qr
2000 26 19 15 10
2001 36 29 23 22
2002 40 25 20 15
2003 46 26 20 18
2004 42 28 24 21
Solution
Calculation Table
Year I II III IV
2000 - 163.1 168.9 66.16
2001 360.0 80.5 169.3 95.16
2002 181.8 62.5 80.0 165.0
2003 306.16 56.5 166.9 90.0
2004 233.3 66.16 85.16 816.5
Total 1081.8 339.3 400.8 414.0
Mean 2160.5 616.9 80.2 83.0
C.R. 100.0 616.9 54.5 45.2
C.R. (adjusted) 100.0 62.3 43.3 28.4
S.I. 1160.9 106.5 164.0 48.6
The chained relative (C.R.) of the Ist quarter on the basis of C. R. of the 4th quarter =
270 45.2
122.3
100
1
The trend adjustment factor d = (122.3 100) 5.6
4
Thus, the adjusted C.R. of 1st quarter = 100
and for 2nd = 616.9 – 5.6 = 62.3
for 3rd = 54.5-2 × 5.6 = 43.3
for 4th = 45.2 – 3 × 5.6 = 28.4
100 62.3 43.3 28.4
The grand average of adjusted C.R., G = 58.5
4
Adjusted C.R. × 100
The seasonal index of a quarter =
G
Merits and Demerits
This method is less complicated than the ratio to moving average and the ratio to
trend methods. However, this method is based upon the assumption of a linear
trend, which may not always hold true.
Depersonalisation of Data
The depersonalization of data implies the removal of the effect of seasonal variations
from the time series variable. If Y consists of the sum of various components, then
for its deaseasonalization, we subtract seasonal variations from it. Similarly, in case
of multiplicative model, the depersonalisation is done by taking the ratio of Y value to
the corresponding seasonal index. A clue to this is provided by the fact that the sum
of seasonal indices is equal to zero for an additive model while their sum is 400 or
1200 for a multiplicative model.
It may be pointed out here that the depersonalization of a data is done under the
assumption that the pattern of seasonal variations, computed on the basis of past
data, is similar to the pattern of seasonal variations in the year of depersonalization.
Example 14.19
DE seasonalise the following data on the sales of a company during various months
of 1990 by using their respective seasonal indices. Also interpret the DE
seasonalised values.
Month Sales S.I. Month Sales S.I.
(Rs. ‘000) (Rs. ‘000)
Jan 16.5 109 Jul 36.5 85
Feb 21.3 105 Aug 44.4 88
Mar 216.1 108 Sep 54.9 98
Apr 31.0 102 Oct 62.0 102
May 35.5 100 Nov 616.6 104
Jun 36.3 89 Dec 168.16 110
Solution Let Y denote monthly sales and DS denote the DE seasonalised sales. Then,
we can write
Y
DS 100
S.I
The deseasonalised figures of sales for each month represent the monthly sales that
would have been in the absence of seasonal variations.
14.4 MEASUREMENT OF CYCLICAL AND IRREGULAR VARIATIONS
MEASUREMENT OF CYCLICAL VARIATIONS-
RESIDUAL METHOD
As mentioned earlier that a typical time-series has four components: secular trend
(T), seasonal variation (S), cyclical variation (C), and irregular variation (I). In a
multiplicative time-series model, these components are written as:
Y=T×C×S×I
The deseasonalization data can be adjusted for trend analysis these by the
corresponding trend and seasonal variation values. Thus we are left with only
cyclical (C) and irregular (I) variations in the data set as shown below:
Y T×C×S×I
= =C×I
T×S T×S
The moving averages of an appropriate period may be used to eliminate or reduce the
effect of irregular variations and thus left behind only the cyclical variations.
The procedure of identifying cyclical variation is known as the residual method.
Recall that cyclical variations in time-series tend to oscillate above and below the
secular trend line for periods longer than one year. The steps of residual method are
summarized as follows:
(ii) Obtain trend values and expressed seasonalized data as percentages of the trend
values.
(iii) Divided the original data (Y) by the corresponding trend values (T) in the time-
series to get S × C × I. Further divide S × C × I by S to get C × I.
3. With simple exponential smoothing, the forecast is made up of the last period
forecast plus a portion of the .............. between the last period’s actual demand
and the last period’s forecast.
4. When the given values of dependent variable y from approximately a geometric
progression while the corresponding independent variable x values form an
arithmetic progression, the relationship between variables x and y is given by an
exponential function, and the best fitting curve is said to describe the
................... .
5. The objective of smoothing methods into ............... out the random variations due
to irregular components of the time series.
14.6 SUMMARY
A series of observations, on a variable, recorded after successive intervals of time is
called a time series. There are two main objectives of the analysis of any time series
data: To study the past behaviour of data and to make forecasts for future. There are
different components of time series analysis like trend, cycles, seasonal and
irregular. Multiplicative and additive model is used for decomposition of time series.
The principal methods of measuring trend fall into following categories: Free Hand
Curve methods, Method of Averages and Method of least squares. The objective of
smoothing methods into smoothen out the random variations due to irregular
components of the time series and thereby provide us with an overall impression of
the pattern of movement in the data over time. There are three smoothing methods
like: Moving averages, weighted moving averages and Semi-averages. If the time
series data are in terms of annual figures, the seasonal variations are absent. These
variations are likely to be present in data recorded on quarterly or monthly or weekly
or daily or hourly basis. The measurement of seasonal variation is done by isolating
them from other components of a time series. There are four methods commonly
used for the measurement of seasonal variations. These methods are: Method of
Simple Averages, Ratio to Trend Method, Ratio to Moving Average Method and
Method of Line Relatives.
14.7 KEYWORDS
Trend- It is the broad long-term tendency of either upward or downward movement
in the average (or mean) value of the forecast variable y over time.
Cycles- An upward and downward oscillation of uncertain duration and magnitude
about the trend line due to seasonal effect with fairly regular period or long period
with irregular swings is called a cycle. Seasonal- It is a special case of a cycle
component of time series in which the magnitude and duration of the cycle do not
vary but happen at a regular interval each year.
Irregular- An irregular or erratic (or residual) movement in a time series is caused by
short-term unanticipated and non-recurring factors.
Semi-Average Method: It permit us to estimate the slope and intercept of the trend
the quite easily if a linear function will adequately described the data.
Exponential Smoothing Methods: It is a type of moving-average forecasting
technique which weighs past data in an exponential manner so that the most recent
data carries more weight in the moving average.
Deseasonalisation of Data: It implies the removal of the effect of seasonal variations
from the time series variable.
14.8 SELF ASSESSEMENT TEST
1. What effect does seasonal variability have on a time-series? What is the basis for
this variability for an economic time-series?
2. What is measured by a moving average? Why are 4-quarter and 12-month
moving averages used to develop a seasonal index?
3. Briefly describe the moving average and least squares methods of measuring trend
in time-series.
4. Distinguish between ratio-to-trend and ratio-to-moving average as methods of
measuring seasonal variations, which is better and why?
5. Why do we deseasonalize data? Explain the ratio-to-moving average method to
compute the seasonal index.
6. Apply the method of link relatives to the following data and calculate seasonal
indexes.
Quarter 1995 1996 19916 1998 1999
I 6.0 5.4 6.8 16.2 6.6
II 6.5 16.9 6.5 5.8 16.3
III 16.8 8.4 9.3 16.5 8.0
IV 8.16 16.3 6.4 8.5 16.1
(a) Find the linear equation that describes the trend in the production of steel by the
company.
14. The sales (Rs. In lakh) of a company for the years 1990 to 1996 are given below:
Find trend values by using the equation Yc = abx and estimate the value for
19916.
15. A company that specializes in the production of petrol filters has recorded the
following production (in 1000 units) over the last 16 years.
(a) Develop a second degree estimating equation that best describes these data.
2. Trend
3. Difference
4. Exponential trend
5. Smoothen
3. R.P. Hooda: Statistic for Business and Economic, McMillan India Ltd.
STRUCTURE
15.1Learning Objectives
15.2Introduction
15.2.1 What are Index Numbers?
15.2.2 Uses of Index Numbers
15.2.3 Types of Index Numbers
15.2.3.1 Simple Index Numbers
15.2.3.2 Composite Index Numbers
15.2.3.2.1 Simple Aggregative Price/Quantity Index
15.2.3.2.2 Index of Average of Price/Quantity Relatives
Understand the techniques and the problems involved in constructing and using index numbers.
15.2 INTRODUCTION
In business, managers and other decision makers may be concerned with the way in which the values of
variables change over time like prices paid for raw materials, numbers of employees and customers,
annual income and profits, and so on. Index numbers are one way of describing such changes.
If we turn to any journal devoted to economic and financial matters, we are very likely to come across
an index number of one or the other type. It may be an index number of share prices or a wholesale
price index or a consumer price index or an index of industrial production. The objective of these index
numbers is to measure the changes that have occurred in prices, cost of living, production, and so forth.
For example, if a wholesale price index number for the year 2000 with base year 1990 was 170; it
shows that wholesale prices, in general, increased by 70 percent in 2000 as compared to those in 1990.
Now, if the same index number moves to 180 in 2001, it shows that there has been 80 percent increase
in wholesale prices in 2001 as compared to those in 1990.
With the help of various index numbers, economists and businessmen are able to describe and
appreciate business and economic situations quantitatively. Index numbers were originally developed
by economists for monitoring and comparing different groups of goods. It is necessary in business to
understand and manipulate the different published index serieses, and to construct index series of your
own. Having constructed your own index, it can then be compared to a national one such as the RPI, a
similar index for your industry as a whole and also to indexes for your competitors. These comparisons
are a very useful tool for decision making in business.
For example, an accountant of a supermarket chain could construct an index of the company's own sales
and compare it to the index of the volume of sales for the general supermarket industry. A graph of the
two indexes will illustrate the company's performance within the sector. It is immediately clear from
Figure 15-1 that, after initially lagging behind the general market, the supermarket company caught up
and then overtook it. In the later stages, the company was having better results than the general market
but that, as with the whole industry, those had levelled out.
Our focus in this lesson will be on the discussion related to the methodology of index number
construction. The scope of the lesson is rather limited and as such, it does not discuss a large number of
index numbers that are presently compiled and published by different departments of the Government
of India.
the prices of a particular commodity like steel, gold, leather, etc. or a group of commodities like
consumer goods, cereals, milk and milk products, cosmetics, etc.
volume of trade, factory production, industrial or agricultural production, imports or exports, stocks
and shares, sales and profits of a business house and so on.
the national income of a country, wage structure of workers in various sectors, bank deposits,
foreign exchange reserves, cost of living of persons of a particular community, class or profession
and so on.
the variations between similar categories of objects/subjects, such as persons, groups of persons,
organisations etc. or other characteristics such as income, profession, etc.
The utility of index numbers in facilitating comparison may be seen when, for example we are
interested in studying the general change in the price level of consumer goods, i.e. good or commodities
consumed by the people belonging to a particular section of society, say, low income group or middle
income group or labour class and so on. Obviously these changes are not directly measurable as the
price quotations of the various commodities are available in different units, e.g., cereals (wheat, rice,
pulses, etc) are quoted in Rs per quintal or kg; water in Rs per gallon; milk, petrol, kerosene, etc. in Rs
per liter; cloth in Rs per miter and so on.
Further, the prices of some of the commodities may be increasing while those of others may be
decreasing during the two periods and the rates of increase or decrease may be different for different
commodities. Index number is a statistical device, which enables us to arrive at a single representative
figure that gives the general level of the price of the phenomenon (commodities) in an extensive group.
According to Wheldon:
―Index number is a statistical device for indicating the relative movements of the data where
measurement of actual movements is difficult or incapable of being made.‖
―Index number shows by its variations the changes in a magnitude which is not susceptible either of
accurate measurement in itself or of direct valuation in practice.‖
On the basis of above discussion, the following characteristics of index numbers are apparent:
1. Index Numbers are specialized averages: An average is a summary figure measuring the central
tendency of the data, representing a group of figures. Index number has all these functions to
perform. L R Connor states, "in its simplest form, it (index number) represents a special case of an
average, generally a weighted average compiled from a sample of items judged to be representative
of the whole". It is a special type of average – it averages variables having different units of
measurement.
2. Index Numbers are expressed in percentages: Index numbers are expressed in terms of
percentages so as to show the extent of change. However, percentage sign (%) is never used.
3. Index Numbers measure changes not capable of direct measurement: The technique of index
numbers is utilized in measuring changes in magnitude, which are not capable of direct
measurement. Such magnitudes do not exist in themselves. Examples of such magnitudes are 'price
level', 'cost of living', 'business or economic activity' etc. The statistical methods used in the
construction of index numbers are largely methods for combining a number of phenomena
representing a particular magnitude in such a manner that the changes in that magnitude may be
measured in a meaningful way without introduction of serious bias.
4. Index Numbers are for comparison: The index numbers by their nature are comparative. They
compare changes taking place over time or between places or between like categories.
In brief, index number is a statistical technique used in measuring the composite change in several
similar economic variables over time. It measures only the composite change, because some of the
variables included may be showing an increase, while some others may be showing a decrease. It
synthesizes the changes taking place in different directions and by varying extents into the one
composite change. Thus, an index number is a device to simplify comparison to show relative
movements of the data concerned and to replace what may be complicated figures by simple ones
calculated on a percentage basis.
level of prices or accordingly purchasing power of money, today index numbers are extensively used
for a variety of purposes in economics, business, management, etc., and for quantitative data relating to
production, consumption, profits, personnel and financial matters etc., for comparing changes in the
level of phenomenon for two periods, places, etc. In fact there is hardly any field or quantitative
measurements where index numbers are not constructed. They are used in almost all sciences – natural,
social and physical. The main uses of index numbers can be summarized as follows:
Index numbers are indispensable tools for the management personnel of any government organisation or
individual business concern and in business planning and formulation of executive decisions. The
indices of prices (wholesale & retail), output (volume of trade, import and export, industrial and
agricultural production) and bank deposits, foreign exchange and reserves etc., throw light on the nature
of, and variation in the general economic and business activity of the country. They are the indicators of
business environment. A careful study of these indices gives us a fairly good appraisal of the general
trade, economic development and business activity of the country. In the world of G Simpson and F
Kafka:
―Index numbers are today one of the most widely used statistical devices. They are used to take the pulse of the economy
and they have come to be used as indicators of inflationary or deflationary tendencies.‖
Like barometers, which are used in Physics and Chemistry to measure atmospheric pressure, index
numbers are rightly termed as ―economic barometers‖, which measure the pressure of economic and
business behaviour.
Since the index numbers study the relative change in the level of a phenomenon at different periods of
time, they are especially useful for the study of the general trend for a group phenomenon in time series
data. The indices of output (industrial and agricultural production), volume of trade, import and export,
etc., are extremely useful for studying the changes in the level of phenomenon due to the various
components of a time series, viz. secular trend, seasonal and cyclical variations and irregular
components and reflect upon the general trend of production and business activity. As a measure of
average change in extensive group, the index numbers can be used to forecast future events. For
instance, if a businessman is interested in establishing a new undertaking, the study of the trend of
changes in the prices, wages and incomes in different industries is extremely helpful to him to frame a
general idea of the comparative courses, which the future holds for different undertakings.
Index numbers of the data relating to various business and economic variables serve an important guide
to the formulation of appropriate policy. For example, the cost of living index numbers are used by the
government and, the industrial and business concerns for the regulation of dearness allowance (D.A.) or
grant of bonus to the workers so as to enable them to meet the increased cost of living from time to
time. The excise duty on the production or sales of a commodity is regulated according to the index
numbers of the consumption of the commodity from time to time. Similarly, the indices of consumption
of various commodities help in the planning of their future production. Although index numbers are
now widely used to study the general economic and business conditions of the society, they are also
applied with advantage by sociologists (population indices), psychologists (IQs‘), health and
educational authorities etc., for formulating and revising their policies from time to time.
A traditional use of index numbers is in measuring the purchasing power of money. Since the changes
in prices and purchasing power of money are inversely related, an increase in the general price index
indicates that the purchasing power of money has gone down.
1
PurchasingPower x100
GeneralPrice Index
Accordingly, if the consumer price index for a given year is 150, the purchasing power of a rupee is
(1/150) 100 = 0.157. That is, the purchasing power of a rupee in the given year is 157 paise as compared
to the base year.
With the increase in prices, the amount of goods and services which money wages can buy (or the real
wages) goes on decreasing. Index numbers tell us the change in real wages, which are obtained as
Money Wage
Real Wage x100
ConsumerPrice Index
A real wage index equal to, say, 120 corresponding to money wage index of 1150 will indicate an
increase in real wages by only 20 per cent as against 150 per cent increase in money wages.
Index numbers also serve as the basis of determining the terms of exchange. The terms of exchange are
the parity rate at which one set of commodities is exchanged for another set of commodities. It is deter-
mined by taking the ratio of the price index for the two groups of commodities and expressing it in
percentage.
For example, if A and B are the two groups of commodities with 120 and 150 as their price index in a
particular year, respectively, the ratio 120/150 multiplied by 100 is 80 per cent. It means that prices of A
group of commodities in terms of those in group B are lower by 20 per cent.
Consumer price indices or cost of living index numbers are used for deflation of net national product,
income value series in national accounts. The technique of obtaining real wages from the given nominal
wages (as explained in use 4 above) can be used to find real income from inflated money income, real
sales from nominal sales and so on by taking into account appropriate index numbers.
Index numbers may be broadly classified into various categories depending upon the type of the
phenomenon they study. Although index numbers can be constructed for measuring relative changes in
any field of quantitative measurement, we shall primarily confine the discussion to the data relating to
economics and business i.e., data relating to prices, production (output) and consumption. In this
context index numbers may be broadly classified into the following three categories:
1. Price Index Numbers: The price index numbers measure the general changes in the prices. They
are further sub-divided into the following classes:
(i) Wholesale Price Index Numbers: The wholesale price index numbers reflect the changes in
the general price level of a country.
(ii) Retail Price Index Numbers: These indices reflect the general changes in the retail prices of
various commodities such as consumption goods, stocks and shares, bank deposits,
government bonds, etc.
(iii) Consumer Price Index: Commonly known as the Cost of living Index, CPI is a specialized
kind of retail price index and enables us to study the effect of changes in the price of a basket
of goods or commodities on the purchasing power or cost of living of a particular class or
section of the people like labour class, industrial or agricultural worker, low income or middle
income class etc.
2. Quantity Index Numbers: Quantity index numbers study the changes in the volume of goods
produced (manufactured), consumed or distributed, like: the indices of agricultural production,
industrial production, imports and exports, etc. They are extremely helpful in studying the level of
physical output in an economy.
3. Value Index Numbers: These are intended to study the change in the total value (price multiplied
by quantity) of output such as indices of retail sales or profits or inventories. However, these
indices are not as common as price and quantity indices.
Various indices can also be distinguished on the basis of the number of commodities that go into the
construction of an index. Indices constructed for individual commodities or variable are termed as
simple index numbers. Those constructed for a group of commodities or variables are known as
aggregative (or composite) index numbers.
Notations Used
Since index numbers are computed for prices, quantities, and values, these are denoted by
the lower case letters:
p, q, and v represent respectively the price, the quantity, and the value of an individual
commodity.
Subscripts 0, 1, 2,… i, ... are attached to these lower case letters to distinguish price,
quantity, or value in any one period from those in the other. Thus,
p0 denotes the price of a commodity in the base period,
p1 denotes the price of a commodity in period 1, or the current period, and
pi denotes the price of a commodity in the ith period, where i = 1,2,3, ...
Similar meanings are assigned to q0, q1, ... qi, ... and v0, v1, … vi, …
Capital letters P, Q and V are used to represent the price, quantity, and value index
numbers, respectively. Subscripts attached to P, Q, and V indicates the years compared.
Thus,
POI means the price index for period 1 relative to period 0,
P02 means the price index for period 2 relative to period 0,
PI2 means the price index for period 2 relative to period 1, and so on.
Similar meanings are assigned to quantity Q and value V indices. It may be noted that
all indices are expressed in percent with 100 as the index for the base period, the period
with which comparison is to be made.
Here, in this lesson, we will develop methods of constructing simple as well as composite indices.
Value in Periodi
Indexfor any Periodi x100
Value in Base Year
Given are the following price-quantity data of fish, with price quoted in Rs per kg and production in qtls.
These simple indices facilitate comparison by transforming absolute quantities/prices into percentages.
Given such an index, it is easy to find the percent by which the price/quantity may have changed in a
given period as compared to the base period. For example, observing the index computed in Example
15-1, one can firmly say that the output of fish was 30 per cent more in 1984 as compared to 1980.
It may also be noted that given the simple price/quantity for the base year and the index for the period i
= 1, 2, 3, …; the actual price/quantity for the period i = 1, 2, 3, … may easily be obtained as:
P
pi p0 0i …………(15-3)
100
Q
and qi q0 0i …………(15-4)
100
122.00
qi 500
100
= 1510
Irrespective of the units in which prices/quantities are quoted, this index for given prices/quantities, of a
group of commodities is constructed in the following three steps:
(i) Find the aggregate of prices/quantities of all commodities for each period (or place).
(ii) Selecting one period as the base, divide the aggregate prices/quantities corresponding to
each period (or place) by the aggregate of prices/ quantities in the base period.
The computation procedure contained in the above steps can be expressed as:
P0i
p i
x100 …………(15-5)
p 0
and Q0i
q i
x100 …………(15-15)
q 0
Example 15-2
Given are the following price-quantity data, with price quoted in Rs per kg and production in qtls.
1980 1985
Item Price Production Price Production
Fish 15 500 20 1500
Mutton 18 590 23 1540
Chicken 22 450 24 500
Find (a) Simple Aggregative Price Index with 1980 as the base.
(b) Simple Aggregative Quantity Index with 1980 as the base.
Solution:
Calculations for
Simple Aggregative Price and Quantity Indices
(Base Year = 1980)
Prices
Item Quantities
1980(p0) 1985(pi)
1980(q0) 1985(qi)
15 20 500 1500
Fish
Mutton 18 23 590 1540
Chicken 22 24 450 500
Sum → 55 157 1540 1740
P0i
p i
x100
p 0
67
P0i x100
55
P0i 121.82
(b) Simple Aggregative Quantity Index with 1980 as the base
Q0i
q i
x100
q 0
1740
Q0i x100
1540
Q0i 112.98
Although Simple Aggregative Index is simple to calculate, it has two important limitations:
First, equal weights get assigned to every item entering into the construction of this index irrespective of
relative importance of each individual item being different. For example, items like pencil and milk are
assigned equal importance in the construction of this index. This limitation renders the index of no
practical utility.
Second, different units in which the prices are quoted also sometimes unduly affect this index. Prices
quoted in higher weights, such as price of wheat per bushel as compared to a price per kg, will have
unduly large influence on this index. Consequently, the prices of only a few commodities may dominate
the index. This problem no longer exists when the units in which the prices of various commodities are
quoted have a common base.
Even the condition of common base will provide no real solution because commodities with relatively
high prices such as gold, which is not as important as milk, will continue to dominate this index
excessively. For example, in the Example 15-2 given above chicken prices are relatively higher than
those of fish, and hence chicken prices tend to influence this index relatively more than the prices of
fish.
Given the prices/quantities of a number of commodities that enter into the construction of this index, it
is computed in the following two steps:
(i) After selecting the base year, find the price relative/quantity relative of each commodity for each
year with respect to the base year price/quantity. As defined earlier, the price relative/quantity
relative of a commodity for a given period is the ratio of the price/quantity of that commodity in the
given period to its price/quantity in the base period.
(ii) Multiply the result for each commodity by 100, to get simple price/quantity indices for each
commodity.
(iii) Take the average of the simple price/quantity indices by using arithmetic mean, geometric mean or
median.
p
P0i Average of i x100
p0
q
and Q0i Average of i x100
q0
(a) Index of Average of Price Relatives (base year 1980); using mean, median and geometric mean.
(b) Index of Average of Quantity Relatives (base year 1980); using mean, median and geometric
mean.
Solution:
Calculations for
Index of Average of Price Relatives and Quantity Relatives
(Base Year = 1980)
Price Relative Quantity Relative
p p q q
i x100 log i x100 i x100 log i x100
Item p0 p0 q0 q0
Fish 133.33 2.1248 120.00 2.0792
Mutton 127.77 2.10153 108.47 2.0354
Chicken 109.09 2.0378 111.11 2.0457
Sum → 370.19 15.21589 339.58 15.11503
pi
p x100
Using arithmetic mean P0i 0
N
370.19
3
123.39
N 1
Using Median P0i Size of th item
2
3 1
Size of th item
2
Size of 2nd item
127.77
1 p
Using geometric mean P0i Anti log log i x100
N p0
1
Anti log 6.2689
3
Antilog2.08963
122.92
(b) Index of Average of Quantity Relatives (base year 1980)
qi
q x100
Using arithmetic mean Q0i 0
N
339.58
3
113.19
N 1
Using Median Q0i Size of th item
2
3 1
Size of th item
2
Size of 2nd item
111.11
1 q
Using geometric mean Q0i Anti log log i x100
N q0
1
Anti log 6.1603
3
Antilog2.05343
113.09
Apart from the inherent drawback that this index accords equal importance to all items entering into its
construction, a simple arithmetic mean and median are not appropriate average to be applied to ratios.
Because it is generally believed that a simple average injects an upward bias in the index. So geometric
mean is considered a more appropriate average for ratios and percentages.
Among several ways of assigning weights, two widely used ways are:
(i) to use base period quantities/prices as weights, popularly known as Laspeyre's Index, and
(ii) to use the given (current) period quantities/prices as weights, popularly known as Paasche's Index.
Laspeyre’s Index
i
P0La
p q
i 0
x100 …………(15-11)
p q
0 0
Q0Lai
q p
i 0
x100 …………(15-12)
q p
0 0
Paasche’s Index
i
P0Pa
p q
i i
x100 …………(15-13)
p q
0 i
Q0Pai
q p
i i
x100 …………(15-14)
q p
0 i
Example 15-4
(d) Paasche‘s Quantity Index for 1985, using 1980 as the base
Solution:
Calculations for
Laspeyre’s and Paasche’s Indices
(Base Year = 1980)
Item p0 q0 p1 q0 p0 q1 p1 q1
Fish 7500 10000 9000 12000
Mutton 101520 13570 11520 14720
Chicken 9900 10800 11000 12000
Sum → 28020 34370 31520 38720
(a) Laspeyre‘s Price Index for 1985, using 1980 as the base
i
P0La
p qi 0
x100
p q0 0
34370
x100
28020
122.66
(b) Laspeyre‘s Quantity Index for 1985, using 1980 as the base
Q0Lai
q pi 0
x100
q p
0 0
31520
x100
28020
112.49
(c) Paasche‘s Price Index for 1985, using 1980 as the base
i
P0Pa
p qi i
x100
p q0 i
38720
x100
31520
122.84
(d) Paasche‘s Quantity Index for 1985, using 1980 as the base
Q0Pai
q pi i
x100
q p
0 i
38720
x100
34370
112.66
Accordingly, the cost of collection of 500 qtls of fish, 590 qtls of mutton and 450 qtls of chicken
has increased by 22.1515 per cent in 1985 as compared to what it was in 1980. Viewed differently,
it indicates that a fixed amount of goods sold at 1985 prices yield 22.1515 per cent more revenue
than what it did at 1980 prices.
2. It also implies that a fixed amount of goods when purchased at 1985 prices would cost 22.1515 per
cent more than what it did at 1980 prices. In this interpretation, the Laspeyre's Price Index serves as
the basis of constructing the cost of living index, for it tells how much more does it cost to maintain
the base period standard of living at the current period prices.
Laspeyre's Quantity Index, too, has precise interpretations. It reveals the percentage change in total
expenditure in the given (current) period as compared to the base period if varying amounts of the same
basket of goods are sold at the base period prices. When viewed in this manner, we will be required to
spend 12.49 per cent more in 1985 as compared to 1980 if the quantities of fish, mutton and chicken for
19155 are sold at the base period (1980) prices.
A careful examination of the Paasche's Price Index will show that this too is amenable to the following
precise interpretations:
1. It compares the cost of collection of a fixed basket of goods selected in the given period with the
cost of collection of the same basket of goods in the base period.
Accordingly, the cost of collection of a fixed basket of goods containing 1500 qtls of fish, 1540 qtls
of mutton and 500 qtls of chicken in 1985 is about 22.84 per cent more than the cost of collecting
the same basket of goods in 1980. Viewed a little differently, it indicates that a fixed basket of
goods sold at 1985 prices yields 22.84 per cent more revenue than what it would have earned had it
been sold at the base period (1980) prices.
2. It also tells that a fixed amount of goods purchased at 1985 prices will cost 22.84 per cent more
than what it would have cost if this fixed amount of goods had been sold at base period (1980)
prices.
Analogously, Paasche's Quantity Index, too, has its own precise meaning. It tells the per cent change in
total expenditure in the given period as compared to the base period if varying amounts of the same
basket of goods are to be sold at given period prices. When so viewed, we will be required to spend
12.1515 per cent more in 1985 as compared to 1980 if the quantities of fish, mutton and chicken for
1980 are sold at the given period (1985) prices.
Relationship Between Laspeyre‟s and Paasche‟s Indices
In order to understand the relationship between Laspeyre‘s and Paasche‘s Indices, the assumptions on
which the two indices are based be borne in mind:
Laspeyre's index is based on the assumption that unless there is a change in tastes and preferences,
people continue to buy a fixed basket of goods irrespective of how high or low the prices are likely to be
in the future. Paasche's index, on the other hand, assumes that people would have bought the same
amount of a given basket of goods in the past irrespective of how high or low were the past prices.
However, the basic contention implied in the assumptions on which the two indices are based is not
true. For, people do make shifts in their purchase pattern and preferences by buying more of goods that
tend to become cheaper and less of those that tend to become costlier. In view of this, the following two
situations that are likely to emerge need consideration:
1. When the prices of goods that enter into the construction of these indices show a general tendency
to rise, those whose prices increase more than the average increase in prices will have smaller
quantities in the given period than the corresponding quantities in the base period. That is, qi‘s will
be smaller than q0‘s when prices in general are rising. Consequently, Paasche's index will have
relatively smaller weights than those in the Laspeyre's index and, therefore, the former ( P0Pa
i ) will
words, Paasche's index will show a relatively greater fall when the prices in general tend to fall.
An important inference based on the above discussion is that the Paasche's index has a downward bias
and the Laspeyre's index an upward bias. This directly follows from the fact that the Paasche's index,
relative to the Laspeyre's index, shows a smaller rise when the prices in general are rising, and a greater
fall when the prices in general are falling.
It may, however, be noted that when the quantity demanded increases because of change in real income,
tastes and preferences, advertising, etc., the prices remaining unchanged, the Paasche's index will show
a higher value than the Laspeyre's index. In such situations, the Paasche's index will overstate, and the
Laspeyre's will understate, the changes in prices. The former now represents the upper limit, and the
latter the lower limit, of the range of price changes.
The relationship between the two indices can be derived more precisely by making use of the
coefficient of linear correlation computed as:
fXY fX fY
N N N
rxy …………(15.15)
Sx S y
pi q
in which X and Y denote the relative price movements( ) and relative quantity movements( i )
p0 q0
respectively. Sx and Sy are the standard deviations of price and quantity movements, respectively. While
rxy represents the coefficient of correlation between the relative price and quantity movements; f
represents the weights assigned, that is, p0 q0. N is the sum of frequencies i. e. N = p q
0 0 .
Substituting the values of X, Y, f and N in Eq. (15-15), and then rearranging the expression, we have
rxy S x S y
p q p q x p q
i i i 0 0 i
p q p q p q
0 0 0 0 0 0
If
p q
i i
V0i , is the index of value expanded between the base period and the ith period, then
p q
0 0
p q
0 0
0i
rxy S x S y
1
p q xp qi 0 0 i
V0i p q p q 0 0 i i
rxy S x S y 1
1 P0La
i x
V0i P0Pa
i
P0La rxy S x S y
i
Pa
1 …………(15.115)
P0i V0i
The relationship in Eq. (15.115) offers the following useful results:
i P0i when either rxy , Sx and Sy is equal to zero. That is, the two indices will give the same
P0La Pa
1.
result either when there is no correlation between the price and quantity movements, or when the
price or quantity movements are in the same ratio so that Sx or Sy is equal to zero.
2. Since in actual practice rxy will have a negative value between 0 and -1, and as neither Sx = 0 nor Sy
= 0, the right hand side of Eq. (15-115) will be less than 1. This means that P0La
i is normally greater
than P0Pa
i .
3. Given the overall movement in the index of value ( V0i ) expanded, the greater the coefficient of
correlation (rxy) between price and quantity movements and/or the greater the degree of dispersion
(Sx and Sy) in the price and quantity movements, the greater the discrepancy between P0La Pa
i and P0i .
4. The longer the time interval between the two periods to be compared, the more the chances for
price and quantity movements leading to higher values of Sx and Sy. The assumption of tastes,
habits, and preferences remaining unchanged breaking down over a longer period, people do find
enough time to make shifts in their consumption pattern, buying more of goods that may have
become relatively cheaper and less of those that may have become relatively dearer. All this will
end up with a higher degree of correlation between the price and quantity movement.
Consequently, P0La Pa
i will diverge from P0i more in the long run than in the short run. So long as the
To overcome the difficulty of overstatement of changes in prices by the Laspeyre's index and
understatement by the Paasche's index, different indices have been developed to compromise and
improve upon them. These are particularly useful when the given period and the base period fall quite
apart and result in a greater divergence between Laspeyre's and Paasche's indices.
1. Marshall-Edgeworth Index
The Marshall-Edgeworth Index uses the average of the base period and given period quantities/prices as
the weights, and is expressed as
q0 qi
p i
2
P0ME x100 …………(15-17)
q0 qi
i
p0 2
p0 pi
q i
2
Q0ME x100 …………(15-18)
p0 pi
i
q0 2
2. Dorbish and Bowley Index
The Dorbish and Bowley Index is defined as the arithmetic mean of the Laspeyre‘s and Paasche‘s
indices.
i P0i
P0La Pa
P DB
0i …………(15-19)
2
Q0Lai Q0Pai
i
Q0DB …………(15-20)
2
3. Fisher’s Ideal Index
The Fisher‘s Ideal Index is defined as the geometric mean of the Laspeyre‘s and Paasche‘s indices.
P0Fi P0La Pa
i .P0i …………(15-21)
pi
v p x100
0
P0i …………(15-23)
v
then v can be obtained either as
(i) the product of the base period prices and the base period quantities denoted as v0 that is, v0 = p0
q0 , or
(ii) the product of the base period prices and the given period quantities denoted as vi that is, vi = p0 qi
pi
0 0 p
p q x100
0
0 P0i …………(15-24)
p0 q0
It may be seen that 0 P0i is the same as the Laspeyre‘s aggregative price index.
pi
p q p x100
0 i
0
P0i …………(15-25)
i
p q 0 i
It may be seen that i P0i is the same as the Paasche‘s aggregative price index.
qi
v q x100
0
Q0i …………(15-215)
v
then v can be obtained either as
(i) the product of the base period quantities and the base period prices denoted as v0 that is, v0 = q0
p0 , or
(ii) the product of the base period quantities and the given period prices denoted as vi that is, vi = q0 pi
q
q p0 i x100
0
q0
Q0i …………(15-27)
0
q 0 p0
It may be seen that 0 Q0i is the same as the Laspeyre‘s aggregative quantity index.
q
q pi i x100
0
q0
Q0i …………(15-28)
i
q0 pi
It may be seen that i Q0i is the same as the Paasche‘s aggregative quantity index.
Example 15-5
From the data in Example 15.2 find the:
pi
p q p x100
0 0
0
P0i
0
p q 0 0
3437000
28020
122.66
pi
p q p x100
0 i
0
P0i
i
p q 0 i
3872000
31520
122.84
Calculations for
Index of Weighted Average of Quantity Relatives
(Base Year = 1980)
q q
q0 p0 1 x100 q0 p1 1 x100
Item v0 = q0 p0 v1 = q0 p1 q0 q0
Fish 7500 10000 900000 1200000
Mutton 101520 13570 1152000 1472000
Chicken 9900 10800 1100000 1200000
112.66
Although the indices of weighted average of price/quantity relatives yield the same results as the
Laspeyre's or Paasche's price/quantity indices, we do construct these indices also in situations when it is
necessary and advantageous to do so. Some such situations are as follows:
(i) When a group of commodities is to be represented by a single commodity in the group, the price
relative of the latter is weighted by the group as a whole.
(ii) Where the price/quantity relatives of individual commodities have been computed, these can be
more conveniently utilised in constructing the index.
(iii) Price/quantity relatives serve a useful purpose in splicing two index series having different base
periods.
(iv) Depersonalizing a time series requires construction of a seasonal index, which also requires the use
of relatives.
3. Factor Reversal Test: This is the second of the two important tests of consistency proposed by
Prof Irving Fisher. According to him:
―Just as our formula should permit the interchange of two times without giving inconsistent results, so it ought to
permit interchanging the price and quantities without giving inconsistent results – i.e., the two results multiplied
together should give the true value ratio, except for a constant of proportionality.‖
This implies that if the price and quantity indices are obtained for the same data, same base and
current periods and using the same formula, then their product (without the factor 100) should give
the true value ratio. Symbolically, we should have (without factor 100).
P01xQ01
p q 1 1
V01 …………(15-30)
p q 0 0
Fisher‘s formula satisfies the factor reversal test. In fact fisher‘s index is the only index satisfying
this test as none of the formulae discussed in the lesson satisfies this test.
Remark: Since Fisher‘s index is the only index that satisfies both the time reversal and factor
reversal tests, it is termed as Fisher‘s Ideal Index.
4. Circular Test: Circular test, first suggested by Westergaard, is an extension of time reversal test
for more than two periods and is based on the shift ability of the base period. This requires the
index to work in a circular manner and this property enables us to find the index numbers from
period to period without referring back to the original base each time. For three periods a,b,c, the
test requires :
Pab xPbc xPca 1 abc …………(15-31)
For Instance
Hence Laspeyre‘s index does not satisfy the circular test. In fact, circular test is not satisfied by any
of the weighted aggregative formulae with changing weights. This test is satisfied only by the index
number formulae based on:
2. Splicing the Old Series to Make it Continuous with the New Series
This means reducing the old series into the new series before the base period of the letter. As shown
in Table 15.8.2(ii), splicing here takes place at the base period of the new series. To do this, a ratio
of the index of 1980 of the new series (100) to the index of 1980 of the old series (200) is computed
and the index for each of the preceding years of the old series are then multiplied by this ratio.
Table 15.8.2(ii)
Splicing the Old Series with the New Series
Price Index Price Index
Spliced Index Number
Year (19715 = 100) (1980 = 100)
[Old Series x (100/200)]
(Old Series) (New Series)
1971 100 -- 50
1977 120 -- 150
1978 1415 -- 73.50
1979 172 -- 815
1980 200 100 100
1981 -- 110 110
1982 -- 1115 1115
1983 -- 125 125
1984 -- 140 140
Another important consideration is that the base year should not be too remote in the past. A more
recent year needs to be selected as the base year. The use of a particular year for a prolonged period
would distort the changes that it purports to measure. That is why we find that the base year of major
index numbers, such as consumer price index or index of industrial production, is shifted from time to
time.
Selection of Weights to be Used: It should be amply clear from the various indices discussed in the
lesson that the choice of the system of weights, which may be used, is fairly large. Since any system of
weights has its own merits and is capable of giving results amenable to precise interpretations, the
weights used should be decided keeping in view the purpose for which an index is constructed.
It is also worthwhile to bear in, mind that the use of any system of weights should represent the relative
importance of individual commodities that enter into the construction of an index. The interpretations
that are intended to be made from an index number are also important in deciding the weights. The use
of a system of weights that involves heavy computational work deserves to be avoided.
Type of Average to be Used: What type of average should be used is a problem specific to simple
average indices. Theoretically, one can use any of the several averages that we have, such as mean,
median, mode, harmonic mean, and geometric mean. Besides being locational averages, median and
mode are not the appropriate averages to use especially where the number of years for which an index is
to be computed, is not large.
While the use of harmonic mean and geometric mean has some definite merits over mean, particularly
when the data to be averaged refer to ratios, mean is generally more frequently used for convenience in
computations.
Choice of Index: The problem of selection of an appropriate index arises because of availability of
different types of indices giving different results when applied to the same data. Out of the various
indices discussed, the choice should be in favour of one which is capable of giving more accurate and
precise results, and which provides answer to specific questions for which an index is constructed.
While the Fisher's index may be considered ideal for its ability to satisfy the tests of adequacy, this too
suffers from two important drawbacks. First, it involves too lengthy computations, and second, it is not
amenable to easy interpretations as are the Laspeyre's and Paasche's indices. The use of the term ideal
does not, however, mean that it is the best to use under all types of situations. Other indices are more
appropriate under situations where specific answers are needed.
Selection of Commodities: Commodities to be included in the construction of an index should be
carefully selected. Only those commodities deserve to be included in the construction of an index as
would make it more representative. This, in fact, is a problem of sampling, for being related to the
selection of commodities to be included in the sample.
In this context, it is important to note that the selection of commodities must not be based on random
sampling. The reason being that in random sampling every commodity, including those that are not
important and relevant, have equal chance of being selected, and consequently, the index may not be
representative. The choice of commodities has, therefore, to be deliberate and in keeping with the
relevance and importance of each individual commodity to the purpose for which the index is
constructed.
Data Collection: Collection of data through a sample is the most important issue in the construction of
index numbers. The data collected are the raw material of an index. Data quality is the basic factor that
determines the usefulness of an index. The data have to be as accurate, reliable, comparable,
representative, and adequate, as possible.
The practical utility of an index also depends on how readily it can be constructed. Therefore, data
should be collected from where these can be easily available. While the purpose of an index number
will indicate what type of data are to be collected, it also determines the source from where the data can
be available.
15.5 CHECK YOUR PROGRESS
1. Index Numbers measure changes not capable of ……… measurement.
2. The wholesale price index numbers reflect the changes in the ……… level of a country.
3. Weighted aggregative Indices make up this ……… by assigning proper weights to individual
items.
4. The Paasche's index has a ………… bias and the Laspeyre's index an upward bias.
5. Splicing two index number series means reducing two ………….. index series with different base
periods into a single series either at the base period of the old series (one with an old base year), or
at the base period of the new series (one with a recent base year).
15.6 SUMMARY
Index numbers are statistical devices designed to measure the relative changes in the level of a certain
phenomenon in two or more situation. It may refer to a single variable or a group of distinct but related
variables. There are different uses of Index numbers like: act as an economic Barometers, help in
Studying Trends and Tendencies, help in Formulating Decisions and Policies, measure the Purchasing
Power of Money and used for deflation. There are following types of it like: price index numbers,
quantity index numbers and value index numbers. Various indices can also be distinguished on the basis
of the number of commodities that go into the construction of an index. Indices constructed for
individual commodities or variable are termed as simple index numbers. Those constructed for a group
of commodities or variables are known as aggregative (or composite) index numbers. There are various
formulae for the construction of index numbers. None of the formulae measures the price changes or
quantity changes with perfection and has some bias. The problem is to choose the most appropriate
formula in a given situation. As a measure of the formula error a number of mathematical tests, known
as the tests of consistency or tests of adequacy. There are different test for it like: unit test, time reversal
test, factor reversal test and circular test.
15.7 KEYWORDS
Price Index Numbers: The price index numbers measure the general changes in the prices.
Quantity Index Numbers: Quantity index numbers study the changes in the volume of goods produced
(manufactured), consumed or distributed.
Value Index Numbers: These are intended to study the change in the total value (price multiplied by
quantity) of output such as indices of retail sales or profits or inventories.
Simple index numbers: A simple price index number is based on the price or quantity of a single
commodity.
Splicing two index number series: It means reducing two overlapping index series with different base
periods into a single series either at the base period of the old series (one with an old base year), or at
the base period of the new series (one with a recent base year).
13. Construct index number of price and index number of quantity from the following data using:
a. Laspeyre‘s formula,
b. Paasche‘s formula,
c. Dorbish and Bowley‘s formula,
d. Marshall and Edgeworth‘s formula, and
e. Fisher‘s Ideal Index formula
Base Year Current Year
Commodities Price Quantity Price Quantity
A 2 8 4 15
B 5 10 15 5
C 4 14 5 10
D 2 19 2 13
Which of the formula satisfy
A 5 30 40
B 8 20 30
C 10 10 20
15. From the following data, construct price index by using Weighted Average of Price Relatives
Method:
Quantity Base Year Price Current Year Price
Commodities
16. From the information given below, calculate the Cost of Living Index number for 1985, with 1984
as base year by
a. Aggregative Expenditure Method, and
3. Business Statistics by Amir D. Aczel and J. Sounderpandian. Tata McGraw Hill Publishing
Company Ltd., New Delhi.
4. Statistics for Business and Economics by R.P. Hooda. MacMillan India Ltd., New Delhi.
5. Business Statistics by S.P. Gupta and M.P. Gupta. Sultan Chand and Sons., New Delhi.
6. Statistical Method by S.P. Gupta. Sultan Chand and Sons., New Delhi.
7. Statistics for Management by Richard I. Levin and David S. Rubin. Prentice Hall of India
Pvt. Ltd., New Delhi.
8. Statistics for Business and Economics by Kohlar Heinz. Harper Collins., New York.
STRUCTURE
16.2 Introduction
16.6 Summary
16.7 Keywords
16.2 INTRODUCTION
The subject of ‗quality control‘ has assumed considerable importance in recent years in the wake of
globalisation of economies world over. As a result, there has been tremendous increase in competition
amongst business enterprises both within and outside the country.
Quality has been defined in different ways by different experts but almost all those definitions emphasis
that quality must meet the requirements of the customer. While quality is very vital for providing
satisfaction to the customer, it goes far beyond this. For industrial and commercial organisations, quality
is not only central to profitability but crucial to business survival. This aspect has assumed considerable
importance in today‘s tough and challenging business environment. If quality is ignored or overlooked
by these organisations then their continued existence is in danger.
The main factor that affects quality is variability in the process. This variability does not allow a factory
to provide consistently a standard quality product. Prior to mass production, an individual worker or a
few of them produced by hand, checking frequently if the product manufactured is coming out as they
had conceived it. If it was distorted, they would again check where the fault laid, measure, and rework
on it. However, when goods began to be manufactured on a mass scale, it became apparent that
individual items could not be identical. It is almost impossible to eliminate variability completely. Such
a situation poses a major problem in that the parts that are supposed to fit together would not fit. This
shows that variability is the cause of poor quality.
The various causes of variation in the product may be classified into two categories:
(a) Scientific and identifiable
(b) Random and Chance
The first category comprises such causes as the use of defective raw material, poor equipment, poor
workmanship, and so on. While the second category contains causes that do not have any bearing on the
production process. The main purpose of our quality control exercise is to segregate specific and
identifiable causes from the chance or random causes.
In the early days of mass production, inspection of the product and sorting out the defective ones was
the chief method used for quality control. It was thought that the rejection of defective items would not
cost much as the marginal cost of each unit was small. But gradually it became apparent that the costs
of defective items were much higher than supposed earlier. This is because a number of people had to
be employed to inspect the product besides losing the goodwill of the customers.
This realisation laid emphasis on doing things right at the very first time, focussing on the concept of
zero defects. This means that efforts must be made to prevent defects at each stage of manufacturing a
product or delivering a service. In order to achieve this, workers engaged in the production are given the
responsibility to check their output rather than to pass it on for a final inspection. One major benefit of
this approach is that workers feel a sense of pride and satisfaction for the responsibility given to them.
3 Action zone
Upper control limit
2 Warning zone
Upper warning limit
1
Stable
Variable or attribute
Central line
Zone
1
Lower warning limit
2 Warning zone
Lower control limit
3 Action zone
Time
Fig. 16.1 A specimen of control chart
The action required depends on the zones in which the results fall. The possibilities are:
(xviii) Nothing needs to be done in case of stable zone wherein variation occurs due
to common causes only.
(xix) In respect of warning zone, there seem to be special causes of variation. There
is a need for collecting more information and having a watchful eye on the
process.
(xx) Action zone suggests that special causes of variation in the process are present.
The situation demands further investigation and where appropriate the process
needs to be adjusted.
These three situations can be compared to traffic lights, which signal ‘stop’, ‘caution’
or ‘go’. Let us examine in some more detail major parts of a control chart.
Quality
Scale
Out of control
3 Signals
(Central line)
Average
3 Signals
Sample (sub-group) Number
(Lower control limit)
LCL
Out of control
1 2 3 4 5 6 7 8 9 10 11 12 13
from the average, that is, mean –3 standard deviation. The upper and lower control limits are
usually drawn as dotted lines.
We have just said that upper and lower control limits are set at 3 limits. One may ask the reason for
this approach. It may be noted that the 3 limits were first proposed by Shewart for his control charts.
On the basis of probability consideration, if variable X is normally distributed, the probability that a
random observation on the variable will lie between µ ± 3 (where µ is the mean and the standard
deviation of X) is 0.997, which is extremely high. It may be recalled the area of the normal curve
between µ ± 3 is 99.73 per cent. This means that the probability of a random observation going
beyond these limits is nearly 0.003. This means that the variable quality characteristic is assumed to be
normally distributed and that the probability of a sample point going outside 3 limits when the process
is in control is very small. If a sample point goes beyond this limit, it is highly likely that the normality
assumption of the process is not applicable.
In order to set up a sound quality control mechanism, the concerned organisation must be keenly
interested. It must take the following steps.
First, it must select the quality characteristics, which need to be kept under control. Besides, both their
upper and lower limits within which variation can be tolerated should be fixed up. Second, the
production process must be analysed so that the possible causes of variation can be determined. Finally,
it must lay down as to how the inspection data will be collected and recorded as also how they will be
subdivided. Depending on the type of inspection data available, any one of the following types of
control charts can be used.
(xxi) Control charts for x
(xxii) one
(xxiii) Control chart for C
(xxiv) Control chart for p or pn
In order to ascertain whether the process is in control or out of control, x -charts are constructed. In
regard to the process output, there is an assumption of normality where µ and are known, though in
many situations this assumption may not hold good. We know that the sample means have a sampling
distribution with µ x = µ and x = /n. The construction of x -chart needs the values of µ and and
also a sample size n. There are three lines in a x control chart, viz. the centre line indicating µ x , the
upper control limit (UCL), with value µ x + 3 and the lower control limit (LCL), with value µ x – 3.
In addition to the control limits, there are warning limits, which are determined by 1.96 on either side
of the centre line. Thus, the upper warning limit (UWL) = µ + (1.96)/n and the lower warning limit
(LWL) = µ - (1.96)/n. Figure 14.1 shows these two warning limits. However, the control charts do not
normally show the warning limits.
Let us take an example to illustrate the procedure used in construction x control charts.
Example 16.1. A company is engaged in the manufacture of battery cells in its plant. The process is
said to be under control if the mean life of battery cells is 1,200 hrs with a standard deviation of 75 hrs.
Considering these values to be the process average and process dispersion, you are required to
determine the 3-sigma control limits for x -chart for samples of size 14.
Solution- Given are µ = 1,200 hrs, = 75 hrs and n = 14.
As the estimates of process average and process dispersion are based on a large sample, the desired
control limits can be obtained by the following formula:
µ ± 3 /n
Substituting the values in the above formula,
UCL = µ + 3(75/14)
= 1,200 + 56.25
= 1,256.25
LCL = µ - 3(75/14)
= 1,200–56.25
= 1,143.75
The preceding discussion has given us some basic ideas on x -chart. The question is that when
population mean and population standard deviation are not known to us, then how to construct x -
charts. In such cases, we use sample information to estimate unknown parameters. Let us take first the
estimation of µ. This can be done by taking the mean of the sample mean ( x ). This can be calculated
by the following formula
x = x/n × k = x /k
Where, n = number of observations in each sample
k = number of samples taken
In respect of control charts, it has become customary to use R as an estimate of . R signifies the
average of the sample ranges. It is a biased estimator of , and d is the correction factor. The value for
d2 is given in question. Thus, the upper and lower control limits (UCL and LCL) for an x -chart are
computed with the following formulas:
3R
UCL = x +
d2n
3R
LCL = x –
d2n
In the above formula, d2 stands for control chart factor. These limits are often calculated as x ± A2 R
where A2 = 3/(d2n).
By using these formulas, we can now plot the three lines— CL (central line), UCL (upper control line)
and LCL lower control line). Let us take an example to show how these formulas can be used.
Example 16.2. Suppose we are given the following information:
n = 20, x = 75, d2 = 3.735 and R = 15. We are asked to find the CL, UCL and LCL for a x control
chart.
Solution. It is obvious that CL is the grand mean, that is, 75.
3R
UCL = x +
d2n
3(15)
= 75 +
3.735×20
45
= 75 + 16.70
= 77.69
3R
UCL = x –
d2n
3(15)
= 75 –
3.735×20
45
= 75 –
16.70
= 72.31
Example 16.3. A company manufactures tyres. A quality control engineer is responsible to ensure that
the tyres turned out are fit for use up to 40,000 km. He monitors the life of the output from the
production process. From each of the 10 batches of 900 tyres, he has tested 5 tyres and recorded the
following data, with x and R measured in thousands of km.
Batch 1 2 3 4 5 6 7 8 9 10
x 40.2 43.1 42.4 39.8 43.1 41.5 40.7 39.2 38.9 41.9
R 1.3 1.5 1.8 0.6 2.1 1.4 1.6 1.1 1.3 1.5
Construct an x -chart using the above data. Do you think that the production process is in control?
Explain. (Value of d2 = 2.326)
Solution.
x 410.8
x = k = 10 = 41.08
R 14.2
R = k = 10 = 1.42
CL = 41.08
3R
UCL = x +
d2n
3(1.42)
= 41.08 +
2.326×5
4.26
= 41.08 + 2.326×2.24
44
Tread Life in Thousand km
43
42
UCL
41
CL
40 LCL
39
38
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Batch Number
information regarding the sampling distribution of R, in particular its standard deviation R. For this
purpose, the following formula is used.
R = d3
where
= population standard deviation
d3 = another factor depending on n
The values of d3 are given in question.
Control Limits for an R-Chart
3d3 R 3d3
R + d2 = R 1+ d2
UCL =
3d3 R 3d3
LCL = R – = R 1– d
d2 2
3d3
LCL = R D3, where D3 = 1 – d
2
The values of D3 and D4 can also be found from Table of control charts.
Example 16.4. We have to determine the UCL and the LCL by applying the above formulae to the data
given in Example 14.3.
Solution. The UCL and the LCL are calculated as follows:
3d3
UCL = R 1+ d
2
3(0.864)
= 1.42 1+ 2.326
= 1.42 (1 + 1.11) = 2.996 or 3 approx.
3d3
LCL = R 1– d
2
3(0.864)
= 1.42 1– 2.326
= 1.42 × -0.11 = -0.156 (to be taken as zero)
Some explanation is needed for the zero value of LCL. A sample range is always a non-negative
number (because it is the difference between the largest and smallest observations in the sample).
However, when n 6, the LCL computed by the above equation will be negative. Although in this case
n is 10, yet the calculation shows a negative value. As such, we set the value of LCL at zero.
A major limitation of R-chart arises from the characteristic of range itself. As we know that the range
considers only the highest and the lowest values in a distribution, it may ignore the nature of variation in
the remaining observations. Further, it is influenced by extreme values, which may significantly differ
from one sample to the other. In view of these limitations, R-chart is only a convenient device for
examining variability of the process.
UCL = C + 3 C
LCL = C – 3 C
This formula is based on a normal curve approximation to the Poisson distribution. The use of the C-
chart is appropriate if the occasions for a defect in each production unit are infinite, but the probability
of a defect at any point is very small and is constant.
Example 16.5.
Fifteen pieces of cloth from different rolls contained respectively 1, 5, 3, 2, 7, 6, 3, 2, 6, 5, 4, 3, 5, 6, and
3 imperfections. Draw a control chart using these data and state whether the process is in a state of
statistical control.
Solution.
C =(1 + 5 + 3 + 2 + 7 + 6 + 3 + 2 + 6 + 5 + 4 + 3 + 4 + 6 + 3)/15
= 60/15 = 4
UCL = C + 3 C
= 4 + 34 = 4 + 6 = 10
LCL = C – 3 C
= 4 – 34 = 4 – 6 = –2
Since the number of defectives cannot be negative, the lower control limit will be taken as zero. Figure
14.4 shows both the control limits. The chart clearly shows that all the imperfections in cloth are within
the control limits, that is, no point lies outside the control limits. This suggests that the process is in a
state of statistical control.
12
UCL
Number of imperfections
10
CL
4
LCL
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Rolls of Cloth
The control chart for attributes is known as the p-chart. Such a chart is used to control the proportion or
percentage of defectives per sample. It may be noted that there is an assumption that the items are
produced by Bernoulli process, which implies the following three assumptions: (i) There are only two
outcomes– acceptable or defective. (ii) The outcomes occur randomly. (iii) There is no change in the
probability of either outcome for each trial.
As we have seen earlier that the C-chart is concerned with the number of defectives, it can be easily
converted into proportion by dividing the number of defectives by the sample size. Thus, we can use the
p-chart in place of the C-chart. In order to draw the p-chart, we have to follow the following procedure:
(xxv) Calculate the average fraction defective ( p ) by dividing the number of defective
units by the total number of units inspected.
(xxvii) The upper and lower control limits are to be obtained by using the following
formulas:
pq
UCL = p + 3
n
where = (1 – )
pq
LCL = p – 3
n
Any sample point falling outside the UCL and the LCL indicates that the process is not in control. It is
preferable to set up the chart to express ‗percent defective‘ to ‗fraction defective‘.
Example 16.6. The following figures give the number of defects in 10 samples, each containing 200
items: 40, 44, 22, 34, 24, 32, 28, 32, 34 and 30. Calculate the values for central line and the upper and
lower control limits of p-chart. Draw the p-chart and comment if the process can be regarded in control.
Solution.
Table 1. Worksheet for calculating the values for p-chart
Sample No. No. of defectives Fraction defectives
1 40 0.20
2 44 0.22
3 22 0.11
4 34 0.17
5 24 0.12
6 32 0.14
7 28 0.14
8 32 0.14
9 34 0.17
10 30 0.15
Total 320
It will be seen from Fig. 16.6 that all the units fall within the upper and lower control limits. On the
basis of this chart, we can say that the process is well under control. It may be noted that we have
plotted the percentage defective instead of fraction defective in the above chart.
45
40
Percentage defective
35
30
25 UCL
20
CL
15
10
5
LCL
0
0 1 2 3 4 5 6 7 8 9 10 11
Number of samples
Benefits
There are several benefits of SQC approach and these include:
(xxviii)SQC can be applied to any type of problem selected and process originally tackled will result
into improvement.
(xxix) This approach eliminates the ‗emotion‘ factor and the decisions are based on facts rather than on
opinions.
(xxx) As the workers are directly involved in the improvement process, their ‗quality awareness‘
increases.
(xxxi) The knowledge and experience potential of those involved in the process is released in a
systematic way through the investigative approach. They increasingly realise that their role in
problem solving is collecting and communicating the relevant facts on which decisions are made.
(xxxii) Managers and supervisors solve problems methodically instead of in a haphazard manner. Thus,
the approach to the problem becomes unified in place of an individual approach earlier.
(xxxiii)In case of any inquiry from the government or any other appropriate authority, the quality can be
defended on the basis of statistical process control.
(xxxiv) Since the firm strictly adheres to the SQC, the users of the product may rely on it and may not
resort to check the quality themselves.
Limitations
Despite the above mentioned advantages of the SQC, it may be noted that it is unable to solve all the
problems arising in quality improvement. There are several highly complex problems where SQC may
not be in a position to contribute much towards reduction of variability. This apart, at times, managers
use SQC mechanically and construct control charts without going into the depth of the problem. As a
result, statistical methods have been criticised at times. It has been argued that continuous improvement
in quality can be attained by studying all parts of an organisation and not merely one part viz.
production process.
3. The construction of x -chart needs the values of µ and and also a .................. .
4. Nothing needs to be done in case of stable zone wherein variation occurs due to
................ causes only.
(v) In respect of warning zone, there seem to be special causes of...........................
(vi) Action zone suggests that special causes of variation in the process
are...........................
16.6 SUMMARY
In the early days of mass production, inspection of the product and sorting out the defective ones was
the chief method used for quality control. It was thought that the rejection of defective items would not
cost much as the marginal cost of each unit was small. But gradually it became apparent that the costs
of defective items were much higher than supposed earlier. This is because a number of people had to
be employed to inspect the product besides losing the goodwill of the customers. Statistical Quality
Control (SQC) is the application of appropriate statistical tools to processes to ensure continuous
improvement in quality of products, services and productivity in the workforce. Control charts show a
step-by-step approach to statistical quality control. These are ‗road-maps‘ that are very helpful in
solving the problems pertaining to quality. The underlying feature of such a chart is that there are
certain SQC techniques that are most appropriate in each step. Depending on the type of inspection data
available, any one of the following types of control charts can be used: Control charts for x , Control
chart for or R alone, Control chart for C, Control chart for p or pn. There are several benefits and
limitations of SQC. The objective of acceptance sampling is either to accept or to reject the product. It
does not attempt to control the quality during the manufacturing process. This is altogether a different
approach from control charts.
16.7 KEYWORDS
Acceptance sampling: It involves sampling inspection by a purchaser who has to decide whether to
accept a shipment of product.
P-Chart: The control chart for attributes is known as the p-chart. Such a chart is used to control the
proportion or percentage of defectives per sample.
R- Chart: In this chart, the value of the sample range for each of the samples is plotted. The central line
for R-charts is placed at R .
Quality scale: This is a vertical scale, which is marked as per the chosen quality characteristic (either in
variables or attributes) of each sample.
Plotted samples: The control chart does not show the qualities of individual items of a sample. Instead,
the quality of the entire sample represented by a single value (a statistic) is shown. The single value
plotted on the chart is in the form of a dot (or sometimes a small circle or a cross).
Sample numbers- The samples, which are also referred to as sub-groups in SQC, on a control chart are
numbered individually and are shown on a horizontal line.
The horizontal lines- The central line represents the average quality of the samples plotted on the chart
.
16.8 SELF-ASSESSMENT TEST
1. What is statistical quality control? How is it useful to industry?
2. What is a control chart? Describe how it is constructed and used?
3. Describe briefly the working of the p-chart.
4. Write a detailed note on ‗Acceptance Sampling‘.
3. Sample size n.
4. Common
5. Variation
6. Present
NOTE
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
NOTE
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
NOTE
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
NOTE
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
NOTE
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
NOTE
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
NOTE
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________
___________________________________________________________________________________