Reading Material - Lesson 1-8 - Data Analysis
Reading Material - Lesson 1-8 - Data Analysis
Data Analysis
(Reading Material)
Topics Include:
1. Data, 2. Univariate Frequency Distribution, 3. Dispersion, 4. Introduction to
Correlation and Regression, 5. Introduction to Probability, 6. Normal Distribution,
7. Hypothesis Testing, 8. Index Number.
1
Skill Enhancement Course (SEC)
DATA ANALYSIS
Course Description
This course introduces the student to collection and presentation of data. It also discusses
how data can be summarized and analysed for drawing statistical inferences. The students
will be introduced to important data sources that are available and will also be trained in the
use of free statistical software to analyse data.
Course Outline:
Readings:
1. P.H. Karmel and M. Polasek (1978), Applied Statistics for Economists, 4th edition,
Pitman.
2. H.R. Speegel, L.J. Stephens & N. Kumar (4th Edition), Schaum Series.
Lesson 1
Data
INTRODUCTION
The first step in a statistical investigation is the planning of the proposed investigation. After planning,
the next step is the collection of data, keeping in view the object and scope of investigation. There are
number of methods of collecting data. The mode of collection of data also depends upon the availability
of resources. The collected data would be edited, presented, analysed and interpreted. If the job of data
collection is not done sincerely and seriously, the results of the investigation is bound to be inaccurate and
misleading. And so the resources used in performing the other steps would be wasted and the purpose of
the investigation would be defeated.
Types of Data
I. Primary Data
II. Secondary Data
PRIMARY DATA
Definition
Data is called primary, if it is originally collected in the process of investigation. Primary data are original
in nature. Primary data are generally used in case of some special purpose investigation. The process of
collecting primary data is time consuming. For example suppose we want to compare the average income
of employees of two companies. This can be done by collecting the data regarding the incomes of
employees of both companies. The data collected would be edited, presented and analysed by taking
averages of both groups of data. On the basis of the averages, we would be able to declare as to the average
income for which company is more. The data used in this investigation is primary, because the data
regarding the income of employees was collected during the process of investigation.
Methods of Collecting Primary Data
(i) Direct personal investigation
(ii) Indirect oral investigation
(iii) Through local correspondents
(iv) Through questionnaires mailed to informants
(v) Through schedules filled by enumerators.
Now we shall discuss the process of collecting primary data by these methods. We shall also discuss the
suitability, merits and demerits regarding the above mentioned methods of collecting primary data.
(i) Direct Personal Investigation
In this method of collecting data, the investigator directly comes in contact with the informants to collect
data. The investigator himself visits the different informants, covered in the scope of the investigation and
collect data as per the need of the investigation. Suppose an investigator wants to use this method to collect
data regarding the wages of the employees of a factory then he would have to contact each and every
employee of the factory in order to collect the required data. In the context of this method of collecting
primary data, Professor C.A. Moser has remarked, “In the strict sense, observation implies the use of the
2
eyes rather than of the ears and the voice”. The suitability of this method depends upon the personality of
the investigator. The investigator is expected to be tactful, skilled, honest, well behaved and industrious.
It is suitable when the area to be covered is small. This is also suitable when the data is to be kept secret.
(iii) Through Local Correspondents
In this method of collecting data, the informants are not directly contacted by the investigator, but instead,
the data about the informants is collected and sent to the investigator by the local correspondents,
appointed by the investigator. Newspaper agencies collect data by using this method. They appoint their
correspondents area wise. The correspondents themselves send the desired data to the offices of their
respective newspaper. The suitability of this method depends upon the personality of the correspondent.
He is expected to be unbiased, skilled and honest. To eliminate the bias of the correspondents, it is
advisable to appoint more than one correspondent in each area.
(iv) Through Questionnaires Mailed to Informants
In this method of collecting data, the informants are not directly contacted by the investigator but instead
the investigator send questionnaires by post to the informants with the request of sending them back after
filling the same. The suitability of this method depends upon the quality of the ‘questionnaire’ and the
response of the informant. This method is useful when area to be covered is widely spread. This method
would not work in case the informants are illiterate or semi-literate.
(v) Through Schedules Filled by Enumerators
In this method of collecting data, the informants are not directly contacted by the investigator, but instead,
the enumerators are deputed to contact the informants and to fill the schedules on the spot, after collecting
data as per the need of the schedule. The basic difference between this method and the previous method
is that, in this method the schedules are filled by the enumerators after getting information from the
informants, whereas in the previous method, the questionnaires were to be filled by the informants
themselves. The suitability of this method depends upon the enumerators. The enumerators are expected
to be skilled, honest, hard working, well-behaved and free from bias. This method of collecting data is
suitable in case the informants are illiterate or semi-literate. In our country, census data about all the
citizens is collected after every ten years by using this method.
(vi) Requisites of a Good ‘Questionnaire’ and ‘Schedule’
In the last two methods of collecting primary data, we discussed the method of questionnaires to be filled
by the informants and the method of filling schedules by the enumerators. In fact, there is no fundamental
difference between a questionnaire and a schedule. Both questionnaire and schedule contain some
questions. The only difference between the two is that the former is filled by the informants themselves,
whereas in the case of later, the data concerning the informants is filled by the enumerators. The success
of collecting data by using either questionnaire or schedule depends upon the quality of itself. Preparation
of questionnaire and schedule is an art. Now we shall discuss in detail the requisites of a good
questionnaire and schedule.
(i) Forwarding letter: The investigator must include a forwarding letter in case of sending questionnaires
to the informants. The investigator must request the informants to fill in the same and to return it back
after filling it. The object of the investigation should also be mentioned in the latter. The informants should
3
also be ensured that the filled questionnaires would be kept secretly, if desired. To encourage the response
of informants, special concessions and free gifts may be offered to the informants.
(ii) Questions should be minimum in number: The number of questions in a questionnaire or a schedule
should be as small as possible. Unnecessary questions should never be included. Inclusion of more than
20 or 25 questions would be undesirable.
(iii) Questions should be Easy to understand: The questions included in a questionnaire or a schedule
should be Easy to understand. The questions should not be confusing in nature. The language used should
also be simple and the use of highly technical terms should also be avoided.
(iv) Questions should be logically arranged: The questions in a questionnaire or a schedule also be
logically arranged. The questions should be arranged so that there is natural and spontaneous reaction of
the informants to the questions. It is not fair to ask the informant whether he is employed or unemployed
after asking his monthly income. Such sequence of questions create bad impression on the mind of the
informants.
(v) Only well-defined terms should be used in questions: In drafting questions for a questionnaire or a
schedule, only well defined terms should be used. For example, the term ‘income’ should be clearly
defined in the sense whether it is to include allowances etc. along with the basic income or not. Similarly,
in case of businessman, whether the informants are to inform about their gross profits or net profits etc.
(vi) Prohibited questions should not be included: No such question should be included in the
questionnaire or schedule which may immediately agitate the mind of the informants. Question like,
“Have you given up the habit of telling a lie” or “How many times in a month, do you quarrel with your
wife”, would immediately mar the spirit of the informants.
(vii) Irrelevant questions should be avoided: In questionnaire or schedule, only those questions should
be included which bears direct link with the object of the investigation. If the object is to study the problem
of unemployment, then it would be useless to collect data regarding the heights and weights of the
informants.
(viii) Pilot survey: Before the questionnaire is sent to all the informants for collecting data, it should be
checked before hand for its workability. This is done by sending the questionnaire to a selected sample
and the replies received are studied thoroughly. If the investigator finds that most of the informants in the
sample have left some questions un-answered then those questions should be modified or deleted
altogether, provided the object of the investigation permits to do so. This is called Pilot Survey. Pilot
Survey must be carried out before the questionnaire is finally accepted.
SECONDARY DATA
Definition
Data is called secondary if it is not originally collected in the process of investigation, but instead, the
data collected by some other agency is used for the purpose. If the investigation is not of very special
nature, then the use of secondary data may be made provided that can serve the purpose. Suppose we want
to investigate the extent of poverty in our country, then this investigation can be carried out by using the
national census data which is obtained regularly after every 10 years. The use of secondary data economise
the money spent. It also reduces the time period of investigation to a great extent. If in an investigation
some secondary data could be made use of, then we must use the same. The secondary data are ought to
4
be used very carefully. In this context, Connor has remarked, “Statistics, especially other peoples’
statistics are full of pitfalls for the user.”
Methods of Collecting Secondary Data
(i) Collection from Published Data
(ii) Collection from Un-published Data.
(i) Collection from Published Data
There are agencies which collect statistical data regularly and publish it. The published data is very
important and is used frequently by investigators. The main sources of published data are as follows:
(a) International publications: International Organisations and Govt. of foreign countries collect and
publish statistical data relating to various characteristics. The data is collected regularly as well as on ad-
hoc basis.
Some of the publications are:
(i) U.N.O. Statistical Year Book
(ii) Annual Reports of I.L.O.
(iii) Annual Reports of the Economic and Social Commission for Asia and
Pacific (ESCAP)
(iv) Demography Year Book
(v) Bulletins of World Bank.
(b) Goverment publications: In India, the Central Govt. and State Govt. collects data regarding various
aspects. This data is published and is found very useful for investigation purpose. Some of the publications
are:
(i) Census Report of India
(ii) Five-Year Plans
(iii) Reserve Bank of India Bulletin
(iv) Annual Survey of Industries
(v) Statistical Abstracts of India.
(c) Report of commissions and committees: The Central Govt. and State Govt. appoints Commissions
and Committees to study certain issues. The reports of such investigations are very useful. Some of these
are:
(i) Reports of National Labour Commission
(ii) Reports of Finance Commission
(iii) Report of Hazari Committee etc.
(d) Publications of research institutes: There are number of research institutes in India which regularly
collect data and analyse it. Some of the agencies are:
(i) Central Statistical Organisation (C.S.O.)
(ii) Institute of Economic Growth
(iii) Indian Statistical Institute
(iv) National Council of Applied Economic Research etc.
(e) Newspapers and Magazines. There are many newspapers and magazines which publish data relating
to various aspects. Some of these are:
5
(i) Economic Times
(ii) Financial Express
(iii) Commerce
(iv) Transport
(v) Capital etc.
(f) Reports of trade associations: The trade associations also collect data and publish it. Some of the
agencies are:
(i) Stock Exchanges
(ii) Trade Unions
(iii) Federation of Indian Chamber of Commerce and Industry.
(ii) Collection from Un-published Data
The Central Government, State Government and Research Institutes also collect data which is not
published due to some reasons. This type of data is called unpublished data. Un-published data can also
be made use of in Investigations. The data collected by research scholars of Universities is also generally
not published.
Precautions in the Use of Secondary Data
The secondary data must be used very carefully. The applicability of the secondary data should be judged
keeping in view the object and scope of the Investigation. Prof. Bowley has remarked, “Secondary data
should not be accepted at their face value.” Following are the basis on which the applicability of secondary
data is to be judged.
(i) Reliability of data: Reliability of data is assessed by reliability of the agency which collected that data.
The agency should not be biased in any way. The enumerators who collected the data should have been
unbiased and well trained. Degree of accuracy achieved should also be judged.
(ii) Suitability of data: The suitability of the data should be assessed keeping in view the object and scope
of Investigation. If the data is not suitable for the investigation, then it is not to be used just for the sake
of economy of time and money. The use of unsuitable data can lead to only misleading results.
(iii) Adequacy of data: The adequacy of data should also be judged keeping in view the object and scope
of the investigation. If the data is found to be inadequate, it should not be used. For example, if the object
of investigation is to study the problem of unemployment in India, then the data regarding unemployment
in one state say U.P. would not serve the purpose.
Branches of Data Analysis
There are mainly two branches of Statistics : descriptive statistics and inferential statistics. Descriptive
statistics refers to the summary of important aspects of a data set. This includes collecting data, organizing
the data, and then presenting the data in the form of charts and tables. In addition, we often calculate
numerical measures that summarize, for instance, the data’s typical value and the data’s variability. Today,
the techniques encountered in descriptive statistics account for the most visible application of statistics—
the abundance of quantitative information that is collected and published in our society every day. The
unemployment rate, the president’s approval rating, the Dow Jones Industrial Average, batting averages,
the crime rate, and the divorce rate are but a few of the many “statistics” that can be found in a reputable
6
newspaper on a frequent, if not daily, basis. Yet, despite the familiarity of descriptive statistics, these
methods represent only a minor portion of the body of statistical applications.
The phenomenal growth in statistics is mainly in the field called inferential statistics. Generally,
inferential statistics refers to drawing conclusions about a large set of data— called a population—based
on a smaller set of sample data. A population is defined as all members of a specified group (not
necessarily people), whereas a sample is a subset of that particular population. The individual values
contained in a population or a sample are often referred to as observations. In most statistical applications,
we must rely on sample data in order to make inferences about various characteristics of the population.
A population consists of all items of interest in a statistical problem. A sample is a subset of the population.
In other words, a population is the entire group that you want to draw conclusions about and A sample is
the specific group that you will collect data from. The size of the sample is always less than the total size
of the population. Normally sample data is analyze and calculate a sample statistic to make inferences
about the unknown population parameter.
For example, the Undergraduate students in the India is population whereas 300 undergraduate students
from DU in India is sample drawn from entire population.
Populations are used when your research question requires, or when you have access to, data from every
member of the population. Usually, it is only straightforward to collect data from a whole population when
it is small, accessible and cooperative. For example; A high school administrator wants to analyze the
final exam scores of all graduating seniors to see if there is a trend. Since they are only interested in
applying their findings to the graduating seniors in this high school, they use the
whole population dataset.
When your population is large in size, geographically dispersed, or difficult to contact, it’s necessary to
use a sample. With statistical analysis, you can use sample data to make estimates or test hypotheses about
population data.
For example; if we want to study political attitudes in young people. Your population is the 900,000
undergraduate students in the India. Because it’s not practical to collect data from all of them, you use a
sample of 900 undergraduate volunteers from five universities – this is the group who will complete your
online survey.
Ideally, a sample should be randomly selected and representative of the population. Using probability
sampling methods (such as simple random sampling or stratified sampling) reduces the risk of sampling
bias and enhances both internal and external validity.
7
Need for Sampling
• Necessity: Sometimes it’s simply not possible to study the whole population due to its size or
inaccessibility.
• Practicality: It’s easier and more efficient to collect data from a sample.
• Cost-effectiveness: There are fewer participant, laboratory, equipment, and researcher costs
involved.
• Manageability: Storing and running statistical analyses on smaller datasets is easier and reliable.
When you collect data from a population or a sample, there are various measurements and numbers you
can calculate from the data. A parameter is a measure that describes the whole population. A statistic is a
measure that describes the sample.
You can use estimation or hypothesis testing to estimate how likely it is that a sample statistic differs from
the population parameter.
Example: In your study of students’ political attitudes, you ask your survey participants to rate themselves
on a scale from 1, very liberal, to 7, very conservative. You find that most of your sample identifies as
liberal – the mean rating on the political attitudes scale is 3.2.
You can use this statistic, the sample mean of 3.2, to make a scientific guess about the population
parameter – that is, to infer the mean political attitude rating of all undergraduate students in the India.
What is Sampling?
8
Sampling is the process of selecting observations (a sample) to provide an adequate description and robust
inferences of the population. The sample is representative of the population.
Non-Probability sampling
Probability sampling
Sample element: a case or a single unit that is selected from a population and measured in some way—
the basis of analysis (e.g., an person, thing, specific time, etc.).
Universe: the theoretical aggregation of all possible elements—unspecified to time and space (e.g.,
University of Delhi).
Population: the theoretical aggregation of specified elements as defined for a given survey defined by
time and space (e.g., DUstudents in 2009).
Sample or Target population: the aggregation of the population from which the sample is actually
drawn (e.g., DU in 2009-10 academic year).
Sample frame: a specific list that closely approximates all elements in the population—from this the
researcher selects units to create the study sample (database of DU students in 2009-10).
9
Sample: a set of cases that is drawn from a larger pool and used to make generalizations about the
population
Estimator
When a statistic is used to estimate a parameter, it is referred to as an estimator.
Estimate
A particular value of the estimator is called an estimate.
Non-probability Sample
Any sampling process which does not ensure some nonzero probability for each element in the population
to be included in the sample would belong to the category of non-probability sampling. In this case,
samples may be picked up based on the judgment or convenience of the enumerator. Usually, the complete
sample is not decided at the beginning of the study but it evolves as the study progresses.
Probability Sample:
In this design of sample, we use chance to choose a sample. The sample is chosen by chance, with prior
information available on what samples can be chosen and what are the probabilities of each sample getting
chosen. Every item of the population has a known chance of getting selected for the sample. Specifically,
random sampling has the following mathematical properties
a) Distinct samples from the population can be defined, which means that we can clearly state the items
that belong to a particular sample of the population.
b) Each sample has a known probability of selection.
c) Each sample is selected by a random process. It may have equal or unequal probability of getting
selected.
d) The method for computing the estimate from the sample must be stated and lead to unique estimates
for a specific sample. So for example, it can be stated that the estimate is the average of the measurements
on the individual items of the sample.
This sampling procedure is amenable to the calculation of frequency distribution of the estimates (for each
sample, when repeated sampling is done). We know the number of times a particular sample will be
selected and thereafter the estimate from the sample. Thus, a well-defined sampling theory can be
developed for such procedures. Also, for this procedure, it was realized that by the use of sampling theory
and normal distribution, the amount of error to be expected in the estimates made from the sample can be
approximately predicted. There are various ways to obtain samples that represent the population. Some of
the methods of sampling are the following:
10
Simple Random Sampling
It is a method of obtaining a sample of size ‘n’ from a population size of ‘N’ units such that each of the
N
Cn samples chosen have an equal probability of getting chosen . To be precise, the random variables X1,
X2…. Xn are said to form a simple random sample of size n if the following two conditions are met:
a) The Xi’s are independent random variables.
b) Every Xi has equal probability.
The Xis are then termed as independent and identically distributed (iid). Random sampling is then possible
if sampling is with replacement or is from an infinite population (in which case the probability of Xis is
equal and Xis become independent). There are various ways by which we obtain random samples.
One is the lottery method, in which individual units of the population are allotted a number, which are
then put onto slips of paper. These slips are then shuffled and a random draw of required numbers (which
constitute the sample size) is done. This constitutes a random sample. The other method is that of using
random sampling numbers.
The most important virtue of this sampling method is that it is easiest to derive a probability distribution
of the sample statistic than for any other sampling method. In simple random sampling, each Xi of the
sample has equal probability of p= n/N (no. of success/total no. of observation). Here all the units of the
sample are selected using a random mechanism. For example, a national fast food chain wants to randomly
select 5 out of 50 states to sample taste of its consumers. A simple random sample will ensure that the
50
C5= 2118760 samples of size 5 will have the same likelihood of being used in the study.
Systematic Random Sampling(SRS)
It is useful when the population units are ordered or listed in a random fashion. For example, in the case,
where houses are randomly arranged in rows. Suppose a few houses at random are to be selected from a
given city. Here systematic sampling can be used because the houses are usually arranged in rows and are
thus numbered. The first house is selected at random and then the next 10th or 15th house is selected
systematically. This is called systematic random sampling. In systematic sampling, the sampling units are
selected at equal distances from each other in the frame, with random selection of only the first unit.
Another population in which systematic sampling is used is when selecting items from a production or
assembly line for quality testing. A manufacturer may select the first item on the production line randomly,
after which every 20th item is selected for the sample.
Steps in SRS
Decide on sample size: n
Divide frame of N individuals into groups of k individuals: k=N/n
Randomly select one individual from the 1st group
Select every kth individual thereafter
11
Both simple random sampling and systematic sampling techniques are usually recommended when all the
population units are relatively homogeneous.
Stratified Random Sampling:
This method is adopted when the population from which a sample has to be drawn is heterogeneous. We
then divide the heterogeneous groups into homogeneous groups called strata and then draw a random
sample from each stratum. The heterogeneous groups or subpopulations must be non-overlapping and
adding up all the units of the strata must give us the total number of units of the population.
Steps
Divide population into two or more subgroups (called strata) according to some common characteristic
A simple random sample is selected from each subgroup, with sample sizes proportional to strata sizes
Samples from subgroups are combined into one
Stratified random sampling can be used for various reasons, some of which are the following:
i) Administrative convenience may facilitate the use of stratification; for example, the agency
conducting the survey may have field offices in different geographical stratum, each of which
can supervise the survey for a part of the population.
12
ii) Sampling problems differ significantly for different parts of the population. For example,
people living in institutions (e.g., hotels, hospitals, prisons) are often placed in a different
stratum from people living in ordinary homes because a different approach to the sampling is
appropriate for the two situations. Population of people can be divided into different strata of
income groups for example high income group, upper middle income group, lower middle
income group and low income group which can greatly enhance the analytical capacity of the
sample statistic.
iii) Stratification may enhance the precision in the estimates of the characteristics (as compared to
simple random sample) of the whole population by dividing a heterogeneous population into
subpopulations, each of which is internally homogeneous. The estimates of each strata can be
combined into precise estimate for the whole population.
iv) Sample Size
v) The size of the sample depends on various considerations, including population variability,
statistical issues, economic factors, availability of participants, and the importance of the
problem. Few are:
vi) One of the most important factors that affect the sample size is the extent of variability in the
population. Taking an extreme case, if there is no variability, i.e. if all the members of the
population are exactly identical, a sample of size 1 is as good as a sample of 100 or any other
number. Therefore, the larger the variability, the larger is the sample size required.
vii) A second consideration is the confidence in the inference made-the larger the sample size the
higher is the confidence. In many situations, the confidence level is used as the basis to decide
sample size as we shall see in the next unit.
viii) There is generally a trade-off between the accuracy of the sample in representing population
values and the costs associated with sample size. The larger the sample, the more confident we
can be that it accurately reflects what exists in the population, but large samples can be
extremely expensive and time consuming. A small sample is less expensive and time
consuming, but it is not as accurate. Therefore, in situations requiring minimal error and
maximum accuracy of prediction of population values, large samples will be required. In cases
where more error can be tolerated, small samples will do. It is not unusual to use relatively
small samples to generalize to millions of individuals.
ix) Other factors that help determine what is considered an adequate sample size are diversity of
the population concerning the factors of interest and the number of factors. The greater the
diversity among individuals and the greater the number of factors present, the larger the sample
that is required to achieve representativeness.
Sampling error
A sampling error is the difference between a population parameter and a sample statistic. In your
study, the sampling error is the difference between the mean political attitude rating of your sample
and the true mean political attitude rating of all undergraduate students in the India.
13
Sampling errors happen even when you use a randomly selected sample. This is because random
samples are not identical to the population in terms of numerical measures like means and standard
deviations.
Because the aim of scientific research is to generalize findings from the sample to the population, you
want the sampling error to be low. You can reduce sampling error by increasing the sample size.
*********
14
Lesson 2
Summarisation of the data is a necessary function of any statistical analysis. As a first step in this
direction, the huge mass of unwieldy data are summarised in the form of tables and frequency
distributions. In order to bring the characteristics of the data into sharp focus, these tables and
frequency distributions need to be summarised further. A measure of central tendency or an average
is very essential and an important summary measure in any statistical analysis. It is a single value
which can be taken as representative of the whole distribution.
Functions of an Average
1. To present huge mass of data in a summarised form: It is very difficult for human mind to grasp
a large body of numerical figures. A measure of average is used to summarise such data into a single
figure which makes it easier to understand and remember.
2. To facilitate comparison: Different sets of data can be compared by comparing their averages. For
example, the level of wages of workers in two factories can be compared by mean (or average) wages
of workers in each of them.
3. To help in decision-making: Most of the decisions to be taken in research, planning, etc., are based
on the average value of certain variables.
1. It should be rigidly defined, preferably by an algebraic formula, so that different persons obtain the
same value for a given set of data.
15
7. It should not be much affected by the fluctuations of sampling.
Various measures of average can be classified into the following three categories:
1. Mathematical Averages:
2. Positional Averages:
(a) Median
(b) Mode
The above measures of central tendency will be discussed in the order to their popularity. Out of these,
the Arithmetic Mean, Median and Mode, being most popular, are discussed in that order.
Arithmetic Mean
Before the discussion of arithmetic mean, we shall introduce certain notations. It will be assumed that
there are n observations whose values are denoted by X1, X2, ..... Xn, respectively. The sum of these
observations X1 + X2 + ..... + Xn will be denoted in abbreviated form as,
∑ Xi where ∑ (called sigma) denotes summation sign. The subscript of X, i.e., ‘i’ is a positive integer,
which indicates the serial number of the observation. Since there are n observations, variation in i will
be from 1 to n. When there is no ambiguity in range of summation, this indication can be skipped and
we may simply
write X1 + X2 + ..... + Xn = ∑ Xi
Arithmetic Mean is defined as the sum of observations divided by the number of observations.
In case of simple arithmetic mean, equal importance is given to all the observations while in weighted
arithmetic mean, the importance given to various observations is not same.
Let there be n observations X1, X2 ..... Xn. Their arithmetic mean can be calculated either by direct
method or by short cut method. The arithmetic mean of these observations will be denoted by 𝑋̅ .
(a) Direct Method: Under this method, 𝑋̅ is obtained by dividing sum of observations by number of
observations, i.e.,
𝑋̅ = ΣX/N
where ΣX is the sum of all the numbers in the sample i.e, X1 + X2 + ..... + Xn and N is the number of
observation in the sample.
As an example, the mean of the numbers 1, 2, 3, 6, 8 is 20/5 = 4 regardless of whether the numbers
constitute the entire population or just a sample from the population.
Example 1
Calculate the mean for pH levels of soil 6.8, 6.6, 5.2, 5.6, 5.8
6.8+6.6+5.2+5.6+5.8 30
Mean = 𝑋̅ = = =6
5 5
17
When Data are in the form of an Ungrouped Frequency Distribution
The mean for ungrouped data is obtained from the following formula:
∑ 𝑓𝑋
𝑋̅ = 𝑁
Where x = value of the variable f = the frequency of individual class N = the sum of the frequencies or
total frequencies in a sample.
Short-cut method
∑ 𝑓𝑑
𝑋̅ = 𝐴 + 𝑁
Where
𝑋−𝐴
d= 𝑖
A = assumed mean; N = total frequency
Example 2
Given the following frequency distribution, calculate the arithmetic mean
Marks : 64 63 62 61 60 59
Number 8 18 12 9 7 6
of
Student
s
Solution:
X F Fx D=x-A Fd
64 8 512 2 16
63 18 1134 1 18
62 12 744 0 0
61 9 549 -1 -9
60 7 420 -2 -14
59 6 354 -3 -18
60 3713 -7
Direct Method
37133
𝑋̅ = = 61.88
60
Short-cut method
∑ 𝑓𝑑
𝑋̅ = 𝐴 + ∗𝑐
𝑛
Let A = 62
18
(−7)
𝑋̅ = 62 + ∗ 1 = 61.88
6
Grouped Data
When the items in a list are written in the form of a range, for example, 10-20, 20-30; we need to first
calculate the class mark or midpoint.
Class Mark = (Upper Limit + Lower Limit) / 2
Then, the mean can be calculated using the formula given below,
∑ 𝑓𝑋
𝑋̅ =
𝑁
Where x = the mid-point or classmark of individual class f = the frequency of individual class N = the sum
of the frequencies or total frequencies in a sample.
Short-cut method
∑ 𝑓𝑑
𝑋̅ = 𝐴 + ∗𝑐
𝑁
Where
𝑋−𝐴
d= 𝑐
A = any value in x; N = total frequency c = width of the class interval
Example 3 For the frequency distribution of seed yield of plot given in table, calculate the mean yield per
plot.
19
Direct method:
∑ 𝑓𝑋 74.5∗3+94.5∗5+114.5∗7+134.5∗20 4187.5
𝑋̅ = = = = 119.64 𝑔𝑚𝑠
𝑛 35 35
Shortcut method
∑ 𝑓𝑑 44
𝑋̅ = 𝐴 + ∗ 𝑐 = 94.5 + 35 ∗ 20 = 119.64 𝑔𝑚𝑠
𝑛
1. It is rigidly defined.
3. If the number of items is sufficiently large, it is more accurate and more reliable.
5. It is possible to calculate even if some of the details of the data are lacking.
Demerits
2. It cannot be in the study of qualitative phenomena not capable of numerical measurement i.e.
Intelligence, beauty, honesty etc.,
3. It can ignore any single item only at the risk of losing its accuracy.
6. It may lead to fallacious conclusions, if the details of the data from which it is computed are not given.
WEIGHTED MEAN
Weight here refers to the importance of a value in a distribution. A simple logic is that a number is as
important in the distribution as the number of times it appears. So, the frequency of a number can also be
20
its weight. But there may be other situations where we have to determine the weight based on some other
reasons. For example, the number of innings in which runs were made may be considered as weight
because runs (50 or 100 or 200) show their importance. Calculating the weighted mean of scores of several
innings of a player, we may take the strength of the opponent (as judged by the proportion of matches lost
by a team against the opponent) as the corresponding weight. Higher the proportion stronger would be the
opponent and hence more would be the weight. If xi has a weight wi, then weighted mean is defined as:
∑ 𝑋𝑖 𝑊𝑖
𝑋̅ =
∑ 𝑊𝑖
X for all i = 1, 2, 3,…, k.
MEDIAN
Median is that value of the variable which divides the whole distribution into two equal parts. Here, it may
be noted that the data should be arranged in ascending or descending order of magnitude. When the
number of observations is odd then the median is the middle value of the data. For even number of
observations, there will be two middle values. So we take the arithmetic mean of these two middle values.
Number of the observations below and above the median, are same. Median is not affected by extremely
large or extremely small values (as it corresponds to the middle value) and it is also not affected by open
end class intervals. In such situations, it is preferable in comparison to mean.
1. Median for Ungrouped Data
Mathematically, if x1, x2,…, xn are the n observations then for obtaining the median first of all we have
to arrange these n values either in ascending order or in descending order. When the observations are
arranged in ascending or descending order, the middle value gives the median if n is odd. For even number
of observations there will be two middle values. So we take the arithmetic mean of these two values.
𝑁+1
Md = ( ) 𝑡ℎ 𝑜𝑏𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛; when N is odd
2
𝑁 𝑁+2
𝑡ℎ 𝑜𝑏𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛+ 𝑡ℎ 𝑜𝑏𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
Md = ( 2 2
); when N is even
2
21
7+8
Md = ( ) = 7.5
2
For Ungrouped Data (when frequencies are given)
If Xi are the different value of variable with frequencies ‘f’ then we calculate cumulative frequencies from
∑𝑓 𝑁
f then median is defined by Md = Value of variable corresponding to ( 2 ) 𝑡ℎ = ( 2 ) 𝑡ℎ cumulative
frequency.
Cumulative frequency (cf)
Cumulative frequency of each class is the sum of the frequency of the class and thefrequencies of the
pervious classes, ie adding the frequencies successively, so that the last cumulative frequency gives the
total number of items.
Note: If N/2 is not the exact cumulative frequency then value of the variable corresponding to next
cumulative frequencies is the median.
Example 7: Find Median from the given frequency distribution
X 20 40 60 80
F 7 5 4 3
Solution: First find cumulative frequency
X f c.f.
20 7 7
40 5 5+7=12
60 4 12+4=16
80 3 16+3=19
∑f = 19
19
Md = Value of the variable corresponding to the ( 2 ) 𝑡ℎ cumulative frequency.
= Value of the variable corresponding to 9.5 since 9.5 is not among c.f.
So, the next cumulative frequency is 12 and the value of variable against 12 cumulative frequency is 40.
So median is 40.
Continuous Series
The steps given below are followed for the calculation of median in continuous series.
Step1: Find cumulative frequencies.
Step 2: Find N/2
Step3: See in the cumulative frequency the value first greater than N/2; Then the corresponding class
interval is median class-interval.
𝑁
−𝑐𝑓
2
Step 4: Apply the formula Median = l + ∗𝑖
𝑓
22
Where l = lower limit of median class
N= no. of observations
cf denotes cumulative frequency of the class preceding the median class
f = frequency of median class
i = class size (assuming classes are of equal size)
Example: For the frequency distribution of weights of sorghum ear-heads given in table below. Calculate
the median .
Weights of ear No of ear
heads ( in g) heads (f)
60-80 22
80-100 38
100-120 45
120-140 35
140-160 24
Solution:
N/2 = 164/2= 82
82 lies between 60 and 105. Corresponding to 60 the less than class is 100 and corresponding to105 the
less than class is 120. Therefore the median class is 100-120. Its lower limit is 100.
Here;
l= 100, cf= 60, N=164 , f = 45 and i= 20
82−60
Median = 100 + ∗ 20 = 109.78 gms
45
Merits of Median
1. Median is not influenced by extreme values because it is a positional average.
23
2. Median can be calculated in case of distribution with open-end intervals.
3. Median can be located even if the data are incomplete.
Demerits of Median
1. A slight change in the series may bring drastic change in median value.
2. In case of even number of items or continuous series, median is an estimated valueother than any
value in the series.
3. It is not suitable for further mathematical treatment except its use incalculating mean deviation.
4. It does not take into account all the observations.
Here, l is lower limit of the first quartile class, h is its width, f is its frequency and C is cumulative
frequency of classes preceding the first quartile class. By definition, the second quartile is median
of the distribution. The third quartile ( Q3) of a distribution can also be defined in a similar manner.
For a discrete distribution, Q3 is that value of the variate such that at least 75% of the observations
are less than or equal to it and at least 25% of the observations are greater than or equal to it. For a
grouped frequency distribution, Q3 is that value of the variate such that area under the histogram to
the left of the ordinate at Q3 is 75% and the area to its right is 25%. The formula for computation
of Q3 can be written as
24
𝑁
3∗ −𝑐
4
Q3 = l + * i; where the symbols have their usual meaning.
𝑓
Deciles
Deciles divide a distribution into 10 equal parts and there are, in all, 9 deciles denoted as D1, D2,......
D9 respectively. For a discrete distribution, the i th decile Di is that value of the variate such that at
least (10i)% of the observation are less than or equal to it and at least (100 - 10i)% of the
observations are greater than or equal to it (i = 1, 2, ...... 9).
For a continuous or grouped frequency distribution, Di is that value of the variate such that the area
under the histogram to the left of the ordinate at Di is (10i)% and the area to its right is (100 - 10i)%.
The formula for the i th decile can be written as
𝑁
𝑗∗ −𝑐
Dj = l + 10
* i where j = 1,2,…..,10
𝑓
Percentiles
Percentiles divide a distribution into 100 equal parts and there are, in all, 99 percentiles denoted as
P1, P2, ...... P25, ...... P40, ...... P60, ...... P99 respectively. For a discrete distribution, the kth
percentile Pk is that value of the variate such that at least k% of the observations are less than or
equal to it and at least (100 – k)% of the observations are greater than or equal to it. For a grouped
frequency distribution, Pk is that value of the variate such that the area under the histogram to the
left of the ordinate at Pk is k% and the area to its right is (100 – k)% . The formula for the kth
percentile can be written as
𝑁
𝑗∗ −𝑐
Pj = l + 100
* i where j = 1,2,…50,51…...,100
𝑓
Example: Locate Median, Q1, Q3, D4, D7, P15, P60 and P90 from the following data:
Daily 75 76 77 78 79 80 81 82 83 84 85
profit in
Rs.)
No. of 15 20 32 35 33 22 20 10 8 3 2
Shops
Solution:
First we calculate the cumulative frequencies, as in the following table:
Daily profit in Rs.) 75 76 77 78 79 80 81 82 83 84 85
No. of Shops 15 20 32 35 33 22 20 10 8 3 2
C.F. 15 35 67 102 135 157 177 187 195 198 200
25
1. Determination of Median: Here N/2 = 100. From the cumulative frequency column, we note that
there are 102 (greater than 50% of the total) observations that are less than or equal to 78 and there
are 133 observations that are greater than or equal to 78. Therefore, Md = 78.
2. Determination of Q1 and Q3: First we determine N/4 which is equal to 50. From the cumulative
frequency column, we note that there are 67 (which is greater than 25% of the total) observations
that are less than or equal to 77 and there are 165 (which is greater than 75% of the total)
observations that are greater than or equal to 77. Therefore, Q1 = 77. Similarly, Q3 = 80.
3. Determination of D4 and D7: From the cumulative frequency column, we note that there are 102
(greater than 40% of the total) observations that are less than or equal to 78 and there are 133 (greater
than 60% of the total) observations that are greater than or equal to 78. Therefore, D4 = 78.
Similarly, D7 = 80.
4. Determination of P15, P60 and P90: From the cumulative frequency column, we note that there
are 35 (greater than 15% of the total) observations that are less than or equal to 76 and there are 185
(greater than 85% of the total) observations that are greater than or equal to 76. Therefore, P15 =
76. Similarly, P60 = 79 and P90 = 82.
Mode
The mode refers to that value in a distribution, which occur most frequently. It is an actual value,
which has the highest concentration of items in and around it. It shows the centre of concentration
of the frequency in around a given value. Therefore, where the purpose is to know the point of the
highest concentration it is preferred. It is, thus, a positional measure.
Its importance is very great in agriculture like to find typical height of a crop variety, maximum
source of irrigation in a region, maximum disease prone paddy variety. Thus the mode is an
important measure in case of qualitative data.
Computation of the mode Ungrouped or Raw Data
For ungrouped data or a series of individual observations, mode is often found by mere inspection.
Example
Find the mode for the following seed weight 2 , 7, 10, 15, 10, 17, 8, 10, 2 gms
Here 10 repeats 3 times and rest all 1 time only. Hence, Mode = 10
In some cases the mode may be absent while in some cases there may be more than one mode.
Example
(1) 12, 10, 15, 24, 30 (no mode because all reapeats once only)
(2) 7, 10, 15, 12, 7, 14, 24, 10, 7, 20, 10
26
the modal values are 7 and 10 as both occur 3 times each.
Note: If 2 or more values appear with the same frequency, each is a mode. The downside to using
the mode as a measure of central tendency is that a set of data may have no mode or may have more
than 1 mode. However, the same set of data will have only 1 mean and only 1 median. The word
modal is often used when referring to the mode of a data set. If a data set has only 1 value that
occurs most often, the set is called unimodal. Likewise, a data set that has 2 values that occur with
the greatest frequency is referred to as bimodal. Finally, when a set has more than 2 values that
occur with the same greatest frequency, the set is called multimodal.
Example: The following table represents the number of times that 100 randomly selected students ate at
the school cafeteria during the first month of school:
Number
of 2 3 4 5 6 7 8
times
Number
of 3 8 22 29 20 8 10
students
What is the mode of the numbers of times that a student ate at the cafeteria?
Solution : When data is arranged in a frequency table, the mode is simply the value that has the highest
frequency. Therefore, since the table shows that 29 students ate 5 times in the cafeteria, 5 is the mode of
the data set.
Mode = 5 times
Where,
l= lower limit of the modal class
f1 = frequency of the modal class
27
f0= frequency of class preceding the modal class
f2 = frequency of class succeeding the modal class
i = width of the modal class
In summary, we can use the following steps to compute the mode of grouped or continuous frequency
distribution with equal class intervals:
Step 1: Prepare the frequency distribution table in such a way that its first column consists of the
observations and the second column the respective frequency.
Step 2: Determine the class of maximum frequency by inspection. This class is called the modal class.
𝑓1 −𝑓0
Step 3: To calculate mode, using formula: Mode = 𝑙 + *i
2𝑓1− 𝑓2− 𝑓
0
Example
3) The heights, in cm, of 50 students are recorded
Height (in
125-130 130-135 135-140 140-145 145-150
cm)
Number of
7 14 10 10 9
students
Here, the maximum frequency is 14 and the corresponding class is 130-135. So, 130-135 is the modal
class such that
l=130, i=5, f1=14, f2=7, f0=7 and f2=10.
𝑓1 −𝑓0
Mode = 𝑙 + *i
2𝑓1− 𝑓2− 𝑓
0
14−7
Mode = 130 + 2∗14−10−7 ∗ 5
=133.18
Frequency distribution of data shows how often the values in the data set occur. A frequency distribution
is said to be symmetrical when the values of mean, median and mode are equal. That is, there is an equal
number of values on both sides of the mean which means the values occur at regular frequencies.
28
[Figure 1]
Whereas in the negatively skewed frequency distribution, the median and the mode would be to the right
of the mean. That means that the mean is less than the median and the median is less than the mode. (
Mean < Median < Mode )
[Figure 2]
In a positively skewed frequency distribution, the median and mode would be to the left of the mean. That
means that the mean is greater than the median and the median is greater than the mode ( Mean > Median
> Mode )
29
[Figure 3]
Empirical studies have proved that in a frequency distribution that is moderately skewed, a very important
relationship exists between the mean, median and the mode. The distance between the mean and the
median is about one-third the distance between the mean and the mode.
30
Geometric Mean
The geometric mean is a type of average , usually used for growth rates, like population growth or interest
rates. While the arithmetic mean adds items, the geometric mean multiplies items. Also, you can only get
the geometric mean for positive numbers.
The geometric mean of a series containing n observations is the nth root of the product of thevalues. If x1,
x2…, xn are observations then
𝑛
GM = √𝑋1 ∗ 𝑋2 ∗ 𝑋3 ∗ … … ∗ 𝑋𝑛
Or (𝑋1 ∗ 𝑋2 ∗ 𝑋3 ∗ … … ∗ 𝑋𝑛)1/𝑛
1
Log GM = log(𝑋1 ∗ 𝑋2 ∗ 𝑋3 ∗ … … ∗ 𝑋𝑛)
𝑛
1
= 𝑛 (log(𝑋1) + log 𝑋2 + 𝑙𝑜𝑔𝑋3 + ⋯ … . +𝑙𝑜𝑔𝑋𝑛)
∑ log 𝑋𝑖
= 𝑛
∑ log 𝑋𝑖
GM = antilog 𝑛
31
= antilog( 1.785) = 60.954
GM in Individual Series
Find the Geometric mean for the following
Solution
Here n= 50
GM = antilog
99.21
= antilog 50
32
Continuous distribution
Example
For the frequency distribution of weights of sorghum ear-heads given in table below.
Calculate the Geometric mean
Here n = 160
∑ 𝑓 log 𝑋𝑖
GM = antilog 𝑁
324.2
= Antilog 160
= Antilog 2.02625
= 106.23
33
Harmonic mean (H.M)
Harmonic mean of a set of observations is defined as the reciprocal of the arithmeticaverage of the
reciprocal of the given values. If X1, X2…..Xn are n observations,
𝑛
HM = 1 where i= 1,2….., n
∑
𝑋𝑖
5 0.2000
10 0.1000
17 0.0588
24 0.0417
30 0.4338
5
HM = 0.4338 = 11.526
Example:
Number of tomatoes per plant are given below. Calculate the harmonic mean.
34
Solution
Number of No of
tomatoes per plants(f)
plant (x)
20 4 0.0500 0.2000
21 2 0.0476 0.0952
22 7 0.0454 0.3178
23 1 0.0435 0.0435
24 3 0.0417 0.1251
25 1 0.0400 0.0400
18 0.8216
18
HM = 0.8216 = 21.91
Merits of H.M
1. It is rigidly defined.
2. It is defined on all observations.
3. It is amenable to further algebraic treatment.
4. It is the most suitable average when it is desired to give greater weight to smaller observations and
less weight to the larger ones.
Demerits of H.M
1. It is not easily understood.
2. It is difficult to compute.
3. It is only a summary figure and may not be the actual item in the series
4. It gives greater importance to small items and is therefore, useful only when small items have to
be given greater weightage.
5. It is rarely used in grouped data.
35
Lesson 3
Measures of Dispersion
Introduction
A measure of central tendency summarizes the distribution of a variable into a single figure which can be
regarded as its representative. This measure alone, however, is not sufficient to describe a distribution
because there may be a situation where two or more different distributions have the same central value.
Conversely, it is possible that the pattern of distribution in two or more situations is same but the values
of their central tendency are different. Hence, it is necessary to define some additional summary measures
to adequately represent the characteristics of a distribution. One such measure is known as the measure of
dispersion or the measure of variation.
The concept of dispersion is related to the extent of scatter or variability in observations. The variability,
in an observation, is often measured as its deviation from a central value. A suitable average of all such
deviations is called the measure of dispersion. In the words of A.L. Bowley “Dispersion is the measure of
variation of the items.”
Objectives of Measuring Dispersion
The main objectives of measuring dispersion of a distribution are:
1. To test reliability of an average: A measure of dispersion can be used to test the reliability of an average.
A low value of dispersion implies that there is greater degree of homogeneity among various items and,
consequently, their average can be taken as more reliable or representative of the distribution.
2. To compare the extent of variability in two or more distributions: The extent of variability in two or
more distributions can be compared by computing their respective dispersions. A distribution having
lower value of dispersion is said to be more uniform or consistent.
3. To facilitate the computations of other statistical measures: Measures of dispersions are used in
computations of various important statistical measures like correlation, regression, test statistics,
confidence intervals, control limits, etc.
4. To serve as the basis for control of variations: The main objective of computing a measure of
dispersion is to know whether the given observations are uniform or not. This knowledge may be utilised
in many ways. In the words of Spurr and Bonini, “In matters of health, variations in body temperature,
pulse beat and blood pressure are basic guides to diagnosis.
Prescribed treatment is designed to control their variations. In industrial production, efficient operation
requires control of quality variations, the causes of which are sought through inspection and quality control
programs”. The extent of inequalities of income and wealth in any society may help in the selection of an
appropriate policy to control their variations.
Characteristics of a Good Measure of Dispersion
Like the characteristics of a measure of central tendency, a good measure of dispersion should possess the
36
following characteristics:
1. It should be easy to calculate.
2. It should be easy to understand.
3. It should be rigidly defined.
4. It should be based on all the observations.
5. It should be capable of further mathematical treatment.
6. It should not be unduly affected by extreme observations.
7. It should not be much affected by the fluctuations of sampling.
Measures of Dispersion
Various measures of dispersion can be classified into two broad categories:
1. The measures which express the spread of observations in terms of distance between the values of
selected observations. These are also termed as distance measures, e.g., range, interquartile range,
interpercentile range, etc.
2. The measures which express the spread of observations in terms of the average of deviations of
observations from some central value. These are also termed as the averages of second order, e.g., mean
deviation, standard deviation, etc.
The following are some important measures of dispersion
(a) Range
(b) Inter-Quartile Range
(c) Mean Deviation
(d) Standard Deviation
Range
The range of a distribution is the difference between its two extreme observations, i.e., the difference
between the largest and smallest observations. Symbolically, R = L – S where R denotes range, L and S
denote largest and smallest observations, respectively. R is the absolute measure of range. A relative
measure of range, also termed as the coefficient of range, is defined as:
𝐿−𝑆
Coefficient of Range = 𝐿+𝑆
Example: Find range and coefficient of range for each of the following data:
1. Weekly wages of 10 workers of a factory are:
310, 350, 420, 105, 115, 290, 245, 450, 300, 375.
37
Solution:
Range = 450 – 105 = 345
450 – 105
Coefficient of Range = 450+ 105 = 0.62
38
Quartile Deviation or Semi-Interquartile Range
Half of the interquartile range is called the quartile deviation or semi-interquartile range.
Symbolically,
𝑄3 − 𝑄1
Q.D. = 2
The value of Q.D. gives the average magnitude by which the two quartiles deviate from median. If the
distribution is approximately symmetrical, then Md ± Q.D. will include about 50% of the observations
and, thus, we can write Q1 = Md – Q.D. and Q3 = Md + Q.D. Further, a low value of Q.D. indicates a
high concentration of central 50% observations and viceversa.
Quartile deviation is an absolute measure of dispersion. The corresponding relative measure is known as
coefficient of quartile deviation defined as
𝑄 −𝑄
Coeff. of Q.D. = 𝑄3+ 𝑄1
3 1
Example: Find the quartile deviation, and its coefficient from the following data:
Age (in years) X no. of students f
15 4
16 6
17 10
18 15
19 12
20 9
21 4
Solution:
Table for the calculation of Q.D.
Age (in years) X no. of students f cf
15 4 4
16 6 10
17 10 20
18 15 35
19 12 47
20 9 56
21 4 60
39
Solution:
N/ 4 = 60/15
Q1 = 17 (by inspection)
Q3 = 3* N/4 = 3* 60/4 = 45
Q3 = 19
𝑄3 − 𝑄1 19−17
Q.D. = = = 1 𝑦𝑒𝑎𝑟
2 2
𝑄 −𝑄 19−17
Coeff. of Q.D. = 𝑄3+ 𝑄1 = = 0.056
3 1 19+17
40
In case of an ungrouped frequency distribution, the observations X1, X2, ..... Xn occur with respective
frequencies f1, f2, ..... fn such that ∑fi = N. The corresponding formulae for M.D. can be written as:.
∑ 𝑓|𝑋− 𝑋̅|
1. Mean Deviation from Mean MDm = 𝑁
∑ 𝑓(|𝑋− 𝑀|)
2. Mean Deviation from Median MDmd = 𝑁
∑ 𝑓(|𝑋− 𝑚𝑜𝑑𝑒|)
3. Mean Deviation from Mode MDmode = 𝑁
The above formulae are also applicable to a grouped frequency distribution where the symbols
X1, X2, ..... Xn will denote the mid-values of the first, second ..... nth classes respectively.
Note: Mean deviation is minimum when deviations are taken from median.
Coefficient of Mean Deviation
The above formulae for mean deviation give an absolute measure of dispersion. The formulae for relative
measure, termed as the coefficient of mean deviation, are given below:
𝑀𝐷 𝑓𝑟𝑜𝑚 𝑚𝑒𝑎𝑛
1. Coeff. of MDmean = 𝑋̅
𝑀𝐷 𝑓𝑟𝑜𝑚 𝑚𝑒𝑑𝑖𝑎𝑛
2. Coeff. of MDmedian = 𝑀𝑒𝑑𝑖𝑎𝑛
𝑀𝐷 𝑓𝑟𝑜𝑚 𝑚𝑜𝑑𝑒
3. Coeff. of MDmode = 𝑀𝑜𝑑𝑒
Example: Calculate mean deviation from mean and median for the following data of heights (in inches)
of 10 persons.
60, 62, 70, 69, 63, 65, 60, 68, 63, 64
Also calculate their respective coefficients.
Solution:
Calculation of M.D. from Mean
60+62+70+69+63+65+60+68+63+64
Mean = 𝑋̅ = = 64.4 𝑖𝑛𝑐ℎ𝑒𝑠
10
X 60 62 70 69 63 65 60 68 63 64 Total
|X- 4.4 2.4 5.6 4.6 1.4 0.6 4.4 3.6 1.4 0.4 28.8
Mean|
∑|𝑋− 𝑋̅ |
Mean Deviation from Mean MDm = = 28.8/ 10 = 2.88
𝑛
2.88
Coeff. of MDmean = 64.4= 0.045
41
Calculation of M.D. from Md
Arranging the observations in order of magnitude, we have
60, 60, 62, 63, 63, 64, 65, 68, 69, 70
The median of the above observations is = (63+ 64)/ 2
= 63.5 inches.
∑(|𝑋− 𝑀|)
Mean Deviation from Median MDmd = 𝑛
X 60 62 70 69 63 65 60 68 63 64 Total
|X- Med| 3.5 1.5 6.5 5.5 0.5 1.5 5.5 4.5 0.5 0.5 28.0
MD median = 28/10 = 2.80
Also, the coefficient of M.D. from Md = 2.80/ 63.5 = 0.044.
Example :
In a foreign language class, there are 4 languages and the frequencies of students learning the language
and the frequency of lectures per week is given as:
No. of
6 5 9 12
students(xi)
Frequency
of 5 7 4 9
lectures(fi)
Calculate the mean deviation about the mean for the given data.
Solution: The following table gives us a tabular representation of data and the calculations
42
MDmean = 70.64 / 25 =2.8256
Coeff of MDm = 2.8256/8.36 = 0.338
Example: Calculate Mean deviation about mean from the following grouped data
Class-X Oct-20 20 - 30 30 - 40 40 - 50 50 - 60 60 - 70 70 - 80
Frequency 15 25 20 12 8 5 3
Solution
f⋅|x- 35|
Class f Mid value (x) f⋅x |x-35|
10 - 20 15 15 225 20 300
20 - 30 25 25 625 10 250
30 - 40 20 35 700 0 0
40 - 50 12 45 540 10 120
50 - 60 8 55 440 20 160
60 - 70 5 65 325 30 150
70 - 80 3 75 225 40 120
Mean = 3080/ 88 = 35
43
Mean deviation from Mean = 1100/88 = 12.5
Coeff of MD mean = 12.5/ 35 = 0.3571
44
The standard deviation is denoted by Greek letter ‘σ ’ which is called ‘small sigma’ or simply sigma.
In terms of symbols
1
σ = √𝑛 ∑(𝑋 − 𝑋̅)2 ; where n is the number of observation; and
1
σ = √𝑁 ∑ 𝑓(𝑋 − 𝑋̅)2 for a grouped or ungrouped frequency distribution, where an observation Xi occurs
with frequency f.
It should be noted here that the units of σ are same as the units of X.
Calculation of Standard Deviation
There are two methods of calculating standard deviation: (i) Direct Method (ii) Short-cut Method
Direct Method
1. Individual Series: If there are n observations X1, X2, ...... Xn, various steps in the calculation of
standard deviation are:
Step 1: Find the mean.(𝑋̅)
Step 2: For each data point, find the square of its distance to the mean. (𝑋 − 𝑋̅)2
Step 3: Sum the values from Step 2. i.e, ∑ (𝑋 − 𝑋̅)2
1
Step 4: Divide by the number of data points.𝑛 ∑ (𝑋 − 𝑋̅)2
1
Step 5: Take the square root.√𝑛 ∑ (𝑋 − 𝑋̅)2
Variance
The Variance is defined as the average of the squared differences from the Mean. In other words, variance
is square of standard deviation i.e, σ2.
Example:
The heights of dogs are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.
Solution:
First step is to find the Mean:
600+470+170+430+300 1970
Mean = = = 394
5 5
45
X X- 394 (X-394)2
600 206 42436
470 76 5776
170 -224 50176
430 36 1296
300 -94 8836
Total 108520
1
σ = √𝑛 ∑ (𝑋 − 𝑋̅)2
1
= √5 ∗ 108520
= √21704 = 147.32 mm
Variance = σ 2 = (147.32)2 = 21704
2. Ungrouped or Grouped Frequency Distributions: Let the observations X1, X2 ...... Xn appear
with respective frequencies f1, f2 ...... fn, where ∑f= N. As before, if the distribution is grouped,
then X1, X2 ...... Xn will denote the mid-values of the first, second ..... nth class intervals
respectively. The formulae for the calculation of standard deviation and variance can be written
as
1
σ = √𝑁 ∑ f(𝑋 − 𝑋̅)2
1
σ2 = 𝑛 ∑ (𝑋 − 𝑋̅)2 ; respectively
∑ 𝑓𝑋 2 ∑ 𝑓𝑋 2
σ=√ − ( )
𝑁 𝑁
46
Solution.
Calculation of Standard Deviation
X F fX X- mean (X- mean)2 f(X-
mean)2
10 2 20 –4 16 32 200
11 7 77 –3 9 63 847
12 10 120 –2 4 40 1440
13 12 156 –1 1 12 2028
14 15 210 0 0 0 2940
15 11 165 1 1 11 2475
16 10 160 2 4 40 2560
17 6 102 3 9 54 1734
18 3 54 4 16 48 972
Total 76 1064 300 15196
Mean = 1064 / 76 = 14
σ2 = 300/76 = 3.95
σ = √3.95 = 1.99
Alternative Method
From the last column of the above table, we have
Sum of squares = 15196
Mean of squares = 15196/ 76
= 199.95
Thus, σ 2 = Mean of squares – Square of the mean = 199.96 – (14)2 = 3.96 and σ = √3.95 = 1.99
Example: Calculate standard deviation of the following series:
47
115-120 320 145-150 210
120 350 150 160
125 520 155 90
Solution
671750 2750 2
σ2 = − (3300) = 202.87
3300
σ = √202.87 = 14.24
Coefficient of Variation
The standard deviation is an absolute measure of dispersion and is expressed in the same units as the units
of variable X. A relative measure of dispersion, based on standard deviation is known as coefficient of
𝜎
standard deviation and is given by 𝑋̅ *100 .
This measure introduced by Karl Pearson, is used to compare the variability or homogeneity or stability
or uniformity or consistency of two or more sets of data. The data having a higher value of the coefficient
of variation is said to be more dispersed or less uniform, etc.
48
Example: Calculate standard deviation and its coefficient of variation from the following data:
Measurements 0-5 5-10 10-15 15-20 20-25
Frequency 4 1 10 3 2
Solution:
CI F Mid Value X- A F(X-A) F(X-A)2
0-5 4 2.5 -2 -8 16
5-10 1 7.5 -1 -1 1
10-15 10 12.5 0 0 0
15-20 3 17.5 1 3 3
20-25 2 22.5 2 4 8
Total 20 -2 28
𝑋−12.5
A= ; i=5
5
5∗2
Mean = 12.5 – = 12
20
∑ 𝑓𝑋 2 ∑ 𝑓𝑋 2
σ=√ − ( ) *i
𝑁 𝑁
28 28 2
= √20 − (20) ∗ 5 = 5.89
5.89
Thus, the coefficient of variation (CV) = ∗ 100
12
= 49%.
Merits, Demerits and Uses of Standard Deviation
Merits
1. It is a rigidly defined measure of dispersion.
2. It is based on all the observations.
3. It is capable of being treated mathematically. For example, if standard deviations of a number of groups
are known, their combined standard deviation can be computed.
4. It is not very much affected by the fluctuations of sampling and, therefore, is widely used in sampling
theory and test of significance.
Demerits
1. As compared to the quartile deviation and range, etc., it is difficult to understand and difficult to
calculate.
2. It gives more importance to extreme observations.
49
3. Since it depends upon the units of measurement of the observations, it cannot be used to compare the
dispersions of the distributions expressed in different units.
Uses of Standard Deviation
1. Standard deviation can be used to compare the dispersions of two or more distributions when their units
of measurements and arithmetic means are same.
2. It is used to test the reliability of mean. It may be pointed out here that the mean of a distribution with
lower standard deviation is said to be more reliable.
SKEWNESS
Skewness means lack of symmetry. In mathematics, a figure is called symmetric if there exists a point in
it through which if a perpendicular is drawn on the X-axis, it divides the figure into two congruent parts
i.e. identical in all respect or one part can be superimposed on the other i.e mirror images of each other.
In Statistics, a distribution is called symmetric if mean, median and mode coincide. Otherwise, the
distribution becomes asymmetric. If the right tail is longer, we get a positively skewed distribution for
which mean > median > mode while if the left tail is longer, we get a negatively skewed distribution for
which mean < median < mode.
The example of the Symmetrical curve, Positive skewed curve and Negative skewed curve are given as
follows:
Frequency
Symmetrical Curve
50
Negative Skewed Curve
2. In business and economic series, measures of variation have greater practical application than
measures of skewness. However, in medical and life science field measures of skewness have greater
practical applications than the variance.
51
Various measures of Skewness
Measures of skewness help us to know to what degree and in which direction (positive or negative) the
frequency distribution has a departure from symmetry. Although positive or negative skewness can be
detected graphically depending on whether the right tail or the left tail is longer but, we don’t get idea of
the magnitude. Besides, borderline cases between symmetry and asymmetry may be difficult to detect
graphically. Hence some statistical measures are required to find the magnitude of lack of symmetry. A
good measure of skewness should possess three criteria:
1. It should be a unit free number so that the shapes of different distributions, so far as symmetry is
concerned, can be compared even if the unit of the underlying variables are different;
2. If the distribution is symmetric, the value of the measure should be zero. Similarly, the measure
should give positive or negative values according as the distribution has positive or negative
skewness respectively; and
3. As we move from extreme negative skewness to extreme positive skewness, the value of the measure
should vary accordingly.
Measures of skewness can be both absolute as well as relative. Since in a symmetrical distribution mean,
median and mode are identical more the mean moves away from the mode, the larger the asymmetry or
skewness. An absolute measure of skewness can not be used for purposes of comparison because of the
same amount of skewness has different meanings in distribution with small variation and in distribution
with large variation.
Absolute Measures of Skewness
Following are the absolute measures of skewness:
1. Skewness (Sk) = Mean – Median
For comparing to series, absolute measures cannot be used; for that; we calculate the relative measures
which are called coefficient of skewness. Coefficient of skewness are pure numbers independent of units
of measurements.
In order to make valid comparison between the skewness of two or more distributions we have to eliminate
the distributing influence of variation. Such elimination can be done by dividing the absolute skewness by
standard deviation. The following are the important methods of measuring relative skewness:
β and γ Coefficient of Skewness
Karl Pearson defined the following β and ℽ coefficients of skewness, based upon the second and third
central moments:
52
𝜇2
β = 𝜇33
2
𝜇3
ℽ1 = ±√𝛽1 = 3/2
𝜇2
53
Then the sign of skewness would depend upon the value of µ3 whether it is positive or negative. It is
advisable to use ℽ1 as measure of skewness
Karl Pearson’s Coefficient of Skewness
This method is most frequently used for measuring skewness. The formula for measuring coefficient of
skewness is given by
mean − mode
Sk = 𝜎
The value of this coefficient would be zero in a symmetrical distribution. If mean is greater than mode,
coefficient of skewness would be positive otherwise negative. The value of the Karl Pearson’s coefficient
of skewness usually lies between ± 1 for moderately skewed distribution. If mode is not well defined, we
use the formula
3(mean − median)
Sk = 𝜎
The value of Sk would be zero if it is a symmetrical distribution. If the value is greater than zero, it is
positively skewed and if the value is less than zero it is negatively skewed distribution. It will take value
between +1 and -1.
Example1: For a distribution Karl Pearson’s coefficient of skewness is 0.64, standard deviation is 13 and
mean is 59.2 Find mode and median.
Solution: We have given
Sk = 0.64, σ = 13 and Mean = 59.2
Therefore by using formulae
Mean− Mode
Sk = 𝜎
59.2−𝑚𝑜𝑑𝑒
0.64 = 13
54
Mode = 3 Median – 2 Mean
50.88 = 3 Median - 2 (59.2)
50.88+11.8.4 169.28
Median = =
3 3
Median = 56.42
Remarks about Skewness
1. If the value of mean, median and mode are same in any distribution, then the skewness does not exist
in that distribution. Larger the difference in these values, larger the skewness;
2. If sum of the frequencies are equal on the both sides of mode then skewness does not exist;
3. If the distance of first quartile and third quartile are same from the median then a skewness does not
exist. Similarly if deciles (first and ninth) and percentiles (first and ninety nine) are at equal distance from
the median. Then there is no asymmetry;
4. If the sums of positive and negative deviations obtained from mean, median or mode are equal then
there is no asymmetry; and
5. If a graph of a data become a normal curve and when it is folded at middle and one part overlap fully
on the other one then there is no asymmetry.
KURTOSIS
If we have the knowledge of the measures of central tendency, dispersion and skewness, even then we
cannot get a complete idea of a distribution. In addition to these measures, we need to know another
measure to get the complete idea about the shape of the distribution which can be studied with the help of
Kurtosis. Prof. Karl Pearson has called it the “Convexity of a Curve”. Kurtosis gives a measure of flatness
of distribution.
The degree of kurtosis of a distribution is measured relative to that of a normal curve. The curves with
greater peakedness than the normal curve are called “Leptokurtic”. The curves which are more flat than
the normal curve are called “Platykurtic”. The normal curve is called “Mesokurtic.” The Fig. below
describes the three different curves mentioned above
55
Figure: Platykurtic Curve, Mesokurtic Curve and Leptokurtic Curve
Karl Pearson’s Measures of Kurtosis
For calculating the kurtosis, the second and fourth central moments of variable are used. For this,
following formula given by Karl Pearson is used:
𝜇
β2 = 𝜇42
2
or ℽ2 = β2 - 3
where, µ2 = Second order central moment of distribution
µ4 = Fourth order central moment of distribution
Description
1. If β2 = 3 or ℽ2 = 0, then curve is said to be mesokurtic;
2. If β2 < 3 or ℽ2 < 0, then curve is said to be platykurtic;
3. If β2 > 3 or ℽ2 > 0, then curve is said to be leptokurtic;
Example 2: First four moments about mean of a distribution are 0, 2.5, 0.7 and 18.75. Find coefficient of
skewness and kurtosis.
Solution: We have µ1 = 0, µ2 = 2.5, µ3 = 0.7 and µ4 = 18.75
𝜇2
Skewness β = 𝜇33
2
56
0.72
β = 2.53 = 0.031
𝜇
β2 = 𝜇42
2
18.75
Kurtosisβ2 = =3
2.52
57
Lesson 4
Introduction to Correlation and Regression
I. Introduction to Correlation
Correlation analysis is used to quantify the association between two variables (e.g., between an
independent and a dependent variable or between two independent variables). Various experts have
defined correlation in their own words and their definitions, broadly speaking, imply that correlation
is the degree of association between two or more variables.
A.M. Tuttle defined correlation as “Correlation is an analysis of covariation between two or more
variables.”
In other words, variables are correlated if corresponding to a change in one variable there is a change
in another variable. This change may be in either direction. If one variable increases (or decreases) the
other may also increases (or decreases).
Correlation Coefficient: It is a numerical measure of the degree of association between two or more
variables.
Correlation and Causation
Correlation analysis deals with the association or co-variation between two or more variables and helps
to determine the degree of relationship between two or more variables. But correlation does not
indicate a cause and effect relationship between two variables. It explains only co-variation. The high
degree of correlation between two variables may exist due to any one or a combination of the following
reasons.
1. Correlation may be due to pure chance: Especially in a small sample, the correlation is due to
pure chance. There may be a high degree of correlation between two variables in a sample, but in the
population there may not be any relationship between the variables. For example, the production of
corn, availability of dairy products, chlorophyll content and plant height. These variables have no
relationship. However, if a relationship is formed, it may be only a chance or coincidence. Such types
of correlation is known as spurious or nonsensical correlation.
2. Both variables are influenced by some other variables: A high degree of correlation between the
variables may be due to some cause or different causes affected each of these variables. For example,
a high degree of correlation may exist between the yield per acre of paddy or wheat due to the effect
of rainfall and other factors like fertilizers used, favourable weather conditions etc. But none of the
two variables is the cause of the other. It is difficult to explain which is the cause and which is the
effect, they may not have caused each other, but there is an outside influence.
3. Mutual dependence: In this case, the variables affect each other. The subjective and relative
variable are to be judged for the circumstances. For example, the production of jute and rainfall.
Rainfall is the subject and jute production is relative. The effect of rainfall is directly related to the jute
production.
Co-variance V/s Correlation
A covariance reveals the direction of the linear relationship between two variables wheras the
correlation coefficient shows the direction and the strength of the linear relationship between two
variables. The correlation coefficient is unit-free since the units in the numerator cancel with those in
the denominator. Hence, it can be used for comparison.
58
Properties of Coefficient of Correlation
1. The coefficient of correlation is independent of the change of origin and scale of
measurements.
2. Always ranges between -1 to + 1
TYPES OF CORRELATION
In a bivariate distribution, correlation is classified into many types, but the important are:
1. Positive and negative correlation
2. Simple and multiple correlation
3. Partial and total correlation
4. Linear and non-linear correlation.
Positive and Negative Correlation
Positive and negative correlation depend upon the direction of the change of the variables. If two
variables tend to move together in the same direction i.e., an increase or decrease in the value of one
variable is accompanied by an increase or decrease in the value of other variable, then the correlation
is called positive or direct correlation. E.g height and weight of a person.
If two variables tend to move together in opposite directions i.e., an increase or decrease in the value
of one variable is accompanied by a decrease or increase in the value of other variable, then the
correlation is called negative or inverse correlation. E.g, price and demand of a commodity.
Simple and Multiple Correlation
When there are only two variables under study then the relationship is described as simple correlation.
For examples, the yield of wheat and use of fertilizers. But in a multiple correlation, there are more
than two variables under study. Multiple correlation consists of the measurement of the relationship
between a dependable variable and two or more independent variables. For examples, the relationship
between plant yield with that of a number of pods and a number of clusters in pulses are studied
together.
Partial and Total Correlation
To study of two variables excluding some other variables is called partial correlation. For examples,
the correlation between yield of maize and fertilizers excluding, the effect of pesticides and manures
is called partial correlation, In total correlation, all the facts are taken into account.
Linear and Non-linear Correlation
If the ratio of change between two variables is uniform, then there will be linear correlation between
them. Whereas, in case of curvilinear or non-linear correlation, the amount of change in one variable
does not bear a constant ratio of the amount of change in the other variable.
METHODS OF STUDYING CORRELATION
The different methods of finding out the relationship between two variables are:
1. Scatter diagram or scattergram or dot diagram,
This is the simplest method of finding out whether there is any relationship present between two
variables by plotting the values on a chart, known as scatter diagram. By this method a rough idea
about the correlation of two variables can be judged. In this method, the given data are plotted on a
graph paper in the form of dots. X variables are plotted on the horizontal axis and Y variables on the
vertical axis. Thus, we have the dots and we can know the scatter or concentration of the various
points. This will show the type of correlation.
59
A scatter diagram of the data helps in having a visual idea about the nature of association between two
variables. If the plotted points form a straight line running from the lower left-hand corner to the upper
right-hand corner, then there is a perfect positive correlation (i.e., r = +1). On the other hand, if the
points are in a straight line, having a falling trend from the upper left-hand corner to the lower right-
hand corner, it reveals that there is a perfect negative or inverse correlation (i.e., r = – 1). If the plotted
points fall in a narrow band, and the points are rising from lower left-hand corner to the upper right-
hand corner, there will be a high degree of positive correlation between the two variables. If the plotted
points fall in a narrow band from the upper left-hand corner to the lower right-hand corner, there will
be a high degree of negative correlation. If the plotted points lie scatter all over the diagram, there is
no correlation between the two variables.
Karl Pearson’s Coefficient of Correlation
Karl Pearson, a great biometrician and statistician, suggested a mathematical method for measuring
the magnitude of linear relationship between two variables. Karl Pearson’s method is the most widely
used method in practice and is known as Pearson’s coefficient of correlation. It gives information about
the direction as well as the magnitude of the relationship between two variables. It is denoted by the
symbol ‘r’ ; the formula for calculating Pearson’s r is:
Covariance( x,y)
r= or
σx σy
NΣXY− ΣX.ΣY
𝑟 =
√NΣX2 −(ΣX)2 √NΣY2 −(ΣY)2
Σxy
In terms of deviations from mean it is defined as 𝑟 =
√x2 √y2
Where 𝑥=𝑋−𝑋̅ and 𝑦=𝑌−𝑌̅
Mathematical Properties of Karl Pearson’s Coefficient of Correlation
(i) Coefficient of correlation lies between + 1 and – 1. Symbolically
–1≤r≤+1
That is ‘r’ cannot be less than – 1 and cannot exceed + 1.
60
(ii) Coefficient of correlation is independent of change of origin and scale of the variables x and y. By
change of scale, we mean that all values of x and y series are multiplied or divided by some constant.
By change of origin, means a constant is subtracted from all values of x and y series.
Example: Calculate the Karl Pearson's coefficient of correlation from the following pairs of values:
Values 12 9 8 10 11 13 7
of X:
Values 14 8 6 9 11 12 3
of Y:
Solution:
The formula for Karl Pearson's coefficient of correlation is
NΣXY− ΣX.ΣY
𝑟 =
√NΣX2 −(ΣX)2 √NΣY2 −(ΣY)2
The values of different terms, given in the formula, are calculated from the following table:
Xi Yi XiYi Xi2 Yi2
12 14 168 144 196
9 8 72 81 64
8 6 48 64 36
10 9 90 100 81
11 11 121 121 121
13 12 156 169 144
7 3 21 49 9
Total 70 63 676 728 651
61
Degree of Correlation
The degree of correlation between two variables can be ascertained by the quantitative value of
coefficient of correlation which can be found out by calculation, Karl Pearson has given a formula for
measuring correlation coefficient (r). However, the results of this formula varies between + 1 and – 1.
In case of perfect positive correlation, the result will be r = + 1 and in case of perfect negative
correlation, the result will be r = – 1. However, in the absence of correlation, the result will be r = 0. It
indicates that the degree of scattering is very large. In experimental research, it is very difficult to find
such values of r as + 1, – 1 and 0.
The following
table will show the approximate degree of correlation according to Karl Pearson’s formula:
Degree of correlation Positive Negative
Perfect correlation +1 –1
Very high degree of + 0.9 or more – 0.9 or more
correlation
Sufficiently high degree of from + 0.75 to + 0.9 from – 0.75 to – 0.9
correlation
Moderate degree of from + 0.6 to + 0.75 from – 0.6 to – 0.75
correlation
Only the possibility of from + 0.3 to + 0.6 from – 0.3 to – 0.6
correlation
Possibly no correlation less than + 0.3 less than – 0.3
Absence of correlation 0 0
Assumptions
There are some assumptions of Karl Pearson’s coefficient of correlation. They are as follows:
(i) Linear Relationship: If the two variables are plotted on a scatter diagram, it is assumed that the
plotted points will form a straight line. So, there is a linear relationship between the variables.
(ii) Normality: The correlated variables are affected by a large number of independent causes, which
form a normal distribution. Variables like quantity of money, age, weight, height, price, demand, etc.
are affected by such forces, that normal distribution is formed.
(iii) Casual Relationship: Correlation is only meaningful, if there is a cause and effect relationship
between the force affecting the distribution of items in the two series. It is meaningless, if there is no
such relationship. There is no relationship between rice and weight, because the factors that affect
these variables are not common.
(iv) Proper Grouping: It will be a better correlation analysis if there is an equal number of pairs.
(v) Error of Measurement: If the error of measurement is reduced to the minimum, the coefficient
of correlation is more reliable.
Merits and Limitations of Coefficient of Correlation
The only merit of Karl Pearson’s coefficient of correlation is that it is the most popular method for
expressing the degree and direction of linear association between the two variables in terms of a pure
62
number, independent of units of the variables. This measure, however, suffers from certain limitations,
given below:
1. Coefficient of correlation r does not give any idea about the existence of cause and effect relationship
between the variables. It is possible that a high value of r is obtained although none of them seem to
be directly affecting the other. Hence, any interpretation of r should be done very carefully.
2. It is only a measure of the degree of linear relationship between two variables. If the relationship is
not linear, the calculation of r does not have any meaning.
3. Its value is unduly affected by extreme items.
4. As compared with other methods, the computations of r are cumbersome and time consuming.
Spearman’s Rank Correlation
This is a crude method of computing correlation between two characteristics. In this method, various
items are assigned ranks according to the two characteristics and a correlation is computed between
these ranks. This method is often used in the following circumstances:
1. When the quantitative measurements of the characteristics are not possible, e.g., the results of a
beauty contest where various individuals can only be ranked.
2. Even when the characteristics is measurable, it is desirable to avoid such measurements due to
shortage of time, money, complexities of calculations due to large data, etc.
3. When the given data consist of some extreme observations, the value of Karl Pearson’s coefficient
is likely to be unduly affected. In such a situation the computation of the rank correlation is preferred
because it will give less importance to the extreme observations.
4. It is used as a measure of the degree of association in situations where the nature of population, from
which data are collected, is not known.
Spearman’s Rank correlation coefficient is a technique which can be used to summarise the strength
and direction (negative or positive) of a relationship between two variables.
The result will always be between 1 and minus 1.
Steps - calculating the coefficient
• Create a table from your data.
• Rank the two data sets. Ranking is achieved by giving the ranking '1' to the biggest number in a
column, '2' to the second biggest value and so on. The smallest value in the column will get the
lowest ranking. This should be done for both sets of measurements.
• Find the difference in the ranks (d): This is the difference between the ranks of the two values on
each row of the table. The rank of the second value (price) is subtracted from the rank of the first
(distance from the museum).
• Square the differences (d²) To remove negative values and then sum them (∑d²).
• Calculate the coefficient (Rsp) using the formula below. The answer will always be between 1.0 (a
perfect positive correlation) and -1.0 (a perfect negative correlation).
6∑𝑑2
Rsp = 1 - 𝑛3 −𝑛
63
Example: The following table displays the association between the IQ of each adolescent in a
sample with the number of hours they listen to rock music per month. Determine the strength of the
correlation between IQ and rock music using Spearman’s rank correlation.
Solution:
Step1: Give ranks to the data
Calculation Table
IQ Rock Rank of IQ Rank of Di = Rank Di2
(X) Rock(Y) of X- Rank
of Y
99 3 4 2 2 4
120 0 9 1 8 64
98 30 3 9 -6 36
102 45 5 10 -5 25
123 16 10 4 6 36
105 25 6 7 -1 1
85 17 1 5 -4 16
110 24 7 6 1 1
117 26 8 8 0 0
90 5 2 3 -1 1
Total 184
6∑𝑑2
rsp = 1 - 𝑛3 −𝑛
6∗184
= 1 - 103−10 = - 0.115
Since, the value of correlation is near zero, hence, there is hardly any association between
variables.
Quotations of index numbers of equity share prices of a certain joint stock company and the
prices of preference shares are given below.
Using the method of rank correlation determine the relationship between equity shares and
reference shares prices.
64
Solution:
∑ di2 = 90; n= 7
Rank correlation is given by
6∗90
rsp = 1- 7(49−1) 1 − 1.6071 = −0.6071
Interpretation: There is a negative correlation between equity shares and preference share prices.
There is a strong disagreement between equity shares and preference share prices.
Repeated Ranks
In case of a tie, i.e., when two or more individuals have the same rank, each individual is assigned
a rank equal to the mean of the ranks that would have been assigned to them in the event of there
being slight differences in their values. To understand this, let us consider the series 20, 21, 21,
24, 25, 25, 25, 26, 27, 28. Here the value 21 is repeated two times and the value 25 is repeated
three times. When we rank these values, rank 1 is given to 20. The values 21 and 21 could have
been assigned ranks 2 and 3 if these were slightly different from each other. Thus, each value will
be assigned a rank equal to mean of 2 and 3, i.e., 2.5. Further, the value 24 will be assigned a rank
equal to 4 and each of the values 25 will be assigned a rank equal to 6, the mean of 5, 6 and 7 and
so on.
Since the Spearman's formula is based upon the assumption of different ranks to different
individuals, therefore, its correction becomes necessary in case of tied ranks. It should be noted
that the means of the ranks will remain unaffected. Further, the changes in the variances are
usually small and are neglected. However, it is necessary to correct the term di2 and accordingly
𝑚(𝑚2 −1)
the correction factor , where m denotes the number of observations tied to a particular
12
2(22 −1)
rank, is added to it for every tie. We note that there will be two correction factors, i.e., and
12
3(32 −1)
in the above example
12
Hence, rsp is given by:
65
Merits and Demerits of Rank Correlation Coefficient
Merits of Rank Correlation Coefficient
1. Spearman’s rank correlation coefficient can be interpreted in the same way as the Karl
Pearson’s correlation coefficient;
2. It is easy to understand and easy to calculate;
3. If we want to see the association between qualitative characteristics, rank correlation
coefficient is the only formula;
4. Rank correlation coefficient is the non-parametric version of the Karl Pearson’s product
moment correlation coefficient; and
5. It does not require the assumption of the normality of the population from which the sample
observations are taken.
Demerits of Rank Correlation Coefficient
1. Product moment correlation coefficient can be calculated for bivariate frequency distribution
but rank correlation coefficient cannot be calculated; and
2. If n >30, this formula is time consuming
Example 4: Calculate rank correlation coefficient from the following data:
Expenditure 10 15 14 25 14 14 20 22
on
advertisement
Profit 6 25 12 18 25 40 10 7
Solution: Let us denote the expenditure on advertisement by x and profit by y
66
Rsp is given by
Here rank 6 is repeated three times in rank of x and rank 2.5 is repeated twice in rank of y, so the
correction factor is
3(32 −1) 2(22 −1)
+ = 2.50
12 12
6( 83.5+2.50)
Rsp = 1- = -0.024
504
67
II. Regression Analysis
Introduction
Correlation gives the degree and direction relationship between two variables, it does not give the
nature of relationship between two variables. whereas regression describe the dependence of a variable
on an independent variable. It is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. It can be utilized to assess the
strength of the relationship between variables and for modelling the future relationship between them.
Correlation Regression
‘Correlation’ as the name says it determines the ‘Regression’ explains how an independent variable
interconnection or a co-relationship between the is numerically associated with the dependent
variables. variable.
In Correlation, both the independent and However, in Regression, both the dependent and
dependent values have no difference. independent variable are different.
The primary objective of Correlation is, to find When it comes to regression, its primary intent is,
out a quantitative/numerical value expressing the to reckon the values of a haphazard variable based
association between the values. on the values of the fixed variable.
Correlation stipulates the degree to which both of However, regression specifies the effect of the
the variables can move together. change in the unit, in the known variable(p) on the
evaluated variable (q).
Correlation helps to constitute the connection Regression helps in estimating a variable’s value
between the two variables. based on another given value.
68
Linear model assumptions
Assumption
Linear regression analysis is based on following fundamental assumptions:
i) The model is linear in parameters.
ii) The error term has zero mean and follows a normal distribution E(μi)=0
iii) The error terms are independent that is covariance between them is zero or there is no
Autocorrelation. Cov (μi, μj) = 0 for all i≠j
iv) The explanatory variables are fixed in repeated sampling.
v) All error terms have equal variance (i.e.) they are homoscedastic. Var(μi) = σ2
vi) The number of observations is greater than the number of regression coefficients to be estimated.
n > k + 1 where k is the number of explanatory variables.
vii) There is no multicollinearity between the explanatory variables that is the explanatory variables
do not share a perfect linear relationship.
69
Simple linear regression is a model that assesses the relationship between a dependent variable and an
independent variable. In bivariate data there are two variables and therefore, there are two regression
lines.
The simple linear model Y on X is expressed using the following equation:
Y = β 0 + β1 X
Where:
Y – Dependent variable
X – Independent (explanatory) variable
β0 – Intercept
β1– Slope
The expression β0 + β1X is the deterministic component of the simple linear regression model, which
can be thought of as the expected value of Y for a given value of X. In other words, conditional on X,
E(Y) = β0 + β1X. The slope parameter β1 determines whether the linear relationship between X and
E(Y) is positive (β1 > 0) or negative (β1 < 0); β1 = 0 indicates that there is no linear relationship between
x and E(y). Figure below shows the deterministic portion of the simple linear regression model for
various values of the intercept β0 and the slope β1 parameters.
FIGURE: Various examples of a simple linear regression model
70
β0 = y̅ − β1 x̅
∑(xi − x̅)(yi − y
̅)
β1 = ∑(xi − x̅)2
and for x on y is
β0 = x̅ − β1 y̅
̅)(xi − x̅)
∑(yi − y
β1 = ̅)2
∑(yi − y
113
x̅ = = 16.14
7
182
y̅ = = 26
7
∑(xi − x̅)(yi − y
̅) 248
β1 = ∑(xi − x̅)2
=158.857 = 1.56
β0 = y̅ − β1 x̅
= 26 – 16.14 * 26 = 0.82
Therefore, regression line Y on X is
Y = 0.82 + 1.56 X
The estimated slope coefficient of 1.56 suggests a positive relationship between Y and X. The
estimated intercept coefficient of 0.82 suggests that if X equals zero, then predicted Y = 0.82.
Deriving Regression Line of X on Y:
̅)(xi − x̅)
∑(yi − y
β1 = ̅)2
∑(yi − y
= 248/456 = 0.544
71
β0 = x̅ − β1 y̅ = 16.14 – 26*0.544 = 1.996
hence, equation is
X = 1.996 + 0.544 Y
The estimated slope coefficient of 0.544 suggests a positive relationship between Y and X. The
estimated intercept coefficient of 1.996 suggests that if Y equals zero, then predicted X = 0.544.
PROPERTIES OF REGRESSION COEFFICIENTS
1. Correlation coefficient is the geometric mean between the regression coefficients.
2. It is clear from the property 1, both regression coefficients must have the same sign. i.e., either they
will positive or negative.
3. If one of the regression coefficients is greater than unity, the other must be less than unity.
4. The correlation coefficient will have the same sign as that of the regression coefficients.
5. Arithmetic mean of the regression coefficients is greater than the correlation coefficient.
6. Regression coefficients are independent of the change of origin but not of scale.
72
y = β 0 + β 1 x 1 + β 2 x 2
̅̅̅ − β2 ̅̅̅
β0 = y̅ − β0x1 x2
Summary
In regression analysis, we can predict or estimate the value of one variable from the given value of
the other variable. Regression explains the functional form of two variables one as dependent variable
and other as independent variable.
The regression analysis confined to the study of only two variables at a time is termed as simple
regression. The regression analysis for studying more than two variables at a time is known as multiple
regression.
If the bivariate data are plotted on a graph paper, a scatter diagram is obtained which indicates some
relationship between two variables. The dots of scatter diagram tend to concentrate around a curve.
This curve is known as regression curve and its functional form is called regression equation.
The geometric mean of the regression coefficients is the coefficient of correlation.
Arithmetic mean of the regression coefficients is greater than or equal to coefficient of correlation.
In multiple regression model, we assume that a linear relationship exists between some variable y,
which we call the dependent variable, and k independent variables x1, x2, ..., xk.
73
Fill in the blanks:
1. Regression analysis confined to the study of only two variables at a time is termed as
...........................
2. Regression line of x on y is denoted by ...........................
3. Functional form of regression curve is called ...........................
4. If both the lines of regression coincide, there is a ........................... between the variables.
5. Regression analysis is an ........................... measure.
State whether the following statements are True or False:
6. Arithmetic mean of the regression coefficients is less than coefficient of correlation.
7. If coefficient of regression r = 0, then there is no relationship between the two variables.
8. Regression coefficients are dependent upon change of origin.
9. Regression is a mathematical measure showing the average relationship between two variables.
10. Spearman’s Rank correlation method can be used when data are irregular
1. simple regression
2. x = a + by
3. regression equation
4. perfect correlation
5. absolute
6. False
7. True
8. False
9. True
10. True
******
74
Lesson 5
INTRODUCTION TO PROBABILITY
Managers need to cope with uncertainty in many decision making situations. For example, a
manager may assume that the volume of sales in the successive year is known exactly to him. This
is not true because he know roughly what the next year sales will be. But manager cannot give the
exact number. There is some uncertainty. Concepts of probability will help manager to measure
uncertainty and perform associated analyses. This chapter provides the conceptual framework of
probability and the various probability rules that are essential in business decisions.
Probability theory was originated from gambling theory. A large number of problems exist even
today which are based on the game of chance, such as coin tossing, dice throwing and playing cards.
A probability is a numerical value that measures the likelihood that an event occurs. This value is
between zero and one, where a value of zero indicates an impossible event and a value of one
indicates a definite event.
Consider, for example, the toss of a coin. The result of a toss can be a head or a tail, therefore, it is
a random experiment. Here we know that either a head or a tail would occur as a result of the toss,
however, it is not possible to predetermine the outcome. With the use of probability theory, it is
possible to assign a quantitative measure, to express the extent of uncertainty, associated with the
occurrence of each possible outcome of a random experiment.
In order to define an event and assign the appropriate probability to it, it is useful to first establish
some terminology and impose some structure on the situation.
Some important terms &concepts:
1. Event:
The occurrence or non-occurrence of a phenomenon is called an event. For example, in the toss of
two coins, there are four exhaustive outcomes, i.e. (H, H), (H, T), (T, H), (T, T). The events
associated with this experiment can be defined in a number of ways. For example, (i) the event of
occurrence of head on both the coins, (ii) the event of occurrence of head on at least one of the two
coins, (iii) the event of non-occurrence of head on the two coins, etc
2. Random Experiments:
Experiments of any type where the outcome cannot be predicted are called random experiments.
3. Sample Space:
A set of all possible outcomes from an experiment is called a sample space.
Eg: Consider a random experiment E of throwing 2 coins at a time. The possible outcomes are HH,
TT, HT, TH.
These 4 outcomes constitute a sample space denoted by, S ={ HH, TT, HT, TH}.
75
4. Trail & Event:
Consider an experiment of throwing a coin. When tossing a coin, we may get a head(H) or tail(T).
Here tossing of a coin is a trail and getting a hand or tail is an event.
In other words, “Every non-empty subset of A of the sample space S is called an event”.
5. Null Event:
Two events are said to be independent when the actual happening of one does not influence in any
way the happening of the other. Events which are not independent are called dependent events.
Eg: If we draw a card in a pack of well shuffled cards and again draw a card from the rest of pack
of cards (containing 51 cards), then the second draw is dependent on the first. But if on the other
hand, we draw a second card from the pack by replacing the first card drawn, the second draw is
known as independent of the first.
Example: A snowboarder competing in the Winter Olympic Games is trying to assess her
probability of earning a medal in her event, the ladies’ halfpipe. Construct the appropriate sample
space.
SOLUTION: The athlete’s attempt to predict her chances of earning a medal is an experiment
because, until the Winter Games occur, the outcome is unknown. We formalize an experiment by
constructing its sample space. The athlete’s competition has four possible outcomes: gold medal,
76
silver medal, bronze medal, and no medal. We formally write the sample space as S = {gold, silver,
bronze, no medal}.
Definitions
Classical Definition:
This definition, also known as the mathematical definition of probability or classical or a priori
definition of probability, was given by J. Bernoulli. With the use of this definition, the probabilities
associated with the occurrence of various events are determined by specifying the conditions of a
random experiment.
If n is the number of equally likely, mutually exclusive and exhaustive outcomes of a random
experiment out of which m outcomes are favourable to the occurrence of an event A, then the
probability that A occurs, denoted by P(A), is given by:
Probability(of happening an event E) = the number of ways of achieving success
the total number of possible outcomes
= m/n
Where m = Number of favourable cases
n = Total number of exhaustive cases.
Example:
1. In tossing a coin, what is the probability of getting a head.
Solution: Total no. of events = {H, T}= 2
Favourable event = {H}= 1
Probability = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥ℎ𝑎𝑢𝑠𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
= 12
2. In throwing a die, the probability of getting 2.
Sol: Total no. of events = {1,2,3,4,5,6}= 6
Favourable event = {2}= 1
Probability = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠/𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥ℎ𝑎𝑢𝑠𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
= 1/6
77
Number of ways of getting 7 is, (1,6), (2,5), (3,4), (4,3), (5,2), (6,1) = 6
Probability = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠/𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥ℎ𝑎𝑢𝑠𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
= 6/36
= 1/6
4. A bag contains 6 red & 7 black balls. Find the Probability of drawing a red ball.
5. Find the Probability of a card drawn at random from an ordinary pack, is a diamond.
6. From a pack of 52 cards, 1 card is drawn at random. Find the Probability of getting a queen.
Here, ‘a’ is the actual number of expected outcomes and ‘n’ is the total number of observations or
trials.
For instance, during the last calendar year there were 50 births at a local hospital. If 32 of the new
arrivals were baby girls, the relative frequency approach reveals that the probability of a baby girl
being born in that locality is
No.of girls born last year
P (girl) = a/n = = 32/50= 0.64
Total no of births
78
The Subjective Approach:
The subjective approach requires the assignment of the probability of some event on the basis of the
best available evidence. Suppose we consider the next cricket match that India & Australia will play.
What is the probability that India will win. With the subjective method to assign probabilities to the
experimental outcomes, we may use any data available, our experience, intuition, personal judgment
etc. Subjective probabilities are assignments of numerical values measuring an individual’s degree of
confidence in the occurrence of particular events, or in the truth of particular propositions, after just
account had been taken of all the relevant information.
Illustrations of subjective probability are:
1. Estimating the likelihood the New England Patriots will play in the Super Bowl next year.
2. Estimating the probability General Motors Corp. will lose its number 1 ranking in total units sold to
Ford Motor Co. or DaimlerChrysler within 2 years.
3. Estimating the likelihood you will earn an A in this course.
The types of probability are summarized in Chart 1 below. A probability statement always assigns a
likelihood of an event that has not yet occurred. There is, of course, a considerable latitude in the
degree of uncertainty that surrounds this probability, based primarily on the knowledge possessed by
the individual concerning the underlying process. The individual possesses a great deal of knowledge
about the toss of a die and can state that the probability that a one-spot will appear face up on the toss
of a true die is one-sixth. But we know very little concerning the acceptance in the marketplace of a
new and untested product.
79
Definitions and Notation
• The complement of an event is the event not occurring. The probability that Event A will not
occur is denoted by P(A').
• The probability that Event A occurs, given that Event B has occurred, is called a conditional
probability. The conditional probability of Event A, given Event B, is denoted by the symbol
P(A|B).
• The probability that Events A and B both occur is the probability of the intersection of A and B.
The probability of the intersection of Events A and B is denoted by P(A ∩ B). If Events A and B
are mutually exclusive, P(A ∩ B) = 0. The intersection of two events, denoted A ∩ B, is the event
consisting of all outcomes in A and B. Figure depicts the intersection of two events A and B. A
useful way to illustrate these concepts is through the use of a Venn diagram, named after the British
mathematician John Venn (1834–1923).The intersection A ∩ B is the portion in the Venn diagram
that is included in both A and B.
Union Probability: Union probability is denoted P (E1 U E2) or P (A or B), where E1 and E2 are two
events. P (E1 U E2) is the probability that E1 will occur or that E2 will occur or both E1 and E2 will
occur. In a company, the probability that a person is male or a clerical worker is a union probability.
A person qualifies for the union by being male or by being clerical worker or by being both (a male
clerical worker).
80
Complement Of Event
The complement of event A, denoted Ac, is the event consisting of all outcomes in the sample space
S that are not in A. In Figure given below, Ac is everything in S that is not included in A.
A snowboarder competing in the Winter Olympic Games is trying to assess her probability of
earning a medal in her event, the ladies’ halfpipe. Construct the appropriate sample space.
Now suppose the snowboarder defines the following three events:
A = {gold, silver, bronze}; that is, event A denotes earning a medal;
B = {silver, bronze, no medal}; that is, event B denotes earning at most a silver medal; and
C = {no medal}; that is, event C denotes failing to earn a medal.
a. Find A ∪ B and B ∪ C.
b. Find A ∩ B and A ∩ C.
c. Find Bc.
SOLUTION: The athlete’s attempt to predict her chances of earning a medal is an experiment because,
until the Winter Games occur, the outcome is unknown. We formalize an experiment by constructing
its sample space. The athlete’s competition has four possible outcomes: gold medal, silver medal,
bronze medal, and no medal. We formally write the sample space as
S = {gold, silver, bronze, no medal}.
a. The union of A and B denotes all outcomes common to A or B; here, the event A ∪ B = {gold,
silver, bronze, no medal}. Note that there is no double counting of the outcomes “silver” or “bronze”
in A ∪ B. Similarly, we have the event B ∪ C = {silver, bronze, no medal}.
b. The intersection of A and B denotes all outcomes common to A and B; here, the event A ∩ B =
{silver, bronze}. The event A ∩ C = Ø, where Ø denotes the null (empty) set; no common outcomes
appear in both A and C.
c. The complement of B denotes all outcomes in S that are not in B; here, the event Bc = {gold}.
The two properties of probability
1. The probability of any event A is a value between 0 and 1; that is,
0 ≤ P(A) ≤ 1.
2. The sum of the probabilities of any list of mutually exclusive and exhaustive events equals 1.
THE ADDITION RULE
The addition rule states that the probability that A or B occurs, or that at least one
of these events occurs, is equal to the probability that A occurs, plus the probability
that B occurs, minus the probability that both A and B occur. Equivalently,
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
81
EXAMPLE
Anthony feels that he has a 75% chance of getting an A in Statistics and a 55% chance of getting an A
in Managerial Economics. He also believes he has a 40% chance of getting an A in both classes.
a. What is the probability that he gets an A in at least one of these courses?
b. What is the probability that he does not get an A in either of these courses?
SOLUTION:
a. Let P(AS) correspond to the probability of getting an A in Statistics and P(AM) correspond to the
probability of getting an A in Managerial Economics. Thus, P(AS) = 0.75 and P(AM) = 0.55. In
addition, there is a 40% chance that Anthony gets an A in both classes; that is, P(AS ∩ AM) = 0.40. In
order to find the probability that he receives an A in at least one of these courses, we calculate
P( A S ∪ A M ) = P( A S ) + P( A M ) − P( A S ∩ A M ) = 0.75 + 0.55 − 0.40 = 0.90.
b. The probability that he does not receive an A in either of these two courses is actually the
complement of the union of the two events; that is, P((AS ∪ AM)c).
We calculated the union in part a, so using the complement rule we have
P( ( A S ∪ A M ) c ) = 1 − P( A S ∪ A M ) = 1 − 0.90 = 0.10.
An alternative expression that correctly captures the required probability is P( A S c ∩ A M c ), which
is the probability that he does not get an A in Statistics and he does not get an A in Managerial
Economics. A common mistake is to calculate the probability as 1 − P(AS ∩ AM) = 1 − 0.40 = 0.60,
which simply indicates that there is a 60% chance that Anthony will not get an A in both courses. This
is clearly not the required probability that Anthony does not get an A in either course.
THE ADDITION RULE FOR MUTUALLY EXCLUSIVE EVENTS
If A and B are mutually exclusive events, then P(A ∩ B) = 0 and, therefore, the addition rule simplifies
to P(A ∪ B) = P(A) + P(B).
EXAMPLE
Samantha Greene, a college senior, contemplates her future immediately after graduation. She thinks
there is a 25% chance that she will join the Peace Corps and teach English in Madagascar for the next
few years. Alternatively, she believes there is a 35% chance that she will enroll in a full-time law
school program in the United States.
a. What is the probability that she joins the Peace Corps or enrolls in law school?
b. What is the probability that she does not choose either of these options?
SOLUTION:
a. We can write the probability that Samantha joins the Peace Corps as P(A) = 0.25 and the probability
that she enrolls in law school as P(B) = 0.35.
Immediately after college, Samantha cannot choose both of these options. This implies that these
events are mutually exclusive, so P(A ∩ B) = 0. Thus, when solving for the probability that Samantha
joins the Peace Corps or enrolls in law school, P(A ∪ B), we can simply sum P(A) and P(B):
P(A ∪ B) = P(A) + P(B) = 0.25 + 0.35 = 0.60.
b. In order to find the probability that she does not choose either of these options, we need to recognize
that this probability is the complement of the union of the two events; that is, P((A ∪ B)c). Therefore,
using the complement rule, we have
P( (A ∪ B) c ) = 1 − P(A ∪ B) = 1 − 0.60 = 0.40.
82
THE ADDITION RULE
The addition rule states that the probability that A or B occurs, or that at least one of these events
occurs, is equal to the probability that A occurs, plus the probability that B occurs, minus the probability
that both A and B occur. Equivalently,
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
EXAMPLE
Anthony feels that he has a 75% chance of getting an A in Statistics and a 55% chance of getting an A
in Managerial Economics. He also believes he has a 40% chance of getting an A in both classes.
a. What is the probability that he gets an A in at least one of these courses?
b. What is the probability that he does not get an A in either of these courses?
SOLUTION:
a. Let P(AS) correspond to the probability of getting an A in Statistics and P(AM) correspond to the
probability of getting an A in Managerial Economics. Thus, P(AS) = 0.75 and P(AM) = 0.55. In
addition, there is a 40% chance that Anthony gets an A in both classes; that is, P(AS ∩ AM) = 0.40. In
order to find the probability that he receives an A in at least one of these courses, we calculate
P( A S ∪ A M ) = P( A S ) + P( A M ) − P( A S ∩ A M ) = 0.75 + 0.55 − 0.40 = 0.90.
b. The probability that he does not receive an A in either of these two courses is actually the
complement of the union of the two events; that is, P((AS ∪ AM)c).
We calculated the union in part a, so using the complement rule we have
P( ( A S ∪ A M ) c ) = 1 − P( A S ∪ A M ) = 1 − 0.90 = 0.10.
An alternative expression that correctly captures the required probability is P( A S c ∩ A M c ), which
is the probability that he does not get an A in Statistics and he does not get an A in Managerial
Economics. A common mistake is to calculate the probability as 1 − P(AS ∩ AM) = 1 − 0.40 = 0.60,
which simply indicates that there is a 60% chance that Anthony will not get an A in both courses. This
is clearly not the required probability that Anthony does not get an A in either course.
THE ADDITION RULE FOR MUTUALLY EXCLUSIVE EVENTS
If A and B are mutually exclusive events, then P(A ∩ B) = 0 and, therefore, the addition rule simplifies
to P(A ∪ B) = P(A) + P(B).
83
EXAMPLE
Samantha Greene, a college senior, contemplates her future immediately after graduation. She thinks
there is a 25% chance that she will join the Peace Corps and teach English in Madagascar for the next
few years. Alternatively, she believes there is a 35% chance that she will enroll in a full-time law
school program in the United States.
a. What is the probability that she joins the Peace Corps or enrolls in law school?
b. What is the probability that she does not choose either of these options?
SOLUTION:
a. We can write the probability that Samantha joins the Peace Corps as P(A) = 0.25 and the probability
that she enrolls in law school as P(B) = 0.35.
Immediately after college, Samantha cannot choose both of these options. This implies that these
events are mutually exclusive, so P(A ∩ B) = 0. Thus, when solving for the probability that Samantha
joins the Peace Corps or enrolls in law school, P(A ∪ B), we can simply sum P(A) and P(B):
P(A ∪ B) = P(A) + P(B) = 0.25 + 0.35 = 0.60.
b. In order to find the probability that she does not choose either of these options, we need to recognize
that this probability is the complement of the union of the two events; that is, P((A ∪ B)c). Therefore,
using the complement rule, we have
P( (A ∪ B) c ) = 1 − P(A ∪ B) = 1 − 0.60 = 0.40.
Conditional Probability
Given two events A and B, each with a positive probability of occurring, the probability that A occurs
given that B has occurred (A conditioned on B) is equal to P(A | B) = P(A ∩B) P(B) . Similarly, the
probability that B occurs given that A has occurred (B conditioned on A) is equal to P(B | A) = P(A ∩
B)/P(A) .
If A represents “finding a job” and B represents “prior work experience,” then P(A) = 0.80 and the
conditional probability is denoted as P(A | B) = 0.90. The vertical mark | means “given that,” and the
conditional probability is typically read as “the probability of A given B.” In this example, the
probability of finding a suitable job increases from 0.80 to 0.90 when conditioned on prior work
experience. In general, the conditional probability, P(A | B), is greater than the unconditional
probability, P(A), if B exerts a positive influence on A.
Similarly, P(A | B) is less than P(A) when B exerts a negative influence on A. Finally, if
84
B exerts no influence on A, then P(A | B) equals P(A). It is common to refer to “unconditional
probability” simply as “probability.”
Example:
Of the cars on a used car lot, 70% have air conditioning (AC) and 40% have a CD player (CD). 20%
of the cars have both. What is the probability that a car has a CD player, given that it has AC ?
Solution:
𝑃(𝐶𝐷∩𝐴𝐶) 0.2
P(CD/AC) = = = 0.2857
𝑃(𝐴𝐶) 0.7
Note: Given AC, we only consider the top row (70% of the cars). Of these, 20% have a CD player.
20% of 70% is about 28.57%.
85
P(A and B) = P(A)* P(B)
Example: Suppose that for a given year there is a 2% chance that your desktop computer will crash
and a 6% chance that your laptop computer will crash. Moreover, there is a 0.12% chance that both
computers will crash. Is the reliability of the two computers independent of each other?
SOLUTION: Let event D represent the outcome that your desktop crashes and event L represent the
outcome that your laptop crashes. Therefore, P(D) = 0.02, P(L) = 0.06, and P(D ∩ L) = 0.0012. The
reliability of the two computers is independent because
𝑃(𝐷∩𝐿)
P(D | L) = 𝑃(𝐿)
= 0.0012/0.06
= 0.02 = P(D).
In other words, if your laptop crashes, it does not alter the probability that your desktop also crashes.
Equivalently,
𝑃(𝐷∩𝐿)
P(D | L) = = = 0.0012/0.06
𝑃(𝐿)
= 0.06 = P(L)
Multiplication rule for two events A and B:
P(A and B) = P(B)|P(A/B)
If A and B are independent events, then the probability that A and B both occur equals the product of
the probability of A and the probability of B; that is,
P(A ∩ B) = P(A)P(B).
Example: The probability of passing the MBA exam is
0.50 for person A and 0.80 for person B. The prospect of person A passing the exam is completely
unrelated to person B success on the exam.
a. What is the probability that both person A and B pass the exam?
b. What is the probability that at least one of them passes the exam?
SOLUTION: We can write the probabilities that person A passes the exam and that person B passes
the exam as P(J) = 0.50 and P(L) = 0.80, respectively.
a. Since we are told that person A chances of passing the exam are not influenced by person B
success at the exam, we can conclude that these events are independent, so P(J) = P(J | L) =
0.50 and P(L) = P(L | J) = 0.80. Thus, when solving for the probability that both person pass
the exam, we calculate the product of the probabilities:
P(J ∩ L) = P(J) P(L) = 0.50 × 0.80 = 0.40.
b. We calculate the probability that at least one of them passes the exam as
86
given that the marginal probabilities P(Xi) and conditional Probabilities P (D/Xi) for all i, are known,
then the marginal probability of D is defined as:
P (D) = P (X1∩D) + P(X2∩D) + -----------------+ P(Xn∩D)
= P(X1)* P(D/X1) + P(X2)* P(D/X2) +-----------P(Xn)*P(D/Xn)
Since D could have resulted due toX1 or X2 or-------Xn we obtain probability of D by relating D with
X1, X2, ----- Xn.
An Export house manager purchases cotton shirts for an export consignment from two Designers.
Suppose Designer A produced 65% of the shirts and that Designer B produced 35%.8 percent of the
shirts produced by A were defective and 12% of the B designer’s shirts were defective. On a particular
day, a normal shipment arrives from both the Designers& the contents get mixed up. A shirt is chosen
(at random) for quality check by the authority and is found to be defective & thus the consignment is
rejected. Now what is the probability that designer A produced the shirt? That designer B produced
the shirt? The decision is required by the Exporter to take corrective measures to reduce the probability
of rejection of his consignment in future.
Here we are given data on the percentage of the shirts produced by the two designers which provides
basis to know the likelihood that a randomly selected shirt is produced by a particular Designer. The
given information shows that if we let,
X1 represent the event that designer A produced the shirt
X2 represent the event that designer B produced the shirt
Then from the given data we have:
P (X1) = .65, P (X2) = .35
These values are called Prior Probabilities. They are so called because they are established prior to
the empirical evidence about the quality of the shirt. As per the quality of shirts the conditional
probabilities show:
The probability that the shirt is defective on the condition that it is produced by Designer A is P(D/
X1) = .08. Similarly
P (D/X2) = .12
From the given data we know P (X1) = .65. The new information that the shirt is defective changes
the probabilities. With this additional information, we can revise the probability the shirt was produced
by Designer A. Let D be the event that the tested piece is defective & we want to know the probability
that the defective shirt came from Designer A. We now want to determine P(X1/D) not just P
(X1).Thus, after the additional information/ experiment is performed, we replace P(Xj) by P(Xj/D).
Recalling the rule of conditional probability:
P (X1∩D) P (X1).P (X1/D)
P (X1/D) = = (1)
P (D) P (D)
87
However P (D) is not readily discernible. This is where Bayes’ theorem comes in. There are two ways
the shirt may be defective. It can come from Designer A & be defective ‘ or it can come from Designer
B & be defective. Using the rule of addition given above
P (D) = P (X1∩D) + P (X2∩D)
= P (X1). P(D/ X1) + P(X2). P (D/X2). (2)
Substituting P(D) from (2) in the denominator of the conditional probability formula (1), Bayes’
theorem tells us ;
P(Xj∩D)
P(Xj/D) = 𝑝(𝑋1∩𝐷)+𝑃(𝑋2∩𝐷)+⋯….+𝑃(𝑋𝑛∩𝐷)
Thus, in the above example, the probability was .65 that the shirt was produced by designer A & .35
that it came from designer B. These are called prior probabilities because they are based on the original
information .The new information that the shirt is defective changes the probabilities with the result
of a sample of a defective shirt, the probability of X1 has been revised downwards to 0.553 & that of
X2 has been revised upward to 0.447.
One way to lay out a revision of probabilities problem (The Bayes’ Rule) is to use a table. Table below
shows the analysis for the shirts problem.
88
Designer B P (X2) = 0.35 P (X2/D) = P (X2∩D) = P (X2/D) =
0.12 P(X2) P(X2∩D)/P(D)
P(D/X2).= = .042 ÷ .094
.35x.12 = .042 = .447
P(defective) =
.094
Practice Questions:
State whether the following statements are true or false:
1. The concept of probability originated from the analysis of the games of chance in the
17th century.
2. The theory of probability is a study of Statistical or Random Experiments.
3. It is the backbone of Statistical Inference and Decision Theory that are essential tools of the
analysis of most of the modern business and economic problems.
4. A phenomenon or an experiment which can result into more than one possible outcome,
is called a random phenomenon or random experiment or statistical experiment.
5. The result of a toss can be a head or a tail. thus, it is a non random experiment
89
Classification of Probability Distribution
Bionomial Distribution
Meaning & Definition:
Binomial Distribution is associated with James Bernoulli, a Swiss Mathematician. Therefore, it is also
called Bernoulli distribution. Binomial distribution is the probability distribution expressing the
probability of one set of dichotomous alternatives, i.e., success or failure. In other words, it is used to
determine the probability of success in experiments on which there are only two mutually exclusive
outcomes. Binomial distribution is discrete probability distribution. Binomial Distribution can be
defined as follows: “A random variable r is said to follow Binomial Distribution with parameters n
and p if its probability distribution function is given by:
P(r) = nCx px qn-x
Where, P = probability of success in a single trial
q=1–p
n = number of trials
x = number of success in ‘n’ trials.
Assumption or Conditions for application of Binomial Distribution
Binomial distribution can be applied when:-
1. The random experiment has two outcomes i.e., success and failure.
2. The probability of success in a single trial remains constant from trial to trial of the experiment.
3. The experiment is repeated for finite number of times.
4. The trials are independent.
Properties (features) of Binomial Distribution:
90
1. It is a discrete probability distribution.
2. The shape and location of Binomial distribution changes as ‘p’ changes for a given ‘n’.
3. The mode of the Binomial distribution is equal to the value of ‘x’ which has the largest probability.
4. Mean of the Binomial distribution increases as ‘n’ increases with ‘p’ remaining constant.
5. The mean of Binomial distribution is np.
7. If ‘n’ is large and if neither ‘p’ nor ‘q’ is too close zero, Binomial distribution may be approximated
to Normal Distribution.
8. If two independent random variables follow Binomial distribution, their sum also follows Binomial
distribution.
Working Rules for Solving Problems
I. Make sure that the trials in the random experiment are independent and each trial result in
either ‘success’ or ‘failure’.
II. Define the binomial variable and find the values of n and p from the given data. Also find
q by using: q = 1 – p.
III. Put the values of n, p and q in the formula:
P(x= successes) = nCr px qn–x, x = 0, 1, 2, ...... ,n
IV. Express the event, whose probability is desired, in terms of values of the binomial variable
x. Use formula to find the required probability.
Question: Six coins are tossed simultaneously. What is the probability of obtaining 4 heads?
Solution: pmf of Binomial distribution is :
P(x) = nCx px qn-x
x=4n=6p=½q=1–p=1–½=½
6! 1 1
∴ p(x = 4) = 6C4 ( ½ )4 ( ½ )6-4 = (6−4)!4! (2)4 (2)6−4
6∗5 1
= 2∗1 64 = 30/128 = 0.234
91
II. Exactly one of the LED light bulbs is defective?
III. Two or fewer LED light bulbs are defective?
IV. Three or more of the LED light bulbs are defective?
Solution:
Here p = 0.05, q = 1- p = 1- 0.05 = 0.95 and n = 10,
Applying pmf P(x) = nCx px qn-x
We get
1. P(X = 0) = 0.5987
2. P(X = 1) = 0.3151
3. P(X ≤ 2) = P(X=0)+ P(X=1) + P(X=2) = 0.9885
4. P(X ≥ 3) = P(X=3) + P(X=4) + …+ P(X=10)
= 1 – P(X<3)
= 1 – [P(X=0)+ P(X=1) + P(X=2 ]
= 0.0115
Example: In the United States, about 30% of adults have four-year college degrees . Suppose five
adults are randomly selected.
a. What is the probability that none of the adults has a college degree?
b. What is the probability that no more than two of the adults have a college degree?
c. What is the probability that at least two of the adults have a college degree?
d. Calculate the expected value, the variance, and the standard deviation of this binomial distribution.
e. Graphically depict the probability distribution and comment on its symmetry/skewness.
SOLUTION: First, this problem satisfies the conditions for a Bernoulli process with a random
selection of five adults, n = 5. Here, an adult either has a college degree, with probability p = 0.30, or
does not have a college degree, with probability 1 − p = 1 − 0.30 = 0.70. Given a large number of
adults, it fulfills the requirement that the probability that an adult has a college degree stays the same
from adult to adult.
a. In order to find the probability that none of the adults has a college degree, we
let x = 0 and find
P(X = 0)
= 5C0 (0.30)0 (0.70)5-0
5!
= 0!(5−0)! (0.30) 0 × (0.70) 5−0
= 1 × 1 × 0.1681
92
= 0.1681.
In other words, from a random sample of five adults, there is a 16.81% chance that none of the adults
has a college degree.
b. We find the probability that no more than two adults have a college degree as
P(X ≤ 2) = P(X = 0) + P(X = 1) + P(X = 2).
We have already found P(X = 0) from part a. So we now compute P(X = 1)
and P(X = 2):
P(X = 1)
== 5C1 (0.30)1 (0.70)5-1
5!
= 1!(5−1)! (0.30) 1 × (0.70) 5−1
= 0.3602
P(X = 2)
= 5C2 (0.30)2 (0.70)5-2
5!
= 2!(5−2)! (0.30) 2 × (0.70) 5−2
= 0.3087
Next we sum the three relevant probabilities and obtain
P(X ≤ 2) = 0.1681 + 0.3602 + 0.3087 = 0.8370. From a random sample of five adults, there is an 83.7%
likelihood that no more than two of them will have a college degree.
c. We find the probability that at least two adults have a college degree as
P(X ≥ 2) = P(X = 2) + P(X = 3) + P(X = 4) + P(X = 5).
We can solve this problem by calculating and then summing each of the four probabilities, from P(X
= 2) to P(X = 5). A simpler method uses one of the key properties of a probability distribution, which
states that the sum of the probabilities over all values of X equals 1. Therefore, P(X ≥ 2) can be written
as 1 − [P(X = 0) + P(X = 1)]. We have already calculated P(X = 0) and P(X = 1) from parts a and b, so
P(X ≥ 2) = 1 − [P(X = 0) + P(X = 1)] = 1 − (0.1681 + 0.3602) = 0.4717.
From a random sample of five adults, there is a 47.17% likelihood that at least two adults will have a
college degree.
d. We use the simplified formulas to calculate the mean, the variance, and the standard deviation as
E(X) = np = 5 × 0.30 = 1.5 adults,
Var(X) = σ 2 = np(1 − p) = 5 × 0.30 × 0.70 = 1.05 (adults) 2, and
93
SD(X) = σ = √𝑛𝑝(1 − 𝑝)
94
Summary
• This distribution mainly deals with attributes. An attributes is either present or absent with respect
to elements of a population.
• The random experiment is performed for a finite and fixed number of trials.
• Each trial must result in either “success” or “failure”. • The probability of success in each trial is
same.
• As number of trials (n) in the binomial distribution increases, the number of successes also increases.
Fill in the blanks:
1. There are only two possible outcomes of each trial either ........................... or ...........................
2. Binomial variable counts the number of ........................... in a random experiment with trials
satisfying 4 conditions of binomial distribution.
3. Binomial distribution mainly deals with ...........................
4. In binomial distribution, the probabilities of 0 success, 1 success, 2 successes, ... n successes are the
1st, 2nd, 3rd ... (n + 1)th terms in ........................... of (q + p)n.
5. Binomial distribution can be applied only when number of trials are ........................... and wol
...........................
Answers
1. success, failure
2. successes
3. attributes
4. binomial expansion
5. finite and fixed
Questions
1.Assume that on an average one telephone number out of 15 called between 2 P.M. and 3 P.M. on
week days is busy. What is the probability that if six randomly selected telephone numbers are called,
at least three of them will be busy?
2. A bag contains 10 balls each marked with one of the digits 0 to 9. If four balls are drawn successively
with replacement from the bag, what is the probability that none is marked with the digit ‘0’?
3. In a box containing 100 bulbs, 10 are defective. What is the probability that out of a sample of 5
bulbs (i) none is defective? (ii) exactly two are defective?
95
Lesson 6
Normal Probability Distribution
The normal distribution is considered the corner stone of modern statistical theory. It is of considerable
importance in statistical theory, especially- in statistical inferences related to population parameters. It
is widely used in practical problems related to height, weight, distance, production, scientific
measurements etc. Carl Friedrich Gauss (1777-1855) for the first time derived the precise
mathematical formula for the normal distribution; therefore, it is sometimes also called the ‘Gaussian
distribution’.
A continuous random variable X is said to have a normal distribution if its probability density function
(pdf) is given by the following equation
−1 (x−μ) 2
1 ( )
f(x) = √2πσ2 e 2 σ
1. The mean and standard deviation (or variance ) are the two parameters of the normal
distribution. The μ is the centre of the distribution and σ explains the spread of values around
the centre. The values of and determine the position and shape of the distribution. It is a family
of distribution where each distribution is differentiated by its combination of µ and σ.
2. The values of mean (measure of central value), median (that divides the series in two equal
parts) and mode (the most frequent value) coincide. If from the peak of the distribution we
draw a perpendicular on the horizontal axis (also called variable axis), the foot of the
perpendicular gives the value of mean, median and mode.
3. It is a unimodal distribution and the curve attains its maximum value at x = μ.
4. It is a bell shaped and a symmetric curve (as shown in Figure below). If the curve is folded
from the middle, the two halves will coincide. The shape of the curve to the left of mean is
mirror image of the shape to the right side of the curve.
96
Symmetrical bell-shaped curve
5. The variability of the distribution is determined by the value of. The greater is its value the
higher is the spread of the distribution and greater would be width of the distribution. The two
normal distribution curves having same mean and different standard deviation are shown in
Figure below. We can see that both the distributions are symmetric and bell shaped. However,
observations in distribution A (with higher σ) are more dispersed therefore the curve is flattened
and has thicker tails, than the distribution B (with lower σ).
6. The random variable X ranges from -∞ to +∞ the tails of the curve extend to infinity in both
the directions and they never touch the x-axis.
97
7. This distribution being a probability density function, f(x) ≥ 0. And, we know that the
probability of a continuous random variable is given by the area under the curve, therefore the
total area under the curve is one i.e.
−1 (x−μ) 2
+∞ +∞ 1 ( )
P(-∞ < X< +∞) = ∫−∞ f(x)dx = ∫−∞ √2πσ2
e2 σ dx = 1
8. Since the distribution is symmetrical about its mean value μ. This implies the area under the
curve on both the sides of the mean (μ) is 0.5, i.e. P (X ≥ μ) = P (X ≤ μ) = 0.5
9. The point of inflexion of the curve occurs at µ ± 1σ. It is a point where the curve changes its
curvature, to both of its sides.
10. Empirically, it is observed that the approximate percentage area in the following commonly
used intervals are µ ± 1σ, covers 68.27% of the total observations i.e., 68.27% of all
observations lie within one standard deviation of the mean of the distribution. Similarly, µ ±
2σ, covers 95.45 % and µ ± 3 σ, covers 99.73 % of the total observations
98
−1 (x−μ) 2
x2 1 ( ) 1 x2 −1 ((x−μ))2
P(x1<X<x2)= ∫x1 √2πσ2
e2 σ dx = σ√2π ∫x1 e 2 σ dx
The difficulty encountered in performing integral of normal density function, requires
tabulation of normal distribution curve areas. However, we must notice from the above
equation that the area included between two values of X depends not only on the values of x1
and x2 but also on the values of the two parameters of the distribution. As the value of any one
of the two parameters of the normal distribution function changes, the probability for given
values x1 and x2 of would also change. This means, we would require as many tables as the
number of possible combinations of μ and σ2. Since there can be infinite combinations, it is not
practicable to prepare table for each of the possible combination. To deal with this, we
transform each normal random variable with a specified μ and σ2, into a standard normal
random variable, called the Z variable which has 0 mean and 1 standard deviation or variance.
This standardized variable is used to calculate the probabilities of all normal distributions,
irrespective of the values of μ and σ2.
The standard normal curve is also a bell-shaped with mean, median and mode coinciding at
0.
The Z-curve is centred at its mean value 0 and is symmetrical around it, i.e.,
P (-∞ ≤ Z ≤ 0) = P (Z ≤ 0) = P (0 ≤ Z ≤ + ∞) = P (Z ≥ 0) = 0.5
The point of inflexion of Z curve occurs at ±1 (it should be noticed here that for the normal
distribution it occurs at µ± 1. since for Z curve μ = 0 and σ = 1, therefore 0±1*1= ±1).
The respective limits of z, covering 68.27%, 95.45% and 99.73% of area under the standard
normal distribution curve are ±1, ±2, and ±3.
99
APPLICATIONS OF NORMAL DISTRIBUTION
This distribution is applied to Problems concerning:
1. calculation of hit probability of a shot.
2. statistical inference in most branches of science.
3. calculation of errors made by chance in experimental measurements
If x is a normal variate with mean 30 and S.D 5. find
(i) P(26≤x ≤ 40) (ii) P (x ≥ 45) (iii)P(1.52 ≤ Z ≤ 1.96) (iv) P(Z > 4)
Solution: Given x is a normal variate with
mean µ = 30 and S.D. σ = 5.
Let z be the standard normal variate,
𝑥−𝜇 𝑥−30
then z = =
𝜎 5
26−30
(i) When x = 26, z = = -4/5 = -0.8
5
40−30
When x = 40, z = 5
=2
P(26 ≤ x ≤ 40) = P( – 0.8 ≤ z ≤ 2)
= Area under the normal variate between z = – 0.8 and z = 2
100
P (x 45) = P(z 3)
= Area under standard normal variate to the right of z = 3
= (Area to the right of z = 0) – (Area between z = 0 and z = 3)
= P(z 0) – P(0 z 3)
= 0.5 – 0.49865
= 0.00135.
(iii)As in part a and shown in Figure below,
P(1.52 ≤ Z ≤ 1.96) = P(Z ≤ 1.96) − P(Z < 1.52)
101
Lesson 7
Hypothesis Testing
Introduction
A statistical hypothesis test is a method of making statistical decisions using experimental data. In
statistics, a result is called statistically significant if it is unlikely to have occurred by chance. The
phrase “test of significance” was coined by Ronald Fisher: “Critical tests of this kind may be called
tests of significance, and when such tests are available we may discover whether a second sample is
or is not significantly different from the first.”
Hypothesis testing is sometimes called confirmatory data analysis, in contrast to exploratory data
analysis. In frequency probability, these decisions are almost always made using nullhypothesis tests;
that is, ones that answer the question. Assuming that the null hypothesis is true, what is the
probability of observing a value for the test statistic that is at least as extreme as the value that was
actually observed? One use of hypothesis testing is deciding whether experimental results contain
enough information to cast doubt on conventional wisdom.
Meaning of Hypothesis
A hypothesis is a tentative proposition relating to certain phenomenon, which the researcher wants to
verify when required. If the researcher wants to infer something about the total population from
which the sample was taken, statistical methods are used to make inference. We may say that, while
a hypothesis is useful, it is not always necessary. Many a time, the researcher is interested in
collecting and analysing the data indicating the main characteristics without a hypothesis. Also, a
hypothesis may be rejected but can never be accepted except tentatively. Further evidence may prove
it wrong. It is wrong to conclude that since hypothesis was not rejected it can be accepted as valid.
What is a Null Hypothesis?
A null hypothesis is a statement about the population, whose credibility or validity the researcher
wants to assess based on the sample. A null hypothesis is formulated specifically to test for possible
rejection or nullification. Hence the name ‘null hypothesis’. Null hypothesis always states “no
difference”. It is this null hypothesis that is tested by the researcher.
Statistical Testing Procedure
1. Formulate the null hypothesis, with H0 and HA, the alternate hypothesis.
According to the given problem, H0 represents the value of some parameter of population.
2. Select on appropriate test assuming H0 to be true.
3. Calculate the value.
4. Select the level of significance other at 1% or 5%.
5. Find the critical region.
6. If the calculated value lies within the critical region, then reject Ho.
7. State the conclusion in writing.
Formulate the Hypothesis
The normal approach is to set two hypotheses instead of one, in such a way, that if one hypothesis is
true, the other is false. Alternatively, if one hypothesis is false or rejected, then the other is true or
accepted. These two hypotheses are:
(1) Null hypothesis
(2) Alternate hypothesis
Let us assume that the mean of the population is μo and the mean of the sample is x. Since we have
assumed that the population has a mean of μo, this is our null hypothesis. We write this as Hoμ = μo,
where Ho is the null hypothesis. Alternate hypothesis is HA=μ. The rejection of null hypothesis will
show that the mean of the population is not μ o. This implies that alternate hypothesis is accepted.
Statistical Significance Level
Having formulated the hypothesis, the next step is its validity at a certain level of significance. The
confidence with which a null hypothesis is accepted or rejected depends upon the significance level.
102
A significance level of say 5% means that the risk of making a wrong decision is 5%. The researcher
is likely to be wrong in accepting false hypothesis or rejecting a true hypothesis by 5 out of 100
occasions. A significance level of say 1% means, that the researcher is running the risk of being
wrong in accepting or rejecting the hypothesis is one of every 100 occasions. Therefore, a 1%
significance level provides greater confidence to the decision than 5% significance level.
There are two type of tests.
One-tailed and Two-tailed Tests
A hypothesis test may be one-tailed or two-tailed. In one-tailed test the test-statistic for rejection of
null hypothesis falls only in one-tailed of sampling distribution curve.
Example:
In a right side test, the critical region lies entirely in the right tail of the sample distribution.
Whether the test is one-sided or two-sided – depends on alternate hypothesis.
A tyre company claims that mean life of its new tyre is 15,000 km. Now the researcher
formulates the hypothesis that tyre life is = 15,000 km.
A two-tailed test is one in which the test statistics leading to rejection of null hypothesis falls on both
tails of the sampling distribution curve as shown.
When we should apply a hypothesis test that is one-tailed or two-tailed depends on the nature of the
problem. One-tailed test is used when the researcher’s interest is primarily on one side of the issue.
Example:
“Is the current advertisement less effective than the proposed new advertisement”?
A two-tailed test is appropriate, when the researcher has no reason to focus on one side of the
issue.
Example: “Are the two markets – Mumbai and Delhi different to test market a product?”
A product is manufactured by a semi-automatic machine. Now, assume that the same product is
manufactured by the fully automatic machine. This will be two-sided test, because the null
hypothesis is that “the two methods used for manufacturing the product do not differ significantly”.
H0 = μ1 = μ2
103
Degree of Freedom
It tells the researcher the number of elements that can be chosen freely.
Example: a+b/2 =5. fix a=3, b has to be 7. Therefore, the degree of freedom is 1.
Select Test Criteria
If the hypothesis pertains to a larger sample (30 or more), the Z-test is used. When the sample is
small (less than 30), the T-test is used.
Compute
Carry out computation.
Make Decisions
Accepting or rejecting of the null hypothesis depends on whether the computed value falls in the
region of rejection at a given level of significance.
Errors in Hypothesis Testing
There are two types of errors:
1. Hypothesis is rejected when it is true.
2. Hypothesis is not rejected when it is false.
(1) is called Type 1 error ( a), (2) is called Type 2 error ( b).
When a =0.10 it means that true hypothesis will be accepted in 90 out of 100 occasions. Thus, there
is a risk of rejecting a true hypothesis in 10 out of every 100 occasions. To reduce the risk, use a =
0.01 which implies that we are prepared to take a 1% risk i.e., the probability of rejecting a true
hypothesis is 1%. It is also possible that in hypothesis testing, we may commit Type 2 error (b) i.e.,
accepting a null hypothesis which is false.
Example of Type 1 and Type 2 error:
Type 1 and Type 2 error is presented as follows. Suppose a marketing company has 2 distributors
(retailers) with varying capabilities. On the basis of capabilities, the company has grouped them into
two categories (1) Competent retailer (2) Incompetent retailer. Thus R 1 is a competent retailer and
R2 is an incompetent retailer. The firm wishes to award a performance bonus (as a part of trade
promotion) to encourage good retailership. Assume that two actions A1 and A2 would represent
whether the bonus or trade incentive is given and not given. This is shown as follows:
When the firm has failed to reward a competent retailer, it has committed type-2 error. On the other
hand, when it was rewarded to an incompetent retailer, it has committed type-1error.
Types of Tests
1. Parametric test.
2. Non-parametric test.
Parametric Test
(1) Parametric tests are more powerful. The data in this test is derived from interval and ratio
measurement.
(2) In parametric tests, it is assumed that the data follows normal distributions. Examples of
parametric tests are (a) Z-Test, (b) T-Test and (c) F-Test.
(3) Observations must be independent i.e., selection of any one item should not affect the chances of
selecting any others be included in the sample.
104
What is univariate/bivariate data analysis?
Univariate
If we wish to analyse one variable at a time, this is called univariate analysis. For example: Effect of
sales on pricing. Here, price is an independent variable and sales is a dependent variable. Change the
price and measure the sales.
Bivariate
The relationship of two variables at a time is examined by means of bi-variate data analysis. If one is
interested in a problem of detecting whether a parameter has either increased or decreased, a two-
sided test is appropriate.
Non-parametric Test
Non-parametric tests are used to test the hypothesis with nominal and ordinal data.
(1) We do not make assumptions about the shape of population distribution.
(2) These are distribution-free tests.
(3) The hypothesis of non-parametric test is concerned with something other than the value of a
population parameter.
(4) Easy to compute. There are certain situations particularly in marketing research, where the
assumptions of parametric tests are not valid. For example: In a parametric test, we assume that data
collected follows a normal distribution. In such cases, non-parametric tests are used. Examples of
non-parametric tests are (a) Binomial test (b) Chi-Square test
(c) Mann-Whitney U test (d) Sign test. A binominal test is used when the population has only two
classes such as male, female; buyers, non-buyers, success, failure etc. All observations made about
the population must fall into one of the two tests. The binomial test is used when the sample size is
small.
Advantages
1. They are quick and easy to use.
2. When data are not very accurate, these tests produce fairly good results.
Disadvantages
Non-parametric test involves the greater risk of accepting a false hypothesis and thus committing a
Type 2 error.
P-values
A p-value, sometimes called an uncertainty or probability coefficient, is based on properties of the
sampling distribution. It is usually expressed as p less than some decimal, as in p < .05 or p < .0006,
where the decimal is obtained by tweaking the significance setting of any statistical procedure. It is
used in two ways: (1) as a criterion level where you, the researcher have arbitrarily decided in
advance to use as the cutoff where you reject the null hypothesis, in which case, you would
ordinarily say something like “setting p at p > .65 for one-tailed or twotailed tests of significance
allows some confidence that 65% of the time, rejecting the null hypothesis will not be in error”; and
more commonly, (2) as a expression of inference uncertainty after you have run some test statistic
regarding the strength of some association or relationship between your independent and dependent
variables, in which case, you would say something like “the evidence suggests there is a statistically
significant effect, however, p < .05 also suggests that 5% of the time, we should be uncertain about
the significance of drawing any statistical inferences.”
Summary
Hypothesis testing is the use of statistics to determine the probability that a given
hypothesis is true.
The usual process of hypothesis testing consists of four steps.
Formulate the null hypothesis and the alternative hypothesis.
Identify a test statistic that can be used to assess the truth of the null hypothesis.
Compute the P-value, which is the probability that a test statistic at least as significant as the one
observed would be obtained assuming that the null hypothesis were true.
105
The smaller the -value, the stronger the evidence against the null hypothesis.
Compare the -value to an acceptable significance value a.
If , that the observed effect is statistically significant, the null hypothesis is ruled out, and the
alternative hypothesis is valid.
Test of Significance
Introduction
Tests for statistical significance are used to estimate the probability that a relationship observed in
the data occurred only by chance; the probability that the variables are really unrelated in the
population. They can be used to filter out unpromising hypotheses. In research reports, tests of
statistical significance are reported in three ways. First, the results of the test may be reported in the
textual discussion of the results. Include:
1. Hypothesis
2. Test statistic used and its value
3. Degrees of freedom
4. Value for alpha (p-value)
Tests for statistical significance are used because they constitute a common yardstick that can be
understood by a great many people, and they communicate essential information about a research
project that can be compared to the findings of other projects. However, they do not assure that the
research has been carefully designed and executed. In fact, tests for statistical significance may be
misleading, because they are precise numbers. But they have no relationship to the practical
significance of the findings of the research.
Finally, one must always use measures of association along with tests for statistical significance. The
latter estimate the probability that the relationship exists; while the former estimate the strength (and
sometimes the direction) of the relationship. Each has its use, and they are best when used together.
There are two types of tests:
Small Sample Tests
Large Sample Test
Small Sample Tests
T-test
T-test is used in the following circumstances: When the sample size n<30.
Example: A certain pesticide is packed into bags by a machine. Random samples of 10 bags are
drawn and their contents are found as follows: 50,49,52,44,45,48,46,45,49,45. Confirm whether the
average packaging can be taken to be 50 kgs.
In this text, the sample size is less than 30. Standard deviations are not known using this test. We can
find out if there is any significant difference between the two means i.e. whether the two population
means are equal.
The Student’s T-distribution
Let X1, X2 ...... Xn be n independent random variables from a normal population with mean m and
standard deviation s (unknown).
106
which follows t - distribution with (n–1) degrees of freedom.
Features of t-distribution
1. Like c2- distribution, t-distribution also has one parameter n = n–1, where n denotes sample size.
Hence, this distribution is known if n is known.
𝑣
2. Mean of the random variable t is zero and standard deviation is √𝑣−2, for n > 2.
3. The probability curve of t-distribution is symmetrical about the ordinate at t = 0. Like a normal
variable, the t variable can take any value from– to .
4. The distribution approaches normal distribution as the number of degrees of freedom become
large.
5. The random variate t is defined as the ratio of a standard normal variate to the square root of 2 -
variate divided by its degrees of freedom.
Illustration: There are two nourishment programmes ‘A’ and ‘B’. Two groups of children are
subjected to this. Their weight is measured after six months. The first group of children subjected to
the programme ‘A’ weighed 44,37,48,60,41 kgs. at the end of programme. The second group of
children were subjected to nourishment programme ‘B’ and their weight was 42, 42, 58, 64, 64, 67,
62 kgs. at the end of the programme. From the above, can we conclude that nourishment programme
‘B’ increased the weight of the children significantly, given a 5% level of confidence.
Null Hypothesis: There is no significant difference between Nourishment programme ‘A’ and ‘B’.
Alternative Hypothesis: Nourishment programme B is better than ‘A’ or Nourishment programme
‘B’ increase the children’s weight significantly.
Solution:
107
= 1.89
be the variances of the first sample and the second samples respectively.
Then F - statistic is defined as the ratio of two c2 - variates. Thus, we can write
108
Features of F-distribution
1. This distribution has two parameters n1 (= n1 - 1) and n2 (= n2 - 1).
2. The mean of F - variate with n1 and n2 degrees of freedom is v2 / (v2 – 2) and standard error is
In the above, the sample size is 100, hence a Z-test may be used.
b Testing the hypothesis about difference between two means: This can be used when two population
means are given and null hypothesis is Ho : P1 = P2.
Example: In a city during the year 2000, 20% of households indicated that they read ‘Femina’
magazine. Three years later, the publisher had reasons to believe that circulation has gone up. A
survey was conducted to confirm this. A sample of 1,000 respondents were contacted and it was
109
found 210 respondents confirmed that they subscribe to the periodical ‘Femina’. From the above, can
we conclude that there is a significant increase in the circulation of ‘Femina’?
Solution:
We will set up null hypothesis and alternate hypothesis as follows:
Null Hypothesis is H0. μ = 15%
Alternate Hypothesis is HA. μ > 15%
This is a one-tailed (right) test.
As the value of Z at 0.05 =1.64 and calculated value of Z falls in the rejection region, we reject null
hypothesis, and therefore we conclude that the sale of ‘Femina’ has increased significantly.
Chi-square Test
With the help of this test, we will come to know whether two or more attributes are associated or not.
How much the two attributes are related cannot be by Chi-Square test. Suppose, we have certain
number of observations classified according to two attributes. We may like to know whether a newly
introduced medicine is effective in the treatment of certain disease or not.
The numbers of automobile accidents per week in a certain city were as follows:
Does the above data indicate that accident conditions were uniform during the 10- month period.
Expected frequency= 12+ 8+ 20+ 2+ 14+ 10+ 15+ 6+ 9+ 4= 100/ 10 = 10
Computation
Null hypothesis: The accident occurrence is uniform over a 10-week period.
110
(𝑂−𝐸)2
2 =∑ 𝐸
Where O is the observed frequency, E is the expected frequency.
D.F = 10 - 1 = 9
Table value at 5% for 9 degree of freedom = 16.91
Since calculated value = 26.6 greater than table value of 19.19, null hypothesis rejected at 5% level
of significance.
Conclusion: The accident occurring are not uniform over a 10-week period.
Summary
Significance Level: Significance level is the criterion used for rejecting the null hypothesis.
Tests for statistical significance: Tests for statistical significance are used to estimate the probability
that a relationship observed in the data occurred only by chance; the probability that the variables are
really unrelated in the population.
Testing the hypothesis about difference between two means: This can be used when two
population means are given and null hypothesis is Ho : P1 = P2.
*******
111
Lesson 8
Index Number
Introduction
An index number is a statistical measure used to compare the average level of magnitude of a group
of distinct but related variables in two or more situations. Suppose that we want to compare the average
price level of different items of food in 2020 with what it was in 2010. Let the different items of food
be wheat, rice, milk, eggs, ghee, sugar, pulses, etc. If the prices of all these items change in the same
ratio and in the same direction; assume that prices of all the items have increased by 10% in 2020 as
compared with their prices in 2010; then there will be no difficulty in finding out the average change
in price level for the group as a whole. Obviously, the average price level of all the items taken as a
group will also be 10% higher in 2020 as compared with prices of 2010. However, in real situations,
neither the prices of all the items change in the same ratio nor in the same direction, i.e., the prices of
some commodities may change to a greater extent as compared to prices of other commodities.
Moreover, the price of some commodities may rise while that of others may fall. For such situations,
the index numbers are very useful device for measuring the average change in prices or any other
characteristics like quantity, value, etc., for the group as a whole.
Another important feature of the index number is that it is often used to average a characteristics
expressed in different units for different items of a group. In the words of Tuttle; “An index number is
a single ratio (usually in percentage) which measures the combined (i.e., averaged) change of several
variables between two different times, places or situations.” For example, the price of wheat may be
quoted as rupee/kg., price of milk as rupee/litre, price of eggs as rupee/dozen, etc. To arrive at a single
figure that expresses the average change in price for the whole group, various prices have to be
combined and averaged in a suitable way. This single figure is known as price index and can be used
to determine the extent and direction of average change in the prices for the group. In a similar way
we can construct quantity index numbers, value index numbers, etc.
Characteristics of index numbers
1. Index numbers are specialised averages: As we know that an average of data is its representative
summary figure. In a similar way, an index number is also an average, often a weighted average,
computed for a group. It is called a specialised average because the figures, that are averaged, are not
necessarily expressed in homogeneous units.
2. Index numbers measure the changes for a group which are not capable of being directly
measured: The examples of such magnitudes are: Price level of a group of items, level of business
activity in a market, level of industrial or agricultural output in an economy, etc.
3. Index numbers are expressed in terms of percentages: The changes in magnitude of a group are
expressed in terms of percentages which are independent of the units of measurement. This facilitates
the comparison of two or more index numbers in different situations.
Uses of Index Numbers
The main uses of index numbers are:
1. To measure and compare changes: The basic purpose of the construction of an index number is
to measure the level of activity of phenomena like price level, cost of living, level of agricultural
production, level of business activity, etc. It is because of this reason that sometimes index numbers
are termed as barometers of economic activity. It may be mentioned here that a barometer is an
instrument which is used to measure atmospheric pressure in physics. The level of an activity can be
112
expressed in terms of index numbers at different points of time or for different places at a particular
point of time. These index numbers can be easily compared to determine the trend of the level of an
activity over a period of time or with reference to different places.
2. To help in providing guidelines for framing suitable policies: Index numbers are indispensable
tools for the management of any government or non-government organisation. For example, the
increase in cost of living index is helpful in deciding the amount of additional dearness allowance that
should be paid to the workers to compensate them for the rise in prices. In addition to this, index
numbers can be used in planning and formulation of various government and business policies.
3. Price index numbers are used in deflating: This is a very important use of price index numbers.
These index numbers can be used to adjust monetary figures of various periods for changes in prices.
For example, the figure of national income of a country is computed on the basis of the prices of the
year in question. Such figures, for various years often known as national income at current prices, do
not reveal the real change in the level of production of goods and services. In order to know the real
change in national income, these figures must be adjusted for price changes in various years. Such
adjustments are possible only by the use of price index numbers and the process of adjustment, in a
situation of rising prices, is known as deflating.
4. To measure purchasing power of money: We know that there is inverse relation between the
purchasing power of money and the general price level measured in terms of a price index number.
Thus, reciprocal of the relevant price index can be taken as a measure of the purchasing power of
money.
Construction of Index Numbers
To illustrate the construction of an index number, we reconsider various items of food mentioned
earlier. Let the prices of different items in the two years, 1990 and 1992, be as given below:
113
1. By taking the difference of prices in the two years, i.e., 360 – 300 = 60, one can say that the price
of wheat has gone up by 60/quintal in 2021 as compared with its price in 2019.
2. By taking the ratio of the two prices, i.e.,360/300 = 1.20, one can say that if the price of wheat in
2019 is taken to be 1, then it has become 1.20 in 2021. A more convenient way of comparing the two
prices is to express the price ratio in terms of percentage, i.e., 360/120* 100 = 300 , known as Price
Relative of the item. In our example, price relative of wheat is 120 which can be interpreted as the
price of wheat in 2021 when its price in 2019 is taken as 100. Further, the figure 120 indicates that
price of wheat has gone up by 120 – 100 = 20% in 2021 as compared with its price in 2019.
The first way of expressing the price change is inconvenient because the change in price depends upon
the units in which it is quoted. This problem is taken care of in the second method, where price change
is expressed in terms of percentage. An additional advantage of this method is that various price
changes, expressed in percentage, are comparable. Further, it is very easy to grasp the 20% increase in
price rather than the increase expressed as 60 rupees/quintal.
For the construction of index number, we have to obtain the average price change for the group in
2021, usually termed as the Current Year, as compared with the price of 2019, usually called the Base
Year. This comparison can be done in two ways:
1. By taking suitable average of price relatives of different items. The methods of index number
construction based on this procedure are termed as Average of Price Relative Methods.
2. By taking ratio of the averages of the prices of different items in each year. These methods are
popularly known as Aggregative Methods. Since the average in each of the above methods can be
simple or weighted, these can further be divided as simple or weighted. Various methods of index
number construction can be classified as shown below:
Figure : Various Methods of Index Number Construction
In addition to this, a particular method would depend upon the type of average used. Although,
geometric mean is more suitable for averaging ratios, arithmetic mean is often preferred because of its
simplicity with regard to computations and interpretation.
Notations and Terminology
Before writing various formulae of index numbers, it is necessary to introduce certain notations and
terminology for convenience.
Base Year: The year from which comparisons are made is called the base year. It is commonly denoted
by writing ‘0’ as a subscript of the variable.
Current Year: The year under consideration for which the comparisons are to be computed is called
the current year. It is commonly denoted by writing ‘1’ as a subscript of the variable.
114
Let there be n items in a group which are numbered from 1 to n. Let p0i denote the price of the
ith item in base year and p1i denote its price in current year, where i = 1, 2, ...... n. In a similar way q0i
and q1i will denote the quantities of the ith item in base and current years respectively.
Using these notations, we can write an expression for price relative of the ith item as
𝑝 𝑞
Pi = 𝑝1𝑖 ∗ 100 , and quantity relative of the i th item as Qi = 𝑞1𝑖 ∗ 100
0𝑖 0𝑖
Further, P01 will be used to denote the price index number of period ‘1’ as compared with the prices
of period ‘0’. Similarly, Q01 would denote the quantity and the value index numbers respectively of
period ‘1’ as compared with period ‘0’.
Construction of an Index Number
Un-weighted Index
In the un-weighted index number the weights are not assigned to the various items used for the
calculation of index number. Two unweighted price index number are given below:
Solution
𝑝
∑ 1 ∗100 684.16
𝑝0
Index number P01 = = = 136.83
𝑛 5
Here, price is said to have risen by 136.83 per cent.
This price index number calculated by using simple aggregative method has limited use. The reasons
are as follows:
115
(a) This method doesn’t take into account the relative importance of various commodities used
in the calculation of index number since equal importance is given to all the items.
(b) The different items are required to be expressed in the same unit. In practice, however, the
different items may be expressed in different units.
(c) The index number obtained by this method is not reliable as it is affected by the unit in
which prices of several commodities are quoted.
Weighted Average of Price Relatives: In order to take this into account, weighing of different items,
in proportion to their degree of importance, becomes necessary.
Let wi be the weight assigned to the i th item (i = 1, 2, ...... n). Thus, the index number, given by the
weighted arithmetic mean of price relatives, is
ΣPiWi
The formula for a weighted aggregative price index is P01 = . An index number becomes a
ΣWi
weighted index when the relative importance of items is taken care of.
Nature of weights
While taking weighted average of price relatives, the values are often taken as weights . These weights
can be the values of base year quantities valued at base year prices, i.e., p0iq0i, or the values of current
year quantities valued at current year prices, i.e., p1iq1i, or the values of current year quantities valued
at base year prices, i.e., p0iq1i, etc., or any other value.
Example:
Construct an index number for 2002 taking 2010 as base for the following data, by using weighted
arithmetic mean of price relatives
Commodities Prices in 2002 Prices in 2010 Weights
A 60 100 30
B 20 20 20
C 40 60 24
D 100 120 30
E 120 80 10
Solution:
Commodities Prices in Prices in P1/p0*100 Weights
2002 2010
A 60 100 166.67 30 5000.1
B 20 20 100.00 20 2000.00
C 40 60 150.00 24 3600.00
D 100 120 120.00 30 3600.00
E 120 80 66.67 10 666.7
Total 114 148668.8
ΣPiWi 148668.8
Index number P01 = = = 130.41
ΣWi 114
Simple Aggregative Method: In this method, the simple arithmetic mean of the prices of all the items
of the group for the current as well as for the base year are computed separately. The ratio of current
year average to base year average multiplied by 100 gives the required index number.
∑ 𝑃0𝑖
Using notations, the arithmetic mean of prices of n items in current year is given by 𝑛
116
𝑃1𝑖
∑ ∑ 𝑃1𝑖
𝑛
Simple aggregative price index P01 = 𝑃0𝑖 ∗ 100 = ∑ 𝑃0𝑖
∗ 100
∑
𝑛
Omitting the subscript i, the above index number can also be written as:
∑𝑃
P01 = ∑ 𝑃1 ∗ 100
0
Example: The following table gives the prices of six items in the years 2010 and 2011. Use simple
aggregative method to find index of 2011 with 2010 as base.
Item 2020 2021
A 40 50
B 60 60
C 20 30
D 50 70
E 80 90
F 100 100
Solution:
Solution:
Let p0 be the price in 2020 and p1 be the price in 2021. Thus, we have
Item 2020 2021
A 40 50
B 60 60
C 20 30
D 50 70
E 80 90
F 100 100
Total 350 400
350
P01 = 400 ∗ 100 = 114.29
Weighted Aggregative Method: This index number is defined as the ratio of the weighted arithmetic
means of current to base year prices multiplied by 100.
Using the notations, defined earlier, the weighted arithmetic mean of current year prices can be written
ΣP1iWi
as = ΣWi
ΣP0iWi
Similarly, the weighted arithmetic mean of base year prices = ΣWi
ΣP1iWi
ΣWi ΣP1iWi
Price Index Number P01 = ΣP0iWi ∗ 100 = ∗ 100
ΣP0iWi
ΣWi
Omitting the subscript, we can also write
ΣP1 W
P01 = ∗ 100
ΣP0 W
Nature of Weights
In case of weighted aggregative price index numbers, quantities are often taken as weights.
117
These quantities can be the quantities purchased in base year or in current year or an average of base
year and current year quantities or any other quantities. Depending upon the choice of weights, some
of the popular formulae for weighted index numbers can be written as follows:
1. Laspeyres’s Index: Laspeyres’ price index number uses base year quantities as weights.
Thus, we can write:
Σp1 𝑞0
P01 = ∗ 100
Σp0 𝑞0
2. Paasche’s Index: This index number uses current year quantities as weights. Thus, we can write
Σp1 𝑞1
P01 = ∗ 100
Σp0 𝑞1
3. Fisher’s Ideal Index: As will be discussed later that the Laspeyres’s Index has an upward bias and
the Paasche’s Index has a downward bias. In view of this, Fisher suggested that an ideal index should
be the geometric mean of Laspeyres’ and Paasche’s indices. Thus, the
Fisher’s formula can be written as follows:
Σp 𝑞 Σp 𝑞
P01 = √Σp1 𝑞0 ∗ Σp1 𝑞1 *100
0 0 0 1
𝑙 𝑝
P01 = √𝑃01 ∗ 𝑃01
𝑙
𝑃01 = 𝐿𝑎𝑠𝑝𝑒𝑦𝑟𝑒𝑠 ′ 𝑠𝐼𝑛𝑑𝑒𝑥
𝑝
𝑃01 = 𝑃𝑎𝑎𝑠𝑐ℎ𝑒 ′ 𝑠𝐼𝑛𝑑𝑒𝑥
4. Dorbish and Bowley’s Index: This index number is constructed by taking the arithmetic mean of
the Laspeyres’s and Paasche’s indices.
1 Σp1 𝑞0 Σp1 𝑞1
P01 = [ + ] ∗ 100
2 Σp0 𝑞0 Σp0 𝑞1
5. Marshall and Edgeworth’s Index: This index number uses arithmetic mean of base and current
year quantities.
Σp (𝑞 𝑞 )
P01 = Σp1 (𝑞0+ 𝑞1) ∗ 100
0 0+ 1
Example: Example: For the data given in the following table, compute
1. Laspeyres’s Price Index
2. Paasche’s Price Index
3. Fisher’s Ideal Index
4. Dorbish and Bowley’s Price Index
5. Marshall and Edgeworth’s Price Index
P0 Q0 P1 Q1
10 30 12 50
8 15 10 25
6 20 6 30
4 10 6 20
118
Solution:
P0 Q0 P1 Q1 P0Q0 P1Q0 P0Q1 P1Q1
10 30 12 50 300 360 500 600
8 15 10 25 120 150 200 250
6 20 6 30 120 120 180 180
4 10 6 20 40 60 80 120
580 690 960 1150 Total
The calculation of various price index numbers are done as given below:
1. Laspeyres’sP01 = 690/580 *100 = 118.97
2. Paasche’s P01 = 1150/960* 100 =119.79
690 1150
3. Fisher’s P01 = √580 ∗ ∗ 100 = 119.38
960
1 690 1150
4. Dorbish and Bowley’s P01 = 2 [580 + ] ∗ 100 = 119.4
960
690+1150
5. Marshall and Edgeworth’s P01 = ∗ 100 = 119.48
580+960
Quantity Index Numbers
A quantity index number measures the change in quantities in current year as compared with a base
year. The formulae for quantity index numbers can be directly written from price index numbers simply
by interchanging the role of price and quantity. Similar to a price relative, we can define a quantity
𝑞
relative as Q = 1 ∗ 100
𝑞0
𝑙 𝑝
Q01 = √𝑄01 ∗ 𝑄01
𝑙
𝑄01 = 𝐿𝑎𝑠𝑝𝑒𝑦𝑟𝑒𝑠 ′ 𝑠𝐼𝑛𝑑𝑒𝑥
𝑝
𝑄01 = 𝑃𝑎𝑎𝑠𝑐ℎ𝑒 ′ 𝑠𝐼𝑛𝑑𝑒𝑥
Test of adequacy for an Index Number
Index numbers are studied to know the relative changes in price and quantity for any two years
compared. There are two tests which are used to test the adequacy for an index number. The two tests
are as follows,
119
(i) Time Reversal Test
(ii) Factor Reversal Test
The criterion for a good index number is to satisfy the above two tests.
Time Reversal Test
It is an important test for testing the consistency of a good index number. This test maintains time
consistency by working both forward and backward with respect to time (here time refers to base year
and current year). Symbolically the following relationship should be satisfied, P01 × P10 =1
Fisher’s index number formula satisfies the above relationship
when the base year and current year are interchanged, we get
120
Example
Calculate Fisher’s price index number and show that it satisfies both Time Reversal Test and Factor
Reversal Test for data given below.
Solution
121
Summary
An index number is a statistical measure used to compare the average level of magnitude of a group
of distinct but related variables in two or more situations.
In real situations, neither the prices of all the items change in the same ratio nor in the same direction,
i.e., the prices of some commodities may change to a greater extent as compared to prices of other
commodities.
The index numbers are very useful device for measuring the average change in prices or any other
characteristics like quantity, value, etc., for the group as a whole.
Index numbers are specialized type of averages that are used to measure the changes in a characteristics
which is not capable of being directly measured.
The changes in magnitude of a group are expressed in terms of percentages which are independent of
the units of measurement. This facilitates the comparison of two or more index numbers in different
situations.
122
Index numbers are indispensable tools for the management of any government or nongovernment
organizations.
There is inverse relation between the purchasing power of money and the general price level measured
in terms of a price index number.
The reciprocal of the relevant price index can be taken as a measure of the purchasing power of money.
The year from which comparisons are made is called the base year. It is commonly denoted by writing
‘0’ as a subscript of the variable.
While taking weighted average of price relatives, the values are often taken as weights. These weights
can be the values of base year quantities valued at base year prices.
In case of weighted aggregative price index numbers, quantities are often taken as weights.
These quantities can be the quantities purchased in base year or in current year or an average of base
year and current year quantities or any other quantities.
A quantity index number measures the change in quantities in current year as compared with a base
year.
Index numbers where comparisons of various periods were done with reference to a particular period,
termed as base period. Such type of index number series is known as fixed base series.
*********
123