Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
62 views

Reading Material - Lesson 1-8 - Data Analysis

This document provides information about a Skill Enhancement Course (SEC) in Data Analysis offered as part of a BA program. The course aims to introduce students to collecting, presenting, summarizing, and analyzing data to draw statistical inferences. Topics covered include data sources, univariate frequency distributions, measures of central tendency and dispersion, bivariate frequency distributions, correlation and regression, probability theory, estimation, hypothesis testing, and index numbers. Readings include applied statistics textbooks. The course outline discusses sources of data, univariate and bivariate frequency distributions, probability theory, estimation, and index numbers. Good questionnaires and schedules for collecting primary data should have a forwarding letter, minimal questions, easy to understand language, logical question arrangement, and use of

Uploaded by

Sushma Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Reading Material - Lesson 1-8 - Data Analysis

This document provides information about a Skill Enhancement Course (SEC) in Data Analysis offered as part of a BA program. The course aims to introduce students to collecting, presenting, summarizing, and analyzing data to draw statistical inferences. Topics covered include data sources, univariate frequency distributions, measures of central tendency and dispersion, bivariate frequency distributions, correlation and regression, probability theory, estimation, hypothesis testing, and index numbers. Readings include applied statistics textbooks. The course outline discusses sources of data, univariate and bivariate frequency distributions, probability theory, estimation, and index numbers. Good questionnaires and schedules for collecting primary data should have a forwarding letter, minimal questions, easy to understand language, logical question arrangement, and use of

Uploaded by

Sushma Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

B.

A Programme Semester IV Economics

Skill Enhancement Course (SEC)

Data Analysis
(Reading Material)

School of Open Learning


Campus of Open Learning
University of Delhi

Topics Include:
1. Data, 2. Univariate Frequency Distribution, 3. Dispersion, 4. Introduction to
Correlation and Regression, 5. Introduction to Probability, 6. Normal Distribution,
7. Hypothesis Testing, 8. Index Number.

1
Skill Enhancement Course (SEC)
DATA ANALYSIS

Course Description

This course introduces the student to collection and presentation of data. It also discusses
how data can be summarized and analysed for drawing statistical inferences. The students
will be introduced to important data sources that are available and will also be trained in the
use of free statistical software to analyse data.

Course Outline:

1. Sources of data. Population census versus sample surveys. Random sampling.


2. Univariate frequency distributions. Measures of central tendency: mean, median and
mode; arithmetic, geometric and harmonic mean. Measures of dispersion, skewness
and kurtosis.
3. Bivariate frequency distribution. Correlation and regression. Rank correlation.
4. Introduction to probability theory. Notions of random experiment, sample space,
event, probability of an event. Conditional probability. Independence of events.
Random variables and probability distributions. Binomial and normal distributions.
5. Estimation of population parameters from sample data. Unbiased estimators for
population mean and variance.
6. Basics of index numbers: price and quantity index numbers.

Readings:

1. P.H. Karmel and M. Polasek (1978), Applied Statistics for Economists, 4th edition,
Pitman.
2. H.R. Speegel, L.J. Stephens & N. Kumar (4th Edition), Schaum Series.
Lesson 1
Data
INTRODUCTION
The first step in a statistical investigation is the planning of the proposed investigation. After planning,
the next step is the collection of data, keeping in view the object and scope of investigation. There are
number of methods of collecting data. The mode of collection of data also depends upon the availability
of resources. The collected data would be edited, presented, analysed and interpreted. If the job of data
collection is not done sincerely and seriously, the results of the investigation is bound to be inaccurate and
misleading. And so the resources used in performing the other steps would be wasted and the purpose of
the investigation would be defeated.
Types of Data
I. Primary Data
II. Secondary Data
PRIMARY DATA
Definition
Data is called primary, if it is originally collected in the process of investigation. Primary data are original
in nature. Primary data are generally used in case of some special purpose investigation. The process of
collecting primary data is time consuming. For example suppose we want to compare the average income
of employees of two companies. This can be done by collecting the data regarding the incomes of
employees of both companies. The data collected would be edited, presented and analysed by taking
averages of both groups of data. On the basis of the averages, we would be able to declare as to the average
income for which company is more. The data used in this investigation is primary, because the data
regarding the income of employees was collected during the process of investigation.
Methods of Collecting Primary Data
(i) Direct personal investigation
(ii) Indirect oral investigation
(iii) Through local correspondents
(iv) Through questionnaires mailed to informants
(v) Through schedules filled by enumerators.
Now we shall discuss the process of collecting primary data by these methods. We shall also discuss the
suitability, merits and demerits regarding the above mentioned methods of collecting primary data.
(i) Direct Personal Investigation
In this method of collecting data, the investigator directly comes in contact with the informants to collect
data. The investigator himself visits the different informants, covered in the scope of the investigation and
collect data as per the need of the investigation. Suppose an investigator wants to use this method to collect
data regarding the wages of the employees of a factory then he would have to contact each and every
employee of the factory in order to collect the required data. In the context of this method of collecting
primary data, Professor C.A. Moser has remarked, “In the strict sense, observation implies the use of the

2
eyes rather than of the ears and the voice”. The suitability of this method depends upon the personality of
the investigator. The investigator is expected to be tactful, skilled, honest, well behaved and industrious.
It is suitable when the area to be covered is small. This is also suitable when the data is to be kept secret.
(iii) Through Local Correspondents
In this method of collecting data, the informants are not directly contacted by the investigator, but instead,
the data about the informants is collected and sent to the investigator by the local correspondents,
appointed by the investigator. Newspaper agencies collect data by using this method. They appoint their
correspondents area wise. The correspondents themselves send the desired data to the offices of their
respective newspaper. The suitability of this method depends upon the personality of the correspondent.
He is expected to be unbiased, skilled and honest. To eliminate the bias of the correspondents, it is
advisable to appoint more than one correspondent in each area.
(iv) Through Questionnaires Mailed to Informants
In this method of collecting data, the informants are not directly contacted by the investigator but instead
the investigator send questionnaires by post to the informants with the request of sending them back after
filling the same. The suitability of this method depends upon the quality of the ‘questionnaire’ and the
response of the informant. This method is useful when area to be covered is widely spread. This method
would not work in case the informants are illiterate or semi-literate.
(v) Through Schedules Filled by Enumerators
In this method of collecting data, the informants are not directly contacted by the investigator, but instead,
the enumerators are deputed to contact the informants and to fill the schedules on the spot, after collecting
data as per the need of the schedule. The basic difference between this method and the previous method
is that, in this method the schedules are filled by the enumerators after getting information from the
informants, whereas in the previous method, the questionnaires were to be filled by the informants
themselves. The suitability of this method depends upon the enumerators. The enumerators are expected
to be skilled, honest, hard working, well-behaved and free from bias. This method of collecting data is
suitable in case the informants are illiterate or semi-literate. In our country, census data about all the
citizens is collected after every ten years by using this method.
(vi) Requisites of a Good ‘Questionnaire’ and ‘Schedule’
In the last two methods of collecting primary data, we discussed the method of questionnaires to be filled
by the informants and the method of filling schedules by the enumerators. In fact, there is no fundamental
difference between a questionnaire and a schedule. Both questionnaire and schedule contain some
questions. The only difference between the two is that the former is filled by the informants themselves,
whereas in the case of later, the data concerning the informants is filled by the enumerators. The success
of collecting data by using either questionnaire or schedule depends upon the quality of itself. Preparation
of questionnaire and schedule is an art. Now we shall discuss in detail the requisites of a good
questionnaire and schedule.
(i) Forwarding letter: The investigator must include a forwarding letter in case of sending questionnaires
to the informants. The investigator must request the informants to fill in the same and to return it back
after filling it. The object of the investigation should also be mentioned in the latter. The informants should

3
also be ensured that the filled questionnaires would be kept secretly, if desired. To encourage the response
of informants, special concessions and free gifts may be offered to the informants.
(ii) Questions should be minimum in number: The number of questions in a questionnaire or a schedule
should be as small as possible. Unnecessary questions should never be included. Inclusion of more than
20 or 25 questions would be undesirable.
(iii) Questions should be Easy to understand: The questions included in a questionnaire or a schedule
should be Easy to understand. The questions should not be confusing in nature. The language used should
also be simple and the use of highly technical terms should also be avoided.
(iv) Questions should be logically arranged: The questions in a questionnaire or a schedule also be
logically arranged. The questions should be arranged so that there is natural and spontaneous reaction of
the informants to the questions. It is not fair to ask the informant whether he is employed or unemployed
after asking his monthly income. Such sequence of questions create bad impression on the mind of the
informants.
(v) Only well-defined terms should be used in questions: In drafting questions for a questionnaire or a
schedule, only well defined terms should be used. For example, the term ‘income’ should be clearly
defined in the sense whether it is to include allowances etc. along with the basic income or not. Similarly,
in case of businessman, whether the informants are to inform about their gross profits or net profits etc.
(vi) Prohibited questions should not be included: No such question should be included in the
questionnaire or schedule which may immediately agitate the mind of the informants. Question like,
“Have you given up the habit of telling a lie” or “How many times in a month, do you quarrel with your
wife”, would immediately mar the spirit of the informants.
(vii) Irrelevant questions should be avoided: In questionnaire or schedule, only those questions should
be included which bears direct link with the object of the investigation. If the object is to study the problem
of unemployment, then it would be useless to collect data regarding the heights and weights of the
informants.
(viii) Pilot survey: Before the questionnaire is sent to all the informants for collecting data, it should be
checked before hand for its workability. This is done by sending the questionnaire to a selected sample
and the replies received are studied thoroughly. If the investigator finds that most of the informants in the
sample have left some questions un-answered then those questions should be modified or deleted
altogether, provided the object of the investigation permits to do so. This is called Pilot Survey. Pilot
Survey must be carried out before the questionnaire is finally accepted.
SECONDARY DATA
Definition
Data is called secondary if it is not originally collected in the process of investigation, but instead, the
data collected by some other agency is used for the purpose. If the investigation is not of very special
nature, then the use of secondary data may be made provided that can serve the purpose. Suppose we want
to investigate the extent of poverty in our country, then this investigation can be carried out by using the
national census data which is obtained regularly after every 10 years. The use of secondary data economise
the money spent. It also reduces the time period of investigation to a great extent. If in an investigation
some secondary data could be made use of, then we must use the same. The secondary data are ought to

4
be used very carefully. In this context, Connor has remarked, “Statistics, especially other peoples’
statistics are full of pitfalls for the user.”
Methods of Collecting Secondary Data
(i) Collection from Published Data
(ii) Collection from Un-published Data.
(i) Collection from Published Data
There are agencies which collect statistical data regularly and publish it. The published data is very
important and is used frequently by investigators. The main sources of published data are as follows:
(a) International publications: International Organisations and Govt. of foreign countries collect and
publish statistical data relating to various characteristics. The data is collected regularly as well as on ad-
hoc basis.
Some of the publications are:
(i) U.N.O. Statistical Year Book
(ii) Annual Reports of I.L.O.
(iii) Annual Reports of the Economic and Social Commission for Asia and
Pacific (ESCAP)
(iv) Demography Year Book
(v) Bulletins of World Bank.
(b) Goverment publications: In India, the Central Govt. and State Govt. collects data regarding various
aspects. This data is published and is found very useful for investigation purpose. Some of the publications
are:
(i) Census Report of India
(ii) Five-Year Plans
(iii) Reserve Bank of India Bulletin
(iv) Annual Survey of Industries
(v) Statistical Abstracts of India.
(c) Report of commissions and committees: The Central Govt. and State Govt. appoints Commissions
and Committees to study certain issues. The reports of such investigations are very useful. Some of these
are:
(i) Reports of National Labour Commission
(ii) Reports of Finance Commission
(iii) Report of Hazari Committee etc.
(d) Publications of research institutes: There are number of research institutes in India which regularly
collect data and analyse it. Some of the agencies are:
(i) Central Statistical Organisation (C.S.O.)
(ii) Institute of Economic Growth
(iii) Indian Statistical Institute
(iv) National Council of Applied Economic Research etc.
(e) Newspapers and Magazines. There are many newspapers and magazines which publish data relating
to various aspects. Some of these are:

5
(i) Economic Times
(ii) Financial Express
(iii) Commerce
(iv) Transport
(v) Capital etc.
(f) Reports of trade associations: The trade associations also collect data and publish it. Some of the
agencies are:
(i) Stock Exchanges
(ii) Trade Unions
(iii) Federation of Indian Chamber of Commerce and Industry.
(ii) Collection from Un-published Data
The Central Government, State Government and Research Institutes also collect data which is not
published due to some reasons. This type of data is called unpublished data. Un-published data can also
be made use of in Investigations. The data collected by research scholars of Universities is also generally
not published.
Precautions in the Use of Secondary Data
The secondary data must be used very carefully. The applicability of the secondary data should be judged
keeping in view the object and scope of the Investigation. Prof. Bowley has remarked, “Secondary data
should not be accepted at their face value.” Following are the basis on which the applicability of secondary
data is to be judged.
(i) Reliability of data: Reliability of data is assessed by reliability of the agency which collected that data.
The agency should not be biased in any way. The enumerators who collected the data should have been
unbiased and well trained. Degree of accuracy achieved should also be judged.
(ii) Suitability of data: The suitability of the data should be assessed keeping in view the object and scope
of Investigation. If the data is not suitable for the investigation, then it is not to be used just for the sake
of economy of time and money. The use of unsuitable data can lead to only misleading results.
(iii) Adequacy of data: The adequacy of data should also be judged keeping in view the object and scope
of the investigation. If the data is found to be inadequate, it should not be used. For example, if the object
of investigation is to study the problem of unemployment in India, then the data regarding unemployment
in one state say U.P. would not serve the purpose.
Branches of Data Analysis
There are mainly two branches of Statistics : descriptive statistics and inferential statistics. Descriptive
statistics refers to the summary of important aspects of a data set. This includes collecting data, organizing
the data, and then presenting the data in the form of charts and tables. In addition, we often calculate
numerical measures that summarize, for instance, the data’s typical value and the data’s variability. Today,
the techniques encountered in descriptive statistics account for the most visible application of statistics—
the abundance of quantitative information that is collected and published in our society every day. The
unemployment rate, the president’s approval rating, the Dow Jones Industrial Average, batting averages,
the crime rate, and the divorce rate are but a few of the many “statistics” that can be found in a reputable

6
newspaper on a frequent, if not daily, basis. Yet, despite the familiarity of descriptive statistics, these
methods represent only a minor portion of the body of statistical applications.
The phenomenal growth in statistics is mainly in the field called inferential statistics. Generally,
inferential statistics refers to drawing conclusions about a large set of data— called a population—based
on a smaller set of sample data. A population is defined as all members of a specified group (not
necessarily people), whereas a sample is a subset of that particular population. The individual values
contained in a population or a sample are often referred to as observations. In most statistical applications,
we must rely on sample data in order to make inferences about various characteristics of the population.
A population consists of all items of interest in a statistical problem. A sample is a subset of the population.
In other words, a population is the entire group that you want to draw conclusions about and A sample is
the specific group that you will collect data from. The size of the sample is always less than the total size
of the population. Normally sample data is analyze and calculate a sample statistic to make inferences
about the unknown population parameter.
For example, the Undergraduate students in the India is population whereas 300 undergraduate students
from DU in India is sample drawn from entire population.
Populations are used when your research question requires, or when you have access to, data from every
member of the population. Usually, it is only straightforward to collect data from a whole population when
it is small, accessible and cooperative. For example; A high school administrator wants to analyze the
final exam scores of all graduating seniors to see if there is a trend. Since they are only interested in
applying their findings to the graduating seniors in this high school, they use the
whole population dataset.
When your population is large in size, geographically dispersed, or difficult to contact, it’s necessary to
use a sample. With statistical analysis, you can use sample data to make estimates or test hypotheses about
population data.

For example; if we want to study political attitudes in young people. Your population is the 900,000
undergraduate students in the India. Because it’s not practical to collect data from all of them, you use a
sample of 900 undergraduate volunteers from five universities – this is the group who will complete your
online survey.
Ideally, a sample should be randomly selected and representative of the population. Using probability
sampling methods (such as simple random sampling or stratified sampling) reduces the risk of sampling
bias and enhances both internal and external validity.

7
Need for Sampling

• Necessity: Sometimes it’s simply not possible to study the whole population due to its size or
inaccessibility.
• Practicality: It’s easier and more efficient to collect data from a sample.
• Cost-effectiveness: There are fewer participant, laboratory, equipment, and researcher costs
involved.
• Manageability: Storing and running statistical analyses on smaller datasets is easier and reliable.

Population parameter vs sample statistic

When you collect data from a population or a sample, there are various measurements and numbers you
can calculate from the data. A parameter is a measure that describes the whole population. A statistic is a
measure that describes the sample.

You can use estimation or hypothesis testing to estimate how likely it is that a sample statistic differs from
the population parameter.

Example: In your study of students’ political attitudes, you ask your survey participants to rate themselves
on a scale from 1, very liberal, to 7, very conservative. You find that most of your sample identifies as
liberal – the mean rating on the political attitudes scale is 3.2.

You can use this statistic, the sample mean of 3.2, to make a scientific guess about the population
parameter – that is, to infer the mean political attitude rating of all undergraduate students in the India.

What is Sampling?

8
Sampling is the process of selecting observations (a sample) to provide an adequate description and robust
inferences of the population. The sample is representative of the population.

 There are 2 types of sampling:

 Non-Probability sampling

 Probability sampling

Basic Terms in Sampling


Before discussing sampling, ets discuss basic terms used in it.

 Sample element: a case or a single unit that is selected from a population and measured in some way—
the basis of analysis (e.g., an person, thing, specific time, etc.).

 Universe: the theoretical aggregation of all possible elements—unspecified to time and space (e.g.,
University of Delhi).

 Population: the theoretical aggregation of specified elements as defined for a given survey defined by
time and space (e.g., DUstudents in 2009).

 Sample or Target population: the aggregation of the population from which the sample is actually
drawn (e.g., DU in 2009-10 academic year).

 Sample frame: a specific list that closely approximates all elements in the population—from this the
researcher selects units to create the study sample (database of DU students in 2009-10).

9
 Sample: a set of cases that is drawn from a larger pool and used to make generalizations about the
population

 Estimator
When a statistic is used to estimate a parameter, it is referred to as an estimator.
 Estimate
A particular value of the estimator is called an estimate.
Non-probability Sample
Any sampling process which does not ensure some nonzero probability for each element in the population
to be included in the sample would belong to the category of non-probability sampling. In this case,
samples may be picked up based on the judgment or convenience of the enumerator. Usually, the complete
sample is not decided at the beginning of the study but it evolves as the study progresses.
Probability Sample:
In this design of sample, we use chance to choose a sample. The sample is chosen by chance, with prior
information available on what samples can be chosen and what are the probabilities of each sample getting
chosen. Every item of the population has a known chance of getting selected for the sample. Specifically,
random sampling has the following mathematical properties
a) Distinct samples from the population can be defined, which means that we can clearly state the items
that belong to a particular sample of the population.
b) Each sample has a known probability of selection.
c) Each sample is selected by a random process. It may have equal or unequal probability of getting
selected.
d) The method for computing the estimate from the sample must be stated and lead to unique estimates
for a specific sample. So for example, it can be stated that the estimate is the average of the measurements
on the individual items of the sample.
This sampling procedure is amenable to the calculation of frequency distribution of the estimates (for each
sample, when repeated sampling is done). We know the number of times a particular sample will be
selected and thereafter the estimate from the sample. Thus, a well-defined sampling theory can be
developed for such procedures. Also, for this procedure, it was realized that by the use of sampling theory
and normal distribution, the amount of error to be expected in the estimates made from the sample can be
approximately predicted. There are various ways to obtain samples that represent the population. Some of
the methods of sampling are the following:

10
Simple Random Sampling
It is a method of obtaining a sample of size ‘n’ from a population size of ‘N’ units such that each of the
N
Cn samples chosen have an equal probability of getting chosen . To be precise, the random variables X1,
X2…. Xn are said to form a simple random sample of size n if the following two conditions are met:
a) The Xi’s are independent random variables.
b) Every Xi has equal probability.
The Xis are then termed as independent and identically distributed (iid). Random sampling is then possible
if sampling is with replacement or is from an infinite population (in which case the probability of Xis is
equal and Xis become independent). There are various ways by which we obtain random samples.
One is the lottery method, in which individual units of the population are allotted a number, which are
then put onto slips of paper. These slips are then shuffled and a random draw of required numbers (which
constitute the sample size) is done. This constitutes a random sample. The other method is that of using
random sampling numbers.
The most important virtue of this sampling method is that it is easiest to derive a probability distribution
of the sample statistic than for any other sampling method. In simple random sampling, each Xi of the
sample has equal probability of p= n/N (no. of success/total no. of observation). Here all the units of the
sample are selected using a random mechanism. For example, a national fast food chain wants to randomly
select 5 out of 50 states to sample taste of its consumers. A simple random sample will ensure that the
50
C5= 2118760 samples of size 5 will have the same likelihood of being used in the study.
Systematic Random Sampling(SRS)
It is useful when the population units are ordered or listed in a random fashion. For example, in the case,
where houses are randomly arranged in rows. Suppose a few houses at random are to be selected from a
given city. Here systematic sampling can be used because the houses are usually arranged in rows and are
thus numbered. The first house is selected at random and then the next 10th or 15th house is selected
systematically. This is called systematic random sampling. In systematic sampling, the sampling units are
selected at equal distances from each other in the frame, with random selection of only the first unit.
Another population in which systematic sampling is used is when selecting items from a production or
assembly line for quality testing. A manufacturer may select the first item on the production line randomly,
after which every 20th item is selected for the sample.
Steps in SRS
Decide on sample size: n
Divide frame of N individuals into groups of k individuals: k=N/n
Randomly select one individual from the 1st group
Select every kth individual thereafter

11
Both simple random sampling and systematic sampling techniques are usually recommended when all the
population units are relatively homogeneous.
Stratified Random Sampling:
This method is adopted when the population from which a sample has to be drawn is heterogeneous. We
then divide the heterogeneous groups into homogeneous groups called strata and then draw a random
sample from each stratum. The heterogeneous groups or subpopulations must be non-overlapping and
adding up all the units of the strata must give us the total number of units of the population.
Steps
Divide population into two or more subgroups (called strata) according to some common characteristic
A simple random sample is selected from each subgroup, with sample sizes proportional to strata sizes
Samples from subgroups are combined into one

Stratified random sampling can be used for various reasons, some of which are the following:
i) Administrative convenience may facilitate the use of stratification; for example, the agency
conducting the survey may have field offices in different geographical stratum, each of which
can supervise the survey for a part of the population.

12
ii) Sampling problems differ significantly for different parts of the population. For example,
people living in institutions (e.g., hotels, hospitals, prisons) are often placed in a different
stratum from people living in ordinary homes because a different approach to the sampling is
appropriate for the two situations. Population of people can be divided into different strata of
income groups for example high income group, upper middle income group, lower middle
income group and low income group which can greatly enhance the analytical capacity of the
sample statistic.
iii) Stratification may enhance the precision in the estimates of the characteristics (as compared to
simple random sample) of the whole population by dividing a heterogeneous population into
subpopulations, each of which is internally homogeneous. The estimates of each strata can be
combined into precise estimate for the whole population.
iv) Sample Size
v) The size of the sample depends on various considerations, including population variability,
statistical issues, economic factors, availability of participants, and the importance of the
problem. Few are:
vi) One of the most important factors that affect the sample size is the extent of variability in the
population. Taking an extreme case, if there is no variability, i.e. if all the members of the
population are exactly identical, a sample of size 1 is as good as a sample of 100 or any other
number. Therefore, the larger the variability, the larger is the sample size required.
vii) A second consideration is the confidence in the inference made-the larger the sample size the
higher is the confidence. In many situations, the confidence level is used as the basis to decide
sample size as we shall see in the next unit.
viii) There is generally a trade-off between the accuracy of the sample in representing population
values and the costs associated with sample size. The larger the sample, the more confident we
can be that it accurately reflects what exists in the population, but large samples can be
extremely expensive and time consuming. A small sample is less expensive and time
consuming, but it is not as accurate. Therefore, in situations requiring minimal error and
maximum accuracy of prediction of population values, large samples will be required. In cases
where more error can be tolerated, small samples will do. It is not unusual to use relatively
small samples to generalize to millions of individuals.
ix) Other factors that help determine what is considered an adequate sample size are diversity of
the population concerning the factors of interest and the number of factors. The greater the
diversity among individuals and the greater the number of factors present, the larger the sample
that is required to achieve representativeness.

Sampling error

A sampling error is the difference between a population parameter and a sample statistic. In your
study, the sampling error is the difference between the mean political attitude rating of your sample
and the true mean political attitude rating of all undergraduate students in the India.

13
Sampling errors happen even when you use a randomly selected sample. This is because random
samples are not identical to the population in terms of numerical measures like means and standard
deviations.

Because the aim of scientific research is to generalize findings from the sample to the population, you
want the sampling error to be low. You can reduce sampling error by increasing the sample size.

*********

14
Lesson 2

Univariate Frequency Distributions

Measure of Central Tendency

Summarisation of the data is a necessary function of any statistical analysis. As a first step in this
direction, the huge mass of unwieldy data are summarised in the form of tables and frequency
distributions. In order to bring the characteristics of the data into sharp focus, these tables and
frequency distributions need to be summarised further. A measure of central tendency or an average
is very essential and an important summary measure in any statistical analysis. It is a single value
which can be taken as representative of the whole distribution.

Functions of an Average

1. To present huge mass of data in a summarised form: It is very difficult for human mind to grasp
a large body of numerical figures. A measure of average is used to summarise such data into a single
figure which makes it easier to understand and remember.

2. To facilitate comparison: Different sets of data can be compared by comparing their averages. For
example, the level of wages of workers in two factories can be compared by mean (or average) wages
of workers in each of them.

3. To help in decision-making: Most of the decisions to be taken in research, planning, etc., are based
on the average value of certain variables.

Characteristics of a Good Average

A good measure of average must possess the following characteristics:

1. It should be rigidly defined, preferably by an algebraic formula, so that different persons obtain the
same value for a given set of data.

2. It should be easy to compute.

3. It should be easy to understand.

4. It should be based on all the observations.

5. It should be capable of further algebraic treatment.

6. It should not be unduly affected by extreme observations.

15
7. It should not be much affected by the fluctuations of sampling.

Various Measures of Average

Various measures of average can be classified into the following three categories:

1. Mathematical Averages:

(a) Arithmetic Mean or Mean

(b) Geometric Mean

(c) Harmonic Mean

(d) Quadratic Mean

2. Positional Averages:

(a) Median

(b) Mode

The above measures of central tendency will be discussed in the order to their popularity. Out of these,
the Arithmetic Mean, Median and Mode, being most popular, are discussed in that order.

Arithmetic Mean

Before the discussion of arithmetic mean, we shall introduce certain notations. It will be assumed that
there are n observations whose values are denoted by X1, X2, ..... Xn, respectively. The sum of these
observations X1 + X2 + ..... + Xn will be denoted in abbreviated form as,

∑ Xi where ∑ (called sigma) denotes summation sign. The subscript of X, i.e., ‘i’ is a positive integer,
which indicates the serial number of the observation. Since there are n observations, variation in i will
be from 1 to n. When there is no ambiguity in range of summation, this indication can be skipped and
we may simply

write X1 + X2 + ..... + Xn = ∑ Xi

Arithmetic Mean is defined as the sum of observations divided by the number of observations.

It can be computed in two ways:

1. Simple arithmetic mean and


16
2. Weighted arithmetic mean

In case of simple arithmetic mean, equal importance is given to all the observations while in weighted
arithmetic mean, the importance given to various observations is not same.

Calculation of simple arithmetic mean can be done in following ways:

1. When Individual Observations are Given

Let there be n observations X1, X2 ..... Xn. Their arithmetic mean can be calculated either by direct
method or by short cut method. The arithmetic mean of these observations will be denoted by 𝑋̅ .

(a) Direct Method: Under this method, 𝑋̅ is obtained by dividing sum of observations by number of
observations, i.e.,

𝑋̅ = ΣX/N
where ΣX is the sum of all the numbers in the sample i.e, X1 + X2 + ..... + Xn and N is the number of
observation in the sample.
As an example, the mean of the numbers 1, 2, 3, 6, 8 is 20/5 = 4 regardless of whether the numbers
constitute the entire population or just a sample from the population.
Example 1
Calculate the mean for pH levels of soil 6.8, 6.6, 5.2, 5.6, 5.8
6.8+6.6+5.2+5.6+5.8 30
Mean = 𝑋̅ = = =6
5 5

Short Cut Method


If the number of observations in the data is more and/or figures are large, it is difficult to compute
arithmetic mean by direct method. The computation can be made easier by using assumed mean method.
In order to save time in calculating mean from a data set containing a large number of observations as well
as large numerical figures, you can use assumed mean method. Here you assume a particular figure in the
data as the arithmetic mean on the basis of logic/experience. Then you may take deviations of the said
assumed mean from each of the observation. You can, then, take the summation of these deviations and
divide it by the number of observations in the data. The actual arithmetic mean is estimated by taking the
sum of the assumed mean and the ratio of sum of deviations to number of observations. Symbolically,
∑(𝑋 −𝐴) ∑(𝑑𝑖 )
Mean = 𝑋̅ = 𝑛𝑖 = 𝑛

Let, A = assumed mean X = individual observations n= total numbers of observations d = deviation of


assumed mean from individual observation, i.e. d = X – A.

17
When Data are in the form of an Ungrouped Frequency Distribution
The mean for ungrouped data is obtained from the following formula:
∑ 𝑓𝑋
𝑋̅ = 𝑁
Where x = value of the variable f = the frequency of individual class N = the sum of the frequencies or
total frequencies in a sample.
Short-cut method
∑ 𝑓𝑑
𝑋̅ = 𝐴 + 𝑁

Where
𝑋−𝐴
d= 𝑖
A = assumed mean; N = total frequency
Example 2
Given the following frequency distribution, calculate the arithmetic mean

Marks : 64 63 62 61 60 59
Number 8 18 12 9 7 6
of
Student
s

Solution:

X F Fx D=x-A Fd
64 8 512 2 16
63 18 1134 1 18
62 12 744 0 0
61 9 549 -1 -9
60 7 420 -2 -14
59 6 354 -3 -18
60 3713 -7
Direct Method
37133
𝑋̅ = = 61.88
60
Short-cut method

∑ 𝑓𝑑
𝑋̅ = 𝐴 + ∗𝑐
𝑛
Let A = 62

18
(−7)
𝑋̅ = 62 + ∗ 1 = 61.88
6
Grouped Data
When the items in a list are written in the form of a range, for example, 10-20, 20-30; we need to first
calculate the class mark or midpoint.
Class Mark = (Upper Limit + Lower Limit) / 2
Then, the mean can be calculated using the formula given below,
∑ 𝑓𝑋
𝑋̅ =
𝑁
Where x = the mid-point or classmark of individual class f = the frequency of individual class N = the sum
of the frequencies or total frequencies in a sample.
Short-cut method
∑ 𝑓𝑑
𝑋̅ = 𝐴 + ∗𝑐
𝑁

Where
𝑋−𝐴
d= 𝑐
A = any value in x; N = total frequency c = width of the class interval
Example 3 For the frequency distribution of seed yield of plot given in table, calculate the mean yield per
plot.

Yield per 64.5-84.5 84.5- 104.5- 124.5-


plot 104.5 124.5 144.5
in(ing)
No of 3 5 7 20
plots
Solution

Yield ( in g) No of Mid X 𝑋−𝐴 fd


d= 𝑖
Plots (f)
64.5-84.5 3 74.5 -1 -3
84.5-104.5 5 94.5 0 0
104.5-124.5 7 114.5 1 7
124.5-144.5 20 134.5 2 40
Total 35 44
Assuming A=94.5
The mean yield per plot is

19
Direct method:
∑ 𝑓𝑋 74.5∗3+94.5∗5+114.5∗7+134.5∗20 4187.5
𝑋̅ = = = = 119.64 𝑔𝑚𝑠
𝑛 35 35

Shortcut method
∑ 𝑓𝑑 44
𝑋̅ = 𝐴 + ∗ 𝑐 = 94.5 + 35 ∗ 20 = 119.64 𝑔𝑚𝑠
𝑛

Merits and demerits of Arithmetic mean Merits

1. It is rigidly defined.

2. It is easy to understand and easy to calculate.

3. If the number of items is sufficiently large, it is more accurate and more reliable.

4. It is a calculated value and is not based on its position in the series.

5. It is possible to calculate even if some of the details of the data are lacking.

6. Of all averages, it is affected least by fluctuations of sampling.

7. It provides a good basis for comparison.

Demerits

1. It cannot be obtained by inspection nor located through a frequency graph.

2. It cannot be in the study of qualitative phenomena not capable of numerical measurement i.e.
Intelligence, beauty, honesty etc.,

3. It can ignore any single item only at the risk of losing its accuracy.

4. It is affected very much by extreme values.

5. It cannot be calculated for open-end classes.

6. It may lead to fallacious conclusions, if the details of the data from which it is computed are not given.
WEIGHTED MEAN
Weight here refers to the importance of a value in a distribution. A simple logic is that a number is as
important in the distribution as the number of times it appears. So, the frequency of a number can also be

20
its weight. But there may be other situations where we have to determine the weight based on some other
reasons. For example, the number of innings in which runs were made may be considered as weight
because runs (50 or 100 or 200) show their importance. Calculating the weighted mean of scores of several
innings of a player, we may take the strength of the opponent (as judged by the proportion of matches lost
by a team against the opponent) as the corresponding weight. Higher the proportion stronger would be the
opponent and hence more would be the weight. If xi has a weight wi, then weighted mean is defined as:
∑ 𝑋𝑖 𝑊𝑖
𝑋̅ =
∑ 𝑊𝑖
X for all i = 1, 2, 3,…, k.
MEDIAN
Median is that value of the variable which divides the whole distribution into two equal parts. Here, it may
be noted that the data should be arranged in ascending or descending order of magnitude. When the
number of observations is odd then the median is the middle value of the data. For even number of
observations, there will be two middle values. So we take the arithmetic mean of these two middle values.
Number of the observations below and above the median, are same. Median is not affected by extremely
large or extremely small values (as it corresponds to the middle value) and it is also not affected by open
end class intervals. In such situations, it is preferable in comparison to mean.
1. Median for Ungrouped Data
Mathematically, if x1, x2,…, xn are the n observations then for obtaining the median first of all we have
to arrange these n values either in ascending order or in descending order. When the observations are
arranged in ascending or descending order, the middle value gives the median if n is odd. For even number
of observations there will be two middle values. So we take the arithmetic mean of these two values.
𝑁+1
Md = ( ) 𝑡ℎ 𝑜𝑏𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛; when N is odd
2
𝑁 𝑁+2
𝑡ℎ 𝑜𝑏𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛+ 𝑡ℎ 𝑜𝑏𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
Md = ( 2 2
); when N is even
2

Example 5: Find median of following observations:


6, 4, 3, 7, 8
Solution: First arrange the given data in ascending order as
3, 4, 6, 7, 8
Since, the number of observations i.e. 5, is odd, so median would be the middle value that is 6.
Example 6: Calculate median for the following data:
7, 8, 9, 3, 4, 10
Solution: First arrange given data in ascending order as
3, 4, 7, 8, 9, 10
Here, Number of observations (n) = 6 (even). So we get the median by
6 6+2
𝑡ℎ 𝑜𝑏𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛+ 𝑡ℎ 𝑜𝑏𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
2 2
Md = ( )
2
3 𝑟𝑑 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛+4𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
Md = ( )
2

21
7+8
Md = ( ) = 7.5
2
For Ungrouped Data (when frequencies are given)
If Xi are the different value of variable with frequencies ‘f’ then we calculate cumulative frequencies from
∑𝑓 𝑁
f then median is defined by Md = Value of variable corresponding to ( 2 ) 𝑡ℎ = ( 2 ) 𝑡ℎ cumulative
frequency.
Cumulative frequency (cf)
Cumulative frequency of each class is the sum of the frequency of the class and thefrequencies of the
pervious classes, ie adding the frequencies successively, so that the last cumulative frequency gives the
total number of items.

Note: If N/2 is not the exact cumulative frequency then value of the variable corresponding to next
cumulative frequencies is the median.
Example 7: Find Median from the given frequency distribution
X 20 40 60 80
F 7 5 4 3
Solution: First find cumulative frequency
X f c.f.
20 7 7
40 5 5+7=12
60 4 12+4=16
80 3 16+3=19
∑f = 19
19
Md = Value of the variable corresponding to the ( 2 ) 𝑡ℎ cumulative frequency.
= Value of the variable corresponding to 9.5 since 9.5 is not among c.f.
So, the next cumulative frequency is 12 and the value of variable against 12 cumulative frequency is 40.
So median is 40.
Continuous Series
The steps given below are followed for the calculation of median in continuous series.
Step1: Find cumulative frequencies.
Step 2: Find N/2
Step3: See in the cumulative frequency the value first greater than N/2; Then the corresponding class
interval is median class-interval.
𝑁
−𝑐𝑓
2
Step 4: Apply the formula Median = l + ∗𝑖
𝑓

22
Where l = lower limit of median class
N= no. of observations
cf denotes cumulative frequency of the class preceding the median class
f = frequency of median class
i = class size (assuming classes are of equal size)
Example: For the frequency distribution of weights of sorghum ear-heads given in table below. Calculate
the median .
Weights of ear No of ear
heads ( in g) heads (f)
60-80 22
80-100 38
100-120 45
120-140 35
140-160 24
Solution:

Weights of ear No of ear Less than Cumulative


heads ( in g) heads (f) class frequency (cf)
60-80 22 <80 22
80-100 38 <100 60
100-120 45 <120 105
120-140 35 <140 140
140-160 24 <160 164
Total 164
𝑁
−𝑐𝑓
2
Median = l + ∗𝑖
𝑓

N/2 = 164/2= 82
82 lies between 60 and 105. Corresponding to 60 the less than class is 100 and corresponding to105 the
less than class is 120. Therefore the median class is 100-120. Its lower limit is 100.
Here;
l= 100, cf= 60, N=164 , f = 45 and i= 20
82−60
Median = 100 + ∗ 20 = 109.78 gms
45

Merits of Median
1. Median is not influenced by extreme values because it is a positional average.
23
2. Median can be calculated in case of distribution with open-end intervals.
3. Median can be located even if the data are incomplete.

Demerits of Median
1. A slight change in the series may bring drastic change in median value.
2. In case of even number of items or continuous series, median is an estimated valueother than any
value in the series.
3. It is not suitable for further mathematical treatment except its use incalculating mean deviation.
4. It does not take into account all the observations.

Other Partition or Positional Measures


Median of a distribution divides it into two equal parts. It is also possible to divide it into more than
two equal parts. The values that divide a distribution into more than two equal parts are commonly
known as partition values or fractiles. Some important partition values are discussed are:
Quartiles
The values of a variable that divide a distribution into four equal parts are called quartiles. Since
three values are needed to divide a distribution into four parts, there are three quartiles, viz. Q1, Q2
and Q3, known as the first, second and the third quartile respectively. For a discrete distribution,
the first quartile (Q1) is defined as that value of the variate such that at least 25% of the observations
are less than or equal to it and at least 75% of the observations are greater than or equal to it. For a
continuous or grouped frequency distribution, Q1 is that value of the variate such that the area under
the histogram to the left of the ordinate at Q1 is 25% and the area to its right is 75%.
The formula for the computation of Q1 can be written by making suitable changes in the formula
of median. After locating the first quartile class, the formula for Q1 can be written as follows:
𝑁
−𝑐
4
Q1 = l + *i
𝑓

Here, l is lower limit of the first quartile class, h is its width, f is its frequency and C is cumulative
frequency of classes preceding the first quartile class. By definition, the second quartile is median
of the distribution. The third quartile ( Q3) of a distribution can also be defined in a similar manner.
For a discrete distribution, Q3 is that value of the variate such that at least 75% of the observations
are less than or equal to it and at least 25% of the observations are greater than or equal to it. For a
grouped frequency distribution, Q3 is that value of the variate such that area under the histogram to
the left of the ordinate at Q3 is 75% and the area to its right is 25%. The formula for computation
of Q3 can be written as

24
𝑁
3∗ −𝑐
4
Q3 = l + * i; where the symbols have their usual meaning.
𝑓

Deciles
Deciles divide a distribution into 10 equal parts and there are, in all, 9 deciles denoted as D1, D2,......
D9 respectively. For a discrete distribution, the i th decile Di is that value of the variate such that at
least (10i)% of the observation are less than or equal to it and at least (100 - 10i)% of the
observations are greater than or equal to it (i = 1, 2, ...... 9).
For a continuous or grouped frequency distribution, Di is that value of the variate such that the area
under the histogram to the left of the ordinate at Di is (10i)% and the area to its right is (100 - 10i)%.
The formula for the i th decile can be written as
𝑁
𝑗∗ −𝑐
Dj = l + 10
* i where j = 1,2,…..,10
𝑓

Percentiles
Percentiles divide a distribution into 100 equal parts and there are, in all, 99 percentiles denoted as
P1, P2, ...... P25, ...... P40, ...... P60, ...... P99 respectively. For a discrete distribution, the kth
percentile Pk is that value of the variate such that at least k% of the observations are less than or
equal to it and at least (100 – k)% of the observations are greater than or equal to it. For a grouped
frequency distribution, Pk is that value of the variate such that the area under the histogram to the
left of the ordinate at Pk is k% and the area to its right is (100 – k)% . The formula for the kth
percentile can be written as
𝑁
𝑗∗ −𝑐
Pj = l + 100
* i where j = 1,2,…50,51…...,100
𝑓

Example: Locate Median, Q1, Q3, D4, D7, P15, P60 and P90 from the following data:
Daily 75 76 77 78 79 80 81 82 83 84 85
profit in
Rs.)
No. of 15 20 32 35 33 22 20 10 8 3 2
Shops
Solution:
First we calculate the cumulative frequencies, as in the following table:
Daily profit in Rs.) 75 76 77 78 79 80 81 82 83 84 85
No. of Shops 15 20 32 35 33 22 20 10 8 3 2
C.F. 15 35 67 102 135 157 177 187 195 198 200

25
1. Determination of Median: Here N/2 = 100. From the cumulative frequency column, we note that
there are 102 (greater than 50% of the total) observations that are less than or equal to 78 and there
are 133 observations that are greater than or equal to 78. Therefore, Md = 78.
2. Determination of Q1 and Q3: First we determine N/4 which is equal to 50. From the cumulative
frequency column, we note that there are 67 (which is greater than 25% of the total) observations
that are less than or equal to 77 and there are 165 (which is greater than 75% of the total)
observations that are greater than or equal to 77. Therefore, Q1 = 77. Similarly, Q3 = 80.
3. Determination of D4 and D7: From the cumulative frequency column, we note that there are 102
(greater than 40% of the total) observations that are less than or equal to 78 and there are 133 (greater
than 60% of the total) observations that are greater than or equal to 78. Therefore, D4 = 78.
Similarly, D7 = 80.
4. Determination of P15, P60 and P90: From the cumulative frequency column, we note that there
are 35 (greater than 15% of the total) observations that are less than or equal to 76 and there are 185
(greater than 85% of the total) observations that are greater than or equal to 76. Therefore, P15 =
76. Similarly, P60 = 79 and P90 = 82.

Mode
The mode refers to that value in a distribution, which occur most frequently. It is an actual value,
which has the highest concentration of items in and around it. It shows the centre of concentration
of the frequency in around a given value. Therefore, where the purpose is to know the point of the
highest concentration it is preferred. It is, thus, a positional measure.
Its importance is very great in agriculture like to find typical height of a crop variety, maximum
source of irrigation in a region, maximum disease prone paddy variety. Thus the mode is an
important measure in case of qualitative data.
Computation of the mode Ungrouped or Raw Data
For ungrouped data or a series of individual observations, mode is often found by mere inspection.
Example
Find the mode for the following seed weight 2 , 7, 10, 15, 10, 17, 8, 10, 2 gms
Here 10 repeats 3 times and rest all 1 time only. Hence, Mode = 10
In some cases the mode may be absent while in some cases there may be more than one mode.
Example
(1) 12, 10, 15, 24, 30 (no mode because all reapeats once only)
(2) 7, 10, 15, 12, 7, 14, 24, 10, 7, 20, 10

26
the modal values are 7 and 10 as both occur 3 times each.
Note: If 2 or more values appear with the same frequency, each is a mode. The downside to using
the mode as a measure of central tendency is that a set of data may have no mode or may have more
than 1 mode. However, the same set of data will have only 1 mean and only 1 median. The word
modal is often used when referring to the mode of a data set. If a data set has only 1 value that
occurs most often, the set is called unimodal. Likewise, a data set that has 2 values that occur with
the greatest frequency is referred to as bimodal. Finally, when a set has more than 2 values that
occur with the same greatest frequency, the set is called multimodal.

Example: The following table represents the number of times that 100 randomly selected students ate at
the school cafeteria during the first month of school:

Number
of 2 3 4 5 6 7 8
times

Number
of 3 8 22 29 20 8 10
students

What is the mode of the numbers of times that a student ate at the cafeteria?

Solution : When data is arranged in a frequency table, the mode is simply the value that has the highest
frequency. Therefore, since the table shows that 29 students ate 5 times in the cafeteria, 5 is the mode of
the data set.

Mode = 5 times

Mode in Grouped Data


To calculate the mode in case of grouped frequency distribution, first identify the modal class, i.e, the
class that has the highest frequency. Then, apply the formula given below to calculate the mode.
𝑓1 −𝑓0
Mode = 𝑙 + *i
2𝑓1− 𝑓2− 𝑓
0

Where,
l= lower limit of the modal class
f1 = frequency of the modal class

27
f0= frequency of class preceding the modal class
f2 = frequency of class succeeding the modal class
i = width of the modal class
In summary, we can use the following steps to compute the mode of grouped or continuous frequency
distribution with equal class intervals:
Step 1: Prepare the frequency distribution table in such a way that its first column consists of the
observations and the second column the respective frequency.
Step 2: Determine the class of maximum frequency by inspection. This class is called the modal class.
𝑓1 −𝑓0
Step 3: To calculate mode, using formula: Mode = 𝑙 + *i
2𝑓1− 𝑓2− 𝑓
0

Example
3) The heights, in cm, of 50 students are recorded

Height (in
125-130 130-135 135-140 140-145 145-150
cm)

Number of
7 14 10 10 9
students

Here, the maximum frequency is 14 and the corresponding class is 130-135. So, 130-135 is the modal
class such that
l=130, i=5, f1=14, f2=7, f0=7 and f2=10.
𝑓1 −𝑓0
Mode = 𝑙 + *i
2𝑓1− 𝑓2− 𝑓
0

14−7
Mode = 130 + 2∗14−10−7 ∗ 5
=133.18

Hence, the modal height = 133.18 cm.

Empirical Relationship between Mean, Median and Mode

Frequency distribution of data shows how often the values in the data set occur. A frequency distribution
is said to be symmetrical when the values of mean, median and mode are equal. That is, there is an equal
number of values on both sides of the mean which means the values occur at regular frequencies.

28
[Figure 1]

Whereas in the negatively skewed frequency distribution, the median and the mode would be to the right
of the mean. That means that the mean is less than the median and the median is less than the mode. (
Mean < Median < Mode )

[Figure 2]

In a positively skewed frequency distribution, the median and mode would be to the left of the mean. That
means that the mean is greater than the median and the median is greater than the mode ( Mean > Median
> Mode )

29
[Figure 3]

Empirical studies have proved that in a frequency distribution that is moderately skewed, a very important
relationship exists between the mean, median and the mode. The distance between the mean and the
median is about one-third the distance between the mean and the mode.

30
Geometric Mean

The geometric mean is a type of average , usually used for growth rates, like population growth or interest
rates. While the arithmetic mean adds items, the geometric mean multiplies items. Also, you can only get
the geometric mean for positive numbers.
The geometric mean of a series containing n observations is the nth root of the product of thevalues. If x1,
x2…, xn are observations then
𝑛
GM = √𝑋1 ∗ 𝑋2 ∗ 𝑋3 ∗ … … ∗ 𝑋𝑛
Or (𝑋1 ∗ 𝑋2 ∗ 𝑋3 ∗ … … ∗ 𝑋𝑛)1/𝑛
1
Log GM = log(𝑋1 ∗ 𝑋2 ∗ 𝑋3 ∗ … … ∗ 𝑋𝑛)
𝑛
1
= 𝑛 (log(𝑋1) + log 𝑋2 + 𝑙𝑜𝑔𝑋3 + ⋯ … . +𝑙𝑜𝑔𝑋𝑛)
∑ log 𝑋𝑖
= 𝑛
∑ log 𝑋𝑖
GM = antilog 𝑛

For grouped data


∑ 𝑓 log 𝑋𝑖
GM = antilog 𝑁

GM is used in studies like bacterial growth, cell division, etc.


Example
If the weights of sorghum ear heads are 45, 60, 48,100, 65 gms. Find the Geometric mean
Solution:
Weight of ear Log x
head x (g)
45 1.653
60 1.778
48 1.681
100 2.000
65 1.813
Total 8.925
Here n = 5
∑ log 𝑋𝑖
GM = antilog 𝑛
8.925
GM = antilog 5

31
= antilog( 1.785) = 60.954
GM in Individual Series
Find the Geometric mean for the following

Weight of sorghum (x) No. of ear head(f)


50 4
65 6
75 16
80 8
95 7
100 4

Solution

Weight of No. of ear Log x f x log x


sorghum (x) head(f)
50 5 1.699 8.495
63 10 10.799 17.99
65 5 1.813 9.065
130 15 2.114 31.71
135 15 2.130 31.95
Total 50 9.555 99.21

Here n= 50
GM = antilog
99.21
= antilog 50

= Antilog (1.9842) = 96.43

32
Continuous distribution
Example
For the frequency distribution of weights of sorghum ear-heads given in table below.
Calculate the Geometric mean

Weights of ear No of ear


heads ( in g) heads (f)
60-80 22
80-100 38
100-120 45
120-140 35
140-160 20
Total 160
Solution:
Weights of ear No of ear Mid x Log x f log x
heads ( in g) heads (f)
60-80 22 70 1.845 40
59
80-100 38 90 1.954 74.25
100-120 45 110 2.041 91.85
120-140 35 130 2.114 73.99
140-160 20 150 2.176 43.52
Total 160 324.2

Here n = 160
∑ 𝑓 log 𝑋𝑖
GM = antilog 𝑁
324.2
= Antilog 160

= Antilog 2.02625
= 106.23

33
Harmonic mean (H.M)
Harmonic mean of a set of observations is defined as the reciprocal of the arithmeticaverage of the
reciprocal of the given values. If X1, X2…..Xn are n observations,
𝑛
HM = 1 where i= 1,2….., n

𝑋𝑖

For a frequency distribution


𝑁
HM = 𝑓

𝑋𝑖

H.M is used when we are dealing with speed, rates, etc.


From the given data 5, 10,17,24,30 calculate H.M.

5 0.2000
10 0.1000
17 0.0588
24 0.0417
30 0.4338
5
HM = 0.4338 = 11.526

Example:
Number of tomatoes per plant are given below. Calculate the harmonic mean.

Number of tomatoes per plant 20 21 22 23 24 25


Number of plants 4 2 7 1 3 1

34
Solution
Number of No of
tomatoes per plants(f)
plant (x)

20 4 0.0500 0.2000
21 2 0.0476 0.0952
22 7 0.0454 0.3178
23 1 0.0435 0.0435
24 3 0.0417 0.1251
25 1 0.0400 0.0400
18 0.8216
18
HM = 0.8216 = 21.91

Merits of H.M
1. It is rigidly defined.
2. It is defined on all observations.
3. It is amenable to further algebraic treatment.
4. It is the most suitable average when it is desired to give greater weight to smaller observations and
less weight to the larger ones.
Demerits of H.M
1. It is not easily understood.
2. It is difficult to compute.
3. It is only a summary figure and may not be the actual item in the series
4. It gives greater importance to small items and is therefore, useful only when small items have to
be given greater weightage.
5. It is rarely used in grouped data.

35
Lesson 3
Measures of Dispersion
Introduction
A measure of central tendency summarizes the distribution of a variable into a single figure which can be
regarded as its representative. This measure alone, however, is not sufficient to describe a distribution
because there may be a situation where two or more different distributions have the same central value.
Conversely, it is possible that the pattern of distribution in two or more situations is same but the values
of their central tendency are different. Hence, it is necessary to define some additional summary measures
to adequately represent the characteristics of a distribution. One such measure is known as the measure of
dispersion or the measure of variation.
The concept of dispersion is related to the extent of scatter or variability in observations. The variability,
in an observation, is often measured as its deviation from a central value. A suitable average of all such
deviations is called the measure of dispersion. In the words of A.L. Bowley “Dispersion is the measure of
variation of the items.”
Objectives of Measuring Dispersion
The main objectives of measuring dispersion of a distribution are:
1. To test reliability of an average: A measure of dispersion can be used to test the reliability of an average.
A low value of dispersion implies that there is greater degree of homogeneity among various items and,
consequently, their average can be taken as more reliable or representative of the distribution.
2. To compare the extent of variability in two or more distributions: The extent of variability in two or
more distributions can be compared by computing their respective dispersions. A distribution having
lower value of dispersion is said to be more uniform or consistent.
3. To facilitate the computations of other statistical measures: Measures of dispersions are used in
computations of various important statistical measures like correlation, regression, test statistics,
confidence intervals, control limits, etc.
4. To serve as the basis for control of variations: The main objective of computing a measure of
dispersion is to know whether the given observations are uniform or not. This knowledge may be utilised
in many ways. In the words of Spurr and Bonini, “In matters of health, variations in body temperature,
pulse beat and blood pressure are basic guides to diagnosis.
Prescribed treatment is designed to control their variations. In industrial production, efficient operation
requires control of quality variations, the causes of which are sought through inspection and quality control
programs”. The extent of inequalities of income and wealth in any society may help in the selection of an
appropriate policy to control their variations.
Characteristics of a Good Measure of Dispersion
Like the characteristics of a measure of central tendency, a good measure of dispersion should possess the

36
following characteristics:
1. It should be easy to calculate.
2. It should be easy to understand.
3. It should be rigidly defined.
4. It should be based on all the observations.
5. It should be capable of further mathematical treatment.
6. It should not be unduly affected by extreme observations.
7. It should not be much affected by the fluctuations of sampling.
Measures of Dispersion
Various measures of dispersion can be classified into two broad categories:
1. The measures which express the spread of observations in terms of distance between the values of
selected observations. These are also termed as distance measures, e.g., range, interquartile range,
interpercentile range, etc.
2. The measures which express the spread of observations in terms of the average of deviations of
observations from some central value. These are also termed as the averages of second order, e.g., mean
deviation, standard deviation, etc.
The following are some important measures of dispersion
(a) Range
(b) Inter-Quartile Range
(c) Mean Deviation
(d) Standard Deviation
Range
The range of a distribution is the difference between its two extreme observations, i.e., the difference
between the largest and smallest observations. Symbolically, R = L – S where R denotes range, L and S
denote largest and smallest observations, respectively. R is the absolute measure of range. A relative
measure of range, also termed as the coefficient of range, is defined as:
𝐿−𝑆
Coefficient of Range = 𝐿+𝑆

Example: Find range and coefficient of range for each of the following data:
1. Weekly wages of 10 workers of a factory are:
310, 350, 420, 105, 115, 290, 245, 450, 300, 375.

37
Solution:
Range = 450 – 105 = 345
450 – 105
Coefficient of Range = 450+ 105 = 0.62

Example: The distribution of marks obtained by 100 students:


Marks 0-10 10-20 20-30 30-40 40-50
No of 6 14 21 20 18
Students
Marks 50-60 60-70 70-80 80-90 90-100
No of 10 5 3 2 1
Students
Solution: Range = 100 – 0 = 100 marks
100 – 0
Coefficient of Range = 100+ 0 = 1

Merits and Demerits of Range


Merits
1. It is easy to understand and easy to calculate.
2. It gives a quick measure of variability.
Demerits 1. It is not based on all the observations.
2. It is very much affected by extreme observations.
3. It only gives rough idea of spread of observations.
4. It does not give any idea about the pattern of the distribution. There can be two distribution with the
same range but different patterns of distribution.
5. It is very much affected by fluctuations of sampling.
6. It is not capable of being treated mathematically.
7. It cannot be calculated for a distribution with open ends.
Uses of Range
Inspite of many serious demerits, it is useful in the following situations:
1. It is used in the preparation of control charts for controlling the quality of manufactured items.
2. It is also used in the study of fluctuations of, say, price of a commodity, temperature of a patient, amount
of rainfall in a given period, etc.

38
Quartile Deviation or Semi-Interquartile Range
Half of the interquartile range is called the quartile deviation or semi-interquartile range.
Symbolically,
𝑄3 − 𝑄1
Q.D. = 2

The value of Q.D. gives the average magnitude by which the two quartiles deviate from median. If the
distribution is approximately symmetrical, then Md ± Q.D. will include about 50% of the observations
and, thus, we can write Q1 = Md – Q.D. and Q3 = Md + Q.D. Further, a low value of Q.D. indicates a
high concentration of central 50% observations and viceversa.
Quartile deviation is an absolute measure of dispersion. The corresponding relative measure is known as
coefficient of quartile deviation defined as
𝑄 −𝑄
Coeff. of Q.D. = 𝑄3+ 𝑄1
3 1

Example: Find the quartile deviation, and its coefficient from the following data:
Age (in years) X no. of students f
15 4
16 6
17 10
18 15
19 12
20 9
21 4
Solution:
Table for the calculation of Q.D.
Age (in years) X no. of students f cf
15 4 4
16 6 10
17 10 20
18 15 35
19 12 47
20 9 56
21 4 60

39
Solution:
N/ 4 = 60/15
Q1 = 17 (by inspection)
Q3 = 3* N/4 = 3* 60/4 = 45
Q3 = 19
𝑄3 − 𝑄1 19−17
Q.D. = = = 1 𝑦𝑒𝑎𝑟
2 2
𝑄 −𝑄 19−17
Coeff. of Q.D. = 𝑄3+ 𝑄1 = = 0.056
3 1 19+17

Merits and Demerits of Quartile Deviation


Merits
1. It is rigidly defined.
2. It is easy to understand and easy to compute.
3. It is not affected by extreme observations and hence a suitable measure of dispersion when a distribution
is highly skewed.
4. It can be calculated even for a distribution with open ends.
Demerits
1. Since it is not based on all the observations, hence, not a reliable measure of dispersion.
2. It is very much affected by the fluctuations of sampling.
3. It is not capable of being treated mathematically.
Mean Deviation or Average Deviation
Mean deviation is a measure of dispersion based on all the observations. It is defined as the arithmetic
mean of the absolute deviations of observations from a central value like mean, median or mode. Here the
dispersion in each observation is measured by its deviation from a central value. This deviation will be
positive for an observation greater than the central value and negative for less than it.
Calculation of Mean Deviation
The following are the formulae for the computation of mean deviation (M.D.) of an individual series of
observations X1, X2, ..... Xn:
∑|𝑋− 𝑋̅ |
1. Mean Deviation from Mean MDm = 𝑛
∑(|𝑋− 𝑀|)
2. Mean Deviation from Median MDmd =
𝑛
∑(|𝑋− 𝑚𝑜𝑑𝑒|)
3. Mean Deviation from Mode MDmode = 𝑛

40
In case of an ungrouped frequency distribution, the observations X1, X2, ..... Xn occur with respective
frequencies f1, f2, ..... fn such that ∑fi = N. The corresponding formulae for M.D. can be written as:.
∑ 𝑓|𝑋− 𝑋̅|
1. Mean Deviation from Mean MDm = 𝑁
∑ 𝑓(|𝑋− 𝑀|)
2. Mean Deviation from Median MDmd = 𝑁
∑ 𝑓(|𝑋− 𝑚𝑜𝑑𝑒|)
3. Mean Deviation from Mode MDmode = 𝑁

The above formulae are also applicable to a grouped frequency distribution where the symbols
X1, X2, ..... Xn will denote the mid-values of the first, second ..... nth classes respectively.
Note: Mean deviation is minimum when deviations are taken from median.
Coefficient of Mean Deviation
The above formulae for mean deviation give an absolute measure of dispersion. The formulae for relative
measure, termed as the coefficient of mean deviation, are given below:
𝑀𝐷 𝑓𝑟𝑜𝑚 𝑚𝑒𝑎𝑛
1. Coeff. of MDmean = 𝑋̅
𝑀𝐷 𝑓𝑟𝑜𝑚 𝑚𝑒𝑑𝑖𝑎𝑛
2. Coeff. of MDmedian = 𝑀𝑒𝑑𝑖𝑎𝑛
𝑀𝐷 𝑓𝑟𝑜𝑚 𝑚𝑜𝑑𝑒
3. Coeff. of MDmode = 𝑀𝑜𝑑𝑒

Example: Calculate mean deviation from mean and median for the following data of heights (in inches)
of 10 persons.
60, 62, 70, 69, 63, 65, 60, 68, 63, 64
Also calculate their respective coefficients.
Solution:
Calculation of M.D. from Mean
60+62+70+69+63+65+60+68+63+64
Mean = 𝑋̅ = = 64.4 𝑖𝑛𝑐ℎ𝑒𝑠
10

X 60 62 70 69 63 65 60 68 63 64 Total
|X- 4.4 2.4 5.6 4.6 1.4 0.6 4.4 3.6 1.4 0.4 28.8
Mean|

∑|𝑋− 𝑋̅ |
Mean Deviation from Mean MDm = = 28.8/ 10 = 2.88
𝑛
2.88
Coeff. of MDmean = 64.4= 0.045

41
Calculation of M.D. from Md
Arranging the observations in order of magnitude, we have
60, 60, 62, 63, 63, 64, 65, 68, 69, 70
The median of the above observations is = (63+ 64)/ 2
= 63.5 inches.
∑(|𝑋− 𝑀|)
Mean Deviation from Median MDmd = 𝑛

X 60 62 70 69 63 65 60 68 63 64 Total
|X- Med| 3.5 1.5 6.5 5.5 0.5 1.5 5.5 4.5 0.5 0.5 28.0
MD median = 28/10 = 2.80
Also, the coefficient of M.D. from Md = 2.80/ 63.5 = 0.044.

Example :
In a foreign language class, there are 4 languages and the frequencies of students learning the language
and the frequency of lectures per week is given as:

Language Sanskrit Spanish French English

No. of
6 5 9 12
students(xi)

Frequency
of 5 7 4 9
lectures(fi)

Calculate the mean deviation about the mean for the given data.
Solution: The following table gives us a tabular representation of data and the calculations

42
MDmean = 70.64 / 25 =2.8256
Coeff of MDm = 2.8256/8.36 = 0.338
Example: Calculate Mean deviation about mean from the following grouped data
Class-X Oct-20 20 - 30 30 - 40 40 - 50 50 - 60 60 - 70 70 - 80
Frequency 15 25 20 12 8 5 3

Solution

f⋅|x- 35|
Class f Mid value (x) f⋅x |x-35|

10 - 20 15 15 225 20 300

20 - 30 25 25 625 10 250

30 - 40 20 35 700 0 0

40 - 50 12 45 540 10 120

50 - 60 8 55 440 20 160

60 - 70 5 65 325 30 150

70 - 80 3 75 225 40 120

--- --- --- --- --- ---

-- n=88 -- ∑f x=3080 -- ∑f⋅|x- mean|=1100

Mean = 3080/ 88 = 35

43
Mean deviation from Mean = 1100/88 = 12.5
Coeff of MD mean = 12.5/ 35 = 0.3571

Merits and Demerits of Mean Deviation


Merits
1. It is easy to understand and easy to compute.
2. It is based on all the observations.
3. It is less affected by extreme observations vis-a-vis range or standard deviation (to be discussed in the
next section).
4. It is not much affected by fluctuations of sampling.
Demerits
1. It is not capable of further mathematical treatment. Since mean deviation is the arithmetic mean of
absolute values of deviations, it is not very convenient to be algebraically manipulated.
2. This necessitates a search for a measure of dispersion which is capable of being subjected to further
mathematical treatment.
3. It is not well defined measure of dispersion since deviations can be taken from any measure of central
tendency.
Uses of M.D.
The mean deviation is a very useful measure of dispersion when sample size is small and no elaborate
analysis of data is needed. Since standard deviation gives more importance to extreme observations the
use of mean deviation is preferred in statistical analysis of certain economic, business and social
phenomena.
Standard Deviation
From the mathematical point of view, the practice of ignoring minus sign of the deviations, while
computing mean deviation, is very inconvenient and this makes the formula, for mean deviation,
unsuitable for further mathematical treatment. Further, if the signs are taken into account, the sum of
deviations taken from their arithmetic mean is zero. This would mean that there is no dispersion in the
observations. However, the fact remains that various observations are different from each other. In order
to escape this problem, the squares of the deviations from arithmetic mean are taken and the positive
square root of the arithmetic mean of sum of squares of these deviations is taken as a measure of
dispersion. This measure of dispersion is known as standard deviation or root-mean square deviation.
Square of standard deviation is known as variance. The concept of standard deviation was introduced by
Karl Pearson in 1893.

44
The standard deviation is denoted by Greek letter ‘σ ’ which is called ‘small sigma’ or simply sigma.
In terms of symbols

1
σ = √𝑛 ∑(𝑋 − 𝑋̅)2 ; where n is the number of observation; and

1
σ = √𝑁 ∑ 𝑓(𝑋 − 𝑋̅)2 for a grouped or ungrouped frequency distribution, where an observation Xi occurs
with frequency f.
It should be noted here that the units of σ are same as the units of X.
Calculation of Standard Deviation
There are two methods of calculating standard deviation: (i) Direct Method (ii) Short-cut Method
Direct Method
1. Individual Series: If there are n observations X1, X2, ...... Xn, various steps in the calculation of
standard deviation are:
Step 1: Find the mean.(𝑋̅)
Step 2: For each data point, find the square of its distance to the mean. (𝑋 − 𝑋̅)2
Step 3: Sum the values from Step 2. i.e, ∑ (𝑋 − 𝑋̅)2
1
Step 4: Divide by the number of data points.𝑛 ∑ (𝑋 − 𝑋̅)2

1
Step 5: Take the square root.√𝑛 ∑ (𝑋 − 𝑋̅)2

Variance
The Variance is defined as the average of the squared differences from the Mean. In other words, variance
is square of standard deviation i.e, σ2.
Example:
The heights of dogs are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.
Solution:
First step is to find the Mean:
600+470+170+430+300 1970
Mean = = = 394
5 5

Hence, the mean (average) height of dog is 394 mm.


Second step, calculate each dog's difference from the Mean

45
X X- 394 (X-394)2
600 206 42436
470 76 5776
170 -224 50176
430 36 1296
300 -94 8836
Total 108520

1
σ = √𝑛 ∑ (𝑋 − 𝑋̅)2

1
= √5 ∗ 108520

= √21704 = 147.32 mm
Variance = σ 2 = (147.32)2 = 21704
2. Ungrouped or Grouped Frequency Distributions: Let the observations X1, X2 ...... Xn appear
with respective frequencies f1, f2 ...... fn, where ∑f= N. As before, if the distribution is grouped,
then X1, X2 ...... Xn will denote the mid-values of the first, second ..... nth class intervals
respectively. The formulae for the calculation of standard deviation and variance can be written
as

1
σ = √𝑁 ∑ f(𝑋 − 𝑋̅)2

1
σ2 = 𝑛 ∑ (𝑋 − 𝑋̅)2 ; respectively

Here also, we can show that


Variance = Mean of squares – Square of the mean
Therefore, we can write
∑ 𝑓𝑋 2 ∑ 𝑓𝑋 2
σ2 = − ( )
𝑁 𝑁

∑ 𝑓𝑋 2 ∑ 𝑓𝑋 2
σ=√ − ( )
𝑁 𝑁

Example: Calculate standard deviation and variance of the following data :


X: 10 11 12 13 14 15 16 17 18
f: 2 7 10 12 15 11 10 6 3

46
Solution.
Calculation of Standard Deviation
X F fX X- mean (X- mean)2 f(X-
mean)2
10 2 20 –4 16 32 200
11 7 77 –3 9 63 847
12 10 120 –2 4 40 1440
13 12 156 –1 1 12 2028
14 15 210 0 0 0 2940
15 11 165 1 1 11 2475
16 10 160 2 4 40 2560
17 6 102 3 9 54 1734
18 3 54 4 16 48 972
Total 76 1064 300 15196

Mean = 1064 / 76 = 14
σ2 = 300/76 = 3.95

σ = √3.95 = 1.99
Alternative Method
From the last column of the above table, we have
Sum of squares = 15196
Mean of squares = 15196/ 76
= 199.95

Thus, σ 2 = Mean of squares – Square of the mean = 199.96 – (14)2 = 3.96 and σ = √3.95 = 1.99
Example: Calculate standard deviation of the following series:

Weekly No. of Weekly No. of


wages workers wages workers
100-105 200 130-135 410
105-110 210 135-140 320
110-115 230 140-145 280

47
115-120 320 145-150 210
120 350 150 160
125 520 155 90
Solution

CI F Mid values d=X- A fd fd2


100-105 200 102.5 -25 -5000 125000
105-110 210 107.5 -20 -4200 84000
110-115 230 112.5 -15 -3450 51750
115-120 320 117.5 -10 -3200 32000
120-125 350 122.5 -5 -1750 8750
125-130 520 127.5 0 0 0
130-135 410 132.5 5 2050 10250
135-140 320 137.5 10 3200 32000
140-145 280 142.5 15 4200 63000
145-150 210 147.5 20 4200 84000
150-155 160 152.5 25 4000 100000
155-160 90 157.5 30 2700 81000
Total 3300 2750 671750
A= 127.5
∑ 𝑓𝑋 2 ∑ 𝑓𝑋 2
σ2 = − ( )
𝑁 𝑁

671750 2750 2
σ2 = − (3300) = 202.87
3300

σ = √202.87 = 14.24
Coefficient of Variation
The standard deviation is an absolute measure of dispersion and is expressed in the same units as the units
of variable X. A relative measure of dispersion, based on standard deviation is known as coefficient of
𝜎
standard deviation and is given by 𝑋̅ *100 .

This measure introduced by Karl Pearson, is used to compare the variability or homogeneity or stability
or uniformity or consistency of two or more sets of data. The data having a higher value of the coefficient
of variation is said to be more dispersed or less uniform, etc.

48
Example: Calculate standard deviation and its coefficient of variation from the following data:
Measurements 0-5 5-10 10-15 15-20 20-25
Frequency 4 1 10 3 2
Solution:
CI F Mid Value X- A F(X-A) F(X-A)2
0-5 4 2.5 -2 -8 16
5-10 1 7.5 -1 -1 1
10-15 10 12.5 0 0 0
15-20 3 17.5 1 3 3
20-25 2 22.5 2 4 8
Total 20 -2 28
𝑋−12.5
A= ; i=5
5
5∗2
Mean = 12.5 – = 12
20

∑ 𝑓𝑋 2 ∑ 𝑓𝑋 2
σ=√ − ( ) *i
𝑁 𝑁

28 28 2
= √20 − (20) ∗ 5 = 5.89
5.89
Thus, the coefficient of variation (CV) = ∗ 100
12

= 49%.
Merits, Demerits and Uses of Standard Deviation
Merits
1. It is a rigidly defined measure of dispersion.
2. It is based on all the observations.
3. It is capable of being treated mathematically. For example, if standard deviations of a number of groups
are known, their combined standard deviation can be computed.
4. It is not very much affected by the fluctuations of sampling and, therefore, is widely used in sampling
theory and test of significance.
Demerits
1. As compared to the quartile deviation and range, etc., it is difficult to understand and difficult to
calculate.
2. It gives more importance to extreme observations.

49
3. Since it depends upon the units of measurement of the observations, it cannot be used to compare the
dispersions of the distributions expressed in different units.
Uses of Standard Deviation
1. Standard deviation can be used to compare the dispersions of two or more distributions when their units
of measurements and arithmetic means are same.
2. It is used to test the reliability of mean. It may be pointed out here that the mean of a distribution with
lower standard deviation is said to be more reliable.
SKEWNESS
Skewness means lack of symmetry. In mathematics, a figure is called symmetric if there exists a point in
it through which if a perpendicular is drawn on the X-axis, it divides the figure into two congruent parts
i.e. identical in all respect or one part can be superimposed on the other i.e mirror images of each other.
In Statistics, a distribution is called symmetric if mean, median and mode coincide. Otherwise, the
distribution becomes asymmetric. If the right tail is longer, we get a positively skewed distribution for
which mean > median > mode while if the left tail is longer, we get a negatively skewed distribution for
which mean < median < mode.
The example of the Symmetrical curve, Positive skewed curve and Negative skewed curve are given as
follows:

Frequency

Symmetrical Curve

50
Negative Skewed Curve

Positive Skewed Curve


Difference between Variance and Skewness
The following two points of difference between variance and skewness should be carefully noted.
1. Variance tells us about the amount of variability while skewness gives the direction of variability.

2. In business and economic series, measures of variation have greater practical application than
measures of skewness. However, in medical and life science field measures of skewness have greater
practical applications than the variance.

51
Various measures of Skewness
Measures of skewness help us to know to what degree and in which direction (positive or negative) the
frequency distribution has a departure from symmetry. Although positive or negative skewness can be
detected graphically depending on whether the right tail or the left tail is longer but, we don’t get idea of
the magnitude. Besides, borderline cases between symmetry and asymmetry may be difficult to detect
graphically. Hence some statistical measures are required to find the magnitude of lack of symmetry. A
good measure of skewness should possess three criteria:
1. It should be a unit free number so that the shapes of different distributions, so far as symmetry is
concerned, can be compared even if the unit of the underlying variables are different;
2. If the distribution is symmetric, the value of the measure should be zero. Similarly, the measure
should give positive or negative values according as the distribution has positive or negative
skewness respectively; and
3. As we move from extreme negative skewness to extreme positive skewness, the value of the measure
should vary accordingly.
Measures of skewness can be both absolute as well as relative. Since in a symmetrical distribution mean,
median and mode are identical more the mean moves away from the mode, the larger the asymmetry or
skewness. An absolute measure of skewness can not be used for purposes of comparison because of the
same amount of skewness has different meanings in distribution with small variation and in distribution
with large variation.
Absolute Measures of Skewness
Following are the absolute measures of skewness:
1. Skewness (Sk) = Mean – Median

2. Skewness (Sk) = Mean – Mode

3. Skewness (Sk) = (Q3 - Q2) - (Q2 - Q1)

For comparing to series, absolute measures cannot be used; for that; we calculate the relative measures
which are called coefficient of skewness. Coefficient of skewness are pure numbers independent of units
of measurements.
In order to make valid comparison between the skewness of two or more distributions we have to eliminate
the distributing influence of variation. Such elimination can be done by dividing the absolute skewness by
standard deviation. The following are the important methods of measuring relative skewness:
β and γ Coefficient of Skewness
Karl Pearson defined the following β and ℽ coefficients of skewness, based upon the second and third
central moments:

52
𝜇2
β = 𝜇33
2

It is used as measure of skewness. For a symmetrical distribution, β1 shall be zero. β1 as a measure of


skewness does not tell about the direction of skewness, i.e. positive or negative. Because µ3 being the sum
of cubes of the deviations from mean may be positive or negative but µ32 is always positive. Also, µ2 being
the variance always positive. Hence, β1 would be always positive. This drawback is removed if we
calculate Karl Pearson’s Gamma coefficient ℽ1which is the square root of β1 i. e.

𝜇3
ℽ1 = ±√𝛽1 = 3/2
𝜇2

53
Then the sign of skewness would depend upon the value of µ3 whether it is positive or negative. It is
advisable to use ℽ1 as measure of skewness
Karl Pearson’s Coefficient of Skewness
This method is most frequently used for measuring skewness. The formula for measuring coefficient of
skewness is given by
mean − mode
Sk = 𝜎

The value of this coefficient would be zero in a symmetrical distribution. If mean is greater than mode,
coefficient of skewness would be positive otherwise negative. The value of the Karl Pearson’s coefficient
of skewness usually lies between ± 1 for moderately skewed distribution. If mode is not well defined, we
use the formula
3(mean − median)
Sk = 𝜎

By using the relationship


Mode = (3 Median – 2 Mean) Here, - 3 ≤ Sk ≤ 3. In practice it is rarely obtained.
Bowleys’s Coefficient of Skewness
This method is based on quartiles. The formula for calculating coefficient of skewness is given by
(𝑄3 −𝑄2 )− (𝑄2 −𝑄1 )
Sk = (𝑄3 −𝑄1 )

(𝑄3 −2𝑄2 +𝑄1 )


= (𝑄3 −𝑄1 )

The value of Sk would be zero if it is a symmetrical distribution. If the value is greater than zero, it is
positively skewed and if the value is less than zero it is negatively skewed distribution. It will take value
between +1 and -1.
Example1: For a distribution Karl Pearson’s coefficient of skewness is 0.64, standard deviation is 13 and
mean is 59.2 Find mode and median.
Solution: We have given
Sk = 0.64, σ = 13 and Mean = 59.2
Therefore by using formulae
Mean− Mode
Sk = 𝜎

59.2−𝑚𝑜𝑑𝑒
0.64 = 13

Mode = 59.20 – 8.32 = 50.88

54
Mode = 3 Median – 2 Mean
50.88 = 3 Median - 2 (59.2)
50.88+11.8.4 169.28
Median = =
3 3

Median = 56.42
Remarks about Skewness
1. If the value of mean, median and mode are same in any distribution, then the skewness does not exist
in that distribution. Larger the difference in these values, larger the skewness;
2. If sum of the frequencies are equal on the both sides of mode then skewness does not exist;
3. If the distance of first quartile and third quartile are same from the median then a skewness does not
exist. Similarly if deciles (first and ninth) and percentiles (first and ninety nine) are at equal distance from
the median. Then there is no asymmetry;
4. If the sums of positive and negative deviations obtained from mean, median or mode are equal then
there is no asymmetry; and
5. If a graph of a data become a normal curve and when it is folded at middle and one part overlap fully
on the other one then there is no asymmetry.
KURTOSIS
If we have the knowledge of the measures of central tendency, dispersion and skewness, even then we
cannot get a complete idea of a distribution. In addition to these measures, we need to know another
measure to get the complete idea about the shape of the distribution which can be studied with the help of
Kurtosis. Prof. Karl Pearson has called it the “Convexity of a Curve”. Kurtosis gives a measure of flatness
of distribution.
The degree of kurtosis of a distribution is measured relative to that of a normal curve. The curves with
greater peakedness than the normal curve are called “Leptokurtic”. The curves which are more flat than
the normal curve are called “Platykurtic”. The normal curve is called “Mesokurtic.” The Fig. below
describes the three different curves mentioned above

55
Figure: Platykurtic Curve, Mesokurtic Curve and Leptokurtic Curve
Karl Pearson’s Measures of Kurtosis
For calculating the kurtosis, the second and fourth central moments of variable are used. For this,
following formula given by Karl Pearson is used:
𝜇
β2 = 𝜇42
2

or ℽ2 = β2 - 3
where, µ2 = Second order central moment of distribution
µ4 = Fourth order central moment of distribution
Description
1. If β2 = 3 or ℽ2 = 0, then curve is said to be mesokurtic;
2. If β2 < 3 or ℽ2 < 0, then curve is said to be platykurtic;
3. If β2 > 3 or ℽ2 > 0, then curve is said to be leptokurtic;

Example 2: First four moments about mean of a distribution are 0, 2.5, 0.7 and 18.75. Find coefficient of
skewness and kurtosis.
Solution: We have µ1 = 0, µ2 = 2.5, µ3 = 0.7 and µ4 = 18.75

𝜇2
Skewness β = 𝜇33
2

56
0.72
β = 2.53 = 0.031

𝜇
β2 = 𝜇42
2

18.75
Kurtosisβ2 = =3
2.52

As β2 is equal to 3, so the curve is mesokurtic

57
Lesson 4
Introduction to Correlation and Regression
I. Introduction to Correlation
Correlation analysis is used to quantify the association between two variables (e.g., between an
independent and a dependent variable or between two independent variables). Various experts have
defined correlation in their own words and their definitions, broadly speaking, imply that correlation
is the degree of association between two or more variables.
A.M. Tuttle defined correlation as “Correlation is an analysis of covariation between two or more
variables.”
In other words, variables are correlated if corresponding to a change in one variable there is a change
in another variable. This change may be in either direction. If one variable increases (or decreases) the
other may also increases (or decreases).

Correlation Coefficient: It is a numerical measure of the degree of association between two or more
variables.
Correlation and Causation
Correlation analysis deals with the association or co-variation between two or more variables and helps
to determine the degree of relationship between two or more variables. But correlation does not
indicate a cause and effect relationship between two variables. It explains only co-variation. The high
degree of correlation between two variables may exist due to any one or a combination of the following
reasons.
1. Correlation may be due to pure chance: Especially in a small sample, the correlation is due to
pure chance. There may be a high degree of correlation between two variables in a sample, but in the
population there may not be any relationship between the variables. For example, the production of
corn, availability of dairy products, chlorophyll content and plant height. These variables have no
relationship. However, if a relationship is formed, it may be only a chance or coincidence. Such types
of correlation is known as spurious or nonsensical correlation.
2. Both variables are influenced by some other variables: A high degree of correlation between the
variables may be due to some cause or different causes affected each of these variables. For example,
a high degree of correlation may exist between the yield per acre of paddy or wheat due to the effect
of rainfall and other factors like fertilizers used, favourable weather conditions etc. But none of the
two variables is the cause of the other. It is difficult to explain which is the cause and which is the
effect, they may not have caused each other, but there is an outside influence.
3. Mutual dependence: In this case, the variables affect each other. The subjective and relative
variable are to be judged for the circumstances. For example, the production of jute and rainfall.
Rainfall is the subject and jute production is relative. The effect of rainfall is directly related to the jute
production.
Co-variance V/s Correlation
A covariance reveals the direction of the linear relationship between two variables wheras the
correlation coefficient shows the direction and the strength of the linear relationship between two
variables. The correlation coefficient is unit-free since the units in the numerator cancel with those in
the denominator. Hence, it can be used for comparison.

58
Properties of Coefficient of Correlation
1. The coefficient of correlation is independent of the change of origin and scale of
measurements.
2. Always ranges between -1 to + 1
TYPES OF CORRELATION
In a bivariate distribution, correlation is classified into many types, but the important are:
1. Positive and negative correlation
2. Simple and multiple correlation
3. Partial and total correlation
4. Linear and non-linear correlation.
Positive and Negative Correlation
Positive and negative correlation depend upon the direction of the change of the variables. If two
variables tend to move together in the same direction i.e., an increase or decrease in the value of one
variable is accompanied by an increase or decrease in the value of other variable, then the correlation
is called positive or direct correlation. E.g height and weight of a person.
If two variables tend to move together in opposite directions i.e., an increase or decrease in the value
of one variable is accompanied by a decrease or increase in the value of other variable, then the
correlation is called negative or inverse correlation. E.g, price and demand of a commodity.
Simple and Multiple Correlation
When there are only two variables under study then the relationship is described as simple correlation.
For examples, the yield of wheat and use of fertilizers. But in a multiple correlation, there are more
than two variables under study. Multiple correlation consists of the measurement of the relationship
between a dependable variable and two or more independent variables. For examples, the relationship
between plant yield with that of a number of pods and a number of clusters in pulses are studied
together.
Partial and Total Correlation
To study of two variables excluding some other variables is called partial correlation. For examples,
the correlation between yield of maize and fertilizers excluding, the effect of pesticides and manures
is called partial correlation, In total correlation, all the facts are taken into account.
Linear and Non-linear Correlation
If the ratio of change between two variables is uniform, then there will be linear correlation between
them. Whereas, in case of curvilinear or non-linear correlation, the amount of change in one variable
does not bear a constant ratio of the amount of change in the other variable.
METHODS OF STUDYING CORRELATION
The different methods of finding out the relationship between two variables are:
1. Scatter diagram or scattergram or dot diagram,
This is the simplest method of finding out whether there is any relationship present between two
variables by plotting the values on a chart, known as scatter diagram. By this method a rough idea
about the correlation of two variables can be judged. In this method, the given data are plotted on a
graph paper in the form of dots. X variables are plotted on the horizontal axis and Y variables on the
vertical axis. Thus, we have the dots and we can know the scatter or concentration of the various
points. This will show the type of correlation.

59
A scatter diagram of the data helps in having a visual idea about the nature of association between two
variables. If the plotted points form a straight line running from the lower left-hand corner to the upper
right-hand corner, then there is a perfect positive correlation (i.e., r = +1). On the other hand, if the
points are in a straight line, having a falling trend from the upper left-hand corner to the lower right-
hand corner, it reveals that there is a perfect negative or inverse correlation (i.e., r = – 1). If the plotted
points fall in a narrow band, and the points are rising from lower left-hand corner to the upper right-
hand corner, there will be a high degree of positive correlation between the two variables. If the plotted
points fall in a narrow band from the upper left-hand corner to the lower right-hand corner, there will
be a high degree of negative correlation. If the plotted points lie scatter all over the diagram, there is
no correlation between the two variables.
Karl Pearson’s Coefficient of Correlation
Karl Pearson, a great biometrician and statistician, suggested a mathematical method for measuring
the magnitude of linear relationship between two variables. Karl Pearson’s method is the most widely
used method in practice and is known as Pearson’s coefficient of correlation. It gives information about
the direction as well as the magnitude of the relationship between two variables. It is denoted by the
symbol ‘r’ ; the formula for calculating Pearson’s r is:
Covariance( x,y)
r= or
σx σy
NΣXY− ΣX.ΣY
𝑟 =
√NΣX2 −(ΣX)2 √NΣY2 −(ΣY)2
Σxy
In terms of deviations from mean it is defined as 𝑟 =
√x2 √y2
Where 𝑥=𝑋−𝑋̅ and 𝑦=𝑌−𝑌̅
Mathematical Properties of Karl Pearson’s Coefficient of Correlation
(i) Coefficient of correlation lies between + 1 and – 1. Symbolically
–1≤r≤+1
That is ‘r’ cannot be less than – 1 and cannot exceed + 1.

60
(ii) Coefficient of correlation is independent of change of origin and scale of the variables x and y. By
change of scale, we mean that all values of x and y series are multiplied or divided by some constant.
By change of origin, means a constant is subtracted from all values of x and y series.

Example: Calculate the Karl Pearson's coefficient of correlation from the following pairs of values:
Values 12 9 8 10 11 13 7
of X:
Values 14 8 6 9 11 12 3
of Y:
Solution:
The formula for Karl Pearson's coefficient of correlation is
NΣXY− ΣX.ΣY
𝑟 =
√NΣX2 −(ΣX)2 √NΣY2 −(ΣY)2

The values of different terms, given in the formula, are calculated from the following table:
Xi Yi XiYi Xi2 Yi2
12 14 168 144 196
9 8 72 81 64
8 6 48 64 36
10 9 90 100 81
11 11 121 121 121
13 12 156 169 144
7 3 21 49 9
Total 70 63 676 728 651

Here n = 7 (no. of pairs of observations)


7∗ 676− 70∗ 63
rxy = = 0.949
√7∗728− (70)2 √7∗651−(63)2

Hence, there is positive and high correlation between X and Y.


Example: Calculate the Karl Pearson's coefficient of correlation between X and Y from the following
data:
No. of pairs of observations n = 8,
Mean of X =11 and Mean of Y =10
∑(𝑋 − ̅̅̅
𝑋)2 = 184 ;
̅̅̅ 2
∑(𝑌 − 𝑌) = 148
∑(𝑋 − 𝑋)̅̅̅(𝑌 − 𝑌)
̅̅̅ = 164
Solution:
NΣXY− ΣX.ΣY
𝑟 =
√NΣX2 −(ΣX)2 √NΣY2 −(ΣY)2
164
r= = 0.99
√184√148
Hence, there is positive and high correlation between X and Y.

61
Degree of Correlation
The degree of correlation between two variables can be ascertained by the quantitative value of
coefficient of correlation which can be found out by calculation, Karl Pearson has given a formula for
measuring correlation coefficient (r). However, the results of this formula varies between + 1 and – 1.
In case of perfect positive correlation, the result will be r = + 1 and in case of perfect negative
correlation, the result will be r = – 1. However, in the absence of correlation, the result will be r = 0. It
indicates that the degree of scattering is very large. In experimental research, it is very difficult to find
such values of r as + 1, – 1 and 0.

The following
table will show the approximate degree of correlation according to Karl Pearson’s formula:
Degree of correlation Positive Negative
Perfect correlation +1 –1
Very high degree of + 0.9 or more – 0.9 or more
correlation
Sufficiently high degree of from + 0.75 to + 0.9 from – 0.75 to – 0.9
correlation
Moderate degree of from + 0.6 to + 0.75 from – 0.6 to – 0.75
correlation
Only the possibility of from + 0.3 to + 0.6 from – 0.3 to – 0.6
correlation
Possibly no correlation less than + 0.3 less than – 0.3
Absence of correlation 0 0
Assumptions
There are some assumptions of Karl Pearson’s coefficient of correlation. They are as follows:
(i) Linear Relationship: If the two variables are plotted on a scatter diagram, it is assumed that the
plotted points will form a straight line. So, there is a linear relationship between the variables.
(ii) Normality: The correlated variables are affected by a large number of independent causes, which
form a normal distribution. Variables like quantity of money, age, weight, height, price, demand, etc.
are affected by such forces, that normal distribution is formed.
(iii) Casual Relationship: Correlation is only meaningful, if there is a cause and effect relationship
between the force affecting the distribution of items in the two series. It is meaningless, if there is no
such relationship. There is no relationship between rice and weight, because the factors that affect
these variables are not common.
(iv) Proper Grouping: It will be a better correlation analysis if there is an equal number of pairs.
(v) Error of Measurement: If the error of measurement is reduced to the minimum, the coefficient
of correlation is more reliable.
Merits and Limitations of Coefficient of Correlation
The only merit of Karl Pearson’s coefficient of correlation is that it is the most popular method for
expressing the degree and direction of linear association between the two variables in terms of a pure

62
number, independent of units of the variables. This measure, however, suffers from certain limitations,
given below:
1. Coefficient of correlation r does not give any idea about the existence of cause and effect relationship
between the variables. It is possible that a high value of r is obtained although none of them seem to
be directly affecting the other. Hence, any interpretation of r should be done very carefully.
2. It is only a measure of the degree of linear relationship between two variables. If the relationship is
not linear, the calculation of r does not have any meaning.
3. Its value is unduly affected by extreme items.
4. As compared with other methods, the computations of r are cumbersome and time consuming.
Spearman’s Rank Correlation
This is a crude method of computing correlation between two characteristics. In this method, various
items are assigned ranks according to the two characteristics and a correlation is computed between
these ranks. This method is often used in the following circumstances:
1. When the quantitative measurements of the characteristics are not possible, e.g., the results of a
beauty contest where various individuals can only be ranked.
2. Even when the characteristics is measurable, it is desirable to avoid such measurements due to
shortage of time, money, complexities of calculations due to large data, etc.
3. When the given data consist of some extreme observations, the value of Karl Pearson’s coefficient
is likely to be unduly affected. In such a situation the computation of the rank correlation is preferred
because it will give less importance to the extreme observations.
4. It is used as a measure of the degree of association in situations where the nature of population, from
which data are collected, is not known.

Spearman’s Rank correlation coefficient is a technique which can be used to summarise the strength
and direction (negative or positive) of a relationship between two variables.
The result will always be between 1 and minus 1.
Steps - calculating the coefficient
• Create a table from your data.
• Rank the two data sets. Ranking is achieved by giving the ranking '1' to the biggest number in a
column, '2' to the second biggest value and so on. The smallest value in the column will get the
lowest ranking. This should be done for both sets of measurements.
• Find the difference in the ranks (d): This is the difference between the ranks of the two values on
each row of the table. The rank of the second value (price) is subtracted from the rank of the first
(distance from the museum).
• Square the differences (d²) To remove negative values and then sum them (∑d²).
• Calculate the coefficient (Rsp) using the formula below. The answer will always be between 1.0 (a
perfect positive correlation) and -1.0 (a perfect negative correlation).
6∑𝑑2
Rsp = 1 - 𝑛3 −𝑛

63
Example: The following table displays the association between the IQ of each adolescent in a
sample with the number of hours they listen to rock music per month. Determine the strength of the
correlation between IQ and rock music using Spearman’s rank correlation.
Solution:
Step1: Give ranks to the data
Calculation Table
IQ Rock Rank of IQ Rank of Di = Rank Di2
(X) Rock(Y) of X- Rank
of Y
99 3 4 2 2 4
120 0 9 1 8 64
98 30 3 9 -6 36
102 45 5 10 -5 25
123 16 10 4 6 36
105 25 6 7 -1 1
85 17 1 5 -4 16
110 24 7 6 1 1
117 26 8 8 0 0
90 5 2 3 -1 1
Total 184

6∑𝑑2
rsp = 1 - 𝑛3 −𝑛
6∗184
= 1 - 103−10 = - 0.115
Since, the value of correlation is near zero, hence, there is hardly any association between
variables.
Quotations of index numbers of equity share prices of a certain joint stock company and the
prices of preference shares are given below.

Using the method of rank correlation determine the relationship between equity shares and
reference shares prices.

64
Solution:

∑ di2 = 90; n= 7
Rank correlation is given by
6∗90
rsp = 1- 7(49−1) 1 − 1.6071 = −0.6071

Interpretation: There is a negative correlation between equity shares and preference share prices.
There is a strong disagreement between equity shares and preference share prices.

Repeated Ranks
In case of a tie, i.e., when two or more individuals have the same rank, each individual is assigned
a rank equal to the mean of the ranks that would have been assigned to them in the event of there
being slight differences in their values. To understand this, let us consider the series 20, 21, 21,
24, 25, 25, 25, 26, 27, 28. Here the value 21 is repeated two times and the value 25 is repeated
three times. When we rank these values, rank 1 is given to 20. The values 21 and 21 could have
been assigned ranks 2 and 3 if these were slightly different from each other. Thus, each value will
be assigned a rank equal to mean of 2 and 3, i.e., 2.5. Further, the value 24 will be assigned a rank
equal to 4 and each of the values 25 will be assigned a rank equal to 6, the mean of 5, 6 and 7 and
so on.
Since the Spearman's formula is based upon the assumption of different ranks to different
individuals, therefore, its correction becomes necessary in case of tied ranks. It should be noted
that the means of the ranks will remain unaffected. Further, the changes in the variances are
usually small and are neglected. However, it is necessary to correct the term di2 and accordingly
𝑚(𝑚2 −1)
the correction factor , where m denotes the number of observations tied to a particular
12
2(22 −1)
rank, is added to it for every tie. We note that there will be two correction factors, i.e., and
12
3(32 −1)
in the above example
12
Hence, rsp is given by:

65
Merits and Demerits of Rank Correlation Coefficient
Merits of Rank Correlation Coefficient
1. Spearman’s rank correlation coefficient can be interpreted in the same way as the Karl
Pearson’s correlation coefficient;
2. It is easy to understand and easy to calculate;
3. If we want to see the association between qualitative characteristics, rank correlation
coefficient is the only formula;
4. Rank correlation coefficient is the non-parametric version of the Karl Pearson’s product
moment correlation coefficient; and
5. It does not require the assumption of the normality of the population from which the sample
observations are taken.
Demerits of Rank Correlation Coefficient
1. Product moment correlation coefficient can be calculated for bivariate frequency distribution
but rank correlation coefficient cannot be calculated; and
2. If n >30, this formula is time consuming
Example 4: Calculate rank correlation coefficient from the following data:
Expenditure 10 15 14 25 14 14 20 22
on
advertisement
Profit 6 25 12 18 25 40 10 7
Solution: Let us denote the expenditure on advertisement by x and profit by y

x Rank of x y Rank of y ( Ry d = Rx-Ry d2


(Rx) )
10 8 6 8 0 0
15 4 25 2.5 1.5 2.25
14 6 12 5 1 1
25 1 18 4 −3 9
14 6 25 2.5 3.5 12.25
14 6 40 1 5 25
20 3 10 6 −3 9
22 2 7 7 −5 25
Total 83.5

66
Rsp is given by

Here rank 6 is repeated three times in rank of x and rank 2.5 is repeated twice in rank of y, so the
correction factor is
3(32 −1) 2(22 −1)
+ = 2.50
12 12
6( 83.5+2.50)
Rsp = 1- = -0.024
504

There is a negative association between expenditure on advertisement and profit.

67
II. Regression Analysis
Introduction
Correlation gives the degree and direction relationship between two variables, it does not give the
nature of relationship between two variables. whereas regression describe the dependence of a variable
on an independent variable. It is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. It can be utilized to assess the
strength of the relationship between variables and for modelling the future relationship between them.
Correlation Regression
‘Correlation’ as the name says it determines the ‘Regression’ explains how an independent variable
interconnection or a co-relationship between the is numerically associated with the dependent
variables. variable.
In Correlation, both the independent and However, in Regression, both the dependent and
dependent values have no difference. independent variable are different.
The primary objective of Correlation is, to find When it comes to regression, its primary intent is,
out a quantitative/numerical value expressing the to reckon the values of a haphazard variable based
association between the values. on the values of the fixed variable.
Correlation stipulates the degree to which both of However, regression specifies the effect of the
the variables can move together. change in the unit, in the known variable(p) on the
evaluated variable (q).
Correlation helps to constitute the connection Regression helps in estimating a variable’s value
between the two variables. based on another given value.

Uses of Regression Analysis


Regression analysis is useful in many studies.
(i) Regression analysis is used in biostatistics in all those fields where two or more relative variables
are having the tendency to go back to the average.
(ii) Regression analysis predicts the value of dependent variables from the values of independent
variables.
(iii) Regression analysis is highly useful and the regression line or equation helps to estimate the value
of dependent variable, when the values of independent variables are used in the equation.
(iv) We can calculate the coefficient of correlation (r) with the help of regression coefficients.
(v) Regression analysis in statistical estimation of demand curves, supply curves, production function,
cost function, consumption function, etc. can be predicted.
Regression analysis includes several variations, such as linear, multiple linear, and nonlinear. The most
common models are simple linear and multiple linear. Nonlinear regression analysis is commonly used
for more complicated data sets in which the dependent and independent variables show a nonlinear
relationship.

68
Linear model assumptions
Assumption
Linear regression analysis is based on following fundamental assumptions:
i) The model is linear in parameters.
ii) The error term has zero mean and follows a normal distribution E(μi)=0
iii) The error terms are independent that is covariance between them is zero or there is no
Autocorrelation. Cov (μi, μj) = 0 for all i≠j
iv) The explanatory variables are fixed in repeated sampling.
v) All error terms have equal variance (i.e.) they are homoscedastic. Var(μi) = σ2
vi) The number of observations is greater than the number of regression coefficients to be estimated.
n > k + 1 where k is the number of explanatory variables.
vii) There is no multicollinearity between the explanatory variables that is the explanatory variables
do not share a perfect linear relationship.

Simple linear regression

69
Simple linear regression is a model that assesses the relationship between a dependent variable and an
independent variable. In bivariate data there are two variables and therefore, there are two regression
lines.
The simple linear model Y on X is expressed using the following equation:
Y = β 0 + β1 X
Where:
Y – Dependent variable
X – Independent (explanatory) variable
β0 – Intercept
β1– Slope
The expression β0 + β1X is the deterministic component of the simple linear regression model, which
can be thought of as the expected value of Y for a given value of X. In other words, conditional on X,
E(Y) = β0 + β1X. The slope parameter β1 determines whether the linear relationship between X and
E(Y) is positive (β1 > 0) or negative (β1 < 0); β1 = 0 indicates that there is no linear relationship between
x and E(y). Figure below shows the deterministic portion of the simple linear regression model for
various values of the intercept β0 and the slope β1 parameters.
FIGURE: Various examples of a simple linear regression model

Regression line X on Y can be written as:


X = β 0 + β1 Y
where y is independent variable and x is dependent variable
When the regression lines show some trend upward or downward, we say that there is some correlation
between two variables. But if both the lines of regression are perpendicular to each other, then both
the variables are uncorrelated or r = 0. Further, if both the lines of regression coincide, we say that
there is a perfect correlation between the variables or r = ± 1.
With the help of bivariate data, the two regression lines can be fitted by the method of least squares
for finding the values of unknown parameters.
The slope β1 and the intercept β0 of the sample regression equation y on x are calculated as

70
β0 = y̅ − β1 x̅
∑(xi − x̅)(yi − y
̅)
β1 = ∑(xi − x̅)2

and for x on y is

β0 = x̅ − β1 y̅
̅)(xi − x̅)
∑(yi − y
β1 = ̅)2
∑(yi − y

Example: Determine the equation of a straight line y on x from the data:


x 10 12 13 16 17 20 25
y 10 22 24 27 29 33 37
Solution

x y x-x̅ y-y̅ (x-x̅)( y-y̅) ( x-x̅)2 ( y-y̅)2


10 10 -6.14286 -16 98.28571 37.73469 256
12 22 -4.14286 -4 16.57143 17.16327 16
13 24 -3.14286 -2 6.285714 9.877551 4
16 27 -0.14286 1 -0.14286 0.020408 1
17 29 0.857143 3 2.571429 0.734694 9
20 33 3.857143 7 27 14.87755 49
25 37 8.857143 11 97.42857 78.44898 121
∑x = 113 ∑y = 182 248 158.8571 456

113
x̅ = = 16.14
7
182
y̅ = = 26
7
∑(xi − x̅)(yi − y
̅) 248
β1 = ∑(xi − x̅)2
=158.857 = 1.56

β0 = y̅ − β1 x̅

= 26 – 16.14 * 26 = 0.82
Therefore, regression line Y on X is
Y = 0.82 + 1.56 X
The estimated slope coefficient of 1.56 suggests a positive relationship between Y and X. The
estimated intercept coefficient of 0.82 suggests that if X equals zero, then predicted Y = 0.82.
Deriving Regression Line of X on Y:
̅)(xi − x̅)
∑(yi − y
β1 = ̅)2
∑(yi − y
= 248/456 = 0.544

71
β0 = x̅ − β1 y̅ = 16.14 – 26*0.544 = 1.996
hence, equation is
X = 1.996 + 0.544 Y
The estimated slope coefficient of 0.544 suggests a positive relationship between Y and X. The
estimated intercept coefficient of 1.996 suggests that if Y equals zero, then predicted X = 0.544.
PROPERTIES OF REGRESSION COEFFICIENTS
1. Correlation coefficient is the geometric mean between the regression coefficients.

2. It is clear from the property 1, both regression coefficients must have the same sign. i.e., either they
will positive or negative.
3. If one of the regression coefficients is greater than unity, the other must be less than unity.
4. The correlation coefficient will have the same sign as that of the regression coefficients.
5. Arithmetic mean of the regression coefficients is greater than the correlation coefficient.

6. Regression coefficients are independent of the change of origin but not of scale.

Multiple Linear Regression


A multiple linear regression model allows us to analyze the linear relationship between the response
variable and two or more explanatory variables. The choice of the explanatory variables is based on
economic theory, intuition, and/or prior research. The multiple linear regression model is a
straightforward extension of the simple linear regression model.
The multiple linear regression model is defined as
y = β 0  +  β 1 x 1  +  β 2 x 2  + . . . + β k x k ,
where y is the response variable, x1, x2, . . . , xk are the k explanatory variables,. The coefficients β0,
β1, . . . , βk are the unknown parameters to be estimated. For each explanatory variable xj ( j = 1, . . . ,
k), the corresponding slope coefficient bj is the estimate of βj. We slightly modify the interpretation
of the slope coefficients in the context of a multiple linear regression model. Here bj measures the
change in the predicted value of the response variable y given a unit increase in the associated
explanatory variable xj, holding all other explanatory variables constant. In other words, it represents
the partial influence of xj on y .
If there are 3 variables i.e, one dependent and two independent variables, then the regression line is:

72
y = β 0  +  β 1 x 1  +  β 2 x 2

(Σyix1i )(Σx2i 2 )−(Σyix2i )(Σx1i x2i )


β 1= (Σx1i 2 )(Σx2i 2 )−(Σx1i x2i )2

(Σyix1i )(Σx2i 2 )−(Σyix2i )(Σx1i x2i )


β2= (Σx1i 2 )(Σx2i 2 )−(Σx1i x2i )2

̅̅̅ − β2 ̅̅̅
β0 = y̅ − β0x1 x2

Summary
In regression analysis, we can predict or estimate the value of one variable from the given value of
the other variable. Regression explains the functional form of two variables one as dependent variable
and other as independent variable.
The regression analysis confined to the study of only two variables at a time is termed as simple
regression. The regression analysis for studying more than two variables at a time is known as multiple
regression.
If the bivariate data are plotted on a graph paper, a scatter diagram is obtained which indicates some
relationship between two variables. The dots of scatter diagram tend to concentrate around a curve.
This curve is known as regression curve and its functional form is called regression equation.
The geometric mean of the regression coefficients is the coefficient of correlation.
Arithmetic mean of the regression coefficients is greater than or equal to coefficient of correlation.
In multiple regression model, we assume that a linear relationship exists between some variable y,
which we call the dependent variable, and k independent variables x1, x2, ..., xk.

73
Fill in the blanks:
1. Regression analysis confined to the study of only two variables at a time is termed as
...........................
2. Regression line of x on y is denoted by ...........................
3. Functional form of regression curve is called ...........................
4. If both the lines of regression coincide, there is a ........................... between the variables.
5. Regression analysis is an ........................... measure.
State whether the following statements are True or False:
6. Arithmetic mean of the regression coefficients is less than coefficient of correlation.
7. If coefficient of regression r = 0, then there is no relationship between the two variables.
8. Regression coefficients are dependent upon change of origin.
9. Regression is a mathematical measure showing the average relationship between two variables.
10. Spearman’s Rank correlation method can be used when data are irregular
1. simple regression
2. x = a + by
3. regression equation
4. perfect correlation
5. absolute
6. False
7. True
8. False
9. True
10. True

******

74
Lesson 5
INTRODUCTION TO PROBABILITY
Managers need to cope with uncertainty in many decision making situations. For example, a
manager may assume that the volume of sales in the successive year is known exactly to him. This
is not true because he know roughly what the next year sales will be. But manager cannot give the
exact number. There is some uncertainty. Concepts of probability will help manager to measure
uncertainty and perform associated analyses. This chapter provides the conceptual framework of
probability and the various probability rules that are essential in business decisions.
Probability theory was originated from gambling theory. A large number of problems exist even
today which are based on the game of chance, such as coin tossing, dice throwing and playing cards.
A probability is a numerical value that measures the likelihood that an event occurs. This value is
between zero and one, where a value of zero indicates an impossible event and a value of one
indicates a definite event.
Consider, for example, the toss of a coin. The result of a toss can be a head or a tail, therefore, it is
a random experiment. Here we know that either a head or a tail would occur as a result of the toss,
however, it is not possible to predetermine the outcome. With the use of probability theory, it is
possible to assign a quantitative measure, to express the extent of uncertainty, associated with the
occurrence of each possible outcome of a random experiment.
In order to define an event and assign the appropriate probability to it, it is useful to first establish
some terminology and impose some structure on the situation.
Some important terms &concepts:
1. Event:
The occurrence or non-occurrence of a phenomenon is called an event. For example, in the toss of
two coins, there are four exhaustive outcomes, i.e. (H, H), (H, T), (T, H), (T, T). The events
associated with this experiment can be defined in a number of ways. For example, (i) the event of
occurrence of head on both the coins, (ii) the event of occurrence of head on at least one of the two
coins, (iii) the event of non-occurrence of head on the two coins, etc
2. Random Experiments:
Experiments of any type where the outcome cannot be predicted are called random experiments.
3. Sample Space:
A set of all possible outcomes from an experiment is called a sample space.
Eg: Consider a random experiment E of throwing 2 coins at a time. The possible outcomes are HH,
TT, HT, TH.
These 4 outcomes constitute a sample space denoted by, S ={ HH, TT, HT, TH}.

75
4. Trail & Event:
Consider an experiment of throwing a coin. When tossing a coin, we may get a head(H) or tail(T).
Here tossing of a coin is a trail and getting a hand or tail is an event.
In other words, “Every non-empty subset of A of the sample space S is called an event”.
5. Null Event:

An event having no sample point is called a null event and is denoted by ∅.


6. Exhaustive Events:
The total number of possible outcomes in any trail is known as exhaustive events.
Eg: In throwing a die the possible outcomes are getting 1 or 2 or 3 or 4 or 5 or 6. Hence we have 6
exhaustive events in throwing a die.
7. Mutually Exclusive Events:
Two events are said to be mutually exclusive when the occurrence of one affects the occurrence of
the other. In otherwords, if A & B are mutually exclusive events and if A happens then B will not
happen and viceversa.
Eg: In tossing a coin the events head or tail are mutually exclusive, since both tail & head cannot
appear in the same time.

8. Equally Likely Events:


Two events are said to be equally likely if one of them cannot be expected in the preference to the
other.
Eg: In throwing a coin, the events head & tail have equal chances of occurrence.

9. Independent & Dependent Events:

Two events are said to be independent when the actual happening of one does not influence in any
way the happening of the other. Events which are not independent are called dependent events.
Eg: If we draw a card in a pack of well shuffled cards and again draw a card from the rest of pack
of cards (containing 51 cards), then the second draw is dependent on the first. But if on the other
hand, we draw a second card from the pack by replacing the first card drawn, the second draw is
known as independent of the first.
Example: A snowboarder competing in the Winter Olympic Games is trying to assess her
probability of earning a medal in her event, the ladies’ halfpipe. Construct the appropriate sample
space.
SOLUTION: The athlete’s attempt to predict her chances of earning a medal is an experiment
because, until the Winter Games occur, the outcome is unknown. We formalize an experiment by
constructing its sample space. The athlete’s competition has four possible outcomes: gold medal,

76
silver medal, bronze medal, and no medal. We formally write the sample space as S = {gold, silver,
bronze, no medal}.
Definitions
Classical Definition:
This definition, also known as the mathematical definition of probability or classical or a priori
definition of probability, was given by J. Bernoulli. With the use of this definition, the probabilities
associated with the occurrence of various events are determined by specifying the conditions of a
random experiment.
If n is the number of equally likely, mutually exclusive and exhaustive outcomes of a random
experiment out of which m outcomes are favourable to the occurrence of an event A, then the
probability that A occurs, denoted by P(A), is given by:
Probability(of happening an event E) = the number of ways of achieving success
the total number of possible outcomes
= m/n
Where m = Number of favourable cases
n = Total number of exhaustive cases.

How to Interpret Probability


Mathematically, the probability that an event will occur is expressed as a number between 0 and 1.
Notational, the probability of event A is represented by P(A).
• If P(A) equals zero, there is no chance that the event A will occur.
• If P(A) is close to zero, there is little likelihood that event A will occur.
• If P(A) is close to one, there is a strong chance that event A will occur
• If P(A) equals one, event A will definitely occur.
The sum of all possible outcomes in a statistical experiment is equal to one. This means, for example,
that if an experiment can have three possible outcomes (A, B, and C), then P(A) + P(B) + P(C) = 1

Example:
1. In tossing a coin, what is the probability of getting a head.
Solution: Total no. of events = {H, T}= 2
Favourable event = {H}= 1
Probability = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥ℎ𝑎𝑢𝑠𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
= 12
2. In throwing a die, the probability of getting 2.
Sol: Total no. of events = {1,2,3,4,5,6}= 6
Favourable event = {2}= 1
Probability = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠/𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥ℎ𝑎𝑢𝑠𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
= 1/6

3. Find the probability of throwing 7 with two dice.


Sol: Total no. of possible ways of throwing a dice twice = 36 ways

77
Number of ways of getting 7 is, (1,6), (2,5), (3,4), (4,3), (5,2), (6,1) = 6
Probability = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠/𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥ℎ𝑎𝑢𝑠𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
= 6/36
= 1/6

4. A bag contains 6 red & 7 black balls. Find the Probability of drawing a red ball.

Sol: Total no. of possible ways of getting 1 ball = 6 + 7 =13


Number of ways of getting 1 red ball = 6
Probability = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠/𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥ℎ𝑎𝑢𝑠𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
= 6/13

5. Find the Probability of a card drawn at random from an ordinary pack, is a diamond.

Sol: Total no. of possible ways of getting 1 card = 52


Number of ways of getting 1 diamond card is 6
Probability = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠/𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥ℎ𝑎𝑢𝑠𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
= 13/52
= 1/4

6. From a pack of 52 cards, 1 card is drawn at random. Find the Probability of getting a queen.

Sol: A queen may be chosen in 4 ways.


Total no. of ways of selecting 1 card = 52
Probability = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠/𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥ℎ𝑎𝑢𝑠𝑡𝑖𝑣𝑒 𝑐𝑎𝑠𝑒𝑠
= 4/52 = 1/13

Relative Frequency Approach:


The relative frequency approach uses past data that have been empirically observed. It notes the
frequency with which same event has occurred in the past and estimates the probability of its
reoccurrence on the basis of these historic data. The probability of an event based on this approach is
determined by
No.of times event has occurred
P (E) =a/n = Total no.of observations/ trials

Here, ‘a’ is the actual number of expected outcomes and ‘n’ is the total number of observations or
trials.
For instance, during the last calendar year there were 50 births at a local hospital. If 32 of the new
arrivals were baby girls, the relative frequency approach reveals that the probability of a baby girl
being born in that locality is
No.of girls born last year
P (girl) = a/n = = 32/50= 0.64
Total no of births

78
The Subjective Approach:
The subjective approach requires the assignment of the probability of some event on the basis of the
best available evidence. Suppose we consider the next cricket match that India & Australia will play.
What is the probability that India will win. With the subjective method to assign probabilities to the
experimental outcomes, we may use any data available, our experience, intuition, personal judgment
etc. Subjective probabilities are assignments of numerical values measuring an individual’s degree of
confidence in the occurrence of particular events, or in the truth of particular propositions, after just
account had been taken of all the relevant information.
Illustrations of subjective probability are:
1. Estimating the likelihood the New England Patriots will play in the Super Bowl next year.
2. Estimating the probability General Motors Corp. will lose its number 1 ranking in total units sold to
Ford Motor Co. or DaimlerChrysler within 2 years.
3. Estimating the likelihood you will earn an A in this course.
The types of probability are summarized in Chart 1 below. A probability statement always assigns a
likelihood of an event that has not yet occurred. There is, of course, a considerable latitude in the
degree of uncertainty that surrounds this probability, based primarily on the knowledge possessed by
the individual concerning the underlying process. The individual possesses a great deal of knowledge
about the toss of a die and can state that the probability that a one-spot will appear face up on the toss
of a true die is one-sixth. But we know very little concerning the acceptance in the marketplace of a
new and untested product.

Chart 1: Definitions of probability


Some Rules for Computing Probabilities
Before discussing rules ; let’s consider few definition

79
Definitions and Notation

• The complement of an event is the event not occurring. The probability that Event A will not
occur is denoted by P(A').
• The probability that Event A occurs, given that Event B has occurred, is called a conditional
probability. The conditional probability of Event A, given Event B, is denoted by the symbol
P(A|B).
• The probability that Events A and B both occur is the probability of the intersection of A and B.
The probability of the intersection of Events A and B is denoted by P(A ∩ B). If Events A and B
are mutually exclusive, P(A ∩ B) = 0. The intersection of two events, denoted A ∩ B, is the event
consisting of all outcomes in A and B. Figure depicts the intersection of two events A and B. A
useful way to illustrate these concepts is through the use of a Venn diagram, named after the British
mathematician John Venn (1834–1923).The intersection A ∩ B is the portion in the Venn diagram
that is included in both A and B.

Union Probability: Union probability is denoted P (E1 U E2) or P (A or B), where E1 and E2 are two
events. P (E1 U E2) is the probability that E1 will occur or that E2 will occur or both E1 and E2 will
occur. In a company, the probability that a person is male or a clerical worker is a union probability.
A person qualifies for the union by being male or by being clerical worker or by being both (a male
clerical worker).

80
Complement Of Event
The complement of event A, denoted Ac, is the event consisting of all outcomes in the sample space
S that are not in A. In Figure given below, Ac is everything in S that is not included in A.

A snowboarder competing in the Winter Olympic Games is trying to assess her probability of
earning a medal in her event, the ladies’ halfpipe. Construct the appropriate sample space.
Now suppose the snowboarder defines the following three events:
A = {gold, silver, bronze}; that is, event A denotes earning a medal;
B = {silver, bronze, no medal}; that is, event B denotes earning at most a silver medal; and
C = {no medal}; that is, event C denotes failing to earn a medal.
a. Find A ∪ B and B ∪ C.
b. Find A ∩ B and A ∩ C.
c. Find Bc.
SOLUTION: The athlete’s attempt to predict her chances of earning a medal is an experiment because,
until the Winter Games occur, the outcome is unknown. We formalize an experiment by constructing
its sample space. The athlete’s competition has four possible outcomes: gold medal, silver medal,
bronze medal, and no medal. We formally write the sample space as
S = {gold, silver, bronze, no medal}.
a. The union of A and B denotes all outcomes common to A or B; here, the event A ∪ B = {gold,
silver, bronze, no medal}. Note that there is no double counting of the outcomes “silver” or “bronze”
in A ∪ B. Similarly, we have the event B ∪ C = {silver, bronze, no medal}.
b. The intersection of A and B denotes all outcomes common to A and B; here, the event A ∩ B =
{silver, bronze}. The event A ∩ C = Ø, where Ø denotes the null (empty) set; no common outcomes
appear in both A and C.
c. The complement of B denotes all outcomes in S that are not in B; here, the event Bc = {gold}.
The two properties of probability
1. The probability of any event A is a value between 0 and 1; that is,
0 ≤ P(A) ≤ 1.
2. The sum of the probabilities of any list of mutually exclusive and exhaustive events equals 1.
THE ADDITION RULE
The addition rule states that the probability that A or B occurs, or that at least one
of these events occurs, is equal to the probability that A occurs, plus the probability
that B occurs, minus the probability that both A and B occur. Equivalently,
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)

81
EXAMPLE
Anthony feels that he has a 75% chance of getting an A in Statistics and a 55% chance of getting an A
in Managerial Economics. He also believes he has a 40% chance of getting an A in both classes.
a. What is the probability that he gets an A in at least one of these courses?
b. What is the probability that he does not get an A in either of these courses?
SOLUTION:
a. Let P(AS) correspond to the probability of getting an A in Statistics and P(AM) correspond to the
probability of getting an A in Managerial Economics. Thus, P(AS) = 0.75 and P(AM) = 0.55. In
addition, there is a 40% chance that Anthony gets an A in both classes; that is, P(AS ∩ AM) = 0.40. In
order to find the probability that he receives an A in at least one of these courses, we calculate
P( A S  ∪  A M ) = P( A S ) + P( A M ) − P( A S  ∩  A M ) = 0.75 + 0.55 − 0.40 = 0.90.
b. The probability that he does not receive an A in either of these two courses is actually the
complement of the union of the two events; that is, P((AS ∪ AM)c).
We calculated the union in part a, so using the complement rule we have
P( ( A S  ∪  A M ) c ) = 1 − P( A S  ∪  A M ) = 1 − 0.90 = 0.10.
An alternative expression that correctly captures the required probability is P( A S c  ∩ A M c ), which
is the probability that he does not get an A in Statistics and he does not get an A in Managerial
Economics. A common mistake is to calculate the probability as 1 − P(AS ∩ AM) = 1 − 0.40 = 0.60,
which simply indicates that there is a 60% chance that Anthony will not get an A in both courses. This
is clearly not the required probability that Anthony does not get an A in either course.
THE ADDITION RULE FOR MUTUALLY EXCLUSIVE EVENTS
If A and B are mutually exclusive events, then P(A ∩ B) = 0 and, therefore, the addition rule simplifies
to P(A ∪ B) = P(A) + P(B).
EXAMPLE
Samantha Greene, a college senior, contemplates her future immediately after graduation. She thinks
there is a 25% chance that she will join the Peace Corps and teach English in Madagascar for the next
few years. Alternatively, she believes there is a 35% chance that she will enroll in a full-time law
school program in the United States.
a. What is the probability that she joins the Peace Corps or enrolls in law school?
b. What is the probability that she does not choose either of these options?
SOLUTION:
a. We can write the probability that Samantha joins the Peace Corps as P(A) = 0.25 and the probability
that she enrolls in law school as P(B) = 0.35.
Immediately after college, Samantha cannot choose both of these options. This implies that these
events are mutually exclusive, so P(A ∩ B) = 0. Thus, when solving for the probability that Samantha
joins the Peace Corps or enrolls in law school, P(A ∪ B), we can simply sum P(A) and P(B):
P(A ∪ B) = P(A) + P(B) = 0.25 + 0.35 = 0.60.
b. In order to find the probability that she does not choose either of these options, we need to recognize
that this probability is the complement of the union of the two events; that is, P((A ∪ B)c). Therefore,
using the complement rule, we have
P( (A ∪ B) c ) = 1 − P(A ∪ B) = 1 − 0.60 = 0.40.

82
THE ADDITION RULE
The addition rule states that the probability that A or B occurs, or that at least one of these events
occurs, is equal to the probability that A occurs, plus the probability that B occurs, minus the probability
that both A and B occur. Equivalently,
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)

EXAMPLE
Anthony feels that he has a 75% chance of getting an A in Statistics and a 55% chance of getting an A
in Managerial Economics. He also believes he has a 40% chance of getting an A in both classes.
a. What is the probability that he gets an A in at least one of these courses?
b. What is the probability that he does not get an A in either of these courses?
SOLUTION:
a. Let P(AS) correspond to the probability of getting an A in Statistics and P(AM) correspond to the
probability of getting an A in Managerial Economics. Thus, P(AS) = 0.75 and P(AM) = 0.55. In
addition, there is a 40% chance that Anthony gets an A in both classes; that is, P(AS ∩ AM) = 0.40. In
order to find the probability that he receives an A in at least one of these courses, we calculate
P( A S  ∪  A M ) = P( A S ) + P( A M ) − P( A S  ∩  A M ) = 0.75 + 0.55 − 0.40 = 0.90.
b. The probability that he does not receive an A in either of these two courses is actually the
complement of the union of the two events; that is, P((AS ∪ AM)c).
We calculated the union in part a, so using the complement rule we have
P( ( A S  ∪  A M ) c ) = 1 − P( A S  ∪  A M ) = 1 − 0.90 = 0.10.
An alternative expression that correctly captures the required probability is P( A S c  ∩ A M c ), which
is the probability that he does not get an A in Statistics and he does not get an A in Managerial
Economics. A common mistake is to calculate the probability as 1 − P(AS ∩ AM) = 1 − 0.40 = 0.60,
which simply indicates that there is a 60% chance that Anthony will not get an A in both courses. This
is clearly not the required probability that Anthony does not get an A in either course.
THE ADDITION RULE FOR MUTUALLY EXCLUSIVE EVENTS
If A and B are mutually exclusive events, then P(A ∩ B) = 0 and, therefore, the addition rule simplifies
to P(A ∪ B) = P(A) + P(B).

83
EXAMPLE
Samantha Greene, a college senior, contemplates her future immediately after graduation. She thinks
there is a 25% chance that she will join the Peace Corps and teach English in Madagascar for the next
few years. Alternatively, she believes there is a 35% chance that she will enroll in a full-time law
school program in the United States.
a. What is the probability that she joins the Peace Corps or enrolls in law school?
b. What is the probability that she does not choose either of these options?
SOLUTION:
a. We can write the probability that Samantha joins the Peace Corps as P(A) = 0.25 and the probability
that she enrolls in law school as P(B) = 0.35.
Immediately after college, Samantha cannot choose both of these options. This implies that these
events are mutually exclusive, so P(A ∩ B) = 0. Thus, when solving for the probability that Samantha
joins the Peace Corps or enrolls in law school, P(A ∪ B), we can simply sum P(A) and P(B):
P(A ∪ B) = P(A) + P(B) = 0.25 + 0.35 = 0.60.
b. In order to find the probability that she does not choose either of these options, we need to recognize
that this probability is the complement of the union of the two events; that is, P((A ∪ B)c). Therefore,
using the complement rule, we have
P( (A ∪ B) c ) = 1 − P(A ∪ B) = 1 − 0.60 = 0.40.

Conditional Probability
Given two events A and B, each with a positive probability of occurring, the probability that A occurs
given that B has occurred (A conditioned on B) is equal to P(A | B) = P(A ∩B) P(B) . Similarly, the
probability that B occurs given that A has occurred (B conditioned on A) is equal to P(B | A) = P(A ∩
B)/P(A) .
If A represents “finding a job” and B represents “prior work experience,” then P(A) = 0.80 and the
conditional probability is denoted as P(A | B) = 0.90. The vertical mark | means “given that,” and the
conditional probability is typically read as “the probability of A given B.” In this example, the
probability of finding a suitable job increases from 0.80 to 0.90 when conditioned on prior work
experience. In general, the conditional probability, P(A | B), is greater than the unconditional
probability, P(A), if B exerts a positive influence on A.
Similarly, P(A | B) is less than P(A) when B exerts a negative influence on A. Finally, if

84
B exerts no influence on A, then P(A | B) equals P(A). It is common to refer to “unconditional
probability” simply as “probability.”
Example:
Of the cars on a used car lot, 70% have air conditioning (AC) and 40% have a CD player (CD). 20%
of the cars have both. What is the probability that a car has a CD player, given that it has AC ?
Solution:

𝑃(𝐶𝐷∩𝐴𝐶) 0.2
P(CD/AC) = = = 0.2857
𝑃(𝐴𝐶) 0.7
Note: Given AC, we only consider the top row (70% of the cars). Of these, 20% have a CD player.
20% of 70% is about 28.57%.

Independent Versus Dependent Events


Two events, A and B, are independent if P(A | B) = P(A) or, equivalently, P(B | A)= P(B). Otherwise,
the events are dependent.
If A and B are independent, then and the multiplication rule simplifies to

85
P(A and B) = P(A)* P(B)
Example: Suppose that for a given year there is a 2% chance that your desktop computer will crash
and a 6% chance that your laptop computer will crash. Moreover, there is a 0.12% chance that both
computers will crash. Is the reliability of the two computers independent of each other?
SOLUTION: Let event D represent the outcome that your desktop crashes and event L represent the
outcome that your laptop crashes. Therefore, P(D) = 0.02, P(L) = 0.06, and P(D ∩ L) = 0.0012. The
reliability of the two computers is independent because
𝑃(𝐷∩𝐿)
P(D | L) = 𝑃(𝐿)
= 0.0012/0.06
= 0.02 = P(D).
In other words, if your laptop crashes, it does not alter the probability that your desktop also crashes.
Equivalently,
𝑃(𝐷∩𝐿)
P(D | L) = = = 0.0012/0.06
𝑃(𝐿)
= 0.06 = P(L)
Multiplication rule for two events A and B:
P(A and B) = P(B)|P(A/B)
If A and B are independent events, then the probability that A and B both occur equals the product of
the probability of A and the probability of B; that is,
P(A ∩ B) = P(A)P(B).
Example: The probability of passing the MBA exam is
0.50 for person A and 0.80 for person B. The prospect of person A passing the exam is completely
unrelated to person B success on the exam.
a. What is the probability that both person A and B pass the exam?
b. What is the probability that at least one of them passes the exam?
SOLUTION: We can write the probabilities that person A passes the exam and that person B passes
the exam as P(J) = 0.50 and P(L) = 0.80, respectively.
a. Since we are told that person A chances of passing the exam are not influenced by person B
success at the exam, we can conclude that these events are independent, so P(J) = P(J | L) =
0.50 and P(L) = P(L | J) = 0.80. Thus, when solving for the probability that both person pass
the exam, we calculate the product of the probabilities:
P(J ∩ L) = P(J) P(L) = 0.50 × 0.80 = 0.40.
b. We calculate the probability that at least one of them passes the exam as

P(J ∪ L) = P(J) + P(L) − P(J ∩ L) = 0.50 + 0.80 − 0.40 = 0.90.


Marginal Probability
Marginal Probability is also know as law of total probability. Let Xi (i= 1,2,3,-----------n) be n mutually
exclusive & collectively exhaustive events & D be an event defined additionally on the sample space,

86
given that the marginal probabilities P(Xi) and conditional Probabilities P (D/Xi) for all i, are known,
then the marginal probability of D is defined as:
P (D) = P (X1∩D) + P(X2∩D) + -----------------+ P(Xn∩D)
= P(X1)* P(D/X1) + P(X2)* P(D/X2) +-----------P(Xn)*P(D/Xn)
Since D could have resulted due toX1 or X2 or-------Xn we obtain probability of D by relating D with
X1, X2, ----- Xn.
An Export house manager purchases cotton shirts for an export consignment from two Designers.
Suppose Designer A produced 65% of the shirts and that Designer B produced 35%.8 percent of the
shirts produced by A were defective and 12% of the B designer’s shirts were defective. On a particular
day, a normal shipment arrives from both the Designers& the contents get mixed up. A shirt is chosen
(at random) for quality check by the authority and is found to be defective & thus the consignment is
rejected. Now what is the probability that designer A produced the shirt? That designer B produced
the shirt? The decision is required by the Exporter to take corrective measures to reduce the probability
of rejection of his consignment in future.
Here we are given data on the percentage of the shirts produced by the two designers which provides
basis to know the likelihood that a randomly selected shirt is produced by a particular Designer. The
given information shows that if we let,
X1 represent the event that designer A produced the shirt
X2 represent the event that designer B produced the shirt
Then from the given data we have:
P (X1) = .65, P (X2) = .35
These values are called Prior Probabilities. They are so called because they are established prior to
the empirical evidence about the quality of the shirt. As per the quality of shirts the conditional
probabilities show:
The probability that the shirt is defective on the condition that it is produced by Designer A is P(D/
X1) = .08. Similarly
P (D/X2) = .12
From the given data we know P (X1) = .65. The new information that the shirt is defective changes
the probabilities. With this additional information, we can revise the probability the shirt was produced
by Designer A. Let D be the event that the tested piece is defective & we want to know the probability
that the defective shirt came from Designer A. We now want to determine P(X1/D) not just P
(X1).Thus, after the additional information/ experiment is performed, we replace P(Xj) by P(Xj/D).
Recalling the rule of conditional probability:
P (X1∩D) P (X1).P (X1/D)
P (X1/D) = = (1)
P (D) P (D)

87
However P (D) is not readily discernible. This is where Bayes’ theorem comes in. There are two ways
the shirt may be defective. It can come from Designer A & be defective ‘ or it can come from Designer
B & be defective. Using the rule of addition given above
P (D) = P (X1∩D) + P (X2∩D)
= P (X1). P(D/ X1) + P(X2). P (D/X2). (2)
Substituting P(D) from (2) in the denominator of the conditional probability formula (1), Bayes’
theorem tells us ;
P(Xj∩D)
P(Xj/D) = 𝑝(𝑋1∩𝐷)+𝑃(𝑋2∩𝐷)+⋯….+𝑃(𝑋𝑛∩𝐷)

P(Xj ).P(D/ Xj)


= Σ P(Xj ).P(D/ Xj)..

j =1, 2, -----------, n. {1}


Where the denominator is the probability of the sample result, P(D) & the numerator is a specific term
of the denominator.
We can find P (X1/D) as:
P (X1∩D) P(X1 ).P(𝐷/X1)
P (X1/D) = = = .052 /.094 = .553
P (D) P(X1∩D)+P(X2∩D)

P (X2∩D) P(X2 ).P(𝐷/X1)


P (X2/D) = = = 0.042 /0.094 = 0.447
P (D) P(X1∩D)+P(X2∩D)

Thus, in the above example, the probability was .65 that the shirt was produced by designer A & .35
that it came from designer B. These are called prior probabilities because they are based on the original
information .The new information that the shirt is defective changes the probabilities with the result
of a sample of a defective shirt, the probability of X1 has been revised downwards to 0.553 & that of
X2 has been revised upward to 0.447.
One way to lay out a revision of probabilities problem (The Bayes’ Rule) is to use a table. Table below
shows the analysis for the shirts problem.

Event Prior Conditional Joint Posterior or


Probability Probability Probability Revised
P(AΠD) Probability
&P(BΠD) that defective
respectively in shirt was
following produced by
columns (i)A, (ii)B
Designer A P (X1) = 0.65 P (X1/D) = P (X1∩D) = P (X1/D) =
0.08 P(X1). P(D/ P(X1∩D)/P(D)
X1). =.65x.08 = .052 ÷ .094 =
=.052 .553

88
Designer B P (X2) = 0.35 P (X2/D) = P (X2∩D) = P (X2/D) =
0.12 P(X2) P(X2∩D)/P(D)
P(D/X2).= = .042 ÷ .094
.35x.12 = .042 = .447
P(defective) =
.094

Practice Questions:
State whether the following statements are true or false:
1. The concept of probability originated from the analysis of the games of chance in the
17th century.
2. The theory of probability is a study of Statistical or Random Experiments.
3. It is the backbone of Statistical Inference and Decision Theory that are essential tools of the
analysis of most of the modern business and economic problems.
4. A phenomenon or an experiment which can result into more than one possible outcome,
is called a random phenomenon or random experiment or statistical experiment.
5. The result of a toss can be a head or a tail. thus, it is a non random experiment

89
Classification of Probability Distribution

Bionomial Distribution
Meaning & Definition:
Binomial Distribution is associated with James Bernoulli, a Swiss Mathematician. Therefore, it is also
called Bernoulli distribution. Binomial distribution is the probability distribution expressing the
probability of one set of dichotomous alternatives, i.e., success or failure. In other words, it is used to
determine the probability of success in experiments on which there are only two mutually exclusive
outcomes. Binomial distribution is discrete probability distribution. Binomial Distribution can be
defined as follows: “A random variable r is said to follow Binomial Distribution with parameters n
and p if its probability distribution function is given by:
P(r) = nCx px qn-x
Where, P = probability of success in a single trial
q=1–p
n = number of trials
x = number of success in ‘n’ trials.
Assumption or Conditions for application of Binomial Distribution
Binomial distribution can be applied when:-
1. The random experiment has two outcomes i.e., success and failure.
2. The probability of success in a single trial remains constant from trial to trial of the experiment.
3. The experiment is repeated for finite number of times.
4. The trials are independent.
Properties (features) of Binomial Distribution:

90
1. It is a discrete probability distribution.
2. The shape and location of Binomial distribution changes as ‘p’ changes for a given ‘n’.
3. The mode of the Binomial distribution is equal to the value of ‘x’ which has the largest probability.
4. Mean of the Binomial distribution increases as ‘n’ increases with ‘p’ remaining constant.
5. The mean of Binomial distribution is np.

6. The Standard deviation of Binomial distribution is √𝑛𝑝𝑞

7. If ‘n’ is large and if neither ‘p’ nor ‘q’ is too close zero, Binomial distribution may be approximated
to Normal Distribution.
8. If two independent random variables follow Binomial distribution, their sum also follows Binomial
distribution.
Working Rules for Solving Problems
I. Make sure that the trials in the random experiment are independent and each trial result in
either ‘success’ or ‘failure’.
II. Define the binomial variable and find the values of n and p from the given data. Also find
q by using: q = 1 – p.
III. Put the values of n, p and q in the formula:
P(x= successes) = nCr px qn–x, x = 0, 1, 2, ...... ,n
IV. Express the event, whose probability is desired, in terms of values of the binomial variable
x. Use formula to find the required probability.
Question: Six coins are tossed simultaneously. What is the probability of obtaining 4 heads?
Solution: pmf of Binomial distribution is :
P(x) = nCx px qn-x
x=4n=6p=½q=1–p=1–½=½
6! 1 1
∴ p(x = 4) = 6C4 ( ½ )4 ( ½ )6-4 = (6−4)!4! (2)4 (2)6−4

6∗5 1
= 2∗1 64 = 30/128 = 0.234

SYSKA, a LED manufacturing company regularly conducts quality


checks at a specified periods on the products it manufactures.
Historically, the failure rate for LED light bulbs that the company
manufactures is 5%. Suppose a random sample of 10 LED light bulbs
is selected. What is the probability that
I. None of the LED light bulbs are defective?

91
II. Exactly one of the LED light bulbs is defective?
III. Two or fewer LED light bulbs are defective?
IV. Three or more of the LED light bulbs are defective?
Solution:
Here p = 0.05, q = 1- p = 1- 0.05 = 0.95 and n = 10,
Applying pmf P(x) = nCx px qn-x
We get
1. P(X = 0) = 0.5987
2. P(X = 1) = 0.3151
3. P(X ≤ 2) = P(X=0)+ P(X=1) + P(X=2) = 0.9885
4. P(X ≥ 3) = P(X=3) + P(X=4) + …+ P(X=10)
= 1 – P(X<3)
= 1 – [P(X=0)+ P(X=1) + P(X=2 ]
= 0.0115
Example: In the United States, about 30% of adults have four-year college degrees . Suppose five
adults are randomly selected.
a. What is the probability that none of the adults has a college degree?
b. What is the probability that no more than two of the adults have a college degree?
c. What is the probability that at least two of the adults have a college degree?
d. Calculate the expected value, the variance, and the standard deviation of this binomial distribution.
e. Graphically depict the probability distribution and comment on its symmetry/skewness.
SOLUTION: First, this problem satisfies the conditions for a Bernoulli process with a random
selection of five adults, n = 5. Here, an adult either has a college degree, with probability p = 0.30, or
does not have a college degree, with probability 1 − p = 1 − 0.30 = 0.70. Given a large number of
adults, it fulfills the requirement that the probability that an adult has a college degree stays the same
from adult to adult.
a. In order to find the probability that none of the adults has a college degree, we
let x = 0 and find
P(X = 0)
= 5C0 (0.30)0 (0.70)5-0
5!
= 0!(5−0)! (0.30) 0  ×  (0.70) 5−0

= 1 × 1 × 0.1681

92
= 0.1681.
In other words, from a random sample of five adults, there is a 16.81% chance that none of the adults
has a college degree.
b. We find the probability that no more than two adults have a college degree as
P(X ≤ 2) = P(X = 0) + P(X = 1) + P(X = 2).
We have already found P(X = 0) from part a. So we now compute P(X = 1)
and P(X = 2):
P(X = 1)
== 5C1 (0.30)1 (0.70)5-1
5!
= 1!(5−1)! (0.30) 1  ×  (0.70) 5−1

= 0.3602
P(X = 2)
= 5C2 (0.30)2 (0.70)5-2
5!
= 2!(5−2)! (0.30) 2  ×  (0.70) 5−2

= 0.3087
Next we sum the three relevant probabilities and obtain
P(X ≤ 2) = 0.1681 + 0.3602 + 0.3087 = 0.8370. From a random sample of five adults, there is an 83.7%
likelihood that no more than two of them will have a college degree.
c. We find the probability that at least two adults have a college degree as
P(X ≥ 2) = P(X = 2) + P(X = 3) + P(X = 4) + P(X = 5).
We can solve this problem by calculating and then summing each of the four probabilities, from P(X
= 2) to P(X = 5). A simpler method uses one of the key properties of a probability distribution, which
states that the sum of the probabilities over all values of X equals 1. Therefore, P(X ≥ 2) can be written
as 1 − [P(X = 0) + P(X = 1)]. We have already calculated P(X = 0) and P(X = 1) from parts a and b, so
P(X ≥ 2) = 1 − [P(X = 0) + P(X = 1)] = 1 − (0.1681 + 0.3602) = 0.4717.
From a random sample of five adults, there is a 47.17% likelihood that at least two adults will have a
college degree.
d. We use the simplified formulas to calculate the mean, the variance, and the standard deviation as
E(X) = np = 5 × 0.30 = 1.5 adults,
Var(X) = σ 2 = np(1 − p) = 5 × 0.30  × 0.70 = 1.05 (adults) 2, and

93
SD(X) = σ = √𝑛𝑝(1  −  𝑝)

= √1.05= 1.02 = 1.02 adults.


e. Before we graph this distribution, we first show the complete binomial distribution
for Example in Table below
Table: Binomial Distribution with n = 5 and p = 0.30
x)
X P(X=x)
0 0.1681
1 0.3602
2 0.3087
3 0.1323
4 0.0284
5 0.0024
This binomial distribution is graphically depicted in Figure below. When randomly selecting five
adults, the most likely outcome is that exactly one adult will have a college degree. The distribution is
not symmetric; rather, it is positively skewed.
Figure: Binomial Distribution with n = 5 and p = 0.30

Applications of Binomial Distribution


This distribution is applied to problems concerning:
1. The number of defectives items in a sample.
2. The estimation of reliability of systems.
3. Number of rounds fired from a gun hitting a target.
4. Radar detection

94
Summary

• This distribution mainly deals with attributes. An attributes is either present or absent with respect
to elements of a population.

• The random experiment is performed for a finite and fixed number of trials.

• Each trial must result in either “success” or “failure”. • The probability of success in each trial is
same.

• The method of drawing histogram of a binomial distribution is analogous to the procedure of


drawing histogram of a frequency distribution.

• As number of trials (n) in the binomial distribution increases, the number of successes also increases.
Fill in the blanks:
1. There are only two possible outcomes of each trial either ........................... or ...........................
2. Binomial variable counts the number of ........................... in a random experiment with trials
satisfying 4 conditions of binomial distribution.
3. Binomial distribution mainly deals with ...........................
4. In binomial distribution, the probabilities of 0 success, 1 success, 2 successes, ... n successes are the
1st, 2nd, 3rd ... (n + 1)th terms in ........................... of (q + p)n.
5. Binomial distribution can be applied only when number of trials are ........................... and wol
...........................
Answers
1. success, failure
2. successes
3. attributes
4. binomial expansion
5. finite and fixed
Questions
1.Assume that on an average one telephone number out of 15 called between 2 P.M. and 3 P.M. on
week days is busy. What is the probability that if six randomly selected telephone numbers are called,
at least three of them will be busy?
2. A bag contains 10 balls each marked with one of the digits 0 to 9. If four balls are drawn successively
with replacement from the bag, what is the probability that none is marked with the digit ‘0’?
3. In a box containing 100 bulbs, 10 are defective. What is the probability that out of a sample of 5
bulbs (i) none is defective? (ii) exactly two are defective?

95
Lesson 6
Normal Probability Distribution
The normal distribution is considered the corner stone of modern statistical theory. It is of considerable
importance in statistical theory, especially- in statistical inferences related to population parameters. It
is widely used in practical problems related to height, weight, distance, production, scientific
measurements etc. Carl Friedrich Gauss (1777-1855) for the first time derived the precise
mathematical formula for the normal distribution; therefore, it is sometimes also called the ‘Gaussian
distribution’.
A continuous random variable X is said to have a normal distribution if its probability density function
(pdf) is given by the following equation
−1 (x−μ) 2
1 ( )
f(x) = √2πσ2 e 2 σ

– ∞ < x < ∞; – ∞< µ < ∞, σ > 0


Where µ is mean of the distribution and σ2 is variance
The Conditions of Normality In order that the distribution of a random variable X is normal, the
factors affecting its observations must satisfy the following conditions:
1. A large number of chance factors: The factors, affecting the observations of a random variable,
should be numerous and equally probable so that the occurrence or non-occurrence of any one of them
is not predictable.
2. Condition of homogeneity: The factors must be similar over the relevant population although, their
incidence may vary from observation to observation.
3. Condition of independence: The factors, affecting observations, must act independently of each
other.
4. Condition of symmetry: Various factors operate in such a way that the deviations of observations
above and below mean are balanced with regard to their magnitude as well as their number.
Properties:

1. The mean and standard deviation (or variance ) are the two parameters of the normal
distribution. The μ is the centre of the distribution and σ explains the spread of values around
the centre. The values of and determine the position and shape of the distribution. It is a family
of distribution where each distribution is differentiated by its combination of µ and σ.
2. The values of mean (measure of central value), median (that divides the series in two equal
parts) and mode (the most frequent value) coincide. If from the peak of the distribution we
draw a perpendicular on the horizontal axis (also called variable axis), the foot of the
perpendicular gives the value of mean, median and mode.
3. It is a unimodal distribution and the curve attains its maximum value at x = μ.
4. It is a bell shaped and a symmetric curve (as shown in Figure below). If the curve is folded
from the middle, the two halves will coincide. The shape of the curve to the left of mean is
mirror image of the shape to the right side of the curve.

96
Symmetrical bell-shaped curve

5. The variability of the distribution is determined by the value of. The greater is its value the
higher is the spread of the distribution and greater would be width of the distribution. The two
normal distribution curves having same mean and different standard deviation are shown in
Figure below. We can see that both the distributions are symmetric and bell shaped. However,
observations in distribution A (with higher σ) are more dispersed therefore the curve is flattened
and has thicker tails, than the distribution B (with lower σ).

6. The random variable X ranges from -∞ to +∞ the tails of the curve extend to infinity in both
the directions and they never touch the x-axis.

97
7. This distribution being a probability density function, f(x) ≥ 0. And, we know that the
probability of a continuous random variable is given by the area under the curve, therefore the
total area under the curve is one i.e.

−1 (x−μ) 2
+∞ +∞ 1 ( )
P(-∞ < X< +∞) = ∫−∞ f(x)dx = ∫−∞ √2πσ2
e2 σ dx = 1

8. Since the distribution is symmetrical about its mean value μ. This implies the area under the
curve on both the sides of the mean (μ) is 0.5, i.e. P (X ≥ μ) = P (X ≤ μ) = 0.5
9. The point of inflexion of the curve occurs at µ ± 1σ. It is a point where the curve changes its
curvature, to both of its sides.
10. Empirically, it is observed that the approximate percentage area in the following commonly
used intervals are µ ± 1σ, covers 68.27% of the total observations i.e., 68.27% of all
observations lie within one standard deviation of the mean of the distribution. Similarly, µ ±
2σ, covers 95.45 % and µ ± 3 σ, covers 99.73 % of the total observations

Area under Normal Probability Distribution


If the values of mean (μ), and standard deviation (σ) are known, the distribution is said to be
fully specified. The expression used to abbreviate that X is normally distributed with mean μ
and variance σ2,is X ~ N (μ, σ2), where E(X) = μ and V(X) = σ2

Computing area under the Normal Curve


In order to compute the probabilities between two X values (x1 and x2) or area covered under
the curve, between the two points, we are required to use integration technique in the following
manner.

98
−1 (x−μ) 2
x2 1 ( ) 1 x2 −1 ((x−μ))2
P(x1<X<x2)= ∫x1 √2πσ2
e2 σ dx = σ√2π ∫x1 e 2 σ dx
The difficulty encountered in performing integral of normal density function, requires
tabulation of normal distribution curve areas. However, we must notice from the above
equation that the area included between two values of X depends not only on the values of x1
and x2 but also on the values of the two parameters of the distribution. As the value of any one
of the two parameters of the normal distribution function changes, the probability for given
values x1 and x2 of would also change. This means, we would require as many tables as the
number of possible combinations of μ and σ2. Since there can be infinite combinations, it is not
practicable to prepare table for each of the possible combination. To deal with this, we
transform each normal random variable with a specified μ and σ2, into a standard normal
random variable, called the Z variable which has 0 mean and 1 standard deviation or variance.
This standardized variable is used to calculate the probabilities of all normal distributions,
irrespective of the values of μ and σ2.

Converting Normal Variable into Standard Normal Variable


The distribution of any normal random variable with mean 0 and variance 1 is called a standard
normal variable. The normal random variable X, is converted into Z using the following
x−μ
conversion formula Z= σ , where µ is E(X) and σ =Standard Deviation(X). Using rules of
expectation, we find the two parameters of the standard normal variable Z, the mean and
standard deviation, are 0 and 1, respectively
x−μ E(x)−μ μ−μ
E(X)= E( )= = =0
σ σ σ
x−μ var(X) σ2
Var(z) = Var ( )= = σ2 = 1
σ σ2
Replacing X with the Z-variable, equation (2) can be re-written as
−1
1 (Z)2
f(z) = e2 ꟷ SND (0,1)
√2π
The properties of Z- distribution are similar to those of the normal distributions.

The random variable Z has an infinite range i.e.-∞<Z<∞ .

The standard normal curve is also a bell-shaped with mean, median and mode coinciding at
0.

The Z-curve is centred at its mean value 0 and is symmetrical around it, i.e.,
P (-∞ ≤ Z ≤ 0) = P (Z ≤ 0) = P (0 ≤ Z ≤ + ∞) = P (Z ≥ 0) = 0.5

The point of inflexion of Z curve occurs at ±1 (it should be noticed here that for the normal
distribution it occurs at µ± 1. since for Z curve μ = 0 and σ = 1, therefore 0±1*1= ±1).

The respective limits of z, covering 68.27%, 95.45% and 99.73% of area under the standard
normal distribution curve are ±1, ±2, and ±3.

99
APPLICATIONS OF NORMAL DISTRIBUTION
This distribution is applied to Problems concerning:
1. calculation of hit probability of a shot.
2. statistical inference in most branches of science.
3. calculation of errors made by chance in experimental measurements
If x is a normal variate with mean 30 and S.D 5. find
(i) P(26≤x ≤ 40) (ii) P (x ≥ 45) (iii)P(1.52 ≤ Z ≤ 1.96) (iv) P(Z > 4)
Solution: Given x is a normal variate with
mean µ = 30 and S.D. σ = 5.
Let z be the standard normal variate,
𝑥−𝜇 𝑥−30
then z = =
𝜎 5
26−30
(i) When x = 26, z = = -4/5 = -0.8
5
40−30
When x = 40, z = 5

=2
P(26 ≤ x ≤ 40) = P( – 0.8 ≤ z ≤ 2)
= Area under the normal variate between z = – 0.8 and z = 2

= (Area between z = – 0.8 and z = 0) + (Area between z = 0 and z = 2)


= P(– 0.8 ≤ z ≤ 0) + P (0 ≤ z ≤ 2)
= P(0 ≤ z ≤ 0.8) + P (0 ≤ z ≤ 2)
= 0.2881 + 0.4772
= 0.7563.
45−30
(ii) When x = 45, z = =3
5

100
P (x 45) = P(z 3)
= Area under standard normal variate to the right of z = 3
= (Area to the right of z = 0) – (Area between z = 0 and z = 3)
= P(z 0) – P(0 z 3)
= 0.5 – 0.49865
= 0.00135.
(iii)As in part a and shown in Figure below,
P(1.52 ≤ Z ≤ 1.96) = P(Z ≤ 1.96) − P(Z < 1.52)

= 0.9750 − 0.9357 = 0.0393.


(iv) P(Z > 4) = 1 − P(Z ≤ 4). However, the z table only goes up to 3.99 with P(Z ≤ 3.99) = 1.0
(approximately). In fact, for any z value greater than 3.99, it is acceptable to treat P(Z ≤ z) = 1.0.
Therefore, P(Z > 4) = 1 − P(Z ≤ 4) = 1 − 1 = 0.

State whether the following statements are True or False:


11. The points of inflexion of the Normal curve are equidistant from the mean on
either side.
12. Normal curve is bimodal.
13. The total area under the normal curve above x-axis is 1.
14. Probable error is the deviation on both side of the arithmetic mean.
15. Error function is also called probability integral.

101
Lesson 7
Hypothesis Testing
Introduction
A statistical hypothesis test is a method of making statistical decisions using experimental data. In
statistics, a result is called statistically significant if it is unlikely to have occurred by chance. The
phrase “test of significance” was coined by Ronald Fisher: “Critical tests of this kind may be called
tests of significance, and when such tests are available we may discover whether a second sample is
or is not significantly different from the first.”
Hypothesis testing is sometimes called confirmatory data analysis, in contrast to exploratory data
analysis. In frequency probability, these decisions are almost always made using nullhypothesis tests;
that is, ones that answer the question. Assuming that the null hypothesis is true, what is the
probability of observing a value for the test statistic that is at least as extreme as the value that was
actually observed? One use of hypothesis testing is deciding whether experimental results contain
enough information to cast doubt on conventional wisdom.
Meaning of Hypothesis
A hypothesis is a tentative proposition relating to certain phenomenon, which the researcher wants to
verify when required. If the researcher wants to infer something about the total population from
which the sample was taken, statistical methods are used to make inference. We may say that, while
a hypothesis is useful, it is not always necessary. Many a time, the researcher is interested in
collecting and analysing the data indicating the main characteristics without a hypothesis. Also, a
hypothesis may be rejected but can never be accepted except tentatively. Further evidence may prove
it wrong. It is wrong to conclude that since hypothesis was not rejected it can be accepted as valid.
What is a Null Hypothesis?
A null hypothesis is a statement about the population, whose credibility or validity the researcher
wants to assess based on the sample. A null hypothesis is formulated specifically to test for possible
rejection or nullification. Hence the name ‘null hypothesis’. Null hypothesis always states “no
difference”. It is this null hypothesis that is tested by the researcher.
Statistical Testing Procedure
1. Formulate the null hypothesis, with H0 and HA, the alternate hypothesis.
According to the given problem, H0 represents the value of some parameter of population.
2. Select on appropriate test assuming H0 to be true.
3. Calculate the value.
4. Select the level of significance other at 1% or 5%.
5. Find the critical region.
6. If the calculated value lies within the critical region, then reject Ho.
7. State the conclusion in writing.
Formulate the Hypothesis
The normal approach is to set two hypotheses instead of one, in such a way, that if one hypothesis is
true, the other is false. Alternatively, if one hypothesis is false or rejected, then the other is true or
accepted. These two hypotheses are:
(1) Null hypothesis
(2) Alternate hypothesis
Let us assume that the mean of the population is μo and the mean of the sample is x. Since we have
assumed that the population has a mean of μo, this is our null hypothesis. We write this as Hoμ = μo,
where Ho is the null hypothesis. Alternate hypothesis is HA=μ. The rejection of null hypothesis will
show that the mean of the population is not μ o. This implies that alternate hypothesis is accepted.
Statistical Significance Level
Having formulated the hypothesis, the next step is its validity at a certain level of significance. The
confidence with which a null hypothesis is accepted or rejected depends upon the significance level.

102
A significance level of say 5% means that the risk of making a wrong decision is 5%. The researcher
is likely to be wrong in accepting false hypothesis or rejecting a true hypothesis by 5 out of 100
occasions. A significance level of say 1% means, that the researcher is running the risk of being
wrong in accepting or rejecting the hypothesis is one of every 100 occasions. Therefore, a 1%
significance level provides greater confidence to the decision than 5% significance level.
There are two type of tests.
One-tailed and Two-tailed Tests
A hypothesis test may be one-tailed or two-tailed. In one-tailed test the test-statistic for rejection of
null hypothesis falls only in one-tailed of sampling distribution curve.

Example:
In a right side test, the critical region lies entirely in the right tail of the sample distribution.
Whether the test is one-sided or two-sided – depends on alternate hypothesis.
A tyre company claims that mean life of its new tyre is 15,000 km. Now the researcher
formulates the hypothesis that tyre life is = 15,000 km.
A two-tailed test is one in which the test statistics leading to rejection of null hypothesis falls on both
tails of the sampling distribution curve as shown.

When we should apply a hypothesis test that is one-tailed or two-tailed depends on the nature of the
problem. One-tailed test is used when the researcher’s interest is primarily on one side of the issue.
Example:
“Is the current advertisement less effective than the proposed new advertisement”?
A two-tailed test is appropriate, when the researcher has no reason to focus on one side of the
issue.
Example: “Are the two markets – Mumbai and Delhi different to test market a product?”
A product is manufactured by a semi-automatic machine. Now, assume that the same product is
manufactured by the fully automatic machine. This will be two-sided test, because the null
hypothesis is that “the two methods used for manufacturing the product do not differ significantly”.
H0 = μ1 = μ2

103
Degree of Freedom
It tells the researcher the number of elements that can be chosen freely.
Example: a+b/2 =5. fix a=3, b has to be 7. Therefore, the degree of freedom is 1.
Select Test Criteria
If the hypothesis pertains to a larger sample (30 or more), the Z-test is used. When the sample is
small (less than 30), the T-test is used.
Compute
Carry out computation.
Make Decisions
Accepting or rejecting of the null hypothesis depends on whether the computed value falls in the
region of rejection at a given level of significance.
Errors in Hypothesis Testing
There are two types of errors:
1. Hypothesis is rejected when it is true.
2. Hypothesis is not rejected when it is false.
(1) is called Type 1 error ( a), (2) is called Type 2 error ( b).
When a =0.10 it means that true hypothesis will be accepted in 90 out of 100 occasions. Thus, there
is a risk of rejecting a true hypothesis in 10 out of every 100 occasions. To reduce the risk, use a =
0.01 which implies that we are prepared to take a 1% risk i.e., the probability of rejecting a true
hypothesis is 1%. It is also possible that in hypothesis testing, we may commit Type 2 error (b) i.e.,
accepting a null hypothesis which is false.
Example of Type 1 and Type 2 error:
Type 1 and Type 2 error is presented as follows. Suppose a marketing company has 2 distributors
(retailers) with varying capabilities. On the basis of capabilities, the company has grouped them into
two categories (1) Competent retailer (2) Incompetent retailer. Thus R 1 is a competent retailer and
R2 is an incompetent retailer. The firm wishes to award a performance bonus (as a part of trade
promotion) to encourage good retailership. Assume that two actions A1 and A2 would represent
whether the bonus or trade incentive is given and not given. This is shown as follows:

When the firm has failed to reward a competent retailer, it has committed type-2 error. On the other
hand, when it was rewarded to an incompetent retailer, it has committed type-1error.
Types of Tests
1. Parametric test.
2. Non-parametric test.
Parametric Test
(1) Parametric tests are more powerful. The data in this test is derived from interval and ratio
measurement.
(2) In parametric tests, it is assumed that the data follows normal distributions. Examples of
parametric tests are (a) Z-Test, (b) T-Test and (c) F-Test.
(3) Observations must be independent i.e., selection of any one item should not affect the chances of
selecting any others be included in the sample.

104
What is univariate/bivariate data analysis?
Univariate
If we wish to analyse one variable at a time, this is called univariate analysis. For example: Effect of
sales on pricing. Here, price is an independent variable and sales is a dependent variable. Change the
price and measure the sales.
Bivariate
The relationship of two variables at a time is examined by means of bi-variate data analysis. If one is
interested in a problem of detecting whether a parameter has either increased or decreased, a two-
sided test is appropriate.
Non-parametric Test
Non-parametric tests are used to test the hypothesis with nominal and ordinal data.
(1) We do not make assumptions about the shape of population distribution.
(2) These are distribution-free tests.
(3) The hypothesis of non-parametric test is concerned with something other than the value of a
population parameter.
(4) Easy to compute. There are certain situations particularly in marketing research, where the
assumptions of parametric tests are not valid. For example: In a parametric test, we assume that data
collected follows a normal distribution. In such cases, non-parametric tests are used. Examples of
non-parametric tests are (a) Binomial test (b) Chi-Square test
(c) Mann-Whitney U test (d) Sign test. A binominal test is used when the population has only two
classes such as male, female; buyers, non-buyers, success, failure etc. All observations made about
the population must fall into one of the two tests. The binomial test is used when the sample size is
small.
Advantages
1. They are quick and easy to use.
2. When data are not very accurate, these tests produce fairly good results.
Disadvantages
Non-parametric test involves the greater risk of accepting a false hypothesis and thus committing a
Type 2 error.
P-values
A p-value, sometimes called an uncertainty or probability coefficient, is based on properties of the
sampling distribution. It is usually expressed as p less than some decimal, as in p < .05 or p < .0006,
where the decimal is obtained by tweaking the significance setting of any statistical procedure. It is
used in two ways: (1) as a criterion level where you, the researcher have arbitrarily decided in
advance to use as the cutoff where you reject the null hypothesis, in which case, you would
ordinarily say something like “setting p at p > .65 for one-tailed or twotailed tests of significance
allows some confidence that 65% of the time, rejecting the null hypothesis will not be in error”; and
more commonly, (2) as a expression of inference uncertainty after you have run some test statistic
regarding the strength of some association or relationship between your independent and dependent
variables, in which case, you would say something like “the evidence suggests there is a statistically
significant effect, however, p < .05 also suggests that 5% of the time, we should be uncertain about
the significance of drawing any statistical inferences.”
Summary
Hypothesis testing is the use of statistics to determine the probability that a given
hypothesis is true.
The usual process of hypothesis testing consists of four steps.
Formulate the null hypothesis and the alternative hypothesis.
Identify a test statistic that can be used to assess the truth of the null hypothesis.
Compute the P-value, which is the probability that a test statistic at least as significant as the one
observed would be obtained assuming that the null hypothesis were true.

105
The smaller the -value, the stronger the evidence against the null hypothesis.
Compare the -value to an acceptable significance value a.
If , that the observed effect is statistically significant, the null hypothesis is ruled out, and the
alternative hypothesis is valid.

Test of Significance
Introduction
Tests for statistical significance are used to estimate the probability that a relationship observed in
the data occurred only by chance; the probability that the variables are really unrelated in the
population. They can be used to filter out unpromising hypotheses. In research reports, tests of
statistical significance are reported in three ways. First, the results of the test may be reported in the
textual discussion of the results. Include:
1. Hypothesis
2. Test statistic used and its value
3. Degrees of freedom
4. Value for alpha (p-value)
Tests for statistical significance are used because they constitute a common yardstick that can be
understood by a great many people, and they communicate essential information about a research
project that can be compared to the findings of other projects. However, they do not assure that the
research has been carefully designed and executed. In fact, tests for statistical significance may be
misleading, because they are precise numbers. But they have no relationship to the practical
significance of the findings of the research.
Finally, one must always use measures of association along with tests for statistical significance. The
latter estimate the probability that the relationship exists; while the former estimate the strength (and
sometimes the direction) of the relationship. Each has its use, and they are best when used together.
There are two types of tests:
Small Sample Tests
Large Sample Test
Small Sample Tests
T-test
T-test is used in the following circumstances: When the sample size n<30.
Example: A certain pesticide is packed into bags by a machine. Random samples of 10 bags are
drawn and their contents are found as follows: 50,49,52,44,45,48,46,45,49,45. Confirm whether the
average packaging can be taken to be 50 kgs.
In this text, the sample size is less than 30. Standard deviations are not known using this test. We can
find out if there is any significant difference between the two means i.e. whether the two population
means are equal.
The Student’s T-distribution
Let X1, X2 ...... Xn be n independent random variables from a normal population with mean m and
standard deviation s (unknown).

When s is not known, it is estimated by s, the sample standard deviation


𝑋̅ − 𝜇
In such a case we would like to know the exact distribution of the statistic 𝑠/
√𝑛
and the answer to this is provided by t-distribution.
𝑋̅ − 𝜇
W.S. Gosset defined t statistic as t = 𝑠/
√𝑛

106
which follows t - distribution with (n–1) degrees of freedom.
Features of t-distribution
1. Like c2- distribution, t-distribution also has one parameter n = n–1, where n denotes sample size.
Hence, this distribution is known if n is known.
𝑣
2. Mean of the random variable t is zero and standard deviation is √𝑣−2, for n > 2.
3. The probability curve of t-distribution is symmetrical about the ordinate at t = 0. Like a normal
variable, the t variable can take any value from– to .
4. The distribution approaches normal distribution as the number of degrees of freedom become
large.
5. The random variate t is defined as the ratio of a standard normal variate to the square root of 2 -
variate divided by its degrees of freedom.

Illustration: There are two nourishment programmes ‘A’ and ‘B’. Two groups of children are
subjected to this. Their weight is measured after six months. The first group of children subjected to
the programme ‘A’ weighed 44,37,48,60,41 kgs. at the end of programme. The second group of
children were subjected to nourishment programme ‘B’ and their weight was 42, 42, 58, 64, 64, 67,
62 kgs. at the end of the programme. From the above, can we conclude that nourishment programme
‘B’ increased the weight of the children significantly, given a 5% level of confidence.
Null Hypothesis: There is no significant difference between Nourishment programme ‘A’ and ‘B’.
Alternative Hypothesis: Nourishment programme B is better than ‘A’ or Nourishment programme
‘B’ increase the children’s weight significantly.
Solution:

107
= 1.89

t at 10 d.f. at 5% level is 1.81.


Since, calculated t is greater than 1.81, it is significant. Hence HA is accepted. Therefore the two
nutrition programmes differ significantly with respect to weight increase.
Snedecor’s F-distribution Notes
Let there be two independent random samples of sizes n1 and n2 from two normal populations

with variances and respectively. Further, let and

be the variances of the first sample and the second samples respectively.
Then F - statistic is defined as the ratio of two c2 - variates. Thus, we can write

108
Features of F-distribution
1. This distribution has two parameters n1 (= n1 - 1) and n2 (= n2 - 1).
2. The mean of F - variate with n1 and n2 degrees of freedom is v2 / (v2 – 2) and standard error is

3. The random variate F can take only positive values from 0 to .


4. For large values of 1 and 2, the distribution approaches normal distribution.
5. If a random variate follows t-distribution with n degrees of freedom, then its square follows F-
distribution with 1 and n d.f. i.e. t2n = F1,
6. F and 2 are also related as F 1, 2 = 2/ v1 as 2

Large Sample Test


Z-test (Parametric Test)
(a) When sample size is > 30
(b) P1 = Proportion in sample 1
P2 = Proportion in sample 2
Example: You are working as a purchase manager for a company. The following information has
been supplied by two scooter tyres manufacturers.

In the above, the sample size is 100, hence a Z-test may be used.
b Testing the hypothesis about difference between two means: This can be used when two population
means are given and null hypothesis is Ho : P1 = P2.
Example: In a city during the year 2000, 20% of households indicated that they read ‘Femina’
magazine. Three years later, the publisher had reasons to believe that circulation has gone up. A
survey was conducted to confirm this. A sample of 1,000 respondents were contacted and it was

109
found 210 respondents confirmed that they subscribe to the periodical ‘Femina’. From the above, can
we conclude that there is a significant increase in the circulation of ‘Femina’?
Solution:
We will set up null hypothesis and alternate hypothesis as follows:
Null Hypothesis is H0. μ = 15%
Alternate Hypothesis is HA. μ > 15%
This is a one-tailed (right) test.

As the value of Z at 0.05 =1.64 and calculated value of Z falls in the rejection region, we reject null
hypothesis, and therefore we conclude that the sale of ‘Femina’ has increased significantly.
Chi-square Test
With the help of this test, we will come to know whether two or more attributes are associated or not.
How much the two attributes are related cannot be by Chi-Square test. Suppose, we have certain
number of observations classified according to two attributes. We may like to know whether a newly
introduced medicine is effective in the treatment of certain disease or not.
The numbers of automobile accidents per week in a certain city were as follows:

Does the above data indicate that accident conditions were uniform during the 10- month period.
Expected frequency= 12+ 8+ 20+ 2+ 14+ 10+ 15+ 6+ 9+ 4= 100/ 10 = 10
Computation
Null hypothesis: The accident occurrence is uniform over a 10-week period.

110
(𝑂−𝐸)2
2 =∑ 𝐸
Where O is the observed frequency, E is the expected frequency.
D.F = 10 - 1 = 9
Table value at 5% for 9 degree of freedom = 16.91
Since calculated value = 26.6 greater than table value of 19.19, null hypothesis rejected at 5% level
of significance.
Conclusion: The accident occurring are not uniform over a 10-week period.
Summary
Significance Level: Significance level is the criterion used for rejecting the null hypothesis.
Tests for statistical significance: Tests for statistical significance are used to estimate the probability
that a relationship observed in the data occurred only by chance; the probability that the variables are
really unrelated in the population.
Testing the hypothesis about difference between two means: This can be used when two
population means are given and null hypothesis is Ho : P1 = P2.

*******

111
Lesson 8
Index Number

Introduction
An index number is a statistical measure used to compare the average level of magnitude of a group
of distinct but related variables in two or more situations. Suppose that we want to compare the average
price level of different items of food in 2020 with what it was in 2010. Let the different items of food
be wheat, rice, milk, eggs, ghee, sugar, pulses, etc. If the prices of all these items change in the same
ratio and in the same direction; assume that prices of all the items have increased by 10% in 2020 as
compared with their prices in 2010; then there will be no difficulty in finding out the average change
in price level for the group as a whole. Obviously, the average price level of all the items taken as a
group will also be 10% higher in 2020 as compared with prices of 2010. However, in real situations,
neither the prices of all the items change in the same ratio nor in the same direction, i.e., the prices of
some commodities may change to a greater extent as compared to prices of other commodities.
Moreover, the price of some commodities may rise while that of others may fall. For such situations,
the index numbers are very useful device for measuring the average change in prices or any other
characteristics like quantity, value, etc., for the group as a whole.
Another important feature of the index number is that it is often used to average a characteristics
expressed in different units for different items of a group. In the words of Tuttle; “An index number is
a single ratio (usually in percentage) which measures the combined (i.e., averaged) change of several
variables between two different times, places or situations.” For example, the price of wheat may be
quoted as rupee/kg., price of milk as rupee/litre, price of eggs as rupee/dozen, etc. To arrive at a single
figure that expresses the average change in price for the whole group, various prices have to be
combined and averaged in a suitable way. This single figure is known as price index and can be used
to determine the extent and direction of average change in the prices for the group. In a similar way
we can construct quantity index numbers, value index numbers, etc.
Characteristics of index numbers
1. Index numbers are specialised averages: As we know that an average of data is its representative
summary figure. In a similar way, an index number is also an average, often a weighted average,
computed for a group. It is called a specialised average because the figures, that are averaged, are not
necessarily expressed in homogeneous units.
2. Index numbers measure the changes for a group which are not capable of being directly
measured: The examples of such magnitudes are: Price level of a group of items, level of business
activity in a market, level of industrial or agricultural output in an economy, etc.
3. Index numbers are expressed in terms of percentages: The changes in magnitude of a group are
expressed in terms of percentages which are independent of the units of measurement. This facilitates
the comparison of two or more index numbers in different situations.
Uses of Index Numbers
The main uses of index numbers are:
1. To measure and compare changes: The basic purpose of the construction of an index number is
to measure the level of activity of phenomena like price level, cost of living, level of agricultural
production, level of business activity, etc. It is because of this reason that sometimes index numbers
are termed as barometers of economic activity. It may be mentioned here that a barometer is an
instrument which is used to measure atmospheric pressure in physics. The level of an activity can be

112
expressed in terms of index numbers at different points of time or for different places at a particular
point of time. These index numbers can be easily compared to determine the trend of the level of an
activity over a period of time or with reference to different places.
2. To help in providing guidelines for framing suitable policies: Index numbers are indispensable
tools for the management of any government or non-government organisation. For example, the
increase in cost of living index is helpful in deciding the amount of additional dearness allowance that
should be paid to the workers to compensate them for the rise in prices. In addition to this, index
numbers can be used in planning and formulation of various government and business policies.
3. Price index numbers are used in deflating: This is a very important use of price index numbers.
These index numbers can be used to adjust monetary figures of various periods for changes in prices.
For example, the figure of national income of a country is computed on the basis of the prices of the
year in question. Such figures, for various years often known as national income at current prices, do
not reveal the real change in the level of production of goods and services. In order to know the real
change in national income, these figures must be adjusted for price changes in various years. Such
adjustments are possible only by the use of price index numbers and the process of adjustment, in a
situation of rising prices, is known as deflating.
4. To measure purchasing power of money: We know that there is inverse relation between the
purchasing power of money and the general price level measured in terms of a price index number.
Thus, reciprocal of the relevant price index can be taken as a measure of the purchasing power of
money.
Construction of Index Numbers
To illustrate the construction of an index number, we reconsider various items of food mentioned
earlier. Let the prices of different items in the two years, 1990 and 1992, be as given below:

Item Price in 2019 Price in


(ruppee/unit) 2021(rupees/unit)
1. Wheat 360/quintal
300/quintal
2. Rice 15/kg.
12/kg.
3 Milk 8/litre
7/litre
4. Eggs 12/dozen
11/dozen
5. Ghee 88/kg.
80/kg.
6. Sugar 10/kg.
9/kg.
7. Pulses 16/kg.
14/kg.
The comparison of price of an item, say wheat, in 2021 with its price in 2019 can be done in two ways,
explained below:

113
1. By taking the difference of prices in the two years, i.e., 360 – 300 = 60, one can say that the price
of wheat has gone up by 60/quintal in 2021 as compared with its price in 2019.
2. By taking the ratio of the two prices, i.e.,360/300 = 1.20, one can say that if the price of wheat in
2019 is taken to be 1, then it has become 1.20 in 2021. A more convenient way of comparing the two
prices is to express the price ratio in terms of percentage, i.e., 360/120* 100 = 300 , known as Price
Relative of the item. In our example, price relative of wheat is 120 which can be interpreted as the
price of wheat in 2021 when its price in 2019 is taken as 100. Further, the figure 120 indicates that
price of wheat has gone up by 120 – 100 = 20% in 2021 as compared with its price in 2019.
The first way of expressing the price change is inconvenient because the change in price depends upon
the units in which it is quoted. This problem is taken care of in the second method, where price change
is expressed in terms of percentage. An additional advantage of this method is that various price
changes, expressed in percentage, are comparable. Further, it is very easy to grasp the 20% increase in
price rather than the increase expressed as 60 rupees/quintal.
For the construction of index number, we have to obtain the average price change for the group in
2021, usually termed as the Current Year, as compared with the price of 2019, usually called the Base
Year. This comparison can be done in two ways:
1. By taking suitable average of price relatives of different items. The methods of index number
construction based on this procedure are termed as Average of Price Relative Methods.
2. By taking ratio of the averages of the prices of different items in each year. These methods are
popularly known as Aggregative Methods. Since the average in each of the above methods can be
simple or weighted, these can further be divided as simple or weighted. Various methods of index
number construction can be classified as shown below:
Figure : Various Methods of Index Number Construction

In addition to this, a particular method would depend upon the type of average used. Although,
geometric mean is more suitable for averaging ratios, arithmetic mean is often preferred because of its
simplicity with regard to computations and interpretation.
Notations and Terminology
Before writing various formulae of index numbers, it is necessary to introduce certain notations and
terminology for convenience.
Base Year: The year from which comparisons are made is called the base year. It is commonly denoted
by writing ‘0’ as a subscript of the variable.
Current Year: The year under consideration for which the comparisons are to be computed is called
the current year. It is commonly denoted by writing ‘1’ as a subscript of the variable.

114
Let there be n items in a group which are numbered from 1 to n. Let p0i denote the price of the
ith item in base year and p1i denote its price in current year, where i = 1, 2, ...... n. In a similar way q0i
and q1i will denote the quantities of the ith item in base and current years respectively.
Using these notations, we can write an expression for price relative of the ith item as
𝑝 𝑞
Pi = 𝑝1𝑖 ∗ 100 , and quantity relative of the i th item as Qi = 𝑞1𝑖 ∗ 100
0𝑖 0𝑖
Further, P01 will be used to denote the price index number of period ‘1’ as compared with the prices
of period ‘0’. Similarly, Q01 would denote the quantity and the value index numbers respectively of
period ‘1’ as compared with period ‘0’.
Construction of an Index Number
Un-weighted Index
In the un-weighted index number the weights are not assigned to the various items used for the
calculation of index number. Two unweighted price index number are given below:

Simple Average of Price Relatives:


When arithmetic mean of price relatives is used; the index number formula is given by
𝑝
∑ 𝑃𝑖 ∑ 1 ∗100
𝑝0
P01 = or =
𝑛 𝑛
Example: Given below are the prices of 5 items in 2005 and 2010. Compute the simple price index
number of 1990 taking 2005 as base year.

Item Price in 2005 (Rs/unit) Price in 2010 (Rs/unit)


1 15 20
2 8 7
3 200 300
4 60 110
5 100 130

Solution

Item Price in 2005 Price in 2010 𝒑𝟏


∗ 𝟏𝟎𝟎
(Rs/unit) (Rs/unit) 𝒑𝟎
1 15 20 133.33
2 8 7 87.50
3 200 300 150.00
4 60 110 183.33
5 100 130 130.00
Total 684.16

𝑝
∑ 1 ∗100 684.16
𝑝0
Index number P01 = = = 136.83
𝑛 5
Here, price is said to have risen by 136.83 per cent.
This price index number calculated by using simple aggregative method has limited use. The reasons
are as follows:

115
(a) This method doesn’t take into account the relative importance of various commodities used
in the calculation of index number since equal importance is given to all the items.
(b) The different items are required to be expressed in the same unit. In practice, however, the
different items may be expressed in different units.
(c) The index number obtained by this method is not reliable as it is affected by the unit in
which prices of several commodities are quoted.
Weighted Average of Price Relatives: In order to take this into account, weighing of different items,
in proportion to their degree of importance, becomes necessary.
Let wi be the weight assigned to the i th item (i = 1, 2, ...... n). Thus, the index number, given by the
weighted arithmetic mean of price relatives, is
ΣPiWi
The formula for a weighted aggregative price index is P01 = . An index number becomes a
ΣWi
weighted index when the relative importance of items is taken care of.
Nature of weights
While taking weighted average of price relatives, the values are often taken as weights . These weights
can be the values of base year quantities valued at base year prices, i.e., p0iq0i, or the values of current
year quantities valued at current year prices, i.e., p1iq1i, or the values of current year quantities valued
at base year prices, i.e., p0iq1i, etc., or any other value.
Example:
Construct an index number for 2002 taking 2010 as base for the following data, by using weighted
arithmetic mean of price relatives
Commodities Prices in 2002 Prices in 2010 Weights
A 60 100 30
B 20 20 20
C 40 60 24
D 100 120 30
E 120 80 10

Solution:
Commodities Prices in Prices in P1/p0*100 Weights
2002 2010
A 60 100 166.67 30 5000.1
B 20 20 100.00 20 2000.00
C 40 60 150.00 24 3600.00
D 100 120 120.00 30 3600.00
E 120 80 66.67 10 666.7
Total 114 148668.8

ΣPiWi 148668.8
Index number P01 = = = 130.41
ΣWi 114
Simple Aggregative Method: In this method, the simple arithmetic mean of the prices of all the items
of the group for the current as well as for the base year are computed separately. The ratio of current
year average to base year average multiplied by 100 gives the required index number.
∑ 𝑃0𝑖
Using notations, the arithmetic mean of prices of n items in current year is given by 𝑛

116
𝑃1𝑖
∑ ∑ 𝑃1𝑖
𝑛
Simple aggregative price index P01 = 𝑃0𝑖 ∗ 100 = ∑ 𝑃0𝑖
∗ 100

𝑛
Omitting the subscript i, the above index number can also be written as:
∑𝑃
P01 = ∑ 𝑃1 ∗ 100
0
Example: The following table gives the prices of six items in the years 2010 and 2011. Use simple
aggregative method to find index of 2011 with 2010 as base.
Item 2020 2021
A 40 50
B 60 60
C 20 30
D 50 70
E 80 90
F 100 100
Solution:
Solution:
Let p0 be the price in 2020 and p1 be the price in 2021. Thus, we have
Item 2020 2021
A 40 50
B 60 60
C 20 30
D 50 70
E 80 90
F 100 100
Total 350 400

350
P01 = 400 ∗ 100 = 114.29

Weighted Aggregative Method: This index number is defined as the ratio of the weighted arithmetic
means of current to base year prices multiplied by 100.
Using the notations, defined earlier, the weighted arithmetic mean of current year prices can be written
ΣP1iWi
as = ΣWi
ΣP0iWi
Similarly, the weighted arithmetic mean of base year prices = ΣWi
ΣP1iWi
ΣWi ΣP1iWi
Price Index Number P01 = ΣP0iWi ∗ 100 = ∗ 100
ΣP0iWi
ΣWi
Omitting the subscript, we can also write
ΣP1 W
P01 = ∗ 100
ΣP0 W
Nature of Weights
In case of weighted aggregative price index numbers, quantities are often taken as weights.

117
These quantities can be the quantities purchased in base year or in current year or an average of base
year and current year quantities or any other quantities. Depending upon the choice of weights, some
of the popular formulae for weighted index numbers can be written as follows:
1. Laspeyres’s Index: Laspeyres’ price index number uses base year quantities as weights.
Thus, we can write:
Σp1 𝑞0
P01 = ∗ 100
Σp0 𝑞0
2. Paasche’s Index: This index number uses current year quantities as weights. Thus, we can write
Σp1 𝑞1
P01 = ∗ 100
Σp0 𝑞1
3. Fisher’s Ideal Index: As will be discussed later that the Laspeyres’s Index has an upward bias and
the Paasche’s Index has a downward bias. In view of this, Fisher suggested that an ideal index should
be the geometric mean of Laspeyres’ and Paasche’s indices. Thus, the
Fisher’s formula can be written as follows:
Σp 𝑞 Σp 𝑞
P01 = √Σp1 𝑞0 ∗ Σp1 𝑞1 *100
0 0 0 1

𝑙 𝑝
P01 = √𝑃01 ∗ 𝑃01
𝑙
𝑃01 = 𝐿𝑎𝑠𝑝𝑒𝑦𝑟𝑒𝑠 ′ 𝑠𝐼𝑛𝑑𝑒𝑥
𝑝
𝑃01 = 𝑃𝑎𝑎𝑠𝑐ℎ𝑒 ′ 𝑠𝐼𝑛𝑑𝑒𝑥
4. Dorbish and Bowley’s Index: This index number is constructed by taking the arithmetic mean of
the Laspeyres’s and Paasche’s indices.
1 Σp1 𝑞0 Σp1 𝑞1
P01 = [ + ] ∗ 100
2 Σp0 𝑞0 Σp0 𝑞1
5. Marshall and Edgeworth’s Index: This index number uses arithmetic mean of base and current
year quantities.
Σp (𝑞 𝑞 )
P01 = Σp1 (𝑞0+ 𝑞1) ∗ 100
0 0+ 1
Example: Example: For the data given in the following table, compute
1. Laspeyres’s Price Index
2. Paasche’s Price Index
3. Fisher’s Ideal Index
4. Dorbish and Bowley’s Price Index
5. Marshall and Edgeworth’s Price Index
P0 Q0 P1 Q1
10 30 12 50
8 15 10 25
6 20 6 30
4 10 6 20

118
Solution:
P0 Q0 P1 Q1 P0Q0 P1Q0 P0Q1 P1Q1
10 30 12 50 300 360 500 600
8 15 10 25 120 150 200 250
6 20 6 30 120 120 180 180
4 10 6 20 40 60 80 120
580 690 960 1150 Total

The calculation of various price index numbers are done as given below:
1. Laspeyres’sP01 = 690/580 *100 = 118.97
2. Paasche’s P01 = 1150/960* 100 =119.79
690 1150
3. Fisher’s P01 = √580 ∗ ∗ 100 = 119.38
960
1 690 1150
4. Dorbish and Bowley’s P01 = 2 [580 + ] ∗ 100 = 119.4
960
690+1150
5. Marshall and Edgeworth’s P01 = ∗ 100 = 119.48
580+960
Quantity Index Numbers
A quantity index number measures the change in quantities in current year as compared with a base
year. The formulae for quantity index numbers can be directly written from price index numbers simply
by interchanging the role of price and quantity. Similar to a price relative, we can define a quantity
𝑞
relative as Q = 1 ∗ 100
𝑞0

Various formulae for quantity index numbers are as given below :


∑𝑞
1. Simple aggregative index Q01 = ∑ 𝑞1 ∗ 100
0
∑ 𝑞1
∑ 𝑞0 ∑𝑄
2. Simple average of quantity relatives Q01 = ∗ 100 = ∗ 100
𝑛 𝑛
3. Weighted aggregative Index
ΣQ1 𝑃0
a. 1. Laspeyres’s Index: Q01 = ∗ 100
ΣQ0 𝑝0
Σq1 𝑝1
2. Paasche’s Index: Q01 = ∗ 100
Σq0 𝑝1
3. Fisher’s Ideal Index:
Σq 𝑝 Σq 𝑝
Q01 = √Σq1 𝑝0 ∗ Σq1𝑝1 *100
0 0 0 1

𝑙 𝑝
Q01 = √𝑄01 ∗ 𝑄01
𝑙
𝑄01 = 𝐿𝑎𝑠𝑝𝑒𝑦𝑟𝑒𝑠 ′ 𝑠𝐼𝑛𝑑𝑒𝑥
𝑝
𝑄01 = 𝑃𝑎𝑎𝑠𝑐ℎ𝑒 ′ 𝑠𝐼𝑛𝑑𝑒𝑥
Test of adequacy for an Index Number
Index numbers are studied to know the relative changes in price and quantity for any two years
compared. There are two tests which are used to test the adequacy for an index number. The two tests
are as follows,

119
(i) Time Reversal Test
(ii) Factor Reversal Test
The criterion for a good index number is to satisfy the above two tests.
Time Reversal Test
It is an important test for testing the consistency of a good index number. This test maintains time
consistency by working both forward and backward with respect to time (here time refers to base year
and current year). Symbolically the following relationship should be satisfied, P01 × P10 =1
Fisher’s index number formula satisfies the above relationship

when the base year and current year are interchanged, we get

Factor Reversal Test


This is another test for testing the consistency of a good index number. The product of price index
number and quantity index number from the base year to the current year should be equal to the true
value ratio. That is, the ratio between the total value of current period and total value of the base period
is known as true value ratio. Factor Reversal Test is given by,

120
Example
Calculate Fisher’s price index number and show that it satisfies both Time Reversal Test and Factor
Reversal Test for data given below.

Solution

121
Summary
An index number is a statistical measure used to compare the average level of magnitude of a group
of distinct but related variables in two or more situations.
In real situations, neither the prices of all the items change in the same ratio nor in the same direction,
i.e., the prices of some commodities may change to a greater extent as compared to prices of other
commodities.
The index numbers are very useful device for measuring the average change in prices or any other
characteristics like quantity, value, etc., for the group as a whole.
Index numbers are specialized type of averages that are used to measure the changes in a characteristics
which is not capable of being directly measured.
The changes in magnitude of a group are expressed in terms of percentages which are independent of
the units of measurement. This facilitates the comparison of two or more index numbers in different
situations.

122
Index numbers are indispensable tools for the management of any government or nongovernment
organizations.
There is inverse relation between the purchasing power of money and the general price level measured
in terms of a price index number.
The reciprocal of the relevant price index can be taken as a measure of the purchasing power of money.
The year from which comparisons are made is called the base year. It is commonly denoted by writing
‘0’ as a subscript of the variable.
While taking weighted average of price relatives, the values are often taken as weights. These weights
can be the values of base year quantities valued at base year prices.
In case of weighted aggregative price index numbers, quantities are often taken as weights.
These quantities can be the quantities purchased in base year or in current year or an average of base
year and current year quantities or any other quantities.
A quantity index number measures the change in quantities in current year as compared with a base
year.
Index numbers where comparisons of various periods were done with reference to a particular period,
termed as base period. Such type of index number series is known as fixed base series.

*********

123

You might also like