Unit-10 Sampling Distributions
Unit-10 Sampling Distributions
Sampling
UNIT 10 SAMPLING DISTRIBUTIONS
Distributions
Objectives
When you have successfully completed this unit, you should be able to:
• understand the meaning of sampling distribution of a sample statistic
• obtain the sampling distribution of the mean
• get an understanding of the sampling distribution of variance
• construct the sampling distribution of the proportion
• know the Central Limit Theorem and appreciate why it is used so
extensively in practice
• develop confidence intervals for the population mean and the population
proportion
• determine the sample size required while estimating the population mean
or the population proportion.
Structure
10.1 Introduction
10.2 Sampling Distribution of the Mean
10.3 Central Limit Theorem
10.4 Sampling Distribution of the Variance
10.5 The Student's t Distribution
10.6 Sampling Distribution of the Proportion
10.7 Interval Estimation
10.8 The Sample Size
10.9 Summary
10.10 Self-assessment Exercises
10.11 Further Readings
10.1 INTRODUCTION
Having discussed the various methods available for picking up a sample from
a population we would naturally be interested in drawing inferences about the
population based on our observations made on the sample members. This
could mean estimating the value of a population parameter, testing a
statistical hypothesis about the population, comparing two or more
populations, performing correlation and regression analysis on more than one
variable measured on the sample members, and many other inferences. We
shall discuss some of these problems in this and the subsequent units.
170
What is a Sampling Distribution? Sampling
Distributions
Suppose we are interested in drawing some inference regarding the weight of
containers produced by an automatic filling machine. Our population,
therefore, consists of all the filled-containers produced in the past as well as
those which are going to be produced in the future by the automatic filling
machine. We pick up a sample of size n and take measurements regarding the
characteristic we are interested in viz. the weight of the filled container on
each of our sample members. We thus end up with n sample values
x� , x� , … … … x� . As described in the previous unit, any quantity which can be
determined as a function of the sample values x� , x� , … , x� is called a sample
statistic.
Referring to our earlier discussion on the concept of a random variable, it is
not difficult to see that any sample statistic is a random variable and,
therefore, has a probability distribution or a probability density function. It is
also known as the sampling distribution of the statistic. In practice, we refer
to the sampling distributions of only the commonly used sampling statistics
like the sample mean, sample variance, sample proportion, sample median
etc., which have a role in making inferences about the population.
Why Study Sampling Distributions?
Sample statistics form the basis of all inferences drawn about populations. If
we know the probability distribution of the sample statistic, then we can
calculate the probability that the sample statistic assumes a particular value
(if it is a discrete random variable) or has a value in a given interval. This
ability to calculate the probability that the sample statistic lies in a particular
interval is the most important factor in all statistical inferences. We will
demonstrate this by an example.
Suppose we know that 45% of the population of all users of talcum powder
prefer our brand to the next competing brand. A "new improved" version of
our brand has been developed and given to a random sample of 100 talcum
powder users for use. If 60 of these prefer our "new improved" version to the
next competing brand, what should we conclude? For an answer, we would
like to know the probability that the sample proportion in a sample of size
100 is as large as 60% or higher when the true population proportion is only
45%, i.e. assuming that the new version is no better than the old. If this
probability is quite large, say 0.5, we might conclude that the high sample
proportion viz. 60% is perhaps because of sampling errors.and the new
version is not really superior to the old. On the other hand, if this probability
works out to a very small figure, say 0.001, then rather than concluding that
we have observed a rare event we might conclude that the true population
proportion is higher than 45%, i.e. the new version is actually superior to the
old one as perceived by members of the population. To calculate this
probability, we need to know the probability distribution of sample
proportion or the sampling distribution of the proportion.
171
Sampling and
Sampling
10.2 SAMPLING DISTRIBUTION OF THE MEAN
Distributions
We shall first discuss the sampling distribution of the mean. We start by
discussing the concept of the sample mean and then study its expected value
and variance in the general case. We shall end this section by describing the
sampling distribution of the mean in the special case when the population
distribution is normal.
The Sample Mean
Suppose we have a simple random sample of size n picked up from a
population. We take measurements on each sample member in the
characteristic of our interest and denote the observation as x� , x� , … , x�
respectively. The sample mean for this sample, represented by x, is defined
as
�� + �� + ⋯ + ��
�̅ =
�
If we pick up another sample of size n from the same population, we might
end tip a totally different set of sample values and so a different sample
mean. Therefore, there are many (perhaps infinite) possible values of the
sample mean and the particular value that we obtain, if we pick up only one
sample, is determined only by chance causes. The distribution of the sample
mean is also referred to as the sampling distribution of the mean.
However, to observe the distribution of x� empirically, we have to take many
samples of size n and determine the value of x� for each sample. Then,
looking at the various observed values of z, it might be possible to get an idea
of the nature of the distribution.
Sampling from Infinite Populations
We shall study the distribution of z in two cases-one when the population is
finite and we are sampling without replacement; and the other when the
population is infinitely large or when the sampling is done with replacement.
We start with the latter.
We assume we have a population which is infinitely large and having a
population mean of and a population variance of σ2. This implies that if x is a
random variable denoting the measurement of the characteristic that we are
interested in, on one element of the population picked up randomly, then
the expected value of x, E(x) = �
and the variance of x, Var (x) = σ2
The sample mean, � X, can be looked at as the sum of n random variables, viz
x� , x� , … , x� each being divided by (1/n). Here X� is a random variable
representing the first observed value in the sample, X� is a random variable
representing the second observed value and so on. Now, when the population
is infinitely large, whatever be the value of X� , the distribution of X� is not
affected by it. This is true of any other pair of random variables as well. In
172
other wordsx� , x� , … , x� are independent random variables and all are picked Sampling
Distributions
up from the same population.
∴ �(�� ) = � and Var (�� ) = � �
�(�� ) = � and Var (�� ) = � � and so on
Finally,
�� + �� + ⋯ + ��
�(�̅ ) = � � �
�
1 1 1
= �(�� ) + �(�� ) + ⋯ + �(�� )
� � �
1 1 1
= � + � + ⋯+ �
� � �
= �.
�� ��� �⋯���
and Var (x�) = Var � �
�
x� x� x�
= Var � � + Var � � + ⋯ + Var � �
n n n
1 1 1
= � Var (�� ) + � Var (�� ) + ⋯ + � Var (�� )
� � �
1 1 1
= � �� + � �� + ⋯ + � ��
� � �
��
=
�
We have arrived at two very important results for the case when the
population is infinitely large, which we shall be using very often. The first
says that the expected value of the sample mean is the same as the population
mean while the second says that the variance of the sample mean is the
variance of the population divided by the sample size.
If we take a large number of samples of size n, then the average value of the
sample means tends to be close to the true population mean. On the other
hand, if the sample size is increased then the variance of gets reduced and by
selecting an appropriately large value of n, the variance of x can be made as
small as desired.
Thee standard deviation of x is also called the standard error of the mean.
Very often we estimate the population mean by the sample mean. The
standard error of the mean indicates the extent to which the observed value of
sample mean can be away from the true value, due to sampling errors. For
example, if the standard error of the mean is small, we are reasonably
confident that whatever sample mean value we have observed cannot be very
far away from the true value.
The standard error of the mean is represented by �� .
173
Sampling and Sampling With Replacement
Sampling
Distributions The above results have been obtained under the assumption that the random
variables X� , X� , … , X� , are independent. This assumption is valid when the
population is infinitely large. It is also valid when the sampling is done with
replacement, so that the population is back to the same form before the next
sample member is picked up. Hence, if the sampling is done with
replacement, we would again have
�(�̅ ) = �
��
and Var (�̅ ) = �
�
i.e. ��̅ =
√�
� ���
i.e.��̅ = ⋅ ����
√�
By comparing these expressions with the ones derived above we find that the
standard error of is the same but further multiplied by a factor
�(N − n)/(N − 1). This factor is, therefore, known as the finite population
multiplier.
In practice, almost all the samples used picked up without replacement. Also,
most populations are finite although they may be very large and so the
standard error of the mean should theoretically be found by using the
expression given above. However, if the population size (N) is large and
consequently the sampling ratio (n/N) small, then the finite population
multiplier is close to 1 and is not used, thus treating large finite populations
as if they were infinitely large. For example, if N = 100,000 and n =100, the
finite population multiplier
� − � 100,000 − 100
� =
�−1 100,000 − 1
99,900
=
99,999
= .9995
174
Which is very close to 1 and the standard error of the mean would, for all Sampling
Distributions
practical purposes, be the same whether the population is treated as finite or
infinite. As a rule of that, the finite population multiplier may not be used if
the sampling ratio (n/N) is smaller than 0.05.
Sampling from Normal Populations
We have seen earlier that the normal distribution occurs very frequently
among many natural phenomena. For example, heights or weights of
individuals, the weights of filled-cans from an automatic machine, the
hardness obtained by heat treatment, etc. are distributed normally.
We also know that the sum of two independent random variables will follow
a normal distribution if each of the two random variables belongs to a normal
population. The sample mean, as we have seen earlier is the sum of n random
variables x� , x� , … … x� each divided by n. Now, if each of these random
variables is from the same normal population, it is not difficult to see that x
would also be distributed normally.
Let x~N(�, � � ) symbolically represent the fact that the random variable x is
distributed normally with mean µ and variance � � . What we have said in the
earlier paragraphs, amounts to the following:
If x~N(�, � � )
��
then it follows that x~N ��, �
�
We first make use of the symmetry of the normal distribution and then
calculate the z value by subtracting the mean and then dividing it by the
standard deviation of the random variable distributed normally, viz k. The
probability of interest is also shown as the shaded area in Figure I above.
1 degree of freedom
5 degree of freedom
10 degree of freedom
The chi-square distribution has only one parameter viz. the degrees of
freedom and so there are many chi-square distributions each with its own
degrees of freedom. In statistical tables, chi-square values for different areas
under the right tail and the left tail of various chi-square distributions are
tabulated.
If X� , X� , … , X� are independent random variables, each having a standard
normal distribution, then (x�� + x�� + ⋯ + x�� ) will have a chi-square
distribution with n degrees of freedom.
If �� and �� are independent random variables having chi-square
distributions with �� and �� degrees of freedom, then (y� + y� ) will have a
chi-square distribution with γ� + �� degrees of freedom.
178
We have stated some results above, without deriving them, to help us grasp Sampling
Distributions
the chi-square distribution intuitively. We shall state two more results in the
same spirit.
If �� and �� are independent random variables such that γ1 has a chi-square
distribution with γ1 degrees of freedom and (y� + y� ) has a chi-square
distribution with γ > γ1 degrees of freedom, then �� will have a chi-square
distribution with (� − �� ) degrees of freedom.
Now, if x� , x� , … , x� are n random variables from a normal population with
mean �. and variance ..σ2,
i.e.x� ∼ N(�, � � ), i = 1,2, … , n
�� ��
it implies that �
≈ �(0,1)
�� �� �
and so � �
� will have a chi-square distribution with 1 degree of freedom.
�� �� �
Hence, ∑���� � �
� will have a chi-square distribution with n degrees of
freedom.
We can break up this expression by measuring the deviation from x in place
of .
We will then have
� �
�� − � � 1
�� � = � � �(�� − �̅ ) + (�̅ − �)��
� �
��� ���
� � �
1 1 2(�̅ − �)
= � � (�� − �̅ )� + � � (�̅ − �)� + � (�� − �̅ )
� � ��
��� ��� �
� �
(� − 1)� � �̅ − �
= + � � since � (�� − �̅ ) = 0
�� �/√� ���
Now, we know that the left hand side of the above equation is a random
variable which has a chi-square distribution with n degrees of freedom. We
also know that
��
x̄ ∼ N ��, �
n
�̅ �� �
∴ ��/ �� will have a chi-square distribution with 1 degree of freedom.
√
Hence, if the two terms on the right hand side of the above equation are
independent (which will be assumed as true here and you will have to refer to
advanced texts on statistics for the proof of the same), then it follows that
(���)��
��
has a chi-square distribution with (n -1) degrees of freedom. One
degree of freedom is lost because the deviations are measured from z and not
from �.
179
Sampling and Expected Value and Variance of ��
Sampling
Distributions (���)��
In practice, therefore, we work with the distribution of ��
and not with
the distribution of � � directly. The mean of a chi-square distribution is equal
to its degrees of freedom and the variance is equal to twice the degrees of
freedom. This can be used to find the expected value and the variance of � � .
(���)��
Since ��
has a chi-square distribution with (n–1) degrees of freedom,
(� − 1)� �
∴ �� �=�−1
��
(���)
or ��
⋅ �(� � ) = � − 1
∴ �(� � ) = � �
(���)��
Also Var � ��
� = 2(� − 1)
2� �
∴ �(� � − � � )� =
�−1
���
i.e.Var (� � ) = ���
�(1 − �)
=
�
Finally, if the sample size n is large enough, we can approximate the
binomial probability distribution by a normal distribution with the same mean
and variance. Thus, if n is sufficiently large,
¯ p(1 − p)
� ∼ N �p, �
n
This approximation works quite well if n is sufficiently large so that both np
and n(1- p) are at least as large as 5.
Activity D
A population is normally distributed with a mean of 100. A sample of size 15
is picked up at random from the population. If we know from t tables, that
�� (��� ⩾ 1.761) = 0.05
where t�� represents a t variable with 14 degrees of freedom, calculate
�� (�̅ ⩾ 115)
If we know that the sample standard deviation is 33.
Activity E
In a Board examination this year, 85% of the students who appeared for the
examination passed. 100 students appeared in the same examination from
School Q. What is the probability that 90 or more of these students passed?
…………………………………………………………………………….....
…………………………………………………………………………….....
…………………………………………………………………………….....
…………………………………………………………………………….....
…………………………………………………………………………….....
184
The standard error of the mean can be easily calculated as Sampling
Distributions
� 0.2
�� = = = .04Kg
√� 25
Figure IV: Distribution of �
= 49.7658
Therefore, we can state with 90% confidence level that the mean weight of
cement in a filled hag lies between 49.6342 Kg and 49.7658 Kg.
We can use the above approach when the population standard deviation is
known or when the sample size is large n >30 , in which case the sample
standard deviation can he used as an estimate of the population standard
deviation. However, if the sample size is not large, as in the example above,
then one has to use the t distribution in place of the standard normal
distribution to calculate the probabilities.
Let us assume that we are interested in developing a 90% confidence interval
in the same situation as described earlier with the difference that the
population standard deviation is now not known. However, the sample
standard deviation has been calculated and is known to be 0.2 Kg.
�̅ ��
Since the sample size n = 25, we know that �/ follows a t distribution with
√�
degrees of freedom. From t tables, we can see that the probability that a t
statistic with 24 degrees of freedom lying between – 1.711 s/√n and 1.711 185
Sampling and s/√n is 0.90 – i.e. the probability that X lies between – 1.711 s/√n and +
Sampling
1.711 s/√n is 0.90. This is shown in Figure 5 below.
Distributions
Figure V: Area under a t distribution
= 49.6316
� �.�
and the upper limit = �̅ + 1.711 = 49.7 + 1.711
√� √��
= 49.7684
In this case, we can state with 90% confidence level that the mean weight of
cement in a filled hag lies between 49.6316 Kg and 49.7684 Kg.
�.����.� �
And so n = � �.��
�
= 43.3
We must have a sample size of at least 44 so that the mean weight of cement
in a filled bag can be estimated within plus or minus 0.05 Kg of the true
value with a 90% confidence level.
It is to be noted that this approach does not work if the population standard
deviation is not known because the sample standard deviation is known only
after the sample has been analysed whereas the sample size decision is
required before the sample is picked up.
Sample Size for Estimating Population Proportion
Suppose we want to estimate the proportion of consumers in the population
who prefer our product to the next competing brand. How large a sample
should be taken so that the population proportion can be estimated within
plus or minus 0.05 with a 90% confidence level?
We shall use the sample proportion �̅ to estimate the population proportion p.
If n is sufficiently large, the distribution of �̅ can be approximated by a
normal distribution with mean p and variance p (1 - p)/n (let q = 1 – p).
From normal tables, we can now say that the probability that p will lie
between (p − 1.645�pq/n and (p + 1.645�pq/n is 0.90. In other words,
the interval (p� − 1.645�pq/n to (p� + 1.645�pq/n will contain p, 90% of
the time.
We also want that the interval (p - 0.05) to (p + 0.05) should contain p, 90%
of the time.
�(���)
Therefore, 1.645� �
= 0.05
�(���) �.��
or � �
= �.��� = 0.0304
�(���)
or �
= 0.0009239
�(1 − �)
∴�=
0.0009239
But we do not know the value of p, so n cannot be calculated directly.
However, whatever be the value of p, the highest value for the expression p
(1 - p) is 0.25, which is the case when p = 0.5. Hence, in the worst case the
highest possible value for p(1 -p) is 0.25. In that case 0.25 187
Sampling and 0.25
Sampling �= = 270.6
0.0009239
Distributions
Therefore, if we take a sample of size 271, then we are sure that our estimate
of the population proportion would be within plus and minus 0.05 of the true
value with a confidence level of 90% whatever he the value of p.
Activity F
100 Sodium Vapour Lamps were tested to estimate the life of such a lamp.
The life of these 100 lamps exhibited a mean of 10,000 hours with a standard
deviation of 500 hours. Construct a 90% confidence interval for the true
mean life of a Sodium Vapour Lamp.
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
Activity G
If the sample size in the previous situation had been 15 in place 100, what
would be the confidence interval.
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
Activity H
We want to estimate the proportion of employees who prefer the codification
of rules and regulations. What should be the sample size if we want our
estimate to he within plus or minus 0.05 with a 95% confidence level.
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………………………
10.9 SUMMARY
We have introduced the concept of sampling distributions in this unit. We
have discussed the sampling distributions of some commonly used statistics
and also shown some applications of the same.
A sampling distribution of a sample statistic has been introduced as the
probability distribution or the probability density function of the sample
statistic. In the sampling distribution of the mean, we find that if the
188 population distribution is normal, the sample mean is also distributed
normally with the same mean but with a smaller standard deviation. In fact, Sampling
Distributions
the standard deviation of the sample mean, also known as the standard error
of the mean, is found to be equal to the population standard deviation divided
by the sample size.
We have also presented a very important result called the central limit
theorem which assures us that if the sample size is large enough (greater than
30), the sampling distribution of the mean could be approximated by a
corresponding normal distribution with the mean and standard deviation as
given in the preceding paragraph.
We have then explored the sampling distribution of the variance and found
(���)��
that a related quantity viz. ��
would have a chi-square distribution with
(n -1) degrees of freedom. We have learnt that the chi-square distribution is
tabulated extensively and so any probability calculations regarding � � could
be easily made by referring to the tables for the chi-square distribution.
We have introduced one more distribution viz. the t distribution which is
found to be applicable when the sampling distribution of the mean is of
interest, but the population standard deviation is unknown. It is noticed that if
the sample size is large enough (n>30), the t distribution is actually very
close to the standard normal distribution.
We have also studied the sampling distribution of the proportion and then
looked at two applications of the sampling distributions. One is in developing
an interval estimate for a population parameter with a given confidence level,
which is conceptualised as the probability that a random interval will contain
the true value of the parameter. The second application is to determine the
sample size required while estimating the population mean or the population
proportion.
191