Chapter 2 Sampling and Sampling Distribution
Chapter 2 Sampling and Sampling Distribution
SAMPLING THEORY
Sampling is simply the process of learning about the population on the basis of a sample
drawn from it. Thus in sampling technique instead of every unit of the population only
part of the population is studied and the conclusions are drawn on that basis for the entire
population. The process of sampling involves three elements: selecting the sample,
collecting the information and making an inference about the population.
Census: The process of gathering data from every element in the population.
Sampling Frame: - The list of all possible units in the reference population.
Sampling Error: - The difference between sample statistic and population parameter.
1
Sampling Unit: - Elements of the population to be sampled or the unit of selection in the
sampling process.
Sample design: Is the set of procedures for selecting the sample elements from the
population.
Cost/Economy
Unit cost of collecting data in the case of census is significantly less than in the case of
sampling. However, due to due to the larger number of items in the population, the total
cost involves in the case of census is significantly higher than in the case of sampling.
Suppose it takes Birr 200 per unit to make a census of 10,000,000 individuals but the unit
cost of sampling 5000 individuals is Birr 1000. Thus, the total cost is: 10,000,000 x 200 =
2,000,000,000 but that of sample is 5,000 x 1000 = 50,000,000
Timeliness
Due to the larger size of population total time involves in the case of census is
significantly higher than that of sampling (i.e., the sample may provide us with necessary
information quickly).
2
Destructive Nature of Many Tests
Due to destructive nature of many tests, the resources are completed to collect
information only from part of the population. For example: blood test for a patient, life
hours of a tube light, strength of wires, etc.
Accuracy
Non-sampling error in the case of census is higher than the non-sampling error
committed in the case of a sample survey ( as less qualified investigator are involve in the
case of census and the supervision, monitoring and quality control mechanism in the case
of census may be poor). The higher the degree of non-sampling error, the less reliable
your result may be.
SAMPLING METHODS
There are two principal methods of drawing a sample from a population: Probability
sampling and Non-probability sampling.
1) Probability Sampling
In the case of probability sampling each observation in the population has an equal
chance of being selected to become part of the sample. There is no human judgment in
the case of probability sampling.
3
There are two methods to select a simple random sample:
Lottery method- In this method, each population item is numbered 1 to N on slips of
identical cards (size, shape and color). Then place numbered cards in a bowl, mix them
thoroughly, and select as many cards as needed in a blind fold selection. The subjects
whose numbers are selected constitute the sample. Since it is difficult to mix the cards
thoroughly, there is a chance of obtaining a biased sample. Thus we need other method of
selecting sample elements.
Random Number method- due to the problem of lottery method, statisticians use another
method known as the random number method where numbers are generated using
computers.
a) Assign a unique number to each population element in the sampling frame. Start
with serial number 1, or 01, or 001, etc. depending on the number of digits
required.
b) Choose a random starting position by closing your eyes (blind fold selection) and
placing your finger on a number in the table.
c) Select serial numbers across rows or down columns or diagonally from the
starting point.
d) Discard numbers that are not assigned to any population element and ignore
numbers that have already been selected.
e) Repeat the selection process until the required number of sample elements is
selected.
Advantage of simple random sampling
It ensures that the sample is unbiased.
Disadvantages simple random sampling
It requires a Sampling Frame, and this is sometimes impossible (the case of fish
population).
If the population is very large, it is tedious and time consuming to number and
select the sample.
4
Minority subgroups of the population may not be represented in the sample.
ii. Stratified Sampling
In stratified sampling, a population is first divided into subgroups, called strata (singular stratum),
and a sample is selected from each stratum based on simple random or systematic sampling
method. The strata are made according to various homogeneous characteristics such as sex, race,
region or institutional affiliation such as faculty. Stratified sampling is applied if the population is
heterogeneous.
Stratified random sampling method is a three-step process:
Step 1- Divide the population into homogeneous, mutually exclusive and collectively
exhaustive groups or strata using some stratification variable (e.g. income level, sex,
education level, etc.);
Step 2- Select an independent simple random sample from each stratum (using simple
random sample);
Step 3- Form the final sample by consolidating all sample elements chosen in step 2.
Example: To select a proportionate stratified sample of 20 households from Addis Ababa that
belong to three income groups: low (50), middle (30) and high (20) (N=50+30+20=100).
Sub-divide the club members into three homogeneous sub-groups or strata by the
income groups: low, middle and high.
Calculate the overall sampling fraction, f, in the following manner: f=n/N=20/100=0.2
Where n = sample size and N = population size: n1=0.2*50=10, n2=0.2*30=6 and n3=0.2*2=4.
Thus, n=n1+n2+n3=10+6+4=20
5
If variables are somewhat complex or ambiguous (such as beliefs, attitudes, etc), it is
difficult to separate individuals in to the sub-groups according to these variables.
iii. Systematic Sampling
In systematic sampling only one random number is needed throughout the entire
sampling process. Elements of the population will be arranged in some order and the
elements to be included in the sample will be selected at a constant interval.
To use systematic sampling, a researcher needs:
a. A sampling frame of the population;
b. a skip interval (K) calculated as follows:
Example: Suppose there are 2000 subjects in the population and a sample size of 50
subjects are needed. The sampling interval (k) is 40 (2000/50). Select the starting point,
say ‘x’, from 1 through 40 using simple random sampling, and then include every 40 th
element starting from ‘x’.
6
Defective item
Non-defective item etc,
2) Non-Probability Sampling
In the case of non-probability sampling, not every unit in the population has a chance of
being included in the sample. It involves at least some degree of personal subjectivity
instead of following predetermined, probabilistic rules for selection.
7
iii. Quota sampling
i. Convenience Sampling
Convenience sampling implies sample drawn at the convenience of the researcher. It is common
in exploratory research. Does not lead to any conclusion
Example 2.5: Suppose we know that 54% of the adults in a community are females, and the study
requires 100 respondents as a sample. In quota sampling, we might interview the first 54 females and
the first 46 males.
The following factors should be considered while deciding the sample size:
8
i. The size of the population: the larger the size of the population, the bigger should be
the sample size.
ii. The resource available: if the resources available are vast a large sample size could be
taken. However, in most cases resources constitute a big constraint on sample size.
iii. The degree of accuracy or precision desired: the greater the degree of accuracy
desired, the larger should be the sample size. However, it does not necessarily mean
that bigger samples always ensure greater accuracy.
iv. Homogeneity or heterogeneity of the population: If the population consists of
homogeneous units a small sample may serve the purpose, but if the population
consists of heterogeneous units a large sample may be inevitable.
v. Nature of study: For an intensive and continuous study a small sample may be
suitable. But for studies which are not likely to be repeated and are quite extensive in
nature, it may be necessary to take a large sample size.
vi. Method of sampling adopted: The size of sample is also influenced by the type of
sampling plan adopted. For example if the sample is a simple random sample it may
necessitate a bigger sample size, However, in a properly drawn stratified sampling
plan, even a small sample may give better results.
vii. Nature of respondent: Where it is expected a large number of respondents will not co-
operate and send back the questionnaire, a large sample should be selected.
SAMPLING DISTRIBUTION
NOTE: The normal probability distribution is used to determine probabilities for the
normally distributed individual measurements, given the mean and the standard
deviation. Symbolically, the variable is the measurement X, with the population mean µ
and population standard deviation δ. In contrast to such distributions of individual
measurements, a sampling distribution is a probability distribution for the possible values
of a sample statistic.
Population distribution: Is the distribution of measured values of its members and have
mean denoted byμ and variance δ 2and standard deviationσ . The population standard
deviation describes the variation among values of members of the population; where as the
9
standard deviation of sampling distribution measures the variability among values of the
statistics (sample) such as mean values, proportion values due to sampling errors.
NB: The sampling distribution of the mean is not the sample distribution, which is the
distribution of the measured values of X in one random sample. Rather, the sampling
distribution of the mean is the probability distribution for X , the sample mean.
For any given sample size n taken from a population with mean µ and standard deviation
δ, the value of the sample mean X would vary from sample to sample if several
random samples were obtained from the population. This variability serves as the basis
for sampling distribution.
The sampling distribution of the mean is described by two parameters: the expected value
10
δ N −n
n˃0.05N
δx = ∗
√
√ n N−1 .
N −n
The expression √ N −1 is called finite population correction factor/finite population
multiplier. In the calculation of the standard error of the mean, if the population
S S N−n
δX =
√n
or δ X = ∗
√
√ n N−1 .
3. A sample size n≥30 is generally said to be considered to be a large sample for statistical
analysis where as a sample of size n¿ 30 is considered to be a small sample. The
sampling distribution of means is approximately normal for sufficiently large sample
sizes (n≥ 30).
4. When standard deviation of population σ is not known, the standard deviation of the
sample s which closely approximates σ value is used to compute standard error, i.e. δ x́ =
s
.
√n
Example 1. A population consists of the following ages: 10, 20, 30, 40, and 50.
A random sample of three is to be selected from this population and mean
computed. Develop the sampling distribution of the mean.
Solution: The number of simple random samples of size n that can be drawn without
N!
replacement from a population of size N is N C n( ).With N= 5 and n = 3, 5C3 =
n !( N −n)!
10 samples can be drawn from the population as:
11
300.00
20.00 1 0.1
23.33 1 0.1
26.67 2 0.2
30.00 2 0.2
33.33 2 0.2
36.67 1 0.1
40.00 1 0.1
TOTAL 10.00 1.00
Columns 1 and 2 show frequency distribution of sample means.
Columns 1 and 3 show sampling distribution of the mean.
μ=
∑ X = ∑ x =30 ,
N n Regardless of the sample size μ=X .
∑ ( X i− X )2
σ=
√ N
=
√ 1000
5
=14 . 142
∑ ( X i −X )2
=
√ N
=
√
Since averaging reduces variability
333 .4
10
=5. 774
δ x < δ except the cases where δ = 0 and n =
1.
12
Central Limit Theorem and the Sampling Distribution of the Mean
The Central Limit Theorem (CLT) states that:
The relationship between the shape of the population distribution and the shape of the
sampling distribution of the mean is called the Central Limit Theorem.
The significance of the Central Limit Theorem is that it permits us to use sample statistics
to make inference about population parameters without knowing anything about the
shape of the frequency distribution of that population other than what we can get from the
sample. It also permits us to use the normal distribution curve for analyzing distributions
whose shape is unknown. It creates the potential for applying the normal distribution to
many problems when the sample is sufficiently large. As mentioned earlier the above
properties must exist, given this value of sample mean X́ is first converted in to a value Z
on the standard normal distribution to know how any single value deviates from X́ of
sample mean values ( μ x́), by using the formula;
X́−μ
X́−μ x́
Z= = δ because μ x́=μ
δ x́
√n
If the population is finite and samples of fixed size n are drawn without replacement, then
the standard error of sampling distribution of mean can be modified to adjust the continued
change in the size of population μ due to the several draws of samples of size n is as
follows:
Example 2: The mean length of a certain tool is 41.5 hours with a standard deviation of 2.5
hours. What is the probability that a simple random sample of size 50 drawn from this
population will have a mean between 40.5 hours and 42 hours?
P (40.5≤ X́ ≤42.0) =?
δ 2.5 2.5
μ x́= μ δ x́ = = = = 0.3536
√ n √50 7.0711
13
The population distribution is unknown, but sample size n=50 is large enough to apply the
central limit theorem. Hence the normal distribution can be used to find the required
probability.
X́ 1−μ X́ −μ
P (40.5≤ X́ ≤42) = P ( ≤Z≤ 2 )
δ x́ δ x́
40.5−41.5 42−41.5
=P( ≤ Z≤ )
0.3536 0.3536
= P (−2.8281 ≤ Z ≤ 1.4140)
=P (Z ≥−2.8281) + P (Z ≤ 1.4140)
=0.4977+0.4207=0.9184
Thus 0.9184 is the probability of the tool having mean life between the required hours.
δ =300
0.4977
0.4207
Solution:
A. P (x́ ≥ 900) =?
μ X́ =μ=800gms δ=300gms
n=16
P (x́ ≥ 900) =?
δ 300 300
δ x́ = = = = 75
√ n √16 4
14
0.0918
μ X́ =800 X́ =900
X́−μ x́ 900−8 00
P (x́ ≥ 900) =P (Z≥ = ¿
δ x́ 75
=P (Z≥ 1.33¿
=0.5000-0.4082
=0.0918
B. Since Z=1.96 for the middle 95% area under the normal curve, therefore using the formula
for z to solve for the values of x́ in terms of the known values are as follows.
x́ 1= μ X́ -Zδ x́ x́ 2= μ X́ +Zδ x́
=800-1.96(75) =800+1.96(75)
=653gms =947gms
0.95
=300
number of success , X
Ṕ=
sample ¿ n
With same logic of sampling distribution of mean, the sampling distribution of sample
proportions with mean μ Ṕ and standard deviation also called standard error) δ Ṕ is given by:
15
μ Ṕ = P and δ Ṕ = pq = p(1−P)
√ √
n n
A. np≥5
B. nq≥5
Then the sampling distribution of proportions is very closely normally distributed. It may
be noted that the sampling distribution of the proportion would actually follow binomial
distribution because population is binomially distributed.
For finite population in which sampling is done without replacement we have;
μ Ṕ = P and δ Ṕ = pq * N −n
√ √
n N −1
Under the same guidelines as mentioned in the previous sections, for a large sample size n≥
30, the sampling distribution of proportion is closely approximated by a normal distribution
with a mean and standard deviation as stated above. Hence, to standardize sample
proportion Ṕ, the standard normal variable,
Ṕ−P
Ṕ−μ Ṕ
Z= = pq
δ Ṕ
Example 3.
√ n
Few years back, a policy was introduced to give loans to
unemployed engineers to start their own business. Out of 1,000,000
engineers, 600,000 accepted the policy and got the loan. A sample of 100
unemployed engineers is taken at the same time of allotment of loans. What
is the probability that sample portion would have exceeded 50%
acceptance?
Solution:
μ Ṕ = P=0.60 N=1,000,000
n=100 P ( Ṕ ≥0. 5) =?
Ṕ−μ Ṕ 0.50−0.60
P ( Ṕ ≥0. 5) =P (Z≥ ) =P (Z≥ ) =0.4793+0.5000=0.9793
δ Ṕ 0.0489
0.4793 16
0.5000
5 P=0.60
Given:
μ Ṕ = P=0.40 n=200
Ṕ−P
δ Ṕ = ( 0.4 ) (0.6) =0.0346
√ 200
P (-0.03≤ Ṕ ≤ 0.03) = 2P (Z≥
δ Ṕ
)
= 2P (Z ≤ 0.87 ¿
=2x0.3078
=0.6156
0.3078 0.3078
P=0.40
μ Ṕ = P=0.03 Ṕ2=0.035
Ṕ1=0.02 n=300
17
δ Ṕ = ( 0.03 ) (0.97) =0.0098
√ 300
Ṕ−P Ṕ−P
P (-0.03≤ Ṕ ≤ 0.03) = P ( ≤Z ≤ )
δ Ṕ δ Ṕ
0.02−0.03 0.035−0.03
=P( ≤Z≤ )
0.0098 0.0098
= P (-1.02≤ Z ≤0.51)
=P (Z≥−1.02) + P (Z≤ 0.51)
=0.3461+0.1950
= 0.5411
Hence the probability that the proportion of defective will lie between 0.02 and
0.035 is 0.5411
0.3461 0.1950
Let X́ 1 ∧ X́ 2be the mean of sampling distribution of the mean of two populations,
respectively. Then the difference between their mean values μ1and μ2can be
estimated by generalizing the formula of standard normal variable as follows;
( X́ 1− X́ 2 )−(μ X́ −μ X́ ) ( X́ 1− X́ 2 )−(μ1−μ2 )
Z= 1 2
=
δ ( X́ −X́ )
1 2
δ ( X́ − X́ )
1 2
2 2
δ ¿¿= δ X́ + δ X́
√ 1 2
18
δ 1 2 δ 22
two means)
=
√ n1 n2
+ (standard error of sampling distribution of difference of
n1andn2 are independent random samples drawn from first and second
population , respectively.
Example: Car stereos of manufacturer A have a mean lifetime of 1,400 hours with a standard
deviation of 200 hours, while those of manufacturer B have a mean life time of 1,200 hours with a
standard deviation of 100 hours. If a random sample of 125 stereos of each manufacturer are tested,
what is the probability that manufacturer A’s stereos will have a mean life time which is at least;
0.9772
Hence, the probability is very high that the life time of the stereos of A is 160 hours more
than that of b.
19
b) Proceeding in the same manner as in part a) as follows:
( X́ 1− X́ 2)(μ1 −μ 2) 250−200
P ( X́ 1 − X́ 2 ≥250) = P (Z ≥ =P ( Z ≥ )
δ ( X́ − X́ )
1 2
20
=P (Z ≥−2.5)
=0.5000 - 0.4938
=0.0062 (area under normal curve)
0.0062
Given:
μ1= 4,500 μ2= 4,000
δ 1=200 δ 2=300
n1=5 n2 =100
2 2 2 2
δ ( X́ −X́ )= δ 1 + δ 2 = (200) + (300) = =41.23
1 2
√
n1 n2 √50
P (Z ≥ ¿ ¿)
P ( X́ 1 − X́ 2 ≥600) =
100
600−500
=P ( Z ≥ )
41.23
=P (Z ≥ 2.43)
=0.4925
=0.5000 - 0.4925=0.0075 (area under normal curve)
20
0.0075
21
SAMPLING DISTRIBUTION OF THE DIFFERENCE OF TWO PROPORTIONS
Suppose two populations of size N 1and N 2are given. For each sample of size n1from the first
population, compute sample proportion Ṕ1and standard deviation δ Ṕ . Similarly for each sample
1
size of n2 from the second population, compute sample proportion Ṕ2 and standard deviation δ Ṕ . 2
For all combinations of these samples from these populations, we can obtain a sampling
distribution of the difference Ṕ1− Ṕ2 of sample proportion. Such a distribution is called sampling
distribution of the difference of two proportions. The mean and standard deviation of this
distribution are given by;
μ Ṕ −μ Ṕ = P1−P2
1 2
P1 q1 P2 q 2
2 2
δ ¿¿= δ Ṕ + δ Ṕ =
√ 1 2
√ n1
+
n2
If sample size n1 ∧n1 are large i.e. n1 ≥30, then the sampling distribution of difference of
proportions is closely approximated by a normal distribution.
Example 7. 10% of the machines produced by company a are defective and 5% of these
produced by company B are defective. A random sample of250 machines is taken
from company A and a random sample of 300 machines is taken from company B.
what is the probability that the difference in sample proportion is less than or equal
to0.02?
μ Ṕ −μ Ṕ = P1−P2= 0.10−0.05=0.05
1 2
n1=250 n2 =300
The standard error of the difference in a sample proportions is given by
δ ( Ṕ −Ṕ )= δ Ṕ 2 + δ Ṕ 2 = P1 q1 + P2 q 2
1 2 √ 1 2
n1 √ n2
δ ( Ṕ −Ṕ )= √ 0.0052 = 0.0228
1 2
( Ṕ 1− Ṕ2 )−(P1−P2 )
P¿0.02) =P ( Z ≥
δ ( Ṕ − Ṕ )
1 2
0.02−0.05
=P ( Z ≥ )
0.0228
=P (Z ≥−1.32)
22
=0.5000 - 0.4066=0.0934 (area under normal curve)
Hence the desired probability for the difference in sample proportions is 0.0934
0.093
23