Central Limit Theorem and Sample Size
Central Limit Theorem and Sample Size
Central Limit Theorem and Sample Size
x1 x2 .... xn
Then the mean of the sample is x , and the standard deviation of the sample
n
x x x
2 2 2
1 2 x ..... xn x
is s . Note, the variance of the sample is simply s2 .
n
Let us denote the mean of the annual income of the population of all Singaporean working age
males as and the standard deviation of the population as (Variance = 2 ).
The question now is how large does your sample size have to be such that the sample mean
x is close to .
But, to answer that question, you, the researcher has to specify two quantities: (i) how much
sampling error are you willing to tolerate & (ii) how much confidence do you want to have that
the population mean is in the range of the sampling error you have specified.
Imagine drawing k samples of size 𝑛 from the population of all Singaporean working age males
with mean and standard deviation as defined above. Now calculate the mean of each sample.
This will give you a sequence of sample means as follows:
𝑥̅ 1 , 𝑥̅ 2 , 𝑥̅ 3 , … … . . 𝑥̅ 𝑘
The CLT states that at the limit when n , the distribution of these sample means will approach a
2
normal distribution with mean and standard deviation (or Variance = ). [The proof of
n n
this statement is beyond the scope of this class]
𝑥̅ −𝜇
Please note, this is the same as saying that at the limit when n , the distribution of 𝜎 will
√𝑛
approach a standard normal distribution with mean 0 and standard deviation 1 (Variance=1).
𝑥̅ −𝜇
We can state this more formally as follows, lim 𝜎 ~ 𝑍(0,1), where Z is the notation
𝑛→∞
√𝑛
representing a standard normal distribution with mean 0 and standard deviation 1
Now assume that n is large enough, then we can restate this expression as:
𝑥̅ −𝜇
𝜎 = 𝑍(0,1) (A)
√𝑛
Going back to our original question, equation A is the one that will allow us to determine the
relevant sample size. Also, this is where the concepts of sampling error and confidence level
become important.
Suppose you want a sampling error of S$100. In terms of equation A, this basically means that
𝑥̅ − 𝜇 = 100.
Suppose you want to be 95% confident that the population mean will be within the range of the
sampling error you have specified.
Below is a map of a standard normal distribution (see Figure 1). When a variable follows a
standard normal distribution, 95% of the observations will lie between Z=-1.96 and Z=1.96 [You
can also see this from a Z distribution table, which is easily available online]. In equation A,
replace 𝑍(0,1)= 1.96.
Notice, we are trying to derive the value of n from equation A. But, note, we have not defined
𝜎 and the laws of basic algebra tell us that a single equation with two unknowns cannot be
solved. So we need an estimate of 𝜎. Lets assume that the last time a survey was conducted of
the annual incomes of working age Singaporean males, the annual incomes of those who were
sampled had a standard deviation of S$1000. Let us use this as an estimate of the standard
deviation of the population. Going back to equation A, let’s set 𝜎=1000.
100
1000 = 1.96
√𝑛
I know it may be hard for you to appreciate this at this stage, but this is a really powerful
result. What it basically says is that, provided our estimate of the standard deviation of the
population is correct, if you selected a sample of 384 Singaporean working age males, and
took the mean of their annual incomes, the population mean will be within a range of S$100
from the sample mean and you can have 95% confidence that the population mean lies in
that range. This is a remarkable result because we have no idea what the population mean
is. Also, note, the closer you want the sample mean to be to the population mean and the
greater the confidence you want to have that the population mean lies within the range of
sampling error, the larger the sample size has to be.