0% found this document useful (0 votes)

5 views

Lecture Notes

Uploaded by

ivan.cheung.yui

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Lecture Notes

Uploaded by

ivan.cheung.yui

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 90

Author name(s)

Leture notes for STAT2602

– Monograph –

October 14, 2019

Springer
Contents

1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Discrete distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Continuous distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Empirical distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Moment generating function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Maximum likelihood estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Method of moments estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Estimator properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Confidence intervals for means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 One-sample case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 Tow-sample case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Confidence intervals for variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.1 One-sample case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.2 Two-sample case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Confidence intervals: Large samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Most powerful tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Generalized likelihood ratio tests: One-sample case . . . . . . . . . . . . . . . . . . . 61
5.3.1 Testing for the mean: Variance is known . . . . . . . . . . . . . . . . . . . . . . 62
5.3.2 Testing for the mean: Variance is unknown . . . . . . . . . . . . . . . . . . . . 65
5.3.3 Testing for the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

v
vi Contents

5.3.4 Test and interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.4 Generalized likelihood ratio tests: Two-sample case . . . . . . . . . . . . . . . . . . . 71
5.4.1 Testing for the mean: Variance is known . . . . . . . . . . . . . . . . . . . . . . 71
5.4.2 Testing for the mean: Variance is unknown . . . . . . . . . . . . . . . . . . . . 74
5.4.3 Testing for the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Generalized likelihood ratio tests: Large samples . . . . . . . . . . . . . . . . . . . . . 80
5.5.1 Goodness-of-fit tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5.2 Pearson Chi-squared test of independence . . . . . . . . . . . . . . . . . . . . . 84
Chapter 1
Basic concepts

1.1 Discrete distribution

Let x be a realization of a discrete random variable X ∈ R. Then,

f (x) = P(X = x)

is the probability density function (p.d.f.) of X.

Theorem 1.1. (Discrete univariate probability density function) A discrete univariate

probability density function has the following properties:
(1) f (x) > 0 for x ∈ S;
(2) ∑x∈S f (x) = 1;
(3) P(X ∈ A) = ∑x∈A f (x), where A ⊂ S.

Based on the p.d.f. f (x), we define the function F(x) by

F(x) = P(X ≤ x) = ∑ f (s).

s∈S and s≤x

The function F(x) is called the cumulative distribution function (c.d.f.) of the discrete
random variable X. Note that F(x) is a step function on R and the height of a step at x,
x ∈ S, equals the probability f (x) (see Fig.1.1 for an illustration).
From Theorem 1.1, we can obtain the following theorem.

Theorem 1.2. (Discrete cumulative distribution function) A discrete univariate cumula-

tive distribution function has the following properties:
(1) 0 ≤ F(x) ≤ 1 for x ∈ R;
(2) F(x) is a nondecreasing function of x;
(3) F(∞) = 1 and F(−∞) = 0.

Remark 1.1. The p.d.f. f (x) and the c.d.f. F(x) are one-to-one corresponding. We can first
define the c.d.f. F(x), and then define the p.d.f. f (x) by

f (x) = F(x) − F(x−) for x ∈ S.

1
2 1 Basic concepts

3/6

2/6
f(x)

1/6

0
1 2 3
x

1
F (x)

1/2

1/6

1 2 3
x

Fig. 1.1 The top panel is the p.d.f F(x) of a discrete random variable X, where f (x) = P(X = x) = x/6
for x = 1, 2, 3, and the bottom panel is the corresponding c.d.f. F(x).

Property 1.1. Two discrete random variables X and Y are independent if and only if
F(x, y) = FX (x)FY (y) for all (x, y) ∈ S, where F is joint distribution of X and Y , and FX
(or FY ) is the marginal distribution of X (or Y ).

Property 1.2. Let X and Y be two independent discrete random variables. Then,
(a) for arbitrary countable sets A and B,

P(X ∈ A,Y ∈ B) = P(X ∈ A)P(Y ∈ B);

(b) for any real functions g(·) and h(·), g(X) and h(Y ) are independent.
1.2 Continuous distribution 3

1.2 Continuous distribution

Let X ∈ R be a continuous random variable. The probability of X lies in an interval (a, b]

is ∫ b
P(a < X ≤ b) = f (x)dx
a
for some non-negative function f (·). We call f (x) the p.d.f. of the continuous random
variable X.
Theorem 1.3. (Continuous univariate probability density function) A continuous univari-
ate probability density function has the following properties:
(1) ∫f (x) ≥ 0 for x ∈ R;
(2) R f (x)dx =∫1;
(3) P(X ∈ A) = A f (x)dx for A ⊂ R.
Based on the p.d.f. f (x), the c.d.f of X is defined as
∫ x
F(x) = P(X ≤ x) = f (s)ds,
−∞

which also satisfies Theorem 1.2. From the fundamental theorems of calculus, we have
F ′ (x) = f (x) if exists. Since there are no steps or jumps in a continuous c.d.f., it must be
true that P(X = b) = 0 for all real values of b.
As you can see, the definition for the p.d.f. (or c.d.f.) of a continuous random variable
differs from the definition for the p.d.f. (or c.d.f.) of a discrete random variable by simply
changing the summations that appeared in the discrete case to integrals in the continuous
case.
Example 1.1. (Uniform distribution) A random variable X has a uniform distribution if
{ 1
, for a ≤ x ≤ b,
f (x) = b−a
0, otherwise.

Briefly, we say that X ∼ U(a, b).

Property 1.3. If F is a continuous c.d.f. and X ∼ U(0, 1), then Y = F −1 (X) ∼ F.
Proof.
P(Y ≤ x) = P(F −1 (X) ≤ x) = P(X ≤ F(x)) = F(x).
Note that this property helps us to generate a random variable from certain distribution.
⊓
⊔
Example 1.2. (Normal distribution) A random variable X has a normal distribution if
[ ]
1 (x − µ )2
f (x) = √ exp − for x ∈ R,
σ 2π 2σ 2

where µ ∈ R is the location parameter and σ > 0 is the scale parameter. Briefly, we say
that X ∼ N(µ , σ 2 ). A simple illustration of f (x) with different values of µ and σ is given
in Fig.1.2.
Further, Z = (X − µ )/σ ∼ N(0, 1) (the standard normal distribution), and the c.d.f. of
Z is typically denoted by Φ (x), where
4 1 Basic concepts
∫ x [ 2]
1 s
Φ (x) = P(Z ≤ x) = √ exp − ds.
−∞ 2π 2

Numerical approximations for Φ (x) have been well tabulated in practice.

0.4 0.4
N (0, 1) N (2, 1)
0.35 N (0, 4) 0.35 N (2, 4)

0.3 0.3

0.25 0.25
f(x)

f(x)
0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0
-10 -5 0 5 10 -5 2 10
x x

Fig. 1.2 The p.d.f f (x) of N(µ , σ 2 ).

Property 1.4. If the p.d.f. of a continuous random variable X is fX (x) for x ∈ R, the p.d.f.
a ) for x ∈ R.
of Y = aX + b for a ̸= 0 is fY (x) = 1a fX ( x−b
Proof. Let FX (x) be the c.d.f. of X. Then, the c.d.f. of Y is
( ) ( )
x−b x−b
FY (x) = P(Y ≤ x) = P(aX + b ≤ x) = P X ≤ = FX
a a

for x ∈ R. Hence,
( ) ( )
1 x−b 1 x−b
fY (x) = FY′ (x) = FX′ = fX .
a a a a

This completes the proof. ⊓

⊔
Property 1.5. Two continuous random variables X and Y are independent if and only if

F(x, y) = FX (x)FY (y) for all (x, y) ∈ R2 .

Property 1.6. Let X and Y be two independent continuous random variables. Then,
(a) for arbitrary intervals A and B,

P(X ∈ A,Y ∈ B) = P(X ∈ A)P(Y ∈ B);

(b) for any real functions g(·) and h(·), g(X) and h(Y ) are independent.
1.3 Empirical distribution 5

1.3 Empirical distribution

Suppose that X ∼ F(x) is a random variable resulting from a random experiment. Repeat
this experiment n independent times, we get n random variables X1 , · · · , Xn associated
with these outcomes. The collection of these random variables is called a sample from a
distribution with c.d.f. F(x) (or p.d.f. f (x)). The number n is called the sample size.
As all random variables in a sample follow the same c.d.f. as X, we expect that they can
give us the information about the c.d.f of X. Next, we are going to show that the empirical
distribution of {X1 , · · · , Xn } is close to F(x) in some probability sense.
The empirical distribution of {X1 , · · · , Xn } is defined as

1 n
Fn (x) = ∑ I(Xk ≤ x)
n k=1

for x ∈ R, where I(A) is an indicator function such that I(A) = 1 if A holds and I(A) = 0
otherwise. Obviously, Fn (x) assigns the probability 1/n to each Xk , and we can check that
it satisfies Theorem 1.2 (please do it by yourself). Since Fn (x) is the relative frequency
of the event X ≤ x, it is an approximation of the probability P(X ≤ x) = F(x). Thus, the
following result is expected.

Theorem 1.4. As n → ∞, supx∈R |Fn (x) − F(x)| → 0 almost surely (a.s.).

The proof of aforementioned theorem is omitted. Roughly speaking, the almost surely
convergence in this theorem means that Fn (x) provides an estimate of the c.d.f. F(x) for
each realization {x1 , · · · , xn }. To see it more clearly, Fig.1.3 plots the empirical distribution
function Fn (x) based on a realization {x1 , · · · , xn } with Xi ∼ N(0, 1). As a comparison, the
c.d.f. Φ (x) of N(0, 1) is also included in Fig.1.3. From this figure, we can see that Fn (x) is
getting close to Φ (x) as the sample size n increases, and this is consistent to the conclusion
in Theorem 1.4.

Example 1.3. Let X denote the number of observed heads when four coins are tosses
independently and at random. Recall that the distribution of X is B(4, 1/2). One thousand
repetitions of this experiment (actually simulated on the computer) yielded the following
results:
Number of heads Frequency
0 65
1 246
2 358
3 272
4 59

This information determines the following empirical distribution function:

x F1000 (x) x F1000 (x)

(−∞, 0) 0.000 [2, 3) 0.669
[0, 1) 0.065 [3, 4) 0.941
[1, 2) 0.311 [4, ∞) 1.000

The graph of the empirical distribution function F1000 (x) and the theoretical distribution
function F(x) for the binomial distribution are very close (please check it by yourself).
6 1 Basic concepts

(a) n=10 (b) n=50

0.8

0.8
Fn(x)

Fn(x)
0.4

0.4
0.0

0.0
−1 0 1 2 −2 −1 0 1 2 3

x x

(c) n=100 (d) n=500

0.8

0.8
Fn(x)

Fn(x)
0.4

0.4
0.0

0.0

−2 −1 0 1 2 3 −3 −1 0 1 2 3

x x

Fig. 1.3 The black step function is the empirical distribution function Fn (x) based on a realization
{x1 , · · · , xn } with Xi ∼ N(0, 1). The red solid line is the c.d.f. Φ (x) of N(0, 1).

Example 1.4. The following numbers are a random sample of size 10 from some distribu-
tion:
−0.49, 0.90, 0.76, −0.97, −0.73, 0.93, −0.88, −0.75, 0.88, 0.96.
(a) Write done the empirical distribution; (b) use the empirical distribution to estimate
P(X ≤ −0.5) and P(−0.5 ≤ X ≤ 0.5).

Solution. Order the random sample:

−0.97, −0.88, −0.75, −0.73, −0.49, 0.76, 0.88, 0.90, 0.93, 0.96.

Then, the empirical distribution function F10 (x) is as follows:

x F10 (x) x F10 (x)

(−∞, −0.97) 0.0 [−0.49, 0.76) 0.5
[−0.97, −0.88) 0.1 [0.76, 0.88) 0.6
[−0.88, −0.75) 0.2 [0.88, 0.90) 0.7
[−0.75, −0.73) 0.3 [0.90, 0.93) 0.8
[−0.73, −0.49) 0.4 [0.93, 0.96) 0.9
[0.96, ∞) 1.0

Thus, P(X ≤ −0.5) = F(−0.5) ≈ F10 (−0.5) = 0.4 and P(−0.5 ≤ X ≤ 0.5) = F(0.5) −
F(−0.5) ≈ F10 (0.5) − F10 (−0.5) = 0.5 − 0.4 = 0.1. ⊓
⊔
1.3 Empirical distribution 7

The question now is how to estimate the p.d.f. f (x)? The answer is “relative frequency
histogram”.
For the discrete random variable X, we can estimate f (x) = P(X = x) by the relative
frequency of occurrences of x. That is,

∑nk=1 I(Xk = x)
f (x) ≈ fn (x) = .
n
Example 1.3. (con’t) The relative frequency of observing x = 0, 1, 2, 3 or 4 is listed in the
second column, and it is close to the value of f (x), which is the p.d.f of B(4, 1/2).

x f1000 (x) f (x)

0 0.065 0.0625
1 0.246 0.2500
2 0.358 0.3750
3 0.272 0.2500
4 0.059 0.0625

By increasing the value of n, the difference between fn (x) and f (x) will become small.
⊓
⊔
For the continuous random variable X, we first define the so-called class intervals.
Choose an integer l ≥ 1, and a sequence of real numbers c0 , c1 , · · · , cl such that c0 < c1 <
· · · < cl . The class intervals are

(c0 , c1 ], (c1 , c2 ], · · · , (cl−1 , cl ].

Roughly speaking, the class intervals are a non-overlapped partition of the interval
[Xmin , Xmax ]. As f (x) = F ′ (x), we expect that when c j−1 and c j is close,

F(c j ) − F(c j−1 )

f (x) ≈ for x ∈ (c j−1 , c j ], j = 1, 2, · · · , l.
c j − c j−1

Note that
∑nk=1 I(Xk ∈ (c j−1 , c j ])
F(c j ) − F(c j−1 ) = P(X ∈ (c j−1 , c j ]) ≈
n
is the relative frequency of occurrences of Xk ∈ (c j−1 , c j ]. Thus, we can approximate f (x)
by
∑n I(Xk ∈ (c j−1 , c j ])
f (x) ≈ hn (x) = k=1 for x ∈ (c j−1 , c j ], j = 1, 2, · · · , l.
n(c j − c j−1 )
We call hn (x) the relative frequency histogram. Clearly, the way that we define the class
intervals is not unique, and hence the value of hn (x) is not unique. When the sample size n
is large and the length of the class interval is small, hn (x) is expected to be a good estimate
of f (x).
The property of hn (x) is as follows:
(i) hn (x) ≥ 0 for all x;
(ii) The total area bounded by the x axis and below hn (x) equals one, i.e.,
∫ cl
hn (x)dx = 1;
c0
8 1 Basic concepts

(iii) The probability for an event A, which is composed of a union of class intervals, can
be estimated by the area above A bounded by hn (x), i.e.,
∫
P(A) ≈ hn (x)dx.
A

Example 1.5. A random sample of 50 college-bound high school seniors yielded the fol-
lowing high school cumulative grade point averages (GPA’s).
3.77 2.78 3.40 2.20 3.26
3.00 2.85 2.65 3.08 2.92
3.69 2.83 2.75 3.97 2.74
2.90 3.38 2.38 2.71 3.31
3.92 3.29 4.00 3.50 2.80
3.57 2.84 3.18 3.66 2.86
2.81 3.10 2.84 2.89 2.59
2.95 2.77 3.90 2.82 3.89
2.83 2.28 3.20 2.47 3.00
3.78 3.48 3.52 3.20 3.30
(a) Construct a frequency table for these 50 GPA’s using 10 intervals of equal length with
c0 = 2.005 and c10 = 4.005.
(b) Construct a relative frequency histogram for the grouped data.
(c) Estimate f (3) and f (4).

Solution. (a) and (b). The frequency and the relative frequency histogram based on the
class intervals are given in the following table:
class interval frequency relative frequency class interval frequency relative frequency
histogram histogram
(2.005, 2.205] 1 0.1 (3.005, 3.205] 5 0.5
(2.205, 2.405] 2 0.2 (3.205, 3.405] 6 0.6
(2.405, 2.605] 2 0.2 (3.405, 3.605] 4 0.4
(2.605, 2.805] 7 0.7 (3.605, 3.805] 4 0.4
(2.805, 3.005] 14 1.4 (3.805, 4.005] 5 0.5
(c) As 3 ∈ (2.805, 3.005] and 4 ∈ (3.805, 4.005],

14
f (3) ≈ h50 (3) = = 1.4,
50 × (3.005 − 2.805)

5
f (4) ≈ h50 (4) = = 0.5.
50 × (4.005 − 3.805)
⊓
⊔

1.4 Expectation

Definition 1.1. (Expectation of a discrete statistic) If u(X) is a function of a discrete ran-

dom variable X that has a p.d.f. f (x), then

E[u(X)] = ∑ u(x) f (x),

x∈S
1.4 Expectation 9

where the summation is taken over all possible values of x. If E[u(X)] exists, it is called
the mathematical expectation (or expected value) of u(X).

Remark 1.2. E[u(X)] exists if

∑ |u(x)| f (x) < ∞.
x∈S

We say two random variables X1 and X2 are uncorrelated, if Cov(X1 , X2 ) = 0, where

Cov(X1 , X2 ) = E(X1 X2 ) − E(X1 )E(X2 )

is the covariance of X1 and X2 .

Property 1.7. Let X be a discrete random variable with finite mean E(X), and let a and
b be constants. Then,
(i) E(aX + b) = aE(X) + b;
(ii) if P(X = b) = 1, then E(X) = b;
(iii) if P(a < X ≤ b) = 1, then a < E(X) ≤ b;
(iv) if g(X) and h(X) have finite mean, then

E(g(X) + h(X)) = E(g(X)) + E(h(X)).

Property 1.8. If X ≥ 0 takes integer values, then E(X) = ∑∞ ∞

x=1 P(X ≥ x) = ∑x=0 P(X > x).

Definition 1.2. (Expectation of a continuous statistic) If u(X) is a function of a continuous

random variable that has a p.d.f. f (x), then
∫
E[u(X)] = u(x) f (x)dx.
R

If E[u(X)] exists, it is called the mathematical expectation (or expected value) of u(X).

Remark 1.3. E[u(X)] exists if

∫
|u(x)| f (x)dx < ∞.
R

Example 1.6. Let X have the N(µ , σ 2 ) distribution. Then,

∫ ∞ [ ]
x (x − µ )2
E(X) = √ exp − dx
−∞ 2πσ 2σ 2
∫ ∞ [ 2]
sσ + µ s
= √ exp − d(sσ + µ ) (letting s = (x − µ )/σ )
−∞ 2πσ 2
∫ ∞ [ 2] ∫ ∞ [ 2]
sσ s 1 s
= √ exp − ds + µ √ exp − ds.
−∞ 2π 2 −∞ 2π 2

The first integrand is an odd function, and so the integral over R is zero. The second
integrand is one by some algebra. Hence, E(X) = µ .

Property 1.9. Let X be a continuous random variable, a and b be constants, and g and h
be functions. Then,
(i) if g(X) and h(X) have finite mean then
10 1 Basic concepts

E(ag(X) + bh(X)) = aE(g(X)) + bE(h(X));

(ii) if P(a ≤ X ≤ b) = 1, then a ≤ E(X) ≤ b;

(iii) if h is non-negative, then for a > 0, P(h(X) ≥ a) ≤ E(h(X)/a);
(iv) if g is convex, then g(E(X)) ≤ E(g(X)).

Property 1.10. Let X be a non-negative random variable with c.d.f. F, p.d.f f , and finite
expected value E(X). Then,
∫ ∞
E(X) = (1 − F(x))dx.
0

Property 1.11. Let a, b, c, and d be constants. Then,

(i) E(X 2 ) = 0 if and only if P(X = 0) = 1;
(ii) Cov(aX + b, cY + d) = ac Cov(X,Y );
(iii) Var(X +Y ) = Var(X) + Var(Y ) + 2Cov(X,Y );
(iv) if X and Y are independent, E(h(X)g(Y )) = E(h(X))E(g(Y )) provided that E(h(X)) <
∞ and E(g(Y )) < ∞.
(v) −1 ≤ ρ (X,Y ) ≤ 1;
(vi) |ρ (X,Y )| = 1 if and only if P(X = aY + b) = 1 for some constants a and b;
(vii) ρ (aX + b, cY + d) = sgn(ac)ρ (X,Y ), where sgn(x) denotes the sign of x;
(viii) if X and Y are independent, ρ (X,Y ) = 0.

Property 1.12. (Cauchy-Schwarz inequality) If E(X 2 )E(Y 2 ) < ∞, then

√
E(XY ) ≤ E(X 2 )E(Y 2 ).

Proof. Without loss generality, we assume that E(Y 2 ) > 0. Note that
[ ] [ ]
0 ≤ E (XE(Y 2 ) −Y E(XY ))2 = E(Y 2 ) E(X 2 )E(Y 2 ) − (E(XY ))2 .

Hence, the conclusions holds. ⊓

⊔
Chapter 2
Preliminary

2.1 Moment generating function

Let r be a positive integer. The r-th moment about the origin of a random variable X is
defined as µr = E(X r ). In order to calculate µr , we can make use of the moment generating
function (m.g.f.).
Definition 2.1. (Moment Generating Function) A moment generating function of X is a
function of t ∈ R defined by MX (t) = E(etX ) if exists.
Property 2.1. Suppose MX (t) exists. Then,
∞ ( r)
(1) MX (t) = ∑ µr tr! ;
r=0
(r)
(2) µr = MX (0) for r = 1, 2, . . .;
(3) For constants a and b, MaX+b (t) = ebt MX (at).
Proof. (1) For a discrete random variable X we have
∞ ∞ r ∞ r
(tx)r t t
MX (t) = ∑ etx P(X = x) = ∑ ∑ P(X = x) = ∑ ∑ xr P(X = x) = ∑ µr .
x x r=0 r! r=0 r! x r=0 r!

For a continuous random variable X, the proof is similar by using integrals instead of
sums.
(2) Make use of (1).
[ ]
(3) MaX+b (t) = E e(aX+b)t = ebt E(eatX ) = ebt MX (at). ⊓
⊔

From parts (1)-(2) above, we have

(t ) ( 2) ( 3)
(1) (2) t (3) t
MX (t) = MX (0) + MX (0) + MX (0) + MX (0) +··· ,
1! 2! 3!

which is called the Maclaurin’s series of MX (t) around t = 0. If the Maclaurin’s series
expansion of MX (t) can be found, the r-th moment µr is the coefficient of t r /r!; or if
MX (t) exists and the moments are given, we can frequently sum the Maclaurin’s series to
obtain the closed form of MX (t).
Property 2.2. If MX (t) exists, there is a one-to-one correspondence between MX (t) and
the p.d.f. f (x) (or c.d.f. F(x)).

11
12 2 Preliminary

Proof. The proof is omitted. ⊓

⊔

The above property shows that we can decide the distribution of X by calculating its
m.g.f.

Example 2.1. The m.g.f. of N(µ , σ 2 ) is

∫ +∞ (x−µ )2
1 −
E(etX ) = etx √ e 2σ 2 dx
−∞ 2πσ
∫ +∞ ( )
1 −2σ 2tx + x2 − 2µ x + µ 2
= √ exp dx
−∞ 2πσ −2σ 2
( ) ∫ +∞ ( )
−2µσ 2t − σ 4t 2 1 (x − µ − σ 2t)2
= exp × √ exp dx
−2σ 2 −∞ 2πσ −2σ 2
( )
1
= exp µ t + σ 2t 2 ,
2
( )
µ −σ 2 t)2
because √21πσ exp (x−−2 σ 2 is the density function of N(µ + σ 2t, σ 2 ). ⊓
⊔

Example 2.2. Find the m.g.f. of a random variable X following a Poisson distribution with
mean λ .

Solution.
∞ ∞ ∞
e−λ λ x (λ et )x
∑ etx P(X = x) = ∑ etx = e−λ ∑ = e−λ eλ e = eλ (e −1) .
t t
MX (t) =
x=0 x=0 x! x=0 x!

⊓
⊔

Example 2.3. Find the m.g.f. of a random variable which has a (probability) density func-
tion given by {
e−x , for x > 0;
f (x) =
0, otherwise,
and then use it to find µ1 , µ2 , and µ3 .

Solution.

MX (t) = E(etX )
 +∞
∫ +∞ ∫ +∞ 
e
(t−1)x 1
= , for t < 1;
= etx f (x)dx = etx e−x dx = t −1 1−t
−∞ 0 
 0
does not exist, for t ≥ 1.

Then,

(1) 1 (2) 2
µ1 = MX (0) = = 1, µ2 = MX (0) = = 2,
(1 − t)2 t=0 (1 − t)3 t=0
(3) 2×3
µ3 = MX (0) = = 3!.
(1 − t)4 t=0

⊓
⊔
2.1 Moment generating function 13

Property 2.3. If X1 , X2 , . . . , Xn are independent random variables, MXi (t) exists for i =
1, 2, · · · , n, and Y = X1 + X2 + · · · + Xn , then MY (t) exists and
n
MY (t) = ∏ MXi (t).
i=1

Proof. The proof is left as an excise. ⊓

⊔

Example 2.4. Find the distribution of the sum of n independent random variables X1 , X2 , . . . , Xn
following Poisson distributions with means λ1 , λ2 , . . . , λn respectively.

Solution. Let Y = X1 + X2 + · · · + Xn . Then,

n n
MY (t) = ∏ MXi (t) = ∏ eλi (e −1) = e(e −1) ∑i=1 λi ,
t t n

i=1 i=1

which is the m.d.f. of Poisson random variable with mean ∑ni=1 λi . Hence, by Example
2.2 and Property 2.3, Y ∼ Poisson distribution with mean ∑ni=1 λi . ⊓
⊔

Example 2.5. For positive numbers α and λ , find the moment generating function of a
gamma distribution Gamma(α , λ ) of which the density function is given by
 α α −1 −λ x
λ x e
, for x > 0;
f (x) = Γ (α )

0, otherwise.

Solution.
∫ +∞
λ α xα −1 e−λ x
MX (t) = E(etX ) = etx dx
0 Γ (α )
∫ +∞
λα
= xα −1 e−(λ −t)x dx
0 Γ (α )
∫
λα +∞ (λ − t)α
= α
xα −1 e−(λ −t)x dx
(λ − t) 0 Γ (α )
 α
 λ , for t < λ ;
= (λ − t)α

does not exist, for t ≥ λ ,

where ∫ +∞
(λ − t)α α −1 −(λ −t)x
x e dx = 1
0 Γ (α )
is due to the fact that
(λ − t)α α −1 −(λ −t)x
x e for x > 0
Γ (α )
is the density function of a Gamma(α , λ − t) distribution. ⊓
⊔

Example 2.6. Find the distribution of the sum of n independent random variables X1 , X2 , . . . , Xn
where Xi follows Gamma(αi , λ ), i = 1, 2, . . . , n, with the p.d.f. given by
14 2 Preliminary
 α α −1 −λ x
λ ix i e
, for x > 0;
f (x) = Γ (αi )

0, otherwise.

Solution. From the previous example, we know that the m.g.f. of Xi is

( )αi
λ
MXi (t) = for t < λ , i = 1, 2, . . . , n.
λ −t

Hence, the m.g.f. of X1 + X2 + · · · + Xn is

( )α1 +α2 +···+αn
n
λ
∏ i MX (t) =
λ −t
for t < λ .
i=1

Therefore, X1 + X2 + · · · + Xn follows a Gamma(α1 + α2 + · · · + αn , λ ) distribution. ⊓

⊔
Example 2.7. Prove that the sum of n independent random variables X1 , X2 , . . . , Xn each
following a Bernoulli distribution with parameter p follows B(n, p), the binomial distri-
bution with parameters n and p.
Proof. On one hand, for i = 1, 2, . . . , n, MXi (t) = e0t P(Xi = 0) + e1t P(Xi = 1) = (1 − p) +
et p = 1 − p + pet , and hence the m.g.f. of X1 + X2 + · · · + Xn is
n
∏ MXi (t) = (1 − p + pet )n .
i=1

On the other hand, the m.g.f. of B(n, p) is

n ( ) n ( )
tx n n
∑ x e p x
(1 − p)n−x
= ∑ x (pet )x (1 − p)n−x = (pet + 1 − p)n .
x=0 x=0

Therefore, X1 + X2 + · · · + Xn follows a B(n, p) distribution. ⊓

⊔
Example 2.8. Let X1 , X2 , . . . , Xn be independent N(0, 1) random variables. Show that Y =
∑ni=1 Xi2 ∼ χn2 .
Solution. By the independence of X1 , X2 , . . . , Xn and Property 2.3,
n n
1
MY (t) = ∏ MX 2 (t) = ∏ √ = (1 − 2t)−n/2 ,
i=1
i
i=1 1 − 2t

for t < 1/2, where we have used the fact that

∫ ∞
1 x2
etx √ e− 2 dx
2 2
MX 2 (t) = E(etXi ) =
i −∞ 2π
∫ +∞ ( )
1 (1 − 2t)x2
= √ exp − dx
−∞ 2π 2
∫ +∞ ( ) ( )
1 y2 y √
= √ exp − d √ (by letting y = 1 − 2tx)
−∞ 2π 2 1 − 2t
∫ +∞ ( 2)
1 1 y 1
=√ √ exp − dy = √ .
1 − 2t −∞ 2π 2 1 − 2t
2.2 Convergence 15

Note that the m.g.f. of χn2 is (1 − 2t)−n/2 for t < 1/2. Hence, by Property 2.2, Y ∼ χn2 . ⊓
⊔

2.2 Convergence

Functions of random variables are of interest in statistical applications. Usually, functions

of random sample X = {X1 , · · · , Xn } are called statistics. Two important statistics are the
sample mean X and the sample variance S2 , where

1 n
X= ∑ Xi ,
n i=1
1 n
S2 = ∑ (Xi − X)2 .
n i=1

Although in a particular sample, say x1 , · · · , xn , we observe definite values of these statis-

tics, x and s2 , we should recognize that each value is only one observation of the respective
random variables X and S2 . That is, each X or S2 is also a random variable with its own
distribution.
Suppose that the random sample X from a distribution F(x) with mean µ = E(X)
and variance σ 2 = Var(X). When n is large, Theorem 1.4 shows that F(x) can be well
approximated by Fn (x). Meanwhile, we can easily show that X and S2 are the mean and
variance of a random variable from a distribution Fn (x). Therefore, it is expected that
when n is large, µ and σ 2 can be well approximated by X and S2 , respectively.

Definition 2.2. (Convergence in probability) Let (Zn ; n ≥ 1) be a sequence of random

variables. We say the sequence Zn converges in probability to Z if, for any ε > 0,

P(|Zn − Z| > ε ) → 0 as n → ∞.

For brevity, this is often written as Zn → p Z.

Remark 2.1. 1. For a deterministic sequence {an },

an → a as n → ∞ ⇐⇒ for any ε > 0, there exists an integer N(ε ) > 0

such that when n ≥ N, |an − a| < ε (for sure!)

2. For a random sequence {Zn },

Zn → p Z as n → ∞ ⇐⇒ for any ε > 0, there exists an integer N(ε ) > 0

such that when n ≥ N, P(|Zn − Z| < ε ) is very close
to one (but not for sure!)

3. ε specifies the accuracy of the convergence, which can be achieved for large n(≥ N).

Theorem 2.1. (Weak law of large numbers (LLN)) Let (Xi ; i ≥ 1) be a sequence of in-
dependent random variables having the same finite mean and variance, µ = E(X1 ) and
σ 2 = Var(X1 ). Then, as n → ∞,
X →p µ .
It is customary to write Sn = ∑ni=1 Xi for the partial sums of the Xi .
16 2 Preliminary

Proof. It suffices to show that for any ε > 0,

( )
1
P Sn − µ > ε → 0 as n → ∞. (2.1)
n

By Chebyshov’s inequality below, we have

( ) [( )2 ]
1 1 1
P Sn − µ > ε ≤ 2E Sn − µ
n ε n
( )2 
1  1 n
= 2E
ε ∑ (Xi − µ ) 
n i=1
( )2 
n
1
= 2 2 E  ∑ (Xi − µ )  .
n ε i=1

By the independence of (Xn ; n ≥ 1), we can obtain that

( )  [ ]
2
n n n
E ∑ (Xi − µ ) =E ∑ ∑ (Xi − µ )(X j − µ )
i=1 i=1 j=1
[ ] [ ]
n n n
=E ∑ (Xi − µ )2 +E ∑ ∑ (Xi − µ )(X j − µ )
i=1 i=1 j̸=i, j=1
[ ]
n
=E ∑ (Xi − µ ) 2
(by Property 1.11(iv))
i=1

= nσ 2 .

Thus, it follows that

( )
1 nσ 2 σ2
P Sn − µ > ε ≤ = → 0 as n → ∞,
n n2 ε 2 nε 2

which implies that (2.1) holds. This completes the proof. ⊓

⊔

Property 2.4. (Chebyshov’s inequality) Suppose that E(X 2 ) < ∞. Then, for any constant
a > 0,
E(X 2 )
P(|X| ≥ a) ≤ .
a2
Proof. This is left as an excise. ⊓
⊔

Property 2.5. If Xn → p µ and Yn → p ν , then (i) Xn +Yn → p µ + ν ; (ii) XnYn → p µν ; (iii)

Xn /Yn → µ /ν if Yn ̸= 0 and ν ̸= 0; (iv) g(Xn ) → p g(µ ) for a continuous function g(·).

Proof. The proof is omitted. ⊓

⊔

Example 2.9. Let (Xi ; i ≥ 1) be a sequence of independent random variables having the
same finite mean µ = E(X1 ), finite variance σ 2 = Var(X1 ), and finite fourth moment
µ4 = E(X14 ). Show that
S2 → p Var(X1 ).
2.2 Convergence 17
2
(Hint: S2 = n1 ∑ni=1 Xi2 − X )
Definition 2.3. (Convergence in distribution) Let (Zn ; n ≥ 1) be a sequence of random
variables. We say the sequence Zn converges in distribution to Z if, as n → ∞,

Gn (x) → G(x) whereever G(x) is continuous.

Here, Gn (x) and G(x) are the c.d.f. of Zn and Z, respectively.

Theorem 2.2. (Central limit theorem (CLT)) Let (Xi ; i ≥ 1) be a sequence of inde-
pendent random variables having the same finite mean and variance, µ = E(X1 ) and
σ 2 = Var(X1 ). Then, as n → ∞,
√
X − E(X) n(X − µ )
√ = →d N(0, 1).
Var(X) σ
( )
Central limit theorem shows that X ∼ N E(X), Var(X) , and hence it tells us the dis-
tribution of X when the sample size n is large. Next, we check the performance of central
limit theorem by simulations:
(1) Generate a realization {x1 , x2 , · · · , xn } of the independent random sample {X1 , X2 , · · · , Xn }
from N(0, 1); √
(2) Calculate zn = n(x − µ )/σ with µ = 0 and σ = 1;
(1) (2) (J)
(3) Repeat (1)-(2) J√ times to get {zn , zn , · · · , zn }, which is a sequence of realizations of
Zn , where Zn = n(X − µ )/σ ;
(1) (2) (J)
(4) Plot the (relative frequency) histogram of {zn , zn , · · · , zn }.
(1) (2) (J)
Fig. 2.1 plots the histogram of {zn , zn , · · · , zn } for n = 10, 50, and 1000 with the
(1) (2) (J)
repetition time J = 10000. From this figure, the histogram of {zn , zn , · · · , zn } is very
close to the density of N(0, 1), especially for large n. Hence, this implies that the approx-
imation in central limit theorem performs very well.
To prove the above central limit theorem, we need the following lemma:
Lemma 2.1. If
1. MZn (t), the moment generating function of Zn , exists, n = 1, 2, . . .,
2. lim MZn (t) exists and equals the moment generating function of a random variable Z,
n→∞
then
lim GZn (x) = GZ (x) for all x at which GZ (x) is continuous,
n→∞

where GZn (x) is the c.d.f. of Zn , n = 1, 2, . . ., and GZ (x) is the c.d.f. of Z.

Proof. As (Xi ; i ≥ 1) is a sequence of independent random variables having the same finite
mean µ = E(X1 ) and finite variance σ 2 = Var(X1 ), simple algebra gives us that

1 n σ2
E(X) = ∑
n i=1
E(Xi ) = µ and Var(X) = E[(X)2 ] − [E(X)]2 =
n
.

Hence, it suffices to show that

√
n(X − µ )
Zn = →d N(0, 1). (2.2)
σ
18 2 Preliminary

(1) (2) (J)

Fig. 2.1 The histogram of {zn , zn , · · · , zn } with J = 10000 and n = 10 (top), n = 50 (middle) or
n = 1000 (Bottom). The red line is the density of N(0, 1).

Xi − µ
A heuristic proof for (2.2) is based on Lemma 2.1. Let Yi = , then E(Yi ) = 0 and
σ
Var(Yi ) = 1. Suppose the moment generating function MYi (t) exists. A Taylor’s expansion
of MYi (t) around 0 gives:

(1) t 2 (2)
MYi (t) = MYi (0) + tMYi (0) + M (ε ), for some 0 ≤ ε ≤ t.
2 Yi
Since Zn = √1 ∑ni=1 Yi , then the moment generating function of Zn is thus given by
n

n )
(
t
MZn (t) = ∏ MYi √
i=1 n
[ ( )]n
t
= MYi √
n
[ √ ]n
t (1) (t/ n)2 (2)
= MYi (0) + √ MYi (0) + MYi (ε )
n 2
[ 2
]n
t t (2)
= 1 + √ E(Yi ) + MYi (ε )
n 2n
[ ]n
t 2 (2)
= 1 + MYi (ε ) ,
2n
2.3 Resampling 19
√ (2) (2)
where 0 ≤ ε ≤ t/ n. As n → ∞, ε → 0 and MYi (ε ) → MYi (0) = E(Yi2 ) = 1. Hence,
( )n ( 2) ( )
t2 t 1
lim MZn (t) = lim 1 + = exp = exp 0 × t + × 1 × t ,
2
n→∞ n→∞ 2n 2 2

which is the moment generating function of N(0, 1) random variable. Hence, the conclu-
sion follows directly from Lemma 2.1. ⊓⊔

Example 2.10. Suppose that Y ∼ χ 2 (50). Approximate P(40 < Y < 60).

Solution. By Example 2.8, Y ∼ ∑50i=1 Xi , where X1 , X2 , · · · , X50 are independent N(0, 1)

2
1 50
random variables. Let X = 50 ∑i=1 Xi . Hence,
2

P(40 < Y < 60) = P(40 < 50X < 60)

= P(4/5 < X < 6/5)
(√ √ √ )
50(4/5 − µ ) 50(X − µ ) 50(6/5 − µ )
=P < <
σ σ σ
√ √
50(6/5 − µ ) 50(4/5 − µ )
≈ Φ( ) − Φ( ) (by CLT)
σ σ
= Φ (1) − Φ (−1) ≈ 0.68,

where µ = EXi2 = 1, σ 2 = VarXi2 = 2, and Φ (·) is the c.d.f. of N(0, 1). ⊓

⊔

2.3 Resampling

Suppose {X1 , · · · , Xn } be a random sample from one population with an unknown c.d.f.
F(·). Let {x1 , · · · , xn } be one realization of {X1 , · · · , Xn }. Based on {x1 , · · · , xn }, we have
a realization of the empirical distribution:

1 n
Fn (x) = ∑ I(xk ≤ x).
n k=1

By Theorem 1.4,

F(x) ≈ Fn (x). (2.3)

Since Fn (x) is a discrete c.d.f, we can draw a random sample {X1∗ , X2∗ , · · · , XB∗ } from Fn (x),
and it is expected that the (relative frequency) histogram of {X1∗ , X2∗ , · · · , XB∗ } should be
close to f (x). Here, Xi∗ ∼ X ∗ ∼ Fn (x) is a discrete random variable such that

1
P(X ∗ = x j ) = for j = 1, 2, · · · , n.
n
Conventionally, {X1∗ , X2∗ , · · · , XB∗ } is called the bootstrap (resampling) random sample, and
B is the bootstrap sample size.

Example 2.11. Let {xi }200i=1 be a realization from N(0, 1). Fig. 2.2 plots the histogram of
original realization {xi }200 ∗ 100 ∗ 200 ∗ 500
i=1 and bootstrapped realizations {xi }i=1 , {xi }i=1 , and {xi }i=1 .
20 2 Preliminary

From this figure, we can see that the distribution of the bootstrapped realizations is very
close to the distribution of N(0, 1), especially for large B. ⊓
⊔

(a)0iginalSFBMJ[BUJPO,n=200 (b)#ootstrapQFESFBMJ[BUJPO,B=100
0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
-5 0 5 -5 0 5

(c)#PPUTUSBQQFESFBMJ[BUJPO,B=200 (d)#PPUTUSBQQFESFBMJ[BUJPO,B=500
0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
-5 0 5 -5 0 5

Fig. 2.2 The red line is the p.d.f of N(0, 1). (a) The histogram of original realization {xi }200
i=1 ; (b) The
histogram of one bootstrapped realization {xi∗ }100
i=1 ; (c) The histogram of one bootstrapped realization
{xi∗ }200 ∗ 500
i=1 ; (d) The histogram of one bootstrapped realization {xi }i=1 .

Finally, we consider how to use bootstrap method to approximate the distribution of a

statistic T = g(X1 , X2 , · · · , Xn ), where g(·) is a given functional. In practice, a direct cal-
culation of the distribution of T is usually infeasible. But in view of (2.3), the distribution
of T is close to the distribution of T ∗ , where T ∗ = g(X1∗ , X2∗ , · · · , Xn∗ ). Hence, it motives
us to use the following procedure to approximate the distribution of T :
(1) Generate a bootstrapped realization {x1∗ , x2∗ , · · · , xn∗ } from the distribution Fn (·);
(2) Calculate t ∗ = g(x1∗ , x2∗ , · · · , xn∗ ), which is a realization of T ∗ ;
(3) Repeat (1)-(2) J times to get {t ∗(1) ,t ∗(2) , · · · ,t ∗(J) }, which is a sequence of realizations
of T ∗ ;
(4) Plot the (relative frequency) histogram of {t ∗(1) ,t ∗(2) , · · · ,t ∗(J) }.
Since the histogram of {t ∗(1) ,t ∗(2) , · · · ,t ∗(J) } is close to the p.d.f. of T ∗ , it is also close
to the p.d.f. of T . Clearly, this bootstrap method provides us an easy way to calculate the
percentile of the distribution of T , which is important in many applications.
Chapter 3
Point estimation

In many applications, a random variable X resulting from a random experiment is assumed

to have a certain distribution with the p.d.f. f (x; θ ), where θ ∈ R s is a unknown parameter
taking value in a set Ω , and Ω is usually called the parameter space. For example, X
is often assumed to follow a normal distribution N(µ , σ 2 ); in this case θ = (µ , σ ) ∈
Ω is the unknown parameter, and the parameter space Ω = {(µ , σ ) : µ ∈ R, σ > 0}.
For the experimenter, the important question is how to find a “good” estimator for the
unknown parameter θ . Heuristically, a random sample X = {X1 , X2 , · · · , Xn }, which forms
an empirical distribution, can elicit information about the distribution of X. Hence, it is
natural to expect that we can estimate the unknown parameter θ based on a random sample
X.

3.1 Maximum likelihood estimator

We first consider the maximum likelihood estimator, which is motivated by a simple ex-
ample below.

Example 3.1. Suppose that X follows a Bernoulli distribution so that the p.d.f. of X is

f (x; p) = px (1 − p)1−x , x = 0, 1,

where the unknown parameter p ∈ Ω with Ω = {p : p ∈ (0, 1)}. Further, assume that we
have a random sample X = {X1 , X2 , · · · , Xn } with the observable values x = {x1 , x2 , · · · , xn },
respectively. Then, the probability that X = x is

L(x1 , · · · , xn ; p) = P(X1 = x1 , X2 = x2 , · · · , Xn = xn )
n
= ∏ pxi (1 − p)1−xi = p∑i=1 xi (1 − p)n−∑i=1 xi ,
n n

i=1

which is the joint p.d.f. of X1 , X2 , · · · , Xn evaluated at the observed values. The joint p.d.f.
is a function of p. Then, we want to find the value of p that maximizes this joint p.d.f., or
equivalently, we want to find p∗ such that

p∗ = arg max L(x1 , · · · , xn ; p).

p∈Ω

21
22 3 Point estimation

The way to propose p∗ is reasonable because p∗ most likely has produced the sample
values x1 , · · · , xn . We call p∗ the maximum likelihood estimate, since “likelihood” is often
used as a synonym for “probability” in informal contexts.
Conventionally, we denote L(p) = L(x1 , · · · , xn ; p), and p∗ is easier to be computed by

p∗ = arg max log L(p).

p∈Ω

[Note that p maximizes log L(p) also maximizes L(p)]. By simple algebra (see one ex-
ample below), we can show that
1 n
p∗ = ∑ xi ,
n i=1

which maximizes log L(p). The corresponding statistic, namely n−1 ∑ni=1 Xi , is called the
maximum likelihood estimator (MLE) of p; that is,

1 n
p̂ = ∑ Xi .
n i=1

Note that E( p̂ − p)2 → 0 as n → ∞. Thus, p̂ is a “good” estimator of p in some sense.

Definition 3.1. (Likelihood Function) Let X be a random sample with a joint p.d.f.
f(x1 , · · · , xn ; θ ), where the parameter θ is within a certain parameter space Ω . Then,
the likelihood function of this random sample is defined as

L(θ ) = f(X; θ ) = f(X1 , X2 , · · · , Xn ; θ )

for θ ∈ Ω . Moreover, ℓ(θ ) = log L(θ ) is called the log-likelihood function.

Definition 3.2. (Maximum Likelihood Estimator) Given a likelihood function L(θ ) for
θ ∈ Ω , the maximum likelihood estimator (MLE) of θ is defined as

θ̂ = arg max L(θ ) = arg max ℓ(θ ).

θ ∈Ω θ ∈Ω

Definition 3.3. (Maximum Likelihood Estimate) The observed value of θ̂ is called the
maximum likelihood estimate.

Example 3.2. Let X be an independent random sample from a Bernoulli distribution with
parameter p with 0 < p < 1. Find the maximum likelihood estimator of p.

Solution. For the random sample X, the likelihood function is

n
L(p) = ∏ pXi (1 − p)1−Xi .
i=1

Hence, the log-likelihood function is

n
ℓ(p) = log L(p) = ∑ [Xi log p + (1 − Xi ) log(1 − p)]
i=1
n n
= log p · ∑ Xi + log(1 − p) · ∑ (1 − Xi ).
i=1 i=1
3.1 Maximum likelihood estimator 23

Note that
( )
dℓ(p) 1 n 1 n
= ∑ Xi − n − ∑ Xi
dp p i=1 1− p i=1
n n
(1 − p) ∑ Xi − np + p ∑ Xi
i=1 i=1
=
p(1 − p)
n(X − p)
= .
p(1 − p)

It is not hard to see that

dℓ(p)
p < X ⇐⇒ >0
dp
and
dℓ(p)
p > X ⇐⇒ < 0,
dp
which imply that ℓ(p) attains its maximum at p = X. Therefore, the maximum likelihood
estimator of p is X. ⊓
⊔
Example 3.3. Let X be an independent random sample from a uniformly distribution over
the interval [0, β ]. Find the maximum likelihood estimator of β .

Solution. Note that a uniformly distribution over the interval [0, β ] has the p.d.f. given by

 1 , for 0 ≤ x ≤ β , 1
f (x; β ) = β = I(0 ≤ x ≤ β ).
 β
0, otherwise,

For the random sample X, the likelihood function is

n
1 1 n
L(β ) = ∏ I(0 ≤ Xi ≤ β ) = n ∏ I(0 ≤ Xi ≤ β ).
i=1 β β i=1

In order that L(β ) attains its maximum, β must satisfy

0 ≤ Xi ≤ β , i = 1, 2, . . . , n.
1
Since increases as β decreases, we must select β to be as small as possible subject
βn
to the previous constraint. Therefore, the maximum of L(β ) should be selected to be
the maximum of X1 , X2 , . . . , Xn , that is, the maximum likelihood estimator β̂ = X(n) =
max1≤i≤n Xi . ⊓⊔
Example 3.4. Let X be an independent random sample from N(θ1 , θ2 ), where (θ1 , θ2 ) ∈ Ω
and Ω = {(θ1 , θ2 ) : θ1 ∈ R, θ2 > 0}. Find the MLEs of θ1 and θ2 . [Note: here we let
θ1 = µ and θ2 = σ 2 ].

Solution. Let θ = (θ1 , θ2 ). For the random sample X, the likelihood function is
[ ]
n
1 (Xi − θ1 )2
L(θ ) = ∏ √ exp − .
i=1 2πθ2 2θ2
24 3 Point estimation

Then, the log-likelihood function is

n ∑n (Xi − θ1 )2
ℓ(θ ) = log L(θ ) = − log(2πθ2 ) − i=1 .
2 2θ2

As the MLE is the maximizer of ℓ(θ ), it should satisfy that

∂ ℓ(θ ) 1 n
0=
∂ θ1
= ∑ (Xi − θ1 ),
θ2 i=1
∂ ℓ(θ ) n 1 n
0= =− + 2 ∑ (Xi − θ1 )2 .
∂ θ2 2θ2 2θ2 i=1

Solving the two equations above, we obtain that

1 n 1 n
θ1 = X = ∑
n i=1
Xi and θ2 = S2 = ∑ (Xi − X)2 .
n i=1

By considering the usual condition on the second partial derivatives, these solutions do
provide a maximum. Thus, the MLEs of θ1 and θ2 are

θ̂1 = X and θ̂2 = S2 ,

respectively. ⊓
⊔

3.2 Method of moments estimator

The method of moments estimator is often used in practice, especially when we do not
know the full information about X except for its certain moments. Recall that the r-th
moment about the origin of X is defined as µr = EX r . In many situations, µr contains the
information about the unknown parameter θ . For example, if X ∼ N(µ , σ 2 ), we know that

µ1 = EX = µ and µ2 = EX 2 = Var(X) + [E(X)]2 = σ 2 + µ 2 .

or
µ = µ1 and σ 2 = µ2 − µ12 .
That is, the unknown parameters µ and σ 2 can be estimated if we find “good” estimators
for µ1 and µ2 . Note that by the weak law of large numbers (see Theorem 2.1),

1 n 1 n
m1 = ∑
n i=1
Xi → p E(X) = µ1 and m2 = ∑ Xi2 → p E(X 2 ) = µ2 .
n i=1

Thus, it is reasonable to estimate µ and σ 2 by m1 and m2 − m21 , respectively, and these

estimators are called the method of moments estimator.
To generalize the idea above, we assume that the unknown parameter θ ∈ R s can be
expressed by

θ = h(µ1 , µ2 , · · · , µk ), (3.1)
3.2 Method of moments estimator 25

where h : R k → R s . For the illustrating example above, s = 2, k = 2, θ = (µ , σ 2 ) and

h = (h1 , h2 ) with
h1 (µ1 , µ2 ) = µ1 and h2 (µ1 , µ2 ) = µ2 − µ12 .
Define the r-th sample moment of a random sample X by

1 n r
mr = ∑ Xi ,
n i=1
r = 1, 2, . . . .

Unlike µr , mr always exists for any positive integer r. In view of (3.1), the method of
moments estimator (MME) θ̃ of θ is defined by

θ̃ = h(m1 , m2 , · · · , mk ),

and the observed value of θ̃ is called the method of moments estimate.

Example 3.5. Let X be an independent random sample from a gamma distribution with
the p.d.f. given by  α α −1 −λ x
λ x e
, for x > 0;
f (x) = Γ (α )

0, otherwise.
Find a MME of (α , λ ).

Solution. Some simple algebra shows that the first two moments are

α α2 + α
µ1 = and µ2 = .
λ λ2
[Note: µ1 and µ2 can be obtained from Example 2.5.] Substituting α = λ µ1 in the second
equation, we get

(λ µ1 )2 + λ µ1 µ1 µ1
µ2 = = µ12 + or λ = ,
λ 2 λ µ2 − µ12

which implies that

(µ1 )2
α = λ µ1 = .
µ2 − (µ1 )2
Therefore, one MME of (α , λ ) is (α̃ , λ̃ ), where
2
m21 X m1 X
α̃ = = and λ̃ = = .
m2 − m21 1 n 2 m − m2 1 n
∑ ∑
2 2 2
Xi − X 1 Xi2 − X
n i=1 n i=1

⊓
⊔

It is worth noting that the way to construct h in (3.1) is not unique. Usually, we use the
lowest possible order moments to construct f , although this is may not be the optimal way.
To consider the “optimal” MME, one may refer to the generalized method of moments
estimator for a further reading.
26 3 Point estimation

3.3 Estimator properties

For the same unknown parameter θ , many different estimators may be obtained. Heuris-
tically, some estimators are good and others bad. The question is how would we establish
a criterion of goodness to compare one estimator with another? The particular properties
of estimators that we will discuss below are unbiasedness, efficiency, and consistency.

3.3.1 Unbiasedness

Suppose that θ̂ is an estimator of θ . If θ̂ is a good estimator of θ , a fairly desirable

property is that its mean be equal to θ , namely, E(θ̂ ) = θ . That is, in practice, we would
want E(θ̂ ) to be reasonably close to θ .

Definition 3.4. (Unbiased estimator) The bias of an estimator θ̂ is defined as

Bias(θ̂ ) = E(θ̂ ) − θ .

If Bias(θ̂ ) = 0, θ̂ is called an unbiased estimator of θ . Otherwise, it is said to be biased.

Definition 3.5. (Asymptotically unbiased estimator) θ̂ is an asymptotically unbiased es-

timator if
lim Bias(θ̂ ) = lim [E(θ̂ ) − θ ] = 0,
n→∞ n→∞

where n is the sample size.

Example 3.3. (con’t) (i) Show that β̂ = X(n) is an asymptotically unbiased estimator of β ;
(ii) modify this estimator of β to make it unbiased.

Solution. Let Y = X(n) . For 0 ≤ y ≤ β ,

n ( )n
y
P(Y ≤ y) = ∏ P(Xi ≤ y) = .
i=1 β

Thus, by Property 1.10,

∫ ∞ ∫ β[ ( )n ]
y β n+1
E(Y ) = P(Y > y)dy = 1− dy = β −
0 0 β (n + 1)β n
nβ
= → β as n → ∞,
n+1
which implies that Y is an
( asymptotically
) unbiased estimator of β .
n+1 n+1
Furthermore, since E Y = β , we know that Y is an unbiased estimator
n n
of β . ⊓
⊔

Example 3.4. (con’t) Show that X is an unbiased estimator of θ1 , and S2 is an asymptoti-

cally unbiased estimator of θ2 .

Solution. As each Xi ∼ N(θ1 , θ2 ), we have

3.3 Estimator properties 27

1 n 1 n
E(X) = ∑
n i=1
E(Xi ) = ∑ θ1 = θ1 .
n i=1

Hence, X is an unbiased estimator of θ1 .

Next, it is easy to see that
[ ]
1 n 1 n [ ]
2
E(S ) = E ∑
n i=1
(Xi − X) = ∑ E (Xi − X)2 .
2
n i=1

Note that for each i, we have

[ ] [ ]
E (Xi − X)2 = E (X1 − X)2
= Var(X1 − X) (since E(X1 − X) = 0)
( )
X1 + X2 + · · · + Xn
= Var X1 −
n
( )
(n − 1)X1 n
Xi
= Var −∑
n i=2 n
(n − 1)2 (n − 1)
= 2
θ2 + θ2 (by the independence among Xi ’s)
n n2
n−1
= θ2 .
n
Hence, it follows that

1 n [ ] n−1
E(S2 ) = ∑
n i=1
E (Xi − X)2 =
n
θ2 .

As lim E(S2 ) = θ2 , S2 is an asymptotically unbiased estimator of θ2 . ⊓

⊔
n→∞

3.3.2 Efficiency

Suppose that we have two unbiased estimators θ̂ and θ̃ . The question is how to compare
θ̂ and θ̃ in terms of a certain criterion. To answer this question, we first introduce the
so-called mean squared error of a given estimator θ̂ .

Definition 3.6. (Mean squared error) Suppose that θ̂ is an estimator of θ . The mean
squared error of θ̂ is [( )2 ]
MSE(θ̂ ) = E θ̂ − θ .

For a given estimator θ̂ , MSE(θ̂ ) is the mean (expected) value of the square of the
error (difference) θ̂ − θ . This criterion can be decomposed by two parts as shown below.

Property 3.1. If Var(θ̂ ) exists, then the mean squared error of θ̂ is

[ ]2
MSE(θ̂ ) = Var(θ̂ ) + Bias(θ̂ ) .

Proof.
28 3 Point estimation
[( )2 ]
MSE(θ̂ ) = E θ̂ − θ
({[ ] [ ]}2 )
=E θ̂ − E(θ̂ ) + E(θ̂ ) − θ
([ ]2 [ ][ ] [ ]2 )
= E θ̂ − E(θ̂ ) + 2 θ̂ − E(θ̂ ) E(θ̂ ) − θ + E(θ̂ ) − θ
[ ][ ] [ ]2
= Var(θ̂ ) + 2E θ̂ − E(θ̂ ) E(θ̂ ) − θ + Bias(θ̂ )
[ ]2
= Var(θ̂ ) + Bias(θ̂ ) .

⊓
⊔

Remark 3.1. The following result is straightforward:

lim MSE(θ̂ ) = 0 ⇐⇒ lim Var(θ̂ ) = 0 and lim Bias(θ̂ ) = 0.

n→∞ n→∞ n→∞

Heuristically, one wants MSE(θ̂ ) as small as possible. As we discussed above, the

unbiasedness is the desirable property for a certain estimator θ̂ . Thus, it is reasonable to
restrict our attention to only the unbiased estimator θ̂ . In this case,

MSE(θ̂ ) = Var(θ̂ )

by Property 3.1. Now, for two unbiased estimators θ̂ and θ̃ , we only need to select the
one with a smaller variance, and this motivates us to define the efficiency between θ̂ and
θ̃ .

Definition 3.7. (Efficiency) Suppose that θ̂ and θ̃ are two unbiased estimators of θ . The
efficiency of θ̂ relative to θ̃ is defined by

Var(θ̃ )
Eff (θ̂ , θ̃ ) = .
Var(θ̂ )

If Eff (θ̂ , θ̃ ) > 1, then we say that θ̂ is relatively more efficient than θ̃ .

Example 3.6. Let (Xn ; n ≥ 1) be a sequence of independent random variables having the
same finite mean and variance, µ = E(X1 ) and σ 2 = Var(X1 ). We can show that X is an
unbiased estimator of µ , and Var(X) = σn . Suppose that we now take two samples, one
2

(1) (2)
of size n1 and one of size n2 , and denote the sample means as X and X , respectively.
Then,
(2)
(1) (2) Var(X ) n1
Eff (X , X ) = (1)
= .
Var(X ) n2
Therefore, the larger is the sample size, the more efficient is the sample mean for estimat-
ing µ .
n+1
Example 3.3. (con’t) Note that X is an unbiased estimator of β . Show that
n (n)
(i) 2X is also an unbiased estimator of β ;
(ii) Compare the efficiency of these two estimators of β .

Solution. (i) Since E(X) equals the population mean, which is β /2, E(2X) = β . Thus, 2X
is an unbiased estimator of β .
3.3 Estimator properties 29

(ii) First we must find the variance of the two estimators. Recall that Y = X(n) . Before,
we have already obtained
( )n
y
P(Y ≤ y) = for 0 ≤ y ≤ β .
β

Therefore, for Z = Y 2 , it is not hard to show that

( √ )n
√ z
P(Z ≤ z) = P(Y ≤ z) = for 0 ≤ z ≤ β 2 .
β

By Property 1.10, we have

∫ β2 ( √ )n
z
E(Z) = 1− dz
0 β
∫ β ( )
yn √
=2 y 1 − n dy (by setting y = z)
0 β
∫ β( n+1
)
y
=2 y − n dy
0 β
[ 2 ]
β β n+2
=2 −
2 (n + 2)β n
n
= β 2.
n+2
Or by a direct calculation, we can show that
∫ β ∫ β
yn n n β n+2 n
E(Z) = y2 d = yn+1 dy = · = β 2.
0 βn βn 0 βn n+2 n+2
Hence,
( ) [( ) ] [ ( )]2
n+1 n+1 2 n+1
Var Y =E Y − E Y
n n n
( )2
n+1 n
= β2 −β2
n n+2
( 2 )
n + 2n + 1
= −1 β2
n2 + 2n
β2
= .
n(n + 2)

Since Var(Xi ) = β 2 /12 for each i, we have

β2 β2
Var(2X) = 4Var(X) = 4 · = .
12n 3n
Therefore, ( )
n+1 Var(2X) n+2
Eff Y, 2X = ( n+1 ) = .
n Var n Y 3
30 3 Point estimation

Thus, it can be seen that, for n > 1, n+1

n Y is more efficient than 2X, and for n = 1,
n+1
n Y
and 2X have the same efficiency. ⊔ ⊓

Now, we face two important questions:

(i) For a given unbiased estimator θ̃ , could we find another unbiased estimator θ̃∗ ,
which has a smaller variance than θ̃ ?
(ii) Among all unbiased estimators, could we find the uniformly minimum variance
unbiased estimator (UMVUE), that is,

UMVUE = arg min Var(θ̃ ).

θ̃ is unbiased

The first question tells us how to find a more efficient unbiased estimator from an initial
unbiased estimator; the second question tells us the UMVUE is relatively more efficient
than any other unbiased estimators (in other words, the UMVUE is the best unbiased
estimator).

To answer the first question, we need make use of the sufficient statistic.

Definition 3.8. (Sufficient statistic) Suppose that the random sample X has a joint p.d.f.
f(x1 , · · · , xn ; θ ), where θ is the unknown parameter. The statistic T := T (X) is sufficient
for θ if and only if
( )
f(x1 , · · · , xn ; θ ) = g T (x1 , · · · , xn ); θ h(x1 , · · · , xn ),

where g depends on x1 , · · · , xn only through T (x1 , · · · , xn ), and h does not depend on θ .

Remark 3.2. (i) Definition 3.8 is also called Factorization Theorem; (ii) Sufficient statistic
is not unique:

T = T (X) is a sufficent statistic for θ

( )
⇐⇒v(T ) = v T (X) is also a sufficient statistic for θ ,

where v(·) is an invertible function. For example, if T is a sufficient statistic for θ , then
T 3 is also a sufficient statistic for θ , while T 2 is not a sufficient statistic for θ .

Example 3.7. Suppose that X is an independent random sample from a uniform distribu-
tion U(α , β ). Find a sufficient statistic for (α , β ).

Solution. Let θ = (α , β ). The joint p.d.f. of X is

n ( ) ( )n
1 1
f(x1 , · · · , xn ; θ ) = ∏ I(α ≤ xi ≤ β ) = I(α ≤ xi ≤ β , ∀ i = 1, . . . , n)
i=1 β − α β −α
( )n ( ) ( )
1
= I α ≤ min Xi I max Xi ≤ β .
β −α 1≤i≤n 1≤i≤n
( )
Hence, by Definition 3.8, we know that min Xi , max Xi is a sufficient statistic for
1≤i≤n 1≤i≤n
(α , β ). ⊓
⊔

Example 3.8. Suppose that X is an independent random sample from a normal distribution
N(µ , σ 2 ). Find a sufficient statistic for (µ , σ 2 ).
3.3 Estimator properties 31

Solution. Let θ = (µ , σ 2 ). The joint p.d.f. of X is

f(x1 , · · · , xn ; θ )
( )
n
1 (xi − µ )2
=∏√ exp −
i=1 2πσ
2 2σ 2
( )
n
(xi − µ )2
= (2πσ ) exp − ∑
2 − n2
i=1 2σ 2
( )
n
((xi − x) − (µ − x))2
= (2πσ ) exp − ∑
2 − n2
i=1 2σ 2
( ( ))
n n n
1
= (2πσ ) exp − 2 ∑ (xi − x) + ∑ (µ − x) − 2 ∑ (xi − x)(µ − x)
2 − n2 2 2
2σ i=1 i=1 i=1
( ( ))
n
1
= (2πσ 2 )− 2 exp − 2 ∑ (xi − x)2 + n(µ − x)2
n

2σ i=1
( )
1 n ( n )
= (2πσ ) exp − 2 ∑ (xi − x) exp − 2 (µ − x)2
2 − n2 2
2σ i=1 2σ
( n ) ( n )
= (2πσ 2 )− 2 exp − 2 s2 exp − 2 (µ − x)2 .
n

2σ 2σ
Hence, by Definition 3.8, we know that (X, S2 ) is a sufficient statistic for (µ , σ 2 ). ⊓
⊔
When the random sample X is from an exponential family, the sufficient statistic for θ
can be easily found.
Property 3.2. Let X be an i.i.d. random sample from a p.d.f. having the form:
( )
s
f (x; θ ) = h(x)c(θ ) exp ∑ pi (θ )ti (x) (expontial family),
i=1

where θ = (θ1 , θ2 , · · · , θs ) ∈ Θ ⊂ R s . Then,

( )
n n n
T (X) = ∑ t1 (X j ), ∑ t2 (X j ), · · · , ∑ ts (X j )
j=1 j=1 j=1

is a sufficient statistic for θ .

Remark 3.3. The exponential family includes many of the most common distributions,
e.g., (1) normal; (2) exponential; (3) gamma; (4) chi squared; (5) beta; (6) Bernoulli; (7)
Poisson; (8) geometric; ...
Example 3.8. (con’t) Note that
( )
1 (x − µ )2
f (x; θ ) = √ exp −
2πσ 2 2σ 2
( 2 )
1 x + µ 2 − 2µ x
=√ exp −
2πσ 2 2σ 2
[ ( )] ( )
1 µ 2 nµ x n x2
= √ exp − 2 exp − ,
2πσ 2 2σ σ 2 n 2σ 2 n
32 3 Point estimation
[ ( )]
exp − 2µσ 2 , p1 (θ ) = nµ
2
which implies that h(x) = 1, c(θ ) = √ 1
σ2
, p2 (θ ) = − 2σn 2 ,
2πσ 2
2
t1 (x) = nx , and t2 (x) = xn . By Property 3.2, we know that (X, X 2 ) is a sufficient statistic
for θ .
(Note that we have found both (X, S2 ) and (X, X 2 ) are sufficient statistics for θ , and these
two sufficient statistics have the following link: (X, S2 ) = v(X, X 2 ), where the mapping v
defined by
v
(z1 , z2 ) −→ (z1 , z2 − z21 )
is invertible. Here, we have used the fact S2 = X 2 − (X)2 .) ⊓
⊔

Theorem 3.1. (Rao-Blackwell Theorem) Let θ̃ be an unbiased estimator of θ with E(θ̃ 2 ) <
∞, and T := T (X) be a sufficient statistic for θ . Let w(t) = E(θ̃ |T = t). Then, θ̃∗ = w(T )
is an unbiased estimator of θ and Var(θ̃∗ ) ≤ Var(θ̃ ).

The previous theorem shows that by using the sufficient statistic T , we can always
get a better unbiased estimator (in terms of efficiency) from an initial unbiased estimator.
Moreover, this theorem implies that

the UMVUE is a function of the sufficient statistic T .

Next, we turn to the second question on how to find the UMVUE. The following prop-
erty tells us that the UMVUE is unique.

Property 3.3. If θ̂ is the UMVUE of θ , then θ̂ is unique.

Intuitively, how to find the UMVUE is not an easy task. Below, we offer two ap-
proaches towards this goal.

3.3.2.1 UMVUE: complete and sufficient statistic method

The first approach to find the UMVUE is based on a complete and sufficient statistic. We
have already introduced the sufficient statistic. Below, we introduce the complete statistic.

Definition 3.9. (Complete statistic) For a given random sample X, T := T (X) is a com-
plete statistic of θ if

E [z(T )] = 0 for all θ implies P (z(T ) = 0) = 1 for all θ .

When the random sample X is from an exponential family, the complete statistic (as
the sufficient statistic in Property 3.2) for θ can be easily found.

Property 3.4. Let X be an i.i.d. random sample from a p.d.f. having the form:
( )
s
f (x; θ ) = h(x)c(θ ) exp ∑ pi (θ )ti (x) (expontial family),
i=1

where θ = (θ1 , θ2 , · · · , θs ) ∈ Θ ⊂ R s . Then,

( )
n n n
T (X) = ∑ t1 (X j ), ∑ t2 (X j ), · · · , ∑ ts (X j )
j=1 j=1 j=1
3.3 Estimator properties 33

is a complete statistic for θ as long as the parameter Θ contains an open set in R s .

From Properties 3.2 and 3.4, we know that if X is from an exponential family and the
related parameter Θ contains an open set in R s , then
( )
n n n
T (X) = ∑ t1 (X j ), ∑ t2 (X j ), · · · , ∑ ts (X j )
j=1 j=1 j=1

is a complete and sufficient statistic for θ . If X is not from an exponential family, we need
to check whether a statistic is complete or sufficient for θ by definition.

Example 3.8. (con’t) It is easy to see that (X, X 2 ) is also a complete statistic for θ . ⊓
⊔

The following theorem tells us the relationship between the complete and sufficient
statistic and the UMVUE.

Theorem 3.2. Let T := T (X) be a complete and sufficient statistic for θ , and ϕ (T ) be
any estimator based only on T . Then, ϕ (T ) is the unique UMVUE of its expected value
E[ϕ (T )].

From the preceding theorem, we know that if T is a complete and sufficient statistic
for θ and E[ϕ0 (T )] = θ for some functional ϕ0 (·), then ϕ0 (T ) is the UMVUE of θ . In
other words, the procedure to find the UMVUE is as follows:

(i) Find a complete and sufficient statistic T for θ ; (3.2)

(ii) Find a functional ϕ0 (·) such that E[ϕ0 (T )] = θ . (3.3)

By Theorem 3.2, the unbiased estimator ϕ0 (T ) in (3.3) is the UMVUE of θ .

Example 3.8. (con’t) We have shown that (X, X 2 ) is a complete and sufficient statistic for
θ . Next, since
( )
n 2 n 2 n ( 2 2
)
E(X) = µ , E S = σ 2 and S = X −X ,
n−1 n−1 n−1

n 2 1 n
by Theorem 3.2, we know that X is the UMVUE of µ , and
n−1
S = ∑ (Xi − X)2
n − 1 i=1
is the UMVUE of σ 2 . ⊓
⊔

Example 3.9. Let X be an i.i.d. random sample from Poisson distribution with parameter
λ . Find the UMVUE of λ .

Solution. The p.d.f. of the Poisson distribution is

e−λ λ x e−λ log(λ x ) e−λ x log(λ ) e−λ x n log(λ )

= e = e = en . (3.4)
x! x! x! x!
Hence, by Properties 3.2 and 3.4, we know that

1 n
T (X) = X = ∑ Xi
n i=1
34 3 Point estimation

is a complete and sufficient statistic for λ . Moreover, since E(X) = λ , by Theorem 3.2,
we know that X is the UMVUE of λ . ⊓ ⊔

3.3.2.2 UMVUE: CRLB method

The second approach to find the UMVUE is based on the following procedure:

(i) Find the lower bound of Var(θ̃ ) for all unbiased estimators; (3.5)
(ii) Find an unbiased estimator θ̂ whose variance achieves this lower bound. (3.6)

Clearly, θ̂ in (3.6) is the UMVUE of θ . It is worth noting that conditions (3.5)-(3.6) are not
necessary for the UMVUE, since there are some cases that the UMVUE can not achieve
the lower bound in (3.5).
To consider the lower bound of Var(θ̃ ), we need introduce the Fisher information.
Definition 3.10. (Fisher information) The Fisher information about θ is defied as
[( ) ]
∂ ℓ(θ ) 2
In (θ ) = E ,
∂θ

where ℓ(θ ) = log L(θ ) is the log-likelihood function of the random sample.
Theorem 3.3. Let X be an independent random sample from a population with the p.d.f.
f (x; θ ). Then, under certain regularity conditions, we have the following conclusions.
(i) In (θ ) = nI(θ ), where
[( ) ]
∂ log f (X; θ ) 2
I(θ ) = E ,
∂θ

and X has the same[ distribution as]the population;

∂ 2 log f (X; θ )
(ii) I(θ ) = −E ;
∂θ2
(iii) Cramer-Rao inequality:
1
Var(θ̂ ) ≥ ,
In (θ )

where θ̂ is an unbiased estimator of θ , and 1

In (θ ) is called the Cramer-Rao lower bound
(CRLB).

Proof. (i) Let ∂ log∂fθ(X;θ ) be the score function. Under certain regularity conditions, it can
be shown that the first moment of the score is
[ ] [ ]
∂
∂ log f (X; θ ) ∂ θ f (X; θ )
E =E
∂θ f (X; θ )
∫ ∂
f (x; θ )
∂θ
= f (x; θ )dx
f (x; θ )
∫ ∫
∂ ∂ ∂
= f (x; θ )dx = f (x; θ )dx = 1 = 0.
∂θ ∂θ ∂θ
3.3 Estimator properties 35

Hence, by the independence of X1 , · · · , Xn , it follows that

[( ) ]
∂ ℓ(θ ) 2
In (θ ) = E
∂θ
( )2 
n
∂ log f (X ; θ )
= E ∑ 
i

i=1 ∂θ
[ (  
) ]
n
∂ log f (Xi ; θ ) 2 n n
∂ log f (X ; θ ) ∂ log f (X ; θ )
=E ∑ + E ∑ ∑ 
i j

i=1 ∂θ i=1 j=1 and j̸=i ∂θ ∂θ

[ ( ) ]
n
∂ log f (Xi ; θ ) 2
=E ∑
i=1 ∂θ
= nI(θ ).

(ii) Under certain regularity conditions, it can be shown that

[ 2 ]
∂
E log f (X; θ )
∂θ2
 ( )2 
∫  ∂ 2 f (x; θ ) ∂ 
∂θ f (x; θ )
− ∂θ
2
= f (x; θ )dx
 f (x; θ ) f (x; θ ) 
∫ ∫ ( )2
∂2 ∂
= f (x; θ )dx − log f (x; θ ) f (x; θ )dx
∂θ2 ∂θ
∫
∂2
= f (x; θ )dx − I(θ )
∂θ2
∂2
= 1 − I(θ )
∂θ2
= −I(θ ).

(iii) Recall f(θ ) := f(x1 , · · · , xn ; θ ) = f (x1 ; θ ) · · · f (xn ; θ ) is the joint p.d.f. of X. For
any unbiased estimator θ̂ , we can write θ̂ = g(X) := g(X1 , · · · , Xn ) for some functional g.
Then, we have
∫ ∫
0 = E(θ̂ − θ ) = ··· [g(x1 , · · · , xn ) − θ ]f(θ )dx1 · · · dxn .

Differentiating both sides of the preceding equation, we obtain that

∫ ∫
∂ f(θ )
0= ··· −f(θ ) + [g(x1 , · · · , xn ) − θ ] dx1 · · · dxn ,
∂θ
which implies that
∫ ∫
∂ f(θ )
1= ··· [g(x1 , · · · , xn ) − θ ] dx1 · · · dxn .
∂θ
∫ ∫ ∫
Using the Cauchy-Schwarz Inequality: ( s1 (x)s2 (x)dx)2 ≤ s21 (x)dx s22 (x)dx, we have
36 3 Point estimation
{∫ ∫ }2
∂ f(θ )
1= · · · [g(x1 , · · · , xn ) − θ ] dx1 · · · dxn
∂θ
{∫ ∫
[ ] }2
√ 1 ∂ f(θ )
= · · · [g(x1 , · · · , xn ) − θ ] f(θ ) √ dx1 · · · dxn
f(θ ) ∂ θ
{∫ ∫ [ ] }2
√ √ ∂ log f(θ )
= · · · [g(x1 , · · · , xn ) − θ ] f(θ ) f(θ ) dx1 · · · dxn
∂θ
∫ ∫ ∫ ∫ ( )
∂ log f(θ ) 2
≤ · · · [g(x1 , · · · , xn ) − θ ] f(θ )dx1 · · · dxn × · · · f(θ )
2
dx1 · · · dxn .
∂θ

Hence, it gives us that 1 ≤ Var(θ̂ ) × In (θ ), which implies that the Cramer-Rao inequality
holds. ⊓⊔

Remark 3.4. From the proof above, we can find that “=” holds if and only if there exists a
constant A such that
√ √ ∂ log f(θ )
A[g(x1 , · · · , xn ) − θ ] f(θ ) = f(θ ) for all x1 , x2 , · · · , xn
∂θ
∂ log f(X1 , X2 , · · · , Xn ; θ )
⇐⇒A[g(X1 , · · · , Xn ) − θ ] = (with probability one)
∂θ
∂ log L(θ )
⇐⇒A[θ̂ − θ ] = (with probability one). (3.7)
∂θ
Equation (3.7) is called the attainable condition for the CRLB. In other words, if θ̂ can
achieve the lower bound, it must satisfy (3.7).
1
Corollary 3.1. If θ̂ is an unbiased estimator of θ and Var(θ̂ ) = , then θ̂ is the
In (θ )
UMVUE of θ .

Corollary 3.1 tells us how to find the UMVUE by the method of CRLB.

Example 3.10. Show that X is the UMVUE of the mean of a normal population N(µ , σ 2 ).

Solution. We use the CRLB method to do this. For −∞ < x < ∞,

[ ( ) ]
1 1 x−µ 2
f (x; µ ) = √ exp − ,
σ 2π 2 σ

which implies that

( )2
√ 1 x−µ
log f (x; µ ) = − log(σ 2π ) − ,
2 σ
∂ log f (x; µ ) x − µ
and = .
∂µ σ2
Therefore, [( )2 ] [ ]
∂ log f (X; µ ) (X − µ )2 1
I(µ ) = E =E = 2
∂µ σ 4 σ
or
3.3 Estimator properties 37
[ ] ( )
∂ 2 log f (X; µ ) −1 1
I(µ ) = −E = −E = 2.
∂ µ2 σ2 σ
Hence,
1 1 σ2
CRLB = = = .
In (µ ) nI(µ ) n

Recall that E(X) = µ and Var(X) = σn . Thus, X is the UMVUE of µ .

(A further question: what is the CRLB with respect to σ 2 ? Is this CRLB attainable?) ⊓
⊔

Example 3.11. Show that X is the UMVUE of the parameter θ of a Bernoulli population.

Solution. We use the CRLB method to do this. For x = 0 or 1,

f (x; θ ) = θ x (1 − θ )1−x ,

which implies that

∂ log f (x; θ ) ∂
= [x log θ + (1 − x) log(1 − θ )]
∂θ ∂θ
x 1−x
= −
θ 1−θ
x 1
= − .
θ (1 − θ ) 1 − θ

Noting that [ ]
X 1
E = ,
θ (1 − θ ) 1−θ
we have
[( )2 ] [ ]
∂ log f (X; θ ) X θ (1 − θ ) 1
I(θ ) = E = Var = 2 = .
∂θ θ (1 − θ ) θ (1 − θ ) 2 θ (1 − θ )

Hence,
1 1 θ (1 − θ )
CRLB = = = .
In (θ ) nI(θ ) n
θ (1−θ )
Since E(X) = θ and Var(X) = n , X is the UMVUE of θ . ⊓
⊔

If we know the full information about the population distribution X, the following
theorem tells us that the MLE tends to be the first choice asymptotically.

Theorem 3.4. Suppose that θ̂ is the MLE of a parameter θ of a population distribution.

Then, under certain regular conditions, as n → ∞,

θ̂ − θ
√ →d N(0, 1).
1/In (θ )

1
If θ̂ is an unbiased estimator of θ , the above theorem implies that Var(θ̂ ) ≈
In (θ )
when n is large. That is, the MLE θ̂ can achieve the CRLB asymptotically.
38 3 Point estimation

3.3.3 Consistency

In the previous discussions, we have restricted our attention to the unbiased estimator, and
proposed a way to check whether an unbiased estimator is UMVUE. Now, we introduce
another property of the estimator called the consistency.
Definition 3.11. (Consistent estimator) θ̂ is a consistent estimator of θ , if

θ̂ → p θ ,

that is, for any ε > 0, ( )

P |θ̂ − θ | > ε → 0 as n → ∞.
Note that the definition of the convergence in probability is given in Definition 2.2.
Property 3.5. If θ̂ is an unbiased estimator of a parameter θ and Var(θ̂ ) → 0 as n → ∞,
then θ̂ is a consistent estimator of θ .
Property 3.6. If θ̂ is an asymptotically unbiased estimator of a parameter θ and Var(θ̂ ) →
0 as n → ∞, then θ̂ is a consistent estimator of θ .
We shall mention that the unbiasedness along does not imply the consistency. A toy
example is as follows. Suppose that

θ̂ = I(0 < X1 < 1/2) − I(1/2 < X1 < 1),

where X1 ∼ U(0, 1). Then, E(θ̂ ) = 0, i.e., θ̂ is an unbiased estimator of θ = 0. But, as θ̂

takes value of either 1 or -1, θ̂ ̸→ p 0.
Property 3.7. If θ̂ → p θ and θ̃ → p θ ′ , then
(i) θ̂ ± θ̃ → p θ ± θ ′ ;
(ii) θ̂ · θ̃ → p θ · θ ′ ;
(iii) θ̂ /θ̃ → p θ /θ ′ assuming that θ̃ ̸= 0 and θ ′ ̸= 0;
(iv) if g is any real-valued function that is continuous at θ , g(θ̂ ) → p g(θ ).
Example 3.12. Suppose that X is an independent random sample from a population with
the finite mean µ = E(X1 ), finite variance σ 2 = Var(X1 ), and finite fourth moment
µ4 = E(X14 ). Show that X is a consistent estimator of µ , and S2 is a consistent estima-
tor of σ 2 .

Solution. Note that E(X) = µ and Var(X) = σn → 0 as n → ∞. Hence, by Property 3.5,

X is a consistent estimator of µ (This is just the weak law of large numbers in Theorem
2.1).
For S2 , we have S2 = X 2 − (X)2 . By the weak law of large numbers, we have

1 n 2
X2 = ∑ Xi → p µ2 = E(X12 ).
n i=1

As X → p µ , Property 3.7(iv) implies that

(X)2 → p µ 2 .

Therefore, by Property 3.7(i), S2 → p µ2 − µ 2 = σ 2 . ⊓

⊔
3.3 Estimator properties 39

Remark 3.5. If X is from a N(µ , σ 2 ) population, we can show that

nS2
= χn−1
2
,
σ2
2
where χk2 is a chi-square distribution with k degrees of freedom. Therefore, E( nS
σ2
)=
n − 1, which implies that E(S ) = n σ → σ as n → ∞. That is, S is an asymptotically
2 n−1 2 2 2

unbiased estimator of σ 2 .
2
Moreover, since Var( nS
σ2
) = 2(n − 1), we can obtain that
( )
σ 2 nS2 2σ 4 (n − 1)
2
Var(S ) = Var · = → 0 as n → ∞.
n σ2 n2

Hence, by Property 3.6, S2 is a consistent estimator of σ 2 .

Example 3.13. Let X be an independent random sample from a population with a p.d.f.
2x
f (x; θ ) = I(0 < x ≤ θ ).
θ2
(i) Find the UMVUE of θ ;
(ii) Show that this UMVUE is a consistent estimator of θ ;
(iii) Find the MLE of θ ;
(iv) Find a MME of θ ;
(v) Will the MLE be better than the MME in terms of efficiency?

Solution. (i) Since f (x; θ ) is not continuous with respect to θ , the method of CRLB does
not work. We use the method based on a complete and sufficient statistic. Note that f (x; θ )
does not belong to the exponential family (Why?). Below, we look for a complete and
sufficient statistic for θ by definition.
First, the joint p.d.f. of X is

f(x1 , x2 , · · · , xn ; θ ) = f (x1 ; θ ) f (x2 ; θ ) · · · f (xn ; θ )

n
2xi
= ∏ 2 I(0 < xi < θ )
i=1 θ
[ ][ ]
n n
2
= ∏ xi I(xi > 0) ∏ 2 I(xi ≤ θ )
i=1 i=1 θ
[ ] [( ) ]
n
2 n n
= ∏ xi I(xi > 0)
θ2 ∏
I(xi ≤ θ )
i=1 i=1
[ ] [( ) ( )]
n
2 n
= ∏ xi I(xi > 0) I max xi ≤ θ .
i=1 θ2 1≤i≤n

By Definition 3.8, we know that T := max Xi is a sufficient statistic for θ .

1≤i≤n
Second, we show that T is also a complete statistic for θ . Let z(·) be a functional such
that E(z(T )) = 0 for all θ > 0, i.e.,
∫ θ ∫ θ
2nt 2n−1 2n
0 = E(z(T )) = z(t) dt = 2n z(t)t 2n−1 dt (3.8)
0 θ 2n θ 0
40 3 Point estimation

2nt 2n−1
for all θ > 0, where we have used the fact that T has a p.d.f. f (t) = for t ∈ (0, θ )
θ 2n
(Why?). ∫
By (3.8), we can get that 0θ z(t)t 2n−1 dt = 0 for all θ > 0, which implies that z(θ )θ 2n−1 =
0 and hence z(θ ) = 0. Therefore, by Definition 3.9, we know that T is a complete statistic
for θ .
Third, we can show that
∫ θ ∫ θ
2nt 2n−1 2n 2n
E(T ) = t dt = t 2n dt = θ,
0 θ 2n θ 2n 0 2n + 1
2n + 1
which implies that Y := T is the UMVUE of θ by Theorem 3.2.
2n
(ii) By simple calculation, we have
∫ θ( ) ∫ θ
2n + 1 2 2nt 2n−1 (2n + 1)2 (2n + 1)2 2
E(Y 2 ) = t dt = t 2n+1 dt = θ ,
0 2n θ 2n (2n)θ 2n 0 (2n)(2n + 2)

which implies that

(2n + 1)2 2
Var(Y ) = E(Y 2 ) − [E(Y )]2 = θ −θ2 → 0
(2n)(2n + 2)

as n → ∞. Therefore, by Property 3.5, Y is a consistent estimator of θ .

(iii) The MLE of θ is T (Why?).
(iv) Note that
∫ θ
2x 2
E(X) = x 2 dx = θ .
0 θ 3
3
Hence, X is a MME of θ .
2
(v) This is left as an exercise. ⊓⊔
Chapter 4
Interval estimation

4.1 Basic concepts

In the previous chapter, we have learned how to construct a point estimator for a unknown
parameter θ , leading to the guess of a single value as the value of θ . However, a point
estimator for θ does not provide much information about the accuracy of the estimator. It
is desirable to generate a narrow interval that will cover the unknown parameter θ with a
large probability (confidence). This motivates us to consider the interval estimator in this
chapter.
Definition 4.1. (Interval estimator) An interval estimator of θ is a random interval
[L(X),U(X)], where L(X) := L(X1 , · · · , Xn ) and U(X) := U(X1 , · · · , Xn ) are two statis-
tics such that L(X) ≤ U(X) with probability one.
Definition 4.2. (Interval estimate) If X = x is observed, [L(x),U(x)] is the interval esti-
mate of θ .
Although the definition is based on a closed interval [L(X),U(X)], it will sometimes
be more natural to use an open interval (L(X),U(X)), a half-open and half-closed interval
(L(X),U(X)] (or [L(X),U(X))), or an one-sided interval (∞,U(X)] (or [L(X), ∞)).
The next example shows that compared to the point estimator, the interval estimator
can have some confidence (or guarantee) of capturing the parameter of interest, although
it gives up some precision.
Example 4.1. For an independent random sample X1 , X2 , X3 , X4 from N(µ , 1), consider an
interval estimator of µ by [X − 1, X + 1]. Then, the probability that µ is covered by the
interval [X − 1, X + 1] can be calculated by
( ) ( )
P µ ∈ [X − 1, X + 1] = P X − 1 ≤ µ ≤ X + 1
( )
= P −1 ≤ X − µ ≤ 1
( )
X −µ
= P −2 ≤ √ ≤2
1/4
= P (−2 ≤ Z ≤ 2)
≈ 0.9544,

where Z ∼ N(0, 1) and we have used the fact that X ∼ N(µ , 1/4). Thus, we have over
a 95% change of covering the unknown parameter with our interval estimator. Note that

41
42 4 Interval estimation

for any point estimator µ̂ of µ , we have P(µ̂ = µ ) = 0. Sacrificing some precision in

the interval estimator, in moving from a point to an interval, has resulted in increased
confidence that our assertion about µ is correct. ⊓
⊔

The certainty of the confidence (or guarantee) is quantified in the following definition.

Definition 4.3. (Confidence coefficient) For an interval estimator [L(X),U(X)] of θ , the

confidence coefficient of [L(X),U(X)], denoted by (1 − α ), is

1 − α = P(θ ∈ [L(X),U(X)]),

where P(θ ∈ [L(X),U(X)]) is the coverage probability of [L(X),U(X)].

Remark 4.1. In some situations, the coverage probability P(θ ∈ [L(X),U(X)]) may de-
pend on θ , and then the the confidence coefficient is defined as

1 − α = inf P(θ ∈ [L(X),U(X)]).

Interval estimator, together with a measure of confidence (say, the confidence coef-
ficient), is sometimes known as confidence interval. So, the terminologies of interval
estimators and confidence intervals are interchangeable. A confidence interval with con-
fidence coefficient equal to 1 − α , is called a 1 − α confidence interval.
For example, a 95% (i.e., α = 0.05) confidence interval means that if 100 confidence
intervals were constructed based on 100 different samples from the same population, we
would expect 95 of the intervals to contain θ .
Now, the question is how to construct the interval estimator. One important way to do
it is using the pivotal quantity.

Definition 4.4. (Pivotal Quantity) A random variable Q(X, θ ) = Q(X1 , · · · , Xn , θ ) is a piv-

otal quantity if the distribution of Q(X, θ ) is free of θ . That is, regardless of the distribu-
tion of X, Q(X, θ ) has the same distribution for all values of θ .

Logically, when Q(X, θ ) is a pivotal quantity, we can easily construct a 1 − α confi-

dence interval for Q(X, θ ) by
( )
1−α = P L eα ≤ Q(X, θ ) ≤ Ueα , (4.1)

where Leα and Ueα do not depend on θ . Suppose that the inequalities L
eα ≤ Q(X, θ ) ≤ Ueα
in (4.1) are equivalent to the inequalities L(X) ≤ θ ≤ U(X), Then, from (4.1), a 1 − α
confidence interval of θ is [L(X),U(X)].
In the rest of this section, we will use the pivotal quantity method to construct our
interval estimators.

4.2 Confidence intervals for means

4.2.1 One-sample case

Let X = {X1 , · · · , Xn } be an independent random sample from the population N(µ , σ 2 ).

We first consider the interval estimator of µ when σ 2 is known. Note that
4.2 Confidence intervals for means 43

X ∼ N(µ , σ 2 /n). (4.2)

X −µ
Hence, when σ 2 is known, Z = √ ∼ N(0, 1) is a pivotal quantity involving µ . Let
σ/ n
( )
1 − α = P −zα /2 ≤ Z ≤ zα /2
( )
X −µ
= P −zα /2 ≤ √ ≤ zα /2
σ/ n
( )
σ σ
= P X − zα /2 √ ≤ µ ≤ X + zα /2 √ ,
n n

where zα satisfies
P(Z ≥ zα ) = α
for Z ∼ N(0, 1). Usually, we call zα the upper percentile of N(0, 1) at the level α ; see Fig.
4.1. So, when σ 2 is known, a 1 − α confidence interval of µ is
[ ]
σ σ
X − zα /2 √ , X + zα /2 √ . (4.3)
n n

Given the observed value of X = x and the value of zα /2 , we can calculate the interval

area is 1 − α

area is α/2 area is α/2

−zα/2 zα/2

Fig. 4.1 Upper percentile of N(0, 1) distribution

estimate of µ by
[ ]
σ σ
x − zα /2 √ , x + zα /2 √ .
n n

As the point estimator, the 1 − α confidence interval is also not unique. Ideally, we should
choose it as narrow as possible in some sense, but in practice, we usually choose the
44 4 Interval estimation

equal-tail confidence interval as in (4.3) for convenience, since tables for selecting equal
probabilities in the two tails are readily available.
Example 4.2. A publishing company has just published a new college textbook. Before
the company decides the price of the book, it wants to know the average price of all such
textbooks in the market. The research department at the company took a sample of 36
such textbooks and collected information on their prices. This information produced a
mean price of $48.40 for this sample. It is known that the standard deviation of the prices
of all such textbooks is $4.50. Construct a 90% confidence interval for the mean price of
all such college textbooks assuming that the underlying population is normal.

Solution. From the given information, n = 36, x = 48.40 and σ = 4.50. Now, 1 − α = 0.9,
i.e., α = 0.1, and by (4.3), the 90% confidence interval for the mean price of all such
college textbooks is given by
σ σ 4.50 4.50
[x − zα /2 √ , x + zα /2 √ ] = [48.40 − z0.05 √ , 48.40 + z0.05 √ ]
n n 36 36
≈ [47.1662, 49.6338].

⊓
⊔
Example 4.3. Suppose the bureau of the census and statistics of a city wants to estimate
the mean family annual income µ for all families in the city. It is known that the standard
deviation σ for the family annual income is 60 thousand dollars. How large a sample
should the bureau select so that it can assert with probability 0.99 that the sample mean
will differ from µ by no more than 5 thousand dollars?

Solution. From the construction of a confidence interval, we have

( ) ( )
σ σ σ
1 − α = P −zα /2 √ ≤ X − µ ≤ zα /2 √ = P |X − µ | ≤ zα /2 √ ,
n n n

where 1 − α = 0.99 and σ = 60 thousand dollars. It suffices to have zα /2 √σn ≤ 5 or

( )2 ( )2
60zα /2 60 × 2.576
n≥ = ≈ 955.5517.
5 5

Thus, the sample size should be at least 956. (Note that we have to round 955.5517 up to
the next higher integer. This is always the case when determining the sample size.) ⊓
⊔
Next, we consider the interval estimator of µ when σ 2 is unknown. To find a pivotal
quantity, we need to use the following result.
Property 4.1. (i) X and S2 are independent;
( )2
nS2 ∑n Xi − X
(ii) 2 = i=1 2 is χn−1
2 , where χ 2 is a chi-square distribution with k degrees
σ σ k
of freedom;
X −µ
(iii) T = √ is tn−1 , where tk is a t distribution with k degrees of freedom.
S/ n − 1
k
Property 4.2. (i) If Z1 , · · · , Zk are k independent N(0, 1) random variables, then ∑ Zi2 is
i=1
χk2 ;
4.2 Confidence intervals for means 45

Z
(ii) If Z is N(0, 1), U is χk2 , and Z and U are independent, then T = √ is tk .
U/k
Proof of Property 4.1. (i) The proof of (i) is out of the scope of this course;
(ii) Note that
( )2 [ ]2
n
Xi − µ n
Xi − X X − µ
W=∑ =∑ +
i=1 σ i=1 σ σ
( ) 2 n ( )2
n
Xi − X X −µ
=∑ +∑
i=1 σ i=1 σ
nS2
= + Z2,
σ2
where we have used the fact that the cross-product term is equal to
n
(X − µ )(Xi − X) 2(X − µ ) n ( )
2∑ = ∑ Xi − X = 0.
i=1 σ 2 σ 2
i=1

nS2
Note that W is χn2 and Z 2 is χ12 by Property 4.2(i). Since and Z 2 are independent by
σ2
nS2
(i), we can show that the m.g.f. of 2 is the same as the one of χn−1
2 . Hence, (ii) holds.
σ
(iii) Note that √
(X − µ )/(σ / n)
T=√ √ .
nS2 /σ 2 1/(n − 1)
Hence, by Property 4.2(ii), T is tn−1 . ⊓
⊔
From Property 4.1(iii), we know that T is a pivotal quantity of µ . Let
( )
1 − α = P −tα /2,d f =n−1 ≤ T ≤ tα /2,d f =n−1
( )
X −µ
= P −tα /2,d f =n−1 ≤ √ ≤ tα /2,d f =n−1
S/ n − 1
( )
S S
= P X − tα /2,d f =n−1 √ ≤ µ ≤ X + tα /2,d f =n−1 √ ,
n−1 n−1
where tα ,d f =k satisfies
P(T ≥ tα ,d f =k ) = α
for a random variable T ∼ tk ; see Fig. 4.2. So, when σ 2 is unknown, a 1 − α confidence
interval of µ is
[ ]
S S
X − tα /2,d f =n−1 √ , X + tα /2,d f =n−1 √ . (4.4)
n−1 n−1

Given the observed value of X = x, S = s, and the value of tα /2,d f =n−1 , we can calculate
the interval estimate of µ by
[ ]
s s
x − tα /2,d f =n−1 √ , x + tα /2,d f =n−1 √ .
n−1 n−1
46 4 Interval estimation

area is 1 − α

area is α/2 area is α/2

−tα/2,df=n tα/2,df=n

Fig. 4.2 Upper percentile of tn distribution

Remark 4.2. Usually there is a row with ∞ degrees of freedom in a t-distribution table,
which actually shows values of zα . In fact, when n → ∞, the distribution function of tn
tends to that of N(0, 1); see Fig. 4.3. That is, in tests or exams, if n is so large that the
value of tα ,d f =n cannot be found, you may use zα instead.

t3
t10
t20
N (0, 1)

Fig. 4.3 Distributions of tn and N(0, 1)

4.2 Confidence intervals for means 47

Example 4.4. A paint manufacturer wants to determine the average drying time of a new
brand of interior wall paint. If for 12 test areas of equal size he obtained a mean drying
time of 66.3 minutes and a standard deviation of 8.4 minutes, construct a 95% confidence
interval for the true population mean assuming normality.

Solution. As n = 12, x = 66.3, s = 8.4, α = 1 − 0.95 = 0.05 and tα /2,d f =n−1 = t0.025,11 ≈
2.201, the 95% confidence interval for µ is
[ ]
8.4 8.4
66.3 − 2.201 × √ , 66.3 + 2.201 × √ ,
12 − 1 12 − 1

that is, [61.1722, 71.4278]. ⊓

⊔
Example 4.5. Construct a 95% confidence interval for the mean hourly wage of apprentice
geologists employed by the top 5 oil companies. For a sample of 50 apprentice geologists,
x = 14.75 and s = 3.0 (in dollars).

Solution. As n = 50, α = 1 − 0.95 = 0.05, t0.025,d f =49 ≈ 2.010, we have

s 3.0
tα /2,d f =n−1 √ ≈ 2.010 × √ = 0.8614.
n 50 − 1

Thus, the 95% confidence interval is [14.75 − 0.86, 14.75 + 0.86], or [13.89, 15.59]. ⊓
⊔

4.2.2 Tow-sample case

Besides the confidence interval for the mean of one single normal distribution, we shall
also consider the problem of constructing confidence intervals for the difference of the
means of two normal distributions when the variances are unknown.
Let X = {X1 , X2 , · · · , Xn } and Y = {Y1 ,Y2 , · · · ,Ym } be random samples from indepen-
dent distributions N(µX , σX2 ) and N(µY , σY2 ), respectively. We are of interest to construct
the confidence interval for µX − µY when σX2 = σY2 = σ 2 .
First, we can show that
(X −Y ) − (µX − µY )
Z= √
σ 2 /n + σ 2 /m
is N(0, 1). Also, by the independence of X and Y, from Property 4.1(ii), we know that

nSX2 mSY2
U= + 2
σ2 σ
is χn+m−2
2 . Moreover, by Property 4.1(i), Z and U are independent. Hence,

Z
T=√
U/(n + m − 2)
√
[(X −Y ) − (µX − µY )]/ σ 2 /n + σ 2 /m
= √
(nSX2 + mSY2 )/[σ 2 (n + m − 2)]
(X −Y ) − (µX − µY )
=
R
48 4 Interval estimation

is tn+m−2 , where √ ( )
nSX2 + mSY2 1 1
R= + .
n+m−2 n m
That is, T is a pivotal quantity of µX − µY . Let
( )
1 − α = P −tα /2,d f =n+m−2 ≤ T ≤ tα /2,d f =n+m−2
( )
(X −Y ) − (µX − µY )
= P −tα /2,d f =n+m−2 ≤ ≤ tα /2,d f =n+m−2
R
( )
= P (X −Y ) − tα /2,d f =n+m−2 R ≤ µX − µY ≤ (X −Y ) + tα /2,d f =n+m−2 R ,

So, when σX2 = σY2 = σ 2 is unknown, a 1 − α confidence interval of µX − µY is

[ ]
(X −Y ) − tα /2,d f =n+m−2 R, (X −Y ) + tα /2,d f =n+m−2 R . (4.5)

Given the observed value of X = x, Y = y, SX = sX , SY = sY , and the value of tα /2,d f =n+m−2 ,

we can calculate the interval estimate of µX − µY by
[ ]
(x − y) − tα /2,d f =n+m−2 r, (x − y) + tα /2,d f =n+m−2 r .

where √ ( )
ns2X + msY2 1 1
r= + .
n+m−2 n m

Example 4.6. Suppose that scores on a standardized test in mathematics taken by students
from large and small high schools are N(µX , σ 2 ) and N(µY , σ 2 ), respectively, where σ 2 is
unknown. If a random sample of n = 9 students from large high schools yielded x̄ = 81.31,
s2X = 60.76 and a random sample of m = 15 students from small high schools yielded
ȳ = 78.61, sY2 = 48.24, the endpoints for a 95% confidence interval for µX − µY are given
by
√ ( )
9 × 60.76 + 15 × 48.24 1 1
81.31 − 78.61 ± 2.074 + ,
22 9 15

since P(T ≤ 2.074) = 0.975. So, the 95% confidence interval is [−3.95, 9.35]. ⊓
⊔

4.3 Confidence intervals for variances

4.3.1 One-sample case

First, we consider the one-sample case. By Property 4.1(ii),

nS2
∼ χn−1
2
σ2
is a pivotal quantity involving σ 2 . Let
4.3 Confidence intervals for variances 49
( )
nS2
1 − α = P χ1−2
α /2,d f =n−1 ≤ ≤ χ 2
α /2,d f =n−1
σ2
( )
nS2 nS2
=P ≤σ ≤ 2
2
,
χα2 /2,d f =n−1 χ1−α /2,d f =n−1

where χα2 ,d f =n satisfies

P(T ≥ χα2 ,d f =n ) = α
for a random variable T ∼ χn2 ; see Fig. 4.4. So, a 1 − α confidence interval of σ 2 is
[ ]
nS2 nS2
, 2 . (4.6)
χα2 /2,d f =n−1 χ1− α /2,d f =n−1

Given the observed value of S = s and the values of χα2 /2,d f =n−1 and χ1−
2
α /2,d f =n−1 , we
can calculate the interval estimate of σ by 2

[ ]
ns2 ns2
, 2 .
χα2 /2,d f =n−1 χ1−α /2,d f =n−1

area is 1 − α

area is
α/2

area is α/2

χ21−α/2,df=n χ2α/2,df=n

Fig. 4.4 Upper percentile of χn2 distribution

Example 4.7. A machine is set up to fill packages of cookies. A recently taken random
sample of the weights of 25 packages from the production line gave a variance of 2.9
g2 . Construct a 95% confidence interval for the standard deviation of the weight of a
randomly selected package from the production line.

Solution. As n = 25, s2 = 2.9, α = 0.05,

50 4 Interval estimation

ns2 25(2.9) 25(2.9)

= ≈ ≈ 1.8420,
χα2 /2,d f =n−1 χ0.025,d
2
f =24 39.36
ns2 25(2.9) 25(2.9)
= ≈ ≈ 5.8468,
χ1−
2
α /2,d f =n−1 χ0.975,d
2
f =24 12.40

the 95% confidence interval for the population variance is (1.8420, 5.8468). Taking pos-
itive square roots, we obtain the 95% confidence interval for the population standard de-
viation to be (1.3572, 2.4180). ⊓ ⊔

4.3.2 Two-sample case

Next, we consider the two-sample case. Let

X = {X1 , X2 , · · · , Xn } and Y = {Y1 ,Y2 , · · · ,Ym }

be random samples from independent distributions N(µX , σX2 ) and N(µY , σY2 ), respec-
tively. We are of interest to construct the confidence interval for σX2 /σY2 .

Property 4.3. Suppose that U ∼ χr21 and V ∼ χr22 are independent. Then,

U/r1
Fr1 ,r2 =
V /r2

has an Fr1 ,r2 distribution with r1 and r2 degrees of freedom.

By Property 4.1(ii),
nSX2 mSY2
∼ χ 2
n−1 and ∼ χm−1
2
.
σX2 σY2
Then, by Property 4.3, it follows that
[ ]/[ ]
mSY2 nSX2
∼ Fm−1,n−1 ,
σY2 (m − 1) σX2 (n − 1)

which is a pivotal quantity involving σX2 /σY2 . Let

( [ ]/[ ] )
mSY2 nSX2
1 − α = P F1−α /2,d f =(m−1,n−1) ≤ ≤ Fα /2,d f =(m−1,n−1)
σY2 (m − 1) σX2 (n − 1)
( )
n(m − 1)SX2 σX2 n(m − 1)SX2
=P F1−α /2,d f =(m−1,n−1) ≤ ≤ Fα /2,d f =(m−1,n−1) ,
m(n − 1)SY2 σY2 m(n − 1)SY2

where Fα ,d f =(m,n) satisfies

P(T ≥ Fα ,d f =(m,n) ) = α
for a random variable T ∼ Fm,n ; see Fig. 4.5. So, a 1 − α confidence interval of σX2 /σY2 is
[ ]
n(m − 1)SX2 n(m − 1)SX2
F1−α /2,d f =(m−1,n−1) , Fα /2,d f =(m−1,n−1) . (4.7)
m(n − 1)SY2 m(n − 1)SY2
4.4 Confidence intervals: Large samples 51

Given the observed value of SX = sX , SY = sY , and the values of Fα /2,d f =(m−1,n−1) and
F1−α /2,d f =(m−1,n−1) , we can calculate the interval estimate of σX2 /σY2 by
[ ]
n(m − 1)s2X n(m − 1)s2X
F1−α /2,d f =(m−1,n−1) , Fα /2,d f =(m−1,n−1) .
m(n − 1)sY2 m(n − 1)sY2

area is 1 − α

area is
α/2

area is α/2

F1−α/2,df =(m,n) Fα/2,df =(m,n)

Fig. 4.5 Upper percentile of Fm,n distribution

4.4 Confidence intervals: Large samples

In the previous sections, the confidence intervals are all constructed for the normal pop-
ulation, which allows us to deal with the case of a fixed sample size n. In practice, the
normal assumption on the population is restricted. When the population is not normal,
we can make use of the CLT to propose confidence intervals, which has an approximated
confidence coefficient 1 − α for large n.
To elaborate the idea, we first introduce a useful theorem.

Theorem 4.1. (Slutsky’s theorem) If Xn →d X and Yn → p C (a constant), then

(i) Xn +Yn →d X +C;
(ii) Xn ·Yn →d X ·C;
(iii) Xn /Yn →d X/C provided that C ̸= 0.

Note that Theorem 4.1 fails if C is not a constant.

Let X be an independent random sample from a population, which has the mean µ and
the variance σ 2 < ∞. According to Theorem 4.1, CLT and the fact that S → p σ , we have
52 4 Interval estimation
√ √
n(X − µ ) n(X − µ ) S
= · →d N(0, 1), (4.8)
S σ σ
for large n. Hence, by (4.8), it follows that for large n,
( √ )
n(X − µ )
1 − α ≈ P −zα /2 ≤ ≤ zα /2
S
( )
S S
= P X − zα /2 √ ≤ µ ≤ X + zα /2 √ .
n n

So, an approximated 1 − α confidence interval of µ is

[ ]
S S
X − zα /2 √ , X + zα /2 √ . (4.9)
n n

Given the observed value of x and s, we can calculate the interval estimate of µ by
[ ]
s s
x − zα /2 √ , x + zα /2 √ .
n n

Note that the confidence interval in (4.9) only requires a large n but not the normal popu-
lation assumption. Clearly, the similar idea can be applied to the two-sample case.
To end this chapter, we consider the interval estimator for percentage p, where

p = P (X ∈ (a, b)) .

Define ξ = I(a < X < b). Then, E(ξ ) = p. This indicates that p is the theoretical mean of
ξ . Hence, by (4.9), an approximated 1 − α confidence interval of p is
[ ]
Sξ Sξ
ξ − zα /2 √ , ξ + zα /2 √ , (4.10)
n n

where ξ = 1n ∑ni=1 ξi and Sξ2 = n1 ∑ni=1 (ξi − ξ )2 = ξ (1 − ξ ) with ξi = I(a < Xi < b).
In general, we can treat the interval (a, b) as “success”, p = P(“success”), and ξ =
relative frequence of “success”.

Example 4.8. In a certain political campaign, one candidate has a poll taken at random
among the voting population. The results are n = 112 and y = 59 (for “Yes”). Should the
candidate feel very confident of winning?

Solution. Let p = P(“the condidate wins the campaign”). Then, ξ = 59/112 ≈ 0.527. Ac-
cording to (4.10), since z0.025 ≈ 1.96, an approximated 95% confident interval estimate
for p is
[ √ √ ]
0.527 ∗ (1 − 0.527) 0.527 ∗ (1 − 0.527)
0.527 − z0.025 , 0.527 + z0.025
112 112
≈ [0.435, 0.619].

There has certain possibility that p is less than 50%, and the candidate should take this
into account in campaigning. ⊓ ⊔
Chapter 5
Hypothesis testing

In scientific activities, much attention is devoted to answering questions about the validity
of theories or hypotheses concerning physical phenomena. For examples, (i) Is the new
drug effective in combating a certain disease? (ii) Are females more talented in music
than males? e.t.c. To answer these questions, we need use the hypothesis testing, which is
a procedure used to determine (make a decision) whether a hypothesis should be rejected
(declared false) or not.

5.1 Basic concepts

To study hypothesis, the following technical terms are needed:

Hypothesis a statement (or claim) about a population;
Hypothesis test a rule that leads to a decision to or not to reject a hypothesis;
Simple hypothesis a hypothesis that completely specifies the distribution of the popu-
lation;
Composite hypothesis a hypothesis that does not completely specify the distribution
of the population;
Null hypothesis (H0 ) a hypothesis that is assumed to be true before it can be rejected;
Alternative hypothesis (H1 or Ha ) a hypothesis that will be accepted if the null hy-
pothesis is rejected.
Often, a hypothesis has the special form “the unknown distributional parameter θ be-
longs to a set”. There may be two competing hypotheses of this form:

H0 : θ ∈ Ω0 versus H1 : θ ∈ Ω1 ,

where Ω0 and Ω1 are disjoint sets of possible values of the parameter θ .

Example 5.1. Suppose that the score of STAT2602 follows N(θ , 10), and we want to know
whether the theoretical mean θ = 80. In this case,

H0 : θ = 80 versus H1 : θ ̸= 80.

Here, H0 is a simple hypothesis, because θ is the only unknown parameter and Ω0 con-
sists of exactly one real number; and H1 is a composite hypothesis, because it can not
completely specify the distribution of the score. ⊓
⊔

53
54 5 Hypothesis testing

If both the two hypotheses are simple, the null hypothesis H0 is usually chosen to be
a kind of default hypothesis, which one tends to believe unless given strong evidence
otherwise.
Example 5.2. Suppose that the score of STAT2602 follows N(θ , 10), and we want to know
whether the theoretical mean θ = 80 or 70. In this case,

H0 : θ = 80 versus H1 : θ = 70.

Here, both H0 and H1 are simple hypotheses. We tend to believe H0 : θ = 80 unless given
strong evidence otherwise. ⊔⊓
In order to construct a rule to decide whether the hypothesis is rejected or not, we need
to use the test statistic defined by
Test statistic the statistic upon which the statistical decision will be based.
Usually, the test statistic is a functional on the random sample X = {X1 , · · · , Xn }, and it is
denoted by W (X). Some important terms about the test statistic are as follows:
Rejection region or critical region the set of values of the test statistic for which the
null hypothesis is rejected;
Acceptance region the set of values of the test statistic for which the null hypothesis is
not rejected (is accepted);
Type I error rejection of the null hypothesis when it is true;
Type II error acceptance of the null hypothesis when it is false.
Accept H0 Reject H0

H0 is true No error Type I error

H0 is false Type II error No error

Suppose that the rejection region is

{W (X) ∈ R}.

Definition 5.1. (Power function) The power function π (θ ) is the probability of rejecting
H0 when the true value of the parameter is θ , i.e.,

π (θ ) := Pθ (W (X) ∈ R).

Let α (θ ) and β (θ ) be probabilities of committing a type I and type II error respectively

when the true value of the parameter is θ . That is,

α (θ ) = Pθ (W (X) ∈ R) for θ ∈ Ω0 ;
β (θ ) = Pθ (W (X) ∈ Rc ) for θ ∈ Ω1 .

From α (θ ) and β (θ ), we know that

{
α (θ ), for θ ∈ Ω0 ;
π (θ ) =
1 − β (θ ), for θ ∈ Ω1 .
5.1 Basic concepts 55

Example 5.3. A manufacturer of drugs has to decide whether 90% of all patients given a
new drug will recover from a certain disease. Suppose
(a) the alternative hypothesis is that 60% of all patients given the new drug will recover;
(b) the test statistic is W , the observed number of recoveries in 20 trials;
(c) he will accept the null hypothesis when W > 14 and reject it otherwise.
Find the power function of W .

Solution: Let p = P(”recovery”). The hypotheses are

H0 : p = 0.9 versus H1 : p = 0.6.

The test statistic W follows a binomial distribution B(n, p) with parameters n = 20 and p.
The rejection region is {W ≤ 14}. Hence,

π (p) = P p (W ≤ 14)
= 1 − P p (W > 14)
20 ( )
20 k
= 1− ∑ p (1 − p)20−k
k=15 k
{
0.0113, for p = 0.9;
≈
0.8744, for p = 0.6.

(This implies that the probability of committing a type I and type II error are 0.0113 and
0.1256, respectively.) ⊓ ⊔
Example 5.4. Let X be a random sample from N(µ , σ 2 ), where σ 2 is known. Consider a
µ0
test statistic W = X−√
σ/ n
for hypotheses H0 : µ ≤ µ0 versus H1 : µ > µ0 . Assume that the
rejection region is {W ≥ K}. Then, the power function is

π (µ ) = Pµ (W ≥ K)
( )
X −µ µ −µ
= Pµ √ ≥K+ 0 √
σ/ n σ/ n
( )
µ0 − µ
= P Z ≥K+ √ ,
σ/ n

where Z ∼ N(0, 1). It is easy to see that

lim π (µ ) = 0, lim π (µ ) = 1, and π (µ0 ) = α if P(Z ≥ K) = α .

µ →−∞ µ →∞

⊓
⊔
The ideal power function is 0 for θ ∈ Ω0 and 1 for θ ∈ Ω1 . However, this ideal can
not be attained in general. For a fixed sample size, it is usually impossible to make both
types of error probability arbitrarily small. In searching for a good test, it is common to
restrict consideration to tests that control the type I error probability at a specified level.
Within this class of tests we then search for tests that have type II error probability that is
as small as possible. The size defined below is used to control the type I error probability.
Definition 5.2. (Size) For α ∈ [0, 1], a test with power function π (θ ) is a size α test if

max π (θ ) = α .
θ ∈Ω0
56 5 Hypothesis testing

Remark 5.1. α is also called the level of significance or significance level. If H0 is a simple
hypothesis θ = θ0 , then α = π (θ0 ).

Example 5.5. Suppose that we want to test the null hypothesis that the mean of a normal
population with σ 2 = 1 is µ0 against the alternative hypothesis that it is µ1 , where µ1 > µ0 .
(a) Find the value of K such that {X ≥ K} provides a rejection region with the level of
significance α = 0.05 for a random sample of size n.
(b) For the rejection region found in (a), if µ0 = 10, µ1 = 11 and we need the type II
probability β ≤ 0.06, what should n be?

Solution. (a) Note that H0 : µ = µ0 , H1 : µ = µ1 , and the rejection region is {X ≥ K}. By

definition,

α = π (µ0 ) = Pµ0 (X ≥ K)
( )
X − µ0 K − µ0
= P µ0 √ ≥ √
σ/ n σ/ n
( )
K − µ0
=P Z≥ √ ,
1/ n

where Z ∼ N(0, 1). Hence, when α = 0.05, we should have

( )
K − µ0
0.05 = P Z ≥ √ ,
1/ n

which is equivalent to
K − µ0 1.645
√ = z0.05 ≈ 1.645 or K ≈ µ0 + √ .
1/ n n

(b) By definition,

β = 1 − π (µ1 ) = Pµ1 (X < K)

( )
X − µ1 K − µ1
= P µ1 √ < √
σ/ n σ/ n
( )
K − µ1
=P Z< √
σ/ n
( )
µ0 + 1.645
√ − µ1
n
≈P Z< √ .
σ/ n

With µ0 = 10, µ1 = 11, σ 2 = 1, it follows that

( √ ) ( √ )
β ≈ P Z < n(µ0 − µ1 ) + 1.645 = P Z < − n + 1.645 .

Hence,
√
β ≤ 0.06 ⇐⇒ − n + 1.645 ≤ −z0.06 ≈ −1.555
⇐⇒ n ≥ (1.645 + 1.555)2 ≈ 10.24,

that is, n ≥ 11. ⊓

⊔
5.1 Basic concepts 57

Remark 5.2. In the above example, the value of K in the rejection region {X ≥ K} is
determined by the significance level α . For the test statistic X, the value of K uniquely
decides whether the null hypothesis is rejected or not, and it is usually called a critical
value of this test.

In summary, steps to perform a hypothesis test are as follows:

(1) State the null and alternative hypotheses and the level of significance α .
(2) Choose a test statistic.
(3) Determine the rejection region.
(4) Calculate the value of the test statistic according to the particular sample drawn.
(5) Make a decision: reject H0 if and only if the value of the test statistic falls in the
rejection region.
The key steps are (2) and (3), and they can be accomplished by using the likelihood
ratio or generalized likelihood ratio. The test statistic, denoted by W (X), is chosen case by
case. The rejection region of W (X) usually has the form of {W (X) ≤ K}, {W (X) ≥ K},
{|W (X)| ≥ K}, or {W (X) ≤ K1 } ∪ {W (X) ≥ K2 }, where the values of K, K1 and K2 are
determined by the significance level α .
Instead of using step (5), we can also use p-value to make a decision.

Definition 5.3. (p-value) Let W (x) be the observed value of the test statistic W (X).
Case 1: The rejection region is {W (X) ≤ K}, then

p-value = max Pθ (W (X) ≤ W (x));

θ ∈Ω0

Case 2: The rejection region is {W (X) ≥ K}, then

p-value = max Pθ (W (X) ≥ W (x));

θ ∈Ω0

Case 3: The rejection region is {|W (X)| ≥ K}, then

p-value = max Pθ (|W (X)| ≥ |W (x)|).

θ ∈Ω0

From the above example, we know that p-value does not depend on α , and it helps us
to make a decision by comparing its value with α .

Property 5.1. For a test statistic W (X),

H0 is rejected at the significance level α ⇐⇒ p-value ≤ α .

Proof. We only prove it for Case 1. Note that

p-value = max Pθ (W (X) ≤ W (x)) and α = max Pθ (W (X) ≤ K).

θ ∈Ω0 θ ∈Ω0

By the monotonicity of the c.d.f., we have

p-value ≤ α ⇐⇒ W (x) ≤ K
⇐⇒ the observed value of W (X) falls in the rejection region
⇐⇒ H0 is rejected at the significance level α .

⊓
⊔
58 5 Hypothesis testing

Example 5.3. (con’t) If the observed value of W is 12, then

p-value = P p (W ≤ 12) for p = 0.9

12 ( )
20
= ∑ (0.9)k (0.1)20−k ≈ 0.0004.
k=0 k

Hence, at the significance level α = 0.05, the null hypothesis is rejected. ⊓

⊔

Example 5.5. (con’t) If the observed value of X is 10.417, then

p-value = Pµ (X ≥ 10.417) for µ = 10

( )
X −µ 10.417 − µ
= Pµ √ ≥ √ for µ = 10
σ/ n σ/ n
( )
10.417 − 10
=P Z≥ √
1/ 11
≈ 0.0833.

Hence, at the significance level α = 0.05, the null hypothesis H0 : µ = 10 is not rejected.
⊓
⊔

5.2 Most powerful tests

Definition 5.4. (Most powerful tests) A test concerning a simple null hypothesis θ = θ0
against a simple alternative hypothesis θ = θ1 is said to be most powerful if the power of
the test at θ = θ1 is a maximum.

To construct a most powerful rejection region we refer to the likelihood function of a

random sample X = {X1 , X2 , · · · , Xn } defined by

L(θ ) = f(X1 , X2 , . . . , Xn ; θ ),

where f(x1 , x2 , . . . , xn ; θ ) is the joint p.d.f. of the random variables X1 , X2 , . . . , Xn from a

L(θ0 )
population with a parameter θ . Consider the likelihood ratio . Intuitively speaking,
L(θ1 )
the null hypothesis should be rejected when the likelihood ratio is small.

Theorem 5.1. (Neyman-Pearson Lemma) Suppose X1 , X2 , . . . , Xn constitute a random

sample of size n from a population with exactly one unknown parameter θ . Suppose that
there is a positive constant k and a region C such that

(i) Pθ {(X1 , X2 , . . . , Xn ) ∈ C} = α for θ = θ0 ,

and
f(x1 , x2 , . . . , xn ; θ0 )
(ii) ≤k when (x1 , x2 , . . . , xn ) ∈ C,
f(x1 , x2 , . . . , xn ; θ1 )
f(x1 , x2 , . . . , xn ; θ0 )
(iii) ≥k when (x1 , x2 , . . . , xn ) ̸∈ C.
f(x1 , x2 , . . . , xn ; θ1 )
5.2 Most powerful tests 59

Construct a test, called the likelihood ratio test, which rejects H0 : θ = θ0 and accepts
H1 : θ = θ1 if and only if (X1 , X2 , . . . , Xn ) ∈ C. Then any other test which has significance
level α ∗ ≤ α has power not more than that of this likelihood ratio test. In other words, the
likelihood ratio test is most powerful among all tests having significance level α ∗ ≤ α .

Proof. Suppose D is the rejection region of any other test which has significance level
α ∗ ≤ α . We consider first the continuous case. Note that

α = Pθ {(X1 , X2 , . . . , Xn ) ∈ C} for θ = θ0 (by (i))

∫ ∫ ∫
= ··· f(x1 , x2 , . . . , xn ; θ0 )dx1 dx2 · · · dxn ;
C
α ∗ = Pθ {(X1 , X2 , . . . , Xn ) ∈ D} for θ = θ0
∫ ∫ ∫
= ··· f(x1 , x2 , . . . , xn ; θ0 )dx1 dx2 · · · dxn .
D

Since α ≥ α ∗ , it follows that

∫ ∫ ∫
··· f(x1 , x2 , . . . , xn ; θ0 )dx1 dx2 · · · dxn
C
∫ ∫ ∫
≥ ··· f(x1 , x2 , . . . , xn ; θ0 )dx1 dx2 · · · dxn .
D

Subtracting ∫ ∫ ∫
··· f(x1 , x2 , . . . , xn ; θ0 )dx1 dx2 · · · dxn ,
C∩D
we get
∫ ∫ ∫
··· f(x1 , x2 , . . . , xn ; θ0 )dx1 dx2 · · · dxn
C∩D′
∫ ∫ ∫
≥ ··· f(x1 , x2 , . . . , xn ; θ0 )dx1 dx2 · · · dxn , (5.1)
C′ ∩D

where C′ and D′ are complements of C and D respectively. Hence,

∫ ∫ ∫
··· f(x1 , x2 , . . . , xn ; θ1 )dx1 dx2 · · · dxn
C∩D′
∫ ∫ ∫
f(x1 , x2 , . . . , xn ; θ0 )
≥ ··· dx1 dx2 · · · dxn (by (ii))
k
C∩D′
∫ ∫ ∫
f(x1 , x2 , . . . , xn ; θ0 )
≥ ··· dx1 dx2 · · · dxn (by (5.1))
k
C′ ∩D
∫ ∫ ∫
≥ ··· f(x1 , x2 , . . . , xn ; θ1 )dx1 dx2 · · · dxn . (by (iii))
C′ ∩D
∫ ∫∫
Adding · · · f(x1 , x2 , . . . , xn ; θ1 )dx1 dx2 · · · dxn , we finally obtain
C∩D
60 5 Hypothesis testing
∫ ∫ ∫
··· f(x1 , x2 , . . . , xn ; θ1 )dx1 dx2 · · · dxn
C
∫ ∫ ∫
≥ ··· f(x1 , x2 , . . . , xn ; θ1 )dx1 dx2 · · · dxn ,
D

or
Pθ {(X1 , X2 , . . . , Xn ) ∈ C} ≥ Pθ {(X1 , X2 , . . . , Xn ) ∈ D}
for θ = θ1 . The last inequality states that the power of the likelihood ratio test at θ = θ1
is at least as much as that corresponding to the rejection region D. The proof for discrete
case is similar, with sums taking places of integrals. ⊓⊔

Neyman-Pearson Lemma says that to test H0 : θ = θ0 versus H1 : θ = θ1 , the rejection

region for the likelihood ratio test is

L(θ0 )
≤ k ⇐⇒ (X1 , X2 , · · · , Xn ) ∈ C ⇐⇒ W (X) ∈ R,
L(θ1 )

where the interval R is chosen so that the test has the significance level α . Generally
speaking, the likelihood ratio helps us to determine the test statistic and the form of its
rejection region.

Example 5.6. A random sample {X1 , X2 , . . . , Xn } from a normal population N(µ , σ 2 ),

where σ 2 = σ02 is known, is to be used to test the null hypothesis µ = µ0 against the
alternative hypothesis µ = µ1 , where µ1 > µ0 . Use the Neyman-Pearson Lemma to con-
struct the most powerful test.

Solution. The likelihood function of the sample is

( )n [ ]
1 1 n
L(µ ) = √ exp − 2 ∑ (Xi − µ ) .
2
σ0 2π 2σ0 i=1

L(µ0 )
The likelihood ratio test rejects the null hypothesis µ = µ0 if and only if ≤ k, that
L(µ1 )
is,
{ }
1 n [ ]
exp ∑ (Xi − µ1 ) − (Xi − µ0 ) ≤ k
2σ02 i=1
2 2

n
⇐⇒ ∑ (−2µ1 Xi + µ12 + 2µ0 Xi − µ02 ) ≤ 2σ02 log k
i=1
n
⇐⇒ n(µ12 − µ02 ) + 2(µ0 − µ1 ) ∑ Xi ≤ 2σ02 log k
i=1
2σ 2 log k − n(µ12 − µ02 )
⇐⇒ X ≥ 0 (since µ1 > µ0 ).
2n(µ0 − µ1 )

Therefore, in order that the level of significance is α , we should choose a constant K such
that Pµ (X ≥ K) = α for µ = µ0 , that is,
( )
X − µ0 K − µ0 K − µ0 σ0 zα
P √ ≥ √ = α ⇐⇒ √ = zα ⇐⇒ K = µ0 + √ .
σ0 / n σ0 / n σ0 / n n
5.3 Generalized likelihood ratio tests: One-sample case 61

Therefore, the most powerful test having significance level α ∗ ≤ α is the one which has
the rejection region
{ } { }
σ0 z α X − µ0
X ≥ µ0 + √ or √ ≥ zα .
n σ0 / n

(Note that the rejection region found does not depend on the value of µ1 ). ⊔
⊓

Example 5.7. Suppose X1 , X2 , . . . , Xn constitute a random sample of size n from a popula-

tion given by a density
f (x) = θ xθ −1 I(0 ≤ x ≤ 1).
If 0 ≤ Xi ≤ 1 for i = 1, 2, . . . , n, find the form of the most powerful test for testing

H0 : θ = 2 versus H1 : θ = 1.

Solution. The likelihood function of the sample is

( )θ −1 ( )θ −1
n n n
L(θ ) = θ n
∏ Xi ∏ I(0 ≤ Xi ≤ 1) = θ ∏ Xi
n
.
i=1 i=1 i=1

Hence, the likelihood ratio is

n
L(2)
= 2n ∏ Xi .
L(1) i=1
n
The likelihood ratio test rejects H0 if and only if 2n ∏Xi ≤ K where K is a positive constant
i=1
n
(or, equivalently, ∏Xi ≤ k where k is a positive constant). ⊓
⊔
i=1

5.3 Generalized likelihood ratio tests: One-sample case

The Neyman-Pearson lemma provides a method of constructing most powerful rejection

regions for testing a simple null hypothesis against a simple alternative hypothesis, but it
does not always apply to composite hypotheses. We shall present a general method for
constructing rejection regions for tests of composite hypotheses which in most cases have
very satisfactory properties, although they are not necessarily uniformly most powerful.
Suppose that θ ∈ Ω , where Ω is the parametric space. Consider the following hypothe-
ses:
H0 : θ ∈ Ω0 versus H1 : θ ∈ Ω1 ,
where Ω1 is the complement of Ω0 with respect to Ω (i.e., Ω1 = Ω /Ω0 ). Let

L(Ω0 ) = max L(θ ) and L(Ω ) = max L(θ ).

θ ∈Ω0 θ ∈Ω

The generalized likelihood ratio is defined as

L(Ω0 )
Λ= .
L(Ω )
62 5 Hypothesis testing

Since Ω0 is a subset of Ω , it follows that Λ ≤ 1. When the null hypothesis is false, we

would expect Λ to be small. A generalized likelihood ratio test states, therefore, that the
null hypothesis H0 is rejected if and only if Λ falls in a rejection region of the form Λ ≤ k,
where 0 ≤ k ≤ 1.

5.3.1 Testing for the mean: Variance is known

Example 5.8. Find the generalized likelihood ratio test for testing

H0 : µ = µ0 versus H1 : µ ̸= µ0

on the basis of a random sample of size n from N(µ , σ 2 ), where σ 2 = σ02 is known.

Solution. Ω is the set of all real numbers (i.e., Ω = R) and Ω0 = {µ0 }. On one hand,
since Ω0 contains only µ0 , it follows that
( )n [ ]
1 1 n
L(Ω0 ) = √ exp − 2 ∑ (Xi − µ0 )2 .
σ0 2π 2σ0 i=1

On the other hand, since the maximum likelihood estimator of µ is X, it follows that
( )n [ ]
1 1 n
L(Ω ) = √ exp − 2 ∑ (Xi − X) . 2
σ0 2π 2σ0 i=1

Hence,
{ [ ]}
L(Ω0 ) 1 n n
Λ= = exp − 2 ∑ (Xi − µ0 ) − ∑ (Xi − X)
2 2
L(Ω ) 2σ0 i=1 i=1
[ ]
n(X − µ0 )2
= exp − .
2σ02
{ }
Therefore, the rejection region is X − µ0 ≥ K . In order that the level of significance
is α , that is, ( )
Pµ X − µ0 ≥ K = α for µ = µ0 ,
σ0
we should let K = zα /2 √n
, so that
( )
( ) σ0
Pµ X − µ0 ≥ K = Pµ X − µ0 ≥ zα /2 √
n
( ) ( )
X − µ0 X − µ0
= Pµ √ ≥ zα /2 + Pµ √ ≤ −zα /2
σ0 / n σ0 / n
( ) ( )
= P Z ≥ zα /2 + P Z ≤ −zα /2
α α
= + =α
2 2
for µ = µ0 . So, the generalized likelihood ratio test has the rejection region
5.3 Generalized likelihood ratio tests: One-sample case 63
{ }
X − µ0
√ ≥ zα /2
σ0 / n

at the significance level α . ⊓

⊔

From aforemention example and the similar technique, we can have the following ta-
ble:

Table 5.1 Testing for the mean when σ 2 = σ02 is known

Test H0 H1 Rejection region p-value

{ } ( )
X − µ0 x − µ0
Two-tailed µ = µ0 µ ̸= µ0 √ ≥ zα /2 P |Z| ≥ √
σ0 / n σ0 / n
{ } ( )
X − µ0 x − µ0
Left-tailed µ = µ0 or µ ≥ µ0 µ < µ0 √ ≤ −zα P Z≤ √
σ / n σ0 / n
{0 } ( )
X − µ0 x − µ0
Right-tailed µ = µ0 or µ ≤ µ0 µ > µ0 √ ≥ zα P Z≥ √
σ0 / n σ0 / n

Example 5.9. The standard deviation of the annual incomes of government employees is
$1400. The mean is claimed to be $35,000. Now a sample of 49 employees has been
drawn and their average income is $35,600. At the 5% significance level, can you con-
clude that the mean annual income of all government employees is not $35,000?

Solution 1.
Step 1: “The mean ... is not 35,000” can be written as “µ ̸= 35000”, while “the mean
... is 35,000” can be written as “µ = 35000”. Since the null hypothesis should include an
equality, we consider hypothesis:

H0 : µ = 35000 versus H1 : µ ̸= 35000.

Step 2: The test statistic is

X − µ0 X − 35000 X − 35000
Z= √ = √ = ,
σ/ n 1400/ 49 200

which follows N(0, 1) under H0 .

Step 3: At the significance level α = 5%, the rejection region is
{ }
|Z| ≥ zα /2 ≈ {|Z| ≥ 1.960} .

Step 4: Since x = 35600, the value of the test statistic is

35600 − 35000
= 3.
200
Step 5: Since |3| ≥ 1.960, we reject H0 and accept H1 . Therefore, we conclude that
the mean annual income of all government employees is not $35,000 at the 5% level of
significance.

Solution 2.
64 5 Hypothesis testing

Step 1:
H0 : µ = 35000 versus H1 : µ ̸= 35000.
Step 2: The test statistic is

X − µ0 X − 35000 X − 35000
Z= √ = √ = ,
σ/ n 1400/ 49 200

which follows N(0, 1) under H0 .

Step 3: Since x = 35600, the value of the test statistic is
35600 − 35000
= 3.
200
Step 4:

p-value = P (|Z| ≥ |3|) = 2P (Z ≥ 3) ≈ 2(0.5 − 0.4987) = 0.0026,

where Z follows N(0, 1).

Step 5: Since 0.0026 ≤ 0.05 = α , we reject H0 and accept H1 . Therefore we conclude
that the mean annual income of all government employees is not $35,000. ⊓ ⊔

Example 5.10. The chief financial officer in FedEx believes that including a stamped self-
addressed envelope in the monthly invoice sent to customers will reduce the amount of
time it takes for customers to pay their monthly bills. Currently, customers return their
payments in 24 days on average, with a standard deviation of 6 days. It was calculated that
an improvement of two days on average would cover the costs of the envelopes (because
cheques can be deposited earlier). A random sample of 220 customers was selected and
stamped self-addressed envelopes were included in their invoice packs. The amounts of
time taken for these customers to pay their bills were recorded and their mean is 21.63
days. Assume that the corresponding population standard deviation is still 6 days. Can
the chief financial officer conclude that the plan will be profitable at the 10% significance
level?

Solution 1. The plan will be profitable when “µ < 22”, and not profitable when “µ ≥ 22”.
Since the null hypothesis should include an equality, we have

H0 : µ ≥ 22 versus H1 : µ < 22.

The value of the test statistic is

x − µ0 21.63 − 22
√ = √ ≈ −0.9147.
σ/ n 6/ 220

Since −0.9147 > −1.282 ≈ −z0.1 , H0 should not be rejected. The chief financial officer
cannot conclude that the plan is profitable at the 10% significance level.

Solution 2: Consider

p-value ≈ P(Z ≤ −0.9147) ≈ 0.1814 > 0.1 = α ,

where Z follows N(0, 1). Therefore, H0 should not be rejected. The chief financial officer
cannot conclude that the plan is profitable at the 10% significance level. ⊓
⊔
5.3 Generalized likelihood ratio tests: One-sample case 65

5.3.2 Testing for the mean: Variance is unknown

Example 5.11. Find the generalized likelihood ratio test for testing

H0 : µ = µ0 versus H1 : µ > µ0

on the basis of a random sample of size n from N(µ , σ 2 ).

Solution. Now

Ω = {(µ , σ ) : µ ≥ µ0 , σ > 0} ,
Ω0 = {(µ , σ ) : µ = µ0 , σ > 0} ,
Ω1 = {(µ , σ ) : µ > µ0 , σ > 0} .

The likelihood function of the sample is

( )n [ ]
1 1 n
L(µ , σ ) = = √ exp − 2 ∑ (Xi − µ ) ,
2
σ 2π 2σ i=1

and hence,
∂ ln L(µ , σ ) n
= 2 (X − µ ),
∂µ σ
∂ ln L(µ , σ ) n 1 n
= − + 3 ∑ (Xi − µ )2 .
∂σ σ σ i=1

On Ω0 , the maximum value of L(µ , σ ) is L(µ0 , σ̃ ), where σ̃ satisfies

1 n
σ̃ 2 = ∑ (Xi − µ0 )2 .
n i=1

This is because σ̃ is the maximum value of ln L(µ0 , σ ), by noting that for all µ > 0,
√
1 n ∂ ln L(µ , σ )
σ< ∑
n i=1
(Xi − µ )2 ⇐⇒
∂σ
> 0,
√
1 n ∂ ln L(µ , σ )
σ> ∑ (Xi − µ )2 ⇐⇒ ∂ σ < 0.
n i=1

Therefore, ( )n
1 ( n)
L(Ω0 ) = L(µ0 , σ̃ ) = √ exp − .
σ̃ 2π 2
On Ω , the maximum value of L(µ , σ ) is L(µ̂ , σ̂ ), where (noting that L(µ , σ ) decreases
with respect to µ when µ > X and increases with respect to µ when µ < X)
{
µ0 , if X ≤ µ0 ;
µ̂ =
X, if X > µ0 ,

and
66 5 Hypothesis testing

1 n
σ̂ 2 = ∑ (Xi − µ̂ )2 .
n i=1
Therefore, ( )n
1 ( n)
L(Ω ) = L(µ̂ , σ̂ ) = √ exp − .
σ̂ 2π 2
Thus, we have


1, if X ≤ µ0 ;

 n n/2
( )n ( 2 )n/2 

L(Ω0 ) σ̂ σ̂  ∑ (Xi −X) 
2
Λ= = = =  
L(Ω ) σ̃ σ̃ 2 
 ni=1
 , if X > µ0 .

 

 ∑ (Xi −µ0 )
 2

i=1

The rejection region is {Λ ≤ k} for

{ some }nonnegative constant k < 1 (since we do not
want α to be 1). Then {Λ ≤ k} ⊆ X > µ0 and Λ ≤ k is equivalent to
n n
∑ (Xi − X)2 ∑ (Xi − X)2 1
i=1 i=1
k2/n ≥ n = n = ,
n(X−µ0 )2
∑ (Xi − µ0 )2 ∑ (Xi − X)2 + n(X − µ0 )2 1+ n
i=1 i=1 ∑ (Xi −X)2
i=1

that is,
(X − µ0 )2 n(X − µ0 )2
= n ≥ k−2/n − 1,
S2
∑ (Xi − X) 2
i=1

or (since X > µ0 )
X − µ0
√ ≥ c,
S/ n − 1
√
where c is the constant (n − 1)(k−2/n − 1). In order that the level of significance is α ,
that is, ( )
X − µ0
P(µ ,σ ) √ ≥ c = α for µ = µ0 ,
S/ n − 1
we should let c = tα ,n−1 , since
( )
X − µ0
P(µ ,σ ) √ ≥ tα ,n−1 = P (tn−1 ≥ tα ,n−1 ) for µ = µ0
S/ n − 1

by Property 4.1(iii). So, the generalized likelihood ratio test has the rejection region
{ }
X − µ0
√ ≥ tα ,n−1
S/ n − 1

at the significance level α . ⊓

⊔

Example 5.12. What will happen if we change H0 in the previous example to be µ ≤ µ0 ?

5.3 Generalized likelihood ratio tests: One-sample case 67

Solution. Note that

Ω = {(µ , σ ) : −∞ < µ < ∞, σ > 0} ,

Ω0 = {(µ , σ ) : µ ≤ µ0 , σ > 0} ,
Ω1 = {(µ , σ ) : µ > µ0 , σ > 0} .

On Ω0 , the maximum value of L(µ , σ ) is L(µ̃ , σ̃ ) where

{
X, if X < µ0 ;
µ̃ =
µ0 , if X ≥ µ0 ,

and
1 n
σ̃ 2 = ∑ (Xi − µ̃ )2 .
n i=1
Therefore, ( )n
1 ( n)
L(Ω0 ) = L(µ̃ , σ̃ ) = √ exp − .
σ̃ 2π 2
On Ω , the maximum value of L(µ , σ ) is L(µ̂ , σ̂ ), where

1 n
µ̂ = X and σ̂ 2 = ∑ (Xi − X)2
n i=1

Therefore, ( )n
1 ( n)
L(Ω ) = L(µ̂ , σ̂ ) = √ exp − .
σ̂ 2π 2
Thus, we have


1, if X ≤ µ0 ;

 n n/2
( )n ( 2 )n/2 

L(Ω0 ) σ̂ σ̂  ∑ (Xi −X) 
2
Λ= = = =  
L(Ω ) σ̃ σ̃ 2 
 ni=1
 , if X > µ0 .

 

 ∑ (Xi −µ0 )
 2

i=1

So the generalized likelihood ratio test is the same as that in the previous example. ⊓
⊔

From aforemention two examples and the similar technique, we can have the following
table:

Table 5.2 Testing for the mean when σ 2 is unknown

Test H0 H1 Rejection region p-value

{ } ( )
X − µ0 x − µ0
Two-tailed µ = µ0 µ ̸= µ0 √ ≥ tα /2,n−1 P |tn−1 | ≥ √
S/ n − 1 s/ n − 1
{ } ( )
X − µ0 x − µ0
Left-tailed µ = µ0 or µ ≥ µ0 µ < µ0 √ ≤ −tα ,n−1 P tn−1 ≤ √
S/ n − 1 s/ n − 1
{ } ( )
X − µ0 x − µ0
Right-tailed µ = µ0 or µ ≤ µ0 µ > µ0 √ ≥ tα ,n−1 P tn−1 ≥ √
S/ n − 1 s/ n − 1
68 5 Hypothesis testing

Example 5.13. According to the last census in a city, the mean family annual income was
316 thousand dollars. A random sample of 900 families taken this year produced a mean
family annual income of 313 thousand dollars and a standard deviation of 70 thousand
dollars. At the 2.5% significance level, can we conclude that the mean family annual
income has declined since the last census?

Solution. Consider hypothesis

H0 : µ ≥ 316 versus H1 : µ < 316.

The value of the test statistic is

x − µ0 313 − 316
√ = √ ≈ −1.286.
s/ n − 1 70/ 900 − 1

Since −1.286 > −1.963 = −t0.025,899 , we do not reject H0 . Thus we cannot conclude that
the mean family annual income has declined since the last census at the 2.5% level of
significance. ⊓
⊔

5.3.3 Testing for the variance

Example 5.14. Given a random sample of size n from a normal population with unknown
mean and variance, find the generalized likelihood ratio test for testing the null hypothesis
σ = σ0 (σ0 > 0) against the alternative hypothesis σ ̸= σ0 .

Solution. Note that

Ω = {(µ , σ ) : −∞ < µ < ∞, σ > 0} ,

Ω0 = {(µ , σ ) : −∞ < µ < ∞, σ = σ0 } ,
Ω1 = {(µ , σ ) : −∞ < µ < ∞, σ > 0, σ ̸= σ0 } ,

and the likelihood function of the sample {X1 , · · · , Xn } is

( )n [ ]
1 1 n
L(µ , σ ) = √ exp − 2 ∑ (Xi − µ ) .2
σ 2π 2σ i=1

On Ω0 , the maximum value of L(µ , σ ) is L(µ̃ , σ0 ) where µ̃ = X. Therefore,

L(Ω0 ) = L(µ̃ , σ0 )
( )n [ ]
1 1 n
= √ exp − 2 ∑ (Xi − X)2 .
σ0 2π 2σ0 i=1

On Ω , the maximum value of L(µ , σ ) is L(µ̂ , σ̂ ) where µ̂ = X and σ̂ 2 = S2 . Therefore,

( )n ( n)
1
L(Ω ) = L(µ̂ , σ̂ ) = √ exp − .
σ̂ 2π 2

Thus, we have
5.3 Generalized likelihood ratio tests: One-sample case 69
 n 
 ∑ i − 2
( ) (X X)
L(Ω0 ) 2 n/2
σ̂  n

Λ = = exp − i=1 + 
L(Ω ) σ02  2σ02 2

 n n/2  n 
∑ i −  ∑ i −
(X X)2 (X X)2
 n
   i=1 
=  i=1  exp − + .
 nσ02   2σ02 2

The rejection region is {Λ ≤ k} for some positive constant k < 1 (since we do not want
1 n
α to be 1). Letting Y = ∑ (Xi − X)2 ,
nσ02 i=1
( )
nY n
Λ ≤ k ⇐⇒ Y n/2
exp − + ≤ k,
2 2
⇐⇒ Y exp(−Y + 1) ≤ k2/n ,
k2/n
⇐⇒ Y exp(−Y ) ≤ .
e
For y > 0 define a function g(y) = ye−y . Then,

dg(y)
= e−y − ye−y = (1 − y)e−y .
dy
Since
dg(y) dg(y)
y < 1 ⇐⇒ > 0 and y > 1 ⇐⇒ < 0,
dy dy

g(y) will be small when y is close to zero or very large. Thus we reject the null hypothesis
σ = σ0 when the value of Y (or nY ) is large or small, that is, the rejection region of our
generalized likelihood ratio test has the rejection region:

{nY ≤ K1 } ∪ {nY ≥ K2 }.

nS2
Note that nY = . In order that the level of significance is α , that is,
σ02
( ) ( 2 )
nS2 nS
P(µ ,σ ) ≤ K1 + P ≥ K2 = α for σ = σ0 ,
σ02 σ02

we should let K1 = χ1−

2
α /2,n−1 and K2 = χα /2,n−1 , since
2

( ) ( ) α
nS2
P(µ ,σ ) ≤ K1 = P χn−1
2
≤ χ1−
2
α /2,n−1 =
σ02 2

and ( ) ( ) α
nS2
P(µ ,σ ) ≥ K2 = P χn−1
2
≥ χα2 /2,n−1 =
σ02 2
70 5 Hypothesis testing

for σ = σ0 by using the fact that nY ∼ χn−1

2 from Property 4.1(ii). ⊓
⊔

From the aforemention example and the similar technique, we can have the following
table:

Table 5.3 Testing for the variance

Test H0 H1 Rejection region p-value

{ } ( ( )
nS2 ns2
Two-tailed σ = σ0 σ ̸= σ0 ≤ χ 2
α 2 min P χn−1
2
≤ 2 ,
σ2 1− /2,n−1
σ0
{0 2 } ( 2 ))
nS ns
∪ ≥ χα2 /2,n−1 P χn−1
2
≥ 2
σ 2 σ0
{ 20 } ( )
nS ns2
Left-tailed σ = σ0 or σ ≥ σ0 σ < σ0 ≤ χ 2
1−α ,n−1 P χn−1
2
≤ 2
σ2 σ0
{ 02 } ( )
nS ns2
Right-tailed σ = σ0 or σ ≤ σ0 σ > σ0 ≥ χ 2
α ,n−1 P χn−1 ≥ 2
2
σ02 σ0

Example 5.15. One important factor in inventory control is the variance of the daily de-
mand for the product. A manager has developed the optimal order quantity and reorder
point, assuming that the variance is equal to 250. Recently, the company has experienced
some inventory problems, which induced the operations manager to doubt the assump-
tion. To examine the problem, the manager took a sample of 25 daily demands and found
that s2 = 270.58. Do these data provide sufficient evidence at the 5% significance level to
infer that the management scientist’s assumption about the variance is wrong?

Solution. Consider hypothesis

H0 : σ 2 = 250 vesus H1 : σ 2 ̸= 250.

The value of test statistic is

ns2 25 × 270.58
= ≈ 25.976.
σ02 250

Since χ1−0.05/2,25−1
2 ≈ 12.401 ≤ 25.976 ≤ 39.364 ≈ χ0.05/2,25−1
2 , we do not reject H0 .
There is not sufficient evidence at the 5% significance level to infer that the management
scientists assumption about the variance is wrong. ⊓ ⊔

5.3.4 Test and interval estimation

We can obtain the interval estimation by using the two-tailed hypothesis testing. For ex-
ample, consider hypotheses

H0 : µ = µ0 versus µ ̸= µ0 .

If the variance is known, the acceptance region is

5.4 Generalized likelihood ratio tests: Two-sample case 71
{ }
X − µ0 σ σ
√ < zα /2 ⇐⇒ X − zα /2 √ < µ0 < X + zα /2 √ .
σ/ n n n

at the significance level α . As H0 is accepted, µ = µ0 hence

( )
σ σ
P X − zα /2 √ < µ < X + zα /2 √ = 1 − α.
n n
[ ]
That is, X − zα /2 √σn , X + zα /2 √σn is the 1 − α confidence interval of µ .
Similarly, we can find the confidence interval of µ when the variance is unknown, and
σ 2 by using the two-tailed hypothesis testing.

5.4 Generalized likelihood ratio tests: Two-sample case

In this section, we assume that there are two populations following N(µ1 , σ12 ) and
N(µ2 , σ22 ) respectively. A sample {Xi , i = 1, 2, . . . , n1 } is taken from the population
N(µ1 , σ12 ) and a sample {Y j , j = 1, 2, . . . , n2 } is taken from the population N(µ2 , σ22 ).
Assume that these two samples are independent (that is, X1 , X2 , . . . , Xn1 , Y1 ,Y2 , . . . ,Yn2 are
independent).

5.4.1 Testing for the mean: Variance is known

We first consider the hypothesis testing for µ1 − µ2 when σ1 and σ2 are known.

Example 5.16. Assume that σ1 and σ2 are known. Find the generalized likelihood ratio
for testing
H0 : µ1 − µ2 = δ versus H1 : µ1 − µ2 ̸= δ .

Solution. Note that

Ω0 = {(µ1 , µ2 ) : µ1 − µ2 = δ } ,
Ω1 = {(µ1 , µ2 ) : µ1 − µ2 ̸= δ } ,
Ω = Ω0 ∪ Ω1 = {(µ1 , µ2 ) : −∞ < µ1 < ∞, −∞ < µ2 < ∞} .

The likelihood function of the two samples is

( )n1 [ ]
1 1 n1
L(µ1 , µ2 ) = √ exp − 2 ∑ (Xi − µ1 ) 2
σ1 2π 2σ1 i=1
( )n2 [ ]
1 1 n2
× √ exp − 2 ∑ (Y j − µ2 )2 .
σ2 2π 2σ2 j=1

On Ω0 , we have

1 n1 1 n2
ln L(µ1 , µ2 ) = ln L(µ1 , µ1 − δ ) = C − ∑ (Xi − µ1 )2 − 2σ 2
2σ12 i=1
∑ (Y j − µ1 + δ )2 ,
2 j=1
72 5 Hypothesis testing

where C depends on neither µ1 nor µ2 . By direct calculation,

∂ 1 n1 1 n2
ln L(µ1 , µ1 − δ ) = 2 ∑ (Xi − µ1 ) + 2 ∑ (Y j − µ1 + δ )
∂ µ1 σ1 i=1 σ2 j=1

n1 (X − µ1 ) n2 (Y − µ1 + δ )
= +
σ12 σ22
( )
n1 X n2 (Y + δ ) n1 n2
= 2 + − + µ1 .
σ1 σ22 σ12 σ22

Therefore, the maximum likelihood estimator of µ1 is

n1 X n2 (Y + δ )
+
σ2 σ22
µ̃1 = 1 n1 n2 ,
+
σ12 σ22

since
∂
µ1 < µ̃1 ⇐⇒ ln L(µ1 , µ1 − δ ) > 0,
∂ µ1
∂
µ1 > µ̃1 ⇐⇒ ln L(µ1 , µ1 − δ ) < 0.
∂ µ1

On Ω , it is easy to see that the maximum likelihood estimator of µ1 is X and that of µ2

is Y , since µ1 = X maximizes
( )n1 [ ]
1 1 n1
√ exp − 2 ∑ (Xi − µ1 )2 ,
σ1 2π 2σ1 i=1

and µ2 = Y maximizes
( )n2 [ ]
n2
1 1
√
σ2 2π
exp − 2
2σ2
∑ (Y j − µ2 ) 2
.
j=1

Thus, the generalized likelihood ratio is

L(Ω0 )
Λ =
L(Ω )
[ ]
1 n1 [ ] 1 [
n2 ]
= exp − 2 ∑ (Xi − µ̃1 )2 − (Xi − X)2 − 2 ∑ (Y j − µ̃1 + δ ) − (Y j −Y )
2 2
2σ1 i=1 2σ2 j=1
[ ]
n1 (X − µ̃1 )2 n2 (Y − µ̃1 + δ )2
= exp − −
2σ12 2σ22
[ ′ ]
= exp C (X −Y − δ )2 ,

where C′ is negative and does not depend on the samples, because

5.4 Generalized likelihood ratio tests: Two-sample case 73
n2 n1
(X −Y − δ ) (Y + δ − X)
σ22 σ12
X − µ̃1 = n1 n2 and Y − µ̃1 + δ = n1 n2 .
+ 2 + 2
σ1 σ2
2 σ1 σ2
2

{ }
Therefore the rejection region should be |X −Y − δ | ≥ K .
Under H0 , we have  ( 2)
X follows N µ1 , σ1 ,
( n1
)
Y follows N µ1 − δ , σ22 ,
n2

and thus (by the independence between the two samples)

( )
σ2 σ2
X −Y follows N δ , 1 + 2 .
n1 n2

Therefore, the rejection region is

 

 

 |X −Y − δ | 
√ ≥ zα /2 ,

 

 σ1 + σ2 
2 2
n1 n2

X −Y − δ
where the test statistic is √ . ⊓
⊔
σ12 σ22
+
n1 n2

From the aforemention example and the similar technique, we can have the following
table:

Table 5.4 Testing for the mean when variances σ12 and σ22 are known

Test H0 H1 Rejection region p-value

   

 

 X −Y − δ   |x − y − δ | 
Two-tailed µ1 − µ2 = δ µ1 − µ2 ̸= δ √ ≥ zα /2 P
|Z| ≥ √



 σ12 σ22 
 σ12 σ22
 
n1 + n2 n1 + n2
   

 

 X −Y − δ   x−y−δ 
Left-tailed µ1 − µ2 = δ µ1 − µ2 < δ √ ≤ −zα P
Z ≤ √ 2



 

 σ1 + σ2  σ1 σ22
2 2
n1 n2 n1 + n2

or µ1 − µ2 ≥ δ
   

 

 X −Y − δ   x−y−δ 
Right-tailed µ1 − µ2 = δ µ1 − µ2 > δ √ ≥ zα P
Z ≥ √ 2



 

 σ1 + σ2  σ1 σ22
2 2
n1 n2 n1 + n2

or µ1 − µ2 ≤ δ
74 5 Hypothesis testing

5.4.2 Testing for the mean: Variance is unknown

We second consider the hypothesis testing for µ1 − µ2 when σ1 and σ2 are unknown but
equal.

Example 5.17. Assume that σ1 and σ2 are unknown but equal to σ . Find the generalized
likelihood ratio for testing

H0 : µ1 − µ2 = δ versus H1 : µ1 − µ2 ̸= δ .

Solution. Note that

Ω0 = {(µ1 , µ2 , σ ) : µ1 − µ2 = δ , σ > 0} ,
Ω1 = {(µ1 , µ2 , σ ) : µ1 − µ2 ̸= δ , σ > 0} ,
Ω = Ω0 ∪ Ω1 = {(µ1 , µ2 , σ ) : σ > 0} .

The likelihood function of the two samples is

( )n1 +n2 { [ ]}
n1 n2
1 1
L(µ1 , µ2 , σ ) = √ exp − 2 ∑ (Xi − µ1 ) + ∑ (Y j − µ2 )
2 2
.
σ 2π 2σ i=1 j=1

On Ω0 , we have

ln L(µ1 , µ2 , σ ) = ln L(µ1 , µ1 − δ , σ )
[ ]
n1 n2
1
= C − (n1 + n2 ) ln σ − 2
2σ ∑ (Xi − µ1 ) 2
+ ∑ (Y j − µ1 + δ ) 2
,
i=1 j=1

where C is a constant. Then,

∂ n1 (X − µ1 ) + n2 (Y − µ1 + δ )
ln L(µ1 , µ1 − δ , σ ) =
∂ µ1 σ2
n1 X + n2 (Y + δ ) n1 + n2
= − µ1 .
σ2 σ2
This implies that the maximum likelihood estimator of µ1 is

n1 X + n2 (Y + δ )
µ̃1 = ,
n1 + n2
which does not depend on σ , since

∂
µ1 < µ̃1 ⇐⇒ ln L(µ1 , µ1 − δ , σ ) > 0,
∂ µ1
∂
µ1 > µ̃1 ⇐⇒ ln L(µ1 , µ1 − δ , σ ) < 0.
∂ µ1

Therefore, it is now sufficient to consider L(µ̃1 , µ̃1 − δ , σ ) for finding the maximum like-
lihood estimator of σ . By direct calculation,
5.4 Generalized likelihood ratio tests: Two-sample case 75
[ ]
∂ n1 + n2 1 n1 n2
ln L(µ̃1 , µ̃1 − δ , σ ) = − + 3 ∑ (Xi − µ̃1 ) + ∑ (Y j − µ̃1 + δ )
2 2
∂σ σ σ i=1 j=1
[ ]
n1 n2
1
= 3 −(n1 + n2 )σ + ∑ (Xi − µ̃1 ) + ∑ (Y j − µ̃1 + δ )
2 2 2
σ i=1 j=1

and the maximum likelihood estimator of σ is

v [ ]
u
u 1 n1 n2
t
σ̃ = ∑ (Xi − µ̃1 ) + ∑ (Y j − µ̃1 + δ ) ,
n1 + n2 i=1
2 2
j=1

since
∂
σ > σ̃ ⇐⇒ ln L(µ̃1 , µ̃1 − δ , σ ) < 0.
∂σ
Therefore,
n1 + n2
ln L(Ω0 ) = C − (n1 + n2 ) ln σ̃ − .
2
On Ω , we have
[ ]
n1 n2
1
ln L(µ1 , µ2 , σ ) = C − (n1 + n2 ) ln σ − 2
2σ ∑ (Xi − µ1 ) 2
+ ∑ (Y j − µ2 )
2
,
i=1 j=1

where C is a constant. Then, by direct calculation,

∂ n1 (X − µ1 )
ln L(µ1 , µ2 , σ ) = ,
∂ µ1 σ2
∂ n2 (Y − µ2 )
ln L(µ1 , µ2 , σ ) = ,
∂ µ2 σ2
[ ]
∂ n1 + n2 1 n1 n2

∂σ
ln L(µ1 , µ2 , σ ) = −
σ
+ 3
σ ∑ (Xi − µ1 )2 + ∑ (Y j − µ2 )2 .
i=1 j=1

Hence, by following the same routine as before, we can show that the maximum likelihood
estimators are

µ̂1 = X,
µ̂2 = Y ,
[ ]
n1 n2
1
σ̂ =
2
n1 + n2 ∑ (Xi − X) 2
+ ∑ (Y j −Y ) 2
.
i=1 j=1

Therefore,
n1 + n2
ln L(Ω ) = C − (n1 + n2 ) ln σ̂ − .
2
Now, the generalized likelihood ratio is
( )−(n1 +n2 )/2
L(Ω0 ) σ̃ −(n1 +n2 ) σ̃ 2
Λ= = −(n +n ) = .
L(Ω ) σ̂ 1 2 σ̂ 2
76 5 Hypothesis testing

Note that
n1 n2
∑ (Xi − µ̃1 )2 + ∑ (Y j − µ̃1 + δ )2
σ̃ 2 i=1 j=1
=
σ̂ 2 n1 n2
∑ (Xi − X)2 + ∑ (Y j −Y )2
i=1 j=1
n1 n2
∑ (Xi − X)2 + n1 (X − µ̃1 )2 + ∑ (Y j −Y )2 + n2 (Y − µ̃1 + δ )2
i=1 j=1
= n1 n2
∑ (Xi − X)2 + ∑ (Y j −Y )2
i=1 j=1

n1 (X − µ̃1 )2 + n2 (Y − µ̃1 + δ )2
= 1+ n1 n2
∑ (Xi − X)2 + ∑ (Y j −Y )2
i=1 j=1
[ ]2 [ ]2
n2 (X −Y − δ ) n1 (Y + δ − X)
n1 + n2
n1 + n2 n1 + n2
= 1+ n1 n2
∑ (Xi − X)2 + ∑ (Y j −Y )2
i=1 j=1
n1 n2
(X −Y − δ )2
n1 + n2
= 1+
n1 S12 + n2 S22
(X −Y − δ )2
= 1+ ( ) ,
1 1 [ 2 ]
+ n1 S1 + n2 S22
n1 n2
{ }
where S12 and S22 are the sample variances of {Xi , i = 1, 2, . . . , n1 } and Y j , j = 1, 2, . . . , n2
respectively, Therefore H0 should be rejected when

|X −Y − δ |
√
1√ 2
is large.
1 2
+ n1 S1 + n2 S2
n1 n2
( ) ( )
σ2 σ2
Under H0 , X follows N µ1 , and Y follows N µ1 − δ , , and thus X − Y
( ) n1 n2
σ2 σ2
follows N δ , + , which implies that
n1 n2

X −Y − δ
√ follows N(0, 1).
1 1
σ +
n1 n2

n1 S12 n2 S2
Besides, the fact that the two independent random variables and 22 follows χn21 −1
σ 2 σ
n1 S12 + n2 S22
and χn22 −1 respectively implies that follows χn21 +n2 −2 . Therefore,
σ2
5.4 Generalized likelihood ratio tests: Two-sample case 77

X −Y − δ
√
σ n11 + n12 X −Y − δ
W=√ / =√ √
2 2
n1 S1 + n2 S2 1 1 n1 S12 + n2 S22
(n1 + n2 − 2) +
σ2 n1 n2 n1 + n2 − 2

follows tn1 +n2 −2 by Property 4.2(ii). Letting

n1 n2
∑ (Xi − X)2 + ∑ (Y j −Y )2
n1 S12 + n2 S22 i=1 j=1
S2p = = ,
n1 + n2 − 2 n1 + n2 − 2
{ } X −Y − δ
then the rejection region is |W | ≥ tα /2,n1 +n2 −2 , where the test statistic is W = √ .
1 1
Sp +
n1 n2
⊓
⊔

From the aforemention example and the similar technique, we can have the following
table:

Table 5.5 Testing for the mean when variances σ12 = σ22 are unknown

Test H0 H1 Rejection region p-value

   
 X −Y − δ  x − y − δ
Two-tailed µ1 − µ2 = δ µ1 − µ2 ̸= δ √ ≥ tα /2,n1 +n2 −2 P |tn1 +n2 −2 | ≥ √ 
 S 1
+ 1  s 1
+ 1
p p
 n1 n2   n1 n2

 X −Y − δ  x − y − δ
Left-tailed µ1 − µ2 = δ µ1 − µ2 < δ √ ≤ −tα ,n1 +n2 −2 
P tn1 +n2 −2 ≤ √ 
S 1
+ 1  s 1
+ 1
p n1 n2 p n1 n2

or µ1 − µ2 ≥ δ
   
 X −Y − δ  x − y − δ
Right-tailed µ1 − µ2 = δ µ1 − µ2 > δ √ ≥ tα ,n1 +n2 −2 P tn1 +n2 −2 ≥ √ 
S 1
+ 1  s p n11 + n12
p n1 n2

or µ1 − µ2 ≤ δ

Remark 5.3. S p is called the pooled sample variance, which is an unbiased estimator of
σ 2 under H0 .

Example 5.18. A consumer agency wanted to estimate the difference in the mean amounts
of caffeine in two brands of coffee. The agency took a sample of 15 500-gramme jars of
Brand I coffee that showed the mean amount of caffeine in these jars to be 80 mg per jar
and the standard deviation to be 5 mg. Another sample of 12 500-gramme jars of Brand II
coffee gave a mean amount of caffeine equal to 77 mg per jar and a standard deviation of
6 mg. Assuming that the two populations are normally distributed with equal variances,
check at the 5% significance level whether the mean amount of caffeine in 500-gramme
jars is greater for Brand 1 than for Brand 2.
78 5 Hypothesis testing

Solution. Let the amounts of caffeine in jars of Brand I be referred to as population 1 and
those of Brand II be referred to as population 2.
We consider the hypotheses:

H0 : µ1 ≤ µ2 versus H1 : µ1 > µ2 ,

where µ1 and µ2 are the mean of population 1 and population 2, respectively.

Note that
n1 = 15, x1 = 80, s1 = 5,
and
n2 = 12, x2 = 77, s2 = 6, α = 0.05.
√
1 1
Hence, x1 − x2 = 80 − 77 = 3, tα ,n1 +n2 −2 = t0.05,25 ≈ 1.708, + ≈ 0.3873, and
n1 n2
√ √
n1 s21 + n2 s22 15 ∗ 52 + 12 ∗ 62
sp = = ≈ 5.4626.
n1 + n2 − 2 15 + 12 − 2

Therefore, the observed value of the test statistic is

x − x2 3
w= √1 = ≈ 1.42.
s p n11 + n12 5.4626 ∗ 0.3873

As 1.42 < 1.708, we can not reject H0 . Thus, we conclude that the mean amount of caf-
feine in 500-gramme jars is not greater for Brand 1 than for Brand 2 at the 5% significance
level. ⊔ ⊓

5.4.3 Testing for the variance

In the above subsection, the assumption σ1 = σ2 is needed. Hence, it is interesting to

perform tests comparing σ1 and σ2 .

Example 5.19. Find the generalized likelihood ratio test for hypotheses

H0 : σ1 = σ2 versus H1 : σ1 ̸= σ2 .

Solution. It can be proved (details omitted) that the generalized likelihood ratio is
( )n1 /2
S2
C 12
S2
[ ](n1 +n2 )/2 ,
S12
n1 2 + n2
S2

where C is a constant.
For w > 0 define the function
5.4 Generalized likelihood ratio tests: Two-sample case 79

wn1 /2
G(w) = .
[n1 w + n2 ](n1 +n2 )/2

Then,
n1 n1 + n2
ln G(w) = ln w − ln [n1 w + n2 ] ,
2 2
d n1 n1 + n2 n1
ln G(w) = − ×
dw 2w 2 n1 w + n2
n1 n2 (1 − w)
= ,
2w [n1 w + n2 ]

which is negative when w > 1 and is positive when w < 1. Therefore, the value of G(w)
will be small when w is very large or very small. Therefore H0 should be rejected when
S12
is large or small.
S22
n1 S12
n1 (n2 − 1)S12 (n1 −1)σ12
When H0 is true, = follows Fn1 −1,n2 −1 by Property 4.3. Thus,
n2 (n1 − 1)S22 n2 S22
(n2 −1)σ22
n1 (n2 − 1)S12
we let the test statistic be W = , and the rejection region is
n2 (n1 − 1)S22

{W ≤ F1−α /2,n1 −1,n2 −1 } ∪ {W ≥ Fα /2,n1 −1,n2 −1 }.

⊓
⊔
From the aforemention example and the similar technique, we can have the following
table:

Table 5.6 Testing for the variance

Test H0 H1 Rejection region p-value

{ } ( ( )
n1 (n2 − 1)S12 n1 (n2 − 1)s21
Two-tailed σ1 = σ2 σ1 ̸= σ2 ≥ F α /2,n −1,n −1 2 min P Fn −1,n −1 ≥ ,
n (n − 1)S22 1 2 1 2
n2 (n1 − 1)s22
{ 2 1 } ( ))
n1 (n2 − 1)S12 n1 (n2 − 1)s12
∪ ≤ F1−α /2,n1 −1,n2 −1 P Fn1 −1,n2 −1 ≤
n2 (n1 − 1)S22 n2 (n1 − 1)s22
{ } ( )
n1 (n2 − 1)S12 n1 (n2 − 1)s21
Left-tailed σ1 = σ2 σ1 < σ2 ≤ F1−α ,n1 −1,n2 −1 P F n1 −1,n2 −1 ≤
n2 (n1 − 1)S22 n2 (n1 − 1)s22
or σ1 ≥ σ2
{ } ( )
n1 (n2 − 1)S12 n1 (n2 − 1)s21
Right-tailed σ1 = σ2 σ1 > σ2 ≥ Fα ,n1 −1,n2 −1 P Fn1 −1,n2 −1 ≥
n2 (n1 − 1)S22 n2 (n1 − 1)s22
or σ1 ≤ σ2

Remark 5.4. Recall Fα ,m,n as the positive real number such that P(X ≥ Fα ,m,n ) = α where
X follows Fm,n . Suppose X follows Fm,n . Then 1/X follows Fn,m and

1
F1−α ,m,n = ,
Fα ,n,m
80 5 Hypothesis testing

because
( )
1 1
1 − α = P(F1−α ,m,n < X) = P >
F1−α ,m,n X
( )
1 1
=⇒ α = P ≤
F1−α ,m,n X
1
=⇒ = Fα ,n,m .
F1−α ,m,n

Example 5.20. A study involves the number of absences per year among union and non-
union workers. A sample of 16 union workers has a sample standard deviation of 3.0 days.
A sample of 10 non-union workers has a sample standard deviation of 2.5 days. At the
10% significance level, can we conclude that the variance of the number of days absent
for union workers is different from that for nonunion workers?

Solution. Let all union workers be referred to as population 1 and all non-union workers
be referred to as population 2.
We consider the hypotheses:

H0 : σ1 = σ2 versus H1 : σ1 ̸= σ2 ,

where σ12 and σ22 are the variance of population 1 and population 2, respectively.
Note that n1 = 16, s1 = 3, n2 = 10, and s2 = 2.5. Hence, the value of the test statistic
is
n1 (n2 − 1) s21 3.02
= 0.96 ∗ = 1.3824.
n2 (n1 − 1) s22 2.52
Since
1
< 1 < 1.3824 < 3.006 ≈ f0.05,15,9 ,
f0.05,9,15
we cannot reject H0 . Thus we conclude that the data do not indicate that the variance of
the number of days absent for union workers is different from that for non-union workers
at the 10% significance level. ⊓
⊔

Remark 5.5. Since H0 should not be rejected, we may test

H0 : µ1 = µ2 versus H1 : µ1 ̸= µ2

using the pooled sample variance.

5.5 Generalized likelihood ratio tests: Large samples

Unfortunately, the likelihood ratio method does not always produce a test statistic with a
known probability distribution. Nevertheless, if the sample size is large, we can obtain an
approximation to the distribution of a generalized likelihood ratio.

Theorem 5.2. Suppose that we are testing

H0 : θi = θi,0 for all i = 1, 2, . . . , d

5.5 Generalized likelihood ratio tests: Large samples 81

versus
H1 : θi ̸= θi,0 for at least one i = 1, 2, . . . , d
and that Λ is the generalized likelihood ratio. Then, under very general conditions, when
H0 is true,
−2 ln Λ →d χd2 as n → ∞.

Remark 5.6. For more general cases, d can be determined by

d = [number of parameters to estimate when determining L(Ω )]

− [number of parameters to estimate when determining L(Ω0 )] .

For example, if we test

H0 : θi = θi,0 for all i = 1, 2, . . . , m

against
H1 : θi ̸= θi,0 for at least one i = 1, 2, . . . , m,
then d = m.

5.5.1 Goodness-of-fit tests

Suppose m is an integer greater than 1, and there is a population X such that

P(X = ai ) = pi , i = 1, 2, . . . , m,

where i ̸= j implies ai ̸= a j for i, j = 1, 2, . . . , m, pi > 0 for i = 1, 2, . . . , m and

p1 + p2 + · · · + pm = 1.

Now suppose X1 , X2 , . . . , Xn constitute a random sample of size n from the population.

For i = 1, 2, . . . , m, let Yi be the number of k such that Xk = ai . Then
n!
P(Yi = yi , i = 1, 2, . . . , m) = py1 py2 · · · pymm
y1 !y2 ! · · · ym ! 1 2
for non-negative integers yi , i = 1, 2, . . . , m, such that y1 + y2 + · · · + ym = n.
We want to test

H0 : pi = pi,0 for all i = 1, 2, . . . , m

versus
H1 : pi ̸= pi,0 for at least one i = 1, 2, . . . , m,
where pi,0 > 0 for i = 1, 2, . . . , m and

p1,0 + p2,0 + · · · + pm,0 = 1.

This is equivalent to testing

H0 : pi = pi,0 for all i = 1, 2, . . . , m − 1

82 5 Hypothesis testing

versus
H1 : pi ̸= pi,0 for at least one i = 1, 2, . . . , m − 1,
where pi,0 > 0 for i = 1, 2, . . . , m − 1 and

p1,0 + p2,0 + · · · + pm−1,0 < 1.

If we let (O for observed frequency and E for expected frequency when H0 is true)

Oi = Yi for i = 1, 2, . . . , m − 1,
m−1
Om = n − ∑ Yi ,
i=1
Ei = npi,0 for i = 1, 2, . . . , m − 1,
( )
m−1
Em = n 1 − ∑ pi,0 ,
i=1

then, it can be proved that

m
(Oi − Ei )2 m
O2 m m m
O2
−2 ln Λ ≈ ∑ Ei
= ∑ i − 2 ∑ Oi + ∑ Ei = ∑ i − 2n + n
i=1 i=1 Ei i=1 i=1 i=1 Ei
m
O2
= ∑ Eii − n.
i=1

Note that {−2 ln Λ ≥ K} is the rejection region, and

{ }
m
(Oi − Ei )2
∑ Ei ≥ χα2 ,m−1
i=1

can serve as an approximate rejection region. Since this is only an approximate result,
it is suggested that all expected frequencies should be no less than 5, so that the sample
is large enough. To meet this rule, some categories may be combined when to do so is
logical.

Example 5.21. A journal reported that, in a bag of m&m’s chocolate peanut candies, there
are 30% brown, 30% yellow, 10% blue, 10% red, 10% green and 10% orange candies.
Suppose you purchase a bag of m&m’s chocolate peanut candies at a nearby store and
find 17 brown, 20 yellow, 13 blue, 7 red, 6 green and 9 orange candies, for a total of 72
candies. At the 0.1 level of significance, does the bag purchased agree with the distribution
suggested by the journal?

Solution. Consider hypotheses:

H0 : the bag purchased agrees with the distribution suggested by the journal,

versus

H1 : the bag purchased does not agree with the distribution suggested by the journal.

Then we have the table below, in which all expected frequencies are at least 5.
5.5 Generalized likelihood ratio tests: Large samples 83

Colour Oi Ei O i − Ei
Brown 17 72 × 30% = 21.6 -4.6
Yellow 20 72 × 30% = 21.6 -1.6
Blue 13 72 × 10% = 7.2 5.8
Red 7 72 × 10% = 7.2 -0.2
Green 6 72 × 10% = 7.2 -1.2
Orange 9 72 × 10% = 7.2 1.8
Total 72 72 0

Therefore, as the sample is large enough,

6
O2
−2 ln Λ ≈ ∑ Eii −n
i=1
172 + 202132 + 72 + 62 + 92
= + − 72
21.6 7.2
≈ 6.426 < 9.236 ≈ χ0.1,6−1
2
.

Alternatively,
6
(Oi − Ei )2
−2 ln Λ ≈ ∑ Ei
i=1
(−4.6)2 (−1.6)2 5.82 (−0.2)2 (−1.2)2 1.82
= + + + + +
21.6 21.6 7.2 7.2 7.2 7.2
≈ 6.426 < 9.236 ≈ χ0.1,6−1
2
.

Hence we should not reject H0 . At the significance level 10%, we cannot conclude that
the bag purchased does not agree with the distribution suggested by the journal. ⊔
⊓

Example 5.22. A traffic engineer wishes to study whether drivers have a preference for
certain tollbooths at a bridge during non-rush hours. The number of automobiles passing
through each tollbooth lane was counted during a randomly selected 15-minute interval.
The sample information is as follows.
Tollbooth Lane 1 2 3 4 5 Total
Number of Cars observed 171 224 211 180 214 100

Can we conclude that there are differences in the numbers of cars selecting respectively
each of the lanes? Test at the 5% significance level.

Solution. Consider hypotheses:

H0 : there is no preference among the five lanes,

versus

H1 : there is a preference among the five lanes.

All the five expected frequencies equal 1000 ÷ 5 = 200, which is not less than 5. There-
fore, as the sample is large enough,
84 5 Hypothesis testing

5
O2
−2 ln Λ ≈ ∑ Eii −n
i=1
1712 + 2242 + 2112 + 1802 + 2142
= − 1000
200
≈ 10.67 ≥ 9.488 ≈ χ0.05,5−1
2
.

Hence, H0 should be rejected. At the significance level 5%, we can conclude that there
are differences in the numbers of cars selecting respectively each of the lanes. ⊓
⊔

When testing goodness of fit to help select an appropriate population model, we usually
are interested in testing whether some family of distributions seems appropriate and are
not interested in the lack of fit due to the wrong parameter values. Suppose we want to
test

H0 : the population follows a particular distribution with k unknown parameters.

For calculating Ei ’s, we have to use the maximum likelihood estimate of the unknown
parameters. Then the rejection region is {−2 ln Λ ≥ K} or, approximately,
{ }
m
(Oi − Ei )2
∑ Ei ≥ χα2 ,m−1−k .
i=1

5.5.2 Pearson Chi-squared test of independence

Consider the following joint distribution of two discrete random variables X and Y :

Value of Y
Probability Row sum
b1 ··· bj ··· bc
a1 p1,1 · · · p1, j · · · p1,c p1.
··· ··· ··· ··· ··· ··· ···
Value of X ai pi,1 · · · pi, j · · · pi,c pi.
··· ··· ··· ··· ··· ··· ···
ar pr,1 · · · pr, j · · · pr,c pr.
Column sum p.1 · · · p. j · · · p.c 1
We want to test
H0 : X and Y are independent
versus
H1 : X and Y are not independent.
That is, we want to test

H0 : pi, j = pi· p· j for i = 1, 2, . . . , r − 1 and j = 1, 2, . . . , c − 1

versus

H1 : pi, j ̸= pi· p· j for at least one i and j,

where i = 1, 2, . . . , r − 1 and j = 1, 2, . . . , c − 1.
5.5 Generalized likelihood ratio tests: Large samples 85

A random sample of size n taken from this distribution is a set of n independent vectors,
or ordered pairs of random variables, (X1 ,Y1 ), (X2 ,Y2 ), . . ., (Xn ,Yn ) each following this
distribution. From such a sample we obtain the following table, where Oi, j (called the
observed frequency of the (i, j)-th cell) is the number of k such that Xk = ai and Yk = b j ,
i = 1, 2, . . . , r, j = 1, 2, . . . , c. A box containing an observed frequency is called a cell.
Such a two-way classification table is also called a contingency table or cross-tabulation.
Ours is an r × c contingency table.
Value of Y
Observed frequency Row sum
b1 ··· bj ··· bc
a1 O1,1 · · · O1, j · · · O1,c n1.
··· ··· ··· ··· ··· ··· ···
Value of X ai Oi,1 · · · Oi, j · · · Oi,c ni.
··· ··· ··· ··· ··· ··· ···
ar Or,1 · · · Or, j · · · Or,c nr.
Column sum n.1 · · · n. j · · · n.c n
Let Λ be the generalized likelihood ratio. Then it can be proved that
( )
r c O2 r c O2
r c
(Oi, j − Ei, j )2
−2lnΛ ≈ ∑ ∑ =∑∑ −n = n ∑ ∑
i, j i, j
−1
i=1 j=1 Ei, j i=1 j=1 Ei, j i=1 j=1 ni· n· j

ni· n· j
where Ei, j = is the expected frequency corresponding to Oi, j when H0 is true, i =
n
1, 2, . . . , r and j = 1, 2, . . . , c. The rejection region is approximately
{ }
r c
(Oi, j − Ei, j )2
∑∑ Ei, j
≥ χα ,(r−1)(c−1) ,
2
i=1 j=1

where

the number of degrees of freedom

= [number of parameters to estimate when determining L(Ω )]
− [number of parameters to estimate when determining L(Ω0 )]
= (rc − 1) − [(r − 1) + (c − 1)]
= rc − r − c + 1
= (r − 1)(c − 1).

This test is called the Pearson Chi-squared test of independence. As in previous sections,
we require each expected frequency to be at least 5.
Example 5.23. Suppose we draw a sample of 360 students and obtain the following infor-
mation. At the 0.01 level of significance, test whether a student’s ability in mathematics
is independent of the student’s interest in statistics.
Ability in Math
sum
Low Average High
Low 63 42 15 120
Interest in Statistics Average 58 61 31 150
High 14 47 29 90
Sum 135 150 75 360
86 5 Hypothesis testing

Solution. Consider hypotheses:

H0 : ability in mathematics and interest in statistics are independent,

versus

H1 : ability in mathematics and interest in statistics are not independent (are related).

The table below shows the expected frequencies (where, for example, 45 = 120 × 135 ÷
360 and 50 = 120 × 150 ÷ 360).
Ability in Math
sum
Low Average High
Low 45 50 25 120
Interest in Statistics Average 56.25 62.5 31.25 150
High 33.75 37.5 18.75 90
Sum 135 150 75 360
All expected frequencies are at least 5. Therefore, as the sample is large enough,
( ) ( )
r c O2
632 422 292
n ∑∑
i, j
− 1 = 360 + +···+ −1
i=1 j=1 ni· n· j 120 × 135 120 × 150 90 × 75
≈ 32.140 ≥ 13.277 ≈ χ0.01,(3−1)(3−1)
2
.

Hence, at the significance level 1%, we reject H0 and conclude that there is a relationship
between a student’s ability in mathematics and the student’s interest in statistics.
Alternatively, the value of the test statistic equals
3 3
(Oi, j − Ei, j )2 (63 − 45)2 (42 − 50)2 (29 − 18.75)2
∑∑ Ei, j
=
45
+
50
+···+
18.75
i=1 j=1

≈ 32.140 ≥ 13.277 ≈ χ0.01,(3−1)(3−1)

2
.

⊓
⊔

ALL ST218 Lecture Notes
No ratings yet
ALL ST218 Lecture Notes
87 pages
William D. Penny - Signal Processing Course
100% (1)
William D. Penny - Signal Processing Course
178 pages
Statistical Methods in Data Analysis - W. J. Metzger
No ratings yet
Statistical Methods in Data Analysis - W. J. Metzger
278 pages
book (1)
No ratings yet
book (1)
113 pages
8822 LectureNotes
No ratings yet
8822 LectureNotes
74 pages
ps-notes
No ratings yet
ps-notes
62 pages
Lecture Notes For STAT2602
No ratings yet
Lecture Notes For STAT2602
104 pages
Book
No ratings yet
Book
106 pages
Probability theory (1)
No ratings yet
Probability theory (1)
78 pages
Econometricks-short Guide
No ratings yet
Econometricks-short Guide
110 pages
Formulae and Distributions Tables
No ratings yet
Formulae and Distributions Tables
19 pages
Other Lec From Other
100% (1)
Other Lec From Other
205 pages
Introduction To Probability Theory and S
No ratings yet
Introduction To Probability Theory and S
127 pages
Introduction To Probability Theory and Statistics
No ratings yet
Introduction To Probability Theory and Statistics
127 pages
Statistical Inference in Science
No ratings yet
Statistical Inference in Science
262 pages
Handbook Statistical Foundations of Machine Learning
No ratings yet
Handbook Statistical Foundations of Machine Learning
267 pages
STAT 330 Supplementary Notes
No ratings yet
STAT 330 Supplementary Notes
134 pages
Probability and Statistics Ii: George Deligiannidis Module Lecturer 2020/21: Kalliopi Mylona
No ratings yet
Probability and Statistics Ii: George Deligiannidis Module Lecturer 2020/21: Kalliopi Mylona
107 pages
REG2022
No ratings yet
REG2022
313 pages
Cimentaciones Maquinas
100% (1)
Cimentaciones Maquinas
235 pages
STAT6101 Coursenotes 1516 PDF
No ratings yet
STAT6101 Coursenotes 1516 PDF
73 pages
Support 1 Annee 2023-24
No ratings yet
Support 1 Annee 2023-24
161 pages
Statistical Analysis of Contingency Tables (Fagerland, Morten W. Laake, Petter Lydersen Etc.) (Z-Library)
No ratings yet
Statistical Analysis of Contingency Tables (Fagerland, Morten W. Laake, Petter Lydersen Etc.) (Z-Library)
657 pages
STAT515 Lecture
No ratings yet
STAT515 Lecture
85 pages
An Introduction to Biostatistic 3rd Edition Thomas Glover download
100% (1)
An Introduction to Biostatistic 3rd Edition Thomas Glover download
74 pages
Lecture Notes in Probability: Raz Kupferman Institute of Mathematics The Hebrew University April 5, 2009
No ratings yet
Lecture Notes in Probability: Raz Kupferman Institute of Mathematics The Hebrew University April 5, 2009
159 pages
Stoch Book 19
No ratings yet
Stoch Book 19
134 pages
4b_ProbabilityNotes
No ratings yet
4b_ProbabilityNotes
79 pages
Mathematical Methods Notes
No ratings yet
Mathematical Methods Notes
230 pages
Lectnotemat 5
No ratings yet
Lectnotemat 5
346 pages
CI Textbook
No ratings yet
CI Textbook
490 pages
Ec2142 CourseNotes
No ratings yet
Ec2142 CourseNotes
94 pages
ECON835 Lecture Notes Part 1 Probability Through Asymptotics [Fall 2014]
No ratings yet
ECON835 Lecture Notes Part 1 Probability Through Asymptotics [Fall 2014]
75 pages
Full Notes
No ratings yet
Full Notes
197 pages
Notes Aukland Studied PDF
No ratings yet
Notes Aukland Studied PDF
200 pages
Bias Reduction in Nonparametric Hazard
No ratings yet
Bias Reduction in Nonparametric Hazard
112 pages
Math 6410: Ordinary Differential Equations: Lectures Notes On
No ratings yet
Math 6410: Ordinary Differential Equations: Lectures Notes On
100 pages
Course Notes
No ratings yet
Course Notes
211 pages
Part IA - Probability: Definitions
No ratings yet
Part IA - Probability: Definitions
18 pages
analysis2
No ratings yet
analysis2
66 pages
Class Notes
No ratings yet
Class Notes
147 pages
Estimations
100% (1)
Estimations
183 pages
Course Notes Stats 210 Statistical Theory
No ratings yet
Course Notes Stats 210 Statistical Theory
199 pages
Book Statistik Non Parametrik, Komang Suardika
No ratings yet
Book Statistik Non Parametrik, Komang Suardika
492 pages
Mathematical Statistics Intro Course 1713243381
No ratings yet
Mathematical Statistics Intro Course 1713243381
142 pages
Course Notes Stochastic Processes - Auckland
No ratings yet
Course Notes Stochastic Processes - Auckland
195 pages
(Universitext) Pierre Brémaud - Probability Theory and Stochastic Processes (2020, Springer)
100% (5)
(Universitext) Pierre Brémaud - Probability Theory and Stochastic Processes (2020, Springer)
717 pages
Intro To Prob Theory
No ratings yet
Intro To Prob Theory
302 pages
EC400Stats Lecturenotes2021
No ratings yet
EC400Stats Lecturenotes2021
101 pages
Stochastic Processes
No ratings yet
Stochastic Processes
133 pages
The Moment Problem Konrad Schmüdgen 2017
No ratings yet
The Moment Problem Konrad Schmüdgen 2017
530 pages
MA225 L3 Notes
No ratings yet
MA225 L3 Notes
40 pages
Delay Differential Equation Models in Mathematical Biology
No ratings yet
Delay Differential Equation Models in Mathematical Biology
104 pages
Lectnotemat 2
No ratings yet
Lectnotemat 2
348 pages
Gilboa Notes For Introduction To Decision Theory
No ratings yet
Gilboa Notes For Introduction To Decision Theory
79 pages
Statistical Tolerance Regions: Theory, Applications, and Computation
From Everand
Statistical Tolerance Regions: Theory, Applications, and Computation
Kalimuthu Krishnamoorthy
No ratings yet
Inference and Prediction in Large Dimensions
From Everand
Inference and Prediction in Large Dimensions
Denis Bosq
3.5/5 (1)
Computer-Aided Modeling of Reactive Systems
From Everand
Computer-Aided Modeling of Reactive Systems
Warren E. Stewart
No ratings yet
Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods
From Everand
Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods
Joseph Keshet
No ratings yet
Error-controlled Adaptive Finite Elements in Solid Mechanics
From Everand
Error-controlled Adaptive Finite Elements in Solid Mechanics
Erwin Stein
No ratings yet
AE6207_Problem Set 5
No ratings yet
AE6207_Problem Set 5
1 page
MATH230 Lecture Notes 3
No ratings yet
MATH230 Lecture Notes 3
45 pages
PMMT110 Tutorial Sheet 6 S2 2024
No ratings yet
PMMT110 Tutorial Sheet 6 S2 2024
9 pages
[FREE PDF sample] (Original PDF) Categorical Data Analysis 3rd Edition by Alan Agresti ebooks
No ratings yet
[FREE PDF sample] (Original PDF) Categorical Data Analysis 3rd Edition by Alan Agresti ebooks
45 pages
Session 4-6
No ratings yet
Session 4-6
69 pages
Topic 5
No ratings yet
Topic 5
28 pages
Probst at Lab
No ratings yet
Probst at Lab
9 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
L 2 Discrete Distributions
No ratings yet
L 2 Discrete Distributions
18 pages
GSLC 2 (Renewal Processes PART 2) - Leonardo - 2301857374 - LA41
No ratings yet
GSLC 2 (Renewal Processes PART 2) - Leonardo - 2301857374 - LA41
3 pages
Statistics For Managers Using Microsoft Excel 8th Edition Levine Solutions Manual 1
100% (40)
Statistics For Managers Using Microsoft Excel 8th Edition Levine Solutions Manual 1
36 pages
Bivariate Distributions
No ratings yet
Bivariate Distributions
60 pages
Variance Gamma
No ratings yet
Variance Gamma
24 pages
Y11 SM Textbook - Worked Solutions - CH 12
No ratings yet
Y11 SM Textbook - Worked Solutions - CH 12
13 pages
LAS 06 Mean of A Discrete Random Variable
No ratings yet
LAS 06 Mean of A Discrete Random Variable
2 pages
Lecture Notes Week 2
No ratings yet
Lecture Notes Week 2
10 pages
2014 Fröhlich ReinsurancePricing
No ratings yet
2014 Fröhlich ReinsurancePricing
85 pages
(Slide) NIU 2024 Fall Statistics I Lecture Note 2 Topic 2
No ratings yet
(Slide) NIU 2024 Fall Statistics I Lecture Note 2 Topic 2
59 pages
Mathematics 12 02466
No ratings yet
Mathematics 12 02466
29 pages
Propensity Score Matching
No ratings yet
Propensity Score Matching
14 pages
Assignment 4
No ratings yet
Assignment 4
1 page
Homework 4
No ratings yet
Homework 4
2 pages
MA40189 20 Open
No ratings yet
MA40189 20 Open
6 pages
Skewness, Moments and Kurtosis
No ratings yet
Skewness, Moments and Kurtosis
15 pages
Stable Distribution
No ratings yet
Stable Distribution
8 pages
结构方程模型输出
No ratings yet
结构方程模型输出
2 pages
Gamma Distribution and Beta Distribution
No ratings yet
Gamma Distribution and Beta Distribution
6 pages
Ch4 Sol
No ratings yet
Ch4 Sol
21 pages
Wind Energy: Homework2: BEGGAH Rofeyda Nor El Yaquine October 21, 2023
No ratings yet
Wind Energy: Homework2: BEGGAH Rofeyda Nor El Yaquine October 21, 2023
5 pages
02-Random Variables
No ratings yet
02-Random Variables
62 pages