Rohatgi Expl
Rohatgi Expl
Rohatgi Expl
Mathematical Statistics II
Spring Semester 2012
Dr. J urgen Symanzik
Utah State University
Department of Mathematics and Statistics
3900 Old Main Hill
Logan, UT 843223900
Tel.: (435) 7970696
FAX: (435) 7971822
e-mail: symanzik@math.usu.edu
Contents
Acknowledgements 1
6 Limit Theorems 1
6.1 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
6.2 Weak Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.3 Strong Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.4 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7 Sample Moments 36
7.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2 Sample Moments and the Normal Distribution . . . . . . . . . . . . . . . . . . 39
8 The Theory of Point Estimation 44
8.1 The Problem of Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.2 Properties of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.3 Sucient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.4 Unbiased Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.5 Lower Bounds for the Variance of an Estimate . . . . . . . . . . . . . . . . . . 67
8.6 The Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.7 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.8 Decision Theory Bayes and Minimax Estimation . . . . . . . . . . . . . . . . 83
9 Hypothesis Testing 91
9.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.2 The NeymanPearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9.3 Monotone Likelihood Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.4 Unbiased and Invariant Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
10 More on Hypothesis Testing 115
10.1 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
10.2 Parametric ChiSquared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
10.3 tTests and FTests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
10.4 Bayes and Minimax Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
1
11 Condence Estimation 134
11.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
11.2 ShortestLength Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . 138
11.3 Condence Intervals and Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . 143
11.4 Bayes Condence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
12 Nonparametric Inference 152
12.1 Nonparametric Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
12.2 Single-Sample Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
12.3 More on Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
13 Some Results from Sampling 169
13.1 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
13.2 Stratied Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
14 Some Results from Sequential Statistical Inference 176
14.1 Fundamentals of Sequential Sampling . . . . . . . . . . . . . . . . . . . . . . . 176
14.2 Sequential Probability Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . 180
Index 184
2
Acknowledgements
I would like to thank my students, Hanadi B. Eltahir, Rich Madsen, and Bill Morphet, who
helped during the Fall 1999 and Spring 2000 semesters in typesetting these lecture notes
using L
A
T
E
X and for their suggestions how to improve some of the material presented in class.
Thanks are also due to more than 50 students who took Stat 6710/20 with me since the Fall
2000 semester for their valuable comments that helped to improve and correct these lecture
notes.
In addition, I particularly would like to thank Mike Minnotte and Dan Coster, who previously
taught this course at Utah State University, for providing me with their lecture notes and other
materials related to this course. Their lecture notes, combined with additional material from
Casella/Berger (2002), Rohatgi (1976) and other sources listed below, form the basis of the
script presented here.
The primary textbook required for this class is:
Casella, G., and Berger, R. L. (2002): Statistical Inference (Second Edition), Duxbury
Press/Thomson Learning, Pacic Grove, CA.
A Web page dedicated to this class is accessible at:
http://www.math.usu.edu/~symanzik/teaching/2012_stat6720/stat6720.html
This course closely follows Casella and Berger (2002) as described in the syllabus. Additional
material originates from the lectures from Professors Hering, Trenkler, and Gather I have
attended while studying at the Universitat Dortmund, Germany, the collection of Masters and
PhD Preliminary Exam questions from Iowa State University, Ames, Iowa, and the following
textbooks:
Bandelow, C. (1981): Einf uhrung in die Wahrscheinlichkeitstheorie, Bibliographisches
Institut, Mannheim, Germany.
B uning, H., and Trenkler, G. (1978): Nichtparametrische statistische Methoden, Walter
de Gruyter, Berlin, Germany.
Casella, G., and Berger, R. L. (1990): Statistical Inference, Wadsworth & Brooks/Cole,
Pacic Grove, CA.
Fisz, M. (1989): Wahrscheinlichkeitsrechnung und mathematische Statistik, VEB Deut-
scher Verlag der Wissenschaften, Berlin, German Democratic Republic.
Gibbons, J. D., and Chakraborti, S. (1992): Nonparametric Statistical Inference (Third
Edition, Revised and Expanded), Dekker, New York, NY.
3
Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1994): Continuous Univariate
Distributions, Volume 1 (Second Edition), Wiley, New York, NY.
Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1995): Continuous Univariate
Distributions, Volume 2 (Second Edition), Wiley, New York, NY.
Kelly, D. G. (1994): Introduction to Probability, Macmillan, New York, NY.
Lehmann, E. L. (1983): Theory of Point Estimation (1991 Reprint), Wadsworth &
Brooks/Cole, Pacic Grove, CA.
Lehmann, E. L. (1986): Testing Statistical Hypotheses (Second Edition 1994 Reprint),
Chapman & Hall, New York, NY.
Mood, A. M., and Graybill, F. A., and Boes, D. C. (1974): Introduction to the Theory
of Statistics (Third Edition), McGraw-Hill, Singapore.
Parzen, E. (1960): Modern Probability Theory and Its Applications, Wiley, New York,
NY.
Rohatgi, V. K. (1976): An Introduction to Probability Theory and Mathematical Statis-
tics, John Wiley and Sons, New York, NY.
Rohatgi, V. K., and Saleh, A. K. E. (2001): An Introduction to Probability and Statistics
(Second Edition), John Wiley and Sons, New York, NY.
Searle, S. R. (1971): Linear Models, Wiley, New York, NY.
Tamhane, A. C., and Dunlop, D. D. (2000): Statistics and Data Analysis From Ele-
mentary to Intermediate, Prentice Hall, Upper Saddle River, NJ.
Additional denitions, integrals, sums, etc. originate from the following formula collections:
Bronstein, I. N. and Semendjajew, K. A. (1985): Taschenbuch der Mathematik (22.
Auage), Verlag Harri Deutsch, Thun, German Democratic Republic.
Bronstein, I. N. and Semendjajew, K. A. (1986): Erganzende Kapitel zu Taschenbuch der
Mathematik (4. Auage), Verlag Harri Deutsch, Thun, German Democratic Republic.
Sieber, H. (1980): Mathematische Formeln Erweiterte Ausgabe E, Ernst Klett, Stuttgart,
Germany.
J urgen Symanzik, January 16, 2012.
4
6 Limit Theorems
(Based on Rohatgi, Chapter 6, Rohatgi/Saleh, Chapter 6 & Casella/Berger,
Section 5.5)
Motivation:
I found this slide from my Stat 250, Section 003, Introductory Statistics class (an under-
graduate class I taught at George Mason University in Spring 1999):
What does this mean at a more theoretical level???
1
6.1 Modes of Convergence
Denition 6.1.1:
Let X
1
, . . . , X
n
be iid rvs with common cdf F
X
(x). Let T = T(X) be any statistic, i.e., a
Borelmeasurable function of X that does not involve the population parameter(s) , dened
on the support A of X. The induced probability distribution of T(X) is called the sampling
distribution of T(X).
Note:
(i) Commonly used statistics are:
Sample Mean: X
n
=
1
n
n
i=1
X
i
Sample Variance: S
2
n
=
1
n1
n
i=1
(X
i
X
n
)
2
Sample Median, Order Statistics, Min, Max, etc.
(ii) Recall that if X
1
, . . . , X
n
are iid and if E(X) and V ar(X) exist, then E(X
n
) = =
E(X), E(S
2
n
) =
2
= V ar(X), and V ar(X
n
) =
2
n
.
(iii) Recall that if X
1
, . . . , X
n
are iid and if X has mgf M
X
(t) or characteristic function
X
(t)
then M
Xn
(t) = (M
X
(
t
n
))
n
or
Xn
(t) = (
X
(
t
n
))
n
.
Note: Let X
n
n=1
be a sequence of rvs on some probability space (, L, P). Is there any
meaning behind the expression lim
n
X
n
= X? Not immediately under the usual denitions
of limits. We rst need to dene modes of convergence for rvs and probabilities.
Denition 6.1.2:
Let X
n
n=1
be a sequence of rvs with cdfs F
n
n=1
and let X be a rv with cdf F. If
F
n
(x) F(x) at all continuity points of F, we say that X
n
converges in distribution to
X (X
n
d
X) or X
n
converges in law to X (X
n
L
X), or F
n
converges weakly to F
(F
n
w
F).
Example 6.1.3:
Let X
n
N(0,
1
n
). Then
F
n
(x) =
_
x
exp
_
1
2
nt
2
_
_
2
n
dt
2
=
_
nx
exp(
1
2
s
2
)
2
ds
= (
nx)
= F
n
(x)
() = 1, if x > 0
(0) =
1
2
, if x = 0
() = 0, if x < 0
If F
X
(x) =
_
1, x 0
0, x < 0
the only point of discontinuity is at x = 0. Everywhere else,
(
nx) = F
n
(x) F
X
(x), where (z) = P(Z z) with Z N(0, 1).
So, X
n
d
X, where P(X = 0) = 1, or X
n
d
0 since the limiting rv here is degenerate,
i.e., it has a Dirac(0) distribution.
Example 6.1.4:
In this example, the sequence F
n
n=1
converges pointwise to something that is not a cdf:
Let X
n
Dirac(n), i.e., P(X
n
= n) = 1. Then,
F
n
(x) =
_
0, x < n
1, x n
It is F
n
(x) 0 x which is not a cdf. Thus, there is no rv X such that X
n
d
X.
Example 6.1.5:
Let X
n
n=1
be a sequence of rvs such that P(X
n
= 0) = 1
1
n
and P(X
n
= n) =
1
n
and let
X Dirac(0), i.e., P(X = 0) = 1.
It is
F
n
(x) =
0, x < 0
1
1
n
, 0 x < n
1, x n
F
X
(x) =
_
0, x < 0
1, x 0
It holds that F
n
w
F
X
but
E(X
k
n
) = 0
k
(1
1
n
) +n
k
1
n
= n
k1
, E(X
k
) = 0.
Thus, convergence in distribution does not imply convergence of moments/means.
3
Note:
Convergence in distribution does not say that the X
i
s are close to each other or to X. It only
means that their cdfs are (eventually) close to some cdf F. The X
i
s do not even have to be
dened on the same probability space.
Example 6.1.6:
Let X and X
n
n=1
be iid N(0, 1). Obviously, X
n
d
X but lim
n
X
n
,= X.
Theorem 6.1.7:
Let X and X
n
n=1
be discrete rvs with support A and A
n
n=1
, respectively. Dene
the countable set A = A
_
n=1
A
n
= a
k
: k = 1, 2, 3, . . .. Let p
k
= P(X = a
k
) and
p
nk
= P(X
n
= a
k
). Then it holds that p
nk
p
k
k i X
n
d
X.
Theorem 6.1.8:
Let X and X
n
n=1
be continuous rvs with pdfs f and f
n
n=1
, respectively. If f
n
(x) f(x)
for almost all x as n then X
n
d
X.
Theorem 6.1.9:
Let X and X
n
n=1
be rvs such that X
n
d
X. Let c IR be a constant. Then it holds:
(i) X
n
+c
d
X +c.
(ii) cX
n
d
cX.
(iii) If a
n
a and b
n
b, then a
n
X
n
+b
n
d
aX +b.
Proof:
Part (iii):
Suppose that a > 0, a
n
> 0. (If a < 0, a
n
< 0, the result follows via (ii) and c = 1.)
Let Y
n
= a
n
X
n
+b
n
and Y = aX +b. It is
F
Y
(y) = P(Y y) = P(aX +b y) = P(X
y b
a
) = F
X
(
y b
a
).
Likewise,
F
Yn
(y) = F
Xn
(
y b
n
a
n
).
If y is a continuity point of F
Y
,
yb
a
is a continuity point of F
X
. Since a
n
a, b
n
b and
F
Xn
(x) F
X
(x), it follows that F
Yn
(y) F
Y
(y) for every continuity point y of F
Y
. Thus,
a
n
X
n
+b
n
d
aX +b.
4
Denition 6.1.10:
Let X
n
n=1
be a sequence of rvs dened on a probability space (, L, P). We say that X
n
converges in probability to a rv X (X
n
p
X, P- lim
n
X
n
= X) if
lim
n
P([ X
n
X [> ) = 0 > 0.
Note:
The following are equivalent:
lim
n
P([ X
n
X [> ) = 0
lim
n
P([ X
n
X [ ) = 1
lim
n
P( : [ X
n
() X() [> ) = 0
If X is degenerate, i.e., P(X = c) = 1, we say that X
n
is consistent for c. For example, let
X
n
such that P(X
n
= 0) = 1
1
n
and P(X
n
= 1) =
1
n
. Then
P([ X
n
[> ) =
_
1
n
, 0 < < 1
0, 1
Therefore, lim
n
P([ X
n
[> ) = 0 > 0. So X
n
p
0, i.e., X
n
is consistent for 0.
Theorem 6.1.11:
(i) X
n
p
X X
n
X
p
0.
(ii) X
n
p
X, X
n
p
Y = P(X = Y ) = 1.
(iii) X
n
p
X, X
m
p
X = X
n
X
m
p
0 as n, m .
(iv) X
n
p
X, Y
n
p
Y = X
n
Y
n
p
X Y .
(v) X
n
p
X, k IR a constant = kX
n
p
kX.
(vi) X
n
p
k, k IR a constant = X
r
n
p
k
r
r IN.
(vii) X
n
p
a, Y
n
p
b, a, b IR = X
n
Y
n
p
ab.
(viii) X
n
p
1 = X
1
n
p
1.
(ix) X
n
p
a, Y
n
p
b, a IR, b IR0 =
Xn
Yn
p
a
b
.
(x) X
n
p
X, Y an arbitrary rv = X
n
Y
p
XY .
5
(xi) X
n
p
X, Y
n
p
Y = X
n
Y
n
p
XY .
Proof:
See Rohatgi, page 244245, and Rohatgi/Saleh, page 260261, for partial proofs.
Theorem 6.1.12:
Let X
n
p
X and let g be a continuous function on IR. Then g(X
n
)
p
g(X).
Proof:
Preconditions:
1.) X rv = > 0 k = k() : P([X[ > k) <
2
2.) g is continuous on IR
= g is also uniformly continuous on [k, k] (see Denition of uniformly continuous
in Theorem 3.3.3 (iii):
> 0 > 0 x
1
, x
2
IR :[ x
1
x
2
[< [ g(x
1
) g(x
2
) [< .)
= = (, k) : [X[ k, [X
n
X[ < [g(X
n
) g(X)[ <
Let
A = [X[ k = : [X()[ k
B = [X
n
X[ < = : [X
n
() X()[ <
C = [g(X
n
) g(X)[ < = : [g(X
n
()) g(X())[ <
If A B
2.)
= C
= A B C
= C
C
(A B)
C
= A
C
B
C
= P(C
C
) P(A
C
B
C
) P(A
C
) +P(B
C
)
Now:
P([g(X
n
) g(X)[ ) P([X[ > k)
. .
2
by 1.)
+ P([X
n
X[ )
. .
2
for nn
0
(,,k) since Xn
p
X
for n n
0
(, , k)
6
Corollary 6.1.13:
(i) Let X
n
p
c, c IR and let g be a continuous function on IR. Then g(X
n
)
p
g(c).
(ii) Let X
n
d
X and let g be a continuous function on IR. Then g(X
n
)
d
g(X).
(iii) Let X
n
d
c, c IR and let g be a continuous function on IR. Then g(X
n
)
d
g(c).
Theorem 6.1.14:
X
n
p
X = X
n
d
X.
Proof:
X
n
p
X P([X
n
X[ > ) 0 as n > 0
X
x x
X
n
Theorem 6.1.14
It holds:
P(X x ) = P(X x , [X
n
X[ ) +P(X x , [X
n
X[ > )
(A)
P(X
n
x) +P([X
n
X[ > )
(A) holds since X x and X
n
within of X, thus X
n
x.
Similarly, it holds:
P(X
n
x) = P(X
n
x, [ X
n
X [ ) +P(X
n
x, [ X
n
X [> )
P(X x +) +P([X
n
X[ > )
Combining the 2 inequalities from above gives:
P(X x ) P([X
n
X[ > )
. .
0 as n
P(X
n
x)
. .
=Fn(x)
P(X x +) +P([X
n
X[ > )
. .
0 as n
7
Therefore,
P(X x ) F
n
(x) P(X x +) as n .
Since the cdfs F
n
() are not necessarily left continuous, we get the following result for 0:
P(X < x) F
n
(x) P(X x) = F
X
(x)
Let x be a continuity point of F. Then it holds:
F(x) = P(X < x) F
n
(x) F(x)
= F
n
(x) F(x)
= X
n
d
X
Theorem 6.1.15:
Let c IR be a constant. Then it holds:
X
n
d
c X
n
p
c.
Example 6.1.16:
In this example, we will see that
X
n
d
X ,= X
n
p
X
for some rv X. Let X
n
be identically distributed rvs and let (X
n
, X) have the following joint
distribution:
X
n
X
0 1
0 0
1
2
1
2
1
1
2
0
1
2
1
2
1
2
1
Obviously, X
n
d
X since all have exactly the same cdf, but for any (0, 1), it is
P([ X
n
X [> ) = P([ X
n
X [= 1) = 1 n,
so lim
n
P([ X
n
X [> ) ,= 0. Therefore, X
n
,
p
X.
Theorem 6.1.17:
Let X
n
n=1
and Y
n
n=1
be sequences of rvs and X be a rv dened on a probability space
(, L, P). Then it holds:
Y
n
d
X, [ X
n
Y
n
[
p
0 = X
n
d
X.
8
Proof:
Similar to the proof of Theorem 6.1.14. See also Rohatgi, page 253, Theorem 14, and Ro-
hatgi/Saleh, page 269, Theorem 14.
Theorem 6.1.18: Slutskys Theorem
Let X
n
n=1
and Y
n
n=1
be sequences of rvs and X be a rv dened on a probability space
(, L, P). Let c IR be a constant. Then it holds:
(i) X
n
d
X, Y
n
p
c = X
n
+Y
n
d
X +c.
(ii) X
n
d
X, Y
n
p
c = X
n
Y
n
d
cX.
If c = 0, then also X
n
Y
n
p
0.
(iii) X
n
d
X, Y
n
p
c =
Xn
Yn
d
X
c
if c ,= 0.
Proof:
(i) Y
n
p
c
Th.6.1.11(i)
Y
n
c
p
0
= Y
n
c = Y
n
+ (X
n
X
n
) c = (X
n
+Y
n
) (X
n
+c)
p
0 (A)
X
n
d
X
Th.6.1.9(i)
= X
n
+c
d
X +c (B)
Combining (A) and (B), it follows from Theorem 6.1.17:
X
n
+Y
n
d
X +c
(ii) Case c = 0:
> 0 k > 0, it is
P([ X
n
Y
n
[> ) = P([ X
n
Y
n
[> , Y
n
k
) +P([ X
n
Y
n
[> , Y
n
>
k
)
P([ X
n
k
[> ) +P(Y
n
>
k
)
P([ X
n
[> k) +P([ Y
n
[>
k
)
Since X
n
d
X and Y
n
p
0, it follows for any xed k > 0
lim
n
P([ X
n
Y
n
[> ) P([ X [> k).
As k is arbitrary, we can make P([ X [> k) as small as we want by choosing k large.
Therefore, X
n
Y
n
p
0.
9
Case c ,= 0:
Since X
n
d
X and Y
n
p
c, it follows from (ii), case c = 0, that X
n
Y
n
cX
n
=
X
n
(Y
n
c)
p
0.
= X
n
Y
n
p
cX
n
Th.6.1.14
= X
n
Y
n
d
cX
n
Since cX
n
d
cX by Theorem 6.1.9 (ii), it follows from Theorem 6.1.17:
X
n
Y
n
d
cX
(iii) Let Z
n
p
1 and let Y
n
= cZ
n
.
c=0
=
1
Yn
=
1
Zn
1
c
Th.6.1.11(v,viii)
=
1
Yn
p
1
c
With part (ii) above, it follows:
X
n
d
X and
1
Yn
p
1
c
=
Xn
Yn
d
X
c
Denition 6.1.19:
Let X
n
n=1
be a sequence of rvs such that E([ X
n
[
r
) < for some r > 0. We say that X
n
converges in the r
th
mean to a rv X (X
n
r
X) if E([ X [
r
) < and
lim
n
E([ X
n
X [
r
) = 0.
Example 6.1.20:
Let X
n
n=1
be a sequence of rvs dened by P(X
n
= 0) = 1
1
n
and P(X
n
= 1) =
1
n
.
It is E([ X
n
[
r
) =
1
n
r > 0. Therefore, X
n
r
0 r > 0.
Note:
The special cases r = 1 and r = 2 are called convergence in absolute mean for r = 1
(X
n
1
X) and convergence in mean square for r = 2 (X
n
ms
X or X
n
2
X).
10
Theorem 6.1.21:
Assume that X
n
r
X for some r > 0. Then X
n
p
X.
Proof:
Using Markovs Inequality (Corollary 3.5.2), it holds for any > 0:
E([ X
n
X [
r
)
r
P([ X
n
X [ ) P([ X
n
X [> )
X
n
r
X = lim
n
E([ X
n
X [
r
) = 0
= lim
n
P([ X
n
X [> ) lim
n
E([ X
n
X [
r
)
r
= 0
= X
n
p
X
Example 6.1.22:
Let X
n
n=1
be a sequence of rvs dened by P(X
n
= 0) = 1
1
n
r
and P(X
n
= n) =
1
n
r
for
some r > 0.
For any > 0, P([ X
n
[> ) 0 as n ; so X
n
p
0.
For 0 < s < r, E([ X
n
[
s
) =
1
n
rs
0 as n ; so X
n
s
0. But E([ X
n
[
r
) = 1 , 0 as
n ; so X
n
,
r
0.
Theorem 6.1.23:
If X
n
r
X, then it holds:
(i) lim
n
E([ X
n
[
r
) = E([ X [
r
); and
(ii) X
n
s
X for 0 < s < r.
Proof:
(i) For 0 < r 1, it holds:
E([ X
n
[
r
) = E([ X
n
X+X [
r
) E(([ X
n
X [ + [ X [)
r
)
()
E([ X
n
X [
r
+ [ X [
r
)
= E([ X
n
[
r
) E([ X [
r
) E([ X
n
X [
r
)
= lim
n
E([ X
n
[
r
) lim
n
E([ X [
r
) lim
n
E([ X
n
X [
r
) = 0
= lim
n
E([ X
n
[
r
) E([ X [
r
) (A)
() holds due to Bronstein/Semendjajew (1986), page 36 (see Handout)
11
Similarly,
E([ X [
r
) = E([ X X
n
+X
n
[
r
) E(([ X
n
X [ + [ X
n
[)
r
)
E([ X
n
X [
r
+ [ X
n
[
r
)
= E([ X [
r
) E([ X
n
[
r
) E([ X
n
X [
r
)
= lim
n
E([ X [
r
) lim
n
E([ X
n
[
r
) lim
n
E([ X
n
X [
r
) = 0
= E([ X [
r
) lim
n
E([ X
n
[
r
) (B)
Combining (A) and (B) gives
lim
n
E([ X
n
[
r
) = E([ X [
r
)
For r > 1, it follows from Minkowskis Inequality (Theorem 4.8.3):
[E([ X X
n
+X
n
[
r
)]
1
r
[E([ X X
n
[
r
)]
1
r
+ [E([ X
n
[
r
)]
1
r
= [E([ X [
r
)]
1
r
[E([ X
n
[
r
)]
1
r
[E([ X X
n
[
r
)]
1
r
= [E([ X [
r
)]
1
r
lim
n
[E([ X
n
[
r
)]
1
r
lim
n
[E([ X
n
X [
r
)]
1
r
= 0 since X
n
r
X
= [E([ X [
r
)]
1
r
lim
n
[E([ X
n
[
r
)]
1
r
(C)
Similarly,
[E([ X
n
X +X [
r
)]
1
r
[E([ X
n
X [
r
)]
1
r
+ [E([ X [
r
)]
1
r
= lim
n
[E([ X
n
[
r
)]
1
r
lim
n
[E([ X [
r
)]
1
r
lim
n
[E([ X
n
X [
r
)]
1
r
= 0 since X
n
r
X
= lim
n
[E([ X
n
[
r
)]
1
r
[E([ X [
r
)]
1
r
(D)
Combining (C) and (D) gives
lim
n
[E([ X
n
[
r
)]
1
r
= [E([ X [
r
)]
1
r
= lim
n
E([ X
n
[
r
) = E([ X [
r
)
(ii) For 1 s < r, it follows from Lyapunovs Inequality (Theorem 3.5.4):
[E([ X
n
X [
s
)]
1
s
[E([ X
n
X [
r
)]
1
r
= E([ X
n
X [
s
) [E([ X
n
X [
r
)]
s
r
= lim
n
E([ X
n
X [
s
) lim
n
[E([ X
n
X [
r
)]
s
r
= 0 since X
n
r
X
= X
n
s
X
An additional proof is required for 0 < s < r < 1.
12
Denition 6.1.24:
Let X
n
n=1
be a sequence of rvs on (, L, P). We say that X
n
converges almost surely
to a rv X (X
n
a.s.
X) or X
n
converges with probability 1 to X (X
n
w.p.1
X) or X
n
converges strongly to X i
P( : X
n
() X() as n ) = 1.
Note:
An interesting characterization of convergence with probability 1 and convergence in proba-
bility can be found in Parzen (1960) Modern Probability Theory and Its Applications on
page 416 (see Handout).
Example 6.1.25:
Let = [0, 1] and P a uniform distribution on . Let X
n
() = +
n
and X() = .
For [0, 1),
n
0 as n . So X
n
() X() [0, 1).
However, for = 1, X
n
(1) = 2 ,= 1 = X(1) n, i.e., convergence fails at = 1.
Anyway, since P( : X
n
() X() as n ) = P( [0, 1)) = 1, it is X
n
a.s.
X.
Theorem 6.1.26:
X
n
a.s.
X = X
n
p
X.
Proof:
Choose > 0 and > 0. Find n
0
= n
0
(, ) such that
P
_
n=n
0
[ X
n
X [
_
1 .
Since
n=n
0
[ X
n
X [ [ X
n
X [ n n
0
, it is
P ([ X
n
X [ ) P
_
n=n
0
[ X
n
X [
_
1 n n
0
.
Therefore, P([ X
n
X [ ) 1 as n . Thus, X
n
p
X.
13
Example 6.1.27:
X
n
p
X ,= X
n
a.s.
X:
Let = (0, 1] and P a uniform distribution on .
Dene A
n
by
A
1
= (0,
1
2
], A
2
= (
1
2
, 1]
A
3
= (0,
1
4
], A
4
= (
1
4
,
1
2
], A
5
= (
1
2
,
3
4
], A
6
= (
3
4
, 1]
A
7
= (0,
1
8
], A
8
= (
1
8
,
1
4
], . . .
Let X
n
() = I
An
().
It is P([ X
n
0 [ ) 0 > 0 since X
n
is 0 except on A
n
and P(A
n
) 0. Thus X
n
p
0.
But P( : X
n
() 0) = 0 (and not 1) because any keeps being in some A
n
beyond any
n
0
, i.e., X
n
() looks like 0 . . . 010 . . . 010 . . . 010 . . ., so X
n
,
a.s.
0.
Note that n
0
in this proof relates to N in Parzen (1960), page 416.
Example 6.1.28:
X
n
r
X ,= X
n
a.s.
X:
Let X
n
be independent rvs such that P(X
n
= 0) = 1
1
n
and P(X
n
= 1) =
1
n
.
It is E([ X
n
0 [
r
) = E([ X
n
[
r
) = E([ X
n
[) =
1
n
0 as n , so X
n
r
0 r > 0 (and
due to Theorem 6.1.21, also X
n
p
0).
But
P(X
n
= 0 m n n
0
) =
n
0
n=m
(1
1
n
) = (
m1
m
)(
m
m + 1
)(
m + 1
m + 2
) . . . (
n
0
2
n
0
1
)(
n
0
1
n
0
) =
m1
n
0
As n
0
, it is P(X
n
= 0 m n n
0
) 0 m, so X
n
,
a.s.
0.
Example 6.1.29:
X
n
a.s.
X ,= X
n
r
X:
Let = [0, 1] and P a uniform distribution on .
Let A
n
= [0,
1
ln n
].
Let X
n
() = nI
An
() and X() = 0.
It holds that > 0 n
0
:
1
ln n
0
< = X
n
() = 0 n > n
0
and P( = 0) = 0. Thus,
X
n
a.s.
0.
But E([ X
n
0 [
r
) =
n
r
lnn
r > 0, so X
n
,
r
X.
14
6.2 Weak Laws of Large Numbers
Theorem 6.2.1: WLLN: Version I
Let X
i
i=1
be a sequence of iid rvs with mean E(X
i
) = and variance V ar(X
i
) =
2
< .
Let X
n
=
1
n
n
i=1
X
i
. Then it holds
lim
n
P([ X
n
[ ) = 0 > 0,
i.e., X
n
p
.
Proof:
By Markovs Inequality (Corollary 3.5.2), it holds for all > 0:
P([ X
n
[ )
E((X
n
)
2
)
2
=
V ar(X
n
)
2
=
2
n
2
0 as n
Note:
For iid rvs with nite variance, X
n
is consistent for .
A more general way to derive a WLLN follows in the next Denition.
Denition 6.2.2:
Let X
i
i=1
be a sequence of rvs. Let T
n
=
n
i=1
X
i
. We say that X
i
obeys the WLLN
with respect to a sequence of norming constants B
i
i=1
, B
i
> 0, B
i
, if there exists a
sequence of centering constants A
i
i=1
such that
B
1
n
(T
n
A
n
)
p
0.
Theorem 6.2.3:
Let X
i
i=1
be a sequence of pairwise uncorrelated rvs with E(X
i
) =
i
and V ar(X
i
) =
2
i
,
i IN. If
n
i=1
2
i
as n , we can choose A
n
=
n
i=1
i
and B
n
=
n
i=1
2
i
and get
n
i=1
(X
i
i
)
n
i=1
2
i
p
0.
15
Proof:
By Markovs Inequality (Corollary 3.5.2), it holds for all > 0:
P([
n
i=1
X
i
n
i=1
i
[>
n
i=1
2
i
)
E((
n
i=1
(X
i
i
))
2
)
2
(
n
i=1
2
i
)
2
=
1
2
n
i=1
2
i
0 as n
Note:
To obtain Theorem 6.2.1, we choose A
n
= n and B
n
= n
2
.
Theorem 6.2.4:
Let X
i
i=1
be a sequence of rvs. Let X
n
=
1
n
n
i=1
X
i
. A necessary and sucient condition
for X
i
to obey the WLLN with respect to B
n
= n is that
E
_
X
2
n
1 +X
2
n
_
0
as n .
Proof:
Rohatgi, page 258, Theorem 2, and Rohatgi/Saleh, page 275, Theorem 2.
Example 6.2.5:
Let (X
1
, . . . , X
n
) be jointly Normal with E(X
i
) = 0, E(X
2
i
) = 1 for all i, and Cov(X
i
, X
j
) =
if [ i j [= 1 and Cov(X
i
, X
j
) = 0 if [ i j [> 1.
Let T
n
=
n
i=1
X
i
. Then, T
n
N(0, n + 2(n 1)) = N(0,
2
). It is
E
_
X
2
n
1 +X
2
n
_
= E
_
T
2
n
n
2
+T
2
n
_
=
2
2
_
0
x
2
n
2
+x
2
e
x
2
2
2
dx [ y =
x
, dy =
dx
=
2
2
_
0
2
y
2
n
2
+
2
y
2
e
y
2
2
dy
=
2
2
_
0
(n + 2(n 1))y
2
n
2
+ (n + 2(n 1))y
2
e
y
2
2
dy
n + 2(n 1)
n
2
_
0
2
2
y
2
e
y
2
2
dy
. .
=1, since Var of N(0,1) distribution
16
0 as n
= X
n
p
0
Note:
We would like to have a WLLN that just depends on means but does not depend on the
existence of nite variances. To approach this, we consider the following:
Let X
i
i=1
be a sequence of rvs. Let T
n
=
n
i=1
X
i
. We truncate each [ X
i
[ at c > 0 and get
X
c
i
=
_
X
i
, [ X
i
[ c
0, otherwise
Let T
c
n
=
n
i=1
X
c
i
and m
n
=
n
i=1
E(X
c
i
).
Lemma 6.2.6:
For T
n
, T
c
n
and m
n
as dened in the Note above, it holds:
P([ T
n
m
n
[> ) P([ T
c
n
m
n
[> ) +
n
i=1
P([ X
i
[> c) > 0
Proof:
It holds for all > 0:
P([ T
n
m
n
[> ) = P([ T
n
m
n
[> and [ X
i
[ c i 1, . . . , n) +
P([ T
n
m
n
[> and [ X
i
[> c for at least one i 1, . . . , n)
()
P([ T
c
n
m
n
[> ) +P([ X
i
[> c for at least one i 1, . . . , n)
P([ T
c
n
m
n
[> ) +
n
i=1
P([ X
i
[> c)
() holds since T
c
n
= T
n
when [ X
i
[ c i 1, . . . , n.
Note:
If the X
i
s are identically distributed, then
P([ T
n
m
n
[> ) P([ T
c
n
m
n
[> ) +nP([ X
1
[> c) > 0.
If the X
i
s are iid, then
P([ T
n
m
n
[> )
nE((X
c
1
)
2
)
2
+nP([ X
1
[> c) > 0 ().
Note that P([ X
i
[> c) = P([ X
1
[> c) i IN if the X
i
s are identically distributed and that
E((X
c
i
)
2
) = E((X
c
1
)
2
) i IN if the X
i
s are iid.
17
Theorem 6.2.7: Khintchines WLLN
Let X
i
i=1
be a sequence of iid rvs with nite mean E(X
i
) = . Then it holds:
X
n
=
1
n
T
n
p
Proof:
If we take c = n and replace by n in () in the Note above, we get
P
_
T
n
m
n
n
>
_
= P([ T
n
m
n
[> n)
E((X
n
1
)
2
)
n
2
+nP([ X
1
[> n).
Since E([ X
1
[) < , it is nP([ X
1
[> n) 0 as n by Theorem 3.1.9. From Corollary
3.1.12 we know that E([ X [
) =
_
0
x
1
P([ X [> x)dx. Therefore,
E((X
n
1
)
2
) = 2
_
n
0
xP([ X
n
1
[> x)dx
= 2
_
A
0
xP([ X
n
1
[> x)dx + 2
_
n
A
xP([ X
n
1
[> x)dx
(+)
K +
_
n
A
dx
K +n
In (+), A is chosen suciently large such that xP([ X
n
1
[> x) <
2
x A for an arbitrary
constant > 0 and K > 0 a constant.
Therefore,
E((X
n
1
)
2
)
n
2
K
n
2
+
2
Since is arbitrary, we can make the right hand side of this last inequality arbitrarily small
for suciently large n.
Since E(X
i
) = i, it is
m
n
n
=
n
i=1
E(X
n
i
)
n
as n .
Note:
Theorem 6.2.7 meets the previously stated goal of not having a nite variance requirement.
18
6.3 Strong Laws of Large Numbers
Denition 6.3.1:
Let X
i
i=1
be a sequence of rvs. Let T
n
=
n
i=1
X
i
. We say that X
i
obeys the SLLN
with respect to a sequence of norming constants B
i
i=1
, B
i
> 0, B
i
, if there exists a
sequence of centering constants A
i
i=1
such that
B
1
n
(T
n
A
n
)
a.s.
0.
Note:
Unless otherwise specied, we will only use the case that B
n
= n in this section.
Theorem 6.3.2:
X
n
a.s.
X lim
n
P( sup
mn
[ X
m
X [> ) = 0 > 0.
Proof: (see also Rohatgi, page 249, Theorem 11)
WLOG, we can assume that X = 0 since X
n
a.s.
X implies X
n
X
a.s.
0. Thus, we have to
prove:
X
n
a.s.
0 lim
n
P( sup
mn
[ X
m
[> ) = 0 > 0
Choose > 0 and dene
A
n
() = sup
mn
[ X
m
[>
C = lim
n
X
n
= 0
=:
Since X
n
a.s.
0, we know that P(C) = 1 and therefore P(C
c
) = 0.
Let B
n
() = C A
n
(). Note that B
n+1
() B
n
() and for the limit set
n=1
B
n
() = . It
follows that
lim
n
P(B
n
()) = P(
n=1
B
n
()) = 0.
We also have
P(B
n
()) = P(A
n
C)
= 1 P(C
c
A
c
n
)
= 1 P(C
c
)
. .
=0
P(A
c
n
) +P(C
c
A
C
n
)
. .
=0
= P(A
n
)
19
= lim
n
P(A
n
()) = 0
=:
Assume that lim
n
P(A
n
()) = 0 > 0 and dene D() = lim
n
[ X
n
[> .
Since D() A
n
() n IN, it follows that P(D()) = 0 > 0. Also,
C
c
= lim
n
X
n
,= 0
_
k=1
lim
n
[ X
n
[>
1
k
.
= 1 P(C)
k=1
P(D(
1
k
)) = 0
= X
n
a.s.
0
Note:
(i) X
n
a.s.
0 implies that > 0 > 0 n
0
IN : P( sup
nn
0
[ X
n
[> ) < .
(ii) Recall that for a given sequence of events A
n
n=1
,
A = lim
n
A
n
= lim
n
_
k=n
A
k
=
n=1
_
k=n
A
k
is the event that innitely many of the A
n
occur. We write P(A) = P(A
n
i.o.) where
i.o. stands for innitely often.
(iii) Using the terminology dened in (ii) above, we can rewrite Theorem 6.3.2 as
X
n
a.s.
0 P([ X
n
[> i.o.) = 0 > 0.
20
Theorem 6.3.3: BorelCantelli Lemma
Let A be dened as in (ii) of the previous Note.
(i) 1
st
BCLemma:
Let A
n
n=1
be a sequence of events such that
n=1
P(A
n
) < . Then P(A) = 0.
(ii) 2
nd
BCLemma:
Let A
n
n=1
be a sequence of independent events such that
n=1
P(A
n
) = . Then
P(A) = 1.
Proof:
(i):
P(A) = P( lim
n
_
k=n
A
k
)
= lim
n
P(
_
k=n
A
k
)
lim
n
k=n
P(A
k
)
= lim
n
_
k=1
P(A
k
)
n1
k=1
P(A
k
)
_
()
= 0
() holds since
n=1
P(A
n
) < .
(ii): We have A
c
=
_
n=1
k=n
A
c
k
. Therefore,
P(A
c
) = P( lim
n
k=n
A
c
k
) = lim
n
P(
k=n
A
c
k
).
If we choose n
0
> n, it holds that
k=n
A
c
k
n
0
k=n
A
c
k
.
Therefore,
P(
k=n
A
c
k
) lim
n
0
P(
n
0
k=n
A
c
k
)
indep.
= lim
n
0
n
0
k=n
(1 P(A
k
))
21
(+)
lim
n
0
exp
_
n
0
k=n
P(A
k
)
_
= 0
= P(A) = 1
(+) holds since
1 exp
n
0
j=n
1
n
0
j=n
(1
j
)
n
0
j=n
j
for n
0
> n and 0
j
1
Example 6.3.4:
Independence is necessary for 2
nd
BCLemma:
Let = (0, 1) and P a uniform distribution on .
Let A
n
= I
(0,
1
n
)
(). Therefore,
n=1
P(A
n
) =
n=1
1
n
= .
But for any , A
n
occurs only for 1, 2, . . . ,
1
|, where
1
i=1
be a sequence of independent rvs with common mean 0 and variances
2
i
. Let
T
n
=
n
i=1
X
i
. Then it holds:
P( max
1kn
[ T
k
[ )
n
i=1
2
i
2
> 0
Proof:
See Rohatgi, page 268, Lemma 2, and Rohatgi/Saleh, page 284, Lemma 1.
Lemma 6.3.6: Kroneckers Lemma
For any real numbers x
n
, if
n=1
x
n
converges to s < and B
n
, then it holds:
1
B
n
n
k=1
B
k
x
k
0 as n
22
Proof:
See Rohatgi, page 269, Lemma 3, and Rohatgi/Saleh, page 285, Lemma 2.
Theorem 6.3.7: Cauchy Criterion
X
n
a.s.
X lim
n
P(sup
m
[ X
n+m
X
n
[ ) = 1 > 0.
Proof:
See Rohatgi, page 270, Theorem 5.
Theorem 6.3.8:
If
n=1
V ar(X
n
) < , then
n=1
(X
n
E(X
n
)) converges almost surely.
Proof:
See Rohatgi, page 272, Theorem 6, and Rohatgi/Saleh, page 286, Theorem 4.
Corollary 6.3.9:
Let X
i
i=1
be a sequence of independent rvs. Let B
i
i=1
, B
i
> 0, B
i
, be a sequence
of norming constants. Let T
n
=
n
i=1
X
i
.
If
i=1
V ar(X
i
)
B
2
i
< , then it holds:
T
n
E(T
n
)
B
n
a.s.
0
Proof:
This Corollary follows directly from Theorem 6.3.8 and Lemma 6.3.6.
Lemma 6.3.10: Equivalence Lemma
Let X
i
i=1
and X
i=1
be sequences of rvs. Let T
n
=
n
i=1
X
i
and T
n
=
n
i=1
X
i
.
If the series
i=1
P(X
i
,= X
i
) < , then the series X
i
and X
i
are tailequivalent and
T
n
and T
n
are convergenceequivalent, i.e., for B
n
the sequences
1
Bn
T
n
and
1
Bn
T
n
converge on the same event and to the same limit, except for a null set.
Proof:
See Rohatgi, page 266, Lemma 1.
23
Lemma 6.3.11:
Let X be a rv with E([ X [) < . Then it holds:
n=1
P([ X [ n) E([ X [) 1 +
n=1
P([ X [ n)
Proof:
Continuous case only:
Let X have a pdf f. Then it holds:
E([ X [) =
_
[ x [ f(x)dx =
k=0
_
k|x|k+1
[ x [ f(x)dx
=
k=0
kP(k [ X [ k + 1) E([ X [)
k=0
(k + 1)P(k [ X [ k + 1)
It is
k=0
kP(k [ X [ k + 1) =
k=0
k
n=1
P(k [ X [ k + 1)
=
n=1
k=n
P(k [ X [ k + 1)
=
n=1
P([ X [ n)
Similarly,
k=0
(k + 1)P(k [ X [ k + 1) =
n=1
P([ X [ n) +
k=0
P(k [ X [ k + 1)
=
n=1
P([ X [ n) + 1
Theorem 6.3.12:
Let X
i
i=1
be a sequence of independent rvs. Then it holds:
X
n
a.s.
0
n=1
P([ X
n
[> ) < > 0
Proof:
See Rohatgi, page 265, Theorem 3.
24
Theorem 6.3.13: Kolmogorovs SLLN
Let X
i
i=1
be a sequence of iid rvs. Let T
n
=
n
i=1
X
i
. Then it holds:
T
n
n
= X
n
a.s.
< E([ X [) < (and then = E(X))
Proof:
=:
Suppose that X
n
a.s.
< . It is
T
n
=
n
i=1
X
i
=
n1
i=1
X
i
+X
n
= T
n1
+X
n
.
=
Xn
n
=
T
n
n
..
a.s.
n 1
n
. .
1
T
n1
n 1
. .
a.s.
a.s.
0
By Theorem 6.3.12, we have
n=1
P([
X
n
n
[ 1) < , i.e.,
n=1
P([ X
n
[ n) < .
Since the X
i
are iid (and, say, have the same distribution as X), we can apply Lemma 6.3.11
for X. Now:
n=1
P([ X [ n) <
= 1 +
n=1
P([ X [ n) <
Lemma 6.3.11
= E([ X [) <
Th. 6.2.7 (WLLN)
= X
n
p
E(X)
Since X
n
a.s.
, it follows by Theorem 6.1.26 that X
n
p
. Therefore, it must hold that
= E(X) by Theorem 6.1.11 (ii).
=:
Let E([ X [) < . Dene truncated rvs:
X
k
=
_
X
k
, if [ X
k
[ k
0, otherwise
T
n
=
n
k=1
X
k
X
n
=
T
n
n
25
Then it holds:
k=1
P(X
k
,= X
k
) =
k=1
P([ X
k
[> k)
k=1
P([ X
k
[ k)
iid
=
k=1
P([ X [ k)
Lemma 6.3.11
E([ X [)
<
By Lemma 6.3.10, it follows that T
n
and T
n
are convergenceequivalent. Thus, it is sucient
to prove that X
n
a.s.
E(X).
We now establish the conditions needed in Corollary 6.3.9. It is
V ar(X
n
) E((X
n
)
2
)
=
_
x
2
f
X
n
(x)dx
=
_
n
n
x
2
f
X
n
(x)dx
=
_
n
n
x
2
f
X
(x)dx
_
n
n
f
X
(x)dx
= c
n
_
n
n
x
2
f
X
(x)dx where c
n
=
1
_
n
n
f
X
(x)dx
and c
1
c
2
. . . c
n
. . . 1
= c
n
n1
k=0
_
k|x|<k+1
x
2
f
X
(x)dx
c
n
n1
k=0
(k + 1)
2
_
k|x|<k+1
f
X
(x)dx
= c
n
n1
k=0
(k + 1)
2
P(k [ X [< k + 1)
c
n
n
k=0
(k + 1)
2
P(k [ X [< k + 1)
=
n=1
1
n
2
V ar(X
n
) c
1
n=1
n
k=0
(k + 1)
2
n
2
P(k [ X [< k + 1)
26
= c
1
_
n=1
n
k=1
(k + 1)
2
n
2
P(k [ X [< k + 1) +
n=1
1
n
2
P(0 [ X [< 1)
_
()
c
1
_
k=1
(k + 1)
2
P(k [ X [< k + 1)
_
n=k
1
n
2
_
+ 2P(0 [ X [< 1)
_
(A)
() holds since
n=1
1
n
2
=
2
6
1.65 < 2 and the rst two sums can be rearranged as follows:
n k
1 1
2 1, 2
3 1, 2, 3
.
.
.
.
.
.
=
k n
1 1, 2, 3, . . .
2 2, 3, . . .
3 3, . . .
.
.
.
.
.
.
It is
n=k
1
n
2
=
1
k
2
+
1
(k + 1)
2
+
1
(k + 2)
2
+. . .
1
k
2
+
1
k(k + 1)
+
1
(k + 1)(k + 2)
+. . .
=
1
k
2
+
n=k+1
1
n(n 1)
From Bronstein, page 30, # 7, we know that
1 =
1
1 2
+
1
2 3
+
1
3 4
+. . . +
1
n(n + 1)
+. . .
=
1
1 2
+
1
2 3
+
1
3 4
+. . . +
1
(k 1) k
+
n=k+1
1
n(n 1)
=
n=k+1
1
n(n 1)
= 1
1
1 2
1
2 3
1
3 4
. . .
1
(k 1) k
=
1
2
1
2 3
1
3 4
. . .
1
(k 1) k
=
1
3
1
3 4
. . .
1
(k 1) k
=
1
4
. . .
1
(k 1) k
= . . .
=
1
k
27
=
n=k
1
n
2
1
k
2
+
n=k+1
1
n(n 1)
=
1
k
2
+
1
k
2
k
Using this result in (A), we get
n=1
1
n
2
V ar(X
n
) c
1
_
2
k=1
(k + 1)
2
k
P(k [ X [< k + 1) + 2P(0 [ X [< 1)
_
= c
1
_
2
k=0
kP(k [ X [< k + 1) + 4
k=1
P(k [ X [< k + 1)
+2
k=1
1
k
P(k [ X [< k + 1) + 2P(0 [ X [< 1)
_
(B)
c
1
(2E([ X [) + 4 + 2 + 2)
<
To establish (B), we use an inequality from the Proof of Lemma 6.3.11, i.e.,
k=0
kP(k [ X [< k + 1)
Proof
n=1
P([ X [ n)
Lemma 6.3.11
E([ X [)
Thus, the conditions needed in Corollary 6.3.9 are met. With B
n
= n, it follows that
1
n
T
n
1
n
E(T
n
)
a.s.
0 (C)
Since E(X
n
) E(X) as n , it follows by Kroneckers Lemma (6.3.6) that
1
n
E(T
n
)
E(X). Thus, when we replace
1
n
E(T
n
) by E(X) in (C), we get
1
n
T
n
a.s.
E(X)
Lemma 6.3.10
=
1
n
T
n
a.s.
E(X)
since T
n
and T
n
are convergenceequivalent (as dened in Lemma 6.3.10).
28
6.4 Central Limit Theorems
Let X
n
n=1
be a sequence of rvs with cdfs F
n
n=1
. Suppose that the mgf M
n
(t) of X
n
exists.
Questions: Does M
n
(t) converge? Does it converge to a mgf M(t)? If it does converge, does
it hold that X
n
d
X for some rv X?
Example 6.4.1:
Let X
n
n=1
be a sequence of rvs such that P(X
n
= n) = 1. Then the mgf is M
n
(t) =
E(e
tX
) = e
tn
. So
lim
n
M
n
(t) =
0, t > 0
1, t = 0
, t < 0
So M
n
(t) does not converge to a mgf and F
n
(x) F(x) = 1 x. But F(x) is not a cdf.
Note:
Due to Example 6.4.1, the existence of mgfs M
n
(t) that converge to something is not enough
to conclude convergence in distribution.
Conversely, suppose that X
n
has mgf M
n
(t), X has mgf M(t), and X
n
d
X. Does it hold
that
M
n
(t) M(t)?
Not necessarily! See Rohatgi, page 277, Example 2, and Rohatgi/Saleh, page 289, Example
2, as a counter example. Thus, convergence in distribution of rvs that all have mgfs does
not imply the convergence of mgfs.
However, we can make the following statement in the next Theorem:
Theorem 6.4.2: Continuity Theorem
Let X
n
n=1
be a sequence of rvs with cdfs F
n
n=1
and mgfs M
n
(t)
n=1
. Suppose that
M
n
(t) exists for [ t [ t
0
n. If there exists a rv X with cdf F and mgf M(t) which exists for
[ t [ t
1
< t
0
such that lim
n
M
n
(t) = M(t) t [t
1
, t
1
], then F
n
w
F, i.e., X
n
d
X.
29
Example 6.4.3:
Let X
n
Bin(n,
n
). Recall (e.g., from Theorem 3.3.12 and related Theorems) that for
X Bin(n, p) the mgf is M
X
(t) = (1 p +pe
t
)
n
. Thus,
M
n
(t) = (1
n
+
n
e
t
)
n
= (1 +
(e
t
1)
n
)
n
()
e
(e
t
1)
as n .
In () we use the fact that lim
n
(1 +
x
n
)
n
= e
x
. Recall that e
(e
t
1)
is the mgf of a rv X where
X Poisson(). Thus, we have established the wellknown result that the Binomial distribu-
tion approaches the Poisson distribution, given that n in such a way that np = > 0.
Note:
Recall Theorem 3.3.11: Suppose that X
n
n=1
is a sequence of rvs with characteristic fuctions
n
(t)
n=1
. Suppose that
lim
n
n
(t) = (t) t (h, h) for some h > 0,
and (t) is the characteristic function of a rv X. Then X
n
d
X.
Theorem 6.4.4: LindebergLevy Central Limit Theorem
Let X
n
n=1
be a sequence of iid rvs with E(X
i
) = and 0 < V ar(X
i
) =
2
< . Then it
holds for X
n
=
1
n
n
i=1
X
i
that
n(X
n
)
d
Z
where Z N(0, 1).
Proof:
Let Z N(0, 1). According to Theorem 3.3.12 (v), the characteristic function of Z is
Z
(t) = exp(
1
2
t
2
).
Let (t) be the characteristic function of X
i
. We now determine the characteristic function
n
(t) of
n(Xn)
n
(t) = E
exp
it
n(
1
n
n
i=1
X
i
)
=
_
. . .
_
exp
it
n(
1
n
n
i=1
x
i
)
dF
X
(x)
30
= exp(
it
)
_
exp(
itx
1
n
)dF
X
1
(x
1
) . . .
_
exp(
itx
n
n
)dF
Xn
(x
n
)
=
_
(
t
n
) exp(
it
n
)
_
n
Recall from Theorem 3.3.5 that if the k
th
moment exists, then
(k)
(0) = i
k
E(X
k
). In partic-
ular, it holds for the given distribution that
(1)
(0) = iE(X) = i and
(2)
(0) = i
2
E(X
2
) =
i
2
(
2
+
2
) = (
2
+
2
). Also recall the denition of a Taylor series in MacLaurins form:
f(x) = f(0) +
f
(0)
1!
x +
f
(0)
2!
x
2
+
f
(0)
3!
x
3
+. . . +
f
(n)
(0)
n!
x
n
+. . . ,
e.g.,
f(x) = e
x
= 1 +x +
x
2
2!
+
x
3
3!
+. . .
Thus, if we develop a Taylor series for (
t
n
) around t = 0, we get:
(
t
n
) = (0) +
t
(0) +
1
2
t
2
(
n)
2
(0) +
1
6
t
3
(
n)
3
(0) +. . .
= 1 +t
i
n
1
2
t
2
2
+
2
n
2
+o
_
(
t
n
)
2
_
Here we make use of the Landau symbol o. In general, if we write u(x) = o(v(x)) for
x L, this implies lim
xL
u(x)
v(x)
= 0, i.e., u(x) goes to 0 faster than v(x) or v(x) goes to
faster than u(x). We say that u(x) is of smaller order than v(x) as x L. Examples are
1
x
3
= o(
1
x
2
) and x
2
= o(x
3
) for x . See Rohatgi, page 6, for more details on the Landau
symbols O and o.
Similarly, if we develop a Taylor series for exp(
it
n
) around t = 0, we get:
exp(
it
n
) = 1 t
i
n
1
2
t
2
2
n
2
+o
_
(
t
n
)
2
_
Combining these results, we get:
n
(t) =
1
..
(1)
+t
i
n
. .
(2)
1
2
t
2
2
+
2
n
2
. .
(3)
+o
_
(
t
n
)
2
_
. .
(4)
1
..
(A)
t
i
n
. .
(B)
1
2
t
2
2
n
2
. .
(C)
+o
_
(
t
n
)
2
_
. .
(D)
n
=
1
..
(1)(A)
t
i
n
. .
(1)(B)
1
2
t
2
2
n
2
. .
(1)(C)
+t
i
n
. .
(2)(A)
+t
2
2
n
2
. .
(2)(B)
1
2
t
2
2
+
2
n
2
. .
(3)(A)
+o
_
(
t
n
)
2
_
. .
all others
n
31
=
_
1
1
2
t
2
n
+o
_
(
t
n
)
2
_
_
n
=
_
1 +
1
2
t
2
n
+o
_
1
n
_
_
n
()
exp(
t
2
2
) as n
Thus, lim
n
n
(t) =
Z
(t) t. For a proof of (), see Rohatgi, page 278, Lemma 1. According
to the Note above, it holds that
n(X
n
)
d
Z.
Denition 6.4.5:
Let X
1
, X
2
be iid nondegenerate rvs with common cdf F. Let a
1
, a
2
> 0. We say that F is
stable if there exist constants A and B (depending on a
1
and a
2
) such that
B
1
(a
1
X
1
+a
2
X
2
A) also has cdf F.
Note:
When generalizing the previous denition to sequences of rvs, we have the following examples
for stable distributions:
X
i
iid Cauchy. Then
1
n
n
i=1
X
i
Cauchy (here B
n
= n, A
n
= 0).
X
i
iid N(0, 1). Then
1
n
n
i=1
X
i
N(0, 1) (here B
n
=
n, A
n
= 0).
Denition 6.4.6:
Let X
i
i=1
be a sequence of iid rvs with common cdf F. Let T
n
=
n
i=1
X
i
. F belongs to
the domain of attraction of a distribution V if there exist norming and centering constants
B
n
n=1
, B
n
> 0, and A
n
n=1
such that
P(B
1
n
(T
n
A
n
) x) = F
B
1
n
(TnAn)
(x) V (x) as n
at all continuity points x of V .
Note:
A very general Theorem from Lo`eve states that only stable distributions can have domains
of attraction. From the practical point of view, a wide class of distributions F belong to the
domain of attraction of the Normal distribution.
32
Theorem 6.4.7: Lindeberg Central Limit Theorem
Let X
i
i=1
be a sequence of independent nondegenerate rvs with cdfs F
i
i=1
. Assume
that E(X
k
) =
k
and V ar(X
k
) =
2
k
< . Let s
2
n
=
n
k=1
2
k
.
If the F
k
are absolutely continuous with pdfs f
k
= F
k
, assume that it holds for all > 0 that
(A) lim
n
1
s
2
n
n
k=1
_
{|x
k
|>sn}
(x
k
)
2
F
k
(x)dx = 0.
If the X
k
are discrete rvs with support x
kl
and probabilities p
kl
, l = 1, 2, . . ., assume that
it holds for all > 0 that
(B) lim
n
1
s
2
n
n
k=1
|x
kl
k
|>sn
(x
kl
k
)
2
p
kl
= 0.
The conditions (A) and (B) are called Lindeberg Condition (LC). If either LC holds, then
n
k=1
(X
k
k
)
s
n
d
Z
where Z N(0, 1).
Proof:
Similar to the proof of Theorem 6.4.4, we can use characteristic functions again. An alterna-
tive proof is given in Rohatgi, pages 282288.
Note:
Feller shows that the LC is a necessary condition if
2
n
s
2
n
0 and s
2
n
as n .
Corollary 6.4.8:
Let X
i
i=1
be a sequence of iid rvs such that
1
n
n
i=1
X
i
has the same distribution for all n.
If E(X
i
) = 0 and V ar(X
i
) = 1, then X
i
N(0, 1).
Proof:
Let F be the common cdf of
1
n
n
i=1
X
i
for all n (including n = 1). By the CLT,
lim
n
P(
1
n
n
i=1
X
i
x) = (x),
where (x) denotes P(Z x) for Z N(0, 1). Also, P(
1
n
n
i=1
X
i
x) = F(x) for each n.
Therefore, we must have F(x) = (x).
33
Note:
In general, if X
1
, X
2
, . . ., are independent rvs such that there exists a constant A with
P([ X
n
[ A) = 1 n, then the LC is satised if s
2
n
as n . Why??
Suppose that s
2
n
as n . Since the [ X
k
[s are uniformly bounded (by A), so are the
rvs (X
k
E(X
k
)). Thus, for every > 0 there exists an N
such that if n N
then
P([ X
k
E(X
k
) [< s
n
, k = 1, . . . , n) = 1.
This implies that the LC holds since we would integrate (or sum) over the empty set, i.e., the
set [ x
k
[> s
n
= .
The converse also holds. For a sequence of uniformly bounded independent rvs, a necessary
and sucient condition for the CLT to hold is that s
2
n
as n .
Example 6.4.9:
Let X
i
i=1
be a sequence of independent rvs such that E(X
k
) = 0,
k
= E([ X
k
[
2+
) <
for some > 0, and
n
k=1
k
= o(s
2+
n
).
Does the LC hold? It is:
1
s
2
n
n
k=1
_
{|x|>sn}
x
2
f
k
(x)dx
(A)
1
s
2
n
n
k=1
_
{|x|>sn}
[ x [
2+
n
f
k
(x)dx
1
s
2
n
n
n
k=1
_
[ x [
2+
f
k
(x)dx
=
1
s
2
n
n
n
k=1
k
=
1
k=1
k
s
2+
n
(B)
0 as n
(A) holds since for [ x [> s
n
, it is
|x|
n
> 1. (B) holds since
n
k=1
k
= o(s
2+
n
).
Thus, the LC is satised and the CLT holds.
34
Note:
(i) In general, if there exists a > 0 such that
n
k=1
E([ X
k
k
[
2+
)
s
2+
n
0 as n ,
then the LC holds.
(ii) Both the CLT and the WLLN hold for a large class of sequences of rvs X
i
n
i=1
. If
the X
i
s are independent uniformly bounded rvs, i.e., if P([ X
n
[ M) = 1 n, the
WLLN (as formulated in Theorem 6.2.3) holds. The CLT holds provided that s
2
n
as n .
If the rvs X
i
are iid, then the CLT is a stronger result than the WLLN since the CLT
provides an estimate of the probability P(
1
n
[
n
i=1
X
i
n [ ) 1 P([ Z [
n),
where Z N(0, 1), and the WLLN follows. However, note that the CLT requires the
existence of a 2
nd
moment while the WLLN does not.
(iii) If the X
i
are independent (but not identically distributed) rvs, the CLT may apply
while the WLLN does not.
(iv) See Rohatgi, pages 289293, and Rohatgi/Saleh, pages 299303, for additional details
and examples.
35
7 Sample Moments
7.1 Random Sampling
(Based on Casella/Berger, Section 5.1 & 5.2)
Denition 7.1.1:
Let X
1
, . . . , X
n
be iid rvs with common cdf F. We say that X
1
, . . . , X
n
is a (random)
sample of size n from the population distribution F. The vector of values x
1
, . . . , x
n
is
called a realization of the sample. A rv g(X
1
, . . . , X
n
) which is a Borelmeasurable function
of X
1
, . . . , X
n
and does not depend on any unknown parameter is called a (sample) statistic.
Denition 7.1.2:
Let X
1
, . . . , X
n
be a sample of size n from a population with distribution F. Then
X =
1
n
n
i=1
X
i
is called the sample mean and
S
2
=
1
n 1
n
i=1
(X
i
X)
2
=
1
n 1
_
n
i=1
X
2
i
nX
2
_
is called the sample variance.
Denition 7.1.3:
Let X
1
, . . . , X
n
be a sample of size n from a population with distribution F. The function
F
n
(x) =
1
n
n
i=1
I
(,x]
(X
i
)
is called empirical cumulative distribution function (empirical cdf).
Note:
For any xed x IR,
F
n
(x) is a rv.
Theorem 7.1.4:
The rv
F
n
(x) has pmf
P(
F
n
(x) =
j
n
) =
_
n
j
_
(F(x))
j
(1 F(x))
nj
, j 0, 1, . . . , n,
36
with E(
F
n
(x)) = F(x) and V ar(
F
n
(x)) =
F(x)(1F(x))
n
.
Proof:
It is I
(,x]
(X
i
) Bin(1, F(x)). Then n
F
n
(x) Bin(n, F(x)).
The results follow immediately.
Corollary 7.1.5:
By the WLLN, it follows that
F
n
(x)
p
F(x).
Corollary 7.1.6:
By the CLT, it follows that
n(
F
n
(x) F(x))
_
F(x)(1 F(x))
d
Z,
where Z N(0, 1).
Theorem 7.1.7: GlivenkoCantelli Theorem
F
n
(x) converges uniformly to F(x), i.e., it holds for all > 0 that
lim
n
P( sup
<x<
[
F
n
(x) F(x) [> ) = 0.
Denition 7.1.8:
Let X
1
, . . . , X
n
be a sample of size n from a population with distribution F. We call
a
k
=
1
n
n
i=1
X
k
i
the sample moment of order k and
b
k
=
1
n
n
i=1
(X
i
a
1
)
k
=
1
n
n
i=1
(X
i
X)
k
the sample central moment of order k.
Note:
It is b
1
= 0 and b
2
=
n1
n
S
2
.
37
Theorem 7.1.9:
Let X
1
, . . . , X
n
be a sample of size n from a population with distribution F. Assume that
E(X) = , V ar(X) =
2
, and E((X )
k
) =
k
exist. Then it holds:
(i) E(a
1
) = E(X) =
(ii) V ar(a
1
) = V ar(X) =
2
n
(iii) E(b
2
) =
n1
n
2
(iv) V ar(b
2
) =
4
2
2
n
2(
4
2
2
2
)
n
2
+
4
3
2
2
n
3
(v) E(S
2
) =
2
(vi) V ar(S
2
) =
4
n
n3
n(n1)
2
2
Proof:
(i)
E(X) =
1
n
n
i=1
E(X
i
) =
n
n
=
(ii)
V ar(X) =
_
1
n
_
2
n
i=1
V ar(X
i
) =
2
n
(iii)
E(b
2
) = E
_
1
n
n
i=1
(X
i
X)
2
_
= E
1
n
n
i=1
X
2
i
1
n
2
_
n
i=1
X
i
_
2
= E(X
2
)
1
n
2
E
i=1
X
2
i
+
i=j
X
i
X
j
()
= E(X
2
)
1
n
2
(nE(X
2
) +n(n 1)
2
)
=
n 1
n
(E(X
2
)
2
)
=
n 1
n
2
() holds since X
i
and X
j
are independent and then, due to Theorem 4.5.3, it holds
that E(X
i
X
j
) = E(X
i
)E(X
j
).
See Casella/Berger, page 214, and Rohatgi, page 303306, for the proof of parts (iv) through
(vi) and results regarding the 3
rd
and 4
th
moments and covariances.
38
7.2 Sample Moments and the Normal Distribution
(Based on Casella/Berger, Section 5.3)
Theorem 7.2.1:
Let X
1
, . . . , X
n
be iid N(,
2
) rvs. Then X =
1
n
n
i=1
X
i
and (X
1
X, . . . , X
n
X) are
independent.
Proof:
By computing the joint mgf of (X, X
1
X, X
2
X, . . . , X
n
X), we can use Theorem 4.6.3
(iv) to show independence. We will use the following two facts:
(1):
M
X
(t) = M
_
1
n
n
i=1
X
i
_
(t)
(A)
=
n
i=1
M
X
i
(
t
n
)
(B)
=
_
exp(
t
n
+
2
t
2
2n
2
)
_
n
= exp
_
t +
2
t
2
2n
_
(A) holds by Theorem 4.6.4 (i). (B) follows from Theorem 3.3.12 (vi) since the X
i
s are iid.
(2):
M
X
1
X,X
2
X,...,XnX
(t
1
, t
2
, . . . , t
n
)
Def.4.6.1
= E
_
exp
_
n
i=1
t
i
(X
i
X)
__
= E
_
exp
_
n
i=1
t
i
X
i
X
n
i=1
t
i
__
= E
_
exp
_
n
i=1
X
i
(t
i
t)
__
, where t =
1
n
n
i=1
t
i
= E
_
n
i=1
exp (X
i
(t
i
t))
_
(C)
=
n
i=1
E(exp(X
i
(t
i
t)))
=
n
i=1
M
X
i
(t
i
t)
39
(D)
=
n
i=1
exp
_
(t
i
t) +
2
(t
i
t)
2
2
_
= exp
i=1
(t
i
t)
. .
=0
+
2
2
n
i=1
(t
i
t)
2
= exp
_
2
2
n
i=1
(t
i
t)
2
_
(C) follows from Theorem 4.5.3 since the X
i
s are independent. (D) holds since we evaluate
M
X
(h) = exp(h +
2
h
2
2
) for h = t
i
t.
From (1) and (2), it follows:
M
X,X
1
X,...,XnX
(t, t
1
, . . . , t
n
)
Def.4.6.1
= E
_
exp(tX +t
1
(X
1
X) +. . . +t
n
(X
n
X))
_
= E
_
exp(tX +t
1
X
1
t
1
X +. . . +t
n
X
n
t
n
X)
_
= E
_
exp
_
n
i=1
X
i
t
i
(
n
i=1
t
i
t)X
__
= E
exp
i=1
X
i
t
i
(t
1
+. . . +t
n
t)
n
i=1
X
i
n
= E
_
exp
_
n
i=1
X
i
(t
i
t
1
+. . . +t
n
t
n
)
__
= E
_
n
i=1
exp
_
X
i
nt
i
nt +t
n
__
, where t =
1
n
n
i=1
t
i
(E)
=
n
i=1
E
_
exp
_
X
i
[t +n(t
i
t)]
n
__
=
n
i=1
M
X
i
_
t +n(t
i
t)
n
_
(F)
=
n
i=1
exp
_
[t +n(t
i
t)]
n
+
2
2
1
n
2
[t +n(t
i
t)]
2
_
40
= exp
nt +n
n
i=1
(t
i
t)
. .
=0
+
2
2n
2
n
i=1
(t +n(t
i
t))
2
= exp(t) exp
2
2n
2
nt
2
+ 2nt
n
i=1
(t
i
t)
. .
=0
+n
2
n
i=1
(t
i
t)
2
= exp
_
t +
2
2n
t
2
_
exp
_
2
2
n
i=1
(t
i
t)
2
_
(1)&(2)
= M
X
(t)M
X
1
X,...,XnX
(t
1
, . . . , t
n
)
Thus, X and (X
1
X, . . . , X
n
X) are independent by Theorem 4.6.3 (iv). (E) follows
from Theorem 4.5.3 since the X
i
s are independent. (F) holds since we evaluate M
X
(h) =
exp(h +
2
h
2
2
) for h =
t+n(t
i
t)
n
.
Corollary 7.2.2:
X and S
2
are independent.
Proof:
This can be seen since S
2
is a function of the vector (X
1
X, . . . , X
n
X), and (X
1
X, . . . , X
n
X) is independent of X, as previously shown in Theorem 7.2.1. We can use
Theorem 4.2.7 to formally complete this proof.
Corollary 7.2.3:
(n 1)S
2
2
2
n1
.
Proof:
Recall the following facts:
If Z N(0, 1) then Z
2
2
1
.
If Y
1
, . . . , Y
n
iid
2
1
, then
n
i=1
Y
i
2
n
.
For
2
n
, the mgf is M(t) = (1 2t)
n/2
.
If X
i
N(,
2
), then
X
i
N(0, 1) and
(X
i
)
2
2
2
1
.
Therefore,
n
i=1
(X
i
)
2
2
2
n
and
(X)
2
(
n
)
2
= n
(X)
2
2
2
1
. ()
41
Now consider
n
i=1
(X
i
)
2
=
n
i=1
((X
i
X) + (X ))
2
=
n
i=1
((X
i
X)
2
+ 2(X
i
X)(X ) + (X )
2
)
= (n 1)S
2
+ 0 +n(X )
2
Therefore,
n
i=1
(X
i
)
2
2
. .
W
=
n(X )
2
2
. .
U
+
(n 1)S
2
2
. .
V
We have an expression of the form: W = U +V
Since U and V are functions of X and S
2
, we know by Corollary 7.2.2 that they are independent
and also that their mgfs factor by Theorem 4.6.3 (iv). Now we can write:
M
W
(t) = M
U
(t)M
V
(t)
= M
V
(t) =
M
W
(t)
M
U
(t)
()
=
(1 2t)
n/2
(1 2t)
1/2
= (1 2t)
(n1)/2
Note that this is the mgf of
2
n1
by the uniqueness of mgfs. Thus, V =
(n1)S
2
2
2
n1
.
Corollary 7.2.4:
n(X )
S
t
n1
.
Proof:
Recall the following facts:
If Z N(0, 1), Y
2
n
and Z, Y independent, then
Z
_
Y
n
t
n
.
Z
1
=
n(X)
N(0, 1), Y
n1
=
(n1)S
2
2
2
n1
, and Z
1
, Y
n1
are independent.
Therefore,
n(X )
S
=
(X)
/
n
S/
n
/
n
=
(X)
/
n
_
S
2
(n1)
2
(n1)
=
Z
1
_
Y
n1
(n1)
t
n1
.
42
Corollary 7.2.5:
Let (X
1
, . . . , X
m
) iid N(
1
,
2
1
) and (Y
1
, . . . , Y
n
) iid N(
2
,
2
2
). Let X
i
, Y
j
be independent
i, j.
Then it holds:
X Y (
1
2
)
_
[(m1)S
2
1
/
2
1
] + [(n 1)S
2
2
/
2
2
]
m+n 2
2
1
/m +
2
2
/n
t
m+n2
In particular, if
1
=
2
, then:
X Y (
1
2
)
_
(m1)S
2
1
+ (n 1)S
2
2
mn(m+n 2)
m+n
t
m+n2
Proof:
Homework.
Corollary 7.2.6:
Let (X
1
, . . . , X
m
) iid N(
1
,
2
1
) and (Y
1
, . . . , Y
n
) iid N(
2
,
2
2
). Let X
i
, Y
j
be independent
i, j.
Then it holds:
S
2
1
/
2
1
S
2
2
/
2
2
F
m1,n1
In particular, if
1
=
2
, then:
S
2
1
S
2
2
F
m1,n1
Proof:
Recall that, if Y
1
2
m
and Y
2
2
n
, then
F =
Y
1
/m
Y
2
/n
F
m,n
.
Now, C
1
=
(m1)S
2
1
2
1
2
m1
and C
2
=
(n1)S
2
2
2
2
2
n1
. Therefore,
S
2
1
/
2
1
S
2
2
/
2
2
=
(m1)S
2
1
2
1
(m1)
(n1)S
2
2
2
2
(n1)
=
C
1
/(m1)
C
2
/(n 1)
F
m1,n1
.
If
1
=
2
, then
S
2
1
S
2
2
F
m1,n1
.
43
8 The Theory of Point Estimation
(Based on Casella/Berger, Chapters 6 & 7)
8.1 The Problem of Point Estimation
Let X be a rv dened on a probability space (, L, P). Suppose that the cdf F of X depends
on some set of parameters and that the functional form of F is known except for a nite
number of these parameters.
Denition 8.1.1:
The set of admissible values of is called the parameter space . If F
is the cdf of X
when is the parameter, the set F
i=1
be a sequence of iid rvs with cdf F
i=1
be a sequence of iid Bin(1, p) rvs. Let X
n
=
1
n
n
i=1
X
i
. Since E(X
i
) = p, it
follows by the WLLN that X
n
p
p, i.e., consistency, and by the SLLN that X
n
a.s.
p, i.e,
strong consistency.
However, a consistent estimate may not be unique. We may even have innite many consistent
estimates, e.g.,
n
i=1
X
i
+a
n +b
p
p nite a, b IR.
Theorem 8.2.3:
If T
n
is a sequence of estimates such that E(T
n
) and V ar(T
n
) 0 as n , then T
n
is
consistent for .
Proof:
P([ T
n
[> )
(A)
E((T
n
)
2
)
2
=
E[((T
n
E(T
n
)) + (E(T
n
) ))
2
]
2
=
V ar(T
n
) + 2E[(T
n
E(T
n
))(E(T
n
) )] + (E(T
n
) )
2
2
=
V ar(T
n
) + (E(T
n
) )
2
2
(B)
0 as n
45
(A) holds due to Corollary 3.5.2 (Markovs Inequality). (B) holds since V ar(T
n
) 0 as
n and E(T
n
) as n .
Denition 8.2.4:
Let ( be a group of Borelmeasurable functions of IR
n
onto itself which is closed under com-
position and inverse. A family of distributions P
(g(X) A) = P
g()
(X A).
Example 8.2.5:
Let (X
1
, . . . , X
n
) be iid N(,
2
) with pdf
f(x
1
, . . . , x
n
) =
1
(
2)
n
exp
_
1
2
2
n
i=1
(x
i
)
2
_
.
The group of linear transformations ( has elements
g(x
1
, . . . , x
n
) = (ax
1
+b, . . . , ax
n
+b), a > 0, < b < .
The pdf of g(X) is
f
(x
1
, . . . , x
n
) =
1
(
2a)
n
exp
_
1
2a
2
2
n
i=1
(x
i
a b)
2
_
, x
i
= ax
i
+b, i = 1, . . . , n.
So f : < < ,
2
> 0 is invariant under this group (, with g(,
2
) = (a+b, a
2
2
),
where < a +b < and a
2
2
> 0.
Denition 8.2.6:
Let ( be a group of transformations that leaves F
: invariant. An estimate T is
invariant under ( if
T(g(X
1
), . . . , g(X
n
)) = T(X
1
, . . . , X
n
) g (.
46
Denition 8.2.7:
An estimate T is:
location invariant if T(X
1
+a, . . . , X
n
+a) = T(X
1
, . . . , X
n
), a IR
scale invariant if T(cX
1
, . . . , cX
n
) = T(X
1
, . . . , X
n
), c IR0
permutation invariant if T(X
i
1
, . . . , X
in
) = T(X
1
, . . . , X
n
) permutations (i
1
, . . . , i
n
)
of 1, . . . , n
Example 8.2.8:
Let F
N(,
2
).
S
2
is location invariant.
X and S
2
are both permutation invariant.
Neither X nor S
2
is scale invariant.
Note:
Dierent sources make dierent use of the term invariant. Mood, Graybill & Boes (1974)
for example dene location invariant as T(X
1
+ a, . . . , X
n
+ a) = T(X
1
, . . . , X
n
) + a (page
332) and scale invariant as T(cX
1
, . . . , cX
n
) = cT(X
1
, . . . , X
n
) (page 336). According to their
denition, X is location invariant and scale invariant.
47
8.3 Sucient Statistics
(Based on Casella/Berger, Section 6.2)
Denition 8.3.1:
Let X = (X
1
, . . . , X
n
) be a sample from F
: IR
k
. A statistic T = T(X) is
sucient for (or for the family of distributions F
(T A) = 0 ).
Note:
(i) The sample X is always sucient but this is not particularly interesting and usually is
excluded from further considerations.
(ii) Idea: Once we have reduced from X to T(X), we have captured all the information
in X about .
(iii) Usually, there are several sucient statistics for a given family of distributions.
Example 8.3.2:
Let X = (X
1
, . . . , X
n
) be iid Bin(1, p) rvs. To estimate p, can we ignore the order and simply
count the number of successes?
Let T(X) =
n
i=1
X
i
. It is
P(X
1
= x
1
, . . . X
n
= x
n
[
n
i=1
X
i
= t) =
P(X
1
= x
1
, . . . , X
n
= x
n
, T = t)
P(T = t)
=
P(X
1
= x
1
, . . . , X
n
= x
n
)
P(T = t)
,
n
i=1
x
i
= t
0, otherwise
=
p
t
(1 p)
nt
_
n
t
_
p
t
(1 p)
nt
,
n
i=1
x
i
= t
0, otherwise
=
1
_
n
t
_
,
n
i=1
x
i
= t
0, otherwise
48
This does not depend on p. Thus, T =
n
i=1
X
i
is sucient for p.
Example 8.3.3:
Let X = (X
1
, . . . , X
n
) be iid Poisson(). Is T =
n
i=1
X
i
sucient for ? It is
P(X
1
= x
1
, . . . , X
n
= x
n
[ T = t) =
P(X
1
= x
1
, . . . , X
n
= x
n
, T = t)
P(T = t)
=
i=1
e
x
i
x
i
!
e
n
(n)
t
t!
,
n
i=1
x
i
= t
0, otherwise
=
e
n
x
i
x
i
!
e
n
(n)
t
t!
,
n
i=1
x
i
= t
0, otherwise
=
t!
n
t
n
i=1
x
i
!
,
n
i=1
x
i
= t
0, otherwise
This does not depend on . Thus, T =
n
i=1
X
i
is sucient for .
Example 8.3.4:
Let X
1
, X
2
be iid Poisson(). Is T = X
1
+ 2X
2
sucient for ? It is
P(X
1
= 0, X
2
= 1 [ X
1
+ 2X
2
= 2) =
P(X
1
= 0, X
2
= 1, X
1
+ 2X
2
= 2)
P(X
1
+ 2X
2
= 2)
=
P(X
1
= 0, X
2
= 1)
P(X
1
+ 2X
2
= 2)
=
P(X
1
= 0, X
2
= 1)
P(X
1
= 0, X
2
= 1) +P(X
1
= 2, X
2
= 0)
=
e
(e
)
e
(e
) + (
e
2
2
)e
=
1
1 +
2
,
49
i.e., this is a counterexample. This expression still depends on . Thus, T = X
1
+ 2X
2
is
not sucient for .
Note:
Denition 8.3.1 can be dicult to check. In addition, it requires a candidate statistic. We
need something constructive that helps in nding sucient statistics without having to check
Denition 8.3.1. The next Theorem helps in nding such statistics.
Theorem 8.3.5: Factorization Criterion
Let X
1
, . . . , X
n
be rvs with pdf (or pmf) f(x
1
, . . . , x
n
[ ), . Then T(X
1
, . . . , X
n
) is
sucient for i we can write
f(x
1
, . . . , x
n
[ ) = h(x
1
, . . . , x
n
) g(T(x
1
, . . . , x
n
) [ ),
where h does not depend on and g does not depend on x
1
, . . . , x
n
except as a function of T.
Proof:
Discrete case only.
=:
Suppose T(X) is sucient for . Let
g(t [ ) = P
(T(X) = t)
h(x) = P(X = x [ T(X) = t)
Then it holds:
f(x [ ) = P
(X = x)
()
= P
(X = x, T(X) = T(x) = t)
= P
(T(X) = t
0
) =
{x : T(x)=t
0
}
P
(X = x)
50
=
{x : T(x)=t
0
}
h(x)g(T(x) [ )
= g(t
0
[ )
{x : T(x)=t
0
}
h(x) (A)
If P
(T(X) = t
0
) > 0, it holds:
P
(X = x [ T(X) = t
0
) =
P
(X = x, T(X) = t
0
)
P
(T(X) = t
0
)
=
(X = x)
P
(T(X) = t
0
)
, if T(x) = t
0
0, otherwise
(A)
=
g(t
0
[ )h(x)
g(t
0
[ )
{x : T(x)=t
0
}
h(x)
, if T(x) = t
0
0, otherwise
=
h(x)
{x : T(x)=t
0
}
h(x)
, if T(x) = t
0
0, otherwise
This last expression does not depend on . Thus, T(X) is sucient for .
Note:
(i) In the Theorem above, and T may be vectors.
(ii) If T is sucient for , then also any 1to1 mapping of T is sucient for . However,
this does not hold for arbitrary functions of T.
Example 8.3.6:
Let X
1
, . . . , X
n
be iid Bin(1, p). It is
P(X
1
= x
1
, . . . , X
n
= x
n
[ p) = p
x
i
(1 p)
n
x
i
.
Thus, h(x
1
, . . . , x
n
) = 1 and g(
x
i
[ p) = p
x
i
(1 p)
n
x
i
.
Hence, T =
n
i=1
X
i
is sucient for p.
51
Example 8.3.7:
Let X
1
, . . . , X
n
be iid Poisson(). It is
P(X
1
= x
1
, . . . , X
n
= x
n
[ ) =
n
i=1
e
x
i
x
i
!
=
e
n
x
i
x
i
!
.
Thus, h(x
1
, . . . , x
n
) =
1
x
i
!
and g(
x
i
[ ) = e
n
x
i
.
Hence, T =
n
i=1
X
i
is sucient for .
Example 8.3.8:
Let X
1
, . . . , X
n
be iid N(,
2
) where IR and
2
> 0 are both unknown. It is
f(x
1
, . . . , x
n
[ ,
2
) =
1
(
2)
n
exp
_
(x
i
)
2
2
2
_
=
1
(
2)
n
exp
_
x
2
i
2
2
+
x
i
2
n
2
2
2
_
.
Hence, T = (
n
i=1
X
i
,
n
i=1
X
2
i
) is sucient for (,
2
).
Example 8.3.9:
Let X
1
, . . . , X
n
be iid U(, + 1) where < < . It is
f(x
1
, . . . , x
n
[ ) =
_
1, < x
i
< + 1 i 1, . . . , n
0, otherwise
=
n
i=1
I
(,)
(x
i
) I
(,+1)
(x
i
)
= I
(,)
(min(x
i
)) I
(,+1)
(max(x
i
))
Hence, T = (X
(1)
, X
(n)
) is sucient for .
Denition 8.3.10:
Let f
(g(X)) = 0
implies that
P
(g(X) = 0) = 1 .
We say a statistic T(X) is complete if the family of distributions of T is complete.
52
Example 8.3.11:
Let X
1
, . . . , X
n
be iid Bin(1, p). We have seen in Example 8.3.6 that T =
n
i=1
X
i
is sucient
for p. Is it also complete?
We know that T Bin(n, p). Thus,
E
p
(g(T)) =
n
t=0
g(t)
_
n
t
_
p
t
(1 p)
nt
= 0 p (0, 1)
implies that
(1 p)
n
n
t=0
g(t)
_
n
t
_
(
p
1 p
)
t
= 0 p (0, 1) t.
However,
n
t=0
g(t)
_
n
t
_
(
p
1 p
)
t
is a polynomial in
p
1p
which is only equal to 0 for all p (0, 1)
if all of its coecients are 0.
Therefore, g(t) = 0 for t = 0, 1, . . . , n. Hence, T is complete.
Example 8.3.12:
Let X
1
, . . . , X
n
be iid N(,
2
). We know from Example 8.3.8 that T = (
n
i=1
X
i
,
n
i=1
X
2
i
) is
sucient for . Is it also complete?
We know that
n
i=1
X
i
N(n, n
2
). Therefore,
E((
n
i=1
X
i
)
2
) = n
2
..
V ar
+n
2
2
. .
E
2
= n(n + 1)
2
E(
n
i=1
X
2
i
) = n(
2
+
2
)
. .
n(V ar+E
2
)
= 2n
2
It follows that
E
_
2(
n
i=1
X
i
)
2
(n + 1)
n
i=1
X
2
i
_
= 0 .
But g(x
1
, . . . , x
n
) = 2(
n
i=1
x
i
)
2
(n + 1)
n
i=1
x
2
i
is not identically to 0.
Therefore, T is not complete.
Note:
Recall from Section 5.2 what it means if we say the family of distributions f
: is a
oneparameter (or kparameter) exponential family.
53
Theorem 8.3.13:
Let f
(t) = exp
_
k
i=1
t
i
Q
i
() +D() +S
(t)
_
for suitable S
(t).
Proof:
The proof follows from our Theorems regarding the transformation of rvs.
Theorem 8.3.14:
Let f
(g(T(X))) =
t
g(t)P
(T(X) = t)
(A)
=
t
g(t) exp(t +D() +S
(t)) = 0 (B)
implies g(t) = 0 t. Note that in (A) we make use of a result established in Theorem 8.3.13.
We now dene functions g
+
and g
as:
g
+
(t) =
_
g(t), if g(t) 0
0, otherwise
g
(t) =
_
g(t), if g(t) < 0
0, otherwise
It is g(t) = g
+
(t) g
t
g
+
(t) exp(t +S
(t)) =
t
g
(t) exp(t +S
(t)) (C)
where the term exp(D()) in (A) drops out as a constant on both sides.
54
If we x
0
(a, b) and dene
p
+
(t) =
g
+
(t) exp(
0
t +S
(t))
t
g
+
(t) exp(
0
t +S
(t))
, p
(t) =
g
(t) exp(
0
t +S
(t))
t
g
(t) exp(
0
t +S
(t))
,
it is obvious that p
+
(t) 0 t and p
t
p
+
(t) = 1 and
t
p
(t) = 1. Hence, p
+
and p
of p
+
and p
that
M
+
() =
t
e
t
p
+
(t)
=
t
e
t
g
+
(t) exp(
0
t +S
(t))
t
g
+
(t) exp(
0
t +S
(t))
=
t
g
+
(t) exp((
0
+)t +S
(t))
t
g
+
(t) exp(
0
t +S
(t))
(C)
=
t
g
(t) exp((
0
+)t +S
(t))
t
g
(t) exp(
0
t +S
(t))
=
t
e
t
g
(t) exp(
0
t +S
(t))
t
g
(t) exp(
0
t +S
(t))
=
t
e
t
p
(t)
= M
() (a
0
. .
<0
, b
0
. .
>0
).
By the uniqueness of mgfs it follows that p
+
(t) = p
(t) t.
= g
+
(t) = g
(t) t
= g(t) = 0 t
= T is complete
55
Denition 8.3.15:
Let X = (X
1
, . . . , X
n
) be a sample from F
: IR
k
and let T = T(X) be a sucient
statistic for . T = T(X) is called a minimal sucient statistic for if, for any other
sucient statistic T
= T
(x).
Note:
(i) A minimal sucient statistic achieves the greatest possible data reduction for a sucient
statistic.
(ii) If T is minimal sucient for , then also any 1to1 mapping of T is minimal sucient
for . However, this does not hold for arbitrary functions of T.
Denition 8.3.16:
Let X = (X
1
, . . . , X
n
) be a sample from F
: IR
k
. A statistic T = T(X) is called
ancillary if its distribution does not depend on the parameter .
Example 8.3.17:
Let X
1
, . . . , X
n
be iid U(, + 1) where < < . As shown in Example 8.3.9,
T = (X
(1)
, X
(n)
) is sucient for . Dene
R
n
= X
(n)
X
(1)
.
Use the result from Stat 6710, Homework Assignment 5, Question (viii) (a) to obtain
f
Rn
(r [ ) = f
Rn
(r) = n(n 1)r
n2
(1 r)I
(0,1)
(r).
This means that R
n
Beta(n 1, 2). Moreover, R
n
does not depend on and, therefore,
R
n
is ancillary.
Theorem 8.3.18: Basus Theorem
Let X = (X
1
, . . . , X
n
) be a sample from F
: IR
k
. If T = T(X) is a complete and
minimal sucient statistic, then T is independent of any ancillary statistic.
Theorem 8.3.19:
Let X = (X
1
, . . . , X
n
) be a sample from F
: IR
k
. If any minimal sucient statis-
tic T = T(X) exists for , then any complete statistic is also a minimal sucient statistic.
56
Note:
(i) Due to the last Theorem, Basus Theorem often only is stated in terms of a complete
sucient statistic (which automatically is also a minimal sucient statistic).
(ii) As already shown in Corollary 7.2.2, X and S
2
are independent when sampling from a
N(,
2
) population. As outlined in Casella/Berger, page 289, we could also use Basus
Theorem to obtain the same result.
(iii) The converse of Basus Theorem is false, i.e., if T(X) is independent of any ancillary
statistic, it does not necessarily follow that T(X) is a complete, minimal sucient statis-
tic.
(iv) As seen in Examples 8.3.8 and 8.3.12, T = (
n
i=1
X
i
,
n
i=1
X
2
i
) is sucient for but it is not
complete when X
1
, . . . , X
n
are iid N(,
2
). However, it can be shown that T is minimal
sucient. So, there may be distributions where a minimal sucient statistic exists but
a complete statistic does not exist.
(v) As with invariance, there exist several dierent denitions of ancillarity within the lit-
erature the one dened in this chapter being the most commonly used.
57
8.4 Unbiased Estimation
(Based on Casella/Berger, Section 7.3)
Denition 8.4.1:
Let F
(T) = .
Any function d() for which an unbiased estimate T exists is called an estimable function.
If T is biased,
b(, T) = E
(T)
is called the bias of T.
Example 8.4.2:
If the k
th
population moment exists, the k
th
sample moment is an unbiased estimate. If
V ar(X) =
2
, the sample variance S
2
is an unbiased estimate of
2
.
However, note that for X
1
, . . . , X
n
iid N(,
2
), S is not an unbiased estimate of :
(n 1)S
2
2
2
n1
= Gamma(
n 1
2
, 2)
= E
(n 1)S
2
=
_
0
x
x
n1
2
1
e
x
2
2
n1
2
(
n1
2
)
dx
=
2(
n
2
)
(
n1
2
)
_
0
x
n
2
1
e
x
2
2
n
2
(
n
2
)
dx
()
=
2(
n
2
)
(
n1
2
)
= E(S) =
2
n 1
(
n
2
)
(
n1
2
)
() holds since
x
n
2
1
e
x
2
2
n
2
(
n
2
)
is the pdf of a Gamma(
n
2
, 2) distribution and thus the integral is 1.
So S is biased for and
b(, S) =
2
n 1
(
n
2
)
(
n1
2
)
1
.
58
Note:
If T is unbiased for , g(T) is not necessarily unbiased for g() (unless g is a linear function).
Example 8.4.3:
Unbiased estimates may not exist (see Rohatgi, page 351, Example 2) or they me be absurd
as in the following case:
Let X Poisson() and let d() = e
2
. Consider T(X) = (1)
X
as an estimate. It is
E
(T(X)) = e
x=0
(1)
x
x
x!
= e
x=0
()
x
x!
= e
= e
2
= d()
Hence T is unbiased for d() but since T alternates between -1 and 1 while d() > 0, T is not
a good estimate.
Note:
If there exist 2 unbiased estimates T
1
and T
2
of , then any estimate of the form T
1
+(1)T
2
for 0 < < 1 will also be an unbiased estimate of . Which one should we choose?
Denition 8.4.4:
The mean square error of an estimate T of is dened as
MSE(, T) = E
((T )
2
)
= V ar
i=1
be a sequence of estimates of . If
lim
i
MSE(, T
i
) = 0 ,
then T
i
is called a meansquarederror consistent (MSEconsistent) sequence of es-
timates of .
59
Note:
(i) If we allow all estimates and compare their MSE, generally it will depend on which
estimate is better. For example
= 17 is perfect if = 17, but it is lousy otherwise.
(ii) If we restrict ourselves to the class of unbiased estimates, then MSE(, T) = V ar
(T).
(iii) MSEconsistency means that both the bias and the variance of T
i
approach 0 as i .
Denition 8.4.5:
Let
0
and let U(
0
) be the class of all unbiased estimates T of
0
such that E
0
(T
2
) < .
Then T
0
U(
0
) is called a locally minimum variance unbiased estimate (LMVUE)
at
0
if
E
0
((T
0
0
)
2
) E
0
((T
0
)
2
) T U(
0
).
Denition 8.4.6:
Let U be the class of all unbiased estimates T of such that E
(T
2
) < . Then
T
0
U is called a uniformly minimum variance unbiased estimate (UMVUE) of if
E
((T
0
)
2
) E
((T )
2
) T U.
60
An Excursion into Logic II
In our rst Excursion into Logic in Stat 6710 Mathematical Statistics I, we have established
the following results:
A B is equivalent to B A is equivalent to A B:
A B A B A B B A A B
1 1 1 0 0 1 1
1 0 0 0 1 0 0
0 1 1 1 0 1 1
0 0 1 1 1 1 1
When dealing with formal proofs, there exists one more technique to show A B. Equiva-
lently, we can show (AB) 0, a technique called Proof by Contradiction. This means,
assuming that A and B hold, we show that this implies 0, i.e., something that is always
false, i.e., a contradiction. And here is the corresponding truth table:
A B A B B A B (A B) 0
1 1 1 0 0 1
1 0 0 1 1 0
0 1 1 0 0 1
0 0 1 1 0 1
Note:
We make use of this proof technique in the Proof of the next Theorem.
Example:
Let A : x = 5 and B : x
2
= 25. Obviously A B.
But we can also prove this in the following way:
A : x = 5 and B : x
2
,= 25
= x
2
= 25 x
2
,= 25
This is impossible, i.e., a contradiction. Thus, A B.
61
Theorem 8.4.7:
Let U be the class of all unbiased estimates T of with E
(T
2
) < , and suppose
that U is nonempty. Let U
0
be the set of all unbiased estimates of 0, i.e.,
U
0
= : E
() = 0, E
(
2
) < .
Then T
0
U is UMVUE i
E
(T
0
) = 0 U
0
.
Proof:
Note that E
(T
0
) always exists. This follows from the CauchySchwarzInequality (Theorem
4.5.7 (ii)):
(E
(T
0
))
2
E
(
2
)E
(T
2
0
) <
because E
(
2
) < and E
(T
2
0
) < . Therefore, also E
(T
0
) < .
=:
We suppose that T
0
U is UMVUE and that E
0
(
0
T
0
) ,= 0 for some
0
and some
0
U
0
.
It holds
E
(T
0
+
0
) = E
(T
0
) = IR .
Therefore, T
0
+
0
U IR.
Also, E
0
(
2
0
) > 0 (since otherwise, P
0
(
0
= 0) = 1 and then E
0
(
0
T
0
) = 0).
Now let
=
E
0
(T
0
0
)
E
0
(
2
0
)
.
Then,
E
0
((T
0
+
0
)
2
) = E
0
(T
2
0
+ 2T
0
0
+
2
2
0
)
= E
0
(T
2
0
) 2
(E
0
(T
0
0
))
2
E
0
(
2
0
)
+
(E
0
(T
0
0
))
2
E
0
(
2
0
)
= E
0
(T
2
0
)
(E
0
(T
0
0
))
2
E
0
(
2
0
)
< E
0
(T
2
0
),
and therefore,
V ar
0
(T
0
+
0
) < V ar
0
(T
0
).
This means, T
0
is not UMVUE, i.e., a contradiction!
62
=:
Let E
(T
0
) = 0 for some T
0
U for all and all U
0
.
We choose T U, then also T
0
T U
0
and
E
(T
0
(T
0
T)) = 0 ,
i.e.,
E
(T
2
0
) = E
(T
0
T) .
It follows from the CauchySchwarzInequality (Theorem 4.5.7 (ii)) that
E
(T
2
0
) = E
(T
0
T) (E
(T
2
0
))
1
2
(E
(T
2
))
1
2
.
This implies
(E
(T
2
0
))
1
2
(E
(T
2
))
1
2
and
V ar
(T
0
) V ar
(T),
where T is an arbitrary unbiased estimate of . Thus, T
0
is UMVUE.
Theorem 8.4.8:
Let U be the nonempty class of unbiased estimates of as dened in Theorem 8.4.7.
Then there exists at most one UMVUE T U for .
Proof:
Suppose T
0
, T
1
U are both UMVUE.
Then T
1
T
0
U
0
, V ar
(T
0
) = V ar
(T
1
), and E
(T
0
(T
1
T
0
)) = 0
= E
(T
2
0
) = E
(T
0
T
1
)
= Cov
(T
0
, T
1
) = E
(T
0
T
1
) E
(T
0
)E
(T
1
)
= E
(T
2
0
) (E
(T
0
))
2
= V ar
(T
0
)
= V ar
(T
1
)
=
T
0
T
1
= 1
= P
(aT
0
+bT
1
= 0) = 1 for some a, b
= = E
(T
0
) = E
(
b
a
T
1
) = E
(T
1
)
=
b
a
= 1
= P
(T
0
= T
1
) = 1
63
Theorem 8.4.9:
(i) If an UMVUE T exists for a real function d(), then T is the UMVUE for d(), IR.
(ii) If UMVUEs T
1
and T
2
exist for real functions d
1
() and d
2
(), respectively, then T
1
+T
2
is the UMVUE for d
1
() +d
2
().
Proof:
Homework.
Theorem 8.4.10:
If a sample consists of n independent observations X
1
, . . . , X
n
from the same distribution, the
UMVUE, if it exists, is permutation invariant.
Proof:
Homework.
Theorem 8.4.11: RaoBlackwell
Let F
(h
2
) < . Let T be a sucient statistic for
F
(h [ T) is independent of and it is an
unbiased estimate of . Additionally,
E
((E(h [ T) )
2
) E
((h )
2
)
with equality i h = E(h [ T).
Proof:
By Theorem 4.7.3, E
(E(h [ T)) = E
(h) = .
Since X [ T does not depend on due to suciency, neither does E(h [ T) depend on .
Thus, we only have to show that
E
((E(h [ T))
2
) E
(h
2
) = E
(E(h
2
[ T)).
Thus, we only have to show that
(E(h [ T))
2
E(h
2
[ T).
But the CauchySchwarzInequality (Theorem 4.5.7 (ii)) gives us
(E(h [ T))
2
E(h
2
[ T)E(1 [ T) = E(h
2
[ T).
64
Equality holds i
E
((E(h [ T))
2
) = E
(h
2
) = E
(E(h
2
[ T))
E
(E(h
2
[ T) (E(h [ T))
2
) = 0
E
(V ar(h [ T)) = 0
E
(E(h
1
[ T)) = E
(E(h
2
[ T)) = by Theorem 8.4.11.
Therefore,
E
(E(h
1
[ T) E(h
2
[ T)) = 0 .
Since T is complete, E(h
1
[ T) = E(h
2
[ T).
Therefore, E(h [ T) must be the same for all h U and E(h [ T) improves all h U. There-
fore, E(h [ T) is UMVUE by Theorem 8.4.11.
Note:
We can use Theorem 8.4.12 to nd the UMVUE in two ways if we have a complete sucient
statistic T:
(i) If we can nd an unbiased estimate h(T), it will be the UMVUE since E(h(T) [ T) =
h(T).
(ii) If we have any unbiased estimate h and if we can calculate E(h [ T), then E(h [ T)
will be the UMVUE. The process of determining the UMVUE this way often is called
RaoBlackwellization.
(iii) Even if a complete sucient statistic does not exist, the UMVUE may still exist (see
Rohatgi, page 357358, Example 10).
65
Example 8.4.13:
Let X
1
, . . . , X
n
be iid Bin(1, p). Then T =
n
i=1
X
i
is a complete sucient statistic as seen in
Examples 8.3.6 and 8.3.11.
Since E(X
1
) = p, X
1
is an unbiased estimate of p. However, due to part (i) of the Note above,
since X
1
is not a function of T, X
1
is not the UMVUE.
We can use part (ii) of the Note above to construct the UMVUE. It is
P(X
1
= x [ T) =
T
n
, x = 1
nT
n
, x = 0
= E(X
1
[ T) =
T
n
= X
= X is the UMVUE for p
If we are interested in the UMVUE for d(p) = p(1 p) = p p
2
= V ar(X), we can nd it in
the following way:
E(T) = np
E(T
2
) = E
i=1
X
2
i
+
n
i=1
n
j=1,j=i
X
i
X
j
= np +n(n 1)p
2
= E
_
nT
n(n 1)
_
=
np
n 1
E
_
T
2
n(n 1)
_
=
p
n 1
p
2
= E
_
nT T
2
n(n 1)
_
=
np
n 1
p
n 1
p
2
=
(n 1)p
n 1
p
2
= p p
2
= d(p)
Thus, due to part (i) of the Note above,
nT T
2
n(n 1)
is the UMVUE for d(p) = p(1 p).
66
8.5 Lower Bounds for the Variance of an Estimate
(Based on Casella/Berger, Section 7.3)
Theorem 8.5.1: CramerRao Lower Bound (CRLB)
Let be an open interval of IR. Let f
(x) = 0 is independent of .
Let () be dened on and let it be dierentiable for all . Let T be an unbiased
estimate of () such that E
(T
2
) < . Suppose that
(i)
f
(x)
__
f
(x)dx
_
=
_
f
(x)
dx = 0
or for a pmf f
x
f
(x)
x
f
(x)
= 0 ,
(iii) for a pdf f
__
T(x)f
(x)dx
_
=
_
T(x)
f
(x)
dx
or for a pmf f
x
T(x)f
(x)
x
T(x)
f
(x)
.
Let : IR be any measurable function. Then it holds
(
())
2
E
((T(X) ())
2
) E
_
_
log f
(X)
_
2
_
(A).
Further, for any
0
, either
(
0
) = 0 and equality holds in (A) for =
0
, or we have
E
0
((T(X) (
0
))
2
)
(
(
0
))
2
E
0
_
(
log f
(X)
)
2
_ (B).
Finally, if equality holds in (B), then there exists a real number K(
0
) ,= 0 such that
T(X) (
0
) = K(
0
)
log f
(X)
=
0
(C)
with probability 1, provided that T is not a constant.
67
Note:
(i) Conditions (i), (ii), and (iii) are called regularity conditions. Conditions under which
they hold can be found in Rohatgi, page 1113, Parts 12 and 13.
(ii) The right hand side of inequality (B) is called CramerRao Lower Bound of
0
, or, in
symbols CRLB(
0
).
(iii) The expression E
_
_
log f
(X)
_
2
_
is called the Fisher Information in X.
Proof:
From (ii), we get
E
log f
(X)
_
=
_ _
log f
(x)
_
f
(x)dx
=
_ _
(x)
_
1
f
(x)
f
(x)dx
=
_ _
(x)
_
dx
= 0
= E
_
()
log f
(X)
_
= 0
From (iii), we get
E
_
T(X)
log f
(X)
_
=
_ _
T(x)
log f
(x)
_
f
(x)dx
=
_ _
T(x)
(x)
_
1
f
(x)
f
(x)dx
=
_ _
T(x)
(x)
_
dx
(iii)
=
__
T(x)f
(x)dx
_
=
E(T(X))
=
()
=
()
= E
_
(T(X) ())
log f
(X)
_
=
()
68
= (
())
2
=
_
E
_
(T(X) ())
log f
(X)
__
2
()
E
_
(T(X) ())
2
_
E
_
_
log f
(X)
_
2
_
,
i.e., (A) holds. () follows from the CauchySchwarzInequality (Theorem 4.5.7 (ii)).
If
(
0
) ,= 0, then the lefthand side of (A) is > 0. Therefore, the righthand side is > 0.
Thus,
E
0
_
_
log f
(X)
_
2
_
> 0,
and (B) follows directly from (A).
If
(
0
) = 0, but equality does not hold in (A), then
E
0
_
_
log f
(X)
_
2
_
> 0,
and (B) follows directly from (A) again.
Finally, if equality holds in (B), then
(
0
) ,= 0 (because T is not constant). Thus,
MSE((
0
), T(X)) > 0. The CauchySchwarzInequality (Theorem 4.5.7 (iii)) gives equality
i there exist constants (, ) IR
2
(0, 0) such that
P
_
(T(X) (
0
)) +
_
log f
(X)
=
0
_
= 0
_
= 1.
This implies K(
0
) =
(T(X))
(
())
2
E
_
(
log f
(X)
)
2
_ ().
If we have () = , the inequality () above reduces to
V ar
(T(X))
_
E
_
_
log f
(X)
_
2
__
1
.
Finally, if X = (X
1
, . . . , X
n
) iid with identical f
(T(X))
(
())
2
nE
_
(
log f
(X
1
)
)
2
_.
69
Example 8.5.3:
Let X
1
, . . . , X
n
be iid Bin(1, p). Let X Bin(n, p), p = (0, 1) IR. Let
(p) = E(T(X)) =
n
x=0
T(x)
_
n
x
_
p
x
(1 p)
nx
.
(p) is dierentiable with respect to p under the summation sign since it is a nite polynomial
in p.
Since X =
n
i=1
X
i
with f
p
(x
1
) = p
x
1
(1 p)
1x
1
, x
1
0, 1
= log f
p
(x
1
) = x
1
log p + (1 x
1
) log(1 p)
=
p
log f
p
(x
1
) =
x
1
p
1 x
1
1 p
=
x
1
(1 p) p(1 x
1
)
p(1 p)
=
x
1
p
p(1 p)
= E
p
_
_
p
log f
p
(X
1
)
_
2
_
=
V ar(X
1
)
p
2
(1 p)
2
=
1
p(1 p)
So, if (p) = (p) = p and if T is unbiased for p, then
V ar
p
(T(X))
1
n
1
p(1p)
=
p(1 p)
n
.
Since V ar(X) =
p(1p)
n
, X attains the CRLB. Therefore, X is the UMVUE.
Example 8.5.4:
Let X U(0, ), = (0, ) IR.
f
(x) =
1
I
(0,)
(x)
= log f
(x) = log
= (
log f
(x))
2
=
1
2
= E
_
_
log f
(X)
_
2
_
=
1
2
Thus, the CRLB is
2
n
.
We know that
n+1
n
X
(n)
is the UMVUE since it is a function of a complete sucient statistic
X
(n)
(see Homework) and E(X
(n)
) =
n
n+1
. It is
V ar
_
n + 1
n
X
(n)
_
=
2
n(n + 2)
<
2
n
???
70
How is this possible? Quite simple, since one of the required conditions for Theorem 8.5.1
does not hold. The support of X depends on .
Theorem 8.5.5: Chapman, Robbins, Kiefer Inequality (CRK Inequality)
Let IR. Let f
(T
2
) < .
If ,= , and , assume that f
(x) and f
(x) > 0.
Then it holds that
V ar
(T(X)) sup
{ : S()S(), =}
(() ())
2
V ar
_
f
(X)
f
(X)
_ .
Proof:
Since T is unbiased, it follows
E
(T(X)) = () .
For ,= and S() S(), it follows
_
S()
T(x)
f
(x) f
(x)
f
(x)
f
(x)dx = E
(T(X)) E
(T(X)) = () ()
and
0 =
_
S()
f
(x) f
(x)
f
(x)
f
(x)dx = E
_
f
(X)
f
(X)
1
_
.
Therefore
Cov
_
T(X),
f
(X)
f
(X)
1
_
= () ().
It follows by the CauchySchwarzInequality (Theorem 4.5.7 (ii)) that
(() ())
2
=
_
Cov
_
T(X),
f
(X)
f
(X)
1
__
2
V ar
(T(X))V ar
_
f
(X)
f
(X)
1
_
= V ar
(T(X))V ar
_
f
(X)
f
(X)
_
.
Thus,
V ar
(T(X))
(() ())
2
V ar
_
f
(X)
f
(X)
_ .
Finally, we take the supremum of the righthand side with respect to : S() S(),
,= , which completes the proof.
71
Note:
(i) The CRK inequality holds without the previous regularity conditions.
(ii) An alternative form of the CRK inequality is:
Let , +, ,= 0, be distinct with S( +) S(). Let () = . Dene
J = J(, ) =
1
2
_
_
f
+
(X)
f
(X)
_
2
1
_
.
Then the CRK inequality reads as
V ar
(T(X))
1
inf
(J)
with the inmum taken over ,= 0 such that S( +) S().
(iii) The CRK inequality works for discrete , the CRLB does not work in such cases.
Example 8.5.6:
Let X U(0, ), > 0. The required conditions for the CRLB are not met. Recall from
Example 8.5.4 that
n+1
n
X
(n)
is UMVUE with V ar(
n+1
n
X
(n)
) =
2
n(n+2)
<
2
n
= CRLB.
Let () = . If < , then S() S(). It is
E
_
_
f
(X)
f
(X)
_
2
_
=
_
0
_
_
2
1
dx =
_
f
(X)
f
(X)
_
=
_
0
dx = 1
= V ar
(T(X)) sup
{ : 0<<}
( )
2
1
= sup
{ : 0<<}
(( ))
()
=
2
4
See Homework for a proof of ().
Now, assume that n = 1. Thus,
n+1
n
X
(n)
= 2X. Since X is complete and sucient and 2X is
unbiased for , so T(X) = 2X is the UMVUE. It is
V ar
(2X) = 4V ar
(X) = 4
2
12
=
2
3
>
2
4
.
Since the CRK lower bound is not achieved by the UMVUE, it is not achieved by any unbiased
estimate of .
72
Denition 8.5.7:
Let T
1
, T
2
be unbiased estimates of with E
(T
2
1
) < and E
(T
2
2
) < . We dene
the eciency of T
1
relative to T
2
by
eff
(T
1
, T
2
) =
V ar
(T
1
)
V ar
(T
2
)
and say that T
1
is more ecient than T
2
if eff
(T
1
, T
2
) < 1.
Denition 8.5.8:
Assume the regularity conditions of Theorem 8.5.1 are satised by a family of cdfs
F
if
V ar
(T) =
_
E
_
_
log f
(X)
_
2
__
1
.
Denition 8.5.9:
Let T be the most ecient estimate for the family of cdfs F
(T
1
) = eff
(T
1
, T) =
V ar
(T
1
)
V ar
(T)
.
Denition 8.5.10:
T
1
is asymptotically (most) ecient if T
1
is asymptotically unbiased, i.e., lim
n
E
(T
1
) = ,
and lim
n
eff
(T
1
) = 1, where n is the sample size.
Theorem 8.5.11:
A necessary and sucient condition for an estimate T of to be most ecient is that T is
sucient and
1
K()
(T(x) ) =
log f
(x)
(),
where K() is dened as in Theorem 8.5.1 and the regularity conditions for Theorem 8.5.1
hold.
Proof:
=:
Theorem 8.5.1 says that if T is most ecient, then () holds.
Assume that = IR. We dene
C(
0
) =
_
0
1
K()
d, (
0
) =
_
0
K()
d, and (x) = lim
log f
(x) c(x).
73
Integrating () with respect to gives
_
0
1
K()
T(x)d
_
0
K()
d =
_
0
log f
(x)
d
= T(x)C(
0
) (
0
) = log f
(x)[
+c(x)
= T(x)C(
0
) (
0
) = log f
0
(x) lim
log f
(x) +c(x)
= T(x)C(
0
) (
0
) = log f
0
(x) (x)
Therefore,
f
0
(x) = exp(T(x)C(
0
) (
0
) +(x))
which belongs to an exponential family. Thus, T is sucient.
=:
From (), we get
E
_
_
log f
(X)
_
2
_
=
1
(K())
2
V ar
(T(X)).
Additionally, it holds
E
_
(T(X) )
log f
(X)
_
= 1
as shown in the Proof of Theorem 8.5.1.
Using () in the line above, we get
K()E
_
_
log f
(X)
_
2
_
= 1,
i.e.,
K() =
_
E
_
_
log f
(X)
_
2
__
1
.
Therefore,
V ar
(T(X)) =
_
E
_
_
log f
(X)
_
2
__
1
,
i.e., T is most ecient for .
Note:
Instead of saying a necessary and sucient condition for an estimate T of to be most
ecient ... in the previous Theorem, we could say that an estimate T of is most ecient
i ..., i.e., necessary and sucient means the same as i.
A is necessary for B means: B A (because A B)
A is sucient for B means: A B
74
8.6 The Method of Moments
(Based on Casella/Berger, Section 7.2.1)
Denition 8.6.1:
Let X
1
, . . . , X
n
be iid with pdf (or pmf) f
mom
= T(X
1
, . . . , X
n
) = h(
1
n
n
i=1
X
i
,
1
n
n
i=1
X
2
i
, . . . ,
1
n
n
i=1
X
k
i
).
Note:
(i) The Denition above can also be used to estimate joint moments. For example, we use
1
n
n
i=1
X
i
Y
i
to estimate E(XY ).
(ii) Since E(
1
n
n
i=1
X
j
i
) = m
j
, method of moment estimates are unbiased for the popula-
tion moments. The WLLN and the CLT say that these estimates are consistent and
asymptotically Normal as well.
(iii) If is not a linear function of the population moments,
mom
will, in general, not be
unbiased. However, it will be consistent and (usually) asymptotically Normal.
(iv) Method of moments estimates do not exist if the related moments do not exist.
(v) Method of moments estimates may not be unique. If there exist multiple choices for the
mom, one usually takes the estimate involving the lowestorder sample moment.
(vi) Alternative method of moment estimates can be obtained from central moments (rather
than from raw moments) or by using moments other than the rst k moments.
75
Example 8.6.2:
Let X
1
, . . . , X
n
be iid N(,
2
).
Since = m
1
, it is
mom
= X.
This is an unbiased, consistent and asymptotically Normal estimate.
Since =
_
m
2
m
2
1
, it is
mom
=
_
1
n
n
i=1
X
2
i
X
2
.
This is a consistent, asymptotically Normal estimate. However, it is not unbiased.
Example 8.6.3:
Let X
1
, . . . , X
n
be iid Poisson().
We know that E(X
1
) = V ar(X
1
) = .
Thus, X and
1
n
n
i=1
(X
i
X)
2
are possible choices for the mom of . Due to part (v) of the
Note above, one uses
mom
= X.
76
8.7 Maximum Likelihood Estimation
(Based on Casella/Berger, Section 7.2.2)
Denition 8.7.1:
Let (X
1
, . . . , X
n
) be an nrv with pdf (or pmf) f
(x
1
, . . . , x
n
), . We call the function
L(; x
1
, . . . , x
n
) = f
(x
1
, . . . , x
n
)
of the likelihood function.
Note:
(i) Often is a vector of parameters.
(ii) If (X
1
, . . . , X
n
) are iid with pdf (or pmf) f
i=1
f
(x
i
).
Denition 8.7.2:
A maximum likelihood estimate (MLE) is a nonconstant estimate
ML
such that
L(
ML
; x
1
, . . . , x
n
) = sup
L(; x
1
, . . . , x
n
).
Note:
It is often convenient to work with log L when determining the maximum likelihood estimate.
Since the log is monotone, the maximum is the same.
Example 8.7.3:
Let X
1
, . . . , X
n
be iid N(,
2
), where and
2
are unknown.
L(,
2
; x
1
, . . . , x
n
) =
1
n
(2)
n
2
exp
_
i=1
(x
i
)
2
2
2
_
= log L(,
2
; x
1
, . . . , x
n
) =
n
2
log
2
n
2
log(2)
n
i=1
(x
i
)
2
2
2
The MLE must satisfy
log L
=
1
2
n
i=1
(x
i
) = 0 (A)
log L
2
=
n
2
2
+
1
2
4
n
i=1
(x
i
)
2
= 0 (B)
77
These are the two likelihood equations. From equation (A) we get
ML
= X. Substituting
this for into equation (B) and solving for
2
, we get
2
ML
=
1
n
n
i=1
(X
i
X)
2
. Note that
2
ML
is biased for
2
.
Formally, we still have to verify that we found the maximum (and not a minimum) and that
there is no parameter at the edge of the parameter space such that the likelihood function
does not take its absolute maximum which is not detectable by using our approach for local
extrema.
Example 8.7.4:
Let X
1
, . . . , X
n
be iid U(
1
2
, +
1
2
).
L(; x
1
, . . . , x
n
) =
1, if
1
2
x
i
+
1
2
i = 1, . . . , n
0, otherwise
1 2 + 1 2
X
(1)
X
(1)
+ 1 2 X
(n)
1 2 X
(n)
Example 8.7.4
Therefore, any
(X) such that max(X)
1
2
(X) min(X) +
1
2
is an MLE. Obviously, the
MLE is not unique.
Example 8.7.5:
Let X Bin(1, p), p [
1
4
,
3
4
].
L(p; x) = p
x
(1 p)
1x
=
p, if x = 1
1 p, if x = 0
This is maximized by
p =
3
4
, if x = 1
1
4
, if x = 0
=
2x + 1
4
78
It is
E
p
( p) =
3
4
p +
1
4
(1 p) =
1
2
p +
1
4
MSE
p
( p) = E
p
(( p p)
2
)
= E
p
((
2X + 1
4
p)
2
)
=
1
16
E
p
((2X + 1 4p)
2
)
=
1
16
E
p
(4X
2
+ 2 2X 2 8pX 2 4p + 1 + 16p
2
)
=
1
16
(4(p(1 p) +p
2
) + 4p 16p
2
8p + 1 + 16p
2
)
=
1
16
So p is biased with MSE
p
( p) =
1
16
. If we compare this with p =
1
2
regardless of the data, we
have
MSE
p
(
1
2
) = E
p
((
1
2
p)
2
) = (
1
2
p)
2
1
16
p [
1
4
,
3
4
].
Thus, in this example the MLE is worse than the trivial estimate when comparing their MSEs.
Theorem 8.7.6:
Let T be a sucient statistic for f
(x) = h(x)g
(T(x))
due to the Factorization Criterion (Theorem 8.3.5). Maximizing the likelihood function with
respect to takes h(x) as a constant and therefore is equivalent to maximizing g
(x) with
respect to . But g
(X)
=
1
K()
(
(X) ) w.p. 1.
Thus,
satises the likelihood equations.
We dene A() =
1
K()
. Then it follows
2
log f
(X)
2
= A
()(
(X) ) A().
The Proof of Theorem 8.5.11 gives us
A() = E
_
_
log f
(X)
_
2
_
> 0.
So
2
log f
(X)
= A() < 0,
i.e., log f
) is an MLE of h().
Proof:
For each , we dene
= : , h() =
and
M(; x) = sup
L(; x),
the likelihood function induced by h.
Let
be an MLE and a member of
, where
= h(
). It holds
M(
; x) = sup
L(; x) L(
; x),
but also
M(
; x) sup
M(; x) = sup
_
sup
L(; x)
_
= sup
L(; x) = L(
; x).
Therefore,
M(
; x) = L(
; x) = sup
M(; x).
Thus,
= h(
) is an MLE.
Example 8.7.9:
Let X
1
, . . . , X
n
be iid Bin(1, p). Let h(p) = p(1 p).
Since the MLE of p is p = X, the MLE of h(p) is h( p) = X(1 X).
Theorem 8.7.10:
Consider the following conditions a pdf f
can fulll:
(i)
log f
,
2
log f
2
,
3
log f
3
exist for all for all x. Also,
_
(x)
dx = E
_
log f
(X)
_
= 0 .
(ii)
_
2
f
(x)
2
dx = 0 .
(iii) <
_
2
log f
(x)
2
f
(x)dx < 0 .
81
(iv) There exists a function H(x) such that for all :
3
log f
(x)
H(x)f
2
_
g()
log f
(x)
H(x)f
n
of the
likelihood equation is asymptotically Normal, i.e.,
n
)
d
Z
where Z N(0, 1) and
2
=
_
E
_
_
log f
(X)
_
2
__
1
.
(iii) (Kulldorf) Conditions (i), (iii), and (v) imply that, with probability approaching 1, as
n , the likelihood equation has a consistent solution.
(iv) (Kulldorf) Conditions (i), (ii), (iii), and (v) imply that a consistent solution
n
of the
likelihood equation is asymptotically Normal.
Note:
In case of a pmf f
. Let
/ be the set of possible actions (or decisions) that are open to the statistician in a given
situation , e.g.,
/ = reject H
0
, do not reject H
0
(Hypothesis testing, see Chapter 9)
/ = artefact found is of Greek, Roman origin (Classication)
/ = (Estimation)
Denition 8.8.1:
A decision function d is a statistic, i.e., a Borelmeasurable function, that maps IR
n
into
/. If X = x is observed, the statistician takes action d(x) /.
Note:
For the remainder of this Section, we are restricting ourselves to / = , i.e., we are facing
the problem of estimation.
Denition 8.8.2:
A nonnegative function L that maps / into IR is called a loss function. The value
L(, a) is the loss incurred to the statistician if he/she takes action a when is the true pa-
rameter value.
Denition 8.8.3:
Let T be a class of decision functions that map IR
n
into /. Let L be a loss function on /.
The function R that maps T into IR is dened as
R(, d) = E
(L(, d(X)))
and is called the risk function of d at .
Example 8.8.4:
Let / = IR. Let L(, a) = ( a)
2
. Then it holds that
R(, d) = E
(L(, d(X))) = E
(( d(X))
2
) = E
((
)
2
).
Note that this is just the MSE. If
is unbiased, this would just be V ar(
).
83
Note:
The basic problem of decision theory is that we would like to nd a decision function d T
such that R(, d) is minimized for all . Unfortunately, this is usually not possible.
Denition 8.8.5:
The minimax principle is to choose the decision function d
T such that
max
R(, d
) max
R(, d) d T.
Note:
If the problem of interest is an estimation problem, we call a d
(R(, d)),
where is the a priori distribution.
Note:
If is a continuous rv and X is of continuous type, then
R(, d) = E
(R(, d))
=
_
R(, d) () d
=
_
E
(L(, d(X))) () d
=
_ __
L(, d(x)) f(x [ ) dx
_
() d
=
_ _
L(, d(x)) f(x [ ) () dx d
86
=
_ _
L(, d(x)) f(x, ) dx d
=
_
g(x)
__
L(, d(x))h( [ x) d
_
dx
Similar expressions can be written if and/or X are discrete.
Denition 8.8.9:
A decision function d
) = inf
dT
R(, d).
Theorem 8.8.10:
Let / = IR. Let L(, d(x)) = ( d(x))
2
. In this case, a Bayes rule is
d(x) = E( [ X = x).
Proof:
Minimizing
R(, d) =
_
g(x)
__
( d(x))
2
h( [ x) d
_
dx,
where g is the marginal pdf of X and h is the conditional pdf of given x, is the same as
minimizing
_
( d(x))
2
h( [ x) d.
However, this is minimized when d(x) = E( [ X = x) as shown in Stat 6710, Homework 3,
Question (ii), for the unconditional case.
Note:
Under the conditions of Theorem 8.8.10, d(x) = E( [ X = x) is called the Bayes estimate.
Example 8.8.11:
Let X Bin(n, p). Let L(p, d(x)) = (p d(x))
2
.
Let (p) = 1 p (0, 1), i.e., U(0, 1), be the a priori distribution of p.
Then it holds:
87
f(x, p) =
_
n
x
_
p
x
(1 p)
nx
g(x) =
_
f(x, p)dp
=
_
1
0
_
n
x
_
p
x
(1 p)
nx
dp
h(p [ x) =
f(x, p)
g(x)
=
_
n
x
_
p
x
(1 p)
nx
_
1
0
_
n
x
_
p
x
(1 p)
nx
dp
=
p
x
(1 p)
nx
_
1
0
p
x
(1 p)
nx
dp
E(p [ x) =
_
1
0
ph(p [ x)dp
=
_
1
0
p p
x
(1 p)
nx
dp
_
1
0
p
x
(1 p)
nx
dp
=
_
1
0
p
x+1
(1 p)
nx
dp
_
1
0
p
x
(1 p)
nx
dp
=
B(x + 2, n x + 1)
B(x + 1, n x + 1)
=
(x + 2)(n x + 1)
(x + 2 +n x + 1)
(x + 1)(n x + 1)
(x + 1 +n x + 1)
=
x + 1
n + 2
Thus, by Theorem 8.8.10, the Bayes rule is
p
Bayes
= d
(X) =
X + 1
n + 2
.
88
The Bayes risk of d
(X) is
R(, d
(X)) = E
(R(p, d
(X)))
=
_
1
0
(p)R(p, d
(X))dp
=
_
1
0
(p)E
p
(L(p, d
(X)))dp
=
_
1
0
(p)E
p
((p d
(X))
2
)dp
=
_
1
0
E
p
_
(
X + 1
n + 2
p)
2
_
dp
=
_
1
0
E
p
_
(
X + 1
n + 2
)
2
2p
X + 1
n + 2
+p
2
_
dp
=
1
(n + 2)
2
_
1
0
E
p
_
(X + 1)
2
2p(n + 2)(X + 1) +p
2
(n + 2)
2
_
dp
=
1
(n + 2)
2
_
1
0
E
p
_
X
2
+ 2X + 1 2p(n + 2)(X + 1) +p
2
(n + 2)
2
_
dp
=
1
(n + 2)
2
_
1
0
(np(1 p) + (np)
2
+ 2np + 1 2p(n + 2)(np + 1) +p
2
(n + 2)
2
) dp
=
1
(n + 2)
2
_
1
0
(np np
2
+n
2
p
2
+ 2np + 1 2n
2
p
2
2np 4np
2
4p +
p
2
n
2
+ 4np
2
+ 4p
2
) dp
=
1
(n + 2)
2
_
1
0
(1 4p +np np
2
+ 4p
2
) dp
=
1
(n + 2)
2
_
1
0
(1 + (n 4)p + (4 n)p
2
) dp
=
1
(n + 2)
2
(p +
n 4
2
p
2
+
4 n
3
p
3
)
1
0
=
1
(n + 2)
2
(1 +
n 4
2
+
4 n
3
)
=
1
(n + 2)
2
6 + 3n 12 + 8 2n
6
=
1
(n + 2)
2
n + 2
6
=
1
6(n + 2)
89
Now we compare the Bayes rule d
1
0
=
1
6n
which is, as expected, larger than the Bayes risk of d
(X).
Theorem 8.8.12:
Let f
of is a
Bayes estimate corresponding to some prior distribution on . If the risk function R(, d
)
is constant on , then d
is a minimax estimate of .
Proof:
Homework.
Denition 8.8.13:
Let F denote the class of pdfs (or pmfs) f
, IR
k
, where the functional form of F
,
0
.
91
Denition 9.1.4:
Let X F
, . Let C be a subset of IR
n
such that, if x C, then H
0
is rejected (with
probability 1), i.e.,
C = x IR
n
: H
0
is rejected for this x.
The set C is called the critical region.
Denition 9.1.5:
If we reject H
0
when it is true, we call this a Type I error. If we fail to reject H
0
when it
is false, we call this a Type II error. Usually, H
0
and H
1
are chosen such that the Type I
error is considered more serious.
Example 9.1.6:
We rst consider a nonstatistical example, in this case a jury trial. Our hypotheses are that
the defendant is innocent or guilty. Our possible decisions are guilty or not guilty. Since it is
considered worse to punish the innocent than to let the guilty go free, we make innocence the
null hypothesis. Thus, we have
Truth (unknown)
Decision (known)
Innocent (H
0
) Guilty (H
1
)
Not Guilty (H
0
) Correct Type II Error
Guilty (H
1
) Type I Error Correct
The jury tries to make a decision beyond a reasonable doubt, i.e., it tries to make the
probability of a Type I error small.
Denition 9.1.7:
If C is the critical region, then P
(C),
0
, is a probability of Type I error, and
P
(C
c
),
1
, is a probability of Type II error.
Note:
We would like both error probabilities to be 0, but this is usually not possible. We usually
settle for xing the probability of Type I error to be small, e.g., 0.05 or 0.01, and minimizing
the Type II error.
92
Denition 9.1.8:
Every Borelmeasurable mapping of IR
n
[0, 1] is called a test function. (x) is the
probability of rejecting H
0
when x is observed.
If is the indicator function of a subset C IR
n
, is called a nonrandomized test and C
is the critical region of this test function.
Otherwise, if is not an indicator function of a subset C IR
n
, is called a randomized
test.
Denition 9.1.9:
Let be a test function of the hypothesis H
0
:
0
against the alternative H
1
:
1
.
We say that has a level of signicance of (or is a leveltest or is of size ) if
E
((X)) = P
(reject H
0
)
0
.
In short, we say that is a test for the problem (,
0
,
1
).
Denition 9.1.10:
Let be a test for the problem (,
0
,
1
). For every , we dene
() = E
((X)) = P
(reject H
0
).
We call
0
()
()
.
Denition 9.1.12:
Let
0
()
()
1
.
93
Example 9.1.13:
Let X
1
, . . . , X
n
be iid N(, 1), =
0
,
1
,
0
<
1
.
Let H
0
: X
i
N(
0
, 1) vs. H
1
: X
i
N(
1
, 1).
Intuitively, reject H
0
when X is too large, i.e., if X k for some k.
Under H
0
it holds that X N(
0
,
1
n
).
For a given , we can solve the following equation for k:
P
0
(X > k) = P
_
X
0
1/
n
>
k
0
1/
n
_
= P(Z > z
) =
Here,
X
0
1/
n
= Z N(0, 1) and z
) = , i.e., z
is
the upper quantile of the N(0, 1) distribution. It follows that
k
0
1/
n
= z
and therefore,
k =
0
+
z
n
.
Thus, we obtain the nonrandomized test
(x) =
1, if x >
0
+
z
n
0, otherwise
has power
(
1
) = P
1
_
X >
0
+
z
n
_
= P
_
X
1
1/
n
> (
0
1
)
n +z
_
= P(Z > z
n(
1
0
)
. .
>0
)
>
The probability of a Type II error is
P(Type II error) = 1
(
1
).
94
Example 9.1.14:
Let X Bin(6, p), p = (0, 1).
H
0
: p =
1
2
, H
1
: p ,=
1
2
.
Desired level of signicance: = 0.05.
Reasonable plan: Since E
p=
1
2
(X) = 3, reject H
0
when [ X 3 [ c for some constant c. But
how should we select c?
x c =[ x 3 [ P
p=
1
2
(X = x) P
p=
1
2
([ X 3 [ c)
0, 6 3 0.015625 0.03125
1, 5 2 0.093750 0.21875
2, 4 1 0.234375 0.68750
3 0 0.312500 1.00000
Thus, there is no nonrandomized test with = 0.05.
What can we do instead? Three possibilities:
(i) Reject if [ X 3 [= 3, i.e., use a nonrandomized test of size = 0.03125.
(ii) Reject if [ X 3 [ 2, i.e., use a nonrandomized test of size = 0.21875.
(iii) Reject if [ X 3 [= 3, do not reject if [ X 3 [ 1, and reject with probability
0.05 0.03125
2 0.093750
= 0.1 if [ X 3 [= 2. Thus, we obtain the randomized test
(x) =
1, if x = 0, 6
0.1, if x = 1, 5
0, if x = 2, 3, 4
This test has size
= E
p=
1
2
((X))
= 1 0.015625 2 + 0.1 0.093750 2 + 0 (0.234375 2 + 0.3125)
= 0.05
as intended. The power of can be calculated for any p ,=
1
2
and it is
(p) = P
p
(X = 0 or X = 6) + 0.1 P
p
(X = 1 or X = 5)
95
9.2 The NeymanPearson Lemma
(Based on Casella/Berger, Section 8.3.2)
Let f
: =
0
,
1
be a family of possible distributions of X. f
0
(x) and f
1
(x) = f
1
(x).
Theorem 9.2.1: NeymanPearson Lemma (NP Lemma)
Suppose we wish to test H
0
: X f
0
(x) vs. H
1
: X f
1
(x), where f
i
is the pdf (or pmf) of
X under H
i
, i = 0, 1, where both, H
0
and H
1
, are simple.
(i) Any test of the form
(x) =
1, if f
1
(x) > kf
0
(x)
(x), if f
1
(x) = kf
0
(x)
0, if f
1
(x) < kf
0
(x)
()
for some k 0 and 0 (x) 1, is most powerful of its signicance level for testing
H
0
vs. H
1
.
If k = , the test
(x) =
_
1, if f
0
(x) = 0
0, if f
0
(x) > 0
()
is most powerful of size (or signicance level) 0 for testing H
0
vs. H
1
.
(ii) Given 0 1, there exists a test of the form () or () with (x) = (i.e., a
constant) such that
E
0
((X)) = .
Proof:
We prove the continuous case only.
(i):
Let be a test satisfying (). Let
0
(
(X)) E
0
((X)).
It holds that
_
((x)
(x))(f
1
(x) kf
0
(x))dx =
__
f
1
>kf
0
((x)
(x))(f
1
(x) kf
0
(x))dx
_
+
__
f
1
<kf
0
((x)
(x))(f
1
(x) kf
0
(x))dx
_
since
_
f
1
=kf
0
((x)
(x))(f
1
(x) kf
0
(x))dx = 0.
96
It is
_
f
1
>kf
0
((x)
(x))(f
1
(x) kf
0
(x))dx =
_
f
1
>kf
0
(1
(x))
. .
0
(f
1
(x) kf
0
(x))
. .
0
dx 0
and
_
f
1
<kf
0
((x)
(x))(f
1
(x) kf
0
(x))dx =
_
f
1
<kf
0
(0
(x))
. .
0
(f
1
(x) kf
0
(x))
. .
0
dx 0.
Therefore,
0
_
((x)
(x))(f
1
(x) kf
0
(x))dx
= E
1
((X)) E
1
(
(X)) k(E
0
((X)) E
0
(
(X))).
Since E
0
((X)) E
0
(
(
1
)
(
1
) k(E
0
((X)) E
0
(
(X))) 0,
i.e., is most powerful.
If k = , any test
(
1
)
(
1
) = E
1
((X)) E
1
(
(X)) =
_
{x|f
0
(x)=0}
(1
(x))f
1
(x)dx 0,
i.e., is most powerful of size 0.
(ii):
If = 0, then use (). Otherwise, assume that 0 < 1 and (x) = . It is
E
0
((X)) = P
0
(f
1
(X) > kf
0
(X)) +P
0
(f
1
(X) = kf
0
(X))
= 1 P
0
(f
1
(X) kf
0
(X)) +P
0
(f
1
(X) = kf
0
(X))
= 1 P
0
_
f
1
(X)
f
0
(X)
k
_
+P
0
_
f
1
(X)
f
0
(X)
= k
_
.
Note that the last step is valid since P
0
(f
0
(X) = 0) = 0.
Therefore, given 0 < 1, we want to nd k and such that E
0
((X)) = , i.e.,
P
0
_
f
1
(X)
f
0
(X)
k
_
P
0
_
f
1
(X)
f
0
(X)
= k
_
= 1 .
Note that
f
1
(X)
f
0
(X)
is a rv and, therefore, P
0
_
f
1
(X)
f
0
(X)
k
_
is a cdf.
If there exists a k
0
such that
P
0
_
f
1
(X)
f
0
(X)
k
0
_
= 1 ,
97
we choose = 0 and k = k
0
.
Otherwise, if there exists no such k
0
, then there exists a k
1
such that
P
0
_
f
1
(X)
f
0
(X)
< k
1
_
1 < P
0
_
f
1
(X)
f
0
(X)
k
1
_
, (+)
i.e., the cdf has a jump at k
1
. In this case, we choose k = k
1
and
=
P
0
_
f
1
(X)
f
0
(X)
k
1
_
(1 )
P
0
_
f
1
(X)
f
0
(X)
= k
1
_ .
k1 x
F(x)
*
o
P(f1/f0 <= k1)
1 - alpha
P(f1/f0 < k1)
Theorem 9.2.1
Let us verify that these values for k
1
and meet the necessary conditions:
Obviously,
P
0
_
f
1
(X)
f
0
(X)
k
1
_
0
_
f
1
(X)
f
0
(X)
k
1
_
(1 )
P
0
_
f
1
(X)
f
0
(X)
= k
1
_ P
0
_
f
1
(X)
f
0
(X)
= k
1
_
= 1 .
Also, since
P
0
_
f
1
(X)
f
0
(X)
k
1
_
> (1 ),
it follows that 0 and since
=
P
0
_
f
1
(X)
f
0
(X)
k
1
_
(1 )
P
0
_
f
1
(X)
f
0
(X)
= k
1
_
(+)
0
_
f
1
(X)
f
0
(X)
k
1
_
P
0
_
f
1
(X)
f
0
(X)
< k
1
_
P
0
_
f
1
(X)
f
0
(X)
= k
1
_
=
P
0
_
f
1
(X)
f
0
(X)
= k
1
_
P
0
_
f
1
(X)
f
0
(X)
= k
1
_
= 1,
it follows that 1. Overall, 0 1 as required.
98
Theorem 9.2.2:
If a sucient statistic T exists for the family f
: =
0
,
1
, then the Neyman
Pearson most powerful test is a function of T.
Proof:
Homework
Example 9.2.3:
We want to test H
0
: X N(0, 1) vs. H
1
: X Cauchy(1, 0), based on a single observation.
It is
f
1
(x)
f
0
(x)
=
1
1
1+x
2
1
2
exp(
x
2
2
)
=
_
2
exp(
x
2
2
)
1 +x
2
.
The MP test is
(x) =
1, if
_
2
exp(
x
2
2
)
1+x
2
> k
0, otherwise
where k is determined such that E
H
0
((X)) = .
If < 0.113, we reject H
0
if [ x [> z
2
, where z
2
is the upper
2
quantile of a N(0, 1)
distribution.
If > 0.113, we reject H
0
if [ x [> k
1
or if [ x [< k
2
, where k
1
> 0, k
2
> 0, such that
exp(
k
2
1
2
)
1 +k
2
1
=
exp(
k
2
2
2
)
1 +k
2
2
and
_
k
1
k
2
1
2
exp(
x
2
2
)dx =
1
2
.
x
f
1
(
x
)
/
f
0
(
x
)
-2 -1 0 1 2
0
.
7
0
.
8
0
.
9
1
.
0
1
.
1
-1.585 1.585
-k1 -k2 k2 k1
Example 9.2.3a
x
f
0
(
x
)
-2 -1 0 1 2
0
.
1
0
.
2
0
.
3
0
.
4
-1.585 1.585
alpha/2 = 0.113/2 alpha/2 = 0.113/2
Example 9.2.3b
99
Why is = 0.113 so interesting?
For x = 0, it is
f
1
(x)
f
0
(x)
=
_
2
0.7979.
Similarly, for x 1.585 and x 1.585, it is
f
1
(x)
f
0
(x)
=
_
2
exp(
(1.585)
2
2
)
1 + (1.585)
2
0.7979
f
1
(0)
f
0
(0)
.
More importantly, P
H
0
([ X [> 1.585) = 0.113.
100
9.3 Monotone Likelihood Ratios
(Based on Casella/Berger, Section 8.3.2)
Suppose we want to test H
0
:
0
vs. H
1
: >
0
for a family of pdfs f
: IR.
In general, it is not possible to nd a UMP test. However, there exist conditions under which
UMP tests exist.
Denition 9.3.1:
Let f
1
<
2
, whenever f
1
and f
2
are distinct, the ratio
f
2
(x)
f
1
(x)
is a nondecreasing function of T(x)
for the set of values x for which at least one of f
1
and f
2
is > 0.
Note:
We can also dene families of densities with nonincreasing MLR in T(X), but such families
can be treated by symmetry.
Example 9.3.2:
Let X
1
, . . . , X
n
U[0, ], > 0. Then the joint pdf is
f
(x) =
_
1
n
, 0 x
(n)
0, otherwise
=
1
n
I
[0,]
(x
(n)
),
where x
(n)
= x
max
= max
i=1,...,n
x
i
.
Let
2
>
1
, then
f
2
(x)
f
1
(x)
=
_
2
_
n
I
[0,
2
]
(x
(n)
)
I
[0,
1
]
(x
(n)
)
.
It is
I
[0,
2
]
(x
(n)
)
I
[0,
1
]
(x
(n)
)
=
_
1, x
(n)
[0,
1
]
, x
(n)
(
1
,
2
]
since for x
(n)
[0,
1
], it holds that x
(n)
[0,
2
].
But for x
(n)
(
1
,
2
], it is I
[0,
1
]
(x
(n)
) = 0.
= as T(X) = X
(n)
increases, the density ratio goes from (
2
)
n
to .
=
f
2
f
1
is a nondecreasing function of T(X) = X
(n)
= the family of U[0, ] distributions has a MLR in T(X) = X
(n)
101
Theorem 9.3.3:
The oneparameter exponential family f
(x) =
n
i=1
_
e
x
i
1
x
i
!
_
= e
n
n
i=1
x
i
n
i=1
1
x
i
!
= exp
_
n +
n
i=1
x
i
log()
n
i=1
log(x
i
!)
_
,
which belongs to the oneparameter exponential family.
Since Q() = log() is a nondecreasing function of , it follows by Theorem 9.3.3 that the
Poisson family with parameter > 0 has a MLR in T(X) =
n
i=1
X
i
.
We can verify this result by Denition 9.3.1:
f
2
(x)
f
1
(x)
=
x
i
2
x
i
1
e
n
2
e
n
1
=
_
1
_
x
i
e
n(
2
1
)
.
If
2
>
1
, then
2
1
> 1 and
_
1
_
x
i
is a nondecreasing function of
x
i
.
Therefore, f
i=1
X
i
.
Theorem 9.3.5:
Let X f
1, if T(x) > t
0
, if T(x) = t
0
0, if T(x) < t
0
()
has a nondecreasing power function and is UMP of its size E
0
((X)) = , if the size is not 0.
Also, for every 0 1 and every
0
, there exists a t
0
and a ( t
0
,
0 1), such that the test of form () is the UMP size test of H
0
vs. H
1
.
102
Proof:
=:
Let
1
<
2
,
1
,
2
and suppose E
1
((X)) > 0, i.e., the size is > 0. Since f
has a MLR
in T,
f
2
(x)
f
1
(x)
is a nondecreasing function of T. Therefore, any test of form () is equivalent to
a test
(x) =
1, if
f
2
(x)
f
1
(x)
> k
, if
f
2
(x)
f
1
(x)
= k
0, if
f
2
(x)
f
1
(x)
< k
()
which by the NeymanPearson Lemma (Theorem 9.2.1) is MP of size for testing H
0
: =
1
vs. H
1
: =
2
.
Let
t
has size and power . The power of the MP test of form () must be at least as the
MP test cannot have power less than the trivial test
t
, i.e,
E
2
((X)) E
2
(
t
(X)) = = E
1
((X)).
Thus, for
2
>
1
,
E
2
((X)) E
1
((X)),
i.e., the power function of the test of form () is a nondecreasing function of .
Now let
1
=
0
and
2
>
0
. We know that a test of form () is MP for H
0
: =
0
vs.
H
1
: =
2
>
0
, provided that its size = E
0
((X)) is > 0.
Notice that the test of form () does not depend on
2
. It only depends on t
0
and .
Therefore, the test of form () is MP for all
2
1
. Thus, this test is UMP for simple
H
0
: =
0
vs. composite H
1
: >
0
with size E
0
((X)) =
0
.
Since is UMP for a class
of tests (
) satisfying
E
0
(
(X))
0
,
must also be UMP for the more restrictive class
of tests (
) satisfying
E
(X))
0
0
.
But since the power function of is nondecreasing, it holds for that
E
((X)) E
0
((X)) =
0
0
.
Thus, is the UMP size
0
test of H
0
:
0
vs. H
1
: >
0
if
0
> 0.
103
=:
Use the NeymanPearson Lemma (Theorem 9.2.1).
Note:
By interchanging inequalities throughout Theorem 9.3.5 and its proof, we see that this The-
orem also provides a solution of the dual problem H
0
:
0
vs. H
1
: <
0
.
Theorem: 9.3.6
For the oneparameter exponential family, there exists a UMP twosided test of H
0
:
1
or
2
, (where
1
<
2
) vs. H
1
:
1
< <
2
of the form
(x) =
1, if c
1
< T(x) < c
2
i
, if T(x) = c
i
, i = 1, 2
0, if T(x) < c
1
, or if T(x) > c
2
Note:
UMP tests for H
0
:
1
2
and H
0
: =
0
do not exist for oneparameter exponential
families.
104
9.4 Unbiased and Invariant Tests
(Based on Rohatgi, Section 9.5, Rohatgi/Saleh, Section 9.5 & Casella/Berger,
Section 8.3.2)
If we look at all size tests in the class
by reasonable restrictions?
Denition 9.4.1:
A size test of H
0
:
0
vs H
1
:
1
is unbiased if
E
((X))
1
.
Note:
This condition means that
()
0
and
()
1
. In other words, the
power of this test is never less than .
Denition 9.4.2:
Let U
. A UMP test
will have
1
since we must
compare all tests
, it is
also a UMPU test in U
.
Example 9.4.3:
Let X
1
, . . . , X
n
be iid N(,
2
), where
2
> 0 is known. Consider H
0
: =
0
vs H
1
: ,=
0
.
From the NeymanPearson Lemma, we know that for
1
>
0
, the MP test is of the form
1
(X) =
1, if X >
0
+
n
z
0, otherwise
and for
2
<
0
, the MP test is of the form
2
(X) =
1, if X <
0
n
z
0, otherwise
If a test is UMP, it must have the same rejection region as
1
and
2
. However, these 2
rejection regions are dierent (actually, their intersection is empty). Thus, there exists no
105
UMP test.
We next state a helpful Theorem and then continue with this example and see how we can
nd a UMPU test.
Theorem 9.4.4:
Let c
1
, . . . , c
n
IR be constants and f
1
(x), . . . , f
n+1
(x) be realvalued functions. Let ( be the
class of functions (x) satisfying 0 (x) 1 and
_
(x)f
i
(x)dx = c
i
i = 1, . . . , n.
If ( satises
(x) =
1, if f
n+1
(x) >
n
i=1
k
i
f
i
(x)
0, if f
n+1
(x) <
n
i=1
k
i
f
i
(x)
for some constants k
1
, . . . , k
n
IR, then
maximizes
_
(x)f
n+1
(x)dx among all (.
Proof:
Let
(x) (x))
_
f
n+1
(x)
n
i=1
k
i
f
i
(x)
_
0 x.
This holds since if
(x) = 0,
the left factor is 0 and the right factor is 0.
Therefore,
0
_
(
(x) (x))
_
f
n+1
(x)
n
i=1
k
i
f
i
(x)
_
dx
=
_
(x)f
n+1
(x)dx
_
(x)f
n+1
(x)dx
n
i=1
k
i
__
(x)f
i
(x)dx
_
(x)f
i
(x)dx
_
. .
=c
i
c
i
=0
Thus,
_
(x)f
n+1
(x)dx
_
(x)f
n+1
(x)dx.
Note:
(i) If f
n+1
is a pdf, then
0
, f
2
= f
1
, and
c
1
= .
106
Example 9.4.3: (continued)
So far, we have seen that there exists no UMP test for H
0
: =
0
vs H
1
: ,=
0
.
We will show that
3
(x) =
1, if X <
0
n
z
/2
or if X >
0
+
n
z
/2
0, otherwise
is a UMPU size test.
Due to Theorem 9.2.2, we only have to consider functions of sucient statistics T(X) = X.
Let
2
=
2
n
.
To be unbiased and of size , a test must have
(i)
_
(t)f
0
(t)dt = , and
(ii)
_
(t)f
(t)dt[
=
0
=
_
(t)
_
(t)
_
=
0
dt = 0, i.e., we have a minimum at
0
.
We want to maximize
_
(t)f
(t)dt, ,=
0
such that conditions (i) and (ii) hold.
We choose an arbitrary
1
,=
0
and let
f
1
(t) = f
0
(t)
f
2
(t) =
(t)
=
0
f
3
(t) = f
1
(t)
We now consider how the conditions on
2
exp(
1
2
2
(x
1
)
2
) >
k
1
2
exp(
1
2
2
(x
0
)
2
) +
k
2
2
exp(
1
2
2
(x
0
)
2
)(
x
0
2
)
exp(
1
2
2
(x
1
)
2
) > k
1
exp(
1
2
2
(x
0
)
2
) +k
2
exp(
1
2
2
(x
0
)
2
)(
x
0
2
)
exp(
1
2
2
((x
0
)
2
(x
1
)
2
)) > k
1
+k
2
(
x
0
2
)
exp(
x(
1
0
)
2
2
1
2
0
2
2
) > k
1
+k
2
(
x
0
2
)
107
Note that the left hand side of this inequality is increasing in x if
1
>
0
and decreasing in
x if
1
<
0
. Either way, we can choose k
1
and k
2
such that the linear function in x crosses
the exponential function in x at the two points
L
=
0
n
z
/2
,
U
=
0
+
n
z
/2
.
Obviously,
3
satises (i). We still need to check that
3
satises (ii) and that
3
() has a
minimum at
0
but omit this part from our proof here.
3
is of the form
t
(x) = also satises (i) and (ii) above. Therefore,
3
() ,=
0
. This means that
3
is unbiased.
Overall,
3
is a UMPU test of size .
Denition 9.4.5:
A test is said to be similar on a subset
of if
() = E
((X)) =
.
A test is said to be similar on
if it is similar on
for some , 0 1.
Note:
The trivial test (x) = is similar on every
.
Theorem 9.4.6:
Let be an unbiased test of size for H
0
:
0
vs H
1
:
1
such that
() is a
continuous function in . Then is similar on the boundary =
0
1
, where
0
and
1
are the closures of
0
and
1
, respectively.
Proof:
Let . There exist sequences
n
and
n
whith
n
0
and
n
1
such that
lim
n
n
= and lim
n
n
= .
By continuity,
(
n
)
() and
n
)
().
Since
(
n
) implies
() and since
n
) implies
() it must hold
that
() = .
108
Denition 9.4.7:
A test that is UMP among all similar tests on the boundary =
0
1
is called a
UMP similar test.
Theorem 9.4.8:
Suppose
((X))
0
.
Since the trivial test (x) = is similar, it must hold for
0
that
0
()
1
since
0
is UMP similiar. This implies that
0
is unbiased.
Since
() is continuous in , we see from Theorem 9.4.6 that the class of unbiased tests is
a subclass of the class of similar tests. Since
0
is UMP in the larger class, it is also UMP
in the subclass. Thus,
0
is UMPU.
Note:
The continuity of the power function
i=1
X
i
, we could use Theorem 9.3.5 to nd a UMP
test. However, we want to illustate the use of Theorem 9.4.8 here.
It is = 0 and the power function
() =
_
IR
n
(x)
_
1
2
_
n
exp
_
1
2
n
i=1
(x
i
)
2
_
dx
of any test is continuous in . Thus, due to Theorem 9.4.6, any unbiased size test of H
0
is similar on .
We need a UMP test of H
0
: = 0 vs H
1
: > 0.
By the NP Lemma, a MP test of H
0
: = 0 vs H
1
: =
1
, where
1
> 0 is given by
(x) =
1, if exp
_
x
2
i
2
(x
i
)
2
2
_
> k
0, otherwise
109
or equivalently, by Theorem 9.2.2,
(x) =
1, if T =
n
i=1
X
i
> k
0, otherwise
Since under H
0
, T N(0, n), k is determined by = P
=0
(T > k) = P(
T
n
>
k
n
), i.e.,
k =
nz
.
is independent of
1
for every
1
> 0. So is UMP similar for H
0
vs. H
1
.
Finally, is of size , since for 0, it holds that
E
((X)) = P
(T >
nz
)
= P
_
T n
n
> z
n
_
()
P(Z > z
)
=
() holds since
Tn
n
N(0, 1) for 0 and z
n z
for 0.
Thus all the requirements are met for Theorem 9.4.8, i.e.,
such that if
X P
, then g(X) P
.
Denition 9.4.10:
A group ( of transformations on X leaves a hypothesis testing problem invariant if (
leaves both P
:
0
and P
:
1
invariant, i.e., if y = g(x) h
(y), then
f
(x) :
0
h
(y) :
0
and f
(x) :
1
h
(y) :
1
.
110
Note:
We want two types of invariance for our tests:
Measurement Invariance: If y = g(x) is a 1to1 mapping, the decision based on y should
be the same as the decision based on x. If (x) is the test based on x and
(y) is the
test based on y, then it must hold that (x) =
(g(x)) =
(y).
Formal Invariance: If two tests have the same structure, i.e, the same , the same pdfs (or
pmfs), and the same hypotheses, then we should use the same test in both problems.
So, if the transformed problem in terms of y has the same formal structure as that of
the problem in terms of x, we must have that
(y) = (x) =
(g(x)).
We can combine these two requirements in the following denition:
Denition 9.4.11:
An invariant test with respect to a group ( of tansformations is any test such that
(x) = (g(x)) x g (.
Example 9.4.12:
Let X Bin(n, p). Let H
0
: p =
1
2
vs. H
1
: p ,=
1
2
.
Let ( = g
1
, g
2
, where g
1
(x) = n x and g
2
(x) = x.
If is invariant, then (x) = (n x). Is the test problem invariant? For g
2
, the answer is
obvious.
For g
1
, we get:
g
1
(X) = n X Bin(n, 1 p)
H
0
: p =
1
2
: f
p
(x) : p =
1
2
= h
p
(g
1
(x)) : p =
1
2
= Bin(n,
1
2
)
H
1
: p ,=
1
2
: f
p
(x) : p ,=
1
2
. .
=Bin(n,p=
1
2
)
= h
p
(g
1
(x)) : p ,=
1
2
. .
=Bin(n,p=
1
2
)
So all the requirements in Denition 9.4.10 are met. If, for example, n = 10, the test
(x) =
1, if x = 0, 1, 2, 8, 9, 10
0, otherwise
is invariant under (. For example, (4) = 0 = (10 4) = (6), and, in general,
(x) = (10 x) x 0, 1, . . . , 9, 10.
111
Example 9.4.13:
Let X
1
, . . . , X
n
N(,
2
) where both and
2
> 0 are unknown. It is X N(,
2
n
) and
n1
2
S
2
2
n1
and X and S
2
independent.
Let H
0
: 0 vs. H
1
: > 0.
Let ( be the group of scale changes:
( = g
c
(x, s
2
), c > 0 : g
c
(x, s
2
) = (cx, c
2
s
2
)
The problem is invariant because, when g
c
(x, s
2
) = (cx, c
2
s
2
), then
(i) cX and c
2
S
2
are independent.
(ii) cX N(c,
c
2
2
n
) N(,
2
n
).
(iii)
n1
c
2
2
c
2
S
2
2
n1
.
So, this is the same family of distributions and Denition 9.4.10 holds because 0 implies
that c 0 (for c > 0).
An invariant test satises (x, s
2
) (cx, c
2
s
2
), c > 0, s
2
> 0, x IR.
Let c =
1
s
. Then (x, s
2
) (
x
s
, 1) so invariant tests depend on (x, s
2
) only through
x
s
.
If
x
1
s
1
,=
x
2
s
2
, then there exists no c > 0 such that (x
2
, s
2
2
) (cx
1
, c
2
s
2
1
). So invariance places
no restrictions on for dierent
x
1
s
1
=
x
2
s
2
. Thus, invariant tests are exactly those that depend
only on
x
s
, which are equivalent to tests that are based only on t =
x
s/
n
. Since this mapping
is 1to1, the invariant test will use T =
X
S/
n
t
n1
if = 0. Note that this test does not
depend on the nuisance parameter
2
. Invariance often produces such results.
Denition 9.4.14:
Let ( be a group of transformations on the space of X. We say a statistic T(x) is maximal
invariant under ( if
(i) T is invariant, i.e., T(x) = T(g(x)) g (, and
(ii) T is maximal, i.e., T(x
1
) = T(x
2
) implies that x
1
= g(x
2
) for some g (.
112
Example 9.4.15:
Let x = (x
1
, . . . , x
n
) and g
c
(x) = (x
1
+c, . . . , x
n
+c).
Consider T(x) = (x
n
x
1
, x
n
x
2
, . . . , x
n
x
n1
).
It is T(g
c
(x)) = (x
n
x
1
, x
n
x
2
, . . . , x
n
x
n1
) = T(x), so T is invariant.
If T(x) = T(x
), then x
n
x
i
= x
n
x
i
i = 1, 2, . . . , n 1.
This implies that x
i
x
i
= x
n
x
n
= c i = 1, 2, . . . , n 1.
Thus, g
c
(x
) = (x
1
+c, . . . , x
n
+c) = x.
Therefore, T is maximal invariant.
Denition 9.4.16:
Let I
Z
_
=
T
1
.
.
.
T
n1
Z
X
1
X
n
.
.
.
X
n1
X
n
X
n
f
i
(t
1
+z, t
2
+z, . . . , t
n1
+z, z)dz
which is independent of . The problem is thus reduced to testing a simple hypothesis against
a simple alternative. By the NP Lemma (Theorem 9.2.1), the MP test is
(t
1
, . . . , t
n1
) =
_
1, if (t) > c
0, if (t) < c
where t = (t
1
, . . . , t
n1
) and (t) =
_
f
1
(t
1
+z, t
2
+z, . . . , t
n1
+z, z)dz
_
f
0
(t
1
+z, t
2
+z, . . . , t
n1
+z, z)dz
.
In the homework assignment, we use this result to construct a UMP invariant test of
H
0
: X N(, 1) vs. H
1
: X Cauchy(1, ),
where a Cauchy(1, ) distribution has pdf f(x; ) =
1
1
1 + (x )
2
, where IR.
114
10 More on Hypothesis Testing
10.1 Likelihood Ratio Tests
(Based on Casella/Berger, Section 8.2.1)
Denition 10.1.1:
The likelihood ratio test statistic for
H
0
:
0
vs. H
1
:
1
=
0
is
(x) =
sup
0
f
(x)
sup
(x)
.
The likelihood ratio test (LRT) is the test function
(x) = I
[0,c)
((x)),
for some constant c [0, 1], where c is usually chosen in such a way to make a test of size
.
Note:
(i) We have to select c such that 0 c 1 since 0 (x) 1.
(ii) LRTs are strongly related to MLEs. If
is the unrestricted MLE of over and
0
is
the MLE of over
0
, then (x) =
f
0
(x)
f
(x)
.
Example 10.1.2:
Let X
1
, . . . , X
n
be a sample from N(, 1). We want to construct a LRT for
H
0
: =
0
vs. H
1
: ,=
0
.
It is
0
=
0
and = X. Thus,
(x) =
(2)
n/2
exp(
1
2
(x
i
0
)
2
)
(2)
n/2
exp(
1
2
(x
i
x)
2
)
= exp(
n
2
(x
0
)
2
).
The LRT rejects H
0
if (x) c, or equivalently, [ x
0
[
_
2
log c
n
. This means, the LRT
rejects H
0
: =
0
if x is too far from
0
.
115
Theorem 10.1.3:
If T(X) is sucient for and
(T(x)) = (x) x,
i.e., the LRT can be expressed as a function of every sucient statistic for .
Proof:
Since T is sucient, it follows from Theorem 8.3.5 that its pdf (or pmf) factorizes as f
(x) =
g
0
f
(x)
sup
(x)
=
sup
0
g
(T)h(x)
sup
(T)h(x)
=
sup
0
g
(T)
sup
(T)
=
(T(x))
Thus, our simplied expression for (x) indeed only depends on a sucient statistic T.
Theorem 10.1.4:
If for a given , 0 1, and for a simple hypothesis H
0
and a simple alternative H
1
a
nonrandomized test based on the NP Lemma and a LRT exist, then these tests are equivalent.
Proof:
See Homework.
Note:
Usually, LRTs perform well since they are often UMP or UMPU size tests. However, this
does not always hold. Rohatgi, Example 4, page 440441, cites an example where the LRT is
not unbiased and it is even worse than the trivial test (x) = .
Theorem 10.1.5:
Under some regularity conditions on f
under H
0
.
116
Note:
The regularity conditions required for Theorem 10.1.5 are basically the same as for Theorem
8.7.10. Under independent parameters we understand parameters that are unspecied, i.e.,
free to vary.
Example 10.1.6:
Let X
1
, . . . , X
n
N(,
2
) where IR and
2
> 0 are both unknown.
Let H
0
: =
0
vs. H
1
: ,=
0
.
We have = (,
2
), = (,
2
) : IR,
2
> 0 and
0
= (
0
,
2
) :
2
> 0.
It is
0
= (
0
,
1
n
n
i=1
(x
i
0
)
2
) and
= (x,
1
n
n
i=1
(x
i
x)
2
).
Now, the LR test statistic (x) can be determined:
(x) =
f
0
(x)
f
(x)
=
1
(
2
n
)
n
2
(
(x
i
0
)
2
)
n
2
exp
_
(x
i
0
)
2
2
1
n
(x
i
0
)
2
_
1
(
2
n
)
n
2
(
(x
i
x)
2
)
n
2
exp
_
(x
i
x)
2
2
1
n
(x
i
x)
2
_
=
_
(x
i
x)
2
(x
i
0
)
2
_n
2
=
_
x
2
i
nx
2
x
2
i
2
0
x
i
+n
2
0
nx
2
+nx
2
_n
2
=
1
1 +
_
n(x
0
)
2
(x
i
x)
2
_
n
2
Note that this is a decreasing function of
t(X) =
n(X
0
)
_
1
n1
(X
i
X)
2
=
n(X
0
)
S
()
t
n1
.
() holds due to Corollary 7.2.4.
So we reject H
0
if t(x) is large. Now,
2 log (x) = 2(
n
2
) log
_
1 +n
(x
0
)
2
(x
i
x)
2
_
= nlog
_
1 +n
(x
0
)
2
(x
i
x)
2
_
117
Under H
0
,
n(X
0
)
N(0, 1) and
(X
i
X)
2
2
2
n1
and both are independent according
to Theorem 7.2.1.
Therefore, under H
0
,
n(X
0
)
2
1
n1
(X
i
X)
2
F
1,n1
.
Thus, the mgf of 2 log (X) under H
0
is
M
n
(t) = E
H
0
(exp(2t log (X)))
= E
H
0
(exp(nt log(1 +
F
n 1
)))
= E
H
0
(exp(log(1 +
F
n 1
)
nt
))
= E
H
0
((1 +
F
n 1
)
nt
)
=
_
0
(
n
2
)
(
n1
2
)(
1
2
)(n 1)
1
2
1
f
_
1 +
f
n 1
_
n
2
_
1 +
f
n 1
_
nt
df
Note that
f
1,n1
(f) =
(
n
2
)
(
n1
2
)(
1
2
)(n 1)
1
2
1
f
_
1 +
f
n 1
_
n
2
I
[0,)
(f)
is the pdf of a F
1,n1
distribution.
Let y =
_
1 +
f
n1
_
1
, then
f
n1
=
1y
y
and df =
n1
y
2
dy.
Thus,
M
n
(t) =
(
n
2
)
(
n1
2
)(
1
2
)(n 1)
1
2
_
0
1
f
_
1 +
f
n 1
_
nt
n
2
df
=
(
n
2
)
(
n1
2
)(
1
2
)(n 1)
1
2
(n 1)
1
2
_
1
0
y
n3
2
nt
(1 y)
1
2
dy
()
=
(
n
2
)
(
n1
2
)(
1
2
)
B(
n 1
2
nt,
1
2
)
=
(
n
2
)
(
n1
2
)(
1
2
)
(
n1
2
nt)(
1
2
)
(
n
2
nt)
=
(
n
2
)
(
n1
2
)
(
n1
2
nt)
(
n
2
nt)
, t <
1
2
1
2n
118
() holds since the integral represents the Beta function (see also Example 8.8.11).
As n , we can apply Stirlings formula which states that
((n) + 1) ((n))!
2((n))
(n)+
1
2
exp((n)).
So,
M
n
(t)
2(
n2
2
)
n1
2
exp(
n2
2
)
2(
n(12t)3
2
)
n(12t)2
2
exp(
n(12t)3
2
)
2(
n3
2
)
n2
2
exp(
n3
2
)
2(
n(12t)2
2
)
n(12t)1
2
exp(
n(12t)2
2
)
=
_
n 2
n 3
_n2
2
_
n 2
2
_1
2
_
n(1 2t) 3
n(1 2t) 2
_
n(12t)2
2
_
n(1 2t) 2
2
_
1
2
=
_
(1 +
1
n 3
)
n2
_1
2
. .
e
1
2 as n
_
(1
1
n(1 2t) 2
)
n(12t)2
_1
2
. .
e
1
2 as n
_
n 2
n(1 2t) 2
_1
2
. .
(12t)
1
2 as n
Thus,
M
n
(t)
1
(1 2t)
1
2
as n .
Note that this is the mgf of a
2
1
distribution. Therefore, it follows by the Continuity Theorem
(Theorem 6.4.2) that
2 log (X)
d
2
1
.
Obviously, we could use Theorem 10.1.5 (after checking that the regularity conditions hold)
as well to obtain the same result. Since under H
0
1 parameter (
2
) is unspecied and under
H
1
2 parameters (,
2
) are unspecied, it is = 2 1 = 1.
119
10.2 Parametric ChiSquared Tests
(Based on Rohatgi, Section 10.3 & Rohatgi/Saleh, Section 10.3)
Denition 10.2.1: Normal Variance Tests
Let X
1
, . . . , X
n
be a sample from a N(,
2
) distribution where may be known or unknown
and
2
> 0 is unknown. The following table summarizes the
2
tests that are typically being
used:
Reject H
0
at level if
H
0
H
1
known unknown
I
0
<
0
(x
i
)
2
2
0
2
n;1
s
2
2
0
n1
2
n1;1
II
0
>
0
(x
i
)
2
2
0
2
n;
s
2
2
0
n1
2
n1;
III =
0
,=
0
(x
i
)
2
2
0
2
n;1/2
s
2
2
0
n1
2
n1;1/2
or
(x
i
)
2
2
0
2
n;/2
or s
2
2
0
n1
2
n1;/2
Note:
(i) In Denition 10.2.1,
0
is any xed positive constant.
(ii) Tests I and II are UMPU if is unknown and UMP if is known.
(iii) In test III, the constants have been chosen in such a way to give equal probability to
each tail. This is the usual approach. However, this may result in a biased test.
(iv)
2
n;1
is the (lower) quantile and
2
n;
is the (upper) 1 quantile, i.e., for X
2
n
,
it holds that P(X
2
n;1
) = and P(X
2
n;
) = 1 .
(v) We can also use
2
tests to test for equality of binomial probabilities as shown in the
next few Theorems.
Theorem 10.2.2:
Let X
1
, . . . , X
k
be independent rvs with X
i
Bin(n
i
, p
i
), i = 1, . . . , k. Then it holds that
T =
k
i=1
_
X
i
n
i
p
i
_
n
i
p
i
(1 p
i
)
_
2
d
2
k
as n
1
, . . . , n
k
.
120
Proof:
Homework
Corollary 10.2.3:
Let X
1
, . . . , X
k
be as in Theorem 10.2.2 above. We want to test the hypothesis that H
0
: p
1
=
p
2
= . . . = p
k
= p, where p is a known constant (vs. the alternative H
1
that at least one of
the p
i
s is dierent from the other ones). An appoximate level test rejects H
0
if
y =
k
i=1
_
x
i
n
i
p
_
n
i
p(1 p)
_
2
2
k;
.
Theorem 10.2.4:
Let X
1
, . . . , X
k
be independent rvs with X
i
Bin(n
i
, p), i = 1, . . . , k. Then the MLE of p is
p =
k
i=1
x
i
k
i=1
n
i
.
Proof:
This can be shown by using the joint likelihood function or by the fact that
X
i
Bin(
n
i
, p)
and for X Bin(n, p), the MLE is p =
x
n
.
Theorem 10.2.5:
Let X
1
, . . . , X
k
be independent rvs with X
i
Bin(n
i
, p
i
), i = 1, . . . , k. An approximate
level test of H
0
: p
1
= p
2
= . . . = p
k
= p, where p is unknown (vs. the alternative H
1
that
at least one of the p
i
s is dierent from the other ones), rejects H
0
if
y =
k
i=1
_
x
i
n
i
p
_
n
i
p(1 p)
_
2
2
k1;
,
where p =
x
i
n
i
.
Theorem 10.2.6:
Let (X
1
, . . . , X
k
) be a multinomial rv with parameters n, p
1
, p
2
, . . . , p
k
where
k
i=1
p
i
= 1 and
k
i=1
X
i
= n. Then it holds that
U
k
=
k
i=1
(X
i
np
i
)
2
np
i
d
2
k1
121
as n .
An approximate level test of H
0
: p
1
= p
1
, p
2
= p
2
, . . . , p
k
= p
k
rejects H
0
if
k
i=1
(x
i
np
i
)
2
np
i
>
2
k1;
.
Proof:
Case k = 2 only:
U
2
=
(X
1
np
1
)
2
np
1
+
(X
2
np
2
)
2
np
2
=
(X
1
np
1
)
2
np
1
+
(n X
1
n(1 p
1
))
2
n(1 p
1
)
= (X
1
np
1
)
2
_
1
np
1
+
1
n(1 p
1
)
_
= (X
1
np
1
)
2
_
(1 p
1
) +p
1
np
1
(1 p
1
)
_
=
(X
1
np
1
)
2
np
1
(1 p
1
)
By the CLT,
X
1
np
1
_
np
1
(1 p
1
)
d
N(0, 1). Therefore, U
2
d
2
1
.
Theorem 10.2.7:
Let X
1
, . . . , X
n
be a sample from X. Let H
0
: X F, where the functional form of F is
known completely. We partition the real line into k disjoint Borel sets A
1
, . . . , A
k
and let
P(X A
i
) = p
i
, where p
i
> 0 i = 1, . . . , k.
Let Y
j
= #X
i
s in A
j
=
n
i=1
I
A
j
(X
i
), j = 1, . . . , k.
Then, (Y
1
, . . . , Y
k
) has multinomial distribution with parameters n, p
1
, p
2
, . . . , p
k
.
Theorem 10.2.8:
Let X
1
, . . . , X
n
be a sample from X. Let H
0
: X F
, where = (
1
, . . . ,
r
) is unknown.
Let the MLE
exist. We partition the real line into k disjoint Borel sets A
1
, . . . , A
k
and let
P
(X A
i
) = p
i
, where p
i
> 0 i = 1, . . . , k.
Let Y
j
= #X
i
s in A
j
=
n
i=1
I
A
j
(X
i
), j = 1, . . . , k.
Then it holds that
V
k
=
k
i=1
(Y
i
n p
i
)
2
n p
i
d
2
kr1
.
122
An approximate level test of H
0
: X F
rejects H
0
if
k
i=1
(y
i
n p
i
)
2
n p
i
>
2
kr1;
,
where r is the number of parameters in that have to be estimated.
123
10.3 tTests and FTests
(Based on Rohatgi, Section 10.4 & 10.5 & Rohatgi/Saleh, Section 10.4 & 10.5)
Denition 10.3.1: One and TwoTailed t-Tests
Let X
1
, . . . , X
n
be a sample from a N(,
2
) distribution where
2
> 0 may be known or
unknown and is unknown. Let X =
1
n
n
i=1
X
i
and S
2
=
1
n1
n
i=1
(X
i
X)
2
.
The following table summarizes the z and ttests that are typically being used:
Reject H
0
at level if
H
0
H
1
2
known
2
unknown
I
0
>
0
x
0
+
n
z
x
0
+
s
n
t
n1;
II
0
<
0
x
0
+
n
z
1
x
0
+
s
n
t
n1;1
III =
0
,=
0
[ x
0
[
n
z
/2
[ x
0
[
s
n
t
n1;/2
Note:
(i) In Denition 10.3.1,
0
is any xed constant.
(ii) These tests are based on just one sample and are often called one sample ttests.
(iii) Tests I and II are UMP and test III is UMPU if
2
is known. Tests I, II, and III are
UMPU and UMP invariant if
2
is unknown.
(iv) For large n ( 30), we can use ztables instead of t-tables. Also, for large n we can
drop the Normality assumption due to the CLT. However, for small n, none of these
simplications is justied.
Denition 10.3.2: TwoSample t-Tests
Let X
1
, . . . , X
m
be a sample from a N(
1
,
2
1
) distribution where
2
1
> 0 may be known or
unknown and
1
is unknown. Let Y
1
, . . . , Y
n
be a sample from a N(
2
,
2
2
) distribution where
2
2
> 0 may be known or unknown and
2
is unknown.
Let X =
1
m
m
i=1
X
i
and S
2
1
=
1
m1
m
i=1
(X
i
X)
2
.
Let Y =
1
n
n
i=1
Y
i
and S
2
2
=
1
n1
n
i=1
(Y
i
Y )
2
.
Let S
2
p
=
(m1)S
2
1
+(n1)S
2
2
m+n2
.
The following table summarizes the z and ttests that are typically being used:
124
Reject H
0
at level if
H
0
H
1
2
1
,
2
2
known
2
1
,
2
2
unknown,
1
=
2
I
1
2
1
2
> x y +z
2
1
m
+
2
2
n
x y +t
m+n2;
s
p
_
1
m
+
1
n
II
1
2
1
2
< x y +z
1
_
2
1
m
+
2
2
n
x y +t
m+n2;1
s
p
_
1
m
+
1
n
III
1
2
=
1
2
,= [ x y [ z
/2
_
2
1
m
+
2
2
n
[ x y [ t
m+n2;/2
s
p
_
1
m
+
1
n
Note:
(i) In Denition 10.3.2, is any xed constant.
(ii) All tests are UMPU and UMP invariant.
(iii) If
2
1
=
2
2
=
2
(which is unknown), then S
2
p
is an unbiased estimate of
2
. We should
check that
2
1
=
2
2
with an Ftest.
(iv) For large m + n, we can use ztables instead of t-tables. Also, for large m and large n
we can drop the Normality assumption due to the CLT. However, for small m or small
n, none of these simplications is justied.
Denition 10.3.3: Paired t-Tests
Let (X
1
, Y
1
) . . . , (X
n
, Y
n
) be a sample from a bivariate N(
1
,
2
,
2
1
,
2
2
, ) distribution where
all 5 parameters are unknown.
Let D
i
= X
i
Y
i
N(
1
2
,
2
1
+
2
2
2
1
2
).
Let D =
1
n
n
i=1
D
i
and S
2
d
=
1
n1
n
i=1
(D
i
D)
2
.
The following table summarizes the ttests that are typically being used:
H
0
H
1
Reject H
0
at level if
I
1
2
1
2
> d +
s
d
n
t
n1;
II
1
2
1
2
< d +
s
d
n
t
n1;1
III
1
2
=
1
2
,= [ d [
s
d
n
t
n1;/2
125
Note:
(i) In Denition 10.3.3, is any xed constant.
(ii) These tests are special cases of onesample tests. All the properties stated in the Note
following Denition 10.3.1 hold.
(iii) We could do a test based on Normality assumptions if
2
=
2
1
+
2
2
2
1
2
were
known, but that is a very unrealistic assumption.
Denition 10.3.4: FTests
Let X
1
, . . . , X
m
be a sample from a N(
1
,
2
1
) distribution where
1
may be known or unknown
and
2
1
is unknown. Let Y
1
, . . . , Y
n
be a sample from a N(
2
,
2
2
) distribution where
2
may
be known or unknown and
2
2
is unknown.
Recall that
m
i=1
(X
i
X)
2
2
1
2
m1
,
n
i=1
(Y
i
Y )
2
2
2
2
n1
,
and
m
i=1
(X
i
X)
2
(m1)
2
1
n
i=1
(Y
i
Y )
2
(n 1)
2
2
=
2
2
2
1
S
2
1
S
2
2
F
m1,n1
.
The following table summarizes the Ftests that are typically being used:
Reject H
0
at level if
H
0
H
1
1
,
2
known
1
,
2
unknown
I
2
1
2
2
2
1
>
2
2
1
m
(x
i
1
)
2
1
n
(y
i
2
)
2
F
m,n;
s
2
1
s
2
2
F
m1,n1;
II
2
1
2
2
2
1
<
2
2
1
n
(y
i
2
)
2
1
m
(x
i
1
)
2
F
n,m;
s
2
2
s
2
1
F
n1,m1;
III
2
1
=
2
2
2
1
,=
2
2
1
m
(x
i
1
)
2
1
n
(y
i
2
)
2
F
m,n;/2
s
2
1
s
2
2
F
m1,n1;/2
if s
2
1
s
2
2
or
1
n
(y
i
2
)
2
1
m
(x
i
1
)
2
F
n,m;/2
or
s
2
2
s
2
1
F
n1,m1;/2
if s
2
1
< s
2
2
126
Note:
(i) Tests I and II are UMPU and UMP invariant if
1
and
2
are unknown.
(ii) Test III uses equal tails and therefore may not be unbiased.
(iii) If an Ftest (at level
1
) and a ttest (at level
2
) are both performed, the combined test
has level = 1(1
1
)(1
2
) = 11+
1
+
2
2
=
1
+
2
2
max(
1
,
2
)
(
1
+
2
if both are small).
127
10.4 Bayes and Minimax Tests
(Based on Rohatgi, Section 10.6 & Rohatgi/Saleh, Section 10.6)
Hypothesis testing may be conducted in a decisiontheoretic framework. Here our action
space / consists of two options: a
0
= fail to reject H
0
and a
1
= reject H
0
.
Usually, we assume no loss for a correct decision. Thus, our loss function looks like:
L(, a
0
) =
0, if
0
a(), if
1
L(, a
1
) =
b(), if
0
0, if
1
We consider the following special cases:
01 loss: a() = b() = 1, i.e., all errors are equally bad.
Generalized 01 loss: a() = c
II
, b() = c
I
, i.e., all Type I errors are equally bad and all
Type II errors are equally bad and Type I errors are worse than Type II errors or vice
versa.
Then, the risk function can be written as
R(, d(X)) = L(, a
0
)P
(d(X) = a
0
) +L(, a
1
)P
(d(X) = a
1
)
=
a()P
(d(X) = a
0
), if
1
b()P
(d(X) = a
1
), if
0
The minimax rule minimizes
max
a()P
(d(X) = a
0
), b()P
(d(X) = a
1
).
Theorem 10.4.1:
The minimax rule d for testing
H
0
: =
0
vs. H
1
: =
1
under the generalized 01 loss function rejects H
0
if
f
1
(x)
f
0
(x)
k,
128
where k is chosen such that
R(
1
, d(X)) = R(
0
, d(X))
c
II
P
1
(d(X) = a
0
) = c
I
P
0
(d(X) = a
1
)
c
II
P
1
_
f
1
(X)
f
0
(X)
< k
_
= c
I
P
0
_
f
1
(X)
f
0
(X)
k
_
.
Proof:
Let d
), then
R(
0
, d) = R(
1
, d) < maxR(
0
, d
), R(
1
, d
).
So, d
is not minimax.
If R(
0
, d) R(
0
, d
), i.e.,
c
I
P
0
(d = a
1
) = R(
0
, d) R(
0
, d
) = c
I
P
0
(d
= a
1
),
then
P(reject H
0
[ H
0
true) = P
0
(d = a
1
) P
0
(d
= a
1
).
By the NP Lemma, the rule d is MP of its size. Thus,
P
1
(d = a
1
) P
1
(d
= a
1
) P
1
(d = a
0
) P
1
(d
= a
0
)
= R(
1
, d) R(
1
, d
)
= maxR(
0
, d), R(
1
, d) = R(
1
, d) R(
1
, d
) maxR(
0
, d
), R(
1
, d
) d
= d is minimax
Example 10.4.2:
Let X
1
, . . . , X
n
be iid N(, 1). Let H
0
: =
0
vs. H
1
: =
1
>
0
.
As we have seen before,
f
1
(x)
f
0
(x)
k
1
is equivalent to x k
2
.
Therefore, we choose k
2
such that
c
II
P
1
(X < k
2
) = c
I
P
0
(X k
2
)
c
II
(
n(k
2
1
)) = c
I
(1 (
n(k
2
0
))),
where (z) = P(Z z) for Z N(0, 1).
Given c
I
, c
II
,
0
,
1
, and n, we can solve (numerically) for k
2
using Normal tables.
129
Note:
Now suppose we have a prior distribution () on . Then the Bayes risk of a decision rule
d (under the loss function introduced before) is
R(, d) = E
(R(, d(X)))
=
_
R(, d)()d
=
_
0
b()()P
(d(X) = a
1
)d +
_
1
a()()P
(d(X) = a
0
)d
if is a pdf.
The Bayes risk for a pmf looks similar (see Rohatgi, page 461).
Theorem 10.4.3:
The Bayes rule for testing H
0
: =
0
vs. H
1
: =
1
under the prior (
0
) =
0
and
(
1
) =
1
= 1
0
and the generalized 01 loss function is to reject H
0
if
f
1
(x)
f
0
(x)
c
I
0
c
II
1
.
Proof:
We wish to minimize R(, d). We know that
R(, d)
Def. 8.8.8
= E
(R(, d))
Def. 8.8.3
= E
(E
(L(, d(X))))
Note
=
_
g(x)
..
marginal
(
_
L(, d(x)) h( [ x)
. .
posterior
d) dx
Def. 4.7.1
=
_
g(x)E
(L(, d(X)) [ X = x) dx
= E
X
(E
(x)
()f
(x)
=
0
f
0
(x)
0
f
0
(x) +
1
f
1
(x)
, =
0
1
f
1
(x)
0
f
0
(x) +
1
f
1
(x)
, =
1
130
Therefore,
E
(L(, d(X)) [ X = x) =
c
I
h(
0
[ x), if =
0
, d(x) = a
1
c
II
h(
1
[ x), if =
1
, d(x) = a
0
0, if =
0
, d(x) = a
0
0, if =
1
, d(x) = a
1
This will be minimized if we reject H
0
, i.e., d(x) = a
1
, when c
I
h(
0
[ x) c
II
h(
1
[ x)
= c
I
0
f
0
(x) c
II
1
f
1
(x)
=
f
1
(x)
f
0
(x)
c
I
0
c
II
1
Note:
For minimax rules and Bayes rules, the signicance level is no longer predetermined.
Example 10.4.4:
Let X
1
, . . . , X
n
be iid N(, 1). Let H
0
: =
0
vs. H
1
: =
1
>
0
. Let c
I
= c
II
.
By Theorem 10.4.3, the Bayes rule d rejects H
0
if
f
1
(x)
f
0
(x)
0
1
0
= exp
_
(x
i
1
)
2
2
+
(x
i
0
)
2
2
_
0
1
0
= exp
_
(
1
0
)
x
i
+
n(
2
0
2
1
)
2
_
0
1
0
= (
1
0
)
x
i
+
n(
2
0
2
1
)
2
ln(
0
1
0
)
=
1
n
x
i
1
n
ln(
0
1
0
)
1
0
+
0
+
1
2
If
0
=
1
2
, then we reject H
0
if x
0
+
1
2
.
131
Note:
We can generalize Theorem 10.4.3 to the case of classifying among k options
1
, . . . ,
k
. If we
use the 01 loss function
L(
i
, d) =
1, if d(X) =
j
j ,= i
0, if d(X) =
i
,
then the Bayes rule is to pick
i
if
i
f
i
(x)
j
f
j
(x) j ,= i.
Example 10.4.5:
Let X
1
, . . . , X
n
be iid N(, 1). Let
1
<
2
<
3
and let
1
=
2
=
3
.
Choose =
i
if
i
exp
_
(x
k
i
)
2
2
_
j
exp
_
(x
k
j
)
2
2
_
, j ,= i, j = 1, 2, 3.
Similar to Example 10.4.4, these conditions can be transformed as follows:
x(
i
j
)
(
i
j
)(
i
+
j
)
2
, j ,= i, j = 1, 2, 3.
In our particular example, we get the following decision rules:
(i) Choose
1
if x
1
+
2
2
(and x
1
+
3
2
).
(ii) Choose
2
if x
1
+
2
2
and x
2
+
3
2
.
(iii) Choose
3
if x
2
+
3
2
(and x
1
+
3
2
).
Note that in (i) and (iii) the condition in parentheses automatically holds when the other
condition holds.
132
If
1
= 0,
2
= 2, and
3
= 4, we have the decision rules:
0 2 4
1
2
3
Example 10.4.5
(i) Choose
1
if x 1.
(ii) Choose
2
if 1 x 3.
(iii) Choose
3
if x 3.
We do not have to worry how to handle the boundary since the probability that the rv will
realize on any of the two boundary points is 0.
133
11 Condence Estimation
11.1 Fundamental Notions
(Based on Casella/Berger, Section 9.1 & 9.3.2)
Let X be a rv and a, b be xed positive numbers, a < b. Then
P(a < X < b) = P(a < X and X < b)
= P(a < X and
X
b
< 1)
= P(a < X and
aX
b
< a)
= P(
aX
b
< a < X)
The interval I(X) = (
aX
b
, X) is an example of a random interval. I(X) contains the value
a with a certain xed probability.
For example, if X U(0, 1), a =
1
4
, and b =
3
4
, then the interval I(X) = (
X
3
, X) contains
1
4
with probability
1
2
.
Denition 11.1.1:
Let P
, IR
k
, be a set of probability distributions of a rv X. A family of subsets
S(x) of , where S(x) depends on x but not on , is called a family of random sets. In
particular, if IR and S(x) is an interval ((x), (x)) where (x) and (x) depend on
x but not on , we call S(X) a random interval, with (X) and (X) as lower and upper
bounds, respectively. (X) may be and (X) may be +.
Note:
Frequently in inference, we are not interested in estimating a parameter or testing a hypoth-
esis about it. Instead, we are interested in establishing a lower or upper bound (or both) for
one or multiple parameters.
Denition 11.1.2:
A family of subsets S(x) of IR
k
is called a family of condence sets at condence
level 1 if
P
(S(X) ) 1 ,
where 0 < < 1 is usually small.
The quantity
inf
(S(X) ) = 1
134
is called the condence coecient (i.e., the smallest probability of true coverage is 1 ).
Denition 11.1.3:
For k = 1, we use the following names for some of the condence sets dened in Denition
11.1.2:
(i) If S(x) = ((x), ), then (x) is called a level 1 lower condence bound.
(ii) If S(x) = (, (x)), then (x) is called a level 1 upper condence bound.
(iii) S(x) = ((x), (x)) is called a level 1 condence interval (CI).
Denition 11.1.4:
A family of 1 level condence sets S(x) is called uniformly most accurate (UMA)
if
P
(S(X)
) P
(S
(X)
) ,
, ,=
,
and for any 1 level family of condence sets S
(
1
() < T(X, ) <
2
()) 1 .
Since the distribution of T(X, ) is independent of ,
1
() and
2
() also do not depend on
.
If T(X, ) is increasing in , solve the equations
1
() = T(X, ) for (X) and
2
() = T(X, )
135
for (X).
If T(X, ) is decreasing in , solve the equations
1
() = T(X, ) for (X) and
2
() =
T(X, ) for (X).
In either case, it holds that
P
n
t
n1
and T(X, ) is independent of and monotone and decreasing in .
We choose
1
() and
2
() such that
P(
1
() < T(X, ) <
2
()) = 1
and solve for which yields
P(
1
() <
X
S/
n
<
2
()) =
P(
1
()
S
n
< X <
2
()
S
n
) =
P(
1
()
S
n
X < <
2
()
S
n
X) =
P(X
S
2
()
n
< < X
S
1
()
n
) = 1 .
Thus,
(X
S
2
()
n
, X
S
1
()
n
)
136
is a 1 level CI for . We commonly choose
2
() =
1
() = t
n1;/2
.
Example 11.1.7:
Let X
1
, . . . , X
n
U(0, ).
We know that
= max(X
i
) = Max
n
is the MLE for and sucient for .
The pdf of Max
n
is given by
f
n
(y) =
ny
n1
n
I
(0,)
(y).
Then the rv T
n
=
Maxn
1
t
n1
dt = 1
=
n
2
n
1
= 1
If we choose
2
= 1 and
1
=
1/n
, then (Max
n
,
1/n
Max
n
) is a 1 level CI for . This
holds since
1 = P(
1/n
<
Max
n
< 1)
= P(
1/n
>
Max
n
> 1)
= P(
1/n
Max
n
> > Max
n
)
137
11.2 ShortestLength Condence Intervals
(Based on Casella/Berger, Section 9.2.2 & 9.3.1)
In practice, we usually want not only an interval with coverage probability 1 for , but if
possible the shortest (most precise) such interval.
Denition 11.2.1:
A rv T(X, ) whose distribution is independent of is called a pivot.
Note:
The methods we will discuss here can provide the shortest interval based on a given pivot.
They will not guarantee that there is no other pivot with a shorter minimal interval.
Example 11.2.2:
Let X
1
, . . . , X
n
N(,
2
), where
2
> 0 is known. The obvious pivot for is
T
(X) =
X
/
n
N(0, 1).
Suppose that (a, b) is an interval such that P(a < Z < b) = 1 , where Z N(0, 1).
A 1 level CI based on this pivot is found by
1 = P
_
a <
X
/
n
< b
_
= P
_
X b
n
< < X a
n
_
.
The length of the interval is L = (b a)
n
.
To minimize L, we must choose a and b such that b a is minimal while
(b) (a) =
1
2
_
b
a
e
x
2
2
dx = 1 ,
where (z) = P(Z z). We use the notation (z) =
n
(
db
da
1) =
n
(
(a)
(b)
1).
138
The minimum occurs when (a) = (b) which happens when a = b or a = b. If we select
a = b, then (b) (a) = (a) (a) = 0 ,= 1 . Thus, we must have that b = a = z
/2
.
Thus, the shortest CI based on T
is
(X z
/2
n
, X +z
/2
n
).
Denition 11.2.3:
A pdf f(x) is unimodal i there exists a x
and
f(x) is nonincreasing for x x
.
Theorem 11.2.4:
Let f(x) be a unimodal pdf. If the interval [a, b] satises
(i)
_
b
a
f(x)dx = 1
(ii) f(a) = f(b) > 0, and
(iii) a x
b, where x
is a mode of f(x),
then the interval [a, b] is the shortest of all intervals which satisfy condition (i).
Proof:
Let [a
, b
f(x)dx < 1,
i.e., a contradiction.
We assume that a
is similar.
Suppose that b
a. Then a
a x
x* a b a b
Theorem 11.2.4a
139
It follows
_
b
f(x)dx f(b
)(b
) [x b
f(x) f(b
)
f(a)(b
) [b
a x
f(b
) f(a)
< f(a)(b a) [b
_
b
a
f(x)dx [f(x) f(a) for a x b
= 1 [by (i)
Suppose b
> b a, i.e.,
b
wouldnt be of shorter length than b a. Thus, we have to consider the case that
a
a < b
< b.
x* a b a b
Theorem 11.2.4b
It holds that
_
b
f(x)dx =
_
b
a
f(x)dx +
_
a
a
f(x)dx
_
b
b
f(x)dx
Note that
_
a
a
f(x)dx f(a)(a a
) and
_
b
b
f(x)dx f(b)(b b
). Therefore, we get
_
a
a
f(x)dx
_
b
b
f(x)dx f(a)(a a
) f(b)(b b
)
= f(a)((a a
) (b b
) (b a))
< 0
Thus,
_
b
f(x)dx < 1 .
140
Note:
Example 11.2.2 is a special case of Theorem 11.2.4. However, Theorem 11.2.4 is not immedi-
ately applicable in the following example since the length of that interval is proportional to
1
a
1
b
(and not to b a).
Example 11.2.5:
Let X
1
, . . . , X
n
N(,
2
), where is known. The obvious pivot for
2
is
T
2(X) =
(X
i
)
2
2
2
n
.
So
P
_
a <
(X
i
)
2
2
< b
_
= 1
P
_
(X
i
)
2
b
<
2
<
(X
i
)
2
a
_
= 1
We wish to minimize
L = (
1
a
1
b
)
(X
i
)
2
such that
_
b
a
f
n
(t)dt = 1 , where f
n
(t) is the pdf of a
2
n
distribution.
We get
f
n
(b)
db
da
f
n
(a) = 0
and
dL
da
=
_
1
a
2
+
1
b
2
db
da
_
(X
i
)
2
=
_
1
a
2
+
1
b
2
f
n
(a)
f
n
(b)
_
(X
i
)
2
.
We obtain a minimum if a
2
f
n
(a) = b
2
f
n
(b).
Note that in practice equal tails
2
n;/2
and
2
n;1/2
are used, which do not result in shortest
length CIs. The reason for this selection is simple: When these tests were developed, com-
puters did not exist that could solve these equations numerically. People in general had to
rely on tabulated values. Manually solving the equation above for each case obviously wasnt
a feasible solution.
141
Example 11.2.6:
Let X
1
, . . . , X
n
U(0, ). Let Max
n
= max X
i
= X
(n)
. Since T
n
=
Maxn
has pdf
nt
n1
I
(0,1)
(t) which does not depend on , T
n
can be selected as a our pivot. The den-
sity of T
n
is strictly increasing for n 2, so we cannot nd constants a and b as in Example
11.2.5.
If P(a < T
n
< b) = 1 , then P(
Maxn
b
< <
Maxn
a
) = 1 .
We wish to minimize
L = Max
n
(
1
a
1
b
)
such that
_
b
a
nt
n1
dt = b
n
a
n
= 1 .
We get
nb
n1
na
n1
da
db
= 0 =
da
db
=
b
n1
a
n1
and
dL
db
= Max
n
(
1
a
2
da
db
+
1
b
2
) = Max
n
(
b
n1
a
n+1
+
1
b
2
) = Max
n
(
a
n+1
b
n+1
b
2
a
n+1
) < 0 for 0 a < b 1.
Thus, L does not have a local minimum. However, since
dL
db
< 0, L is strictly decreasing
as a function of b. It is minimized when b = 1, i.e., when b is as large as possible. The
corresponding a is selected as a =
1/n
.
The shortest 1 level CI based on T
n
is (Max
n
,
1/n
Max
n
).
142
11.3 Condence Intervals and Hypothesis Tests
(Based on Casella/Berger, Section 9.2)
Example 11.3.1:
Let X
1
, . . . , X
n
N(,
2
), where
2
> 0 is known. In Example 11.2.2 we have shown that
the interval
(X z
/2
n
, X +z
/2
n
)
is a 1 level CI for .
Suppose we dene a test of H
0
: =
0
vs. H
1
: ,=
0
that rejects H
0
i
0
does not
fall in this interval. Then,
P
0
(Type I error) = P
0
(Reject H
0
when H
0
is true )
= P
0
__
X z
/2
n
, X +z
/2
n
_
,
0
_
= 1 P
0
__
X z
/2
n
, X +z
/2
n
_
0
_
= 1 P
0
_
X z
/2
n
0
and
0
X +z
/2
n
_
= P
0
_
X z
/2
n
0
or
0
X +z
/2
n
_
= P
0
_
X
0
z
/2
n
or X
0
z
/2
n
_
= P
0
_
X
0
n
z
/2
or
X
0
n
z
/2
_
= P
0
_
[ X
0
[
n
z
/2
_
= ,
i.e., has size . So a test based on the shortest 1 level CI obtained in Example 11.2.2
is equivalent to the UMPU test III of size introduced in Denition 10.3.1 (when
2
is known).
Conversely, if (x,
0
) is a family of size tests of H
0
: =
0
, the set
0
[ (x,
0
) fails to reject H
0
(S(X)
)
H
1
(
(S(X) ) = P
(X A()) 1 .
Let S
() = x : S
(x) .
Then,
P
(X A
()) = P
(S
(X) ) 1 .
Since A(
0
) is UMP, it holds that
P
(X A
(
0
)) P
(X A(
0
)) H
1
(
0
).
This implies that
P
(S
(X)
0
) P
(X A(
0
)) = P
(S(X)
0
) H
1
(
0
).
Example 11.3.3:
Let X be a rv that belongs to a oneparameter exponential family with pdf
f
(x) = exp(Q()T(x) +S
(x) +D()),
where Q() is nondecreasing.
We consider a test H
0
: =
0
vs. H
1
: <
0
. By Theorem 9.3.3, the family f
has a MLR in T(X). It follows by the Note after Theorem 9.3.5 that the acceptance region
of a UMP size test of H
0
has the form A(
0
) = x : T(x) > c(
0
) and this test has a
nonincreasing power function.
Now consider a similar test H
0
: =
1
vs. H
1
: <
1
. The acceptance region of a UMP
size test of H
0
also has the form A(
1
) = x : T(x) > c(
1
).
Thus, for
1
0
,
P
0
(T(X) c(
0
)) = = P
1
(T(X) c(
1
)) P
0
(T(X) c(
1
))
144
(since for a UMP test, it holds that power size). Therefore, we can choose c() as non
decreasing.
A level 1 CI for is then
S(x) = : x A() = (, c
1
(T(x))),
where c
1
(T(x)) = sup
: c() T(x).
Example 11.3.4:
Let X Exp() with f
(x) =
1
I
(0,)
(x), which belongs to a oneparameter exponential
family. Then Q() =
1
0
(x)dx =
_
c(
0
)
0
1
0
e
0
dx = 1 e
c(
0
)
0
.
Thus,
e
c(
0
)
0
= 1 = c(
0
) =
0
log(
1
1
)
Therefore, the UMA family of 1 level condence sets is of the form
S(x) = : x A()
= : x log
_
1
1
_
= :
x
log(
1
1
)
=
_
0,
x
log(
1
1
)
_
.
Note:
Just as we frequently restrict the class of tests (when UMP tests dont exist), we can make
the same sorts of restrictions on CIs.
Denition 11.3.5:
A family S(x) of condence sets for parameter is said to be unbiased at level 1 if
P
(S(X) ) 1 and P
(S(X)
) 1 ,
, ,=
.
If S(x) is unbiased and minimizes P
(S(X)
(S(X)
) = P
(X A(
)) 1 ,
, ,=
.
Thus, S is unbiased.
Let S
() = x : S
(x) .
It holds that
P
(X A
)) = P
(S
(X)
) 1 .
Therefore, A
(S
(X)
) = P
(X A
))
()
P
(X A(
))
= P
(S(X)
)
() holds since A() is the acceptance region of a UMPU test.
Theorem 11.3.7:
Let be an interval on IR and f
((X) (X))) =
_
((x) (x))f
(x)dx =
_
=
P
(S(X)
) d
.
Proof:
It holds that =
_
((X) (X))) =
_
IR
n
((x) (x))f
(x)dx
=
_
IR
n
_
_
(x)
(x)
d
_
f
(x)dx
146
=
_
IR
_
IR
n
..
1
(
1
(
)
. .
IR
n
f
(x)dx
=
_
IR
P
_
X [
1
(
),
1
(
)]
_
d
=
_
IR
P
(S(X)
) d
=
_
=
P
(S(X)
) d
Note:
Theorem 11.3.7 says that the expected length of the CI is the probability that S(X) includes
the false
.
Corollary 11.3.8:
If S(X) is UMAU, then E
((X) (X)) =
_
=
P
(S(X)
) d
.
Since a UMAU CI minimizes this probability for all
n
, X +z
/2
n
) is the shortest 1 level CI for .
By Example 9.4.3, the equivalent test is UMPU. So by Theorem 11.3.6 this interval is UMAU
and by Corollary 11.3.8 it has shortest expected length as well.
Example 11.3.10:
Let X
1
, . . . , X
n
N(,
2
), where and
2
> 0 are both unknown.
Note that
T(X,
2
) =
(n 1)S
2
2
= T
2
n1
.
Thus,
P
2
_
1
<
(n 1)S
2
2
<
2
_
= 1 P
2
_
(n 1)S
2
2
<
2
<
(n 1)S
2
1
_
= 1 .
147
We now dene P() as
P
2
_
(n 1)S
2
2
<
2
<
(n 1)S
2
1
_
= P
2
_
(n 1)S
2
2
<
2
2
<
(n 1)S
2
2
_
= P
_
T
2
< <
T
1
_
= P
_
1
T
<
1
<
2
T
_
= P(
1
< T
<
2
)
= P(),
where =
2
2
.
If our test is unbiased, then it follows from Denition 11.3.5 that
P(1) = 1 and P() < 1 ,= 1.
This implies that we can nd
1
,
2
such that P(1) = 1 is a (local) maximum and therefore
dP()
d
=1
=
_
d
d
_
2
f
T
(x)dx
_
=1
()
= (f
T
(
2
)
2
f
T
(
1
)
1
+ 0) [
=1
=
2
f
T
(
2
)
1
f
T
(
1
)
= 0,
where f
T
is the pdf that relates to T
2
,
(n 1)S
2
1
_
is an unbiased 1 level CI for
2
.
Rohatgi, Theorem 4(b), page 428429, states that the related test is UMPU. Therefore, by
Theorem 11.3.6 and Corollary 11.3.8, our CI is UMAU with shortest expected length among
all unbiased intervals.
Note that this CI is dierent from the equaltail CI based on Denition 10.2.1, III, and from
the shortestlength CI obtained in Example 11.2.5.
148
11.4 Bayes Condence Intervals
(Based on Casella/Berger, Section 9.2.4)
Denition 11.4.1:
Given a posterior distribution h( [ x), a level 1 credible set (Bayesian condence
set) is any set A such that
P( A [ x) =
_
A
h( [ x)d = 1 .
Note:
If A is an interval, we speak of a Bayesian condence interval.
Example 11.4.2:
Let X Bin(n, p) and (p) U(0, 1).
In Example 8.8.11, we have shown that
h(p [ x) =
p
x
(1 p)
nx
_
1
0
p
x
(1 p)
nx
dp
I
(0,1)
(p)
= B(x + 1, n x + 1)
1
p
x
(1 p)
nx
I
(0,1)
(p)
=
(n + 2)
(x + 1)(n x + 1)
p
x
(1 p)
nx
I
(0,1)
(p)
p [ x Beta(x + 1, n x + 1),
where B(a, b) =
(a)(b)
(a+b)
is the beta function evaluated for a and b and Beta(x +1, n x +1)
represents a Beta distribution with parameters x + 1 and n x + 1.
Using the observed value for x and tables for incomplete beta integrals or a numerical ap-
proach, we can nd
1
and
2
such that P
p|x
(
1
< p <
2
) = 1 . So (
1
,
2
) is a credible
interval for p.
Note:
(i) The denitions and interpretations of credible intervals and condence intervals are quite
dierent. Therefore, very dierent intervals may result.
(ii) We can often use Theorem 11.2.4 to nd the shortest credible interval (if the precondi-
tions hold).
149
Example 11.4.3:
Let X
1
, . . . , X
n
be iid N(, 1) and () N(0, 1). We want to construct a Bayesian level
1 CI for .
By Denition 8.8.7, the posterior distribution of given x is
h( [ x) =
()f(x [ )
g(x)
where
g(x) =
_
f(x, )d
=
_
()f(x [ )d
=
_
2
exp(
1
2
2
)
1
(
2)
n
exp
_
1
2
n
i=1
(x
i
)
2
_
d
=
_
1
(2)
n+1
2
exp
_
1
2
_
n
i=1
x
2
i
2
n
i=1
x
i
+n
2
_
1
2
2
_
d
=
_
1
(2)
n+1
2
exp
_
1
2
n
i=1
x
2
i
+nx
n
2
1
2
2
_
d
=
1
(2)
n+1
2
exp
_
1
2
n
i=1
x
2
i
_
_
exp
_
n + 1
2
_
2
2
nx
n + 1
__
d
=
1
(2)
n+1
2
exp
_
1
2
n
i=1
x
2
i
_
_
exp
_
n + 1
2
_
nx
n + 1
_
2
+
n + 1
2
_
nx
n + 1
_
2
_
d
=
1
(2)
n+1
2
exp
_
1
2
n
i=1
x
2
i
+
n
2
x
2
2(n + 1)
_
2
1
n + 1
_
1
_
2
1
n+1
exp
_
1
2
1
1/(n + 1)
_
nx
n + 1
_
2
_
d
. .
=1 since pdf of a N(
nx
n+1
,
1
n+1
) distribution
=
(n + 1)
1
2
(2)
n
2
exp
_
1
2
n
i=1
x
2
i
+
n
2
x
2
2(n + 1)
_
150
Therefore,
h( [ x) =
()f(x [ )
g(x)
=
1
2
exp(
1
2
2
)
1
(
2)
n
exp
_
1
2
n
i=1
(x
i
)
2
_
(n + 1)
1
2
(2)
n
2
exp
_
1
2
n
i=1
x
2
i
+
n
2
x
2
2(n + 1)
_
=
n + 1
2
exp
_
1
2
n
i=1
x
2
i
+nx
n
2
1
2
2
+
1
2
n
i=1
x
2
i
n
2
x
2
2(n + 1)
_
=
n + 1
2
exp
_
n + 1
2
_
2
n + 1
nx +
2
n + 1
n
2
x
2
2(n + 1)
__
=
1
_
2
1
n+1
exp
_
1
2
1
1
n+1
_
nx
n + 1
_
2
_
,
i.e.,
[ x N
_
nx
n + 1
,
1
n + 1
_
.
Therefore, a Bayesian level 1 CI for is
_
n
n + 1
X
z
/2
n + 1
,
n
n + 1
X +
z
/2
n + 1
_
.
The shortest (classical) level 1 CI for (treating
2
= 1 as xed) is
_
X
z
/2
n
, X +
z
/2
n
_
as seen in Example 11.2.2.
Thus, the Bayesian CI is slightly shorter than the classical CI since we use additional infor-
mation in constructing the Bayesian CI.
151
12 Nonparametric Inference
12.1 Nonparametric Estimation
Denition 12.1.1:
A statistical method which does not rely on assumptions about the distributional form of a
rv (except, perhaps, that it is absolutely continuous, or purely discrete) is called a nonpara-
metric or distributionfree method.
Note:
Unless otherwise specied, we make the following assumptions for the remainder of this chap-
ter: Let X
1
, . . . , X
n
be iid F, where F is unknown. Let T be the class of all possible
distributions of X.
Denition 12.1.2:
A statistic T(X) is sucient for a family of distributions T if the conditional distibution of
X given T = t is the same for all F T.
Example 12.1.3:
Let X
1
, . . . , X
n
be absolutely continuous. Let T = (X
(1)
, . . . , X
(n)
) be the order statistics.
It holds that
f(x [ T = t) =
1
n!
,
so T is sucient for the family of absolutely continuous distributions on IR.
Denition 12.1.4:
A family of distributions T is complete if the only unbiased estimate of 0 is the 0 itself, i.e.,
E
F
(h(X)) = 0 F T = h(x) = 0 x.
Denition 12.1.5:
A statistic T(X) is complete in relation to T if the class of induced distributions of T is
complete.
Theorem 12.1.6:
The order statistic (X
(1)
, . . . , X
(n)
) is a complete sucient statistic, provided that X
1
, . . . , X
n
are of either (pure) discrete of (pure) continuous type.
152
Denition 12.1.7:
A parameter g(F) is called estimable if it has an unbiased estimate, i.e., if there exists a
T(X) such that
E
F
(T(X)) = g(F) F T.
Example 12.1.8:
Let T be the class of distributions for which second moments exist. Then X is unbiased for
(F) =
_
xdF(x). Thus, (F) is estimable.
Denition 12.1.9:
The degree m of an estimable parameter g(F) is the smallest sample size for which an unbi-
ased estimate exists for all F T.
An unbiased estimate based on a sample of size m is called a kernel.
Lemma 12.1.10:
There exists a symmetric kernel for every estimable parameter.
Proof:
Let T(X
1
, . . . , X
m
) be a kernel of g(F). Dene
T
s
(X
1
, . . . , X
m
) =
1
m!
c
T
s
(X
i
1
, . . . , X
im
),
where T
s
is dened as in Lemma 12.1.10 and the summation c is over all
_
n
m
_
combina-
tions of m integers (i
1
, . . . , i
m
) from 1, , n. U(X
1
, . . . , X
n
) is symmetric in the X
i
s and
E
F
(U(X)) = g(F) for all F.
Example 12.1.13:
For estimating (F) with degree m of (F) = 1:
Symmetric kernel:
T
s
(X
i
) = X
i
, i = 1, . . . , n
Ustatistic:
U
(X) =
1
_
n
1
_
c
X
i
=
1 (n 1)!
n!
c
X
i
=
1
n
n
i=1
X
i
= X
For estimating
2
(F) with degree m of
2
(F) = 2:
Symmetric kernel:
T
s
(X
i
1
, X
i
2
) =
1
2
(X
i
1
X
i
2
)
2
, i
1
, i
2
= 1, . . . , n, i
1
,= i
2
Ustatistic:
U
2 (X) =
1
_
n
2
_
i
1
<i
2
1
2
(X
i
1
X
i
2
)
2
=
1
_
n
2
_
1
4
i
1
=i
2
(X
i
1
X
i
2
)
2
=
(n 2)! 2!
n!
1
4
i
1
=i
2
(X
i
1
X
i
2
)
2
154
=
1
2n(n 1)
i
1
i
2
=i
1
(X
2
i
1
2X
i
1
X
i
2
+X
2
i
2
)
=
1
2n(n 1)
(n 1)
n
i
1
=1
X
2
i
1
2(
n
i
1
=1
X
i
1
)(
n
i
2
=1
X
i
2
) + 2
n
i=1
X
2
i
+
(n 1)
n
i
2
=1
X
2
i
2
=
1
2n(n 1)
n
n
i
1
=1
X
2
i
1
n
i
1
=1
X
2
i
1
2(
n
i
1
=1
X
i
1
)
2
+ 2
n
i=1
X
2
i
+
n
n
i
2
=1
X
2
i
2
n
i
2
=1
X
2
i
2
=
1
n(n 1)
_
n
n
i=1
X
2
i
(
n
i=1
X
i
)
2
_
=
1
n(n 1)
n
n
i=1
X
i
1
n
n
j=1
X
j
=
1
(n 1)
n
i=1
(X
i
X)
2
= S
2
Theorem 12.1.14:
Let T be the class of all absolutely continuous or all purely discrete distribution functions on
IR. Any estimable function g(F), F T, has a unique estimate that is unbiased and sym-
metric in the observations and has uniformly minimum variance among all unbiased estimates.
Proof:
Let X
1
, . . . , X
n
iid
F T, with T(X
1
, . . . , X
n
) an unbiased estimate of g(F).
We dene
T
i
= T
i
(X
1
, . . . , X
n
) = T(X
i
1
, X
i
2
, . . . , X
in
), i = 1, 2, . . . , n!,
over all possible permutations of 1, . . . , n.
Let T =
1
n!
n!
i=1
T
i
and T =
n!
i=1
T
i
.
155
Then
E
F
(T) = g(F)
and
V ar(T) = E(T
2
) (E(T))
2
= E
_
(
1
n!
n!
i=1
T
i
)
2
_
[g(F)]
2
= E
(
1
n!
)
2
n!
i=1
n!
j=1
T
i
T
j
[g(F)]
2
E
n!
i=1
n!
j=1
T
i
T
j
[g(F)]
2
= E
_
n!
i=1
T
i
_
n!
j=1
T
j
[g(F)]
2
= E
_
n!
i=1
T
i
_
2
[g(F)]
2
= E(T
2
) [g(F)]
2
= V ar(T)
Equality holds i T
i
= T
j
i, j = 1, . . . , n!
= T is symmetric in (X
1
, . . . , X
n
) and T = T
= by Rohatgi, Problem 4, page 538, T is a function of order statistics
= by Rohatgi, Theorem 1, page 535, T is a complete sucient statistic
= by Note (i) following Theorem 8.4.12, T is UMVUE
Corollary 12.1.15:
If T(X
1
, . . . , X
n
) is unbiased for g(F), F T, the corresponding Ustatistic is an essentially
unique UMVUE.
156
Denition 12.1.16:
Suppose we have independent samples X
1
, . . . , X
m
iid
F T, Y
1
, . . . , Y
n
iid
G T (G may or
may not equal F.) Let g(F, G) be an estimable function with unbiased estimator T(X
1
, . . . , X
k
, Y
1
, . . . , Y
l
).
Dene
T
s
(X
1
, . . . , X
k
, Y
1
, . . . , Y
l
) =
1
k!l!
P
X
P
Y
T(X
i
1
, . . . , X
i
k
, Y
j
1
, . . . , Y
j
l
)
(where P
X
and P
Y
are permutations of X and Y ) and
U(X, Y ) =
1
_
m
k
__
n
l
_
C
X
C
Y
T
s
(X
i
1
, . . . , X
i
k
, Y
j
1
, . . . , Y
j
l
)
(where C
X
and C
Y
are combinations of X and Y ).
U is a called a generalized Ustatistic.
Example 12.1.17:
Let X
1
, . . . , X
m
and Y
1
, . . . , Y
n
be independent random samples from F and G, respectively,
with F, G T. We wish to estimate
g(F, G) = P
F,G
(X Y ).
Let us dene
Z
ij
=
_
1, X
i
Y
j
0, X
i
> Y
j
for each pair X
i
, Y
j
, i = 1, 2, . . . , m, j = 1, 2, . . . , n.
Then
m
i=1
Z
ij
is the number of Xs Y
j
, and
n
j=1
Z
ij
is the number of Y s > X
i
.
E(I(X
i
Y
j
)) = g(F, G) = P
F,G
(X Y ),
and degrees k and l are = 1, so we use
U(X, Y ) =
1
_
m
1
__
n
1
_
C
X
C
Y
T
s
(X
i
1
, . . . , X
i
k
, Y
j
1
, . . . , Y
j
l
)
=
(m1)!(n 1)!
m!n!
C
X
C
Y
1
1!1!
P
X
P
Y
T(X
i
1
, . . . , X
i
k
, Y
j
1
, . . . , Y
j
l
)
=
1
mn
m
i=1
n
j=1
I(X
i
Y
j
).
This MannWhitney estimator (or Wilcoxin 2Sample estimator) is unbiased and
symmetric in the Xs and Y s. It follows by Corollary 12.1.15 that it has minimum variance.
157
12.2 Single-Sample Hypothesis Tests
Let X
1
, . . . , X
n
be a sample from a distribution F. The problem of t is to test the hypoth-
esis that the sample X
1
, . . . , X
n
is from some specied distribution against the alternative
that it is from some other distribution, i.e., H
0
: F = F
0
vs. H
1
: F(x) ,= F
0
(x) for some x.
Denition 12.2.1:
Let X
1
, . . . , X
n
iid
F, and let the corresponding empirical cdf be
F
n
(x) =
1
n
n
i=1
I
(,x]
(X
i
).
The statistic
D
n
= sup
x
[ F
n
(x) F(x) [
is called the twosided KolmogorovSmirnov statistic (KS statistic).
The onesided KS statistics are
D
+
n
= sup
x
[F
n
(x) F(x)] and D
n
= sup
x
[F(x) F
n
(x)].
Theorem 12.2.2:
For any continuous distribution F, the KS statistics D
n
, D
n
, D
+
n
are distribution free.
Proof:
Let X
(1)
, . . . , X
(n)
be the order statistics of X
1
, . . . , X
n
, i.e., X
(1)
X
(2)
. . . X
(n)
, and
dene X
(0)
= and X
(n+1)
= +.
Then,
F
n
(x) =
i
n
for X
(i)
x < X
(i+1)
, i = 0, . . . , n.
Therefore,
D
+
n
= max
0in
sup
X
(i)
x<X
(i+1)
[
i
n
F(x)]
= max
0in
i
n
[ inf
X
(i)
x<X
(i+1)
F(x)]
()
= max
0in
i
n
F(X
(i)
)
= max max
1in
_
i
n
F(X
(i)
)
_
, 0
() holds since F is nondecreasing in [X
(i)
, X
(i+1)
).
158
Note that D
+
n
is a function of F(X
(i)
). In order to make some inference about D
+
n
, the dis-
tribution of F(X
(i)
) must be known. We know from the Probability Integral Transformation
(see Rohatgi, page 203, Theorem 1) that for a rv X with continuous cdf F
X
, it holds that
F
X
(X) U(0, 1).
Thus, F(X
(i)
) is the i
th
order statistic of a sample from U(0, 1), independent from F. There-
fore, the distribution of D
+
n
is independent of F.
Similarly, the distribution of
D
n
= max max
1in
_
F(X
(i)
)
i 1
n
_
, 0
is independent of F.
Since
D
n
= sup
x
[ F
n
(x) F(x) [= max D
+
n
, D
n
,
the distribution of D
n
is also independent of F.
Theorem 12.2.3:
If F is continuous, then
P(D
n
+
1
2n
) =
0, if 0
_
+
1
2n
1
2n
_
+
3
2n
3
2n
. . .
_
+
2n1
2n
2n1
2n
f(u)du, if 0 < <
2n1
2n
1, if
2n1
2n
where
f(u) = f(u
1
, . . . , u
n
) =
_
n!, if 0 < u
1
< u
2
< . . . < u
n
< 1
0, otherwise
is the joint pdf of an order statistic of a sample of size n from U(0, 1).
Note:
As Gibbons & Chakraborti (1992), page 108109, point out, this result must be interpreted
carefully. Consider the case n = 2.
For 0 < <
3
4
, it holds that
P(D
2
+
1
4
) =
_
+
1
4
1
4
0<u
1
<u
2
<1
_
+
3
4
3
4
2! du
2
du
1
.
Note that the integration limits overlap if
+
1
4
+
3
4
1
4
159
When 0 < <
1
4
, it automatically holds that 0 < u
1
< u
2
< 1. Thus, for 0 < <
1
4
, it holds
that
P(D
2
+
1
4
) =
_
+
1
4
1
4
_
+
3
4
3
4
2! du
2
du
1
= 2!
_
+
1
4
1
4
_
u
2
[
+
3
4
3
4
_
du
1
= 2!
_
+
1
4
1
4
2 du
1
= 2! (2) u
1
[
+
1
4
1
4
= 2! (2)
2
For
1
4
<
3
4
, the region of integration is as follows:
u1
u2
1
1
0
1/4 + nu
3/4 - nu
3/4 - nu
Area 1 Area 2
Note to Theorem 12.2.3
Thus, for
1
4
<
3
4
, it holds that
P(D
2
+
1
4
) =
_
+
1
4
1
4
0<u
1
<u
2
<1
_
+
3
4
3
4
2! du
2
du
1
=
_
+
1
4
3
4
_
1
u
1
2! du
2
du
1
+
_ 3
4
0
_
1
3
4
2! du
2
du
1
= 2
_
_
+
1
4
3
4
_
u
2
[
1
u
1
_
du
1
+
_ 3
4
0
_
u
2
[
1
3
4
_
du
1
_
= 2
_
_
+
1
4
3
4
(1 u
1
) du
1
+
_ 3
4
0
(1
3
4
+) du
1
_
160
= 2
_
u
1
u
2
1
2
_
+
1
4
3
4
+
_
u
1
4
+u
1
_
3
4
= 2
_
( +
1
4
)
( +
1
4
)
2
2
( +
3
4
) +
( +
3
4
)
2
2
+
( +
3
4
)
4
+( +
3
4
)
_
= 2
_
+
1
4
2
2
4
1
32
+
3
4
+
2
2
3
4
+
9
32
4
+
3
16
2
+
3
4
_
= 2
_
2
+
3
2
1
16
_
= 2
2
+ 3
1
8
Combining these results gives
P(D
2
+
1
4
) =
0, if 0
2! (2)
2
, if 0 < <
1
4
2
2
+ 3
1
8
, if
1
4
<
3
4
1, if
3
4
Theorem 12.2.4:
Let F be a continuous cdf. Then it holds z 0:
lim
n
P(D
n
z
n
) = L
1
(z) = 1 2
i=1
(1)
i1
exp(2i
2
z
2
).
Theorem 12.2.5:
Let F be a continuous cdf. Then it holds:
P(D
+
n
z) = P(D
n
z) =
0, if z 0
_
1
1z
_
un
n1
n
z
. . .
_
u
3
2
n
z
_
u
2
1
n
z
f(u)du, if 0 < z < 1
1, if z 1
where f(u) is dened in Theorem 12.2.3.
Note:
It should be obvious that the statistics D
+
n
and D
n
have the same distribution because of
symmetry.
161
Theorem 12.2.6:
Let F be a continuous cdf. Then it holds z 0:
lim
n
P(D
+
n
z
n
) = lim
n
P(D
n
z
n
) = L
2
(z) = 1 exp(2z
2
)
Corollary 12.2.7:
Let V
n
= 4n(D
+
n
)
2
. Then it holds V
n
d
2
2
, i.e., this transformation of D
+
n
has an asymptotic
2
2
distribution.
Proof:
Let x 0. Then it follows:
lim
n
P(V
n
x)
x=4z
2
= lim
n
P(V
n
4z
2
)
= lim
n
P(4n(D
+
n
)
2
4z
2
)
= lim
n
P(
nD
+
n
z)
Th.12.2.6
= 1 exp(2z
2
)
4z
2
=x
= 1 exp(x/2)
Thus, lim
n
P(V
n
x) = 1 exp(x/2) for x 0. Note that this is the cdf of a
2
2
distribu-
tion.
Denition 12.2.8:
Let D
n;
be the smallest value such that P(D
n
> D
n;
) . Likewise, let D
+
n;
be the
smallest value such that P(D
+
n
> D
+
n;
) .
The KolmogorovSmirnov test (KS test) rejects H
0
: F(x) = F
0
(x) x at level if
D
n
> D
n;
.
It rejects H
0
: F(x) F
0
(x) x at level if D
n
> D
+
n;
and it rejects H
0
: F(x) F
0
(x) x
at level if D
+
n
> D
+
n;
.
Note:
Rohatgi, Table 7, page 661, gives values of D
n;
and D
+
n;
for selected values of and small
n. Theorems 12.2.4 and 12.2.6 allow the approximation of D
n;
and D
+
n;
for large n.
162
Example 12.2.9:
Let X
1
, . . . , X
n
C(1, 0). We want to test whether H
0
: X N(0, 1).
The following data has been observed for x
(1)
, . . . , x
(10)
:
1.42, 0.43, 0.19, 0.26, 0.30, 0.45, 0.64, 0.96, 1.97, and 4.68
The results for the KS test have been obtained through the following SPlus session, i.e.,
D
+
10
= 0.02219616, D
10
= 0.3025681, and D
10
= 0.3025681:
> x _ c(-1.42, -0.43, -0.19, 0.26, 0.30, 0.45, 0.64, 0.96, 1.97, 4.68)
> FX _ pnorm(x)
> FX
[1] 0.07780384 0.33359782 0.42465457 0.60256811 0.61791142 0.67364478
[7] 0.73891370 0.83147239 0.97558081 0.99999857
> Dp _ (1:10)/10 - FX
> Dp
[1] 2.219616e-02 -1.335978e-01 -1.246546e-01 -2.025681e-01 -1.179114e-01
[6] -7.364478e-02 -3.891370e-02 -3.147239e-02 -7.558081e-02 1.434375e-06
> Dm _ FX - (0:9)/10
> Dm
[1] 0.07780384 0.23359782 0.22465457 0.30256811 0.21791142 0.17364478
[7] 0.13891370 0.13147239 0.17558081 0.09999857
> max(Dp)
[1] 0.02219616
> max(Dm)
[1] 0.3025681
> max(max(Dp), max(Dm))
[1] 0.3025681
>
> ks.gof(x, alternative = "two.sided", mean = 0, sd = 1)
One-sample Kolmogorov-Smirnov Test
Hypothesized distribution = normal
data: x
ks = 0.3026, p-value = 0.2617
alternative hypothesis:
True cdf is not the normal distn. with the specified parameters
Using Rohatgi, Table 7, page 661, we have to use D
10;0.20
= 0.323 for = 0.20. Since
D
10
= 0.3026 < 0.323 = D
10;0.20
, it is p > 0.20. The KS test does not reject H
0
at level
= 0.20. As SPlus shows, the precise pvalue is even p = 0.2617.
163
Note:
Comparison between
2
and KS goodness of t tests:
KS uses all available data;
2
bins the data and loses information
KS works for all sample sizes;
2
requires large sample sizes
it is more dicult to modify KS for estimated parameters;
2
can be easily adapted
for estimated parameters
KS is conservative for discrete data, i.e., it tends to accept H
0
for such data
the order matters for KS;
2
is better for unordered categorical data
164
12.3 More on Order Statistics
Denition 12.3.1:
Let F be a continuous cdf. A tolerance interval for F with tolerance coecient is
a random interval such that the probability is that this random interval covers at least a
specied percentage 100p% of the distribution.
Theorem 12.3.2:
If order statistics X
(r)
< X
(s)
are used as the endpoints for a tolerance interval for a continuous
cdf F, it holds that
=
sr1
i=0
_
n
i
_
p
i
(1 p)
ni
.
Proof:
According to Denition 12.3.1, it holds that
= P
X
(r)
,X
(s)
_
P
X
(X
(r)
< X < X
(s)
) p
_
.
Since F is continuous, it holds that F
X
(X) U(0, 1). Therefore,
P
X
(X
(r)
< X < X
(s)
) = P(X < X
(s)
) P(X X
(r)
)
= F(X
(s)
) F(X
(r)
)
= U
(s)
U
(r)
,
where U
(s)
and U
(r)
are the order statistics of a U(0, 1) distribution.
Thus,
= P
X
(r)
,X
(s)
_
P
X
(X
(r)
< X < X
(s)
) p
_
= P(U
(s)
U
(r)
p).
By Therorem 4.4.4, we can determine the joint distribution of order statistics and calculate
as
=
_
1
p
_
yp
0
n!
(r 1)!(s r 1)!(n s)!
x
r1
(y x)
sr1
(1 y)
ns
dx dy.
Rather than solving this integral directly, we make the transformation
U = U
(s)
U
(r)
V = U
(s)
.
Then the joint pdf of U and V is
f
U,V
(u, v) =
n!
(r1)!(sr1)!(ns)!
(v u)
r1
u
sr1
(1 v)
ns
, if 0 < u < v < 1
0, otherwise
165
and the marginal pdf of U is
f
U
(u) =
_
1
0
f
U,V
(u, v) dv
=
n!
(r 1)!(s r 1)!(n s)!
u
sr1
I
(0,1)
(u)
_
1
u
(v u)
r1
(1 v)
ns
dv
(A)
=
n!
(r 1)!(s r 1)!(n s)!
u
sr1
(1 u)
ns+r
I
(0,1)
(u)
_
1
0
t
r1
(1 t)
ns
dt
. .
B(r,ns+1)
=
n!
(r 1)!(s r 1)!(n s)!
u
sr1
(1 u)
ns+r
(r 1)!(n s)!
(n s +r)!
I
(0,1)
(u)
=
n!
(n s +r)!(s r 1)!
u
sr1
(1 u)
ns+r
I
(0,1)
(u)
= n
_
n 1
s r 1
_
u
sr1
(1 u)
ns+r
I
(0,1)
(u).
(A) is based on the transformation t =
v u
1 u
, v u = (1 u)t, 1 v = 1 u (1 u)t =
(1 u)(1 t) and dv = (1 u)dt.
It follows that
= P(U
(s)
U
(r)
p)
= P(U p)
=
_
1
p
n
_
n 1
s r 1
_
u
sr1
(1 u)
ns+r
du
(B)
= P(Y < s r) [ where Y Bin(n, p)
=
sr1
i=0
_
n
i
_
p
i
(1 p)
ni
.
(B) holds due to Rohatgi, Remark 3 after Theorem 5.3.18, page 216, since for X Bin(n, p),
it holds that
P(X < k) =
_
1
p
n
_
n 1
k 1
_
x
k1
(1 x)
nk
dx.
166
Example 12.3.3:
Let s = n and r = 1. Then,
=
n2
i=0
_
n
i
_
p
i
(1 p)
ni
= 1 p
n
np
n1
(1 p).
If p = 0.8 and n = 10, then
10
= 1 (0.8)
10
10 (0.8)
9
(0.2) = 0.624,
i.e., (X
(1)
, X
(10)
) denes a 62.4% tolerance interval for 80% probability.
If p = 0.8 and n = 20, then
20
= 1 (0.8)
20
20 (0.8)
19
(0.2) = 0.931,
and if p = 0.8 and n = 30, then
30
= 1 (0.8)
30
30 (0.8)
29
(0.2) = 0.989.
Theorem 12.3.4:
Let k
p
be the p
th
quantile of a continuous cdf F. Let X
(1)
, . . . , X
(n)
be the order statistics of
a sample of size n from F. Then it holds that
P(X
(r)
k
p
X
(s)
) =
s1
i=r
_
n
i
_
p
i
(1 p)
ni
.
Proof:
It holds that
P(X
(r)
k
p
) = P(at least r of the X
i
s are k
p
)
=
n
i=r
_
n
i
_
p
i
(1 p)
ni
.
Therefore,
P(X
(r)
k
p
X
(s)
) = P(X
(r)
k
p
) P(X
(s)
< k
p
)
=
n
i=r
_
n
i
_
p
i
(1 p)
ni
i=s
_
n
i
_
p
i
(1 p)
ni
=
s1
i=r
_
n
i
_
p
i
(1 p)
ni
.
167
Corollary 12.3.5:
(X
(r)
, X
(s)
) is a level
s1
i=r
_
n
i
_
p
i
(1 p)
ni
condence interval for k
p
.
Example 12.3.6:
Let n = 10. We want a 95% condence interval for the median, i.e., k
p
where p =
1
2
.
We get the following probabilities p
r,s
=
s1
i=r
_
n
i
_
p
i
(1 p)
ni
that (X
(r)
, X
(s)
) covers k
0.5
:
p
r,s
s
2 3 4 5 6 7 8 9 10
1 0.01 0.05 0.17 0.38 0.62 0.83 0.94 0.99 0.998
2 0.04 0.16 0.37 0.61 0.82 0.93 0.98 0.99
3 0.12 0.32 0.57 0.77 0.89 0.93 0.94
4 0.21 0.45 0.66 0.77 0.82 0.83
r 5 0.25 0.45 0.57 0.61 0.62
6 0.21 0.32 0.37 0.38
7 0.12 0.16 0.17
8 0.04 0.05
9 0.01
Only the random intervals (X
(1)
, X
(9)
), (X
(1)
, X
(10)
), (X
(2)
, X
(9)
), and (X
(2)
, X
(10)
) give the
desired coverage probability. Therefore, we use the one that comes closest to 95%, i.e.,
(X
(2)
, X
(9)
), as the 95% condence interval for the median.
168
13 Some Results from Sampling
13.1 Simple Random Samples
Denition 13.1.1:
Let be a population of size N with mean and variance
2
. A sampling method (of size
n) is called simple if the set S of possible samples contains all combinations of n elements of
(without repetition) and the probability for each sample s S to become selected depends
only on n, i.e., p(s) =
1
(
N
n
)
s S. Then we call s S a simple random sample (SRS) of
size n.
Theorem 13.1.2:
Let be a population of size N with mean and variance
2
. Let Y : IR be a measurable
function. Let n
i
be the total number of times the parameter y
i
occurs in the population and
p
i
=
n
i
N
be the relative frequency the parameter y
i
occurs in the population. Let (y
1
, . . . , y
n
)
be a SRS of size n with respect to Y , where P(Y = y
i
) = p
i
=
n
i
N
.
Then the components y
i
, i = 1, . . . , n, are identically distributed as Y and it holds for i ,= j:
P(y
i
= y
k
, y
j
= y
l
) =
1
N(N 1)
n
kl
, where n
kl
=
n
k
n
l
, k ,= l
n
k
(n
k
1), k = l
Note:
(i) In Sampling, many authors use capital letters to denote properties of the population
and small letters to denote properties of the random sample. In particular, x
i
s and y
i
s
are considered as random variables related to the sample. They are not seen as specic
realizations.
(ii) The following equalities hold in the scenario of Theorem 13.1.2:
N =
i
n
i
=
1
N
i
n
i
y
i
2
=
1
N
i
n
i
( y
i
)
2
=
1
N
i
n
i
y
2
i
2
169
Theorem 13.1.3:
Let the same conditions hold as in Theorem 13.1.2. Let y =
1
n
n
i=1
y
i
be the sample mean of a
SRS of size n. Then it holds:
(i) E(y) = , i.e., the sample mean is unbiased for the population mean .
(ii) V ar(y) =
1
n
N n
N 1
2
=
1
n
(1 f)
N
N 1
2
, where f =
n
N
.
Proof:
(i)
E(y) =
1
n
n
i=1
E(y
i
) = , since E(y
i
) = i.
(ii)
V ar(y) =
1
n
2
i=1
V ar(y
i
) + 2
i<j
Cov(y
i
, y
j
)
Cov(y
i
, y
j
) = E(y
i
y
j
) E(y
i
)E(y
j
)
= E(y
i
y
j
)
2
=
k,l
y
k
y
l
P(y
i
= y
k
, y
j
= y
l
)
2
Th.13.1.2
=
1
N(N 1)
k=l
y
k
y
l
n
k
n
l
+
k
y
2
k
n
k
(n
k
1)
2
=
1
N(N 1)
k,l
y
k
y
l
n
k
n
l
k
y
2
k
n
k
2
=
1
N(N 1)
__
k
y
k
n
k
__
l
y
l
n
l
_
k
y
2
k
n
k
_
2
Note (ii)
=
1
N(N 1)
_
N
2
2
N(
2
+
2
)
_
2
=
1
N 1
_
N
2
2
(N 1)
_
=
1
N 1
2
, for i ,= j
170
= V ar(y) =
1
n
2
i=1
V ar(y
i
) + 2
i<j
Cov(y
i
, y
j
)
=
1
n
2
_
n
2
+n(n 1)
_
1
N 1
2
__
=
1
n
_
1
n 1
N 1
_
2
=
1
n
N n
N 1
2
=
1
n
(1
n
N
)
N
N 1
2
=
1
n
(1 f)
N
N 1
2
Theorem 13.1.4:
Let y
n
be the sample mean of a SRS of size n. Then it holds that
_
n
1 f
y
n
_
N
N1
d
N(0, 1),
where N and f =
n
N
is a constant.
In particular, when the y
i
s are 01distributed with E(y
i
) = P(y
i
= 1) = p i, then it holds
that
_
n
1 f
y
n
p
_
N
N1
p(1 p)
d
N(0, 1),
where N and f =
n
N
is a constant.
171
13.2 Stratied Random Samples
Denition 13.2.1:
Let be a population of size N, that is split into m disjoint sets
j
, called strata, of size
N
j
, j = 1, . . . , m, where N =
m
j=1
N
j
. If we independently draw a random sample of size n
j
in
each strata, we speak of a stratied random sample.
Note:
(i) The random samples in each strata are not always SRSs.
(ii) Stratied random samples are used in practice as a means to reduce the sample variance
in the case that data in each strata is homogeneous and data among dierent strata is
heterogeneous.
(iii) Frequently used strata in practice are gender, state (or county), income range, ethnic
background, etc.
Denition 13.2.2:
Let Y : IR be a measurable function. In case of a stratied random sample, we use the
following notation:
Let Y
jk
, j = 1, . . . , m, k = 1, . . . , N
j
be the elements in
j
. Then, we dene
(i) Y
j
=
N
j
k=1
Y
jk
the total in the j
th
strata,
(ii)
j
=
1
N
j
Y
j
the mean in the j
th
strata,
(iii) =
1
N
m
j=1
N
j
j
the expectation (or grand mean),
(iv) N =
m
j=1
Y
j
=
m
j=1
N
j
k=1
Y
jk
the total,
(v)
2
j
=
1
N
j
N
j
k=1
(Y
jk
j
)
2
the variance in the j
th
strata, and
(vi)
2
=
1
N
m
j=1
N
j
k=1
(Y
jk
)
2
the variance.
172
(vii) We denote an (ordered) sample in
j
of size n
j
as (y
j1
, . . . , y
jn
j
) and y
j
=
1
n
j
n
j
k=1
y
jk
the
sample mean in the j
th
strata.
Theorem 13.2.3:
Let the same conditions hold as in Denitions 13.2.1 and 13.2.2. Let
j
be an unbiased
estimate of
j
and
V ar(
j
) be an unbiased estimate of V ar(
j
). Then it holds:
(i) =
1
N
m
j=1
N
j
j
is unbiased for .
V ar( ) =
1
N
2
m
j=1
N
2
j
V ar(
j
).
(ii)
V ar( ) =
1
N
2
m
j=1
N
2
j
V ar(
j
) is unbiased for V ar( ).
Proof:
(i)
E( ) =
1
N
m
j=1
N
j
E(
j
) =
1
N
m
j=1
N
j
j
=
By independence of the samples within each strata,
V ar( ) =
1
N
2
m
j=1
N
2
j
V ar(
j
).
(ii)
E(
V ar( )) =
1
N
2
m
j=1
N
2
j
E(
V ar(
j
)) =
1
N
2
m
j=1
N
2
j
V ar(
j
) = V ar( )
Theorem 13.2.4:
Let the same conditions hold as in Theorem 13.2.3. If we draw a SRS in each strata, then it
holds:
(i) =
1
N
m
j=1
N
j
y
j
is unbiased for , where y
j
=
1
n
j
n
j
k=1
y
jk
, j = 1, . . . , m.
V ar( ) =
1
N
2
m
j=1
N
2
j
1
n
j
(1 f
j
)
N
j
N
j
1
2
j
, where f
j
=
n
j
N
j
.
173
(ii)
V ar( ) =
1
N
2
m
j=1
N
2
j
1
n
j
(1 f
j
)s
2
j
is unbiased for V ar( ), where
s
2
j
=
1
n
j
1
n
j
k=1
(y
jk
y
j
)
2
.
Proof:
For a SRS in the j
th
strata, it follows by Theorem 13.1.3:
E(y
j
) =
j
V ar(y
j
) =
1
n
j
(1 f
j
)
N
j
N
j
1
2
j
Also, we can show that
E(s
2
j
) =
N
j
N
j
1
2
j
.
Now the proof follows directly from Theorem 13.2.3.
Denition 13.2.5:
Let the same conditions hold as in Denitions 13.2.1 and 13.2.2. If the sample in each strata
is of size n
j
= n
N
j
N
, j = 1, . . . , m, where n is the total sample size, then we speak of pro-
portional selection.
Note:
(i) In the case of proportional selection, it holds that f
j
=
n
j
N
j
=
n
N
= f, j = 1, . . . , m.
(ii) Proportional strata cannot always be obtained for each combination of m, n, and N.
Theorem 13.2.6:
Let the same conditions hold as in Denition 13.2.5. If we draw a SRS in each strata, then it
holds in case of proportional selection that
V ar( ) =
1
N
2
1 f
f
m
j=1
N
j
2
j
,
where
2
j
=
N
j
N
j
1
2
j
.
Proof:
The proof follows directly from Theorem 13.2.4 (i).
174
Theorem 13.2.7:
If we draw (1) a stratied random sample that consists of SRSs of sizes n
j
under proportional
selection and (2) a SRS of size n =
m
j=1
n
j
from the same population, then it holds that
V ar(y) V ar( ) =
1
n
N n
N(N 1)
j=1
N
j
(
j
)
2
1
N
m
j=1
(N N
j
)
2
j
.
Proof:
See Homework.
175
14 Some Results from Sequential Statistical Inference
14.1 Fundamentals of Sequential Sampling
Example 14.1.1:
A particular machine produces a large number of items every day. Each item can be either
defective or nondefective. The unknown proportion of defective items in the production
of a particular day is p.
Let (X
1
, . . . , X
m
) be a sample from the daily production where x
i
= 1 when the item is
defective and x
i
= 0 when the item is nondefective. Obviously, S
m
=
m
i=1
X
i
Bin(m, p)
denotes the total number of defective items in the sample (assuming that m is small compared
to the daily production).
We might be interested to test H
0
: p p
0
vs. H
1
: p > p
0
at a given signicance level
and use this decision to trash the entire daily production and have the machine xed if indeed
p > p
0
. A suitable test could be
1
(x
1
, . . . , x
m
) =
1, if s
m
> c
0, if s
m
c
where c is chosen such that
1
is a level test.
However, wouldnt it be more benecial if we sequentially sample the items (e.g., take item
# 57, 623, 1005, 1286, 2663, etc.) and stop the machine as soon as it becomes obvious that
it produces too many bad items. (Alternatively, we could also nish the time consuming and
expensive process to determine whether an item is defective or nondefective if it is impossible
to surpass a certain proportion of defectives.) For example, if for some j < m it already holds
that s
j
> c, then we could stop (and immediately call maintenance) and reject H
0
after only
j observations.
More formally, let us dene T = minj [ S
j
> c and T
and rejects H
0
if
T m. Thus, if we consider R
0
= (x
1
, . . . , x
m
) [ t m and R
1
= (x
1
, . . . , x
m
) [ s
m
> c
as critical regions of two tests
0
and
1
, then these two tests are equivalent.
176
Denition 14.1.2:
Let be the parameter space and / the set of actions the statistician can take. We assume
that the rvs X
1
, X
2
, . . . are observed sequentially and iid with common pdf (or pmf) f
(x).
A sequential decision procedure is dened as follows:
(i) A stopping rule species whether an element of / should be chosen without taking
any further observation. If at least one observation is taken, this rule species for every
set of observed values (x
1
, x
2
, . . . , x
n
), n 1, whether to stop sampling and choose an
action in / or to take another observation x
n+1
.
(ii) A decision rule species the decision to be taken. If no observation has been taken,
then we take action d
0
/. If n 1 observation have been taken, then we take action
d
n
(x
1
, . . . , x
n
) /, where d
n
(x
1
, . . . , x
n
) species the action that has to be taken for
the set (x
1
, . . . , x
n
) of observed values. Once an action has been taken, the sampling
process is stopped.
Note:
In the remainder of this chapter, we assume that the statistician takes at least one observation.
Denition 14.1.3:
Let R
n
IR
n
, n = 1, 2, . . ., be a sequence of Borelmeasurable sets such that the sampling
process is stopped after observing X
1
= x
1
, X
2
= x
2
, . . . , X
n
= x
n
if (x
1
, . . . , x
n
) R
n
. If
(x
1
, . . . , x
n
) / R
n
, then another observation x
n+1
is taken. The sets R
n
, n = 1, 2, . . . are called
stopping regions.
Denition 14.1.4:
With every sequential stopping rule we associate a stopping random variable N which
takes on the values 1, 2, 3, . . .. Thus, N is a rv that indicates the total number of observations
taken before the sampling is stopped.
Note:
We use the (sloppy) notation N = n to denote the event that sampling is stopped after
observing exactly n values x
1
, . . . , x
n
(i.e., sampling is not stopped before taking n samples).
Then the following equalities hold:
N = 1 = R
1
177
N = n = (x
1
, . . . , x
n
) IR
n
[ sampling is stopped after n observations but not before
= (R
1
R
2
. . . R
n1
)
c
R
n
= R
c
1
R
c
2
. . . R
c
n1
R
n
N n =
n
_
k=1
N = k
Here we will only consider closed sequential sampling procedures, i.e., procedures where
sampling eventually stops with probability 1, i.e.,
P(N < ) = 1,
P(N = ) = 1 P(N < ) = 0.
Theorem 14.1.5: Walds Equation
Let X
1
, X
2
, . . . be iid rvs with E([ X
1
[) < . Let N be a stopping variable. Let S
N
=
N
k=1
X
k
.
If E(N) < , then it holds
E(S
N
) = E(X
1
)E(N).
Proof:
Dene a sequence of rvs Y
i
, i = 1, 2, . . ., where
Y
i
=
n=1
X
n
Y
n
.
Obviously, it holds that
S
N
=
n=1
X
n
Y
n
.
Thus, it follows that
E(S
N
) = E
_
n=1
X
n
Y
n
_
. ()
It holds that
n=1
E([ X
n
Y
n
[) =
n=1
E([ X
n
[)E([ Y
n
[)
= E([ X
1
[)
n=1
P(N n)
178
= E([ X
1
[)
n=1
k=n
P(N = k)
(A)
= E([ X
1
[)
n=1
nP(N = n)
= E([ X
1
[)E(N)
<
(A) holds due to the following rearrangement of indizes:
n k
1 1, 2, 3, . . .
2 2, 3, . . .
3 3, . . .
.
.
.
.
.
.
We may therefore interchange the expectation and summation signs in () and get
E(S
N
) = E
_
n=1
X
n
Y
n
_
=
n=1
E(X
n
Y
n
)
=
n=1
E(X
n
)E(Y
n
)
= E(X
1
)
n=1
P(N n)
= E(X
1
)E(N)
which completes the proof.
179
14.2 Sequential Probability Ratio Tests
Denition 14.2.1:
Let X
1
, X
2
, . . . be a sequence of iid rvs with common pdf (or pmf) f
0
vs. a simple alternative H
1
: X f
1
when the observations
are taken sequentially.
Let f
0n
and f
1n
denote the joint pdfs (or pmfs) of X
1
, . . . , X
n
under H
0
and H
1
respectively,
i.e.,
f
0n
(x
1
, . . . , x
n
) =
n
i=1
f
0
(x
i
) and f
1n
(x
1
, . . . , x
n
) =
n
i=1
f
1
(x
i
).
Finally, let
n
(x
1
, . . . , x
n
) =
f
1n
(x)
f
0n
(x)
,
where x = (x
1
, . . . , x
n
). Then a sequential probability ratio test (SPRT) for testing H
0
vs. H
1
is the following decision rule:
(i) If at any stage of the sampling process it holds that
n
(x) A,
then stop and reject H
0
.
(ii) If at any stage of the sampling process it holds that
n
(x) B,
then stop and accept H
0
, i.e., reject H
1
.
(iii) If
B <
n
(x) < A,
then continue sampling by taking another observation x
n+1
.
Note:
(i) It is usually convenient to dene
Z
i
= log
f
1
(X
i
)
f
0
(X
i
)
,
where Z
1
, Z
2
, . . . are iid rvs. Then, we work with
log
n
(x) =
n
i=1
z
i
=
n
i=1
(log f
1
(x
i
) log f
0
(x
i
))
instead of using
n
(x). Obviously, we now have to use constants b = log B and a = log A
instead of the original constants B and A.
180
(ii) A and B (where A > B) are constants such that the SPRT will have strength (, ),
where
= P(Type I error) = P(Reject H
0
[ H
0
)
and
= P(Type II error) = P(Accept H
0
[ H
1
).
If N is the stopping rv, then
= P
0
(
N
(X) A) and = P
1
(
N
(X) B).
Example 14.2.2:
Let X
1
, X
2
, . . . be iid N(,
2
), where is unknown and
2
> 0 is known. We want to test
H
0
: =
0
vs. H
1
: =
1
, where
0
<
1
.
If our data is sampled sequentially, we can constract a SPRT as follows:
log
n
(x) =
n
i=1
_
1
2
2
(x
i
1
)
2
(
1
2
2
(x
i
0
)
2
)
_
=
1
2
2
n
i=1
_
(x
i
0
)
2
(x
i
1
)
2
)
_
=
1
2
2
n
i=1
_
x
2
i
2x
i
0
+
2
0
x
2
i
+ 2x
i
1
2
1
_
=
1
2
2
n
i=1
_
2x
i
0
+
2
0
+ 2x
i
2
1
_
=
1
2
2
_
n
i=1
2x
i
(
1
0
) +n(
2
0
2
1
)
_
=
1
0
2
_
n
i=1
x
i
n
0
+
1
2
_
We decide for H
0
if
log
n
(x) b
2
_
n
i=1
x
i
n
0
+
1
2
_
b
i=1
x
i
n
0
+
1
2
+b
,
where b
=
2
0
b.
181
We decide for H
1
if
log
n
(x) a
1
0
2
_
n
i=1
x
i
n
0
+
1
2
_
a
i=1
x
i
n
0
+
1
2
+a
,
where a
=
2
0
a.
n
sum(x_i) b* a*
1
2
3
4
accept H0 continue accept H1
Example 14.2.2
Theorem 14.2.3:
For a SPRT with stopping bounds A and B, A > B, and strength (, ), we have
A
1
and B
1
,
where 0 < < 1 and 0 < < 1.
Theorem 14.2.4:
Assume we select for given , (0, 1), where + 1, the stopping bounds
A
=
1
and B
=
1
.
Then it holds that the SPRT with stopping bounds A
and B
has strength (
), where
1
,
1
, and
+.
182
Note:
(i) The approximation A
=
1
and B
=
1
in Theorem 14.2.4 is called Wald
Approximation for the optimal stopping bounds of a SPRT.
(ii) A
and B
are functions of and only and do not depend on the pdfs (or pmfs) f
0
and f
1
. Therefore, they can be computed once and for all f
i
s, i = 0, 1.
THE END !!!
183
Index
similar, 108
01 Loss, 128
A Posteriori Distribution, 86
A Priori Distribution, 86
Action, 83
Alternative Hypothesis, 91
Ancillary, 56
Asymptotically (Most) Ecient, 73
Basus Theorem, 56
Bayes Estimate, 87
Bayes Risk, 86
Bayes Rule, 87
Bayesian Condence Interval, 149
Bayesian Condence Set, 149
Bias, 58
BorelCantelli Lemma, 21
Cauchy Criterion, 23
Centering Constants, 15, 19
Central Limit Theorem, Lindeberg, 33
Central Limit Theorem, LindebergL`evy, 30
Chapman, Robbins, Kiefer Inequality, 71
CI, 135
Closed, 178
Complete, 52, 152
Complete in Relation to T, 152
Composite, 91
Condence Bound, Lower, 135
Condence Bound, Upper, 135
Condence Coecient, 135
Condence Interval, 135
Condence Level, 134
Condence Sets, 134
Conjugate Family, 90
Consistent, 5
Consistent in the r
th
Mean, 45
Consistent, MeanSquaredError, 59
Consistent, Strongly, 45
Consistent, Weakly, 45
Continuity Theorem, 29
Contradiction, Proof by, 61
Convergence, Almost Sure, 13
Convergence, In r
th
Mean, 10
Convergence, In Absolute Mean, 10
Convergence, In Distribution, 2
Convergence, In Law, 2
Convergence, In Mean Square, 10
Convergence, In Probability, 5
Convergence, Strong, 13
Convergence, Weak, 2
Convergence, With Probability 1, 13
ConvergenceEquivalent, 23
CramerRao Lower Bound, 67, 68
Credible Set, 149
Critical Region, 92
CRK Inequality, 71
CRLB, 67, 68
Decision Function, 83
Decision Rule, 177
Degree, 153
Distribution, A Posteriori, 86
Distribution, Population, 36
Distribution, Sampling, 2
DistributionFree, 152
Domain of Attraction, 32
Eciency, 73
Ecient, Asymptotically (Most), 73
Ecient, More, 73
Ecient, Most, 73
Empirical CDF, 36
Empirical Cumulative Distribution Function, 36
Equivalence Lemma, 23
Error, Type I, 92
Error, Type II, 92
Estimable, 153
Estimable Function, 58
Estimate, Bayes, 87
Estimate, Maximum Likelihood, 77
Estimate, Method of Moments, 75
Estimate, Minimax, 84
Estimate, Point, 44
Estimator, 44
Estimator, MannWhitney, 157
Estimator, Wilcoxin 2Sample, 157
Exponential Family, OneParameter, 53
FTest, 126
Factorization Criterion, 50
Family of CDFs, 44
Family of Condence Sets, 134
Family of PDFs, 44
Family of PMFs, 44
Family of Random Sets, 134
Fisher Information, 68
Formal Invariance, 111
Generalized UStatistic, 157
GlivenkoCantelli Theorem, 37
Hypothesis, Alternative, 91
Hypothesis, Null, 91
184
Independence of X and S
2
, 41
Induced Function, 46
Inequality, Kolmogorovs, 22
Interval, Random, 134
Invariance, Measurement, 111
Invariant, 46, 110
Invariant Test, 111
Invariant, Location, 47
Invariant, Maximal, 112
Invariant, Permutation, 47
Invariant, Scale, 47
KS Statistic, 158
KS Test, 162
Kernel, 153
Kernel, Symmetric, 153
Khintchines Weak Law of Large Numbers, 18
Kolmogorovs Inequality, 22
Kolmogorovs SLLN, 25
KolmogorovSmirnov Statistic, 158
KolmogorovSmirnov Test, 162
Kroneckers Lemma, 22
Landau Symbols O and o, 31
LehmannScheee, 65
Level of Signicance, 93
LevelTest, 93
Likelihood Function, 77
Likelihood Ratio Test, 115
Likelihood Ratio Test Statistic, 115
Lindeberg Central Limit Theorem, 33
Lindeberg Condition, 33
LindebergL`evy Central Limit Theorem, 30
LMVUE, 60
Locally Minumum Variance Unbiased Estimate, 60
Location Invariant, 47
Logic, 61
Loss Function, 83
Lower Condence Bound, 135
LRT, 115
MannWhitney Estimator, 157
Maximal Invariant, 112
Maximum Likelihood Estimate, 77
Mean Square Error, 59
MeanSquaredError Consistent, 59
Measurement Invariance, 111
Method of Moments Estimate, 75
Minimax Estimate, 84
Minimax Principle, 84
Minmal Sucient, 56
MLE, 77
MLR, 101
MOM, 75
Monotone Likelihood Ratio, 101
More Ecient, 73
Most Ecient, 73
Most Powerful Test, 93
MP, 93
MSEConsistent, 59
NeymanPearson Lemma, 96
Nonparametric, 152
Nonrandomized Test, 93
Normal Variance Tests, 120
Norming Constants, 15, 19
NP Lemma, 96
Null Hypothesis, 91
One Sample tTest, 124
OneTailed t-Test, 124
Paired t-Test, 125
Parameter Space, 44
Parametric Hypothesis, 91
Permutation Invariant, 47
Pivot, 138
Point Estimate, 44
Point Estimation, 44
Population Distribution, 36
Posterior Distribution, 86
Power, 93
Power Function, 93
Prior Distribution, 86
Probability Integral Transformation, 159
Probability Ratio Test, Sequential, 180
Problem of Fit, 158
Proof by Contradiction, 61
Proportional Selection, 174
Random Interval, 134
Random Sample, 36
Random Sets, 134
Random Variable, Stopping, 177
Randomized Test, 93
RaoBlackwell, 64
RaoBlackwellization, 65
Realization, 36
Regularity Conditions, 68
Risk Function, 83
Risk, Bayes, 86
Sample, 36
Sample Central Moment of Order k, 37
Sample Mean, 36
Sample Moment of Order k, 37
Sample Statistic, 36
Sample Variance, 36
Sampling Distribution, 2
Scale Invariant, 47
185
Selection, Proportional, 174
Sequential Decision Procedure, 177
Sequential Probability Ratio Test, 180
Signicance Level, 93
Similar, 108
Similar, , 108
Simple, 91, 169
Simple Random Sample, 169
Size, 93
Slutskys Theorem, 9
SPRT, 180
SRS, 169
Stable, 32
Statistic, 2, 36
Statistic, KolmogorovSmirnov, 158
Statistic, Likelihood Ratio Test, 115
Stopping Random Variable, 177
Stopping Regions, 177
Stopping Rule, 177
Strata, 172
Stratied Random Sample, 172
Strong Law of Large Numbers, Kolmogorovs, 25
Strongly Consistent, 45
Sucient, 48, 152
Sucient, Minimal, 56
Symmetric Kernel, 153
tTest, 124
TailEquivalent, 23
Taylor Series, 31
Test Function, 93
Test, Invariant, 111
Test, KolmogorovSmirnov, 162
Test, Likelihood Ratio, 115
Test, Most Powerful, 93
Test, Nonrandomized, 93
Test, Randomized, 93
Test, Uniformly Most Powerful, 93
Tolerance Coecient, 165
Tolerance Interval, 165
TwoSample t-Test, 124
TwoTailed t-Test, 124
Type I Error, 92
Type II Error, 92
UStatistic, 154
UStatistic, Generalized, 157
UMA, 135
UMAU, 145
UMP, 93
UMP similar, 109
UMP Invariant, 113
UMP Unbiased, 105
UMPU, 105
UMVUE, 60
Unbiased, 58, 105, 145
Uniformly Minumum Variance Unbiased Estimate, 60
Uniformly Most Accurate, 135
Uniformly Most Accurate Unbiased, 145
Uniformly Most Powerful Test, 93
Unimodal, 139
Upper Condence Bound, 135
Walds Equation, 178
WaldApproximation, 183
Weak Law Of Large Numbers, 15
Weak Law Of Large Numbers, Khintchines, 18
Weakly Consistent, 45
Wilcoxin 2Sample Estimator, 157
186