Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

On Basic Concepts of Statistics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

ON BASIC CONCEPTS OF STATISTICS

JAROSLAV HAJEK
MATHEMATICAL INSTITUTE OF THE CZECHOSLOVAK ACADEMY OF SCIENCES,
and CHARLES UNIVERSITY, PRAGUE
1. Summary
This paper is a contribution to current discussions on fundamental concepts,
principles, and postulates of statistics. In order to exhibit the basic ideas and
attitudes, mathematical niceties are suppressed as much as possible. The heart
of the paper lies in definitions, simple theorems, and nontrivial examples. The
main issues under analysis are sufficiency, invariance, similarity, conditionality,
likelihood, and their mutual relations.
Section 2 contains a definition of sufficiency for a subparameter (or sufficiency
in the presence of a nuisance parameter), and a criticism of an alternative defi-
nition due to A. N. Kolmogorov [11]. In that section, a comparison of the
principles of sufficiency in the sense of Blackwell-Girschick [2] and in the sense
of A. Birnbaum [1] is added. In theorem 3.5 it is shown that for nuisance pa-
rameters introduced by a group of transformations, the sub-a-field of invariant
events is sufficient for the respective subparameter.
Section 4 deals with the notion of similarity in the x-space as well as in the
(x, 0)-space, and with related notions such as ancillary and exhaustive statistics.
Confidence intervals and fiducial probabilities are shown to involve a postulate
of "independence under ignorance."
Sections 5 and 6 are devoted to the principles of conditionality and of likeli-
hood, as formulated by A. Birnbaum [1]. Their equivalence is proved and their
strict form is criticized. The two principles deny gains obtainable by mixing
strategies, disregarding that, in non-Bayesian conditions, the expected maximum
conditional risk is generally larger than the maximum overall risk. Therefore,
the notion of "correct" conditioning is introduced, in a general enough way to
include the examples given in the literature to support the conditionality
principle. It is shown that in correct conditioning the maximum risk equals the
expected maximum conditional risk, and that in invariant problems the sub-a-
field of invariant events yields the deepest correct conditioning.
A proper field of application of the likelihood principle is shown to consist of
families of experiments, in which the likelihood functions, possibly after a com-
mon transformation of the parameter, have approximately normal form with
constant variance. Then each observed likelihood function allows computing the
risk without reference to the particular experiment.
In section 7, some forms of the Bayesian approach are touched upon, such as
those based on diffuse prior densities, or on a family of prior densities.
139
140 FIFTH BERKELEY SYMPOSIUM: HAIJEK
In section 8, some instructive examples are given with comments.
References are confined to papers quoted only.

2. Sufficiency
The notion of sufficiency does not promote many disputes. Nonetheless, there
are two points, namely sufficiency for a subparameter and the principle of suffi-
ciency, which deserve a critical examination.
2.1. Sufficiency for a subparameter. Let us consider an experiment (X, 0),
where X denotes the observations and 0 denotes a parameter. The random
element X takes its values in an x-space, and 0 takes its values in a 0-space. A
function r of 6 will be called a subparameter. If 6 were replaced by a a--field, then
T would be replaced by a sub-a-field. Under what conditions may we say that a
statistic is sufficient for r? Sufficiency for T may also be viewed as sufficiency in
the presence of a nuisance parameter. The present author is aware of only one
attempt in this direction, due to Kolmogorov [11].
DEFINITION 2.1. A statistic T = t(X) is called sufficient for a subparameter r,
if the posterior distribution of T, given X = x, depends only on T = t and on the
prior distribution of 0.
Unfortunately, the following theorem shows that the Kolmogorov definition is
void.
THEOREM 2.1. If r is a nonconstant subparameter, and if T is sufficient for r
in the sense of definition 2.1, then T is sufficient for 6 as well.
PROOF. For simplicity, let us consider the discrete case only. Let T be not
sufficient for 0, and let us try to show that it cannot be sufficient for r. Since T
is not sufficient for 0, there exist two pairs, (61, 02) and (xl, X2), such that
(2.1) T(x,)= T(x2)
and
(2.2) PO1(X
PO2(X
=
=
x1)
XI)
P91(X = X2)
PO,(X X2)
=

If T(01) 74 r(02), let us consider the following prior distribution: v(0 = 01) =
v(O = 02) = 2. Then, for T1 = T(01) and T2 = r(02),
(2.3) P(T = T1IX = x1) = P61(X = x1)/[P01(X = x1) + Pe2(X = x1)]
and
(2.4) P((T = T11X = X2) = P91(X = x2)/[P1(X = X2) + POX(X = X2)]-
Obviously, (2.2) entails P(r = r1iX = xi) £4 P(r = rI|X = X2) which implies, in
view of (2.1), that T is not sufficient for r.
If r(6j) = r(02) held, we would choose 03 such that 7(03) 74 T(06) = T(02). Note
that the equations
(2.5) POl(X xI) _ PO,(X = x2)
=
(2.5) PO,(X = xI) PO3(X = x2)
BASIC CONCEPTS 141
and
(2.6) (2.6) POe(X XI) -PO,(X
~~~~PO3(X
=
=
xI) _ Pe,(X == x2)
X2)
are not compatible with (2.2). Thus either (2.5) or (2.6) does not hold, and the
above reasoning may be accomplished either with (01, 03) or (02, 03), both pair
satisfying the condition 7(01) F r(03), r(02) $ r(03). Q.E.D.
Thus we have to try to define sufficiency in the presence of nuisance parameters
in some less stringent way.
DEFINITIoN 2.2. Let 61 be the convex hull of the distributions {Pe, r(o) = r}
for all possible r-values. We shall say that T is sufficient for T if
(i) the distribution of T will depend on r only, that is,
(2.7) Pe(dt) = PT(dt),
and
(ii) there exist distributions QT E 6(T such that T is sufficient for the family {QT}.
In the same manner we define a sufficient sub-a-field for r.
Now we shall prove an analogue of the well-known Rao-Blackwell theorem.
For this purpose, let us consider a decision problem with a convex set D of
decisions d, and with a loss function L(r, d), which is convex in d for each r, and
depends on 0 only through r, L(0, d) = L(r(0), d). Applying the minimax principle
to eliminate the nuisance parameter, we associate with each decision function
6(x) the following risk:
(2.8) f
R(r, ) = sup L[r(0), 5(x)]Pe(dx),
where the supremum is taken over all 0-values such that r(0) = r. Now, if T is
sufficient for r in the sense of definition 2.2, we can associate with each decision
function 5(x) another decision function 6(t) defined as follows:
(2.9) 6(t) = f 5(x)Q,(dxIT = t)
provided that 6(x) is integrable. Note that the right side of (2.9) does not depend
on r, since T is sufficient for {Q,}, according to definition 2.2, and that 6(t) E D
for every t in view of convexity of D. Finally put
(2.10) 6*(x) = 8(t(x)).
THEOREM 2.2. Under the above assumptions,
(2.11) R(r, 5*) < R(r, 5)
holds for all r.
PROOF. Since the distribution of T depends on T only, we have
(2.12) R(r, 5*) = f L[r(0), 5*(x)]Pe(dx)
= f Lf[r(0), 5*(x)]Qr(dx)
for all 0 such that T(0) = r. Furthermore, since L(r, d) is convex in d,
142 FIFTH BERKELEY SYMPOSIUM: HAJEK

(2.13) f L[,(O), 6*(x)]Q,(dx) < f L[V(O), S(x)]Q7(dx)


< sup J L[V(O), 6(x)]Pe(dx) = R(r, 6).
This concludes the proof.
Thus, adopting the minimax principle for dealing with nuisance parameters,
and under the due convexity assumptions, one may restrict himself to decision
procedures depending on the sufficient statistic only. This finds its application
in points estimation and in hypothesis testing, for example.
REMARK 2.1. Le Cam [13] presented three parallel ways of defining suffi-
ciency. All of them could probably be used in giving equivalent definitions of
sufficiency for a subparameter. For example, Kolmogorov's definition 2.1 could
be reformulated as follows: T is called sufficient for T if it is sufficient for each
system {Po} such that {O'} C {O} and 10 02$ X> r(6)#0 T(02).
REMARK 2.2. We also could extend the notion of sufficiency by adding some
"limiting points." For example, we could introduce e-sufficiency, as in Le Cam
[13], and then say that T is sufficient if it is e-sufficient for every e > 0, or if it is
sufficient for every compact subset of the --space.
REMARK 2.3. (Added in proof.) Definition 2.2 does not satisfy the natural
requirement that T should be sufficient for T, if it is sufficient for some finer
subparameter T', T = r(r'). To illustrate this point, let us consider a sample
from N(MA, 2) and put T = ( =, S2), I, r = ( a2). Then T is not sufficient
for T in the sense of definition 2.2, since its distribution fails to be dependent
on ,u only. On the other hand, T is sufficient for T' as is well known. The defini-
tion 2.2 should be corrected as follows.
DEFINITION 2.2*. A statistic T = t(X) is called sufficient for a subparameter T,
if it is sufficient in the sense of definition 2.2 for some subparameter r' such that
T = r(r').
REMARK 2.4. (Added in proof.) A more stringent (and, therefore, more conse-
quential) definition of sufficiency for a parameter is provided by Lehmann [14]
in problem 31 of chapter III: T is sufficient for T if, first, 0 = (r, r7), second,
Po(dt) = Pr(dt), third, Pe(dxlT = t) = P,(dxlT = t). If T is sufficient in this
sense, it is also sufficient in the sense of definition 2.2 where we may take
QT = P,,X1 for any particular 771.
2.2. The principle of sufficiency. If comparing the formulations of this princi-
ple as given in Blackwell-Girshick [2] and in A. Birnbaum [1], one feels that
there is an apparent discrepancy. According to Birnbaum's sufficiency principle,
we are not allowed to use randomized tests, for example, while no such impli-
cation follows from the Blackwell-Girshick sufficiency principle. The difference
is serious but easy to explain: Blackwell and Girshick consider only convex situ-
ations (that is, convex D and convex L(6, d) for each 0), where the Rao-Blackwell
theorem can be proved, while A. Birnbaum has in mind any possible situation.
However, what may be supported in convex situations by a theorem is a rather
stringent postulate in a general condition. (If, in Blackwell-Girshick, the situ-
ation is not convex, they make it convex by allowing randomized decisions.)
BASIC CONCEPTS 143
In estimating a real parameter with L(O, d) = (0 - d)2, the convexity con-
ditions are satisfied and no randomization is useful. If, however, 0 would run
through a discrete subset of real numbers, randomization might bring the same
gains from the minimax point of view as in testing a simple hypothesis against
a simple alternative. And even in convex situations the principle may not exclude
any decision procedure as inadmissible. To this end it is necessary for L(T, d) to
be strictly convex in d, and not, for example, linear, as in randomized extensions
of nonconvex problems.

3. Invariance
Most frequently, the nuisance parameter is introduced by a group of trans-
formations of the x-space on itself. Then, if we have a finite invariant measure
on the group, we can easily show that the sub-a-field of invariant events is
sufficient in the presence of the corresponding nuisance parameter.
Let G = {g} be a group of one-to-one transformations of the x-space on itself.
Take a probability distribution P of X and put
(3.1) P,(X E A) = P(gX E A).
Then obviously, Ph(X e g-lA) = P,h(X G A). Let M be a a-finite measure and
denote by gg the measure such that mg(A) = A(gA).
THEOREM 3.1. Let P <<K and Mug <<K for all g c G. Then PO << MA. Denoting
p(x, g) = dP,/d, and p(x) = dP/dy, then
(3.2) P(x, g) = p(g-1x) dMg (x)
holds. More generally,
(3.3) p(x, h-'g) = p(h(x), g)djih (x).
PROOF. According to the definition, one can write
(3.4) P,(X E A) = P(X e g1-A) = f|A p(x) dM = JA p(g-'(y)) dg-1(y)
= LJAp(g-l(y)) dMg
~~dM (Y) dM(y).
CONDITION 3.1. Let 9 be a a-field of subsets of G, and let (a be the a-field of sub-
sets of the x-space. Assume that
(i) Ag << A for all g E G,
(ii) p(x, g) is a X 9-measurable,
(iii) functions Oh(g) = hg and A1h(g) = gh are 9-measurable,
(iv) there is an invariant probability measure v on 9, that is, v(Bg) = v(gB) =
v(B) for all g G G and B e 9.
THEOREM 3.2. Under condition 3.1, let us put
(3.5) P(X) = f p(x, g) dv(g).
144 FIFTH BERKELEY SYMPOSIUM: HAJEK

Then, for each h E G,


(3.6) p(h(x)) = [dfih (x)] p(x).
REMARK 3.1. Note that the first factor on the right-hand side does not de-
pend on p. Obviously p(x, g) = p(x) for all g e G.
PROOF. In view of (3.3), we have

(3.7) p(h(x)) = f p(h(x), g) dv(g) = [df,h (x)] f p(x, h-'g) dv


= [d-h (x)] f p(x,f) dvh(f)

= [dd. (x)] f p(x, f) dv(f)


= [d,±h (x)] p(x), Q.E.D.

We shall say that an event is G-invariant, if gA = A for all g E G. Obviously,


the set of G-invariant events is a sub-a-field (3, and a measurable function f is
63-measurable if and only if f (g(x)) = f (x) for all g E G. Consider now two
distributions P and Po and seek the derivative of P relative to Po on 63, say
[dP/dPo] . Assume that P << ,u and Po << j, denote p = dP/dM, po = dPo/dA,
and introduce p(x) and po(x) by (3.5).
THEOREM 3.3. Under condition 3.1 and under the assumption that po(x) = 0
entails p(x) = 0 almost j,-everyuhere, we have P << Po on (B and
(3.8) [dP/dPo] p(x)=po(x).
=

PROOF. Put 1(x) = p(x)/po(x). Theorem 3.2 entails l(g(x)) = 1(x) for all
g e G, namrely, 1(x) is (3-measurable. Further, for any B E 63, in view of (3.2)
and of [p = 0] X [Po = 0],
(3.9) J1(x) dPo = J (x)pO(x) d,u = JB l(x)po(g-'(x)) d,g-1
= JB l(x)po(x, g) dA = (x)po(x) d;, = fP(X) d,
holds. On the other hand,
(3.10) P(B) = JB p(x) dA = JR p(g-'(x)) djg-' = JB p(x, g) d,u = JB p(x) d,.
Thus P(B) = fB 1(x) dPo, B c 63, which concludes the proof.
THEOREM 3.4. Let the statistic T = t(X) have an expectation under Ph, h E G.
Let condition 3.1 be satisfied. Then the conditional expectation of T relative to the
sub-a-field (6 of G-invariant events and under Ph equals
(3.11) Eh(TJ63, x) = J t(g-F(x))p(x, gh) dv(g)/p(x).
BASIC CONCEPTS 145
PROOF. The proof would follow the lines of the proofs of theorems 3.2
and 3.3.
Consider now a dominated family of probability distributions {Pr} and define
P7,0 by (3.1) for each T. Putting 0 = (r, g), we can say that r is a subparameter
of 0.
THEOREM 3.5. Under condition 3.1 the sub-a-field (3 of G-invariant events is
sufficient for r in the sense of definition 2.2.
PROOF. First, the G-invariant events have a probability depending on T only,
that is, P7,0(B) = PT(B). Second, for
(3.12) Qr(A) = f PT,2(A) dp(g)
we have
(3.13) Q,(A) = f [f p,(x, g) dj] dv(g)
= IA f p(x, g) dv(g) dus
= fA P(x) d,u.
Now let Po be some probability measure such that P7 << PO << A, and introduce
p0(x) by (3.5). Then, according to theorem 3.3, PT(x)/PO(x) is 63-measurable for
all r. Thus p(x) = [p7(x)/1po(x)]po(x), and 63 is sufficient for {Q,}, according to
the factorization criterion.
REMARK 3.2. Considering right-invariant and left-invariant probability
measures on g does not provide any generalization. Actually, if v is right-
invariant, then v'(B) = fAV(gB) dv(g) is invariant, that is, both right-invariant
and left-invariant.
REMARK 3.3. Theorem 3.3 sometimes remains valid even if there exists only
a right-invariant countably finite measure v, as in the case of the groups of
location and/or scale shifts (see [6], chapter II).

4. Similarity
The concept of similarity plays a very important role in classical statistics,
namely in contributions by J. Neyman and R. A. Fisher. It may be applied in
the x-space as well as in the (x, 0)-space.
4.1. Similarity in the x-space. Consider a family of distributions {Po} on
measurable subsets of the x-space. We say that an event A is similar, if its
probability is independent of 0:
(4.1) Po(A) = P(A) for all 0.
Obviously the events "whole x-space" and "the empty set" are always similar.
The class of similar events is closed under complementation and formation of
countable disjoint unions. Such classes are called X-fields by Dynkin [3]. A
X-field is a broader concept than a a-field. The class of similar events is usually
146 FIFTH BERKELEY SYMPOSIUM: HbJEK
not a a-field, which causes ambiguity in applications of the conditionality
principle, as we shall see. Generally, if both A and B are similar and are not
disjoint, then A n B and A U B may not be similar. Consequently, the system
of o-fields contained in a X-field may not include a largest a-field.
Dynkin [3] calls a system of subsets a 7r-field, if it is closed under intersection,
and shows that a X-field containing a 7r-field contains the smallest a-field over the
7r-field. Thus, for example, if for a vector statistic T = t(X) the events
{x: t(x) < c} are similar for every vector c, then {x: T e A} is similar for every
Borel set A.
More generally, a statistic V = v(X) is called similar, if E6V = f v(x)Pe(dx)
exists and is independent of 0. We also may define similarity with respect to a
nuisance parameter only. The notion of similarity forms a basis for the definition
of several other important notions.
Ancillary statistics. A statistics U = u(x) is called ancillary, if its distribution
is independent of 0, that is, if the events generated by U are similar. (We have
seen that it suffices that the events {V < c} be similar.)
Correspondingly, a sub-a-field will be called ancillary, if all its events are
similar.
Exhaustive statistics. A statistics T = t(x) is called exhaustive if (T, U), with
U an ancillary statistic, is a minimal sufficient statistic.
Complete families of distributions. A family {Pe} is called complete if the only
similar statistics are constants.
Our definition of an exhaustive statistic follows Fisher's explanation (ii) in
([5], p. 49), and the examples given by him.
Denote by Io, IoT, 1s'(u) Fisher's information for the families {Pe(dx)},
Pe(dt)}, {Pe(dt) U = u)}, respectively. Then, since (T, U) is sufficient and U is
ancillary,
(4.2) EIT(U) = Ie.
If Io < Io, Fisher calls (4.2) "recovering the information lost." What is the
real content of this phrase?
The first interpretation would be that T = t contains all information supplied
by X = x, provided that we know U = u. But knowing both T = t and U = u,
we know (T, U) = (t, u), that is, we know the value of the sufficient statistics.
Thus this interpretation is void, and, moreover, it holds even if U is not ancillary.
A more appropriate interpretation seems to be as follows. Knowing
{Po(dtlU= u)}, we may dismiss the knowledge of P(du) as well as of
{Pe(dt) U = u')} for u' # u. This, however, expresses nothing else as the con-
ditionality principle formulated below.
Fisher makes a two-field use of exhaustive statistics: first, for extending the
scope of fiducial distributions to cases where no appropriate sufficient statistic
exists, and, second, to a (rather unconvincing) eulogy of maximum likelihood
estimates.
The present author is not sure about the appropriateness of restricting our-
BASIC CONCEPTS 147
selves to "minimal" sufficient statistics in the definition of exhaustive statistics.
Without this restriction, there would rarely exist a minimal exhaustive statistic,
and with this restriction we have to check, in particular cases, whether the
employed sufficient statistic is really minimal.
4.2. Similarity in the (x, 0)-space. Consider a family of distributions {P9} on
the x-space, the family of all possible prior distributions {v} on the 0-space,
and the family of distributions {RJ} on the (x, 0)-space given by
(4.3) R,(dx dO) = v(dO)Pe(dx).
Then we may introduce the notion of similarity in the (x, 0)-space in the same
manner as before in the x-space, with {Po} replaced by {R,}. Consequently, the
prior distribution will play the role of an unknown parameter. Thus an event A
in the (x, 0)-space will be called similar, if
(4.4) R,(A) = R(A)
for all prior distributions v.
To avoid confusion, we shall call measurable functions of (x, 0) quantities and
not statistics. A quantity H = h(X, 0) will be called similar if its expectation
EH = f h(x, 0)v(d0)Pe(dx) is independent of P. Ancillary statistics in the (x, 0)-
space are called pivotal quantities or distribution-free quantities. Since only the
first component of (x, 0) is observable, the applications of similarity in the (x, 0)-
space are quite different from those in the x-space.
Confidence regions. This method of region estimation dwells on the following
idea. Having a similar event A, whose probability equals 1 - a, with a
very small, and knowing that X = x, we can feel confidence that (x, 0) e A,
namely, that the unknown 0 lies within the region S(x) = {0: (x, 0) e A}. Here,
our confidence that the event A has occurred is based on its high probability, and
this confidence is assumed to be unaffected by the knowledge of X = x. Thus we
assume a sort of independence between A and X, though their joint distribution is
indeterminate. The fact that this intrinsic assumption may be dubious is most
appropriately manifested in cases when S(x) is either empty, so that we know
that A did not occur, or equals the whole 0-space, so we are sure that A has
occurred. Such a situation arises in the following.
EXAMPLE 1. For this example, 0 E [0, 1], x is real, Pe(dx) is uniform over
[0, 0 + 2], and
(4.5) A= {(x,0):0+a<x<0+2-a}.
Then S(x) = 0 for 3 - a < x < 3 or 0 < x < a, and S(x) = [0, 1] for
1 + a < x < 2 - a. Although this example is somewhat artificial, the difficulty
involved seems to be real.
Fiducial distribution. Let F(t, 0) be the distribution function of T under 0,
and assume that F is continuous in t for every 0. Then H = F(T, 0) is a pivotal
quantity with uniform distribution over [0, 1]. Thus R(F(T, 0) < x) = x. Now,
if F(t, 0) is strictly decreasing in 0 for each t, and if F-1(t, 0) denotes its inverse for
fixed t, then F(T, 0) < x is equivalent to 0 > F-'(T, x). Now, again, if we know
148 FIFTH BERKELEY SYMPOSIUM: HIJEK
that T = t, and if we feel that it does not affect the probabilities concerning
F(T, 0), we may write
(4.6) x = R(F(T, 0) < x = R(0 > F-'(T, x)) = R(e > F-'(T, x)IT = t)
= R(0 > F-1(t, x)IT = t)
namely, for 0 = F-1(t, x),
(4.7) R(0 < OJT = t) = 1 - F(t, 0).
As we have seen, in confidence regions as well as in fiducial probabilities, a
peculiar postulate of independence in involved. The postulate may be generally
formulated as follows.
Postulate of independence under ignorance. Having a pair (Z, W) such that
the marginal distribution of Z is known, but the joint distribution of (Z, W), as
well as the marginal distribution of W, are unknown, and observing W = w, we
assume that
(4.8) P(ZeAjW = w) = P(ZEA).
J. Neyman gives a different justification of the postulate than R. A. Fisher.
Let us make an attempt to formulate the attitudes of both these authors.
Neyman's justification. Assume that we perform a long series of independent
replications of the given experiments, and denote the results by (Z1, w1), *. *,
(ZN, WN), where the wi's are observed numbers and the Zi's are not observable.
Let us decide to accept the hypothesis Zi e A at each replication. Then our
decisions will be correct approximately in 100P% of cases with P = P(Z E A).
Thus, in a long series, our mistakes in determining P(Z E AIW = w) by (4.8)
if any, will compensate each other.
Fisher's justification. Suppose that the only statistics V = v(W), such that
the probabilities P(Z E AIV = v) are well-determined, are constants. Then we
are allowed to take P(Z E AIW = w) = P(Z E A), since our absence of knowl-
edge prevents us from doing anything better.
The above interpretation of Fisher's view is based on the following passage
from his book ([5], pp. 54-55):
"The particular pair of values of 0 and T appropriate to a particular experi-
menter certainly belongs to this enlarged set, and within this set the proportion
of cases satisfying the inequality
T 2
(4.9) a > 2n X2(
is certainly equal to the chosen probability P. It might, however, have been
true . . . that in some recognizable subset, to which his case belongs, the pro-
portion of cases in which the inequality was satisfied should have some value
other than P. It is the stipulated absence of knowledge a priori of the distribution
of 0, together with the exhaustive character of the statistic T, that makes the
recognition of any such subset impossible, and so guarantees that in his par-
ticular case . . . the general probability is applicable."
BASIC CONCEPTS 149
To apply our general scheme to the case considered by R. A. Fisher, we should
put Z = 1 if (4.9) is satisfied and Z = 0 otherwise, and W = T.
REMARK 4.1. We can see that Fisher's argumentation applies to a more
specific situation than described in the above postulate. He requires that con-
ditional probabilities are not known for any statistics V = v(W) except for
V = const. Thus the notion of fiducial probabilities is a more special notion than
the notion of confidence regions, since Neyman's justification does not need any
such restrictions.
Fisher's additional requirement is in accord with his requirement that fiducial
probabilities should be based on the minimal sufficient statistics, and in such a
case does not lead to difficulties.
However, if the fiducial probabilities are allowed to be based on exhaustive
statistics, then his requirement is contradictory, since no minimal exhaustive
statistics may exist. Nonetheless, we have the following.
THEOREM 4.1. If a minimal sufficient statistic S = s(X) is complete, then it is
also a minimal exhaustive statistic.
PROOF. If there existed an exhaustive statistic T = t(S) different from S,
there would exist a nonconstant ancillary statistic U = u(S), which contradicts
the assumed completeness of S. Q.E.D.
If S is not complete, then there may exist ancillary statistics U = u(S) and
the family corresponding to exhaustive statistics contains a minimal member if
and only if the family of U's contains a maximal member.
REMARK 4.2. Although the two above justifications are unconvincing, they
correspond to habits of human thinking. For example, if one knows that an
individual comes from a subpopulation of a population where the proportion of
individuals with a property A equals P, and if one knows nothing else about the
subpopulation, one applies P to the given individual without hesitation. This is
true even if we know the "name" of the individual, that is, if the subpopulation
consists of a single individual. Fisher is right, if he claims that we would not use
P if the given subpopulation would be a part of a larger one, in which the
proportion is known, too. He does not say, however, what to do if there is no
minimal such larger subpopulation.
In confidence regions the subpopulation consists of pairs (X, 0) such that
X = x. It is true that we know the "proportion" of elements of that subpopula-
tion for which the event A occurs, but we do not know how much of the proba-
bility mass each element carries. Thus we can utilize the knowledge of this
proportion only if it equals 0 or 1, which is exemplified by example 1 but
occurs rather rarely in practice. The problem becomes still more puzzling if the
knowledge of the proportion is based on estimates only, and if these estimates
become less reliable as the subpopulation from which they are derived be-
comes smaller. Is there any reasonable recommendation as to what to do if
we know P(X E AIU1 = ul) and P(X e AIU2 = u2), but we do not know
P(X c A[U1 = ul, U2 =U2)?
REMARK 4.3. Fisher attacked violently the Bayes postulate as an adequate
150 FIFTH BERKELEY SYMPOSIUM: HAJEK

form of expressing mathematically our ignorance. He reinforced this attitude in


([5], p. 20), where we may read: "It is evidently easier for the practitioner of
natural science to recognize the difference between knowing and not knowing
than it seems to be for the more abstract mathematician." On the other hand, as
we have seen, he admits that the absence of knowledge of proportions in a
subpopulation allows us to act in the same manner as if we knew that the pro-
portion is the same as in the whole population. The present author suspects that
this new postulate, if not stronger than the Bayes postulate, is by no means
weaker. Actually, if a proportion P in a population is interpreted as probability
for an individual taken from this population, we tacitly assume that the indi-
vidual has been selected according to the uniform distribution.
One cannot escape the feeling that all attempts to avoid expressing in a
mathematical form the absence of our prior knowledge about the parameter have
been eventually a failure. Of course, the mathematical formalization of ignorance
should be understood broadly enough, including not only prior distributions, but
also the minimax principle, and so on.
4.3. Estimation. Similarity in the (x, 0)-space is widely utilized in estimation.
For example, an estimate 0 is unbiased if and only if the quantity H = 0- 0 is
similar with zero expectation. Further, similarity provides the following class of
estimation methods: starting with a similar quantity H = h(X, 0) with EH = c,
and observing X = x, we take for 0 the solution of equation
(4.10) h(x, 0) = c.
This class includes the method of maximum likelihood for
(4.11) h(x, 0) = (a/la) log po(x).
Another method is offered by the pivotal quantity H = F(T, 0) considered in
connection with fiducial probabilities. Since EH =, we may take for 0, given
T = t, the solution if any of
(4.12) F(t, 0)= _
that is, the parameter value for which the observed value is the median.

5. Conditionality
If the prior distribution v of 0 is known, all statisticians agree that the decisions
given X = x should be made on the basis of the conditional distribution
R,(dOjX = x). This exceptional agreement is caused by the fact that then there
is only one distribution in the (x, 0)-space, so that the problems are transferred
from the ground of statistics to the ground of pure probability theory.
The problems arise in situations where conditioning is applied to a family of
distributions. Here two basically different situations must be distinguished ac-
cording to whether the conditioning statistic is ancillary or not. Conditioning
with respect to an ancillary statistic is considered in the conditionality principle
as formulated by A. Birnbaum [1]. On the other hand, for example, conditioning
BASIC CONCEPTS 151
with respect to a sufficient statistic for a nuisance parameter, successfully used
in constructing most powerful similar tests (see Lehmann [14]), is a quite
different problem and will not be considered here.
Our discussion will concentrate on the conditionality principle, which may be
formulated as follows.
The principle of conditionality. Given an ancillary statistic U, and knowing
U = u, statistical inference should be based on conditional probabilities
Pe(dxl U = u) only; that is, the probabilities Pe(dxl U = u') for u' $d u and P(du)
should be disregarded.
This principle, if properly illustrated, looks very appealing. Moreover, all
proper Bayesian procedures are concordant with it. We say that a Bayesian
procedure is proper, if it is based on a prior distribution v established independ-
ently of the experiment. A Bayesian procedure, which is not proper, is exemplified
by taking for the prior distribution the measure v given by
(5.1) v(dO) = VT dO,
where Io denotes the Fisher information associated with the particular experi-
ment. Obviously, such a Bayesian procedure is not compatible with the principle
of conditionality. (Cf. [4].)
The term "statistical inference" used in the above definition is very vague.
Birnbaum [1] makes use of an equally vague term "evidential meaning." The
present author sees two possible interpretations within the framework of the
decision theory.
Risk interpretation. A decision procedure a defined on the original experiment
should be associated with the same risk as its restriction associated with the
partial experiment given U = u. Of course this transfer of risk is possible only
after the experiment has been performed and u is known.
Decision rules interpretation. A decision function a should be interpreted as a
function of two arguments, a = 6(E, x), with E denoting an experiment from a
class 8 of experiments and x denoting one of its outcomes. The principle of
conditionality then restricts the class of "reasonable" decision functions to those
for which 5(E, x) = B(F, x) as soon as F is a subexperiment of E, and x belongs
to the set of outcomes of F (F being a subexperiment of E means that F equals
"E given U = u" where U is some ancillary statistics and u some of its particular
value).
In this section we shall analyze the former interpretation, returning to the
latter in the next section.
The principle of conditionality, as it stands, is not acceptable in non-Bayesian
conditions, since it denies possible gains obtainable by randomization (mixed
strategies). On the other hand, some examples exhibited to support this principle
(see [1], p. 280) seem to be very convincing. Before attempting to delimit the
area of proper applications of the principle, let us observe that the principle is
generally ambiguous. Actually, since generally no maximal ancillary statistic
exists, it is not clear which ancillary statistic should be chosen for conditioning.
152 FIFTH BERKELEY SYMPOSIUM: HAJEK
Let us denote the conditional distribution given U = u by Pe(dxl U = u) and
the respective conditional risk by
(5.2) R (o, U, u) = f L (O, 6(x)) Pe(dxI U = u).
In terms of a sub-a-field 03, the same will be denoted by
(5.3) f
R (O, 03, x) = L (0, 6(x)) Pe(dxI03, x),
where R(O, 03, x), as a function x, is &3-measurable. Since the decision function a
will be fixed in our considerations, we have deleted it in the symbol for the risk.
DEFINITION 5.1. A conditioning relative to an ancillary statistic U or an
ancillary sub-a-field 03 will be called correct, if
(5.4) R(O, U, u) = R(O)b(u)
or
(5.5) R(0, (0, x) = R(O)b(x),
respectively. In (5.4) and (5.5), R (O) denotes the overall risk, and the function b(x)
is 03-measurable. Obviously, Eb(U) = Eb(X) = 1.
The above definition is somewhat too strict, but it covers all examples ex-
hibited in literature to support the conditionality principle.
THEOREM 5.1. If the conditioning relative to an ancillary sub-a-field 03 is
correct, then
(5.6) E [sup R(0, 03, X)] = sup R(O).
0 6

PROOF. The proof follows immediately from (5.5). Obviously, for a non-
correct conditioning we may obtain E[supe R(G, (B, X)] > supo R(O), so that,
from the minimax point of view, conditional reasoning generally disregards
possible gains obtained by mixing (randomization), and, therefore, is hardly
acceptable under non-Bayesian conditions.
THEOREM 5.2. As in section 3, consider the family {P,} generated by a proba-
bility distribution P and a group G of transformations g. Further, assume the loss
function L to be invariant, namely, such that L(hg, 6(h(x)) = L(g, 6(x)) holds for
all h, g, and x. Then the conditioning relative to the ancillary sub-u-field 03 of
G-invariant events is correct. That is, R(g) = R and
(5.7) R(g, (M, x) = R((0, x) = f L[g, b(h-1(x))]p(x, hg) dv(h)/p(x).
FROOF. For B E 03,
(5.8) fB L(g, 6(x)) p(x, g) d (x) = fIB L[g, 6(h(y))]p(h(y), g) dth(y)
= J L(h-'g, 6(y))p(y, h-'g) d,(y).
The theorem easily follows from theorem 3.4 and from the above relations.
The following theorem describes an important family of cases of ineffective
conditioning.
BASIC CONCEPTS 153
THEOREM 5.3. If there exists a complete sufficient statistic T = t(X), then
R(6, U, u) = R(O) holds for every ancillary statistic U and every a which is a
function of T.
PROOF. The theorem follows from the well-known fact (see Lehmanl [14],
p. 162) that all ancillary statistics are independent of T, if T is complete and
sufficient.
DEFINITION 5.2. We shall say an ancillary sub-a-field c3 yield the deepest
correct conditioning, if for every nonnegative convex function tt and every other
ancillary sub-cr-field i3 yielding a correct conditioning,
(5.9) FPj[b(X)] > E4p[b(X)]
holds, with b and b corresponding to 6( and 3 by (5.5), respectively.
THEOREM 5.4. Under condition 3.1 and under the conditions of theorem 5.2,
the ancillary sub-cr-field 63 of G-invariant events yields the deepest correct con-
ditioning. Further, if any other ancillary sub-cr-field possesses this property, then
R((!, x) = R(6b, x) almost ,u-everywhere.
PROOF. Let v denote the invariant measure on 9. Take a C e 63 and denote
by xc(x) its indicator. Then
(5.10) 4c(x) = f xc(gx) dv(g)
is 63-measurable and
(5.11) P(C) = f qc(x)p(x) dA.
Further, since the loss function is invariant,
(5.12) f L(g, S(x))p(x, g) d,u = f xc(g(y))L(gi, S(y))p(y) dA(y)
where gi denotes the identity transformation. Consequently, denoting by R(g, C)
the conditional risk, given C, we have from (5.10) and (5.12),
(5.13) P(C) f R(g, C) dv(g) = f cpc(x)L(g1, S(x))p(x) d,A.
Now, since R(g) = R, according to theorem 5.2, and since the correctness of CB
entails R(g, C) = Rbc, we have
(5.14) f R(g, C) dv(g) = Rbc.
Note that
(5.15) P(C)bc = f6b(x)p(x) dA.
Further, since oc(x) is 63-measurable,
(5.16) f 0c(x)L(gl, S(x))p(x) d,u = f Pc(x)R((P5, x)p(x) dA
= f cc(x)Rb(x)p(x) dA.
By combining together (5.13) through (5.16), we obtain
154 FIFTH BERKELEY SYMPOSIUM: HbJEK
(5.17) P(C)bc = |f c(x)b(x)p(x) dA.
Now, let us assume Eql(b(X)) < oo. Then, given ae > 0, we choose a fiuiite
partition {Ck}, Ck E 63-, such that
(5.18) E42(b(X)) < E P(Ck)I(&Ck) + E,
and we note that, in view of (5.11), (5.17), and of convexity of xl,
(5.19) P(Ck)4(5Ck) . f OCk(x))\(b(x))P(x) dA,
and, in view of (5.10),
(5.20) EfcC*k(X) = 1
k
for every x. Consequently, (5.18) through (5.20) entail
(5.21) E4'(&(X)) < E#(b(X)) + e.
Since E > 0 is arbitrary, (5.9) follows. The case E#(b(X)) = X0 could be treated
similarly.
The second assertion of the theorem follows from the course of proving the
first assertion. Actually, we obtain E4,(b(X)) < EiA(b(X)) for some 4,, unless b
is a function of b a.e. Also conversely, b must be a function of Z, because the two
conditionings are equally deep. Q.E.D.
A restricted conditionality principle. If all ancillary statistics (sub-a-fields)
yielding the deepest correct conditioning give the same conditional risk, and if
there exists at least one such ancillary statistic (sub-a-field), then the use of the
conditional risk is obligatory.

6. Likelihood
Still more attractive than the principle of conditionality appears the principle
of likelihood, which may be formulated in lines of A. Birnbaum [1], as follows.
The principle of likelihood. Statistical inferences should be based on the likeli-
hood functions only, disregarding the other structure of the particular experi-
ments. Here, again, various interpretations are possible. We suggest the follow-
ing.
Interpretation. For a given 0-space, the particular decision procedures should
be regarded as functionals on a space 2 of likelihood functions, where all functions
differing in a positive multiplicator are regarded as equivalent. Given an experi-
ment E such that for all outcomes x the likelihood functions l(O) belong to 2,
and a decision procedure 6 in the above sense, we should put
(6.1) 6(x) = 3[Iz(-)]
where l(O) = pe(x).
If £ contains only unimodal functions, an estimation procedure of the above
kind is the maximum likelihood estimation method.
A. Birnbaum [1] proved that the principle of conditionality joined with the
BASIC CONCEPTS 155
principle of sufficiency is equivalent to the principle of likelihood. He also con-
jectured that the principle of sufficiency may be left out in his theorem, and gave
a hint, which was not quite clear, for proving it. But his conjecture is really
true if we assume that statistical inferences should be invariant under one-to-one
transformations of the x-space to some other homeomorph space. In the following
theorem we shall give the "decision interpretation" to the principle of condition-
ality, and the class 8 of experiments will be regarded as the class of all experi-
ments such that all their possible likelihood functions belong to some space £.
THEOREM 6.1. Under the above stipulations, the principle of conditionality is
equivalent to the principle of likelihood.
PROOF. If F is a subexperiment of E, and if x belongs to the space of possible
outcomes of F, the likelihood function l1(06E) and lx(OIF) differ by a positive
multiplicator only, namely, they are identical. Thus the likelihood principle
entails the conditionality principle.
Further, assume that the likelihood principle is violated for a decision pro-
cedure 5, that is there exist in 8 two experiments E1 = (9C,, Pie) and E2 = (OC2, P2e)
such that for some points xi and X2,
(6.2) 6(E1, xi) # 6(E2, x2),
whereas
(6.3) P1l(X1 = Xl) = CP20(X2 = X2) for all 0,
with c = c(x,, x2), but independent of 0. We here assume the spaces Sl and 9E2
finite and disjoint, and the o-fields to consist of all subsets. Then let us choose a
number X such that
(6.4) <X < 1
and consider the experiment E = (9C U 9C2, Pe), where
(6.5) PO(X = x) = XP1e(x) if x E Sl,
= XcP20(x) if x E SC2- {x2},
= -X(c + 1-Ple(xl)) if x = X2.
Then E conditioned by x G 9C, coincides with El, and E conditioned by x E (9C2-
{x2}) U {x1} coincides with R2 which is equivalent to E2 up to a one-to-one trans-
formation + (x) = x, if x E -C2-{X2} and ¢(x2) = xi. Thus we should have
S(E1, x1) = B(E, x1) = 5(22, xi) = S(E2, X2), according to the conditionality
principle. In view of (6.2) this is not true, so that the conditionality principle is
violated for 6, too. Q.E.D.
Given a fixed space £, the likelihood principle says what decision procedures
of wide scope (covering many experimental situations) are permissible. Having
any two such procedures, and wanting to compare them, we must resort either
to some particular experiment and compute the risk, or to assume some a priori
distribution v(dO) and to compute the conditional risk. In both cases we leave the
proper ground of the likelihood principle.
156 FIFTH BERKELEY SYMPOSIUM: HAJEK
However, for some special spaces £, all conceivable experiments with likelihood
functions in S, give us the same risk, so that the risk may be regarded as inde-
pendent of the particular experiment. The most important situation of this kind
is treated in the following.
THEOREM 6.2. Let 0 be real and let £C, consist of the following functions:
(6.6) l()=C exp [-1 @t -X0 < t < X0,
with a fixed and positive. Then for each experiment with likelihood functions in £..
there exists a complete sufficient statistic T = t(X), which is normally distributed
with expectation 0 and variance a2.
PROOF. We have a measurable space (DC, t) with a a-finite measure i,(dx)
such that the densities with respect to y& allow the representation
(6.7) po(x) = c(x) exp [I_ ((- 21
This relation shows that T = t(X) is sufficient, by the factorization criterion.
Further,
(6.8) 1 = f pe(x),A(dx) = f e-1/202+6t,(dt)
where , = ,u*t-' and IA*(dx) = c(x) exp [- t2(x)],A(dx). Now, assuming that
there exist two different measures p satisfying (6.8), we easily derive a contra-
diction with the completeness of exponential families of distributions (see
Lehmann [14], theorem 1, p. 132). Thus a(dt) = e-112t2dt and the theorem is
proved.
The whole work by Fisher suggests that he associated the likelihood principle
with the above family of likelihoods, and, asymptotically, with families, which
in the limit shrink to £C. for some a > 0 (see example 8.7). If so, the present
author cannot see any serious objections against the principle, and particularly,
against the method of maximum likelihood. Outside of this area, however, the
likelihood principle is misleading, because the information about the kind of
experiment does not lose its value even if we know the likelihood.

7. Vaguely known prior distributions


There are several ways of utilizing a vague knowledge of the prior distribution
'(dO). Let us examine three of them.
7.1. Diffuse prior distributions. Assume that v(dO) = a-lg(O/a) dO where g(x)
is continuous'and bounded. Then, under general conditions the posterior density
p(OIX = x, a) will tend to a limit for a -- oo, and the limits will, independently
of g, correspond to v(dO) = do. This will be true especially if the likelihood
function l(OIX = x) are strongly unimodal, that is, if -log l(OIX = x) is convex
in 6 for every x, in which case also all moments of finite order will converge.
There are, however, at least two difficulties connected with this approach.
Difficulty 1. However large a fixed a is, for a nonnegligible portion of
BASIC CONCEPTS 157
0-values placed at the tails of the prior distribution, the experiment will lead to
such results x that the posterior distribution with v(dO) = a-lg(/ul) dO will differ
significantly from that with v(O) = dO. Thus, independently of the rate of
diffusion, our results will be biased in a nonnegligible portion of cases.
Difficulty 2. If interested in rapidly increasing functions of 0, say ee, and
trying to estimate them by the posterior expectation, the expectation will diverge
for a --foo under usual conditions (see example 8.5).
7.2. A family of possible prior distributions. Given a family of prior distri-
butions {ja(d0)}, where a denotes some "metaparameter," we may either
(i) estimate a by d = a(X), or
(ii) draw conclusions which are independent of a.
7.3. Making use of non-Bayesian risks. If we know the prior distribution
v(d0) exactly, we could make use of the Bayesian decision function &,, associating
with every outcome x the decision d = d(x) that minimizes
(7.1) R(V, x, d) = f L(r(0), d)v(dOIX = x).
Simultaneously, the respective minimum
(7.2) R(v, x) = min R(V, x, d')
d'E -D
could characterize the risk, given X = x.
If the knowledge of v is vague, we can still utilize &, as a more or less good
solution, the quality of which depends on our luck with the choice of v. Obviously,
in such a situation, the use of R(V, x) would be too optimistic. Consequently, we
had better replace it by an estimate R(O, 6,) = r(X) of the usual risk
(7.3) R(0, s,) = f L(O, 6,(x))Pe(dx).
Then our notion of risk will remain realistic even if our assumptions concerning
v are not. Furthermore, comparing R(V, x) with R(O, 5,) = r(x), we may obtain
information about the appropriateness of the chosen prior distribution.

8. Examples
EXAMPLE 8.1. Let us assume that 0i = 0 or 1, 0 = (01, * , ON) and T =
01 + *-- + ON. Further, put
N
(8.1) pe(Xi, *... , XN) = II (Xi)]@i[g(xi)]l EW,
i=1
where f and g are some one-dimensional densities. In other words, X=
(X1, * * *, XN) is a random sample of size N, each member Xi of which comes
from a distribution with density eitherf or g. We are interested in estimating the
number of the Xi associated with the density f. Let T = (T1, *, TN) be the
- -

order statistic, namely T1 < ... < TN are the observations Xi, *- , Xiv re-
arranged in ascending magnitude.
158 FIFTH BERKELEY SYMPOSIUM: HAJEK
PROPOSITION 8.1. The vector T is a sufficient statistic for T in the sense of
definition 2.2, and
(8.2) pr(t) = pT(tl, * tN) = T!(N - T ! sr i f (ti) jII g (ti),
CS, Es, (ZS,
where ST denotes the system of all subsets s, of size r from {1,*I,
* N}.
PROOF. A simple application of theorem 3.5 to the permutation group.
PROPOSITION 8.2. For every t
(8.3) p+1(t)p.-1(t) = Pr(t)
holds.
PROOF. See [7]. Relation (8.3) means that -log p,(t) is convex in T; that is,
the likelihoods are (strongly) unimodal. Thus we could try to estimate T by the
method of maximum likelihood, or still better by the posterior expectation for the
uniform prior distribution, if Xi = 0 or 1, the T is equivalent to T' = Xi +
+ XN.
This kind of problem occurs in compound decision making. See H. Robbins
[17].
EXAMPLE 8.2. Let 0 < C2 < K, ,u real, 0 = (u, a2), and
(8.4) X)
p(xl, .-- X. = &f n(2r)1 /2n exp {-2 (X -

PROPOSITION 8.3. The statistic


82 = (n - 1)-1 E (X, - t)2
i =1

is sufficient for a2.


PROOF. For given a2 < K we take for the mixing distribution of / the normal
distribution with zero expectation and the variance (K - a2)/n. Then we obtain
the mixed density.
(8.5) q(x,, xN) = n+ (2t2 exp (Xi -
and apply definition 2.2.
REMARK 1. If K -* oo, then the mixing distribution tends to the uniform
distribution over the real line, which is not finite. For practical purposes the
bound K means no restriction, since such a K always exists.
REMARK 2. A collection of examples appropriate for illustrating the notion
of sufficiency for a subparameter could be found in J. Neyman and E. Scott [16].
EXAMPLE 8.4. If T attains only two values, say To and Ti, and if it admits a
sufficient statistic in the sense of definition 2.2, then the respective distributions QT,
and QT1 are least favorable for testing the composite hypothesis T = To against
the composite alternative T = Ti (see Lehmann [14]). In particular, if T is
ancillary under r(0) = To, and if
(8.6) p.(x) = r7(t(X))pb(X), 0 = (T, b)
where the range of b is independent of the range of T, then T is sufficient for T in
BASIC CONCEPTS 159
the sense of definition 2.2, and the respective family {Q7} may be defined so that
we choose an arbitrary b, say bo, and then put Q,(dx) = r,(t(x))pb,(x),u(dx). In
this way asymptotic sufficiency of the vector of ranks for a class of testing prob-
lems is proved in [6]. (See also remark 2.4.)
EXAMPLE 8.5. The standard theory of probability sampling from finite popu-
lations is linear and nonparametric. If we have any information about the type of
distribution in the population, and if this type could make nonlinear estimates
preferable, we may proceed as follows.
Consider the population values Y1, *--, YN as a sample from a distribution
with two parameters. Particularly, assume that the random variables Xi=
h(Yi), where h is a known strictly increasing function, are normal (u, a2). For
example, if h(y) = log y, then the Yi's are log-normally distributed. Now a
simple random sample may be identified with the partial sequence Y1, * * *, Y.,
2 < n < N. Our task is to estimate Y = Y1 + * + YN, or, equivalently, as
we know Y, + * + Yn, to estimate
(8.7) Z = Yn+l + ***+ YN.
Now the minimum variance unbiased estimate of Z equals
(8.8) Z (N-n) [B (n-1, an-1)]

fh-'[Ii + (2v - 1)s(n - 1)n-1/2][v(1 - v)]1/2n-2 dy


where
1n n
(8.9) Y=
n il
h(yi),
'

822 =n-1
=Ii=
n
[h(yi)- 2
On the other hand, choosing the usual "diffuse" prior distribution (d,u da) =
0r- d,u da, we obtain for the conditional expectation of Z, given Yi = yi,
i1, , n, the following result:
(8.10) Z = E(ZjYi = yi, 1 < i < n,v) = (N - n)(n -1)-1/2
[B (' n - , ')]-I h-1(x + vs(1 + 1/n)1/2) 1
+ V2
_ 12 dv.

However, for most important functions h, for example for h(y) = log y, that is
h-1(x) = ex, we obtain 2 = oo. Thus the Bayesian approach should be based on
some other prior distribution. In any case, however, the Bayesian solution would
be too sensitive with respect to the choice of v(d,u du).
Exactly the same unpleasant result (8.10) obtains by the method of fiducial
prediction recommended by R. A. Fisher ([5], p. 116). For details, see [9].
EXAMPLE 8.6. Consider the following method of sampling from a finite popu-
lation of size N: each unit is selected by an independent experiment, and the
probability of its being included in the sample equals n/N. Then the sample size
K is a binomial random variable with expectation n. To avoid empty samples,
let us reject samples of size <ko, where ko is a positive integer, so that K will have
160 FIFTH BERKELEY SYMPOSIUM: HAJEK
truncated binomial distribution. Further, let us assume that we are estimating
the population total Y = y+ + + YN by the estimator
(8.11) Ki
N
K ,
SK
where SK denotes a sample of size K. Then
-
(8.12) E((Y Y)2K =k) = N(N k) a2
k
holds where a2 is the population variance o.2 = (N - 1)-1 E= (yi - 7)2. Thus
K yields a correct conditioning. Further, the deepest correct conditioning is that
relative to K.
On the other hand, simple random sampling of fixed size does not allow
effective correct conditioning, unless we restrict somehow the set of possible
(yi, * * *, yN)-values.
EXAMPLE 8.7. Let f(x) be a continuous one-dimensional density such that
-logf (x) is convex. Assume that
(8.13) I =f [LTxj1]f(x) dx <
Now for every integer N put
N
(8.14) P5(X1, * *, XN) = llf(xi - 0)
i =1
and
(8.15) lZ(6) = P0(Xi, , XN), < 0 < o.
Let t(x) be the mode (or the mid-mode) of the likelihood lx(6), that is,
(8.16) lW(t(x)) > lO(6), -oo < 6 < oo.
Since f(x) is strictly unimodal, t(x) is uniquely defined. Then put
(8.17) l*(8) = cNI1/2(27r)-12 exp [-2(O-t(x))21]
where CN is chosen so that
(8.18) f l*() dx1 ... dxN = 1.
Then CN-- 1 and
(8.19) lim f Jz(6) l1*(0)I dxi
- ... dxN = 0,
the integrals being independent of 6. Thus, for large N, the experiment with
likelihoods 1 may be approximated by an experiment with normal likelihoods
1* (see [8]). Similar results may be found in Le Cam [12] and P. Huber [10].
EXAMPLE 8.7. Let 6 = (01, * , ON),
(8.20) pe(x, ... . XN) = (2r)1/2N exp 2 (xi -0i)2)
and let (6k, * * , ON) be regarded as a sample from a normal distribution (A 2).
If ,u and u2 were known, we would obtain the following joint density of (x, 0):
BASIC CONCEPTS 161
(8.21) r,(xi, * * , 01, * , ON)
N

gN2.-N ep
N N
f-N(27r) N exp (X, - ail'2 -12 ff2 )=)
2 Ea(Xi (0,
E(@i - 2

Consequently, the best estimator of (Or, * * , ON) would be (06, * *, ON) defined
by
(8.22) 1+ U2X

Now, in the prior experiment (u, a2) can be estimated by (0, S9) where
N 2 1 N
(8.23) == N-i ).
In the x-experiment, in turn, a sufficient pair of statistics for (0,
S9) iS (X, S2),
where
1N 2 1 N
(8.24) x= N XiN1 1 .

Now, while j may be estimated by x, the estimation of s3 must be accomplished


by some more complicated function of 82. We know that the distribution of
(N - 1)s2 is noncentral x2 with (N - 1) degrees of freedom and the parameter
of noncentrality (N - I)s'. Denoting the distribution function of that distri-
bution by FN1l(x, 6), where a denotes the parameter of noncentrality, we could
estimate Sg by 3 = h(s2), where h(s2) denotes the solution of
(8.25) FN-l((N - 1)s2, (N - 1)) - 2-
(See section 4.3.) On substituting the estimate in (8.21), one obtains modified
estimators
(8.26) i + 2

The estimators (8.26) will be for large N nearly as good as the estimators (8.26)
(cf. C. Stein [18]).
The estimators (8.26) could be successfully applied to estimating the averages
in individual stratas in sample surveys. They represent a compromise between
estimates based on observations from the same stratum only, which are unbiased
but have large variance, and the overall estimates which are biased but have
small variance.
The same method could be used for Oi = 1 or 0, and (01, * , ON) regarded as
a sample from an alternative distribution with unknown p. The parameter p
could then be estimated in lines of example 8.1 (cf. H. Robbins [17]).

9. Concluding remarks
The genius of classical statistics is based on skillful manipulations with the
notions of sufficiency, similarity and conditionality. The importance of similarity
increased after introducing the notion of completeness. An adequate imbedding
162 FIFTH BERKELEY SYMPOSIUM: HAJEK

of similarity and conditionality into the framework of general theory of decision


making is not straightforward. Classical statistics provided richness of various
methods and did not care too much about criteria. The decision theory added a
great deal of criteria, but stimulated very few new methods. From the point of
view of decision making, risk is important only before we make a decision. After
the decision has been irreversibly done, the risk is irrelevant, and for instance, its
estimation or speculating about the conditional risk makes no sense. Such an
attitude seems to be strange to the spirit of classical statistics. If regarding
statistical problems as games against Nature, one must keep in mind that besides
randomizations introduced by the statistician, there are randomizations (ancil-
lary statistics) involved intrinsically in the structure of experiment. Such ran-
domizations may make the transfer to conditional risks facultative.
REFERENCES
[1] ALLAN BIRNBAUM, "On the foundations of statistical inference," J. Amer. Statist. Assoc.,
Vol. 57 (1962), pp. 269-326.
[2] D. BLACKWELL and M. A. GIRSCMCK, Theory of Games and Statistical Decisions, New
York, Wiley, 1954.
[3] E. B. DYNKIN, The Foundations of the Theory of Markovian Processes, Moscow, Fizmatgiz,
1959. (In Russian.)
[4] BRUNO DE FINETrI and LEONARD J. SAVAGE, "Sul modo di scegliere le probabilita iniziali,"
Sui Fondamenti della Statistica, Biblioteca del Metron, Series C, Vol. I (1962), pp. 81-147.
[5] R. A. FISHER, Statistical Methods and Scientific Inference, London, Oliver and Boyd, 1956.
[6] J. HIJEK and Z. SIDAK, The Theory of Rank Tests, Publishing House of the Czech Academy
of Sciences, to appear.
[7] S. M. SAMUELS, "On the number of successes in independent trials," Ann. Math. Statist.,
Vol. 36 (1965), pp. 1272-1278.
[8] J. HIJEK, "Asymptotic normality of maximum likelihood estimates," to be published.
[9] "Parametric theory of simple random sampling from finite populations," to be
published.
[10] P. J. HUBER, "Robust estimation of a location parameter," Ann. Math. Statist., Vol. 35
(1964), pp. 73-101.
[11] A. N. KOLMOGOROV, "Sur l'estimation statistique des parameters de la loi de Gauss,"
Izv. Akad. Nauk SSSR Ser. Mat., Vol. 6 (1942), pp. 3-32.
[12] L. LE CAM, "Les proprietes asymptotiques des solutions de Bayes," Publ. Inst. Statist.
Univ. Paris, Vol. 7 (1958), pp. 3-4.
[13] , "Sufficiency and approximate sufficiency," Ann. Math. Statist., Vol. 35 (1964),
pp. 1419-1455.
[14] E. L. LEHMANN, Testing Statistical Hypotheses, New York, Wiley, 1959.
[15] J. NEYMAN, "Two breakthroughs in the theory of decision making," Rev. Inst. Internat.
Statist., Vol. 30 (1962), pp. 11-27.
[16] J. NEYMAN and E. L. Scorr, "Consistent estimates based on partially consistent observa-
tions," Econometrica, Vol. 16 (1948), pp. 1-32.
[17] H. ROBBINS, "Asymptotically subminimax solutions of compound decision problems,"
Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability,
Berkeley and Los Angeles, University of California Press, 1950, pp. 131-148.
[18] CHARLES STEIN, "Inadmissibility of the usual estimator for the mean of a multivariate
normal distribution," Proceedings of the Third Berkeley Symposium on Mathematical
Statistics and Probability, Berkeley and Los Angeles, University of California Press, 1956,
Vol. I, pp. 197-206.

You might also like