On Basic Concepts of Statistics
On Basic Concepts of Statistics
On Basic Concepts of Statistics
JAROSLAV HAJEK
MATHEMATICAL INSTITUTE OF THE CZECHOSLOVAK ACADEMY OF SCIENCES,
and CHARLES UNIVERSITY, PRAGUE
1. Summary
This paper is a contribution to current discussions on fundamental concepts,
principles, and postulates of statistics. In order to exhibit the basic ideas and
attitudes, mathematical niceties are suppressed as much as possible. The heart
of the paper lies in definitions, simple theorems, and nontrivial examples. The
main issues under analysis are sufficiency, invariance, similarity, conditionality,
likelihood, and their mutual relations.
Section 2 contains a definition of sufficiency for a subparameter (or sufficiency
in the presence of a nuisance parameter), and a criticism of an alternative defi-
nition due to A. N. Kolmogorov [11]. In that section, a comparison of the
principles of sufficiency in the sense of Blackwell-Girschick [2] and in the sense
of A. Birnbaum [1] is added. In theorem 3.5 it is shown that for nuisance pa-
rameters introduced by a group of transformations, the sub-a-field of invariant
events is sufficient for the respective subparameter.
Section 4 deals with the notion of similarity in the x-space as well as in the
(x, 0)-space, and with related notions such as ancillary and exhaustive statistics.
Confidence intervals and fiducial probabilities are shown to involve a postulate
of "independence under ignorance."
Sections 5 and 6 are devoted to the principles of conditionality and of likeli-
hood, as formulated by A. Birnbaum [1]. Their equivalence is proved and their
strict form is criticized. The two principles deny gains obtainable by mixing
strategies, disregarding that, in non-Bayesian conditions, the expected maximum
conditional risk is generally larger than the maximum overall risk. Therefore,
the notion of "correct" conditioning is introduced, in a general enough way to
include the examples given in the literature to support the conditionality
principle. It is shown that in correct conditioning the maximum risk equals the
expected maximum conditional risk, and that in invariant problems the sub-a-
field of invariant events yields the deepest correct conditioning.
A proper field of application of the likelihood principle is shown to consist of
families of experiments, in which the likelihood functions, possibly after a com-
mon transformation of the parameter, have approximately normal form with
constant variance. Then each observed likelihood function allows computing the
risk without reference to the particular experiment.
In section 7, some forms of the Bayesian approach are touched upon, such as
those based on diffuse prior densities, or on a family of prior densities.
139
140 FIFTH BERKELEY SYMPOSIUM: HAIJEK
In section 8, some instructive examples are given with comments.
References are confined to papers quoted only.
2. Sufficiency
The notion of sufficiency does not promote many disputes. Nonetheless, there
are two points, namely sufficiency for a subparameter and the principle of suffi-
ciency, which deserve a critical examination.
2.1. Sufficiency for a subparameter. Let us consider an experiment (X, 0),
where X denotes the observations and 0 denotes a parameter. The random
element X takes its values in an x-space, and 0 takes its values in a 0-space. A
function r of 6 will be called a subparameter. If 6 were replaced by a a--field, then
T would be replaced by a sub-a-field. Under what conditions may we say that a
statistic is sufficient for r? Sufficiency for T may also be viewed as sufficiency in
the presence of a nuisance parameter. The present author is aware of only one
attempt in this direction, due to Kolmogorov [11].
DEFINITION 2.1. A statistic T = t(X) is called sufficient for a subparameter r,
if the posterior distribution of T, given X = x, depends only on T = t and on the
prior distribution of 0.
Unfortunately, the following theorem shows that the Kolmogorov definition is
void.
THEOREM 2.1. If r is a nonconstant subparameter, and if T is sufficient for r
in the sense of definition 2.1, then T is sufficient for 6 as well.
PROOF. For simplicity, let us consider the discrete case only. Let T be not
sufficient for 0, and let us try to show that it cannot be sufficient for r. Since T
is not sufficient for 0, there exist two pairs, (61, 02) and (xl, X2), such that
(2.1) T(x,)= T(x2)
and
(2.2) PO1(X
PO2(X
=
=
x1)
XI)
P91(X = X2)
PO,(X X2)
=
If T(01) 74 r(02), let us consider the following prior distribution: v(0 = 01) =
v(O = 02) = 2. Then, for T1 = T(01) and T2 = r(02),
(2.3) P(T = T1IX = x1) = P61(X = x1)/[P01(X = x1) + Pe2(X = x1)]
and
(2.4) P((T = T11X = X2) = P91(X = x2)/[P1(X = X2) + POX(X = X2)]-
Obviously, (2.2) entails P(r = r1iX = xi) £4 P(r = rI|X = X2) which implies, in
view of (2.1), that T is not sufficient for r.
If r(6j) = r(02) held, we would choose 03 such that 7(03) 74 T(06) = T(02). Note
that the equations
(2.5) POl(X xI) _ PO,(X = x2)
=
(2.5) PO,(X = xI) PO3(X = x2)
BASIC CONCEPTS 141
and
(2.6) (2.6) POe(X XI) -PO,(X
~~~~PO3(X
=
=
xI) _ Pe,(X == x2)
X2)
are not compatible with (2.2). Thus either (2.5) or (2.6) does not hold, and the
above reasoning may be accomplished either with (01, 03) or (02, 03), both pair
satisfying the condition 7(01) F r(03), r(02) $ r(03). Q.E.D.
Thus we have to try to define sufficiency in the presence of nuisance parameters
in some less stringent way.
DEFINITIoN 2.2. Let 61 be the convex hull of the distributions {Pe, r(o) = r}
for all possible r-values. We shall say that T is sufficient for T if
(i) the distribution of T will depend on r only, that is,
(2.7) Pe(dt) = PT(dt),
and
(ii) there exist distributions QT E 6(T such that T is sufficient for the family {QT}.
In the same manner we define a sufficient sub-a-field for r.
Now we shall prove an analogue of the well-known Rao-Blackwell theorem.
For this purpose, let us consider a decision problem with a convex set D of
decisions d, and with a loss function L(r, d), which is convex in d for each r, and
depends on 0 only through r, L(0, d) = L(r(0), d). Applying the minimax principle
to eliminate the nuisance parameter, we associate with each decision function
6(x) the following risk:
(2.8) f
R(r, ) = sup L[r(0), 5(x)]Pe(dx),
where the supremum is taken over all 0-values such that r(0) = r. Now, if T is
sufficient for r in the sense of definition 2.2, we can associate with each decision
function 5(x) another decision function 6(t) defined as follows:
(2.9) 6(t) = f 5(x)Q,(dxIT = t)
provided that 6(x) is integrable. Note that the right side of (2.9) does not depend
on r, since T is sufficient for {Q,}, according to definition 2.2, and that 6(t) E D
for every t in view of convexity of D. Finally put
(2.10) 6*(x) = 8(t(x)).
THEOREM 2.2. Under the above assumptions,
(2.11) R(r, 5*) < R(r, 5)
holds for all r.
PROOF. Since the distribution of T depends on T only, we have
(2.12) R(r, 5*) = f L[r(0), 5*(x)]Pe(dx)
= f Lf[r(0), 5*(x)]Qr(dx)
for all 0 such that T(0) = r. Furthermore, since L(r, d) is convex in d,
142 FIFTH BERKELEY SYMPOSIUM: HAJEK
3. Invariance
Most frequently, the nuisance parameter is introduced by a group of trans-
formations of the x-space on itself. Then, if we have a finite invariant measure
on the group, we can easily show that the sub-a-field of invariant events is
sufficient in the presence of the corresponding nuisance parameter.
Let G = {g} be a group of one-to-one transformations of the x-space on itself.
Take a probability distribution P of X and put
(3.1) P,(X E A) = P(gX E A).
Then obviously, Ph(X e g-lA) = P,h(X G A). Let M be a a-finite measure and
denote by gg the measure such that mg(A) = A(gA).
THEOREM 3.1. Let P <<K and Mug <<K for all g c G. Then PO << MA. Denoting
p(x, g) = dP,/d, and p(x) = dP/dy, then
(3.2) P(x, g) = p(g-1x) dMg (x)
holds. More generally,
(3.3) p(x, h-'g) = p(h(x), g)djih (x).
PROOF. According to the definition, one can write
(3.4) P,(X E A) = P(X e g1-A) = f|A p(x) dM = JA p(g-'(y)) dg-1(y)
= LJAp(g-l(y)) dMg
~~dM (Y) dM(y).
CONDITION 3.1. Let 9 be a a-field of subsets of G, and let (a be the a-field of sub-
sets of the x-space. Assume that
(i) Ag << A for all g E G,
(ii) p(x, g) is a X 9-measurable,
(iii) functions Oh(g) = hg and A1h(g) = gh are 9-measurable,
(iv) there is an invariant probability measure v on 9, that is, v(Bg) = v(gB) =
v(B) for all g G G and B e 9.
THEOREM 3.2. Under condition 3.1, let us put
(3.5) P(X) = f p(x, g) dv(g).
144 FIFTH BERKELEY SYMPOSIUM: HAJEK
PROOF. Put 1(x) = p(x)/po(x). Theorem 3.2 entails l(g(x)) = 1(x) for all
g e G, namrely, 1(x) is (3-measurable. Further, for any B E 63, in view of (3.2)
and of [p = 0] X [Po = 0],
(3.9) J1(x) dPo = J (x)pO(x) d,u = JB l(x)po(g-'(x)) d,g-1
= JB l(x)po(x, g) dA = (x)po(x) d;, = fP(X) d,
holds. On the other hand,
(3.10) P(B) = JB p(x) dA = JR p(g-'(x)) djg-' = JB p(x, g) d,u = JB p(x) d,.
Thus P(B) = fB 1(x) dPo, B c 63, which concludes the proof.
THEOREM 3.4. Let the statistic T = t(X) have an expectation under Ph, h E G.
Let condition 3.1 be satisfied. Then the conditional expectation of T relative to the
sub-a-field (6 of G-invariant events and under Ph equals
(3.11) Eh(TJ63, x) = J t(g-F(x))p(x, gh) dv(g)/p(x).
BASIC CONCEPTS 145
PROOF. The proof would follow the lines of the proofs of theorems 3.2
and 3.3.
Consider now a dominated family of probability distributions {Pr} and define
P7,0 by (3.1) for each T. Putting 0 = (r, g), we can say that r is a subparameter
of 0.
THEOREM 3.5. Under condition 3.1 the sub-a-field (3 of G-invariant events is
sufficient for r in the sense of definition 2.2.
PROOF. First, the G-invariant events have a probability depending on T only,
that is, P7,0(B) = PT(B). Second, for
(3.12) Qr(A) = f PT,2(A) dp(g)
we have
(3.13) Q,(A) = f [f p,(x, g) dj] dv(g)
= IA f p(x, g) dv(g) dus
= fA P(x) d,u.
Now let Po be some probability measure such that P7 << PO << A, and introduce
p0(x) by (3.5). Then, according to theorem 3.3, PT(x)/PO(x) is 63-measurable for
all r. Thus p(x) = [p7(x)/1po(x)]po(x), and 63 is sufficient for {Q,}, according to
the factorization criterion.
REMARK 3.2. Considering right-invariant and left-invariant probability
measures on g does not provide any generalization. Actually, if v is right-
invariant, then v'(B) = fAV(gB) dv(g) is invariant, that is, both right-invariant
and left-invariant.
REMARK 3.3. Theorem 3.3 sometimes remains valid even if there exists only
a right-invariant countably finite measure v, as in the case of the groups of
location and/or scale shifts (see [6], chapter II).
4. Similarity
The concept of similarity plays a very important role in classical statistics,
namely in contributions by J. Neyman and R. A. Fisher. It may be applied in
the x-space as well as in the (x, 0)-space.
4.1. Similarity in the x-space. Consider a family of distributions {Po} on
measurable subsets of the x-space. We say that an event A is similar, if its
probability is independent of 0:
(4.1) Po(A) = P(A) for all 0.
Obviously the events "whole x-space" and "the empty set" are always similar.
The class of similar events is closed under complementation and formation of
countable disjoint unions. Such classes are called X-fields by Dynkin [3]. A
X-field is a broader concept than a a-field. The class of similar events is usually
146 FIFTH BERKELEY SYMPOSIUM: HbJEK
not a a-field, which causes ambiguity in applications of the conditionality
principle, as we shall see. Generally, if both A and B are similar and are not
disjoint, then A n B and A U B may not be similar. Consequently, the system
of o-fields contained in a X-field may not include a largest a-field.
Dynkin [3] calls a system of subsets a 7r-field, if it is closed under intersection,
and shows that a X-field containing a 7r-field contains the smallest a-field over the
7r-field. Thus, for example, if for a vector statistic T = t(X) the events
{x: t(x) < c} are similar for every vector c, then {x: T e A} is similar for every
Borel set A.
More generally, a statistic V = v(X) is called similar, if E6V = f v(x)Pe(dx)
exists and is independent of 0. We also may define similarity with respect to a
nuisance parameter only. The notion of similarity forms a basis for the definition
of several other important notions.
Ancillary statistics. A statistics U = u(x) is called ancillary, if its distribution
is independent of 0, that is, if the events generated by U are similar. (We have
seen that it suffices that the events {V < c} be similar.)
Correspondingly, a sub-a-field will be called ancillary, if all its events are
similar.
Exhaustive statistics. A statistics T = t(x) is called exhaustive if (T, U), with
U an ancillary statistic, is a minimal sufficient statistic.
Complete families of distributions. A family {Pe} is called complete if the only
similar statistics are constants.
Our definition of an exhaustive statistic follows Fisher's explanation (ii) in
([5], p. 49), and the examples given by him.
Denote by Io, IoT, 1s'(u) Fisher's information for the families {Pe(dx)},
Pe(dt)}, {Pe(dt) U = u)}, respectively. Then, since (T, U) is sufficient and U is
ancillary,
(4.2) EIT(U) = Ie.
If Io < Io, Fisher calls (4.2) "recovering the information lost." What is the
real content of this phrase?
The first interpretation would be that T = t contains all information supplied
by X = x, provided that we know U = u. But knowing both T = t and U = u,
we know (T, U) = (t, u), that is, we know the value of the sufficient statistics.
Thus this interpretation is void, and, moreover, it holds even if U is not ancillary.
A more appropriate interpretation seems to be as follows. Knowing
{Po(dtlU= u)}, we may dismiss the knowledge of P(du) as well as of
{Pe(dt) U = u')} for u' # u. This, however, expresses nothing else as the con-
ditionality principle formulated below.
Fisher makes a two-field use of exhaustive statistics: first, for extending the
scope of fiducial distributions to cases where no appropriate sufficient statistic
exists, and, second, to a (rather unconvincing) eulogy of maximum likelihood
estimates.
The present author is not sure about the appropriateness of restricting our-
BASIC CONCEPTS 147
selves to "minimal" sufficient statistics in the definition of exhaustive statistics.
Without this restriction, there would rarely exist a minimal exhaustive statistic,
and with this restriction we have to check, in particular cases, whether the
employed sufficient statistic is really minimal.
4.2. Similarity in the (x, 0)-space. Consider a family of distributions {P9} on
the x-space, the family of all possible prior distributions {v} on the 0-space,
and the family of distributions {RJ} on the (x, 0)-space given by
(4.3) R,(dx dO) = v(dO)Pe(dx).
Then we may introduce the notion of similarity in the (x, 0)-space in the same
manner as before in the x-space, with {Po} replaced by {R,}. Consequently, the
prior distribution will play the role of an unknown parameter. Thus an event A
in the (x, 0)-space will be called similar, if
(4.4) R,(A) = R(A)
for all prior distributions v.
To avoid confusion, we shall call measurable functions of (x, 0) quantities and
not statistics. A quantity H = h(X, 0) will be called similar if its expectation
EH = f h(x, 0)v(d0)Pe(dx) is independent of P. Ancillary statistics in the (x, 0)-
space are called pivotal quantities or distribution-free quantities. Since only the
first component of (x, 0) is observable, the applications of similarity in the (x, 0)-
space are quite different from those in the x-space.
Confidence regions. This method of region estimation dwells on the following
idea. Having a similar event A, whose probability equals 1 - a, with a
very small, and knowing that X = x, we can feel confidence that (x, 0) e A,
namely, that the unknown 0 lies within the region S(x) = {0: (x, 0) e A}. Here,
our confidence that the event A has occurred is based on its high probability, and
this confidence is assumed to be unaffected by the knowledge of X = x. Thus we
assume a sort of independence between A and X, though their joint distribution is
indeterminate. The fact that this intrinsic assumption may be dubious is most
appropriately manifested in cases when S(x) is either empty, so that we know
that A did not occur, or equals the whole 0-space, so we are sure that A has
occurred. Such a situation arises in the following.
EXAMPLE 1. For this example, 0 E [0, 1], x is real, Pe(dx) is uniform over
[0, 0 + 2], and
(4.5) A= {(x,0):0+a<x<0+2-a}.
Then S(x) = 0 for 3 - a < x < 3 or 0 < x < a, and S(x) = [0, 1] for
1 + a < x < 2 - a. Although this example is somewhat artificial, the difficulty
involved seems to be real.
Fiducial distribution. Let F(t, 0) be the distribution function of T under 0,
and assume that F is continuous in t for every 0. Then H = F(T, 0) is a pivotal
quantity with uniform distribution over [0, 1]. Thus R(F(T, 0) < x) = x. Now,
if F(t, 0) is strictly decreasing in 0 for each t, and if F-1(t, 0) denotes its inverse for
fixed t, then F(T, 0) < x is equivalent to 0 > F-'(T, x). Now, again, if we know
148 FIFTH BERKELEY SYMPOSIUM: HIJEK
that T = t, and if we feel that it does not affect the probabilities concerning
F(T, 0), we may write
(4.6) x = R(F(T, 0) < x = R(0 > F-'(T, x)) = R(e > F-'(T, x)IT = t)
= R(0 > F-1(t, x)IT = t)
namely, for 0 = F-1(t, x),
(4.7) R(0 < OJT = t) = 1 - F(t, 0).
As we have seen, in confidence regions as well as in fiducial probabilities, a
peculiar postulate of independence in involved. The postulate may be generally
formulated as follows.
Postulate of independence under ignorance. Having a pair (Z, W) such that
the marginal distribution of Z is known, but the joint distribution of (Z, W), as
well as the marginal distribution of W, are unknown, and observing W = w, we
assume that
(4.8) P(ZeAjW = w) = P(ZEA).
J. Neyman gives a different justification of the postulate than R. A. Fisher.
Let us make an attempt to formulate the attitudes of both these authors.
Neyman's justification. Assume that we perform a long series of independent
replications of the given experiments, and denote the results by (Z1, w1), *. *,
(ZN, WN), where the wi's are observed numbers and the Zi's are not observable.
Let us decide to accept the hypothesis Zi e A at each replication. Then our
decisions will be correct approximately in 100P% of cases with P = P(Z E A).
Thus, in a long series, our mistakes in determining P(Z E AIW = w) by (4.8)
if any, will compensate each other.
Fisher's justification. Suppose that the only statistics V = v(W), such that
the probabilities P(Z E AIV = v) are well-determined, are constants. Then we
are allowed to take P(Z E AIW = w) = P(Z E A), since our absence of knowl-
edge prevents us from doing anything better.
The above interpretation of Fisher's view is based on the following passage
from his book ([5], pp. 54-55):
"The particular pair of values of 0 and T appropriate to a particular experi-
menter certainly belongs to this enlarged set, and within this set the proportion
of cases satisfying the inequality
T 2
(4.9) a > 2n X2(
is certainly equal to the chosen probability P. It might, however, have been
true . . . that in some recognizable subset, to which his case belongs, the pro-
portion of cases in which the inequality was satisfied should have some value
other than P. It is the stipulated absence of knowledge a priori of the distribution
of 0, together with the exhaustive character of the statistic T, that makes the
recognition of any such subset impossible, and so guarantees that in his par-
ticular case . . . the general probability is applicable."
BASIC CONCEPTS 149
To apply our general scheme to the case considered by R. A. Fisher, we should
put Z = 1 if (4.9) is satisfied and Z = 0 otherwise, and W = T.
REMARK 4.1. We can see that Fisher's argumentation applies to a more
specific situation than described in the above postulate. He requires that con-
ditional probabilities are not known for any statistics V = v(W) except for
V = const. Thus the notion of fiducial probabilities is a more special notion than
the notion of confidence regions, since Neyman's justification does not need any
such restrictions.
Fisher's additional requirement is in accord with his requirement that fiducial
probabilities should be based on the minimal sufficient statistics, and in such a
case does not lead to difficulties.
However, if the fiducial probabilities are allowed to be based on exhaustive
statistics, then his requirement is contradictory, since no minimal exhaustive
statistics may exist. Nonetheless, we have the following.
THEOREM 4.1. If a minimal sufficient statistic S = s(X) is complete, then it is
also a minimal exhaustive statistic.
PROOF. If there existed an exhaustive statistic T = t(S) different from S,
there would exist a nonconstant ancillary statistic U = u(S), which contradicts
the assumed completeness of S. Q.E.D.
If S is not complete, then there may exist ancillary statistics U = u(S) and
the family corresponding to exhaustive statistics contains a minimal member if
and only if the family of U's contains a maximal member.
REMARK 4.2. Although the two above justifications are unconvincing, they
correspond to habits of human thinking. For example, if one knows that an
individual comes from a subpopulation of a population where the proportion of
individuals with a property A equals P, and if one knows nothing else about the
subpopulation, one applies P to the given individual without hesitation. This is
true even if we know the "name" of the individual, that is, if the subpopulation
consists of a single individual. Fisher is right, if he claims that we would not use
P if the given subpopulation would be a part of a larger one, in which the
proportion is known, too. He does not say, however, what to do if there is no
minimal such larger subpopulation.
In confidence regions the subpopulation consists of pairs (X, 0) such that
X = x. It is true that we know the "proportion" of elements of that subpopula-
tion for which the event A occurs, but we do not know how much of the proba-
bility mass each element carries. Thus we can utilize the knowledge of this
proportion only if it equals 0 or 1, which is exemplified by example 1 but
occurs rather rarely in practice. The problem becomes still more puzzling if the
knowledge of the proportion is based on estimates only, and if these estimates
become less reliable as the subpopulation from which they are derived be-
comes smaller. Is there any reasonable recommendation as to what to do if
we know P(X E AIU1 = ul) and P(X e AIU2 = u2), but we do not know
P(X c A[U1 = ul, U2 =U2)?
REMARK 4.3. Fisher attacked violently the Bayes postulate as an adequate
150 FIFTH BERKELEY SYMPOSIUM: HAJEK
5. Conditionality
If the prior distribution v of 0 is known, all statisticians agree that the decisions
given X = x should be made on the basis of the conditional distribution
R,(dOjX = x). This exceptional agreement is caused by the fact that then there
is only one distribution in the (x, 0)-space, so that the problems are transferred
from the ground of statistics to the ground of pure probability theory.
The problems arise in situations where conditioning is applied to a family of
distributions. Here two basically different situations must be distinguished ac-
cording to whether the conditioning statistic is ancillary or not. Conditioning
with respect to an ancillary statistic is considered in the conditionality principle
as formulated by A. Birnbaum [1]. On the other hand, for example, conditioning
BASIC CONCEPTS 151
with respect to a sufficient statistic for a nuisance parameter, successfully used
in constructing most powerful similar tests (see Lehmann [14]), is a quite
different problem and will not be considered here.
Our discussion will concentrate on the conditionality principle, which may be
formulated as follows.
The principle of conditionality. Given an ancillary statistic U, and knowing
U = u, statistical inference should be based on conditional probabilities
Pe(dxl U = u) only; that is, the probabilities Pe(dxl U = u') for u' $d u and P(du)
should be disregarded.
This principle, if properly illustrated, looks very appealing. Moreover, all
proper Bayesian procedures are concordant with it. We say that a Bayesian
procedure is proper, if it is based on a prior distribution v established independ-
ently of the experiment. A Bayesian procedure, which is not proper, is exemplified
by taking for the prior distribution the measure v given by
(5.1) v(dO) = VT dO,
where Io denotes the Fisher information associated with the particular experi-
ment. Obviously, such a Bayesian procedure is not compatible with the principle
of conditionality. (Cf. [4].)
The term "statistical inference" used in the above definition is very vague.
Birnbaum [1] makes use of an equally vague term "evidential meaning." The
present author sees two possible interpretations within the framework of the
decision theory.
Risk interpretation. A decision procedure a defined on the original experiment
should be associated with the same risk as its restriction associated with the
partial experiment given U = u. Of course this transfer of risk is possible only
after the experiment has been performed and u is known.
Decision rules interpretation. A decision function a should be interpreted as a
function of two arguments, a = 6(E, x), with E denoting an experiment from a
class 8 of experiments and x denoting one of its outcomes. The principle of
conditionality then restricts the class of "reasonable" decision functions to those
for which 5(E, x) = B(F, x) as soon as F is a subexperiment of E, and x belongs
to the set of outcomes of F (F being a subexperiment of E means that F equals
"E given U = u" where U is some ancillary statistics and u some of its particular
value).
In this section we shall analyze the former interpretation, returning to the
latter in the next section.
The principle of conditionality, as it stands, is not acceptable in non-Bayesian
conditions, since it denies possible gains obtainable by randomization (mixed
strategies). On the other hand, some examples exhibited to support this principle
(see [1], p. 280) seem to be very convincing. Before attempting to delimit the
area of proper applications of the principle, let us observe that the principle is
generally ambiguous. Actually, since generally no maximal ancillary statistic
exists, it is not clear which ancillary statistic should be chosen for conditioning.
152 FIFTH BERKELEY SYMPOSIUM: HAJEK
Let us denote the conditional distribution given U = u by Pe(dxl U = u) and
the respective conditional risk by
(5.2) R (o, U, u) = f L (O, 6(x)) Pe(dxI U = u).
In terms of a sub-a-field 03, the same will be denoted by
(5.3) f
R (O, 03, x) = L (0, 6(x)) Pe(dxI03, x),
where R(O, 03, x), as a function x, is &3-measurable. Since the decision function a
will be fixed in our considerations, we have deleted it in the symbol for the risk.
DEFINITION 5.1. A conditioning relative to an ancillary statistic U or an
ancillary sub-a-field 03 will be called correct, if
(5.4) R(O, U, u) = R(O)b(u)
or
(5.5) R(0, (0, x) = R(O)b(x),
respectively. In (5.4) and (5.5), R (O) denotes the overall risk, and the function b(x)
is 03-measurable. Obviously, Eb(U) = Eb(X) = 1.
The above definition is somewhat too strict, but it covers all examples ex-
hibited in literature to support the conditionality principle.
THEOREM 5.1. If the conditioning relative to an ancillary sub-a-field 03 is
correct, then
(5.6) E [sup R(0, 03, X)] = sup R(O).
0 6
PROOF. The proof follows immediately from (5.5). Obviously, for a non-
correct conditioning we may obtain E[supe R(G, (B, X)] > supo R(O), so that,
from the minimax point of view, conditional reasoning generally disregards
possible gains obtained by mixing (randomization), and, therefore, is hardly
acceptable under non-Bayesian conditions.
THEOREM 5.2. As in section 3, consider the family {P,} generated by a proba-
bility distribution P and a group G of transformations g. Further, assume the loss
function L to be invariant, namely, such that L(hg, 6(h(x)) = L(g, 6(x)) holds for
all h, g, and x. Then the conditioning relative to the ancillary sub-u-field 03 of
G-invariant events is correct. That is, R(g) = R and
(5.7) R(g, (M, x) = R((0, x) = f L[g, b(h-1(x))]p(x, hg) dv(h)/p(x).
FROOF. For B E 03,
(5.8) fB L(g, 6(x)) p(x, g) d (x) = fIB L[g, 6(h(y))]p(h(y), g) dth(y)
= J L(h-'g, 6(y))p(y, h-'g) d,(y).
The theorem easily follows from theorem 3.4 and from the above relations.
The following theorem describes an important family of cases of ineffective
conditioning.
BASIC CONCEPTS 153
THEOREM 5.3. If there exists a complete sufficient statistic T = t(X), then
R(6, U, u) = R(O) holds for every ancillary statistic U and every a which is a
function of T.
PROOF. The theorem follows from the well-known fact (see Lehmanl [14],
p. 162) that all ancillary statistics are independent of T, if T is complete and
sufficient.
DEFINITION 5.2. We shall say an ancillary sub-a-field c3 yield the deepest
correct conditioning, if for every nonnegative convex function tt and every other
ancillary sub-cr-field i3 yielding a correct conditioning,
(5.9) FPj[b(X)] > E4p[b(X)]
holds, with b and b corresponding to 6( and 3 by (5.5), respectively.
THEOREM 5.4. Under condition 3.1 and under the conditions of theorem 5.2,
the ancillary sub-cr-field 63 of G-invariant events yields the deepest correct con-
ditioning. Further, if any other ancillary sub-cr-field possesses this property, then
R((!, x) = R(6b, x) almost ,u-everywhere.
PROOF. Let v denote the invariant measure on 9. Take a C e 63 and denote
by xc(x) its indicator. Then
(5.10) 4c(x) = f xc(gx) dv(g)
is 63-measurable and
(5.11) P(C) = f qc(x)p(x) dA.
Further, since the loss function is invariant,
(5.12) f L(g, S(x))p(x, g) d,u = f xc(g(y))L(gi, S(y))p(y) dA(y)
where gi denotes the identity transformation. Consequently, denoting by R(g, C)
the conditional risk, given C, we have from (5.10) and (5.12),
(5.13) P(C) f R(g, C) dv(g) = f cpc(x)L(g1, S(x))p(x) d,A.
Now, since R(g) = R, according to theorem 5.2, and since the correctness of CB
entails R(g, C) = Rbc, we have
(5.14) f R(g, C) dv(g) = Rbc.
Note that
(5.15) P(C)bc = f6b(x)p(x) dA.
Further, since oc(x) is 63-measurable,
(5.16) f 0c(x)L(gl, S(x))p(x) d,u = f Pc(x)R((P5, x)p(x) dA
= f cc(x)Rb(x)p(x) dA.
By combining together (5.13) through (5.16), we obtain
154 FIFTH BERKELEY SYMPOSIUM: HbJEK
(5.17) P(C)bc = |f c(x)b(x)p(x) dA.
Now, let us assume Eql(b(X)) < oo. Then, given ae > 0, we choose a fiuiite
partition {Ck}, Ck E 63-, such that
(5.18) E42(b(X)) < E P(Ck)I(&Ck) + E,
and we note that, in view of (5.11), (5.17), and of convexity of xl,
(5.19) P(Ck)4(5Ck) . f OCk(x))\(b(x))P(x) dA,
and, in view of (5.10),
(5.20) EfcC*k(X) = 1
k
for every x. Consequently, (5.18) through (5.20) entail
(5.21) E4'(&(X)) < E#(b(X)) + e.
Since E > 0 is arbitrary, (5.9) follows. The case E#(b(X)) = X0 could be treated
similarly.
The second assertion of the theorem follows from the course of proving the
first assertion. Actually, we obtain E4,(b(X)) < EiA(b(X)) for some 4,, unless b
is a function of b a.e. Also conversely, b must be a function of Z, because the two
conditionings are equally deep. Q.E.D.
A restricted conditionality principle. If all ancillary statistics (sub-a-fields)
yielding the deepest correct conditioning give the same conditional risk, and if
there exists at least one such ancillary statistic (sub-a-field), then the use of the
conditional risk is obligatory.
6. Likelihood
Still more attractive than the principle of conditionality appears the principle
of likelihood, which may be formulated in lines of A. Birnbaum [1], as follows.
The principle of likelihood. Statistical inferences should be based on the likeli-
hood functions only, disregarding the other structure of the particular experi-
ments. Here, again, various interpretations are possible. We suggest the follow-
ing.
Interpretation. For a given 0-space, the particular decision procedures should
be regarded as functionals on a space 2 of likelihood functions, where all functions
differing in a positive multiplicator are regarded as equivalent. Given an experi-
ment E such that for all outcomes x the likelihood functions l(O) belong to 2,
and a decision procedure 6 in the above sense, we should put
(6.1) 6(x) = 3[Iz(-)]
where l(O) = pe(x).
If £ contains only unimodal functions, an estimation procedure of the above
kind is the maximum likelihood estimation method.
A. Birnbaum [1] proved that the principle of conditionality joined with the
BASIC CONCEPTS 155
principle of sufficiency is equivalent to the principle of likelihood. He also con-
jectured that the principle of sufficiency may be left out in his theorem, and gave
a hint, which was not quite clear, for proving it. But his conjecture is really
true if we assume that statistical inferences should be invariant under one-to-one
transformations of the x-space to some other homeomorph space. In the following
theorem we shall give the "decision interpretation" to the principle of condition-
ality, and the class 8 of experiments will be regarded as the class of all experi-
ments such that all their possible likelihood functions belong to some space £.
THEOREM 6.1. Under the above stipulations, the principle of conditionality is
equivalent to the principle of likelihood.
PROOF. If F is a subexperiment of E, and if x belongs to the space of possible
outcomes of F, the likelihood function l1(06E) and lx(OIF) differ by a positive
multiplicator only, namely, they are identical. Thus the likelihood principle
entails the conditionality principle.
Further, assume that the likelihood principle is violated for a decision pro-
cedure 5, that is there exist in 8 two experiments E1 = (9C,, Pie) and E2 = (OC2, P2e)
such that for some points xi and X2,
(6.2) 6(E1, xi) # 6(E2, x2),
whereas
(6.3) P1l(X1 = Xl) = CP20(X2 = X2) for all 0,
with c = c(x,, x2), but independent of 0. We here assume the spaces Sl and 9E2
finite and disjoint, and the o-fields to consist of all subsets. Then let us choose a
number X such that
(6.4) <X < 1
and consider the experiment E = (9C U 9C2, Pe), where
(6.5) PO(X = x) = XP1e(x) if x E Sl,
= XcP20(x) if x E SC2- {x2},
= -X(c + 1-Ple(xl)) if x = X2.
Then E conditioned by x G 9C, coincides with El, and E conditioned by x E (9C2-
{x2}) U {x1} coincides with R2 which is equivalent to E2 up to a one-to-one trans-
formation + (x) = x, if x E -C2-{X2} and ¢(x2) = xi. Thus we should have
S(E1, x1) = B(E, x1) = 5(22, xi) = S(E2, X2), according to the conditionality
principle. In view of (6.2) this is not true, so that the conditionality principle is
violated for 6, too. Q.E.D.
Given a fixed space £, the likelihood principle says what decision procedures
of wide scope (covering many experimental situations) are permissible. Having
any two such procedures, and wanting to compare them, we must resort either
to some particular experiment and compute the risk, or to assume some a priori
distribution v(dO) and to compute the conditional risk. In both cases we leave the
proper ground of the likelihood principle.
156 FIFTH BERKELEY SYMPOSIUM: HAJEK
However, for some special spaces £, all conceivable experiments with likelihood
functions in S, give us the same risk, so that the risk may be regarded as inde-
pendent of the particular experiment. The most important situation of this kind
is treated in the following.
THEOREM 6.2. Let 0 be real and let £C, consist of the following functions:
(6.6) l()=C exp [-1 @t -X0 < t < X0,
with a fixed and positive. Then for each experiment with likelihood functions in £..
there exists a complete sufficient statistic T = t(X), which is normally distributed
with expectation 0 and variance a2.
PROOF. We have a measurable space (DC, t) with a a-finite measure i,(dx)
such that the densities with respect to y& allow the representation
(6.7) po(x) = c(x) exp [I_ ((- 21
This relation shows that T = t(X) is sufficient, by the factorization criterion.
Further,
(6.8) 1 = f pe(x),A(dx) = f e-1/202+6t,(dt)
where , = ,u*t-' and IA*(dx) = c(x) exp [- t2(x)],A(dx). Now, assuming that
there exist two different measures p satisfying (6.8), we easily derive a contra-
diction with the completeness of exponential families of distributions (see
Lehmann [14], theorem 1, p. 132). Thus a(dt) = e-112t2dt and the theorem is
proved.
The whole work by Fisher suggests that he associated the likelihood principle
with the above family of likelihoods, and, asymptotically, with families, which
in the limit shrink to £C. for some a > 0 (see example 8.7). If so, the present
author cannot see any serious objections against the principle, and particularly,
against the method of maximum likelihood. Outside of this area, however, the
likelihood principle is misleading, because the information about the kind of
experiment does not lose its value even if we know the likelihood.
8. Examples
EXAMPLE 8.1. Let us assume that 0i = 0 or 1, 0 = (01, * , ON) and T =
01 + *-- + ON. Further, put
N
(8.1) pe(Xi, *... , XN) = II (Xi)]@i[g(xi)]l EW,
i=1
where f and g are some one-dimensional densities. In other words, X=
(X1, * * *, XN) is a random sample of size N, each member Xi of which comes
from a distribution with density eitherf or g. We are interested in estimating the
number of the Xi associated with the density f. Let T = (T1, *, TN) be the
- -
order statistic, namely T1 < ... < TN are the observations Xi, *- , Xiv re-
arranged in ascending magnitude.
158 FIFTH BERKELEY SYMPOSIUM: HAJEK
PROPOSITION 8.1. The vector T is a sufficient statistic for T in the sense of
definition 2.2, and
(8.2) pr(t) = pT(tl, * tN) = T!(N - T ! sr i f (ti) jII g (ti),
CS, Es, (ZS,
where ST denotes the system of all subsets s, of size r from {1,*I,
* N}.
PROOF. A simple application of theorem 3.5 to the permutation group.
PROPOSITION 8.2. For every t
(8.3) p+1(t)p.-1(t) = Pr(t)
holds.
PROOF. See [7]. Relation (8.3) means that -log p,(t) is convex in T; that is,
the likelihoods are (strongly) unimodal. Thus we could try to estimate T by the
method of maximum likelihood, or still better by the posterior expectation for the
uniform prior distribution, if Xi = 0 or 1, the T is equivalent to T' = Xi +
+ XN.
This kind of problem occurs in compound decision making. See H. Robbins
[17].
EXAMPLE 8.2. Let 0 < C2 < K, ,u real, 0 = (u, a2), and
(8.4) X)
p(xl, .-- X. = &f n(2r)1 /2n exp {-2 (X -
822 =n-1
=Ii=
n
[h(yi)- 2
On the other hand, choosing the usual "diffuse" prior distribution (d,u da) =
0r- d,u da, we obtain for the conditional expectation of Z, given Yi = yi,
i1, , n, the following result:
(8.10) Z = E(ZjYi = yi, 1 < i < n,v) = (N - n)(n -1)-1/2
[B (' n - , ')]-I h-1(x + vs(1 + 1/n)1/2) 1
+ V2
_ 12 dv.
However, for most important functions h, for example for h(y) = log y, that is
h-1(x) = ex, we obtain 2 = oo. Thus the Bayesian approach should be based on
some other prior distribution. In any case, however, the Bayesian solution would
be too sensitive with respect to the choice of v(d,u du).
Exactly the same unpleasant result (8.10) obtains by the method of fiducial
prediction recommended by R. A. Fisher ([5], p. 116). For details, see [9].
EXAMPLE 8.6. Consider the following method of sampling from a finite popu-
lation of size N: each unit is selected by an independent experiment, and the
probability of its being included in the sample equals n/N. Then the sample size
K is a binomial random variable with expectation n. To avoid empty samples,
let us reject samples of size <ko, where ko is a positive integer, so that K will have
160 FIFTH BERKELEY SYMPOSIUM: HAJEK
truncated binomial distribution. Further, let us assume that we are estimating
the population total Y = y+ + + YN by the estimator
(8.11) Ki
N
K ,
SK
where SK denotes a sample of size K. Then
-
(8.12) E((Y Y)2K =k) = N(N k) a2
k
holds where a2 is the population variance o.2 = (N - 1)-1 E= (yi - 7)2. Thus
K yields a correct conditioning. Further, the deepest correct conditioning is that
relative to K.
On the other hand, simple random sampling of fixed size does not allow
effective correct conditioning, unless we restrict somehow the set of possible
(yi, * * *, yN)-values.
EXAMPLE 8.7. Let f(x) be a continuous one-dimensional density such that
-logf (x) is convex. Assume that
(8.13) I =f [LTxj1]f(x) dx <
Now for every integer N put
N
(8.14) P5(X1, * *, XN) = llf(xi - 0)
i =1
and
(8.15) lZ(6) = P0(Xi, , XN), < 0 < o.
Let t(x) be the mode (or the mid-mode) of the likelihood lx(6), that is,
(8.16) lW(t(x)) > lO(6), -oo < 6 < oo.
Since f(x) is strictly unimodal, t(x) is uniquely defined. Then put
(8.17) l*(8) = cNI1/2(27r)-12 exp [-2(O-t(x))21]
where CN is chosen so that
(8.18) f l*() dx1 ... dxN = 1.
Then CN-- 1 and
(8.19) lim f Jz(6) l1*(0)I dxi
- ... dxN = 0,
the integrals being independent of 6. Thus, for large N, the experiment with
likelihoods 1 may be approximated by an experiment with normal likelihoods
1* (see [8]). Similar results may be found in Le Cam [12] and P. Huber [10].
EXAMPLE 8.7. Let 6 = (01, * , ON),
(8.20) pe(x, ... . XN) = (2r)1/2N exp 2 (xi -0i)2)
and let (6k, * * , ON) be regarded as a sample from a normal distribution (A 2).
If ,u and u2 were known, we would obtain the following joint density of (x, 0):
BASIC CONCEPTS 161
(8.21) r,(xi, * * , 01, * , ON)
N
gN2.-N ep
N N
f-N(27r) N exp (X, - ail'2 -12 ff2 )=)
2 Ea(Xi (0,
E(@i - 2
Consequently, the best estimator of (Or, * * , ON) would be (06, * *, ON) defined
by
(8.22) 1+ U2X
Now, in the prior experiment (u, a2) can be estimated by (0, S9) where
N 2 1 N
(8.23) == N-i ).
In the x-experiment, in turn, a sufficient pair of statistics for (0,
S9) iS (X, S2),
where
1N 2 1 N
(8.24) x= N XiN1 1 .
The estimators (8.26) will be for large N nearly as good as the estimators (8.26)
(cf. C. Stein [18]).
The estimators (8.26) could be successfully applied to estimating the averages
in individual stratas in sample surveys. They represent a compromise between
estimates based on observations from the same stratum only, which are unbiased
but have large variance, and the overall estimates which are biased but have
small variance.
The same method could be used for Oi = 1 or 0, and (01, * , ON) regarded as
a sample from an alternative distribution with unknown p. The parameter p
could then be estimated in lines of example 8.1 (cf. H. Robbins [17]).
9. Concluding remarks
The genius of classical statistics is based on skillful manipulations with the
notions of sufficiency, similarity and conditionality. The importance of similarity
increased after introducing the notion of completeness. An adequate imbedding
162 FIFTH BERKELEY SYMPOSIUM: HAJEK