Module2_Principles of data reduction
Module2_Principles of data reduction
Recall the (parametric) inference problem: Let X1 , · · · , Xn be a random sample from some
distribution F , which is parameterized by some parameter vector θ, θ ∈ Θ, where Θ is the parameter
space. Our goal is to infer about F (which is equivalent to infer about θ, or some function of θ) using
the random sample X1 , · · · , Xn . Usually a statistician summarizes the data using some statistics
(functions of the samples), for example, mean, SD, mode, maximum, minimum, etc.
Recall that, the sample space, say X is a subset of Rn , and a statistic T ({X1 , · · · , Xn }) = T (X) is
a function from X → R. Any statistics T (X) defines a form of data reduction, in the sense that the
possible values of T (X) induces a partition in the sample space. Suppose the statistic T (X) has the
realization t, then there exists a collection of sample realizations At = {x = (x1 , . . . , xn )′ : T (x) = t}
which leads to the functional value t. Now, let T = {t = T (x) : x ∈ X } be the range of the function T
with domain X . Then it is not difficult to see that ∪{At : t ∈ T } = X , and if t ̸= t′ , then At ∩ At′ = ϕ.
Therefore, the collection of sets {At : t ∈ T } defines a partition on X , and it is called the partition
induced by the statistic T . Thus the reduction due to the statistic T is equivalent to the partition.
Example. Let X, Y be a random sample from unifrom(0, 1). Consider two statistics T1 = max{X, Y }
and T2 = I(X > Y ) where I is the indicator function. It is easy to see that A1,t = {(x, y) : max{x, y} =
t} is the collection of all points in the two line segments {x = t, 0 < y ≤ t} and {y = t, 0 < x ≤ t},
whereas A2,0 = {(x, y) : x ≤ y} and A2,1 = {(x, y) : x > y}. Thus, T2 induces higher level of data
reduction.
Note that, higher level of reduction may lead to over-summarizing of the data resulting in loss of
important information about the population. On the other hand, no reduction or very low level of
reduction may lead to storing unimportant information. The goal of a statistician is to employ highest
level of reduction as long as no important information is lost. What information is important? In
parametric inference, all information relevant to the parameter θ is important.
1 Sufficient Statistic
Definition 1 (Sufficient Statistic). Let X = {X1 , · · · , Xn } be a random sample from the distribution
{Fθ ; θ ∈ Θ}. A statistic T (X) is a sufficient statistic for θ, if the conditional distribution of the random
sample X given the value of T (X) does not depend on θ.
(One may skip the following in the first reading) To see in more detail, let X be a discrete (or
continuous) random variable with pmf (or, pdf) fθ . Then the conditional distribution of X given
T (X) = t is (
fθ;X,T (x, t) fθ;X (x)/fθ;T (t) if x ∈ At ,
fX|T (X)=t (x) = =
fθ;T (t) 0 otherwise.
By definition of sufficient statistic, fX|T (X)=t (x) is free of θ, and hence is completely known. Thus, it is
(theoretically) possible to simulate from this distribution. Suppose the random variable Y | T (X) = t
is distributed as this conditional distribution. Then the unconditional distribution of Y is same as
1
unconditional distribution of X regardless of the value of θ, i.e., for any (measurable) subset A ⊆ X ,
Pθ (X ∈ A) = Pθ (Y ∈ A), regardless of the value of θ, as
Z Z Z
Pθ (Y ∈ A) = fθ;X|T =t (x)fT (t)dt dx = fθ;X (x)dx = Pθ (X ∈ A).
A t A
D
Thus. Y = X.
This implies that S2, without knowing the distribution of X, is able to generate realizations from
the distribution of X, regardless of the value of θ.
Example 1. Let X1 , ·P
· · , Xn be a random sample from Bernoulli(θ) distribution. Then the statistic
n
T (X) = i=1 Xi is sufficient.
Example 2. Let X1 , .P
. . , Xn be a random sample from Gamma(2, θ) distribution. Then the statistic
n
T (X) = i=1 Xi is sufficient.
Example 3. Let X1 , .P
. . , Xn be a random sample from Normal(µ, 1) distribution. Then the statistic
n
T (X) = i=1 Xi is sufficient.
[Alternative proof using orthogonal transformation]
Example 4. Let X1 , · · · , Xn be a random sample from some absolutely continuous distribution with
parameter vector θ. Then {X(1) , · · · , X(n) } is jointly sufficient for θ. The complete
sample {X1 , · · · , Xn } is also sufficient for θ.
Theorem 1. Let X1 , · · · , Xn denote a random sample from a discrete or absolutely continuous dis-
tribution that has a joint pmf or joint pdf fX (·; θ), θ ∈ Θ. The statistic T = T (X) is sufficient for
θ if and only if (iff ) there exists functions g(t; θ) and h(x) such that, for all sample points x and all
parameter values θ ∈ Θ
fX (x; θ) = g (T (x); θ) h(x).
[Proof]
Remark 1. 1. Let X1 , · · · , Xn be a random sample from some distribution with parameter vector
θ, and T (X) be a sufficient statistics of θ. If T (X) is a function of another statistics U (X), then
U (X) is also sufficient for θ. However, the converse is not true, in general.
[Proof/counter example]
2. Let X1 , · · · , Xn be a random sample from some distribution with parameter vector θ, and T (X) be
a sufficient statistics of θ. If U (X) is a bijective function of T (X). Then U (X) is also sufficient
for θ. [Homework]
Example 5. Let X1 , . . . , Xn be a random sample from Uniform(0, θ) distribution. Then the statistic
T (X) = X(n) is sufficient for θ.
Example 6. Let X1 , . . . , Xn be a random sample from discrete uniform distribution with equal prob-
ability mass on each point of {1, 2, . . . , θ} distribution. Then the statistic T (X) = X(n)
is sufficient for θ.
. , Xn be aPrandom sample from Normal(µ, σ 2 ) distribution. Then the statistic
Example 7. Let X1 , .. P
n n 2
T(X) = i=1 Xi , i=1 Xi is jointly sufficient for θ = (µ, σ 2 ).
Exponential family. A family of pmfs of pdfs is called a d-parameter exponential family if it can
be expressed as ( k )
X
fX (x; θ) = h(x)c(θ) exp wi (θ)ti (x) (1)
i=1
where θ = (θ1 , · · · , θd ). Here h(x) ≥ 0 for all x, and t1 , . . . , tk are real valued functions of x, not
depending on θ. Further, c(θ), w1 (θ), . . . , wk (θ) are real-valued functions of θ, not depending on x.
Many common distributions belong to the the exponential family. Examples include (i) binomial(n, p)
with n known, (ii) Poisson(λ), (iii) normal(µ, σ 2 ), (iv) exponential(λ), (v) Beta(α, β), (vi) Gamma(α, β),
etc.
2
Theorem 2 (Sufficient statistics for exponential family of distributions.). Let X1 , . . . , Xn be a random
sample from a distribution with pmf or pdf fX (·; θ), θ ∈ Θ ⊆ Rd which belongs to an exponential family
given by (1) with d ≤ k. Then the statistic T(X) is jointly sufficient for θ, where
n n
!
X X
T(X) = t1 (Xi ), · · · , tk (Xi )
i=1 i=1
Understanding minimal sufficiency. Suppose T (X) and U (X) are two different sufficient statis-
tics for the class of distributions Fθ , θ ∈ Θ. Let T = {t = T (z) : z ∈ X }, and U = {u = U (z) : z ∈ X }
be the ranges of U and T , respectively. For any t ∈ T and u ∈ U, consider the pre-images of t and u
as At = {z : T (z) = t} ⊆ X , and Bu = {z : U (z) = u} ⊆ X .
Note that,both {At : t ∈ T } and {Bu : u ∈ U} are partitions of X . As both T and U as sufficient,
each point z ∈ At , has same information about θ (as the conditional distribution of fX|T =t is free of
θ). Similarly, each z ∈ Bu has same information about θ.
Now, suppose T (X) be a function of U (X), i.e., there exists a function h : U → T such that
for each realization z ∈ X (sample space), T (z) = t = h(u) = h(U (z)) where U (z) = u. Define
Ct = {u : h(u) = t} ⊆ U as the pre-image of t.
Consider a z ∈ Bu0 , then U (z) = u0 , and T (z) = h(U (z)) = h(u0 ) = t0 , which implies z ∈ At0 ,
where h(u0 ) = t0 . Thus, Bu0 ⊆ At0 , where h(u0 ) = t0 . Now, suppose there exists another u′ such that
h(u′ ) = t0 . Then, similarly we have Bu′ ⊆ At0 . So, more generally, Bu ⊆ At0 , for all u ∈ Ct0 .
3
As all z ∈ At0 contains same information about θ, Bu0 has same information about θ as At0 .
However, At0 , being a larger subset of X , does a better level of data reduction.
Thus, a minimal sufficient statistic T , being functions of all other sufficient statistics, provides the
highest level of data reduction among the class of sufficient statistics.
2. Let both T (X) and U (X) be minimal sufficient statistics. Then U (X) is a bijective function of
T (X). [Proof ]
3. Minimal sufficient statistic is not unique.
Example 8. Let X1 , . . . , Xn be a random sample from Normal(µ, σ 2 ) distribution. Then the statistic
(X̄, S 2 ) is minimal sufficient, where S 2 = var(X).
Example 9. Let X1 , . . . , Xn be a random sample from Uniform(θ, θ + 1) distribution. Then the
statistic T (X) = (X(1) , X(n) ) is minimal sufficient.
Example 10. Let X1 , . . . , Xn be a random sample
Pn from binomial(m, p) distribution, where m is
known. Then the statistic T (X) = i=1 Xi is minimal sufficient.
3 Ancillary Statistic
Ancillary Statistic. Ancillary statistic plays a complementary role compared to that of a sufficient
statistic. That is, while a sufficient statistic contains all the information about θ, which could be
obtained from the sample {X1 , . . . , Xn }, an ancillary statistic contains no information about θ.
Definition 3 (Ancillary Statistic). A statistic S(X) whose distribution does not depend on the pa-
rameter θ is called an ancillary statistic.
Example 11. Let X1 , . . . , Xn be a random sample from normal(µ, 1) distribution. Then the statistic
S 2 is ancillary for µ.
Example 12. Let X1 , . . . , Xn be a random sample from uniform(θ, θ + 1) distribution. Then the
statistic S(X) = (X(n) − X(1) ) is ancillary for θ.
Xi = θ + Wi , i = 1, . . . , n,
where Wi s are i.i.d. from some distribution with CDF F (does not depend on θ). This type of
distribution of X is called location family of distributions, and θ is called the location parameter.
Let S(X) be a statistic such that
for all real d. Then S(X) is ancillary for θ, and is called location-invariant statistic.
4
Example: Range, mean deviation about mean, standard deviation are location invariant statis-
tics.
2. Scale family of distributions: Let X1 , . . . , Xn be a random sample, where
Xi = θWi , i = 1, . . . , n,
where Wi s are i.i.d. from some distribution with CDF F (does not depend on θ). This type of
distribution of X is called scale family of distributions, and θ is called the scale parameter. Let
S(X) be a statistic such that
for all c > 0. Then S(X) is ancillary for θ, and is called scale-invariant statistic.
Pn
Example: The statistics X12 / i=1 Xi2 , mini Xi / maxi Xi are scale invariant statistics.
Example 13. Let X1 , · · · , Xn be a random sample from uniform(0, θ). Then X(n) /X(1) is an ancillary
statistic for θ.
Example 14. Let X1 , · · · , Xn be a random sample from unifrom(θ − 1, θ + 1). Then X1 − X̄n is an
ancillary statistic.
4 Completeness
Complete family of distributions. So far we came across the concepts of minimal sufficient
and ancillary statistics. Intuitively, it seems interesting to know if these two statistics are unrelated
(independent). In fact, for two statistics S(X) and T (X) are independently distributed, and T (X) is
sufficient for θ, then S(X) must be ancillary for θ. (Why?)
Remark 3. The converse of the above statement is not true in general, i.e., if S(X) is sufficient for
θ and T (X) is ancillary for θ, then it is not necessarily true that S(X) and T (X) are independently
distributed. For example, let X1 , X2 be a random sample from normal(θ, 1) distribution. Then S(X) =
(X1 , X2 )′ is jointly sufficient for θ and T (X) = X1 − X2 is ancillary for θ. Now, observe that S and
T are not independent. [To see this, you may verify that P (T > 0 | S = s) ̸= P (T > 0).]
Remark 4. If fact,if S(X) is minimal sufficient for θ and T (X) is ancillary for θ, then it is not
necessarily true that S(X) and T (X) are independently distributed. For example, let X1 , · · · , Xn be a
random sample from uniform(θ, θ + 1). We have previously seen that S(X) = [X(1) , X(n) ]′ is minimal
sufficient for θ. Being a bijective function, S⋆ (X) = [X(1) + X(n) , X(n) − X(1) ]′ is also minimal
sufficient. But T (X) = X(n) − X(1) is ancillary for θ. Thus, here the ancillary statistic is a function
of minimal sufficient statistic.
In all the above cases, observe that, there exists a non-zero function of the sufficient (or, minimal
sufficient) statistics, the expectation of which is a constant c (free of θ). For example in Remark 3,
T (X) = X1 − X2 satisfies Eθ (T (X)) = 0 but Pθ (T (X) = 0) = 0 ̸= 1. In Remarks 4, if we take
the function g : R2 → R as g((x, y)) = y, then Eθ (g (S(X))) = E X(n) − X(1) = c (free of θ),
but Pθ X(n) − X(1) = c = 0 ̸= 1. It turns out that, if one rules out the possibility of having zero
expectation of a non-zero function of the sufficient statistics T = T (X), then T becomes independent
of all ancillary statistics of θ. This special property, which ensures independence of minimal sufficient
and ancillary statistics, is called completeness.
Definition 4 (Complete Family of Distributions). Let T (X) be a statistic with pdf or pmf fT (·; θ).
The family of distribution fT (·; θ), θ ∈ Θ is called complete if Eθ (g(T )) = 0 for all θ ∈ Θ implies
Pθ (g(T ) = 0) = 1 for all θ ∈ Θ. If the family of a statistic T (X) is complete, then T (X) is called a
complete statistic.
Remark 5. Eθ (g(T )) = 0 for all θ ∈ Θ does not in general imply that Pθ (g(T ) = 0) = 1 for all θ ∈ Θ.
For example, let {X1 , X2 } be a random sample from N (θ, 1), T = X1 − X2 and g(T ) = T . Then for
any θ, Eθ (g(T )) = 0. However, P (g(T ) = 0) = P (X1 = X2 ) = 0.
5
Remark 6. Completeness is a property of the family of distribution of a statistic T (X). For example,
the N (θ, 1), θ ∈ R, family is complete. (Proof after midsem)
Example 18. {X1 , · · · , Xn } be a random sample from uniform(0, θ), θ > 0. Then the family of
distributions of X(n) is complete. (Proof)
Example 19. The distribution normal 0, σ 2 is not complete. However, if X ∼ normal 0, σ 2 , the
the distribution of T (X) = X 2 is complete.
Pn
Thus, if X1 , · · · , Xn is aPrandom sample normal 0, σ 2 , then T1 (X) = i=1 Xi is not
n
complete, but T2 (X) = i=1 Xi2 is complete.
is complete if d ≤ k, and {(w1 (θ), · · · , wk (θ)); θ ∈ Θ} contains an open set in Rk . (Without proof )
Remark 7. The condition {(w1 (θ), · · · , wk (θ)); θ ∈ Θ} contains an open set in Rk is crucial. For
example, one can show that the minimal sufficient statistic for Normal(θ, θ2 ), θ ∈ R distribution is not
complete. (Homework)
Theorem 5 (Basu’s theorem). Let {X1 , · · · , Xn } be a sample from the family of distributions with
pmf or pdf fX (·; θ), θ ∈ Θ. If T (X) is a complete-sufficient statistics for θ, then T (X) is independent
of every ancillary statistic of θ. [Proof after midsem]
Remark 8. Basu’s theorem provides us a way to verify independence of two statistics without explicitly
deriving the joint/conditional distributions. Let us see an example.
Example 20. Let X1 , · · · , Xn be a random sample from normal(µ, 1). Let S = S(X) be such that
S(X) = S(X + c1) for all c ∈ R. Then X̄n and S are independent. In particular, X̄ and
S 2 , are independent.
Example 22. Let X1 , · · · , Xn be a random sample from uniform(0, θ). Then X(n) and X(1) /X(n) are
independent.