Bose, A., & Chatterjee, S. (2018) - U-Statistics, Mm-Estimators and Resampling
Bose, A., & Chatterjee, S. (2018) - U-Statistics, Mm-Estimators and Resampling
Bose, A., & Chatterjee, S. (2018) - U-Statistics, Mm-Estimators and Resampling
Arup Bose · Snigdhansu Chatterjee
U-Statistics,
Mm-Estimators
and
Resampling
Texts and Readings in Mathematics
Volume 75
Advisory Editor
C. S. Seshadri, Chennai Mathematical Institute, Chennai
Managing Editor
Rajendra Bhatia, Indian Statistical Institute, New Delhi
Editors
Manindra Agrawal, Indian Institute of Technology, Kanpur
V. Balaji, Chennai Mathematical Institute, Chennai
R. B. Bapat, Indian Statistical Institute, New Delhi
V. S. Borkar, Indian Institute of Technology, Mumbai
T. R. Ramadas, Chennai Mathematical Institute, Chennai
V. Srinivas, Tata Institute of Fundamental Research, Mumbai
Technical Editor
P. Vanchinathan, Vellore Institute of Technology, Chennai
The Texts and Readings in Mathematics series publishes high-quality textbooks,
research-level monographs, lecture notes and contributed volumes. Undergraduate
and graduate students of mathematics, research scholars, and teachers would find
this book series useful. The volumes are carefully written as teaching aids and
highlight characteristic features of the theory. The books in this series are
co-published with Hindustan Book Agency, New Delhi, India.
U-Statistics, Mm-Estimators
and Resampling
123
Arup Bose Snigdhansu Chatterjee
Statistics and Mathematics Unit School of Statistics
Indian Statistical Institute University of Minnesota
Kolkata, West Bengal, India Minneapolis, MN, USA
This work is a co-publication with Hindustan Book Agency, New Delhi, licensed for sale in all countries
in electronic form, in print form only outside of India. Sold and distributed in print within India by
Hindustan Book Agency, P-19 Green Park Extension, New Delhi 110016, India. ISBN: 978-93-86279-71-2
© Hindustan Book Agency 2018.
© Springer Nature Singapore Pte Ltd. 2018 and Hindustan Book Agency 2018
This work is subject to copyright. All rights are reserved by the Publishers, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publishers, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publishers nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publishers remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Contents
Preface xi
1 Introduction to U -statistics 1
1.1 Definition and examples . . . . . . . . . . . . . . . . . . . . . 1
1.2 Some finite sample properties . . . . . . . . . . . . . . . . . . 6
1.2.1 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 First projection . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Law of large numbers and asymptotic normality . . . . . . . 8
1.4 Rate of convergence . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Degenerate U -statistics . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Introduction to resampling 69
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2 Three standard examples . . . . . . . . . . . . . . . . . . . . 71
3.3 Resampling methods: the jackknife and the bootstrap . . . . 77
v
vi Contents
5 An Introduction to R 127
5.1 Introduction, installation, basics . . . . . . . . . . . . . . . . 127
5.1.1 Conventions and rules . . . . . . . . . . . . . . . . . . 130
5.2 The first steps of R programming . . . . . . . . . . . . . . . . 131
5.3 Initial steps of data analysis . . . . . . . . . . . . . . . . . . . 133
5.3.1 A dataset . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.3.2 Exploring the data . . . . . . . . . . . . . . . . . . . . 137
5.3.3 Writing functions . . . . . . . . . . . . . . . . . . . . . 142
5.3.4 Computing multivariate medians . . . . . . . . . . . . 143
5.4 Multivariate median regression . . . . . . . . . . . . . . . . . 145
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Bibliography 150
Contents vii
To Baishali
S. C.
Preface
This small book covers three important topics that we believe every statistics
student should be familiar with: U -statistics, Mm -estimates and Resampling.
The final chapter is a quick and short introduction to the statistical software
R, primarily geared towards implementing resampling. We hasten to add that
the book is introductory. However, adequate references are provided for the
reader to explore further.
Any U -statistic (with finite variance) is the non-parametric minimum vari-
ance unbiased estimator of its expectation. Many common statistics and esti-
mators are either U -statistics or are approximately so. The systematic study
of U -statistics began with Hoeffding (1948) and comprehensive treatment of
U -statistics are available at many places including Lee (1990) and Korolyuk
and Borovskich (1993).
In Chapter 1 we cover the very basics of U -statistics. We begin with
its definition and examples. The exact finite sample distribution and other
properties of U -statistics can seldom be calculated. We cover some asymptotic
properties of U -statistics such as the central limit theorem, weak and strong
law of large numbers, law of iterated logarithm, a deviation result and, a
distribution limit theorem for a first order degenerate U -statistics. As direct
application of these, we establish the asymptotic normality of many common
estimators and the sum of weighted chi-square limit for the Cramér-von Mises
statistic. Other applications are provided in Chapter 2. In particular the idea
of linearization or the so called weak representation of a U -statistic carries
over to the next chapters.
Chapter 2 is on M -estimators and their general versions Mm -estimators,
introduced by Huber (1964) out of robustness considerations. Asymptotic
properties of these estimates have been treated under different sets of con-
ditions in several books and innumerable research articles. Establishing the
xi
xii Preface
most general results for these estimators requires sophisticated treatment us-
ing techniques from the theory of empirical processes. We strive for a simple
approach.
We impose a few very simple conditions on the model. Primary among
these is a convexity condition which is still general enough to be applicable
widely. Under these conditions, a huge class of Mm -estimators are approxi-
mate U -statistics. Hence the theory developed in Chapter 1 can be profitably
used to derive asymptotic properties, such as the central limit theorem for
Mm -estimators by linearizing them. We present several examples to show
how the general results can be applied to specific estimators. In particular,
several multivariate estimates of location are discussed in details.
The linearization in Chapters 1 and 2 was achieved by expending consid-
erable technical effort. There still remain two noteworthy issues. First, such
a linearization may not be easily available for many estimates. Second, even
if an asymptotic normality result is established, it may not be easy to find or
estimate the asymptotic variance.
Since the inception of the bootstrap method in the early eighties, an al-
ternative to asymptotic distributional results is Resampling, This is now a
necessary item in the everyday toolkit of a statistician. It attempts to re-
place analytic derivations with the force of computations. Again, there are
several excellent monographs and books on this topic, in addition to the sur-
feit of articles on both theory and applications of Resampling.
In Chapter 3, we introduce the main ideas of resampling in an easy way
by using three benchmark examples; sample mean, sample median and ordi-
nary least squares estimates of regression parameters. In particular, we also
explain when and how resampling can produce “better” estimates than those
from traditional asymptotic normal approximations. We also present a short
overview of the most common methods of resampling.
Chapter 4 focuses on resampling estimates for the sampling distribution
and asymptotic variance of U -statistics and Mm -estimators. In particular, we
discuss the traditional Efron’s (multinomial) bootstrap and its drawbacks in
the context of U -statistics. We discuss how the generalized bootstrap arises
naturally in this context. We establish a bootstrap linearization result for U -
statistics. The generalized bootstrap with additive and multiplicative weights
are given special attention, the first due to the computational savings obtained
and the second due to its connection with Efron’s bootstrap. Finally, we also
establish a weighted U -statistics result for the generalized bootstrap Mm -
Preface xiii
Arup Bose is Professor at the Statistics and Mathematics Unit, Indian Sta-
tistical Institute, Kolkata, India. He is a Fellow of the Institute of Mathemati-
cal Statistics and of all the three national science academies of India. He has
significant research contributions in the areas of statistics, probability, eco-
nomics and econometrics. He is a recipient of the Shanti Swarup Bhatnagar
Prize and the C R Rao National Award in Statistics. His current research in-
terests are in large dimensional random matrices, free probability, high dimen-
sional data, and resampling. He has authored three books: Patterned Random
Matrices, and Large Covariance and Autocovariance Matrices (with Monika
Bhattacharjee) and Random Circulant Matrices (with Koushik Saha), pub-
lished by Chapman & Hall.
xv
Chapter 1
Introduction to U -statistics
We shall often write Un for Un (h) when the function h is clear from the
context. Appropriate extension is available when h is vector valued.
Note that a U -statistic of degree m is also a U -statistic of degree (m + 1).
As a consequence, the sum of two U -statistics is again a U -statistic. In this
book, we consider the order m to be the smallest integer for which the above
definition holds.
n
Un = n−1 Yi = Y . (1.2)
i=1
n
2
Un = (n − 1)−1 Yi − Y = s2n . (1.4)
i=1
1
h (x1 , y1 ), (x2 , y2 ) = (x1 − x2 )(y1 − y2 ).
2
Definition and examples 3
n
Un = (n − 1)−1 Xi − X Yi − Y . (1.5)
i=1
This is a U -statistic with h (x1 , x2 ), (y1 , y2 ) = sign (x1 − x2 )(y1 − y2 ) .
Example 1.5(Gini’s mean difference): A measure of income inequality is the
Gini’s mean difference given by
−1
n
Un = |Yi − Yj |. (1.7)
2
1≤i<j≤n
2
E(Un ) = σ. (1.8)
π 1/2
n
+
T = Ri I{Yi >0} . (1.9)
i=1
T + is the sum of the ranks of all the positive observations. It can be written
as a linear combination of two U -statistics with kernels of sizes 1 and 2. To
4 Chapter 1. Introduction to U -statistics
I{Yi +Yj >0} = I{Yi >0} I{|Yj |<Yi } + I{Yj >0} I{|Yi |<Yj } . (1.10)
with kernels f (x1 , x2 ) = I{x1 +x2 >0} and g(x1 ) = I{x1 >0} .
Example 1.7: A correlation coefficient different from the usual product mo-
ment correlation was introduced and studied in details by Bergsma (2006).
We need some preparation to define it. First suppose that Z, Z1 and Z2 are
i.i.d. real valued random variables with the distribution function F . Let
1
hF (z1 , z2 ) = − E |z1 − z2 | − |z1 − Z2 | − |Z1 − z2 | + |Z1 − Z2 | . (1.12)
2
κ(X, Y )
ρ∗ (X, Y ) = .
κ(X, X)κ(Y, Y )
1 1
n n
A1k = |Xk − Xi |, A2k = |Yk − Yi |, and
n i=1 n i=1
1 1
n n n n
B1 = |X i − X j |, B2 = |Yi − Yj |.
n2 i=1 j=1 n2 i=1 j=1
For k, l = 1, . . . , n, let
1 n n n
hF̂X (Xk , Xl ) = − |Xk − Xl | − A1k − A1l + B1 ,
2 n−1 n−1 n−1
1 n n n
hF̂Y (Yk , Yl ) = − |Yk − Yl | − A2k − A2l + B2 .
2 n−1 n−1 n−1
1.2.1 Variance
Assume that V h(Y1 , . . . , Ym ) < ∞ where V denotes variance. To compute
V(Un ), we need to compute
COV h(Yj1 , . . . , Yjm ), h(Yi1 , . . . , Yim )
This can be seen as follows. First, we can choose m indices {i1 , . . . , im } from
n
{1, . . . , n} in m ways. Then choose c of those which are to be common
with {j1 , . . . , jm } in mc ways. Now choose the rest of the (m − c) indices of
{j1 , . . . , jm } from the (n − m) remaining indices in m−c ways.
n−m
Hence
V(Un )
−2
n
= COV h(Yi1 , . . . , Yim ), h(Yj1 , . . . , Yjm )
m
1≤i1 <···<im ≤n 1≤j1 <···<jm ≤n
(1.15)
−2 m
n n m n−m
= δc
m c=1
m c m−c
−1 m
n m n−m
= δc . (1.16)
m c=1
c m−c
As a consequence, as n → ∞,
m2 δ 1
V(Un ) = + O(n−2 ) and (1.17)
n
V n1/2 (Un − θ) → m2 δ1 . (1.18)
(x1 −x2 )2
Example 1.8: Let h(x1 , x2 ) = 2 . Then
n
Un = s2n = (n − 1) −1
(Yi − Y n )2 .
i=1
μ4 − σ 4
δ1 = , (1.19)
4
μ4 + σ 4
δ2 = , and
2
4(n − 2) 2
V(Un ) = δ1 + δ2 . (1.20)
n(n − 1) n(n − 1)
so that
E h̃1 (Y1 ) = E h(Y1 , . . . , Ym ) − θ = 0.
Let
m
n
Rn = Un − θ − h̃1 (Yi ). (1.22)
n i=1
m2
V(Un ) = δ1 + V(Rn )
n
2
m
= δ1 + O(n−2 ). (1.26)
n
P
V(n1/2 Rn ) → 0 and hence n1/2 Rn → 0. (1.27)
Law of large numbers and asymptotic normality 9
(a)
m
n
P
Un − θ = h̃1 (Yi ) + Rn where n1/2 Rn → 0. (1.28)
n i=1
(b)
D
n1/2 (Un − θ) −→ N (0, m2 δ1 ) where δ1 = V(h̃1 (Y1 )).
P
Un − θ → 0 as n → ∞.
Rate results in the weak law, when stronger moment conditions are as-
sumed, are given in Section 1.4. A much stronger result than the weak law
is actually true for U -statistics and we state it below:
Theorem 1.3 (Hoeffding (1961)). (Strong law of large numbers (SLLN) for
U -statistics.) If E |h(Y1 , . . . , Ym )| < ∞, then
a.s.
Un − θ −→ 0 as n → ∞.
This result can be proved by using SLLN for either reverse martingales
Berk (1966) or forward martingales Hoeffding (1961). See Lee (1990) for a
detailed proof.
The next result we present is the Law of iterated logarithms (LIL) for U -
statistics. See Lee (1990) for a proof of this result. This result will be used
10 Chapter 1. Introduction to U -statistics
Theorem 1.4 (Dehling et al. (1986)). (Law of iterated logarithm (LIL) for
U -statistics.) Suppose Un is a U -statistic with kernel h such that δ1 > 0 and
E |h(Y1 , . . . , Ym )|2 < ∞. Then as n → ∞,
n(Un − θ)
lim sup = 1 almost surely.
n 2m2 δ1 log log n
Example 1.9: Consider the U -statistic s2n . In Example 1.8, we have calcu-
lated
μ4 − σ 4
δ1 = ,
4
4
where μ4 = E Y − (EY ) . Thus if μ4 < ∞,
D
n1/2 (s2n − σ 2 ) −→ N (0, μ4 − σ 4 ).
δ1 = COV(Y1 Y2 , Y1 Ỹ2 )
= EY12 Y2 Ỹ2 − EY1 Y2 EY1 Ỹ2
= μ2 EY12 − μ4 = μ2 V(Y1 ) = μ2 σ 2 .
continuous. Then
h1 (x, y) (1.29)
= E h((x, y), (X2 , Y2 ))
= P (x − X2 )(y − Y2 ) > 0 − P (x − X2 )(y − Y2 ) < 0
= P (X2 > x, Y2 > y), or (X2 < x, Y2 < y)
− P (X2 > x, Y2 < y) or (X2 < x, Y2 > y)
= 1 − 2F(x, ∞) − 2F(∞, y) + 4 F(x, y)
= 1 − 2F1 (x) 1 − 2F2 (y) + 4 F(x, y) − F1 (x)F2 (y) . (1.30)
D
Hence under independence, n1/2 Un −→ N (0, 4/9).
Example 1.12: Wilcoxon’s statistic, defined in Example 1.6 is used for test-
ing the null hypothesis that the distribution F of Y1 is continuous and sym-
metric about 0. Recall the expression for T + in (1.11). Note that E Un (f ) =
12 Chapter 1. Introduction to U -statistics
P
n−3/2 (nUn (g)) → 0. (1.36)
Further,
V(f˜1 ) = COV I{Y1 +Y2 >0} , I{Y1 +Ỹ2 >0}
= P Y1 + Y2 > 0, Y1 + Ỹ2 > 0 − (1/2)2 .
As a consequence
1 1 1
V f˜1 = − = = δ1 . (1.38)
3 4 12
D
n1/2 (Un (f ) − 1/2) −→ N (0, 1/3).
(b) If ψ(s) = E exp s|h(Y1 , . . . , Ym )| < ∞ for some 0 < s ≤ s0 , then for
k = [n/m], and 0 < s ≤ s0 k,
k
E exp sUn (h) ≤ ψ(s/k) . (1.40)
(c) (Berk (1970)). Under the same assumption as (b), for every > 0, there
exist constants 0 < δ < 1, and C and such that
P supUk (h) − μ > ≤ Cδ n . (1.41)
k≥n
Proof of Theorem 1.5: Now suppose m > 1. Recall the weak decomposi-
14 Chapter 1. Introduction to U -statistics
m
n
Un (h) − θ = h̃1 (Yi ) + Rn . (1.43)
n i=1
Since the result has already been established for m = 1, it is now enough to
prove that
P sup |Rk | ≥ = o(n1−r ). (1.44)
k≥n
Using this in (1.45) verifies (1.44) and proves Theorem 1.5(a) completely.
The detailed proofs of (b) and (c) (without the supremum) can be found
in Serfling (1980) (page 200–202). We just mention that for the special case
of m = 1, part (c) is an immediate consequence of the following lemma whose
proof is available in Durrett (1991), Lemma 9.4, Chapter 1.
Lemma 1.2. Let Y1 , Y2 , . . . be i.i.d. random variables, and E es|Yn | < ∞
for some s > 0. Let Sn = Y1 + . . . + Yn , μ = EYn . Then for each ε > 0 there
exists a > 0 such that
S
n
P − μ > ε = O(e−an ), as n → ∞. (1.47)
n
Note that Ψn (t) is finite for each t since hn1 is bounded. Letting k = [n/m],
and using Theorem 1.5(b),
A1 = P(n1/2 Un (hn1 ) ≥ vn an )
= P(tn1/2 Un (hn1 )/vn ≥ t an )
≤ exp(−tan ) Ψn (n1/2 t/vn )
≤ exp(−tan ) [E exp(n1/2 tY /(vn k))]k , say,
16 Chapter 1. Introduction to U -statistics
t2 n
A1 ≤ exp −tan + . (1.50)
k
Let t = K(log n)1/2 /(4(2m − 1)). Then for all large n, the exponent in (1.50)
equals
2
P(|n1/2 Un (hn1 )| > Kvn (log n)1/2 /2) ≤ n−K /16(2m−1)
. (1.52)
Note that in the above proof the restrictions we have in place on mn , K and
t (with t = K(log n)1/2 /(4(2m − 1)) ≤ n−1/2 kvn /(2mn )) are all compatible.
The Theorem follows by using (1.52) and (1.53) and the given condition on
vn .
Degenerate U -statistics 17
Example 1.13: Consider the U -statistic Un = s2n from Example 1.9, which
has the kernel
(x1 − x2 )2
h(x1 , x2 ) = .
2
Then as calculated earlier, δ1 = 14 μ4 − μ22 where
4 2
μ4 = E Y1 − μ , μ2 = E Y1 − μ . (1.54)
Note that
2
μ4 = μ22 ⇔ Y1 − μ is a constant
⇔ Y1 = μ ± C (C is a constant).
P
Then n1/2 (s2n − μ2 ) → 0.
Example 1.14: Let f be a function such that Ef (Y1 ) = 0 and Ef 2 (Y1 ) < ∞.
Let Un be the U -statistic with kernel h(x1 , x2 ) = f (x1 )f (x2 ). Then h1 (x) =
Ef (x)f (Y2 ) = 0, and
1 D
n 2 Un (h) −→ N (0, 0). (1.55)
n 2 n
1 f (Yi ) 1
= √
i=1
− f 2 (Yi ).
n−1 n n(n − 1) i=1
D
nUn −→ σ 2 (χ21 − 1) (1.56)
where σ 2 = Ef 2 (Y1 ) and χ21 is a chi-square random variable with one degree
of freedom.
Example 1.15: (continued from Example 1.13). Suppose {Yi }’s are i.i.d.,
D
n1/2 (s2n − 1) −→ N (0, 0). (1.57)
n
However, writing Y = n−1 Yi and noting that Yi2 = 1 for all i,
i=1
−1
n (Yi − Yj )2
s2n =
2 2
1≤i<j≤n
n
1
2 2
= Y − n(Y )
n − 1 i=1 i
n n 2
= − Y .
n−1 n−1
Hence n → ∞
n √ n D
n(s2n − 1) = − ( nY )2 −→ 1 − χ21 .
n−1 n−1
E(f1 (Y1 )f2 (Y1 )) = 0 and Ef12 (Y1 ) = Ef22 (Y2 ) = 1. That is, {f1 , f2 } are
Degenerate U -statistics 19
D
nUn −→ a1 (W1 − 1) + a2 (W2 − 1) (1.59)
To motivate further the limit result that we will state and prove shortly,
let us continue to assume δ1 = 0. Recalling the formula for variance given in
(1.16), now
n2 m n − m
V(nUn ) = n δ2 + smaller order terms
m
2 m−2
[m(m − 1)]2
= δ2 + o(1).
2
Ah φj = λj φj , ∀j, (1.63)
φ2j (x)dF(x) = 1, φj (x)φk (x)dF(x) = 0, ∀j = k, and (1.64)
∞
h(x, y) = λj φj (x)φj (y). (1.65)
j=1
20 Chapter 1. Introduction to U -statistics
The equality in (1.65) is in the L2 sense. That is, if Y1 , Y2 are i.i.d. F then
n
E[h(Y1 , Y2 ) − λk φk (Y1 )φk (Y2 )]2 → 0 as n → ∞. (1.66)
k=1
Further
h1 (x) = Eh(x, Y2 )
∞
= λk φk (x)E φk (Y2 ) almost surely F. (1.67)
k=1
∞
2
Eh (Y1 , Y2 ) = λ2k . (1.68)
k=1
Now we are ready to state our first theorem on degenerate U -statistics for
m = 2. The version of this Theorem for m > 2 is given later in Theorem 1.9.
Theorem 1.7 (Gregory (1977); Serfling (1980)). (χ2 -limit theorem.) Suppose
h(x, y) is a kernel such that Eh(x, Y1 ) = 0 a.e. x and Eh2 (Y1 , Y2 ) < ∞. Then
∞
D
nUn −→ λk (Wk − 1) (1.69)
k=1
where {Wk } are i.i.d. χ21 random variables and {λk } are the (non-zero) eigen-
values of the operator Ah given in (1.62).
Proof of Theorem 1.7: The idea of the proof is really as in Example 1.16
after reducing the infinitely many eigenvalues case to the finitely many eigen-
values case. First note that
∞
h1 (x) = λk φk (x)EF (φk (Y1 )) a.e. F. (1.71)
k=1
Degenerate U -statistics 21
∞
h(x, y) = λk φk (x)φk (y) in the L2 (F × F) sense. (1.73)
k=1
Now
−1
n
nUn = n h(Yi , Yj )
2
1≤i<j≤n
1
= h(Yi , Yj ).
n−1
1≤i=j≤n
1
Tn = h(Yi , Yj )
n
1≤i=j≤n
∞
1
= λk φk (Yi )φk (Yj ). (1.74)
n
1≤i=j≤n k=1
If the sum over k were a finite sum then the rest of the proof would proceed
exactly as in Example 1.16. With this in mind, define the truncated sum
1
k
Tnk = λt φt (Yi )φt (Yj ) k ≥ 1. (1.75)
n
1≤i=j≤n t=1
Lemma 1.3. Suppose for every k, {Tnk } is any sequence of random variables
and {Tn } is another sequence of random variables, all on the same probability
space, such that
D
(i) Tnk → Ak as n → ∞
D
(ii) Ak → A as k → ∞
(iii) lim lim sup P(|Tn − Tnk | > ) = 0 ∀ > 0.
k→∞ n→∞
D
Then Tn → A.
Proof of Lemma 1.3. Let φX (t) = E(eitX ) denote the characteristic func-
22 Chapter 1. Introduction to U -statistics
Now
|φTn (t) − φA (t)| ≤ |φTn (t) − φTnk (t)| + |φTnk (t) − φAk (t)|
+ |φAk (t) − φA (t)|
= B1 + B2 + B3 , say.
Now first let n → ∞, then let k → ∞ and use condition (iii) to conclude that
the first term goes to zero. Now let → 0 to conclude that the second term
goes to zero. Thus B1 → 0. This proves the lemma.
Now we continue with the proof of the Theorem. We shall apply Lemma
1.3 to {Tn } and {Tnk } defined in (1.74) and (1.75). Suppose {Wk } is a
sequence of i.i.d. χ21 random variables.
Let
k ∞
Ak = λj (Wj − 1), A = λj (Wj − 1). (1.76)
j=1 j=1
Note that by Kolmogorov’s three series theorem (see for example Chow and
Teicher (1997) Corollary 3, page 117), the value of the infinite series in (1.76)
is finite almost surely, and hence A is a legitimate random variable.
D
Also note that Ak −→ A as k → ∞. Hence, condition (ii) of Lemma 1.3
D
is verified. Now we claim that Tnk −→ Ak . This is because
1
k
Tnk = λt φt (Yi )φt (Yj )
n
1≤i=j≤n t=1
1 2 1
k n k n
= λt φt (Yi ) − λt φ2t (Yi ). (1.77)
n t=1 i=1
n t=1 i=1
Degenerate U -statistics 23
1 2
n
φ (Yi ) −→ Eφ2t (Y1 ) = 1.
a.s.
(1.78)
n i=1 t
1 n
D
√ φt (Yi ), t = 1 . . . k −→ N (0, Ik ) (1.79)
n i=1
D
k
k
k
Tnk −→ λt W t − λt = λt (Wt − 1) (1.80)
t=1 t=1 t=1
where {Wi } are i.i.d. χ21 random variables. This verifies condition (i). To
verify condition (iii) of Lemma 1.3, consider
1 1
k
Tn − Tnk = h(Yi , Yj ) − λt φt (Yi )φt (Yj )
n n
1≤i=j≤n 1≤i=j≤n t=1
2 n
= Unk (1.81)
n 2
k
gk (x, y) = h(x, y) − λk φk (x)φk (y).
t=1
Now since h is a degenerate kernel and Eφk (Y1 ) = 0 ∀k, we conclude that gk
is also a degenerate kernel. Note that using (1.73)
k
Egk2 (Y1 , Y2 ) = E[h(Y1 , Y2 ) − λt φt (Y1 )φt (Y2 )]2
t=1
∞
2
= E λt φt (Y1 )φt (Y2 )
t=k+1
∞
= λ2t .
t=k+1
24 Chapter 1. Introduction to U -statistics
∞ ∞
(n − 1)2 2
= n λt ≤ 2 λ2t . (1.82)
2 t=k+1 t=k+1
Hence
∞
1
lim lim sup P(|Tn − Tnk | > ) ≤ lim λ2t = 0.
k→∞ n→∞ 2 k→∞
t=k+1
This establishes condition (iii) of Lemma 1.3 and hence the Theorem is com-
pletely proved.
Definition 1.2: For any sequence of numbers {x1 , . . . xn }, its empirical cu-
mulative distribution function (e.c.d.f.) is defined as
1
n
Fn (x) = I{xi ≤x} . (1.83)
n i=1
∞
D 1 Wk
CVn −→ . (1.85)
π2 k2
k=1
Degenerate U -statistics 25
F0 (x) = x, 0 ≤ x ≤ 1 (1.86)
and hence
1
2
CVn = n (Fn (x) − x) dx
0
1
1
n
2
=n I{Yi ≤x} − x dx (1.87)
0 n i=1
1
n
= 2 I{Yi ≤x} − x I{Yj ≤x} − x dx
n
1≤i,j≤n 0
1
2
= I{Yi ≤x} − x I{Yj ≤x} − x dx
n
1≤i<j≤n 0
n
1 1 2
+ I{Yi ≤x} − x dx
n i=1 0
n 2
= Un (f ) + Un (h). (1.88)
2 n
Note that I{Yi ≤x} are i.i.d. Bernoulli random variables with probability of
success x. Hence by SLLN
1
a.s. 2
Un (h) → E I{Yi ≤x} − x dx (1.89)
0
1 2
= E I{Yi ≤x} − x dx (1.90)
0
1
1
= x(1 − x)dx = . (1.91)
0 6
26 Chapter 1. Introduction to U -statistics
Moreover
1
Ef (x1 , Y2 ) = E I{x1 ≤x} − x I{Y2 ≤x} − x dx,
0
= 0.
∞
D
nUn (f ) → λk (Wk − 1) (1.92)
k=1
where {Wk } are i.i.d. χ21 variables and {λk } are the eigenvalues of the kernel
f . We now identify the values {λk }. The eigenequation is
1
f (x1 , x2 )g(x2 )dx2 = λg(x1 ). (1.93)
0
Now
1
f (x1 , x2 ) = I(x1 ≤ x)I(x2 ≤ x) − x I(x1 ≤ x) − x I(x2 ≤ x) + x2 dx
0
1
= I(x ≥ max(x1 , x2 ) − x I(x1 ≤ x) − x I(x2 ≤ x) + x2 dx
0
1 − x21 1 − x22 1
= 1 − max(x1 , x2 ) − − +
2 2 3
1 x21 + x22
= − max(x1 , x2 ) + . (1.94)
3 2
1
Recall that any eigenfunction g must satisfy 0
g(x)dx = 0 (see (1.72)). Hence
using (1.94), (1.93) reduces to
1
1 x2 + x22
λg(x1 ) = − max(x1 , x2 ) + 1 g(x2 )dx2
0 3 2
1 1 x1
x22 g(x2 )
= dx2 − x2 g(x2 )dx2 − x1 g(x2 )dx2 . (1.95)
0 2 x1 0
For the moment, assume that g is a continuous function. Then (1.95) implies
Degenerate U -statistics 27
It is well known that the general solution to this second order differential
equation is
Now
1
0= g(x)dx ⇒ t = πk, k = 0, ±1 . . . (1.99)
0
Further
1
1= g 2 (x)dx
0
1
= 4C12 cos2 (πkx)dx
0
4C12
= = 2C12 .
2
28 Chapter 1. Introduction to U -statistics
λπ 2 k 2 = 1, k = ±1 . . . (1.101)
or
1
λ= , k = 1, 2, . . . (1.102)
π2 k2
√
(with possible multiplicity). But { 2 cos(πkx), k = 1, 2, . . . } is a complete
orthonormal system and hence we can conclude that the eigenvalues (with no
multiplicities) and the eigenfunctions are given by
1 √
(λk = , gk (x) = 2 cos(πkx)), k = 1, . . . . (1.103)
π2 k2
∞
Notice that k=1 λk = 1/6 (which was first proved by Euler, see Knopp
(1923) for an early proof).
As a consequence, using (1.88), (1.91) and (1.92),
∞
∞
D 1 1 Wk
CVn → λk (Wk − 1) + = λk Wk = 2 .
6 π k2
k=1 k=1
Example 1.18: (Example 1.7 continued) Using (1.13), it is easy to see that
κn is a degenerate U -statistic. An application of Theorem 1.7 yields the
following: suppose {(Xi , Yi )} are i.i.d. and moreover all the {Xi , Yj } are
independent. Further suppose that their second moments are finite. Then
∞
D
nκn −→ λi μj Zi,j
i,j=1
where {Zi,j } are i.i.d. chi-square random variables each with one degree of
freedom. The {λi } and {μj } are given by the solution of the eigenvalue
equations:
Degenerate U -statistics 29
hFX (x1 , x2 )gX (x2 )dFX (x2 ) = λgX (x1 ) a.e.(FX ). (1.104)
hFY (y1 , y2 )gY (y2 )dFY (y2 ) = λgY (y1 ) a.e.(FY ). (1.105)
We now state the asymptotic limit law for degenerate U -statistics for
general m > 2. Define the second projection h2 of the kernel h as
h2 (x1 , x2 ) = Eh2 x1 , x2 , Y3 , . . . , Ym .
where {Wk } are i.i.d. χ21 random variables and {λk } are the eigenvalues of
the operator defined by
Ag(x2 ) = h2 (x1 , x2 )g(x1 )dF(x1 ).
We omit the proof of this Theorem, which is easy once we use Theorem 1.7
and some easily derivable properties of the second-order remainder in the
Hoeffding decomposition. For details see Lee (1990), page 83.
1.6 Exercises
1. Suppose F is the set of all cumulative distribution functions on R. Let
Y1 , . . . , Yn be i.i.d. observations from some unknown F ∈ F.
(a) Show that the e.c.d.f. Fn is a complete sufficient statistic for this
space.
(c) Using the above fact, show that any U -statistic (with finite vari-
ance) is the nonparametric UMVUE of its expectation.
5. Show that for the U -statistic Kendall’s tau given in Example 1.4, under
independence of X and Y ,
where
2
δc,c = COV f (Xi1 , . . . , Xim1 ), g(Xj1 , . . . , Xjm2 )
and c = number of common indices between the sets {i1 , . . . , im1 } and
{j1 , . . . , jm2 }.
Exercises 31
7. Using the previous exercise, show that for Wilcoxon’s T + given in Ex-
ample 1.6,
+ n
V(T ) = (n − 1)(p4 − p22 ) + p2 (1 − p2 ) + 4(p3 − p1 p2 ) + np1 (1 − p1 )
2
where
10. Show that for the sample covariance U -statistics given in Example 1.3,
2
δ1 = (μ2,2 − σX,Y )/4,
2 2
δ2 = (μ2,2 + σX σY )/2, and hence
2 2 2
(n − 2)σXY − σX σY
VUn = n−1 μ2,2 − ,
n(n − 1)
where
μ2,2 = E[(X − μX )2 (Y − μY )2 ]
2
σX = V(X), σY2 = V(Y ),
2
σXY = COV(X, Y ).
n
Ln = iY(i) (1.107)
i=1
18. Verify (1.33) that U and V are independent continuous U (−1, 1) ran-
dom variables.
19. Show that in case of the sample variance in the degenerate case, there
is only one eigenvalue equal to −1.
Show that Un (h) is degenerate and the L2 operator Ah has two eigen-
values and eigenfunctions. Find these and the asymptotic distribution
of nUn (h).
h(x, y) = xy + x3 y 3 .
Exercises 33
h(x, y) = |x + y| − |x − y|,
d
2 1/2
where |a| = j=1 aj for a = (a1 , . . . , ad ) ∈ Rd . Show that Un (h)
is a degenerate U -statistic. Find the asymptotic distribution of nUn (h)
when d = 1.
25. (This result is used in the proof of Theorem 2.3.) Recall the variance
formula (1.18).
26. For Bergsma’s κ in Example 1.18, show that the eigenvalues and eigen-
functions when all the random variables are i.i.d. continuous uniform
(0, 1) are given by { πk1 2 , gk (u) = 2 cos(πku), 0 ≤ u ≤ 1, k = 1, 2, . . .}.
27. To have an idea of how fast the asymptotic distribution takes hold in
k ∞
the degenerate case, plot (k, i=1 λi / i=1 λi ), k = 1, 2, . . . for (a) the
Cramer-von Mises statistic (b) Bergma’s κn , when all distributions are
continuous uniform (0, 1).
Chapter 2
Mm-estimators and
U -statistics
The special case when m = 1 is the one that is most commonly studied
and in that case θ0 is traditionally called an M -parameter.
The sample analogue Qn of Q is given by
−1
n
Qn (θ) = f (Yi1 , . . . , Yim , θ). (2.2)
m
1≤i1 <i2 ...<im ≤n
So
Qn (θn ) = inf Qn (θ). (2.3)
Note that |f (x, θ)| ≤ 2|θ| and hence Q(θ) = Ef (Y, θ) is finite for all θ ∈ R. It
is easy to check that,
θ
f (x, θ) = θ 2I{x≤0} − 1 + 2 I{x≤s} − I{x≤0} ds − (2p − 1)θ. (2.7)
0
Hence
θ
Q(θ) = 2 F(s)ds − 2pθ for all θ ∈ R. (2.8)
0
x1 + x2 x1 + x2
f (x1 , x2 , θ) =| −θ |−| |. (2.10)
2 2
d
2 1/2
d 1/2
f (x, θ) = xk − θk − x2k . (2.13)
k=1 k=1
Note that Q(θ) = Ef (Y, θ) is finite if E|Y| < ∞, where |a| is the Euclidean
norm of the vector a. It can be shown that if P does not put all its mass on
Convexity 39
d
a hyperplane (that is, if P i=1 Ci Yi = C = 1 for any choice of real num-
bers (C, C1 , . . . Cd )), then Q(θ) is minimized at a unique θ0 (see Kemperman
(1987)). This θ0 is called the L1 -median. The corresponding M -estimator
is called the sample L1 -median. It is unique if {Y1 , . . . , Yd } do not lie on a
lower dimensional hyperplane. If d = 1, the L1 -median reduces to the usual
median discussed in Example 2.3.
Later in Chapter 5 we illustrate the L1 -median with a real data appli-
cation. The R package for this book, called UStatBookABSC also contains a
function to obtain the L1 -median on general datasets.
Example 2.8(Oja-median): This multivariate median was introduced by Oja
(1983). For any (d + 1) points x1 , x2 , . . . , xd+1 ∈ Rd for d ≥ 2, the sim-
plex formed by them is the smallest convex set containing these points. Let
Δ(x1 , . . . , xd , xd+1 ) denote the absolute volume of this simplex. Let
2.2 Convexity
Many researchers have studied the asymptotic properties of M -estimators and
Mm -estimators. Early works on the asymptotic properties of M1 -estimators
and M2 -estimators are Huber (1967) and Maritz et al. (1977). Using condi-
tions similar to Huber (1967), Oja (1984) proved the consistency and asymp-
totic normality of Mm -estimators. His results apply to some of the estimators
that we have presented above.
We emphasize that all examples of f we have considered so far have a
common feature. They are all convex functions of θ. Statisticians prefer to
work with convex loss functions for various reasons. We shall make this blan-
ket assumption here. This does entail some loss of generality. But convexity
leads to a significant simplification in the study of Mm -estimators while at
40 Chapter 2. Mm -estimators and U -statistics
In the next sub-section we will show that we can always choose a measurable
version of the sub-gradient. If f is differentiable, then this sub-gradient is
simply the ordinary derivative. This sub-gradient will be crucial to us.
Example 2.9: (i) For the usual median, it can be checked that a sub-gradient
is given by
⎧
⎪
⎨ 1 if θ>x
g(x, θ) = 0 if θ=x (2.17)
⎪
⎩
−1 if θ < x.
Measurability 41
2.3 Measurability
As Examples 2.3 and 2.6 showed, an Mm -estimator is not necessarily unique.
However, it can be shown by using the convexity assumption, that a mea-
surable minimizer can always be chosen. This can be done by the following
selection theorem and its corollary. The asymptotic results that we will
discuss later, hold for any measurable sequence of minimizers of {Qn (θ)}.
At the heart of choosing a sequence of measurable minimizers is the idea
of measurable selections. This is a very interesting topic in mathematics and
there are many selection theorems in the literature. See for example Castaing
and Valadier (1977). Γ is said to be a multifunction if it assigns a subset Γ(z)
of Rd to each z. A function σ : Z → Rd is said to be a selection of Γ if
σ(z) ∈ Γ(z) for every z. If Z is a measurable space, then σ is said to be a
measurable selection if z → σ(z) is a measurable function.
We quote the following theorem from Castaing and Valadier (1977). For
its proof, see Theorem 3.6 and Proposition 3.11 in Section 3.2 there.
whenever the inf is in the range of q(z, ·), otherwise a(z) is taken to be some
fixed number.
42 Chapter 2. Mm -estimators and U -statistics
for any subset A of Rd . Indeed, inf αA can be replaced by inf α∈C , where C
is a countable dense subset of A, because q(z, ·) is continuous. Let
We have
This is because the right side infimum is certainly in the range of q(z, ·).
Thus,
Z0 = {z : Γ(z) = ∅} (2.22)
We now show how we can apply the above corollary to obtain a measurable
minimiser θn in (2.3). Suppose f (x1 , . . . , xm , θ) is a function on Y m × Rd
which is measurable in (x1 , . . . , xm ) and convex in θ. Note that convexity
automatically implies continuity in θ.
Suppose {Y1 , . . . , Yn }, n ≥ m are i.i.d. Y valued random variables. On
Y consider the function q(·, α) = Qn (α) and apply Corollary 2.1, to get a
n
a.s.
θn −→ θ0 as n → ∞. (2.24)
Proof of Lemma 2.1: Recall that convex functions converge pointwise ev-
erywhere if they converge pointwise on a dense set. Moreover the everywhere
convergence is uniform over compact sets. See Rockafellar (1970), Theorem
10.8 for additional details.
Let C be a countable dense set. To prove (a), it is just enough to observe
that with probability 1, convergence hn (α) → h(α) takes place for all α ∈ C
and then apply the above criterion for convergence of convex functions.
To prove (b), consider an arbitrary sub-sequence of the sequence. For any
fixed α ∈ C, we can select a further sub-sequence, along which hn (α) → h(α)
holds almost surely. Now we can apply the Cantor diagonal method to get
hold of one single sub-sequence {hn } which converges pointwise almost surely
on C. Now apply (a) to conclude that this sub-sequence converges almost
everywhere uniformly on compact sets. Since for any sub-sequence, we have
exhibited a further sub-sequence which converges uniformly, almost surely
on compact sets, the original sequence converges in probability uniformly on
compact sets. This completes the proof.
Proof of Theorem 2.2: Note that by the SLLN for U -statistics, Qn (α) con-
verges to Q(α) for each α almost surely. By Lemma 2.1, this convergence is
uniform on any compact set almost surely.
Let B be a ball of arbitrary radius around θ0 . If θn is not consistent, then
there is an > 0 and a set S in the probability space such that P(S) > 0 and
for each sample point in S, there is a sub-sequence of θn that lies outside this
ball. We assume without loss that for each point in this set, the convergence
Weak representation, asymptotic normality 45
θn∗ = γn θ0 + (1 − γn )θn .
First note that the right side converges to Q(θ0 ). Now, every θn∗ lies on the
compact set {θ : |θ − θ0 | = }. Hence there is a sub-sequence of {θn∗ } which
converges to, say θ1 . Since the convergence of Qn to Q is uniform on compact
sets, the left side of the above equation converges to Q(θ1 ). Hence, Q(θ1 ) ≤
Q(θ0 ). This is a contradiction to the uniqueness of θ0 since |θ0 − θ1 | = . This
proves the Theorem.
Let
Theorem 2.3. Suppose Assumptions (I)–(V) hold. Then for any sequence
of measurable minimizers {θn },
(a) θn − θ0 = −H −1 Un + oP (n−1/2 )
D
(b) n1/2 (θn − θ0 ) −→ N (0, m2 H −1 KH −1 ) where
K = V E g Y1 , . . . Ym , θ0 |Y2 , . . . , Ym
g(Y1 , . . . , Ym , θ0 ).
Hence,
or
For the proof of this Theorem as well for those Theorems given later,
assume without loss that θ0 = 0 and Q(θ0 ) = 0. As a consequence,
n −1
Note that Vn = m s∈S Yn,s is a U -statistic. From Exercise 25 of
Chapter 1, using (2.31), it follows that
n −1 m
2
V Yn,s ≤ E (Yn,s − EYn,s )
m n
s∈S
m 2
≤ K EYn,s
n
m 2
≤ K 2 E αt g(Yn,s , n−1/2 α) − g(Yn,s , 0) .
n
Zn+1 − Zn
= αT g(Z, (n + 1)−1/2 α) − g(Z, 0) − αT g(Z, n−1/2 α) − g(Z, 0)
= αT g(Z, (n + 1)−1/2 α) − g(Z, n−1/2 α) .
∇Q(n−1/2 α) → 0.
Now, due to convexity, by Lemma 2.1, the convergences in (2.35) and (2.36)
are uniform on compact sets. Thus for every > 0 and every M > 0, the
Weak representation, asymptotic normality 49
inequality
√
sup |nQn α/ n − nQn (0) − αT n1/2 Un − αT Hα/2| < (2.37)
|α|≤M
D
αn −→ N (0, m2 H −1 KH −1 ). (2.39)
The rest of the argument is on the intersection of the two events in (2.37)
and (2.41), and that has probability at least 1 − .
Consider the convex function
√
An (α) = nQn α/ n − nQn (0). (2.42)
From (2.37),
Bn (α) − . (2.44)
Comparing the two bounds in (2.43) and (2.44), and using the condition
that α lies on the sphere, it can be shown that the bound in (2.44) is always
−1/2
strictly larger than the one in (2.43) once we choose T = 4 λmin (H)
50 Chapter 2. Mm -estimators and U -statistics
Additionally,
Since g is bounded, Assumption (IV) is trivially satisfied. Thus all the con-
ditions (I)–(V) are satisfied.
Moreover
K = V 2I{θ0 ≥Y1 } = 4 V I{Y1 ≤θ0 } = 4p(1 − p). (2.47)
D
n1/2 (θn − θ0 ) −→ N 0, p(1 − p)(f 2 (θ0 ))−1 . (2.48)
Incidentally, if the assumptions of Theorem 2.3 are not satisfied, the lim-
iting distribution of the Mm -estimate need not be normal. Smirnov (1949)
(translated in Smirnov (1952)) had studied the sample quantiles in such non-
regular situations in complete details, identifying the class of distributions
Weak representation, asymptotic normality 51
Wolfe (1973) pages 205-206) where (Xi , Yi ) are bivariate i.i.d. random vari-
ables and
(iv) The location estimate of Maritz et al. (1977) can also be treated in this
way. Let β be any fixed number between 0 and 1. Let
This implies
Recall that ∇Q(θ) = E[g(Y1 , θ)]. By simple algebra, for |x| ≤ |θ|,
|θ|2 |θ|3
|g(x, θ) − g(x, 0) − h(x, 0)θ| ≤ 5 + . (2.61)
|x|2 |x|3
Using these two inequalities, and the inverse moment condition (2.58), it is
easy to check that, the matrix H exists and can be evaluated as
Let Y (i) be the d × d matrix obtained from Y by deleting its i-th row and
replacing it by a row of 1’s at the end. That is,
⎛ ⎞
Y11 Y12 ... Y1d
⎜ ⎟
⎜ Y21 Y22 ... Y2d ⎟
⎜ ⎟
⎜ · · · · ⎟
⎜ ⎟
⎜ ⎟
⎜ Yi−1,1 Yi−1,2 . . . Yi−1,d ⎟
Y (i) = ⎜
⎜ Y
⎟.
⎟
⎜ i+1,1 Yi+1,2 ... Yi+1,d ⎟
⎜ ⎟
⎜ · · · · ⎟
⎜ ⎟
⎜ Y ⎟
⎝ d1 Yd2 ... Ydd ⎠
1 1 ... 1
Let det(M ) denote the determinant of the matrix M . It is easily seen that
f (Y1 , . . . , Yd , θ) = |det M (θ) | − |det M (0) | = |θT T − Z| − |Z| (2.63)
where
T = (T1 , . . . , Td )T Ti = (−1)i+1 det Y (i) and Z = (−1)d det(Y ). (2.64)
and has common features of the gradients of the sample mean (g(x) = x) as
well as of U -quantiles (g(x) = sign function), see Examples 2.12 and 2.13.
Assume that E|Y1 |2 < ∞. This implies E|T |2 < ∞ which in turn implies
E|gi |2 < ∞ and thus Assumption (IV) is satisfied.
It easily follows that the i-th element of the gradient vector of Q(θ) equals
Qi (θ) = 2E Ti I{Z≤θT T } . (2.66)
If further, F has a density, it follows that the derivative of Qi (θ) with respect
Rate of convergence 55
to θj is given by
Qij (θ) = 2E Ti Tj fZ|T (θT T ) (2.67)
Clearly then Assumption (V) will be satisfied if we assume that, the density
of F exists and H defined above exists and is positive definite. This condition
is satisfied by many common distributions.
Example 2.15: The pth-order Oja-median for 1 < p < 2 is defined by mini-
mizing
Q(θ) = E Δp (Y1 , . . . , Yd , θ) − Δp (Y1 , . . . , Yd , 0) (2.69)
(b) If Assumption (VIb) also holds with some r > 1, then for every δ > 0,
P sup |θk − θ0 | > δ = o(n1−r ) as n → ∞. (2.73)
k≥n
Note that if r < 2, then Assumption (VIb) is weaker than Assumption (IV)
needed for the asymptotic normality. If r > 2, then Assumption (VIb) is
stronger than Assumption (IV) but weaker than Assumption (VIa), and still
implies complete convergence.
Incidentally, the last time that the estimator is distance away from the
parameter is of interest as approaches zero. See Bose and Chatterjee (2001b)
and the references there for some information on this problem.
sup |Q(β) − h(β)| < implies sup |Q(α) − h(α)| < 5δL + 3. (2.75)
β∈B α∈A
On the other hand, to each α ∈ A there corresponds β ∈ B such that |α−β| <
δ and thus α + 2(β − α) = γ ∈ A0 . From (2.76) it follows that
Proof of Theorem 2.4: We first prove part (b). Fix δ > 0. Note that Q is
convex and hence is continuous. Further, it is also Lipschitz, with Lipschitz
constant L say, in a neighborhood of 0. Hence there exists an > 0 such that
Q(α) > 2 for all |α| = δ.
Fix α. By Assumption (VIb) and Theorem 1.5,
P sup |Qk (α) − Qk (0) − Q(α)| > = o(n1−r ). (2.78)
k≥n
Suppose that the event in (2.80) occurs. Using the fact that fk (α) = Qk (α) −
Qk (0) is convex, fk (0) = 0, fk (α) > for all |α| = δ and Q(α) > 2 for all
|α| = δ, we conclude that fk (α) attains its minimum on the set |α| ≤ δ. This
proves part (b) of the Theorem.
To prove part (a), we follow the argument given in the proof of part (b)
but use Theorem 1.5(c) to obtain the required exponential rate. The rest of
the proof remains unchanged. We omit the details.
Theorem 2.5. Suppose Assumptions (I)–(V) and (VII)–(IX) hold for some
0 ≤ s < 1 and r > (8 + d(1 + s))/(1 − s). Then almost surely as n → ∞,
n1/2 (θn − θ0 ) = −H −1 n1/2 Un + O(n−1/2 (log n)1/2 (log log n)1/2 ). (2.82)
The almost sure results obtained in Theorems 2.5 and 2.6 are by no means
exact. We shall discuss this issue in some details later.
To prove the Theorems, we need a Lemma. It is a refinement of Lemma 2.2
on convex functions to the gradient of convex functions.
sup |k(β) − p(β)| < implies sup |k(α) − p(α)| < 4δL + 2 (2.83)
β∈B α∈A
Thus,
δeT p(α)
≤ h(α + δe) − h(α) ≤ λi (βi − α)T p(βi )
≤ λi (βi − α)T k(α) + |βi − α||k(βi ) − k(α)|
+ |βi − α||p(βi ) − k(βi )|
and
α
Yn,s = g(Ys , √ ) − g(Ys , 0). (2.86)
n
Note that
α
E(Yn,s ) = G( √ ), (2.87)
n
and −1
n α
Yn,s = Gn ( √ ) − Un . (2.88)
m n
sS
Let
By using Theorem 1.6 with vn2 = C 2 n−(1+s)/2 ln1+s , for some K and D,
α α
sup P n1/2 |Gn ( √ ) − Un − G( √ )| > KCn−(1+s)/4 ln(1+s)/2 (log n)1/2
|α|≤M ln n n
≤ Dn1−r/2 C −r/2 nr(1+s)/4 ln−r(1+s)/2 (log n)r/2
= Dn1−r(1−s)/4 (log n)r/2 (log log n)−r(1+s)/4 . (2.90)
and so the inequality (2.90), continues to hold when we replace n1/2 G( √αn )
Strong representation theorem 61
Let
α
|n1/2 Gn ( √ ) − n1/2 Un − Hα| ≤ KCn (2.93)
n
Since r > [8 + d(1 + s)]/(1 − s), the right side is summable and hence we can
apply the Borel-Cantelli Lemma to conclude that almost surely, for large n,
α
sup |n1/2 Gn ( √ ) − n1/2 Un − Hα| ≤ K1 n . (2.96)
|α|≤M ln n
By the LIL for U -statistics given in Theorem 1.4, n1/2 Un ln−1 is bounded
almost surely as n → ∞. Hence we can choose M so that
|n1/2 H −1 Un | ≤ M ln − 1
almost surely for large n. Now consider the convex function nQn (n−1/2 α) −
nQn (0) on the sphere
S = {α : |α + H −1 n1/2 Un | = K2 n }
and so the radial directional derivatives of the function are positive. This
shows that the minimiser n1/2 θn of the function must lie within the sphere
Proof of Theorem 2.6: Let vn and Xns be as in the proof of Theorem 2.5.
Let Un be the U -statistic with kernel Xns −EXns which is now bounded since
g is bounded. By arguments similar to those given in the proof of Theorem 1.6
for the kernel hn1 ,
P |n1/2 Un | ≥ vn (log n)1/2 ≤ exp{−Kt(log n)1/2 + t2 n/k}, (2.98)
Assume that
3
(VII) K(θ) − K(θ0 ) − (θ − θ0 )k(θ0 ) = O(|θ − θ0 | 2 ) as θ → θ0 .
Then Assumption (VII) holds with s = 0.
Strong representation theorem 63
where fZ|T (·) denotes the conditional density of Z given T . Hence Assump-
tion (VII) will be satisfied if we assume that for each i, as θ → θ0 ,
E |Yi {FZ|T (θT T ) − FZ|T (θ0T T ) − fZ|T (θ0T T )(θ − θ0 )T )T }|
(i) (L1 -median). Since results for the univariate median (and quantiles) are
very well known (see for example Bahadur (1966), Kiefer (1967)), we confine
our attention to the case d ≥ 2.
|θ|2
|g(x, θ) − g(x, 0) − h(x, 0)θ| ≤ 6 . (2.106)
|x|2
Strong representation theorem 65
where
−1 (3+s)/2
I1 ≤ 4|θ| |x| dF(x) ≤ 2|θ| |x|−(3+s)/2 dF(x) (2.108)
|x|≤|θ| |x|≤|θ|
The inverse moment condition (2.104) assures that Assumption (VII) holds
with ∇2 Q(θ0 ) = H. Thus we have verified all the conditions needed and the
proposition is proved.
Let us investigate the nature of the inverse moment condition (2.104). If Y1
has a density f bounded on every compact subset of Rd then E[|Y1 −θ|−2 ] < ∞
if d ≥ 3 and E[|Y1 −θ0 |−(1+s) ] < ∞ for any 0 ≤ s < 1 if d = 2, and Theorem 2.5
is applicable. However, this boundedness or even the existence of a density
as such is not needed if d ≥ 2. This is in marked contrast with the situation
for d = 1 where the existence of the density is required since it appears in
the leading term of the representation. For most common distributions, the
representation holds with s = 1 from dimension d ≥ 3, and with some s < 1
for dimension d = 2. The weakest representation corresponds to s = 0 and
gives a remainder O(n−1/4 (log n)1/2 (log log n)1/4 ) if E[|Y1 − θ|−3/2 ] < ∞.
The strongest representation corresponds to s = 1 and gives a remainder
O(n−1/2 (log n)1/2 (log log n)1/2 ) if E[|Y1 − θ|−2 ] < ∞.
The moment condition (2.104) forces F to necessarily assign zero mass at
the median. Curiously, if F assigns zero mass to an entire neighborhood of
the median, then the moment condition is automatically satisfied.
Now assume that the L1 -median is zero and Y is dominated in a neighbor-
hood of zero by a variable Z which has a radially symmetric density f (|x|).
Transforming to polar coordinates, the moment condition is satisfied if the
integral of g(r) = r−(3+s)/2+d−1 f (r) is finite. If d = 2 and f is bounded in a
neighborhood of zero, then the integral is finite for all s < 1. If f (r) = O(r−β ),
(β > 0), then the integral is finite if s < 2d − 3 − 2β. In particular, if f is
bounded (β = 0), then any s < 1 is feasible for d = 2 and s = 1 for d = 3.
66 Chapter 2. Mm -estimators and U -statistics
(ii) (Hodges-Lehmann estimate) The above arguments also show that if the
moment condition is changed to E |m−1 (Y1 + · · · + Ym ) − θ0 |−(3+s)/2 < ∞,
Proposition 2.1 holds for the Hodges-Lehmann estimator with
−1
n
Un = g(m−1 (Yi1 + · · · + Yim ), θ0 )). (2.110)
m
1≤i1 <i2 <...<im ≤n
(iii) (Geometric quantiles) For any u such that |u| < 1, the u-th geometric
quantile of Chaudhuri (1996) is defined by taking f (θ, x) = |x − θ| − |x| − uT θ.
Note that u = 0 corresponds to the L1 -median. The arguments given in the
proof of Proposition 2.1 remain valid and the representation of Theorem 2.5
or 2.6 hold for these estimates. One can also define the Hodges-Lehmann
version of these quantiles and the representations would still hold.
2.8 Exercises
1. Show that (2.8) is minimized at θ = F−1 (p) and is unique if F is strictly
increasing at F−1 (p). Find out all the minimizers if F is not strictly
increasing at F−1 (p).
7. Argue, how, in the proof of Theorem 2.3, we can without loss of gener-
ality, assume θ0 = 0 and Q(θ0 ) = 0.
8. Refer to (2.43) and (2.44). Show that Bn (αn ) − > An (αn ) for all
α ∈ {α : |α − αn | = 2[λmin (H)]−1/2 1/2 }.
68 Chapter 2. Mm -estimators and U -statistics
10. For the L1 median given in Example 2.9(ii), check that the gradient
vector is indeed given by (2.56).
13. Verify the calculations for the Oja median given in Example 2.14.
14. Formulate conditions for the asymptotic normality of the pth-order Oja-
median.
Introduction to resampling
3.1 Introduction
In the previous two chapters we have seen many examples of statistical pa-
rameters and their estimates. In general suppose there is a parameter of
interest θ and observable data Y = (Y1 , . . . , Yn ). The steps for statistical
inference can be divided into three broad issues.
(II) Given an estimator θ̂n of θ (that is, a function of Y) how good is this
estimator?
(III) How do we obtain confidence sets, test hypothesis and settle other such
questions of inference about θ?
n
'2 = (n − 1)−1
σ (Yi − θ̂n )2 .
i=1
Using considerable ingenuity, W. S. Gossett, who wrote under the pen name
Student, obtained the exact sampling distribution of Tn = n1/2 (θ̂n − θ)/σ̂ (see
Student (1908)). This distribution is now known as Student’s t-distribution
with (n − 1) degrees of freedom.
If the variables are not i.i.d. normal, the above result does not hold. More
importantly, it is typically impossible to find a closed form formula for the
distribution of Tn . While it is possible to obtain the sampling distribution of
many other statistics, there is no general solution. In particular, the sampling
distribution of the U -statistics and the Mm -estimates that we discussed in
Chapters 1 and 2 are completely intractable in general.
Nevertheless, asymptotic solutions are available. It is often possible to
suitably center and scale the estimator θ̂n , which then converges in distribu-
tion. In Chapters 1 and 2 we have seen numerous instances of this where
convergence happens to the normal distribution.
We have also seen in Chapter 2 how asymptotic normality was established
by a weak representation result, so that the leading term of the centered
statistic is a sum of i.i.d. variables and thus the usual CLT can be applied.
This linearization was achieved by expending considerable technical effort.
There still remains two noteworthy issues.
First, such a linearization may not be easily available for many estimates
and the limit distribution need not be normal. Second, even if
D
an (θ̂n − θ) −→ N (0, V)
median as an estimator of the population median is one such case, since the
asymptotic variance depends on the true probability density value at the
unknown true population median.
This is where resampling comes in. It attempts to replace analytic deriva-
tions with the force of computations. We will introduce some of the popular
resampling techniques and their properties in Section 3.5 below, but before
that, in Section 3.2 we set the stage with three classical examples of problems
where we may study statistical inference using the (i) finite sample exact
distribution approach if available, (ii) asymptotics-driven approach, and (iii)
resampling-based approach. Our discussion on the basic ideas of resampling
are centered around these three examples. In Section 3.3 we define the notion
of consistency of resampling plans in estimating the variance and the entire
sampling distribution.
Then we introduce the quick and easy resampling technique, jackknife,
which is aimed primarily at estimating the bias and variance of a statistic.
The bootstrap is introduced in the context of estimating the sampling proper-
ties of the sample mean in Section 3.4.1. We also introduce the Singh Property
which shows in what sense and how the bootstrap can produce better esti-
mates than an asymptotics-based method. This is followed with a discussion
on resampling for the sample median in Section 3.4.3. After some discussion
on the principles and features of resampling in general in Section 3.3.2, we
present in Section 3.5 several resampling methods that have been developed
for use in linear regression. In Chapter 4 we will focus on resampling for
U -statistics and Mm -estimates.
tions.
Resampling techniques prove their worth more when we have non-i.i.d.
data since calculation of properties of sampling distributions become more
complicated as we move away from the i.i.d. structure. The simplest example
of a non-i.i.d. model is the linear regression and that is our third bench-
mark example. We shall see later that there are many eminently reasonable
resampling techniques available for such non-i.i.d. models.
Now instead of the normalized statistic Zn , we may use the Studentized statis-
tic
Tn = n1/2 (θ̂n − θ)/σ̂.
D
LTn −→ N (0, 1).
see later that even in this basic case, appropriate resampling techniques can
assure better accuracy than offered by the normal approximation.
Example 3.2(The median): Suppose the data is as in Example 3.1. However,
the parameter of interest is now the median ξ ∈ R. Recall that for any
distribution F, and for any α ∈ (0, 1), the α-th quantile of F is defined as
F−1 (α) = inf x ∈ R : F(x) ≥ α .
x∈R
In order to use this result, an estimate of f (ξ) is required. Note that this
is a non-trivial problem since the density f is unknown and that forces us
to enter the realm of density estimation. This estimation also adds an extra
error when using the asymptotic normal approximation (3.2) for inference.
We shall see later that when we use an appropriate resampling technique,
this additional estimation step is completely avoided.
Example 3.3(Simple linear regression): Suppose the data is {(Yi , xi ), i =
1, . . . , n}, where x1 , . . . , xn is a sequence of known constants. Consider the
simple linear regression model
Yi = β1 + β2 xi + ei . (3.3)
We assume that the error or noise terms e1 , . . . , en are i.i.d. from some dis-
tribution F, with Ee1 = 0, and Ve1 = σ 2 < ∞.
The random variable Yi is the i-th response, while xi is often called the
i-th covariate. In the above, we considered the case where the xi ’s are non-
random, but random variables may also be used as covariates with minor
differences in technical conditions that we discuss later. The above simple
linear regression has one slope parameter β2 and the intercept parameter β1 ,
and is a special case of the multiple linear regression, where several covariates
with their own slope parameters may be considered.
It is convenient to express the multiple regression model in a linear alge-
braic notation. We establish some convenient notations first.
74 Chapter 3. Introduction to resampling
Y = Xβ + e. (3.4)
The simple linear regression (3.3) can be seen as a special case of (3.4), with
the choice of p = 2, xi1 = 1 and xi2 = xi for i = 1, . . . , n.
If F is the Normal distribution, i.e., if e1 . . . , en are i.i.d. N 0, σ 2 , then
we have the Gauss-Markov model. This is the most well-studied model for
linear regression, and exact inference is tractable in this case. For example,
if β = (β1 , . . . , βp )T ∈ Rp is the primary parameter of interest, it can be
estimated by the maximum likelihood method using the normality assumption,
and the sampling distribution of the resulting estimator can be described.
n
n
Ȳ = n−1 Yi , x̄ = n−1 xi .
i=1 i=1
The above exact distribution may be used for inference when σ 2 is known.
Even when it is not known, an exact distribution can be obtained, when we
use the estimate of σ 2 given in (3.6).
The vector of residuals from the above multiple linear regression model
fitting is defined as
r = Y − X β̂.
n
'2 = (n − p)−1
σ ri2 . (3.6)
i=1
Suppose that the errors e1 , . . . , en are i.i.d. from some unknown distribu-
tion F with mean zero and finite variance σ 2 , which is also unknown. We
assume that X is of full column rank, and that
n−1 X T X → V as n → ∞ (3.7)
76 Chapter 3. Introduction to resampling
See for example, Freedman (1981) where (among other places) similar results
are presented and discussed. He also discussed the results for the case where
the covariates are random.
We may also obtain the CLT based approximation of the distribution LTn
of the Studentized statistic:
1/2
Tn = σ̂ −1 X T X β̂ − β , (3.9)
D
LTn −→ Np 0, Ip . (3.10)
Variations of the technical conditions are also possible. For example, Shao
and Tu (1995) discuss the asymptotic normality of β̂ using the conditions
−1
X T X → ∞, and max xiT X T X xi → ∞ as n → ∞,
i
2+δ
E|e1 | < ∞ for some δ > 0. (3.11)
n
Yi − xi β .
i=1
If the above convergence does not hold, we say that the estimator is variance
inconsistent.
If the above convergence does not hold, we say that the estimator is distri-
butionally inconsistent.
Note that the quantity supx L̂n (x) − Ln (x) defines a distance metric
between the distributions L̂n and Ln , and we will use this metric several
times in this chapter.
Clearly, conditional on the data, V̂n and L̂n are random objects, the ran-
domness coming from the resampling scheme used to derive the estimate.
There are myriad notions of such resampling estimates. We now proceed to
introduce some of the more important and basic resampling schemes.
It is implicitly assumed that the functional form of Tn is such that all the T(i) ’s
n
are well defined. Let us define T̄ = n−1 i=1 T(i) . The jackknife estimator of
the bias ETn − θ is defined as
(n − 1) T̄ − Tn .
n
2
( = (n − 1)n−1
V T(i) − T̄ . (3.13)
nJ
i=1
Note that leaving aside the factor (n − 1), this may be considered to be
the variance of the empirical distribution of T(i) , 1 ≤ i ≤ n. That is the
resampling randomness here.
80 Chapter 3. Introduction to resampling
( is con-
It was observed by Miller (1964) that the jackknife estimator VnJ
n
F̂J (x) = n−1 I{√n(n−1)(T
(i) −Tn )≤x}
i=1
It will be seen later in Section 3.5 that the different delete-d jackknives are
special cases of a suite of resampling methods called the generalized bootstrap.
The consistency results for the various jackknives can be derived from the
properties of the generalized bootstrap.
Resampling methods: the jackknife and the bootstrap 81
n
FnB (x) = n−1 I{Yib ≤x} .
i=1
B
L(Tnb (x) = B −1 I{Tnb ≤x} .
b=1
Section 3.4.2.
Thus bootstrap was a breakthrough in two aspects:
T
Yb = (Y1b , . . . , Ynb ) ∈ Rn
denote the resample vector. Note that each Yib can be any one of the original
Y1 , . . . , Yn with a probability 1/n. So there may be repetitions in the Yb
series, and chances are high that not all of the original elements of Y will
show up in Yb .
Clearly, conditional on the original data Y, the resample Yb is random,
and if we repeat the SRSWR, we may obtain a completely different Yb re-
sample vector.
84 Chapter 3. Introduction to resampling
Let us define
n
θ̂nb = Yib /n, and Znb = n1/2 θ̂nb − θ̂n ,
i=1
and let LZnb be the distribution of Znb given Y. This conditional distribution
is random but depends only on the sample Y. Hence it can be calculated
exactly when we consider all the nn possible choices of the resample vector
Yb . The bootstrap idea is to approximate the distribution of the normalized
Zn = n1/2 (θ̂n − θ) by this conditional distribution. For use later on, we also
define the asymptotically pivotal random variable Z̃n = n1/2 (θ̂n − θ)/σ, and
denote its exact finite sample distribution by LZ̃n .
Note that by the CLT, LZn converges to N (0, σ 2 ) and LZ̃n converges to
the parameter-free distribution N (0, 1) as n → ∞. Further, LZnb is also
the distribution of a standardized partial sum of (conditionally) i.i.d. random
variables. Hence it is not too hard to show that as n → ∞, this also converges
(almost surely) to the N (0, σ 2 ) distribution. One easy proof of this when the
third moment is finite follows from the Berry-Esseen bound given in the next
section. The fundamental bootstrap result is that,
supLZnb (x) − LZn (x) → 0, almost surely. (3.14)
x
Suppose that σ 2 is known. From the above result, either LZnb or the
CLT-based N (0, σ 2 ) distribution may be used as an approximation for the
unknown sampling distribution LZn for obtaining confidence intervals or con-
ducting hypothesis tests. Unfortunately, it turns out that the accuracy of the
approximation (3.14) is the same as that of the normal approximation (3.1).
Thus apparently no gain has been achieved.
In a vast number of practical problems and real data applications the
variance σ 2 is unknown, and we now consider that case. In the classical
frequentist statistical approach, we may obtain a consistent estimator of σ 2 ,
say σ̂ 2 , and use it as a plug-in quantity for eventual inference. An unbiased
estimator of σ 2 is
n
2
σ̂u2 = (n − 1)−1 Yi − θ̂n ,
i=1
D
LTn −→ N 0, 1
Instead of the standard Normal quantile, a tdf =n−1 is often used, which
makes little practical difference when n is large, yet accommodates the hope
of being exact in case the data is i.i.d. N (θ, σ 2 ). However, questions remain
about how accurate are intervals like (3.15). We will address these issues in
the next few pages.
n
2
σ̂n2 = n−1 Yi − θ̂n ,
i=1
and we denote its distribution by LTnb . A major discovery in the early days
of the bootstrap, which greatly contributed to the flourishing of this topic, is
that LTnb can be a better approximation for LZ̃n compared to N 0, 1 . This
in turn leads to the fact that a bootstrap-based one-sided (1 − α) confidence
interval for θ can be orders of magnitude more accurate than (3.15). We
discuss these aspects in greater detail in Section 3.4.2.
We now briefly discuss the computation aspect of this approach. Note
that there are nn possible values of Yb . See Hall (1992), Appendix I for
details on distribution of possible repetitions of {Yi } in Yb . Hence finding
the exact conditional distribution of Znb involves evaluating it for all these
values. This is computationally infeasible even when n is moderate, hence
a Monte Carlo scheme is regularly used. Suppose we repeat the process of
getting Yb several times b = 1, 2, . . . , B, and get θ̂nb and Znb for 1 ≤ b ≤ B.
Define the e.c.d.f. of these Zn1 , . . . , ZnB :
B
L(Znb (·) = B −1 I{Znb ≤·} . (3.16)
b=1
(b) (Bentkus and Götze (1996)) Suppose Y1 , . . . , Yn are i.i.d. from a distri-
bution F with EY1 = 0, 0 < VY1 = σ 2 < ∞. Let μ3 = E|Y1 |3 . Define
n
Ȳ = n−1 Yi ,
i=1
n
2
σ̂ 2 = n−1 Yi − Ȳ , and
i=1
Tn = Ȳ /σ̂.
Then there exists an absolute constant C > 0 such that for all n ≥ 2
√ μ3
supP nTn < x − Φ(x) ≤ C 3 √ . (3.20)
x σ n
The main essence of Theorem 3.1 is that for Zn or Tn , n1/2 times the
absolute difference between the actual sampling distribution and the Normal
distribution is upper bounded by a finite constant. Thus, in using the normal
approximation, we make an error of O(n−1/2 ). It is also known, and can be
easily verified by using i.i.d. Bernoulli variables, that the rate n−1/2 cannot
be improved in general. This puts a limit on the accuracy of the normal
approximation for normalized and Studentized mean. We note in passing that
such results have been obtained in many other, more complex and challenging,
non-i.i.d. models and for many other statistics.
Assuming that the third moment of the distribution is finite, we can apply
the Berry-Esseen bound (3.19) on Znb along with that on Zn and this implies
(3.14) mentioned earlier. However, at this point it is still not clear which is a
better approximation for LZn ; N (0, 1) or LZnb ?
We shall now show that there is a crucial difference between dealing with
normalized statistic and a Studentized statistic. Basically, for a normalized
statistic there is no gain in bootstrapping. However, under suitable condi-
tions, LTnb is a better estimator for LZ̃n compared to N (0, 1). This is known
as the Singh property. This property is now known to hold in many other
models and statistics but we shall restrict ourselves to only Tnb . We need a
Bootstrapping the mean and the median 89
Theorem 3.2. Suppose Yi are i.i.d. with distribution F that has mean 0,
variance σ 2 and finite third moment μ3 .
(a) If F is lattice with span h, then uniformly in x,
μ3 (1 − x2 ) h
LZ̃n (x) = Φ(x) + φ(x) + 3 1/2 g(n1/2 σh−1 x)φ(x) + o(n−1/2 ).
6σ 3 n1/2 6σ n
(3.21)
μ3 (1 − x2 )
LZ̃n (x) = Φ(x) + φ(x) + o(n−1/2 ). (3.22)
6σ 3 n1/2
These bounds show that the order O(n−1/2 ) in the Berry-Esseen theorem
is sharp and provide additional information on the leading error term un-
der additional conditions. Note that the nature of the leading error term is
different in the lattice and non-lattice cases.
90 Chapter 3. Introduction to resampling
Now, if we could obtain similar expansions for LZnb then we could use
the two sets of expansions to compare the two distributions. This is a non-
trivial issue. Note that for any fixed n, the bootstrap distribution of Y1b ,
being the e.c.d.f. of Y1 , . . . , Yn , is necessarily a discrete distribution. When F
is lattice (non-lattice), the bootstrap distribution may or may not be lattice
(respectively non-lattice). However, it should behave as a lattice (respectively
non-lattice) specially when n is large.
In an extremely remarkable work, the following result was proved by Singh
(1981). For a detailed exposition on Edgeworth expansions in the context of
bootstrap, see Bhattacharya and Qumsiyeh (1989), Hall (1992) and Bose and
Babu (1991).
Theorem 3.3 (Singh (1981)). Suppose Yi are i.i.d. with distribution F which
has mean 0, variance σ 2 and finite third moment μ3 . Then
(a)
lim sup E|Y1 |3 σ −3 n1/2 sup |LZ̃n (x) − LTnb (x)| ≤ 2C0 , almost surely,
n→∞ x
μ3 (1 − x2 ) h
LTnb (x) = Φ(x) + 3 1/2
φ(x) + 3 1/2 g(n1/2 σ̂n h−1 x)φ(x) + o(n−1/2 ).
6σ n 6σ n
Consequently
h
lim sup n1/2 sup |LZ̃n (x) − LTnb (x)| = √ , almost surely.
n→∞ x 2πσ 2
μ3 (1 − x2 )
LTnb (x) = Φ(x) + φ(x) + o(n−1/2 ).
6σ 3 n1/2
Consequently,
Part (a) shows that the difference between LTnb and LZ̃n has an upper
bound of order O(n−1/2 ), the same as the difference between LZn and the
normal approximation. Thus there may be no improvement in using LTnb .
Bootstrapping the mean and the median 91
Part (b) shows that when the parent distribution F is lattice, there is
no improvement in using LTnb to approximate LZ̃n compared to the normal
approximation.
However, part (c) is most interesting. It implies that using the asymptoti-
cally pivotal Studentized statistic Tnb is extremely fruitful, and the bootstrap
distribution LTnb is a better estimator of LZ̃n compared to the N (0, 1) ap-
proximation.
This is the higher order accuracy or Singh Property. More complex for-
mulations and clever manipulations can result in even higher order terms be-
ing properly emulated by the bootstrap, see Abramovitch and Singh (1985).
Moreover, such higher order accuracy results have been proved in many other
set ups, including Studentized versions of various U -statistics. The use of
Edgeworth expansions in the context of bootstrap has been explored in de-
tails in Hall (1992), where corresponding results for many (asymptotically)
pivotal random variables of interest may be found.
The sharper approximation has direct consequence in inference. Consider
the problem of getting a one-sided (1−α) confidence interval of θ, based on the
data Y1 , . . . , Yn from some unknown distribution F with mean θ. We assume
that F is non-lattice, and that the variance σ 2 is known. We discuss only
the unbounded left-tail version, where the interval is of the form (−∞, Rn,α )
for some statistic Rn,α . The CLT-based estimator for this is −∞, θ̂n +
n−1/2 σz1−α . From (3.22), we can compute that this interval has a O(n−1/2 )
coverage error, that is
P θ ∈ −∞, θ̂n + n−1/2 σz1−α = 1 − α + O(n−1/2 ).
On the other hand, using Theorem 3.3(c), if tα,b is the α-th quantile of
Tnb , that is, if P Tnb ≤ tα,b = α, we have
1 − α + O(n−1 ) = P Z̃n ≥ tα,b
= P θ̂n − θ ≥ n−1/2 σtα,b
= P θ ∈ −∞, θ̂n − n−1/2 σ̂tα,b .
Thus, we obtain that the bootstrap-based confidence interval −∞, θ̂n −
n−1/2 σ̂tα,b is O(n−1 ) accurate almost surely.
The above discussion was for the case of one-sided confidence intervals,
92 Chapter 3. Introduction to resampling
when σ is known. Results are also available for the case when σ is unknown,
and it can shown that the interval in (3.15) has a coverage error of O(n−1/2 ),
while the corresponding bootstrap interval has O(n−1 ) coverage error. Similar
results are available for two-sided intervals both when σ is known and when
it is unknown. The accuracy of the coverages are different from the one sided
case but the accuracy of the bootstrap intervals is still a n−1/2 factor higher
than the traditional intervals. We do not discuss the details here since several
technical tools need to be developed for that. Many such details may be found
in Hall (1992).
It can be seen from the above discussion that the existence and ready
usability of an asymptotically pivotal random variable is critically important
in obtaining the higher-order accuracy of the bootstrap estimator. Thus,
Studentization is typically a crucial step in obtaining the Singh Property.
Details on the Studentization and bootstrap-based inference with the Singh
property is given in Babu and Singh (1983, 1984, 1985); Hall (1986, 1988)
and in several other places.
B
2
( = B −1
V ξˆnb − ξˆn ,
nJ
b=1
That is
( → 1 almost surely.
4f 2 (ξ)VnJ
This bootstrap can also be used to estimate the entire sampling distribu-
Resampling in simple linear regression 93
D
Ln −→ N (0, 1/(4f 2 (ξ)))
lim sup n1/4 (log log n)1/2 sup |L̂n (x) − Ln (x)| = CF almost surely.
n→∞ x
However, note that the accuracy of the above approximation is of the order
O(n−1/4 (log log n)−1/2 ), which is quite low. Other resampling schemes have
been studied in this context in Falk and Reiss (1989); Hall and Martin (1991);
Falk (1992). We omit the details here.
ri = Yi − β̂1 − β̂2 xi .
A slight modification of the above needs to be done for the case where the
linear regression model is fitted without the intercept term β1 . In that case,
n
define r̄ = n−1 i=1 ri and for every b ∈ {1, . . . , B}, we obtain {rib , i =
1, . . . , n} as an i.i.d. sample from {ri − r̄, i = 1, . . . , n}. That is, they are an
SRSWR from the centered residuals. Note that when an intercept term is
present in the model, r̄ = 0 almost surely and hence no centering was needed.
Then for every b = 1, . . . , B, we obtain the bootstrap β̂ b by minimizing
n
2
Yib − β1 − β2 xi .
i=1
Suppose rib = Yib − β̂b1 − β̂b2 xi are the residuals at the b-th bootstrap step.
Resampling in simple linear regression 95
n
2
σ̂b2 = (n − 2)−1 Yib − β̂1 − β̂2 xi .
i=1
This is similar to the noise variance estimator σ̂ 2 based on the original data.
In order to state the results for the distributional consistency of the above
procedure, we use a measure of distance between distribution functions. Sup-
pose Fr,p is the space of probability distribution functions on Rp that have
finite r-th moment, r ≥ 1. That is,
Fr,p = G: ||x||r dG(x) < ∞ .
x∈Rp
where TX,Y is the collection of all possible joint distributions of (X, Y ) whose
marginal distributions are H and G respectively (Mallows, 1972). In a slight
abuse of notation, we may also write the above as ρr (X, Y ).
Consider either the normalized residual bootstrap statistic
Coupled with the fact that the normalized and Studentized statistic that
were formed using the original estimator β̂ converge to the same limiting
distributions (see (3.8), (3.10)), this shows that the residual bootstrap is
consistent for both Tn and Zn . In practice, a Monte Carlo approach is taken,
and the e.c.d.f of the {Znb , b = 1, . . . , B} or {Tnb , b = 1, . . . , B} are used for
bootstrap inference with a choice of (large) B.
n
2
(Yib − β1 − β2 xib ) .
i=1
This bootstrap is known as the paired bootstrap. Assume that (Yi , xi ) are
i.i.d. with E||(Yi , xi )||4 < ∞, V(xi ) ∈ (0, ∞), E(ei |xi ) = 0 almost surely. Let
1/2
the distribution of Tnb = σ̂b−1 X T X (β̂ b − β̂), conditional on the data,
be LTnb . Freedman (1981) proved that LTnb → Np (0, Ip ) almost surely. This
establishes the distributional consistency of the paired bootstrap for Tn .
It can actually be shown that under very standard regularity conditions,
the paired bootstrap is distributionally consistent even in many cases where
the explanatory variables are random, and when the errors are heteroscedas-
tic. It remains consistent in multiple linear regression even when p varies
with the sample size n and increases as the sample size increases. These de-
tails follow from the fact that the paired bootstrap is a special case of the
generalized bootstrap described later, for which corresponding results were es-
tablished in Chatterjee and Bose (2005). However, there are additional steps
needed before the Singh property can be claimed for this resampling scheme.
may produce resamples where the design matrix may not be of full col-
umn rank. However, such cases happen with exponentially small probability
(Chatterjee and Bose, 2000), and may be ignored during computation.
We obtain β̂ b by minimizing
n
2
(Yib − β1 − β2 xi ) .
i=1
This is known as the wild or the external bootstrap. Under (3.11), Shao
and Tu (1995) established the distributional consistency of this resampling
scheme.
When random regressors are in use, Mammen (1993) established the dis-
tributional consistency for both the paired and the wild bootstrap under very
general conditions.
We obtain β̂ b by minimizing
n
2
(Yib − β1 − β2 xi ) .
i=1
n
2
Wib (Yi − β1 − β2 xi ) .
i=1
Here {W1b , . . . , Wnb } are a set of random weights, and the properties of the
GBS are entirely controlled by the distribution of the n-dimensional vector
Wnb = (W1b , . . . , Wnb ). We discuss special cases below, several of which were
first formally listed in Præstgaard and Wellner (1993). We omit much of the
details, and just list the essential elements of the resampling methodology.
Let Πn be the n-dimensional vector all of whose elements are 1/n, thus Πn =
1
n 1n ∈ R where 1n is the n-dimensional vector of all 1’s.
n
3.6 Exercises
1. Show that if Tn is the sample mean as in Example 3.1, then its jackknife
variance estimate is same as the traditional unbiased variance estimator
given in (1.4).
(
2. Show that the two expressions for the jackknife variance estimator VnJ
3. Show that if Tn is the sample mean as in Example 3.1, then its naive
bootstrap variance estimate is
n
2
n−2 Yi − θ̂n .
i=1
(a) What is the probability that at least one of the original values
Y1 , Y2 , . . . , Yn does not show up in the resample values Y1b , . . . , Ynb ?
(b) Compute for k = 1, . . . , n, the probability that exactly k of the
original values Y1 , Y2 , . . . , Yn show up in the collection of resample
values Y1b , . . . , Ynb .
(c) Comment on the cases k = 1 and k = n.
Resampling U -statistics
and M -estimators
4.1 Introduction
with m = 2. Suppose that E|h(Y1 , Y2 )|2 < ∞, E|h(Y1 , Y1 )|2 < ∞, and
h(x, y)dF (y) is not a constant. Then the distribution of Unb conditional
on the data is a consistent estimator of the distribution of Un with τn = 1,
that is, almost surely,
supLUn (x) − LUnb (x) → 0.
x∈R
Later Helmers (1991) proved the Singh property of the bootstrap approx-
imation for the distribution of a Studentized non-degenerate U -statistics of
degree 2. Suppose the kernel satisfies Eh(Y1 , Y2 ) = θ and the corresponding
U -statistic
−1
n
Un = h(Yi , Yj )
2
1≤i<j≤n
106 Chapter 4. Resampling U -statistics and M -estimators
n
n 2
Sn2 = 4(n − 1)(n − 2)−2 (n − 1)−1 h(Yi , Yj ) − Un ,
i=1 j=1
2
For the bootstrap versions, we compute Unb and Snb using the formula
2
for Un and Sn given above but replacing Yi ’s with Yib ’s. Define
n
θn = n−2 h(Yi , Yj ),
i,j=1
which is very close to Un , except that the h(Yi , Yi ) terms are now included in
the summation, which thus has n2 terms. Consequently, θn has the scaling
factor n2 , comparable to the n2 factor that appears in the denominator of
Un . Using these, we obtain the bootstrap distributional estimator
−1
LTnb (x) = PB n1/2 Snb Unb − θn ≤ x , x ∈ R.
Theorem 4.2 (Helmers (1991)). Suppose that the distribution of the Hoeffd-
ing projection h1 is non-lattice, E|h(Y1 , Y2 )|4+ < ∞ for some > 0, and
E|h(Y1 , Y1 )|3 < ∞. Then
(GBS) version as
−1
n
Unb = Wn:i1 ,...,im h(Yi1 , . . . , Yim ). (4.2)
m
1≤i1 <···<im ≤n
*
m
Wn:i1 ,...,im = Wn:ij . (4.3)
j=1
m
Wn:i1 ,...,im = m−1 Wn:ij . (4.5)
j=1
In Section 4.4 below we discuss the properties of this particular choice of re-
sampling weights in detail. We will see that using the additive form (4.5) can
lead to significant improvement in computational efficiency, without compro-
mising on the accuracy.
Recall Example 2.2 from Chapter 2, where we showed that all U -statistics
GBS with additive weights 109
are Mm -estimators. That is, for any kernel h(y1 , . . . , ym ), which is symmetric
in its arguments, the corresponding U -statistic can be obtained as the unique
Mm -estimator when we use the contrast function
2 2
f (y1 . . . , ym , θ) = θ − h(y1 , . . . , ym ) − h(y1 , . . . , ym ) .
In view of this, instead of just presenting the analysis of GBS for U -statistics
given in (4.2), we present in Section 4.5 the full discussion of GBS for generic
Mm -estimators.
EB W1 = 1, (4.7)
0 < k < τn2 < K, (4.8)
c11 = O n−1 , (4.9)
c22 → 1, (4.10)
sup c4 < ∞. (4.11)
n
Note that when Wn = W1 , . . . , Wn has the M ultinomial n; 1/n, . . . , 1/n
distribution that links the GBS to Efron’s bootstrap, all the above conditions
110 Chapter 4. Resampling U -statistics and M -estimators
Theorem 4.3. Suppose the kernel h satisfies (4.13)-(4.15) and the resampling
weights are of the form (4.5) where Wn:i satisfy (4.7)-(4.11) and also
n
Wn:i = n.
i=1
Then
supLUn (x) − LUnb (x) → 0 almost surely as n → ∞. (4.16)
x∈R
n
The condition i=1 Wn:i = n and (4.8) together imply (4.9) and (4.12).
A convergence in probability version of this theorem may also be established
with more relaxed condition than (4.13)-(4.14).
To prove the theorem we need a CLT for weighted sums of row-wise ex-
changeable variables from Præstgaard and Wellner (1993). The idea behind
this result is Hajek’s classic CLT (Hájek (1961)) for sampling without replace-
ment.
m
2
m−1 amj − ām → σ 2 > 0,
j=1
GBS with additive weights 111
2
m−1 max amj − ām → 0,
j=1,...,m
m
2 P
m−1 Bmj − B̄m → c2 > 0,
j=1
2
lim lim sup E Bmj − B̄m I{|Bmj −B̄m |>K} = 0.
K→∞ m→∞
Here,
m
m
ām = m−1 amj and B̄m = m−1 Bmj .
j=1 j=1
Then
1 D
m
√ amj Bmj − ām B̄m −→ N (0, c2 σ 2 ). (4.17)
m j=1
Then
n1/2 τn−1 Unb − Un = n1/2 N −1 τn−1 hs W s − 1
s
m
= n1/2 N −1 τn−1 h1 Yij Ws − 1
s j=1
1/2 −1
+n N θτn−1 Ws − 1 + n1/2 N −1 τn−1 gs Ws − 1
s s
= T1 + T2 + T3 say.
n
Because of the condition i=1 Wn:i = n we have that s Ws − 1 = 0,
so T2 = 0 almost surely.
112 Chapter 4. Resampling U -statistics and M -estimators
4
Using (4.13) and the fact that E N −1 s gs = O(n−4 ) (see Serfling
(1980), page 188)) and (4.9), after some algebra we have that for any δ > 0,
PB T3 > δ = OP n−2 .
It remains to consider T1 . For this term also, using (4.6), we have that
n
T1 = n−1/2 Wi h1 Yi + rnb ,
i=1
where again
PB rnb > δ = oP n−2 for any δ > 0.
We now need to show that the distribution of n1/2 (Un − θ) and the boot-
n
strap distribution of n−1/2 m i=1 Wi h1 (Yi ) converge to the same limiting
normal distribution. For the original U -statistic, this is the UCLT, Theo-
rem 1.1. For the bootstrap statistic, we use Theorem 4.4 to get the result.
The conditions (4.10) and (4.11) are required in order to satisfy the conditions
of Theorem 4.4. The details are left as an exercise.
n
Unb = n−1 Wi Ũni . (4.18)
i=1
for bootstrapping. Since Monte Carlo simulations are an integral part of the
computation of bootstrap, we assume that B bootstrap iterations are carried
out under both the methods.
The remarkable fact is that while using additive weights (4.5), the time
and storage space requirements are reduced simultaneously, and for each boot-
strap step instead of requiring O(nm ) time and space, the requirement is only
O(n).
Proof of Theorem 4.5: (a) Let us assume that for given (y1 , . . . , yn ) the
computation of each h(y1 , . . . , ym ) takes H units of time. Then it can be seen
(n−1)
that each Ũni requires(m−1)
H units of time, and once all Ũni ’s have been
computed and stored, Un is easily computed in n steps. Thus an initial nm
order computation is required for both the bootstrap methods.
However, once the bootstrap weights are generated (without loss we as-
sume these to be generated in O(n) time), for the additive weights case only
2n more steps are needed, whereas for the general resampling weights case
n
the number of steps needed are m m assuming all the h(Yi1 , . . . , Yim ) are
stored. Thus for each bootstrap iteration the time complexities are O(n) and
O(nm ) respectively for the two methods. This completes the proof of part
(a).
Part (b) is easily proved by observing that when we use additive weights
only the Ũni defined in (4.18) need to be stored, whereas when we use general
resampling weights all the h(Yi1 , . . . , Yim ) have to be stored.
(II) (4.19) exists and is finite for all θ, that is, Q(θ) is well defined.
Generalized bootstrap for Mm -estimators 115
Proof of Theorem 4.6: In order to prove this theorem, we use the trian-
gulation Lemma of Niemiro (1992) which is also given in Chapter 2. Without
loss of generality, let θ0 = 0 and Q(θ0 ) = 0. Now define
Then we have
Proof of (4.29). Fix an M > 0. For fixed δ∗ > 0, get δ > 0 and > 0 such
that M1 = M + (2δ)1/2 and δ∗ > 5M1 λmax (H)δ + 3, where λmax (H) is the
maximum eigenvalue of H. Consider the set A1 = {θ : ||θ|| ≤ M1 } and let
B = {b1 , . . . , bN } be a finite δ-triangulation of A1 . Note that with
≤ PB [sup τn−1 | Xnbi − θT Hθ/2| > ] (4.31)
θ∈B
N
≤ PB [τn−1 | Wi Xni (bj ) − bTj Hbj /2| > ]
j=1
N
N
≤ PB [| Wi Xni (bj )| > /2] + I{τn−1 | Xni (bj )−bT Hbj /2|>/2}
j
j=1 j=1
N
n
N
≤ k 2
Xni (bj ) + I{τn−1 | Xni (bj )−bT Hbj /2|>/2} (4.32)
j
j=1 i=1 j=1
= oP (1). (4.33)
In the above calculations (4.31) follows from the triangulation Lemma 2.2
given in Chapter 2. Observe that in (4.32) the index j runs over finitely many
points, so it is enough to show the probability rate (4.33) for each fixed b. It
has been proved by Niemiro (1992) (see page 1522) that for fixed b,
2
Xni (b) = oP (1) and Xni (b) − bT Hb/2 = oP (1). (4.34)
i i
Hence (4.33) follows by using the lower bound in (4.8). This proves (4.29).
Proof of (4.30). Note that
1
n
PB [||n−1/2 Snb || > M ] ≤ E || Wi g(Yi ; 0)||2
M 2 n B i=1
2
n n
−2 2
≤ [τ E || W i g(Yi ; 0)|| + || g(Yi ; 0)||2 ]
M 2 n n B i=1 i=1
Kτn−2
n n
2 2
≤ ||g(Z i ; 0)|| + || g(Yi ; 0)||2
M 2 n i=1 M 2 n i=1
= UM say.
The constant K in the last step is obtained by using the upper bound condi-
tion in (4.8) and (4.9). Now fix any two constants , δ > 0. By choosing M
large enough, we have
n
1
sup τn−1 || Xnbi − θT Hθ|| < δ0 , (4.36)
||θ||≤M i=1
2
hold, the convex function nQnb (n−1/2 θ) − nQnb (0) assumes at n−1/2 H −1 Snb
a value less than its values on the sphere
1/2
||θ + n−1/2 H −1 Snb || = κτn δ0 ,
1/2
where κ = 2λmin (H). The global minimiser of this function is n1/2 θ̂nb . Hence
Since δ0 is arbitrary, and because of (4.29) and (4.30), we have for any δ > 0
PB [||rnb3 || > δ] = oP (1).
Now use Theorem 2.3 from Chapter 2 to get the result. This step again
uses (4.8).
Also observe that given the condition on the bootstrap weights, Theorem 4.4
may be applied to obtain
cj = EB Ws Wt whenever |s ∩ t| = j, for j = 0, 1, . . . , m.
Define
and
From (4.42) we have that the asymptotic mean of f1 is 0. Let its asymp-
totic variance be v 2 (m). This will in general be a function of m. Then the
appropriate standardized bootstrap statistic is
n1/2 ξn−1 vm
−1
(θ̂n B − θ̂n ).
Then we have the following Corollary. Its proof is similar to the proof of the
corresponding results in Section 4.5.1, and we omit the details.
Generalized bootstrap for Mm -estimators 121
Corollary 4.2. Assume the conditions of Theorem 4.7. Assume also that
4 2 2
fn:i ’s are exchangeable with supn EB fn:i < ∞ and EB fn:i fn:j → 1 for i = j as
n → ∞. Then
Proof of Theorem 4.7: The first part of this proof is similar to the proof
of Theorem 4.6. Define
We have
Then arguments similar to those used in the proof of Theorem 4.6 yields
(4.44) once it is established that for any fixed , δ > 0,
(a) For any M > 0,
PB [ sup |nN −1 Xnbs − θT Hθ/2| > δ] = oP (1). (4.48)
||θ||≤M
The proof of these are similar to those of (4.29) and (4.30). By carefully
following the arguments for those, we only have to show
n2 N −2 EB ( Ws Xns )2 = oP (1), (4.50)
nN −2 EB || ws g(Ys , 0)||2 = OP (1). (4.51)
122 Chapter 4. Resampling U -statistics and M -estimators
A little algebra shows that nN −1 Xns = OP (1). The first term is oP (1)
from this and (4.42). For the other term, first note that the sum over j is
finite, and from (4.43), we only need to show
n2 N −2 Xns Xnt = oP (1) for every fixed j = 1, . . . , m.
|s∩t|=j
Since the number of terms in |s∩t|=j is O(n−1 N 2 ), it is enough to show
that for any s, t ∈ S
Observe that θT (g(Ys , n−1/2 θ) − g(Ys , 0)) are non-negative random variables
that are non-increasing in n, and their limit is 0. This establishes the above.
Details of this argument is similar to those given in Chapter 2 following (2.33).
The proof of (4.51) is along the same lines. This proves (4.44).
In order to get the representation (4.45), we have to show
n1/2 N −1 Ws g(Ys , 0) = mn−1/2 fi g1 (Yi , 0) + Rnb1 ,
i
where
m
Let h(Ys , 0) = g(Ys , 0) − j=1 g1 (Yij , 0). This is a kernel of a first order
degenerate U -statistic. Then we have
m
Ws g(Ys , 0) = Ws g1 (Yij , 0) + Ws h(Zs , 0).
j=1
n
n−1
m
Ws g1 (Yij ; 0) = fi g1 (Yi ; 0).
s j=1
m − 1 i=1
Also
E(||N −1 h(Ys , 0)||)−2 = O(n−2 ). (4.52)
Now using this result and (4.43), after some algebra we obtain
PB [||n1/2 N −1 Ws h(Zs , 0)|| > δ] = oP (1),
4.6 Exercises
k
1. Suppose w1 , . . . wk are exchangeable random variables such that wi
i=1
1
is a constant. Show that for every m = n, c11 = corr(wm , wn ) = − k−1
irrespective of the distribution of {wi }. Hence verify (4.12).
124 Chapter 4. Resampling U -statistics and M -estimators
3. Extend the setup in the previous question to the case where each pos-
sible k-dimensional vectors has d coordinates 0 and rest have 1’s.
An Introduction to R
> library(UStatBookABSC)
Packages and core R may occasionally need to be updated, for which the
steps are mostly similar to installation. Package installation, updating and
new package creation can also be done from inside R Studio, in some ways
more easily.
As a general rule, using the R help pages is very highly recommended.
They contain a lot more information than what is given below. The way to
obtain information about any command, say the print command, is to simply
type in
If the exact R command is not known, just searching for it in generic terms
on the internet usually elicits what is needed.
Simple computations can be done by typing in the data and commands
at the R prompt, that is the “>” symbol in the R console. However, it is not
good practice to type in lengthy commands or programs in the R console
prompt, or corresponding places in any IDE. One should write R programs
using a text editor, save them as files, and then run them. R programs are
often called R scripts.
One common way of writing such scripts or programs is by using the built-
in editor in R. In the menubar in the R workspace, clicking on the File menu
followed by the New script menu opens the editor on an empty page where
one may write a new program. To edit an existing script, the menu button
Open script may be used. In order to run a program/script, one should
click on the Source menu button. The Change dir menu button allows one
to switch working directory.
130 Chapter 5. An Introduction to R
> getwd()
[1] "/Users/ABSC/Programs"
> setwd("/Users/ABSC/Programs/UStatBook")
> getwd()
[1] "/Users/ABSC/Programs/UStatBook"
Sometimes, but not always, one might want to come back to a work done
earlier on an R workspace. To facilitate this, R can save a workspace at the
end of a session. Also, when starting a new R session, one may begin with
a previously saved workspace. All of these are useful, if and when we want
to return to a previous piece of work in R. It is an inconvenience, however,
when one wants to do fresh work, and does not want older variable name
assignments and other objects stored in R memory to crop up. Also, a saved
workspace often takes up considerable storage space. Memory problems are
also known to occur. These issues are not huge problems for experts, but
often inconvenience beginners.
rm (list = ls())
as the first line of the file UStatBookCodes.R. We run this file in R by clicking
on the “Source File” command. Thus
> source("/Users/ABSC/Programs/UStatBook/UStatBookCodes.R")
executes all the commands that we save in the file UStatBookCodes.R, which
we have saved in the directory /Users/ABSC/Programs/UStatBook.
132 Chapter 5. An Introduction to R
will run R to execute all the commands in the file UStatBookCodes.R, and
save any resulting output to the file UStatBookOutFile.txt. The ampersand
at the end is useful for Linux, Unix and Macintosh users, whereby the pro-
cesses can be run in the background, and not be subject to either inadvertent
stopping and will not require continuous monitoring. Being able to execute
R files from the terminal is extremely useful when running large programs,
which is often the case for Statistics research. A more recent alternative to
the R CMD BATCH command is Rscript. Considerable additional flexibility is
available for advanced users, the help pages contain relevant details.
One can work with several kinds of variables in R. Each single, or scalar,
variable can be of type numeric, double, integer, logical or character.
A collection of such scalars can be gathered in a vector, or a matrix. For
example, the command
> x = vector(length = 5)
> print(x)
the output is
This shows that a vector of length 5 has been created and assigned the address
x. The “*” operation between two vectors of equal length produces another
vector of same length, whose elements are the coordinatewise product. Thus,
> z = x*y
> print(z)
[1] 1 38 -96 32
The functions InnerProduct and Norm above are built in functions in the
package UStatBookABSC for users, with some additional details to handle the
case where the vectors x and y have missing values.
Histogram of Rainfall
15
Frequency
10
5
0
0 10 20 30 40
Rainfall
5.3.1 A dataset
For illustration purposes, we shall use the data on precipitation in Kolkata,
India in 2012, during the months June to September, which corresponds ap-
proximately to the monsoon season. It consists of fifty-one rows and four
columns, where the columns are on the date of precipitation, the precipita-
tion amount in millimeters, the maximum and the minimum temperature for
that day in degree Celcius. Days in which the precipitation amount was below
0.099 millimeters are not included. Suppose for the moment this dataset is
available as a comma separated value (csv) file in our computer in the direc-
tory /Users/ABSC/Programs/UStatBook under the filename Kolkata12.csv.
We can insert it in our R session as a data.frame called Kol Precip as follows
> setwd("/Users/ABSC/Programs/UStatBook")
> Kol_Precip = read.csv(file = "Kolkata12.csv")
We urge the readers to experiment with different kinds of data files, and
read the documentation corresponding to ?read.table. For example, readers
may consider reading in other kinds of text files where the values are not
separated by commas, where each row is not necessarily complete, and where
missing values are depicted in various ways, and where the data resides in a
Initial steps of data analysis 135
0.05
0.04
0.03
Density
0.02
0.01
0.00
0 10 20 30 40
N = 51 Bandwidth = 3.671
Figure 5.2: Density plot of rainfall amounts on rainy days in Kolkata during
the monsoon season of 2012.
> library(UStatBookABSC)
>data(CCU12_Precip)
>ls(CCU12_Precip)
>?CCU12_Precip
The last command above brings up the help page for the dataset, which
also contains an executable example.
Suppose after accessing this data inside R, we wish to save a copy of it as a
comma separated value text file under the name filename Kolk12Precip.csv.
This is easily done inside R as follows
136 Chapter 5. An Introduction to R
28 30 32 34 36 38
* * * *
40
* * * *
* *
30
* *
Precip
20
* * * *
* *
** * * * * **
* *
* * * ***
*
10
**** ** ** * * * * * * * * *
* * *
* ** * * * *** *
** ** ** * *
* **** * **** ** *
* ** * * *
** * *** *
0
* *
38
* *
* *
** * * * *
36
* ** * * * *
* **
** **** * * * * ** ** *
34
* ** * * *
*** * * ** * * * ** * ** *** **
*
* * * *
***** * *
* * TMax ** *** *** * **
32
**
30
* *
28
* *
* *
* *
* * * **
*
*** * * * * * * *
27
* ** ***
**** * **
* * ** ** *** *
** *** * ****
26
* *
*
* *
* * * *
*
*
** ** *
* * TMin
* * * ** * *
** *** * * * * * *
25
*
* * * * * ** *
* *
* *
24
* *
0 10 20 30 40 24 25 26 27
> library(UStatBookABSC)
>data(CCU12_Precip)
> write.csv(CCU12_Precip, file = "Kolk12Precip.csv",
row.names = FALSE)
As in the case of reading of files, writing of files can also be done in various
formats.
Initial steps of data analysis 137
library(UStatBookABSC)
Rainfall = CCU12_Precip$Precip;
print("Average rainfall on a rainy day
in Kolkata in 2012 monsoon is:")
print(mean(Rainfall), digits = 3);
print(paste("Variance of rainfall is", var(Rainfall)));
print(paste("and the standard deviation is",
sd(Rainfall), "while"));
print(paste("the median rainfall is",
median(Rainfall), "and"));
print("here is some summary statistics");
print(summary(Rainfall), digits = 2);
11.0453304835209 while"
> print(paste("the median rainfall is",
median(Rainfall), "and"));
[1] "the median rainfall is 7.9 and"
> print("here is some summary statistics");
[1] "here is some summary statistics"
> print(summary(Rainfall), digits = 2);
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3 2.0 7.9 10.7 14.0 42.9
hist(Rainfall);
dev.new();
plot(density(Rainfall), xlim = c(0,45),
main = "Density of Kolkata-2012 Rainfall");
and obtain the histogram and the density plot of the Rainfall data presented
respectively in Figures 5.1 and 5.2. The command dev.new() requires R to
put the second plot in a separate window. Note the additional details
supplied to the plot command for the density plot, which controls the limit
of the x-axis of the plotting region, and places the title on top of the plot.
There are many more parameters that can be set to make the graphical output
from R pretty, and readers should explore those. In fact, one great advantage
Initial steps of data analysis 139
MaxTemp = CCU12_Precip$TMax;
MinTemp = CCU12_Precip$TMin;
print("The covariance of the max and min temperature is ");
print(cov(MaxTemp, MinTemp));
we get
which obtains the covariance between the maximum and minimum tempera-
ture on a rainy day as 0.58. We might want to test if the difference between
the maximum and minimum temperature on those days is, say, 20 degrees
Celcius, and one way of conducting such a test is by using the t.test as
follows:
mean of x mean of y
33.55686 25.93529
Note that we do not recommend the above two-sample test for the present
data: the maximum and minimum temperature for a given day are very likely
related, and we have not verified that assumptions compatible with a two-
sample t.test hold. The above computation is merely for displaying the
syntax of how to conduct a two-sample test in R.
Let us now conduct a paired t-test, perhaps with the alternative hypothesis
that the true difference is less than 10 degree Celcius keeping in mind that
Kolkata is in a tropical climate region. Additionally, suppose we want a 99%
one-sided confidence interval. This is implemented as follows:
Additionally, we may decide not to rely on the t.test only, and conduct
a signed-rank test.
return(T)
}
The first line says that I.Product is the name of the function, and that
its arguments are x, y. The second line computes the function, and the third
line returns to the main program the computed value of the function. Here
is how this program may be used:
A = c( 1, 2, 3);
BVec = c(0, 2, -1);
I = I.Product(A, BVec);
print(I)
[1] 1
Note that we deliberately used the vectors A and BVec as arguments, the
names of the arguments do not need to match how the function is written.
Also, once the function I.Product is in the system, it can be used repeat-
edly. This may not seem a big deal for a simple function like I.Product,
but in reality many functions are much more complex and elaborate, and
their codification as standalone functions helps programming greatly. Also,
even simple functions typically require checks and conditions to prevent er-
rors and use with incompatible arguments. See, for example, the code for
function InnerProduct in the UStatBookABSC package, which does the same
computation as I.Product, but has checks in place to ensure that both the
vectors used as arguments are numeric, they have the same length, and has
additional methodological steps to handle missing values in either argument.
>Data.CCU = CCU12_Precip[,-1];
>M.Oja = OjaMedian(Data.CCU);
>print(M.Oja)
Precip TMax TMin
9.36444 33.65908 26.16204
Notice that the Oja-median and the L1 -median have similar, but not iden-
tical, estimates. In general, computation of Oja-median is more involved since
determinants of several matrices have to be computed for each optimization
step.
Multivariate median regression 145
*
*
2.0
* * * *
** *
* *** * * *
** * *
* * ** * * * * * * *
* * * ** *
1.5
* * *
* * *
* *** * ** * * * * ** * *
*
* **** * ** ** ** * *
* * * * * * * * * * *
* ** * * ** ** ** *
* * ** * * * ***** * * *** * ** *** ** * ** * *
* **
* * *** ***** * *
1.0
* ** *
* ** * * *
Beta.TMax
* * * ** ** *
* * * * * ** **** ** ****** ** * ****** * *** *
* *
* * * ** ***** ***** * **** *** ** ** ** ** * *
* * * *** * * *** **** ** ** *** ** * **
* ** * *
* * *
* * ** * ***** * * *
* * ** ** * *** ***** * *** **** **** **** *** ******* * * *
0.5
** * * *** * * **** * ** * * *
* * * *
* ****** ** **** **** **** * ** * * * * * *
** *** * * * **
* * * ** ****** * * ****** * **** ** *
* * ** * * * *
*
* * ** * * *
0.0
** *** * * * * *
*
* * ** * * *** * ** * *
* * * * *** *
*
* * *
* * *
* *
-0.5
-10 -8 -6 -4 -2
Beta.Precip
Figure 5.4: Scatter plot of slope parameters from the multivariate response
L1 -regression fitting. The filled-in larger circle is the L1 -median.
the first element of each covariate vector being 1 for the intercept term, and
the second element being the TMin observation. Thus, the parameter for
the present problem is the 2 × 2 matrix B. Note that the elements of the
response variable (Precip, TMax) are potentially physically related to each
other by the Clausius-Clapyron relation (see Dietz and Chatterjee (2014)),
and the data analysis of this section is motivated by the need to understand
the nature of this relationship conditional on the covariate TMin. We use the
function L1Regression in the package UStatBookABSC to obtain the estima-
tor B̂ minimizing
n
Ψn (B) = Yi − B T xi ,
i=1
where recall that |a| = [ a2k ]1/2 is the Euclidean norm of a.
The following code implements the above task:
$Convergence
[1] 0.0005873908
$BetaHat
[,1] [,2]
[1,] 164.803552 16.5318273
[2,] -5.866605 0.6498325
step j, we compute the relative change in norm |B̂j+1 − B̂j |/|B̂j |, where |B|
is the Euclidean norm of the vectorized version of B, ie, when the columns
of B are stacked one below another to form a pd-length vector. We declare
convergence of the algorithm if this relative change in norm is less than ,
and we use = 0.001 here.
The above display shows that the function L1regression has a list as the
output. The first item states that convergence occurred at the 4-th iteration
step, and that the relative change in norm at the last step was 0.0005, and
then the final B̂ value is shown. Thus we have
164.803552 16.5318273
B̂ = .
−5.866605 0.6498325
We also use the function WLS from the package UStatBookABSC to obtain
a least squares estimatorB̃ of B, for which the results are displayed below:
Thus we have
145.412968 16.0526220
B̃ = .
−5.193363 0.6749197
>B = 500;
>Probabilities = rep(1, nrow(DataY))
148 Chapter 5. An Introduction to R
In the above, we set the resampling Monte Carlo size at B = 500. The next
two steps generates the multinomial weights for which we require the package
MASS, which has been called inside UStatBookABSC anyway. The next steps
implement a loop where L1Regression is repeatedly implemented with the
resampling weights, the entire results stored in the list L1Regression.Boot
and further, just the relevant slope parameters stored also in the matrix
T2Effect.
We now present a graphical display of the above results with the code
The above code is for plotting the resampling estimates of B̂[2, ], stored
in the matrix T2Effect. On this scatter plot, we overlay the original B̂[2, ],
as a big filled-in circle. The commands col =2 tells R to use red color for
the point on the computer screen and in color print outs, pch = 19 tells it
to depict the point with a filled-in red circle, and cex = 2 tells it to increase
the size of the point. The output from the above code is given in Figure 5.4.
Exercises 149
5.5 Exercises
1. Write a R function that can implement the Bayesian bootstrap on the
L1 -regression example for the Kolkata precipitation data.
Athreya, K. B., Ghosh, M., Low, L. Y., and Sen, P. K. (1984). Laws of large
numbers for bootstrapped U -statistics. Journal of Statistical Planning and
Inference, 9(2):185 – 194.
Dehling, H., Denker, M., and Philipp, W. (1986). A bounded law of the
iterated logarithm for Hilbert space valued martingales and its application
to U -statistics. Probability Theory and Related Fields, 72(1):111 – 131.
Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans.
SIAM, Philadelphia, USA.
Esseen, C.-G. (1942). On the Liapounoff limit of error in the theory of prob-
ability. Arkiv för Matematik, Astronomi och Fysik, 28A(2):1 – 19.
Bibliography 155
Ghosh, M., Parr, W. C., Singh, K., and Babu, G. J. (1984). A note on
bootstrapping the sample median. The Annals of Statistics, 12(3):1130–
1135.
Gregory, G. G. (1977). Large sample theory for U -statistics and tests of fit.
The Annals of Statistics, 5(1):110–123.
Hall, P. and Martin, M. A. (1991). On the error incurred using the boot-
strap variance estimate when constructing confidence intervals for quan-
tiles. Journal of Multivariate Analysis, 38(1):70 – 81.
Hoeffding, W. (1961). The strong law of large numbers for U -statistics. Insti-
tute of Statistics mimeo series 302, University of North Carolina, Chapel
Hill, USA.
Hubback, J. A. (1946). Sampling for rice yield in Bihar and Orissa. Sankhyā,
pages 281 – 294. First published in 1927 as Bulletin 166, Imperial Agricul-
tural Research Institute, Pusa, India.
Maritz, J. S., Wu, M., and Stuadte, R. G. (1977). A location estimator based
on a U -statistic. The Annals of Statistics, 5(4):779 – 786.
Tukey, J. W. (1958). Bias and confidence in not quite large samples (abstract).
The Annals of Mathematical Statistics, 29(2):614–614.
Wagner, T. J. (1969). On the rate of convergence for the law of large numbers.
The Annals of Mathematical Statistics, 40(6):2195 – 2197.
Denker, M. 9
Abramovitch, L. 91 Dietz, L. 146
Arcones, M. A. 66, 67, 106 Durrett, R. 14
Athreya, K. B. 105 Dynkin, E. B. 29
L-statistics, 32 representation, 58
L1 -median, 38, 143 Mm -estimator, weak
L1 -median, CLT, 52 representation, 45
L1 -median, exponential rate, 58 Mm -parameter, 36
L1 -median, strong representation, U -median, 38
64 U -quantile, 38
L1 -median, sub-gradient, 41 U -quantiles, CLT, 51
M -estimator, 35 U -quantiles, exponential rate, 58
M -estimator, asymptotic U -quantiles, multivariate, 38
normality, 103 U -quantiles, strong
M -estimator, generalized representation, 62
bootstrap, 104 U -statistics, Mm -estimator, 35, 36
M -estimator, sample mean, 36 U -statistics, χ2 limit theorem, 20
M1 -estimate, non-i.i.d., 76 U -statistics, asymptotic
M2 -estimator, 37 normality, 103
M2 -estimator, sample variance, 36 U -statistics, bootstrap, 105
Mm -estimator, 35, 36 U -statistics, central limit
Mm -estimator, U -statistics, 36 theorem, 9
Mm -estimator, CLT, 45 U -statistics, degenerate, 10, 14
Mm -estimator, convexity, 39 U -statistics, degree, order, 2
Mm -estimator, last passage time, U -statistics, deviation result, 15
56 U -statistics, first projection, 8
Mm -estimator, rate of U -statistics, generalized
convergence, 55 bootstrap, 104
Mm -estimator, strong consistency, U -statistics, kernel, 2
43 U -statistics, linear combination, 3
Mm -estimator, strong U -statistics, multivariate, 9, 32
© Springer Nature Singapore Pte Ltd. 2018 and Hindustan Book Agency 2018 165
A. Bose and S. Chatterjee, U-Statistics, Mm-Estimators and Resampling, Texts
and Readings in Mathematics 75, https://doi.org/10.1007/978-981-13-2248-8
166 SUBJECT INDEX