Bose, A., & Chatterjee, S. (2018) - U-Statistics, Mm-Estimators and Resampling

Texts and Readings in Mathematics 75
Arup Bose · Snigdhansu Chatterjee
U-Statistics,
Mm-Estimators
and
Resampling
Texts and Readings in Mathematics
Volume 75
Advisory Editor
C. S. Seshadri, Chennai Mathematical Institute, Chennai
Managing Editor
Rajendra Bhatia, Indian Statistical Institute, New Delhi
Editors
Manindra Agrawal, Indian Institute of Technology, Kanpur
V. Balaji, Chennai Mathematical Institute, Chennai
R. B. Bapat, Indian Statistical Institute, New Delhi
V. S. Borkar, Indian Institute of Technology, Mumbai
T. R. Ramadas, Chennai Mathematical Institute, Chennai
V. Srinivas, Tata Institute of Fundamental Research, Mumbai
Technical Editor
P. Vanchinathan, Vellore Institute of Technology, Chennai
The Texts and Readings in Mathematics series publishes high-quality textbooks,
research-level monographs, lecture notes and contributed volumes. Undergraduate
and graduate students of mathematics, research scholars, and teachers would find
this book series useful. The volumes are carefully written as teaching aids and
highlight characteristic features of the theory. The books in this series are
co-published with Hindustan Book Agency, New Delhi, India.
More information about this series at http://www.springer.com/series/15141

Arup Bose Snigdhansu Chatterjee
•
U-Statistics, Mm-Estimators
and Resampling
123
Arup Bose Snigdhansu Chatterjee
Statistics and Mathematics Unit School of Statistics
Indian Statistical Institute University of Minnesota
Kolkata, West Bengal, India Minneapolis, MN, USA
ISSN 2366-8717 ISSN 2366-8725 (electronic)

ISBN 978-981-13-2247-1 ISBN 978-981-13-2248-8 (eBook)
https://doi.org/10.1007/978-981-13-2248-8
Library of Congress Control Number: 2018952876
This work is a co-publication with Hindustan Book Agency, New Delhi, licensed for sale in all countries
in electronic form, in print form only outside of India. Sold and distributed in print within India by
Hindustan Book Agency, P-19 Green Park Extension, New Delhi 110016, India. ISBN: 978-93-86279-71-2
© Hindustan Book Agency 2018.
© Springer Nature Singapore Pte Ltd. 2018 and Hindustan Book Agency 2018
This work is subject to copyright. All rights are reserved by the Publishers, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publishers, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publishers nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publishers remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Contents
Preface xi
About the Authors xv
1 Introduction to U -statistics 1
1.1 Definition and examples . . . . . . . . . . . . . . . . . . . . . 1
1.2 Some finite sample properties . . . . . . . . . . . . . . . . . . 6
1.2.1 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 First projection . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Law of large numbers and asymptotic normality . . . . . . . 8
1.4 Rate of convergence . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Degenerate U -statistics . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2 Mm -estimators and U -statistics 35

2.1 Basic definitions and examples . . . . . . . . . . . . . . . . . 35
2.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3 Measurability . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4 Strong consistency . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5 Weak representation, asymptotic normality . . . . . . . . . . 45
2.6 Rate of convergence . . . . . . . . . . . . . . . . . . . . . . . 55
2.7 Strong representation theorem . . . . . . . . . . . . . . . . . 58
2.7.1 Comments on the exact rate . . . . . . . . . . . . . . 66
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3 Introduction to resampling 69
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2 Three standard examples . . . . . . . . . . . . . . . . . . . . 71
3.3 Resampling methods: the jackknife and the bootstrap . . . . 77
v
vi Contents
3.3.1 Jackknife: bias and variance estimation . . . . . . . . 78

3.3.2 Bootstrap: bias, variance and distribution estimation . 81
3.4 Bootstrapping the mean and the median . . . . . . . . . . . . 83
3.4.1 Classical bootstrap for the mean . . . . . . . . . . . . 83
3.4.2 Consistency and Singh property . . . . . . . . . . . . 87
3.4.3 Classical bootstrap for the median . . . . . . . . . . . 92
3.5 Resampling in simple linear regression . . . . . . . . . . . . . 93
3.5.1 Residual bootstrap . . . . . . . . . . . . . . . . . . . . 94
3.5.2 Paired bootstrap . . . . . . . . . . . . . . . . . . . . . 96
3.5.3 Wild or external bootstrap . . . . . . . . . . . . . . . 97
3.5.4 Parametric bootstrap . . . . . . . . . . . . . . . . . . 97
3.5.5 Generalized bootstrap . . . . . . . . . . . . . . . . . . 98
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4 Resampling U -statistics and M -estimators 103

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.2 Classical bootstrap for U -statistics . . . . . . . . . . . . . . . 105
4.3 Generalized bootstrap for U -statistics . . . . . . . . . . . . . 107
4.4 GBS with additive weights . . . . . . . . . . . . . . . . . . . . 109
4.4.1 Computational aspects for additive weights . . . . . . 112
4.5 Generalized bootstrap for Mm -estimators . . . . . . . . . . . 113
4.5.1 Resampling representation results for m = 1 . . . . . . 115
4.5.2 Results for general m . . . . . . . . . . . . . . . . . . 119
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5 An Introduction to R 127
5.1 Introduction, installation, basics . . . . . . . . . . . . . . . . 127
5.1.1 Conventions and rules . . . . . . . . . . . . . . . . . . 130
5.2 The first steps of R programming . . . . . . . . . . . . . . . . 131
5.3 Initial steps of data analysis . . . . . . . . . . . . . . . . . . . 133
5.3.1 A dataset . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.3.2 Exploring the data . . . . . . . . . . . . . . . . . . . . 137
5.3.3 Writing functions . . . . . . . . . . . . . . . . . . . . . 142
5.3.4 Computing multivariate medians . . . . . . . . . . . . 143
5.4 Multivariate median regression . . . . . . . . . . . . . . . . . 145
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Bibliography 150
Contents vii
Author Index 163
Subject Index 165

To Chhoti and Mita
A. B.
To Baishali
S. C.
Preface
This small book covers three important topics that we believe every statistics
student should be familiar with: U -statistics, Mm -estimates and Resampling.
The final chapter is a quick and short introduction to the statistical software
R, primarily geared towards implementing resampling. We hasten to add that
the book is introductory. However, adequate references are provided for the
reader to explore further.
Any U -statistic (with finite variance) is the non-parametric minimum vari-
ance unbiased estimator of its expectation. Many common statistics and esti-
mators are either U -statistics or are approximately so. The systematic study
of U -statistics began with Hoeffding (1948) and comprehensive treatment of
U -statistics are available at many places including Lee (1990) and Korolyuk
and Borovskich (1993).
In Chapter 1 we cover the very basics of U -statistics. We begin with
its definition and examples. The exact finite sample distribution and other
properties of U -statistics can seldom be calculated. We cover some asymptotic
properties of U -statistics such as the central limit theorem, weak and strong
law of large numbers, law of iterated logarithm, a deviation result and, a
distribution limit theorem for a first order degenerate U -statistics. As direct
application of these, we establish the asymptotic normality of many common
estimators and the sum of weighted chi-square limit for the Cramér-von Mises
statistic. Other applications are provided in Chapter 2. In particular the idea
of linearization or the so called weak representation of a U -statistic carries
over to the next chapters.
Chapter 2 is on M -estimators and their general versions Mm -estimators,
introduced by Huber (1964) out of robustness considerations. Asymptotic
properties of these estimates have been treated under different sets of con-
ditions in several books and innumerable research articles. Establishing the
xi
xii Preface
most general results for these estimators requires sophisticated treatment us-
ing techniques from the theory of empirical processes. We strive for a simple
approach.
We impose a few very simple conditions on the model. Primary among
these is a convexity condition which is still general enough to be applicable
widely. Under these conditions, a huge class of Mm -estimators are approxi-
mate U -statistics. Hence the theory developed in Chapter 1 can be profitably
used to derive asymptotic properties, such as the central limit theorem for
Mm -estimators by linearizing them. We present several examples to show
how the general results can be applied to specific estimators. In particular,
several multivariate estimates of location are discussed in details.
The linearization in Chapters 1 and 2 was achieved by expending consid-
erable technical effort. There still remain two noteworthy issues. First, such
a linearization may not be easily available for many estimates. Second, even
if an asymptotic normality result is established, it may not be easy to find or
estimate the asymptotic variance.
Since the inception of the bootstrap method in the early eighties, an al-
ternative to asymptotic distributional results is Resampling, This is now a
necessary item in the everyday toolkit of a statistician. It attempts to re-
place analytic derivations with the force of computations. Again, there are
several excellent monographs and books on this topic, in addition to the sur-
feit of articles on both theory and applications of Resampling.
In Chapter 3, we introduce the main ideas of resampling in an easy way
by using three benchmark examples; sample mean, sample median and ordi-
nary least squares estimates of regression parameters. In particular, we also
explain when and how resampling can produce “better” estimates than those
from traditional asymptotic normal approximations. We also present a short
overview of the most common methods of resampling.
Chapter 4 focuses on resampling estimates for the sampling distribution
and asymptotic variance of U -statistics and Mm -estimators. In particular, we
discuss the traditional Efron’s (multinomial) bootstrap and its drawbacks in
the context of U -statistics. We discuss how the generalized bootstrap arises
naturally in this context. We establish a bootstrap linearization result for U -
statistics. The generalized bootstrap with additive and multiplicative weights
are given special attention, the first due to the computational savings obtained
and the second due to its connection with Efron’s bootstrap. Finally, we also
establish a weighted U -statistics result for the generalized bootstrap Mm -
Preface xiii
estimates. These linearization results show the distributional consistency of

the generalized bootstrap for U -statistics and Mm -estimates.
Chapter 5 brings in an additional flavor: that of computation and practi-
cal usage of the various estimation and inferential techniques discussed. We
do not assume any knowledge about statistical softwares or computational
practices on the part of the reader, hence this chapter is suitable for a begin-
ner. We describe various aspects of the free and versatile statistical software
R, followed by a very brief illustration on how to practically use some of the
techniques presented in this book. We created a free and publicly available R
package called UStatBookABSC (Chatterjee (2016)) to accompany this book,
which also contains a dataset that we use for illustrating various R commands
and concepts.
The works of SC and AB have been partially supported respectively by the
United States National Science Foundation (NSF) under the grants # DMS-
1622483 and # DMS-1737918 and by the J.C. Bose Fellowship, Govt. of India.
We thank Monika Bhattacharjee, Gouranga Chatterjee and Deepayan Sarkar
for proof reading parts of the book, and for providing valuable comments.
We are thankful to Lindsey Dietz for obtaining and curating the data on
precipitation and temperature that we have used. We thank Rajendra Bhatia
for encouraging us to write in the TRIM series.
Arup Bose, Kolkata and Snigdhansu Chatterjee, Minneapolis

April 15, 2018
About the Authors
Arup Bose is Professor at the Statistics and Mathematics Unit, Indian Sta-
tistical Institute, Kolkata, India. He is a Fellow of the Institute of Mathemati-
cal Statistics and of all the three national science academies of India. He has
significant research contributions in the areas of statistics, probability, eco-
nomics and econometrics. He is a recipient of the Shanti Swarup Bhatnagar
Prize and the C R Rao National Award in Statistics. His current research in-
terests are in large dimensional random matrices, free probability, high dimen-
sional data, and resampling. He has authored three books: Patterned Random
Matrices, and Large Covariance and Autocovariance Matrices (with Monika
Bhattacharjee) and Random Circulant Matrices (with Koushik Saha), pub-
lished by Chapman & Hall.
Snigdhansu Chatterjee is Professor at the School of Statistics, University

of Minnesota, USA. He is also the Director of the Institute for Research
in Statistics and its Applications. His research interests are in resampling
methods, high-dimensional and big data statistical methods, small area
methods, and application of statistics in climate science, neuroscience and
social sciences. He has written over 45 research articles.
xv
Chapter 1
Introduction to U -statistics
U statistics are a large and important class of statistics. Indeed, any U -

statistic (with finite variance) is the non-parametric minimum variance esti-
mator of its expectation θ. Many common statistics and estimators are either
U -statistics or approximately so. The systematic study of U -statistics began
with Hoeffding (1948). Many of the basic properties of U -statistics are due
to him. Full length treatment of U -statistics are provided by Lee (1990) and
Korolyuk and Borovskich (1993).
Our goal in this chapter is to cover the very basics of U -statistics in a
concise and more or less self-contained way. We will start with the defini-
tion of a U -statistic and provide several examples. We will then cover some
asymptotic properties of U -statistics such as the central limit theorem, weak
and strong law of large numbers, law of iterated logarithm, and a distribution
limit theorem for a first order degenerate U -statistic. As applications of these
results, we will establish the asymptotic distribution of many common statis-
tics, including the Cramér-von Mises statistic. In the next chapter we shall
see that there is a general class of statistics, called Mm estimators, which
are also approximate U -statistics. Thus the results of this chapter will be
applicable to an even wider class of statistics.
1.1 Definition and examples

Let Y1 , Y2 , . . . , Yn be Y-valued random variables. It will be clear from the
context what Y is. We shall assume throughout that they are independent
© Springer Nature Singapore Pte Ltd. 2018 and Hindustan Book Agency 2018 1
A. Bose and S. Chatterjee, U-Statistics, Mm-Estimators and Resampling, Texts
and Readings in Mathematics 75, https://doi.org/10.1007/978-981-13-2248-8_1
2 Chapter 1. Introduction to U -statistics
and identically distributed (i.i.d. hereafter). Suppose h(x1 , . . . , xm ) is a real

valued function on Y m which is symmetric in its arguments.
Definition 1.1: The U -statistic of order or degree m, with kernel h is defined

as:
−1
n
Un (h) = h(Yi1 , . . . , Yim ). (1.1)
m
1≤i1 <···<im ≤n
We shall often write Un for Un (h) when the function h is clear from the
context. Appropriate extension is available when h is vector valued.
Note that a U -statistic of degree m is also a U -statistic of degree (m + 1).
As a consequence, the sum of two U -statistics is again a U -statistic. In this
book, we consider the order m to be the smallest integer for which the above
definition holds.
Example 1.1(Sample mean): Let Y1 , Y2 , . . . , Yn be observations, then with

m = 1 and h(x) = x, we obtain

n
Un = n−1 Yi = Y . (1.2)
i=1
Example 1.2(Sample variance): Let Y1 , Y2 , . . . , Yn be the observations, then

(x1 −x2 ) 2
with m = 2 and h(x1 , x2 ) = 2 , we get
−1
n 2
Un = Yi1 − Yi2 /2. (1.3)
2
1≤i1 <i2 ≤n
It is easily seen that Un is the sample variance,

n
2
Un = (n − 1)−1 Yi − Y = s2n . (1.4)
i=1
Example 1.3(Sample covariance): Suppose (Xi , Yi ), 1 ≤ i ≤ n are the

observations, and consider m = 2 and
1
h (x1 , y1 ), (x2 , y2 ) = (x1 − x2 )(y1 − y2 ).
2
Definition and examples 3
Then Un is the sample covariance of {(Xi , Yi )}, given by

n

Un = (n − 1)−1 Xi − X Yi − Y . (1.5)
i=1
Example 1.4(Kendall’s tau): Define the sign function as

⎧
⎪
⎨ −1 if x < 0,
sign(x) = 0 if x = 0,
⎪
⎩
1 if x > 0.
Suppose (Xi , Yi ), 1 ≤ i ≤ n are continuous bivariate observations. Kendall’s

tau, a measure of discordance, is defined by
−1
n
tn = sign (Xi − Xj )(Yi − Yj ) . (1.6)
2
1≤i<j≤n

This is a U -statistic with h (x1 , x2 ), (y1 , y2 ) = sign (x1 − x2 )(y1 − y2 ) .
Example 1.5(Gini’s mean difference): A measure of income inequality is the
Gini’s mean difference given by
−1
n
Un = |Yi − Yj |. (1.7)
2
1≤i<j≤n
If the observations are i.i.d. N (0, σ 2 ), then we have
2
E(Un ) = σ. (1.8)
π 1/2
Thus Un is a U -statistic with h(x1 , x2 ) = |x1 − x2 | and is a measure of

dispersion.
Example 1.6(Wilcoxon’s one sample rank statistic): Suppose Yi , 1 ≤ i ≤ n,
are continuous observations. Let Ri = Rank (|Yi |), 1 ≤ i ≤ n. Wilcoxon’s
one sample rank statistic is defined as

n
+
T = Ri I{Yi >0} . (1.9)
i=1
T + is the sum of the ranks of all the positive observations. It can be written
as a linear combination of two U -statistics with kernels of sizes 1 and 2. To
do this, note that for i = j,
I{Yi +Yj >0} = I{Yi >0} I{|Yj |<Yi } + I{Yj >0} I{|Yi |<Yj } . (1.10)
Hence with probability one

I{Yj +Yj >0}
1≤i≤j≤n

n
= I{Yi >0} I{|Yj |<Yi } + I{Yj >0} I{|Yi |<Yj } + I{Yi >0}
1≤i<j≤n 1≤i<j≤n i=1
n n
= I{Yi >0} I{|Yj |≤Yi }
i=1 j=1
n
= I{Yi >0} Ri = T + .
i=1
It is now easy to see that

+ n
T = Un (f ) + nUn (g) (1.11)
2
where Un (f ) and Un (g) are the U -statistics

−1 1
n
n
Un (f ) = f (Yi , Yj ) and Un (g) = g(Yi )
2 n i=1
1≤i<j≤n
with kernels f (x1 , x2 ) = I{x1 +x2 >0} and g(x1 ) = I{x1 >0} .
Example 1.7: A correlation coefficient different from the usual product mo-
ment correlation was introduced and studied in details by Bergsma (2006).
We need some preparation to define it. First suppose that Z, Z1 and Z2 are
i.i.d. real valued random variables with the distribution function F . Let
1
hF (z1 , z2 ) = − E |z1 − z2 | − |z1 − Z2 | − |Z1 − z2 | + |Z1 − Z2 | . (1.12)
2
Note that if Z3 , Z4 are i.i.d. F , then
EhF (Z3 , Z4 ) = 0. (1.13)
Now let (X, Y ) be a bivariate random variable with marginal distributions

FX and FY . Let (X1 , Y1 ), and (X2 , Y2 ) be i.i.d. copies of (X, Y ). Then a
Definition and examples 5
“covariance” between X and Y is defined by
κ(X, Y ) = EhFX (X1 , X2 )hFY (Y1 , Y2 )
and a “correlation coefficient” between them is defined by
κ(X, Y )
ρ∗ (X, Y ) = .
κ(X, X)κ(Y, Y )
It is easy to see that if X and Y are independent then κ(X, Y ) = ρ∗ (X, Y ) =

0. It turns out that 0 ≤ ρ∗ ≤ 1 for any pair of random variables (X, Y )
and ρ∗ (X, Y ) = 0 if and only if X and Y are independent. In addition,
ρ∗ (X, Y ) = 1 if and only if X and Y are linearly related. See Bergsma (2006)
for details.
Now suppose we have n observations (Xi , Yi ), 1 ≤ i ≤ n from a bivariate

distribution FX,Y . Based on these observations, we wish to test the hypothesis
H0 : FX,Y (x, y) = FX (x)FY (y) for all x, y. (1.14)
Since κ(X, Y ) = 0 if and only if H0 is true, we can base our test on a

suitable estimate of κ(X, Y ). The most reasonable estimate is given using a
U -statistic, which we describe below. Define the following quantities:
1 1
n n
A1k = |Xk − Xi |, A2k = |Yk − Yi |, and
n i=1 n i=1
1 1
n n n n
B1 = |X i − X j |, B2 = |Yi − Yj |.
n2 i=1 j=1 n2 i=1 j=1
For k, l = 1, . . . , n, let
1 n n n
hF̂X (Xk , Xl ) = − |Xk − Xl | − A1k − A1l + B1 ,
2 n−1 n−1 n−1
1 n n n
hF̂Y (Yk , Yl ) = − |Yk − Yl | − A2k − A2l + B2 .
2 n−1 n−1 n−1
Then an unbiased estimator of κ(X, Y ) is given by

−1
n
κn = hF̂X (Xi , Xj )hF̂Y (Yi , Yj ).
2
1≤i<j≤n
Note that κn may be negative even though κ is never so. Clearly κ is a

U -statistic of degree 2.
1.2 Some finite sample properties
1.2.1 Variance

Assume that V h(Y1 , . . . , Ym ) < ∞ where V denotes variance. To compute
V(Un ), we need to compute

COV h(Yj1 , . . . , Yjm ), h(Yi1 , . . . , Yim )
where COV denotes covariance.

Let c denote the number of common indices between {i1 , . . . , im } and
{j1 , . . . , jm }. The total number of such covariance terms equals

n m n−m
.
m c m−c
This can be seen as follows. First, we can choose m indices {i1 , . . . , im } from
n
{1, . . . , n} in m ways. Then choose c of those which are to be common

with {j1 , . . . , jm } in mc ways. Now choose the rest of the (m − c) indices of

{j1 , . . . , jm } from the (n − m) remaining indices in m−c ways.
n−m
For any c, 1 ≤ c ≤ m define

δc = COV h(Yi1 , . . . , Yim ), h(Yj1 , . . . , Yjm ) ,

hc (x1 , . . . , xc ) = E h(Y1 , . . . , Yc , Yc+1 , . . . , Ym )|Y1 = x1 , . . . , Yc = xc .
Suppose {Ỹ1 , Ỹ2 , . . .} is an i.i.d. copy of {Y1 , Y2 , . . .} (without loss of generality

defined on the same probability space). Then clearly,

δc = COV h(Y1 , . . . , Yc , Yc+1 , . . . , Ym ), h(Y1 , . . . , Yc , Ỹc+1 , . . . , Ỹm )

= E h(Y1 , . . . , Yc , Yc+1 , . . . , Ym )h(Y1 , . . . , Yc , Ỹc+1 , . . . , Ỹm )
2
− E h(Y1 , . . . , Ym )
2 2
= E hc (Y1 , . . . , Yc ) − E hc (Y1 , . . . , Yc )

= V hc Y1 , . . . , Yc ≥ 0.
Some finite sample properties 7
Hence
V(Un )
−2
n
= COV h(Yi1 , . . . , Yim ), h(Yj1 , . . . , Yjm )
m
1≤i1 <···<im ≤n 1≤j1 <···<jm ≤n
(1.15)
−2 m
n n m n−m
= δc
m c=1
m c m−c
−1 m
n m n−m
= δc . (1.16)
m c=1
c m−c
As a consequence, as n → ∞,
m2 δ 1
V(Un ) = + O(n−2 ) and (1.17)
n

V n1/2 (Un − θ) → m2 δ1 . (1.18)
(x1 −x2 )2
Example 1.8: Let h(x1 , x2 ) = 2 . Then

n
Un = s2n = (n − 1) −1
(Yi − Y n )2 .
i=1
It can be verified that, if σ 2 = V(Y1 ) and μ4 = E(Y − E(Y ))4 , then
μ4 − σ 4
δ1 = , (1.19)
4
μ4 + σ 4
δ2 = , and
2
4(n − 2) 2
V(Un ) = δ1 + δ2 . (1.20)
n(n − 1) n(n − 1)
1.2.2 First projection
The first projection of a U -statistic Un , denoted by h1 (·), is the conditional

expectation of h given one of the coordinates:
h1 (x1 ) = E h(x1 , Y2 , . . . , Ym ) (1.21)

Define the centered version of the first projection:
h̃1 (x1 ) = h1 (x1 ) − θ
so that

E h̃1 (Y1 ) = E h(Y1 , . . . , Ym ) − θ = 0.
Let
m
n
Rn = Un − θ − h̃1 (Yi ). (1.22)
n i=1
By explicit calculations it can be easily seen that the decomposition (1.22) is

an orthogonal decomposition in the following sense:
COV[h̃1 (Yi ), Rn ] = 0 ∀ i = 1, . . . , n. (1.23)
Notice that in (1.18) the leading term in V(n1/2 Un ) equals m2 δ1 where

δ1 = COV h(Y1 , Y2 , . . . , Ym ), h(Y1 , Ỹ2 , . . . , Ỹm ) (1.24)

= V h1 (Y1 )). (1.25)
1.3 Law of large numbers and asymptotic nor-

mality
D P a.s.
The notation −→, → and → shall be used throughout the book to denote
convergence in distribution, convergence in probability and convergence al-
most surely, respectively.
From the relations (1.22), (1.23) and (1.18), we get
m2
V(Un ) = δ1 + V(Rn )
n
2
m
= δ1 + O(n−2 ). (1.26)
n
This shows that
P
V(n1/2 Rn ) → 0 and hence n1/2 Rn → 0. (1.27)
Law of large numbers and asymptotic normality 9
Now we obtain two important results on U -statistics. Theorem 1.1(b) is

the Central Limit Theorem for U -statistics (UCLT). It follows by appealing
to the decomposition given in part (a) and the central limit theorem (CLT)
for the sample mean of i.i.d. observations. Theorem 1.1(a) is a weak repre-
sentation or linearization of U -statistics and follows from (1.22) and (1.27).
This is useful when we have to deal with several U -statistics simultaneously.
Then it is easy to see that the multivariate version of Theorem 1.1(b) holds.
In Theorem 1.2, using Theorem 1.1(a), we obtain the weak law of large
numbers (WLLN) for U -Statistics.
Theorem 1.1 (Hoeffding (1948)). (UCLT.) If V[h(Y1 , . . . , Ym )] < ∞, then
(a)
m
n
P
Un − θ = h̃1 (Yi ) + Rn where n1/2 Rn → 0. (1.28)
n i=1
(b)
D
n1/2 (Un − θ) −→ N (0, m2 δ1 ) where δ1 = V(h̃1 (Y1 )).
Theorem 1.2. (Weak law of large numbers (WLLN) for U -statistics.) If

V[h(Y1 , . . . , Ym )] < ∞, then
P
Un − θ → 0 as n → ∞.
Rate results in the weak law, when stronger moment conditions are as-
sumed, are given in Section 1.4. A much stronger result than the weak law
is actually true for U -statistics and we state it below:
Theorem 1.3 (Hoeffding (1961)). (Strong law of large numbers (SLLN) for

U -statistics.) If E |h(Y1 , . . . , Ym )| < ∞, then
a.s.
Un − θ −→ 0 as n → ∞.
This result can be proved by using SLLN for either reverse martingales
Berk (1966) or forward martingales Hoeffding (1961). See Lee (1990) for a
detailed proof.
The next result we present is the Law of iterated logarithms (LIL) for U -
statistics. See Lee (1990) for a proof of this result. This result will be used
in the proof of Theorem 2.5 in Chapter 2.
Theorem 1.4 (Dehling et al. (1986)). (Law of iterated logarithm (LIL) for
U -statistics.) Suppose Un is a U -statistic with kernel h such that δ1 > 0 and

E |h(Y1 , . . . , Ym )|2 < ∞. Then as n → ∞,
n(Un − θ)
lim sup = 1 almost surely.
n 2m2 δ1 log log n
Example 1.9: Consider the U -statistic s2n . In Example 1.8, we have calcu-
lated
μ4 − σ 4
δ1 = ,
4
4
where μ4 = E Y − (EY ) . Thus if μ4 < ∞,
D
n1/2 (s2n − σ 2 ) −→ N (0, μ4 − σ 4 ).
Note that when μ4 = σ 4 , the limit is degenerate at 0.
Example 1.10: Suppose h(x1 , x2 ) = x1 x2 , σ 2 = V(Y1 ) < ∞ and μ = E(Y1 ).

Then
δ1 = COV(Y1 Y2 , Y1 Ỹ2 )

= EY12 Y2 Ỹ2 − EY1 Y2 EY1 Ỹ2
= μ2 EY12 − μ4 = μ2 V(Y1 ) = μ2 σ 2 .
Hence, since m = 2, δ1 = μ2 σ 2 , we have

n
−1 D
1/2 2
n Yi Yj − μ −→ N (0, 4 μ2 σ 2 ).
2
1≤i<j≤n
Note that when μ = 0, the limit distribution is degenerate. We shall deal

with general degenerate U -statistics in the next section.
Example 1.11: Consider Kendall’s tau defined in Example 1.4. It is used

to test the null hypothesis that X and Y are independent. We shall derive
the asymptotic null distribution of this statistic. Let F, F1 and F2 denote
the distributions of (X, Y ), X and Y , respectively. Assume that they are
Law of large numbers and asymptotic normality 11
continuous. Then
h1 (x, y) (1.29)
= E h((x, y), (X2 , Y2 ))

= P (x − X2 )(y − Y2 ) > 0 − P (x − X2 )(y − Y2 ) < 0

= P (X2 > x, Y2 > y), or (X2 < x, Y2 < y)

− P (X2 > x, Y2 < y) or (X2 < x, Y2 > y)
= 1 − 2F(x, ∞) − 2F(∞, y) + 4 F(x, y)

= 1 − 2F1 (x) 1 − 2F2 (y) + 4 F(x, y) − F1 (x)F2 (y) . (1.30)
Under the null hypothesis,
F(x, y) = F1 (x)F2 (y) for all x, y. (1.31)
Hence in that case,

h1 (x, y) = 1 − 2F1 (x) 1 − 2F2 (y) . (1.32)
To compute δ1 , since F1 and F2 are continuous, under the null hypothesis,
U = 1 − 2F1 (X1 ) and V = 1 − 2F2 (Y1 ) (1.33)
are independent U (−1, 1) random variables. Hence
δ1 = V[h1 (X, Y )] = V(U V )

1 1 2
2 2
= EU EV = u2 du = 1/9.
2 −1
Since m = 2, we get m2 δ1 = 4/9. Moreover, under independence,

θ = E sign (X1 − X2 )(Y1 − Y2 ) = 0.
D
Hence under independence, n1/2 Un −→ N (0, 4/9).
Example 1.12: Wilcoxon’s statistic, defined in Example 1.6 is used for test-
ing the null hypothesis that the distribution F of Y1 is continuous and sym-
metric about 0. Recall the expression for T + in (1.11). Note that E Un (f ) =
P(Y1 + Y2 > 0) = θ (say). Under the null hypothesis, θ = 1/2. Further,

√
n−3/2 12(T + − n2 /4) (1.34)

−3/2
√ n 2
=n 12 Un (f ) + nUn (g) − n /4
2
√ √ √
= n−3/2 12nUn (g) + n1/2 3(Un (f ) − 1/2) − n−1/2 3Un (f ). (1.35)
Clearly, using the weak representation (1.28),
P
n−3/2 (nUn (g)) → 0. (1.36)
Further,

V(f˜1 ) = COV I{Y1 +Y2 >0} , I{Y1 +Ỹ2 >0}

= P Y1 + Y2 > 0, Y1 + Ỹ2 > 0 − (1/2)2 .
Note that F is continuous. Then under the null hypothesis, by symmetry,

P Y1 + Y2 > 0, Y1 + Ỹ2 > 0 = 1/3. (1.37)
As a consequence
1 1 1
V f˜1 = − = = δ1 . (1.38)
3 4 12
Thus from the UCLT, Theorem 1.1 (since m = 2),
D
n1/2 (Un (f ) − 1/2) −→ N (0, 1/3).
Hence using (1.35), (1.36) and (1.38) we have,

√ D
n−3/2 12(T + − n2 /4) −→ N (0, 1).
1.4 Rate of convergence

The results of this section provide rate of convergence in the WLLN/SLLN
and the UCLT Theorem 1.1. These results will be specifically useful for
Rate of convergence 13
establishing rate of convergence results for Mm -estimators later in Chapter 2.

Complete proofs of these results will not be provided here but the reader will
be directed to other sources for full proofs.
The first result supplements the SLLN for U -statistics mentioned in Sec-
tion 1.3. Parts (a) and (c) will also be useful in the proof of Theorem 2.4.
Theorem 1.5. (Rate of convergence in SLLN.) Let Un (h) be a U -statistic

with kernel h. Let μ = Eh(Y1 , . . . , Ym ).
(a) (Grams and Serfling (1973)). If E|h(Y1 , . . . , Ym )|r < ∞ for some r > 1,
then for every > 0,

P sup |Uk (h) − μ| > = o(n1−r ). (1.39)
k≥n

(b) If ψ(s) = E exp s|h(Y1 , . . . , Ym )| < ∞ for some 0 < s ≤ s0 , then for
k = [n/m], and 0 < s ≤ s0 k,
k
E exp sUn (h) ≤ ψ(s/k) . (1.40)
(c) (Berk (1970)). Under the same assumption as (b), for every > 0, there
exist constants 0 < δ < 1, and C and such that

P supUk (h) − μ > ≤ Cδ n . (1.41)
k≥n
We provide a sketch of the proof of (a). First assume that m = 1. Then

Un (h) reduces to a sample mean. In this case (a) is a direct consequence of
the following well known result. A proof of this result can be found in Petrov
(1975), Chapter 9, Theorem 2.8. We need the following lemma in order to
establish (a).
Lemma 1.1 (Brillinger (1962); Wagner (1969)). Suppose Y1 , Y2 , . . . are i.i.d.

random variables with E|Yn |r < ∞ for r > 1. Let Sn = Y1 + . . . + Yn , and
μ = EYn . Then for each ε > 0,
S
k
P sup − μ > ε = o(n1−r ), as n → ∞. (1.42)
k≥n k
Proof of Theorem 1.5: Now suppose m > 1. Recall the weak decomposi-
tion (1.28) of U -statistics and write
m
n
Un (h) − θ = h̃1 (Yi ) + Rn . (1.43)
n i=1
Since the result has already been established for m = 1, it is now enough to
prove that
P sup |Rk | ≥ = o(n1−r ). (1.44)
k≥n
Note that since Rn is the sum of two U -statistics, it is also a U -statistic.

Every U -statistic is a reverse martingale and from the well known reverse
martingale inequality (see Shorack (2000), page 247) it follows that

P sup |Rk | ≥ ≤ −1 E|Rn |. (1.45)
k≥n
Further, Rn is a degenerate U -statistic, that is, it is a U -statistic whose first

projection is zero. Hence using Theorem 2.1.3 of Korolyuk and Borovskich
(1993) (page 72), for 1 < r < 2 and Theorem 2.1.4 of Korolyuk and Borovskich
(1993) (page 73), for r ≥ 2, it follows that
E|Rn | = O(n2(1−r) ) = o(n1−r ). (1.46)
Using this in (1.45) verifies (1.44) and proves Theorem 1.5(a) completely.
The detailed proofs of (b) and (c) (without the supremum) can be found
in Serfling (1980) (page 200–202). We just mention that for the special case
of m = 1, part (c) is an immediate consequence of the following lemma whose
proof is available in Durrett (1991), Lemma 9.4, Chapter 1.

Lemma 1.2. Let Y1 , Y2 , . . . be i.i.d. random variables, and E es|Yn | < ∞
for some s > 0. Let Sn = Y1 + . . . + Yn , μ = EYn . Then for each ε > 0 there
exists a > 0 such that
S
n
P − μ > ε = O(e−an ), as n → ∞. (1.47)
n
The following result is on the moderate deviation of U -statistics. We allow

the possibility that the kernel h is also changing with the sample size n. This
result will be used in the proof of Theorem 2.5 of Chapter 2.
Theorem 1.6 (Bose (1997)). (Deviation result) Let {hn } be a sequence of

(symmetric) kernels of order m and let {Yni , 1 ≤ i ≤ n} be i.i.d. real valued
random variables for each n. Let
−1
n
Un (hn ) = hn (Yni1 , . . . , Ynim ). (1.48)
m
1≤i1 <···<im ≤n
Further suppose that for some b, δ > 0, and some vn ≤ nδ ,
E hn (Yni1 , ..., Ynim ) = 0,
E |hn (Yni1 , ..., Ynim )|2 ≤ vn2 and
E |hn (Yni1 , ..., Ynim )|r ≤ b < ∞ for some r > 2.
Then for all large K,

P n1/2 |Un (hn )| > Kvn (log n)1/2 ≤ Dbn1−r/2 vn−r (log n)r/2−1 . (1.49)
Proof of Theorem 1.6: Let

∼ ∼ ∼
hn = hn I(|hn | ≤ mn ), hn1 = hn − Ehn , hn2 = hn − hn1 ,
where {mn } will be chosen.

Note that {hn1 } and {hn2 } are mean zero kernels and have the same properties
as {hn } assumed in the statement of the theorem. Further
Un (hn ) = Un (hn1 ) + Un (hn2 ).
Define for sufficiently large K,
an = K (log n)1/2 /2 and Ψn (t) = E[exp{tUn (hn1 (Yn1 , . . . , Yn,m ))}].
Note that Ψn (t) is finite for each t since hn1 is bounded. Letting k = [n/m],
and using Theorem 1.5(b),
A1 = P(n1/2 Un (hn1 ) ≥ vn an )
= P(tn1/2 Un (hn1 )/vn ≥ t an )
≤ exp(−tan ) Ψn (n1/2 t/vn )
≤ exp(−tan ) [E exp(n1/2 tY /(vn k))]k , say,
where Y = hn1 (Yn1 , · · · , Ynm ). Using the fact that |Y | ≤ mn , EY = 0, and

EY 2 ≤ vn2 , we get
∞
j
n1/2 t n1/2 t
E exp Y ≤ 1+E Y2 mj−2
n /j!
kvn j=2
kvn
∞ 1/2 j
t2 n 2 n t
≤ 1+ (EY ) m n /j!
2k 2 vn2 j=0
kvn
t2 n
≤ 1+
k2
provided t ≤ n−1/2 kvn /(2mn ). With such a choice of t,
t2 n
A1 ≤ exp −tan + . (1.50)
k
Let t = K(log n)1/2 /(4(2m − 1)). Then for all large n, the exponent in (1.50)
equals
K 2 log n nK 2 log n K 2 log n

− + 2
≤− . (1.51)
8(2m − 1) 16(2m − 1) k 16(2m − 1)
Thus we have shown that
2
P(|n1/2 Un (hn1 )| > Kvn (log n)1/2 /2) ≤ n−K /16(2m−1)
. (1.52)
To tackle Un (hn2 ), we proceed as follows.
P(|n1/2 Un (hn2 )| ≥ an vn /2) ≤ 4vn−1 a−1

n n
1/2
E|hn2 (Xn1 , . . . , Xnm )|
≤ 8vn−1 a−1
n n
1/2
[E|hn |r ]1/r [P(|hn | ≥ mn )]1−1/r
≤ 8vn−1 a−1
n n
1/2 1/r
b (m−r
n )
1−1/r 1−1/r
b .
Choosing mn = n1/2 vn /(K(log n)1/2 ),
P(|n1/2 Un (hn2 )| ≥ an vn /2) ≤ 16bvn−r K r−2 n1−r/2 (log n)r/2−1 . (1.53)
Note that in the above proof the restrictions we have in place on mn , K and
t (with t = K(log n)1/2 /(4(2m − 1)) ≤ n−1/2 kvn /(2mn )) are all compatible.
The Theorem follows by using (1.52) and (1.53) and the given condition on
vn .
Degenerate U -statistics 17
1.5 Degenerate U -statistics
In this section, we will consider distribution limits for degenerate U -statistics.

Note that in Theorem 1.1 if δ1 = V (h1 (Y1 )) = 0, then the limit normal
P
distribution is degenerate. That is, n1/2 (Un − θ) → 0. A natural question
to ask is if we can renormalize Un in cases where δ1 = 0 to obtain a non-
degenerate limit.
Example 1.13: Consider the U -statistic Un = s2n from Example 1.9, which
has the kernel
(x1 − x2 )2
h(x1 , x2 ) = .
2

Then as calculated earlier, δ1 = 14 μ4 − μ22 where
4 2
μ4 = E Y1 − μ , μ2 = E Y1 − μ . (1.54)
Note that
2
μ4 = μ22 ⇔ Y1 − μ is a constant
⇔ Y1 = μ ± C (C is a constant).
P
Then n1/2 (s2n − μ2 ) → 0.
Example 1.14: Let f be a function such that Ef (Y1 ) = 0 and Ef 2 (Y1 ) < ∞.
Let Un be the U -statistic with kernel h(x1 , x2 ) = f (x1 )f (x2 ). Then h1 (x) =
Ef (x)f (Y2 ) = 0, and
1 D
n 2 Un (h) −→ N (0, 0). (1.55)
On the other hand,

−1
n
Un = f (Yi )f (Yj )
2
1≤i<j≤n
−1 n n
n 1 1
= f (Yi )f (Yj ) − f 2 (Yi )
2 2 i,j=1 n(n − 1) i=1
n 2 n
1 f (Yi ) 1
= √
i=1
− f 2 (Yi ).
n−1 n n(n − 1) i=1
As a consequence, using SLLN on {f 2 (Yi )}, and CLT on {f (Yi )},
D
nUn −→ σ 2 (χ21 − 1) (1.56)
where σ 2 = Ef 2 (Y1 ) and χ21 is a chi-square random variable with one degree
of freedom.
Example 1.15: (continued from Example 1.13). Suppose {Yi }’s are i.i.d.,
P(Yi = 1) = P(Yi = −1) = 1/2.
Then μ4 = μ22 = 1. Hence, as seen earlier,
D
n1/2 (s2n − 1) −→ N (0, 0). (1.57)

n
However, writing Y = n−1 Yi and noting that Yi2 = 1 for all i,
i=1
−1
n (Yi − Yj )2
s2n =
2 2
1≤i<j≤n
n
1
2 2
= Y − n(Y )
n − 1 i=1 i
n n 2
= − Y .
n−1 n−1
Hence n → ∞
n √ n D
n(s2n − 1) = − ( nY )2 −→ 1 − χ21 .
n−1 n−1
Example 1.16: Suppose f1 and f2 are two functions such that
Ef1 (Y1 ) = Ef2 (Y1 ) = 0,
E(f1 (Y1 )f2 (Y1 )) = 0 and Ef12 (Y1 ) = Ef22 (Y2 ) = 1. That is, {f1 , f2 } are
“orthonormal”. Consider the U -statistic Un with kernel
h(x1 , x2 ) = a1 f1 (x1 )f1 (x2 ) + a2 f2 (x1 )f2 (x2 ). (1.58)
Then as in Example 1.14, but now using the multivariate CLT,
D
nUn −→ a1 (W1 − 1) + a2 (W2 − 1) (1.59)
where W1 and W2 are i.i.d. χ21 random variables.
To motivate further the limit result that we will state and prove shortly,
let us continue to assume δ1 = 0. Recalling the formula for variance given in
(1.16), now

n2 m n − m
V(nUn ) = n δ2 + smaller order terms
m
2 m−2
[m(m − 1)]2
= δ2 + o(1).
2
Thus if δ1 = 0, then the right scaling is indeed nUn . To understand further

our forthcoming result, let

L2 (F) = h : R → R, h2 (x)dF(x) < ∞ and (1.60)

L2 (F × F) = h : R × R → R, h2 (x, y)dF(x)dF(y) < ∞ . (1.61)
Fix h ∈ L2 (F × F). Define the operator Ah : L2 (F) → L2 (F) as

Ah g(x) = h(x, y)g(y)dF(y), g ∈ L2 (F). (1.62)
Then there exists eigenvalues {λj } and corresponding eigenfunctions {φj } ⊂

L2 (F) for the operator A. That is:
Ah φj = λj φj , ∀j, (1.63)

φ2j (x)dF(x) = 1, φj (x)φk (x)dF(x) = 0, ∀j = k, and (1.64)
∞

h(x, y) = λj φj (x)φj (y). (1.65)
j=1
The equality in (1.65) is in the L2 sense. That is, if Y1 , Y2 are i.i.d. F then

n
E[h(Y1 , Y2 ) − λk φk (Y1 )φk (Y2 )]2 → 0 as n → ∞. (1.66)
k=1
Further
h1 (x) = Eh(x, Y2 )
∞

= λk φk (x)E φk (Y2 ) almost surely F. (1.67)
k=1
Also note that from (1.64) and (1.66),
∞

2
Eh (Y1 , Y2 ) = λ2k . (1.68)
k=1
Now we are ready to state our first theorem on degenerate U -statistics for
m = 2. The version of this Theorem for m > 2 is given later in Theorem 1.9.
Theorem 1.7 (Gregory (1977); Serfling (1980)). (χ2 -limit theorem.) Suppose
h(x, y) is a kernel such that Eh(x, Y1 ) = 0 a.e. x and Eh2 (Y1 , Y2 ) < ∞. Then
∞

D
nUn −→ λk (Wk − 1) (1.69)
k=1
where {Wk } are i.i.d. χ21 random variables and {λk } are the (non-zero) eigen-
values of the operator Ah given in (1.62).
Proof of Theorem 1.7: The idea of the proof is really as in Example 1.16
after reducing the infinitely many eigenvalues case to the finitely many eigen-
values case. First note that
h1 (x) = Eh(x, Y1 ) = 0 a.e. F. (1.70)
On the other hand from (1.67), we know that
∞

h1 (x) = λk φk (x)EF (φk (Y1 )) a.e. F. (1.71)
k=1
Since {φk } is an orthonormal system, using (1.70) we thus have
Eφk (Y1 ) = 0 ∀k. (1.72)
Also recall that (see (1.65))
∞

h(x, y) = λk φk (x)φk (y) in the L2 (F × F) sense. (1.73)
k=1
Now
−1
n
nUn = n h(Yi , Yj )
2
1≤i<j≤n
1
= h(Yi , Yj ).
n−1
1≤i=j≤n
Thus it is enough to find the limit distribution of
1
Tn = h(Yi , Yj )
n
1≤i=j≤n
∞

1
= λk φk (Yi )φk (Yj ). (1.74)
n
1≤i=j≤n k=1
If the sum over k were a finite sum then the rest of the proof would proceed
exactly as in Example 1.16. With this in mind, define the truncated sum
1
k
Tnk = λt φt (Yi )φt (Yj ) k ≥ 1. (1.75)
n
1≤i=j≤n t=1
Lemma 1.3. Suppose for every k, {Tnk } is any sequence of random variables
and {Tn } is another sequence of random variables, all on the same probability
space, such that
D
(i) Tnk → Ak as n → ∞
D
(ii) Ak → A as k → ∞
(iii) lim lim sup P(|Tn − Tnk | > ) = 0 ∀ > 0.
k→∞ n→∞
D
Then Tn → A.
Proof of Lemma 1.3. Let φX (t) = E(eitX ) denote the characteristic func-
tion of any random variable X. It suffices to show that
φTn (t) → φA (t) ∀t ∈ R.
Now
|φTn (t) − φA (t)| ≤ |φTn (t) − φTnk (t)| + |φTnk (t) − φAk (t)|
+ |φAk (t) − φA (t)|
= B1 + B2 + B3 , say.
First let n → ∞. Then by (i), B2 → 0. Now let k → ∞. Then by (ii),

B3 → 0. Now we tackle the first term B1 .
Fix > 0. It is easy to see that
|φTn (t) − φTnk (t)| ≤ E[|eit(Tn −Tnk ) − 1|]

≤ 2P(|Tn − Tnk | > ) + sup |eix − 1|.
|x|≤|t|
Now first let n → ∞, then let k → ∞ and use condition (iii) to conclude that
the first term goes to zero. Now let → 0 to conclude that the second term
goes to zero. Thus B1 → 0. This proves the lemma.
Now we continue with the proof of the Theorem. We shall apply Lemma
1.3 to {Tn } and {Tnk } defined in (1.74) and (1.75). Suppose {Wk } is a
sequence of i.i.d. χ21 random variables.
Let

k ∞

Ak = λj (Wj − 1), A = λj (Wj − 1). (1.76)
j=1 j=1
Note that by Kolmogorov’s three series theorem (see for example Chow and
Teicher (1997) Corollary 3, page 117), the value of the infinite series in (1.76)
is finite almost surely, and hence A is a legitimate random variable.
D
Also note that Ak −→ A as k → ∞. Hence, condition (ii) of Lemma 1.3
D
is verified. Now we claim that Tnk −→ Ak . This is because
1
k
Tnk = λt φt (Yi )φt (Yj )
n
1≤i=j≤n t=1
1 2 1
k n k n
= λt φt (Yi ) − λt φ2t (Yi ). (1.77)
n t=1 i=1
n t=1 i=1
By SLLN, for every t,
1 2
n
φ (Yi ) −→ Eφ2t (Y1 ) = 1.
a.s.
(1.78)
n i=1 t
By multivariate CLT, since {φk (·)} are orthonormal,
1 n
D
√ φt (Yi ), t = 1 . . . k −→ N (0, Ik ) (1.79)
n i=1
where Ik is the k × k identity matrix. As a consequence, using (1.77), (1.78)

and (1.79),
D

k
k
k
Tnk −→ λt W t − λt = λt (Wt − 1) (1.80)
t=1 t=1 t=1
where {Wi } are i.i.d. χ21 random variables. This verifies condition (i). To
verify condition (iii) of Lemma 1.3, consider
1 1
k
Tn − Tnk = h(Yi , Yj ) − λt φt (Yi )φt (Yj )
n n
1≤i=j≤n 1≤i=j≤n t=1

2 n
= Unk (1.81)
n 2
where Unk is a U -statistic of degree 2 with kernel

k
gk (x, y) = h(x, y) − λk φk (x)φk (y).
t=1
Now since h is a degenerate kernel and Eφk (Y1 ) = 0 ∀k, we conclude that gk
is also a degenerate kernel. Note that using (1.73)

k
Egk2 (Y1 , Y2 ) = E[h(Y1 , Y2 ) − λt φt (Y1 )φt (Y2 )]2
t=1
∞
2

= E λt φt (Y1 )φt (Y2 )
t=k+1
∞
= λ2t .
t=k+1
Hence from (1.81) using formula (1.16) for variance of a U -statistic,

2
E Tn − Tnk = (n − 1)2 VUnk
∞ ∞
(n − 1)2 2
= n λt ≤ 2 λ2t . (1.82)
2 t=k+1 t=k+1
Hence
∞

1
lim lim sup P(|Tn − Tnk | > ) ≤ lim λ2t = 0.
k→∞ n→∞ 2 k→∞
t=k+1
This establishes condition (iii) of Lemma 1.3 and hence the Theorem is com-
pletely proved.
Before we provide an application of Theorem 1.7, we define the following:
Definition 1.2: For any sequence of numbers {x1 , . . . xn }, its empirical cu-
mulative distribution function (e.c.d.f.) is defined as
1
n
Fn (x) = I{xi ≤x} . (1.83)
n i=1
Example 1.17: (Cramér-von Mises statistic): Let Y1 , . . . , Yn be i.i.d. with

distribution function F. The Cramér-von Mises statistic is often used to test
the null hypothesis that F equals a specified c.d.f. F0 . It is defined as

2
CVn = n Fn (x) − F0 (x) dF0 (x) (1.84)
where Fn (·) is the e.c.d.f. of {Y1 , . . . , Yn }. It is not hard to see that if F0 is

continuous, then under the null hypothesis, CVn is distribution free. That is,
its distribution is independent of F0 whenever Y1 , . . . , Yn are i.i.d. F0 . We
shall see below that CVn is a degenerate U -statistic. The following result
establishes the asymptotic distribution of CVn .
Theorem 1.8. If F0 is continuous and Y1 , Y2 , . . . , are i.i.d. with c.d.f. F0
then with {Wk } as i.i.d. χ21 random variables,
∞
D 1 Wk
CVn −→ . (1.85)
π2 k2
k=1
Proof of Theorem 1.8: Since the distribution of CVn is independent of

F0 , without loss of generality suppose that F0 is the c.d.f. of the uniform
distribution:
F0 (x) = x, 0 ≤ x ≤ 1 (1.86)
and hence
1
2
CVn = n (Fn (x) − x) dx
0
1
1
n
2
=n I{Yi ≤x} − x dx (1.87)
0 n i=1
1
n
= 2 I{Yi ≤x} − x I{Yj ≤x} − x dx
n
1≤i,j≤n 0
1
2
= I{Yi ≤x} − x I{Yj ≤x} − x dx
n
1≤i<j≤n 0
n
1 1 2
+ I{Yi ≤x} − x dx
n i=1 0

n 2
= Un (f ) + Un (h). (1.88)
2 n
Here Un (f ) and Un (h) are U -statistics with kernels respectively,

1
f (x1 , x2 ) = I{x1 ≤x} − x I{x2 ≤x} − x dx,
0
1 2
h(x1 ) = I{x1 ≤x} − x dx.
0
Note that I{Yi ≤x} are i.i.d. Bernoulli random variables with probability of
success x. Hence by SLLN
1
a.s. 2
Un (h) → E I{Yi ≤x} − x dx (1.89)
0
1 2
= E I{Yi ≤x} − x dx (1.90)
0
1
1
= x(1 − x)dx = . (1.91)
0 6
Moreover
1
Ef (x1 , Y2 ) = E I{x1 ≤x} − x I{Y2 ≤x} − x dx,
0
= 0.
Thus Un (f ) is a degenerate U -statistic. Hence by Theorem 1.7,
∞

D
nUn (f ) → λk (Wk − 1) (1.92)
k=1
where {Wk } are i.i.d. χ21 variables and {λk } are the eigenvalues of the kernel
f . We now identify the values {λk }. The eigenequation is
1
f (x1 , x2 )g(x2 )dx2 = λg(x1 ). (1.93)
0
Now
1
f (x1 , x2 ) = I(x1 ≤ x)I(x2 ≤ x) − x I(x1 ≤ x) − x I(x2 ≤ x) + x2 dx
0
1
= I(x ≥ max(x1 , x2 ) − x I(x1 ≤ x) − x I(x2 ≤ x) + x2 dx
0
1 − x21 1 − x22 1
= 1 − max(x1 , x2 ) − − +
2 2 3
1 x21 + x22
= − max(x1 , x2 ) + . (1.94)
3 2
1
Recall that any eigenfunction g must satisfy 0
g(x)dx = 0 (see (1.72)). Hence
using (1.94), (1.93) reduces to
1
1 x2 + x22
λg(x1 ) = − max(x1 , x2 ) + 1 g(x2 )dx2
0 3 2
1 1 x1
x22 g(x2 )
= dx2 − x2 g(x2 )dx2 − x1 g(x2 )dx2 . (1.95)
0 2 x1 0
For the moment, assume that g is a continuous function. Then (1.95) implies
that g is differentiable and hence taking derivative w.r.t. x1 ,

x1

λg (x1 ) = x1 g(x1 ) − g(x2 )dx2 − x1 g(x1 )
0
x1
= − g(x2 )dx2 .
0
Again, if g is continuous, then the right side of the above is differentiable,

and hence so is the left side. Thus we get
λg (x1 ) = −g(x1 ). (1.96)
It is well known that the general solution to this second order differential
equation is
g(x) = C1 eitx + C2 e−itx . (1.97)
The boundary conditions are

1 1
g(x)dx = 0 and g 2 (x)dx = 1. (1.98)
0 0
Since g cannot be the constant function, t = 0. Since g is real, C2 = C1 .

Thus
g(x) = C1 (eitx + e−itx )

= 2C1 cos(tx), 0 ≤ x ≤ 1.
Now
1
0= g(x)dx ⇒ t = πk, k = 0, ±1 . . . (1.99)
0
Further
1
1= g 2 (x)dx
0
1
= 4C12 cos2 (πkx)dx
0
4C12
= = 2C12 .
2
Now using (1.96),
−λ(2C1 )t2 cos(tx) = −2C1 cos(tx) (1.100)
which implies λt2 = 1 .

As a consequence, since t = πk,
λπ 2 k 2 = 1, k = ±1 . . . (1.101)
or
1
λ= , k = 1, 2, . . . (1.102)
π2 k2
√
(with possible multiplicity). But { 2 cos(πkx), k = 1, 2, . . . } is a complete
orthonormal system and hence we can conclude that the eigenvalues (with no
multiplicities) and the eigenfunctions are given by
1 √
(λk = , gk (x) = 2 cos(πkx)), k = 1, . . . . (1.103)
π2 k2
∞
Notice that k=1 λk = 1/6 (which was first proved by Euler, see Knopp
(1923) for an early proof).
As a consequence, using (1.88), (1.91) and (1.92),
∞
∞
D 1 1 Wk
CVn → λk (Wk − 1) + = λk Wk = 2 .
6 π k2
k=1 k=1
Example 1.18: (Example 1.7 continued) Using (1.13), it is easy to see that
κn is a degenerate U -statistic. An application of Theorem 1.7 yields the
following: suppose {(Xi , Yi )} are i.i.d. and moreover all the {Xi , Yj } are
independent. Further suppose that their second moments are finite. Then
∞

D
nκn −→ λi μj Zi,j
i,j=1
where {Zi,j } are i.i.d. chi-square random variables each with one degree of
freedom. The {λi } and {μj } are given by the solution of the eigenvalue
equations:

hFX (x1 , x2 )gX (x2 )dFX (x2 ) = λgX (x1 ) a.e.(FX ). (1.104)

hFY (y1 , y2 )gY (y2 )dFY (y2 ) = λgY (y1 ) a.e.(FY ). (1.105)
We now state the asymptotic limit law for degenerate U -statistics for
general m > 2. Define the second projection h2 of the kernel h as

h2 (x1 , x2 ) = Eh2 x1 , x2 , Y3 , . . . , Ym .
Note that h2 is a symmetric kernel of order 2.
Theorem 1.9. Suppose Y1 , . . . , Yn are i.i.d. F and Un is a U -statistic with

kernel h of order m ≥ 2 such that Eh(x, Y2 , . . . , Ym ) = 0 almost surely and
Eh2 (Y1 , Y2 , . . . , Ym ) < ∞. Then

∞
Dm
nUn −→ λk (Wk − 1) (1.106)
2
k=1
where {Wk } are i.i.d. χ21 random variables and {λk } are the eigenvalues of
the operator defined by

Ag(x2 ) = h2 (x1 , x2 )g(x1 )dF(x1 ).
We omit the proof of this Theorem, which is easy once we use Theorem 1.7
and some easily derivable properties of the second-order remainder in the
Hoeffding decomposition. For details see Lee (1990), page 83.
Before concluding, we also note that in a remarkable article, Dynkin and

Mandelbaum (1983) study the limit distribution of symmetric statistics in a
unified way. This in particular provide the limit distribution of U -statistics,
both degenerate and non-degenerate. Unfortunately, this requires ideas from
Poisson point processes and Wiener integrals, which are outside the purview
of this book.
1.6 Exercises
1. Suppose F is the set of all cumulative distribution functions on R. Let
Y1 , . . . , Yn be i.i.d. observations from some unknown F ∈ F.
(a) Show that the e.c.d.f. Fn is a complete sufficient statistic for this
space.
(b) Show that the e.c.d.f. Fn is the nonparametric maximum likelihood

estimator of the unknown c.d.f. F.
(c) Using the above fact, show that any U -statistic (with finite vari-
ance) is the nonparametric UMVUE of its expectation.
2. Verify (1.4), (1.5) and (1.8).
3. Justify that Gini’s mean difference is a measure of inequality.
4. Show that any U -statistic of degree m is also a U -statistic of degree

(m + 1).
5. Show that for the U -statistic Kendall’s tau given in Example 1.4, under
independence of X and Y ,
δ1 = 1/9, δ2 = 1, and hence

2(2n + 5) 4
VUn = = + O(n−2 ).
9n(n + 1) 9n
6. Suppose Un (f ) and Un (g) are two U -statistics with kernels of order m1

and m2 respectively, m1 ≤ m2 . Generalize formula (1.16) by showing
that
−1
m1
n m2 n − m2 2
COV(Un (f ), Un (g)) = δ
m1 c=1
c m2 − c c,c
where
2

δc,c = COV f (Xi1 , . . . , Xim1 ), g(Xj1 , . . . , Xjm2 )
and c = number of common indices between the sets {i1 , . . . , im1 } and
{j1 , . . . , jm2 }.
Exercises 31
7. Using the previous exercise, show that for Wilcoxon’s T + given in Ex-
ample 1.6,

+ n
V(T ) = (n − 1)(p4 − p22 ) + p2 (1 − p2 ) + 4(p3 − p1 p2 ) + np1 (1 − p1 )
2
where
p1 = P(Y1 > 0), p2 = P(Y1 + Y2 > 0),

p3 = P(Y1 + Y2 > 0), p4 = P(Y1 + Y2 > 0, Y2 + Y3 > 0).
Further, if the distribution of Y1 is symmetric about zero, then p1 = 1/2,

p2 = 1/2, p3 = 3/8, and p4 = 1/3. As a consequence
n(n + 1) n(n + 1)(2n + 1)

ET + = and VT + = .
4 24
8. Verify (1.18 ) for the limiting variance of a U -statistic.
9. Verify (1.20) for sample variance.
10. Show that for the sample covariance U -statistics given in Example 1.3,
2
δ1 = (μ2,2 − σX,Y )/4,
2 2
δ2 = (μ2,2 + σX σY )/2, and hence
2 2 2
(n − 2)σXY − σX σY
VUn = n−1 μ2,2 − ,
n(n − 1)
where
μ2,2 = E[(X − μX )2 (Y − μY )2 ]
2
σX = V(X), σY2 = V(Y ),
2
σXY = COV(X, Y ).
11. Verify the orthogonal decomposition relation (1.23).
12. Verify (1.30) for the first projection of Kendall’s tau.
13. Let Y(1) ≤ · · · ≤ Y(n) be the order statistics of Y1 , . . . , Yn . Derive a

connection between the Gini’s mean difference given in Example 1.5
and the L-statistic

n
Ln = iY(i) (1.107)
i=1
and thereby prove an asymptotic normality result for Ln .
14. Formulate and prove a multivariate version of Theorem 1.1.
15. Show that Rn defined in (1.22) is a degenerate U -statistic.
16. Consider the U -statistic Un with kernel h(x1 , x2 , x3 ) = x1 x2 x3 where

D
EYi = 0, V(Yi ) = 1. Show that n3/2 Un −→Z 3 − 3Z where Z ∼ N (0, 1).
17. Let Un be the U -statistic with kernel h(x1 , x2 , x3 , x4 ) = x1 x2 x3 x4 and

D
EY1 = 0, V(Y1 ) = 1. Show that n2 Un −→Z 4 − 6Z 2 + 3 where Z ∼
N (0, 1).
18. Verify (1.33) that U and V are independent continuous U (−1, 1) ran-
dom variables.
19. Show that in case of the sample variance in the degenerate case, there
is only one eigenvalue equal to −1.
20. Show that (1.72) holds.
21. Suppose Y1 , . . . , Yn are i.i.d. from F0 which is continuous. Show that

ECVn = 1/6 for all n ≥ 1 and its distribution does not depend on F0 .
22. Suppose Y ∼ F is such that EY = 0, EY 2 = σ 2 , EY 3 = 0 and EY 4 < ∞.

Consider the kernel

h(x, y) = xy + x2 − σ 2 y 2 − σ 2 .
Show that Un (h) is degenerate and the L2 operator Ah has two eigen-
values and eigenfunctions. Find these and the asymptotic distribution
of nUn (h).
23. Suppose Y ∼ F is such that EY = EY 3 = 0, and EY 6 < ∞. Consider

the kernel
h(x, y) = xy + x3 y 3 .
Exercises 33
Show that Un (h) is degenerate. Find the eigenvalues and eigenfunctions

of the operator Ah . Find the asymptotic distribution of nUn (h).
24. Suppose Y ∼ F is an Rd valued random variable such that Y and −Y

have the same distribution. Consider the kernel
h(x, y) = |x + y| − |x − y|,
d
2 1/2
where |a| = j=1 aj for a = (a1 , . . . , ad ) ∈ Rd . Show that Un (h)
is a degenerate U -statistic. Find the asymptotic distribution of nUn (h)
when d = 1.
25. (This result is used in the proof of Theorem 2.3.) Recall the variance
formula (1.18).
(a) Show that

V h1 (Y1 ) = δ1 ≤ δ2 ≤ · · · ≤ δm = V h(Y1 , . . . , Ym ) .
(b) Show that

m
V Un ≤ δm .
n
26. For Bergsma’s κ in Example 1.18, show that the eigenvalues and eigen-
functions when all the random variables are i.i.d. continuous uniform
(0, 1) are given by { πk1 2 , gk (u) = 2 cos(πku), 0 ≤ u ≤ 1, k = 1, 2, . . .}.
27. To have an idea of how fast the asymptotic distribution takes hold in
k ∞
the degenerate case, plot (k, i=1 λi / i=1 λi ), k = 1, 2, . . . for (a) the
Cramer-von Mises statistic (b) Bergma’s κn , when all distributions are
continuous uniform (0, 1).
Chapter 2
Mm-estimators and
U -statistics
M -estimators, and their general versions Mm -estimators, were introduced by

Huber (1964) out of robustness considerations. The literature on these es-
timators is very rich and the asymptotic properties of these estimates have
been treated under different sets of conditions. To establish the most general
results for these estimators require very sophisticated treatment using tech-
niques from the theory of empirical processes. But here we strive for a simple
approach.
The goal of this chapter is to first establish a link between U -statistics and
Mm -estimators by imposing a few simple conditions, including a convexity
condition. Yet, the conditions are general enough to be applicable widely. A
huge class of Mm -estimators turn out to be approximate U -statistics. Hence
the theory of U -statistics can be used to derive asymptotic properties of Mm -
estimators. We give several examples to show how the general results can
be applied to many specific estimators. In particular, several multivariate
estimates of location are discussed in details.
2.1 Basic definitions and examples

Let f (x1 , . . . , xm , θ) be a real valued measurable function which is symmetric
in the arguments x1 , . . . , xm . The argument θ is assumed to belong to Rd
and xi ’s belong to some measure space Y.
36 Chapter 2. Mm -estimators and U -statistics
Definition 2.1(Mm -parameter): Let Y1 , . . . , Ym be i.i.d. Y-valued random

variables. Define
Q(θ) = E f (Y1 , . . . , Ym , θ). (2.1)
Let θ0 be the minimizer of Q(θ). Assume that θ0 is unique. We consider θ0

to be the unknown parameter. It is called an Mm -parameter.
The special case when m = 1 is the one that is most commonly studied
and in that case θ0 is traditionally called an M -parameter.
The sample analogue Qn of Q is given by
−1
n
Qn (θ) = f (Yi1 , . . . , Yim , θ). (2.2)
m
1≤i1 <i2 ...<im ≤n
In the absence of any further information on the distribution of Yi , a natural

(nonparametric) estimate of the Mm -parameter θ0 is the minimizer of Qn (θ).
Definition 2.2(Mm -estimator): Suppose that {Y1 , · · · , Yn } is a sequence of

i.i.d. observations. Any measurable value θn which minimizes Qn (θ) is called
an Mm -estimator of θ0 .
So
Qn (θn ) = inf Qn (θ). (2.3)
Note that even when θ0 is unique, θn need not be unique. By an appropriate

Selection Theorem it is often possible to choose a measurable version. We
shall see this later. We always work with such a version.
Example 2.1(Sample mean): Let f (x, θ) = (x − θ)2 − x2 . Clearly Q(θ) =

θ2 − 2E(X)θ which is minimized uniquely at θ0 = E(X). Its unique M -
estimator is the sample mean.
Example 2.2(U -statistics as Mm -estimators): For a function h(x1 , . . . , xm )
which is symmetric in its arguments, let
2 2
f (x1 . . . , xm , θ) = θ − h(x1 , . . . , xm ) − h(x1 , . . . , xm ) . (2.4)
Then θ0 = E h(Y1 , . . . , Ym ) and θn is the U -statistic with kernel h. So all

U -statistics are Mm -estimators. In particular, the sample variance is an M2 -
estimator.
Basic definitions and examples 37
Example 2.3(Median and other quantiles): Suppose Y is a random variable

with distribution function F. For any 0 < p < 1, the pth quantile of F is

F−1 (p) = inf x : F(x) ≥ p . (2.5)
To see this as an M -parameter, let
f (x, θ) =| x − θ | − | x | −(2p − 1)θ. (2.6)
Note that |f (x, θ)| ≤ 2|θ| and hence Q(θ) = Ef (Y, θ) is finite for all θ ∈ R. It
is easy to check that,

θ
f (x, θ) = θ 2I{x≤0} − 1 + 2 I{x≤s} − I{x≤0} ds − (2p − 1)θ. (2.7)
0
Hence
θ
Q(θ) = 2 F(s)ds − 2pθ for all θ ∈ R. (2.8)
0
Q(θ) is minimized at F−1 (p). This minimizer of Q(θ) is unique if F is strictly

increasing at F−1 (p). If p = 1/2, it is called the population median.
Suppose {Y1 , . . . , Yn } is an i.i.d. sample from F. Let Fn be the e.c.d.f.
Then the sample p-th quantile, given by

F−1
n (p) = inf x : Fn (x) ≥ p , (2.9)
is a (measurable) minimiser of Qn (θ). Unlike the sample mean, it is not

necessarily unique. If p = 1/2, we get the sample median.
Example 2.4(Hodges-Lehmann measure of location): Let Y1 , . . . , Yn be i.i.d.
observations from F. Let G be the distribution of Y1 +Y
2
2
. Instead of the usual
mean as a measure of location, consider the M2 -parameter θ0 = G−1 (1/2)
(assume that G is strictly increasing at θ0 ).
Yi +Yj
The median of { 2 , 1 ≤ i < j ≤ n}, the sample measure of location,
is an M2 -estimator of θ0 . Here m = 2 and
x1 + x2 x1 + x2
f (x1 , x2 , θ) =| −θ |−| |. (2.10)
2 2
Example 2.5(Robust measure of scale/dispersion): Usually, M -estimators

are thought of as measures of location. However, Mm -estimators encompass

measures of both, location and scale. The variance as a measure of dispersion
is influenced by extreme observations. To address this problem, Bickel and
Lehmann (1979) considered the distribution of |Y1 − Y2 | and took its median
to be a measure of dispersion. As in the previous examples, this is an M2 -
parameter provided the distribution of |Y1 − Y2 | is strictly increasing at the
median. Here m = 2 and
f (x1 , x2 , θ) =|| x1 − x2 | −θ | − | x1 − x2 | . (2.11)
The corresponding estimate is the median of {|Yi − Yj |, 1 ≤ i < j ≤ n} and

as before, need not be unique.
Example 2.6(U -quantiles): The ideas of the previous two examples can be
extended to define U -quantiles of Choudhury and Serfling (1988). Suppose
that h(x1 , . . . , xm ) is a symmetric kernel. Define
f (x1 , . . . , xm , θ) =| h(x1 , . . . , xm ) − θ | − | h(x1 , . . . , xm ) | . (2.12)
Then θ0 , the minimizer of E[f (Y1 , · · · , Ym , θ)], is called a U -median. It is the

unique minimiser if the distribution of h(Y1 , . . . , Ym ) is strictly increasing at
θ0 . Its sample version is the median of {h(Yi1 , Yi2 , . . . , Yim ), 1 ≤ i1 < · · · <
im ≤ n}. Other U -quantiles can be defined in a way similar to the sample
quantiles in Example 2.3. Note that just like the sample quantiles, the sample
minimizers, in general, are not unique. Multivariate U -quantiles defined by
Helmers and Hušková (1994) are also Mm -estimates.
Now suppose we have multivariate observations. Then there are several
reasonable definitions of “median”. The reader may consult the excellent
article of Small (1990) for an introduction to the various notions of me-
dian/location for multivariate observations. We shall discuss some of these
which fit into our framework.
Example 2.7(L1 -median): Suppose Y = (Y1 , . . . , Yd ) is a d ≥ 2 dimensional
random vector with corresponding probability distribution P. Let

d
2 1/2
d 1/2
f (x, θ) = xk − θk − x2k . (2.13)
k=1 k=1
Note that Q(θ) = Ef (Y, θ) is finite if E|Y| < ∞, where |a| is the Euclidean
norm of the vector a. It can be shown that if P does not put all its mass on
Convexity 39

d
a hyperplane (that is, if P i=1 Ci Yi = C = 1 for any choice of real num-
bers (C, C1 , . . . Cd )), then Q(θ) is minimized at a unique θ0 (see Kemperman
(1987)). This θ0 is called the L1 -median. The corresponding M -estimator
is called the sample L1 -median. It is unique if {Y1 , . . . , Yd } do not lie on a
lower dimensional hyperplane. If d = 1, the L1 -median reduces to the usual
median discussed in Example 2.3.
Later in Chapter 5 we illustrate the L1 -median with a real data appli-
cation. The R package for this book, called UStatBookABSC also contains a
function to obtain the L1 -median on general datasets.
Example 2.8(Oja-median): This multivariate median was introduced by Oja
(1983). For any (d + 1) points x1 , x2 , . . . , xd+1 ∈ Rd for d ≥ 2, the sim-
plex formed by them is the smallest convex set containing these points. Let
Δ(x1 , . . . , xd , xd+1 ) denote the absolute volume of this simplex. Let
f (x1 , . . . , xd , θ) = Δ(x1 , . . . , xd , θ) − Δ(x1 , . . . , xd , 0). (2.14)
Suppose (Y1 , . . . , Yd )T are d-dimensional i.i.d. random vectors with distribu-

tion function F. If E|Y1 | < ∞, then Q(θ) = E f (Y1 , . . . , Yd , θ) exists. It is
known that if F does not have all its mass on a lower dimensional hyperplane,
then there is a unique minimiser θ0 of Q(θ) (see León and Massé (1993)). It
is called the Oja-median. If Y1 , . . . , Yn is an i.i.d. sample from F, then any
corresponding minimizer of Qn (θ) is the sample Oja-median. It is unique if
Y1 , . . . , Yn do not lie on a lower dimensional hyperplane.
2.2 Convexity
Many researchers have studied the asymptotic properties of M -estimators and
Mm -estimators. Early works on the asymptotic properties of M1 -estimators
and M2 -estimators are Huber (1967) and Maritz et al. (1977). Using condi-
tions similar to Huber (1967), Oja (1984) proved the consistency and asymp-
totic normality of Mm -estimators. His results apply to some of the estimators
that we have presented above.
We emphasize that all examples of f we have considered so far have a
common feature. They are all convex functions of θ. Statisticians prefer to
work with convex loss functions for various reasons. We shall make this blan-
ket assumption here. This does entail some loss of generality. But convexity
leads to a significant simplification in the study of Mm -estimators while at
the same time, still encompassing a huge class of estimators.

Several works have assumed and exploited this convexity in similar con-
texts. Perhaps the earliest use of this convexity was by Heiler and Willers
(1988) in linear regression models. See also Hjort and Pollard (1993). For
m = 1, Haberman (1989) established the consistency and asymptotic nor-
mality of θn and Niemiro (1992) established a Bahadur type representation
(linearization) θn = θ0 + Sn /n + Rn where Rn is of suitable order almost
surely and Sn is the partial sum of a sequence of i.i.d. random variables. This
was extended by Bose (1998) to Mm -estimators. In the next sub-sections, we
shall exploit the convexity heavily and establish some large sample properties
of Mm -estimates.
Even though our setup covers a lot of interesting multivariate location
and scale estimators, we must emphasize that it does not cover several other
important estimators such as the medians of Liu (1990), Tukey (1975),
Rousseeuw (1985) and others, since the convexity condition is not satis-
fied. More general approaches in the absence of convexity are provided by
Jurečková (1977), Jurecková and Sen (1996) and Chatterjee and Bose (2005).
See also de la Peña and Giné (1999) page 279.
Now note that since f is convex in θ, it has a sub-gradient g(x, θ) (see
Rockafellar (1970), page 218). This sub-gradient has the property that for
all α, β, x,
f (x, α) + (β − α)T g(x, α) ≤ f (x, β). (2.15)
It can be checked that a vector γ ∈ Rd is a sub-gradient of f (z, ·) at α if and

only if h(z, γ) ≥ 0 (see Rockafellar (1970), page 214), where

h(z, γ) = inf f (z, β) − f (z, α) − (β − α)T γ . (2.16)
β
In the next sub-section we will show that we can always choose a measurable
version of the sub-gradient. If f is differentiable, then this sub-gradient is
simply the ordinary derivative. This sub-gradient will be crucial to us.
Example 2.9: (i) For the usual median, it can be checked that a sub-gradient
is given by
⎧
⎪
⎨ 1 if θ>x
g(x, θ) = 0 if θ=x (2.17)
⎪
⎩
−1 if θ < x.
Measurability 41
(ii) For the L1 -median, it can be checked that a sub-gradient is given by

⎧
⎨ θ−x if θ = x
g(x, θ) = |θ − x| (2.18)
⎩
0 if θ = x.
2.3 Measurability
As Examples 2.3 and 2.6 showed, an Mm -estimator is not necessarily unique.
However, it can be shown by using the convexity assumption, that a mea-
surable minimizer can always be chosen. This can be done by the following
selection theorem and its corollary. The asymptotic results that we will
discuss later, hold for any measurable sequence of minimizers of {Qn (θ)}.
At the heart of choosing a sequence of measurable minimizers is the idea
of measurable selections. This is a very interesting topic in mathematics and
there are many selection theorems in the literature. See for example Castaing
and Valadier (1977). Γ is said to be a multifunction if it assigns a subset Γ(z)
of Rd to each z. A function σ : Z → Rd is said to be a selection of Γ if
σ(z) ∈ Γ(z) for every z. If Z is a measurable space, then σ is said to be a
measurable selection if z → σ(z) is a measurable function.
We quote the following theorem from Castaing and Valadier (1977). For
its proof, see Theorem 3.6 and Proposition 3.11 in Section 3.2 there.
Theorem 2.1 (Castaing and Valadier (1977)). (Selection theorem) Let Γ

be a multifunction from a measurable space Z to a closed non-empty subset
of Rd . Let ∅ denote the empty set. If for each compact set K in Rd , {z :
Γ(z) K = ∅} is measurable, then Γ admits a measurable selection.
Corollary 2.1. Let Z be a measurable space and q : Z × Rd → R be a

function. Assume q(z, ·) is continuous for every z and q(·, α) is measurable
for every α. Then there is a measurable function a : Z → Rd such that
q(z, a(z)) = inf q(z, α),

α
whenever the inf is in the range of q(z, ·), otherwise a(z) is taken to be some
fixed number.
Proof of Corollary 2.1: To begin with, note that
inf q(z, α) : Z → R ∪ {−∞} (2.19)

α∈A
for any subset A of Rd . Indeed, inf αA can be replaced by inf α∈C , where C
is a countable dense subset of A, because q(z, ·) is continuous. Let
Γ(z) = {β : q(z, β) = inf q(z, α)}. (2.20)

α
We have
Γ(z) = ∅ ⇔ inf q(z, α) = inf q(z, α) for some n. (2.21)

α |α|≤n
This is because the right side infimum is certainly in the range of q(z, ·).
Thus,
Z0 = {z : Γ(z) = ∅} (2.22)
is a measurable set. Define
a(z) = some constant, for all z ∈

/ Z0 (2.23)
and consider Γ on Z0 . Since Γ(z) is always a closed subset of Rd , it is

enough to observe that for each compact K, {z : Γ(z) K = ∅} is equal to
{z : inf α q(z, α) = inf αK q(z, α)} and, consequently, it is a measurable set.
The existence of a measurable selection a : Z0 → Rd of Γ follows now
from Theorem 2.1.
We now show how we can apply the above corollary to obtain a measurable
minimiser θn in (2.3). Suppose f (x1 , . . . , xm , θ) is a function on Y m × Rd
which is measurable in (x1 , . . . , xm ) and convex in θ. Note that convexity
automatically implies continuity in θ.
Suppose {Y1 , . . . , Yn }, n ≥ m are i.i.d. Y valued random variables. On
Y consider the function q(·, α) = Qn (α) and apply Corollary 2.1, to get a
n
random vector αn = a(Y1 , . . . , Yn ) that satisfies (2.3). Take θn to be equal to

this αn . Hence note that θn is measurable.
Now, having proved that there is at least one measurable θn , henceforth,
we always work with a measurable θn . The asymptotic properties of θn are
Strong consistency 43
intimately tied to the sub-gradient g of f . So we need to make sure the

existence of a measurable sub-gradient. This follows immediately from the
following Corollary.
Corollary 2.2. Let Z be a measurable space and f : Z × Rd −→ R. Assume

that f (z, ·) is convex for every z, and f (·, α) is measurable for every α. Then
there is g : Z × Rd −→ Rd such that g(z, ·) is a sub-gradient of f (z, ·) for
every z and g(·, α) is measurable for every α.
Proof of Corollary 2.2: Fix α. Recall the criterion presented in (2.16)

for any vector to be a sub-gradient. Now, for every z, h(z, ·) is a concave
and finite-valued function; hence it is continuous. For every γ, h(·, γ) is
measurable, because the infimum can be taken over β in a countable dense
set, as in the preceding proof (see the arguments just before (2.20)).
Denoted by Γ(z) the set of all sub-derivatives (i.e., the set of all sub-
gradients) of f (z, ·) at α. Then Γ(z) is a non-empty closed set. Now note
that for each compact K, {z : Γ(z) ∩ K = ∅} and is measurable, for it is equal
to {z : supγ∈K h(γ, z) ≥ 0}. Now we can apply the Selection Theorem 2.1 to
complete the proof.
2.4 Strong consistency

Definition 2.3: A sequence of estimators {θn } of θ0 is said to be strongly
a.s.
consistent if θn −→ θ0 as n → ∞.
Interestingly, all that is needed to guarantee strong consistency of an

Mm -estimator are the following minimal assumptions. They will be in force
throughout this chapter.
(I) f (x1 , . . . , xm , θ) is measurable in (x1 , . . . , xm ) and convex in θ.
(II) Q(θ) is finite for all θ.
(III) θ0 exists and is unique.
Incidentally the parameter space need not always be the entire Rd and
may be restricted to an appropriate convex subset. If (I) and (II) are satisfied
on such a subset, then all the results we give below remain valid if θ0 is an
interior point of this subset.
Theorem 2.2. (Strong consistency) Under Assumptions I, II and III, for

any sequence of measurable minimizers {θn } of Qn (θ),
a.s.
θn −→ θ0 as n → ∞. (2.24)
This Theorem in particular implies that all the estimators introduced so

far in our examples are strongly consistent as soon as we make sure that the
minimal assumptions (II)–(III) hold.
To prove the Theorem, we need the following Lemma.
Lemma 2.1. Suppose {hn } is a sequence of random convex functions on Rd

which converges to a function h pointwise either almost surely or in probabil-
ity. Then this convergence is uniform on any compact set of Rd , respectively
(a) almost surely, (b) in probability.
Proof of Lemma 2.1: Recall that convex functions converge pointwise ev-
erywhere if they converge pointwise on a dense set. Moreover the everywhere
convergence is uniform over compact sets. See Rockafellar (1970), Theorem
10.8 for additional details.
Let C be a countable dense set. To prove (a), it is just enough to observe
that with probability 1, convergence hn (α) → h(α) takes place for all α ∈ C
and then apply the above criterion for convergence of convex functions.
To prove (b), consider an arbitrary sub-sequence of the sequence. For any
fixed α ∈ C, we can select a further sub-sequence, along which hn (α) → h(α)
holds almost surely. Now we can apply the Cantor diagonal method to get
hold of one single sub-sequence {hn } which converges pointwise almost surely
on C. Now apply (a) to conclude that this sub-sequence converges almost
everywhere uniformly on compact sets. Since for any sub-sequence, we have
exhibited a further sub-sequence which converges uniformly, almost surely
on compact sets, the original sequence converges in probability uniformly on
compact sets. This completes the proof.
Proof of Theorem 2.2: Note that by the SLLN for U -statistics, Qn (α) con-
verges to Q(α) for each α almost surely. By Lemma 2.1, this convergence is
uniform on any compact set almost surely.
Let B be a ball of arbitrary radius around θ0 . If θn is not consistent, then
there is an > 0 and a set S in the probability space such that P(S) > 0 and
for each sample point in S, there is a sub-sequence of θn that lies outside this
ball. We assume without loss that for each point in this set, the convergence
Weak representation, asymptotic normality 45
of Qn to Q also holds. For a fixed sample point, we continue to denote such

a sequence by {n}.
Let θn∗ be the point of intersection of the line joining θ0 and θn with the
ball B. Then for some sequence 0 < γn < 1,
θn∗ = γn θ0 + (1 − γn )θn .
By convexity of Qn and the fact that θn is a minimizer of Qn ,
Qn (θn∗ ) ≤ γn Qn (θ0 ) + (1 − γn )Qn (θn )

≤ γn Qn (θ0 ) + (1 − γn )Qn (θ0 )
= Qn (θ0 ).
First note that the right side converges to Q(θ0 ). Now, every θn∗ lies on the
compact set {θ : |θ − θ0 | = }. Hence there is a sub-sequence of {θn∗ } which
converges to, say θ1 . Since the convergence of Qn to Q is uniform on compact
sets, the left side of the above equation converges to Q(θ1 ). Hence, Q(θ1 ) ≤
Q(θ0 ). This is a contradiction to the uniqueness of θ0 since |θ0 − θ1 | = . This
proves the Theorem.
2.5 Weak representation, asymptotic normal-

ity
We now give an in probability representation (linearization) result for Mm -
estimators. This representation implies the asymptotic normality of Mm -
estimators.
Let g(x, θ) be a measurable sub-gradient of f . It is easy to see by using
(2.16), that under Assumption (II), the expectation of g is finite. Moreover,
the gradient vector ∇Q(θ) of Q at θ exists and
∇Q(θ) = E[g(Y1 , . . . , Ym , θ)] < ∞. (2.25)
Denote the matrix of second derivatives of Q at θ, whenever it exists, by

∇2 Q(θ). So
∂ 2 Q(θ)
∇2 Q(θ) = .
∂θi ∂θj
Let
H = ∇2 Q(θ0 ) and (2.26)

−1
n
Un = g(Yi1 , . . . , Yim , θ0 ). (2.27)
m
1≤i1 ···<im ≤n
Let N be an appropriate neighborhood of θ0 . We list two additional assump-

tions to derive a weak representation and asymptotic normality.
(IV) E|g(Y1 , . . . , Ym , θ)|2 < ∞ ∀ θ ∈ N.

(V) H = ∇2 Q(θ0 ) exists and is positive definite.
The following Theorem is a consequence of the works of Haberman (1989)

and Niemiro (1992) for m = 1, and Bose (1998) for general m.
Theorem 2.3. Suppose Assumptions (I)–(V) hold. Then for any sequence
of measurable minimizers {θn },
(a) θn − θ0 = −H −1 Un + oP (n−1/2 )
D
(b) n1/2 (θn − θ0 ) −→ N (0, m2 H −1 KH −1 ) where

K = V E g Y1 , . . . Ym , θ0 |Y2 , . . . , Ym
is the dispersion matrix of the first projection of
g(Y1 , . . . , Ym , θ0 ).
All Mm -estimators given in Section 2.1 satisfy the conditions of Theorem

2.3 once we make reasonable assumptions on the underlying distribution.
Hence the asymptotic normality of a huge collection of estimators follows.
After we give the proof of the Theorem, we will illustrate its use through a
discussion of the appropriate assumptions required in some specific cases.
Proof of Theorem 2.3: Using (2.16), for all α, β,
f (x, α) + (β − α)T g(x, α) ≤ f (x, β), (2.28)

f (x, β) + (α − β)T g(x, β) ≤ f (x, α). (2.29)
Hence,
(β − α)T g(x, α) ≤ f (x, β) − f (x, α) ≤ (β − α)T g(x, β) (2.30)
or
0 ≤ f (x, β) − f (x, α) − (β − α)T g(x, α)

≤ (β − α)T g(x, β) − g(x, α) . (2.31)
Notice that Q(θ) = Ef (Y1 , Y2 , . . . , Ym , θ) is finite. Consequently, it follows

from (2.30) that Eg(Y1 , Y2 , . . . , Ym , θ) is finite for all θ. Moreover, based on
(2.31), note that Eg(Y1 , Y2 , . . . , Ym , θ) serves as a subgradient of Q(θ). Now,
when Q(θ) is differentiable, it follows that
∇Q(θ) = Eg(Y1 , Y2 , . . . , Ym , θ).
For the proof of this Theorem as well for those Theorems given later,
assume without loss that θ0 = 0 and Q(θ0 ) = 0. As a consequence,
∇Q(0) = Eg(Y1 , Y2 , . . . , Ym , 0).

Let S = s = (i1 , i2 , . . . , im ) : 1 ≤ i1 < i2 · · · < im ≤ n . For any s ∈ S, let
Ys = (Yi1 , . . . , Yim ) and
Yn,s = f (Ys , n−1/2 α) − f (Ys , 0) − n−1/2 αT g(Ys , 0). (2.32)
n −1
Note that Vn = m s∈S Yn,s is a U -statistic. From Exercise 25 of
Chapter 1, using (2.31), it follows that
n −1 m 2
V Yn,s ≤ E (Yn,s − EYn,s )
m n
s∈S
m 2
≤ K EYn,s
n
m 2
≤ K 2 E αt g(Yn,s , n−1/2 α) − g(Yn,s , 0) .
n
Let Z be identically distributed as any Ys . Let

Zn = αT g(Z, n−1/2 α) − g(Z, 0) . (2.33)
Note that using (2.31), Zn ≥ 0.

Further,
Zn+1 − Zn

= αT g(Z, (n + 1)−1/2 α) − g(Z, 0) − αT g(Z, n−1/2 α) − g(Z, 0)

= αT g(Z, (n + 1)−1/2 α) − g(Z, n−1/2 α) .
However, by using (2.31),

T
(n + 1)−1/2 α − n−1/2 α) g(Z, (n + 1)−1/2 α) − g(Z, n−1/2 α) ≥ 0.
Hence Zn+1 − Zn ≤ 0. That is, {Zn } is non-increasing.
Let lim Zn = Z0 ≥ 0. Thus E(Zn ) ↓ E(Z0 ). Now,

EZn = EαT g(Z, n−1/2 α) − g(Z, 0)
= EαT g(Z, n−1/2 α)

= αT ∇Q(n−1/2 α)
and since the second partial derivatives exist (at 0),
∇Q(n−1/2 α) → 0.
Hence EZn → 0. Hence Z0 = 0 a.s. and as a consequence, EZn2 → 0. Noting

that E Yn,s = Q(n−1/2 α), it follows that for each fixed α,
−1
n
n (Yn,s − EYn,s )
m
s∈S
α α
= nQn √ − nQn (0) − n1/2 αT Un − nQ √ (2.34)
n n
P
→ 0. (2.35)
On the other hand, by Assumption (V),

√
nQ α/ n → αT Hα/2 for every α. (2.36)
Now, due to convexity, by Lemma 2.1, the convergences in (2.35) and (2.36)
are uniform on compact sets. Thus for every > 0 and every M > 0, the
inequality
√
sup |nQn α/ n − nQn (0) − αT n1/2 Un − αT Hα/2| < (2.37)
|α|≤M
holds with probability at least (1 − /2) for large n.

Define the quadratic form
Bn (α) = αT n1/2 Un + αT Hα/2. (2.38)
Its minimizer is αn = −H −1 n1/2 Un and by UCLT Theorem 1.1
D
αn −→ N (0, m2 H −1 KH −1 ). (2.39)
The minimum value of the quadratic form is
Bn (αn ) = −n1/2 UnT H −1 n1/2 UnT /2. (2.40)
Further, n1/2 Un is bounded in probability. So we can select an M such that

P | − H −1 n1/2 Un | < M − 1 ≥ 1 − /2. (2.41)
The rest of the argument is on the intersection of the two events in (2.37)
and (2.41), and that has probability at least 1 − .
Consider the convex function
√
An (α) = nQn α/ n − nQn (0). (2.42)
From (2.37),
An (αn ) ≤ − n1/2 UnT H −1 n1/2 UnT /2 = + Bn (αn ). (2.43)
Now consider the value of An on the sphere {α : |α − αn | = T 1/2 } where

T will be chosen. Again by using (2.37), on this sphere, its value is at least
Bn (α) − . (2.44)
Comparing the two bounds in (2.43) and (2.44), and using the condition
that α lies on the sphere, it can be shown that the bound in (2.44) is always
−1/2
strictly larger than the one in (2.43) once we choose T = 4 λmin (H)
where λmin denotes the minimum eigenvalue.

On the other hand An has the minimizer n1/2 θn . So, using the fact that
An is convex, it follows that its minimizer satisfies |n1/2 θn − αn | < T 1/2 .
Since this holds with probability at least (1 − ) where is arbitrary, the
first part of the theorem is proved. The second part now follows from the
multivariate version of UCLT Theorem 1.1.
Example 2.10(Maximum likelihood estimator): Under suitable conditions,

the maximum likelihood estimator (m.l.e.) is weakly or strongly consistent
and asymptotically normal. See for example van der Vaart and Wellner (1996)
for sets of conditions under which this is true. If we are ready to assume that
the log-likelihood function is concave in the parameter, then these claims
follow from the above theorem.
Example 2.11(Sample quantiles): Recall (2.8) from Example 2.3. If F is
continuous at a point θ, then Q (θ) = 2F(θ) − 2p.
Further if F is differentiable at θ0 with derivative f (θ0 ), then
H = Q (θ0 ) = 2 f (θ0 ) (> 0 if f (θ0 ) > 0). (2.45)
Additionally,
g(x, θ) = I{θ≥x} − I{θ≤x} − (2p − 1) = 2I{θ≥x} − I{x=θ} − 2p. (2.46)
Since g is bounded, Assumption (IV) is trivially satisfied. Thus all the con-
ditions (I)–(V) are satisfied.
Moreover

K = V 2I{θ0 ≥Y1 } = 4 V I{Y1 ≤θ0 } = 4p(1 − p). (2.47)
Hence if f (θ0 ) > 0, then the sample p-th quantile θn = F−1

n (p) satisfies
D
n1/2 (θn − θ0 ) −→ N 0, p(1 − p)(f 2 (θ0 ))−1 . (2.48)
Incidentally, if the assumptions of Theorem 2.3 are not satisfied, the lim-
iting distribution of the Mm -estimate need not be normal. Smirnov (1949)
(translated in Smirnov (1952)) had studied the sample quantiles in such non-
regular situations in complete details, identifying the class of distributions
possible. Jurečková (1983) considered general M -estimates in non-regular

situations. See also Bose and Chatterjee (2001a).
Example 2.12(U -quantiles, generalized order statistic): As in Example 2.11,

let h(x1 , . . . , xm ) be a symmetric kernel and let
f (x1 , . . . , xm , θ) = |h(x1 , . . . , xm ) − θ| − |h(x1 , . . . , xm )| − (2p − 1)θ.
Let Y1 , . . . , Yn be i.i.d. with distribution F. Let us use the notation K to

denote the c.d.f. of h(Y1 , . . . , Ym ). Let θ0 = K−1 (p) be the (unique) p-th
quantile of K. If K is differentiable at θ0 with a positive density k(θ0 ), then
Assumption (V) holds with H = 2k(θ0 ). The gradient vector is given by
g(x, θ) = 2I{θ≥h(x)} − I{θ=h(x)} − 2p. (2.49)
This is bounded and hence (IV) holds trivially.

Let
−1
n
Kn (y) = I{h(Yi1 ,...,Yim )≤y} (2.50)
m
1≤i1 <···<im ≤n
be the empirical distribution. The Mm -estimate is then K−1

n (p), the pth-
quantile of Kn .
By application of Theorem 2.3,

D −1
n1/2 K−1
n (p) − K −1
(p) −→ N 0, p(1 − p) k(K −1
(p) . (2.51)
Particular cases of this result are the following four estimates.
(i) Univariate Hodges-Lehmann estimator of Hodges and Lehmann (1963)

where
h(Y1 , . . . , Ym ) = m−1 (Y1 + · · · + Ym ). (2.52)
(ii) Dispersion estimator of Bickel and Lehmann (1979) where
h(Yi , Yj ) = |Yi − Yj |. (2.53)
(iii) Regression coefficient estimator introduced by Theil (see Hollander and

Wolfe (1973) pages 205-206) where (Xi , Yi ) are bivariate i.i.d. random vari-
ables and
h((Xi , Yi ), (Xj , Yj )) = (Yi − Yj )/(Xi − Xj ). (2.54)
(iv) The location estimate of Maritz et al. (1977) can also be treated in this
way. Let β be any fixed number between 0 and 1. Let
L(x1 , x2 , θ) = |βx1 + (1 − β)x2 − θ| + |βx2 + (1 − β)x1 − θ|. (2.55)

The minimizer of E L(Y1 , Y2 , θ) − L(Y1 , Y2 , 0) is a measure of location of
Yi (Maritz et al. (1977)) and its estimate is the median of βYi + (1 − β)Yj ,
i = j (β = 1/2 yields the Hodges-Lehmann estimator of order 2). Conditions
similar to above guarantee asymptotic normality for this estimator.
Example 2.13(L1 -median, d ≥ 2): The L1 -median θ0 was defined in Exam-

ple 2.7. If the dimension d = 1, then the L1 -median coincides with the usual
median whose asymptotic normality was discussed in Example 2.11. So we
assume that d ≥ 2. Recall that the gradient vector is
⎧ α−x
⎪
⎪ if α = x
⎨ |α − x|
g(x, α) = (2.56)
⎪
⎪
⎩
0 if α = x.
Thus the sub-gradient g is a bounded function and hence Assumption (IV) is

satisfied. Moreover, g is differentiable, except when x = θ.
Assume that P(Y1 = θ) = 0 for all θ. Then the matrix of partial derivatives
of g is given by

1 (θ − x)(θ − x)T
h(x, θ) = I− , x = θ. (2.57)
|θ − x| |θ − x|2
Consider the inverse moment condition
E[|Y1 − θ0 |−1 ] < ∞. (2.58)
This implies
E|h(Y1 , θ0 )| < ∞. (2.59)

Recall that ∇Q(θ) = E[g(Y1 , θ)]. By simple algebra, for |x| ≤ |θ|,
|g(x, θ) − g(x, 0)| ≤ 2|θ|/|x|. (2.60)
Similarly, for |x| > |θ|,
|θ|2 |θ|3
|g(x, θ) − g(x, 0) − h(x, 0)θ| ≤ 5 + . (2.61)
|x|2 |x|3
Using these two inequalities, and the inverse moment condition (2.58), it is
easy to check that, the matrix H exists and can be evaluated as
H = E[h(Y1 , θ0 )]. (2.62)
Example 2.14(Oja-median): Recall the Oja median defined in Example 2.8.

Recall that Δ(x1 , . . . , xd , θ) is the absolute volume of the simplex formed by
{x1 , . . . , xd , θ}. Let Y denote the d × d random matrix whose j-th column is
Yj = (Y1j , . . . , Ydj )T , 1 ≤ j ≤ d. That is,
⎛ ⎞
Y11 Y12 ... Y1d
⎜ ⎟
⎜ Y21 Y22 ... Y2d ⎟
Y =⎜
⎜ ·
⎟.
⎝ · · · ⎟⎠
Yd1 Yd2 . . . Ydd
Let Y (i) be the d × d matrix obtained from Y by deleting its i-th row and
replacing it by a row of 1’s at the end. That is,
⎛ ⎞
Y11 Y12 ... Y1d
⎜ ⎟
⎜ Y21 Y22 ... Y2d ⎟
⎜ ⎟
⎜ · · · · ⎟
⎜ ⎟
⎜ ⎟
⎜ Yi−1,1 Yi−1,2 . . . Yi−1,d ⎟
Y (i) = ⎜
⎜ Y
⎟.
⎟
⎜ i+1,1 Yi+1,2 ... Yi+1,d ⎟
⎜ ⎟
⎜ · · · · ⎟
⎜ ⎟
⎜ Y ⎟
⎝ d1 Yd2 ... Ydd ⎠
1 1 ... 1
Finally let M (θ) be the (d+1)×(d+1) matrix obtained by augmenting the

column vector θ = (θ1 , . . . , θd )T and a (d + 1) row vector of 1’s respectively
to the first column and the last row of Y . That is,

⎛ ⎞
Y11 Y12 ... Y1d θ1
⎜ ⎟
⎜ Y21 Y22 ... Y2d θ2 ⎟
⎜ ⎟
M (θ) = ⎜
⎜ · · · · ⎟.
⎟
⎜ ⎟
⎝ Yd1 Yd2 . . . Ydd θd ⎠
1 1 ... 1 1
Let det(M ) denote the determinant of the matrix M . It is easily seen that

f (Y1 , . . . , Yd , θ) = |det M (θ) | − |det M (0) | = |θT T − Z| − |Z| (2.63)
where

T = (T1 , . . . , Td )T Ti = (−1)i+1 det Y (i) and Z = (−1)d det(Y ). (2.64)
Hence Q is well defined if E |Y1 | < ∞. Further, if g = (g1 , . . . , gd ), then
gi = Ti · sign(θT T − Z), 1 ≤ i ≤ d (2.65)
and has common features of the gradients of the sample mean (g(x) = x) as
well as of U -quantiles (g(x) = sign function), see Examples 2.12 and 2.13.
Assume that E|Y1 |2 < ∞. This implies E|T |2 < ∞ which in turn implies
E|gi |2 < ∞ and thus Assumption (IV) is satisfied.
To guarantee Assumption (V), suppose F is the distribution function of

Y1 . First assume that F is continuous. Note that by arguments similar to
those given in Example 2.11,

Q(θ) − Q(θ0 ) = 2E θT T I{Z≤θT T } − θ0T T I{Z≤θ0T T }

+ 2E Z I{Z≤θT T } − Z I{Z≤θ0T T } .
It easily follows that the i-th element of the gradient vector of Q(θ) equals

Qi (θ) = 2E Ti I{Z≤θT T } . (2.66)
If further, F has a density, it follows that the derivative of Qi (θ) with respect
to θj is given by

Qij (θ) = 2E Ti Tj fZ|T (θT T ) (2.67)
where fZ|T (·) denotes the conditional density of Z given T . Thus

H= Qij (θ0 ) . (2.68)
Clearly then Assumption (V) will be satisfied if we assume that, the density
of F exists and H defined above exists and is positive definite. This condition
is satisfied by many common distributions.
Example 2.15: The pth-order Oja-median for 1 < p < 2 is defined by mini-
mizing

Q(θ) = E Δp (Y1 , . . . , Yd , θ) − Δp (Y1 , . . . , Yd , 0) (2.69)
where Δ is as in Example 2.8. The quantities gi and H are now given by
gi (θ) = pTi |θT T − Z|p−1 sign (θt T − Z), i = 1, . . . , d, (2.70)

H = hij = p(p − 1) E[Ti Tj θ0T T − Z|p−2 ] . (2.71)
Now it is easy to formulate conditions for the asymptotic normality of the

pth-order Oja-median.
2.6 Rate of convergence

In the previous section, we have shown that the leading term of an Mm -
estimator is a U -statistic. Recall that in Chapter 1, we have proved some
rate of convergence results for U -statistics under additional conditions on the
kernel. Hence we take a cue from those results to demonstrate that when the
conditions of Section 2.5 are appropriately strengthened, then the results on
consistency and asymptotic normality can be sharpened considerably.
Let N be a neighborhood of θ0 and r > 1 and 0 ≤ s < 1 be such that
the following assumptions hold. Further restrictions on r and s as needed are
given in the forthcoming theorems.

(VIa) E exp t|g(Y1 , . . . , Ym , θ)| < ∞ ∀ θ ∈ N and some t = t(θ) > 0.
(VIb) E|g(Y1 , . . . , Ym , θ)|r < ∞ ∀ θ ∈ N.
Theorem 2.4. Suppose Assumptions (I)–(V) hold.

(a) If Assumption (VIa) also holds, then for every δ > 0, there exists an
α > 0 such that,

P sup |θk − θ0 | > δ = O(exp(−αn)). (2.72)
k≥n
(b) If Assumption (VIb) also holds with some r > 1, then for every δ > 0,

P sup |θk − θ0 | > δ = o(n1−r ) as n → ∞. (2.73)
k≥n
Theorem 2.4(a) implies that the rate of convergence is exponentially fast.

This in turn implies that θn → θ0 completely. That is, for every δ > 0,
∞

P |θn − θ0 | > δ < ∞. (2.74)
n=1
Note that if r < 2, then Assumption (VIb) is weaker than Assumption (IV)
needed for the asymptotic normality. If r > 2, then Assumption (VIb) is
stronger than Assumption (IV) but weaker than Assumption (VIa), and still
implies complete convergence.
Incidentally, the last time that the estimator is distance away from the
parameter is of interest as approaches zero. See Bose and Chatterjee (2001b)
and the references there for some information on this problem.
To prove Theorem 2.4, we need a Lemma, but first a definition.
Definition 2.4: (δ-triangulation.) Let A0 and B be sets in Rd . We say that

B is a δ-triangulation of A0 if every α ∈ A0 is a convex combination, λi βi
of points βi ∈ B such that |βi − α| < δ for all i.
Recall that a real valued function h is said to be a Lipschitz function with

Lipschitz constant L if
|h(α) − h(β)| ≤ L|α − β| for all α, β.
Lemma 2.2. Let A ⊂ A0 be convex sets in Rd such that |α − β| > 2δ

whenever α ∈ A and β ∈
/ A0 . Assume that B is a δ-triangulation of A0 . If
h is a Lipschitz function on A0 with Lipschitz constant L, and Q is a convex

function on A0 then
sup |Q(β) − h(β)| < implies sup |Q(α) − h(α)| < 5δL + 3. (2.75)
β∈B α∈A
Proof of Lemma 2.2: Consider an α ∈ A0 and write it as a convex combi-

nation λi βi with βi ∈ B and |βi − α| < δ. Since Q(βi ) < h(α) + δL + ε,
we have

Q(α) ≤ λi Q(βi ) < h(α) + δL + ε. (2.76)
On the other hand, to each α ∈ A there corresponds β ∈ B such that |α−β| <
δ and thus α + 2(β − α) = γ ∈ A0 . From (2.76) it follows that
Q(α) ≥ 2Q(β) − Q(γ) > 2(h(β) − ε) − (h(γ) + δL + ε)

> 2(h(α) − δL − ε) − (h(α) + 3δL + ε)
= h(α) − 5δL − 3ε. (2.77)
The result now follows from (2.76) and (2.77).
Proof of Theorem 2.4: We first prove part (b). Fix δ > 0. Note that Q is
convex and hence is continuous. Further, it is also Lipschitz, with Lipschitz
constant L say, in a neighborhood of 0. Hence there exists an > 0 such that
Q(α) > 2 for all |α| = δ.
Fix α. By Assumption (VIb) and Theorem 1.5,

P sup |Qk (α) − Qk (0) − Q(α)| > = o(n1−r ). (2.78)
k≥n
< . Let A = {α : |α| ≤

Now choose ˜ and δ̃ both positive such that 5δ̃L + 3˜
δ} and A0 = {α : |α| ≤ δ + 2δ̃}. Let B be a finite δ-triangulation of A0 .
Note that such a triangulation exists. From (2.78),

P sup sup |Qk (α) − Qk (0) − Q(α)| > = o(n1−r ). (2.79)
k≥n α∈B
Since Qk (·) is convex, using Lemma 2.2 (with h = Qk ) and (2.79),

< = 1 − o(n1−r ). (2.80)
P sup sup |Qk (α) − Qk (0) − Q(α)| < 5δ̃L + 3˜
k≥n |α|≤δ
Suppose that the event in (2.80) occurs. Using the fact that fk (α) = Qk (α) −
Qk (0) is convex, fk (0) = 0, fk (α) > for all |α| = δ and Q(α) > 2 for all
|α| = δ, we conclude that fk (α) attains its minimum on the set |α| ≤ δ. This
proves part (b) of the Theorem.
To prove part (a), we follow the argument given in the proof of part (b)
but use Theorem 1.5(c) to obtain the required exponential rate. The rest of
the proof remains unchanged. We omit the details.
Example 2.16(U -quantiles and L1 -median): Whenever the gradient is uni-

formly bounded, Assumption (VIa) is trivially satisfied. In particular, this
is the case for U -quantiles (Example 2.6) and the L1 -median (Example 2.7),
and then Theorem 2.4(a) is applicable.
Example 2.17(Oja-median): Recall the Oja-median defined in Example 2.8
and discussed further in Example 2.14. Note that the r-th moment of the
gradient g is finite if the rth moment of T is finite which in turn is true if the
rth moment of Y1 is finite and then Theorem 2.4(b) is applicable.
2.7 Strong representation theorem

We now proceed to strengthen the asymptotic normality Theorem 2.3 by im-
posing further assumptions. As before, let N be an appropriate neighborhood
of θ0 while r > 1 and 0 ≤ s < 1 are numbers. Suppose that as θ → θ0 ,
(VII) |∇Q(θ) − ∇2 Q(θ0 )(θ − θ0 )| = O(|θ − θ0 |(3+s)/2 ).
(VIII) E|g(Y1 , . . . , Ym , θ) − g(Y1 , . . . , Ym , θ0 )|2 = O(|θ − θ0 |(1+s) ).
(IX) E|g(Y1 , . . . , Ym , θ)|r = O(1).
Theorem 2.5. Suppose Assumptions (I)–(V) and (VII)–(IX) hold for some
0 ≤ s < 1 and r > (8 + d(1 + s))/(1 − s). Then almost surely as n → ∞,
n1/2 (θn − θ0 ) = −H −1 n1/2 Un + O(n−(1+s)/4 (log n)1/2 (log log n)(1+s)/4 ).

(2.81)
Theorem 2.5 holds for s = 1, with the interpretation that r = ∞ and g is

bounded. This is of special interest and we state this separately.
Strong representation theorem 59
Theorem 2.6. Assume that g is bounded and Assumptions (I)–(V) and

(VII)–(VIII) hold with s = 1. Then almost surely as n → ∞,
n1/2 (θn − θ0 ) = −H −1 n1/2 Un + O(n−1/2 (log n)1/2 (log log n)1/2 ). (2.82)
The almost sure results obtained in Theorems 2.5 and 2.6 are by no means
exact. We shall discuss this issue in some details later.
To prove the Theorems, we need a Lemma. It is a refinement of Lemma 2.2
on convex functions to the gradient of convex functions.
Lemma 2.3. Let A ⊂ A0 be convex sets in Rd such that |α − β| > 2δ

whenever α ∈ A and β ∈ / A0 . Assume that B is a δ-triangulation of A0 . Let
k be an Rd valued Lipschitz function on A0 with Lipschitz constant L. Let p
be a sub-gradient of some convex function on A0 . Then
sup |k(β) − p(β)| < implies sup |k(α) − p(α)| < 4δL + 2 (2.83)
β∈B α∈A
Proof of Lemma 2.3: Assume p(α) is a sub-gradient of the convex function

h(α). Let e ∈ Rd , |e| = 1. For each α ∈ A, the point α + δe can be written

as a convex combination λi βi with βi ∈ B and |βi − α − δe| < δ. As a
consequence, |βi − α| < 2δ. Now, using the definition of a subgradient,

h(α + δe) ≤ λi h(βi ) ≤ λi (h(α) + (βi − α)T p(βi )). (2.84)
Thus,
δeT p(α)

≤ h(α + δe) − h(α) ≤ λi (βi − α)T p(βi )

≤ λi (βi − α)T k(α) + |βi − α||k(βi ) − k(α)|

+ |βi − α||p(βi ) − k(βi )|
≤ δeT k(α) + (2δ)2 L + 2δε.
The lemma follows easily from the above inequality.
Proof of Theorem 2.5: Recall the notation S, s, Ys introduced at the start

of the proof of Theorem 2.3. Define

−1
n
G(α) = ∇Q(α), Gn (α) = g(Ys , α) (2.85)
m
s∈S
and
α
Yn,s = g(Ys , √ ) − g(Ys , 0). (2.86)
n
Note that
α
E(Yn,s ) = G( √ ), (2.87)
n
and −1
n α
Yn,s = Gn ( √ ) − Un . (2.88)
m n
sS
Let
ln = (log log n)1/2 .
By (VIII), for any M > 0,
sup E|Yn,s |2 = O((n−1/2 ln )1+s ). (2.89)

|α|≤M ln
By using Theorem 1.6 with vn2 = C 2 n−(1+s)/2 ln1+s , for some K and D,
α α
sup P n1/2 |Gn ( √ ) − Un − G( √ )| > KCn−(1+s)/4 ln(1+s)/2 (log n)1/2
|α|≤M ln n n
≤ Dn1−r/2 C −r/2 nr(1+s)/4 ln−r(1+s)/2 (log n)r/2
= Dn1−r(1−s)/4 (log n)r/2 (log log n)−r(1+s)/4 . (2.90)
This is the main probability inequality required to establish the Theorem.

The rest of the proof is similar to the proof of Theorem 2.4. The refinements
needed now are provided by the triangulation Lemma 2.3 and the LIL for Un ,
Theorem 1.4.
Assumption (VII) implies that for each M > 0,

sup Hα − n1/2 G( √α ) = O(n−(1+s)/4 (log log n)(3+s)/4 ), (2.91)
n
|α|≤M ln
and so the inequality (2.90), continues to hold when we replace n1/2 G( √αn )
by Hθ in the left side of (2.90).
Let
n = n−(1+s)/4 ln(1+s)/2 (log n)1/2 . (2.92)
Consider a δn = n−(1+s)/4 (log n)1/2 triangulation of the ball B = {α : |α| ≤

M ln +1}. We can select such a triangulation consisting of O(nd(1+s)/4 ) points.
From the probability inequality above it follows that
α
|n1/2 Gn ( √ ) − n1/2 Un − Hα| ≤ KCn (2.93)
n
holds simultaneously for all α belonging to the triangulation with probability

1 − O(nd(1+s)/4+1−r(1−s)/4 (log n)r/2 ). Now we use Lemma 2.3 to extend this
inequality to all points α in the ball. Letting K1 = KC(2|H| + 1), we obtain
α
P sup n1/2 |Gn ( √ ) − Un − n−1/2 Hα| > K1 n (2.94)
|α|≤M ln n

= O nd(1+s)/4+1−r(1−s)/4 (log n)r/2 . (2.95)
Since r > [8 + d(1 + s)]/(1 − s), the right side is summable and hence we can
apply the Borel-Cantelli Lemma to conclude that almost surely, for large n,
α
sup |n1/2 Gn ( √ ) − n1/2 Un − Hα| ≤ K1 n . (2.96)
|α|≤M ln n
By the LIL for U -statistics given in Theorem 1.4, n1/2 Un ln−1 is bounded
almost surely as n → ∞. Hence we can choose M so that
|n1/2 H −1 Un | ≤ M ln − 1
almost surely for large n. Now consider the convex function nQn (n−1/2 α) −
nQn (0) on the sphere
S = {α : |α + H −1 n1/2 Un | = K2 n }
where K2 = 2K1 [inf |e|=1 eT He]−1 . Clearly using (2.96),
eT n1/2 Gn (−H −1 n1/2 Un + K2 n e) ≥ eT HeK2 n − K1 n ≥ K1 n > 0,

and so the radial directional derivatives of the function are positive. This
shows that the minimiser n1/2 θn of the function must lie within the sphere
|n1/2 θn + H −1 n1/2 Un | ≤ K2 n (2.97)
with probability one for large n, proving Theorem 2.5.
Proof of Theorem 2.6: Let vn and Xns be as in the proof of Theorem 2.5.
Let Un be the U -statistic with kernel Xns −EXns which is now bounded since
g is bounded. By arguments similar to those given in the proof of Theorem 1.6
for the kernel hn1 ,

P |n1/2 Un | ≥ vn (log n)1/2 ≤ exp{−Kt(log n)1/2 + t2 n/k}, (2.98)
provided t ≤ n−1/2 kvn /2mn , where k = [n/m] and mn is bounded by C0

say. Letting t = K(log n)1/2 , it easily follows that the right side of the above
inequality is bounded by exp(−Cn) for some c. The rest of the proof is same
as the proof of Theorem 2.5.
Example 2.18(U -quantiles): These were defined in Example 2.6. Choudhury

and Serfling (1988) proved a representation for them by using the approach of
Bahadur (1966). Such a result now follows directly from Theorem 2.6. Notice
that the sub-gradient vector given in Example 2.12 is bounded. Additionally,
suppose that
(VIII) K has a density k which is continuous around θ0 .

It may then be easily checked that
E|g(x, θ) − g(x, θ0 )|2 ≤ 4|K(θ) − K(θ0 )| = O(|θ − θ0 |). (2.99)
Thus (VIII) holds with s = 0.

It is also easily checked (see Example 2.12) that
∇Q(θ) = Eg(Y1 , θ) = 2K(θ) − 2p. (2.100)
Assume that
3
(VII) K(θ) − K(θ0 ) − (θ − θ0 )k(θ0 ) = O(|θ − θ0 | 2 ) as θ → θ0 .
Then Assumption (VII) holds with s = 0.
It is easy to check that this implies Q(θ) is twice differentiable at θ = θ0

with H = ∇2 Q(θ0 ) = 2k(θ0 ). Thus, under Assumptions (VII) and (VIII) ,
Theorem 2.6 holds for U -quantiles. The same arguments also show that the
location measure of Maritz et al. (1977) also satisfies Theorem 2.6 under
conditions similar to above.
Example 2.19(Oja-median): Recall the notation of Examples 2.8 and 2.14.

The i-th element of the gradient vector of f is given by gi = Ti ·sign(θT T −Z),
i = 1, . . . , d.
Assumption (VIII) is satisfied if

E||Y ||2 I{θT T ≤Z≤θ0T T } + I{θ0T T ≤Z≤θT T } = O(|θ − θ0 |1+s ). (2.101)
If F has a density then so does the conditional distribution of Z given T .

By conditioning on T and from the experience of the univariate median, it is
easy to see that Assumption (VIII) holds with s = 0 if this conditional density
is bounded uniformly in θT T for θ in a neighborhood of θ0 and E|T |2 < ∞.
For the case d = 1, this is exactly Assumption (VIII) in Example 2.3.
To guarantee Assumption (VII), recall that if F has a density, derivative
Qij of Qi (θ) with respect to θj and the matrix H are given by

Qij (θ) = 2E Ti Tj fZ|T (θT T ) and H = Qij (θ0 ) (2.102)
where fZ|T (·) denotes the conditional density of Z given T . Hence Assump-
tion (VII) will be satisfied if we assume that for each i, as θ → θ0 ,

E |Yi {FZ|T (θT T ) − FZ|T (θ0T T ) − fZ|T (θ0T T )(θ − θ0 )T )T }|
= O(|θ − θ0 |(3+s)/2 ). (2.103)
This condition is satisfied by many common densities. The other required

Assumption (IX) is satisfied by direct moment conditions on T or Y .
By a similar approach, it is easy to formulate conditions under which
Theorem 2.6 holds for pth-order Oja-median for 1 < p < 2.
Example 2.20(L1 -median, mth-order Hodges-Lehmann estimate and geo-

metric quantiles in dimension d ≥ 2): Suppose Y, Y1 , Y2 , . . . , Yn are i.i.d. d
dimensional random variables.
(i) (L1 -median). Since results for the univariate median (and quantiles) are
very well known (see for example Bahadur (1966), Kiefer (1967)), we confine
our attention to the case d ≥ 2.
Proposition 2.1. Suppose θ0 is unique. If for some 0 ≤ s ≤ 1,
E[|Y1 − θ0 |−(3+s)/2 ] < ∞, (2.104)
then according as s < 1 or s = 1, the representation of Theorem 2.5 or 2.6

d
holds for the L1 -median with Un = n−1 i=1 (Yi − θ0 )/|Yi − θ0 | and H defined
in Example 2.13 earlier.
To establish the proposition, we verify the appropriate Assumptions. Re-

call the gradient vector given in Example 2.13 which is bounded. Hence As-
sumptions (I)–(V) and (IX) are trivially satisfied. Let F be the distribution
of Y1 .
To verify Assumption (VIII), without loss of generality assume that θ0 = 0.
Noting that g is bounded by 1 and |g(x, θ) − g(x, 0)| ≤ 2|θ|/|x|, we have

2 2 −2
E|g(Y1 , θ) − g(Y1 , 0)| ≤ 4|θ| |x| dF(x) + dF(x)
|x|>|θ| |x|<|θ|

≤ 4|θ|1+s |x|−(1+s) dF(x)
|x|>|θ|

+ |θ|1+s |x|−(1+s) dF(x)
|x|<|θ|
≤ 4|θ|1+s E[|Y1 |−(1+s) ].
The moment assumption (2.104) assures that Assumption (VIII) is satisfied

since (1 + s) ≤ (3 + s)/2. Recall the function h(x, θ) and H defined in
Example 2.13. Note that under our assumptions H is positive definite. By
using arguments similar to those given in Example 2.13, it is easily seen that
for |x| ≤ |θ|,
|g(x, θ) − g(x, 0) − h(x, 0)θ| ≤ 4|θ|/|x|. (2.105)
Similarly, for |x| > |θ|,
|θ|2
|g(x, θ) − g(x, 0) − h(x, 0)θ| ≤ 6 . (2.106)
|x|2
Using these two inequalities, and taking expectation,
|∇Q(θ) − ∇Q(0) − Hθ| ≤ I1 + I2 (2.107)
where

−1 (3+s)/2
I1 ≤ 4|θ| |x| dF(x) ≤ 2|θ| |x|−(3+s)/2 dF(x) (2.108)
|x|≤|θ| |x|≤|θ|
and using the fact that 0 ≤ s ≤ 1,

2 −2 (3+s)/2
I2 ≤ 6|θ| |x| dF(x) ≤ 6|θ| |x|−(3+s)/2 dF(x). (2.109)
|x|>|θ| |x|≥|θ|
The inverse moment condition (2.104) assures that Assumption (VII) holds
with ∇2 Q(θ0 ) = H. Thus we have verified all the conditions needed and the
proposition is proved.
Let us investigate the nature of the inverse moment condition (2.104). If Y1
has a density f bounded on every compact subset of Rd then E[|Y1 −θ|−2 ] < ∞
if d ≥ 3 and E[|Y1 −θ0 |−(1+s) ] < ∞ for any 0 ≤ s < 1 if d = 2, and Theorem 2.5
is applicable. However, this boundedness or even the existence of a density
as such is not needed if d ≥ 2. This is in marked contrast with the situation
for d = 1 where the existence of the density is required since it appears in
the leading term of the representation. For most common distributions, the
representation holds with s = 1 from dimension d ≥ 3, and with some s < 1
for dimension d = 2. The weakest representation corresponds to s = 0 and
gives a remainder O(n−1/4 (log n)1/2 (log log n)1/4 ) if E[|Y1 − θ|−3/2 ] < ∞.
The strongest representation corresponds to s = 1 and gives a remainder
O(n−1/2 (log n)1/2 (log log n)1/2 ) if E[|Y1 − θ|−2 ] < ∞.
The moment condition (2.104) forces F to necessarily assign zero mass at
the median. Curiously, if F assigns zero mass to an entire neighborhood of
the median, then the moment condition is automatically satisfied.
Now assume that the L1 -median is zero and Y is dominated in a neighbor-
hood of zero by a variable Z which has a radially symmetric density f (|x|).
Transforming to polar coordinates, the moment condition is satisfied if the
integral of g(r) = r−(3+s)/2+d−1 f (r) is finite. If d = 2 and f is bounded in a
neighborhood of zero, then the integral is finite for all s < 1. If f (r) = O(r−β ),
(β > 0), then the integral is finite if s < 2d − 3 − 2β. In particular, if f is
bounded (β = 0), then any s < 1 is feasible for d = 2 and s = 1 for d = 3.
(ii) (Hodges-Lehmann estimate) The above arguments also show that if the

moment condition is changed to E |m−1 (Y1 + · · · + Ym ) − θ0 |−(3+s)/2 < ∞,
Proposition 2.1 holds for the Hodges-Lehmann estimator with
−1
n
Un = g(m−1 (Yi1 + · · · + Yim ), θ0 )). (2.110)
m
1≤i1 <i2 <...<im ≤n
(iii) (Geometric quantiles) For any u such that |u| < 1, the u-th geometric
quantile of Chaudhuri (1996) is defined by taking f (θ, x) = |x − θ| − |x| − uT θ.
Note that u = 0 corresponds to the L1 -median. The arguments given in the
proof of Proposition 2.1 remain valid and the representation of Theorem 2.5
or 2.6 hold for these estimates. One can also define the Hodges-Lehmann
version of these quantiles and the representations would still hold.
2.7.1 Comments on the exact rate

The above results are quite strong even though there is scope for further
improvement from the experience of the univariate case (see Kiefer (1967)).
Unlike the case for d = 1, the boundedness or even the existence of a density
as such is not needed if d ≥ 2. Obtaining the exact order is a delicate
and hard problem. The higher order asymptotic properties of the sample
median for d = 1 was extensively studied with suitable conditions on the
density by Bahadur (1966) and Kiefer (1967) via the fluctuations of the sample
distribution function which puts mass n−1 at the sample values.
This approach has been used by several authors in other similar situations.
For example, a representation for U -quantiles was proved by Choudhury and
Serfling (1988) by studying the fluctuations of the distribution function which
n
puts equal mass at all the m points h(Yi1 , . . . , Yim ), 1 ≤ i1 ≤ i2 ≤ . . . ≤
im ≤ n. Chaudhuri (1992) proved a representation for the L1 -median and its
Hodges-Lehmann version in higher dimensions by the same approach.
We note here that the theory of empirical processes serves as a very valu-
able tool in the study of properties of estimators of the type studied in this
chapter. For instance, Arcones (1996) derived some exact almost sure rates
for U -quantiles under certain “local variance conditions” by using empirical
processes. See also Arcones et al. (1994). In particular the asympototic nor-
mality of the Oja-median and the L1 -median were obtained by this approach.
In a similar vein, Sherman (1994) derived maximal inequalities for degenerate
Exercises 67
U -processes of order k, k ≥ 1. These inequalities can be used to determine the

limiting distribution of estimators that optimize criterion functions having U -
process structure. However, we did not pursue empirical process arguments
here, since that would require considerable additional technical developments.
Generally speaking, the exact rate depends on the nature of the function
f . See Arcones and Mason (1997) for some refined almost sure results in
general M -estimation problems. As an example, consider the L1 -median
when d = 2. If the density of the observations exists in a neighborhood of
the median, is continuous at the median and E g(Y1 , θ) has a second order
expansion at the median, then the exact almost sure order of the remainder
is O(n−1/2 (log n)1/2 (log log n)).
2.8 Exercises
1. Show that (2.8) is minimized at θ = F−1 (p) and is unique if F is strictly
increasing at F−1 (p). Find out all the minimizers if F is not strictly
increasing at F−1 (p).
2. List the properties of the various medians given in Small (1990).
3. Convince yourself of the uniqueness claim for the L1 -median made in

Example 2.7.
4. Convince yourself of the uniqueness claim for the Oja-median made in

Example 2.8.
5. Show that under Assumption (II), the expectation of g is finite.

Hint: use (2.15).
6. Show that under Assumption (II), the gradient vector ∇Q(θ) of Q at θ

exists and
∇Q(θ) = E[g(Y1 , . . . , Ym , θ)] < ∞. (2.111)
7. Argue, how, in the proof of Theorem 2.3, we can without loss of gener-
ality, assume θ0 = 0 and Q(θ0 ) = 0.
8. Refer to (2.43) and (2.44). Show that Bn (αn ) − > An (αn ) for all
α ∈ {α : |α − αn | = 2[λmin (H)]−1/2 1/2 }.
9. Verify that Theorem 2.3 holds for Example 2.12(iv).
10. For the L1 median given in Example 2.9(ii), check that the gradient
vector is indeed given by (2.56).
11. Show that (2.58) implies (2.59).
12. Verify that (2.62) holds.
13. Verify the calculations for the Oja median given in Example 2.14.
14. Formulate conditions for the asymptotic normality of the pth-order Oja-
median.
15. Under suitable assumptions, state and prove an asymptotic bivariate

normality theorem for the sample mean and the sample median.
Chapter 3
Introduction to resampling
3.1 Introduction
In the previous two chapters we have seen many examples of statistical pa-
rameters and their estimates. In general suppose there is a parameter of
interest θ and observable data Y = (Y1 , . . . , Yn ). The steps for statistical
inference can be divided into three broad issues.
(I) How do we estimate θ from the data Y?
(II) Given an estimator θ̂n of θ (that is, a function of Y) how good is this
estimator?
(III) How do we obtain confidence sets, test hypothesis and settle other such
questions of inference about θ?
Question (I) on estimating θ from Y is fundamental to statistics and

new ideas of estimation emerge as new problems appear across the statistical
horizon. In the previous chapters we have dealt with this question for some
classes of parameters.
The next two questions are equally important. A traditional way to answer
(II) is to report the variance or the mean squared error of θ̂n , which are
functions of the distribution of θ̂n . The answer to the last question is also
typically based on the probability distribution of θ̂n , that is, its sampling
distribution.
Most often, computation of the exact sampling distribution of θ̂n or of the
exact variance of θ̂n are intractable problems. Only in very limited number
70 Chapter 3. Introduction to resampling
of situations is it feasible to compute the exact sampling distribution or the

exact variance of an estimator.
One such situation is where the data is n i.i.d. observations Y1 , . . . , Yn from
n
N (θ, σ 2 ), and the estimator for θ is the sample mean θ̂n = n−1 i=1 Yi . If σ 2
is known, then θ̂n ∼ N (θ, n−1 σ 2 ). When σ 2 is unknown, an M2 -estimator,
as seen in Chapter 2 is given by

n
'2 = (n − 1)−1
σ (Yi − θ̂n )2 .
i=1
Using considerable ingenuity, W. S. Gossett, who wrote under the pen name
Student, obtained the exact sampling distribution of Tn = n1/2 (θ̂n − θ)/σ̂ (see
Student (1908)). This distribution is now known as Student’s t-distribution
with (n − 1) degrees of freedom.
If the variables are not i.i.d. normal, the above result does not hold. More
importantly, it is typically impossible to find a closed form formula for the
distribution of Tn . While it is possible to obtain the sampling distribution of
many other statistics, there is no general solution. In particular, the sampling
distribution of the U -statistics and the Mm -estimates that we discussed in
Chapters 1 and 2 are completely intractable in general.
Nevertheless, asymptotic solutions are available. It is often possible to
suitably center and scale the estimator θ̂n , which then converges in distribu-
tion. In Chapters 1 and 2 we have seen numerous instances of this where
convergence happens to the normal distribution.
We have also seen in Chapter 2 how asymptotic normality was established
by a weak representation result, so that the leading term of the centered
statistic is a sum of i.i.d. variables and thus the usual CLT can be applied.
This linearization was achieved by expending considerable technical effort.
There still remains two noteworthy issues.
First, such a linearization may not be easily available for many estimates
and the limit distribution need not be normal. Second, even if
D
an (θ̂n − θ) −→ N (0, V)
for some V > 0 and some sequence an → ∞, the asymptotic variance V is

unknown and may not even have a closed form expression. In many situations,
estimation of V is very hard. Example 3.2 below on the use of the sample
Three standard examples 71
median as an estimator of the population median is one such case, since the
asymptotic variance depends on the true probability density value at the
unknown true population median.
This is where resampling comes in. It attempts to replace analytic deriva-
tions with the force of computations. We will introduce some of the popular
resampling techniques and their properties in Section 3.5 below, but before
that, in Section 3.2 we set the stage with three classical examples of problems
where we may study statistical inference using the (i) finite sample exact
distribution approach if available, (ii) asymptotics-driven approach, and (iii)
resampling-based approach. Our discussion on the basic ideas of resampling
are centered around these three examples. In Section 3.3 we define the notion
of consistency of resampling plans in estimating the variance and the entire
sampling distribution.
Then we introduce the quick and easy resampling technique, jackknife,
which is aimed primarily at estimating the bias and variance of a statistic.
The bootstrap is introduced in the context of estimating the sampling proper-
ties of the sample mean in Section 3.4.1. We also introduce the Singh Property
which shows in what sense and how the bootstrap can produce better esti-
mates than an asymptotics-based method. This is followed with a discussion
on resampling for the sample median in Section 3.4.3. After some discussion
on the principles and features of resampling in general in Section 3.3.2, we
present in Section 3.5 several resampling methods that have been developed
for use in linear regression. In Chapter 4 we will focus on resampling for
U -statistics and Mm -estimates.
3.2 Three standard examples

In the last two chapters we have seen that a host of estimators may be lin-
earized with a sample mean type leading term. The sample mean is the sim-
plest smooth statistic. Our first example records the asymptotic normality
of the mean and introduces the important idea of Studentization. Generally
speaking, resampling techniques work best for smooth functions of the sample
mean after an appropriate Studentization.
In Chapter 2 we dealt with Mm -estimators. One of the simplest non-
smooth M -estimator is the sample median. This is our second benchmark
example. Some resampling plans are not suited for such non-smooth situa-
tions.
Resampling techniques prove their worth more when we have non-i.i.d.
data since calculation of properties of sampling distributions become more
complicated as we move away from the i.i.d. structure. The simplest example
of a non-i.i.d. model is the linear regression and that is our third bench-
mark example. We shall see later that there are many eminently reasonable
resampling techniques available for such non-i.i.d. models.
Example 3.1(The mean): Suppose Yi ’s are i.i.d. according to some distribu-

tion F on the real line with unknown mean θ and unknown variance σ 2 ; the
n
parameter of interest is θ. Consider the estimator θ̂n = i=1 Yi /n. Define
Zn = n1/2 (θ̂n − θ), and let LZn denote its probability distribution function.
The CLT says that
D
LZn −→ N 0, σ 2 . (3.1)
If σ 2 is known, this asymptotic Normal limit distribution allows us to make

(asymptotically valid) inferences about θ, for example, perform hypothesis
tests on, or obtain a confidence interval for, θ. We will refer to statistics like
Zn that are centered, and scaled by constants like n1/2 that do not depend
on the data, as normalized θ̂n .
If σ 2 is unknown, then an M2 -estimator for it is given by
n
2
'2 = (n − 1)−1
σ Yi − θ̂n .
i=1
Now instead of the normalized statistic Zn , we may use the Studentized statis-
tic
Tn = n1/2 (θ̂n − θ)/σ̂.
Let LTn denote the probability distribution function of Tn . If F is N (θ, σ 2 ),

then LTn is the Student’s t-distribution with (n − 1) degrees of freedom. In
general, if F has finite positive variance, then
D
LTn −→ N (0, 1).
This asymptotic distribution is free of parameters, and hence is obviously

suitable for inference on θ.
The notable features here are (i) the centering, (ii) estimate of the asymp-
totic variance, (iii) Studentization and (iv) asymptotic normality. We shall
see later that even in this basic case, appropriate resampling techniques can
assure better accuracy than offered by the normal approximation.
Example 3.2(The median): Suppose the data is as in Example 3.1. However,
the parameter of interest is now the median ξ ∈ R. Recall that for any
distribution F, and for any α ∈ (0, 1), the α-th quantile of F is defined as

F−1 (α) = inf x ∈ R : F(x) ≥ α .
x∈R
In particular, ξ = F−1 (1/2), is the median of F.

Now suppose that F has a density f . Recall that Fn is the e.c.d.f. which
assigns mass n−1 at each Yi , 1 ≤ i ≤ n. Let the sample median, ξˆn = F−1 (0.5)
n
be the estimator of ξ. We have seen in Chapter 2 that if f (ξ) > 0, then
−1
n1/2 ξˆn − ξ −→ N 0, 4f 2 (ξ) . (3.2)
In order to use this result, an estimate of f (ξ) is required. Note that this
is a non-trivial problem since the density f is unknown and that forces us
to enter the realm of density estimation. This estimation also adds an extra
error when using the asymptotic normal approximation (3.2) for inference.
We shall see later that when we use an appropriate resampling technique,
this additional estimation step is completely avoided.
Example 3.3(Simple linear regression): Suppose the data is {(Yi , xi ), i =
1, . . . , n}, where x1 , . . . , xn is a sequence of known constants. Consider the
simple linear regression model
Yi = β1 + β2 xi + ei . (3.3)
We assume that the error or noise terms e1 , . . . , en are i.i.d. from some dis-
tribution F, with Ee1 = 0, and Ve1 = σ 2 < ∞.
The random variable Yi is the i-th response, while xi is often called the
i-th covariate. In the above, we considered the case where the xi ’s are non-
random, but random variables may also be used as covariates with minor
differences in technical conditions that we discuss later. The above simple
linear regression has one slope parameter β2 and the intercept parameter β1 ,
and is a special case of the multiple linear regression, where several covariates
with their own slope parameters may be considered.
It is convenient to express the multiple regression model in a linear alge-
braic notation. We establish some convenient notations first.
Suppose aT denotes the transpose of the (vector or matrix) a, and let

Y = (Y1 , . . . , Yn )T ∈ Rn denote the vector of responses, X ∈ Rn × Rp denote
the matrix of covariates whose i-th row is the vector xi = (xi1 , xi2 , . . . , xip ),
the regression coefficients are β = (β1 , . . . , βp )T , and the noise vector is e =
(e1 , . . . , en )T ∈ Rn . In this notation, the multiple linear regression model is
given by
Y = Xβ + e. (3.4)
The simple linear regression (3.3) can be seen as a special case of (3.4), with
the choice of p = 2, xi1 = 1 and xi2 = xi for i = 1, . . . , n.

If F is the Normal distribution, i.e., if e1 . . . , en are i.i.d. N 0, σ 2 , then
we have the Gauss-Markov model. This is the most well-studied model for
linear regression, and exact inference is tractable in this case. For example,
if β = (β1 , . . . , βp )T ∈ Rp is the primary parameter of interest, it can be
estimated by the maximum likelihood method using the normality assumption,
and the sampling distribution of the resulting estimator can be described.
In the simple linear regression case, let

n
n
Ȳ = n−1 Yi , x̄ = n−1 xi .
i=1 i=1
The maximum likelihood estimate (m.l.e.) of β is then given by
β̂1 = Ȳ − β̂2 x̄,

n
(x − x̄)(Yi − Ȳ )
β̂2 = i=1 n i 2
.
i=1 (xi − x̄)
The m.l.e. for the multiple linear regression coefficient is given by

−1 T
β̂ = X T X X Y. (3.5)
The exact distribution of β̂ in the simple linear regression is given by

⎛ ⎛
n
n ⎞⎞
n −1 x2i − xi ⎟⎟
⎜ β n ⎜
⎜ 1 ⎜ ⎟⎟
β̂ ∼ N ⎜ , σ2 n x2i − ( xi )2 ⎜ i=1
n
i=1
⎟⎟ .
⎝ β2 ⎝ ⎠⎠
i=1 i=1 − xi n
i=1
For the multiple linear regression case, we have

−1
β̂ ∼ N β, σ 2 X T X .
The above exact distribution may be used for inference when σ 2 is known.
Even when it is not known, an exact distribution can be obtained, when we
use the estimate of σ 2 given in (3.6).
The exact distribution depends on the assumption that F = N (0, σ 2 ).

Even when this assumption is violated, the principle of least squared errors
yields the same estimators β̂ as above, although the above exact distribution
is no longer valid. In this situation, we call β̂ the least squares estimate
(l.s.e.). Of course the l.s.e. may not be optimal when F is not N (0, σ 2 ), but
the issue of optimality under perfect distributional specification is moot when
the distribution is unknown, as is the case in most practical situations.
The vector of residuals from the above multiple linear regression model
fitting is defined as
r = Y − X β̂.
Note that the residual vector is a statistic as it depends on Y, X and β̂ only,

and is thus a computable quantity. In particular, it is not the same as the
error ei , which is an unknown, unobservable random variable. An estimate
of the noise variance σ 2 is usually taken to be

n
'2 = (n − p)−1
σ ri2 . (3.6)
i=1
Though the exact distribution of β̂ given in (3.5) is intractable when F

is not Gaussian, we can establish asymptotic distribution results for β̂. We
describe only the case where the covariates are non-random, that is, the
matrix X is non-random.
Suppose that the errors e1 , . . . , en are i.i.d. from some unknown distribu-
tion F with mean zero and finite variance σ 2 , which is also unknown. We
assume that X is of full column rank, and that
n−1 X T X → V as n → ∞ (3.7)
where V is positive definite. Under these conditions we have that

D
n1/2 β̂ − β) −→ Np 0, σ 2 V −1 . (3.8)
See for example, Freedman (1981) where (among other places) similar results
are presented and discussed. He also discussed the results for the case where
the covariates are random.
We may also obtain the CLT based approximation of the distribution LTn
of the Studentized statistic:
1/2
Tn = σ̂ −1 X T X β̂ − β , (3.9)
for which we have
D
LTn −→ Np 0, Ip . (3.10)
Variations of the technical conditions are also possible. For example, Shao
and Tu (1995) discuss the asymptotic normality of β̂ using the conditions
−1
X T X → ∞, and max xiT X T X xi → ∞ as n → ∞,
i
2+δ
E|e1 | < ∞ for some δ > 0. (3.11)
Now consider another estimator β̂(LAD) of β obtained by minimizing

n

Yi − xi β .
i=1
In the language of Chapter 1, this may be termed an M1 -estimate in a non-

i.i.d. situation. Among other places, asymptotics for this estimator may be
found in Pollard (1991).
Assume that the errors are i.i.d. from some distribution F with a median
zero, and a continuous positive density f (·) in a neighborhood of zero. Also
assume that n−1 X T X → V as n → ∞ where V is positive definite. Then
−1/2 D
2f (0) X T X β̂(LAD) − β −→ N 0, Ip .
Resampling methods come to our crucial aid in such situations. Indeed we

will have a problem of plenty. Since the observables are not i.i.d., we shall see
Resampling methods: the jackknife and the bootstrap 77
that there are several eminently reasonable ways of performing resampling in

this model, depending on the goal of the statistician and restrictions on the
model. Resampling methods in this simple non-i.i.d. model will serve as a
stepping stone to resampling in models that are more complicated and that
allow for dependence in the observables.
To summarize, we need an estimate of the distribution of a normalized
or Studentized statistic, along with an estimate of the asymptotic variance
of the statistic. These in general shall depend on the parent distribution,
the sample size n and of course on the nature of the statistic. Even when
asymptotic normality has been proved, it is not so simple outside the realm
of cases like the sample mean of Example 3.1 or the l.s.e. of Example 3.3 to
handle the asymptotic variance. The case of β̂(LAD) illustrates this point.
3.3 Resampling methods: the jackknife and

the bootstrap
Resampling techniques are computation oriented techniques for inference,
that often circumvent the issues related to the asymptotics discussed above.
In particular, they bypass the analytic derivations, both for asymptotic nor-
mality and for asymptotic variance. The important aspect of these tech-
niques is the repeated use of the original sample, by drawing sample-from-
the-sample, or resampling, often in novel ways. All these techniques broadly
fall under the umbrella of resampling techniques or resampling plans.
At the least, they offer approximations to the variance of a statistic. With
some resampling methods, we shall also be able to estimate “any” feature of
the distribution of a statistic, which in particular includes the variance and the
entire distribution. The added bonus is that some resample approximations,
in a technical sense to be made precise later, achieve more accuracy, than the
traditional normal approximation in a wide variety of situations. This is true
even in the simplest situation of the Studentized sample mean.
However, the applicability of any resampling scheme to a given problem
is not at all obvious. The fact that a resampling technique is feasible compu-
tationally does not of course imply that the resulting approximations would
be correct. We now define two notions of resampling consistency; the first
is applicable to estimation of the variance of a statistic and the second to
estimation of the entire sampling distribution of a statistic, Tn say.
Definition 3.1(Variance consistency): A resampling variance estimator V̂n

is said to be consistent, for V(Tn ) or for the asymptotic variance of Tn (say
v(Tn )), if, conditional on the data as n → ∞,
V̂n /V(Tn ) → 1 or V̂n /v(Tn ) → 1 almost surely or in probability.
If the above convergence does not hold, we say that the estimator is variance
inconsistent.
Definition 3.2(Distributional consistency): Suppose Ln is the distribution

of a normalized statistic or an appropriately Studentized statistic. Let L̂n be
its resample estimate. We say that this estimate is consistent if, conditional
on the data as n → ∞,

supL̂n (x) − Ln (x) −→ 0, almost surely or in probability.
x
If the above convergence does not hold, we say that the estimator is distri-
butionally inconsistent.

Note that the quantity supx L̂n (x) − Ln (x) defines a distance metric
between the distributions L̂n and Ln , and we will use this metric several
times in this chapter.
Clearly, conditional on the data, V̂n and L̂n are random objects, the ran-
domness coming from the resampling scheme used to derive the estimate.
There are myriad notions of such resampling estimates. We now proceed to
introduce some of the more important and basic resampling schemes.
3.3.1 Jackknife: bias and variance estimation

It stands to reason that the problem of estimating the bias and variance of
a statistic should be simpler than that of estimating its entire distribution.
The jackknife was introduced to do precisely this. In this section, we present
the fundamentals of the jackknife procedure.
Suppose the data set is Y = (Y1 , . . . , Yi−1 , Yi , Yi+1 , . . . , Yn ). Note that we
do not assume that Yi are necessarily i.i.d. Suppose the goal is to estimate the
bias and variance of an estimator Tn = T (Y) of some parameter θ. Following
the initial ideas of Quenouille (1949), Tukey (1958) proposed the following
method, which he called the jackknife.
Consider the data set Y(i) = (Y1 , . . . , Yi−1 , Yi+1 , . . . , Yn ) obtained by

deleting the i-th data point Yi from Y. Let

T(i) = T Y(i) , i = 1, . . . , n.
It is implicitly assumed that the functional form of Tn is such that all the T(i) ’s
n
are well defined. Let us define T̄ = n−1 i=1 T(i) . The jackknife estimator of
the bias ETn − θ is defined as

(n − 1) T̄ − Tn .
Using this, we may define the bias-reduced jackknife estimator of θ as

TJ = Tn − (n − 1) T̄ − Tn = nTn − (n − 1)T̄

n

−1
=n nTn − (n − 1)T(i) .
i=1
Based on this, Tukey (1958) defined the pseudo-values as
T̃i = nTn − (n − 1)T(i) ,
and conjectured that the collection of pseudo-values {T̃i } may be roughly

considered as independent copies of the random variable n1/2 Tn . While we
now know that this conjecture is valid only in a limited number of problems,
it suggested an immediate estimator for V(Tn ), namely, n−1 times the sample
variance of the pseudo-values. Thus, the delete-1 jackknife estimate of the
variance or the asymptotic variance of Tn is given by
n

n 2
( = n−1 (n − 1)−1
V T̃i − n−1 T̃j . (3.12)
nJ
i=1 j=1
An alternative expression for the same quantity is

n
2
( = (n − 1)n−1
V T(i) − T̄ . (3.13)
nJ
i=1
Note that leaving aside the factor (n − 1), this may be considered to be
the variance of the empirical distribution of T(i) , 1 ≤ i ≤ n. That is the
resampling randomness here.
It is easily verified that if Tn is the sample mean as in Example 3.1, then

the above estimator is the same as the traditional unbiased variance estimator
given in (1.4).
( is con-
It was observed by Miller (1964) that the jackknife estimator VnJ
sistent when Tn is a smooth function of the observations. At the same time,

it performs poorly for non-smooth estimators such as the median.
( /v(Tn ) converges in distribution to

Indeed, if Tn is the sample median, VnJ
2
(Y /2) where Y is a chi-square random variable with two degrees of freedom
(see Efron (1982), Chapter 3), and so the jackknife estimator is variance
inconsistent.
Estimating the distribution of Tn is of course a more difficult problem.

Suppose the data Y1 , . . . , Yn are i.i.d. with EY1 = θ and VY1 = σ 2 . Con-
n
sider the case when Tn = n−1 i=1 Yi , the sample mean. The empirical
distribution function of centered and scaled {T(i) , i = 1, . . . , n}, given by

n
F̂J (x) = n−1 I{√n(n−1)(T
(i) −Tn )≤x}
i=1
may be considered a potential estimator of the distribution of n1/2 (Tn −

θ), which we know converges to N (0, σ 2 ) using the central limit theorem.
However, Wu (1990) established that F̂J converges to N (0, σ 2 ) if and only if
the data, Y1 , . . . , Yn are i.i.d. N (θ, σ 2 ). Hence the above delete-1 jackknife is
distributionally inconsistent.
Nevertheless, the situation can be salvaged. The delete-1 jackknife can

be extended in a straightforward way. We may delete d observations at a
time, 1 ≤ d < n thus yielding the delete-d jackknife. It was shown by Wu
(1990) that the delete-d jackknife yields a consistent distribution estimate if
d/n remains bounded away from either zero or one and the statistic Tn has
some reasonable regularity properties.
It will be seen later in Section 3.5 that the different delete-d jackknives are
special cases of a suite of resampling methods called the generalized bootstrap.
The consistency results for the various jackknives can be derived from the
properties of the generalized bootstrap.
3.3.2 Bootstrap: bias, variance and distribution estima-

tion
Statistical thinking was revolutionized with the introduction of the bootstrap
by Efron (1979), who proposed this method in an attempt to understand the
jackknife better. Gradually it has spread roots and there are now myriad
variations of the original idea. In this section we present an overview of the
bootstrap methodology. In Section 3.4 we present details on its implementa-
tion for the sample mean and median, and then in Section 3.5 we discuss the
various ways in which the bootstrap may be applied in linear regression.
Suppose the parameter of interest is θ = A(F), a functional of the unknown
distribution F from which the data Y1 , . . . , Yn is a random sample. Let θ̂n =
A(Fn ) be its estimator where Fn is the e.c.d.f., and let V (Fn ) be an estimator
of the variance of θ̂n . We are interested in estimating the distribution or
variance of the normalized or Studentized statistics
−1/2
Zn = n1/2 A(Fn ) − A(F) , Tn = V (Fn ) A(Fn ) − A(F) .
Draw a simple random sample Y1b , . . . , Ynb with replacement (SRSWR)

from the e.c.d.f. Fn . Let FnB (x) be the e.c.d.f. based on {Yib , i = 1, . . . , n}:

n
FnB (x) = n−1 I{Yib ≤x} .
i=1
Let the bootstrap statistic corresponding to Zn and Tn be respectively as

Znb = n1/2 A FnB − A Fn ,
−1/2
Tnb = V (FnB ) A FnB − A Fn .
The bootstrap idea is to approximate the distribution of Tn (or Zn ) by the

conditional distribution of Tnb (or Znb ). Conditional on the data {Y1 , . . . , Yn },
the random component in Tnb (or Znb ) is induced only by the scheme of an
SRSWR from the data. The distribution function Fn is completely known
and hence so is the conditional distribution of Tnb (Znb ). We use the notation
PB , EB , VB to respectively denote probabilities, expectations and variances
of Tnb (Znb ) or any other statistic computed using a resample, conditional
on the given data Y1 , . . . , Yn . Note that by SLLN, such probabilities and
moments may be approximated to arbitrary degree of accuracy by repeated
Monte Carlo simulations.

The motivation behind this novel idea is that Fn is a close approximation
of F, and FnB relates to Fn in the same way that Fn relates to F, so Tnb (Znb )
is a data-driven imitation of Tn (Zn ). It works when A(·) is a “nice” function
so that this closeness idea transfers to A(FnB ), A(Fn ) and A(F).
It is important to realize that the function Fn need not necessarily be the
empirical distribution or FnB need not be generated the way described above–
any close approximation F̃n of F, and some “empirical-from-F̃n ” can replace
Fn and FnB in the definition of Tnb (Znb ). We shall see that this will result
in many different types of bootstrap, specially in dependent and/or non-i.i.d.
situations.
In practice, a resampling scheme is implemented by repeating a few steps
a (large) number of (say B) times. Note that these steps may or may not
involve a simple random sampling. For each iteration b ∈ {1, 2, . . . , B}, we
get a value θ̂nb of θ̂n , and imitate the centering and scaling as earlier to get
Tnb (or Znb ). The e.c.d.f. of Tn1 , . . . , TnB is given by

B
L(Tnb (x) = B −1 I{Tnb ≤x} .
b=1
This is the bootstrap Monte Carlo approximation of LTn , the distribution

function of Tn . We discuss the practical implementation of the bootstrap
Monte Carlo in Chapter 5.
Over the past three and a half decades a significant part of research on
bootstrap has concentrated on broadening the class of problems on which
bootstrap may be applied, and on establishing that by using bootstrap, better
approximations are obtainable than what are obtained using more traditional
asymptotic methods. Excellent review of the bootstrap, the jackknife and
other resampling plans are available in many references, including the books
by Efron and Tibshirani (1993); Hall (1992); Shao and Tu (1995); Davison
and Hinkley (1997). In Section 3.5 we discuss the linear regression model and
provide a first glimpse of the variety of resampling plans that are available.
Efron’s computational ideas received strong theoretical support soon after.
In Bickel and Freedman (1981) the distributional consistency of the bootstrap
was demonstrated for many statistical functionals. Further, in Singh (1981),
it was established that the bootstrap approximation is often better than the
classical asymptotic normal approximation, in a sense we make precise in
Bootstrapping the mean and the median 83
Section 3.4.2.
Thus bootstrap was a breakthrough in two aspects:
(i) It is a computation based technique for obtaining the distribution and

other properties of a large class of statistics.
(ii) It often produces an estimate that is a better approximation than the

asymptotic normal approximation.
The exposition of Efron (1979) was of fundamental importance to statisti-

cians, since it elucidated how the bootstrap is a very powerful computational
tool to estimate sampling distributions in a wide array of problems.
Curiously, there have been interesting precursors of the bootstrap, and
Hall (2003) discusses some of this history. This development is tied to work
by statisticians in India, and is partially documented in Hubback (1946);
Mahalanobis (1940, 1944, 1945, 1946b,a).
3.4 Bootstrapping the mean and the median
3.4.1 Classical bootstrap for the mean

T
The data is Y = (Y1 , . . . , Yn ) ∈ Rn where Y1 , . . . , Yn are i.i.d. distributed as
F with EY1 = θ, VY1 = σ 2 . The parameter of interest is the mean θ. Consider

the estimator θ̂n = Yi /n. The traditional or classical Efron’s bootstrap,
often called the naive bootstrap, can be described as follows. Draw a SRSWR
Y1b , . . . , Ynb from the data. Let
T
Yb = (Y1b , . . . , Ynb ) ∈ Rn
denote the resample vector. Note that each Yib can be any one of the original
Y1 , . . . , Yn with a probability 1/n. So there may be repetitions in the Yb
series, and chances are high that not all of the original elements of Y will
show up in Yb .
Clearly, conditional on the original data Y, the resample Yb is random,
and if we repeat the SRSWR, we may obtain a completely different Yb re-
sample vector.
Let us define

n
θ̂nb = Yib /n, and Znb = n1/2 θ̂nb − θ̂n ,
i=1
and let LZnb be the distribution of Znb given Y. This conditional distribution
is random but depends only on the sample Y. Hence it can be calculated
exactly when we consider all the nn possible choices of the resample vector
Yb . The bootstrap idea is to approximate the distribution of the normalized
Zn = n1/2 (θ̂n − θ) by this conditional distribution. For use later on, we also
define the asymptotically pivotal random variable Z̃n = n1/2 (θ̂n − θ)/σ, and
denote its exact finite sample distribution by LZ̃n .
Note that by the CLT, LZn converges to N (0, σ 2 ) and LZ̃n converges to
the parameter-free distribution N (0, 1) as n → ∞. Further, LZnb is also
the distribution of a standardized partial sum of (conditionally) i.i.d. random
variables. Hence it is not too hard to show that as n → ∞, this also converges
(almost surely) to the N (0, σ 2 ) distribution. One easy proof of this when the
third moment is finite follows from the Berry-Esseen bound given in the next
section. The fundamental bootstrap result is that,

supLZnb (x) − LZn (x) → 0, almost surely. (3.14)
x
Suppose that σ 2 is known. From the above result, either LZnb or the
CLT-based N (0, σ 2 ) distribution may be used as an approximation for the
unknown sampling distribution LZn for obtaining confidence intervals or con-
ducting hypothesis tests. Unfortunately, it turns out that the accuracy of the
approximation (3.14) is the same as that of the normal approximation (3.1).
Thus apparently no gain has been achieved.
In a vast number of practical problems and real data applications the
variance σ 2 is unknown, and we now consider that case. In the classical
frequentist statistical approach, we may obtain a consistent estimator of σ 2 ,
say σ̂ 2 , and use it as a plug-in quantity for eventual inference. An unbiased
estimator of σ 2 is

n
2
σ̂u2 = (n − 1)−1 Yi − θ̂n ,
i=1
and in imitation of the asymptotically pivotal quantity Z̃n defined above, we

now define the Studentized random variable

Tn = n1/2 θ̂n − θ /σ̂u .
This is also asymptotically pivotal, and we denote its distribution by LTn .

Readers will easily recognize this as the t-statistic; and when Y1 , . . . , Yn are
i.i.d. N (θ, σ 2 ), we know that LTn is exactly Student’s t-distribution with
n − 1 degrees of freedom.
However, in a vast number of practical problems and real data applica-

tions, there is little reason to believe the data to be i.i.d. N (θ, σ 2 ). Conse-
quently, we make use of the fact that
D
LTn −→ N 0, 1
for inference. For example, a classical one-sided 1 − α confidence interval for

θ may be based on the random variable Tn and the limit of LTn , and can thus
be

−∞, θ̂n + n−1/2 σ̂u z1−α , (3.15)
where z1−α is the (1 − α)-th quantile of the standard Normal distribution.
Instead of the standard Normal quantile, a tdf =n−1 is often used, which
makes little practical difference when n is large, yet accommodates the hope
of being exact in case the data is i.i.d. N (θ, σ 2 ). However, questions remain
about how accurate are intervals like (3.15). We will address these issues in
the next few pages.
In the bootstrap approach, we may directly estimate the variance of θ̂nb

conditional on the data. A small computation yields that this bootstrap
variance estimator is n−1 σ̂n2 , where

n
2
σ̂n2 = n−1 Yi − θ̂n ,
i=1
which is the well-known non-parametric m.l.e. of σ 2 . It is easy to see as

well that the bootstrap variance estimator is variance consistent according to
Definition 3.1.
The corresponding bootstrap statistic is

Tnb = n1/2 θ̂nb − θ̂n /σ̂n ,
and we denote its distribution by LTnb . A major discovery in the early days
of the bootstrap, which greatly contributed to the flourishing of this topic, is

that LTnb can be a better approximation for LZ̃n compared to N 0, 1 . This
in turn leads to the fact that a bootstrap-based one-sided (1 − α) confidence
interval for θ can be orders of magnitude more accurate than (3.15). We
discuss these aspects in greater detail in Section 3.4.2.
We now briefly discuss the computation aspect of this approach. Note
that there are nn possible values of Yb . See Hall (1992), Appendix I for
details on distribution of possible repetitions of {Yi } in Yb . Hence finding
the exact conditional distribution of Znb involves evaluating it for all these
values. This is computationally infeasible even when n is moderate, hence
a Monte Carlo scheme is regularly used. Suppose we repeat the process of
getting Yb several times b = 1, 2, . . . , B, and get θ̂nb and Znb for 1 ≤ b ≤ B.
Define the e.c.d.f. of these Zn1 , . . . , ZnB :

B
L(Znb (·) = B −1 I{Znb ≤·} . (3.16)
b=1
Note that this empirical distribution is easily computable, and it serves as an

estimate for the true bootstrap distribution of Znb . How does it behave as
B → ∞? The answer is provided by the Glivenko-Cantelli lemma. Let F be
any distribution function and let Fn be the empirical distribution based on
i.i.d. observations from F. Then this lemma says that
sup |Fn (x) − F(x)| → 0 almost surely.

x
By an application of the Glivenko-Cantelli lemma (see Wasserman (2006),

page 14), for any fixed n,
lim sup |L(Znb (x) − LZnb (x)| = 0 almost surely.

B→∞ x
Thus, L(Znb can be used as an approximation to LZnb , and consequently for

LZn as well.
Indeed we can quantify how good this approximation is. This is based on
a much stronger result than the Glivenko-Cantelli lemma. The Dvoretzky-

Kiefer-Wolfowitz (DKW) inequality says that:

P sup |Fn (x) − F(x)| > ≤ 2 exp (−2n2 ) for any > 0. (3.17)
x
For a proof of this inequality, see Massart (1990).

Using this inequality, it can be shown that
sup |L(Znb (x) − LZnb (x)| = OP (B −1/2 ), (conditional on) Y a. s. (3.18)

x
This quantifies how closely L(Znb approximates LZnb . We may choose B as

large as it is computationally feasible or desirable. The choice of B in the
context of confidence intervals has been studied in detail in Hall (1986).
3.4.2 Consistency and Singh property

n
We continue our discussion on the sample mean θ̂n = i=1 Yi /n from Ex-
ample 3.1. As argued earlier, LZn and LZnb both converge to N (0, σ 2 ), while
LZ̃n , LTn and LTnb all converge to the parameter-free N (0, 1) distribution. It
is not clear what is gained by using LZnb rather than N (0, σ 2 ) as an estimate
of the distribution of LZn , or by using LTnb instead of N (0, 1) or LTn as an
estimate of LZ̃n . To investigate this issue, we need to delve deeper.
Our starting point is the so-called Berry-Esseen bounds. Broadly speaking
a Berry-Esseen bound provides an upper bound for the error committed in
the normal approximation of the distribution of a statistic. We shall be
concerned with such bounds only for the normalized sample mean Zn and
the Studentized sample mean Tn defined above in Section 3.4.1.
Early versions of the Berry-Esseen bounds are available in Lyapunov
(1900, 1901); Berry (1941); Esseen (1942, 1945, 1956). The versions that
we use are by Korolev and Shevtsova (2010); Shevtsova (2014) for Zn and
Bentkus and Götze (1996) for Tn respectively. Let Φ(·) denote the standard
normal distribution function.
Theorem 3.1. (a) (Esseen (1956); Shevtsova (2014)) Suppose Y1 , . . . , Yn

are i.i.d. from some distribution F with zero mean and finite variance σ 2 . Let
μ3 = E|Y1 |3 . Then for all n ≥ 1 and some C0 ∈ (0.40973, 0.4756)

n
√ μ3
supP Yi < σx n − Φ(x) ≤ C0 3 √ . (3.19)
x
i=1
σ n
(b) (Bentkus and Götze (1996)) Suppose Y1 , . . . , Yn are i.i.d. from a distri-
bution F with EY1 = 0, 0 < VY1 = σ 2 < ∞. Let μ3 = E|Y1 |3 . Define

n
Ȳ = n−1 Yi ,
i=1
n
2
σ̂ 2 = n−1 Yi − Ȳ , and
i=1
Tn = Ȳ /σ̂.
Then there exists an absolute constant C > 0 such that for all n ≥ 2

√ μ3
supP nTn < x − Φ(x) ≤ C 3 √ . (3.20)
x σ n
The main essence of Theorem 3.1 is that for Zn or Tn , n1/2 times the
absolute difference between the actual sampling distribution and the Normal
distribution is upper bounded by a finite constant. Thus, in using the normal
approximation, we make an error of O(n−1/2 ). It is also known, and can be
easily verified by using i.i.d. Bernoulli variables, that the rate n−1/2 cannot
be improved in general. This puts a limit on the accuracy of the normal
approximation for normalized and Studentized mean. We note in passing that
such results have been obtained in many other, more complex and challenging,
non-i.i.d. models and for many other statistics.
Assuming that the third moment of the distribution is finite, we can apply
the Berry-Esseen bound (3.19) on Znb along with that on Zn and this implies
(3.14) mentioned earlier. However, at this point it is still not clear which is a
better approximation for LZn ; N (0, 1) or LZnb ?
We shall now show that there is a crucial difference between dealing with
normalized statistic and a Studentized statistic. Basically, for a normalized
statistic there is no gain in bootstrapping. However, under suitable condi-
tions, LTnb is a better estimator for LZ̃n compared to N (0, 1). This is known
as the Singh property. This property is now known to hold in many other
models and statistics but we shall restrict ourselves to only Tnb . We need a
classification of distributions to discuss this further.

Definition 3.3(Lattice distribution): A random variable Y is said to have a
lattice distribution if for some a and some h > 0
P[Y = a + kh; k = 0, ±1, ±2, . . .] = 1.
The largest such number h is called the span of the distribution.

Most commonly studied discrete distributions, such as the Binomial and
the Poisson, are examples of lattice distributions. A distribution that does
not satisfy the above property is called a non-lattice distribution. All con-
tinuous distributions are non-lattice. For the following results, recall that
Φ(·) and φ(·) denote respectively the cumulative distribution function and
the probability density function of the standard Normal distribution.
The following refinements of the Berry-Esseen theorem are needed. These
are Edgeworth expansion results. The proofs are omitted. See Bhattacharya
and Rao (1976) for a very detailed and comprehensive development of the
theory of Edgeworth expansions. A selection of other relevant papers on this
topic include Babu and Singh (1989a,b); Kolassa and McCullagh (1990).
Let [y] denote the integer part of y. Define
g(y) = [y] − y + 1/2, for all y ∈ R.
Theorem 3.2. Suppose Yi are i.i.d. with distribution F that has mean 0,
variance σ 2 and finite third moment μ3 .
(a) If F is lattice with span h, then uniformly in x,
μ3 (1 − x2 ) h
LZ̃n (x) = Φ(x) + φ(x) + 3 1/2 g(n1/2 σh−1 x)φ(x) + o(n−1/2 ).
6σ 3 n1/2 6σ n
(3.21)
(b) If F is non-lattice, then uniformly in x,
μ3 (1 − x2 )
LZ̃n (x) = Φ(x) + φ(x) + o(n−1/2 ). (3.22)
6σ 3 n1/2
These bounds show that the order O(n−1/2 ) in the Berry-Esseen theorem
is sharp and provide additional information on the leading error term un-
der additional conditions. Note that the nature of the leading error term is
different in the lattice and non-lattice cases.
Now, if we could obtain similar expansions for LZnb then we could use
the two sets of expansions to compare the two distributions. This is a non-
trivial issue. Note that for any fixed n, the bootstrap distribution of Y1b ,
being the e.c.d.f. of Y1 , . . . , Yn , is necessarily a discrete distribution. When F
is lattice (non-lattice), the bootstrap distribution may or may not be lattice
(respectively non-lattice). However, it should behave as a lattice (respectively
non-lattice) specially when n is large.
In an extremely remarkable work, the following result was proved by Singh
(1981). For a detailed exposition on Edgeworth expansions in the context of
bootstrap, see Bhattacharya and Qumsiyeh (1989), Hall (1992) and Bose and
Babu (1991).
Theorem 3.3 (Singh (1981)). Suppose Yi are i.i.d. with distribution F which
has mean 0, variance σ 2 and finite third moment μ3 . Then
(a)
lim sup E|Y1 |3 σ −3 n1/2 sup |LZ̃n (x) − LTnb (x)| ≤ 2C0 , almost surely,
n→∞ x
where C0 is the constant from (a) of the Berry-Esseen Theorem.

(b) If F is lattice with span h, then uniformly in x, almost surely,
μ3 (1 − x2 ) h
LTnb (x) = Φ(x) + 3 1/2
φ(x) + 3 1/2 g(n1/2 σ̂n h−1 x)φ(x) + o(n−1/2 ).
6σ n 6σ n
Consequently
h
lim sup n1/2 sup |LZ̃n (x) − LTnb (x)| = √ , almost surely.
n→∞ x 2πσ 2
(c) If F is non-lattice, then uniformly in x, almost surely,
μ3 (1 − x2 )
LTnb (x) = Φ(x) + φ(x) + o(n−1/2 ).
6σ 3 n1/2
Consequently,
n1/2 sup |LZ̃n (x) − LTnb (x)| → 0 almost surely.

x
Part (a) shows that the difference between LTnb and LZ̃n has an upper
bound of order O(n−1/2 ), the same as the difference between LZn and the
normal approximation. Thus there may be no improvement in using LTnb .
Part (b) shows that when the parent distribution F is lattice, there is
no improvement in using LTnb to approximate LZ̃n compared to the normal
approximation.
However, part (c) is most interesting. It implies that using the asymptoti-
cally pivotal Studentized statistic Tnb is extremely fruitful, and the bootstrap
distribution LTnb is a better estimator of LZ̃n compared to the N (0, 1) ap-
proximation.
This is the higher order accuracy or Singh Property. More complex for-
mulations and clever manipulations can result in even higher order terms be-
ing properly emulated by the bootstrap, see Abramovitch and Singh (1985).
Moreover, such higher order accuracy results have been proved in many other
set ups, including Studentized versions of various U -statistics. The use of
Edgeworth expansions in the context of bootstrap has been explored in de-
tails in Hall (1992), where corresponding results for many (asymptotically)
pivotal random variables of interest may be found.
The sharper approximation has direct consequence in inference. Consider
the problem of getting a one-sided (1−α) confidence interval of θ, based on the
data Y1 , . . . , Yn from some unknown distribution F with mean θ. We assume
that F is non-lattice, and that the variance σ 2 is known. We discuss only
the unbounded left-tail version, where the interval is of the form (−∞, Rn,α )

for some statistic Rn,α . The CLT-based estimator for this is −∞, θ̂n +

n−1/2 σz1−α . From (3.22), we can compute that this interval has a O(n−1/2 )
coverage error, that is

P θ ∈ −∞, θ̂n + n−1/2 σz1−α = 1 − α + O(n−1/2 ).
On the other hand, using Theorem 3.3(c), if tα,b is the α-th quantile of
Tnb , that is, if P Tnb ≤ tα,b = α, we have

1 − α + O(n−1 ) = P Z̃n ≥ tα,b

= P θ̂n − θ ≥ n−1/2 σtα,b

= P θ ∈ −∞, θ̂n − n−1/2 σ̂tα,b .

Thus, we obtain that the bootstrap-based confidence interval −∞, θ̂n −

n−1/2 σ̂tα,b is O(n−1 ) accurate almost surely.
The above discussion was for the case of one-sided confidence intervals,
when σ is known. Results are also available for the case when σ is unknown,
and it can shown that the interval in (3.15) has a coverage error of O(n−1/2 ),
while the corresponding bootstrap interval has O(n−1 ) coverage error. Similar
results are available for two-sided intervals both when σ is known and when
it is unknown. The accuracy of the coverages are different from the one sided
case but the accuracy of the bootstrap intervals is still a n−1/2 factor higher
than the traditional intervals. We do not discuss the details here since several
technical tools need to be developed for that. Many such details may be found
in Hall (1992).
It can be seen from the above discussion that the existence and ready
usability of an asymptotically pivotal random variable is critically important
in obtaining the higher-order accuracy of the bootstrap estimator. Thus,
Studentization is typically a crucial step in obtaining the Singh Property.
Details on the Studentization and bootstrap-based inference with the Singh
property is given in Babu and Singh (1983, 1984, 1985); Hall (1986, 1988)
and in several other places.
3.4.3 Classical bootstrap for the median
Suppose the data at hand is Y1 , . . . , Yn i.i.d. F. Let ξ denote the population

median and ξˆn denote the sample median. In this case, Efron’s classical
bootstrap is implemented as follows. Using the SRSWR scheme, generate B
T
i.i.d. resamples Yb = Y1b , . . . , Ynb ∈ Rn for b = 1, . . . , B. Compute the
median ξˆnb from each of these resamples. Let

B
2
( = B −1
V ξˆnb − ξˆn ,
nJ
b=1
be the resample variance.

Under the very mild condition that E|Y1 |α < ∞ for some α > 0, Ghosh
( is consistent for the asymptotic variance of ξˆn .
et al. (1984) showed that VnJ
That is
( → 1 almost surely.
4f 2 (ξ)VnJ
This bootstrap can also be used to estimate the entire sampling distribu-
Resampling in simple linear regression 93
tion of properly centered and scaled ξˆn . Let

Ln (x) = P n1/2 ξˆn − ξ ≤ x ,
and recall that
D
Ln −→ N (0, 1/(4f 2 (ξ)))
as n → ∞. The classical bootstrap approximation of Ln (·) is given by

L̂n (x) = P n1/2 ξˆnb − ξˆn ≤ x
conditional on the data. To show that L̂n is consistent for estimating Ln , we

use the following result:
Theorem 3.4 (Theorem 2 of Singh (1981)). If F has a bounded second deriva-

tive in a neighborhood of the population median ξ and f (ξ) > 0, then there is
a constant CF , such that
lim sup n1/4 (log log n)1/2 sup |L̂n (x) − Ln (x)| = CF almost surely.
n→∞ x
A similar result is of course true for any quantile between 0 and 1.

Theorem 3.4 implies

supL̂n (x) − Ln (x) −→ 0 almost surely.
x
However, note that the accuracy of the above approximation is of the order
O(n−1/4 (log log n)−1/2 ), which is quite low. Other resampling schemes have
been studied in this context in Falk and Reiss (1989); Hall and Martin (1991);
Falk (1992). We omit the details here.
3.5 Resampling in simple linear regression

Consider the linear regression model discussed earlier in Example 3.3. We
now demonstrate how to conduct inference on θ using resampling plans in
this model. As discussed earlier, this involves estimating the distribution
function LTn or LZn , but we shall concentrate only on LTn for brevity, where
Tn is given in (3.9).
There are several potentially applicable resampling methods in this model.

Not all of them lead to consistent estimators in all situations. In this sec-
tion, we present several resampling techniques that are in popular use. For
simplicity and clarity, we shall restrict ourselves to the simple linear regres-
sion model (3.3) for the moment. However, many of these techniques can be
adapted to more complicated models. We use the notation and definitions
from Example 3.3 discussed earlier.
3.5.1 Residual bootstrap

Let us assume that the errors are i.i.d. with mean 0 and finite variance σ 2 .
Recall that the residuals from the simple linear regression are
ri = Yi − β̂1 − β̂2 xi .
Bearing in mind that under mild conditions given in Example 3.3, β̂ → β,

either almost surely or in probability, it makes intuitive sense to treat the
residuals as approximately i.i.d. Thus in this scheme, we draw the resamples
from the residuals. This feature leads to the name residual bootstrap.
For every b ∈ {1, . . . , B}, we obtain {rib , i = 1, . . . , n} i.i.d. from the
e.c.d.f. of r1 , . . . , rn . That is, they are an SRSWR from residuals, and each
rib is any one of the original residuals r1 , . . . , rn with probability 1/n. We
then define
Yib = β̂1 + β̂2 xi + rib , i = 1, . . . , n.
A slight modification of the above needs to be done for the case where the
linear regression model is fitted without the intercept term β1 . In that case,
n
define r̄ = n−1 i=1 ri and for every b ∈ {1, . . . , B}, we obtain {rib , i =
1, . . . , n} as an i.i.d. sample from {ri − r̄, i = 1, . . . , n}. That is, they are an
SRSWR from the centered residuals. Note that when an intercept term is
present in the model, r̄ = 0 almost surely and hence no centering was needed.
Then for every b = 1, . . . , B, we obtain the bootstrap β̂ b by minimizing

n
2
Yib − β1 − β2 xi .
i=1
Suppose rib = Yib − β̂b1 − β̂b2 xi are the residuals at the b-th bootstrap step.
We define the residual bootstrap noise variance estimator as

n
2
σ̂b2 = (n − 2)−1 Yib − β̂1 − β̂2 xi .
i=1
This is similar to the noise variance estimator σ̂ 2 based on the original data.
In order to state the results for the distributional consistency of the above
procedure, we use a measure of distance between distribution functions. Sup-
pose Fr,p is the space of probability distribution functions on Rp that have
finite r-th moment, r ≥ 1. That is,

Fr,p = G: ||x||r dG(x) < ∞ .
x∈Rp
For two probability distributions H and G on Rp that belong to Fr,p , the

Mallow’s distance is defined as
1/r
ρr (H, G) = inf E||X − Y ||r ,
TX,Y
where TX,Y is the collection of all possible joint distributions of (X, Y ) whose
marginal distributions are H and G respectively (Mallows, 1972). In a slight
abuse of notation, we may also write the above as ρr (X, Y ).
Consider either the normalized residual bootstrap statistic
Znb = n1/2 (β̂ b − β̂)
or the Studentized statistic

1/2
Tnb = σ̂b−1 X T X (β̂ b − β̂).
We assume that the conditions for weak convergence of the distribution of β̂

to a Gaussian distribution hold. Using the Mallow’s distance as a major inter-
mediate tool, Freedman (1981) proved that as n → ∞ the distribution of Znb ,
conditional on the observed data, converges almost surely to Np (0, σ 2 V −1 ),
where V is defined in (3.7). Similarly, the distribution of Tnb , conditional on
the observed data, converges almost surely to the standard Normal distribu-
tion on Rp denoted by Np (0, Ip ). That is, as n → ∞ almost surely
LZnb → Np (0, σ 2 V −1 ), and LTnb → Np (0, Ip ).

Coupled with the fact that the normalized and Studentized statistic that
were formed using the original estimator β̂ converge to the same limiting
distributions (see (3.8), (3.10)), this shows that the residual bootstrap is
consistent for both Tn and Zn . In practice, a Monte Carlo approach is taken,
and the e.c.d.f of the {Znb , b = 1, . . . , B} or {Tnb , b = 1, . . . , B} are used for
bootstrap inference with a choice of (large) B.
3.5.2 Paired bootstrap

Here, we consider the data {(Yi , xi ), i = 1, . . . , n} as consisting of n pairs of
observations (Yi , xi ), and we draw an SRSWR from these data-pairs. Thus,
for every b ∈ {1, . . . , B}, we obtain {(Yib , xib ), i = 1, . . . , n}, where each
(Yib , xib ) is any one of the original sampled data (Y1 , x1 ), . . . , (Yn , xn ) with
probability 1/n. This is a natural resampling technique to use when the
regressors {xi } are random variables.
Once the resample {(Yib , xib ), i = 1, . . . , n} is obtained, we use the simple
linear regression model, and obtain β̂nb by minimizing

n
2
(Yib − β1 − β2 xib ) .
i=1
This bootstrap is known as the paired bootstrap. Assume that (Yi , xi ) are
i.i.d. with E||(Yi , xi )||4 < ∞, V(xi ) ∈ (0, ∞), E(ei |xi ) = 0 almost surely. Let
1/2
the distribution of Tnb = σ̂b−1 X T X (β̂ b − β̂), conditional on the data,
be LTnb . Freedman (1981) proved that LTnb → Np (0, Ip ) almost surely. This
establishes the distributional consistency of the paired bootstrap for Tn .
It can actually be shown that under very standard regularity conditions,
the paired bootstrap is distributionally consistent even in many cases where
the explanatory variables are random, and when the errors are heteroscedas-
tic. It remains consistent in multiple linear regression even when p varies
with the sample size n and increases as the sample size increases. These de-
tails follow from the fact that the paired bootstrap is a special case of the
generalized bootstrap described later, for which corresponding results were es-
tablished in Chatterjee and Bose (2005). However, there are additional steps
needed before the Singh property can be claimed for this resampling scheme.
Note that the paired bootstrap can be computationally expensive, since it

involves repeated inversions of matrices. Also, the SRSWR sampling scheme
may produce resamples where the design matrix may not be of full col-
umn rank. However, such cases happen with exponentially small probability
(Chatterjee and Bose, 2000), and may be ignored during computation.
3.5.3 Wild or external bootstrap

In this scheme, we obtain the resamples using the residuals {ri } together with
an i.i.d. draw Z1b , . . . , Znb from some known distribution G which has mean
0 and variance 1. There could be more restrictions on G, for example, we
might want its skewness to be a constant that we specify. We then define
rib = Zib ri , i = 1, . . . , n, and

Yib = β̂1 + β̂2 xi + rib , i = 1, . . . , n.
We obtain β̂ b by minimizing

n
2
(Yib − β1 − β2 xi ) .
i=1
This is known as the wild or the external bootstrap. Under (3.11), Shao
and Tu (1995) established the distributional consistency of this resampling
scheme.
When random regressors are in use, Mammen (1993) established the dis-
tributional consistency for both the paired and the wild bootstrap under very
general conditions.
3.5.4 Parametric bootstrap

This resampling scheme is an alternative when F is known, except possibly
for some finite-dimensional parameter ξ, which can be estimated from the
available data by ξˆn . For example, in the Gauss-Markov model, F = N (0, σ 2 ),
and thus ξ ≡ σ 2 .
For every b ∈ {1, . . . , B}, we obtain Z1b , . . . , Znb i.i.d. with distribution
F(·; ξˆn ), where F(·; ξˆn ) is the distribution F with ξˆn in place of the unknown
parameter ξ. For example, in the Gauss-Markov model, F(·; ξˆn ) is N (0, σ '2 ).
We then define
Yib = β̂1 + β̂2 xi + Zib , i = 1, . . . , n.

We obtain β̂ b by minimizing

n
2
(Yib − β1 − β2 xi ) .
i=1
The parametric bootstrap is distributionally consistent under the regular-

ity properties listed in (3.11).
3.5.5 Generalized bootstrap

While the above schemes differed in how Yib , i = 1, . . . , n are generated, no-
tice that the basic feature of obtaining estimators and their resample-based
versions by use of minimization of the sum of squared error, that is, ordinary
least squares (OLS) methodology is retained. The material below constitute
a special case of the results of Chatterjee and Bose (2005).
In generalized bootstrap (GBS in short) we dispense with generation of
Yib , i = 1, . . . , n altogether. For every b ∈ {1, . . . , B}, we obtain β̂ b directly
by minimizing the weighted least squares (WLS)

n
2
Wib (Yi − β1 − β2 xi ) .
i=1
Here {W1b , . . . , Wnb } are a set of random weights, and the properties of the
GBS are entirely controlled by the distribution of the n-dimensional vector
Wnb = (W1b , . . . , Wnb ). We discuss special cases below, several of which were
first formally listed in Præstgaard and Wellner (1993). We omit much of the
details, and just list the essential elements of the resampling methodology.
Let Πn be the n-dimensional vector all of whose elements are 1/n, thus Πn =
1
n 1n ∈ R where 1n is the n-dimensional vector of all 1’s.
n
1. [Paired bootstrap] We get the paired bootstrap as a special case of

the GBS by specifying that each Wnb = (W1b , . . . , Wnb ) is an i.i.d. draw
from the Multinomial(n, Πn ) distribution.
2. [Bayesian bootstrap] Suppose αn is the n-dimensional vector with all

elements equal to α, for some α > 0. We get the Bayesian bootstrap as
a special case of the GBS by specifying that each Wnb = (W1b , . . . , Wnb )
is an i.i.d. draw from the Dirichlet(αn ) distribution. The properties of
the Bayesian bootstrap is very similar to that of the paired bootstrap,
however, this one has a Bayesian interpretation.
3. [Moon bootstrap] (m out of n bootstrap) We get the m-out-of-n

bootstrap (which we call the moon-bootstrap) as a special case of the
GBS by specifying that each Wnb = (W1b , . . . , Wnb ) is an i.i.d. draw
from the Multinomial(m, Πn ) distribution. Under the conditions m →
∞ as n → ∞ but m/n → 0, the moon-bootstrap is consistent in a very
wide class of problems where most other methods fail. This is one of
the most robust resampling methods available.
For comparison with subsampling later on, let us also describe the mech-
anism by which the moon-bootstrap is generated classically. We fix a
“subsample” size m where m is (considerably) smaller than n. Then for
every b ∈ {1, . . . , B}, we obtain subsamples of size m {(Yib , xib ),
i = 1, . . . , m} by drawing an SRSWR from (Y1 , x1 ), . . . , (Yn , xn ).
4. [The delete-d jackknife] In the context of some variance consistency

results, it was established in Chatterjee (1998), that all the jackknives
are special cases of GBS. We outline the main idea behind this below.
Let Nn = {1, 2, . . . , n} and Sn,d = {all subsets of size d from Nn }. We

identify a typical element of Sn,d by S = {i1 , i2 , . . . , id }. There are nd

elements in Sn,d . Let S C = Nn \ S. Let ξS , S ∈ Sn,d be a collection

of vectors in Rn defined by ξS = ξS (1), ξS (2), . . . , ξS (n) , where
) −1
n−d n if i ∈ S C
ξS (i) =
0 if i ∈ S.
Then the delete-d jackknife can be identified as a special case of the

GBS with the resampling weight vector Wnb = W1b , W2b , . . . , Wnb
which has the probability law
−1
n
P Wnb = ξS = , S ∈ Sn,d . (3.23)
d
5. [The deterministic weight bootstrap] This is a generalization of

the delete-d jackknife. Suppose {ξn } is a sequence of numbers. Consider
a random permutation of its first n elements as Wnb . See Præstgaard
and Wellner (1993) for further conditions on the sequence.
6. [The double bootstrap] Suppose Mn follows a Multinomial (m, Πn )

distribution, and conditional on Mn , suppose that each of the resam-

pling weights Wnb = (W1b , . . . , Wnb ) is an i.i.d. draw from the Multi-
nomial (n, n−1 Mn ) distribution. This particular resampling scheme is
often used for calibration of other resampling techniques, to ensure that
the Singh property holds.
7. [The Polya-Eggenberger bootstrap] This is a variant of the double

bootstrap. Suppose Mn follows a Dirichlet(αn ) distribution, where
αn = α1n ∈ Rn for some α > 0. Conditional on Mn , the resampling
weights Wn follow a Multinomial ((n, n−1 Mn )).
8. [The hypergeometric bootstrap] In this procedure, we have

−1 *n
nK K
P W1b = b1 , . . . , Wnb = bn = ,
n i=1
bi

n
for integers b1 , . . . , bn satisfying 0 ≤ bi ≤ K, bi = n.
i=1
9. [Subsampling] Subsampling is related to jackknives and the moon-

bootstrap. This method, like the moon-bootstrap, is one of the most
robust inferential methods available. Essentially, if LTn has a limit,
subsampling produces a consistent estimate of it. In subsampling, we
fix a “subsample” size m where m is (considerably) smaller than n. Then
for every b ∈ {1, . . . , B}, we obtain subsamples of size m {(Yib , xib ), i =
1, . . . , m} by drawing a simple random sample without replacement
(SRSWOR) from the original data (Y1 , x1 ), . . . , (Yn , xn ). It is crucial
that the sampling should be without replacement, otherwise we just
get the moon-bootstrap. Subsampling can be identified also with the
delete-n − m jackknife.
3.6 Exercises
1. Show that if Tn is the sample mean as in Example 3.1, then its jackknife
variance estimate is same as the traditional unbiased variance estimator
given in (1.4).
(
2. Show that the two expressions for the jackknife variance estimator VnJ
given in (3.12) and (3.13) are identical.

Exercises 101
3. Show that if Tn is the sample mean as in Example 3.1, then its naive
bootstrap variance estimate is

n
2
n−2 Yi − θ̂n .
i=1
4. Verify that there are nn possible values of Yb in Efron’s bootstrap.
5. Consider Efron’s classical bootstrap using the SRSWR drawing from

the data.
(a) What is the probability that at least one of the original values
Y1 , Y2 , . . . , Yn does not show up in the resample values Y1b , . . . , Ynb ?
(b) Compute for k = 1, . . . , n, the probability that exactly k of the
original values Y1 , Y2 , . . . , Yn show up in the collection of resample
values Y1b , . . . , Ynb .
(c) Comment on the cases k = 1 and k = n.
6. Look up the proof of D-K-W inequality (3.17).
7. Use the D-K-W inequality to prove the Gilvenko-Cantelli Lemma.
8. Show how claim (3.18) follows from the D-K-W inequality.

Chapter 4
Resampling U -statistics
and M -estimators
4.1 Introduction
Recall from Chapter 1 that if Y1 , . . . , Yn is an i.i.d. sample from some proba-

bility distribution function F, and θ = Eh(Y1 , Y2 , . . . , Ym ) where h(·) is sym-
metric in its arguments, then the U -statistic,
−1
n
Un = h(Yi1 , . . . , Yim )
m
1≤i1 <i2 <...<im ≤n
is the usual estimator for θ. In order to perform hypothesis tests or to ob-

tain confidence intervals for θ, we need the exact or approximate distribution
of Un . There are no closed form exact distribution of U -statistics in gen-
eral. We established the asymptotic normality of U -statistics in Theorem 1.1
of Chapter 1. However, the asymptotic variance is typically unknown and
has to be estimated. Similarly, the Mm -estimators discussed in Chapter 2
have an asymptotic normal distribution under fairly general conditions, but
the asymptotic variance is typically parameter-dependent and has to be esti-
mated.
An alternative to using such asymptotic distributional results and esti-
mated parameters is to use resampling, which has been introduced in Chap-
ter 3. We have seen in Chapter 3 that in addition to being usable for
104 Chapter 4. Resampling U -statistics and M -estimators
estimating asymptotic variance, many resampling methods can provide di-

rect approximations to the sampling distribution of statistics of interest, and
thus circumvent the need for using asymptotic variance altogether. Such
resampling-based distribution estimators may have the very attractive Singh
property (higher order accuracy) described in Chapter 3, and thus they can
be better than CLT-based asymptotic approximations.
In this chapter, we discuss some resampling techniques that are applicable

for estimating sampling distributions and asymptotic variances of U -statistics
and Mm -estimators. We will discuss how the GBS estimators discussed in
Chapter 3 arise naturally in this context.
First in Section 4.2 we review some of the more traditional resampling

schemes as applied to U -statistics. We will see that a naive application of
Efron’s bootstrap or a jackknife technique may lead to poorly defined esti-
mates. Consequently there is a need for more careful application of resampling
schemes in this context. In Section 4.3 we present the GBS for U -statistics,
and show how some of the classical resampling methods can be treated as
special cases of the GBS. Moreover, the very general form of the GBS sug-
gests a variety of resampling weights that may be used in this context. We
discuss an additive weights resampling technique in Section 4.4, which is also
a special case of the GBS, but deserves special attention owing to the po-
tential of savings in the computational requirements. Then in Section 4.5 we
present the GBS for Mm -estimators, of which the U -statistics are a special
case.
Let Unb be the generic form of any resampled version of Un . Define

LUn (x) = P n1/2 Un − θ ≤ x ,

LUnb (x) = P n1/2 mτn−1 Unb − Un ≤ xY1 , . . . , Yn ,
where τn is some appropriate scaling.
Recall that the corresponding resampling scheme is said to be distribu-

tionally consistent if

supLUn (x) − LUnb (x) → 0
x∈R
either in probability or almost surely as n → ∞.

Classical bootstrap for U -statistics 105
4.2 Classical bootstrap for U -statistics

Recall that in Efron’s bootstrap or multinomial bootstrap the resample is
chosen using an SRSWR of size n from the e.c.d.f. When applied to the
U -statistic Un , this bootstrap is implemented with the following steps: from
the data Y1 , . . . , Yn , draw a simple random sample with replacement, say
Y1b , . . . , Ynb . Then Efron’s bootstrap U -statistics is defined by
−1
n
Unb = h(Yi1 b , . . . , Yim b ). (4.1)
m
1≤i1 <i2 <···<im ≤n
Conditional on the data Y1 , . . . , Yn , the distribution (or variance) of the

statistic Unb may be used to approximate the distribution (or variance) of
Un . Consistency results for Efron’s bootstrap were obtained by Bickel and
Freedman (1981) and Athreya et al. (1984) for normalized and Studentized U -
statistics respectively. As an example, we present the following result, which
is based on Bickel and Freedman (1981).
Theorem 4.1 (Bickel and Freedman (1981)). Let Y1 , . . . , Yn be i.i.d. random

variables and let
−1
n
Un = h(Yi1 , . . . , Yim )
m
1≤i1 <···<im ≤n
with m = 2. Suppose that E|h(Y1 , Y2 )|2 < ∞, E|h(Y1 , Y1 )|2 < ∞, and

h(x, y)dF (y) is not a constant. Then the distribution of Unb conditional
on the data is a consistent estimator of the distribution of Un with τn = 1,
that is, almost surely,

supLUn (x) − LUnb (x) → 0.
x∈R
Later Helmers (1991) proved the Singh property of the bootstrap approx-
imation for the distribution of a Studentized non-degenerate U -statistics of
degree 2. Suppose the kernel satisfies Eh(Y1 , Y2 ) = θ and the corresponding
U -statistic
−1
n
Un = h(Yi , Yj )
2
1≤i<j≤n
has the Hoeffding projection

h1 (Yi ) = E h(Yi , Yj ) − θ|Yi
with positive variance δ1 . Helmers (1991) defined
n

n 2
Sn2 = 4(n − 1)(n − 2)−2 (n − 1)−1 h(Yi , Yj ) − Un ,
i=1 j=1
and proposed n−1 Sn2 as a jackknife-type estimator of the variance of Un .

Define

LTn (x) = P n1/2 Sn−1 Un − θ ≤ x , for x ∈ R.
2
For the bootstrap versions, we compute Unb and Snb using the formula
2
for Un and Sn given above but replacing Yi ’s with Yib ’s. Define

n
θn = n−2 h(Yi , Yj ),
i,j=1
which is very close to Un , except that the h(Yi , Yi ) terms are now included in
the summation, which thus has n2 terms. Consequently, θn has the scaling

factor n2 , comparable to the n2 factor that appears in the denominator of
Un . Using these, we obtain the bootstrap distributional estimator

−1
LTnb (x) = PB n1/2 Snb Unb − θn ≤ x , x ∈ R.
Helmers (1991) first established a one-term Edgeworth expansion of LTn .

Then, under the additional condition that E|h(Y1 , Y1 )|3 < ∞, the Singh prop-
erty was established. We state this result below.
Theorem 4.2 (Helmers (1991)). Suppose that the distribution of the Hoeffd-
ing projection h1 is non-lattice, E|h(Y1 , Y2 )|4+ < ∞ for some > 0, and
E|h(Y1 , Y1 )|3 < ∞. Then
sup |LTn (x) − LTnb (x)| = o(n−1/2 ) as n → ∞ almost surely.

x
A general consistency result, regardless of the degree of degeneracy of a

U -statistic, was obtained by Arcones and Gine (1992), as long as the degree
Generalized bootstrap for U -statistics 107
of degeneracy is known. In a different direction, Helmers and Hušková (1994)

considered multivariate U -quantiles and showed consistency in distribution
estimation. A number of other authors have also studied Efron’s bootstrap of
U -statistics, see references in the papers mentioned above and also the book
by Shao and Tu (1995). As with most other problems where resampling is
used, a Monte Carlo implementation may be used for numeric computations.
The performance of Efron’s bootstrap is affected by the behaviour of the
kernel. Note that in this resampling, values such as h(Y1 , Y1 , . . . , Y1 ) may
appear in the resample while they are absent in the original definition of Un .
This may create inconsistency of the bootstrap. An example is provided in
Bickel and Freedman (1981) and reproduced in Shao and Tu (1995), and we
sketch it below.
Example 4.1: Supposed m = 2, and define h(x, x) = exp{1/x}, and consider

the population distribution to be continuous Uniform (0, 1). Note that under
this distribution we have Eh(Y, Y ) = ∞. We write the kernel h(x, y) =
h1 (x, y) + h2 (x, y) where h1 (x, y) = h(x, y)Ix=y . Let Un1 and Un2 be the
U -statistics corresponding to h1 and h2 respectively. Define Un = Un1 +
Un2 , and note that since the population distribution is continuous, Un2 =
0 almost surely. Also, note that suitably centered and scaled Un has an
asymptotic normal distribution. It can be shown that if we choose to use
Efron’s bootstrap to estimate the distribution of Un , we get an inconsistent
P
result, since n1/2 (Un2b − Un2 ) → ∞.
4.3 Generalized bootstrap for U -statistics

The kind of deficiency we saw for Efron’s bootstrap can be remedied by using
generalized bootstrap (GBS) instead.
We first introduce the notation for a triangular sequence of random vari-
ables, to be called the bootstrap weights in the sequel. Suppose for every
positive integer n, we consider all possible m-dimensional vectors (i1 , . . . , im )
n
where each ij ∈ {1, . . . , n}, and ij ’s distinct. There are clearly m such
vectors. Corresponding to these m-dimensional vector of distinct integers,
we define the triangular sequence of random variables Wn:i1 ,...,im as bootstrap
weights. We will shortly present examples of such bootstrap weights, and later
present properties of such weight vectors when we discuss theoretical results.
For a U -statistic with kernel h(y1 , . . . , ym ), define its generalized bootstrap
(GBS) version as
−1
n
Unb = Wn:i1 ,...,im h(Yi1 , . . . , Yim ). (4.2)
m
1≤i1 <···<im ≤n
Example 4.2(Efron’s bootstrap as a special case of GBS): Suppose Wn:i is

the (random) number of times the data point Yi is obtained when we im-
plement Efron’s bootstrap of drawing an SRSWR from the original data. It

can be seen that in this case, the vector Wn = Wn:1 , . . . , Wn:n follows a

M ultinomial n; 1/n, . . . , 1/n distribution. Define
*
m
Wn:i1 ,...,im = Wn:ij . (4.3)
j=1
Consider the case of a U -statistic with kernel of order m = 1. Here, it is

easy to see that Efron’s bootstrap is simply a GBS with the choice (4.3).
For general m, Efron’s bootstrap for U -statistic (4.1) cannot be written
like the GBS (4.2). However, relation (4.3) suggests a new way of defining
the bootstrap version of a U -statistic, namely
−1 *
m
n
Unb = Wn:ij h(Yi1 , . . . , Yim ). (4.4)
m
1≤i1 <i2 <...<im ≤n j=1
Hušková and Janssen (1993b) established the distributional consistency of

(4.4) for U -statistics of degree m = 2. Later Hušková and Janssen (1993a)
extended this result to degenerate U -statistics.
It is important to realize that the resampling weights used in (4.2) are of
a very general form, and the choice of using a multiplicative form as in (4.4)
is a further special case. Instead, for example, consider the additive form of
resampling weights where

m
Wn:i1 ,...,im = m−1 Wn:ij . (4.5)
j=1
In Section 4.4 below we discuss the properties of this particular choice of re-
sampling weights in detail. We will see that using the additive form (4.5) can
lead to significant improvement in computational efficiency, without compro-
mising on the accuracy.
Recall Example 2.2 from Chapter 2, where we showed that all U -statistics
GBS with additive weights 109
are Mm -estimators. That is, for any kernel h(y1 , . . . , ym ), which is symmetric
in its arguments, the corresponding U -statistic can be obtained as the unique
Mm -estimator when we use the contrast function
2 2
f (y1 . . . , ym , θ) = θ − h(y1 , . . . , ym ) − h(y1 , . . . , ym ) .
In view of this, instead of just presenting the analysis of GBS for U -statistics
given in (4.2), we present in Section 4.5 the full discussion of GBS for generic
Mm -estimators.
4.4 GBS with additive weights

Let {Wn:i , 1 ≤ i ≤ n, n ≥ 1} be a triangular sequence of non-negative,
row-wise exchangeable random variables, independent of {Y1 , . . . , Yn }. Re-
call the notation PB , EB , VB to respectively denote probabilities, expecta-
tions and variances of these resampling weights, conditional on the given
data Y1 , . . . , Yn . We henceforth drop the first suffix in the weights Wn:i and
denote it by Wi . We fix that EB Wi = 1 for all i = 1, . . . , n, and write

τn2 = VB Wi and Wi = τn−1 Wi − 1 . (4.6)
Since row-wise exchangeability of Wn:i ’s is assumed, we may unambigu-

ously adopt the notation
cabc... = EB W1a W2b W3c . . . .
Generic constants at various places will be denoted by k and K.

We will assume the following conditions on the weights:
EB W1 = 1, (4.7)
0 < k < τn2 < K, (4.8)

c11 = O n−1 , (4.9)
c22 → 1, (4.10)
sup c4 < ∞. (4.11)
n

Note that when Wn = W1 , . . . , Wn has the M ultinomial n; 1/n, . . . , 1/n
distribution that links the GBS to Efron’s bootstrap, all the above conditions
are satisfied. In fact for these weights,
τn2 = 1 − 1/n and c11 = −1/(n − 1). (4.12)
Define as usual, the centered first projection of h as

h1 (Y1 ) = E h(Y1 , . . . , Ym )Y1 − θ,
and assume that

Eh4 Y1 , . . . , Ym < ∞, (4.13)

E|h1 Y1 |4+δ < ∞, for some δ > 0, (4.14)

Eh21 Y1 > 0. (4.15)
Theorem 4.3. Suppose the kernel h satisfies (4.13)-(4.15) and the resampling
weights are of the form (4.5) where Wn:i satisfy (4.7)-(4.11) and also

n
Wn:i = n.
i=1
Then

supLUn (x) − LUnb (x) → 0 almost surely as n → ∞. (4.16)
x∈R
n
The condition i=1 Wn:i = n and (4.8) together imply (4.9) and (4.12).
A convergence in probability version of this theorem may also be established
with more relaxed condition than (4.13)-(4.14).
To prove the theorem we need a CLT for weighted sums of row-wise ex-
changeable variables from Præstgaard and Wellner (1993). The idea behind
this result is Hajek’s classic CLT (Hájek (1961)) for sampling without replace-
ment.
Theorem 4.4 (Præstgaard and Wellner (1993)). Let {amj } be a triangular

array of constants, and let Bmj , j = 1, . . . , m, m ≥ 1 be a triangular array
of row-exchangeable random variables such that

m
2
m−1 amj − ām → σ 2 > 0,
j=1
GBS with additive weights 111
2
m−1 max amj − ām → 0,
j=1,...,m
m
2 P
m−1 Bmj − B̄m → c2 > 0,
j=1
2
lim lim sup E Bmj − B̄m I{|Bmj −B̄m |>K} = 0.
K→∞ m→∞
Here,

m
m
ām = m−1 amj and B̄m = m−1 Bmj .
j=1 j=1
Then
1 D
m
√ amj Bmj − ām B̄m −→ N (0, c2 σ 2 ). (4.17)
m j=1
Proof of Theorem 4.3: Let s = (i1 , . . . , im ) denote a typical multi-index.

m
Without any scope for confusion, Ws = Wn:i1 ,...,im = m−1 j=1 Wij . Let

Ws = τn−1 Ws − 1 ,
hs = h(Yi1 , . . . , Yim ),

m
gs = hs − h1 (Yij ) − θ,
j=1

n
N= .
m
Then

n1/2 τn−1 Unb − Un = n1/2 N −1 τn−1 hs W s − 1
s

m

= n1/2 N −1 τn−1 h1 Yij Ws − 1
s j=1

1/2 −1
+n N θτn−1 Ws − 1 + n1/2 N −1 τn−1 gs Ws − 1
s s
= T1 + T2 + T3 say.
n
Because of the condition i=1 Wn:i = n we have that s Ws − 1 = 0,
so T2 = 0 almost surely.
4
Using (4.13) and the fact that E N −1 s gs = O(n−4 ) (see Serfling
(1980), page 188)) and (4.9), after some algebra we have that for any δ > 0,

PB T3 > δ = OP n−2 .
It remains to consider T1 . For this term also, using (4.6), we have that

n

T1 = n−1/2 Wi h1 Yi + rnb ,
i=1
where again
PB rnb > δ = oP n−2 for any δ > 0.
We now need to show that the distribution of n1/2 (Un − θ) and the boot-
n
strap distribution of n−1/2 m i=1 Wi h1 (Yi ) converge to the same limiting
normal distribution. For the original U -statistic, this is the UCLT, Theo-
rem 1.1. For the bootstrap statistic, we use Theorem 4.4 to get the result.
The conditions (4.10) and (4.11) are required in order to satisfy the conditions
of Theorem 4.4. The details are left as an exercise.
4.4.1 Computational aspects for additive weights

n
U -statistics are computationally lengthy, since m quantities are involved.
Also, the computation of h(·) may be difficult in many cases. Thus generalized
bootstrap by using (4.2) make heavy demand on computational time and
storage space requirements. We address this issue now.
Define
−1
n−1
Ũni = h(Yi , Yi1 , . . . , Yim−1 ).
m−1
1≤i1 ···<im−1 ≤n;ij =i
A simple calculation shows that our generalized bootstrap statistic using

additive weights (4.5) is of the form

n
Unb = n−1 Wi Ũni . (4.18)
i=1
In order to discuss the computational advantages of using additive resam-

pling weights over other forms of resampling weights, we use two considera-
tions, the computational time requirement and the storage space requirement
Generalized bootstrap for Mm -estimators 113
for bootstrapping. Since Monte Carlo simulations are an integral part of the
computation of bootstrap, we assume that B bootstrap iterations are carried
out under both the methods.
Theorem 4.5. (a) The time complexity of estimating the distribution of

Un using additive resampling weights and generic resampling weights in B

bootstrap iterations are respectively O n(B + 1) + nm and O nm (B + 1)
respectively.
(b) The space complexity of estimating the distribution of Un using addi-
tive resampling weights and generic resampling weights are O(n) and O(nm )
respectively, in addition to the space requirement for storing the results of
each bootstrap iteration.
The remarkable fact is that while using additive weights (4.5), the time
and storage space requirements are reduced simultaneously, and for each boot-
strap step instead of requiring O(nm ) time and space, the requirement is only
O(n).
Proof of Theorem 4.5: (a) Let us assume that for given (y1 , . . . , yn ) the
computation of each h(y1 , . . . , ym ) takes H units of time. Then it can be seen
(n−1)
that each Ũni requires(m−1)
H units of time, and once all Ũni ’s have been
computed and stored, Un is easily computed in n steps. Thus an initial nm
order computation is required for both the bootstrap methods.
However, once the bootstrap weights are generated (without loss we as-
sume these to be generated in O(n) time), for the additive weights case only
2n more steps are needed, whereas for the general resampling weights case
n
the number of steps needed are m m assuming all the h(Yi1 , . . . , Yim ) are
stored. Thus for each bootstrap iteration the time complexities are O(n) and
O(nm ) respectively for the two methods. This completes the proof of part
(a).
Part (b) is easily proved by observing that when we use additive weights
only the Ũni defined in (4.18) need to be stored, whereas when we use general
resampling weights all the h(Yi1 , . . . , Yim ) have to be stored.
4.5 Generalized bootstrap for Mm -estimators

We now focus on estimating the distribution of an Mm -estimator by resam-
pling. Since an Mm -estimator is obtained by a minimisation procedure, it
is natural to obtain its bootstrap version by a parallel minimisation which

involves random weights. This is precisely how several resampling schemes
were seen to operate in linear regression in Chapter 3.
We first establish bootstrap asymptotic representation results, following
which we obtain consistency results for the generalized bootstrap of Mm -
estimators. We recall the setup from Chapter 2.
Let Y1 , . . . , Ym be m i.i.d. copies of a Y ∈ Y ⊆ Rp valued random variable
and let f (y, a) be a real measurable function defined for y ∈ Y m and a ∈ Rd .
Let
Q(θ) = Ef (Y1 , . . . , Ym ; θ). (4.19)
Let θ0 ∈ Rd be such that
Q(θ0 ) = min Q(a). (4.20)

a
We assume θ0 is unique, and it is the unknown parameter, to be estimated

from the data.
Suppose that Y1 , . . . , Yn is an i.i.d. sample. Then the sample analog of (4.19)
is
−1
n
Qn (θ) = f (Yi1 , Yi2 , . . . , Yim ; θ). (4.21)
m
1≤i1 <...im ≤n
Let θ̂n be any measurable selection such that
Q(θ̂n ) = min Qn (a), (4.22)

a
which is the Mm -estimator of the parameter θ0 .

As in Chapter 2, we assume that f (y; a) is convex in a. Let g(y; a) be a
measurable subgradient of f (y; a), that is
f (y; a) + (b − a)T g(y; a) ≤ f (y; b) (4.23)
holds for all a, b ∈ Rd , y ∈ Y m . Recall the assumptions needed in Chapter 2

for the asymptotic normality of Mm estimators.
(I) f (y; a) is convex with respect to a for each fixed y.
(II) (4.19) exists and is finite for all θ, that is, Q(θ) is well defined.
(III) θ0 satisfying (4.20) exists and is unique.
(IV) E|g(Y1 , . . . , Ym ; a)|2 < ∞ for all a in a neighborhood of θ0 .
(V) Q(a) is twice differentiable at θ0 and H = ∇2 (θ0 ) is positive definite.
The bootstrap equivalent of θ̂n is obtained by minimizing

−1
n
Qnb (θ) = Wn:i1 ,i2 ,...,im f (Yi1 , Yi2 , . . . , Yim ; θ). (4.24)
m
1≤i1 <...im ≤n
Let {θ̂nB } be chosen in a measurable way (such a choice is possible) satisfying
Qnb (θ̂nb ) = min Qnb (θ). (4.25)

θ
This θ̂nb is not unique in general.
4.5.1 Resampling representation results for m = 1
In this subsection we consider the m = 1 case separately for easier under-

standing of the general case that we present later.
Theorem 4.6 (Bose and Chatterjee (2003)). Suppose Assumptions (I)-(V)

hold. Assume that resampling is performed by minimizing (4.24) with weights
that satisfy (4.7)-(4.9). Then
τn−1 n1/2 (θ̂nb − θ̂n ) = −n−1/2 H −1 Snb + rnb where (4.26)

n
Snb = Wi g(Yi ; θ0 ) and (4.27)
i=1
PB [||rnb || > ] = oP (1) for any > 0. (4.28)
Proof of Theorem 4.6: In order to prove this theorem, we use the trian-
gulation Lemma of Niemiro (1992) which is also given in Chapter 2. Without
loss of generality, let θ0 = 0 and Q(θ0 ) = 0. Now define
Xni = [f (Yi ; n−1/2 θ) − f (Yi ; 0)] − n−1/2 θT g(Yi ; 0) and

Xnbi = Wi Xni .
Then we have
EB Xnbi = Xni and

n
Xnbi = nQnb (n−1/2 θ) − nQnb (0) − n−1/2 θT Snb where
i=1

n
Snb = Wi g(Yi ; 0).
i=1
In order to make explicit the fact that Xni is a function of θ, we sometimes

write Xni (θ).
Fix δ1 , δ2 > 0. We first prove two facts.
(a) For any M > 0

PB [ sup τn−1 | Xnbi − θT Hθ/2| > δ2 ] = oP (1). (4.29)
||θ||≤M
(b) There exists M > 0 such that
P[PB [||n−1/2 Snb || ≥ M ] > δ1 ] < δ2 . (4.30)
Proof of (4.29). Fix an M > 0. For fixed δ∗ > 0, get δ > 0 and > 0 such
that M1 = M + (2δ)1/2 and δ∗ > 5M1 λmax (H)δ + 3, where λmax (H) is the
maximum eigenvalue of H. Consider the set A1 = {θ : ||θ|| ≤ M1 } and let
B = {b1 , . . . , bN } be a finite δ-triangulation of A1 . Note that with
A = {θ : ||θ|| ≤ M }, L = M1 λmax (H),

h(θ) = θT Hθ/2, and h (θ) = Wi Xni (θ),
all the conditions of Lemma 2.3 are satisfied. Now

PB [ sup τn−1 | Xnbi − θT Hθ/2| > δ∗ ]
||θ||≤M

≤ PB [ sup τn−1 | Xnbi − θT Hθ/2| > 5Lδ + 3]
||θ||≤M

≤ PB [sup τn−1 | Xnbi − θT Hθ/2| > ] (4.31)
θ∈B

N
≤ PB [τn−1 | Wi Xni (bj ) − bTj Hbj /2| > ]
j=1

N
N
≤ PB [| Wi Xni (bj )| > /2] + I{τn−1 | Xni (bj )−bT Hbj /2|>/2}
j
j=1 j=1

N
n
N
≤ k 2
Xni (bj ) + I{τn−1 | Xni (bj )−bT Hbj /2|>/2} (4.32)
j
j=1 i=1 j=1
= oP (1). (4.33)
In the above calculations (4.31) follows from the triangulation Lemma 2.2
given in Chapter 2. Observe that in (4.32) the index j runs over finitely many
points, so it is enough to show the probability rate (4.33) for each fixed b. It
has been proved by Niemiro (1992) (see page 1522) that for fixed b,

2
Xni (b) = oP (1) and Xni (b) − bT Hb/2 = oP (1). (4.34)
i i
Hence (4.33) follows by using the lower bound in (4.8). This proves (4.29).
Proof of (4.30). Note that
1
n
PB [||n−1/2 Snb || > M ] ≤ E || Wi g(Yi ; 0)||2
M 2 n B i=1
2
n n
−2 2
≤ [τ E || W i g(Yi ; 0)|| + || g(Yi ; 0)||2 ]
M 2 n n B i=1 i=1
Kτn−2
n n
2 2
≤ ||g(Z i ; 0)|| + || g(Yi ; 0)||2
M 2 n i=1 M 2 n i=1
= UM say.
The constant K in the last step is obtained by using the upper bound condi-
tion in (4.8) and (4.9). Now fix any two constants , δ > 0. By choosing M
large enough, we have
P[PB [τn−1 ||n−1/2 Snb || < M ] > ]

≤ P[ < UM ]
<δ (4.35)
by using (IV). This proves (4.30).

Fix a δ0 > 0. Now on the set A where both

n
1
sup τn−1 || Xnbi − θT Hθ|| < δ0 , (4.36)
||θ||≤M i=1
2
||n−1/2 H −1 Snb || < M − 1 (4.37)
hold, the convex function nQnb (n−1/2 θ) − nQnb (0) assumes at n−1/2 H −1 Snb
a value less than its values on the sphere
1/2
||θ + n−1/2 H −1 Snb || = κτn δ0 ,
1/2
where κ = 2λmin (H). The global minimiser of this function is n1/2 θ̂nb . Hence
n1/2 τn−1 θ̂nb = −n−1/2 τn−1 H −1 Snb + rnb3 . (4.38)
Since δ0 is arbitrary, and because of (4.29) and (4.30), we have for any δ > 0
PB [||rnb3 || > δ] = oP (1).
Now use Theorem 2.3 from Chapter 2 to get the result. This step again
uses (4.8).
For any c ∈ Rd with ||c|| = 1, let
Ln (x) = P[n1/2 cT (θ̂n − θ0 ) ≤ x] and

Lnb (x) = PB [τn−1 n1/2 cT (θ̂nb − θn ) ≤ x].
The following corollary establishes the consistency of the resampling tech-

nique for estimating Ln .
Corollary 4.1. (Consistency of resampling distribution estimate) Assume

the conditions of Theorem 4.6. Assume also (4.10)-(4.11). Then
sup |Lnb (x) − Ln (x)| = oP (1) as n → ∞.

x∈R
Proof of Corollary 4.1: Again, without loss of generality assume, θ0 = 0

and Q(θ0 ) = 0. Let
h(Yi ; 0) = cT H −1 g(Yi ; 0) and

V∞ = Eh2 (Y1 ; 0).
Using Theorem 2.3 we obtain that
sup |Ln (x) − N (0, V∞ )| → 0. (4.39)

x
Also observe that given the condition on the bootstrap weights, Theorem 4.4
may be applied to obtain
sup |Lnb (x) − N (0, V∞ )| = oP (1) as n → ∞. (4.40)

x
Then (4.39) and (4.40) complete the proof.
4.5.2 Results for general m

For convenience, let us fix the notation
S = {(i1 , . . . , im ) : 1 ≤ i1 < i2 < · · · < im ≤ n},
and a typical element of S is s = (i1 , . . . , im ). We use the notation |s ∩ t| = k

to denote two typical subsets s = {i1 , . . . , im } and t = {j1 , . . . , jm } of size m
from {1, . . . , n}, which have exactly k elements in common.
We often use the same notation s for both s = (i1 , . . . , im ) ∈ S and
n
s = {i1 , . . . , im } ⊂ {1, . . . n}. Also, let N = m . Let Ys = (Yi1 , . . . , Yim )

when s = (i1 , . . . , im ) ∈ S. The notation |s∩t|=j denotes sum over all s and
t for which |s ∩ t| = j holds.
The bootstrap weights Ws are non-negative random variables. We assume
that for all s ∈ S, the distribution of Ws is the same with EWs = 1. Define
ξn2 = EB (Ws − 1)2 and Ws = ξn−1 (Ws − 1).
Assume that EB Ws Wt is a function of |s ∩ t| only, and let
cj = EB Ws Wt whenever |s ∩ t| = j, for j = 0, 1, . . . , m.
We assume that the following conditions are satisfied:
K > ξn2 > k > 0 (4.41)

c0 = o(1) (4.42)
|cj | = O(1), j = 1, . . . , m. (4.43)
Define
g1 (Y1 ; θ0 ) = E[g(Y1 , . . . , Ym ; θ0 )|Y1 ] and

−1
n−1
fi = Ws ,
m−1 s:i∈s
where the sum runs over all s ∈ S such that i ∈ {i1 , . . . , im }.
Theorem 4.7 (Bose and Chatterjee (2003)). Suppose Assumptions (I)-(V)

hold. Assume that resampling is performed by minimizing (4.24) with weights
that satisfy (4.41)-(4.43). Then
ξn−1 n1/2 (θ̂nb − θ̂n ) = −n1/2 H −1 Snb + rnb (4.44)

= −mn−1/2 fi H −1 g1 (Yi ; θ0 ) + Rnb (4.45)
where g1 (Yi ; θ0 ) is the first projection of g(·; θ0 ),

−1
n
Snb = Ws g(Ys , θ0 ),
m
s∈S
and
PB [||rnb || > ] = oP (1) and (4.46)

PB [||Rnb || > ] = oP (1) for every > 0. (4.47)
From (4.42) we have that the asymptotic mean of f1 is 0. Let its asymp-
totic variance be v 2 (m). This will in general be a function of m. Then the
appropriate standardized bootstrap statistic is
n1/2 ξn−1 vm
−1
(θ̂n B − θ̂n ).
For any c ∈ Rd with ||c|| = 1, let
Ln (x) = P[n1/2 cT (θ̂n − θ0 ) ≤ x] and

Lnb (x) = PB [n1/2 ξn−1 vm
−1 T
c (θ̂nb − θ̂n ) ≤ x].
Then we have the following Corollary. Its proof is similar to the proof of the
corresponding results in Section 4.5.1, and we omit the details.
Corollary 4.2. Assume the conditions of Theorem 4.7. Assume also that
4 2 2
fn:i ’s are exchangeable with supn EB fn:i < ∞ and EB fn:i fn:j → 1 for i = j as
n → ∞. Then
sup |Lnb (x) − L(x)| = oP (1) as n → ∞.

x∈R
Similar to the development presented in Section 4.5.1, the representation

(4.45) will turn out to be useful for showing consistency of the generalized
bootstrap scheme, both for variance estimation as well as distributional ap-
proximation.
Proof of Theorem 4.7: The first part of this proof is similar to the proof
of Theorem 4.6. Define
Xns = f (Ys ; n−1/2 θ) − f (Ys ; 0) − n−1/2 θT g(Ys ; 0),

Xnbs = Ws Xns , and Snb = N −1 g(Ys ; 0).
s
We have
nQnb (n−1/2 θ) − nQnb (0) − n1/2 θT Snb − θT Hθ/2

= nN −1 Xnbs − θT Hθ/2.
Then arguments similar to those used in the proof of Theorem 4.6 yields
(4.44) once it is established that for any fixed , δ > 0,
(a) For any M > 0,

PB [ sup |nN −1 Xnbs − θT Hθ/2| > δ] = oP (1). (4.48)
||θ||≤M
(b) There exists M > 0 such that,
P[PB [||n1/2 Snb || > M ] > ] < δ. (4.49)
The proof of these are similar to those of (4.29) and (4.30). By carefully
following the arguments for those, we only have to show

n2 N −2 EB ( Ws Xns )2 = oP (1), (4.50)

nN −2 EB || ws g(Ys , 0)||2 = OP (1). (4.51)
For (4.50), we have

n2 N −2 EB ( Ws Xns )2

m
= n2 N −2 cj Xns Xnt
j=0 |s∩t|=j

m
2 −2 2 2 −2
=n N c0 ( Xns ) + n N cj Xns Xnt .
s j=1 |s∩t|=j

A little algebra shows that nN −1 Xns = OP (1). The first term is oP (1)
from this and (4.42). For the other term, first note that the sum over j is
finite, and from (4.43), we only need to show

n2 N −2 Xns Xnt = oP (1) for every fixed j = 1, . . . , m.
|s∩t|=j
For this, we have

n2 N −2 Xns Xnt
|s∩t|=j

≤ nN −2 [θT (g(Ys , n−1/2 θ) − g(Ys , 0))][θT (g(Yt , n−1/2 θ) − g(Yt , 0))].
|s∩t|=j

Since the number of terms in |s∩t|=j is O(n−1 N 2 ), it is enough to show
that for any s, t ∈ S
aT (g(Ys , n−1/2 θ) − g(Ys , 0))θT (g(Yt , n−1/2 θ) − g(Yt , 0)) = oP (1).
Observe that θT (g(Ys , n−1/2 θ) − g(Ys , 0)) are non-negative random variables
that are non-increasing in n, and their limit is 0. This establishes the above.
Details of this argument is similar to those given in Chapter 2 following (2.33).
The proof of (4.51) is along the same lines. This proves (4.44).
In order to get the representation (4.45), we have to show

n1/2 N −1 Ws g(Ys , 0) = mn−1/2 fi g1 (Yi , 0) + Rnb1 ,
i
where
PB [||Rnb1 || > δ] = oP (1) for any δ > 0.

Exercises 123
m
Let h(Ys , 0) = g(Ys , 0) − j=1 g1 (Yij , 0). This is a kernel of a first order
degenerate U -statistic. Then we have

m
Ws g(Ys , 0) = Ws g1 (Yij , 0) + Ws h(Zs , 0).
j=1
Also, it can be established that
n
n−1
m
Ws g1 (Yij ; 0) = fi g1 (Yi ; 0).
s j=1
m − 1 i=1
Also

E(||N −1 h(Ys , 0)||)−2 = O(n−2 ). (4.52)
Now using this result and (4.43), after some algebra we obtain

PB [||n1/2 N −1 Ws h(Zs , 0)|| > δ] = oP (1),
and this yields (4.45).
For different choices of generalized bootstrap weights, and defining Ws

either as in (4.4) or (4.5), the conditions (4.41)-(4.43) are satisfied. When
using (4.4), the moments of products of Wi ’s of various orders are involved
and conditions are difficult to check unless m is small.
A natural question is how fi and Ws relate to each other. Observe that
in both the multiplicative and additive form of Ws given in (4.4) and (4.5),
the difference between Ws and an appropriately scaled fi (where the scaling
factor is a function of m only) is negligible in the representation (4.45). Again
the algebra is complicated in case of (4.4), and may be verified for small values
of m. There are no such difficulties with (4.5).
4.6 Exercises

k
1. Suppose w1 , . . . wk are exchangeable random variables such that wi
i=1
1
is a constant. Show that for every m = n, c11 = corr(wm , wn ) = − k−1
irrespective of the distribution of {wi }. Hence verify (4.12).
2. Suppose ei , 1 ≤ i ≤ k are k-dimensional vectors where the ith position

of ei is 0 and the rest of the positions are 1. Consider the k-variate
distribution which puts mass 1/k at each of these vectors. That is,
consider the random vector w such that P (w = ei ) = 1/k, 1 ≤ i ≤ k.
Let wi be the ith coordinate of w. Show that
(i) {wi } are exchangeable, and each wi has a Bernoulli distribution.
1
(ii) for any n = m, Corr(wn , wm ) = − k−1 .
(iii) as k → ∞, for any fixed i, wi has a degenerate distribution.
3. Extend the setup in the previous question to the case where each pos-
sible k-dimensional vectors has d coordinates 0 and rest have 1’s.
4. Suppose wi , 1 ≤ i ≤ k has a multinomial distribution with number of

trials k and equal success probability 1/k for each cell. Show that
(i) {wi } are exchangeable, and each wi has a Bernoulli distribution.
1
(ii) for any n = m, Corr(wn , wm ) = − k−1 .
(iii) as k → ∞, for any fixed i, the asymptotic distribution of wi is
Poisson with mean 1.
(iv) as k → ∞, for any fixed i = j, wi and wj are asymptotically
independent.
5. Suppose Xi are i.i.d. with mean μ and variance σ 2 . Suppose wn,i

are multinomial with number of trials n and equal cell probabilities
n
1/n. Show that as n → ∞, the distribution of n−1/2 i=1 (Xi wn,i −
Xi ), conditional on X1 , . . . Xn , . . ., converges almost surely to the nor-
mal with mean zero and variance σ 2 . Show that the distribution of
n
n−1/2 i=1 (Xi − μ) converges to the same asymptotic limit.
6. Justify how the minimiser in (4.25) can be chosen in a measurable way.
7. Prove the following fact used in the proof of Theorem 4.3:

4
E N −1 gs = O(n−4 ).
s
8. Prove the following fact used in the proof of Theorem 4.3:

n
n−1/2 m i=1 Wi h1 (Yi ), conditionally on the data {Y1 , . . . , Yn , . . .} has
an asymptotic N (0, σ12 ) distribution where σ12 = Vh1 (Yi ).
Exercises 125
9. Prove (4.34) used in the proof of Theorem 4.6.
10. Show that (4.40) follows from an application of Theorem 4.4.
11. Prove Corollary 4.2.

Chapter 5
An Introduction to R
5.1 Introduction, installation, basics

This chapter is a soft introduction to the statistical software called R. We
will discuss how to conduct elementary data analysis, use built-in programs
and packages, write and run one’s own programs, in the context of the topics
covered in this book. All softwares have some specific advantages and several
deficiencies, and R is no exception. The two main reasons we consider R as the
software of choice are that it is completely free, and that this is perhaps
the most popular software among statisticians. Because of these reasons, a
very large number of statistical algorithms and procedures have been already
developed for the R environment. We have written a software package called
UStatBookABSC to accompany this book, containing as built-in functions
the R codes for several of the procedures discussed here. We discuss this
particular R package later in this chapter.
There are many texts and online notes on R, some are introductory, while
others present the concepts from statistical and computational viewpoints at
various levels of detail. A very small sample of recent references is Dalgaard
(2008); Matloff (2011). Many more are available for free download from the
internet, or for purchase.
For any questions relating to R, generally internet searches are adequate
to elicit answers or hints on how to proceed. This software runs on all stan-
dard operating systems, and the example codes that we present below will
run on various versions of Linux, Unix, Apple MacintoshTM and Microsoft
128 Chapter 5. An Introduction to R
WindowsTM , with occasional minor modifications. We have strived for sim-

plicity and transparency and most of the development below is strictly for
pedagogical purposes. Some of the displayed output is edited in a minor
way to fit on the page. We assume essentially no knowledge of programming
languages or computational software.
The homepage for R is https://www.r-project.org/. Here, there are
instructions and links for downloading and installing the software. The first
step is to download and install the base distribution. Once installation of
the base distribution is complete and successful, there ought to be an icon on
the desktop (and/or elsewhere) that looks like the English letter R, generally
accompanied with some numbers, denoting the version number, in small print
below it.
Click on the R icon, and an R workspace comes up. The major component
of this is the R console, where one can type in data, R commands, and
where the standard on-screen output of R is shown. Graphics, like histograms,
boxplots or scatter plots, show up in a different panel.
Note the short description of R that comes right at the top of the R
console. Note in particular, that the software comes without any warranty.
A second statement says that R is a project with many contributors; many
such contributors have provided macro-programs to do specific tasks in R.
Such “contributed packages” are extremely useful, and later in this chapter
we discuss how to download, install, update and use them. Note however,
that R is open source, and while there are several extremely useful and im-
portant contributed packages, there is no quality guarantee of contributed
packages. The third statement in the R console is about how to use the
help, demo, and how to quit R. Many users install and use R-based inte-
grated development environment (IDE) software applications like R Studio,
available from https://www.rstudio.com/. Such IDE are often favored by
many users, and have several convenient features.
R is comprised of the core, base package, and several additional con-
tributed packages. The base package is necessary. Once that is installed and
running, one may download and install any of the contributed packages as
and when needed. The package accompanying this book, UStatBookABSC,
is one such contributed package. The simplest way of installing a contribu-
tion package is to click on packages in the top-menubar in the R workspace.
Other options, like using the command install.packages at the R workspace
prompt, or using the command line R CMD INSTALL are also available.
Introduction, installation, basics 129
Once a package, say UStatBookABSC (Chatterjee (2016)) is installed in

one’s computer using any of the above methods, the following command may
be used to access its contents:
> library(UStatBookABSC)
Packages and core R may occasionally need to be updated, for which the
steps are mostly similar to installation. Package installation, updating and
new package creation can also be done from inside R Studio, in some ways
more easily.
As a general rule, using the R help pages is very highly recommended.
They contain a lot more information than what is given below. The way to
obtain information about any command, say the print command, is to simply
type in
> ?print
If the exact R command is not known, just searching for it in generic terms
on the internet usually elicits what is needed.
Simple computations can be done by typing in the data and commands
at the R prompt, that is the “>” symbol in the R console. However, it is not
good practice to type in lengthy commands or programs in the R console
prompt, or corresponding places in any IDE. One should write R programs
using a text editor, save them as files, and then run them. R programs are
often called R scripts.
One common way of writing such scripts or programs is by using the built-
in editor in R. In the menubar in the R workspace, clicking on the File menu
followed by the New script menu opens the editor on an empty page where
one may write a new program. To edit an existing script, the menu button
Open script may be used. In order to run a program/script, one should
click on the Source menu button. The Change dir menu button allows one
to switch working directory.
To check the working directory, use the command getwd().
> getwd()
[1] "/Users/ABSC/Programs"
To change the working directory we use the command setwd();.
> setwd("/Users/ABSC/Programs/UStatBook")
> getwd()
[1] "/Users/ABSC/Programs/UStatBook"
Sometimes, but not always, one might want to come back to a work done
earlier on an R workspace. To facilitate this, R can save a workspace at the
end of a session. Also, when starting a new R session, one may begin with
a previously saved workspace. All of these are useful, if and when we want
to return to a previous piece of work in R. It is an inconvenience, however,
when one wants to do fresh work, and does not want older variable name
assignments and other objects stored in R memory to crop up. Also, a saved
workspace often takes up considerable storage space. Memory problems are
also known to occur. These issues are not huge problems for experts, but
often inconvenience beginners.
5.1.1 Conventions and rules

Note that R is case sensitive, thus a and A are considered different. In the
rest of the chapter, note the use of upper and lowercase letters through-
out the R commands. In particular, many standard built-in R commands
like help, print, plot, sum, sqrt, mean, var, cov, cor, median are
in lowercase, while others like R CMD BATCH are in uppercase, and some like
Rscript or UStatBookABSC have both.
The symbol # in an R script denotes a comment, which is skipped by
R when executing the script. Thus, this is an extremely convenient tool to
annotate and write remarks, explanatory notes and comments in R programs.
The first steps of R programming 131
It is a good practice to liberally annotate and provide explanatory notes in R

programs.
As with many other computer programming languages, the command x
= y in R means the following: the value of the object y is stored in the space
that we will denote by the name x. For example, x = 5 means we are storing
the number 5 in a particular chamber of the computer’s memory, that we
will call x, and that we can recall, edit, change or use, simply by it’s name
x. Instead of just assigning a number to the name/memory-space x, we
might assign a vector, matrix, data-frame, or any other “object”. A popular
convention, sometimes assiduously followed by R programmers, is to use x <-
5 for assigning the number 5 to the address x.
5.2 The first steps of R programming

As mentioned earlier, an effective and simpler way of programming in R is to
type in the commands in a file, and then execute those commands collectively
using the Source file command on the top menubar. For example, we
may create the file UStatBookCodes.R for the computational work related to
this section, and save it in the directory /Users/ABSC/Programs/UStatBook.
This may be done either externally in a text editor, or by using the “New File”
command from the R console. In order to clear out the contents that may be
lodged in the memory of R including removal of previously used variables and
functions, one may use the command
rm (list = ls())
as the first line of the file UStatBookCodes.R. We run this file in R by clicking
on the “Source File” command. Thus
> source("/Users/ABSC/Programs/UStatBook/UStatBookCodes.R")
executes all the commands that we save in the file UStatBookCodes.R, which
we have saved in the directory /Users/ABSC/Programs/UStatBook.
In fact, it is not necessary to start an interactive R session for running a

script saved as a file. At the terminal (“shell” in Unix, “command prompt”
in some other systems), the command
$ R CMD BATCH UStatBookCodes.R UStatBookOutFile.txt &
will run R to execute all the commands in the file UStatBookCodes.R, and
save any resulting output to the file UStatBookOutFile.txt. The ampersand
at the end is useful for Linux, Unix and Macintosh users, whereby the pro-
cesses can be run in the background, and not be subject to either inadvertent
stopping and will not require continuous monitoring. Being able to execute
R files from the terminal is extremely useful when running large programs,
which is often the case for Statistics research. A more recent alternative to
the R CMD BATCH command is Rscript. Considerable additional flexibility is
available for advanced users, the help pages contain relevant details.
One can work with several kinds of variables in R. Each single, or scalar,
variable can be of type numeric, double, integer, logical or character.
A collection of such scalars can be gathered in a vector, or a matrix. For
example, the command
> x = vector(length = 5)
declares x to be a vector of length 5. If we type in
> print(x)
the output is
[1] FALSE FALSE FALSE FALSE FALSE

Initial steps of data analysis 133
This shows that a vector of length 5 has been created and assigned the address
x. The “*” operation between two vectors of equal length produces another
vector of same length, whose elements are the coordinatewise product. Thus,
> z = x*y
> print(z)
[1] 1 38 -96 32
One use of the above is to get inner products, as in sum (x * y):
> InnerProduct = sum (x * y)

> print(InnerProduct)
[1] -25
Similarly, we can get the norm of any vector x as follows:
> Norm = sqrt(sum( x * x))

> print(Norm)
[1] 38.07887
The functions InnerProduct and Norm above are built in functions in the
package UStatBookABSC for users, with some additional details to handle the
case where the vectors x and y have missing values.
5.3 Initial steps of data analysis

In this section, we discuss some simple data analysis steps that ought to be
part of the initial steps of any scientific analysis. Perhaps the most important
initial step after reading in the data is to get some idea about the nature of
the variables, as with the str(D) command. Then, one may wish to obtain
some summary statistics for the one dimensional marginal distribution of
each variable, and consider two dimensional marginals to understand the
relationship between the variables.
Histogram of Rainfall
15
Frequency
10
5
0
0 10 20 30 40
Rainfall
Figure 5.1: Histogram of rainfall amounts on rainy days in Kolkata during

the monsoon season of 2012.
5.3.1 A dataset
For illustration purposes, we shall use the data on precipitation in Kolkata,
India in 2012, during the months June to September, which corresponds ap-
proximately to the monsoon season. It consists of fifty-one rows and four
columns, where the columns are on the date of precipitation, the precipita-
tion amount in millimeters, the maximum and the minimum temperature for
that day in degree Celcius. Days in which the precipitation amount was below
0.099 millimeters are not included. Suppose for the moment this dataset is
available as a comma separated value (csv) file in our computer in the direc-
tory /Users/ABSC/Programs/UStatBook under the filename Kolkata12.csv.
We can insert it in our R session as a data.frame called Kol Precip as follows
> setwd("/Users/ABSC/Programs/UStatBook")
> Kol_Precip = read.csv(file = "Kolkata12.csv")
We urge the readers to experiment with different kinds of data files, and
read the documentation corresponding to ?read.table. For example, readers
may consider reading in other kinds of text files where the values are not
separated by commas, where each row is not necessarily complete, and where
missing values are depicted in various ways, and where the data resides in a
Density of Kolkata-2012 Rainfall
0.05
0.04
0.03
Density
0.02
0.01
0.00
0 10 20 30 40
N = 51 Bandwidth = 3.671
Figure 5.2: Density plot of rainfall amounts on rainy days in Kolkata during
the monsoon season of 2012.
directory different from the one where the R code resides.

For advanced users, the function scan and various other functions in spe-
cialized packages are available for reading in more complex files. For example,
the package fmri contains the function read.NIFTI for reading fMRI image
datasets, and the package ncdf4 has functions for reading NetCDF format
commonly used for storing climate and related datasets.
For convenience of the readers, we have included the above dataset on
precipitation in Kolkata in two formats in the package UStatBookABSC. The
first is a format which is convenient both for the reader and for R, and may
be accessed as follows:
>data(CCU12_Precip)
>ls(CCU12_Precip)
>?CCU12_Precip
The last command above brings up the help page for the dataset, which
also contains an executable example.
Suppose after accessing this data inside R, we wish to save a copy of it as a
comma separated value text file under the name filename Kolk12Precip.csv.
This is easily done inside R as follows
28 30 32 34 36 38
* * * *
40
* * * *
* *
30
* *
Precip
20
* * * *
* *
** * * * * **
* *
* * * ***
*
10
**** ** ** * * * * * * * * *
* * *
* ** * * * *** *
** ** ** * *
* **** * **** ** *
* ** * * *
** * *** *
0
* *
38
* *
* *
** * * * *
36
* ** * * * *
* **
** **** * * * * ** ** *
34
* ** * * *
*** * * ** * * * ** * ** *** **
*
* * * *
***** * *
* * TMax ** *** *** * **
32
**
30
* *
28
* *
* *
* *
* * * **
*
*** * * * * * * *
27
* ** ***
**** * **
* * ** ** *** *
** *** * ****
26
* *
*
* *
* * * *
*
*
** ** *
* * TMin
* * * ** * *
** *** * * * * * *
25
*
* * * * * ** *
* *
* *
24
* *
0 10 20 30 40 24 25 26 27
Figure 5.3: Scatter plots of rainfall, maximum and minimum temperature

amounts on rainy days in Kolkata during the monsoon season of 2012.
>data(CCU12_Precip)
> write.csv(CCU12_Precip, file = "Kolk12Precip.csv",
row.names = FALSE)
As in the case of reading of files, writing of files can also be done in various
formats.
5.3.2 Exploring the data

We consider the CCU12 Precip data from the package UStatBookABSC for
this section. Standard summary statistics, like the mean, variance, standard
deviation, median are obtained simply by typing in obvious commands. The
following easy to follow commands set are written in a file and are run from
the source:
library(UStatBookABSC)
Rainfall = CCU12_Precip$Precip;
print("Average rainfall on a rainy day
in Kolkata in 2012 monsoon is:")
print(mean(Rainfall), digits = 3);
print(paste("Variance of rainfall is", var(Rainfall)));
print(paste("and the standard deviation is",
sd(Rainfall), "while"));
print(paste("the median rainfall is",
median(Rainfall), "and"));
print("here is some summary statistics");
print(summary(Rainfall), digits = 2);
and this produces the output
> print("Average rainfall on a rainy day

in Kolkata in 2012 monsoon is:")
[1] "Average rainfall on a rainy day
in Kolkata in 2012 monsoon is:"
> print(mean(Rainfall), digits = 3);
[1] 10.7
> print(paste("Variance of rainfall is", var(Rainfall)));
[1] "Variance of rainfall is 121.999325490196"
> print(paste("and the standard deviation is",
sd(Rainfall), "while"));
[1] "and the standard deviation is
11.0453304835209 while"
> print(paste("the median rainfall is",
median(Rainfall), "and"));
[1] "the median rainfall is 7.9 and"
> print("here is some summary statistics");
[1] "here is some summary statistics"
> print(summary(Rainfall), digits = 2);
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3 2.0 7.9 10.7 14.0 42.9
Note that digits in print command refers to the minimum number of

significant digits to print. See help manual for details and other formatting
options.
The above R commands and output demonstrate how to use some built-
in simple functions like mean, var, sd, median, summary. Note that the
median value of rainfall in millimeters seem considerably less compared to
the mean, which is also reflected in the gap between the third quartile of
the univariate empirical distribution of Rainfall. More details about the
univariate empirical distribution can be obtained using the ecdf function.
One should consider visual depictions of the data also. We run the fol-
lowing commands
hist(Rainfall);
dev.new();
plot(density(Rainfall), xlim = c(0,45),
main = "Density of Kolkata-2012 Rainfall");
and obtain the histogram and the density plot of the Rainfall data presented
respectively in Figures 5.1 and 5.2. The command dev.new() requires R to
put the second plot in a separate window. Note the additional details
supplied to the plot command for the density plot, which controls the limit
of the x-axis of the plotting region, and places the title on top of the plot.
There are many more parameters that can be set to make the graphical output
from R pretty, and readers should explore those. In fact, one great advantage
of R as a software is its versatility in graphics. Also note that such graphical

outputs can be saved automatically in several different formats.
If we run the commands
MaxTemp = CCU12_Precip$TMax;
MinTemp = CCU12_Precip$TMin;
print("The covariance of the max and min temperature is ");
print(cov(MaxTemp, MinTemp));
we get
> print("The covariance of the max

and min temperature is ");
[1] "The covariance of the max and min temperature is "
> print(cov(MaxTemp, MinTemp));
[1] 0.5833529
which obtains the covariance between the maximum and minimum tempera-
ture on a rainy day as 0.58. We might want to test if the difference between
the maximum and minimum temperature on those days is, say, 20 degrees
Celcius, and one way of conducting such a test is by using the t.test as
follows:
> t.test(MaxTemp, MinTemp, mu = 10)

Welch Two Sample t-test
data: MaxTemp and MinTemp
t = -7.1276, df = 67.392, p-value = 8.707e-10
alternative hypothesis: true difference
in means is not equal to 10
95 percent confidence interval:
6.955590 8.287547
sample estimates:
mean of x mean of y
33.55686 25.93529
Note that we do not recommend the above two-sample test for the present
data: the maximum and minimum temperature for a given day are very likely
related, and we have not verified that assumptions compatible with a two-
sample t.test hold. The above computation is merely for displaying the
syntax of how to conduct a two-sample test in R.
Let us now conduct a paired t-test, perhaps with the alternative hypothesis
that the true difference is less than 10 degree Celcius keeping in mind that
Kolkata is in a tropical climate region. Additionally, suppose we want a 99%
one-sided confidence interval. This is implemented as follows:
> t.test(MaxTemp, MinTemp, mu = 10, paired = TRUE,

alternative = "less", conf.level = 0.99)
Paired t-test
t = -7.9962, df = 50, p-value = 8.429e-11
alternative hypothesis: true difference
in means is less than 10
-Inf 8.336408
sample estimates:
mean of the differences
7.621569
Additionally, we may decide not to rely on the t.test only, and conduct
a signed-rank test.
> wilcox.test(MaxTemp, MinTemp, mu = 10, paired = TRUE,

alternative = "less", conf.int = TRUE, conf.level = 0.99)
Wilcoxon signed rank test with continuity correction

V = 82.5, p-value = 1.109e-07
alternative hypothesis: true location shift is less than 10
-Inf 7.900037
sample estimates:
(pseudo)median
7.500018
The above steps of separating out the variables Rainfall, MaxTemp,

MinTemp, and then running univariate or bivariate simplistic analysis is not
recommended. In practice we should leave the data frame structure unal-
tered as far as possible. We demonstrate how to take only the three relevant
variables from the dataset, compute summary statistics, compute the 3 × 3
variance-covariance matrix, followed by a scatterplot of each variable against
the others as shown in Figure 5.3, then conduct a linear regression of the
precipitation as a function of the maximum and minimum temperatures.
> KolData = subset(CCU12_Precip, select = -Date);

> summary(KolData)
Precip TMax TMin
Min. : 0.30 Min. :26.60 Min. :23.80
1st Qu.: 2.00 1st Qu.:32.65 1st Qu.:25.25
Median : 7.90 Median :33.50 Median :25.90
Mean :10.72 Mean :33.56 Mean :25.94
3rd Qu.:14.00 3rd Qu.:34.60 3rd Qu.:26.50
Max. :42.90 Max. :38.20 Max. :27.60
> var(KolData)
Precip TMax TMin
Precip 121.999325 -12.4834510 -4.4887765
TMax -12.483451 4.8145020 0.5833529
TMin -4.488776 0.5833529 0.8643294
> plot(KolData, pch = "*");
> Kol.LM = lm(Precip ˜ TMax + TMin, data = KolData);

> print(summary(Kol.LM));
Call:
lm(formula = Precip ˜ TMax + TMin, data = KolData)
Residuals:
Min 1Q Median 3Q Max
-22.648 -4.792 -1.078 3.646 30.336
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 179.7416 37.0017 4.858 1.31e-05 ***
TMax -2.1385 0.6081 -3.517 0.000966 ***
TMin -3.7500 1.4352 -2.613 0.011953 *
---
Residual standard error: 9.041 on 48 degrees of freedom
Multiple R-squared: 0.3568,Adjusted R-squared: 0.33
F-statistic: 13.31 on 2 and 48 DF, p-value: 2.514e-05
5.3.3 Writing functions

In any programming language including R, it is important to know how to
write functions, so as to have one set of codes that may be used repeatedly
throughout the main program, to make programs more easily readable and
interpretable, and to reduce the possibility of errors in repeated computations.
All packages in R, including the package UStatBookABSC, is just a collection
of such functions, with documentation and explanations on how to use those
functions.
We illustrate how to write functions with two examples. The first shows
how to compute the Euclidean inner product of two vectors x and y, which as
we know is sum(x*y) in R notation. A simple function to do this computation
is as follows:
I.Product = function( x, y){

T = sum(x*y)
return(T)
}
The first line says that I.Product is the name of the function, and that
its arguments are x, y. The second line computes the function, and the third
line returns to the main program the computed value of the function. Here
is how this program may be used:
A = c( 1, 2, 3);
BVec = c(0, 2, -1);
I = I.Product(A, BVec);
print(I)
[1] 1
Note that we deliberately used the vectors A and BVec as arguments, the
names of the arguments do not need to match how the function is written.
Also, once the function I.Product is in the system, it can be used repeat-
edly. This may not seem a big deal for a simple function like I.Product,
but in reality many functions are much more complex and elaborate, and
their codification as standalone functions helps programming greatly. Also,
even simple functions typically require checks and conditions to prevent er-
rors and use with incompatible arguments. See, for example, the code for
function InnerProduct in the UStatBookABSC package, which does the same
computation as I.Product, but has checks in place to ensure that both the
vectors used as arguments are numeric, they have the same length, and has
additional methodological steps to handle missing values in either argument.
5.3.4 Computing multivariate medians

We now give two examples on computing multivariate medians.
Example 5.1(L1 -median computation): Recall the L1 -median from Exam-

ple 2.7. Here, we show how to compute it, using built-in functions from
the package UStatBookABSC. The function that we will use is L1Regression,
which is actually a function to compute multivariate median regression esti-

mators. Details about its usage in a regression context is given in Section 5.4,
here we illustrate how the same function may be used to obtain multivari-
ate medians. We would like to obtain the median of the three dimensional
random variable (Precip, TMax, TMin) from our data. This is done by
fitting the L1 regression function to this vector, using only the intercept as a
covariate. The following code may be used:
>DataY = cbind(CCU12_Precip$Precip, CCU12_Precip$TMax,

CCU12_Precip$TMin);
>DataX = rep(1, length(CCU12_Precip$Precip));
>M.L1 = L1Regression(DataY, DataX);
> print(M.L1$BetaHat)
[,1] [,2] [,3]
[1,] 8.154253 33.52507 25.93579
Example 5.2(Oja-median computation): In this example, we illustrate how

to compute the Oja-median, introduced in Example 2.8. The function for
this is also built in the package UStatBookABSC. The following code obtains
the Oja-median for (Precip, TMax, TMin):
>Data.CCU = CCU12_Precip[,-1];
>M.Oja = OjaMedian(Data.CCU);
>print(M.Oja)
Precip TMax TMin
9.36444 33.65908 26.16204
Notice that the Oja-median and the L1 -median have similar, but not iden-
tical, estimates. In general, computation of Oja-median is more involved since
determinants of several matrices have to be computed for each optimization
step.
Multivariate median regression 145
5.4 Multivariate median regression
Multivariate median regression
*
*
2.0
* * * *
** *
* *** * * *
** * *
* * ** * * * * * * *
* * * ** *
1.5
* * *
* * *
* *** * ** * * * * ** * *
*
* **** * ** ** ** * *
* * * * * * * * * * *
* ** * * ** ** ** *
* * ** * * * ***** * * *** * ** *** ** * ** * *
* **
* * *** ***** * *
1.0
* ** *
* ** * * *
Beta.TMax
* * * ** ** *
* * * * * ** **** ** ****** ** * ****** * *** *
* *
* * * ** ***** ***** * **** *** ** ** ** ** * *
* * * *** * * *** **** ** ** *** ** * **
* ** * *
* * *
* * ** * ***** * * *
* * ** ** * *** ***** * *** **** **** **** *** ******* * * *
0.5
** * * *** * * **** * ** * * *
* * * *
* ****** ** **** **** **** * ** * * * * * *
** *** * * * **
* * * ** ****** * * ****** * **** ** *
* * ** * * * *
*
* * ** * * *
0.0
** *** * * * * *
*
* * ** * * *** * ** * *
* * * * *** *
*
* * *
* * *
* *
-0.5
-10 -8 -6 -4 -2
Beta.Precip
Figure 5.4: Scatter plot of slope parameters from the multivariate response
L1 -regression fitting. The filled-in larger circle is the L1 -median.
We conclude this chapter with an example of how to compute an Mm -

estimator and perform resampling for it. We consider the two-dimensional
variable (Precip, TMax) as the response variable, and TMin as the explana-
tory variable along with an intercept term. This is a generalization of Exam-
ple 2.7 presented in Chapter 2. We consider the case where the i-th observa-
tion variable Yi ∈ Rd with d = 2 in the present example. We assume that
the L1 -median θi ∈ Rd of Yi may be expressed as a linear transformation
of covariates xi ∈ Rp , thus θi = B T xi for some unknown parameter matrix
B ∈ Rd × Rp . This is a multivariate response version of quantile regression
(Koenker and Bassett Jr, 1978). In our data example, we have p = 2, with
the first element of each covariate vector being 1 for the intercept term, and
the second element being the TMin observation. Thus, the parameter for
the present problem is the 2 × 2 matrix B. Note that the elements of the
response variable (Precip, TMax) are potentially physically related to each
other by the Clausius-Clapyron relation (see Dietz and Chatterjee (2014)),
and the data analysis of this section is motivated by the need to understand
the nature of this relationship conditional on the covariate TMin. We use the
function L1Regression in the package UStatBookABSC to obtain the estima-
tor B̂ minimizing

n

Ψn (B) = Yi − B T xi ,
i=1

where recall that |a| = [ a2k ]1/2 is the Euclidean norm of a.
The following code implements the above task:
>DataY = cbind(CCU12_Precip$Precip, CCU12_Precip$TMax);

>DataX = cbind(rep(1, length(CCU12_Precip$Precip)),
CCU12_Precip$TMin)
>A2 = L1Regression(DataY, DataX)
>print(A2)
> print(A2)
$Iteration
[1] 4
$Convergence
[1] 0.0005873908
$BetaHat
[,1] [,2]
[1,] 164.803552 16.5318273
[2,] -5.866605 0.6498325
We used default choices of the tuning parameters of L1Regression, which

can be queried using ?L1Regression. In particular, the algorithm used inside
the function is iterative in nature, and if B̂j is the estimate of B at iteration
Multivariate median regression 147
step j, we compute the relative change in norm |B̂j+1 − B̂j |/|B̂j |, where |B|
is the Euclidean norm of the vectorized version of B, ie, when the columns
of B are stacked one below another to form a pd-length vector. We declare
convergence of the algorithm if this relative change in norm is less than ,
and we use = 0.001 here.
The above display shows that the function L1regression has a list as the
output. The first item states that convergence occurred at the 4-th iteration
step, and that the relative change in norm at the last step was 0.0005, and
then the final B̂ value is shown. Thus we have

164.803552 16.5318273
B̂ = .
−5.866605 0.6498325
We also use the function WLS from the package UStatBookABSC to obtain
a least squares estimatorB̃ of B, for which the results are displayed below:
> A3 = WLS(DataY, DataX)

> print(A3);
[,1] [,2]
[1,] 145.412968 16.0526220
[2,] -5.193363 0.6749197
Thus we have

145.412968 16.0526220
B̃ = .
−5.193363 0.6749197
It may be of interest to study the distribution of B̂[2, ], the second row of

B̂, which contains the slope parameters for the covariate TMin for the two re-
sponse variables. We use the generalized bootstrap with multinomial weights
to get a consistent distributional approximation of B̂[2, ]. The following code
implements this task:
>B = 500;
>Probabilities = rep(1, nrow(DataY))
>BootWeight = rmultinom(B, nrow(DataY), Probabilities);

>L1Regression.Boot = list();
>T2Effect = matrix( nrow = B, ncol = ncol(DataY));
>for (b in 1:B){
>Weights.Boot = diag(BootWeight[, b]);
>L1Regression.Boot[[b]] = L1Regression(DataY, DataX,
Weights = Weights.Boot);
>T2Effect[b, ] = L1Regression.Boot[[b]]$BetaHat[2,];
>}
In the above, we set the resampling Monte Carlo size at B = 500. The next
two steps generates the multinomial weights for which we require the package
MASS, which has been called inside UStatBookABSC anyway. The next steps
implement a loop where L1Regression is repeatedly implemented with the
resampling weights, the entire results stored in the list L1Regression.Boot
and further, just the relevant slope parameters stored also in the matrix
T2Effect.
We now present a graphical display of the above results with the code
> plot(T2Effect, xlab = "Beta.Precip", ylab = "Beta.TMax",

main = "Multivariate median regression", pch = "*")
>points(A2$BetaHat[2,1],A2$BetaHat[2,2], col = 2,
pch = 19, cex = 2);
>points(A3[2,1],A3[2,2], col = 3, pch = 22,
cex = 2, bg = 3);
The above code is for plotting the resampling estimates of B̂[2, ], stored
in the matrix T2Effect. On this scatter plot, we overlay the original B̂[2, ],
as a big filled-in circle. The commands col =2 tells R to use red color for
the point on the computer screen and in color print outs, pch = 19 tells it
to depict the point with a filled-in red circle, and cex = 2 tells it to increase
the size of the point. The output from the above code is given in Figure 5.4.
Exercises 149
5.5 Exercises
1. Write a R function that can implement the Bayesian bootstrap on the
L1 -regression example for the Kolkata precipitation data.
2. Write a function to implement the m-out-of-n bootstrap on the Oja-

median for the Kolkata precipitation data. Discuss how can the results
from this computation may be used to obtain a confidence region for
the Oja-median.
3. Use resampling to obtain 1 − α confidence intervals for each element of

the L1 -regression slope matrix.
Bibliography
Abramovitch, L. and Singh, K. (1985). Edgeworth corrected pivotal statistics

and the bootstrap. The Annals of Statistics, 13(1):116 – 132.
Arcones, M. A. (1996). The Bahadur-Kiefer representation for U - quantiles.

The Annals of Statistics, 24(3):1400–1422.
Arcones, M. A., Chen, Z., and Gine, E. (1994). Estimators related to U -

processes with applications to multivariate medians: Asymptotic normality.
The Annals of Statistics, 22(3):1460 – 1477.
Arcones, M. A. and Gine, E. (1992). On the bootstrap of U - and V -statistics.

The Annals of Statistics, 20:655 – 674.
Arcones, M. A. and Mason, D. M. (1997). A general approach to Bahadur-

Kiefer representations for M -estimators. Mathematical Methods of Statis-
tics, 6(3):267–292.
Athreya, K. B., Ghosh, M., Low, L. Y., and Sen, P. K. (1984). Laws of large
numbers for bootstrapped U -statistics. Journal of Statistical Planning and
Inference, 9(2):185 – 194.
Babu, G. J. and Singh, K. (1983). Inference on means using the bootstrap.

Babu, G. J. and Singh, K. (1984). On one term Edgeworth correction by

Efron’s bootstrap. Sankhyā: Series A, pages 219 – 232.
Babu, G. J. and Singh, K. (1985). Edgeworth expansions for sampling with-

out replacement from finite populations. Journal of Multivariate Analysis,
17(3):261 – 278.
and Readings in Mathematics 75, https://doi.org/10.1007/978-981-13-2248-8
152 Bibliography
Babu, G. J. and Singh, K. (1989a). A note on Edgeworth expansions for the

lattice case. Journal of Multivariate Analysis, 30(1):27–33.
Babu, G. J. and Singh, K. (1989b). On Edgeworth expansions in the mixture

cases. The Annals of Statistics, 17(1):443 – 447.
Bahadur, R. R. (1966). A note on quantiles in large samples. The Annals of

Mathematical Statistics, 37(3):577 – 580.
Bentkus, V. and Götze, F. (1996). The Berry-Esseen bound for Student’s

statistic. The Annals of Probability, 24(1):491–503.
Bergsma, W. P. (2006). A new correlation coefficient, its orthogonal decompo-

sition and associated tests of independence. arXiv preprint math/0604627.
Berk, R. H. (1966). Limiting behavior of posterior distributions when the

model is incorrect. The Annals of Mathematical Statistics, 37(1):51 – 58.
Berk, R. H. (1970). Consistency a posteriori. The Annals of Mathematical

Statistics, 41(3):894 – 906.
Berry, A. C. (1941). The accuracy of the Gaussian approximation to the

sum of independent variates. Transactions of the American Mathematical
Society, 49(1):122 – 136.
Bhattacharya, R. N. and Qumsiyeh, M. (1989). Second order and lp -

comparisons between the bootstrap and empirical Edgeworth expansion
methodologies. The Annals of Statistics, 17(1):160 – 169.
Bhattacharya, R. N. and Rao, R. R. (1976). Normal Approximation and

Asymptotic Expansions. Wiley Series in Probability and Mathematical
Statistics. John Wiley, New York, USA.
Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the

bootstrap. The Annals of Statistics, 9:1196 – 1217.
Bickel, P. J. and Lehmann, E. L. (1979). Descriptive statistics for nonpara-

metric models. IV. Spread. In Jurečková, J., editor, Contributions to Statis-
tics: Jaroslav Hájek Memorial Volume, pages 33 – 40. Academia, Prague.
Bose, A. (1997). Bahadur representation and other asymptotic properties of

M -estimates based on U -functionals. Technical report, Indian Statistical
Institute, Kolkata, India.
Bibliography 153
Bose, A. (1998). Bahadur representation of Mm -estimates. The Annals of

Statistics, 26(2):771 – 777.
Bose, A. and Babu, G. J. (1991). Accuracy of the bootstrap approximation.

Probability Theory and Related Fields, 90(3):301 – 316.
Bose, A. and Chatterjee, S. (2001a). Generalised bootstrap in non-regular

M -estimation problems. Statistics and Probability Letters, 55(3):319 – 328.
Bose, A. and Chatterjee, S. (2001b). Last passage times of minimum con-

trast estimators. Journal of the Australian Mathematical Society-Series A,
71(1):1 – 10.
Bose, A. and Chatterjee, S. (2003). Generalized bootstrap for estimators of

minimizers of convex functions. Journal of Statistical Planning and Infer-
ence, 117(2):225 – 239.
Brillinger, D. R. (1962). A note on the rate of convergence of a mean.

Biometrika, 49(3/4):574 – 576.
Castaing, C. and Valadier, M. (1977). Convex Analysis and Measurable Mul-

tifunctions, volume 580 of Lecture Notes in Mathematics. Springer-Verlag,
Berlin.
Chatterjee, S. (1998). Another look at the jackknife: Further examples of

generalized bootstrap. Statistics and Probability Letters, 40(4):307 – 319.
Chatterjee, S. (2016). UStatBookABSC: A Companion Package to the Book

“U-Statistics, M-Estimation and Resampling”. R package version 1.0.0.
Chatterjee, S. and Bose, A. (2000). Variance estimation in high dimensional

regression models. Statistica Sinica, 10(2):497 – 516.
Chatterjee, S. and Bose, A. (2005). Generalized bootstrap for estimating

equations. The Annals of Statistics, 33(1):414 – 436.
Chaudhuri, P. (1992). Multivariate location estimation using extension of

R-estimates through U -statistics type approach. The Annals of Statistics,
20(2):897 – 916.
Chaudhuri, P. (1996). On a geometric notion of quantiles for multivariate

data. Journal of the American Statistical Association, 91(434):862 – 872.
154 Bibliography
Choudhury, J. and Serfling, R. J. (1988). Generalized order statistics, Ba-

hadur representations, and sequential nonparametric fixed-width confi-
dence intervals. Journal of Statistical Planning and Inference, 19(3):269
– 282.
Chow, Y. S. and Teicher, H. (1997). Probability Theory: Independence, Inter-

changeability, Martingales. Springer Texts in Mathematics. Springer, New
York, USA, 3rd edition.
Dalgaard, P. (2008). Introductory Statistics with R. Springer, 2nd edition.
Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and their

Application. Cambridge University Press, Cambridge, UK.
de la Peña, V. H. and Giné, E. (1999). Decoupling: From Dependence to

Independence. Springer, New York, USA.
Dehling, H., Denker, M., and Philipp, W. (1986). A bounded law of the
iterated logarithm for Hilbert space valued martingales and its application
to U -statistics. Probability Theory and Related Fields, 72(1):111 – 131.
Dietz, L. and Chatterjee, S. (2014). Logit-normal mixed model for Indian

monsoon precipitation. Nonlinear Processes in Geophysics, 21:934 – 953.
Durrett, R. (1991). Probability: Theory and Examples. Wadsworth, Pacific

Grove, CA, USA.
Dynkin, E. B. and Mandelbaum, A. (1983). Symmetric statistics, Poisson

point processes, and multiple Wiener integrals. The Annals of Statistics,
11(3):739 – 745.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The

Annals of Statistics, 7(1):1 – 26.
Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans.
SIAM, Philadelphia, USA.
Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap.

Chapman & Hall/CRC press, Boca Raton, USA.
Esseen, C.-G. (1942). On the Liapounoff limit of error in the theory of prob-
ability. Arkiv för Matematik, Astronomi och Fysik, 28A(2):1 – 19.
Bibliography 155
Esseen, C.-G. (1945). Fourier analysis of distribution functions. A mathe-

matical study of the Laplace-Gaussian law. Acta Mathematica, 77(1):1 –
125.
Esseen, C.-G. (1956). A moment inequality with an application to the central

limit theorem. Scandinavian Actuarial Journal (Skandinavisk Aktuarietid-
skrift), 39(3-4):160 – 170.
Falk, M. (1992). Bootstrapping the sample quantile: A survey. In Jöckel, K.-

H., Rothe, G., and Sendler, W., editors, Bootstrapping and Related Tech-
niques: Proceedings of an International Conference, Held in Trier, FRG,
June 4–8, 1990, pages 165 – 172. Springer, Berlin, Heidelberg.
Falk, M. and Reiss, R. D. (1989). Weak convergence of smoothed and nons-

moothed bootstrap quantile estimates. The Annals of Probability, 17(1):362
– 371.
Freedman, D. A. (1981). Bootstrapping regression models. The Annals of

Statistics, 9(6):1218 – 1228.
Ghosh, M., Parr, W. C., Singh, K., and Babu, G. J. (1984). A note on
bootstrapping the sample median. The Annals of Statistics, 12(3):1130–
1135.
Grams, W. F. and Serfling, R. J. (1973). Convergence rates for U -statistics

and related statistics. The Annals of Statistics, 1(1):153 – 160.
Gregory, G. G. (1977). Large sample theory for U -statistics and tests of fit.
The Annals of Statistics, 5(1):110–123.
Haberman, S. J. (1989). Concavity and estimation. The Annals of Statistics,

17(4):1631 – 1661.
Hájek, J. (1961). Some extensions of the Wald - Wolfowitz - Noether Theo-

rem. The Annals of Mathematical Statistics, 32(2):506 – 523.
Hall, P. (1986). On the bootstrap and confidence intervals. The Annals of

Statistics, 14(4):1431–1452.
Hall, P. (1988). Theoretical comparison of bootstrap confidence intervals.

156 Bibliography
Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer-Verlag,

New York, USA.
Hall, P. (2003). A short prehistory of the bootstrap. Statistical Science,

18(2):158 – 167.
Hall, P. and Martin, M. A. (1991). On the error incurred using the boot-
strap variance estimate when constructing confidence intervals for quan-
tiles. Journal of Multivariate Analysis, 38(1):70 – 81.
Heiler, S. and Willers, R. (1988). Asymptotic normality of R-estimates in the

linear model. Statistics, 19(2):173 – 184.
Helmers, R. (1991). On the Edgeworth expansion and the bootstrap approx-

imation for a Studentized U -statistic. The Annals of Statistics, 19:470 –
484.
Helmers, R. and Hušková, M. (1994). Bootstrapping multivariate U -quantiles

and related statistics. Journal of Multivariate Analysis, 49(1):97–109.
Hjort, N. L. and Pollard, D. (1993). Asymptotics for minimizers of convex

processes. Preprint series. Statistical Research Report, Matematisk Insti-
tutt, Universitet i Oslo.
Hodges, J. L. and Lehmann, E. L. (1963). Estimates of location based on

rank tests. The Annals of Mathematical Statistics, 34(2):598 – 611.
Hoeffding, W. (1948). A class of statistics with asymptotically normal distri-

bution. The Annals of Mathematical Statistics, 19:293 – 325.
Hoeffding, W. (1961). The strong law of large numbers for U -statistics. Insti-
tute of Statistics mimeo series 302, University of North Carolina, Chapel
Hill, USA.
Hollander, M. and Wolfe, D. A. (1973). Nonparametric Statistical Methods.

John Wiley and Sons, New York, USA.
Hubback, J. A. (1946). Sampling for rice yield in Bihar and Orissa. Sankhyā,
pages 281 – 294. First published in 1927 as Bulletin 166, Imperial Agricul-
tural Research Institute, Pusa, India.
Huber, P. J. (1964). Robust estimation of a location parameter. The Annals

of Mathematical Statistics, 35(1):73 – 101.
Bibliography 157
Huber, P. J. (1967). The behavior of maximum likelihood estimates under

nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium
on Mathematical Statistics and Probability, volume 1, pages 221 – 233.
Hušková, M. and Janssen, P. (1993a). Consistency of the generalized boot-

strap for degenerate U -statistics. The Annals of Statistics, 21(4):1811 –
1823.
Hušková, M. and Janssen, P. (1993b). Generalized bootstrap for Studentized

U -statistics: A rank statistic approach. Statistics and Probability Letters,
16(3):225–233.
Jurečková, J. (1977). Asymptotic relations of M -estimates and R-estimates

in linear regression model. The Annals of Statistics, 5(3):464 – 472.
Jurečková, J. (1983). Asymptotic behavior of M -estimators of location in

nonregular cases. Statistics & Risk Modeling, 1(4 - 5):323 – 340.
Jurecková, J. and Sen, P. K. (1996). Robust Statistical Procedures: Asymp-

totics and Interrelations. John Wiley & Sons.
Kemperman, J. H. B. (1987). The median of a finite measure on a Banach

space. In Dodge, Y., editor, Statistical Data Analysis Based on the L1 -Norm
and Related Methods (Neuchâtel, 1987), pages 217–230. North Holland,
Amsterdam, Holland.
Kiefer, J. (1967). On Bahadur’s representation of sample quantiles. The

Annals of Mathematical Statistics, 38(5):1323 – 1342.
Knopp, K. (1923). Theory and Application of Infinite Series. Blackie and

Sons Ltd., London, UK.
Koenker, R. and Bassett Jr, G. (1978). Regression quantiles. Econometrica,

pages 33 – 50.
Kolassa, J. E. and McCullagh, P. (1990). Edgeworth series for lattice distri-

butions. The Annals of Statistics, 18(2):981 – 985.
Korolev, V. Y. and Shevtsova, I. G. (2010). On the upper bound for the

absolute constant in the Berry-Esseen inequality. Theory of Probability
and Its Applications, 54(4):638 – 658.
158 Bibliography
Korolyuk, V. S. and Borovskich, Y. V. (1993). Theory of U-Statistics.

Springer, New York, USA.
Lee, A. J. (1990). U-Statistics: Theory and Practice. Marcel Dekker Inc.,

New York.
León, C. A. and Massé, J.-C. (1993). La médiane simpliciale d’Oja: Exis-

tence, unicité et stabilité. The Canadian Journal of Statistics/La Revue
Canadienne de Statistique, 21(4):397 – 408.
Liu, R. Y. (1990). On a notion of data depth based on random simplices. The

Lyapunov, A. (1900). Sur une proposition de la théorie des probabilités.

Bulletin de l’Académie Impériale des Sciences, 13(4):359 – 386.
Lyapunov, A. (1901). Nouvelle forme du théoreme sur la limite de probabilité.

Memoires de l’Academe de St-Pétersbourg, Volume 12.
Mahalanobis, P. C. (1940). A sample survey of the acreage under jute in

Bengal. Sankhyā, 4(4):511 – 530.
Mahalanobis, P. C. (1944). On large-scale sample surveys. Philosophi-

cal Transactions of the Royal Society of London B: Biological Sciences,
231(584):329 – 451.
Mahalanobis, P. C. (1945). Report on the Bihar crop survey: Rabi season

1943-44. Sankhyā, 7(1):29 – 106.
Mahalanobis, P. C. (1946a). Recent experiments in statistical sampling in the

Indian Statistical Institute. Journal of the Royal Statistical Society: Series
A (Statistics in Society), 109(4):325 – 378. Reprinted, including discussion,
in Sankhyā, volume 20, (1958) 329–397.
Mahalanobis, P. C. (1946b). Sample surveys of crop yields in India. Sankhyā,

pages 269 – 280.
Mallows, C. L. (1972). A note on asymptotic joint normality. The Annals of

Mathematical Statistics, 43(2):508 – 515.
Mammen, E. (1993). Bootstrap and wild bootstrap for high dimensional

linear models. The Annals of Statistics, 21(1):255 – 285.
Bibliography 159
Maritz, J. S., Wu, M., and Stuadte, R. G. (1977). A location estimator based
on a U -statistic. The Annals of Statistics, 5(4):779 – 786.
Massart, P. (1990). The tight constant in the Dvoretzky-Kiefer-Wolfowitz

inequality. The Annals of Probability, 18(3):1269–1283.
Matloff, N. (2011). The Art of R Programming: A Tour of Statistical Software

Design. No Starch Press, San Francisco, USA.
Miller, R. G. (1964). A trustworthy jackknife. The Annals of Mathematical

Statistics, 35(4):1594–1605.
Niemiro, W. (1992). Asymptotics for M -estimators defined by convex mini-

mization. The Annals of Statistics, 20(3):1514–1533.
Oja, H. (1983). Descriptive statistics for multivariate distributions. Statistics

and Probability Letters, 1(6):327 – 332.
Oja, H. (1984). Asymptotical properties of estimators based on U -statistics.

preprint, Dept. of Applied Mathematics, University of Oulu, Finland.
Petrov, V. V. (1975). Sums of Independent Random Variables. Springer, New

York, USA.
Pollard, D. (1991). Asymptotics for least absolute deviation regression esti-

mators. Econometric Theory, 7(02):186 – 199.
Præstgaard, J. and Wellner, J. A. (1993). Exchangeably weighted bootstraps

of the general empirical process. The Annals of Probability, 21(4):2053–
2086.
Quenouille, M. H. (1949). Approximate tests of correlation in time-

series. Journal of the Royal Statistical Society: Series B (Methodological),
11(1):68–84.
Rockafellar, R. T. (1970). Convex Analysis. Princeton Mathematical Series.

Princeton University Press.
Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point.

In Grossman, W., Pflug, G., Vincze, I., and Wertz, W., editors, Math-
ematical Statistics and Applications, pages 283 – 297. Reidel Publishing
Company, Dordrecht.
160 Bibliography
Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics.

John Wiley & Sons, New York, USA.
Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. Springer-Verlag,

New York, USA.
Sherman, R. P. (1994). Maximal inequalities for degenerate U -processes with

applications to optimization estimators. The Annals of Statistics, 22(1):439
– 459.
Shevtsova, I. (2014). On the absolute constants in the Berry-Esseen type

inequalities for identically distributed summands. Doklady Mathematics,
89(3):378 – 381.
Shorack, G. R. (2000). Probability for Statisticians. Springer Texts in Statis-

tics. Springer, New York, USA.
Singh, K. (1981). On the asymptotic accuracy of Efron’s bootstrap. The

Small, C. G. (1990). A survey of multidimensional medians. International

Statistical Review, 58(3):263 – 277.
Smirnov, N. V. (1949). Limit distributions for the terms of a variational

series. Trudy Matematicheskogo Instituta im. VA Steklova, 25:3 – 60.
Smirnov, N. V. (1952). Limit distributions for the terms of a variational

series. American Mathematical Society Translation Series, 11(1):82 – 143.
Student (1908). The probable error of a mean. Biometrika, 6(1):1–25.
Tukey, J. W. (1958). Bias and confidence in not quite large samples (abstract).
The Annals of Mathematical Statistics, 29(2):614–614.
Tukey, J. W. (1975). Mathematics and the picturing of data. In Proceedings

of the International Congress of Mathematicians, volume 2, pages 523 –
531.
van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and

Empirical Processes: With Applications to Statistics. Springer, New York,
USA.
Bibliography 161
Wagner, T. J. (1969). On the rate of convergence for the law of large numbers.
The Annals of Mathematical Statistics, 40(6):2195 – 2197.
Wasserman, L. (2006). All of Nonparametric Statistics. Springer, New York,

USA.
Wu, C. F. J. (1990). On the asymptotic properties of the jackknife histogram.

Author Index
Denker, M. 9
Abramovitch, L. 91 Dietz, L. 146
Arcones, M. A. 66, 67, 106 Durrett, R. 14
Athreya, K. B. 105 Dynkin, E. B. 29
Babu, G. J. 89, 90, 92 Efron, B. 80–83

Bahadur, R. R. 62, 63, 66 Esseen, C.-G. 87
Bassett Jr, G. 145
Bentkus, V. 87 Falk, M. 93
Bergsma, W. P. 4, 5 Freedman, D. A. 76, 82, 95, 96, 105,
Berk, R. H. 9, 13 107
Berry, A. C. 87
Ghosh, M. 92, 105
Bhattacharya, R. N. 89, 90
Gine, E. 66, 106
Bickel, P. J. 38, 51, 82, 105, 107
Götze, F. 87
Borovskich, Y. V. xi, 1, 14
Grams, W. F. 13
Bose, A. 14, 40, 46, 50, 56, 90, 96,
Gregory, G. G. 20
98, 115, 120
Brillinger, D. R. 13 Haberman, S. J. 40, 46
Hájek, J. 110
Castaing, C. 41
Hall, P. 82, 83, 86, 87, 90–93
Chatterjee, S. xiii, 40, 50, 56, 96, 98,
Heiler, S. 40
99, 115, 120, 128, 146
Helmers, R. 105, 106
Chaudhuri, P. 66
Hinkley, D. V. 82
Chen, Z. 66
Hjort, N. L. 40
Choudhury, J. 38, 62, 66
Hodges, J. L. 51
Chow, Y. S. 22
Hoeffding, W. xi, 1, 9
Dalgaard, P. 127 Hollander, M. 51
Davison, A. C. 82 Hubback, J. A. 83
de la Peña, V. H. 40 Huber, P. J. xi, 35, 39
Dehling, H. 9 Hušková, M. 106, 108
164 AUTHOR INDEX
Janssen, P. 108 Præstgaard, J. 98, 99, 110

Jurečková, J. 40, 50
Quenouille, M. H. 79
Kemperman, J. H. B. 39 Qumsiyeh, M. 90
Kiefer, J. 63, 66
Rao, R. R. 89
Knopp, K. 28
Reiss, R. D. 93
Koenker, R. 145
Rockafellar, R. T. 40, 44
Kolassa, J. E. 89
Rousseeuw, P. J. 40
Korolev, V. Y. 87
Korolyuk, V. S. xi, 1, 14 Sen, P. K. 40, 105
Serfling, R. J. 13, 14, 20, 38, 62, 66,
Lee, A. J. xi, 1, 9, 29
112
Lehmann, E. L. 38, 51
Shao, J. 76, 82, 97, 106, 107
León, C. A. 39
Sherman, R. P. 66
Liu, R. Y. 40
Shevtsova, I. 87
Low, L. Y. 105
Shevtsova, I. G. 87
Lyapunov, A. 87
Shorack, G. R. 14
Mahalanobis, P. C. 83 Singh, K. 82, 89–93
Mallows, C. L. 95 Small, C. G. 38, 67
Mammen, E. 97 Smirnov, N. V. 50
Mandelbaum, A. 29 Stuadte, R. G. 39, 51, 52, 63
Maritz, J. S. 39, 51, 52, 63 Student 70
Martin, M. A. 93
Teicher, H. 22
Mason, D. M. 67
Tibshirani, R. J. 82
Massart, P. 86
Tu, D. 76, 82, 97, 106, 107
Massé, J.-C. 39
Tukey, J. W. 40, 79
Matloff, N. 127
McCullagh, P. 89 Valadier, M. 41
Miller, R. G. 80 van der Vaart, A. W. 50
Niemiro, W. 40, 46, 115, 117 Wagner, T. J. 13

Wasserman, L. 86
Oja, H. 39
Wellner, J. A. 50, 98, 99, 110
Parr, W. C. 92 Willers, R. 40
Petrov, V. V. 13 Wolfe, D. A. 51
Philipp, W. 9 Wu, C. F. J. 80
Pollard, D. 40, 76 Wu, M. 39, 51, 52, 63
Subject Index
L-statistics, 32 representation, 58
L1 -median, 38, 143 Mm -estimator, weak
L1 -median, CLT, 52 representation, 45
L1 -median, exponential rate, 58 Mm -parameter, 36
L1 -median, strong representation, U -median, 38
64 U -quantile, 38
L1 -median, sub-gradient, 41 U -quantiles, CLT, 51
M -estimator, 35 U -quantiles, exponential rate, 58
M -estimator, asymptotic U -quantiles, multivariate, 38
normality, 103 U -quantiles, strong
M -estimator, generalized representation, 62
bootstrap, 104 U -statistics, Mm -estimator, 35, 36
M -estimator, sample mean, 36 U -statistics, χ2 limit theorem, 20
M1 -estimate, non-i.i.d., 76 U -statistics, asymptotic
M2 -estimator, 37 normality, 103
M2 -estimator, sample variance, 36 U -statistics, bootstrap, 105
Mm -estimator, 35, 36 U -statistics, central limit
Mm -estimator, U -statistics, 36 theorem, 9
Mm -estimator, CLT, 45 U -statistics, degenerate, 10, 14
Mm -estimator, convexity, 39 U -statistics, degree, order, 2
Mm -estimator, last passage time, U -statistics, deviation result, 15
56 U -statistics, first projection, 8
Mm -estimator, rate of U -statistics, generalized
convergence, 55 bootstrap, 104
Mm -estimator, strong consistency, U -statistics, kernel, 2
43 U -statistics, linear combination, 3
Mm -estimator, strong U -statistics, multivariate, 9, 32
166 SUBJECT INDEX
U -statistics, orthogonal bootstrap, parametric, 97

decomposition, 8 bootstrap, residual, 94
U -statistics, rate of convergence, bootstrap, review, 82
13 bootstrap, wild, 97
U -statistics, strong law of large
numbers, 9 Cantor diagonal method, 44
U -statistics, variance, 6 CDF, empirical, 24
U -statistics, weak representation, CLT, 72
9 CLT, Mm -estimator, 45
δ-triangulation, 56 CLT, Kendall’s tau, 10
δ-triangulation, finite, 57 CLT, sample variance, 10
p-th quantile, sample, 37 CLT, weighted sum of
exchangeable, 110
function, Lipschitz, 56 CLT, Wilcoxon’s statistics, 11
Computing, resampling, 148
Bahadur representation, 58 conditional distribution,
Bayesian bootstrap, GBS, 98 bootstrap, 82
bootstrap weights, 107 consistent, strongly, 43
bootstrap distribution, median, contrast function, 109
93 convergence almost surely, 8
bootstrap idea, 81 convergence in distribution, 8
bootstrap statistic, 81 convergence in probability, 8
bootstrap variance, median, 92 convergence, compact set, 44
bootstrap, m out of n , 99 convergence, convex function, 44
bootstrap, Bayesian, 98 convergence, dense, 44
bootstrap, classical, 83 convergence, pointwise, 44
bootstrap, consistency, 105 convergence, uniform, 44
bootstrap, data driven imitation, convex function, 39, 57
82 convex function, gradient, 59
bootstrap, distributional convex function, sub-gradient, 40
consistency, 82 convex function, uniform
bootstrap, Efron’s, 83 convergence, 44
bootstrap, external, 97 convex set, 43, 56
bootstrap, generalized, 98 convexity, Mm -estimator, 39
bootstrap, moon, 99 Cramér-von Mises statistic, 24
bootstrap, naive, 83
bootstrap, paired, 96 delete-d jackknife, GBS, 99
SUBJECT INDEX 167
density estimation, 73 GBS, double bootstrap, 99

deterministic weight bootstrap, GBS, hypergeometric bootstrap,
GBS, 99 100
dispersion estimator, CLT, 51 GBS, moon bootstrap, 99
dispersion, gradient projection, 46 GBS, paired bootstrap, 98
distribution, lattice, 89 GBS, Polya-Eggenberger
distributional consistency, 78 bootstrap, 100
double bootstrap, GBS, 99 generalized bootstrap, 98, 108
double bootstrap, Singh property, geometric quantile, strong
100 representation, 66
Gini’s mean difference, 3
e.c.d.f., 24, 37, 82, 86
gradient, bounded, 58
Efron’s bootstrap, 83, 92
gradient, first projection, 46
Efron’s bootstrap U -statistics,
105 heteroscedastic, 96
empirical cdf, 24 Hodges-Lehmann estimator, CLT,
empirical distribution, 81 51
empirical distribution, U -quantile, Hodges-Lehmann estimator,
51 strong representation, 66
error, 73 Hodges-Lehmann measure, 37
estimate, least squares, 75 hypergeometric bootstrap, GBS,
estimate, maximum likelihood, 74 100
exponential rate, 58 hyperplane, 39
exponential rate, L1 -median, 58
exponential rate, U -quantiles, 58 i.i.d., approximately, 94
external bootstrap, 97 inverse moment condition, 52, 64
first projection, 110 jackknife, 78

first projection, Mm -estimator, 46 jackknife variance, smooth
forward martingale, 9 statistic, 80
function, convex, 57 jackknife, introduction, 78
jackknife, median, 80
Gauss-Markov model, 74, 97 jackknife, non-smooth, 80
GBS, 98, 104 jackknife, review, 82
GBS, Bayesian bootstrap, 98 jackknife, variance, 79
GBS, delete-d jackknife, 99
GBS, deterministic weight Kendall’s tau, 3, 30
bootstrap, 99 Kendall’s tau, CLT, 10
168 SUBJECT INDEX
lattice distribution, 89 median, of sums, 37

least squares estimate, 75 median, Oja, 39, 144
least squares, weighted, 98 median, population, 37
linearization, 70 median, Rousseeuw, 40
Lipschitz function, 56 median, sample, 37
Liu’s median, 40 median, sub-gradient, 40
location estimator, CLT, 52 median, Tukey, 40
location estimator, minimizer, 36, 37, 41
representation, 63 minimizer, measurable, 41
location measure, 37 Monte Carlo, 82
moon bootstrap, 99
m.l.e., 50 moon bootstrap, GBS, 99
Mallow’s distance, 95 multifunction, 41
martingale, forward, 9 multivariate median, 38
martingale, reverse, 9 multivariate median regression,
maximum likelihood, 74 146
mean, estimation, 72
mean, inference, 72 noise, 73
measurability, 41 noise variance, estimate, 75
measurability, selection theorem, normalized statistic, 72
41
Oja-median, 39, 144
measurable θn , 42
Oja-median, pth-order, 55
measurable minimiser, 42
Oja-median, CLT, 53
measurable selection, 41
Oja-median, rate of convergence,
measurable sequence, 41
58
measure of dispersion, 3, 37
Oja-median, strong
median, 37
representation, 63
median, L1 , 38, 143
OLS, 98
median, U , 38
median, Efron’s bootstrap, 92 paired bootstrap, 96
median, estimation, 73 paired bootstrap, consistency, 96
median, inference, 73 paired bootstrap, GBS, 98
median, Liu, 40 paired bootstrap, heteroscedastic,
median, measure of dispersion, 37 96
median, multivariate, 38 paired bootstrap, multiple linear
median, non-smooth, 80 regression, 96
median, of differences, 38 parameter free, 72
SUBJECT INDEX 169
parameter space, restricted, 43 R, writing functions, 142

parametric bootstrap, 97 random weights, 98
parametric bootstrap, rate of convergence,
consistency, 98 Mm -estimator, 55
Polya-Eggenberger bootstrap, rate of convergence, Oja-median,
GBS, 100 58
representation, in probability, 45
quantile, 37 representation, location
quantile regression, 146 estimator, 63
quantile, non-regular, 50 representation, strong, 58
quantile, population, 37 resample, 83
quantile, sample, 50 resampling, 71, 77
quantiles, U , 38 resampling plans, review, 82
resampling, consistency, 104
R function, density, 138 resampling, residuals, 94
R function, dev.new, 138 resampling, simple linear
R function, ecdf, 138 regression, 93
R function, hist, 138 residual, 94
R function, L1Regression, 146 residual bootstrap, 94
R function, mean, 138 residual bootstrap,
R function, median, 138 distributionally
R function, sd, 138 consistent, 96
R function, summary, 138 residuals, centered, 94
R function, t-test, 140 residuals, resampling, 94
R function, var, 138 reverse martingale, 9
R package, MASS, 148 reverse martingale inequality, 14
R, batch, 132 Rousseeuw’s median, 40
R, change directory, 129 Rscript, 132
R, clearing memory, 131
R, command line, 132 sample covariance, 2, 31
R, computing, 137 sample mean, 2
R, conventions, 130 sample mean, M -estimator, 36
R, print, 129 sample mean, sample median,
R, reading files, 135 joint normality, 68
R, software, 127 sample Oja-median, 39
R, variable types, 132 sample variance, 2
R, workspace, 130 sample variance, M2 -estimator, 36
170 SUBJECT INDEX
sample variance, CLT, 10 Hodges-Lehmann

sample variance, degenerate estimator, 66
distribution, 17 strong representation,
sample, with replacement, 81 Oja-median, 63
sampling distribution, 69 strongly consistent, 43
second derivative, matrix, 45 Student’s t, 70
selection theorem, 41 Studentized statistic, 72
selection, measurable, 41 sub-derivative, 43
set, convex, 56 sub-gradient, 40, 43
simple linear regression, 73 sub-gradient, L1 -median, 41
simple linear regression, sub-gradient, median, 40
resampling, 93 subsampling, 100
simple random sampling, 83
simplex, 39 Theil’s estimator, CLT, 51
simplex, volume, 53 triangulation, 56, 59
Singh property, 105 Tukey’s median, 40
Singh property, double bootstrap,
100 U-statistics, 1, 3
skewness, 97 UCLT, 9
SLLN, Monte Carlo, 82 uniform convergence, 44
span, lattice distribution, 89 uniform convergence, convex
SRSWOR, 100 function, 44
SRSWR, 81, 83, 99 UStatBookABSC, 137
SRSWR, pair, 96 UStatBookABSC package, 127
SRSWR, residuals, 94 UStatBookABSC, L1 -median, 143
statistic, normalized, 72 UStatBookABSC, L1 -median
statistic, Studentized, 72 regression, 145
strong consistency, 43 UStatBookABSC, computing
strong representation, 58 medians, 143
strong representation, L1 -median, UStatBookABSC, dataset, 134,
64 135
strong representation, UStatBookABSC, inner product,
U -quantiles, 62 133
strong representation, exact, 66 UStatBookABSC, norm, 133
strong representation, geometric UStatBookABSC, Oja-median,
quantile, 66 144
strong representation, UStatBookABSC, package, 133
SUBJECT INDEX 171
variance consistency, 78 Wilcoxon’s one sample rank

variance, bootstrap, 85 statistic, 3
Wilcoxon’s one sample rank
weak representation, 70 statistic, CLT, 11
weak representation, Mm Wilcoxon’s one sample rank
estimator, 45 statistics, 31
weighted least squares, 98 wild bootstrap, 97
1. R. B. Bapat: Linear Algebra and Linear Models (3/E)

2. Rajendra Bhatia: Fourier Series (2/E)
3. C.Musili: Representations of Finite Groups
4. Henry Helson: Linear Algebra (2/E)
5. Donald Sarason: Complex Function Theory (2/E)
6. M. G. Nadkarni: Basic Ergodic Theory (3/E)
7. Henry Helson: Harmonic Analysis (2/E)
8. K. Chandrasekharan: A Course on Integration Theory
9. K. Chandrasekharan: A Course on Topological Groups
10. Rajendra Bhatia(ed.): Analysis, Geometry and Probability
11. K. R. Davidson: C* -Algebras by Example
12. Meenaxi Bhattacharjee et al.: Notes on Infinite Permutation Groups
13. V. S. Sunder: Functional Analysis - Spectral Theory
14. V. S. Varadarajan: Algebra in Ancient and Modern Times
15. M. G. Nadkarni: Spectral Theory of Dynamical Systems
16. A. Borel: Semi-Simple Groups and Symmetric Spaces
17. Matilde Marcoli: Seiberg Witten Gauge Theory
18. Albrecht Bottcher:Toeplitz Matrices, Asymptotic Linear Algebra and Functional
Analysis
19. A. Ramachandra Rao and P Bhimasankaram: Linear Algebra (2/E)
20. C. Musili: Algebraic Geomtery for Beginners
21. A. R. Rajwade: Convex Polyhedra with Regularity Conditions and Hilbert’s Third
Problem
22. S. Kumaresen: A Course in Differential Geometry and Lie Groups
23. Stef Tijs: Introduction to Game Theory
24. B. Sury: The Congruence Subgroup Problem - An Elementary Approach Aimed at
Applications
25. Rajendra Bhatia (ed.): Connected at Infinity - A Selection of Mathematics by
Indians
26. Kalyan Mukherjea: Differential Calculas in Normed Linear Spaces (2/E)
27. Satya Deo: Algebraic Topology - A Primer (2/E)
28. S. Kesavan: Nonlinear Functional Analysis - A First Course
29. Sandor Szabo: Topics in Factorization of Abelian Groups
30. S. Kumaresan and G.Santhanam: An Expedition to Geometry
31. David Mumford: Lectures on Curves on an Algebraic Surface (Reprint)
32. John. W Milnor and James D Stasheff: Characteristic Classes(Reprint)
33. K.R. Parthasarathy: Introduction to Probability and Measure
34. Amiya Mukherjee: Topics in Differential Topology
35. K.R. Parthasarathy: Mathematical Foundation of Quantum Mechanics (Corrected
Reprint)
36. K. B. Athreya and S.N.Lahiri: Measure Theory
37. Terence Tao: Analysis - I (3/E)
38. Terence Tao: Analysis - II (3/E)
39. Wolfram Decker and Christoph Lossen: Computing in Algebraic Geometry
40. A. Goswami and B.V.Rao: A Course in Applied Stochastic Processes
41. K. B. Athreya and S.N.Lahiri: Probability Theory
42. A. R. Rajwade and A.K. Bhandari: Surprises and Counterexamples in Real
Function Theory
43. Gene H. Golub and Charles F. Van Loan: Matrix Computations (Reprint of the 4/E)
44. Rajendra Bhatia: Positive Definite Matrices
45. K.R. Parthasarathy: Coding Theorems of Classical and Quantum Information
Theory (2/E)
46. C.S. Seshadri: Introduction to the Theory of Standard Monomials (2/E)
47. Alain Connes and Matilde Marcolli: Noncommutative Geometry, Quantum Fields
and Motives
48. Vivek S. Borkar: Stochastic Approximation - A Dynamical Systems Viewpoint
49. B.J. Venkatachala: Inequalities - An Approach Through Problems (2/E)
50. Rajendra Bhatia: Notes on Functional Analysis
51. A. Clebsch: Jacobi’s Lectures on Dynamics (2/E)
52. S. Kesavan: Functional Analysis
53. V.Lakshmibai and Justin Brown: Flag Varieties - An Interplay of Geometry,
Combinatorics and Representation Theory (2/E)
54. S. Ramasubramanian: Lectures on Insurance Models
55. Sebastian M. Cioaba and M. Ram Murty: A First Course in Graph Theory and
Combinatorics
56. Bamdad R. Yahaghi: Iranian Mathematics Competitions 1973-2007
57. Aloke Dey: Incomplete Block Designs
58. R.B. Bapat: Graphs and Matrices (2/E)
59. Hermann Weyl: Algebraic Theory of Numbers(Reprint)
60. C L Siegel: Transcendental Numbers(Reprint)
61. Steven J. Miller and RaminTakloo-Bighash: An Invitation to Modern Number
Theory (Reprint)
62. John Milnor: Dynamics in One Complex Variable (3/E)
63. R. P. Pakshirajan: Probability Theory: A Foundational Course
64. Sharad S. Sane: Combinatorial Techniques
65. Hermann Weyl: The Classical Groups-Their Invariants and Representations
(Reprint)
66. John Milnor: Morse Theory (Reprint)
67. Rajendra Bhatia(Ed.): Connected at Infinity II- A Selection of Mathematics by
Indians
68. Donald Passman: A Course in Ring Theory (Reprint)
69. Amiya Mukherjee: Atiyah-Singer Index Theorem- An Introduction
70. Fumio Hiai and Denes Petz: Introduction to Matrix Analysis and Applications
71. V. S. Sunder: Operators on Hilbert Space
72. Amiya Mukherjee: Differential Topology
73. David Mumford and Tadao Oda: Algebraic Geometry II
74. Kalyan B. Sinha and Sachi Srivastava: Theory of Semigroups and Applications

Bose, A., & Chatterjee, S. (2018) - U-Statistics, Mm-Estimators and Resampling

Uploaded by

Copyright:

Available Formats

Bose, A., & Chatterjee, S. (2018) - U-Statistics, Mm-Estimators and Resampling

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bose, A., & Chatterjee, S. (2018) - U-Statistics, Mm-Estimators and Resampling

Uploaded by

Copyright:

Available Formats

Texts and Readings in Mathematics 75

More information about this series at http://www.springer.com/series/15141

ISSN 2366-8717 ISSN 2366-8725 (electronic)

Library of Congress Control Number: 2018952876

About the Authors xv

2 Mm -estimators and U -statistics 35

3.3.1 Jackknife: bias and variance estimation . . . . . . . . 78

4 Resampling U -statistics and M -estimators 103

Author Index 163

Subject Index 165

estimates. These linearization results show the distributional consistency of

Arup Bose, Kolkata and Snigdhansu Chatterjee, Minneapolis

Snigdhansu Chatterjee is Professor at the School of Statistics, University

U statistics are a large and important class of statistics. Indeed, any U -

1.1 Deﬁnition and examples

and identically distributed (i.i.d. hereafter). Suppose h(x1 , . . . , xm ) is a real

Deﬁnition 1.1: The U -statistic of order or degree m, with kernel h is deﬁned

Example 1.1(Sample mean): Let Y1 , Y2 , . . . , Yn be observations, then with

Example 1.2(Sample variance): Let Y1 , Y2 , . . . , Yn be the observations, then

It is easily seen that Un is the sample variance,

Example 1.3(Sample covariance): Suppose (Xi , Yi ), 1 ≤ i ≤ n are the

Then Un is the sample covariance of {(Xi , Yi )}, given by

Example 1.4(Kendall’s tau): Deﬁne the sign function as

Suppose (Xi , Yi ), 1 ≤ i ≤ n are continuous bivariate observations. Kendall’s

If the observations are i.i.d. N (0, σ 2 ), then we have

Thus Un is a U -statistic with h(x1 , x2 ) = |x1 − x2 | and is a measure of

do this, note that for i = j,

Hence with probability one

It is now easy to see that

where Un (f ) and Un (g) are the U -statistics

Note that if Z3 , Z4 are i.i.d. F , then

EhF (Z3 , Z4 ) = 0. (1.13)

Now let (X, Y ) be a bivariate random variable with marginal distributions

“covariance” between X and Y is deﬁned by

κ(X, Y ) = EhFX (X1 , X2 )hFY (Y1 , Y2 )

and a “correlation coeﬃcient” between them is deﬁned by

It is easy to see that if X and Y are independent then κ(X, Y ) = ρ∗ (X, Y ) =

Now suppose we have n observations (Xi , Yi ), 1 ≤ i ≤ n from a bivariate

H0 : FX,Y (x, y) = FX (x)FY (y) for all x, y. (1.14)

Since κ(X, Y ) = 0 if and only if H0 is true, we can base our test on a

Then an unbiased estimator of κ(X, Y ) is given by

Note that κn may be negative even though κ is never so. Clearly κ is a

1.2 Some ﬁnite sample properties

where COV denotes covariance.

For any c, 1 ≤ c ≤ m deﬁne

Suppose {Ỹ1 , Ỹ2 , . . .} is an i.i.d. copy of {Y1 , Y2 , . . .} (without loss of generality

It can be veriﬁed that, if σ 2 = V(Y1 ) and μ4 = E(Y − E(Y ))4 , then

1.2.2 First projection

The ﬁrst projection of a U -statistic Un , denoted by h1 (·), is the conditional

h1 (x1 ) = E h(x1 , Y2 , . . . , Ym ) (1.21)

Deﬁne the centered version of the ﬁrst projection:

h̃1 (x1 ) = h1 (x1 ) − θ

By explicit calculations it can be easily seen that the decomposition (1.22) is

COV[h̃1 (Yi ), Rn ] = 0 ∀ i = 1, . . . , n. (1.23)

Notice that in (1.18) the leading term in V(n1/2 Un ) equals m2 δ1 where

1.3 Law of large numbers and asymptotic nor-

This shows that

Now we obtain two important results on U -statistics. Theorem 1.1(b) is

Theorem 1.1 (Hoeﬀding (1948)). (UCLT.) If V[h(Y1 , . . . , Ym )] < ∞, then