Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

This Content Downloaded From 117.227.34.195 On Fri, 18 Nov 2022 17:23:35 UTC

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Investigating Smooth Multiple Regression by the Method of Average Derivatives

Author(s): Wolfgang Hardle and Thomas M. Stoker


Source: Journal of the American Statistical Association , Dec., 1989, Vol. 84, No. 408
(Dec., 1989), pp. 986-995
Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association

Stable URL: https://www.jstor.org/stable/2290074

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

Taylor & Francis, Ltd. and American Statistical Association are collaborating with JSTOR to
digitize, preserve and extend access to Journal of the American Statistical Association

This content downloaded from


117.227.34.195 on Fri, 18 Nov 2022 17:23:35 UTC
All use subject to https://about.jstor.org/terms
Investigating Smooth Multiple Regression by the
Method of Average Derivatives
WOLFGANG HARDLE and THOMAS M. STOKER*

Let (xl, . . . Xk, y) be a random vector where y denotes a response on the vector x of predictor variables. In this article
propose a technique [termed average derivative estimation (ADE)] for studying the mean response m(x) = E(y I x) throug
the estimation of the k vector of average derivatives (5 = E(m'). The ADE procedure involves two stages: first estimate (5
using an estimator 6, and then approximate m(x) by m'i(x) = g(xTS), where g is an estimator of the univariate regression
on xT(S. We argue that the ADE procedure exhibits several attractive characteristics: data summarization through interpretab
coefficients, graphical depiction of the possible nonlinearity between y and xT3, and theoretical properties consistent with
dimension reduction. We motivate the ADE procedure using examples of models that take the form m(x) = g(xTfl). In this
framework, 5 is shown to be proportional to ,B and mh(x) infers m(x) exactly. The focus of the procedure is on the estimator
3, which is based on a simple average of kernel smoothers and is shown to be a \/N consistent and asymptotically normal
estimator of 65. The estimator g(.) is a standard kernel regression estimator and is shown to have the same properties as the
kernel regression of y on xT(5. In sum, the estimator 3 converges to (5 at the rate typically available in parametric estimation
problems, and m' (x) converges to E(y I xT(5) at the optimal one-dimensional nonparametric rate. We also give a consistent
estimator of the asymptotic covariance matrix of 3, to facilitate inference. We discuss the conditions underlying these results,
including how \N/N consistent estimation of 3 requires undersmoothing relative to pointwise multivariate estimation. We also
indicate the relationship between the ADE method and projection pursuit regression. For illustration, we apply the ADE
method to data on automobile collisions.

KEY WORDS: ADE regression; GLIM models; Kernel estimation; Nonparametric estimation.

1. INTRODUCTION where m' am/ax is the vector of partial derivatives and


expectation is taken with respect to the marginal distri-
The popularity of linear modeling in empirical analysis
bution of x. We argue in the next section that 6 represents
is based on the ease with which the results can be inter-
sensible "coefficients" of changes in x and y.
preted. This tradition influenced the modeling of various
We construct a nonparametric estimator 6 of 6, based
parametric nonlinear regression relationships, where the
on an observed random sample (xi, yi) (i = 1, , N).
mean response variable is assumed to be a nonlinear func-
Our procedure for modeling m(x) is to first compute 6,
tion of a weighted sum of the predictor variables. As in
form the weighted sum zi = xfTS for i = 1, . . . ,N (where
linear modeling, this feature is attractive because the coef-
xT is the transpose of x), and then compute the (Nadaraya-
ficients, or weights of the sum, give a simple picture of
Watson) kernel estimator g(.) of the regression of yi on
the relative impacts of the individual predictor variables
Zi. The regression function m(x) is then approximated by
on the response variable. In this article we propose a flex-
ible method of studying general multivariate regression m(x) = g(xTS). (1.3)
relationships in line with this approach. Our method is to
The output of the procedure is three-fold: a summary of
first estimate a specific set of coefficients, termed average
the relative impacts of changes in x on y (via 6), a visual
derivatives, and then compute a (univariate) nonpara-
depiction of the nonlinearity between y and the weighted
metric regression of the response on the weighted sum of
sum xTS (a graph of g), and a formula for computing
predictor variables.
estimates of the mean response m(x) [from Eq. (1.3)]. We
The central focus of this article is analysis of the aver-
refer to this as the ADE method, for "average derivative
age derivative, which is defined as follows. Let (x, y) =
estimation."
(x1, . . . Xk, y) denote a random vector, where y is the
In addition to allowing data summarization through in-
response 'studied. If the mean response of y given x is
terpretable coefficients, the average derivative estimator
denoted by
is computationally simple and has theoretical properties
m(x) = E(y I x), (1.1) consistent with dimension reduction. The statistic 6 is
based on a simple average of nonparametric kernel
then the vector of "average derivatives" is given as
smoothers, and its properties depend only on regular-
6 = E(m'), (1.2) ity properties on the joint density of (x, y) or, in particu-
lar, on no functional form assumptions on the regression
function m(x). The limiting distribution of N/N( - 6) is
* Wolfgang Hardle is Privatdozent, Institut fur Wirtschaftst
multivariate normal. The nonparametric regression esti-
Universitat Bonn, D-5300 Bonn, Federal Republic of Germany. Thomas
M. Stoker is Professor of Applied Economics, Sloan School of Manage- mator mt (x) = g(xTS) is constructed from a k-dimensional
ment, Massachusetts Institute of Technology, Cambridge, MA 02139. predictor variable, but it achieves the optimal rate that is
This research was funded by the Deutsche Forschungsgemeinschaft, Son-
'derforschungsbereich 303, and National Science Foundation Grant SES
8721889. The authors thank R. Carroll, J. Hart, J. Powell, D. Scott, J. ? 1989 American Statistical Association
Wooldridge, and several seminar audiences for helpful comments, and Journal of the American Statistical Association
T. Foster for assistance in the computational work. December 1989, Vol. 84, No. 408, Theory and Methods

986

This content downloaded from


117.227.34.195 on Fri, 18 Nov 2022 17:23:35 UTC
All use subject to https://about.jstor.org/terms
Hardle and Stoker: Regression by the Method of Average Derivatives 987

typical for one-dimensional smoothing problems (see A parametric approach to the estimation of any of these
Stone 1980). Although S and g(.) each involve choice models, for instance, based on maximum likelihood, re-
of a smoothing parameter, they are computed directly quires the (parametric) specification of the distribution of
from the data in two steps and thus require no computer- the random variable e and of the transformations VQ( ),
intensive iterative techniques for finding optimal objective and for (2.1), the transformation /(4). Substantial bias can
function values. result if any of these features is incorrectly specified. Non-
-Section 2 motivates the ADE method through several parametric estimation of 6 = yfl avoids such restrictive
examples familiar from applied work. Section 3 introduces specifications. In fact, the form m(x) = g(xTfl) generalizes
the estimators s and g and establishes their large-sample the "generalized linear models" (GLIM); see McCullagh
statistical properties. Section 4 discusses the results, in- and Nelder (1983). These models have g invertible, with
cluding the relationship of the ADE method to projection g-1 referred to as the "link" function. Other approaches
pursuit regression (PPR) of Friedman and Stuetzle (1981). that generalize GLIM can be found in Breiman and Fried-
Section 5 applies the ADE method to data on automobile man (1985), Hastie and Tibshirani (1986), and O'Sullivan,
collisions. Section 6 concludes with a discussion of related Yandell, and Raynor (1986).
research. Turning our attention to ADE regression modeling, we
show in the next section that mt (x) of (1.3) will estimate
2. MOTIVATION OF THE ADE PROCEDURE
g(x&T) = E(y I xT3), in general. Consequently, the ADE
The average derivative 6 is most naturally interpreted method will completely infer m(x) when
in situations where the influence of x on y is modeled via
a weighted sum of the predictors; where m(x) = g(xTfJ) m(x) = g(xTO). (2.4)
for a vector of coefficients,B. In such a model, 3 is inti-
But this is the case for each of the aforementioned ex-
mately related to fi, as m' = [dgld(xTJJ)]fJ, so that 3 =
amples, or whenever m(x) = g(xTfJ), since a (nonzero)
E[dgld(xTf3)]f3 = yfl, where y is a scalar (assumed non-
rescaling of fi can be absorbed into g. Here m(x) is re-
zero). Thus 3 is proportional to the coefficients fi when-
parameterized to have coefficients 3 = y,B by defining g(Q)
ever the mean response is determined by xTfl.
-- g(1y), so m(x) = g(xTf3) = g(xT3). This rescaling cor-
An obvious example is the classical linear regression
responds to E[dgld(xT3)] = 1, a normalization of g that
model; y = a + xTfJ + e, where e is a random variable
would not obtain for alternative scalings of fi.
such that E(e I x) = 0, which gives 3 = fi. Another class
Equivalently, we can interpret the scale of 3 by noting
of models is those that are linear up to transformations:
that if each value x is translated to x + A, then the change
0(y) = VI(xTfi) + e, (2.1) in the overall mean of y is AT3. This feature is familiar for
coefficients when the true model is linear, but not for
where {() is a nonconstant transformation, 0(Q)
coefficients is ain-
within nonlinear model. For instance, alter-
vertible, and e is a random disturbance that is independent
native scalings of fi for the transformation model (2.1)
of x. Here we have that m(x) = E[0-l(q(xTfl) + e) I x]
would make the average change dependent on 0(4) and
= g(xTfl). The form (2.1) includes the model of Box and
q4).
Cox (1964), where 0(y) = (yi - 1)I2A and ql(xTfl) = a Finally, there are modeling situations where 3 is inter-
+ [(X Tf) 2 - 11/22.
pretable but (2.4) does not obtain. For instance, if x =
Other models exhibiting this structure are discrete re-
(x1, x2) and the model is partially linear,
gression models, where y is 1 or 0 according to

y=1 if e<(xTfl) y = xTfl1 + T (x2) + e, (2.5)

= 0 if e ::- (xTfl). (2.2) then 31 = fl, and 32 = E(p'), where 3 = (31, 32) coi
with the partition of x. If, in addition, q = (x2fl2)
Here the regression function m(x) is the probability that
s1 = IA and 2 = YA2, so 62 is proportional to the
y = 1, which is given as m(x) = Pr{e < yl(xTfl) I x} = cients within the nonlinear part of the model. See Robin-
g(xTfl). References to specific examples of binary response
son (1988) for references to partially linear models and
models can be found in Manski and McFadden (1981). Stoker (1986) for other examples where the average de-
Standard probit models specify that e is a normal random rivative has a direct interpretation.
variable (with distribution function 1D) and yi(xTfl) = a
+ XT/3, giving m(x) = (D(a + xTfl). Logistic regression 3. KERNEL ESTIMATION OF
models are likewise included; here m(x) = exp(a + xTfJ)I AVERAGE DERIVATIVES
[1 + exp(a + xTfO)].
Censored regression, where Our approach to estimation of 6 uses nonparametric
estimation of the marginal density of x. Let f(x) denote
y = (xTf) + e if (xTf) + e O 0 this marginal density, f' =-df/dx the vector of partial
= 0 if (xTfl) + e < O, (2.3) derivatives, and 1 -3 ln fIdx = -f'If, the negative
log-density derivative. If f(x) = 0 on the boundary of x
is likewise included, and setting y(x Tfl)
values, = ae + by
then integration x parts
TJJ gives
gives
the familiar censored linear regression model [see Powell
(1986), among others]. =- E(m') = E[l(x)y]. (3.1)

This content downloaded from


117.227.34.195 on Fri, 18 Nov 2022 17:23:35 UTC
All use subject to https://about.jstor.org/terms
988 Journal of the American Statistical Association, December 1989

Our estimator of 6 is a sample analog of the last term in constructed using any uniformly consistent estimators of
this formula, using a nonparametric estimator of 1(x) eval- 1(x), m(x), and m'(x). The proof of Theorem 3.1 suggests
uated at each observation. a more direct estimator of r(yi, xi), defined as
In particular, the density function f(x) is estimated at
x using the (Rosenblatt-Parzen) kernel density estimator
rhi- h(X)Yfi+ N-1h- L h-'K' (xi- x-)

fh(X) = Nlh k K ( ' (3.2)


xi h) ih(XI)] fh(XJ)~ (3.6)

where K(*) is a kernel function, h = hN is the bandwidth


parameter, and h -> 0 as N-* oo. The vector function 1(x)
Define the estimator E of X as the sample covariance
is then estimated using fh(x) as matrix of {rhEII}.

ih(X) = fh(X)/fh(X), (3.3) N

X= NK 1 rhirhI - rhrhX (3.7)


where f a f,hlaX is an estimator of the partial density
derivative. For a suitable kernel K(-) under general con-
where 7h - N-' rh. We then have the following theo-
ditions, fh(x), f,(x), and 4h(x) are consistent estimators of
rem.
f (x), f '(x), and 1(x), respectively.
Because of division by fh, the function i may exhibit Theorem 3.2. If N->oo, h--O, b-+Oand b-'h->
erratic behavior when the value of fh is very small. Con- 0, X is a consistent estimator of Ia.
sequently, for estimation of (5 we only include terms for
which the value of fh(Xi) is above a bound. Toward this Theorem 3.2 facilitates the measurement of precision
end, define the indicator Ii - I[fh(xi) > b], where I[-] is of 35 as well as inference on hypotheses about (5. For in-
the indicator function and b = bN is a trimming bound stance, the covariance matrix of J is estimated by Nr' .
such that b - 0 as N --> o. Moreover, consider testing restrictions that certain com-
The "average derivative estimator" 6 is defined as ponents of (5 are 0 or testing equality restrictions across
N
components of (5. Such restrictions are captured by the
( = N-' th (xi)yiliA (3.4) null hypothesis that Q(5 = q0, where Q is a k, x k matrix
i-l1 of full rank k, c k. Tests of this hypothesis can be based
on the Wald statistic W = N(Qo - q0)T(Q:?QT)1l(Q3s -
We derive the large-sample statistical properties of 5 on
qo) which has a limiting x2 distribution with k 3 df.
the basis of smoothness conditions on m(x) and f(x). The
We now turn our attention to the estimation of g(x T()
required assumptions (listed in the Appendix) are de-
= E( y I xT(5) and add the assumption that g( ) is twice
scribed as follows. As before, the k vectorx is continuously
differentiable. Set z; -xT3 (j - 1, .. .,vN),and let f
distributed with density f (x), and f (x) = 0 on the bound-
denote the density of z =xT(5. Define g(Z) as the (Na-
ary of x values. The regression function m(x) = E(y I x)
daraya-Watson) kernel estimator of the regression of y on
is (a.e.) continuously differentiable, and the second mo-
z = xTS:
ments of m' and ly exist. The density f(x) is assumed to
be smooth, having partial derivatives of order p - k + 2.
The kernel function K(v) has compact support and is as-
sumed to be of order p. We also require some technical gh'(z) = N fh'(Z) , (3.8)
conditions on the behavior of m(x) and f(x) in the tails
of the distribution, for instance, ruling out thick tails and where flh is the density estimator
rapid increases in m(x) as |x| -* o.
Under these conditions, ( is an asymptotically normal f, h '(X) = - h N (z E, 1) (3.9)
estimator of (, stated formally as follows.

Theorem 3.1. Given Assumptions 1-9 stated in the with bandwidth h' = h1N,
Appendix, if (a) N -* oo, h -O 0, b -O 0, and b-Ihunivariate)
-> 0; kernel function. Suppose, for a moment, that
(b) for some e > 0, b4Nl-h2k+2 ? o; and (c) Nh2p-2 z;= xf(5 instead of ZJ were used in (3.8) and (3.9); then
0, then \/Ni(Q - () has a limiting normal distribution with
it is well known (Schuster 1972) that the resulting regres-
mean 0 and variance E, where X is the covariance matrix sion estimator is asymptotically normal and converges
of r(y, x), with (pointwise) at the optimal (univariate) rate N2'. Theorem
r(y, x) = m'(x) + [y - m(x)]l(x). (3.5) 3.3 states that there is no cost to using the estimated values
Moas described previously.
The proof of Theorem 3.1, as well as those of the other
results of the article, are contained in the Appendix. Theorem 3.3. Given Assumptions 1-10 stated in the
The covariance matrix E could be consistently estimated Appendix, let z be such that f,(z) ?b, > 0. If N X>
as the sample variance of uniformly consistent estimators and h' y Npt 'e , then N=gh (Z) - g(z)] has a limit
of r(y1, xi) (i = 1, . . . , N), and the latter could be normal distribution with mean B(z) and variance V(z),

This content downloaded from


117.227.34.195 on Fri, 18 Nov 2022 17:23:35 UTC
All use subject to https://about.jstor.org/terms
Hdrdle and Stoker: Regression by the Method of Average Derivatives 989

where not converge too quickly to 0 as N -> oo, but rather must
converge slowly. The behavior of the bandwidth h as N
B(z) = [g"(z)/2 + g'(z)f{(z)Ifj(z)] u2Kl(u) du -X c is bounded both below and above by Conditions (b)
and (c).
Condition (c) does imply that the pointwise conver-
V(z) = [var(y I x'.5 = z)Ifi(z)] f K1(u)2 du. (3.10)
gence of fh(x) to f(x) must be suboptimal. Stone (1980)
showed that the optimal pointwise rate of convergence un-
The bias and variance given in (3.10) can be estimated
der our conditions is NPl(2P+k), and Collomb and Hirdle
consistently for each z using y, gh, and flh, and their de-showed that this rate is achievable with kernel
(1986)
rivatives, using standard methods. Therefore, asymptotic
density estimators such as (3.2), for instance, by taking t
confidence intervals can be constructed for g , (z). It is
= hoN-l1(2P+k). But we have that N1P- 2 -m cc, which vi-
clear that the same confidence intervals apply to m4(x) =
olates Condition (c), so as N -m ?c, h must converge to 0
gh'(X 3), for z = XTS. more quickly than W.This occurs because (c) is a bias con-
4. REMARKS AND DISCUSSION dition; as N -> c, the (pointwise) bias of fh(x) must vanish
at a faster rate than its (pointwise) variance, for the bias
4.1 The Average Derivative Estimator of 3 to be o(N-"12). In other words, for \/N consistent
As indicated in the Introduction, the most interesting estimation of 3, one must "undersmooth" the nonpara-
feature of Theorem 3.1 is that 3 converges to 8 at rate metric component lh(X).
\/N. This is the rate typically available in parametric es-
4.2 Modeling Multiple Regression
timation problems and is the rate that would be attained
if the values 1(xi) (i = 1 ... , N) were known and usedTheorem 3.3 shows that the optimal one-dimensional
convergence rate is achievable in the estimation of g(xT3)
in the average (3.4). The estimator 4h(x) converges point-
= E(y I xT6), using S instead of 3. The requirement that
wise to 1(x) at a slower rate, so Theorem 3.1 gives a sit-
uation where the average of nonparametric estimators g( ) is twice differentiable affixes the optimal rate at
converges more quickly than any of its individual com- N2/5, but otherwise plays no role: if g(-) is assumed dif-
ponents. This occurs because of the overlap between ker-
ferentiable of order q and K1(&) is a kernel of order q,
then it is easily shown that the optimal rate of Nql(2q+l) is
nel densities at different evaluation points; for instance,
attained. The attainment of optimal one-dimensional rates
if xi and x, are sufficiently close, the data used in the local
average fh(Xi) will overlap with that used in !h(Xj). These of convergence is possible for the ADE method because
overlaps lead to the approximation of 6 by U statistics with the additive structure of g(xT6) is sufficient to follow the
kernels depending on N. The asymptotic normality of S "dimension reduction principle" of Stone (1986). Alter-
follows from results on the equivalence of such U statis- native uses of additive structure can be found in Breiman
tics to (ordinary) sample average. In a similar spirit, and Friedman (1985) and Hastie and Tibshirani (1986).
Powell, Stock, and Stoker (in press) obtained \N/ con- The ADE method can be regarded as a version of PPR
vergence rates for the estimation of "density weighted" of Friedman and Stuetzle (1981). The first step of PPR is
average derivatives, and Carroll (1982), Robinson (1988), to choose ft (normalized as a direction) and g to minimize
and Hardle and Marron (1987) showed how kernel den- s(g, f) = I [y, - g(xTff)]2, and any model of the form
sities can be used to obtain \'NK convergence rates for
m(x) = g(xTfl) is inferred by the ADE estimator mA(x) =
certain parameters in specific semiparametric models. We
gh'(xT) at the optimal one-dimensional rate of conver-
also note that our method of trimming follows Bickel gence. For a general regression function, however, mA(x),
(1982), Manski (1984), and Robinson (1988). gh' , and 3 will not necessarily minimize the sum of squares
For any given sample size, the bandwidth h and the s(g, fB): given g, ft is chosen such that {yj - g(xTft)} is
trimming bound b can be set to any (positive) values, so orthogonal to {xig'(xiTf)}, which does not imply that ft =
their choice can be based entirely on the small-sample S/1IS.
behavior of S. Conditions (a)-(c) of Theorem 3.1 indicate
Given 3, gh' is a local least squares estimator; namely,
how the initial bandwidth and trimming bound must be Kl[(z - xTS)Ih'](yi - t)2 is minimized by t = gh'(z).
decreased as the sample size is increased. These conditions Moreover, 3 is a type of least squares estimator, as follows.
are certainly feasible; suppose that h = hON-C and b = Set A = (SI)_11h(xi)I1, where SI is the sample moment SI
bON-", then (a)-(c) are equivalent to 0 < q < 4 and pl - N- ih(xi) h(xi) TL. Then 3 is the value of d that min-
(2p - 2) < C < (1 - 4i - e)1(2k + 2). Since p - k + imizes the sum of squares z [y~ - TdJ2, or equivalently,
Si1S are the coordinates of { yi} projected onto the sub-
2 and e is arbitrarily small, q can be chosen small enough
to fulfill the last condition. space spanned by {4h(xi)i}.
The bandwidth conditions arise as follows. Condition ADE and PPR thus represent different computational
(b) assures that the estimator S can be "linearized" tomethods
1 of inferring m(x) = g(xt#). The possible advan-
without an estimated denominator and is a sufficient tages of ADE arise from reduced computational effort;
condition for asymptotic normality. Condition (c) assures (given hs b, and h) Z(x) = g(xT) is computed directly
that the bias of S vanishes at rate VIM. Conditions (a)-(c) from the data, whereas minimizing s(g, ft) (by checking
are one-sided in implying that the trimming bound b can- all directions ft and computing g for each ft) typically in-

This content downloaded from


117.227.34.195 on Fri, 18 Nov 2022 17:23:35 UTC
All use subject to https://about.jstor.org/terms
990 Journal of the American Statistical Association, December 1989

volved considerable computational effort [although the re- Table 2. Collision (side impact) Data
sults of Ichimura (1987) may provide some improvement].
AGE VEL ACL y AGE VEL ACL y

5. ADE IN AN AUTOMOBILE COLLISION STUDY 22 50 98 0 30 45 95 0


21 49 160 0 27 46 96 1
We illustrate the ADE approach with data from a proj- 40 50 134 1 25 44 106 0
43 50 142 1 53 44 86 1
ect on the calibration of automobile dummies for studying
23 51 118 0 64 45 65 1
automobile safety. The data consist of observations from 58 51 143 1 54 45 103 0
N = 58 simulated side impact collisions as described in 29 51 77 0 41 45 102 1
29 51 184 0 36 45 108 1
Kallieris, Mattern, and Hardle (1989) and listed in the
47 51 100 1 27 45 140 0
Appendix. All calculations are performed using GAUSS 39 51 188 1 45 45 94 1
on a microcomputer. 22 50 162 0 49 40 77 0
52 51 151 1 24 40 101 0
Of interest is whether the accidents are judged to result
28 50 181 1 65 40 82 1
in a fatality, so the response is y = 1 if fatal, y = 0 if not 42 50 158 1 63 51 169 1
fatal. The k = 3 predictor variables are age of the subject 59 51 168 1 26 40 82 0
28 41 128 0 60 45 83 1
(AGE, x1), velocity of the automobile (VEL, x2), and the
23 61 268 1 47 45 103 1
maximal acceleration (upon impact) measured on the sub- 38 41 76 0 59 44 104 1
ject's abdomen (ACL, X3). The x variables are standard- 50 61 185 1 26 44 139 0
28 41 58 0 31 45 128 1
ized for the analysis: each variable is centered by its sample
40 61 190 1 47 46 138 1
mean and divided by its standard deviation. For this ap- 32 50 94 0 41 45 102 0
plication, the regression E(y I x) = m(x) is the conditional 53 47 131 0 25 44 90 0
44 50 120 1 50 44 88 1
probability of a fatality given x.
38 51 107 1 53 50 128 1
Because of the moderately small sample size, for com- 36 50 97 0 62 50 136 1
puting s we use a standard positive kernel instead of the 33 53 138 1 23 50 108 0
51 41 68 1 27 60 176 1
higher-order kernel prescribed by Theorem 3.1 (for k =
60 42 78 1 19 60 191 0
3, a kernel of order p = 5 is indicated, and some limited
small sample Monte Carlo experiments showed that the
analysis, but we do n
oscillating local weights produce slightly smaller bias but
from Table 1. In additi
considerably higher variance than a standard positive ker-
are not sensitive to bandwidth or trimming percentage
nel). In particular, we used the kernel K(u1, u2, U3) = choice; although not reported, virtually identical estimates
K1(u1)K0(u2)K1(u3), where K, is the univariate "biweight" are obtained for bandwidths in the range of 1 to 2, and
kernel
trimming percentages are obtained in the range of 1%-
K1(u) = (15/16)(1 - u2)21(IuI c 1). (5.1) 10%.
For computing the kernel regression g(.) of y on xT3,
Although our theoretical results do not constrain the we also employed the biweight kernel K1, with bandwidth
choice of bandwidth h, some Monte Carlo experience sug- = .20. The curve g is graphed in Figure 1. Figure 1
gests that reasonable small-sample performance is ob- displays the familiar shape of cumulative density function,
tained by setting h in the range of 1 to 2 (one to two but it is important to note that there is nothing in the
standard deviations of the predictors), and so we set h = framework that implies this shape or implies that g(x Ta)
1.5. Likewise for the trimming bound b; for interpretation = E(y I xT3) (or A) should be monotonic in xT3. Although
we set the bound to drop the a = 5% of observations with somewhat beyond the scope of this article, it may be of
smallest estimated density values. interest to explore some features of this finding, as a brief
The average derivative estimates 3 are given in Table 1 illustration of how nonparametric- analysis can be used to
for the collision data in Table 2. The AGE effect is rea- guide parametric modeling.
sonably precisely estimated, whereas the VEL and ACL In particular, suppose that these data were consistent
effects are not very well estimated (on the basis of their with a (homoscedastic) discrete response model of the
standard errors). On the basis of the appropriate Wald form
statistics, (31, 62, 33) = 0 and (32, 33) = 0 are rejected at y = 1 if e<xT5
a 5% level of significance, whereas 33 = 0 is not. Con-
sequently, we could set 33 = 0 for the remainder of the = 0 if e?xT3, (5.2)

Table 1. Average Derivative Estimates for Collision Data

Hypothesis tests
Predictor variables
Degrees of
6 AGE (xl) VEL (X2) ACL (x3) Null hypothesis Wald statistic W freedom q Pr(xq> W)
Value .134 .051 .045 (61, 62, 63) = (0, 0, 0) 19.41 3 .00023
Standard error .033 .028 .027 (62, 63) = (0, 0) 7.61 2 .022
63 = 0 3.44 1 .063

NOTE: N = 58; h = 1.5; a = 5%.

This content downloaded from


117.227.34.195 on Fri, 18 Nov 2022 17:23:35 UTC
All use subject to https://about.jstor.org/terms
Hdrdle and Stoker: Regression by the Method of Average Derivatives 991

1.00 conflict with the framework, that merits further study. The
average of g'(x[T) over the data is 1.76, which contrasts
0.80 -
with the normalization E(g') = 1. Although possibly due
to sampling error or our particular choice of h' (doubling
h' to .4 decreases the average to 1.17), this could signal
g
A 0.60 -
underestimation of 3 or, in particular, underestimation of
(Estimated
the scale of 3. Although we could easily "correct" for this,
% Fata I ) 0.40
our intention here is just to indicate the need for further
study of scaling and/or normalization of 3. With regard
0.20 - to the preceding discussion, it is important to note that a
rescaling of 3 would only relabel the horizontal axes of
0.00. Figures 1 and 2. In particular, the scaling of 3 does not
-0.60 -0.36 -0.12 0.12 0.36 0.60 affect the substantive conclusions of Figures 1 and 2, nor
Z= XTg does it affect the fitted values m(x) of the ADE model
Figure 1. ADE Regression for Collision Data. (1.3).

6. CONCLUDING REMARKS
where e is distributed independently of x (possibly with
nonzero mean). This formulation specializes (2.2) by set- In this article we have advanced the ADE method as a
ting y(z) = z and normalizing ,8 to 3. As discussed pre- useful yet flexible tool for studying general regressioni re-
viously, in this model E(y I x) = m(x) = g(y I xT5) = lationships. At its center is the estimation of average de-
Pr(e < xT3), so g(z) is the cumulative distribution func- rivatives, which we propose as sensible coefficients for
tion of e. Moreover, under this model, g' = dgldz is the measuring the relative impacts of separate predictor vari-
density function of e, which we can estimate by the kernel ables on the mean response. Although we have established
estimator of g' (the derivative of A). The estimator is attractive statistical properties for the estimators, it is im-
graphed in Figure 2. Its multimodal shape indicates the portant to stress that the real motivation for the ADE
possibility of a "mixture" distribution for e, which is in method is the economy it offers for nonparametric data
contrast with standard parameterizations of binary re- summarization. Instead of attempting to interpret a fully
sponse models (e.g., in a probit model e is assumed to be flexible nonparametric regression, the ADE method per-
normally distributed). Although we do not pursue these mits the significance of individual predictor variables to
issues further here, at minimum, these results suggest that be judged via simple hypothesis tests on the value of the
one should test for the presence of a mixture, as well as average derivatives. Nonlinearity of the relationship is
look for additional distinctions in the data (or design dis- summarized by a graph of the function g. As such, we
crimination rules) that can be built into the model so that regard the ADE method as a natural outgrowth of linear
a unimodal density for e is statistically appropriate. modeling, or "running (ordinary least squares) regres-
The appearance of several modes of g'(.) is not due to sions," as a useful method of data summarization.
undersmoothing; it remains with a tripling of the band- Although the results of our empirical illustration are
width h' to .6. Moreover, it is not due to using the im- encouraging [another application is given in Hardle, Hil-
precise estimate 33; dropping ACL and reestimating gives denbrand, and Jerison (1988)], many questions can be
a more pronounced multimodal shape of g'. posed regarding practical implementation of the ADE es-
There is one feature of the results, which appears in timators. For instance, are there automatic methods for
setting the bandwidth and trimming parameters that assure
good small-sample performance of the estimators? Would
3.50 ' ' I ' l small-sample performance be improved by normalizing the
scale of s or using alternative methods of nonparametric
2.80 approximation for the ingredients of (5? These sorts of
issues need to be addressed as part of future research.
The ADE estimators are simple to compute, using stan-
2.10 dard software packages available for microcomputers. In
A,
g addition, these procedures are being implemented as part
of the exploratory data software package XploRe of Har-
1. 40
dle (1988).

APPENDIX: ASSUMPTIONS, PROOFS OF


0.70 0-/ \
THEOREMS, AND DATA

A.1 Assumptions for Theorems 3.1, 3.2, and 3.3


0.00
-0.60 - 0.36 - 0. 12 0.12 0.36 0.60 1. The support Qi of f is a convex, possibly unbounded subset
of Rk with nonempty interior. The underlying measure of (y, x)
can be written as ,iy x ,u where ,b is Lebesgue measure.
Figure 2. ADE Regression Derivative for Collision Data. 2. f(x) = 0 for all x E dQ~, where dQ~ is the boundary of Q.

This content downloaded from


117.227.34.195 on Fri, 18 Nov 2022 17:23:35 UTC
All use subject to https://about.jstor.org/terms
992 Journal of the American Statistical Association, December 1989

are relateddifferentiable
3. m(x) --E(y I x) is continuously to 6. First, define 3 based on
on trimming with res
il C Q,
where l - Q is a set of measure 0. to the true density value:
4. The moments E[lT(x)l(x)y2] and E[(m')T(m')] exist. M2(x) N

E(y2 I x) is continuous. 3=N-1 th(xi)yiII (A.2)


5. All derivatives of f(x) of order p exist, where p 2 k + 2.
6. The kernel function has support {ul lul < 1}, is symmetric,
where I, I[f(xi) > b] (i = 1, . . ., N). Next define a linear-
has p moments, and K(u) = 0 for all u E {u| |uI = 1}. K(u)ization
is 6:
of order p:
oS = o + j1 + j2, (A.3)
I K(u) du = 1, where
N

fuIu'2 ... u *PK(u) du = O, 1l + 12 + + lp < P, o= N-' l(x,)y,I,

,, N-,1=1 f(i
and
,~ ~~ ` N',f()yI

J U1U'2 ... UPK(U) du # O, l + 12 + + p =P.


2= -N-' ,=1
E f(xi)
fh(X) l(x,)YiI, (A.4)
7. The functions f (x) and m(x) obey local Lipschitz conditions:
Proof of Theorem 3.1. The proof consists of the following
For v in a neighborhood of 0, there exist functions (Of, (of', Wi,)
four steps.
and wo,m such that
Step 1. Linearization: /(N - 3) = op(1).
If(X + V) - f(x)I < (Wf(X)IVI, Step 2. Asymptotic normality: \IN[3 - E(&)] has a limiting
If'(X + V) - f'(x)I < wOf'(X)IVI, normal distribution with mean 0 and variance E.
Step 3. Asymptotic bias: \lN[E(6) - 3] = o(1).
IM'(x + v) - m'(x)I < 0L)4(X)V I,
Step 4. Trimming: \/N(6 - 3) has the same limiting distri-
and bution as \N/(5 - 3).
Il(x + v)m(x + v) - l(x)m(x)l < wlm(x)lvl, The combination of Steps 1-4 yields Theorem 3.1.

Step 1: Linearization. Some arithmetic gives


where E[(lyof)21] < oo, E[(yw(f,)2] < oo, E[w02,] < 0o, and E[w, 2
< 00.
\/i~i( - 3) N1'2 [f (Xi) - fh(xiA)]Wlxl) - f(i
8. Let AN = {x I f (x) > b} and BN = {x I f (x) c b}. As N VN(j - ) =N - _f (X)] f h(XI) _ (i)
00, fBN m(x)f'(x) dx = o(N 1/2).
9. If f (P) denotes any pth order derivative of f, f (P) is locally
- N -12 E (X) - fh X]2 l(x1)yi,I
Holder continuous: there exists y > 0 and c(x) such that I
(x + v) - f(P)(x)I c c(x) IvLI. Thep + y moments of K(v) exist.
so by (A. la), there is a constant cf such that with high probability
The following integrals are bounded as N -> oo:
VW (j - 3)
I m(x)f(P)(x) dx; h) f c(x)m(x) dx;
AN AN

2- bf(N-(e/2)h k)-1/2 Sup [ - fhII


h f m(x)l(x)f(P)(x) dx; hY+1 f c(x)m(x)l(x) dx.
AN AN

X sup[lfh - f 'I']
An additional assumption for Theorem 3.3 follows. x ~N

10. m(x) = E(y I x) is twice differentiable for all x in the lN AN2 Il(xi)Y,llt

interior of fl. b2 - bsf(Nl-(EI2)hk 1

A.2 Proof of the Main Results The terms N- 1 yI y1II a


ability by Chebyshev's in
We begin with two preliminary remarks. First, Equation (3.1)we have that \/(6 - 3)
is shown formally as theorem 1 of Stoker (1986), by compo- o (1), since b2N'-(e/2)hk --
nentwise integration by parts (see also Beran 1977). Second, (b).
because of Condition (c), as N-> oo, the pointwise mean squared
errors of f,, and f ' are dominated by their variances. Therefore, Step 2: Asymptotic Norm
since the set {x I f(x) ' b} is compact and b-'h -> 0, for any ehas a limiting normal distr
> 0 we have that [compare Silverman (1978) and Collomb and are \N equivalent to (ordinary) sample averages and then ap-
Hardle (1986)] pealing to standard central limit theory. Throughout this section,
vi = (yi, xi). For 6o, we have that
supIfh(x) - f(x)I I[f(x) > b] = Op[(Nl (E/2)hk)-1/2] (A. la)
/N\
and
fl[o- EQ30)]
supJ.fa(x) - f'(x)I I[f(x) > b] -= p(l('2h+)l2
(A.lb) (A.5)

where
In the proofs, we use two (unobservable) "4estimators" ro(v) = I(x)y, since var(ly) exists and b -O as N -*oo.
that

This content downloaded from


117.227.34.195 on Fri, 18 Nov 2022 17:23:35 UTC
All use subject to https://about.jstor.org/terms
Hdrdle and Stoker: Regression by the Method of Average Derivatives 993

since b2Nhk+2
To analyze 5, and j2, we approximate them-> oo is
byimplied by Condition (b). Therefore, (A.8
U statistics.
The U statistic related to j1 can be written as
is valid.
We now refine (A.8) to show that U, is equivalent to an average
U1 = (2)' E PIN(V(, Vj), whose components do not vary with N, namely, the average of
2 i=l j=i+l
r1(v) - E[r,(v)], where r1(v1) = 1(xi)y, + m'(xi). For this, b* =
with
supx,u{f(x + hu) I f(x) = b Jul c 1} and IP = I[f(x) > b*]. By
construction, if Jul c 1, then I[f(x + hu) > b] - Ij* 0 O only
P1N = h 12 K (f )(i) fy(XI;) when Ij* = 0, and b* -O 0 and hlb* -O 0 as b -O* 0 and h -O 0.
where K' aK/au. Note that by symmetry of K(*), we have Now write rlN(v,) = E(2p1,(vi, v1) I v1) as

N/N'[jl - E(ji)] rlN(vI)

= VN[UI - E(U1)] -N-1{V-N[U1 - E(U1)J}. _ k 1 (x xfy,I, m(x)I[f (x)> b

The second term in this expansion will converge in probability -h-kj K' f(x) f(x) ) dx
to 0 provided that VN[UI - E(U,)] has a limiting distribution,
which we show later. Therefore, we have that -=)(i)f h-'K'(u)f (x, + hu) du - I h-'K'(u)

`N[oj - E(6,)] = \/N[U1 - E(U1)] + op(1). (A.6)


x m(xi + hu) du - (1 I- I*) f h1K'(u)m(x1 + hu)
The U statistic related to 2 iS
N)- N N x {I[f(xi + hu) > b] - I,*} du

with = - f |K(u)f'(x, + hu) du + IP K(u)

1 (xk 4_____ _____) y


P2N = 2h kK (e l(Ix)y+I, +(x) x m'(xi + hu) du + (1 - I,*)a(x,; h, b),
2 kh f f(Xi) + f(X)/
where a(x,; h, b) = -f h-'K'(u)m(x, + hu){I[f(xi + hu) > b]
U2 is related to j2 via
- I*'} du, so the difference between rlN and r, is
VN\'[ 2 - E(62)= VN/[U2 - E(U2)] - N-I
tlN(V) r-N(Vl)- r1(v1)
X {\N/[U2 - E(U2)]}
N = - U fx) IK(u)[f'(x, + hu) - f'(x,)] du
+ N-1/2 N-lh-kK(0)

+ f K(u)[m'(xi + hu) - m'(x,)1 du + (1 - I,)l(x,)yj

f f(Xi) f(x))
+ (1 - IP)m'(x,) + (1 - I)a(xi; h, b).
As before, the second term converges in probability to 0 provided
that \/K[U2 - E(U2)] has a limiting distribution, as shown later. The second moment E[lItN(v)121 vanishes as N -- oo. By As-
The third term converges in probability to 0, because its variance sumption 7, the second moment of [y1I,/f(xi)] f K(u)[f '(x +
is bounded by K(0)2N-2h-2k(hlb)2E[l(x)2y21] = o(1), since Nhk hu) - f'(xi)] du is bounded by (h/b)2(f IuIK(u) du)2E[ykoa']
X0 and hlb -Q 0, Therefore,
0[(hlb)2] = o(1). The second moment of IP f K(u)[m'(x, +
hu) - m'(x,)] du is bounded by h2(f JulK(u) du)2E[Ow =] 0(h2)
N-[ 2 - E(2)] = N/N[U2 - E(U2)] + o (1). (A.7) = o(1). The second moments of (1 - I)l(x)yi and (1 - )m'
The analysis of U, and U2 is quite similar, so we present the (xi) vanish by Assumption 4, since b -O 0 and b* - 0. Finally,
details only for Ul. We note that U, is a U statistic with varying the second moment of (1 - I*)a(x1; h, b) vanishes if the second
kernel (e.g., see Nolan and Pollard 1987), since PlN depends on moment of a(x,; h, b) exists. Consider the tth component a,(xi;
N through the bandwidth h. Asymptotic normality of U, follows h, b) of a and define the marginal kernel K(,) = f K(u) du, and
from lemma 3.1 of Powell, Stock, and Stoker (in press), which the conditional kemel K, = KIK(,). For given x, integrating a,(x;
states that if E[Jp1N(Vi, V,)12] = o(N), then h, b) by parts absorbs h-' and shows that a, is the sum of two
terms: the expectation [with regard to K(u)] of m[(x + hu){I[f (x
iN-Uj - E(U4)j
+ hu) > b] - I[f(x) > b*j} and the expectation [with regard
to K()] of Kim(x + hu) over u values such that f (x + hu) = b.
_ N/2 ( {r,N(v,) - E[rlN(v)]}) + op(1), (A.8) Because the variances of m' and y exist, the second moment of
each of these expectations exists, so E(a2) exists. Therefore,
where rN =2E[pIN(V, vj)Iv]. This condition is implied E(aby
12)(b):
exists, so the second moment of (1 - IP)a(x; h, b) van-
If M,(x) E(yI I x) and M2(x) E(y21 I x), then ishes, which suffices to prove E[Ht1N(v)12] = o(1).
E[fP1N(Vi, Vj)I21 This fact completes the proof that U1 is asymptotically normal,
as
__~~~~~~~. ~~~~2
-4bh2k+2 t IK (eK)| [M2(X,) + M2(X,)
N-112 ( {rN(Vi) - E[rlN(V)I})
- 2M,(x,)M,(x1)] f (x1)f(x) dx, dx,

= N112 (x {ri(v,) -[ri(v)I})


4b2h2k+2 f |K'(u)|2[M2(x) + M2(x, + hu)
-2Mj(xj)Mj(x + hu)] f (x,)f(x + hu) di du
+N-"12 E {tlN(Vi) -E[tlN(v)I})} (A-9)
- O(b2hk2) =O[N(b2Nhk+}11 =

This content downloaded from


117.227.34.195 on Fri, 18 Nov 2022 17:23:35 UTC
All use subject to https://about.jstor.org/terms
994 Journal of the American Statistical Association, December 1989

by Assumption 9. Therefore,
and the last term converges in probability to by 0,Condition
since (c), its
we have TIN =
variance
is bounded by E[ItlN(v)21] = o(1). Combining
O[N-12 (NI2 hP-1)] = O(N-112). The(A.9), (A.8),
same analysis and
for T2N com-
(A.6), we have pletes the proof of \N/[E(j) - 5] = o(1).
Step 4: Trimming. Steps 1-3 have shown that \lN(5 -
lN[jj - E( =)] N- 12 {rl(v,) - E[rj(v)]}) + o,(1),
6 - R) = op(l), where R = N-1 X [r(y1, x1) - E(r)], so b is
asymptotically normal. We now demonstrate the same property
(A. 10)
for 5. For this, let CN = cf(NlI/2)hk) 12 where Cf is an upper
where r,(v) l 1(x)y + m'(x). bound consisent with (A.la). Define the average kernel esti-
The U statistic representation 6f '2 is analyzed in a similar mator based on trimming with respect to the bound b + CN: 5U
fashion. In particular, E[J P2N(Vi, V1)12] = o(N) follows from (b), N-1 EN _^i(xi)yiI[f(x1) > b + CN]. Since b-'cN - 0 by Con-
so U2 - E(U2) is NV equivalent to a sample average, which dition (b), c5, obeys the tenets of Steps 1-3, so VN(6u - 6 -
combined with (A.7) gives R) = op(1). We now show that \N/(6 - 6") = op(1). First, i
- I[f (x) c b + CN; fh(xi) > b], so
N
N
iN"[F2 - E(J2)] = N-1/2 {r2(vi) - E[r2(v)]} + op(1), l )- = N 1/2> th(Xi)YiIi
i=l

(A.11) N N

where r2(v) = - [I(x)y + 1(x)m(x)]. Combining (A.5), (A. 10),


= N-
i~=l 1=1

and (A.11) yields Step 2, as


The latter
N

VN[0i - E(5)] = N-1/2 {r(v,) - E[r(v)]}) + op(1) Nc >12 l(11Xiii

(A.12) N
t=l

with r0(v) + r1(v) + r2(v) r(v) r(y, x) in the statement of


Theorem 3.1.
N

Step 3: Asymptotic Bias. The bias of c is E(6) - 3 = TON - N -12 l(x1)yiII[f(xi) < b + CN]
i=1
T1N - T2N, where
so
TON = E[l(x,)yiIi] , N 2

E N- 1/2 1 (
TIN =
'rlEN([= f(x,) f'(x)]
fh ~~~f (xi)/ Yi I,
N 2

and cN-
i=1

C2N= E ([ fJh (Xi) f (xi)] f () = E{ll(x)y 12I[f (x) < b + CN]} = o(1)
by the Lebesgue dominated convergence theorem, since b + C
Let AN, BN be defined as before; then - 0 and El1(x)y 2 exists. The first term also vanishes in prob
bility, as
TON 1 f l(x)m(x)f(x) dx - f l(x)m(x)f(x) dx N

AN N- 1/2 [th(xi) - '(xi)] Yi1i

f m(x)f'(x) dx = o(N- 12). N

cN- f'/2 E (x)f-(x)-f()i)(Yx)ii


i=l
We only show that TIN = o(N-112), with the proof of T2N =
o(N-112) quite similar. Let i denote an index set (k, . .. , tk), N f ,/ f
N-112 (Xi) h (Xi)
1 ^, |x (X) f (Xi) I
where 2 p. For u = (u, ..., Uk), define u1 = ull ... Uk
and f P) = aPf/(au)'. By partial integration we have

TrN = f m(x) f K(u)[f'(x + hu) - f'(x)] du dx


AN
N~~~ + N 1/2 E f|xI - f ( x) Il(x)yfIL
N~~~~~~~~~ 1/2 (X) f [t( (xi))]ii
= IANANI
m(x) E f K(u)hP-IfP)(4)ul du dx, Thus with high probability

where the summation is over all index sets i with , tj = p a


N~N
4 lies on the line segment between x and x - hu. Therefore,

TlN = hP- IN m(x) z f(P)(x) cfb1


K(u)u' du- dx
'sup{JfN f' I[f(x)> b CNI} (NA

+ hP-1 f m (x) ,fIK(u)[f (P)(4) - f (P)(x)]u' du dtx + b-' sup{Ifh - fII[fi(X)> b - CN]} (N1/2 2 Il(x)yil)

- O(hP-1) = o1(b2 N2+(e4)hk2) =op(l)

This content downloaded from


117.227.34.195 on Fri, 18 Nov 2022 17:23:35 UTC
All use subject to https://about.jstor.org/terms
Hdrdle and Stoker: Regression by the Method of Average Derivatives 995

Box, G. E. P., and


since N-1"2 I I yilLi and N-112 I I1(xj)yjIL are Cox,each
D. R. (1964), "An Analysis
op(l), as of Transformations,"
before,
Journal of the Royal Statistical Society, Ser. B, 26, 211-252.
and b2N'-(e/2)hk+2 -- oo by Condition (b). Therefore, \/N(3 -
Breiman, L., and Friedman, J. H. (1985), "Estimating Optimal Trans-
6.) = op(1), so VlN(6 - 6 - R) = op(1). This
formations for Multiplecompletes the
Regression and Correlation," Journal of the
proof of Theorem 3.1. American Statistical Association, 80, 580-619.
Proof of Theorem 3.2. The estimator Ph, is constructed by Carroll, R. J. (1982), "Adapting for Heteroscedasticity in Linear
Models," The Annals of Statistics, 10, 1224-1233.
direct estimation of the U statistic structure of 6. In particular,
Collomb, G., and Hardle, W. (1986), "Strong Uniform Convergence
define PlN(V1, vj) and f2(vi, vj) by replacing f, 1, and I by f, 1, Rates in Robust Nonparametric Time Series Analysis and Prediction:
and I in the expressions for PlN and P2N. Next define roi = Kernel Regression Estimation From Dependent Observations," Sto-
chastic Processes and Their Applications, 23, 77-89.
th(Xi)YjI, r,, = 2N-1 1 P1N(Vi, vj), r2i = 2N j P2N(Vj, v1), and
Friedman, J. H., and Stuetzle, W. (1981), "Projection Pursuit Regres-
ri = ro, + r1i + r2i. By techniques similar to those cited for sion," Journal of the American Statistical Association, 76, 817-823.
(A.la,b), we have that sup|rj - r(yi, xj)II, = op(l). Hardle, W. (1988), "XploRe-A Computing Environment for Explor-
An argument similar to Step 4 can be applied to X, so con- atory Regression and Density Smoothing," Statistical Software News-
sistency of X will follow from consistency of N-' rhjrh,I, forletters, 14, 113-119.
Hardle, W., Hildenbrand, W., and Jerison, M. (1988), "Empirical Evi-
E(rrT) and consistency of N-' 1 P,jIj for E(r). But these follow
dence on the Law of Demand," working paper (Sonderforschungs-
immediately; for instance, we have
bereich 303), Universitat Bonn.
Hardle, W., and Marron, J. S. (1987), "Semiparametric Comparison of
N- A Erhi,',Ir - E(rrT)
Regression Curves," working paper (Sonderforschungsbereich 303),
- N-1 (A j - r)(r- r )TIj + N-1 Y ri(rhi - r)TI Universitat Bonn.
Hastie, T., and Tibshirani, R. (1986), "Generalized Additive Models"
+ N- 1 (Ai - ri)rTI, - N-1 E rirT(1 - I) (with discussion), Statistical Science, 1, 297-318.
Ichimura, H. (1987), "Estimation of Single Index Models," unpublished
+ N- 1 rirT - E(rr T) doctoral dissertation, Massachusetts Institute of Technology, Dept. of
Economics.
Kallieris, D., Mattern, R., and Hardle, W. (1989), "Verhalten des EU-
ROSID Beim 90 Grad Seitenaufprall im Vergleich zu PMTO Sowie
since supIr, - r(y,, x,)IIj = op(l), the variance of r exists, and
US-SID, HYBRID II und APROD," in Forschungsvereinigung
Pr{f(x) < b} = o(1). This completes the proof of Theorem 3.2.
Automobiltechnik (FAT) Schriftenreihe, Frankfurt am Main.
Proof of Theorem 3.3. With zj = xTc5, define d, = z; - z= Manski, C. F. (1984), "Adaptive Estimation of Nonlinear Regression
Models," Econometric Reviews, 3, 145-194.
xIT(S - 6), and since f1(z) 2 b1 > 0, dj = Op(N-112). Denote by
Manski, C. F., and McFadden, D. (1981), Structural Analysis of Discrete
gh and flh the kernel regression and density estimator (3.8) and Data With Econometric Applications, Cambridge, MA: MIT Press.
(3.9) using zj instead of A;. When h - N-"5, it is a standard result McCullagh, P., and Nelder, J. A. (1983), Generalized Linear Models,
(Schuster 1972) that N215[gh'(Z) - g(z)] has the limiting distri- London: Chapman & Hall.
bution given in Theorem 3.3. Consequently, the result follows Nolan, D., and Pollard, D. (1987), "U-Processes: Rates of Conver-
gence," The Annals of Statistics, 15, 780-799.
if gh'(Z) - gh'(Z) = op(N 2/S). O'Sullivan, F., Yandell, B. S., and Raynor, W. J. (1986), "Automatic
First consider flh' lh' By applying the triangle inequality Smoothing of Regression Functions in Generalized Linear Models,"
to the Taylor expansion of flh', we have Journal of the American Statistical Association, 81, 96-103.
Powell, J. L. (1986), "Symmetrically Trimmed Least Squares Estimation
Iflh'(Z) -f1h'(Z)I for Tobit Models," Econometrica, 54, 1435-1460.
Powell, J. L., Stock, J. H., and Stoker, T. M. (in press), "Semiparametric
I sup{dj}l If ('h(z)l + sup{dj} N-lh' -3 >. K','[(z - d)Ih'],
Estimation of Index Coefficients," Econometrica, 57.
where 5j lies between Aj and zj. Therefore, f1h'(Z) - fRobinson, 1h'(Z) P. M. (1988), "Root N-Consistent Semiparametric Regres-
op(N 25), and by a similar argument f1h'(Z)gh'(Z) - f1h'(Z)gh'(Z) sion," Econometrica, 56, 931-954.
Schuster, E. F. (1972), "Joint Asymptotic Distribution of the Estimated
- op(N-2/5), so we can conclude that gh'(z) - gh'(Z) =
Regression Function at a Finite Number of Distinct Points," The An-
op(N 2/5). This completes the proof of Theorem 3.3. nals of Mathematical Statistics, 43, 84-88.
Silverman, B. W. (1978), "Weak and Strong Uniform Consistency of the
[Received March 1987. Revised March 1989.] Kernel Estimate of a Density Function and Its Derivatives," The An-
nals of Statistics, 6, 177-184; Addendum (1980), 8, 1175-1176.
REFERENCES Stoker, T. M. (1986), "Consistent Estimation of Scaled Coefficients,"
Econometrica, 54, 1461-1481.
Beran, R. (1977), "Adaptive Estimates for Autoregressive Processes," Stone, C. J. (1980), "Optimal Rates of Convergence for Nonparametric
Annals of the Institute of Statistical Mathematics, 28, 77-89. Estimators," The Annals of Statistics, 8, 1348-1360.
Bickel, P. (1982), "On Adaptive Estimation," The Annals of Statistics, (1986), "The Dimensionality Reduction Principle for Generalized
10, 647-671. Additive Models," The Annals of Statistics, 14, 590-606.

This content downloaded from


117.227.34.195 on Fri, 18 Nov 2022 17:23:35 UTC
All use subject to https://about.jstor.org/terms

You might also like