This Content Downloaded From 117.227.34.195 On Fri, 18 Nov 2022 17:23:35 UTC
This Content Downloaded From 117.227.34.195 On Fri, 18 Nov 2022 17:23:35 UTC
This Content Downloaded From 117.227.34.195 On Fri, 18 Nov 2022 17:23:35 UTC
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Taylor & Francis, Ltd. and American Statistical Association are collaborating with JSTOR to
digitize, preserve and extend access to Journal of the American Statistical Association
Let (xl, . . . Xk, y) be a random vector where y denotes a response on the vector x of predictor variables. In this article
propose a technique [termed average derivative estimation (ADE)] for studying the mean response m(x) = E(y I x) throug
the estimation of the k vector of average derivatives (5 = E(m'). The ADE procedure involves two stages: first estimate (5
using an estimator 6, and then approximate m(x) by m'i(x) = g(xTS), where g is an estimator of the univariate regression
on xT(S. We argue that the ADE procedure exhibits several attractive characteristics: data summarization through interpretab
coefficients, graphical depiction of the possible nonlinearity between y and xT3, and theoretical properties consistent with
dimension reduction. We motivate the ADE procedure using examples of models that take the form m(x) = g(xTfl). In this
framework, 5 is shown to be proportional to ,B and mh(x) infers m(x) exactly. The focus of the procedure is on the estimator
3, which is based on a simple average of kernel smoothers and is shown to be a \/N consistent and asymptotically normal
estimator of 65. The estimator g(.) is a standard kernel regression estimator and is shown to have the same properties as the
kernel regression of y on xT(5. In sum, the estimator 3 converges to (5 at the rate typically available in parametric estimation
problems, and m' (x) converges to E(y I xT(5) at the optimal one-dimensional nonparametric rate. We also give a consistent
estimator of the asymptotic covariance matrix of 3, to facilitate inference. We discuss the conditions underlying these results,
including how \N/N consistent estimation of 3 requires undersmoothing relative to pointwise multivariate estimation. We also
indicate the relationship between the ADE method and projection pursuit regression. For illustration, we apply the ADE
method to data on automobile collisions.
KEY WORDS: ADE regression; GLIM models; Kernel estimation; Nonparametric estimation.
986
typical for one-dimensional smoothing problems (see A parametric approach to the estimation of any of these
Stone 1980). Although S and g(.) each involve choice models, for instance, based on maximum likelihood, re-
of a smoothing parameter, they are computed directly quires the (parametric) specification of the distribution of
from the data in two steps and thus require no computer- the random variable e and of the transformations VQ( ),
intensive iterative techniques for finding optimal objective and for (2.1), the transformation /(4). Substantial bias can
function values. result if any of these features is incorrectly specified. Non-
-Section 2 motivates the ADE method through several parametric estimation of 6 = yfl avoids such restrictive
examples familiar from applied work. Section 3 introduces specifications. In fact, the form m(x) = g(xTfl) generalizes
the estimators s and g and establishes their large-sample the "generalized linear models" (GLIM); see McCullagh
statistical properties. Section 4 discusses the results, in- and Nelder (1983). These models have g invertible, with
cluding the relationship of the ADE method to projection g-1 referred to as the "link" function. Other approaches
pursuit regression (PPR) of Friedman and Stuetzle (1981). that generalize GLIM can be found in Breiman and Fried-
Section 5 applies the ADE method to data on automobile man (1985), Hastie and Tibshirani (1986), and O'Sullivan,
collisions. Section 6 concludes with a discussion of related Yandell, and Raynor (1986).
research. Turning our attention to ADE regression modeling, we
show in the next section that mt (x) of (1.3) will estimate
2. MOTIVATION OF THE ADE PROCEDURE
g(x&T) = E(y I xT3), in general. Consequently, the ADE
The average derivative 6 is most naturally interpreted method will completely infer m(x) when
in situations where the influence of x on y is modeled via
a weighted sum of the predictors; where m(x) = g(xTfJ) m(x) = g(xTO). (2.4)
for a vector of coefficients,B. In such a model, 3 is inti-
But this is the case for each of the aforementioned ex-
mately related to fi, as m' = [dgld(xTJJ)]fJ, so that 3 =
amples, or whenever m(x) = g(xTfJ), since a (nonzero)
E[dgld(xTf3)]f3 = yfl, where y is a scalar (assumed non-
rescaling of fi can be absorbed into g. Here m(x) is re-
zero). Thus 3 is proportional to the coefficients fi when-
parameterized to have coefficients 3 = y,B by defining g(Q)
ever the mean response is determined by xTfl.
-- g(1y), so m(x) = g(xTf3) = g(xT3). This rescaling cor-
An obvious example is the classical linear regression
responds to E[dgld(xT3)] = 1, a normalization of g that
model; y = a + xTfJ + e, where e is a random variable
would not obtain for alternative scalings of fi.
such that E(e I x) = 0, which gives 3 = fi. Another class
Equivalently, we can interpret the scale of 3 by noting
of models is those that are linear up to transformations:
that if each value x is translated to x + A, then the change
0(y) = VI(xTfi) + e, (2.1) in the overall mean of y is AT3. This feature is familiar for
coefficients when the true model is linear, but not for
where {() is a nonconstant transformation, 0(Q)
coefficients is ain-
within nonlinear model. For instance, alter-
vertible, and e is a random disturbance that is independent
native scalings of fi for the transformation model (2.1)
of x. Here we have that m(x) = E[0-l(q(xTfl) + e) I x]
would make the average change dependent on 0(4) and
= g(xTfl). The form (2.1) includes the model of Box and
q4).
Cox (1964), where 0(y) = (yi - 1)I2A and ql(xTfl) = a Finally, there are modeling situations where 3 is inter-
+ [(X Tf) 2 - 11/22.
pretable but (2.4) does not obtain. For instance, if x =
Other models exhibiting this structure are discrete re-
(x1, x2) and the model is partially linear,
gression models, where y is 1 or 0 according to
= 0 if e ::- (xTfl). (2.2) then 31 = fl, and 32 = E(p'), where 3 = (31, 32) coi
with the partition of x. If, in addition, q = (x2fl2)
Here the regression function m(x) is the probability that
s1 = IA and 2 = YA2, so 62 is proportional to the
y = 1, which is given as m(x) = Pr{e < yl(xTfl) I x} = cients within the nonlinear part of the model. See Robin-
g(xTfl). References to specific examples of binary response
son (1988) for references to partially linear models and
models can be found in Manski and McFadden (1981). Stoker (1986) for other examples where the average de-
Standard probit models specify that e is a normal random rivative has a direct interpretation.
variable (with distribution function 1D) and yi(xTfl) = a
+ XT/3, giving m(x) = (D(a + xTfl). Logistic regression 3. KERNEL ESTIMATION OF
models are likewise included; here m(x) = exp(a + xTfJ)I AVERAGE DERIVATIVES
[1 + exp(a + xTfO)].
Censored regression, where Our approach to estimation of 6 uses nonparametric
estimation of the marginal density of x. Let f(x) denote
y = (xTf) + e if (xTf) + e O 0 this marginal density, f' =-df/dx the vector of partial
= 0 if (xTfl) + e < O, (2.3) derivatives, and 1 -3 ln fIdx = -f'If, the negative
log-density derivative. If f(x) = 0 on the boundary of x
is likewise included, and setting y(x Tfl)
values, = ae + by
then integration x parts
TJJ gives
gives
the familiar censored linear regression model [see Powell
(1986), among others]. =- E(m') = E[l(x)y]. (3.1)
Our estimator of 6 is a sample analog of the last term in constructed using any uniformly consistent estimators of
this formula, using a nonparametric estimator of 1(x) eval- 1(x), m(x), and m'(x). The proof of Theorem 3.1 suggests
uated at each observation. a more direct estimator of r(yi, xi), defined as
In particular, the density function f(x) is estimated at
x using the (Rosenblatt-Parzen) kernel density estimator
rhi- h(X)Yfi+ N-1h- L h-'K' (xi- x-)
Theorem 3.1. Given Assumptions 1-9 stated in the with bandwidth h' = h1N,
Appendix, if (a) N -* oo, h -O 0, b -O 0, and b-Ihunivariate)
-> 0; kernel function. Suppose, for a moment, that
(b) for some e > 0, b4Nl-h2k+2 ? o; and (c) Nh2p-2 z;= xf(5 instead of ZJ were used in (3.8) and (3.9); then
0, then \/Ni(Q - () has a limiting normal distribution with
it is well known (Schuster 1972) that the resulting regres-
mean 0 and variance E, where X is the covariance matrix sion estimator is asymptotically normal and converges
of r(y, x), with (pointwise) at the optimal (univariate) rate N2'. Theorem
r(y, x) = m'(x) + [y - m(x)]l(x). (3.5) 3.3 states that there is no cost to using the estimated values
Moas described previously.
The proof of Theorem 3.1, as well as those of the other
results of the article, are contained in the Appendix. Theorem 3.3. Given Assumptions 1-10 stated in the
The covariance matrix E could be consistently estimated Appendix, let z be such that f,(z) ?b, > 0. If N X>
as the sample variance of uniformly consistent estimators and h' y Npt 'e , then N=gh (Z) - g(z)] has a limit
of r(y1, xi) (i = 1, . . . , N), and the latter could be normal distribution with mean B(z) and variance V(z),
where not converge too quickly to 0 as N -> oo, but rather must
converge slowly. The behavior of the bandwidth h as N
B(z) = [g"(z)/2 + g'(z)f{(z)Ifj(z)] u2Kl(u) du -X c is bounded both below and above by Conditions (b)
and (c).
Condition (c) does imply that the pointwise conver-
V(z) = [var(y I x'.5 = z)Ifi(z)] f K1(u)2 du. (3.10)
gence of fh(x) to f(x) must be suboptimal. Stone (1980)
showed that the optimal pointwise rate of convergence un-
The bias and variance given in (3.10) can be estimated
der our conditions is NPl(2P+k), and Collomb and Hirdle
consistently for each z using y, gh, and flh, and their de-showed that this rate is achievable with kernel
(1986)
rivatives, using standard methods. Therefore, asymptotic
density estimators such as (3.2), for instance, by taking t
confidence intervals can be constructed for g , (z). It is
= hoN-l1(2P+k). But we have that N1P- 2 -m cc, which vi-
clear that the same confidence intervals apply to m4(x) =
olates Condition (c), so as N -m ?c, h must converge to 0
gh'(X 3), for z = XTS. more quickly than W.This occurs because (c) is a bias con-
4. REMARKS AND DISCUSSION dition; as N -> c, the (pointwise) bias of fh(x) must vanish
at a faster rate than its (pointwise) variance, for the bias
4.1 The Average Derivative Estimator of 3 to be o(N-"12). In other words, for \/N consistent
As indicated in the Introduction, the most interesting estimation of 3, one must "undersmooth" the nonpara-
feature of Theorem 3.1 is that 3 converges to 8 at rate metric component lh(X).
\/N. This is the rate typically available in parametric es-
4.2 Modeling Multiple Regression
timation problems and is the rate that would be attained
if the values 1(xi) (i = 1 ... , N) were known and usedTheorem 3.3 shows that the optimal one-dimensional
convergence rate is achievable in the estimation of g(xT3)
in the average (3.4). The estimator 4h(x) converges point-
= E(y I xT6), using S instead of 3. The requirement that
wise to 1(x) at a slower rate, so Theorem 3.1 gives a sit-
uation where the average of nonparametric estimators g( ) is twice differentiable affixes the optimal rate at
converges more quickly than any of its individual com- N2/5, but otherwise plays no role: if g(-) is assumed dif-
ponents. This occurs because of the overlap between ker-
ferentiable of order q and K1(&) is a kernel of order q,
then it is easily shown that the optimal rate of Nql(2q+l) is
nel densities at different evaluation points; for instance,
attained. The attainment of optimal one-dimensional rates
if xi and x, are sufficiently close, the data used in the local
average fh(Xi) will overlap with that used in !h(Xj). These of convergence is possible for the ADE method because
overlaps lead to the approximation of 6 by U statistics with the additive structure of g(xT6) is sufficient to follow the
kernels depending on N. The asymptotic normality of S "dimension reduction principle" of Stone (1986). Alter-
follows from results on the equivalence of such U statis- native uses of additive structure can be found in Breiman
tics to (ordinary) sample average. In a similar spirit, and Friedman (1985) and Hastie and Tibshirani (1986).
Powell, Stock, and Stoker (in press) obtained \N/ con- The ADE method can be regarded as a version of PPR
vergence rates for the estimation of "density weighted" of Friedman and Stuetzle (1981). The first step of PPR is
average derivatives, and Carroll (1982), Robinson (1988), to choose ft (normalized as a direction) and g to minimize
and Hardle and Marron (1987) showed how kernel den- s(g, f) = I [y, - g(xTff)]2, and any model of the form
sities can be used to obtain \'NK convergence rates for
m(x) = g(xTfl) is inferred by the ADE estimator mA(x) =
certain parameters in specific semiparametric models. We
gh'(xT) at the optimal one-dimensional rate of conver-
also note that our method of trimming follows Bickel gence. For a general regression function, however, mA(x),
(1982), Manski (1984), and Robinson (1988). gh' , and 3 will not necessarily minimize the sum of squares
For any given sample size, the bandwidth h and the s(g, fB): given g, ft is chosen such that {yj - g(xTft)} is
trimming bound b can be set to any (positive) values, so orthogonal to {xig'(xiTf)}, which does not imply that ft =
their choice can be based entirely on the small-sample S/1IS.
behavior of S. Conditions (a)-(c) of Theorem 3.1 indicate
Given 3, gh' is a local least squares estimator; namely,
how the initial bandwidth and trimming bound must be Kl[(z - xTS)Ih'](yi - t)2 is minimized by t = gh'(z).
decreased as the sample size is increased. These conditions Moreover, 3 is a type of least squares estimator, as follows.
are certainly feasible; suppose that h = hON-C and b = Set A = (SI)_11h(xi)I1, where SI is the sample moment SI
bON-", then (a)-(c) are equivalent to 0 < q < 4 and pl - N- ih(xi) h(xi) TL. Then 3 is the value of d that min-
(2p - 2) < C < (1 - 4i - e)1(2k + 2). Since p - k + imizes the sum of squares z [y~ - TdJ2, or equivalently,
Si1S are the coordinates of { yi} projected onto the sub-
2 and e is arbitrarily small, q can be chosen small enough
to fulfill the last condition. space spanned by {4h(xi)i}.
The bandwidth conditions arise as follows. Condition ADE and PPR thus represent different computational
(b) assures that the estimator S can be "linearized" tomethods
1 of inferring m(x) = g(xt#). The possible advan-
without an estimated denominator and is a sufficient tages of ADE arise from reduced computational effort;
condition for asymptotic normality. Condition (c) assures (given hs b, and h) Z(x) = g(xT) is computed directly
that the bias of S vanishes at rate VIM. Conditions (a)-(c) from the data, whereas minimizing s(g, ft) (by checking
are one-sided in implying that the trimming bound b can- all directions ft and computing g for each ft) typically in-
volved considerable computational effort [although the re- Table 2. Collision (side impact) Data
sults of Ichimura (1987) may provide some improvement].
AGE VEL ACL y AGE VEL ACL y
Hypothesis tests
Predictor variables
Degrees of
6 AGE (xl) VEL (X2) ACL (x3) Null hypothesis Wald statistic W freedom q Pr(xq> W)
Value .134 .051 .045 (61, 62, 63) = (0, 0, 0) 19.41 3 .00023
Standard error .033 .028 .027 (62, 63) = (0, 0) 7.61 2 .022
63 = 0 3.44 1 .063
1.00 conflict with the framework, that merits further study. The
average of g'(x[T) over the data is 1.76, which contrasts
0.80 -
with the normalization E(g') = 1. Although possibly due
to sampling error or our particular choice of h' (doubling
h' to .4 decreases the average to 1.17), this could signal
g
A 0.60 -
underestimation of 3 or, in particular, underestimation of
(Estimated
the scale of 3. Although we could easily "correct" for this,
% Fata I ) 0.40
our intention here is just to indicate the need for further
study of scaling and/or normalization of 3. With regard
0.20 - to the preceding discussion, it is important to note that a
rescaling of 3 would only relabel the horizontal axes of
0.00. Figures 1 and 2. In particular, the scaling of 3 does not
-0.60 -0.36 -0.12 0.12 0.36 0.60 affect the substantive conclusions of Figures 1 and 2, nor
Z= XTg does it affect the fitted values m(x) of the ADE model
Figure 1. ADE Regression for Collision Data. (1.3).
6. CONCLUDING REMARKS
where e is distributed independently of x (possibly with
nonzero mean). This formulation specializes (2.2) by set- In this article we have advanced the ADE method as a
ting y(z) = z and normalizing ,8 to 3. As discussed pre- useful yet flexible tool for studying general regressioni re-
viously, in this model E(y I x) = m(x) = g(y I xT5) = lationships. At its center is the estimation of average de-
Pr(e < xT3), so g(z) is the cumulative distribution func- rivatives, which we propose as sensible coefficients for
tion of e. Moreover, under this model, g' = dgldz is the measuring the relative impacts of separate predictor vari-
density function of e, which we can estimate by the kernel ables on the mean response. Although we have established
estimator of g' (the derivative of A). The estimator is attractive statistical properties for the estimators, it is im-
graphed in Figure 2. Its multimodal shape indicates the portant to stress that the real motivation for the ADE
possibility of a "mixture" distribution for e, which is in method is the economy it offers for nonparametric data
contrast with standard parameterizations of binary re- summarization. Instead of attempting to interpret a fully
sponse models (e.g., in a probit model e is assumed to be flexible nonparametric regression, the ADE method per-
normally distributed). Although we do not pursue these mits the significance of individual predictor variables to
issues further here, at minimum, these results suggest that be judged via simple hypothesis tests on the value of the
one should test for the presence of a mixture, as well as average derivatives. Nonlinearity of the relationship is
look for additional distinctions in the data (or design dis- summarized by a graph of the function g. As such, we
crimination rules) that can be built into the model so that regard the ADE method as a natural outgrowth of linear
a unimodal density for e is statistically appropriate. modeling, or "running (ordinary least squares) regres-
The appearance of several modes of g'(.) is not due to sions," as a useful method of data summarization.
undersmoothing; it remains with a tripling of the band- Although the results of our empirical illustration are
width h' to .6. Moreover, it is not due to using the im- encouraging [another application is given in Hardle, Hil-
precise estimate 33; dropping ACL and reestimating gives denbrand, and Jerison (1988)], many questions can be
a more pronounced multimodal shape of g'. posed regarding practical implementation of the ADE es-
There is one feature of the results, which appears in timators. For instance, are there automatic methods for
setting the bandwidth and trimming parameters that assure
good small-sample performance of the estimators? Would
3.50 ' ' I ' l small-sample performance be improved by normalizing the
scale of s or using alternative methods of nonparametric
2.80 approximation for the ingredients of (5? These sorts of
issues need to be addressed as part of future research.
The ADE estimators are simple to compute, using stan-
2.10 dard software packages available for microcomputers. In
A,
g addition, these procedures are being implemented as part
of the exploratory data software package XploRe of Har-
1. 40
dle (1988).
are relateddifferentiable
3. m(x) --E(y I x) is continuously to 6. First, define 3 based on
on trimming with res
il C Q,
where l - Q is a set of measure 0. to the true density value:
4. The moments E[lT(x)l(x)y2] and E[(m')T(m')] exist. M2(x) N
,, N-,1=1 f(i
and
,~ ~~ ` N',f()yI
X sup[lfh - f 'I']
An additional assumption for Theorem 3.3 follows. x ~N
10. m(x) = E(y I x) is twice differentiable for all x in the lN AN2 Il(xi)Y,llt
where
In the proofs, we use two (unobservable) "4estimators" ro(v) = I(x)y, since var(ly) exists and b -O as N -*oo.
that
since b2Nhk+2
To analyze 5, and j2, we approximate them-> oo is
byimplied by Condition (b). Therefore, (A.8
U statistics.
The U statistic related to j1 can be written as
is valid.
We now refine (A.8) to show that U, is equivalent to an average
U1 = (2)' E PIN(V(, Vj), whose components do not vary with N, namely, the average of
2 i=l j=i+l
r1(v) - E[r,(v)], where r1(v1) = 1(xi)y, + m'(xi). For this, b* =
with
supx,u{f(x + hu) I f(x) = b Jul c 1} and IP = I[f(x) > b*]. By
construction, if Jul c 1, then I[f(x + hu) > b] - Ij* 0 O only
P1N = h 12 K (f )(i) fy(XI;) when Ij* = 0, and b* -O 0 and hlb* -O 0 as b -O* 0 and h -O 0.
where K' aK/au. Note that by symmetry of K(*), we have Now write rlN(v,) = E(2p1,(vi, v1) I v1) as
The second term in this expansion will converge in probability -h-kj K' f(x) f(x) ) dx
to 0 provided that VN[UI - E(U,)] has a limiting distribution,
which we show later. Therefore, we have that -=)(i)f h-'K'(u)f (x, + hu) du - I h-'K'(u)
f f(Xi) f(x))
+ (1 - IP)m'(x,) + (1 - I)a(xi; h, b).
As before, the second term converges in probability to 0 provided
that \/K[U2 - E(U2)] has a limiting distribution, as shown later. The second moment E[lItN(v)121 vanishes as N -- oo. By As-
The third term converges in probability to 0, because its variance sumption 7, the second moment of [y1I,/f(xi)] f K(u)[f '(x +
is bounded by K(0)2N-2h-2k(hlb)2E[l(x)2y21] = o(1), since Nhk hu) - f'(xi)] du is bounded by (h/b)2(f IuIK(u) du)2E[ykoa']
X0 and hlb -Q 0, Therefore,
0[(hlb)2] = o(1). The second moment of IP f K(u)[m'(x, +
hu) - m'(x,)] du is bounded by h2(f JulK(u) du)2E[Ow =] 0(h2)
N-[ 2 - E(2)] = N/N[U2 - E(U2)] + o (1). (A.7) = o(1). The second moments of (1 - I)l(x)yi and (1 - )m'
The analysis of U, and U2 is quite similar, so we present the (xi) vanish by Assumption 4, since b -O 0 and b* - 0. Finally,
details only for Ul. We note that U, is a U statistic with varying the second moment of (1 - I*)a(x1; h, b) vanishes if the second
kernel (e.g., see Nolan and Pollard 1987), since PlN depends on moment of a(x,; h, b) exists. Consider the tth component a,(xi;
N through the bandwidth h. Asymptotic normality of U, follows h, b) of a and define the marginal kernel K(,) = f K(u) du, and
from lemma 3.1 of Powell, Stock, and Stoker (in press), which the conditional kemel K, = KIK(,). For given x, integrating a,(x;
states that if E[Jp1N(Vi, V,)12] = o(N), then h, b) by parts absorbs h-' and shows that a, is the sum of two
terms: the expectation [with regard to K(u)] of m[(x + hu){I[f (x
iN-Uj - E(U4)j
+ hu) > b] - I[f(x) > b*j} and the expectation [with regard
to K()] of Kim(x + hu) over u values such that f (x + hu) = b.
_ N/2 ( {r,N(v,) - E[rlN(v)]}) + op(1), (A.8) Because the variances of m' and y exist, the second moment of
each of these expectations exists, so E(a2) exists. Therefore,
where rN =2E[pIN(V, vj)Iv]. This condition is implied E(aby
12)(b):
exists, so the second moment of (1 - IP)a(x; h, b) van-
If M,(x) E(yI I x) and M2(x) E(y21 I x), then ishes, which suffices to prove E[Ht1N(v)12] = o(1).
E[fP1N(Vi, Vj)I21 This fact completes the proof that U1 is asymptotically normal,
as
__~~~~~~~. ~~~~2
-4bh2k+2 t IK (eK)| [M2(X,) + M2(X,)
N-112 ( {rN(Vi) - E[rlN(V)I})
- 2M,(x,)M,(x1)] f (x1)f(x) dx, dx,
by Assumption 9. Therefore,
and the last term converges in probability to by 0,Condition
since (c), its
we have TIN =
variance
is bounded by E[ItlN(v)21] = o(1). Combining
O[N-12 (NI2 hP-1)] = O(N-112). The(A.9), (A.8),
same analysis and
for T2N com-
(A.6), we have pletes the proof of \N/[E(j) - 5] = o(1).
Step 4: Trimming. Steps 1-3 have shown that \lN(5 -
lN[jj - E( =)] N- 12 {rl(v,) - E[rj(v)]}) + o,(1),
6 - R) = op(l), where R = N-1 X [r(y1, x1) - E(r)], so b is
asymptotically normal. We now demonstrate the same property
(A. 10)
for 5. For this, let CN = cf(NlI/2)hk) 12 where Cf is an upper
where r,(v) l 1(x)y + m'(x). bound consisent with (A.la). Define the average kernel esti-
The U statistic representation 6f '2 is analyzed in a similar mator based on trimming with respect to the bound b + CN: 5U
fashion. In particular, E[J P2N(Vi, V1)12] = o(N) follows from (b), N-1 EN _^i(xi)yiI[f(x1) > b + CN]. Since b-'cN - 0 by Con-
so U2 - E(U2) is NV equivalent to a sample average, which dition (b), c5, obeys the tenets of Steps 1-3, so VN(6u - 6 -
combined with (A.7) gives R) = op(1). We now show that \N/(6 - 6") = op(1). First, i
- I[f (x) c b + CN; fh(xi) > b], so
N
N
iN"[F2 - E(J2)] = N-1/2 {r2(vi) - E[r2(v)]} + op(1), l )- = N 1/2> th(Xi)YiIi
i=l
(A.11) N N
(A.12) N
t=l
Step 3: Asymptotic Bias. The bias of c is E(6) - 3 = TON - N -12 l(x1)yiII[f(xi) < b + CN]
i=1
T1N - T2N, where
so
TON = E[l(x,)yiIi] , N 2
E N- 1/2 1 (
TIN =
'rlEN([= f(x,) f'(x)]
fh ~~~f (xi)/ Yi I,
N 2
and cN-
i=1
C2N= E ([ fJh (Xi) f (xi)] f () = E{ll(x)y 12I[f (x) < b + CN]} = o(1)
by the Lebesgue dominated convergence theorem, since b + C
Let AN, BN be defined as before; then - 0 and El1(x)y 2 exists. The first term also vanishes in prob
bility, as
TON 1 f l(x)m(x)f(x) dx - f l(x)m(x)f(x) dx N
+ hP-1 f m (x) ,fIK(u)[f (P)(4) - f (P)(x)]u' du dtx + b-' sup{Ifh - fII[fi(X)> b - CN]} (N1/2 2 Il(x)yil)