Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Detection of Influential Observation in Linear Regression

R. Dennis Cook
Technometrics, Vol. 19, No. 1. (Feb., 1977), pp. 15-18.
Stable URL:
http://links.jstor.org/sici?sici=0040-1706%28197702%2919%3A1%3C15%3ADOIOIL%3E2.0.CO%3B2-8
Technometrics is currently published by American Statistical Association.

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained
prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in
the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/journals/astata.html.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.

The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academic
journals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers,
and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community take
advantage of advances in technology. For more information regarding JSTOR, please contact support@jstor.org.

http://www.jstor.org
Sat Sep 29 10:54:54 2007

TECHNOMETRICSO, VOL. 1 9 , N O . 1, FEBRUARY 1 9 7 7

Detection of Influential Observation in Linear


Regression
R. Dennis Cook
D e p a r t m e n t of Applied Statistics

University of Minnesota

St. Paul, Minnesota 5 5 1 0 8

A new measure based on confidence ellipsoids is developed for judging the contribution of
each data point to the determination of the least squares estimate of the parameter vector in full
rank linear regression models. It is shown that the measure combines information from the
studentized residuals and the variances of the residuals and predicted values. Two examples are
presented.

KEY WORDS
Influential observations
Confidence ellipsoids
Variances of residuals
Outliers
I. INTRODUCTION

It is perhaps a universally held opinion that the


overall summary statistics (e.g., R2, arising from
data analyses based on full rank linear regression
models can present a distorted and misleading picture. This has led to the recommendation and use of a
number of procedures that can isolate peculiarities in
the data; plots of the residuals (R,) and examination
of standardized residuals are probably the two most
widely used. The studentized residuals, t,, (i.e. the
residual divided by its standard error) have been recommended (see, e.g., [2], [4], [6]) as more appropriate
than the standardized residuals (i.e., the residual divided by the square root of the mean square for error)
for detecting outliers. Also, approximate critical values for the maximum absolute studentized residual
are available [8].
Behnken and Draper [2] have illustrated that the
estimated variances of the predicted values (or,
equivalently, the estimated variances of the residuals,
? (R,)) contain relevant information beyond that furnished by residual plots or studentized residuals. Specifically, they state "A wide variation in the [variance
of the residuals] reflects a peculiarity of the X matrix,
namely a nonhomogeneous spacing of the observations and will thus often direct attention t o data
deficiencies." The opinion that these variances contain additional information was also put forth by
Huber [6] and Davies and Hutton [4]. Box and
Draper [3] developed a robust design criterion based

6)

Received December 1975: revised September 1976

on the sums of squares of the variances of the predicted values.


If a potentially critical observation has been detected using one or more of the above measures then
the examination of the effects of deleting the observation seems a natural next step. However, the problem
of determining which point(s) to delete can be very
perplexing, especially in large data sets, because each
point now has two associated measures (t,, F'(R~))
which must be judged simultaneously. For example,
assuming the mean square for error t o be 1.0, which
point from the set (ti, V(R,)) = (1, .I), (I .732, .25), (3,
.5), (5.196, .75) is most likely to be critical?
It is the purpose of this note to suggest an easily
interpretable measure that combines information
from both t, and V(R,), and that will naturally isolate
"critical" values.
2. DEVELOPMENT

Consider the model


where Y is an n X 1 vector of observations, X is an n
X p full rank matrix of known constants, 0 is an n x
p vector of unknown parameters and e is an n X 1
vector of randomly distributed errors such that E(e)
= 0 and V(e) = Ia2. Recall that the least squares
estimate of p is

The corresponding residual vector is

The covariance matrices of Y and R are, respectively,

R. DENNIS COOK

16

It follows immediately that

and

Finally, the normal theory (1 - a ) X 100% confidence


ellipsoid for the unknown vector, 8, is given by the
set of all vectors @*,say, that satisfy

where s2 = RIR/(n - p) and F@, n - p, 1 - a ) is the


1 -a probability point of the central F-distribution
with p and n - p degrees of freedom.
To determine the degree of influence the ith data
point has on the estimate, 8 , a natural first step would
be to compute the least squares estimate of@with the
point deleted. Accordingly, let b,-i, denote the least
squares estimate of @ with the ith point deleted. An
easily interpretable measure of the distance of b,-,,
from /!? is (3). Thus, the suggested measure of the critical nature of each data point is now defined to be

This provides a measure of the distance between


in terms of descriptive levels of significance. Suppose, for example, that D, F@, n - p,
.5), then the removal of the ith data point moves the
least squares estimate to the edge of the 50% confidence region for @ based on b. Such a situation may
be cause for concern. For an uncomplicated analysis
one would like each
to stay well within a lo%,
say, confidence region.
On the surface it might seem that any desirability
this measure has would be overshadowed by the computations necessary for the determination of n + 1
regressions. However, it is easily shown that (see [I])

fi[-,, and fi

where X,-i, is obtained by removing the ith row, xi1,


from X and Y , is the ith observation. Also, letting ui
= xil(X'X)-'xi and assuming vi < 1, as well as be the
case if X , _ , ,has full rank p,
( x ( - , ) l x ( - i ) ) -=
l (XfX)-'

+ (xlx)-lx,x,l(x'x)-'/(I

u,),

from which it follows that

Substitution of (6) into (5) yields

TECHNOMETRICSO,
VOL. 19, NO. 1, FEBRUARY 1977

Note that DLdepends on three relevant quantities


all relating to the full data set: The number of parameters, p, the ith studentized residual,

and the ratio of the variance of the ith predicted


value, v(Ei) = xil(X'X)-'xia2 = via2(see, equation I),
to the variance of the ith residual, V(R,) = a2(1 - ui)
(see, equation 2). Thus D ican be written as simply

Clearly, t,2 is a measure of the degree to which the ith


observation can be considered as an outlier from the
assumed model. In addition, it is easily demonstrated
that if the possible presence of a single outlier is
modeled by adding a parameter vector 0' = (0,0, . . .,
0, 8, 0, . . . 0) (both 6' and its position within 0' are
unknown) to the model, then max ti2 is a monotonic
function of the normal theory likelihood ratio test of
the hypothesis that 6 = 0.
The ratios v(~,)/v(R,)measure the relative sensitivity of the estimate, p, to potential outlying values
at each data point. They are also monotonic functions of the 0,'s which are the quantities Box and
Draper [3] used in the development of their robust
design (i.e., insensitive to outliers) criterion. A large
value of the ratio indicates that the associated point
has heavy weight in the determination of.8. The two
individual measures combine in (8) to produce a measure of the overall impact any single point has on the
least squares solution.
A little care must be exercised when using D, since
&,,is essentially undefined in extreme cases when
V(R,) = 0 (see the lemma in [l]); i.e., when X(-,, has
rank less than p.
Returning to the example given in the Introduction
we see that each point has an equal overall impact on
the regression since in each case pD, = 9.0. To continue the example, suppose p = 3 and n - p = 24,
then Di = 3.0 and the removal of any of the four
points would move the estimate of j3 to the edge of
the 95% confidence region about fi. However, inspection of the individual components shows that the
reasons for the extreme movement are different. The
two points (3, .5) and (5.196, .75) could be rejected as
containing outliers in the dependent variable while
the remaining two could not. Inspection of X would
be necessary to determine why the remaining points
are important. It may be, for example, that they

DETECTION OF INFLUENTIAL OBSERVATIONS IN LINEAR REGRESSION

correspond to outlying values in the independent variables.


In any analysis additional information may be
,)
A
gained by examining t, and v ( ~ ~ ) / v ( R separately.
three column output of ti, v(ki) /V(Ri) and Dl would
seem to be a highly desirable option in any multiple
regression program.
The following two examples should serve as additional demonstrations of the use of Dl. No attempt at
a "complete" analysis is made.

Data

The data for this example were previously published by Hald (see [2] and p.165 of [5]). There are 4
regressors and 13 observation points. Table 2 lists
Ri/s, ti, V ( ~ , ) / V ( R ,and
) Di. In contrast to the previous example the data here seem well behaved. Observation number 8 has the largest Di value but its
removal moves the least squares estimate to the edge
of only the 10% confidence region for 0.
4. EXTENSIONS

3. EXAMPLES

Example I-Longley

Example 2-Hald

17

Data

Longley [7] presented a data set relating six economic variables to total derived employment for the
years 1947 to 1962. Table 1 lists the residuals standardized by s, t , , v(P,)/ v(R,), Dl and the year. Notice
first that there are considerable differences between
R,/s and t,. Second, the point with the largest Dl
value corresponds to 1951. Removal of this point will
move the least squares estimate to the edge of a 35%
confidence region around 8. The second largest D l
value is at 1962 and its removal will move the estimate of 0 to approximately the edge of a 15% confidence region. Clearly, 1951 and 1962 have the greatest impact on the determination of 8. The point with
the largest studentized residual is 1956; however, the
effect of this point on 8 is not important relative to
the effects of 1951 and 1962. The identification of the
points with max Iti 1 and max v(~,)/v(R,)(or max
ui) would not have isolated 1951. (It is interesting to
note that '1951 was the first full year of the Korean
conflict).

It is easily seen that Di is invariant under changes


of scale. If the scale of each variable is thought to be
an important consideration it may be more desirable
to compute the squared length of @,-,, -8). It is
easily shown that

The proposed measure was developed under the


implicit presumption that is the parameter of interest. This may not always be the case. If interest is in q,
say, linearly independent combinations of the elements of 0, then it would be more reasonable to
measure the influence each data point has on the
determination of the least squares estimates of these
combinations. Let A denote a qxp rank q matrix
and let $ = A0 denote the combinations of interest.
A generalized measure of the importance of the ith
point is now defined to be

TABLE I -Longley D a ~ a

Year
1947
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62

*:

Ri/s

0.88
-0.31
0.15
-1.34
1.02
-0.82
-0.54
-0.04
0.05
1.48
-0.06
-0.13
-0.51
-0.28
1.12
-0.68

1 ti 1
1.15
0.48
0.19
1.70
1.64
1.03
0.75
0.06
0.07
1.83
0.07
0.18
0.64
0.32
1.42
1.21

V(Yi)/V(Ri)
0.74
1.30
0.57
0.59
1.60
0.59
0.97
1.02
0.84
0.49
0.56
0.93
0.60
0.30
0.59
2.21

i
0.14
0.04

0.24
0.61
0.09
0.08

*
*
0.23
*
*
0.04
*
0.17
0.47

smaller than 5 x

TECHNOMETRICSO,
VOL. 19, NO. 1, FEBRUARY 1977

18

R. DENNIS COOK

TABLE 2-Hald

Data
A

Observation

Ri / s

1 ti 1

1
2
3
4
5
6
7
8
9
10
11
12
13

0.002
0.62
-0.68
-0.71
0.10
1.61
-0.59
-1.24
0.56
0.12
0.81
0.40
-0.94

0.003
0.76
1.05
0.84
0.13
1.71
0.74
1.69
0.67
0.21
1.07
0.46
1.12

*:

V(Yi)/V(Ri)
1.22
0.50
1.36
0.42
0.56
0.14
0.58
0.69
0.42
2.34
0.74
0.36
0.44

Di

*
0.06
0.30
0.06

0.08
0.06
0.31
0.04
0.02
0.17
0.02
0.11

smaller than 2 x

where

Di(zl) I Di(xil)= pDi

A ( X I X ) - ' A ' and

*,-,, = @,-,,.

Since
it follows that

for all z . Thus, when prediction of mean values or the


individual components of 0 are of interest it may not
be necessary to use (10) directly: If Di(xL1)shows a
~ , the
negligible difference between xi 8 and ~ , ' f ? - then
,
also be neglidifference between 2'8 and Z ' B ( - ~must
gible for all z .
5. ACKNOWLEDGEMENT

T o obtain the descriptive levels of significance the


values of this generalized measure should, of course,
be compared to the probability points of the central
F-distribution with q and n - p degrees of freedom.
The case when q = 1 , i.e., when A is chosen to be a
lxp vector z', say, warrants special emphasis. From
( 9 ) it is easily seen that

where Di= Di(I) and p(. , .) denotes the correlation


coefficient. If z' is a vector of values of the independent variables then Di(zl) measures the distance between the predicted mean value of y a t z using the ith
data point (z'b) and the predicted value at z without
).
also that when z' is chosen
the ith point ( z ' & - ~ , Note
to be a vector of the form ( 0 , . . . , 1,0, . . . , 0 ) , D,(zl)
measures the distance between the corresponding
components of 8 and 6,-,,.
The maximum value of D,(zl) for a fixed i is obtained by choosing z' = x,',

TECHNOMETRICSO,
VOL. 19, NO. 1, FEBRUARY 1977

The author would like to thank Professor C. Bingham for his suggestions a n d criticisms.
REFERENCES
[ I ] Beckman, R. J. and Trussell, H. J., (1974). The distribution of
an arbitrary studentized residual and the effects of updating in
multiple regression. J Amer. Statist. Assoc. 69, 199-201.
[2] Behnken, D. W. and Draper, N . R., (1972). Residuals and their
variance patterns. Technometrics, 14. 102-1 11.
[3] Box, G . E. P. and Draper, N. R., (1975). Robust design.
Biomefrika, 62, 347-352.
[4] Davies, R. B. and Hutton, B., (1975). The effects of errors in
the independent variables in linear regression. Biomefrika, 62,
383-391.
[5] Draper, N. R. and Smith, H., (1966). Applied Regression Analysis. Wiley, New York.
[6] Huber, P. J., (1975). Robustness and Designs. A Survey of
Statistical Design and Linear Models. North-Holland, Amsterdam.
[7] Longley, J. W., (1967). An appraisal of least squares programs
for the electronic computer from the point of view of the user.
J. Amer. Statist. Assoc., 6 2 , 819-841.
[8] Lund, R. E., (1975). Tables for an approximate test for outliers
in linear models. Technometrics, 17, 473-476.

You might also like