Chapter 3
Chapter 3
Chapter 3
2
The presence of linear patterns is reassuring but absence of such patterns does not
imply that linear model is incorrect. Most of the statistical software provide the
option for creating the scatterplot matrix. The view of all the plots provides an
indication that a multiple linear regression model may provide a reasonable fit to the
data. It is to be kept in mind that we get only the information on pairs of variables
through the scatterplot of y versus x1 , y versus x2 , …, y versus xk , whereas the
assumption of linearity is between y and jointly with x1 , x2 , , xk .
If some of the explanatory variables are themselves interrelated, this provide the first
indication for the problem of multicollinearity.
( )
E ( i ) = E ( yi − yˆi ) = E 0 + 1 xi + i − ˆ0 + ˆ1 xi
( )
= E ( 0 + 1 xi ) − E ( i ) − E ˆ0 + ˆ1 xi
= 0 + 1 xi − 0 − 0 − 1 xi = 0
and their approximate average variance is estimated by
n n
( i − ) i2
2
SSE
i =1
= i =1
= = MSE
n − ( k + 1) n − ( k + 1) n − ( k + 1)
That is, the standard deviation of the residuals is equal to the standard deviation of
the fitted regression model. The residuals are not independent, however, as the n
residuals have only ( n − p ) degrees of freedom associated with them. This non-
independence of the residuals has little effect on their use for model adequacy
checking as long as n is not small relative to the number of parameters p.
Sometimes it is useful to work with scaled residuals. These scaled residuals are
helpful in finding observations that are outliers, or extreme values, that is,
observations that are separated in some fashion from the rest of the data.
4
where H = X ( XX) X is the hat matrix. The hat matrix has several useful
−1
Thus, the residuals are the same linear transformation of the observations y and the
errors . The covariance matrix of the residuals is given by:
Var ( e ) = Var ( I − H ) = ( I − H )Var ( )( I − H ) = 2 ( I − H )
since Var ( ) = 2I and ( I − H ) is symmetric and idempotent. The matrix ( I − H ) is
generally not diagonal, so the residuals have different variances and they are
correlated. The variance of the ith residual is given by:
Var ( i ) = 2 (1 − hii )
where hij is the ith diagonal element of the hat matrix H. The covariance between
residuals i and j is given by:
Cov ( i , j ) = − 2 hij
where hij is the (i,j)th element of the hat matrix. Now since 0 hii 1 , using the
residual mean square MSE to estimate the variance of the residuals actually
overestimates Var ( i ) . Furthermore, since hii is a measure of the location of the ith
point in x-space, the Var ( i ) depends on where the point xi lies. Generally points
near the centre of the x-space have larger residual/variance (poorer least squares fit)
than residuals at more remote locations. Violations of model assumptions are more
likely at remote points, and these violations may be hard to detect from inspection
of the ordinary residuals ei (or the standardized residuals d i ) because their residuals
will usually be smaller.
A logical procedure, then, is to examine the studentized residuals
i
ri = , i = 1, 2, ..., n
MSE (1 − hii )
5
potentially highly influential on the least squares fit, examination of the studentized
residuals is generally recommended.
Some of these points are very easy to see by examining the studentized residuals for
a simple linear regression model. If there is only one independent variable (SLR), it
can be shown that the studentized residuals are given as:
i
ri = , i = 1, 2, ..., n
1 ( xi − x ) 2
MSE 1 − +
n SS xx
Notice that when the observation xi is close to the midpoint of the x-data, xi − x will
be small, and the estimated standard deviation of ei (the denominator) will be large.
Conversely, when xi is near the extreme ends of the range of the x-data, xi − x will
be large, and the estimated standard deviation of ei will be small. Also, when the
sample size n is really large, the effect of ( xi − x ) will be relatively small, so in big
2
6
This prediction error calculation is repeated for each observation i = 1, 2, ..., n .
These prediction errors are usually called PRESS residuals. Some authors call e(i )
deleted residuals. It would initially seem that calculating the PRESS residuals
requires fitting n different regressions. However, it is possible to calculate PRESS
residuals from the results of a single least squares fit to all n observations since it
can be shown that the ith PRESS residual is
i
(i ) = , i = 1, 2, ..., n
1 − hii
Let ˆ(i ) be the vector of estimated coefficients obtained by deleting the ith
observation. Then:
−1
βˆ ( i ) = X(i ) X(i ) X(i )y (i )
where X ( i ) and y ( i ) are the X and y vectors with the ith observation deleted. Then
the ith deleted residual can be rewritten as:
= y − yˆ = y − xβˆ
(i ) i (i ) i i (i)
−1
= yi − xi X( i ) X( i ) X( i ) y ( i )
where xi is the ith row of X. If XX is a ( k x k ) matrix and ( XX − xx) is the
matrix with the ith row deleted, it can be shown that
( XX ) xx ( XX )
−1 −1
= yi − xi ( XX ) +
−1
X( i ) y ( i )
1 − h ii
xi ( XX ) xi xi ( XX ) X( i )y ( i )
−1 −1
yi − xi ( XX )
−1
X( i )y ( i ) −
1 − hii
yi (1 − hii ) − (1 − hii ) xi ( XX ) X( i ) y ( i ) − hii x ( XX ) X( i )y ( i )
−1 −1
=
1 − hii
7
yi (1 − hii ) − xi ( XX ) X( i ) y ( i )
−1
(i ) =
1 − hii
y (1 − hii ) − xi ( XX ) ( Xy − xi yi )
−1
= i
1 − hii
y (1 − hii ) − xiβˆ + hii yi yi − xiβˆ
= i = = i
1 − hii 1 − hii 1 − hii
From above it is easy to see that the PRESS residual is just the ordinary residual
weighted according to the diagonal elements of the hat matrix hii . Residuals
associated with points for which hii is large will have large PRESS residuals. These
points will generally be high influence points. Generally, a large difference between
the ordinary residual and the PRESS residual will indicate a point where the model
fits the data well, but a model built without that point predicts poorly.
i
The standardized Deleted residual is calculated as: (si ) =
( 2
1 − hii )
3.2.5 Detecting Outliers: Studentized Deleted (r-Student) Residuals
The studentized ri discussed above is often considered an outlier diagnostic. It is
customary to use MSE as an estimate of 2 in computing ri . Another approach
would be to use an estimate of 2 based on a data set with the ith observation deleted
from the sample. Denote the estimate of 2 so obtained by S(2i ) . It can be shown that
n − ( k + 1) MSE − i2 (1 − hii )
S =
2
(i )
n − ( k + 1) − 1
The estimate of 2 is used instead of MSE to produce an externally (deleted)
studentized residual, usually called R-student given by
i
ti = , i = 1, 2, ..., n
S 2
(i ) (1 − hii )
In many situations, ti will differ little from the studentized residual ri . However, if
the ith observation is influential, then S(2i ) can differ significantly from MSE , and
thus the R-student statistic will be more sensitive to this point. Under the usual
assumptions, the R-student has a sampling distribution that can be approximated as
t distribution with d.o.f = ( n − 1) − ( k + 1) .
8
Example 1: The Delivery Time Data
Table below presents the scaled residuals using the model for the soft drink delivery
time data. Examining column 1 (the ordinary residuals, originally calculated on
Table 3.3), note that one residual, 9 = 7.4197 seems suspiciously large. Column 2
shows that the standardized residual d9 = 9 MSE = 2.2763 . All other standardized
residuals are inside the 2 limits.
Column 5 contains the PRESS (deleted) residuals. The PRESS (deleted) residuals
for point 9 is substantially larger than the corresponding ordinary residuals,
indicating that these are likely to be points where the model fits reasonably well but
does not provide good predictions of fresh data. These points are remote from the
9
rest of the sample. Column 6 displays the values of R-student. Only one value, t9 is
unusually large. Note that t9 is larger than the corresponding studentized residual r9
, indicating that when observation 9 is set aside, S(29 ) is smaller than MSE , so clearly
this observation is influential. Note that S(29 ) is calculated as follows:
2
n − ( k + 1) MSE − 92 (1 − h9,9 )
S =
(9)
n − ( k + 1) − 1
( 22 )(10.6239 ) − ( 7.4197 ) (1 − 0.49829)
2
=
21
= 5.9046
2158.70 106.76
1678.15 -67.27
2316.00 -14.59
2061.30 65.09
2207.50 -215.98
1708.30 -213.60
1784.70 48.56
2575.00 40.06
2357.90 8.73
2256.70 37.57
2165.20 20.37
2399.55 -88.95
1799.80 80.82
2336.75 71.17
1765.30 -45.14
2053.50 94.44
2414.40 9.50
2200.50 37.10
2654.20 100.68
1753.70 -75.32
10
3.3 RESIDUAL PLOTS
Graphical analysis of residuals is very effective way to investigate the adequacy of
the fit of a regression model and to check the underlying assumptions. In this section,
the basic residual plots is introduced and illustrated. These plots are typically
generated by regression option/command in computer software packages. They
should be examined routinely in all regression modeling problems. It is often a good
idea to plot both the original residuals and one or more of the scaled residuals.
Studentized residuals is usually chosen due to the constant variance.
11
considered normal. Conversely, panel (c) shows flattening at the extremes, which is
a pattern typical of samples from a distribution with heavier tails than the normal.
Panels (d) and (e) exhibit patterns associated with positive and negative skew,
respectively.
Because samples taken from a normal distribution will not plot exactly as a straight
line, some experience is required to interpret normal probability plots. Daniel and
Wood [1980] present normal probability plots for sample sizes 8-384. Study of these
plots is helpful in acquiring a feel for how much deviation from the straight line is
acceptable. Small sample sizes ( n 16 ) often produce normal probability plots that
deviate substantially from linearity. For large sample sizes ( n 32 ) the plots are
much better behaved. Usually about 20 points are required to produce normal
probability plots that are stable enough to be easily interpreted.
Andrews [1979] and Gnanadesikan [1977] note that normal probability plots often
exhibit no unusual behavior even if the errors i are not normally distributed. This
problem occurs because the residuals are not a simple random sample; they are the
remnants of a parameter estimation process. The residuals are actually linear
combinations of the model errors. Thus, fitting the parameters tends to destroy the
evidence of non-normality in the residuals, and consequently cannot always rely on
the normal probability plot to detect departures from normality. A common defect
that shows up on the normal probability plot is the occurrence of one or two large
residuals. Sometimes this is an indication that the corresponding observations are
outliers.
12
Example: The Delivery Time Data
Figure below presents normal probability plots of the residuals from the regression
model for the delivery time data. Notice that the residuals are plotted on the vertical
axis. The original least squares residuals are plotted in Figure (a) and the R-student
residuals are plotted in Figure (b).
The residuals in both plots do not lie exactly along a straight line, indicating that
there may be some problems with the normality assumption, or that there may be
one or more outliers in the data. From Example 4.1, the studentized residual for
observation 9 is moderately large ( r9 = 3.2138) , as the R-student residual
(t9 = 4.3108) . However, there is no indication of a severe problem in the delivery
time data.
13
Ryan-Joiner test which is similar to Shapiro-Wilk test
This test assesses normality by calculating the correlation between the data and the
normal scores of the data. If the correlation coefficient is near 1, the population is
likely to be normal. The test statistic assesses the strength of the correlation with the
null hypothesis of population normality is rejected if it falls below the critical value.
The test statistic is calculated as follows:
2
n
ai x i
W = n i =1
( xi − x )
2
i =1
where xi are the ordered sample values while ai are constants generated from the
means, variances and covariances of the order statistics of a sample size n from a
normal distribution.
Kolmogorov-Smirnov test
This is a test that is based on the empirical distribution function (ecdf). Given n
ordered data points, 1 , 2 , K , n , the ecdf is defined as:
n (i )
En ( i ) =
n
where n ( i ) is the number of points less than i (that has been ordered from smallest
to largest). Thus En(i ) is a step function that increases by 1 n at the value of each
ordered data point. The test statistic is defined as:
i −1 i
D = max F ( i ) − , − F ( i )
1i n
n n
where F is the theoretical cumulative normal distribution. The null hypothesis of
normality is rejected if the test statistic is greater than the critical value obtained
from a table.
14
The patterns in panels (b) and (c) indicate that the variance of the errors is not
constant. The outward-opening funnel pattern in panel (b) implies that the
variance is an increasing function of y [an inward-opening funnel is also possible,
indicating that Var ( ) increases as y decreases]. This scenario occurs when the
variable show multiplicative model, rather than the often assumed additional
model: y = E y rather than y = E y + .
The double–bow pattern in panel (c) often occurs when y is a proportion between
zero and one, similar to binomial experiments. The variance of a binomial proportion
near 0.5 is greater than that near zero or one. A curved plot such as in panel (d)
indicates non-linearity. This could mean that other independent variables are
needed in the model. For example, a squared term may be necessary. In practice,
transformations on the response are generally employed to stabilize variance.
Transformations on the independent and/or the dependent variable or to use the
method of weighted least squares. may also be helpful in these cases.
A plot of the residuals against yˆ i may also reveal one or more unusually large
residuals. These points are, of course, potential outliers. Large residuals that occur
at the extreme yˆ i values could also indicate that either the variance is not constant
or the true relationship between y and x is not linear. These possibilities should be
investigated before the points are considered outliers.
15
Example: The Delivery Time Data
Figure below presents plots of the original residuals and the R-student values versus
the fitted values of delivery time. These plots do not exhibit any strong unusual
pattern, although the large residual e9 (or t9 ) shows up clearly. These does seem to
be slight tendency for the model to under-predict short delivery times and over-
predict long delivery times.
16
Example: The Delivery Time Data
Figure below presents the plots of R-student values from the delivery time problem
versus both independent variables. Panel a plots R-student values versus cases and
panel b plots R-student values versus distance. Neither of these plots reveals any
clear indication of a problem with either misspecification of the regressor (implying
the need for either a transformation on the regressor or higher order terms in cases
and/or distance) or inequality of variance, although the moderately large residual
associated with point 9 is apparent on both plots.
17
3.4 DETECTING LACT OF FIT
Consider the general linear model: y = 0 + 1x1 + 3 x2 + K + k xk + .
Assume that this model is correctly specified (i.e that the terms is the model
accurately represent the true relationship between y and the independent variables).
Recall the assumption made on the random error, was that E ( ) = 0 for any given
set of values of x1, x2 , K , xk implying: E ( y ) = 0 + 1 x1 + 3 x2 + K + k xk .
18
(
ˆ p = y − ˆ0 + ˆ1 x1 + ˆ3 x2 + K ˆ j −1 x j −1 + ˆ j +1 x j +1 + K + ˆk xk )
= ˆ + ˆ j x j
That is, partial residuals measure the influence of x j on the dependent variable, y
after the effects of the other independent variables, x1 , x2 , K , x j −1, x j +1, K , xk have
been removed or accounted for.
If the variable x3 enter the model linearly, the partial residual plot y|x1x2 against
x3|x1x2 should show linear relationship, that is the partial residuals fall along a
straight line with non-zero slope. The slope of this line will be the regression
coefficient of x3 in multiple linear regression model y = 0 + 1x1 + 3 x2 + 3 x3 + .
Thus, if the partial residual plot shows a horizontal band, this indicates that there is
no additional useful information in x3 for describing y. Meanwhile, if the partial
residual plot shows a curvilinear band then a higher-order terms of x3 or a
transformation on x3 may be conducted.
The lack of fit test requires that we have replicate observations on the response y for
at least one level of x. Furthermore, the procedure assumes that the normality,
independence and constant variance requirements are met and that only the first-
order or straight-line character of the relationship is in doubt. Suppose that we have
ni observations on the response at the ith level of the independent variable xi ,
i = 1, 2, K , m . Let yij denotes the jth observation on the response at xi and
j = 1, 2, K , ni so that n = i =1 ni total observations.
m
The test procedure involves partitioning the residual sum of squares into two
components:
SSE = SSPE + SSLOF
where SSPE is the sum of squares due to pure error and SSLOF is the sum of squares
due to lack of fit. Note that the (ij)th residual is now written as:
19
yij − yˆi = ( yij − yi ) + ( yi − yˆi )
where yi is the average of the ni observations at xi . Squaring both sides and
summing over i and j produces:
m ni m ni m ni
i =1 j =1 i =1 j =1 i =1 j =1
( yij − yi )
2
It can be seen that the SSPE : is obtained by computing the corrected
i =1 j =1
sum of squares of the repeat observations at each level of xi and then pooling over
the m levels of x. If the assumption of constant variance is satisfied, this is a model-
independent measure of pure error since only the variability of the dependent
variable y at each x level is used to compute SSPE . Since there are ( ni − 1) degrees
of freedom for pure error at each level xi , the total number of degrees of freedom
m
associated with SSPE is ( ni − 1) = n − m .
i =1
m ni
( yi − yˆi )
2
Meanwhile, SSLOF : is a weighted sum of squared deviations
i =1 j =1
between the mean response yi at each level x level and the corresponding fitted
value. If the fitted values yˆ i are close to the corresponding average response yi , then
there is a strong indication that the regression function is linear. There are ( m − 2 )
degrees of freedom associated with SSLOF since there are m levels of x and two
degrees of freedom are lost since two parameters must be estimated to obtain yi .
Computationally, SSLOF is obtained by subtracting SSPE from SSE . The test
statistic for lack of fit is given as:
SSLOF ( m − 2 ) MSElof
Flof = =
SSPE ( n − m ) MSE pe
If the true regression function is linear, then the statistic Flof follows Fm−2,n−m . The
hypotheses to be tested are as below:
H0 : The linear relationship assumed in the model is reasonable
H1 : The linear relationship assumed in the model is not reasonable
Example
x 1.0 1.0 2.0 3.3 3.3 4.0 4.0 4.0 4.7
y 10.84 9.30 16.35 22.88 24.35 24.56 25.86 29.16 24.59
x 5.0 5.6 5.6 5.6 6.0 6.0 6.5 6.9
y 22.25 25.90 27.20 25.61 25.45 26.56 21.03 21.46
20
Check that: the straight line fitted regression model is given as: y = 13.301 + 2.108 x
with SST = 487.6126 , SSR = 234.7087 and SSE = 252.9039 . Note that there are 10
distinct levels of x with repeat points at x = 1.0, 3.3, 4.0, 5.6 and 6.0 .
Scatterplot of y vs x
30
25
20
y
15
10
1 2 3 4 5 6 7
x
From the plot above, there is some indication that the straight-line regression would
not be satisfactory, and it would be helpful to conduct a test to determine if there is
systematic curvature is present.
The SSPE is computed using the repeat points as follows:
j ( yij − yi )
2
Level of x yij yi d.o.f
1.0 10.84, 9.30 10.070 1.1858 1
3.3 22.88, 24.35 23.615 1.0805 1
4.0 24.56, 25.86, 29.16 26.527 11.2467 2
5.6 25.90, 27.20, 25.61 26.237 1.4341 2
6.0 25.45, 26.56 26.005 0.6161 1
Total 15.5632 7
The ANOVA table with SSE partitioned into SSPE and SSLOF is given as:
Source of Sum of Degree of Mean F / Flof p-value
Variation Squares Freedom Square
Regression 234.7087 1 234.7087 13.92 < 0.0013
Residual 252.9039 15 16.8603
Lack of fit 237.3407 8 29.6676 13.34 0.0013
Pure error 15.5632 7 2.2233
Total 487.6126 16
From the table above, the lack of fit test statistic is quite large with very small p-
value and thus the tentative fitted linear regression is not reasonable (there is lack of
fit) to describe the relationship between x and y.
21
The ANOVA table from Minitab16 is given below for comparison
Source of Sum of Degree of Mean F / Flof p-value
Variation Squares Freedom Square
Regression 237.48 1 237.48 14.24 0.002
Residual 250.13 15 16.68
Lack of fit 234.57 8 29.32 13.19 0.001
Pure error 15.56 7 2.22
Total 487.61 16
22