Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chapter 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

CHAPTER 3

MODEL ADEQUACY CHECKING

The fitting of linear regression model, estimation of parameters and testing of


hypothesis on the estimated coefficient are based on following major assumptions:
1. The relationship between the study variable and explanatory variables is linear, at
least approximately.
2. The error term has zero mean.
3. The error term has constant variance.
4. The errors are uncorrelated.
5. The errors are normally distributed.
In short, the validity of many inferences associated with a regression analysis
depends on the error term,  , satisfying the above assumptions. Taken together,
assumptions 4 and 5 imply that the errors are independent random variables.
Assumption 5 is required for hypothesis testing and interval estimation. The validity
of these assumption is needed for the results to be meaningful. Always consider the
validity of these assumptions to be doubtful and conduct analyses to examine the
adequacy of the model that have tentatively entertained. The types of model
inadequacies discussed here have potentially serious consequences.
If these assumptions are violated, the result can be incorrect and may have serious
consequences. If these departures are small, the final result may not be changed
significantly and the least squares regression analysis produces reliable statistical
tests and confidence intervals. But if the departures are large, the model obtained
may become unstable in the sense that a different sample could lead to an entirely
different model with opposite conclusions. Detection and examination on the
departure from the underlying assumptions cannot be made by using standard
summary statistics such as t-statistic, F-statistic or R 2 .
One important point to keep in mind is that these assumptions are for the population
and we work only with a sample. So the main issue is to take a decision about the
population on the basis of a sample of data. Several diagnostic methods to check the
violation of regression assumption are based on the study of model residuals with
the help of various types of graphics.

3.1 CHECKING OF LINEAR RELATIONSHIP


3.1.1 CASE OF ONE EXPLANATORY VARIABLE
If there is only one explanatory variable in the model, then it is easy to check the
existence of linear relationship between the dependent variable, y and the
independent variable, x by using scatter diagram. If the scatter diagram shows a
linear trend, it indicates that the relationship between and is linear. If the trend is not
linear, then it indicates that the relationship between y and x is nonlinear.
1
However, as we shall see later, linearization can be made by conducting certain
transformation on the variables.

3.1.2. CASE OF MORE THAN ONE EXPLANATORY VARIABLES


To check the assumption of linearity between the dependent variable and a set of
explanatory variables, the matrix scatter plot of the data can be used. A matrix
scatterplot matrix is a two dimensional array of two dimension plots where each
form contains a scatter diagram except for the diagonal. Thus, each plot sheds some
light on the relationship between a pair of variables. It gives more information than
the correlation coefficient between each pair of variables because it gives a sense of
linearity or nonlinearity of the relationship and some awareness of how the
individual data points are scattered over the region. It is a scatter plot of y versus x1
, y versus x2 , …, y versus xk , as well as scatter plot of xi versus x j .

2
The presence of linear patterns is reassuring but absence of such patterns does not
imply that linear model is incorrect. Most of the statistical software provide the
option for creating the scatterplot matrix. The view of all the plots provides an
indication that a multiple linear regression model may provide a reasonable fit to the
data. It is to be kept in mind that we get only the information on pairs of variables
through the scatterplot of y versus x1 , y versus x2 , …, y versus xk , whereas the
assumption of linearity is between y and jointly with  x1 , x2 , , xk  .
If some of the explanatory variables are themselves interrelated, this provide the first
indication for the problem of multicollinearity.

3.2 RESIDUAL ANALYSIS


3.2.1 Definition of Residuals
Although both dependent and independent variables are observed, the true
“population” regression parameters are not known. Since the model coefficients
from OLS estimation are not the same as the true population parameters, the
estimated errors occur and these are better known as the residuals. The residuals are
defined as
 i = yi − yˆi , i = 1, 2, ..., n
where yi is an observation and yˆ i is the corresponding fitted value. Fitted value, is
obtained by substituting the values of x variables into the regression model. Since a
residual may be viewed as the deviation between the data and the fit, it is also a
measure of the variability in the response variable not explained by the regression
model. It is also convenient to think of the residuals as the realized or observed
values of the model errors. Thus, any departures from the assumptions on the errors
should show up in the residuals. Analysis of the residuals is an effective way to
discover several types of model inadequacies. As will see, plotting residuals is a
very effective way to investigate how well the regression model fits the data, and to
check the assumptions previously mentioned.
3
The residuals have several important properties. They have zero mean since

( )
E (  i ) = E ( yi − yˆi ) = E   0 + 1 xi +  i − ˆ0 + ˆ1 xi 
 
( )
= E (  0 + 1 xi ) − E (  i ) − E ˆ0 + ˆ1 xi
=  0 + 1 xi − 0 −  0 − 1 xi = 0
and their approximate average variance is estimated by
n n

 ( i −  ) i2
2

SSE
i =1
= i =1
= = MSE
n − ( k + 1) n − ( k + 1) n − ( k + 1)

That is, the standard deviation of the residuals is equal to the standard deviation of
the fitted regression model. The residuals are not independent, however, as the n
residuals have only ( n − p ) degrees of freedom associated with them. This non-
independence of the residuals has little effect on their use for model adequacy
checking as long as n is not small relative to the number of parameters p.
Sometimes it is useful to work with scaled residuals. These scaled residuals are
helpful in finding observations that are outliers, or extreme values, that is,
observations that are separated in some fashion from the rest of the data.

3.2.2 Detecting Outliers: Standardized residuals


The residuals are standardized based on the concept of residual minus its mean and
divided by its standard deviation. Since E (  i ) = 0 and MSE estimates the
approximate average variance, so logically the scaled standardized residuals is
calculated as:
i
di = , i = 1, 2, ..., n
MSE
The standardized residuals have mean zero and approximately unit variance.
Consequently, a large standardized residual ( di  3, say ) potentially indicates an
outlier.

3.2.3 Detecting Outliers: Studentized Residuals


Using MSE as the variance of the ith residual ei is only an approximation. This
residual scaling can be improved by dividing ei by the exact standard deviation of
the ith residual. Recall that we may write the vector of residuals as follows:
e = (1− H ) y

4
where H = X ( XX) X is the hat matrix. The hat matrix has several useful
−1

properties. It is symmetric ( H = H ) and idempotent ( HH = H ) . Similarly the


matrix ( I − H ) is symmetric and idempotent. Substituting y = X +  into above
yields
e = ( I − H )( X +  ) = X − HX + ( I − H ) 
= X − X ( XX ) XX + ( I − H )  = ( I − H ) 
−1

Thus, the residuals are the same linear transformation of the observations y and the
errors  . The covariance matrix of the residuals is given by:
Var ( e ) = Var ( I − H )   = ( I − H )Var (  )( I − H ) =  2 ( I − H )
since Var (  ) =  2I and ( I − H ) is symmetric and idempotent. The matrix ( I − H ) is
generally not diagonal, so the residuals have different variances and they are
correlated. The variance of the ith residual is given by:
Var ( i ) =  2 (1 − hii )
where hij is the ith diagonal element of the hat matrix H. The covariance between
residuals  i and  j is given by:
Cov (  i ,  j ) = − 2 hij
where hij is the (i,j)th element of the hat matrix. Now since 0  hii  1 , using the
residual mean square MSE to estimate the variance of the residuals actually
overestimates Var ( i ) . Furthermore, since hii is a measure of the location of the ith
point in x-space, the Var ( i ) depends on where the point xi lies. Generally points
near the centre of the x-space have larger residual/variance (poorer least squares fit)
than residuals at more remote locations. Violations of model assumptions are more
likely at remote points, and these violations may be hard to detect from inspection
of the ordinary residuals ei (or the standardized residuals d i ) because their residuals
will usually be smaller.
A logical procedure, then, is to examine the studentized residuals
i
ri = , i = 1, 2, ..., n
MSE (1 − hii )

instead of  i (or d i ). The studentized residuals have constant variance Var ( ri ) = 1


regardless of the location xi when the form of the model is correct. In many
situations the variance of the residuals stabilizes, particularly for large data sets. In
these cases there may be little difference between the standardized and studentized
residuals. Thus, standardized and studentized residuals often convey equivalent
information. However, since any point with a large residual and a large hii is

5
potentially highly influential on the least squares fit, examination of the studentized
residuals is generally recommended.
Some of these points are very easy to see by examining the studentized residuals for
a simple linear regression model. If there is only one independent variable (SLR), it
can be shown that the studentized residuals are given as:
i
ri = , i = 1, 2, ..., n
  1 ( xi − x ) 2

MSE 1 −  + 
  n SS xx 

Notice that when the observation xi is close to the midpoint of the x-data, xi − x will
be small, and the estimated standard deviation of ei (the denominator) will be large.
Conversely, when xi is near the extreme ends of the range of the x-data, xi − x will
be large, and the estimated standard deviation of ei will be small. Also, when the
sample size n is really large, the effect of ( xi − x ) will be relatively small, so in big
2

datasets, studentized residuals may not differ dramatically from standardized


residuals.

3.2.4 Detecting Outliers: PRESS (DELETED) Residuals


The standardized and studentized residuals are effective in detecting outliers.
Another approach to making residuals useful in finding outliers is to examine the
quantity that is computed from yi − yˆ(i ) , where yˆ( i ) is the fitted value of the ith
response based on all observations except the ith one (i.e the ith observation is
deleted from the model). The logic behind this is that if the ith observation yi is
really unusual, the regression model based on all observations may be overly
influenced by this observation. This could produce a fitted value yˆ i that is very
similar to the observed value yi , and consequently, the ordinary residual ei will be
small. Therefore, it will be hard to detect the outlier.
However, if the ith observation is deleted, then yˆ( i ) cannot be influenced by that
observation, so the resulting residual should be likely to indicate the presence of the
outlier. Thus, the big idea here is that if an observation is very influential, leaving
out this information/observation could drastically change the appearance of the
regression line and would cause much larger deleted residual than the ordinary
residual.
After the ith observation is deleted the regression model is fit to the remaining ( n − 1)
observations, and the predicted value of yi is calculated corresponding to the deleted
observation, then the corresponding prediction error is
e( i ) = yi − yˆ( i )

6
This prediction error calculation is repeated for each observation i = 1, 2, ..., n .
These prediction errors are usually called PRESS residuals. Some authors call e(i )
deleted residuals. It would initially seem that calculating the PRESS residuals
requires fitting n different regressions. However, it is possible to calculate PRESS
residuals from the results of a single least squares fit to all n observations since it
can be shown that the ith PRESS residual is
i
 (i ) = , i = 1, 2, ..., n
1 − hii
Let ˆ(i ) be the vector of estimated coefficients obtained by deleting the ith
observation. Then:
−1
βˆ ( i ) =  X(i ) X(i )  X(i )y (i )
where X ( i ) and y ( i ) are the X and y vectors with the ith observation deleted. Then
the ith deleted residual can be rewritten as:
 = y − yˆ = y − xβˆ
(i ) i (i ) i i (i)
−1
= yi − xi  X( i ) X( i )  X( i ) y ( i )
 
where xi is the ith row of X. If XX is a ( k x k ) matrix and ( XX − xx) is the
matrix with the ith row deleted, it can be shown that
( XX ) xx ( XX )
−1 −1

( XX − xx ) = ( XX )


−1 −1
+
1 − x ( XX ) x
−1

( XX ) xi xi ( XX )


−1 −1
−1
 Xi X i  = ( XX ) −1
+
 ( ) ( ) 1 − xi ( XX ) xi
−1

( XX ) xi xi ( XX )


−1 −1

= ( XX ) since hii = xi ( XX ) xi


−1 −1
+
1 − hii
Substituting:
−1
 ( i ) = yi − xi  X(i ) X(i )  X(i )y (i )
 ( XX ) xi xi ( XX ) 
−1 −1

= yi − xi ( XX ) +
−1
 X( i ) y ( i )
 1 − h ii 
xi ( XX ) xi xi ( XX ) X( i )y ( i )
−1 −1

yi − xi ( XX )
−1
X( i )y ( i ) −
1 − hii
yi (1 − hii ) − (1 − hii ) xi ( XX ) X( i ) y ( i ) − hii x ( XX ) X( i )y ( i )
−1 −1

=
1 − hii

7
yi (1 − hii ) − xi ( XX ) X( i ) y ( i )
−1

 (i ) =
1 − hii
y (1 − hii ) − xi ( XX ) ( Xy − xi yi )
−1

= i since Xy = X( i ) y ( i ) + xi yi


1 − hii
y (1 − hii ) − xi ( XX ) Xy + xi ( XX ) xi yi
−1 −1

= i
1 − hii
y (1 − hii ) − xiβˆ + hii yi yi − xiβˆ 
= i = = i
1 − hii 1 − hii 1 − hii

From above it is easy to see that the PRESS residual is just the ordinary residual
weighted according to the diagonal elements of the hat matrix hii . Residuals
associated with points for which hii is large will have large PRESS residuals. These
points will generally be high influence points. Generally, a large difference between
the ordinary residual and the PRESS residual will indicate a point where the model
fits the data well, but a model built without that point predicts poorly.
i
The standardized Deleted residual is calculated as:  (si ) =
( 2
 1 − hii )
3.2.5 Detecting Outliers: Studentized Deleted (r-Student) Residuals
The studentized ri discussed above is often considered an outlier diagnostic. It is
customary to use MSE as an estimate of  2 in computing ri . Another approach
would be to use an estimate of  2 based on a data set with the ith observation deleted
from the sample. Denote the estimate of  2 so obtained by S(2i ) . It can be shown that

 n − ( k + 1)  MSE −  i2 (1 − hii )
S =
2
(i )
n − ( k + 1) − 1
The estimate of  2 is used instead of MSE to produce an externally (deleted)
studentized residual, usually called R-student given by
i
ti = , i = 1, 2, ..., n
S 2
(i ) (1 − hii )
In many situations, ti will differ little from the studentized residual ri . However, if
the ith observation is influential, then S(2i ) can differ significantly from MSE , and
thus the R-student statistic will be more sensitive to this point. Under the usual
assumptions, the R-student has a sampling distribution that can be approximated as
t distribution with d.o.f = ( n − 1) − ( k + 1) .

8
Example 1: The Delivery Time Data
Table below presents the scaled residuals using the model for the soft drink delivery
time data. Examining column 1 (the ordinary residuals, originally calculated on
Table 3.3), note that one residual,  9 = 7.4197 seems suspiciously large. Column 2
shows that the standardized residual d9 =  9 MSE = 2.2763 . All other standardized
residuals are inside the  2 limits.

Column 3 shows the studentized residual at point 9 is r9 =  9 MSE (1 − h9,9 ) = 3.2138


, which is substantially larger than the standardized residual. Note that point 9 has
the largest value of x1 (30 cases) and x2 (1460 feet). If taken account of the remote
location point 9 when scaling its residual, it can be concluded that the model does
not fit this point well. The diagonal elements of the hat matrix, which are used
extensively in computing scaled residuals, are shown in column 4.

Scaled Residuals for Delivery Time Data


 i = yi − yˆi di ri hii  (i ) S(2i ) ti
No.
(1) (2) (3) (4) (5) (6)
1 −5.0281 −1.5426 −1.6277 0.10180 −5.5980 −1.6956
2 1.1464 0.3517 0.3490 0.07070 1.2336 0.3575
3 −0.0498 −0.0153 −0.0161 0.09874 −0.0557 −0.0157
4 4.9244 1.5108 1.5798 0.05838 5.2297 1.6392
5 −0.4444 −0.1363 −0.1418 0.07501 −0.4804 −0.1386
6 −0.2896 −0.0888 −0.0908 0.04287 −0.3025 −0.0887
7 0.8446 0.2501 0.2704 0.08180 0.9198 0.2646
8 1.1566 0.3548 0.3667 0.06373 1.2353 0.3594
9 7.4197 2.2763 3.2138 0.49829 14.7888 4.3108
10 2.3764 0.7291 0.8133 0.19630 2.9568 0.8068
11 2.2375 0.6865 0.7181 0.08613 2.4484 0.7099
12 −0.5930 −0.1819 −0.1932 0.11366 −0.6690 −0.1890
13 1.0270 0.3151 0.3252 0.06113 1.0938 0.3185
14 1.0675 0.3275 0.3411 0.07824 1.1581 0.3342
15 0.6712 0.2059 0.2103 0.04111 0.7000 0.2057
16 −0.6629 −0.2034 −0.2227 0.16594 −0.7948 −0.2178
17 0.4364 0.1339 0.1381 0.05943 0.4640 0.1349
18 3.4486 1.0580 1.1130 0.09626 3.8159 1.1193
19 1.7932 0.5502 0.5787 0.09645 1.9846 0.5296
20 −5.7880 −1.7758 −1.8736 0.10169 −6.4432 −1.9967
21 −2.6142 −0.8020 −0.8779 0.16528 −3.1318 −0.8731
22 −3.6865 −1.1310 −1.4500 0.39158 −6.0591 −1.4896
23 −4.6076 −1.4136 −1.4437 0.04126 −4.8059 −1.4825
24 −4.5728 −1.4029 −1.4961 0.12061 −5.2000 −1.5422
25 −0.2126 −0.0652 −0.0675 0.06664 −0.2278 −0.0660

Column 5 contains the PRESS (deleted) residuals. The PRESS (deleted) residuals
for point 9 is substantially larger than the corresponding ordinary residuals,
indicating that these are likely to be points where the model fits reasonably well but
does not provide good predictions of fresh data. These points are remote from the
9
rest of the sample. Column 6 displays the values of R-student. Only one value, t9 is
unusually large. Note that t9 is larger than the corresponding studentized residual r9
, indicating that when observation 9 is set aside, S(29 ) is smaller than MSE , so clearly
this observation is influential. Note that S(29 ) is calculated as follows:

2
 n − ( k + 1)  MSE −  92 (1 − h9,9 )
S =
(9)
n − ( k + 1) − 1
( 22 )(10.6239 ) − ( 7.4197 ) (1 − 0.49829)
2

=
21
= 5.9046

Example 2: The Rocket Propellant Data


yi  i = yi − yˆi di ri hii  (i ) S(2i ) ti

2158.70 106.76
1678.15 -67.27
2316.00 -14.59
2061.30 65.09
2207.50 -215.98
1708.30 -213.60
1784.70 48.56
2575.00 40.06
2357.90 8.73
2256.70 37.57
2165.20 20.37
2399.55 -88.95
1799.80 80.82
2336.75 71.17
1765.30 -45.14
2053.50 94.44
2414.40 9.50
2200.50 37.10
2654.20 100.68
1753.70 -75.32

10
3.3 RESIDUAL PLOTS
Graphical analysis of residuals is very effective way to investigate the adequacy of
the fit of a regression model and to check the underlying assumptions. In this section,
the basic residual plots is introduced and illustrated. These plots are typically
generated by regression option/command in computer software packages. They
should be examined routinely in all regression modeling problems. It is often a good
idea to plot both the original residuals and one or more of the scaled residuals.
Studentized residuals is usually chosen due to the constant variance.

3.3.1 NORMAL PROBABILITY PLOT


The assumption of normality of disturbances is very much needed for the validity of
the results for testing of hypothesis, confidence intervals and prediction intervals.
Small departures from the normality assumptions do not affect the model greatly,
but gross non-normality is potentially more serious as the t or F statistics and
confidence and prediction intervals depend on the normality assumption.
Furthermore, if the errors come from a distribution with thicker or heavier tails than
the normal, the least squares fit may be sensitive to a small subset of the data. Heavy-
tailed error distributions often generate outliers that “pull” the least squares fit too
much in their direction. In these cases, other estimation techniques (such as the
robust regression methods) should be considered.
A very simple method of checking the normality assumption is to construct a normal
probability plot of the residuals. This is a graph designed so that the cumulative
normal distribution will plot as a straight line. Let  1   2  ...    n be the residuals
ranked in increasing order. If the graph is plotted with  i  against the cumulative
probability Pi = ( i − 1 2 ) n , i = 1, 2, ..., n , on the normal probability plot, the
resulting points should lie approximately on a straight line. The straight line is
usually determined visually, with emphasis on the central values (e.g., the 0.33 and
0.67 cumulative probability points) rather than the extremes. If the residuals are
normally distributed, then the ordered residuals should be approximately the same
as the ordered normal scores. So the resulting points should lie approximately on
the straight line with an intercept zero and a slope of one (these are the mean and
standard distributions of standardized residuals). Substantial departures from a
straight line indicate that the distribution is not normal.
Sometimes normal probability plots are constructed by plotting the ranked residual
 i  against the “expected normal value”  −1 ( i − 1 2 ) n  , where  denotes the
standard normal cumulative distribution. Note that Minitab uses
Pi = i − ( 3 8)  n + (1 4 ) . Figure (a) displays an “idealized” normal probability plot.
Notice that the points lie approximately along a straight line. Panels (b-e) present
other typical problems. Panel (b) shows a sharp upward and downward curve at both
extremes, indicating that the tails of this distribution are too light for it to be

11
considered normal. Conversely, panel (c) shows flattening at the extremes, which is
a pattern typical of samples from a distribution with heavier tails than the normal.
Panels (d) and (e) exhibit patterns associated with positive and negative skew,
respectively.

Because samples taken from a normal distribution will not plot exactly as a straight
line, some experience is required to interpret normal probability plots. Daniel and
Wood [1980] present normal probability plots for sample sizes 8-384. Study of these
plots is helpful in acquiring a feel for how much deviation from the straight line is
acceptable. Small sample sizes ( n  16 ) often produce normal probability plots that
deviate substantially from linearity. For large sample sizes ( n  32 ) the plots are
much better behaved. Usually about 20 points are required to produce normal
probability plots that are stable enough to be easily interpreted.
Andrews [1979] and Gnanadesikan [1977] note that normal probability plots often
exhibit no unusual behavior even if the errors  i are not normally distributed. This
problem occurs because the residuals are not a simple random sample; they are the
remnants of a parameter estimation process. The residuals are actually linear
combinations of the model errors. Thus, fitting the parameters tends to destroy the
evidence of non-normality in the residuals, and consequently cannot always rely on
the normal probability plot to detect departures from normality. A common defect
that shows up on the normal probability plot is the occurrence of one or two large
residuals. Sometimes this is an indication that the corresponding observations are
outliers.

12
Example: The Delivery Time Data
Figure below presents normal probability plots of the residuals from the regression
model for the delivery time data. Notice that the residuals are plotted on the vertical
axis. The original least squares residuals are plotted in Figure (a) and the R-student
residuals are plotted in Figure (b).
The residuals in both plots do not lie exactly along a straight line, indicating that
there may be some problems with the normality assumption, or that there may be
one or more outliers in the data. From Example 4.1, the studentized residual for
observation 9 is moderately large ( r9 = 3.2138) , as the R-student residual
(t9 = 4.3108) . However, there is no indication of a severe problem in the delivery
time data.

An alternative to assessing the normal probability plot is to conduct a normality test.


Unlike the graphical technique mentioned above, normality test is a 1-sample
hypothesis test to determine whether the population from which you draw the sample
is non-normal. The null hypothesis for a normality test states that the population is
normal. There exist various forms of normality test and few will be discussed.
Anderson-Darling test
This test compares the empirical cumulative distribution function of the sample data
with the distribution expected if the data were normal. If the observed difference is
sufficiently large, the null hypothesis of population normality is rejected. The test
statistic is calculated as:
( 2i − 1)
 ( ) ( )
n
A2 = −n − S where S =  n ln F yi + ln 1 − F yn+1−i 
i =1

13
Ryan-Joiner test which is similar to Shapiro-Wilk test
This test assesses normality by calculating the correlation between the data and the
normal scores of the data. If the correlation coefficient is near 1, the population is
likely to be normal. The test statistic assesses the strength of the correlation with the
null hypothesis of population normality is rejected if it falls below the critical value.
The test statistic is calculated as follows:
2
 n 
  ai x i  
W = n i =1 
 ( xi − x )
2

i =1

where xi are the ordered sample values while ai are constants generated from the
means, variances and covariances of the order statistics of a sample size n from a
normal distribution.

Kolmogorov-Smirnov test
This is a test that is based on the empirical distribution function (ecdf). Given n
ordered data points,  1 ,  2 , K ,  n , the ecdf is defined as:
n (i )
En ( i ) =
n
where n ( i ) is the number of points less than  i (that has been ordered from smallest
to largest). Thus En(i ) is a step function that increases by 1 n at the value of each
ordered data point. The test statistic is defined as:
 i −1 i 
D = max  F (  i ) − , − F (  i  ) 
1i n
 n n 
where F is the theoretical cumulative normal distribution. The null hypothesis of
normality is rejected if the test statistic is greater than the critical value obtained
from a table.

3.3.2 PLOT OF RESIDUALS AGAINST THE FITTED VALUES


A plot of the residuals ei (or the scaled residuals di , ri or ti ) versus the
corresponding fitted values yˆ i is useful for detecting several common types of model
inadequacies. In particular, the plot is useful in validating the assumption of
homoscedasticity for the errors. Violation of homoscedasticity assumption typically
occurs when different values of independent variable has unequal variance, i.e the
values show heteroscedasticity phenomena. If this plot resembles Figure (a) below,
that is when the residuals can be contained in a horizontal band, then there are no
obvious model defects. Plots of ei versus yˆ i that resemble any of the patterns in
panels (b-d) are evidence of model deficiencies.

14
The patterns in panels (b) and (c) indicate that the variance of the errors is not
constant. The outward-opening funnel pattern in panel (b) implies that the
variance is an increasing function of y [an inward-opening funnel is also possible,
indicating that Var (  ) increases as y decreases]. This scenario occurs when the
variable show multiplicative model, rather than the often assumed additional
model: y = E  y  rather than y = E  y  +  .

The double–bow pattern in panel (c) often occurs when y is a proportion between
zero and one, similar to binomial experiments. The variance of a binomial proportion
near 0.5 is greater than that near zero or one. A curved plot such as in panel (d)
indicates non-linearity. This could mean that other independent variables are
needed in the model. For example, a squared term may be necessary. In practice,
transformations on the response are generally employed to stabilize variance.
Transformations on the independent and/or the dependent variable or to use the
method of weighted least squares. may also be helpful in these cases.

A plot of the residuals against yˆ i may also reveal one or more unusually large
residuals. These points are, of course, potential outliers. Large residuals that occur
at the extreme yˆ i values could also indicate that either the variance is not constant
or the true relationship between y and x is not linear. These possibilities should be
investigated before the points are considered outliers.

15
Example: The Delivery Time Data
Figure below presents plots of the original residuals and the R-student values versus
the fitted values of delivery time. These plots do not exhibit any strong unusual
pattern, although the large residual e9 (or t9 ) shows up clearly. These does seem to
be slight tendency for the model to under-predict short delivery times and over-
predict long delivery times.

3.3.3 PLOT OF RESIDUALS AGAINST INDEPENDENT VARIABLE


Plotting the residuals against the corresponding values of each independent variable
can also be helpful. These plots often exhibit patterns such as those against the fitted
value, except that the horizontal scale is xij for the jth independent variable rather
than yˆ i . Once again an impression of a horizontal band containing the residuals is
desirable. The funnel and double bow patterns in panels (b) and (c) indicate non-
constant variance.
The curved band in panel (d) or a non-linear pattern in general implies that the
assumed relationship between y and the independent variable x j is not correct. Thus,
either higher order terms in x j (such as x2j ) or a transformation should be
considered. For the case of simple linear regression, it is not necessary to plot
residuals versus both yˆ i and the independent variable since yˆ i is just a linear
combination of the independent variable and as such the two plots would only differ
in the scale of y-axis.
It is also helpful to plot the residuals against explanatory variables that are not
currently is the model, but which could potentially be included. Any structure in
the plot of residuals versus an omitted variable indicates that incorporation of that
variable could improve the model.

16
Example: The Delivery Time Data
Figure below presents the plots of R-student values from the delivery time problem
versus both independent variables. Panel a plots R-student values versus cases and
panel b plots R-student values versus distance. Neither of these plots reveals any
clear indication of a problem with either misspecification of the regressor (implying
the need for either a transformation on the regressor or higher order terms in cases
and/or distance) or inequality of variance, although the moderately large residual
associated with point 9 is apparent on both plots.

3.3.5 PLOT OF RESIDUALS IN TIME SEQUENCE


If the time sequence in which the data were collected is known, it is a good idea to
plot the residuals against time order. Ideally, this plot will resemble Figure (a) in
Section 3.3.4; that is, a horizontal band will enclose all of the residuals, and the
residuals will fluctuate in a more or less random fashion within this band. However,
if this plot resembles the patterns in Figure (b-d) in Section 3.3.4, this may indicate
that the variance is changing with time, or that linear or quadratic terms in time
should be added to the model.
The time sequence plot of residuals may indicate that the errors at one time period
are correlated with those at other time periods. The correlation between model errors
at different time periods is called autocorrelation. A plot such as Figure (a)
indicates positive autocorrelation, while Figure (b) is typical of negative
autocorrelation. The presence of autocorrelation is a potentially serious violation of
the basic regression assumptions.

17
3.4 DETECTING LACT OF FIT
Consider the general linear model: y = 0 + 1x1 + 3 x2 + K +  k xk +  .
Assume that this model is correctly specified (i.e that the terms is the model
accurately represent the true relationship between y and the independent variables).
Recall the assumption made on the random error,  was that E (  ) = 0 for any given
set of values of x1, x2 , K , xk implying: E ( y ) =  0 + 1 x1 + 3 x2 + K +  k xk .

Suppose an analyst hypothesizes a misspecified model with mean, Em ( y ) such that


E ( y )  Em ( y ) . The hypothesized equation for the misspecified model is given as
y = Em ( y ) +  , giving  = y − Em ( y ) . It is easy to see that for the misspecified model,
E (  ) = E ( y ) − Em ( y )  0 . Thus, for a misspecified model, the assumption of
E (  ) = 0 is violated. A misspecified model may yield a significant F-value in the
ANOVA table and without further investigation, an innocent/unknowledgeable
analyst may not detect the lack of fit in the model.
Based on the previously mentioned residuals plots in Section 4.3, the presence of
trend, changes in the variability and the presence of more than 5% of standardized
residuals being greater than 2s would indicate a problem of model fit.

3.4.1 PARTIAL RESIDUAL PLOTS


A plot of the residuals versus an independent variable is useful in determining
whether a curvature effect for that variable is needed in the model. However, this
plot may not completely show the correct or complete marginal effect of an
independent variable, given other variables in the model. Define partial residuals
for j-th independent variable, x j as:

18
(
ˆ p = y − ˆ0 + ˆ1 x1 + ˆ3 x2 + K ˆ j −1 x j −1 + ˆ j +1 x j +1 + K + ˆk xk )
= ˆ + ˆ j x j

That is, partial residuals measure the influence of x j on the dependent variable, y
after the effects of the other independent variables, x1 , x2 , K , x j −1, x j +1, K , xk have
been removed or accounted for.

Suppose we are considering a first-order multiple regression model with three


independent variables, x1, x2 , x3 , that is y = 0 + 1x1 + 3 x2 + 3 x3 +  . To check
whether the relationship between y and x3 is correctly specified or should enter the
regression model (i.e to determine the nature of marginal effect of x3 ), a partial
regression is conducted as follows:
1) regress y on x1 and x2 and obtain the residual  y|x1x2
2) regress x3 on x1 and x2 and obtain the residual  x3|x1x2
3) plot partial residual of  y|x1x2 against  x3|x1x2

If the variable x3 enter the model linearly, the partial residual plot  y|x1x2 against
 x3|x1x2 should show linear relationship, that is the partial residuals fall along a
straight line with non-zero slope. The slope of this line will be the regression
coefficient of x3 in multiple linear regression model y = 0 + 1x1 + 3 x2 + 3 x3 +  .
Thus, if the partial residual plot shows a horizontal band, this indicates that there is
no additional useful information in x3 for describing y. Meanwhile, if the partial
residual plot shows a curvilinear band then a higher-order terms of x3 or a
transformation on x3 may be conducted.

3.4.2 LACK OF FIT TEST

The lack of fit test requires that we have replicate observations on the response y for
at least one level of x. Furthermore, the procedure assumes that the normality,
independence and constant variance requirements are met and that only the first-
order or straight-line character of the relationship is in doubt. Suppose that we have
ni observations on the response at the ith level of the independent variable xi ,
i = 1, 2, K , m . Let yij denotes the jth observation on the response at xi and
j = 1, 2, K , ni so that n =  i =1 ni total observations.
m

The test procedure involves partitioning the residual sum of squares into two
components:
SSE = SSPE + SSLOF
where SSPE is the sum of squares due to pure error and SSLOF is the sum of squares
due to lack of fit. Note that the (ij)th residual is now written as:

19
yij − yˆi = ( yij − yi ) + ( yi − yˆi )
where yi is the average of the ni observations at xi . Squaring both sides and
summing over i and j produces:
m ni m ni m ni

( yij − yˆi ) = ( yij − yi ) +  ( yi − yˆi )


2 2 2

i =1 j =1 i =1 j =1 i =1 j =1

since the cross-product term can be shown to be equal to zero.


m ni

 ( yij − yi )
2
It can be seen that the SSPE : is obtained by computing the corrected
i =1 j =1

sum of squares of the repeat observations at each level of xi and then pooling over
the m levels of x. If the assumption of constant variance is satisfied, this is a model-
independent measure of pure error since only the variability of the dependent
variable y at each x level is used to compute SSPE . Since there are ( ni − 1) degrees
of freedom for pure error at each level xi , the total number of degrees of freedom
m
associated with SSPE is  ( ni − 1) = n − m .
i =1

m ni

( yi − yˆi )
2
Meanwhile, SSLOF : is a weighted sum of squared deviations
i =1 j =1
between the mean response yi at each level x level and the corresponding fitted
value. If the fitted values yˆ i are close to the corresponding average response yi , then
there is a strong indication that the regression function is linear. There are ( m − 2 )
degrees of freedom associated with SSLOF since there are m levels of x and two
degrees of freedom are lost since two parameters must be estimated to obtain yi .
Computationally, SSLOF is obtained by subtracting SSPE from SSE . The test
statistic for lack of fit is given as:
SSLOF ( m − 2 ) MSElof
Flof = =
SSPE ( n − m ) MSE pe
If the true regression function is linear, then the statistic Flof follows Fm−2,n−m . The
hypotheses to be tested are as below:
H0 : The linear relationship assumed in the model is reasonable
H1 : The linear relationship assumed in the model is not reasonable

Example
x 1.0 1.0 2.0 3.3 3.3 4.0 4.0 4.0 4.7
y 10.84 9.30 16.35 22.88 24.35 24.56 25.86 29.16 24.59
x 5.0 5.6 5.6 5.6 6.0 6.0 6.5 6.9
y 22.25 25.90 27.20 25.61 25.45 26.56 21.03 21.46

20
Check that: the straight line fitted regression model is given as: y = 13.301 + 2.108 x
with SST = 487.6126 , SSR = 234.7087 and SSE = 252.9039 . Note that there are 10
distinct levels of x with repeat points at x = 1.0, 3.3, 4.0, 5.6 and 6.0 .

Scatterplot of y vs x
30

25

20
y

15

10

1 2 3 4 5 6 7
x

From the plot above, there is some indication that the straight-line regression would
not be satisfactory, and it would be helpful to conduct a test to determine if there is
systematic curvature is present.
The SSPE is computed using the repeat points as follows:

 j ( yij − yi )
2
Level of x yij yi d.o.f
1.0 10.84, 9.30 10.070 1.1858 1
3.3 22.88, 24.35 23.615 1.0805 1
4.0 24.56, 25.86, 29.16 26.527 11.2467 2
5.6 25.90, 27.20, 25.61 26.237 1.4341 2
6.0 25.45, 26.56 26.005 0.6161 1
Total 15.5632 7

The ANOVA table with SSE partitioned into SSPE and SSLOF is given as:
Source of Sum of Degree of Mean F / Flof p-value
Variation Squares Freedom Square
Regression 234.7087 1 234.7087 13.92 < 0.0013
Residual 252.9039 15 16.8603
Lack of fit 237.3407 8 29.6676 13.34 0.0013
Pure error 15.5632 7 2.2233
Total 487.6126 16

From the table above, the lack of fit test statistic is quite large with very small p-
value and thus the tentative fitted linear regression is not reasonable (there is lack of
fit) to describe the relationship between x and y.

21
The ANOVA table from Minitab16 is given below for comparison
Source of Sum of Degree of Mean F / Flof p-value
Variation Squares Freedom Square
Regression 237.48 1 237.48 14.24 0.002
Residual 250.13 15 16.68
Lack of fit 234.57 8 29.32 13.19 0.001
Pure error 15.56 7 2.22
Total 487.61 16

3.5 DETECTION AND TREATMENT OF OUTLIERS


• An outlier is an extreme observation.
• Residuals that are considerably larger in absolute value than the others, say, 3 or
4 times of standard deviation from the mean indicate potential outliers in y-space.
This idea is derived from the 3-sigma or 4-sigma limits.
• Depending on their location, outliers can have moderate to severe effects on the
regression model.
• Outliers may indicate a model failure for these points.
• Residual plots against yˆ i and normal probability plots help in identifying outliers.
Examination of scaled residuals, e.g., studentized and R-student residuals are
more helpful as they have mean zero and variance one.
• Outliers can also occurs in explanatory variables in x-space. They can also affect
the regression results.
• Sometimes outliers are “bad” values occurring as a “a result of unusual but
explainable events. For example, faulty measurements, incorrect recording of
data, failure of measuring instrument etc.
• Bad values need to be discarded but should have strong non-statistical evidence
that the outlier is a bad value before it is discarded. Discarding bad values is
desirable because least squares pull the fitted equation toward the outlier.
• Sometimes outlier is an unusual but perfectly plausible observation. If such
observations are deleted, then it may give a false impression of improvement in
fit of equation.
• Sometimes the outlier is more important than the rest of the data because it may
control many key model properties.
• The effect of outliers on the regression model may be checked by dropping these
points and refitting the regression equation.
• The value of t-statistic, F-statistic, R 2 and residual mean square may be sensitive
to outliers.

22

You might also like