Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
37 views

Lecture 15: Diagnostics and Inference For Multiple Linear Regression 1 Review

This document provides a review and introduction to diagnostics and inference for multiple linear regression models. Key points: - Multiple linear regression models relate a response variable to multiple predictor variables through a linear equation plus error term. - Diagnostics include plotting residuals against predictors to check assumptions of constant variance and lack of structure. Interactions between predictors can also be checked. - Inference assumes the modeling assumptions of independent and Gaussian errors are correct. Sampling distributions of the coefficient estimates and predicted values are Gaussian based on this assumption.

Uploaded by

S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Lecture 15: Diagnostics and Inference For Multiple Linear Regression 1 Review

This document provides a review and introduction to diagnostics and inference for multiple linear regression models. Key points: - Multiple linear regression models relate a response variable to multiple predictor variables through a linear equation plus error term. - Diagnostics include plotting residuals against predictors to check assumptions of constant variance and lack of structure. Interactions between predictors can also be checked. - Inference assumes the modeling assumptions of independent and Gaussian errors are correct. Sampling distributions of the coefficient estimates and predicted values are Gaussian based on this assumption.

Uploaded by

S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Lecture 15: Diagnostics and Inference for Multiple Linear

Regression

1 Review
In the multiple linear regression model, we assume that the response Y is a linear function of all
the predictors, plus a constant, plus noise:

Y = β0 + β1 X1 + β2 X2 + . . . βp Xp + . (1)

The number of coefficients is q = p + 1.


We make no assumptions about the (marginal or joint) distributions of the Xi , but we assume
that E [|X] = 0, Var [|X] = σ 2 , and that  is uncorrelated across measurements. In matrix form,
the model is
Y = Xβ +  (2)
where X is an n × q matrix that includes an initial column of all 1’s. Remember that q = p + 1.
When we add the Gaussian noise assumption, we are making all of the assumptions above, and
further assuming that
 ∼ M V N (0, σ 2 I) (3)
independently of X.
The least squares estimate of the coefficients is

βb = (XT X)−1 XT Y (4)

Under the Gaussian noise assumption, this is also the maximum likelihood estimate.
The fitted values (i.e., estimates of the conditional means at data points used to estimate the
model) are given by the “hat” or “influence” matrix:
b ≡m
Y b = Xβb = HY (5)

which is symmetric and idempotent. The residuals are given by

e = (I − H)Y (6)

and I − H is also symmetric and idempotent.


The expected mean squared error, which is the maximum likelihood estimate of σ 2 , has a small
negative bias:  
 2 1 T n−q  q
E σ b =E e e = σ2 = σ2 1 − . (7)
n n n
Since HXβ = Xβ, the residuals can also be written

e = (I − H) (8)

hence
E [e] = 0 and Var [e] = σ 2 (I − H). (9)
Under the Gaussian noise assumption, β,
b mb and e all have Gaussian distributions.

1
1.1 Point Predictions
Suppose that X0 is the m × q dimensional matrix storing the values of the predictor variables at
m points where we want to make predictions. (These may or may not include points we used to
estimate the model, and m may be bigger, smaller or equal to n.) Similarly, let Y0 be the m × 1
matrix of random values of Y at those points. The point predictions we want to make are

E Y0 |X0 = X0 = m(X0 ) = X0 β
 
(10)

and we estimate this by


b 0 ) = X0 βb
m(X (11)
which is to say
b 0 ) = X0 (XT X)−1 XT Y
m(X (12)
(It’s easy to verify that when X0 = X, this reduces to HY.)

2 Diagnostics for Multiple Linear Regression


Before proceeding to detailed statistical inference, we need to check our modeling assumptions,
which means we need diagnostics.

2.1 Plots
All of the plots we learned how to do for simple linear regression remain valuable:

1. Plot the residuals against the predictors. This now means p distinct plots, of course. Each
of them should show a flat scatter of points around 0 (because E [|Xi ] = 0), of roughly
constant width (because Var [|Xi ] = σ 2 ). Curvature or steps to this plot is a sign of potential
nonlinearity, or of an omitted variable. Changing width is a potential sign of non-constant
variance.

2. Plot the squared residuals against the predictors. Each of these p plots should show a flat
b2 .
scatter of points around σ

3. Plot the residuals against the fitted values. This is an extra plot, redundant when we only
have one predictor (because the fitted values were linear in the predictor).

4. Plot the squared residuals against the fitted values.

5. Plot the residuals against coordinates. If observations are dated, time-stamped, or spatially
located, plot the residuals as functions of time, or make a map. If there is a meaningful order
to the observations, plot residuals from successive observations against each other. Because
the i are uncorrelated, all of these plots should show a lack of structure.

6. Plot the residuals’ distribution against a Gaussian. (qq-plot)

Out-of-sample predictions, with either random or deliberately selected testing sets, also remain
valuable.

2
2.1.1 Collinearity
A linear dependence between two (or more) columns of the X matrix is called collinearity, and it
keeps us from finding a solution by least squares. Computationally, collinearity will show up in the
form of the determinant of XT X being zero. Equivalently, the smallest eigenvalue of XT X will be
zero. If lm is given a collinear set of predictor variables, it will sometimes give an error messages,
but more often it will decide not to estimate one of the collinear variables, and return an NA for
the offending coefficient. We will return to the subject of collinearity in a future lecture.

2.1.2 Interactions
Another possible complication for multiple regression which we didn’t have with the simple regres-
sion model is that of interactions between variables. One of our assumptions is that each variable
makes a distinct, additive contribution to the response, and the size of this contribution is com-
pletely insensitive to the contributions of other variables. If this is not true — if the relationship
between Y and Xi changes depending on the value of another predictor, Xj — then there is an
interaction between them. There are several ways of looking for interactions. We will return to
this subject in a future lecture.

2.2 Remedies
All of the remedies for model problems we discussed earlier, for the simple linear model, are still
available to us.

Transform the response. We can change the response variable from Y to g(Y ), in the hope
that the assumptions of the linear-Gaussian model are more nearly satisfied for this new variable.
That is, we hope that

g(Y ) = β0 + β1 X1 + . . . + βp Xp + ,  ∼ N (0, σ)2 . (13)

Transform the predictors. We can also transform each of the predictors, making the model

Y = β0 + β1 f1 (X1 ) + . . . βp fp (Xp ) + ,  ∼ N (0, σ 2 ) (14)

As the notation suggests, each Xi could be subject to a different transformation. Again, it’s just a
matter of what we put in the columns of the X matrix before solving for β. b (In 402 you will see
how the functions can be estimated automatically.)

Changing the variables used. One option which is available to us with multiple regression is
to add in new variables, or to remove ones we’re already using. This should be done carefully, with
an eye towards satisfying the model assumptions, rather than blindly increasing some score. We
will discuss this extensively later.

Removing Outliers. As always, we can remove outliers, as long as we document the fact that
we are doing so.

3
3 Inference for Multiple Linear Regression
The results in this section presume that all of the modeling assumptions are correct. Also, all
distributions stated are conditional on X.

3.1 Sampling Distributions


As in the simple linear model, the sampling distributions are the basis of all inference.
In the simple linear model, because the noise  is Gaussian, and the coefficient estimators were
linear in the noise, βb0 and βb1 were also Gaussian. This remains true in for Gaussian multiple linear
regression models:

βb = (XT X)XT Y (15)


T T
= (X X)X (Xβ + ) (16)
T T
= β + (X X)X  (17)

Since (XT X)XT  is a constant times a Gaussian, it is also a Gaussian; adding on another Gaussian
still leaves us with a Gaussian. We saw the expectation and variance last time, so

βb ∼ M V N (β, σ 2 (XT X)−1 ) (18)

It follows that  
βbj ∼ N βj , σ 2 (XT X)−1
jj . (19)

The same logic applies to the estimates of conditional means. In §1.1, we saw that the estimated
conditional means at new observations X0 are given by

b 0 ) = X0 (XT X)−1 XT Y.
m(X (20)

so it follows that
b 0 ) ∼ M V N (X0 β, σ 2 X0 (XT X)−1 (X0 )T ).
m(X (21)
Eq. 21 simplifies for the special case of the fitted values, i.e., the estimated conditional means on
the original data.
Yb ∼ M V N (Xβ, σ 2 H). (22)
Similarly, the residuals have a Gaussian distribution:

e ∼ M V N (0, σ 2 (I − H). (23)

The in-sample mean squared error, or training error, or estimate of σ 2 , is

b2 = n−1 eT e
σ

and
nbσ2
∼ χ2n−q (24)
σ2
where again q = p + 1. We will not prove this here.

4
Constraints on the residuals. The residuals are not all independent of each other. In the case
of the simplePlinear model, Pthe fact that we estimated the model by least squares left us with two
constraints, i ei = 0 and i ei Xi = 0. If we had only one constraint, that would let us fill in the
last residual if we knew the other n − 1 residuals. Having two constraints meant that knowing any
n − 2 residuals determined the remaining two.
We got those constraints from the normal or estimating equations, which in turn came from
setting the derivative of the mean squared error (or of the log-likelihood) to zero. In the multiple
regression model, when we set the derivative to zero, we get the matrix equation
XT (Y − Xβ)b =0 (25)
But the term in parentheses is just e, so the equation is
XT e = 0 (26)
Expanding out the matrix multiplication,
 P   
i ei 0
 P Xi1 ei   0 
i
= (27)
   
 .. .. 
P . .
   
i Xip ei 0
Thus the residuals are subject to p + 1 linear constraints, and knowing any n − (p + 1) of them will
fix the rest. The vector of residuals e is a point in an n-dimensional space. As a random vector,
without any constraints it could lie anywhere in that space, as, for instance,  can. The constraints,
however, for it to live in a lower-dimensional subspace, specifically, a space of dimension n − (p + 1).

Bias of σb2 . Let’s compute the bias of σ b2 . Before we do so, remember from Lecture 13 that if
Q = Z CZ is a quadratic form, then E[Q] = µT Cµ + tr(CΣ) where µ = E and Σ = Var(Z). Also
T

remember that HT = H and H2 = H. So,


 2 1  T 
E σb = E e e (28)
n
1 
E ((I − H))T ((I − H))

= (29)
n
1  T T
E  (I − HT )(I − H)

= (30)
n
1  T
E  (I − H − HT + HT H)

= (31)
n
1  T 
= E  (I − H) (32)
n
1
= tr ((I − H)Var []) (33)
n
1
= tr ((I − H)σ 2 I) (34)
n
σ2
= tr (I − H) (35)
n
σ2
= (n − q) (36)
n
since Var [] = σ 2 I, tr I = n and tr H = p + 1 (homework).

5
3.2 t Distributions for Coefficient and Conditional Mean Estimators
From Eq. 19, it follows that
βbj − βj
∼ N (0, 1) (37)
sej
where q
sej = σ 2 (XT X)−1
jj . (38)
The estimate standard error is h i q
se
b βbj = σb2 (XT X)−1
jj . (39)

We then have that


βbj − βj
h i ∼ tn−q . (40)
se
b βbj

The same applies to the estimated conditional means, and to the distribution of a new Y 0 around
the estimated conditional mean (in a prediction interval). Thus, all the theory we did for parametric
and predictive inference in the simple model carries over, just with a different number of degrees
of freedom.
As with the simple model, tn−q → N (0, 1), so t statistics approach z statistics as the sample
size grows. R uses the t-distribution but you can use the Normal approximation if you like.

3.3 Hypothesis Testing


The summary function in R lists a p-value for testing H0 : βj = 0, for every cirefficient j. As
usual, we should be skeptical about whether this is useful. As we said earlier, it is probably more
important to focus on confidence intervals and prediction.
Common Mistakes. Looking at the hypothesis tests often leads to making some mistakes.
Here are some of these common mistakes:

• Saying “βi wasn’t significantly different from zero, so Xi doesn’t matter for Y ”. After all, Xi
could still be an important cause of Y , but we don’t have enough data, or enough variance on
Xi , or enough variance in Xi uncorrelated with other X’s, to accurately estimate its slope.
All of these would prevent us from saying that βi was significantly different from 0, i.e.,
distinguishable from 0 with high reliability.

• Saying “βi was significantly different from zero, so Xi really matters to Y ”. After all, any
βi which is not exactly zero can be made arbitrarily significant by increasing n and/or the
sample variance of Xi . That is, its t statistic will go to ±∞, and the p-value as small as you
have patience to make it.

• Deleting all the variables whose coefficients didn’t have stars by them, and re-running the
regression. After all, since it makes no sense to pretend that the statistically significant
variables are the only ones which matter, limiting the regression to the statistically significant
variables is even less sensible.

6
• Saying “all my coefficients are really significant, so the linear-Gaussian model must be right”.
After all, all the hypothesis tests on individual coefficients presume the linear Gaussian model,
both in the null and in the alternative. The tests have no power to notice nonlinearities, non-
constant noise variance, or non-Gaussian noise.

You might also like