Lecture 22: Review For Exam 2 1 Basic Model Assumptions (Without Gaussian Noise)
Lecture 22: Review For Exam 2 1 Basic Model Assumptions (Without Gaussian Noise)
1
We say “equations”, plural, because this is equivalent to the set of p + 1 equations
n
1X
ei = 0 (10)
n
i=1
n
1X
ei Xij = 0 (11)
n
i=1
(Many people omit the factor of 1/n.) This tells us that while e is an n-dimensional vector, it is
subject to p + 1 linear constraints, so it is confined to a linear subspace of dimension n − p − 1.
Thus n − p − 1 is the number of residual degrees of freedom.
The solution to the estimating equations is
This is one of the two most important equations in the whole subject. It says that the coefficients
are a linear function of the response vector Y.
The least squares estimate is a constant plus noise:
Its variance is h i
Var βb = σ 2 (XT X)−1 . (18)
Since the entries in XT X are usual proportional to n, it can be helpful to write this as
h i σ2 1 −1
T
Var βb = X X . (19)
n n
2
The fitted values are thus linear in Y: set the responses all to zero and all the fitted values will be
zero; double all the responses and all the fitted values will double.
The n × n hat matrix H ≡ X(XT X)−1 XT , also called the influence, projection or prediction
matrix, controls the fitted values. It is a function of X alone, ignoring the response variable totally.
It is an n × n matrix with several important properties:
• It is symmetric, HT = H.
• It is idempotent, H2 = H.
P
• Its trace tr H = i Hii = p + 1, the number of degrees of freedom for the fitted values.
To make a prediction at a new point, not in the data used for estimation, we take its predictor
coordinates and group them into a 1 × (p + 1) matrix Xnew (including the 1 for the intercept).
Thehpoint prediction
i forhY i is then Xnew β.
b The expected value is Xnew β, and the variance is
Var Xnew βb = Xnew Var βb XTnew = σ 2 Xnew (XT X)−1 XTnew .
The residuals are also linear in the response:
e≡Y−m
b = (I − H)Y. (24)
2. ∂ Ybi /∂Yi = Hii ; the leverage says how much changing the response value for point i changes
the fitted value there.
3
h i
3. Cov Ybi , Yi = σ 2 Hii ; the leverage says how much covariance there is between the ith response
and the ith fitted value.
4. Var [ei ] = σ 2 (1 − Hii ); the leverage controls how big the ith residual is.
• X is not collinear: none of its columns is a linear combination of other columns; which is
also equivalent to
• The eigenvalues of XT X are all > 0. (If there are zero eigenvalues, the corresponding eigen-
vectors indicate linearly-dependent combinations of predictor variables.)
Nearly-collinear predictor variables tend to lead to large variances for coefficient estimates, with
high levels of correlation among the estimates.
It is perfectly OK for one column of X to be a function of another, provided it is a nonlinear
function. Thus in polynomial regression we add extra columns for powers of one or more of the
predictor variables. (Any other nonlinear function is however also legitimate.) This complicates the
interpretation of coefficients as slopes, just as though we had done a transformation of a column.
Estimation and inference for the coefficients on these predictor variables goes exactly like estimation
and inference for any other coefficient.
One column of X could be a (nonlinear) function of two or more of the other columns; this
is how we represent interactions. Usually the interaction column is just a product of two other
columns, for a product or multiplicative interaction; this also complicates the interpretation of
coefficients as slopes. (See the notes on interactions.) Estimation and inference for the coefficients
on these predictor variables goes exactly like estimation and inference for any other coefficient.
We can include qualitative predictor variables with k discrete categories or levels by introducing
binary indicator variables for k − 1 of the levels, and adding them to X. The coefficients on these
indicators tell us about amounts that are added (or subtracted) to the response for every individual
who is a member of that category or level, compared to what would be predicted for an otherwise-
identical individual in the baseline category. Equivalently, every category gets its own intercept.
Estimation and inference for the coefficients on these predictor variables goes exactly like estimation
and inference for any other coefficient.
Interacting the indicator variables for categories with other variables gives coefficients which say
what amount is added to the slope used for each member of that category (compared to the slope
for members of the baseline level). Equivalently, each category gets its own slope. Estimation and
inference for the coefficients on these predictor variables goes exactly like estimation and inference
for any other coefficient.
Model selection for prediction aims at picking a model which will predict well on new data
drawn from the same distribution as the data we’ve seen. One way to estimate this out-of-sample
performance is to look at what the expected squared error would be on new data with the same X
4
matrix, but a new, independent realization of Y. In the notes on model selection, we showed that
n
1 0 1 1X
E b T (Y0 − m)
(Y − m) b = E (Y − m) T
b (Y − m)b +2 Cov [Yi , m
b i] (30)
n n n
i=1
1 2
= E b T (Y − m)
(Y − m) b + σ 2 tr H (31)
n n
1 2
= E b T (Y − m)
(Y − m) b + σ 2 (p + 1). (32)
n n
3 Gaussian Noise
The Gaussian noise assumption is added on to the other assumptions already made. It is that
i ∼ N (0, σ 2 ), independent of the predictor variables and all other j . In other words, has a
multivariate Gaussian distribution,
∼ M V N (0, σ 2 I). (35)
Under this assumption, it follows that, since βb is a linear function of , it also has a multivariate
Gaussian distribution:
βb ∼ M V N (β, σ 2 (XT X)−1 ) (36)
and
b ∼ M V N (Xβ, σ 2 H).
Y (37)
It follows from this that
βbi ∼ N (βi , σ 2 (XT X)−1
i+1,i+1 (38)
and
Ybi ∼ N (Xi β, σ 2 Hii ). (39)
The sampling distribution of the estimated conditional mean at a new point Xnew is
5
Moreover, the MSE is statistically independent of β.
b We may therefore define
h i q
se b (XT X)−1
b βbi = σ i+1,i+1 (41)
and h i p
b Ybi = σ
se b Hii (42)
and
Ybi − m(Xi )
∼ tn−p−1 ≈ N (0, 1). (44)
se
b [m
b i]
The Wald test for the hypothesis that βi = βi∗ therefore forms the test statistic
βbi − βi∗
h i (45)
se
b βbi
and rejects the hypothesis if it is too large (above or below zero) compared to the quantiles of a
tn−p−1 distribution. The summary function of R runs such a test of the hypothesis that βi = 0.
There is nothing magic or even especially important about testing for a 0 coefficient, and the same
test works for testing whether a slope = 42 (for example).
Important! The null hypothesis being test is
The Wald test does not test any of the model assumptions (it presumes them all), and it cannot say
whether in an absolutely sense Xi matters for Y ; adding or removing other predictors can change
whether the true βi = 0.
Warning! Retaining the null hypothesis βi = 0 can happen if either the parameter is precisely
estimated, and confidently known to be close to zero, or if it is im-precisely estimated, and might
as well be zero or something huge on either side. Saying “We can ignore this because we can be
quite sure it’s small” can make sense; saying “We can ignore this because we have no idea what it
is” is preposterous.
To test whether several coefficients (βj : j ∈ S) are all simultaneously zero, use an F test. The
null hypothesis is
H0 : βj = 0 for all j ∈ S
and the altermative is
H1 : βj 6= 0 for at least one j ∈ S.
6
The F statistic is
(b 2
σnull bf2ull )/s
−σ
Fstat = (46)
bf2ull /(n − p − 1)
σ
where s is the number of elements in S. Under that null hypothesis,
Fstat ∼ Fs,n−p−1 (47)
If we are testing a subset of coefficients, we have a “partial” F test. A “full” F test sets s = p,
i.e., it tests the null hypothesis of an intercept-only model (with independent, constant-variance
Gaussian noise) against the alternative of the linear model on X1 , . . . Xp (and only those variables,
with independent, constant-variance Gaussian noise). This is only of interest under very unusual
circumstances.
Once again, no F test is capable of checking any modeling assumptions. This is because both
the null hypothesis and the alternative hypothesis presume that the all of the modeling assumptions
are exactly correct.
A 1 − α confidence interval for βi is
βbi ± se
b [βi ] tn−p−1 (α/2) ≈ βbi ± se
b [βi ] zα/2 . (48)
We saw how to create a confidence ellipsoid for several coefficients. These make a simultaneous
guarantee: all the parameters are trapped inside the confidence region with probabiluty 1 − α. A
simpler way to get a simultaneous confidence region for all p parameters is to use 1−α/p confidence
intervals for each one (“Bonferroni correction”). This gives a confidence hyper-rectangle.
A 1 − α confidence interval for the regression function at a point is
b i ) ± se
m(X b [m(X
b i )] tn−p−1 (α/2). (49)
Residuals. The cross-validated or studentized residuals are:
1. Temporarily hold out data point i
2. Re-estimate the coefficients to get βb(−i) and σ
b(−i) .
b (−i) (Xi ).
3. Make a prediction for Yi , namely, Ybi(i) = m
4. Calculate
Yi − Ybi(i)
ti = h i. (50)
(−i)
b(−i) + se
σ b m bi