lecture_8
lecture_8
lecture_8
Zheng Tian
Contents
1 Introduction 1
9 Multicollinearity 23
1 Introduction
1.1 Overview
This lecture extends the simple regression model to the multiple regression model. Many
aspects of multiple regression parallel those of regression with a single regressor. The
coefficients of the multiple regression model can be estimated using the OLS estimation
method. The algebraic and statistical properties of the OLS estimators of the multiple
regression are also similar to those of the simple regression. However, there are some new
concepts, such as the omitted variable bias and multicollinearity, to deepen our under-
standing of the OLS estimation.
1
1.2 Learning goals
1.3 Readings
In the last two lectures, we use a simple linear regression model to examine the effect of
class sizes on test scores in the California elementary school districts. The simple linear
regression model with only one regressor is
Obviously, it ignores too many other important factors. Instead, all these other factors
are lumped into a single term, OtherFactors, which is the error term, ui , in the regression
model.
2
Percentage of English learners as an example
The percentage of English learners in a school district could be an relevant and important
determinant of test scores, which is omitted in the simple regression model.
• How can it affect the estimate of the effect of student-teacher ratios on test score?
– The districts with high percentages of English learners tend to have large
student-teacher ratios. That is, these two variables are correlated with the
correlation coefficient being 0.17.
– The higher percentage of English learners a district has, the lower test scores
students there will make.
We can include these important but ignored variables, like the percentage of English
learners (P ctEL), in the regression model.
A regression model with more than one regressors is a multiple regression model.
where
• X1i , X2i , . . . , Xki are the ith observation on each of the k regressors.
3
• ui is the error term associated with the ith observation, representing all other factors
that are not included in the model.
• The population regression line (or population regression function) is the relationship
that holds between Y and X on average in the population
– The linear regression model with a single regressor is in fact a multiple regres-
sion model with two regressors, 1 and X.
Y = β0 + β1 X1 + · · · + βk Xk + u (2)
Suppose we have two regressors X1 and X2 and we are interested in the effect of X1 on
Y . We can let X1 change by ∆X1 and holding X2 constant. Then, the new value of Y is
Y + ∆Y = β0 + β1 (X1 + ∆X1 ) + β2 X2
∆Y
β1 = holding X2 constant
∆X
Partial effect
4
is
∂Y
βi =
∂Xi
By the definition of a partial derivative, βi is just the effect of a marginal change in Xi on
Y holding other X constant.
When we save the data set of California school districts in Excel, it is saved in a spreadsheet
as shown in Figure 1.
Each row represents an observation of all variables pertaining to a school district, and
each column represents a variable with all observations. This format of data display can
be concisely denoted using vectors and a matrix.
x01
Y1 1 X11 · · · Xk1 u1 β0
Y2 1 X12 · · · Xk2 x02 u2 β1
Y = . , X = . . . . = . , u = . , β =
..
. . . . . . .
. . . . . . . .
5
• X is an n × (k + 1) matrix of n observations on k + 1 regressors which include the
intercept term as a regressor of 1’s.
The multiple regression model in Equation (1) for the ith observation can be written
as
Yi = x0i β + ui , i = 1, . . . , n (3)
Stacking all n observations in Equation (3) yields the multiple regression model in
matrix form:
Y = Xβ + u (4)
X = [X0 , X1 , . . . , Xk ]
where Xi = [Xi1 , Xi2 , . . . , Xin ]0 is a n×1 vector of n observations of the kth regressor.
X0 is a vector of 1s. That is, X0 = [1, 1, . . . , 1]0 . More often, we use ι to denote
such a vector of 1s. 1
Y = β0 ι + β1 X1 + · · · + βk Xk + u (5)
Pn −1
1
ι has the following properties: (1) ι0 x = i=1 xi for an n × 1 vector x, (2) ι0 ι = n and (ι0 ι) = 1/n,
−1
(3) ι0 (ι0 ι) x = x̄, and (4) ι0 Xι = n
P Pn
i=1 j=1 xij for an n × n matrix X.
6
3 The OLS Estimator in Multiple Regression
The idea of the ordinary least squares estimation for a multiple regression model is exactly
the same as for a simple regression model. The OLS estimators of the multiple regression
model are obtained by minimizing the sum of the squared prediction mistakes.
or in matrix notation
Ŷ = Xb
û = Y − Xb
7
The OLS estimator of β as a solution to the minimization problem
The formula for the OLS estimator is obtained by taking the derivative of the sum of
squared prediction mistakes, S(b0 , b1 , . . . , bk ), with respect to each coefficient, setting these
derivatives to zero, and solving for the estimator β̂.
n
∂ X
(Yi − b0 − b1 X1i − · · · − bk Xki )2 =
∂bj
i=1
n
X
−2 Xji (Yi − b0 − b1 X1i − · · · − bk Xki ) = 0
i=1
S(b) = û0 û = Y0 Y − b0 X0 Y − Y0 Xb − b0 X0 Xb
Then
b = (X0 X)−1 X0 Y
8
3.2 Example: the OLS estimator of β̂1 in a simple regression model
Let take a simple linear regression model as an example. The simple linear regression
model written in matrix notation is
Y = β0 ι + β1 X1 + u = Xβ + u
where
Y1 1 X11 u1 !
.
. . . β0
. . .. , u = .. , β =
. , X = ι X1 = .
Y= β1
Yn 1 X1n un
! 1 X11
!
ι0 1 ··· 1 .
.. ..
X0 X = ι X1 = .
X01 X11 · · · X1n
1 X1n
! Pn !
ι0 ι ι 0 X1 n X 1i
= = Pn Pni=1 2
X01 ι X01 X1 i=1 X1i i=1 X1i
Next, we compute X0 Y.
! ! Y1 ! Pn !
0 ι0 1 ··· 1 .
.. = ι0 Y i=1 Yi
XY= Y= = Pn
X01 X11 · · · X1n X01 Y i=1 X1i Yi
Yn
9
Finally, we compute β̂ = (X0 X)−1 X0 Y, which is
! Pn Pn ! Pn !
β̂0 X 2 − X Y
1 1i i
= Pn Pi=1 1i i=1
Pn i=1
n i=1 X1i − ( ni=1 X1i )2
2
− ni=1 X1i
P
β̂1 n i=1 X1i Yi
Pn !
2
Pn P n P n
1 X
i=1 1i Y
i=1 i − X
i=1 1i X Y
1i i
= Pn P i=1
2 − ( ni=1 X1i )2 − ni=1 X1i ni=1 Yi + n ni=1 X1i Yi
P
n i=1 X1i
P P
Therefore, β̂1 is the second element of the vector pare-multiplied by the fraction, that is,
Pn
X1i Yi − ni=1 X1i ni=1 Yi
P P Pn
n (X − X̄1 )(Yi − Ȳ )
β̂1 = i=1
= Pn 1i
i=1
n ni=1 X1i 2 − ( n X )2
P P 2
i=1 1i i=1 (X1i − X̄1 )
It follows that
Pn n Pn
2 X1i ni=1 X1i Yi
P P
i=1 X1i P i=1 Yi − i=1
β̂0 = = Ȳ − β̂1 X̄1
n ni=1 X1i 2 − ( ni=1 X1i )2
P
Now we can apply the OLS estimation method of multiple regression to the application
of California school districts. Recall that the estimated simple linear regression model is
\ = 698.9 − 2.28 × ST R
T estScore
Since we concern that the estimated coefficient on STR may be overestimated without
considering the percentage of English learners in the districts, we include this new variable
in the multiple regression model to control for the effect of English learners, yielding a
new estimated regression model as
• The interpretation of the new estimated coefficient on STR is, holding the per-
centage of English learners constant, a unit decrease in STR is estimated to
increase test scores by 1.10 points.
• We can also interpret the estimated coefficient on PctEL as, holding STR constant,
one unit decrease in PctEL increases test scores by 0.65 point.
• The magnitude of the negative effect of STR on test scores in the multiple regression
is approximately half as large as when STR is the only regressor, which verifies our
concern that we may omit important variables in the simple linear regression model.
10
4 Measures of Fit in Multiple Regression
The standard error of regression (SER) estimates the standard deviation of the error term
u. In multiple regression, the SER is
n
1 X û0 û SSR
SER = sû , where s2û = û2i = = (10)
n−k−1 n−k−1 n−k−1
i=1
4.2 R2
Like in the regression model with single regressor, we can define T SS, ESS, and SSR in
the multiple regression model.
Pn
• The total sum of squares (TSS): T SS = i=1 (Yi − Ȳ )2
Pn
• The explained sum of squares (ESS): ESS = i=1 (Ŷi − Ȳ )2
Pn
• The sum of squared residuals (SSR): SSR = 2
i=1 ûi
11
still holds so that we can define R2 as
ESS SSR
R2 = =1− (11)
T SS T SS
Limitations of R2
1. R2 is valid only if a regression model is estimated using the OLS since otherwise it
would not be true that T SS = ESS + SSR.
2. R2 defined in the form of the deviation from the mean is only valid when a constant
term is included in regression. Otherwise, use the uncentered version of R2 , which
is also defined as
EES SSR
Ru2 = =1− (12)
T SS T SS
Pn
where T SS = 2 = Y0 Y, ESS = 2i=1 Ŷi2 = Ŷ0 Ŷ, and SSR = ni=1 û2i =
P P
i=1 Yi
û0 û, using the uncentered variables. Note that in a regression without a constant
term, the equality T SS = ESS + SSR is still true.
Y = β0 + β1 X1 + u (13)
Y = β0 + β1 X1 + β2 X2 + u (14)
Since both models use the same Y, T SS must be the same. If the OLS estimator β̂2
does not equal 0, then SSR in Equation (13) is always larger than that of Equation
(14) since the former SSR is minimized with respect to β0 , β1 and with the constraint
of β2 = 0 and the latter is minimized without the constraint over β2 .
The adjusted R2 is, or R̄2 , is a modified version of R2 in Equation (11). R̄2 improves R2
in the sense that it does not necessarily increase when a new regressor is added.
R̄2 is defined as
SSR/(n − k − 1) n − 1 SSR s2
R̄2 = 1 − =1− = 1 − 2u (15)
T SS/(n − 1) n − k − 1 T SS sY
12
• The adjustment is made by dividing SSR and T SS by their corresponding degrees
of freedom, which is n − k − 1 and n − 1, respectively.
• s2u is the sample variance of the OLS residuals, which is given in Equation (10); s2Y
is the sample variance of Y .
• The definition of R̄2 in Equation (15) is valid only when a constant term is included
in the regression model.
• Since n−1
n−k−1 > 1, then it is always true that the R̄2 < R2 .
• The R̄2 can be negative. This happens when the regressors, taken together, reduce
the sum of squared residuals by such a small amount that his reduction fails to offset
the factor n−k−1 .
n−1
• Both R2 and R̄2 are valid when the regression model is estimated by the OLS
estimators. R2 or R̄2 computed with the estimators other than the OLS estimators
is usually called pseudo R2 .
13
where X1 is an n×(k1+1) matrix composed of the intercept and the first k1+1 regressors
in Equation (16), and X2 is an n × k2 matrix composed of the rest k2 regressors. β 1 =
(β0 , β1 , . . . , βk1 )0 and β 2 = (βk1+1 , . . . , βk )0 .
Suppose that we are interested in β 2 but not much in β 1 in Equation (17). How can we
estimate β 2 ?
We can obtain the OLS estimation of β 2 with Equation (7), i.e., β̂ = (X0 X)−1 X0 Y. β̂ 2
is a vector consisting of the last k2 elements in β̂.
As such, we can get an n×k2 matrix composed of all the residuals X e k1+1 · · · X
e 2 = (X e k ).
3. Regress Y
e on X
e 2 , and obtain the estimates of β 2 as β 2 = (X
e0 X
e −1 e 0 e
2 2 ) X2 Y.
14
1. the OLS estimates of β 2 using the second strategy and that from the first strategy
are numerically identical.
The proof of the FWL theorem is beyond the scope of this proof. Interested students
may refer to Exercise 18.7. Understanding the meaning of this theorem is much more
important than understanding the proof.
The FWL theorem provides a mathematical statement of how the multiple regression
coefficients in β̂ 2 capture the effects of X2 on Y, controlling for other X.
• Step 3 estimates the effect of the regressors in X2 on Y using the parts in X2 and
Y that have excluded the effects of X1 .
Yi = β0 + β1 Xi + ui , i = 1, . . . , n
Following the estimation strategy in the FWL theorem, we can carry out the following
regressions,
Yi = α + ei
yi = β1 xi + vi or in matrix notation y = β1 x + v
4. We can obtain βˆ1 directly by applying the formula in Equation (7). That is
P P
0 −1 0 xi yi (Xi − X̄)(Yi − Ȳ )
β̂1 = (x x) i
xy= P 2 = iP 2
i xi i (Xi − X̄)
15
which is exactly the same as the OLS estimator of β1 in Yi = β0 + β1 Xi + ui .
We introduce four least squares assumptions for a multiple regression model. The first
three assumptions are the same as those in the simple regression model with just minor
modifications to suit multiple regressors. The fourth assumption is a new one.
Assumption #1 E(ui |xi ) = 0. The conditional mean of ui given X1i , X2i , . . . , Xki has
mean of zero. This is the key assumption to assure that the OLS estimators are
unbiased.
Assumption #3 Large outliers are unlikely, i.e.„ 0 < E(X4 ) < ∞ and 0 < E(Y4 ) < ∞.
That is, the dependent variables and regressors have finite kurtosis.
Under the least squares assumptions the OLS estimator β̂ can be shown to be unbiased
and consistent estimator of β in the multiple regression model of Equation (4).
Unbiasedness
16
in which E(u|X) = 0 from the first least squares assumption.
Consistency
1 0
plim XX= QX (20)
n→∞ n (k+1)×(k+1)
which means as n goes to very large, X0 X converge to a nonstochastic matrix QX with full
rank (k + 1). In Chapter 18, we will see that QX = E(Xi x0i ) where xi = [1, X1i , . . . , Xki ]0
is the ith row of X.
n
1X
plim xi ui
n→∞ n
i=1
Since E(ui |xi ) = 0, we know that E(xi ui ) = E(xi E(ui |xi )) = 0. Also, by Assumptions
#2 and #3, we know that xi ui are i.i.d. and have positive finite variance. Thus, by the
law of large number
n
1X
plim xi ui = E(xi ui ) = 0
n→∞ n
i=1
17
7.2 The Gauss-Markov theorem and efficiency
1. E(u|X) = 0,
That Var(β̃) − Var(β̂) is a positive semidefinite matrix means that for any nonzero (k +
1) × 1 vector c,
c0 Var(β̃) − Var(β̂) c ≥ 0
Like in the regression model with single regressor, the least squares assumptions can be
summarized by the Gauss-Markov conditions as
• Assumption #1, #2, and the additional assumption of homoskedasticity imply that
Var(u|X) = σu2 In .
2
The complete proof of the Gauss-Markov theorem in multiple regression is in Appendix 18.5.
18
For a random vector x, the variance of x is a covariance matrix defined as
which also holds for the conditional variance by replacing the expectation operator
with the conditional expectation operator.
where
u21 u1 u2 · · · u1 un
u2 u1 u22 · · · u2 un
0
..
uu = .. .. ..
. . . .
un u1 un u2 · · · u2n
β̃ = Ay = AXβ + Au
19
which only holds when AX = Ik+1 and the first Gauss-Markov condition holds.
The OLS estimator β̂ is a linear conditionally unbiased estimator with A = (X0 X)−1 X0 .
Obviously, AX = Ik+1 is true for β̂.
If the homoskedasticity assumption does not hold, denote the covariance matrix of u as
Var(u|X) = Ω
Heteroskedasticity means that the diagonal elements of Ω can be different (i.e. Var(ui |X) =
σi2 for i = 1, . . . , n), while the off-diagonal elements are zeros, that is
σ12 0 ··· 0
0 σ22 · · · 0
..
Ω= .. . . ..
. . . .
0 0 · · · σn2
20
7.3 The asymptotic normal distribution
In large samples, the OLS estimator β̂ has the multivariate normal asymptotic distribution
as
d
β̂ −−→ N (β, Σβ̂ ) (23)
where Σβ̂ = Var(β̂|X) for which use Equation (21) for the homoskedastic case and Equa-
tion (22) for the heteroskedastic case.
The proof of the asymptotic normal distribution and the multivariate central limit theorem
are given in Chapter 18.
The omitted variable bias is the bias in the OLS esitmator that arises when the included
regressors, X, are correlated with omitted variables, Z, where X may include k regressors,
X1 , . . . , Xk , and Z may include l omitted variables, Z1 , . . . , Zm . The omitted variable
bias occurs when two conditions are met
Y = Xβ + Zγ + u (24)
in which the first least squares assumption, E(u|X, Z) = 0, holds. We further assume
that Cov(X, Z) 6= 0
Y = Xβ + (25)
Since represents all other factors that are not in Equation (25), including Z, and
Cov(X, Z) 6= 0, this means that Cov(X, ) 6= 0, which implies that E(|X) 6= 0. (Re-
call that in Chapter 4, we prove that E(ui |Xi ) = 0 ⇒ Cov(ui , Xi ) = 0, which implies
that Cov(ui , Xi ) 6= 0 ⇒ E(ui |Xi ) 6= 0).) Therefore, Assumption #1 does not hold for the
short model, which means that the OLS estimator of Equation (25) is biased.
21
An informal proof of the OLS estimator of Equation (25) is biased is given as follows.
The OLS estimator of Equation (25) is β̂ = (X0 X)−1 X0 Y. Plugging Y with the true
model, we have
The second term in the equation above usually does not equal zero unless either
Therefore, if these two conditions do not hold, β̂ for the short model is biased. And the
magnitude and direction of the bias is determined by X0 Zγ.
Yi = β0 + β1 X1i + β2 X2i + ui , i = 1, . . . , n
Yi = β0 + β1 X1i + i , i = 1, . . . , n
As n → ∞, the numerator of the second term converges to Cov(X1 , ) = ρX1 σX1 σ and
the denominator converges to σX
2 , where ρ
1 X1 is the correlation coefficient between X1i
p σ
β̂1 −−→ β1 + ρX1 (27)
σX1
| {z }
omitted variable bias
22
From Equations (26) and (27), we can summarize some facts about the omitted variable
bias:
• Omitt variable bias is a problem irregardless of whether the sample size is large or
small. β̂ is biased and inconsistent when there is omitted variable bias.
• Whether this bias is large or small in practice depends on |ρX1 | or |X0 Zγ|.
• One easy way to detect the existence of the omitted variable bias is that when adding
a new regressor, the estimated coefficients on some previously included regressors
change substantially.
9 Multicollinearity
Perfect multicollinearity refers to the situation when one of the regressor is a perfect
linear function of the other regressors.
• In the terminology of linear algebra, perfect multicollinearity means that the vectors
of regressors are linearly dependent.
• That is, the vector of a regressor can be expressed as a linear combination of vectors
of the other regressors.
Remember that the matrix of regressors X can be written in terms of column vectors as
X = [ι, X1 , X2 , . . . , Xk ]
That the k+1 column vectors are linearly dependent means that there exist some (k+1)×1
nonzero vector β = [β0 , β1 , . . . , βk ]0 such that
β0 ι + β1 X1 + · · · + βk Xk = 0
• If X does not have full column rank, then X0 X is singular, that is, the inverse of
X0 X does not exist. Therefore, we can state the assumption of requiring no perfect
23
multicollinearity in another way as assuming that X has full column rank.
• If X0 X is not invertible, the OLS estimator based on the formula of β̂ = (X0 X)−1 X0 Y
does not exist.
Remember that perfect multicollinearity occurs when one regressor can be expressed as
a linear combination of other regressors. This problem belongs to the logic error when
the researcher sets up the regression model. That is, the researcher uses some redundant
regressors in the model to provide the same information that merely one regressor can
sufficiently provide.
Y = β0 + β1 X1 + β2 X2 + u
And we want to add a new variable Z into this model. The following practices cause
perfect multicollinearity
• Z = aX1 or Z = bX2
• Z = 1 − aX1
• Z = aX1 + bX2
However, we can add a Z that is not a linear function of X1 or X2 such that there is no
perfect multicollinearity problem. For example,
• Z = X12
• Z = ln X1
• Z = X1 X2
The dummy variable trap is a good case of perfect multicollinearity that a modeler often
encounters. Recall that a binary variable (or dummy variable) Di , taking values
of one or zero, can be used in a regression model to distinguish two mutually exclusive
groups of samples, for instance, the male and the female. In fact, dummy variables can
24
be constructed to represent more than two groups and be used in multiple regression to
examine the difference between these groups.
Suppose that we have a data composed of people of four ethnic groups: White, African
American, Hispanic, and Asian. And we want to estimate a regression model to see
whether wages among these four groups are different. We may (mistakenly as we will see)
set up a multiple regression model as follows
where W hitei is a dummy variable which equal 1 if the ith observation is a white people
and equal 0 if he/she is not, similarly for Af ricani , Hispanici , and Asiani .
A concrete example
To be concrete, suppose we have four observations: Chuck, Mike, Juan, and Li, who are
White, African American, Hispanic, and Asian, respectively. Then the dummy variables
are
1 0 0 0
0 1 0 0
0 , Af rican = 0 , Hispanic = 1 , Asian = 0
W hite =
0 0 0 1
However, when we construct a model like Equation (28), we fall into the dummy variable
trap, suffering perfect multicollinearity. This is because this model has a constant term
β0 × 1 which is the sum of all dummy variables. That is,
1
1
= W hite + Af rican + Hispanic + Asian
1
1
W age = β0 + β1 + u
To avoid the dummy variable trap, we can either of the following two methods:
25
The difference between these two methods lies in how we interpret the coefficients on
dummy variables.
W age = β1 + u
Then β1 is the population mean wage of whites, that is, β1 = E(W age|W hite = 1).
Similarly, β2 , β3 , and β4 are the population mean wage of African Americans, Hispanics,
and Asians, respectively.
If we drop the dummy variable for white people, then the model becomes
And the constant term β1 is just the population mean of whites, that is,
W age = β1 + β2 + u
26
Similarly, we can get that
Therefore, when we adopt the second method by dropping a dummy variable for the
reference case, then the coefficients on other dummy variables represent the difference in
the population means between the interested case and the reference case.
• Imperfect multicollinearity does not affect the unbiasedness of the OLS estimators.
However, it does affect the efficiency, i.e., the variance of the OLS estimators.
Y = β0 + β1 X1 + β2 X2 + u (31)
By the FWL theorem, estimating Equation (31) will get the same OLS estimators of β1
and β2 as estimating the following model,
y = β1 x1 + β2 x2 + v (32)
27
Suppose that X1 and X2 are correlated so that their correlation coefficient |ρ12 | > 0. And
the square of the sample correlation coefficient is
2
( x1 x2 )2
P
(X1 − X̄1 )(X2 − X̄2 )
P
2
r12 =P =P P (33)
(X1 − X̄1 )2 (X2 − X̄2 )2
P
x1 x2
• β̂ is still unbiased since the assumption of E(u|X) = 0 holds and so does E(v|x) = 0.
• The variance of β̂1 , which is the first diagonal element of σu2 (x0 x)−1 , is affected by
r12 . To see this, we write Var(β̂1 |x) explicitly as
σu2 i x22
P
Var(β̂1 |x) = P 2
P 2
P 2
i x1 i x2 − ( i x1 x2 )
σu2 i x22
P
=X X
( i x1 x2 )2
P
2 2
x1 x2 1 − P 2 P 2
i i i x1 i x2
σ2 1
= Pu 2 2
i x1 (1 − r12 )
• The variance inflation factor (VIF) is a commonly used indicator for detecting mul-
ticollinearity. The definition is
1
VIF = 2
1 − r12
The smaller VIF is for a regressor, the less severe the problem of multicollinearity is.
However, there is no widely accepted cut-off value for VIF to detect multicollinearity.
V IF > 10 for a regressor is often seen as an indication of multicollinearity, but we
cannot always trust this.
28
Possible remedies for multicollinearity
• Include more sample in hope of the variation in X getting widened, i.e., increasing
P
i (X1i − X̄1 ).
• Drop the variable(s) that is highly correlated with other regressors. Notice that by
doing this we are at the risk of suffering the omitted variable bias. There is always a
trade-off between including all relevant regressors and making the regression model
parsimonious.3
3
The word "parsimonious" in Econometrics means that we always want to make the model as concise
as possible without any redundant variables included.
29