Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

lecture_8

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Lecture 8: Linear Regression with Multiple Regressors

Zheng Tian

Contents
1 Introduction 1

2 The Multiple Regression Model 2

3 The OLS Estimator in Multiple Regression 7

4 Measures of Fit in Multiple Regression 11

5 The Frisch-Waugh-Lovell Theorem 13

6 The Least Squares Assumptions in Multiple Regression 16

7 The Statistical Properties of the OLS Estimators in Multiple Regression 16

8 The Omitted Variable Bias 21

9 Multicollinearity 23

1 Introduction

1.1 Overview

This lecture extends the simple regression model to the multiple regression model. Many
aspects of multiple regression parallel those of regression with a single regressor. The
coefficients of the multiple regression model can be estimated using the OLS estimation
method. The algebraic and statistical properties of the OLS estimators of the multiple
regression are also similar to those of the simple regression. However, there are some new
concepts, such as the omitted variable bias and multicollinearity, to deepen our under-
standing of the OLS estimation.

1
1.2 Learning goals

• Be able to set up a multiple regression model with matrix notation.

• Understand the meaning of holding other things constant.

• Estimate the multiple regression model with the OLS estimation.

• Understand the Frisch-Waugh-Lovell theorem.

• Capable of detecting the omitted variable bias and multicollinearity.

1.3 Readings

• Introduction to Econometrics by Stock and Watson. Read thoroughly Chapter 6


and Sections 18.1 and 18.2

• Introductory Econometrics: a Modern Approach by Wooldridge. Chapter 3.

2 The Multiple Regression Model

2.1 The problem of a simple linear regression

In the last two lectures, we use a simple linear regression model to examine the effect of
class sizes on test scores in the California elementary school districts. The simple linear
regression model with only one regressor is

T estScore = β0 + β1 × ST R + OtherF actors

Is this model adequate to characterize the determination of test scores?

Obviously, it ignores too many other important factors. Instead, all these other factors
are lumped into a single term, OtherFactors, which is the error term, ui , in the regression
model.

What are possible other important factors?

• School district characteristics: average income level, demographic components

• School characteristics: teachers’ quality, school buildings,

• Student characteristics: family economic conditions, individual ability

2
Percentage of English learners as an example

The percentage of English learners in a school district could be an relevant and important
determinant of test scores, which is omitted in the simple regression model.

• How can it affect the estimate of the effect of student-teacher ratios on test score?

– The districts with high percentages of English learners tend to have large
student-teacher ratios. That is, these two variables are correlated with the
correlation coefficient being 0.17.

– The higher percentage of English learners a district has, the lower test scores
students there will make.

– In the simple regression model, the estimated negative coefficient on student-


teacher ratios on test scores could include not only the negative influence from
class sizes but also that from the percentage of English learners.

– In the terminology of statistics, the magnitude of the coefficient on student-


teacher ratio is overestimated.

– Generally, we commit an omitted variable bias by setting up a simple re-


gression model. We will explore the omitted variable bias in the last section in
this lecture.

• Solutions to the omitted variable bias

We can include these important but ignored variables, like the percentage of English
learners (P ctEL), in the regression model.

T estScorei = β0 + β1 ST Ri + β2 P ctELi + OtherF actorsi

A regression model with more than one regressors is a multiple regression model.

2.2 A multiple regression model

The general form of a multiple regression model is

Yi = β0 + β1 X1i + β2 X2i + · · · + βk Xki + ui , i = 1, . . . , n (1)

where

• Yi is the ith observation on the dependent variable.

• X1i , X2i , . . . , Xki are the ith observation on each of the k regressors.

3
• ui is the error term associated with the ith observation, representing all other factors
that are not included in the model.

• The population regression line (or population regression function) is the relationship
that holds between Y and X on average in the population

E(Yi |X1i , . . . , Xki ) = β0 + β1 X1i + · · · + βk Xki

• β1 , . . . , βk are the coefficients on the corresponding Xi , i = 1, . . . , k. β0 is the inter-


cept, which can also be thought of the coefficient on a regressor X0i that equals 1
for all observations.

– Including X0i , there are k + 1 regressors in the multiple regression model.

– The linear regression model with a single regressor is in fact a multiple regres-
sion model with two regressors, 1 and X.

2.3 The interpretation of βi

Holding other things constant

We can suppress the subscript i in Equation (1) so that we re-write it as

Y = β0 + β1 X1 + · · · + βk Xk + u (2)

In Equation (2), the coefficient βi on a regressor Xi , for i = 1, . . . , k, measures the effect


on Y of a unit change in Xi , holding other X constant.

Suppose we have two regressors X1 and X2 and we are interested in the effect of X1 on
Y . We can let X1 change by ∆X1 and holding X2 constant. Then, the new value of Y is

Y + ∆Y = β0 + β1 (X1 + ∆X1 ) + β2 X2

Subtracting Y = β0 + β1 X1 + β2 X2 , we have ∆Y = β1 ∆X1 . That is

∆Y
β1 = holding X2 constant
∆X

Partial effect

If Y and Xi for i = 1, . . . , k are continuous and differentiated variables, from Equation


(2), we know that βi is as simply as the partial derivative of Y with respect to Xi . That

4
is
∂Y
βi =
∂Xi
By the definition of a partial derivative, βi is just the effect of a marginal change in Xi on
Y holding other X constant.

2.4 The matrix notation of a multiple regression model

Consider the matrix notation as a way to organize data

When we save the data set of California school districts in Excel, it is saved in a spreadsheet
as shown in Figure 1.

Figure 1: The California data set in spreadsheet

Each row represents an observation of all variables pertaining to a school district, and
each column represents a variable with all observations. This format of data display can
be concisely denoted using vectors and a matrix.

Let us first define the following vectors and matrices:

x01
         
Y1 1 X11 · · · Xk1 u1 β0
 Y2  1 X12 · · · Xk2   x02   u2   β1 
         
Y =  .  , X = . . . . =  . , u =  . , β = 
 .. 
. . . . . . .
        
 .  . . . .  .  .  .

Yn 1 X1n · · · Xkn xn0 un βk

• Y is an n × 1 vector of n observations on the dependent variable.

5
• X is an n × (k + 1) matrix of n observations on k + 1 regressors which include the
intercept term as a regressor of 1’s.

• xi is a (k + 1) × 1 vector of the ith observation on all (k + 1) regressors. Thus, x0i


denotes the ith row in X.

• β is a (k + 1) × 1 vector of the (k + 1) regression coefficients.

• u is an n × 1 vector of the n error terms.

Write a multiple regression model with matrix notation

• The multiple regression model for one observation

The multiple regression model in Equation (1) for the ith observation can be written
as
Yi = x0i β + ui , i = 1, . . . , n (3)

• The multiple regression model for all observations

Stacking all n observations in Equation (3) yields the multiple regression model in
matrix form:
Y = Xβ + u (4)

X can also be written in terms of column vectors as

X = [X0 , X1 , . . . , Xk ]

where Xi = [Xi1 , Xi2 , . . . , Xin ]0 is a n×1 vector of n observations of the kth regressor.
X0 is a vector of 1s. That is, X0 = [1, 1, . . . , 1]0 . More often, we use ι to denote
such a vector of 1s. 1

Thus, Equation (4) can be re-written as

Y = β0 ι + β1 X1 + · · · + βk Xk + u (5)
Pn −1
1
ι has the following properties: (1) ι0 x = i=1 xi for an n × 1 vector x, (2) ι0 ι = n and (ι0 ι) = 1/n,
−1
(3) ι0 (ι0 ι) x = x̄, and (4) ι0 Xι = n
P Pn
i=1 j=1 xij for an n × n matrix X.

6
3 The OLS Estimator in Multiple Regression

3.1 The OLS estimator

The minimization problem

The idea of the ordinary least squares estimation for a multiple regression model is exactly
the same as for a simple regression model. The OLS estimators of the multiple regression
model are obtained by minimizing the sum of the squared prediction mistakes.

Let b = [b0 , b1 , . . . , bk ]0 be some estimators of β = [β0 , β1 , . . . , βk ]0 . The predicted Yi can


be obtained by
Ŷi = b0 + b1 X1i + · · · + bk Xki = x0i b, i = 1, . . . , n

or in matrix notation
Ŷ = Xb

The prediction mistakes with b, or called the residuals, are

ûi = Yi − b0 − b1 X1i − · · · − bk Xki = Yi − x0i b

or in matrix notation, the residual vector is

û = Y − Xb

Then the sum of the squared prediction mistakes (residuals) is


n
X
S(b) = S(b0 , b1 , . . . , bk ) = (Yi − b0 − b1 X1i − · · · − bk Xki )2
i=1
n
X
= (Yi − x0i b)2 = (Y − Xb)0 (Y − Xb)
i=1
n
X
= û0 û = û2i
i=1

The OLS estimator is the solution to the following minimization problem:

min S(b) = û0 û (6)


b

7
The OLS estimator of β as a solution to the minimization problem

The formula for the OLS estimator is obtained by taking the derivative of the sum of
squared prediction mistakes, S(b0 , b1 , . . . , bk ), with respect to each coefficient, setting these
derivatives to zero, and solving for the estimator β̂.

The derivative of S(b0 , . . . , bk ) with respect to bj is

n
∂ X
(Yi − b0 − b1 X1i − · · · − bk Xki )2 =
∂bj
i=1
n
X
−2 Xji (Yi − b0 − b1 X1i − · · · − bk Xki ) = 0
i=1

There are k + 1 such equations for j = 0, . . . , k. Solving this system of equations, we


obtain the OLS estimator β̂ = (β̂0 , . . . , β̂k )0 .

Using matrix notation, the formula for the OLS estimator β̂ is

β̂ = (X0 X)−1 X0 Y (7)

To prove Equation (7), we need to use some results of matrix calculus.

∂a0 x ∂x0 a ∂x0 Ax


= a, = a, and = (A + A0 )x (8)
∂x ∂x ∂x

when A is symmetric, then (∂x0 Ax)/(∂x) = 2Ax

Proof of Equation (7). The sum of squared prediction mistakes is

S(b) = û0 û = Y0 Y − b0 X0 Y − Y0 Xb − b0 X0 Xb

The first order conditions for minimizing S(b) with respect to b is

− 2X0 Y − 2X0 Xb = 0 (9)

Then
b = (X0 X)−1 X0 Y

given that X0 X is invertible.

Note that Equation (9) represents a system of equations with k + 1 equations.

8
3.2 Example: the OLS estimator of β̂1 in a simple regression model

Let take a simple linear regression model as an example. The simple linear regression
model written in matrix notation is

Y = β0 ι + β1 X1 + u = Xβ + u

where

     
Y1 1 X11 u1 !
 .    
. .  .  β0
.  . ..  , u =  ..  , β =

 .  , X = ι X1 =  .
Y=    β1
Yn 1 X1n un

Let’s get the components in Equation (7) step by step.

First, the most important part is (X0 X)−1 .

 
! 1 X11
!
ι0   1 ··· 1 .
 .. .. 
X0 X = ι X1 = . 
X01 X11 · · · X1n  
1 X1n
! Pn !
ι0 ι ι 0 X1 n X 1i
= = Pn Pni=1 2
X01 ι X01 X1 i=1 X1i i=1 X1i

Recall that the inverse of a 2 × 2 matrix can be calculated as follows


!−1 !
a11 a12 1 a22 −a12
=
a21 a22 a11 a22 − a12 a21 −a21 a11

Thus, the inverse of X0 X is


Pn Pn !
2
X1i −
1 i=1 X1i
−1
X0 X = Pn Pi=1
n i=1 X1i − ( ni=1 X1i )2
2 n
P
− i=1 X1i n

Next, we compute X0 Y.
 
! ! Y1 ! Pn !
0 ι0 1 ··· 1  . 
 ..  = ι0 Y i=1 Yi
XY= Y= = Pn
X01 X11 · · · X1n   X01 Y i=1 X1i Yi
Yn

9
Finally, we compute β̂ = (X0 X)−1 X0 Y, which is
! Pn Pn ! Pn !
β̂0 X 2 − X Y
1 1i i
= Pn Pi=1 1i i=1
Pn i=1
n i=1 X1i − ( ni=1 X1i )2
2
− ni=1 X1i
P
β̂1 n i=1 X1i Yi
Pn !
2
Pn P n P n
1 X
i=1 1i Y
i=1 i − X
i=1 1i X Y
1i i
= Pn P i=1
2 − ( ni=1 X1i )2 − ni=1 X1i ni=1 Yi + n ni=1 X1i Yi
P
n i=1 X1i
P P

Therefore, β̂1 is the second element of the vector pare-multiplied by the fraction, that is,
Pn
X1i Yi − ni=1 X1i ni=1 Yi
P P Pn
n (X − X̄1 )(Yi − Ȳ )
β̂1 = i=1
= Pn 1i
i=1
n ni=1 X1i 2 − ( n X )2
P P 2
i=1 1i i=1 (X1i − X̄1 )

It follows that
Pn n Pn
2 X1i ni=1 X1i Yi
P P
i=1 X1i P i=1 Yi − i=1
β̂0 = = Ȳ − β̂1 X̄1
n ni=1 X1i 2 − ( ni=1 X1i )2
P

3.3 Application to Test Scores and the Student-Teacher Ratio

Now we can apply the OLS estimation method of multiple regression to the application
of California school districts. Recall that the estimated simple linear regression model is

\ = 698.9 − 2.28 × ST R
T estScore

Since we concern that the estimated coefficient on STR may be overestimated without
considering the percentage of English learners in the districts, we include this new variable
in the multiple regression model to control for the effect of English learners, yielding a
new estimated regression model as

\ = 686.0 − 1.10 × ST R − 0.65 × P ctEL


T estScore

• The interpretation of the new estimated coefficient on STR is, holding the per-
centage of English learners constant, a unit decrease in STR is estimated to
increase test scores by 1.10 points.

• We can also interpret the estimated coefficient on PctEL as, holding STR constant,
one unit decrease in PctEL increases test scores by 0.65 point.

• The magnitude of the negative effect of STR on test scores in the multiple regression
is approximately half as large as when STR is the only regressor, which verifies our
concern that we may omit important variables in the simple linear regression model.

10
4 Measures of Fit in Multiple Regression

4.1 The Standard errors of the regression (SER)

The standard error of regression (SER) estimates the standard deviation of the error term
u. In multiple regression, the SER is
n
1 X û0 û SSR
SER = sû , where s2û = û2i = = (10)
n−k−1 n−k−1 n−k−1
i=1

Here SSR is divided by (n − k − 1) because there are (k + 1) coefficients to be estimated


using n samples.

4.2 R2

The definition of R2 in multiple regression models

Like in the regression model with single regressor, we can define T SS, ESS, and SSR in
the multiple regression model.
Pn
• The total sum of squares (TSS): T SS = i=1 (Yi − Ȳ )2
Pn
• The explained sum of squares (ESS): ESS = i=1 (Ŷi − Ȳ )2
Pn
• The sum of squared residuals (SSR): SSR = 2
i=1 ûi

Let yi be the deviation of Yi from its sample mean, that it, yi = Yi − Ȳ , i = 1, . . . , n. In


matrix notation, we have
         
Y1 1 y1 Y1 Ȳ
 Y2  1  y2   Y2  Ȳ 
         
 ..  ,
Y=  ..  , y =  ..  =  ..  −  ..  = Y − Ȳ ι
ι=
       
 .  .  .   .  .
Yn 1 yn Yn Ȳ

Therefore, y is the vector of the deviation from the mean of Yi , i = 1, . . . , n. Similarly,


we can define the deviation from the mean of Ŷi , i = 1, . . . , n as ŷ = Ŷ − Ȳ ι. Then we
can rewrite T SS, ESS, and SSR as

T SS = y0 y, ESS = ŷ0 ŷ, and SSR = û0 û

In multiple regression, the relationship that

T SS = ESS + SSR, or, y0 y = ŷ0 ŷ + û0 û

11
still holds so that we can define R2 as

ESS SSR
R2 = =1− (11)
T SS T SS

Limitations of R2

1. R2 is valid only if a regression model is estimated using the OLS since otherwise it
would not be true that T SS = ESS + SSR.

2. R2 defined in the form of the deviation from the mean is only valid when a constant
term is included in regression. Otherwise, use the uncentered version of R2 , which
is also defined as
EES SSR
Ru2 = =1− (12)
T SS T SS
Pn
where T SS = 2 = Y0 Y, ESS = 2i=1 Ŷi2 = Ŷ0 Ŷ, and SSR = ni=1 û2i =
P P
i=1 Yi
û0 û, using the uncentered variables. Note that in a regression without a constant
term, the equality T SS = ESS + SSR is still true.

3. Most importantly, R2 increases whenever an additional regressor is included in a


multiple regression model, unless the estimated coefficient on the added regressor is
exactly zero.

Consider two regression models

Y = β0 + β1 X1 + u (13)
Y = β0 + β1 X1 + β2 X2 + u (14)

Since both models use the same Y, T SS must be the same. If the OLS estimator β̂2
does not equal 0, then SSR in Equation (13) is always larger than that of Equation
(14) since the former SSR is minimized with respect to β0 , β1 and with the constraint
of β2 = 0 and the latter is minimized without the constraint over β2 .

4.3 The adjusted R2

The definition of the adjusted R2

The adjusted R2 is, or R̄2 , is a modified version of R2 in Equation (11). R̄2 improves R2
in the sense that it does not necessarily increase when a new regressor is added.

R̄2 is defined as

SSR/(n − k − 1) n − 1 SSR s2
R̄2 = 1 − =1− = 1 − 2u (15)
T SS/(n − 1) n − k − 1 T SS sY

12
• The adjustment is made by dividing SSR and T SS by their corresponding degrees
of freedom, which is n − k − 1 and n − 1, respectively.

• s2u is the sample variance of the OLS residuals, which is given in Equation (10); s2Y
is the sample variance of Y .

• The definition of R̄2 in Equation (15) is valid only when a constant term is included
in the regression model.

• Since n−1
n−k−1 > 1, then it is always true that the R̄2 < R2 .

• On one hand, k ↑ ⇒ SSR


T SS ↓. On the other hand, k ↑ ⇒ n−1
n−k−1 ↑. Whether R̄2
increases or decreases depends on which of these effects is stronger.

• The R̄2 can be negative. This happens when the regressors, taken together, reduce
the sum of squared residuals by such a small amount that his reduction fails to offset
the factor n−k−1 .
n−1

The usefulness of R2 and R̄2

• Both R2 and R̄2 are valid when the regression model is estimated by the OLS
estimators. R2 or R̄2 computed with the estimators other than the OLS estimators
is usually called pseudo R2 .

• Their importance as measures of fit cannot be overstated. We cannot heavily reply


on R2 or R̄2 to judge whether some regressors should be included in the model or
not.

5 The Frisch-Waugh-Lovell Theorem

5.1 The grouped regressors

Consider a multiple regression model

Yi = β0 + β1 X1i + · · · + βk1 Xk1,i + βk1+1 Xk1+1,i + · · · βk Xk +ui (16)


| {z } | {z }
k1+1 regressors k2 regressors

In Equation (16), among k regressors, X1 , . . . , Xk , we collect k1 regressors and an intercept


into a group and the rest k2 = k − k1 regressors into a second group. In matrix notation,
we write
Y = X1 β 1 + X2 β 2 + u (17)

13
where X1 is an n×(k1+1) matrix composed of the intercept and the first k1+1 regressors
in Equation (16), and X2 is an n × k2 matrix composed of the rest k2 regressors. β 1 =
(β0 , β1 , . . . , βk1 )0 and β 2 = (βk1+1 , . . . , βk )0 .

5.2 Two estimation strategies

Suppose that we are interested in β 2 but not much in β 1 in Equation (17). How can we
estimate β 2 ?

The first strategy: the standard OLS estimation

We can obtain the OLS estimation of β 2 with Equation (7), i.e., β̂ = (X0 X)−1 X0 Y. β̂ 2
is a vector consisting of the last k2 elements in β̂.

In matrix notation, we can get β̂ 2 from the following equation


! !−1 !
β̂ 1 X01 X1 X01 X2 X01 Y
=
β̂ 2 X02 X1 X02 X2 X02 Y

The second strategy: the step OLS estimation

Alternatively, we can perform the following steps to estimate β 2 :

1. Regress each regressor in X2 on all regressors in X1 , including the intercept, and


get the residuals from this regression, denoted as X
e 2.

That is, for each regressor Xi in X2 , i = k1 + 1, . . . , k, we estimate a multiple


regression,
Xi = γ0 + γ1 X1 + · · · + γk1 Xk1 + v

The residuals from this regression is

e i = Xi − γ̂0 − γ̂1 X1 − · · · − γ̂k1 Xk1


X

As such, we can get an n×k2 matrix composed of all the residuals X e k1+1 · · · X
e 2 = (X e k ).

2. Regress Y on all regressors in X1 , denoting the residuals from this regression as Y.


e

3. Regress Y
e on X
e 2 , and obtain the estimates of β 2 as β 2 = (X
e0 X
e −1 e 0 e
2 2 ) X2 Y.

5.3 The Frisch-Waugh-Lovell Theorem

The Frisch-Waugh-Lovell (FWL) Theorem states that

14
1. the OLS estimates of β 2 using the second strategy and that from the first strategy
are numerically identical.

2. the residuals from the regression of Y


e on X
e 2 and the residuals from Equation (17)
are numerically identical.

The proof of the FWL theorem is beyond the scope of this proof. Interested students
may refer to Exercise 18.7. Understanding the meaning of this theorem is much more
important than understanding the proof.

The FWL theorem provides a mathematical statement of how the multiple regression
coefficients in β̂ 2 capture the effects of X2 on Y, controlling for other X.

• Step 1 purges the effects of the regressors in X1 on the regressors in X2

• Step 2 purges the effects of the regressors in X1 on Y.

• Step 3 estimates the effect of the regressors in X2 on Y using the parts in X2 and
Y that have excluded the effects of X1 .

5.4 An example of the FWL theorem

Consider a regression model with single regressor

Yi = β0 + β1 Xi + ui , i = 1, . . . , n

Following the estimation strategy in the FWL theorem, we can carry out the following
regressions,

1. Regress Yi on 1. That is, estimate the model

Yi = α + ei

Then, the OLS estimator of α is Ȳ and the residuals is yi = Yi − Ȳ

2. Similarly, regress Xi on 1. Then the residuals from this regression is xi = Xi − X̄.

3. Regress yi on xi without intercept. That is, estimate the model

yi = β1 xi + vi or in matrix notation y = β1 x + v

4. We can obtain βˆ1 directly by applying the formula in Equation (7). That is
P P
0 −1 0 xi yi (Xi − X̄)(Yi − Ȳ )
β̂1 = (x x) i
xy= P 2 = iP 2
i xi i (Xi − X̄)

15
which is exactly the same as the OLS estimator of β1 in Yi = β0 + β1 Xi + ui .

6 The Least Squares Assumptions in Multiple Regression

We introduce four least squares assumptions for a multiple regression model. The first
three assumptions are the same as those in the simple regression model with just minor
modifications to suit multiple regressors. The fourth assumption is a new one.

Assumption #1 E(ui |xi ) = 0. The conditional mean of ui given X1i , X2i , . . . , Xki has
mean of zero. This is the key assumption to assure that the OLS estimators are
unbiased.

Assumption #2 (Yi , x0i ), i = 1, . . . , n, are i.i.d. This assumption holds automatically if


the data are collected by simple random sampling.

Assumption #3 Large outliers are unlikely, i.e.„ 0 < E(X4 ) < ∞ and 0 < E(Y4 ) < ∞.
That is, the dependent variables and regressors have finite kurtosis.

Assumption #4 No perfect multicollinearity. The regressors are said to exhibit


perfect multicollinearity if one of the regressor is a perfect linear function of the
other regressors.

7 The Statistical Properties of the OLS Estimators in Mul-


tiple Regression

7.1 Unbiasedness and consistency

Under the least squares assumptions the OLS estimator β̂ can be shown to be unbiased
and consistent estimator of β in the multiple regression model of Equation (4).

Unbiasedness

The OLS estimators β̂ is unbiased if E(β̂) = β.

To show the unbiasedness, we can rewrite β̂ as follows,


−1 −1 −1
β̂ = X0 X X0 Y = X0 X X0 (Xβ + u) = β + X0 X X0 u (18)

Thus, the conditional expectation of β̂ is


−1
E(β̂|X) = β + X0 X X0 E(u|X) = β (19)

16
in which E(u|X) = 0 from the first least squares assumption.

Using the law of iterated expectation, we have

E(β̂) = E(E(β̂|X)) = E(β) = β

Therefore, β̂ is an unbiased estimator of β.

Consistency

The OLS estimator β̂ is consistent if as n → ∞, β̂ will converge to β in probability, that


is, plimn→∞ β̂ = β.

From Equation (18), we can have


−1
X0 X X0 u
  
plim β̂ = β + plim plim
n→∞ n→∞ n n→∞ n

Let us first make an assumption, which is usually true, that

1 0
plim XX= QX (20)
n→∞ n (k+1)×(k+1)

which means as n goes to very large, X0 X converge to a nonstochastic matrix QX with full
rank (k + 1). In Chapter 18, we will see that QX = E(Xi x0i ) where xi = [1, X1i , . . . , Xki ]0
is the ith row of X.

Now let us look at plimn→∞ n1 X0 u which can be rewritten as

n
1X
plim xi ui
n→∞ n
i=1

Since E(ui |xi ) = 0, we know that E(xi ui ) = E(xi E(ui |xi )) = 0. Also, by Assumptions
#2 and #3, we know that xi ui are i.i.d. and have positive finite variance. Thus, by the
law of large number
n
1X
plim xi ui = E(xi ui ) = 0
n→∞ n
i=1

Therefore, we can conclude that


plim β̂ = β
n→∞

That is, β̂ is consistent.

17
7.2 The Gauss-Markov theorem and efficiency

The Gauss-Markov conditions

The Gauss-Markov conditions for multiple regression are

1. E(u|X) = 0,

2. Var(u|X) = E(uu0 |X) = σu2 In (homoskedasticity),

3. X has full column rank (no perfect multicollinearity).

The Gauss-Markov Theorem

If the Gauss-Markov conditions hold in the multiple regression model, then


the OLS estimator β̂ is more efficient than any other linear unbiased estimator
β̃ in the sense that Var(β̃) − Var(β̂) is a positive semidefinite matrix. That is,
the OLS estimator is BLUE.

That Var(β̃) − Var(β̂) is a positive semidefinite matrix means that for any nonzero (k +
1) × 1 vector c,  
c0 Var(β̃) − Var(β̂) c ≥ 0

or we can simply write as


Var(β̃) ≥ Var(β̂)

The equality holds only when β̃ = β̂.2

Understanding the Gauss-Markov conditions

Like in the regression model with single regressor, the least squares assumptions can be
summarized by the Gauss-Markov conditions as

• Assumptions #1 and #2 imply that E(u|X) = 0n .

E(ui |X) = E(ui |[X1 , . . . , xi , . . . , xn ]0 ) = E(ui |xi ) = 0

in which the second equality follows Assumption #2 that xi , for i = 1, . . . , n are


independent.

• Assumption #1, #2, and the additional assumption of homoskedasticity imply that
Var(u|X) = σu2 In .
2
The complete proof of the Gauss-Markov theorem in multiple regression is in Appendix 18.5.

18
For a random vector x, the variance of x is a covariance matrix defined as

Var(x) = E (x − E(x))(x − E(x))0




which also holds for the conditional variance by replacing the expectation operator
with the conditional expectation operator.

Since E(u|X) = 0, its covariance matrix, conditioned on X, is

Var(u|X) = E(uu0 |X)

where  
u21 u1 u2 · · · u1 un
 u2 u1 u22 · · · u2 un 
 
0
 ..
uu =  .. .. .. 
 . . . . 

un u1 un u2 · · · u2n

Thus, in the matrix uu0 ,

– the expectation of the diagonal elements, conditioned on X, are the conditional


variance of ui which is σu2 because of homoskedasticity.

– The conditional expectation of the off-diagonal elements are the covariance of


ui and uj , conditioned on X. Since ui and uj are independent according to
Assumption #2, E(ui uj |X) = 0.

Therefore, the conditional covariance matrix of u is


 
σu2 0 ··· 0
0 σu2 · · · 0
 
2
 ..
Var(u|X) =  .. . . .. 
 = σu In
. . . .
0 0 · · · σu2

Linear conditionally unbiased estimators

Any linear estimator of β can be written as

β̃ = Ay = AXβ + Au

where A is a weight matrix depending only on X not on y.

For β̃ to be conditionally unbiased, we must have

E(β̃|X) = AXβ + AE(u|X) = β

19
which only holds when AX = Ik+1 and the first Gauss-Markov condition holds.

The OLS estimator β̂ is a linear conditionally unbiased estimator with A = (X0 X)−1 X0 .
Obviously, AX = Ik+1 is true for β̂.

The conditional covariance matrix of β̂

The conditional variance matrix of β̂ can be derived as follows


h i
Var(β̂|X) = E (β̂ − β)(β̂ − β)0 |X
 
0
−1 0  0 −1 0 0
=E XX Xu XX X u |X
h −1 0 0 i
= E X0 X X uu X(X0 X)−1 |X
−1 0
= X0 X X E(uu0 |X)X(X0 X)−1

Then, by the second Gauss-Markov condition, we have


−1
Var(β̂|X) = X0 X X0 (σu2 In )X(X0 X)−1 = σu2 (X0 X)−1

The homoskedasticity-only covariance matrix of β̂ is

Var(β̂|X) = σu2 (X0 X)−1 (21)

If the homoskedasticity assumption does not hold, denote the covariance matrix of u as

Var(u|X) = Ω

Heteroskedasticity means that the diagonal elements of Ω can be different (i.e. Var(ui |X) =
σi2 for i = 1, . . . , n), while the off-diagonal elements are zeros, that is
 
σ12 0 ··· 0
0 σ22 · · · 0
 
 ..
Ω= .. . . .. 
. . . . 

0 0 · · · σn2

Define Σ = X0 ΩX. Then the heteroskedasticity-robust covariance matrix of β̂ is


−1
Varh (β̂|X) = X0 X Σ(X0 X)−1 (22)

20
7.3 The asymptotic normal distribution

In large samples, the OLS estimator β̂ has the multivariate normal asymptotic distribution
as
d
β̂ −−→ N (β, Σβ̂ ) (23)

where Σβ̂ = Var(β̂|X) for which use Equation (21) for the homoskedastic case and Equa-
tion (22) for the heteroskedastic case.

The proof of the asymptotic normal distribution and the multivariate central limit theorem
are given in Chapter 18.

8 The Omitted Variable Bias

8.1 The definition of the omitted variable bias

The omitted variable bias is the bias in the OLS esitmator that arises when the included
regressors, X, are correlated with omitted variables, Z, where X may include k regressors,
X1 , . . . , Xk , and Z may include l omitted variables, Z1 , . . . , Zm . The omitted variable
bias occurs when two conditions are met

1. X is correlated with some omitted variables in Z.

2. The omitted variables are determinants of the dependent variable Y.

8.2 The reason for the omitted variable bias

Suppose that the true model is

Y = Xβ + Zγ + u (24)

in which the first least squares assumption, E(u|X, Z) = 0, holds. We further assume
that Cov(X, Z) 6= 0

However, we mistakenly exclude Z in regression analysis and estimate a short model

Y = Xβ +  (25)

Since  represents all other factors that are not in Equation (25), including Z, and
Cov(X, Z) 6= 0, this means that Cov(X, ) 6= 0, which implies that E(|X) 6= 0. (Re-
call that in Chapter 4, we prove that E(ui |Xi ) = 0 ⇒ Cov(ui , Xi ) = 0, which implies
that Cov(ui , Xi ) 6= 0 ⇒ E(ui |Xi ) 6= 0).) Therefore, Assumption #1 does not hold for the
short model, which means that the OLS estimator of Equation (25) is biased.

21
An informal proof of the OLS estimator of Equation (25) is biased is given as follows.

The OLS estimator of Equation (25) is β̂ = (X0 X)−1 X0 Y. Plugging Y with the true
model, we have

β̂ = (X0 X)−1 X0 (Xβ + Zγ + u) = β + (X0 X)−1 X0 Zγ + (X0 X)−1 X0 u

Taking the expectation of β̂, conditioned on X, we have

E(β̂|X) = β + (X0 X)−1 X0 Zγ +0 (26)


| {z }
omitted variable bias

The second term in the equation above usually does not equal zero unless either

1. γ = 0, which means that Z are not determinants of Y in the true model, or

2. X0 Z = 0, which means that X and Z are not correlated.

Therefore, if these two conditions do not hold, β̂ for the short model is biased. And the
magnitude and direction of the bias is determined by X0 Zγ.

8.3 An illustration using a linear model with two regressors

Suppose the true model is

Yi = β0 + β1 X1i + β2 X2i + ui , i = 1, . . . , n

with E(ui |X1i , X2i ) = 0

However, we estimate a wrong model of

Yi = β0 + β1 X1i + i , i = 1, . . . , n

In Lecture 5 we showed that β1 can be expressed as


1 P
n i (X1i − X̄1 )i
β̂1 = β1 + 1 P 2
n i (Xi − X̄1 )

As n → ∞, the numerator of the second term converges to Cov(X1 , ) = ρX1  σX1 σ and
the denominator converges to σX
2 , where ρ
1 X1  is the correlation coefficient between X1i

and . Therefore, we have

p σ
β̂1 −−→ β1 + ρX1  (27)
σX1
| {z }
omitted variable bias

22
From Equations (26) and (27), we can summarize some facts about the omitted variable
bias:

• Omitt variable bias is a problem irregardless of whether the sample size is large or
small. β̂ is biased and inconsistent when there is omitted variable bias.

• Whether this bias is large or small in practice depends on |ρX1  | or |X0 Zγ|.

• The direction of this bias is determined by the sign of ρX1  or X0 Zγ.

• One easy way to detect the existence of the omitted variable bias is that when adding
a new regressor, the estimated coefficients on some previously included regressors
change substantially.

9 Multicollinearity

9.1 Perfect multicollinearity

Perfect multicollinearity refers to the situation when one of the regressor is a perfect
linear function of the other regressors.

• In the terminology of linear algebra, perfect multicollinearity means that the vectors
of regressors are linearly dependent.

• That is, the vector of a regressor can be expressed as a linear combination of vectors
of the other regressors.

Remember that the matrix of regressors X can be written in terms of column vectors as

X = [ι, X1 , X2 , . . . , Xk ]

where Xi = [Xi1 , Xi2 , . . . , Xin ]0 is a n × 1 vector of n observations of the ith regressor. ι


is a vector of 1s, representing the constant term.

That the k+1 column vectors are linearly dependent means that there exist some (k+1)×1
nonzero vector β = [β0 , β1 , . . . , βk ]0 such that

β0 ι + β1 X1 + · · · + βk Xk = 0

If Xi , for i = 1, . . . , n, are linearly dependent, then it follows

• X does not have full column rank.

• If X does not have full column rank, then X0 X is singular, that is, the inverse of
X0 X does not exist. Therefore, we can state the assumption of requiring no perfect

23
multicollinearity in another way as assuming that X has full column rank.

• If X0 X is not invertible, the OLS estimator based on the formula of β̂ = (X0 X)−1 X0 Y
does not exist.

9.2 Examples of perfect multicollinearity

Remember that perfect multicollinearity occurs when one regressor can be expressed as
a linear combination of other regressors. This problem belongs to the logic error when
the researcher sets up the regression model. That is, the researcher uses some redundant
regressors in the model to provide the same information that merely one regressor can
sufficiently provide.

Possible linear combination

Suppose we have a multiple regression model

Y = β0 + β1 X1 + β2 X2 + u

And we want to add a new variable Z into this model. The following practices cause
perfect multicollinearity

• Z = aX1 or Z = bX2

• Z = 1 − aX1

• Z = aX1 + bX2

However, we can add a Z that is not a linear function of X1 or X2 such that there is no
perfect multicollinearity problem. For example,

• Z = X12

• Z = ln X1

• Z = X1 X2

9.3 The dummy variable trap

The dummy variable trap is a good case of perfect multicollinearity that a modeler often
encounters. Recall that a binary variable (or dummy variable) Di , taking values
of one or zero, can be used in a regression model to distinguish two mutually exclusive
groups of samples, for instance, the male and the female. In fact, dummy variables can

24
be constructed to represent more than two groups and be used in multiple regression to
examine the difference between these groups.

Suppose that we have a data composed of people of four ethnic groups: White, African
American, Hispanic, and Asian. And we want to estimate a regression model to see
whether wages among these four groups are different. We may (mistakenly as we will see)
set up a multiple regression model as follows

W agei = β0 + β1 W hitei + β2 Af ricani + β3 Hispanici + β4 Asiani + ui (28)

where W hitei is a dummy variable which equal 1 if the ith observation is a white people
and equal 0 if he/she is not, similarly for Af ricani , Hispanici , and Asiani .

A concrete example

To be concrete, suppose we have four observations: Chuck, Mike, Juan, and Li, who are
White, African American, Hispanic, and Asian, respectively. Then the dummy variables
are        
1 0 0 0
       
0 1 0 0
0 , Af rican = 0 , Hispanic = 1 , Asian = 0
W hite =        
       
0 0 0 1

However, when we construct a model like Equation (28), we fall into the dummy variable
trap, suffering perfect multicollinearity. This is because this model has a constant term
β0 × 1 which is the sum of all dummy variables. That is,
 
1
 
1
  = W hite + Af rican + Hispanic + Asian
1
 
1

Let see when the observation is Chuck, then the model is

W age = β0 + β1 + u

Estimating this model yields β\


0 + β1 , from which we cannot get a unique solution for β1 .

To avoid the dummy variable trap, we can either of the following two methods:

1. drop the constant term

2. drop one dummy variable

25
The difference between these two methods lies in how we interpret the coefficients on
dummy variables.

Drop the constant term

If we drop the constant term, the model becomes

W age = β1 W hite + β2 Af rican + β3 Hispanic + β4 Asian + u (29)

For Chuck or all white people, the model becomes

W age = β1 + u

Then β1 is the population mean wage of whites, that is, β1 = E(W age|W hite = 1).
Similarly, β2 , β3 , and β4 are the population mean wage of African Americans, Hispanics,
and Asians, respectively.

Drop one dummy variable

If we drop the dummy variable for white people, then the model becomes

W age = β1 + β2 Af rican + β3 Hispanic + β4 Asian + u (30)

For white people, the model is


W age = β1 + ui

And the constant term β1 is just the population mean of whites, that is,

β1 = E(W age|W hite = 1)

So we say that white people serve as a reference case in Model (30).

For African Americans, the model is

W age = β1 + β2 + u

From it we have E(W age|Af rican = 1) = β1 + β2 so that

β2 = E(W age|Af rican = 1) − β1 = E(W age|Af rican = 1) − E(W age|W hite = 1)

26
Similarly, we can get that

β3 = E(W age|Hispanic = 1) − E(W age|W hite = 1)


β4 = E(W age|Asian = 1) − E(W age|W hite = 1)

Therefore, when we adopt the second method by dropping a dummy variable for the
reference case, then the coefficients on other dummy variables represent the difference in
the population means between the interested case and the reference case.

9.4 Imperfect Multicollinearity

Definition of imperfect multicollinearity

Imperfect multicollinearity is a problem of regression when two or more regressors


are highly correlated. Although they bear similar names, imperfect multicollinearity and
perfect multicollinearity are two different concepts.

• Perfect multicollinearity is a problem of modeling building, resulting in a total failure


to estimate a linear model.

• Imperfect multicollinearity is usually a problem of data when some regressors are


highly correlated.

• Imperfect multicollinearity does not affect the unbiasedness of the OLS estimators.
However, it does affect the efficiency, i.e., the variance of the OLS estimators.

An illustration using a regression model with two regressors

Suppose we have a linear regression model with two regressors.

Y = β0 + β1 X1 + β2 X2 + u (31)

where, for simplicity, u is assumed to be homoskedastic.

By the FWL theorem, estimating Equation (31) will get the same OLS estimators of β1
and β2 as estimating the following model,

y = β1 x1 + β2 x2 + v (32)

where y = Y − Ȳ ι, x1 = X1 − X̄1 ι, and x2 = X2 − X̄2 ι, that is, y, x1 , and x2 are in


the form of the deviation from the mean. And denote x = [x1 x2 ] as the matrix of all
regressors in Model (32).

27
Suppose that X1 and X2 are correlated so that their correlation coefficient |ρ12 | > 0. And
the square of the sample correlation coefficient is

2
( x1 x2 )2
P
(X1 − X̄1 )(X2 − X̄2 )
P
2
r12 =P =P P (33)
(X1 − X̄1 )2 (X2 − X̄2 )2
P
x1 x2

The OLS estimator of Model (32) is


−1
β̂ = x0 x x0 y (34)

with the homoskedasticity-only covariance matrix as


−1
Var(β̂|x) = σu2 x0 x (35)

• β̂ is still unbiased since the assumption of E(u|X) = 0 holds and so does E(v|x) = 0.

• The variance of β̂1 , which is the first diagonal element of σu2 (x0 x)−1 , is affected by
r12 . To see this, we write Var(β̂1 |x) explicitly as

σu2 i x22
P
Var(β̂1 |x) = P 2
P 2
P 2
i x1 i x2 − ( i x1 x2 )
σu2 i x22
P
=X X 
( i x1 x2 )2
P 
2 2
x1 x2 1 − P 2 P 2
i i i x1 i x2

σ2 1
= Pu 2 2
i x1 (1 − r12 )

Therefore, when X1 and X2 are highly correlated, that is r12


2 gets close to 1, then

Var(β̂1 |x) becomes very large.

• The consequence of multicollinearity is that it may lead us to wrongly fail to reject


the zero hypothesis in the t-test for a coefficient.

• The variance inflation factor (VIF) is a commonly used indicator for detecting mul-
ticollinearity. The definition is

1
VIF = 2
1 − r12

The smaller VIF is for a regressor, the less severe the problem of multicollinearity is.
However, there is no widely accepted cut-off value for VIF to detect multicollinearity.
V IF > 10 for a regressor is often seen as an indication of multicollinearity, but we
cannot always trust this.

28
Possible remedies for multicollinearity

• Include more sample in hope of the variation in X getting widened, i.e., increasing
P
i (X1i − X̄1 ).

• Drop the variable(s) that is highly correlated with other regressors. Notice that by
doing this we are at the risk of suffering the omitted variable bias. There is always a
trade-off between including all relevant regressors and making the regression model
parsimonious.3

3
The word "parsimonious" in Econometrics means that we always want to make the model as concise
as possible without any redundant variables included.

29

You might also like