Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
14 views

Topic 3 - endogeneity (1)

Econ 7IE

Uploaded by

saien1moodley5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Topic 3 - endogeneity (1)

Econ 7IE

Uploaded by

saien1moodley5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

ECON7IE

Topic 3
Endogeneity
WHY IS THIS TOPIC IMPORTANT?
• We commonly need to estimate models where:
– One or more important factors cannot be measured
– Some of the data may be inaccurate
– There are multiple causal relationships, not just X → Y

• These are all examples of the presence of endogeneity


– Its effect on the reliability of regression model results is a key issue in empirical
research

• In this topic, we’ll learn what endogeneity is, how it affects the reliability of OLS results,
and what methods can be used to overcome it

2
Part 1

The Problem of Endogeneity

We consider the case of an endogenous


explanatory variable, which arises when
one of the Classical Linear Regression
Model assumptions is violated.
1.1 DEFINITION OF ENDOGENEITY
• Consider the regression model
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝑢
• If any 𝑋𝑗 is correlated with 𝑢 for any reason, then:
– 𝑋𝑗 is an endogenous explanatory variable
• Three key statistical / economic reasons why 𝑋𝑗 and 𝑢 may be correlated:
a) Omitted variables that are correlated with 𝑋𝑗
b) Measurement error in 𝑋𝑗
c) Simultaneity (or bi-directional causality) between 𝑋𝑗 and 𝑌

• We will:
– Try to identify sources of endogeneity in models
– Derive expressions for the consequences of endogeneity
– See how we can estimate models to overcome this issue
4
a) Omitted variable
• An important explanatory variable is omitted from the regression
– And it is correlated with any of the included X variables
• Why might a variable be omitted from a model?
• E.g.
𝑒𝑎𝑟𝑛𝑖𝑛𝑔𝑠 = 𝛽1 + 𝛽2 𝑦𝑟𝑠𝑐ℎ𝑜𝑜𝑙 + 𝛽3 𝑎𝑏𝑖𝑙𝑖𝑡𝑦 + ⋯ + 𝑢

b) Measurement error
• An explanatory variable is measured with error i.e. is inaccurate:
– Some variables are inherently difficult to measure e.g. income
– May need to use a proxy when true variable is unavailable
• E.g.
𝑙𝑒𝑖𝑠𝑢𝑟𝑒 𝑡𝑖𝑚𝑒 = 𝛽1 + 𝛽2 ℎℎ𝑖𝑛𝑐𝑜𝑚𝑒 + 𝛽3 𝑚𝑎𝑙𝑒 + ⋯ + 𝑢
5
c) Simultaneity
• One (or more) explanatory variables are jointly determined with Y
– i.e. X affects Y, and Y affects X
• Common in macro models
• Also occurs with many other complex economic processes
• E.g. effect of inflation on trade openness:
𝑜𝑝𝑒𝑛𝑛𝑒𝑠𝑠 = 𝛽1 + 𝛽2 𝑖𝑛𝑓𝑙𝑎𝑡𝑖𝑜𝑛 + 𝛽3 𝑙𝑛𝑝𝑐𝑖𝑛𝑐 + 𝛽4 𝑙𝑛𝑙𝑎𝑛𝑑 + 𝑢
𝑖𝑛𝑓𝑙𝑎𝑡𝑖𝑜𝑛 = 𝛼1 + 𝛼2 𝑜𝑝𝑒𝑛𝑛𝑒𝑠𝑠 + 𝛼3 𝑙𝑛𝑝𝑐𝑖𝑛𝑐 + 𝑢
• Possible to suspect/identify simultaneity even when only given one equation:
– If we suspect feedback from Y to X
• All demand and supply models suffer from simultaneity:
– Equilibrium price and quantity are determined simultaneously
– Through the interaction of demand and supply
6
Class Exercise Question 1
• Identify the source of endogeneity related to the first X variable in each
of the following models:
a) Omitted variables that are correlated with 𝑋𝑗
b) Measurement error in 𝑋𝑗
c) Simultaneity (or bi-directional causality) between 𝑋𝑗 and 𝑌
• In some cases, more than one source may apply!

1. 𝑚𝑢𝑟𝑑𝑒𝑟 𝑟𝑎𝑡𝑒 = 𝛽1 + 𝛽2 𝑝𝑜𝑙𝑖𝑐𝑒 + 𝛽3 𝑖𝑛𝑐𝑜𝑚𝑒 + 𝑢


2. 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑 = 𝛽1 + 𝛽2 𝑖𝑚𝑚𝑖𝑔_𝑠𝑡𝑎𝑡𝑢𝑠 + 𝛽3 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 𝑢
3. ℎ𝑒𝑎𝑙𝑡ℎ_𝑠𝑡𝑎𝑡𝑢𝑠 = 𝛽1 + 𝛽2 𝑖𝑛𝑐𝑜𝑚𝑒 + 𝛽3 𝑎𝑔𝑒 + ⋯ + 𝑢
4. 𝑔𝑟𝑜𝑤𝑡ℎ = 𝛽1 + 𝛽2 𝑖𝑛𝑠𝑡𝑖𝑡𝑢𝑡𝑖𝑜𝑛𝑎𝑙_𝑞𝑢𝑎𝑙𝑖𝑡𝑦 + 𝛽3 𝑐𝑎𝑝𝑖𝑡𝑎𝑙 + 𝛽4 𝑙𝑎𝑏𝑜𝑢𝑟 + 𝑢
5. 𝑙𝑛ℎ𝑤𝑎𝑔𝑒 = 𝛽1 + 𝛽2 𝑦𝑟𝑠𝑐ℎ𝑜𝑜𝑙 + 𝛽3 𝑒𝑥𝑝 + ⋯ + 𝑢
6. 𝑞𝑢𝑎𝑛𝑡𝑖𝑡𝑦𝑇𝑉𝑠 = 𝛽1 + 𝛽2 𝑝𝑟𝑖𝑐𝑒𝑇𝑉𝑠 + 𝛽3 𝑖𝑛𝑐𝑜𝑚𝑒 + ⋯ + 𝑢
7
1.2 SUMMARY THUS FAR
• Endogeneity is present in a lot of models

• We need to be able to:


– Explain its source ✓
– Understand its effect on our ability to estimate reliable parameters
– Correct any resulting econometric problems

8
1.3 STANDARD ASSUMPTIONS FOR THE
CLASSICAL LINEAR REGRESSION MODEL (CLRM)
• These assumptions are required:
– For OLS estimators to be unbiased estimators of population parameters

• Assumptions relate to statistical properties of estimators:


– Somewhat abstract!
– Describe properties of estimators when random sampling is done repeatedly
– Have nothing to do with a particular sample
– i.e. not meaningful to discuss properties of estimates obtained from a single sample

9
• Assumption CLRM1:
– The model is linear in the parameters
• Assumption CLRM2:
– The dataset is a random sample drawn from the population
• Assumption CLRM3:
– There is no perfect multicollinearity
• Assumption CLRM4:
– The error terms must be uncorrelated with all the X variables
– i.e. there is no endogeneity

• When CLRM4 holds: we have exogenous explanatory variables


• But if any 𝑋𝑗 is correlated with 𝑢 for any reason, then 𝑋𝑗 is an endogenous explanatory
variable

10
Assumption CLRM4: Zero conditional mean
𝐸 𝑢|𝑋2 , 𝑋3 , … , 𝑋𝑘 = 0
or
𝑐𝑜𝑣 𝑢, 𝑋𝑗 = 0 , 𝑗 = 2, … , 𝑘
• CLRM4 is more likely to hold when fewer factors are in the error term
– i.e. When the model is better specified
• BUT CLRM4 can fail due to three sources discussed previously
• We cannot know for sure whether the average value of the unobserved factors is
unrelated to the explanatory variables.
• But this is the most important assumption:

Exogeneity is the key assumption to enable a causal interpretation


of the regression results
WHY?
11
1.4 RESULT: CONSISTENCY OF OLS
Under assumptions CLRM1-CLRM4:
OLS estimator 𝑏𝑗 is consistent for 𝛽𝑗 for all 𝑗 = 2, … , 𝑘

• What is consistency?
– It is an asymptotic or large sample property
– Let 𝑏𝑗 be the OLS estimator of 𝛽𝑗 for some j.
– For each N, 𝑏𝑗 has a probability distribution (representing its possible values in
different random samples of size N).
– If this estimator is consistent, then the distribution of 𝑏𝑗 becomes more and more
tightly distributed around 𝛽𝑗 as the sample size grows.
– As N tends to infinity, the distribution of 𝑏𝑗 collapses to the single point 𝛽𝑗 :

Say: 𝛽𝑗 is the
𝑝𝑙𝑖𝑚 𝑏𝑗 = 𝛽𝑗
probability limit of 𝑏𝑗
12
Fig C3. Sampling distributions of 𝑏𝑗 for increasing
sample sizes

𝑓(𝑏𝑗 )
N = 40

N = 16

N=4

𝛽𝑗 𝑏𝑗
Source: Wooldridge (2013) 13
Why does consistency matter?
• Virtually all economists agree:
– consistency is a minimal requirement for an estimator

• Nobel Prize-winning econometrician Clive W. J. Granger:


– “If you can’t get it right as N goes to infinity, you shouldn’t be in this business.”

14
Showing the consistency of OLS
• In general, we need matrix algebra to show this.
• But, we can illustrate it for a simple model with a single X
• The formula (estimator) for the slope coefficient is given by:
σ𝑁 ത
𝑖=1(𝑋𝑖2 − 𝑋2 ) 𝑌𝑖
𝑏2 = 𝑁
σ𝑖=1(𝑋𝑖2 − 𝑋ത1 )2
• Substituting 𝑌𝑖 = 𝛽1 + 𝛽2 𝑋𝑖2 + 𝑢𝑖 and rearranging gives:
1 𝑁
σ𝑖=1(𝑋𝑖2 − 𝑋ത2 ) 𝑢𝑖
𝑏2 = 𝛽2 + 𝑁
1 𝑁
σ𝑖=1(𝑋𝑖2 − 𝑋ത2 )2
𝑁
• Law of large numbers, the numerator and denominator converge in probability to
𝑐𝑜𝑣(𝑋2 , 𝑢) and 𝑣𝑎𝑟 𝑋2 , i.e.
𝑐𝑜𝑣(𝑋2 , 𝑢) CLRM4
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2 +
𝑣𝑎𝑟 𝑋2
= 𝛽2 because 𝑐𝑜𝑣 𝑋2 , 𝑢 = 0 15
1.5 CONSEQUENCE OF VIOLATING ASSUMPTION CLRM4
• Given that:
𝑐𝑜𝑣 𝑋2 , 𝑢
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2 + 1.1
𝑣𝑎𝑟 𝑋2
• Then the inconsistency (or asymptotic bias) is:
𝑐𝑜𝑣(𝑋2 , 𝑢)
𝑝𝑙𝑖𝑚 𝑏2 − 𝛽2 =
𝑣𝑎𝑟 𝑋2
If 𝑐𝑜𝑣 𝑋2 , 𝑢 = 0 OLS is consistent and unbiased
If 𝑐𝑜𝑣 𝑋2 , 𝑢 < 0 OLS is inconsistent and biased downwards
If 𝑐𝑜𝑣 𝑋2 , 𝑢 > 0 OLS is inconsistent and biased upwards

• If the covariance is small, the inconsistency might be negligible


– But we cannot estimate 𝑐𝑜𝑣 𝑋2 , 𝑢 since 𝑢 is unobserved
• We need to use our knowledge of the relationship being estimated 16
• We will examine each of the three potential causes of endogeneity
• i.e. of violating assumption CLRM4
1. Omitted variables
2. Measurement error
3. Bidirectional causality (simultaneity)

• We will look at:


– Why is 𝑢 correlated with 𝑋𝑗 in each case?
– What is the nature of the resulting asymptotic bias in each case?
– What is the general econometric method of solving the endogeneity issue?
• Instrumental variables

17
2. OMITTED VARIABLES
• Suppose the true model is:
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢
• but instead we estimate:
𝑌 = 𝑏1 + 𝑏2 𝑋2 + 𝑣
– E.g. 𝑌 is earnings, 𝑋2 is years of education, and 𝑋3 is ability
– Does 𝑏2 measure the true return to education, 𝛽2 ?
• From eq.(1.1):
𝑐𝑜𝑣(𝑋2 , 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙)
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2 +
𝑣𝑎𝑟 𝑋2
𝑐𝑜𝑣(𝑋2 , 𝛽3 𝑋3 + 𝑢 )
= 𝛽2 +
𝑣𝑎𝑟 𝑋2
𝑐𝑜𝑣 𝑋2 , 𝛽3 𝑋3 + 𝑐𝑜𝑣 𝑋2 , 𝑢
= 𝛽2 +
𝑣𝑎𝑟 𝑋2
𝑐𝑜𝑣 𝑋2 , 𝑋3
= 𝛽2 + 𝛽3
𝑣𝑎𝑟 𝑋2
18
𝑐𝑜𝑣 𝑋2 , 𝑋3
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2 + 𝛽3
𝑣𝑎𝑟 𝑋2
• Therefore 𝑏2 is asymptotically unbiased only if either:
➢ 𝛽3 = 0 (i.e. there is no omitted variable), or
➢ 𝑋2 and 𝑋3 are uncorrelated.
• If neither of these two occurs, then b2 is biased and inconsistent,
– direction of asymptotic bias depends on sign of 𝛽3 𝑐𝑜𝑣 𝑋2 , 𝑋3 .
• In the example:
𝑒𝑎𝑟𝑛𝑖𝑛𝑔𝑠 = 𝛽1 + 𝛽2 𝑦𝑟𝑠𝑐ℎ𝑜𝑜𝑙 + 𝛽3 𝑎𝑏𝑖𝑙𝑖𝑡𝑦 + 𝑢
– what is the direction of the bias of the return to education, when ability is
unobserved?
• Determining direction of bias is more complex with multiple Xs:
– Depends on their relationships with each other and with the omitted factor

Now try Exercise 3, Question 2! 19


3. MEASUREMENT ERROR
• Suppose that the true model is given by
𝑌 = 𝛽1 + 𝛽2 𝑋2∗ + 𝑢
• But 𝑋2∗ cannot be measured accurately: we only have an imperfect measure 𝑋2
– E.g. 𝑋2∗ is actual income, but 𝑋2 is reported income
• What is the effect on our ability to estimate 𝛽2 ?
• The measurement error in the population is simply
𝑒2 = 𝑋2 − 𝑋2∗
• We make the classical errors-in-variables (CEV) assumption: the measurement error is
uncorrelated with the true (unobserved) 𝑋2∗
• Simplify eq.(1.1) for 𝑝𝑙𝑖𝑚 𝑏2 , using various properties of variance and covariance in this
context, to become:
𝑐𝑜𝑣(𝑋2 , 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙) 𝑣𝑎𝑟 𝑋2∗
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2 + = 𝛽2
𝑣𝑎𝑟 𝑋2 𝑣𝑎𝑟 𝑋2∗ + 𝑣𝑎𝑟 𝑒2
20
• In the presence of measurement error: ‘Signal’ i.e true information
contained in 𝑋2∗

𝑣𝑎𝑟 𝑋2∗
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2
𝑣𝑎𝑟 𝑋2∗ + 𝑣𝑎𝑟 𝑒2

‘Noise’ i.e. measurement error

• Therefore, the OLS estimate 𝑏2 is biased towards zero (this is called attenuation bias).
– The larger the degree of measurement error, the greater is the attenuation bias.

• Issue is more complex in models with multiple Xs:


– Generally, measurement error in a single variable causes inconsistency in all
estimators
21
4. SIMULTANEITY
• Simultaneity arises when some of the Xs are jointly determined with
the dependent variable in the same economic model.
– There is bidirectional causality between X and Y
• We should view the equation we are interested in estimating as part of a system of
relationships:
– multiple causal relationships.

• Some examples:
– Models of demand and supply i.e. market equilibrium
• For commodities
• For an input into production e.g. labour
– Models of the macroeconomy
22
Example 1: Demand and supply
• Consider a system of supply and demand for a commodity:
Demand: Q = 1 + 2P + 3Y + u1 (4.1)
Supply: Q = 1 + 2P + u2 (4.2)
• In equilibrium, equate demand and supply:
1 +  2P +  3Y + u1 = 1 +  2P + u2
 2P −  2P = 1 + u2 − (1 +  3Y + u1 )
P ( 2 −  2 ) = 1 − 1 −  3Y + u2 − u1 slope

 − 3 u −u
P= 1 1 − Y+ 2 1 (4.3)
(3)
 2 − 2  2 − 2  2 − 2
intercept error term
• Thus P is a function of u1: i.e. X variable correlated with error term
• P is an endogenous explanatory variable
– It is simultaneously determined with Q
– Cannot meaningfully estimate (4.1) using OLS: 2 will inconsistent. 23
4.1 SIMULTANEITY BIAS

• Demand and supply equations (4.1) and (4.2) are known as structural equations:
– They describe the structure of the economy:
• Derivable from economic theory
• Have a causal interpretation
• In the structural equations:
– Price and quantity are determined simultaneously:
• price affects quantity and quantity affects price
– P and Q are endogenous variables, while Y is exogenous
– Estimation by OLS will lead to biased and inconsistent coefficient estimates
• Explanatory variables are correlated with error term
• Determining the direction of the bias is generally complicated in models with multiple X
variables.
24
Avoiding simultaneity bias
• Equations such as (4.3) are known as reduced form equations:
– Endogenous variables are expressed as a function only of all exogenous variables
(and a constant)
– Can derive a similar equation for Q
• Write the reduced form equations as:
𝑃 = 𝜋11 + 𝜋21 𝑌 + 𝑣1 4.3𝑎
𝑄 = 𝜋12 + 𝜋22 𝑌 + 𝑣2 4.4
• These reduced form equations can be estimated by OLS:
– All the RHS variables are exogenous
• But:
– We don’t care about the values of the 𝜋 parameters
– The parameters of interest are 𝛼1 , 𝛼2 and 𝛼3 , and 𝛽1 and 𝛽2 (from the structural
equations)
25
4.2 IDENTIFICATION OF STRUCTURAL EQUATIONS
• In OLS, we can identify the value of the parameters if:
– each explanatory variable is uncorrelated with the error term
• This condition does not hold when there is endogeneity

• We can sometimes still identify (or consistently estimate) the parameters in a structural
equation
– Similarly for cases of omitted variables or measurement error.

• Do we have enough information to retrieve the original coefficients (𝛼s and 𝛽s) from
the 𝜋s?
– Answer depends on having additional exogenous variables
– i.e. exogenous variables that are not in the equation of interest
26
Three possible situations
1. An equation is unidentified
– We cannot get the structural coefficients from the reduced form estimates
– E.g. the demand equation Q = 1 + 2P + 3Y + u1
– There are no additional exogenous variables in the model

2. An equation is exactly identified


– We can get unique structural form coefficient estimates
– E.g. the supply equation Q = 1 + 2P + u2

3. An equation is over-identified
– More than one set of structural coefficients could be obtained from the reduced form
– Example given later
27
Condition for a structural equation to be identified
• A structural equation satisfies the order condition if:
– number of exogenous variables excluded from the equation is
– at least as large as the number of right-hand side endogenous variables
• This is a necessary (but not sufficient) condition for identification
• The rank condition is a sufficient condition
– but requires matrix algebra: beyond scope of this module
• Express the order condition as:
K – k0  m0
• where K = no. of exogenous variables in the equation system (i.e. overall model)
in total
k0 = no. of exogenous variables in the structural equation
m0 = no. of endogenous variables on RHS of structural equation
28
Demand: Q = 1 + 2P + 3Y + u1 (4.1)
Supply: Q = 1 + 2P + u2 (4.2)

Are each of these structural equations identified?


For the model as a whole: K=
Demand equation: k0 = ; m0 =
K – k0 =

Supply equation: k0 = ; m0 =
K – k0 =

• Therefore we can get unbiased estimates of the parameters in the supply equation
– but not in the demand equation.

29
Example 2: Keynesian macro model
• For a closed economy:
𝐶 = 𝛽1 + 𝛽2 𝑌 + 𝛽3 𝑟 + 𝑢1 4.5
𝐼 = 𝛾1 + 𝛾2 𝑟 + 𝑢2 4.6
𝑌 ≡𝐶+𝐼+𝐺 4.7
• Three equations in the system:
– therefore three endogenous (dependent) variables
• Assume all other variables are exogenous

• Is equation (4.5) identified?


– For the model as a whole: K=
– For equation (4.5): k0 = ; m0 =
– Therefore:

30
• Various issues with such a simple macro model:
1. Difficult to argue that interest rates and government spending are exogenous
2. Model would be estimated with time series data, but is static:
• We expect adjustment lags

• Can adapt the model to deal with issue 2, e.g.


𝐶𝑡 = 𝛽1 + 𝛽2 𝑌𝑡 + 𝛽3 𝑟𝑡 + 𝛽4 𝐶𝑡−1 + 𝑢1
𝐼𝑡 = 𝛾1 + 𝛾2 𝑟𝑡 + 𝛾3 𝑌𝑡−1 + 𝑢2
• Then the lagged values can be treated as exogenous:
– They are referred to as predetermined variables
– Including lags helps with identification (as well as better modelling dynamic
behaviour)

Now try Exercise 3, Question 3.1 and 3.2! 34


Part 2

Estimation in the Presence of Endogeneity:


The use of instrumental variables

We focus on how to address endogeneity,


and various associated statistical tests
5. ESTIMATION: INSTRUMENTAL VARIABLE TECHNIQUE
• Recall:
– We cannot use OLS directly on the structural equations
– Because the endogenous explanatory variable/s are correlated with the errors

• One solution:
– Don’t use the endogenous Xs
– Rather, use some other variables instead
• We want these other variables to be:
– (highly) correlated with the endogenous Xs, but
– NOT correlated with the errors

• They are called INSTRUMENTS (IVs)


33
• Here, we express the use of instruments more formally:
• Consider the equation:
Y1 = 1 + 2X + 3Y2 + u
where X is exogenous and Y2 is endogenous (correlated with u).
• The method of instrumental variables requires that we find a variable Z which is an
instrument for Y2
• Z must be:
1) strongly correlated with Y2
Instrument relevance: corr (Z, Y2 )  0
but
2) not correlated with u
Instrument exogeneity: corr (Z, u) = 0
• If the instrument is good (i.e. satisfies the two conditions above):
– we can use it to consistently estimate the parameters in the equation of interest.
34
5.1 WHERE DO THE INSTRUMENTS COME FROM?
• Depends on the source of endogeneity
• Simultaneity:
– Provided we have a model with multiple equations:
– Instruments are the excluded exogenous variables from other equations
• Including any predetermined variables
• Omitted variable and measurement error:
– More challenging:
• There aren’t additional equations with extra variables
– Need to make an argument for choice of instrument/s, and justify
– Similarly for cases of simultaneity with only one equation
• Panel data often provides instruments from previous time periods
– See Topics 5 and 6 for more information
35
Some examples of instruments: 1

• We want to estimate the causal effect of skipping class on academic performance:


𝑚𝑎𝑟𝑘 = 𝛽1 + 𝛽2 𝑎𝑏𝑠𝑒𝑛𝑡 + 𝛽3 𝑝𝑟𝑒𝑣𝑚𝑎𝑟𝑘𝑠 + 𝛽4 𝑚𝑜𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 + 𝑢
– But motivation is an omitted variable
– We suspect it is correlated with absenteeism

• Proposed IV:
– Use distance between living location and campus as instrument for absent
• Motivation:
– Relevance: longer commute → probability of being absent (e.g. due to transport
problems)
– Exogeneity: distance not expected to be correlated with motivation
36
Some examples of instruments: 2
• We want to estimate the causal effect of education on earnings:
log(𝑤𝑎𝑔𝑒) = 𝛽1 + 𝛽2 𝑦𝑟𝑠𝑐ℎ𝑜𝑜𝑙 + 𝛽3 𝑎𝑏𝑖𝑙𝑖𝑡𝑦 + 𝑢
• Proposed IV 1: Parents’ education
– Relevance: parents’ education is correlated with child’s education in many samples
(true for SA?)
– Exogeneity: but likely to be correlated with child’s ability
• Proposed IV 2: Number of siblings
– Relevance: having more siblings is typically associated with lower education per child
(true for SA?)
– Exogeneity: likely to be uncorrelated with child’s ability
• Need to make similar arguments for measurement error cases
The statistical reliability of the results depends on having good IVs
37
5.2 TWO-STAGE LEAST SQUARES (2SLS)
• Two-stage least squares (2SLS) provides a method for using multiple
instrumental variables.
• 2SLS proceeds as follows:
– Stage 1:
• Regress each endogenous variable that appears on the RHS of the structural
equation on all of its instruments
– In simultaneous equations, this is the reduced form equation
• Predict the value of each endogenous variable, 𝑍መ
– Stage 2:
• Use the predicted value of each endogenous variable in place of the variable
itself
• Standard errors have to be corrected in Stage 2
• Interpret the resulting coefficients and perform hypothesis tests as usual.
38
Stata example
Consider a demand and supply model for a food product:
Demand: Q = 1 + 2P + 3PS + 4INC + u1
Supply: Q = 1 + 2P + 3PF + u2
• Q is quantity; P is price; PS is price of a substitute; INC is per capita income; PF is price of
factor of production
• Endogenous: Q and P; exogenous: PS, INC and PF.
• The demand equation, estimated by OLS:
. regress q p ps inc
Source | SS df MS Number of obs = 30
-------------+------------------------------ F( 3, 26) = 8.52
Model | 305.92719 3 101.97573 Prob > F = 0.0004
Residual | 311.209627 26 11.969601 R-squared = 0.4957
-------------+------------------------------ Adj R-squared = 0.4375
Total | 617.136817 29 21.2805799 Root MSE = 3.4597 If price and quantity
------------------------------------------------------------------------------ are simultaneously
q | Coef. Std. Err. t P>|t| [95% Conf. Interval] determined, then this
-------------+----------------------------------------------------------------
p | .0232954 .0768423 0.30 0.764 -.1346562 .181247 coefficient is likely to
ps | .7100395 .2143246 3.31 0.003 .269489 1.15059
inc | .0764442 1.190855 0.06 0.949 -2.371393 2.524282 be biased.
_cons | 1.091045 3.71158 0.29 0.771 -6.538218 8.720308
------------------------------------------------------------------------------
39
. ivregress 2sls q (p = ps inc pf) ps inc, first
First-stage regressions
-----------------------
Number of obs = 30
F( 3, 26) = 69.19
This stage creates an instrument for the Prob > F = 0.0000
R-squared = 0.8887
potentially-endogenous variable, price Adj R-squared = 0.8758
Root MSE = 6.5975
------------------------------------------------------------------------------
p | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ps | 1.708147 .3508806 4.87 0.000 .9869017 2.429393
inc | 7.602491 1.724336 4.41 0.000 4.058068 11.14691
pf | 1.353906 .2985062 4.54 0.000 .7403175 1.967494
_cons | -32.51242 7.984235 -4.07 0.000 -48.92425 -16.10059
------------------------------------------------------------------------------

Instrumental variables (2SLS) regression Number of obs = 30


Wald chi2(3) = 20.43
Prob > chi2 = 0.0001
Stage 2 uses the instrument in R-squared = .
place of price in the regression Root MSE = 4.5895
After dealing with the
------------------------------------------------------------------------------
q | Coef. Std. Err. z P>|z| [95% Conf. Interval] endogeneity, price has
-------------+---------------------------------------------------------------- a significant negative
p | -.3744591 .1533755 -2.44 0.015 -.6750695 -.0738486
ps | 1.296033 .3306669 3.92 0.000 .6479381 1.944128 effect on quantity
inc | 5.013977 2.125875 2.36 0.018 .847339 9.180615
_cons | -4.279471 5.161076 -0.83 0.407 -14.39499 5.836052 demanded
------------------------------------------------------------------------------
Instrumented: p
Instruments: ps inc pf 40
5.3 TESTING FOR INSTRUMENT VALIDITY
• Estimates produced using IV are consistent only when the IV used is valid
• Illustrate properties of IV estimation if Z is a poor IV:
Instrument exogeneity:
should be close to zero
𝑐𝑜𝑟𝑟(𝑍, 𝑢) 𝜎𝑢
𝑝𝑙𝑖𝑚 𝑏2,𝐼𝑉 = 𝛽2 + .
𝑐𝑜𝑟𝑟(𝑍, 𝑋2 ) 𝜎𝑋2
Instrument relevance:
should be large
• If Z is not exogenous: estimates are inconsistent
• If relevance of Z is weak:
– Can have large asymptotic bias (and high std errors)
– Even if 𝑐𝑜𝑟𝑟(𝑍, 𝑢) is small
41
1) Instrument relevance:
• Straightforward to assess:
– Examine the first stage of 2SLS
• Focus on significance of the IV’s, rather than all exogenous variables.
– IVs should be significantly related to the endogenous X:
• Use t-test for one IV, or F-test for multiple IVs
– Rule of thumb: for a single endogenous explanatory variable, the F-statistic in the
first stage should be greater than 10.
. ivregress 2sls q (p = ps inc pf) ps inc, first
First-stage regressions
-----------------------
Number of obs = 30
F( 3, 26) = 69.19
Prob > F = 0.0000
R-squared = 0.8887
Adj R-squared = 0.8758
Root MSE = 6.5975
------------------------------------------------------------------------------
p | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ps | 1.708147 .3508806 4.87 0.000 .9869017 2.429393
inc | 7.602491 1.724336 4.41 0.000 4.058068 11.14691
pf | 1.353906 .2985062 4.54 0.000 .7403175 1.967494
_cons | -32.51242 7.984235 -4.07 0.000 -48.92425 -16.10059
------------------------------------------------------------------------------ 42
2) Instrument exogeneity:
• If the coefficients are exactly identified:
– There is no statistical test for this assumption.
– Researcher must use knowledge and judgement of the research question at hand.

• If equation is over-identified (i.e. extra IVs), can conduct a test

43
Test for over-identifying restrictions
• Suppose that we have q more instruments than we need:
– i.e. q = (K – k0) – (m0) > 0
– Recall that IVs must be excluded exogenous variables
– E.g. one endogenous X (m0 = 1), and three proposed IVs (K – k0 = 3)
• q = 3 – 1 = 2 over-identifying restrictions.
• Then we can test whether the 2SLS residuals are correlated with q linear functions of
the instruments

• Procedure for testing over-identifying restrictions:


1) Estimate structural equation by 2SLS; obtain residuals, 𝑢ො 1 .
2) Regress 𝑢ො1 on all exogenous variables. Obtain 𝑅12 .
3) Test statistic = 𝑛𝑅12 ~𝜒 2 with df = q
2
4) If 𝑛𝑅12 > 𝜒𝑐𝑟𝑖𝑡 , reject 𝐻0 : 𝐼𝑉𝑠 𝑢𝑛𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑 𝑤𝑖𝑡ℎ 𝑢ො1
5) Conclude that at least some of the IVs are not exogenous. 44
• Recall that our model is:
Demand: Q = 1 + 2P + 3PS + 4INC + u1
Supply: Q = 1 + 2P + 3PF + u2
• q = (K – k0) – (m0) = (no. of proposed IVs) – (no. of endogenous Xs)
– Demand equation: q = (3-2) – (1) = 0
– Supply equation: q = (3-1) – (1) = 1
. ivregress 2sls q (p = ps inc pf) pf

Instrumental variables (2SLS) regression Number of obs = 30


Wald chi2(2) = 211.69
Prob > chi2 = 0.0000
R-squared = 0.9019
Root MSE = 1.4207

------------------------------------------------------------------------------
q | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p | .3379816 .0236408 14.30 0.000 .2916465 .3843166
pf | -1.000909 .0782929 -12.78 0.000 -1.154361 -.8474581
_cons | 20.0328 1.160349 17.26 0.000 17.75856 22.30704
------------------------------------------------------------------------------
Instrumented: p
Instruments: pf ps inc

. predict u, resid
45
. reg u pf ps inc

Source | SS df MS Number of obs = 30


-------------+---------------------------------- F(3, 26) = 0.47
Model | 3.0948454 3 1.03161513 Prob > F = 0.7080
Residual | 57.4597199 26 2.20998923 R-squared = 0.0511
-------------+---------------------------------- Adj R-squared = -0.0584
Total | 60.5545653 29 2.08808846 Root MSE = 1.4866

------------------------------------------------------------------------------
u | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pf | .0363318 .067262 0.54 0.594 -.1019273 .1745909
ps | .0790798 .0790635 1.00 0.326 -.0834376 .2415971
inc | -.4023461 .3885424 -1.04 0.310 -1.201007 .3963143
_cons | -1.149104 1.799078 -0.64 0.529 -4.847162 2.548953
------------------------------------------------------------------------------

• Then 𝑛𝑅2 = 30 ∗ 0. 0511 = 1.533


2
• 𝜒𝑐𝑟𝑖𝑡 𝛼 = 0.05; 𝑑𝑓 = 𝑞 = 1 = 3.841
2
• 𝑛𝑅2 < 𝜒𝑐𝑟𝑖𝑡 , therefore cannot reject 𝐻0
• Therefore the instruments used are exogenous.

Now try Exercise 3, Question 3.3! 46


5.4 TESTING FOR ENDOGENEITY
• It is ‘costly’ to use IV if there is no endogeneity:
– IV is less efficient (has larger standard errors) than OLS.

• Statistical Properties of OLS and IV:


Endogeneity No endogeneity
OLS Inconsistent Consistent and efficient
IV Consistent Consistent but inefficient

• In the presence of endogeneity:


– Only IV is consistent
– BUT may have bias in small samples
• Recall: consistency is an asymptotic property
47
A. Regression-based Test
• Consider the equation:
Y1 = 1 + 2X + 3Y2 + u
where X is exogenous and Y2 may be endogenous.
• Estimate the reduced form equation for Y2
– i.e. regress Y2 on all the truly exogenous variables
– and obtain the residuals, e.

• Now include these residuals in the model of interest:


Y1 = 1 + 2X + 3Y2 + θe + u
• Hypotheses: H0: θ = 0, i.e. Y2 is exogenous
H1: θ  0, i.e. Y2 is endogenous
• Thus a standard t-test on the coefficient on e in the above regression:
– constitutes a test of the null hypothesis of Y2 being exogenous.
48
B. Hausman Test
• Estimate the model by both OLS and IV:
– Compare (statistically) the coefficient values and their variances.

• H0: no endogeneity bias (both OLS and IV estimators will be consistent, but
OLS is more efficient)
• H1: endogeneity (only IV will be consistent – the difference between the OLS and IV
coefficients will not converge to zero as n → )

• If there is a systematic difference in the OLS and IV estimates:


– the explanatory variable/s is/are endogenous.
• The test statistic is based on the differences between all of the coefficients:
– follows a chi-squared distribution (with df = number of instrumented variables).

49
Stata example
A. Regression-based test:
To test whether price is endogenous in the demand equation, estimate the
reduced form equation for price, then include its residuals in the demand equation:
reduced form equation: regress the potentially
. reg p ps inc pf
endog var, p, on all exog vars in the model
Source | SS df MS Number of obs = 30
-------------+------------------------------ F( 3, 26) = 69.19
Model | 9034.77551 3 3011.59184 Prob > F = 0.0000
Residual | 1131.69721 26 43.5268157 R-squared = 0.8887
-------------+------------------------------ Adj R-squared = 0.8758
Total | 10166.4727 29 350.568025 Root MSE = 6.5975

------------------------------------------------------------------------------
p | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ps | 1.708147 .3508806 4.87 0.000 .9869017 2.429393
inc | 7.602491 1.724336 4.41 0.000 4.058068 11.14691
pf | 1.353906 .2985062 4.54 0.000 .7403175 1.967494
_cons | -32.51242 7.984235 -4.07 0.000 -48.92425 -16.10059
------------------------------------------------------------------------------
predict the residuals from
. predict e, resid
the reduced form equation 50
include the residuals as an extra
. regress q p ps inc e
variable in the demand equation
Source | SS df MS Number of obs = 30
-------------+------------------------------ F( 4, 25) = 60.88
Model | 559.677099 4 139.919275 Prob > F = 0.0000
Residual | 57.4597181 25 2.29838873 R-squared = 0.9069
-------------+------------------------------ Adj R-squared = 0.8920
Total | 617.136817 29 21.2805799 Root MSE = 1.516

------------------------------------------------------------------------------
q | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p | -.3744591 .0506639 -7.39 0.000 -.4788032 -.2701149
ps | 1.296033 .1092277 11.87 0.000 1.071074 1.520992
inc | 5.013977 .702231 7.14 0.000 3.567705 6.460249
e | .7124655 .0678067 10.51 0.000 .5728149 .852116
_cons | -4.279471 1.704836 -2.51 0.019 -7.790645 -.7682958
------------------------------------------------------------------------------

p-value on residuals = 0
Reject H0 at all levels of
significance
• Therefore reject H0: θ = 0 (p is exogenous)
• Therefore price is endogenous in the demand equation.
51
B. Hausman test:
Command for the Hausman test,
. hausman IV OLS, cons sigmamore
comparing the two sets of estimates
---- Coefficients ----
| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| IV OLS Difference S.E.
-------------+----------------------------------------------------------------
p | -.3744591 .0232954 -.3977545 .0863877
ps | 1.296033 .7100395 .5859938 .1272711
inc | 5.013977 .0764442 4.937533 1.072376
_cons | -4.279471 1.091045 -5.370516 1.166414
------------------------------------------------------------------------------
b = consistent under Ho and Ha; obtained from ivregress
B = inconsistent under Ha, efficient under Ho; obtained from regress
Test: Ho: difference in coefficients not systematic

chi2(1) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 21.20
Prob>chi2 = 0.0000
Reject H0 at all levels
of significance

• H0: no endogeneity bias


• Therefore endogeneity does exist in the demand equation:
– We must estimate the equation using IV, not OLS.
52
6. CONCLUSION

• Endogeneity is one of the key issues in empirical econometrics:


– It violates an assumption that is required to have unbiased, consistent estimators
– It means that relationships can no longer be interpreted as causal

• The way in which endogeneity is discussed and dealt with is a crucial determinant of:
– Reliability of empirical estimates
– Whether an empirical paper is published
– Success of empirical dissertations for advanced degrees

• In this topic, we’ve gone through some key tools for dealing with this issue:
– It remains a complex conceptual and empirical issue which is difficult grapple with.

53

You might also like