Topic 3 - endogeneity (1)
Topic 3 - endogeneity (1)
Topic 3
Endogeneity
WHY IS THIS TOPIC IMPORTANT?
• We commonly need to estimate models where:
– One or more important factors cannot be measured
– Some of the data may be inaccurate
– There are multiple causal relationships, not just X → Y
• In this topic, we’ll learn what endogeneity is, how it affects the reliability of OLS results,
and what methods can be used to overcome it
2
Part 1
• We will:
– Try to identify sources of endogeneity in models
– Derive expressions for the consequences of endogeneity
– See how we can estimate models to overcome this issue
4
a) Omitted variable
• An important explanatory variable is omitted from the regression
– And it is correlated with any of the included X variables
• Why might a variable be omitted from a model?
• E.g.
𝑒𝑎𝑟𝑛𝑖𝑛𝑔𝑠 = 𝛽1 + 𝛽2 𝑦𝑟𝑠𝑐ℎ𝑜𝑜𝑙 + 𝛽3 𝑎𝑏𝑖𝑙𝑖𝑡𝑦 + ⋯ + 𝑢
b) Measurement error
• An explanatory variable is measured with error i.e. is inaccurate:
– Some variables are inherently difficult to measure e.g. income
– May need to use a proxy when true variable is unavailable
• E.g.
𝑙𝑒𝑖𝑠𝑢𝑟𝑒 𝑡𝑖𝑚𝑒 = 𝛽1 + 𝛽2 ℎℎ𝑖𝑛𝑐𝑜𝑚𝑒 + 𝛽3 𝑚𝑎𝑙𝑒 + ⋯ + 𝑢
5
c) Simultaneity
• One (or more) explanatory variables are jointly determined with Y
– i.e. X affects Y, and Y affects X
• Common in macro models
• Also occurs with many other complex economic processes
• E.g. effect of inflation on trade openness:
𝑜𝑝𝑒𝑛𝑛𝑒𝑠𝑠 = 𝛽1 + 𝛽2 𝑖𝑛𝑓𝑙𝑎𝑡𝑖𝑜𝑛 + 𝛽3 𝑙𝑛𝑝𝑐𝑖𝑛𝑐 + 𝛽4 𝑙𝑛𝑙𝑎𝑛𝑑 + 𝑢
𝑖𝑛𝑓𝑙𝑎𝑡𝑖𝑜𝑛 = 𝛼1 + 𝛼2 𝑜𝑝𝑒𝑛𝑛𝑒𝑠𝑠 + 𝛼3 𝑙𝑛𝑝𝑐𝑖𝑛𝑐 + 𝑢
• Possible to suspect/identify simultaneity even when only given one equation:
– If we suspect feedback from Y to X
• All demand and supply models suffer from simultaneity:
– Equilibrium price and quantity are determined simultaneously
– Through the interaction of demand and supply
6
Class Exercise Question 1
• Identify the source of endogeneity related to the first X variable in each
of the following models:
a) Omitted variables that are correlated with 𝑋𝑗
b) Measurement error in 𝑋𝑗
c) Simultaneity (or bi-directional causality) between 𝑋𝑗 and 𝑌
• In some cases, more than one source may apply!
8
1.3 STANDARD ASSUMPTIONS FOR THE
CLASSICAL LINEAR REGRESSION MODEL (CLRM)
• These assumptions are required:
– For OLS estimators to be unbiased estimators of population parameters
9
• Assumption CLRM1:
– The model is linear in the parameters
• Assumption CLRM2:
– The dataset is a random sample drawn from the population
• Assumption CLRM3:
– There is no perfect multicollinearity
• Assumption CLRM4:
– The error terms must be uncorrelated with all the X variables
– i.e. there is no endogeneity
10
Assumption CLRM4: Zero conditional mean
𝐸 𝑢|𝑋2 , 𝑋3 , … , 𝑋𝑘 = 0
or
𝑐𝑜𝑣 𝑢, 𝑋𝑗 = 0 , 𝑗 = 2, … , 𝑘
• CLRM4 is more likely to hold when fewer factors are in the error term
– i.e. When the model is better specified
• BUT CLRM4 can fail due to three sources discussed previously
• We cannot know for sure whether the average value of the unobserved factors is
unrelated to the explanatory variables.
• But this is the most important assumption:
• What is consistency?
– It is an asymptotic or large sample property
– Let 𝑏𝑗 be the OLS estimator of 𝛽𝑗 for some j.
– For each N, 𝑏𝑗 has a probability distribution (representing its possible values in
different random samples of size N).
– If this estimator is consistent, then the distribution of 𝑏𝑗 becomes more and more
tightly distributed around 𝛽𝑗 as the sample size grows.
– As N tends to infinity, the distribution of 𝑏𝑗 collapses to the single point 𝛽𝑗 :
Say: 𝛽𝑗 is the
𝑝𝑙𝑖𝑚 𝑏𝑗 = 𝛽𝑗
probability limit of 𝑏𝑗
12
Fig C3. Sampling distributions of 𝑏𝑗 for increasing
sample sizes
𝑓(𝑏𝑗 )
N = 40
N = 16
N=4
𝛽𝑗 𝑏𝑗
Source: Wooldridge (2013) 13
Why does consistency matter?
• Virtually all economists agree:
– consistency is a minimal requirement for an estimator
14
Showing the consistency of OLS
• In general, we need matrix algebra to show this.
• But, we can illustrate it for a simple model with a single X
• The formula (estimator) for the slope coefficient is given by:
σ𝑁 ത
𝑖=1(𝑋𝑖2 − 𝑋2 ) 𝑌𝑖
𝑏2 = 𝑁
σ𝑖=1(𝑋𝑖2 − 𝑋ത1 )2
• Substituting 𝑌𝑖 = 𝛽1 + 𝛽2 𝑋𝑖2 + 𝑢𝑖 and rearranging gives:
1 𝑁
σ𝑖=1(𝑋𝑖2 − 𝑋ത2 ) 𝑢𝑖
𝑏2 = 𝛽2 + 𝑁
1 𝑁
σ𝑖=1(𝑋𝑖2 − 𝑋ത2 )2
𝑁
• Law of large numbers, the numerator and denominator converge in probability to
𝑐𝑜𝑣(𝑋2 , 𝑢) and 𝑣𝑎𝑟 𝑋2 , i.e.
𝑐𝑜𝑣(𝑋2 , 𝑢) CLRM4
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2 +
𝑣𝑎𝑟 𝑋2
= 𝛽2 because 𝑐𝑜𝑣 𝑋2 , 𝑢 = 0 15
1.5 CONSEQUENCE OF VIOLATING ASSUMPTION CLRM4
• Given that:
𝑐𝑜𝑣 𝑋2 , 𝑢
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2 + 1.1
𝑣𝑎𝑟 𝑋2
• Then the inconsistency (or asymptotic bias) is:
𝑐𝑜𝑣(𝑋2 , 𝑢)
𝑝𝑙𝑖𝑚 𝑏2 − 𝛽2 =
𝑣𝑎𝑟 𝑋2
If 𝑐𝑜𝑣 𝑋2 , 𝑢 = 0 OLS is consistent and unbiased
If 𝑐𝑜𝑣 𝑋2 , 𝑢 < 0 OLS is inconsistent and biased downwards
If 𝑐𝑜𝑣 𝑋2 , 𝑢 > 0 OLS is inconsistent and biased upwards
17
2. OMITTED VARIABLES
• Suppose the true model is:
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢
• but instead we estimate:
𝑌 = 𝑏1 + 𝑏2 𝑋2 + 𝑣
– E.g. 𝑌 is earnings, 𝑋2 is years of education, and 𝑋3 is ability
– Does 𝑏2 measure the true return to education, 𝛽2 ?
• From eq.(1.1):
𝑐𝑜𝑣(𝑋2 , 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙)
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2 +
𝑣𝑎𝑟 𝑋2
𝑐𝑜𝑣(𝑋2 , 𝛽3 𝑋3 + 𝑢 )
= 𝛽2 +
𝑣𝑎𝑟 𝑋2
𝑐𝑜𝑣 𝑋2 , 𝛽3 𝑋3 + 𝑐𝑜𝑣 𝑋2 , 𝑢
= 𝛽2 +
𝑣𝑎𝑟 𝑋2
𝑐𝑜𝑣 𝑋2 , 𝑋3
= 𝛽2 + 𝛽3
𝑣𝑎𝑟 𝑋2
18
𝑐𝑜𝑣 𝑋2 , 𝑋3
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2 + 𝛽3
𝑣𝑎𝑟 𝑋2
• Therefore 𝑏2 is asymptotically unbiased only if either:
➢ 𝛽3 = 0 (i.e. there is no omitted variable), or
➢ 𝑋2 and 𝑋3 are uncorrelated.
• If neither of these two occurs, then b2 is biased and inconsistent,
– direction of asymptotic bias depends on sign of 𝛽3 𝑐𝑜𝑣 𝑋2 , 𝑋3 .
• In the example:
𝑒𝑎𝑟𝑛𝑖𝑛𝑔𝑠 = 𝛽1 + 𝛽2 𝑦𝑟𝑠𝑐ℎ𝑜𝑜𝑙 + 𝛽3 𝑎𝑏𝑖𝑙𝑖𝑡𝑦 + 𝑢
– what is the direction of the bias of the return to education, when ability is
unobserved?
• Determining direction of bias is more complex with multiple Xs:
– Depends on their relationships with each other and with the omitted factor
𝑣𝑎𝑟 𝑋2∗
𝑝𝑙𝑖𝑚 𝑏2 = 𝛽2
𝑣𝑎𝑟 𝑋2∗ + 𝑣𝑎𝑟 𝑒2
• Therefore, the OLS estimate 𝑏2 is biased towards zero (this is called attenuation bias).
– The larger the degree of measurement error, the greater is the attenuation bias.
• Some examples:
– Models of demand and supply i.e. market equilibrium
• For commodities
• For an input into production e.g. labour
– Models of the macroeconomy
22
Example 1: Demand and supply
• Consider a system of supply and demand for a commodity:
Demand: Q = 1 + 2P + 3Y + u1 (4.1)
Supply: Q = 1 + 2P + u2 (4.2)
• In equilibrium, equate demand and supply:
1 + 2P + 3Y + u1 = 1 + 2P + u2
2P − 2P = 1 + u2 − (1 + 3Y + u1 )
P ( 2 − 2 ) = 1 − 1 − 3Y + u2 − u1 slope
− 3 u −u
P= 1 1 − Y+ 2 1 (4.3)
(3)
2 − 2 2 − 2 2 − 2
intercept error term
• Thus P is a function of u1: i.e. X variable correlated with error term
• P is an endogenous explanatory variable
– It is simultaneously determined with Q
– Cannot meaningfully estimate (4.1) using OLS: 2 will inconsistent. 23
4.1 SIMULTANEITY BIAS
• Demand and supply equations (4.1) and (4.2) are known as structural equations:
– They describe the structure of the economy:
• Derivable from economic theory
• Have a causal interpretation
• In the structural equations:
– Price and quantity are determined simultaneously:
• price affects quantity and quantity affects price
– P and Q are endogenous variables, while Y is exogenous
– Estimation by OLS will lead to biased and inconsistent coefficient estimates
• Explanatory variables are correlated with error term
• Determining the direction of the bias is generally complicated in models with multiple X
variables.
24
Avoiding simultaneity bias
• Equations such as (4.3) are known as reduced form equations:
– Endogenous variables are expressed as a function only of all exogenous variables
(and a constant)
– Can derive a similar equation for Q
• Write the reduced form equations as:
𝑃 = 𝜋11 + 𝜋21 𝑌 + 𝑣1 4.3𝑎
𝑄 = 𝜋12 + 𝜋22 𝑌 + 𝑣2 4.4
• These reduced form equations can be estimated by OLS:
– All the RHS variables are exogenous
• But:
– We don’t care about the values of the 𝜋 parameters
– The parameters of interest are 𝛼1 , 𝛼2 and 𝛼3 , and 𝛽1 and 𝛽2 (from the structural
equations)
25
4.2 IDENTIFICATION OF STRUCTURAL EQUATIONS
• In OLS, we can identify the value of the parameters if:
– each explanatory variable is uncorrelated with the error term
• This condition does not hold when there is endogeneity
• We can sometimes still identify (or consistently estimate) the parameters in a structural
equation
– Similarly for cases of omitted variables or measurement error.
• Do we have enough information to retrieve the original coefficients (𝛼s and 𝛽s) from
the 𝜋s?
– Answer depends on having additional exogenous variables
– i.e. exogenous variables that are not in the equation of interest
26
Three possible situations
1. An equation is unidentified
– We cannot get the structural coefficients from the reduced form estimates
– E.g. the demand equation Q = 1 + 2P + 3Y + u1
– There are no additional exogenous variables in the model
3. An equation is over-identified
– More than one set of structural coefficients could be obtained from the reduced form
– Example given later
27
Condition for a structural equation to be identified
• A structural equation satisfies the order condition if:
– number of exogenous variables excluded from the equation is
– at least as large as the number of right-hand side endogenous variables
• This is a necessary (but not sufficient) condition for identification
• The rank condition is a sufficient condition
– but requires matrix algebra: beyond scope of this module
• Express the order condition as:
K – k0 m0
• where K = no. of exogenous variables in the equation system (i.e. overall model)
in total
k0 = no. of exogenous variables in the structural equation
m0 = no. of endogenous variables on RHS of structural equation
28
Demand: Q = 1 + 2P + 3Y + u1 (4.1)
Supply: Q = 1 + 2P + u2 (4.2)
Supply equation: k0 = ; m0 =
K – k0 =
• Therefore we can get unbiased estimates of the parameters in the supply equation
– but not in the demand equation.
29
Example 2: Keynesian macro model
• For a closed economy:
𝐶 = 𝛽1 + 𝛽2 𝑌 + 𝛽3 𝑟 + 𝑢1 4.5
𝐼 = 𝛾1 + 𝛾2 𝑟 + 𝑢2 4.6
𝑌 ≡𝐶+𝐼+𝐺 4.7
• Three equations in the system:
– therefore three endogenous (dependent) variables
• Assume all other variables are exogenous
30
• Various issues with such a simple macro model:
1. Difficult to argue that interest rates and government spending are exogenous
2. Model would be estimated with time series data, but is static:
• We expect adjustment lags
• One solution:
– Don’t use the endogenous Xs
– Rather, use some other variables instead
• We want these other variables to be:
– (highly) correlated with the endogenous Xs, but
– NOT correlated with the errors
• Proposed IV:
– Use distance between living location and campus as instrument for absent
• Motivation:
– Relevance: longer commute → probability of being absent (e.g. due to transport
problems)
– Exogeneity: distance not expected to be correlated with motivation
36
Some examples of instruments: 2
• We want to estimate the causal effect of education on earnings:
log(𝑤𝑎𝑔𝑒) = 𝛽1 + 𝛽2 𝑦𝑟𝑠𝑐ℎ𝑜𝑜𝑙 + 𝛽3 𝑎𝑏𝑖𝑙𝑖𝑡𝑦 + 𝑢
• Proposed IV 1: Parents’ education
– Relevance: parents’ education is correlated with child’s education in many samples
(true for SA?)
– Exogeneity: but likely to be correlated with child’s ability
• Proposed IV 2: Number of siblings
– Relevance: having more siblings is typically associated with lower education per child
(true for SA?)
– Exogeneity: likely to be uncorrelated with child’s ability
• Need to make similar arguments for measurement error cases
The statistical reliability of the results depends on having good IVs
37
5.2 TWO-STAGE LEAST SQUARES (2SLS)
• Two-stage least squares (2SLS) provides a method for using multiple
instrumental variables.
• 2SLS proceeds as follows:
– Stage 1:
• Regress each endogenous variable that appears on the RHS of the structural
equation on all of its instruments
– In simultaneous equations, this is the reduced form equation
• Predict the value of each endogenous variable, 𝑍መ
– Stage 2:
• Use the predicted value of each endogenous variable in place of the variable
itself
• Standard errors have to be corrected in Stage 2
• Interpret the resulting coefficients and perform hypothesis tests as usual.
38
Stata example
Consider a demand and supply model for a food product:
Demand: Q = 1 + 2P + 3PS + 4INC + u1
Supply: Q = 1 + 2P + 3PF + u2
• Q is quantity; P is price; PS is price of a substitute; INC is per capita income; PF is price of
factor of production
• Endogenous: Q and P; exogenous: PS, INC and PF.
• The demand equation, estimated by OLS:
. regress q p ps inc
Source | SS df MS Number of obs = 30
-------------+------------------------------ F( 3, 26) = 8.52
Model | 305.92719 3 101.97573 Prob > F = 0.0004
Residual | 311.209627 26 11.969601 R-squared = 0.4957
-------------+------------------------------ Adj R-squared = 0.4375
Total | 617.136817 29 21.2805799 Root MSE = 3.4597 If price and quantity
------------------------------------------------------------------------------ are simultaneously
q | Coef. Std. Err. t P>|t| [95% Conf. Interval] determined, then this
-------------+----------------------------------------------------------------
p | .0232954 .0768423 0.30 0.764 -.1346562 .181247 coefficient is likely to
ps | .7100395 .2143246 3.31 0.003 .269489 1.15059
inc | .0764442 1.190855 0.06 0.949 -2.371393 2.524282 be biased.
_cons | 1.091045 3.71158 0.29 0.771 -6.538218 8.720308
------------------------------------------------------------------------------
39
. ivregress 2sls q (p = ps inc pf) ps inc, first
First-stage regressions
-----------------------
Number of obs = 30
F( 3, 26) = 69.19
This stage creates an instrument for the Prob > F = 0.0000
R-squared = 0.8887
potentially-endogenous variable, price Adj R-squared = 0.8758
Root MSE = 6.5975
------------------------------------------------------------------------------
p | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ps | 1.708147 .3508806 4.87 0.000 .9869017 2.429393
inc | 7.602491 1.724336 4.41 0.000 4.058068 11.14691
pf | 1.353906 .2985062 4.54 0.000 .7403175 1.967494
_cons | -32.51242 7.984235 -4.07 0.000 -48.92425 -16.10059
------------------------------------------------------------------------------
43
Test for over-identifying restrictions
• Suppose that we have q more instruments than we need:
– i.e. q = (K – k0) – (m0) > 0
– Recall that IVs must be excluded exogenous variables
– E.g. one endogenous X (m0 = 1), and three proposed IVs (K – k0 = 3)
• q = 3 – 1 = 2 over-identifying restrictions.
• Then we can test whether the 2SLS residuals are correlated with q linear functions of
the instruments
------------------------------------------------------------------------------
q | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p | .3379816 .0236408 14.30 0.000 .2916465 .3843166
pf | -1.000909 .0782929 -12.78 0.000 -1.154361 -.8474581
_cons | 20.0328 1.160349 17.26 0.000 17.75856 22.30704
------------------------------------------------------------------------------
Instrumented: p
Instruments: pf ps inc
. predict u, resid
45
. reg u pf ps inc
------------------------------------------------------------------------------
u | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pf | .0363318 .067262 0.54 0.594 -.1019273 .1745909
ps | .0790798 .0790635 1.00 0.326 -.0834376 .2415971
inc | -.4023461 .3885424 -1.04 0.310 -1.201007 .3963143
_cons | -1.149104 1.799078 -0.64 0.529 -4.847162 2.548953
------------------------------------------------------------------------------
• H0: no endogeneity bias (both OLS and IV estimators will be consistent, but
OLS is more efficient)
• H1: endogeneity (only IV will be consistent – the difference between the OLS and IV
coefficients will not converge to zero as n → )
49
Stata example
A. Regression-based test:
To test whether price is endogenous in the demand equation, estimate the
reduced form equation for price, then include its residuals in the demand equation:
reduced form equation: regress the potentially
. reg p ps inc pf
endog var, p, on all exog vars in the model
Source | SS df MS Number of obs = 30
-------------+------------------------------ F( 3, 26) = 69.19
Model | 9034.77551 3 3011.59184 Prob > F = 0.0000
Residual | 1131.69721 26 43.5268157 R-squared = 0.8887
-------------+------------------------------ Adj R-squared = 0.8758
Total | 10166.4727 29 350.568025 Root MSE = 6.5975
------------------------------------------------------------------------------
p | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ps | 1.708147 .3508806 4.87 0.000 .9869017 2.429393
inc | 7.602491 1.724336 4.41 0.000 4.058068 11.14691
pf | 1.353906 .2985062 4.54 0.000 .7403175 1.967494
_cons | -32.51242 7.984235 -4.07 0.000 -48.92425 -16.10059
------------------------------------------------------------------------------
predict the residuals from
. predict e, resid
the reduced form equation 50
include the residuals as an extra
. regress q p ps inc e
variable in the demand equation
Source | SS df MS Number of obs = 30
-------------+------------------------------ F( 4, 25) = 60.88
Model | 559.677099 4 139.919275 Prob > F = 0.0000
Residual | 57.4597181 25 2.29838873 R-squared = 0.9069
-------------+------------------------------ Adj R-squared = 0.8920
Total | 617.136817 29 21.2805799 Root MSE = 1.516
------------------------------------------------------------------------------
q | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p | -.3744591 .0506639 -7.39 0.000 -.4788032 -.2701149
ps | 1.296033 .1092277 11.87 0.000 1.071074 1.520992
inc | 5.013977 .702231 7.14 0.000 3.567705 6.460249
e | .7124655 .0678067 10.51 0.000 .5728149 .852116
_cons | -4.279471 1.704836 -2.51 0.019 -7.790645 -.7682958
------------------------------------------------------------------------------
p-value on residuals = 0
Reject H0 at all levels of
significance
• Therefore reject H0: θ = 0 (p is exogenous)
• Therefore price is endogenous in the demand equation.
51
B. Hausman test:
Command for the Hausman test,
. hausman IV OLS, cons sigmamore
comparing the two sets of estimates
---- Coefficients ----
| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| IV OLS Difference S.E.
-------------+----------------------------------------------------------------
p | -.3744591 .0232954 -.3977545 .0863877
ps | 1.296033 .7100395 .5859938 .1272711
inc | 5.013977 .0764442 4.937533 1.072376
_cons | -4.279471 1.091045 -5.370516 1.166414
------------------------------------------------------------------------------
b = consistent under Ho and Ha; obtained from ivregress
B = inconsistent under Ha, efficient under Ho; obtained from regress
Test: Ho: difference in coefficients not systematic
chi2(1) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 21.20
Prob>chi2 = 0.0000
Reject H0 at all levels
of significance
• The way in which endogeneity is discussed and dealt with is a crucial determinant of:
– Reliability of empirical estimates
– Whether an empirical paper is published
– Success of empirical dissertations for advanced degrees
• In this topic, we’ve gone through some key tools for dealing with this issue:
– It remains a complex conceptual and empirical issue which is difficult grapple with.
53