Lecture 09 Model Misspecification
Lecture 09 Model Misspecification
Model Misspecification
Model specification is concerned with the choice of the functional form that is used to
analyze relationships between variables. In fact, the first assumption of Model A requires the
model to be correctly specified. In other words, it is assumed that we know precisely which
variables should be included. However, in practice no one can be absolutely sure that the chosen
functional form is a correct one; therefore specification errors are possible. Primarily, there are
two reasons for their existence:
1) the formula is wrong;
2) the list of explanatory variables is wrong.
The objective of this lecture is to discuss the consequences of either including variables that
should not be included in the correct specification or leaving out variables that are relevant for the
analyzing relationship. So, basically, the lecture will deal with the second reason for the presence
of specification errors. Looking at procedures for functional form selection is a more advanced
subject that will not be considered in detail but we will grasp the approach to the choice of model
specification.
Variable misspecification:
Suppose, it is necessary to choose one of the two functions (1) and (2):
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝑢 (1)
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢 (2)
Two kinds of misspecification are possible here:
I. Omitting an important explanatory variable, i.e. estimating the type (1) relationship when the
true one is of the type (2).
II. Including unnecessary explanatory variable, i.e. estimating the type (2) relationship when the
true one is of the type (1).
Inclusion of an
𝑌̂ = 𝑏1 + 𝑏2 𝑋2 + 𝑏3 𝑋3 Correct specification,
irrelevant variable
1
̂ (𝑋2 , 𝑌) Cov
Cov ̂ (𝑋2 , 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢)
𝑏2 = = =
̂ (𝑋2 )
Var ̂ (𝑋2 )
Var
̂ (𝑋2 , 𝑌) Cov
Cov ̂ (𝑋2 , 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢)
𝑏2 = = =
̂ (𝑋2 )
Var ̂ (𝑋2 )
Var
̂ (𝑋2 , 𝛽1 ) + 𝛽2 Cov
Cov ̂ (𝑋2 , 𝑋2 ) + 𝛽3 Cov
̂ (𝑋2 , 𝑋3 ) + Cov
̂ (𝑋2 , 𝑢)
= =
̂ (𝑋2 )
Var
̂ (𝑋2 , 𝑋3 ) + Cov
0 + 𝛽2 𝑉𝑎𝑟(𝑋2 ) + 𝛽3 Cov ̂ (𝑋2 , 𝑢) ̂ (𝑋2 , 𝑋3 ) Cov
Cov ̂ (𝑋2 , 𝑢)
= = 𝛽2 + 𝛽3 +
̂ (𝑋2 )
Var ̂ (𝑋2 )
Var ̂ (𝑋2 )
Var
true
bias random
value
component
As 𝑋2 is non-stochastic and
∑(𝑋2𝑖 − 𝑋̄2 )(𝑢𝑖 − 𝑢̄ ) ∑((𝑋2𝑖 − 𝑋̄2 )E(𝑢𝑖 − 𝑢̄ ))
̂ (𝑋2 , 𝑢)) = E(
E(Cov )= =0
𝑛 𝑛
̂ (𝑋2 , 𝑋3 ) Cov
Cov ̂ (𝑋2 , 𝑢) ̂ (𝑋2 , 𝑋3 )
Cov
E(𝑏2 ) = E(𝛽2 + 𝛽3 + ) = 𝛽2 + 𝛽3
̂ (𝑋2 )
Var ̂ (𝑋2 )
Var ̂ (𝑋2 )
Var
̂ (𝑋2 , 𝑋3 )
Cov ∑(𝑋2𝑖 − 𝑋̄2 )(𝑋3𝑖 − 𝑋̄3 )
𝑏𝑖𝑎𝑠 = E(𝑏2 ) − 𝛽2 = 𝛽3 = 𝛽3
̂ (𝑋2 )
Var ∑(𝑋2𝑖 − 𝑋̄2 )2
If 𝑏𝑖𝑎𝑠 > 0, then it is called that the estimate is biased upwards.
If 𝑏𝑖𝑎𝑠 < 0, then it is called that the estimate is biased downwards.
Intuitive explanation:
If 𝑋3 is omitted, then as a result 𝑋2 has 2 effects:
• direct effect on 𝑌;
• apparent indirect effect acting as a proxy for missing 𝑋3.
The strength of the apparent indirect effect depends on 2 factors:
1) the ability of 𝑋2 to mimic the behavior of
Y
𝑋3 that can be derived by regressing 𝑋3 on
effect of X3
𝑋2. In fact, it is a slope coefficient in the
regression 𝑋3 = ℎ1 + ℎ2 𝑋2 + 𝑢 where direct effect
of X2, holding
̂ (𝑋2 ,𝑋3 )
Cov ∑(𝑋2𝑖 −𝑋̄2 )(𝑋3𝑖 −𝑋̄3 ) apparent effect of
ℎ2 = ̂ (𝑋2 )
= ∑(𝑋2𝑖 −𝑋̄2 )2
X3 constant
Var X2, acting as a
2) the direct effect of 𝑋3 on 𝑌 measured by 𝛽3 mimic for X3
For other directions of biases the answer to the question what sum will be greater is not definite
because it depends on relative strengths of indirect effects. The main point is that in general it is
impossible to determine the contribution to R2 of each explanatory variable in multiple regression
analysis.
II. Inclusion of an irrelevant variable:
If 𝑋3 has no effect on the dependent variable but 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢 is estimated,
then we can express the relationship with 𝛽3 = 0. However, at the moment of estimation we do
not grasp this fact (in other words, we do not use all available information). Therefore, it results in
inefficient estimates. At the same time, the properties of OLS estimators, including unbiasedness,
do not depend on the true values of the parameters. Formally, the model is still correctly specified,
so standard errors remain valid but they tend to be larger indicating the loss in efficiency.
Mathematically,
𝜎𝑢2 1
𝜎𝑏22 = ×
∑(𝑋2𝑖 − 𝑋̄2 ) 2 1 − 𝑟𝑋22,𝑋3
𝜎𝑢2
Therefore, as 0 < 1 − 𝑟𝑋22,𝑋3 < 1 => 𝜎𝑏22 > ∑(𝑋 obtained from 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝑢.
2𝑖 −𝑋̄2 )
2
Consequences of this type misspecification: other things being equal, estimated coefficients are
unbiased; standard errors, t tests, and F test are valid but efficiency is lower.
Consequences of misspecification
True model
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝑢 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢
Coefficients are biased (in
𝑌̂ = 𝑏1 + 𝑏2 𝑋2 Correct specification, general). Standard errors and
Fitted model
Proxy variables:
There are many examples when one of explanatory variables is some unobserved factor. In
particular, it is either not precisely defined (as the quality of education) or it requires a lot of time
to obtain data. However, skipping such factor results in omitted variable bias. The problem can be
reduced or eliminated by using a proxy variable that is linearly related to the unobserved variable
as much as possible.
Consider 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3+. . . +𝛽𝑘 𝑋𝑘 + 𝑢.
Assume that 𝑋2 is unobserved, so we use a proxy 𝑍 such that 𝑋2 = 𝜆 + 𝜇𝑍.
Y = 1 + 2 ( + Z ) + 3 X 3 + ... + k X k + u
The relationship is transformed into:
= ( 1 + 2 ) + 2 Z + 3 X 3 + ... + k X k + u
All variables become observed. If the proxy relationship is an exact one, then
1. The coefficients of 𝑋3, … , 𝑋𝑘 will be the same as those we would obtain using 𝑋2 as
an explanatory variable.
2. Standard errors and t-statistics of these coefficients will be the same as those obtainable
when 𝑋2 is used as an explanatory variable.
3. 𝑅 2 will be the same as in the model with 𝑋2.
4. The coefficient of 𝑍 will provide an estimate of 𝛽2 𝜇, and, consequently, it is impossible
to obtain an estimate of 𝛽2, unless 𝜇 is known.
5. The t-statistic for 𝑍 will be same as that we would obtain for 𝑋2, so, we can test the
significance of the explanatory variable 𝑋2, even being unable to estimate its
coefficient.
6. It is impossible to obtain an estimate of the intercept 𝛽1, since the intercept of the
estimated model is equal to 𝛽1 + 𝛽2 𝜆. However, it is usually more important to estimate
the regression coefficients than the intercept.
In practice, the relationship between the proxy variable and the approximated one is usually not
exactly linear, but the above tendencies are still observed and should be taken into account.
4
Unintended proxies:
If we use a proxy variable without realizing that it is a proxy, then this situation is called
unintentional use of proxy. Consequences depend on motives of estimating the regression:
Use to predict future values of the dependent Use as a policy variable to influence other
variable variables