Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
16 views

Lecture 09 Model Misspecification

This lecture discusses model misspecification, focusing on the consequences of including irrelevant variables or omitting relevant ones in statistical models. It highlights two types of misspecification: omitting an important explanatory variable and including unnecessary ones, both of which can lead to biased estimates and invalid statistical tests. The lecture also explains how these issues affect the efficiency of estimators and the interpretation of goodness-of-fit measures in regression analysis.

Uploaded by

Angelina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Lecture 09 Model Misspecification

This lecture discusses model misspecification, focusing on the consequences of including irrelevant variables or omitting relevant ones in statistical models. It highlights two types of misspecification: omitting an important explanatory variable and including unnecessary ones, both of which can lead to biased estimates and invalid statistical tests. The lecture also explains how these issues affect the efficiency of estimators and the interpretation of goodness-of-fit measures in regression analysis.

Uploaded by

Angelina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Lecture 9

Model Misspecification
Model specification is concerned with the choice of the functional form that is used to
analyze relationships between variables. In fact, the first assumption of Model A requires the
model to be correctly specified. In other words, it is assumed that we know precisely which
variables should be included. However, in practice no one can be absolutely sure that the chosen
functional form is a correct one; therefore specification errors are possible. Primarily, there are
two reasons for their existence:
1) the formula is wrong;
2) the list of explanatory variables is wrong.
The objective of this lecture is to discuss the consequences of either including variables that
should not be included in the correct specification or leaving out variables that are relevant for the
analyzing relationship. So, basically, the lecture will deal with the second reason for the presence
of specification errors. Looking at procedures for functional form selection is a more advanced
subject that will not be considered in detail but we will grasp the approach to the choice of model
specification.
Variable misspecification:
Suppose, it is necessary to choose one of the two functions (1) and (2):
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝑢 (1)
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢 (2)
Two kinds of misspecification are possible here:
I. Omitting an important explanatory variable, i.e. estimating the type (1) relationship when the
true one is of the type (2).
II. Including unnecessary explanatory variable, i.e. estimating the type (2) relationship when the
true one is of the type (1).

Two kinds of misspecification


True model
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝑢 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢
𝑌̂ = 𝑏1 + 𝑏2 𝑋2 Correct specification, Omission of a relevant variable
model
Fitted

Inclusion of an
𝑌̂ = 𝑏1 + 𝑏2 𝑋2 + 𝑏3 𝑋3 Correct specification,
irrelevant variable

I. Omission of a relevant variable:


Let’s analyze statistical properties of estimated coefficients. Remember that there are several
equivalent ways to express formulas for OLS coefficients. This lecture will use a sample
covariance/variance approach that can be considered as an alternative way to get the same result
as it is done in the lecture slides by using a different tool.
Correct specification: 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢 (*);
Fitted model: 𝑌̂ = 𝑏1 + 𝑏2 𝑋2.
OLS result
̂ (𝑋2 , 𝑌)
Cov
𝑏2 =
̂ (𝑋2 )
Var
Substituting the expression of Y in (*) for Y and using properties of variance/covariance:

1
̂ (𝑋2 , 𝑌) Cov
Cov ̂ (𝑋2 , 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢)
𝑏2 = = =
̂ (𝑋2 )
Var ̂ (𝑋2 )
Var
̂ (𝑋2 , 𝑌) Cov
Cov ̂ (𝑋2 , 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢)
𝑏2 = = =
̂ (𝑋2 )
Var ̂ (𝑋2 )
Var
̂ (𝑋2 , 𝛽1 ) + 𝛽2 Cov
Cov ̂ (𝑋2 , 𝑋2 ) + 𝛽3 Cov
̂ (𝑋2 , 𝑋3 ) + Cov
̂ (𝑋2 , 𝑢)
= =
̂ (𝑋2 )
Var
̂ (𝑋2 , 𝑋3 ) + Cov
0 + 𝛽2 𝑉𝑎𝑟(𝑋2 ) + 𝛽3 Cov ̂ (𝑋2 , 𝑢) ̂ (𝑋2 , 𝑋3 ) Cov
Cov ̂ (𝑋2 , 𝑢)
= = 𝛽2 + 𝛽3 +
̂ (𝑋2 )
Var ̂ (𝑋2 )
Var ̂ (𝑋2 )
Var
true
bias random
value
component
As 𝑋2 is non-stochastic and
∑(𝑋2𝑖 − 𝑋̄2 )(𝑢𝑖 − 𝑢̄ ) ∑((𝑋2𝑖 − 𝑋̄2 )E(𝑢𝑖 − 𝑢̄ ))
̂ (𝑋2 , 𝑢)) = E(
E(Cov )= =0
𝑛 𝑛
̂ (𝑋2 , 𝑋3 ) Cov
Cov ̂ (𝑋2 , 𝑢) ̂ (𝑋2 , 𝑋3 )
Cov
E(𝑏2 ) = E(𝛽2 + 𝛽3 + ) = 𝛽2 + 𝛽3
̂ (𝑋2 )
Var ̂ (𝑋2 )
Var ̂ (𝑋2 )
Var
̂ (𝑋2 , 𝑋3 )
Cov ∑(𝑋2𝑖 − 𝑋̄2 )(𝑋3𝑖 − 𝑋̄3 )
𝑏𝑖𝑎𝑠 = E(𝑏2 ) − 𝛽2 = 𝛽3 = 𝛽3
̂ (𝑋2 )
Var ∑(𝑋2𝑖 − 𝑋̄2 )2
If 𝑏𝑖𝑎𝑠 > 0, then it is called that the estimate is biased upwards.
If 𝑏𝑖𝑎𝑠 < 0, then it is called that the estimate is biased downwards.
Intuitive explanation:
If 𝑋3 is omitted, then as a result 𝑋2 has 2 effects:
• direct effect on 𝑌;
• apparent indirect effect acting as a proxy for missing 𝑋3.
The strength of the apparent indirect effect depends on 2 factors:
1) the ability of 𝑋2 to mimic the behavior of
Y
𝑋3 that can be derived by regressing 𝑋3 on
effect of X3
𝑋2. In fact, it is a slope coefficient in the
regression 𝑋3 = ℎ1 + ℎ2 𝑋2 + 𝑢 where direct effect
of X2, holding
̂ (𝑋2 ,𝑋3 )
Cov ∑(𝑋2𝑖 −𝑋̄2 )(𝑋3𝑖 −𝑋̄3 ) apparent effect of
ℎ2 = ̂ (𝑋2 )
= ∑(𝑋2𝑖 −𝑋̄2 )2
X3 constant
Var X2, acting as a
2) the direct effect of 𝑋3 on 𝑌 measured by 𝛽3 mimic for X3

that is the significance of the omitted


variable in explaining the dependent
X2 X3
variable.
Consequences of this type misspecification: other things being equal, estimated coefficients are
biased; standard errors, t tests, and F test are invalid.
Comparison of 𝑹𝟐 in the presence of omitted variable bias:
Consider three models:
2
Correct specification: 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢 with 𝑅𝑐𝑜𝑟𝑟𝑒𝑐𝑡 (i);
2
Variable 𝑋3 is omitted: 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝑢 with 𝑅𝑜𝑚_ 𝑣𝑎𝑟 3 (ii);
2
Variable 𝑋2 is omitted: 𝑌 = 𝛽1 + 𝛽3 𝑋3 + 𝑢 with 𝑅𝑜𝑚_ 𝑣𝑎𝑟 2 (iii).
Let’s analyze how the goodness of fit measures are compared. Is it possible to determine the
contribution to 𝑅 2 of each explanatory variable in (i) by running (ii) and (iii) and then calculating
2
𝑅 2 separately? If it is true, then their separate measures will determine the joint explanatory power
2 2 2
exactly in one way: 𝑅𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 𝑅𝑜𝑚_ 𝑣𝑎𝑟 3 + 𝑅𝑜𝑚_ 𝑣𝑎𝑟 2 . BUT IT IS NOT TRUE because of the
2 2 2 2 2
omitted variable bias. In fact, either 𝑅𝑐𝑜𝑟𝑟𝑒𝑐𝑡 > 𝑅𝑜𝑚_ 𝑣𝑎𝑟 3 + 𝑅𝑜𝑚_ 𝑣𝑎𝑟 2 or 𝑅𝑐𝑜𝑟𝑟𝑒𝑐𝑡 < 𝑅𝑜𝑚_ 𝑣𝑎𝑟 3 +
2
𝑅𝑜𝑚_ 𝑣𝑎𝑟 2 . The answer depends on the direction of biases in (ii) and (iii):
̂ (𝑋2 ,𝑋3 )
Cov ̂ (𝑋2 ,𝑋3 )
Cov
𝑏𝑖𝑎𝑠𝑖𝑖 > 0 𝛽3 Var
̂ (𝑋2 )
>0 𝑏𝑖𝑎𝑠𝑖𝑖 < 0 𝛽3 Var
̂ (𝑋2 )
<0
{ <=> { Cov { <=> { Cov
𝑏𝑖𝑎𝑠𝑖𝑖𝑖 > 0 ̂ (𝑋2 ,𝑋3 ) 𝑏𝑖𝑎𝑠𝑖𝑖𝑖 < 0 ̂ (𝑋2 ,𝑋3 )
𝛽2 Var
̂ (𝑋 )
>0 𝛽2 Var
̂ (𝑋 )
<0
3 3
In (ii) the indirect effect of 𝑋2 acting as a proxy In (ii) the apparent explanatory power of 𝑋2 is
of 𝑋3 is positive inflating its apparent reduced by the negative bias in its coefficient.
explanatory power. Similarly, In (iii) the apparent explanatory
In (iii) the indirect effect of 𝑋3 acting as a power of 𝑋3 is reduced by the negative bias in
proxy of 𝑋2 is also positive inflating its its coefficient.
apparent explanatory power. Combining these 2 effects:
2 2 2
Combining these 2 results: 𝑅𝑐𝑜𝑟𝑟𝑒𝑐𝑡 > 𝑅𝑜𝑚_ 𝑣𝑎𝑟 3 + 𝑅𝑜𝑚_ 𝑣𝑎𝑟 2
2 2 2
𝑅𝑐𝑜𝑟𝑟𝑒𝑐𝑡 < 𝑅𝑜𝑚_ 𝑣𝑎𝑟 3 + 𝑅𝑜𝑚_ 𝑣𝑎𝑟 2

For other directions of biases the answer to the question what sum will be greater is not definite
because it depends on relative strengths of indirect effects. The main point is that in general it is
impossible to determine the contribution to R2 of each explanatory variable in multiple regression
analysis.
II. Inclusion of an irrelevant variable:
If 𝑋3 has no effect on the dependent variable but 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢 is estimated,
then we can express the relationship with 𝛽3 = 0. However, at the moment of estimation we do
not grasp this fact (in other words, we do not use all available information). Therefore, it results in
inefficient estimates. At the same time, the properties of OLS estimators, including unbiasedness,
do not depend on the true values of the parameters. Formally, the model is still correctly specified,
so standard errors remain valid but they tend to be larger indicating the loss in efficiency.
Mathematically,
𝜎𝑢2 1
𝜎𝑏22 = ×
∑(𝑋2𝑖 − 𝑋̄2 ) 2 1 − 𝑟𝑋22,𝑋3
𝜎𝑢2
Therefore, as 0 < 1 − 𝑟𝑋22,𝑋3 < 1 => 𝜎𝑏22 > ∑(𝑋 obtained from 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝑢.
2𝑖 −𝑋̄2 )
2

Consequences of this type misspecification: other things being equal, estimated coefficients are
unbiased; standard errors, t tests, and F test are valid but efficiency is lower.

Consequences of misspecification
True model
𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝑢 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢
Coefficients are biased (in
𝑌̂ = 𝑏1 + 𝑏2 𝑋2 Correct specification, general). Standard errors and
Fitted model

tests are invalid.


Coefficients are
unbiased (in general),
𝑌̂ = 𝑏1 + 𝑏2 𝑋2 + 𝑏3 𝑋3 Correct specification,
but inefficient. Standard
errors and tests are valid
3
The approach to the choice of model specification
The foregoing discussion implies that it is preferable to construct an initial model with
maximum number of explanatory variables, and its subsequent improvement by gradual
elimination of insignificant variables and deriving more and more efficient estimators. If we go
the other way round, i.e. start from the model with a minimum number of explanatory variables
and add new ones, the obtained estimates will be biased from the very beginning and the standard
errors will be invalid.
However, a criterion is needed to determine, which explanatory variables are insignificant
and should be excluded. If we simply exclude all variables with insignificant coefficients, it is
possible to make an error because of the presence of multicollinearity. Therefore, the decision to
exclude variables is made on the basis of an F-test for the joint explanatory power of several
variables (F-test for linear restrictions).

Proxy variables:
There are many examples when one of explanatory variables is some unobserved factor. In
particular, it is either not precisely defined (as the quality of education) or it requires a lot of time
to obtain data. However, skipping such factor results in omitted variable bias. The problem can be
reduced or eliminated by using a proxy variable that is linearly related to the unobserved variable
as much as possible.
Consider 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3+. . . +𝛽𝑘 𝑋𝑘 + 𝑢.
Assume that 𝑋2 is unobserved, so we use a proxy 𝑍 such that 𝑋2 = 𝜆 + 𝜇𝑍.
Y = 1 +  2 ( + Z ) +  3 X 3 + ... +  k X k + u
The relationship is transformed into:
= ( 1 +  2  ) +  2 Z +  3 X 3 + ... +  k X k + u
All variables become observed. If the proxy relationship is an exact one, then
1. The coefficients of 𝑋3, … , 𝑋𝑘 will be the same as those we would obtain using 𝑋2 as
an explanatory variable.
2. Standard errors and t-statistics of these coefficients will be the same as those obtainable
when 𝑋2 is used as an explanatory variable.
3. 𝑅 2 will be the same as in the model with 𝑋2.
4. The coefficient of 𝑍 will provide an estimate of 𝛽2 𝜇, and, consequently, it is impossible
to obtain an estimate of 𝛽2, unless 𝜇 is known.
5. The t-statistic for 𝑍 will be same as that we would obtain for 𝑋2, so, we can test the
significance of the explanatory variable 𝑋2, even being unable to estimate its
coefficient.
6. It is impossible to obtain an estimate of the intercept 𝛽1, since the intercept of the
estimated model is equal to 𝛽1 + 𝛽2 𝜆. However, it is usually more important to estimate
the regression coefficients than the intercept.
In practice, the relationship between the proxy variable and the approximated one is usually not
exactly linear, but the above tendencies are still observed and should be taken into account.

4
Unintended proxies:
If we use a proxy variable without realizing that it is a proxy, then this situation is called
unintentional use of proxy. Consequences depend on motives of estimating the regression:

Use to predict future values of the dependent Use as a policy variable to influence other
variable variables

If 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛(𝑋2 , 𝑍) is relatively high, then it It matters whether to use 𝑋2 or 𝑍. If there is no


does not matter whether to include 𝑋2 or 𝑍 for functional connection between 𝑋2 and 𝑍, then
that purpose the proxy has no direct effect on 𝑌

A Monte Carlo experiment: omitted variable


A Monte Carlo is a controlled experiment that allows us to check whether the results of
estimated model are plausible. Finite sample distribution properties can be investigated.
We know that omitted variables result in biased estimates. This point can be illustrated by
means of Monte Carlo simulation:
Correct specification: 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢
Estimated model: 𝑌 = 𝛽1 + 𝛽2 𝑋2 + 𝑢
Stages:
1) Choose the true values of 𝛽1 , 𝛽2 , 𝛽3;
2) Choose 𝑋2 and 𝑋3 in each observation;
3) Use random generating process to provide the disturbance term. Run regression: 𝑌 = 𝛽1 +
𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝑢 and calculate corresponding values of the dependent variable;
4) Use generating values of 𝑌 to run the regression with the omitted variable: 𝑌 = 𝛽1 +
𝛽2 𝑋2 + 𝑢. Estimate the parameters;
5) Repeat the procedure from step 3.
As a result, when we produce the experiment many times, we are able to obtain the distribution of
estimated 𝛽1 and 𝛽2 from step 4). Then we can compare these values to the chosen true parameters
of 𝛽1 and 𝛽2 in 1). The finite sample bias can be determined:
̂ (𝑋2 , 𝑋3 )
Cov ∑(𝑋2𝑖 − 𝑋̄2 )(𝑋3𝑖 − 𝑋̄3 )
𝑏𝑖𝑎𝑠 = 𝛽3 = 𝛽3
̂ (𝑋2 )
Var ∑(𝑋2𝑖 − 𝑋̄2 )2

You might also like