wp667 PDF
wp667 PDF
wp667 PDF
The College
Stata Economics
Journal (yyyy) Working Paper No. 667 ii, pp. 1–38
vv, Number
Abstract. We extend our 2003 paper on instrumental variables (IV) and GMM
estimation and testing and describe enhanced routines that address HAC standard
errors, weak instruments, LIML and k-class estimation, tests for endogeneity and
RESET and autocorrelation tests for IV estimates.
Keywords: st0001, instrumental variables, weak instruments, generalized method
of moments, endogeneity, heteroskedasticity, serial correlation, HAC standard er-
rors, LIML, CUE, overidentifying restrictions, Frisch–Waugh–Lovell theorem, RE-
SET, Cumby-Huizinga test
1 Introduction
In an earlier paper, Baum et al. (2003), we discussed instrumental variables (IV) es-
timators in the context of Generalized Method of Moments (GMM) estimation and
presented Stata routines for estimation and testing comprising the ivreg2 suite. Since
that time, those routines have been considerably enhanced and additional routines have
been added to the suite. This paper presents the analytical underpinnings of both ba-
sic IV/GMM estimation and these enhancements and describes the enhanced routines.
Some of these features are now also available in Stata 10’s ivregress, while others are
not.
The additions include:
• Estimation and testing that is robust to, and efficient in the presence of, arbitrary
serial correlation.
• A range of test statistics that allow the user to address the problems of underiden-
tification or weak identification, including statistics that are robust in the presence
of heteroskedasticity, autocorrelation or clustering.
• Three additional IV/GMM estimators: the GMM continuously updated estimator
(CUE) of Hansen et al. (1996); limited-information maximum likelihood (LIML);
and k-class estimators.
• A more intuitive syntax for GMM estimation: the gmm2s option requests the two-
step feasible efficient GMM estimator, which reduces to standard IV/2SLS if no
robust covariance matrix estimator is also requested. The cue option requests
c yyyy StataCorp
September, 2007 LP st0001
2 Enhanced routines for IV/GMM estimation and testing
• An option that allows the user to “partial out” regressors: something which is
particularly useful when the user has a rank-deficient estimate of the covariance
matrix of orthogonality conditions (common with the cluster option and single-
ton dummy variables).
• Several advanced options, including options that will speed up estimation using
ivreg2 by suppressing the calculation of various checks and statistics.
• A version of the RESET regression specification test, ivreset, that (unlike official
Stata’s ovtest) is appropriate for use in an instrumental variables context.
2.1 Setup
The equation to be estimated is, in matrix notation,
y = Xβ + u (1)
into [X1 X2 ], with the K1 regressors X1 assumed under the null to be endogenous and
the K2 ≡ (K − K1 ) remaining regressors X2 assumed exogenous, giving us
The set of instrumental variables is Z and is n × L. This is the full set of variables
that are assumed to be exogenous, i.e., E(Zi ui ) = 0. We partition the instruments
into [Z1 Z2 ], where the L1 instruments Z1 are excluded instruments and the remaining
L2 ≡ (L − L1 ) instruments Z2 ≡ X2 are the included instruments/exogenous regressors:
The order condition for identification of the equation is L ≥ K implying there must
be at least as many excluded instruments (L1 ) as there are endogenous regressors (K1 )
as Z2 is common to both lists. If L = K, the equation is said to be exactly identified
by the order condition; if L > K, the equation is overidentified. The order condition is
necessary but not sufficient for identification; see Section 7 for a full discussion.
Each of the L moment equations corresponds to a sample moment. For some given
estimator β̂, we can write these L sample moments as
n n
1X 1X 0 1
g(β̂) = gi (β̂) = Zi (yi − Xi β̂) = Z 0 û (6)
n i=1 n i=1 n
The intuition behind GMM is to choose an estimator for β that brings g(β̂) as close to
zero as possible. If the equation to be estimated is exactly identified, so that L = K,
then we have as many equations—the L moment conditions—as we do unknowns: the
K coefficients in β̂. In this case it is possible to find a β̂ that solves g(β̂) = 0, and this
GMM estimator is in fact a special case of the IV estimator as we discuss below.
4 Enhanced routines for IV/GMM estimation and testing
If the equation is overidentified, however, so that L > K, then we have more equa-
tions than we do unknowns. In general it will not be possible to find a β̂ that will set all
L sample moment conditions exactly to zero. In this case, we take an L × L weighting
matrix W and use it to construct a quadratic form in the moment conditions. This
gives us the GMM objective function:
In the linear case we are considering, deriving and solving the K first order conditions
∂J(β̂)
∂ β̂
= 0 (treating W as a matrix of constants) yields the GMM estimator:1
The GMM estimator is consistent for any symmetric positive definite weighting
matrix W , and thus there are there are as many GMM estimators as there are choices
of weighting matrix W . Efficiency is not guaranteed for an arbitrary W , so we refer to
the estimator defined in Equation (9) as the possibly inefficient GMM estimator.
We are particularly interested in efficient GMM estimators: GMM estimators with
minimum asymptotic variance. Moreover, for any GMM estimator to be useful, we
must be able to conduct inference, and for that we need estimates of the variance of the
estimator. Both require estimates of the covariance matrix of orthogonality conditions,
a key concept in GMM estimation.
V (β̂GM M ) = (Q0XZ W QXZ )−1 (Q0XZ W SW QXZ )(Q0XZ W QXZ )−1 (11)
1. The results of the minimization, and hence the GMM estimator, will be the same for weighting
matrices that differ by a constant of proportionality.
Christopher F. Baum, Mark E. Schaffer and Steven Stillman 5
Under standard√ assumptions (see Hayashi (2000), pp. 202–203, 209) the inefficient GMM
estimator is “ n-consistent”. That is,
√
n (β̂GM M − β) → N [0, V (β̂GM M )] (12)
The efficient GMM estimator (EGM M ) makes use of an optimal weighting matrix
W which minimizes the asymptotic variance of the estimator. This is achieved by
choosing W = S −1 . Substitute this into Equation (9) and Equation (13) and we obtain
the efficient GMM estimator
Similarly, √
n (β̂EGM M − β) → N [0, V (β̂EGM M )] (16)
√
and we perform inference on n β̂EGM M by using
1 1
V √ β̂EGM M = (Q0XZ S −1 QXZ )−1 (17)
n n
where Ω̂ is the diagonal matrix of squared residuals û2i from β̃, the consistent but not
necesxsarily efficient first-step GMM estimator. In the ivreg2 implementation of two-
step efficient GMM, this first-step estimator is β̂IV , the IV estimator. The resulting
estimate Ŝ can be used to conduct consistent inference for the first-step estimator using
Equation (11), or it can be used to obtain and conduct inference for the efficient GMM
estimator using Equations (14) and (17).
In the next section we discuss how the two-step GMM estimator can be applied
when the errors are serially correlated.
which may be seen as a generalization of Equation (20), with Γ0 = E(gi gi0 ) and
0
Γj = E(gt gt−j ), j = ±1, ±2, . . . . (22)
n−j n−j
1X 1X 0
Γ̂j = ĝt ĝt−j = Z ût ût−j Zt−j (24)
n t=1 n t=1 t
The usual way this is handled in practice is for the summation to be truncated at a
specified lag q. Thus the S matrix can be estimated by
q
X j
Ŝ = Γ̂0 + κ (Γ̂j + Γ̂0j ) (25)
j=1
q n
where ut , ut−j are replaced by consistent estimates from first-stage estimation. The
kernel function, κ(j/qn ), applies appropriate weights to the terms of the summation,
with qn defined as the bandwidth of the kernel (possibly as a function of n).3 In many
kernels, consistency is obtained by having the weight fall to zero after a certain number
of lags.
The best-known approach to this problem in econometrics is that of Newey and
West (1987b), which generates Ŝ using the Bartlett kernel function and a user-specified
value of q. For the Bartlett kernel, κ(·) = [1 − j/qn ] if j ≤ qn − 1, 0 otherwise. These
estimates are said to be HAC: heteroskedasticity- and autocorrelation-consistent, as
they incorporate the standard sandwich formula (Equation (20)) in computing Γ0 .
HAC estimates can be calculated by ivreg2 using the robust and bw() options with
the kernel function’s bandwidth (the bw() option) set to q.4 The bandwidth may also
be chosen optimally by specifying bw(auto) using the automatic bandwidth selection
criterion of Newey and West (1994).5,6 By default, ivreg2 uses the Bartlett kernel
function.7 If the equation contains endogenous regressors, these options will cause the
IV estimates to be HAC. If the equation is overidentified and the robust, gmm2s and
bw() options are specified, the resulting GMM estimates will be both HAC and more
efficient than those produced by IV.
The Newey–West (Bartlett kernel function) specification is only one of many feasible
HAC estimators of the covariance matrix. Andrews (1991) shows that in the class of
positive semidefinite kernels, the rate of convergence of Ŝ → S depends on the choice of
kernel and bandwidth. The Bartlett kernel’s performance is bettered by those in a subset
of this class, including the Quadratic Spectral kernel. Accordingly, ivreg2 provides a
menu of kernel choices, including (abbreviations in parentheses): Quadratic Spectral
(qua or qs), Truncated (tru); Parzen (par); Tukey–Hanning (thann); Tukey–Hamming
(thamm); Daniell (dan); and Tent (ten). In the cases of the Bartlett, Parzen, and Tukey–
Hanning/Hamming kernels, the number of lags used to construct the kernel estimate
equals the bandwidth (bw) minus one.8 If the kernels above are used with bw(1),
no lags are used and ivreg2 will report the usual Eicker–Huber–White “sandwich”
heteroskedastic–robust variance estimates. Most, but not all, of these kernels guarantee
3. For more detail on this GMM estimator, see Hayashi (2000), pp. 406–417.
4. For the special case of OLS, Newey–West standard errors are available from [TS] newey with the
maximum lag (q − 1) specified by newey’s lag() option.
5. This implementation is identical to that provided by Stata’s [R] ivregress.
6. Automatic bandwidth selection is only available for the Bartlett, Parzen and Quadratic spectral
kernels; see below.
7. A common choice of bandwidth for the Bartlett kernel function is T 1/3 .
8. A common choice of bandwidth for these kernels is (q − 1) ≈ T 1/4 (Greene (2003), p. 200). A value
related to the periodicity of the data (4 for quarterly, 12 for monthly, etc.) is often chosen.
Christopher F. Baum, Mark E. Schaffer and Steven Stillman 9
that the estimated Ŝ is positive definite and therefore always invertible; the truncated
kernel, for example, was proposed in the early literature in this area but is now rarely
used because it can generate an noninvertible Ŝ. For a survey covering various kernel
estimators and their properties, see Cushing and McGarvey (1999) and Hall (2005), pp.
75–86.
Under conditional homoskedasticity the expression for the autocovariance matrix
simplifies:
and the calculations of the corresponding kernel estimators also simplify; see Hayashi
(2000), pp. 413–14. These estimators may perform better than their heteroskedastic-
robust counterparts in finite samples. If the researcher is satisfied with the assumption
of homoskedasticity but wants to deal with autocorrelation of unknown form, she should
use the AC correction without the H correction for arbitrary heteroskedasticity by omit-
ting the robust option. ivreg2 allows selection of H, AC, or HAC V CEs by combining
the robust, bw() and kernel options. Thus both robust and bw() must be specified to
calculate a HAC V CE of the Newey–West type, employing the default Bartlett kernel.9
To illustrate the use of HAC standard errors, we estimate a quarterly time-series
model relating the change in the U.S. inflation rate (D.inf) to the unemployment rate
(UR) for 1960q3–1999q4. As instruments, we use the second lag of quarterly GDP growth
and the lagged values of the Treasury bill rate, the trade-weighted exchange rate and
the Treasury medium-term bond rate.10 We first estimate the equation with standard
IV under the assumption of i.i.d. errors.
. use http://fmwww.bc.edu/ec-p/data/stockwatson/macrodat
. generate inf = 100 * log( CPI / L4.CPI )
(4 missing values generated)
. generate ggdp = 100 * log( GDP / L4.GDP )
(10 missing values generated)
. ivreg2 D.inf (UR=L2.ggdp L.TBILL L.ER L.TBON)
IV (2SLS) estimation
9. It should also be noted that Stata’s official [TS] newey does not allow gaps in time-series data. As
there is no difficulty in computing HAC estimates with gaps in a regularly spaced time series, ivreg2
handles this case properly.
10. These data accompany Stock and Watson (2003).
10 Enhanced routines for IV/GMM estimation and testing
Instrumented: UR
Excluded instruments: L2.ggdp L.TBILL L.ER L.TBON
In these estimates, the negative coefficient on the unemployment rate is consistent with
macroeconomic theories of the natural rate. In that context, lowering unemployment
below the natural rate will cause an acceleration of price inflation. The Sargan statistic
implies that the test of overidentifying restrictions cannot reject its null hypothesis.
An absence of autocorrelation in the error process is unusual in time series analysis,
so we test the equation using ivactest, as discussed below in Section 10. Using the
default value of one lag, we consider whether the error process exhibits AR(1) behavior.
The test statistic implies that the errors do not exhibit serial independence:
. ivactest
Cumby-Huizinga test with H0: errors nonautocorrelated at order 1
Test statistic: 25.909524
Under H0, Chi-sq(1) with p-value: 3.578e-07
Given this strong rejection of the null of independence, we reestimate the equation with
HAC standard errors, choosing a bandwidth (bw) of 5 (roughly T 1/3 ) and the robust
option. By default, the Bartlett kernel is used, so that these are Newey–West two-step
efficient GMM estimates.
. ivreg2 D.inf (UR=L2.ggdp L.TBILL L.ER L.TBON), gmm2s robust bw(5)
2-Step GMM estimation
Robust
D.inf Coef. Std. Err. z P>|z| [95% Conf. Interval]
Instrumented: UR
Excluded instruments: L2.ggdp L.TBILL L.ER L.TBON
It appears that by generating HAC estimates of the covariance matrix, the statistical
significance of the unemployment rate in this equation is now questioned. One important
statistic is also altered: the test for overidentification, denoted as the Sargan test in
the former estimates, is on the borderline of rejecting its null hypothesis at the 90%
level. When we reestimate the equation with HAC standard errors, various summary
statistics are “robustified” as well: in this case, the test of overidentifying restrictions,
now denoted Hansen’s J. That statistic is now far from rejection of its null, giving us
greater confidence that our instrument set is appropriate.
As noted earlier, the second-step minimization treats the weighting matrix W = (S(β̃))−1
as a constant matrix. Thus the residuals in the estimate of S are the first-stage residuals
defined by β̃, whereas the residuals in the orthogonality conditions g are the second-stage
residuals defined by β̂.
The minimization problem that defines the GMM “continuously updated estimator”
(CUE) of Hansen et al. (1996) is, by contrast,
Here, the weighting matrix is a function of the β being estimated. The residuals in S
are the same residuals that are in g, and estimation of S is done simultaneously with
the estimation of β. In general, solving this minimization problem requires numerical
methods.
Both the two-step efficient GMM and CUE GMM procedures reduce to familiar
estimators under linearity and conditional homoskedasticity. In this case, S = E(gi gi0 ) =
E(u2i Zi0 Zi ) = E(u2i )E(Zi0 Zi ) = σ 2 QZZ . As usual, QZZ is estimated by its sample
counterpart n1 Z 0 Z. In two-step efficient GMM under homoskedasticity, the minimization
becomes
û(β̂)0 PZ û(β̂)
β̂IV ≡ arg min J(β̂) = (29)
β̂ σ̂ 2
where û(β̂) ≡ (y−X β̂) and PZ ≡ Z(Z 0 Z)−1 Z 0 is the projection matrix. In the minimiza-
tion, the error variance σ̂ 2 is treated as a constant and hence doesn’t require first-step
estimation, and the β̂ that solves (29) is the IV estimator βIV = (X 0 PZ X)−1 X 0 PZ y.11
With CUE GMM under conditional homoskedasticity, the estimated error variance
is a function of the residuals σ̂ 2 = û0 (β̂)û(β̂)/n and the minimization becomes
û(β̂)0 PZ û(β̂)
β̂LIM L ≡ arg min J(β̂) = (30)
β̂ û(β̂)0 û(β̂)/n
The β̂ that solves (30) is defined as the limited information maximum likelihood (LIML)
estimator.
Unlike CUE estimators in general, the LIML estimator can be derived analytically
and does not require numerical methods. This derivation is the solution to an eigenvalue
problem (see Davidson and MacKinnon (1993), pp. 644–49). The LIML estimator
was first derived by Anderson and Rubin (1949), who also provided the first test of
overidentifying restrictions for estimation of an equation with endogenous regressors.
This Anderson–Rubin statistic (not to be confused with the test discussed below under
11. The error variance σ̂ 2 , required for inference, is calculated at the end using the IV residuals.
Christopher F. Baum, Mark E. Schaffer and Steven Stillman 13
“weak identification”) follows naturally from the solution to the eigenvalue problem. If
we denote the minimum eigenvalue by λ, then the Anderson–Rubin likelihood ratio test
statistic for the validity of the overidentifying restrictions (orthogonality conditions) is
n log(λ). Since LIML is also an efficient GMM estimator, the value J of the minimized
GMM objective function also provides a test of overidentifying restrictions. The J test
of the same overidentifying restrictions is closely related to the Anderson-Rubin test;
1
the minimized value of the LIML GMM objective function is in fact J = n 1−λ . Of
1
course, n log(λ) ≈ n 1−λ .
Although CUE and LIML provide no asymptotic efficiency gains over two-step GMM
and IV, recent research suggests that their finite-sample performance may be superior.
In particular, there is evidence suggesting that CUE and LIML perform better than
IV-GMM in the presence of weak instruments (Hahn et al. (2004)). This is reflected,
for example, in the critical values for the Stock–Yogo weak instruments test discussed
below in Section 7.3.12 The disadvantage of CUE in general is that it requires numerical
optimization; LIML does not, but does require the often rather strong assumption of
i.i.d. disturbances. In ivreg2, the cue option combined with the robust, cluster,
and/or bw options generates coefficient estimates that are efficient in the presence of the
corresponding deviations from i.i.d. disturbances. Specifying cue with no other options
is equivalent to the combination of the options liml and coviv (“covariance-IV”: see
below).
The implementation of the CUE estimator in ivreg2 uses Stata’s ml routine to mini-
mize the objective function. The starting values are either IV or two-step efficient GMM
coefficient estimates. These can be overridden with the cueinit option, which takes a
matrix of starting values of the coefficient vector β as its argument. The cueoptions
option passes its contents to Stata’s ml command. Estimation with the cue option can
be slow and problematic when the number of parameters to be estimated is substantial,
and it should be used with caution.
as a good choice; see Fuller (1977) or Davidson and MacKinnon (1993), pp. 649–50.
Nagar’s bias-adjusted 2SLS estimator can be obtained with the kclass(#) option by
setting k = 1 + (L−K)
N , where (L − K) is the number of overidentifying restrictions and
N is the sample size; see Nagar (1959). Research suggests that both of these k-class
estimators have a better finite-sample performance than IV in the presence of weak
instruments, though like IV, none of these k-class estimators is robust to violations of
the i.i.d. assumption. ivreg2 also provides Stock–Yogo critical values for the Fuller
version of LIML.
The default covariance matrix reported by ivreg2 for the LIML and general k-class
estimators is (Davidson and MacKinnon (1993), p. 650):
is also valid, and can be obtained with the coviv option. With coviv, the covariance
matrix for LIML and the other general k-class estimators will differ from that for the
IV estimator only because the estimate of the error variance σ̂ 2 will differ.
Robust
D.inf Coef. Std. Err. z P>|z| [95% Conf. Interval]
Christopher F. Baum, Mark E. Schaffer and Steven Stillman 15
Instrumented: UR
Excluded instruments: L2.ggdp L.TBILL L.ER L.TBON
When this estimator is employed, the magnitude of the point estimate of the UR co-
efficient falls yet farther, and it is no longer significantly different from zero at any
reasonable level of significance.
13. If the test statistic is required for an inefficient GMM estimator (e.g., an overidentifying restric-
tions test for the IV estimator that is robust to heteroskedasticity), ivreg2 reports the J statistic for
the corresponding efficient GMM estimator; see our 2003 paper. This J statistic is identical to that
produced by estat overid following official Stata’s ivregress gmm.
16 Enhanced routines for IV/GMM estimation and testing
orthog option takes as its argument the list of exogenous variables ZB whose exogeneity
is called into question. If the exogenous variable being tested is an instrument, the
efficient GMM estimator that does not use the corresponding orthogonality condition
simply drops the instrument. This is illustrated in the following pair of estimations
where the second regression is the estimation implied by the orthog option in the first:
ivreg2 y x1 x2 (x3 = z1 z2 z3 z4), orthog(z4)
ivreg2 y x1 x2 (x3 = z1 z2 z3)
If the exogenous variable that is being tested is a regressor, the efficient GMM estimator
that does not use the corresponding orthogonality condition treats the regressor as
endogenous, as below; again, the second estimation is implied by the use of orthog in
the former equation:
ivreg2 y x1 x2 (x3 = z1 z2 z3 z4), orthog(x2)
ivreg2 y x1 (x2 x3 = z1 z2 z3)
the test statistic reported for the endogeneity of x3 is numerically equal to the test
statistic reported for the orthog option in
ivreg2 y x1 x2 x3 ( = z1 z2 z3 z4), orthog(x3)
The endog option is both easier to understand and more convenient to use.
Under conditional homoskedasticity, this endogeneity test statistic is numerically
equal to a Hausman test statistic: see Hayashi (2000), pp. 233–34 and Baum et al.
(2003), pp. 19–22. The endogeneity test statistic can also be calculated after ivreg
or ivreg2 by the command ivendog. Unlike the Durbin–Wu–Hausman versions of the
endogeneity test reported by ivendog, the endog option of ivreg2 can report test statis-
tics that are robust to various violations of conditional homoskedasticity. The ivendog
option unavailable in ivreg2 is the Wu–Hausman F -test version of the endogeneity test.
To illustrate this option, we use a data set provided in Wooldridge (2003). We es-
timate the log of females’ wages as a function of the worker’s experience, (experience)2
and years of education. If the education variable is considered endogenous, it is in-
strumented with the worker’s age and counts of the number of pre-school children and
older children in the household. We test whether the educ variable need be considered
endogenous in this equation with the endog option:
Christopher F. Baum, Mark E. Schaffer and Steven Stillman 17
. use http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta
. ivreg2 lwage exper expersq (educ=age kidslt6 kidsge6), endog(educ)
IV (2SLS) estimation
Instrumented: educ
Included instruments: exper expersq
Excluded instruments: age kidslt6 kidsge6
In this context, we estimate the equation treating educ as endogenous, and merely name
it in the endog varlist to perform the C (GMM distance) test. The test cannot reject
its null that educ may be treated as exogenous. In contrast, we may calculate this same
test statistic with the earlier orthog option:
ivreg2 lwage exper expersq educ (=age kidslt6 kidsge6), orthog(educ)
Using orthog, we again list educ in the option’s varlist, but we must estimate the
equation with that variable treated as exogenous: an equivalent but perhaps a less
intuitive way to perform the test.
18 Enhanced routines for IV/GMM estimation and testing
with instruments Z̃1 and X̃2B will be the same as the shared coefficients estimated for
the original model
y = [X1 X2 ][β10 β20 ]0 + u (35)
with instruments Z1 and X2 . It is even possible to partial-out the full set of included
exogenous variables X2 , so that the partialled-out version of the model becomes
ỹ = X̃1 β1 + ũ (36)
with no exogenous regressors and only excluded instruments Z̃1 , and the estimated β̂1
will be the same as that obtained when estimating the full set of regressors.
The FWL theorem is implemented in ivreg2 by the new partial(varlist) option,
which requests that the exogenous regressors in the varlist should be partialled out from
all the other variables (other regressors and excluded instruments) in the estimation. If
the equation includes a constant, it is automatically partialled out as well.
The partial option is most useful when the covariance matrix of orthogonality con-
ditions S is not of full rank. When this is the case, efficient GMM and overidentification
tests are infeasible as the optimal GMM weighting matrix W = S −1 cannot be calcu-
lated. In some important cases, partialling out enough exogenous regressors can make
the covariance matrix of the remaining orthogonality conditions full rank, and efficient
GMM becomes feasible.
The invariance of the estimation results to partialling-out applies to one- and two-
step estimators such as OLS, IV, LIML and two-step GMM, but not to CUE or to
GMM iterated more than two steps. The reason is that the latter estimators update
the estimated S matrix. An updated S implies different estimates of the coefficients on
the partialled-out variables, which imply different residuals, which in turn produce a dif-
ferent estimated S. Intuitively, partialling-out uses OLS estimates of the coefficients on
the partialled-out variables to generate the S matrix, whereas CUE would use more effi-
cient HOLS (“heteroskedastic OLS”) estimates.14 Partialling out exogenous regressors
that are not of interest may still be desirable with CUE estimation, however, because
reducing the number of parameters estimated makes the CUE numerical optimization
faster and more reliable.
14. We are grateful to Manuel Arellano for helpful discussions on this point. On HOLS, see our 2003
paper.
Christopher F. Baum, Mark E. Schaffer and Steven Stillman 19
One common case calling for partialling-out arises when using cluster and the
number of clusters is less than L, the number of (exogenous regressors + excluded
instruments). This causes the matrix S to be rank deficient (Baum et al. (2003), pp. 9–
10). The problem can be addressed by using partial to remove enough exogenous
regressors for S to have full rank. A similar problem arises if a robust covariance matrix
is requested when the regressors include a variable that is a singleton dummy, i.e., a
variable with one value of 1 and (N − 1) values of zero or vice versa. The singleton
dummy causes the robust covariance matrix estimator to be less than full rank. In this
case, partialling out the variable with the singleton dummy solves the problem.
The partial option has two limitations: it cannot be used with time-series opera-
tors, and post-estimation [R] predict can be used only to generate residuals.
15. As X2 ≡ Z2 , these variables are perfectly correlated with each other. The canonical correlations
between X and Z before partialling out would also include the L2 ≡ K2 correlations that are equal to
unity.
20 Enhanced routines for IV/GMM estimation and testing
calculated as the eigenvalues of (X̃10 X̃1 )−1 (X̃10 Z̃1 )(Z̃10 Z̃1 )−1 (Z̃10 X̃1 ). The rank condition
can then be interpreted as the requirement that all K1 of the canonical correlations
must be significantly different from zero. If one or more of the canonical correlations is
zero, the model is underidentified or unidentified.
An alternative and useful interpretation of the rank condition is to use the reduced
form. Write the set of reduced form (“first stage”) equations for the regressors X as
X = ZΠ + v (37)
the null hypothesis that the smallest canonical correlation rK1 is zero. A large sample
2
test statistic for this is simply nrK 1
. Under the null, the test statistic is distributed
2
χ with (L − K + 1) degrees of freedom, so that it may be calculated even for an
exactly-identified equation. A failure to reject the null hypothesis suggests the model
is unidentified. Not surprisingly given its “N × R2 ” form this test can be interpreted as
an LM test.16
The Cragg–Donald (1993) statistic is an alternative and closely related test for the
rank of a matrix that can also be used to test for underidentification. Whereas the
Anderson test is an LM test, the Cragg–Donald test is a Wald test, also derived from an
eigenvalue problem. Poskitt and Skeels (2002) show that in fact the Cragg–Donald test
2 2
statistic can be stated in terms of canonical correlations as nrK 1
/(1 − rK 1
) (see Poskitt
2
and Skeels (2002), p. 17). It is also distributed as χ (L − K + 1).
Both these tests require the assumption of i.i.d. errors, and hence are reported if
ivreg2 is invoked without the robust, cluster or bw options. The Anderson LM χ2
statistic is reported by ivreg2 in the main regression output while both the Anderson
LM and Cragg–Donald Wald χ2 statistics are reported with the first option.
If the errors are heteroskedastic or serially correlated, the Anderson and Cragg–
Donald statistics are not valid. This is an important shortcoming, because these viola-
tions of the i.i.d. assumption would typically be expected to cause the null of underi-
dentification to be rejected too often. Researchers would face the danger of interpreting
a rejection of the null as evidence of a well-specified model that is adequately identified,
when in fact it was both underidentified and misspecified.
Recently, several robust statistics for testing the rank of a matrix have been pro-
posed. Kleibergen and Paap (2006) have proposed the rk statistic for this purpose.
Their rk test statistic is reported by ivreg2 if the user requests any sort of robust
covariance estimator. The LM version of the Kleibergen–Paap rk statistic can be con-
sidered as a generalization of the Anderson canonical correlation rank statistic to the
non-i.i.d. case. Similarly, the Wald version of the rk statistic reduces to the Cragg–
Donald statistic when the errors are i.i.d. The rk test is implemented in Stata by the
ranktest command of Kleibergen and Schaffer (2007) which ivreg2 uses to calculate
the rk statistic. If ivreg2 is invoked with the robust, bw or cluster options, the
tests of underidentification reported by ivreg2 are based on the rk statistic and will be
correspondingly robust to heteroskedasticity, autocorrelation or clustering. For a full
discussion of the rk statistic, see Kleibergen and Paap (2006).
It is useful to note that in the special case of a single endogenous regressor, the
Anderson, Cragg–Donald, and Kleibergen–Paap statistics reduce to familiar statistics
available from OLS estimation of the single reduced form equation with an appropriate
choice of V CE estimator. Thus the Cragg–Donald Wald statistic can be calculated by
estimating (38) and testing the joint significance of the coefficents Π11 on the excluded
instruments Z1 using a standard Wald test and a traditional non-robust covariance es-
16. Earlier versions of ivreg2 reported an LR version of this test, where the test statistic is −n log(1 −
2 ). This LR test has the same asymptotic distribution as the LM form. See Anderson (1984), pp.
rK
1
497-8.
22 Enhanced routines for IV/GMM estimation and testing
17. This can be done very simply in Stata using ivreg2 by estimating (38) with only Z2 as regressors,
Z1 as excluded instruments and an empty list of endogenous regressors. The Sargan statistic reported
by ivreg2 will be the Anderson LM statistic. See our 2003 article for further discussion.
18. See the on-line help for ranktest for examples. These test statistics are “large-sample” χ2 tests and
can be obtained from OLS regression using ivreg2. Stata’s regress command reports finite-sample t
tests. Also note that the robust rk LM statistic can be obtained as described in the preceding footnote.
Invoke ivreg2 with X1 as the dependent variable, Z2 as regressors, Z1 as excluded instruments and
no endogenous regressors. With the robust option the reported Hansen J statistic is the robust rk
statistic.
Christopher F. Baum, Mark E. Schaffer and Steven Stillman 23
an estimation that assumes i.i.d. disturbances. The null hypothesis being tested is that
the estimator is weakly identified in the sense that it is subject to bias that the inves-
tigator finds unacceptably large. The Stock–Yogo weak instruments tests come in two
flavors: maximal relative bias and maximal size, where the null is that the instruments
do not suffer from the specified bias. Rejection of their null hypothesis represents the
absence of a weak instruments problem. The first flavor is based on the ratio of the
bias of the estimator to the bias of OLS. The null is that instruments are weak, where
24 Enhanced routines for IV/GMM estimation and testing
weak instruments are defined as instruments that can lead to an asymptotic relative
bias greater than some value b. Because this test uses the finite sample distribution
of the IV estimator, it cannot be calculated in certain cases. This is because the mth
moment of the IV estimator exists if and only if m < (L − K + 1).19
The second flavor of the Stock–Yogo tests is based on the performance of the Wald
test statistic for β1 . Under weak identification, the Wald test rejects too often. The test
statistic is based on the rejection rate r (10%, 20%, etc.) that the researcher is willing
to tolerate if the true rejection rate should be the standard 5%. Weak instruments are
defined as instruments that will lead to a rejection rate of r when the true rejection rate
is 5%.
Stock and Yogo (2005) have tabulated critical values for their two weak identification
tests for the IV estimator, the LIML estimator, and Fuller’s modified LIML estimator.
The weak instruments bias in the IV estimator is larger than that of the LIML estima-
tors, and hence the critical values for the null that instruments are weak are also larger.
The Stock–Yogo critical values are available for a range of possible circumstances (up
to 3 endogenous regressors and 100 excluded instruments).
The weak identification test that uses the Cragg–Donald F statistic, like the cor-
responding underidentification test, requires an assumption of i.i.d. errors. This is a
potentially serious problem, for the same reason as given earlier: if the test statistic is
large simply because the disturbances are not i.i.d., the researcher will commit a Type I
error and incorrectly conclude that the model is adequately identified.
If the user specifies the robust, cluster or bw options in ivreg2, the reported
weak instruments test statistic is a Wald F statistic based on the Kleibergen–Paap rk
statistic. We are not aware of any studies on testing for weak instruments in the presence
of non-i.i.d. errors. In our view, however, the use of the rk Wald statistic, as the robust
analog of the Cragg–Donald statistic, is a sensible choice and clearly superior to the
use of the latter in the presence of heteroskedasticity, autocorrelation or clustering. We
suggest, however, that when using the rk statistic to test for weak identification, users
either apply with caution the critical values compiled by Stock and Yogo (2005) for the
i.i.d. case, or refer to the older “rule of thumb” of Staiger and Stock (1997) that the
F -statistic should be at least 10 for weak identification not to be considered a problem.
ivreg2 will report in the main regression output the relevant Stock and Yogo (2005)
critical values for IV, LIML and Fuller-LIML estimates if they are available. The re-
ported test statistic will be the Cragg–Donald statistic if the traditional covariance
estimator is used or the rk statistic if a robust covariance estimator is requested. If
the user requests two-step GMM estimation, ivreg2 will report an rk statistic and the
IV critical values. If the user requests the CUE estimator, ivreg2 will report an rk
statistic and the LIML critical values. The justification for this is that IV and LIML are
special cases of two-step GMM and CUE respectively, and the similarities carry over to
weak instruments: the literature suggests that IV and two-step GMM are less robust
to weak instruments than LIML and CUE. Again, however, users of ivreg2 may again
wish to exercise some caution in applying the Stock–Yogo critical values in these cases.
Now consider estimating a reduced form equation for y with the full set of instruments
as regressors:
y = Z1 γ1 + Z2 γ2 + η (42)
If the null H0 : β1 = 0 is correct, Π11 β1 = 0, and therefore γ1 = 0. Thus the Anderson
and Rubin (1949) test of the null H0 : β1 = 0 is obtained by estimating the reduced
form for y and testing that the coefficients γ1 of the excluded instruments Z1 are jointly
equal to zero. If we fail to reject γ1 = 0, then we also fail to reject β1 = 0.
The Anderson–Rubin statistic is robust to the presence of weak instruments. As
instruments become weak, the elements of Π11 become smaller, and hence so does
Π11 β1 : the null H0 : γ1 = 0 is less likely to be rejected. That is, as instruments become
weak, the power of the test declines, an intuitively appealing feature: weak instruments
come at a price. ivreg2 reports both the χ2 version of the Anderson–Rubin statistic
(distributed with L1 degrees of freedom) and the F -statistic version of the test. ivreg2
also reports the closely-related Stock and Wright (2000) S-statistic. The S statistic
tests the same null hypothesis as the A-R statistic and has the same distribution under
the null. It is given by the value of the CUE objective function (with the exogenous
regressors partialled out). Whereas the A-R statistic provides a Wald test, the S statistic
provides an LM or GMM distance test of the same hypothesis.
Importantly, if the model is estimated with a robust covariance matrix estimator,
both the Anderson–Rubin statistic and the S statistic reported by ivreg2 are corre-
spondingly robust. See Dufour (2003) and Chernozhukov and Hansen (2005) for further
discussion of the Anderson–Rubin approach. For related alternative test statistics that
26 Enhanced routines for IV/GMM estimation and testing
are also robust to weak instruments (but not violations of the i.i.d. assumption), see
the condivreg and condtest commands available from Moreira and Poi (2003) and
Mikusheva and Poi (2006).
Robust
lw Coef. Std. Err. z P>|z| [95% Conf. Interval]
Instrumented: iq
Included instruments: s expr tenure rns smsa _Iyear_67 _Iyear_68 _Iyear_69
_Iyear_70 _Iyear_71 _Iyear_73
Excluded instruments: age mrt
Note that for J(β0 ) to be the appropriate test statistic, it is necessary for the exogenous
regressors to be partialled out with the fwl() option.
20. It is important to note that an Anderson–Rubin confidence region need not be finite nor connected.
The test provided in condivreg (Moreira and Poi (2003), Mikusheva and Poi (2006)) is uniformly most
powerful in the situation where there is one endogenous regressor and i.i.d. errors. The Anderson–
Rubin test provided by ivreg2 is a simple and preferable alternative when errors are not i.i.d. or there
is more than one endogenous regressor.
Christopher F. Baum, Mark E. Schaffer and Steven Stillman 31
. di e(sargan)
102.10909
. mat S0 = e(S)
. qui ivreg2 lw med age (iq=kww), gmm2s smatrix(S0)
. test med age
( 1) med = 0
( 2) age = 0
chi2( 2) = 102.11
Prob > chi2 = 0.0000
. qui ivreg2 lw kww age (iq=med), gmm2s smatrix(S0)
. test kww age
( 1) kww = 0
( 2) age = 0
chi2( 2) = 102.11
Prob > chi2 = 0.0000
. qui ivreg2 lw med kww (iq=age), gmm2s smatrix(S0)
. test med kww
( 1) med = 0
( 2) kww = 0
chi2( 2) = 102.11
Prob > chi2 = 0.0000
defined as X̂ β̂, where β̂ is the IV estimate of the coefficents and X̂ ≡ [Z Π̂ Z2 ], i.e., the
reduced form predicted values of the endogenous regressors plus the exogenous regres-
sors. Note that if the equation is exactly identified, the optimal forecasts and reduced
form forecasts coincide, and the Pesaran–Taylor and Pagan–Hall tests are identical.
The ivreset test flavors vary according to the polynomial terms (square, cube,
fourth power of ŷ), the choice of forecast values (Pesaran–Taylor optimal forecasts or
Pagan–Hall reduced form forecasts), test statistic (Wald or GMM-distance), and large
vs. small sample statistic (χ2 or F -statistic). The test statistic is distributed with
degrees of freedom equal to the number of polynomial terms. The default is the Pesaran–
Taylor version using the square of the optimal forecast of y and a χ2 Wald statistic with
one degree of freedom.
If the original ivreg2 estimation was heteroskedastic-robust, cluster-robust, AC or
HAC, the reported RESET test will be as well. The ivreset command can also be
used after OLS regression with [R] regress or ivreg2 when there are no endogenous
regressors. In this case, either a standard Ramsey RESET test using fitted values of y
or a robust test corresponding to the specification of the original regression is reported.
We illustrate use of ivreset using a model fitted to the Griliches data:
. use http://fmwww.bc.edu/ec-p/data/hayashi/griliches76.dta
(Wages of Very Young Men, Zvi Griliches, J.Pol.Ec. 1976)
. quietly ivreg2 lw s expr tenure rns smsa (iq=med kww), robust
. ivreset
Ramsey/Pesaran-Taylor RESET test
Test uses square of fitted value of y (X-hat*beta-hat)
Ho: E(y|X) is linear in X
Wald test statistic: Chi-sq(1) = 4.53 P-value = 0.0332
Test is heteroskedastic-robust
. ivreset, poly(4) rf small
Ramsey/Pagan-Hall RESET test
Test uses square, cube and 4th power of reduced form prediction of y
Ho: E(y|X) is linear in X
Wald test statistic: F(3,748) = 1.72 P-value = 0.1616
Test is heteroskedastic-robust
The first ivreset takes all the defaults, and corresponds to a second-order polynomial
in ŷ with the Pesaran–Smith optimal forecast and a Wald χ2 test statistic which rejects
the null at better than 95%. The second employs a fourth-order polynomial and requests
the Pagan–Hall reduced form forecast with a Wald F -statistic, falling short of the 90%
level of significance.
21. If the previous command estimated a V CE under the assumption of i.i.d. errors, q must be 0.
34 Enhanced routines for IV/GMM estimation and testing
heteroskedasticity-robust estimator. When the gmm option was combined with the bw
option, estimates were autocorrelation-robust but not heteroskedasticity-robust. This
version of ivreg2 uses a new taxonomy of estimation options, summarized below. Note
that the gmm2s option by itself produces the IV (2SLS) estimator, as described in Section
2.2. One of the options [robust, cluster, bw] must be added to generate two-step
efficient GMM estimates.
The following table summarizes the estimator and the properties of its point and
interval estimates for each combination of estimation options.
A number of tests performed by ivreg2 are not available from [R] ivregress. These
include the “GMM distance” tests of endogeneity/exogeneity discussed in Section 5,
the general underidentification/weak identification test of Kleibergen and Paap (2006)
discussed in Section 7 and tests for instrument relevance. In diagnosing potentially weak
instruments, ivreg2’s ability to save the first-stage regressions is also unique.
12 Syntax diagrams
These diagrams describe all of the programs in the ivreg2 suite, including those which
have not been substantially modified since their documentation in Baum et al. (2003).
ivreg2 depvar varlist1 (varlist2=varlist iv) weight if in
, gmm2s bw(# | auto) kernel(string) liml fuller(#) kclass(#) coviv
cue cueinit(matrix) cueoptions(string) b0(matrix) robust cluster(varname)
orthog(varlist ex) endog(varlist en) redundant(varlist ex) partial(varlist ex)
small noconstant smatrix(matrix) wmatrix(matrix) first ffirst savefirst
savefprefix(string) rf saverf saverfprefix(string) nocollin noid level(#)
noheader nofooter eform(string) depname(varname) plus
overid , chi2 dfr f all
ivhettest varlist , ivlev ivsq fitlev fitsq ph phnorm nr2 bpg all
ivendog varlist
ivreset , polynomial(#) rform cstat small
ivactest , s(#) q(#)
13 Acknowledgements
We are grateful to many members of the Stata user community who have assisted in the
identification of useful features in the ivreg2 suite and helped identify problems with
the programs. We thank (without implicating) Austin Nichols for his suggestions on
this draft, Frank Kleibergen for discussions about testing for identification, and Manuel
Arellano and Graham Elliott for helpful discussions of GMM-CUE. Some portions of
the discussion of weak instruments in Section 7.3 are taken from Chapter 8 of Baum
(2006).
36 Enhanced routines for IV/GMM estimation and testing
14 References
Ahn, S. C. 1997. Orthogonality tests in linear models. Oxford Bulletin of Economics
and Statistics 59(1): 183–186.
Anderson, T. W., and H. Rubin. 1949. Estimation of the parameters of a single equation
in a complete system of stochastic equations. Annals of Mathematical Statistics 20:
46–63.
Baum, C. F., M. E. Schaffer, and S. Stillman. 2003. Instrumental variables and GMM:
Estimation and testing. Stata Journal 3: 1–31.
Bound, J., D. A. Jaeger, and R. Baker. 1995. Problems with instrumental variables esti-
mation when the correlation between the instruments and the endogeneous explana-
tory variable is weak. Journal of the American Statistical Association 90: 443–450.
Chernozhukov, V., and C. Hansen. 2005. The Reduced Form: A Simple Approach to
Inference with Weak Instruments. Working paper, University of Chicago, Graduate
School of Business.
Cragg, J. G., and S. G. Donald. 1993. Testing identifiability and specification in instru-
mental variables models. Econometric Theory 9: 222–240.
Cumby, R. E., and J. Huizinga. 1992. Testing the autocorrelation structure of distur-
bances in ordinary least squares and instrumental variables regressions. Econometrica
60(1): 185–195.
Frisch, R., and F. V. Waugh. 1933. Partial time regressions as compared with individual
trends. Econometrica 1(4): 387–401.
Christopher F. Baum, Mark E. Schaffer and Steven Stillman 37
Greene, W. H. 2003. Econometric Analysis. 5th ed. Upper Saddle River, NJ: Prentice–
Hall.
Hahn, J., and J. Hausman. 2002. Notes on bias in estimators for simultaneous equation
models. Economics Letters 75(2): 237–41.
Hahn, J., J. Hausman, and G. Kuersteiner. 2004. Estimation with weak instruments:
Accuracy of higher-order bias and MSE approximations. Econometrics Journal 7(1):
272–306.
Hall, A. R., and F. P. M. Peixe. 2003. A Consistent Method for the Selection of Relevant
Instruments. Econometric Reviews 22(5): 269–287.
Hansen, L., J. Heaton, and A. Yaron. 1996. Finite sample properties of some alternative
GMM estimators. Journal of Business and Economic Statistics 14(3): 262–280.
Hayashi, F. 2000. Econometrics. 1st ed. Princeton, NJ: Princeton University Press.
Kleibergen, F., and R. Paap. 2006. Generalized reduced rank tests using the singular
value decomposition. Journal of Econometrics 127(1): 97–126.
Lovell, M. 1963. Seasonal adjustment of economic time series. Journal of the American
Statistical Association 58: 993–1010.
Mikusheva, A., and B. P. Poi. 2006. Tests and confidence sets with correct size when
instruments are potentially weak. Stata Journal 6(3): 335–347.
Moreira, M., and B. Poi. 2003. Implementing Tests with the Correct Size in the Simul-
taneous Equations Model. Stata Journal 3(1): 57–70.
Nagar, A. 1959. The bias and moment matrix of the general K-class estimators of the
parameters in simultaneous eequations. Econometrica 27(4): 575–595.
Newey, W. K., and K. D. West. 1987a. Hypothesis testing with efficient method of
moments estimation. International Economic Review 28: 777–787.