Feasible Generalized Least Squares For Panel Data With Cross-Sectional and Serial Correlations
Feasible Generalized Least Squares For Panel Data With Cross-Sectional and Serial Correlations
Feasible Generalized Least Squares For Panel Data With Cross-Sectional and Serial Correlations
https://doi.org/10.1007/s00181-020-01977-2
Abstract
This paper considers generalized least squares (GLS) estimation for linear panel data
models. By estimating the large error covariance matrix consistently, the proposed
feasible GLS estimator is more efficient than the ordinary least squares in the presence
of heteroskedasticity, serial and cross-sectional correlations. The covariance matrix
used for the feasible GLS is estimated via the banding and thresholding method. We
establish the limiting distribution of the proposed estimator. A Monte Carlo study is
considered. The proposed method is applied to an empirical application.
1 Introduction
B Yuan Liao
yuan.liao@rutgers.edu
Jushan Bai
jb3064@columbia.edu
Sung Hoon Choi
shchoi@economics.rutgers.edu
1 Columbia University, 420 West 118th St. MC 3308, New York, NY 10027, USA
2 Rutgers University, 75 Hamilton St., New Brunswick, NJ 08901, USA
123
J. Bai et al.
lano (1987); Driscoll and Kraay (1998); Hansen (2007a); Vogelsang (2012), among
others. A widely used class of robust standard errors is clustered standard errors, for
example, Petersen (2009), Wooldridge (2010) and Cameron and Miller (2015). Bai et
al. (2019) proposed a robust standard error with unknown clusters. In an interesting
paper by Abadie et al. (2017), they argued for caution in the application of clustered
standard errors since they may give rise to conservative confidence intervals. The
second approach is to use the generalized least squares estimator (GLS) that directly
takes into account heteroskedasticity, and cross-sectional and serial correlations in the
estimation. It is well known that GLS is more efficient than OLS.
This paper focuses on the second approach. For panel models, the underlying
covariance matrix involves a large number of parameters. It is important to make
GLS operational. We thus consider feasible generalized least squares (FGLS). Hansen
(2007b) studied FGLS estimation that takes into account serial correlation and clus-
tering problems in fixed effects panel and multilevel models. His approach requires
the cluster structure to be known. This gives motivation to our paper. We assume
the unknown cluster structure, and control heteroskedasticity, both serial and cross-
sectional correlations by estimating the large error covariance matrix consistently.
In cross-sectional setting, Romano and Wolf (2017) obtained asymptotically valid
inference of the FGLS estimator, combined with heteroskedasticity-consistent stan-
dard errors without knowledge of the conditional heteroskedasticity functional form.
Moreover, Miller and Startz (2018) adapted machine learning methods (i.e., support
vector regression) to take into account the misspecified form of heteroskedasticity.
In this paper, we consider (i) balanced panel data, (ii) the case of large-N large-T ,
and (iii) both serial and cross-sectional correlations, but unknown structure of clusters.
We introduce a modified FGLS estimator that eliminates the cross-sectional and serial
correlation bias by proposing a high-dimensional error covariance matrix estimator.
In addition, our proposed method is applicable when the knowledge of clusters is not
available. Let u t be an N × 1 vector of regressor noises, whose definition is to be
clear later. Following the idea of Bai and Liao (2017), in this paper, the FGLS involves
estimating an N T × N T dimensional inverse covariance matrix −1 , where
= (Eu t u s )
123
Feasible generalized least squares…
of this paper is the theoretical justification for estimating the large error covariance
matrix.
For the FGLS, it is crucial for the asymptotic analysis to prove that the effect of
estimating is first-order negligible. In the usual low-dimensional settings that involve
estimating optimal weight matrix, such as the optimal GMM estimations, it has been
well known that consistency for the inverse covariance matrix estimator is sufficient
for the first-order asymptotic theory, e.g., Hansen (1982), Newey (1990), Newey and
McFadden (1994). However, it turns out that when the covariance matrix is of high-
dimensions, not even the optimal convergence rate for estimating −1 is sufficient.
In fact, proving the first-order equivalence between the FGLS and the infeasible GLS
(that uses the true −1 ) is a very challenging problem under the large N , large T
setting. We provide a new theoretical argument to achieve this goal.
The banding and thresholding methods, which we employ in this paper, are two
useful regularization methods. In the recent machine learning literature, these methods
have been extensively exploited for estimating high-dimensional parameters. More-
over, in the econometric literature, nonparametric machine learning techniques have
been verified to be powerful tools: Bai and Ng (2017); Chernozhukov et al. (2016);
Chernozhukov et al. (2017); Wager and Athey (2018), etc.
The rest of the paper is organized as follows. In Section 2, we describe the model
and the large error covariance matrix estimator. Also, we introduce the implementation
of FGLS estimator and its limiting distribution. In Sect. 3, we apply our methods to
study the US divorce rate problem. Conclusions are provided in Sect. 4. All proofs
and Monte Carlo studies are given in the online supplement.
Throughout this paper, let νmin (A) and νmax (A) denote the minimum √ and max-
imum eigenvalues of matrix A, respectively.
√ Also, we use A = νmax (A A),
A1 = maxi j |Ai j | and A F = tr (A A) as the operator norm, 1 -norm and
the Frobenius norm of a matrix A, respectively. Note that if A is a vector, A = A F
is equal to the Euclidean norm.
The model (1) can be stacked and represented in full matrix notation as
Y = Xβ + U, (2)
123
J. Bai et al.
123
Feasible generalized least squares…
t,s hB L , if |h| ≤ L
NT
:=
0, if |h| > L.
Therefore, we use “banding” to control the serial correlation, and “sparsity” to control
the cross-sectional correlation.
Given the intuition of the population approximation, we construct the large covari-
O L S and the
ance estimator as follows. First, we denote the OLS estimator of β by β
corresponding residuals by
u it = yit − xit β O L S .
Now, we estimate the N × N block matrix h = Eu t u t−h . To do so, let
1 T
u it
u j,t−h , if h ≥ 0 h,ii ,
R if i = j
h,i j
R = T1 t=h+1 , σh,i j =
and
T +h h,i j ), if i = j,
T t=1 it j,t−h ,
u
u if h < 0 si j ( R
si j (z) = sgn(z)(|z| − τi j )+ ,
where (x)+ = x if x ≥ 0, and zero otherwise. Here, sgn(·) denotes the sign function,
and other thresholding functions, e.g., hard thresholding, are possible. For the threshold
value, we specify
0,ii | | R
τi j = MγT | R 0, j j |,
log(L N )
for some pre-determined value M > 0, where γT = T is such that
h,i j − Eu it u i,t−h | = O P (γT ). Note that here we use an entry-
maxh≤L maxi, j≤N | R
dependent threshold τi j , which may vary across (i, j). Then, define
h = (
σh,i j ) N ×N . (4)
123
J. Bai et al.
Here, ω(h, L) is the kernel function (see Andrews (1991) and Newey and West (1994)).
We let ω(h, L) = 1 − h/(L + 1) be the Bartlett kernel function, where L is the
bandwidth. Our final estimator of is an N T × N T matrix:
= (
t,s ).
β −1 X ]−1 X
F G L S = [X −1 Y .
Note that the above defined FGLS estimator leaves two quantities to be specified
to applied researchers: (i) the constant M > 0 in the threshold value for τi j , and (ii)
the Newey–West bandwidth L. We discuss the choice of these two quantities in Sect.
2.2.2 below.
Our suggested covariance matrix estimator, , requires the choice of tuning parameters
L and M, which are the bandwidth and the threshold constant, respectively. We write
(M, L) =
, where the covariance estimator depends on M and L. First, to choose
the bandwidth L, we suggest using L ∗ = 4(T /100)2/9 , which is proposed by Newey
and West (1994). For a small size of T , we also recommend L ≤ 3.
As for the choice of the thresholding constant M, our recommended rule-of-thumb
choice is any constant that is on the interval [0.5, 2]. Based on our simulations in
extensive studies with various values for N and T , we find that M = 1.8 is a universally
good choice.
Alternatively, M can also be chosen through multifold cross-validation. To discuss
this procedure, let us randomly split the data P times. We divide the data into P =
log(T ) blocks J1 , . . . , J P with block length T / log(T ) and take one of the P blocks
as the validation set. At the pth split, we denote by p the sample covariance matrix
0
123
Feasible generalized least squares…
p = |J p |−1 t∈J
based on the validation set, defined by u S, p (M) be the
u . Let
0 p t t 0
thresholding estimator with threshold constant M using the training data set { u t }t ∈J
/ p.
Finally, we choose the constant M ∗ by minimizing the cross-validation objective
function
1 S, p
P
M ∗ = arg min (M) −
p 2F ,
0 0
c<M<C̄ P
j=1
Here, S (M) is the soft-thresholded estimator as defined in the Eq. (4). Then,
0
the resulting estimator of is (M ∗ , L ∗ ). To determine this value, one can plot
λmin {(C, L)} as a function of C, fixing L = L ∗ and visually determine c.
In summary, Table 1 summarizes the recommended quantities for implementing
the proposed FGLS estimator.
Note that an advantage of the method proposed in this paper is that it does not assume
known cluster information (i.e., the number of clusters and the membership of clus-
ters). On the other hand, when clustering information is available, this method can be
modified to take into account that information and is particularly suitable when the
number of clusters is small, and the size of each cluster is large.
For example, let C1 , ..., C G be disjoint subsets of {1, ..., N }, so that they are known
clusters and that u it and u js are uncorrelated if i and j belong to different clusters
for any (t, s). Then, naturally we can re-arranged the N × N matrix h = Eu t u t−h
so that it can be decomposed into G disjoint blocks on the diagonal and off-diagonal
blocks are zeros:
⎛ ⎞
h,1
⎜ .. ⎟
h = ⎝ . ⎠.
h,G
It is assumed that G is small while the size of each diagonal block matrix is large.
Within the gth (g ≤ G) diagonal block matrix, say h,g , we apply thresholding to
h,g = (
further reduce the dimensionality. So we estimate h,g by σh,g,i j ), where
h,ii ,
R if i = j, and i, j ∈ C g
σh,g,i j =
h,i j ), if i = j, and i, j ∈ C g .
si j ( R
123
123
Table 1 Recommended choices for implementations
Here, P, C̄ and c are required constants for the choice of M based on cross-validations
J. Bai et al.
Feasible generalized least squares…
−1 − Ä−1
2.3 The effect of Ä
1
√ −1 − −1 )U = o P (1).
X ( (5)
NT
In the usual low-dimensional settings that involve estimating optimal weight matrix,
such as the optimal GMM estimations, it has been well known that consistency for the
inverse covariance matrix estimator is sufficient for the first-order asymptotic theory,
e.g., Hansen (1982), Newey (1990), Newey and McFadden (1994). It turns out, when
the covariance matrix is of high-dimensions, not even the optimal convergence rate
of − is sufficient. In fact, proving equation (5) is a very challenging problem.
In the general case when both cross-sectional and serial correlations are present, our
−1 − −1 )U . We shall proceed in
strategy is to use a careful expansion for √ 1 X (
NT
two steps:
Step 1: Show that √ 1 X ( −1 − −1 )U = √ 1 W ( − )ε + o P (1), where
NT NT
W = −1 X , and ε = −1 U .
− )ε = o P (1).
Step 2: Show that √ 1 W (
NT
1 1
T
1
√ − )ε ≈
W ( √ wit ε j,t−h
NT NT T
|h|≤L i, j∈Abh t=h+1
T
(u is u j,s−h − Eu it u j,t−h ). (6)
s=h+1
Here, wit is defined such that, we can write W = (w1 , · · · , wT ) with wt being an
N × d matrix of wit ; εit is defined similarly. While proving (6) to be o P (1), in the
123
J. Bai et al.
T
and
σi2 = 1
T
2
t=1 u it , because u it is known. Then, the GLS becomes:
−1
1 1
N T N T
σi−2
xit xit σi−2 .
xit yit
NT NT
i=1 t=1 i=1 t=1
1
N
T
√ σi−2 − σi−2 ) = o P (1).
xit u it ( (7)
NT i=1 t=1
1
N
T
1 2
T
A≡ √ xit u it σi−2 ( (u is − Eu is
2
))σi−2 = o P (1). (8)
NT T
i=1 t=1 s=1
√
NT 1
N T
EA = E(xit E(u it3 |xit ))σi−4 .
T NT
i=1 t=1
123
Feasible generalized least squares…
We impose the following conditions, regulating the sparsity and serial weak depen-
dence.
Assumption 2.1 (i) {u t , xt }t≥1 is strictly stationary. In addition, each u t has zero
mean vector, and {u t }t≥1 and {xt }t≥1 are independent.
(ii) There are constants c1 , c2 > 0 such that λmin (h ) > c1 and h 1 < c2 for
each fixed h.
(iii) Exponential tail: There exist r1 , r2 > 0 and b1 , b2 > 0, and for any s > 0, i ≤ N
and l ≤ d,
(iv) Strong mixing: There exist κ ∈ (0, 1) such that r1−1 +r2−1 + κ −1 > 1, and C > 0
such that for all T > 0,
where F−∞ 0 and FT∞ denote the σ -algebras generated by {(xt , u t ) : t ≤ 0} and
{(xt , u t ) : t ≥ T }, respectively.
Condition (ii) requires that h be well conditioned. Condition (iii) ensures the
Bernstein-type inequality for weakly dependent data, which requires the underly-
ing distributions to be thin-tailed. Condition (iv) is the standard α-mixing condition,
adapted to the large-N panel. In addition, we impose the following regularity condi-
tions.
Assumption 2.2 (i) There exists a constant C > 0 such that for all i ≤ N and t ≤ T ,
Exit 4 < C and Eu it4 <C.
(ii) Define ξT (L) = maxt≤T |h|>L Eu t u t−h . Then, ξT (L) → 0.
123
J. Bai et al.
(iii) Define f T (L) = maxt≤T |h|≤L Eu t u t−h (1 − ω(|h|, L)). Then f T (L) → 0.
Assumption 2.2 allows us to prove the convergence rate of the covariance matrix
estimator. Condition (ii) is an extension of the standard weak serial dependence con-
dition to the high-dimensional case in panel data literature. It allows us to employ
banding or Newey–West trunction procedure. Condition (iii) is well satisfied by vari-
ous kernel functions for the HAC-type estimator. For the Bartlett kernel, for example,
∞
1
max Eu t u t−h (1 − ω(|h|, L)) ≤ max Eu t u t−h |h|
t≤T L t≤T
|h|≤L |h|=0
converges to zero as L → ∞ as long as maxt≤T ∞
|h|=0 Eu t u t−h |h| < ∞.
In this paper, we assume h to be a sparse matrix for each h and impose similar
conditions as those in Bickel and Levina (2008a) and Fan et al. (2013): write h =
(h,i j ) N ×N , where h,i j = Eu it u j,t−h . For some q ∈ [0, 1), we define
N
m N = max max |h,i j |q ,
|h|≤L i≤N
j=1
Assumption 2.3 For any N T × N T matrix M, we denote (M)ts,i j as the (i, j)th
element of the (t, s)th block of the matrix M.
−α ), for a constant α > 0.
(i) |h|>L h 1 = O(L
T N
(ii) maxi≤N ,t≤T s=1 j=1 |(−1 )ts,i j | = O(1).
1−q
(iii) There is q ∈ [0, 1) such that Lm N γT = o(1) holds. In addition,
√ 3−2q √ 3−3q
T L 2 m 2N γT = o(1), and N T L 3 m 3N γT = o(1).
√ √
N T (ξT (L) + f T (L))3 = o(1) and L −α T
1−q
(iv) N T m N γT = o(1).
Conditions (i)-(ii) require the weak cross-sectional correlations. Condition (iii) is
about the sparsity assumptions on the growth of m N , associated with q and the speed
of L.
Remark 2.2 To understand Assumption 2.3, consider a simple case where Eu it u j,t−h is
nonzero for only finitely many pairs i = j. This corresponds to q = 0 and m N = O(1).
Then condition (iii) requires
√
N L 3 log3/2 (L T ) = o(T ).
123
Feasible generalized least squares…
In practice, the bandwidth L and log(L T ) both grow very slowly compared to N
and T . So essentially this condition requires N = o(T 2 ). In addition, condition (iv)
assumes that the autocorrelations should decay sufficiently fast as L → ∞. Suppose
both ζT (L) and f T (L) decay in a polynomial rate of L (e.g., with order L −c0 ), then
this condition requires that the order of the polynomial, c0 , should be sufficiently large.
− . It
All the above conditions, we show in the appendix the convergence of
then leads to the following proposition.
Proposition 2.1 Under Assumption 2.1-2.2, for q ∈ [0, 1) and α > 0 such that
Assumption 2.3 holds,
√ 1
F G L S − β) = −1
N T (β √
X U −1
NT
−1 1 −1 −1
+ √ X ( − ) U + o P (1),
NT
T T
where G1T ,i j (h) = √1
t=h+1 (u it u j,t−h −Eu it u j,t−h ) and GT ,i j (h)
2 = √1
t=h+1
T T
wit ε j,t−h .
Remark 2.3 While it is difficult to verity the above high-level condition in the presence
of either serial dependence or cross-sectional dependence or both, the intuition can be
understood in the simple i.i.d. case. Suppose u it is independent across both i and t.
Then, we can set L = 0 and this condition becomes
1 N
1 2
T T
A≡ √ (u it − Eu it2 ) xis u is σi−4 = o P (1),
N T i=1 T t=1 s=1
123
J. Bai et al.
In the literature, the cause of the sharp increase in the US divorce rate in the 1960-
1970s is an important research question. During 1970s, more than half of states in the
US liberalized the divorce system, and the effects of reforms on divorce rates have
been investigated by many such as Allen (1992) and Peters (1986). With controls for
state and year fixed effects, Friedberg (1998) suggested that state law reforms signifi-
cantly increased divorce rates. Also, she assumed that unilateral divorce laws affected
divorce rates permanently. However, divorce rates from 1975 have been subsequently
decreasing according to empirical evidence. Therefore, the question of whether law
reforms also affect the divorce rate decrease has arisen. Wolfers (2006) revisited this
question by using a treatment effect panel data model and identified only temporal
effects of reforms on divorce rates. In particular, he used dummy variables for the first
two years after the reforms, 3-4 years, 5-6 years, and so on. More specifically, the
following fixed effect panel data model was considered:
8
yit = αi + μt + βk X it,k + δi t + u it , (10)
k=1
where yit is the divorce rate for state i and year t, αi a state fixed effect, μt a time fixed
effects, and δi t a linear time trend with unknown coefficient δi . X it is a binary regressor
which denotes the treatment effect 2k years after the reform. Wolfers (2006) suggested
that “the divorce rate rose sharply following the adoption of unilateral divorce laws,
but this rise was reversed within about a decade.” He also concluded that “15 years
after reform the divorce rate is lower as a result of the adoption of unilateral divorce,
although it is hard to draw any strong conclusions about long-run effects.
Both Friedberg (1998) and Wolfers (2006) used a weighted model by multiplying
all variables by the square root of state population. In addition, they used ordinary OLS
standard error, which does not take into account heteroskedasticity, serial and cross-
sectional correlations. However, standard errors might be biased when one disregards
these correlations. Therefore, we re-estimated the model of Wolfers (2006) using the
123
Feasible generalized least squares…
proposed FGLS method and OLS with the heteroskedastic standard errors of White
(1980), the clustered standard error of Arellano (1987), and the robust standard error
of Bai et al. (2019).
The same dataset as in Wolfers (2006) is used, which includes the divorce rate, state-
level reform years, binary regressors, and state population. Due to missing observations
around divorce law reforms, we exclude Indiana, New Mexico and Louisiana. As a
result, we obtain balanced panel data from 1956 to 1988 for 48 states. We fit the
models both with and without linear time trend and use OLS and FGLS in each model
to estimate β.
The choice of the tuning parameters for implementing the FGLS follows the guid-
ance provided in Table 1. Specifically, we set bandwidth L = 3 as proposed by Newey
and West (1994) (L = 4(T /100)2/9 ). The thresholding values are chosen by the
cross-validation method as discussed in Sect. 2.2.2, more specifically, M = 1.8 and
M = 1.9 for the model with and without linear time trends, respectively. The Bartlett
kernel is used in the OLS robust standard error and FGLS estimation.
Model (10) is in fact more complicated than the model we formally studied in
this paper, due to the inclusion of linear time trends and fixed effects. While theo-
retical studies of models with trends might be challenging in the high-dimensional
GLS setting, it is straightforward to implement it in the same FGLS framework
by applying a projection transformation to eliminate the time trend. Specifically, let
= (1, 2, ..., T ) and P = IT − ( )−1 . We can define Y i = P (yi1 , ..., yi T ) ,
and
X i = P (X i1 , ..., X i T ) , and define ˙
y˙ it and
X it accordingly from yit and X it that
further remove the fixed effects.
The estimated β1 , · · · , β8 with and without linear time trend and standard errors
are summarized in Table 2 below. The OLS and FGLS estimates in both models
are similar to each other. The results show that divorce rates rose soon after the law
reform. However, within a decade, divorce rates had fallen over time. Interestingly,
FGLS confirms the negative effects of the law reforms on the divorce rates, specifically,
11-15+ years after the reform in the model with state-specific linear time trends, and
9-15+ years after the reform in the model without state-specific linear time trends.
In addition, the FGLS estimates for 1-6 and 1-4 years are positive and statistically
significant in the models with and without linear time trends, respectively. For OLS,
the coefficient estimates for 3-4 and 7-15+ are significant in the model without linear
time trends based on se BC L . In contrast, the OLS estimates are statistically significant
only for 1-4 years when a linear time trend is added. According to the clustered standard
error, seC X , note that only 11-15+ are statistically significant in the model without
trends.
According to OLS and FGLS estimation results with and without a linear time trend,
we make the following conclusion: in the first 8 years, the overall trend of divorce rate
is increasing, but the law reform reduces the divorce rate after 3-4 years. However, 8
years after the reform, we observe that the law reform has a negative effect on divorce
rate. Finally, we also note that there is a noticeable difference between the magnitudes
of the OLS and FGLS estimates. While the difference of the magnitudes may be due to
the small sample problem, the magnitude of these estimates are mostly within the 95%
confidence interval of the other estimators. For instance, β F G L S = 0.133 for the effect
123
J. Bai et al.
Table 2 Empirical application: effects of divorce law reform with state and year fixed effects: US state level
data annual from 1956 to 1988, dependent variable is divorce rate per 1000 persons per year
Effects: O L S
β seW seC X se BC L F G L S
β se F G L S
of 1-2 years, which is within the 95% confidence interval constructed using β O L S ; the
latter is [−0.0184, 0.5304]. For another example, β F G L S = 0.165 for the effect of
3-4 years, which is within the 95% confidence interval constructed using β O L S ; the
latter is [0.050, 0.367]. We note that these confidence intervals are relatively large, as
a consequence of the relative small sample size in this study.
Overall, the results of FGLS estimates are consistent with Wolfers (2006). The
FGLS confirms that the law reforms significantly contribute to the subsequent decrease
in the divorce rates, more specifically, 9–15 years after the reform in the model without
linear time trends, and 11–15 years after in the model with linear time trends. Though
Wolfers (2006) de-emphasized the negative coefficient at the end of the periods, as
these are not robust to inclusion of state-specific quadratic trends, which we did not
employ in this paper, nevertheless, we still interpret the economic insight of these
results as the consequence of two sides of the same treatment, the law reforms: after
earlier dissolution of bad matches after law reforms, marital relations were gradually
affected and changed.
123
Feasible generalized least squares…
4 Conclusions
This paper considers generalized least squares (GLS) estimation for linear panel data
models. By estimating the large error covariance matrix consistently, the proposed
feasible GLS estimator is more efficient than the ordinary least squares (OLS) in the
presence of heteroskedasticity, serial and cross-sectional correlations. The covariance
matrix used for the feasible GLS is estimated via the banding and thresholding method.
We establish the limiting distribution of the proposed estimator. A Monte Carlo study
is considered. The proposed method is applied to an empirical application.
References
Abadie A, Athey S, Imbens GW, Wooldridge J (2017) When should you adjust standard errors for clustering?.
National Bureau of Economic Research Working Paper No 24003
Allen DW (1992) Marriage and divorce: comment. Am Econ Rev 82(3):679–685
Andrews DW (1991) Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econ
J Econ Soc 59(3):817–858
Arellano M (1987) Computing Robust standard errors for within-groups estimators. Oxford Bull Econ Stat
49(4):431–434
Bai J, Choi SH, Liao Y (2019) Standard Errors for Panel Data Models with Unknown Clusters.
arXiv:1910.07406
Bai J, Liao Y (2017) Inferences in panel data with interactive effects using large covariance matrices. J
Econ 200(1):59–78
Bai J, Ng S (2017) Principal components and regularized estimation of factor models. arXiv:1708.08137
Bickel PJ, Levina E (2008a) Covariance regularization by thresholding. Ann Stat 36(6):2577–2604
Bickel PJ, Levina E (2008b) Regularized estimation of large covariance matrices. Ann Stat 36(1):199–227
Cameron AC, Miller DL (2015) A practitioner’s guide to cluster-robust inference. J Human Resour
50(2):317–372
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W (2017) Dou-
ble/debiased/neyman machine learning of treatment effects. Am Econ Rev 107(5):261–65
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey WK (2016) Double machine
learning for treatment and causal parameters. Discussion paper, cemmap working paper, Centre for
Microdata Methods and Practice
Driscoll JC, Kraay AC (1998) Consistent covariance matrix estimation with spatially dependent panel data.
Rev Econ Stat 80(4):549–560
Fan J, Liao Y, Mincheva M (2013) Large covariance estimation by thresholding principal orthogonal com-
plements. J R Stat Soc Ser B 75(4):603–680
Friedberg L (1998) Did unilateral divorce raise divorce rates? Evidence from panel data. Am Econ Rev
88(3):608–627
Hansen CB (2007a) Asymptotic properties of a robust variance matrix estimator for panel data when T is
large. J Econ 141(2):597–620
Hansen CB (2007b) Generalized least squares inference in panel and multilevel models with serial corre-
lation and fixed effects. J Econ 140(2):670–694
Hansen LP (1982) Large sample properties of generalized method of moments estimators. Econ J Econ
Soc, pp 1029–1054
Liang K-Y, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika
73(1):13–22
Miller S, Startz R (2018)Feasible generalized least squares using machine learning. Available at SSRN
2966194
Newey WK (1990) Efficient instrumental variables estimation of nonlinear models. Econ J Econ Soc, pp
809–837
Newey WK, McFadden D (1994) Large sample estimation and hypothesis testing. Handbook Econ 4:2111–
2245
123
J. Bai et al.
Newey WK, West KD (1987) A simple, positive semi-definite, heteroskedasticity and autocorrelationcon-
sistent covariance matrix. Econ J Econ Soc 55:703–708
Newey WK, West KD (1994) Automatic lag selection in covariance matrix estimation. Rev Econ Stud
61(4):631–653
Peters HE (1986) Marriage and divorce: Informational constraints and private contracting. Am Econ Rev
76(3):437–454
Petersen MA (2009) Estimating standard errors in finance panel data sets: Comparing approaches. Rev
Financ Stud 22(1):435–480
Romano JP, Wolf M (2017) Resurrecting weighted least squares. J Econ 197(1):1–19
Vogelsang TJ (2012) Heteroskedasticity, autocorrelation, and spatial correlation robust inference in linear
panel models with fixed-effects. J Econ 166(2):303–319
Wager S, Athey S (2018) Estimation and inference of heterogeneous treatment effects using random forests.
J Am Stat Assoc 113(523):1228–1242
White H (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for het-
eroskedasticity. Econ J Econ Soc, pp 817–838
Wolfers J (2006) Did unilateral divorce laws raise divorce rates? A reconciliation and new results. Am Econ
Rev 96(5):1802–1820
Wooldridge JM (2010) Econometric analysis of cross section and panel data. MIT press, Cambridge
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123