Feasible Generalized Least Squares For Panel Data With Cross-Sectional and Serial Correlations

Empirical Economics
https://doi.org/10.1007/s00181-020-01977-2
Feasible generalized least squares for panel data with

cross-sectional and serial correlations
Jushan Bai1 · Sung Hoon Choi2 · Yuan Liao2
Received: 21 March 2020 / Accepted: 30 October 2020

© Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract
This paper considers generalized least squares (GLS) estimation for linear panel data
models. By estimating the large error covariance matrix consistently, the proposed
feasible GLS estimator is more efficient than the ordinary least squares in the presence
of heteroskedasticity, serial and cross-sectional correlations. The covariance matrix
used for the feasible GLS is estimated via the banding and thresholding method. We
establish the limiting distribution of the proposed estimator. A Monte Carlo study is
considered. The proposed method is applied to an empirical application.
Keywords Panel data · Efficiency · Thresholding · Banding · Cross-sectional

correlation · Serial correlation · Heteroskedasticity
1 Introduction
Heteroskedasticity, cross-sectional and serial correlations are important problems in

the error terms of panel regression models. There are two approaches to deal with
these problems. The first approach is to use the ordinary least squares (OLS) estimator
but with a robust standard error that is robust to heteroskedasticity and correlations,
for example, White (1980); Newey and West (1987); Liang and Zeger (1986); Arel-
Electronic supplementary material The online version of this article (https://doi.org/10.1007/s00181-

020-01977-2) contains supplementary material, which is available to authorized users.
B Yuan Liao
yuan.liao@rutgers.edu
Jushan Bai
jb3064@columbia.edu
Sung Hoon Choi
shchoi@economics.rutgers.edu
1 Columbia University, 420 West 118th St. MC 3308, New York, NY 10027, USA
2 Rutgers University, 75 Hamilton St., New Brunswick, NJ 08901, USA
123
J. Bai et al.
lano (1987); Driscoll and Kraay (1998); Hansen (2007a); Vogelsang (2012), among
others. A widely used class of robust standard errors is clustered standard errors, for
example, Petersen (2009), Wooldridge (2010) and Cameron and Miller (2015). Bai et
al. (2019) proposed a robust standard error with unknown clusters. In an interesting
paper by Abadie et al. (2017), they argued for caution in the application of clustered
standard errors since they may give rise to conservative confidence intervals. The
second approach is to use the generalized least squares estimator (GLS) that directly
takes into account heteroskedasticity, and cross-sectional and serial correlations in the
estimation. It is well known that GLS is more efficient than OLS.
This paper focuses on the second approach. For panel models, the underlying
covariance matrix involves a large number of parameters. It is important to make
GLS operational. We thus consider feasible generalized least squares (FGLS). Hansen
(2007b) studied FGLS estimation that takes into account serial correlation and clus-
tering problems in fixed effects panel and multilevel models. His approach requires
the cluster structure to be known. This gives motivation to our paper. We assume
the unknown cluster structure, and control heteroskedasticity, both serial and cross-
sectional correlations by estimating the large error covariance matrix consistently.
In cross-sectional setting, Romano and Wolf (2017) obtained asymptotically valid
inference of the FGLS estimator, combined with heteroskedasticity-consistent stan-
dard errors without knowledge of the conditional heteroskedasticity functional form.
Moreover, Miller and Startz (2018) adapted machine learning methods (i.e., support
vector regression) to take into account the misspecified form of heteroskedasticity.
In this paper, we consider (i) balanced panel data, (ii) the case of large-N large-T ,
and (iii) both serial and cross-sectional correlations, but unknown structure of clusters.
We introduce a modified FGLS estimator that eliminates the cross-sectional and serial
correlation bias by proposing a high-dimensional error covariance matrix estimator.
In addition, our proposed method is applicable when the knowledge of clusters is not
available. Let u t be an N × 1 vector of regressor noises, whose definition is to be
clear later. Following the idea of Bai and Liao (2017), in this paper, the FGLS involves
estimating an N T × N T dimensional inverse covariance matrix −1 , where
= (Eu t u s )
where each block Eu t u s is an N × N autocovariance matrix. Here, parametric struc-

tures on the serial or cross-sectional correlations are not imposed. By assuming weak
dependences, we apply nonparametric methods to estimate the covariance matrix. To
address the estimation of serial autocorrelations, we employ the idea of Newey–West
truncation. This method, in the FGLS setting, is equivalent to “banding”, previously
proposed by Bickel and Levina (2008b) for estimating large covariance matrices. We
apply it to banding out off-diagonal N × N blocks that are far from the diagonal
block. In addition, to control for the cross-sectional correlation, we assume that each
of the N × N block matrices are sparse, potentially resulting from the presence of
cross-sectional correlations within clusters. We then estimate them by applying the
thresholding approach of Bickel and Levina (2008a). We apply thresholding separately
to the N × N blocks, which are formed by time lags Eu t u t−h : h = 0, 1, 2, .... This
allows the cluster-membership to be potentially changing over-time. A contribution
123
Feasible generalized least squares…
of this paper is the theoretical justification for estimating the large error covariance
matrix.
For the FGLS, it is crucial for the asymptotic analysis to prove that the effect of
estimating is first-order negligible. In the usual low-dimensional settings that involve
estimating optimal weight matrix, such as the optimal GMM estimations, it has been
well known that consistency for the inverse covariance matrix estimator is sufficient
for the first-order asymptotic theory, e.g., Hansen (1982), Newey (1990), Newey and
McFadden (1994). However, it turns out that when the covariance matrix is of high-
dimensions, not even the optimal convergence rate for estimating −1 is sufficient.
In fact, proving the first-order equivalence between the FGLS and the infeasible GLS
(that uses the true −1 ) is a very challenging problem under the large N , large T
setting. We provide a new theoretical argument to achieve this goal.
The banding and thresholding methods, which we employ in this paper, are two
useful regularization methods. In the recent machine learning literature, these methods
have been extensively exploited for estimating high-dimensional parameters. More-
over, in the econometric literature, nonparametric machine learning techniques have
been verified to be powerful tools: Bai and Ng (2017); Chernozhukov et al. (2016);
Chernozhukov et al. (2017); Wager and Athey (2018), etc.
The rest of the paper is organized as follows. In Section 2, we describe the model
and the large error covariance matrix estimator. Also, we introduce the implementation
of FGLS estimator and its limiting distribution. In Sect. 3, we apply our methods to
study the US divorce rate problem. Conclusions are provided in Sect. 4. All proofs
and Monte Carlo studies are given in the online supplement.
Throughout this paper, let νmin (A) and νmax (A) denote the minimum √ and max-
imum eigenvalues of matrix A, respectively.
√ Also, we use A = νmax (A A),

A1 = maxi j |Ai j | and A F = tr (A A) as the operator norm, 1 -norm and
the Frobenius norm of a matrix A, respectively. Note that if A is a vector, A = A F
is equal to the Euclidean norm.
2 Feasible generalized least squares
We consider a linear model1
yit = xit β + u it . (1)
The model (1) can be stacked and represented in full matrix notation as
Y = Xβ + U, (2)
where Y = (y1 , · · · , yT ) is the N T × 1 vector of yit with each yt being an N × 1

vector; X = (x1 , · · · , x T ) is the N T × d matrix of xit with each xt being an N × d;
U = (u 1 , · · · , u T ) is the N T × 1 vector of u it with each u t being an N × 1 vector.
1 For technical simplicity, we focus on a simple model where there are no fixed effects. It is straightforward
to allow additive fixed effects αi + μt by applying the de-meaning first. The theories would be slightly
more sophisticated, though such extensions are straightforward.
123
J. Bai et al.
Let = (Eu t u s ) be an N T × N T matrix, consisting of many blocks matrices.

The (t, s)th block is an N × N covariance matrix Eu t u s . We consider the following
(infeasible) GLS estimator of β:
in f = (X −1 X )−1 X −1 Y .

β (3)
GLS
Note that is a high-dimensional conditional covariance matrix, which is very difficult

to estimate. We aim to achieve the following: (i) obtain a “good” estimator of −1 ,
allowing an arbitrary form of weak dependence in u it , and (ii) show that the effect of
replacing −1 by −1 is asymptotically negligible.
We start with a population approximation for in order to gain the intuitions. Then,
we suggest the estimator for that takes into account both serial and cross-sectional
correlations.
2.1 Population approximation
We start with a “banding” approximation to control serial correlations. Recall that

= (Eu t u s ), where the (t, s) block is Eu t u s . By assuming serial stationarity and
strong mixing condition, Eu t u s depends on (t, s) only through h = t −s. Specifically,
with slight abuse of notation, we can write t,s = h = Eu t u t−h . Note for i = j,
it is possible that Eu it u j,t−h = Eu i,t−h u jt , so h is possibly non-symmetric for
h > 0. On the other hand, is symmetric due to s,t = t,s . The diagonal blocks
are the same, and all equal 0 = Eu t u t , while magnitudes of the elements of the
off-diagonal blocks h = Eu t u t−h decay to zero as |h| → ∞ under the weak serial
dependence assumption.
In the Newey–West spirit, can be approximated by N W = (t,s N W ), where each
block can be written as t,s = h for h = t − s. Here, h is an N × N block

N W N W N W
matrix, defined as:

Eu t u t−h , if |h| ≤ L
hN W =
0, if |h| > L,
for some pre-determined L → ∞. For instance, as suggested by Newey and West

(1994), we can set L equal to 4(T /100)(2/9) . Note that hN W = −h N W . We regard
N W = (hN W ) as the “population banding approximation.”

Next, we focus on the N × N block matrix h = Eu t u t−h to control cross-
sectional correlations. Under the intuition that u it is cross-sectional weakly dependent,
we assume h is a sparse matrix, that is, h,i j = Eu it u j,t−h is “small” for “many”
pairs (i, j). Then, h can be approximated by a sparse matrix hB L = (h,i BL )
j N ×N
( Bickel and Levina (2008a)), where

Eu it u j,t−h , if |Eu it u j,t−h | > τi j
h,i
BL
j =
0, if |Eu it u j,t−h | ≤ τi j ,
123
for some pre-determined threshold τi j → 0. We regard hB L as the “population sparse

approximation.”
In summary, we approximate by an N T × N T matrix ( t,s
N T ), where each block
t,s is an N × N matrix, defined as: for h = t − s,
N T

t,s hB L , if |h| ≤ L
NT
:=
0, if |h| > L.
Therefore, we use “banding” to control the serial correlation, and “sparsity” to control
the cross-sectional correlation.
2.2 Implementation of feasible GLS
2.2.1 The estimator of Ä and FGLS
Given the intuition of the population approximation, we construct the large covari-
O L S and the
ance estimator as follows. First, we denote the OLS estimator of β by β
corresponding residuals by
u it = yit − xit β O L S .
Now, we estimate the N × N block matrix h = Eu t u t−h . To do so, let

1 T

u it
u j,t−h , if h ≥ 0 h,ii ,
R if i = j
h,i j
R = T1 t=h+1 , σh,i j =
and
T +h h,i j ), if i = j,
T t=1 it j,t−h ,

u
u if h < 0 si j ( R
where si j (·) : R → R is a “soft-thresholding function” with an entry-dependent

threshold τi j such that
si j (z) = sgn(z)(|z| − τi j )+ ,
where (x)+ = x if x ≥ 0, and zero otherwise. Here, sgn(·) denotes the sign function,
and other thresholding functions, e.g., hard thresholding, are possible. For the threshold
value, we specify

0,ii | | R
τi j = MγT | R 0, j j |,

log(L N )
for some pre-determined value M > 0, where γT = T is such that
h,i j − Eu it u i,t−h | = O P (γT ). Note that here we use an entry-
maxh≤L maxi, j≤N | R
dependent threshold τi j , which may vary across (i, j). Then, define
h = (
σh,i j ) N ×N . (4)
123
J. Bai et al.
t,s as an N × N matrix: for h = t − s,

Next, we define the (t, s)th block

t,s = ω(|h|, L)h , if |h| ≤ L

0, if |h| > L.
Here, ω(h, L) is the kernel function (see Andrews (1991) and Newey and West (1994)).
We let ω(h, L) = 1 − h/(L + 1) be the Bartlett kernel function, where L is the
bandwidth. Our final estimator of is an N T × N T matrix:
= (
t,s ).
is a nonparametric estimator, which does not require an assumed parametric

Here,
structure on .
, we propose the feasible GLS (FGLS) estimator of β as
Finally, given
β −1 X ]−1 X
F G L S = [X −1 Y .
Note that the above defined FGLS estimator leaves two quantities to be specified
to applied researchers: (i) the constant M > 0 in the threshold value for τi j , and (ii)
the Newey–West bandwidth L. We discuss the choice of these two quantities in Sect.
2.2.2 below.
Remark 2.1 (Universal thresholding) We apply thresholding separately to the N × N

blocks, (
σh,i j ) N ×N , which are estimated lagged blocks for Eu t u t−h : h = 0, 1, 2, . . ..
This allows the cluster-membership to be potentially changing over-time, that is, the
identities of zeros and nonzero elements of Eu t u t−h can change over h. If it is known
that the cluster-membership (i.e., identities of nonzero elements) is time-invariant,
then one would set h,i j | ≤ τi j for i = j. This potentially
σh,i j = 0 if maxh≤L | R
would increase the finite sample accuracy of identifying the cluster-membership.
2.2.2 Choice of tuning parameters
Our suggested covariance matrix estimator, , requires the choice of tuning parameters
L and M, which are the bandwidth and the threshold constant, respectively. We write
(M, L) =
, where the covariance estimator depends on M and L. First, to choose
the bandwidth L, we suggest using L ∗ = 4(T /100)2/9 , which is proposed by Newey
and West (1994). For a small size of T , we also recommend L ≤ 3.
As for the choice of the thresholding constant M, our recommended rule-of-thumb
choice is any constant that is on the interval [0.5, 2]. Based on our simulations in
extensive studies with various values for N and T , we find that M = 1.8 is a universally
good choice.
Alternatively, M can also be chosen through multifold cross-validation. To discuss
this procedure, let us randomly split the data P times. We divide the data into P =
log(T ) blocks J1 , . . . , J P with block length T / log(T ) and take one of the P blocks
as the validation set. At the pth split, we denote by p the sample covariance matrix
0
123

p = |J p |−1 t∈J
based on the validation set, defined by u S, p (M) be the
u . Let
0 p t t 0
thresholding estimator with threshold constant M using the training data set { u t }t ∈J
/ p.
Finally, we choose the constant M ∗ by minimizing the cross-validation objective
function
1 S, p
P
M ∗ = arg min (M) −
p 2F ,
0 0
c<M<C̄ P
j=1
S (C̄) is a diagonal matrix and can be fixed as,

where C̄ is a large constant such that 0
(M, L) for
e.g., C̄ = 3; c is a constant that guarantees the positive definiteness of
M > c: for each fixed L,
(C, L)} > 0, ∀C > M].

c = inf[M > 0 : λmin {
Here, S (M) is the soft-thresholded estimator as defined in the Eq. (4). Then,
0
the resulting estimator of is (M ∗ , L ∗ ). To determine this value, one can plot

λmin {(C, L)} as a function of C, fixing L = L ∗ and visually determine c.
In summary, Table 1 summarizes the recommended quantities for implementing
the proposed FGLS estimator.
2.2.3 Incorporating known clusters
Note that an advantage of the method proposed in this paper is that it does not assume
known cluster information (i.e., the number of clusters and the membership of clus-
ters). On the other hand, when clustering information is available, this method can be
modified to take into account that information and is particularly suitable when the
number of clusters is small, and the size of each cluster is large.
For example, let C1 , ..., C G be disjoint subsets of {1, ..., N }, so that they are known
clusters and that u it and u js are uncorrelated if i and j belong to different clusters
for any (t, s). Then, naturally we can re-arranged the N × N matrix h = Eu t u t−h
so that it can be decomposed into G disjoint blocks on the diagonal and off-diagonal
blocks are zeros:
⎛ ⎞
h,1
⎜ .. ⎟
h = ⎝ . ⎠.
h,G
It is assumed that G is small while the size of each diagonal block matrix is large.
Within the gth (g ≤ G) diagonal block matrix, say h,g , we apply thresholding to
h,g = (
further reduce the dimensionality. So we estimate h,g by σh,g,i j ), where

h,ii ,
R if i = j, and i, j ∈ C g
σh,g,i j =

h,i j ), if i = j, and i, j ∈ C g .
si j ( R
123
123
Table 1 Recommended choices for implementations
Quantities si j (z) ω(|h|, L) τi j γT L

|h| log(L N )
Choice sgn(z)(|z| − τi j )+ 1 − L+1 MγT | R
0,ii | | R
0, j j |
T 4(T /100)2/9
Quantities M (rule-of-thumb) P C̄ c
Choice 1.8 log T 3 (C, L)}
visually by plotting λmin {
Here, P, C̄ and c are required constants for the choice of M based on cross-validations
J. Bai et al.
Putting these estimated diagonal blocks together, we obtain h , the estimated h .

The within-cluster thresholding then allows unknown correlations within each clus-
ter. In contrast, the conventional clustered standard errors lose a lot of degrees of
freedom when the size of cluster is too large (because each cluster is effectively
treated as a “single observation”), resulting in conservative confidence intervals. See
Cameron and Miller (2015) for more discussions.
Moreover, when the number of clusters is large, and the size of each cluster is
small, then this is the usual setting of cluster standard errors. One does not need to
apply thresholding, as the known clusters naturally form small diagonal blocks on h .
Because the size of these blocks are small, sufficient degrees of freedom is kept and
it is then straightforward to estimate h .
−1 − Ä−1
2.3 The effect of Ä
F G L S is to show that it is asymp-

A key step of proving the asymptotic property for β
in f
totically equivalent to βG L S , that is:
1
√ −1 − −1 )U = o P (1).
X ( (5)
NT
In the usual low-dimensional settings that involve estimating optimal weight matrix,
such as the optimal GMM estimations, it has been well known that consistency for the
inverse covariance matrix estimator is sufficient for the first-order asymptotic theory,
e.g., Hansen (1982), Newey (1990), Newey and McFadden (1994). It turns out, when
the covariance matrix is of high-dimensions, not even the optimal convergence rate
of − is sufficient. In fact, proving equation (5) is a very challenging problem.
In the general case when both cross-sectional and serial correlations are present, our
−1 − −1 )U . We shall proceed in
strategy is to use a careful expansion for √ 1 X (
NT
two steps:
Step 1: Show that √ 1 X ( −1 − −1 )U = √ 1 W ( − )ε + o P (1), where
NT NT
W = −1 X , and ε = −1 U .
− )ε = o P (1).
Step 2: Show that √ 1 W (
NT
Now, we suppose ω(h, L) = 1, ≈ N W and let Abh = {(i, j) : |Eu it u j,t−h | =

0}, Ash = {(i, j) : |Eu it u j,t−h | = 0}. As for Step 2, we shall show,
1 1
T
1
√ − )ε ≈
W ( √ wit ε j,t−h
NT NT T
|h|≤L i, j∈Abh t=h+1

T
(u is u j,s−h − Eu it u j,t−h ). (6)
s=h+1
Here, wit is defined such that, we can write W = (w1 , · · · , wT ) with wt being an
N × d matrix of wit ; εit is defined similarly. While proving (6) to be o P (1), in the
123
J. Bai et al.
presence of both serial and cross-sectional correlations, is very technically challenging.

We thus directly assume it is o P (1) as a high-level condition (see Assumption 2.4 in
Sect. 2.4 below). To appreciate the need of this high-level condition, let us consider a
simple example as follows.
A simple example To illustrate the key technical issue, consider a simple and ideal
case where u it is known, and independent across both i and t, but with cross-sectional
heteroskedasticity. In this case, the covariance matrix of the N T × 1 vector U is a
diagonal matrix, with diagonal elements σi2 = Eu it2 :
⎛ ⎞ ⎛ 2 ⎞
D σ1
⎜ .. ⎟ ⎜ .. ⎟
=⎝ . ⎠ , where D = ⎝ . ⎠.
D σ N2
Then, a natural estimator for is

⎛ ⎞ ⎛ 2 ⎞

D
σ1
=⎜

⎟ =⎜
⎝ . . . ⎠ , where D ⎝ ...
⎟
⎠,

D
σN
2
T
and
σi2 = 1
T
2
t=1 u it , because u it is known. Then, the GLS becomes:
−1
1 1
N T N T
σi−2
xit xit σi−2 .
xit yit
NT NT
i=1 t=1 i=1 t=1
A key step is to prove that the effect of estimating D is asymptotically negligible:
1
N
T
√ σi−2 − σi−2 ) = o P (1).
xit u it ( (7)
NT i=1 t=1
It can be shown that the problem reduces to proving:
1
N
T
1 2
T
A≡ √ xit u it σi−2 ( (u is − Eu is
2
))σi−2 = o P (1). (8)
NT T
i=1 t=1 s=1
Under the simplified conditions of this example (u it is independent across both i

and t), it is straightforward to calculate var(A) and show that it converges to zero as
N , T → ∞ regardless of whether N < T or not.
As for E A, straightforward calculations yield
√
NT 1
N T
EA = E(xit E(u it3 |xit ))σi−4 .
T NT
i=1 t=1
123
Generally, if u it |xit is non-Gaussian and asymmetric, E(u it3 |xit ) = 0. Hence, we

require N /T → 0 to have E A → 0. Hence, to allow for non-Gaussian and asymmetric
conditional distributions, in the GLS setting it turns out N = o(T ) is required.
We shall not explicitly impose N = o(T ) in this paper as a formal assumption,
but instead impose Assumption 2.4. On one hand, when the distribution of u it is
symmetric, we do not require N = o(T ) because as is shown in the above example,
E(u it3 |xit ) = 0 is sufficient for E A → 0 and is satisfied by symmetric distributions. On
the other hand, when u it is non-symmetric, Assumption 2.4 then implicitly requires
N = o(T ). Note that N = o(T ) is a strong assumption in many microeconomic
applications for panel data models. But as illustrated in the above simple example, if
u it |xit is not symmetric, it is required for feasible GLS even if is diagonal. One
possible approach to weakening this assumption is to remove the higher order bias
from . Higher order debiasing is a complicated procedure in the presence of general
weak dependences. This is left for future research.
2.4 Asymptotic results of FGLS
We impose the following conditions, regulating the sparsity and serial weak depen-
dence.
Assumption 2.1 (i) {u t , xt }t≥1 is strictly stationary. In addition, each u t has zero
mean vector, and {u t }t≥1 and {xt }t≥1 are independent.
(ii) There are constants c1 , c2 > 0 such that λmin (h ) > c1 and h 1 < c2 for
each fixed h.
(iii) Exponential tail: There exist r1 , r2 > 0 and b1 , b2 > 0, and for any s > 0, i ≤ N
and l ≤ d,
P(|u it | > s) ≤ ex p(−(s/b1 )r1 ), P(|xit,l | > s) ≤ exp(−(s/b2 )r2 ).
(iv) Strong mixing: There exist κ ∈ (0, 1) such that r1−1 +r2−1 + κ −1 > 1, and C > 0
such that for all T > 0,
sup |P(A)P(B) − P(AB)| < exp(−C T κ ),

0 ,B∈F ∞
A∈F−∞ T
where F−∞ 0 and FT∞ denote the σ -algebras generated by {(xt , u t ) : t ≤ 0} and
{(xt , u t ) : t ≥ T }, respectively.
Condition (ii) requires that h be well conditioned. Condition (iii) ensures the
Bernstein-type inequality for weakly dependent data, which requires the underly-
ing distributions to be thin-tailed. Condition (iv) is the standard α-mixing condition,
adapted to the large-N panel. In addition, we impose the following regularity condi-
tions.
Assumption 2.2 (i) There exists a constant C > 0 such that for all i ≤ N and t ≤ T ,
Exit 4 < C and Eu it4 <C.
(ii) Define ξT (L) = maxt≤T |h|>L Eu t u t−h . Then, ξT (L) → 0.
123
J. Bai et al.

(iii) Define f T (L) = maxt≤T |h|≤L Eu t u t−h (1 − ω(|h|, L)). Then f T (L) → 0.
Assumption 2.2 allows us to prove the convergence rate of the covariance matrix
estimator. Condition (ii) is an extension of the standard weak serial dependence con-
dition to the high-dimensional case in panel data literature. It allows us to employ
banding or Newey–West trunction procedure. Condition (iii) is well satisfied by vari-
ous kernel functions for the HAC-type estimator. For the Bartlett kernel, for example,
∞

1
max Eu t u t−h (1 − ω(|h|, L)) ≤ max Eu t u t−h |h|
t≤T L t≤T
|h|≤L |h|=0

converges to zero as L → ∞ as long as maxt≤T ∞
|h|=0 Eu t u t−h |h| < ∞.
In this paper, we assume h to be a sparse matrix for each h and impose similar
conditions as those in Bickel and Levina (2008a) and Fan et al. (2013): write h =
(h,i j ) N ×N , where h,i j = Eu it u j,t−h . For some q ∈ [0, 1), we define

N
m N = max max |h,i j |q ,
|h|≤L i≤N
j=1
as a measurement of the sparsity. We would require that m N should be either

fixed or grow slowly
as N → ∞. In particular, when q = 0, m N =
max|h|≤L maxi≤N Nj=1 1(h,i j = 0), which corresponds to the exact sparsity case.
Let

γT = log(L N )/T .
Assumption 2.3 For any N T × N T matrix M, we denote (M)ts,i j as the (i, j)th
element of the (t, s)th block of the matrix M.
−α ), for a constant α > 0.
(i) |h|>L h 1 = O(L
T N
(ii) maxi≤N ,t≤T s=1 j=1 |(−1 )ts,i j | = O(1).
1−q
(iii) There is q ∈ [0, 1) such that Lm N γT = o(1) holds. In addition,
√ 3−2q √ 3−3q
T L 2 m 2N γT = o(1), and N T L 3 m 3N γT = o(1).
√ √
N T (ξT (L) + f T (L))3 = o(1) and L −α T
1−q
(iv) N T m N γT = o(1).
Conditions (i)-(ii) require the weak cross-sectional correlations. Condition (iii) is
about the sparsity assumptions on the growth of m N , associated with q and the speed
of L.
Remark 2.2 To understand Assumption 2.3, consider a simple case where Eu it u j,t−h is
nonzero for only finitely many pairs i = j. This corresponds to q = 0 and m N = O(1).
Then condition (iii) requires
√
N L 3 log3/2 (L T ) = o(T ).
123
In practice, the bandwidth L and log(L T ) both grow very slowly compared to N
and T . So essentially this condition requires N = o(T 2 ). In addition, condition (iv)
assumes that the autocorrelations should decay sufficiently fast as L → ∞. Suppose
both ζT (L) and f T (L) decay in a polynomial rate of L (e.g., with order L −c0 ), then
this condition requires that the order of the polynomial, c0 , should be sufficiently large.
− . It
All the above conditions, we show in the appendix the convergence of
then leads to the following proposition.
Proposition 2.1 Under Assumption 2.1-2.2, for q ∈ [0, 1) and α > 0 such that
Assumption 2.3 holds,

√ 1
F G L S − β) = −1
N T (β √
X U −1
NT

−1 1 −1 −1
+ √ X ( − ) U + o P (1),
NT
where = E(X −1 X /N T ).

− also appears as an “weighted
As we see from the above proposition, the effect
average” in the second term on the right-hand-side of the expansion. The negligibility
of this term relies on the following high-level condition. We define W = −1 X and
ε = −1 U . Then, W = (w1 , · · · , wT ) with wt being an N × d matrix of wit , and
εit is defined similarly.
Assumption 2.4 Let Abh = {(i, j) : |Eu it u j,t−h | = 0}. Then,

1 L
√ G1
(h)G2
(h) = o P (1), (9)
NT T ,i j T ,i j
h=0 i, j∈Ab
h
T T
where G1T ,i j (h) = √1
t=h+1 (u it u j,t−h −Eu it u j,t−h ) and GT ,i j (h)
2 = √1
t=h+1
T T
wit ε j,t−h .
Remark 2.3 While it is difficult to verity the above high-level condition in the presence
of either serial dependence or cross-sectional dependence or both, the intuition can be
understood in the simple i.i.d. case. Suppose u it is independent across both i and t.
Then, we can set L = 0 and this condition becomes
1 N
1 2
T T
A≡ √ (u it − Eu it2 ) xis u is σi−4 = o P (1),
N T i=1 T t=1 s=1
which is (8) as we discussed in Sect. 2.3. As discussed therein, it is straightforward

to see that var(A) = o(1), proving E A = o(1) requires either u it has a symmetric
distribution so that Eu it3 = 0, or N = o(T ) for asymmetric distributions. Similar
conditions were required for high-dimensional GLS problems, for instance, by Bai
and Liao (2017) in panel data with interactive effect estimations.
123
J. Bai et al.
Then, we have the following limiting distribution.

Theorem 2.1 Suppose var(U |X ) = var(U ) = . Under the Assumptions 2.1-2.4, for
q ∈ [0, 1) and α > 0 such that Assumption 2.3 holds, as N , T → ∞,
√
N T (β
d
F G L S − β) → N (0, −1 ),
where = lim N T E(X −1 X /N T ), assumed to exist. The consistent estimator of

is −1 X /N T .
= X
The asymptotic variance of the FGLS estimator is Avar(β F G L S ) = −1 /N T , and

an estimator of it is (X X ) . Asymptotic standard errors can be obtained in the
−1 −1
usual fashion from the asymptotic variance estimates.
3 Empirical study: Eﬀects of divorce law reforms on divorce rates
In the literature, the cause of the sharp increase in the US divorce rate in the 1960-
1970s is an important research question. During 1970s, more than half of states in the
US liberalized the divorce system, and the effects of reforms on divorce rates have
been investigated by many such as Allen (1992) and Peters (1986). With controls for
state and year fixed effects, Friedberg (1998) suggested that state law reforms signifi-
cantly increased divorce rates. Also, she assumed that unilateral divorce laws affected
divorce rates permanently. However, divorce rates from 1975 have been subsequently
decreasing according to empirical evidence. Therefore, the question of whether law
reforms also affect the divorce rate decrease has arisen. Wolfers (2006) revisited this
question by using a treatment effect panel data model and identified only temporal
effects of reforms on divorce rates. In particular, he used dummy variables for the first
two years after the reforms, 3-4 years, 5-6 years, and so on. More specifically, the
following fixed effect panel data model was considered:

8
yit = αi + μt + βk X it,k + δi t + u it , (10)
k=1
where yit is the divorce rate for state i and year t, αi a state fixed effect, μt a time fixed
effects, and δi t a linear time trend with unknown coefficient δi . X it is a binary regressor
which denotes the treatment effect 2k years after the reform. Wolfers (2006) suggested
that “the divorce rate rose sharply following the adoption of unilateral divorce laws,
but this rise was reversed within about a decade.” He also concluded that “15 years
after reform the divorce rate is lower as a result of the adoption of unilateral divorce,
although it is hard to draw any strong conclusions about long-run effects.
Both Friedberg (1998) and Wolfers (2006) used a weighted model by multiplying
all variables by the square root of state population. In addition, they used ordinary OLS
standard error, which does not take into account heteroskedasticity, serial and cross-
sectional correlations. However, standard errors might be biased when one disregards
these correlations. Therefore, we re-estimated the model of Wolfers (2006) using the
123
proposed FGLS method and OLS with the heteroskedastic standard errors of White
(1980), the clustered standard error of Arellano (1987), and the robust standard error
of Bai et al. (2019).
The same dataset as in Wolfers (2006) is used, which includes the divorce rate, state-
level reform years, binary regressors, and state population. Due to missing observations
around divorce law reforms, we exclude Indiana, New Mexico and Louisiana. As a
result, we obtain balanced panel data from 1956 to 1988 for 48 states. We fit the
models both with and without linear time trend and use OLS and FGLS in each model
to estimate β.
The choice of the tuning parameters for implementing the FGLS follows the guid-
ance provided in Table 1. Specifically, we set bandwidth L = 3 as proposed by Newey
and West (1994) (L = 4(T /100)2/9 ). The thresholding values are chosen by the
cross-validation method as discussed in Sect. 2.2.2, more specifically, M = 1.8 and
M = 1.9 for the model with and without linear time trends, respectively. The Bartlett
kernel is used in the OLS robust standard error and FGLS estimation.
Model (10) is in fact more complicated than the model we formally studied in
this paper, due to the inclusion of linear time trends and fixed effects. While theo-
retical studies of models with trends might be challenging in the high-dimensional
GLS setting, it is straightforward to implement it in the same FGLS framework
by applying a projection transformation to eliminate the time trend. Specifically, let
= (1, 2, ..., T ) and P = IT − ( )−1 . We can define Y i = P (yi1 , ..., yi T ) ,
and
X i = P (X i1 , ..., X i T ) , and define ˙
y˙ it and
X it accordingly from yit and X it that
further remove the fixed effects.
The estimated β1 , · · · , β8 with and without linear time trend and standard errors
are summarized in Table 2 below. The OLS and FGLS estimates in both models
are similar to each other. The results show that divorce rates rose soon after the law
reform. However, within a decade, divorce rates had fallen over time. Interestingly,
FGLS confirms the negative effects of the law reforms on the divorce rates, specifically,
11-15+ years after the reform in the model with state-specific linear time trends, and
9-15+ years after the reform in the model without state-specific linear time trends.
In addition, the FGLS estimates for 1-6 and 1-4 years are positive and statistically
significant in the models with and without linear time trends, respectively. For OLS,
the coefficient estimates for 3-4 and 7-15+ are significant in the model without linear
time trends based on se BC L . In contrast, the OLS estimates are statistically significant
only for 1-4 years when a linear time trend is added. According to the clustered standard
error, seC X , note that only 11-15+ are statistically significant in the model without
trends.
According to OLS and FGLS estimation results with and without a linear time trend,
we make the following conclusion: in the first 8 years, the overall trend of divorce rate
is increasing, but the law reform reduces the divorce rate after 3-4 years. However, 8
years after the reform, we observe that the law reform has a negative effect on divorce
rate. Finally, we also note that there is a noticeable difference between the magnitudes
of the OLS and FGLS estimates. While the difference of the magnitudes may be due to
the small sample problem, the magnitude of these estimates are mostly within the 95%
confidence interval of the other estimators. For instance, β F G L S = 0.133 for the effect
123
J. Bai et al.
Table 2 Empirical application: effects of divorce law reform with state and year fixed effects: US state level
data annual from 1956 to 1988, dependent variable is divorce rate per 1000 persons per year
Effects: O L S
β seW seC X se BC L F G L S
β se F G L S
Panel A: Without state-specific linear time trends

1–2 years 0.256 0.140 0.189 0.148 0.133 0.046∗
3–4 years 0.209 0.081∗ 0.159 0.089∗ 0.165 0.056∗
5–6 years 0.126 0.073 0.168 0.069 0.100 0.059
7–8 years 0.105 0.070 0.165 0.040∗ 0.026 0.061
9–10 years − 0.122 0.060∗ 0.161 0.054∗ − 0.129 0.061∗
11–12 years − 0.344 0.071∗ 0.173∗ 0.075∗ − 0.253 0.062∗
13–14 years − 0.496 0.074∗ 0.188∗ 0.062∗ − 0.324 0.063∗
15+years − 0.508 0.089∗ 0.223∗ 0.077∗ − 0.325 0.067∗
Panel B: With state-specific linear time trends
1–2 years 0.286 0.152 0.206 0.140∗ 0.171 0.044∗
3–4 years 0.254 0.099∗ 0.171 0.126∗ 0.220 0.058∗
5–6 years 0.186 0.102 0.206 0.143 0.175 0.067∗
7–8 years 0.177 0.109 0.230 0.146 0.097 0.075
9–10 years − 0.037 0.111 0.241 0.154 − 0.073 0.082
11–12 years − 0.247 0.128 0.268 0.183 − 0.240 0.089∗
13–14 years − 0.386 0.137∗ 0.295 0.209 − 0.329 0.098∗
15+years − 0.414 0.158∗ 0.337 0.243 − 0.382 0.108∗
Note Standard errors with asterisks indicate significance at 5% level using N (0, 1) critical values. For OLS
standard errors, seW and seC X refer to the heteroskedastic standard errors by White (1980) and the clustered
standard errors by Arellano (1987), respectively; se BC L is the robust standard error suggested by Bai et al.
(2019). The threshold values for FGLS by the cross-validation are M = 1.9 and M = 1.8 for Panel A and
B, respectively. OLS and FGLS estimates and standard errors (using state population weights)
of 1-2 years, which is within the 95% confidence interval constructed using β O L S ; the

latter is [−0.0184, 0.5304]. For another example, β F G L S = 0.165 for the effect of
3-4 years, which is within the 95% confidence interval constructed using β O L S ; the
latter is [0.050, 0.367]. We note that these confidence intervals are relatively large, as
a consequence of the relative small sample size in this study.
Overall, the results of FGLS estimates are consistent with Wolfers (2006). The
FGLS confirms that the law reforms significantly contribute to the subsequent decrease
in the divorce rates, more specifically, 9–15 years after the reform in the model without
linear time trends, and 11–15 years after in the model with linear time trends. Though
Wolfers (2006) de-emphasized the negative coefficient at the end of the periods, as
these are not robust to inclusion of state-specific quadratic trends, which we did not
employ in this paper, nevertheless, we still interpret the economic insight of these
results as the consequence of two sides of the same treatment, the law reforms: after
earlier dissolution of bad matches after law reforms, marital relations were gradually
affected and changed.
123
4 Conclusions
This paper considers generalized least squares (GLS) estimation for linear panel data
models. By estimating the large error covariance matrix consistently, the proposed
feasible GLS estimator is more efficient than the ordinary least squares (OLS) in the
presence of heteroskedasticity, serial and cross-sectional correlations. The covariance
matrix used for the feasible GLS is estimated via the banding and thresholding method.
We establish the limiting distribution of the proposed estimator. A Monte Carlo study
is considered. The proposed method is applied to an empirical application.
References
Abadie A, Athey S, Imbens GW, Wooldridge J (2017) When should you adjust standard errors for clustering?.
National Bureau of Economic Research Working Paper No 24003
Allen DW (1992) Marriage and divorce: comment. Am Econ Rev 82(3):679–685
Andrews DW (1991) Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econ
J Econ Soc 59(3):817–858
Arellano M (1987) Computing Robust standard errors for within-groups estimators. Oxford Bull Econ Stat
49(4):431–434
Bai J, Choi SH, Liao Y (2019) Standard Errors for Panel Data Models with Unknown Clusters.
arXiv:1910.07406
Bai J, Liao Y (2017) Inferences in panel data with interactive effects using large covariance matrices. J
Econ 200(1):59–78
Bai J, Ng S (2017) Principal components and regularized estimation of factor models. arXiv:1708.08137
Bickel PJ, Levina E (2008a) Covariance regularization by thresholding. Ann Stat 36(6):2577–2604
Bickel PJ, Levina E (2008b) Regularized estimation of large covariance matrices. Ann Stat 36(1):199–227
Cameron AC, Miller DL (2015) A practitioner’s guide to cluster-robust inference. J Human Resour
50(2):317–372
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W (2017) Dou-
ble/debiased/neyman machine learning of treatment effects. Am Econ Rev 107(5):261–65
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey WK (2016) Double machine
learning for treatment and causal parameters. Discussion paper, cemmap working paper, Centre for
Microdata Methods and Practice
Driscoll JC, Kraay AC (1998) Consistent covariance matrix estimation with spatially dependent panel data.
Rev Econ Stat 80(4):549–560
Fan J, Liao Y, Mincheva M (2013) Large covariance estimation by thresholding principal orthogonal com-
plements. J R Stat Soc Ser B 75(4):603–680
Friedberg L (1998) Did unilateral divorce raise divorce rates? Evidence from panel data. Am Econ Rev
88(3):608–627
Hansen CB (2007a) Asymptotic properties of a robust variance matrix estimator for panel data when T is
large. J Econ 141(2):597–620
Hansen CB (2007b) Generalized least squares inference in panel and multilevel models with serial corre-
lation and fixed effects. J Econ 140(2):670–694
Hansen LP (1982) Large sample properties of generalized method of moments estimators. Econ J Econ
Soc, pp 1029–1054
Liang K-Y, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika
73(1):13–22
Miller S, Startz R (2018)Feasible generalized least squares using machine learning. Available at SSRN
2966194
Newey WK (1990) Efficient instrumental variables estimation of nonlinear models. Econ J Econ Soc, pp
809–837
Newey WK, McFadden D (1994) Large sample estimation and hypothesis testing. Handbook Econ 4:2111–
2245
123
J. Bai et al.
Newey WK, West KD (1987) A simple, positive semi-definite, heteroskedasticity and autocorrelationcon-
sistent covariance matrix. Econ J Econ Soc 55:703–708
Newey WK, West KD (1994) Automatic lag selection in covariance matrix estimation. Rev Econ Stud
61(4):631–653
Peters HE (1986) Marriage and divorce: Informational constraints and private contracting. Am Econ Rev
76(3):437–454
Petersen MA (2009) Estimating standard errors in finance panel data sets: Comparing approaches. Rev
Financ Stud 22(1):435–480
Romano JP, Wolf M (2017) Resurrecting weighted least squares. J Econ 197(1):1–19
Vogelsang TJ (2012) Heteroskedasticity, autocorrelation, and spatial correlation robust inference in linear
panel models with fixed-effects. J Econ 166(2):303–319
Wager S, Athey S (2018) Estimation and inference of heterogeneous treatment effects using random forests.
J Am Stat Assoc 113(523):1228–1242
White H (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for het-
eroskedasticity. Econ J Econ Soc, pp 817–838
Wolfers J (2006) Did unilateral divorce laws raise divorce rates? A reconciliation and new results. Am Econ
Rev 96(5):1802–1820
Wooldridge JM (2010) Econometric analysis of cross section and panel data. MIT press, Cambridge
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123

Feasible Generalized Least Squares For Panel Data With Cross-Sectional and Serial Correlations

Uploaded by

Copyright:

Available Formats

Feasible Generalized Least Squares For Panel Data With Cross-Sectional and Serial Correlations

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Feasible Generalized Least Squares For Panel Data With Cross-Sectional and Serial Correlations

Uploaded by

Copyright:

Available Formats

Empirical Economics

Feasible generalized least squares for panel data with

Jushan Bai1 · Sung Hoon Choi2 · Yuan Liao2

Received: 21 March 2020 / Accepted: 30 October 2020

Keywords Panel data · Efficiency · Thresholding · Banding · Cross-sectional

Heteroskedasticity, cross-sectional and serial correlations are important problems in

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s00181-

where each block Eu t u s is an N × N autocovariance matrix. Here, parametric struc-

2 Feasible generalized least squares

We consider a linear model1

yit = xit β + u it . (1)

where Y = (y1 , · · · , yT ) is the N T × 1 vector of yit with each yt being an N × 1

Let = (Eu t u s ) be an N T × N T matrix, consisting of many blocks matrices.

in f = (X −1 X )−1 X −1 Y .

Note that is a high-dimensional conditional covariance matrix, which is very difficult

2.1 Population approximation

We start with a “banding” approximation to control serial correlations. Recall that

block can be written as t,s = h for h = t − s. Here, h is an N × N block

matrix, defined as:

for some pre-determined L → ∞. For instance, as suggested by Newey and West

N W = (hN W ) as the “population banding approximation.”

for some pre-determined threshold τi j → 0. We regard hB L as the “population sparse

2.2 Implementation of feasible GLS

2.2.1 The estimator of Ä and FGLS

where si j (·) : R → R is a “soft-thresholding function” with an entry-dependent

t,s as an N × N matrix: for h = t − s,

is a nonparametric estimator, which does not require an assumed parametric

Remark 2.1 (Universal thresholding) We apply thresholding separately to the N × N

2.2.2 Choice of tuning parameters

S (C̄) is a diagonal matrix and can be fixed as,

(C, L)} > 0, ∀C > M].

2.2.3 Incorporating known clusters

Quantities si j (z) ω(|h|, L) τi j γT L

Putting these estimated diagonal blocks together, we obtain h , the estimated h .

F G L S is to show that it is asymp-

Now, we suppose ω(h, L) = 1, ≈ N W and let Abh = {(i, j) : |Eu it u j,t−h | =

presence of both serial and cross-sectional correlations, is very technically challenging.

Then, a natural estimator for is

A key step is to prove that the effect of estimating D is asymptotically negligible:

It can be shown that the problem reduces to proving:

Under the simplified conditions of this example (u it is independent across both i

Generally, if u it |xit is non-Gaussian and asymmetric, E(u it3 |xit ) = 0. Hence, we

2.4 Asymptotic results of FGLS

P(|u it | > s) ≤ ex p(−(s/b1 )r1 ), P(|xit,l | > s) ≤ exp(−(s/b2 )r2 ).

sup |P(A)P(B) − P(AB)| < exp(−C T κ ),

as a measurement of the sparsity. We would require that m N should be either

where  = E(X −1 X /N T ).

which is (8) as we discussed in Sect. 2.3. As discussed therein, it is straightforward

Then, we have the following limiting distribution.

where  = lim N T E(X −1 X /N T ), assumed to exist. The consistent estimator of 

usual fashion from the asymptotic variance estimates.

3 Empirical study: Eﬀects of divorce law reforms on divorce rates

Panel A: Without state-specific linear time trends

You might also like

Now, we suppose ω(h, L) = 1, ≈ N W and let Abh = {(i, j) : |Eu it u j,t−h | =

Generally, if u it |xit is non-Gaussian and asymmetric, E(u it3 |xit ) = 0. Hence, we

where = E(X −1 X /N T ).

where = lim N T E(X −1 X /N T ), assumed to exist. The consistent estimator of