Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Instrumented Principal Component Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

Instrumented Principal Component Analysis*

Bryan Kelly Seth Pruitt Yinan Su

Yale, AQR, NBER Arizona State University Johns Hopkins University

December 17, 2020

Abstract

We propose a new approach of latent factor analysis that, in addition to


the main panel of interest, introduces other relevant data that serve as in-
struments for dynamic factor loadings. The method, called IPCA, provides a
parsimonious means of incorporating vast conditioning information into fac-
tor model estimates. This improves the efficiency of estimates for the latent
factors and their loadings, and helps to ascertain the economic relationships
among factors and individuals via the observable instruments. The estimation
is fast to calculate and accommodates unbalanced panels. We show consistency
and asymptotic normality under general panel data generating processes. We
demonstrate the advantages of IPCA in simulated data and in applications to
equity asset pricing and international macroeconomics.

Keywords: IPCA, Latent Factors, Factor Loading Modeling, Instruments, Iden-


tification, Large Panel
* Kelly (corresponding author): bryan.kelly@yale.edu; Yale School of Management, 165 Whitney
Ave., New Haven, CT 06511; (p) 203-432-2221; (f) 203-436-9604. Pruitt: seth.pruitt@asu.edu; W.P.
Carey School of Business, 400 E Lemon St., Tempe, AZ 85287. Su: ys@jhu.edu; Carey Business
School, 100 International Drive, Baltimore, MD 21202. We are grateful to Lars Hansen, Dacheng Xiu,
Jonathan Wright, Yuehao Bai, and Federico Bandi as well as seminar and conference participants
at ASU, Duke, and Montreal for helpful comments. We thank Rongchen Li for excellent research
assistance.

Electronic copy available at: https://ssrn.com/abstract=2983919


1 Introduction
Latent factor analysis is an empirical workhorse in economics and finance. Its essence
is parsimony: a small number of common factors drive the variation of a large cross
section (Geweke, 1977; Sargent and Sims, 1977). A large literature has moved be-
yond using factor analysis merely for exploratory dimension-reduction and data pre-
processing, to using it as an estimator for economic models that link hundreds, thou-
sands, or millions of individual variables to a few aggregate economic forces. At the
heart of these models are factor loadings—they quantify economic relationships as
individuals’ sensitivities to the aggregate factors.
High dimensional environments present an econometric challenge because the
number of factor loadings to be estimated is increasing with the size of the cross
section. The literature studies this issue predominantly in the case of static factor
loadings (e.g., Bai, 2003; Stock and Watson, 2002). But the estimation challenges
are magnified when factor loadings are time-varying, as called for in many important
economic applications. In this case, the number of factor loadings may even exceed
the size of the data panel.1
We propose instrumented principal component analysis, or IPCA, to overcome
the challenges of recovering accurate estimates of latent factor models when panel
dimensions are large and/or when loadings are time-varying. Our solution draws on
(potentially vast) conditioning data, beyond that in the main panel of interest, to
instrument for the factor loadings. This makes it possible to simultaneously bring
more data to bear on the statistical model while introducing structure among factor
loadings that reduces the model’s degrees of freedom. In doing so, it also significantly
expands the economic content of factor models.
For concreteness, let t = 1, . . . , T index time and i = 1, . . . , N index individuals.
The IPCA model is

xi,t = βi,t ft + µi,t , (1)


βi,t = ci,t Γ + ηi,t . (2)
1
Existing solutions for handling dynamic factor loadings rely on serial dependence of the loadings.
Recent advances in state-space models view factor loadings as latent dynamic processes themselves
(Primiceri, 2005; Pruitt, 2012; Del Negro and Otrok, 2008, among others). Su and Wang (2017)
achieve smoothly varying loadings by repeatedly estimating static ones in moving kernels. But
this does not relieve the difficulty associated with the need to estimate separate loading for each
individual in a large cross section.

Electronic copy available at: https://ssrn.com/abstract=2983919


Equation (1) is a generic factor model in which xi,t is scalar panel data, ft and βi,t
are the factors and their loadings (K × 1 and 1 × K vectors, respectively, where K
is the number of factors), and µi,t is the idiosyncratic error. Equation (2) is the core
of IPCA and links information in an L-dimensional vector of instrumental variables,
ci,t , with the factor loadings βi,t . The key restriction giving content to IPCA is that
the mapping from instruments to loadings is fixed over time and across individuals.
The mapping is linearly parameterized by the L × K matrix Γ, which is the primary
target of estimation along with the latent factors ft .2
With the advent of “big data,” a wealth of new information is increasingly available
for economic analysis. Drawing on this wealth, IPCA leverages instrumental condi-
tioning data ci,t , which can vary over time and across individuals, to better hone in
on the latent factor structure in xi,t . Economic theory often suggests that individu-
als possess ancillary observable attributes that convey information about their factor
loadings. For example, Fama and French (1993) note that firm-level observable char-
acteristics “such as size and book-to-market equity must proxy for sensitivity to . . .
risk factors in returns”—Kelly et al. (2019) (henceforth KPS) use IPCA to formalize
this idea and estimate a conditional asset-pricing model with an unprecedented de-
gree of empirical success in equity returns. As another example, option “Greeks” are
sensitivities to underlying risk factors and are mechanically linked to observable and
dynamic quantities such as the option’s moneyness and maturity (Büchner and Kelly,
2020) and are generally observable aspects of the option contract and the option’s
traded price. As a third example, some macroeconomic models describe inter-firm
trade as interaction of nodes on a network. As a firm’s centrality increases or de-
creases it becomes more or less integral to overall economic activity, thus implying
dynamic loadings on an aggregate growth factor (Acemoglu and Azar, 2020). Finally,
The common idea underlying each of these examples is that the individual’s (op-
tion; stock; firm) constant identity becomes irrelevant once we condition on the ap-
propriate attributes of the individual. These attributes (moneyness; book-to-market
ratio; network centrality) that dictate the individual’s exposure to aggregate fluctu-
ations are often observable and thus ripe for harvesting in the estimation of latent
factor models. IPCA operationalizes this logic through equation (2). The matrix Γ
2
The error term ηi,t allows for unobservable behavior of βi,t on top of what observable instruments
can capture. An orthogonality condition between the instruments and errors, similar to the exclusion
restriction in the method of instrumental variable regression, is necessary to achieve consistent model
estimates.

Electronic copy available at: https://ssrn.com/abstract=2983919


determines how each of the L attributes maps into each of the K factor loadings.
This is a constraint that IPCA imposes on the factor loadings in the generic factor
model (1) by requiring that the Γ mapping applies uniformly for all individuals in all
time periods.3
Despite the fact that IPCA brings more data to bear on the factor model, the
constraint in equation (2) renders IPCA an especially parsimonious factor model. The
philosophy of IPCA is that individual loadings are an unnecessary excess, requiring
a divergent count of N × K parameters in the commonly studied large cross section
limit. IPCA displaces the need to estimate individual-specific loading parameters,
instead requiring only an understanding of how individuals’ attributes map into their
factor loadings, summarized by the L × K parameters in Γ, whose size does not
increase with either of the panel’s dimensions. Accordingly, our theoretical analyses
consider large panel asymptotics with N, T approaching infinity while holding L, K
constant. The ability to assimilate large panel data in a factor model with fixed

parameter dimensions means that IPCA’s Γ estimator converges N times faster than
the individual static loading estimates derived from principal components analysis
(PCA).4
At the same time, the constraint that IPCA imposes on factor loadings brings
new economic content to factor modeling. It probes the economic content of a factor
model by estimating which attributes in ci,t are important for capturing the factor
structure in xit . By evaluating the bindingness of constraints in (1), IPCA pro-
vides new tests of the economic determinants of individuals’ sensitivities to aggregate
shocks, which open a broad avenue of economic discovery. In summary, two directions
of improvements—expanded input data and decreased model freedom—combine to
form the flexible yet robust estimator of IPCA with deeper economic underpinnings
than traditionally ascertained by statistical factor models.

Estimation In the following analysis, we often plug βi,t in equation (2) into equation
(1) and combine their two errors into a new compound error term ei,t , yielding an
3
This logic shares an analogy with state-space models, which impose constraints on the serial
dependence of βi,t ; the main difference is that in the typical state space model the dynamics of βi,t
are inferred only from the main panel √ data, xi,t .
4
PCA β’s convergence
√ rates is T according to Bai and Ng (2002); Bai (2003). IPCA Γ’s
convergence rate is N T per Theorem 3.

Electronic copy available at: https://ssrn.com/abstract=2983919


equivalent representation of the IPCA model:

xi,t = ci,t Γft + ei,t , ei,t := ηi,t ft + µi,t . (3)

IPCA is estimated as a least squares problem. It minimizes the sample sum of squared
errors (compound errors ei,t ) over parameters Γ and {ft } jointly. This least squares
estimation is inspired by PCA, which minimizes the sum of squared errors over {ft }
and static loading parameters {βi }.
The optimization does not admit an analytical solution in general, but is speed-
ily solved numerically by an alternating least squares (ALS) algorithm. It iterates
between minimizing over Γ while holding {ft } fixed, and minimizing over {ft } while
holding Γ fixed, until convergence. Importantly, the two partial optimization sub-
problems are simple linear regressions. This has a number of practical benefits. For
example, the procedure converges quickly and without the need for complicated nu-
merical algorithms. And IPCA handles unbalanced panels as easily as pooled panel
OLS, which is a great advantage in many applications.
Expanding the least squares problem to include nested constraints yields flexible
tests of economic hypotheses. For example, restricting the lth row of Γ to be all
zeros aligns with a test for the marginal contribution of the lth instrument to overall
model performance. As another example, restricting one of the K factors to be some
observable macroeconomic time series can be used to test the series’ relevance for
modeling covariation within the panel. In the context of cross-sectional asset pricing,
another example restricts one of the factors to be a constant in order to test whether
the factor space admits arbitrage opportunities.5

Asymptotic Results We show consistency as well as derive the rate of conver-


gence and the limiting distributions of Γ and ft estimations, within the framework of
a large panel wherein N, T simultaneously approaches infinity. Notably, Γ converges
√ √
at the rate of N T . As a comparison, PCA loading’s (β) rate is T since it relies on
each individual’s time-series information. IPCA additionally picks up cross-sectional
information to estimate Γ—the benefit of bringing the wealth of instrumental infor-
mation into the problem. Meanwhile, the estimation of ft achieves a convergence rate

of N , the same as PCA. This is because estimating ft relies on cross-sectional linear
5
These tests are demonstrated in KPS.

Electronic copy available at: https://ssrn.com/abstract=2983919



regressions, which cannot be more accurate than N .
The asymptotic results are connected to the panel literature if we view IPCA
(Eq. 3) as a panel model with time-fixed effects ft and a structural parameter Γ.
Gagliardini and Gourieroux (2014) study maximum likelihood estimators of general
non-linear panel models with time-fixed effects, and arrive at the same convergence
rates. However, in our linear factor model, asymptotics are built from the moment
condition of the orthogonality between instruments and errors, rather than distribu-
tional assumptions.

Rotational Identification Rotational unidentification is a well-known issue in fac-


tor analysis. In the context of IPCA, the mapping matrix Γ and the factors {ft } can-
not be separately identified without further assumptions because a parameter pair
Γ, {ft } and any of its “rotations” ΓR, {R−1 ft } generate identical sample fits.
It is well understood that the choice of identifying assumption, or “normaliza-
tion,” is not unique, though little work has investigated the effects of this choice on
asymptotic properties of factor model estimators. We provide a new analysis that
explicitly describes how the normalization choice affects stochastic rates of conver-
gence and asymptotic distributions. This includes an asymptotic decomposition of
parameter estimation error into a component arising from the estimation problem
absent a normalization, and a separate component purely attributable to the choice
of normalization. This decomposition can be applied not only to IPCA, but to latent
factor estimators more generally, including PCA.6 Based on this derivation, we show
that the asymptotic results for IPCA discussed above hold under any valid identifying
assumption.

Outline The paper proceeds as follows. Section 2 introduces the notion of a stochas-
tic panel and uses that to describe the generating process of panel data considered
by the paper. In Section 3, we discuss estimation and analyze choices of identifying
normalization. Based on these preparations, Section 4 proves the consistency of Γ es-
timation, Section 5 presents the asymptotic distributions of Γ estimation errors, and
Section 6 contains the asymptotics of the factor estimation. Section 7 examines the
small sample properties of IPCA estimation with Monte Carlo simulations. Section 8
6
Bai and Ng (2013) works out the asymptotics of PCA for a few specific normalizations other
than the one in Bai (2003), while we develop a unified analysis of IPCA that is general to any
normalization choice.

Electronic copy available at: https://ssrn.com/abstract=2983919


applies IPCA to the empirical analyses of international macroeconomics and returns
of newly listed stocks. Section 9 concludes.

2 The Data Generating Process


2.1 Stochastic Panels
We construct what we call “stochastic panels” to model the generating process of two-
dimensional data, which for concreteness we refer to as time series and cross-sectional
dimensions, respectively. This extends the concept of a stochastic process defined
as a transformation that traces out a sequence of sample points, to two recombining
transformations that trace out a rectangular lattice of sample points.7 The notion
of a stochastic panel helps formalize the genesis of objects that are conventionally
expressed with single or double subscripts. For example, variables with a single
t subscript can denote common or aggregate realizations that contain population
information about the current cross section. This is the collection of events that
is invariant to the “cross-sectional” transformation. Likewise, a single i subscript
would imply invariance to the time series transformation, while a double subscript
i, t implies dependence on transformations in both dimensions.
A stochastic panel is defined as a random variable (vector) X, on probability
space {Ω, F, P r}, with two transformations S[d] : Ω → Ω on two different directions
[d] ∈ {[cs], [ts]}, satisfying the following conditions:

SP.1 Measurable: Both transformations are F-measurable, i.e., ∀Λ ∈ F, S−1


[d] (Λ) ∈
F.

SP.2 Recombining: S[d] ◦ S[−d] = S[−d] ◦ S[d] .8


 
−1
SP.3 Measure-preserving: ∀Λ ∈ F, P r S[d] (Λ) = P r (Λ), for either d.

SP.4 Jointly ergodic: Any event invariant to both transformations has a probability
of either zero or one, i.e., ∀Λ s.t. Λ = S−1
[d] (Λ) , ∀d, P r (Λ) = 0 or 1.

7
For stochastic processes defined with a transformation on the sample space, see for example
Hansen and Sargent (2013). This fundamental construction is similar to the one in Gagliardini et al.
(2016). They work with a sample space for the time-series process and a continuum to represent
individuals. Our sample space can be seen as the Cartesian product of these two sets, and we use
two transformation to describe the sampling scheme from a single probability space.
8
[−d] means the other direction with respect to [d].

Electronic copy available at: https://ssrn.com/abstract=2983919


Figure 1: Sample Stochastic Panel Generated by Lattice Points


ω S[cs] (ω)

S[ts] (ω)

Notes: The horizontal blue arrows are S[cs] , vertical blue arrows are S[ts] . The random
variables evaluated at the nine lattice points starting from ω constitute a sample stochastic
panel with N = 3, T = 3. The square blocks represent partitions in F, the columns of
blocks are F[ts] ’s partitions, and the rows blocks are F[cs] ’s partitions.

i−1
SP.5 Cross-sectional exchangeable: The sequence of random variables {X◦S[cs] ,i =
1, 2, . . . } is exchangeable.

Condition SP.1 is the familiar measurability condition from the definition of


stochastic processes except applied in two directions. The interpretation of S[ts] (ω)
is “the next period”, and the interpretation of S[cs] (ω) is “the next individual” to be
sampled, as illustrated in Figure 1.
Condition SP.2 guarantees that the next period’s second individual is the same as
second individual in the next period. Inductively, starting from an ω ∈ Ω, following
the two transformations N and T times, respectively, one can trace out an N ×
T rectangular lattice of sample points (again see Figure 1). The set of random
variables evaluated at the lattice of sample points defines the sample stochastic panel
conventionally represented with double subscripts:

Xi,t := X ◦ Si−1 t−1


[cs] ◦ S[ts] , ∀i = 1, · · · , N, t = 1, · · · , T.

Furthermore, let sub-σ algebra F[d] ⊆ F be the collection of events invariant under
S[−d] , the transformation of the other direction. In Figure 1, imagine F is generated
with the square partitions, then F[ts] and F[cs] consists of the column blocks and row
blocks, respectively.
Let X [d] be F[d] -measurable random variables, and it is straightforward to see
[ts]
X [d] = X [d] ◦ S[−d] . That is to say, Xi,t is constant for i = 1, 2, · · · , so the i subscript

Electronic copy available at: https://ssrn.com/abstract=2983919


is redundant and dropped. This provides a definition of random variables that are
represented using only a single subscript.

[ts]
Xt := X [ts] St−1
[ts] , t = 1, · · · , T,
[cs]
Xi := X [cs] Si−1
[cs] , i = 1, · · · , N.

The t-subscript random variables are F[ts] -measurable. They represent common or ag-
gregate realization and contain distributional information about the “current” cross-
sectional population. The factor process ft is the main example. Symmetrically,
F[cs] -measurable i-subscript variables are about individual i’s static characteristics. A
static factor loading βi would be of this sort.
Condition SP.3 implies that each direction itself defines a stationary stochastic pro-
cess in the traditional sense. Stationary stochastic processes admit the one-directional
law of large numbers (LLN).9

Lemma 1 (One directional LLN). Under Conditions SP.1–3 and if E kXk2 < ∞,
then
N T
1 X L2 1X L2 
Xi,t −→ E X·,t F[ts] , ∀t, Xi,t −→ E Xi,· F[cs] ,∀i.
  
and
N i=1 T t=1

Notice the right-hand sides are F[ts] or F[cs] -measurable. That means, for example, the
cross-sectional average converges to a time-specific aggregate, which can be written
with a single-t subscript. This allows Xi,t to be non-ergodic in either direction. For
example, if ω and the next individual S[cs] (ω) are in different S[ts] -invariant events
(different row blocks in Figure 1), X1,t will not repeat the course of events of X2,t ,
no matter how long time lasts. We intentional leave this possibility as a realistic
feature of panel data. For example, for the application in Section 8.1, although all
countries’ import/export shares are varying over time, some countries are inherently
more trade-reliant than others. As a result, their time-series averages converge to
different limits given by the expectations conditional on country identity (F[cs] ).
Condition SP.4 introduces and imposes “panel-wise” ergodicity when jointly con-
sidering both directions. This allows panel-wise averages to converge to deterministic
limits, even as we intentionally allow non-deterministic one-directional limits. The
9
We write a mean-square convergence result, because it is used for following derivations. An
almost-sure convergence result is also obvious.

Electronic copy available at: https://ssrn.com/abstract=2983919


convergence result to deterministic limits paves the foundation for frequentist infer-
ence on stochastic panel-based models.

Lemma 2 (Panel LLN). Under Conditions SP.1–4, and if E kXk2 < ∞, then as
N, T → ∞,

N T
1 XX L1
Xi,t −→ EX.
N T i=1 t=1

The four conditions discussed so far are symmetric in the two directions. However,
we expect the cross section to be independent in some sense, while the time series
evolutions are serially dependent. SP.5 formalizes these properties by strengthening
stationarity in SP.3 to exchangeability only for the [cs] direction.10 Under this con-
dition, F[cs] -measurable single i-subscript variables (for instance the fixed loadings βi
in PCA) are cross-sectionally i.i.d. And, the double-subscript variables are i.i.d. con-
ditional on F[ts] .11 For example, book-to-marketi,t is i.i.d. across stocks conditional on
the common time-specific information. Unconditionally, it is not independent (only
exchangeable) due to aggregate fluctuations of the value ratio.

2.2 IPCA Assumptions


We can now formally state the assumptions of the IPCA model using the machinery of
stochastic panels. Let c, µ, η, f 0 be the random variables of a stochastic panel. Among
them, f 0 is F[ts] -measurable (no i subscript). Based on those primitive variables,
equations (1)-(3) hold for each ω, defining e, β, x accordingly.12

2.2.1 Assumptions for Consistency

For consistency, we make the following three assumptions.


[ts] 
Assumption A (Instrument orthogonal to error). E c>

i,t ei,t F = 0L×1 .

10
We note SP.5 is not required for consistency results, but only for asymptotic normality.
11
The first claim is because ergodic exchangeable processes are i.i.d. The second is de Finetti
decomposition of exchangeable processes onto sub-σ-algebras of invariant events (see Hansen and
Sargent, 2013).
12
When referring to the true parameters, we use a zero superscript, for example Γ0 and ft0 , and
omit the zero superscript (Γ, ft ) for generic parameters depending on the context.

10

Electronic copy available at: https://ssrn.com/abstract=2983919


This assumption is IPCA’s analogue to the exclusion restriction in instrumental vari-
able regression and is a key condition giving content to model equation (2). It is
necessary for consistent estimation of Γ.13
0 0 > 2
Assumption B (Moments). The following moments exist: (1) E ft ft (2) Ekci,t ei,t k2
2 h 2 0 2 i
(3) E c>
i,t ci,t
(4) E c>
i,t ci,t
kft k .

Assumption C. (1) The parameter space Ψ of Γ is compact and away from rank
deficient: det Γ> Γ >  for some  > 0. (2) Almost surely, ci,t is bounded, and
 > [ts] 
(3) Define Ωcc
t := E ci,t ci,t F
, then almost surely det Ωcc
t >  for some  > 0.

The two assumptions above are regularity conditions for consistency. Assumption
B lists the required finite moments to apply the panel Law of Large Numbers (Lemma
2). Assumption C guarantees matrix Γ> Ct> Ct Γ, whose inverse frequently appears,
remains nonsingular (Ct denotes the N × L matrix that stacks up the cross section
of ci,t ).

2.2.2 Assumptions for Asymptotic Normality

For asymptotic normality and deriving the asymptotic variance, we impose the fol-
lowing additional assumptions.

Assumption D (Central Limit Theorems).


 d
(1) As N, T → ∞, √N1 T i,t vect c> 0>

− Normal 0, Ωcef .
P
i,t ei,t ft →
d
(2) For any t, as N → ∞, √1N Ct> et →
− Normal (0, Ωce
t ).
d
(3) As N, T → ∞, √1T t vecb ft0 ft0 > − V ff →− Normal 0, V[3] , where V ff := E ft0 ft0 > .14
P    

Assumption E (Bounded Dependence). There exists an M < ∞, such that ∀N, T ,


1
P  > 
NT i,j,t,s kτ ij,ts k ≤ M , where τ ij,ts := E c i,t e i,t e j,s c j,s .

Assumption D(1) is a panel-wise central limit theorem. It gives the asymptotic


distribution of the dominant term in the score function S(Γ0 ) (see Lemma 3). As-
sumption D(2) is a cross-sectional central limit theorem conditional on aggregate time
13
Notice, an alternative to Assumption A is its sufficient
[ts]  condition of imposing orthogonality
[ts] with
respect to each of the two primitive errors: E c> >

i,t η i,t
F = 0 and E c i,t µi,t
F = 0.
14
vect (A) is an operator that vectorizes matrix A’s elements to a column vector by going right first,
then down. For a square matrix A, vecb (A) vectorizes A’s upper triangular entries, not including
the diagonal, to a vector by going right first, then down. See details with an example in Appendix
A.

11

Electronic copy available at: https://ssrn.com/abstract=2983919


series information (F[ts] ). It gives factor estimation’s asymptotic distribution (see The-
orem 4.b), and works in a similar way as Bai’s (2003) Assumption F3. Assumption
D(3) is used by the results in Appendix B, which is related to how orthogonalizing
factors (as a normalization) introduces “contamination” to estimation.
Assumption E sets a bound for the times series and cross-sectional dependency
of c>
i,t ei,t for proving asymptotic normality (see Lemma 3). It is analogous to Bai’s
(2003) Assumption C4.

Assumption F (Constant Second Moment of Instruments). Ωcc cc


t is constant at Ω .

Assumption F shuts down the variation of ci,t ’s cross-sectional second moment,


in order to retain concise expressions for the asymptotic variances in the specific
identification cases (Theorem 3). The essence of the asymptotic theory does not
depend on Assumption F. For example, the convergence rates would not change if it
were relaxed.

3 Estimator as Normalized Optimizer


Estimation consists of two steps: optimizing the objective function to find the set of
equivalent solutions, followed by a normalization of the solutions that selects a unique
element of this set to serve as the parameter estimate. The second step is necessary
because IPCA has a rotational unidentification issue well-known from other latent
factor settings. While our identification approach in some sense follows convention in
the literature, we more formally construct identification conditions. This formality is
not just for defining the estimator, but for more rigorously expounding the notion of
a “true” parameter and the corresponding measurement of estimation errors. As a
result, we develop a general asymptotic analysis framework for estimators normalized
by any valid identifying assumption, which is also applicable in contexts beyond IPCA
(such as in the asymptotic analysis of PCA).

12

Electronic copy available at: https://ssrn.com/abstract=2983919


3.1 Optimization
IPCA is estimated as a least squares problem that minimizes the sample sum of
squared errors (SSE) over parameters Γ and {ft }:
X
min (xi,t − ci,t Γft )2 . (4)
Γ,{ft }
i,t

This objective is inspired by PCA which also optimizes the sample SSE but over
parameters {βi } and {ft }.
We use an Alternating Least Squares (ALS) method for the numerical solution
of the optimization because, unlike PCA, the IPCA optimization problem does not
have a solution through an eigen-decomposition.15 The SSE target in (4) is quadratic
in either Γ or {ft } when the other is fixed. This property allows for analytical opti-
mization of Γ and {ft } one at a time. Given any Γ, factors {ft } are t-separable and
solved with cross-sectional OLS for each t:
X −1 > > 16
fbt (Γ) := arg min (xi,t − ci,t Γft )2 = Γ> Ct> Ct Γ Γ C t xt . (5)
ft
i

Symmetrically, given {ft }, the optimizing Γ (vectorized as γ) is solved with pooled


panel OLS of xi,t onto LK regressors, ci,t ⊗ ft> :
!−1 !
X 2
X X
c> >
c>
  
arg min (xi,t − ci,t Γft ) = i,t ⊗ ft ci,t ⊗ ft i,t ⊗ ft xi,t .
γ
i,t i,t i,t
(6)
Throughout the paper, the lower case γ represents vectorized Γ> matrix, γ = vect (Γ)
(likewise for γ 0 , γ
b, and so forth).
15
In the special case that Ct> Ct is constant across t, the solution of the IPCA estimation opti-
mization reduces to the SVD of the interaction of xi,t and ci,t , which is similar to Fan et al. (2016)
projected principal component analysis; however, they did not consider time-varying instruments
nor time-varying loadings. The SSE target can be transformed to a sum of multivariate Rayleigh
−1
quotients with inverted matrices Γ> Ct> Ct Γ . Were Ct> Ct invariant, these matrices are invariant
and can be pulled out of the t-summation, leading to a case similar to PCA where the SSE can be
minimized with SVD. A previous version of this paper and the Appendix of KPS detail the solution
under this special case.
16
Here, xt is the N × 1 vector of xi,t , Ct is the N × L matrix of ci,t . Notice fbt (Γ) depends on the
sample cross-section and is not F[ts] -measurable (although single t subscript). A more proper but
cumbersome indexing would be fbN,t

13

Electronic copy available at: https://ssrn.com/abstract=2983919


The ALS algorithm begins with an initial guess and iterates between updates of Γ
and {ft } while holding the other fixed according to the above first-order conditions.
The program stops when the optimality conditions are satisfied up to a predetermined
tolerance.17 The ALS method can be seen as a Block Coordinate Descent algorithm
with Γ and {ft } as the two blocks. In practice, it converges after a few hundred
iterations in a matter of seconds in our empirical and simulated datasets. The speedy
calculation is a notably important feature that will allow for widespread application
of IPCA. For example, it is easy to repeatedly estimate the model for a large number
of generated samples as required in simulation-based inference procedures.
Notice the numerical method easily adapts to unbalanced panels, because both
the partial optimizations are regressions which tolerate missing values. In particular,
we can rewrite Equations (4)-(6) with the minor changes for all the summation signs
to summing over only the observed i, t panel entries. Then the same procedure follows
through.18
The following sections focus on Γ first by concentrating out ft as an intermediate
object. Define target function G(Γ) as the concentrated SSE objected in (4) with
respect to Γ only,
1 X 2
G(Γ) = xi,t − ci,t Γfbt (Γ) , (7)
2N T i,t

where fbt (Γ) is the optimal factor given any Γ according to (5). And, define score func-
tion as the derivative: S(Γ)b = ∂G(Γ) . Therefore, the two-argument joint minimization
∂γ
problem (Eq. 4) is equivalent to minimizing G(Γ) or solving the first order condition
S(Γ) = 0LK×1 , with respect to Γ only. The asymptotic analysis proceeds by first
analyzing S to characterize the asymptotics of Γ b while treating factor estimate fbt (Γ)
b
as an implicit intermediate. Once the asymptotic analysis of Γ b is done, Section 6
comes back to the factor asymptotics by plugging Γ b in.
17
We can prove uniqueness for the asymptotic score function up to rotation in the sense of Propo-
sition 2, which provides a theoretical foundation of the ALS method. In simulations we see that
convergence is unique and fast unless data generating errors are simulated to be unreasonably large.
18
In practice, we first “fill up” data {xi,t , ci,t } at any unobserved i, t entry with zeros, and then
use the same ALS program for the completed panel. It is easy to show this process is equivalent to
summing over only the observed entries.

14

Electronic copy available at: https://ssrn.com/abstract=2983919


3.2 Normalization
In population, a class of “rotated” true parameters Γ0 R yields exactly the same data
generating process for xi,t , ci,t , so long as true factors are rotated inversely as R−1 ft0
(for any full rank K × K matrix R). The counterpart sample issue is that the target
and score are invariant to rotation: if Γ b solves S(Γ) = 0, then any rotation ΓR
b is also
a solution.
IPCA estimation deals with the issue by following the convention in the latent
factor literature—it selects a normalization from the set of many rotational equivalent
optimizers as the unique estimator. The choice of normalization is often made based
on economic interpretability, algebraic elegance, or computational convenience.
The convention in the literature is to derive asymptotic theory of an estimator
conditional on a specific normalization. In contrast with this convention, we provide
a general method to characterize asymptotic properties under any valid normaliza-
tion. To this end, we construct the following concepts for a generic normalization.
We will discuss two specific normalization examples towards the end of this section.
The asymptotic analysis in the following sections are first derived with the generic
normalization, based on which the specific cases become straightforward evaluations.
To define key concepts, let parameter space Ψ be the set of all possible parameters
Γ. A parameter Γ is called rotationally equivalent to Γ0 if there exists an invertible
matrix R such that Γ = Γ0 R. Define an identification condition as a subset Θ ⊂ Ψ,
such that for any Γ ∈ Ψ there is a unique Γ0 ∈ Θ which is rotationally equivalent
to Γ. Associated with an identification condition, the normalization is the mapping
from any Γ to the (unique) equivalent Γ0 ∈ Θ. Define identification function I(Γ)
as a (vector-valued) function on the parameter space such that its solution set con-
stitute an identification condition: Θ = {Γ | I(Γ) = 0}. The identification function
operationalizes a normalization, and is analyzed in tandem with the score function in
our asymptotic analysis.
Figure 2 illustrates these concepts. Dashed lines through the parameter space
Ψ represent sets of rotationally equivalent parameters. The data generating process
is unchanged for parameters in the same set. In addition, the target function G is
invariant within each set. The particular set that minimize G is drawn in blue.
We have plotted a generic identification condition Θ as the solid black line which
intersects each dashed line at only one point. There are many possible normalizations,
representable as different Θ’s that “cut through” all of the rotationally equivalent

15

Electronic copy available at: https://ssrn.com/abstract=2983919


Figure 2: Identification Condition, Normalization, and Estimation

Θ ⇔ I(Γ) = 0

Γ
b
arg min G(Γ) ⇔
Γ S(Γ) = 0
Γ0 = ΓR

Parameter Space Ψ
Notes: The dashed lines are sets of rotationally equivalent parameters. A particular
one (in blue) minimizes the target function. The black curve represents the identification
condition Θ. The intersection of sets Θ and arg min G(Γ) defines the estimator Γ. b The
red arrow represents the normalization of a parameter to Θ.

parameter sets each only once. We will analyze two concrete examples of Θ further
blow.

3.3 Estimator Defined as Normalized Optimizer


The estimator is defined as the optimizer of target function G(Γ) that is normalized
by Θ:

Γ(Θ)
b = arg min G(Γ). (8)
Γ∈Θ

When there is no emphasis on the specific normalization choice, we omit “(Θ)” and
simply write Γ.
b
Despite first appearances, (8) is not a constrained optimization because the con-
straint “Γ ∈ Θ” never restrains the target from achieving its global optimum—it
only picks a unique representation out of the (rotationally equivalent) set of Γ’s that
all achieve the same minimum. Equivalently, Γ b is the solution of the simultaneous
equations S(Γ) = 0 and I(Γ) = 0, shown as the intersection of the two corresponding
curves in Figure 2. Based on this representation, we will characterize the asymptotics
of Γ
b by analyzing the score and identification functions.
In addition, we note that a normalization Θ must be known to the econometrician,
meaning it can depend on the sample but cannot depend on the underlying population
parameters.

16

Electronic copy available at: https://ssrn.com/abstract=2983919


3.4 Two Normalization Examples
We give two specific normalization examples that are convenient and interpretable,
called ΘX and ΘY . We will keep returning to these two examples for concrete asymp-
totic analysis, and always use “X” and “Y” subscripts when referring to the con-
structions. For example, IX (Γ) and IY (Γ) are identification functions for the two
cases.
Γ has LK degrees of freedom in total, within which K 2 degrees are unidentified
since the rotation R is K × K. So an identification condition must restrict K 2
degrees of freedom. The first example pins down the top K × K block of Γ as the
identity matrix, and leaves the lower L − K rows free. Specifically, define ΘX :=
{Γ ∈ Ψ | Block1:K (Γ) = IK }, where Block1:K (·) is a function of a matrix that cuts
out its first K rows. The associated normalization mapping is Γ0 = ΓBlock1:K (Γ)−1 .
This identification condition forces a one-to-one correspondence between the K
factors and the first K instruments, which gives the factors an economic interpretation
dictated by those instruments.19 Using a familiar asset pricing example Fama and
French (1993), suppose there are three factors (K = 3), that xi,t consists of monthly
stock returns, and that the first three instruments are lagged size, book-to-market
ratio, and momentum of a stock (ci,t = [sizei,t−1 , bmi,t−1 , momi,t−1 , ...]). We know that
βi,t is given by ci,t Γ. Under ΘX , the first loading is βi,t,1 = 1 · sizei,t−1 + 0 · bmi,t−1 +
0 · momi,t−1 + effects of other instruments—loadings on the first factor are driven
one-for-one by the first instrument (size), but are unaffected by book-to-market and
momentum, giving the first factor a direct interpretation as a size factor analogous
to the “SMB” factor of Fama and French (1993). Likewise, the second and third
factors are book-to-market and momentum factors. The lower L − K rows of Γ tell
us how the remaining L − K instruments affect loadings. Hence, this normalization
is reminiscent of, and a less ad-hoc alternative to, the popular characteristics-sorted
portfolios that the empirical asset pricing literature has adopted for understanding
systematic risks.20
19
Without loss of generality, we can reorder the instruments however we wish.
20
This identification condition ΘX appears similar to identifying restriction PC3 in Bai and Ng
(2013), but carries an important distinction due to IPCA’s utilization of instruments. In the static-
loading PCA estimator, the PC3 restriction says that the first K individuals have a direct corre-
spondence to the K factors. Continuing the asset-pricing example, it would imply that the factors
are aligned with the first K individual stocks which, from a financial economics point-of-view, is
ad-hoc and not intuitive for understanding the factor space.

17

Electronic copy available at: https://ssrn.com/abstract=2983919


For the second normalization, define ΘY as the set of Γ ∈ Ψ such that [1] Γ is
ortho-normal: Γ> Γ = IK ; and [2] fbt (Γ) is orthogonal: T1 t fbt (Γ)fbt> (Γ) is diagonal, with
P

distinct and descending entries. Appendix C.4 verifies ΘY is indeed an identification


condition that satisfies the uniqueness condition, and writes out the associated nor-
malization procedure.21 This identification condition is a familiar PCA convention
and is implemented for IPCA in KPS. It chooses a set of orthogonal factors to repre-
sent the factor space. Notice [1] and [2] pin down 12 K(K + 1) and 12 K(K − 1) degrees
of freedom, respectively, for a total of K 2 . Importantly, identification condition ΘY
depends on the sample, because fbt (Γ) in [2] is a sample-dependent function. This
sample-dependence brings some complexity in the asymptotic analysis which we will
address in the following sections.

4 Consistency of Γ Estimation
This section demonstrates that IPCA estimates the mapping matrix Γ consistently.
Since the estimate Γ b is the simultaneous solution of the first order condition (score
function) and the identification function, its consistency is based on the (uniform)
convergence of these two functions, following the standard strategy for analyzing
M -estimators (Newey and McFadden, 1994). Relative to the classical framework,
we must confront two additional challenges: simultaneous N, T convergence and the
identification issue. To address the former, we rely on the large sample properties of
stochastic panels developed in Section 2.1. For the latter, we build on the normaliza-
tion concepts constructed in Section 3.2.

4.1 Score Function Convergence


We show the score function uniformly converges to a deterministic limiting function:

Proposition 1 (Uniform Convergence of the Score Function).


21
An additional nuance is that the signs of the K Γ columns (and the corresponding K factors)
need one more restriction to pin down. If the additional restriction is not suitable, it brings problems
in finite sample simulations (for example, the one Stock and Watson (2002) proposed). Appendix
D explains the issue in detail and constructs a sign restriction [3] that is theoretically sound and
practically easy to use in simulations.

18

Electronic copy available at: https://ssrn.com/abstract=2983919


Under Assumptions A–C, the score function converges uniformly in probability:
p
sup S(Γ) − S 0 (Γ) →
− 0, N, T → ∞.22
Γ∈Ψ

Moreover, we verify that the limiting function S 0 is solved only by Γ0 and its uniden-
tified rotations:

Proposition 2 (Only True Parameters Solve Limiting Score). Under Assumption C,


S 0 (Γ) = 0 if and only if Γ is rotationally equivalent with Γ0 .

These two results are the foundation of IPCA consistency. When combined, they
imply a sense of set convergence regardless of the identification issue—the set of SSE
minimizers {Γ s.t. S(Γ) = 0} converges to the set of unidentified true parameters.
Once we build the counterparts of these results for the identification function in the
next subsection, score and identification functions together will lead to estimator
convergence.
But before that, let us explain the intuition behind Proposition 1, which is essen-
tial for IPCA’s large sample theory. By taking the derivative of the target, we find
1 > >
P
S(Γ) = N T t vect Ct ebt (Γ)ft (Γ) , where fbt (Γ) and ebt (Γ) are the coefficients and er-
b
rors, respectively, of the cross-sectional OLS regression of xt onto Ct Γ.23 The sample
OLS estimate fbt (Γ) misses the true ft0 for two reasons. For one, it uses a Γ that is not
the true Γ0 , and thereby the instrumented factor loadings Ct Γ are off. Second, even
if we knew the true Γ0 and thus the sample β’s were right, OLS in the finite cross
section does not exactly reveal the true ft0 . To formalize this intuition, we construct
fet (Γ) as the “population” counterpart of the same xt -on-Ct Γ regression. By “popu-
lation” we mean conditional on F[ts] : that is, the collection of all “aggregate” events
which contain the information of how the “current” cross section would be generated.
Specifically,
[ts] −1  > > [ts]  −1 > cc 0 0
fet (Γ) := E Γ> c> = Γ> Ωcc

i,t ci,t Γ F E Γ ci,t xi,t F t Γ Γ Ωt Γ ft , (9)

 > [ts]  24
with the shorthand Ωcc
t := E ci,t ci,t F . Therefore, the forementioned two sources
22
The expression of S 0 is in Appendix C.6 together with the proof.
23
Eq. (5) defined fbt (Γ), and ebt (Γ) := xt − Ct Γfbt (Γ).
24
Assumption A is used in Eq. (9).

19

Electronic copy available at: https://ssrn.com/abstract=2983919


of fbt (Γ) error are formally represented with the decomposition
−1 > >
fbt (Γ) = fet (Γ) + Γ> Ct> Ct Γ Γ Ct eet (Γ), (10)

where eet (Γ) := xt − Ct Γfet (Γ) is the corresponding population OLS error. Importantly,
given a rotation R, fet (Γ0 R) = R−1 ft0 and eet (Γ0 R) = et . This means if one knew the
true Γ0 (even if not knowing the particular rotation), functions fet (Γ)ande et (Γ) would
be able to reveal the true factor structure. The second term in (10) captures the
remaining error caused only by the finite cross section.
With this knowledge, the score function can be broken down by representing
ft (Γ), ebt (Γ) with fet (Γ), eet (Γ). We save the exact expression to Appendix C.6, except
b
noting that the score function inherits the same decomposition: into a part that is only
“off” due to the “wrong” Γ, and another part due to finite cross section. Intuitively,
the first part converges to the probability limit S 0 (Γ), while the the second vanishes
in the large-N, T limit.25 Finally, the limiting score S 0 (Γ) can recover the true Γ0 (up
to a rotation) according to Proposition 2.

4.2 Identification Function Convergence


As defined in Subsection 3.2, an identification function is solved by the normalized
parameters: Θ = {Γ s.t. I(Γ) = 0}. We define the following identification functions
for the the two examples ΘX and ΘY , respectively:

I
" >
 #
veca Γ Γ − K
IX (Γ) := vect (Block1:K (Γ) − IK ) , IY (Γ) :=  P  .26
vecb T1 t fbt (Γ)fbt> (Γ) − V ff

Notice fbt (Γ) involves sample data, and hence IY is a random function.
We show IY converges uniformly to a deterministic function IY0 , which also con-
stitutes an identification function. The same claim trivially applies to IX as it is
25
The proof is in Appendix C.6, which deals with additional complication of uniform large-N, T
convergence across Γ, which is necessary for the convergence discussed next.
26
Mappings veca and vecb vectorize the upper triangular entries of a square matrix. The difference
is veca includes the diagonal entries, and vecb does not. See details with an example in Appendix
A. V ff is the factor’s population second moment matrix as defined in Assumption D. It is actually
redundant to write “−V ff ” in the expression of IY , as long as the assumed V ff is a diagonal matrix,
since vecb ignores the diagonal entries. However, we leave it in to note ΘY can be easily adjusted
at applied researcher’s discretion with the asymptotic analysis largely unchanged. For example, one
can specify V ff = IK and switch “veca” and “vecb” to standardize factors instead of Γ.

20

Electronic copy available at: https://ssrn.com/abstract=2983919


deterministic to begin with. In detail, Appendix C.8 verifies both IX , IY satisfy the
following conditions about a generic identification function I:

IF.1 Uniform convergence: There exists a deterministic function I 0 (Γ) such that
p
sup I(Γ) − I 0 (Γ) →
− 0, N, T → ∞.
Γ∈Ψ

IF.2 Limit is Identification Condition:


Set Θ0 := {Γ ∈ Ψ | I 0 (Γ) = 0} is an identification condition.

In addition, we maintain the identification assumption about the true Γ0 :

IA Identification Assumption: Γ0 ∈ Θ0 , i.e., I 0 (Γ0 ) = 0.

IF.2 and IA combined imply Γ0 solves I 0 , and it is the only solution among all of its
equivalent rotations.
The above convergence and uniqueness conditions about I are symmetric to
Propositions 1 and 2 about S. Since the estimator is the intersection of S and I,
these conditions together form the premise of estimator consistency, which will be
formally stated in Subsection 4.4. Therefore, IF.1–2 can be used to establish if any
normalization, besides ΘX and ΘY , yields a consistent parametric estimation.
We note IA is an innocuous assumption because the true parameter can always
be normalized to satisfy any identification condition, including Θ0 , without changing
the data generating process. Next, we sort out some ambiguity brought about by
having different, but rotationally equivalent, true parameters.

4.3 Which True Parameter?


This subsection clarifies some subtle issues involved in normalizing the true parameter.
Let us first review how the points talked about so far are shown in Figure 3. Based on
this visual representation, we explain the roles played by the different true parameters
in asymptotic analysis.
Proposition 1 establishes that the set of optimizers (shown as the upper “S(Γ) =
0” curve) converges to the set of rotationally-equivalent true parameters (the lower
“S 0 (Γ) = 0” curve). Meanwhile, IF.1 says the identification conditions are also con-
verging. The blue vertical line describes the normalization ΘY for a specific sample.

21

Electronic copy available at: https://ssrn.com/abstract=2983919


Figure 3: Estimated and True Parameters in the Two Normalization Cases
Θ0
ΘX = Θ0X ΘY Op ( √1 ) Y T

Γ(ΘX )
b Γ(ΘY )
b
S(Γ) = 0
A B C
Op ( √ 1 )
NT

Γ0 (ΘX ) Γ0 (ΘY ) S 0 (Γ) = 0


Γ0 (Θ0Y )

Notes: The top horizontal curve is the set of optimizers. The lower curve is the set of
rotationally equivalent true parameters. The three vertical lines are identification condi-
tions. All the black objects are deterministic, the blue ones are sample-dependent. The
two solid red arrows point from the random sets to their deterministic limits, labeled with
the rates of convergence. The dashed red arrows A, B, C mark three specific estimation
errors, whose asymptotic distributions are in Theorems 3.a, 3.b, 5, respectively.

It varies across samples but converges to the limit Θ0Y , shown as the vertical line on
the right. On the other hand, the limit of ΘX is itself (since it is deterministic), shown
as the vertical line on the left. The estimators Γ(Θ
b X ) and Γ(Θb Y ) are at the intersec-
tions of the optimizer set and the identification-condition set, where the parenthesis
clarifies which identification condition it is in.
We make two main points. First, assumption IA means that the true Γ0 under
the two normalization cases ΘX and ΘY are not the same. This can be seen by noting
that the limiting identification conditions Θ0 are different for the two cases—one is
ΘX itself and the other is Θ0Y . To avoid ambiguity, we write the true parameter under
the two cases as Γ0 (ΘX ) and Γ0 (Θ0Y ), while Γ0 is reserved as the shorthand when
there is no emphasis on the specific case. The two are rotationally equivalent, but
satisfy different limiting identification conditions as clarified in the parentheses. The
true parameter is deterministic, since it is the intersection of two deterministic sets:
S 0 (Γ) = 0 and Θ0 . This deterministic truth is the parameter in IPCA definition
(Eq. 2). Later on, it will serve as the fixed reference point in asymptotic expansion.
Finally, the asymptotic variance is expressed with regard to the generic Γ0 , which is
subsequently evaluated at either Γ0 (ΘX ) or Γ0 (Θ0Y ) depending on the specific case.
Our second main point is that we measure estimation error against the true pa-
rameter normalized in the same way as the estimate is normalized. We call this the
normalized true parameter and denote it as Γ0 (Θ), where Θ is the normalization used

22

Electronic copy available at: https://ssrn.com/abstract=2983919


for estimation.27 Arrows A and B in Figure 3 are examples of this choice of estimation
error. The two arrows represent the two estimation errors involved in our asymptotic
expansions: Γ(Θb X ) − Γ0 (ΘX ) and Γ(Θ
b Y ) − Γ0 (ΘY ). Each pair is an estimate minus a
true, both under a common normalization.
A subtle distinction is that the normalized truth can be made sample-dependent
by the normalization. This is the case of Arrow B in particular, where the reference
point Γ(Θ
b Y ) is sample-dependent. On the other hand, in the case of ΘX there is no
distinction for the normalized truth because ΘX and its limit are one and the same.
Measuring estimation errors against the normalized truth is well-justified, al-
though perhaps peculiar at first sight that a “true” parameter can be random. Indeed,
it follows and formalizes the convention of asymptotic analysis in factor analysis.28
Following previous literature, our primary interest is on Arrows A and B, since they
are essential about the accuracy of IPCA estimation in recovering a true parameter, al-
b Y )−Γ0 (Θ0 )
beit a rotated true parameter. In contrast, the asymptotic analysis of Γ(Θ Y
(Arrow C) would reveal the complete asymptotic distribution of the estimator Γ(ΘY )
b
since the reference point Γ0 (Θ0Y ) is fixed. We will show that Arrow C converges at a
slower rate than Arrows A or B due to an additional stochastic wedge. The wedge
exists because the estimate is normalized according to ΘY while the reference point is
normalized by its limit Θ0Y .29 This slows down the convergence of Arrow C, because

it also relies on the convergence of ΘY to Θ0Y (which happens at the slower rate T ).
We derive these and other facts rigorously in the analysis below. We focus our
asymptotic analysis on the errors labeled as Arrows A and B, and relegate theoretical
and simulation results for Arrow C to Appendix B.
To recap, we have defined the following notations. The estimate is Γ(Θ), b or
Γ
b in short when there is no emphasize on the specific normalization Θ. The true
parameter is Γ0 (Θ0 ), or Γ0 in short similarly. And, the normalized true parameter
is Γ(Θ),
b against which the estimation errors are referenced. The instances of these
27
Since Θ is an identification condition, per its definition, such a Γ0 (Θ) is unique.
28
Bai and Ng (2002) and Bai (2003) are prominent examples that estimation errors are measured
against the normalize true in the literature. They study PCA estimation error λ bi − H −1 λ0 , where
i
0
H is a sample-dependent rotation imposed on the true λi . (In their notation, λi is factor loadings,
equivalent to our βi .) Although in Bai and Ng (2013), such rotation matrix H is avoided, the true
parameters used as comparison targets are still normalized according to sample realizations. For
example, they directly restrict the sample second moment matrix of the true factors in their PC1
and PC2. This is comparable to our ΘY condition [2].
29
One cannot normalize the estimate by Θ0Y , simply because it is of population quantities. It is
only a theoretical construction for large sample theory.

23

Electronic copy available at: https://ssrn.com/abstract=2983919


three objects in the two normalization examples can be found in Figure 3. Lastly,
the same notations with a small γ represent the vectorized variants.

4.4 Consistency of Γ Estimation


Now we combine the previous preparations to show the consistency of Γ estimation.
Per the estimator definition, Γ b solves the simultaneous equations [S; I] (Γ) = 0.30
Proposition 1 and IF.1 combined means [S; I] (Γ) uniformly converges to [S 0 ; I 0 ] (Γ).
Proposition 2, IF.1, and IA together imply Γ0 is the unique solution of [S 0 ; I 0 ] (Γ) = 0.
Since the solution of a uniformly converging function converges to the limit’s unique
p
solution (Newey and McFadden, 1994), we have Γ b →− Γ0 . Meanwhile, Γ0 (Θ) is the
solution of [S 0 ; I] (Γ), and we can similarly show it approaches the same fixed limit
Γ0 . Bridged by the common limit Γ0 , the difference between the estimation and the
normalized truth must converge.

Theorem 1 (Γ Estimation Consistency – Generic Normalization). Under Assump-


tions A–C, and if the identification condition Θ has an associated identification func-
p
tion I(Γ) that satisfies IF.1–2, then as N, T → ∞, Γ b − Γ0 (Θ) →
− 0.

The theorem above is the consistency result for a generic identification condition
Θ. For the two specific cases (ΘX and ΘY ), we already verified that both identification
functions IX (Γ) and IY (Γ) satisfy Condition IF.1–2. Therefore, the estimators in these
two specific cases are consistent as well.

Corollary 1 (Γ Estimation Consistency — Specific Cases). Under Assumptions A–


C, IPCA Γ b estimated under normalization ΘX or ΘY are consistent with respect to
p
b X ) − Γ0 (ΘX ) →
the accordingly normalized true parameters: as N, T → ∞, Γ(Θ − 0,
p
b Y ) − Γ0 (ΘY ) →
and Γ(Θ − 0.

5 Asymptotic Distributions of Γ Estimation Error


This section analyzes the asymptotic distribution of estimation error γb − γ 0 (Θ). We
first provide the general results for a generic identification condition Θ, and then
calculate the asymptotic distributions under the two specific normalization cases.
30
[S; I] (Γ) means the vector-valued function by vertically stacking up S(Γ) and I(Γ). Same for a
few other stacked functions below.

24

Electronic copy available at: https://ssrn.com/abstract=2983919


5.1 Asymptotic Error — Generic Normalization
Theorem 2 (Asymptotic Error — Generic Normalization). Under the conditions of
Theorem 1, and assuming both S(Γ) and I(Γ) are continuously differentiable in a
neighborhood around γ 0 , where γ 0 satisfy IA, as N, T → ∞,
−1
b − γ 0 = − H 0> H 0 + J 0> J 0 H 0> S(γ 0 ) + J 0> I(γ 0 ) + op S(γ 0 ) + I(γ 0 ) ,
 
γ
(2.a)
−1
b − γ 0 (Θ) = − H 0> H 0 + J 0> J 0 H 0> S(γ 0 ) + op S(γ 0 ) ,
 
γ (2.b)

∂S 0 (Γ) ∂I 0 (Γ)
where H 0 := ∂γ >
and J 0 := ∂γ >
.
γ=γ 0 γ=γ 0

The proof is built on Newey and McFadden’s (1994) analysis of a M -estimator by


linearizing the score function. We extend that result to the not-locally-identified situ-
ation by appending the identification function to the score, and linearizing both sym-
metrically.31 This analytical method, and the results it affords, could more broadly
be used to analyze other estimators that require a normalization step.
The theorem offers a clear decomposition of the sources of randomness in parame-
ter estimation. On the right-hand sides of 2.a and 2.b, the only random terms are the
score and identification functions (both evaluated at Γ0 ), representing the sample in-
accuracies emanating from optimization and normalization. The theorem says γ b − γ0
loads on both sources of randomness while γ b − γ 0 (Θ) loads only on the randomness
of normalization.
The intuition can be illustrated by Figure 3. Look at the small neighborhood
around Γ0 (Θ0Y ), and imagine both the “S(Γ) = 0” curve and the ΘY curve are “wob-
bling” around their deterministic limits: they represent the two sources of randomness
affecting estimation. Since the estimate Γ(Θ
b Y ) is the intersection of the two curves,
it must load on the randomness of both (shown as Arrow C). However, since both
b Y ) and Γ0 (ΘY ) are under the same identification ΘY , their difference does not
Γ(Θ
depend on the randomness of ΘY (shown as Arrow B).

5.2 Asymptotic Error — Specific Cases


This section brings the general asymptotic result in Theorem 2.b to the specific cases
denoted by Arrows A and B. For the reasons discussed in Section 4.3, asymptotic
31
The proof that details our way of analysis is in Appendix C.10.

25

Electronic copy available at: https://ssrn.com/abstract=2983919


analysis of Arrow C, which is a special case of 2.a, is relegated to Appendix B.
To evaluate 2.b, it only remains to calculate the three right-hand side terms: H 0 ,
J 0 , and the asymptotic distribution of S(γ 0 ). As we have discussed in Subsection 4.3,
the general expressions in Theorem 2 need to be evaluated locally at either Γ0 (ΘX )
or Γ0 (Θ0Y ). As mentioned before, we use use subscripts X and Y to denote the values
calculated for their corresponding normalizations.
The Hessian-like matrices HX0 , JX0 and HY0 , JY0 are calculated by taking the deriva-
tives of the limiting functions, which are detailed in Appendix C.12–C.13. The more
p
interesting result is the asymptotic distribution of S(γ 0 ). We know S(γ 0 ) → − 0 from
Propositions 1 and 2. The following lemma says that the convergence happens at the

rate of N T and gives the asymptotic distributions.

Lemma 3 (Asymptotic Distribution of the Score Evaluated at Γ0 ).


Under Assumptions A–F, as N, T → ∞ such that T /N → 0,
√  d 
[1]
 √  d 
[1]

N T S γ 0 (ΘX ) →− Normal 0, VX , N T S Γ0 (Θ0Y ) →− Normal 0, VY .32


This lemma implies the convergence rate of Γ estimation error is N T , regardless

of the normalization choice. The N T -convergence rate highlights IPCA’s ability to
harness not just the time-series, but also the cross-sectional information. The ability
ultimately comes from the assumption that the instrumental mapping is common
across individuals. In contrast, without modeling the structure of the factor loadings,
even were factors observed then the loading estimation could only rely on time-series

information and thereby achieve convergence at rate T .

From the perspective of the panel literature, the N T convergence rate can be
understood by viewing Γ as a common structural parameter and ft as the time fixed-
effects. This view raises the concern that estimation could be asymptotically biased
in the presence of “incidental parameters” whose number increases with the sample
size (Neyman and Scott, 1948; Lancaster, 2000). Gagliardini and Gourieroux (2014)
note the incidental parameter problem is much less pronounced in the case of time
fixed-effects with large cross sections. We follow them and focus on the T /N → 0
case, in which the asymptotic distribution is centered around zero and no asymptotic
bias correction is needed. Should T /N converge to a positive number, we conjecture
the asymptotic distribution is still normal but with a non-zero mean—exact analysis
32 [1] [1]
The expressions of VX and VY is with the proof in appendix C.11.

26

Electronic copy available at: https://ssrn.com/abstract=2983919


is left to future research. The simulation results we report below show no noticeable
bias.
Finally, assembling the previous calculations into 2.b, we have the asymptotic
distributions for Arrows A and B. The expressions are symmetric for the two cases
and the specific values are numerically verified by Monte Carlo simulation in Section
7.

Theorem 3 (Asymptotic Error — Specific Cases). Under Assumptions A–F, as


N, T → ∞ such that T /N → 0,
√  d −1 0 > 
[1]

b(ΘX ) − γ 0 (ΘX ) →
NT γ − − HX0 > HX0 + JX0 > JX0 HX Normal 0, VX , (3.a)
√  d −1 0 > 
[1]

b(ΘY ) − γ 0 (ΘY ) →
NT γ − − HY0 > HY0 + JY0 > JY0 HY Normal 0, VY . (3.b)

6 Asymptotic Analysis of Factor Estimation


We have so far treated factor estimation fbt (Γ)
b implicitly, because we concentrated-out
ft in the target function (Eq. 7). Now, with the asymptotics of Γ b in hand, we come
back to factor estimation and lay out its asymptotic distribution with relative ease.
We measure estimation error against the normalized true factor for the reasons dis-
cussed in 4.3. Given Γ0 (Θ) is the normalized true Γ, the correspondingly normalized
true factor is fet (Γ0 (Θ)).33
This factor estimation error can be decomposed into two sources following the
discussion around Eq. (10),

    > > −1


b − fet Γ0 (Θ) = fet (Γ)
fbt (Γ) b − fet Γ0 (Θ) + Γb Ct Ct Γ
b b> Ct> eet (Γ).
Γ b (11)

The first term captures the part of factor estimation error only due to the inaccuracy
of Γ estimation. The previous section concluded that Γ b and Γ0 (Θ) converge at the

rate of N T . Hence, the first term has the same rate of convergence. The second
term captures the remaining inaccuracy from the cross-sectional regression’s finite
sample. That is, even were Γ0 (Θ) known, the second term still would dominate at
33
Given a (sample-dependent) R, such that the normalized true Γ is represented as Γ0 (Θ) = Γ0 R,
as R−1 ft0 = R −1 e
 
the correspondingly normalized true factor is inversely rotated
n o ft Γ0 = fet Γ0 R =
  
fet Γ0 (Θ) . In other words, the two pairs, Γ0 , f 0 and Γ0 (Θ), fet Γ0 (Θ) , are rotationally equiv-
t
alent to each other.

27

Electronic copy available at: https://ssrn.com/abstract=2983919



rate N . Based on this decomposition, we arrive at the follow results about the
convergence of factor estimation error.

Theorem 4 (Factor Estimation Asymptotics).


Under Assumptions A–F, with an identification Θ that satisfies IF.1–2,
p
b − fet (Γ0 (Θ)) →
(4.a) Factor estimation is consistent: as N, T → ∞, fbt (Γ) − 0.
(4.b) Factor estimation error centered against the normalized true factor converges

to a normal distribution at the rate of N : as N, T → ∞, ∀t,
√ 
0
 d 
[2]

N ft (Γ) − ft Γ (Θ) →
b b e − Normal 0, Vt .34

The theorem gives the factor’s asymptotic distribution under a generic normaliza-
[2]
tion Θ. Similar to Γ estimation, we can evaluate Vt at either the Γ0 (ΘX ) or Γ0 (Θ0Y )
normalizations.35

7 Simulations
This section presents a concise set of simulations that illustrate the behavior of the
IPCA estimation in finite samples, and assess the accuracy of approximation based
on the asymptotic theory derived above. To summarize, we find that estimation
errors are well-approximated with a normal distribution. This is true even in rather
small samples, and when the true generating process has errors with large variance.
This shows that applied researchers can confidently assume normality for confidence
intervals and hypothesis tests when applying IPCA, as it verifies the asymptotic
derivations in the previous section. We present details below.
For given N, T , we generate a stochastic panel of c, f 0 , e and use these to assemble
the x panel. We calibrate the simulated data to the IPCA model (fixing K = 2 and
L = 10) estimated from US monthly stock returns in KPS.36
Simulations proceed according to the following steps:
34 [2]
The expression of the asymptotic variance Vt is with the proof in Appendix C.14.
35
Appendix B.3 contains the asymptotics when centered by the original ft0 . This situation cor-
responds to Arrow C for Γ estimation, and the the additional stochastic wedge introduced by a
sample-based normalization affects the asymptotic distribution.
36
In particular, we use the ten most significant instruments from KPS as calibration targets. They
are market capitalization, total assets, market beta, short-term reversal, momentum, turnover, price
relative to its 52-week high, long-term reversal, unexplained volume, and idiosyncratic volatility with
respect to the Fama-French three factor model.

28

Electronic copy available at: https://ssrn.com/abstract=2983919


1. Generate factors. Fit a VAR(1) process to estimated IPCA factors from KPS.
Simulate ft0 according to the estimated VAR employing normal innovations.

2. Generate instruments. For each stock, calculate the time-series averages of


the instruments. Pool the demeaned characteristics into a panel and estimate
a ten variable panel VAR(1). Next, for each i, generate the means of ci,t as
an i.i.d. draw from the empirical distribution of the time series means. Then,
simulate the dynamic component of ci,t from the estimated VAR with normal
innovations.37

3. Generate errors. Elements of the error panel e are simulated from an i.i.d.
normal distribution whose variance is calibrated so that the population R2 ,
defined as 1 − Ee2 /Ex2 , equals 20%, matching the empirical R2 in the estimated
model in KPS.

4. Generate main panel. We fix Γ(Θ b Y ) at its empirically estimated value and
calculate xi,t according to model equation (3).

We produce 200 simulated sample panels of dimension N = 200, T = 200. For


each simulated sample, we estimate Γ under the two identification conditions ΘX
and ΘY . Figure 4 reports the histograms of the estimation errors and overlay them
with the theoretical distributions for comparison. Given the data generating process,
the theoretical distributions of estimation errors are approximated by the asymptotic
distributions given in Theorem 3.38
b X ) − Γ0 (ΘX ). It corresponds to The-
The first panel reports estimation error Γ(Θ
orem 3.a, or Arrow A in Figure 3. The four entries at the top are empty, because
the corresponding four entries of Γ are pinned down by the identification condition
and do not need to be estimated. The second panel corresponds to Theorem 3.b, or
b Y ) − Γ0 (ΘY ). Note that while there are K 2 more distribu-
Arrow B. It reports Γ(Θ
tions presented on the right, the two normalizations have exactly the same degrees of
37
We generate ci,t with ex-ante i.i.d. means, so that each individual’s time-series process is non-
ergodic. This is deliberate so that ci,t admits the flexible property allowed by stochastic panels and
resembles real-world instrument data.
38
The required population moments, Ωcc , Ωcef , V[3] etc., are calculated by large sample Monte
Carlo. In the process of calculating these population quantities via Monte Carlo, we rotate the data
generating process ex ante according to the required identification assumption IA. For example, ft0
needs to be inversely rotated when Γ0 is normalized from Γ0 (Θ0Y ) to Γ0 (ΘX ), resulting in a different
value for Ωcef .

29

Electronic copy available at: https://ssrn.com/abstract=2983919


Figure 4: Γ Estimation Errors — Simulated v.s. Asymptotic Approximation

Note: This figure reports the small sample distributions of Γ estimation errors under the two example
normalization cases. We conduct 200 simulations with sample dimensions N = 200, T = 200.
b X ) − Γ0 (ΘX ) (Arrow A). The right panel reports
The left panel reports the distribution of Γ(Θ
0
the distribution of Γ(ΘY ) − Γ (ΘY ) (Arrow B). Each subplot corresponds to one element in the
b
L × K (10 × 2) matrix of Γ. Each histogram plots the simulated estimation errors, which is overlaid
with the asymptotic distributions from Theorem 3 as finite sample approximations. The horizontal
axis range is ±6 theoretical standard deviations, the tick marks are at ±3 theoretical standard
deviations. The vertical axes are probability density for the bell curves or frequency density for the
bars.

freedom. In other words, the top four distributions on the right duplicate information
in the distributions plotted below them.
In all cases, the simulated distributions are centered around zero and match the
theoretical distributions well. For some entries, we observe some skewness and tail
heaviness. These are due to the small sample size and relatively large error variance
in the generating process. In untabulated simulations with N = 1000, T = 1000, the
asymptotics more-fully kick in and the distributions become visually indistinguishable
from a normal. Hence, the simulation results suggest that even with panels of only

30

Electronic copy available at: https://ssrn.com/abstract=2983919


moderate size, our asymptotic approximations are accurate.

8 Applications
This section describes two empirical applications of IPCA to demonstrate its broad
usefulness for analyzing economic data. The first is an application to international
macroeconomics, where IPCA makes it easy analyze many nations’ evolving relation-
ships to global business cycles using country-level instruments. The second applica-
tion builds on KPS and uses IPCA to analyze a dynamic model of asset risk and
expected returns.

8.1 International Macroeconomics


Country-level macroeconomic fluctuations are globally connected (Backus et al., 1992).
Using a static state-space model, Gregory et al. (1997) use maximum likelihood to
document this for the G7 countries. Likewise, Kose et al. (2003) use a static state-
space model estimated with Bayesian methods to disentangle global from regional
and country-specific growth factors for a panel of countries over 30 years. Recently
Kose et al. (2012) (henceforth KOP) used data from the World Development Indica-
tors and estimated their static state-space model for a panel of countries before and
after 1985 in order to analyze the convergence or decoupling of global business cycles.
In essence, they ask: have countries’ relationships to global growth changed as the
countries themselves have evolved? This question is ideally suited for investigation
with IPCA.
We use IPCA to analyze the global factor structure in GDP growth using data
from the World Development Indicators database. We include as many countries as
possible from the “industrial/developed” and “emerging” country groups studied by
KOP. After reasonable data filters on indicators and countries, which we detail in the
Appendix E, we are left with 45 countries. Within this sample there are nine variables
that are available for most countries, so we use these as our instruments. The first
two instruments are the import and export share of GDP—natural indicators of a
country’s economic connectedness with the rest of the world. Next we use the propor-
tion of GDP relative to world GDP to measure the nation’s relative size. We account
for dynamics in capital intensity using gross capital formation, and we use popula-

31

Electronic copy available at: https://ssrn.com/abstract=2983919


Figure 5: MAE and Average Loading
Industrial/developed Emerging
3 1.5 5 2

4 1.5

2 1

3 1

1 0.5

2 0.5

MAE MAE
Mean loading Mean loading
0 0 1 0
1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020

Notes: The left axis shows the IPCA model mean absolute error in units of percentage
annual growth and corresponds to the blue dotted line. The right axis corresponds to the
equally weighted cross-sectional mean factor loading in each group, shown in red.

tion growth to account for growth in the labor force. To account for recent economic
growth and risks, we include the 5-year rolling mean and volatility of the nation’s
GDP growth and its rate of inflation. Finally, we include a constant characteristic.
Next, we double the set of instruments to 18 by interacting the nine variable above
with an indicator for whether a country is in industrial/developed group. Our annual
data run from 1961 to 2015 so that T = 55, and we demean the growth rates following
KOP. About 91% of the 5,280 possible country-year observations are non-missing.
We study latent factor models with K = 1 and compare IPCA to the static-
loading PCA estimator. We find a panel R2 from the IPCA model of 32%, capturing
roughly triple the variation in demeaned country growth explained by KOP. The R2
from PCA is 22%, or two-thirds that achieved by IPCA. When making a head-to-head
comparison of PCA and IPCA it is important to keep in mind two major differences
between the estimators. The first is their stark difference in parameterization. IPCA
achieves its fit using only 18 parameters to estimate its loadings, or 60% fewer pa-
rameters than the 45 used by PCA.39 Second, IPCA accommodates dynamics in each
country’s loading on the global growth factor while PCA estimates static loadings. If
countries converge or decouple as they evolve, IPCA’s dynamic loadings are capable
of detecting this. PCA’s static loadings, on the other hand, cannot detect such dy-
namics and will instead try to fit an evolving system with a static model, and this
type of misspecification is difficult to diagnose.
Results from IPCA show that beta dynamics are indeed critical to understanding
39
These parameter counts are net of the 45 country-specific means used to demean the data in
both IPCA and PCA.

32

Electronic copy available at: https://ssrn.com/abstract=2983919


Table 1: Global Growth Model Estimates

GDP -0.27∗ Ind×GDP 0.44∗


Capital Formation -0.09 Ind×Capital Formation 0.21
Exports 0.45 Ind×Exports -0.48
Imports -0.34 Ind×Imports 0.32
Inflation 0.23∗ Ind×Inflation -0.05
Pop. Growth -0.31 Ind×Pop. Growth 0.17
Growth Vol. 0.79∗ Ind×Growth Vol. 0.06
Mean Growth 0.00 Ind×Mean Growth -0.19∗
Constant -0.60∗ Ind×Constant 0.69∗

Notes: Estimated Γ coefficients scaled by the panel standard deviation of each instrument
(the exceptions are Constant and Ind×Constant which are unscaled). The instruments
are the log of GDP as a fraction of world GDP, gross capital formation, export and import
share of GDP, inflation, population growth, 5-year rolling mean and volatility of GDP
growth, and a constant. Each instrument is also interacted with an indicator for inclusion
in the “industrial/developed” country group. An asterisk denotes statistical significance
at the 10% level or better (using a bootstrap test following KPS).

the global business cycle. Figure 5 shows the time series of loadings on the global
growth factor broken out by industrial/developed economies and emerging economies.
For readability, we report the equally weighted cross-sectional average loading within
each group of countries.
We see substantial variation in global growth sensitivity in each group. This is
underpinned by an interesting state dependence in loadings—they rise sharply in eco-
nomic downturns. While this is visually evident in the plot for industrial/developed
countries, the precise nature of state dependence in loadings can be read from the es-
timated Γ matrix, which is shown in Table 1. To make estimates more interpretable,
we scale each element of Γ to describe the effect on factor loadings from a one standard
deviation increase in the associated instrument.
First, the constant and its interaction with the industrial dummy show that load-
ings in the industrial/developed group are significantly higher than those in emerg-
ing economies. But the largest and perhaps most interesting finding is the role of
growth volatility in describing state dependence in global growth sensitivity. A well
documented pattern in the business cycle literature is the spike in growth volatility
associated with recessions (Bloom, 2014).Table 1 shows that such rises in volatility
are accompanied by concomitant rises in sensitivity to the global growth factor. It
also shows that the dependence of growth sensitivity is one of the few instruments

33

Electronic copy available at: https://ssrn.com/abstract=2983919


that plays a similar role in both developed and emerging countries. For the emerg-
ing group, a one standard deviation increase in growth volatility associates with an
increased loading of 0.79 on the global growth factor, versus 0.85 for the industrial
group (the difference is insignificant). The only other instrument that the two groups
agree on is inflation. We see that higher inflation associates with high global sensi-
tivity, that this effect is slightly stronger in emerging economies (estimate of 0.23),
but that the difference versus developed economies is insignificant. For the remaining
instruments, our estimates show significant differences in the drivers of global expo-
sure. Emerging economies are especially sensitive to global fluctuations if they have
high exports, low imports, are relatively small, have high inflation, and have low pop-
ulation growth. In industrial/developed group, the effects of most instruments other
than volatility net to nearly zero. One other significant effect in industrial/developed
countries is that global sensitivity rises when recent growth has been low (based on
the significant coefficient of −0.19 on recent mean growth). This compounds the
jump in global sensitivity associated with a rise in volatility, because recent growth
volatility and recent mean growth are negatively correlated.
Finally, we see some broad evidence of global convergence from the dotted lines
shown in Figure 5. These describe the mean absolute error (MAE) from the IPCA
model each year. They show that over time the global growth factor has become
increasingly successful at describing the full panel of growth rates. This is evident
from the downward trend in MAE among both industrial and emerging economies.
Overall, IPCA results illustrate an important role for dynamic loadings in global
business cycle models and show that IPCA’s ability to accommodate such dynamics
ultimately delivers a more accurate description of the data than leading alternatives.

8.2 Asset Pricing


KPS apply IPCA to describe systematic risk and associated risk premium (cost of
capital) of US stocks, where systematic risk is defined as dynamic loadings on latent
factors that are instrumented by stock characteristics. Here, we expand upon their
asset pricing context and use IPCA to calculate systematic risk and cost of capital for
newly listed firms that the model has not seen before. This is motivated by the chal-
lenging question of how to value private firms and if it is possible to use information
in publicly traded equity prices for this purpose. Like in the macroeconomic example,

34

Electronic copy available at: https://ssrn.com/abstract=2983919


IPCA is ideally suited to address this question because the model parameterizes risk
and cost of capital as a function of firm characteristics. IPCA finds the mapping
between the return behavior of traded firms and their characteristics, which can then
be extrapolated to non-traded firms to approximate their as-if traded value. Note
that this is not possible with standard empirical asset pricing methodologies, which
require the history of publicly traded prices for individual assets to infer their future
risk premia (and thus cost of capital).
Our data consists of over 2.9 million stock-month observations of excess stock
returns (x) and 93 associated firm characteristics (z) from 1965-2018.40 We use
lagged firm characteristics to instrument for the conditional systematic risk loadings
(ci,t = zi,t−1 ).41 The out-of-sample evaluation is performed for newly listed firms,
defined as the first 12 months following a firm’s initial public offering (IPO). In our
data, 21,275 firms have an IPO, comprising 249,414 out-of-sample stock-month obser-
vations. In-sample estimation is based on the complementary 2.6 million observations
of incumbent stocks that excludes those in the test sample of new listings.
We calculate the total R2 and predictive R2 , as defined in KPS, to evaluate the
estimates of stocks’ systematic risk and expect returns, respectively. The procedure
starts by first estimating Γ b and {fbt } within the in-sample data set of incumbent
stocks. Then, the estimates are brought to the data of new listings to calculate fitted
values as
bTot
x i,t+1 := zi,t Γft+1 ,
bb bPred
x i,t+1 := zi,t Γλ
bb

for all i, t in the out-of-sample data set. The first term x bTot
i,t+1 reconstructs the realized
return as the factor model’s fitted value (that is, using the factor realization, fbt+1 ).
The second term is a prediction of the new listing return and replaces the factor
realization with the factor’s estimated mean λ, b directly following KPS.42 Based on
these fits, we calculate the out-of-sample total and predictive R2 as the explained
variation in xi,t+1 due to x bTot bPred
i,t+1 and x i,t+1 , respectively.
Table 2 reports the results for the IPCA model (with K = 4, as advocated by
KPS). The close similarity of the in-sample and out-of-sample total R2 indicates the
40
The dataset is from Gu et al. (2020). Firm characteristics are transformed into ranks on the
interval [−0.5, 0.5] as in KPS. Any missing characteristic is assigned the value 0, which is replacement
with the cross-sectional mean/median.
41
The information content of unexpected idiosyncratic return shocks is formalized by Assumption
A.
42 b
λ is calculated as the in-sample time series mean of fbt+1 .

35

Electronic copy available at: https://ssrn.com/abstract=2983919


Table 2: Explained Variation of Stock Returns

Total R2 Predictive R2
Incumbent stocks (in-sample) 15.66 0.25
New listings (out-of-sample) 13.44 0.22
Notes: R2 in percentage. Based on IPCA with K = 4 for the 1965-2018 US stock-month
panel.

same characteristics that determine the systematic riskiness of incumbent stocks also
determine the riskiness of new listings. In other words, once we condition on firm
characteristics, the model finds a highly similar description for the common variation
among returns on newly listed stocks compared to the common variation in returns
on incumbents.
While the total R2 is indicative of the model’s ability to describe systematic risks
of new listings, the predictive R2 summarizes the model’s description of their ex-
pected returns (or, in equivalent terms, cost of capital or discount rates). That is,
the predictive R2 is especially informative about the usefulness of the model for asset
valuation. The close similarity of the in-sample and out-of-sample predictive R2 indi-
cates that the IPCA model is as effective at “pricing” new listings as it is for pricing
incumbent stocks. The most important takeaway is that the model does this without
using the individual return history of the new listings (which is of course unavailable
and the crux of the research question), but manages to price them nonetheless by
extrapolating what it learns from data on incumbent stocks.43

9 Conclusion
This paper has introduced a new approach of modeling and estimating the latent
factor structure of panel data, called Instrumented Principal Component Analysis
(IPCA). The key innovation is using additional panel data to instrument for the
dynamic factor loadings. Mainly, each individual’s time-varying factor loading is
related to instrumental data according to a common and constant mapping.
Estimating this mapping, rather than the factor loadings directly, has many econo-
metric advantages compared to other latent variable estimators like PCA. On one
43
This performance is even more remarkable when we recognize that new firms’ stock returns are
more variable than incumbents.

36

Electronic copy available at: https://ssrn.com/abstract=2983919


hand, the mapping’s parameterization tends to be more parsimonious. Its degrees of
freedom are fixed and not increasing with the size of the cross section, which improves
the rate of convergence and tends to avoid over-fit problems in high-dimensional ap-
plications. At the same time, IPCA brings broad possibilities of economic discovery
by harnessing the wealth of additional panel information as well as relying on eco-
nomic theories that indicate relationships between observable covariates and factor
exposures. These advantages are exemplified with two applications to equity returns
and international macroeconomics respectively.
Our main theoretical contribution is to derive the statistical properties of the
IPCA estimator. We show consistency and asymptotic normality under general data
generating processes. We emphasize generality with two main theoretical innova-
tions. First, the fundamental construction of stochastic panels establishes a collec-
tion of large panel asymptotic results under general conditions without resorting to
higher-level assumptions. Second, our method deals with the well-known rotational
unidentification issue in latent factor analysis in general terms rather than under spe-
cific normalization assumptions. We show how the choice of normalization affects the
rate of convergence and the asymptotic distribution. This identification framework is
applicable to estimators other than IPCA that also require additional normalization.

37

Electronic copy available at: https://ssrn.com/abstract=2983919


References
Acemoglu, D. and Azar, P. D. (2020). Endogenous Pro-
duction Networks. Econometrica, 88(1):33–82. eprint:
https://onlinelibrary.wiley.com/doi/pdf/10.3982/ECTA15899.

Backus, D. K., Kehoe, P. J., and Kydland, F. E. (1992). International Real Business
Cycles. Journal of Political Economy, 100(4):745–775. Publisher: University of
Chicago Press.

Bai, J. (2003). Inferential Theory for Factor Models of Large Dimensions. Economet-
rica, 71(1):135–171.

Bai, J. and Ng, S. (2002). Determining the Number of Factors in Approximate Factor
Models. Econometrica, 70(1):191–221.

Bai, J. and Ng, S. (2013). Principal components estimation and identification of static
factors. Journal of Econometrics, 176(1):18–29.

Büchner, M. and Kelly, B. (2020). A Factor Model for Option Returns. Yale Univer-
sity Working Paper.

Bloom, N. (2014). Fluctuations in Uncertainty. Journal of Economic Perspectives,


28(2):153–176.

Del Negro, M. and Otrok, C. (2008). Dynamic factor models with time-varying
parameters: measuring changes in international business cycles. FRB of New York
Staff Report, (326).

Fama, E. F. and French, K. R. (1993). Common risk factors in the returns on stocks
and bonds. Journal of Financial Economics, 33(1):3–56.

Fan, J., Liao, Y., and Wang, W. (2016). Projected principal component analysis in
factor models. The Annals of Statistics, 44(1):219–254.

Gagliardini, P. and Gourieroux, C. (2014). EFFICIENCY IN LARGE DYNAMIC


PANEL MODELS WITH COMMON FACTORS. Econometric Theory, 30(5):961–
1020. Publisher: Cambridge University Press.

38

Electronic copy available at: https://ssrn.com/abstract=2983919


Gagliardini, P., Ossola, E., and Scaillet, O. (2016). Time-Varying Risk Premium in
Large Cross-Sectional Equity Data Sets. Econometrica, 84(3):985–1046.

Geweke, J. (1977). The dynamic factor analysis of economic time series. Latent
variables in socio-economic models.

Gregory, A., Head, A., and Raynauld, J. (1997). Measuring world business cycles.
International Economic Review, 38(3):677–701.

Gu, S., Kelly, B., and Xiu, D. (2020). Empirical Asset Pricing via Machine Learning.
The Review of Financial Studies, 33(5):2223–2273. Publisher: Oxford Academic.

Hansen, L. P. and Sargent, T. J. (2013). Risk, Uncertainty, and Value.

Kelly, B. T., Pruitt, S., and Su, Y. (2019). Characteristics are covariances: A unified
model of risk and return. Journal of Financial Economics, 134(3):501–524.

Kose, A., Otrok, C., and Prasad, E. (2012). Global business cycles: convergence or
decoupling? International Economic Review, 53(2):511–538.

Kose, M. A., Otrok, C., and Whiteman, C. H. (2003). International Business Cy-
cles: World, Region, and Country-Specific Factors. American Economic Review,
93(4):1216–1239.

Lancaster, T. (2000). The incidental parameter problem since 1948. Journal of


Econometrics, 95(2):391–413.

Newey, W. K. and McFadden, D. (1994). Chapter 36 Large sample estimation and


hypothesis testing. In Handbook of Econometrics, volume 4, pages 2111–2245.
Elsevier.

Neyman, J. and Scott, E. L. (1948). Consistent Estimates Based on Partially Con-


sistent Observations. Econometrica, 16(1):1–32. Publisher: [Wiley, Econometric
Society].

Primiceri, G. E. (2005). Time Varying Structural Vector Autoregressions and Mone-


tary Policy. The Review of Economic Studies, 72(3):821–852.

Pruitt, S. (2012). Uncertainty Over Models and Data: The Rise and Fall of American
Inflation. Journal of Money, Credit and Banking, 44(2-3):341–365.

39

Electronic copy available at: https://ssrn.com/abstract=2983919


Sargent, T. J. and Sims, C. A. (1977). Business cycle modeling without pretending to
have too much a priori economic theory. New methods in business cycle research,
1:145–168.

Stock, J. H. and Watson, M. W. (2002). Forecasting Using Principal Components


from a Large Number of Predictors. Journal of the American Statistical Association,
97(460):1167–1179.

Su, L. and Wang, X. (2017). On time-varying factor models: Estimation and testing.
Journal of Econometrics, 198(1):84–101.

40

Electronic copy available at: https://ssrn.com/abstract=2983919


Appendix

A Notation Details
Matrix Operations: vect (A) vectorizes matrix A to a column by going right first,
then down. Throughout the paper, γ is always vect (Γ), same for all the other deco-
b Γ0 (Θ), etc. Related, veca (A) stacks A’s upper triangle entries, including the
rated Γ,
diagonal, in a vector by going right first, then down; vecb (A) stacks A’s upper tri-
angle entries, not including
h the diagonal, ini a vector by going right first, then down.
For example, let A = 1, 2, 3; 4, 5, 6; 7, 8, 9 , then vect (A) = [1, 2, 3, 4, 5, 6, 7, 8, 9]> ,
veca (A) = [1, 2, 3, 5, 6, 9]> , vecb (A) = [2, 3, 6]> .
[A; B] means the two matrices vertically stacked.
k·k is the Euclidean norm of a vector, or the Frobenius norm if the input is a
matrix. k·k2 is the square of k·k, or the sum of squares across all entries of the vector
or matrix.
P P
Summations like i , t are always in the range of the panel sample: i = 1 to N ,
t = 1 to T , without writing out the details.

B Results About Arrow C


B.1 Asymptotic Distribution of Γ(Θ
b Y)

This subsection works out the asymptotic distribution of Γ(Θb Y )−Γ0 (Θ0 ), or Arrow C
Y
0 0
in Figure 3. Since the reference point Γ (ΘY ) is deterministic, this analysis provides
the complete asymptotic distribution of the estimator Γ(Θb Y ). In contrast to Arrow B
(studied in Section 5.2), Arrow C also depends on the randomness of the identification
ΘY . We rely on the general result 2.a for the asymptotic distribution. Subsection
5.2 has already calculated three of the four right-hand side inputs: HY0 , JY0 and the
asymptotic distribution of S (Γ0 (Θ0Y )). The fourth piece that needs to be analyzed is
the asymptotic distribution IY (Γ0 (Θ0Y )).
Identification function IY (Γ), as defined in 4.2 is stacked up by two parts. The
top 12 K(K + 1) entries are from ortho-normalizing Γ, which is irrelevant of sample
information. Therefore, similar to the ΘX case, the top part of IY (Γ0 (Θ0Y )) is still
deterministic. The bottom 12 K(K −1) rows of IY (Γ) are from diagonalizing the sample

41

Electronic copy available at: https://ssrn.com/abstract=2983919


second moment matrix of fbt , a process that introduces a rotational contamination

with convergence rate T . To see why it is this rate, suppose the true ft0 is observed,
which follows a diagonal second moment matrix in population. However, its sample
second moment would not be exactly diagonal, off by a sample error that is in the

order of 1/ T . Imposing the sample normalization requires rotating ft0 slightly just
to offset that error.

Lemma 4 (Asymptotic distribution of IY (Γ0 )). Under Assumptions A–F,


" #!
√ d 0 0
Γ0 (Θ0Y ) →

T IY − Normal 0K 2 ×1 , ,
0 V[3]

where V[3] is specified in Assumption D(3).44

Lemma 4 implies the “contamination effect” that the normalization step intro-

b Y ) converges at the rate of T . Surprisingly, this is even slower than
duces to Γ(Θ

that of the core optimization step, which converges at N T according Lemma 3.
Hence it solely shows up in the asymptotic distribution of Γ(Θ
b Y ) as the dominant

term. In summary, the convergence rate of Arrow C is T , while those of Arrows A

and B are both N T .

Theorem 5 (Asymptotic Distribution of Arrow C). Under Assumptions A–F, as


N, T → ∞,
" #!
√ d −1 0 0
b(ΘY ) − γ 0 (Θ0Y ) →− − HY0 > HY0 + JY0 > JY0 JY0 > Normal 0,
 
T γ .
0 V[3]

B.2 Simulation Results about Arrow C


Figure 6 presents the simulation results about Arrow C. The results are from the same
simulations from Section 7. The first panel reports the simulated error of Arrow
C and compare that with the theoretical ones from Theorem 5. Except for some
entries that roughly match, the simulated small-sample distributions are wider than
the theoretical distributions by orders of magnitude. This is purely a finite sample
phenomenon, the asymptotics have no problem.
44
In the variance matrix, the 0 block in the top left is 12 K(K + 1) × 12 K(K + 1), top right is
1 1 1
2 K(K + 1) × 2 K(K − 1), lower left is 2 K(K − 1) × 12 K(K + 1). Together with V[3] of 21 K(K −
1
1) × 2 K(K − 1), they make up the variance matrix of K 2 × K 2 .

42

Electronic copy available at: https://ssrn.com/abstract=2983919


Figure 6: Γ Estimation Errors — Simulated v.s. Asymptotic Approximation

Note: This figure is constructed with the same simulation exercises in Figure 4, and also follows
the same format. The simulated distribution of a new set of random variables (written at the top
of each panel) are reported. The corresponding theoretical distributions are from Theorems 5, 3.b,
and 5 respectively. Additionally, the x-axes ticks are marked with numbers for comparison. Again,
the x-axes range is ±6 theoretical standard deviations, and the tick numbers mark ±3 theoretical
standard deviations.

To diagnose the issue, we decompose b Y ) − Γ0 (Θ0 ) = (Γ(Θ b Y) −


 √Arrow  C asΓ(Θ√  Y
Γ0 (ΘY )) + (Γ0 (ΘY ) − Γ0 (Θ0Y )) = Op 1/ N T + Op 1/ T , and respectively plot
the two terms in the second and third panels. Although the first term (arrow B) is
asymptotically small, its level, as reported in the second panel, is still large at N =
200, T = 200.45 As a result, the first term messes up the small sample distribution of
Arrow C, which theoretically is only driven by the second term. In the third panel,
we ignore the first term and only look at the dominant second term. It conforms to
the asymptotic theory from Theorem 5 well. In this sense, the third panel verifies the
derivation of large sample asymptotics of the theorem.
So N, T -orders aside, why the asymptotic variance of the first term (Arrow B) is
45
The second panel of Figure 6 is a copy of the second panel of Figure 4, except we now include
the x-axis labels for comparing theoretical variances.

43

Electronic copy available at: https://ssrn.com/abstract=2983919


much larger than the second? Retracing the analyses, Arrow B hinges on the accuracy
of the first order condition, which eventually depends on the N, T -CLT of c> ef 0>
(Assumption D(1)). While the second depends on the accuracy of the identification
condition, which is about the time-series CLT of f 0 f 0> only (Assumption D(3)). So
the relative levels of the variances of e to f 0 is critical here, which can be roughly
measured and calibrated to empirical data by R2 . As a check, scaling down the
variance of e to an unrealistic level makes the first panel also converge well (detailed
plot omitted).

B.3 Factor Estimation Error Measured Against {ft0 }


Section 6 analyzed the factor estimation error against the normalized true fet (Γ0 (ΘY )).
Now we measured factor estimation against ft0 . We know ft0 = fet (Γ0 ). The analogue
of Eq. (11) is
 
ft Γ(ΘY ) − ft0
b b
     −1  
0 0 > > > >
= ft Γ(ΘY ) − ft Γ (ΘY ) + Γ(ΘY ) Ct Ct Γ(ΘY )
e b e b b Γ Ct eet Γ(ΘY ) .
b b

The only difference happens at the first term, which is now driven by the estimation

b Y ) and Γ0 (Θ0 ) (Arrow C), which has a convergence rate of T
error between Γ(Θ Y
according to Subsection B.1. This implies the additional stochastic wedge introduced
by a sample-based normalization
 mightno longer be dominated by the √ second term.

b Y ) and f 0 converge the slower of N or T .
The following Theorem says fbt Γ(Θ t

Theorem 6 (Factor Estimation Error Measured Against {ft0 }). Under Assumptions
A–F, as N, T → ∞,
   n √ √ o
b Y ) − ft0 = Op max 1/ N , 1/ T
fbt Γ(Θ .

C Proofs of All Theoretical Results


C.1 Proof of Lemma 1
Proof. This is a direct application of Birkhoff Law of Large Numbers (Hansen and
Sargent, 2013, Theorem 2.5.1). Notice in the two right-hand sides, the dot “·” can take

44

Electronic copy available at: https://ssrn.com/abstract=2983919


any natural number without changing the value of the conditional expectation.

C.2 Lemma 5 and its Proof


The joint ergodicity condition SP.4 first implies the single-subscript random variables
are ergodic, and hence their averages converge to the unconditional expectations:

Lemma 5 (Single-subscript LLN). Under Conditions SP.1–4, let X [d] be F[d] -measurable
2
(single-subscript) random vectors, and if E X [d] < ∞, then

N T
1 X [cs] L2 1 X [ts] L2
Xi −→ EX [cs] , as N → ∞ and Xt −→ EX [ts] , as T → ∞.
N i=1 T t=1

[cs] L2
Proof. Apply Lemma 1, we have N1 N
P  [cs] [ts] 
i=1 X i −
→ E X F . It remains to show
[cs] [cs] [cs]
Xi is ergodic. We know X is measured by F , within which every event is
S[ts] -invariant. So an S[cs] -invariant event within F[cs] must be invariant to bothn trans- o
[cs]
formations. By condition SP.4, it has probability either 0 or 1. That is to say, Xi
is an ergodic stochastic process (in the traditional one-directional sense). That implies
 
E X [cs] F[ts] = EX [cs] w.p. 1, which in turn gives the desired m.s. convergence.
The other direction is symmetric.

C.3 Proof of Lemma 2


Combining Lemmas 1 and 5 yields a result that the panel-wise average can recover
the population moment when both N and T are large.

Proof.
!
1 XX 1X 1 X 1X 
Xi,t − E X·,t F[ts] E X·,t F[ts]
  
Xi,t = + (12)
NT t i T t N i T t

L1
We are going to show first term −→ 0 (N → ∞, uniformly across T ), second term
L1
−→ EX, as N, T → ∞ (T → ∞, uniformly across N ).

45

Electronic copy available at: https://ssrn.com/abstract=2983919


For any T , apply limN →∞ E k·k to the first term
!
1 X 1 X  [ts] 
lim E Xi,t − E X·,t F (13)

T N i

N →∞
t

1X 1 X  [ts] 
≤ lim Xi,t − E X·,t F (14)

E
N →∞ T N
t i

1 X  [ts] 
= lim E Xi,t − E X·,t F = 0. (15)

N →∞ N
i

We applied a triangular inequality, stationarity by condition SP.3, and Lemma 1


(notice L2 convergence implies L1 convergence), in that order.
 
For the second term, notice the summand E X·,t F[ts] is F[ts] -measurable. There-
fore, Lemma 5 implies L2 convergence which in turn implies L1 convergence towards
the unconditional expectation.

C.4 Verify ΘY is an identification condition


Lemma 6. Set ΘY is an identification condition. I.e., for any Γ ∈ Ψ there is a
unique Γ0 ∈ ΘY that is rotationally equivalent to Γ.

Proof. We proof by first constructing the normalization Γ0 first, and then show it is
unique. Before that, we notice a relationship:

1Xb 1 X b b>
ft (ΓR) fbt > (ΓR) = R−1 ft (Γ)ft (Γ)R−1> (16)
T t T t

normalization procedure: Starting from a Γ, we look for a ΓR ∈ ΘY .


1. Find Cholesky decomposition: [Chol]> [Chol] = Γ> Γ.
2. Calculate: [OldV] = T1 t fbt (Γ)fbt> (Γ).
P

3. Find Eigen-decomposition: [Chol] [OldV] [Chol]> = [Orth] [Diag] [Orth]> such


that [Diag] is has descending diagonal entries.
4. Finally, we find the normalization as Γ0 = Γ [Chol]−1 [Orth].
(5) In addition, it might be required to align the signs of the K columns of Γ0
following the discussion in Appendix D.
To verify the normalization procedure is correct, we need to check Γ0 ∈ ΘY . We
verify that: Γ0> Γ0 = [Orth]> [Chol]> Γ> Γ [Chol]−1 [Orth] = IK . And, based on Eq. 16,

46

Electronic copy available at: https://ssrn.com/abstract=2983919


we verify that:

1 X b 0 b> 0
ft (Γ ) ft (Γ ) = [Orth]−1 [Chol] [OldV] [Chol]> [Orth]−1 > = [Diag] (17)
T t

normalization is unique:
We need to show such an Γ0 is unique. Since Γ0 = ΓR, we just need to show such
an R is unique. We inspect the restrictions of ΘY and shrink the set of possible R
down to singularity, following the above normalization procedure:
First, given ΘY restriction [1], R> Γ> ΓR = IK . The possible R must have de-
composition: R = [Chol]−1 [Orth], where [Orth] can only be ortho-normal matri-
ces ([Orth]> [Orth] = I). Plug this decomposition into ΘY restriction [2], we need
an [Orth] that also satisfies [Orth]−1 [Chol] [OldV] [Chol]> [Orth]−1 > = [Diag]. We
have found setting [Orth] as the eigen-decomposition satisfies. Importantly, when
we restrict the eigen-decomposition with distinct and decreasing eigen-values such an
eigen-decomposition is unique.

C.5 Some Lemmas for Uniform Large-N, T Convergence


Let xN,t (Γ) represent a sequence (indexed by N ) of stochastic process functions of
Γ with finite dimensions. Define several large N limiting conditions that it can be
subject to. These conditions are all uniformly across Γ, and prefixed with “U-”.
 
lim E sup kxN,t (Γ)k = 0 (U-mean converging)
N →∞ Γ∈Ψ
 
2
lim E sup kxN,t (Γ)k = 0 (U-mean square converging)
N →∞ Γ∈Ψ
 
2
∃M, N ∗ , s.t. E sup kxN,t (Γ)k < M ∀N > N ∗
Γ∈Ψ
(U-mean square bounded)
 
∃M, N ∗ , s.t. P r sup kxN,t (Γ)k < M = 1, ∀N > N ∗ (U-a.s. bounded)
Γ

Next, Lemma 7 is the upgrade version of Lemma 2 to deal with “uniform across Γ”.

Lemma 7. If xN,t (Γ) is stationary in t and U-mean converging, then its time-series
average is converging in the large-N probability limit uniformly for any T . That is,

47

Electronic copy available at: https://ssrn.com/abstract=2983919


∀, δ > 0, ∃N [1] s.t. ∀T and ∀N > N [1]
( )
1 X
P r sup xN,t (Γ) >  < δ. (18)

Γ∈Ψ T
t

P
Proof. Start by an inequality exchanging the order of sup and :

1 X 1X 1X
sup xN,t (Γ) ≤ sup kxN,t (Γ)k ≤ sup kxN,t (Γ)k . (19)

Γ∈Ψ T t Γ∈Ψ T
t
T t Γ∈Ψ

Apply expectation on both sides, and by stationarity, ∀T



1 X 1X
E sup xN,t (Γ) ≤ E sup kxN,t (Γ)k = E sup kxN,t (Γ)k . (20)

Γ∈Ψ T t T t Γ∈Ψ Γ∈Ψ

The last term E supΓ∈Ψ kxN,t (Γ)k is irrelevant of T , and it converges to zero as N →
∞, according to the precondition about U-mean converging. Hence, the first term is
T -uniform large N -convergence: ∀ > 0, ∃N [1] s.t. ∀T and ∀N > N [1]

1 X
E sup xN,t (Γ) < . (21)

Γ∈Ψ T t

Therefore, by Chebyshev’s inequality, we can make conclusion statement about T -


uniform N -convergence in probability.

Lemma 7 is critical to establish large N, T simultaneous convergence. Notice in


the conclusion statement of the Lemma, N [1] does not depend on T but only on , δ.
Hence, this statement implies large N, T simultaneous convergence, a property that
will be used in the proof of Proposition 1 later.
Lemma 7 also shows U-mean converging is important since it is the necessary
condition for large-N, T simultaneous convergence. The next lemma gives some cal-
culation rules to reach a U-mean converging sequence.
[1] [2]
Lemma 8. If xN,t (Γ) is U-mean square converging, xN,t (Γ) is U-mean square bounded,
[3]
and xN,t (Γ) is U-a.s. bounded, then
[1] [2]
1. xN,t (Γ)xN,t (Γ) is U-mean converging, which implies
[1]
2. xN,t (Γ) by itself is also U-mean converging.

48

Electronic copy available at: https://ssrn.com/abstract=2983919


[1] [3]
3. xN,t (Γ)xN,t (Γ) is still U-mean square converging.
[2] [3]
4. xN,t (Γ)xN,t (Γ) is still U-mean square bounded.

Proof. 1. For each ω, we have



[1] [2] [1] [2]
sup kxN,t (Γ)k = sup xN,t (Γ)xN,t (Γ) ≤ sup xN,t (Γ) sup xN,t (Γ) . (22)

Γ Γ Γ Γ

So, put inside expectation:


 
[1] [2]
E sup kxN,t (Γ)k ≤ E sup xN,t (Γ) sup xN,t (Γ) (23)

Γ Γ Γ
  2   2 1/2
[1] [2]
≤ E sup xN,t (Γ) E sup xN,t (Γ) (24)

Γ Γ

by Cauchy-Schwarz inequality. Then it is easy to wrap up the proof with deterministic


limit analysis. Namely the product of a sequence converging to zero and a bounded
sequence is also converging to zero, and the square root of a converging sequence
converges to the square root.
2. Trivial.
3. By a matrix version of the Cauchy-Schwarz inequality:
2 2 2
[1] [3] [1] [3]
kxN,t (Γ)k2 = xN,t (Γ)xN,t (Γ) ≤ xN,t (Γ) xN,t (Γ) (25)

Apply E supΓ on both sides:


2 2
[1] [3]
E sup kxN,t (Γ)k2 ≤ E sup xN,t (Γ) xN,t (Γ) (26)

Γ
Γ 2 2 
[1] [3]
≤ E sup xN,t (Γ) sup xN,t (Γ) (27)

Γ Γ

2
[3] [3]
For any ω, if supΓ∈Ψ xN,t (Γ) < M , then supΓ∈Ψ xN,t (Γ) < M 2 . That means

2
[3]
xN,t (Γ) is also U-a.s. bounded for large enough N ’s. Plug the bound, M 2 , back

into the expectation calculation for any finite N we had above:


 2 
2 [1]
E sup kxN,t (Γ)k ≤ E sup xN,t (Γ) M 2 (28)

Γ Γ

49

Electronic copy available at: https://ssrn.com/abstract=2983919


Take large-N limits on both sides:
 2 
2 [1]
lim E sup kxN,t (Γ)k ≤ lim E sup xN,t (Γ) M 2 = 0. (29)

N Γ N Γ

[1] [2]
4. Almost the same the as the previous proof. Just change xN,t (Γ) to xN,t (Γ) every-
where until the last three lines. In the last three lines, just change “limN = 0” to
“lim supN < ∞”.

Lemma 9 builds the necessary conditions for U-mean square converging from low
level conditions for the primitives for example Assumption A. It is closely related to
Lemma 1.

Lemma 9. If xi,t is a stochastic panel with zero time-series conditional expectation


with bounded unconditional second moment:

E x F[ts] = 0 and E kxk2 < +∞,


 
(30)

1 46
P
then its cross-sectional average N i xi,t is U-mean square converging.

Proof. Apply Lemma 1 to x:

N
1 X L2
Xi,t −→ 0, ∀t. (31)
N i=1

Because x is irrelevant of Γ, it is easy to see the mean squared convergence result


above implies U-mean square converging.

C.6 Proposition 1
C.6.1 Complete the statement of Proposition 1

We first state the complete version of the Proposition 1 with the addition of an
intermediate large-N result and writing the expressions of the probability limits. .
46
notice here Γ does not enter in the random function.

50

Electronic copy available at: https://ssrn.com/abstract=2983919


Proposition 1 (Uniform Convergence of the Score Function).
Under Assumptions A–C, the score function converges uniformly in probability:

p
sup kS(Γ) − ST (Γ)k →− 0, N → ∞, ∀T, (32)
Γ∈Ψ
p
sup S(Γ) − S 0 (Γ) →
− 0, N, T → ∞, (33)
Γ∈Ψ

where

1X 
cc 0 e>

ST (Γ) := vect Ωt Πt (Γ)ft ft (Γ) , (34)
T t
 
0 e>
S 0 (Γ) := E vect Ωcc
t Π t (Γ)f f
t t (Γ) , (35)
 −1 > cc  0
Πt (Γ) := IL − Γ Γ> Ωcc t Γ Γ Ωt Γ . (36)

C.6.2 Preparations for the Proof of Proposition 1

The proof of Proposition 1 is quite involved. The first part manipulates the expression
of the score function to a form consisting of primitive random variables. Second,
Lemma 10 deals with the cross-section convergence. Then, Lemma 11 deals with the
time-series dimension. In the final step, the results are put together to finish the
proof of for the large N, T convergence in Proposition 1.
The first part of proof manipulates the expression of the score function to a form
consisting of primitive random variables. Then, we analyze its uniform convergence.
We already have

1 X > b   1 X  
S(Γ) = Ct ⊗ ft (Γ) xt − Ct Γfbt (Γ) = vect Ct> ebt (Γ)fbt> (Γ) , (37)
NT t NT t

and
−1 > >
fbt (Γ) = fet (Γ) + Γ> Ct> Ct Γ Γ Ct eet (Γ). (38)

Population OLS error:

eet (Γ) = xt − Ct Γfet (Γ) = et + Ct Πt (Γ)ft0 ,


 −1 > cc  0
with the shorthand Πt (Γ) := IL − Γ Γ> Ωcc
t Γ Γ Ωt Γ . Combined with Eq.

51

Electronic copy available at: https://ssrn.com/abstract=2983919


(10), we have
−1 > >
ebt (Γ) = et + Ct Πt (Γ)ft0 − Ct Γ Γ> Ct> Ct Γ Γ Ct eet (Γ)
−1 > >
fbt (Γ) = fet (Γ) + Γ> C > Ct Γ

t Γ C eet (Γ)
t (39)

Plug those back to the score. Each summand in equation (37) yields 3 × 2 = 6 terms:

Ct> ebt (Γ)fbt> (Γ)


=Ct> et fet> (Γ)
+Ct> Ct Πt (Γ)ft0 fe> (Γ)
t
−1
−Ct> Ct Γ Γ >
Ct> Ct Γ >
Γ Ct> eet (Γ) fet> (Γ)
−1
+Ct> et ee> > >
t (Γ)Ct Γ Γ Ct Ct Γ
−1
+Ct> Ct Πt (Γ)ft0 ee> > >
t (Γ)Ct Γ Γ Ct Ct Γ
−1 > > −1
−Ct> Ct Γ Γ> Ct> Ct Γ Γ Ct eet (Γ) ee> > >
t (Γ)Ct Γ Γ Ct Ct Γ . (40)

[1] [6]
Call the six terms St (Γ) to St (Γ), so that

1 X 
[1] [6]

S(Γ) = vect St (Γ) + · · · + St (Γ) . (41)
NT t

Before jumping into the rigorous proof, we give a loose description of the rationale.
Given the expression of score function in Eq. (41), Proposition 1 states the score
function’s uniform probability limit. First, taking N → ∞ at a fixed t, we have
the three modular results, N1 Ct> et → 0, N1 ΓCt> eet (Γ) → 0, and N1 Ct> Ct = Op (1) ,
by cross-sectional LLN. Plugging these into the score expression (40), we find that
1 [p]
S (Γ) → 0 for p = 1, 3, 4, 5, 6. The second term is an exception as the only
N t
one not involving an eet (Γ) or et error term. It correspond to the first source of fbt
[2]
decomposition purely from a “wrong” Γ rather than the finite sample. We see St
[2] 0 e>
increases with N . We have N1 St (Γ) → Ωcc t Πt (Γ)ft ft (Γ). The cross-sectional limit is
F[ts] measurable (Lemma 1). Taking a finite time-series average yields the finite-T
large-N convergence of the score given by Eq. (32).47 Finally, taking T → ∞ delivers
the score’s convergence to the unconditional expectation, as Eq. (33) report.
The proof, to a large extent, follows the steps of proving Lemmas 5 and 2, with
47
This result could be used to construct finite-T large-N inference – we leave that for future work.

52

Electronic copy available at: https://ssrn.com/abstract=2983919


the complication of uniform convergence across Γ, which is necessary for solution
convergence proved further below.

Lemma 10 (Large N Cross-sectional Convergence at each t).

1 X  [1] [6]

0 e>
St (Γ) + · · · + St (Γ) − Ωcc
t Πt (Γ)ft ft (Γ) (42)
N i

is U-mean converging.

Proof. The cross-section convergence is the bulk of the analysis. We proceed by


analyzing the six terms one by one. We list out the statements in each step and
provide in-line proofs of the statements.

1. 1
C >e
N t t
is U-mean square converging.
This is by Lemma 9, treating ci,t ei,t as the xi,t in the lemma. The conditions
are met given assumptions A, B.
−1 > cc 0
2. Γ> Ωcc
t Γ Γ Ωt Γ is U-a.s. bounded.
This is because it is a continuous function w.r.t. Γ, Ωcc 0
t , and Γ whose domains
are all bounded and away from singularity given assumptions C.

3. fet> (Γ) is U-mean square bounded.


−1 > cc 0 0 −1 > cc 0
fet (Γ) = Γ> Ωcc
t Γ Γ Ω t Γ f t , in which Γ> cc
Ω t Γ Γ Ωt Γ is U-a.s. bounded
0
by the previous statement, ft is U-mean square bounded by assumption B1.
Then apply lemma 8.4.
1
c>
P cc

4. N i i,t c i,t − Ω t is U-mean square converging.
The argument is the same as statement number 1 above. Treat c> cc
i,t ci,t − Ωt as
the xi,t and apply Lemma 9. The conditions are met given the definition of Ωcct
and assumption B.
1 [1]
5. S (Γ)
N t
is U-mean converging.
[1] h i
Notice decomposition N1 St (Γ) = N1 Ct> et fet> (Γ) , use the two previous state-


ments about the two parts and apply lemma 8.1.

53

Electronic copy available at: https://ssrn.com/abstract=2983919


6. Πt (Γ)ft0 fet> (Γ) is U-mean square bounded.
h −1 > cc  0 i  0 0 >  h 0> cc −1 i
Πt (Γ)ft0 fet> (Γ) = IL − Γ Γ> Ωcc
t Γ Γ Ωt Γ f f
t t Γ Ωt Γ Γ> cc
Ωt Γ
(43)

The third term, according to statement number 2 above, is U-a.s. bounded. By


the same arguments, so is the first term. The middle term is U-mean square
bounded by assumption B(1). Then by Lemma 8.4, the three things together
is U-mean square bounded.
1 [2] 0 e>
7. S (Γ)
N t
− Ωcc
t Πt (Γ)ft ft (Γ) is U-mean converging.

1 [2] 1
St (Γ) = Ct> Ct Πt (Γ)ft0 fet> (Γ) (44)
N "N #
1 [2] cc 0 e> 1 X
> cc
 h
0 e>
i
S (Γ) − Ωt Πt (Γ)ft ft (Γ) = ci,t ci,t − Ωt Πt (Γ)ft ft (Γ) (45)
N t N i

Then, straightforward application of the previous two statements on Lemma


8.1.
−1
8. Ct> Ct Γ Γ> Ct> Ct Γ is U-a.s. bounded.
1 > h > 1 > −1 i
The term equals N Ct Ct Γ Γ N Ct Ct Γ . Obviously the first part is U-
a.s. bounded. Treat the second term as a non-linear function in the form of
−1
Γ> ΩΓ , which is a continuous function for non-singular inputs. It remains
to show that the domain of the inputs are bounded and away from singularity
so that the output is bounded. We know N1 Ct> Ct is not only bounded, but
also uniformly approaches Ωcc t a.s., which is invertible a.s. by Assumption C.2.
So for large enough N , N Ct> Ct is invertible a.s. as well. Also Γ is full rank
1

according to Assumption C.1.


1 > >
9. N
Γ Ct eet (Γ) is U-mean square converging.

1 > > 1 1 > >


Γ Ct eet (Γ) = Γ> Ct> et + Γ Ct Ct Q> 0 0
t (Γ)Γ ft (46)
N N N
1 1 >X >
= Γ> Ct> et +
 >
Γ ci,t ci,t − Ωcc
t Qt (Γ)Γ0 ft0 (47)
N N i

The first term is U-mean square converging, according to statement 1, assump-

54

Electronic copy available at: https://ssrn.com/abstract=2983919


tion C, and Lemma 8.3. We want to show so is the second. We put a vect (·)
operator to the summand, which does not affect the norm. Rearrange it as,

c>
 >
cc
Qt (Γ)Γ0 ft0 = c> cc
⊗ ft0 > vect Q> 0
   
vect i,t ci,t − Ωt i,t ci,t − Ωt t (Γ)Γ (48)

So the second term all together equals:


" #
> 1 X > 0>
cc
vect Q> 0
 
Γ ci,t ci,t − Ωt ⊗ ft t (Γ)Γ (49)
N i

The first and third part is U-a.s. bounded. The middle part is U-mean square
converging, lemma 9, given assumptions B(4). Then, we can just apply Lemma
8.4 twice.
1 [3]
10. S (Γ)
N t
is U-mean converging.
 h
1 [3] h
> > >
−1 i 1 > > >
i
S (Γ) = Ct Ct Γ Γ Ct Ct Γ Γ Ct eet (Γ) ft (Γ)
e (50)
N t N

The three term are U-a.s. bounded, U-mean square converging, and U-mean
square bounded, and apply Lemma 8.
1 [4] [5] [6]
11. S (Γ), N1 St (Γ), N1 St (Γ)
N t
are all U-mean converging.
For the last three terms, given the similarities to the situations above, we just
write out the decompositions. The remaining arguments about repeatedly ap-
plying Lemma 8 are omitted.
   −1
1 [4] 1 > 1 > > 1 >
S (Γ) = C et ee (Γ)Ct Γ Γ C Ct Γ (51)
N t N t N t N t
    −1
1 [5] 1 >  0 1 > > 1 >
S (Γ) = C Ct Πt (Γ) ft ee (Γ)Ct Γ Γ C Ct Γ (52)
N t N t N t N t
"  −1 #   
1 [6] 1 > > 1 > 1 > 1 > >
S (Γ) = C Ct Γ Γ C Ct Γ ee (Γ)Ct Γ Γ Ct eet (Γ) (53)
N t N t N t N t N
 −1
> 1 >
Γ C Ct Γ (54)
N t

[1] [6]
Finally, given the analysis above of St · · · St , we can conclude the required state-

55

Electronic copy available at: https://ssrn.com/abstract=2983919


ment.
p
Lemma 11 (ST Convergence). supΓ∈Ψ kST (Γ) − S 0 (Γ)k →
− 0, T → ∞.

Proof. This is a familiar case in the sense that it only has the time-series dimension
— this is a stationary and ergodic time-series average analysis. The only twist is
it requires uniform convergence over Γ ∈ Ψ. We proceed by applying Lemma 2.4
in Newey and McFadden (1994). It requires to construct the Γ-irrelevant random
variable dt , and verify it dominates t and has finite expectation. Notice, ST (Γ) =
1
vect (t(Γ)) and t(Γ) = Ωcc Πt (Γ)f 0 fe> (Γ)
P
T t t t t

 0 0 >  h 0> cc −1 i


cc > cc
kvect (t(Γ))k = kt(Γ)k = [Ωt Πt (Γ)] ft ft Γ Ωt Γ Γ Ωt Γ (55)



> cc −1
ft0 ft0 >

≤ kΩcc
0> cc 
t Πt (Γ)k Γ Ωt Γ Γ Ωt Γ (56)

≤ M ft0 ft0 > := dt



(57)

where M is the a.s. bound, such that


 
cc 0> cc > cc
−1
P r sup kΩt Πt (Γ)k Γ Ωt Γ Γ Ωt Γ < M = 1. (58)

Γ∈Ψ

A finite M exists, because the two norms within sup are continuous functions on
compact domains, given Assumption C. We have thus constructed dt , and shown
that kvect (t(Γ))k ≤ kdt k. Moreover, Edt < ∞ given Assumption B(1).

C.6.3 The Main Proof of Proposition 1

After preparing the lemmas above, we are finally ready for the main proof of Propo-
sition 1.
P  [1] [6]

0 e>
Proof. According to Lemma 10, N1 i St (Γ) + · · · + St (Γ) − Ωcc t Πt (Γ)ft ft (Γ) is
U-mean converging. We have the time-series average defined as

1X 
0 e>

ST (Γ) := vect Ωcc
t Πt (Γ)f f
t t (Γ) (59)
T t

p
Apply Lemma 7, we have supΓ∈Ψ kS(Γ) − ST (Γ)k →
− 0, N → ∞, ∀T. That is to say,

56

Electronic copy available at: https://ssrn.com/abstract=2983919


∀, δ > 0, ∃N [1] , s.t. ∀T and ∀N > N [1]
 
P r sup kS(Γ) − ST (Γ)k < δ > 1 − . (60)
Γ∈Ψ

p
By Lemma 11, supΓ∈Ψ kST (Γ) − S 0 (Γ)k →
− 0, T → ∞. That is to say, ∀, δ > 0, ∃T [1] ,
s.t. irrelevant of N , ∀T > T [1]
 
0

P r sup ST (Γ) − S (Γ) < δ > 1 − .
(61)
Γ∈Ψ

Combined, ∀N > N [1] , T > T [1]


 
0

P r sup S(Γ) − S (Γ) < 2δ (62)
Γ∈Ψ
 
0

≥P r sup kS(Γ) − ST (Γ)k < δ AND sup ST (Γ) − S (Γ) < δ (63)
Γ∈Ψ Γ∈Ψ

≥1 − 2. (64)

p
That means the required conclusion: supΓ∈Ψ kS(Γ) − S 0 (Γ)k →
− 0, N, T → ∞

C.7 Proof of Proposition 2


Proof. “If”: It is easy to verify that Γ0 and all of its rotations solve S 0 (Γ) = 0,
because they solve Πt (Γ) = 0, ∀ω.
“Only if”: All we need to show is that ∀Γ not rotationally equivalent to Γ0 ,
S 0 (Γ) 6= 0. The random term in S 0 :
 −1 > cc  0 0 0 > 0> cc −1
Ωcc
t Π t (Γ)f 0 e>
f
t t (Γ) = Ωcc
t IL − Γ Γ> cc
Ωt Γ Γ Ωt Γ ft ft Γ Ωt Γ Γ> Ωcc
t Γ

:= ARft0 ft0 > R> B, (65)

where R is rotation s.t. ERft0 ft0 > R is diagonal with positive entries, and A and B
are the shorthands of L × K and K × K respectively. Notice ∀Γ not rotationally
equivalent to Γ0 , Πt (Γ) 6= 0, so A, B are full rank K w.p. 1, by Assumption C. One
can construct constant p, q of lengths L, K s.t. the signs of each entry in p> A and
Bq are always the same. As a result, p> E ARft0 ft0 > RB q > 0.
 

57

Electronic copy available at: https://ssrn.com/abstract=2983919


C.8 Specific Identification Functions Satisfy IF.1–2
Lemma 12 (Verify Condition IF.1–2). The identification functions IX (Γ) and IY (Γ)
both satisfy Condition IF.1–2.
Proof. It is obvious for ΘX , because it is deterministic. The identification function
and its limit are the same IX (Γ) = IX0 (Γ) = vect (Block1:K (Γ) − IK ).
For ΘY , construct IY0 as

I
" >
 #
veca Γ Γ − K
IY0 (Γ) :=  h i  .48 (66)
vecb E fet (Γ)fet> (Γ) − V ff

Next, we need to verify the two parts of the Condition.


Verify IF.1:
The top 21 K(K + 1) rows of IY and IY0 are the same. Next, we mostly need to
work on the lower part of IY and IY0 . Define

1 X b b>
I[2] (Γ) := ft (Γ)ft (Γ) − V ff (67)
T t
h i
0
I[2] (Γ) := E fet (Γ)fet> (Γ) − V ff (68)

−1 > >


Given fbt (Γ) = fet (Γ) + Γ> Ct> Ct Γ Γ Ct eet (Γ), we have

1 X  e e> h i
0
I[2] (Γ) − I[2] (Γ) = ft (Γ)ft (Γ) − E fet (Γ)fet> (Γ)
T t
−1
+fet (Γ)ee> > >
t (Γ)Ct Γ Γ Ct Ct Γ
−1 > >
+ Γ> Ct> Ct Γ Γ Ct eet (Γ)fet> (Γ)
−1 > > −1 
+ Γ> Ct> Ct Γ Γ Ct eet (Γ)ee>
t (Γ) C t Γ Γ > >
C t C t Γ . (69)

The analysis from here is very similar to the proof of Proposition 1.


For the first term, mimic Lemma 11 to draw the conclusion that

1 X  h i
p
sup fet (Γ)fet> (Γ) − E fet (Γ)fet> (Γ) →
− 0, T → ∞. (70)

Γ∈Ψ T t

48 0
We are constructing IY that will serve in IA to pick the true Γ0 (Θ0Y ). So in defining IY
0
, function
0
ft (Γ) relies on a Γ that is not
e
 pinned down yet. The factors needs to be accordingly rotated by
“re-defining” ft0 = fet Γ0 (Θ0Y ) .

58

Electronic copy available at: https://ssrn.com/abstract=2983919


Define the sum of the second, third, and forth term inside the summation as Ξt (Γ):
−1
Ξt (Γ) = fet (Γ)ee> > >
t (Γ)Ct Γ Γ Ct Ct Γ (71)
−1 > >
+ Γ> Ct> Ct Γ Γ Ct eet (Γ)fet> (Γ) (72)
−1 > > −1
+ Γ> Ct> Ct Γ Γ Ct eet (Γ)ee> > >
t (Γ)Ct Γ Γ Ct Ct Γ (73)

Repeat the arguments in Lemma 10 and we arrive at the conclusion that Ξt (Γ) is
U-mean converging.
Combining these two conclusions, following the same arguments in the proof of
Proposition 1, IF.1 is verified.
Verify IF.2:
For any parameter Γ and its rotation Γ0 = ΓR, we find the relationship that
mirrors Eq. 16 in Lemma 6.
h i h i
E fet (ΓR) fet> (ΓR) = R−1> E fet (Γ)fet> (Γ) R−1 (74)

Given this property, we find IY0 and IY behave in the same way when Γ is rotated.
Therefore, the procedure of the normalization is the same as that in that Lemma 6,
which in turn proves the normalization is unique, that verifies IF.2.
h A difference
i to point out is that here the input of the normalization procedure
>
E ft (Γ)ft (Γ) is an object about the underlying true and unknown to econometrian.
e e

C.9 Proof of Theorem 1


Proof. We stack the score and identification functions together and treat the pair
b solves [S; I] (Γ) = 0 and Γ0 (Θ) solves
as a single function. By their definitions, Γ
[S 0 ; I] (Γ) = 0. Proposition 1 and Condition IF.1 imply that the stacked function
[S; I] is uniformly converging:
p
sup [S; I] (Γ) − S 0 ; I 0 (Γ) →
 
− 0, N, T → ∞
Γ∈Ψ

Based on the previous analysis, Proposition 2 and IF.1-2 combined imply Γ0 is the
unique solution to the function limits’ equation system: [S 0 ; I 0 ] (Γ) = 0. We know the
solution of the uniformly converging function converges to the limit’s unique solution,

59

Electronic copy available at: https://ssrn.com/abstract=2983919


p
according to Newey and McFadden (1994). Therefore, Γ − Γ0 . Similarly, [S 0 ; I]
b →
p
uniformly converges to [S 0 ; I 0 ], implying Γ0 (Θ) →
− Γ0 . These two convergence results
combined imply Γb − Γ0 (Θ) converges to zero.

Corollary 1 is a direct application of Theorem 1.

C.10 Proof of Theorem 2


b − γ 0 . Later, γ
Proof. We first consider γ b − γ 0 (Θ) has almost the same derivation,
except for one important difference.
Per Lagrange’s Mean Value Theorem, at each realization, there exists γ̄ in-between
b and γ 0 such that
γ

0 ∂S (γ)
b − γ0 .

S (b
γ ) = S(γ ) + >
γ (75)
∂γ
γ=γ̄

Notice S (b
γ ) = 0. That means

b − γ 0 = −S(γ 0 ),

H̄ γ (76)

∂S(γ)
where H̄ := ∂γ >
is shorthand for the sample Hessian matrix at γ̄.49
γ=γ̄
Usually, without the unidentification problem, one would divide by the Hessian
b − γ = −H̄ −1 S(γ 0 ), and then take the limit of H̄ to express the asymptotic
to get γ
distribution. But that is not possible here when S has non-unique solutions.
The unidentification problem manifests as a singular H̄. Specifically, the score
should have zero gradients on the K 2 directions of unidentification. Intuitively, think
in the IPCA case, where the rotation matrix R is K × K. So there are K 2 directions
to marginally perturb Γ0 without changing the score — it would be constantly 0.
This implies the score has zero gradient on those K 2 directions, meaning H̄, although
an LK-square matrix, has a rank of only LK − K 2 .
A rank-deficient H̄ leads to the non-uniqueness of the γ
b solutions in equation (76).
0
Given one γb − γ that solves the linear equation, there are a family of other vectors
49
Lagrange’s Mean Value Theorem only applies to function of multiple inputs and scalar output,
but not to vector-valued functions. So each row of H̄ are evaluated at a different γ̄. Same for the
additional K 2 rows of J¯ below. But importantly, the different γ̄’s are all in between γ 0 and γ
b (in
a linear-combination sense). This guarantees later we can take the limit of the Hessian (see Newey
and McFadden, 1994, footnote 25).

60

Electronic copy available at: https://ssrn.com/abstract=2983919


that all solve the equation, forming a K 2 -dimensional sub-linear space of solutions.
This is the unidentification issue manifested as the same non-unique solution problem
but now for a linearized equation.
To solve the problem, we resort to the identification function to pin down the
indeterminacy. We linearize I similarly and then append it to the linearized score
(remember I (b γ ) = 0 too):

0 ∂I (γ) 0

I (b
γ ) = I(γ ) + γ − γ . (77)
∂γ > γ=γ̄
b

J¯ γ
b − γ 0 = −I(γ 0 ).

(78)

where J¯ := ∂I(γ)
∂γ >
is I’s counterpart of the Hessian. Notice I is K 2 × 1 vector and
γ=γ̄
J is K 2 × LK Jacobian. So, these additional K 2 equation pins down the addition K 2
degrees of freedom. Append the K 2 equations in (78) below (76) and form a single
linearization of the estimator,

H̄; J¯ γb − γ 0 = − S(γ 0 ); I(γ 0 ) .


    

Now this linear equation system has a unique solution at (b γ −γ 0 ), because the stacked
H̄; J¯ matrix of size (LK + K 2 ) × LK is now full rank LK. To solve it, left multiply
 
>   −1  >
both sides by the pseudoinverse H̄; J¯ H̄; J¯ H̄; J¯ :


−1
b − γ 0 = − H̄ > H̄ + J¯> J¯ H̄ > S(γ 0 ) + J¯> I(γ 0 ) .

γ (79)

With the estimator thus linearized, the rest of asymptotic derivation follows canon-
p
ical M -estimator analysis. Given, γ b → − γ 0 , then γ̄, the mean value between γ b and
0 0
γ , goes to γ as well. By the continuous mapping theorem, we find the limits of the
Hessians:

∂S 0 (γ) ∂I 0 (γ)

0
plim H̄ = H := , ¯
plim J = J :=0
.50 (80)
∂γ > ∂γ >
N,T →∞ γ=γ 0 N,T →∞ γ=γ 0

Then, taking the probability limit of equation (79) leads to line 2.a in Theorem 2.
The derivation of line 2.b is similar, save for an important difference. Let us
redo this proof, but now start by applying the Lagrange’s Mean Value Theorem to
γ − γ 0 ). Denote the requisite mean value as γ̄¯ and denote
γ − γ 0 (Θ)) instead of (b
(b

61

Electronic copy available at: https://ssrn.com/abstract=2983919


¯ , J.
the Hessians evaluated at γ̄¯ as H̄ ¯ Then, the same logic applies till we find the
counterpart of equation (79) as

¯ + J¯> J¯ −1 H̄
   
¯ > H̄
b − γ 0 (Θ) = − H̄
γ ¯ > S γ 0 (Θ) + J¯> I γ 0 (Θ) . (81)

Not only γb but also γ 0 (Θ) converge to γ 0 . Hence the new mean value γ̄¯ between them
converges to γ 0 as well. So, the limits of H̄ ¯ and J¯ are still H 0 and J 0 , which are
evaluated at the same deterministic γ 0 .51
The important difference in this case is that I (γ 0 (Θ)) = 0 by construction, which
eliminates the I term in (81) and leads to line 2.b in the Theorem. This difference
carries an important intuition that is discussed below.
The remainder of the proof only needs to shows S (γ 0 (Θ)) = S(γ 0 ) + op (S(γ 0 )).
Apply Mean Value for another time in between γ 0 (Θ) to γ 0 :

0
 0 ∂S(Γ) 0 0

S Γ (Θ) − S(Γ ) = γ (Θ) − γ (82)
∂γ > γ=γ [3]
!
∂S 0 (Γ)  ∂S 0 (Γ)

∂S(Γ) 0 0
γ 0 (Θ) − γ 0 (83)

= >
− >
γ (Θ) − γ + >

∂γ
γ=γ [3] ∂γ
γ=γ [3] ∂γ
γ=γ [3]

Notice the first term is Op (S 0 (Γ)) op (1) = op (S 0 (Γ)). The second term is 0 because
S 0 is constant at 0 from γ 0 (Θ) to γ 0 .

C.11 Lemma 3
We state and prove a general version of Lemma 3 that can be evaluated at two specific
cases.

Lemma 13 (Asymptotic Distribution of the Score evaluated at Γ0 — Generic Iden-


tification). Under Assumptions A–F, if the identification condition Θ has an asso-
ciated identification function I(Γ) that satisfies IF.1–2, and γ 0 is under IA, then as
N, T → ∞ such that T /N → 0,
√ d
N T S(Γ0 ) →
− Normal 0, V[1] .


51
This shows the importance of keeping the deterministic γ 0 as the limiting reference point in the
linearizion, even though we are ultimately interested in the result about γ 0 (Θ) and γ
b.

62

Electronic copy available at: https://ssrn.com/abstract=2983919


where V[1] = (Q0 ⊗ IK ) Ωcef Q0> ⊗ IK and Q0 := Qt (Γ0 ) given that Qt (Γ) := IL −

> cc
−1 >
Ωcc
t Γ Γ Ωt Γ Γ is constant over t under Assumption F.

Lemma 3 simply provides two special cases under Lemma 13. In particular, there
[1] [1]
are two differences between VX and VY . First, Q0 is evaluated under either Γ0 (ΘX )
or Γ0 (Θ0Y ). Second, the correspond true ft0 also need to be rotated, resulting in
rotated values of the asymptotic variance Ωcef .
Below is a proof of the general Lemma 13.

Proof. We evaluate score at Γ0 by breaking it into the six parts in Eq. 41. When
[2] [4]
evaluated at Γ0 , St (Γ0 ) = St (Γ0 ) = 0, because they contain Qt (Γ)> Γ0 which is zero
at Γ = Γ0 . For the rest four terms, we have them in two pairs, and only need to show
they have the following results: as N, T → ∞,

1 X 
[1] [3]

d
vect St (Γ0 ) + St (Γ0 ) →− Normal 0, V[1] ,

√ (84)
NT t
1 X 
[4] 0 [6] 0

p
√ vect St (Γ ) + St (Γ ) → − 0. (85)
NT t

Equation 84: We have


h −1 i
= IL −
[1] [3]
St (Γ0 ) + St (Γ0 ) Ct> Ct Γ0 Γ 0>
Ct> Ct Γ0 0>
Γ Ct> et ft0 > (86)

= (Qt (Γ0 ) − N,t ) Ct> et ft0 > (87)


−1 0> 0 −1 0>
where N,t = Ct> Ct Γ0 Γ0> Ct> Ct Γ0

Γ −Ωcc
t Γ
0
Γ0> Ωcc
t Γ Γ . We break out the
1
P
two terms, put them into √N T t vect (·), and respectively show their asymptotics
are:

1 X  d
vect Qt (Γ0 )Ct> et ft0 > →− Normal 0, V[1] ,

√ (88)
NT t
1 X  p
√ vect N,t Ct> et ft0 > → − 0 N, T → ∞. (89)
NT t

For the first one, notice vect Q0 Ct> et ft0 > = (Q0 ⊗ IK ) vect Ct> et ft0 > . Then it
 

is obvious applying a CMT to the CLT in Assumption D(1).

63

Electronic copy available at: https://ssrn.com/abstract=2983919


For the second one, we limit its second moment:
2
1 X
vect N,t Ct> et ft0 >

√ (90)

NT t
1 X 0> >

0 >

vect N,t c> >

= i,t ei,t f t vect  N,s c j,s ei,t f s (91)
N T i,j,t,s
1 X 0 >
vect N,t ⊗ f 0 t c>
 
= i,t ei,t ej,s cj,s vect N,s ⊗ f s (92)
N T i,j,t,s

Take expectation:
2
1 X
vect N,t Ct> et ft0 >

E √ (93)

NT t
1 X
E vect N,t ⊗ f 0 t kτij,ts k E vect N,s ⊗ f 0 s
 
≤ (94)
N T i,j,t,s
!
2 1 X
≤ sup E vect N,t ⊗ f 0 t

kτij,ts k (95)
t N T i,j,t,s

Notice the first term → 0, as N, T → ∞, second term is bounded according to


Assumption E. Therefore, this second moment → 0.
Equation 85: The summand:

[4] [6]
St (Γ0 ) + St (Γ0 ) (96)
−1
= Ct> Mt (Γ0 )et et > Ct Γ0 Γ0> Ct> Ct Γ0 (97)
 −1
0 1
N >
 >  0 1 0> > 0
= Qt (Γ ) Ct et et Ct Γ Γ Ct Ct Γ (98)
N N
 −1 > 
where QN t (Γ) := IL − 1
C
N t
>
C t Γ 1 > >
N
Γ C t C t Γ Γ .
Notice the first term in (98) QN 0
t (Γ) has a constant large-N probability limit Q ,
which is defined in the statement of Lamma 13. It is the similar situation for the last
−1
terms, Γ0 N1 Γ0> Ct> Ct Γ0 . Since both of the two parts are only of Ct , their large-N
convergence is bounded according to Assumption C.2. The constant large-N limits
can be analyzed outside of the t-summation. Therefore, all we need to show is about

64

Electronic copy available at: https://ssrn.com/abstract=2983919


p
√1 1
Ct> et et > Ct
P  
the t-sum of the middle terms: NT t vect N

− 0.
 
1 X 1 >
 > 
√ vect Ct et et Ct (99)
NT t N
√  ! !> 
1 T X 1 X X
= √ vect  c>i,t ei,t c>
i,t ei,t
 (100)
T N t N i i
√ !
1X T 1 X >
= vect √ ci,t ei,t ej,t cj,t (101)
T t N N i,j

Take unconditional expectation:


 
1 X 1 >
 > 
E√ vect Ct e t e t Ct (102)
NT t N
√ !
T 1 X
= Evect √ c>i,t ei,t ej,t cj,t (103)
N N i,j
√ !
T 1 X >
= Evect √ ci,t ci,t e2i,t (104)
NN i
→ 0. (105)

The first equation is due to time series stationarity, the second equation is due to
cross-sectional i.i.d. SP.5, and the final limit is due to the cross-sectional LLN and
the condition that T /N → 0. Notice the random variable is non-negative, and its
unconditional expectation is converging, thereby we have shown it probability limit
is converging as well.

Additional comment: based on this analysis, we conjecture that if T /N converges


to a fixed positive number, then Eq. 85 has a non-zero probability limit. As a result,
the score’s asymptotic distribution is still normal but with a bias. That means Γ
estimation would by asymptotically biased, but still consistent, as the “incidental
parameter problem” would have suggested. We leave this scenario for future research.

C.12 Calculate HX0 , HY0


Similarly to the situation in C.11, we state and prove and general result, about H 0
calculation, and discuss how to evaluate the general expression at ΘX and ΘY .

65

Electronic copy available at: https://ssrn.com/abstract=2983919


Lemma 14 (Calculate H 0 - General). Under Assumption F,

0
 ∂cc ff

H = Ω ⊗V vect (Π(Γ)) .52
∂γ γ=γ 0

Similar to the evaluation of V[1] in Subsection C.11, there are two differences
between HX0 and HY0 . First, the expression of Π(Γ) and its derivative is only about
Ωcc which does not depend on the rotation. But the derivative needs to be evaluated
at either Γ0 (ΘX ) or Γ0 (Θ0Y ). Second, the correspond true ft0 also need to be rotated.
This will result in difference in the assumed asymptotic variance V ff .
Next is the proof of the general Lemma 14.

Proof. Start from the definition


   
cc 0 e>

∂S (Γ) 0 ∂vect Ωt Πt (Γ) f f
t t (Γ)


0
H := >
= E >
 (106)
∂γ
γ=γ 0 ∂γ

γ=γ 0

Notice Πt (Γ0 ) = 0. Therefore, the terms involving ∇fet after taking derivative will
drop out, only ∇Πt (Γ0 ) terms survive. We write the H 0 column by column. The p’th
column, or the derivative w.r.t. the p’th entry γp simplifies as
" !#
∂Πt (Γ)
H 0 p = E vect Ωcc
t f 0f 0> (107)
∂γp γ=γ 0 t t

This result above does not require Assumption F, and can be calculated by LLN
simulation if one is interested in the general case. Now, we impose the constant Ωcc
t
assumption F:
!
0 cc ∂Π(Γ) ff cc ff ∂vect (Π(Γ))

H p = vect Ω V = Ω ⊗ V (108)
∂γp γ=γ 0 ∂γp 0
γ=γ


0 cc ff
 ∂vect(Π(Γ))
Finally, append the columns together, we have H = Ω ⊗ V ∂γ >
γ=γ 0
52
Under Assumption F, Πt (Γ) is also time-constant and deterministic, so we drop its t subscript.

We choose do not write out the deterministic derivatives ∂γ vect (Π(Γ)) analytically for conciseness,
which is easily calculated numerically in the simulated and empirical exercises.

66

Electronic copy available at: https://ssrn.com/abstract=2983919


C.13 Calculate JX0 , JY0
JX0 is much easier. The identification function IX is deterministic, i.e. IX = IX0 .
According to J 0 ’s definition (Eq. 80), we calculating IX0 ’s derivative and get

JX0 = [IK 2 ×K 2 , 0K 2 ×(L−K)K ]. (109)

JY0 is much more involved and we summarize the calculation in the following lemma.

Lemma 15 (Calculate JY0 ). Under Assumption F: JY0 is stacked up by two parts,


∂veca(Γ Γ)
>
the top 21 K(K + 1) rows are JY0 [1:( 1 K(K+1)), : ] = ∂γ > 0 . For the bottom

2
γ=γ
1
2
K(K − 1) rows, the p’th column is

ff >
JY0 [( 1 K(K+1)+1:K 2 ), 0 cc
 ff 0 cc

p ] = vecb D p Γ , Ω V + V Dp Γ , Ω (110)
2

Proof. We omit the “S” subscript in this proof. The first part of I is deterministic,
so the first part of J 0 is easy. Lower part of J 0 :

0 ∂ 
J [( 1 K(K+1)+1:K 2 ), : ] = N,T
plim >
vecb I[2]
2 →∞ ∂γ γ0
!
∂ 1 X  e e> 
= plim >
vecb ft (Γ)ft (Γ)
N,T →∞ ∂γ T t 0
γ
 !
∂ 
> ∂ e ∂ e>
vecb fet (Γ)fet (Γ) = vecb ft (Γ) ft0 > + ft0 f (Γ)
∂γp γ0 ∂γp γ0 ∂γp t γ 0

∂ e ∂ > cc
−1 > cc 0
Γ Ωt Γ ft0 := Dp Γ0 , Ωcc
 0
ft (Γ) = Γ Ωt Γ t ft
∂γp 0 ∂γp 0
γ γ
∂  
vecb fet (Γ)fet> (Γ) = vecb Dp Γ0 , Ωcc ft0 ft0 > + ft0 ft0 > Dp> Γ0 , Ωcc
 
t t
∂γp γ0
 0 0>
J 0 [( 1 K(K+1)+1:K 2 ), p ] = E vecb Dp Γ0 , Ωcc ft ft + ft0 ft0 > Dp> Γ0 , Ωcc
 
t t
2

ff >

By Assumption F, J 0 [( 1 K(K+1)+1:K 2 ), p
0 cc ff 0 cc
] = vecb Dp (Γ , Ω ) V + V Dp (Γ , Ω ) .
2

67

Electronic copy available at: https://ssrn.com/abstract=2983919


C.14 Proof of Theorem 4
Proof. (1) Given Γb consistency to the two targets (Theorem 1), and that fet (Γ) has
p
bounded derivative around Γ0 , we have fet (Γ)−
b fet (Γ0 (Θ)) → − 0. It remains to show the
second term is op (1). We have shown that N1 Γ> Ct> eet (Γ) is U-mean square converging
−1
and N Γ> Ct> Ct Γ is U-a.s. bounded (see intermediate steps proving Lemma 10).
> >
−1 > >
Combined, Γ Ct Ct Γ Γ Ct eet (Γ) is U-mean square converging. Because that is
a uniform result, when evaluate at Γ b for the second term, it is converging (in m.s.
which implies in probability).  √ 
0
(2) Given Theorem 2 and Lemma 3, Γ − Γ (Θ) = Op 1/ N T . So, the first term is
b
 √ 
op 1/ N T . For the second term:

√  −1 √
b > C > Ct Γ b = N Γ0> C > Ct Γ0 −1 Γ0> C > et + op (1)
b> C > eet (Γ)

N Γ t
b Γ t t t (111)
 
d [2]

− Normal 0, Vt , (112)

[2] 0 −1 0> ce 0 0 −1
 
where Vt = Γ0> Ωcc t Γ Γ Ωt Γ Γ0> Ωcc t Γ , with Ωce t from Assumption D(2).
cc ce
Notice neither Ωt or Ωt are affected by the rotation. So one only needs to plug in
[2]
the value of Γ0 as either Γ0 (ΘX ) or Γ0 (Θ0Y ) to evaluate Vt for the specific cases.

C.15 Proof of Lemma 4


Proof. The top 12 K(K + 1) rows of IY (Γ0 ) are deterministic at 0: IY [1: 1 K(K+1)] (Γ0 ) =
2
0 1 K(K+1)×1 . Next we analyze the four terms in Eq. 69 in the proof Lemma 12 First
2
0 e 0
term has a plim,
 √which
 in general is non-zero. When evaluated at Γ , since ft (Γ ) =
ft0 , it is Op 1/ T . According to the analysis in Lemma 12, second and third are
 √ 
Op 1/ N T . Fourth is Op (1/N ). That is to say

√ 1 X 0 0>
T I[2] (Γ0 ) − √ ft ft − V ff = 0K×K

plim (113)
N,T →∞ T t

√ 0
 d 
[3]

By the times-series CLT in D(3): T vecb I[2] (Γ ) → − Normal 0 1 K(K−1)×1 , V .
2
In addition, the cross-covariances between the top and the bottom part are 0.

68

Electronic copy available at: https://ssrn.com/abstract=2983919


C.16 Proof of Theorem 6
Proof. Straightforward based on Theorems 4 and 5.

D The Sign Issue in Normalization


As mentioned in footnote 21, identification conditions like ΘY with only [1] and [2]
does not pin down the signs of each Γ column. To be precise, for any Γ ∈ ΘY ,
let Γ0 = Γdiag {s}, where diag {s} is a K × K diagonal matrix with +1 or −1’s
on the diagonal. Then Γ and Γ0 are obviously rotationally equivalent, because the
corresponding factors can flip the signs accordingly. Meanwhile, both Γ, Γ0 satisfy
ΘY ’s [1] and [2]. Without adding additional sign restrictions, this would violate the
uniqueness property of an identification condition.
It is not hard to add restrictions for the signs to pin down this remaining bit of
unidentification in theory. For example, one can restrict all the sample factors means
( T1 t fbt (Γ)) to bebbbb positive (call this [30 ]). Alternatively, restricting the first non-
P

zero element in each column of Γ to be positive works as well (call this [300 ]). Similar
restrictions are seen in Stock and Watson (2002), Bai and Ng (2013).
However, we report that the sign issue is trickier in finite sample simulations. For
example, if the true factors has zero (or close to zero) expectations (E [ft0 ] = 0).
Then, even if the true factors are observed, its finite sample averages are arbitrarily
positive or negative, making [30 ] unstable. Even if Γ b is estimated rather accurately,
the small sign flipping of the factor mean makes a large difference in Γ b − Γ0 , keeping
it away from converging. Similarly, [300 ] also runs into finite sample problems if Γ0 ’s
first non-zero elements in some columns are close to zero.
So how to design a sign restriction such that the signal-to-noise ratio in picking the
sign is always maximized, adapting to any potential peculiar model? One way is to
make Γ’s signs always align with those of Γ0 . So let [3000 ] be Γ s.t. Γ> 0
k Γk > 0, ∀k. Then,
[3000 ]’s obvious problem is it depends on population information (Γ0 ), disqualifying it
as an estimation’s identification condition.
Finally, we give a sign restriction that is both theoretically sound and practically
easy to use in simulation exercises. Let [3] be the set of Γ s.t. Γ> k Γk > 0, ∀k, where
b
Γ
b = arg minΓ∈[1],[2],[300 ] G(Γ). The idea is for the Γ within the target minimizing set,
we use [300 ] (or [30 ]) to pin down its signs. That gives the unique estimate. And for

69

Electronic copy available at: https://ssrn.com/abstract=2983919


Γ outside of the set, we align its signs to the estimate. This will normalize the true
towards the direction of the estimate when constructing the reference point Γ0 (Θ).
This avoids the sign indeterminacy problem within a small neighborhood
around
0
b − Γ (Θ) always take the
Γ.
b The benefit is when calculating estimation error, Γ

smallest possible value among all possible sign combinations. To calculate that quan-
tity in simulation exercises, one can first ignore the sign issue and calculate Γ
b and
Γ0 (Θ) both up to sign unidentification. Then pick the signs to minimize that norm
between the two:

0
min Γ − Γ (Θ)diag {s} . (114)
b
s1 ...sK =±1

E International Macroeconomic Instruments Data


There are fifty-nine indicators (besides GDP growth itself) that we could use as
instruments. Because we require a country-year observation to have GDP growth
and all its instruments non-missing, we first restrict attention to only indicators that
are 80% nonmissing in the panel; we choose 80% as a cut-off so as to be able to include
import and export percentage, which are very natural instruments to include. We
then drop indicators that a priori appear unconnected to global growth: adolescent
fertility rates, fertility rates, life expectancy, mortality rates, total population, and
surface area. We drop gross national income and gross national income per capita
because of their high correlation with GDP and GDP growth, respectively. We drop
Guinea, Haiti, and Tanzania because they don’t have non-missing GDP growth before
1985. Because we want to include exports/imports as instruments, we drop Congo,
Ethiopia, Paraguay, and Zambia because they don’t have non-missing import/export
data before 1985. We want to include Germany in our analysis and it has no CO2
emissions data before 1985, so we drop that indicator. We want to include Hong
Kong in our analysis and it has no domestic credit data before 1985, so we drop that
indicator. We want to include Belgium in our analysis and it has no population density
data before 1985, so we drop that indicator. We want to include capital formation but
Luxumbourg and Cabo Verde have no data before 1985, so we drop those countries.
We drop the percentage of merchandise trade given its high correlation with imports
and exports which we do include. Urban population growth is quite correlated with
population growth, so we exclude the former. This leaves us with the indicators

70

Electronic copy available at: https://ssrn.com/abstract=2983919


discussed in the text, combined with an Industrial dummy. We then restrict attention
to the available countries and Kose et al. (2012) labeled as Industrial and Emerging,
leaving the countries in Table 3.

Table 3: List of Countries

Argentina Hong Kong SAR, China Norway


Australia Iceland Pakistan
Austria India Peru
Belgium Indonesia Philippines
Brazil Ireland Portugal
Canada Israel Singapore
Chile Italy South Africa
China Japan Spain
Colombia Jordan Sweden
Denmark Korea, Rep. Switzerland
Egypt, Arab Rep. Malaysia Thailand
Finland Mexico Turkey
France Morocco United Kingdom
Germany Netherlands United States
Greece New Zealand Venezuela, RB

71

Electronic copy available at: https://ssrn.com/abstract=2983919

You might also like