Instrumented Principal Component Analysis
Instrumented Principal Component Analysis
Instrumented Principal Component Analysis
Abstract
Estimation In the following analysis, we often plug βi,t in equation (2) into equation
(1) and combine their two errors into a new compound error term ei,t , yielding an
3
This logic shares an analogy with state-space models, which impose constraints on the serial
dependence of βi,t ; the main difference is that in the typical state space model the dynamics of βi,t
are inferred only from the main panel √ data, xi,t .
4
PCA β’s convergence
√ rates is T according to Bai and Ng (2002); Bai (2003). IPCA Γ’s
convergence rate is N T per Theorem 3.
IPCA is estimated as a least squares problem. It minimizes the sample sum of squared
errors (compound errors ei,t ) over parameters Γ and {ft } jointly. This least squares
estimation is inspired by PCA, which minimizes the sum of squared errors over {ft }
and static loading parameters {βi }.
The optimization does not admit an analytical solution in general, but is speed-
ily solved numerically by an alternating least squares (ALS) algorithm. It iterates
between minimizing over Γ while holding {ft } fixed, and minimizing over {ft } while
holding Γ fixed, until convergence. Importantly, the two partial optimization sub-
problems are simple linear regressions. This has a number of practical benefits. For
example, the procedure converges quickly and without the need for complicated nu-
merical algorithms. And IPCA handles unbalanced panels as easily as pooled panel
OLS, which is a great advantage in many applications.
Expanding the least squares problem to include nested constraints yields flexible
tests of economic hypotheses. For example, restricting the lth row of Γ to be all
zeros aligns with a test for the marginal contribution of the lth instrument to overall
model performance. As another example, restricting one of the K factors to be some
observable macroeconomic time series can be used to test the series’ relevance for
modeling covariation within the panel. In the context of cross-sectional asset pricing,
another example restricts one of the factors to be a constant in order to test whether
the factor space admits arbitrage opportunities.5
Outline The paper proceeds as follows. Section 2 introduces the notion of a stochas-
tic panel and uses that to describe the generating process of panel data considered
by the paper. In Section 3, we discuss estimation and analyze choices of identifying
normalization. Based on these preparations, Section 4 proves the consistency of Γ es-
timation, Section 5 presents the asymptotic distributions of Γ estimation errors, and
Section 6 contains the asymptotics of the factor estimation. Section 7 examines the
small sample properties of IPCA estimation with Monte Carlo simulations. Section 8
6
Bai and Ng (2013) works out the asymptotics of PCA for a few specific normalizations other
than the one in Bai (2003), while we develop a unified analysis of IPCA that is general to any
normalization choice.
SP.4 Jointly ergodic: Any event invariant to both transformations has a probability
of either zero or one, i.e., ∀Λ s.t. Λ = S−1
[d] (Λ) , ∀d, P r (Λ) = 0 or 1.
7
For stochastic processes defined with a transformation on the sample space, see for example
Hansen and Sargent (2013). This fundamental construction is similar to the one in Gagliardini et al.
(2016). They work with a sample space for the time-series process and a continuum to represent
individuals. Our sample space can be seen as the Cartesian product of these two sets, and we use
two transformation to describe the sampling scheme from a single probability space.
8
[−d] means the other direction with respect to [d].
Ω
ω S[cs] (ω)
S[ts] (ω)
Notes: The horizontal blue arrows are S[cs] , vertical blue arrows are S[ts] . The random
variables evaluated at the nine lattice points starting from ω constitute a sample stochastic
panel with N = 3, T = 3. The square blocks represent partitions in F, the columns of
blocks are F[ts] ’s partitions, and the rows blocks are F[cs] ’s partitions.
i−1
SP.5 Cross-sectional exchangeable: The sequence of random variables {X◦S[cs] ,i =
1, 2, . . . } is exchangeable.
Furthermore, let sub-σ algebra F[d] ⊆ F be the collection of events invariant under
S[−d] , the transformation of the other direction. In Figure 1, imagine F is generated
with the square partitions, then F[ts] and F[cs] consists of the column blocks and row
blocks, respectively.
Let X [d] be F[d] -measurable random variables, and it is straightforward to see
[ts]
X [d] = X [d] ◦ S[−d] . That is to say, Xi,t is constant for i = 1, 2, · · · , so the i subscript
[ts]
Xt := X [ts] St−1
[ts] , t = 1, · · · , T,
[cs]
Xi := X [cs] Si−1
[cs] , i = 1, · · · , N.
The t-subscript random variables are F[ts] -measurable. They represent common or ag-
gregate realization and contain distributional information about the “current” cross-
sectional population. The factor process ft is the main example. Symmetrically,
F[cs] -measurable i-subscript variables are about individual i’s static characteristics. A
static factor loading βi would be of this sort.
Condition SP.3 implies that each direction itself defines a stationary stochastic pro-
cess in the traditional sense. Stationary stochastic processes admit the one-directional
law of large numbers (LLN).9
Lemma 1 (One directional LLN). Under Conditions SP.1–3 and if E kXk2 < ∞,
then
N T
1 X L2 1X L2
Xi,t −→ E X·,t F[ts] , ∀t, Xi,t −→ E Xi,· F[cs] ,∀i.
and
N i=1 T t=1
Notice the right-hand sides are F[ts] or F[cs] -measurable. That means, for example, the
cross-sectional average converges to a time-specific aggregate, which can be written
with a single-t subscript. This allows Xi,t to be non-ergodic in either direction. For
example, if ω and the next individual S[cs] (ω) are in different S[ts] -invariant events
(different row blocks in Figure 1), X1,t will not repeat the course of events of X2,t ,
no matter how long time lasts. We intentional leave this possibility as a realistic
feature of panel data. For example, for the application in Section 8.1, although all
countries’ import/export shares are varying over time, some countries are inherently
more trade-reliant than others. As a result, their time-series averages converge to
different limits given by the expectations conditional on country identity (F[cs] ).
Condition SP.4 introduces and imposes “panel-wise” ergodicity when jointly con-
sidering both directions. This allows panel-wise averages to converge to deterministic
limits, even as we intentionally allow non-deterministic one-directional limits. The
9
We write a mean-square convergence result, because it is used for following derivations. An
almost-sure convergence result is also obvious.
Lemma 2 (Panel LLN). Under Conditions SP.1–4, and if E kXk2 < ∞, then as
N, T → ∞,
N T
1 XX L1
Xi,t −→ EX.
N T i=1 t=1
The four conditions discussed so far are symmetric in the two directions. However,
we expect the cross section to be independent in some sense, while the time series
evolutions are serially dependent. SP.5 formalizes these properties by strengthening
stationarity in SP.3 to exchangeability only for the [cs] direction.10 Under this con-
dition, F[cs] -measurable single i-subscript variables (for instance the fixed loadings βi
in PCA) are cross-sectionally i.i.d. And, the double-subscript variables are i.i.d. con-
ditional on F[ts] .11 For example, book-to-marketi,t is i.i.d. across stocks conditional on
the common time-specific information. Unconditionally, it is not independent (only
exchangeable) due to aggregate fluctuations of the value ratio.
10
Assumption C. (1) The parameter space Ψ of Γ is compact and away from rank
deficient: det Γ> Γ > for some > 0. (2) Almost surely, ci,t is bounded, and
> [ts]
(3) Define Ωcc
t := E ci,t ci,t F
, then almost surely det Ωcc
t > for some > 0.
The two assumptions above are regularity conditions for consistency. Assumption
B lists the required finite moments to apply the panel Law of Large Numbers (Lemma
2). Assumption C guarantees matrix Γ> Ct> Ct Γ, whose inverse frequently appears,
remains nonsingular (Ct denotes the N × L matrix that stacks up the cross section
of ci,t ).
For asymptotic normality and deriving the asymptotic variance, we impose the fol-
lowing additional assumptions.
11
12
This objective is inspired by PCA which also optimizes the sample SSE but over
parameters {βi } and {ft }.
We use an Alternating Least Squares (ALS) method for the numerical solution
of the optimization because, unlike PCA, the IPCA optimization problem does not
have a solution through an eigen-decomposition.15 The SSE target in (4) is quadratic
in either Γ or {ft } when the other is fixed. This property allows for analytical opti-
mization of Γ and {ft } one at a time. Given any Γ, factors {ft } are t-separable and
solved with cross-sectional OLS for each t:
X −1 > > 16
fbt (Γ) := arg min (xi,t − ci,t Γft )2 = Γ> Ct> Ct Γ Γ C t xt . (5)
ft
i
13
where fbt (Γ) is the optimal factor given any Γ according to (5). And, define score func-
tion as the derivative: S(Γ)b = ∂G(Γ) . Therefore, the two-argument joint minimization
∂γ
problem (Eq. 4) is equivalent to minimizing G(Γ) or solving the first order condition
S(Γ) = 0LK×1 , with respect to Γ only. The asymptotic analysis proceeds by first
analyzing S to characterize the asymptotics of Γ b while treating factor estimate fbt (Γ)
b
as an implicit intermediate. Once the asymptotic analysis of Γ b is done, Section 6
comes back to the factor asymptotics by plugging Γ b in.
17
We can prove uniqueness for the asymptotic score function up to rotation in the sense of Propo-
sition 2, which provides a theoretical foundation of the ALS method. In simulations we see that
convergence is unique and fast unless data generating errors are simulated to be unreasonably large.
18
In practice, we first “fill up” data {xi,t , ci,t } at any unobserved i, t entry with zeros, and then
use the same ALS program for the completed panel. It is easy to show this process is equivalent to
summing over only the observed entries.
14
15
Θ ⇔ I(Γ) = 0
Γ
b
arg min G(Γ) ⇔
Γ S(Γ) = 0
Γ0 = ΓR
Parameter Space Ψ
Notes: The dashed lines are sets of rotationally equivalent parameters. A particular
one (in blue) minimizes the target function. The black curve represents the identification
condition Θ. The intersection of sets Θ and arg min G(Γ) defines the estimator Γ. b The
red arrow represents the normalization of a parameter to Θ.
parameter sets each only once. We will analyze two concrete examples of Θ further
blow.
Γ(Θ)
b = arg min G(Γ). (8)
Γ∈Θ
When there is no emphasis on the specific normalization choice, we omit “(Θ)” and
simply write Γ.
b
Despite first appearances, (8) is not a constrained optimization because the con-
straint “Γ ∈ Θ” never restrains the target from achieving its global optimum—it
only picks a unique representation out of the (rotationally equivalent) set of Γ’s that
all achieve the same minimum. Equivalently, Γ b is the solution of the simultaneous
equations S(Γ) = 0 and I(Γ) = 0, shown as the intersection of the two corresponding
curves in Figure 2. Based on this representation, we will characterize the asymptotics
of Γ
b by analyzing the score and identification functions.
In addition, we note that a normalization Θ must be known to the econometrician,
meaning it can depend on the sample but cannot depend on the underlying population
parameters.
16
17
4 Consistency of Γ Estimation
This section demonstrates that IPCA estimates the mapping matrix Γ consistently.
Since the estimate Γ b is the simultaneous solution of the first order condition (score
function) and the identification function, its consistency is based on the (uniform)
convergence of these two functions, following the standard strategy for analyzing
M -estimators (Newey and McFadden, 1994). Relative to the classical framework,
we must confront two additional challenges: simultaneous N, T convergence and the
identification issue. To address the former, we rely on the large sample properties of
stochastic panels developed in Section 2.1. For the latter, we build on the normaliza-
tion concepts constructed in Section 3.2.
18
Moreover, we verify that the limiting function S 0 is solved only by Γ0 and its uniden-
tified rotations:
These two results are the foundation of IPCA consistency. When combined, they
imply a sense of set convergence regardless of the identification issue—the set of SSE
minimizers {Γ s.t. S(Γ) = 0} converges to the set of unidentified true parameters.
Once we build the counterparts of these results for the identification function in the
next subsection, score and identification functions together will lead to estimator
convergence.
But before that, let us explain the intuition behind Proposition 1, which is essen-
tial for IPCA’s large sample theory. By taking the derivative of the target, we find
1 > >
P
S(Γ) = N T t vect Ct ebt (Γ)ft (Γ) , where fbt (Γ) and ebt (Γ) are the coefficients and er-
b
rors, respectively, of the cross-sectional OLS regression of xt onto Ct Γ.23 The sample
OLS estimate fbt (Γ) misses the true ft0 for two reasons. For one, it uses a Γ that is not
the true Γ0 , and thereby the instrumented factor loadings Ct Γ are off. Second, even
if we knew the true Γ0 and thus the sample β’s were right, OLS in the finite cross
section does not exactly reveal the true ft0 . To formalize this intuition, we construct
fet (Γ) as the “population” counterpart of the same xt -on-Ct Γ regression. By “popu-
lation” we mean conditional on F[ts] : that is, the collection of all “aggregate” events
which contain the information of how the “current” cross section would be generated.
Specifically,
[ts] −1 > > [ts] −1 > cc 0 0
fet (Γ) := E Γ> c> = Γ> Ωcc
i,t ci,t Γ F E Γ ci,t xi,t F t Γ Γ Ωt Γ ft , (9)
> [ts] 24
with the shorthand Ωcc
t := E ci,t ci,t F . Therefore, the forementioned two sources
22
The expression of S 0 is in Appendix C.6 together with the proof.
23
Eq. (5) defined fbt (Γ), and ebt (Γ) := xt − Ct Γfbt (Γ).
24
Assumption A is used in Eq. (9).
19
where eet (Γ) := xt − Ct Γfet (Γ) is the corresponding population OLS error. Importantly,
given a rotation R, fet (Γ0 R) = R−1 ft0 and eet (Γ0 R) = et . This means if one knew the
true Γ0 (even if not knowing the particular rotation), functions fet (Γ)ande et (Γ) would
be able to reveal the true factor structure. The second term in (10) captures the
remaining error caused only by the finite cross section.
With this knowledge, the score function can be broken down by representing
ft (Γ), ebt (Γ) with fet (Γ), eet (Γ). We save the exact expression to Appendix C.6, except
b
noting that the score function inherits the same decomposition: into a part that is only
“off” due to the “wrong” Γ, and another part due to finite cross section. Intuitively,
the first part converges to the probability limit S 0 (Γ), while the the second vanishes
in the large-N, T limit.25 Finally, the limiting score S 0 (Γ) can recover the true Γ0 (up
to a rotation) according to Proposition 2.
I
" >
#
veca Γ Γ − K
IX (Γ) := vect (Block1:K (Γ) − IK ) , IY (Γ) := P .26
vecb T1 t fbt (Γ)fbt> (Γ) − V ff
Notice fbt (Γ) involves sample data, and hence IY is a random function.
We show IY converges uniformly to a deterministic function IY0 , which also con-
stitutes an identification function. The same claim trivially applies to IX as it is
25
The proof is in Appendix C.6, which deals with additional complication of uniform large-N, T
convergence across Γ, which is necessary for the convergence discussed next.
26
Mappings veca and vecb vectorize the upper triangular entries of a square matrix. The difference
is veca includes the diagonal entries, and vecb does not. See details with an example in Appendix
A. V ff is the factor’s population second moment matrix as defined in Assumption D. It is actually
redundant to write “−V ff ” in the expression of IY , as long as the assumed V ff is a diagonal matrix,
since vecb ignores the diagonal entries. However, we leave it in to note ΘY can be easily adjusted
at applied researcher’s discretion with the asymptotic analysis largely unchanged. For example, one
can specify V ff = IK and switch “veca” and “vecb” to standardize factors instead of Γ.
20
IF.1 Uniform convergence: There exists a deterministic function I 0 (Γ) such that
p
sup
I(Γ) − I 0 (Γ)
→
− 0, N, T → ∞.
Γ∈Ψ
IF.2 and IA combined imply Γ0 solves I 0 , and it is the only solution among all of its
equivalent rotations.
The above convergence and uniqueness conditions about I are symmetric to
Propositions 1 and 2 about S. Since the estimator is the intersection of S and I,
these conditions together form the premise of estimator consistency, which will be
formally stated in Subsection 4.4. Therefore, IF.1–2 can be used to establish if any
normalization, besides ΘX and ΘY , yields a consistent parametric estimation.
We note IA is an innocuous assumption because the true parameter can always
be normalized to satisfy any identification condition, including Θ0 , without changing
the data generating process. Next, we sort out some ambiguity brought about by
having different, but rotationally equivalent, true parameters.
21
Γ(ΘX )
b Γ(ΘY )
b
S(Γ) = 0
A B C
Op ( √ 1 )
NT
Notes: The top horizontal curve is the set of optimizers. The lower curve is the set of
rotationally equivalent true parameters. The three vertical lines are identification condi-
tions. All the black objects are deterministic, the blue ones are sample-dependent. The
two solid red arrows point from the random sets to their deterministic limits, labeled with
the rates of convergence. The dashed red arrows A, B, C mark three specific estimation
errors, whose asymptotic distributions are in Theorems 3.a, 3.b, 5, respectively.
It varies across samples but converges to the limit Θ0Y , shown as the vertical line on
the right. On the other hand, the limit of ΘX is itself (since it is deterministic), shown
as the vertical line on the left. The estimators Γ(Θ
b X ) and Γ(Θb Y ) are at the intersec-
tions of the optimizer set and the identification-condition set, where the parenthesis
clarifies which identification condition it is in.
We make two main points. First, assumption IA means that the true Γ0 under
the two normalization cases ΘX and ΘY are not the same. This can be seen by noting
that the limiting identification conditions Θ0 are different for the two cases—one is
ΘX itself and the other is Θ0Y . To avoid ambiguity, we write the true parameter under
the two cases as Γ0 (ΘX ) and Γ0 (Θ0Y ), while Γ0 is reserved as the shorthand when
there is no emphasis on the specific case. The two are rotationally equivalent, but
satisfy different limiting identification conditions as clarified in the parentheses. The
true parameter is deterministic, since it is the intersection of two deterministic sets:
S 0 (Γ) = 0 and Θ0 . This deterministic truth is the parameter in IPCA definition
(Eq. 2). Later on, it will serve as the fixed reference point in asymptotic expansion.
Finally, the asymptotic variance is expressed with regard to the generic Γ0 , which is
subsequently evaluated at either Γ0 (ΘX ) or Γ0 (Θ0Y ) depending on the specific case.
Our second main point is that we measure estimation error against the true pa-
rameter normalized in the same way as the estimate is normalized. We call this the
normalized true parameter and denote it as Γ0 (Θ), where Θ is the normalization used
22
23
The theorem above is the consistency result for a generic identification condition
Θ. For the two specific cases (ΘX and ΘY ), we already verified that both identification
functions IX (Γ) and IY (Γ) satisfy Condition IF.1–2. Therefore, the estimators in these
two specific cases are consistent as well.
24
25
√
This lemma implies the convergence rate of Γ estimation error is N T , regardless
√
of the normalization choice. The N T -convergence rate highlights IPCA’s ability to
harness not just the time-series, but also the cross-sectional information. The ability
ultimately comes from the assumption that the instrumental mapping is common
across individuals. In contrast, without modeling the structure of the factor loadings,
even were factors observed then the loading estimation could only rely on time-series
√
information and thereby achieve convergence at rate T .
√
From the perspective of the panel literature, the N T convergence rate can be
understood by viewing Γ as a common structural parameter and ft as the time fixed-
effects. This view raises the concern that estimation could be asymptotically biased
in the presence of “incidental parameters” whose number increases with the sample
size (Neyman and Scott, 1948; Lancaster, 2000). Gagliardini and Gourieroux (2014)
note the incidental parameter problem is much less pronounced in the case of time
fixed-effects with large cross sections. We follow them and focus on the T /N → 0
case, in which the asymptotic distribution is centered around zero and no asymptotic
bias correction is needed. Should T /N converge to a positive number, we conjecture
the asymptotic distribution is still normal but with a non-zero mean—exact analysis
32 [1] [1]
The expressions of VX and VY is with the proof in appendix C.11.
26
The first term captures the part of factor estimation error only due to the inaccuracy
of Γ estimation. The previous section concluded that Γ b and Γ0 (Θ) converge at the
√
rate of N T . Hence, the first term has the same rate of convergence. The second
term captures the remaining inaccuracy from the cross-sectional regression’s finite
sample. That is, even were Γ0 (Θ) known, the second term still would dominate at
33
Given a (sample-dependent) R, such that the normalized true Γ is represented as Γ0 (Θ) = Γ0 R,
as R−1 ft0 = R −1 e
the correspondingly normalized true factor is inversely rotated
n o ft Γ0 = fet Γ0 R =
fet Γ0 (Θ) . In other words, the two pairs, Γ0 , f 0 and Γ0 (Θ), fet Γ0 (Θ) , are rotationally equiv-
t
alent to each other.
27
The theorem gives the factor’s asymptotic distribution under a generic normaliza-
[2]
tion Θ. Similar to Γ estimation, we can evaluate Vt at either the Γ0 (ΘX ) or Γ0 (Θ0Y )
normalizations.35
7 Simulations
This section presents a concise set of simulations that illustrate the behavior of the
IPCA estimation in finite samples, and assess the accuracy of approximation based
on the asymptotic theory derived above. To summarize, we find that estimation
errors are well-approximated with a normal distribution. This is true even in rather
small samples, and when the true generating process has errors with large variance.
This shows that applied researchers can confidently assume normality for confidence
intervals and hypothesis tests when applying IPCA, as it verifies the asymptotic
derivations in the previous section. We present details below.
For given N, T , we generate a stochastic panel of c, f 0 , e and use these to assemble
the x panel. We calibrate the simulated data to the IPCA model (fixing K = 2 and
L = 10) estimated from US monthly stock returns in KPS.36
Simulations proceed according to the following steps:
34 [2]
The expression of the asymptotic variance Vt is with the proof in Appendix C.14.
35
Appendix B.3 contains the asymptotics when centered by the original ft0 . This situation cor-
responds to Arrow C for Γ estimation, and the the additional stochastic wedge introduced by a
sample-based normalization affects the asymptotic distribution.
36
In particular, we use the ten most significant instruments from KPS as calibration targets. They
are market capitalization, total assets, market beta, short-term reversal, momentum, turnover, price
relative to its 52-week high, long-term reversal, unexplained volume, and idiosyncratic volatility with
respect to the Fama-French three factor model.
28
3. Generate errors. Elements of the error panel e are simulated from an i.i.d.
normal distribution whose variance is calibrated so that the population R2 ,
defined as 1 − Ee2 /Ex2 , equals 20%, matching the empirical R2 in the estimated
model in KPS.
4. Generate main panel. We fix Γ(Θ b Y ) at its empirically estimated value and
calculate xi,t according to model equation (3).
29
Note: This figure reports the small sample distributions of Γ estimation errors under the two example
normalization cases. We conduct 200 simulations with sample dimensions N = 200, T = 200.
b X ) − Γ0 (ΘX ) (Arrow A). The right panel reports
The left panel reports the distribution of Γ(Θ
0
the distribution of Γ(ΘY ) − Γ (ΘY ) (Arrow B). Each subplot corresponds to one element in the
b
L × K (10 × 2) matrix of Γ. Each histogram plots the simulated estimation errors, which is overlaid
with the asymptotic distributions from Theorem 3 as finite sample approximations. The horizontal
axis range is ±6 theoretical standard deviations, the tick marks are at ±3 theoretical standard
deviations. The vertical axes are probability density for the bell curves or frequency density for the
bars.
freedom. In other words, the top four distributions on the right duplicate information
in the distributions plotted below them.
In all cases, the simulated distributions are centered around zero and match the
theoretical distributions well. For some entries, we observe some skewness and tail
heaviness. These are due to the small sample size and relatively large error variance
in the generating process. In untabulated simulations with N = 1000, T = 1000, the
asymptotics more-fully kick in and the distributions become visually indistinguishable
from a normal. Hence, the simulation results suggest that even with panels of only
30
8 Applications
This section describes two empirical applications of IPCA to demonstrate its broad
usefulness for analyzing economic data. The first is an application to international
macroeconomics, where IPCA makes it easy analyze many nations’ evolving relation-
ships to global business cycles using country-level instruments. The second applica-
tion builds on KPS and uses IPCA to analyze a dynamic model of asset risk and
expected returns.
31
4 1.5
2 1
3 1
1 0.5
2 0.5
MAE MAE
Mean loading Mean loading
0 0 1 0
1960 1970 1980 1990 2000 2010 2020 1960 1970 1980 1990 2000 2010 2020
Notes: The left axis shows the IPCA model mean absolute error in units of percentage
annual growth and corresponds to the blue dotted line. The right axis corresponds to the
equally weighted cross-sectional mean factor loading in each group, shown in red.
tion growth to account for growth in the labor force. To account for recent economic
growth and risks, we include the 5-year rolling mean and volatility of the nation’s
GDP growth and its rate of inflation. Finally, we include a constant characteristic.
Next, we double the set of instruments to 18 by interacting the nine variable above
with an indicator for whether a country is in industrial/developed group. Our annual
data run from 1961 to 2015 so that T = 55, and we demean the growth rates following
KOP. About 91% of the 5,280 possible country-year observations are non-missing.
We study latent factor models with K = 1 and compare IPCA to the static-
loading PCA estimator. We find a panel R2 from the IPCA model of 32%, capturing
roughly triple the variation in demeaned country growth explained by KOP. The R2
from PCA is 22%, or two-thirds that achieved by IPCA. When making a head-to-head
comparison of PCA and IPCA it is important to keep in mind two major differences
between the estimators. The first is their stark difference in parameterization. IPCA
achieves its fit using only 18 parameters to estimate its loadings, or 60% fewer pa-
rameters than the 45 used by PCA.39 Second, IPCA accommodates dynamics in each
country’s loading on the global growth factor while PCA estimates static loadings. If
countries converge or decouple as they evolve, IPCA’s dynamic loadings are capable
of detecting this. PCA’s static loadings, on the other hand, cannot detect such dy-
namics and will instead try to fit an evolving system with a static model, and this
type of misspecification is difficult to diagnose.
Results from IPCA show that beta dynamics are indeed critical to understanding
39
These parameter counts are net of the 45 country-specific means used to demean the data in
both IPCA and PCA.
32
Notes: Estimated Γ coefficients scaled by the panel standard deviation of each instrument
(the exceptions are Constant and Ind×Constant which are unscaled). The instruments
are the log of GDP as a fraction of world GDP, gross capital formation, export and import
share of GDP, inflation, population growth, 5-year rolling mean and volatility of GDP
growth, and a constant. Each instrument is also interacted with an indicator for inclusion
in the “industrial/developed” country group. An asterisk denotes statistical significance
at the 10% level or better (using a bootstrap test following KPS).
the global business cycle. Figure 5 shows the time series of loadings on the global
growth factor broken out by industrial/developed economies and emerging economies.
For readability, we report the equally weighted cross-sectional average loading within
each group of countries.
We see substantial variation in global growth sensitivity in each group. This is
underpinned by an interesting state dependence in loadings—they rise sharply in eco-
nomic downturns. While this is visually evident in the plot for industrial/developed
countries, the precise nature of state dependence in loadings can be read from the es-
timated Γ matrix, which is shown in Table 1. To make estimates more interpretable,
we scale each element of Γ to describe the effect on factor loadings from a one standard
deviation increase in the associated instrument.
First, the constant and its interaction with the industrial dummy show that load-
ings in the industrial/developed group are significantly higher than those in emerg-
ing economies. But the largest and perhaps most interesting finding is the role of
growth volatility in describing state dependence in global growth sensitivity. A well
documented pattern in the business cycle literature is the spike in growth volatility
associated with recessions (Bloom, 2014).Table 1 shows that such rises in volatility
are accompanied by concomitant rises in sensitivity to the global growth factor. It
also shows that the dependence of growth sensitivity is one of the few instruments
33
34
for all i, t in the out-of-sample data set. The first term x bTot
i,t+1 reconstructs the realized
return as the factor model’s fitted value (that is, using the factor realization, fbt+1 ).
The second term is a prediction of the new listing return and replaces the factor
realization with the factor’s estimated mean λ, b directly following KPS.42 Based on
these fits, we calculate the out-of-sample total and predictive R2 as the explained
variation in xi,t+1 due to x bTot bPred
i,t+1 and x i,t+1 , respectively.
Table 2 reports the results for the IPCA model (with K = 4, as advocated by
KPS). The close similarity of the in-sample and out-of-sample total R2 indicates the
40
The dataset is from Gu et al. (2020). Firm characteristics are transformed into ranks on the
interval [−0.5, 0.5] as in KPS. Any missing characteristic is assigned the value 0, which is replacement
with the cross-sectional mean/median.
41
The information content of unexpected idiosyncratic return shocks is formalized by Assumption
A.
42 b
λ is calculated as the in-sample time series mean of fbt+1 .
35
Total R2 Predictive R2
Incumbent stocks (in-sample) 15.66 0.25
New listings (out-of-sample) 13.44 0.22
Notes: R2 in percentage. Based on IPCA with K = 4 for the 1965-2018 US stock-month
panel.
same characteristics that determine the systematic riskiness of incumbent stocks also
determine the riskiness of new listings. In other words, once we condition on firm
characteristics, the model finds a highly similar description for the common variation
among returns on newly listed stocks compared to the common variation in returns
on incumbents.
While the total R2 is indicative of the model’s ability to describe systematic risks
of new listings, the predictive R2 summarizes the model’s description of their ex-
pected returns (or, in equivalent terms, cost of capital or discount rates). That is,
the predictive R2 is especially informative about the usefulness of the model for asset
valuation. The close similarity of the in-sample and out-of-sample predictive R2 indi-
cates that the IPCA model is as effective at “pricing” new listings as it is for pricing
incumbent stocks. The most important takeaway is that the model does this without
using the individual return history of the new listings (which is of course unavailable
and the crux of the research question), but manages to price them nonetheless by
extrapolating what it learns from data on incumbent stocks.43
9 Conclusion
This paper has introduced a new approach of modeling and estimating the latent
factor structure of panel data, called Instrumented Principal Component Analysis
(IPCA). The key innovation is using additional panel data to instrument for the
dynamic factor loadings. Mainly, each individual’s time-varying factor loading is
related to instrumental data according to a common and constant mapping.
Estimating this mapping, rather than the factor loadings directly, has many econo-
metric advantages compared to other latent variable estimators like PCA. On one
43
This performance is even more remarkable when we recognize that new firms’ stock returns are
more variable than incumbents.
36
37
Backus, D. K., Kehoe, P. J., and Kydland, F. E. (1992). International Real Business
Cycles. Journal of Political Economy, 100(4):745–775. Publisher: University of
Chicago Press.
Bai, J. (2003). Inferential Theory for Factor Models of Large Dimensions. Economet-
rica, 71(1):135–171.
Bai, J. and Ng, S. (2002). Determining the Number of Factors in Approximate Factor
Models. Econometrica, 70(1):191–221.
Bai, J. and Ng, S. (2013). Principal components estimation and identification of static
factors. Journal of Econometrics, 176(1):18–29.
Büchner, M. and Kelly, B. (2020). A Factor Model for Option Returns. Yale Univer-
sity Working Paper.
Del Negro, M. and Otrok, C. (2008). Dynamic factor models with time-varying
parameters: measuring changes in international business cycles. FRB of New York
Staff Report, (326).
Fama, E. F. and French, K. R. (1993). Common risk factors in the returns on stocks
and bonds. Journal of Financial Economics, 33(1):3–56.
Fan, J., Liao, Y., and Wang, W. (2016). Projected principal component analysis in
factor models. The Annals of Statistics, 44(1):219–254.
38
Geweke, J. (1977). The dynamic factor analysis of economic time series. Latent
variables in socio-economic models.
Gregory, A., Head, A., and Raynauld, J. (1997). Measuring world business cycles.
International Economic Review, 38(3):677–701.
Gu, S., Kelly, B., and Xiu, D. (2020). Empirical Asset Pricing via Machine Learning.
The Review of Financial Studies, 33(5):2223–2273. Publisher: Oxford Academic.
Kelly, B. T., Pruitt, S., and Su, Y. (2019). Characteristics are covariances: A unified
model of risk and return. Journal of Financial Economics, 134(3):501–524.
Kose, A., Otrok, C., and Prasad, E. (2012). Global business cycles: convergence or
decoupling? International Economic Review, 53(2):511–538.
Kose, M. A., Otrok, C., and Whiteman, C. H. (2003). International Business Cy-
cles: World, Region, and Country-Specific Factors. American Economic Review,
93(4):1216–1239.
Pruitt, S. (2012). Uncertainty Over Models and Data: The Rise and Fall of American
Inflation. Journal of Money, Credit and Banking, 44(2-3):341–365.
39
Su, L. and Wang, X. (2017). On time-varying factor models: Estimation and testing.
Journal of Econometrics, 198(1):84–101.
40
A Notation Details
Matrix Operations: vect (A) vectorizes matrix A to a column by going right first,
then down. Throughout the paper, γ is always vect (Γ), same for all the other deco-
b Γ0 (Θ), etc. Related, veca (A) stacks A’s upper triangle entries, including the
rated Γ,
diagonal, in a vector by going right first, then down; vecb (A) stacks A’s upper tri-
angle entries, not including
h the diagonal, ini a vector by going right first, then down.
For example, let A = 1, 2, 3; 4, 5, 6; 7, 8, 9 , then vect (A) = [1, 2, 3, 4, 5, 6, 7, 8, 9]> ,
veca (A) = [1, 2, 3, 5, 6, 9]> , vecb (A) = [2, 3, 6]> .
[A; B] means the two matrices vertically stacked.
k·k is the Euclidean norm of a vector, or the Frobenius norm if the input is a
matrix. k·k2 is the square of k·k, or the sum of squares across all entries of the vector
or matrix.
P P
Summations like i , t are always in the range of the panel sample: i = 1 to N ,
t = 1 to T , without writing out the details.
This subsection works out the asymptotic distribution of Γ(Θb Y )−Γ0 (Θ0 ), or Arrow C
Y
0 0
in Figure 3. Since the reference point Γ (ΘY ) is deterministic, this analysis provides
the complete asymptotic distribution of the estimator Γ(Θb Y ). In contrast to Arrow B
(studied in Section 5.2), Arrow C also depends on the randomness of the identification
ΘY . We rely on the general result 2.a for the asymptotic distribution. Subsection
5.2 has already calculated three of the four right-hand side inputs: HY0 , JY0 and the
asymptotic distribution of S (Γ0 (Θ0Y )). The fourth piece that needs to be analyzed is
the asymptotic distribution IY (Γ0 (Θ0Y )).
Identification function IY (Γ), as defined in 4.2 is stacked up by two parts. The
top 12 K(K + 1) entries are from ortho-normalizing Γ, which is irrelevant of sample
information. Therefore, similar to the ΘX case, the top part of IY (Γ0 (Θ0Y )) is still
deterministic. The bottom 12 K(K −1) rows of IY (Γ) are from diagonalizing the sample
41
Lemma 4 implies the “contamination effect” that the normalization step intro-
√
b Y ) converges at the rate of T . Surprisingly, this is even slower than
duces to Γ(Θ
√
that of the core optimization step, which converges at N T according Lemma 3.
Hence it solely shows up in the asymptotic distribution of Γ(Θ
b Y ) as the dominant
√
term. In summary, the convergence rate of Arrow C is T , while those of Arrows A
√
and B are both N T .
42
Note: This figure is constructed with the same simulation exercises in Figure 4, and also follows
the same format. The simulated distribution of a new set of random variables (written at the top
of each panel) are reported. The corresponding theoretical distributions are from Theorems 5, 3.b,
and 5 respectively. Additionally, the x-axes ticks are marked with numbers for comparison. Again,
the x-axes range is ±6 theoretical standard deviations, and the tick numbers mark ±3 theoretical
standard deviations.
43
The only difference happens at the first term, which is now driven by the estimation
√
b Y ) and Γ0 (Θ0 ) (Arrow C), which has a convergence rate of T
error between Γ(Θ Y
according to Subsection B.1. This implies the additional stochastic wedge introduced
by a sample-based normalization
mightno longer be dominated by the √ second term.
√
b Y ) and f 0 converge the slower of N or T .
The following Theorem says fbt Γ(Θ t
Theorem 6 (Factor Estimation Error Measured Against {ft0 }). Under Assumptions
A–F, as N, T → ∞,
n √ √ o
b Y ) − ft0 = Op max 1/ N , 1/ T
fbt Γ(Θ .
44
Lemma 5 (Single-subscript LLN). Under Conditions SP.1–4, let X [d] be F[d] -measurable
2
(single-subscript) random vectors, and if E
X [d]
< ∞, then
N T
1 X [cs] L2 1 X [ts] L2
Xi −→ EX [cs] , as N → ∞ and Xt −→ EX [ts] , as T → ∞.
N i=1 T t=1
[cs] L2
Proof. Apply Lemma 1, we have N1 N
P [cs] [ts]
i=1 X i −
→ E X F . It remains to show
[cs] [cs] [cs]
Xi is ergodic. We know X is measured by F , within which every event is
S[ts] -invariant. So an S[cs] -invariant event within F[cs] must be invariant to bothn trans- o
[cs]
formations. By condition SP.4, it has probability either 0 or 1. That is to say, Xi
is an ergodic stochastic process (in the traditional one-directional sense). That implies
E X [cs] F[ts] = EX [cs] w.p. 1, which in turn gives the desired m.s. convergence.
The other direction is symmetric.
Proof.
!
1 XX 1X 1 X 1X
Xi,t − E X·,t F[ts] E X·,t F[ts]
Xi,t = + (12)
NT t i T t N i T t
L1
We are going to show first term −→ 0 (N → ∞, uniformly across T ), second term
L1
−→ EX, as N, T → ∞ (T → ∞, uniformly across N ).
45
Proof. We proof by first constructing the normalization Γ0 first, and then show it is
unique. Before that, we notice a relationship:
1Xb 1 X b b>
ft (ΓR) fbt > (ΓR) = R−1 ft (Γ)ft (Γ)R−1> (16)
T t T t
46
1 X b 0 b> 0
ft (Γ ) ft (Γ ) = [Orth]−1 [Chol] [OldV] [Chol]> [Orth]−1 > = [Diag] (17)
T t
normalization is unique:
We need to show such an Γ0 is unique. Since Γ0 = ΓR, we just need to show such
an R is unique. We inspect the restrictions of ΘY and shrink the set of possible R
down to singularity, following the above normalization procedure:
First, given ΘY restriction [1], R> Γ> ΓR = IK . The possible R must have de-
composition: R = [Chol]−1 [Orth], where [Orth] can only be ortho-normal matri-
ces ([Orth]> [Orth] = I). Plug this decomposition into ΘY restriction [2], we need
an [Orth] that also satisfies [Orth]−1 [Chol] [OldV] [Chol]> [Orth]−1 > = [Diag]. We
have found setting [Orth] as the eigen-decomposition satisfies. Importantly, when
we restrict the eigen-decomposition with distinct and decreasing eigen-values such an
eigen-decomposition is unique.
Next, Lemma 7 is the upgrade version of Lemma 2 to deal with “uniform across Γ”.
Lemma 7. If xN,t (Γ) is stationary in t and U-mean converging, then its time-series
average is converging in the large-N probability limit uniformly for any T . That is,
47
P
Proof. Start by an inequality exchanging the order of sup and :
1 X
1X 1X
sup
xN,t (Γ)
≤ sup kxN,t (Γ)k ≤ sup kxN,t (Γ)k . (19)
Γ∈Ψ
T t
Γ∈Ψ T
t
T t Γ∈Ψ
The last term E supΓ∈Ψ kxN,t (Γ)k is irrelevant of T , and it converges to zero as N →
∞, according to the precondition about U-mean converging. Hence, the first term is
T -uniform large N -convergence: ∀ > 0, ∃N [1] s.t. ∀T and ∀N > N [1]
1 X
E sup
xN,t (Γ)
< . (21)
Γ∈Ψ
T t
48
2
[3]
[3]
For any ω, if supΓ∈Ψ
xN,t (Γ)
< M , then supΓ∈Ψ
xN,t (Γ)
< M 2 . That means
2
[3]
xN,t (Γ)
is also U-a.s. bounded for large enough N ’s. Plug the bound, M 2 , back
49
[1] [2]
4. Almost the same the as the previous proof. Just change xN,t (Γ) to xN,t (Γ) every-
where until the last three lines. In the last three lines, just change “limN = 0” to
“lim supN < ∞”.
Lemma 9 builds the necessary conditions for U-mean square converging from low
level conditions for the primitives for example Assumption A. It is closely related to
Lemma 1.
1 46
P
then its cross-sectional average N i xi,t is U-mean square converging.
N
1 X L2
Xi,t −→ 0, ∀t. (31)
N i=1
C.6 Proposition 1
C.6.1 Complete the statement of Proposition 1
We first state the complete version of the Proposition 1 with the addition of an
intermediate large-N result and writing the expressions of the probability limits. .
46
notice here Γ does not enter in the random function.
50
p
sup kS(Γ) − ST (Γ)k →− 0, N → ∞, ∀T, (32)
Γ∈Ψ
p
sup
S(Γ) − S 0 (Γ)
→
− 0, N, T → ∞, (33)
Γ∈Ψ
where
1X
cc 0 e>
ST (Γ) := vect Ωt Πt (Γ)ft ft (Γ) , (34)
T t
0 e>
S 0 (Γ) := E vect Ωcc
t Π t (Γ)f f
t t (Γ) , (35)
−1 > cc 0
Πt (Γ) := IL − Γ Γ> Ωcc t Γ Γ Ωt Γ . (36)
The proof of Proposition 1 is quite involved. The first part manipulates the expression
of the score function to a form consisting of primitive random variables. Second,
Lemma 10 deals with the cross-section convergence. Then, Lemma 11 deals with the
time-series dimension. In the final step, the results are put together to finish the
proof of for the large N, T convergence in Proposition 1.
The first part of proof manipulates the expression of the score function to a form
consisting of primitive random variables. Then, we analyze its uniform convergence.
We already have
1 X > b 1 X
S(Γ) = Ct ⊗ ft (Γ) xt − Ct Γfbt (Γ) = vect Ct> ebt (Γ)fbt> (Γ) , (37)
NT t NT t
and
−1 > >
fbt (Γ) = fet (Γ) + Γ> Ct> Ct Γ Γ Ct eet (Γ). (38)
51
Plug those back to the score. Each summand in equation (37) yields 3 × 2 = 6 terms:
[1] [6]
Call the six terms St (Γ) to St (Γ), so that
1 X
[1] [6]
S(Γ) = vect St (Γ) + · · · + St (Γ) . (41)
NT t
Before jumping into the rigorous proof, we give a loose description of the rationale.
Given the expression of score function in Eq. (41), Proposition 1 states the score
function’s uniform probability limit. First, taking N → ∞ at a fixed t, we have
the three modular results, N1 Ct> et → 0, N1 ΓCt> eet (Γ) → 0, and N1 Ct> Ct = Op (1) ,
by cross-sectional LLN. Plugging these into the score expression (40), we find that
1 [p]
S (Γ) → 0 for p = 1, 3, 4, 5, 6. The second term is an exception as the only
N t
one not involving an eet (Γ) or et error term. It correspond to the first source of fbt
[2]
decomposition purely from a “wrong” Γ rather than the finite sample. We see St
[2] 0 e>
increases with N . We have N1 St (Γ) → Ωcc t Πt (Γ)ft ft (Γ). The cross-sectional limit is
F[ts] measurable (Lemma 1). Taking a finite time-series average yields the finite-T
large-N convergence of the score given by Eq. (32).47 Finally, taking T → ∞ delivers
the score’s convergence to the unconditional expectation, as Eq. (33) report.
The proof, to a large extent, follows the steps of proving Lemmas 5 and 2, with
47
This result could be used to construct finite-T large-N inference – we leave that for future work.
52
1 X [1] [6]
0 e>
St (Γ) + · · · + St (Γ) − Ωcc
t Πt (Γ)ft ft (Γ) (42)
N i
is U-mean converging.
1. 1
C >e
N t t
is U-mean square converging.
This is by Lemma 9, treating ci,t ei,t as the xi,t in the lemma. The conditions
are met given assumptions A, B.
−1 > cc 0
2. Γ> Ωcc
t Γ Γ Ωt Γ is U-a.s. bounded.
This is because it is a continuous function w.r.t. Γ, Ωcc 0
t , and Γ whose domains
are all bounded and away from singularity given assumptions C.
53
1 [2] 1
St (Γ) = Ct> Ct Πt (Γ)ft0 fet> (Γ) (44)
N "N #
1 [2] cc 0 e> 1 X
> cc
h
0 e>
i
S (Γ) − Ωt Πt (Γ)ft ft (Γ) = ci,t ci,t − Ωt Πt (Γ)ft ft (Γ) (45)
N t N i
54
c>
>
cc
Qt (Γ)Γ0 ft0 = c> cc
⊗ ft0 > vect Q> 0
vect i,t ci,t − Ωt i,t ci,t − Ωt t (Γ)Γ (48)
The first and third part is U-a.s. bounded. The middle part is U-mean square
converging, lemma 9, given assumptions B(4). Then, we can just apply Lemma
8.4 twice.
1 [3]
10. S (Γ)
N t
is U-mean converging.
h
1 [3] h
> > >
−1 i 1 > > >
i
S (Γ) = Ct Ct Γ Γ Ct Ct Γ Γ Ct eet (Γ) ft (Γ)
e (50)
N t N
The three term are U-a.s. bounded, U-mean square converging, and U-mean
square bounded, and apply Lemma 8.
1 [4] [5] [6]
11. S (Γ), N1 St (Γ), N1 St (Γ)
N t
are all U-mean converging.
For the last three terms, given the similarities to the situations above, we just
write out the decompositions. The remaining arguments about repeatedly ap-
plying Lemma 8 are omitted.
−1
1 [4] 1 > 1 > > 1 >
S (Γ) = C et ee (Γ)Ct Γ Γ C Ct Γ (51)
N t N t N t N t
−1
1 [5] 1 > 0 1 > > 1 >
S (Γ) = C Ct Πt (Γ) ft ee (Γ)Ct Γ Γ C Ct Γ (52)
N t N t N t N t
" −1 #
1 [6] 1 > > 1 > 1 > 1 > >
S (Γ) = C Ct Γ Γ C Ct Γ ee (Γ)Ct Γ Γ Ct eet (Γ) (53)
N t N t N t N t N
−1
> 1 >
Γ C Ct Γ (54)
N t
[1] [6]
Finally, given the analysis above of St · · · St , we can conclude the required state-
55
Proof. This is a familiar case in the sense that it only has the time-series dimension
— this is a stationary and ergodic time-series average analysis. The only twist is
it requires uniform convergence over Γ ∈ Ψ. We proceed by applying Lemma 2.4
in Newey and McFadden (1994). It requires to construct the Γ-irrelevant random
variable dt , and verify it dominates t and has finite expectation. Notice, ST (Γ) =
1
vect (t(Γ)) and t(Γ) = Ωcc Πt (Γ)f 0 fe> (Γ)
P
T t t t t
A finite M exists, because the two norms within sup are continuous functions on
compact domains, given Assumption C. We have thus constructed dt , and shown
that kvect (t(Γ))k ≤ kdt k. Moreover, Edt < ∞ given Assumption B(1).
After preparing the lemmas above, we are finally ready for the main proof of Propo-
sition 1.
P [1] [6]
0 e>
Proof. According to Lemma 10, N1 i St (Γ) + · · · + St (Γ) − Ωcc t Πt (Γ)ft ft (Γ) is
U-mean converging. We have the time-series average defined as
1X
0 e>
ST (Γ) := vect Ωcc
t Πt (Γ)f f
t t (Γ) (59)
T t
p
Apply Lemma 7, we have supΓ∈Ψ kS(Γ) − ST (Γ)k →
− 0, N → ∞, ∀T. That is to say,
56
p
By Lemma 11, supΓ∈Ψ kST (Γ) − S 0 (Γ)k →
− 0, T → ∞. That is to say, ∀, δ > 0, ∃T [1] ,
s.t. irrelevant of N , ∀T > T [1]
0
P r sup ST (Γ) − S (Γ) < δ > 1 − .
(61)
Γ∈Ψ
≥1 − 2. (64)
p
That means the required conclusion: supΓ∈Ψ kS(Γ) − S 0 (Γ)k →
− 0, N, T → ∞
where R is rotation s.t. ERft0 ft0 > R is diagonal with positive entries, and A and B
are the shorthands of L × K and K × K respectively. Notice ∀Γ not rotationally
equivalent to Γ0 , Πt (Γ) 6= 0, so A, B are full rank K w.p. 1, by Assumption C. One
can construct constant p, q of lengths L, K s.t. the signs of each entry in p> A and
Bq are always the same. As a result, p> E ARft0 ft0 > RB q > 0.
57
I
" >
#
veca Γ Γ − K
IY0 (Γ) := h i .48 (66)
vecb E fet (Γ)fet> (Γ) − V ff
1 X b b>
I[2] (Γ) := ft (Γ)ft (Γ) − V ff (67)
T t
h i
0
I[2] (Γ) := E fet (Γ)fet> (Γ) − V ff (68)
1 X e e> h i
0
I[2] (Γ) − I[2] (Γ) = ft (Γ)ft (Γ) − E fet (Γ)fet> (Γ)
T t
−1
+fet (Γ)ee> > >
t (Γ)Ct Γ Γ Ct Ct Γ
−1 > >
+ Γ> Ct> Ct Γ Γ Ct eet (Γ)fet> (Γ)
−1 > > −1
+ Γ> Ct> Ct Γ Γ Ct eet (Γ)ee>
t (Γ) C t Γ Γ > >
C t C t Γ . (69)
58
Repeat the arguments in Lemma 10 and we arrive at the conclusion that Ξt (Γ) is
U-mean converging.
Combining these two conclusions, following the same arguments in the proof of
Proposition 1, IF.1 is verified.
Verify IF.2:
For any parameter Γ and its rotation Γ0 = ΓR, we find the relationship that
mirrors Eq. 16 in Lemma 6.
h i h i
E fet (ΓR) fet> (ΓR) = R−1> E fet (Γ)fet> (Γ) R−1 (74)
Given this property, we find IY0 and IY behave in the same way when Γ is rotated.
Therefore, the procedure of the normalization is the same as that in that Lemma 6,
which in turn proves the normalization is unique, that verifies IF.2.
h A difference
i to point out is that here the input of the normalization procedure
>
E ft (Γ)ft (Γ) is an object about the underlying true and unknown to econometrian.
e e
Based on the previous analysis, Proposition 2 and IF.1-2 combined imply Γ0 is the
unique solution to the function limits’ equation system: [S 0 ; I 0 ] (Γ) = 0. We know the
solution of the uniformly converging function converges to the limit’s unique solution,
59
Notice S (b
γ ) = 0. That means
b − γ 0 = −S(γ 0 ),
H̄ γ (76)
∂S(γ)
where H̄ := ∂γ >
is shorthand for the sample Hessian matrix at γ̄.49
γ=γ̄
Usually, without the unidentification problem, one would divide by the Hessian
b − γ = −H̄ −1 S(γ 0 ), and then take the limit of H̄ to express the asymptotic
to get γ
distribution. But that is not possible here when S has non-unique solutions.
The unidentification problem manifests as a singular H̄. Specifically, the score
should have zero gradients on the K 2 directions of unidentification. Intuitively, think
in the IPCA case, where the rotation matrix R is K × K. So there are K 2 directions
to marginally perturb Γ0 without changing the score — it would be constantly 0.
This implies the score has zero gradient on those K 2 directions, meaning H̄, although
an LK-square matrix, has a rank of only LK − K 2 .
A rank-deficient H̄ leads to the non-uniqueness of the γ
b solutions in equation (76).
0
Given one γb − γ that solves the linear equation, there are a family of other vectors
49
Lagrange’s Mean Value Theorem only applies to function of multiple inputs and scalar output,
but not to vector-valued functions. So each row of H̄ are evaluated at a different γ̄. Same for the
additional K 2 rows of J¯ below. But importantly, the different γ̄’s are all in between γ 0 and γ
b (in
a linear-combination sense). This guarantees later we can take the limit of the Hessian (see Newey
and McFadden, 1994, footnote 25).
60
J¯ γ
b − γ 0 = −I(γ 0 ).
(78)
where J¯ := ∂I(γ)
∂γ >
is I’s counterpart of the Hessian. Notice I is K 2 × 1 vector and
γ=γ̄
J is K 2 × LK Jacobian. So, these additional K 2 equation pins down the addition K 2
degrees of freedom. Append the K 2 equations in (78) below (76) and form a single
linearization of the estimator,
Now this linear equation system has a unique solution at (b γ −γ 0 ), because the stacked
H̄; J¯ matrix of size (LK + K 2 ) × LK is now full rank LK. To solve it, left multiply
> −1 >
both sides by the pseudoinverse H̄; J¯ H̄; J¯ H̄; J¯ :
−1
b − γ 0 = − H̄ > H̄ + J¯> J¯ H̄ > S(γ 0 ) + J¯> I(γ 0 ) .
γ (79)
With the estimator thus linearized, the rest of asymptotic derivation follows canon-
p
ical M -estimator analysis. Given, γ b → − γ 0 , then γ̄, the mean value between γ b and
0 0
γ , goes to γ as well. By the continuous mapping theorem, we find the limits of the
Hessians:
∂S 0 (γ) ∂I 0 (γ)
0
plim H̄ = H := , ¯
plim J = J :=0
.50 (80)
∂γ > ∂γ >
N,T →∞ γ=γ 0 N,T →∞ γ=γ 0
Then, taking the probability limit of equation (79) leads to line 2.a in Theorem 2.
The derivation of line 2.b is similar, save for an important difference. Let us
redo this proof, but now start by applying the Lagrange’s Mean Value Theorem to
γ − γ 0 ). Denote the requisite mean value as γ̄¯ and denote
γ − γ 0 (Θ)) instead of (b
(b
61
¯ + J¯> J¯ −1 H̄
¯ > H̄
b − γ 0 (Θ) = − H̄
γ ¯ > S γ 0 (Θ) + J¯> I γ 0 (Θ) . (81)
Not only γb but also γ 0 (Θ) converge to γ 0 . Hence the new mean value γ̄¯ between them
converges to γ 0 as well. So, the limits of H̄ ¯ and J¯ are still H 0 and J 0 , which are
evaluated at the same deterministic γ 0 .51
The important difference in this case is that I (γ 0 (Θ)) = 0 by construction, which
eliminates the I term in (81) and leads to line 2.b in the Theorem. This difference
carries an important intuition that is discussed below.
The remainder of the proof only needs to shows S (γ 0 (Θ)) = S(γ 0 ) + op (S(γ 0 )).
Apply Mean Value for another time in between γ 0 (Θ) to γ 0 :
0
0 ∂S(Γ) 0 0
S Γ (Θ) − S(Γ ) = γ (Θ) − γ (82)
∂γ > γ=γ [3]
!
∂S 0 (Γ) ∂S 0 (Γ)
∂S(Γ) 0 0
γ 0 (Θ) − γ 0 (83)
= >
− >
γ (Θ) − γ + >
∂γ
γ=γ [3] ∂γ
γ=γ [3] ∂γ
γ=γ [3]
Notice the first term is Op (S 0 (Γ)) op (1) = op (S 0 (Γ)). The second term is 0 because
S 0 is constant at 0 from γ 0 (Θ) to γ 0 .
C.11 Lemma 3
We state and prove a general version of Lemma 3 that can be evaluated at two specific
cases.
51
This shows the importance of keeping the deterministic γ 0 as the limiting reference point in the
linearizion, even though we are ultimately interested in the result about γ 0 (Θ) and γ
b.
62
Lemma 3 simply provides two special cases under Lemma 13. In particular, there
[1] [1]
are two differences between VX and VY . First, Q0 is evaluated under either Γ0 (ΘX )
or Γ0 (Θ0Y ). Second, the correspond true ft0 also need to be rotated, resulting in
rotated values of the asymptotic variance Ωcef .
Below is a proof of the general Lemma 13.
Proof. We evaluate score at Γ0 by breaking it into the six parts in Eq. 41. When
[2] [4]
evaluated at Γ0 , St (Γ0 ) = St (Γ0 ) = 0, because they contain Qt (Γ)> Γ0 which is zero
at Γ = Γ0 . For the rest four terms, we have them in two pairs, and only need to show
they have the following results: as N, T → ∞,
1 X
[1] [3]
d
vect St (Γ0 ) + St (Γ0 ) →− Normal 0, V[1] ,
√ (84)
NT t
1 X
[4] 0 [6] 0
p
√ vect St (Γ ) + St (Γ ) → − 0. (85)
NT t
1 X d
vect Qt (Γ0 )Ct> et ft0 > →− Normal 0, V[1] ,
√ (88)
NT t
1 X p
√ vect N,t Ct> et ft0 > → − 0 N, T → ∞. (89)
NT t
For the first one, notice vect Q0 Ct> et ft0 > = (Q0 ⊗ IK ) vect Ct> et ft0 > . Then it
63
Take expectation:
2
1 X
vect N,t Ct> et ft0 >
E
√ (93)
NT t
1 X
E
vect N,t ⊗ f 0 t
kτij,ts k E
vect N,s ⊗ f 0 s
≤ (94)
N T i,j,t,s
!
2 1 X
≤ sup E
vect N,t ⊗ f 0 t
kτij,ts k (95)
t N T i,j,t,s
[4] [6]
St (Γ0 ) + St (Γ0 ) (96)
−1
= Ct> Mt (Γ0 )et et > Ct Γ0 Γ0> Ct> Ct Γ0 (97)
−1
0 1
N >
> 0 1 0> > 0
= Qt (Γ ) Ct et et Ct Γ Γ Ct Ct Γ (98)
N N
−1 >
where QN t (Γ) := IL − 1
C
N t
>
C t Γ 1 > >
N
Γ C t C t Γ Γ .
Notice the first term in (98) QN 0
t (Γ) has a constant large-N probability limit Q ,
which is defined in the statement of Lamma 13. It is the similar situation for the last
−1
terms, Γ0 N1 Γ0> Ct> Ct Γ0 . Since both of the two parts are only of Ct , their large-N
convergence is bounded according to Assumption C.2. The constant large-N limits
can be analyzed outside of the t-summation. Therefore, all we need to show is about
64
The first equation is due to time series stationarity, the second equation is due to
cross-sectional i.i.d. SP.5, and the final limit is due to the cross-sectional LLN and
the condition that T /N → 0. Notice the random variable is non-negative, and its
unconditional expectation is converging, thereby we have shown it probability limit
is converging as well.
65
Similar to the evaluation of V[1] in Subsection C.11, there are two differences
between HX0 and HY0 . First, the expression of Π(Γ) and its derivative is only about
Ωcc which does not depend on the rotation. But the derivative needs to be evaluated
at either Γ0 (ΘX ) or Γ0 (Θ0Y ). Second, the correspond true ft0 also need to be rotated.
This will result in difference in the assumed asymptotic variance V ff .
Next is the proof of the general Lemma 14.
Notice Πt (Γ0 ) = 0. Therefore, the terms involving ∇fet after taking derivative will
drop out, only ∇Πt (Γ0 ) terms survive. We write the H 0 column by column. The p’th
column, or the derivative w.r.t. the p’th entry γp simplifies as
" !#
∂Πt (Γ)
H 0 p = E vect Ωcc
t f 0f 0> (107)
∂γp γ=γ 0 t t
This result above does not require Assumption F, and can be calculated by LLN
simulation if one is interested in the general case. Now, we impose the constant Ωcc
t
assumption F:
!
0 cc ∂Π(Γ) ff cc ff ∂vect (Π(Γ))
H p = vect Ω V = Ω ⊗ V (108)
∂γp γ=γ 0 ∂γp 0
γ=γ
0 cc ff
∂vect(Π(Γ))
Finally, append the columns together, we have H = Ω ⊗ V ∂γ >
γ=γ 0
52
Under Assumption F, Πt (Γ) is also time-constant and deterministic, so we drop its t subscript.
∂
We choose do not write out the deterministic derivatives ∂γ vect (Π(Γ)) analytically for conciseness,
which is easily calculated numerically in the simulated and empirical exercises.
66
JY0 is much more involved and we summarize the calculation in the following lemma.
ff >
JY0 [( 1 K(K+1)+1:K 2 ), 0 cc
ff 0 cc
p ] = vecb D p Γ , Ω V + V Dp Γ , Ω (110)
2
Proof. We omit the “S” subscript in this proof. The first part of I is deterministic,
so the first part of J 0 is easy. Lower part of J 0 :
0 ∂
J [( 1 K(K+1)+1:K 2 ), : ] = N,T
plim >
vecb I[2]
2 →∞ ∂γ γ0
!
∂ 1 X e e>
= plim >
vecb ft (Γ)ft (Γ)
N,T →∞ ∂γ T t 0
γ
!
∂
> ∂ e ∂ e>
vecb fet (Γ)fet (Γ) = vecb ft (Γ) ft0 > + ft0 f (Γ)
∂γp γ0 ∂γp γ0 ∂γp t γ 0
∂ e ∂ > cc
−1 > cc 0
Γ Ωt Γ ft0 := Dp Γ0 , Ωcc
0
ft (Γ) = Γ Ωt Γ t ft
∂γp 0 ∂γp 0
γ γ
∂
vecb fet (Γ)fet> (Γ) = vecb Dp Γ0 , Ωcc ft0 ft0 > + ft0 ft0 > Dp> Γ0 , Ωcc
t t
∂γp γ0
0 0>
J 0 [( 1 K(K+1)+1:K 2 ), p ] = E vecb Dp Γ0 , Ωcc ft ft + ft0 ft0 > Dp> Γ0 , Ωcc
t t
2
ff >
By Assumption F, J 0 [( 1 K(K+1)+1:K 2 ), p
0 cc ff 0 cc
] = vecb Dp (Γ , Ω ) V + V Dp (Γ , Ω ) .
2
67
√ −1 √
b > C > Ct Γ b = N Γ0> C > Ct Γ0 −1 Γ0> C > et + op (1)
b> C > eet (Γ)
N Γ t
b Γ t t t (111)
d [2]
→
− Normal 0, Vt , (112)
[2] 0 −1 0> ce 0 0 −1
where Vt = Γ0> Ωcc t Γ Γ Ωt Γ Γ0> Ωcc t Γ , with Ωce t from Assumption D(2).
cc ce
Notice neither Ωt or Ωt are affected by the rotation. So one only needs to plug in
[2]
the value of Γ0 as either Γ0 (ΘX ) or Γ0 (Θ0Y ) to evaluate Vt for the specific cases.
√ 1 X 0 0>
T I[2] (Γ0 ) − √ ft ft − V ff = 0K×K
plim (113)
N,T →∞ T t
√ 0
d
[3]
By the times-series CLT in D(3): T vecb I[2] (Γ ) → − Normal 0 1 K(K−1)×1 , V .
2
In addition, the cross-covariances between the top and the bottom part are 0.
68
zero element in each column of Γ to be positive works as well (call this [300 ]). Similar
restrictions are seen in Stock and Watson (2002), Bai and Ng (2013).
However, we report that the sign issue is trickier in finite sample simulations. For
example, if the true factors has zero (or close to zero) expectations (E [ft0 ] = 0).
Then, even if the true factors are observed, its finite sample averages are arbitrarily
positive or negative, making [30 ] unstable. Even if Γ b is estimated rather accurately,
the small sign flipping of the factor mean makes a large difference in Γ b − Γ0 , keeping
it away from converging. Similarly, [300 ] also runs into finite sample problems if Γ0 ’s
first non-zero elements in some columns are close to zero.
So how to design a sign restriction such that the signal-to-noise ratio in picking the
sign is always maximized, adapting to any potential peculiar model? One way is to
make Γ’s signs always align with those of Γ0 . So let [3000 ] be Γ s.t. Γ> 0
k Γk > 0, ∀k. Then,
[3000 ]’s obvious problem is it depends on population information (Γ0 ), disqualifying it
as an estimation’s identification condition.
Finally, we give a sign restriction that is both theoretically sound and practically
easy to use in simulation exercises. Let [3] be the set of Γ s.t. Γ> k Γk > 0, ∀k, where
b
Γ
b = arg minΓ∈[1],[2],[300 ] G(Γ). The idea is for the Γ within the target minimizing set,
we use [300 ] (or [30 ]) to pin down its signs. That gives the unique estimate. And for
69
smallest possible value among all possible sign combinations. To calculate that quan-
tity in simulation exercises, one can first ignore the sign issue and calculate Γ
b and
Γ0 (Θ) both up to sign unidentification. Then pick the signs to minimize that norm
between the two:
0
min
Γ − Γ (Θ)diag {s}
. (114)
b
s1 ...sK =±1
70
71