0% found this document useful (0 votes)

46 views

4869 Robust Sparse Principal Component Regression Under The High Dimensional Elliptical Model

This document discusses robust sparse principal component regression under the elliptical model for high dimensional non-Gaussian data. It proposes a new method that estimates regression coefficients in the optimal rate under this model. Specifically, it 1) characterizes the potential advantage of classical principal component regression over least squares estimation in low dimensions under the Gaussian model, and 2) proposes a robust sparse principal component regression for high dimensional elliptically distributed data that can estimate regression coefficients in the optimal parametric rate, providing an alternative to Gaussian-based methods. Experiments on synthetic and real data illustrate the empirical usefulness of the proposed method.

Uploaded by

Ve Lopi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

4869 Robust Sparse Principal Component Regression Under The High Dimensional Elliptical Model

Uploaded by

Ve Lopi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Robust Sparse Principal Component Regression

under the High Dimensional Elliptical Model

Fang Han Han Liu

Department of Biostatistics Department of Operations Research
Johns Hopkins University and Financial Engineering
Baltimore, MD 21210 Princeton University
fhan@jhsph.edu Princeton, NJ 08544
hanliu@princeton.edu

Abstract
In this paper we focus on the principal component regression and its application to
high dimension non-Gaussian data. The major contributions are two folds. First,
in low dimensions and under the Gaussian model, by borrowing the strength from
recent development in minimax optimal principal component estimation, we first
time sharply characterize the potential advantage of classical principal component
regression over least square estimation. Secondly, we propose and analyze a new
robust sparse principal component regression on high dimensional elliptically dis-
tributed data. The elliptical distribution is a semiparametric generalization of the
Gaussian, including many well known distributions such as multivariate Gaus-
sian, rank-deficient Gaussian, t, Cauchy, and logistic. It allows the random vector
to be heavy tailed and have tail dependence. These extra flexibilities make it very
suitable for modeling finance and biomedical imaging data. Under the elliptical
model, we prove that our method can estimate the regression coefficients in the
optimal parametric rate and therefore is a good alternative to the Gaussian based
methods. Experiments on synthetic and real world data are conducted to illustrate
the empirical usefulness of the proposed method.

1 Introduction
Principal component regression (PCR) has been widely used in statistics for years (Kendall, 1968).
Take the classical linear regression with random design for example. Let x1 , . . . , xn 2 Rd be n
independent realizations of a random vector X 2 Rd with mean 0 and covariance matrix ⌃. The
classical linear regression model and simple principal component regression model can be elaborated
as follows:
(Classical linear regression model) Y = X + ✏;
(Principal Component Regression Model) Y = ↵Xu1 + ✏, (1.1)
where X = (x1 , . . . , xn )T 2 Rn⇥d , Y 2 Rn , ui is the i-th leading eigenvector of ⌃, and ✏ 2
Nn (0, 2 Id ) is independent of X, 2 Rd and ↵ 2 R. Here Id 2 Rd⇥d is the identity matrix. The
principal component regression then can be conducted in two steps: First we obtain an estimator ub1
of u1 ; Secondly we project the data in the direction of u
b 1 and solve a simple linear regression in
estimating ↵.
By checking Equation (1.1), it is easy to observe that the principal component regression model is a
subset of the general linear regression (LR) model with the constraint that the regression coefficient
is proportional to u1 . There has been a lot of discussions on the advantage of principal component
regression over classical linear regression. In low dimensional settings, Massy (1965) pointed out
that principal component regression can be much more efficient in handling collinearity among pre-
dictors compared to the linear regression. More recently, Cook (2007) and Artemiou and Li (2009)
argued that principal component regression has potential to play a more important role. In partic-
ular, letting u
b j be the j-th leading eigenvector of the sample covariance matrix ⌃ b of x1 , . . . , xn ,

1
Artemiou and Li (2009) show that under mild conditions with high probability the correlation be-
tween the response Y and Xb ui is higher than or equal to the correlation between Y and Xb uj when
i < j. This indicates, although not rigorous, there is possibility that principal component regression
can borrow strength from the low rank structure of ⌃, which motivates our work.
Even though the statistical performance of principal component regression in low dimensions is not
fully understood, there is even less analysis on principal component regression in high dimensions
where the dimension d can be even exponentially larger than the sample size n. This is partially
due to the fact that estimating the leading eigenvectors of ⌃ itself has been difficult enough. For
example, Johnstone and Lu (2009) show that, even under the Gaussian model, when d/n !
for some > 0, there exist multiple settings under which u b 1 can be an inconsistent estimator of
u1 . To attack this “curse of dimensionality”, one solution is adding a sparsity assumption on u1 ,
leading to various versions of the sparse PCA. See, Zou et al. (2006); d’Aspremont et al. (2007);
Moghaddam et al. (2006), among others. Under the (sub)Gaussian settings, minimax optimal rates
are being established in estimating u1 , . . . , um (Vu and Lei, 2012; Ma, 2013; Cai et al., 2013).
Very recently, Han and Liu (2013b) relax the Gaussian assumption in conducting a scale invariant
version of the sparse PCA (i.e., estimating the leading eigenvector of the correlation instead of the
covariance matrix). However, it can not be easily applied to estimate u1 and the rate of convergence
they proved is not the parametric rate.
This paper improves upon the aforementioned results in two directions. First, with regard to the
classical principal component regression, under a double asymptotic framework in which d is al-
lowed to increase with n, by borrowing the very recent development in principal component anal-
ysis (Vershynin, 2010; Lounici, 2012; Bunea and Xiao, 2012), we for the first time explicitly show
the advantage of principal component regression over the classical linear regression. We explicitly
confirm the following two advantages of principal component regression: (i) Principal component
regression is insensitive to collinearity, while linear regression is very sensitive to; (ii) Principal
component regression can utilize the low rank structure of the covariance matrix ⌃, while linear
regression cannot.
Secondly, in high dimensions where d can increase much faster, even exponentially faster, than n,
we propose a robust method in conducting (sparse) principal component regression under a non-
Gaussian elliptical model. The elliptical distribution is a semiparametric generalization to the Gaus-
sian, relaxing the light tail and zero tail dependence constraints, but preserving the symmetry prop-
erty. We refer to Klüppelberg et al. (2007) for more details. This distribution family includes many
well known distributions such as multivariate Gaussian, rank deficient Gaussian, t, logistic, and
many others. Under the elliptical model, we exploit the result in Han and Liu (2013a), who showed
that by utilizing a robust covariance matrix estimator, the multivariate Kendall’s tau, we can obtain
an estimator ue 1 , which can recover u1 in the optimal parametric rate as shown in Vu and Lei (2012).
We then exploit u e 1 in conducting principal
p component regression and show that the obtained esti-
mator ˇ can estimate in the optimal s log d/n rate. The optimal rate in estimating u1 and
, combined with the discussion in the classical principal component regression, indicates that the
proposed method has potential to handle high dimensional complex data and has its advantage over
high dimensional linear regression methods, such as ridge regression and lasso. These theoretical
results are also backed up by numerical experiments on both synthetic and real world equity data.

2 Classical Principal Component Regression

This section is devoted to the discussion on the advantage of classical principal component re-
gression over the classical linear regression. We start with a brief introduction of notations. Let
M = [Mij ] 2 Rd⇥d and v = (v1 , ..., vd )T 2 Rd . We denote vI to be the subvector of v whose
entries are indexed by a set I. We also denote MI,J to be the submatrix of M whose rows are
indexed by I and columns are indexed by J. Let MI⇤ and M⇤J be the submatrix of M with rows
indexed by I, and the submatrix of M with columns indexed by J. Let supp(v) := {j : vj 6= 0}.
For 0 < q < 1, we define the `0 , `q and `1 vector norms as
d
X
kvk0 := card(supp(v)), kvkq := ( |vi |q )1/q and kvk1 := max |vi |.
1id
i=1

Let Tr(M) be the trace of M. Let j (M) be the j-th largest eigenvalue of M and ⇥j (M) be the
corresponding leading eigenvector. In particular, we let max (M) := 1 (M) and min (M) :=

2
d (M). We define S := {v 2 Rd : kvk2 = 1} to be the d-dimensional unit sphere. We define
d 1

the matrix `max norm and `2 norm as kMkmax := max{|Mij |} and kMk2 := supv2Sd 1 kMvk2 .
We define diag(M) to be a diagonal matrix with [diag(M)]jj = Mjj for j = 1, . . . , d. We denote
c,C
vec(M) := (MT⇤1 , . . . , MT⇤d )T . For any two sequence {an } and {bn }, we denote an ⇣ bn if there
exist two fixed constants c, C such that c  an /bn  C.
Let x1 , . . . , xn 2 Rd be n independent observations of a d-dimensional random vector X ⇠
Nd (0, ⌃), u1 := ⇥1 (⌃) and ✏1 , . . . , ✏n ⇠ N1 (0, 2 ) are independent from each other and
{Xi }ni=1 . We suppose that the following principal component regression model holds:
Y = ↵Xu1 + ✏, (2.1)
where Y = (Y1 , . . . , Yn )T 2 Rn , X = [x1 , . . . , xn ]T 2 Rn⇥d and ✏ = (✏1 , . . . , ✏n )T 2 Rn . We
are interested in estimating the regression coefficient := ↵u1 .
Let b represent the solution of the classical least square estimator without taking the information
that is proportional to u1 into account. b can be expressed as follows:
b := (XT X) 1
XT Y . (2.2)

We then have the following proposition, which shows that the mean square error of b is highly
related to the scale of min (⌃).
Proposition 2.1. Under the principal component regression model shown in (2.1), we have
2
✓ ◆
b 2 1 1
Ek k2 = + ··· + .
n d 1 1 (⌃) d (⌃)

Proposition 2.1 reflects the vulnerability of least square estimator on the collinearity. More specifi-
cally, when d (⌃) is extremely small, going to zero in the scale of O(1/n), b can be an inconsistent
estimator even when d is fixed. On the other hand, using the Markov inequality, when d (⌃) is
lowerpbounded by a fixed constant and d = o(n), the rate of convergence of b is well known to be
OP ( d/n).
Motivated from Equation (2.1), the classical principal component regression estimator can be elab-
orated as follows.
(1) We first estimate u1 using the leading eigenvector u b := 1 P xi xT .
b 1 of the sample covariance ⌃ n i

(2) We then estimate ↵ 2 R in Equation (2.1) by the standard least square estimation on the projected
data Zb := Xbu1 2 R n :
e := (Z
↵ b T Z)
b 1Z bT Y ,
The final principal component regression estimator e is then obtained as e = ↵ eub 1 . We then have
e
the following important theorem, which provides a rate of convergence for to approximate .
Theorem 2.2. Let r⇤ (⌃) := Tr(⌃)/ max (⌃) represent the effective rank of ⌃ (Vershynin, 2010).
Suppose that r
r⇤ (⌃) log d
k⌃k2 · = o(1).
n
Under the Model (2.1), when max (⌃) > c1 and 2 (⌃)/ 1 (⌃) < C1 < 1 for some fixed constants
C1 and c1 , we have
(r ! r )
1 1 r ⇤ (⌃) log d
ke k2 = OP + ↵+ p · . (2.3)
n max (⌃) n

Theorem 2.2, compared to Proposition 2.1, provides several important messages on the performance
of principal component regression. First, compared to the least square estimator b, e is insensitive
to collinearity in the sense that min (⌃) plays no role in the rate of convergence of e. Secondly,
when min (⌃) is lower bounded by apfixed constant and ↵ is upperp bounded by a fixed constant,
the rate of convergence for b is OP ( d/n) and for e is OP ( r⇤ (⌃) log d/n), while r⇤ (⌃) :=

3
Tr(⌃)/ max (⌃)  d and is of order o(d) when there exists a low rank structure for ⌃. These
two observations, combined together, illustrate the advantages of the classical principal component
regression over least square estimation. These advantages justify the use of principal component
regression. There is one more thing to be noted: the performance of e, unlike b, depends on ↵.
When ↵ is small, e can predict more accurately.
These three observations are verified in Figure 1. Here the data are generated according to Equation
(2.1) and we set n = 100, d = 10, ⌃ to be a diagonal matrix with descending diagonal values
⌃ii = i and 2 = 1. In Figure 1(A), we set ↵ = 1, 1 = 10, j = 1 for j = 2, . . . , d 1, and
changing d from 1 to 1/100; In Figure 1(B), we set ↵ = 1, j = 1 for j = 2, . . . , d and changing
1 from 1 to 100; In Figure 1(C), we set 1 = 10, j = 1 for j = 2, . . . , d, and changing ↵ from
0.1 to 10. In the three figures, the empirical mean square error is plotted against 1/ d , 1 , and ↵. It
can be observed that the results, each by each, matches the theory.
1.0

1.0
PCR
0.8
0.8

0.8
Mean Square Error

Mean Square Error

0.6

0.6
0.6

LR LR
PCR PCR
0.4

0.4
0.4

0.2

0.2
0.2

0.0
0.0

0 20 40 60 80 100 0 20 40 60 80 100 0 2 4 6 8 10

1/lambda_min lambda_max alpha

A B C
Figure 1: Justification of Proposition 2.1 and Theorem 2.2. The empirical mean square errors are
plotted against 1/ d , 1 , and ↵ separately in (A), (B), and (C). Here the results of classical linear
regression and principal component regression are marked in black solid line and red dotted line.

3 Robust Sparse Principal Component Regression under Elliptical Model

In this section, we propose a new principal component regression method. We generalize the settings
in classical principal component regression discussed in the last section in two directions: (i) We
consider the high dimensional settings where the dimension d can be much larger than the sample
size n; (ii) In modeling the predictors x1 , . . . , xn , we consider a more general elliptical, instead of
the Gaussian distribution family. The elliptical family can capture characteristics such as heavy tails
and tail dependence, making it more suitable for analyzing complex datasets in finance, genomics,
and biomedical imaging.
3.1 Elliptical Distribution
In this section we define the elliptical distribution and introduce the basic property of the elliptical
d
distribution. We denote by X = Y if random vectors X and Y have the same distribution.
Here we only consider the continuous random vectors with density existing. To our knowledge,
there are essentially four ways to define the continuous elliptical distribution with density. The most
intuitive way is as follows: A random vector X 2 Rd is said to follow an elliptical distribution
ECd (µ, ⌃, ⇠) if and only there exists a random variable ⇠ > 0 (a.s.) and a Gaussian distribution
Z ⇠ Nd (0, ⌃) such that
d
X = µ + ⇠Z. (3.1)
Note that here ⇠ is not necessarily independent of Z. Accordingly, elliptical distribution can be
regarded as a semiparametric generalization to the Gaussian distribution, with the nonparametric
part ⇠. Because ⇠ can be very heavy tailed, X can also be very heavy tailed. Moreover, when E⇠ 2
exists, we have
Cov(X) = E⇠ 2 ⌃ and ⇥j (Cov(X)) = ⇥j (⌃) for j = 1, . . . , d.
This implies that, when E⇠ 2 exists, to recover u1 := ⇥1 (Cov(X)), we only need to recover ⇥1 (⌃).
Here ⌃ is conventionally called the scatter matrix.

4
We would like to point out that the elliptical family is significantly larger than the Gaussian. In
fact, Gaussian is fully parameterized by finite dimensional parameters (mean and variance). In
contrast, the elliptical is a semiparametric family (since the elliptical density can be represented as
g((x µ)T ⌃ 1(x µ)) where the function g(·) function is completely unspecified.). If we consider
the “volumes” of the family of the elliptical family and the Gaussian family with respect to the
Lebesgue reference measure, the volume of Gaussian family is zero (like a line in a 3-dimensional
space), while the volume of the elliptical family is positive (like a ball in a 3-dimensional space).
3.2 Multivariate Kendall’s tau
As a important step in conducting the principal component regression, we need to estimate u1 =
⇥1 (Cov(X)) = ⇥1 (⌃) as accurately as possible. Since the random variable ⇠ in Equation (3.1)
can be very heavy tailed, the according elliptical distributed random vector can be heavy tailed.
Therefore, as has been pointed out by various authors (Tyler, 1987; Croux et al., 2002; Han and Liu,
2013b), the leading eigenvector of the sample covariance matrix ⌃ b can be a bad estimator in esti-
mating u1 = ⇥1 (⌃) under the elliptical distribution. This motivates developing robust estimator.
In particular, in this paper we consider using the multivariate Kendall’s tau (Choi and Marden, 1998)
and recently deeply studied by Han and Liu (2013a). In the following we give a brief description
of this estimator. Let X ⇠ ECd (µ, ⌃, ⇠) and X f be an independent copy of X. The population
multivariate Kendall’s tau matrix, denoted by K 2 Rd⇥d , is defined as:
!
(X X)(Xf fT
X)
K := E . (3.2)
kX Xk f 2
2

Let x1 , . . . , xn be n independent observations of X. The sample version of multivariate Kendall’s

tau is accordingly defined as
1 X (xi xj )(xi xj )T
b =
K , (3.3)
n(n 1) kxi xj k22
i6=j

and we have that E(K) b = K. K b is a matrix version U statistic and it is easy to see that
b
maxjk |Kjk |  1, maxjk |Kjk |  1. Therefore, K b is a bounded matrix and hence can be a nicer
statistic than the sample covariance matrix. Moreover, we have the following important proposition,
coming from Oja (2010), showing that K has the same eigenspace as ⌃ and Cov(X).
Proposition 3.1 (Oja (2010)). Let X ⇠ ECd (µ, ⌃, ⇠) be a continuous distribution and K be the
population multivariate Kendall’s tau statistic. Then if j (⌃) 6= k (⌃) for any k 6= j, we have
!
2
j (⌃)Uj
⇥j (⌃) = ⇥j (K) and j (K) = E 2 2 , (3.4)
1 (⌃)U1 + . . . + d (⌃)Ud

where U := (U1 , . . . , Ud )T follows a uniform distribution in Sd 1

. In particular, when E⇠ 2 exists,
⇥j (Cov(X)) = ⇥j (K).

3.3 Model and Method

In this section we discuss the model we build and the accordingly proposed method in conducting
high dimensional (sparse) principal component regression on non-Gaussian data.
Similar as in Section 2, we consider the classical simple principal component regression model:
Y = ↵Xu1 + ✏ = ↵[x1 , . . . , xn ]T u1 + ✏.
To relax the Gaussian assumption, we assume that both x1 , . . . , xn 2 Rd and ✏1 , . . . , ✏n 2 R be
elliptically distributed. We assume that xi 2 ECd (0, ⌃, ⇠). To allow the dimension d increasing
much faster than n, we impose a sparsity structure on u1 = ⇥1 (⌃). Moreover, to make u1 iden-
tifiable, we assume that 1 (⌃) 6= 2 (⌃). Thusly, the formal model of the robust sparse principal
component regression considered in this paper is as follows:
⇢
Y = ↵Xu1 + ✏,
Md (Y , ✏; ⌃, ⇠, s) : (3.5)
x1 , . . . , xn ⇠ ECd (0, ⌃, ⇠), k⇥1 (⌃)k0  s, 1 (⌃) 6= 2 (⌃).

5
Then the robust sparse principal component regression can be elaborated as a two step procedure:
(i) Inspired by the model Md (Y , ✏; ⌃, ⇠, s) and Proposition 3.1, we consider the following opti-
mization problem to estimate u1 := ⇥1 (⌃):

u b
e 1 = arg max v T Kv, subject to v 2 Sd 1
\ B0 (s), (3.6)
v2Rd

where B0 (s) := {v 2 Rd : kvk0  s} and K b is the estimated multivariate Kendall’s tau matrix.
The corresponding global optimum is denoted by u
e 1 . Using Proposition 3.1, u
e 1 is also an estimator
of ⇥1 (Cov(X)), whenever the covariance matrix exists.
(ii) We then estimate ↵ 2 R in Equation (3.5) by the standard least square estimation on the projected
data Ze := Xe u1 2 R n :
↵
ˇ := (Z e T Z)
e 1Z eT Y ,
The final principal component regression estimator ˇ is then obtained as ˇ = ↵ ˇue1 .
3.4 Theoretical Property
In Theorem 2.2, we show that how to estimate u1 accurately plays an important role in conducting
the principal component regression. Following this discussion and the very recent results in Han and
Liu (2013a), the following “easiest” and “hardest” conditions are considered. Here L , U are two
constants larger than 1.
1,U 1,U
Condition 1 (“Easiest”): 1 (⌃) ⇣ d j (⌃) for any j 2 {2, . . . , d} and 2 (⌃) ⇣ j (⌃) for any
j 2 {3, . . . , d};
L ,U
Condition 2 (“Hardest”): 1 (⌃) ⇣ j (⌃) for any j 2 {2, . . . , d}.
In the sequel, we say that the model Md (Y , ✏; ⌃, ⇠, s) holds if the data (Y , X) are generated using
the model Md (Y , ✏; ⌃, ⇠, s).
Under Conditions 1 and 2, wepthen have the following theorem, which shows that under certain
conditions, k ˇ k2 = OP ( s log d/n), which is the optimal parametric rate in estimating the
regression coefficient (Ravikumar et al., 2008).
Theorem 3.2. Let the model Md (Y , ✏; ⌃, ⇠, s) hold and |↵| in Equation (3.5) are upper bounded
by a constant and k⌃k2 is lower bounded by a constant. Then under Condition 1 or Condition 2
and for all random vector X such that

max b
|v T (⌃ ⌃)v| = oP (1),
v2Sd 1 ,kvk
0 2s

we have the robust principal component regression estimator ˇ satisfies that

r !
ˇ s log d
k k2 = OP .
n
Normal multivariate-t EC1 EC2
1.5

1.4
1.2

1.2

1.2
1.0

1.0

1.0
1.0
0.8
averaged error

averaged error

averaged error
0.8

0.8
0.6

0.6

0.6
0.5
0.4

0.4

0.4
0.2

0.2

PCR PCR PCR PCR

0.0

RPCR RPCR RPCR RPCR

0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80

number of selected features number of selected features number of selected features number of selected features

Figure 2: Curves of averaged estimation errors between the estimates and true parameters for differ-
ent distributions (normal, multivariate-t, EC1, and EC2, from left to right) using the truncated power
method. Here n = 100, d = 200, and we are interested in estimating the regression coefficient .
The horizontal-axis represents the cardinalities of the estimates’ support sets and the vertical-axis
represents the empirical mean square error. Here from the left to the right, the minimum mean
square errors for lasso are 0.53, 0.55, 1, and 1.

6
4 Experiments

In this section we conduct study on both synthetic and real-world data to investigate the empirical
performance of the robust sparse principal component regression proposed in this paper. We use the
truncated power algorithm proposed in Yuan and Zhang (2013) to approximate the global optimums
e 1 to (3.6). Here the cardinalities of the support sets of the leading eigenvectors are treated as tuning
u
parameters. The following three methods are considered:
lasso: the classical L1 penalized regression;
PCR: The sparse principal component regression using the sample covariance matrix as the suffi-
cient statistic and exploiting the truncated power algorithm in estimating u1 ;
RPCR: The robust sparse principal component regression proposed in this paper, using the mul-
tivariate Kendall’s tau as the sufficient statistic and exploiting the truncated power algorithm to
estimate u1 .

4.1 Simulation Study

In this section, we conduct simulation study to back up the theoretical results and further investigate
the empirical performance of the proposed robust sparse principal component regression method.
To illustrate the empirical usefulness of the proposed method, we first consider generating the data
matrix X. To generate X, we need to consider how to generate ⌃ and ⇠. In detail, let !1 >
!2 > !3 = . . . = !d be the eigenvalues and u1 , . . . , ud be the eigenvectors of ⌃ with uj :=
(uj1 , . . . , ujd )T . The top 2 leading eigenvectors u1 , u2 of ⌃ are specified to be sparse with sj :=
p Pj 1 Pj
kuj k0 and ujk = 1/ sj for k 2 [1 + i=1 si , i=1 si ] and zero for all the others. ⌃ is generated
P2
as ⌃ = j=1 (!j !d )uj uTj +!d Id . Across all settings, we let s1 = s2 = 10, !1 = 5.5, !2 = 2.5,
and !j = 0.5 for all j = 3, . . . , d. With ⌃, we then consider the following four different elliptical
distributions:
d
(Normal) X ⇠ ECd (0, ⌃, ⇣1 ) with ⇣1 = d . Here d is the chi-distribution with degree of freedom
i.i.d. p d
d. For Y1 , . . . , Yd ⇠ N (0, 1), Y12 + . . . + Yd2 = d . In this setting, X follows the Gaussian
distribution (Fang et al., 1990).
d p d d
(Multivariate-t) X ⇠ ECd (0, ⌃, ⇣2 ) with ⇣2 = ⇠1⇤ /⇠2⇤ . Here ⇠1⇤ = d and ⇠2⇤ =  with
 2 Z . In this setting, X follows a multivariate-t distribution with degree of freedom  (Fang
+

et al., 1990). Here we consider  = 3.

(EC1) X ⇠ ECd (0, ⌃, ⇣3 ) with ⇣3 ⇠ F (d, 1), an F distribution.
(EC2) X ⇠ ECd (0, ⌃, ⇣4 ) with ⇣4 ⇠ Exp(1), an exponential distribution.
We then simulate x1 , . . . , xn from X. This forms a data matrix X. Secondly, we let Y = Xu1 + ✏,
where ✏ ⇠ Nn (0, In ). This produces the data (Y , X). We repeatedly generate n data according
to the four distributions discussed above for 1,000 times. To show the estimation accuracy, Figure
2 plots the empirical mean square error between the estimate ǔ1 and true regression coefficient
against the numbers of estimated nonzero entries (defined as kǔ1 k0 ), for PCR and RPCR, under
different schemes of (n, d), ⌃ and different distributions. Here we considered n = 100 and d = 200.
It can be seen that we do not plot the results of lasso in Figure 2. As discussed in Section 2,
especially as shown in Figure 1, linear regression and principal component regression have their
own advantages in different settings. More specifically, we do not plot the results of lasso here
simply because it performs so bad under our simulation settings. For example, under the Gaussian
settings with n = 100 and d = 200, the lowest mean square error for lasso is 0.53 and the errors
are averagely above 1.5, while for RPCR is 0.13 and is averagely below 1.
Figure 2 shows when the data are non-Gaussian but follow an elliptically distribution, RPCR out-
performs PCR constantly in terms of estimation accuracy. Moreover, when the data are indeed nor-
mally distributed, there is no obvious difference between RPCR and PCR, indicating that RPCR
is a safe alternative to the classical sparse principal component regression.

7
0.45
●

●
●

0.40
4
●●

●
●●
●●

0.35
●
●●●
●●
●
●●
●●
●●
●
●●

averaged prediction error

●
●
●
●●
●
2

●
●
●
●●
●
Sample Quantiles

●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●
●

0.30
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
lasso
●●
●
0

●
●●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●●
PCR

0.25
●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
RPCR
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●

0.20
−2

●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●●
●
●●●
●
●●●
●

0.15
●
● ●
−4

0.10
●
−6

0 50 100 150
−3 −2 −1 0 1 2 3

Theoretical Quantiles number of selected features

A B
Figure 3: (A) Quantile vs. quantile plot of the log-return values for one stock ”Goldman Sachs”.
(B) Prediction error against the number of features selected. The scale of the prediction errors is
enlarged by 100 times for better visualization.

4.2 Application to Equity Data

In this section we apply the proposed robust sparse principal component regression and the other
two methods to the stock price data from Yahoo! Finance (finance.yahoo.com). We collect
the daily closing prices for 452 stocks that are consistently in the S&P 500 index between January 1,
2003 through January 1, 2008. This gives us altogether T=1,257 data points, each data point corre-
sponds to the vector of closing prices on a trading day. Let St = [Stt,j ] denote by the closing price of
stock j on day t. We are interested in the log return data X = [Xtj ] with Xtj = log(Stt,j /Stt 1,j ).
We first show that this data set is non-Gaussian and heavy tailed. This is done first by conducting
marginal normality tests (Kolmogorove-Smirnov, Shapiro-Wilk, and Lillifors) on the data. We find
that at most 24 out of 452 stocks would pass any of three normality test. With Bonferroni correction
there are still over half stocks that fail to pass any normality tests. Moreover, to illustrate the heavy
tailed issue, we plot the quantile vs. quantile plot for one stock, “Goldman Sachs”, in Figure 3(A).
It can be observed that the log return values for this stock is heavy tailed compared to the Gaussian.
To illustrate the power of the proposed method, we pick a subset of the data first. The stocks can
be summarized into 10 Global Industry Classification Standard (GICS) sectors and we are focusing
on the subcategory “Financial”. This leave us 74 stocks and we denote the resulting data to be
F 2 R1257⇥74 . We are interested in predicting the log return value in day t for each stock indexed
by k (i.e., treating Ft,k as the response) using the log return values for all the stocks in day t 1
to day t 7 (i.e., treating vec(Ft 7t0 t 1,· ) as the predictor). The dimension for the regressor is
accordingly 7 ⇥ 74 = 518. For each stock indexed by k, to learn the regression coefficient k , we
use Ft0 2{1,...,1256},· as the training data and applying the three different methods on this dataset. For
each method, after obtaining an estimator bk , we use vec(Ft0 2{1250,...,1256},· ) b to estimate F1257,k .
This procedure is repeated for each k and the averaged prediction errors are plotted against the
number of features selected (i.e., k bk0 ) in Figure 3(B). To visualize the difference more clearly, in
the figures we enlarge the scale of the prediction errors by 100 times. It can be observed that RPCR
has the universally lowest prediction error with regard to different number of features.

Acknowledgement
Han’s research is supported by a Google fellowship. Liu is supported by NSF Grants III-1116730
and NSF III-1332109, an NIH sub-award and a FDA sub-award from Johns Hopkins University.

8
References
Artemiou, A. and Li, B. (2009). On principal components and regression: a statistical explanation of a natural
phenomenon. Statistica Sinica, 19(4):1557.
Bunea, F. and Xiao, L. (2012). On the sample covariance matrix estimator of reduced effective rank population
matrices, with applications to fPCA. arXiv preprint arXiv:1212.5321.
Cai, T. T., Ma, Z., and Wu, Y. (2013). Sparse PCA: Optimal rates and adaptive estimation. The Annals of
Statistics (to appear).
Choi, K. and Marden, J. (1998). A multivariate version of kendall’s ⌧ . Journal of Nonparametric Statistics,
9(3):261–293.
Cook, R. D. (2007). Fisher lecture: Dimension reduction in regression. Statistical Science, 22(1):1–26.
Croux, C., Ollila, E., and Oja, H. (2002). Sign and rank covariance matrices: statistical properties and ap-
plication to principal components analysis. In Statistical data analysis based on the L1-norm and related
methods, pages 257–269. Springer.
d’Aspremont, A., El Ghaoui, L., Jordan, M. I., and Lanckriet, G. R. (2007). A direct formulation for sparse
PCA using semidefinite programming. SIAM review, 49(3):434–448.
Fang, K., Kotz, S., and Ng, K. (1990). Symmetric multivariate and related distributions. Chapman&Hall,
London.
Han, F. and Liu, H. (2013a). Optimal sparse principal component analysis in high dimensional elliptical model.
arXiv preprint arXiv:1310.3561.
Han, F. and Liu, H. (2013b). Scale-invariant sparse PCA on high dimensional meta-elliptical data. Journal of
the American Statistical Association (in press).
Johnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high
dimensions. Journal of the American Statistical Association, 104(486).
Kendall, M. G. (1968). A course in multivariate analysis.
Klüppelberg, C., Kuhn, G., and Peng, L. (2007). Estimating the tail dependence function of an elliptical
distribution. Bernoulli, 13(1):229–251.
Lounici, K. (2012). Sparse principal component analysis with missing observations. arXiv preprint
arXiv:1205.7060.
Ma, Z. (2013). Sparse principal component analysis and iterative thresholding. to appear Annals of Statistics.
Massy, W. F. (1965). Principal components regression in exploratory statistical research. Journal of the Amer-
ican Statistical Association, 60(309):234–256.
Moghaddam, B., Weiss, Y., and Avidan, S. (2006). Spectral bounds for sparse PCA: Exact and greedy algo-
rithms. Advances in neural information processing systems, 18:915.
Oja, H. (2010). Multivariate Nonparametric Methods with R: An approach based on spatial signs and ranks,
volume 199. Springer.
Ravikumar, P., Raskutti, G., Wainwright, M., and Yu, B. (2008). Model selection in gaussian graphical models:
High-dimensional consistency of l1-regularized mle. Advances in Neural Information Processing Systems
(NIPS), 21.
Tyler, D. E. (1987). A distribution-free m-estimator of multivariate scatter. The Annals of Statistics, 15(1):234–
251.
Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint
arXiv:1011.3027.
Vu, V. Q. and Lei, J. (2012). Minimax rates of estimation for sparse pca in high dimensions. Journal of Machine
Learning Research (AIStats Track).
Yuan, X. and Zhang, T. (2013). Truncated power method for sparse eigenvalue problems. Journal of Machine
Learning Research, 14:899–925.
Zou, H., Hastie, T., and Tibshirani, R. (2006). Sparse principal component analysis. Journal of computational
and graphical statistics, 15(2):265–286.