15 Aos1423

The Annals of Statistics
2016, Vol. 44, No. 6, 2323–2359

DOI: 10.1214/15-AOS1423
© Institute of Mathematical Statistics, 2016
INFLUENTIAL FEATURES PCA FOR HIGH

DIMENSIONAL CLUSTERING1
B Y J IASHUN J IN2 AND WANJIE WANG

Carnegie Mellon University and National University of Singapore
We consider a clustering problem where we observe feature vectors Xi ∈
R p , i = 1, 2, . . . , n, from K possible classes. The class labels are unknown
and the main interest is to estimate them. We are primarily interested in the
modern regime of p n, where classical clustering methods face challenges.
We propose Influential Features PCA (IF-PCA) as a new clustering pro-
cedure. In IF-PCA, we select a small fraction of features with the largest
Kolmogorov–Smirnov (KS) scores, obtain the first (K − 1) left singular vec-
tors of the post-selection normalized data matrix, and then estimate the labels
by applying the classical k-means procedure to these singular vectors. In this
procedure, the only tuning parameter is the threshold in the feature selec-
tion step. We set the threshold in a data-driven fashion by adapting the recent
notion of Higher Criticism. As a result, IF-PCA is a tuning-free clustering
method.
We apply IF-PCA to 10 gene microarray data sets. The method has com-
petitive performance in clustering. Especially, in three of the data sets, the
error rates of IF-PCA are only 29% or less of the error rates by other meth-
ods. We have also rediscovered a phenomenon on empirical null by Efron
[J. Amer. Statist. Assoc. 99 (2004) 96–104] on microarray data.
With delicate analysis, especially post-selection eigen-analysis, we derive
tight probability bounds on the Kolmogorov–Smirnov statistics and show that
IF-PCA yields clustering consistency in a broad context. The clustering prob-
lem is connected to the problems of sparse PCA and low-rank matrix recov-
ery, but it is different in important ways. We reveal an interesting phase tran-
sition phenomenon associated with these problems and identify the range of
interest for each.
1. Introduction. Consider a clustering problem where we have feature vec-

tors Xi ∈ R p , i = 1, 2, . . . , n, from K possible classes. For simplicity, we assume
K is small and is known to us. The class labels y1 , y2 , . . . , yn take values from
{1, 2, . . . , K}, but are unfortunately unknown to us, and the main interest is to es-
timate them.
Received July 2014; revised December 2015.

1 Discussed in 10.1214/16-AOS1423A, 10.1214/16-AOS1423B, 10.1214/16-AOS1423C,
10.1214/16-AOS1423D; rejoinder at 10.1214/16-AOS1501.
2 Supported in part by NSF Grants DMS-12-08315 and DMS-15-13414.
MSC2010 subject classifications. Primary 62H30, 62G32; secondary 62E20.
Key words and phrases. Empirical null, feature selection, gene microarray, Hamming distance,
phase transition, post-selection spectral clustering, sparsity.
2323
2324 J. JIN AND W. WANG
TABLE 1
Gene microarray data sets investigated in this paper. Note that K is small and p n
(p: number of genes; n: number of subjects)
# Data name Abbreviation Source K n p
1 Brain Brn Pomeroy (02) 5 42 5597

2 Breast Cancer Brst Wang et al. (05) 2 276 22,215
3 Colon Cancer Cln Alon et al. (99) 2 62 2000
4 Leukemia Leuk Golub et al. (99) 2 72 3571
5 Lung Cancer(1) Lung1 Gordon et al. (02) 2 181 12,533
6 Lung Cancer(2) Lung2 Bhattacharjee et al. (01) 2 203 12,600
7 Lymphoma Lymp Alizadeh et al. (00) 3 62 4026
8 Prostate Cancer Prst Singh et al. (02) 2 102 6033
9 SRBCT SRB Kahn (01) 4 63 2308
10 SuCancer Su Su et al. (01) 2 174 7909
Our study is largely motivated by clustering using gene microarray data. In a

typical setting, we have patients from several different classes (e.g., normal, dis-
eased), and for each patient, we have measurements (gene expression levels) on
the same set of genes. The class labels of the patients are unknown and it is of
interest to use the expression data to predict them.
Table 1 lists 10 gene microarray data sets (arranged alphabetically). Data sets
1, 3, 4, 7, 8 and 9 were analyzed and cleaned in Dettling (2004), Data set 5 is
from Gordon et al. (2002), Data sets 2, 6, 10 were analyzed and grouped into
two classes in Yousefi et al. (2010), among which Data set 10 was cleaned by
us in the same way as by Dettling (2004). All the data sets can be found at
www.stat.cmu.edu/~jiashun/Research/software/GenomicsData. The data sets are
analyzed in Section 1.4, after our approach is fully introduced.
In these data sets, the true labels are given but (of course) we do not use them
for clustering; the true labels are thought of as the “ground truth” and are only used
for comparing the error rates of different methods.
View each Xi as the sum of a “signal component” and a “noise component”:
(1.1) Xi = E[Xi ] + Zi , Zi ≡ Xi − E[Xi ].
For any numbers a1 , a2 , . . . , ap , let diag(a1 , a2 , . . . , ap ) be the p × p diagonal

matrix where the ith diagonal entry is ai , 1 ≤ i ≤ p. We assume
i.i.d.
(1.2) Zi ∼ N(0, ), where = diag σ 2 (1), σ 2 (2), . . . , σ 2 (p) ,
and the vector σ = (σ (1), σ (2), . . . , σ (p)) is unknown to us. Assumption (1.2)
is only for simplicity: our method to be introduced below is not tied to such an
assumption, and works well with most of the data sets in Table 1; see Sections 1.1
and 1.4 for more discussions.
INFLUENTIAL FEATURES PCA 2325

Denote the overall mean vector by μ̄ = n1 ni=1 E[Xi ]. For K different vectors
μ1 , μ2 , . . . , μK ∈ R p , we model E[Xi ] by (yi are class labels)
(1.3) E[Xi ] = μ̄ + μk if and only if yi = k.
For 1 ≤ k ≤ K, let δk be the fraction of samples in Class k. Note that
(1.4) δ1 μ1 + δ2 μ2 + · · · + δK μK = 0,
so μ1 , μ2 , . . . , μK are linearly dependent. However, it is natural to assume
(1.5) μ1 , μ2 , . . . , μK−1 are linearly independent.
D EFINITION 1.1. We call feature j a useless feature (for clustering) if

μ1 (j ) = μ2 (j ) = · · · = μK (j ) = 0, and a useful feature otherwise.
We call μk the contrast mean vector of Class k, 1 ≤ k ≤ K. In many applica-

tions, the contrast mean vectors are sparse in the sense that only a small fraction
of the features are useful. Examples include but are not limited to gene microar-
ray data: it is widely believed that only a small fraction of genes are differentially
expressed, so the contrast mean vectors are sparse.
We are primarily interested in the modern regime of p n. In such a regime,
classical methods (e.g., k-means, hierarchical clustering, Principal Component
Analysis (PCA) [Hastie, Tibshirani and Friedman (2009)]) are either computa-
tionally challenging or ineffective. Our primary interest is to develop new methods
that are appropriate for this regime.
1.1. Influential features PCA (IF-PCA). Denote the data matrix by

X = [X1 , X2 , . . . , Xn ] .
We propose IF-PCA as a new spectral clustering method. Conceptually, IF-PCA
contains an IF part and a PCA part. In the IF part, we select features by exploiting
the sparsity of the contrast mean vectors, where we remove many columns of X
leaving only those we think are influential for clustering (and so the name of Influ-
ential Features). In the PCA part, we apply the classical PCA to the post-selection
data matrix.3
We normalize each column of X and denote the resultant matrix by W :

W (i, j ) = Xi (j ) − X̄(j ) /σ̂ (j ), 1 ≤ i ≤ n, 1 ≤ j ≤ p,
1 n 1 n
where X̄(j ) = n i=1 Xi (j ) and σ̂ (j ) = [ n−1 i=1 (Xi (j ) − X̄(j )) ]
2 1/2 are the
empirical mean and standard deviation associated with feature j , respectively.
Write
W = [W1 , W2 , . . . , Wn ] .
3 Such a two-stage clustering idea (i.e., feature selection followed by post-selection clustering)
is not completely new and can be found in Chan and Hall (2010), for example. Of course, their
procedure is very different from ours.
For any 1 ≤ j ≤ p, denote the empirical CDF associated with feature j by

1 n

Fn,j (t) = 1 Wi (j ) ≤ t .
n i=1
IF-PCA contains two “IF” steps and two “PCA” steps as follows.
Input: data matrix X, number of classes K, and parameter t.
Output: predicted n × 1 label vector ŷtIF = (ŷt,1
IF , ŷ IF , . . . , ŷ IF ).
t,2 t,n
• IF-1. For each 1 ≤ j ≤ p, compute a Kolmogorov–Smirnov (KS) statistic by
√

(1.6) ψn,j = n · sup
Fn,j (t) − (t)
, : CDF of N(0, 1) .
−∞<t<∞
• IF-2. Following the suggestions by Efron (2004), we renormalize by
∗
(1.7) ψn,j = [ψn,j − mean of all p KS-scores]/SD of all p KS-scores.4
• PCA-1. Fix a threshold t > 0. For short, let W (t) be the matrix formed by re-
stricting the columns of W to the set of retained indices Ŝp (t), where
∗
(1.8) Ŝp (t) = 1 ≤ j ≤ p : ψn,j ≥t .
Let Û (t) ∈ R n,K−1 be the matrix consisting the first K − 1 (unit-norm) left sin-
(t)
√ 6 Û∗ ∈ R
gular vectors of W (t) .5 Define a matrix n,K−1 by truncating Û (t) entry-
wise with threshold Tp = log(p)/ n.

• PCA-2. Cluster by applying the classical k-means to Û∗(t) assuming there are
≤ K classes. Let ŷtIF be the predicted label vector.
In the procedure, t is the only tuning parameter. In Section 1.3, we propose a
data-driven approach to choosing t, so the method becomes tuning-free. Step 2 is
largely for gene microarray data, and is not necessary if models (1.1)–(1.2) hold.
In Table 2, we use the Lung Cancer(1) data to illustrate how IF-PCA performs
with different choices of t. The results show that with t properly set, the number
of clustering errors of IF-PCA can be as low as 4. In comparison, classical PCA
(column 2 of Table 2; where t = 0.000 so we do not perform feature selection) has
22 clustering errors.
In Figure 1, we compare IF-PCA with classical PCA by investigating Û (t) de-
fined in Step 3 for two choices of t: (a) t = 0.000 so Û (t) is the first singular
4 Alternatively, we can normalize the KS-scores with sample median and Median Absolute Devia-
tion (MAD); see Section 1.5 for more discussion.
5 For a matrix M ∈ R n,m , the kth left (right) singular vector is the eigenvector associated with the
kth largest eigenvalue of the matrix MM (of the matrix M M).
6 That is, Û (t) (i, k) = Û (i, k)1{|Û (i, k)| ≤ T } + T sgn(Û (i, k))1{|Û (i, k)| > T }, 1 ≤ i ≤ n, 1 ≤
∗ √ p p p
k ≤ K − 1. We usually take Tp = log(p)/ n as above, but log(p) can be replaced by any sequence
that tends to ∞ as p → ∞. The truncation is mostly for theoretical analysis in Section 2 and is not
used in numerical study (real or simulated data).
TABLE 2
Clustering errors and # of selected features for different choices of t [Lung Cancer(1) data].
Columns highlighted correspond to the sweet spot of the threshold choice
Threshold t 0.000 0.608 0.828 0.938 1.048 1.158 1.268 1.378 1.488
# of selected features 12,533 5758 1057 484 261 129 63 21 2
Clustering errors 22 22 24 4 5 7 38 39 33
vector of pre-selection data matrix W , and (b) a data-driven threshold choice by

Higher Criticism to be introduced in Section 1.3. For (b), the entries of Û (t) can be
clearly divided into two groups, yielding almost error-free clustering results. Such
a clear separation does not exist for (a). These results suggest that IF-PCA may
significantly improve classical PCA.
Two important questions arise:
• In (1.7), we use a modified KS statistic for feature selection. What is the ratio-
nale behind the use of KS statistics and the modification?
• The clustering errors critically depend on the threshold t. How to set t in a data-
driven fashion?
In Section 1.2, we address the first question. In Section 1.3, we propose a data-
driven threshold choice by the recent notion of Higher Criticism.
1.2. KS statistic, normality assumption, and Efron’s empirical null. The goal
in Steps 1–2 is to find an easy-to-implement method to rank the features. The focus
of Step 1 is on a data matrix satisfying models (1.1)–(1.5), and the focus of Step 2
is to adjust Step 1 in a way so to work well with microarray data. We consider two
steps separately.
F IG . 1. Comparison of Û (t) for t = 0.000 (left; no feature selection) and t = 1.057 (right; t is set
by Higher Criticism in a data-driven fashion); note Û (t) is an n × 1 vector since K = 2. y-axis:
entries of Û (t) , x-axis: sample indices. Plots are based on Lung Cancer(1) data, where ADCA and
MPM represent two different classes.
Consider the first step. The interest is to test for each fixed j , 1 ≤ j ≤ p, whether
feature j is useless or useful. Since we have no prior information about the class
labels, the problem can be reformulated as that of testing whether all n samples
associated with the j th feature are i.i.d. Gaussian
i.i.d.
(1.9) H0,j : Xi (j ) ∼ N μ̄(j ), σ 2 (j ) , i = 1, 2, . . . , n,
or they are i.i.d. from a K-component heterogenous Gaussian mixture:
i.i.d.
K

(1.10) H1,j : Xi (j ) ∼ δk N μ̄(j ) + μk (j ), σ 2 (j ) , i = 1, 2, . . . , n,
k=1
where δk > 0 is the prior probability that Xi (j ) comes from Class k, 1 ≤ k ≤

K. Note that μ̄(j ), σ (j ) and ((δ1 , μ1 (j )), . . . , (δK , μK (j ))) are unknown. The
above is a well-known difficult testing problem. For example, in such a setting, the
classical Likelihood Ratio Test (LRT) is known to be not well behaved [e.g., Chen
and Li (2009)].
Our proposal is to use the Kolmogorov–Smirnov (KS) test, which measures the
maximum difference between the empirical CDF for the normalized data and the
CDF of N(0, 1). The KS test is a well-known goodness-of-fit test [e.g., Shorack
and Wellner (1986)]. In the idealized Gaussian model (1.9)–(1.10), the KS test is
asymptotically equivalent to the optimal moment-based tests (e.g., see Section 2),
but its success is not tied to a specific model for the alternative hypothesis, and
is more robust against occasional outliers. Also, Efron’s null correction (below) is
more successful if we use KS instead of moment-based tests for feature ranking.
This is our rationale for Step 1.
We now discuss our rationale for Step 2. We discover an interesting phe-
nomenon which we illustrate with Figure 2 [Lung Cancer(1) data]. Ideally, if the
F IG . 2. Left: The histogram of KS-scores of the Lung Cancer(1) data. The two lines in blue and
red denote the theoretical null and empirical null densities, respectively. Right: empirical survival
function of the adjusted KS-scores based on Lung Cancer(1) data (red) and the survival function of
theoretical null (blue).
normality assumption (1.2) is valid for this data set, then the density function of the
KS statistic for model (1.9) (the blue curve in left panel; obtained by simulations)
should fit well with the histogram of the KS-scores based on the Lung Cancer(1)
data. Unfortunately, this is not the case, and there is a substantial discrepancy in
fitting. On the other hand, if we translate and rescale the blue curve so that it has
the same mean and standard deviation as the KS-scores associated with Lung Can-
cer(1) data, then the new curve (red curve; left panel of Figure 2) fits well with the
histogram.7
A related phenomenon was discussed in Efron (2004), only considering Stu-
dentized t-statistics in a different setting. As in Efron (2004), we call the density
functions associated with two curves (blue and red) the theoretical null and the
empirical null, respectively. The phenomenon is then: the theoretical null has a
poor fit with the histogram of the KS-scores of the real data, but the empirical null
may have a good fit.
In the right panel of Figure 2, we view this from a slightly different perspective,
and show that the survival function associated with the adjusted KS-scores (i.e.,
∗ ) of the real data fits well with the theoretical null.
ψn,j
The above observations explain the rationale for Step 2. Also, they suggest that
IF-PCA does not critically depend on the normality assumption and works well for
microarray data. This is further validated in Section 1.4.
R EMARK . Efron (2004) suggests several possible reasons (e.g., dependence

between different samples, dependence between the genes) for the discrepancy
between the theoretical null and empirical null, but what has really caused such
a discrepancy is not fully understood. Whether Efron’s empirical null is useful in
other application areas or other data types (and if so, to what extent) is also an open
problem, and to understand it we need a good grasp on the mechanism by which
the data sets of interest are generated.
1.3. Threshold choice by Higher Criticism. The performance of IF-PCA crit-

ically depends on the threshold t, and it is of interest to set t in a data-driven
fashion. We approach this by the recent notion of Higher Criticism.
Higher Criticism (HC) was first introduced in Donoho and Jin (2004) as a
method for large-scale multiple testing. In Donoho and Jin (2008), HC was also
found to be useful to set a threshold for feature selection in the context of classi-
fication. HC is also useful in many other settings. See Donoho and Jin (2015), Jin
and Ke (2016) for reviews on HC.
To adapt HC for threshold choice in IF-PCA, we must modify the procedure
carefully, since the purpose is very different from those in previous literature. The
approach contains three simple steps as follows.
7 If we replace sample mean and standard deviation by sample median and MAD, respectively, then
it gives rises to the normalization in the second footnote of Section 1.1.
TABLE 3
Pseudocode for IF-HCT-PCA (for microarray data; threshold set by Higher Criticism)
IF .
Input: data matrix X, number of classes K. Output: class label vector ŷHC
1. Rank features: Let ψn,j be the KS-scores as in (1.6) and F0 be the CDF of ψn,j under null,
1 ≤ j ≤ p.
2. Normalize KS-scores: ψn∗ = (ψn − mean(ψn ))/SD(ψn ).
3. Threshold choice by HCT: Calculate P -values by πj = 1 − F0 (ψn,j ∗ ), 1 ≤ j ≤
√
p and sort them by π(1) < π(2) < · · · < π(p) . Define HCp,j = p(j/p − π(j ) )/
√
max{ n(j/p − π(j ) ), 0} + j/p, and let jˆ = argmax{j :π >log(p)/p,j <p/2} {HCp,j }. HC
(j )
threshold tpHC is the jˆth largest KS-score.
4. Post-selection PCA: Define post-selection data matrix W (HC) (i.e., sub-matrix of W consists of
all column j of W with ψn,j ∗ > t HC ). Let U ∈ R n,K−1 be the matrix of the first (K − 1) left
p
IF = kmeans(U, K).
singular vectors of W (HC) . Cluster by ŷHC
• For 1 ≤ j ≤ p, calculate a P -value πj = 1 − F0 (ψn,j ), where F0 is the distri-

bution of ψn,j under the null (i.e., feature j is useless).
• Sort all P -values in the ascending order π(1) < π(2) < · · · < π(p) .
• Define the Higher Criticism score by
√
√
(1.11) HCp,j = p(j/p − π(j ) )/ max n(j/p − π(j ) ), 0 + j/p.
Let jˆ be the index such that jˆ = argmax{1≤j ≤p/2,π(j ) >log(p)/p} {HCp,j }. The HC
threshold tpHC for IF-PCA is then the jˆth largest KS-scores.
Combining HCT with IF-PCA gives a tuning-free clustering procedure IF-HCT-
PCA, or IF-PCA for short if there is no confusion. See Table 3.
For illustration, we again employ the Lung Cancer(1) data. In this data set,
jˆ = 251, tpHC = 1.0573, and HC selects 251 genes with the largest KS-scores.
In Figure 3, we plot the error rates of IF-PCA applied to the k features of W with
the largest KS-scores, where k ranges from 1 to p/2 (for different k, we are using
the same ranking for all p genes). The figure shows that there is a “sweet spot”
for k where the error rates are the lowest. HCT corresponds to jˆ = 251 and 251 is
F IG . 3. Error rates by IF-PCA (y-axis) with different number of selected features k (x-axis) [Lung
Cancer(1) data]. HCT corresponds to 251 selected features (dashed vertical line).
in this sweet spot. This suggests that HCT gives a reasonable threshold choice, at
least for some real data sets.
R EMARK . When we apply HC to microarray data, we follow the discussions

in Section 1.2 and take F0 to be the distribution of ψn,j under the null but with
the mean and variance adjusted to match those of the KS-scores. In the definition,
we require π(jˆ) > log(p)/p, as HCp,j may be ill-behaved for very small j [e.g.,
Donoho and Jin (2004)].
The rationale for HCT can also be explained theoretically. For illustration, con-
sider the case where K = 2 so we only have two classes. Fixing a threshold t > 0,
let Û (t) be the first left singular vector of W (t) as in Section 1.1. In a companion
paper [Jin, Ke and Wang (2015a)], we show that when the signals are rare and
weak, then for t in the range of interest,
(1.12) Û (t) ∝ snr(t)
· U + z + rem,
where U is an n × 1 non-stochastic vector with only two distinct entries (each de-

termines one of two classes), snr(t) is a nonstochastic function of t, z ∼ N(0, In ),
and rem is the remainder term [the entries of which are asymptotically of much

smaller magnitude than that of z or snr(t) · U ]. Therefore, performance of IF-PCA
is best when we maximize snr(t) (though this is unobservable). We call such a
threshold the Ideal Threshold: tpideal = argmint>0 {snr(t)}.

Let F̄p (t) be the survival function of ψn,j under the null (not dependent on
p
j ), and let Ĝp (t) = p1 j =1 1{ψn,j ≥ t} be the empirical survival function. In-
√
√
troduce HCp (t) = p[Ĝp (t) − F̄p (t)]/ Ĝp (t) + n[max{Ĝp (t) − F̄p (t), 0}],
and let ψ(1) > ψ(2) > · · · > ψ(p) be the sorted values of ψn,j . Recall that
π(k) is the kth smallest P -value. By definitions, we have Ĝp (t)|t=ψ(k) = k/p
and F̄p (t)|t=ψ(k) = π(k) . As a result, we have HCp (t)|t=ψ(k) = [k/p − π(k) ]/
√
k/p + n max{k/p − π(k) , 0}, where the right-hand side is the form of HC in-
troduced in (1.11). Note that HCp (t) is a function which is only discontinuous at
t = ψ(k) , 1 ≤ k ≤ p, and between two adjacent discontinuous points, the function
is monotone. Combining this with the definition of tpHC , tpHC = argmaxt {HCp (t)}.
Now, as p → ∞, some regularity appears, and Ĝp (t) converges to a non-
stochastic counterpart, denoted by Ḡp (t), which can be viewed as the survival
function associated with
the marginal density of ψn,j . Introduce IdealHC(t) =
√ √
p[Ḡp (t) − F̄p (t)]/ Ḡp (t) + n[max{Ḡp (t) − F̄p (t), 0}] as the ideal counter-
part of HCp (t). It is seen that HCp (t) ≈ IdealHC(t) for t in the range of interest,
and so tpHC ≈ tpidealHC , where the latter is defined as the nonstochastic threshold t
that maximizes IdealHC(t).
In Jin, Ke and Wang (2015a), we show that under a broad class of rare and weak

signal models, the leading term of the Taylor expansion of snr(t) is proportional to
that of IdealHC(t) for t in the range of interest, and so tpidealHC ≈ tpideal . Combining
this with the discussions above, we have tpHC ≈ tpidealHC ≈ tpideal , which explains the
rationale for HCT.
The above relationships are justified in Jin, Ke and Wang (2015a). The proofs
are rather long (70 manuscript pages in Annals of Statistics format), so we will
report them in a separate paper. The ideas above are similar to that in Donoho and
Jin (2008) but the focus there is on classification and our focus is on clustering;
our version of HC is also very different from theirs.
1.4. Applications to gene microarray data. We compare IF-HCT-PCA with

four other clustering methods (applied to the normalized data matrix W directly,
without feature selection): (1) SpectralGem [Lee, Luca and Roeder (2010)] which
is the same as classical PCA introduced earlier, (2) classical k-means, (3) hierar-
chical clustering [Hastie, Tibshirani and Friedman (2009)], and (4) k-means++
[Arthur and Vassilvitskii (2007)]. In theory, k-means is NP hard, but heuristic al-
gorithms are available; we use the built-in k-means package in Matlab with the
parameter “replicates” equal to 30, so that the algorithm randomly samples ini-
tial cluster centroid positions 30 times (in the last step of either classical PCA
or IF-HCT-PCA, k-means is also used, where the number of “replicates” is also
30). The k-means++ [Arthur and Vassilvitskii (2007)] is a recent modification
of k-means. It improves the performance of k-means in some numerical studies,
though the problem remains NP hard in theory. For hierarchical clustering, we use
“complete” as the linkage function; other choices give more or less the same re-
sults. In IF-HCT-PCA, the P -values associated with the KS-scores are computed
using simulated KS-scores under the null with 2 × 103 × p independent replica-
tions; see Section 1.3 for remarks on F0 . In Table 3, we repeat the main steps of
IF-HCT-PCA for clarification, by presenting the pseudocode.
We applied all 5 methods to each of the 10 gene microarray data sets in Table 1.
The results are reported in Table 4. Since all methods except hierarchical clustering
have algorithmic randomness (they depend on built-in k-means package in Matlab
which uses a random start), we report the mean error rate based on 30 independent
replications. The standard deviation of all methods is very small (< 0.0001) except
for k-means++, so we only report the standard deviation of k-means++. In the
last column of Table 4,
error rate of IF − HCT − PCA
(1.13) r= .
minimum of the error rates of the other 4 methods
We find that r < 1 for all data sets except for two. In particular, r ≤ 0.29 for
three of the data sets, marking a substantial improvement, and r ≤ 0.87 for three
other data sets, marking a moderate improvement. The r-values also suggest an
TABLE 4
Comparison of clustering error rates by different methods for the 10 gene microarray data sets
introduced in Table 1. Column 5: numbers in the brackets are the standard deviations (SD); SD for
all other methods are negligible so are not reported. Last column: see (1.13)
# Data set K kmeans kmeans++ Hier SpecGem IF-HCT-PCA r
1 Brain 5 0.286 0.427 (0.09) 0.524 0.143 0.262 1.83

2 Breast Cancer 2 0.442 0.430 (0.05) 0.500 0.438 0.406 0.94
3 Colon Cancer 2 0.443 0.460 (0.07) 0.387 0.484 0.403 1.04
4 Leukemia 2 0.278 0.257 (0.09) 0.278 0.292 0.069 0.27
5 Lung Cancer(1) 2 0.116 0.196 (0.09) 0.177 0.122 0.033 0.29
6 Lung Cancer(2) 2 0.436 0.439 (0.00) 0.301 0.434 0.217 0.72
7 Lymphoma 3 0.387 0.317 (0.13) 0.468 0.226 0.065 0.29
8 Prostate Cancer 2 0.422 0.432 (0.01) 0.480 0.422 0.382 0.91
9 SRBCT 4 0.556 0.524 (0.06) 0.540 0.508 0.444 0.87
10 SuCancer 2 0.477 0.459 (0.05) 0.448 0.489 0.333 0.74
interesting point: for “easier” data sets, IF-PCA tends to have more improvements
over the other 4 methods.
We make several remarks. First, for the Brain data set, unexpectedly, IF-PCA
underperforms classical PCA, but still outperforms other methods. Among our data
sets, the Brain data seem to be an “outlier”. Possible reasons include (a) useful
features are not sparse, and (b) the sample size is very small (n = 42) so the useful
features are individually very weak. When (a)–(b) happen, it is almost impossible
to successfully separate the useful features from useless ones, and it is preferable
to use classical PCA. Such a scenario may be found in Jin, Ke and Wang (2015b);
see, for example, Figure 1 (left) and related context therein.
Second, for Colon Cancer, all methods behave unsatisfactorily, and IF-PCA
slightly underperforms hierarchical clustering (r = 1.04). The data set is known
to be a difficult one even for classification (where class labels of training samples
are known [Donoho and Jin (2008)]). For such a difficult data set, it is hard for
IF-PCA to significantly outperform other methods.
Last, for the SuCancer data set, the KS-scores are significantly skewed to the
right. Therefore, instead of using the normalization (1.7), we normalize ψn,j such
that the mean and standard deviation for the lower 50% of KS-scores match those
for the lower 50% of the simulated KS-scores under the null; compare this with
Section 1.3 for remarks on P -value calculations.
1.5. Three variants of IF-HCT-PCA. First, in IF-HCT-PCA, we normal-

ize the KS-scores with the sample mean and sample standard deviation as
∗ = [ψ
in (1.7). Alternatively, we may normalize the KS-scores by ψn,j n,j −
median of all KS-scores]/[MAD of all KS-scores] (MAD: Median Absolute De-
viation), while other steps of IF-HCT-PCA are kept intact. Denote the resultant
TABLE 5
Clustering error rates of IF-HCT-PCA, IF-HCT-PCA-med, IF-HCT-kmeans and IF-HCT-hier
Brn Brst Cln Leuk Lung1 Lung2 Lymp Prst SRB Su
IF-HCT-PCA 0.262 0.406 0.403 0.069 0.033 0.217 0.065 0.382 0.444 0.333
IF-HCT-PCA-med 0.333 0.424 0.436 0.014 0.017 0.217 0.097 0.382 0.206 0.333
IF-HCT-kmeans 0.191 0.380 0.403 0.028 0.033 0.217 0.032 0.382 0.401 0.328
IF-HCT-hier 0.476 0.351 0.371 0.250 0.177 0.227 0.355 0.412 0.603 0.500
variant by IF-HCT-PCA-med (med: median). Second, recall that IF-HCT-PCA has

two stages: in the first one, we select features with a threshold determined by HC;
in the second one, we apply PCA to the post-selection data matrix. Alternatively, in
the second stage, we may apply classical k-means or hierarchical clustering to the
post-selection data instead (the first stage is intact). Denote these two alternatives
by IF-HCT-kmeans and IF-HCT-hier, respectively.
Table 5 compares IF-HCT-PCA with the three variants (in IF-HCT-kmeans, the
“replicate” parameter in k-means is taken to be 30 as before), where the first three
methods have similar performances, while the last one performs comparably less
satisfactorily. Not surprisingly, these methods generally outperform their classical
counterparts (i.e., classical PCA, classical k-means, and hierarchical clustering;
see Table 4).
We remark that, for post-selection clustering, it is frequently preferable to use
PCA than k-means. First, k-means could be much slower than PCA, especially
when the number of selected features in the IF step is large. Second, the k-means
algorithm we use in Matlab is only a heuristic approximation of the theoretical
k-means (which is NP-hard), so it is not always easy to justify the performance of
k-means algorithm theoretically.
1.6. Connection to sparse PCA. The study is closely related to the recent in-
terest on sparse PCA [Amini and Wainwright (2009), Arias-Castro, Lerman and
Zhang (2013), Johnstone (2001), Jung and Marron (2009), Lei and Vu (2015),
Ma (2013), Zou, Hastie and Tibshirani (2006)], but is different in important ways.
Consider the normalized data matrix W = [W1 , W2 , . . . , Wn ] for example. In our
model, recall that μ1 , μ2 , . . . , μK are the K sparse contrast mean vectors and the
noise covariance matrix is diagonal, we have
W ≈ M −1/2 + Z where Z ∈ R n,p has i.i.d. N(0, 1) entries,
and M ∈ R n,p is the matrix where the ith row is μk if and only if i ∈ Class k. This
is a setting that is frequently considered in the sparse PCA literature.
However, we must note that the main focus of sparse PCA is to recover the
supports of μ1 , μ2 , . . . , μK , while the main focus here is subject clustering. We
recognize that, the two problems—support recovery and subject clustering—are
essentially two different problems, and addressing one successfully does not nec-
essarily address the other successfully. For illustration, consider two scenarios:
• If useful features are very sparse but each is sufficiently strong, it is easy to
identify the support of the useful features, but due to the extreme sparsity, it
may be still impossible to have consistent clustering.
• If most of the useful features are very weak with only a few of them very strong,
the latter will be easy to identify and may yield consistent clustering, still, it
may be impossible to satisfactorily recover the supports of μ1 , μ2 , . . . , μK , as
most of the useful features are very weak.
In a forthcoming manuscript Jin, Ke and Wang (2015b), we investigate the connec-
tions and differences between two problems more closely, and elaborate the above
points with details.
With that being said, from a practical viewpoint, one may still wonder how
sparse PCA may help in subject clustering. A straightforward clustering approach
that exploits the sparse PCA ideas is the following:
• Estimate the first (K − 1) right singular vectors of the matrix M −1/2 using the
sparse PCA algorithm as in Zou, Hastie and Tibshirani (2006), equation (3.7)
sp sp sp
(say). Denote the estimates by ν̂1 , ν̂2 , . . . , ν̂K−1 .
sp sp
• Cluster by applying classical k-means to the n × (K − 1) matrix [W ν̂1 , W ν̂2 ,
sp
. . . , W ν̂K−1 ], assuming there are ≤ K classes.
For short, we call this approach Clu-sPCA. One problem here is that, Clu-sPCA is
not tuning-free, as most existing sparse PCA algorithms have one or more tuning
parameters. How to set the tuning parameters in subject clustering is a challenging
problem: for example, since the class labels are unknown, using conventional cross
validations (as we may use in classification where class labels of the training set
are known) might not help.
In Table 6, we compare IF-HCT-PCA and Clu-sPCA using the 10 data sets in
Table 1. Note that in Clu-sPCA, the tuning parameter in the sparse PCA step [Zou,
Hastie and Tibshirani (2006), equation (3.7)] is ideally chosen to minimize the
clustering errors, using the true class labels. The results are based on 30 indepen-
TABLE 6
Clustering error rates for IF-HCT-PCA and Clu-sPCA. The tuning parameter of Clu-sPCA is
chosen ideally to minimize the errors (IF-HCT-PCA is tuning-free). Only SDs that are larger
than 0.01 are reported (in brackets)
Brn Brst Cln Leuk Lung1 Lung2 Lymp Prst SRB Su
IF-HCT-PCA 0.262 0.406 0.403 0.069 0.033 0.217 0.065 0.382 0.444 0.333
Clu-sPCA 0.263 0.438 0.435 0.292 0.110 0.433 0.190 (0.01) 0.422 0.428 0.437
dent repetitions. Compared to Clu-sPCA, IF-HCT-PCA outperforms for half of the

data sets (bold face), and has similar performances for the remaining half.
The above results support our philosophy: the problem of subject clustering and
the problem of support recovery are related but different, and success in one does
not automatically lead to the success in the other.
1.7. Summary and contributions. Our contribution is three-fold: feature selec-

tion by the KS statistic, post-selection PCA for high dimensional clustering, and
threshold choice by the recent idea of Higher Criticism.
In the first fold, we rediscover a phenomenon found earlier by Efron (2004) for
microarray study, but the focus there is on t-statistic or F -statistic, and the focus
here is on the KS statistic. We establish tight probability bounds on the KS statis-
tic when the data is Gaussian or Gaussian mixtures where the means and variances
are unknown; see Section 2.5. While tight tail probability bounds have been avail-
able for decades in the case where the data are i.i.d. from N(0, 1), the current case
is much more challenging. Our results follow the work by Siegmund (1982) and
Loader (1992) on the local Poisson approximation of boundary crossing probabil-
ity, and are useful for pinning down the thresholds in KS screening.
In the second fold, we propose to use IF-PCA for clustering and have success-
fully applied it to gene microarray data. The method compares favorably with other
methods, which suggests that both the IF step and the post-selection PCA step are
effective. We also establish a theoretical framework where we investigate the clus-
tering consistency carefully; see Section 2. The analysis it entails is sophisticated
and involves delicate post-selection eigen-analysis (i.e., eigen-analysis on the post-
selection data matrix). We also gain useful insight that the success of feature se-
lection depends on the feature-wise weighted third moment of the samples, while
the success of PCA depends more on the feature-wise weighted second moment.
Our study is closely related to the SpectralGem approach by Lee, Luca and Roeder
(2010), but our focus is on KS screening, post-selection PCA, and clustering with
microarray data is different.
In the third fold, we propose to set the threshold by Higher Criticism. We find
an intimate relationship between the HC functional and the signal-to-noise ratio
associated with post-selection eigen-analysis. As mentioned in Section 1.3, the
full analysis on the HC threshold choice is difficult and long, so for reasons of
space, we do not include it in this paper.
Our findings support the philosophy by Donoho (2015), that for real data anal-
ysis, we prefer to use simple models and methods that allow sophisticated the-
oretical analysis than complicated and computationally intensive methods (as an
increasing trend in some other scientific communities).
1.8. Content and notation. Section 2 contains the main theoretical results,
where we show IF-PCA is consistent in clustering under some regularity condi-
tions. Section 3 contains the numerical studies and Section 4 discusses connec-
tion to other work and addresses some future research. Secondary theorems and
TABLE 7
Pseudocode for IF-PCA (for a given threshold t > 0)
Input: data matrix X, number of classes K, threshold t > 0. Output: class label vector ŷtIF .
1. Rank features: Let ψn,j , 1 ≤ j ≤ p, be the KS-scores as in (1.6).
2. Post-selection PCA: Define post-selection data matrix W (t) (i.e., sub-matrix of W consists of all
column j with ψn,j > t). Let U ∈ R n,K−1 be the matrix of the first (K − 1) left singular vectors
of W (t) . Cluster by ŷtIF = kmeans(U, K).
lemmas are proved in the supplementary material of the paper. In this paper, Lp
denotes a generic multi-log(p) term (see Section 2.3). For a vector ξ , ξ denotes
the
2 -norm. For a real matrix A, A denotes the matrix spectral norm, AF
denotes the matrix Frobenius norm, and smin (A) denotes the smallest non-zero
singular value.
2. Main results. Section 2.1 introduces our asymptotic model, Section 2.2
discusses the main regularity conditions and related notation. Section 2.3 presents
the main theorem and Section 2.4 presents two corollaries, together with a phase
transition phenomenon. Section 2.5 discusses the tail probability of the KS statis-
tic, which is the key for the IF step. Section 2.6 studies post-selection eigen-
analysis which is the key for the PCA step. The main theorems and corollaries
are proved in Section 2.7.
To be utterly clear, the IF-PCA procedure we study in this section is the one
presented in Table 7, where the threshold t > 0 is given.
2.1. The asymptotic clustering model. The model we consider is (1.1), (1.2),
(1.3) and (1.5), where the data matrix is X = [X1 , X2 , . . . , Xn ] , with Xi ∼ N(μ̄ +
μk , ) if and only if i ∈ Class k, 1 ≤ k ≤ K, and = diag(σ12 , σ22 , . . . , σp2 ); K is
the number of classes, μ̄ is the overall mean vector, μ1 , μ2 , . . . , μK are contrast
mean vectors which satisfy (1.5).
We use p as the driving asymptotic parameter, and let other parameters be tied
to p through fixed parameters. Fixing θ ∈ (0, 1), we let
(2.1) n = np = p θ ,
so that as p → ∞, p n 1.8 Let M ∈ R K,p be the matrix
(2.2) M = [m1 , m2 , . . . , mK ] where mk = −1/2 μk .
Denote the set of useful features by

(2.3) Sp = Sp (M) = 1 ≤ j ≤ p : mk (j ) = 0 for some 1 ≤ k ≤ K ,
8 For simplicity, we drop the subscript of n as long as there is no confusion.

p
and let sp = sp (M) = |Sp (M)| be the number of useful features. Fixing ϑ ∈ (0, 1),
we let
(2.4) sp = p1−ϑ .
Throughout this paper, the number of classes K is fixed, as p changes.
D EFINITION 2.1. We call model (1.1), (1.2), (1.3) and (1.5) the Asymptotic
Clustering Model if (2.1) and (2.4) hold and denote it by ACM(ϑ, θ ).
It is more convenient to work with the normalized data matrix W = [W1 ,

W2 , . . . , Wn ] where, as before, Wi (j ) = [Xi (j ) − X̄(j )]/σ̂ (j ), and X̄(j ) and
σ̂ (j ) are the empirical mean and standard deviation associated with the fea-
ture j , 1 ≤ j ≤ p, 1 ≤ i ≤ n. Introduce ˆ = diag(σ̂ 2 (1), σ̂ 2 (2), . . . , σ̂ 2 (p)) and
= E[].
ˆ Note that σ̂ 2 (j ) is an unbiased estimator for σ 2 (j ) when feature j is
useless but is not necessarily so when feature j is useful. As a result, ˆ is “closer”

to than to ; this causes (unavoidable) complications in notation. Denote for
short
(2.5) −1/2 .
= 1/2
This is a p × p diagonal matrix where most of the diagonals are 1, and all other
diagonals are close to 1 (under mild conditions). Let 1n be the n × 1 vector of ones
and ek ∈ R K be the kth standard basis vector of R K , 1 ≤ k ≤ K. Let L ∈ R n,K
be the matrix where the ith row is ek if and only if Sample i ∈ Class k. Recall the
definition of M in (2.2). With these notation, we can write

(2.6) W = LM + Z −1/2 + R, Z −1/2 has i.i.d. N(0, 1) entries,
where R stands for the remainder term
−1/2
(2.7) ˆ −1/2 + LM 1/2 + Z
R = 1n (μ̄ − X̄) ˆ −1/2 .
−
Recall that rank(LM) = K − 1 and is nearly the identity matrix.
2.2. Regularity conditions and related notation. We use C > 0 as a generic

constant, which may change from occurrence to occurrence, but does not depend
on p. Recall that δk is the fraction of samples in Class k, and σ 2 (j ) is the j th
diagonal of . The following regularity conditions are mild:

(2.8) min {δk } ≥ C, and max σ (j ) + σ −1 (j ) ≤ C.
1≤k≤K 1≤j ≤p
Introduce the following two p × 1 vectors κ = (κ(1), κ(2), . . . , κ(p)) and τ =

(τ (1), τ (2), . . . , τ (p)) by
K 1/2

(2.9) κ(j ) = κ(j ; M, p, n) = δk m2k (j ) ,
k=1

√ √
K
(2.10) τ (j ) = τ (j ; M, p, n) = (6 2π )−1 · n ·
δk m3k (j )
.

k=1
Note that κ(j ) and τ (j ) are related to the weighted second and third moments
of the j th column of M, respectively; τ and κ play a key role in the success of
feature selection and post-selection PCA, respectively. In the case that τ (j )’s are
all small, the success of our method relies on higher moments of the columns of
M; see Section 2.5 for more discussions. Introduce

ε(M) = max
mk (j )
,
1≤k≤K,j ∈Sp (M)

τmin = min τ (j ) .
j ∈Sp (M)
We are primarily interested in the range where the feature strengths are rare and
weak, so we assume as p → ∞,
(2.11) ε(M) → 0.9
In Section 2.5, we shall see that τ (j ) can be viewed as the Signal-to-Noise Ratio
(SNR) associated with the j th feature and τmin is the minimum √ SNR of all useful
features. The most interesting range for τ (j ) is τ (j ) ≥ O( log(p)). In fact, if
τ (j )s are of a much smaller order, then the useful features and the useless features
are merely inseparable. In light of this, we fix a constant r > 0 and assume
√
(2.12) τmin ≥ a0 · 2r log(p) where a0 = (π − 2)/(4π).10
By the way τ (j ) is defined, the interesting range for non-zero mk (j ) is |mk (j )| ≥
O((log(p)/n)1/6 ). We also need some technical conditions which can be largely
relaxed with more complicated analysis:11
√
n K
max δk mk (j ) ≤ Cp−δ ,
4
j ∈Sp (M) τ (j )
k=1
(2.13)

log(p) 1/2
min

mk (j ) ≥ C ,
{(j,k):mk (j )=0]} n
for some δ > 0. As the most interesting range of |mk (j )| is O((log(p)/n)1/6 ),
these conditions are mild.
9 This condition is used in the post-selection eigen-analysis. Recall that W (t) is the shorthand
notation for the post-selection normalized data matrix associated with threshold t. As W (t) is the
sum of a low-rank matrix and a noise matrix, (W (t) )(W (t) ) equals to the sum of four terms, two of
them are “cross terms.” In eigen-analysis of (W (t) )(W (t) ) , we need condition (2.11) to control the
cross terms.
10 Throughout this paper, a denotes the constant √(π − 2)/(4π ). The constant comes from the
0
analysis of the tail behavior of the KS statistic; see Theorems 2.3–2.4.
11 Condition (2.13) is only needed for Theorem 2.4 on the tail behavior of the KS statistic associated
with a useful feature. The conditions ensure singular cases will not happen so the weighted third
moment [captured by τ (j )] is the leading term in the Taylor expansion. For more discussion, see the
remark in Section 2.5.
Similarly, for√the threshold t in (1.8) we use for the KS-scores, the interesting
range is t = O( log(p)). In light of this, we are primarily interested in threshold
of the form

(2.14) tp (q) = a0 · 2q log(p) where q > 0 is a constant.
We now define a quantity errp , which is the clustering error rate of IF-PCA in
our main results. Define
sp κ2∞
ρ1 (L, M) = ρ1 (L, M; p, n) = .
κ2
Introduce two K × K matrices A and (where A is diagonal) by

A(k, k) = δk mk , (k,
) = mk 2 m
/ mk · m
, 1 ≤ k,
≤ K;
recall that is “nearly” the identity matrix. Note that AA ≤ κ2 , and that
when m1 , . . . , mK have comparable magnitudes, all the eigenvalues of AA
have the same magnitude. In light of this, let smin (AA) be the minimum singular
value of A and introduce the ratio
ρ2 (L, M) = ρ2 (L, M; p, n) = κ2 /smin (AA).
Define

1 + p1−ϑ∧q /n √ √
errp = ρ2 (L, M) + p−( r− q)2+ /2K
κ

p(ϑ−q)+
+ pϑ−1 + ρ1 (L, M) .
n
This quantity errp combines the “bias” term associated with the useful features
that we have missed in feature selection and the “variance” term associated with
retained features; see Lemmas 2.2 and 2.3 for details. Throughout this paper, we
assume that there is a constant C > 0 such that
(2.15) errp ≤ p −C .
R EMARK . Note that ρ1 (L, M) ≥ 1 and ρ2 (L, M) ≥ 1. A relatively small

ρ1 (L, M) means that τ (j ) are more or less in the same magnitude, and a relatively
small ρ2 (L, M) means that the (K − 1) non-zero eigenvalues of LM 2 M L have
comparable magnitudes. Our hope is that neither of these two ratios is unduly
large.
2.3. Main theorem: Clustering consistency by IF-PCA. Recall ψn,j is the KS

statistic. For any threshold t > 0, denote the set of retained features by
Ŝp (t) = {1 ≤ j ≤ p : ψn,j ≥ t}.
For any n × p matrix W , let W Ŝp (t) be the matrix formed by replacing all columns
of W with the index j ∈ / Ŝp (t) by the vector of zeros (note the slight difference
compared with W in Section 1.1). Denote the n × (K − 1) matrix of the first
(t)
(K − 1) left singular vectors of W Ŝp (tp (q)) by

Û (tp (q)) = Û W Ŝp (tp (q)) = [η̂1 , η̂2 , . . . , η̂K−1 ], where η̂k = η̂k W Ŝp (tp (q)) .
Recall that W = [LM + Z −1/2 ] + R and let LM = U DV be the Singular
Value Decomposition (SVD) of LM such that D ∈ R K−1,K−1 is a diagonal ma-
trix with the diagonals being singular values arranged descendingly, U ∈ R n,K−1
satisfies U U = IK−1 , and V ∈ R p,K−1 satisfies V V = IK−1 . Then U is the
non-stochastic counterpart of Û (tp (q)) . We hope that the linear space spanned by
columns of Û (tp (q)) is “close” to that spanned by columns of U .
D EFINITION 2.2. Lp > 0 denotes a multi-log(p) term that may vary from
occurrence to occurrence but satisfies Lp p−δ → 0 and Lp pδ → ∞, ∀δ > 0.
For any K ≥ 1, let

(2.16) HK = {All K × K orthogonal matrices}.
The following theorem is proved in Section 2.7, which shows that the singular
vectors IF-PCA obtains span a low-dimensional subspace that is “very close” to
its counterpart in the ideal case where there is no noise.
T HEOREM 2.1. Fix (ϑ, θ ) ∈ (0, 1)2 , and consider ACM(ϑ, θ ). Suppose the
regularity conditions (2.8), (2.11), (2.12), (2.13) and (2.15) hold, and the threshold
in IF-PCA is set as t = tp (q) as in (2.14). Then there is a matrix H in HK−1 such
that as p → ∞, with probability at least 1 − o(p−2 ), Û (tp (q)) − UHF ≤ Lp errp .
Recall that in IF-PCA, once Û (tp (q)) is obtained, we estimate the class labels by
truncating Û (tp (q)) entry-wise (see the PCA-1 step and the footnote in Section 1.1)
and then cluster by applying the classical k-means. Also, the estimated class la-
bels are denoted by ŷtIFp (q)
= (ŷtIF
p (q),1
, ŷtIF
p (q),2
, ŷtIF
p (q),n
) . We measure the clustering
errors by the Hamming distance
n

Hamm∗p ŷtIF
p (q)
, y = min P ŷ IF
tp (q),i = π(yi ) ,
π
i=1
where π is any permutation in {1, 2, . . . , K}. The following theorem is our main
result, which gives an upper bound for the Hamming errors of IF-PCA.
T HEOREM 2.2. Fix (ϑ, θ ) ∈ (0, 1)2 , and consider ACM(ϑ, θ ). Suppose the
√ (2.12), (2.13) and (2.15) hold, and let tp = tp (q)
regularity conditions (2.8), (2.11),
as in (2.14) and Tp = log(p)/ n in IF-PCA. As p → ∞,

n−1 Hamm∗p ŷtIF
p (q)
, y ≤ Lp errp .
The theorem can be proved by Theorem 2.1 and an adaption of Jin (2015),
Theorem 2.2. In fact, by
√ Lemma 2.1 below, the absolute values of all entries of
U are bounded by C/ n from above. By the choice of Tp and definitions, the
(t (q)) (t (q))
truncated matrix Û∗ p satisfies Û∗ p − UHF ≤ Û (tp (q)) − UHF . Using
this and Theorem 2.1, the proof of Theorem 2.2 is basically an exercise of classical
theory on k-means algorithm. For this reason, we skip the proof.
2.4. Two corollaries and a phase transition phenomenon. Corollary 2.1 can
be viewed as a simplified version of Theorem 2.1, so we omit the proof; recall that
Lp denotes a generic multi-log(p) term.
C OROLLARY 2.1. Suppose conditions of Theorem 2.1 hold, and suppose

max{ρ1 (L, M), ρ2 (L, M)} ≤ Lp as p → ∞. Then there is a matrix H in HK−1
such that as p → ∞, with probability at least 1 − o(p−2 ),
(t (q))
Û p − UH F
√ √
−[( r− q)+ ]2 /(2K)
≤ Lp p

−1 (1−ϑ)/2 p−θ/2+[(ϑ−q)+ ]/2 , if (1 − ϑ) > θ ,
+ Lp κ p +1
p−(1−ϑ)/2+[(1−θ −q)+ ]/2 , if (1 − ϑ) ≤ θ.
By assumption (2.12), the interesting range for a non-zero mk (j ) is |mk (j )|

Lp n−1/6 . It follows that κ Lp p(1−ϑ)/2 n−1/6 and κ−1 p(1−ϑ)/2 → ∞. In
this range, we have the following corollary, which is proved in Section 2.7.
C OROLLARY 2.2. Suppose conditions of Corollary 2.1 hold, and κ =

Lp p(1−ϑ)/2 n−1/6 . Then as p → ∞, the following hold:
(a) If (1 − ϑ) < θ/3, for any r > 0, whatever q is chosen, the upper bound of
minH ∈HK−1 Û (tp (q)) − UHF in Corollary 2.1 goes to infinity.
(b) If θ/3 < (1 − ϑ) < 1 − 2θ/3, for any r > ϑ − 2θ/3, there exists q ∈ (0, r)
such that minH ∈HK−1 Û (tp (q)) − UHF → 0 with probability at least 1 − o(p −2 ).
√ √
In particular, if (1 − ϑ) ≤ θ and r > ( K(1 − ϑ) − Kθ/3 + 1 − θ)2 , by taking
q = 1 − θ,

min Û (tp (q)) − UH F ≤ Lp n1/6 sp−1/2 ;
H ∈HK−1
√ √
if (1 − ϑ) > θ and r > ( 2Kθ/3 + ϑ)2 , by taking q = ϑ,

min Û (tp (q)) − UH F ≤ Lp n−1/3 .
H ∈HK−1
(c) If (1−ϑ) > 1−2θ/3, for any r > 0, by taking q = 0, minH ∈HK−1 Û (tp (q)) −
UHF → 0 with probability at least 1 − o(p−2 ).
To interpret Corollary 2.2, we take a special case where K = 2, all diagonals

of are bounded from above and below by a constant, and all non-zero features
μk (j ) have comparable magnitudes; that is, there is a positive number u0 that may
depend on (n, p) and a constant C > 0 such that

(2.17) u0 ≤
μk (j )
≤ Cu0 for any (k, j ) such that μk (j ) = 0.
1/3
In our parameterization, sp = p1−ϑ , n = p θ and u0 τmin /n1/6 (log(p)/n)1/6
since K = 2. Cases (a)–(c) in Corollary 2.2 translate to (a) 1 sp n1/3 ,
(b) n1/3 sp p/n2/3 and (c) sp p/n2/3 , respectively.
The primary interest in this paper is Case (b). In this case, Corollary 2.2 says
that both feature selection and post-selection PCA can be successful, provided that
u0 = c0 (log(p)/n)1/6 for an appropriately large constant c0 . Case (a) addresses the
case of very sparse signals, and Corollary 2.2 says that we need stronger signals
than that of u0 (log(p)/n)1/6 for IF-PCA to be successful. Case (c) addresses
the case where signals are relatively dense, and PCA is successful without feature
selection (i.e., taking q = 0).
We have been focused on the case u0 = Lp n−1/6 as our primary interest is on
clustering by IF-PCA. For a more complete picture, we model u0 by u0 = Lp p−α ;
we let the exponent α vary and investigate what is the critical order for u0 for some
different problems and different methods. In this case, it is seen that u0 ∼ n−1/6
is the critical order for the success of feature selection (see Section 2.5), u0√∼
√
p/(ns) is the critical order for the success of Classical PCA and u0 ∼ 1/ s
is the critical order for IF-PCA in an idealized situation where the Screen step
finds exactly all the useful features. These suggest an interesting phase transition
phenomenon for IF-PCA.
• Feature selection√is trivial but clustering is impossible. 1 s n1/3 and
n−1/6 u0 ≤ 1/ s. Individually, useful features are sufficiently strong, so it
is trivial to recover the support of M 1/2 (say, by thresholding the KS-scores
one by one); note that M 1/2 = [μ1 , μ2 , . . . , μK ] . However, useful features
are so sparse that it is impossible for any methods to have consistent clustering.
• Clustering and feature selection are possible but non-trivial. n1/3 s
p/n2/3 and u0 = (r log(p)/n)1/6 , where r is a constant. In this range, feature
selection is indispensable and there is a region where IF-PCA may yield a con-
sistent clustering but Classical PCA may not. A similar conclusion can be drawn
if the purpose is to recover the support of M 1/2 by thresholding the KS-scores.
• Clustering
√ is trivial but feature selection is impossible. s p/n2/3 and
p/(ns) ≤ u0 n−1/6 . In this range, the sparsity level is low and Classical
PCA is able to yield consistent clustering, but the useful features are individ-
ually too weak that it is impossible to fully recover the support of M 1/2 by
using all p different KS-scores.
In Jin, Ke and Wang (2015b), we investigate the phase transition with much more
refined studies (in a slightly different setting).
2.5. Tail probability of KS statistic. IF-PCA consists of a screening step (IF-

step) and a PCA step. In the IF-step, the key is to study the tail behavior of the
KS statistic ψn,j , defined in (1.6). Fix 1 ≤ j ≤ p. Recall that in our model, Xi ∼
N(μ̄ + μk , ) if i ∈ Class k, 1 ≤ i ≤ n, and that j is a useless feature if and only
if μ1 (j ) = μ2 (j ) = ·√· · = μK (j ) = 0.
Recall that a0 = (π − 2)/(4π). Theorem 2.3 addresses the tail behavior of
ψn,j when feature j is useless.
T HEOREM 2.3. Fix θ ∈ (0, 1) and let n = np = pθ . Fix 1 ≤ j ≤ p. If the j th

√a useless feature, then as p → ∞, for any sequence tp such that tp → ∞
feature is
and tp / n → 0,
P (ψn,j ≥ tp )
1 √ 2.
( 2a0 )−1 exp(−tp2 /(2a02 ))
We conjecture that P (ψn,j ≥ tp ) ∼ 2 · √ 1 exp(−tp2 /(2a02 )), with possibly a

2a0
more sophisticated proof than that in the paper.
Recall that τ is defined in (2.10). Theorem 2.4 addresses the tail behavior of
ψn,j when feature j is useful.
T HEOREM 2.4. Fix θ ∈ (0, 1). Let n = np = pθ , and τ (j ) be as in (2.10),

where j is a useful feature. √
Suppose (2.12) and (2.13) hold, and the threshold tp is
such that tp → ∞, that tp / n → 0, and that τ (j ) ≥ (1 + C)tp for some constant
C > 0. Then as p → ∞,

1
P (ψn,j ≤ tp ) ≤ C K exp − 2
τ (j ) − tp 2 + O p−3 .
2Ka0
Theorems 2.3–2.4 are proved in the supplementary material Jin and Wang
(2016). Combining two theorems, roughly saying, we have that:
• if j is a useless feature, then the right tail of ψn,j behaves like that of N(0, a02 ),
• if j is a useful feature, then the left tail of ψn,j is bounded by that of
N(τ (j ), Ka02 ).
These suggest that the feature selection using the KS statistic in the current setting
is very similar to feature selection with a Stein’s normal means model; the latter is
more or less well understood [e.g., Abramovich et al. (2006)]. √
As a result, the most interesting √range for τ (j ) is τ (j ) ≥ O( log(p)). If we
threshold the KS-scores at tp (q) = 2q log(p), by similar argument as in feature
selection with a Stein’s normal means setting, we expect that:
√ √
• All useful features are retained, except for a fraction ≤ Cp−[( r− q)+ ] /K ,
2
• No more than (1 + o(1)) · p1−q useless features are (mistakenly) retained,

• #{retained features} = |Ŝp (tp (q))| ≤ C[p1−ϑ + p1−q + log(p)].
These facts pave the way for the PCA step; see sections below.
R EMARK . Theorem

2.4 hinges on τ (j ), which is a quantity proportional to the
“third moment” K 3
k=1 k k (j ) and can be viewed as the “effective signal strength”
δ m
of the KS statistic. In the symmetric case (say, K = 2 and δ1 = δ2 = 1/2), the
third moment (which equals to 0) is no longer the right quantity for calibrating the
effective signal strength of the KS statistic, and we must use the fourth moment.
In such cases, for 1 ≤ j ≤ p, let
2
√ 1 K
ω(j ) = n sup y 1 − 3y 2 φ(y) · δk m2k (j )
−∞<y<∞ 8 k=1

1 K
+ φ (3) (y) · δk m4k (j ) ,
24 k=1
where φ (3) (y) is the third derivative of the standard normal density φ(y).
Theorem 2.4 continues to hold provided√ that (a) τ (j ) is replaced by ω(j ),
(b)√the condition (2.12) of τmin ≥ a0 2r log(p) is replaced by that of ωmin ≥
a0 2r log(p), where ωmin = minj ∈Sp (M) {ω(j )} and (c) the first part of con-
√ K
dition (2.13), maxj ∈Sp (M) { τ (jn) −δ
k=1 δk mk (j )} ≤ Cp , is replaced by that of
4
√
maxj ∈Sp (M) { ω(jn) K (j )|5 } ≤ Cp−δ . This is consistent with that in Arias-
k=1 δk |mk
Castro and Verzelen (2014), which studies the clustering problem in a similar
setting (especially on the symmetric case) with great details.
In the literature, tight bounds of this kind are only available for the case where
Xi are i.i.d. samples from a known distribution (especially, parameters—if any—
are known). In this case, the bound is derived by Kolmogorov (1933); also see
Shorack and Wellner (1986). The setting considered here is more complicated,
and how to derive tight bounds is an interesting but rather challenging prob-
lem. The main difficulty lies in that, any estimates of the unknown parameters
(μ̄(j ), μ1 (j ), . . . , μk (j ), σ (j )) have stochastic fluctuations at the same order of
that of the stochastic fluctuation of the empirical CDF, but two types of fluctu-
ations are correlated in a complicated way, so it is hard to derive the right con-
stant a0 in the exponent. There are two existing approaches: one is due to Durbin
(1985) which approaches the problem by approximating the stochastic process by

a Brownian bridge, the other is due to Loader (1992) [see also Siegmund (1982),
Woodroofe (1978)] on the local Poisson approximation of the boundary crossing
probability. It is argued in Loader (1992) that the second approach is more accu-
rate. Our proofs follow the idea in Loader (1992), Siegmund (1982).
2.6. Post-selection eigen-analysis. For the PCA step, as in Section 2.3, we let
be the n × p matrix where the j th column is the same as that of W if
W Ŝp (tp (q))
j ∈ Ŝp (tp (q)) and is the zero vector otherwise. With such notation,
Ŝp (tp (q))
(2.18) W Ŝp (tp (q)) = LM + L M − M Ŝp (tp (q)) + Z −1/2 + R .
We analyze the there terms on the right-hand side separately.
Consider the first term LM . Recall that L ∈ R n,K with the ith row be-
ing ek if and only if i ∈ Class k, 1 ≤ i ≤ n, 1 ≤ k ≤ K, and M ∈ R K,p
with the kth row being mk = ( −1/2 μk ) , 1 ≤ k ≤ K. Also, recall that A =
√ √
diag( δ1 m1 , . . . , δK mK ) and ∈ R K,K with (k,
) = mk 2 m
/(mk ·
m
), 1 ≤ k,
≤ K. Note that rank(AA) = rank(LM) = K − 1. Assume all
non-zero eigenvalues of AA are simple, and denote them by λ1 > λ2 > · · · >
λK−1 . Write
(2.19) AA = Q · diag(λ1 , λ2 , . . . , λK−1 ) · Q , Q ∈ R K,K−1 ,
where the kth column of Q is the kth eigenvector of AA, and let
(2.20) LM = U DV
be an SVD of LM . Introduce

(2.21) G = diag( δ1 , δ2 , . . . , δK ) ∈ R K,K .
The following lemma is proved in the supplementary material [Jin and Wang
(2016), Appendix C].
EMMA 2.1. The matrix LM has (K − 1) non-zero singular values which

L√ √
are nλ1 , . . . , nλK−1 . Also, there is a matrix H ∈ HK−1 [see (2.16)] such that

U = n−1/2 L G−1 QH ∈ R n,K−1 .
For the matrix G−1 QH , the
2 -norm of the kth row is (δk−1 − 1)1/2 , and the
2 -
distance between the kth row and the
th row is (δk−1 + δ
−1 )1/2 , which is no less
than 2, 1 ≤ k <
≤ K.
By Lemma 2.1 and definitions, it follows that:

• For any 1 ≤ i ≤ n and 1 ≤ k ≤ K − 1, the ith row of U equals to the kth row of
n−1/2 G−1 QH if and only if Sample i comes from Class k.
• The matrix U has K distinct rows, according to which the rows of U parti-
tion into K different groups. This partition coincides with the partition of the n
samples into K different classes. 2
√ Also, the
-norm between each pair of the K
distinct rows is no less than 2/ n.
Consider the second term on the right-hand side of (2.18). This is the “bias”
term caused by useful features which we may fail to select.
L EMMA 2.2. Suppose the conditions of Theorem 2.1 hold. As p → ∞, with

probability at least 1 − o(p −2 ),

L M − M Ŝp (tp (q))
√ √ √
≤ Cκ n · p−(1−ϑ)/2 ρ1 (L, M) · log(p) + p−[( r− q)+ ] /(2K) .
2
Consider the last term on the right-hand side of (2.18). This is the “variance”
term consisting of two parts, the part from original measurement noise matrix Z
and the remainder term due to normalization.
L EMMA 2.3. Suppose the conditions of Theorem 2.1 hold. As p → ∞, with

probability at least 1 − o(p −2 ),

Z −1/2 + R Ŝp (tp (q))
√ 3
≤C n + p(1−ϑ∧q)/2 + κp(ϑ−q)+ /2 ρ1 (L, M) · log(p) .
Combining Lemmas 2.2–2.3 and using the definition of errp ,

√
Ŝ (t (q)) nκ
(2.22) W p p
− LM ≤ Lp errp · .
ρ2 (L, M)
2.7. Proofs of the main results. We now show Theorem 2.1 and Corollary 2.1.
Proof of Theorem 2.2 is very similar to that of Theorem 2.2 in Jin (2015) and proof
of Corollary 2.2 is elementary, so we omit them.
Consider Theorem 2.1. Let

T = LM 2 M L , T̂ = W Ŝp (tp (q)) W Ŝp (tp (q)) .
Recall that U and Û (tp (q)) contain the (K − 1) leading eigenvectors of T and T̂ , re-
spectively. Using the sine-theta theorem [Davis and Kahan (1970)] [see also Propo-
sition 1 in Cai, Ma and Wu (2015)],
(t (q)) (t (q)) −1
(2.23) Û p Û p − U U ≤ 2smin (T )T̂ − T ;
in (2.23), we have used the fact that T has a rank of K − 1 so that the gap between
the (K − 1)th and Kth largest eigenvalues is equal to the minimum non-zero sin-
gular value smin (T ). The following lemma is proved in the supplementary material
[Jin and Wang (2016), Appendix C].
L EMMA 2.4. For any integers 1 ≤ m ≤ p and two p × m matrices V1 , V2

satisfying V1 V1 = V2 V2 = I , there exists an orthogonal matrix H ∈ R m,m such
that V1 − V2 H F ≤ V1 V1 − V2 V2 F .
Combine (2.23) with Lemma 2.4 and note that Û (tp (q)) (Û (tp (q)) ) − U U has a
rank of 2K or smaller. It follows that there is an H ∈ HK−1 such that
(t (q)) √ −1
(2.24) Û p − UH F ≤ 2 2Ksmin (T )T̂ − T .
First, T̂ − T ≤ 2LM · W Ŝp (tp (q)) − LM + W Ŝp (tp (q)) − LM 2 . From
Lemmas 2.2–2.3 and (2.15), LM W Ŝp (tp (q)) − LM . Therefore,
√
T̂ − T 2LM W Ŝp (tp (q)) − LM ≤ 2 nκ · W Ŝp (tp (q)) − LM .
Second, by Lemma 2.1,

smin (T ) = n · smin AA = nκ2 /ρ2 (L, M).
Plugging in these results into (2.24), we find that
(t (q)) √ ρ2 (L, M) Ŝ (t (q))
(2.25) Û p − UH F ≤ 4 2K √ W p p − LM ,
nκ
where by Lemmas 2.2–2.3, the right-hand side equals to Lp errp . The claim then
follows by combining (2.25) and (2.22).
Consider Corollary 2.2. For each j ∈ Sp (M), it can be deduced that κ(j ) ≥
ε(M), using especially (2.11). Therefore, κ ≥ Lp p(1−ϑ)/2 n−1/6 = Lp ×
p(1−ϑ)/2−θ/6 . The error bound in Corollary 2.1 reduces to

√ √
−[( r− q)+ ]2 /(2K) p−θ/3+(ϑ−q)+ /2 , θ < 1 − ϑ,
(2.26) Lp p + Lp
pθ/6−(1−ϑ)/2+(1−θ −q)+ /2 , θ ≥ 1 − ϑ.
Note that (2.26) is lower bounded by Lp pθ/6−(1−ϑ)/2 for any q ≥ 0; and it is
upper bounded by Lp p−θ/3+ϑ/2 when taking q = 0. The first and third claims
then follow immediately. Below, we show the second claim.
First, consider the case θ < 1 − ϑ. If r > ϑ, we can take any q ∈ (ϑ, r) and the
error bound is o(1). If r ≤ ϑ, noting that (ϑ − r)/2 < θ/3, there exists q < r such
that (ϑ − q)/2 < θ/3, and √ the corresponding
√ error bound
√ √ is o(1).
In particular, if r > ( 2Kθ/3 + ϑ)2 , we have (
√ √ 2
r − ϑ)2 /(2K) > θ/3; then
for q ≥ ϑ, the error bound is Lp p−θ/3 + Lp p−( r− q) /(2K) ; for q < ϑ, the error
bound is Lp p−θ/3+(ϑ−q)/2 ; so the optimal q ∗ = ϑ and the corresponding error
bound is Lp p−θ/3 = Lp n−1/3 .
Next, consider the case 1−ϑ ≤ θ < 3(1−ϑ). If r > 1−θ , for any q ∈ (1−θ, r),
the error bound is o(1); note that θ/6 < (1−ϑ)/2. If r ≤ 1−θ , noting that (1−θ −
r)/2 < (1−ϑ)/2−θ/6, there is a q < r such that (1−θ −q)/2 < (1−ϑ)/2−θ/6,
and the corresponding error bound is o(1).
TABLE 8
Pseudocode for IF-PCA(1) (for simulations; threshold set by Higher Criticism)
IF .
Input: data matrix X, number of classes K. Output: class label vector ŷHC
1. Rank features: Let ψn,j be the KS-scores as in (1.6), and F0 be the CDF of ψn,j under null,
1 ≤ j ≤ p.
2. Threshold choice by HCT: Calculate P -values by πj = 1 − F0 (ψn,j ), 1 ≤ j ≤
√
p and sort them by π(1) < π(2) < · · · < π(p) . Define HCp,j = p(j/p − π(j ) )/
√
max{ n(j/p − π(j ) ), 0} + j/p, and let jˆ = argmax{j :π >log(p)/p,j <p/2} {HCp,j }. HC
(j )
threshold tpHC is the jˆth largest KS-score.
3. Post-selection PCA: Define post-selection data matrix W (HC) (i.e., sub-matrix of W consists of
all column j of W with ψn,j > tpHC ). Let U ∈ R n,K−1 be the matrix of the first (K − 1) left
IF = kmeans(U, K).
singular vectors of W (HC) . Cluster by ŷHC
√ √ √
√ In particular, if r > ( K(1 − ϑ) − Kθ/3 + 1 − θ )2 , we have that ( r −
1 − θ )2 /(2K) > (1 − ϑ)/2 − θ/6; then for q ≥ 1 − θ , the error bound
√ √ 2
is Lp p θ/6−(1−ϑ)/2 + Lp p −( r− q) /(2K) ; for q < 1 − θ , the error bound is
Lp p θ/6−(1−ϑ)/2+(1−θ −q)/2 ; so the optimal q ∗ = 1 − θ and the corresponding error

−1/2
bound is Lp pθ/6−(1−ϑ)/2 = Lp n1/6 sp .
3. Simulations. We conducted a small-scale simulation study to investigate

the numerical performance of IF-PCA. We consider two variants of IF-PCA, de-
noted by IF-PCA(1) and IF-PCA(2). In IF-PCA(1), the threshold is chosen using
HCT (so the choice is data-driven), and in IF-PCA(2), the threshold t is given. In
both variants, we skip the normalization step on KS scores (that step is designed
for microarray data only). The pseudocodes of IF-PCA(2) and IF-PCA(1) are given
in Table 7 (Section 2) and Table 8, respectively. We compared IF-PCA(1) and IF-
PCA(2) with 4 other different methods: classical k-means (kmeans), k-means++
(kmeans+), classical hierarchical clustering (Hier) and SpectralGem (SpecGem;
same as classical PCA). In hierarchical clustering, we only consider the linkage
type of “complete”; other choices of linkage have very similar results.
In each experiment, we fix parameters (K, p, θ, ϑ, r, rep), two probability mass
vectors δ = (δ1 , . . . , δK ) and γ = (γ1 , γ2 , γ3 ) and three probability densities
gσ , gμ defined over (0, ∞) and gμ̄ defined over (−∞, ∞). With these parame-
ters, we let n = np = pθ and εp = p1−ϑ ; n is the sample size, εp is roughly the
fraction of useful features, and rep is the number of repetitions.12 We generate the
n × p data matrix X as follows:
12 For each parameter setting, we generate the X matrix for rep times, and at each time, we apply
all the six algorithms. The clustering errors are averaged over all the repetitions.
• Generate the class labels y1 , y2 , . . . , yn i.i.d. from MN(K, δ),13 and let L be
the n × K matrix such that the ith row of L equals to ek if and only if yi = k,
1 ≤ k ≤ K.
i.i.d.
• Generate the overall mean vector μ̄ by μ̄(j ) ∼ gμ̄ , 1 ≤ j ≤ p.
• Generate the contrast mean vectors μ1 , . . . , μK as follows. First, generate
b1 , b2 , . . . , bp i.i.d. from Bernoulli(εp ). Second, for each j such that bj = 1,
generate the i.i.d. signs {βk (j )}K−1
k=1 such that βk (j ) = −1, 0, 1 with probability
γ1 , γ2 , γ3 , respectively, and generate the feature magnitudes {hk (j )}K−1
k=1 i.i.d.
from gμ . Last, for 1 ≤ k ≤ K − 1, set μk by [the factor 72π is chosen to be
consistent with (2.10)]
1/6
μk (j ) = 72π · 2r log(p) · n−1 · hk (j ) · bj · βk (j ),

and let μK = − δ1K K−1
k=1 δk μk .
• Generate the noise matrix Z as follows. First, generate a p × 1 vector σ by
i.i.d.
σ (j ) ∼ gσ . Second, generate the n rows of Z i.i.d. from N(0, ), where =
diag(σ 2 (1), σ 2 (2), . . . , σ 2 (p)).
• Let X = 1μ̄ + L[μ1 , . . . , μK ] + Z.
In the simulation settings, r can be viewed as the parameter of (average) signal
strength. The density gσ characterizes noise heteroscedasticity; when gσ is a point
mass at 1, the noise variance of all the features are equal. The density gμ controls
the strengths of useful features; when gμ is a point mass at 1, all the useful features
have the same strength. The signs of useful features are captured in the probability
vector γ ; when K = 2, we always set γ2 = 0 so that μk (j ) = 0 for a useful feature
j ; when K ≥ 3, for a useful feature j , we allow μk (j ) = 0 for some k.
For IF-PCA(2), the theoretical threshold choice as in (2.14) is t = 2q̃ log(p)
for some 0 < q̃ < (π − 2)/(4π) ≈ 0.09. We often set q̃ ∈ {0.03, 0.04, 0.05, 0.06},
depending on the signal strength parameter r.
The simulation study contains five experiments, which we now describe.
E XPERIMENT 1. In this experiment, we study the effect of signal strength over

clustering performance, and compare two cases: the classes have unequal or equal
number of samples. We set (K, p, θ, ϑ, rep) = (2, 4 × 104 , 0.6, 0.7, 100), and γ =
(0.5, 0, 0.5) (so that the useful features have equal probability to have positive
and negative signs). Denote by U (a, b) the uniform distribution over (a − b, a +
b). We set gμ as U (0.8, 1.2), gσ as U (1, 1.2), and gμ̄ as N(0, 1). We investigate
two choices of δ: (δ1 , δ2 ) = (1/3, 2/3) and (δ1 , δ2 ) = (1/2, 1/2); we call them
“asymmetric” and “symmetric” case, respectively. In the latter case, the two classes
roughly
√ have equal number of samples. The threshold in IF-PCA(2) is taken to be
t = 2 · 0.06 · log(p).
13 We say X ∼ MN(K, δ) if P (X = k) = δ , 1 ≤ k ≤ K; MN stands for multinomial.

k
F IG . 4. Comparison of clustering error rates [Experiment 1(a)]. x-axis: signal strength parameter
r. y-axis: error rates. Left: δ = (1/3, 2/3). Right: δ = (1/2, 1/2).
In Experiment 1(a), we let the signal strength parameter r ∈ {0.20, 0.35, 0.50,
0.65} for the asymmetric case, and r ∈ {0.06, 0.14, 0.22, 0.30} for the symmetric
case. The results are summarized in Figure 4. We find that two versions of IF-PCA
outperform the other methods in most settings, increasingly so when the signal
strength increases. Moreover, two versions of IF-PCA have similar performance,
with those of IF-PCA(1) being slightly better. This suggests that our threshold
choice by HCT is not only data-driven but also yields satisfactory clustering re-
sults. On the other hand, it also suggests that IF-PCA is relatively insensitive to
different choices of the threshold, as long as they are in a certain range.
In Experiment 1(b), we make a more careful comparison between the asym-
metric and symmetric cases. Note that for the same parameter r, the actual signal
strength in the symmetric case is stronger because of normalization. As a result,
for δ = (1/3, 2/3), we still let r ∈ {0.20, 0.35, 0.50, 0.65}, but for δ = (1/2, 1/2),
we take r = c0 × {0.20, 0.35, 0.50, 0.65}, where c0 is a constant chosen such that
for any r > 0, r and c0 r yield the same value of κ(j ) [see (2.9)] in the asym-
metric and symmetric cases, respectively; we note that κ(j ) can be viewed as the
effective signal-to-noise ratio of Kolmogorov–Smirnov statistic. The results are
summarized in Table 9. Both versions of IF-PCA have better clustering results
when δ = (1/3, 2/3), suggesting that the clustering task is more difficult in the
symmetric case. This is consistent with the theoretical results; see, for example,
Arias-Castro and Verzelen (2014), Jin, Ke and Wang (2015b).
E XPERIMENT 2. In this experiment, we allow feature sparsity to vary [Experi-

ment 2(a)], and investigate the effect of unequal feature strength [Experiment 2(b)].
We set (K, p, θ, r, rep) = (2, 4 × 104 , 0.6, 0.3, 100) (so n = 577),
√ γ = (0.5, 0, 0.5)
and (δ1 , δ2 ) = (1/3, 2/3). The threshed for IF-PCA(2) is t = 2 · 0.05 · log(p).
TABLE 9
Comparison of average clustering error rates (Experiment 1). Number in the brackets are standard
deviations of the error rates
(δ1 , δ2 ) = (1/2, 1/2) (δ1 , δ2 ) = (1/3, 2/3)
r IF-PCA(1) IF-PCA(2) IF-PCA(1) IF-PCA(2)
0.20 0.467 (0.04) 0.481 (0.01) 0.391 (0.11) 0.443 (0.08)

0.35 0.429 (0.08) 0.480 (0.02) 0.253 (0.15) 0.341 (0.16)
0.50 0.368 (0.13) 0.466 (0.05) 0.144 (0.14) 0.225 (0.18)
0.65 0.347 (0.13) 0.459 (0.07) 0.099 (0.12) 0.098 (0.11)
In Experiment 2(a), we let ϑ range in {0.68, 0.72, 0.76, 0.80}. Since the number
of useful features is roughly p 1−ϑ , a larger ϑ corresponds to a higher sparsity level.

For any μ and a, b > 0, let TN(u, b2 , a) be the conditional distribution of (X|u −
a ≤ X ≤ u + a) for X ∼ N(u, b ), where TN stands for “Truncated Normal.”
2

We take gμ̄ as N(0, 1), gμ as TN(1,
0.12 , 0.2), and gσ as TN(1, 0.12 , 0.1). The
results are summarized in the left panel of Figure 5, where for all sparsity levels,
two versions of IF-PCA have similar performance and each of them significantly
outperforms the other methods.
In Experiment 2(b), we use the same setting except that gμ is TN(1, 0.1, 0.7)
and gσ is the point mass at 1. Note that in Experiment 2(a), the support of gμ is
(0.8, 1.2), and in the current setting, the support is (0.3, 1.7) which is wider. As a
result, the strengths of useful features in the current setting have more variability.
At the same time, we force the noise variance of all features to be 1, for a fair
comparison. The results are summarized in the right panel of Figure 5. They are
similar to those in Experiment 2(a), suggesting that IF-PCA continues to work well
even when the feature strengths are unequal.
F IG . 5. Comparison of average clustering error rates (Experiment 2). x-axis: sparsity parame-

ter ϑ . y-axis: error rates. Left: gμ is TN(1,
0.12 , 0.2) and gσ is TN(1, 0.12 , 0.1). Right: gμ is

TN(1, 0.1, 0.7) and gσ is point mass at 1.
TABLE 10
Comparison of average clustering error rates (Experiment 3). Numbers in the brackets are the
standard deviations of the error rates
Threshold (q̃) ϑ = 0.68 ϑ = 0.72 ϑ = 0.76 ϑ = 0.80
IF-PCA(1) HCT (stochastic) 0.053 (0.08) 0.157 (0.16) 0.337 (0.14) 0.433 (0.10)
IF-PCA(2) 0.03 0.038 (0.05) 0.152 (0.12) 0.345 (0.13) 0.449 (0.06)
0.04 0.045 (0.08) 0.122 (0.12) 0.312 (0.15) 0.427 (0.09)
0.05 0.068 (0.12) 0.154 (0.15) 0.303 (0.16) 0.413 (0.12)
0.06 0.118 (0.15) 0.237 (0.17) 0.339 (0.16) 0.423 (0.10)
E XPERIMENT 3. In this experiment, we study how different threshold choices

affect the performance of IF-PCA. With the same as those in Experiment 2(b),
we investigate four threshold choices for IF-PCA(2): t = 2q̃ log(p) for q̃ ∈
{0.03, 0.04, 0.05, 0.06}, where we recall that the theoretical choice of threshold
(2.14) suggests 0 < q̃ < 0.09. The results are summarized in Table 10, which sug-
gest that IF-PCA(1) and IF-PCA(2) have comparable performances, and that IF-
PCA(2) is relatively insensitive to different threshold choices, as long as they fall
in a certain range. However, the best threshold choice does depend on ϑ. From a
practical view point, since ϑ is unknown, it is preferable to set the threshold in a
data-driven fashion; this is what we use in IF-PCA(1).
E XPERIMENT 4. In this experiment, we investigate the effects of correlations

among the noise over the clustering results. We generate the data matrix X the
same as before, except for that the noise matrix Z is replaced by ZA, for a matrix
A ∈ R p,p . Fixing a number d ∈ (−1, 1), we consider three choices of A, (a)–(c).
In (a), A(i, j ) = 1{i = j } + d · 1{j = i + 1}, 1 ≤ i, j ≤ p. In (b)–(c), fixing an
integer N > 1, for each j = 1, 2, . . . , p, we randomly generate a size N subset of
{1, 2, . . . , p} \ {j }, denoted by IN (j ). We then let A(i, j ) = 1{i = j } + d · 1{i ∈
IN (j )}. For (b), we take N = 5 and for (c), we take N = 20. We set d = 0.1 in
(a)–(c). We set (K, p, θ, ϑ, r, rep) = (4, 2 × 104 , 0.5, 0.6, 0.7, 100) (so n = 141),
and (δ1 , δ2 , δ3 , δ4 ) = (1/4, 1/4, 1/4, 1/4), γ = (0.3, 0.05, 0.65). For an exponen-
tial random variable X ∼ Exp(λ), denote the density of [b + X|a1 ≤ b + X ≤ a2 ]

by TSE(λ, b, a1 , a2 ), where TSE stands for “Truncated Shifted Exponential.” We

take gμ̄ as N(0, 1), gμ as TSE(0.1, 0.9, −∞, ∞) (so it has√a mean 1), and gσ as

TSE(0.1, 0.9, 0.9, 1.2). The threshold for IF-PCA(2) is t = 2 · 0.03 · log(p). The
results are summarized in the left panel of Figure 6, which suggest that IF-PCA
continues to work in the presence of correlations among the noise: IF-PCA sig-
nificantly outperforms the other 4 methods, especially for the randomly selected
correlations.
F IG . 6. Comparison of average clustering error rates for Experiment 4 (left panel) and Experiment
5 (right panel). y-axis: error rates.
E XPERIMENT 5. In this experiment, we study how different noise distri-

butions affect the clustering results. We generate the data matrix X the same
as before, except for the distribution of the noise matrix Z is different. We
consider three different settings for the noise matrix Z: (a) for a vector a =
i.i.d.
(a1 , a2 , . . . , aK ), generate row i of Z by Zi ∼ N(0, ak Ip ) if Sample i comes
√
from Class k, 1 ≤ k ≤ K, 1 ≤ i ≤ n, (b) Z = 2/3Z̃, where all entries of
Z̃ are i.i.d. samples from t6 (0), √ where t6 (0) denotes the central t-distribution
with df = 6, (c) Z = [Z̃ − 6]/ 12, where the entries of Z̃ are i.i.d. sam-
ples
√ from the √ chi-squared distribution with df = 6 [in (b)–(c), the constants of
2/3 and 12 are chosen so that each entry of Z has zero mean and unit vari-
ance]. We set (K, p, θ, ϑ, r, rep) = (4, 2 × 104 , 0.5, 0.55, 1, 100), (δ1 , δ2 , δ3 , δ4 ) =
(1/4, 1/4, 1/3, 1/6), and γ = (0.4, 0.1, 0.5). We take gμ̄ to be N(0, 1). In case
(a), we √ take (a1 , a2 , a3 , a4 ) = (0.8, 1, 1.2, 1.4). The threshold for IF-PCA(2) is set
as t = 2 · 0.03 · log(p). The results are summarized in the right panel of Fig-
ure 6, which suggest that IF-PCA continues to outperform the other 4 clustering
methods.
4. Connections and extensions. We propose IF-PCA as a new spectral clus-

tering method, and we have successfully applied the method to clustering using
gene microarray data. IF-PCA is a two-stage method which consists of a marginal
screening step and a post-selection clustering step. The methodology contains
three important ingredients: using the KS statistic for marginal screening, post-
selection PCA and threshold choice by HC.
The KS statistic can be viewed as an omnibus test or a goodness-of-fit mea-
sure. The methods and theory we developed on the KS statistic can be useful in
many other settings, where it is of interest to find a powerful yet robust test. For
example, they can be used for non-Gaussian detection of the Cosmic Microwave
Background (CMB) or can be used for detecting rare and weak signals or small
cliques in large graphs [e.g., Donoho and Jin (2015)].
The KS statistic can also be viewed as a marginal screening procedure. Screen-
ing is a well-known approach in high dimensional analysis. For example, in vari-
able selection, we use marginal screening for dimension reduction [Fan and Lv
(2008)], and in cancer classification, we use screening to adapt Fisher’s LDA and
QDA to modern settings [Donoho and Jin (2008), Efron (2009), Fan et al. (2015)].
However, the setting here is very different.
Of course, another important reason that we choose to use the KS-based
marginal screening in IF-PCA is for simplicity and practical feasibility: with such
a screening method, we are able to (a) use Efron’s proposal of empirical null to cor-
rect the null distribution, and (b) set the threshold by Higher Criticism; (a)–(b) are
especially important as we wish to have a tuning-free and yet effective procedure
for subject clustering with gene microarray data. In more complicated situations, it
is possible that marginal screening is sub-optimal, and it is desirable to use a more
sophisticated screening method. We mention two possibilities below.
In the first possibility, we might use the recent approaches by Birnbaum et al.
(2013), Paul and Johnstone (2012), where the primary interest is signal recov-
ery or feature estimation. The point here is that, while the two problems—subject
clustering and feature estimation—are very different, we still hope that a better
feature estimation method may improve the results of subject clustering. In these
papers, the authors proposed Augmented sparse PCA (ASPCA) as a new approach
to feature estimation and showed that under certain sparse settings, ASPCA may
have advantages over marginal screening methods, and that ASPCA is asymp-
totically minimax. This suggests an alternative to IF-PCA, where in the IF step,
we replace the marginal KS screening by some augmented feature screening ap-
proaches. However, the open question is, how to develop such an approach that is
tuning-free and practically feasible. We leave this to the future work.
Another possibility is to combine the KS statistic with the recent innovation
of Graphlet Screening [Jin, Zhang and Zhang (2014), Ke, Jin and Fan (2014)]
in variable selection. This is particularly appropriate if the columns of the noise
matrix Z are correlated, where it is desirable to exploit the graphic structures of
the correlations to improve the screening efficiency. Graphic Screening is a graph-
guided multivariate screening procedure and has advantages over the better-known
method of marginal screening and the lasso. At the heart of Graphlet Screening is
a graph, which in our setting is defined as follows: each feature j , 1 ≤ j ≤ p, is a
node, and there is an edge between nodes i and j if and only if row i and row j of
the normalized data matrix W are strongly correlated (note that for a useful feature,
the means of the corresponding row of W are non-zero; in our range of interest,
these non-zero means are at the order of n−1/6 , and so have negligible effects over
the correlations). In this sense, adapting Graphlet Screening in the screening step
helps to solve highly correlated data. We leave this to the future work.
The post-selection PCA is a flexible idea that can be adapted to address many
other problems. Take model (1.1) for example. The method can be adapted to ad-
dress the problem of testing whether LM = 0 or LM = 0 (i.e., whether the data
matrix consists of a low-rank structure or not), the problem of estimating M, or
the problem of estimating LM. The latter is connected to recent interest on sparse
PCA and low-rank matrix recovery. Intellectually, the PCA approach is connected
to SCORE for community detection on social networks [Jin (2015)], but is very
different.
Threshold choice by HC is a recent innovation, and was first proposed in
Donoho and Jin (2008) [see also Fan, Jin and Yao (2013)] in the context of classi-
fication. However, our focus here is on clustering, and the method and theory we
need are very different from those in Donoho and Jin (2008), Fan, Jin and Yao
(2013). In particular, this paper requires sophisticated post-selection Random Ma-
trix Theory (RMT), which we do not need in Donoho and Jin (2008), Fan, Jin
and Yao (2013). Our study on RMT is connected to Baik and Silverstein (2006),
Guionnet and Zeitouni (2000), Johnstone (2001), Lee, Zou and Wright (2010),
Paul (2007) but is very different.
In a high level, IF-PCA is connected to the approaches by Azizyan, Singh and
Wasserman (2013), Chan and Hall (2010) in that all three approaches are two-
stage methods that consist of a screening step and a post-selection clustering step.
However, the screening step and the post-selection step in all three approaches are
significantly different from each other. Also, IF-PCA is connected to the spectral
graph partitioning algorithm by Ng, Jordan and Weiss (2002), but it is very differ-
ent, especially in feature selection and threshold choice by HC.
In this paper, we have assumed that the first (K − 1) contrast mean vectors
μ1 , μ2 , . . . , μK−1 are linearly independent (consequently, the rank of the matrix
M [see (2.6)] is (K − 1)), and that K is known (recall that K is the number of
classes). In the gene microarray examples, we discuss in this paper, a class is a
patient group (normal, cancer, cancer sub-type) so K is usually known to us as
a priori. Moreover, it is believed that different cancer sub-types can be distin-
guished from each other by one or more genes (though we do not know which)
so μ1 , μ2 , . . . , μK−1 are linearly independent. Therefore, both assumptions are
reasonable.
On the other hand, in a broader context, either of these two assumptions could
be violated. Fortunately, at least to some extent, the main ideas in this paper can
be extended. We consider two cases. In the first one, we assume K is known but
r = rank(M) < (K − 1). In this case, the main results in this paper continue to
hold, provided that some mild regularity conditions hold. In detail, let U ∈ R n,r be
the matrix consisting the first r left singular vectors of LM as before; it can be
shown that, as before, U has K distinct rows. The additional regularity condition
we need here is that, the
2 -norm between any pair of the K distinct rows has a
reasonable lower bound. In the second case, we assume K is unknown and has to
be estimated. In the literature, this is a well-known hard problem. To tackle this
problem, one might utilize the recent developments on rank detection Kritchman
and Nadler (2008) [see also Birnbaum et al. (2013), Cai, Ma and Wu (2015)],
where in a similar setting, the authors constructed a confident lower bound for the
number of classes K. A problem of interest is then to investigate how to combine
the methods in these papers with IF-PCA to deal with the more challenging case
of unknown K; we leave this for future study.
Acknowledgements. The authors would like to thank David Donoho,

Shiqiong Huang, Tracy Zheng Ke, Pei Wang and anonymous referees for valu-
able pointers and discussion.
SUPPLEMENTARY MATERIAL
Supplement to “Influential Features PCA for high dimensional clustering”
(DOI: 10.1214/15-AOS1423SUPP; .pdf). Owing to space constraints, the technical
proofs are relegated a supplementary document Jin and Wang (2016). It contains
three sections: Appendices A–C.
REFERENCES
A BRAMOVICH , F., B ENJAMINI , Y., D ONOHO , D. L. and J OHNSTONE , I. M. (2006). Adapting to
unknown sparsity by controlling the false discovery rate. Ann. Statist. 34 584–653. MR2281879
A MINI , A. A. and WAINWRIGHT, M. J. (2009). High-dimensional analysis of semidefinite relax-
ations for sparse principal components. Ann. Statist. 37 2877–2921. MR2541450
A RIAS -C ASTRO , E., L ERMAN , G. and Z HANG , T. (2013). Spectral clustering based on local PCA.
Available at arXiv:1301.2007.
A RIAS -C ASTRO , E. and V ERZELEN , N. (2014). Detection and feature selection in sparse mixture
models. Available at arXiv:1405.1478.
A RTHUR , D. and VASSILVITSKII , S. (2007). k-means++: The advantages of careful seeding. In
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms 1027–
1035. ACM, New York. MR2485254
A ZIZYAN , M., S INGH , A. and WASSERMAN , L. (2013). Minimax theory for high-dimensional
Gaussian mixtures with sparse mean separation. In Advances in Neural Information Processing
Systems 2139–2147. Curran Associates, Red Hook, NY.
BAIK , J. and S ILVERSTEIN , J. W. (2006). Eigenvalues of large sample covariance matrices of spiked
population models. J. Multivariate Anal. 97 1382–1408. MR2279680
B IRNBAUM , A., J OHNSTONE , I. M., NADLER , B. and PAUL , D. (2013). Minimax bounds for sparse
PCA with noisy high-dimensional data. Ann. Statist. 41 1055–1084. MR3113803
C AI , T., M A , Z. and W U , Y. (2015). Optimal estimation and rank detection for sparse spiked co-
variance matrices. Probab. Theory Related Fields 161 781–815. MR3334281
C HAN , Y. and H ALL , P. (2010). Using evidence of mixed populations to select variables for cluster-
ing very high-dimensional data. J. Amer. Statist. Assoc. 105 798–809. MR2724862
C HEN , J. and L I , P. (2009). Hypothesis test for normal mixture models: The EM approach. Ann.
Statist. 37 2523–2542. MR2543701
DAVIS , C. and K AHAN , W. M. (1970). The rotation of eigenvectors by a perturbation. III. SIAM J.
Numer. Anal. 7 1–46. MR0264450
D ETTLING , M. (2004). BagBoosting for tumor classification with gene expression data. Bioinfor-
matics 20 3583–3593.
D ONOHO , D. (2015). 50 years of data science. Unpublished manuscript.

D ONOHO , D. and J IN , J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann.
Statist. 32 962–994. MR2065195
D ONOHO , D. and J IN , J. (2008). Higher criticism thresholding: Optimal feature selection when
useful features are rare and weak. Proc. Natl. Acad. Sci. USA 105 14790–14795.
D ONOHO , D. and J IN , J. (2015). Higher criticism for large-scale inference, especially for rare and
weak effects. Statist. Sci. 30 1–25. MR3317751
D URBIN , J. (1985). The first-passage density of a continuous Gaussian process to a general bound-
ary. J. Appl. Probab. 22 99–122. MR0776891
E FRON , B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis.
J. Amer. Statist. Assoc. 99 96–104. MR2054289
E FRON , B. (2009). Empirical Bayes estimates for large-scale prediction problems. J. Amer. Statist.
Assoc. 104 1015–1028. MR2562003
FAN , Y., J IN , J. and YAO , Z. (2013). Optimal classification in sparse Gaussian graphic model. Ann.
Statist. 41 2537–2571. MR3161437
FAN , J. and LV, J. (2008). Sure independence screening for ultrahigh dimensional feature space.
J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849–911. MR2530322
FAN , J., K E , Z. T., L IU , H. and X IA , L. (2015). QUADRO: A supervised dimension reduction
method via Rayleigh quotient optimization. Ann. Statist. 43 1498–1534. MR3357869
G ORDON , G. J., J ENSEN , R. V., H SIAO , L., G ULLANS , S. R., B LUMENSTOCK , J. E., R A -
MASWAMY, S., R ICHARDS , W. G., S UGARBAKER , D. J. and B UENO , R. (2002). Translation
of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in
lung cancer and mesothelioma. Cancer Res. 62 4963–4967.
G UIONNET, A. and Z EITOUNI , O. (2000). Concentration of the spectral measure for large matrices.
Electron. Commun. Probab. 5 119–136 (electronic). MR1781846
H ASTIE , T., T IBSHIRANI , R. and F RIEDMAN , J. (2009). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction, 2nd ed. Springer, New York. MR2722294
J IN , J. (2015). Fast community detection by SCORE. Ann. Statist. 43 57–89. MR3285600
J IN , J. and K E , Z. T. (2016). Rare and weak effects in large-scale inference: Methods and phase
diagrams. Statist. Sinica 26 1–34.
J IN , J., K E , Z. T. and WANG , W. (2015a). Optimal spectral clustering by Higher Criticism Thresh-
olding. Manuscript.
J IN , J., K E , Z. T. and WANG , W. (2015b). Phase transitions for high dimensional clustering and
related problems. Available at arXiv:1502.06952.
J IN , J. and WANG , W. (2016). Supplement to “Influential Features PCA for high dimensional clus-
tering.” DOI:10.1214/15-AOS1423SUPP.
J IN , J., Z HANG , C. and Z HANG , Q. (2014). Optimality of graphlet screening in high dimensional
variable selection. J. Mach. Learn. Res. 15 2723–2772. MR3270749
J OHNSTONE , I. M. (2001). On the distribution of the largest eigenvalue in principal components
analysis. Ann. Statist. 29 295–327. MR1863961
J UNG , S. and M ARRON , J. S. (2009). PCA consistency in high dimension, low sample size context.
Ann. Statist. 37 4104–4130. MR2572454
K E , Z. T., J IN , J. and FAN , J. (2014). Covariate assisted screening and estimation. Ann. Statist. 42
2202–2242. MR3269978
KOLMOGOROV, A. N. (1933). Sulla determinazione empirica di una legge di distribuzione. G. Ist.
Ital. Attuari 4 83–91.
K RITCHMAN , S. and NADLER , B. (2008). Determining the number of components in a factor model
from limited noisy data. Chemometr. Intell. Lab 94 19–32.
L EE , A. B., L UCA , D. and ROEDER , K. (2010). A spectral graph approach to discovering genetic
ancestry. Ann. Appl. Stat. 4 179–202. MR2758169
L EE , S., Z OU , F. and W RIGHT, F. A. (2010). Convergence and prediction of principal component

scores in high-dimensional settings. Ann. Statist. 38 3605–3629. MR2766862
L EI , J. and V U , V. Q. (2015). Sparsistency and agnostic inference in sparse PCA. Ann. Statist. 43
299–322. MR3311861
L OADER , C. R. (1992). Boundary crossing probabilities for locally Poisson processes. Ann. Appl.
Probab. 2 199–228. MR1143400
M A , Z. (2013). Sparse principal component analysis and iterative thresholding. Ann. Statist. 41 772–
801. MR3099121
N G , A. Y., J ORDAN , M. I. and W EISS , Y. (2002). On spectral clustering: Analysis and an algorithm.
Adv. Neural Inf. Process. Syst. 2 849–856.
PAUL , D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance
model. Statist. Sinica 17 1617–1642. MR2399865
PAUL , D. and J OHNSTONE , I. M. (2012). Augmented sparse principal component analysis for high
dimensional data. Available at arXiv:1202.1242.
S HORACK , G. R. and W ELLNER , J. A. (1986). Empirical Processes with Applications to Statistics.
Wiley, New York. MR0838963
S IEGMUND , D. (1982). Large deviations for boundary crossing probabilities. Ann. Probab. 10 581–
588. MR0659529
W OODROOFE , M. (1978). Large deviations of likelihood ratio statistics with applications to sequen-
tial testing. Ann. Statist. 6 72–84. MR0455183
YOUSEFI , M. R., H UA , J., S IMA , C. and D OUGHERTY, E. R. (2010). Reporting bias when using
real data sets to analyze classification performance. Bioinformatics 26 68–76.
Z OU , H., H ASTIE , T. and T IBSHIRANI , R. (2006). Sparse principal component analysis. J. Comput.
Graph. Statist. 15 265–286. MR2252527
D EPARTMENT OF S TATISTICS D EPARTMENT OF S TATISTICS AND

C ARNEGIE M ELLON U NIVERSITY A PPLIED P ROBABILITY
P ITTSBURGH , P ENNSYLVANIA 15213 NATIONAL U NIVERSITY OF S INGAPORE
USA S INGAPORE 117546
E- MAIL : jiashun@stat.cmu.edu E- MAIL : staww@nus.edu.sg

15 Aos1423

Uploaded by

Copyright:

Available Formats

15 Aos1423

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

15 Aos1423

Uploaded by

Copyright:

Available Formats

The Annals of Statistics

2016, Vol. 44, No. 6, 2323–2359

INFLUENTIAL FEATURES PCA FOR HIGH

B Y J IASHUN J IN2 AND WANJIE WANG

1. Introduction. Consider a clustering problem where we have feature vec-

Received July 2014; revised December 2015.

# Data name Abbreviation Source K n p

1 Brain Brn Pomeroy (02) 5 42 5597

Our study is largely motivated by clustering using gene microarray data. In a

For any numbers a1 , a2 , . . . , ap , let diag(a1 , a2 , . . . , ap ) be the p × p diagonal

D EFINITION 1.1. We call feature j a useless feature (for clustering) if

We call μk the contrast mean vector of Class k, 1 ≤ k ≤ K. In many applica-

1.1. Influential features PCA (IF-PCA). Denote the data matrix by

For any 1 ≤ j ≤ p, denote the empirical CDF associated with feature j by

wise with threshold Tp = log(p)/ n.

vector of pre-selection data matrix W , and (b) a data-driven threshold choice by

where δk > 0 is the prior probability that Xi (j ) comes from Class k, 1 ≤ k ≤

R EMARK . Efron (2004) suggests several possible reasons (e.g., dependence

1.3. Threshold choice by Higher Criticism. The performance of IF-PCA crit-

• For 1 ≤ j ≤ p, calculate a P -value πj = 1 − F0 (ψn,j ), where F0 is the distri-

R EMARK . When we apply HC to microarray data, we follow the discussions

1.4. Applications to gene microarray data. We compare IF-HCT-PCA with

# Data set K kmeans kmeans++ Hier SpecGem IF-HCT-PCA r

1 Brain 5 0.286 0.427 (0.09) 0.524 0.143 0.262 1.83

1.5. Three variants of IF-HCT-PCA. First, in IF-HCT-PCA, we normal-

Brn Brst Cln Leuk Lung1 Lung2 Lymp Prst SRB Su

variant by IF-HCT-PCA-med (med: median). Second, recall that IF-HCT-PCA has

Brn Brst Cln Leuk Lung1 Lung2 Lymp Prst SRB Su

dent repetitions. Compared to Clu-sPCA, IF-HCT-PCA outperforms for half of the

1.7. Summary and contributions. Our contribution is three-fold: feature selec-

8 For simplicity, we drop the subscript of n as long as there is no confusion.

It is more convenient to work with the normalized data matrix W = [W1 ,

2.2. Regularity conditions and related notation. We use C > 0 as a generic

Introduce the following two p × 1 vectors κ = (κ(1), κ(2), . . . , κ(p)) and τ =

ρ2 (L, M) = ρ2 (L, M; p, n) = κ2 /smin (AA).

R EMARK . Note that ρ1 (L, M) ≥ 1 and ρ2 (L, M) ≥ 1. A relatively small

2.3. Main theorem: Clustering consistency by IF-PCA. Recall ψn,j is the KS

(K − 1) left singular vectors of W Ŝp (tp (q)) by

For any K ≥ 1, let

C OROLLARY 2.1. Suppose conditions of Theorem 2.1 hold, and suppose

By assumption (2.12), the interesting range for a non-zero mk (j ) is |mk (j )| 

C OROLLARY 2.2. Suppose conditions of Corollary 2.1 hold, and κ =

To interpret Corollary 2.2, we take a special case where K = 2, all diagonals

2.5. Tail probability of KS statistic. IF-PCA consists of a screening step (IF-

T HEOREM 2.3. Fix θ ∈ (0, 1) and let n = np = pθ . Fix 1 ≤ j ≤ p. If the j th

We conjecture that P (ψn,j ≥ tp ) ∼ 2 · √ 1 exp(−tp2 /(2a02 )), with possibly a

T HEOREM 2.4. Fix θ ∈ (0, 1). Let n = np = pθ , and τ (j ) be as in (2.10),

• No more than (1 + o(1)) · p1−q useless features are (mistakenly) retained,

(1985) which approaches the problem by approximating the stochastic process by

EMMA 2.1. The matrix LM has (K − 1) non-zero singular values which

By Lemma 2.1 and definitions, it follows that:

L EMMA 2.2. Suppose the conditions of Theorem 2.1 hold. As p → ∞, with

L EMMA 2.3. Suppose the conditions of Theorem 2.1 hold. As p → ∞, with

Combining Lemmas 2.2–2.3 and using the definition of errp ,

L EMMA 2.4. For any integers 1 ≤ m ≤ p and two p × m matrices V1 , V2

Lp p θ/6−(1−ϑ)/2+(1−θ −q)/2 ; so the optimal q ∗ = 1 − θ and the corresponding error

3. Simulations. We conducted a small-scale simulation study to investigate

E XPERIMENT 1. In this experiment, we study the effect of signal strength over

13 We say X ∼ MN(K, δ) if P (X = k) = δ , 1 ≤ k ≤ K; MN stands for multinomial.

Introduce the following two p × 1 vectors κ = (κ(1), κ(2), . . . , κ(p)) and τ =

ρ2 (L, M) = ρ2 (L, M; p, n) = κ2 /smin (AA).

By assumption (2.12), the interesting range for a non-zero mk (j ) is |mk (j )|

C OROLLARY 2.2. Suppose conditions of Corollary 2.1 hold, and κ =