15 Aos1423
15 Aos1423
15 Aos1423
TABLE 1
Gene microarray data sets investigated in this paper. Note that K is small and p n
(p: number of genes; n: number of subjects)
and the vector σ = (σ (1), σ (2), . . . , σ (p)) is unknown to us. Assumption (1.2)
is only for simplicity: our method to be introduced below is not tied to such an
assumption, and works well with most of the data sets in Table 1; see Sections 1.1
and 1.4 for more discussions.
INFLUENTIAL FEATURES PCA 2325
Denote the overall mean vector by μ̄ = n1 ni=1 E[Xi ]. For K different vectors
μ1 , μ2 , . . . , μK ∈ R p , we model E[Xi ] by (yi are class labels)
(1.3) E[Xi ] = μ̄ + μk if and only if yi = k.
For 1 ≤ k ≤ K, let δk be the fraction of samples in Class k. Note that
(1.4) δ1 μ1 + δ2 μ2 + · · · + δK μK = 0,
so μ1 , μ2 , . . . , μK are linearly dependent. However, it is natural to assume
(1.5) μ1 , μ2 , . . . , μK−1 are linearly independent.
3 Such a two-stage clustering idea (i.e., feature selection followed by post-selection clustering)
is not completely new and can be found in Chan and Hall (2010), for example. Of course, their
procedure is very different from ours.
2326 J. JIN AND W. WANG
Let Û (t) ∈ R n,K−1 be the matrix consisting the first K − 1 (unit-norm) left sin-
(t)
√ 6 Û∗ ∈ R
gular vectors of W (t) .5 Define a matrix n,K−1 by truncating Û (t) entry-
4 Alternatively, we can normalize the KS-scores with sample median and Median Absolute Devia-
tion (MAD); see Section 1.5 for more discussion.
5 For a matrix M ∈ R n,m , the kth left (right) singular vector is the eigenvector associated with the
kth largest eigenvalue of the matrix MM (of the matrix M M).
6 That is, Û (t) (i, k) = Û (i, k)1{|Û (i, k)| ≤ T } + T sgn(Û (i, k))1{|Û (i, k)| > T }, 1 ≤ i ≤ n, 1 ≤
∗ √ p p p
k ≤ K − 1. We usually take Tp = log(p)/ n as above, but log(p) can be replaced by any sequence
that tends to ∞ as p → ∞. The truncation is mostly for theoretical analysis in Section 2 and is not
used in numerical study (real or simulated data).
INFLUENTIAL FEATURES PCA 2327
TABLE 2
Clustering errors and # of selected features for different choices of t [Lung Cancer(1) data].
Columns highlighted correspond to the sweet spot of the threshold choice
Threshold t 0.000 0.608 0.828 0.938 1.048 1.158 1.268 1.378 1.488
# of selected features 12,533 5758 1057 484 261 129 63 21 2
Clustering errors 22 22 24 4 5 7 38 39 33
1.2. KS statistic, normality assumption, and Efron’s empirical null. The goal
in Steps 1–2 is to find an easy-to-implement method to rank the features. The focus
of Step 1 is on a data matrix satisfying models (1.1)–(1.5), and the focus of Step 2
is to adjust Step 1 in a way so to work well with microarray data. We consider two
steps separately.
F IG . 1. Comparison of Û (t) for t = 0.000 (left; no feature selection) and t = 1.057 (right; t is set
by Higher Criticism in a data-driven fashion); note Û (t) is an n × 1 vector since K = 2. y-axis:
entries of Û (t) , x-axis: sample indices. Plots are based on Lung Cancer(1) data, where ADCA and
MPM represent two different classes.
2328 J. JIN AND W. WANG
Consider the first step. The interest is to test for each fixed j , 1 ≤ j ≤ p, whether
feature j is useless or useful. Since we have no prior information about the class
labels, the problem can be reformulated as that of testing whether all n samples
associated with the j th feature are i.i.d. Gaussian
i.i.d.
(1.9) H0,j : Xi (j ) ∼ N μ̄(j ), σ 2 (j ) , i = 1, 2, . . . , n,
or they are i.i.d. from a K-component heterogenous Gaussian mixture:
i.i.d.
K
(1.10) H1,j : Xi (j ) ∼ δk N μ̄(j ) + μk (j ), σ 2 (j ) , i = 1, 2, . . . , n,
k=1
F IG . 2. Left: The histogram of KS-scores of the Lung Cancer(1) data. The two lines in blue and
red denote the theoretical null and empirical null densities, respectively. Right: empirical survival
function of the adjusted KS-scores based on Lung Cancer(1) data (red) and the survival function of
theoretical null (blue).
INFLUENTIAL FEATURES PCA 2329
normality assumption (1.2) is valid for this data set, then the density function of the
KS statistic for model (1.9) (the blue curve in left panel; obtained by simulations)
should fit well with the histogram of the KS-scores based on the Lung Cancer(1)
data. Unfortunately, this is not the case, and there is a substantial discrepancy in
fitting. On the other hand, if we translate and rescale the blue curve so that it has
the same mean and standard deviation as the KS-scores associated with Lung Can-
cer(1) data, then the new curve (red curve; left panel of Figure 2) fits well with the
histogram.7
A related phenomenon was discussed in Efron (2004), only considering Stu-
dentized t-statistics in a different setting. As in Efron (2004), we call the density
functions associated with two curves (blue and red) the theoretical null and the
empirical null, respectively. The phenomenon is then: the theoretical null has a
poor fit with the histogram of the KS-scores of the real data, but the empirical null
may have a good fit.
In the right panel of Figure 2, we view this from a slightly different perspective,
and show that the survival function associated with the adjusted KS-scores (i.e.,
∗ ) of the real data fits well with the theoretical null.
ψn,j
The above observations explain the rationale for Step 2. Also, they suggest that
IF-PCA does not critically depend on the normality assumption and works well for
microarray data. This is further validated in Section 1.4.
7 If we replace sample mean and standard deviation by sample median and MAD, respectively, then
it gives rises to the normalization in the second footnote of Section 1.1.
2330 J. JIN AND W. WANG
TABLE 3
Pseudocode for IF-HCT-PCA (for microarray data; threshold set by Higher Criticism)
IF .
Input: data matrix X, number of classes K. Output: class label vector ŷHC
1. Rank features: Let ψn,j be the KS-scores as in (1.6) and F0 be the CDF of ψn,j under null,
1 ≤ j ≤ p.
2. Normalize KS-scores: ψn∗ = (ψn − mean(ψn ))/SD(ψn ).
3. Threshold choice by HCT: Calculate P -values by πj = 1 − F0 (ψn,j ∗ ), 1 ≤ j ≤
√
p and sort them by π(1) < π(2) < · · · < π(p) . Define HCp,j = p(j/p − π(j ) )/
√
max{ n(j/p − π(j ) ), 0} + j/p, and let jˆ = argmax{j :π >log(p)/p,j <p/2} {HCp,j }. HC
(j )
threshold tpHC is the jˆth largest KS-score.
4. Post-selection PCA: Define post-selection data matrix W (HC) (i.e., sub-matrix of W consists of
all column j of W with ψn,j ∗ > t HC ). Let U ∈ R n,K−1 be the matrix of the first (K − 1) left
p
IF = kmeans(U, K).
singular vectors of W (HC) . Cluster by ŷHC
Let jˆ be the index such that jˆ = argmax{1≤j ≤p/2,π(j ) >log(p)/p} {HCp,j }. The HC
threshold tpHC for IF-PCA is then the jˆth largest KS-scores.
Combining HCT with IF-PCA gives a tuning-free clustering procedure IF-HCT-
PCA, or IF-PCA for short if there is no confusion. See Table 3.
For illustration, we again employ the Lung Cancer(1) data. In this data set,
jˆ = 251, tpHC = 1.0573, and HC selects 251 genes with the largest KS-scores.
In Figure 3, we plot the error rates of IF-PCA applied to the k features of W with
the largest KS-scores, where k ranges from 1 to p/2 (for different k, we are using
the same ranking for all p genes). The figure shows that there is a “sweet spot”
for k where the error rates are the lowest. HCT corresponds to jˆ = 251 and 251 is
F IG . 3. Error rates by IF-PCA (y-axis) with different number of selected features k (x-axis) [Lung
Cancer(1) data]. HCT corresponds to 251 selected features (dashed vertical line).
INFLUENTIAL FEATURES PCA 2331
in this sweet spot. This suggests that HCT gives a reasonable threshold choice, at
least for some real data sets.
The rationale for HCT can also be explained theoretically. For illustration, con-
sider the case where K = 2 so we only have two classes. Fixing a threshold t > 0,
let Û (t) be the first left singular vector of W (t) as in Section 1.1. In a companion
paper [Jin, Ke and Wang (2015a)], we show that when the signals are rare and
weak, then for t in the range of interest,
(1.12) Û (t) ∝ snr(t)
· U + z + rem,
where U is an n × 1 non-stochastic vector with only two distinct entries (each de-
termines one of two classes), snr(t) is a nonstochastic function of t, z ∼ N(0, In ),
and rem is the remainder term [the entries of which are asymptotically of much
smaller magnitude than that of z or snr(t) · U ]. Therefore, performance of IF-PCA
is best when we maximize snr(t) (though this is unobservable). We call such a
threshold the Ideal Threshold: tpideal = argmint>0 {snr(t)}.
Let F̄p (t) be the survival function of ψn,j under the null (not dependent on
p
j ), and let Ĝp (t) = p1 j =1 1{ψn,j ≥ t} be the empirical survival function. In-
√
√
troduce HCp (t) = p[Ĝp (t) − F̄p (t)]/ Ĝp (t) + n[max{Ĝp (t) − F̄p (t), 0}],
and let ψ(1) > ψ(2) > · · · > ψ(p) be the sorted values of ψn,j . Recall that
π(k) is the kth smallest P -value. By definitions, we have Ĝp (t)|t=ψ(k) = k/p
and F̄p (t)|t=ψ(k) = π(k) . As a result, we have HCp (t)|t=ψ(k) = [k/p − π(k) ]/
√
k/p + n max{k/p − π(k) , 0}, where the right-hand side is the form of HC in-
troduced in (1.11). Note that HCp (t) is a function which is only discontinuous at
t = ψ(k) , 1 ≤ k ≤ p, and between two adjacent discontinuous points, the function
is monotone. Combining this with the definition of tpHC , tpHC = argmaxt {HCp (t)}.
Now, as p → ∞, some regularity appears, and Ĝp (t) converges to a non-
stochastic counterpart, denoted by Ḡp (t), which can be viewed as the survival
function associated with
the marginal density of ψn,j . Introduce IdealHC(t) =
√ √
p[Ḡp (t) − F̄p (t)]/ Ḡp (t) + n[max{Ḡp (t) − F̄p (t), 0}] as the ideal counter-
part of HCp (t). It is seen that HCp (t) ≈ IdealHC(t) for t in the range of interest,
and so tpHC ≈ tpidealHC , where the latter is defined as the nonstochastic threshold t
that maximizes IdealHC(t).
2332 J. JIN AND W. WANG
In Jin, Ke and Wang (2015a), we show that under a broad class of rare and weak
signal models, the leading term of the Taylor expansion of snr(t) is proportional to
that of IdealHC(t) for t in the range of interest, and so tpidealHC ≈ tpideal . Combining
this with the discussions above, we have tpHC ≈ tpidealHC ≈ tpideal , which explains the
rationale for HCT.
The above relationships are justified in Jin, Ke and Wang (2015a). The proofs
are rather long (70 manuscript pages in Annals of Statistics format), so we will
report them in a separate paper. The ideas above are similar to that in Donoho and
Jin (2008) but the focus there is on classification and our focus is on clustering;
our version of HC is also very different from theirs.
TABLE 4
Comparison of clustering error rates by different methods for the 10 gene microarray data sets
introduced in Table 1. Column 5: numbers in the brackets are the standard deviations (SD); SD for
all other methods are negligible so are not reported. Last column: see (1.13)
interesting point: for “easier” data sets, IF-PCA tends to have more improvements
over the other 4 methods.
We make several remarks. First, for the Brain data set, unexpectedly, IF-PCA
underperforms classical PCA, but still outperforms other methods. Among our data
sets, the Brain data seem to be an “outlier”. Possible reasons include (a) useful
features are not sparse, and (b) the sample size is very small (n = 42) so the useful
features are individually very weak. When (a)–(b) happen, it is almost impossible
to successfully separate the useful features from useless ones, and it is preferable
to use classical PCA. Such a scenario may be found in Jin, Ke and Wang (2015b);
see, for example, Figure 1 (left) and related context therein.
Second, for Colon Cancer, all methods behave unsatisfactorily, and IF-PCA
slightly underperforms hierarchical clustering (r = 1.04). The data set is known
to be a difficult one even for classification (where class labels of training samples
are known [Donoho and Jin (2008)]). For such a difficult data set, it is hard for
IF-PCA to significantly outperform other methods.
Last, for the SuCancer data set, the KS-scores are significantly skewed to the
right. Therefore, instead of using the normalization (1.7), we normalize ψn,j such
that the mean and standard deviation for the lower 50% of KS-scores match those
for the lower 50% of the simulated KS-scores under the null; compare this with
Section 1.3 for remarks on P -value calculations.
TABLE 5
Clustering error rates of IF-HCT-PCA, IF-HCT-PCA-med, IF-HCT-kmeans and IF-HCT-hier
IF-HCT-PCA 0.262 0.406 0.403 0.069 0.033 0.217 0.065 0.382 0.444 0.333
IF-HCT-PCA-med 0.333 0.424 0.436 0.014 0.017 0.217 0.097 0.382 0.206 0.333
IF-HCT-kmeans 0.191 0.380 0.403 0.028 0.033 0.217 0.032 0.382 0.401 0.328
IF-HCT-hier 0.476 0.351 0.371 0.250 0.177 0.227 0.355 0.412 0.603 0.500
1.6. Connection to sparse PCA. The study is closely related to the recent in-
terest on sparse PCA [Amini and Wainwright (2009), Arias-Castro, Lerman and
Zhang (2013), Johnstone (2001), Jung and Marron (2009), Lei and Vu (2015),
Ma (2013), Zou, Hastie and Tibshirani (2006)], but is different in important ways.
Consider the normalized data matrix W = [W1 , W2 , . . . , Wn ] for example. In our
model, recall that μ1 , μ2 , . . . , μK are the K sparse contrast mean vectors and the
noise covariance matrix is diagonal, we have
W ≈ M −1/2 + Z where Z ∈ R n,p has i.i.d. N(0, 1) entries,
and M ∈ R n,p is the matrix where the ith row is μk if and only if i ∈ Class k. This
is a setting that is frequently considered in the sparse PCA literature.
However, we must note that the main focus of sparse PCA is to recover the
supports of μ1 , μ2 , . . . , μK , while the main focus here is subject clustering. We
recognize that, the two problems—support recovery and subject clustering—are
INFLUENTIAL FEATURES PCA 2335
essentially two different problems, and addressing one successfully does not nec-
essarily address the other successfully. For illustration, consider two scenarios:
• If useful features are very sparse but each is sufficiently strong, it is easy to
identify the support of the useful features, but due to the extreme sparsity, it
may be still impossible to have consistent clustering.
• If most of the useful features are very weak with only a few of them very strong,
the latter will be easy to identify and may yield consistent clustering, still, it
may be impossible to satisfactorily recover the supports of μ1 , μ2 , . . . , μK , as
most of the useful features are very weak.
In a forthcoming manuscript Jin, Ke and Wang (2015b), we investigate the connec-
tions and differences between two problems more closely, and elaborate the above
points with details.
With that being said, from a practical viewpoint, one may still wonder how
sparse PCA may help in subject clustering. A straightforward clustering approach
that exploits the sparse PCA ideas is the following:
• Estimate the first (K − 1) right singular vectors of the matrix M −1/2 using the
sparse PCA algorithm as in Zou, Hastie and Tibshirani (2006), equation (3.7)
sp sp sp
(say). Denote the estimates by ν̂1 , ν̂2 , . . . , ν̂K−1 .
sp sp
• Cluster by applying classical k-means to the n × (K − 1) matrix [W ν̂1 , W ν̂2 ,
sp
. . . , W ν̂K−1 ], assuming there are ≤ K classes.
For short, we call this approach Clu-sPCA. One problem here is that, Clu-sPCA is
not tuning-free, as most existing sparse PCA algorithms have one or more tuning
parameters. How to set the tuning parameters in subject clustering is a challenging
problem: for example, since the class labels are unknown, using conventional cross
validations (as we may use in classification where class labels of the training set
are known) might not help.
In Table 6, we compare IF-HCT-PCA and Clu-sPCA using the 10 data sets in
Table 1. Note that in Clu-sPCA, the tuning parameter in the sparse PCA step [Zou,
Hastie and Tibshirani (2006), equation (3.7)] is ideally chosen to minimize the
clustering errors, using the true class labels. The results are based on 30 indepen-
TABLE 6
Clustering error rates for IF-HCT-PCA and Clu-sPCA. The tuning parameter of Clu-sPCA is
chosen ideally to minimize the errors (IF-HCT-PCA is tuning-free). Only SDs that are larger
than 0.01 are reported (in brackets)
IF-HCT-PCA 0.262 0.406 0.403 0.069 0.033 0.217 0.065 0.382 0.444 0.333
Clu-sPCA 0.263 0.438 0.435 0.292 0.110 0.433 0.190 (0.01) 0.422 0.428 0.437
2336 J. JIN AND W. WANG
1.8. Content and notation. Section 2 contains the main theoretical results,
where we show IF-PCA is consistent in clustering under some regularity condi-
tions. Section 3 contains the numerical studies and Section 4 discusses connec-
tion to other work and addresses some future research. Secondary theorems and
INFLUENTIAL FEATURES PCA 2337
TABLE 7
Pseudocode for IF-PCA (for a given threshold t > 0)
Input: data matrix X, number of classes K, threshold t > 0. Output: class label vector ŷtIF .
1. Rank features: Let ψn,j , 1 ≤ j ≤ p, be the KS-scores as in (1.6).
2. Post-selection PCA: Define post-selection data matrix W (t) (i.e., sub-matrix of W consists of all
column j with ψn,j > t). Let U ∈ R n,K−1 be the matrix of the first (K − 1) left singular vectors
of W (t) . Cluster by ŷtIF = kmeans(U, K).
lemmas are proved in the supplementary material of the paper. In this paper, Lp
denotes a generic multi-log(p) term (see Section 2.3). For a vector ξ , ξ denotes
the
2 -norm. For a real matrix A, A denotes the matrix spectral norm, AF
denotes the matrix Frobenius norm, and smin (A) denotes the smallest non-zero
singular value.
2. Main results. Section 2.1 introduces our asymptotic model, Section 2.2
discusses the main regularity conditions and related notation. Section 2.3 presents
the main theorem and Section 2.4 presents two corollaries, together with a phase
transition phenomenon. Section 2.5 discusses the tail probability of the KS statis-
tic, which is the key for the IF step. Section 2.6 studies post-selection eigen-
analysis which is the key for the PCA step. The main theorems and corollaries
are proved in Section 2.7.
To be utterly clear, the IF-PCA procedure we study in this section is the one
presented in Table 7, where the threshold t > 0 is given.
2.1. The asymptotic clustering model. The model we consider is (1.1), (1.2),
(1.3) and (1.5), where the data matrix is X = [X1 , X2 , . . . , Xn ] , with Xi ∼ N(μ̄ +
μk , ) if and only if i ∈ Class k, 1 ≤ k ≤ K, and = diag(σ12 , σ22 , . . . , σp2 ); K is
the number of classes, μ̄ is the overall mean vector, μ1 , μ2 , . . . , μK are contrast
mean vectors which satisfy (1.5).
We use p as the driving asymptotic parameter, and let other parameters be tied
to p through fixed parameters. Fixing θ ∈ (0, 1), we let
(2.1) n = np = p θ ,
so that as p → ∞, p n 1.8 Let M ∈ R K,p be the matrix
(2.2) M = [m1 , m2 , . . . , mK ] where mk = −1/2 μk .
Denote the set of useful features by
(2.3) Sp = Sp (M) = 1 ≤ j ≤ p : mk (j ) = 0 for some 1 ≤ k ≤ K ,
and let sp = sp (M) = |Sp (M)| be the number of useful features. Fixing ϑ ∈ (0, 1),
we let
(2.4) sp = p1−ϑ .
Throughout this paper, the number of classes K is fixed, as p changes.
D EFINITION 2.1. We call model (1.1), (1.2), (1.3) and (1.5) the Asymptotic
Clustering Model if (2.1) and (2.4) hold and denote it by ACM(ϑ, θ ).
√ √
K
(2.10) τ (j ) = τ (j ; M, p, n) = (6 2π )−1 · n ·
δk m3k (j )
.
k=1
INFLUENTIAL FEATURES PCA 2339
Note that κ(j ) and τ (j ) are related to the weighted second and third moments
of the j th column of M, respectively; τ and κ play a key role in the success of
feature selection and post-selection PCA, respectively. In the case that τ (j )’s are
all small, the success of our method relies on higher moments of the columns of
M; see Section 2.5 for more discussions. Introduce
ε(M) = max
mk (j )
,
1≤k≤K,j ∈Sp (M)
τmin = min τ (j ) .
j ∈Sp (M)
We are primarily interested in the range where the feature strengths are rare and
weak, so we assume as p → ∞,
(2.11) ε(M) → 0.9
In Section 2.5, we shall see that τ (j ) can be viewed as the Signal-to-Noise Ratio
(SNR) associated with the j th feature and τmin is the minimum √ SNR of all useful
features. The most interesting range for τ (j ) is τ (j ) ≥ O( log(p)). In fact, if
τ (j )s are of a much smaller order, then the useful features and the useless features
are merely inseparable. In light of this, we fix a constant r > 0 and assume
√
(2.12) τmin ≥ a0 · 2r log(p) where a0 = (π − 2)/(4π).10
By the way τ (j ) is defined, the interesting range for non-zero mk (j ) is |mk (j )| ≥
O((log(p)/n)1/6 ). We also need some technical conditions which can be largely
relaxed with more complicated analysis:11
√
n K
max δk mk (j ) ≤ Cp−δ ,
4
j ∈Sp (M) τ (j )
k=1
(2.13)
log(p) 1/2
min
mk (j ) ≥ C ,
{(j,k):mk (j )=0]} n
for some δ > 0. As the most interesting range of |mk (j )| is O((log(p)/n)1/6 ),
these conditions are mild.
9 This condition is used in the post-selection eigen-analysis. Recall that W (t) is the shorthand
notation for the post-selection normalized data matrix associated with threshold t. As W (t) is the
sum of a low-rank matrix and a noise matrix, (W (t) )(W (t) ) equals to the sum of four terms, two of
them are “cross terms.” In eigen-analysis of (W (t) )(W (t) ) , we need condition (2.11) to control the
cross terms.
10 Throughout this paper, a denotes the constant √(π − 2)/(4π ). The constant comes from the
0
analysis of the tail behavior of the KS statistic; see Theorems 2.3–2.4.
11 Condition (2.13) is only needed for Theorem 2.4 on the tail behavior of the KS statistic associated
with a useful feature. The conditions ensure singular cases will not happen so the weighted third
moment [captured by τ (j )] is the leading term in the Taylor expansion. For more discussion, see the
remark in Section 2.5.
2340 J. JIN AND W. WANG
Similarly, for√the threshold t in (1.8) we use for the KS-scores, the interesting
range is t = O( log(p)). In light of this, we are primarily interested in threshold
of the form
(2.14) tp (q) = a0 · 2q log(p) where q > 0 is a constant.
We now define a quantity errp , which is the clustering error rate of IF-PCA in
our main results. Define
sp κ2∞
ρ1 (L, M) = ρ1 (L, M; p, n) = .
κ2
Introduce two K × K matrices A and (where A is diagonal) by
A(k, k) = δk mk , (k,
) = mk
2 m
/ mk · m
, 1 ≤ k,
≤ K;
recall that
is “nearly” the identity matrix. Note that AA ≤ κ2 , and that
when m1 , . . . , mK have comparable magnitudes, all the eigenvalues of AA
have the same magnitude. In light of this, let smin (AA) be the minimum singular
value of A and introduce the ratio
Define
1 + p1−ϑ∧q /n √ √
errp = ρ2 (L, M) + p−( r− q)2+ /2K
κ
p(ϑ−q)+
+ pϑ−1 + ρ1 (L, M) .
n
This quantity errp combines the “bias” term associated with the useful features
that we have missed in feature selection and the “variance” term associated with
retained features; see Lemmas 2.2 and 2.3 for details. Throughout this paper, we
assume that there is a constant C > 0 such that
(2.15) errp ≤ p −C .
For any n × p matrix W , let W Ŝp (t) be the matrix formed by replacing all columns
of W with the index j ∈ / Ŝp (t) by the vector of zeros (note the slight difference
compared with W in Section 1.1). Denote the n × (K − 1) matrix of the first
(t)
D EFINITION 2.2. Lp > 0 denotes a multi-log(p) term that may vary from
occurrence to occurrence but satisfies Lp p−δ → 0 and Lp pδ → ∞, ∀δ > 0.
T HEOREM 2.1. Fix (ϑ, θ ) ∈ (0, 1)2 , and consider ACM(ϑ, θ ). Suppose the
regularity conditions (2.8), (2.11), (2.12), (2.13) and (2.15) hold, and the threshold
in IF-PCA is set as t = tp (q) as in (2.14). Then there is a matrix H in HK−1 such
that as p → ∞, with probability at least 1 − o(p−2 ), Û (tp (q)) − UHF ≤ Lp errp .
Recall that in IF-PCA, once Û (tp (q)) is obtained, we estimate the class labels by
truncating Û (tp (q)) entry-wise (see the PCA-1 step and the footnote in Section 1.1)
and then cluster by applying the classical k-means. Also, the estimated class la-
bels are denoted by ŷtIFp (q)
= (ŷtIF
p (q),1
, ŷtIF
p (q),2
, ŷtIF
p (q),n
) . We measure the clustering
errors by the Hamming distance
n
Hamm∗p ŷtIF
p (q)
, y = min P ŷ IF
tp (q),i = π(yi ) ,
π
i=1
where π is any permutation in {1, 2, . . . , K}. The following theorem is our main
result, which gives an upper bound for the Hamming errors of IF-PCA.
2342 J. JIN AND W. WANG
T HEOREM 2.2. Fix (ϑ, θ ) ∈ (0, 1)2 , and consider ACM(ϑ, θ ). Suppose the
√ (2.12), (2.13) and (2.15) hold, and let tp = tp (q)
regularity conditions (2.8), (2.11),
as in (2.14) and Tp = log(p)/ n in IF-PCA. As p → ∞,
n−1 Hamm∗p ŷtIF
p (q)
, y ≤ Lp errp .
The theorem can be proved by Theorem 2.1 and an adaption of Jin (2015),
Theorem 2.2. In fact, by
√ Lemma 2.1 below, the absolute values of all entries of
U are bounded by C/ n from above. By the choice of Tp and definitions, the
(t (q)) (t (q))
truncated matrix Û∗ p satisfies Û∗ p − UHF ≤ Û (tp (q)) − UHF . Using
this and Theorem 2.1, the proof of Theorem 2.2 is basically an exercise of classical
theory on k-means algorithm. For this reason, we skip the proof.
2.4. Two corollaries and a phase transition phenomenon. Corollary 2.1 can
be viewed as a simplified version of Theorem 2.1, so we omit the proof; recall that
Lp denotes a generic multi-log(p) term.
(c) If (1−ϑ) > 1−2θ/3, for any r > 0, by taking q = 0, minH ∈HK−1 Û (tp (q)) −
UHF → 0 with probability at least 1 − o(p−2 ).
(2.17) u0 ≤
μk (j )
≤ Cu0 for any (k, j ) such that μk (j ) = 0.
1/3
In our parameterization, sp = p1−ϑ , n = p θ and u0 τmin /n1/6 (log(p)/n)1/6
since K = 2. Cases (a)–(c) in Corollary 2.2 translate to (a) 1 sp n1/3 ,
(b) n1/3 sp p/n2/3 and (c) sp p/n2/3 , respectively.
The primary interest in this paper is Case (b). In this case, Corollary 2.2 says
that both feature selection and post-selection PCA can be successful, provided that
u0 = c0 (log(p)/n)1/6 for an appropriately large constant c0 . Case (a) addresses the
case of very sparse signals, and Corollary 2.2 says that we need stronger signals
than that of u0 (log(p)/n)1/6 for IF-PCA to be successful. Case (c) addresses
the case where signals are relatively dense, and PCA is successful without feature
selection (i.e., taking q = 0).
We have been focused on the case u0 = Lp n−1/6 as our primary interest is on
clustering by IF-PCA. For a more complete picture, we model u0 by u0 = Lp p−α ;
we let the exponent α vary and investigate what is the critical order for u0 for some
different problems and different methods. In this case, it is seen that u0 ∼ n−1/6
is the critical order for the success of feature selection (see Section 2.5), u0√∼
√
p/(ns) is the critical order for the success of Classical PCA and u0 ∼ 1/ s
is the critical order for IF-PCA in an idealized situation where the Screen step
finds exactly all the useful features. These suggest an interesting phase transition
phenomenon for IF-PCA.
• Feature selection√is trivial but clustering is impossible. 1 s n1/3 and
n−1/6 u0 ≤ 1/ s. Individually, useful features are sufficiently strong, so it
is trivial to recover the support of M 1/2 (say, by thresholding the KS-scores
one by one); note that M 1/2 = [μ1 , μ2 , . . . , μK ] . However, useful features
are so sparse that it is impossible for any methods to have consistent clustering.
• Clustering and feature selection are possible but non-trivial. n1/3 s
p/n2/3 and u0 = (r log(p)/n)1/6 , where r is a constant. In this range, feature
selection is indispensable and there is a region where IF-PCA may yield a con-
sistent clustering but Classical PCA may not. A similar conclusion can be drawn
if the purpose is to recover the support of M 1/2 by thresholding the KS-scores.
2344 J. JIN AND W. WANG
• Clustering
√ is trivial but feature selection is impossible. s p/n2/3 and
p/(ns) ≤ u0 n−1/6 . In this range, the sparsity level is low and Classical
PCA is able to yield consistent clustering, but the useful features are individ-
ually too weak that it is impossible to fully recover the support of M 1/2 by
using all p different KS-scores.
In Jin, Ke and Wang (2015b), we investigate the phase transition with much more
refined studies (in a slightly different setting).
Theorems 2.3–2.4 are proved in the supplementary material Jin and Wang
(2016). Combining two theorems, roughly saying, we have that:
• if j is a useless feature, then the right tail of ψn,j behaves like that of N(0, a02 ),
• if j is a useful feature, then the left tail of ψn,j is bounded by that of
N(τ (j ), Ka02 ).
INFLUENTIAL FEATURES PCA 2345
These suggest that the feature selection using the KS statistic in the current setting
is very similar to feature selection with a Stein’s normal means model; the latter is
more or less well understood [e.g., Abramovich et al. (2006)]. √
As a result, the most interesting √range for τ (j ) is τ (j ) ≥ O( log(p)). If we
threshold the KS-scores at tp (q) = 2q log(p), by similar argument as in feature
selection with a Stein’s normal means setting, we expect that:
√ √
• All useful features are retained, except for a fraction ≤ Cp−[( r− q)+ ] /K ,
2
R EMARK . Theorem
2.4 hinges on τ (j ), which is a quantity proportional to the
“third moment” K 3
k=1 k k (j ) and can be viewed as the “effective signal strength”
δ m
of the KS statistic. In the symmetric case (say, K = 2 and δ1 = δ2 = 1/2), the
third moment (which equals to 0) is no longer the right quantity for calibrating the
effective signal strength of the KS statistic, and we must use the fourth moment.
In such cases, for 1 ≤ j ≤ p, let
2
√ 1 K
ω(j ) = n sup y 1 − 3y 2 φ(y) · δk m2k (j )
−∞<y<∞ 8 k=1
1 K
+ φ (3) (y) · δk m4k (j ) ,
24 k=1
where φ (3) (y) is the third derivative of the standard normal density φ(y).
Theorem 2.4 continues to hold provided√ that (a) τ (j ) is replaced by ω(j ),
(b)√the condition (2.12) of τmin ≥ a0 2r log(p) is replaced by that of ωmin ≥
a0 2r log(p), where ωmin = minj ∈Sp (M) {ω(j )} and (c) the first part of con-
√ K
dition (2.13), maxj ∈Sp (M) { τ (jn) −δ
k=1 δk mk (j )} ≤ Cp , is replaced by that of
4
√
maxj ∈Sp (M) { ω(jn) K (j )|5 } ≤ Cp−δ . This is consistent with that in Arias-
k=1 δk |mk
Castro and Verzelen (2014), which studies the clustering problem in a similar
setting (especially on the symmetric case) with great details.
In the literature, tight bounds of this kind are only available for the case where
Xi are i.i.d. samples from a known distribution (especially, parameters—if any—
are known). In this case, the bound is derived by Kolmogorov (1933); also see
Shorack and Wellner (1986). The setting considered here is more complicated,
and how to derive tight bounds is an interesting but rather challenging prob-
lem. The main difficulty lies in that, any estimates of the unknown parameters
(μ̄(j ), μ1 (j ), . . . , μk (j ), σ (j )) have stochastic fluctuations at the same order of
that of the stochastic fluctuation of the empirical CDF, but two types of fluctu-
ations are correlated in a complicated way, so it is hard to derive the right con-
stant a0 in the exponent. There are two existing approaches: one is due to Durbin
2346 J. JIN AND W. WANG
2.6. Post-selection eigen-analysis. For the PCA step, as in Section 2.3, we let
be the n × p matrix where the j th column is the same as that of W if
W Ŝp (tp (q))
j ∈ Ŝp (tp (q)) and is the zero vector otherwise. With such notation,
Ŝp (tp (q))
(2.18) W Ŝp (tp (q)) = LM
+ L M − M Ŝp (tp (q))
+ Z −1/2
+ R .
We analyze the there terms on the right-hand side separately.
Consider the first term LM
. Recall that L ∈ R n,K with the ith row be-
ing ek if and only if i ∈ Class k, 1 ≤ i ≤ n, 1 ≤ k ≤ K, and M ∈ R K,p
with the kth row being mk = ( −1/2 μk ) , 1 ≤ k ≤ K. Also, recall that A =
√ √
diag( δ1 m1 , . . . , δK mK ) and ∈ R K,K with (k,
) = mk
2 m
/(mk ·
m
), 1 ≤ k,
≤ K. Note that rank(AA) = rank(LM) = K − 1. Assume all
non-zero eigenvalues of AA are simple, and denote them by λ1 > λ2 > · · · >
λK−1 . Write
(2.19) AA = Q · diag(λ1 , λ2 , . . . , λK−1 ) · Q , Q ∈ R K,K−1 ,
where the kth column of Q is the kth eigenvector of AA, and let
(2.20) LM
= U DV
be an SVD of LM
. Introduce
(2.21) G = diag( δ1 , δ2 , . . . , δK ) ∈ R K,K .
The following lemma is proved in the supplementary material [Jin and Wang
(2016), Appendix C].
• The matrix U has K distinct rows, according to which the rows of U parti-
tion into K different groups. This partition coincides with the partition of the n
samples into K different classes. 2
√ Also, the
-norm between each pair of the K
distinct rows is no less than 2/ n.
Consider the second term on the right-hand side of (2.18). This is the “bias”
term caused by useful features which we may fail to select.
Consider the last term on the right-hand side of (2.18). This is the “variance”
term consisting of two parts, the part from original measurement noise matrix Z
and the remainder term due to normalization.
Combine (2.23) with Lemma 2.4 and note that Û (tp (q)) (Û (tp (q)) ) − U U has a
rank of 2K or smaller. It follows that there is an H ∈ HK−1 such that
(t (q)) √ −1
(2.24) Û p − UH F ≤ 2 2Ksmin (T )T̂ − T .
First, T̂ − T ≤ 2LM
· W Ŝp (tp (q)) − LM
+ W Ŝp (tp (q)) − LM
2 . From
Lemmas 2.2–2.3 and (2.15), LM
W Ŝp (tp (q)) − LM
. Therefore,
√
T̂ − T 2LM
W Ŝp (tp (q)) − LM
≤ 2 nκ · W Ŝp (tp (q)) − LM
.
Second, by Lemma 2.1,
smin (T ) = n · smin AA = nκ2 /ρ2 (L, M).
Plugging in these results into (2.24), we find that
(t (q)) √ ρ2 (L, M) Ŝ (t (q))
(2.25) Û p − UH F ≤ 4 2K √ W p p − LM
,
nκ
where by Lemmas 2.2–2.3, the right-hand side equals to Lp errp . The claim then
follows by combining (2.25) and (2.22).
Consider Corollary 2.2. For each j ∈ Sp (M), it can be deduced that κ(j ) ≥
ε(M), using especially (2.11). Therefore, κ ≥ Lp p(1−ϑ)/2 n−1/6 = Lp ×
p(1−ϑ)/2−θ/6 . The error bound in Corollary 2.1 reduces to
√ √
−[( r− q)+ ]2 /(2K) p−θ/3+(ϑ−q)+ /2 , θ < 1 − ϑ,
(2.26) Lp p + Lp
pθ/6−(1−ϑ)/2+(1−θ −q)+ /2 , θ ≥ 1 − ϑ.
Note that (2.26) is lower bounded by Lp pθ/6−(1−ϑ)/2 for any q ≥ 0; and it is
upper bounded by Lp p−θ/3+ϑ/2 when taking q = 0. The first and third claims
then follow immediately. Below, we show the second claim.
First, consider the case θ < 1 − ϑ. If r > ϑ, we can take any q ∈ (ϑ, r) and the
error bound is o(1). If r ≤ ϑ, noting that (ϑ − r)/2 < θ/3, there exists q < r such
that (ϑ − q)/2 < θ/3, and √ the corresponding
√ error bound
√ √ is o(1).
In particular, if r > ( 2Kθ/3 + ϑ)2 , we have (
√ √ 2
r − ϑ)2 /(2K) > θ/3; then
for q ≥ ϑ, the error bound is Lp p−θ/3 + Lp p−( r− q) /(2K) ; for q < ϑ, the error
bound is Lp p−θ/3+(ϑ−q)/2 ; so the optimal q ∗ = ϑ and the corresponding error
bound is Lp p−θ/3 = Lp n−1/3 .
Next, consider the case 1−ϑ ≤ θ < 3(1−ϑ). If r > 1−θ , for any q ∈ (1−θ, r),
the error bound is o(1); note that θ/6 < (1−ϑ)/2. If r ≤ 1−θ , noting that (1−θ −
r)/2 < (1−ϑ)/2−θ/6, there is a q < r such that (1−θ −q)/2 < (1−ϑ)/2−θ/6,
and the corresponding error bound is o(1).
INFLUENTIAL FEATURES PCA 2349
TABLE 8
Pseudocode for IF-PCA(1) (for simulations; threshold set by Higher Criticism)
IF .
Input: data matrix X, number of classes K. Output: class label vector ŷHC
1. Rank features: Let ψn,j be the KS-scores as in (1.6), and F0 be the CDF of ψn,j under null,
1 ≤ j ≤ p.
2. Threshold choice by HCT: Calculate P -values by πj = 1 − F0 (ψn,j ), 1 ≤ j ≤
√
p and sort them by π(1) < π(2) < · · · < π(p) . Define HCp,j = p(j/p − π(j ) )/
√
max{ n(j/p − π(j ) ), 0} + j/p, and let jˆ = argmax{j :π >log(p)/p,j <p/2} {HCp,j }. HC
(j )
threshold tpHC is the jˆth largest KS-score.
3. Post-selection PCA: Define post-selection data matrix W (HC) (i.e., sub-matrix of W consists of
all column j of W with ψn,j > tpHC ). Let U ∈ R n,K−1 be the matrix of the first (K − 1) left
IF = kmeans(U, K).
singular vectors of W (HC) . Cluster by ŷHC
√ √ √
√ In particular, if r > ( K(1 − ϑ) − Kθ/3 + 1 − θ )2 , we have that ( r −
1 − θ )2 /(2K) > (1 − ϑ)/2 − θ/6; then for q ≥ 1 − θ , the error bound
√ √ 2
is Lp p θ/6−(1−ϑ)/2 + Lp p −( r− q) /(2K) ; for q < 1 − θ , the error bound is
12 For each parameter setting, we generate the X matrix for rep times, and at each time, we apply
all the six algorithms. The clustering errors are averaged over all the repetitions.
2350 J. JIN AND W. WANG
• Generate the class labels y1 , y2 , . . . , yn i.i.d. from MN(K, δ),13 and let L be
the n × K matrix such that the ith row of L equals to ek if and only if yi = k,
1 ≤ k ≤ K.
i.i.d.
• Generate the overall mean vector μ̄ by μ̄(j ) ∼ gμ̄ , 1 ≤ j ≤ p.
• Generate the contrast mean vectors μ1 , . . . , μK as follows. First, generate
b1 , b2 , . . . , bp i.i.d. from Bernoulli(εp ). Second, for each j such that bj = 1,
generate the i.i.d. signs {βk (j )}K−1
k=1 such that βk (j ) = −1, 0, 1 with probability
γ1 , γ2 , γ3 , respectively, and generate the feature magnitudes {hk (j )}K−1
k=1 i.i.d.
from gμ . Last, for 1 ≤ k ≤ K − 1, set μk by [the factor 72π is chosen to be
consistent with (2.10)]
1/6
μk (j ) = 72π · 2r log(p) · n−1 · hk (j ) · bj · βk (j ),
and let μK = − δ1K K−1
k=1 δk μk .
• Generate the noise matrix Z as follows. First, generate a p × 1 vector σ by
i.i.d.
σ (j ) ∼ gσ . Second, generate the n rows of Z i.i.d. from N(0, ), where =
diag(σ 2 (1), σ 2 (2), . . . , σ 2 (p)).
• Let X = 1μ̄ + L[μ1 , . . . , μK ] + Z.
In the simulation settings, r can be viewed as the parameter of (average) signal
strength. The density gσ characterizes noise heteroscedasticity; when gσ is a point
mass at 1, the noise variance of all the features are equal. The density gμ controls
the strengths of useful features; when gμ is a point mass at 1, all the useful features
have the same strength. The signs of useful features are captured in the probability
vector γ ; when K = 2, we always set γ2 = 0 so that μk (j ) = 0 for a useful feature
j ; when K ≥ 3, for a useful feature j , we allow μk (j ) = 0 for some k.
For IF-PCA(2), the theoretical threshold choice as in (2.14) is t = 2q̃ log(p)
for some 0 < q̃ < (π − 2)/(4π) ≈ 0.09. We often set q̃ ∈ {0.03, 0.04, 0.05, 0.06},
depending on the signal strength parameter r.
The simulation study contains five experiments, which we now describe.
F IG . 4. Comparison of clustering error rates [Experiment 1(a)]. x-axis: signal strength parameter
r. y-axis: error rates. Left: δ = (1/3, 2/3). Right: δ = (1/2, 1/2).
In Experiment 1(a), we let the signal strength parameter r ∈ {0.20, 0.35, 0.50,
0.65} for the asymmetric case, and r ∈ {0.06, 0.14, 0.22, 0.30} for the symmetric
case. The results are summarized in Figure 4. We find that two versions of IF-PCA
outperform the other methods in most settings, increasingly so when the signal
strength increases. Moreover, two versions of IF-PCA have similar performance,
with those of IF-PCA(1) being slightly better. This suggests that our threshold
choice by HCT is not only data-driven but also yields satisfactory clustering re-
sults. On the other hand, it also suggests that IF-PCA is relatively insensitive to
different choices of the threshold, as long as they are in a certain range.
In Experiment 1(b), we make a more careful comparison between the asym-
metric and symmetric cases. Note that for the same parameter r, the actual signal
strength in the symmetric case is stronger because of normalization. As a result,
for δ = (1/3, 2/3), we still let r ∈ {0.20, 0.35, 0.50, 0.65}, but for δ = (1/2, 1/2),
we take r = c0 × {0.20, 0.35, 0.50, 0.65}, where c0 is a constant chosen such that
for any r > 0, r and c0 r yield the same value of κ(j ) [see (2.9)] in the asym-
metric and symmetric cases, respectively; we note that κ(j ) can be viewed as the
effective signal-to-noise ratio of Kolmogorov–Smirnov statistic. The results are
summarized in Table 9. Both versions of IF-PCA have better clustering results
when δ = (1/3, 2/3), suggesting that the clustering task is more difficult in the
symmetric case. This is consistent with the theoretical results; see, for example,
Arias-Castro and Verzelen (2014), Jin, Ke and Wang (2015b).
TABLE 9
Comparison of average clustering error rates (Experiment 1). Number in the brackets are standard
deviations of the error rates
In Experiment 2(a), we let ϑ range in {0.68, 0.72, 0.76, 0.80}. Since the number
of useful features is roughly p 1−ϑ , a larger ϑ corresponds to a higher sparsity level.
For any μ and a, b > 0, let TN(u, b2 , a) be the conditional distribution of (X|u −
a ≤ X ≤ u + a) for X ∼ N(u, b ), where TN stands for “Truncated Normal.”
2
We take gμ̄ as N(0, 1), gμ as TN(1,
0.12 , 0.2), and gσ as TN(1, 0.12 , 0.1). The
results are summarized in the left panel of Figure 5, where for all sparsity levels,
two versions of IF-PCA have similar performance and each of them significantly
outperforms the other methods.
In Experiment 2(b), we use the same setting except that gμ is TN(1, 0.1, 0.7)
and gσ is the point mass at 1. Note that in Experiment 2(a), the support of gμ is
(0.8, 1.2), and in the current setting, the support is (0.3, 1.7) which is wider. As a
result, the strengths of useful features in the current setting have more variability.
At the same time, we force the noise variance of all features to be 1, for a fair
comparison. The results are summarized in the right panel of Figure 5. They are
similar to those in Experiment 2(a), suggesting that IF-PCA continues to work well
even when the feature strengths are unequal.
F IG . 5. Comparison of average clustering error rates (Experiment 2). x-axis: sparsity parame-
ter ϑ . y-axis: error rates. Left: gμ is TN(1,
0.12 , 0.2) and gσ is TN(1, 0.12 , 0.1). Right: gμ is
TN(1, 0.1, 0.7) and gσ is point mass at 1.
INFLUENTIAL FEATURES PCA 2353
TABLE 10
Comparison of average clustering error rates (Experiment 3). Numbers in the brackets are the
standard deviations of the error rates
IF-PCA(1) HCT (stochastic) 0.053 (0.08) 0.157 (0.16) 0.337 (0.14) 0.433 (0.10)
IF-PCA(2) 0.03 0.038 (0.05) 0.152 (0.12) 0.345 (0.13) 0.449 (0.06)
0.04 0.045 (0.08) 0.122 (0.12) 0.312 (0.15) 0.427 (0.09)
0.05 0.068 (0.12) 0.154 (0.15) 0.303 (0.16) 0.413 (0.12)
0.06 0.118 (0.15) 0.237 (0.17) 0.339 (0.16) 0.423 (0.10)
F IG . 6. Comparison of average clustering error rates for Experiment 4 (left panel) and Experiment
5 (right panel). y-axis: error rates.
Background (CMB) or can be used for detecting rare and weak signals or small
cliques in large graphs [e.g., Donoho and Jin (2015)].
The KS statistic can also be viewed as a marginal screening procedure. Screen-
ing is a well-known approach in high dimensional analysis. For example, in vari-
able selection, we use marginal screening for dimension reduction [Fan and Lv
(2008)], and in cancer classification, we use screening to adapt Fisher’s LDA and
QDA to modern settings [Donoho and Jin (2008), Efron (2009), Fan et al. (2015)].
However, the setting here is very different.
Of course, another important reason that we choose to use the KS-based
marginal screening in IF-PCA is for simplicity and practical feasibility: with such
a screening method, we are able to (a) use Efron’s proposal of empirical null to cor-
rect the null distribution, and (b) set the threshold by Higher Criticism; (a)–(b) are
especially important as we wish to have a tuning-free and yet effective procedure
for subject clustering with gene microarray data. In more complicated situations, it
is possible that marginal screening is sub-optimal, and it is desirable to use a more
sophisticated screening method. We mention two possibilities below.
In the first possibility, we might use the recent approaches by Birnbaum et al.
(2013), Paul and Johnstone (2012), where the primary interest is signal recov-
ery or feature estimation. The point here is that, while the two problems—subject
clustering and feature estimation—are very different, we still hope that a better
feature estimation method may improve the results of subject clustering. In these
papers, the authors proposed Augmented sparse PCA (ASPCA) as a new approach
to feature estimation and showed that under certain sparse settings, ASPCA may
have advantages over marginal screening methods, and that ASPCA is asymp-
totically minimax. This suggests an alternative to IF-PCA, where in the IF step,
we replace the marginal KS screening by some augmented feature screening ap-
proaches. However, the open question is, how to develop such an approach that is
tuning-free and practically feasible. We leave this to the future work.
Another possibility is to combine the KS statistic with the recent innovation
of Graphlet Screening [Jin, Zhang and Zhang (2014), Ke, Jin and Fan (2014)]
in variable selection. This is particularly appropriate if the columns of the noise
matrix Z are correlated, where it is desirable to exploit the graphic structures of
the correlations to improve the screening efficiency. Graphic Screening is a graph-
guided multivariate screening procedure and has advantages over the better-known
method of marginal screening and the lasso. At the heart of Graphlet Screening is
a graph, which in our setting is defined as follows: each feature j , 1 ≤ j ≤ p, is a
node, and there is an edge between nodes i and j if and only if row i and row j of
the normalized data matrix W are strongly correlated (note that for a useful feature,
the means of the corresponding row of W are non-zero; in our range of interest,
these non-zero means are at the order of n−1/6 , and so have negligible effects over
the correlations). In this sense, adapting Graphlet Screening in the screening step
helps to solve highly correlated data. We leave this to the future work.
2356 J. JIN AND W. WANG
The post-selection PCA is a flexible idea that can be adapted to address many
other problems. Take model (1.1) for example. The method can be adapted to ad-
dress the problem of testing whether LM = 0 or LM = 0 (i.e., whether the data
matrix consists of a low-rank structure or not), the problem of estimating M, or
the problem of estimating LM. The latter is connected to recent interest on sparse
PCA and low-rank matrix recovery. Intellectually, the PCA approach is connected
to SCORE for community detection on social networks [Jin (2015)], but is very
different.
Threshold choice by HC is a recent innovation, and was first proposed in
Donoho and Jin (2008) [see also Fan, Jin and Yao (2013)] in the context of classi-
fication. However, our focus here is on clustering, and the method and theory we
need are very different from those in Donoho and Jin (2008), Fan, Jin and Yao
(2013). In particular, this paper requires sophisticated post-selection Random Ma-
trix Theory (RMT), which we do not need in Donoho and Jin (2008), Fan, Jin
and Yao (2013). Our study on RMT is connected to Baik and Silverstein (2006),
Guionnet and Zeitouni (2000), Johnstone (2001), Lee, Zou and Wright (2010),
Paul (2007) but is very different.
In a high level, IF-PCA is connected to the approaches by Azizyan, Singh and
Wasserman (2013), Chan and Hall (2010) in that all three approaches are two-
stage methods that consist of a screening step and a post-selection clustering step.
However, the screening step and the post-selection step in all three approaches are
significantly different from each other. Also, IF-PCA is connected to the spectral
graph partitioning algorithm by Ng, Jordan and Weiss (2002), but it is very differ-
ent, especially in feature selection and threshold choice by HC.
In this paper, we have assumed that the first (K − 1) contrast mean vectors
μ1 , μ2 , . . . , μK−1 are linearly independent (consequently, the rank of the matrix
M [see (2.6)] is (K − 1)), and that K is known (recall that K is the number of
classes). In the gene microarray examples, we discuss in this paper, a class is a
patient group (normal, cancer, cancer sub-type) so K is usually known to us as
a priori. Moreover, it is believed that different cancer sub-types can be distin-
guished from each other by one or more genes (though we do not know which)
so μ1 , μ2 , . . . , μK−1 are linearly independent. Therefore, both assumptions are
reasonable.
On the other hand, in a broader context, either of these two assumptions could
be violated. Fortunately, at least to some extent, the main ideas in this paper can
be extended. We consider two cases. In the first one, we assume K is known but
r = rank(M) < (K − 1). In this case, the main results in this paper continue to
hold, provided that some mild regularity conditions hold. In detail, let U ∈ R n,r be
the matrix consisting the first r left singular vectors of LM
as before; it can be
shown that, as before, U has K distinct rows. The additional regularity condition
we need here is that, the
2 -norm between any pair of the K distinct rows has a
reasonable lower bound. In the second case, we assume K is unknown and has to
be estimated. In the literature, this is a well-known hard problem. To tackle this
INFLUENTIAL FEATURES PCA 2357
problem, one might utilize the recent developments on rank detection Kritchman
and Nadler (2008) [see also Birnbaum et al. (2013), Cai, Ma and Wu (2015)],
where in a similar setting, the authors constructed a confident lower bound for the
number of classes K. A problem of interest is then to investigate how to combine
the methods in these papers with IF-PCA to deal with the more challenging case
of unknown K; we leave this for future study.
SUPPLEMENTARY MATERIAL
Supplement to “Influential Features PCA for high dimensional clustering”
(DOI: 10.1214/15-AOS1423SUPP; .pdf). Owing to space constraints, the technical
proofs are relegated a supplementary document Jin and Wang (2016). It contains
three sections: Appendices A–C.
REFERENCES
A BRAMOVICH , F., B ENJAMINI , Y., D ONOHO , D. L. and J OHNSTONE , I. M. (2006). Adapting to
unknown sparsity by controlling the false discovery rate. Ann. Statist. 34 584–653. MR2281879
A MINI , A. A. and WAINWRIGHT, M. J. (2009). High-dimensional analysis of semidefinite relax-
ations for sparse principal components. Ann. Statist. 37 2877–2921. MR2541450
A RIAS -C ASTRO , E., L ERMAN , G. and Z HANG , T. (2013). Spectral clustering based on local PCA.
Available at arXiv:1301.2007.
A RIAS -C ASTRO , E. and V ERZELEN , N. (2014). Detection and feature selection in sparse mixture
models. Available at arXiv:1405.1478.
A RTHUR , D. and VASSILVITSKII , S. (2007). k-means++: The advantages of careful seeding. In
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms 1027–
1035. ACM, New York. MR2485254
A ZIZYAN , M., S INGH , A. and WASSERMAN , L. (2013). Minimax theory for high-dimensional
Gaussian mixtures with sparse mean separation. In Advances in Neural Information Processing
Systems 2139–2147. Curran Associates, Red Hook, NY.
BAIK , J. and S ILVERSTEIN , J. W. (2006). Eigenvalues of large sample covariance matrices of spiked
population models. J. Multivariate Anal. 97 1382–1408. MR2279680
B IRNBAUM , A., J OHNSTONE , I. M., NADLER , B. and PAUL , D. (2013). Minimax bounds for sparse
PCA with noisy high-dimensional data. Ann. Statist. 41 1055–1084. MR3113803
C AI , T., M A , Z. and W U , Y. (2015). Optimal estimation and rank detection for sparse spiked co-
variance matrices. Probab. Theory Related Fields 161 781–815. MR3334281
C HAN , Y. and H ALL , P. (2010). Using evidence of mixed populations to select variables for cluster-
ing very high-dimensional data. J. Amer. Statist. Assoc. 105 798–809. MR2724862
C HEN , J. and L I , P. (2009). Hypothesis test for normal mixture models: The EM approach. Ann.
Statist. 37 2523–2542. MR2543701
DAVIS , C. and K AHAN , W. M. (1970). The rotation of eigenvectors by a perturbation. III. SIAM J.
Numer. Anal. 7 1–46. MR0264450
D ETTLING , M. (2004). BagBoosting for tumor classification with gene expression data. Bioinfor-
matics 20 3583–3593.
2358 J. JIN AND W. WANG