04_sparsePCA
04_sparsePCA
Key Words: Arrays; Gene expression; Lasso/elastic net; Multivariate analysis; Singular
value decomposition; Thresholding.
1. INTRODUCTION
Principal component analysis (PCA) (Jolliffe 1986) is a popular data-processing and
dimension-reduction technique, with numerous applications in engineering, biology, and so-
cial science. Some interesting examples include handwritten zip code classification (Hastie,
Tibshirani, and Friedman 2001) and human face recognition (Hancock, Burton, and Bruce
1996). Recently PCA has been used in gene expression data analysis (Alter, Brown, and
Botstein 2000). Hastie et al. (2000) proposed the so-called gene shaving techniques using
PCA to cluster highly variable and coherent genes in microarray datasets.
PCA seeks the linear combinations of the original variables such that the derived vari-
ables capture maximal variance. PCA can be computed via the singular value decomposition
(SVD) of the data matrix. In detail, let the data X be a n × p matrix, where n and p are the
Hui Zou is Assistant Professor, School of Statistics, University of Minnesota, Minneapolis, MN 55455 (E-mail:
hzou@stat.umn.edu). Trevor Hastie is Professor, Department of Statistics, Stanford University, Stanford, CA
94305 (E-mail: hastie@stat.stanford.edu). Robert Tibshirani is Professor, Department of Health Research Policy,
Stanford University, Stanford, CA 94305 (E-mail: tibs@stat.stanford.edu).
2006
c American Statistical Association, Institute of Mathematical Statistics,
and Interface Foundation of North America
Journal of Computational and Graphical Statistics, Volume 15, Number 2, Pages 265–286
DOI: 10.1198/106186006X113430
265
266 H. ZOU, T. HASTIE, AND R. TIBSHIRANI
number of observations and the number of variables, respectively. Without loss of generality,
assume the column means of X are all 0. Let the SVD of X be
X = UDVT . (1.1)
Z = UD are the principal components (PCs), and the columns of V are the corresponding
loadings of the principal components. The sample variance of the ith PC is D2ii /n. In gene
expression data the standardized PCs U are called the eigen-arrays and V are the eigen-
genes (Alter, Brown, and Botstein 2000). Usually the first q (q min(n, p)) PCs are chosen
to represent the data, thus a great dimensionality reduction is achieved.
The success of PCA is due to the following two important optimal properties:
1. principal components sequentially capture the maximum variability among the
columns of X, thus guaranteeing minimal information loss;
2. principal components are uncorrelated, so we can talk about one principal compo-
nent without referring to others.
However, PCA also has an obvious drawback, that is, each PC is a linear combination of all
p variables and the loadings are typically nonzero. This makes it often difficult to interpret
the derived PCs. Rotation techniques are commonly used to help practitioners to interpret
principal components (Jolliffe 1995). Vines (2000) considered simple principal components
by restricting the loadings to take values from a small set of allowable integers such as 0,
1, and −1.
We feel it is desirable not only to achieve the dimensionality reduction but also to reduce
the number of explicitly used variables. An ad hoc way to achieve this is to artificially set the
loadings with absolute values smaller than a threshold to zero. This informal thresholding
approach is frequently used in practice, but can be potentially misleading in various respects
(Cadima and Jolliffe 1995). McCabe (1984) presented an alternative to PCA which found a
subset of principal variables. Jolliffe, Trendafilov, and Uddin (2003) introduced SCoTLASS
to get modified principal components with possible zero loadings.
The same interpretation issues arise in multiple linear regression, where the response
is predicted by a linear combination of the predictors. Interpretable models are obtained via
variable selection. The lasso (Tibshirani 1996) is a promising variable selection technique,
simultaneously producing accurate and sparse models. Zou and Hastie (2005) proposed
the elastic net, a generalization of the lasso, which has some advantages. In this article we
introduce a new approach for estimating PCs with sparse loadings, which we call sparse
principal component analysis (SPCA). SPCA is built on the fact that PCA can be written as
a regression-type optimization problem, with a quadratic penalty; the lasso penalty (via the
elastic net) can then be directly integrated into the regression criterion, leading to a modified
PCA with sparse loadings.
In the next section we briefly review the lasso and the elastic net. The methodological
details of SPCA are presented in Section 3. We present an efficient algorithm for fitting
the SPCA model. We also derive an appropriate expression for representing the variance
explained by modified principal components. In Section 4 we consider a special case of the
SPCA algorithm for handling gene expression arrays efficiently. The proposed methodology
SPARSE PRINCIPAL COMPONENT ANALYSIS 267
is illustrated by using real data and simulation examples in Section 5. Discussions are in
Section 6. The article ends with an Appendix summarizing technical details.
where λ is non-negative. The lasso was originally solved by quadratic programming (Tibshi-
rani 1996). Efron, Hastie, Johnstone, and Tibshirani (2004) showed that the lasso estimates
β̂ are piecewise linear as a function of λ, and proposed an algorithm called LARS to effi-
ciently solve the entire lasso solution path in the same order of computations as a single least
squares fit. The piecewise linearity of the lasso solution path was first proved by Osborne,
Presnell, Turlach (2000) where a different algorithm was proposed to solve the entire lasso
solution path.
The lasso continuously shrinks the coefficients toward zero, and achieves its prediction
accuracy via the bias variance trade-off. Due to the nature of the L1 penalty, some coefficients
will be shrunk to exact zero if λ1 is large enough. Therefore the lasso simultaneously
produces both an accurate and sparse model, which makes it a favorable variable selection
method. However, the lasso has several limitations as pointed out by Zou and Hastie (2005).
The most relevant one to this work is that the number of variables selected by the lasso is
limited by the number of observations. For example, if applied to microarray data where
there are thousands of predictors (genes) (p > 1000) with less than 100 samples (n < 100),
the lasso can only select at most n genes, which is clearly unsatisfactory.
The elastic net (Zou and Hastie 2005) generalizes the lasso to overcome these draw-
backs, while enjoying its other favorable properties. For any non-negative λ1 and λ2 , the
elastic net estimates β̂en are given as follows
p p p
2
β̂en = (1 + λ2 ) arg min Y − Xj βj 2 + λ2 |βj | + λ1 |βj | . (2.2)
β
j=1 j=1 j=1
The elastic net penalty is a convex combination of the ridge and lasso penalties. Obviously,
the lasso is a special case of the elastic net when λ2 = 0. Given a fixed λ2 , the LARS-
EN algorithm (Zou and Hastie 2005) efficiently solves the elastic net problem for all λ1
with the computational cost of a single least squares fit. When p > n, we choose some
268 H. ZOU, T. HASTIE, AND R. TIBSHIRANI
λ2 > 0. Then the elastic net can potentially include all variables in the fitted model, so this
particular limitation of the lasso is removed. Zou and Hastie (2005) compared the elastic
net with the lasso and discussed the application of the elastic net as a gene selection method
in microarray analysis.
subject to
for some tuning parameter t. Although sufficiently small t yields some exact zero loadings,
there is not much guidance with SCoTLASS in choosing an appropriate value for t. One
could try several t values, but the high computational cost of SCoTLASS makes this an im-
practical solution. This high computational cost is probably due to the fact that SCoTLASS
is not a convex optimization problem. Moreover, the examples in Jolliffe, Trendafilov, and
Uddin (2003) showed that the loadings obtained by SCoTLASS are not sparse enough when
one requires a high percentage of explained variance.
We consider a different approach to modifying PCA. We first show how PCA can be
recast exactly in terms of a (ridge) regression problem. We then introduce the lasso penalty
by changing this ridge regression to an elastic-net regression.
β̂
Let v̂ = β̂ridge , then v̂ = Vi .
ridge
The theme of this simple theorem is to show the connection between PCA and a
regression method. Regressing PCs on variables was discussed in Cadima and Jolliffe
(1995), where they focused on approximating PCs by a subset of k variables. We extend it
to a more general case of ridge regression in order to handle all kinds of data, especially
gene expression data. Obviously, when n > p and X is a full rank matrix, the theorem does
not require a positive λ. Note that if p > n and λ = 0, ordinary multiple regression has
no unique solution that is exactly Vi . The same happens when n > p and X is not a full
rank matrix. However, PCA always gives a unique solution in all situations. As shown in
Theorem 1, this indeterminancy is eliminated by the positive ridge penalty (λβ2 ). Note
that after normalization the coefficients are independent of λ, therefore the ridge penalty is
not used to penalize the regression coefficients but to ensure the reconstruction of principal
components.
Now let us add the L1 penalty to (3.4) and consider the following optimization problem
p β̂
where β1 = j=1 |βj | is the 1-norm of β. We call Vi = β̂
an approximation to Vi , and
XVi the ith approximated principal component. Zou and Hastie (2005) called (3.5) naive
elastic net which differs from the elastic net by a scaling factor (1 + λ). Since we are using
the normalized fitted coefficients, the scaling factor does not affect Vi . Clearly, large enough
λ1 gives a sparse β̂, hence a sparse Vi . Given a fixed λ, (3.5) is efficiently solved for all λ1
by using the LARS-EN algorithm (Zou and Hastie 2005). Thus, we can flexibly choose a
sparse approximation to the ith principal component.
n
(α̂, β̂) = arg min xi − αβ T xi 2 + λβ2 (3.6)
α,β i=1
subject to α2 = 1.
Then β̂ ∝ V1 .
The next theorem extends Theorem 2 to derive the whole sequence of PCs.
Theorem 3. Suppose we are considering the first k principal components. Let Ap×k =
270 H. ZOU, T. HASTIE, AND R. TIBSHIRANI
subject to AT A = Ik×k .
Then β̂j ∝ Vj for j = 1, 2, . . . , k.
Theorems 2 and 3 effectively transform the PCA problem to a regression-type problem.
n
The critical element is the objective function i=1 xi − ABT xi 2 . If we restrict B = A,
then
n n
xi − ABT xi 2 = xi − AAT xi 2 ,
i=1 i=1
whose minimizer under the orthonormal constraint on A is exactly the first k loading vectors
of ordinary PCA. This formulation arises in the “closest approximating linear manifold”
derivation of PCA (e.g., Hastie, Tibshirani, and Friedman 2001). Theorem 3 shows that we
can still have exact PCA while relaxing the restriction B = A and adding the ridge penalty
term. As can be seen later, these generalizations enable us to flexibly modify PCA.
The proofs of Theorems 2 and 3 are given in the Appendix; here we give an intuitive
explanation. Note that
n
xi − ABT xi 2 = X − XBAT 2 . (3.8)
i=1
Whereas the same λ is used for all k components, different λ1,j ’s are allowed for penalizing
the loadings of different principal components. Again, if p > n, a positive λ is required in
order to get exact PCA when the sparsity constraint (the lasso penalty) vanishes (λ1,j = 0).
We call (3.12) the SPCA criterion hereafter.
A given B: On the other hand, if B is fixed, then we can ignore the penalty part in (3.12)
n
and only try to minimize i=1 xi −ABT xi 2 = X−XBAT 2 , subject to AT A =
Ik×k . The solution is obtained by a reduced rank form of the Procrustes rotation,
given in Theorem 4 below. We compute the SVD
Although (3.16) (with Σ instead of XT X) is not quite an elastic net problem, we can
easily turn it into one. Create the artificial response Y ∗∗ and X∗∗ as follows
Y ∗∗ = Σ 2 αj X∗∗ = Σ 2 ,
1 1
(3.17)
Let Z be the modified PCs. Usually the total variance explained by Z is calculated
T
by tr(Z Z). This is reasonable when Z are uncorrelated. However, if they are correlated,
T
tr(Z Z) is too optimistic for representing the total variance. Suppose (Ẑi , i = 1, 2, . . . , k)
are the first k modified PCs by any method, and the (k + 1)th modified PC Ẑk+1 is obtained.
We want to compute the total variance explained by the first k + 1 modified PCs, which
should be the sum of the explained variance by the first k modified PCs and the additional
variance from Ẑk+1 . If Ẑk+1 is correlated with (Ẑi , i = 1, 2, . . . , k), then its variance
contains contributions from (Ẑi , i = 1, 2, . . . , k), which should not be included into the
total variance given the presence of (Ẑi , i = 1, 2, . . . , k).
Here we propose a new formula to compute the total variance explained by Z, which
takes into account the correlations among Z. We use regression projection to remove the
linear dependence between correlated components. Denote Ẑj·1,...,j−1 the residual after
adjusting Ẑj for Ẑ1 , . . . , Ẑj−1 , that is
where H1,...,j−1 is the projection matrix on {Ẑi }j−11 . Then the adjusted variance of Ẑj is
k
Ẑj·1,...,j−1 2 , and the total explained variance is defined as j=1 Ẑj·1,...,j−1 2 . When
T
the modified PCs Z are uncorrelated, the new formula agrees with tr(Z Z).
Note that the above computations depend on the order of Ẑi . However, since we have
a natural order in PCA, ordering is not an issue here. Using the QR decomposition, we can
easily compute the adjusted variance. Suppose Z = QR, where Q is orthonormal and R is
upper triangular. Then it is straightforward to see that
subject to AT A = Ik×k .
β̂
When λ → ∞, Vj (λ) → β̂j .
j
We can use the same alternating algorithm in Section 3.3 to solve (4.1), where we only
need to replace the general elastic net problem with its special case (λ = ∞). Note that
given A,
β̂j = arg min −2αjT (XT X)βj + βj 2 + λ1,j βj 1 , (4.2)
βj
Gene Expression Arrays SPCA Algorithm. Replacing Step 2 in the general SPCA
algorithm with
Step 2∗ : for j = 1, 2, . . . , k
T T λ1,j
βj = αj X X − Sign(αjT XT X). (4.3)
2 +
The operation in (4.3) is called soft-thresholding. Figure 1 gives an illustration of how the
soft-thresholding rule operates. Recently soft-thresholding has become increasingly popular
in the literature. For example, nearest shrunken centroids (Tibshirani, Hastie, Narasimhan,
and Chu 2002) adopts the soft-thresholding rule to simultaneously classify samples and
select important genes in microarrays.
SPARSE PRINCIPAL COMPONENT ANALYSIS 275
5. EXAMPLES
Table 2. Pitprops Data: Loadings of the First Six Modified PCs by SCoTLASS. Empty cells have zero
loadings.
t = 1.75
Variable PC1 PC2 PC3 PC4 PC5 PC6
topdiam 0.664 −0.025 0.002 −0.035
length 0.683 −0.001 −0.040 0.001 −0.018
moist 0.641 0.195 0.180 −0.030
testsg 0.701 0.001 −0.001
ovensg −0.887 −0.056
ringtop 0.293 −0.186 −0.373 0.044
ringbut 0.001 0.107 −0.658 −0.051 0.064
bowmax 0.001 0.735 0.021 −0.168
bowdist 0.283 −0.001
whorls 0.113 −0.001 0.388 −0.017 0.320
clear −0.923
knots 0.001 −0.554 0.016 0.004
diaknot 0.703 0.001 −0.197 0.080
Figure 2. Pitprops data: The sequences of sparse approximations to the first six principal components. The curves
show the percentage of explained variance (PEV) as a function of λ1 . The vertical broken lines indicate the choice
of λ1 used in our SPCA analysis.
278 H. ZOU, T. HASTIE, AND R. TIBSHIRANI
Table 3. Pitprops Data: Loadings of the First Six Sparse PCs by SPCA. Empty cells have zero loadings.
thresholding, has various drawbacks, it may serve as the benchmark for testing sparse PCs
methods. A variant of simple thresholding is soft-thresholding. We found that when used in
PCA, soft-thresholding performs very similarly to simple thresholding. Thus, we omitted the
results of soft-thresholding in this article. Both SCoTLASS and SPCA were compared with
simple thresholding. Table 4 presents the loadings and the corresponding variance explained
by simple thresholding. To make the comparisons fair, we let the numbers of nonzero
loadings obtained by simple thresholding match the results of SCoTLASS and SPCA, as
shown in the top and bottom parts of Table 4, respectively. In terms of variance, it seems
that simple thresholding is better than SCoTLASS and worse than SPCA. Moreover, the
variables with nonzero loadings by SPCA are different to that chosen by simple thresholding
for the first three PCs; while SCoTLASS seems to create a similar sparseness pattern as
simple thresholding does, especially in the leading PC.
and
We used the exact covariance matrix of (X1 , . . . , X10 ) to perform PCA, SPCA and simple
thresholding (in the population setting).
Table 4. Pitprops Data: Loadings of the First Six Modified PCs by Simple Thresholding. Empty cells
have zero loadings.
Adjusted
Variance (%) 60.0 39.6 0.08 40.9 39.5 38.8 38.6
The variance of the three underlying factors is 290, 300, and 283.8, respectively. The
numbers of variables associated with the three factors are 4, 4, and 2. Therefore V2 and
V1 are almost equally important, and they are much more important than V3 . The first two
PCs together explain 99.6% of the total variance. These facts suggest that we only need
to consider two derived variables with “correct” sparse representations. Ideally, the first
derived variable should recover the factor V2 only using (X5 , X6 , X7 , X8 ), and the second
derived variable should recover the factor V1 only using (X1 , X2 , X3 , X4 ). In fact, if we
sequentially maximize the variance of the first two derived variables under the orthonormal
constraint, while restricting the numbers of nonzero loadings to four, then the first derived
variable uniformly assigns nonzero loadings on (X5 , X6 , X7 , X8 ); and the second derived
variable uniformly assigns nonzero loadings on (X1 , X2 , X3 , X4 ).
Both SPCA (λ = 0) and simple thresholding were carried out by using the oracle
information that the ideal sparse representations use only four variables. Table 5 summarizes
the comparison results. Clearly, SPCA correctly identifies the sets of important variables. In
fact, SPCA delivers the ideal sparse representations of the first two principal components.
Mathematically, it is easy to show that if t = 2 is used, SCoTLASS is also able to find the
same sparse solution. In this example, both SPCA and SCoTLASS produce the ideal sparse
PCs, which may be explained by the fact that both methods explicitly use the lasso penalty.
In contrast, simple thresholding incorrectly includes X9 , X10 in the most important
variables. The variance explained by simple thresholding is also lower than that by SPCA,
although the relative difference is small (less than 5%). Due to the high correlation between
V2 and V3 , variables X9 , X10 achieve loadings which are even higher than those of the true
important variables (X5 , X6 , X7 , X8 ). Thus the truth is disguised by the high correlation.
On the other hand, simple thresholding correctly discovers the second factor, because V1
has a low correlation with V3 .
Figure 3. The sparse leading principal component: percentage of explained variance versus sparsity. Simple
thresholding and SPCA have similar performances. However, there still exists consistent difference in the selected
genes (the ones with nonzero loadings).
relevant to the outcome (e.g. tumor type or survival time). PCA (or SVD) has been a popular
tool for this purpose. Many gene-clustering methods in the literature use PCA (or SVD) as
a building block. For example, gene shaving (Hastie et al. 2000) uses an iterative principal
component shaving algorithm to identify subsets of coherent genes. Here we consider
another approach to gene selection through SPCA. The idea is intuitive: if the (sparse)
principal component can explain a large part of the total variance of gene expression levels,
then the subset of genes representing the principal component are considered important.
We illustrate the sparse PC selection method on Ramaswamy’s data (Ramaswamy et al.
2001) which has 16,063 (p = 16,063) genes and 144 (n = 144) samples. Its first principal
component explains 46% of the total variance. For microarray data like this, it appears
that SCoTLASS cannot be practically useful for finding sparse PCs. We applied SPCA
(λ = ∞) to find the leading sparse PC. A sequence of values for λ1 were used such that
the number of nonzero loadings varied over a wide range. As displayed in Figure 3, the
percentage of explained variance decreases at a slow rate, as the sparsity increases. As few
as 2.5% of these 16,063 genes can sufficiently construct the leading principal component
with an affordable loss of explained variance (from 46% to 40%). Simple thresholding
was also applied to this data. It seems that when using the same number of genes, simple
thresholding always explains slightly higher variance than SPCA does. Among the same
number of selected genes by SPCA and simple thresholding, there are about 2% different
genes, and this difference rate is quite consistent.
282 H. ZOU, T. HASTIE, AND R. TIBSHIRANI
6. DISCUSSION
It has been an interesting research topic for years to derive principal components with
sparse loadings. From a practical point of view, a good method to achieve the sparseness
goal should (at least) possess the following properties.
The often-used simple thresholding approach is not criterion based. However, this
informal method seems to possess the first two of the desirable properties listed above. If
the explained variance and sparsity are the only concerns, simple thresholding is a reasonable
approach, and it is extremely convenient. We have shown that simple thresholding can work
well with gene expression arrays. The serious problem with simple thresholding is that it
can misidentify the real important variables. Nevertheless, simple thresholding is regarded
as a benchmark for any potentially better method.
Using the lasso constraint in PCA, SCoTLASS successfully derives sparse loadings.
However, SCoTLASS is not computationally efficient, and it lacks a good rule to pick its
tuning parameter. In addition, it is not feasible to apply SCoTLASS to gene expression
arrays, where PCA is a quite popular tool.
In this work we have developed SPCA using our SPCA criterion (3.12). This new
criterion gives exact PCA results when its sparsity (lasso) penalty term vanishes. SPCA
allows flexible control on the sparse structure of the resulting loadings. Unified efficient
algorithms have been proposed to compute SPCA solutions for both regular multivariate
data and gene expression arrays. As a principled procedure, SPCA enjoys advantages in
several aspects, including computational efficiency, high explained variance and an ability
in identifying important variables.
Software in R for fitting the SPCA model (and elastic net models) is available in the
CRAN contributed package elasticnet.
APPENDIX: PROOFS
Proof of Theorem 1: Using XT X = VD2 VT and VT V = I, we have
−1 T D2ii
β̂ridge = XT X + λI X (XVi ) = Vi . (A.1)
D2ii + λ
Hence v̂ = Vi . ✷
Note that since Theorem 2 is a special case of Theorem 3, we will not prove it separately.
We first provide a lemma.
Lemma 1. Consider the ridge regression criterion
Cλ (β̂) = yT (I − Sλ )y,
Sλ = X(XT X + λI)−1 XT .
Cλ (β̂) = (y − Xβ̂)T y. The result follows since the “fitted values” Xβ̂ = Sλ y. ✷
Proof of Theorem 3. We use the notation introduced in Section 3: A = [α1 , . . . , αk ]
and B = [β1 , . . . , βk ]. Let
n
k
Cλ (A, B) = xi − ABT xi 2 + λ βj 2 .
i=1 j=1
As in (3.9) we have
n
xi − ABT xi 2 = X − XBAT 2 (A.2)
i=1
= XA⊥ 2 + XA − XB2 . (A.3)
and using Lemma 1 and (A.2) we have that the partially optimized penalized criterion is
given by
Cλ (A, B) = XA⊥ 2 + tr (XA)T (I − Sλ )(XA) . (A.5)
284 H. ZOU, T. HASTIE, AND R. TIBSHIRANI
Since AT A = I, the last term is equal to tr(NT N), and hence we need to maximize (minus
half) the middle term. With the SVD MT N = UDVT , this middle term becomes
where
n
k
βj k
BT βj
Cλ,λ1 (A, B) = xi − A xi 2 + λ 2 + λ1,j 1 . (A.12)
1+λ 1+λ 1+λ
i=1 j=1 j=1
Since
k
βj 2 1
= tr(BT B),
1+λ (1 + λ)2
j=1
and
n
BT B T T B T
xi − A xi = tr (X − X
2
A ) (X − X A )
1+λ 1+λ 1+λ
i=1
1 2
= tr XT X + tr BT XT XB − tr AT XT XB .
(1 + λ)2 1+λ
SPARSE PRINCIPAL COMPONENT ANALYSIS 285
Thus we have
Cλ,λ1 (A, B) = tr XT X
T k
1 X X + λI
+ tr BT B − 2Tr AT XT XB + λ1,j βj 1 ,
1+λ 1+λ
j=1
ACKNOWLEDGMENTS
We thank the editor, an associate editor, and referees for helpful comments and suggestions which greatly
improved the manuscript.
REFERENCES
Alter, O., Brown, P., and Botstein, D. (2000), “Singular Value Decomposition for Genome-Wide Expression Data
Processing and Modeling,” in Proceedings of the National Academy of Sciences, 97, pp. 10101–10106.
Cadima, J., and Jolliffe, I. (1995), “Loadings and Correlations in the Interpretation of Principal Components,”
Journal of Applied Statistics, 22, 203–214.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004), “Least Angle Regression,” The Annals of Statistics,
32, 407–499.
Hancock, P., Burton, A., and Bruce, V. (1996), “Face Processing: Human Perception and Principal Components
Analysis,” Memory and Cognition, 24, 26–40.
Hastie, T., Tibshirani, R., and Friedman, J. (2001), The Elements of Statistical Learning; Data mining, Inference
and Prediction, New York: Springer Verlag.
Hastie, T., Tibshirani, R., Eisen, M., Brown, P., Ross, D., Scherf, U., Weinstein, J., Alizadeh, A., Staudt, L.,
and Botstein, D. (2000), “ ‘gene Shaving’ as a Method for Identifying Distinct Sets of Genes With Similar
Expression Patterns,” Genome Biology, 1, 1–21.
Jeffers, J. (1967), “Two Case Studies in the Application of Principal Component,” Applied Statistics, 16, 225–236.
Jolliffe, I. (1986), Principal Component Analysis, New York: Springer Verlag.
(1995), “Rotation of Principal Components: Choice of Normalization Constraints,” Journal of Applied
Statistics, 22, 29–35.
Jolliffe, I. T., Trendafilov, N. T., and Uddin, M. (2003), “A Modified Principal Component Technique Based on
the Lasso,” Journal of Computational and Graphical Statistics, 12, 531–547.
Mardia, K., Kent, J., and Bibby, J. (1979), Multivariate Analysis, New York: Academic Press.
McCabe, G. (1984), “Principal Variables,” Technometrics, 26, 137–144.
286 H. ZOU, T. HASTIE, AND R. TIBSHIRANI
Osborne, M. R., Presnell, B., and Turlach, B. A. (2000), “A New Approach to Variable Selection in Least Squares
Problems,” IMA Journal of Numerical Analysis, 20, 389–403.
Ramaswamy, S., Tamayo, P., Rifkin, R., Mukheriee, S., Yeang, C., Angelo, M., Ladd, C., Reich, M., Latulippe,
E., Mesirov, J., Poggio, T., Gerald, W., Loda, M., Lander, E., and Golub, T. (2001), “Multiclass Cancer
Diagnosis using Tumor Gene Expression Signature,” in Proceedings of the National Academy of Sciences,
98, pp. 15149–15154.
Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society,
Series B, 58, 267–288.
Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002), “Diagnosis of Multiple Cancer Types by Shrunken
Centroids of Gene,” in Proceedings of the National Academy of Sciences, 99, 6567–6572.
Vines, S. (2000), “Simple Principal Components,” Applied Statistics, 49, 441–451.
Zou, H., and Hastie, T. (2005), “Regularization and Variable Selection via the Elastic Net,” Journal of the Royal
Statistical Society, Series B, 67, 301–320.