Groth 2012
Groth 2012
Groth 2012
Abstract
Principal components analysis (PCA) is a standard tool in multivariate data analysis to reduce the number of
dimensions, while retaining as much as possible of the data’s variation. Instead of investigating thousands of
original variables, the first few components containing the majority of the data’s variation are explored. The
visualization and statistical analysis of these new variables, the principal components, can help to find
similarities and differences between samples. Important original variables that are the major contributors
to the first few components can be discovered as well.
This chapter seeks to deliver a conceptual understanding of PCA as well as a mathematical description.
We describe how PCA can be used to analyze different datasets, and we include practical code examples.
Possible shortcomings of the methodology and ways to overcome these problems are also discussed.
Key words: Principal components analysis, Multivariate data analysis, Metabolite profiling, Codon
usage, Dimensionality reduction
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_22, # Springer Science+Business Media, LLC 2013
527
528 D. Groth et al.
2. Important
Concepts
The “-omics” technologies mentioned above require careful data
preprocessing and normalization before the actual data analysis can
be performed. The methods used for this purpose have recently
been discussed for microarrays (2) and metabolite data (3). We here
only briefly outline important concepts specific to these data types
and give a general introduction using different examples.
2.1. Data Normalization Data normalization aims to remove the technical variability and
and Transformation to impute missing values that result from experimental issues.
Data transformation, in contrast, aims to move the data distribu-
tion into a Gaussian one, and ensures that more powerful para-
metric statistical methods can be used later on. The basic steps for
data preparation are shown in Fig. 1 and are detailed below.
22 Principal Components Analysis 529
150
6
100
4
50
2
0
0
cm kg grade cm kg grade
3
20
2
15
1
10
0
−1
5
cm kg grade cm kg grade
2.2. Principal We next demonstrate the PCA using the example of the students
Components Analysis dataset, with the variables “cm,” “kg,” and “grade.” The students
are the samples in this case. As it can be seen in Table 1, height,
measured in cm, and the weight, measured in kg, have a higher
variance in comparison to the to the grade, which ranges from 1 to
5. After scaling to unit variance, it can be seen in Table 2 that all
variables have variance of one. The weight and the height of our
sample students have a larger covariance than to that between grade
and both weight and height. Remember that the covariance
between variables indicate the degree of correlation between two
532 D. Groth et al.
50 60 70 80 90 110 1 2 3 4
200
190
J E J E
F F 180
Y V OR V OR Y
cm CK
QTUBXI P U Q
M KC
ITP XB
170
W ZM H W HD N Z
N
SADL
G AL S G
160
150
H 110
3 H 100
2 90
E E
1 P R
F kg F
P
80
J J OR 70
I O I XB
0 UBTX V UV
MLKQ T 60
Q
L Z MK Y D
W A C NS G
YZ
A
SGD
NW C
−1 50
Z Z
2 Y Y
G G
1 BX BX
E E
S S
0
N TP
I R F
N T P
I R F
grade
D D
H C
K Q O C KQ O H
AL M J A LM J
−1
W V W V
U U
−2
−1 0 1 2 −1 0 1 2 3
Fig. 3. Pairs plot of unscaled (upper triangle) and scaled (lower triangle) data. Individuals can be identified by their letter
codes.
Table 1
Covariance matrix for the original data. The diagonal contains the variances
cm kg Grade
cm 47.54 36.90 0.11
kg 36.90 123.18 0.84
Grade 0.11 0.84 0.62
Table 2
Covariance/correlation matrix for scaled data
cm kg Grade
cm 1.00 0.48 0.02
kg 0.48 1.00 0.10
Grade 0.02 0.10 1.00
Table 3
Covariance matrix for the principal components, the matrix
is a diagonal matrix, all nondiagonal values are zero. This
means that the principal components are uncorrelated to
each other
a b
1.5
1.5
1.0 o
Variances
Variances
1.0
o
0.5
0.5
0.0
0.0
c d
3
1.0
2
W U
0.5
V
1
AL M
H
D CK Q O J
PC2
PC2
kg
R 0.0 kg
I
0
SN T Pcm F
cm
BX
−1
G E
grade
−2
Z Y
−1.0
grade
−3
Fig. 4. Common plots for PCA visualization. (a) Screeplot for the first few components using a bar plot; (b) screeplot using
lines; (c) biplot showing the most relevant loading vectors; (d) correlation plot.
Fig. 5. Illustrative projection of a three-dimensional data point cloud onto a two-dimensional surface.
22 Principal Components Analysis 537
a b 3
l
H l
H
3
2
2
l
E
PC2
l 1
kg
F
1 l
P l
R l
l Jl l lL
lA
S
l
P
lI O 0 lN
G
D
lW
l Zl Ml UlBl l lI l
l l
X R l
F
0 lB
U l
V Kl l X
lT Ol
ll
T l
C Q l
E
l
L l Q
lK
M
l
l
l Z C
D l
Y V Jl
−1 ll G
A
S llW
N l l −1 l
Y
−2 −1 0 1 2 −1 0 1 2
cm PC1
c d
1.5 3 H
2
kg
Variances
1.0 1
ADL P
PC2
SN
0 GWZ M UBXI R F
CK QT O E
V J
−1 Y
0.5 cm
−2
−3
0.0
PC1 PC2 −3 −2 −1 0 1 2 3
PC1
Fig. 6. PCA plots using students height and weight data. (a) Scaled data; (b) data projected
into the new coordinate system of principal components; (c) screeplot of the two resulting
PC’s; (d) biplot showing the loadings vectors of the original variables.
3. Biological
Examples
PCA was first applied to microarray data in 2000 (9, 10) and has
been reviewed in this context (11). We decided to choose other
types of data to illustrate PCA. First we use a dataset which deals
with the transformation of qualitative data into numerical data:
codon frequencies from various taxa will be used to evaluate their
main codon differences with a PCA. Next, we use data from a
recent study about principal changes in the Escherichia coli
(E. coli) metabolism after applying stress conditions (12). The
data analysis is extended with adding visualization that PCA enables
to better understand the E. coli primary stress metabolism.
3.1. Sequence Data Here we use PCA to demonstrate that the codon usage differs for
Analysis protein-coding genes of different organisms, a fact that is well-
known. Genome sequences for many taxa are freely available from
different online resources. We just take five genomes for this exam-
ple, although we could easily have taken 50 or 500: Arabidopsis
thaliana (a higher plant), Caenorhabditis elegans (a nematode),
Drosophila melanogaster (the fruit fly), Canis familiaris (the dog),
and Saccharomyces cerevisiae (yeast). For each of these genomes, we
only use protein-coding genes and end up with about 33,000
(plant), 24,000 (nematode), 15,000 (dog), 18,000 (fruit fly), and
6,000 (yeast) genes each.
For one species at a time, we then record for each of these gene
sequences which of the 64 possible codons is used how many times.
The data to be analyzed for interesting patterns is therefore a
5 64 matrix. It describes the codon usage combined for all
protein-coding genes for each of the five taxa. An abbreviated and
transposed version of this matrix is shown below; note that absolute
frequencies were recorded.
22 Principal Components Analysis 539
> codons¼read.table(’http://bitbucket.org/mittel-
mark/r-code/downloads/codonsTaxa.txt’, header¼
TRUE, row.names¼1)
> head(codons[,c(1:3,62:64)])
AAA AAC AAG TTC TTG TTT
ATHA 419472 275641 432728 269345 285985 299032
CELE 394707 192068 272375 249129 210524 240164
CFAM 243989 187183 314228 205704 134896 186681
DMEL 185136 280003 417111 227142 173632 141944
SSCE 124665 72230 88924 52059 77227 76198
a 40 b
Variances
30
20 TGG
5 CCT GGG
10 CFAM
0 ATHA
PC1 PC3 PC5
PC2
ACT CAG
0 CAC
CELE
CAT GCC
SSCE
c 1.0 TGG
TGTCCT GGG
AGGCTC DMEL
CTT
GCT
TCT
AGA
GCA CGG
GTC −5
TTT GGA GAG
GTG TCGACG
PC2
TCA
ACA ATG TTC GAC
GTT
AAA
GAA CCC
CTG
CAG
0.0 ACTCCA AGC
TGC
CAC
GGC
AGT
CAT GGT
TTA GCC
ATC
AAG
TAT
GAT
ATT
GTA TCC
ACC
TTGATA CGA CGC
CAACTATGA
AAT CGT
TAA GCG
CCG
AACTAC
TAGTCG
ACG
−1.0
PC1 PC1
Fig. 7. Codon data. (a) Screeplot of PC’s variances; (b) biplot for first two PC’s and most important loadings; (c) correlation
plot for all variables.
540 D. Groth et al.
3.2. Metabolite Data In this example we employ PCA to analyze the system level stress
Analysis adjustments following the response of E. coli to five different per-
turbations. We make use of time-resolved metabolite measure-
ments to get a detailed understanding of the successive events
following heat- and cold-shock, oxidative stress, lactose diauxie,
and stationary phase. A previous analysis of the metabolite data
together with transcript data measured under the exact same per-
turbations and time-points was able to show, that E. coli’s response
on the metabolic level shows a higher degree of specificity as
22 Principal Components Analysis 541
Glc-6-P
6-P-gluconolactone
Fru-6-P
pentose phosphate
6-P-gluconic a.
pathway
Ribose-5-P Xylulose-5-P G3P
glycolysis
1,3 DPGA
2PGA
Pyruvic a.
Acetyl-CoA
OAA Citric a.
TCA cycle
Malic a. Isocitric a.
Fumaric a. 2-ketogluratic a.
Succinic a. Succinyl-CoA
Fig. 10. Metabolite concentrations of conditions and different time-points. Within each time-series, each metabolite
concentration is normalized to preperturbation levels.
4. PCA
Improvements
and Alternatives
PCA is an excellent method for finding orthogonal directions that
correspond to maximum variance. Datasets can, of course, contain
other types of structures that PCA is not designed to detect. For
example, the largest variations might be not of the greatest
biological importance. This is a problem which cannot easily be
solved as it requires the knowledge of the biology behind the data.
In this case it may be important to remove the outliers to minimize
the effect of single values on the overall outcome. Approaches to
provide outlier-insensitive PCA algorithms like robust (14) or
weighted PCA (15) and an R package, rrcov (16), which can be
used to apply some of the advanced PCA methods to the data set
are available. The R package provides the function PcaCov which
calls robust estimators of covariance.
In datasets with many variables it is sometimes difficult to
obtain a general description of a certain component. For this
purpose, e.g., in microarray analysis, often the enrichment for
certain ontology terms for the variables contributing at most to a
component is used to get an impression what the component is
actually representing (17).
Sometimes a problem with PCA is that the components,
although uncorrelated, are dependent and orthogonal to each
other. Independent components analysis (ICA) does not have this
shortcoming. Some authors have found that ICA outperforms PCA
(18), other authors have found the opposite (19, 20). Which
method is in practice best, depends on the actual data structure,
and ICA is in some cases a possible alternative to PCA. The
fastICA algorithm can be used for this purpose (21, 22). Because
ICA does not reduce the number of variables as PCA does, ICA can be
used in conjunction with PCA to get a decreased number of variables
to consider. For instance, it has been shown that ICA, when per-
formed on the first few principal components, i.e., on the results of
a preceding PCA, can improve the sample differentiation (23).
Higher-order dependencies, for instance data are scattered in a
ringlike manner around a certain point, are sometimes difficult to
resolve with standard PCA, and a nonlinear approach may be required
to transform the data firstly with a new coordinate system. This
parametric approach is sometimes called kernelPCA (24, 25). To
obtain deeper insights into the relevant variables required to differen-
tiate between the samples, factor analysis might be a better choice.
546 D. Groth et al.
5. Availability
of R-Code
The example data and the R-code required to create the graphics of
this article is available at the webpage: http://bitbucket.org/
mittelmark/r-code/wiki/Home.
The script file ma.pca.r contains some functions which can be
used to simplify data analysis using R. The data and functions of the
ma.pca object can be investigated by typing the ls(ma.pca)
command. Some of the most important functions and objects are:
l ma.pca$new(data)—performs a new PCA analysis on data,
needs to be called first
l ma.pca$summary()—returns a summary, with the variances for
the most important components
l ma.pca$scores—the positions of the new data points in the new
coordinate system
l ma.pca$loadings—numerical values to describe the amount
each variable contributes to a certain component
l ma.pca$plot()—a pairs plot for the most important compo-
nents, % of variance in the diagonal
l ma.pca$biplot()—produces a biplot for the samples and for the
most important variables
l ma.pca$corplot()—produces a correlation plot for all variables
on selected components
l ma.pca$screeplot()—produces an improved screeplot for the
PCA
These functions have different parameters, for example not
to plot the first two but other components can be chosen with
the pcs-argument. For instance: ma.pca$corplot(pcs¼c
(’PC2’,’PC3’),cex¼1.2) would rather plot the second versus
the third component and slightly enlarge the text labels. To get
comfortable with the functions users should study the material on
the project website and the R-source code.
22 Principal Components Analysis 547
Acknowledgments
References
1. Hotelling H (1933) Analysis of complex statis- 14. Hubert M, Engelen S (2004) Robust PCA and
tical variables into principal components. J classification in biosciences. Bioinformatics
Educ Psychol 24:417–441, and 498–520 20:1728–1736
2. Quackenbush J (2002) Microarray data nor- 15. Kriegel HP, Kröger P, Schubert E, Zimek A
malization and transformation. Nat Genet 32 (2008) A general framework for increasing
(Suppl):496–501 the robustness of PCA-based correlation clus-
3. Steinfath M, Groth D, Lisec J, Selbig J (2008) tering algorithms. In: Lud€ascher B, Mamoulis
Metabolite profile analysis: from raw data to N (eds) Scientific and statistical database man-
regression and classification. Physiol Plant agement. Springer, Berlin
132:150–161 16. Todorov V, Filzmoser P (2009) An object-
4. Cover TM, Hart PE (1967) Nearest neighbor oriented framework for robust multivariate
pattern classification. IEEE Trans Inf Theory analysis. J Stat Softw 32:1–47
13:21–27 17. Ma S, Kosorok MR (2009) Identification of
5. Bo TM, Dysvik B, Jonassen I (2004) LSim- differential gene pathways with principal com-
pute: accurate estimation of missing values in ponent analysis. Bioinformatics 25:882–889
microarray data with least squares methods. 18. Draper BA, Baek K, Bartlett MS, Beveridge JR
Nucleic Acids Res 32:e34 (2003) Recognizing faces with PCA and ICA.
6. Stacklies W, Redestig H, Scholz M et al (2007) Comput Vis Image Understand 91:115–137
pcaMethods—a bioconductor package 19. Virtanen J, Noponen T, Meril€ainen P (2009)
providing PCA methods for incomplete data. Comparison of principal and independent
Bioinformatics 23:1164–1167 component analysis in removing extracerebral
7. Troyanskaya O, Cantor M, Sherlock G et al interference from near-infrared spectroscopy
(2001) Missing value estimation methods for signals. J Biomed Opt 14:054032
DNA microarrays. Bioinformatics 17:520–525 20. Baek K, Draper BA, Beveridge JR, She K (2002)
8. Celton M, Malpertuy A, Lelandais G, de Bre- PCA vs. ICA: a comparison on the feret data set.
vern AG (2010) Comparative analysis of miss- In Proc of the 4th Intern Conf on Computer
ing value imputation methods to improve Vision, ICCV 20190, pp 824–827
clustering and interpretation of microarray 21. Hyv€arinen A (1999) Fast and robust fixed-
experiments. BMC Genomics 11:15 point algorithms for independent component
9. Alter O, Brown PO, Botstein D (2000) Singu- analysis. IEEE Trans Neural Netw 10:626–634
lar value decomposition for genome-wide 22. Marchini JL, Heaton C, Ripley BD (2009)
expression data processing and modeling. fastICA: FastICA algorithms to perform ica
Proc Natl Acad Sci USA 97:10101–10106 and projection pursuit. http://cran.r-project.
10. Alter O, Brown PO, Botstein D (2003) org/web/packages/fastICA
Generalized singular value decomposition for 23. Scholz M, Selbig J (2007) Visualization and
comparative analysis of genome-scale expres- analysis of molecular data. Methods Mol Biol
sion data sets of two different organisms. Proc 358:87–104
Natl Acad Sci USA 100:3351–3356 24. Scholz M, Kaplan F, Guy CL et al (2005) Non-
11. Quackenbush J (2001) Computational analysis linear PCA: a missing data approach. Bioinfor-
of microarray data. Nat Rev Genet 2:418–427 matics 21:3887–3895
12. Jozefczuk S, Klie S, Catchpole G et al (2010) 25. Schölkopf B, Smola A, M€ uller KR (1998) Non-
Metabolomic and transcriptomic stress linear component analysis as a kernel eigen-
response of Escherichia coli. Mol Syst Biol value problem. Neural Comput 10:1299–1319
6:364 26. Hotelling H (1936) Relations between two
13. Gasch AP, Spellman PT, Kao CM et al (2000) sets of variates. Biometrika 28:321–377
Genomic expression programs in the response 27. de Leeuw J, Mair P (2009) Simple and canoni-
of yeast cells to environmental changes. Mol cal correspondence analysis using the R pack-
Biol Cell 11:4241–4257 age anacor. J Stat Softw 31:1–18