Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Groth 2012

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Chapter 22

Principal Components Analysis


Detlef Groth, Stefanie Hartmann, Sebastian Klie, and Joachim Selbig

Abstract
Principal components analysis (PCA) is a standard tool in multivariate data analysis to reduce the number of
dimensions, while retaining as much as possible of the data’s variation. Instead of investigating thousands of
original variables, the first few components containing the majority of the data’s variation are explored. The
visualization and statistical analysis of these new variables, the principal components, can help to find
similarities and differences between samples. Important original variables that are the major contributors
to the first few components can be discovered as well.
This chapter seeks to deliver a conceptual understanding of PCA as well as a mathematical description.
We describe how PCA can be used to analyze different datasets, and we include practical code examples.
Possible shortcomings of the methodology and ways to overcome these problems are also discussed.

Key words: Principal components analysis, Multivariate data analysis, Metabolite profiling, Codon
usage, Dimensionality reduction

1. Introduction

Modern data analysis is challenged by the enormous number of


possible variables that can be measured simultaneously. Examples
include microarrays that measure nucleotide or protein levels, next
generation sequencers that measure RNA levels, or GC/MS and
LC/MS that measure metabolite levels. The simultaneous analysis
of genomic, transcriptomic, proteomic, and metabolomic data fur-
ther increases the number of variables investigated in parallel.
A typical problem illustrating this issue is the statistical evalua-
tion of clinical data, for instance investigating the differences
between healthy and diseased patients in cancer research. Having
measured thousands of gene expression levels, an obvious question
is which expression levels contribute the most to the differences
between the individuals, and which genotypic and phenotypic
properties, e.g., sex or age, are also important for the differences.
Visualization and exploration of just two variables is an easy task,

Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_22, # Springer Science+Business Media, LLC 2013

527
528 D. Groth et al.

whereas the exploration of multidimensional data sets require data


decomposition and dimension reduction techniques such as princi-
pal components analysis (PCA) (1). PCA can deliver an overview
about the most important variables that contribute the most to the
differences and similarities between samples. In many cases, these
might be the variables that are of biological interest. Those compo-
nents can be used to visualize and to describe the dataset in a
concise manner. Visualizing sample differences and similarities can
also give information about the amount of biological and technical
variation in the datasets.
Mathematically, PCA uses the symmetrical covariance matrix
between the variables. For square matrices of size N  N, N eigen-
vectors with N eigenvalues can be determined. The components
are the eigenvectors of this square matrix and the eigenvector with
the largest eigenvalue is the first principal component. If we assume
that the most-varying components are the most important ones,
we can plot the first components against each other to visualize the
distances and differences between the samples in the datasets.
By exploring the variables that contribute most to important com-
ponents, it is possible to get insights into biological key processes.
It is not uncommon that the dataset has ten or even fewer principal
components containing more than 90% of the total variance in the
dataset, as opposed to thousands of original variables.
It is important to note that PCA is an unsupervised method,
i.e., a possible class membership of the samples is not taken into
account by this method. Although grouping of samples or variables
might be apparent in a low dimensional representation, PCA is not
a clustering tool as no distances are considered and no cluster labels
are assigned.

2. Important
Concepts
The “-omics” technologies mentioned above require careful data
preprocessing and normalization before the actual data analysis can
be performed. The methods used for this purpose have recently
been discussed for microarrays (2) and metabolite data (3). We here
only briefly outline important concepts specific to these data types
and give a general introduction using different examples.

2.1. Data Normalization Data normalization aims to remove the technical variability and
and Transformation to impute missing values that result from experimental issues.
Data transformation, in contrast, aims to move the data distribu-
tion into a Gaussian one, and ensures that more powerful para-
metric statistical methods can be used later on. The basic steps for
data preparation are shown in Fig. 1 and are detailed below.
22 Principal Components Analysis 529

Fig. 1. Steps in data preparation for PCA.

Data normalization consists mainly of background subtraction


and missing value imputation. Many higher data analysis methods
assume a complete data matrix without missing values. If we simply
omit rows and columns with missing values for large data matrices,
even if they are few, too much information will be lost from the
dataset. For instance, if only 1% of data are missing, with 1,000
rows and 50 columns, we would have 500 missing values, and
almost no data would be left in the matrix if the missing values
are distributed uniformly in the matrix. For missing values, simple
but often-used methods like the replacement of a missing value
with the row or column mean or median, or for log-transformed
data the replacement of missing values with zeros, are not feasible.
They do not take into account the correlative relations within the
data. Better-suited are methods using only relevant similar rows or
columns for the mean or median determination. For instance the
K-nearest neighbor (KNN) algorithm, which uses the k most simi-
lar rows or columns for the mean or median calculation, can be
used (4). Other examples of methods for missing value estimations
are based on least squares methods (5) or PCA approaches (6). For
an experimental comparison of different methods, the interested
reader should consult the articles of refs. 7 and 8.
PCA is heavily influenced by outliers, and therefore the next
step after data normalization and missing value imputation should
be the removal of outliers. An easy and frequently used method is
the removal of all values that are more than three times the standard
deviation from the sample mean. This should be done in an iterative
manner, because the outlier itself influences both the sample mean
and the standard deviation. After removal of outliers, the values for
the outliers should be imputed again as described above. Even if an
530 D. Groth et al.

outlier is not the result of a technical variation, it is a good idea for


the PCA to remove it. The reason for this is because the PCA result
is otherwise influenced mostly by noisy variables that contain out-
liers. The R-package pcaMethods can be used for outlier removal
and missing value imputation (6).
After normalizing, some kind of data transformation is needed,
because often variables within a dataset differ by their values in the
order of magnitudes. If, for example, the height for humans is
recorded in meters instead of centimeters, the variance for this
variable will be much lower than the variance of the weight
recorded in kilograms for the same people. In order to give both
variables equal chance of contributing to the analysis, they need to
be standardized. The technique used mostly for this purpose is
called “scaling to unit variance”: For determining the individual
unit variance value, the so called z-score (zi), from each original
value (xi), the mean for this variable (m) is subtracted and
the difference is then divided by the standard deviation (sx): zi ¼
(xi  m)/sx.
After scaling all variables to have a variance of one, the covari-
ance matrix equals the correlation matrix. One disadvantage of this
approach is that low level values, for instance slightly larger than
background values, get a high impact on the resulting PCA due
their large scatter.
Other data transformations that might replace or precede the
scaling procedure are log-transformations. In case zeroes exist in
the dataset, a positive constant (e.g., 1) must be added to all
elements of the data matrix, and then log-transformation can be
performed. If there are negative values, the asinh-transformation is
an option. Log-transformation of the data can bring a nonnormal
data distribution closer to being normally distributed. This allows
the usage of more powerful parametric methods for such data.
Often the individual values are transformed to fold-changes by
dividing them by the mean or median for this variable. Using this
approach the data also needs to be log-transformed to ensure a
normal distribution and a centering of the data at zero.
Scaling and log-transformation are illustrated using measure-
ments for 26 students, their height in centimeter, their weight in
kilogram, and their statistics course grade as a numerical value
between 1 and 5. The influence of scaling and log-transformation
on the original data distribution for each variable is shown in Fig. 2.
In Fig. 3, the data for each individual in our example dataset before
and after scaling are visualized using a pairs plot.
A step often required for later analysis, e.g., testing for differ-
entially expressed genes, is filtering to exclude variables that are not
altered between samples or that generally have low abundance.
PCA is well suited to ignore nonrelevant variables in the datasets.
For better compatibility with subsequent analyses like inferential
22 Principal Components Analysis 531

original data log2 original data

150

6
100

4
50

2
0

0
cm kg grade cm kg grade

unit variance (uv) data uv and centered data


25

3
20

2
15

1
10

0
−1
5

cm kg grade cm kg grade

Fig. 2. Comparison of original, log-transformed, and scaled data.

tests, a filtering procedure may also be applied before performing


the PCA analysis. The problem that after scaling low level values,
sometimes noisy, get an equal impact on the PCA can diminished if
a filtering step is introduced. In our examples we use the standard
scaling procedure without outlier removal and any further data
transformation or filtering steps.

2.2. Principal We next demonstrate the PCA using the example of the students
Components Analysis dataset, with the variables “cm,” “kg,” and “grade.” The students
are the samples in this case. As it can be seen in Table 1, height,
measured in cm, and the weight, measured in kg, have a higher
variance in comparison to the to the grade, which ranges from 1 to
5. After scaling to unit variance, it can be seen in Table 2 that all
variables have variance of one. The weight and the height of our
sample students have a larger covariance than to that between grade
and both weight and height. Remember that the covariance
between variables indicate the degree of correlation between two
532 D. Groth et al.

50 60 70 80 90 110 1 2 3 4

200

190
J E J E
F F 180
Y V OR V OR Y
cm CK
QTUBXI P U Q
M KC
ITP XB
170
W ZM H W HD N Z
N
SADL
G AL S G
160

150

H 110
3 H 100

2 90
E E
1 P R
F kg F
P
80
J J OR 70
I O I XB
0 UBTX V UV
MLKQ T 60
Q
L Z MK Y D
W A C NS G
YZ
A
SGD
NW C
−1 50

Z Z
2 Y Y
G G
1 BX BX
E E
S S
0
N TP
I R F
N T P
I R F
grade
D D
H C
K Q O C KQ O H
AL M J A LM J
−1
W V W V
U U
−2
−1 0 1 2 −1 0 1 2 3

Fig. 3. Pairs plot of unscaled (upper triangle) and scaled (lower triangle) data. Individuals can be identified by their letter
codes.

Table 1
Covariance matrix for the original data. The diagonal contains the variances

cm kg Grade
cm 47.54 36.90 0.11
kg 36.90 123.18 0.84
Grade 0.11 0.84 0.62

variables, when variables are scaled to have unit variance. High


positive values denote a high degree of positive correlation, and
large negative values indicate a high degree of negative correlation.
When there is unit variance, covariance near zero means no
22 Principal Components Analysis 533

Table 2
Covariance/correlation matrix for scaled data

cm kg Grade
cm 1.00 0.48 0.02
kg 0.48 1.00 0.10
Grade 0.02 0.10 1.00

Table 3
Covariance matrix for the principal components, the matrix
is a diagonal matrix, all nondiagonal values are zero. This
means that the principal components are uncorrelated to
each other

PC1 PC2 PC3


PC1 1.49 0.00 0.00
PC2 0.00 1.01 0.00
PC3 0.00 0.00 0.50

correlation between the variables. In the example, weight and


height are correlated but the grades are, of course, not correlated
to the weight and height of students.

2.2.1. Mathematical Mathematically, PCA is using the symmetric covariance matrix to


Background determine the principal components (PC’s). The PC’s are the
eigenvectors of the covariance matrix, and the eigenvector with
the largest eigenvalue is the first principal component. The vector
with the second largest eigenvalue is the second principal compo-
nent, and so on. Principal components are uncorrelated to each
other as the covariance matrix in Table 3 shows. The following code
will read in the data from a Web site and perform the calculation of
the eigenvectors and their eigenvalues directly using R.
> students¼read.table(’http://cdn.bitbucket.org/
mittelmark/r-code/downloads/survey.min.tab’,
header- ¼TRUE)
> eigen.res ¼ eigen(cov(scale(students)))
> eigen.res
$values
[1] 1.4880792 1.0077034 0.5042174
$vectors
534 D. Groth et al.

[,1] [,2] [,3]


[1,] 0.6961980 0.19381501 0.6911904
[2,] 0.7093489 -0.03800063 -0.7038324
[3,] -0.1101476 0.98030184 -0.1639384
> eigen.res$values/sum(eigen.res$values)
[1] 0.4960264 0.3359011 0.1680725

After scaling, each variable contributes equally to the overall


variance; when there are three variables, they each contribute one-
third of the total variation in the dataset. However, the largest
eigenvalue of the covariance matrix is around 1.5, which means
that the first principal component contains around 50% of the
overall variation in the dataset, and the second component still
around 34% of the total variation. As shown in the last code
example line, the exact proportion values can be obtained by divid-
ing the eigenvalues by the sum of all eigenvalues.
The second component of the result of the eigen calculation in
R is the loading vectors. They contain in the columns the values for
each principal component, in the rows the values for the variables
belonging to the eigenvectors: “cm,” “kg” and “grade” for the
different components in this case. A large absolute loading value
means that the variable contributes much to this principal compo-
nent. The variables “cm” (first row) and “kg” (second row) con-
tribute mostly to component PC1 (first column) and PC3 (third
column), whereas the variable “grade,” third row, contributes
mostly to PC2 (second column).
Using the prcomp function of R, the same calculations can be
done in a more straight-forward manner. We create an object called
“pcx” with the scaled data. This object contains the variable load-
ings in a table called “rotation,” and the coordinates for the indi-
viduals inside the new coordinate system of principal components
in a table “x.” The latter are also called scores, and they show how
the variables correlate with a component. The summary command
for the pcx object shows the contribution for the most important
components to the total variance.
> pcx ¼ prcomp(scale(students))
> summary(pcx)
Importance of components:
PC1 PC2 PC3
Standard deviation 1.220 1.004 0.710
Proportion of Variance 0.496 0.336 0.168
Cumulative Proportion 0.496 0.832 1.000
> pcx$rotation
PC1 PC2 PC3
cm 0.6961980 -0.19381501 -0.6911904
kg 0.7093489 0.03800063 0.7038324
grade -0.1101476 -0.98030184 0.1639384
> head(pcx$x, n ¼ 4)
22 Principal Components Analysis 535

a b

1.5

1.5
1.0 o
Variances

Variances

1.0
o

0.5
0.5

0.0
0.0

PC1 PC2 PC3 PC1 PC2 PC3

c d
3

1.0
2

W U

0.5
V
1

AL M
H
D CK Q O J
PC2

PC2

kg
R 0.0 kg
I
0

SN T Pcm F
cm
BX
−1

G E
grade
−2

Z Y
−1.0

grade
−3

−3 −2 −1 0 1 2 3 −1.0 0.0 0.5 1.0


PC1
PC1

Fig. 4. Common plots for PCA visualization. (a) Screeplot for the first few components using a bar plot; (b) screeplot using
lines; (c) biplot showing the most relevant loading vectors; (d) correlation plot.

PC1 PC2 PC3


A -1.482874036 0.9159609 0.2510519
B -0.005895547 -0.8607660 0.1385797
C -0.714299682 0.3420358 -0.4925597
D -1.280122071 0.3014947 0.2079020

The variances of the eigenvalues for the first few components


are often plotted in a so called screeplot to show how the variance of
the principal components decreases by additional components,
shown in Fig. 4a, b. Often, even when a dataset consists of
thousands of variables, the majority of the variance is in the first
few components. To investigate how the variables contribute to the
loading vectors, a biplot and a correlation plot can be used. The
biplot shows both the position of samples in the new coordinate
space and the loading vectors for the original variables in the new
coordinate system (Fig. 4c). Often the number of variables shown is
limited to a few, mostly restricted to those correlating at best with
the main principal components. In this way, biplots can uncover the
536 D. Groth et al.

correspondence between variable and samples and identify samples


with similar variable values. A correlation plot can show how well
a given variable correlates with a principal component (Fig. 4d).
In the students example dataset we can see that the variables “kg”
and “cm” correlate well with the first component, whereas the
“grade,” while not correlated with the other variables, correlates
perfectly with the second component. Here, PC1 represents some-
thing like the general size, i.e., weight and height, whereas PC2
perfectly represents the course grade. To illustrate this we can
examine the individual values for some students. On the right side
of Fig. 4c are large students on the upper side are students with a
high grade.
If we compare the biplot with the original data in Fig. 2 we see
that students with higher (Z, Y) and lower grades (U, W), and with
higher (H, J, F, E) and lower sizes (S, N, G), are nicely shown in the
2D space of the biplot. As PCA performs dimension reduction,
there are some guidelines on how many components should be
investigated. For example, the components that cumulatively
account for at least 90% of the total variance can be used. Alterna-
tively, components that account for less variance than the original
variables (unit variance) can be disregarded.

2.2.2. Geometrical An intuitive geometrical explanation of PCA can be seen as a


Illustration rotation of the original data space. For three variables imagine
finding a point outside of a three-dimensional visualization of the
data points to maximize its projection onto a two-dimensional
surface. The angles chosen for this point represent the new compo-
nent system (Fig. 5). We illustrate this with just two variables, the
weight and height of our students. The data are shown in Fig. 6a in
a xy-plot. If we project the 2D space into a new coordinate system
we draw a line onto the xy-plot which shows the vector of our first
component. The second component is always orthogonal to the
first. We can see that after projecting the data into the new coordi-
nate system the first component contains much more variance than
the second component (Fig. 6b). Using just the columns for the
weight and height we see that the first component contains now

Fig. 5. Illustrative projection of a three-dimensional data point cloud onto a two-dimensional surface.
22 Principal Components Analysis 537

a b 3
l
H l
H
3
2
2
l
E

PC2
l 1

kg
F
1 l
P l
R l
l Jl l lL
lA
S
l
P
lI O 0 lN
G
D
lW
l Zl Ml UlBl l lI l
l l
X R l
F
0 lB
U l
V Kl l X
lT Ol
ll
T l
C Q l
E
l
L l Q
lK
M
l
l
l Z C
D l
Y V Jl
−1 ll G
A
S llW
N l l −1 l
Y

−2 −1 0 1 2 −1 0 1 2
cm PC1
c d
1.5 3 H
2
kg
Variances

1.0 1
ADL P

PC2
SN
0 GWZ M UBXI R F
CK QT O E
V J
−1 Y
0.5 cm
−2
−3
0.0
PC1 PC2 −3 −2 −1 0 1 2 3

PC1

Fig. 6. PCA plots using students height and weight data. (a) Scaled data; (b) data projected
into the new coordinate system of principal components; (c) screeplot of the two resulting
PC’s; (d) biplot showing the loadings vectors of the original variables.

around 66% of the variance, whereas the second component con-


tains now 34% of the variance. This is nicely visualized in the
screeplot shown in Fig. 6c.
The principal components in this example have a certain
meaning. The first component represents general size of the stu-
dents, but size is here not only restricted to the height but also to
the width or weight. The second component could be explained as
the body mass index (BMI), people with a high value for PC2 have
a larger BMI than people with a low PC2 value. The biplot in
Fig. 6d also visualizes the loadings for the original variable into
the new component space. We can now also determine some
important properties for the different subjects regarding their
BMI. For instance we can assume that students F and E are large
but neither too lightweight nor too heavy for their height. In
contrast student H is quite heavy for his/her weight. On the left
are smaller students (for example Z) who do not vary greatly in
their BMI. Student Y has a low BMI in contrast. The original and
the scaled data for all students could be seen in Fig. 3.
538 D. Groth et al.

PCA can be performed either on the variables or on the sam-


ples. Generally only one type of PCA has to be performed. PCA on
variables focuses on the correlations between the variables, whereas
PCA on samples is examining the correlation between the samples.
In our examples we perform a PCA on the variables. To switch
between both modes, the data matrix just has to be transposed.
If making a serial experiment which just one or two parameters
changed, for example time and concentration, you will perform a
PCA on the variables, whereas if you have a lot of replicates for few
conditions you will do a PCA on the samples. For a PCA on samples
with many variables, the eigenvalues often do not contain much
information, as there are too many of them. In this case it is
advisable to try to group related variables together.

3. Biological
Examples
PCA was first applied to microarray data in 2000 (9, 10) and has
been reviewed in this context (11). We decided to choose other
types of data to illustrate PCA. First we use a dataset which deals
with the transformation of qualitative data into numerical data:
codon frequencies from various taxa will be used to evaluate their
main codon differences with a PCA. Next, we use data from a
recent study about principal changes in the Escherichia coli
(E. coli) metabolism after applying stress conditions (12). The
data analysis is extended with adding visualization that PCA enables
to better understand the E. coli primary stress metabolism.

3.1. Sequence Data Here we use PCA to demonstrate that the codon usage differs for
Analysis protein-coding genes of different organisms, a fact that is well-
known. Genome sequences for many taxa are freely available from
different online resources. We just take five genomes for this exam-
ple, although we could easily have taken 50 or 500: Arabidopsis
thaliana (a higher plant), Caenorhabditis elegans (a nematode),
Drosophila melanogaster (the fruit fly), Canis familiaris (the dog),
and Saccharomyces cerevisiae (yeast). For each of these genomes, we
only use protein-coding genes and end up with about 33,000
(plant), 24,000 (nematode), 15,000 (dog), 18,000 (fruit fly), and
6,000 (yeast) genes each.
For one species at a time, we then record for each of these gene
sequences which of the 64 possible codons is used how many times.
The data to be analyzed for interesting patterns is therefore a
5  64 matrix. It describes the codon usage combined for all
protein-coding genes for each of the five taxa. An abbreviated and
transposed version of this matrix is shown below; note that absolute
frequencies were recorded.
22 Principal Components Analysis 539

> codons¼read.table(’http://bitbucket.org/mittel-
mark/r-code/downloads/codonsTaxa.txt’, header¼
TRUE, row.names¼1)
> head(codons[,c(1:3,62:64)])
AAA AAC AAG TTC TTG TTT
ATHA 419472 275641 432728 269345 285985 299032
CELE 394707 192068 272375 249129 210524 240164
CFAM 243989 187183 314228 205704 134896 186681
DMEL 185136 280003 417111 227142 173632 141944
SSCE 124665 72230 88924 52059 77227 76198

The visualization of such a matrix in a pairs plot, as we did


before, is generally impractical due to the large number of variables.
Instead, we perform a PCA, again using the software package R and
the R-script ma.pca.r. After importing the matrix into R and
loading ma.pca.r, a new ma.pca object can be created with the
command ma.pca$new(codons) (assuming the data is stored in a
variable codons). This automatically performs a PCA on the data,
and the internal object created by the prcomp function of R can be
used for plotting and analysis. As can be seen in Fig. 7, the first four
components carry almost all the variance of the dataset. The com-
mand ma.pca$biplot(), for example, generates the biplot
shown in Fig. 7b, which here displays the five most important
codons that differentiate between the five taxa in the first two

a 40 b
Variances

30

20 TGG
5 CCT GGG
10 CFAM

0 ATHA
PC1 PC3 PC5
PC2

ACT CAG
0 CAC
CELE
CAT GCC

SSCE
c 1.0 TGG
TGTCCT GGG
AGGCTC DMEL
CTT
GCT
TCT
AGA
GCA CGG
GTC −5
TTT GGA GAG
GTG TCGACG
PC2

TCA
ACA ATG TTC GAC
GTT
AAA
GAA CCC
CTG
CAG
0.0 ACTCCA AGC
TGC
CAC
GGC
AGT
CAT GGT
TTA GCC
ATC
AAG
TAT
GAT
ATT
GTA TCC
ACC
TTGATA CGA CGC
CAACTATGA
AAT CGT
TAA GCG
CCG
AACTAC
TAGTCG
ACG
−1.0

−1.0 0.0 1.0 −5 0 5

PC1 PC1

Fig. 7. Codon data. (a) Screeplot of PC’s variances; (b) biplot for first two PC’s and most important loadings; (c) correlation
plot for all variables.
540 D. Groth et al.

principal components. The command to get a list of these codons is


shown below:
> source("http://bitbucket.org/mittelmark/r-code/
downloads/ma.pca.r")
> codons¼t(scale(t(codons))) # transpose and scale
> ma.pca$new(codons)
> ma.pca$screeplot() # Fig 7A
> ma.pca$biplot(top¼5) # Fig 7B
> ma.pca$corplot() # Fig 7C
> ma.pca$getMainLoadings("PC1")[1:5]
[1] "CAG" "CAC" "ACT" "GCC" "CAT"

Figure 7b shows that in the first principal component, the


frequencies of the two sets of codons in grey (CAT, GCC, ACT,
CAC, CAG) correlate at best with the first component, and they are
therefore especially useful for distinguishing the plant, yeast, and
the nematode from the dog and the fruit fly. Similarly, for the
second component the codons in black font are responsible in the
second principal component for the separation of the five taxa
(CCT, GGG, TCG, ACG, TGG).
In addition to the biplot, a correlation plot can be generated.
The command ma.pca$corplot() will produce the plot shown
in Fig. 7c and displays the individual correlation of each variable to
the principal component. Finally, a summary of the PCA can be
printed to the R-terminal. This shows the amount of variance in the
individual components:
> ma.pca$summary()
Importance of components:
PC1 PC2 PC3 PC4 PC5
Standard
deviation 6.128 3.497 3.259 1.8980 2.13e-15
Proportion of
Variance 0.587 0.191 0.166 0.0563 0.00e+00
Cumulative
Proportion 0.587 0.778 0.944 1.0000 1.00e+00

Almost 60% of the total variance is in the first component which


agrees nicely with the correlation plot in Fig. 7c, showing that most
codons correlate either positively or negatively with the PC1.

3.2. Metabolite Data In this example we employ PCA to analyze the system level stress
Analysis adjustments following the response of E. coli to five different per-
turbations. We make use of time-resolved metabolite measure-
ments to get a detailed understanding of the successive events
following heat- and cold-shock, oxidative stress, lactose diauxie,
and stationary phase. A previous analysis of the metabolite data
together with transcript data measured under the exact same per-
turbations and time-points was able to show, that E. coli’s response
on the metabolic level shows a higher degree of specificity as
22 Principal Components Analysis 541

compared with the general response observed on the transcript


level. Furthermore, this specificity is even more prominent during
the early stress adaptation phase (12).
The lactose diauxie experiment describes two distinct growth
phases of E. coli. Those two growth phases are characterized by the
exclusive use of either of two carbon sources: first glucose and then,
upon depletion of the former in the media, lactose. Stationary
phase describes the timeframe in which E. coli stops growth,
because nutrient concentrations become limiting. Furthermore,
because of an increased cell density, stationary phase is character-
ized by hypoxia due to low oxygen levels. The dataset considered in
this example consists of metabolite concentrations measured with
gas chromatography mass spectrometry (GC-MS). The samples
were obtained for each experimental condition at time points
10–50 min post-perturbation plus an additional control time-
series. Each experimental condition was independently repeated
three times and the measurements reported consist of the median
of those three measurements per condition and time-point. An
analysis of the obtained spectra lead to the identification of 188
metabolites of which 95 could be positively identified (58 meta-
bolic profiles could be chemically classified and 35 remain to be of
unknown structure). A detailed treatment of the extraction- and
data normalization procedures can be found in ref. 12.
Out of the 95 experimentally identified metabolites, we select
11 metabolites from E. coli’s primary metabolism for the PCA
(Fig. 8). The reasoning for this selection is the following: The
response of the metabolism following a perturbation is character-
ized by E. coli’s general strategy of energy conservation which is
expected to be reflected by rapid decrease of central carbon metab-
olism intermediates. From the literature (13) we know that on the
genome level this energy conservation is coinciding with a down-
regulation of genes related to cell growth.
We create a data frame “metabolites” in which each row repre-
sents a measurement for a certain experimental condition and time-
point. This amounts to 37 conditions:
> metabolites¼read.table(’http://bitbucket.org/
mittelmark/r-code/downloads/primary-metabolism-
ecoli.tab’, header¼TRUE)
> metabolites[c(1:3,35:38),1:5]
X2KeGuAc SuAc FuAc MaAc X6PGAc
cold_0 0.0055 0.0200 0.4507 0.0936 0.0086
cold_1 0.0038 0.0219 0.3794 0.0559 0.0109
cold_2 0.0053 0.0285 0.3311 0.0619 0.0105
stat_2 0.0307 0.1680 1.5829 0.2729 0.0374
stat_3 0.0997 0.2824 2.7050 0.3279 0.0217
stat_4 0.0495 0.1085 1.4768 0.2568 0.0141
stat_5 0.0850 0.1086 1.2772 0.1875 0.0126
> dim(metabolites)
[1] 38 11
542 D. Groth et al.

Glc-6-P
6-P-gluconolactone
Fru-6-P

pentose phosphate
6-P-gluconic a.

Ribulose -5-P Fru-1,6-P

pathway
Ribose-5-P Xylulose-5-P G3P

glycolysis
1,3 DPGA

G3P S7P 3PGA

2PGA

E4P Fru -6-P PEP

Pyruvic a.
Acetyl-CoA

OAA Citric a.
TCA cycle
Malic a. Isocitric a.

Fumaric a. 2-ketogluratic a.

Succinic a. Succinyl-CoA

Fig. 8. Overview E. coli’s primary metabolism. Metabolites for which concentrations


measured are denoted in bold.

Here, for example, cold_2 denotes the measurement for the


second time-point (¼20 min after application of the cold-stress) for
E. coli cells treated with heat-stress. Each such a condition is char-
acterized by 11 entries or observations (the columns of our data
frame) which are given by the 11 metabolite concentrations
measured.
Figure 9 shows a biplot of all 37 different conditions and their
respective measurement time-points. We project the conditions on
the axis defined by the first and second principal component which
together capture 79% of the total variation in the dataset. It is
directly visible that those two components are enough to discrimi-
nate the form of the experimental treatment, as well as discriminat-
ing the time within a condition:
Lactose diauxie and stationary phase both show a higher dis-
tance from the origin than any other of the stresses. Clearly, both
condition are characterized by either depletion (stationary phase)
or change of the primary carbon source (lactose diauxie). Naturally
we would expect that to have a huge impact on E. coli’s primary
metabolism as a result of changes in the corresponding metabolite
levels. Out of the three stress-conditions, cold-shock measurements
are the closest to control time-points. Again, this relates to the fact,
22 Principal Components Analysis 543

Fig. 9. Biplot of experimental conditions and their respective time-points.

that cold-shock is the physiological mildest stress compared to


heat-shock and oxidative stress.
In the origin of the PCA plot we find the early time-points from
control, heat, cold, and oxidative stress. Most likely this can be
attributed to the fact, that stress-adaptation is often not instanta-
neous and thus not immediately reflected on the metabolic level.
Notable exceptions are the 10 min measurement for lactose diauxie
and stationary phase. For heat stress it can observe that the further
time progresses, measurements have a greater distance to the ori-
gin. However, this trend is reversed for the late stationary phase and
lactose diauxic shift measurements (stat_5 and lac_4), as those
time-points move in closer to the origin. One possible explanation
is that E. coli has (partially) adapted to the new nutrient conditions,
and the metabolic profile is again closer to the control condition.
Finally, the metabolite levels that are important for the discrim-
ination of the timepoints are examined: The arrows in the biplot
indicate which metabolites have a dominant effect in finding the
two principal components. Since the direction of the arrows points
towards time-points from stationary phase, we can assume an
increase of metabolites associated with these arrows.
544 D. Groth et al.

Fig. 10. Metabolite concentrations of conditions and different time-points. Within each time-series, each metabolite
concentration is normalized to preperturbation levels.

Indeed, an investigation of the metabolite concentrations


(Fig. 10) shows a general decrease of the primary metabolites in
cold-, heat- and oxidative-stress conditions and a strong increase
for stationary phase and a medium increase for lactose shift, respec-
tively. Decreased levels of for example phosphoenolpyruvic acid
(PEP) and glycolic acid-3-phosphate (GlAc3Ph) from glycolysis
are dominant effects of stress application. This finding is in accor-
dance with the previously mentioned energy conservation strategy.
The pronounced and counter-intuitive increase for the TCA-
cycle intermediates 2-ketoglutaric acid (2KeGlu), succinic acid
(SuAc) and malic acid (MaAc) can be explained by the previously
mentioned increase in bacterial culture density under stationary
phase that results in a shift from aerobic to micro-aerobic (hypoxia)
conditions. The lack of oxygen triggers a number of adjustments of
the activity of TCA-cycle enzymes with the aim of providing an
alternative electron acceptor for cellular respiration. Briefly, this
increase of TCA-cycle intermediates arises from a repression of
the enzyme 2-ketoglutarate dehydrogenase which normally con-
verts 2-ketoglutaric acid to Succinyl-CoA with the results of an
accumulation of 2-ketoglutaric acid. A subsequent replacement of
succinate dehydrogenase activity by fumarate reductase allows
usage of fumarate (FuAc) as an alternative electron acceptor.
This in turn leads to an accumulation of succinic acid which cannot
22 Principal Components Analysis 545

be metabolized further and is excreted from the cell. Finally,


accumulation of malic acid can be interpreted as an effect of change
in metabolic flux towards the malate, fumarate and succinate
branch of the TCA cycle, forced by increased demands of fumarate
production for use as an electron acceptor.

4. PCA
Improvements
and Alternatives
PCA is an excellent method for finding orthogonal directions that
correspond to maximum variance. Datasets can, of course, contain
other types of structures that PCA is not designed to detect. For
example, the largest variations might be not of the greatest
biological importance. This is a problem which cannot easily be
solved as it requires the knowledge of the biology behind the data.
In this case it may be important to remove the outliers to minimize
the effect of single values on the overall outcome. Approaches to
provide outlier-insensitive PCA algorithms like robust (14) or
weighted PCA (15) and an R package, rrcov (16), which can be
used to apply some of the advanced PCA methods to the data set
are available. The R package provides the function PcaCov which
calls robust estimators of covariance.
In datasets with many variables it is sometimes difficult to
obtain a general description of a certain component. For this
purpose, e.g., in microarray analysis, often the enrichment for
certain ontology terms for the variables contributing at most to a
component is used to get an impression what the component is
actually representing (17).
Sometimes a problem with PCA is that the components,
although uncorrelated, are dependent and orthogonal to each
other. Independent components analysis (ICA) does not have this
shortcoming. Some authors have found that ICA outperforms PCA
(18), other authors have found the opposite (19, 20). Which
method is in practice best, depends on the actual data structure,
and ICA is in some cases a possible alternative to PCA. The
fastICA algorithm can be used for this purpose (21, 22). Because
ICA does not reduce the number of variables as PCA does, ICA can be
used in conjunction with PCA to get a decreased number of variables
to consider. For instance, it has been shown that ICA, when per-
formed on the first few principal components, i.e., on the results of
a preceding PCA, can improve the sample differentiation (23).
Higher-order dependencies, for instance data are scattered in a
ringlike manner around a certain point, are sometimes difficult to
resolve with standard PCA, and a nonlinear approach may be required
to transform the data firstly with a new coordinate system. This
parametric approach is sometimes called kernelPCA (24, 25). To
obtain deeper insights into the relevant variables required to differen-
tiate between the samples, factor analysis might be a better choice.
546 D. Groth et al.

Where PCA tries to find a projection of one set of points into a


lower dimensional space, the Canonical Correlation Analysis (CCA,
(26)) extends PCA in that way that it tries to find a projection of
two sets of corresponding points. An example where CCA could be
applied is a data set consisting one data matrix carrying gene
expression data, the other carrying metabolite data. There exists
an R package which can be used to perform simple correspondence
analysis as well as CCA (27).

5. Availability
of R-Code
The example data and the R-code required to create the graphics of
this article is available at the webpage: http://bitbucket.org/
mittelmark/r-code/wiki/Home.
The script file ma.pca.r contains some functions which can be
used to simplify data analysis using R. The data and functions of the
ma.pca object can be investigated by typing the ls(ma.pca)
command. Some of the most important functions and objects are:
l ma.pca$new(data)—performs a new PCA analysis on data,
needs to be called first
l ma.pca$summary()—returns a summary, with the variances for
the most important components
l ma.pca$scores—the positions of the new data points in the new
coordinate system
l ma.pca$loadings—numerical values to describe the amount
each variable contributes to a certain component
l ma.pca$plot()—a pairs plot for the most important compo-
nents, % of variance in the diagonal
l ma.pca$biplot()—produces a biplot for the samples and for the
most important variables
l ma.pca$corplot()—produces a correlation plot for all variables
on selected components
l ma.pca$screeplot()—produces an improved screeplot for the
PCA
These functions have different parameters, for example not
to plot the first two but other components can be chosen with
the pcs-argument. For instance: ma.pca$corplot(pcs¼c
(’PC2’,’PC3’),cex¼1.2) would rather plot the second versus
the third component and slightly enlarge the text labels. To get
comfortable with the functions users should study the material on
the project website and the R-source code.
22 Principal Components Analysis 547

Acknowledgments

We thank Kristin Feher for carefully reviewing our manuscript.

References
1. Hotelling H (1933) Analysis of complex statis- 14. Hubert M, Engelen S (2004) Robust PCA and
tical variables into principal components. J classification in biosciences. Bioinformatics
Educ Psychol 24:417–441, and 498–520 20:1728–1736
2. Quackenbush J (2002) Microarray data nor- 15. Kriegel HP, Kröger P, Schubert E, Zimek A
malization and transformation. Nat Genet 32 (2008) A general framework for increasing
(Suppl):496–501 the robustness of PCA-based correlation clus-
3. Steinfath M, Groth D, Lisec J, Selbig J (2008) tering algorithms. In: Lud€ascher B, Mamoulis
Metabolite profile analysis: from raw data to N (eds) Scientific and statistical database man-
regression and classification. Physiol Plant agement. Springer, Berlin
132:150–161 16. Todorov V, Filzmoser P (2009) An object-
4. Cover TM, Hart PE (1967) Nearest neighbor oriented framework for robust multivariate
pattern classification. IEEE Trans Inf Theory analysis. J Stat Softw 32:1–47
13:21–27 17. Ma S, Kosorok MR (2009) Identification of
5. Bo TM, Dysvik B, Jonassen I (2004) LSim- differential gene pathways with principal com-
pute: accurate estimation of missing values in ponent analysis. Bioinformatics 25:882–889
microarray data with least squares methods. 18. Draper BA, Baek K, Bartlett MS, Beveridge JR
Nucleic Acids Res 32:e34 (2003) Recognizing faces with PCA and ICA.
6. Stacklies W, Redestig H, Scholz M et al (2007) Comput Vis Image Understand 91:115–137
pcaMethods—a bioconductor package 19. Virtanen J, Noponen T, Meril€ainen P (2009)
providing PCA methods for incomplete data. Comparison of principal and independent
Bioinformatics 23:1164–1167 component analysis in removing extracerebral
7. Troyanskaya O, Cantor M, Sherlock G et al interference from near-infrared spectroscopy
(2001) Missing value estimation methods for signals. J Biomed Opt 14:054032
DNA microarrays. Bioinformatics 17:520–525 20. Baek K, Draper BA, Beveridge JR, She K (2002)
8. Celton M, Malpertuy A, Lelandais G, de Bre- PCA vs. ICA: a comparison on the feret data set.
vern AG (2010) Comparative analysis of miss- In Proc of the 4th Intern Conf on Computer
ing value imputation methods to improve Vision, ICCV 20190, pp 824–827
clustering and interpretation of microarray 21. Hyv€arinen A (1999) Fast and robust fixed-
experiments. BMC Genomics 11:15 point algorithms for independent component
9. Alter O, Brown PO, Botstein D (2000) Singu- analysis. IEEE Trans Neural Netw 10:626–634
lar value decomposition for genome-wide 22. Marchini JL, Heaton C, Ripley BD (2009)
expression data processing and modeling. fastICA: FastICA algorithms to perform ica
Proc Natl Acad Sci USA 97:10101–10106 and projection pursuit. http://cran.r-project.
10. Alter O, Brown PO, Botstein D (2003) org/web/packages/fastICA
Generalized singular value decomposition for 23. Scholz M, Selbig J (2007) Visualization and
comparative analysis of genome-scale expres- analysis of molecular data. Methods Mol Biol
sion data sets of two different organisms. Proc 358:87–104
Natl Acad Sci USA 100:3351–3356 24. Scholz M, Kaplan F, Guy CL et al (2005) Non-
11. Quackenbush J (2001) Computational analysis linear PCA: a missing data approach. Bioinfor-
of microarray data. Nat Rev Genet 2:418–427 matics 21:3887–3895
12. Jozefczuk S, Klie S, Catchpole G et al (2010) 25. Schölkopf B, Smola A, M€ uller KR (1998) Non-
Metabolomic and transcriptomic stress linear component analysis as a kernel eigen-
response of Escherichia coli. Mol Syst Biol value problem. Neural Comput 10:1299–1319
6:364 26. Hotelling H (1936) Relations between two
13. Gasch AP, Spellman PT, Kao CM et al (2000) sets of variates. Biometrika 28:321–377
Genomic expression programs in the response 27. de Leeuw J, Mair P (2009) Simple and canoni-
of yeast cells to environmental changes. Mol cal correspondence analysis using the R pack-
Biol Cell 11:4241–4257 age anacor. J Stat Softw 31:1–18

You might also like