A Tutorial On Principal Component Analysis
A Tutorial On Principal Component Analysis
Jonathon Shlens
Google
Research
Mountain View, CA 94043
I. INTRODUCTION
Principal component analysis (PCA) is a standard tool in modern data analysis - in diverse fields from neuroscience to computer graphics - because it is a simple, non-parametric method
for extracting relevant information from confusing data sets.
With minimal effort PCA provides a roadmap for how to reduce a complex data set to a lower dimension to reveal the
sometimes hidden, simplified structures that often underlie it.
Electronic
address: jonathon.shlens@gmail.com
Take for example a simple toy problem from physics diagrammed in Figure 1. Pretend we are studying the motion
of the physicists ideal spring. This system consists of a ball
of mass m attached to a massless, frictionless spring. The ball
is released a small distance away from equilibrium (i.e. the
spring is stretched). Because the spring is ideal, it oscillates
indefinitely along the x-axis about its equilibrium at a set frequency.
This is a standard problem in physics in which the motion
along the x direction is solved by an explicit function of time.
In other words, the underlying dynamics can be expressed as
a function of a single variable x.
However, being ignorant experimenters we do not know any
of this. We do not know which, let alone how many, axes
and dimensions are important to measure. Thus, we decide to
measure the balls position in a three-dimensional space (since
we live in a three dimensional world). Specifically, we place
three movie cameras around our system of interest. At 120 Hz
each movie camera records an image indicating a two dimensional position of the ball (a projection). Unfortunately, because of our ignorance, we do not even know what are the real
x, y and z axes, so we choose three camera positions~a,~b and~c
at some arbitrary angles with respect to the system. The angles
between our measurements might not even be 90o ! Now, we
record with the cameras for several minutes. The big question
remains: how do we get from this data set to a simple equation
2
A. A Naive Basis
camera A
camera B
camera C
xA
yA
x
~X =
B
yB
x
C
yC
where each camera contributes a 2-dimensional projection of
the balls position to the entire vector ~X. If we record the balls
position for 10 minutes at 120 Hz, then we have recorded 10
60 120 = 72000 of these vectors.
FIG. 1 A toy example. The position of a ball attached to an oscillating spring is recorded using three cameras A, B and C. The position
of the ball tracked by each camera is depicted in each panel below.
of x?
We know a-priori that if we were smart experimenters, we
would have just measured the position along the x-axis with
one camera. But this is not what happens in the real world.
We often do not know which measurements best reflect the
dynamics of our system in question. Furthermore, we sometimes record more dimensions than we actually need.
Also, we have to deal with that pesky, real-world problem of
noise. In the toy example this means that we need to deal
with air, imperfect cameras or even friction in a less-than-ideal
spring. Noise contaminates our data set only serving to obfuscate the dynamics further. This toy example is the challenge
experimenters face everyday. Keep this example in mind as
we delve further into abstract concepts. Hopefully, by the end
of this paper we will have a good understanding of how to
systematically extract x using principal component analysis.
With this concrete example, let us recast this problem in abstract terms. Each sample ~X is an m-dimensional vector,
where m is the number of measurement types. Equivalently,
every sample is a vector that lies in an m-dimensional vector space spanned by some orthonormal basis. From linear
algebra we know that all measurement vectors form a linear
combination of this set of unit length basis vectors. What is
this orthonormal basis?
This question is usually a tacit assumption often overlooked.
Pretend we gathered our toy example data above, but only
looked at camera A. What is an orthonormal basis for (xA , yA )?
A naive choice
would be {(1,0), (0, 1)}, but why select this
basis over {( 22 , 22 ), ( 2 2 , 2 2 )} or any other arbitrary rotation? The reason is that the naive basis reflects the method we
gathered the data. Pretend we
record the position (2, 2). We
did not record 2 2 in the ( 22 , 22 ) direction and 0 in the perpendicular direction. Rather, we recorded the position (2, 2)
on our camera meaning 2 units up and 2 units to the left in our
camera window. Thus our original basis reflects the method
we measured our data.
How do we express this naive basis in linear algebra? In the
two dimensional case, {(1, 0), (0, 1)} can be recast as individual row vectors. A matrix constructed out of these row vectors
is the 2 2 identity matrix I. We can generalize this to the mdimensional case by constructing an m m identity matrix
b1
1
b2 0
B=
.. = ..
.
.
bm
0
0
1
.. . .
.
.
0
0
0
..
=I
.
1
3
and thus it can be trivially expressed as a linear combination
of {bi }.
p1 xi
yi = ...
pm xi
B. Change of Basis
With this rigor we may now state more precisely what PCA
asks: Is there another basis, which is a linear combination of
the original basis, that best re-expresses our data set?
A close reader might have noticed the conspicuous addition of
the word linear. Indeed, PCA makes one stringent but powerful assumption: linearity. Linearity vastly simplifies the problem by restricting the set of potential bases. With this assumption PCA is now limited to re-expressing the data as a linear
combination of its basis vectors.
Let X be the original data set, where each column is a single
sample (or moment in time) of our data set (i.e. ~X). In the toy
example X is an m n matrix where m = 6 and n = 72000.
Let Y be another m n matrix related by a linear transformation P. X is the original recorded data set and Y is a new
representation of that data set.
PX = Y
C. Questions Remaining
By assuming linearity the problem reduces to finding the appropriate change of basis. The row vectors {p1 , . . . , pm } in
this transformation will become the principal components of
X. Several questions now arise.
What is the best way to re-express X?
(1)
p1
PX = ... x1 xn
pm
p1 x1 p1 xn
..
..
..
Y =
.
.
.
pm x1 pm xn
Now comes the most important question: what does best express the data mean? This section will build up an intuitive
answer to this question and along the way tack on additional
assumptions.
In this section xi and yi are column vectors, but be forewarned. In all other
sections xi and yi are row vectors.
2signal
2noise
2
signal
2
noise
r2
r2
r1
r2
r1
low redundancy
FIG. 2 Simulated data of (x, y) for camera A. The signal and noise
variances 2signal and 2noise are graphically represented by the two
lines subtending the cloud of data. Note that the largest direction
of variance does not lie along the basis of the recording (xA , yA ) but
rather along the best-fit line.
r1
high redundancy
C. Covariance Matrix
B. Redundancy
1
1
a2i , 2B = b2i
n
n i
i
1
ai bi
n
i
5
The off-diagonal terms of CX are the covariance between measurement types.
CX captures the covariance between all possible pairs of measurements. The covariance values reflect the noise and redundancy in our measurements.
2AB = 2A if A = B.
(2)
x1
X = ...
xm
One interpretation of X is the following. Each row of X corresponds to all measurements of a particular type. Each column
of X corresponds to a set of measurements from one particular
trial (this is ~X from section 3.1). We now arrive at a definition
for the covariance matrix CX .
1
CX XXT .
n
Consider the matrix CX = 1n XXT . The i jth element of CX
is the dot product between the vector of the ith measurement
type with the vector of the jth measurement type. We can
summarize several properties of CX :
CX is a square symmetric m m matrix (Theorem 2 of
Appendix A)
The diagonal terms of CX are the variance of particular
measurement types.
Pretend we have the option of manipulating CX . We will suggestively define our manipulated covariance matrix CY . What
features do we want to optimize in CY ?
1
Note that in practice, the covariance 2AB is calculated as n1
i ai bi . The
slight change in normalization constant arises from estimation theory, but
that is beyond the scope of this tutorial.
6
an efficient, analytical solution to this problem. We will discuss two solutions in the following sections.
=
=
=
CY =
E. Summary of Assumptions
This section provides a summary of the assumptions behind PCA and hint at when these assumptions might perform
poorly.
I. Linearity
Linearity frames the problem as a change of basis. Several areas of research have explored how
extending these notions to nonlinear regimes (see
Discussion).
CY =
=
=
=
CY
We derive our first algebraic solution to PCA based on an important property of eigenvector decomposition. Once again,
the data set is X, an m n matrix, where m is the number of
measurement types and n is the number of samples. The goal
is summarized as follows.
Find some orthonormal matrix P in Y = PX such
that CY 1n YYT is a diagonal matrix. The rows
of P are the principal components of X.
PCX PT
P(ET DE)PT
P(PT DP)PT
(PPT )D(PPT )
= (PP1 )D(PP1 )
= D
1
YYT
n
1
(PX)(PX)T
n
1
PXXT PT
n
1
P( XXT )PT
n
PCX PT
7
VI. A MORE GENERAL SOLUTION USING SVD
The set of eigenvectors {v1 , v 2 , . . . , v r } and the set of vectors {u 1 , u 2 , . . . , u r } are both orthonormal sets or bases in rdimensional space.
We can summarize this result for all vectors in one matrix
multiplication by following the prescribed construction in Figure 4. We start by constructing a new diagonal matrix .
..
..
where 1 2 . . . r are the rank-ordered set of singular values. Likewise we construct accompanying orthogonal
matrices,
V = v 1 v 2 . . . v m
U = u 1 u 2 . . . u n
where we have appended an additional (m r) and (n r) orthonormal vectors to fill up the matrices for V and U respectively (i.e. to deal with degeneracy issues). Figure 4 provides
a graphical representation of how all of the pieces fit together
to form the matrix version of SVD.
XV = U
where each column of V and U perform the scalar version of
the decomposition (Equation 3). Because V is orthogonal, we
can multiply both sides by V1 = VT to arrive at the final form
of the decomposition.
1 if i = j
0 otherwise
kXvi k = i
These properties are both proven in Theorem 5. We now have
all of the pieces to construct the decomposition. The scalar
version of singular value decomposition is just a restatement
of the third definition.
Xvi = i u i
X = UVT
(4)
B. Interpreting SVD
The final form of SVD is a concise but thick statement. Instead let us reinterpret Equation 3 as
(3)
Xa = kb
This result says a quite a bit. X multiplied by an eigenvector of XT X is equal to a scalar times another vector.
where a and b are column vectors and k is a scalar constant. The set {v1 , v 2 , . . . , v m } is analogous to a and the set
{u 1 , u 2 , . . . , u n } is analogous to b. What is unique though is
that {v1 , v 2 , . . . , v m } and {u 1 , u 2 , . . . , u n } are orthonormal sets
of vectors which span an m or n dimensional space, respectively. In particular, loosely speaking these sets appear to span
8
The scalar form of SVD is expressed in equation 3.
Xvi = i u i
The mathematical intuition behind the construction of the matrix form is that we want to express all n scalar equations in just one
equation. It is easiest to understand this process graphically. Drawing the matrices of equation 3 looks likes the following.
We can construct three new matrices V, U and . All singular values are first rank-ordered 1 2 . . . r , and the corresponding vectors are indexed in the same rank order. Each pair of associated vectors v i and u i is stacked in the ith column along
their respective matrices. The corresponding singular value i is placed along the diagonal (the iith position) of . This generates
the equation XV = U, which looks like the following.
The matrices V and U are m m and n n matrices respectively and is a diagonal matrix with a few non-zero values (represented by the checkerboard) along its diagonal. Solving this single matrix equation solves all n value form equations.
FIG. 4 Construction of the matrix form of SVD (Equation 4) from the scalar form (Equation 3).
=
=
=
=
U
(U)T
UT
Z
X = UVT
UT X = VT
UT X = Z
9
Quick Summary of PCA
VII. DISCUSSION
Principal component analysis (PCA) has widespread applications because it reveals simple underlying structures in complex data sets using analytical solutions from linear algebra.
Figure 5 provides a brief summary for implementing PCA.
A primary benefit of PCA arises from quantifying the importance of each dimension for describing the variability of a data
set. In particular, the measurement of the variance along each
5
6
FIG. 6 Example of when PCA fails (red lines). (a) Tracking a person on a ferris wheel (black dots). All dynamics can be described
by the phase of the wheel , a non-linear combination of the naive
basis. (b) In this example data set, non-Gaussian distributed data and
non-orthogonal axes causes PCA to fail. The axes with the largest
variance do not correspond to the appropriate answer.
principle component provides a means for comparing the relative importance of each dimension. An implicit hope behind
employing this method is that the variance along a small number of principal components (i.e. less than the number of measurement types) provides a reasonable characterization of the
complete data set. This statement is the precise intuition behind any method of dimensional reduction a vast arena of
active research. In the example of the spring, PCA identifies that a majority of variation exists along a single dimension (the direction of motion x ), eventhough 6 dimensions are
recorded.
Although PCA works on a multitude of real world problems, any diligent scientist or engineer must ask when does
PCA fail? Before we answer this question, let us note a remarkable feature of this algorithm. PCA is completely nonparametric: any data set can be plugged in and an answer
comes out, requiring no parameters to tweak and no regard for
how the data was recorded. From one perspective, the fact that
PCA is non-parametric (or plug-and-play) can be considered
a positive feature because the answer is unique and independent of the user. From another perspective the fact that PCA
is agnostic to the source of the data is also a weakness. For
instance, consider tracking a person on a ferris wheel in Figure 6a. The data points can be cleanly described by a single
variable, the precession angle of the wheel , however PCA
would fail to recover this variable.
A deeper appreciation of the limits of PCA requires some consideration about the underlying assumptions and in tandem,
a more rigorous description of the source of data. Generally speaking, the primary motivation behind this method is
to decorrelate the data set, i.e. remove second-order dependencies. The manner of approaching this goal is loosely akin
to how one might explore a town in the Western United States:
drive down the longest road running through the town. When
10
one sees another big road, turn left or right and drive down
this road, and so forth. In this analogy, PCA requires that each
new road explored must be perpendicular to the previous, but
clearly this requirement is overly stringent and the data (or
town) might be arranged along non-orthogonal axes, such as
Figure 6b. Figure 6 provides two examples of this type of data
where PCA provides unsatisfying results.
To address these problems, we must define what we consider
optimal results. In the context of dimensional reduction, one
measure of success is the degree to which a reduced representation can predict the original data. In statistical terms,
we must define an error function (or loss function). It can
be proved that under a common loss function, mean squared
error (i.e. L2 norm), PCA provides the optimal reduced representation of the data. This means that selecting orthogonal
directions for principal components is the best solution to predicting the original data. Given the examples of Figure 6, how
could this statement be true? Our intuitions from Figure 6
suggest that this result is somehow misleading.
The solution to this paradox lies in the goal we selected for the
analysis. The goal of the analysis is to decorrelate the data, or
said in other terms, the goal is to remove second-order dependencies in the data. In the data sets of Figure 6, higher order
dependencies exist between the variables. Therefore, removing second-order dependencies is insufficient at revealing all
structure in the data.7
Multiple solutions exist for removing higher-order dependencies. For instance, if prior knowledge is known about the
problem, then a nonlinearity (i.e. kernel) might be applied
to the data to transform the data to a more appropriate naive
basis. For instance, in Figure 6a, one might examine the polar coordinate representation of the data. This parametric approach is often termed kernel PCA.
Another direction is to impose more general statistical definitions of dependency within a data set, e.g. requiring that data
along reduced dimensions be statistically independent. This
class of algorithms, termed, independent component analysis
(ICA), has been demonstrated to succeed in many domains
where PCA fails. ICA has been applied to many areas of signal and image processing, but suffers from the fact that solutions are (sometimes) difficult to compute.
Writing this paper has been an extremely instructional experience for me. I hope that this paper helps to demystify the
motivation and results of PCA, and the underlying assumptions behind this important analysis technique. Please send
me a note if this has been useful to you as it inspires me to
keep writing!
When are second order dependencies sufficient for revealing all dependencies in a data set? This statistical condition is met when the first and second
order statistics are sufficient statistics of the data. This occurs, for instance,
when a data set is Gaussian distributed.
11
has the special property that all of its eigenvectors are not just
linearly independent but also orthogonal, thus completing our
proof.
All of these properties arise from the dot product of any two
vectors from this set.
(Xvi ) (Xvj ) = (Xvi )T (Xvj )
In the first part of the proof, let A be just some matrix, not
necessarily symmetric, and let it have independent eigenvectors (i.e. no degeneracy). Furthermore, let E = [e1 e2 . . . en ]
be the matrix of eigenvectors placed in the columns. Let D be
a diagonal matrix where the ith eigenvalue is placed in the iith
position.
We will now show that AE = ED. We can examine the
columns of the right-hand and left-hand sides of the equation.
Left hand side : AE = [Ae1 Ae2 . . . Aen ]
Right hand side : ED = [1 e1 2 e2 . . . n en ]
Evidently, if AE = ED then Aei = i ei for all i. This equation is the definition of the eigenvalue equation. Therefore,
it must be that AE = ED. A little rearrangement provides
A = EDE1 , completing the first part the proof.
For the second part of the proof, we show that a symmetric
matrix always has orthogonal eigenvectors. For some symmetric matrix, let 1 and 2 be distinct eigenvalues for eigenvectors e1 and e2 .
1 e1 e2 =
=
=
=
=
1 e1 e2 =
(1 e1 )T e2
(Ae1 )T e2
e1 T AT e2
e1 T Ae2
e1 T (2 e2 )
2 e1 e2
= v Ti XT Xvj
= v Ti ( j v j )
= j v i v j
(Xvi ) (Xvj ) = j i j
Appendix B: Code
A = EDE
http://www.mathworks.com
12
[PC, V] = eig(covariance);
[M,N] = size(data);