Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

14 Singular Value Decomposition

For any high-dimensional data analysis, one’s first thought should often be: can I use an SVD? The singular
value decomposition is an invaluable analysis tool for dealing with large high-dimensional data. In many
cases, data in high dimensions, most of the dimensions do not contribute to the structure of the data. But
filtering these takes some care since it may not be clear which ones are important, if the importance may
come from a combination of dimensions. The singular value decomposition can be viewed as a way of
finding these important dimensions, and thus the key relationships in the data.
On the other hand, the SVD is often viewed as a numerical linear algebra operation that is done on a
matrix. It decomposes a matrix down into three component matrices. These matrices have structure, being
orthogonal or diagonal.
The goal of this note is to bridge these views and in particular to provide geometric intuition for the SVD.
The SVD is related to several other tools which will also consider:

• PCA (Principal Component Analysis): a geometric interpretation, after centering the data
• Eigen-decomposition: shares the same components after data has been made “square.”
• MDS (Multidimensional Scaling): Given just pairwise distances, convert to eigen-decomposition

Data. We will focus on a dataset A ⊂ Rd where A is a set of n “points.” At the same time, we will think of
A as a n × d data matrix. Each row corresponds to a point, and each column corresponds to a dimension. We
will use Euclidean distance, so it is essential all dimensions have the same units! (Some interpretations
reverse these so a column is a point, a row is a dimension – but they are really the same.)

14.0.1 Projections
Different than in linear regression this family of techniques will measure error as a projection from ai ∈ Rd
to the closest point πF (ai ) on F . To define this we will use linear algebra.
First recall, that given a unit vector u ∈ Rd and any data point p ∈ Rd , then the dot product

hu, pi

is the norm of p projected onto the line through u. If we multiply this scalar by u then

πu (p) = hu, piu,

and it results in the point on the line through u that is closest to data point p. This is a projection of p onto u.
To understand this for a subspace F , we will need to define a basis. For now we will assume that F
contains the origin (0, 0, 0, . . . , 0) (as did the line through u). Then if F is k-dimensional, then this means
there is a k-dimensional basis UF = {u1 , u2 , . . . , uk } so that

• For each ui ∈ UF we have kui k = 1, that is ui is a unit vector.

• For each pair ui , uj ∈ UF we have hui , uj i = 0; the pairs are orthogonal.


Pk
• For any point x ∈ F we can write x = i=1 αi ui ; in particular αi = hx, ui i.

1
Given such a basis, then the projection on to F of a point p ∈ Rd is simply

k
X
πF (p) = hui , piui .
i=1

Thus if p happens to be exactly in F , then this recovers p exactly.


The other powerful part of the basis UF is the it defines a new coordinate system. Instead of using the d
original coordinates, we can use new coordinates (α1 (p), α2 (p), . . . , αk (p)) where αi (p) = hui , pi. To be
clear πF (p) is still in Rd , but there is a k-dimensional representation if we restrict to F .
When F is d-dimensional, this operation can still be interesting. The basis we choose UF = {u1 , u2 , . . . , ud }
could be the same as the original coordinate axis, that is we could have ui = ei = (0, 0, . . . , 0, 1, 0, . . . , 0)
where only the ith coordinate is 1. But if it is another basis, then this acts as a rotation (with possibility of
also a mirror flip). The first coordinate is rotated to be along u1 ; the second along u2 ; and so on. In πF (p),
the point p does not change, just its representation.

14.0.2 SSE Goal


As usual our goal will be to minimize the sum of squared errors. In this case we define this as
X
SSE(A, F ) = kai − πF (ai )k2 ,
ai ∈A

and our desired k-dimensional subspace F is

F ∗ = arg min SSE(A, F )


F

As compared to linear regression, this is much less a “proxy goal” where the true goal was prediction. Now
we have no labels (the yi values), so we simply try to fit a model through all of the data.

14.1 The SVD Operator


First we document what the following operation in matlab does:

[U, S, V ] = svd(A)

VT
A = U S

The backend of this (in almost any language) calls some very carefully optimized Fortran code as part
of the LAPACK library. First of all, no information is lost since we can simply recover the original data as
A = U SV T , up to numerical precision, and the Fortran library is optimized to provide very high numerical
precision.

Data Mining: Algorithms, Geometry, and Probability


c Jeff M. Phillips, University of Utah
The structure that lurks beneath. The matrix S only has non-zero elements along its diagonal. So
Si,j = 0 if i 6= j. The remaining values σ1 = S1,1 , σ2 = S2,2 , . . ., σr = Sr,r are known as the singular
values. They have the property that
σ1 ≥ σ2 ≥ . . . σr ≥ 0

where r ≤ min{n, d} is the rank of the matrix A. So the number of non-zero singular values reports the
rank (this is a numerical way of computing the rank or a matrix).
The matrices U and V are orthogonal. Thus, their columns are all unit vectors and orthogonal to each
other (within each matrix). The columns of U , written u1 , u2 , . . . , ud , are called the left singular vectors;
and the columns of V , written v1 , v2 , . . . , vn , are called the right singular vectors.
This means for any vector x ∈ Rd , the columns of V (the right singular vectors) provide a basis. That is,
we can write
Xd
x= αi vi for αi = hx, vi i.
i=1

Similarly for any vector y ∈ Rn , the columns of U (the left singular vectors) provide a basis. This also
implies that kxk = kV T xk and kyk = kyU k.

S VT right singular vector


important directions (vj by j)
orthogonal: creates basis

A = U singular value
importance of singular vectors
decreasing rank order: j j+1

one data point

v2
left singular vector v1
maps contribution of data points x
to singular values kAxk

Tracing the path of a vector. To illustrate what this decomposition demonstrates, a useful exercise is to
trace what happens to a vector x ∈ Rd as it is left-multiplied by A, that is Ax = U SV T x.
First V T x produces a new vector ξ ∈ Rd . It essentially changes no information, just changes the basis to
that described by the right singular values. For instance the new i coordinate ξi = hvi , xi.
Next η ∈ Rn is the result of SV T x = Sξ. It scales ξ by the singular values of S. Note that if d < n (the
case we will focus on), then the last n − d coordinates of η are 0. In fact, for j > r (where r = rank(A))
then ηj = 0. For j ≤ r, then the vector η is stretched longer in the first coordinates since these have larger
values.
The final result is a vector y ∈ Rn , the result of Ax = U SV T x = U η. This again just changes the basis
of η so that it aligns with the left singular vectors. In the setting n > d, the last n − d left singular vectors
are meaningless since the corresponding entries in η are 0.
Working backwards ... this final U matrix can be thought of mapping the effect of η onto each of the data
points of A. The η vector, in turn, can be thought of as scaling by the content of the data matrix A (the U and
V T matrices contain no scaling information). And the ξ vector arises via the special rotation matrix V T that
puts the starting point x into the right basis to do the scaling (from the original d-dimensional coordinates
to one that suits the data better).

Data Mining: Algorithms, Geometry, and Probability


c Jeff M. Phillips, University of Utah
Example: Tracing through the SVD
Consider a matrix  
4 3
 2 2 
 −1 −3  ,
A= 

−5 −2

and its SVD [U, S, V ] = svd(A):


   
−0.6122 0.0523 0.0642 0.7864 8.1655 0  
 −0.3415 0.2026 0.8489 −0.3487   0 2.3074  −0.8142 −.5805
U = , S= , V = .
 0.3130 −0.8070 0.4264 0.2625   0 0  −0.5805 0.8142
0.6408 0.5522 0.3057 0.4371 0 0

Now consider a vector x = (0.243, 0.97) (scaled very slightly so it is a unit vector, kxk = 1).
Multiplying by V T rotates (and flips) x to ξ = V T x; still kξk = 1

x2 v2
x
v2

x1 v1

v1

Next multiplying by S scales ξ to η = Sξ. Notice there are an imaginary third and fourth coordinates
now; they are both coming out of the page! Don’t worry, they won’t poke you since their magnitude
is 0.

⌘ v2

v1

Finally, y = U η = Ax is again another rotation of η in this four dimensional space.

14.1.1 Best Rank-k Approximation


So how does this help solve the initial problem of finding F ∗ , which minimized the SSE? The singular
values hold the key.

Data Mining: Algorithms, Geometry, and Probability


c Jeff M. Phillips, University of Utah
It turns out that there is a unique singular value decomposition, up to ties in the singular values. This
means, there is exactly one (up to singular value ties) set of right singular values which rotate into a basis
so that kAxk = kSV T xk for all x ∈ Rd (recall that U is orthogonal, so it does not change the norm,
kU ηk = kηk).
Next we realize that the singular values come in sorted order σ1 ≥ σ2 ≥ . . . ≥ σr . In fact, they are
defined so that we choose v1 so it maximizes kAv1 k, then we find the next singular vector v2 which is
orthogonal to v1 and maximizes kAv2 k, and so on. Then σi = kAvi k.
If we define F with the basis UF = {v1 , v2 , . . . , vk }, then
2
Xd X k d
X
kx − πF (x)k2 = vi hx, vi i − vi hx, vi i = hx, vi i2 .


i=1 i=1 i=k+1

so the projection error is that part of x in the last (d − k) right singular vectors.
But we are not trying to directly predict new data P here (like in regression). Rather, we are trying to
approximate the data we have. We want to minimize i kai − πF (ai )k2 . But for any unit vector u, we
recall now that
Xn
kAuk2 = hai , ui.
i=1
can be measured with a set of orthonormal vectors w1 , w2 , . . . , wd−k which are
Thus the projection error P
each orthogonal to F , as n−k 2
j=1 kAwj k . When defining F as the first k right singular values, then these
orthogonal vectors are the remaining (n − k) right singular vectors, so the projection error is
n
X d
X d
X
kai − πF (ai )k2 = kAvj k2 = σj2 .
i=1 j=k+1 j=k+1

And thus by how the right singular vectors are defined, this expression is minimized when F is defined as
the span of the first k singular values.

Best rank-k approximation. A similar goal is to find the best rank-k approximation of A. That is a matrix
Ak ∈ Rn×d so that rank(Ak ) = k and it minimizes both
kA − Ak k2
and kA − Ak kF .
Pd
Note that kA − Ak k2 = σk+1 and kA − Ak kF = j=k+1 σj2 . And recall that for a matrix
2 M we define the
Frobenius norm kM k2F = i,j Mi,j2 and the spectral norm kM k = max kM xk/kxk.
P
2 x
Remarkably, this Ak matrix also comes from the SVD. If we set Sk as the matrix S in the decomposition
so that all but the first k singular values are 0, then it has rank k. Hence Ak = U Sk V T also has rank k and
is our solution. But we can notice that when we set most of Sk to 0, then the last (d − k) columns of V are
meaningless since they are only multiplied by 0s in U Sk V T , so we can also set those to all 0s, or remove
them entirely (along with the last (d − k) columns of Sk ). Similar we can make 0 or remove the last (n − k)
columns of U . These matrices are referred to as Vk and Uk respectively, and also Ak = Uk Sk VkT .

Sk VkT

Ak = Uk

Data Mining: Algorithms, Geometry, and Probability


c Jeff M. Phillips, University of Utah
In another view, we can also write a matrix A with rank r as
r
X
A= σi ui viT ,
i=1

where each ui viT is a n × d matrix with rank 1.


We next will relate the SVD to other common matrix analysis forms: PCA, Eigendecomposition, and
MDS. One may find literature that uses slightly different forms of these terms (they are often intermingled),
but I believe this is the cleanest, most consistent, mapping.

14.2 Principle Component Analysis (PCA)


Recall that the original goal of this topic was to find the k-dimensional subspace F to minimize
X
kA − πF (A)k2F = kai − πF (ai )k2 .
ai ∈A

We have not actually solved this yet. The top k right singular values Vk of A only provided this bound
assuming that F contains the origin: (0, 0, . . . , 0). However, this might not be the case!
Principal Component Analysis (PCA) is an extension of the SVD when we do not restrict that the subspace
Vk must go through the origin. It turns out, like with simple linear regression, that the optimal F must go
through the mean of all of the data. So we can still use the SVD, after a simple preprocessing step called
centering to shift the data matrix so its mean is exactly at the origin.
Specifically, centering is adjusting the original input data matrix A ∈ Rn×d so that each column (each
1 Pn
dimension) has an average value of 0. This is easier than it seems. Define āj = n i=1 Ai,j (the average
of each column j). Then set each Ãi,j = Ai,j − āj to represent the entry in the ith row and jth column of
centered matrix Ã.
There is a centering matrix Cn = In − n1 11T where In is the n×n identity matrix, 1 is the all-ones column
vector (of length n) and thus 11T is the all-ones n × n matrix. Then we can also just write à = Cn A.
Now to perform PCA on a data set A, we compute [U, S, V ] = svd(Cn A) = svd(Ã).
Then the resulting singular values diag(S) = {σ1 , σ2 , . . . , σr } are known as the principal values, and the
top k right singular vectors Vk = [v1 v2 . . . vk ] are known as the top-k principal directions.
This often gives a better fitting to the data than just SVD. The SVD finds the best rank-k approximation
of A, which is the best k-dimensional subspace (up to Frobenius and spectral norms) which passes through
the origin. If all of the data is far from the origin, this can essentially “waste” a dimension to pass through
the origin. However, we also need to store the shift from the origin, a vector c̃ = (ã1 , ã2 , . . . , ãd ) ∈ Rd .

14.3 Eigenvalues and Eigenvectors


A related matrix decomposition to SVD is the eigendecomposition. This is only defined for a square matrix
B ∈ Rn×n .
An eigenvector of B is a vector v such that there is some scalar λ that

Bv = λv.

That is, multiplying B by v results in a scaled version of v. The associated value λ is called the eigenvalue.
As a convention, we typically normalize v so kvk = 1.
In general, a square matrix B ∈ Rn×n may have up to n eigenvectors (a matrix V ∈ Rn×n ) and values (a
vector l ∈ Rn ). Some of the eigenvalues may be complex numbers (even when all of its entries are real!).

Data Mining: Algorithms, Geometry, and Probability


c Jeff M. Phillips, University of Utah
Again it is easy to compute with matlab as [V,L] = eigs(B).
For this reason, we will focus on positive semidefinite matrices. A positive definite matrix B ∈ Rn×n
is a symmetric matrix with all positive eigenvalues. Another characterization is for every vector x ∈ Rn
then xT Bx is positive. A positive semidefinite matrix B ∈ Rn×n may have some eigenvalues at 0 and are
otherwise positive; equivalently for any vector x ∈ Rn , then xT Bx may be zero or positive.
How do we get positive semi-definite matrices? Lets start with a data matrix A ∈ Rn×d . Then we can
construct two positive semidefinite matrices

BR = AT A and BL = AAT .

Matrix BR is d × d and BL is n × n. If the rank of A is d, then BR is positive definite. If the rank of A is


n, then BL is positive definite.

Eigenvectors and eigenvalues relation to SVD. Next consider the SVD of A so that [U, S, V ] = svd(A).
Then we can write
BR V = AT AV = (V SU T )(U SV T )V = V S 2 .
Note that the last step follows because for orthogonal matrices U and V , then U T U = I and V T V =
I, where I is the identity matrix, which has no effect. The matrix S is a diagonal square1 matrix S =
diag(σ1 , σ2 , . . . , σd ). Then S 2 = SS (the product of S with S) is again diagonal with entries S 2 =
diag(σ12 , σ22 , . . . , σd2 ).
Now consider a single column vi of V (which is the ith right singular vector of A). Then extracting this
column’s role in the linear system BR V = V S 2 we obtain

BR vi = vi σi2 .

This means that ith right singular vector of A is an eigenvector (in fact the ith eigenvector) of BR = AT A.
Moreover, the ith eigenvalue λi of BR is the ith singular value of A squared: λi = σi2 .
Similarly we can derive

BL U = AAT U = (U SV T )(V SU T )U = U S 2 ,

and hence the left singular vectors of A are the eigenvectors of BL = AAT and the eigenvalues of BL are
the squared singular values of A.

14.4 Multidimensional Scaling


Dimensionality reduction is an abstract problem with input of a high-dimensional data set P ⊂ Rd and a
goal of finding a corresponding lower dimensional data set Q ⊂ Rk , where k << d, and properties of P
are preserved in Q. Both low-rank approximations through direct SVD and through PCA are examples of
this: Q = πVk (P ). However, these techniques require an explicit representation of P to start with. In some
cases, we are only presented P more abstractly. There two common situations:

• We are provided a set of n objects X, and a bivariate function d : X × X → R that returns a distance
between them. For instance, we can put two cities into an airline website, and it may return a dollar
amount for the cheapest flight between those two cities. This dollar amount is our “distance.”
1
Technically, S ∈ Rn×d . To make this simple argument work, lets first assume w.l.o.g. (without loss of generality) that d ≤ n.
Then the bottom n − d rows of S are all zeros, which mean the right n − d rows of U do not matter. So we can ignore both these
n − d rows and columns. Then S is square. This makes U no longer orthogonal, so U T U is then a projection, not identity; but it
turns out this is a project to the span of A, so the argument still works.

Data Mining: Algorithms, Geometry, and Probability


c Jeff M. Phillips, University of Utah
• We are simply provided a matrix D ∈ Rn×n , where each entry Di,j is the distance between the ith
and jth point. In the first scenario, we can calculate such a matrix D.

Multi-Dimensional Scaling (MDS) has the goal of taking such a distance matrix D for n points and giving
low-dimensional (typically) Euclidean coordinates to these points so that the embedded points have similar
spatial relations to that described in D. If we had some original data set A which resulted in D, we could
just apply PCA to find the embedding. It is important to note, in the setting of MDS we are typically just
given D, and not the original data A. However, as we will show next, we can derive a matrix that will act
like AAT using only D.
A similarity matrix M is an n × n matrix where entry Mi,j is the similarity between the ith and the jth
data point. The similarity often associated with Euclidean distance kai − aj k is the standard inner (or dot
product) hai , aj i. We can write

kai − aj k2 = kai k2 + kaj k2 − 2hai , aj i,

and hence
1
kai k2 + kaj k2 − kai − aj k2 .

hai , aj i = (14.1)
2
Next we observe that for the n × n matrix AAT the entry [AAT ]i,j = hai , aj i. So it seems hopeful we can
derive AAT from D using equation (14.1). That is we can set kai − aj k2 = Di,j 2 . However, we need also

need values for kai k2 and kaj k2 .


Since the embedding has an arbitrary shift to it (if we add a shift vector s to all embedding points,
then no distances change), then we can arbitrarily choose a1 to be at the origin. Then ka1 k2 = 0 and
kaj k2 = ka1 − aj k2 = D1,j 2 . Using this assumption and equation (14.1), we can then derive the similarity

matrix AAT . Then we can run the eigen-decomposition on AAT and use the coordinates of each point along
the first k eigenvectors to get an embedding. This is known as classical MDS.
It is often used for k as 2 or 3 so the data can be easily visualized.
There are several other forms that try to preserve the distance more directly, where as this approach
is essentially just minimizing the squared residuals of the projection from some unknown original (high-
dimensional embedding). One can see that we recover the distances with no error if we use all n eigenvectors
– if they exist. However, as mentioned, there may be less than n eigenvectors, or they may be associated
with complex eigenvalues. So if our goal is an embedding into k = 3 or k = 10, there is no guarantee that
this will work, or even what guarantees this will have. But MDS is used a lot nonetheless.

Data Mining: Algorithms, Geometry, and Probability


c Jeff M. Phillips, University of Utah

You might also like