PCA (v3)

Unsupervised Learning:
Principle Component Analysis

Unsupervised Learning
• Dimension Reduction • Generation (無中生有)
(化繁為簡)
only having
function input
only having
function
function output
function
Random numbers
Dimension Reduction
vector x function vector z
(High Dim) (Low Dim)
Looks like 3-D Actually, 2-D

Dimension Reduction
• In MNIST, a digit is 28 x 28 dims.
• Most 28 x 28 dim vectors are not digits
-20。 -10。 0。 10。 20。

0
Cluster 3 0
Clustering 1
1 Open question: how many
0
0 clusters do we need?
1
0 0
Cluster 1 Cluster 2
• K-means
• Clustering 𝑋 = 𝑥 1 , ⋯ , 𝑥 𝑛 , ⋯ , 𝑥 𝑁 into K clusters
• Initialize cluster center 𝑐 𝑖 , i=1,2, … K (K random 𝑥 𝑛 from 𝑋)
• Repeat 𝑛 is most “close” to 𝑐 𝑖
1 𝑥
• For all 𝑥 𝑛 in 𝑋: 𝑏𝑖𝑛
0 Otherwise
• Updating all 𝑐 𝑖 : 𝑐 𝑖 = ෍ 𝑏 𝑛 𝑥 𝑛 ൘෍ 𝑏 𝑛
𝑖 𝑖
𝑥𝑛 𝑥𝑛
Clustering
• Hierarchical Agglomerative Clustering (HAC)
root
Step 1: build a tree
Step 2: pick a
threshold
Distributed Representation
• Clustering: an object must
belong to one cluster
小傑是強化系
• Distributed representation
強化系 0.70
放出系 0.25
變化系 0.05
小傑是
操作系 0.00
具現化系 0.00
特質系 0.00
Distributed Representation
vector x function vector z
(High Dim) (Low Dim)
𝑥2
• Feature selection
Select 𝑥2 ?
𝑥1
• Principle component analysis (PCA)

[Bishop, Chapter 12] 𝑧 = 𝑊𝑥
PCA
𝑧 = 𝑊𝑥 Large
variance
Reduce to 1-D:
𝑧1 = 𝑤 1 ∙ 𝑥
Small variance
𝑥 Project all the data points x onto 𝑤 1 ,

and obtain a set of 𝑧1
𝑤1 We want the variance of 𝑧1 as large as
possible
𝑧1 = 𝑤 1 ∙ 𝑥 1
𝑉𝑎𝑟 𝑧1 = ෍ 𝑧1 − 𝑧ഥ1 2 𝑤 1 2 = 1
𝑁
𝑧1
PCA Project all the data points x onto 𝑤 1 ,
and obtain a set of 𝑧1
We want the variance of 𝑧1 as large as
𝑧 = 𝑊𝑥
possible
1 1
Reduce to 1-D: 𝑉𝑎𝑟 𝑧1 = ෍ 𝑧1 − 𝑧ഥ1 2 𝑤 2 =1
𝑁
𝑧1
𝑧1 = 𝑤1 ∙𝑥
𝑧2 = 𝑤 2 ∙ 𝑥
We want the variance of 𝑧2 as large as
possible
𝑤1 𝑇
𝑊= 1
𝑤2 𝑇
𝑉𝑎𝑟 𝑧2 = ෍ 𝑧2 − 𝑧ഥ2 2
𝑤2 2 =1
𝑁
⋮ 𝑧2
𝑤1 ∙ 𝑤2 = 0
Orthogonal
matrix
Warning of Math
𝑧1 = 𝑤 1 ∙ 𝑥
PCA 1 1 1 1
1
𝑧ഥ1 = ෍ 𝑧1 = ෍ 𝑤 ∙ 𝑥 = 𝑤 ∙ ෍ 𝑥 = 𝑤 1 ∙ 𝑥ҧ
𝑁 𝑁 𝑁
1
𝑉𝑎𝑟 𝑧1 = ෍ 𝑧1 − 𝑧ഥ1 2
𝑁
𝑧1 𝑎 ∙ 𝑏 2 = 𝑎𝑇 𝑏 2 = 𝑎𝑇 𝑏𝑎𝑇 𝑏
1
= ෍ 𝑤 1 ∙ 𝑥 − 𝑤 1 ∙ 𝑥ҧ 2
= 𝑎𝑇 𝑏 𝑎𝑇 𝑏 𝑇 = 𝑎𝑇 𝑏𝑏 𝑇 𝑎
𝑁
𝑥
1 1 2
= ෍ 𝑤 ∙ 𝑥 − 𝑥ҧ
𝑁 Find 𝑤 1 maximizing
1
= ෍ 𝑤 1 𝑇 𝑥 − 𝑥ҧ 𝑥 − 𝑥ҧ 𝑇 𝑤 1 𝑤 1 𝑇 𝑆𝑤 1
𝑁
1 𝑇
1 𝑤1 2 = 𝑤1 𝑇 𝑤1 = 1
= 𝑤 ෍ 𝑥 − 𝑥ҧ 𝑥 − 𝑥ҧ 𝑇 𝑤 1
𝑁
= 𝑤 1 𝑇 𝐶𝑜𝑣 𝑥 𝑤 1 𝑆 = 𝐶𝑜𝑣 𝑥
Find 𝑤 1 maximizing 𝑤 1 𝑇 𝑆𝑤 1 𝑤1 𝑇 𝑤1 = 1
𝑆 = 𝐶𝑜𝑣 𝑥 Symmetric Positive-semidefinite

(non-negative eigenvalues)
Using Lagrange multiplier [Bishop, Appendix E]

𝑔 𝑤 1 = 𝑤 1 𝑇 𝑆𝑤 1 − 𝛼 𝑤 1 𝑇 𝑤 1 − 1
𝜕𝑔 𝑤1 Τ𝜕𝑤11 =0 𝑆𝑤 1 − 𝛼𝑤 1 = 0
𝑆𝑤 1 = 𝛼𝑤 1 𝑤 1 : eigenvector
𝜕𝑔 𝑤 1 Τ𝜕𝑤21 = 0
𝑤 1 𝑇 𝑆𝑤 1 = 𝛼 𝑤 1 𝑇 𝑤 1
…
=𝛼 Choose the maximum one
𝑤 1 is the eigenvector of the covariance matrix S

Corresponding to the largest eigenvalue 𝜆1
Find 𝑤 2 maximizing 𝑤 2 𝑇 𝑆𝑤 2 𝑤2 𝑇𝑤2 = 1 𝑤 2 𝑇 𝑤1 = 0
𝑔 𝑤 2 = 𝑤 2 𝑇 𝑆𝑤 2 − 𝛼 𝑤 2 𝑇 𝑤 2 − 1 −𝛽 𝑤 2 𝑇 𝑤 1 − 0
𝜕𝑔 𝑤 2 Τ𝜕𝑤12 = 0 𝑆𝑤 2 − 𝛼𝑤 2 − 𝛽𝑤 1 = 0
𝑤 1 0𝑇 𝑆𝑤 2 − 𝛼 𝑤 1 0𝑇 𝑤 2 − 𝛽 𝑤 11𝑇 𝑤 1 = 0
𝜕𝑔 𝑤 2 Τ𝜕𝑤22 =0
1 𝑇 2 𝑇
= 𝑤 𝑆𝑤 = 𝑤 2 𝑇 𝑆𝑇 𝑤1
…
= 𝑤 2 𝑇 𝑆𝑤 1 = 𝜆1 𝑤 2 𝑇 𝑤 1 = 0
𝑆𝑤 1 = 𝜆1 𝑤 1
𝛽 = 0: 𝑆𝑤 2 − 𝛼𝑤 2 = 0 𝑆𝑤 2 = 𝛼𝑤 2
𝑤 2 is the eigenvector of the covariance matrix S

Corresponding to the 2nd largest eigenvalue 𝜆2
PCA - decorrelation 𝑥2 𝑧2
𝑧 = 𝑊𝑥 𝑥1 𝑧1
𝐶𝑜𝑣 𝑧 = 𝐷 PCA
Diagonal matrix
1
𝐶𝑜𝑣 𝑧 = ෍ 𝑧 − 𝑧ҧ 𝑧 − 𝑧ҧ 𝑇 = 𝑊𝑆𝑊 𝑇 𝑆 = 𝐶𝑜𝑣 𝑥
𝑁
= 𝑊𝑆 𝑤 1 ⋯ 𝑤𝐾 = 𝑊 𝑆𝑤 1 ⋯ 𝑆𝑤 𝐾
= 𝑊 𝜆1 𝑤 1 ⋯ 𝜆𝐾 𝑤 𝐾 = 𝜆1 𝑊𝑤 1 ⋯ 𝜆𝐾 𝑊𝑤𝐾
= 𝜆1 𝑒1 ⋯ 𝜆𝐾 𝑒𝐾 =𝐷 Diagonal matrix
End of Warning
PCA – Another Point of View
Basic Component:
1 0 1 0 1
…….
u1 u2 u3 u4 u5
1
0 1x 1x 1x
1
0 ≈ + +
1 u1 u3 u5
⋮ 𝑐1
1 2 𝐾
𝑥 ≈ 𝑐1 𝑢 + 𝑐2 𝑢 + ⋯ + 𝑐K 𝑢 + 𝑥ҧ 𝑐2 Represent a
Pixels in a ⋮ digit image
digit image component 𝑐K
PCA – Another Point of View
𝑥 − 𝑥ҧ ≈ 𝑐1 𝑢1 + 𝑐2 𝑢2 + ⋯ + 𝑐K 𝑢𝐾 = 𝑥ො
Reconstruction error:
(𝑥 − 𝑥)ҧ − 𝑥ො 2 Find 𝑢1 , … , 𝑢𝐾 minimizing the error
𝐾
𝐿= min𝐾 ෍
1
𝑥 − 𝑥ҧ − ෍ 𝑐𝑘 𝑢𝑘
𝑢 ,…,𝑢
𝑘=1
2
PCA: 𝑧 = 𝑊𝑥 𝑥ො
𝑧1 T
𝑤1 𝑤 1 , 𝑤 2 , … 𝑤 𝐾 (from PCA) is the
𝑧2 𝑤2 T component 𝑢1 , 𝑢2 , … 𝑢𝐾
⋮ = ⋮
𝑥
minimizing L
𝑧𝐾 𝑤𝐾 T
Proof in [Bishop, Chapter 12.1.2]
𝑥 − 𝑥ҧ ≈ 𝑐1 𝑢1 + 𝑐2 𝑢2 + ⋯ + 𝑐K 𝑢𝐾 = 𝑥ො
Reconstruction error:
(𝑥 − 𝑥)ҧ − 𝑥ො 2 Find 𝑢1 , … , 𝑢𝐾 minimizing the error
𝑥 1 − 𝑥ҧ ≈ 𝑐11 𝑢1 + 𝑐21 𝑢2 + ⋯
𝑥 2 − 𝑥ҧ ≈ 𝑐12 𝑢1 + 𝑐22 𝑢2 + ⋯
𝑥 3 − 𝑥ҧ ≈ 𝑐13 𝑢1 + 𝑐23 𝑢2 + ⋯
……
𝑐11 𝑐12 𝑐13
… ≈ u1 u2 … 𝑐21 𝑐22 𝑐23
Minimize
…
…
…
Matrix X
Error
𝑥 1 − 𝑥ҧ
𝑐11 𝑐12 𝑐13

… ≈ u1 u2 … 𝑐21 𝑐22 𝑐23
Minimize
…
…
…
Matrix X
Error
MxN MxK KxK KxN
∑ V
X ≈ U
K columns of U: a set of orthonormal eigen vectors

corresponding to the K largest eigenvalues of XXT
This is the solution of PCA
SVD:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/LA_2016/Lecture/SVD.pdf
PCA looks like a neural network with one
Autoencoder
hidden layer (linear activation function)
If 𝑤 1 , 𝑤 2 , … 𝑤 𝐾 is the component 𝑢1 , 𝑢2 , … 𝑢𝐾
𝐾
To minimize reconstruction error:
𝑥ො = ෍ 𝑐𝑘 𝑤 𝑘 𝑥 − 𝑥ҧ
𝑐𝑘 = 𝑥 − 𝑥ҧ ∙ 𝑤 𝑘
𝑘=1
𝐾 = 2:
𝑤11
𝑐1
𝑥 − 𝑥ҧ 𝑤21
𝑤31
Autoencoder
𝐾
𝑘=1
𝐾 = 2:
𝑐1
𝑥 − 𝑥ҧ
𝑤12
𝑐2
𝑤22
𝑤32
Autoencoder
𝐾
𝑘=1
𝐾 = 2:
𝑤11 𝑥ො1
𝑐1
𝑤21
𝑥 − 𝑥ҧ
𝑤12 𝑤31 𝑥ො2
𝑐2
𝑤22
𝑤32
𝑥ො3
Autoencoder
𝐾
𝑘=1
𝐾 = 2: It can be deep. Deep Autoencoder
𝑥ො1 Minimize
𝑐1
𝑤12 error
𝑥 − 𝑥ҧ 𝑥 − 𝑥ҧ
𝑤12 𝑥ො2
𝑐2 𝑤22
𝑤22
Gradient
𝑤32 Descent?
𝑤32 𝑥ො3
PCA - Pokémon
• Inspired from:
https://www.kaggle.com/strakul5/d/abcsds/pokemon/princi
pal-component-analysis-of-pokemon-data
• 800 Pokemons, 6 features for each (HP, Atk, Def, Sp Atk, Sp
Def, Speed)
𝜆𝑖
• How many principle components?
𝜆1 + 𝜆2 + 𝜆3 + 𝜆4 + 𝜆5 + 𝜆6
𝜆1 𝜆2 𝜆3 𝜆4 𝜆5 𝜆6
ratio 0.45 0.18 0.13 0.12 0.07 0.04
Using 4 components is good enough
PCA - Pokémon
HP Atk Def Sp Atk Sp Def Speed
PC1 0.4 0.4 0.4 0.5 0.4 0.3 強度
PC2 0.1 0.0 0.6 -0.3 0.2 -0.7
PC3 -0.5 -0.6 0.1 0.3 0.6 防禦(犧牲速度)
0.1
PC4 0.7 -0.4 -0.4 0.1 0.2 -0.3
PCA - Pokémon
HP Atk Def Sp Atk Sp Def Speed
PC1 0.4 0.4 0.4 0.5 0.4 0.3
PC2 0.1 0.0 0.6 -0.3 0.2 -0.7
PC3 -0.5 -0.6 0.1 0.3 0.6 特殊防禦(犧牲
0.1
生命力強
PC4 0.7 -0.4 -0.4 0.1 0.2 攻擊和生命)
-0.3
PCA - Pokémon
• http://140.112.21.35:2880/~tlkagk/pokemon/pca.html
• The code is modified from
• http://jkunst.com/r/pokemon-visualize-em-all/
PCA - MNIST = 𝑎1 𝑤 1 + 𝑎2 𝑤 2 + ⋯
images
30 components:
Eigen-digits
PCA - Face
30 components:
http://www.cs.unc.edu/~lazebnik/research/spr Eigen-face
ing08/assignment3.html
Weakness of PCA
• Unsupervised • Linear
PCA
LDA
http://www.astroml.org/book_figures/c
hapter7/fig_S_manifold_PCA.html
Weakness of PCA
Pixel (28x28) -> PCA (2) Pixel (28x28) -> tSNE (2)
Acknowledgement
• 感謝彭冲同學發現引用資料的錯誤
• 感謝 Hsiang-Chih Cheng 同學發現投影片上的錯
誤
Appendix
• http://4.bp.blogspot.com/_sHcZHRnxlLE/S9EpFXYjfvI/AAAAAAAABZ0/_oEQiaR3
WVM/s640/dimensionality+reduction.jpg
• https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction
_Review_2009.pdf

PCA (v3)

Uploaded by

Copyright:

Available Formats

PCA (v3)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PCA (v3)

Uploaded by

Copyright:

Available Formats

Unsupervised Learning:

Principle Component Analysis

Looks like 3-D Actually, 2-D

-20。 -10。 0。 10。 20。

Step 1: build a tree

• Principle component analysis (PCA)

𝑥 Project all the data points x onto 𝑤 1 ,

𝑆 = 𝐶𝑜𝑣 𝑥 Symmetric Positive-semidefinite

Using Lagrange multiplier [Bishop, Appendix E]

=𝛼 Choose the maximum one

𝑤 1 is the eigenvector of the covariance matrix S

𝑤 2 is the eigenvector of the covariance matrix S

𝑐11 𝑐12 𝑐13

K columns of U: a set of orthonormal eigen vectors

𝐾 = 2: It can be deep. Deep Autoencoder

You might also like