Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

PCA (v3)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Unsupervised Learning:

Principle Component Analysis


Unsupervised Learning
• Dimension Reduction • Generation (無中生有)
(化繁為簡)

only having
function input
only having
function
function output
function

Random numbers
Dimension Reduction
vector x function vector z
(High Dim) (Low Dim)

Looks like 3-D Actually, 2-D


Dimension Reduction
• In MNIST, a digit is 28 x 28 dims.
• Most 28 x 28 dim vectors are not digits

-20。 -10。 0。 10。 20。


0
Cluster 3 0
Clustering 1
1 Open question: how many
0
0 clusters do we need?
1
0 0
Cluster 1 Cluster 2
• K-means
• Clustering 𝑋 = 𝑥 1 , ⋯ , 𝑥 𝑛 , ⋯ , 𝑥 𝑁 into K clusters
• Initialize cluster center 𝑐 𝑖 , i=1,2, … K (K random 𝑥 𝑛 from 𝑋)
• Repeat 𝑛 is most “close” to 𝑐 𝑖
1 𝑥
• For all 𝑥 𝑛 in 𝑋: 𝑏𝑖𝑛
0 Otherwise
• Updating all 𝑐 𝑖 : 𝑐 𝑖 = ෍ 𝑏 𝑛 𝑥 𝑛 ൘෍ 𝑏 𝑛
𝑖 𝑖
𝑥𝑛 𝑥𝑛
Clustering
• Hierarchical Agglomerative Clustering (HAC)
root

Step 1: build a tree

Step 2: pick a
threshold
Distributed Representation
• Clustering: an object must
belong to one cluster
小傑是強化系
• Distributed representation

強化系 0.70
放出系 0.25
變化系 0.05
小傑是
操作系 0.00
具現化系 0.00
特質系 0.00
Distributed Representation
vector x function vector z
(High Dim) (Low Dim)

𝑥2
• Feature selection
Select 𝑥2 ?

𝑥1

• Principle component analysis (PCA)


[Bishop, Chapter 12] 𝑧 = 𝑊𝑥
PCA
𝑧 = 𝑊𝑥 Large
variance
Reduce to 1-D:

𝑧1 = 𝑤 1 ∙ 𝑥
Small variance

𝑥 Project all the data points x onto 𝑤 1 ,


and obtain a set of 𝑧1
𝑤1 We want the variance of 𝑧1 as large as
possible
𝑧1 = 𝑤 1 ∙ 𝑥 1
𝑉𝑎𝑟 𝑧1 = ෍ 𝑧1 − 𝑧ഥ1 2 𝑤 1 2 = 1
𝑁
𝑧1
PCA Project all the data points x onto 𝑤 1 ,
and obtain a set of 𝑧1
We want the variance of 𝑧1 as large as
𝑧 = 𝑊𝑥
possible
1 1
Reduce to 1-D: 𝑉𝑎𝑟 𝑧1 = ෍ 𝑧1 − 𝑧ഥ1 2 𝑤 2 =1
𝑁
𝑧1
𝑧1 = 𝑤1 ∙𝑥
𝑧2 = 𝑤 2 ∙ 𝑥
We want the variance of 𝑧2 as large as
possible
𝑤1 𝑇

𝑊= 1
𝑤2 𝑇
𝑉𝑎𝑟 𝑧2 = ෍ 𝑧2 − 𝑧ഥ2 2
𝑤2 2 =1
𝑁
⋮ 𝑧2
𝑤1 ∙ 𝑤2 = 0
Orthogonal
matrix
Warning of Math
𝑧1 = 𝑤 1 ∙ 𝑥
PCA 1 1 1 1
1
𝑧ഥ1 = ෍ 𝑧1 = ෍ 𝑤 ∙ 𝑥 = 𝑤 ∙ ෍ 𝑥 = 𝑤 1 ∙ 𝑥ҧ
𝑁 𝑁 𝑁
1
𝑉𝑎𝑟 𝑧1 = ෍ 𝑧1 − 𝑧ഥ1 2
𝑁
𝑧1 𝑎 ∙ 𝑏 2 = 𝑎𝑇 𝑏 2 = 𝑎𝑇 𝑏𝑎𝑇 𝑏
1
= ෍ 𝑤 1 ∙ 𝑥 − 𝑤 1 ∙ 𝑥ҧ 2
= 𝑎𝑇 𝑏 𝑎𝑇 𝑏 𝑇 = 𝑎𝑇 𝑏𝑏 𝑇 𝑎
𝑁
𝑥
1 1 2
= ෍ 𝑤 ∙ 𝑥 − 𝑥ҧ
𝑁 Find 𝑤 1 maximizing
1
= ෍ 𝑤 1 𝑇 𝑥 − 𝑥ҧ 𝑥 − 𝑥ҧ 𝑇 𝑤 1 𝑤 1 𝑇 𝑆𝑤 1
𝑁
1 𝑇
1 𝑤1 2 = 𝑤1 𝑇 𝑤1 = 1
= 𝑤 ෍ 𝑥 − 𝑥ҧ 𝑥 − 𝑥ҧ 𝑇 𝑤 1
𝑁

= 𝑤 1 𝑇 𝐶𝑜𝑣 𝑥 𝑤 1 𝑆 = 𝐶𝑜𝑣 𝑥
Find 𝑤 1 maximizing 𝑤 1 𝑇 𝑆𝑤 1 𝑤1 𝑇 𝑤1 = 1

𝑆 = 𝐶𝑜𝑣 𝑥 Symmetric Positive-semidefinite


(non-negative eigenvalues)

Using Lagrange multiplier [Bishop, Appendix E]


𝑔 𝑤 1 = 𝑤 1 𝑇 𝑆𝑤 1 − 𝛼 𝑤 1 𝑇 𝑤 1 − 1

𝜕𝑔 𝑤1 Τ𝜕𝑤11 =0 𝑆𝑤 1 − 𝛼𝑤 1 = 0
𝑆𝑤 1 = 𝛼𝑤 1 𝑤 1 : eigenvector
𝜕𝑔 𝑤 1 Τ𝜕𝑤21 = 0
𝑤 1 𝑇 𝑆𝑤 1 = 𝛼 𝑤 1 𝑇 𝑤 1

=𝛼 Choose the maximum one

𝑤 1 is the eigenvector of the covariance matrix S


Corresponding to the largest eigenvalue 𝜆1
Find 𝑤 2 maximizing 𝑤 2 𝑇 𝑆𝑤 2 𝑤2 𝑇𝑤2 = 1 𝑤 2 𝑇 𝑤1 = 0

𝑔 𝑤 2 = 𝑤 2 𝑇 𝑆𝑤 2 − 𝛼 𝑤 2 𝑇 𝑤 2 − 1 −𝛽 𝑤 2 𝑇 𝑤 1 − 0

𝜕𝑔 𝑤 2 Τ𝜕𝑤12 = 0 𝑆𝑤 2 − 𝛼𝑤 2 − 𝛽𝑤 1 = 0
𝑤 1 0𝑇 𝑆𝑤 2 − 𝛼 𝑤 1 0𝑇 𝑤 2 − 𝛽 𝑤 11𝑇 𝑤 1 = 0
𝜕𝑔 𝑤 2 Τ𝜕𝑤22 =0
1 𝑇 2 𝑇
= 𝑤 𝑆𝑤 = 𝑤 2 𝑇 𝑆𝑇 𝑤1

= 𝑤 2 𝑇 𝑆𝑤 1 = 𝜆1 𝑤 2 𝑇 𝑤 1 = 0
𝑆𝑤 1 = 𝜆1 𝑤 1
𝛽 = 0: 𝑆𝑤 2 − 𝛼𝑤 2 = 0 𝑆𝑤 2 = 𝛼𝑤 2

𝑤 2 is the eigenvector of the covariance matrix S


Corresponding to the 2nd largest eigenvalue 𝜆2
PCA - decorrelation 𝑥2 𝑧2
𝑧 = 𝑊𝑥 𝑥1 𝑧1
𝐶𝑜𝑣 𝑧 = 𝐷 PCA
Diagonal matrix

1
𝐶𝑜𝑣 𝑧 = ෍ 𝑧 − 𝑧ҧ 𝑧 − 𝑧ҧ 𝑇 = 𝑊𝑆𝑊 𝑇 𝑆 = 𝐶𝑜𝑣 𝑥
𝑁
= 𝑊𝑆 𝑤 1 ⋯ 𝑤𝐾 = 𝑊 𝑆𝑤 1 ⋯ 𝑆𝑤 𝐾
= 𝑊 𝜆1 𝑤 1 ⋯ 𝜆𝐾 𝑤 𝐾 = 𝜆1 𝑊𝑤 1 ⋯ 𝜆𝐾 𝑊𝑤𝐾

= 𝜆1 𝑒1 ⋯ 𝜆𝐾 𝑒𝐾 =𝐷 Diagonal matrix
End of Warning
PCA – Another Point of View
Basic Component:
1 0 1 0 1
…….
u1 u2 u3 u4 u5
1
0 1x 1x 1x
1
0 ≈ + +
1 u1 u3 u5
⋮ 𝑐1
1 2 𝐾
𝑥 ≈ 𝑐1 𝑢 + 𝑐2 𝑢 + ⋯ + 𝑐K 𝑢 + 𝑥ҧ 𝑐2 Represent a
Pixels in a ⋮ digit image
digit image component 𝑐K
PCA – Another Point of View
𝑥 − 𝑥ҧ ≈ 𝑐1 𝑢1 + 𝑐2 𝑢2 + ⋯ + 𝑐K 𝑢𝐾 = 𝑥ො
Reconstruction error:
(𝑥 − 𝑥)ҧ − 𝑥ො 2 Find 𝑢1 , … , 𝑢𝐾 minimizing the error
𝐾

𝐿= min𝐾 ෍
1
𝑥 − 𝑥ҧ − ෍ 𝑐𝑘 𝑢𝑘
𝑢 ,…,𝑢
𝑘=1
2
PCA: 𝑧 = 𝑊𝑥 𝑥ො
𝑧1 T
𝑤1 𝑤 1 , 𝑤 2 , … 𝑤 𝐾 (from PCA) is the
𝑧2 𝑤2 T component 𝑢1 , 𝑢2 , … 𝑢𝐾
⋮ = ⋮
𝑥
minimizing L
𝑧𝐾 𝑤𝐾 T
Proof in [Bishop, Chapter 12.1.2]
𝑥 − 𝑥ҧ ≈ 𝑐1 𝑢1 + 𝑐2 𝑢2 + ⋯ + 𝑐K 𝑢𝐾 = 𝑥ො
Reconstruction error:
(𝑥 − 𝑥)ҧ − 𝑥ො 2 Find 𝑢1 , … , 𝑢𝐾 minimizing the error

𝑥 1 − 𝑥ҧ ≈ 𝑐11 𝑢1 + 𝑐21 𝑢2 + ⋯
𝑥 2 − 𝑥ҧ ≈ 𝑐12 𝑢1 + 𝑐22 𝑢2 + ⋯
𝑥 3 − 𝑥ҧ ≈ 𝑐13 𝑢1 + 𝑐23 𝑢2 + ⋯

……
𝑐11 𝑐12 𝑐13
… ≈ u1 u2 … 𝑐21 𝑐22 𝑐23
Minimize



Matrix X
Error
𝑥 1 − 𝑥ҧ

𝑐11 𝑐12 𝑐13


… ≈ u1 u2 … 𝑐21 𝑐22 𝑐23
Minimize



Matrix X
Error
MxN MxK KxK KxN
∑ V
X ≈ U

K columns of U: a set of orthonormal eigen vectors


corresponding to the K largest eigenvalues of XXT
This is the solution of PCA
SVD:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/LA_2016/Lecture/SVD.pdf
PCA looks like a neural network with one
Autoencoder
hidden layer (linear activation function)

If 𝑤 1 , 𝑤 2 , … 𝑤 𝐾 is the component 𝑢1 , 𝑢2 , … 𝑢𝐾
𝐾
To minimize reconstruction error:
𝑥ො = ෍ 𝑐𝑘 𝑤 𝑘 𝑥 − 𝑥ҧ
𝑐𝑘 = 𝑥 − 𝑥ҧ ∙ 𝑤 𝑘
𝑘=1

𝐾 = 2:
𝑤11
𝑐1

𝑥 − 𝑥ҧ 𝑤21

𝑤31
PCA looks like a neural network with one
Autoencoder
hidden layer (linear activation function)

If 𝑤 1 , 𝑤 2 , … 𝑤 𝐾 is the component 𝑢1 , 𝑢2 , … 𝑢𝐾
𝐾
To minimize reconstruction error:
𝑥ො = ෍ 𝑐𝑘 𝑤 𝑘 𝑥 − 𝑥ҧ
𝑐𝑘 = 𝑥 − 𝑥ҧ ∙ 𝑤 𝑘
𝑘=1

𝐾 = 2:

𝑐1
𝑥 − 𝑥ҧ
𝑤12
𝑐2
𝑤22
𝑤32
PCA looks like a neural network with one
Autoencoder
hidden layer (linear activation function)

If 𝑤 1 , 𝑤 2 , … 𝑤 𝐾 is the component 𝑢1 , 𝑢2 , … 𝑢𝐾
𝐾
To minimize reconstruction error:
𝑥ො = ෍ 𝑐𝑘 𝑤 𝑘 𝑥 − 𝑥ҧ
𝑐𝑘 = 𝑥 − 𝑥ҧ ∙ 𝑤 𝑘
𝑘=1

𝐾 = 2:

𝑤11 𝑥ො1
𝑐1
𝑤21
𝑥 − 𝑥ҧ
𝑤12 𝑤31 𝑥ො2
𝑐2
𝑤22
𝑤32
𝑥ො3
PCA looks like a neural network with one
Autoencoder
hidden layer (linear activation function)

If 𝑤 1 , 𝑤 2 , … 𝑤 𝐾 is the component 𝑢1 , 𝑢2 , … 𝑢𝐾
𝐾
To minimize reconstruction error:
𝑥ො = ෍ 𝑐𝑘 𝑤 𝑘 𝑥 − 𝑥ҧ
𝑐𝑘 = 𝑥 − 𝑥ҧ ∙ 𝑤 𝑘
𝑘=1

𝐾 = 2: It can be deep. Deep Autoencoder

𝑥ො1 Minimize
𝑐1
𝑤12 error
𝑥 − 𝑥ҧ 𝑥 − 𝑥ҧ
𝑤12 𝑥ො2
𝑐2 𝑤22
𝑤22
Gradient
𝑤32 Descent?
𝑤32 𝑥ො3
PCA - Pokémon
• Inspired from:
https://www.kaggle.com/strakul5/d/abcsds/pokemon/princi
pal-component-analysis-of-pokemon-data
• 800 Pokemons, 6 features for each (HP, Atk, Def, Sp Atk, Sp
Def, Speed)
𝜆𝑖
• How many principle components?
𝜆1 + 𝜆2 + 𝜆3 + 𝜆4 + 𝜆5 + 𝜆6

𝜆1 𝜆2 𝜆3 𝜆4 𝜆5 𝜆6
ratio 0.45 0.18 0.13 0.12 0.07 0.04
Using 4 components is good enough
PCA - Pokémon
HP Atk Def Sp Atk Sp Def Speed
PC1 0.4 0.4 0.4 0.5 0.4 0.3 強度
PC2 0.1 0.0 0.6 -0.3 0.2 -0.7
PC3 -0.5 -0.6 0.1 0.3 0.6 防禦(犧牲速度)
0.1
PC4 0.7 -0.4 -0.4 0.1 0.2 -0.3
PCA - Pokémon
HP Atk Def Sp Atk Sp Def Speed
PC1 0.4 0.4 0.4 0.5 0.4 0.3
PC2 0.1 0.0 0.6 -0.3 0.2 -0.7
PC3 -0.5 -0.6 0.1 0.3 0.6 特殊防禦(犧牲
0.1
生命力強
PC4 0.7 -0.4 -0.4 0.1 0.2 攻擊和生命)
-0.3
PCA - Pokémon
• http://140.112.21.35:2880/~tlkagk/pokemon/pca.html
• The code is modified from
• http://jkunst.com/r/pokemon-visualize-em-all/
PCA - MNIST = 𝑎1 𝑤 1 + 𝑎2 𝑤 2 + ⋯
images
30 components:

Eigen-digits
PCA - Face
30 components:

http://www.cs.unc.edu/~lazebnik/research/spr Eigen-face
ing08/assignment3.html
Weakness of PCA
• Unsupervised • Linear

PCA

LDA
http://www.astroml.org/book_figures/c
hapter7/fig_S_manifold_PCA.html
Weakness of PCA

Pixel (28x28) -> PCA (2) Pixel (28x28) -> tSNE (2)
Acknowledgement
• 感謝 彭冲 同學發現引用資料的錯誤
• 感謝 Hsiang-Chih Cheng 同學發現投影片上的錯

Appendix
• http://4.bp.blogspot.com/_sHcZHRnxlLE/S9EpFXYjfvI/AAAAAAAABZ0/_oEQiaR3
WVM/s640/dimensionality+reduction.jpg
• https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction
_Review_2009.pdf

You might also like