Learning a nonlinear embedding by preserving class neibourhood structure 최종

Learning a Nonlinear Embedding by
Preserving Class Neighbourhood Structure
AISTATS `07 San Juan, Puerto Rico
Salakhutdinov Ruslan, and Geoffrey E. Hinton.
Presenter:
WooSung Choi
(ws_choi@korea.ac.kr)
DataKnow. Lab
Korea UNIV.

Background
(k-) Nearest Neighbor Query

kNN(k-Nearest Neighbor) Query
0 x
y

kNN(k-Nearest Neighbor) Classification
Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear
embedding by preserving class neighbourhood
structure." International Conference on Artificial Intelligence and
Statistics. 2007.
NN Class
1-NN 6
2-NN 6
3-NN 6
4-NN 6
5-NN 0
<Result of 5-NN>
Result of 5-NN Classification: 6 (80%)

Motivating Example
• MNIST
 Dimensionality: 28 x 28 = 784
 50,000 training images
 10,000 test images
• Error: 2.77%
• Query response: 108ms

Reality Check
• Curse of dimensionality
 [Qin lv et al, Image Similarity Search with Compact Data
Structures @CIKM`04]

 poor performance when the number of dimensions is high
Roger Weber et al, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-
Dimensional Spaces @ VLDB`98

Locality Sensitive Hashing, Data Sensitive Hashing
Curse of
Dimensionality
Recall 데이터 분포 고려 기반 기술
Scan X (없음) 1 △ N/A
RTree-based Solution O (강함) 1 O index: Tree
Locality Sensitive
Hashing
△ (덜함) < 1 X
Hashing
+ Mathematics
Data Sensitive Hashing △ (덜함) < 1 O
Hashing
+ Machine
Learning

Abstract
• How to pre-train and fine-tune a MNN
 To lean a nonlinear transformation
 From the input space
 To a low dimensional feature space
 Where KNN classification performs well
 Improved using unlabeled data

Notation
• Transformation to Low-Dim Feature Space
 Input vectors: 𝑥 𝑎 ∈ 𝑋
 Transformation Function 𝑓 ∶ 𝑋 → 𝑌
 Parameterized by 𝑊
 Output vectors: 𝑓(𝑥 𝑎|𝑊)
• Similarity Measure
 Input vectors: 𝑥 𝑎
, 𝑥 𝑏
∈ 𝑋
 Output: 𝐷IST[𝑓 𝑥 𝑎|𝑊 , 𝑓 𝑥 𝑏|𝑊 ]

Objective (informal)
• 𝑎 𝑡ℎ
𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑎𝑠𝑒: (𝑥 𝑎
, 𝑐 𝑎
)
 𝑥 𝑎 ∈ 𝑋
 𝑐 𝑎
∈ 1, 2, 3, … , 𝐶
• Goal
 𝑖𝑓 𝑐 𝑎 = 𝑐 𝑏
 𝑡ℎ𝑒𝑛 𝐷IST 𝑓 𝑥 𝑎|𝑊 , 𝑓 𝑥 𝑏|𝑊 − 𝑠𝑚𝑎𝑙𝑙𝑒𝑟
 𝑒𝑙𝑠𝑒 𝑖𝑓 𝑐 𝑎 ≠ 𝑐 𝑏
 𝑡ℎ𝑒𝑛 𝐷IST 𝑓 𝑥 𝑎
|𝑊 , 𝑓 𝑥 𝑏
|𝑊 − 𝑔𝑟𝑒𝑎𝑡𝑒𝑟

Objective (formal)
• 𝑎 𝑡ℎ
𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑎𝑠𝑒: (𝑥 𝑎
, 𝑐 𝑎
)
 𝑥 𝑎 ∈ 𝑋
 𝑐 𝑎
∈ 1, 2, 3, … , 𝐶
• Goal: Maximizing 𝑂 𝑁𝐶𝐴
 𝑂 𝑁𝐶𝐴 = 𝑎=1
𝑁
𝑏:𝑐 𝑎=𝑐 𝑏
exp(−𝑑 𝑎𝑏)
𝑧≠𝑎 exp(−𝑑 𝑎𝑧)
 𝑤ℎ𝑒𝑟𝑒 𝑑 𝑎𝑏 = 𝑓 𝑥 𝑎|𝑊 − 𝑓 𝑥 𝑏|𝑊
2

Relative Works: Linear Transformation
• Linear Transformation [8,9,18]
 𝑓 𝑥 𝑎|𝑊 = 𝑊𝑥 𝑎
 Weakness
 Limited number of parameters
 𝑓: 𝑅784 → 𝑅30, then 𝑊 should be 30 by 784 matrix(23,520
parameters)
 In this paper: 785*500 + 501*500 + 501*500 + 2001*30
parameters
 Cannot model higher-order correlation
• Deep Autoencoder [14], DBN[12]

In this paper
• Non-Linear Transformation
 Overview
 Pre-training: Similar to [12,14]
 Stack of RBM
 RBM1 784-500
 RBM2 500-500
 RBM3 500-2000
 RBM4 2000-30
 Fine-tuning: backpropagation
 To maximize the objective function
 Maximize the expected number
 of correctly classified points
on the training data

2. Learning Nonlinear NCA
Neighbourhood Component Analysis

Notation
Symbol Definition
𝑎 = 1,2,3,4, … , 𝑁 Index
𝑥 𝑎
∈ 𝑅 𝑑 𝑎 𝑡ℎ training vector (d-
dimensional data)
𝑐 𝑎
∈ {1,2,…,C} Label of 𝑎 𝑡ℎ
training vector
(𝑥 𝑎
, 𝑐 𝑎
) Labeled training cases
𝑓(𝑥 𝑎
|𝑊)
Output of Multilayer Neural
network parameterized by 𝑊
𝑑 𝑎𝑏 = 𝑓 𝑥 𝑎|𝑊 − 𝑓 𝑥 𝑏|𝑊
2
Euclidean distance metric
𝑝 𝑎𝑏 =
The probability that
point a selects one of its
neighbor b in the
transformed feature space
𝑑 𝑎𝒂 𝑑 𝑎𝑏 𝑑 𝑎𝐜 𝑑 𝑎𝐝 𝑑 𝑎𝐞
0 1 5 7 7
𝑒−𝑑 𝑎𝑎 e−𝑑 𝑎𝑏 e−𝑑 𝑎𝑐 e−𝑑 𝑎𝑑 e−𝑑 𝑎𝑒
1 0.3678 0.0497 0.002 0.002
𝑝 𝑎𝑎 𝑝 𝑎𝑏 𝑝 𝑎𝑐 𝑝 𝑎𝑑 𝑝 𝑎𝑒
0 0.88 0.11 0 0
𝑝 𝑎𝑏 =
0.3678
0.3678 + 0.0497 + 0.0002 + 0.0002
≈ 0.88

Notation
Symbol Definition
𝑝 𝑎𝑏 =
point a selects one of its
neighbor b in the
transformed feature
space
𝑝 𝑐 𝑎
= 𝑘 =
b:cb=𝑘
𝑝 𝑎𝑏
point a belongs to class k
𝑶 𝑵𝑪𝑨 =
𝒂=𝟏
𝑵
𝒃:𝒄 𝒂=𝒄 𝒃
𝒆𝒙𝒑(−𝒅 𝒂𝒃)
𝒛≠𝒂 𝒆𝒙𝒑(−𝒅 𝒂𝒛)
The Expected Number of
correctly classified point
on the training data
𝒄 𝒂
𝒄 𝒃 𝒄 𝒄
𝒄 𝒅 𝒄 𝒆
N/A 3 3 2 1
𝑝 𝑎𝑎 𝑝 𝑎𝑏 𝑝 𝑎𝑐 𝑝 𝑎𝑑 𝑝 𝑎𝑒
0 0.88 0.11 0 0
𝑝 𝑐 𝑎 = 3 = 0.99
𝑝 𝑐 𝑎
= 2 = 0
𝑝 𝑐 𝑎
= 1 = 0

Learning Rule
 𝑂 𝑁𝐶𝐴 = 𝑎=1
𝑁
• Derivation

Standard backpropagation
Output Layer:
𝜕𝑊 𝑖𝑗
=
𝜕𝑛𝑒𝑡 𝑗
𝜕𝑊 𝑖𝑗
= 𝑜𝑖 𝑛𝑒𝑡𝑗 1 − 𝑛𝑒𝑡𝑗
Inner Layer:
𝜕𝑊 𝑘𝑖
= 𝑖
𝜕𝑛𝑒𝑡 𝑖
𝜕𝑊 𝑘𝑖
𝜕𝑛𝑒𝑡 𝑖
= 𝑜 𝑘 𝑖 𝑤𝑖𝑗

Learning Rule
 𝑂 𝑁𝐶𝐴 = 𝑎=1
𝑁
• Derivation

 Δ𝑤𝑖𝑗 = 𝛼
𝜕𝑂 𝑁𝐶𝐴
𝜕𝑊

Details
• Pre-training
 Mini-batch
 Each containing 100cases
 Epoch: 50
 Fine-Tuning
 Method: Conjugate gradients on larger
mini-batches of 5,000 with three line
search performed for each mini-batch
 Epoch: 50
 Dataset
 60,000 training images
 10,000 for validation

Appendix
Regularized Nonlinear NCA

Application
• Learn Compact binary codes that allow efficient retrieval
 Gist descriptor + Locality Sensitive Hashing Scheme + Nonlinear NCA
 Dataset: LabelMe
 22,000 images
 label: {human, woman, man, etc}
Torralba, Antonio, Rob Fergus, and Yair Weiss. "Small codes and large image databases for
recognition." Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on.
IEEE, 2008.
http://labelme2.csail.mit.edu/Release3.0/browserTools/php/publications.php

Neural Network
Toy Example: AND gate, XOR gate

AND gate
Z
𝑥
𝑦
1
x y t
0 0 0
0 1 0
1 0 0
1 1 1
𝑤0
𝑤1
𝑤2
sigm 𝑠𝑖𝑔𝑧
𝑧 = 𝑥 ∙ 𝑤0 + 𝑦 ∙ 𝑤1 + 1 ∙ 𝑤2
𝑠𝑖𝑔𝑚 𝑥 =
1
1 + 𝑒−𝑥

XOR gate
x y t
0 0 0
0 1 1
1 0 1
1 1 0
𝑧1
𝑥
𝑦
1
𝑤00
𝑤01
𝑤02
sigm 𝑠𝑖𝑔𝑧1
𝑧2
𝑥
𝑦
1
𝑤10
𝑤11
𝑤12
sigm 𝑠𝑖𝑔𝑧2
𝑧3
1
𝑤20
𝑤21
𝑤22
sigm
𝑠𝑖𝑔𝑧3

Implementations
• Toy example: Training Algorithm for logic gate
• NLNCA for MNIST

Learning a nonlinear embedding by preserving class neibourhood structure 최종

More Related Content

Learning a nonlinear embedding by preserving class neibourhood structure 최종