Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo

1

Learning a Nonlinear Embedding by
Preserving Class Neighbourhood Structure
AISTATS `07 San Juan, Puerto Rico
Salakhutdinov Ruslan, and Geoffrey E. Hinton.
Presenter:
WooSung Choi
(ws_choi@korea.ac.kr)
DataKnow. Lab
Korea UNIV.

2

Background
(k-) Nearest Neighbor Query

3

kNN(k-Nearest Neighbor) Query
0 x
y

4

kNN(k-Nearest Neighbor) Classification
Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear
embedding by preserving class neighbourhood
structure." International Conference on Artificial Intelligence and
Statistics. 2007.
NN Class
1-NN 6
2-NN 6
3-NN 6
4-NN 6
5-NN 0
<Result of 5-NN>
Result of 5-NN Classification: 6 (80%)

5

Motivating Example
• MNIST
 Dimensionality: 28 x 28 = 784
 50,000 training images
 10,000 test images
• Error: 2.77%
• Query response: 108ms

6

Reality Check
• Curse of dimensionality
 [Qin lv et al, Image Similarity Search with Compact Data
Structures @CIKM`04]

 poor performance when the number of dimensions is high
Roger Weber et al, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-
Dimensional Spaces @ VLDB`98

7

Locality Sensitive Hashing, Data Sensitive Hashing
Curse of
Dimensionality
Recall 데이터 분포 고려 기반 기술
Scan X (없음) 1 △ N/A
RTree-based Solution O (강함) 1 O index: Tree
Locality Sensitive
Hashing
△ (덜함) < 1 X
Hashing
+ Mathematics
Data Sensitive Hashing △ (덜함) < 1 O
Hashing
+ Machine
Learning

8

Abstract

9

Abstract
• How to pre-train and fine-tune a MNN
 To lean a nonlinear transformation
 From the input space
 To a low dimensional feature space
 Where KNN classification performs well
 Improved using unlabeled data

10

Introduction

11

Notation
• Transformation to Low-Dim Feature Space
 Input vectors: 𝑥 𝑎 ∈ 𝑋
 Transformation Function 𝑓 ∶ 𝑋 → 𝑌
 Parameterized by 𝑊
 Output vectors: 𝑓(𝑥 𝑎|𝑊)
• Similarity Measure
 Input vectors: 𝑥 𝑎
, 𝑥 𝑏
∈ 𝑋
 Output: 𝐷IST[𝑓 𝑥 𝑎|𝑊 , 𝑓 𝑥 𝑏|𝑊 ]

12

Objective (informal)
• 𝑎 𝑡ℎ
𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑎𝑠𝑒: (𝑥 𝑎
, 𝑐 𝑎
)
 𝑥 𝑎 ∈ 𝑋
 𝑐 𝑎
∈ 1, 2, 3, … , 𝐶
• Goal
 𝑖𝑓 𝑐 𝑎 = 𝑐 𝑏
 𝑡ℎ𝑒𝑛 𝐷IST 𝑓 𝑥 𝑎|𝑊 , 𝑓 𝑥 𝑏|𝑊 − 𝑠𝑚𝑎𝑙𝑙𝑒𝑟
 𝑒𝑙𝑠𝑒 𝑖𝑓 𝑐 𝑎 ≠ 𝑐 𝑏
 𝑡ℎ𝑒𝑛 𝐷IST 𝑓 𝑥 𝑎
|𝑊 , 𝑓 𝑥 𝑏
|𝑊 − 𝑔𝑟𝑒𝑎𝑡𝑒𝑟

13

Objective (formal)
• 𝑎 𝑡ℎ
𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑎𝑠𝑒: (𝑥 𝑎
, 𝑐 𝑎
)
 𝑥 𝑎 ∈ 𝑋
 𝑐 𝑎
∈ 1, 2, 3, … , 𝐶
• Goal: Maximizing 𝑂 𝑁𝐶𝐴
 𝑂 𝑁𝐶𝐴 = 𝑎=1
𝑁
𝑏:𝑐 𝑎=𝑐 𝑏
exp(−𝑑 𝑎𝑏)
𝑧≠𝑎 exp(−𝑑 𝑎𝑧)
 𝑤ℎ𝑒𝑟𝑒 𝑑 𝑎𝑏 = 𝑓 𝑥 𝑎|𝑊 − 𝑓 𝑥 𝑏|𝑊
2

14

Relative Works: Linear Transformation
• Linear Transformation [8,9,18]
 𝑓 𝑥 𝑎|𝑊 = 𝑊𝑥 𝑎
 Weakness
 Limited number of parameters
 𝑓: 𝑅784 → 𝑅30, then 𝑊 should be 30 by 784 matrix(23,520
parameters)
 In this paper: 785*500 + 501*500 + 501*500 + 2001*30
parameters
 Cannot model higher-order correlation
• Deep Autoencoder [14], DBN[12]

15

In this paper
• Non-Linear Transformation
 Overview
 Pre-training: Similar to [12,14]
 Stack of RBM
 RBM1 784-500
 RBM2 500-500
 RBM3 500-2000
 RBM4 2000-30
 Fine-tuning: backpropagation
 To maximize the objective function
 Maximize the expected number
 of correctly classified points
on the training data

16

Objective (formal)
• 𝑎 𝑡ℎ
𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑎𝑠𝑒: (𝑥 𝑎
, 𝑐 𝑎
)
 𝑥 𝑎 ∈ 𝑋
 𝑐 𝑎
∈ 1, 2, 3, … , 𝐶
• Goal: Maximizing 𝑂 𝑁𝐶𝐴
 𝑂 𝑁𝐶𝐴 = 𝑎=1
𝑁
𝑏:𝑐 𝑎=𝑐 𝑏
exp(−𝑑 𝑎𝑏)
𝑧≠𝑎 exp(−𝑑 𝑎𝑧)
 𝑤ℎ𝑒𝑟𝑒 𝑑 𝑎𝑏 = 𝑓 𝑥 𝑎|𝑊 − 𝑓 𝑥 𝑏|𝑊
2

17

2. Learning Nonlinear NCA
Neighbourhood Component Analysis

18

Notation
Symbol Definition
𝑎 = 1,2,3,4, … , 𝑁 Index
𝑥 𝑎
∈ 𝑅 𝑑 𝑎 𝑡ℎ training vector (d-
dimensional data)
𝑐 𝑎
∈ {1,2,…,C} Label of 𝑎 𝑡ℎ
training vector
(𝑥 𝑎
, 𝑐 𝑎
) Labeled training cases
𝑓(𝑥 𝑎
|𝑊)
Output of Multilayer Neural
network parameterized by 𝑊
𝑑 𝑎𝑏 = 𝑓 𝑥 𝑎|𝑊 − 𝑓 𝑥 𝑏|𝑊
2
Euclidean distance metric
𝑝 𝑎𝑏 =
exp(−𝑑 𝑎𝑏)
𝑧≠𝑎 exp(−𝑑 𝑎𝑧)
The probability that
point a selects one of its
neighbor b in the
transformed feature space
𝑑 𝑎𝒂 𝑑 𝑎𝑏 𝑑 𝑎𝐜 𝑑 𝑎𝐝 𝑑 𝑎𝐞
0 1 5 7 7
𝑒−𝑑 𝑎𝑎 e−𝑑 𝑎𝑏 e−𝑑 𝑎𝑐 e−𝑑 𝑎𝑑 e−𝑑 𝑎𝑒
1 0.3678 0.0497 0.002 0.002
𝑝 𝑎𝑎 𝑝 𝑎𝑏 𝑝 𝑎𝑐 𝑝 𝑎𝑑 𝑝 𝑎𝑒
0 0.88 0.11 0 0
𝑝 𝑎𝑏 =
0.3678
0.3678 + 0.0497 + 0.0002 + 0.0002
≈ 0.88

19

Notation
Symbol Definition
𝑝 𝑎𝑏 =
exp(−𝑑 𝑎𝑏)
𝑧≠𝑎 exp(−𝑑 𝑎𝑧)
The probability that
point a selects one of its
neighbor b in the
transformed feature
space
𝑝 𝑐 𝑎
= 𝑘 =
b:cb=𝑘
𝑝 𝑎𝑏
The probability that
point a belongs to class k
𝑶 𝑵𝑪𝑨 =
𝒂=𝟏
𝑵
𝒃:𝒄 𝒂=𝒄 𝒃
𝒆𝒙𝒑(−𝒅 𝒂𝒃)
𝒛≠𝒂 𝒆𝒙𝒑(−𝒅 𝒂𝒛)
The Expected Number of
correctly classified point
on the training data
𝒄 𝒂
𝒄 𝒃 𝒄 𝒄
𝒄 𝒅 𝒄 𝒆
N/A 3 3 2 1
𝑝 𝑎𝑎 𝑝 𝑎𝑏 𝑝 𝑎𝑐 𝑝 𝑎𝑑 𝑝 𝑎𝑒
0 0.88 0.11 0 0
𝑝 𝑐 𝑎 = 3 = 0.99
𝑝 𝑐 𝑎
= 2 = 0
𝑝 𝑐 𝑎
= 1 = 0

20

Learning Rule
• Backpropagation To maximize
 𝑂 𝑁𝐶𝐴 = 𝑎=1
𝑁
𝑏:𝑐 𝑎=𝑐 𝑏
𝑒𝑥𝑝(−𝑑 𝑎𝑏)
𝑧≠𝑎 𝑒𝑥𝑝(−𝑑 𝑎𝑧)
• Derivation

𝜕
𝜕𝑓(𝑥 𝑎|𝑊)
[
𝑏:𝑐 𝑎=𝑐 𝑏
𝑒𝑥 𝑝 −(𝑑 𝑎𝑏
2
)
𝑧≠𝑎 𝑒𝑥 𝑝 −(𝑑 𝑎𝑧
2)
]
𝒅 𝒂𝒃 = 𝐟 𝒙 𝒂
|𝑾 − 𝐟 𝒙 𝒃
|𝑾
𝟐
𝒅 𝒂𝒃 = 𝐟 𝒙 𝒂
|𝑾 − 𝐟 𝒙 𝒃
|𝑾

21

Learning Rule
• Backpropagation To maximize
 𝑂 𝑁𝐶𝐴 = 𝑎=1
𝑁
𝑏:𝑐 𝑎=𝑐 𝑏
𝑒𝑥𝑝(−𝑑 𝑎𝑏)
𝑧≠𝑎 𝑒𝑥𝑝(−𝑑 𝑎𝑧)
• Derivation

Standard backpropagation
Output Layer:
𝜕𝑓(𝑥 𝑎|𝑊)
𝜕𝑊 𝑖𝑗
=
𝜕𝑛𝑒𝑡 𝑗
𝜕𝑊 𝑖𝑗
𝜕𝑓(𝑥 𝑎|𝑊)
𝜕𝑛𝑒𝑡 𝑗
= 𝑜𝑖 𝑛𝑒𝑡𝑗 1 − 𝑛𝑒𝑡𝑗
Inner Layer:
𝜕𝑓(𝑥 𝑎|𝑊)
𝜕𝑊 𝑘𝑖
= 𝑖
𝜕𝑛𝑒𝑡 𝑖
𝜕𝑊 𝑘𝑖
𝜕𝑛𝑒𝑡 𝑗
𝜕𝑛𝑒𝑡 𝑖
𝜕𝑓(𝑥 𝑎|𝑊)
𝜕𝑛𝑒𝑡 𝑗
= 𝑜 𝑘 𝑖 𝑤𝑖𝑗
𝜕𝑓(𝑥 𝑎|𝑊)
𝜕𝑛𝑒𝑡 𝑗

22

Learning Rule
• Backpropagation To maximize
 𝑂 𝑁𝐶𝐴 = 𝑎=1
𝑁
𝑏:𝑐 𝑎=𝑐 𝑏
𝑒𝑥𝑝(−𝑑 𝑎𝑏)
𝑧≠𝑎 𝑒𝑥𝑝(−𝑑 𝑎𝑧)
• Derivation

 Δ𝑤𝑖𝑗 = 𝛼
𝜕𝑂 𝑁𝐶𝐴
𝜕𝑊

23

Details
• Pre-training
 Mini-batch
 Each containing 100cases
 Epoch: 50
 Fine-Tuning
 Method: Conjugate gradients on larger
mini-batches of 5,000 with three line
search performed for each mini-batch
 Epoch: 50
 Dataset
 60,000 training images
 10,000 for validation

24

Experiment

25

Result

26

Result

27

Appendix
Regularized Nonlinear NCA

28

Regularized Nonlinear NCA

29

Application
• Learn Compact binary codes that allow efficient retrieval
 Gist descriptor + Locality Sensitive Hashing Scheme + Nonlinear NCA
 Dataset: LabelMe
 22,000 images
 label: {human, woman, man, etc}
Torralba, Antonio, Rob Fergus, and Yair Weiss. "Small codes and large image databases for
recognition." Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on.
IEEE, 2008.
http://labelme2.csail.mit.edu/Release3.0/browserTools/php/publications.php

30

Neural Network
Toy Example: AND gate, XOR gate

31

AND gate
Z
𝑥
𝑦
1
x y t
0 0 0
0 1 0
1 0 0
1 1 1
𝑤0
𝑤1
𝑤2
sigm 𝑠𝑖𝑔𝑧
𝑧 = 𝑥 ∙ 𝑤0 + 𝑦 ∙ 𝑤1 + 1 ∙ 𝑤2
𝑠𝑖𝑔𝑚 𝑥 =
1
1 + 𝑒−𝑥

32

XOR gate
x y t
0 0 0
0 1 1
1 0 1
1 1 0
𝑧1
𝑥
𝑦
1
𝑤00
𝑤01
𝑤02
sigm 𝑠𝑖𝑔𝑧1
𝑧2
𝑥
𝑦
1
𝑤10
𝑤11
𝑤12
sigm 𝑠𝑖𝑔𝑧2
𝑧3
1
𝑤20
𝑤21
𝑤22
sigm
𝑠𝑖𝑔𝑧3

33

Implementations
• Toy example: Training Algorithm for logic gate
• NLNCA for MNIST

More Related Content

Learning a nonlinear embedding by preserving class neibourhood structure 최종

  • 1. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure AISTATS `07 San Juan, Puerto Rico Salakhutdinov Ruslan, and Geoffrey E. Hinton. Presenter: WooSung Choi (ws_choi@korea.ac.kr) DataKnow. Lab Korea UNIV.
  • 4. kNN(k-Nearest Neighbor) Classification Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear embedding by preserving class neighbourhood structure." International Conference on Artificial Intelligence and Statistics. 2007. NN Class 1-NN 6 2-NN 6 3-NN 6 4-NN 6 5-NN 0 <Result of 5-NN> Result of 5-NN Classification: 6 (80%)
  • 5. Motivating Example • MNIST  Dimensionality: 28 x 28 = 784  50,000 training images  10,000 test images • Error: 2.77% • Query response: 108ms
  • 6. Reality Check • Curse of dimensionality  [Qin lv et al, Image Similarity Search with Compact Data Structures @CIKM`04]   poor performance when the number of dimensions is high Roger Weber et al, A Quantitative Analysis and Performance Study for Similarity-Search Methods in High- Dimensional Spaces @ VLDB`98
  • 7. Locality Sensitive Hashing, Data Sensitive Hashing Curse of Dimensionality Recall 데이터 분포 고려 기반 기술 Scan X (없음) 1 △ N/A RTree-based Solution O (강함) 1 O index: Tree Locality Sensitive Hashing △ (덜함) < 1 X Hashing + Mathematics Data Sensitive Hashing △ (덜함) < 1 O Hashing + Machine Learning
  • 9. Abstract • How to pre-train and fine-tune a MNN  To lean a nonlinear transformation  From the input space  To a low dimensional feature space  Where KNN classification performs well  Improved using unlabeled data
  • 11. Notation • Transformation to Low-Dim Feature Space  Input vectors: 𝑥 𝑎 ∈ 𝑋  Transformation Function 𝑓 ∶ 𝑋 → 𝑌  Parameterized by 𝑊  Output vectors: 𝑓(𝑥 𝑎|𝑊) • Similarity Measure  Input vectors: 𝑥 𝑎 , 𝑥 𝑏 ∈ 𝑋  Output: 𝐷IST[𝑓 𝑥 𝑎|𝑊 , 𝑓 𝑥 𝑏|𝑊 ]
  • 12. Objective (informal) • 𝑎 𝑡ℎ 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑎𝑠𝑒: (𝑥 𝑎 , 𝑐 𝑎 )  𝑥 𝑎 ∈ 𝑋  𝑐 𝑎 ∈ 1, 2, 3, … , 𝐶 • Goal  𝑖𝑓 𝑐 𝑎 = 𝑐 𝑏  𝑡ℎ𝑒𝑛 𝐷IST 𝑓 𝑥 𝑎|𝑊 , 𝑓 𝑥 𝑏|𝑊 − 𝑠𝑚𝑎𝑙𝑙𝑒𝑟  𝑒𝑙𝑠𝑒 𝑖𝑓 𝑐 𝑎 ≠ 𝑐 𝑏  𝑡ℎ𝑒𝑛 𝐷IST 𝑓 𝑥 𝑎 |𝑊 , 𝑓 𝑥 𝑏 |𝑊 − 𝑔𝑟𝑒𝑎𝑡𝑒𝑟
  • 13. Objective (formal) • 𝑎 𝑡ℎ 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑎𝑠𝑒: (𝑥 𝑎 , 𝑐 𝑎 )  𝑥 𝑎 ∈ 𝑋  𝑐 𝑎 ∈ 1, 2, 3, … , 𝐶 • Goal: Maximizing 𝑂 𝑁𝐶𝐴  𝑂 𝑁𝐶𝐴 = 𝑎=1 𝑁 𝑏:𝑐 𝑎=𝑐 𝑏 exp(−𝑑 𝑎𝑏) 𝑧≠𝑎 exp(−𝑑 𝑎𝑧)  𝑤ℎ𝑒𝑟𝑒 𝑑 𝑎𝑏 = 𝑓 𝑥 𝑎|𝑊 − 𝑓 𝑥 𝑏|𝑊 2
  • 14. Relative Works: Linear Transformation • Linear Transformation [8,9,18]  𝑓 𝑥 𝑎|𝑊 = 𝑊𝑥 𝑎  Weakness  Limited number of parameters  𝑓: 𝑅784 → 𝑅30, then 𝑊 should be 30 by 784 matrix(23,520 parameters)  In this paper: 785*500 + 501*500 + 501*500 + 2001*30 parameters  Cannot model higher-order correlation • Deep Autoencoder [14], DBN[12]
  • 15. In this paper • Non-Linear Transformation  Overview  Pre-training: Similar to [12,14]  Stack of RBM  RBM1 784-500  RBM2 500-500  RBM3 500-2000  RBM4 2000-30  Fine-tuning: backpropagation  To maximize the objective function  Maximize the expected number  of correctly classified points on the training data
  • 16. Objective (formal) • 𝑎 𝑡ℎ 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑎𝑠𝑒: (𝑥 𝑎 , 𝑐 𝑎 )  𝑥 𝑎 ∈ 𝑋  𝑐 𝑎 ∈ 1, 2, 3, … , 𝐶 • Goal: Maximizing 𝑂 𝑁𝐶𝐴  𝑂 𝑁𝐶𝐴 = 𝑎=1 𝑁 𝑏:𝑐 𝑎=𝑐 𝑏 exp(−𝑑 𝑎𝑏) 𝑧≠𝑎 exp(−𝑑 𝑎𝑧)  𝑤ℎ𝑒𝑟𝑒 𝑑 𝑎𝑏 = 𝑓 𝑥 𝑎|𝑊 − 𝑓 𝑥 𝑏|𝑊 2
  • 17. 2. Learning Nonlinear NCA Neighbourhood Component Analysis
  • 18. Notation Symbol Definition 𝑎 = 1,2,3,4, … , 𝑁 Index 𝑥 𝑎 ∈ 𝑅 𝑑 𝑎 𝑡ℎ training vector (d- dimensional data) 𝑐 𝑎 ∈ {1,2,…,C} Label of 𝑎 𝑡ℎ training vector (𝑥 𝑎 , 𝑐 𝑎 ) Labeled training cases 𝑓(𝑥 𝑎 |𝑊) Output of Multilayer Neural network parameterized by 𝑊 𝑑 𝑎𝑏 = 𝑓 𝑥 𝑎|𝑊 − 𝑓 𝑥 𝑏|𝑊 2 Euclidean distance metric 𝑝 𝑎𝑏 = exp(−𝑑 𝑎𝑏) 𝑧≠𝑎 exp(−𝑑 𝑎𝑧) The probability that point a selects one of its neighbor b in the transformed feature space 𝑑 𝑎𝒂 𝑑 𝑎𝑏 𝑑 𝑎𝐜 𝑑 𝑎𝐝 𝑑 𝑎𝐞 0 1 5 7 7 𝑒−𝑑 𝑎𝑎 e−𝑑 𝑎𝑏 e−𝑑 𝑎𝑐 e−𝑑 𝑎𝑑 e−𝑑 𝑎𝑒 1 0.3678 0.0497 0.002 0.002 𝑝 𝑎𝑎 𝑝 𝑎𝑏 𝑝 𝑎𝑐 𝑝 𝑎𝑑 𝑝 𝑎𝑒 0 0.88 0.11 0 0 𝑝 𝑎𝑏 = 0.3678 0.3678 + 0.0497 + 0.0002 + 0.0002 ≈ 0.88
  • 19. Notation Symbol Definition 𝑝 𝑎𝑏 = exp(−𝑑 𝑎𝑏) 𝑧≠𝑎 exp(−𝑑 𝑎𝑧) The probability that point a selects one of its neighbor b in the transformed feature space 𝑝 𝑐 𝑎 = 𝑘 = b:cb=𝑘 𝑝 𝑎𝑏 The probability that point a belongs to class k 𝑶 𝑵𝑪𝑨 = 𝒂=𝟏 𝑵 𝒃:𝒄 𝒂=𝒄 𝒃 𝒆𝒙𝒑(−𝒅 𝒂𝒃) 𝒛≠𝒂 𝒆𝒙𝒑(−𝒅 𝒂𝒛) The Expected Number of correctly classified point on the training data 𝒄 𝒂 𝒄 𝒃 𝒄 𝒄 𝒄 𝒅 𝒄 𝒆 N/A 3 3 2 1 𝑝 𝑎𝑎 𝑝 𝑎𝑏 𝑝 𝑎𝑐 𝑝 𝑎𝑑 𝑝 𝑎𝑒 0 0.88 0.11 0 0 𝑝 𝑐 𝑎 = 3 = 0.99 𝑝 𝑐 𝑎 = 2 = 0 𝑝 𝑐 𝑎 = 1 = 0
  • 20. Learning Rule • Backpropagation To maximize  𝑂 𝑁𝐶𝐴 = 𝑎=1 𝑁 𝑏:𝑐 𝑎=𝑐 𝑏 𝑒𝑥𝑝(−𝑑 𝑎𝑏) 𝑧≠𝑎 𝑒𝑥𝑝(−𝑑 𝑎𝑧) • Derivation  𝜕 𝜕𝑓(𝑥 𝑎|𝑊) [ 𝑏:𝑐 𝑎=𝑐 𝑏 𝑒𝑥 𝑝 −(𝑑 𝑎𝑏 2 ) 𝑧≠𝑎 𝑒𝑥 𝑝 −(𝑑 𝑎𝑧 2) ] 𝒅 𝒂𝒃 = 𝐟 𝒙 𝒂 |𝑾 − 𝐟 𝒙 𝒃 |𝑾 𝟐 𝒅 𝒂𝒃 = 𝐟 𝒙 𝒂 |𝑾 − 𝐟 𝒙 𝒃 |𝑾
  • 21. Learning Rule • Backpropagation To maximize  𝑂 𝑁𝐶𝐴 = 𝑎=1 𝑁 𝑏:𝑐 𝑎=𝑐 𝑏 𝑒𝑥𝑝(−𝑑 𝑎𝑏) 𝑧≠𝑎 𝑒𝑥𝑝(−𝑑 𝑎𝑧) • Derivation  Standard backpropagation Output Layer: 𝜕𝑓(𝑥 𝑎|𝑊) 𝜕𝑊 𝑖𝑗 = 𝜕𝑛𝑒𝑡 𝑗 𝜕𝑊 𝑖𝑗 𝜕𝑓(𝑥 𝑎|𝑊) 𝜕𝑛𝑒𝑡 𝑗 = 𝑜𝑖 𝑛𝑒𝑡𝑗 1 − 𝑛𝑒𝑡𝑗 Inner Layer: 𝜕𝑓(𝑥 𝑎|𝑊) 𝜕𝑊 𝑘𝑖 = 𝑖 𝜕𝑛𝑒𝑡 𝑖 𝜕𝑊 𝑘𝑖 𝜕𝑛𝑒𝑡 𝑗 𝜕𝑛𝑒𝑡 𝑖 𝜕𝑓(𝑥 𝑎|𝑊) 𝜕𝑛𝑒𝑡 𝑗 = 𝑜 𝑘 𝑖 𝑤𝑖𝑗 𝜕𝑓(𝑥 𝑎|𝑊) 𝜕𝑛𝑒𝑡 𝑗
  • 22. Learning Rule • Backpropagation To maximize  𝑂 𝑁𝐶𝐴 = 𝑎=1 𝑁 𝑏:𝑐 𝑎=𝑐 𝑏 𝑒𝑥𝑝(−𝑑 𝑎𝑏) 𝑧≠𝑎 𝑒𝑥𝑝(−𝑑 𝑎𝑧) • Derivation   Δ𝑤𝑖𝑗 = 𝛼 𝜕𝑂 𝑁𝐶𝐴 𝜕𝑊
  • 23. Details • Pre-training  Mini-batch  Each containing 100cases  Epoch: 50  Fine-Tuning  Method: Conjugate gradients on larger mini-batches of 5,000 with three line search performed for each mini-batch  Epoch: 50  Dataset  60,000 training images  10,000 for validation
  • 29. Application • Learn Compact binary codes that allow efficient retrieval  Gist descriptor + Locality Sensitive Hashing Scheme + Nonlinear NCA  Dataset: LabelMe  22,000 images  label: {human, woman, man, etc} Torralba, Antonio, Rob Fergus, and Yair Weiss. "Small codes and large image databases for recognition." Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008. http://labelme2.csail.mit.edu/Release3.0/browserTools/php/publications.php
  • 30. Neural Network Toy Example: AND gate, XOR gate
  • 31. AND gate Z 𝑥 𝑦 1 x y t 0 0 0 0 1 0 1 0 0 1 1 1 𝑤0 𝑤1 𝑤2 sigm 𝑠𝑖𝑔𝑧 𝑧 = 𝑥 ∙ 𝑤0 + 𝑦 ∙ 𝑤1 + 1 ∙ 𝑤2 𝑠𝑖𝑔𝑚 𝑥 = 1 1 + 𝑒−𝑥
  • 32. XOR gate x y t 0 0 0 0 1 1 1 0 1 1 1 0 𝑧1 𝑥 𝑦 1 𝑤00 𝑤01 𝑤02 sigm 𝑠𝑖𝑔𝑧1 𝑧2 𝑥 𝑦 1 𝑤10 𝑤11 𝑤12 sigm 𝑠𝑖𝑔𝑧2 𝑧3 1 𝑤20 𝑤21 𝑤22 sigm 𝑠𝑖𝑔𝑧3
  • 33. Implementations • Toy example: Training Algorithm for logic gate • NLNCA for MNIST