One shot learning - deep learning ( meta learn )

[N] – SHOT Learning
In deep learning models
davinnovation@gmail.com

Learning process
Human Deep Learning

What they need…
single one picture book 60,000 train data (MNIST)

In Traditional Deep Learning…
http://pinktentacle.com/tag/waseda-university/

One Shot Learning
Give Inference ability to Learning model

One Shot Learning
Train One(Few) Data, Works Awesome

One Shot Learning – Many approaches
Transfer Learning
Domain Adaptation
Imitation Learning
…..

Started from Meta Learning
Meta means something that is "self-referencing".
‘Knowing about knowing’
Today…

Neural Turing Machine
https://arxiv.org/abs/1410.5401
Memory
Read Heads Write Heads
Controller
External Input External Output
Turing Machine
Basic Structure

Memory
Controller
Neural Network
(RNN)
N x M memory
Block
𝑀𝑡 : t is time
N
M

Memory
Write Heads
Controller
Read Heads
σ𝑖 𝑤𝑡(𝑖) = 1 (0 ≤ 𝑤𝑡(𝑖) ≤ 1)0.9 0.1 0 ...
Weight vector : 𝑤𝑡
1 2
2 1
1 3
1.1
1.9
1.2
𝑟𝑡 ← ෍
𝑖
𝑤𝑡(𝑖)𝑀𝑡(𝑖)
Read vector : 𝑟𝑡
i = 0 i = 1 …
i = 0 i = 1 …

Memory
Controller
Neural Network
(RNN)
Write Heads
σ𝑖 𝑊𝑡(𝑖) = 1 (0 ≤ 𝑊𝑡(𝑖) ≤ 1)
0.9 0.1 0 ...
Weight vector : 𝑊𝑡0.1 1.8
2 1
0.1 2.7
erase vector : 𝑒𝑡 (0 < 𝑒𝑡 < 1)
𝑀′
𝑡(𝑖) ← 𝑀𝑡−1(𝑖)[1 − 𝑤𝑡 𝑖 𝑒𝑡]
~1
~0
~1
In my case - For compute, set `nearly’ 0 and 1 ( it’s not actually 0 || 1 )
`
Erase -> Add

Memory
Controller
Neural Network
(RNN)
Write Heads
σ𝑖 𝑊𝑡(𝑖) = 1 (0 ≤ 𝑊𝑡(𝑖) ≤ 1)
0.9 0.1 0 ...
Weight vector : 𝑊𝑡
1 1.9
2.9 1.1
0.1 2.7
add vector : 𝑎 𝑡 (0 < 𝑎 𝑡 < 1)
𝑀𝑡 𝑖 ← 𝑀′
𝑡 𝑖 + 𝑤𝑡 𝑖 𝑎 𝑡
~1
~1
~0
Erase -> Add

Memory
Controller
Neural Network
(RNN)
Addressing Mechanism
How to calculate ‘Weight Vector’ 𝑊𝑡
Content-based addressing >
- Based on Similarity of ‘Current value’ and ‘predicted by controller value’
location-based addressing >
Like 𝑓 𝑥, 𝑦 = 𝑥 x 𝑦 : variable 𝑥 , 𝑦 store them in different addresses, retrieve
them and perform a multiplication algorithm
=>
- Based on addressed by location
Use both mechanisms concurrently

Memory
Controller
Neural Network
(RNN)
𝑤𝑡−1
𝑀𝑡
𝑘 𝑡
β 𝑡
𝑔𝑡
𝑠𝑡
γ 𝑡
Content
Addressing
Interpolation
Convolutional
Shift
Sharpening
Previous State
Controller Outputs
𝑤𝑡
𝑐
𝑤𝑡
𝑔
𝑤𝑡
𝑤′ 𝑡
𝑘 𝑡 : key vector
β 𝑡 : key strength
𝑔𝑡 : interpolation gate (0, 1)
𝑠𝑡 : shift weighting ( only integer )
γ 𝑡 : sharping weighting
Content-based addressing
Location-based addressing

Memory
Controller
Neural Network
(RNN)
𝑤𝑡−1
𝑀𝑡
𝑘 𝑡
β 𝑡
𝑔𝑡
𝑠𝑡
γ 𝑡
Content
Addressing
Interpolation
Convolutional
Shift
Sharpening
Previous State
Controller Outputs
𝑤𝑡
𝑐
𝑤𝑡
𝑔
𝑤𝑡
𝑤′ 𝑡
𝑔𝑡 : interpolation gate
𝑠𝑡 : shift weighting
Content
Addressing
𝑤𝑡
𝑐
𝑘 𝑡
β 𝑡
𝑀𝑡
𝑤𝑡
𝑐
𝑖 ←
exp(β 𝑡 𝐾 𝑘 𝑡,𝑀𝑡 𝑖 )
σ 𝑗 exp(β 𝑡 𝐾 𝑘 𝑡,𝑀𝑡 𝑗 )
𝐾 u, v =
𝑢 ∙ 𝑣
| 𝑢 | ∙ | 𝑣 |
∶ 𝑐𝑜𝑠𝑖𝑛𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦
1 2
2 1
1 3
3
2
1
𝑘 𝑡 𝑀𝑡
β 𝑡 = 0 0.5
0.5
.61
.39
β 𝑡 = 5
.98
.02
β 𝑡 = 50
𝑤𝑡
𝑐

Memory
Controller
Neural Network
(RNN)
𝑤𝑡−1
𝑀𝑡
𝑘 𝑡
β 𝑡
𝑔𝑡
𝑠𝑡
γ 𝑡
Content
Addressing
Interpolation
Convolutional
Shift
Sharpening
Previous State
Controller Outputs
𝑤𝑡
𝑐
𝑤𝑡
𝑔
𝑤𝑡
𝑤′ 𝑡
Interpolation𝑤𝑡
𝑐
𝑤𝑡
𝑔
𝑤𝑡−1
𝑔𝑡
𝑤𝑡
𝑔
← 𝑔𝑡 𝑤𝑡
𝑐
+ (1 − 𝑔𝑡) 𝑤𝑡−1
.61
.39
𝑤𝑡
𝑐
.01
.90
𝑤𝑡−1
𝑔𝑡 = 0 .01
.90
𝑔𝑡 = 0.5 .36
.64
𝑤𝑡
𝑔

Memory
Controller
Neural Network
(RNN)
𝑤𝑡−1
𝑀𝑡
𝑘 𝑡
β 𝑡
𝑔𝑡
𝑠𝑡
γ 𝑡
Content
Addressing
Interpolation
Convolutional
Shift
Sharpening
Previous State
Controller Outputs
𝑤𝑡
𝑐
𝑤𝑡
𝑔
𝑤𝑡
𝑤′ 𝑡
Convolutional
Shift
𝑤𝑡
𝑔
𝑤′ 𝑡
𝑠𝑡
𝑤′
𝑡(𝑖) ← σ 𝑗=0
𝑁−1
𝑤𝑡
𝑔
𝑗 𝑠𝑡(𝑖 − 𝑗) // circular convolution
.01
.90
𝑤𝑡
𝑔
0
1
𝑠𝑡
0.9
𝑤′
𝑡(0)
.01
.90
𝑤𝑡
𝑔
0
1
𝑠𝑡
.01
𝑤′
𝑡(1)
.01
.90
𝑤′
𝑡
0
1
𝑠𝑡
1
0
𝑠𝑡
0.5
0.5
𝑠𝑡
Right shift Left shift Spread shift

Memory
Controller
Neural Network
(RNN)
𝑤𝑡−1
𝑀𝑡
𝑘 𝑡
β 𝑡
𝑔𝑡
𝑠𝑡
γ 𝑡
Content
Addressing
Interpolation
Convolutional
Shift
Sharpening
Previous State
Controller Outputs
𝑤𝑡
𝑐
𝑤𝑡
𝑔
𝑤𝑡
𝑤′ 𝑡
Sharpening 𝑤𝑡
𝑤′ 𝑡
γ 𝑡
𝑤𝑡 𝑖 ←
𝑤′
𝑡(𝑖)γ 𝑡
σ 𝑗 𝑤′
𝑡(𝑗)γ 𝑡
.01
.90
𝑤′
𝑡
γ 𝑡 = 0 0.5
0.5
.01
.90
.0001
.9998
γ 𝑡 = 1
γ 𝑡 = 2

Memory
Controller
Neural Network
(RNN)
Neural Network
(RNN+LSTM)
Controller

Experiment – Copy train time

Experiment – Copy result
NTM
LSTM
Input : Random binary vector ( 8bit )
NTM > LSTM
Quick & Clear

Memory-Augmented
Neural Networks https://arxiv.org/pdf/1605.06065.pdf
NTM is ‘Meta learning’
=>
Use Memory-Augment, make rapid learn model.

Memory-Augmented
Memory
Controller
External Input External OutputAddressing Mechanism
Content-based addressing >
- Based on Similarity of ‘Current value’ and ‘predicted by controller value’
location-based addressing >
Like 𝑓 𝑥, 𝑦 = 𝑥 x 𝑦 : variable 𝑥 , 𝑦 store them in different addresses, retrieve
them and perform a multiplication algorithm
=>
- Based on addressed by location
=> Doesn’t use location-based addressing
=> Only Use Content-based addressing
Memory-Augmented Neural Networks (MANN)

Memory
Controller
Neural Network
(RNN)
𝑤𝑡−1
𝑀𝑡
𝑘 𝑡
β 𝑡
𝑔𝑡
𝑠𝑡
γ 𝑡
Content
Addressing
Interpolation
Convolutional
Shift
Sharpening
Previous State
Controller Outputs
𝑤𝑡
𝑐
𝑤𝑡
𝑔
𝑤𝑡
𝑤′ 𝑡
Memory-Augmented
MANN

Memory
Controller
Neural Network
(RNN)
𝑤𝑡−1
𝑀𝑡
𝑘 𝑡
β 𝑡
𝑔𝑡
𝑠𝑡
γ 𝑡
Content
Addressing
Interpolation
Convolutional
Shift
Sharpening
Previous State
Controller Outputs
𝑤𝑡
𝑟
𝑤𝑡
𝑔
𝑤𝑡
𝑤′ 𝑡
Memory-Augmented
MANN

Memory
Controller
Neural Network
(RNN)
𝑤𝑡−1
𝑀𝑡
𝑘 𝑡
β 𝑡
𝑔𝑡
𝑠𝑡
γ 𝑡
Content
Addressing
Interpolation
Convolutional
Shift
Sharpening
Previous State
Controller Outputs
𝑤𝑡
𝑐
𝑤𝑡
𝑔
𝑤𝑡
𝑤′ 𝑡
Content
Addressing
𝑤𝑡
𝑟
𝑘 𝑡
𝑀𝑡
𝑤𝑡
𝑟
𝑖 ←
exp(𝐾 𝑘 𝑡,𝑀𝑡 𝑖 )
σ 𝑗 exp(𝐾 𝑘 𝑡,𝑀𝑡 𝑗 )
𝐾 u, v =
𝑢 ∙ 𝑣
| 𝑢 | ∙ | 𝑣 |
∶ 𝑐𝑜𝑠𝑖𝑛𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦
1 2
2 1
1 3
3
2
1
𝑘 𝑡 𝑀𝑡
.53
.47
𝑤𝑡
𝑟
Memory-Augmented
MANN

Memory
Controller
Neural Network
(RNN)
Write Heads
0.9 0.1 0 ...
0.1 1.8
2 1
0.1 2.7
𝑤𝑡
𝑢
← γ 𝑤𝑡−1
𝑢
+ 𝑤𝑡
𝑟
+ 𝑤𝑡
𝑤
~1
~0
~1
`
Memory-Augmented
Least Recently Used Access ( LRUA )
Compute from MANN
𝑤𝑡
𝑤
← σ α 𝑤𝑡−1
𝑟
+ (1 − σ α ) 𝑤𝑡−1
𝑙𝑢
Decay Parameter
𝑤𝑡
𝑙𝑢
(𝑖) = ቊ
0, 𝑖𝑓 𝑤𝑡
𝑢
𝑖 𝑖𝑠 𝑏𝑖𝑔 𝑒𝑛𝑜𝑢𝑔ℎ
1, 𝑒𝑙𝑠𝑒
𝑀𝑡 𝑖 ← 𝑀𝑡−1 𝑖 + 𝑤𝑡
𝑤
𝑖 𝑘 𝑡
Least Used Memory
Sigmoid Function
Scalar gate parameter
𝑤𝑡
𝑤
𝑘 𝑡 : write vector

Experiment – Data
One Episode :
Input : (𝑥0, 𝑛𝑢𝑙𝑙), 𝑥1, 𝑦0 , 𝑥2, 𝑦1 , … (𝑥 𝑇, 𝑦 𝑇−1)
Output : (𝑦′0), 𝑦′1 , 𝑦′2 , … (𝑦′ 𝑇)
𝑝 𝑦𝑡 𝑥𝑡; 𝐷1:𝑡−1; 𝜃)
(𝑥0, 𝑛𝑢𝑙𝑙 )
(𝑦′0)
RNN
𝑥1, 𝑦0
(𝑦′1)

Experiment – Data
Omniglot Dataset : 1600 > classes
 1200 class train, 423 class test ( downscale to 20x20 )
+ plus rotate augmentation

Experiment – Classification Result
Trained with one-hot vector representations
With Five randomly chosen labels, train 100,000 episode ( each episode are ‘new class’ )
Instance
: class emerge count…?

Active One-Shot https://cs.stanford.edu/~woodward/papers/active_one_shot_learning_2016.pdf
MANN + ‘decision predict or pass’
Just Like Quiz Show!

Active One-Shot
https://cs.stanford.edu/~woodward/papers/active_one_shot_learning_2016.pdf
RNN
𝑦𝑡, 𝑥 𝑡+1 or (0, 𝑥 𝑡+1)
[0,1] || (𝑦′ 𝑡, 0)
𝑦𝑡, 𝑥 𝑡+1
[0,1]
𝑟𝑡 = −0.05
𝑦𝑡, 𝑥 𝑡+1
(𝑦′ 𝑡, 0)
𝑟𝑡 = ቊ
+1, 𝑖𝑓 𝑦′ 𝑡 = 𝑦𝑡
−1

Results https://cs.stanford.edu/~woodward/papers/active_one_shot_learning_2016.pdf

Want some more?
http://proceedings.mlr.press/v37/romera-
paredes15.pdf
https://arxiv.org/pdf/1606.04080.pdf

Reference
• https://tensorflow.blog/tag/one-shot-learning/
• http://www.modulabs.co.kr/DeepLAB_library/11115
• https://www.youtube.com/watch?v=CzQSQ_0Z-QU
• https://www.slideshare.net/JisungDavidKim/oneshot-
learning
• https://norman3.github.io/papers/docs/neural_turing_
machine.html
• https://www.slideshare.net/webaquebec/guillaume-
chevalier-deep-learning-avec-tensor-flow
• https://www.slideshare.net/ssuser06e0c5/metalearnin
g-with-memory-augmented-neural-networks

One shot learning - deep learning ( meta learn )

More Related Content

One shot learning - deep learning ( meta learn )