[DSC 2016] 系列活動：李宏毅 / 一天搞懂深度學習

Deep Learning Tutorial
李宏毅
Hung-yi Lee

Deep learning
attracts lots of attention.
• I believe you have seen lots of exciting results
before.
This talk focuses on the basic techniques.
Deep learning trends
at Google. Source:
SIGMOD/Jeff Dean

Outline
Lecture IV: Next Wave
Lecture III: Variants of Neural Network
Lecture II: Tips for Training Deep Neural Network
Lecture I: Introduction of Deep Learning

Lecture I:
Introduction of
Deep Learning

Outline of Lecture I
Introduction of Deep Learning
Why Deep?
“Hello World” for Deep Learning
Let’s start with general
machine learning.

Machine Learning
≈ Looking for a Function
• Speech Recognition
• Image Recognition
• Playing Go
• Dialogue System
 f
 f
 f
 f
“Cat”
“How are you”
“5-5”
“Hello”“Hi”
(what the user said) (system response)
(next move)

Framework
A set of
function 21, ff
 1f “cat”
 1f “dog”
 2f “money”
 2f “snake”
Model
 f “cat”
Image Recognition:

Framework
A set of
function 21, ff
 f “cat”
Image Recognition:
Model
Training
Data
Goodness of
function f
Better!
“monkey” “cat” “dog”
function input:
function output:
Supervised Learning

Framework
A set of
function 21, ff
 f “cat”
Image Recognition:
Model
Training
Data
Goodness of
function f
“monkey” “cat” “dog”
*
f
Pick the “Best” Function
Using 
f
“cat”
Training Testing
Step 1
Step 2 Step 3

Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Three Steps for Deep Learning
Deep Learning is so simple ……

Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Neural
Network

bwawawaz KKkk  11
Neural Network
z
1w
kw
Kw
…
1a
ka
Ka

b
 z
bias
a
weights
Neuron…
……
A simple function
Activation
function

Neural Network
  z
bias
Activation
function
weights
Neuron
1
-2
-1
1
2
-1
1
4
 z
z
  z
e
z 


1
1

Sigmoid Function
0.98

Neural Network
 z
 z
 z
 z
Different connections leads to
different network structure
Weights and biases are network parameters 𝜃
Each neurons can have different values
of weights and biases.

Fully Connect Feedforward
Network
 z
z
  z
e
z 


1
1

Sigmoid Function
1
-1
1
-2
1
-1
1
0
4
-2
0.98
0.12

Network
1
-2
1
-1
1
0
4
-2
0.98
0.12
2
-1
-1
-2
3
-1
4
-1
0.86
0.11
0.62
0.83
0
0
-2
2
1
-1

Network
1
-2
1
-1
1
0
0.73
0.5
2
-1
-1
-2
3
-1
4
-1
0.72
0.12
0.51
0.85
0
0
-2
2
𝑓
0
0
=
0.51
0.85
Given parameters 𝜃, define a function
𝑓
1
−1
=
0.62
0.83
0
0
This is a function.
Input vector, output vector
Given network structure, define a function set

Output
LayerHidden Layers
Input
Layer
Network
Input Output
1x
2x
Layer 1
……
Nx
……
Layer 2
……
Layer L
……
……
……
……
……
y1
y2
yM
Deep means many hidden layers
neuron

Output Layer (Option)
• Softmax layer as the output layer
Ordinary Layer
 11 zy 
 22 zy 
 33 zy 
1z
2z
3z



In general, the output of
network can be any value.
May not be easy to interpret

Output Layer (Option)
• Softmax layer as the output layer
1z
2z
3z
Softmax Layer
e
e
e
1z
e
2z
e
3z
e



3
1
1
1
j
zz j
eey

3
1j
z j
e



3
-3
1 2.7
20
0.05
0.88
0.12
≈0
Probability:
 1 > 𝑦𝑖 > 0
 𝑖 𝑦𝑖 = 1


3
1
2
2
j
zz j
eey


3
1
3
3
j
zz j
eey

Example Application
Input Output
16 x 16 = 256
1x
2x
256x
……
Ink → 1
No ink → 0
……
y1
y2
y10
Each dimension represents
the confidence of a digit.
is 1
is 2
is 0
……
0.1
0.7
0.2
The image
is “2”

Example Application
• Handwriting Digit Recognition
Machine “2”
1x
2x
256x
……
……
y1
y2
y10
is 1
is 2
is 0
……
What is needed is a
function ……
Input:
256-dim vector
output:
10-dim vector
Neural
Network

Output
LayerHidden Layers
Input
Layer
Example Application
Input Output
1x
2x
Layer 1
……
Nx
……
Layer 2
……
Layer L
……
……
……
……
“2”
……
y1
y2
y10
is 1
is 2
is 0
……
A function set containing the
candidates for
Handwriting Digit Recognition
You need to decide the network structure to
let a good function in your function set.

FAQ
• Q: How many layers? How many neurons for each
layer?
• Q: Can the structure be automatically determined?
Trial and Error Intuition+

Training Data
• Preparing training data: images and their labels
The learning target is defined on
the training data.
“5” “0” “4” “1”
“3”“1”“2”“9”

Learning Target
16 x 16 = 256
1x
2x
……256x
……
……
……
……
Ink → 1
No ink → 0
……
y1
y2
y10
y1 has the maximum value
The learning target is ……
Input:
y2 has the maximum valueInput:
is 1
is 2
is 0
Softmax

Loss
1x
2x
……
Nx
……
……
……
……
……
y1
y2
y10
Loss
𝑙
“1”
……
1
0
0……
Loss can be the distance between the
network output and target
target
As close as
possible
A good function should make the loss
of all examples as small as possible.
Given a set of
parameters

Total Loss
x1
x2
xR
NN
NN
NN
……
……
y1
y2
yR
𝑦1
𝑦2
𝑦 𝑅
𝑙1
……
……
x3 NN y3
𝑦3
For all training data …
𝐿 =
𝑟=1
𝑅
𝑙 𝑟
Find the network
parameters 𝜽∗ that
minimize total loss L
Total Loss:
𝑙2
𝑙3
𝑙 𝑅
As small as possible
Find a function in
function set that
minimizes total loss L

How to pick the best function
Find network parameters 𝜽∗ that minimize total loss L
Network parameters 𝜃 =
𝑤1, 𝑤2, 𝑤3, ⋯ , 𝑏1, 𝑏2, 𝑏3, ⋯
Enumerate all possible values
Layer l
……
Layer l+1
……
E.g. speech recognition: 8 layers and
1000 neurons each layer
1000
neurons
1000
neurons
106
weights
Millions of parameters

Gradient Descent
Total
Loss 𝐿
Random, RBM pre-train
Usually good enough
𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯
w
 Pick an initial value for w

Gradient Descent
Total
Loss 𝐿
𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯
w
 Compute 𝜕𝐿 𝜕𝑤
Positive
Negative
Decrease w
Increase w
http://chico386.pixnet.net/album/photo/171572850

Gradient Descent
Total
Loss 𝐿
𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯
w
−𝜂𝜕𝐿 𝜕𝑤
η is called
“learning rate”
𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤
Repeat

Gradient Descent
Total
Loss 𝐿
𝑤1, 𝑤2, ⋯ , 𝑏1, 𝑏2, ⋯
w
𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤
Repeat Until 𝜕𝐿 𝜕𝑤 is approximately small
(when update is little)

Gradient Descent
𝑤1
Compute 𝜕𝐿 𝜕𝑤1
−𝜇 𝜕𝐿 𝜕𝑤1
0.15
𝑤2
−𝜇 𝜕𝐿 𝜕𝑤2
0.05
𝑏1
Compute 𝜕𝐿 𝜕𝑏1
−𝜇 𝜕𝐿 𝜕𝑏1
0.2
…………
0.2
-0.1
0.3
𝜃
𝜕𝐿
𝜕𝑤1
𝜕𝐿
𝜕𝑤2
⋮
𝜕𝐿
𝜕𝑏1
⋮
𝛻𝐿 =
gradient

Gradient Descent
𝑤1
−𝜇 𝜕𝐿 𝜕𝑤1
0.15
−𝜇 𝜕𝐿 𝜕𝑤1
0.09
𝑤2
−𝜇 𝜕𝐿 𝜕𝑤2
0.05
−𝜇 𝜕𝐿 𝜕𝑤2
0.15
𝑏1
−𝜇 𝜕𝐿 𝜕𝑏1
0.2
−𝜇 𝜕𝐿 𝜕𝑏1
0.10
…………
0.2
-0.1
0.3
……
……
……
𝜃

𝑤1
𝑤2
Gradient Descent
Color: Value of
Total Loss L
Randomly pick a starting point

𝑤1
𝑤2
Gradient Descent Hopfully, we would reach
a minima …..
Compute 𝜕𝐿 𝜕𝑤1, 𝜕𝐿 𝜕𝑤2
(−𝜂 𝜕𝐿 𝜕𝑤1, −𝜂 𝜕𝐿 𝜕𝑤2)
Color: Value of
Total Loss L

Gradient Descent - Difficulty
• Gradient descent never guarantee global minima
𝐿
𝑤1 𝑤2
Different initial point
Reach different minima,
so different results
There are some tips to
help you avoid local
minima, no guarantee.

Gradient Descent
𝑤1𝑤2
You are playing Age of Empires …
Compute 𝜕𝐿 𝜕𝑤1, 𝜕𝐿 𝜕𝑤2
(−𝜂 𝜕𝐿 𝜕𝑤1, −𝜂 𝜕𝐿 𝜕𝑤2)
You cannot see the whole map.

Gradient Descent
This is the “learning” of machines in deep
learning ……
Even alpha go using this approach.
I hope you are not too disappointed :p
People image …… Actually …..

Backpropagation
• Backpropagation: an efficient way to compute 𝜕𝐿 𝜕𝑤
• Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_201
5_2/Lecture/DNN%20backprop.ecm.mp4/index.html
Don’t worry about 𝜕𝐿 𝜕𝑤, the toolkits will handle it.
台大周伯威
同學開發

Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Concluding Remarks

Outline of Lecture I
Introduction of Deep Learning
Why Deep?
“Hello World” for Deep Learning

Layer X Size
Word Error
Rate (%)
Layer X Size
Word Error
Rate (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Deeper is Better?
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Not surprised, more
parameters, better
performance

Universality Theorem
Reference for the reason:
http://neuralnetworksandde
eplearning.com/chap4.html
Any continuous function f
M
: RRf N

Can be realized by a network
with one hidden layer
(given enough hidden
neurons)
Why “Deep” neural network not “Fat” neural network?

Fat + Short v.s. Thin + Tall
1x 2x …… Nx
Deep
1x 2x …… Nx
……
Shallow
Which one is better?
The same number
of parameters

Fat + Short v.s. Thin + Tall
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Layer X Size
Word Error
Rate (%)
Layer X Size
Word Error
Rate (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Why?

Analogy
• Logic circuits consists of
gates
• A two layers of logic gates
can represent any Boolean
function.
• Using multiple layers of
logic gates to build some
functions are much simpler
• Neural network consists of
neurons
• A hidden layer network can
represent any continuous
function.
• Using multiple layers of
neurons to represent some
functions are much simpler
This page is for EE background.
less gates needed
Logic circuits Neural network
less
parameters
less
data?

長髮
男
Modularization
• Deep → Modularization
Girls with
long hair
Boys with
short hair
Boys with
long hair
Image
Classifier
1
Classifier
2
Classifier
3
長髮
女
長髮
女
長髮
女
長髮
女
Girls with
short hair
短髮
女
短髮
男
短髮
男
短髮
男
短髮
男
短髮
女
短髮
女
短髮
女
Classifier
4
Little examplesweak

Modularization
Image
Long or
short?
Boy or Girl?
Classifiers for the
attributes
長髮
男
長髮
女
長髮
女
長髮
女
長髮
女
短髮
女短髮
男
短髮
男
短髮
男
短髮
男
短髮
女
短髮
女
短髮
女
v.s.
長髮
男
長髮
女
長髮
女
長髮
女
長髮
女
短髮
女
短髮
男
短髮
男
短髮
男
短髮
男
短髮
女
短髮
女
短髮
女
v.s.
Each basic classifier can have
sufficient training examples.
Basic
Classifier

Modularization
Image
Long or
short?
Boy or Girl?
Sharing by the
following classifiers
as module
can be trained by little data
Girls with
long hair
Boys with
short hair
Boys with
long hair
Classifier
1
Classifier
2
Classifier
3
Girls with
short hair
Classifier
4
Little datafineBasic
Classifier

Modularization
1x
2x
……
Nx
……
……
……
……
……
……
The most basic
classifiers
Use 1st layer as module
to build classifiers
Use 2nd layer as
module ……
The modularization is
automatically learned from data.
→ Less training data?

Modularization
1x
2x
……
Nx
……
……
……
……
……
……
The most basic
classifiers
Use 1st layer as module
to build classifiers
Use 2nd layer as
module ……
Reference: Zeiler, M. D., & Fergus, R.
(2014). Visualizing and understanding
convolutional networks. In Computer
Vision–ECCV 2014 (pp. 818-833)

Keras
keras
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/L
ecture/Theano%20DNN.ecm.mp4/index.html
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Le
cture/RNN%20training%20(v6).ecm.mp4/index.html
Very flexible
Need some
effort to learn
Easy to learn and use
(still have some flexibility)
You can modify it if you can write
TensorFlow or Theano
Interface of
TensorFlow or
Theano
or
If you want to learn theano:

Keras
• François Chollet is the author of Keras.
• He currently works for Google as a deep learning
engineer and researcher.
• Keras means horn in Greek
• Documentation: http://keras.io/
• Example:
https://github.com/fchollet/keras/tree/master/exa
mples

使用 Keras 心得
感謝沈昇勳同學提供圖檔

Example Application
• Handwriting Digit Recognition
Machine “1”
“Hello world” for deep learning
MNIST Data: http://yann.lecun.com/exdb/mnist/
Keras provides data sets loading function: http://keras.io/datasets/
28 x 28

Keras
y1 y2 y10
……
……
……
……
Softmax
500
500
28x28

Keras
Step 3.1: Configuration
Step 3.2: Find the optimal network parameters
𝑤 ← 𝑤 − 𝜂𝜕𝐿 𝜕𝑤
0.1
Training data
(Images)
Labels
(digits)
Next lecture

Keras
Step 3.2: Find the optimal network parameters
https://www.tensorflow.org/versions/r0.8/tutorials/mnist/beginners/index.html
Number of training examples
numpy array
28 x 28
=784
numpy array
10
Number of training examples
…… ……

Keras
http://keras.io/getting-started/faq/#how-can-i-save-a-keras-model
How to use the neural network (testing):
case 1:
case 2:
Save and load models

Keras
• Using GPU to speed training
• Way 1
• THEANO_FLAGS=device=gpu0 python
YourCode.py
• Way 2 (in your code)
• import os
• os.environ["THEANO_FLAGS"] =
"device=gpu0"

Lecture II:
Tips for Training DNN

Neural
Network
Good Results on
Testing Data?
Good Results on
Training Data?
Step 3: pick the
best function
Step 2: goodness
of function
Step 1: define a
set of function
YES
YES
NO
NO
Overfitting!
Recipe of Deep Learning

Do not always blame Overfitting
Testing Data
Overfitting?
Training Data
Not well trained

Neural
Network
Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Different approaches for
different problems.
e.g. dropout for good results
on testing data

Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum

Choosing Proper Loss
1x
2x ……
256x
……
……
……
……
……
y1
y2
y10
loss
“1”
……
1
0
0
……
target
Softmax
𝑖=1
10
𝑦𝑖 − 𝑦𝑖
2Square
Error
Cross
Entropy −
𝑖=1
10
𝑦𝑖 𝑙𝑛𝑦𝑖
Which one is better?
𝑦1
𝑦2
𝑦10
……
1
0
0
=0 =0

Let’s try it
Square Error
Cross Entropy

Let’s try it
Accuracy
Square Error 0.11
Cross Entropy 0.84
Training
Testing:
Cross
Entropy
Square
Error

Choosing Proper Loss
Total
Loss
w1
w2
Cross
Entropy
Square
Error
When using softmax output layer,
choose cross entropy
http://jmlr.org/procee
dings/papers/v9/gloro
t10a/glorot10a.pdf

Mini-batch
x1
NN
……
y1
𝑦1
𝑙1
x31 NN y31
𝑦31
𝑙31
x2
NN
……
y2
𝑦2
𝑙2
x16 NN y16
𝑦16
𝑙16
 Pick the 1st batch
 Randomly initialize
network parameters
 Pick the 2nd batch
Mini-batchMini-batch
𝐿′ = 𝑙1 + 𝑙31 + ⋯
𝐿′′ = 𝑙2 + 𝑙16 + ⋯
Update parameters once
 Until all mini-batches
have been picked
…
one epoch
Repeat the above process
We do not really minimize total loss!

Mini-batch
x1
NN
……
y1
𝑦1
𝑙1
x31 NN y31
𝑦31
𝑙31
Mini-batch
𝐿′ = 𝑙1
+ 𝑙31
+ ⋯
𝐿′′ = 𝑙2
+ 𝑙16
+ ⋯
 Until all mini-batches
have been picked
…one epoch
100 examples in a mini-batch
Repeat 20 times

Mini-batch
x1
NN
……
y1
𝑦1
𝑙1
x31 NN y31
𝑦31
𝑙31
x2
NN
……
y2
𝑦2
𝑙2
x16 NN y16
𝑦16
𝑙16
 Randomly initialize
network parameters
𝐿′ = 𝑙1 + 𝑙31 + ⋯
𝐿′′ = 𝑙2 + 𝑙16 + ⋯
…
L is different each time
when we update
parameters!
We do not really minimize total loss!

Mini-batch
Original Gradient Descent With Mini-batch
Unstable!!!
The colors represent the total loss.

Mini-batch is Faster
1 epoch
See all
examples
See only one
batch
Update after seeing all
examples
If there are 20 batches, update
20 times in one epoch.
Original Gradient Descent With Mini-batch
Not always true with
parallel computing.
Can have the same speed
(not super large data set)
Mini-batch has better performance!

Mini-batch is Better! Accuracy
Mini-batch 0.84
No batch 0.12
Testing:
Epoch
Accuracy
Mini-batch
No batch
Training

x1
NN…… y1
𝑦1
𝑙1
x31 NN y31
𝑦31
𝑙31
x2
NN
……
y2
𝑦2
𝑙2
x16 NN y16
𝑦16
𝑙16
Shuffle the training examples for each epoch
Epoch 1
x1
NN
……
y1
𝑦1
𝑙1
x31 NN y31
𝑦31
𝑙17
x2
NN
……
y2
𝑦2
𝑙2
x16 NN y16
𝑦16
𝑙26
Epoch 2
Don’t worry. This is the default of Keras.

Hard to get the power of Deep …
Deeper usually does not imply better.
Results on Training Data

Let’s try it
Accuracy
3 layers 0.84
9 layers 0.11
Testing:
9 layers
3 layers
Training

Vanishing Gradient Problem
Larger gradients
Almost random Already converge
based on random!?
Learn very slow Learn very fast
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM
Smaller gradients

Vanishing Gradient Problem
1x
2x
……
Nx
……
……
……
……
……
……
……
𝑦1
𝑦2
𝑦 𝑀
……
𝑦1
𝑦2
𝑦 𝑀
𝑙
Intuitive way to compute the derivatives …
𝜕𝑙
𝜕𝑤
=?
+∆𝑤
+∆𝑙
∆𝑙
∆𝑤
Smaller gradients
Large
input
Small
output

Hard to get the power of Deep …
In 2006, people used RBM pre-training.
In 2015, people use ReLU.

ReLU
• Rectified Linear Unit (ReLU)
Reason:
1. Fast to compute
2. Biological reason
3. Infinite sigmoid
with different biases
4. Vanishing gradient
problem
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0
𝜎 𝑧
[Xavier Glorot, AISTATS’11]
[Andrew L. Maas, ICML’13]
[Kaiming He, arXiv’15]

ReLU
1x
2x
1y
2y
0
0
0
0
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0

ReLU
1x
2x
1y
2y
A Thinner linear network
Do not have
smaller gradients
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0

Let’s try it
• 9 layers
9 layers Accuracy
Sigmoid 0.11
ReLU 0.96
Training
Testing:
ReLU
Sigmoid

ReLU - variant
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 0.01𝑧
𝐿𝑒𝑎𝑘𝑦 𝑅𝑒𝐿𝑈
𝑧
𝑎
𝑎 = 𝑧
𝑎 = 𝛼𝑧
𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐 𝑅𝑒𝐿𝑈
α also learned by
gradient descent

Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
Max
1x
2x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
Max
Max
+ 1
+ 2
+ 4
+ 3
2
4
ReLU is a special cases of Maxout
You can have more than 2 elements in a group.
neuron

Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
• Activation function in maxout network can be
any piecewise linear convex function
• How many pieces depending on how many
elements in a group
ReLU is a special cases of Maxout
2 elements in a group 3 elements in a group

𝑤1
𝑤2
Learning Rates
If learning rate is too large
Total loss may not decrease
after each update
Set the learning
rate η carefully

𝑤1
𝑤2
Learning Rates
If learning rate is too large
Set the learning
rate η carefully
If learning rate is too small
Training would be too slow
Total loss may not decrease
after each update

Learning Rates
• Popular & Simple Idea: Reduce the learning rate by
some factor every few epochs.
• At the beginning, we are far from the destination, so we
use larger learning rate
• After several epochs, we are close to the destination, so
we reduce the learning rate
• E.g. 1/t decay: 𝜂 𝑡 = 𝜂 𝑡 + 1
• Learning rate cannot be one-size-fits-all
• Giving different parameters different learning
rates

Adagrad
Parameter dependent
learning rate
w ← 𝑤 − ߟ 𝑤 𝜕𝐿 ∕ 𝜕𝑤
constant
𝑔𝑖
is 𝜕𝐿 ∕ 𝜕𝑤 obtained
at the i-th update
ߟ 𝑤 =
𝜂
𝑖=0
𝑡
𝑔𝑖 2
Summation of the square of the previous derivatives
𝑤 ← 𝑤 − 𝜂𝜕𝐿 ∕ 𝜕𝑤Original:
Adagrad:

Adagrad
g0 g1 ……
0.1 0.2 ……
g0 g1 ……
20.0 10.0 ……
Observation: 1. Learning rate is smaller and
smaller for all parameters
2. Smaller derivatives, larger
learning rate, and vice versa
𝜂
0.12
𝜂
0.12 + 0.22
𝜂
202
𝜂
202 + 102
=
𝜂
0.1
=
𝜂
0.22
=
𝜂
20
=
𝜂
22
Why?
ߟ 𝑤 =
𝜂
𝑖=0
𝑡
𝑔𝑖 2
Learning rate: Learning rate:
𝑤1 𝑤2

Smaller Derivatives
Larger Learning Rate
2. Smaller derivatives, larger
learning rate, and vice versa
Why?
Smaller
Learning Rate
Larger
derivatives

Not the whole story ……
• Adagrad [John Duchi, JMLR’11]
• RMSprop
• https://www.youtube.com/watch?v=O3sxAc4hxZU
• Adadelta [Matthew D. Zeiler, arXiv’12]
• “No more pesky learning rates” [Tom Schaul, arXiv’12]
• AdaSecant [Caglar Gulcehre, arXiv’14]
• Adam [Diederik P. Kingma, ICLR’15]
• Nadam
• http://cs229.stanford.edu/proj2015/054_report.pdf

Hard to find
optimal network parameters
Total
Loss
The value of a network parameter w
Very slow at the
plateau
Stuck at local minima
𝜕𝐿 ∕ 𝜕𝑤
= 0
Stuck at saddle point
𝜕𝐿 ∕ 𝜕𝑤
= 0
𝜕𝐿 ∕ 𝜕𝑤
≈ 0

In physical world ……
• Momentum
How about put this phenomenon
in gradient descent?

Movement =
Negative of 𝜕𝐿∕𝜕𝑤 + Momentum
Momentum
cost
𝜕𝐿∕𝜕𝑤 = 0
Still not guarantee reaching
global minima, but give some
hope ……
Negative of 𝜕𝐿 ∕ 𝜕𝑤
Momentum
Real Movement

Adam RMSProp (Advanced Adagrad) + Momentum

Let’s try it
• ReLU, 3 layer
Accuracy
Original 0.96
Adam 0.97
Training
Testing:
Adam
Original

Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Early Stopping
Regularization
Dropout
Network Structure

Why Overfitting?
• Training data and testing data can be different.
Training Data: Testing Data:
The parameters achieving the learning target do not
necessary have good results on the testing data.
Learning target is defined by the training data.

Panacea for Overfitting
• Have more training data
• Create more training data (?)
Original
Training Data:
Created
Training Data:
Shift 15。
Handwriting recognition:

Why Overfitting?
• For experiments, we added some noises to the
testing data

Why Overfitting?
• For experiments, we added some noises to the
testing data
Training is not influenced.
Accuracy
Clean 0.97
Noisy 0.50
Testing:

Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Early Stopping
Weight Decay
Dropout
Network Structure

Early Stopping
Epochs
Total
Loss
Training set
Testing set
Stop at
here
Validation set
http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when-
the-validation-loss-isnt-decreasing-anymoreKeras:

Weight Decay
• Our brain prunes out the useless link between
neurons.
Doing the same thing to machine’s brain improves
the performance.

Weight Decay
Useless
Close to zero (萎縮了)
Weight decay is one
kind of regularization

Weight Decay
• Implementation
Smaller and smaller
Keras: http://keras.io/regularizers/
w
L
ww


 
 
w
L
ww


 1
Original:
Weight Decay:
0.01
0.99

Dropout
Training:
 Each time before updating the parameters
 Each neuron has p% to dropout

Dropout
Training:
 Each time before updating the parameters
 Each neuron has p% to dropout
 Using the new network for training
The structure of the network is changed.
Thinner!
For each mini-batch, we resample the dropout neurons

Dropout
Testing:
 No dropout
 If the dropout rate at training is p%,
all the weights times (1-p)%
 Assume that the dropout rate is 50%.
If a weight w = 1 by training, set 𝑤 = 0.5 for testing.

Dropout - Intuitive Reason
 When teams up, if everyone expect the partner will do
the work, nothing will be done finally.
 However, if you know your partner will dropout, you
will do better.
我的 partner
會擺爛，所以
我要好好做
 When testing, no one dropout actually, so obtaining
good results eventually.

Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout
rate) when testing?
Training of Dropout Testing of Dropout
𝑤1
𝑤2
𝑤3
𝑤4
𝑧
𝑤1
𝑤2
𝑤3
𝑤4
𝑧′
Assume dropout rate is 50%
0.5 ×
0.5 ×
0.5 ×
0.5 ×
No dropout
Weights from training
𝑧′ ≈ 2𝑧
𝑧′ ≈ 𝑧
Weights multiply (1-p)%

Dropout is a kind of ensemble.
Ensemble
Network
1
Network
2
Network
3
Network
4
Train a bunch of networks with different structures
Training
Set
Set 1 Set 2 Set 3 Set 4

Ensemble
y1
Network
1
Network
2
Network
3
Network
4
Testing data x
y2 y3 y4
average

Training of
Dropout
minibatch
1
……
Using one mini-batch to train one network
Some parameters in the network are shared
minibatch
2
minibatch
3
minibatch
4
M neurons
2M possible
networks

testing data x
Testing of Dropout
……
average
y1 y2 y3
All the
weights
multiply
(1-p)%
≈ y
?????

More about dropout
• More reference for dropout [Nitish Srivastava, JMLR’14] [Pierre Baldi,
NIPS’13][Geoffrey E. Hinton, arXiv’12]
• Dropout works better with Maxout [Ian J. Goodfellow, ICML’13]
• Dropconnect [Li Wan, ICML’13]
• Dropout delete neurons
• Dropconnect deletes the connection between neurons
• Annealed dropout [S.J. Rennie, SLT’14]
• Dropout rate decreases by epochs
• Standout [J. Ba, NISP’13]
• Each neural has different dropout rate

Let’s try it
y1 y2 y10
……
……
……
……
Softmax
500
500
model.add( dropout(0.8) )
model.add( dropout(0.8) )

Let’s try it
Training
Dropout
No Dropout
Epoch
Accuracy
Accuracy
Noisy 0.50
+ dropout 0.63
Testing:

Good Results on
Testing Data?
Good Results on
Training Data?
YES
YES
Early Stopping
Regularization
Dropout
Network Structure
CNN is a very good example!
(next lecture)

Concluding Remarks
of Lecture II

Neural
Network
Good Results on
Testing Data?
Good Results on
Training Data?
Step 3: pick the
best function
Step 2: goodness
of function
Step 1: define a
set of function
YES
YES
NO
NO

Document Classification
http://top-breaking-news.com/
Machine
政治
體育
經濟
“president” in document
“stock” in document
體育政治財經

Adaptive Learning Rate
Accuracy
MSE 0.36
CE 0.55
+ ReLU 0.75
+ Adam 0.77

Dropout
Accuracy
Adam 0.77
+ dropout 0.79

Lecture III:
Variants of Neural
Networks

Variants of Neural Networks
Convolutional Neural
Network (CNN)
Recurrent Neural Network
(RNN)
Widely used in
image processing

Why CNN for Image?
• When processing image, the first layer of fully
connected network would be very large
100
……
……
……
……
……
Softmax
100
100 x 100 x 3 1000
3 x 107
Can the fully connected network be simplified by
considering the properties of image recognition?

Why CNN for Image
• Some patterns are much smaller than the whole
image
A neuron does not have to see the whole image
to discover the pattern.
“beak” detector
Connecting to small region with less parameters

Why CNN for Image
• The same patterns appear in different regions.
“upper-left
beak” detector
“middle beak”
detector
They can use the same
set of parameters.
Do almost the same thing

Why CNN for Image
• Subsampling the pixels will not change the object
subsampling
bird
bird
We can subsample the pixels to make image smaller
Less parameters for the network to process the image

Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Convolutional
Neural Network

The whole CNN
Fully Connected
Feedforward network
cat dog ……
Convolution
Max Pooling
Convolution
Max Pooling
Flatten
Can repeat
many times

The whole CNN
Convolution
Max Pooling
Convolution
Max Pooling
Flatten
Can repeat
many times
 Some patterns are much
smaller than the whole image
The same patterns appear in
different regions.
Subsampling the pixels will
not change the object
Property 1
Property 2
Property 3

CNN – Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
……
Those are the network
parameters to be learned.
Matrix
Matrix
Each filter detects a small
pattern (3 x 3).
Property 1

CNN – Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
3 -1
stride=1

CNN – Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
3 -3
If stride=2
We set stride=1 below

CNN – Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
stride=1
Property 2

CNN – Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
-1 -1 -1 -1
-1 -1 -2 1
-1 -1 -2 1
-1 0 -4 3
Do the same process for
every filter
stride=1
4 x 4 image
Feature
Map

CNN – Zero Padding
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
You will get another 6 x 6
images in this way
0
Zero padding
00
0
0
0
0
000

CNN – Colorful image
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
-1 1 -1
-1 1 -1
-1 1 -1
-1 1 -1
-1 1 -1
-1 1 -1
Colorful image

CNN – Max Pooling
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
-1 -1 -1 -1
-1 -1 -2 1
-1 -1 -2 1
-1 0 -4 3
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1

CNN – Max Pooling
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
3 0
13
-1 1
30
2 x 2 image
Each filter
is a channel
New image
but smaller
Conv
Max
Pooling

The whole CNN
Convolution
Max Pooling
Convolution
Max Pooling
Can repeat
many times
A new image
The number of the channel
is the number of filters
Smaller than the original
image
3 0
13
-1 1
30

The whole CNN
Fully Connected
Feedforward network
cat dog ……
Convolution
Max Pooling
Convolution
Max Pooling
Flatten
A new image
A new image

Flatten
3 0
13
-1 1
30 Flatten
3
0
1
3
-1
1
0
3
Fully Connected
Feedforward network

The whole CNN
Convolution
Max Pooling
Convolution
Max Pooling
Can repeat
many times

Max
1x
2x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
image
convolution Max
pooling
-1 1 -1
-1 1 -1
-1 1 -1
1 -1 -1
-1 1 -1
-1 -1 1
(Ignoring the non-linear activation function after the convolution.)

1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
1:
2:
3:
…
7:
8:
9:
…
13:
14:
15:… Only connect to 9
input, not fully
connected
4:
10:
16:
1
0
0
0
0
1
0
0
0
0
1
1
3
Less parameters!

1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
1:
2:
3:
…
7:
8:
9:
…
13:
14:
15:…
4:
10:
16:
1
0
0
0
0
1
0
0
0
0
1
1
3
-1
Shared weights
6 x 6 image
Less parameters!
Even less parameters!

3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
3 0
13
Max
1x
1x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1

Max
1x
2x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
image
convolution
Max
pooling
-1 1 -1
-1 1 -1
-1 1 -1
1 -1 -1
-1 1 -1
-1 -1 1
Only 9 x 2 = 18
parameters
Dim = 6 x 6 = 36
Dim = 4 x 4 x 2
= 32
parameters =
36 x 32 = 1152

Convolutional Neural Network
Learning: Nothing special, just gradient descent ……
CNN
“monkey”
“cat”
“dog”
Convolution, Max
Pooling, fully connected
1
0
0
……
target
Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Convolutional
Neural Network

Playing Go
Network (19 x 19
positions)
Next move
19 x 19 vector
Black: 1
white: -1
none: 0
19 x 19 vector
Fully-connected feedword
network can be used
But CNN performs much better.
19 x 19 matrix
(image)

Playing Go
Network
Network
record of previous plays
Target:
“天元” = 1
else = 0
Target:
“五之 5” = 1
else = 0
Training:
進藤光 v.s. 社清春
黑: 5之五
白: 天元
黑: 五之5

Why CNN for playing Go?
• Some patterns are much smaller than the whole
image
• The same patterns appear in different regions.
Alpha Go uses 5 x 5 for first layer

Why CNN for playing Go?
• Subsampling the pixels will not change the object
Alpha Go does not use Max Pooling ……
Max Pooling How to explain this???

Variants of Neural Networks
Network (CNN)
(RNN) Neural Network with Memory

Example Application
• Slot Filling
I would like to arrive Taipei on November 2nd.
ticket booking system
Destination:
time of arrival:
Taipei
November 2nd
Slot

Example Application
1x 2x
2y1y
Taipei
Input: a word
(Each word is represented
as a vector)
Solving slot filling by
Feedforward network?

1-of-N encoding
Each dimension corresponds
to a word in the lexicon
The dimension for the word
is 1, and others are 0
lexicon = {apple, bag, cat, dog, elephant}
apple = [ 1 0 0 0 0]
bag = [ 0 1 0 0 0]
cat = [ 0 0 1 0 0]
dog = [ 0 0 0 1 0]
elephant = [ 0 0 0 0 1]
The vector is lexicon size.
1-of-N Encoding
How to represent each word as a vector?

Beyond 1-of-N encoding
w = “apple”
a-a-a
a-a-b
p-p-l
26 X 26 X 26
……
a-p-p
…
p-l-e
…
…………
1
1
1
0
0
Word hashingDimension for “Other”
w = “Sauron”
…
apple
bag
cat
dog
elephant
“other”
0
0
0
0
0
1
w = “Gandalf”
187

Example Application
1x 2x
2y1y
Taipei
dest
time of
departure
Input: a word
(Each word is represented
as a vector)
Output:
Probability distribution that
the input word belonging to
the slots
Solving slot filling by
Feedforward network?

Example Application
1x 2x
2y1y
Taipei
arrive Taipei on November 2nd
other otherdest time time
leave Taipei on November 2nd
place of departure
Neural network
needs memory!
dest
time of
departure
Problem?

Step 1:
define a set
of function
Step 2:
goodness of
function
Step 3: pick
the best
function
Recurrent
Neural Network

Recurrent Neural Network (RNN)
1x 2x
2y1y
1a 2a
Memory can be considered
as another input.
The output of hidden layer
are stored in the memory.
store

RNN
store store
x1
x2 x3
y1 y2
y3
a1
a1
a2
a2 a3
The same network is used again and again.
Probability of
“arrive” in each slot
Probability of
“Taipei” in each slot
Probability of
“on” in each slot

RNN
store
x1 x2
y1 y2
a1
a1
a2
……
……
……
store
x1 x2
y1 y2
a1
a1
a2
……
……
……
leave Taipei
Prob of “leave”
in each slot
Prob of “Taipei”
in each slot
Prob of “arrive”
in each slot
Prob of “Taipei”
in each slot
arrive Taipei
Different
The values stored in the memory is different.

Of course it can be deep …
…… ……
xt
xt+1 xt+2
……
……yt
……
……
yt+1
……
yt+2
……
……

Bidirectional RNN
yt+1
…… ……
…………
yt+2yt
xt xt+1 xt+2
xt
xt+1 xt+2

Memory
Cell
Long Short-term Memory (LSTM)
Input Gate
Output Gate
Signal control
the input gate
Signal control
the output gate
Forget
Gate
Signal control
the forget gate
Other part of the network
Other part of the network
(Other part of
the network)
(Other part of
the network)
(Other part of
the network)
LSTM
Special Neuron:
4 inputs,
1 output

𝑧
𝑧𝑖
𝑧𝑓
𝑧 𝑜
𝑔 𝑧
𝑓 𝑧𝑖
multiply
multiply
Activation function f is
usually a sigmoid function
Between 0 and 1
Mimic open and close gate
c
𝑐′ = 𝑔 𝑧 𝑓 𝑧𝑖 + 𝑐𝑓 𝑧𝑓
ℎ 𝑐′𝑓 𝑧 𝑜
𝑎 = ℎ 𝑐′
𝑓 𝑧 𝑜
𝑔 𝑧 𝑓 𝑧𝑖
𝑐′
𝑓 𝑧𝑓
𝑐𝑓 𝑧𝑓
𝑐

7
3
10
-10
10
3
≈1 3
≈1
10
10
≈0
0

7
-3
10
10
-10
≈1
≈0
10
≈1
-3
-3
-3
-3
-3

LSTM
ct-1
……
vector
xt
zzizf zo 4 vectors

LSTM
xt
zzi
×
zf zo
× ＋ ×
yt
ct-1
z
zi
zf
zo

LSTM
xt
zzi
×
zf zo
× ＋ ×
yt
xt+1
zzi
×
zf zo
× ＋ ×
yt+1
ht
Extension: “peephole”
ht-1 ctct-1
ct-1 ct
ct+1

Multiple-layer
LSTM
This is quite
standard now.
https://img.komicolle.org/2015-09-20/src/14426967627131.gif
Don’t worry if you cannot understand this.
Keras can handle it.
Keras supports
“LSTM”, “GRU”, “SimpleRNN” layers

copy copy
x1
x2 x3
y1 y2
y3
Wi
a1
a1
a2
a2 a3
Training
Sentences:
Learning Target
other otherdest
10 0 10 010 0
other dest other
… … … … … …
time time

Learning
RNN Learning is very difficult in practice.
Backpropagation
through time (BPTT)
𝑤 ← 𝑤 − 𝜂𝜕𝐿 ∕ 𝜕𝑤 1x 2x
2y1y
1a 2a
copy
𝑤

Unfortunately ……
• RNN-based network is not always easy to learn
感謝曾柏翔同學
提供實驗結果
Real experiments on Language modeling
Lucky
sometimes
TotalLoss
Epoch

The error surface is rough.
w1
w2
Cost
The error surface is either
very flat or very steep.
Clipping
[Razvan Pascanu, ICML’13]
TotalLoss

Why?
1
1
y1
0
1
w
y2
0
1
w
y3
0
1
w
y1000
……
𝑤 = 1
𝑤 = 1.01
𝑦1000
= 1
𝑦1000 ≈ 20000
𝑤 = 0.99
𝑤 = 0.01
𝑦1000 ≈ 0
𝑦1000 ≈ 0
1 1 1 1
Large
𝜕𝐿 𝜕𝑤
Small
Learning rate?
small
𝜕𝐿 𝜕𝑤
Large
Learning rate?
Toy Example
=w999

add
• Long Short-term Memory (LSTM)
• Can deal with gradient vanishing (not gradient
explode)
Helpful Techniques
Memory and input are
added
The influence never disappears
unless forget gate is closed
No Gradient vanishing
(If forget gate is opened.)
[Cho, EMNLP’14]
Gated Recurrent Unit (GRU):
simpler than LSTM

Helpful Techniques
Vanilla RNN Initialized with Identity matrix + ReLU activation
function [Quoc V. Le, arXiv’15]
 Outperform or be comparable with LSTM in 4 different tasks
[Jan Koutnik, JMLR’14]
Clockwise RNN
[Tomas Mikolov, ICLR’15]
Structurally Constrained
Recurrent Network (SCRN)

More Applications ……
store store
x1
x2 x3
y1 y2
y3
a1
a1
a2
a2 a3
Probability of
“arrive” in each slot
Probability of
“Taipei” in each slot
Probability of
“on” in each slot
Input and output are both sequences
with the same length
RNN can do more than that!

Many to one
• Input is a vector sequence, but output is only one vector
Sentiment Analysis
……
我覺太得糟了
超好雷
好雷
普雷
負雷
超負雷
看了這部電影覺
得很高興 …….
這部電影太糟了
…….
這部電影很
棒 …….
Positive (正雷) Negative (負雷) Positive (正雷)
……
Keras Example:
https://github.com/fchollet/keras/blob
/master/examples/imdb_lstm.py

Many to Many (Output is shorter)
• Both input and output are both sequences, but the output
is shorter.
• E.g. Speech Recognition
好好好
Trimming
棒棒棒棒棒
“好棒”
Why can’t it be
“好棒棒”
Input:
Output: (character sequence)
(vector
sequence)
Problem?

Many to Many (Output is shorter)
• Both input and output are both sequences, but the output
is shorter.
• Connectionist Temporal Classification (CTC) [Alex Graves,
ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li,
Interspeech’15][Andrew Senior, ASRU’15]
好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
“好棒” “好棒棒”Add an extra symbol “φ”
representing “null”

Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
Containing all
information about
input sequence
learning
machine

learning
machine
機習器學
……
……
Don’t know when to stop
慣性

推 tlkagk: =========斷==========
Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D%
E6%8E%A8%E6%96%87 (鄉民百科)

learning
machine
機習器學
Add a symbol “===“ (斷)
[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]
===

One to Many
• Input an image, but output a sequence of words
Input
image
a woman is
……
===
CNN
A vector
for whole
image
[Kelvin Xu, arXiv’15][Li Yao, ICCV’15]
Caption Generation

Application:
Video Caption Generation
Video
A girl is running.
A group of people is
walking in the forest.
A group of people is
knocked by a tree.

Video Caption Generation
• Can machine describe what it see from video?
• Demo: 曾柏翔、吳柏瑜、盧宏宗

Concluding Remarks
Network (CNN)
(RNN)

Outline
Supervised Learning
• Ultra Deep Network
• Attention Model
Reinforcement Learning
Unsupervised Learning
• Image: Realizing what the World Looks Like
• Text: Understanding the Meaning of Words
• Audio: Learning human language without supervision
New network structure

Skyscraper
https://zh.wikipedia.org/wiki/%E9%9B%99%E5%B3%B0%E5%A1%94#/me
dia/File:BurjDubaiHeight.svg

Ultra Deep Network
8 layers
19 layers
22 layers
AlexNet (2012) VGG (2014) GoogleNet (2014)
16.4%
7.3%
6.7%
http://cs231n.stanford.e
du/slides/winter1516_le
cture8.pdf

Ultra Deep Network
AlexNet
(2012)
VGG
(2014)
GoogleNet
(2014)
152 layers
3.57%
Residual Net
(2015)
Taipei
101
101 layers
16.4%
7.3% 6.7%

Ultra Deep Network
AlexNet
(2012)
VGG
(2014)
GoogleNet
(2014)
152 layers
3.57%
Residual Net
(2015)
16.4%
7.3% 6.7%
This ultra deep network
have special structure.
Worry about overfitting?
Worry about training
first!

Ultra Deep Network
• Ultra deep network is the
ensemble of many networks
with different depth.
6 layers
4 layers
2 layers
Ensemble

Ultra Deep Network
• FractalNet
Resnet in Resnet
Good Initialization?

Ultra Deep Network
• •
+
copy
copy
Gate
controller

Input layer
output layer
Input layer
output layer
Input layer
output layer
Highway Network automatically
determines the layers needed!

Organize
Attention-based Model
http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html
Lunch todayWhat you learned
in these lectures
summer
vacation 10
years ago
What is deep
learning?
Answer

Attention-based Model
Reading Head
Controller
Input
Reading Head
output
…… ……
Machine’s Memory
DNN/RNN
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Attain%20(v3).e
cm.mp4/index.html

Attention-based Model v2
Reading Head
Controller
Input
Reading Head
output
…… ……
Machine’s Memory
DNN/RNN
Neural Turing Machine
Writing Head
Controller
Writing Head

Reading Comprehension
Query
Each sentence becomes a vector.
……
DNN/RNN
Reading Head
Controller
……
answer
Semantic
Analysis

Reading Comprehension
• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J.
Weston, R. Fergus. NIPS, 2015.
The position of reading head:
Keras has example:
https://github.com/fchollet/keras/blob/master/examples/ba
bi_memnn.py

Visual Question Answering
source: http://visualqa.org/

Query DNN/RNN
Reading Head
Controller
answer
CNN A vector for
each region

• Huijuan Xu, Kate Saenko. Ask, Attend and Answer: Exploring
Question-Guided Spatial Attention for Visual Question
Answering. arXiv Pre-Print, 2015

Speech Question Answering
• TOEFL Listening Comprehension Test by Machine
• Example:
Question: “ What is a possible origin of Venus’ clouds? ”
Audio Story:
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
(The original story is 5 min long.)

Simple Baselines
Accuracy(%)
(1) (2) (3) (4) (5) (6) (7)
Naive Approaches
random
(4) the choice with semantic
most similar to others
(2) select the shortest
choice as answer
Experimental setup:
717 for training,
124 for validation, 122 for testing

Model Architecture
“what is a possible
origin of Venus‘ clouds?"
Question:
Question
Semantics
…… It be quite possible that this be
due to volcanic eruption because
volcanic eruption often emit gas. If
that be the case volcanism could very
well be the root cause of Venus 's thick
cloud cover. And also we have observe
burst of radio energy from the planet
's surface. These burst be similar to
what we see when volcano erupt on
earth ……
Audio Story:
Speech
Recognition
Semantic
Analysis
Semantic
Analysis
Attention
Answer
Select the choice most
similar to the answer
Attention
Everything is learned
from training examples

Model Architecture
Word-based Attention

Model Architecture
Sentence-based Attention

(A)
(A) (A) (A) (A)
(B) (B) (B)

Supervised Learning
Accuracy(%)
(1) (2) (3) (4) (5) (6) (7)
Memory Network: 39.2%
Naive Approaches
(proposed by FB AI group)

Supervised Learning
Accuracy(%)
(1) (2) (3) (4) (5) (6) (7)
Memory Network: 39.2%
Naive Approaches
Word-based Attention: 48.8%
(proposed by FB AI group)
[Fang & Hsu & Lee, SLT 16]
[Tseng & Lee, Interspeech 16]

Scenario of Reinforcement
Learning
Agent
Environment
Observation Action
RewardDon’t do
that

Learning
Agent
Environment
Observation Action
RewardThank you.
Agent learns to take actions to
maximize expected reward.
http://www.sznews.com/news/conte
nt/2013-11/26/content_8800180.htm

Supervised v.s. Reinforcement
• Supervised
• Reinforcement
Hello 
Agent
……
Agent
……. ……. ……
Bad
“Hello” Say “Hi”
“Bye bye” Say “Good bye”
Learning from
teacher
Learning from
critics

Learning
Environment
Observation Action
Reward Next Move
If win, reward = 1
If loss, reward = -1
Otherwise, reward = 0
Agent learns to take actions to
maximize expected reward.

Supervised v.s. Reinforcement
• Supervised:
• Reinforcement Learning
Next move:
“5-5”
Next move:
“3-3”
First move …… many moves …… Win!
Alpha Go is supervised learning + reinforcement learning.

Difficulties of Reinforcement
Learning
• It may be better to sacrifice immediate reward to
gain more long-term reward
• E.g. Playing Go
• Agent’s actions affect the subsequent data it
receives
• E.g. Exploration

Deep Reinforcement Learning
Environment
Observation Action
Reward
Function
Input
Function
Output
Used to pick the
best function
…
……
DNN

Application: Interactive Retrieval
• Interactive retrieval is helpful.
user
“Deep Learning”
“Deep Learning” related to Machine Learning?
“Deep Learning” related to Education?
[Wu & Lee, INTERSPEECH 16]

Deep Reinforcement Learning
• Different network depth
Better retrieval
performance,
Less user labor
The task cannot be addressed
by linear model.
Some depth is needed.
More Interaction

More applications
• Alpha Go, Playing Video Games, Dialogue
• Flying Helicopter
• https://www.youtube.com/watch?v=0JL04JJjocc
• Driving
• https://www.youtube.com/watch?v=0xo1Ldx3L
5Q
• Google Cuts Its Giant Electricity Bill With
DeepMind-Powered AI
• http://www.bloomberg.com/news/articles/2016-07-
19/google-cuts-its-giant-electricity-bill-with-deepmind-
powered-ai

To learn deep reinforcement
learning ……
• Lectures of David Silver
• http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Te
aching.html
• 10 lectures (1:30 each)
• Deep Reinforcement Learning
• http://videolectures.net/rldm2015_silver_reinfo
rcement_learning/

Does machine know what the
world look like?
Draw something!
Ref: https://openai.com/blog/generative-models/

Deep Dream
• Given a photo, machine adds what it sees ……
http://deepdreamgenerator.com/

Deep Style
• Given a photo, make its style like famous paintings
https://dreamscopeapp.com/

Deep Style
CNN CNN
content style
CNN
?

Generating Images by RNN
color of
1st pixel
color of
2nd pixel
color of
2nd pixel
color of
3rd pixel
color of
3rd pixel
color of
4th pixel

Generating Images by RNN
• Pixel Recurrent Neural Networks
• https://arxiv.org/abs/1601.06759
Real
World

Generating Images
• Training a decoder to generate images is
unsupervised
Neural Network
? Training data is a lot of imagescode

Auto-encoder
NN
Encoder
NN
Decoder
code
code
Learn togetherInputLayer
bottle
OutputLayer
Layer
Layer
… …
Code
As close as possible
Layer
Layer
Encoder Decoder
Not state-of-
the-art
approach

Generating Images
• Training a decoder to generate images is
unsupervised
• Variation Auto-encoder (VAE)
• Ref: Auto-Encoding Variational Bayes,
https://arxiv.org/abs/1312.6114
• Generative Adversarial Network (GAN)
• Ref: Generative Adversarial Networks,
http://arxiv.org/abs/1406.2661
NN
Decoder
code

Which one is machine-generated?
Ref: https://openai.com/blog/generative-models/

畫漫畫!!! https://github.com/mattya/chainer-DCGAN

http://top-breaking-news.com/
Machine Reading
• Machine learn the meaning of words from reading
a lot of documents without supervision

Machine Reading
dog
cat
rabbit
jump
run
flower
tree
Word Vector / Embedding

Machine Reading
• Generating Word Vector/Embedding is
unsupervised
Neural Network
Apple
https://garavato.files.wordpress.com/2011/11/stacksdocuments.jpg?w=490
Training data is a lot of text
?

Machine Reading
• A word can be understood by its context
蔡英文 520宣誓就職
馬英九 520宣誓就職
蔡英文、馬英九 are
something very similar
You shall know a word
by the company it keeps

Word Vector
Source: http://www.slideshare.net/hustwj/cikm-keynotenov2014
283

Word Vector
• Characteristics
• Solving analogies
𝑉 ℎ𝑜𝑡𝑡𝑒𝑟 − 𝑉 ℎ𝑜𝑡 ≈ 𝑉 𝑏𝑖𝑔𝑔𝑒𝑟 − 𝑉 𝑏𝑖𝑔
𝑉 𝑅𝑜𝑚𝑒 − 𝑉 𝐼𝑡𝑎𝑙𝑦 ≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦
𝑉 𝑘𝑖𝑛𝑔 − 𝑉 𝑞𝑢𝑒𝑒𝑛 ≈ 𝑉 𝑢𝑛𝑐𝑙𝑒 − 𝑉 𝑎𝑢𝑛𝑡
Rome : Italy = Berlin : ?
𝑉 𝐺𝑒𝑟𝑚𝑎𝑛𝑦
≈ 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦
Compute 𝑉 𝐵𝑒𝑟𝑙𝑖𝑛 − 𝑉 𝑅𝑜𝑚𝑒 + 𝑉 𝐼𝑡𝑎𝑙𝑦
Find the word w with the closest V(w)
284

Machine Reading

Demo
• Model used in demo is provided by 陳仰德
• Part of the project done by 陳仰德、林資偉
• TA: 劉元銘
• Training data is from PTT (collected by 葉青峰)
286

Learning from Audio Book
Machine listens to lots of
audio book
[Chung, Interspeech 16)
Machine does not have
any prior knowledge
Like an infant

Audio Word to Vector
• Audio segment corresponding to an unknown word
Fixed-length vector

• The audio segments corresponding to words with
similar pronunciations are close to each other.
ever ever
never
never
never
dog
dog
dogs

Sequence-to-sequence
Auto-encoder
audio segment
acoustic features
The values in the memory
represent the whole audio
segment
x1 x2 x3 x4
RNN Encoder
audio segment
vector
The vector we want
How to train RNN Encoder?

Sequence-to-sequence
Auto-encoder
RNN Decoder
x1 x2 x3 x4
y1 y2 y3 y4
x1 x2 x3
x4
RNN Encoder
audio segment
acoustic features
The RNN encoder and
decoder are jointly trained.
Input acoustic features

- Results
• Visualizing embedding vectors of the words
fear
nearname
fame

WaveNet (DeepMind)
https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Concluding Remarks
Lecture IV: Next Wave
Lecture III: Variants of Neural Network
Lecture II: Tips for Training Deep Neural Network
Lecture I: Introduction of Deep Learning

AI 即將取代多數的工作?
• New Job in AI Age
http://www.express.co.uk/news/science/651202/First-step-towards-The-Terminator-
becoming-reality-AI-beats-champ-of-world-s-oldest-game
AI 訓練師
(機器學習專家、
資料科學家)

AI 訓練師
機器不是自己會學嗎？
為什麼需要 AI 訓練師
戰鬥是寶可夢在打，
為什麼需要寶可夢訓練師？

AI 訓練師
寶可夢訓練師
• 寶可夢訓練師要挑選適合
的寶可夢來戰鬥
• 寶可夢有不同的屬性
• 召喚出來的寶可夢不一定
能操控
• E.g. 小智的噴火龍
• 需要足夠的經驗
AI 訓練師
• 在 step 1，AI訓練師要挑
選合適的模型
• 不同模型適合處理不
同的問題
• 不一定能在 step 3 找出
best function
• E.g. Deep Learning
• 需要足夠的經驗

AI 訓練師
• 厲害的 AI ， AI 訓練師功不可沒
• 讓我們一起朝 AI 訓練師之路邁進
http://www.gvm.com.tw/web
only_content_10787.html

[DSC 2016] 系列活動：李宏毅 / 一天搞懂深度學習

[DSC 2016] 系列活動：李宏毅 / 一天搞懂深度學習

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to [DSC 2016] 系列活動：李宏毅 / 一天搞懂深度學習

Similar to [DSC 2016] 系列活動：李宏毅 / 一天搞懂深度學習 (20)

More from 台灣資料科學年會

More from 台灣資料科學年會 (20)

Recently uploaded

Recently uploaded (20)

[DSC 2016] 系列活動：李宏毅 / 一天搞懂深度學習