Introduction To Deep Learning
Introduction To Deep Learning
MIT 6.S191
Alexander Amini
January 28, 2019
ARTIFICIAL
INTELLIGENCE
MACHINE LEARNING
Any technique that enables
Ability to learn without DEEP LEARNING
computers to mimic
human behavior explicitly being programmed Extract patterns from data using
neural networks
Lines & Edges Eyes & Nose & Ears Facial Structure
1958
Perceptron 1. Big Data 2. Hardware 3. Software
• Learnable Weights
• Larger Datasets • Graphics • Improved
• Easier Collection Processing Units Techniques
& Storage (GPUs) • New Models
1986 Backpropagation • Massively • Toolboxes
• Multi-Layer Perceptron Parallelizable
1995 Deep Convolutional NN
• Digit Recognition
Linear combination
Output of inputs
&" !" #
(' = * + &, !,
!$
Σ
,-"
&$ ('
!# Non-linear
activation function
&#
Linear combination
Output of inputs
1
!%
#
Σ
./"
!$ )(
&$ Non-linear Bias
!# activation function
&#
1
!% #
)( = + !%+ - &. !.
&" !" ./"
!$ Σ )( )( = + ! % + 1 2 3
&$ &" !"
!#
where: 1 = ⋮ and 3 = ⋮
&# !#
&#
Activation Functions
1
!%
*) = , !% + . / 0
&" !"
&$ 1
!# , 1 = 2 1 =
1 + 3 45
&#
3
We have: + , = 1 and . =
1 −2
1
*) = / + , + 1 2 .
3
&' Σ *) = / 1+ &
&' 2 3
−2 ( −2
*) = / ( 1 + 3 & ' − 2 & ( )
&(
This is just a line in 2D!
0
1 1
=
(
2&
−
3
&' Σ *)
'
3&
1+
−2
&'
&(
0
1 1
=
−1
(
2&
2
−
3
&' Σ *)
'
3&
1+
−2
&'
&(
−1
Assume we have input: 0 =
2
*) = , 1 + 3 ∗ −1 − 2 ∗ 2
= , −6 ≈ 0.002
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
The Perceptron: Example
*) = , ( 1 + 3 & ' − 2 & ( )
&(
0
1 1
=
1 < 0
(
2&
* < 0.5
−
3
&' Σ *)
'
3&
1+
−2
&'
&(
1 > 0
* > 0.5
1
!%
&" !"
!$ Σ *)
&$
!#
&#
!$
&=( %
!" %
!#
#
% = )* + , !- )-
-.$
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Multi Output Perceptron
!$
&$ = ( %$
%$
!"
&" = ( %"
%"
!#
#
%) = *+,) + . !/ */,)
/0$
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Single Layer Neural Network
($) (")
7 7
6 %$
%$
!$
6 %"
%" ('$
!"
%& ('"
6 %&
!#
%)*
6 %)*
%$
($)
!$ 5$,"
($)
5"," %" ('$
!"
($) %& ('"
5#,"
!#
%)*
#
($) ($)
%" = ,-," +2 !3 ,3,"
34$
($) ($) ($) ($)
= ,-," + !$ ,$," + !" ,"," + !# ,#,"
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Multi Output Perceptron
from tf.keras.layers import *
inputs = Inputs(m)
hidden = Dense(d1)(inputs)
outputs = Dense(2)(hidden)
%$ model = Model(inputs, outputs)
!$
%" ('$
!"
%& ('"
!#
%)*
%&,$
!$
%&," *)$
!" ⋯ ⋯
%&,( *)"
!#
%&,+,
! " = Hours
spent on the
final project
Legend
Pass
Fail
! " = Hours
spent on the
final project
Legend
? Pass
4
5 Fail
$#
!#
#
! = 4 ,5 $" '&# Predicted: 0.1
!"
$%
$#
!#
# Predicted: 0.1
! = 4 ,5 $" '&#
Actual: 1
!"
$%
The loss of our network measures the cost incurred from incorrect predictions
$#
!#
# Predicted: 0.1
! = 4 ,5 $" '&#
Actual: 1
!"
$%
The empirical loss measures the total loss over our entire dataset
+(!) '
$#
4, 5 !# 0.1 1
)= 2, 1 0.8 0
$" '&#
5, 8 0.6 1
⋮ ⋮ !"
⋮ ⋮
$%
1 5
Also known as: . / = 2 ℒ + ! (3) ; / , ' (3)
• Objective function
• Cost function
1 34#
• Empirical Risk Predicted Actual
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Binary Cross Entropy Loss
Cross entropy loss can be used with models that output a probability between 0 and 1
+(!) '
$#
4, 5 !# 0.1 1
) = 2, 1 0.8 0
$" '&#
5, 8 0.6 1
⋮ ⋮ !"
⋮ ⋮
$%
1 5
. / = 2 ' (3) log + ! 3 ; / + (1 − ' (3) ) log 1 − + ! 3 ; /
1 34#
Actual Predicted Actual Predicted
Mean squared error loss can be used with regression models that output continuous real numbers
+(!) '
$#
4, 5 !# 30 90
) = 2, 1 80 20
$" '&#
5, 8 85 95
⋮ ⋮ !"
⋮ ⋮
$%
1 5 "
. / = 2 ' 3 3
− + ! ;/ Final Grades
1 34# (percentage)
Actual Predicted
We want to find the network weights that achieve the lowest loss
1 0
!∗ = argmin , ℒ 2 3 (-) ; ! , 8 (-)
! + -./
!∗ = argmin 9(!)
!
We want to find the network weights that achieve the lowest loss
1 0
!∗ = argmin , ℒ 2 3 (-) ; ! , 8 (-)
! + -./
!∗ = argmin 9(!)
!
Remember:
! = !(:) , !(/) , ⋯
*(-., -0)
-0
-.
Loss Optimization
Randomly pick an initial ("#, "%)
'("#, "%)
"%
"#
Loss Optimization
!"($)
Compute gradient, !$
&('(, '*)
'*
'(
Loss Optimization
Take small step in opposite direction of gradient
!(#$, #&)
#&
#$
Gradient Descent
Repeat until convergence
!(#$, #&)
#&
#$
Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, & ') weights = tf.random_normal(shape, stddev=sigma)
4. )*(+)
Update weights, + ← + − . weights_new = weights.assign(weights – lr * grads)
)+
5. Return weights
Algorithm
1. Initialize weights randomly ~"(0, & ') weights = tf.random_normal(shape, stddev=sigma)
4. )*(+)
Update weights, + ← + − . weights_new = weights.assign(weights – lr * grads)
)+
5. Return weights
!) !"
' () +* #(%)
How does a small change in one weight (ex. !") affect the final loss #(%)?
&+ &'
) *+ -, "($)
!"($)
=
!&'
&. &'
, -. *) "($)
&' &.
, -' *) "($)
&' &.
- ,' *) "($)
&' &.
- ,' *) "($)
Repeat this for every weight in the network using gradients from later layers
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Neural Networks in Practice:
Optimization
Training Neural Networks is Difficult
Remember:
Optimization through gradient descent
%& (!)
!←!−$
%!
Remember:
Optimization through gradient descent
%& (!)
!←!−$
%!
Small learning rate converges slowly and gets stuck in false local minima
"(!)
Initial guess
!
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Setting the Learning Rate
"(!)
Initial guess
!
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Setting the Learning Rate
"($)
Initial guess
!
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
How to deal with this?
Idea 1:
Try lots of different learning rates and see what works “just right”
Idea 1:
Try lots of different learning rates and see what works “just right”
Idea 2:
Do something smarter!
Design an adaptive learning rate that “adapts” to the landscape
• Adam tf.train.AdamOptimizer
Kingma et al. “Adam: A Method for Stochastic
Optimization.” 2014.
• RMSProp tf.train.RMSPropOptimizer
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. )*(+)
Compute gradient, )+
4. )*(+)
Update weights, + ← + − .
)+
5. Return weights
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. )*(+)
Compute gradient, )+
4. )*(+)
Update weights, + ← + − .
)+
5. Return weights
Can be very
computational to
compute!
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick single data point )
4. *+, (-)
Compute gradient, *-
5. *+(-)
Update weights, - ← - − 0 *-
6. Return weights
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick single data point )
4. *+, (-)
Compute gradient, *-
5. *+(-)
Update weights, - ← - − 0 *-
6. Return weights
Easy to compute but
very noisy
(stochastic)!
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Stochastic Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick batch of ) data points
4. *+(,) . *+3 (,)
Compute gradient, *,
= / ∑/
12. *,
5. *+(,)
Update weights, , ← , − 6 *,
6. Return weights
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick batch of ) data points
4. *+(,) . *+3 (,)
Compute gradient, *,
= / ∑/
12. *,
5. *+(,)
Update weights, , ← , − 6 *,
6. Return weights
Fast to compute and a much better
estimate of the true gradient!
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Mini-batches while training
What is it?
Technique that constrains our optimization problem to discourage complex models
What is it?
Technique that constrains our optimization problem to discourage complex models
'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)
'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)
'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)
Loss
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Loss Testing
Training
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Loss Testing
Training
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Loss Testing
Training
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Loss Testing
Training
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Loss Testing
Training
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Under-fitting Over-fitting
Legend
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Core Foundation Review
(),%
"% "%
(),# '&%
"# Σ '& "# ⋯ ⋯
(),+ '&#
"$
"$
(),,-
Lincoln 0.8
Washington 0.1
classification
Jefferson 0.05
Obama 0.05
Problems?
How can we use spatial structure in the input to inform the architecture of the network?
filters
= 9
filter
image
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs…
' '
4x4 filter: matrix 1) applying a window of weights
$ $ !"# (")*,#), + .
of weights !"# 2) computing linear combinations
"%& #%&
3) activating with non-linear function
for neuron (p,q) in hidden layer
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
CNNs: Spatial Arrangement of Output Volume
depth
Layer Dimensions:
ℎ"#"$
where h and w are spatial dimensions
d (depth) = number of filters
height
Stride:
Filter step size
Receptive Field:
Locations in input image that
width a node is path connected to
1) Reduced dimensionality
2) Spatial invariance
Classification task: produce a list of object categories present in image. 1000 categories.
“Top 5 error”: rate at which the model does not output correct label in top 5 predictions
Other tasks include:
single-object localization, object detection from video/image, scene classification, scene parsing
2013: ZFNet
20
- 8 layers, more filters
16.4 2014:VGG
- 19 layers
11.7 2014: GoogLeNet
10
6.7 - “Inception” modules
5.1 - 22 layers, 5million parameters
3.57
2015: ResNet
0 - 152 layers
10
11
12
13
14
15
an
20
20
20
20
20
20
um
H
number of layers
20
100
16.4
11.7
10 50
7.3 6.7
5.1
3.57
0 0
10
11
12
13
14
14
15
an
10
11
12
13
14
14
15
um
20
20
20
20
20
20
20
20
20
20
20
20
20
20
H
Automobile
Bird
Cat
Deer
Dog
MNIST: handwritten digits
Frog
Horse
Ship
ImageNet:
22K categories. 14M images. Truck
places: natural scenes
CIFAR-10
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Deep Learning for Computer Vision: Impact