Lecture 2 Deep Learning Overview
Lecture 2 Deep Learning Overview
Lecture 2
1
CS 404/504, Fall 2021
Lecture Outline
2
CS 404/504, Fall 2021
Machine Learning
Labeled Data algorithm
Training
Prediction
Learned
Labeled Data Prediction
model
class A
class B
Regression Clustering
Classification
Supervised Learning
Machine Learning Basics
Unsupervised Learning
Machine Learning Basics
• Nearest Neighbor – for each test data point, assign the class label of the nearest
training data point
▪ Adopt a distance function to find the nearest neighbor
o Calculate the distance to each data point in the training set, and assign the class of the nearest
data point (minimum distance)
▪ It does not require learning a set of weights
Test Training
Training example examples
examples from class 2
from class 1
x
2 x
x
x o
x x
x
+ o x
o x
o + x
o
o o
o
o
x
1
Picture from: James Hays – Machine Learning Overview 9
CS 404/504, Fall 2021
Linear Classifier
Machine Learning Basics
10
CS 404/504, Fall 2021
Linear Classifier
Machine Learning Basics
12
CS 404/504, Fall 2021
13
CS 404/504, Fall 2021
Non-linear Techniques
Linear vs Non-linear Techniques
• Non-linear SVM
▪ The original input space is mapped to a higher-dimensional feature space where the
training set is linearly separable
▪ Define a non-linear kernel function to calculate a non-linear decision boundary in the
original feature space
17
CS 404/504, Fall 2021
• Both the binary and multi-class classification problems can be linearly or non-
linearly separated
▪ Figure: linearly and non-linearly separated data for binary classification problem
18
CS 404/504, Fall 2021
Picture from: Fie-Fei Li, Andrej Karpathy, Justin Johnson – Understanding and Visualizing CNNs 19
CS 404/504, Fall 2021
No-Free-Lunch Theorem
Machine Learning Basics
20
CS 404/504, Fall 2021
• Deep learning (DL) is a machine learning subfield that uses multiple layers for
learning data representations
▪ DL is exceptionally effective at learning patterns
• DL applies a multi-layer process for learning rich hierarchical features (i.e., data
representations)
▪ Input image pixels → Edges → Textures → Parts → Objects
Why is DL Useful?
Introduction to Deep Learning
24
CS 404/504, Fall 2021
Representational Power
Introduction to Deep Learning
25
CS 404/504, Fall 2021
y2
0.7 is 2
The image is “2”
……
……
……
y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents the
confidence of a digit
No ink → 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial 26
CS 404/504, Fall 2021
y1
Machine y2
“2
……
……
”
y10
• …
output
…
Activation
weights function
input
bias
Weights Biases
Activation functions
30
CS 404/504, Fall 2021
…… y2
……
……
……
……
……
…… yM
Matrix Operation
Introduction to Neural Networks
• Matrix operations are helpful when working with multidimensional inputs and
outputs
1 4 0.98
1 W x + b a
-2
1
-1 -2 0.12
-1
1
0
Matrix Operation
Introduction to Neural Networks
…… y1
W1 …… y2
b1
……
……
……
……
……
x a1 …… yM
a1 W1 x + b1
Matrix Operation
Introduction to Neural Networks
…… y1
W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
x a1 a2
…… y yM
W1 x + b1
W2 a1 + b2
WL aL-1 + bL
Matrix Operation
Introduction to Neural Networks
…… y1
W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
x a1 a2
…… y yM
y x WL … W2 W1 x + b1 + b2 … + bL
Softmax Layer
Introduction to Neural Networks
1 0.73
-3 0.05
Softmax Layer
Introduction to Neural Networks
1 2.7 0.12
-3 0.05 ≈0
Activation Functions
Introduction to Neural Networks
Activation: Sigmoid
Introduction to Neural Networks
• Sigmoid function σ: takes a real-valued number and “squashes” it into the range
between 0 and 1
▪ The output can be interpreted as the firing rate of a biological neuron
o Not firing = 0; Fully firing = 1
▪ When the neuron’s activation are 0 or 1, sigmoid neurons saturate
o Gradients at these regions are almost zero (almost no signal will flow)
▪ Sigmoid activations are less common in modern NNs
Activation: Tanh
Introduction to Neural Networks
Activation: ReLU
Introduction to Neural Networks
43
CS 404/504, Fall 2021
44
CS 404/504, Fall 2021
• Linear function means that the output signal is proportional to the input signal
to the neuron
▪ If the value of the constant c is 1, it is also
called identity activation function
▪ This activation type is used in regression
problems
o E.g., the last layer can have linear activation
function, in order to output a real number
(and not a class membership)
45
CS 404/504, Fall 2021
Training NNs
Training Neural Networks
…… y1
0.1 is 1
Softmax
…… y2
0.7 is 2
……
……
……
…… y10
0.2 is 0
16 x 16 = 256
Slide credit: Hung-yi Lee – Deep Learning Tutorial 46
CS 404/504, Fall 2021
Training NNs
Training Neural Networks
Training NNs
Training Neural Networks
Training NNs
Training Neural Networks
…… y1 0.2 1
…… y2 0.3 0
Cost
……
……
……
……
……
……
…… y10 0.5 0
True label “1”
Training NNs
Training Neural Networks
x1 NN y1
x2 NN y2
x3 NN y3
……
……
……
……
xN NN yN
Slide credit: Hung-yi Lee – Deep Learning Tutorial 50
CS 404/504, Fall 2021
Loss Functions
Training Neural Networks
• Classification tasks
Training
examples
Loss Functions
Training Neural Networks
• Regression tasks
Training
examples
Output
Linear (Identity) or Sigmoid Activation
Layer
Training NNs
Training Neural Networks
53
CS 404/504, Fall 2021
54
CS 404/504, Fall 2021
4. Go to step 2, repeat
• Example (contd.)
4. Go to step 2, repeat
• Gradient descent algorithm stops when a local minimum of the loss surface is
reached
▪ GD does not guarantee reaching a global minimum
▪ However, empirical evidence suggests that GD works well for NNs
Backpropagation
Training Neural Networks
59
CS 404/504, Fall 2021
60
CS 404/504, Fall 2021
61
CS 404/504, Fall 2021
• Besides the local minima problem, the GD algorithm can be very slow at
plateaus, and it can get stuck at saddle points
• Gradient descent with momentum uses the momentum of the gradient for
parameter optimization
Negative of Gradient
Momentum
Real Movement
Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial 63
CS 404/504, Fall 2021
64
CS 404/504, Fall 2021
GD with Nesterov
GD with momentum
momentum
Adam
Training Neural Networks
66
CS 404/504, Fall 2021
Learning Rate
Training Neural Networks
• Learning rate
▪ The gradient tells us the direction in which the loss has the steepest rate of increase,
but it does not tell us how far along the opposite direction we should step
▪ Choosing the learning rate (also called the step size) is one of the most important
hyper-parameter settings for NN training
LR too LR too
small large
67
CS 404/504, Fall 2021
Learning Rate
Training Neural Networks
• Learning rate scheduling is applied to change the values of the learning rate
during the training
▪ Annealing is reducing the learning rate over time (a.k.a. learning rate decay)
o Approach 1: reduce the learning rate by some factor every few epochs
● Typical values: reduce the learning rate by a half every 5 epochs, or divide by 10 every 20 epochs
o Approach 2: exponential or cosine decay gradually reduce the learning rate over time
o Approach 3: reduce the learning rate by a constant (e.g., by half) whenever the validation loss
stops improving
● In TensorFlow: tf.keras.callbacks.ReduceLROnPleateau()
○ Monitor: validation loss, factor: 0.1 (i.e., divide by 10), patience: 10 (how many epochs to wait before applying it),
Minimum learning rate: 1e-6 (when to stop)
▪ Warmup is gradually increasing the learning rate initially, and afterward let it cool
Exponential
down until the end decay
of the training Cosine decay Warmup
69
CS 404/504, Fall 2021
• In some cases, during training, the gradients can become either very small
(vanishing gradients) of very large (exploding gradients)
▪ They result in very small or very large update of the parameters
▪ Solutions: change learning rate, ReLU activations, regularization, LSTM units in RNNs
…… y1
…… y2
……
……
……
……
……
…… yM
Generalization
Generalization
• Underfitting
▪ The model is too “simple” to represent
all the relevant class characteristics
▪ E.g., model with too few parameters
▪ Produces high error on the training set
and high error on the validation set
• Overfitting
▪ The model is too “complex” and fits
irrelevant characteristics (noise) in the
data
▪ E.g., model with too many parameters
▪ Produces low error on the training error
and high error on the validation set
71
CS 404/504, Fall 2021
Overfitting
Generalization
• Overfitting – a model with high capacity fits the noise in the data instead of the
underlying relationship
73
CS 404/504, Fall 2021
74
CS 404/504, Fall 2021
75
CS 404/504, Fall 2021
Regularization: Dropout
Regularization
• Dropout
▪ Randomly drop units (along with their connections) during training
▪ Each unit is retained with a fixed dropout rate p, independent of other units
▪ The hyper-parameter p needs to be chosen (tuned)
o Often, between 20% and 50% of the units are dropped
Regularization: Dropout
Regularization
……
• Early-stopping
▪ During model training, use a validation set
o E.g., validation/train ratio of about 25% to 75%
▪ Stop when the validation accuracy (or loss) has not improved after n epochs
o The parameter n is called patience
Stop training
validation
78
CS 404/504, Fall 2021
Batch Normalization
Regularization
79
CS 404/504, Fall 2021
Hyper-parameter Tuning
Hyper-parameter Tuning
80
CS 404/504, Fall 2021
Hyper-parameter Tuning
Hyper-parameter Tuning
• Grid search
▪ Check all values in a range with a step value
• Random search
▪ Randomly sample values for the parameter
▪ Often preferred to grid search
• Bayesian hyper-parameter optimization
▪ Is an active area of research
81
CS 404/504, Fall 2021
k-Fold Cross-Validation
k-Fold Cross-Validation
82
CS 404/504, Fall 2021
k-Fold Cross-Validation
k-Fold Cross-Validation
Ensemble Learning
Ensemble Learning
84
CS 404/504, Fall 2021
Shallow Deep
NN NN
……
……
input
Slide credit: Hung-yi Lee – Deep Learning Tutorial 85
CS 404/504, Fall 2021
• Convolutional neural networks (CNNs) were primarily designed for image data
• CNNs use a convolutional operator for extracting data features
▪ Allows parameter sharing
▪ Efficient to train
▪ Have less parameters than NNs with fully-connected layers
• CNNs are robust to spatial translations of objects in images
• A convolutional filter slides (i.e., convolves) across the image
Convolutional
Input matrix 3x3 filter
• When the convolutional filters are scanned over the image, they capture useful
features
▪ E.g., edge detection by convolutions
0 1 0
Filter 1 -4
1
0 1 0
• In CNNs, hidden units in a layer are only connected to a small region of the
layer before it (called local receptive field)
▪ The depth of each feature map corresponds to the number of convolutional filters
used at each layer
w1 w2
w3 w4 w5 w6
w7 w8
Filter 1
Filter 2
Input Image
Layer 1
Feature Layer 2
Map Feature
Map
Bedroom
Kitchen
12
25
25
51
51
251
51
12
25
521
51
6
64
4
6
6
2
2
8
2
Bathroom
Outdoor
Max Pool
Conv
layer
Residual CNNs
Convolutional Neural Networks
91
CS 404/504, Fall 2021
• Recurrent NNs are used for modeling sequential data and data with varying
length of inputs and outputs
▪ Videos, text, speech, DNA sequences, human skeletal data
• RNNs introduce recurrent connections between the neurons
▪ This allows processing sequential data one element at a time by selectively passing
information across a sequence
▪ Memory of the previous inputs is stored in the model’s internal state and affect the
model predictions
▪ Can capture correlations in sequential data
• RNNs use backpropagation-through-time for training
• RNNs are more sensitive to the vanishing gradient problem than CNNs
92
CS 404/504, Fall 2021
OUTPUT
h0 h1 h2 h3
x1 x2 x3
• RNNs can have one of many inputs and one of many outputs
A person riding a
Image
motorbike on dirt
Captioning
road
Machine
Happy Diwali शुभ दीपावली
Translation
Bidirectional RNNs
Recurrent Neural Networks
LSTM Networks
Recurrent Neural Networks
96
CS 404/504, Fall 2021
LSTM Networks
Recurrent Neural Networks
• LSTM cell
▪ Input gate, output gate, forget gate, memory cell
▪ LSTM can learn long-term correlations within data sequences
97
CS 404/504, Fall 2021
References
98