0% found this document useful (0 votes)

38 views

Lecture 2 Deep Learning Overview

Uploaded by

Harsh Bangia

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views

Lecture 2 Deep Learning Overview

Uploaded by

Harsh Bangia

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 98

CS 404/504, Fall 2021

Lecture 2

Deep Learning Overview

1
CS 404/504, Fall 2021

Lecture Outline

• Machine learning basics

▪ Supervised and unsupervised learning
▪ Linear and non-linear classification methods
• Introduction to deep learning
• Elements of neural networks (NNs)
▪ Activation functions
• Training NNs
▪ Gradient descent
▪ Regularization methods
• NN architectures
▪ Convolutional NNs
▪ Recurrent NNs

2
CS 404/504, Fall 2021

Machine Learning Basics

• Artificial Intelligence is a scientific field concerned with the development of

algorithms that allow computers to learn without being explicitly programmed
• Machine Learning is a branch of Artificial Intelligence, which focuses on
methods that learn from data and make predictions on unseen data

Machine Learning
Labeled Data algorithm

Training
Prediction

Learned
Labeled Data Prediction
model

Picture from: Ismini Lourentzou – Introduction to Deep Learning 3

CS 404/504, Fall 2021

Machine Learning Types

Machine Learning Basics

• Supervised: learning with labeled data

▪ Example: email classification, image classification
▪ Example: regression for predicting real-valued outputs
• Unsupervised: discover patterns in unlabeled data
▪ Example: cluster similar data points
• Reinforcement learning: learn to act based on feedback/reward
▪ Example: learn to play Go

class A

class B

Regression Clustering
Classification

Slide credit: Ismini Lourentzou – Introduction to Deep Learning 4

CS 404/504, Fall 2021

Supervised Learning
Machine Learning Basics

• Supervised learning categories and techniques

▪ Numerical classifier functions
o Linear classifier, perceptron, logistic regression, support vector machines (SVM), neural
networks
▪ Parametric (probabilistic) functions
o Naïve Bayes, Gaussian discriminant analysis (GDA), hidden Markov models (HMM),
probabilistic graphical models
▪ Non-parametric (instance-based) functions
o k-nearest neighbors, kernel regression, kernel density estimation, local regression
▪ Symbolic functions
o Decision trees, classification and regression trees (CART)
▪ Aggregation (ensemble) learning
o Bagging, boosting (Adaboost), random forest

Slide credit: Y-Fan Chang – An Overview of Machine Learning 5

CS 404/504, Fall 2021

Unsupervised Learning
Machine Learning Basics

Slide credit: Y-Fan Chang – An Overview of Machine Learning 6

CS 404/504, Fall 2021

Nearest Neighbor Classifier

Machine Learning Basics

• Nearest Neighbor – for each test data point, assign the class label of the nearest
training data point
▪ Adopt a distance function to find the nearest neighbor
o Calculate the distance to each data point in the training set, and assign the class of the nearest
data point (minimum distance)
▪ It does not require learning a set of weights

Test Training
Training example examples
examples from class 2
from class 1

Picture from: James Hays – Machine Learning Overview 7

CS 404/504, Fall 2021

Nearest Neighbor Classifier

Machine Learning Basics

Picture from: https://cs231n.github.io/classification/ 8

CS 404/504, Fall 2021

k-Nearest Neighbors Classifier

Machine Learning Basics

• k-Nearest Neighbors approach considers multiple neighboring data points to

classify a test data point
▪ E.g., 3-nearest neighbors
o The test example in the figure is the + mark
o The class of the test example is obtained by voting (based on the distance to the 3 closest
points)

x
2 x
x
x o
x x
x
+ o x
o x
o + x
o
o o
o
o

x
1
Picture from: James Hays – Machine Learning Overview 9
CS 404/504, Fall 2021

Linear Classifier
Machine Learning Basics

10
CS 404/504, Fall 2021

Linear Classifier
Machine Learning Basics

• The decision boundary is linear

▪ A straight line in 2D, a flat plane in 3D, a hyperplane in 3D and
higher dimensional space
• Example: classify an input image
▪ The selected parameters in this example are not good, because the
predicted cat score is low

Picture from: https://cs231n.github.io/classification/ 11

CS 404/504, Fall 2021

Support Vector Machines

Machine Learning Basics

• Support vector machines (SVM)

▪ How to find the best decision boundary?
o All lines in the figure correctly separate the 2 classes
o The line that is farthest from all training examples
will have better generalization capabilities
▪ SVM solves an optimization problem:
o First, identify a decision boundary that correctly
classifies
o Next, the the
increase examples
geometric margin between the
boundary and all examples
▪ The data points that define the maximum
margin width are called support vectors
▪ Find W and b by solving:

12
CS 404/504, Fall 2021

Linear vs Non-linear Techniques

• Linear classification techniques

▪ Linear classifier
▪ Perceptron
▪ Logistic regression
▪ Linear SVM
▪ Naïve Bayes
• Non-linear classification techniques
▪ k-nearest neighbors
▪ Non-linear SVM
▪ Neural networks
▪ Decision trees
▪ Random forest

13
CS 404/504, Fall 2021

Linear vs Non-linear Techniques

• For some tasks, input data

can be linearly separable,
and linear classifiers can be
suitably applied

• For other tasks, linear

classifiers may have
difficulties to produce
adequate decision
boundaries

Picture from: Y-Fan Chang – An Overview of Machine Learning 14

CS 404/504, Fall 2021

Non-linear Techniques
Linear vs Non-linear Techniques

Picture from: Y-Fan Chang – An Overview of Machine Learning 15

CS 404/504, Fall 2021

Non-linear Support Vector Machines

Linear vs Non-linear Techniques

• Non-linear SVM
▪ The original input space is mapped to a higher-dimensional feature space where the
training set is linearly separable
▪ Define a non-linear kernel function to calculate a non-linear decision boundary in the
original feature space

Picture from: James Hays – Machine Learning Overview 16

CS 404/504, Fall 2021

Binary vs Multi-class Classification

• A classification problem with only 2 classes is referred to as binary classification

▪ The output labels are 0 or 1
▪ E.g., benign or malignant tumor, spam or no-spam email
• A problem with 3 or more classes is referred to as multi-class classification

17
CS 404/504, Fall 2021

Binary vs Multi-class Classification

• Both the binary and multi-class classification problems can be linearly or non-
linearly separated
▪ Figure: linearly and non-linearly separated data for binary classification problem

18
CS 404/504, Fall 2021

Computer Vision Tasks

Machine Learning Basics

• Computer vision has been the primary area of interest for ML

• The tasks include: classification, localization, object detection, instance
segmentation

Picture from: Fie-Fei Li, Andrej Karpathy, Justin Johnson – Understanding and Visualizing CNNs 19
CS 404/504, Fall 2021

No-Free-Lunch Theorem
Machine Learning Basics

• Wolpert (2002) - The Supervised Learning No-Free-Lunch Theorems

• The derived classification models for supervised learning are simplifications of
the reality
▪ The simplifications are based on certain assumptions
▪ The assumptions fail in some situations
o E.g., due to inability to perfectly estimate ML model parameters from limited data
• In summary, No-Free-Lunch Theorem states:
▪ No single classifier works the best for all possible problems
▪ Since we need to make assumptions to generalize

20
CS 404/504, Fall 2021

ML vs. Deep Learning

Introduction to Deep Learning

• Conventional machine learning methods rely on human-designed feature

representations
▪ ML becomes just optimizing weights to best make a final prediction

Picture from: Ismini Lourentzou – Introduction to Deep Learning 21

CS 404/504, Fall 2021

ML vs. Deep Learning

Introduction to Deep Learning

• Deep learning (DL) is a machine learning subfield that uses multiple layers for
learning data representations
▪ DL is exceptionally effective at learning patterns

Picture from: https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png 22

CS 404/504, Fall 2021

ML vs. Deep Learning

Introduction to Deep Learning

• DL applies a multi-layer process for learning rich hierarchical features (i.e., data
representations)
▪ Input image pixels → Edges → Textures → Parts → Objects

Low-Level Mid-Level High-Level Trainable

Output
Features Features Features Classifier

Slide credit: Param Vir Singh – Deep Learning 23

CS 404/504, Fall 2021

Why is DL Useful?
Introduction to Deep Learning

• DL provides a flexible, learnable framework for representing visual, text,

linguistic information
▪ Can learn in supervised and unsupervised manner
• DL represents an effective end-to-end learning system
• Requires large amounts of training data
• Since about 2010, DL has outperformed other ML techniques
▪ First in vision and speech, then NLP, and other applications

24
CS 404/504, Fall 2021

Representational Power
Introduction to Deep Learning

• NNs use nonlinear mapping of the inputs x to the

outputs f(x) to compute complex decision boundaries
• But then, why use deeper NNs?
▪ The fact that deep NNs work better is an empirical
observation
▪ Mathematically, deep NNs have the same
representational power as a one-layer NN

25
CS 404/504, Fall 2021

Introduction to Neural Networks

• Handwritten digit recognition (MNIST dataset)

▪ The intensity of each pixel is considered an input element
▪ Output is the class of the digit
Input Output
y1
0.1 is 1

y2
0.7 is 2
The image is “2”
……

……
……
y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents the
confidence of a digit
No ink → 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial 26
CS 404/504, Fall 2021

Introduction to Neural Networks

• Handwritten digit recognition

Machine y2
“2
……

……
”
y10

Slide credit: Hung-yi Lee – Deep Learning Tutorial 27

CS 404/504, Fall 2021

Elements of Neural Networks

Introduction to Neural Networks

• …

output
…

Activation
weights function
input
bias

Slide credit: Hung-yi Lee – Deep Learning Tutorial 28

CS 404/504, Fall 2021

Elements of Neural Networks

Introduction to Neural Networks

• A NN with one hidden layer and one output layer

Weights Biases

Activation functions

4 + 2 = 6 neurons (not counting inputs)

[3 × 4] + [4 × 2] = 20 weights
4 + 2 = 6 biases
26 learnable parameters

Slide credit: Ismini Lourentzou – Introduction to Deep Learning 29

CS 404/504, Fall 2021

Elements of Neural Networks

Introduction to Neural Networks

• A neural network playground link

30
CS 404/504, Fall 2021

Elements of Neural Networks

Introduction to Neural Networks

• Deep NNs have many hidden layers

▪ Fully-connected (dense) layers (a.k.a. Multi-Layer Perceptron or MLP)
▪ Each neuron is connected to all neurons in the succeeding layer

Input Layer 1 Layer 2 Layer L Output

…… y1

…… y2

……
……

……

……
…… yM

Input Layer Output Layer

Hidden Layers
Slide credit: Hung-yi Lee – Deep Learning Tutorial 31
CS 404/504, Fall 2021

Elements of Neural Networks

Introduction to Neural Networks

• A simple network, toy example

0.98 Sigmoid Function

1 4
1
-2
1
-1 -2 0.12
-1
1
0

Slide credit: Hung-yi Lee – Deep Learning Tutorial 32

CS 404/504, Fall 2021

Elements of Neural Networks

Introduction to Neural Networks

1 4 0.98 2 0.86 3 0.62

1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2

Slide credit: Hung-yi Lee – Deep Learning Tutorial 33

CS 404/504, Fall 2021

Matrix Operation
Introduction to Neural Networks

• Matrix operations are helpful when working with multidimensional inputs and
outputs

1 4 0.98
1 W x + b a
-2
1
-1 -2 0.12
-1
1
0

Slide credit: Hung-yi Lee – Deep Learning Tutorial 34

CS 404/504, Fall 2021

Matrix Operation
Introduction to Neural Networks

• Multilayer NN, matrix calculations for the first layer

▪ Input vector x, weights matrix W1, bias vector b1, output vector a1

…… y1

W1 …… y2
b1

……
……

……

……
x a1 …… yM

a1 W1 x + b1

Slide credit: Hung-yi Lee – Deep Learning Tutorial 35

CS 404/504, Fall 2021

Matrix Operation
Introduction to Neural Networks

• Multilayer NN, matrix calculations for all layers

…… y1

W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……
x a1 a2
…… y yM

W1 x + b1
W2 a1 + b2
WL aL-1 + bL

Slide credit: Hung-yi Lee – Deep Learning Tutorial 36

CS 404/504, Fall 2021

Matrix Operation
Introduction to Neural Networks

…… y1

W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……
x a1 a2
…… y yM

y x WL … W2 W1 x + b1 + b2 … + bL

Slide credit: Hung-yi Lee – Deep Learning Tutorial 37

CS 404/504, Fall 2021

Softmax Layer
Introduction to Neural Networks

• In multi-class classification tasks, the output layer is typically a softmax layer

▪ I.e., it employs a softmax activation function
▪ If a layer with a sigmoid activation function is used as the output layer instead, the
predictions by the NN may not be easy to interpret
o Note that an output layer with sigmoid activations can still be used for binary classification
A Layer with Sigmoid Activations
3 0.95

1 0.73

-3 0.05

Slide credit: Hung-yi Lee – Deep Learning Tutorial 38

CS 404/504, Fall 2021

Softmax Layer
Introduction to Neural Networks

• The softmax layer applies softmax activations to output

a probability value in the range [0, 1]
▪ The values z inputted to the softmax layer are referred to as
logits
A Softmax Layer
3 20 0.88

1 2.7 0.12

-3 0.05 ≈0

Slide credit: Hung-yi Lee – Deep Learning Tutorial 39

CS 404/504, Fall 2021

Activation Functions
Introduction to Neural Networks

Picture from: http://cs231n.github.io/assets/nn1/layer_sizes.jpeg 40

CS 404/504, Fall 2021

Activation: Sigmoid
Introduction to Neural Networks

• Sigmoid function σ: takes a real-valued number and “squashes” it into the range
between 0 and 1
▪ The output can be interpreted as the firing rate of a biological neuron
o Not firing = 0; Fully firing = 1
▪ When the neuron’s activation are 0 or 1, sigmoid neurons saturate
o Gradients at these regions are almost zero (almost no signal will flow)
▪ Sigmoid activations are less common in modern NNs

Slide credit: Ismini Lourentzou – Introduction to Deep Learning 41

CS 404/504, Fall 2021

Activation: Tanh
Introduction to Neural Networks

Slide credit: Ismini Lourentzou – Introduction to Deep Learning 42

CS 404/504, Fall 2021

Activation: ReLU
Introduction to Neural Networks

▪ Most modern deep NNs use ReLU

activations
▪ ReLU is fast to compute
o Compared to sigmoid, tanh
o Simply threshold a matrix at zero
▪ Accelerates the convergence of gradient
descent
o Due to linear, non-saturating form
▪ Prevents the gradient vanishing problem

43
CS 404/504, Fall 2021

Activation: Leaky ReLU

Introduction to Neural Networks

▪ This resolves the dying ReLU problem

▪ Most current works still use ReLU
o With a proper setting of the learning rate,
the problem of dying ReLU can be avoided

44
CS 404/504, Fall 2021

Activation: Linear Function

Introduction to Neural Networks

• Linear function means that the output signal is proportional to the input signal
to the neuron
▪ If the value of the constant c is 1, it is also
called identity activation function
▪ This activation type is used in regression
problems
o E.g., the last layer can have linear activation
function, in order to output a real number
(and not a class membership)

45
CS 404/504, Fall 2021

Training NNs
Training Neural Networks

…… y1
0.1 is 1

Softmax
…… y2
0.7 is 2
……

……

……
…… y10
0.2 is 0
16 x 16 = 256
Slide credit: Hung-yi Lee – Deep Learning Tutorial 46
CS 404/504, Fall 2021

Training NNs
Training Neural Networks

• Data preprocessing - helps convergence during training

▪ Mean subtraction, to obtain zero-centered data
o Subtract the mean for each individual data dimension (feature)
▪ Normalization
o Divide each feature by its standard deviation
● To obtain standard deviation of 1 for each data dimension (feature)
o Or, scale the data within the range [0,1] or [-1, 1]
● E.g., image pixel intensities are divided by 255 to be scaled in the [0,1] range

Picture from: https://cs231n.github.io/neural-networks-2/ 47

CS 404/504, Fall 2021

Training NNs
Training Neural Networks

Input: y1 has the maximum value

Input: y2 has the maximum value

.
.
.

Input: y9 has the maximum value

Input: y10 has the maximum value

Slide credit: Hung-yi Lee – Deep Learning Tutorial 48

CS 404/504, Fall 2021

Training NNs
Training Neural Networks

…… y1 0.2 1
…… y2 0.3 0
Cost
……

……
……

……
……
……
…… y10 0.5 0
True label “1”

Slide credit: Hung-yi Lee – Deep Learning Tutorial 49

CS 404/504, Fall 2021

Training NNs
Training Neural Networks

x1 NN y1

x2 NN y2

x3 NN y3
……
……

……
……

xN NN yN
Slide credit: Hung-yi Lee – Deep Learning Tutorial 50
CS 404/504, Fall 2021

Loss Functions
Training Neural Networks

• Classification tasks

Training
examples

Output Softmax Activations

Layer [maps to a probability distribution]

Loss function Cross-entropy

Slide credit: Ismini Lourentzou – Introduction to Deep Learning 51

CS 404/504, Fall 2021

Loss Functions
Training Neural Networks

• Regression tasks

Training
examples

Output
Linear (Identity) or Sigmoid Activation
Layer

Mean Squared Error

Loss function

Mean Absolute Error

Slide credit: Ismini Lourentzou – Introduction to Deep Learning 52

CS 404/504, Fall 2021

Training NNs
Training Neural Networks

53
CS 404/504, Fall 2021

Gradient Descent Algorithm

Training Neural Networks

54
CS 404/504, Fall 2021

Gradient Descent Algorithm

Training Neural Networks

4. Go to step 2, repeat

Slide credit: Hung-yi Lee – Deep Learning Tutorial 55

CS 404/504, Fall 2021

Gradient Descent Algorithm

Training Neural Networks

• Example (contd.)

Eventually, we would reach a minimum …..

4. Go to step 2, repeat

Slide credit: Hung-yi Lee – Deep Learning Tutorial 56

CS 404/504, Fall 2021

Gradient Descent Algorithm

Training Neural Networks

• Gradient descent algorithm stops when a local minimum of the loss surface is
reached
▪ GD does not guarantee reaching a global minimum
▪ However, empirical evidence suggests that GD works well for NNs

Picture from: https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/ 57

CS 404/504, Fall 2021

Gradient Descent Algorithm

Training Neural Networks

Slide credit: Hung-yi Lee – Deep Learning Tutorial 58

CS 404/504, Fall 2021

Backpropagation
Training Neural Networks

59
CS 404/504, Fall 2021

Mini-batch Gradient Descent

Training Neural Networks

60
CS 404/504, Fall 2021

Stochastic Gradient Descent

Training Neural Networks

• Stochastic gradient descent

▪ SGD uses mini-batches that consist of a single input example
o E.g., one image mini-batch
▪ Although this method is very fast, it may cause significant fluctuations in the loss
function
o Therefore, it is less commonly used, and mini-batch GD is preferred
▪ In most DL libraries, SGD typically means a mini-batch GD (with an option to add
momentum)

61
CS 404/504, Fall 2021

Problems with Gradient Descent

Training Neural Networks

• Besides the local minima problem, the GD algorithm can be very slow at
plateaus, and it can get stuck at saddle points

Very slow at the plateau

Stuck at a saddle point

Stuck at a local minimum

Slide credit: Hung-yi Lee – Deep Learning Tutorial 62

CS 404/504, Fall 2021

Gradient Descent with Momentum

Training Neural Networks

• Gradient descent with momentum uses the momentum of the gradient for
parameter optimization

Movement = Negative of Gradient + Momentum

Negative of Gradient
Momentum
Real Movement

Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial 63
CS 404/504, Fall 2021

Gradient Descent with Momentum

Training Neural Networks

64
CS 404/504, Fall 2021

Nesterov Accelerated Momentum

Training Neural Networks

GD with Nesterov
GD with momentum
momentum

Picture from: https://towardsdatascience.com/learning-parameters-part-2-a190bef2d12 65

CS 404/504, Fall 2021

Adam
Training Neural Networks

66
CS 404/504, Fall 2021

Learning Rate
Training Neural Networks

• Learning rate
▪ The gradient tells us the direction in which the loss has the steepest rate of increase,
but it does not tell us how far along the opposite direction we should step
▪ Choosing the learning rate (also called the step size) is one of the most important
hyper-parameter settings for NN training

LR too LR too
small large

67
CS 404/504, Fall 2021

Learning Rate
Training Neural Networks

• Training loss for different learning rates

▪ High learning rate: the loss increases or plateaus too quickly
▪ Low learning rate: the loss decreases too slowly (takes many epochs to reach a
solution)

Picture from: https://cs231n.github.io/neural-networks-3/ 68

CS 404/504, Fall 2021

Learning Rate Scheduling

Training Neural Networks

• Learning rate scheduling is applied to change the values of the learning rate
during the training
▪ Annealing is reducing the learning rate over time (a.k.a. learning rate decay)
o Approach 1: reduce the learning rate by some factor every few epochs
● Typical values: reduce the learning rate by a half every 5 epochs, or divide by 10 every 20 epochs
o Approach 2: exponential or cosine decay gradually reduce the learning rate over time
o Approach 3: reduce the learning rate by a constant (e.g., by half) whenever the validation loss
stops improving
● In TensorFlow: tf.keras.callbacks.ReduceLROnPleateau()
○ Monitor: validation loss, factor: 0.1 (i.e., divide by 10), patience: 10 (how many epochs to wait before applying it),
Minimum learning rate: 1e-6 (when to stop)

▪ Warmup is gradually increasing the learning rate initially, and afterward let it cool
Exponential
down until the end decay
of the training Cosine decay Warmup

69
CS 404/504, Fall 2021

Vanishing Gradient Problem

Training Neural Networks

• In some cases, during training, the gradients can become either very small
(vanishing gradients) of very large (exploding gradients)
▪ They result in very small or very large update of the parameters
▪ Solutions: change learning rate, ReLU activations, regularization, LSTM units in RNNs

…… y1

…… y2
……

……
……

……

……
…… yM

Small gradients, learns very slow

Slide credit: Hung-yi Lee – Deep Learning Tutorial 70

CS 404/504, Fall 2021

Generalization
Generalization

• Underfitting
▪ The model is too “simple” to represent
all the relevant class characteristics
▪ E.g., model with too few parameters
▪ Produces high error on the training set
and high error on the validation set

• Overfitting
▪ The model is too “complex” and fits
irrelevant characteristics (noise) in the
data
▪ E.g., model with too many parameters
▪ Produces low error on the training error
and high error on the validation set
71
CS 404/504, Fall 2021

Overfitting
Generalization

• Overfitting – a model with high capacity fits the noise in the data instead of the
underlying relationship

• The model may fit the training data

very well, but fails to generalize to new
examples (test or validation data)

Picture from: http://cs231n.github.io/assets/nn1/layer_sizes.jpeg 72

CS 404/504, Fall 2021

Regularization: Weight Decay

Regularization

Data loss Regularization loss

73
CS 404/504, Fall 2021

Regularization: Weight Decay

Regularization

74
CS 404/504, Fall 2021

Regularization: Weight Decay

Regularization

75
CS 404/504, Fall 2021

Regularization: Dropout
Regularization

• Dropout
▪ Randomly drop units (along with their connections) during training
▪ Each unit is retained with a fixed dropout rate p, independent of other units
▪ The hyper-parameter p needs to be chosen (tuned)
o Often, between 20% and 50% of the units are dropped

Slide credit: Hung-yi Lee – Deep Learning Tutorial 76

CS 404/504, Fall 2021

Regularization: Dropout
Regularization

• Dropout is a kind of ensemble learning

▪ Using one mini-batch to train one network with a slightly different
architecture
minibatch minibatch minibatch minibatch
1 2 3 n

……

Slide credit: Hung-yi Lee – Deep Learning Tutorial 77

CS 404/504, Fall 2021

Regularization: Early Stopping

Regularization

• Early-stopping
▪ During model training, use a validation set
o E.g., validation/train ratio of about 25% to 75%
▪ Stop when the validation accuracy (or loss) has not improved after n epochs
o The parameter n is called patience

Stop training

validation

78
CS 404/504, Fall 2021

Batch Normalization
Regularization

79
CS 404/504, Fall 2021

Hyper-parameter Tuning
Hyper-parameter Tuning

80
CS 404/504, Fall 2021

Hyper-parameter Tuning
Hyper-parameter Tuning

• Grid search
▪ Check all values in a range with a step value
• Random search
▪ Randomly sample values for the parameter
▪ Often preferred to grid search
• Bayesian hyper-parameter optimization
▪ Is an active area of research

81
CS 404/504, Fall 2021

k-Fold Cross-Validation
k-Fold Cross-Validation

• Using k-fold cross-validation for hyper-parameter tuning is common when the

size of the training data is small
▪ It also leads to a better and less noisy estimate of the model performance by averaging
the results across several folds
• E.g., 5-fold cross-validation (see the figure on the next slide)
1. Split the train data into 5 equal folds
2. First use folds 2-5 for training and fold 1 for validation
3. Repeat by using fold 2 for validation, then fold 3, fold 4, and fold 5
4. Average the results over the 5 runs (for reporting purposes)
5. Once the best hyper-parameters are determined, evaluate the model on the test
data

82
CS 404/504, Fall 2021

k-Fold Cross-Validation
k-Fold Cross-Validation

• Illustration of a 5-fold cross-validation

Picture from: https://scikit-learn.org/stable/modules/cross_validation.html 83

CS 404/504, Fall 2021

Ensemble Learning
Ensemble Learning

• Ensemble learning is training multiple classifiers separately and combining their

predictions
▪ Ensemble learning often outperforms individual classifiers
▪ Better results obtained with higher model variety in the ensemble
▪ Bagging (bootstrap aggregating)
o Randomly draw subsets from the training set (i.e., bootstrap samples)
o Train separate classifiers on each subset of the training set
o Perform classification based on the average vote of all classifiers
▪ Boosting
o Train a classifier, and apply weights on the training set (apply higher weights on misclassified
examples, focus on “hard examples”)
o Train new classifier, reweight training set according to prediction error
o Repeat
o Perform classification based on weighted vote of the classifiers

84
CS 404/504, Fall 2021

Deep vs Shallow Networks

• Deeper networks perform better than shallow networks

▪ But only up to some limit: after a certain number of layers, the performance of deeper
networks plateaus
output

Shallow Deep
NN NN

……

……
input
Slide credit: Hung-yi Lee – Deep Learning Tutorial 85
CS 404/504, Fall 2021

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks

• Convolutional neural networks (CNNs) were primarily designed for image data
• CNNs use a convolutional operator for extracting data features
▪ Allows parameter sharing
▪ Efficient to train
▪ Have less parameters than NNs with fully-connected layers
• CNNs are robust to spatial translations of objects in images
• A convolutional filter slides (i.e., convolves) across the image

Convolutional
Input matrix 3x3 filter

Picture from: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution 86

CS 404/504, Fall 2021

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks

• When the convolutional filters are scanned over the image, they capture useful
features
▪ E.g., edge detection by convolutions
0 1 0
Filter 1 -4
1
0 1 0

Input Image Convoluted

Image

Slide credit: Param Vir Singh – Deep Learning 87

CS 404/504, Fall 2021

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks

• In CNNs, hidden units in a layer are only connected to a small region of the
layer before it (called local receptive field)
▪ The depth of each feature map corresponds to the number of convolutional filters
used at each layer

w1 w2

w3 w4 w5 w6

w7 w8
Filter 1
Filter 2
Input Image
Layer 1
Feature Layer 2
Map Feature
Map

Slide credit: Param Vir Singh – Deep Learning 88

CS 404/504, Fall 2021

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks

• Max pooling: reports the maximum output within a rectangular neighborhood

• Average pooling: reports the average output of a rectangular neighborhood
• Pooling layers reduce the spatial size of the feature maps
▪ Reduce the number of parameters, prevent overfitting

MaxPool with a 2×2 filter with stride of 2

1 3 5 3
4 5
4 2 3 1
3 4
3 1 1 3
0 1 0 4
Output Matrix
Input Matrix

Slide credit: Param Vir Singh – Deep Learning 89

CS 404/504, Fall 2021

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks

• Feature extraction architecture

▪ After 2 convolutional layers, a max-pooling layer reduces the size of the feature maps
(typically by 2)
▪ A fully convolutional and a softmax layers are added last to perform classification
Living
Room

Bedroom

Kitchen
12

25
25

51
51

251
51
12

521

51
6
64
4

6
6

2
2
8

2
Bathroom

Outdoor
Max Pool
Conv
layer

Fully Connected Layer

Slide credit: Param Vir Singh – Deep Learning 90

CS 404/504, Fall 2021

Residual CNNs
Convolutional Neural Networks

• Residual networks (ResNets)

▪ Introduce “identity” skip connections
o Layer inputs are propagated and added to the layer output
o Mitigate the problem of vanishing gradients during training
o Allow training very deep NN (with over 1,000 layers)
▪ Several ResNet variants exist: 18, 34, 50, 101, 152, and 200 layers
▪ Are used as base models of other state-of-the-art NNs
o Other similar models: ResNeXT, DenseNet

91
CS 404/504, Fall 2021

Recurrent Neural Networks (RNNs)

Recurrent Neural Networks

• Recurrent NNs are used for modeling sequential data and data with varying
length of inputs and outputs
▪ Videos, text, speech, DNA sequences, human skeletal data
• RNNs introduce recurrent connections between the neurons
▪ This allows processing sequential data one element at a time by selectively passing
information across a sequence
▪ Memory of the previous inputs is stored in the model’s internal state and affect the
model predictions
▪ Can capture correlations in sequential data
• RNNs use backpropagation-through-time for training
• RNNs are more sensitive to the vanishing gradient problem than CNNs

92
CS 404/504, Fall 2021

Recurrent Neural Networks (RNNs)

Recurrent Neural Networks

OUTPUT

h0 h1 h2 h3

x1 x2 x3

Slide credit: Param Vir Singh – Deep Learning 93

CS 404/504, Fall 2021

Recurrent Neural Networks (RNNs)

Recurrent Neural Networks

• RNNs can have one of many inputs and one of many outputs

RNN Application Input Output

A person riding a
Image
motorbike on dirt
Captioning
road

Sentiment Awesome movie.

Analysis Highly recommended. Positive

Machine
Happy Diwali शुभ दीपावली
Translation

Slide credit: Param Vir SIngh– Deep Learning 94

CS 404/504, Fall 2021

Bidirectional RNNs
Recurrent Neural Networks

• Bidirectional RNNs incorporate both forward and backward passes through

sequential data
▪ The output may not only depend on the previous elements in the sequence, but also
on future elements in the sequence
▪ It resembles two RNNs stacked on top of each other

Outputs both past and future elements

Slide credit: Param Vir Singh – Deep Learning 95

CS 404/504, Fall 2021

LSTM Networks
Recurrent Neural Networks

• Long Short-Term Memory (LSTM) networks are a variant of RNNs

• LSTM mitigates the vanishing/exploding gradient problem
▪ Solution: a Memory Cell, updated at each step in the sequence
• Three gates control the flow of information to and from the Memory Cell
▪ Input Gate: protects the current step from irrelevant inputs
▪ Output Gate: prevents current step from passing irrelevant information to later steps
▪ Forget Gate: limits information passed from one cell to the next
• Most modern RNN models use either LSTM units or other more advanced types
of recurrent units (e.g., GRU units)

96
CS 404/504, Fall 2021

LSTM Networks
Recurrent Neural Networks

• LSTM cell
▪ Input gate, output gate, forget gate, memory cell
▪ LSTM can learn long-term correlations within data sequences

97
CS 404/504, Fall 2021

References

1. Hung-yi Lee – Deep Learning Tutorial

2. Ismini Lourentzou – Introduction to Deep Learning
3. CS231n Convolutional Neural Networks for Visual Recognition (Stanford CS
course) (link)
4. James Hays, Brown – Machine Learning Overview
5. Param Vir Singh, Shunyuan Zhang, Nikhil Malik – Deep Learning
6. Sebastian Ruder – An Overview of Gradient Descent Optimization Algorithms
(link)

Deep Learning Overview
No ratings yet
Deep Learning Overview
102 pages
Lecture 8 Deep Learning Overview PDF
No ratings yet
Lecture 8 Deep Learning Overview PDF
98 pages
Lecture 2 Deep Learning Overview
No ratings yet
Lecture 2 Deep Learning Overview
99 pages
Lecture 2 Deep Learning Overview
No ratings yet
Lecture 2 Deep Learning Overview
99 pages
Deep Learning ICV EXAM Materiel
No ratings yet
Deep Learning ICV EXAM Materiel
98 pages
Introduction To Deep Learning
100% (1)
Introduction To Deep Learning
122 pages
Lecture 3 Deep Learning
No ratings yet
Lecture 3 Deep Learning
98 pages
Module 2_Deep_Learning_Fundamentals
No ratings yet
Module 2_Deep_Learning_Fundamentals
98 pages
DL-ppt
No ratings yet
DL-ppt
100 pages
Deep Learning - A Gentle Introduction
No ratings yet
Deep Learning - A Gentle Introduction
100 pages
Chapter 5 - Machine Learning Basics
No ratings yet
Chapter 5 - Machine Learning Basics
58 pages
presentation
No ratings yet
presentation
10 pages
STAT 451: Introduction To Machine Learning Lecture Notes
No ratings yet
STAT 451: Introduction To Machine Learning Lecture Notes
22 pages
Deep Learning
No ratings yet
Deep Learning
68 pages
Chapter 1
No ratings yet
Chapter 1
17 pages
ML Final
No ratings yet
ML Final
98 pages
ML Overview Notes
No ratings yet
ML Overview Notes
23 pages
Unit 1 (2)
No ratings yet
Unit 1 (2)
46 pages
Machine Learning-Lecture 01
No ratings yet
Machine Learning-Lecture 01
28 pages
1c Machinelearning
No ratings yet
1c Machinelearning
50 pages
Deep learning Module 1 Chapter 1
No ratings yet
Deep learning Module 1 Chapter 1
18 pages
Fundamental_Deep learning
No ratings yet
Fundamental_Deep learning
69 pages
CS480 Lecture November 14th
No ratings yet
CS480 Lecture November 14th
72 pages
Cz4041 1a Introduction
No ratings yet
Cz4041 1a Introduction
55 pages
DataScience - Unit 4
No ratings yet
DataScience - Unit 4
236 pages
DL UNIT 1
No ratings yet
DL UNIT 1
21 pages
CEC453 Machine Learning
No ratings yet
CEC453 Machine Learning
168 pages
Deep Learning Lecture 0 Introduction Alexander Tkachenko
No ratings yet
Deep Learning Lecture 0 Introduction Alexander Tkachenko
31 pages
Presentation on ML - Copy
No ratings yet
Presentation on ML - Copy
469 pages
01_ml_basics
No ratings yet
01_ml_basics
61 pages
Chapter 4
No ratings yet
Chapter 4
18 pages
ML-01
No ratings yet
ML-01
23 pages
Lecture 2 Introduction To ML
No ratings yet
Lecture 2 Introduction To ML
35 pages
Deep Learning
100% (1)
Deep Learning
21 pages
Unit 1
No ratings yet
Unit 1
43 pages
ML 01
No ratings yet
ML 01
15 pages
mlintro-2
No ratings yet
mlintro-2
28 pages
Military AI-Week 02-Key Concept Machine Learning
No ratings yet
Military AI-Week 02-Key Concept Machine Learning
84 pages
21CS743 Model Question Paper Solution
No ratings yet
21CS743 Model Question Paper Solution
32 pages
UNIT-4
No ratings yet
UNIT-4
38 pages
21cs743 Model Question Paper Solution
No ratings yet
21cs743 Model Question Paper Solution
33 pages
ML Final
No ratings yet
ML Final
95 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
58 pages
Introduction To Machine Learning: David Kauchak CS 451 - Fall 2013
No ratings yet
Introduction To Machine Learning: David Kauchak CS 451 - Fall 2013
34 pages
AA12_Deep_Learning_2024 (1)
No ratings yet
AA12_Deep_Learning_2024 (1)
30 pages
1. Intro to Machine Learning
No ratings yet
1. Intro to Machine Learning
32 pages
MLLecture 1
No ratings yet
MLLecture 1
10 pages
Lecture Notes: Introduction To Machine Learning For The Sciences
No ratings yet
Lecture Notes: Introduction To Machine Learning For The Sciences
80 pages
main
No ratings yet
main
17 pages
01_ml-overview_notes
No ratings yet
01_ml-overview_notes
19 pages
Meta Motion Fitness Tracker 241109 213742[1] Removed
No ratings yet
Meta Motion Fitness Tracker 241109 213742[1] Removed
20 pages
ML Question Bank-1
No ratings yet
ML Question Bank-1
10 pages
Basics of Machine Learning
100% (4)
Basics of Machine Learning
22 pages
CCST9017 (2023-24lecture11printed Version) MachineLearning
No ratings yet
CCST9017 (2023-24lecture11printed Version) MachineLearning
55 pages
105 Machine Learning Paper
No ratings yet
105 Machine Learning Paper
6 pages
Fundamentals of Machine Learning II
No ratings yet
Fundamentals of Machine Learning II
13 pages
Machine Learning (1)
No ratings yet
Machine Learning (1)
87 pages
Module 01- ML-21EC744
No ratings yet
Module 01- ML-21EC744
20 pages
Unit 1 PDF
No ratings yet
Unit 1 PDF
135 pages
Using the Standards - Data Analysis & Probability, Grade 5
From Everand
Using the Standards - Data Analysis & Probability, Grade 5
MathQueue
No ratings yet
Art Therapy Assistant Tool Milestone 2 V2
No ratings yet
Art Therapy Assistant Tool Milestone 2 V2
72 pages
v7.0 Tutorial
No ratings yet
v7.0 Tutorial
24 pages
Final Main Report 1
No ratings yet
Final Main Report 1
68 pages
Rapport
No ratings yet
Rapport
106 pages
Machine Learning &deep Learning in Python &R
No ratings yet
Machine Learning &deep Learning in Python &R
48 pages
Information Security Awareness - Refresher Course
100% (2)
Information Security Awareness - Refresher Course
83 pages
Evaluation Metrics: Anand Avati
No ratings yet
Evaluation Metrics: Anand Avati
31 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
1 s2.0 S0167404821003230 Main
No ratings yet
1 s2.0 S0167404821003230 Main
21 pages
The Role of Artificial Intelligence AI I2018 PDF
No ratings yet
The Role of Artificial Intelligence AI I2018 PDF
6 pages
Short Details of Business Analyst Course
No ratings yet
Short Details of Business Analyst Course
4 pages
Employee Churn Prediction Using Logistic Regression
No ratings yet
Employee Churn Prediction Using Logistic Regression
72 pages
LP 5 Manual
No ratings yet
LP 5 Manual
40 pages
Livestock Disease Prediction System
0% (1)
Livestock Disease Prediction System
3 pages
Machine Learning Basics: Shusen Wang
No ratings yet
Machine Learning Basics: Shusen Wang
30 pages
MC4301 - ML Unit 4 (Parametric Machine Learning)
No ratings yet
MC4301 - ML Unit 4 (Parametric Machine Learning)
56 pages
The Thesis Report
No ratings yet
The Thesis Report
46 pages
Proposal Defense v6
No ratings yet
Proposal Defense v6
55 pages
Practice Machine Learning With Datasets From The UCI Machine Learning Repository
No ratings yet
Practice Machine Learning With Datasets From The UCI Machine Learning Repository
23 pages
Lab 04 - SUpervised ML Classification
No ratings yet
Lab 04 - SUpervised ML Classification
3 pages
1036 2050 1 SM
No ratings yet
1036 2050 1 SM
10 pages
Ensemble-Based Botnet Attack Detection and Classification Using Machine Learning Algorithms on NBaIoT Dataset
No ratings yet
Ensemble-Based Botnet Attack Detection and Classification Using Machine Learning Algorithms on NBaIoT Dataset
6 pages
C4 +Supervised+Machine+Learning.pptx
No ratings yet
C4 +Supervised+Machine+Learning.pptx
169 pages
Fraud analytics using descriptive predictive and social network techniques a guide to data science for fraud detection 1st Edition Bart Baesens - Download the ebook and explore the most detailed content
100% (2)
Fraud analytics using descriptive predictive and social network techniques a guide to data science for fraud detection 1st Edition Bart Baesens - Download the ebook and explore the most detailed content
57 pages
Chapter 1
No ratings yet
Chapter 1
32 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
12 pages
MachineLearning Unit-III Ppt
No ratings yet
MachineLearning Unit-III Ppt
26 pages
Crop, Fertilizer, & Irrigation Recommendation Using Machine Learning Techniques
No ratings yet
Crop, Fertilizer, & Irrigation Recommendation Using Machine Learning Techniques
9 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages