0% found this document useful (0 votes)

2 views

Module 2

The document provides an overview of deep learning, focusing on deep feedforward networks, training processes, and various optimization techniques such as Gradient Descent and its variants. It discusses regularization methods to prevent overfitting and improve model generalization. Additionally, it includes a series of assessment questions related to the concepts covered in the module.

Uploaded by

akshaylalsp6

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Module 2

Uploaded by

akshaylalsp6

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Module 2

Deep Learning
Introduction to deep learning, Deep feed forward network, Training deep models, Optimization
techniques - Gradient Descent (GD), GD with momentum, Nesterov accelerated GD, Stochastic
GD, AdaGrad, RMSProp, Adam. Regularization Techniques - L1 and L2 regularization, Early
stopping, Dataset augmentation, Parameter sharing and tying, Injecting noise at input,
Ensemble methods, Dropout, Parameter initialization.

Reena Thomas, Asst. Prof., CSE dept., CEMP 1

• Deep Learning is a subset of Machine Learning that uses neural networks with
many layers (hence "deep") to model and understand complex patterns in large
amounts of data. It is inspired by the structure and function of the human brain,
allowing computers to learn from data without explicit programming.

• A Deep Feedforward Network is a type of neural network where information

moves in one direction—from input to output—through multiple layers of
neurons (multi-layer neural network) , without any loops or cycles.

Features of Deep Feedforward Network

• Feedforward: Data moves forward through the network, layer by layer, without
feedback connections.
• Deep: The network has multiple hidden layers between input and output layers.
• Fully connected: Each neuron in a layer is connected to every neuron in the
next layer.

2
Steps for Training Deep Models
1. Data Collection
Gather a large, diverse dataset suitable for the task (images, text, etc.).

2. Data Preprocessing
Clean the data (remove noise, handle missing values).
Normalize or scale the data to ensure it is in a consistent range.

3. Model Design
Choose an appropriate architecture (e.g., CNN for images, RNN for sequences).
Decide the number of layers, neurons, and activation functions.

4. Split Data
Divide the data into training, validation, and test sets.

5. Initialization
Initialize the model weights properly to avoid vanishing or exploding gradients.

6. Forward Pass
Input data through the network, layer by layer, to calculate the output.
3
7. Loss Calculation
Compute the loss (difference between predicted output and actual output) using a loss
function.

8. Backward Pass (Backpropagation)

Compute gradients of the loss with respect to model parameters using the chain rule.
Update weights to minimize the loss using gradient descent or an optimizer like Adam.

9. Model Evaluation
Evaluate model performance on the validation set after each epoch to tune
hyperparameters and avoid overfitting.

10. Iterate and Tune

Train for multiple epochs (iterations).
Adjust hyperparameters like learning rate, batch size, and regularization based on
performance.

11. Final Evaluation

Once training is complete, evaluate the model's performance on the test set to check
its generalization ability.

4
Optimization Techniques in Deep
Learning

5
• Optimization techniques in deep learning refer to mathematical and algorithmic
methods used to minimize (or maximize) an objective function, typically the loss
function.
• The goal of optimization is to adjust the model's parameters (weights and biases)
to improve its performance on a given task.
• Gradient-Based Optimization techniques rely on computing gradients of the loss
function to update the model parameters.
• Gradient Descent (GD)
• Batch Gradient Descent (BGD)
• Mini-Batch Gradient Descent
• Stochastic Gradient Descent (SGD)
• Variants of SGD (Momentum-Based Optimizers)
• Momentum
• Nesterov Accelerated Gradient (NAG)
• Adagrad
• RMSprop
• Adam (Adaptive Moment Estimation)
6
Gradient Descent

• Gradient Descent is an optimization technique used to find a local minimum of

a differentiable function.
• The weights (or parameters) are initialized using specific strategies and updated
iteratively using an update equation.

• This process continues until the optimization function converges to a minimum

(not necessarily zero).

• Graphically, this means finding the lowest point on the function curve.

7
• The gradient (slope) is positive on the right side of a minimum and negative on
the left side.
• The slope is close to zero at the minimum, indicating a critical point.
• The Gradient Descent (GD) algorithm starts from a random point and moves
downhill to reach the minimum.
• The gradient represents the direction and rate of steepest ascent; we move in
the opposite direction to minimize the function.

8
• The learning rate controls the step size—if too high, the algorithm may
overshoot; if too low, convergence is slow.

• GD works well for convex functions.

• But may get stuck in local minima or saddle points in non-convex functions.

9
1. Batch Gradient Descent (BGD)

• Uses the entire dataset to compute the gradient of the loss function.
• Updates model parameters only after evaluating all training samples.
Advantages:
• Produces a stable and smooth convergence.
• Moves steadily toward the optimal solution.
Disadvantages:
• Computationally expensive for large datasets.
• Requires significant memory and processing power.
• Slower updates since it waits for the entire dataset before making a move.

10
2. Mini-Batch Gradient Descent

• Splits the dataset into smaller batches.

• Computes the gradient and updates weights using each mini-batch.
Advantages:
• Faster than batch gradient descent.
• More stable than stochastic gradient descent.
• Vectorized operations let GPUs (Graphics Processing Unit)/ TPUs (tensor
Processing Unit) process many data points at once, making tasks faster and
more efficient.
Disadvantages:
• Still requires tuning the batch size for efficiency.
• Some noise in updates, but less than SGD.

11
3. Gradient Descent with Momentum
Gradient Descent with momentum is an optimization technique that helps
accelerate gradient descent by smoothing out updates and reducing oscillations.
Why Momentum?
• Standard Gradient Descent (GD) can be slow, especially when gradients oscillate
in different directions.
• Momentum helps GD move faster in the right direction by accumulating past
gradients and using them to update weights.
• This is especially useful in valleys or ravines where standard GD might zig-zag
slowly.

12
13
How Momentum Helps?

• Reduces oscillations → Moves smoothly instead of bouncing back and forth.

• Speeds up convergence → Faster movement in the right direction.
• Escapes local minima → Helps overcome small bumps in the loss landscape.

14
4. Stochastic Gradient Descent (SGD)

• It is an optimization algorithm used to minimize the loss function in machine learning

models.
• It is commonly used for training neural networks and other machine learning models.
• Instead of computing gradients using the entire dataset (as in Batch Gradient Descent),
SGD updates parameters using only one data point (or a small batch) at a time.
• The model’s parameters (weights) are updated in the direction that reduces the loss.
• This approach makes SGD faster and more memory-efficient, especially for large
datasets.
• However, it introduces randomness (stochasticity) in updates, which can help escape
local minima but also causes more noise in convergence.

15
16
17
5. Stochastic Gradient Descent (SGD) with
Momentum
• Momentum accelerates SGD by incorporating a fraction of the previous
update into the current update.
• Smooths out updates and prevents oscillations, especially in ravines (steep in
some directions, flat in others).
• Remembers past update direction, reducing reliance on only the current
gradient.
• Prevents zig-zagging, leading to faster convergence, especially in high-
dimensional spaces.

18
19
• Each contour line represents a region of equal loss (or cost)—the closer you get to
the center (red dot), the lower the loss.
• The red dot represents the global minimum, where the optimization should ideally
converge.

20
6. Nesterov Accelerated Gradient (NAG)
• Standard momentum-based gradient descent updates parameters using a velocity
term to smooth out updates.
• NAG improves upon this by calculating the gradient at the "look-ahead" position
instead of the current position, reducing oscillations and overshooting issues.

21
22
23
7. Adaptive Gradient (AdaGrad)
• Modifies SGD by adapting the learning rate per parameter based on how
frequently it has been updated.
• Infrequently updated parameters get larger updates, frequently updated ones
get smaller updates.
• Works well for sparse data where some parameters rarely get gradients.

24
25
8. RMSProp (Root Mean Square Propagation)
• Improves AdaGrad by introducing an exponentially decaying moving average of
past squared gradients, preventing the learning rate from vanishing too quickly.
• This ensures that learning continues even after many updates.

26
27
9. Adam (Adaptive Moment Estimation)
• Combines momentum (for stable updates) and RMSProp (for adaptive learning
rates).
• Uses two moving averages:
• First moment estimate : captures the mean of past gradients (like
momentum).
• Second moment estimate : captures the variance of past gradients (like
RMSProp).
• Bias correction ensures unbiased estimates early in training.

28
29
30
Regularization techniques
Regularization techniques are used to prevent overfitting by adding penalties
to the model's complexity, making it generalize better to new, unseen data.

Key Points:
• Prevents Overfitting: Reduces the model's tendency to memorize noise in
the training data.
• Controls Complexity: Limits the model's complexity by penalizing large or
unnecessary parameters.
• Improves Generalization: Helps the model perform well on unseen data.

31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Previous Year Questions

49
1. Elucidate the features of deep feed forward networks. (Slide 2)
2. Explain Nesterov Accelerated Gradient Descent with equations. (Slide 21-23)
3. Compare batch gradient descent and stochastic gradient descent. Describe the
advantages and limitations of each.

50
4. Compare RMSProp with AdaGrad. (Slide 24-27)
5. Why is proper parameter initialization crucial for training deep networks?

6. Explain different parameter initialization techniques. (Slide 41,42)

7. Compare L1 and L2 regularization. . (Slide 33-36)
8. Explain different gradient descent optimization strategies used in deep learning.
(Slide 7-30)
9. Explain the following.
i)Early stopping (37,38) ii) Drop out (43-44)
iii) Injecting noise at input (47) iv)Parameter sharing and tying (48)

51
10. A supervised learning problem is given to model a deep feed forward neural
network. Suggest solutions for a small sized dataset for training.

11. Explain how L2 regularization improves the performance of deep feed forward
neural networks.

52
12. Differentiate stochastic gradient descent with and without momentum. Give
equations for weight updation in SGD with and without momentum.

13. State how to apply early stopping in the context of learning using Gradient
Descent.

53
14. Why is it necessary to use a validation set (instead of simply using the test set)
when using early stopping?

15. Describe the effect in bias and variance when a neural network is modified with
more number of hidden units followed with dropout regularization.

54
16. Describe the advantage of using Adam optimizer instead of basic gradient descent

55
56
Course Level Assessment Questions

57
1. Derive a mathematical expression to show L2 regularization as weight decay. Explain
how L2 regularization improves the performance of deep feed forward neural
networks.

58
2. In stochastic gradient descent, each pass over the dataset requires the same
number of arithmetic operations, whether we use minibatches of size 1 or size 1000.
Why can it nevertheless be more computationally efficient to use minibatches of size
1000?

59
3. State how to apply early stopping in the context of learning using Gradient
Descent. Why is it necessary to use a validation set (instead of simply using the test
set) when using early stopping?

60
4. Suppose that a model does well on the training set, but only achieves an accuracy
of 85% on the validation set. You conclude that the model is overfitting, and plan to
use L1 or L2 regularization to fix the issue. However, you learn that some of the
examples in the data may be incorrectly labeled. Which form of regularisation would
you prefer to use and why?

61
5. Describe one advantage of using Adam optimizer instead of basic gradient descent.

62
Model Questions

63
1. Derive weight updating rule in gradient descent when the error function is
a) mean squared error b) cross entropy.
2. Discuss methods to prevent overfitting in neural networks.

64
3. Differentiate gradient descent with and without momentum. Give equations for
weight updation in GD with and without momentum. Illustrate plateaus, saddle
points and slowly varying gradient.

65
4. Suppose a supervised learning problem is given to model a deep feed forward
neural network. Suggest solutions for the following a) small sized dataset for training
b) dataset with unlabeled data c) large data set but data from different distribution.

66
5. Describe the effect in bias and variance when a neural network is modified with
more number of hidden units followed with dropout regularization

Bone Densitometry For Technologists. ISBN 1588290204, 978-1588290205
100% (13)
Bone Densitometry For Technologists. ISBN 1588290204, 978-1588290205
23 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
GD Compare
No ratings yet
GD Compare
5 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Deep Learning
No ratings yet
Deep Learning
23 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
cst414-deep learning module 2
No ratings yet
cst414-deep learning module 2
13 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
Deep Learning (MODULE-2) (2)
No ratings yet
Deep Learning (MODULE-2) (2)
86 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
cours5
No ratings yet
cours5
23 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
Lec 8
No ratings yet
Lec 8
43 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Technical_writing (2)
No ratings yet
Technical_writing (2)
9 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Unit – IV
No ratings yet
Unit – IV
24 pages
tutorial 1,2
No ratings yet
tutorial 1,2
12 pages
deep learnig u2
No ratings yet
deep learnig u2
4 pages
Lecture 4
No ratings yet
Lecture 4
45 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Training NNs
No ratings yet
Training NNs
34 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Rajesh (Dl Unit3) 06dec2024
No ratings yet
Rajesh (Dl Unit3) 06dec2024
67 pages
MLP Encoder Decoder
No ratings yet
MLP Encoder Decoder
14 pages
Optimization in Machine Learning
No ratings yet
Optimization in Machine Learning
26 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
Advanced Topics in Machine Learning: Supervised Learning, Deep Learning, and Optimization Techniques
No ratings yet
Advanced Topics in Machine Learning: Supervised Learning, Deep Learning, and Optimization Techniques
5 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
DL
No ratings yet
DL
12 pages
Secrets of Deep Learning 1716536527
No ratings yet
Secrets of Deep Learning 1716536527
12 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Deep learning Unit 4
No ratings yet
Deep learning Unit 4
10 pages
Deep Learning
No ratings yet
Deep Learning
18 pages
Unit 2
No ratings yet
Unit 2
19 pages
08 Training
No ratings yet
08 Training
18 pages
DLA-CAT 1
No ratings yet
DLA-CAT 1
37 pages
ADAM-1
No ratings yet
ADAM-1
11 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
HCIP-AI-EI Developer V2.0 Training Material
No ratings yet
HCIP-AI-EI Developer V2.0 Training Material
508 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
DL (2)
No ratings yet
DL (2)
18 pages
Deep Learning
No ratings yet
Deep Learning
21 pages
Unit 4 NNDL-1
No ratings yet
Unit 4 NNDL-1
12 pages
Unit 2
No ratings yet
Unit 2
13 pages
Visualising SGD With Momentum, Adam and Learning Rate Annealing
No ratings yet
Visualising SGD With Momentum, Adam and Learning Rate Annealing
8 pages
Convolutional Neural Network
No ratings yet
Convolutional Neural Network
59 pages
Unit 4 - GRADIENT LEARNING
No ratings yet
Unit 4 - GRADIENT LEARNING
3 pages
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
XT2127-X - Moto g10 (Capri5000) - OD
No ratings yet
XT2127-X - Moto g10 (Capri5000) - OD
59 pages
Chapter 6
No ratings yet
Chapter 6
24 pages
Poem of Samion
No ratings yet
Poem of Samion
20 pages
L01-slides-Programming For Performance
No ratings yet
L01-slides-Programming For Performance
53 pages
r1.0 Ge-Dsh-73 82 and 82 Poe User Manual
No ratings yet
r1.0 Ge-Dsh-73 82 and 82 Poe User Manual
179 pages
Cyber Security
No ratings yet
Cyber Security
38 pages
(Ebook) The Personal Efficiency Program: How to Stop Feeling Overwhelmed and Win Back Control of Your Work by Kerry Gleeson ISBN 9780470371312, 0470371315 All Chapters Instant Download
100% (2)
(Ebook) The Personal Efficiency Program: How to Stop Feeling Overwhelmed and Win Back Control of Your Work by Kerry Gleeson ISBN 9780470371312, 0470371315 All Chapters Instant Download
81 pages
2b. PL - Cabling - Standards
100% (2)
2b. PL - Cabling - Standards
42 pages
DynamicProgramming Notes
No ratings yet
DynamicProgramming Notes
13 pages
Dynamic Formatting in A VFP9 Report
No ratings yet
Dynamic Formatting in A VFP9 Report
2 pages
Aug 20 Apple
No ratings yet
Aug 20 Apple
124 pages
Relevador Motor 7SK82 - P1H132886
No ratings yet
Relevador Motor 7SK82 - P1H132886
7 pages
Applied Mathematics Ll 1 1
No ratings yet
Applied Mathematics Ll 1 1
230 pages
123819FMM Industry4.0 - Cat21 - Full-Copy1
No ratings yet
123819FMM Industry4.0 - Cat21 - Full-Copy1
75 pages
Python Machine Learning For Beginners B09MDRTFN3
No ratings yet
Python Machine Learning For Beginners B09MDRTFN3
97 pages
3 12th Business Maths Question Bank English Medium
No ratings yet
3 12th Business Maths Question Bank English Medium
109 pages
Guide - How To Invest in Cryptocurrency
No ratings yet
Guide - How To Invest in Cryptocurrency
28 pages
AMX FX3U M Series PLC Manual - Ver1.1
No ratings yet
AMX FX3U M Series PLC Manual - Ver1.1
130 pages
Star Citizen Helpdesk Project: 3.x Info
No ratings yet
Star Citizen Helpdesk Project: 3.x Info
46 pages
VOISEE COMMUNICATOR: An Android Mobile Application For Hearing-Impaired and Blind Communications
No ratings yet
VOISEE COMMUNICATOR: An Android Mobile Application For Hearing-Impaired and Blind Communications
7 pages
Playnetwork Media Player: Installation & User Manual
No ratings yet
Playnetwork Media Player: Installation & User Manual
16 pages
MS Outlook Set Guide
No ratings yet
MS Outlook Set Guide
3 pages
CV Harsh Ranglani
No ratings yet
CV Harsh Ranglani
1 page
Course Outline - EndUser
No ratings yet
Course Outline - EndUser
4 pages
Oytdr 7 W 5 e 68 T 78 Youij
No ratings yet
Oytdr 7 W 5 e 68 T 78 Youij
102 pages
Master Powershell Tricks Vol2 PDF
67% (3)
Master Powershell Tricks Vol2 PDF
653 pages
TNMS Alarms PDF
No ratings yet
TNMS Alarms PDF
17 pages
Lessonly Brand Style Guide Internal
No ratings yet
Lessonly Brand Style Guide Internal
41 pages
Delta Ia-Plc Dvp06xa-S 06xa-S2 I Tse 20210224
No ratings yet
Delta Ia-Plc Dvp06xa-S 06xa-S2 I Tse 20210224
17 pages