0% found this document useful (0 votes)

6 views

Module 2 Part3

Module 2 of the Deep Learning course covers various techniques including deep feedforward networks, optimization methods, regularization techniques, and ensemble methods. It discusses parameter sharing in autoencoders, recurrent neural networks, and convolutional neural networks, along with methods like dropout and noise injection for improving model robustness. Additionally, it addresses parameter initialization strategies such as Xavier and Kaiming He recommendations to enhance learning efficiency.

Uploaded by

thejasurendran

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Module 2 Part3

Uploaded by

thejasurendran

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

CST414

DEEP LEARNING
Module-2 PART -III

1
SYLLABUS
2

Module-2 (Deep learning) Introduction to deep learning, Deep feed

forward network, Training deep models, Optimization techniques - Gradient Descent
(GD), GD with momentum, Nesterov accelerated GD, Stochastic GD, AdaGrad,
 RMSProp, Adam. Regularization Techniques - L1 and L2 regularization, Early
stopping, Dataset augmentation, Parameter sharing and tying,
TRACE KTU
Injecting noise at input, Ensemble methods, Dropout, Parameter
initialization.
PARAMETER SHARING
3
 parameter sharing is enabled by domain-specific insights.
 Examples of parameter-sharing methods are as follows:
 1. Sharing weights in autoencoders:
 The symmetric weights in the encoder and decoder portion of the
autoencoder are often shared

TRACE KTU
 In a single-layer autoencoder with linear activation, weight
sharing forces orthogonality among the different hidden
components of the weight matrix.
 2. Recurrent neural networks:
 These networks are often used for modeling sequential data, such
as time-series, biological sequences, and text.
 In recurrent neural, networks a time-layered representation of the
4 network is created in which the neural network is replicated
across layers associated with time stamps.
 Since each time stamp is assumed to use the same model, the
parameters are shared between different layers.
 Convolutional neural networks: Convolutional neural networks
are used for image recognition and prediction

TRACE KTU
 The inputs of the network are arranged into a rectangular grid
pattern, along with all the layers of the network
 The basic idea is that a rectangular patch of the image
corresponds to a portion of the visual field, and it should be
interpreted in the same way no matter where it is located.
 An additional type of weight sharing is soft weight sharing .
5 In soft weight sharing , the parameters are not completely
tied, but a penalty is associated with them being different.

TRACE KTU
6 INJECTING NOISE AT INPUT

 Noise injection is a form of penalty-based regularization of

the weights.
 The use of Gaussian noise in the input is roughly equal to L2-
regularization in single-layer networks with linear activation

TRACE KTU
 The de-noising autoencoder is based on noise injection
rather than penalization of the weights or hidden units.
 The goal of the de-noising autoencoder is to reconstruct good
examples from corrupted training data.
 Therefore, the type of noise should be calibrated to the nature
of the input.
 Several different types of noise can be added:
7  1. Gaussian noise: This type of noise is appropriate for real-
valued inputs.
 The added noise has zero mean and variance λ > 0 for each
input. Here, λ is the regularization parameter.
 2. Masking noise: The basic idea is to set a fraction f of the
inputs to zeros in order to corrupt the inputs.

TRACE KTU
 This type of approach is particularly useful when working with
 binary inputs.
 3. Salt-and-pepper noise: In this case, a fraction f of the
inputs are set to either their minimum or maximum possible
values according to a fair coin flip.
 The approach is typically used for binary inputs, for which the
minimum and maximum values are 0 and 1, respectively.
8  The inputs to the autoencoder are corrupted training records,
and the outputs are the uncorrupted data records.
 the autoencoder learns to recognize the fact that the input
is corrupted, and the true representation of the input needs to
be reconstructed.
 Note that the noise in the training data is explicitly added,
whereas that in the test data is already present as a result of
TRACE KTU
various application-specific reasons.
 The nature of the noise added to the input training data should
be based on insights about the type of corruption present in the
test data
9

TRACE KTU
ENSEMBLE METHODS
10
 One way of reducing the error of a classifier is to find a way to
reduce either its bias or the variance without affecting the other
component.
 Ensemble methods are used commonly in machine
 learning, and two examples of such methods are bagging and
boosting.
TRACE KTU
 The former is a method for variance reduction, whereas the
latter is a method for bias reduction
 Most ensemble methods in neural networks are focused on
variance reduction. This is because neural networks are valued
for their ability to build arbitrarily complex models in which the
bias is relatively low.
11  Bagging and Subsampling
 The prediction across different data sets can then be
averaged to yield the final prediction.
 If a sufficient number of training data sets is used, the variance
of the prediction will be reduced to 0,
 The basic idea is to generate new training data sets from the

TRACE KTU
single instance of the base data by sampling.
 The sampling can be performed with or without replacement
 It is common to use the softmax to yield probabilistic
predictions of discrete outputs
 If probabilistic predictions are averaged, it is common to
12
average the logarithms of these values.
 For discrete predictions, arithmetically averaged voting is used
 The main difference between bagging and subsampling is in
terms of whether or not replacement is used in the creation
of the sampled training data sets
 1. Bagging: In bagging, the training data is sampled with
TRACE KTU
replacement.
 The sample size s may be different from the size of the training
data size n, although it is common to set s to n.
 The resampled data will contain duplicates, and about a
 fraction (1−1/n)n ≈ 1/e of the original data set will not be
included at all.
 A model is constructed on the resampled training data set, and
13
each test instance is predicted with the resampled data.
 The entire process of resampling and model building is repeated
m times.
 For a given test instance, each of these m models is applied to
the test data.
 The predictions from different models are then averaged to
TRACE KTU
yield a single robust prediction.
 Although it is customary to choose s = n in bagging, the best
results are often obtained by choosing values of s much less
than n.
 2. Subsampling is similar to bagging, except that the different
14 models are constructed on the samples of the data created
without replacement.
 The predictions from the different models are averaged. In this
case, it is essential to choose s < n, because choosing s = n yields
the same training data set and identical results across different
ensemble components
 When a sufficient training data are available, subsampling is
TRACE KTU
often preferable to bagging
 Using bagging makes sense when the amount of available data is
limited
DROPOUT
15  Dropout is a method that uses node sampling instead of edge
sampling in order to create a neural network ensemble.
 If a node is dropped, then all incoming and outgoing connections
from that node need to be dropped as well.
 The nodes are sampled only from the input and hidden layers of
the network.

TRACE KTU
 If the full neural network contains M nodes, then the total
number of possible sampled networks is 2M
 weights of the different sampled networks are shared. Therefore,
Dropout combines node sampling with weight sharing.
 The training process then uses a single sampled example in
order to update the weights of the sampled network using
backpropagation.
 The training process proceeds using the following steps,
16  1. Sample a neural network from the base network. The input
nodes are each sampled with probability pi, and the hidden
nodes are each sampled with probability ph.
 All samples are independent of one another.
 2. Sample a single training instance or a mini-batch of training
instances

TRACE KTU
 3. Update the weights of the retained edges in the network using
backpropagation on the sampled training instance or the mini-
batch of training instances.
 In the Dropout method, thousands of neural networks are
sampled with shared weights, and a tiny training data set is
used to update the weights in each case
 if the neural network has k probabilistic outputs corresponding
17
to the k classes, and the jth ensemble yields an output of p(j)
 i for the ith class, then the ensemble estimate for the ith class
is computed as follows:

TRACE KTU
 Here, m is the total number of ensemble components, which
can be rather large in the case of the Dropout method.
 The values of the probabilities are re-normalized so that they
sum to 1:
 A key insight of the Dropout method is that it is not necessary to
18 evaluate the prediction on all ensemble components
 one can perform forward propagation on only the base network
(with no dropping) after re-scaling the weights.
 The basic idea is to multiply the weights going out of each
unit with the probability of sampling that unit.
 By using this approach, the expected output of that unit from a
TRACE KTU
sampled network is captured. This rule is referred to as the
weight scaling inference rule
 Using this rule also ensures that the input going into a unit is
19
also the same as the expected input that would occur in a
sampled network.

 The rule is not exactly true for networks with nonlinearities.

TRACE KTU
 By dropping both input units and hidden units, Dropout
effectively incorporates noise into both the input data and the
hidden representations.

 The nature of this noise can be viewed as a kind of masking

noise in which some inputs and hidden units are set to 0.

 Noise addition is a form of regularization.

 Dropout prevents a phenomenon referred to as feature co-
20
adaptation from occurring between hidden units.
 Since the effect of Dropout is a masking noise that removes
some of the hidden units, this approach forces a certain level
of redundancy between the features learned at the different
hidden units. This type of redundancy leads to increased
robustness.

TRACE KTU
 Dropout is efficient because each of the sampled subnetworks
is trained with a small set of sampled instances

 Dropout prevents this type of co-adaptation by forcing the

neural
 network to make predictions using only a subset of the inputs
and activations
21 PARAMETER INITIALIZATION.

three different cases of parameters’ initialization

1.Initialize all parameters to zero.
2.Initialize parameters to random values from standard
normal distribution or uniform distribution and
TRACE KTU
multiply it by a scalar such as 10.
3.Initialize parameters based on:
• Xavier recommendation.
• Kaiming He recommendation.
22 1. Initialize all parameters to zero.

TRACE KTU

Learning rate =0.01

23
 As the cost curve shows, the neural network didn’t learn
anything
 That is because of symmetry between all neurons which leads
to all neurons have the same update on every iteration
 Therefore, regardless of how many iterations we run the

TRACE KTU
optimization algorithms,
 all the neurons would still get the same update and no
learning would happen.
Initialize parameters to random values
24

TRACE KTU

Learning rate =0.01

we’ll multiply the random values by a big number such

as 10 to show that initializing parameters to big values
may cause our optimization to have higher error
rates
Random initialization here is helping but still the loss
TRACE KTU
function has high value and may take long time to
converge and achieve a significantly low value.
3. Initializing parameters based on and Xavier and
26
Kaiming He recommendations

 Xavier method is best applied when activation function applied

on hidden layers is Hyperbolic Tangent so that the weight on
each hidden layer would have the following variance:
 var(W^l )= 1/m^(l-1).

TRACE KTU
 We can achieve this by multiplying the random values from
standard normal distribution by
27 In Xavier initialization

TRACE KTU
28

TRACE KTU
 Kaiming He method is best applied when activation
29 function applied on hidden layers is Rectified Linear
Unit (ReLU).
 so that the weight on each hidden layer would have the
following variance: var(W^l )= 2/m^(l-1)
 We can achieve this by multiplying the random values
from standard normal distribution by

TRACE KTU
30

TRACE KTU
31

END OF MODULE 2
TRACE KTU

Module 1part 1new
No ratings yet
Module 1part 1new
59 pages
Module 4 Merged
No ratings yet
Module 4 Merged
78 pages
Module 2 Part1new
No ratings yet
Module 2 Part1new
32 pages
MODULE 3part 2
No ratings yet
MODULE 3part 2
17 pages
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet
Module 1part 2
No ratings yet
Module 1part 2
70 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
CST414-SCHEME
No ratings yet
CST414-SCHEME
8 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
26 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Assignment_13_Modern_AI
No ratings yet
Assignment_13_Modern_AI
3 pages
AI - W7L13
No ratings yet
AI - W7L13
46 pages
Convolutional Neural Network
No ratings yet
Convolutional Neural Network
59 pages
Chapter 3 Ann
No ratings yet
Chapter 3 Ann
26 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
SE-6104 Data Mining and Analytics: Lecture # 13 Advance Classification
No ratings yet
SE-6104 Data Mining and Analytics: Lecture # 13 Advance Classification
31 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
UNIT 3 - Backpropagation Algorithm
No ratings yet
UNIT 3 - Backpropagation Algorithm
38 pages
Unit 2
No ratings yet
Unit 2
37 pages
Ch22 Presn PDF
No ratings yet
Ch22 Presn PDF
34 pages
Lecture 05
No ratings yet
Lecture 05
34 pages
Deep learning
No ratings yet
Deep learning
15 pages
Chapter21 4e
No ratings yet
Chapter21 4e
35 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
Lec 8
No ratings yet
Lec 8
43 pages
AN2DL_03_2324_NeuralNetwroksTraining
No ratings yet
AN2DL_03_2324_NeuralNetwroksTraining
40 pages
Feed Forward Neural Network Assignment PDF
No ratings yet
Feed Forward Neural Network Assignment PDF
11 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
Nips10 Workshop Tutorial Final PDF
No ratings yet
Nips10 Workshop Tutorial Final PDF
73 pages
Module 2
No ratings yet
Module 2
67 pages
DL Unit-3
No ratings yet
DL Unit-3
9 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
ML-UNIT-II
No ratings yet
ML-UNIT-II
16 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
MODULE 5 part 1 (2)
No ratings yet
MODULE 5 part 1 (2)
46 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
L2 Neural Network Basics
No ratings yet
L2 Neural Network Basics
105 pages
EELU ANN ITF309 Lecture 08 Spring 2023-2024-Sensitivity-Back-Propagation
No ratings yet
EELU ANN ITF309 Lecture 08 Spring 2023-2024-Sensitivity-Back-Propagation
39 pages
Pattern Classification 11. Backpropagation & Time-Series Forecasting
No ratings yet
Pattern Classification 11. Backpropagation & Time-Series Forecasting
78 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
18 pages
Al3451 Ml - Questionbank -3,4,5
No ratings yet
Al3451 Ml - Questionbank -3,4,5
11 pages
FULLTEXT01
No ratings yet
FULLTEXT01
74 pages
Nguyen Duy
No ratings yet
Nguyen Duy
66 pages
boosting
No ratings yet
boosting
28 pages
CST414 M2 Ktunotes.in
No ratings yet
CST414 M2 Ktunotes.in
30 pages
Lec13 Neural Networks and Deep Learning PDF
No ratings yet
Lec13 Neural Networks and Deep Learning PDF
33 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Data Mining, Advance Methods
No ratings yet
Data Mining, Advance Methods
83 pages
2. Deep Neural Network
No ratings yet
2. Deep Neural Network
60 pages
DL Unit 3 Jntuk r20
100% (1)
DL Unit 3 Jntuk r20
47 pages
Different Activation Functions With The Equations
No ratings yet
Different Activation Functions With The Equations
6 pages
Lecture 4
No ratings yet
Lecture 4
45 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Unit 3
No ratings yet
Unit 3
110 pages
Machine Learning-Gkouzionis
No ratings yet
Machine Learning-Gkouzionis
14 pages
ML Assignment
No ratings yet
ML Assignment
4 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
TripleThe Interpretable Deep Learning Anomaly Detection Framework Based on Trace-Metric-Log of Microservice
No ratings yet
TripleThe Interpretable Deep Learning Anomaly Detection Framework Based on Trace-Metric-Log of Microservice
10 pages
2 Machine Learning Overview v3.5
No ratings yet
2 Machine Learning Overview v3.5
95 pages
Brochure Master in Data Science
No ratings yet
Brochure Master in Data Science
13 pages
1 s2.0 S2589721722000046 Main
No ratings yet
1 s2.0 S2589721722000046 Main
13 pages
Lecture 5
No ratings yet
Lecture 5
36 pages
Case study on NLP
No ratings yet
Case study on NLP
23 pages
Final YouTube Automating Comment Analysis
No ratings yet
Final YouTube Automating Comment Analysis
19 pages
Deformable Attention Presentation
No ratings yet
Deformable Attention Presentation
14 pages
ANN 5TH PPT
No ratings yet
ANN 5TH PPT
98 pages
Additive Models and Trees
No ratings yet
Additive Models and Trees
34 pages
The Age of Surveillance Capitalism Diggi PDF
No ratings yet
The Age of Surveillance Capitalism Diggi PDF
7 pages
BathSoap Pres
No ratings yet
BathSoap Pres
27 pages
Week 9
No ratings yet
Week 9
3 pages
Combining Xxsentence Similarities Measures To Identify Paraphrases
No ratings yet
Combining Xxsentence Similarities Measures To Identify Paraphrases
15 pages
Detection of Turkish Fake News From Tweets With BERT Models
No ratings yet
Detection of Turkish Fake News From Tweets With BERT Models
14 pages
Addressing Data Imbalance in Plant Disease Recognition Through Contrastive Learning
No ratings yet
Addressing Data Imbalance in Plant Disease Recognition Through Contrastive Learning
6 pages
Data-Preparation-and-Preprocessing-A-Crucial-Step-in-Machine-Learning
No ratings yet
Data-Preparation-and-Preprocessing-A-Crucial-Step-in-Machine-Learning
10 pages
BIA 5000 Introduction To Analytics - Lesson 4
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 4
49 pages
Anthropic
No ratings yet
Anthropic
12 pages
Deep Learning Assignment2 Solutions PDF
No ratings yet
Deep Learning Assignment2 Solutions PDF
16 pages
unit-i-notes-machine-learning-techniques
No ratings yet
unit-i-notes-machine-learning-techniques
21 pages
Cameron Afzal: Professional Experience
No ratings yet
Cameron Afzal: Professional Experience
1 page
Lunet: A Deep Neural Network For Network Intrusion Detection
No ratings yet
Lunet: A Deep Neural Network For Network Intrusion Detection
8 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
Introduction To CRISP DM Framework For Data Science and Machine Learning
No ratings yet
Introduction To CRISP DM Framework For Data Science and Machine Learning
7 pages
IMAGE_GENERATION_WITH_STABLE_DIFFUSION_AI
No ratings yet
IMAGE_GENERATION_WITH_STABLE_DIFFUSION_AI
5 pages
Lena Laurent Data Annotator CV
No ratings yet
Lena Laurent Data Annotator CV
2 pages
Algorithms Quiz Solution - Georgia Tech - Machine Learning
No ratings yet
Algorithms Quiz Solution - Georgia Tech - Machine Learning
2 pages
Q11 Explain Optimal A Algorithm ?
No ratings yet
Q11 Explain Optimal A Algorithm ?
18 pages
Plenary - Jalao - AI in CE
No ratings yet
Plenary - Jalao - AI in CE
25 pages