Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

Module 2 Part3

Module 2 of the Deep Learning course covers various techniques including deep feedforward networks, optimization methods, regularization techniques, and ensemble methods. It discusses parameter sharing in autoencoders, recurrent neural networks, and convolutional neural networks, along with methods like dropout and noise injection for improving model robustness. Additionally, it addresses parameter initialization strategies such as Xavier and Kaiming He recommendations to enhance learning efficiency.

Uploaded by

thejasurendran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Module 2 Part3

Module 2 of the Deep Learning course covers various techniques including deep feedforward networks, optimization methods, regularization techniques, and ensemble methods. It discusses parameter sharing in autoencoders, recurrent neural networks, and convolutional neural networks, along with methods like dropout and noise injection for improving model robustness. Additionally, it addresses parameter initialization strategies such as Xavier and Kaiming He recommendations to enhance learning efficiency.

Uploaded by

thejasurendran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

CST414

DEEP LEARNING
Module-2 PART -III

1
SYLLABUS
2

Module-2 (Deep learning) Introduction to deep learning, Deep feed


forward network, Training deep models, Optimization techniques - Gradient Descent
(GD), GD with momentum, Nesterov accelerated GD, Stochastic GD, AdaGrad,
 RMSProp, Adam. Regularization Techniques - L1 and L2 regularization, Early
stopping, Dataset augmentation, Parameter sharing and tying,
TRACE KTU
Injecting noise at input, Ensemble methods, Dropout, Parameter
initialization.
PARAMETER SHARING
3
 parameter sharing is enabled by domain-specific insights.
 Examples of parameter-sharing methods are as follows:
 1. Sharing weights in autoencoders:
 The symmetric weights in the encoder and decoder portion of the
autoencoder are often shared

TRACE KTU
 In a single-layer autoencoder with linear activation, weight
sharing forces orthogonality among the different hidden
components of the weight matrix.
 2. Recurrent neural networks:
 These networks are often used for modeling sequential data, such
as time-series, biological sequences, and text.
 In recurrent neural, networks a time-layered representation of the
4 network is created in which the neural network is replicated
across layers associated with time stamps.
 Since each time stamp is assumed to use the same model, the
parameters are shared between different layers.
 Convolutional neural networks: Convolutional neural networks
are used for image recognition and prediction

TRACE KTU
 The inputs of the network are arranged into a rectangular grid
pattern, along with all the layers of the network
 The basic idea is that a rectangular patch of the image
corresponds to a portion of the visual field, and it should be
interpreted in the same way no matter where it is located.
 An additional type of weight sharing is soft weight sharing .
5 In soft weight sharing , the parameters are not completely
tied, but a penalty is associated with them being different.

TRACE KTU
6 INJECTING NOISE AT INPUT

 Noise injection is a form of penalty-based regularization of


the weights.
 The use of Gaussian noise in the input is roughly equal to L2-
regularization in single-layer networks with linear activation

TRACE KTU
 The de-noising autoencoder is based on noise injection
rather than penalization of the weights or hidden units.
 The goal of the de-noising autoencoder is to reconstruct good
examples from corrupted training data.
 Therefore, the type of noise should be calibrated to the nature
of the input.
 Several different types of noise can be added:
7  1. Gaussian noise: This type of noise is appropriate for real-
valued inputs.
 The added noise has zero mean and variance λ > 0 for each
input. Here, λ is the regularization parameter.
 2. Masking noise: The basic idea is to set a fraction f of the
inputs to zeros in order to corrupt the inputs.

TRACE KTU
 This type of approach is particularly useful when working with
 binary inputs.
 3. Salt-and-pepper noise: In this case, a fraction f of the
inputs are set to either their minimum or maximum possible
values according to a fair coin flip.
 The approach is typically used for binary inputs, for which the
minimum and maximum values are 0 and 1, respectively.
8  The inputs to the autoencoder are corrupted training records,
and the outputs are the uncorrupted data records.
 the autoencoder learns to recognize the fact that the input
is corrupted, and the true representation of the input needs to
be reconstructed.
 Note that the noise in the training data is explicitly added,
whereas that in the test data is already present as a result of
TRACE KTU
various application-specific reasons.
 The nature of the noise added to the input training data should
be based on insights about the type of corruption present in the
test data
9

TRACE KTU
ENSEMBLE METHODS
10
 One way of reducing the error of a classifier is to find a way to
reduce either its bias or the variance without affecting the other
component.
 Ensemble methods are used commonly in machine
 learning, and two examples of such methods are bagging and
boosting.
TRACE KTU
 The former is a method for variance reduction, whereas the
latter is a method for bias reduction
 Most ensemble methods in neural networks are focused on
variance reduction. This is because neural networks are valued
for their ability to build arbitrarily complex models in which the
bias is relatively low.
11  Bagging and Subsampling
 The prediction across different data sets can then be
averaged to yield the final prediction.
 If a sufficient number of training data sets is used, the variance
of the prediction will be reduced to 0,
 The basic idea is to generate new training data sets from the

TRACE KTU
single instance of the base data by sampling.
 The sampling can be performed with or without replacement
 It is common to use the softmax to yield probabilistic
predictions of discrete outputs
 If probabilistic predictions are averaged, it is common to
12
average the logarithms of these values.
 For discrete predictions, arithmetically averaged voting is used
 The main difference between bagging and subsampling is in
terms of whether or not replacement is used in the creation
of the sampled training data sets
 1. Bagging: In bagging, the training data is sampled with
TRACE KTU
replacement.
 The sample size s may be different from the size of the training
data size n, although it is common to set s to n.
 The resampled data will contain duplicates, and about a
 fraction (1−1/n)n ≈ 1/e of the original data set will not be
included at all.
 A model is constructed on the resampled training data set, and
13
each test instance is predicted with the resampled data.
 The entire process of resampling and model building is repeated
m times.
 For a given test instance, each of these m models is applied to
the test data.
 The predictions from different models are then averaged to
TRACE KTU
yield a single robust prediction.
 Although it is customary to choose s = n in bagging, the best
results are often obtained by choosing values of s much less
than n.
 2. Subsampling is similar to bagging, except that the different
14 models are constructed on the samples of the data created
without replacement.
 The predictions from the different models are averaged. In this
case, it is essential to choose s < n, because choosing s = n yields
the same training data set and identical results across different
ensemble components
 When a sufficient training data are available, subsampling is
TRACE KTU
often preferable to bagging
 Using bagging makes sense when the amount of available data is
limited
DROPOUT
15  Dropout is a method that uses node sampling instead of edge
sampling in order to create a neural network ensemble.
 If a node is dropped, then all incoming and outgoing connections
from that node need to be dropped as well.
 The nodes are sampled only from the input and hidden layers of
the network.

TRACE KTU
 If the full neural network contains M nodes, then the total
number of possible sampled networks is 2M
 weights of the different sampled networks are shared. Therefore,
Dropout combines node sampling with weight sharing.
 The training process then uses a single sampled example in
order to update the weights of the sampled network using
backpropagation.
 The training process proceeds using the following steps,
16  1. Sample a neural network from the base network. The input
nodes are each sampled with probability pi, and the hidden
nodes are each sampled with probability ph.
 All samples are independent of one another.
 2. Sample a single training instance or a mini-batch of training
instances

TRACE KTU
 3. Update the weights of the retained edges in the network using
backpropagation on the sampled training instance or the mini-
batch of training instances.
 In the Dropout method, thousands of neural networks are
sampled with shared weights, and a tiny training data set is
used to update the weights in each case
 if the neural network has k probabilistic outputs corresponding
17
to the k classes, and the jth ensemble yields an output of p(j)
 i for the ith class, then the ensemble estimate for the ith class
is computed as follows:

TRACE KTU
 Here, m is the total number of ensemble components, which
can be rather large in the case of the Dropout method.
 The values of the probabilities are re-normalized so that they
sum to 1:
 A key insight of the Dropout method is that it is not necessary to
18 evaluate the prediction on all ensemble components
 one can perform forward propagation on only the base network
(with no dropping) after re-scaling the weights.
 The basic idea is to multiply the weights going out of each
unit with the probability of sampling that unit.
 By using this approach, the expected output of that unit from a
TRACE KTU
sampled network is captured. This rule is referred to as the
weight scaling inference rule
 Using this rule also ensures that the input going into a unit is
19
also the same as the expected input that would occur in a
sampled network.

 The rule is not exactly true for networks with nonlinearities.

TRACE KTU
 By dropping both input units and hidden units, Dropout
effectively incorporates noise into both the input data and the
hidden representations.

 The nature of this noise can be viewed as a kind of masking


noise in which some inputs and hidden units are set to 0.

 Noise addition is a form of regularization.


 Dropout prevents a phenomenon referred to as feature co-
20
adaptation from occurring between hidden units.
 Since the effect of Dropout is a masking noise that removes
some of the hidden units, this approach forces a certain level
of redundancy between the features learned at the different
hidden units. This type of redundancy leads to increased
robustness.

TRACE KTU
 Dropout is efficient because each of the sampled subnetworks
is trained with a small set of sampled instances

 Dropout prevents this type of co-adaptation by forcing the


neural
 network to make predictions using only a subset of the inputs
and activations
21 PARAMETER INITIALIZATION.

three different cases of parameters’ initialization


1.Initialize all parameters to zero.
2.Initialize parameters to random values from standard
normal distribution or uniform distribution and
TRACE KTU
multiply it by a scalar such as 10.
3.Initialize parameters based on:
• Xavier recommendation.
• Kaiming He recommendation.
22 1. Initialize all parameters to zero.

TRACE KTU

Learning rate =0.01


23
 As the cost curve shows, the neural network didn’t learn
anything
 That is because of symmetry between all neurons which leads
to all neurons have the same update on every iteration
 Therefore, regardless of how many iterations we run the

TRACE KTU
optimization algorithms,
 all the neurons would still get the same update and no
learning would happen.
Initialize parameters to random values
24

TRACE KTU

Learning rate =0.01


25

we’ll multiply the random values by a big number such


as 10 to show that initializing parameters to big values
may cause our optimization to have higher error
rates
Random initialization here is helping but still the loss
TRACE KTU
function has high value and may take long time to
converge and achieve a significantly low value.
3. Initializing parameters based on and Xavier and
26
Kaiming He recommendations

 Xavier method is best applied when activation function applied


on hidden layers is Hyperbolic Tangent so that the weight on
each hidden layer would have the following variance:
 var(W^l )= 1/m^(l-1).

TRACE KTU
 We can achieve this by multiplying the random values from
standard normal distribution by
27 In Xavier initialization

TRACE KTU
28

TRACE KTU
 Kaiming He method is best applied when activation
29 function applied on hidden layers is Rectified Linear
Unit (ReLU).
 so that the weight on each hidden layer would have the
following variance: var(W^l )= 2/m^(l-1)
 We can achieve this by multiplying the random values
from standard normal distribution by

TRACE KTU
30

TRACE KTU
31

END OF MODULE 2
TRACE KTU

You might also like