Module 2 Part3
Module 2 Part3
DEEP LEARNING
Module-2 PART -III
1
SYLLABUS
2
TRACE KTU
In a single-layer autoencoder with linear activation, weight
sharing forces orthogonality among the different hidden
components of the weight matrix.
2. Recurrent neural networks:
These networks are often used for modeling sequential data, such
as time-series, biological sequences, and text.
In recurrent neural, networks a time-layered representation of the
4 network is created in which the neural network is replicated
across layers associated with time stamps.
Since each time stamp is assumed to use the same model, the
parameters are shared between different layers.
Convolutional neural networks: Convolutional neural networks
are used for image recognition and prediction
TRACE KTU
The inputs of the network are arranged into a rectangular grid
pattern, along with all the layers of the network
The basic idea is that a rectangular patch of the image
corresponds to a portion of the visual field, and it should be
interpreted in the same way no matter where it is located.
An additional type of weight sharing is soft weight sharing .
5 In soft weight sharing , the parameters are not completely
tied, but a penalty is associated with them being different.
TRACE KTU
6 INJECTING NOISE AT INPUT
TRACE KTU
The de-noising autoencoder is based on noise injection
rather than penalization of the weights or hidden units.
The goal of the de-noising autoencoder is to reconstruct good
examples from corrupted training data.
Therefore, the type of noise should be calibrated to the nature
of the input.
Several different types of noise can be added:
7 1. Gaussian noise: This type of noise is appropriate for real-
valued inputs.
The added noise has zero mean and variance λ > 0 for each
input. Here, λ is the regularization parameter.
2. Masking noise: The basic idea is to set a fraction f of the
inputs to zeros in order to corrupt the inputs.
TRACE KTU
This type of approach is particularly useful when working with
binary inputs.
3. Salt-and-pepper noise: In this case, a fraction f of the
inputs are set to either their minimum or maximum possible
values according to a fair coin flip.
The approach is typically used for binary inputs, for which the
minimum and maximum values are 0 and 1, respectively.
8 The inputs to the autoencoder are corrupted training records,
and the outputs are the uncorrupted data records.
the autoencoder learns to recognize the fact that the input
is corrupted, and the true representation of the input needs to
be reconstructed.
Note that the noise in the training data is explicitly added,
whereas that in the test data is already present as a result of
TRACE KTU
various application-specific reasons.
The nature of the noise added to the input training data should
be based on insights about the type of corruption present in the
test data
9
TRACE KTU
ENSEMBLE METHODS
10
One way of reducing the error of a classifier is to find a way to
reduce either its bias or the variance without affecting the other
component.
Ensemble methods are used commonly in machine
learning, and two examples of such methods are bagging and
boosting.
TRACE KTU
The former is a method for variance reduction, whereas the
latter is a method for bias reduction
Most ensemble methods in neural networks are focused on
variance reduction. This is because neural networks are valued
for their ability to build arbitrarily complex models in which the
bias is relatively low.
11 Bagging and Subsampling
The prediction across different data sets can then be
averaged to yield the final prediction.
If a sufficient number of training data sets is used, the variance
of the prediction will be reduced to 0,
The basic idea is to generate new training data sets from the
TRACE KTU
single instance of the base data by sampling.
The sampling can be performed with or without replacement
It is common to use the softmax to yield probabilistic
predictions of discrete outputs
If probabilistic predictions are averaged, it is common to
12
average the logarithms of these values.
For discrete predictions, arithmetically averaged voting is used
The main difference between bagging and subsampling is in
terms of whether or not replacement is used in the creation
of the sampled training data sets
1. Bagging: In bagging, the training data is sampled with
TRACE KTU
replacement.
The sample size s may be different from the size of the training
data size n, although it is common to set s to n.
The resampled data will contain duplicates, and about a
fraction (1−1/n)n ≈ 1/e of the original data set will not be
included at all.
A model is constructed on the resampled training data set, and
13
each test instance is predicted with the resampled data.
The entire process of resampling and model building is repeated
m times.
For a given test instance, each of these m models is applied to
the test data.
The predictions from different models are then averaged to
TRACE KTU
yield a single robust prediction.
Although it is customary to choose s = n in bagging, the best
results are often obtained by choosing values of s much less
than n.
2. Subsampling is similar to bagging, except that the different
14 models are constructed on the samples of the data created
without replacement.
The predictions from the different models are averaged. In this
case, it is essential to choose s < n, because choosing s = n yields
the same training data set and identical results across different
ensemble components
When a sufficient training data are available, subsampling is
TRACE KTU
often preferable to bagging
Using bagging makes sense when the amount of available data is
limited
DROPOUT
15 Dropout is a method that uses node sampling instead of edge
sampling in order to create a neural network ensemble.
If a node is dropped, then all incoming and outgoing connections
from that node need to be dropped as well.
The nodes are sampled only from the input and hidden layers of
the network.
TRACE KTU
If the full neural network contains M nodes, then the total
number of possible sampled networks is 2M
weights of the different sampled networks are shared. Therefore,
Dropout combines node sampling with weight sharing.
The training process then uses a single sampled example in
order to update the weights of the sampled network using
backpropagation.
The training process proceeds using the following steps,
16 1. Sample a neural network from the base network. The input
nodes are each sampled with probability pi, and the hidden
nodes are each sampled with probability ph.
All samples are independent of one another.
2. Sample a single training instance or a mini-batch of training
instances
TRACE KTU
3. Update the weights of the retained edges in the network using
backpropagation on the sampled training instance or the mini-
batch of training instances.
In the Dropout method, thousands of neural networks are
sampled with shared weights, and a tiny training data set is
used to update the weights in each case
if the neural network has k probabilistic outputs corresponding
17
to the k classes, and the jth ensemble yields an output of p(j)
i for the ith class, then the ensemble estimate for the ith class
is computed as follows:
TRACE KTU
Here, m is the total number of ensemble components, which
can be rather large in the case of the Dropout method.
The values of the probabilities are re-normalized so that they
sum to 1:
A key insight of the Dropout method is that it is not necessary to
18 evaluate the prediction on all ensemble components
one can perform forward propagation on only the base network
(with no dropping) after re-scaling the weights.
The basic idea is to multiply the weights going out of each
unit with the probability of sampling that unit.
By using this approach, the expected output of that unit from a
TRACE KTU
sampled network is captured. This rule is referred to as the
weight scaling inference rule
Using this rule also ensures that the input going into a unit is
19
also the same as the expected input that would occur in a
sampled network.
TRACE KTU
By dropping both input units and hidden units, Dropout
effectively incorporates noise into both the input data and the
hidden representations.
TRACE KTU
Dropout is efficient because each of the sampled subnetworks
is trained with a small set of sampled instances
TRACE KTU
TRACE KTU
optimization algorithms,
all the neurons would still get the same update and no
learning would happen.
Initialize parameters to random values
24
TRACE KTU
TRACE KTU
We can achieve this by multiplying the random values from
standard normal distribution by
27 In Xavier initialization
TRACE KTU
28
TRACE KTU
Kaiming He method is best applied when activation
29 function applied on hidden layers is Rectified Linear
Unit (ReLU).
so that the weight on each hidden layer would have the
following variance: var(W^l )= 2/m^(l-1)
We can achieve this by multiplying the random values
from standard normal distribution by
TRACE KTU
30
TRACE KTU
31
END OF MODULE 2
TRACE KTU